# Devflow Implementation Plan v4 r1 ## 0. Document Status - **v4 r1: language migration TS → Python.** Major version bump; the TypeScript monorepo (apps/, packages/, tests/, scripts/, pnpm/tsconfig metadata) was deleted in `0e61b2d` after being re-implemented under `my-deepagent/`. v3 CC counters are preserved as historical context; v4 begins its own CC series (DR-1 below; CC-Py-1 onward as new change clarifications land). - This document supersedes v2 and all earlier v3 drafts where conflicting. - Single-user, single-machine assumption. No auth, no retention policy, no observability dashboards, no multi-tenancy. - Target OS: macOS 13+ / Linux. No Windows. - All paths are Unix-style. All times are stored UTC. - Decisions in this document are locked unless explicitly marked `(provisional)`. Override requires updating this document, not only code. - §1 Stack Decisions, §2 Directory Layout, §3 doctor checklist, §22 Decision Log have been rewritten for v4 r1. §4~§17 (DB schema, enums, hashing, template/ persona/binding, session runtime, prompt envelope, artifact registry, run events, fake adapter, state machines, errors, write set, SSE contract) are language-neutral domain spec and remain valid for the Python implementation. - v3 CC history (informational): - r1 applied CC-1 through CC-5. - r2 applied CC-6 through CC-10. - r3 applied CC-11 through CC-15. - r4 applies CC-16 through CC-18. - r5 applies CC-19. - r6 applies CC-20. - r7 applies CC-21 through CC-23. - r8 applies CC-24 through CC-26. - r9 applies CC-27 through CC-28. - r10 applies CC-29 through CC-31. - r11 applies CC-32. - r12 applies CC-33 through CC-35. - r13 applied CC-39 (final v3 revision; superseded by v4 r1). ## 1. Stack Decisions ### 1.1 Workspace - **Python 3.12+**, managed by **uv** workspaces (`uv sync`, `uv add`, `uv run`). - Pinned via `.python-version`. No Node, no pnpm, no tsc. - `pyproject.toml` at repo root + per-package `pyproject.toml` under `packages//` (uv workspace members). - Imports are absolute. No `from . import *`. ### 1.2 Tooling | Concern | Choice | Notes | |---------|--------|-------| | Lint / format | **ruff** | One root `ruff.toml`. `ruff check .` + `ruff format --check .`. | | Type check | **mypy --strict** | `mypy.ini` enables strict mode; tests relax `disallow_untyped_defs`. | | Test | **pytest** + **pytest-asyncio** + **pytest-httpx** + **respx** | `pytest -q`. | | Pre-commit | **pre-commit** (`.pre-commit-config.yaml`) | Runs ruff + mypy + pytest --collect-only. | | Schema validation | **pydantic v2** + **pydantic-settings** | Replaces zod. | | YAML | **PyYAML** | Persona/template YAML loaders. | | JSON Schema | **jsonschema** (2020-12) | Artifact registry. | | HTTP client | **httpx** (async) | OpenRouter / pricing fetch. | | Logging | **structlog** + **rich** | Replaces pino. `_scrub_processor` redacts secrets before stderr / JSON sinks. | | CLI | **typer** + **prompt_toolkit** | Replaces commander; prompt_toolkit drives the interactive REPL. | | OS dirs | **platformdirs** | XDG data / state / config dirs. | | Secrets | **keyring** | macOS Keychain / Linux Secret Service / Windows Credential Store. | ### 1.3 Database - **v0.1.0: SQLite 3 (WAL mode)** via **aiosqlite**, ORM: **SQLAlchemy 2.0 async**. - Migrations: **Alembic** (baseline + per-feature revisions). - WAL + `busy_timeout=5000` + `PRAGMA foreign_keys=ON` enforced at connect. **Why SQLite for v0.1.0** — single CLI process is the only writer. No Docker prerequisite for the `mydeepagent` install path, no service boot ordering, no network. WAL handles concurrent reads from the same process fine. The `ux_active_run_repo_base` partial unique index covers the active-run uniqueness invariant. **Migration trigger → Postgres 16 (planned for v0.2 PR #1, ahead of M8-Py).** The bound is when *a second process* starts writing to `my-deepagent`'s `runs` / `run_phases` / `llm_calls` tables. That happens with FastAPI (M8-Py) or the web GUI port — not with Temporal. Temporal worker (M5-Py) runs against its own backing store (`temporal` namespace, separate Postgres DB or sqlite-cluster) and does not touch `my-deepagent` ORM tables, so M5-Py does *not* force a DB switch by itself. Concretely the cut point is: | Milestone | Writer count to my-deepagent DB | Verdict | |-----------|--------------------------------|---------| | v0.1.0 (CLI + REPL) | 1 (the CLI process) | SQLite OK | | M5-Py (Temporal worker) | still 1 (worker = "CLI" continuation; Temporal DB is separate) | SQLite still OK | | M8-Py (FastAPI HTTP server) | 2 (CLI + API server) | **Postgres required** — SQLite WAL allows only one concurrent writer | | Web GUI / multi-tenant | 2+ | Postgres | When the switch lands, the migration plan is: stop writers → `alembic downgrade base` against sqlite → re-generate the baseline against Postgres (SQLAlchemy stays the same; only dialect-specific bits — JSON columns, partial unique index syntax, UPSERT — need review) → `pg_isready` doctor check. ### 1.4 Logging - **structlog**, JSON sink to stderr by default, rich pretty sink when stdout is a TTY. - Standard fields: `time`, `level`, `module`, `run_id?`, `phase_id?`, `role?`, `event_id?`, `interactive_session_id?`. - `_scrub_processor` redacts OpenRouter / Anthropic / OpenAI / LangSmith / GitHub / GitLab API keys and generic `Bearer …` tokens before emission. - Levels: same semantics as v3 (`trace`/`debug`/`info`/`warn`/`error`). ### 1.5 Config - Single `pydantic-settings` BaseSettings in `my_deepagent.config.Config` with `MYDEEPAGENT_` env prefix and optional TOML source. - Source precedence (high → low): explicit overrides → `os.environ` (with `MYDEEPAGENT_` prefix) → `.env` → `config.toml` → schema defaults. - Config is loaded once at process start, validated, frozen, and re-exported as an immutable typed `Config`. - Validation failure is fatal (exit code 2). - Required keys at v0.1.0: - `MYDEEPAGENT_DATABASE_URL` (default `sqlite+aiosqlite:////db.sqlite3`) - `MYDEEPAGENT_WORKSPACE_ROOT` - `MYDEEPAGENT_LOG_LEVEL` - `MYDEEPAGENT_OPENROUTER_API_KEY` when the OpenRouter backend is enabled (resolution order: config → env → OS keyring → error). - Path canonicalization: `workspace_root` is resolved via `Path.resolve()` at config load. Any path entering the system is canonicalized before storage or hashing. Backend registration (deepagents-flavored): ```python class BackendConfig(BaseModel, frozen=True): id: Backend # openrouter | anthropic | openai | google | fake enabled: bool api_base_url: str | None = None # openrouter default https://openrouter.ai/api/v1 api_key_env: str | None = None # default MYDEEPAGENT_OPENROUTER_API_KEY ``` - `fake` is always available. - `openrouter` is available only when enabled and the resolved key is present. - Doctor warns on misconfig; binding fails fast at run start with `human_required:backend_unavailable`. ### 1.6 HTTP / SSE - **FastAPI** + **uvicorn** + **sse-starlette** for the M8-Py REST + SSE surface (v3 r13 §17 contract unchanged: same event types, same headers, same `data: \n\n` framing). - Body validation via the same pydantic v2 models used elsewhere. - WebSocket remains deferred unless SSE fails under transcript volume. ## 2. Directory Layout v4 r1 collapses the v3 multi-package monorepo into a single `my-deepagent/` project. The TS `apps/`, `packages/`, `tests/`, `scripts/` trees were deleted in `0e61b2d`; v3 §4~§17 module-by-module spec still applies but each module now lives under `my_deepagent/.py` instead of `packages//src/.ts`. ```text / ├── docs/ │ ├── plan.md # this document │ ├── plan-v4-draft.md # v4 r1 design memo (informational) │ └── schemas/ │ ├── artifacts/ # JSON Schema 2020-12 (language-neutral) │ ├── personas/ # YAML persona seed (language-neutral) │ └── templates/ # YAML workflow templates ├── docker-compose.yml # Postgres + Temporal (still relevant for M5-Py) ├── .env.example ├── .gitignore ├── my-deepagent-seed/ # v0.1.0 bootstrap kit (historical, may be pruned) └── my-deepagent/ ├── pyproject.toml # uv workspace root ├── uv.lock ├── ruff.toml ├── mypy.ini ├── alembic.ini ├── .pre-commit-config.yaml ├── CHANGELOG.md ├── alembic/ │ ├── env.py │ └── versions/ # baseline + per-feature migrations ├── docs/schemas/ # mirror of repo-root docs/schemas for loader convenience ├── src/my_deepagent/ │ ├── config.py # pydantic-settings Config (replaces §1.5 zod schema) │ ├── enums.py # closed-set enums (§5) │ ├── errors.py # error taxonomy (§18) │ ├── hash.py # content-addressed hashing (§6) │ ├── persona.py # Persona + loader (§7.2) │ ├── workflow.py # WorkflowTemplate + loader (§7.1) │ ├── binding.py # autoSelect / override / consent store (§7.4) │ ├── artifact_schema.py # JSON Schema 2020-12 registry (§10) │ ├── run_event.py # event types + idempotency keys (§11, §13.1) │ ├── prompt_envelope.py # envelope builder (§9) │ ├── budget.py # BudgetTracker (v4-new) │ ├── secrets.py # config → env → keyring resolution chain │ ├── keys.py # OS keyring wrapper │ ├── audit.py # append-only JSONL audit log (v4-new) │ ├── logging.py # structlog + secret scrubber (§1.4) │ ├── governance.py # first-run consent (v4-new) │ ├── i18n/ # ko / en catalog │ ├── recovery.py # sweep_orphan_runs (§19) │ ├── session.py # deepagents adapter (§8.5, v4-new) │ ├── engine.py # WorkflowEngine — phase loop (§15) │ ├── persistence/ │ │ ├── db.py # SQLAlchemy 2 async engine │ │ ├── models.py # ORM models (§4) │ │ └── checkpointer.py # LangGraph SqliteSaver context │ ├── middleware/ │ │ ├── cost.py # CostMiddleware (v4-new) │ │ ├── budget.py # BudgetMiddleware (v4-new) │ │ ├── audit.py # AuditToolMiddleware │ │ ├── safety.py # SafetyShellMiddleware (deny-path / destructive command) │ │ └── artifact_watcher.py # ArtifactWatcherMiddleware │ ├── monitoring/ │ │ ├── pricing.py # OpenRouter pricing cache │ │ └── cost_estimator.py # pre-run preview │ ├── cli/ # typer-driven CLI │ │ ├── main.py # entry (interactive REPL when no subcommand) │ │ ├── doctor.py # §3 doctor checks (Python/uv version) │ │ ├── init.py │ │ ├── keys_cmd.py │ │ ├── run.py │ │ ├── runs.py │ │ ├── stats.py │ │ └── interactive.py # prompt_toolkit REPL │ ├── tui/ │ │ └── approval.py # tri-state approval prompt │ └── slash.py # REPL slash commands └── tests/ ├── unit/ # pure-Python unit tests └── integration/ # async + persistence + real OpenRouter (gated) ``` Future trees deferred: - `apps/api/`, `apps/worker/` (M5-Py / M8-Py): FastAPI app and temporalio worker. v4 r1 keeps them out until M5 lands. - `apps/web/`: Web GUI port is out of scope for v0.1.0 (separate milestone). ## 3. `mydeepagent doctor` Exit codes: - `0`: all green. - `1`: one or more red checks. - `2`: internal or unknown error. Each check emits: - `name` - `status`: `pass` | `fail` | `warn` - `detail` - `remediation` Closed check list (v4 r1, 8 checks — Node/pnpm/Docker/Drizzle dropped): 1. **python**: `python --version` satisfies `>=3.12,<3.14`. 2. **uv**: `uv --version` resolves (any). 3. **git**: `git --version` `>=2.40`. 4. **workspace_root**: `MYDEEPAGENT_WORKSPACE_ROOT` exists, is a directory, and is writable. 5. **config+governance**: `Config` loads from env + `.env` + `config.toml` without ValidationError; first-run governance consent file exists (or is created interactively on first run only). 6. **openrouter_api_key**: resolution chain (config → env → OS keyring) yields a non-empty value. Warn-only when the OpenRouter backend is not enabled. 7. **openrouter_ping + pricing upsert**: `GET https://openrouter.ai/api/v1/models` with the bearer key. - `200` → pass; pricing rows are upserted into `model_pricing` for use by the `mydeepagent run` cost preview. - `401` → fail. - any other non-200 / network error → warn. 8. **disk+sqlite integrity**: - Free disk on the `workspace_root` partition: warn under 10 GB, fail under 2 GB, green target ≥ 5 GB. - SQLite DB file (if present) opens and `PRAGMA integrity_check` returns `ok`. Output: - Rich human table by default. - `--json` for machine-readable output. - `--quiet` prints only nonzero results. Notes: - `tmux` / `Docker` / `Postgres` / `pg_isready` / drizzle migration checks from v3 §3 are dropped in v4 r1 — the v0.1.0 runtime is SQLite-only and tmux is out of scope for the deepagents-driven session model. - `--list-orphans` and friends are owned by `mydeepagent runs list/show` (§19). ## 4. Database Schema First migration prelude: ```sql CREATE EXTENSION IF NOT EXISTS pgcrypto; ``` All tables use `gen_random_uuid()` primary keys unless noted. All times are `timestamptz`. Mutable rows include `updated_at`. JSON columns use `jsonb`. ### 4.1 `workflow_templates` - `id uuid primary key default gen_random_uuid()` - `name text not null` - `version int not null` - `hash text not null unique` - `definition jsonb not null` - `created_at timestamptz not null default now()` - unique `(name, version)` ### 4.2 `agent_personas` - `id uuid primary key default gen_random_uuid()` - `name text not null` - `version int not null` - `hash text not null unique` - `definition jsonb not null` - `created_at timestamptz not null default now()` - unique `(name, version)` ### 4.3 `runs` - `id uuid primary key default gen_random_uuid()` - `template_id uuid not null references workflow_templates(id)` - `template_hash text not null` - `state text not null` - `repo_path text not null` - canonical absolute path - resolved through `fs.realpathSync` before insert - `base_branch text not null` - `worktree_root text not null` - canonical absolute path under `WORKSPACE_ROOT//` - `current_phase_id uuid references run_phases(id)` nullable and deferrable - `started_at timestamptz` - `ended_at timestamptz` - `final_report_path text` - `paused_from_state text` - set when transitioning to `paused` - cleared on resume - null when state is not `paused` - `created_at timestamptz not null default now()` - `updated_at timestamptz` Active-run uniqueness: ```sql CREATE UNIQUE INDEX ux_active_run_repo_base ON runs (repo_path, base_branch) WHERE state NOT IN ('completed', 'failed', 'aborted'); ``` ### 4.4 `run_inputs` - `id uuid primary key default gen_random_uuid()` - `run_id uuid not null unique references runs(id) on delete cascade` - `requirements_md text not null` - `objective jsonb` - `extra jsonb` - `input_hash text not null` `input_hash` is based on: - `requirements_md` - `objective` - `extra` - canonical `repo_path` - `base_branch` ### 4.5 `run_bindings` - `id uuid primary key default gen_random_uuid()` - `run_id uuid not null references runs(id) on delete cascade` - `role_id text not null` - `persona_id uuid not null references agent_personas(id)` - `persona_hash text not null` - `backend text not null` - `binding_hash text not null` - unique `(run_id, role_id)` ### 4.6 `run_phases` - `id uuid primary key default gen_random_uuid()` - `run_id uuid not null references runs(id) on delete cascade` - `phase_key text not null` - `seq int not null` - `state text not null` - `attempts int not null default 0` - `started_at timestamptz` - `ended_at timestamptz` - unique `(run_id, phase_key)` ### 4.7 `run_events` Append-only. - `id bigserial primary key` - `run_id uuid not null references runs(id) on delete cascade` - `phase_id uuid references run_phases(id)` - `seq bigint not null` - `type text not null` - `payload jsonb not null` - `idempotency_key text not null` - `ts timestamptz not null default now()` - unique `(run_id, seq)` - unique `(run_id, idempotency_key)` - index `(run_id, ts)` Concurrency: - All inserts go through `RunEventRepository.append()`. - Raw SQL inserts into `run_events` are forbidden. - `append()` takes `pg_advisory_xact_lock(hash64('devflow:run-events', run_id))`. - Inside that same transaction it assigns: ```sql seq := COALESCE(MAX(seq), 0) + 1 ``` ### 4.8 `approval_requests` - `id uuid primary key default gen_random_uuid()` - `run_id uuid not null references runs(id)` - `phase_id uuid references run_phases(id)` - `gate_key text not null` - `state text not null` - `idempotency_key text not null` - `payload jsonb not null` - `created_at timestamptz not null default now()` - `resolved_at timestamptz` - unique `(idempotency_key)` ### 4.9 `approval_decisions` Append-only and immutable. - `id uuid primary key default gen_random_uuid()` - `approval_request_id uuid not null references approval_requests(id)` - `action text not null` - `approve` - `reject` - `request_changes` - `abort` - `comment text` - `decided_at timestamptz not null default now()` - `idempotency_key text not null unique` `pause` is not an approval decision. ### 4.10 `tui_sessions` - `id uuid primary key default gen_random_uuid()` - `run_id uuid not null references runs(id) on delete cascade` - `role_id text not null` - `backend text not null` - `cwd text not null` - `expected_artifact_path text` - `expected_schema text` - `last_prompt_hash text` - `last_prompt_at timestamptz` - `last_capture_seq bigint not null default 0` - `last_known_pane_pid int` - `tmux_session text` - `tmux_window text` - `state text not null` - `recovery_attempts int not null default 0` - unique `(run_id, role_id)` ### 4.11 `tui_transcript_chunks` Append-only. - `id bigserial primary key` - `session_id uuid not null references tui_sessions(id) on delete cascade` - `seq bigint not null` - `content text not null` - `captured_at timestamptz not null default now()` - unique `(session_id, seq)` ### 4.12 `artifacts` - `id uuid primary key default gen_random_uuid()` - `run_id uuid not null references runs(id) on delete cascade` - `phase_id uuid references run_phases(id)` - `path text not null` - `schema_id text not null` - `hash text not null` - `valid boolean not null` - `validation_error jsonb` - `created_at timestamptz not null default now()` - unique `(run_id, path, hash)` ### 4.13 `commands` - `id uuid primary key default gen_random_uuid()` - `run_id uuid not null references runs(id) on delete cascade` - `phase_id uuid references run_phases(id)` - `kind text not null` - `git` - `test` - `e2e` - `doctor` - `backtest` - `other` - `argv text[] not null` - `cwd text not null` - `exit_code int` - `stdout_path text` - `stderr_path text` - `started_at timestamptz` - `ended_at timestamptz` ### 4.14 `review_findings` - `id uuid primary key default gen_random_uuid()` - `run_id uuid not null references runs(id) on delete cascade` - `phase_id uuid references run_phases(id)` - `reviewer_role text not null` - `severity text not null` - `info` - `low` - `medium` - `high` - `critical` - `category text not null` - `correctness` - `evidence` - `style` - `security` - `performance` - `other` - `file_path text` - `line int` - `summary text not null` - `evidence text` - `verifier_status text not null default 'unverified'` - `unverified` - `confirmed` - `rejected` - `created_at timestamptz not null default now()` ### 4.15 Backtest Stub Tables `backtest_iterations` and `backtest_metrics` are created at M1 as stub tables: - `id uuid primary key default gen_random_uuid()` - `run_id uuid not null references runs(id) on delete cascade` - `payload jsonb` - `created_at timestamptz not null default now()` Full schema is deferred to M12. ## 5. Enums All enums live in `packages/core/src/enums.ts` as TypeScript `const` objects and Zod enums. ### 5.1 `Backend` - `codex` - `claude` - `fake` - `openrouter` openrouter is HTTP-based and has no tmux/PTY; see §8.5. Future `gemini` support adds an enum entry and a `BackendProfile`; no design change. ### 5.2 `Capability` - `spec_write` - `phase_planning` - `task_dag_planning` - `code_edit` - `test_first_development` - `code_review` - `evidence_check` - `command_execute` - `backtest_run` - `metric_extract` - `failure_mining` - `objective_eval` - `final_report_compose` ### 5.3 `RiskLevel` - `low` - `medium` - `high` Risk is declared per phase in the template. Persona has `maxRiskLevel`. Binding fails when `phase.risk > persona.maxRiskLevel`. ### 5.4 `ApprovalDecisionAction` - `approve` - `reject` - `request_changes` - `abort` `pause` is a run-level control operation, not an approval decision. ### 5.5 `ApprovalState` - `pending` - `approved` - `rejected` - `changes_requested` - `aborted` - `paused` `paused` is not an auto-decision. ### 5.6 `RunState` - `created` - `bound` - `planning` - `awaiting_approval` - `executing` - `paused` - `completed` - `failed` - `aborted` ### 5.7 `RunPhaseState` - `pending` - `running` - `awaiting_artifact` - `validating` - `awaiting_approval` - `completed` - `failed` - `skipped` ### 5.8 `SessionState` - `CREATED` - `BOOTSTRAPPING` - `READY` - `BUSY` - `WAITING_FOR_APPROVAL` - `ARTIFACT_TIMEOUT` - `HUNG` - `CRASHED` - `RESUMING` - `REBOOTSTRAPPED` - `FAILED_NEEDS_HUMAN` ## 6. Content-Addressed Hashing ### 6.1 Canonical JSON - Object keys sorted lexicographically by UTF-16 code units. - No insignificant whitespace. - Strings use standard JSON escaping. - No Unicode normalization. - Numbers use shortest round-trippable representation. - Integers have no decimal point. - No leading zeros. - Arrays preserve order. - No trailing newline. `packages/core/src/hash.ts` exports: ```ts canonicalize(value: unknown): string hash(value: unknown): string ``` `hash()` returns `sha256hex(canonicalize(value))`. ### 6.2 Hash Subjects - Template hash: - `{ name, version, roles, phases, gates, capabilitiesRequired }` - Persona hash: - `{ name, version, capabilities, backend, maxRiskLevel, allowedRoles, promptConfig, modelConfig }` - Binding hash: - `{ runId, roleId, templateHash, personaHash, backend, override }` - Run input hash: - `{ templateHash, bindings: sorted[bindingHash], requirementsMd, objective, repoPath, baseBranch, extra }` - Prompt hash: - `{ runId, roleId, phaseKey, expectedArtifact, expectedSchema, instructions, attempt }` - Artifact hash: - SHA-256 of file bytes. Prompt hash uses `phaseKey`, not `phaseId`, because `PromptEnvelope` carries `phaseKey`. ## 7. Template, Persona, Binding ### 7.1 Template Schema ```ts const TemplatePhase = z.object({ key: z.string(), title: z.string(), risk: RiskLevel, roles: z.array(z.string()), expectedArtifact: z .object({ path: z.string(), schema: z.string(), }) .optional(), gates: z.array(z.string()).default([]), timeoutMs: z.number().int().positive().optional(), }); const TemplateRole = z.object({ id: z.string(), requiredCapabilities: z.array(Capability), preferredBackends: z.array(Backend).default([]), count: z.number().int().min(1).default(1), diversity: z .object({ requireDifferentBackends: z.boolean().default(false), }) .optional(), }); const Template = z.object({ name: z.string(), version: z.number().int().positive(), roles: z.array(TemplateRole), phases: z.array(TemplatePhase), defaultGates: z.array(z.string()).default([]), }); ``` ### 7.2 Persona Schema ```ts const Persona = z.object({ name: z.string(), version: z.number().int().positive(), backend: Backend, capabilities: z.array(Capability), maxRiskLevel: RiskLevel, allowedRoles: z.array(z.string()).optional(), promptConfig: z .object({ systemPrompt: z.string().optional(), instructionsPrelude: z.string().optional(), }) .default({}), modelConfig: z.record(z.string(), z.unknown()).default({}), }); ``` modelConfig conventions: - Personas bound to `openrouter` MUST set `modelConfig.model` to a routable OpenRouter model id, e.g. `anthropic/claude-sonnet-4-5`, `deepseek/deepseek-chat`, `meta-llama/llama-3.1-70b-instruct`. - Other supported keys: `maxTokens`, `temperature`, `topP`. All optional. - For tmux-based backends (`codex`, `claude`, `fake`), `modelConfig.model` is informational only and MAY be omitted. - Binding fails fast with `human_required:model_unavailable` when an `openrouter` persona has no `modelConfig.model`. ### 7.3 Override Semantics - Override may swap persona for a role. - Override may constrain backend to a specific allowed backend. - Override cannot add capabilities. - Override cannot raise risk above persona `maxRiskLevel`. - Diversity rules apply after override. - Lock-time validation runs the full binding algorithm. - On first binding failure, the run does not start. ### 7.4 Binding Algorithm For each role: 1. Select override persona if present; otherwise run `autoSelect`. 2. Assert backend is enabled in `config.backends`. 3. Assert non-fake backend binary resolved at process start. 4. Assert role id is in `allowedRoles`, unless `allowedRoles` is absent. 5. Assert required capabilities are a subset of persona capabilities. 6. Assert every phase using the role has risk <= persona `maxRiskLevel`. 7. Expand roles with `count > 1` into `roleId#0`, `roleId#1`, etc. 8. Enforce diversity rules after expansion. 9. Compute and persist `binding_hash` per role instance. `autoSelect` is deterministic. Sort candidates by: 1. role `preferredBackends` order. 2. `persona.version desc`. 3. `persona.name asc`. 4. `persona.hash asc`. Personas whose backend is not in `preferredBackends` are eligible only if all preferred-backend personas fail capability or risk checks. Binding fails with `human_required:no_eligible_persona` if no persona satisfies requirements. ### 7.5 Seeding Personas: - `docs/schemas/personas/@.yaml` - filename encodes immutable identity. - loader parses with Persona schema. - loader computes `personaHash`. - loader upserts keyed by `(name, version)`. - hash mismatch on an existing row is fatal. Templates: - `docs/schemas/templates/@.yaml` - same immutable version rule. Deleting a published file is allowed only when no run references that hash. ## 8. Session Runtime ### 8.1 SessionAdapter Interface ```ts export interface SessionAdapter { start(input: StartInput): Promise; sendPrompt(handle: SessionHandle, envelope: PromptEnvelope): Promise<{ promptId: string }>; probe(handle: SessionHandle): Promise; resume(handle: SessionHandle): Promise; rebootstrap(handle: SessionHandle): Promise; capture(handle: SessionHandle, fromSeq: bigint): AsyncIterable; dispose(handle: SessionHandle): Promise; } export interface StartInput { runId: string; roleId: string; backend: Backend; cwd: string; expectedArtifactPath?: string; expectedSchema?: string; envelopePrelude?: string; } export interface SessionHandle { sessionId: string; pid?: number; tmuxSession?: string; tmuxWindow?: string; } export interface ProbeResult { alive: boolean; paneActive: boolean; lastOutputAt?: Date; hint?: string; } export interface TranscriptChunk { seq: bigint; content: string; capturedAt: Date; } ``` For HTTP backends (`openrouter`) the `SessionHandle.pid`, `tmuxSession`, and `tmuxWindow` fields are always `undefined`. See §8.5 for the HTTP adapter mapping. ### 8.2 Session State Machine - `CREATED -> BOOTSTRAPPING -> READY` - `READY <-> BUSY` - `BUSY -> WAITING_FOR_APPROVAL` - `BUSY -> ARTIFACT_TIMEOUT` - `BUSY -> HUNG` - `BUSY -> CRASHED` - `HUNG | CRASHED | ARTIFACT_TIMEOUT -> RESUMING -> READY` - `RESUMING -> REBOOTSTRAPPED -> READY` - exhausted errors -> `FAILED_NEEDS_HUMAN` ### 8.3 Recovery Counters - `sendPrompt` retry: 2. - Means one initial send plus two adapter-level retries, three physical send attempts max. - `resume` retry: 2. - `rebootstrap` retry: 1. - artifact repair retry: 1. - max hung time: configurable; default 20 minutes. Exhaustion creates a human gate with `recoveryHint`. ### 8.4 SessionManager Singleton - M4: hosted in `apps/api`. - M5+: hosted in `apps/worker`. - Only SessionManager may call mutating `SessionAdapter` methods. - Holds in-memory `Map`. - Takes `pg_advisory_lock(hash64('devflow:session-manager'))`. - Second instance exits code `3`. - On start: - query non-terminal `tui_sessions`. - call `adapter.resume(handle)`. - success: place handle in map. - failure: session -> `FAILED_NEEDS_HUMAN`, append `session.failed`, create recovery gate. - On SIGTERM/SIGINT: - refuse new prompts. - allow in-flight artifact polling up to 30s. - persist `last_capture_seq`. - release advisory lock. ### 8.5 OpenRouter Adapter — v4 r1 deepagents rewrite **Supersedes the v3 marker-extraction HTTP adapter (CC-39).** In v4 the OpenRouter integration is a multi-turn, tool-using agent driven by LangChain `deepagents` 0.6.1 — no single-shot completions, no `<<>>` markers, no transcript replay reconstruction. Construction — `my_deepagent.session.build_agent(persona, run_id, …)`: ```python llm = ChatOpenAI( model=persona.model, # e.g. "openrouter:deepseek/deepseek-chat" base_url=config.openrouter_api_base, # https://openrouter.ai/api/v1 api_key=resolve_openrouter_api_key(), timeout=persona.model_params.timeout, ) agent = deepagents.create_deep_agent( model=llm, tools=[], # base tools come from LocalShellBackend instructions=persona.system_prompt, subagents=[_subagent_to_dict(s) for s in persona.subagents], middleware=[ SafetyShellMiddleware(...), # destructive command + deny-path guard AuditToolMiddleware(...), # append-only JSONL audit log ArtifactWatcherMiddleware(...), # write_file/edit_file detection CostMiddleware(...), # usage_metadata + budget ledger ], backend=LocalShellBackend( # bash + read_file + write_file + edit_file + ls cwd=worktree_root, # `permissions` kwarg is intentionally omitted for local_shell backend # (deepagents 0.6.1 NotImplementedError workaround — enforcement moves # to SafetyShellMiddleware). ), ) ``` Method mapping (driven by `WorkflowEngine` rather than a v3-style adapter interface): - **Start**: `create_deep_agent` returns a `CompiledStateGraph` per phase. No persistent session object is shared across phases — each phase is a fresh agent invocation parameterized by persona + envelope. - **Send prompt**: `await agent.ainvoke({"messages": [HumanMessage(envelope)]})` where `envelope` is built by `WorkflowEngine._build_envelope` (§9 with the artifact JSON Schema inlined so the model sees the exact required fields). - **Tool use**: native `read_file` / `write_file` / `edit_file` / `ls` / `bash` calls are emitted by the model and dispatched through LocalShellBackend, recorded by AuditToolMiddleware, gated by SafetyShellMiddleware. - **Probe / resume / rebootstrap / dispose**: not applicable — the agent is ephemeral per phase. Crash recovery operates at the run/phase level via `sweep_orphan_runs` (§19), not at a session-adapter level. Artifact production: - The model writes the artifact directly to `expected_artifact_path` via the `write_file` tool. ArtifactWatcherMiddleware observes the tool call and notifies the engine. - The envelope inlines the artifact's JSON Schema definition so the LLM has the exact required fields. - Schema validation is performed by `ArtifactSchemaRegistry.validate` on the written file (§10). On failure, the engine retries once with a repair prompt; second failure raises `human_required:artifact_invalid_after_repair`. Error mapping (preserved from CC-39, applied per-call by the LangChain exception path): - HTTP `401` → `human_required:backend_auth_failed`. - HTTP `429` → `recoverable:rate_limited` (exponential backoff: 1 s, 2 s, 4 s, max 30 s, owned by langchain-openai retries). - HTTP `5xx` → `recoverable:network_blip`. - HTTP `400` with `model_not_found` → `human_required:model_unavailable`. - BudgetTracker pre-call rejection → `human_required:token_budget_exceeded`. - SafetyShellMiddleware blocked tool call → `human_required:tool_quota_exceeded`. Known v0.1.0 limitations: - `usage_metadata` is sometimes empty on responses forwarded by OpenRouter (deepagents wraps the underlying ChatOpenAI response so token counts may not surface). The recorder still fires and `LlmCallRow` is persisted, but `input_tokens` / `output_tokens` may read 0. v0.2 will probe additional response shapes (raw chunks / callbacks). - Anthropic models via OpenRouter currently fail with a `tool_calls.args` JSON-string vs dict ValidationError inside `langchain-openai` 1.2.1. Workaround: pin DeepSeek personas via `BindingOverride`. Tracking for v0.2. ## 9. Prompt Envelope ### 9.1 Wire Format ```text DEVFLOW_PROMPT_BEGIN Run: Role: Phase: Attempt: Expected artifact: Expected schema: Dedup-Key: Instructions: DEVFLOW_PROMPT_END ``` ### 9.2 Schema ```ts const PromptEnvelope = z.object({ uuid: z.string().uuid(), runId: z.string().uuid(), roleId: z.string(), phaseKey: z.string(), attempt: z.number().int().nonnegative(), expectedArtifact: z.string(), expectedSchema: z.string(), dedupKey: z.string(), instructions: z.string(), }); ``` ### 9.3 Rules - Prompt identity is `dedupKey`. - Adapter treats duplicate `dedupKey` for the same session within a run lifetime as idempotent success and does not reprocess the prompt. - `attempt` increments only when the engine intentionally re-sends after timeout or repair. - Adapter-level retry does not increment attempt. - Completion is never inferred from transcript text. - Completion requires a schema-valid artifact. ### 9.4 Backend Prelude Sent once at session bootstrap before the first envelope. Required structure: 1. Backend identity statement. 2. Persona `instructionsPrelude`. 3. Protocol declaration: completion is signaled only by writing expected artifact files. 4. Envelope marker contract. 5. Approval/probe contract: `DEVFLOW_PROBE` must respond with one line `READY` or `BUSY `. Codex and Claude-specific addenda live in `packages/session/src/profiles/{codex,claude}.ts` and are populated at M10. ## 10. Artifact Schema Registry ### 10.1 Layout JSON Schema 2020-12 documents live at: ```text docs/schemas/artifacts/.json ``` `schema_id` format: ```text /@ ``` Examples: - `dev/spec@1` - `dev/phase-plan@1` - `dev/dag@1` - `dev/review-finding-batch@1` - `bt/objective@1` - `bt/iteration-result@1` - `common/final-report@1` ### 10.2 Loader `packages/core/src/artifact-schema.ts` exports: ```ts function loadSchema(id: string): JsonSchema; function validateArtifact( id: string, data: unknown ): { ok: true } | { ok: false; errors: ValidationError[] }; ``` Unknown schema id is fatal. ### 10.3 Validation Flow 1. Engine waits for `expectedArtifactPath` to appear. 2. Debounce 500ms after last `mtime` change. 3. Read file. 4. Compute SHA-256. 5. Validate against `expectedSchema`. 6. Valid: - insert artifact row with `valid=true`. - append `artifact.validated`. - advance phase. 7. Invalid: - insert artifact row with `valid=false`. - append `artifact.invalid`. - trigger one repair prompt. - after repair exhaustion, create human gate. 8. Timeout: - append `artifact.timeout`. - probe session. - enter recovery flow. ### 10.4 Final Report At terminal run state, write atomically: - `//.report.md` - `//.report.json` Both are written even on `failed` or `aborted`, best-effort. `common/final-report@1` minimum fields: - `runId` - `templateHash` - `bindings[]` - `inputs` - `phases[]` - `approvals[]` - `findings[]` - `commands[]` - `artifacts[]` - `events.tail` - `unresolved[]` - `endedAt` - `status` ### 10.5 Backtest Objective Stub `bt/objective@1`: ```json { "targets": [ { "metric": "sharpe", "op": "gte", "value": 1.5, "weight": 1.0 }, { "metric": "mdd", "op": "lte", "value": 0.15, "weight": 1.0 } ], "stopWhen": "all" } ``` - `op`: `gte` | `lte` | `eq` | `gt` | `lt` - `stopWhen`: `all` | `weighted` - `weighted` threshold is hardcoded at 0.8 at M12. - Full DSL deferred to M12. ## 11. Run Events Closed event types: ```text run.created run.started run.paused run.resumed run.completed run.failed run.aborted phase.started phase.completed phase.failed phase.skipped prompt.sent prompt.repaired artifact.expected artifact.validated artifact.invalid artifact.timeout approval.requested approval.resolved session.created session.ready session.busy session.idle session.crashed session.recovered session.failed command.started command.completed command.failed review.batch_recorded finding.verifier_resolved backtest.iteration_started backtest.iteration_completed backtest.objective_evaluated ``` ### 11.1 Idempotency Keys Every event append requires deterministic `idempotency_key`. | Event family | Key formula | |---|---| | `run.created`, `run.started`, `run.completed`, `run.failed`, `run.aborted` | `:` | | `run.paused` | `run.paused::` | | `run.resumed` | `run.resumed::` | | `phase.started`, `phase.completed`, `phase.failed`, `phase.skipped` | `::` | | `prompt.sent`, `prompt.repaired` | `:` | | `artifact.expected`, `artifact.timeout` | `:::` | | `artifact.validated`, `artifact.invalid` | `:::` | | `approval.requested` | `approval.requested:` | | `approval.resolved` | `approval.resolved::` | | `session.created`, `session.failed` | `:` | | `session.busy`, `session.idle` | `::` | | `session.ready`, `session.crashed`, `session.recovered` | `::` | | `command.started`, `command.completed`, `command.failed` | `:` | | `review.batch_recorded` | `review.batch_recorded:::` | | `finding.verifier_resolved` | `finding.verifier_resolved:` | | `backtest.iteration_started`, `backtest.iteration_completed`, `backtest.objective_evaluated` | `:` | Definitions: - `phase_attempt` is incremented before event append. - `recovery_attempts` is incremented before event append. - `prompt_dedup_key` is the envelope dedup key. - `approval_idempotency_key` is from `approval_requests`. - Artifact expected/timeout events are per-attempt. - Artifact validated/invalid events are content-keyed by path + hash. ## 12. Fake Session Adapter ### 12.1 Behavior - Deterministic. - In-process. - No PTY. - No tmux. - Drives engine end-to-end without real backends. ### 12.2 Sentinel Triggers On `sendPrompt`, inspect `expectedSchema`. Fixture path: ```text tests/fixtures/fake-artifacts//.json ``` `scenarioName` comes from instruction header: ```text Scenario: ``` Default scenario: `ok`. Scenarios: - `ok`: write fixture to `expectedArtifactPath` after 50ms by default. - `invalid`: write deliberately schema-invalid payload. - `timeout`: never write. - `crash`: throw `RecoverableError`. ### 12.3 Transcript Fake adapter emits chunks such as: ```text [fake] received prompt ; will write in 50ms ``` ## 13. State Machines ### 13.1 Run State States: - `created` - `bound` - `planning` - `awaiting_approval` - `executing` - `paused` - `completed` - `failed` - `aborted` Transitions: | From | Trigger | To | Side effects | |---|---|---|---| | `created` | `lockBindings ok` | `bound` | persist bindings; emit `run.started` | | `created` | `lockBindings fail` | `failed` | emit `run.failed` | | `bound` | phase plan needed | `planning` | emit `phase.started` | | `planning` | plan artifact valid | `awaiting_approval` | request approval | | `awaiting_approval` | approve | `executing` | emit `approval.resolved`, `run.resumed` | | `awaiting_approval` | reject | `failed` | emit `run.failed` | | `awaiting_approval` | request_changes | `planning` | increment phase attempts | | `awaiting_approval` | timeout | `paused` | set `paused_from_state='awaiting_approval'` | | `executing` | phase ok, more phases | `executing` | next phase | | `executing` | normal workflow approval gate | `awaiting_approval` | request gate | | `executing` | all phases done | `completed` | emit `run.completed`, write final report | | `executing` | unrecoverable error | `failed` | emit `run.failed` | | `executing` | manual `pauseRun` | `paused` | set `paused_from_state='executing'` | | `planning` | manual `pauseRun` | `paused` | set `paused_from_state='planning'` | | `paused` | resume | `paused_from_state` | emit `run.resumed`, clear `paused_from_state` | | any non-terminal state | `abortRun` | `aborted` | emit `run.aborted`, dispose sessions | Non-terminal states for `abortRun`: - `created` - `bound` - `planning` - `awaiting_approval` - `executing` - `paused` ### 13.2 Run Phase State States: - `pending` - `running` - `awaiting_artifact` - `validating` - `awaiting_approval` - `completed` - `failed` - `skipped` Transitions: | From | Trigger | To | |---|---|---| | `pending` | start | `running` | | `running` | prompt sent, artifact expected | `awaiting_artifact` | | `awaiting_artifact` | artifact appears | `validating` | | `awaiting_artifact` | timeout | `running` after probe/repair, or `failed` after exhaustion | | `validating` | valid | `awaiting_approval` if gate, else `completed` | | `validating` | invalid | `running` after one repair, else `failed` | | `awaiting_approval` | approve | `completed` | | `awaiting_approval` | reject / abort | `failed` | | `awaiting_approval` | request_changes | `running`, attempt + 1 | Replay rules: - `phase.started.payload.repair === true` marks that attempt as the single allowed repair attempt. Replaying that attempt MUST use repair instructions, `prompt.repaired`, and must not start a third attempt. - Repair replay from `running` may reuse an existing `READY` / bootstrapped session even if `last_prompt_hash` still contains the previous attempt's prompt hash; current-attempt prompt send has not happened yet. - If phase state is `running`, existing artifact files are never accepted unless the current prompt event (`prompt.sent` or `prompt.repaired`) for the current dedup key is already recorded. Replay without prompt proof treats existing files as stale. - If phase state is `running`, session state is `BUSY`, and `last_prompt_hash` matches the current prompt but the matching prompt event is missing, replay waits for the artifact with the current file signature as the baseline. This preserves idempotency without validating a stale pre-existing artifact. - Baseline-protected waits must not synthesize durable prompt proof before the wait finishes. If replay crashes or is cancelled before validation, the next replay must still treat the existing artifact as baseline/stale unless real prompt proof already exists. - If phase state is `validating` and no artifact row exists yet, replay re-reads and validates the current `expectedArtifactPath` instead of treating the state as corruption. - If phase state is `validating` and artifact rows already exist for the same phase/path/schema, replay may reuse only an artifact row created at or after the current session `last_prompt_at`; older rows are treated as stale previous-attempt outputs and the file is revalidated. - Session bootstrap DB row/state changes and `session.created` / `session.ready` events are written in one DB transaction after adapter start succeeds. ## 14. Approval State States: - `pending` - `approved` - `rejected` - `changes_requested` - `aborted` - `paused` ### 14.1 Transitions | From | Event | To | Side effects | |---|---|---|---| | `pending` | approve decision | `approved` | insert decision row | | `pending` | reject decision | `rejected` | insert decision row; run -> `failed` | | `pending` | request_changes decision | `changes_requested` | insert decision row; increment attempt | | `pending` | abort decision | `aborted` | insert decision row; run -> `aborted` | | `pending` | timeout | `paused` | run -> `paused`; no decision row | | `paused` | unpause | `pending` | re-arm gate; no decision row | | terminal states | any decision | unchanged | return 409 | Rules: - A `pending` request can transition to one non-pending state per pending epoch. - Terminal approval states reject further decisions. - `paused` may return to `pending` only through `unpause`. - Manual pause is run-level `pauseRun`; it leaves approval gate in `pending`. - Only `approve`, `reject`, `request_changes`, and `abort` create `approval_decisions` rows. - Default timeout is null. - Timeout never auto-approves or auto-rejects. ### 14.2 Decision Idempotency - GUI: - UUIDv4 per click. - reused across automatic UI retries for the same logical action. - CLI: - UUIDv4 per invocation. - `--client-token=` override for scripted retry. - API: - existing `(approval_request_id, action, client_token)` returns existing row with status 200. - new decision inserts row and returns 201. - same token with different action returns 409. - decision on non-pending request returns 409. ### 14.3 Destructive Command Enforcement Devflow-direct commands have hard enforcement. TUI-agent commands have best-effort enforcement. Hard-blocked Devflow-direct patterns: - `rm -rf` - `git reset --hard` - `git clean` - `git push --force` - `git push --force-with-lease` - `git worktree remove --force` - `git branch -D` - `docker volume rm` - `docker compose down -v` - `DROP DATABASE` - `DROP SCHEMA` - migration rollback - reads/writes touching `.env*`, `~/.ssh/`, `~/.aws/`, `~/.config/gcloud/`, `~/.kube/` - files matching `*token*`, `*secret*`, `*credentials*`, `*.pem`, `*.key` TUI-agent command enforcement is best-effort: 1. Prelude prohibits destructive operations. 2. Backend permission mode is set to safest available mode. 3. Transcript audit captures post-hoc evidence. 4. Human intervention goes through `devflow attach`. 5. Worktrees and branches are preserved by default. v1 does not claim real-time blocking of TUI-internal commands. ## 15. Run Engine and Temporal Contract The M4 `RunEngine` contract is frozen before M5. M5 reimplements the same interface through Temporal. ### 15.1 Public API ```ts interface RunEngine { startRun(input: RunStartInput): Promise<{ runId: string }>; signalApproval( runId: string, approvalRequestId: string, action: ApprovalDecisionAction, clientToken: string, comment?: string ): Promise; pauseRun(runId: string): Promise; resumeRun(runId: string): Promise; abortRun(runId: string, reason: string): Promise; getStatus(runId: string): Promise; } ``` ### 15.2 Temporal Shape - Namespace: `devflow`. - Task queue: `devflow-runs`. - Single worker process: `apps/worker`. - Workflow: `runWorkflow(input: RunStartInput)`. - Signals: - `approve` - `pause` - `resume` - `abort` - `unpause` - No Updates in M5. - Status is read from DB. Activities: - M5 compatibility activity surface: - `prepareRunActivity(input)` - `lockBindingsActivity(runId)` - `failRunActivity(runId, reason)` - `advanceRunActivity(runId)` - `signalApprovalActivity(runId, approvalRequestId, action, clientToken, comment?)` - `pauseRunActivity(runId)` - `resumeRunActivity(runId)` - `abortRunActivity(runId, reason)` - `getStatusActivity(runId)` - `isRunTerminalActivity(runId)` - `composeFinalReportActivity(runId)` - `advanceRunActivity` is the M5 parity wrapper over M4 phase advancement. It may internally perform prompt send, artifact wait/validation, event recording, and approval request creation through the same DB/idempotency contracts already locked in sections 8 through 14. - The granular activity split (`sendPromptToSession`, `waitForArtifact`, `validateArtifact`, `recordEvent`, `requestApproval`, `runCommand`) is deferred to a later hardening ADR. It is not an M5 acceptance gate. - Prompt/session mutation still occurs only inside worker-hosted activities through SessionManager. M5+ API code never mutates `SessionAdapter` directly. Retry policy: - Default: max attempts 3, exponential backoff start 1s, max 30s. - `composeFinalReportActivity`: max attempts 1. - Activity-level failures serialize `DevflowError`; non-recoverable Devflow errors are rethrown as non-retryable Temporal failures. - `advanceRunActivity` is cancellation-aware and idempotent by DB state, event idempotency keys, prompt dedup keys, and artifact content keys. - Already-applied approval signal replay repairs missing final reports for every terminal run state: `completed`, `failed`, and `aborted`, regardless of whether the replayed approval action was `approve`, `request_changes`, `reject`, or `abort`. - API-side already-applied approval replay is report-repair only. It must not call `SessionAdapter` mutation methods; reject/abort session disposal belongs to the worker/session-manager path that originally applies the decision. - If a workflow closes before the API observes an approval signal result, closed-workflow settlement must first verify the requested decision was applied, then replay approval side effects, then wait for the terminal report. ### 15.3 Hard Constraints - Workflow code holds only serializable state. - No tmux handles in workflow state. - No PTY refs in workflow state. - No DB clients in workflow state. - M5+ session interaction happens through activities calling SessionManager in `apps/worker`. - M5+ API never calls mutating `SessionAdapter` methods. - SessionManager advisory lock prevents API/worker ownership conflict during M4 -> M5 transition. - Workflow code uses deterministic clock/randomness only. ## 16. WriteSet and Worktree ### 16.1 WriteSet - Each task declares `writeSet: string[]`. - Patterns are relative to repo root. - Glob engine: `fast-glob`. - Options: ```ts { cwd: worktreeRoot, dot: true, followSymbolicLinks: false, onlyFiles: true, suppressErrors: false } ``` Conflict detection: 1. Expand writeSets. 2. Forbidden globs cause conflict if matched by more than one task: - `pnpm-lock.yaml` - `package-lock.json` - `**/migrations/**` - `**/*.generated.*` - root `tsconfig*.json` - `biome.json` - `lefthook.yml` - `.github/**` - `.gitlab-ci.yml` 3. Pairwise file intersections must be empty. Conflict creates `parallel_dag_approved` gate. ### 16.2 Worktree Lifecycle - Worktree root: - `WORKSPACE_ROOT//` - non-parallel main lane: `WORKSPACE_ROOT//main` - Created via `git worktree add`. - Branch name: ```text devflow// ``` - Terminal run state does not remove worktrees or branches. - Output branches are deliverables. - Disk growth is accepted. - Cleanup is manual: ```bash devflow cleanup [--lane=] ``` Cleanup: - uses `git worktree remove` without `--force` by default. - refuses dirty worktrees. - `--force` requires an additional gate. - `git branch -D` is destructive and gated. - `doctor --list-orphans` lists only; it never removes. ## 17. SSE Contract Endpoints: - `GET /sse/runs/:runId` - `GET /sse/global` Heartbeat every 15 seconds. Events: | Event | Scope | |---|---| | `run.state_changed` | both | | `run.event_appended` | run | | `phase.state_changed` | run | | `approval.created` | both | | `approval.resolved` | both | | `session.state_changed` | run | | `transcript.chunk_appended` | run | | `artifact.validated` | run | Reconnect: - Run-scoped `/sse/runs/:runId`: - `Last-Event-ID` is last `run_events.seq` for that run. - server replays `run.event_appended` for `seq > lastSeq`. - derived non-`run.event_appended` SSE types are not replayed for historical rows; state is re-derived by fetch. - Global `/sse/global`: - `Last-Event-ID` is last global `run_events.id`, because `run_events.seq` is only monotonic within a run. - fresh connects start at the latest global event id and emit only new summary events. - reconnects replay rows with `id > lastId`. - global stream emits only scope=`both` events: `run.state_changed`, `approval.created`, `approval.resolved`. - global stream never emits `run.event_appended`. ## 18. Errors v4: `my_deepagent.errors.MyDeepAgentError` (replaces v3 `DevflowError` 1:1): ```python class ErrorClass(StrEnum): RECOVERABLE = "recoverable" HUMAN_REQUIRED = "human_required" FATAL = "fatal" class MyDeepAgentError(Exception): error_class: ErrorClass code: str run_id: UUID | None phase_id: UUID | None recovery_hint: str | None cause: BaseException | None ``` Recoverable: - `network_blip` - `pane_briefly_unresponsive` - `prompt_send_transient` - `db_serialization_retry` - `rate_limited` Human required: - `artifact_invalid_after_repair` - `artifact_timeout_exhausted` - `prompt_send_exhausted` - `destructive_command_blocked` - `secret_access_blocked` - `backend_unavailable` - `no_eligible_persona` - `writeset_conflict` - `merge_conflict` - `objective_not_met` - `review_dispute_unresolved` - `backend_auth_failed` - `model_unavailable` - `token_budget_exceeded` *(v4 r1: BudgetTracker rejects a call whose estimated cost would breach the per-run, per-day, or per-persona-daily cap with `on_hit=block`.)* - `tool_quota_exceeded` *(v4 r1: SafetyShellMiddleware blocked a tool call due to deny-path / destructive-command policy, or a per-phase tool-call cap was hit.)* Fatal: - `db_unreachable` - `workspace_permissions` - `internal_state_corruption` - `template_load_failed` - `artifact_schema_unknown` - `artifact_schema_load_failed` - `migration_pending` - `config_invalid` Mapping: - recoverable -> retry; exhausted -> human_required. - human_required / recovery gate -> run paused and gate created. This is distinct from normal workflow approval gates in §13.1, which use `awaiting_approval`. - fatal -> run failed, sessions disposed, final report best-effort. ## 19. Concurrent Runs and Crash Recovery ### 19.1 Active Run Uniqueness - `MAX_CONCURRENT_RUNS`, default 4. - DB partial unique index is the source of truth: - one active run per `(repo_path, base_branch)`. - `repo_path` is canonicalized before insert. - Advisory lock is auxiliary only: ```text pg_try_advisory_xact_lock(hash64('devflow:start-run', repoPath, baseBranch)) ``` - Unique-index violation returns: ```json { "currentRunId": "...", "currentState": "..." } ``` with HTTP 409. ### 19.2 Crash Recovery M4, no Temporal: - On `apps/api` startup, sweep non-terminal runs. - Mark them `failed`. - `final_report_path = null`. - Append synthesized `run.failed` with reason `process_restart_unrecovered`. - Cascade associated `tui_sessions` to `FAILED_NEEDS_HUMAN`. - Append `session.failed`. - This frees active-run uniqueness slots. M5+: - No sweep. - Temporal durability owns in-flight workflow recovery. - SessionManager resumes tmux sessions. - Active-run partial index blocks duplicate runs until completion or explicit abort. ## 20. Milestones ### M1: Monorepo + Postgres + CLI Doctor - Scaffold workspace. - Add pnpm, tsconfig, biome, lefthook, Vitest. - Add Docker Compose for Postgres. - Add Drizzle and first migration. - Add `devflow doctor`. - Implement checks 1-9. - Stub checks 10-12 as warn where needed. - Add SSE compatibility smoke test: - minimal Fastify 5 server. - `fastify-sse-v2` plugin. - 30-second integration test. - receive 3 events and reconnect. - if plugin fails, implement native `reply.raw` SSE helper before M1 is green. ### M2: Core Schema + Registry + Binding - Implement enums. - Implement canonical hashing. - Implement Template schema. - Implement Persona schema. - Implement seed loader. - Implement binding algorithm. - Implement artifact schema registry. - Add first schemas: - `dev/spec@1` - `dev/phase-plan@1` - `common/final-report@1` - Tests: - schema validation. - override semantics. - risk enforcement. - diversity enforcement. - deterministic auto-select. ### M3: Fake Session Runtime - Implement `SessionAdapter`. - Implement `FakeSessionAdapter`. - Implement prompt envelope. - Implement event recorder. - Implement fake sentinel scenarios. - Persist transcript chunks. - Tests: - prompt correlation. - artifact validation. - invalid artifact. - timeout. - fake crash. ### M4: Minimal Run Engine - Implement `packages/run-engine`. - Used directly by `apps/api`. - No Temporal. - Supports: - start run. - lock bindings. - approval. - fake prompt. - artifact wait/validate. - final report. - Freeze the `RunEngine` contract. - Full fake `development@1` minus reviewers. ### M5: Temporal Integration - Reimplement `RunEngine` through Temporal. - Preserve M4 behavior. - Add parity tests using the same M4 scenarios. - M5+ SessionManager lives in `apps/worker`. ### M6: Real tmux SessionManager - Implement `TmuxSessionAdapter`. - Decoupled from M5. - May begin after M3 is stable. - Pre-M5 real tmux is opt-in smoke only. - Production run path remains fake until both M5 and M6 are green. ### M7: TUI Recovery State Machine - Implement session state transitions. - Implement recovery counters. - Implement escalation to human gates. ### M8: API + GUI Minimum - Implement Fastify routes. - Implement SSE. - Implement GUI screens: - Dashboard. - Templates. - Personas. - New Run. - Run Detail. - Approvals. - TUI Sessions. ### M9: `development@1` Fake-Agent Full Run - Add curated `development@1`. - Add review consensus. - Add verifier flow with fake reviewers. - Add coverage gate >=70% lines for core/session/run-engine. ### M10: Codex/Claude Opt-In Real Run - Implement profiles: - `packages/session/src/profiles/codex.ts` - `packages/session/src/profiles/claude.ts` - Real backends become production-default only after both M5 and M6 are green. - Until then real tmux/Codex/Claude are developer-flagged opt-in smoke only. ### M11: Parallel Lanes - Add task DAG scheduler. - Add writeSet detection. - Add per-lane worktrees. - Add merge coordinator. - Add conflict gates. ### M12: Backtest Workflow - Add `backtest-strategy@1`. - Add objective evaluator. - Add metric parser extension points. - Add failure mining artifacts. - Add Backtest Lab GUI. ### M13: Template Factory - Generate draft template from natural language and repo discovery. - Add harness design. - Add template review. - Add dry-run and promote flow. ## 21. Out of Scope - Authentication. - Authorization. - Multi-user support. - Data retention or archival policy. - Observability dashboards. - Remote template/persona registries. - Multi-machine deployment. - HA. - Managed backups. - Web ingress. - TLS. - Reverse proxy. ## 22. Decision Log ### Open Questions Closed | # | Question | Resolution | |---|---|---| | OQ-1 | Persona/template seeding format | Immutable YAML at `docs/schemas/{personas,templates}/@.yaml` | | OQ-2 | Approval timeout default | `null`; timeout freezes only | | OQ-3 | Final report format | Markdown and JSON | | OQ-4 | Temporal namespace/queue | namespace `devflow`, task queue `devflow-runs` | | OQ-5 | WriteSet glob engine | `fast-glob` | | OQ-6 | Backtest objective DSL | Stub in M12, full DSL deferred | | OQ-7 | Codex/Claude prompt prelude | Structure locked, exact text deferred to M10 | ### Blocking Corrections Applied | # | Issue | Resolution | |---|---|---| | CC-1 | Terminal state deleted worktrees/branches | Preserve by default; manual gated cleanup only | | CC-2 | SessionManager location conflict | M4 API, M5+ worker | | CC-3 | Event duplicates under retry | `run_events.idempotency_key` | | CC-4 | Destructive command enforcement overclaimed | Devflow-direct hard, TUI best-effort | | CC-5 | UUID extension missing | `CREATE EXTENSION IF NOT EXISTS pgcrypto` | | CC-6 | Advisory lock not enough for active-run uniqueness | partial unique index | | CC-7 | Undefined transition sequence in event keys | cause-based keys | | CC-8 | Approval paused transition missing | explicit approval transition table | | CC-9 | AutoSelect order nondeterministic | deterministic sort | | CC-10 | SSE plugin compatibility assumed | M1 smoke + native fallback | | CC-11 | ApprovalAction included pause | split `ApprovalDecisionAction`; `pauseRun` is run-level | | CC-12 | Artifact hash key collision | include phase id and path | | CC-13 | Resume previous state not stored | `runs.paused_from_state` | | CC-14 | repo path aliasing | canonical realpath storage | | CC-15 | M4 sweep left tmux sessions ambiguous | cascade session state to `FAILED_NEEDS_HUMAN`; real tmux production-default only after M5+M6 | | CC-16 | Prompt hash used phaseId but envelope uses phaseKey | prompt hash uses phaseKey | | CC-17 | abortRun transition too narrow | abort from any non-terminal run state | | CC-18 | approval pending transition wording conflicted with pause epoch | pending can transition once per pending epoch; paused may unpause to pending | | CC-19 | `tsc -b --noEmit` is brittle with TypeScript 5.6 project references on clean worktrees | build still uses `tsc -b`; no-emit verification uses root `tsconfig.typecheck.json` | | CC-20 | `sendPrompt` retry count was ambiguous against Temporal activity attempts | §8.3 now states retry budget means initial attempt plus retries; §15.2 remains Temporal-level attempts only | | CC-21 | Duplicate prompt dedup handling conflicted with adapter retry idempotency | duplicate `dedupKey` returns idempotent success without reprocessing | | CC-22 | Normal workflow approval gates and human-required recovery gates were easy to conflate | §13.1 names normal workflow gates; §18 keeps human_required recovery gates paused | | CC-23 | Phase start and event append could diverge under retry/error | phase start and `phase.started` append occur in one DB transaction | | CC-24 | Repair attempt replay lost repair prompt identity and one-repair budget | repair attempts are derived from `phase.started.payload.repair`, replay uses repair instructions and `prompt.repaired`, and cannot start attempt 3 | | CC-25 | `validating` replay failed if crash happened before artifact row insert | replay revalidates the expected artifact file when state is `validating` but no artifact row exists | | CC-26 | Session bootstrap state/events could diverge | session row/state and `session.created` / `session.ready` events are committed in one DB transaction | | CC-27 | `validating` replay could reuse stale previous-attempt artifact rows | artifact-row replay requires `artifact.created_at >= tui_sessions.last_prompt_at`; otherwise the file is revalidated | | CC-28 | repair `running` replay rejected existing READY sessions with previous attempt prompt hash | current-attempt repair prompt is considered unsent, so replay may reuse the session and send `prompt.repaired` | | CC-29 | API Temporal approval replay omitted M4 approval side-effect repair | API approval signal reader now wires `replayAppliedApprovalSideEffects`, so already-applied terminal approval replays can repair missing final reports | | CC-30 | `running` replay could validate stale artifacts without prompt proof | `running` replay requires matching prompt event proof; BUSY replay without prompt event uses current artifact signature as baseline and ignores stale files | | CC-31 | M5 activity list over-specified granular activities not implemented by the M4 parity adapter | M5 locks the compatibility activity wrapper surface; granular activity split is deferred to a later hardening ADR | | CC-32 | Already-applied `approve` / `request_changes` replay repaired missing reports for `completed` / `failed` but missed `aborted` | approval replay side-effect repair now composes missing final reports for all terminal states | | CC-33 | API-side already-applied `reject` / `abort` replay tried to dispose sessions through DB-only replay validation runtime | API replay side effects are report-repair only; worker-side decision application owns session disposal | | CC-34 | Closed-workflow approval settlement waited for reports but did not replay approval side effects | settlement now verifies the requested decision, replays side effects, then waits for the terminal report | | CC-35 | Baseline-protected BUSY replay recorded synthetic prompt proof before the baseline wait was durable | baseline replay no longer records synthetic prompt events; replay without real prompt proof keeps treating existing files as stale | | CC-36 | SSE reconnect wording used per-run `seq` for global stream even though `seq` is not globally monotonic | `/sse/runs/:runId` uses per-run `seq`; `/sse/global` uses global `run_events.id` and emits only scope=`both` summary events | | CC-37 | Run SSE replay could emit historical derived events after the first page | run SSE drains historical rows up to a high-water `seq` with only `run.event_appended`, then switches to live derived events | | CC-38 | Normal phase start changed run state to `planning` / `executing` without a summary event source | `phase.started` payload includes `runState`; SSE derives `run.state_changed` from that live event | | CC-39 | No OpenRouter HTTP backend; users cannot pick cost-tuned per-persona models | add `openrouter` to Backend enum; HTTP `OpenRouterAdapter` in §8.5; persona `modelConfig.model` requirement; doctor check 13; new error codes `rate_limited`, `backend_auth_failed`, `model_unavailable` (final v3 entry — v4 reinterprets the OpenRouter integration as the deepagents-driven session adapter; the standalone HTTP `OpenRouterAdapter` from CC-39 is **superseded by DR-1**) | ### Decision Records (v4) | ID | Decision | Rationale | Impact | |----|----------|-----------|--------| | DR-1 | **v3 → v4 major bump: delete TS monorepo, rewrite in Python on LangChain `deepagents`.** | (1) Claude/Anthropic direct API cost is prohibitive for a single-user toolchain. (2) OpenRouter cost-tuned models (DeepSeek, etc.) require a multi-turn, tool-using agent harness; `deepagents` is Python-only with no 1:1 TS port. (3) Switching languages is shorter than reimplementing the harness. | Step 0 (commit `0e61b2d`) deleted `apps/`, `packages/`, `tests/`, `scripts/`, pnpm/tsconfig metadata. The Python rewrite lives at `my-deepagent/` and reached Step 15 (real OpenRouter E2E PASS, ~$0.05/run) before the v3 codebase was removed. CC-39's separate `OpenRouterAdapter` is replaced by `my_deepagent.session.build_agent` (deepagents 0.6.1 with LocalShellBackend + SafetyShellMiddleware). v3 CC counters frozen; v4 begins its own series. Recovery: `git checkout pre-python-rewrite -- `. | | DR-2 | **SQLite for v0.1.0; Postgres migration scheduled as v0.2 PR #1, ahead of M8-Py FastAPI.** | The v4 r1 first draft suggested Postgres should re-enter "with Temporal." That was wrong: Temporal (M5-Py) does not write to the `my-deepagent` ORM tables — it has its own backing store. The real trigger for Postgres is *a second writer on `runs` / `run_phases` / `llm_calls`*, which first appears with FastAPI (M8-Py) and the eventual web GUI. Until then, SQLite WAL handles single-process concurrent reads fine and saves new users a Docker prerequisite at install time. | Migration sequencing: v0.2 PR #1 = "stop writers → `alembic downgrade base` against SQLite → regenerate baseline against Postgres → adjust JSON column types / partial-unique-index syntax / UPSERT for the Postgres dialect → add `pg_isready` doctor check." M5-Py (Temporal) can be implemented on either SQLite or Postgres my-deepagent DB; the order (v0.2-PR-1 → M5-Py → M8-Py) is chosen for stack consistency, not necessity. Supersedes the "Postgres parked indefinitely" wording from earlier v4 r1 drafts. | ### Future Open Questions - FOQ-1, M12: full backtest objective DSL. - FOQ-2, M13: template factory generation prompts. - FOQ-3, post-M10: optional third backend such as Gemini. - FOQ-4, post-M8: WebSocket vs SSE if transcript pressure requires it. ## 23. Kickoff Order v3 historical order (TS, completed up to M8 before the v4 pivot): 1. M1.1: repo + pnpm + tsconfig + biome + lefthook + vitest workspace. 2. M1.2: docker-compose + Postgres healthcheck + drizzle-kit + first migration. 3. M1.3: `apps/cli` skeleton + `devflow doctor`. 4. M1.4: `packages/core` skeleton with config, enums, errors, hash, prompt-envelope, run-event types. 5. M2.1: Zod schemas for Template/Persona, persona YAML loader, hashing. 6. M2.2: Binding algorithm + tests. 7. M2.3: Artifact schema registry + first three schemas. 8. M3.1: `SessionAdapter` interface + `FakeSessionAdapter`. 9. M3.2: Transcript chunk capture + DB persistence. 10. M3.3: engine-shaped harness running a single fake phase end-to-end. 11. M4: assemble run engine; lock contract; full fake `development@1` minus reviewers. 12. M5 in parallel with M6 once M4 is green. v4 r1 order (Python, status as of v0.1.0): | Step | Scope | Status | |------|-------|--------| | Step 0 | Scaffold `my-deepagent/` (uv workspace, ruff, mypy, alembic, .pre-commit) | DONE (`17ba5d7`) | | Step 1 | `devflow_core` → `my_deepagent.{config,enums,errors,hash,persona,prompt_envelope,run_event}` | DONE | | Step 2 | `devflow_db` → `my_deepagent.persistence.{db,models,checkpointer}` + Alembic baseline | DONE | | Step 3 | `mydeepagent doctor` (typer) | DONE | | Step 4 | Persona / workflow seeding + binding (`my_deepagent.{persona,workflow,binding}`) | DONE | | Step 5 | Artifact schema registry (`my_deepagent.artifact_schema`) | DONE | | Step 6 | Distribution: init/login/logout/keys, governance consent, i18n (ko/en) | DONE | | Step 7 | WorkflowEngine + ArtifactWatcherMiddleware (replaces v3 §15 in-process engine) | DONE | | Step 8 | Budget guardrails (`my_deepagent.budget` + cost preview + CostMiddleware) | DONE | | Step 9 | Crash recovery + concurrency (`my_deepagent.recovery` + `mydeepagent runs …`) | DONE | | Step 10 | Interactive REPL (`mydeepagent` no-subcommand + slash commands) | DONE | | Step 11 | Audit log + structlog secret scrubbing | DONE | | Step 12 | Doctor 8-check + OpenRouter pricing fetch + `mydeepagent pricing` | DONE | | Step 13 | Tmux adapter (M6-Py) | DEFERRED — not in v0.1.0 | | Step 14 | TUI recovery (M7-Py) | DEFERRED — not in v0.1.0 | | Step 15 | End-to-end real OpenRouter integration test | DONE (`733c9be`) | | Step 0-purge | Delete v3 TS monorepo per DR-1 | DONE (`0e61b2d`) | | v0.2 PR #1 | **Postgres migration** — Alembic baseline regen against Postgres 16; SQLite removed. Triggered by upcoming M8-Py multi-process writes, sequenced *before* M8-Py for a clean cut. Adds `pg_isready` doctor check; `mydeepagent doctor` no longer offers SQLite fast-path. | PLANNED (next) | | M5-Py | Temporal worker (`apps/worker`). Temporal server uses its own backing DB (separate Postgres `temporal` namespace) and does not touch `my-deepagent`'s ORM tables, so M5-Py works on either SQLite or Postgres my-deepagent DB. Targeted post-v0.2-PR-1 for stack consistency. | PLANNED | | M8-Py | FastAPI + SSE (`apps/api`). Requires Postgres (see §1.3 trigger table). | PLANNED |