Files
dev-puppeteer/docs/plan.md
2026-05-10 00:31:18 +09:00

46 KiB

Devflow Implementation Plan v3 r4

0. Document Status

  • This document supersedes v2 and all earlier v3 drafts where conflicting.
  • Single-user, single-machine assumption. No auth, no retention policy, no observability dashboards, no multi-tenancy.
  • Target OS: macOS 13+ / Linux. No Windows.
  • All paths are Unix-style. All times are stored UTC.
  • Decisions in this document are locked unless explicitly marked (provisional). Override requires updating this document, not only code.
  • r1 applied CC-1 through CC-5.
  • r2 applied CC-6 through CC-10.
  • r3 applied CC-11 through CC-15.
  • r4 applies CC-16 through CC-18.

1. Stack Decisions

1.1 Workspace

  • pnpm 9 with workspaces. No Turbo.
  • Node 22 LTS, pinned by .nvmrc and package.json#engines.
  • TypeScript 5.6 with project references via tsc -b.
  • strict: true.
  • No any unless accompanied by an explicit annotation comment explaining why.

1.2 Tooling

  • Build:
    • tsup for libraries, CJS + ESM dual output.
    • vite for apps/web.
    • tsx for apps/cli, apps/api, and apps/worker in dev.
    • node for prod-ish local runs.
  • Test:
    • vitest with workspace config.
    • Coverage via @vitest/coverage-v8.
    • No coverage gate at M1.
    • M9 adds coverage gate: >=70% lines on packages/core, packages/session, packages/run-engine.
  • Lint/format:
    • biome.
    • One root config.
  • Pre-commit:
    • lefthook.
    • Runs biome check --write on staged files.
    • Runs tsc -b --noEmit on changed packages.
    • Runs related Vitest tests on changed packages.

1.3 Database

  • Postgres 16 via Docker Compose.
  • Drizzle ORM + drizzle-kit generate.
  • Generated SQL migrations are committed.
  • Migrations are never auto-applied at runtime except through the explicit migration runner invoked by devflow up.
  • Migration runner:
    • scripts/migrate.ts.
    • Takes DATABASE_URL.
    • devflow up waits for Postgres health and then runs pending migrations.

1.4 Logging

  • pino.
  • pino-pretty in dev, JSON otherwise.
  • Standard fields:
    • time
    • level
    • module
    • runId?
    • phaseId?
    • role?
    • eventId?
  • Levels:
    • trace: transcript chunks only.
    • debug: internal state transitions.
    • info: run events.
    • warn: recoverable errors.
    • error: human-required or fatal errors.

1.5 Config

  • Single Zod schema in packages/core/src/config.ts.
  • Source precedence, high to low:
    • process.env
    • .env.local
    • .env
    • schema defaults
  • Config is loaded once at process start, validated, frozen, and exported as typed Config.
  • Config validation failure is fatal.
  • Required keys at M1:
    • DATABASE_URL
    • WORKSPACE_ROOT
    • LOG_LEVEL
  • M5 adds:
    • TEMPORAL_ADDRESS
  • Path canonicalization:
    • WORKSPACE_ROOT is resolved through fs.realpathSync and stored as an absolute path at config load.
    • Any path entering the system must be canonicalized before storage or hashing.
    • repo_path and worktree_root rules are defined in section 4.

Backend registration:

const BackendConfig = z.object({
  id: Backend,                       // codex | claude | fake
  enabled: z.boolean(),
  binaryPath: z.string().optional(), // resolved from PATH if absent; required for codex/claude
});
  • fake is always available.
  • codex and claude are available only when:
    • enabled=true
    • binary resolves at process start.
  • Resolution failure:
    • doctor warns.
    • binding fails fast at run start with human_required:backend_unavailable.
  • Binding reads from config.backends, never directly from PATH.

1.6 HTTP

  • fastify 5.
  • @fastify/sensible.
  • SSE primary strategy:
    • Try fastify-sse-v2.
    • Fastify 5 compatibility is not assumed.
    • M1 includes a smoke test.
  • SSE fallback:
    • Native reply.raw.
    • Headers:
      • content-type: text/event-stream
      • cache-control: no-cache
      • connection: keep-alive
    • Write data: <json>\n\n.
    • Manage heartbeats and reconnect manually.
  • WebSocket is deferred unless SSE fails under transcript volume.

2. Directory Layout

devflow/
├── package.json
├── pnpm-workspace.yaml
├── tsconfig.base.json
├── biome.json
├── lefthook.yml
├── vitest.workspace.ts
├── docker-compose.yml
├── .nvmrc
├── .env.example
├── docs/
│   ├── plan.md
│   ├── adr/
│   └── schemas/
│       ├── artifacts/
│       ├── personas/
│       └── templates/
├── scripts/
│   ├── migrate.ts
│   └── seed.ts
├── packages/
│   ├── core/
│   │   └── src/
│   │       ├── config.ts
│   │       ├── enums.ts
│   │       ├── hash.ts
│   │       ├── errors.ts
│   │       ├── template.ts
│   │       ├── persona.ts
│   │       ├── binding.ts
│   │       ├── prompt-envelope.ts
│   │       ├── artifact-schema.ts
│   │       ├── run-event.ts
│   │       └── index.ts
│   ├── db/
│   │   └── src/
│   │       ├── schema/
│   │       ├── migrations/
│   │       ├── repositories/
│   │       └── client.ts
│   ├── session/
│   │   └── src/
│   │       ├── adapter.ts
│   │       ├── fake.ts
│   │       ├── tmux.ts
│   │       ├── profiles/
│   │       │   ├── codex.ts
│   │       │   └── claude.ts
│   │       ├── recovery.ts
│   │       └── transcript.ts
│   ├── harness/
│   │   └── src/
│   │       ├── git.ts
│   │       ├── worktree.ts
│   │       ├── runner.ts
│   │       ├── review.ts
│   │       └── backtest.ts
│   ├── run-engine/
│   │   └── src/
│   │       ├── engine.ts
│   │       ├── phase-executor.ts
│   │       └── approval.ts
│   └── workflows/
│       └── src/
│           ├── workflow.ts
│           └── activities.ts
├── apps/
│   ├── api/
│   ├── web/
│   ├── cli/
│   └── worker/
└── tests/
    ├── e2e/
    └── fixtures/

3. devflow doctor

Exit codes:

  • 0: all green.
  • 1: one or more red checks.
  • 2: internal or unknown error.

Each check emits:

  • name
  • status: pass | fail | warn
  • detail
  • remediation

Closed check list:

  1. Node version satisfies >=22.0.0 <23.
  2. pnpm version >=9.0.0.
  3. tmux exists, version >=3.3.
  4. git version >=2.40.
  5. Docker daemon reachable.
  6. Postgres container running, pg_isready ok, DATABASE_URL connects.
  7. No pending Drizzle migrations.
  8. WORKSPACE_ROOT exists and is writable.
  9. .env resolves to valid Config.
  10. codex in PATH, warn-only.
  11. claude in PATH, warn-only.
  12. Free disk on WORKSPACE_ROOT partition:
    • warn under 10GB.
    • fail under 2GB.
    • target green threshold: >=5GB.

Output:

  • Human table by default.
  • --json for machine-readable output.
  • --quiet prints only nonzero results.
  • --list-orphans lists orphaned worktrees only; it never removes them.

4. Database Schema

First migration prelude:

CREATE EXTENSION IF NOT EXISTS pgcrypto;

All tables use gen_random_uuid() primary keys unless noted. All times are timestamptz. Mutable rows include updated_at. JSON columns use jsonb.

4.1 workflow_templates

  • id uuid primary key default gen_random_uuid()
  • name text not null
  • version int not null
  • hash text not null unique
  • definition jsonb not null
  • created_at timestamptz not null default now()
  • unique (name, version)

4.2 agent_personas

  • id uuid primary key default gen_random_uuid()
  • name text not null
  • version int not null
  • hash text not null unique
  • definition jsonb not null
  • created_at timestamptz not null default now()
  • unique (name, version)

4.3 runs

  • id uuid primary key default gen_random_uuid()
  • template_id uuid not null references workflow_templates(id)
  • template_hash text not null
  • state text not null
  • repo_path text not null
    • canonical absolute path
    • resolved through fs.realpathSync before insert
  • base_branch text not null
  • worktree_root text not null
    • canonical absolute path under WORKSPACE_ROOT/<runId>/
  • current_phase_id uuid references run_phases(id) nullable and deferrable
  • started_at timestamptz
  • ended_at timestamptz
  • final_report_path text
  • paused_from_state text
    • set when transitioning to paused
    • cleared on resume
    • null when state is not paused
  • created_at timestamptz not null default now()
  • updated_at timestamptz

Active-run uniqueness:

CREATE UNIQUE INDEX ux_active_run_repo_base
ON runs (repo_path, base_branch)
WHERE state NOT IN ('completed', 'failed', 'aborted');

4.4 run_inputs

  • id uuid primary key default gen_random_uuid()
  • run_id uuid not null unique references runs(id) on delete cascade
  • requirements_md text not null
  • objective jsonb
  • extra jsonb
  • input_hash text not null

input_hash is based on:

  • requirements_md
  • objective
  • extra
  • canonical repo_path
  • base_branch

4.5 run_bindings

  • id uuid primary key default gen_random_uuid()
  • run_id uuid not null references runs(id) on delete cascade
  • role_id text not null
  • persona_id uuid not null references agent_personas(id)
  • persona_hash text not null
  • backend text not null
  • binding_hash text not null
  • unique (run_id, role_id)

4.6 run_phases

  • id uuid primary key default gen_random_uuid()
  • run_id uuid not null references runs(id) on delete cascade
  • phase_key text not null
  • seq int not null
  • state text not null
  • attempts int not null default 0
  • started_at timestamptz
  • ended_at timestamptz
  • unique (run_id, phase_key)

4.7 run_events

Append-only.

  • id bigserial primary key
  • run_id uuid not null references runs(id) on delete cascade
  • phase_id uuid references run_phases(id)
  • seq bigint not null
  • type text not null
  • payload jsonb not null
  • idempotency_key text not null
  • ts timestamptz not null default now()
  • unique (run_id, seq)
  • unique (run_id, idempotency_key)
  • index (run_id, ts)

Concurrency:

  • All inserts go through RunEventRepository.append().
  • Raw SQL inserts into run_events are forbidden.
  • append() takes pg_advisory_xact_lock(hash64('devflow:run-events', run_id)).
  • Inside that same transaction it assigns:
seq := COALESCE(MAX(seq), 0) + 1

4.8 approval_requests

  • id uuid primary key default gen_random_uuid()
  • run_id uuid not null references runs(id)
  • phase_id uuid references run_phases(id)
  • gate_key text not null
  • state text not null
  • idempotency_key text not null
  • payload jsonb not null
  • created_at timestamptz not null default now()
  • resolved_at timestamptz
  • unique (idempotency_key)

4.9 approval_decisions

Append-only and immutable.

  • id uuid primary key default gen_random_uuid()
  • approval_request_id uuid not null references approval_requests(id)
  • action text not null
    • approve
    • reject
    • request_changes
    • abort
  • comment text
  • decided_at timestamptz not null default now()
  • idempotency_key text not null unique

pause is not an approval decision.

4.10 tui_sessions

  • id uuid primary key default gen_random_uuid()
  • run_id uuid not null references runs(id) on delete cascade
  • role_id text not null
  • backend text not null
  • cwd text not null
  • expected_artifact_path text
  • expected_schema text
  • last_prompt_hash text
  • last_prompt_at timestamptz
  • last_capture_seq bigint not null default 0
  • last_known_pane_pid int
  • tmux_session text
  • tmux_window text
  • state text not null
  • recovery_attempts int not null default 0
  • unique (run_id, role_id)

4.11 tui_transcript_chunks

Append-only.

  • id bigserial primary key
  • session_id uuid not null references tui_sessions(id) on delete cascade
  • seq bigint not null
  • content text not null
  • captured_at timestamptz not null default now()
  • unique (session_id, seq)

4.12 artifacts

  • id uuid primary key default gen_random_uuid()
  • run_id uuid not null references runs(id) on delete cascade
  • phase_id uuid references run_phases(id)
  • path text not null
  • schema_id text not null
  • hash text not null
  • valid boolean not null
  • validation_error jsonb
  • created_at timestamptz not null default now()
  • unique (run_id, path, hash)

4.13 commands

  • id uuid primary key default gen_random_uuid()
  • run_id uuid not null references runs(id) on delete cascade
  • phase_id uuid references run_phases(id)
  • kind text not null
    • git
    • test
    • e2e
    • doctor
    • backtest
    • other
  • argv text[] not null
  • cwd text not null
  • exit_code int
  • stdout_path text
  • stderr_path text
  • started_at timestamptz
  • ended_at timestamptz

4.14 review_findings

  • id uuid primary key default gen_random_uuid()
  • run_id uuid not null references runs(id) on delete cascade
  • phase_id uuid references run_phases(id)
  • reviewer_role text not null
  • severity text not null
    • info
    • low
    • medium
    • high
    • critical
  • category text not null
    • correctness
    • evidence
    • style
    • security
    • performance
    • other
  • file_path text
  • line int
  • summary text not null
  • evidence text
  • verifier_status text not null default 'unverified'
    • unverified
    • confirmed
    • rejected
  • created_at timestamptz not null default now()

4.15 Backtest Stub Tables

backtest_iterations and backtest_metrics are created at M1 as stub tables:

  • id uuid primary key default gen_random_uuid()
  • run_id uuid not null references runs(id) on delete cascade
  • payload jsonb
  • created_at timestamptz not null default now()

Full schema is deferred to M12.

5. Enums

All enums live in packages/core/src/enums.ts as TypeScript const objects and Zod enums.

5.1 Backend

  • codex
  • claude
  • fake

Future gemini support adds an enum entry and a BackendProfile; no design change.

5.2 Capability

  • spec_write
  • phase_planning
  • task_dag_planning
  • code_edit
  • test_first_development
  • code_review
  • evidence_check
  • command_execute
  • backtest_run
  • metric_extract
  • failure_mining
  • objective_eval
  • final_report_compose

5.3 RiskLevel

  • low
  • medium
  • high

Risk is declared per phase in the template. Persona has maxRiskLevel. Binding fails when phase.risk > persona.maxRiskLevel.

5.4 ApprovalDecisionAction

  • approve
  • reject
  • request_changes
  • abort

pause is a run-level control operation, not an approval decision.

5.5 ApprovalState

  • pending
  • approved
  • rejected
  • changes_requested
  • aborted
  • paused

paused is not an auto-decision.

5.6 RunState

  • created
  • bound
  • planning
  • awaiting_approval
  • executing
  • paused
  • completed
  • failed
  • aborted

5.7 RunPhaseState

  • pending
  • running
  • awaiting_artifact
  • validating
  • awaiting_approval
  • completed
  • failed
  • skipped

5.8 SessionState

  • CREATED
  • BOOTSTRAPPING
  • READY
  • BUSY
  • WAITING_FOR_APPROVAL
  • ARTIFACT_TIMEOUT
  • HUNG
  • CRASHED
  • RESUMING
  • REBOOTSTRAPPED
  • FAILED_NEEDS_HUMAN

6. Content-Addressed Hashing

6.1 Canonical JSON

  • Object keys sorted lexicographically by UTF-16 code units.
  • No insignificant whitespace.
  • Strings use standard JSON escaping.
  • No Unicode normalization.
  • Numbers use shortest round-trippable representation.
  • Integers have no decimal point.
  • No leading zeros.
  • Arrays preserve order.
  • No trailing newline.

packages/core/src/hash.ts exports:

canonicalize(value: unknown): string
hash(value: unknown): string

hash() returns sha256hex(canonicalize(value)).

6.2 Hash Subjects

  • Template hash:
    • { name, version, roles, phases, gates, capabilitiesRequired }
  • Persona hash:
    • { name, version, capabilities, backend, maxRiskLevel, allowedRoles, promptConfig, modelConfig }
  • Binding hash:
    • { runId, roleId, templateHash, personaHash, backend, override }
  • Run input hash:
    • { templateHash, bindings: sorted[bindingHash], requirementsMd, objective, repoPath, baseBranch, extra }
  • Prompt hash:
    • { runId, roleId, phaseKey, expectedArtifact, expectedSchema, instructions, attempt }
  • Artifact hash:
    • SHA-256 of file bytes.

Prompt hash uses phaseKey, not phaseId, because PromptEnvelope carries phaseKey.

7. Template, Persona, Binding

7.1 Template Schema

const TemplatePhase = z.object({
  key: z.string(),
  title: z.string(),
  risk: RiskLevel,
  roles: z.array(z.string()),
  expectedArtifact: z
    .object({
      path: z.string(),
      schema: z.string(),
    })
    .optional(),
  gates: z.array(z.string()).default([]),
  timeoutMs: z.number().int().positive().optional(),
});

const TemplateRole = z.object({
  id: z.string(),
  requiredCapabilities: z.array(Capability),
  preferredBackends: z.array(Backend).default([]),
  count: z.number().int().min(1).default(1),
  diversity: z
    .object({
      requireDifferentBackends: z.boolean().default(false),
    })
    .optional(),
});

const Template = z.object({
  name: z.string(),
  version: z.number().int().positive(),
  roles: z.array(TemplateRole),
  phases: z.array(TemplatePhase),
  defaultGates: z.array(z.string()).default([]),
});

7.2 Persona Schema

const Persona = z.object({
  name: z.string(),
  version: z.number().int().positive(),
  backend: Backend,
  capabilities: z.array(Capability),
  maxRiskLevel: RiskLevel,
  allowedRoles: z.array(z.string()).optional(),
  promptConfig: z
    .object({
      systemPrompt: z.string().optional(),
      instructionsPrelude: z.string().optional(),
    })
    .default({}),
  modelConfig: z.record(z.string(), z.unknown()).default({}),
});

7.3 Override Semantics

  • Override may swap persona for a role.
  • Override may constrain backend to a specific allowed backend.
  • Override cannot add capabilities.
  • Override cannot raise risk above persona maxRiskLevel.
  • Diversity rules apply after override.
  • Lock-time validation runs the full binding algorithm.
  • On first binding failure, the run does not start.

7.4 Binding Algorithm

For each role:

  1. Select override persona if present; otherwise run autoSelect.
  2. Assert backend is enabled in config.backends.
  3. Assert non-fake backend binary resolved at process start.
  4. Assert role id is in allowedRoles, unless allowedRoles is absent.
  5. Assert required capabilities are a subset of persona capabilities.
  6. Assert every phase using the role has risk <= persona maxRiskLevel.
  7. Expand roles with count > 1 into roleId#0, roleId#1, etc.
  8. Enforce diversity rules after expansion.
  9. Compute and persist binding_hash per role instance.

autoSelect is deterministic. Sort candidates by:

  1. role preferredBackends order.
  2. persona.version desc.
  3. persona.name asc.
  4. persona.hash asc.

Personas whose backend is not in preferredBackends are eligible only if all preferred-backend personas fail capability or risk checks.

Binding fails with human_required:no_eligible_persona if no persona satisfies requirements.

7.5 Seeding

Personas:

  • docs/schemas/personas/<name>@<version>.yaml
  • filename encodes immutable identity.
  • loader parses with Persona schema.
  • loader computes personaHash.
  • loader upserts keyed by (name, version).
  • hash mismatch on an existing row is fatal.

Templates:

  • docs/schemas/templates/<name>@<version>.yaml
  • same immutable version rule.

Deleting a published file is allowed only when no run references that hash.

8. Session Runtime

8.1 SessionAdapter Interface

export interface SessionAdapter {
  start(input: StartInput): Promise<SessionHandle>;
  sendPrompt(handle: SessionHandle, envelope: PromptEnvelope): Promise<{ promptId: string }>;
  probe(handle: SessionHandle): Promise<ProbeResult>;
  resume(handle: SessionHandle): Promise<SessionHandle>;
  rebootstrap(handle: SessionHandle): Promise<SessionHandle>;
  capture(handle: SessionHandle, fromSeq: bigint): AsyncIterable<TranscriptChunk>;
  dispose(handle: SessionHandle): Promise<void>;
}

export interface StartInput {
  runId: string;
  roleId: string;
  backend: Backend;
  cwd: string;
  expectedArtifactPath?: string;
  expectedSchema?: string;
  envelopePrelude?: string;
}

export interface SessionHandle {
  sessionId: string;
  pid?: number;
  tmuxSession?: string;
  tmuxWindow?: string;
}

export interface ProbeResult {
  alive: boolean;
  paneActive: boolean;
  lastOutputAt?: Date;
  hint?: string;
}

export interface TranscriptChunk {
  seq: bigint;
  content: string;
  capturedAt: Date;
}

8.2 Session State Machine

  • CREATED -> BOOTSTRAPPING -> READY
  • READY <-> BUSY
  • BUSY -> WAITING_FOR_APPROVAL
  • BUSY -> ARTIFACT_TIMEOUT
  • BUSY -> HUNG
  • BUSY -> CRASHED
  • HUNG | CRASHED | ARTIFACT_TIMEOUT -> RESUMING -> READY
  • RESUMING -> REBOOTSTRAPPED -> READY
  • exhausted errors -> FAILED_NEEDS_HUMAN

8.3 Recovery Counters

  • sendPrompt retry: 2.
  • resume retry: 2.
  • rebootstrap retry: 1.
  • artifact repair retry: 1.
  • max hung time: configurable; default 20 minutes.

Exhaustion creates a human gate with recoveryHint.

8.4 SessionManager Singleton

  • M4: hosted in apps/api.
  • M5+: hosted in apps/worker.
  • Only SessionManager may call mutating SessionAdapter methods.
  • Holds in-memory Map<sessionId, SessionHandle>.
  • Takes pg_advisory_lock(hash64('devflow:session-manager')).
  • Second instance exits code 3.
  • On start:
    • query non-terminal tui_sessions.
    • call adapter.resume(handle).
    • success: place handle in map.
    • failure: session -> FAILED_NEEDS_HUMAN, append session.failed, create recovery gate.
  • On SIGTERM/SIGINT:
    • refuse new prompts.
    • allow in-flight artifact polling up to 30s.
    • persist last_capture_seq.
    • release advisory lock.

9. Prompt Envelope

9.1 Wire Format

DEVFLOW_PROMPT_BEGIN <uuid>
Run: <run-id>
Role: <role-id>
Phase: <phase-key>
Attempt: <int>
Expected artifact: <absolute-path>
Expected schema: <schema-id>
Dedup-Key: <prompt-hash>
Instructions:
<freeform multi-line instructions>
DEVFLOW_PROMPT_END <uuid>

9.2 Schema

const PromptEnvelope = z.object({
  uuid: z.string().uuid(),
  runId: z.string().uuid(),
  roleId: z.string(),
  phaseKey: z.string(),
  attempt: z.number().int().nonnegative(),
  expectedArtifact: z.string(),
  expectedSchema: z.string(),
  dedupKey: z.string(),
  instructions: z.string(),
});

9.3 Rules

  • Prompt identity is dedupKey.
  • Adapter refuses duplicate dedupKey for the same session within a run lifetime.
  • attempt increments only when the engine intentionally re-sends after timeout or repair.
  • Adapter-level retry does not increment attempt.
  • Completion is never inferred from transcript text.
  • Completion requires a schema-valid artifact.

9.4 Backend Prelude

Sent once at session bootstrap before the first envelope.

Required structure:

  1. Backend identity statement.
  2. Persona instructionsPrelude.
  3. Protocol declaration: completion is signaled only by writing expected artifact files.
  4. Envelope marker contract.
  5. Approval/probe contract: DEVFLOW_PROBE must respond with one line READY or BUSY <reason>.

Codex and Claude-specific addenda live in packages/session/src/profiles/{codex,claude}.ts and are populated at M10.

10. Artifact Schema Registry

10.1 Layout

JSON Schema 2020-12 documents live at:

docs/schemas/artifacts/<schema_id>.json

schema_id format:

<domain>/<name>@<version>

Examples:

  • dev/spec@1
  • dev/phase-plan@1
  • dev/dag@1
  • dev/review-finding-batch@1
  • bt/objective@1
  • bt/iteration-result@1
  • common/final-report@1

10.2 Loader

packages/core/src/artifact-schema.ts exports:

function loadSchema(id: string): JsonSchema;
function validateArtifact(
  id: string,
  data: unknown
): { ok: true } | { ok: false; errors: ValidationError[] };

Unknown schema id is fatal.

10.3 Validation Flow

  1. Engine waits for expectedArtifactPath to appear.
  2. Debounce 500ms after last mtime change.
  3. Read file.
  4. Compute SHA-256.
  5. Validate against expectedSchema.
  6. Valid:
    • insert artifact row with valid=true.
    • append artifact.validated.
    • advance phase.
  7. Invalid:
    • insert artifact row with valid=false.
    • append artifact.invalid.
    • trigger one repair prompt.
    • after repair exhaustion, create human gate.
  8. Timeout:
    • append artifact.timeout.
    • probe session.
    • enter recovery flow.

10.4 Final Report

At terminal run state, write atomically:

  • <WORKSPACE_ROOT>/<runId>/<runId>.report.md
  • <WORKSPACE_ROOT>/<runId>/<runId>.report.json

Both are written even on failed or aborted, best-effort.

common/final-report@1 minimum fields:

  • runId
  • templateHash
  • bindings[]
  • inputs
  • phases[]
  • approvals[]
  • findings[]
  • commands[]
  • artifacts[]
  • events.tail
  • unresolved[]
  • endedAt
  • status

10.5 Backtest Objective Stub

bt/objective@1:

{
  "targets": [
    { "metric": "sharpe", "op": "gte", "value": 1.5, "weight": 1.0 },
    { "metric": "mdd", "op": "lte", "value": 0.15, "weight": 1.0 }
  ],
  "stopWhen": "all"
}
  • op: gte | lte | eq | gt | lt
  • stopWhen: all | weighted
  • weighted threshold is hardcoded at 0.8 at M12.
  • Full DSL deferred to M12.

11. Run Events

Closed event types:

run.created
run.started
run.paused
run.resumed
run.completed
run.failed
run.aborted
phase.started
phase.completed
phase.failed
phase.skipped
prompt.sent
prompt.repaired
artifact.expected
artifact.validated
artifact.invalid
artifact.timeout
approval.requested
approval.resolved
session.created
session.ready
session.busy
session.idle
session.crashed
session.recovered
session.failed
command.started
command.completed
command.failed
review.batch_recorded
finding.verifier_resolved
backtest.iteration_started
backtest.iteration_completed
backtest.objective_evaluated

11.1 Idempotency Keys

Every event append requires deterministic idempotency_key.

Event family Key formula
run.created, run.started, run.completed, run.failed, run.aborted <type>:<run_id>
run.paused run.paused:<run_id>:<cause>
run.resumed run.resumed:<run_id>:<cause>
phase.started, phase.completed, phase.failed, phase.skipped <type>:<phase_id>:<phase_attempt>
prompt.sent, prompt.repaired <type>:<prompt_dedup_key>
artifact.expected, artifact.timeout <type>:<phase_id>:<phase_attempt>:<expected_path>
artifact.validated, artifact.invalid <type>:<phase_id>:<expected_path>:<artifact_hash>
approval.requested approval.requested:<approval_idempotency_key>
approval.resolved approval.resolved:<approval_request_id>:<action>
session.created, session.failed <type>:<session_id>
session.busy, session.idle <type>:<session_id>:<prompt_dedup_key>
session.ready, session.crashed, session.recovered <type>:<session_id>:<recovery_attempts>
command.started, command.completed, command.failed <type>:<command_id>
review.batch_recorded review.batch_recorded:<phase_id>:<reviewer_role>:<phase_attempt>
finding.verifier_resolved finding.verifier_resolved:<finding_id>
backtest.iteration_started, backtest.iteration_completed, backtest.objective_evaluated <type>:<iteration_id>

Definitions:

  • phase_attempt is incremented before event append.
  • recovery_attempts is incremented before event append.
  • prompt_dedup_key is the envelope dedup key.
  • approval_idempotency_key is from approval_requests.
  • Artifact expected/timeout events are per-attempt.
  • Artifact validated/invalid events are content-keyed by path + hash.

12. Fake Session Adapter

12.1 Behavior

  • Deterministic.
  • In-process.
  • No PTY.
  • No tmux.
  • Drives engine end-to-end without real backends.

12.2 Sentinel Triggers

On sendPrompt, inspect expectedSchema.

Fixture path:

tests/fixtures/fake-artifacts/<expectedSchema>/<scenarioName>.json

scenarioName comes from instruction header:

Scenario: <name>

Default scenario: ok.

Scenarios:

  • ok: write fixture to expectedArtifactPath after 50ms by default.
  • invalid: write deliberately schema-invalid payload.
  • timeout: never write.
  • crash: throw RecoverableError.

12.3 Transcript

Fake adapter emits chunks such as:

[fake] received prompt <uuid>; will write <path> in 50ms

13. State Machines

13.1 Run State

States:

  • created
  • bound
  • planning
  • awaiting_approval
  • executing
  • paused
  • completed
  • failed
  • aborted

Transitions:

From Trigger To Side effects
created lockBindings ok bound persist bindings; emit run.started
created lockBindings fail failed emit run.failed
bound phase plan needed planning emit phase.started
planning plan artifact valid awaiting_approval request approval
awaiting_approval approve executing emit approval.resolved, run.resumed
awaiting_approval reject failed emit run.failed
awaiting_approval request_changes planning increment phase attempts
awaiting_approval timeout paused set paused_from_state='awaiting_approval'
executing phase ok, more phases executing next phase
executing phase needs gate awaiting_approval request gate
executing all phases done completed emit run.completed, write final report
executing unrecoverable error failed emit run.failed
executing manual pauseRun paused set paused_from_state='executing'
planning manual pauseRun paused set paused_from_state='planning'
paused resume paused_from_state emit run.resumed, clear paused_from_state
any non-terminal state abortRun aborted emit run.aborted, dispose sessions

Non-terminal states for abortRun:

  • created
  • bound
  • planning
  • awaiting_approval
  • executing
  • paused

13.2 Run Phase State

States:

  • pending
  • running
  • awaiting_artifact
  • validating
  • awaiting_approval
  • completed
  • failed
  • skipped

Transitions:

From Trigger To
pending start running
running prompt sent, artifact expected awaiting_artifact
awaiting_artifact artifact appears validating
awaiting_artifact timeout running after probe/repair, or failed after exhaustion
validating valid awaiting_approval if gate, else completed
validating invalid running after one repair, else failed
awaiting_approval approve completed
awaiting_approval reject / abort failed
awaiting_approval request_changes running, attempt + 1

14. Approval State

States:

  • pending
  • approved
  • rejected
  • changes_requested
  • aborted
  • paused

14.1 Transitions

From Event To Side effects
pending approve decision approved insert decision row
pending reject decision rejected insert decision row; run -> failed
pending request_changes decision changes_requested insert decision row; increment attempt
pending abort decision aborted insert decision row; run -> aborted
pending timeout paused run -> paused; no decision row
paused unpause pending re-arm gate; no decision row
terminal states any decision unchanged return 409

Rules:

  • A pending request can transition to one non-pending state per pending epoch.
  • Terminal approval states reject further decisions.
  • paused may return to pending only through unpause.
  • Manual pause is run-level pauseRun; it leaves approval gate in pending.
  • Only approve, reject, request_changes, and abort create approval_decisions rows.
  • Default timeout is null.
  • Timeout never auto-approves or auto-rejects.

14.2 Decision Idempotency

  • GUI:
    • UUIDv4 per click.
    • reused across automatic UI retries for the same logical action.
  • CLI:
    • UUIDv4 per invocation.
    • --client-token=<uuid> override for scripted retry.
  • API:
    • existing (approval_request_id, action, client_token) returns existing row with status 200.
    • new decision inserts row and returns 201.
    • same token with different action returns 409.
    • decision on non-pending request returns 409.

14.3 Destructive Command Enforcement

Devflow-direct commands have hard enforcement. TUI-agent commands have best-effort enforcement.

Hard-blocked Devflow-direct patterns:

  • rm -rf
  • git reset --hard
  • git clean
  • git push --force
  • git push --force-with-lease
  • git worktree remove --force
  • git branch -D
  • docker volume rm
  • docker compose down -v
  • DROP DATABASE
  • DROP SCHEMA
  • migration rollback
  • reads/writes touching .env*, ~/.ssh/, ~/.aws/, ~/.config/gcloud/, ~/.kube/
  • files matching *token*, *secret*, *credentials*, *.pem, *.key

TUI-agent command enforcement is best-effort:

  1. Prelude prohibits destructive operations.
  2. Backend permission mode is set to safest available mode.
  3. Transcript audit captures post-hoc evidence.
  4. Human intervention goes through devflow attach.
  5. Worktrees and branches are preserved by default.

v1 does not claim real-time blocking of TUI-internal commands.

15. Run Engine and Temporal Contract

The M4 RunEngine contract is frozen before M5. M5 reimplements the same interface through Temporal.

15.1 Public API

interface RunEngine {
  startRun(input: RunStartInput): Promise<{ runId: string }>;
  signalApproval(
    runId: string,
    approvalRequestId: string,
    action: ApprovalDecisionAction,
    clientToken: string,
    comment?: string
  ): Promise<void>;
  pauseRun(runId: string): Promise<void>;
  resumeRun(runId: string): Promise<void>;
  abortRun(runId: string, reason: string): Promise<void>;
  getStatus(runId: string): Promise<RunStatus>;
}

15.2 Temporal Shape

  • Namespace: devflow.
  • Task queue: devflow-runs.
  • Single worker process: apps/worker.
  • Workflow: runWorkflow(input: RunStartInput).
  • Signals:
    • approve
    • pause
    • resume
    • abort
    • unpause
  • No Updates in M5.
  • Status is read from DB.

Activities:

  • lockBindings(input)
  • generatePhasePlan(runId, phaseKey, attempt)
  • sendPromptToSession(sessionId, envelope)
  • waitForArtifact(sessionId, expectedPath, expectedSchema, timeoutMs)
  • validateArtifact(artifactPath, expectedSchema)
  • recordEvent(runId, type, payload)
  • requestApproval(runId, gateKey, phaseId, payload, idempotencyKey)
  • runCommand(kind, argv, cwd, env)
  • composeFinalReport(runId)

Retry policy:

  • Default: max attempts 3, exponential backoff start 1s, max 30s.
  • requestApproval: max attempts 1.
  • composeFinalReport: max attempts 1.
  • sendPromptToSession: max attempts 2; further retry belongs to engine recovery.

15.3 Hard Constraints

  • Workflow code holds only serializable state.
  • No tmux handles in workflow state.
  • No PTY refs in workflow state.
  • No DB clients in workflow state.
  • M5+ session interaction happens through activities calling SessionManager in apps/worker.
  • M5+ API never calls mutating SessionAdapter methods.
  • SessionManager advisory lock prevents API/worker ownership conflict during M4 -> M5 transition.
  • Workflow code uses deterministic clock/randomness only.

16. WriteSet and Worktree

16.1 WriteSet

  • Each task declares writeSet: string[].
  • Patterns are relative to repo root.
  • Glob engine: fast-glob.
  • Options:
{
  cwd: worktreeRoot,
  dot: true,
  followSymbolicLinks: false,
  onlyFiles: true,
  suppressErrors: false
}

Conflict detection:

  1. Expand writeSets.
  2. Forbidden globs cause conflict if matched by more than one task:
    • pnpm-lock.yaml
    • package-lock.json
    • **/migrations/**
    • **/*.generated.*
    • root tsconfig*.json
    • biome.json
    • lefthook.yml
    • .github/**
    • .gitlab-ci.yml
  3. Pairwise file intersections must be empty.

Conflict creates parallel_dag_approved gate.

16.2 Worktree Lifecycle

  • Worktree root:
    • WORKSPACE_ROOT/<runId>/<laneId>
    • non-parallel main lane: WORKSPACE_ROOT/<runId>/main
  • Created via git worktree add.
  • Branch name:
devflow/<runId>/<laneId>
  • Terminal run state does not remove worktrees or branches.
  • Output branches are deliverables.
  • Disk growth is accepted.
  • Cleanup is manual:
devflow cleanup <run-id> [--lane=<id>]

Cleanup:

  • uses git worktree remove without --force by default.
  • refuses dirty worktrees.
  • --force requires an additional gate.
  • git branch -D is destructive and gated.
  • doctor --list-orphans lists only; it never removes.

17. SSE Contract

Endpoints:

  • GET /sse/runs/:runId
  • GET /sse/global

Heartbeat every 15 seconds.

Events:

Event Scope
run.state_changed both
run.event_appended run
phase.state_changed run
approval.created both
approval.resolved both
session.state_changed run
transcript.chunk_appended run
artifact.validated run

Reconnect:

  • Last-Event-ID is last run_events.seq.
  • server replays seq > lastSeq.
  • non-run-event SSE types are not replayed; state is re-derived by fetch.

18. Errors

packages/core/src/errors.ts:

type ErrorClass = 'recoverable' | 'human_required' | 'fatal';

class DevflowError extends Error {
  readonly class: ErrorClass;
  readonly code: string;
  readonly runId?: string;
  readonly phaseId?: string;
  readonly recoveryHint?: string;
  readonly cause?: unknown;
}

Recoverable:

  • network_blip
  • pane_briefly_unresponsive
  • prompt_send_transient
  • db_serialization_retry

Human required:

  • artifact_invalid_after_repair
  • artifact_timeout_exhausted
  • destructive_command_blocked
  • secret_access_blocked
  • backend_unavailable
  • no_eligible_persona
  • writeset_conflict
  • merge_conflict
  • objective_not_met
  • review_dispute_unresolved

Fatal:

  • db_unreachable
  • workspace_permissions
  • internal_state_corruption
  • template_load_failed
  • migration_pending
  • config_invalid

Mapping:

  • recoverable -> retry; exhausted -> human_required.
  • human_required -> run paused and gate created.
  • fatal -> run failed, sessions disposed, final report best-effort.

19. Concurrent Runs and Crash Recovery

19.1 Active Run Uniqueness

  • MAX_CONCURRENT_RUNS, default 4.
  • DB partial unique index is the source of truth:
    • one active run per (repo_path, base_branch).
  • repo_path is canonicalized before insert.
  • Advisory lock is auxiliary only:
pg_try_advisory_xact_lock(hash64('devflow:start-run', repoPath, baseBranch))
  • Unique-index violation returns:
{ "currentRunId": "...", "currentState": "..." }

with HTTP 409.

19.2 Crash Recovery

M4, no Temporal:

  • On apps/api startup, sweep non-terminal runs.
  • Mark them failed.
  • final_report_path = null.
  • Append synthesized run.failed with reason process_restart_unrecovered.
  • Cascade associated tui_sessions to FAILED_NEEDS_HUMAN.
  • Append session.failed.
  • This frees active-run uniqueness slots.

M5+:

  • No sweep.
  • Temporal durability owns in-flight workflow recovery.
  • SessionManager resumes tmux sessions.
  • Active-run partial index blocks duplicate runs until completion or explicit abort.

20. Milestones

M1: Monorepo + Postgres + CLI Doctor

  • Scaffold workspace.
  • Add pnpm, tsconfig, biome, lefthook, Vitest.
  • Add Docker Compose for Postgres.
  • Add Drizzle and first migration.
  • Add devflow doctor.
  • Implement checks 1-9.
  • Stub checks 10-12 as warn where needed.
  • Add SSE compatibility smoke test:
    • minimal Fastify 5 server.
    • fastify-sse-v2 plugin.
    • 30-second integration test.
    • receive 3 events and reconnect.
    • if plugin fails, implement native reply.raw SSE helper before M1 is green.

M2: Core Schema + Registry + Binding

  • Implement enums.
  • Implement canonical hashing.
  • Implement Template schema.
  • Implement Persona schema.
  • Implement seed loader.
  • Implement binding algorithm.
  • Implement artifact schema registry.
  • Add first schemas:
    • dev/spec@1
    • dev/phase-plan@1
    • common/final-report@1
  • Tests:
    • schema validation.
    • override semantics.
    • risk enforcement.
    • diversity enforcement.
    • deterministic auto-select.

M3: Fake Session Runtime

  • Implement SessionAdapter.
  • Implement FakeSessionAdapter.
  • Implement prompt envelope.
  • Implement event recorder.
  • Implement fake sentinel scenarios.
  • Persist transcript chunks.
  • Tests:
    • prompt correlation.
    • artifact validation.
    • invalid artifact.
    • timeout.
    • fake crash.

M4: Minimal Run Engine

  • Implement packages/run-engine.
  • Used directly by apps/api.
  • No Temporal.
  • Supports:
    • start run.
    • lock bindings.
    • approval.
    • fake prompt.
    • artifact wait/validate.
    • final report.
  • Freeze the RunEngine contract.
  • Full fake development@1 minus reviewers.

M5: Temporal Integration

  • Reimplement RunEngine through Temporal.
  • Preserve M4 behavior.
  • Add parity tests using the same M4 scenarios.
  • M5+ SessionManager lives in apps/worker.

M6: Real tmux SessionManager

  • Implement TmuxSessionAdapter.
  • Decoupled from M5.
  • May begin after M3 is stable.
  • Pre-M5 real tmux is opt-in smoke only.
  • Production run path remains fake until both M5 and M6 are green.

M7: TUI Recovery State Machine

  • Implement session state transitions.
  • Implement recovery counters.
  • Implement escalation to human gates.

M8: API + GUI Minimum

  • Implement Fastify routes.
  • Implement SSE.
  • Implement GUI screens:
    • Dashboard.
    • Templates.
    • Personas.
    • New Run.
    • Run Detail.
    • Approvals.
    • TUI Sessions.

M9: development@1 Fake-Agent Full Run

  • Add curated development@1.
  • Add review consensus.
  • Add verifier flow with fake reviewers.
  • Add coverage gate >=70% lines for core/session/run-engine.

M10: Codex/Claude Opt-In Real Run

  • Implement profiles:
    • packages/session/src/profiles/codex.ts
    • packages/session/src/profiles/claude.ts
  • Real backends become production-default only after both M5 and M6 are green.
  • Until then real tmux/Codex/Claude are developer-flagged opt-in smoke only.

M11: Parallel Lanes

  • Add task DAG scheduler.
  • Add writeSet detection.
  • Add per-lane worktrees.
  • Add merge coordinator.
  • Add conflict gates.

M12: Backtest Workflow

  • Add backtest-strategy@1.
  • Add objective evaluator.
  • Add metric parser extension points.
  • Add failure mining artifacts.
  • Add Backtest Lab GUI.

M13: Template Factory

  • Generate draft template from natural language and repo discovery.
  • Add harness design.
  • Add template review.
  • Add dry-run and promote flow.

21. Out of Scope

  • Authentication.
  • Authorization.
  • Multi-user support.
  • Data retention or archival policy.
  • Observability dashboards.
  • Remote template/persona registries.
  • Multi-machine deployment.
  • HA.
  • Managed backups.
  • Web ingress.
  • TLS.
  • Reverse proxy.

22. Decision Log

Open Questions Closed

# Question Resolution
OQ-1 Persona/template seeding format Immutable YAML at docs/schemas/{personas,templates}/<name>@<version>.yaml
OQ-2 Approval timeout default null; timeout freezes only
OQ-3 Final report format Markdown and JSON
OQ-4 Temporal namespace/queue namespace devflow, task queue devflow-runs
OQ-5 WriteSet glob engine fast-glob
OQ-6 Backtest objective DSL Stub in M12, full DSL deferred
OQ-7 Codex/Claude prompt prelude Structure locked, exact text deferred to M10

Blocking Corrections Applied

# Issue Resolution
CC-1 Terminal state deleted worktrees/branches Preserve by default; manual gated cleanup only
CC-2 SessionManager location conflict M4 API, M5+ worker
CC-3 Event duplicates under retry run_events.idempotency_key
CC-4 Destructive command enforcement overclaimed Devflow-direct hard, TUI best-effort
CC-5 UUID extension missing CREATE EXTENSION IF NOT EXISTS pgcrypto
CC-6 Advisory lock not enough for active-run uniqueness partial unique index
CC-7 Undefined transition sequence in event keys cause-based keys
CC-8 Approval paused transition missing explicit approval transition table
CC-9 AutoSelect order nondeterministic deterministic sort
CC-10 SSE plugin compatibility assumed M1 smoke + native fallback
CC-11 ApprovalAction included pause split ApprovalDecisionAction; pauseRun is run-level
CC-12 Artifact hash key collision include phase id and path
CC-13 Resume previous state not stored runs.paused_from_state
CC-14 repo path aliasing canonical realpath storage
CC-15 M4 sweep left tmux sessions ambiguous cascade session state to FAILED_NEEDS_HUMAN; real tmux production-default only after M5+M6
CC-16 Prompt hash used phaseId but envelope uses phaseKey prompt hash uses phaseKey
CC-17 abortRun transition too narrow abort from any non-terminal run state
CC-18 approval pending transition wording conflicted with pause epoch pending can transition once per pending epoch; paused may unpause to pending

Future Open Questions

  • FOQ-1, M12: full backtest objective DSL.
  • FOQ-2, M13: template factory generation prompts.
  • FOQ-3, post-M10: optional third backend such as Gemini.
  • FOQ-4, post-M8: WebSocket vs SSE if transcript pressure requires it.

23. Kickoff Order

  1. M1.1: repo + pnpm + tsconfig + biome + lefthook + vitest workspace.
  2. M1.2: docker-compose + Postgres healthcheck + drizzle-kit + first migration.
  3. M1.3: apps/cli skeleton + devflow doctor.
  4. M1.4: packages/core skeleton with config, enums, errors, hash, prompt-envelope, run-event types.
  5. M2.1: Zod schemas for Template/Persona, persona YAML loader, hashing.
  6. M2.2: Binding algorithm + tests.
  7. M2.3: Artifact schema registry + first three schemas.
  8. M3.1: SessionAdapter interface + FakeSessionAdapter.
  9. M3.2: Transcript chunk capture + DB persistence.
  10. M3.3: engine-shaped harness running a single fake phase end-to-end.
  11. M4: assemble run engine; lock contract; full fake development@1 minus reviewers.
  12. M5 in parallel with M6 once M4 is green.