continue

fix: use incremental diff per iteration instead of cumulative base diff
After each iteration's _commit_iteration, record the new HEAD SHA and use it as the diff anchor for the next iteration. Previously capture_diff always diffed against the initial base commit, causing every iteration to return the same full cumulative diff — reviewers couldn't see what changed between iterations, leading to repeated feedback and stuck FAIL loops. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 17:54:30 +09:00 · 2026-03-15 10:07:11 +09:00
15 changed files with 913 additions and 224 deletions
--- a/DEVELOPMENT.md
+++ b/DEVELOPMENT.md
@@ -10,6 +10,8 @@ AI 에이전트 2개를 활용한 개발 워크플로우(기획→체크리스
 - Generator: `--permission-mode auto` (파일 읽기/쓰기 가능)
 - Reviewer: `--permission-mode plan` (읽기 전용 탐색)
 - subprocess의 `cwd`를 현재 작업 디렉토리로 설정
+- 기본 실행 모드는 **direct mode**다. 즉 agentic coder도 현재 작업트리에서 직접 수정한다.
+- `--worktree` 또는 `use_worktree: true`를 명시한 경우에만 isolated git worktree를 생성한다.

 ## 사용자 경험 (UX Flow)

@@ -34,6 +36,7 @@ ls output/v1/ v2/ final-report.md

 ```yaml
 output_dir: output
+use_worktree: false
 max_iterations: 3

 inputs:
@@ -51,10 +54,8 @@ agents:
    system_prompt: "You are a meticulous code reviewer."

 # 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음)
-pipeline: preset:simple          # "A 생성 → B 리뷰" (기본값)
-# pipeline: preset:cross-review  # "둘 다 생성 → 서로 리뷰"
+pipeline: preset:coding-plan-review   # "문서 기반 구현 → 코드/문서 리뷰 → 수정 → 재검증" (기본값)
 # pipeline: preset:plan-review        # "구현 전 문서 리뷰 → 수정 → 재검증 반복"
-# pipeline: preset:coding-review-fix  # "초기 코딩 1회 → 리뷰/수정 반복"

 # 방법 2: 직접 커스텀 (고급 사용자용)
 # pipeline:
@@ -75,10 +76,8 @@ pipeline: preset:simple          # "A 생성 → B 리뷰" (기본값)

 | 프리셋 | 설명 | 자동 생성되는 steps |
 |--------|------|-------------------|
-| `simple` | A 코딩 → B 리뷰 | coding(agent1) → review(agent2) |
-| `cross-review` | 둘 다 코딩, 서로 리뷰 | coding_a → coding_b → review_of_b(agent_a) → review_of_a(agent_b) |
 | `plan-review` | 구현 전 문서 리뷰/수정/재검증 반복 | plan_review_* → aggregate_review → plan_fix → verify |
-| `coding-review-fix` | 초기 코딩 후 리뷰/수정 반복 | initial_coding(coding) → review_fix(review* → aggregate → coding → verify) |
+| `coding-plan-review` | 문서 기반 구현 후 코드/문서 리뷰/수정 반복 | initial_coding(coding) → coding_plan_review(review* → aggregate → coding_plan_fix → verify) |

 프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다.

@@ -101,7 +100,7 @@ cross_eval/
 **models.py** — 순환 참조 방지, 모든 데이터클래스 집중:
 - `AgentConfig` (command, args, system_prompt, stdin_mode)
 - `StepConfig` (name, agent, role, prompt_template, output_key, verdict, verdict_pattern, context_override)
- `PipelineConfig` (output_dir, max_iterations, inputs, agents, pipeline)
+- `PipelineConfig` (output_dir, use_worktree, max_iterations, inputs, agents, pipeline)
 - `AgentResult` (output, exit_code, agent_name, step_name, duration_seconds)
 - `IterationResult` (iteration, step_outputs, verdict, feedback)
 - `PipelineResult` (iterations, final_verdict, total_duration)
@@ -117,7 +116,7 @@ cross_eval/
 - `default:review` — 과최적화/오탐/누락 3기준 검토 + `VERDICT: PASS|FAIL` 출력 + **"프로젝트 디렉토리를 직접 탐색하여 코드를 검증하라"** 지시
 - `{variable}` 플레이스홀더, 누락 시 `(no {key} provided)` 출력
 - 사용자가 커스텀 .md 파일로 오버라이드 가능
- `PIPELINE_PRESETS` dict: `simple`, `cross-review`, `plan-review` 등 프리셋별 StepConfig 리스트 정의
+- `PIPELINE_PRESETS` / `PHASED_PRESETS` dict: `plan-review`, `coding-plan-review` 프리셋별 StepConfig/PhaseConfig 정의

 **agent.py** — `invoke_agent(agent_config, prompt, cwd)`:
 - `cwd` 파라미터로 프로젝트 디렉토리 지정 → 에이전트가 해당 디렉토리에서 파일 탐색 가능
@@ -139,16 +138,21 @@ for iteration 1..max_iterations:
 final-report.md 생성
 ```

+agentic 실행 경로는 두 모드가 있다.
+- 기본: direct mode (`cwd`에서 직접 수정)
+- opt-in: isolated worktree mode (`--worktree` 또는 `use_worktree: true`)
+
 **report.py** — 최종 마크다운 리포트:
 - 요약 테이블 (반복 횟수, 판정, 소요시간)
 - 반복별 상세 (각 step 출력, 에이전트명, 소요시간)
 - 최종 판정

 **cli.py** — 서브커맨드:
- `cross-eval init [--dir .] [--preset simple|cross-review|plan-review]` — 스캐폴딩 (기존 파일 안 덮어씀)
- `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...]`
+- `cross-eval init [--dir .] [--preset coding-plan-review|plan-review]` — 스캐폴딩 (기존 파일 안 덮어씀)
+- `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...] [--worktree]`
 - `--input key=path`: config의 inputs 오버라이드/추가
 - `--dry-run`: 에이전트 호출 없이 렌더링된 프롬프트만 출력
+- `--worktree`: 기본 direct mode 대신 isolated git worktree에서 실행

 ## 수정할 파일 목록

@@ -172,10 +176,12 @@ final-report.md 생성
 4. plan.md/checklist.md에 간단한 내용 넣고 `cross-eval run --max-iter 2` 로 실제 실행
 5. `output/` 디렉토리에 v1/, final-report.md 생성 확인

+`--dry-run` 은 미리보기 전용이며 실제 verdict가 PASS가 아니어도 프로세스 종료 코드는 `0`으로 처리한다.
+

  cross-eval run \
    --docs /Users/chungyeong/Desktop/Dev/new-alpha-foundry/plans/TO_CLICKHOUSE \
-    --preset coding-review-fix \
+    --preset coding-plan-review \
    --coder claude \
    --reviewer codex \
    --reviewer codex \
@@ -187,4 +193,4 @@ final-report.md 생성
    --max-iter 10


-cross-eval run --plan /Users/chungyeong/Desktop/Dev/cross-eval/UX_IMPROVEMENT_PLAN.md --coder claude --reviewer claude --senior claude --model sonnet --preset coding-review-fix --lang ko --max-iter 1
+cross-eval run --plan /Users/chungyeong/Desktop/Dev/cross-eval/UX_IMPROVEMENT_PLAN.md --coder claude --reviewer claude --senior claude --model sonnet --preset coding-plan-review --lang ko --max-iter 1
--- a/README.md
+++ b/README.md
@@ -51,12 +51,15 @@ cp .cross-eval/checklist-sample.md .cross-eval/checklist.md
 ### 3. 실행

 ```bash
-# 기본 실행 (코딩 → 리뷰, 최대 3회 반복)
+# 기본 실행 (현재 작업트리 direct mode, 최대 3회 반복)
 cross-eval run

 # 프롬프트만 확인 (에이전트 호출 없이, 비용 절약)
 cross-eval run --dry-run

+# 격리된 git worktree에서 실행하고 싶을 때만 명시
+cross-eval run --worktree
+
 # 최대 반복 횟수 변경
 cross-eval run --max-iter 5

@@ -80,6 +83,9 @@ output/
 └── final-report.md    # 전체 요약 리포트
 ```

+기본값은 **direct mode**다. 즉 `cross-eval`은 현재 작업트리에서 직접 파일을 읽고 수정한다.
+별도 격리 실행이 필요할 때만 `--worktree`를 붙여 isolated git worktree를 사용한다.
+
 ## 설정 (`.cross-eval/config.yaml`)

 ```yaml
@@ -101,7 +107,8 @@ agents:
    args: ["-p", "--model", "opus", "--permission-mode", "plan"]
    system_prompt: "You are a meticulous code reviewer."

-pipeline: preset:simple
+pipeline: preset:coding-plan-review
+use_worktree: false        # 기본값. true면 isolated worktree 사용
 ```

 실행 중에 `config.yaml`을 수정하면 다음 반복부터 자동으로 반영됩니다.
@@ -110,16 +117,16 @@ pipeline: preset:simple

 | 프리셋 | 설명 |
 |--------|------|
-| `simple` | Agent A가 코딩, Agent B가 리뷰 (기본값) |
-| `cross-review` | 둘 다 코딩, 서로 교차 리뷰 |
 | `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 문서를 수정한 뒤 재검증까지 반복 |
-| `review-only` | 기존 코드만 감사 용도로 검토 |
-| `review-fix` | 리뷰 결과를 취합한 뒤 자동 수정과 재검증까지 반복 |
-| `coding-review-fix` | 초기 코딩 1회 후 리뷰 결과를 취합해 자동 수정과 재검증을 반복 |
+| `coding-plan-review` | 입력 문서를 바탕으로 코드를 구현하고, 코드와 문서를 함께 리뷰/수정/재검증 반복 |
+
+두 프리셋은 역할만 다르고, 대부분의 CLI 옵션은 동일하게 동작한다. 예를 들어 `--plan`, `--checklist`, `--docs`, `--coder`, `--reviewer`, `--senior`, `--max-iter`, `--dry-run`, `--worktree`는 둘 다 같은 방식으로 사용할 수 있다.

 ```bash
 # 초기화 옵션
-cross-eval init --preset cross-review   # 교차 리뷰 프리셋
+cross-eval init --preset coding-plan-review  # 구현 + 코드/문서 리뷰 프리셋
 cross-eval init --preset plan-review         # 문서 리뷰/수정/재검증 프리셋
 cross-eval init --lang en               # 영어 템플릿
 ```
+
+`cross-eval run --dry-run` 은 프롬프트와 파이프라인 구성을 미리보기만 하며, 실제 판정이 PASS가 아니어도 종료 코드는 `0`이다.
--- a/checklist.md
+++ b/checklist.md
@@ -0,0 +1,31 @@
+# cross-eval CLI 사용성 리팩토링 체크리스트
+
+## 핵심 사용자 흐름
+- [ ] `cross-eval init` 이후 무엇을 해야 하는지 분명하게 안내한다.
+- [ ] `cross-eval doctor`를 언제 왜 써야 하는지 설명한다.
+- [ ] `cross-eval run` 실행 전 필요한 준비물이 명확하다.
+- [ ] 실행 후 결과가 `.cross-eval/output` 아래에 저장된다는 점이 안내된다.
+
+## `run` 커맨드 이해도
+- [ ] `--preset`별 차이가 빠르게 비교 가능하다.
+- [ ] `--coder`, `--reviewer`, `--senior`의 역할 차이가 설명된다.
+- [ ] config 기반 실행과 CLI 옵션 기반 실행의 관계가 명확하다.
+- [ ] 어떤 옵션이 config를 override하는지 혼동 없이 이해할 수 있다.
+
+## 예시 품질
+- [ ] 대표 사용 예시가 실제 사용자 목적 중심으로 정리되어 있다.
+- [ ] 예시가 너무 많아 산만하지 않고, 핵심 조합 위주로 압축되어 있다.
+- [ ] 초보자용 기본 예시와 고급 사용 예시가 구분되어 있다.
+- [ ] 예시만 복사해도 실제 실행 가능한 수준이다.
+
+## 리팩토링 범위 통제
+- [ ] 기존 명령 이름과 옵션 이름을 바꾸지 않는다.
+- [ ] 기능 동작을 불필요하게 변경하지 않는다.
+- [ ] 안내 문구 개선이 목적이지 새 기능 추가가 아님을 유지한다.
+- [ ] plan 범위를 넘는 UI/기능 확장을 하지 않는다.
+
+## 코드 품질
+- [ ] 기존 테스트가 깨지지 않도록 한다.
+- [ ] 도움말/문구 변경으로 인한 회귀를 확인한다.
+- [ ] 문자열 변경이 실제 출력 흐름과 모순되지 않는다.
+- [ ] 중복되거나 상충되는 설명이 생기지 않는다.
--- a/cross_eval/cli.py
+++ b/cross_eval/cli.py
@@ -38,7 +38,7 @@ coders: [claude-coder]
 reviewers: [claude-reviewer]
 # seniors: [codex-senior]

-# 파이프라인 종류: simple | cross-review | plan-review | review-only | review-fix | coding-review-fix
+# 파이프라인 종류: plan-review | coding-plan-review
 pipeline: preset:{preset}

 # 반복 설정
@@ -194,20 +194,12 @@ def main(argv: list[str] | None = None) -> int:
    )
    init_parser.add_argument(
        "--preset",
-        default="simple",
-        choices=[
-            "simple",
-            "cross-review",
-            "plan-review",
-            "review-only",
-            "review-fix",
-            "coding-review-fix",
-        ],
+        default="coding-plan-review",
+        choices=["plan-review", "coding-plan-review"],
        help=(
-            "파이프라인 종류 (기본: simple). "
-            "simple=코딩+리뷰, cross-review=교차리뷰, plan-review=문서리뷰수정재검증, "
-            "review-only=리뷰만, review-fix=리뷰수렴+자동수정, "
-            "coding-review-fix=초기코딩후리뷰수렴"
+            "파이프라인 종류 (기본: coding-plan-review). "
+            "plan-review=문서리뷰수정재검증, "
+            "coding-plan-review=문서기반구현후 코드+문서 리뷰/수정/재검증"
        ),
    )
    init_parser.add_argument(
@@ -252,9 +244,9 @@ def main(argv: list[str] | None = None) -> int:
    )
    demo_parser.add_argument(
        "--preset",
-        default="simple",
-        choices=["simple", "review-fix", "coding-review-fix"],
-        help="데모할 파이프라인 종류 (기본: simple)",
+        default="coding-plan-review",
+        choices=["plan-review", "coding-plan-review"],
+        help="데모할 파이프라인 종류 (기본: coding-plan-review)",
    )
    demo_parser.add_argument(
        "--escalate",
@@ -281,25 +273,12 @@ def main(argv: list[str] | None = None) -> int:
        ),
        epilog=(
            "파이프라인 종류 (--preset):\n"
-            "  ┌──────────────┬─────────────────────────────────────────────────────┐\n"
-            "  │ simple       │ Coder가 코드 작성 → Reviewer가 리뷰               │\n"
-            "  │ (기본값)     │ FAIL이면 피드백 반영해서 재코딩, PASS까지 반복     │\n"
-            "  ├──────────────┼─────────────────────────────────────────────────────┤\n"
-            "  │ review-fix   │ 2단계 파이프라인:                                  │\n"
-            "  │              │  Reviewer N명 병렬 리뷰 → 취합 → 수정 → 재검증   │\n"
-            "  ├──────────────┼─────────────────────────────────────────────────────┤\n"
-            "  │ coding-      │ 3단계 파이프라인:                                  │\n"
-            "  │ review-fix   │  초기 코딩 1회 → 리뷰 취합 → 수정 → 재검증 반복   │\n"
-            "  ├──────────────┼─────────────────────────────────────────────────────┤\n"
-            "  │ plan-review  │ 구현 전 기획서/체크리스트/문서를 검토하고       │\n"
-            "  │              │ 수정한 뒤 시니어가 재검증할 때까지 반복         │\n"
-            "  ├──────────────┼─────────────────────────────────────────────────────┤\n"
-            "  │ review-only  │ 코드 작성 없이 Reviewer N명이 기존 코드만 검토    │\n"
-            "  │              │ (이미 작성된 코드의 품질 감사용)                   │\n"
-            "  ├──────────────┼─────────────────────────────────────────────────────┤\n"
-            "  │ cross-review │ Coder 2명이 각각 구현 → 상대방 코드를 교차 리뷰   │\n"
-            "  │              │ (서로 다른 에이전트의 구현 비교용)                 │\n"
-            "  └──────────────┴─────────────────────────────────────────────────────┘\n"
+            "  ┌─────────────────────┬──────────────────────────────────────────────┐\n"
+            "  │ coding-plan-review  │ 입력 문서 기반 구현 → 코드+문서 리뷰/수정   │\n"
+            "  │ (기본값)            │ → 재검증 반복                                │\n"
+            "  ├─────────────────────┼──────────────────────────────────────────────┤\n"
+            "  │ plan-review         │ 구현 전 문서 리뷰 → 문서 수정 → 재검증 반복 │\n"
+            "  └─────────────────────┴──────────────────────────────────────────────┘\n"
            "\n"
            "기본 제공 에이전트:\n"
            "  ┌──────────────────┬─────────┬───────────┬──────────────────────────┐\n"
@@ -316,34 +295,13 @@ def main(argv: list[str] | None = None) -> int:
            "\n"
            "사용 예시:\n"
            "\n"
-            "  기본 실행 (Claude가 코딩하고 Claude가 리뷰):\n"
-            "    cross-eval run --plan plan.md\n"
-            "\n"
-            "  Codex가 코딩, Claude가 리뷰:\n"
-            "    cross-eval run --plan plan.md --coder codex --reviewer claude\n"
-            "\n"
-            "  리뷰어 2명 (Claude + Codex):\n"
-            "    cross-eval run --plan plan.md --reviewer claude --reviewer codex\n"
-            "\n"
-            "  리뷰 취합용 Senior 추가:\n"
-            "    cross-eval run --plan plan.md --preset review-fix \\\n"
-            "      --reviewer claude --reviewer codex --senior codex\n"
-            "\n"
-            "  리뷰 수렴 후 자동 수정 (review-fix):\n"
-            "    cross-eval run --plan plan.md --preset review-fix \\\n"
-            "      --reviewer claude --reviewer codex\n"
-            "\n"
-            "  초기 코딩 후 리뷰 수렴 + 자동 수정 (coding-review-fix):\n"
-            "    cross-eval run --plan plan.md --preset coding-review-fix \\\n"
-            "      --reviewer claude --reviewer codex\n"
-            "\n"
-            "  기존 코드 리뷰만 (review-only):\n"
-            "    cross-eval run --plan plan.md --preset review-only \\\n"
-            "      --reviewer claude --reviewer codex\n"
+            "  코드 + 문서 구현/리뷰 루프 (coding-plan-review):\n"
+            "    cross-eval run --plan plan.md --preset coding-plan-review \\\n"
+            "      --coder claude --reviewer codex --reviewer claude --senior codex\n"
            "\n"
            "  문서 리뷰 + 수정 + 재검증 반복 (plan-review):\n"
            "    cross-eval run --plan plan.md --preset plan-review \\\n"
-            "      --coder codex --reviewer codex\n"
+            "      --coder claude --reviewer codex --reviewer claude --senior codex\n"
            "\n"
            "  모델 변경:\n"
            "    cross-eval run --plan plan.md --model sonnet\n"
@@ -420,7 +378,11 @@ def main(argv: list[str] | None = None) -> int:
    )
    agent_group.add_argument(
        "--agentic", action="store_true", default=False,
-        help="Coder를 agentic 모드로 실행 (worktree에서 파일 직접 수정, git diff로 결과 캡처)",
+        help="Coder를 agentic 모드로 실행 (파일 직접 수정, git diff로 결과 캡처)",
+    )
+    agent_group.add_argument(
+        "--worktree", action="store_true", default=False,
+        help="기본 direct mode 대신 isolated git worktree에서 실행",
    )
    agent_group.add_argument(
        "--model", default=None, metavar="MODEL",
@@ -443,15 +405,8 @@ def main(argv: list[str] | None = None) -> int:
    pipe_group = run_parser.add_argument_group("파이프라인")
    pipe_group.add_argument(
        "--preset", default=None,
-        choices=[
-            "simple",
-            "cross-review",
-            "plan-review",
-            "review-only",
-            "review-fix",
-            "coding-review-fix",
-        ],
-        help="파이프라인 종류 (기본: simple). 각 종류 설명은 아래 참조",
+        choices=["plan-review", "coding-plan-review"],
+        help="파이프라인 종류 (기본: coding-plan-review). 각 종류 설명은 아래 참조",
    )
    pipe_group.add_argument(
        "--max-iter", type=int, default=None,
@@ -560,18 +515,11 @@ def cmd_demo(args: argparse.Namespace) -> int:
 # ---------------------------------------------------------------------------

 _PRESET_DESCRIPTIONS = {
-    "simple": "코딩 + 리뷰 (가장 기본)",
-    "review-fix": "리뷰 → 취합 → 수정 → 재검증 반복",
-    "coding-review-fix": "초기 코딩 + 리뷰 수렴 반복",
+    "coding-plan-review": "입력 문서 기반 구현 후 코드+문서 리뷰/수정 반복",
    "plan-review": "문서 리뷰 → 수정 → 재검증 반복",
-    "review-only": "기존 코드만 리뷰 (코딩 없음)",
-    "cross-review": "2명이 각각 구현 후 교차 리뷰",
 }

-_PRESET_ORDER = [
-    "simple", "review-fix", "coding-review-fix",
-    "plan-review", "review-only", "cross-review",
-]
+_PRESET_ORDER = ["coding-plan-review", "plan-review"]


 def _prompt_choice(
@@ -640,7 +588,7 @@ def _run_guided_init(target: Path) -> dict:
    coder = _prompt_text("  Coder 에이전트", default="claude")
    reviewer = _prompt_text("  Reviewer 에이전트", default="claude")

-    needs_senior = preset in ("review-fix", "coding-review-fix")
+    needs_senior = preset in ("coding-plan-review", "plan-review")
    senior = ""
    if needs_senior:
        senior = _prompt_text("  Senior 에이전트", default=reviewer)
@@ -899,10 +847,10 @@ def cmd_run(args: argparse.Namespace) -> int:
    need_rebuild = args.preset is not None or args.coders or args.reviewers or args.seniors
    if need_rebuild:
        from cross_eval.prompts import PHASED_PRESETS
-        preset = args.preset or "simple"
+        preset = args.preset or "coding-plan-review"
        # Determine which preset was configured (from YAML or defaults)
        if args.preset is None and config.phases:
-            preset = config.preset_name if config.preset_name != "custom" else "review-fix"
+            preset = config.preset_name if config.preset_name != "custom" else "coding-plan-review"
        elif args.preset is None and not args.coders and not args.reviewers and not args.seniors:
            pass  # no changes needed
        inferred_coders, inferred_reviewers, inferred_seniors = _infer_roles(
@@ -929,8 +877,6 @@ def cmd_run(args: argparse.Namespace) -> int:
        elif preset in PIPELINE_PRESETS:
            config.pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors)
            config.phases = []
-            if preset == "review-only" and args.max_iter is None and args.min_iter is None:
-                config.max_iterations = 1

    sync_phased_iterations(config)
    if args.max_iter is not None:
@@ -951,6 +897,9 @@ def cmd_run(args: argparse.Namespace) -> int:
            if coder_name in config.agents:
                _make_agentic(config.agents[coder_name])

+    if args.worktree:
+        config.use_worktree = True
+
    ensure_fix_preset_agentic(config)

    # --model: apply to ALL agents
@@ -988,7 +937,7 @@ def cmd_run(args: argparse.Namespace) -> int:
            print(f"No files found in: {docs_dir}", file=sys.stderr)
            return 1
        config.inputs["docs"] = docs_content
-        config.inputs["docs_ref"] = str(docs_dir)
+        config.inputs["docs_ref"] = docs_dir

    if args.env_files:
        for env_file in args.env_files:
@@ -1062,6 +1011,9 @@ def cmd_run(args: argparse.Namespace) -> int:
    if not args.dry_run and result.run_dir:
        print(f"Output: {result.run_dir}/")

+    if args.dry_run:
+        return 0
+
    if result.final_verdict == "ESCALATE":
        from cross_eval.report import print_escalation_report
        print_escalation_report(config, result)
--- a/cross_eval/config.py
+++ b/cross_eval/config.py
@@ -31,7 +31,10 @@ DEFAULT_ROLE_REASONING_EFFORTS = {
    "reviewer": "medium",
    "senior": "high",
 }
-FIX_STYLE_PRESETS = {"plan-review", "review-fix", "coding-review-fix"}
+FIX_STYLE_PRESETS = {
+    "plan-review",
+    "coding-plan-review",
+}


 # ---------------------------------------------------------------------------
@@ -298,8 +301,7 @@ def _default_seniors_for_preset(
        isinstance(pipeline_raw, str)
        and pipeline_raw in {
            "preset:plan-review",
-            "preset:review-fix",
-            "preset:coding-review-fix",
+            "preset:coding-plan-review",
        }
        and reviewers
    ):
@@ -382,9 +384,11 @@ def default_config() -> PipelineConfig:
    coders = ["claude-coder"]
    reviewers = ["claude-reviewer"]
    seniors: list[str] = []
-    pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors)
+    pipeline: list[StepConfig] = []
+    phases = PHASED_PRESETS["coding-plan-review"](coders, reviewers, seniors)
    return PipelineConfig(
        output_dir=Path(".cross-eval/output"),
+        use_worktree=False,
        max_iterations=3,
        language="ko",
        execution=ExecutionConfig(),
@@ -394,6 +398,8 @@ def default_config() -> PipelineConfig:
        reviewers=reviewers,
        seniors=seniors,
        pipeline=pipeline,
+        phases=phases,
+        preset_name="coding-plan-review",
    )


@@ -437,7 +443,7 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
        )

    # --- roles: explicit or inferred ---
-    pipeline_raw = raw.get("pipeline", "preset:simple")
+    pipeline_raw = raw.get("pipeline", "preset:coding-plan-review")
    coders_raw = raw.get("coders")
    reviewers_raw = raw.get("reviewers")
    seniors_raw = raw.get("seniors")
@@ -498,6 +504,7 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:

    config = PipelineConfig(
        output_dir=output_dir,
+        use_worktree=bool(raw.get("use_worktree", False)),
        max_iterations=int(raw.get("max_iterations", 3)),
        min_iterations=int(raw.get("min_iterations", 1)),
        verbose=bool(raw.get("verbose", False)),
@@ -555,10 +562,10 @@ def _resolve_pipeline(
    """Resolve pipeline from preset string or explicit step list.

    Returns (steps, phases) tuple.  Only one will be non-empty.
-    - Simple/cross-review/plan-review/review-only → steps populated, phases empty.
-    - Phased presets (review-fix) → steps empty, phases populated.
+    - plan-review → steps populated, phases empty.
+    - coding-plan-review → steps empty, phases populated.
    """
-    # Preset: "preset:simple" or "preset:review-fix"
+    # Preset: "preset:plan-review" or "preset:coding-plan-review"
    if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"):
        preset_name = pipeline_raw.split(":", 1)[1]
        if preset_name in PIPELINE_PRESETS:
@@ -592,7 +599,7 @@ def _resolve_pipeline(
        return steps, []

    raise ValueError(
-        f"'pipeline' must be a preset string (e.g. 'preset:simple') "
+        f"'pipeline' must be a preset string (e.g. 'preset:plan-review') "
        f"or a list of step definitions, got {type(pipeline_raw).__name__}"
    )

--- a/cross_eval/demo.py
+++ b/cross_eval/demo.py
@@ -165,7 +165,7 @@ CYAN = "\033[36m"
 RESET = "\033[0m"


-def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None:
+def run_mock_demo(preset: str = "coding-plan-review", show_escalate: bool = False) -> None:
    """Run a simulated demo showing the full pipeline lifecycle."""
    steps = _MOCK_ESCALATE_STEPS if show_escalate else _MOCK_STEPS

@@ -229,7 +229,7 @@ def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None:


 def run_live_demo(
-    preset: str = "simple",
+    preset: str = "coding-plan-review",
    timeout: int | None = None,
 ) -> PipelineResult:
    """Run a live demo with real agents using the built-in plan."""
@@ -255,8 +255,9 @@ def run_live_demo(
        pipeline = []
        phases = PHASED_PRESETS[preset](coders, reviewers, seniors)
    else:
-        pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors)
-        phases = []
+        pipeline = []
+        phases = PHASED_PRESETS["coding-plan-review"](coders, reviewers, seniors)
+        

    with tempfile.TemporaryDirectory() as tmpdir:
        plan_path = Path(tmpdir) / "plan.md"
--- a/cross_eval/models.py
+++ b/cross_eval/models.py
@@ -62,6 +62,7 @@ class PipelineConfig:
    """Full cross-eval configuration."""

    output_dir: Path = field(default_factory=lambda: Path(".cross-eval/output"))
+    use_worktree: bool = False
    max_iterations: int = 3
    min_iterations: int = 1
    verbose: bool = False
--- a/cross_eval/pipeline.py
+++ b/cross_eval/pipeline.py
@@ -4,6 +4,7 @@ from __future__ import annotations
 import logging
 import os
 import re
+import shutil
 import subprocess
 import time
 from hashlib import sha256
@@ -34,6 +35,19 @@ from cross_eval.runtime_env import (
 logger = logging.getLogger(__name__)


+def _get_current_head(cwd: Path) -> str | None:
+    """Return the current HEAD SHA for an existing repository."""
+    result = subprocess.run(
+        ["git", "rev-parse", "HEAD"],
+        cwd=cwd,
+        capture_output=True,
+        text=True,
+    )
+    if result.returncode != 0:
+        return None
+    return result.stdout.strip() or None
+
+
 def run_pipeline(
    config: PipelineConfig,
    cwd: Path | None = None,
@@ -62,18 +76,20 @@ def _commit_iteration(
    label: str,
    iteration: int,
    verdict: str | None,
-) -> None:
+) -> str:
    """Intermediate commit after each agentic iteration.

    This resets the diff baseline so the next iteration only captures new changes.
+    Returns the new HEAD SHA to use as the base_commit for the next iteration.
    """
-    from cross_eval.worktree import commit_worktree
+    from cross_eval.worktree import commit_worktree, get_current_head
    committed = commit_worktree(
        worktree_path,
        f"cross-eval: {label} v{iteration} ({verdict or 'no-verdict'})",
    )
    if committed:
        logger.debug("  Intermediate commit: v%d (%s)", iteration, verdict)
+    return get_current_head(worktree_path)


 def _has_agentic_steps(config: PipelineConfig, steps: list[StepConfig]) -> bool:
@@ -122,8 +138,6 @@ def _copy_inputs_to_worktree(
    Updates ``config.inputs`` in-place so subsequent reference refreshes use
    worktree-local paths.
    """
-    import shutil
-
    base_root = base_cwd.resolve()
    track_external_inputs = config.preset_name == "plan-review"
    inputs_dir = worktree_path / ".cross-eval-inputs"
@@ -132,7 +146,7 @@ def _copy_inputs_to_worktree(
        # Exclude read-only input copies from git so they don't pollute code diffs.
        (inputs_dir / ".gitignore").write_text("*\n", encoding="utf-8")
    for key, val in list(config.inputs.items()):
-        if key.endswith("_ref") or not isinstance(val, Path):
+        if not isinstance(val, Path):
            continue
        if not val.exists():
            continue
@@ -141,17 +155,71 @@ def _copy_inputs_to_worktree(
            rel_path = resolved.relative_to(base_root)
        except ValueError:
            dest = inputs_dir / val.name
-            shutil.copy2(resolved, dest)
+            _copy_path(resolved, dest)
            config.inputs[key] = dest
            continue

        worktree_target = worktree_path / rel_path
        if not worktree_target.exists():
-            worktree_target.parent.mkdir(parents=True, exist_ok=True)
-            shutil.copy2(resolved, worktree_target)
+            _copy_path(resolved, worktree_target)
        config.inputs[key] = worktree_target


+def _snapshot_input_paths(config: PipelineConfig) -> dict[str, Path]:
+    """Capture original on-disk input paths before remapping into a worktree."""
+    return {
+        key: val
+        for key, val in config.inputs.items()
+        if isinstance(val, Path)
+    }
+
+
+def _apply_worktree_inputs_to_base(
+    config: PipelineConfig,
+    original_inputs: dict[str, Path],
+    *,
+    cwd: Path,
+) -> list[Path]:
+    """Copy the final worktree-edited inputs back onto the user-provided paths."""
+    restored: list[Path] = []
+    for key, original_path in original_inputs.items():
+        current_path = config.inputs.get(key)
+        if not isinstance(current_path, Path) or not current_path.exists():
+            continue
+        if current_path.resolve() == original_path.resolve():
+            continue
+        _copy_path(current_path, original_path)
+        restored.append(original_path)
+    return restored
+
+
+def _commit_base_repo_paths(cwd: Path, paths: list[Path], message: str) -> bool:
+    """Commit changed input paths in the base repository when they live under cwd."""
+    rel_paths: list[str] = []
+    for path in paths:
+        try:
+            rel_paths.append(str(path.resolve().relative_to(cwd.resolve())))
+        except ValueError:
+            continue
+
+    if not rel_paths:
+        return False
+
+    subprocess.run(
+        ["git", "add", "--", *rel_paths],
+        cwd=cwd,
+        capture_output=True,
+        check=True,
+    )
+    result = subprocess.run(
+        ["git", "commit", "-m", message],
+        cwd=cwd,
+        capture_output=True,
+        text=True,
+    )
+    return result.returncode == 0
+
+
 def _snapshot_repo_state(cwd: Path) -> dict[str, str]:
    """Capture the base repository working-tree state.

@@ -342,18 +410,26 @@ def _run_simple_pipeline(

    # Setup shared worktree for agentic mode
    worktree_path: Path | None = None
+    agent_execution_path: Path | None = None
    agentic_branch_name: str | None = None
    agentic_base_commit: str | None = None
+    original_input_paths: dict[str, Path] = {}
    base_repo_state: dict[str, str] | None = None
    base_repo_status: str | None = None
    if not dry_run and _has_agentic_steps(config, config.pipeline):
+        if config.use_worktree:
            worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree(
                cwd, run_dir, config.preset_name,
            )
+            original_input_paths = _snapshot_input_paths(config)
            _copy_inputs_to_worktree(config, worktree_path, base_cwd=cwd)
            _refresh_input_references(config, input_contents)
            base_repo_state = _snapshot_repo_state(cwd)
            base_repo_status = _snapshot_repo_status(cwd)
+            agent_execution_path = worktree_path
+        else:
+            agent_execution_path = cwd
+            agentic_base_commit = _get_current_head(cwd)

    feedback = "(no feedback — first iteration)"
    iterations: list[IterationResult] = []
@@ -379,7 +455,7 @@ def _run_simple_pipeline(
                config.pipeline, config, input_contents, feedback,
                i, config.max_iterations, cwd, timeout, dry_run,
                run_dir=run_dir, output_iter=i,
-                worktree_path=worktree_path,
+                worktree_path=agent_execution_path,
                runtime_env=runtime_env,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
@@ -387,8 +463,8 @@ def _run_simple_pipeline(
            )

            # Intermediate commit so next iteration's diff only shows new changes
-            if worktree_path is not None:
-                _commit_iteration(worktree_path, config.preset_name, i, verdict)
+            if config.use_worktree and worktree_path is not None:
+                agentic_base_commit = _commit_iteration(worktree_path, config.preset_name, i, verdict)

            iter_result = IterationResult(
                iteration=i,
@@ -478,8 +554,25 @@ def _run_simple_pipeline(
                break

    finally:
+        if config.use_worktree and worktree_path is not None and original_input_paths:
+            restored_paths = _apply_worktree_inputs_to_base(
+                config, original_input_paths, cwd=cwd,
+            )
+            if restored_paths:
+                try:
+                    committed = _commit_base_repo_paths(
+                        cwd,
+                        restored_paths,
+                        f"cross-eval: {config.preset_name} ({final_verdict})",
+                    )
+                    if committed:
+                        logger.info("  Applied and committed final input changes in base repo.")
+                    else:
+                        logger.info("  Applied final input changes in base repo (no commit created).")
+                except Exception:
+                    logger.warning("  Failed to commit final input changes in base repo", exc_info=True)
        agentic_branch: str | None = None
-        if worktree_path is not None and agentic_branch_name is not None:
+        if config.use_worktree and worktree_path is not None and agentic_branch_name is not None:
            agentic_branch = _finalize_worktree(
                cwd, worktree_path, agentic_branch_name,
                config.preset_name, final_verdict,
@@ -521,18 +614,26 @@ def _run_phased_pipeline(
    # Setup shared worktree for agentic mode
    all_phase_steps = [s for p in config.phases for s in p.steps]
    worktree_path: Path | None = None
+    agent_execution_path: Path | None = None
    agentic_branch_name: str | None = None
    agentic_base_commit: str | None = None
+    original_input_paths: dict[str, Path] = {}
    base_repo_state: dict[str, str] | None = None
    base_repo_status: str | None = None
    if not dry_run and _has_agentic_steps(config, all_phase_steps):
+        if config.use_worktree:
            worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree(
                cwd, run_dir, config.preset_name,
            )
+            original_input_paths = _snapshot_input_paths(config)
            _copy_inputs_to_worktree(config, worktree_path, base_cwd=cwd)
            _refresh_input_references(config, input_contents)
            base_repo_state = _snapshot_repo_state(cwd)
            base_repo_status = _snapshot_repo_status(cwd)
+            agent_execution_path = worktree_path
+        else:
+            agent_execution_path = cwd
+            agentic_base_commit = _get_current_head(cwd)

    iterations: list[IterationResult] = []
    feedback = "(no feedback — first iteration)"
@@ -579,7 +680,7 @@ def _run_phased_pipeline(
                    phase.steps, config, input_contents, feedback,
                    pi, phase.max_iterations, cwd, timeout, dry_run,
                    run_dir=run_dir, output_iter=global_iter, phase_name=phase.name,
-                    worktree_path=worktree_path,
+                    worktree_path=agent_execution_path,
                    runtime_env=runtime_env,
                    base_repo_state=base_repo_state,
                    base_repo_status=base_repo_status,
@@ -587,8 +688,8 @@ def _run_phased_pipeline(
                )

                # Intermediate commit so next iteration's diff only shows new changes
-                if worktree_path is not None:
-                    _commit_iteration(
+                if config.use_worktree and worktree_path is not None:
+                    agentic_base_commit = _commit_iteration(
                        worktree_path, f"{config.preset_name}/{phase.name}",
                        global_iter, verdict,
                    )
@@ -715,8 +816,25 @@ def _run_phased_pipeline(
                final_verdict = "PASS" if phase_converged else "MAX_ITERATIONS_REACHED"

    finally:
+        if config.use_worktree and worktree_path is not None and original_input_paths:
+            restored_paths = _apply_worktree_inputs_to_base(
+                config, original_input_paths, cwd=cwd,
+            )
+            if restored_paths:
+                try:
+                    committed = _commit_base_repo_paths(
+                        cwd,
+                        restored_paths,
+                        f"cross-eval: {config.preset_name} ({final_verdict})",
+                    )
+                    if committed:
+                        logger.info("  Applied and committed final input changes in base repo.")
+                    else:
+                        logger.info("  Applied final input changes in base repo (no commit created).")
+                except Exception:
+                    logger.warning("  Failed to commit final input changes in base repo", exc_info=True)
        agentic_branch: str | None = None
-        if worktree_path is not None and agentic_branch_name is not None:
+        if config.use_worktree and worktree_path is not None and agentic_branch_name is not None:
            agentic_branch = _finalize_worktree(
                cwd, worktree_path, agentic_branch_name,
                config.preset_name, final_verdict,
@@ -750,6 +868,8 @@ def _load_inputs(config: PipelineConfig) -> dict[str, str]:
    for key, val in config.inputs.items():
        if key.endswith("_ref"):
            input_contents[key] = str(val)
+        elif key == "docs":
+            input_contents[key] = _load_docs_input(config, current_value=val)
        elif isinstance(val, str):
            input_contents[key] = val
        else:
@@ -765,6 +885,8 @@ def _refresh_inputs(
    for key, val in config.inputs.items():
        if key.endswith("_ref"):
            input_contents[key] = str(val)
+        elif key == "docs":
+            input_contents[key] = _load_docs_input(config, current_value=val)
        elif isinstance(val, str):
            input_contents[key] = val
        elif isinstance(val, Path) and val.exists():
@@ -772,6 +894,40 @@ def _refresh_inputs(
    _refresh_input_references(config, input_contents)


+def _load_docs_input(config: PipelineConfig, *, current_value: Path | str) -> str:
+    """Load docs content from docs_ref when available so edits are visible next iteration."""
+    docs_ref = config.inputs.get("docs_ref")
+    docs_path = docs_ref if isinstance(docs_ref, Path) else None
+    if docs_path is not None and docs_path.exists():
+        if docs_path.is_dir():
+            return _read_docs_tree(docs_path)
+        try:
+            return docs_path.read_text(encoding="utf-8")
+        except (UnicodeDecodeError, OSError):
+            return ""
+    if isinstance(current_value, str):
+        return current_value
+    if current_value.exists() and current_value.is_file():
+        return current_value.read_text(encoding="utf-8")
+    return ""
+
+
+def _read_docs_tree(docs_dir: Path) -> str:
+    """Read all visible text files under a docs tree and concatenate them."""
+    parts: list[str] = []
+    for f in sorted(
+        path for path in docs_dir.rglob("*")
+        if path.is_file() and not any(part.startswith(".") for part in path.relative_to(docs_dir).parts)
+    ):
+        try:
+            content = f.read_text(encoding="utf-8")
+        except (UnicodeDecodeError, OSError):
+            continue
+        rel_path = f.relative_to(docs_dir).as_posix()
+        parts.append(f"### {rel_path}\n{content}")
+    return "\n\n".join(parts)
+
+
 def _refresh_input_references(
    config: PipelineConfig,
    input_contents: dict[str, str],
@@ -1701,3 +1857,12 @@ def _save_report(run_dir: Path, config: PipelineConfig, result: PipelineResult)
    report_path.parent.mkdir(parents=True, exist_ok=True)
    report_path.write_text(report, encoding="utf-8")
    logger.info("Report saved: %s", report_path)
+
+
+def _copy_path(src: Path, dest: Path) -> None:
+    """Copy a file or directory into the worktree, preserving structure."""
+    if src.is_dir():
+        shutil.copytree(src, dest, dirs_exist_ok=True)
+        return
+    dest.parent.mkdir(parents=True, exist_ok=True)
+    shutil.copy2(src, dest)
--- a/cross_eval/prompts.py
+++ b/cross_eval/prompts.py
@@ -512,6 +512,218 @@ PLAN_FIX_TEMPLATE_KO = """\
 8. 수정이 끝나면 무엇을 바꿨는지와 아직 사람 판단이 필요한 blocker가 있는지 짧게 정리하세요.
 """

+PLAN_VERIFY_TEMPLATE = """\
+You are verifying the latest planning package after plan-only revisions.
+
+## Plan
+{plan}
+
+## Checklist
+{checklist}
+
+## Reference Documents
+{docs}
+
+## Previous Review (iteration {iteration} of {max_iterations})
+{feedback}
+
+## Execution Evidence
+{execution_evidence}
+
+## Verify Instructions
+Review the latest planning package itself: the plan, checklist, and reference documents.
+You MAY inspect the current repository to confirm that the documents describe the current reality accurately enough.
+Do NOT require production code, scripts, infrastructure, or external environments to already be fixed.
+
+For `plan-review`, PASS means the documents are now clear enough to execute without further document edits.
+A known implementation gap, repo mismatch, legacy script problem, external dependency, or environment blocker is NOT a FAIL by itself if:
+- the issue is described accurately in the planning package,
+- the affected scope or gate is documented clearly,
+- the required follow-up action or non-go condition is documented clearly, and
+- the package does not misrepresent unresolved work as already complete.
+
+Only mark FAIL when the planning package still needs correction, such as:
+- unresolved ambiguity or contradiction in the documents,
+- missing prerequisite, dependency, gate, ownership, or evidence rule,
+- a known blocker that is still described inaccurately or misleadingly,
+- conflicting source-of-truth rules across the planning documents,
+- checklist or status criteria that would cause an operator to make the wrong decision.
+
+Report implementation/repository problems that are already documented correctly under "Out of Scope Issues" or note them as documented risks, not as FAIL reasons.
+
+## Output Format
+
+### Remaining Document Issues
+- [Major][Omission] Description (reference specific plan/checklist/doc item)
+(Write "None" if no document issue remains.)
+
+### Documented Risks / Out of Scope
+- Description of a real implementation/repository/environment risk that is already documented correctly
+(Write "None" if nothing notable remains.)
+
+### Summary
+- Remaining document issues: N
+- Documented risks / out-of-scope items: N
+- Overall quality: [BRIEF ASSESSMENT]
+
+### Verdict
+If the planning package no longer needs document changes, output: VERDICT: PASS
+Otherwise output: VERDICT: FAIL
+"""
+
+PLAN_VERIFY_TEMPLATE_KO = """\
+당신은 plan-only 수정 이후 최신 기획 패키지를 재검증하는 검토자입니다.
+
+## 기획서
+{plan}
+
+## 체크리스트
+{checklist}
+
+## 참고 문서
+{docs}
+
+## 이전 리뷰 결과 ({max_iterations}회 중 {iteration}번째)
+{feedback}
+
+## 실행 증거
+{execution_evidence}
+
+## 검증 지침
+최신 기획 패키지 자체를 다시 검토하세요: 기획서, 체크리스트, 참고 문서를 함께 봅니다.
+현재 저장소를 살펴보며 문서가 현실을 정확히 설명하는지 확인할 수는 있지만, 프로덕션 코드, 스크립트, 인프라, 외부 환경이 이미 수정되어 있을 것을 요구하면 안 됩니다.
+
+`plan-review`에서 PASS의 뜻은 "이제 문서를 더 고칠 필요 없이 이 계획을 실행할 수 있다"입니다.
+즉 구현 공백, 저장소 불일치, legacy 스크립트 문제, 외부 의존성, 환경 blocker가 남아 있어도 아래 조건을 만족하면 FAIL 사유가 아닙니다.
+- 그 문제가 기획 패키지에 정확히 기록되어 있고
+- 어떤 범위/게이트에 영향을 주는지 분명히 적혀 있고
+- 필요한 후속 조치나 non-go 조건이 명확히 적혀 있고
+- 아직 해결되지 않은 일을 이미 해결된 것처럼 오해하게 만들지 않는 경우
+
+반대로 아래와 같은 경우에만 FAIL로 판정하세요.
+- 문서 안에 아직 모호성이나 모순이 남아 있는 경우
+- 선행조건, 의존성, 게이트, 담당 주체, evidence 규칙이 빠진 경우
+- 알려진 blocker가 여전히 부정확하거나 오해를 부르는 방식으로 서술된 경우
+- 기획 문서들 사이에서 source-of-truth 규칙이 충돌하는 경우
+- 체크리스트나 상태 판정 기준 때문에 실행자가 잘못된 결정을 내릴 수 있는 경우
+
+이미 문서에 정확히 기록된 구현/저장소 문제는 "범위 밖 이슈" 또는 "문서화된 리스크"로만 남기고, 그 자체를 FAIL 사유로 삼지 마세요.
+
+## 출력 형식
+
+### 남은 문서 이슈
+- [Major][누락] 이슈 설명 (관련 기획서/체크리스트/참고 문서 항목 참조)
+(남은 문서 이슈가 없으면 "없음"이라고 작성하세요.)
+
+### 문서화된 리스크 / 범위 밖 이슈
+- 실제 구현/저장소/환경 리스크이지만 문서에는 이미 정확히 반영된 항목
+(해당 사항이 없으면 "없음"이라고 작성하세요.)
+
+### 요약
+- 남은 문서 이슈 수: N
+- 문서화된 리스크 / 범위 밖 항목 수: N
+- 전체 품질: [간략한 평가]
+
+### 판정
+기획 패키지를 더 수정할 필요가 없으면: VERDICT: PASS
+그렇지 않으면: VERDICT: FAIL
+"""
+
+CODING_PLAN_REVIEW_TEMPLATE = """\
+You are reviewing both the implementation and the planning package together.
+
+## Artifact References
+{artifact_references}
+
+## Execution Evidence
+{execution_evidence}
+
+## Review Instructions
+Read the referenced plan/checklist/docs/review artifacts directly from disk. \
+Inspect the current repository and evaluate BOTH:
+1. whether the implementation matches the plan/checklist/docs, and
+2. whether the planning package still accurately describes the implementation target and constraints.
+
+Report only issues that matter to delivering the original plan correctly. \
+Do not invent new scope. Distinguish between code issues, document issues, and consistency gaps between them.
+
+For each issue found, classify it with BOTH severity AND category:
+- Severity: Critical / Major / Minor
+- Category: Over-engineering / Omission
+
+If previous review feedback is provided above, mark each prior item as CONFIRMED or DISMISSED.
+If you find issues outside the original plan scope, report them separately under "Out of Scope Issues".
+
+### Verdict
+If the implementation satisfies the plan/checklist and the planning package no longer needs correction, output: VERDICT: PASS
+Otherwise output: VERDICT: FAIL
+"""
+
+CODING_PLAN_REVIEW_TEMPLATE_KO = """\
+당신은 구현 결과와 기획 문서 패키지를 함께 검토하는 리뷰어입니다.
+
+## 참조 아티팩트
+{artifact_references}
+
+## 실행 증거
+{execution_evidence}
+
+## 검토 지침
+참조된 plan/checklist/docs/review markdown를 직접 읽고 현재 저장소를 확인한 뒤, 아래 두 가지를 함께 평가하세요.
+1. 현재 구현이 plan/checklist/docs와 일치하는가
+2. 기획 문서 패키지가 현재 구현 목표와 제약을 여전히 정확하게 설명하는가
+
+원래 계획을 제대로 완수하는 데 필요한 이슈만 보고하세요. 새로운 범위를 만들지 마세요.
+코드 이슈, 문서 이슈, 코드-문서 불일치를 구분해서 적으세요.
+
+발견된 각 이슈에 심각도와 카테고리를 모두 부여하세요.
+- 심각도: Critical / Major / Minor
+- 카테고리: 과최적화 / 누락
+
+이전 리뷰 피드백이 있으면 각 항목을 CONFIRMED 또는 DISMISSED로 판정하세요.
+원래 계획 범위 밖 이슈는 "범위 밖 이슈"로 별도 분리하세요.
+
+### 판정
+구현이 plan/checklist를 충족하고 기획 문서 패키지도 더 이상 수정할 필요가 없으면: VERDICT: PASS
+그렇지 않으면: VERDICT: FAIL
+"""
+
+CODING_PLAN_FIX_TEMPLATE = """\
+You are fixing confirmed issues in both the implementation and the planning package.
+
+## Artifact References
+{artifact_references}
+
+## Current Review Feedback
+{feedback}
+
+## Instructions
+1. Read the referenced plan/checklist/docs/review artifacts directly from disk.
+2. Fix ONLY the confirmed issues from the current review feedback.
+3. You may update both implementation files and planning artifacts when needed.
+4. Preserve the original plan intent and scope. Do not silently broaden requirements.
+5. Keep code, plan, checklist, and supporting docs consistent after edits.
+6. After editing, briefly summarize what you changed and any blocker that still needs human input.
+"""
+
+CODING_PLAN_FIX_TEMPLATE_KO = """\
+당신은 현재 리뷰에서 확정된 이슈를 코드와 기획 문서 패키지에 함께 반영하는 수정 담당자입니다.
+
+## 참조 아티팩트
+{artifact_references}
+
+## 현재 리뷰 피드백
+{feedback}
+
+## 지침
+1. 참조된 plan/checklist/docs/review markdown를 직접 읽으세요.
+2. 현재 리뷰 피드백에서 확정된 이슈만 수정하세요.
+3. 필요하면 코드와 기획 문서를 모두 수정할 수 있습니다.
+4. 최초 plan의 의도와 범위를 유지하세요. 요구사항을 몰래 넓히지 마세요.
+5. 수정 후 코드, plan, checklist, 참고 문서가 서로 모순되지 않게 유지하세요.
+6. 수정이 끝나면 무엇을 바꿨는지와 아직 사람 판단이 필요한 blocker가 있는지 짧게 정리하세요.
+"""
+
 AGGREGATE_REVIEW_TEMPLATE = """\
 You are adjudicating multiple review results and turning them into an actionable decision.

@@ -645,6 +857,9 @@ DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
        "review": REVIEW_TEMPLATE,
        "plan-review": PLAN_REVIEW_TEMPLATE,
        "plan-fix": PLAN_FIX_TEMPLATE,
+        "plan-verify": PLAN_VERIFY_TEMPLATE,
+        "coding-plan-review": CODING_PLAN_REVIEW_TEMPLATE,
+        "coding-plan-fix": CODING_PLAN_FIX_TEMPLATE,
        "review-only": REVIEW_ONLY_TEMPLATE,
        "aggregate-review": AGGREGATE_REVIEW_TEMPLATE,
    },
@@ -653,6 +868,9 @@ DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
        "review": REVIEW_TEMPLATE_KO,
        "plan-review": PLAN_REVIEW_TEMPLATE_KO,
        "plan-fix": PLAN_FIX_TEMPLATE_KO,
+        "plan-verify": PLAN_VERIFY_TEMPLATE_KO,
+        "coding-plan-review": CODING_PLAN_REVIEW_TEMPLATE_KO,
+        "coding-plan-fix": CODING_PLAN_FIX_TEMPLATE_KO,
        "review-only": REVIEW_ONLY_TEMPLATE_KO,
        "aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO,
    },
@@ -961,7 +1179,7 @@ def _build_plan_review_preset(
            name="verify",
            agent=senior_agent,
            role="review",
-            prompt_template="default:plan-review",
+            prompt_template="default:plan-verify",
            output_key="verify_result",
            verdict=True,
        ),
@@ -1065,16 +1283,97 @@ def _build_coding_review_fix_preset(
    ]


+def _build_coding_plan_review_preset(
+    coders: list[str], reviewers: list[str], seniors: list[str],
+) -> list[PhaseConfig]:
+    """Implement from plan/docs, then review and fix code+docs together."""
+    if not coders:
+        raise ValueError("'coding-plan-review' preset requires at least 1 coder")
+    if not reviewers:
+        raise ValueError("'coding-plan-review' preset requires at least 1 reviewer")
+
+    review_steps: list[StepConfig] = []
+    reviewer_keys = _unique_safe_keys(reviewers)
+    for reviewer, rk in zip(reviewers, reviewer_keys):
+        review_steps.append(
+            StepConfig(
+                name=f"review_{rk}",
+                agent=reviewer,
+                role="review",
+                prompt_template="default:coding-plan-review",
+                output_key=f"review_{rk}",
+                verdict=False,
+                parallel=True,
+            ),
+        )
+
+    senior_agent = seniors[0] if seniors else reviewers[0]
+    review_step_names = [f"review_{rk}" for rk in reviewer_keys]
+    review_output_keys = [f"review_{rk}" for rk in reviewer_keys]
+
+    return [
+        PhaseConfig(
+            name="initial_coding",
+            steps=[
+                StepConfig(
+                    name="coding",
+                    agent=coders[0],
+                    role="coding",
+                    prompt_template="default:coding",
+                    output_key="coding_output",
+                ),
+            ],
+            max_iterations=1,
+            consecutive_pass=1,
+        ),
+        PhaseConfig(
+            name="coding_plan_review",
+            steps=review_steps + [
+                StepConfig(
+                    name="aggregate_review",
+                    agent=senior_agent,
+                    role="review",
+                    prompt_template="default:aggregate-review",
+                    output_key="aggregate_review",
+                    context_override={
+                        "candidate_outputs": (
+                            "Current implementation and planning package under review "
+                            "(code + plan/checklist/reference docs)."
+                        ),
+                        "reviews_bundle": _build_named_bundle(
+                            reviewers, review_step_names, review_output_keys, "Review",
+                        ),
+                    },
+                ),
+                StepConfig(
+                    name="coding_plan_fix",
+                    agent=coders[0],
+                    role="coding",
+                    prompt_template="default:coding-plan-fix",
+                    output_key="coding_plan_fix_output",
+                    context_override={"feedback": "{aggregate_review}"},
+                ),
+                StepConfig(
+                    name="verify",
+                    agent=senior_agent,
+                    role="review",
+                    prompt_template="default:coding-plan-review",
+                    output_key="verify_result",
+                    verdict=True,
+                ),
+            ],
+            max_iterations=5,
+            consecutive_pass=1,
+        ),
+    ]
+
+
 PIPELINE_PRESETS: dict[str, Callable] = {
-    "simple": _build_simple_preset,
-    "cross-review": _build_cross_review_preset,
    "plan-review": _build_plan_review_preset,
-    "review-only": _build_review_only_preset,
 }

 PHASED_PRESETS: dict[str, Callable] = {
-    "review-fix": _build_review_fix_preset,
-    "coding-review-fix": _build_coding_review_fix_preset,
+    "coding-plan-review": _build_coding_plan_review_preset,
 }

 ALL_PRESET_NAMES: list[str] = list(PIPELINE_PRESETS.keys()) + list(PHASED_PRESETS.keys())
--- a/cross_eval/worktree.py
+++ b/cross_eval/worktree.py
@@ -101,19 +101,18 @@ def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> tuple[P


 def capture_diff(worktree_path: Path, base_commit: str | None = None) -> str:
-    """Capture all changes made in the worktree as a unified diff.
+    """Capture all changes made in the worktree since ``base_commit``.

-    Includes both tracked modifications, new untracked files, and changes
-    that the agent may have committed on its own.
+    Handles two scenarios:
+    1. Agent left changes uncommitted → ``git add -A && git diff base HEAD``
+    2. Agent committed its own changes → HEAD advanced, diff base..HEAD captures them

    Args:
-        base_commit: The commit SHA from when the worktree was created.
-                     If provided, diffs against this fixed base instead of HEAD.
-                     This is critical because agents (e.g. Claude in interactive
-                     mode) may create their own commits, advancing HEAD and
-                     making ``git diff --cached HEAD`` return empty.
+        base_commit: The diff anchor — typically the worktree HEAD *before* this
+                     iteration started (set by ``get_current_head`` after each
+                     ``_commit_iteration``). Falls back to ``HEAD`` if not given.
    """
-    # Stage any uncommitted changes so they're included in the diff
+    # Stage any uncommitted changes
    subprocess.run(
        ["git", "add", "-A"],
        cwd=worktree_path,
@@ -121,35 +120,33 @@ def capture_diff(worktree_path: Path, base_commit: str | None = None) -> str:
        check=True,
    )

-    if base_commit:
-        # Diff everything (committed + staged) against the original base.
-        # This captures changes regardless of whether the agent committed them.
-        result = subprocess.run(
-            ["git", "diff", base_commit, "--cached"],
+    # Commit staged changes so everything is reachable via HEAD
+    # (this is a no-op if nothing is staged)
+    subprocess.run(
+        ["git", "commit", "-m", "cross-eval: capture-diff snapshot", "--allow-empty-message"],
        cwd=worktree_path,
        capture_output=True,
        text=True,
    )
-        diff = result.stdout.strip()
-        if diff:
-            return diff

-        # Also check committed changes (agent may have committed and left
-        # nothing staged)
+    ref = base_commit or "HEAD~1"
    result = subprocess.run(
-            ["git", "diff", base_commit, "HEAD"],
+        ["git", "diff", ref, "HEAD"],
        cwd=worktree_path,
        capture_output=True,
        text=True,
    )
    return result.stdout.strip()

-    # Fallback: no base_commit, use original behavior
+
+def get_current_head(worktree_path: Path) -> str:
+    """Return the current HEAD SHA of the worktree."""
    result = subprocess.run(
-        ["git", "diff", "--cached", "HEAD"],
+        ["git", "rev-parse", "HEAD"],
        cwd=worktree_path,
        capture_output=True,
        text=True,
+        check=True,
    )
    return result.stdout.strip()

--- a/plan.md
+++ b/plan.md
@@ -0,0 +1,47 @@
+# cross-eval CLI 사용성 리팩토링
+
+## 목표
+`cross-eval`의 CLI 사용 경험을 리팩토링하여, 사용자가 각 옵션의 의미를 빠르게 이해하고 목적에 맞는 옵션 조합을 쉽게 선택할 수 있도록 만든다.
+
+## 배경
+현재 `cross-eval`은 `init`, `run`, `demo`, `doctor` 등 주요 커맨드와 다양한 옵션을 제공하지만, 처음 사용하는 사용자가 어떤 상황에서 어떤 옵션을 써야 하는지 한눈에 이해하기 어렵다. 특히 `run`의 preset, agent 조합, config 기반 실행과 직접 옵션 기반 실행의 관계가 복잡하게 느껴질 수 있다.
+
+## 요구사항
+1. CLI 도움말 또는 온보딩 문구를 리팩토링해 초보 사용자도 주요 흐름을 빠르게 이해할 수 있어야 한다.
+2. 사용자가 대표적인 사용 시나리오별로 적절한 옵션 조합을 쉽게 찾을 수 있어야 한다.
+3. `run` 커맨드의 주요 옵션들(preset, coder/reviewer/senior, config, output 관련)의 역할이 더 명확하게 드러나야 한다.
+4. `init` 이후 사용자가 다음에 무엇을 해야 하는지 자연스럽게 이어지도록 안내해야 한다.
+5. 기존 기능은 유지해야 하며, 동작 방식 자체를 바꾸기보다 설명 구조와 사용 흐름을 개선하는 데 집중해야 한다.
+
+## 사용자 시나리오
+1. 처음 설치한 사용자가 `cross-eval init` 후 무엇을 해야 하는지 알고 싶다.
+2. 사용자가 `run`을 실행하려는데 `--preset`별 차이를 빠르게 비교하고 싶다.
+3. 사용자가 `claude`, `codex`, `senior` 조합을 어떤 상황에서 쓰는지 예시와 함께 이해하고 싶다.
+4. 사용자가 config 기반 실행과 CLI 옵션 기반 실행 중 무엇을 써야 할지 판단하고 싶다.
+5. 사용자가 실행 결과가 어디에 저장되는지, 어떤 식으로 확인하는지 알고 싶다.
+
+## 제약조건
+- 기존 CLI 명령 이름과 핵심 옵션 이름은 유지한다.
+- 기존 파이프라인 동작 로직은 불필요하게 수정하지 않는다.
+- 기능 추가보다 안내 구조, 도움말 문구, 예시, 설명 흐름 개선에 집중한다.
+- 문서는 한국어 사용자 기준으로 이해하기 쉽게 유지하되, 기존 프로젝트 톤과 구조를 해치지 않는다.
+
+## 범위
+### 포함
+- `argparse` help/description/epilog 문구 개선
+- `init` 후 다음 단계 안내 문구 개선
+- `run` 사용 예시 정리 및 대표 조합 예시 보강
+- preset/agent/config/output 개념 설명 재구성
+- 필요 시 README 또는 온보딩 문구 일부 정리
+
+### 제외
+- 새로운 preset 추가
+- 새로운 CLI 옵션 추가
+- 파이프라인 실행 알고리즘 변경
+- 에이전트 호출 방식 자체 변경
+
+## 성공 기준
+1. `--help`만 읽어도 기본 사용 흐름이 명확하다.
+2. 사용자가 대표 시나리오별 실행 예시를 바로 복사해 쓸 수 있다.
+3. `init → 작성 → doctor → run → output 확인` 흐름이 자연스럽게 연결된다.
+4. 옵션 설명이 길기만 하지 않고, 실제 선택 판단에 도움이 되도록 구조화된다.
--- a/tests/test_agentic.py
+++ b/tests/test_agentic.py
@@ -490,6 +490,8 @@ class TestMakeAgenticCodex(unittest.TestCase):
 def _make_agentic_config(
    run_dir: Path,
    agentic_coder: bool = True,
+    *,
+    use_worktree: bool = False,
 ) -> PipelineConfig:
    """Build a config with an agentic coder + non-agentic reviewer."""
    coder = AgentConfig(
@@ -521,6 +523,7 @@ def _make_agentic_config(
    ]
    return PipelineConfig(
        output_dir=run_dir,
+        use_worktree=use_worktree,
        max_iterations=2,
        min_iterations=1,
        language="en",
@@ -551,7 +554,7 @@ class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
    ) -> None:
        with tempfile.TemporaryDirectory() as td:
            run_dir = Path(td)
-            config = _make_agentic_config(run_dir)
+            config = _make_agentic_config(run_dir, use_worktree=True)

            wt_path = run_dir / "work"
            wt_path.mkdir()
@@ -573,6 +576,44 @@ class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
            mock_setup.assert_called_once()


+class TestDirectAgenticMode(unittest.TestCase):
+    """Agentic coders run in the current working tree by default."""
+
+    @patch("cross_eval.pipeline._setup_worktree")
+    @patch("cross_eval.pipeline.invoke_agent_agentic")
+    @patch("cross_eval.pipeline.invoke_agent")
+    def test_agentic_uses_current_worktree_by_default(
+        self,
+        mock_invoke: MagicMock,
+        mock_invoke_agentic: MagicMock,
+        mock_setup: MagicMock,
+    ) -> None:
+        with tempfile.TemporaryDirectory() as td:
+            repo = Path(td)
+            _init_git_repo(repo)
+            run_dir = repo / ".cross-eval" / "output"
+            run_dir.mkdir(parents=True, exist_ok=True)
+            config = _make_agentic_config(run_dir)
+
+            mock_invoke_agentic.return_value = AgentResult(
+                output="diff output", exit_code=0,
+                agent_name="claude-coder", step_name="coding",
+                duration_seconds=0.1,
+            )
+            mock_invoke.return_value = AgentResult(
+                output="VERDICT: PASS", exit_code=0,
+                agent_name="claude-reviewer", step_name="review",
+                duration_seconds=0.1,
+            )
+
+            run_pipeline(config, cwd=repo)
+
+            mock_setup.assert_not_called()
+            self.assertEqual(mock_invoke_agentic.call_args.kwargs["worktree_path"], repo)
+            reviewer_call = mock_invoke.call_args
+            self.assertEqual(reviewer_call.kwargs["cwd"], repo)
+
+
 class TestSetupWorktreeLocation(unittest.TestCase):
    """_setup_worktree places agentic worktrees outside the base repo."""

@@ -618,7 +659,7 @@ class TestReviewerRunsInWorktreeCwd(unittest.TestCase):
    ) -> None:
        with tempfile.TemporaryDirectory() as td:
            run_dir = Path(td)
-            config = _make_agentic_config(run_dir)
+            config = _make_agentic_config(run_dir, use_worktree=True)

            wt_path = run_dir / "work"
            wt_path.mkdir()
@@ -660,7 +701,7 @@ class TestCommitIterationCalled(unittest.TestCase):
    ) -> None:
        with tempfile.TemporaryDirectory() as td:
            run_dir = Path(td)
-            config = _make_agentic_config(run_dir)
+            config = _make_agentic_config(run_dir, use_worktree=True)

            wt_path = run_dir / "work"
            wt_path.mkdir()
@@ -702,7 +743,7 @@ class TestFinalizeWorktreeCalled(unittest.TestCase):
    ) -> None:
        with tempfile.TemporaryDirectory() as td:
            run_dir = Path(td)
-            config = _make_agentic_config(run_dir)
+            config = _make_agentic_config(run_dir, use_worktree=True)

            wt_path = run_dir / "work"
            wt_path.mkdir()
--- a/tests/test_config.py
+++ b/tests/test_config.py
@@ -331,7 +331,7 @@ class BuiltinAgentConfigTest(unittest.TestCase):
        )
        self.assertEqual(
            _default_seniors_for_preset(
-                "preset:review-fix",
+                "preset:coding-plan-review",
                ["codex-reviewer", "claude-reviewer"],
                BUILTIN_AGENTS,
            ),
@@ -339,7 +339,7 @@ class BuiltinAgentConfigTest(unittest.TestCase):
        )
        self.assertEqual(
            _default_seniors_for_preset(
-                "preset:review-fix",
+                "preset:coding-plan-review",
                ["claude-reviewer"],
                BUILTIN_AGENTS,
            ),
@@ -347,15 +347,7 @@ class BuiltinAgentConfigTest(unittest.TestCase):
        )
        self.assertEqual(
            _default_seniors_for_preset(
-                "preset:coding-review-fix",
-                ["codex-reviewer"],
-                BUILTIN_AGENTS,
-            ),
-            ["codex-senior"],
-        )
-        self.assertEqual(
-            _default_seniors_for_preset(
-                "preset:simple",
+                "preset:unknown",
                ["codex-reviewer"],
                BUILTIN_AGENTS,
            ),
@@ -1019,7 +1011,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
                "  checklist: checklist.md\n"
                "coders: [claude-coder]\n"
                "reviewers: [claude-reviewer]\n"
-                "pipeline: preset:review-fix\n"
+                "pipeline: preset:coding-plan-review\n"
                f"max_iterations: {max_iterations}\n"
                "language: en\n"
            ),
@@ -1031,8 +1023,9 @@ class FixPresetBehaviorTest(unittest.TestCase):
        with tempfile.TemporaryDirectory() as tmpdir:
            config = load_config(self._write_fix_config(Path(tmpdir), max_iterations=7))

-        self.assertEqual(config.preset_name, "review-fix")
-        self.assertEqual(config.phases[0].max_iterations, 7)
+        self.assertEqual(config.preset_name, "coding-plan-review")
+        self.assertEqual(config.phases[0].max_iterations, 1)
+        self.assertEqual(config.phases[1].max_iterations, 7)
        self.assertTrue(config.agents["claude-coder"].agentic)
        self.assertNotIn("-p", config.agents["claude-coder"].args)

@@ -1042,7 +1035,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
            captured: dict[str, object] = {}

            def _fake_run_pipeline(config, **kwargs):
-                captured["phase_max"] = config.phases[0].max_iterations
+                captured["phase_max"] = config.phases[1].max_iterations
                captured["agentic"] = config.agents[config.coders[0]].agentic
                return PipelineResult(
                    iterations=[],
@@ -1062,13 +1055,13 @@ class FixPresetBehaviorTest(unittest.TestCase):
        self.assertEqual(captured["phase_max"], 9)
        self.assertTrue(captured["agentic"])

-    def test_run_preset_review_fix_auto_enables_agentic_without_flag(self) -> None:
+    def test_run_preset_coding_plan_review_auto_enables_agentic_without_flag(self) -> None:
        captured: dict[str, object] = {}

        def _fake_run_pipeline(config, **kwargs):
            captured["preset"] = config.preset_name
            captured["agentic"] = config.agents[config.coders[0]].agentic
-            captured["phase_max"] = config.phases[0].max_iterations
+            captured["phase_max"] = config.phases[1].max_iterations
            return PipelineResult(
                iterations=[],
                final_verdict="PASS",
@@ -1076,10 +1069,10 @@ class FixPresetBehaviorTest(unittest.TestCase):
            )

        with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
-            exit_code = main(["run", "--preset", "review-fix", "--dry-run"])
+            exit_code = main(["run", "--preset", "coding-plan-review", "--dry-run"])

        self.assertEqual(exit_code, 0)
-        self.assertEqual(captured["preset"], "review-fix")
+        self.assertEqual(captured["preset"], "coding-plan-review")
        self.assertTrue(captured["agentic"])
        self.assertEqual(captured["phase_max"], 3)

@@ -1089,6 +1082,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
        def _fake_run_pipeline(config, **kwargs):
            captured["preset"] = config.preset_name
            captured["agentic"] = config.agents[config.coders[0]].agentic
+            captured["use_worktree"] = config.use_worktree
            captured["seniors"] = list(config.seniors)
            captured["steps"] = [step.name for step in config.pipeline]
            captured["max_iter"] = config.max_iterations
@@ -1104,6 +1098,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
        self.assertEqual(exit_code, 0)
        self.assertEqual(captured["preset"], "plan-review")
        self.assertTrue(captured["agentic"])
+        self.assertFalse(captured["use_worktree"])
        self.assertEqual(captured["seniors"], ["claude-senior"])
        self.assertEqual(
            captured["steps"],
@@ -1111,6 +1106,36 @@ class FixPresetBehaviorTest(unittest.TestCase):
        )
        self.assertEqual(captured["max_iter"], 3)

+    def test_run_worktree_flag_enables_isolated_worktree_mode(self) -> None:
+        captured: dict[str, object] = {}
+
+        def _fake_run_pipeline(config, **kwargs):
+            captured["use_worktree"] = config.use_worktree
+            return PipelineResult(
+                iterations=[],
+                final_verdict="PASS",
+                run_dir=Path(".cross-eval/output"),
+            )
+
+        with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
+            exit_code = main(["run", "--preset", "plan-review", "--dry-run", "--worktree"])
+
+        self.assertEqual(exit_code, 0)
+        self.assertTrue(captured["use_worktree"])
+
+    def test_run_dry_run_returns_zero_even_when_not_pass(self) -> None:
+        def _fake_run_pipeline(config, **kwargs):
+            return PipelineResult(
+                iterations=[],
+                final_verdict="MAX_ITERATIONS_REACHED",
+                run_dir=Path(".cross-eval/output"),
+            )
+
+        with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
+            exit_code = main(["run", "--preset", "plan-review", "--dry-run"])
+
+        self.assertEqual(exit_code, 0)
+
    def test_run_senior_model_override_applies_only_to_seniors(self) -> None:
        captured: dict[str, list[str]] = {}

@@ -1127,7 +1152,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
        with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
            exit_code = main([
                "run",
-                "--preset", "review-fix",
+                "--preset", "coding-plan-review",
                "--coder", "claude",
                "--reviewer", "claude",
                "--senior", "claude",
@@ -1155,7 +1180,7 @@ class OutputDirectoryResolutionTest(unittest.TestCase):
                    "  plan: plan.md\n"
                    "coders: [claude-coder]\n"
                    "reviewers: [claude-reviewer]\n"
-                    "pipeline: preset:simple\n"
+                    "pipeline: preset:coding-plan-review\n"
                    "output_dir: .cross-eval/output\n"
                ),
                encoding="utf-8",
--- a/tests/test_onboarding.py
+++ b/tests/test_onboarding.py
@@ -55,7 +55,7 @@ class DoctorCheckInstalledTest(unittest.TestCase):
            config_path = ce_dir / "config.yaml"
            config_path.write_text(
                "inputs:\n  plan: plan.md\ncoders: [claude-coder]\n"
-                "reviewers: [claude-reviewer]\npipeline: preset:simple\n",
+                "reviewers: [claude-reviewer]\npipeline: preset:coding-plan-review\n",
                encoding="utf-8",
            )
            # Also create plan.md so validation passes
@@ -137,22 +137,22 @@ class DemoTest(unittest.TestCase):
    def test_mock_demo_runs_without_error(self) -> None:
        # Should not raise
        with patch("sys.stdout"):
-            run_mock_demo(preset="simple")
+            run_mock_demo(preset="coding-plan-review")

    def test_mock_demo_escalate_runs_without_error(self) -> None:
        with patch("sys.stdout"):
-            run_mock_demo(preset="simple", show_escalate=True)
+            run_mock_demo(preset="coding-plan-review", show_escalate=True)

    def test_cmd_demo_mock_default(self) -> None:
        with patch("cross_eval.demo.run_mock_demo") as mock:
            exit_code = main(["demo"])
-        mock.assert_called_once_with(preset="simple", show_escalate=False)
+        mock.assert_called_once_with(preset="coding-plan-review", show_escalate=False)
        self.assertEqual(exit_code, 0)

    def test_cmd_demo_escalate_flag(self) -> None:
        with patch("cross_eval.demo.run_mock_demo") as mock:
            exit_code = main(["demo", "--escalate"])
-        mock.assert_called_once_with(preset="simple", show_escalate=True)
+        mock.assert_called_once_with(preset="coding-plan-review", show_escalate=True)
        self.assertEqual(exit_code, 0)

    def test_cmd_demo_live_requires_confirmation(self) -> None:
--- a/tests/test_runtime_misc.py
+++ b/tests/test_runtime_misc.py
@@ -16,13 +16,17 @@ from cross_eval.agent import (
 )
 from cross_eval.models import AgentConfig, AgentResult, ExecutionConfig, PipelineConfig, StepConfig
 from cross_eval.pipeline import (
+    _apply_worktree_inputs_to_base,
+    _commit_base_repo_paths,
    _copy_inputs_to_worktree,
    _commit_iteration,
    _execute_parallel_batch,
    _execute_step,
    _finalize_worktree,
    _format_runtime_error_markdown,
+    _load_inputs,
    _maybe_save_step_transcript,
+    _refresh_inputs,
    _snapshot_repo_state,
 )
 from cross_eval.runtime_env import (
@@ -155,6 +159,110 @@ class TestWorktreeInputMapping(unittest.TestCase):
                    capture_output=True,
                )

+    def test_plan_review_docs_ref_maps_to_worktree_and_refreshes_docs(self) -> None:
+        with tempfile.TemporaryDirectory() as tmpdir:
+            repo = Path(tmpdir) / "repo"
+            repo.mkdir()
+            _init_git_repo(repo)
+            docs_dir = repo / "plans"
+            docs_dir.mkdir()
+            (docs_dir / "A.md").write_text("A v1\n", encoding="utf-8")
+            subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True)
+            subprocess.run(
+                ["git", "commit", "-m", "add docs"],
+                cwd=repo,
+                capture_output=True,
+                check=True,
+            )
+
+            config = PipelineConfig(
+                inputs={
+                    "docs": "stale snapshot",
+                    "docs_ref": docs_dir,
+                },
+                preset_name="plan-review",
+            )
+            input_contents = _load_inputs(config)
+            self.assertIn("A.md", input_contents["docs"])
+
+            worktree_dir = Path(tmpdir) / "wt"
+            branch = "cross-eval/test-docs-ref"
+            worktree_path, _ = create_worktree(repo, worktree_dir, branch)
+            try:
+                _copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
+                self.assertEqual(config.inputs["docs_ref"], worktree_path / "plans")
+
+                updated = worktree_path / "plans" / "A.md"
+                updated.write_text("A v2\n", encoding="utf-8")
+                _refresh_inputs(config, input_contents)
+                self.assertIn("A.md", input_contents["docs"])
+                self.assertIn("A v2", input_contents["docs"])
+            finally:
+                remove_worktree(base_cwd=repo, work_dir=worktree_path)
+                subprocess.run(
+                    ["git", "branch", "-D", branch],
+                    cwd=repo,
+                    capture_output=True,
+                )
+
+    def test_worktree_doc_changes_apply_back_and_commit_in_base_repo(self) -> None:
+        with tempfile.TemporaryDirectory() as tmpdir:
+            repo = Path(tmpdir) / "repo"
+            repo.mkdir()
+            _init_git_repo(repo)
+            docs_dir = repo / "plans"
+            docs_dir.mkdir()
+            doc_path = docs_dir / "A.md"
+            doc_path.write_text("A v1\n", encoding="utf-8")
+            subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True)
+            subprocess.run(
+                ["git", "commit", "-m", "add docs"],
+                cwd=repo,
+                capture_output=True,
+                check=True,
+            )
+
+            config = PipelineConfig(
+                inputs={"docs_ref": docs_dir},
+                preset_name="plan-review",
+            )
+            original_inputs = {"docs_ref": docs_dir}
+
+            worktree_dir = Path(tmpdir) / "wt"
+            branch = "cross-eval/test-apply-back"
+            worktree_path, _ = create_worktree(repo, worktree_dir, branch)
+            try:
+                _copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
+                worktree_doc = config.inputs["docs_ref"] / "A.md"
+                worktree_doc.write_text("A v2\n", encoding="utf-8")
+
+                restored = _apply_worktree_inputs_to_base(
+                    config, original_inputs, cwd=repo,
+                )
+                self.assertEqual(restored, [docs_dir])
+                self.assertEqual(doc_path.read_text(encoding="utf-8"), "A v2\n")
+
+                committed = _commit_base_repo_paths(
+                    repo, restored, "cross-eval: plan-review (FAIL)",
+                )
+                self.assertTrue(committed)
+
+                log = subprocess.run(
+                    ["git", "log", "-1", "--pretty=%s"],
+                    cwd=repo,
+                    capture_output=True,
+                    text=True,
+                    check=True,
+                )
+                self.assertEqual(log.stdout.strip(), "cross-eval: plan-review (FAIL)")
+            finally:
+                remove_worktree(base_cwd=repo, work_dir=worktree_path)
+                subprocess.run(
+                    ["git", "branch", "-D", branch],
+                    cwd=repo,
+                    capture_output=True,
+                )
+
    def test_classify_unknown_failure(self) -> None:
        failure_type, suggested_action = _classify_agent_failure("weird crash")
        self.assertEqual(failure_type, "UNKNOWN")
@@ -413,11 +521,13 @@ class TestInvokeAgenticRuntime(unittest.TestCase):


 class TestPipelineHelpers(unittest.TestCase):
+    @patch("cross_eval.worktree.get_current_head", return_value="a" * 40)
    @patch("cross_eval.worktree.commit_worktree", return_value=True)
-    def test_commit_iteration_logs_only_when_committed(self, mock_commit: MagicMock) -> None:
+    def test_commit_iteration_logs_only_when_committed(self, mock_commit: MagicMock, mock_head: MagicMock) -> None:
        with tempfile.TemporaryDirectory() as tmpdir:
-            _commit_iteration(Path(tmpdir), "review-fix", 2, "PASS")
+            new_head = _commit_iteration(Path(tmpdir), "review-fix", 2, "PASS")
        mock_commit.assert_called_once()
+        self.assertEqual(new_head, "a" * 40)

    def test_snapshot_repo_state_includes_untracked_digest(self) -> None:
        with tempfile.TemporaryDirectory() as tmpdir: