Fix plan-review worktree document tracking

Make plan-review a review-fix-verify loop
fix: capture_diff uses base commit to handle agent self-commits
2026-03-15 00:35:42 +09:00 · 2026-03-15 00:01:26 +09:00 · 2026-03-14 23:59:53 +09:00
14 changed files with 659 additions and 111 deletions
--- a/DEVELOPMENT.md
+++ b/DEVELOPMENT.md
@@ -53,7 +53,7 @@ agents:
 # 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음)
 pipeline: preset:simple          # "A 생성 → B 리뷰" (기본값)
 # pipeline: preset:cross-review  # "둘 다 생성 → 서로 리뷰"
-# pipeline: preset:plan-review   # "구현 전 문서/기획 검토"
+# pipeline: preset:plan-review   # "구현 전 문서 리뷰 → 수정 → 재검증 반복"
 # pipeline: preset:coding-review-fix  # "초기 코딩 1회 → 리뷰/수정 반복"
 # 방법 2: 직접 커스텀 (고급 사용자용)
@@ -77,7 +77,7 @@ pipeline: preset:simple          # "A 생성 → B 리뷰" (기본값)
 |--------|------|-------------------|
 | `simple` | A 코딩 → B 리뷰 | coding(agent1) → review(agent2) |
 | `cross-review` | 둘 다 코딩, 서로 리뷰 | coding_a → coding_b → review_of_b(agent_a) → review_of_a(agent_b) |
-| `plan-review` | 구현 전 문서 검토 | parallel plan_review_* → senior_review(optional) |
+| `plan-review` | 구현 전 문서 리뷰/수정/재검증 반복 | plan_review_* → aggregate_review → plan_fix → verify |
 | `coding-review-fix` | 초기 코딩 후 리뷰/수정 반복 | initial_coding(coding) → review_fix(review* → aggregate → coding → verify) |
 프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다.
@@ -185,3 +185,6 @@ final-report.md 생성
    --reviewer-effort high \
    --senior-effort xhigh \
    --max-iter 10
 cross-eval run --plan /Users/chungyeong/Desktop/Dev/cross-eval/UX_IMPROVEMENT_PLAN.md --coder claude --reviewer claude --senior claude --model sonnet --preset coding-review-fix --lang ko --max-iter 1
--- a/README.md
+++ b/README.md
@@ -112,7 +112,7 @@ pipeline: preset:simple
 |--------|------|
 | `simple` | Agent A가 코딩, Agent B가 리뷰 (기본값) |
 | `cross-review` | 둘 다 코딩, 서로 교차 리뷰 |
-| `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 필요시 현재 코드베이스와의 정합성도 확인 |
+| `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 문서를 수정한 뒤 재검증까지 반복 |
 | `review-only` | 기존 코드만 감사 용도로 검토 |
 | `review-fix` | 리뷰 결과를 취합한 뒤 자동 수정과 재검증까지 반복 |
 | `coding-review-fix` | 초기 코딩 1회 후 리뷰 결과를 취합해 자동 수정과 재검증을 반복 |
@@ -120,6 +120,6 @@ pipeline: preset:simple
 ```bash
 # 초기화 옵션
 cross-eval init --preset cross-review   # 교차 리뷰 프리셋
-cross-eval init --preset plan-review    # 구현 전 문서 검토 프리셋
+cross-eval init --preset plan-review    # 문서 리뷰/수정/재검증 프리셋
 cross-eval init --lang en               # 영어 템플릿
 ```
--- a/UX_IMPROVEMENT_PLAN.md
+++ b/UX_IMPROVEMENT_PLAN.md
@@ -0,0 +1,178 @@
 # cross-eval UX 개선 계획
 > 사용자 안내 메시지, 에러 메시지, 도움말 텍스트 전반의 품질을 높여서
 > 처음 쓰는 사람도 막히지 않고 파이프라인을 돌릴 수 있게 만든다.
 ---
 ## 1. CLI 도움말 텍스트 개선
 ### 1.1 `cross-eval` 메인 도움말
 - [ ] 메인 description에 "어떤 문제를 해결하는 도구인지" 한 줄 요약 추가
  - 현재: "AI 코딩 에이전트의 결과물을 자동으로 검증하는 CLI 도구"
  - 개선: "AI 코딩 에이전트가 기획서대로 구현했는지 자동 교차 검증. 과최적화·누락·거짓 통과를 잡아냄"
 - [ ] 서브커맨드별 한 줄 설명을 메인 help에 추가 (init/doctor/demo/run 각각)
 ### 1.2 `cross-eval run` 도움말
 - [ ] epilog의 프리셋 테이블이 너무 길음 — "빠른 선택 가이드" 3줄 추가
  - 예: "처음이면 simple, 리뷰만 하려면 review-only, 코딩+리뷰+자동수정이면 coding-review-fix"
 - [ ] `--reasoning-effort` 도움말에 별칭(extra-high, x-high 등) 명시
 - [ ] `--target` 옵션이 실제로 프롬프트에 어떤 영향을 주는지 설명 추가
 - [ ] `--agentic` 플래그 설명에 worktree 생성/정리 동작 요약 추가
 - [ ] `--min-iter` 설명에 "왜 PASS인데 반복하는지" 용도 한 줄 추가
  - 예: "결과 안정성 확인용. 한 번 PASS가 우연이 아닌지 재검증"
 - [ ] `--dry-run` 설명에 "에이전트 호출 없이 프롬프트만 미리보기" 명확히
 - [ ] 에이전트 축약 규칙(claude → claude-coder 등) 예시와 함께 더 명확하게
 ### 1.3 `cross-eval init` 도움말
 - [ ] `--guided` 옵션을 더 눈에 띄게 — "처음이면 --guided 추천" 문구
 - [ ] 생성되는 파일 설명에 "각 파일을 어떻게 쓰는지" 한 줄씩 추가
 ### 1.4 `cross-eval doctor` 도움말
 - [ ] 어떤 항목을 점검하는지 목록 미리 보여주기
 - [ ] "인증 실패 시 어떻게 해야 하는지" 구체적 명령어 포함
 ### 1.5 `cross-eval demo` 도움말
 - [ ] mock vs live 차이를 한 눈에 볼 수 있도록 비교 추가
 - [ ] `--escalate` 옵션이 mock 전용인 점 강조
 ---
 ## 2. 에러 메시지 개선
 ### 2.1 필수 입력 누락
 - [ ] `--plan` 없이 `cross-eval run` 실행 시 명확한 에러:
  - "기획서(plan)가 필요합니다. --plan plan.md 또는 .cross-eval/config.yaml의 inputs.plan에 지정하세요."
 - [ ] config.yaml 없이 실행 시 기본값 사용 중임을 알리는 INFO 메시지 추가
 ### 2.2 에이전트 실패 메시지
 - [ ] `AUTH` 실패 시 구체적 해결 명령어 제시
  - Claude: "claude login 으로 인증하세요"
  - Codex: "codex auth 로 인증하세요"
 - [ ] `USAGE_LIMIT` 시 어떤 한도인지 힌트 (토큰? 요금?)
 - [ ] `EMPTY_DIFF` 시 "에이전트가 파일을 수정하지 않았습니다" + 가능한 원인 목록
 - [ ] `WRITE_FAILURE` 시 worktree 경로와 권한 상태 출력
 - [ ] 에이전트 빈 출력(empty output) 시 "에이전트가 응답하지 않았습니다. 프롬프트가 너무 길거나 인증 만료일 수 있습니다" 등 원인 제안
 ### 2.3 설정 검증 에러
 - [ ] 중복 step name 에러에 "어떤 phase의 어떤 step이 중복인지" 구체적으로
 - [ ] 없는 에이전트 참조 시 "사용 가능한 에이전트: ..." 리스트 포함 (이미 있으나 확인)
 - [ ] YAML 파싱 에러 시 라인 번호 포함
 ### 2.4 파일/경로 에러
 - [ ] "File not found: {path}" → "파일을 찾을 수 없습니다: {path}\n  현재 디렉토리: {cwd}" 로 개선
 - [ ] docs 디렉토리 비어있을 때 → "참고 문서 폴더가 비어있습니다: {path}\n  .md, .txt 등 문서 파일을 넣어주세요"
 ---
 ## 3. 진행 상태 메시지 개선
 ### 3.1 파이프라인 실행 중
 - [ ] 실행 시작 시 요약 배너 출력:
  ```
  ━━━ cross-eval ━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    Plan:      .cross-eval/plan.md
    Preset:    simple (코딩→리뷰→반복)
    Coder:     claude-coder
    Reviewer:  claude-reviewer
    Max iter:  3
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ```
 - [ ] 각 iteration 시작 시 "무엇을 하려는 단계인지" 한 줄 설명
  - 예: "Iteration 1/3 — Coder가 기획서 기반 초기 구현 중..."
  - 예: "Iteration 2/3 — 리뷰 피드백 반영해서 수정 중..."
 - [ ] 타임아웃 시 경과 시간과 제한 시간 모두 출력
 ### 3.2 결과 요약
 - [ ] 최종 결과에 소요 시간 추가
 - [ ] FAIL 시 "마지막 리뷰에서 지적된 주요 이슈 N개" 간단 요약
 - [ ] ESCALATE 시 사람이 봐야 할 이유 1~2줄 요약
 - [ ] dry-run 종료 시 "이것은 미리보기입니다. 실제 실행하려면 --dry-run을 빼세요" 명시
 ### 3.3 Auto-escalation 안내
 - [ ] auto-escalation 발동 시 "N회 연속 FAIL → 자동 에스컬레이션" 설명
 - [ ] 어떤 조건에서 auto-escalation이 발동하는지 run 도움말에 언급
 ---
 ## 4. 첫 사용 경험(Onboarding) 개선
 ### 4.1 init 후 안내
 - [ ] plan.md 템플릿에 실제 예시 포함 (현재 최소한의 구조만 있음)
  - "## 기능 요구사항" 아래 구체적 예시 한 개
 - [ ] checklist.md 템플릿에 체크리스트 작성 가이드 + 예시 추가
 - [ ] init 완료 후 "다음 단계" 안내를 더 구체적으로:
  - 현재: "1. plan.md에 기획서 작성"
  - 개선: "1. .cross-eval/plan.md를 열어 기획서를 작성하세요 (예: 구현할 기능, API 스펙, DB 스키마 등)"
 ### 4.2 doctor 개선
 - [ ] 체크 통과 시 "준비 완료! cross-eval run --plan .cross-eval/plan.md 로 실행하세요" 안내
 - [ ] 인증 실패 시 OS별 설치/인증 가이드 URL 포함
 ### 4.3 demo 개선
 - [ ] demo 완료 후 "실제 프로젝트에서 시작하려면:" 안내 추가
 - [ ] mock demo에서 각 단계가 뭘 하는 건지 주석 스타일로 설명
 ---
 ## 5. 용어 일관성
 - [ ] "에이전트 이름" vs "에이전트 역할" 구분 통일
  - 이름: claude-coder, codex-reviewer (실제 실행 단위)
  - 역할: coder, reviewer, senior (논리적 역할)
 - [ ] Verdict 표기 통일: 항상 대문자 `PASS` / `FAIL` / `ESCALATE`
 - [ ] "프리셋" vs "파이프라인" 용어 정리
  - `--preset`은 "파이프라인 유형"으로 통일
 - [ ] 한영 혼용 줄이기 — 한국어 모드에서 불필요한 영어 최소화
  - 단, PASS/FAIL/ESCALATE 같은 verdict은 영어 유지 (가독성)
 ---
 ## 6. 출력 디렉토리 구조 안내
 - [ ] run 완료 시 출력 폴더 구조 요약 출력:
  ```
  Output: .cross-eval/output/
    ├── iter-1/          (각 반복의 에이전트 출력)
    ├── iter-2/
    └── final-report.md  (최종 리포트)
  ```
 - [ ] report.md 상단에 "이 리포트 읽는 법" 간단 안내 추가
 ---
 ## 7. config.yaml 주석 개선
 - [ ] 기본 생성되는 config.yaml에 각 섹션별 설명 주석 보강
 - [ ] 자주 쓰는 설정 변경 예시를 주석으로 포함
  - 예: "# 리뷰어를 2개로 늘리려면: reviewer: [claude, codex]"
  - 예: "# 에이전트 모드로 실제 파일 수정: agentic: true"
 - [ ] phase-based 파이프라인 커스텀 예시 주석 추가
 ---
 ## 우선순위
 | 우선순위 | 항목 | 이유 |
 |---------|------|------|
 | P0 | 2.1 필수 입력 누락 에러 | 가장 자주 부딪히는 문제 |
 | P0 | 4.1 init 후 안내 + 템플릿 | 첫 사용에서 막히면 이탈 |
 | P0 | 3.1 실행 시작 요약 배너 | 뭐가 돌아가는지 알아야 함 |
 | P1 | 2.2 에이전트 실패 메시지 | 실패 시 뭘 해야 하는지 모름 |
 | P1 | 1.2 run 도움말 정리 | 옵션이 많아서 혼란 |
 | P1 | 5. 용어 일관성 | 혼동 줄이기 |
 | P2 | 3.2~3.3 결과/진행 메시지 | 있으면 좋지만 급하진 않음 |
 | P2 | 7. config.yaml 주석 | 파워 유저 편의 |
 | P2 | 6. 출력 구조 안내 | 한 번 보면 이해됨 |
 | P3 | 1.3~1.5 나머지 도움말 | 점진적 개선 |
 ---
 ## 테스트 방법
 각 항목 수정 후:
 1. **도움말 확인**: `cross-eval --help`, `cross-eval run --help` 등
 2. **에러 경로 확인**: 일부러 잘못된 입력으로 실행 → 에러 메시지가 유용한지
 3. **첫 사용 시뮬레이션**: 빈 디렉토리에서 `init → doctor → demo → run` 풀 플로우
 4. **cross-eval 자체로 검증**: 이 문서를 plan.md로 사용해 cross-eval run 실행
--- a/cross_eval/agent.py
+++ b/cross_eval/agent.py
@@ -34,6 +34,12 @@ _NO_CHANGE_ACK_MARKERS = (
    "code is correct as-is",
    "already correct",
    "no action required",
    "변경 없음",
    "수정 없음",
    "수정할 필요 없음",
    "변경할 필요 없음",
    "이미 올바름",
    "조치 불필요",
 )
 _CHANGE_CLAIM_MARKERS = (
    "summary of all changes made",
@@ -73,6 +79,15 @@ _CHANGE_CLAIM_MARKERS = (
    "completed the implementation",
    "all changes have been made",
    "changes are complete",
    "수정 완료",
    "모든 수정이 완료",
    "변경 요약",
    "변경 파일",
    "신규 생성",
    "기획서 수정",
    "체크리스트 수정",
    "문서를 수정",
    "문서 수정",
 )
@@ -414,6 +429,7 @@ def invoke_agent_agentic(
    env: Optional[dict[str, str]] = None,
    timeout: int | None = None,
    quiet: bool = False,
    base_commit: str | None = None,
 ) -> AgentResult:
    """Invoke an agent in agentic mode using the worktree as the source of truth."""
    from cross_eval.worktree import capture_diff
@@ -506,8 +522,8 @@ def invoke_agent_agentic(
            suggested_action=suggested_action,
        )
-    # Capture git diff as the output (changes since last commit on the branch)
+    # Capture git diff as the output (changes since the base commit)
-    diff_output = capture_diff(worktree_path)
+    diff_output = capture_diff(worktree_path, base_commit=base_commit)
    if not diff_output:
        stdout_excerpt = (result.stdout or "").strip()
--- a/cross_eval/cli.py
+++ b/cross_eval/cli.py
@@ -205,7 +205,7 @@ def main(argv: list[str] | None = None) -> int:
        ],
        help=(
            "파이프라인 종류 (기본: simple). "
-            "simple=코딩+리뷰, cross-review=교차리뷰, plan-review=문서기획검토, "
+            "simple=코딩+리뷰, cross-review=교차리뷰, plan-review=문서리뷰수정재검증, "
            "review-only=리뷰만, review-fix=리뷰수렴+자동수정, "
            "coding-review-fix=초기코딩후리뷰수렴"
        ),
@@ -291,8 +291,8 @@ def main(argv: list[str] | None = None) -> int:
            "  │ coding-      │ 3단계 파이프라인:                                  │\n"
            "  │ review-fix   │  초기 코딩 1회 → 리뷰 취합 → 수정 → 재검증 반복   │\n"
            "  ├──────────────┼─────────────────────────────────────────────────────┤\n"
-            "  │ plan-review  │ 구현 전 기획서/체크리스트/문서를 검토             │\n"
+            "  │ plan-review  │ 구현 전 기획서/체크리스트/문서를 검토하고       │\n"
-            "  │              │ 필요하면 현재 코드베이스와의 정합성도 점검       │\n"
+            "  │              │ 수정한 뒤 시니어가 재검증할 때까지 반복         │\n"
            "  ├──────────────┼─────────────────────────────────────────────────────┤\n"
            "  │ review-only  │ 코드 작성 없이 Reviewer N명이 기존 코드만 검토    │\n"
            "  │              │ (이미 작성된 코드의 품질 감사용)                   │\n"
@@ -341,9 +341,9 @@ def main(argv: list[str] | None = None) -> int:
            "    cross-eval run --plan plan.md --preset review-only \\\n"
            "      --reviewer claude --reviewer codex\n"
            "\n"
-            "  구현 전 문서/기획 검토 (plan-review):\n"
+            "  문서 리뷰 + 수정 + 재검증 반복 (plan-review):\n"
            "    cross-eval run --plan plan.md --preset plan-review \\\n"
-            "      --reviewer claude --reviewer codex\n"
+            "      --coder codex --reviewer codex\n"
            "\n"
            "  모델 변경:\n"
            "    cross-eval run --plan plan.md --model sonnet\n"
@@ -563,7 +563,7 @@ _PRESET_DESCRIPTIONS = {
    "simple": "코딩 + 리뷰 (가장 기본)",
    "review-fix": "리뷰 → 취합 → 수정 → 재검증 반복",
    "coding-review-fix": "초기 코딩 + 리뷰 수렴 반복",
-    "plan-review": "구현 전 기획서/문서 검토",
+    "plan-review": "문서 리뷰 → 수정 → 재검증 반복",
    "review-only": "기존 코드만 리뷰 (코딩 없음)",
    "cross-review": "2명이 각각 구현 후 교차 리뷰",
 }
@@ -929,7 +929,7 @@ def cmd_run(args: argparse.Namespace) -> int:
        elif preset in PIPELINE_PRESETS:
            config.pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors)
            config.phases = []
-            if preset in {"plan-review", "review-only"} and args.max_iter is None and args.min_iter is None:
+            if preset == "review-only" and args.max_iter is None and args.min_iter is None:
                config.max_iterations = 1
    sync_phased_iterations(config)
--- a/cross_eval/config.py
+++ b/cross_eval/config.py
@@ -31,7 +31,7 @@ DEFAULT_ROLE_REASONING_EFFORTS = {
    "reviewer": "medium",
    "senior": "high",
 }
-FIX_STYLE_PRESETS = {"review-fix", "coding-review-fix"}
+FIX_STYLE_PRESETS = {"plan-review", "review-fix", "coding-review-fix"}
 # ---------------------------------------------------------------------------
@@ -296,7 +296,11 @@ def _default_seniors_for_preset(
    """Infer a default senior agent for presets that benefit from adjudication."""
    if not (
        isinstance(pipeline_raw, str)
-        and pipeline_raw in {"preset:review-fix", "preset:coding-review-fix"}
+        and pipeline_raw in {
            "preset:plan-review",
            "preset:review-fix",
            "preset:coding-review-fix",
        }
        and reviewers
    ):
        return []
--- a/cross_eval/pipeline.py
+++ b/cross_eval/pipeline.py
@@ -84,50 +84,72 @@ def _has_agentic_steps(config: PipelineConfig, steps: list[StepConfig]) -> bool:
    )
-def _setup_worktree(cwd: Path, run_dir: Path, preset_name: str) -> tuple[Path, str]:
+def _setup_worktree(cwd: Path, run_dir: Path, preset_name: str) -> tuple[Path, str, str]:
    """Create a shared worktree for the entire pipeline run.
    1. Generate branch name (cross-eval/<preset>_<timestamp>)
    2. Create branch from HEAD
    3. Create worktree on that branch
-    Returns (worktree_path, branch_name).
+    Returns (worktree_path, branch_name, base_commit).
    """
    from cross_eval.worktree import create_worktree, make_branch_name, make_worktree_dir
    branch_name = make_branch_name(preset_name)
    worktree_dir = make_worktree_dir(cwd, branch_name)
-    worktree_path = create_worktree(
+    worktree_path, base_commit = create_worktree(
        base_cwd=cwd, work_dir=worktree_dir, branch_name=branch_name,
    )
    (run_dir / "worktree_path.txt").write_text(f"{worktree_path}\n", encoding="utf-8")
    (run_dir / "worktree_branch.txt").write_text(f"{branch_name}\n", encoding="utf-8")
-    return worktree_path, branch_name
+    (run_dir / "worktree_base.txt").write_text(f"{base_commit}\n", encoding="utf-8")
    return worktree_path, branch_name, base_commit
 def _copy_inputs_to_worktree(
    config: PipelineConfig,
    worktree_path: Path,
    *,
    base_cwd: Path,
 ) -> None:
    """Copy input files (plan, checklist, etc.) into the worktree.
-    This ensures agents running in plan/read-only mode within the worktree
+    Repo-local inputs are remapped to the corresponding path inside the worktree
-    can access these files, even though the originals live in the base repo.
+    so agentic edits produce a real git diff. External inputs are copied into a
-    Updates config.inputs in-place so subsequent reference refreshes use
+    dedicated inputs directory. For ``plan-review`` these external copies remain
    tracked so document edits can survive on the branch; other presets keep them
    ignored to avoid polluting code diffs.
    Updates ``config.inputs`` in-place so subsequent reference refreshes use
    worktree-local paths.
    """
    import shutil
    base_root = base_cwd.resolve()
    track_external_inputs = config.preset_name == "plan-review"
    inputs_dir = worktree_path / ".cross-eval-inputs"
    inputs_dir.mkdir(exist_ok=True)
-    # Exclude from git so these don't pollute agentic diffs
+    if not track_external_inputs:
        # Exclude read-only input copies from git so they don't pollute code diffs.
        (inputs_dir / ".gitignore").write_text("*\n", encoding="utf-8")
    for key, val in list(config.inputs.items()):
        if key.endswith("_ref") or not isinstance(val, Path):
            continue
        if not val.exists():
            continue
        resolved = val.resolve()
        try:
            rel_path = resolved.relative_to(base_root)
        except ValueError:
            dest = inputs_dir / val.name
-        shutil.copy2(val, dest)
+            shutil.copy2(resolved, dest)
            config.inputs[key] = dest
            continue
        worktree_target = worktree_path / rel_path
        if not worktree_target.exists():
            worktree_target.parent.mkdir(parents=True, exist_ok=True)
            shutil.copy2(resolved, worktree_target)
        config.inputs[key] = worktree_target
 def _snapshot_repo_state(cwd: Path) -> dict[str, str]:
@@ -321,13 +343,14 @@ def _run_simple_pipeline(
    # Setup shared worktree for agentic mode
    worktree_path: Path | None = None
    agentic_branch_name: str | None = None
    agentic_base_commit: str | None = None
    base_repo_state: dict[str, str] | None = None
    base_repo_status: str | None = None
    if not dry_run and _has_agentic_steps(config, config.pipeline):
-        worktree_path, agentic_branch_name = _setup_worktree(
+        worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree(
            cwd, run_dir, config.preset_name,
        )
-        _copy_inputs_to_worktree(config, worktree_path)
+        _copy_inputs_to_worktree(config, worktree_path, base_cwd=cwd)
        _refresh_input_references(config, input_contents)
        base_repo_state = _snapshot_repo_state(cwd)
        base_repo_status = _snapshot_repo_status(cwd)
@@ -360,6 +383,7 @@ def _run_simple_pipeline(
                runtime_env=runtime_env,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=agentic_base_commit,
            )
            # Intermediate commit so next iteration's diff only shows new changes
@@ -498,13 +522,14 @@ def _run_phased_pipeline(
    all_phase_steps = [s for p in config.phases for s in p.steps]
    worktree_path: Path | None = None
    agentic_branch_name: str | None = None
    agentic_base_commit: str | None = None
    base_repo_state: dict[str, str] | None = None
    base_repo_status: str | None = None
    if not dry_run and _has_agentic_steps(config, all_phase_steps):
-        worktree_path, agentic_branch_name = _setup_worktree(
+        worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree(
            cwd, run_dir, config.preset_name,
        )
-        _copy_inputs_to_worktree(config, worktree_path)
+        _copy_inputs_to_worktree(config, worktree_path, base_cwd=cwd)
        _refresh_input_references(config, input_contents)
        base_repo_state = _snapshot_repo_state(cwd)
        base_repo_status = _snapshot_repo_status(cwd)
@@ -558,6 +583,7 @@ def _run_phased_pipeline(
                    runtime_env=runtime_env,
                    base_repo_state=base_repo_state,
                    base_repo_status=base_repo_status,
                    base_commit=agentic_base_commit,
                )
                # Intermediate commit so next iteration's diff only shows new changes
@@ -903,6 +929,7 @@ def _run_steps(
    runtime_env: dict[str, str] | None = None,
    base_repo_state: dict[str, str] | None = None,
    base_repo_status: str | None = None,
    base_commit: str | None = None,
 ) -> tuple[dict[str, str], dict[str, AgentResult], str | None]:
    """Execute all steps in one iteration, parallelizing where possible."""
    step_outputs: dict[str, str] = {}
@@ -923,6 +950,7 @@ def _run_steps(
                runtime_env=runtime_env,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=base_commit,
            )
        else:
            _execute_parallel_batch(
@@ -934,6 +962,7 @@ def _run_steps(
                runtime_env=runtime_env,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=base_commit,
            )
    # Extract verdict from all verdict steps (ALL must PASS; ESCALATE wins over all)
@@ -961,6 +990,7 @@ def _invoke_agentic(
    env: dict[str, str] | None = None,
    timeout: int | None = None,
    quiet: bool = False,
    base_commit: str | None = None,
 ) -> AgentResult:
    """Run an agent in agentic mode using an existing worktree."""
    return invoke_agent_agentic(
@@ -968,6 +998,7 @@ def _invoke_agentic(
        worktree_path=worktree_path,
        env=env,
        timeout=timeout, quiet=quiet,
        base_commit=base_commit,
    )
@@ -992,6 +1023,7 @@ def _execute_step(
    runtime_env: dict[str, str] | None = None,
    base_repo_state: dict[str, str] | None = None,
    base_repo_status: str | None = None,
    base_commit: str | None = None,
 ) -> None:
    """Execute a single step, updating step_outputs and step_results in place."""
    if not quiet:
@@ -1035,6 +1067,7 @@ def _execute_step(
                worktree_path=worktree_path,
                env=runtime_env,
                timeout=timeout, quiet=quiet,
                base_commit=base_commit,
            )
        else:
            # When worktree exists, run non-agentic agents (reviewers) in
@@ -1125,6 +1158,7 @@ def _execute_parallel_batch(
    runtime_env: dict[str, str] | None = None,
    base_repo_state: dict[str, str] | None = None,
    base_repo_status: str | None = None,
    base_commit: str | None = None,
 ) -> None:
    """Execute multiple steps in parallel using threads."""
    agent_names = ", ".join(s.agent for s in batch)
@@ -1139,6 +1173,7 @@ def _execute_parallel_batch(
                run_dir=run_dir, output_iter=output_iter, phase_name=phase_name,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=base_commit,
            )
        return
@@ -1161,6 +1196,7 @@ def _execute_parallel_batch(
                phase_name=phase_name, worktree_path=worktree_path,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=base_commit,
            )
        return
@@ -1204,6 +1240,7 @@ def _execute_parallel_batch(
                worktree_path=worktree_path,
                env=runtime_env,
                timeout=timeout, quiet=True,
                base_commit=base_commit,
            )
        else:
            effective_cwd = worktree_path if worktree_path else cwd
--- a/cross_eval/prompts.py
+++ b/cross_eval/prompts.py
@@ -472,12 +472,58 @@ PLAN_REVIEW_TEMPLATE_KO = """\
 그렇지 않으면: VERDICT: FAIL
 """
 PLAN_FIX_TEMPLATE = """\
 You are tasked with revising planning documents based on adjudicated review feedback.
 ## Artifact References
 {artifact_references}
 ## Current Review Feedback
 {feedback}
 ## Instructions
 1. Read the referenced plan/checklist/docs/review artifacts directly from disk.
 2. Update the planning package itself: the plan, checklist, and reference documents as needed.
 3. Do NOT write or modify production code. Only revise planning artifacts.
 4. Address ONLY the confirmed planning issues from the current review feedback.
 5. If feedback marks any item as DISMISSED or false positive, leave it unchanged.
 6. Make the smallest document changes that resolve ambiguity, omissions, scope creep, or repository compatibility issues.
 7. Keep the plan, checklist, and supporting docs internally consistent after your edits.
 8. After editing, briefly summarize what you changed and any blocker that still needs human input.
 """
 PLAN_FIX_TEMPLATE_KO = """\
 당신은 시니어 리뷰 결과를 바탕으로 기획 문서를 수정하는 담당자입니다.
 ## 참조 아티팩트
 {artifact_references}
 ## 현재 리뷰 피드백
 {feedback}
 ## 지침
 1. 참조된 plan/checklist/docs/review markdown를 직접 읽으세요.
 2. 수정 대상은 기획 패키지 자체입니다. 필요에 따라 기획서, 체크리스트, 참고 문서를 수정하세요.
 3. 프로덕션 코드를 작성하거나 수정하지 마세요. 기획 문서만 고치세요.
 4. 현재 리뷰 피드백에서 확정된 기획 이슈만 해결하세요.
 5. DISMISSED 또는 오탐으로 정리된 항목은 건드리지 마세요.
 6. 모호성, 누락, 과도한 범위, 저장소 정합성 문제를 해소하는 최소한의 문서 수정만 하세요.
 7. 수정 후에도 기획서, 체크리스트, 참고 문서가 서로 모순되지 않게 유지하세요.
 8. 수정이 끝나면 무엇을 바꿨는지와 아직 사람 판단이 필요한 blocker가 있는지 짧게 정리하세요.
 """
 AGGREGATE_REVIEW_TEMPLATE = """\
 You are adjudicating multiple review results and turning them into an actionable decision.
 ## Artifact References
 {artifact_references}
 ## Candidate Artifact Under Review
 {candidate_outputs}
 ## Reviewer Findings Bundle
 {reviews_bundle}
 ## Previous Issue Tracker
 {previous_senior_tracker}
@@ -486,19 +532,19 @@ You are adjudicating multiple review results and turning them into an actionable
 ## Instructions
 Read the referenced plan/checklist/docs/review artifacts directly from disk. \
-Explore the project directory and the referenced git commit/diff to confirm the \
+Inspect the repository and referenced artifacts only as needed to confirm the \
-current codebase state. Use the execution evidence above to verify claims against \
+current target state. Use the execution evidence above to verify claims against \
 actual command outputs, artifact paths, and exit codes. Then:
 1. Deduplicate overlapping issues across reviewers.
 2. Resolve disagreements explicitly.
-3. Keep only issues supported by the plan, checklist, code, or reviewer evidence.
+3. Keep only issues supported by the plan, checklist, reference docs, repository state, or reviewer evidence.
 4. When evidence is mixed, explain what was confirmed, what was dismissed, and what still needs follow-up.
-5. Produce a prioritized action list for the coder.
+5. Produce a prioritized action list for the implementer/editor.
 6. Maintain the Issue Tracker table across iterations (carry forward unresolved issues).
 7. If no confirmed issue remains, output VERDICT: PASS.
-8. If issues exist that the coder can fix, output VERDICT: FAIL.
+8. If issues exist that the implementer/editor can fix, output VERDICT: FAIL.
 9. If issues require human intervention (ambiguous requirements, architecture decisions, \
-external dependency problems, or the same issue persists after 2+ fix attempts), \
+external dependency problems, or the same issue persists after 2+ attempts), \
 output VERDICT: ESCALATE.
 ## Output Format
@@ -512,8 +558,8 @@ output VERDICT: ESCALATE.
 (Write "None" if nothing was dismissed.)
 ### Action Items
-1. Concrete fix the coder should make
+1. Concrete fix the implementer/editor should make
-2. Concrete fix the coder should make
+2. Concrete fix the implementer/editor should make
 ## Issue Tracker
@@ -536,6 +582,12 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
 ## 참조 아티팩트
 {artifact_references}
 ## 현재 검토 대상
 {candidate_outputs}
 ## 리뷰 결과 묶음
 {reviews_bundle}
 ## 이전 이슈 트래커
 {previous_senior_tracker}
@@ -543,17 +595,17 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
 {execution_evidence}
 ## 지침
-참조된 plan/checklist/docs/review markdown와 git 상태를 직접 읽어 현재 코드베이스 상태를 확인한 뒤, \
+참조된 plan/checklist/docs/review markdown와 저장소 상태를 직접 읽어 현재 검토 대상의 상태를 확인한 뒤, \
 위 실행 증거를 활용하여 에이전트의 주장을 실제 명령어 출력, 아티팩트 경로, 종료 코드로 검증하세요. \
 그런 다음 아래를 수행하세요.
 1. 리뷰어들 사이에 중복되는 이슈를 합치세요.
 2. 의견 충돌은 명시적으로 정리하세요.
-3. 기획서, 체크리스트, 코드, 리뷰 근거로 뒷받침되는 이슈만 남기세요.
+3. 기획서, 체크리스트, 참고 문서, 저장소 상태, 리뷰 근거로 뒷받침되는 이슈만 남기세요.
 4. 근거가 엇갈리면 무엇이 확정이고 무엇이 기각 또는 추가확인 대상인지 분명히 적으세요.
-5. coder가 바로 수정할 수 있는 우선순위 액션 아이템을 만드세요.
+5. 수정 담당자가 바로 처리할 수 있는 우선순위 액션 아이템을 만드세요.
 6. 이슈 트래커 테이블을 반복 간에 유지하세요 (미해결 이슈를 이월).
 7. 확정된 이슈가 없으면 VERDICT: PASS 를 출력하세요.
-8. coder가 수정 가능한 이슈가 있으면 VERDICT: FAIL 을 출력하세요.
+8. 수정 담당자가 해결 가능한 이슈가 있으면 VERDICT: FAIL 을 출력하세요.
 9. 사람의 개입이 필요한 이슈(모호한 요구사항, 아키텍처 결정, 외부 의존성 문제, \
 동일 이슈가 2회 이상 해결 실패)가 있으면 VERDICT: ESCALATE 를 출력하세요.
@@ -568,8 +620,8 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
 (기각된 항목이 없으면 "없음"이라고 작성하세요.)
 ### 액션 아이템
-1. coder가 수정해야 할 구체적인 작업
+1. 수정 담당자가 처리해야 할 구체적인 작업
-2. coder가 수정해야 할 구체적인 작업
+2. 수정 담당자가 처리해야 할 구체적인 작업
 ## 이슈 트래커
@@ -592,6 +644,7 @@ DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
        "coding": CODING_TEMPLATE,
        "review": REVIEW_TEMPLATE,
        "plan-review": PLAN_REVIEW_TEMPLATE,
        "plan-fix": PLAN_FIX_TEMPLATE,
        "review-only": REVIEW_ONLY_TEMPLATE,
        "aggregate-review": AGGREGATE_REVIEW_TEMPLATE,
    },
@@ -599,6 +652,7 @@ DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
        "coding": CODING_TEMPLATE_KO,
        "review": REVIEW_TEMPLATE_KO,
        "plan-review": PLAN_REVIEW_TEMPLATE_KO,
        "plan-fix": PLAN_FIX_TEMPLATE_KO,
        "review-only": REVIEW_ONLY_TEMPLATE_KO,
        "aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO,
    },
@@ -843,56 +897,75 @@ def _build_review_only_preset(
 def _build_plan_review_preset(
    coders: list[str], reviewers: list[str], seniors: list[str],
 ) -> list[StepConfig]:
-    """Plan-review: reviewers audit planning docs before implementation."""
+    """Plan-review: review planning docs, revise them, then verify in a loop."""
    if not coders:
        raise ValueError("'plan-review' preset requires at least 1 coder")
    if not reviewers:
        raise ValueError("'plan-review' preset requires at least 1 reviewer")
-    if len(reviewers) == 1 and not seniors:
+    review_steps: list[StepConfig] = []
-        return [
+    if len(reviewers) == 1:
        review_steps.append(
            StepConfig(
                name="plan_review",
                agent=reviewers[0],
                role="review",
                prompt_template="default:plan-review",
                output_key="plan_review_result",
                verdict=True,
            ),
-        ]
+        )
-
+        review_step_names = ["plan_review"]
-    steps: list[StepConfig] = []
+        review_output_keys = ["plan_review_result"]
    else:
        reviewer_keys = _unique_safe_keys(reviewers)
        for reviewer, rk in zip(reviewers, reviewer_keys):
-        steps.append(
+            review_steps.append(
                StepConfig(
                    name=f"plan_review_{rk}",
                    agent=reviewer,
                    role="review",
                    prompt_template="default:plan-review",
                    output_key=f"plan_review_{rk}",
                verdict=not seniors,
                    parallel=True,
                ),
            )
-    if seniors:
+        review_step_names = [f"plan_review_{rk}" for rk in reviewer_keys]
-        step_names = [f"plan_review_{rk}" for rk in reviewer_keys]
+        review_output_keys = [f"plan_review_{rk}" for rk in reviewer_keys]
-        output_keys = [f"plan_review_{rk}" for rk in reviewer_keys]
+
-        steps.append(
+    fix_coder = coders[0]
    senior_agent = seniors[0] if seniors else reviewers[0]
    return review_steps + [
        StepConfig(
-                name="senior_review",
+            name="aggregate_review",
-                agent=seniors[0],
+            agent=senior_agent,
            role="review",
            prompt_template="default:aggregate-review",
-                output_key="senior_review_result",
+            output_key="aggregate_review",
                verdict=True,
            context_override={
-                    "candidate_outputs": "Planning documents under review (plan/checklist/reference docs).",
+                "candidate_outputs": "Current planning package under review (plan/checklist/reference docs).",
                "reviews_bundle": _build_named_bundle(
-                        reviewers, step_names, output_keys, "Review",
+                    reviewers, review_step_names, review_output_keys, "Review",
                ),
            },
        ),
-        )
+        StepConfig(
-    return steps
+            name="plan_fix",
            agent=fix_coder,
            role="coding",
            prompt_template="default:plan-fix",
            output_key="plan_fix_output",
            context_override={"feedback": "{aggregate_review}"},
        ),
        StepConfig(
            name="verify",
            agent=senior_agent,
            role="review",
            prompt_template="default:plan-review",
            output_key="verify_result",
            verdict=True,
        ),
    ]
 def _build_review_fix_preset(
--- a/cross_eval/worktree.py
+++ b/cross_eval/worktree.py
@@ -37,18 +37,31 @@ def make_worktree_dir(base_cwd: Path, branch_name: str) -> Path:
    )
-def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> Path:
+def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> tuple[Path, str]:
    """Create a git worktree on a new branch from HEAD.
    1. Create branch from HEAD
    2. Create worktree checked out to that branch
    The branch lives in the original repo, so it survives worktree removal.
    Returns (worktree_path, base_commit_sha).
    """
    work_dir = work_dir.resolve()
    if work_dir.exists():
        shutil.rmtree(work_dir)
    # Record the base commit SHA before creating the branch.
    # This is the anchor for all diffs — even if the agent makes its own commits,
    # we always diff against this base to capture the full set of changes.
    result = subprocess.run(
        ["git", "rev-parse", "HEAD"],
        cwd=base_cwd,
        capture_output=True,
        text=True,
        check=True,
    )
    base_commit = result.stdout.strip()
    # Create the branch at HEAD
    try:
        subprocess.run(
@@ -83,15 +96,24 @@ def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> Path:
            f"Failed to create worktree at {work_dir}: {e.stderr.strip()}"
        ) from e
-    logger.debug("Created worktree on branch '%s': %s", branch_name, work_dir)
+    logger.debug("Created worktree on branch '%s': %s (base: %s)", branch_name, work_dir, base_commit[:8])
-    return work_dir
+    return work_dir, base_commit
-def capture_diff(worktree_path: Path) -> str:
+def capture_diff(worktree_path: Path, base_commit: str | None = None) -> str:
    """Capture all changes made in the worktree as a unified diff.
-    Includes both tracked modifications and new untracked files.
+    Includes both tracked modifications, new untracked files, and changes
    that the agent may have committed on its own.
    Args:
        base_commit: The commit SHA from when the worktree was created.
                     If provided, diffs against this fixed base instead of HEAD.
                     This is critical because agents (e.g. Claude in interactive
                     mode) may create their own commits, advancing HEAD and
                     making ``git diff --cached HEAD`` return empty.
    """
    # Stage any uncommitted changes so they're included in the diff
    subprocess.run(
        ["git", "add", "-A"],
        cwd=worktree_path,
@@ -99,6 +121,30 @@ def capture_diff(worktree_path: Path) -> str:
        check=True,
    )
    if base_commit:
        # Diff everything (committed + staged) against the original base.
        # This captures changes regardless of whether the agent committed them.
        result = subprocess.run(
            ["git", "diff", base_commit, "--cached"],
            cwd=worktree_path,
            capture_output=True,
            text=True,
        )
        diff = result.stdout.strip()
        if diff:
            return diff
        # Also check committed changes (agent may have committed and left
        # nothing staged)
        result = subprocess.run(
            ["git", "diff", base_commit, "HEAD"],
            cwd=worktree_path,
            capture_output=True,
            text=True,
        )
        return result.stdout.strip()
    # Fallback: no base_commit, use original behavior
    result = subprocess.run(
        ["git", "diff", "--cached", "HEAD"],
        cwd=worktree_path,
--- a/tests/test_agentic.py
+++ b/tests/test_agentic.py
@@ -76,10 +76,12 @@ class TestCreateWorktree(unittest.TestCase):
            wt_dir = Path(td) / "wt"
            branch = "cross-eval/test_branch"
-            result_path = create_worktree(base, wt_dir, branch)
+            result_path, base_commit = create_worktree(base, wt_dir, branch)
            # Worktree directory exists
            self.assertTrue(result_path.exists())
            # Base commit SHA was captured
            self.assertEqual(len(base_commit), 40)
            # Branch was created in the original repo
            branches = subprocess.run(
                ["git", "branch", "--list", branch],
@@ -102,7 +104,7 @@ class TestCaptureDiff(unittest.TestCase):
            wt_dir = Path(td) / "wt"
            branch = "cross-eval/diff_test"
-            create_worktree(base, wt_dir, branch)
+            create_worktree(base, wt_dir, branch)  # ignore return tuple
            # Make changes in the worktree
            (wt_dir / "new_file.txt").write_text("hello\n")
@@ -553,7 +555,7 @@ class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
@@ -582,7 +584,7 @@ class TestSetupWorktreeLocation(unittest.TestCase):
            run_dir.mkdir(parents=True)
            _init_git_repo(base)
-            worktree_path, branch_name = _setup_worktree(base, run_dir, "review-fix")
+            worktree_path, branch_name, _base_commit = _setup_worktree(base, run_dir, "review-fix")
            try:
                self.assertTrue(worktree_path.exists())
                self.assertNotIn(str(base.resolve()), str(worktree_path.resolve()))
@@ -620,7 +622,7 @@ class TestReviewerRunsInWorktreeCwd(unittest.TestCase):
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
@@ -662,7 +664,7 @@ class TestCommitIterationCalled(unittest.TestCase):
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
@@ -704,7 +706,7 @@ class TestFinalizeWorktreeCalled(unittest.TestCase):
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
@@ -822,7 +824,7 @@ class TestParallelAgenticFallsBackToSequential(unittest.TestCase):
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            call_order: list[str] = []
--- a/tests/test_config.py
+++ b/tests/test_config.py
@@ -42,6 +42,8 @@ from cross_eval.prompts import (
    REVIEW_TEMPLATE_KO,
    PLAN_REVIEW_TEMPLATE,
    PLAN_REVIEW_TEMPLATE_KO,
    PLAN_FIX_TEMPLATE,
    PLAN_FIX_TEMPLATE_KO,
    REVIEW_ONLY_TEMPLATE,
    REVIEW_ONLY_TEMPLATE_KO,
    AGGREGATE_REVIEW_TEMPLATE,
@@ -310,7 +312,23 @@ class BuiltinAgentConfigTest(unittest.TestCase):
        self.assertIn("Repeated Aggregate Findings", report)
        self.assertIn("same as iteration 3", report)
-    def test_review_fix_defaults_senior_from_reviewer_family(self) -> None:
+    def test_fix_and_plan_presets_default_senior_from_reviewer_family(self) -> None:
        self.assertEqual(
            _default_seniors_for_preset(
                "preset:plan-review",
                ["codex-reviewer"],
                BUILTIN_AGENTS,
            ),
            ["codex-senior"],
        )
        self.assertEqual(
            _default_seniors_for_preset(
                "preset:plan-review",
                ["claude-reviewer"],
                BUILTIN_AGENTS,
            ),
            ["claude-senior"],
        )
        self.assertEqual(
            _default_seniors_for_preset(
                "preset:review-fix",
@@ -421,23 +439,49 @@ class BuiltinAgentConfigTest(unittest.TestCase):
        )
        self.assertEqual(
-            [step.output_key for step in steps],
+            [step.output_key for step in steps[:2]],
            ["plan_review_codex_reviewer", "plan_review_codex_reviewer_2"],
        )
-    def test_plan_review_with_senior_adds_aggregate_step(self) -> None:
+    def test_plan_review_builds_review_fix_verify_loop(self) -> None:
        steps = _build_plan_review_preset(
            ["codex-coder"],
            ["claude-reviewer", "codex-reviewer"],
            ["claude-senior"],
        )
-        self.assertEqual(steps[-1].name, "senior_review")
+        self.assertEqual(
-        self.assertEqual(steps[-1].agent, "claude-senior")
+            [step.name for step in steps],
-        self.assertTrue(steps[-1].verdict)
+            [
                "plan_review_claude_reviewer",
                "plan_review_codex_reviewer",
                "aggregate_review",
                "plan_fix",
                "verify",
            ],
        )
        self.assertEqual(steps[2].agent, "claude-senior")
        self.assertEqual(steps[3].agent, "codex-coder")
        self.assertEqual(steps[4].agent, "claude-senior")
        self.assertTrue(steps[4].verdict)
        self.assertFalse(steps[0].verdict)
        self.assertFalse(steps[1].verdict)
    def test_plan_review_single_reviewer_uses_default_loop_steps(self) -> None:
        steps = _build_plan_review_preset(
            ["codex-coder"],
            ["codex-reviewer"],
            [],
        )
        self.assertEqual(
            [step.name for step in steps],
            ["plan_review", "aggregate_review", "plan_fix", "verify"],
        )
        self.assertEqual(steps[1].agent, "codex-reviewer")
        self.assertEqual(steps[2].prompt_template, "default:plan-fix")
        self.assertTrue(steps[3].verdict)
    def test_cross_review_duplicate_coders_get_unique_step_keys(self) -> None:
        steps = _build_cross_review_preset(
            ["codex-coder", "codex-coder"],
@@ -576,6 +620,8 @@ class PromptTemplateTest(unittest.TestCase):
        """Coding templates should tell coder to ignore DISMISSED items."""
        self.assertIn("DISMISSED", CODING_TEMPLATE)
        self.assertIn("DISMISSED", CODING_TEMPLATE_KO)
        self.assertIn("DISMISSED", PLAN_FIX_TEMPLATE)
        self.assertIn("DISMISSED", PLAN_FIX_TEMPLATE_KO)
    def test_aggregate_templates_dismissed_structure(self) -> None:
        """Aggregate templates should use [False positive] / [Already fixed] tags."""
@@ -583,6 +629,10 @@ class PromptTemplateTest(unittest.TestCase):
        self.assertIn("[Already fixed]", AGGREGATE_REVIEW_TEMPLATE)
        self.assertIn("[오탐]", AGGREGATE_REVIEW_TEMPLATE_KO)
        self.assertIn("[수정 완료]", AGGREGATE_REVIEW_TEMPLATE_KO)
        self.assertIn("{candidate_outputs}", AGGREGATE_REVIEW_TEMPLATE)
        self.assertIn("{reviews_bundle}", AGGREGATE_REVIEW_TEMPLATE)
        self.assertIn("{candidate_outputs}", AGGREGATE_REVIEW_TEMPLATE_KO)
        self.assertIn("{reviews_bundle}", AGGREGATE_REVIEW_TEMPLATE_KO)
 class ReviewMetricsParsingTest(unittest.TestCase):
@@ -1033,6 +1083,34 @@ class FixPresetBehaviorTest(unittest.TestCase):
        self.assertTrue(captured["agentic"])
        self.assertEqual(captured["phase_max"], 3)
    def test_run_preset_plan_review_auto_enables_agentic_without_flag(self) -> None:
        captured: dict[str, object] = {}
        def _fake_run_pipeline(config, **kwargs):
            captured["preset"] = config.preset_name
            captured["agentic"] = config.agents[config.coders[0]].agentic
            captured["seniors"] = list(config.seniors)
            captured["steps"] = [step.name for step in config.pipeline]
            captured["max_iter"] = config.max_iterations
            return PipelineResult(
                iterations=[],
                final_verdict="PASS",
                run_dir=Path(".cross-eval/output"),
            )
        with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
            exit_code = main(["run", "--preset", "plan-review", "--dry-run"])
        self.assertEqual(exit_code, 0)
        self.assertEqual(captured["preset"], "plan-review")
        self.assertTrue(captured["agentic"])
        self.assertEqual(captured["seniors"], ["claude-senior"])
        self.assertEqual(
            captured["steps"],
            ["plan_review", "aggregate_review", "plan_fix", "verify"],
        )
        self.assertEqual(captured["max_iter"], 3)
    def test_run_senior_model_override_applies_only_to_seniors(self) -> None:
        captured: dict[str, list[str]] = {}
--- a/tests/test_evidence.py
+++ b/tests/test_evidence.py
@@ -465,6 +465,9 @@ class TestExpandedClaimMarkers(unittest.TestCase):
    def test_changes_are_complete(self) -> None:
        self.assertTrue(_claims_file_changes("All changes are complete"))
    def test_korean_change_summary_triggers(self) -> None:
        self.assertTrue(_claims_file_changes("모든 수정이 완료되었습니다. 아래는 변경 요약입니다."))
 class TestExpandedNoChangeMarkers(unittest.TestCase):
    """New no-change markers prevent false positives."""
@@ -484,6 +487,9 @@ class TestExpandedNoChangeMarkers(unittest.TestCase):
    def test_no_action_required(self) -> None:
        self.assertFalse(_claims_file_changes("No action required"))
    def test_korean_no_change_marker(self) -> None:
        self.assertFalse(_claims_file_changes("변경할 필요 없음"))
 # ---------------------------------------------------------------------------
 # 6. Cross-iteration evidence propagation
--- a/tests/test_pipeline_integration.py
+++ b/tests/test_pipeline_integration.py
@@ -13,7 +13,11 @@ from cross_eval.models import (
    StepConfig,
 )
 from cross_eval.pipeline import run_pipeline
-from cross_eval.prompts import _build_review_fix_preset, _build_simple_preset
+from cross_eval.prompts import (
    _build_plan_review_preset,
    _build_review_fix_preset,
    _build_simple_preset,
 )
 def _make_mock_agent(outputs: list[str]):
@@ -262,6 +266,60 @@ class TestPhasedPipelineEscalateBreaksPhase(unittest.TestCase):
            self.assertTrue(len(result.escalated_issues) > 0)
 class TestPlanReviewPipelineLoopsUntilVerifyPass(unittest.TestCase):
    """Document plan-review should revise docs and re-verify across iterations."""
    def test_plan_review_fail_then_pass(self) -> None:
        with tempfile.TemporaryDirectory() as tmpdir:
            coders = ["claude-coder"]
            reviewers = ["claude-reviewer"]
            seniors = ["claude-senior"]
            steps = _build_plan_review_preset(coders, reviewers, seniors)
            config = PipelineConfig(
                output_dir=Path(tmpdir),
                max_iterations=4,
                min_iterations=1,
                language="en",
                inputs={
                    "plan": "Test plan",
                    "checklist": "Test checklist",
                    "docs": "Reference docs",
                },
                agents=dict(BUILTIN_AGENTS),
                coders=coders,
                reviewers=reviewers,
                seniors=seniors,
                pipeline=steps,
                preset_name="plan-review",
            )
            mock = _make_step_mock({
                "plan_review": [
                    "Requirements are ambiguous\n\nVERDICT: FAIL",
                    "Looks aligned\n\nVERDICT: PASS",
                ],
                "aggregate_review": [
                    "### Confirmed Issues\n- Clarify acceptance criteria\n\n"
                    "### Action Items\n1. Tighten the checklist\n\nVERDICT: FAIL",
                    "### Confirmed Issues\nNone\n\n"
                    "### Dismissed Findings\nNone\n\n"
                    "### Action Items\n1. No document changes needed\n\nVERDICT: PASS",
                ],
                "plan_fix": ["Updated plan and checklist", "No-op"],
                "verify": [
                    "Still missing edge-case criteria\n\nVERDICT: FAIL",
                    "Planning package is now implementable\n\nVERDICT: PASS",
                ],
            })
            with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
                result = run_pipeline(config)
            self.assertEqual(result.final_verdict, "PASS")
            self.assertEqual(len(result.iterations), 2)
 class TestAutoEscalateFiresWithoutSenior(unittest.TestCase):
    """Test 6: simple pipeline without senior, same FAIL feedback 3 times -> auto-escalate."""
--- a/tests/test_runtime_misc.py
+++ b/tests/test_runtime_misc.py
@@ -16,6 +16,7 @@ from cross_eval.agent import (
 )
 from cross_eval.models import AgentConfig, AgentResult, ExecutionConfig, PipelineConfig, StepConfig
 from cross_eval.pipeline import (
    _copy_inputs_to_worktree,
    _commit_iteration,
    _execute_parallel_batch,
    _execute_step,
@@ -118,6 +119,42 @@ class TestInvokeAgentRuntime(unittest.TestCase):
        self.assertEqual(ctx.exception.failure_type, "API_ERROR")
        self.assertIn("backend down", ctx.exception.raw_error)
 class TestWorktreeInputMapping(unittest.TestCase):
    def test_repo_local_plan_input_maps_to_tracked_worktree_path(self) -> None:
        with tempfile.TemporaryDirectory() as tmpdir:
            repo = Path(tmpdir) / "repo"
            repo.mkdir()
            _init_git_repo(repo)
            (repo / "plan.md").write_text("plan v1\n", encoding="utf-8")
            subprocess.run(["git", "add", "plan.md"], cwd=repo, capture_output=True, check=True)
            subprocess.run(
                ["git", "commit", "-m", "add plan"],
                cwd=repo,
                capture_output=True,
                check=True,
            )
            worktree_dir = Path(tmpdir) / "wt"
            branch = "cross-eval/test-plan-review"
            worktree_path, _ = create_worktree(repo, worktree_dir, branch)
            try:
                config = PipelineConfig(
                    inputs={"plan": repo / "plan.md"},
                    preset_name="plan-review",
                )
                _copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
                self.assertEqual(config.inputs["plan"], worktree_path / "plan.md")
            finally:
                remove_worktree(base_cwd=repo, work_dir=worktree_path)
                subprocess.run(
                    ["git", "branch", "-D", branch],
                    cwd=repo,
                    capture_output=True,
                )
    def test_classify_unknown_failure(self) -> None:
        failure_type, suggested_action = _classify_agent_failure("weird crash")
        self.assertEqual(failure_type, "UNKNOWN")
@@ -775,11 +812,18 @@ class TestRuntimeEnvironmentHelpers(unittest.TestCase):
 class TestWorktreeFailures(unittest.TestCase):
    @patch("cross_eval.worktree.subprocess.run")
    def test_create_worktree_raises_when_branch_creation_fails(self, mock_run: MagicMock) -> None:
-        mock_run.side_effect = subprocess.CalledProcessError(
+        # First call: git rev-parse HEAD (succeeds)
        # Second call: git branch (fails)
        rev_parse_result = MagicMock(returncode=0)
        rev_parse_result.stdout = "a" * 40
        mock_run.side_effect = [
            rev_parse_result,
            subprocess.CalledProcessError(
                1,
                ["git", "branch"],
                stderr="branch failed",
-        )
+            ),
        ]
        with tempfile.TemporaryDirectory() as tmpdir:
            base = Path(tmpdir)
@@ -791,14 +835,17 @@ class TestWorktreeFailures(unittest.TestCase):
    @patch("cross_eval.worktree.subprocess.run")
    def test_create_worktree_cleans_branch_on_worktree_failure(self, mock_run: MagicMock) -> None:
        rev_parse_result = MagicMock(returncode=0)
        rev_parse_result.stdout = "a" * 40
        mock_run.side_effect = [
-            MagicMock(returncode=0),
+            rev_parse_result,           # git rev-parse HEAD
            MagicMock(returncode=0),    # git branch
            subprocess.CalledProcessError(
                1,
                ["git", "worktree", "add"],
                stderr="worktree failed",
            ),
-            MagicMock(returncode=0),
+            MagicMock(returncode=0),    # git branch -D (cleanup)
        ]
        with tempfile.TemporaryDirectory() as tmpdir: