continue

fix: use incremental diff per iteration instead of cumulative base diff
After each iteration's _commit_iteration, record the new HEAD SHA and use it as the diff anchor for the next iteration. Previously capture_diff always diffed against the initial base commit, causing every iteration to return the same full cumulative diff — reviewers couldn't see what changed between iterations, leading to repeated feedback and stuck FAIL loops. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 17:54:30 +09:00 · 2026-03-15 10:07:11 +09:00 · 2026-03-15 00:35:42 +09:00 · 2026-03-15 00:01:26 +09:00 · 2026-03-14 23:59:53 +09:00
19 changed files with 1527 additions and 290 deletions
--- a/DEVELOPMENT.md
+++ b/DEVELOPMENT.md
@@ -10,6 +10,8 @@ AI 에이전트 2개를 활용한 개발 워크플로우(기획→체크리스
 - Generator: `--permission-mode auto` (파일 읽기/쓰기 가능)
 - Reviewer: `--permission-mode plan` (읽기 전용 탐색)
 - subprocess의 `cwd`를 현재 작업 디렉토리로 설정
 - 기본 실행 모드는 **direct mode**다. 즉 agentic coder도 현재 작업트리에서 직접 수정한다.
 - `--worktree` 또는 `use_worktree: true`를 명시한 경우에만 isolated git worktree를 생성한다.
 ## 사용자 경험 (UX Flow)
@@ -34,6 +36,7 @@ ls output/v1/ v2/ final-report.md
 ```yaml
 output_dir: output
 use_worktree: false
 max_iterations: 3
 inputs:
@@ -51,10 +54,8 @@ agents:
    system_prompt: "You are a meticulous code reviewer."
 # 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음)
-pipeline: preset:simple          # "A 생성 → B 리뷰" (기본값)
+pipeline: preset:coding-plan-review   # "문서 기반 구현 → 코드/문서 리뷰 → 수정 → 재검증" (기본값)
-# pipeline: preset:cross-review  # "둘 다 생성 → 서로 리뷰"
+# pipeline: preset:plan-review        # "구현 전 문서 리뷰 → 수정 → 재검증 반복"
 # pipeline: preset:plan-review   # "구현 전 문서/기획 검토"
 # pipeline: preset:coding-review-fix  # "초기 코딩 1회 → 리뷰/수정 반복"
 # 방법 2: 직접 커스텀 (고급 사용자용)
 # pipeline:
@@ -75,10 +76,8 @@ pipeline: preset:simple          # "A 생성 → B 리뷰" (기본값)
 | 프리셋 | 설명 | 자동 생성되는 steps |
 |--------|------|-------------------|
-| `simple` | A 코딩 → B 리뷰 | coding(agent1) → review(agent2) |
+| `plan-review` | 구현 전 문서 리뷰/수정/재검증 반복 | plan_review_* → aggregate_review → plan_fix → verify |
-| `cross-review` | 둘 다 코딩, 서로 리뷰 | coding_a → coding_b → review_of_b(agent_a) → review_of_a(agent_b) |
+| `coding-plan-review` | 문서 기반 구현 후 코드/문서 리뷰/수정 반복 | initial_coding(coding) → coding_plan_review(review* → aggregate → coding_plan_fix → verify) |
 | `plan-review` | 구현 전 문서 검토 | parallel plan_review_* → senior_review(optional) |
 | `coding-review-fix` | 초기 코딩 후 리뷰/수정 반복 | initial_coding(coding) → review_fix(review* → aggregate → coding → verify) |
 프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다.
@@ -101,7 +100,7 @@ cross_eval/
 **models.py** — 순환 참조 방지, 모든 데이터클래스 집중:
 - `AgentConfig` (command, args, system_prompt, stdin_mode)
 - `StepConfig` (name, agent, role, prompt_template, output_key, verdict, verdict_pattern, context_override)
- `PipelineConfig` (output_dir, max_iterations, inputs, agents, pipeline)
+- `PipelineConfig` (output_dir, use_worktree, max_iterations, inputs, agents, pipeline)
 - `AgentResult` (output, exit_code, agent_name, step_name, duration_seconds)
 - `IterationResult` (iteration, step_outputs, verdict, feedback)
 - `PipelineResult` (iterations, final_verdict, total_duration)
@@ -117,7 +116,7 @@ cross_eval/
 - `default:review` — 과최적화/오탐/누락 3기준 검토 + `VERDICT: PASS|FAIL` 출력 + **"프로젝트 디렉토리를 직접 탐색하여 코드를 검증하라"** 지시
 - `{variable}` 플레이스홀더, 누락 시 `(no {key} provided)` 출력
 - 사용자가 커스텀 .md 파일로 오버라이드 가능
- `PIPELINE_PRESETS` dict: `simple`, `cross-review`, `plan-review` 등 프리셋별 StepConfig 리스트 정의
+- `PIPELINE_PRESETS` / `PHASED_PRESETS` dict: `plan-review`, `coding-plan-review` 프리셋별 StepConfig/PhaseConfig 정의
 **agent.py** — `invoke_agent(agent_config, prompt, cwd)`:
 - `cwd` 파라미터로 프로젝트 디렉토리 지정 → 에이전트가 해당 디렉토리에서 파일 탐색 가능
@@ -139,16 +138,21 @@ for iteration 1..max_iterations:
 final-report.md 생성
 ```
 agentic 실행 경로는 두 모드가 있다.
 - 기본: direct mode (`cwd`에서 직접 수정)
 - opt-in: isolated worktree mode (`--worktree` 또는 `use_worktree: true`)
 **report.py** — 최종 마크다운 리포트:
 - 요약 테이블 (반복 횟수, 판정, 소요시간)
 - 반복별 상세 (각 step 출력, 에이전트명, 소요시간)
 - 최종 판정
 **cli.py** — 서브커맨드:
- `cross-eval init [--dir .] [--preset simple|cross-review|plan-review]` — 스캐폴딩 (기존 파일 안 덮어씀)
+- `cross-eval init [--dir .] [--preset coding-plan-review|plan-review]` — 스캐폴딩 (기존 파일 안 덮어씀)
- `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...]`
+- `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...] [--worktree]`
 - `--input key=path`: config의 inputs 오버라이드/추가
 - `--dry-run`: 에이전트 호출 없이 렌더링된 프롬프트만 출력
 - `--worktree`: 기본 direct mode 대신 isolated git worktree에서 실행
 ## 수정할 파일 목록
@@ -172,10 +176,12 @@ final-report.md 생성
 4. plan.md/checklist.md에 간단한 내용 넣고 `cross-eval run --max-iter 2` 로 실제 실행
 5. `output/` 디렉토리에 v1/, final-report.md 생성 확인
 `--dry-run` 은 미리보기 전용이며 실제 verdict가 PASS가 아니어도 프로세스 종료 코드는 `0`으로 처리한다.
  cross-eval run \
    --docs /Users/chungyeong/Desktop/Dev/new-alpha-foundry/plans/TO_CLICKHOUSE \
-    --preset coding-review-fix \
+    --preset coding-plan-review \
    --coder claude \
    --reviewer codex \
    --reviewer codex \
@@ -185,3 +191,6 @@ final-report.md 생성
    --reviewer-effort high \
    --senior-effort xhigh \
    --max-iter 10
 cross-eval run --plan /Users/chungyeong/Desktop/Dev/cross-eval/UX_IMPROVEMENT_PLAN.md --coder claude --reviewer claude --senior claude --model sonnet --preset coding-plan-review --lang ko --max-iter 1
--- a/README.md
+++ b/README.md
@@ -51,12 +51,15 @@ cp .cross-eval/checklist-sample.md .cross-eval/checklist.md
 ### 3. 실행
 ```bash
-# 기본 실행 (코딩 → 리뷰, 최대 3회 반복)
+# 기본 실행 (현재 작업트리 direct mode, 최대 3회 반복)
 cross-eval run
 # 프롬프트만 확인 (에이전트 호출 없이, 비용 절약)
 cross-eval run --dry-run
 # 격리된 git worktree에서 실행하고 싶을 때만 명시
 cross-eval run --worktree
 # 최대 반복 횟수 변경
 cross-eval run --max-iter 5
@@ -80,6 +83,9 @@ output/
 └── final-report.md    # 전체 요약 리포트
 ```
 기본값은 **direct mode**다. 즉 `cross-eval`은 현재 작업트리에서 직접 파일을 읽고 수정한다.
 별도 격리 실행이 필요할 때만 `--worktree`를 붙여 isolated git worktree를 사용한다.
 ## 설정 (`.cross-eval/config.yaml`)
 ```yaml
@@ -101,7 +107,8 @@ agents:
    args: ["-p", "--model", "opus", "--permission-mode", "plan"]
    system_prompt: "You are a meticulous code reviewer."
-pipeline: preset:simple
+pipeline: preset:coding-plan-review
 use_worktree: false        # 기본값. true면 isolated worktree 사용
 ```
 실행 중에 `config.yaml`을 수정하면 다음 반복부터 자동으로 반영됩니다.
@@ -110,16 +117,16 @@ pipeline: preset:simple
 | 프리셋 | 설명 |
 |--------|------|
-| `simple` | Agent A가 코딩, Agent B가 리뷰 (기본값) |
+| `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 문서를 수정한 뒤 재검증까지 반복 |
-| `cross-review` | 둘 다 코딩, 서로 교차 리뷰 |
+| `coding-plan-review` | 입력 문서를 바탕으로 코드를 구현하고, 코드와 문서를 함께 리뷰/수정/재검증 반복 |
-| `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 필요시 현재 코드베이스와의 정합성도 확인 |
+
-| `review-only` | 기존 코드만 감사 용도로 검토 |
+두 프리셋은 역할만 다르고, 대부분의 CLI 옵션은 동일하게 동작한다. 예를 들어 `--plan`, `--checklist`, `--docs`, `--coder`, `--reviewer`, `--senior`, `--max-iter`, `--dry-run`, `--worktree`는 둘 다 같은 방식으로 사용할 수 있다.
 | `review-fix` | 리뷰 결과를 취합한 뒤 자동 수정과 재검증까지 반복 |
 | `coding-review-fix` | 초기 코딩 1회 후 리뷰 결과를 취합해 자동 수정과 재검증을 반복 |
 ```bash
 # 초기화 옵션
-cross-eval init --preset cross-review   # 교차 리뷰 프리셋
+cross-eval init --preset coding-plan-review  # 구현 + 코드/문서 리뷰 프리셋
-cross-eval init --preset plan-review    # 구현 전 문서 검토 프리셋
+cross-eval init --preset plan-review         # 문서 리뷰/수정/재검증 프리셋
 cross-eval init --lang en               # 영어 템플릿
 ```
 `cross-eval run --dry-run` 은 프롬프트와 파이프라인 구성을 미리보기만 하며, 실제 판정이 PASS가 아니어도 종료 코드는 `0`이다.
--- a/UX_IMPROVEMENT_PLAN.md
+++ b/UX_IMPROVEMENT_PLAN.md
@@ -0,0 +1,178 @@
 # cross-eval UX 개선 계획
 > 사용자 안내 메시지, 에러 메시지, 도움말 텍스트 전반의 품질을 높여서
 > 처음 쓰는 사람도 막히지 않고 파이프라인을 돌릴 수 있게 만든다.
 ---
 ## 1. CLI 도움말 텍스트 개선
 ### 1.1 `cross-eval` 메인 도움말
 - [ ] 메인 description에 "어떤 문제를 해결하는 도구인지" 한 줄 요약 추가
  - 현재: "AI 코딩 에이전트의 결과물을 자동으로 검증하는 CLI 도구"
  - 개선: "AI 코딩 에이전트가 기획서대로 구현했는지 자동 교차 검증. 과최적화·누락·거짓 통과를 잡아냄"
 - [ ] 서브커맨드별 한 줄 설명을 메인 help에 추가 (init/doctor/demo/run 각각)
 ### 1.2 `cross-eval run` 도움말
 - [ ] epilog의 프리셋 테이블이 너무 길음 — "빠른 선택 가이드" 3줄 추가
  - 예: "처음이면 simple, 리뷰만 하려면 review-only, 코딩+리뷰+자동수정이면 coding-review-fix"
 - [ ] `--reasoning-effort` 도움말에 별칭(extra-high, x-high 등) 명시
 - [ ] `--target` 옵션이 실제로 프롬프트에 어떤 영향을 주는지 설명 추가
 - [ ] `--agentic` 플래그 설명에 worktree 생성/정리 동작 요약 추가
 - [ ] `--min-iter` 설명에 "왜 PASS인데 반복하는지" 용도 한 줄 추가
  - 예: "결과 안정성 확인용. 한 번 PASS가 우연이 아닌지 재검증"
 - [ ] `--dry-run` 설명에 "에이전트 호출 없이 프롬프트만 미리보기" 명확히
 - [ ] 에이전트 축약 규칙(claude → claude-coder 등) 예시와 함께 더 명확하게
 ### 1.3 `cross-eval init` 도움말
 - [ ] `--guided` 옵션을 더 눈에 띄게 — "처음이면 --guided 추천" 문구
 - [ ] 생성되는 파일 설명에 "각 파일을 어떻게 쓰는지" 한 줄씩 추가
 ### 1.4 `cross-eval doctor` 도움말
 - [ ] 어떤 항목을 점검하는지 목록 미리 보여주기
 - [ ] "인증 실패 시 어떻게 해야 하는지" 구체적 명령어 포함
 ### 1.5 `cross-eval demo` 도움말
 - [ ] mock vs live 차이를 한 눈에 볼 수 있도록 비교 추가
 - [ ] `--escalate` 옵션이 mock 전용인 점 강조
 ---
 ## 2. 에러 메시지 개선
 ### 2.1 필수 입력 누락
 - [ ] `--plan` 없이 `cross-eval run` 실행 시 명확한 에러:
  - "기획서(plan)가 필요합니다. --plan plan.md 또는 .cross-eval/config.yaml의 inputs.plan에 지정하세요."
 - [ ] config.yaml 없이 실행 시 기본값 사용 중임을 알리는 INFO 메시지 추가
 ### 2.2 에이전트 실패 메시지
 - [ ] `AUTH` 실패 시 구체적 해결 명령어 제시
  - Claude: "claude login 으로 인증하세요"
  - Codex: "codex auth 로 인증하세요"
 - [ ] `USAGE_LIMIT` 시 어떤 한도인지 힌트 (토큰? 요금?)
 - [ ] `EMPTY_DIFF` 시 "에이전트가 파일을 수정하지 않았습니다" + 가능한 원인 목록
 - [ ] `WRITE_FAILURE` 시 worktree 경로와 권한 상태 출력
 - [ ] 에이전트 빈 출력(empty output) 시 "에이전트가 응답하지 않았습니다. 프롬프트가 너무 길거나 인증 만료일 수 있습니다" 등 원인 제안
 ### 2.3 설정 검증 에러
 - [ ] 중복 step name 에러에 "어떤 phase의 어떤 step이 중복인지" 구체적으로
 - [ ] 없는 에이전트 참조 시 "사용 가능한 에이전트: ..." 리스트 포함 (이미 있으나 확인)
 - [ ] YAML 파싱 에러 시 라인 번호 포함
 ### 2.4 파일/경로 에러
 - [ ] "File not found: {path}" → "파일을 찾을 수 없습니다: {path}\n  현재 디렉토리: {cwd}" 로 개선
 - [ ] docs 디렉토리 비어있을 때 → "참고 문서 폴더가 비어있습니다: {path}\n  .md, .txt 등 문서 파일을 넣어주세요"
 ---
 ## 3. 진행 상태 메시지 개선
 ### 3.1 파이프라인 실행 중
 - [ ] 실행 시작 시 요약 배너 출력:
  ```
  ━━━ cross-eval ━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    Plan:      .cross-eval/plan.md
    Preset:    simple (코딩→리뷰→반복)
    Coder:     claude-coder
    Reviewer:  claude-reviewer
    Max iter:  3
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ```
 - [ ] 각 iteration 시작 시 "무엇을 하려는 단계인지" 한 줄 설명
  - 예: "Iteration 1/3 — Coder가 기획서 기반 초기 구현 중..."
  - 예: "Iteration 2/3 — 리뷰 피드백 반영해서 수정 중..."
 - [ ] 타임아웃 시 경과 시간과 제한 시간 모두 출력
 ### 3.2 결과 요약
 - [ ] 최종 결과에 소요 시간 추가
 - [ ] FAIL 시 "마지막 리뷰에서 지적된 주요 이슈 N개" 간단 요약
 - [ ] ESCALATE 시 사람이 봐야 할 이유 1~2줄 요약
 - [ ] dry-run 종료 시 "이것은 미리보기입니다. 실제 실행하려면 --dry-run을 빼세요" 명시
 ### 3.3 Auto-escalation 안내
 - [ ] auto-escalation 발동 시 "N회 연속 FAIL → 자동 에스컬레이션" 설명
 - [ ] 어떤 조건에서 auto-escalation이 발동하는지 run 도움말에 언급
 ---
 ## 4. 첫 사용 경험(Onboarding) 개선
 ### 4.1 init 후 안내
 - [ ] plan.md 템플릿에 실제 예시 포함 (현재 최소한의 구조만 있음)
  - "## 기능 요구사항" 아래 구체적 예시 한 개
 - [ ] checklist.md 템플릿에 체크리스트 작성 가이드 + 예시 추가
 - [ ] init 완료 후 "다음 단계" 안내를 더 구체적으로:
  - 현재: "1. plan.md에 기획서 작성"
  - 개선: "1. .cross-eval/plan.md를 열어 기획서를 작성하세요 (예: 구현할 기능, API 스펙, DB 스키마 등)"
 ### 4.2 doctor 개선
 - [ ] 체크 통과 시 "준비 완료! cross-eval run --plan .cross-eval/plan.md 로 실행하세요" 안내
 - [ ] 인증 실패 시 OS별 설치/인증 가이드 URL 포함
 ### 4.3 demo 개선
 - [ ] demo 완료 후 "실제 프로젝트에서 시작하려면:" 안내 추가
 - [ ] mock demo에서 각 단계가 뭘 하는 건지 주석 스타일로 설명
 ---
 ## 5. 용어 일관성
 - [ ] "에이전트 이름" vs "에이전트 역할" 구분 통일
  - 이름: claude-coder, codex-reviewer (실제 실행 단위)
  - 역할: coder, reviewer, senior (논리적 역할)
 - [ ] Verdict 표기 통일: 항상 대문자 `PASS` / `FAIL` / `ESCALATE`
 - [ ] "프리셋" vs "파이프라인" 용어 정리
  - `--preset`은 "파이프라인 유형"으로 통일
 - [ ] 한영 혼용 줄이기 — 한국어 모드에서 불필요한 영어 최소화
  - 단, PASS/FAIL/ESCALATE 같은 verdict은 영어 유지 (가독성)
 ---
 ## 6. 출력 디렉토리 구조 안내
 - [ ] run 완료 시 출력 폴더 구조 요약 출력:
  ```
  Output: .cross-eval/output/
    ├── iter-1/          (각 반복의 에이전트 출력)
    ├── iter-2/
    └── final-report.md  (최종 리포트)
  ```
 - [ ] report.md 상단에 "이 리포트 읽는 법" 간단 안내 추가
 ---
 ## 7. config.yaml 주석 개선
 - [ ] 기본 생성되는 config.yaml에 각 섹션별 설명 주석 보강
 - [ ] 자주 쓰는 설정 변경 예시를 주석으로 포함
  - 예: "# 리뷰어를 2개로 늘리려면: reviewer: [claude, codex]"
  - 예: "# 에이전트 모드로 실제 파일 수정: agentic: true"
 - [ ] phase-based 파이프라인 커스텀 예시 주석 추가
 ---
 ## 우선순위
 | 우선순위 | 항목 | 이유 |
 |---------|------|------|
 | P0 | 2.1 필수 입력 누락 에러 | 가장 자주 부딪히는 문제 |
 | P0 | 4.1 init 후 안내 + 템플릿 | 첫 사용에서 막히면 이탈 |
 | P0 | 3.1 실행 시작 요약 배너 | 뭐가 돌아가는지 알아야 함 |
 | P1 | 2.2 에이전트 실패 메시지 | 실패 시 뭘 해야 하는지 모름 |
 | P1 | 1.2 run 도움말 정리 | 옵션이 많아서 혼란 |
 | P1 | 5. 용어 일관성 | 혼동 줄이기 |
 | P2 | 3.2~3.3 결과/진행 메시지 | 있으면 좋지만 급하진 않음 |
 | P2 | 7. config.yaml 주석 | 파워 유저 편의 |
 | P2 | 6. 출력 구조 안내 | 한 번 보면 이해됨 |
 | P3 | 1.3~1.5 나머지 도움말 | 점진적 개선 |
 ---
 ## 테스트 방법
 각 항목 수정 후:
 1. **도움말 확인**: `cross-eval --help`, `cross-eval run --help` 등
 2. **에러 경로 확인**: 일부러 잘못된 입력으로 실행 → 에러 메시지가 유용한지
 3. **첫 사용 시뮬레이션**: 빈 디렉토리에서 `init → doctor → demo → run` 풀 플로우
 4. **cross-eval 자체로 검증**: 이 문서를 plan.md로 사용해 cross-eval run 실행
--- a/checklist.md
+++ b/checklist.md
@@ -0,0 +1,31 @@
 # cross-eval CLI 사용성 리팩토링 체크리스트
 ## 핵심 사용자 흐름
 - [ ] `cross-eval init` 이후 무엇을 해야 하는지 분명하게 안내한다.
 - [ ] `cross-eval doctor`를 언제 왜 써야 하는지 설명한다.
 - [ ] `cross-eval run` 실행 전 필요한 준비물이 명확하다.
 - [ ] 실행 후 결과가 `.cross-eval/output` 아래에 저장된다는 점이 안내된다.
 ## `run` 커맨드 이해도
 - [ ] `--preset`별 차이가 빠르게 비교 가능하다.
 - [ ] `--coder`, `--reviewer`, `--senior`의 역할 차이가 설명된다.
 - [ ] config 기반 실행과 CLI 옵션 기반 실행의 관계가 명확하다.
 - [ ] 어떤 옵션이 config를 override하는지 혼동 없이 이해할 수 있다.
 ## 예시 품질
 - [ ] 대표 사용 예시가 실제 사용자 목적 중심으로 정리되어 있다.
 - [ ] 예시가 너무 많아 산만하지 않고, 핵심 조합 위주로 압축되어 있다.
 - [ ] 초보자용 기본 예시와 고급 사용 예시가 구분되어 있다.
 - [ ] 예시만 복사해도 실제 실행 가능한 수준이다.
 ## 리팩토링 범위 통제
 - [ ] 기존 명령 이름과 옵션 이름을 바꾸지 않는다.
 - [ ] 기능 동작을 불필요하게 변경하지 않는다.
 - [ ] 안내 문구 개선이 목적이지 새 기능 추가가 아님을 유지한다.
 - [ ] plan 범위를 넘는 UI/기능 확장을 하지 않는다.
 ## 코드 품질
 - [ ] 기존 테스트가 깨지지 않도록 한다.
 - [ ] 도움말/문구 변경으로 인한 회귀를 확인한다.
 - [ ] 문자열 변경이 실제 출력 흐름과 모순되지 않는다.
 - [ ] 중복되거나 상충되는 설명이 생기지 않는다.
--- a/cross_eval/agent.py
+++ b/cross_eval/agent.py
@@ -34,6 +34,12 @@ _NO_CHANGE_ACK_MARKERS = (
    "code is correct as-is",
    "already correct",
    "no action required",
    "변경 없음",
    "수정 없음",
    "수정할 필요 없음",
    "변경할 필요 없음",
    "이미 올바름",
    "조치 불필요",
 )
 _CHANGE_CLAIM_MARKERS = (
    "summary of all changes made",
@@ -73,6 +79,15 @@ _CHANGE_CLAIM_MARKERS = (
    "completed the implementation",
    "all changes have been made",
    "changes are complete",
    "수정 완료",
    "모든 수정이 완료",
    "변경 요약",
    "변경 파일",
    "신규 생성",
    "기획서 수정",
    "체크리스트 수정",
    "문서를 수정",
    "문서 수정",
 )
@@ -414,6 +429,7 @@ def invoke_agent_agentic(
    env: Optional[dict[str, str]] = None,
    timeout: int | None = None,
    quiet: bool = False,
    base_commit: str | None = None,
 ) -> AgentResult:
    """Invoke an agent in agentic mode using the worktree as the source of truth."""
    from cross_eval.worktree import capture_diff
@@ -506,8 +522,8 @@ def invoke_agent_agentic(
            suggested_action=suggested_action,
        )
-    # Capture git diff as the output (changes since last commit on the branch)
+    # Capture git diff as the output (changes since the base commit)
-    diff_output = capture_diff(worktree_path)
+    diff_output = capture_diff(worktree_path, base_commit=base_commit)
    if not diff_output:
        stdout_excerpt = (result.stdout or "").strip()
--- a/cross_eval/cli.py
+++ b/cross_eval/cli.py
@@ -38,7 +38,7 @@ coders: [claude-coder]
 reviewers: [claude-reviewer]
 # seniors: [codex-senior]
-# 파이프라인 종류: simple | cross-review | plan-review | review-only | review-fix | coding-review-fix
+# 파이프라인 종류: plan-review | coding-plan-review
 pipeline: preset:{preset}
 # 반복 설정
@@ -194,20 +194,12 @@ def main(argv: list[str] | None = None) -> int:
    )
    init_parser.add_argument(
        "--preset",
-        default="simple",
+        default="coding-plan-review",
-        choices=[
+        choices=["plan-review", "coding-plan-review"],
            "simple",
            "cross-review",
            "plan-review",
            "review-only",
            "review-fix",
            "coding-review-fix",
        ],
        help=(
-            "파이프라인 종류 (기본: simple). "
+            "파이프라인 종류 (기본: coding-plan-review). "
-            "simple=코딩+리뷰, cross-review=교차리뷰, plan-review=문서기획검토, "
+            "plan-review=문서리뷰수정재검증, "
-            "review-only=리뷰만, review-fix=리뷰수렴+자동수정, "
+            "coding-plan-review=문서기반구현후 코드+문서 리뷰/수정/재검증"
            "coding-review-fix=초기코딩후리뷰수렴"
        ),
    )
    init_parser.add_argument(
@@ -252,9 +244,9 @@ def main(argv: list[str] | None = None) -> int:
    )
    demo_parser.add_argument(
        "--preset",
-        default="simple",
+        default="coding-plan-review",
-        choices=["simple", "review-fix", "coding-review-fix"],
+        choices=["plan-review", "coding-plan-review"],
-        help="데모할 파이프라인 종류 (기본: simple)",
+        help="데모할 파이프라인 종류 (기본: coding-plan-review)",
    )
    demo_parser.add_argument(
        "--escalate",
@@ -281,25 +273,12 @@ def main(argv: list[str] | None = None) -> int:
        ),
        epilog=(
            "파이프라인 종류 (--preset):\n"
-            "  ┌──────────────┬─────────────────────────────────────────────────────┐\n"
+            "  ┌─────────────────────┬──────────────────────────────────────────────┐\n"
-            "  │ simple       │ Coder가 코드 작성 → Reviewer가 리뷰               │\n"
+            "  │ coding-plan-review  │ 입력 문서 기반 구현 → 코드+문서 리뷰/수정   │\n"
-            "  │ (기본값)     │ FAIL이면 피드백 반영해서 재코딩, PASS까지 반복     │\n"
+            "  │ (기본값)            │ → 재검증 반복                                │\n"
-            "  ├──────────────┼─────────────────────────────────────────────────────┤\n"
+            "  ├─────────────────────┼──────────────────────────────────────────────┤\n"
-            "  │ review-fix   │ 2단계 파이프라인:                                  │\n"
+            "  │ plan-review         │ 구현 전 문서 리뷰 → 문서 수정 → 재검증 반복 │\n"
-            "  │              │  Reviewer N명 병렬 리뷰 → 취합 → 수정 → 재검증   │\n"
+            "  └─────────────────────┴──────────────────────────────────────────────┘\n"
            "  ├──────────────┼─────────────────────────────────────────────────────┤\n"
            "  │ coding-      │ 3단계 파이프라인:                                  │\n"
            "  │ review-fix   │  초기 코딩 1회 → 리뷰 취합 → 수정 → 재검증 반복   │\n"
            "  ├──────────────┼─────────────────────────────────────────────────────┤\n"
            "  │ plan-review  │ 구현 전 기획서/체크리스트/문서를 검토             │\n"
            "  │              │ 필요하면 현재 코드베이스와의 정합성도 점검       │\n"
            "  ├──────────────┼─────────────────────────────────────────────────────┤\n"
            "  │ review-only  │ 코드 작성 없이 Reviewer N명이 기존 코드만 검토    │\n"
            "  │              │ (이미 작성된 코드의 품질 감사용)                   │\n"
            "  ├──────────────┼─────────────────────────────────────────────────────┤\n"
            "  │ cross-review │ Coder 2명이 각각 구현 → 상대방 코드를 교차 리뷰   │\n"
            "  │              │ (서로 다른 에이전트의 구현 비교용)                 │\n"
            "  └──────────────┴─────────────────────────────────────────────────────┘\n"
            "\n"
            "기본 제공 에이전트:\n"
            "  ┌──────────────────┬─────────┬───────────┬──────────────────────────┐\n"
@@ -316,34 +295,13 @@ def main(argv: list[str] | None = None) -> int:
            "\n"
            "사용 예시:\n"
            "\n"
-            "  기본 실행 (Claude가 코딩하고 Claude가 리뷰):\n"
+            "  코드 + 문서 구현/리뷰 루프 (coding-plan-review):\n"
-            "    cross-eval run --plan plan.md\n"
+            "    cross-eval run --plan plan.md --preset coding-plan-review \\\n"
            "      --coder claude --reviewer codex --reviewer claude --senior codex\n"
            "\n"
-            "  Codex가 코딩, Claude가 리뷰:\n"
+            "  문서 리뷰 + 수정 + 재검증 반복 (plan-review):\n"
            "    cross-eval run --plan plan.md --coder codex --reviewer claude\n"
            "\n"
            "  리뷰어 2명 (Claude + Codex):\n"
            "    cross-eval run --plan plan.md --reviewer claude --reviewer codex\n"
            "\n"
            "  리뷰 취합용 Senior 추가:\n"
            "    cross-eval run --plan plan.md --preset review-fix \\\n"
            "      --reviewer claude --reviewer codex --senior codex\n"
            "\n"
            "  리뷰 수렴 후 자동 수정 (review-fix):\n"
            "    cross-eval run --plan plan.md --preset review-fix \\\n"
            "      --reviewer claude --reviewer codex\n"
            "\n"
            "  초기 코딩 후 리뷰 수렴 + 자동 수정 (coding-review-fix):\n"
            "    cross-eval run --plan plan.md --preset coding-review-fix \\\n"
            "      --reviewer claude --reviewer codex\n"
            "\n"
            "  기존 코드 리뷰만 (review-only):\n"
            "    cross-eval run --plan plan.md --preset review-only \\\n"
            "      --reviewer claude --reviewer codex\n"
            "\n"
            "  구현 전 문서/기획 검토 (plan-review):\n"
            "    cross-eval run --plan plan.md --preset plan-review \\\n"
-            "      --reviewer claude --reviewer codex\n"
+            "      --coder claude --reviewer codex --reviewer claude --senior codex\n"
            "\n"
            "  모델 변경:\n"
            "    cross-eval run --plan plan.md --model sonnet\n"
@@ -420,7 +378,11 @@ def main(argv: list[str] | None = None) -> int:
    )
    agent_group.add_argument(
        "--agentic", action="store_true", default=False,
-        help="Coder를 agentic 모드로 실행 (worktree에서 파일 직접 수정, git diff로 결과 캡처)",
+        help="Coder를 agentic 모드로 실행 (파일 직접 수정, git diff로 결과 캡처)",
    )
    agent_group.add_argument(
        "--worktree", action="store_true", default=False,
        help="기본 direct mode 대신 isolated git worktree에서 실행",
    )
    agent_group.add_argument(
        "--model", default=None, metavar="MODEL",
@@ -443,15 +405,8 @@ def main(argv: list[str] | None = None) -> int:
    pipe_group = run_parser.add_argument_group("파이프라인")
    pipe_group.add_argument(
        "--preset", default=None,
-        choices=[
+        choices=["plan-review", "coding-plan-review"],
-            "simple",
+        help="파이프라인 종류 (기본: coding-plan-review). 각 종류 설명은 아래 참조",
            "cross-review",
            "plan-review",
            "review-only",
            "review-fix",
            "coding-review-fix",
        ],
        help="파이프라인 종류 (기본: simple). 각 종류 설명은 아래 참조",
    )
    pipe_group.add_argument(
        "--max-iter", type=int, default=None,
@@ -560,18 +515,11 @@ def cmd_demo(args: argparse.Namespace) -> int:
 # ---------------------------------------------------------------------------
 _PRESET_DESCRIPTIONS = {
-    "simple": "코딩 + 리뷰 (가장 기본)",
+    "coding-plan-review": "입력 문서 기반 구현 후 코드+문서 리뷰/수정 반복",
-    "review-fix": "리뷰 → 취합 → 수정 → 재검증 반복",
+    "plan-review": "문서 리뷰 → 수정 → 재검증 반복",
    "coding-review-fix": "초기 코딩 + 리뷰 수렴 반복",
    "plan-review": "구현 전 기획서/문서 검토",
    "review-only": "기존 코드만 리뷰 (코딩 없음)",
    "cross-review": "2명이 각각 구현 후 교차 리뷰",
 }
-_PRESET_ORDER = [
+_PRESET_ORDER = ["coding-plan-review", "plan-review"]
    "simple", "review-fix", "coding-review-fix",
    "plan-review", "review-only", "cross-review",
 ]
 def _prompt_choice(
@@ -640,7 +588,7 @@ def _run_guided_init(target: Path) -> dict:
    coder = _prompt_text("  Coder 에이전트", default="claude")
    reviewer = _prompt_text("  Reviewer 에이전트", default="claude")
-    needs_senior = preset in ("review-fix", "coding-review-fix")
+    needs_senior = preset in ("coding-plan-review", "plan-review")
    senior = ""
    if needs_senior:
        senior = _prompt_text("  Senior 에이전트", default=reviewer)
@@ -899,10 +847,10 @@ def cmd_run(args: argparse.Namespace) -> int:
    need_rebuild = args.preset is not None or args.coders or args.reviewers or args.seniors
    if need_rebuild:
        from cross_eval.prompts import PHASED_PRESETS
-        preset = args.preset or "simple"
+        preset = args.preset or "coding-plan-review"
        # Determine which preset was configured (from YAML or defaults)
        if args.preset is None and config.phases:
-            preset = config.preset_name if config.preset_name != "custom" else "review-fix"
+            preset = config.preset_name if config.preset_name != "custom" else "coding-plan-review"
        elif args.preset is None and not args.coders and not args.reviewers and not args.seniors:
            pass  # no changes needed
        inferred_coders, inferred_reviewers, inferred_seniors = _infer_roles(
@@ -929,8 +877,6 @@ def cmd_run(args: argparse.Namespace) -> int:
        elif preset in PIPELINE_PRESETS:
            config.pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors)
            config.phases = []
            if preset in {"plan-review", "review-only"} and args.max_iter is None and args.min_iter is None:
                config.max_iterations = 1
    sync_phased_iterations(config)
    if args.max_iter is not None:
@@ -951,6 +897,9 @@ def cmd_run(args: argparse.Namespace) -> int:
            if coder_name in config.agents:
                _make_agentic(config.agents[coder_name])
    if args.worktree:
        config.use_worktree = True
    ensure_fix_preset_agentic(config)
    # --model: apply to ALL agents
@@ -988,7 +937,7 @@ def cmd_run(args: argparse.Namespace) -> int:
            print(f"No files found in: {docs_dir}", file=sys.stderr)
            return 1
        config.inputs["docs"] = docs_content
-        config.inputs["docs_ref"] = str(docs_dir)
+        config.inputs["docs_ref"] = docs_dir
    if args.env_files:
        for env_file in args.env_files:
@@ -1062,6 +1011,9 @@ def cmd_run(args: argparse.Namespace) -> int:
    if not args.dry_run and result.run_dir:
        print(f"Output: {result.run_dir}/")
    if args.dry_run:
        return 0
    if result.final_verdict == "ESCALATE":
        from cross_eval.report import print_escalation_report
        print_escalation_report(config, result)
--- a/cross_eval/config.py
+++ b/cross_eval/config.py
@@ -31,7 +31,10 @@ DEFAULT_ROLE_REASONING_EFFORTS = {
    "reviewer": "medium",
    "senior": "high",
 }
-FIX_STYLE_PRESETS = {"review-fix", "coding-review-fix"}
+FIX_STYLE_PRESETS = {
    "plan-review",
    "coding-plan-review",
 }
 # ---------------------------------------------------------------------------
@@ -296,7 +299,10 @@ def _default_seniors_for_preset(
    """Infer a default senior agent for presets that benefit from adjudication."""
    if not (
        isinstance(pipeline_raw, str)
-        and pipeline_raw in {"preset:review-fix", "preset:coding-review-fix"}
+        and pipeline_raw in {
            "preset:plan-review",
            "preset:coding-plan-review",
        }
        and reviewers
    ):
        return []
@@ -378,9 +384,11 @@ def default_config() -> PipelineConfig:
    coders = ["claude-coder"]
    reviewers = ["claude-reviewer"]
    seniors: list[str] = []
-    pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors)
+    pipeline: list[StepConfig] = []
    phases = PHASED_PRESETS["coding-plan-review"](coders, reviewers, seniors)
    return PipelineConfig(
        output_dir=Path(".cross-eval/output"),
        use_worktree=False,
        max_iterations=3,
        language="ko",
        execution=ExecutionConfig(),
@@ -390,6 +398,8 @@ def default_config() -> PipelineConfig:
        reviewers=reviewers,
        seniors=seniors,
        pipeline=pipeline,
        phases=phases,
        preset_name="coding-plan-review",
    )
@@ -433,7 +443,7 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
        )
    # --- roles: explicit or inferred ---
-    pipeline_raw = raw.get("pipeline", "preset:simple")
+    pipeline_raw = raw.get("pipeline", "preset:coding-plan-review")
    coders_raw = raw.get("coders")
    reviewers_raw = raw.get("reviewers")
    seniors_raw = raw.get("seniors")
@@ -494,6 +504,7 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
    config = PipelineConfig(
        output_dir=output_dir,
        use_worktree=bool(raw.get("use_worktree", False)),
        max_iterations=int(raw.get("max_iterations", 3)),
        min_iterations=int(raw.get("min_iterations", 1)),
        verbose=bool(raw.get("verbose", False)),
@@ -551,10 +562,10 @@ def _resolve_pipeline(
    """Resolve pipeline from preset string or explicit step list.
    Returns (steps, phases) tuple.  Only one will be non-empty.
-    - Simple/cross-review/plan-review/review-only → steps populated, phases empty.
+    - plan-review → steps populated, phases empty.
-    - Phased presets (review-fix) → steps empty, phases populated.
+    - coding-plan-review → steps empty, phases populated.
    """
-    # Preset: "preset:simple" or "preset:review-fix"
+    # Preset: "preset:plan-review" or "preset:coding-plan-review"
    if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"):
        preset_name = pipeline_raw.split(":", 1)[1]
        if preset_name in PIPELINE_PRESETS:
@@ -588,7 +599,7 @@ def _resolve_pipeline(
        return steps, []
    raise ValueError(
-        f"'pipeline' must be a preset string (e.g. 'preset:simple') "
+        f"'pipeline' must be a preset string (e.g. 'preset:plan-review') "
        f"or a list of step definitions, got {type(pipeline_raw).__name__}"
    )
--- a/cross_eval/demo.py
+++ b/cross_eval/demo.py
@@ -165,7 +165,7 @@ CYAN = "\033[36m"
 RESET = "\033[0m"
-def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None:
+def run_mock_demo(preset: str = "coding-plan-review", show_escalate: bool = False) -> None:
    """Run a simulated demo showing the full pipeline lifecycle."""
    steps = _MOCK_ESCALATE_STEPS if show_escalate else _MOCK_STEPS
@@ -229,7 +229,7 @@ def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None:
 def run_live_demo(
-    preset: str = "simple",
+    preset: str = "coding-plan-review",
    timeout: int | None = None,
 ) -> PipelineResult:
    """Run a live demo with real agents using the built-in plan."""
@@ -255,8 +255,9 @@ def run_live_demo(
        pipeline = []
        phases = PHASED_PRESETS[preset](coders, reviewers, seniors)
    else:
-        pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors)
+        pipeline = []
-        phases = []
+        phases = PHASED_PRESETS["coding-plan-review"](coders, reviewers, seniors)
    with tempfile.TemporaryDirectory() as tmpdir:
        plan_path = Path(tmpdir) / "plan.md"
--- a/cross_eval/models.py
+++ b/cross_eval/models.py
@@ -62,6 +62,7 @@ class PipelineConfig:
    """Full cross-eval configuration."""
    output_dir: Path = field(default_factory=lambda: Path(".cross-eval/output"))
    use_worktree: bool = False
    max_iterations: int = 3
    min_iterations: int = 1
    verbose: bool = False
--- a/cross_eval/pipeline.py
+++ b/cross_eval/pipeline.py
@@ -4,6 +4,7 @@ from __future__ import annotations
 import logging
 import os
 import re
 import shutil
 import subprocess
 import time
 from hashlib import sha256
@@ -34,6 +35,19 @@ from cross_eval.runtime_env import (
 logger = logging.getLogger(__name__)
 def _get_current_head(cwd: Path) -> str | None:
    """Return the current HEAD SHA for an existing repository."""
    result = subprocess.run(
        ["git", "rev-parse", "HEAD"],
        cwd=cwd,
        capture_output=True,
        text=True,
    )
    if result.returncode != 0:
        return None
    return result.stdout.strip() or None
 def run_pipeline(
    config: PipelineConfig,
    cwd: Path | None = None,
@@ -62,18 +76,20 @@ def _commit_iteration(
    label: str,
    iteration: int,
    verdict: str | None,
-) -> None:
+) -> str:
    """Intermediate commit after each agentic iteration.
    This resets the diff baseline so the next iteration only captures new changes.
    Returns the new HEAD SHA to use as the base_commit for the next iteration.
    """
-    from cross_eval.worktree import commit_worktree
+    from cross_eval.worktree import commit_worktree, get_current_head
    committed = commit_worktree(
        worktree_path,
        f"cross-eval: {label} v{iteration} ({verdict or 'no-verdict'})",
    )
    if committed:
        logger.debug("  Intermediate commit: v%d (%s)", iteration, verdict)
    return get_current_head(worktree_path)
 def _has_agentic_steps(config: PipelineConfig, steps: list[StepConfig]) -> bool:
@@ -84,50 +100,124 @@ def _has_agentic_steps(config: PipelineConfig, steps: list[StepConfig]) -> bool:
    )
-def _setup_worktree(cwd: Path, run_dir: Path, preset_name: str) -> tuple[Path, str]:
+def _setup_worktree(cwd: Path, run_dir: Path, preset_name: str) -> tuple[Path, str, str]:
    """Create a shared worktree for the entire pipeline run.
    1. Generate branch name (cross-eval/<preset>_<timestamp>)
    2. Create branch from HEAD
    3. Create worktree on that branch
-    Returns (worktree_path, branch_name).
+    Returns (worktree_path, branch_name, base_commit).
    """
    from cross_eval.worktree import create_worktree, make_branch_name, make_worktree_dir
    branch_name = make_branch_name(preset_name)
    worktree_dir = make_worktree_dir(cwd, branch_name)
-    worktree_path = create_worktree(
+    worktree_path, base_commit = create_worktree(
        base_cwd=cwd, work_dir=worktree_dir, branch_name=branch_name,
    )
    (run_dir / "worktree_path.txt").write_text(f"{worktree_path}\n", encoding="utf-8")
    (run_dir / "worktree_branch.txt").write_text(f"{branch_name}\n", encoding="utf-8")
-    return worktree_path, branch_name
+    (run_dir / "worktree_base.txt").write_text(f"{base_commit}\n", encoding="utf-8")
    return worktree_path, branch_name, base_commit
 def _copy_inputs_to_worktree(
    config: PipelineConfig,
    worktree_path: Path,
    *,
    base_cwd: Path,
 ) -> None:
    """Copy input files (plan, checklist, etc.) into the worktree.
-    This ensures agents running in plan/read-only mode within the worktree
+    Repo-local inputs are remapped to the corresponding path inside the worktree
-    can access these files, even though the originals live in the base repo.
+    so agentic edits produce a real git diff. External inputs are copied into a
-    Updates config.inputs in-place so subsequent reference refreshes use
+    dedicated inputs directory. For ``plan-review`` these external copies remain
    tracked so document edits can survive on the branch; other presets keep them
    ignored to avoid polluting code diffs.
    Updates ``config.inputs`` in-place so subsequent reference refreshes use
    worktree-local paths.
    """
-    import shutil
+    base_root = base_cwd.resolve()
    track_external_inputs = config.preset_name == "plan-review"
    inputs_dir = worktree_path / ".cross-eval-inputs"
    inputs_dir.mkdir(exist_ok=True)
-    # Exclude from git so these don't pollute agentic diffs
+    if not track_external_inputs:
-    (inputs_dir / ".gitignore").write_text("*\n", encoding="utf-8")
+        # Exclude read-only input copies from git so they don't pollute code diffs.
        (inputs_dir / ".gitignore").write_text("*\n", encoding="utf-8")
    for key, val in list(config.inputs.items()):
-        if key.endswith("_ref") or not isinstance(val, Path):
+        if not isinstance(val, Path):
            continue
        if not val.exists():
            continue
-        dest = inputs_dir / val.name
+        resolved = val.resolve()
-        shutil.copy2(val, dest)
+        try:
-        config.inputs[key] = dest
+            rel_path = resolved.relative_to(base_root)
        except ValueError:
            dest = inputs_dir / val.name
            _copy_path(resolved, dest)
            config.inputs[key] = dest
            continue
        worktree_target = worktree_path / rel_path
        if not worktree_target.exists():
            _copy_path(resolved, worktree_target)
        config.inputs[key] = worktree_target
 def _snapshot_input_paths(config: PipelineConfig) -> dict[str, Path]:
    """Capture original on-disk input paths before remapping into a worktree."""
    return {
        key: val
        for key, val in config.inputs.items()
        if isinstance(val, Path)
    }
 def _apply_worktree_inputs_to_base(
    config: PipelineConfig,
    original_inputs: dict[str, Path],
    *,
    cwd: Path,
 ) -> list[Path]:
    """Copy the final worktree-edited inputs back onto the user-provided paths."""
    restored: list[Path] = []
    for key, original_path in original_inputs.items():
        current_path = config.inputs.get(key)
        if not isinstance(current_path, Path) or not current_path.exists():
            continue
        if current_path.resolve() == original_path.resolve():
            continue
        _copy_path(current_path, original_path)
        restored.append(original_path)
    return restored
 def _commit_base_repo_paths(cwd: Path, paths: list[Path], message: str) -> bool:
    """Commit changed input paths in the base repository when they live under cwd."""
    rel_paths: list[str] = []
    for path in paths:
        try:
            rel_paths.append(str(path.resolve().relative_to(cwd.resolve())))
        except ValueError:
            continue
    if not rel_paths:
        return False
    subprocess.run(
        ["git", "add", "--", *rel_paths],
        cwd=cwd,
        capture_output=True,
        check=True,
    )
    result = subprocess.run(
        ["git", "commit", "-m", message],
        cwd=cwd,
        capture_output=True,
        text=True,
    )
    return result.returncode == 0
 def _snapshot_repo_state(cwd: Path) -> dict[str, str]:
@@ -320,17 +410,26 @@ def _run_simple_pipeline(
    # Setup shared worktree for agentic mode
    worktree_path: Path | None = None
    agent_execution_path: Path | None = None
    agentic_branch_name: str | None = None
    agentic_base_commit: str | None = None
    original_input_paths: dict[str, Path] = {}
    base_repo_state: dict[str, str] | None = None
    base_repo_status: str | None = None
    if not dry_run and _has_agentic_steps(config, config.pipeline):
-        worktree_path, agentic_branch_name = _setup_worktree(
+        if config.use_worktree:
-            cwd, run_dir, config.preset_name,
+            worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree(
-        )
+                cwd, run_dir, config.preset_name,
-        _copy_inputs_to_worktree(config, worktree_path)
+            )
-        _refresh_input_references(config, input_contents)
+            original_input_paths = _snapshot_input_paths(config)
-        base_repo_state = _snapshot_repo_state(cwd)
+            _copy_inputs_to_worktree(config, worktree_path, base_cwd=cwd)
-        base_repo_status = _snapshot_repo_status(cwd)
+            _refresh_input_references(config, input_contents)
            base_repo_state = _snapshot_repo_state(cwd)
            base_repo_status = _snapshot_repo_status(cwd)
            agent_execution_path = worktree_path
        else:
            agent_execution_path = cwd
            agentic_base_commit = _get_current_head(cwd)
    feedback = "(no feedback — first iteration)"
    iterations: list[IterationResult] = []
@@ -356,15 +455,16 @@ def _run_simple_pipeline(
                config.pipeline, config, input_contents, feedback,
                i, config.max_iterations, cwd, timeout, dry_run,
                run_dir=run_dir, output_iter=i,
-                worktree_path=worktree_path,
+                worktree_path=agent_execution_path,
                runtime_env=runtime_env,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=agentic_base_commit,
            )
            # Intermediate commit so next iteration's diff only shows new changes
-            if worktree_path is not None:
+            if config.use_worktree and worktree_path is not None:
-                _commit_iteration(worktree_path, config.preset_name, i, verdict)
+                agentic_base_commit = _commit_iteration(worktree_path, config.preset_name, i, verdict)
            iter_result = IterationResult(
                iteration=i,
@@ -454,8 +554,25 @@ def _run_simple_pipeline(
                break
    finally:
        if config.use_worktree and worktree_path is not None and original_input_paths:
            restored_paths = _apply_worktree_inputs_to_base(
                config, original_input_paths, cwd=cwd,
            )
            if restored_paths:
                try:
                    committed = _commit_base_repo_paths(
                        cwd,
                        restored_paths,
                        f"cross-eval: {config.preset_name} ({final_verdict})",
                    )
                    if committed:
                        logger.info("  Applied and committed final input changes in base repo.")
                    else:
                        logger.info("  Applied final input changes in base repo (no commit created).")
                except Exception:
                    logger.warning("  Failed to commit final input changes in base repo", exc_info=True)
        agentic_branch: str | None = None
-        if worktree_path is not None and agentic_branch_name is not None:
+        if config.use_worktree and worktree_path is not None and agentic_branch_name is not None:
            agentic_branch = _finalize_worktree(
                cwd, worktree_path, agentic_branch_name,
                config.preset_name, final_verdict,
@@ -497,17 +614,26 @@ def _run_phased_pipeline(
    # Setup shared worktree for agentic mode
    all_phase_steps = [s for p in config.phases for s in p.steps]
    worktree_path: Path | None = None
    agent_execution_path: Path | None = None
    agentic_branch_name: str | None = None
    agentic_base_commit: str | None = None
    original_input_paths: dict[str, Path] = {}
    base_repo_state: dict[str, str] | None = None
    base_repo_status: str | None = None
    if not dry_run and _has_agentic_steps(config, all_phase_steps):
-        worktree_path, agentic_branch_name = _setup_worktree(
+        if config.use_worktree:
-            cwd, run_dir, config.preset_name,
+            worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree(
-        )
+                cwd, run_dir, config.preset_name,
-        _copy_inputs_to_worktree(config, worktree_path)
+            )
-        _refresh_input_references(config, input_contents)
+            original_input_paths = _snapshot_input_paths(config)
-        base_repo_state = _snapshot_repo_state(cwd)
+            _copy_inputs_to_worktree(config, worktree_path, base_cwd=cwd)
-        base_repo_status = _snapshot_repo_status(cwd)
+            _refresh_input_references(config, input_contents)
            base_repo_state = _snapshot_repo_state(cwd)
            base_repo_status = _snapshot_repo_status(cwd)
            agent_execution_path = worktree_path
        else:
            agent_execution_path = cwd
            agentic_base_commit = _get_current_head(cwd)
    iterations: list[IterationResult] = []
    feedback = "(no feedback — first iteration)"
@@ -554,15 +680,16 @@ def _run_phased_pipeline(
                    phase.steps, config, input_contents, feedback,
                    pi, phase.max_iterations, cwd, timeout, dry_run,
                    run_dir=run_dir, output_iter=global_iter, phase_name=phase.name,
-                    worktree_path=worktree_path,
+                    worktree_path=agent_execution_path,
                    runtime_env=runtime_env,
                    base_repo_state=base_repo_state,
                    base_repo_status=base_repo_status,
                    base_commit=agentic_base_commit,
                )
                # Intermediate commit so next iteration's diff only shows new changes
-                if worktree_path is not None:
+                if config.use_worktree and worktree_path is not None:
-                    _commit_iteration(
+                    agentic_base_commit = _commit_iteration(
                        worktree_path, f"{config.preset_name}/{phase.name}",
                        global_iter, verdict,
                    )
@@ -689,8 +816,25 @@ def _run_phased_pipeline(
                final_verdict = "PASS" if phase_converged else "MAX_ITERATIONS_REACHED"
    finally:
        if config.use_worktree and worktree_path is not None and original_input_paths:
            restored_paths = _apply_worktree_inputs_to_base(
                config, original_input_paths, cwd=cwd,
            )
            if restored_paths:
                try:
                    committed = _commit_base_repo_paths(
                        cwd,
                        restored_paths,
                        f"cross-eval: {config.preset_name} ({final_verdict})",
                    )
                    if committed:
                        logger.info("  Applied and committed final input changes in base repo.")
                    else:
                        logger.info("  Applied final input changes in base repo (no commit created).")
                except Exception:
                    logger.warning("  Failed to commit final input changes in base repo", exc_info=True)
        agentic_branch: str | None = None
-        if worktree_path is not None and agentic_branch_name is not None:
+        if config.use_worktree and worktree_path is not None and agentic_branch_name is not None:
            agentic_branch = _finalize_worktree(
                cwd, worktree_path, agentic_branch_name,
                config.preset_name, final_verdict,
@@ -724,6 +868,8 @@ def _load_inputs(config: PipelineConfig) -> dict[str, str]:
    for key, val in config.inputs.items():
        if key.endswith("_ref"):
            input_contents[key] = str(val)
        elif key == "docs":
            input_contents[key] = _load_docs_input(config, current_value=val)
        elif isinstance(val, str):
            input_contents[key] = val
        else:
@@ -739,6 +885,8 @@ def _refresh_inputs(
    for key, val in config.inputs.items():
        if key.endswith("_ref"):
            input_contents[key] = str(val)
        elif key == "docs":
            input_contents[key] = _load_docs_input(config, current_value=val)
        elif isinstance(val, str):
            input_contents[key] = val
        elif isinstance(val, Path) and val.exists():
@@ -746,6 +894,40 @@ def _refresh_inputs(
    _refresh_input_references(config, input_contents)
 def _load_docs_input(config: PipelineConfig, *, current_value: Path | str) -> str:
    """Load docs content from docs_ref when available so edits are visible next iteration."""
    docs_ref = config.inputs.get("docs_ref")
    docs_path = docs_ref if isinstance(docs_ref, Path) else None
    if docs_path is not None and docs_path.exists():
        if docs_path.is_dir():
            return _read_docs_tree(docs_path)
        try:
            return docs_path.read_text(encoding="utf-8")
        except (UnicodeDecodeError, OSError):
            return ""
    if isinstance(current_value, str):
        return current_value
    if current_value.exists() and current_value.is_file():
        return current_value.read_text(encoding="utf-8")
    return ""
 def _read_docs_tree(docs_dir: Path) -> str:
    """Read all visible text files under a docs tree and concatenate them."""
    parts: list[str] = []
    for f in sorted(
        path for path in docs_dir.rglob("*")
        if path.is_file() and not any(part.startswith(".") for part in path.relative_to(docs_dir).parts)
    ):
        try:
            content = f.read_text(encoding="utf-8")
        except (UnicodeDecodeError, OSError):
            continue
        rel_path = f.relative_to(docs_dir).as_posix()
        parts.append(f"### {rel_path}\n{content}")
    return "\n\n".join(parts)
 def _refresh_input_references(
    config: PipelineConfig,
    input_contents: dict[str, str],
@@ -903,6 +1085,7 @@ def _run_steps(
    runtime_env: dict[str, str] | None = None,
    base_repo_state: dict[str, str] | None = None,
    base_repo_status: str | None = None,
    base_commit: str | None = None,
 ) -> tuple[dict[str, str], dict[str, AgentResult], str | None]:
    """Execute all steps in one iteration, parallelizing where possible."""
    step_outputs: dict[str, str] = {}
@@ -923,6 +1106,7 @@ def _run_steps(
                runtime_env=runtime_env,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=base_commit,
            )
        else:
            _execute_parallel_batch(
@@ -934,6 +1118,7 @@ def _run_steps(
                runtime_env=runtime_env,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=base_commit,
            )
    # Extract verdict from all verdict steps (ALL must PASS; ESCALATE wins over all)
@@ -961,6 +1146,7 @@ def _invoke_agentic(
    env: dict[str, str] | None = None,
    timeout: int | None = None,
    quiet: bool = False,
    base_commit: str | None = None,
 ) -> AgentResult:
    """Run an agent in agentic mode using an existing worktree."""
    return invoke_agent_agentic(
@@ -968,6 +1154,7 @@ def _invoke_agentic(
        worktree_path=worktree_path,
        env=env,
        timeout=timeout, quiet=quiet,
        base_commit=base_commit,
    )
@@ -992,6 +1179,7 @@ def _execute_step(
    runtime_env: dict[str, str] | None = None,
    base_repo_state: dict[str, str] | None = None,
    base_repo_status: str | None = None,
    base_commit: str | None = None,
 ) -> None:
    """Execute a single step, updating step_outputs and step_results in place."""
    if not quiet:
@@ -1035,6 +1223,7 @@ def _execute_step(
                worktree_path=worktree_path,
                env=runtime_env,
                timeout=timeout, quiet=quiet,
                base_commit=base_commit,
            )
        else:
            # When worktree exists, run non-agentic agents (reviewers) in
@@ -1125,6 +1314,7 @@ def _execute_parallel_batch(
    runtime_env: dict[str, str] | None = None,
    base_repo_state: dict[str, str] | None = None,
    base_repo_status: str | None = None,
    base_commit: str | None = None,
 ) -> None:
    """Execute multiple steps in parallel using threads."""
    agent_names = ", ".join(s.agent for s in batch)
@@ -1139,6 +1329,7 @@ def _execute_parallel_batch(
                run_dir=run_dir, output_iter=output_iter, phase_name=phase_name,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=base_commit,
            )
        return
@@ -1161,6 +1352,7 @@ def _execute_parallel_batch(
                phase_name=phase_name, worktree_path=worktree_path,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=base_commit,
            )
        return
@@ -1204,6 +1396,7 @@ def _execute_parallel_batch(
                worktree_path=worktree_path,
                env=runtime_env,
                timeout=timeout, quiet=True,
                base_commit=base_commit,
            )
        else:
            effective_cwd = worktree_path if worktree_path else cwd
@@ -1664,3 +1857,12 @@ def _save_report(run_dir: Path, config: PipelineConfig, result: PipelineResult)
    report_path.parent.mkdir(parents=True, exist_ok=True)
    report_path.write_text(report, encoding="utf-8")
    logger.info("Report saved: %s", report_path)
 def _copy_path(src: Path, dest: Path) -> None:
    """Copy a file or directory into the worktree, preserving structure."""
    if src.is_dir():
        shutil.copytree(src, dest, dirs_exist_ok=True)
        return
    dest.parent.mkdir(parents=True, exist_ok=True)
    shutil.copy2(src, dest)
--- a/cross_eval/prompts.py
+++ b/cross_eval/prompts.py
@@ -472,12 +472,270 @@ PLAN_REVIEW_TEMPLATE_KO = """\
 그렇지 않으면: VERDICT: FAIL
 """
 PLAN_FIX_TEMPLATE = """\
 You are tasked with revising planning documents based on adjudicated review feedback.
 ## Artifact References
 {artifact_references}
 ## Current Review Feedback
 {feedback}
 ## Instructions
 1. Read the referenced plan/checklist/docs/review artifacts directly from disk.
 2. Update the planning package itself: the plan, checklist, and reference documents as needed.
 3. Do NOT write or modify production code. Only revise planning artifacts.
 4. Address ONLY the confirmed planning issues from the current review feedback.
 5. If feedback marks any item as DISMISSED or false positive, leave it unchanged.
 6. Make the smallest document changes that resolve ambiguity, omissions, scope creep, or repository compatibility issues.
 7. Keep the plan, checklist, and supporting docs internally consistent after your edits.
 8. After editing, briefly summarize what you changed and any blocker that still needs human input.
 """
 PLAN_FIX_TEMPLATE_KO = """\
 당신은 시니어 리뷰 결과를 바탕으로 기획 문서를 수정하는 담당자입니다.
 ## 참조 아티팩트
 {artifact_references}
 ## 현재 리뷰 피드백
 {feedback}
 ## 지침
 1. 참조된 plan/checklist/docs/review markdown를 직접 읽으세요.
 2. 수정 대상은 기획 패키지 자체입니다. 필요에 따라 기획서, 체크리스트, 참고 문서를 수정하세요.
 3. 프로덕션 코드를 작성하거나 수정하지 마세요. 기획 문서만 고치세요.
 4. 현재 리뷰 피드백에서 확정된 기획 이슈만 해결하세요.
 5. DISMISSED 또는 오탐으로 정리된 항목은 건드리지 마세요.
 6. 모호성, 누락, 과도한 범위, 저장소 정합성 문제를 해소하는 최소한의 문서 수정만 하세요.
 7. 수정 후에도 기획서, 체크리스트, 참고 문서가 서로 모순되지 않게 유지하세요.
 8. 수정이 끝나면 무엇을 바꿨는지와 아직 사람 판단이 필요한 blocker가 있는지 짧게 정리하세요.
 """
 PLAN_VERIFY_TEMPLATE = """\
 You are verifying the latest planning package after plan-only revisions.
 ## Plan
 {plan}
 ## Checklist
 {checklist}
 ## Reference Documents
 {docs}
 ## Previous Review (iteration {iteration} of {max_iterations})
 {feedback}
 ## Execution Evidence
 {execution_evidence}
 ## Verify Instructions
 Review the latest planning package itself: the plan, checklist, and reference documents.
 You MAY inspect the current repository to confirm that the documents describe the current reality accurately enough.
 Do NOT require production code, scripts, infrastructure, or external environments to already be fixed.
 For `plan-review`, PASS means the documents are now clear enough to execute without further document edits.
 A known implementation gap, repo mismatch, legacy script problem, external dependency, or environment blocker is NOT a FAIL by itself if:
 - the issue is described accurately in the planning package,
 - the affected scope or gate is documented clearly,
 - the required follow-up action or non-go condition is documented clearly, and
 - the package does not misrepresent unresolved work as already complete.
 Only mark FAIL when the planning package still needs correction, such as:
 - unresolved ambiguity or contradiction in the documents,
 - missing prerequisite, dependency, gate, ownership, or evidence rule,
 - a known blocker that is still described inaccurately or misleadingly,
 - conflicting source-of-truth rules across the planning documents,
 - checklist or status criteria that would cause an operator to make the wrong decision.
 Report implementation/repository problems that are already documented correctly under "Out of Scope Issues" or note them as documented risks, not as FAIL reasons.
 ## Output Format
 ### Remaining Document Issues
 - [Major][Omission] Description (reference specific plan/checklist/doc item)
 (Write "None" if no document issue remains.)
 ### Documented Risks / Out of Scope
 - Description of a real implementation/repository/environment risk that is already documented correctly
 (Write "None" if nothing notable remains.)
 ### Summary
 - Remaining document issues: N
 - Documented risks / out-of-scope items: N
 - Overall quality: [BRIEF ASSESSMENT]
 ### Verdict
 If the planning package no longer needs document changes, output: VERDICT: PASS
 Otherwise output: VERDICT: FAIL
 """
 PLAN_VERIFY_TEMPLATE_KO = """\
 당신은 plan-only 수정 이후 최신 기획 패키지를 재검증하는 검토자입니다.
 ## 기획서
 {plan}
 ## 체크리스트
 {checklist}
 ## 참고 문서
 {docs}
 ## 이전 리뷰 결과 ({max_iterations}회 중 {iteration}번째)
 {feedback}
 ## 실행 증거
 {execution_evidence}
 ## 검증 지침
 최신 기획 패키지 자체를 다시 검토하세요: 기획서, 체크리스트, 참고 문서를 함께 봅니다.
 현재 저장소를 살펴보며 문서가 현실을 정확히 설명하는지 확인할 수는 있지만, 프로덕션 코드, 스크립트, 인프라, 외부 환경이 이미 수정되어 있을 것을 요구하면 안 됩니다.
 `plan-review`에서 PASS의 뜻은 "이제 문서를 더 고칠 필요 없이 이 계획을 실행할 수 있다"입니다.
 즉 구현 공백, 저장소 불일치, legacy 스크립트 문제, 외부 의존성, 환경 blocker가 남아 있어도 아래 조건을 만족하면 FAIL 사유가 아닙니다.
 - 그 문제가 기획 패키지에 정확히 기록되어 있고
 - 어떤 범위/게이트에 영향을 주는지 분명히 적혀 있고
 - 필요한 후속 조치나 non-go 조건이 명확히 적혀 있고
 - 아직 해결되지 않은 일을 이미 해결된 것처럼 오해하게 만들지 않는 경우
 반대로 아래와 같은 경우에만 FAIL로 판정하세요.
 - 문서 안에 아직 모호성이나 모순이 남아 있는 경우
 - 선행조건, 의존성, 게이트, 담당 주체, evidence 규칙이 빠진 경우
 - 알려진 blocker가 여전히 부정확하거나 오해를 부르는 방식으로 서술된 경우
 - 기획 문서들 사이에서 source-of-truth 규칙이 충돌하는 경우
 - 체크리스트나 상태 판정 기준 때문에 실행자가 잘못된 결정을 내릴 수 있는 경우
 이미 문서에 정확히 기록된 구현/저장소 문제는 "범위 밖 이슈" 또는 "문서화된 리스크"로만 남기고, 그 자체를 FAIL 사유로 삼지 마세요.
 ## 출력 형식
 ### 남은 문서 이슈
 - [Major][누락] 이슈 설명 (관련 기획서/체크리스트/참고 문서 항목 참조)
 (남은 문서 이슈가 없으면 "없음"이라고 작성하세요.)
 ### 문서화된 리스크 / 범위 밖 이슈
 - 실제 구현/저장소/환경 리스크이지만 문서에는 이미 정확히 반영된 항목
 (해당 사항이 없으면 "없음"이라고 작성하세요.)
 ### 요약
 - 남은 문서 이슈 수: N
 - 문서화된 리스크 / 범위 밖 항목 수: N
 - 전체 품질: [간략한 평가]
 ### 판정
 기획 패키지를 더 수정할 필요가 없으면: VERDICT: PASS
 그렇지 않으면: VERDICT: FAIL
 """
 CODING_PLAN_REVIEW_TEMPLATE = """\
 You are reviewing both the implementation and the planning package together.
 ## Artifact References
 {artifact_references}
 ## Execution Evidence
 {execution_evidence}
 ## Review Instructions
 Read the referenced plan/checklist/docs/review artifacts directly from disk. \
 Inspect the current repository and evaluate BOTH:
 1. whether the implementation matches the plan/checklist/docs, and
 2. whether the planning package still accurately describes the implementation target and constraints.
 Report only issues that matter to delivering the original plan correctly. \
 Do not invent new scope. Distinguish between code issues, document issues, and consistency gaps between them.
 For each issue found, classify it with BOTH severity AND category:
 - Severity: Critical / Major / Minor
 - Category: Over-engineering / Omission
 If previous review feedback is provided above, mark each prior item as CONFIRMED or DISMISSED.
 If you find issues outside the original plan scope, report them separately under "Out of Scope Issues".
 ### Verdict
 If the implementation satisfies the plan/checklist and the planning package no longer needs correction, output: VERDICT: PASS
 Otherwise output: VERDICT: FAIL
 """
 CODING_PLAN_REVIEW_TEMPLATE_KO = """\
 당신은 구현 결과와 기획 문서 패키지를 함께 검토하는 리뷰어입니다.
 ## 참조 아티팩트
 {artifact_references}
 ## 실행 증거
 {execution_evidence}
 ## 검토 지침
 참조된 plan/checklist/docs/review markdown를 직접 읽고 현재 저장소를 확인한 뒤, 아래 두 가지를 함께 평가하세요.
 1. 현재 구현이 plan/checklist/docs와 일치하는가
 2. 기획 문서 패키지가 현재 구현 목표와 제약을 여전히 정확하게 설명하는가
 원래 계획을 제대로 완수하는 데 필요한 이슈만 보고하세요. 새로운 범위를 만들지 마세요.
 코드 이슈, 문서 이슈, 코드-문서 불일치를 구분해서 적으세요.
 발견된 각 이슈에 심각도와 카테고리를 모두 부여하세요.
 - 심각도: Critical / Major / Minor
 - 카테고리: 과최적화 / 누락
 이전 리뷰 피드백이 있으면 각 항목을 CONFIRMED 또는 DISMISSED로 판정하세요.
 원래 계획 범위 밖 이슈는 "범위 밖 이슈"로 별도 분리하세요.
 ### 판정
 구현이 plan/checklist를 충족하고 기획 문서 패키지도 더 이상 수정할 필요가 없으면: VERDICT: PASS
 그렇지 않으면: VERDICT: FAIL
 """
 CODING_PLAN_FIX_TEMPLATE = """\
 You are fixing confirmed issues in both the implementation and the planning package.
 ## Artifact References
 {artifact_references}
 ## Current Review Feedback
 {feedback}
 ## Instructions
 1. Read the referenced plan/checklist/docs/review artifacts directly from disk.
 2. Fix ONLY the confirmed issues from the current review feedback.
 3. You may update both implementation files and planning artifacts when needed.
 4. Preserve the original plan intent and scope. Do not silently broaden requirements.
 5. Keep code, plan, checklist, and supporting docs consistent after edits.
 6. After editing, briefly summarize what you changed and any blocker that still needs human input.
 """
 CODING_PLAN_FIX_TEMPLATE_KO = """\
 당신은 현재 리뷰에서 확정된 이슈를 코드와 기획 문서 패키지에 함께 반영하는 수정 담당자입니다.
 ## 참조 아티팩트
 {artifact_references}
 ## 현재 리뷰 피드백
 {feedback}
 ## 지침
 1. 참조된 plan/checklist/docs/review markdown를 직접 읽으세요.
 2. 현재 리뷰 피드백에서 확정된 이슈만 수정하세요.
 3. 필요하면 코드와 기획 문서를 모두 수정할 수 있습니다.
 4. 최초 plan의 의도와 범위를 유지하세요. 요구사항을 몰래 넓히지 마세요.
 5. 수정 후 코드, plan, checklist, 참고 문서가 서로 모순되지 않게 유지하세요.
 6. 수정이 끝나면 무엇을 바꿨는지와 아직 사람 판단이 필요한 blocker가 있는지 짧게 정리하세요.
 """
 AGGREGATE_REVIEW_TEMPLATE = """\
 You are adjudicating multiple review results and turning them into an actionable decision.
 ## Artifact References
 {artifact_references}
 ## Candidate Artifact Under Review
 {candidate_outputs}
 ## Reviewer Findings Bundle
 {reviews_bundle}
 ## Previous Issue Tracker
 {previous_senior_tracker}
@@ -486,19 +744,19 @@ You are adjudicating multiple review results and turning them into an actionable
 ## Instructions
 Read the referenced plan/checklist/docs/review artifacts directly from disk. \
-Explore the project directory and the referenced git commit/diff to confirm the \
+Inspect the repository and referenced artifacts only as needed to confirm the \
-current codebase state. Use the execution evidence above to verify claims against \
+current target state. Use the execution evidence above to verify claims against \
 actual command outputs, artifact paths, and exit codes. Then:
 1. Deduplicate overlapping issues across reviewers.
 2. Resolve disagreements explicitly.
-3. Keep only issues supported by the plan, checklist, code, or reviewer evidence.
+3. Keep only issues supported by the plan, checklist, reference docs, repository state, or reviewer evidence.
 4. When evidence is mixed, explain what was confirmed, what was dismissed, and what still needs follow-up.
-5. Produce a prioritized action list for the coder.
+5. Produce a prioritized action list for the implementer/editor.
 6. Maintain the Issue Tracker table across iterations (carry forward unresolved issues).
 7. If no confirmed issue remains, output VERDICT: PASS.
-8. If issues exist that the coder can fix, output VERDICT: FAIL.
+8. If issues exist that the implementer/editor can fix, output VERDICT: FAIL.
 9. If issues require human intervention (ambiguous requirements, architecture decisions, \
-external dependency problems, or the same issue persists after 2+ fix attempts), \
+external dependency problems, or the same issue persists after 2+ attempts), \
 output VERDICT: ESCALATE.
 ## Output Format
@@ -512,8 +770,8 @@ output VERDICT: ESCALATE.
 (Write "None" if nothing was dismissed.)
 ### Action Items
-1. Concrete fix the coder should make
+1. Concrete fix the implementer/editor should make
-2. Concrete fix the coder should make
+2. Concrete fix the implementer/editor should make
 ## Issue Tracker
@@ -536,6 +794,12 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
 ## 참조 아티팩트
 {artifact_references}
 ## 현재 검토 대상
 {candidate_outputs}
 ## 리뷰 결과 묶음
 {reviews_bundle}
 ## 이전 이슈 트래커
 {previous_senior_tracker}
@@ -543,17 +807,17 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
 {execution_evidence}
 ## 지침
-참조된 plan/checklist/docs/review markdown와 git 상태를 직접 읽어 현재 코드베이스 상태를 확인한 뒤, \
+참조된 plan/checklist/docs/review markdown와 저장소 상태를 직접 읽어 현재 검토 대상의 상태를 확인한 뒤, \
 위 실행 증거를 활용하여 에이전트의 주장을 실제 명령어 출력, 아티팩트 경로, 종료 코드로 검증하세요. \
 그런 다음 아래를 수행하세요.
 1. 리뷰어들 사이에 중복되는 이슈를 합치세요.
 2. 의견 충돌은 명시적으로 정리하세요.
-3. 기획서, 체크리스트, 코드, 리뷰 근거로 뒷받침되는 이슈만 남기세요.
+3. 기획서, 체크리스트, 참고 문서, 저장소 상태, 리뷰 근거로 뒷받침되는 이슈만 남기세요.
 4. 근거가 엇갈리면 무엇이 확정이고 무엇이 기각 또는 추가확인 대상인지 분명히 적으세요.
-5. coder가 바로 수정할 수 있는 우선순위 액션 아이템을 만드세요.
+5. 수정 담당자가 바로 처리할 수 있는 우선순위 액션 아이템을 만드세요.
 6. 이슈 트래커 테이블을 반복 간에 유지하세요 (미해결 이슈를 이월).
 7. 확정된 이슈가 없으면 VERDICT: PASS 를 출력하세요.
-8. coder가 수정 가능한 이슈가 있으면 VERDICT: FAIL 을 출력하세요.
+8. 수정 담당자가 해결 가능한 이슈가 있으면 VERDICT: FAIL 을 출력하세요.
 9. 사람의 개입이 필요한 이슈(모호한 요구사항, 아키텍처 결정, 외부 의존성 문제, \
 동일 이슈가 2회 이상 해결 실패)가 있으면 VERDICT: ESCALATE 를 출력하세요.
@@ -568,8 +832,8 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
 (기각된 항목이 없으면 "없음"이라고 작성하세요.)
 ### 액션 아이템
-1. coder가 수정해야 할 구체적인 작업
+1. 수정 담당자가 처리해야 할 구체적인 작업
-2. coder가 수정해야 할 구체적인 작업
+2. 수정 담당자가 처리해야 할 구체적인 작업
 ## 이슈 트래커
@@ -592,6 +856,10 @@ DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
        "coding": CODING_TEMPLATE,
        "review": REVIEW_TEMPLATE,
        "plan-review": PLAN_REVIEW_TEMPLATE,
        "plan-fix": PLAN_FIX_TEMPLATE,
        "plan-verify": PLAN_VERIFY_TEMPLATE,
        "coding-plan-review": CODING_PLAN_REVIEW_TEMPLATE,
        "coding-plan-fix": CODING_PLAN_FIX_TEMPLATE,
        "review-only": REVIEW_ONLY_TEMPLATE,
        "aggregate-review": AGGREGATE_REVIEW_TEMPLATE,
    },
@@ -599,6 +867,10 @@ DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
        "coding": CODING_TEMPLATE_KO,
        "review": REVIEW_TEMPLATE_KO,
        "plan-review": PLAN_REVIEW_TEMPLATE_KO,
        "plan-fix": PLAN_FIX_TEMPLATE_KO,
        "plan-verify": PLAN_VERIFY_TEMPLATE_KO,
        "coding-plan-review": CODING_PLAN_REVIEW_TEMPLATE_KO,
        "coding-plan-fix": CODING_PLAN_FIX_TEMPLATE_KO,
        "review-only": REVIEW_ONLY_TEMPLATE_KO,
        "aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO,
    },
@@ -843,56 +1115,75 @@ def _build_review_only_preset(
 def _build_plan_review_preset(
    coders: list[str], reviewers: list[str], seniors: list[str],
 ) -> list[StepConfig]:
-    """Plan-review: reviewers audit planning docs before implementation."""
+    """Plan-review: review planning docs, revise them, then verify in a loop."""
    if not coders:
        raise ValueError("'plan-review' preset requires at least 1 coder")
    if not reviewers:
        raise ValueError("'plan-review' preset requires at least 1 reviewer")
-    if len(reviewers) == 1 and not seniors:
+    review_steps: list[StepConfig] = []
-        return [
+    if len(reviewers) == 1:
        review_steps.append(
            StepConfig(
                name="plan_review",
                agent=reviewers[0],
                role="review",
                prompt_template="default:plan-review",
                output_key="plan_review_result",
                verdict=True,
            ),
-        ]
+        )
        review_step_names = ["plan_review"]
        review_output_keys = ["plan_review_result"]
    else:
        reviewer_keys = _unique_safe_keys(reviewers)
        for reviewer, rk in zip(reviewers, reviewer_keys):
            review_steps.append(
                StepConfig(
                    name=f"plan_review_{rk}",
                    agent=reviewer,
                    role="review",
                    prompt_template="default:plan-review",
                    output_key=f"plan_review_{rk}",
                    parallel=True,
                ),
            )
        review_step_names = [f"plan_review_{rk}" for rk in reviewer_keys]
        review_output_keys = [f"plan_review_{rk}" for rk in reviewer_keys]
-    steps: list[StepConfig] = []
+    fix_coder = coders[0]
-    reviewer_keys = _unique_safe_keys(reviewers)
+    senior_agent = seniors[0] if seniors else reviewers[0]
-    for reviewer, rk in zip(reviewers, reviewer_keys):
+
-        steps.append(
+    return review_steps + [
-            StepConfig(
+        StepConfig(
-                name=f"plan_review_{rk}",
+            name="aggregate_review",
-                agent=reviewer,
+            agent=senior_agent,
-                role="review",
+            role="review",
-                prompt_template="default:plan-review",
+            prompt_template="default:aggregate-review",
-                output_key=f"plan_review_{rk}",
+            output_key="aggregate_review",
-                verdict=not seniors,
+            context_override={
-                parallel=True,
+                "candidate_outputs": "Current planning package under review (plan/checklist/reference docs).",
-            ),
+                "reviews_bundle": _build_named_bundle(
-        )
+                    reviewers, review_step_names, review_output_keys, "Review",
-    if seniors:
+                ),
-        step_names = [f"plan_review_{rk}" for rk in reviewer_keys]
+            },
-        output_keys = [f"plan_review_{rk}" for rk in reviewer_keys]
+        ),
-        steps.append(
+        StepConfig(
-            StepConfig(
+            name="plan_fix",
-                name="senior_review",
+            agent=fix_coder,
-                agent=seniors[0],
+            role="coding",
-                role="review",
+            prompt_template="default:plan-fix",
-                prompt_template="default:aggregate-review",
+            output_key="plan_fix_output",
-                output_key="senior_review_result",
+            context_override={"feedback": "{aggregate_review}"},
-                verdict=True,
+        ),
-                context_override={
+        StepConfig(
-                    "candidate_outputs": "Planning documents under review (plan/checklist/reference docs).",
+            name="verify",
-                    "reviews_bundle": _build_named_bundle(
+            agent=senior_agent,
-                        reviewers, step_names, output_keys, "Review",
+            role="review",
-                    ),
+            prompt_template="default:plan-verify",
-                },
+            output_key="verify_result",
-            ),
+            verdict=True,
-        )
+        ),
-    return steps
+    ]
 def _build_review_fix_preset(
@@ -992,16 +1283,97 @@ def _build_coding_review_fix_preset(
    ]
 def _build_coding_plan_review_preset(
    coders: list[str], reviewers: list[str], seniors: list[str],
 ) -> list[PhaseConfig]:
    """Implement from plan/docs, then review and fix code+docs together."""
    if not coders:
        raise ValueError("'coding-plan-review' preset requires at least 1 coder")
    if not reviewers:
        raise ValueError("'coding-plan-review' preset requires at least 1 reviewer")
    review_steps: list[StepConfig] = []
    reviewer_keys = _unique_safe_keys(reviewers)
    for reviewer, rk in zip(reviewers, reviewer_keys):
        review_steps.append(
            StepConfig(
                name=f"review_{rk}",
                agent=reviewer,
                role="review",
                prompt_template="default:coding-plan-review",
                output_key=f"review_{rk}",
                verdict=False,
                parallel=True,
            ),
        )
    senior_agent = seniors[0] if seniors else reviewers[0]
    review_step_names = [f"review_{rk}" for rk in reviewer_keys]
    review_output_keys = [f"review_{rk}" for rk in reviewer_keys]
    return [
        PhaseConfig(
            name="initial_coding",
            steps=[
                StepConfig(
                    name="coding",
                    agent=coders[0],
                    role="coding",
                    prompt_template="default:coding",
                    output_key="coding_output",
                ),
            ],
            max_iterations=1,
            consecutive_pass=1,
        ),
        PhaseConfig(
            name="coding_plan_review",
            steps=review_steps + [
                StepConfig(
                    name="aggregate_review",
                    agent=senior_agent,
                    role="review",
                    prompt_template="default:aggregate-review",
                    output_key="aggregate_review",
                    context_override={
                        "candidate_outputs": (
                            "Current implementation and planning package under review "
                            "(code + plan/checklist/reference docs)."
                        ),
                        "reviews_bundle": _build_named_bundle(
                            reviewers, review_step_names, review_output_keys, "Review",
                        ),
                    },
                ),
                StepConfig(
                    name="coding_plan_fix",
                    agent=coders[0],
                    role="coding",
                    prompt_template="default:coding-plan-fix",
                    output_key="coding_plan_fix_output",
                    context_override={"feedback": "{aggregate_review}"},
                ),
                StepConfig(
                    name="verify",
                    agent=senior_agent,
                    role="review",
                    prompt_template="default:coding-plan-review",
                    output_key="verify_result",
                    verdict=True,
                ),
            ],
            max_iterations=5,
            consecutive_pass=1,
        ),
    ]
 PIPELINE_PRESETS: dict[str, Callable] = {
    "simple": _build_simple_preset,
    "cross-review": _build_cross_review_preset,
    "plan-review": _build_plan_review_preset,
    "review-only": _build_review_only_preset,
 }
 PHASED_PRESETS: dict[str, Callable] = {
-    "review-fix": _build_review_fix_preset,
+    "coding-plan-review": _build_coding_plan_review_preset,
    "coding-review-fix": _build_coding_review_fix_preset,
 }
 ALL_PRESET_NAMES: list[str] = list(PIPELINE_PRESETS.keys()) + list(PHASED_PRESETS.keys())
--- a/cross_eval/worktree.py
+++ b/cross_eval/worktree.py
@@ -37,18 +37,31 @@ def make_worktree_dir(base_cwd: Path, branch_name: str) -> Path:
    )
-def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> Path:
+def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> tuple[Path, str]:
    """Create a git worktree on a new branch from HEAD.
    1. Create branch from HEAD
    2. Create worktree checked out to that branch
    The branch lives in the original repo, so it survives worktree removal.
    Returns (worktree_path, base_commit_sha).
    """
    work_dir = work_dir.resolve()
    if work_dir.exists():
        shutil.rmtree(work_dir)
    # Record the base commit SHA before creating the branch.
    # This is the anchor for all diffs — even if the agent makes its own commits,
    # we always diff against this base to capture the full set of changes.
    result = subprocess.run(
        ["git", "rev-parse", "HEAD"],
        cwd=base_cwd,
        capture_output=True,
        text=True,
        check=True,
    )
    base_commit = result.stdout.strip()
    # Create the branch at HEAD
    try:
        subprocess.run(
@@ -83,15 +96,23 @@ def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> Path:
            f"Failed to create worktree at {work_dir}: {e.stderr.strip()}"
        ) from e
-    logger.debug("Created worktree on branch '%s': %s", branch_name, work_dir)
+    logger.debug("Created worktree on branch '%s': %s (base: %s)", branch_name, work_dir, base_commit[:8])
-    return work_dir
+    return work_dir, base_commit
-def capture_diff(worktree_path: Path) -> str:
+def capture_diff(worktree_path: Path, base_commit: str | None = None) -> str:
-    """Capture all changes made in the worktree as a unified diff.
+    """Capture all changes made in the worktree since ``base_commit``.
-    Includes both tracked modifications and new untracked files.
+    Handles two scenarios:
    1. Agent left changes uncommitted → ``git add -A && git diff base HEAD``
    2. Agent committed its own changes → HEAD advanced, diff base..HEAD captures them
    Args:
        base_commit: The diff anchor — typically the worktree HEAD *before* this
                     iteration started (set by ``get_current_head`` after each
                     ``_commit_iteration``). Falls back to ``HEAD`` if not given.
    """
    # Stage any uncommitted changes
    subprocess.run(
        ["git", "add", "-A"],
        cwd=worktree_path,
@@ -99,12 +120,34 @@ def capture_diff(worktree_path: Path) -> str:
        check=True,
    )
-    result = subprocess.run(
+    # Commit staged changes so everything is reachable via HEAD
-        ["git", "diff", "--cached", "HEAD"],
+    # (this is a no-op if nothing is staged)
    subprocess.run(
        ["git", "commit", "-m", "cross-eval: capture-diff snapshot", "--allow-empty-message"],
        cwd=worktree_path,
        capture_output=True,
        text=True,
    )
    ref = base_commit or "HEAD~1"
    result = subprocess.run(
        ["git", "diff", ref, "HEAD"],
        cwd=worktree_path,
        capture_output=True,
        text=True,
    )
    return result.stdout.strip()
 def get_current_head(worktree_path: Path) -> str:
    """Return the current HEAD SHA of the worktree."""
    result = subprocess.run(
        ["git", "rev-parse", "HEAD"],
        cwd=worktree_path,
        capture_output=True,
        text=True,
        check=True,
    )
    return result.stdout.strip()
--- a/plan.md
+++ b/plan.md
@@ -0,0 +1,47 @@
 # cross-eval CLI 사용성 리팩토링
 ## 목표
 `cross-eval`의 CLI 사용 경험을 리팩토링하여, 사용자가 각 옵션의 의미를 빠르게 이해하고 목적에 맞는 옵션 조합을 쉽게 선택할 수 있도록 만든다.
 ## 배경
 현재 `cross-eval`은 `init`, `run`, `demo`, `doctor` 등 주요 커맨드와 다양한 옵션을 제공하지만, 처음 사용하는 사용자가 어떤 상황에서 어떤 옵션을 써야 하는지 한눈에 이해하기 어렵다. 특히 `run`의 preset, agent 조합, config 기반 실행과 직접 옵션 기반 실행의 관계가 복잡하게 느껴질 수 있다.
 ## 요구사항
 1. CLI 도움말 또는 온보딩 문구를 리팩토링해 초보 사용자도 주요 흐름을 빠르게 이해할 수 있어야 한다.
 2. 사용자가 대표적인 사용 시나리오별로 적절한 옵션 조합을 쉽게 찾을 수 있어야 한다.
 3. `run` 커맨드의 주요 옵션들(preset, coder/reviewer/senior, config, output 관련)의 역할이 더 명확하게 드러나야 한다.
 4. `init` 이후 사용자가 다음에 무엇을 해야 하는지 자연스럽게 이어지도록 안내해야 한다.
 5. 기존 기능은 유지해야 하며, 동작 방식 자체를 바꾸기보다 설명 구조와 사용 흐름을 개선하는 데 집중해야 한다.
 ## 사용자 시나리오
 1. 처음 설치한 사용자가 `cross-eval init` 후 무엇을 해야 하는지 알고 싶다.
 2. 사용자가 `run`을 실행하려는데 `--preset`별 차이를 빠르게 비교하고 싶다.
 3. 사용자가 `claude`, `codex`, `senior` 조합을 어떤 상황에서 쓰는지 예시와 함께 이해하고 싶다.
 4. 사용자가 config 기반 실행과 CLI 옵션 기반 실행 중 무엇을 써야 할지 판단하고 싶다.
 5. 사용자가 실행 결과가 어디에 저장되는지, 어떤 식으로 확인하는지 알고 싶다.
 ## 제약조건
 - 기존 CLI 명령 이름과 핵심 옵션 이름은 유지한다.
 - 기존 파이프라인 동작 로직은 불필요하게 수정하지 않는다.
 - 기능 추가보다 안내 구조, 도움말 문구, 예시, 설명 흐름 개선에 집중한다.
 - 문서는 한국어 사용자 기준으로 이해하기 쉽게 유지하되, 기존 프로젝트 톤과 구조를 해치지 않는다.
 ## 범위
 ### 포함
 - `argparse` help/description/epilog 문구 개선
 - `init` 후 다음 단계 안내 문구 개선
 - `run` 사용 예시 정리 및 대표 조합 예시 보강
 - preset/agent/config/output 개념 설명 재구성
 - 필요 시 README 또는 온보딩 문구 일부 정리
 ### 제외
 - 새로운 preset 추가
 - 새로운 CLI 옵션 추가
 - 파이프라인 실행 알고리즘 변경
 - 에이전트 호출 방식 자체 변경
 ## 성공 기준
 1. `--help`만 읽어도 기본 사용 흐름이 명확하다.
 2. 사용자가 대표 시나리오별 실행 예시를 바로 복사해 쓸 수 있다.
 3. `init → 작성 → doctor → run → output 확인` 흐름이 자연스럽게 연결된다.
 4. 옵션 설명이 길기만 하지 않고, 실제 선택 판단에 도움이 되도록 구조화된다.
--- a/tests/test_agentic.py
+++ b/tests/test_agentic.py
@@ -76,10 +76,12 @@ class TestCreateWorktree(unittest.TestCase):
            wt_dir = Path(td) / "wt"
            branch = "cross-eval/test_branch"
-            result_path = create_worktree(base, wt_dir, branch)
+            result_path, base_commit = create_worktree(base, wt_dir, branch)
            # Worktree directory exists
            self.assertTrue(result_path.exists())
            # Base commit SHA was captured
            self.assertEqual(len(base_commit), 40)
            # Branch was created in the original repo
            branches = subprocess.run(
                ["git", "branch", "--list", branch],
@@ -102,7 +104,7 @@ class TestCaptureDiff(unittest.TestCase):
            wt_dir = Path(td) / "wt"
            branch = "cross-eval/diff_test"
-            create_worktree(base, wt_dir, branch)
+            create_worktree(base, wt_dir, branch)  # ignore return tuple
            # Make changes in the worktree
            (wt_dir / "new_file.txt").write_text("hello\n")
@@ -488,6 +490,8 @@ class TestMakeAgenticCodex(unittest.TestCase):
 def _make_agentic_config(
    run_dir: Path,
    agentic_coder: bool = True,
    *,
    use_worktree: bool = False,
 ) -> PipelineConfig:
    """Build a config with an agentic coder + non-agentic reviewer."""
    coder = AgentConfig(
@@ -519,6 +523,7 @@ def _make_agentic_config(
    ]
    return PipelineConfig(
        output_dir=run_dir,
        use_worktree=use_worktree,
        max_iterations=2,
        min_iterations=1,
        language="en",
@@ -549,11 +554,11 @@ class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
    ) -> None:
        with tempfile.TemporaryDirectory() as td:
            run_dir = Path(td)
-            config = _make_agentic_config(run_dir)
+            config = _make_agentic_config(run_dir, use_worktree=True)
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
@@ -571,6 +576,44 @@ class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
            mock_setup.assert_called_once()
 class TestDirectAgenticMode(unittest.TestCase):
    """Agentic coders run in the current working tree by default."""
    @patch("cross_eval.pipeline._setup_worktree")
    @patch("cross_eval.pipeline.invoke_agent_agentic")
    @patch("cross_eval.pipeline.invoke_agent")
    def test_agentic_uses_current_worktree_by_default(
        self,
        mock_invoke: MagicMock,
        mock_invoke_agentic: MagicMock,
        mock_setup: MagicMock,
    ) -> None:
        with tempfile.TemporaryDirectory() as td:
            repo = Path(td)
            _init_git_repo(repo)
            run_dir = repo / ".cross-eval" / "output"
            run_dir.mkdir(parents=True, exist_ok=True)
            config = _make_agentic_config(run_dir)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
                agent_name="claude-coder", step_name="coding",
                duration_seconds=0.1,
            )
            mock_invoke.return_value = AgentResult(
                output="VERDICT: PASS", exit_code=0,
                agent_name="claude-reviewer", step_name="review",
                duration_seconds=0.1,
            )
            run_pipeline(config, cwd=repo)
            mock_setup.assert_not_called()
            self.assertEqual(mock_invoke_agentic.call_args.kwargs["worktree_path"], repo)
            reviewer_call = mock_invoke.call_args
            self.assertEqual(reviewer_call.kwargs["cwd"], repo)
 class TestSetupWorktreeLocation(unittest.TestCase):
    """_setup_worktree places agentic worktrees outside the base repo."""
@@ -582,7 +625,7 @@ class TestSetupWorktreeLocation(unittest.TestCase):
            run_dir.mkdir(parents=True)
            _init_git_repo(base)
-            worktree_path, branch_name = _setup_worktree(base, run_dir, "review-fix")
+            worktree_path, branch_name, _base_commit = _setup_worktree(base, run_dir, "review-fix")
            try:
                self.assertTrue(worktree_path.exists())
                self.assertNotIn(str(base.resolve()), str(worktree_path.resolve()))
@@ -616,11 +659,11 @@ class TestReviewerRunsInWorktreeCwd(unittest.TestCase):
    ) -> None:
        with tempfile.TemporaryDirectory() as td:
            run_dir = Path(td)
-            config = _make_agentic_config(run_dir)
+            config = _make_agentic_config(run_dir, use_worktree=True)
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
@@ -658,11 +701,11 @@ class TestCommitIterationCalled(unittest.TestCase):
    ) -> None:
        with tempfile.TemporaryDirectory() as td:
            run_dir = Path(td)
-            config = _make_agentic_config(run_dir)
+            config = _make_agentic_config(run_dir, use_worktree=True)
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
@@ -700,11 +743,11 @@ class TestFinalizeWorktreeCalled(unittest.TestCase):
    ) -> None:
        with tempfile.TemporaryDirectory() as td:
            run_dir = Path(td)
-            config = _make_agentic_config(run_dir)
+            config = _make_agentic_config(run_dir, use_worktree=True)
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
@@ -822,7 +865,7 @@ class TestParallelAgenticFallsBackToSequential(unittest.TestCase):
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            call_order: list[str] = []
--- a/tests/test_config.py
+++ b/tests/test_config.py
@@ -42,6 +42,8 @@ from cross_eval.prompts import (
    REVIEW_TEMPLATE_KO,
    PLAN_REVIEW_TEMPLATE,
    PLAN_REVIEW_TEMPLATE_KO,
    PLAN_FIX_TEMPLATE,
    PLAN_FIX_TEMPLATE_KO,
    REVIEW_ONLY_TEMPLATE,
    REVIEW_ONLY_TEMPLATE_KO,
    AGGREGATE_REVIEW_TEMPLATE,
@@ -310,26 +312,10 @@ class BuiltinAgentConfigTest(unittest.TestCase):
        self.assertIn("Repeated Aggregate Findings", report)
        self.assertIn("same as iteration 3", report)
-    def test_review_fix_defaults_senior_from_reviewer_family(self) -> None:
+    def test_fix_and_plan_presets_default_senior_from_reviewer_family(self) -> None:
        self.assertEqual(
            _default_seniors_for_preset(
-                "preset:review-fix",
+                "preset:plan-review",
                ["codex-reviewer", "claude-reviewer"],
                BUILTIN_AGENTS,
            ),
            ["codex-senior"],
        )
        self.assertEqual(
            _default_seniors_for_preset(
                "preset:review-fix",
                ["claude-reviewer"],
                BUILTIN_AGENTS,
            ),
            ["claude-senior"],
        )
        self.assertEqual(
            _default_seniors_for_preset(
                "preset:coding-review-fix",
                ["codex-reviewer"],
                BUILTIN_AGENTS,
            ),
@@ -337,7 +323,31 @@ class BuiltinAgentConfigTest(unittest.TestCase):
        )
        self.assertEqual(
            _default_seniors_for_preset(
-                "preset:simple",
+                "preset:plan-review",
                ["claude-reviewer"],
                BUILTIN_AGENTS,
            ),
            ["claude-senior"],
        )
        self.assertEqual(
            _default_seniors_for_preset(
                "preset:coding-plan-review",
                ["codex-reviewer", "claude-reviewer"],
                BUILTIN_AGENTS,
            ),
            ["codex-senior"],
        )
        self.assertEqual(
            _default_seniors_for_preset(
                "preset:coding-plan-review",
                ["claude-reviewer"],
                BUILTIN_AGENTS,
            ),
            ["claude-senior"],
        )
        self.assertEqual(
            _default_seniors_for_preset(
                "preset:unknown",
                ["codex-reviewer"],
                BUILTIN_AGENTS,
            ),
@@ -421,23 +431,49 @@ class BuiltinAgentConfigTest(unittest.TestCase):
        )
        self.assertEqual(
-            [step.output_key for step in steps],
+            [step.output_key for step in steps[:2]],
            ["plan_review_codex_reviewer", "plan_review_codex_reviewer_2"],
        )
-    def test_plan_review_with_senior_adds_aggregate_step(self) -> None:
+    def test_plan_review_builds_review_fix_verify_loop(self) -> None:
        steps = _build_plan_review_preset(
            ["codex-coder"],
            ["claude-reviewer", "codex-reviewer"],
            ["claude-senior"],
        )
-        self.assertEqual(steps[-1].name, "senior_review")
+        self.assertEqual(
-        self.assertEqual(steps[-1].agent, "claude-senior")
+            [step.name for step in steps],
-        self.assertTrue(steps[-1].verdict)
+            [
                "plan_review_claude_reviewer",
                "plan_review_codex_reviewer",
                "aggregate_review",
                "plan_fix",
                "verify",
            ],
        )
        self.assertEqual(steps[2].agent, "claude-senior")
        self.assertEqual(steps[3].agent, "codex-coder")
        self.assertEqual(steps[4].agent, "claude-senior")
        self.assertTrue(steps[4].verdict)
        self.assertFalse(steps[0].verdict)
        self.assertFalse(steps[1].verdict)
    def test_plan_review_single_reviewer_uses_default_loop_steps(self) -> None:
        steps = _build_plan_review_preset(
            ["codex-coder"],
            ["codex-reviewer"],
            [],
        )
        self.assertEqual(
            [step.name for step in steps],
            ["plan_review", "aggregate_review", "plan_fix", "verify"],
        )
        self.assertEqual(steps[1].agent, "codex-reviewer")
        self.assertEqual(steps[2].prompt_template, "default:plan-fix")
        self.assertTrue(steps[3].verdict)
    def test_cross_review_duplicate_coders_get_unique_step_keys(self) -> None:
        steps = _build_cross_review_preset(
            ["codex-coder", "codex-coder"],
@@ -576,6 +612,8 @@ class PromptTemplateTest(unittest.TestCase):
        """Coding templates should tell coder to ignore DISMISSED items."""
        self.assertIn("DISMISSED", CODING_TEMPLATE)
        self.assertIn("DISMISSED", CODING_TEMPLATE_KO)
        self.assertIn("DISMISSED", PLAN_FIX_TEMPLATE)
        self.assertIn("DISMISSED", PLAN_FIX_TEMPLATE_KO)
    def test_aggregate_templates_dismissed_structure(self) -> None:
        """Aggregate templates should use [False positive] / [Already fixed] tags."""
@@ -583,6 +621,10 @@ class PromptTemplateTest(unittest.TestCase):
        self.assertIn("[Already fixed]", AGGREGATE_REVIEW_TEMPLATE)
        self.assertIn("[오탐]", AGGREGATE_REVIEW_TEMPLATE_KO)
        self.assertIn("[수정 완료]", AGGREGATE_REVIEW_TEMPLATE_KO)
        self.assertIn("{candidate_outputs}", AGGREGATE_REVIEW_TEMPLATE)
        self.assertIn("{reviews_bundle}", AGGREGATE_REVIEW_TEMPLATE)
        self.assertIn("{candidate_outputs}", AGGREGATE_REVIEW_TEMPLATE_KO)
        self.assertIn("{reviews_bundle}", AGGREGATE_REVIEW_TEMPLATE_KO)
 class ReviewMetricsParsingTest(unittest.TestCase):
@@ -969,7 +1011,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
                "  checklist: checklist.md\n"
                "coders: [claude-coder]\n"
                "reviewers: [claude-reviewer]\n"
-                "pipeline: preset:review-fix\n"
+                "pipeline: preset:coding-plan-review\n"
                f"max_iterations: {max_iterations}\n"
                "language: en\n"
            ),
@@ -981,8 +1023,9 @@ class FixPresetBehaviorTest(unittest.TestCase):
        with tempfile.TemporaryDirectory() as tmpdir:
            config = load_config(self._write_fix_config(Path(tmpdir), max_iterations=7))
-        self.assertEqual(config.preset_name, "review-fix")
+        self.assertEqual(config.preset_name, "coding-plan-review")
-        self.assertEqual(config.phases[0].max_iterations, 7)
+        self.assertEqual(config.phases[0].max_iterations, 1)
        self.assertEqual(config.phases[1].max_iterations, 7)
        self.assertTrue(config.agents["claude-coder"].agentic)
        self.assertNotIn("-p", config.agents["claude-coder"].args)
@@ -992,7 +1035,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
            captured: dict[str, object] = {}
            def _fake_run_pipeline(config, **kwargs):
-                captured["phase_max"] = config.phases[0].max_iterations
+                captured["phase_max"] = config.phases[1].max_iterations
                captured["agentic"] = config.agents[config.coders[0]].agentic
                return PipelineResult(
                    iterations=[],
@@ -1012,13 +1055,13 @@ class FixPresetBehaviorTest(unittest.TestCase):
        self.assertEqual(captured["phase_max"], 9)
        self.assertTrue(captured["agentic"])
-    def test_run_preset_review_fix_auto_enables_agentic_without_flag(self) -> None:
+    def test_run_preset_coding_plan_review_auto_enables_agentic_without_flag(self) -> None:
        captured: dict[str, object] = {}
        def _fake_run_pipeline(config, **kwargs):
            captured["preset"] = config.preset_name
            captured["agentic"] = config.agents[config.coders[0]].agentic
-            captured["phase_max"] = config.phases[0].max_iterations
+            captured["phase_max"] = config.phases[1].max_iterations
            return PipelineResult(
                iterations=[],
                final_verdict="PASS",
@@ -1026,13 +1069,73 @@ class FixPresetBehaviorTest(unittest.TestCase):
            )
        with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
-            exit_code = main(["run", "--preset", "review-fix", "--dry-run"])
+            exit_code = main(["run", "--preset", "coding-plan-review", "--dry-run"])
        self.assertEqual(exit_code, 0)
-        self.assertEqual(captured["preset"], "review-fix")
+        self.assertEqual(captured["preset"], "coding-plan-review")
        self.assertTrue(captured["agentic"])
        self.assertEqual(captured["phase_max"], 3)
    def test_run_preset_plan_review_auto_enables_agentic_without_flag(self) -> None:
        captured: dict[str, object] = {}
        def _fake_run_pipeline(config, **kwargs):
            captured["preset"] = config.preset_name
            captured["agentic"] = config.agents[config.coders[0]].agentic
            captured["use_worktree"] = config.use_worktree
            captured["seniors"] = list(config.seniors)
            captured["steps"] = [step.name for step in config.pipeline]
            captured["max_iter"] = config.max_iterations
            return PipelineResult(
                iterations=[],
                final_verdict="PASS",
                run_dir=Path(".cross-eval/output"),
            )
        with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
            exit_code = main(["run", "--preset", "plan-review", "--dry-run"])
        self.assertEqual(exit_code, 0)
        self.assertEqual(captured["preset"], "plan-review")
        self.assertTrue(captured["agentic"])
        self.assertFalse(captured["use_worktree"])
        self.assertEqual(captured["seniors"], ["claude-senior"])
        self.assertEqual(
            captured["steps"],
            ["plan_review", "aggregate_review", "plan_fix", "verify"],
        )
        self.assertEqual(captured["max_iter"], 3)
    def test_run_worktree_flag_enables_isolated_worktree_mode(self) -> None:
        captured: dict[str, object] = {}
        def _fake_run_pipeline(config, **kwargs):
            captured["use_worktree"] = config.use_worktree
            return PipelineResult(
                iterations=[],
                final_verdict="PASS",
                run_dir=Path(".cross-eval/output"),
            )
        with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
            exit_code = main(["run", "--preset", "plan-review", "--dry-run", "--worktree"])
        self.assertEqual(exit_code, 0)
        self.assertTrue(captured["use_worktree"])
    def test_run_dry_run_returns_zero_even_when_not_pass(self) -> None:
        def _fake_run_pipeline(config, **kwargs):
            return PipelineResult(
                iterations=[],
                final_verdict="MAX_ITERATIONS_REACHED",
                run_dir=Path(".cross-eval/output"),
            )
        with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
            exit_code = main(["run", "--preset", "plan-review", "--dry-run"])
        self.assertEqual(exit_code, 0)
    def test_run_senior_model_override_applies_only_to_seniors(self) -> None:
        captured: dict[str, list[str]] = {}
@@ -1049,7 +1152,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
        with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
            exit_code = main([
                "run",
-                "--preset", "review-fix",
+                "--preset", "coding-plan-review",
                "--coder", "claude",
                "--reviewer", "claude",
                "--senior", "claude",
@@ -1077,7 +1180,7 @@ class OutputDirectoryResolutionTest(unittest.TestCase):
                    "  plan: plan.md\n"
                    "coders: [claude-coder]\n"
                    "reviewers: [claude-reviewer]\n"
-                    "pipeline: preset:simple\n"
+                    "pipeline: preset:coding-plan-review\n"
                    "output_dir: .cross-eval/output\n"
                ),
                encoding="utf-8",
--- a/tests/test_evidence.py
+++ b/tests/test_evidence.py
@@ -465,6 +465,9 @@ class TestExpandedClaimMarkers(unittest.TestCase):
    def test_changes_are_complete(self) -> None:
        self.assertTrue(_claims_file_changes("All changes are complete"))
    def test_korean_change_summary_triggers(self) -> None:
        self.assertTrue(_claims_file_changes("모든 수정이 완료되었습니다. 아래는 변경 요약입니다."))
 class TestExpandedNoChangeMarkers(unittest.TestCase):
    """New no-change markers prevent false positives."""
@@ -484,6 +487,9 @@ class TestExpandedNoChangeMarkers(unittest.TestCase):
    def test_no_action_required(self) -> None:
        self.assertFalse(_claims_file_changes("No action required"))
    def test_korean_no_change_marker(self) -> None:
        self.assertFalse(_claims_file_changes("변경할 필요 없음"))
 # ---------------------------------------------------------------------------
 # 6. Cross-iteration evidence propagation
--- a/tests/test_onboarding.py
+++ b/tests/test_onboarding.py
@@ -55,7 +55,7 @@ class DoctorCheckInstalledTest(unittest.TestCase):
            config_path = ce_dir / "config.yaml"
            config_path.write_text(
                "inputs:\n  plan: plan.md\ncoders: [claude-coder]\n"
-                "reviewers: [claude-reviewer]\npipeline: preset:simple\n",
+                "reviewers: [claude-reviewer]\npipeline: preset:coding-plan-review\n",
                encoding="utf-8",
            )
            # Also create plan.md so validation passes
@@ -137,22 +137,22 @@ class DemoTest(unittest.TestCase):
    def test_mock_demo_runs_without_error(self) -> None:
        # Should not raise
        with patch("sys.stdout"):
-            run_mock_demo(preset="simple")
+            run_mock_demo(preset="coding-plan-review")
    def test_mock_demo_escalate_runs_without_error(self) -> None:
        with patch("sys.stdout"):
-            run_mock_demo(preset="simple", show_escalate=True)
+            run_mock_demo(preset="coding-plan-review", show_escalate=True)
    def test_cmd_demo_mock_default(self) -> None:
        with patch("cross_eval.demo.run_mock_demo") as mock:
            exit_code = main(["demo"])
-        mock.assert_called_once_with(preset="simple", show_escalate=False)
+        mock.assert_called_once_with(preset="coding-plan-review", show_escalate=False)
        self.assertEqual(exit_code, 0)
    def test_cmd_demo_escalate_flag(self) -> None:
        with patch("cross_eval.demo.run_mock_demo") as mock:
            exit_code = main(["demo", "--escalate"])
-        mock.assert_called_once_with(preset="simple", show_escalate=True)
+        mock.assert_called_once_with(preset="coding-plan-review", show_escalate=True)
        self.assertEqual(exit_code, 0)
    def test_cmd_demo_live_requires_confirmation(self) -> None:
--- a/tests/test_pipeline_integration.py
+++ b/tests/test_pipeline_integration.py
@@ -13,7 +13,11 @@ from cross_eval.models import (
    StepConfig,
 )
 from cross_eval.pipeline import run_pipeline
-from cross_eval.prompts import _build_review_fix_preset, _build_simple_preset
+from cross_eval.prompts import (
    _build_plan_review_preset,
    _build_review_fix_preset,
    _build_simple_preset,
 )
 def _make_mock_agent(outputs: list[str]):
@@ -262,6 +266,60 @@ class TestPhasedPipelineEscalateBreaksPhase(unittest.TestCase):
            self.assertTrue(len(result.escalated_issues) > 0)
 class TestPlanReviewPipelineLoopsUntilVerifyPass(unittest.TestCase):
    """Document plan-review should revise docs and re-verify across iterations."""
    def test_plan_review_fail_then_pass(self) -> None:
        with tempfile.TemporaryDirectory() as tmpdir:
            coders = ["claude-coder"]
            reviewers = ["claude-reviewer"]
            seniors = ["claude-senior"]
            steps = _build_plan_review_preset(coders, reviewers, seniors)
            config = PipelineConfig(
                output_dir=Path(tmpdir),
                max_iterations=4,
                min_iterations=1,
                language="en",
                inputs={
                    "plan": "Test plan",
                    "checklist": "Test checklist",
                    "docs": "Reference docs",
                },
                agents=dict(BUILTIN_AGENTS),
                coders=coders,
                reviewers=reviewers,
                seniors=seniors,
                pipeline=steps,
                preset_name="plan-review",
            )
            mock = _make_step_mock({
                "plan_review": [
                    "Requirements are ambiguous\n\nVERDICT: FAIL",
                    "Looks aligned\n\nVERDICT: PASS",
                ],
                "aggregate_review": [
                    "### Confirmed Issues\n- Clarify acceptance criteria\n\n"
                    "### Action Items\n1. Tighten the checklist\n\nVERDICT: FAIL",
                    "### Confirmed Issues\nNone\n\n"
                    "### Dismissed Findings\nNone\n\n"
                    "### Action Items\n1. No document changes needed\n\nVERDICT: PASS",
                ],
                "plan_fix": ["Updated plan and checklist", "No-op"],
                "verify": [
                    "Still missing edge-case criteria\n\nVERDICT: FAIL",
                    "Planning package is now implementable\n\nVERDICT: PASS",
                ],
            })
            with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
                result = run_pipeline(config)
            self.assertEqual(result.final_verdict, "PASS")
            self.assertEqual(len(result.iterations), 2)
 class TestAutoEscalateFiresWithoutSenior(unittest.TestCase):
    """Test 6: simple pipeline without senior, same FAIL feedback 3 times -> auto-escalate."""
--- a/tests/test_runtime_misc.py
+++ b/tests/test_runtime_misc.py
@@ -16,12 +16,17 @@ from cross_eval.agent import (
 )
 from cross_eval.models import AgentConfig, AgentResult, ExecutionConfig, PipelineConfig, StepConfig
 from cross_eval.pipeline import (
    _apply_worktree_inputs_to_base,
    _commit_base_repo_paths,
    _copy_inputs_to_worktree,
    _commit_iteration,
    _execute_parallel_batch,
    _execute_step,
    _finalize_worktree,
    _format_runtime_error_markdown,
    _load_inputs,
    _maybe_save_step_transcript,
    _refresh_inputs,
    _snapshot_repo_state,
 )
 from cross_eval.runtime_env import (
@@ -118,6 +123,146 @@ class TestInvokeAgentRuntime(unittest.TestCase):
        self.assertEqual(ctx.exception.failure_type, "API_ERROR")
        self.assertIn("backend down", ctx.exception.raw_error)
 class TestWorktreeInputMapping(unittest.TestCase):
    def test_repo_local_plan_input_maps_to_tracked_worktree_path(self) -> None:
        with tempfile.TemporaryDirectory() as tmpdir:
            repo = Path(tmpdir) / "repo"
            repo.mkdir()
            _init_git_repo(repo)
            (repo / "plan.md").write_text("plan v1\n", encoding="utf-8")
            subprocess.run(["git", "add", "plan.md"], cwd=repo, capture_output=True, check=True)
            subprocess.run(
                ["git", "commit", "-m", "add plan"],
                cwd=repo,
                capture_output=True,
                check=True,
            )
            worktree_dir = Path(tmpdir) / "wt"
            branch = "cross-eval/test-plan-review"
            worktree_path, _ = create_worktree(repo, worktree_dir, branch)
            try:
                config = PipelineConfig(
                    inputs={"plan": repo / "plan.md"},
                    preset_name="plan-review",
                )
                _copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
                self.assertEqual(config.inputs["plan"], worktree_path / "plan.md")
            finally:
                remove_worktree(base_cwd=repo, work_dir=worktree_path)
                subprocess.run(
                    ["git", "branch", "-D", branch],
                    cwd=repo,
                    capture_output=True,
                )
    def test_plan_review_docs_ref_maps_to_worktree_and_refreshes_docs(self) -> None:
        with tempfile.TemporaryDirectory() as tmpdir:
            repo = Path(tmpdir) / "repo"
            repo.mkdir()
            _init_git_repo(repo)
            docs_dir = repo / "plans"
            docs_dir.mkdir()
            (docs_dir / "A.md").write_text("A v1\n", encoding="utf-8")
            subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True)
            subprocess.run(
                ["git", "commit", "-m", "add docs"],
                cwd=repo,
                capture_output=True,
                check=True,
            )
            config = PipelineConfig(
                inputs={
                    "docs": "stale snapshot",
                    "docs_ref": docs_dir,
                },
                preset_name="plan-review",
            )
            input_contents = _load_inputs(config)
            self.assertIn("A.md", input_contents["docs"])
            worktree_dir = Path(tmpdir) / "wt"
            branch = "cross-eval/test-docs-ref"
            worktree_path, _ = create_worktree(repo, worktree_dir, branch)
            try:
                _copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
                self.assertEqual(config.inputs["docs_ref"], worktree_path / "plans")
                updated = worktree_path / "plans" / "A.md"
                updated.write_text("A v2\n", encoding="utf-8")
                _refresh_inputs(config, input_contents)
                self.assertIn("A.md", input_contents["docs"])
                self.assertIn("A v2", input_contents["docs"])
            finally:
                remove_worktree(base_cwd=repo, work_dir=worktree_path)
                subprocess.run(
                    ["git", "branch", "-D", branch],
                    cwd=repo,
                    capture_output=True,
                )
    def test_worktree_doc_changes_apply_back_and_commit_in_base_repo(self) -> None:
        with tempfile.TemporaryDirectory() as tmpdir:
            repo = Path(tmpdir) / "repo"
            repo.mkdir()
            _init_git_repo(repo)
            docs_dir = repo / "plans"
            docs_dir.mkdir()
            doc_path = docs_dir / "A.md"
            doc_path.write_text("A v1\n", encoding="utf-8")
            subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True)
            subprocess.run(
                ["git", "commit", "-m", "add docs"],
                cwd=repo,
                capture_output=True,
                check=True,
            )
            config = PipelineConfig(
                inputs={"docs_ref": docs_dir},
                preset_name="plan-review",
            )
            original_inputs = {"docs_ref": docs_dir}
            worktree_dir = Path(tmpdir) / "wt"
            branch = "cross-eval/test-apply-back"
            worktree_path, _ = create_worktree(repo, worktree_dir, branch)
            try:
                _copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
                worktree_doc = config.inputs["docs_ref"] / "A.md"
                worktree_doc.write_text("A v2\n", encoding="utf-8")
                restored = _apply_worktree_inputs_to_base(
                    config, original_inputs, cwd=repo,
                )
                self.assertEqual(restored, [docs_dir])
                self.assertEqual(doc_path.read_text(encoding="utf-8"), "A v2\n")
                committed = _commit_base_repo_paths(
                    repo, restored, "cross-eval: plan-review (FAIL)",
                )
                self.assertTrue(committed)
                log = subprocess.run(
                    ["git", "log", "-1", "--pretty=%s"],
                    cwd=repo,
                    capture_output=True,
                    text=True,
                    check=True,
                )
                self.assertEqual(log.stdout.strip(), "cross-eval: plan-review (FAIL)")
            finally:
                remove_worktree(base_cwd=repo, work_dir=worktree_path)
                subprocess.run(
                    ["git", "branch", "-D", branch],
                    cwd=repo,
                    capture_output=True,
                )
    def test_classify_unknown_failure(self) -> None:
        failure_type, suggested_action = _classify_agent_failure("weird crash")
        self.assertEqual(failure_type, "UNKNOWN")
@@ -376,11 +521,13 @@ class TestInvokeAgenticRuntime(unittest.TestCase):
 class TestPipelineHelpers(unittest.TestCase):
    @patch("cross_eval.worktree.get_current_head", return_value="a" * 40)
    @patch("cross_eval.worktree.commit_worktree", return_value=True)
-    def test_commit_iteration_logs_only_when_committed(self, mock_commit: MagicMock) -> None:
+    def test_commit_iteration_logs_only_when_committed(self, mock_commit: MagicMock, mock_head: MagicMock) -> None:
        with tempfile.TemporaryDirectory() as tmpdir:
-            _commit_iteration(Path(tmpdir), "review-fix", 2, "PASS")
+            new_head = _commit_iteration(Path(tmpdir), "review-fix", 2, "PASS")
        mock_commit.assert_called_once()
        self.assertEqual(new_head, "a" * 40)
    def test_snapshot_repo_state_includes_untracked_digest(self) -> None:
        with tempfile.TemporaryDirectory() as tmpdir:
@@ -775,11 +922,18 @@ class TestRuntimeEnvironmentHelpers(unittest.TestCase):
 class TestWorktreeFailures(unittest.TestCase):
    @patch("cross_eval.worktree.subprocess.run")
    def test_create_worktree_raises_when_branch_creation_fails(self, mock_run: MagicMock) -> None:
-        mock_run.side_effect = subprocess.CalledProcessError(
+        # First call: git rev-parse HEAD (succeeds)
-            1,
+        # Second call: git branch (fails)
-            ["git", "branch"],
+        rev_parse_result = MagicMock(returncode=0)
-            stderr="branch failed",
+        rev_parse_result.stdout = "a" * 40
-        )
+        mock_run.side_effect = [
            rev_parse_result,
            subprocess.CalledProcessError(
                1,
                ["git", "branch"],
                stderr="branch failed",
            ),
        ]
        with tempfile.TemporaryDirectory() as tmpdir:
            base = Path(tmpdir)
@@ -791,14 +945,17 @@ class TestWorktreeFailures(unittest.TestCase):
    @patch("cross_eval.worktree.subprocess.run")
    def test_create_worktree_cleans_branch_on_worktree_failure(self, mock_run: MagicMock) -> None:
        rev_parse_result = MagicMock(returncode=0)
        rev_parse_result.stdout = "a" * 40
        mock_run.side_effect = [
-            MagicMock(returncode=0),
+            rev_parse_result,           # git rev-parse HEAD
            MagicMock(returncode=0),    # git branch
            subprocess.CalledProcessError(
                1,
                ["git", "worktree", "add"],
                stderr="worktree failed",
            ),
-            MagicMock(returncode=0),
+            MagicMock(returncode=0),    # git branch -D (cleanup)
        ]
        with tempfile.TemporaryDirectory() as tmpdir:
Author	SHA1	Message	Date
이충영 에이닷서비스개발	0bbe0f6f7b	continue	2026-03-15 17:54:30 +09:00
chungyeong	28efd5bb8f	fix: use incremental diff per iteration instead of cumulative base diff After each iteration's _commit_iteration, record the new HEAD SHA and use it as the diff anchor for the next iteration. Previously capture_diff always diffed against the initial base commit, causing every iteration to return the same full cumulative diff — reviewers couldn't see what changed between iterations, leading to repeated feedback and stuck FAIL loops. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 10:07:11 +09:00
chungyeong	bf64d19123	Fix plan-review worktree document tracking	2026-03-15 00:35:42 +09:00
chungyeong	a85a490a9b	Make plan-review a review-fix-verify loop	2026-03-15 00:01:26 +09:00
chungyeong	60c7b07939	fix: capture_diff uses base commit to handle agent self-commits Claude in agentic mode (interactive, no -p flag) commits its own changes, advancing HEAD. This made `git diff --cached HEAD` return empty, triggering false EMPTY_DIFF errors every time. Now capture_diff diffs against the base commit SHA recorded at worktree creation, so changes are captured regardless of whether the agent committed them. Also adds UX_IMPROVEMENT_PLAN.md for guided message improvements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 23:59:53 +09:00