Compare commits

..

5 Commits

Author SHA1 Message Date
이충영 에이닷서비스개발
0bbe0f6f7b continue 2026-03-15 17:54:30 +09:00
chungyeong
28efd5bb8f fix: use incremental diff per iteration instead of cumulative base diff
After each iteration's _commit_iteration, record the new HEAD SHA and use
it as the diff anchor for the next iteration. Previously capture_diff
always diffed against the initial base commit, causing every iteration to
return the same full cumulative diff — reviewers couldn't see what changed
between iterations, leading to repeated feedback and stuck FAIL loops.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 10:07:11 +09:00
chungyeong
bf64d19123 Fix plan-review worktree document tracking 2026-03-15 00:35:42 +09:00
chungyeong
a85a490a9b Make plan-review a review-fix-verify loop 2026-03-15 00:01:26 +09:00
chungyeong
60c7b07939 fix: capture_diff uses base commit to handle agent self-commits
Claude in agentic mode (interactive, no -p flag) commits its own changes,
advancing HEAD. This made `git diff --cached HEAD` return empty, triggering
false EMPTY_DIFF errors every time. Now capture_diff diffs against the
base commit SHA recorded at worktree creation, so changes are captured
regardless of whether the agent committed them.

Also adds UX_IMPROVEMENT_PLAN.md for guided message improvements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 23:59:53 +09:00
19 changed files with 1527 additions and 290 deletions

View File

@@ -10,6 +10,8 @@ AI 에이전트 2개를 활용한 개발 워크플로우(기획→체크리스
- Generator: `--permission-mode auto` (파일 읽기/쓰기 가능) - Generator: `--permission-mode auto` (파일 읽기/쓰기 가능)
- Reviewer: `--permission-mode plan` (읽기 전용 탐색) - Reviewer: `--permission-mode plan` (읽기 전용 탐색)
- subprocess의 `cwd`를 현재 작업 디렉토리로 설정 - subprocess의 `cwd`를 현재 작업 디렉토리로 설정
- 기본 실행 모드는 **direct mode**다. 즉 agentic coder도 현재 작업트리에서 직접 수정한다.
- `--worktree` 또는 `use_worktree: true`를 명시한 경우에만 isolated git worktree를 생성한다.
## 사용자 경험 (UX Flow) ## 사용자 경험 (UX Flow)
@@ -34,6 +36,7 @@ ls output/v1/ v2/ final-report.md
```yaml ```yaml
output_dir: output output_dir: output
use_worktree: false
max_iterations: 3 max_iterations: 3
inputs: inputs:
@@ -51,10 +54,8 @@ agents:
system_prompt: "You are a meticulous code reviewer." system_prompt: "You are a meticulous code reviewer."
# 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음) # 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음)
pipeline: preset:simple # "A 생성 → B 리뷰" (기본값) pipeline: preset:coding-plan-review # "문서 기반 구현 → 코드/문서 리뷰 → 수정 → 재검증" (기본값)
# pipeline: preset:cross-review # "둘 다 생성 → 서로 리뷰" # pipeline: preset:plan-review # "구현 전 문서 리뷰 → 수정 → 재검증 반복"
# pipeline: preset:plan-review # "구현 전 문서/기획 검토"
# pipeline: preset:coding-review-fix # "초기 코딩 1회 → 리뷰/수정 반복"
# 방법 2: 직접 커스텀 (고급 사용자용) # 방법 2: 직접 커스텀 (고급 사용자용)
# pipeline: # pipeline:
@@ -75,10 +76,8 @@ pipeline: preset:simple # "A 생성 → B 리뷰" (기본값)
| 프리셋 | 설명 | 자동 생성되는 steps | | 프리셋 | 설명 | 자동 생성되는 steps |
|--------|------|-------------------| |--------|------|-------------------|
| `simple` | A 코딩 → B 리뷰 | coding(agent1) → review(agent2) | | `plan-review` | 구현 전 문서 리뷰/수정/재검증 반복 | plan_review_* → aggregate_review → plan_fix → verify |
| `cross-review` | 둘 다 코딩, 서로 리뷰 | coding_a → coding_b → review_of_b(agent_a) → review_of_a(agent_b) | | `coding-plan-review` | 문서 기반 구현 후 코드/문서 리뷰/수정 반복 | initial_coding(coding) → coding_plan_review(review* → aggregate → coding_plan_fix → verify) |
| `plan-review` | 구현 전 문서 검토 | parallel plan_review_* → senior_review(optional) |
| `coding-review-fix` | 초기 코딩 후 리뷰/수정 반복 | initial_coding(coding) → review_fix(review* → aggregate → coding → verify) |
프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다. 프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다.
@@ -101,7 +100,7 @@ cross_eval/
**models.py** — 순환 참조 방지, 모든 데이터클래스 집중: **models.py** — 순환 참조 방지, 모든 데이터클래스 집중:
- `AgentConfig` (command, args, system_prompt, stdin_mode) - `AgentConfig` (command, args, system_prompt, stdin_mode)
- `StepConfig` (name, agent, role, prompt_template, output_key, verdict, verdict_pattern, context_override) - `StepConfig` (name, agent, role, prompt_template, output_key, verdict, verdict_pattern, context_override)
- `PipelineConfig` (output_dir, max_iterations, inputs, agents, pipeline) - `PipelineConfig` (output_dir, use_worktree, max_iterations, inputs, agents, pipeline)
- `AgentResult` (output, exit_code, agent_name, step_name, duration_seconds) - `AgentResult` (output, exit_code, agent_name, step_name, duration_seconds)
- `IterationResult` (iteration, step_outputs, verdict, feedback) - `IterationResult` (iteration, step_outputs, verdict, feedback)
- `PipelineResult` (iterations, final_verdict, total_duration) - `PipelineResult` (iterations, final_verdict, total_duration)
@@ -117,7 +116,7 @@ cross_eval/
- `default:review` — 과최적화/오탐/누락 3기준 검토 + `VERDICT: PASS|FAIL` 출력 + **"프로젝트 디렉토리를 직접 탐색하여 코드를 검증하라"** 지시 - `default:review` — 과최적화/오탐/누락 3기준 검토 + `VERDICT: PASS|FAIL` 출력 + **"프로젝트 디렉토리를 직접 탐색하여 코드를 검증하라"** 지시
- `{variable}` 플레이스홀더, 누락 시 `(no {key} provided)` 출력 - `{variable}` 플레이스홀더, 누락 시 `(no {key} provided)` 출력
- 사용자가 커스텀 .md 파일로 오버라이드 가능 - 사용자가 커스텀 .md 파일로 오버라이드 가능
- `PIPELINE_PRESETS` dict: `simple`, `cross-review`, `plan-review` 프리셋별 StepConfig 리스트 정의 - `PIPELINE_PRESETS` / `PHASED_PRESETS` dict: `plan-review`, `coding-plan-review` 프리셋별 StepConfig/PhaseConfig 정의
**agent.py**`invoke_agent(agent_config, prompt, cwd)`: **agent.py**`invoke_agent(agent_config, prompt, cwd)`:
- `cwd` 파라미터로 프로젝트 디렉토리 지정 → 에이전트가 해당 디렉토리에서 파일 탐색 가능 - `cwd` 파라미터로 프로젝트 디렉토리 지정 → 에이전트가 해당 디렉토리에서 파일 탐색 가능
@@ -139,16 +138,21 @@ for iteration 1..max_iterations:
final-report.md 생성 final-report.md 생성
``` ```
agentic 실행 경로는 두 모드가 있다.
- 기본: direct mode (`cwd`에서 직접 수정)
- opt-in: isolated worktree mode (`--worktree` 또는 `use_worktree: true`)
**report.py** — 최종 마크다운 리포트: **report.py** — 최종 마크다운 리포트:
- 요약 테이블 (반복 횟수, 판정, 소요시간) - 요약 테이블 (반복 횟수, 판정, 소요시간)
- 반복별 상세 (각 step 출력, 에이전트명, 소요시간) - 반복별 상세 (각 step 출력, 에이전트명, 소요시간)
- 최종 판정 - 최종 판정
**cli.py** — 서브커맨드: **cli.py** — 서브커맨드:
- `cross-eval init [--dir .] [--preset simple|cross-review|plan-review]` — 스캐폴딩 (기존 파일 안 덮어씀) - `cross-eval init [--dir .] [--preset coding-plan-review|plan-review]` — 스캐폴딩 (기존 파일 안 덮어씀)
- `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...]` - `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...] [--worktree]`
- `--input key=path`: config의 inputs 오버라이드/추가 - `--input key=path`: config의 inputs 오버라이드/추가
- `--dry-run`: 에이전트 호출 없이 렌더링된 프롬프트만 출력 - `--dry-run`: 에이전트 호출 없이 렌더링된 프롬프트만 출력
- `--worktree`: 기본 direct mode 대신 isolated git worktree에서 실행
## 수정할 파일 목록 ## 수정할 파일 목록
@@ -172,10 +176,12 @@ final-report.md 생성
4. plan.md/checklist.md에 간단한 내용 넣고 `cross-eval run --max-iter 2` 로 실제 실행 4. plan.md/checklist.md에 간단한 내용 넣고 `cross-eval run --max-iter 2` 로 실제 실행
5. `output/` 디렉토리에 v1/, final-report.md 생성 확인 5. `output/` 디렉토리에 v1/, final-report.md 생성 확인
`--dry-run` 은 미리보기 전용이며 실제 verdict가 PASS가 아니어도 프로세스 종료 코드는 `0`으로 처리한다.
cross-eval run \ cross-eval run \
--docs /Users/chungyeong/Desktop/Dev/new-alpha-foundry/plans/TO_CLICKHOUSE \ --docs /Users/chungyeong/Desktop/Dev/new-alpha-foundry/plans/TO_CLICKHOUSE \
--preset coding-review-fix \ --preset coding-plan-review \
--coder claude \ --coder claude \
--reviewer codex \ --reviewer codex \
--reviewer codex \ --reviewer codex \
@@ -185,3 +191,6 @@ final-report.md 생성
--reviewer-effort high \ --reviewer-effort high \
--senior-effort xhigh \ --senior-effort xhigh \
--max-iter 10 --max-iter 10
cross-eval run --plan /Users/chungyeong/Desktop/Dev/cross-eval/UX_IMPROVEMENT_PLAN.md --coder claude --reviewer claude --senior claude --model sonnet --preset coding-plan-review --lang ko --max-iter 1

View File

@@ -51,12 +51,15 @@ cp .cross-eval/checklist-sample.md .cross-eval/checklist.md
### 3. 실행 ### 3. 실행
```bash ```bash
# 기본 실행 (코딩 → 리뷰, 최대 3회 반복) # 기본 실행 (현재 작업트리 direct mode, 최대 3회 반복)
cross-eval run cross-eval run
# 프롬프트만 확인 (에이전트 호출 없이, 비용 절약) # 프롬프트만 확인 (에이전트 호출 없이, 비용 절약)
cross-eval run --dry-run cross-eval run --dry-run
# 격리된 git worktree에서 실행하고 싶을 때만 명시
cross-eval run --worktree
# 최대 반복 횟수 변경 # 최대 반복 횟수 변경
cross-eval run --max-iter 5 cross-eval run --max-iter 5
@@ -80,6 +83,9 @@ output/
└── final-report.md # 전체 요약 리포트 └── final-report.md # 전체 요약 리포트
``` ```
기본값은 **direct mode**다. 즉 `cross-eval`은 현재 작업트리에서 직접 파일을 읽고 수정한다.
별도 격리 실행이 필요할 때만 `--worktree`를 붙여 isolated git worktree를 사용한다.
## 설정 (`.cross-eval/config.yaml`) ## 설정 (`.cross-eval/config.yaml`)
```yaml ```yaml
@@ -101,7 +107,8 @@ agents:
args: ["-p", "--model", "opus", "--permission-mode", "plan"] args: ["-p", "--model", "opus", "--permission-mode", "plan"]
system_prompt: "You are a meticulous code reviewer." system_prompt: "You are a meticulous code reviewer."
pipeline: preset:simple pipeline: preset:coding-plan-review
use_worktree: false # 기본값. true면 isolated worktree 사용
``` ```
실행 중에 `config.yaml`을 수정하면 다음 반복부터 자동으로 반영됩니다. 실행 중에 `config.yaml`을 수정하면 다음 반복부터 자동으로 반영됩니다.
@@ -110,16 +117,16 @@ pipeline: preset:simple
| 프리셋 | 설명 | | 프리셋 | 설명 |
|--------|------| |--------|------|
| `simple` | Agent A가 코딩, Agent B가 리뷰 (기본값) | | `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 문서를 수정한 뒤 재검증까지 반복 |
| `cross-review` | 둘 다 코딩, 서로 교차 리뷰 | | `coding-plan-review` | 입력 문서를 바탕으로 코드를 구현하고, 코드와 문서를 함께 리뷰/수정/재검증 반복 |
| `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 필요시 현재 코드베이스와의 정합성도 확인 |
| `review-only` | 기존 코드만 감사 용도로 검토 | 두 프리셋은 역할만 다르고, 대부분의 CLI 옵션은 동일하게 동작한다. 예를 들어 `--plan`, `--checklist`, `--docs`, `--coder`, `--reviewer`, `--senior`, `--max-iter`, `--dry-run`, `--worktree`는 둘 다 같은 방식으로 사용할 수 있다.
| `review-fix` | 리뷰 결과를 취합한 뒤 자동 수정과 재검증까지 반복 |
| `coding-review-fix` | 초기 코딩 1회 후 리뷰 결과를 취합해 자동 수정과 재검증을 반복 |
```bash ```bash
# 초기화 옵션 # 초기화 옵션
cross-eval init --preset cross-review # 교차 리뷰 프리셋 cross-eval init --preset coding-plan-review # 구현 + 코드/문서 리뷰 프리셋
cross-eval init --preset plan-review # 구현 전 문서 검토 프리셋 cross-eval init --preset plan-review # 문서 리뷰/수정/재검증 프리셋
cross-eval init --lang en # 영어 템플릿 cross-eval init --lang en # 영어 템플릿
``` ```
`cross-eval run --dry-run` 은 프롬프트와 파이프라인 구성을 미리보기만 하며, 실제 판정이 PASS가 아니어도 종료 코드는 `0`이다.

178
UX_IMPROVEMENT_PLAN.md Normal file
View File

@@ -0,0 +1,178 @@
# cross-eval UX 개선 계획
> 사용자 안내 메시지, 에러 메시지, 도움말 텍스트 전반의 품질을 높여서
> 처음 쓰는 사람도 막히지 않고 파이프라인을 돌릴 수 있게 만든다.
---
## 1. CLI 도움말 텍스트 개선
### 1.1 `cross-eval` 메인 도움말
- [ ] 메인 description에 "어떤 문제를 해결하는 도구인지" 한 줄 요약 추가
- 현재: "AI 코딩 에이전트의 결과물을 자동으로 검증하는 CLI 도구"
- 개선: "AI 코딩 에이전트가 기획서대로 구현했는지 자동 교차 검증. 과최적화·누락·거짓 통과를 잡아냄"
- [ ] 서브커맨드별 한 줄 설명을 메인 help에 추가 (init/doctor/demo/run 각각)
### 1.2 `cross-eval run` 도움말
- [ ] epilog의 프리셋 테이블이 너무 길음 — "빠른 선택 가이드" 3줄 추가
- 예: "처음이면 simple, 리뷰만 하려면 review-only, 코딩+리뷰+자동수정이면 coding-review-fix"
- [ ] `--reasoning-effort` 도움말에 별칭(extra-high, x-high 등) 명시
- [ ] `--target` 옵션이 실제로 프롬프트에 어떤 영향을 주는지 설명 추가
- [ ] `--agentic` 플래그 설명에 worktree 생성/정리 동작 요약 추가
- [ ] `--min-iter` 설명에 "왜 PASS인데 반복하는지" 용도 한 줄 추가
- 예: "결과 안정성 확인용. 한 번 PASS가 우연이 아닌지 재검증"
- [ ] `--dry-run` 설명에 "에이전트 호출 없이 프롬프트만 미리보기" 명확히
- [ ] 에이전트 축약 규칙(claude → claude-coder 등) 예시와 함께 더 명확하게
### 1.3 `cross-eval init` 도움말
- [ ] `--guided` 옵션을 더 눈에 띄게 — "처음이면 --guided 추천" 문구
- [ ] 생성되는 파일 설명에 "각 파일을 어떻게 쓰는지" 한 줄씩 추가
### 1.4 `cross-eval doctor` 도움말
- [ ] 어떤 항목을 점검하는지 목록 미리 보여주기
- [ ] "인증 실패 시 어떻게 해야 하는지" 구체적 명령어 포함
### 1.5 `cross-eval demo` 도움말
- [ ] mock vs live 차이를 한 눈에 볼 수 있도록 비교 추가
- [ ] `--escalate` 옵션이 mock 전용인 점 강조
---
## 2. 에러 메시지 개선
### 2.1 필수 입력 누락
- [ ] `--plan` 없이 `cross-eval run` 실행 시 명확한 에러:
- "기획서(plan)가 필요합니다. --plan plan.md 또는 .cross-eval/config.yaml의 inputs.plan에 지정하세요."
- [ ] config.yaml 없이 실행 시 기본값 사용 중임을 알리는 INFO 메시지 추가
### 2.2 에이전트 실패 메시지
- [ ] `AUTH` 실패 시 구체적 해결 명령어 제시
- Claude: "claude login 으로 인증하세요"
- Codex: "codex auth 로 인증하세요"
- [ ] `USAGE_LIMIT` 시 어떤 한도인지 힌트 (토큰? 요금?)
- [ ] `EMPTY_DIFF` 시 "에이전트가 파일을 수정하지 않았습니다" + 가능한 원인 목록
- [ ] `WRITE_FAILURE` 시 worktree 경로와 권한 상태 출력
- [ ] 에이전트 빈 출력(empty output) 시 "에이전트가 응답하지 않았습니다. 프롬프트가 너무 길거나 인증 만료일 수 있습니다" 등 원인 제안
### 2.3 설정 검증 에러
- [ ] 중복 step name 에러에 "어떤 phase의 어떤 step이 중복인지" 구체적으로
- [ ] 없는 에이전트 참조 시 "사용 가능한 에이전트: ..." 리스트 포함 (이미 있으나 확인)
- [ ] YAML 파싱 에러 시 라인 번호 포함
### 2.4 파일/경로 에러
- [ ] "File not found: {path}" → "파일을 찾을 수 없습니다: {path}\n 현재 디렉토리: {cwd}" 로 개선
- [ ] docs 디렉토리 비어있을 때 → "참고 문서 폴더가 비어있습니다: {path}\n .md, .txt 등 문서 파일을 넣어주세요"
---
## 3. 진행 상태 메시지 개선
### 3.1 파이프라인 실행 중
- [ ] 실행 시작 시 요약 배너 출력:
```
━━━ cross-eval ━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Plan: .cross-eval/plan.md
Preset: simple (코딩→리뷰→반복)
Coder: claude-coder
Reviewer: claude-reviewer
Max iter: 3
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```
- [ ] 각 iteration 시작 시 "무엇을 하려는 단계인지" 한 줄 설명
- 예: "Iteration 1/3 — Coder가 기획서 기반 초기 구현 중..."
- 예: "Iteration 2/3 — 리뷰 피드백 반영해서 수정 중..."
- [ ] 타임아웃 시 경과 시간과 제한 시간 모두 출력
### 3.2 결과 요약
- [ ] 최종 결과에 소요 시간 추가
- [ ] FAIL 시 "마지막 리뷰에서 지적된 주요 이슈 N개" 간단 요약
- [ ] ESCALATE 시 사람이 봐야 할 이유 1~2줄 요약
- [ ] dry-run 종료 시 "이것은 미리보기입니다. 실제 실행하려면 --dry-run을 빼세요" 명시
### 3.3 Auto-escalation 안내
- [ ] auto-escalation 발동 시 "N회 연속 FAIL → 자동 에스컬레이션" 설명
- [ ] 어떤 조건에서 auto-escalation이 발동하는지 run 도움말에 언급
---
## 4. 첫 사용 경험(Onboarding) 개선
### 4.1 init 후 안내
- [ ] plan.md 템플릿에 실제 예시 포함 (현재 최소한의 구조만 있음)
- "## 기능 요구사항" 아래 구체적 예시 한 개
- [ ] checklist.md 템플릿에 체크리스트 작성 가이드 + 예시 추가
- [ ] init 완료 후 "다음 단계" 안내를 더 구체적으로:
- 현재: "1. plan.md에 기획서 작성"
- 개선: "1. .cross-eval/plan.md를 열어 기획서를 작성하세요 (예: 구현할 기능, API 스펙, DB 스키마 등)"
### 4.2 doctor 개선
- [ ] 체크 통과 시 "준비 완료! cross-eval run --plan .cross-eval/plan.md 로 실행하세요" 안내
- [ ] 인증 실패 시 OS별 설치/인증 가이드 URL 포함
### 4.3 demo 개선
- [ ] demo 완료 후 "실제 프로젝트에서 시작하려면:" 안내 추가
- [ ] mock demo에서 각 단계가 뭘 하는 건지 주석 스타일로 설명
---
## 5. 용어 일관성
- [ ] "에이전트 이름" vs "에이전트 역할" 구분 통일
- 이름: claude-coder, codex-reviewer (실제 실행 단위)
- 역할: coder, reviewer, senior (논리적 역할)
- [ ] Verdict 표기 통일: 항상 대문자 `PASS` / `FAIL` / `ESCALATE`
- [ ] "프리셋" vs "파이프라인" 용어 정리
- `--preset`은 "파이프라인 유형"으로 통일
- [ ] 한영 혼용 줄이기 — 한국어 모드에서 불필요한 영어 최소화
- 단, PASS/FAIL/ESCALATE 같은 verdict은 영어 유지 (가독성)
---
## 6. 출력 디렉토리 구조 안내
- [ ] run 완료 시 출력 폴더 구조 요약 출력:
```
Output: .cross-eval/output/
├── iter-1/ (각 반복의 에이전트 출력)
├── iter-2/
└── final-report.md (최종 리포트)
```
- [ ] report.md 상단에 "이 리포트 읽는 법" 간단 안내 추가
---
## 7. config.yaml 주석 개선
- [ ] 기본 생성되는 config.yaml에 각 섹션별 설명 주석 보강
- [ ] 자주 쓰는 설정 변경 예시를 주석으로 포함
- 예: "# 리뷰어를 2개로 늘리려면: reviewer: [claude, codex]"
- 예: "# 에이전트 모드로 실제 파일 수정: agentic: true"
- [ ] phase-based 파이프라인 커스텀 예시 주석 추가
---
## 우선순위
| 우선순위 | 항목 | 이유 |
|---------|------|------|
| P0 | 2.1 필수 입력 누락 에러 | 가장 자주 부딪히는 문제 |
| P0 | 4.1 init 후 안내 + 템플릿 | 첫 사용에서 막히면 이탈 |
| P0 | 3.1 실행 시작 요약 배너 | 뭐가 돌아가는지 알아야 함 |
| P1 | 2.2 에이전트 실패 메시지 | 실패 시 뭘 해야 하는지 모름 |
| P1 | 1.2 run 도움말 정리 | 옵션이 많아서 혼란 |
| P1 | 5. 용어 일관성 | 혼동 줄이기 |
| P2 | 3.2~3.3 결과/진행 메시지 | 있으면 좋지만 급하진 않음 |
| P2 | 7. config.yaml 주석 | 파워 유저 편의 |
| P2 | 6. 출력 구조 안내 | 한 번 보면 이해됨 |
| P3 | 1.3~1.5 나머지 도움말 | 점진적 개선 |
---
## 테스트 방법
각 항목 수정 후:
1. **도움말 확인**: `cross-eval --help`, `cross-eval run --help` 등
2. **에러 경로 확인**: 일부러 잘못된 입력으로 실행 → 에러 메시지가 유용한지
3. **첫 사용 시뮬레이션**: 빈 디렉토리에서 `init → doctor → demo → run` 풀 플로우
4. **cross-eval 자체로 검증**: 이 문서를 plan.md로 사용해 cross-eval run 실행

31
checklist.md Normal file
View File

@@ -0,0 +1,31 @@
# cross-eval CLI 사용성 리팩토링 체크리스트
## 핵심 사용자 흐름
- [ ] `cross-eval init` 이후 무엇을 해야 하는지 분명하게 안내한다.
- [ ] `cross-eval doctor`를 언제 왜 써야 하는지 설명한다.
- [ ] `cross-eval run` 실행 전 필요한 준비물이 명확하다.
- [ ] 실행 후 결과가 `.cross-eval/output` 아래에 저장된다는 점이 안내된다.
## `run` 커맨드 이해도
- [ ] `--preset`별 차이가 빠르게 비교 가능하다.
- [ ] `--coder`, `--reviewer`, `--senior`의 역할 차이가 설명된다.
- [ ] config 기반 실행과 CLI 옵션 기반 실행의 관계가 명확하다.
- [ ] 어떤 옵션이 config를 override하는지 혼동 없이 이해할 수 있다.
## 예시 품질
- [ ] 대표 사용 예시가 실제 사용자 목적 중심으로 정리되어 있다.
- [ ] 예시가 너무 많아 산만하지 않고, 핵심 조합 위주로 압축되어 있다.
- [ ] 초보자용 기본 예시와 고급 사용 예시가 구분되어 있다.
- [ ] 예시만 복사해도 실제 실행 가능한 수준이다.
## 리팩토링 범위 통제
- [ ] 기존 명령 이름과 옵션 이름을 바꾸지 않는다.
- [ ] 기능 동작을 불필요하게 변경하지 않는다.
- [ ] 안내 문구 개선이 목적이지 새 기능 추가가 아님을 유지한다.
- [ ] plan 범위를 넘는 UI/기능 확장을 하지 않는다.
## 코드 품질
- [ ] 기존 테스트가 깨지지 않도록 한다.
- [ ] 도움말/문구 변경으로 인한 회귀를 확인한다.
- [ ] 문자열 변경이 실제 출력 흐름과 모순되지 않는다.
- [ ] 중복되거나 상충되는 설명이 생기지 않는다.

View File

@@ -34,6 +34,12 @@ _NO_CHANGE_ACK_MARKERS = (
"code is correct as-is", "code is correct as-is",
"already correct", "already correct",
"no action required", "no action required",
"변경 없음",
"수정 없음",
"수정할 필요 없음",
"변경할 필요 없음",
"이미 올바름",
"조치 불필요",
) )
_CHANGE_CLAIM_MARKERS = ( _CHANGE_CLAIM_MARKERS = (
"summary of all changes made", "summary of all changes made",
@@ -73,6 +79,15 @@ _CHANGE_CLAIM_MARKERS = (
"completed the implementation", "completed the implementation",
"all changes have been made", "all changes have been made",
"changes are complete", "changes are complete",
"수정 완료",
"모든 수정이 완료",
"변경 요약",
"변경 파일",
"신규 생성",
"기획서 수정",
"체크리스트 수정",
"문서를 수정",
"문서 수정",
) )
@@ -414,6 +429,7 @@ def invoke_agent_agentic(
env: Optional[dict[str, str]] = None, env: Optional[dict[str, str]] = None,
timeout: int | None = None, timeout: int | None = None,
quiet: bool = False, quiet: bool = False,
base_commit: str | None = None,
) -> AgentResult: ) -> AgentResult:
"""Invoke an agent in agentic mode using the worktree as the source of truth.""" """Invoke an agent in agentic mode using the worktree as the source of truth."""
from cross_eval.worktree import capture_diff from cross_eval.worktree import capture_diff
@@ -506,8 +522,8 @@ def invoke_agent_agentic(
suggested_action=suggested_action, suggested_action=suggested_action,
) )
# Capture git diff as the output (changes since last commit on the branch) # Capture git diff as the output (changes since the base commit)
diff_output = capture_diff(worktree_path) diff_output = capture_diff(worktree_path, base_commit=base_commit)
if not diff_output: if not diff_output:
stdout_excerpt = (result.stdout or "").strip() stdout_excerpt = (result.stdout or "").strip()

View File

@@ -38,7 +38,7 @@ coders: [claude-coder]
reviewers: [claude-reviewer] reviewers: [claude-reviewer]
# seniors: [codex-senior] # seniors: [codex-senior]
# 파이프라인 종류: simple | cross-review | plan-review | review-only | review-fix | coding-review-fix # 파이프라인 종류: plan-review | coding-plan-review
pipeline: preset:{preset} pipeline: preset:{preset}
# 반복 설정 # 반복 설정
@@ -194,20 +194,12 @@ def main(argv: list[str] | None = None) -> int:
) )
init_parser.add_argument( init_parser.add_argument(
"--preset", "--preset",
default="simple", default="coding-plan-review",
choices=[ choices=["plan-review", "coding-plan-review"],
"simple",
"cross-review",
"plan-review",
"review-only",
"review-fix",
"coding-review-fix",
],
help=( help=(
"파이프라인 종류 (기본: simple). " "파이프라인 종류 (기본: coding-plan-review). "
"simple=코딩+리뷰, cross-review=교차리뷰, plan-review=문서기획검토, " "plan-review=문서리뷰수정재검증, "
"review-only=리뷰만, review-fix=리뷰수렴+자동수정, " "coding-plan-review=문서기반구현후 코드+문서 리뷰/수정/재검증"
"coding-review-fix=초기코딩후리뷰수렴"
), ),
) )
init_parser.add_argument( init_parser.add_argument(
@@ -252,9 +244,9 @@ def main(argv: list[str] | None = None) -> int:
) )
demo_parser.add_argument( demo_parser.add_argument(
"--preset", "--preset",
default="simple", default="coding-plan-review",
choices=["simple", "review-fix", "coding-review-fix"], choices=["plan-review", "coding-plan-review"],
help="데모할 파이프라인 종류 (기본: simple)", help="데모할 파이프라인 종류 (기본: coding-plan-review)",
) )
demo_parser.add_argument( demo_parser.add_argument(
"--escalate", "--escalate",
@@ -281,25 +273,12 @@ def main(argv: list[str] | None = None) -> int:
), ),
epilog=( epilog=(
"파이프라인 종류 (--preset):\n" "파이프라인 종류 (--preset):\n"
" ┌───────────────────────────────────────────────────────────────────┐\n" " ┌───────────────────────────────────────────────────────────────────┐\n"
"simple │ Coder가 코드 작성 → Reviewer가 리뷰 \n" "coding-plan-review │ 입력 문서 기반 구현 → 코드+문서 리뷰/수정\n"
" │ (기본값) │ FAIL이면 피드백 반영해서 재코딩, PASS까지 반복\n" " │ (기본값) │ → 재검증 반복 \n"
" ├───────────────────────────────────────────────────────────────────┤\n" " ├───────────────────────────────────────────────────────────────────┤\n"
" │ review-fix │ 2단계 파이프라인: \n" "plan-review │ 구현 전 문서 리뷰 → 문서 수정 → 재검증 반복\n"
" │ │ Reviewer N명 병렬 리뷰 → 취합 → 수정 → 재검증 │\n" " └─────────────────────┴──────────────────────────────────────────────┘\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ coding- │ 3단계 파이프라인: │\n"
" │ review-fix │ 초기 코딩 1회 → 리뷰 취합 → 수정 → 재검증 반복 │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ plan-review │ 구현 전 기획서/체크리스트/문서를 검토 │\n"
" │ │ 필요하면 현재 코드베이스와의 정합성도 점검 │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ review-only │ 코드 작성 없이 Reviewer N명이 기존 코드만 검토 │\n"
" │ │ (이미 작성된 코드의 품질 감사용) │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ cross-review │ Coder 2명이 각각 구현 → 상대방 코드를 교차 리뷰 │\n"
" │ │ (서로 다른 에이전트의 구현 비교용) │\n"
" └──────────────┴─────────────────────────────────────────────────────┘\n"
"\n" "\n"
"기본 제공 에이전트:\n" "기본 제공 에이전트:\n"
" ┌──────────────────┬─────────┬───────────┬──────────────────────────┐\n" " ┌──────────────────┬─────────┬───────────┬──────────────────────────┐\n"
@@ -316,34 +295,13 @@ def main(argv: list[str] | None = None) -> int:
"\n" "\n"
"사용 예시:\n" "사용 예시:\n"
"\n" "\n"
" 기본 실행 (Claude가 코딩하고 Claude가 리뷰):\n" " 코드 + 문서 구현/리뷰 루프 (coding-plan-review):\n"
" cross-eval run --plan plan.md\n" " cross-eval run --plan plan.md --preset coding-plan-review \\\n"
" --coder claude --reviewer codex --reviewer claude --senior codex\n"
"\n" "\n"
" Codex가 코딩, Claude가 리뷰:\n" " 문서 리뷰 + 수정 + 재검증 반복 (plan-review):\n"
" cross-eval run --plan plan.md --coder codex --reviewer claude\n"
"\n"
" 리뷰어 2명 (Claude + Codex):\n"
" cross-eval run --plan plan.md --reviewer claude --reviewer codex\n"
"\n"
" 리뷰 취합용 Senior 추가:\n"
" cross-eval run --plan plan.md --preset review-fix \\\n"
" --reviewer claude --reviewer codex --senior codex\n"
"\n"
" 리뷰 수렴 후 자동 수정 (review-fix):\n"
" cross-eval run --plan plan.md --preset review-fix \\\n"
" --reviewer claude --reviewer codex\n"
"\n"
" 초기 코딩 후 리뷰 수렴 + 자동 수정 (coding-review-fix):\n"
" cross-eval run --plan plan.md --preset coding-review-fix \\\n"
" --reviewer claude --reviewer codex\n"
"\n"
" 기존 코드 리뷰만 (review-only):\n"
" cross-eval run --plan plan.md --preset review-only \\\n"
" --reviewer claude --reviewer codex\n"
"\n"
" 구현 전 문서/기획 검토 (plan-review):\n"
" cross-eval run --plan plan.md --preset plan-review \\\n" " cross-eval run --plan plan.md --preset plan-review \\\n"
" --reviewer claude --reviewer codex\n" " --coder claude --reviewer codex --reviewer claude --senior codex\n"
"\n" "\n"
" 모델 변경:\n" " 모델 변경:\n"
" cross-eval run --plan plan.md --model sonnet\n" " cross-eval run --plan plan.md --model sonnet\n"
@@ -420,7 +378,11 @@ def main(argv: list[str] | None = None) -> int:
) )
agent_group.add_argument( agent_group.add_argument(
"--agentic", action="store_true", default=False, "--agentic", action="store_true", default=False,
help="Coder를 agentic 모드로 실행 (worktree에서 파일 직접 수정, git diff로 결과 캡처)", help="Coder를 agentic 모드로 실행 (파일 직접 수정, git diff로 결과 캡처)",
)
agent_group.add_argument(
"--worktree", action="store_true", default=False,
help="기본 direct mode 대신 isolated git worktree에서 실행",
) )
agent_group.add_argument( agent_group.add_argument(
"--model", default=None, metavar="MODEL", "--model", default=None, metavar="MODEL",
@@ -443,15 +405,8 @@ def main(argv: list[str] | None = None) -> int:
pipe_group = run_parser.add_argument_group("파이프라인") pipe_group = run_parser.add_argument_group("파이프라인")
pipe_group.add_argument( pipe_group.add_argument(
"--preset", default=None, "--preset", default=None,
choices=[ choices=["plan-review", "coding-plan-review"],
"simple", help="파이프라인 종류 (기본: coding-plan-review). 각 종류 설명은 아래 참조",
"cross-review",
"plan-review",
"review-only",
"review-fix",
"coding-review-fix",
],
help="파이프라인 종류 (기본: simple). 각 종류 설명은 아래 참조",
) )
pipe_group.add_argument( pipe_group.add_argument(
"--max-iter", type=int, default=None, "--max-iter", type=int, default=None,
@@ -560,18 +515,11 @@ def cmd_demo(args: argparse.Namespace) -> int:
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
_PRESET_DESCRIPTIONS = { _PRESET_DESCRIPTIONS = {
"simple": "코딩 + 리뷰 (가장 기본)", "coding-plan-review": "입력 문서 기반 구현 후 코드+문서 리뷰/수정 반복",
"review-fix": "리뷰 → 취합 → 수정 → 재검증 반복", "plan-review": "문서 리뷰 → 수정 → 재검증 반복",
"coding-review-fix": "초기 코딩 + 리뷰 수렴 반복",
"plan-review": "구현 전 기획서/문서 검토",
"review-only": "기존 코드만 리뷰 (코딩 없음)",
"cross-review": "2명이 각각 구현 후 교차 리뷰",
} }
_PRESET_ORDER = [ _PRESET_ORDER = ["coding-plan-review", "plan-review"]
"simple", "review-fix", "coding-review-fix",
"plan-review", "review-only", "cross-review",
]
def _prompt_choice( def _prompt_choice(
@@ -640,7 +588,7 @@ def _run_guided_init(target: Path) -> dict:
coder = _prompt_text(" Coder 에이전트", default="claude") coder = _prompt_text(" Coder 에이전트", default="claude")
reviewer = _prompt_text(" Reviewer 에이전트", default="claude") reviewer = _prompt_text(" Reviewer 에이전트", default="claude")
needs_senior = preset in ("review-fix", "coding-review-fix") needs_senior = preset in ("coding-plan-review", "plan-review")
senior = "" senior = ""
if needs_senior: if needs_senior:
senior = _prompt_text(" Senior 에이전트", default=reviewer) senior = _prompt_text(" Senior 에이전트", default=reviewer)
@@ -899,10 +847,10 @@ def cmd_run(args: argparse.Namespace) -> int:
need_rebuild = args.preset is not None or args.coders or args.reviewers or args.seniors need_rebuild = args.preset is not None or args.coders or args.reviewers or args.seniors
if need_rebuild: if need_rebuild:
from cross_eval.prompts import PHASED_PRESETS from cross_eval.prompts import PHASED_PRESETS
preset = args.preset or "simple" preset = args.preset or "coding-plan-review"
# Determine which preset was configured (from YAML or defaults) # Determine which preset was configured (from YAML or defaults)
if args.preset is None and config.phases: if args.preset is None and config.phases:
preset = config.preset_name if config.preset_name != "custom" else "review-fix" preset = config.preset_name if config.preset_name != "custom" else "coding-plan-review"
elif args.preset is None and not args.coders and not args.reviewers and not args.seniors: elif args.preset is None and not args.coders and not args.reviewers and not args.seniors:
pass # no changes needed pass # no changes needed
inferred_coders, inferred_reviewers, inferred_seniors = _infer_roles( inferred_coders, inferred_reviewers, inferred_seniors = _infer_roles(
@@ -929,8 +877,6 @@ def cmd_run(args: argparse.Namespace) -> int:
elif preset in PIPELINE_PRESETS: elif preset in PIPELINE_PRESETS:
config.pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors) config.pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors)
config.phases = [] config.phases = []
if preset in {"plan-review", "review-only"} and args.max_iter is None and args.min_iter is None:
config.max_iterations = 1
sync_phased_iterations(config) sync_phased_iterations(config)
if args.max_iter is not None: if args.max_iter is not None:
@@ -951,6 +897,9 @@ def cmd_run(args: argparse.Namespace) -> int:
if coder_name in config.agents: if coder_name in config.agents:
_make_agentic(config.agents[coder_name]) _make_agentic(config.agents[coder_name])
if args.worktree:
config.use_worktree = True
ensure_fix_preset_agentic(config) ensure_fix_preset_agentic(config)
# --model: apply to ALL agents # --model: apply to ALL agents
@@ -988,7 +937,7 @@ def cmd_run(args: argparse.Namespace) -> int:
print(f"No files found in: {docs_dir}", file=sys.stderr) print(f"No files found in: {docs_dir}", file=sys.stderr)
return 1 return 1
config.inputs["docs"] = docs_content config.inputs["docs"] = docs_content
config.inputs["docs_ref"] = str(docs_dir) config.inputs["docs_ref"] = docs_dir
if args.env_files: if args.env_files:
for env_file in args.env_files: for env_file in args.env_files:
@@ -1062,6 +1011,9 @@ def cmd_run(args: argparse.Namespace) -> int:
if not args.dry_run and result.run_dir: if not args.dry_run and result.run_dir:
print(f"Output: {result.run_dir}/") print(f"Output: {result.run_dir}/")
if args.dry_run:
return 0
if result.final_verdict == "ESCALATE": if result.final_verdict == "ESCALATE":
from cross_eval.report import print_escalation_report from cross_eval.report import print_escalation_report
print_escalation_report(config, result) print_escalation_report(config, result)

View File

@@ -31,7 +31,10 @@ DEFAULT_ROLE_REASONING_EFFORTS = {
"reviewer": "medium", "reviewer": "medium",
"senior": "high", "senior": "high",
} }
FIX_STYLE_PRESETS = {"review-fix", "coding-review-fix"} FIX_STYLE_PRESETS = {
"plan-review",
"coding-plan-review",
}
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
@@ -296,7 +299,10 @@ def _default_seniors_for_preset(
"""Infer a default senior agent for presets that benefit from adjudication.""" """Infer a default senior agent for presets that benefit from adjudication."""
if not ( if not (
isinstance(pipeline_raw, str) isinstance(pipeline_raw, str)
and pipeline_raw in {"preset:review-fix", "preset:coding-review-fix"} and pipeline_raw in {
"preset:plan-review",
"preset:coding-plan-review",
}
and reviewers and reviewers
): ):
return [] return []
@@ -378,9 +384,11 @@ def default_config() -> PipelineConfig:
coders = ["claude-coder"] coders = ["claude-coder"]
reviewers = ["claude-reviewer"] reviewers = ["claude-reviewer"]
seniors: list[str] = [] seniors: list[str] = []
pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors) pipeline: list[StepConfig] = []
phases = PHASED_PRESETS["coding-plan-review"](coders, reviewers, seniors)
return PipelineConfig( return PipelineConfig(
output_dir=Path(".cross-eval/output"), output_dir=Path(".cross-eval/output"),
use_worktree=False,
max_iterations=3, max_iterations=3,
language="ko", language="ko",
execution=ExecutionConfig(), execution=ExecutionConfig(),
@@ -390,6 +398,8 @@ def default_config() -> PipelineConfig:
reviewers=reviewers, reviewers=reviewers,
seniors=seniors, seniors=seniors,
pipeline=pipeline, pipeline=pipeline,
phases=phases,
preset_name="coding-plan-review",
) )
@@ -433,7 +443,7 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
) )
# --- roles: explicit or inferred --- # --- roles: explicit or inferred ---
pipeline_raw = raw.get("pipeline", "preset:simple") pipeline_raw = raw.get("pipeline", "preset:coding-plan-review")
coders_raw = raw.get("coders") coders_raw = raw.get("coders")
reviewers_raw = raw.get("reviewers") reviewers_raw = raw.get("reviewers")
seniors_raw = raw.get("seniors") seniors_raw = raw.get("seniors")
@@ -494,6 +504,7 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
config = PipelineConfig( config = PipelineConfig(
output_dir=output_dir, output_dir=output_dir,
use_worktree=bool(raw.get("use_worktree", False)),
max_iterations=int(raw.get("max_iterations", 3)), max_iterations=int(raw.get("max_iterations", 3)),
min_iterations=int(raw.get("min_iterations", 1)), min_iterations=int(raw.get("min_iterations", 1)),
verbose=bool(raw.get("verbose", False)), verbose=bool(raw.get("verbose", False)),
@@ -551,10 +562,10 @@ def _resolve_pipeline(
"""Resolve pipeline from preset string or explicit step list. """Resolve pipeline from preset string or explicit step list.
Returns (steps, phases) tuple. Only one will be non-empty. Returns (steps, phases) tuple. Only one will be non-empty.
- Simple/cross-review/plan-review/review-only → steps populated, phases empty. - plan-review → steps populated, phases empty.
- Phased presets (review-fix) → steps empty, phases populated. - coding-plan-review → steps empty, phases populated.
""" """
# Preset: "preset:simple" or "preset:review-fix" # Preset: "preset:plan-review" or "preset:coding-plan-review"
if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"): if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"):
preset_name = pipeline_raw.split(":", 1)[1] preset_name = pipeline_raw.split(":", 1)[1]
if preset_name in PIPELINE_PRESETS: if preset_name in PIPELINE_PRESETS:
@@ -588,7 +599,7 @@ def _resolve_pipeline(
return steps, [] return steps, []
raise ValueError( raise ValueError(
f"'pipeline' must be a preset string (e.g. 'preset:simple') " f"'pipeline' must be a preset string (e.g. 'preset:plan-review') "
f"or a list of step definitions, got {type(pipeline_raw).__name__}" f"or a list of step definitions, got {type(pipeline_raw).__name__}"
) )

View File

@@ -165,7 +165,7 @@ CYAN = "\033[36m"
RESET = "\033[0m" RESET = "\033[0m"
def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None: def run_mock_demo(preset: str = "coding-plan-review", show_escalate: bool = False) -> None:
"""Run a simulated demo showing the full pipeline lifecycle.""" """Run a simulated demo showing the full pipeline lifecycle."""
steps = _MOCK_ESCALATE_STEPS if show_escalate else _MOCK_STEPS steps = _MOCK_ESCALATE_STEPS if show_escalate else _MOCK_STEPS
@@ -229,7 +229,7 @@ def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None:
def run_live_demo( def run_live_demo(
preset: str = "simple", preset: str = "coding-plan-review",
timeout: int | None = None, timeout: int | None = None,
) -> PipelineResult: ) -> PipelineResult:
"""Run a live demo with real agents using the built-in plan.""" """Run a live demo with real agents using the built-in plan."""
@@ -255,8 +255,9 @@ def run_live_demo(
pipeline = [] pipeline = []
phases = PHASED_PRESETS[preset](coders, reviewers, seniors) phases = PHASED_PRESETS[preset](coders, reviewers, seniors)
else: else:
pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors) pipeline = []
phases = [] phases = PHASED_PRESETS["coding-plan-review"](coders, reviewers, seniors)
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
plan_path = Path(tmpdir) / "plan.md" plan_path = Path(tmpdir) / "plan.md"

View File

@@ -62,6 +62,7 @@ class PipelineConfig:
"""Full cross-eval configuration.""" """Full cross-eval configuration."""
output_dir: Path = field(default_factory=lambda: Path(".cross-eval/output")) output_dir: Path = field(default_factory=lambda: Path(".cross-eval/output"))
use_worktree: bool = False
max_iterations: int = 3 max_iterations: int = 3
min_iterations: int = 1 min_iterations: int = 1
verbose: bool = False verbose: bool = False

View File

@@ -4,6 +4,7 @@ from __future__ import annotations
import logging import logging
import os import os
import re import re
import shutil
import subprocess import subprocess
import time import time
from hashlib import sha256 from hashlib import sha256
@@ -34,6 +35,19 @@ from cross_eval.runtime_env import (
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
def _get_current_head(cwd: Path) -> str | None:
"""Return the current HEAD SHA for an existing repository."""
result = subprocess.run(
["git", "rev-parse", "HEAD"],
cwd=cwd,
capture_output=True,
text=True,
)
if result.returncode != 0:
return None
return result.stdout.strip() or None
def run_pipeline( def run_pipeline(
config: PipelineConfig, config: PipelineConfig,
cwd: Path | None = None, cwd: Path | None = None,
@@ -62,18 +76,20 @@ def _commit_iteration(
label: str, label: str,
iteration: int, iteration: int,
verdict: str | None, verdict: str | None,
) -> None: ) -> str:
"""Intermediate commit after each agentic iteration. """Intermediate commit after each agentic iteration.
This resets the diff baseline so the next iteration only captures new changes. This resets the diff baseline so the next iteration only captures new changes.
Returns the new HEAD SHA to use as the base_commit for the next iteration.
""" """
from cross_eval.worktree import commit_worktree from cross_eval.worktree import commit_worktree, get_current_head
committed = commit_worktree( committed = commit_worktree(
worktree_path, worktree_path,
f"cross-eval: {label} v{iteration} ({verdict or 'no-verdict'})", f"cross-eval: {label} v{iteration} ({verdict or 'no-verdict'})",
) )
if committed: if committed:
logger.debug(" Intermediate commit: v%d (%s)", iteration, verdict) logger.debug(" Intermediate commit: v%d (%s)", iteration, verdict)
return get_current_head(worktree_path)
def _has_agentic_steps(config: PipelineConfig, steps: list[StepConfig]) -> bool: def _has_agentic_steps(config: PipelineConfig, steps: list[StepConfig]) -> bool:
@@ -84,50 +100,124 @@ def _has_agentic_steps(config: PipelineConfig, steps: list[StepConfig]) -> bool:
) )
def _setup_worktree(cwd: Path, run_dir: Path, preset_name: str) -> tuple[Path, str]: def _setup_worktree(cwd: Path, run_dir: Path, preset_name: str) -> tuple[Path, str, str]:
"""Create a shared worktree for the entire pipeline run. """Create a shared worktree for the entire pipeline run.
1. Generate branch name (cross-eval/<preset>_<timestamp>) 1. Generate branch name (cross-eval/<preset>_<timestamp>)
2. Create branch from HEAD 2. Create branch from HEAD
3. Create worktree on that branch 3. Create worktree on that branch
Returns (worktree_path, branch_name). Returns (worktree_path, branch_name, base_commit).
""" """
from cross_eval.worktree import create_worktree, make_branch_name, make_worktree_dir from cross_eval.worktree import create_worktree, make_branch_name, make_worktree_dir
branch_name = make_branch_name(preset_name) branch_name = make_branch_name(preset_name)
worktree_dir = make_worktree_dir(cwd, branch_name) worktree_dir = make_worktree_dir(cwd, branch_name)
worktree_path = create_worktree( worktree_path, base_commit = create_worktree(
base_cwd=cwd, work_dir=worktree_dir, branch_name=branch_name, base_cwd=cwd, work_dir=worktree_dir, branch_name=branch_name,
) )
(run_dir / "worktree_path.txt").write_text(f"{worktree_path}\n", encoding="utf-8") (run_dir / "worktree_path.txt").write_text(f"{worktree_path}\n", encoding="utf-8")
(run_dir / "worktree_branch.txt").write_text(f"{branch_name}\n", encoding="utf-8") (run_dir / "worktree_branch.txt").write_text(f"{branch_name}\n", encoding="utf-8")
return worktree_path, branch_name (run_dir / "worktree_base.txt").write_text(f"{base_commit}\n", encoding="utf-8")
return worktree_path, branch_name, base_commit
def _copy_inputs_to_worktree( def _copy_inputs_to_worktree(
config: PipelineConfig, config: PipelineConfig,
worktree_path: Path, worktree_path: Path,
*,
base_cwd: Path,
) -> None: ) -> None:
"""Copy input files (plan, checklist, etc.) into the worktree. """Copy input files (plan, checklist, etc.) into the worktree.
This ensures agents running in plan/read-only mode within the worktree Repo-local inputs are remapped to the corresponding path inside the worktree
can access these files, even though the originals live in the base repo. so agentic edits produce a real git diff. External inputs are copied into a
Updates config.inputs in-place so subsequent reference refreshes use dedicated inputs directory. For ``plan-review`` these external copies remain
tracked so document edits can survive on the branch; other presets keep them
ignored to avoid polluting code diffs.
Updates ``config.inputs`` in-place so subsequent reference refreshes use
worktree-local paths. worktree-local paths.
""" """
import shutil base_root = base_cwd.resolve()
track_external_inputs = config.preset_name == "plan-review"
inputs_dir = worktree_path / ".cross-eval-inputs" inputs_dir = worktree_path / ".cross-eval-inputs"
inputs_dir.mkdir(exist_ok=True) inputs_dir.mkdir(exist_ok=True)
# Exclude from git so these don't pollute agentic diffs if not track_external_inputs:
(inputs_dir / ".gitignore").write_text("*\n", encoding="utf-8") # Exclude read-only input copies from git so they don't pollute code diffs.
(inputs_dir / ".gitignore").write_text("*\n", encoding="utf-8")
for key, val in list(config.inputs.items()): for key, val in list(config.inputs.items()):
if key.endswith("_ref") or not isinstance(val, Path): if not isinstance(val, Path):
continue continue
if not val.exists(): if not val.exists():
continue continue
dest = inputs_dir / val.name resolved = val.resolve()
shutil.copy2(val, dest) try:
config.inputs[key] = dest rel_path = resolved.relative_to(base_root)
except ValueError:
dest = inputs_dir / val.name
_copy_path(resolved, dest)
config.inputs[key] = dest
continue
worktree_target = worktree_path / rel_path
if not worktree_target.exists():
_copy_path(resolved, worktree_target)
config.inputs[key] = worktree_target
def _snapshot_input_paths(config: PipelineConfig) -> dict[str, Path]:
"""Capture original on-disk input paths before remapping into a worktree."""
return {
key: val
for key, val in config.inputs.items()
if isinstance(val, Path)
}
def _apply_worktree_inputs_to_base(
config: PipelineConfig,
original_inputs: dict[str, Path],
*,
cwd: Path,
) -> list[Path]:
"""Copy the final worktree-edited inputs back onto the user-provided paths."""
restored: list[Path] = []
for key, original_path in original_inputs.items():
current_path = config.inputs.get(key)
if not isinstance(current_path, Path) or not current_path.exists():
continue
if current_path.resolve() == original_path.resolve():
continue
_copy_path(current_path, original_path)
restored.append(original_path)
return restored
def _commit_base_repo_paths(cwd: Path, paths: list[Path], message: str) -> bool:
"""Commit changed input paths in the base repository when they live under cwd."""
rel_paths: list[str] = []
for path in paths:
try:
rel_paths.append(str(path.resolve().relative_to(cwd.resolve())))
except ValueError:
continue
if not rel_paths:
return False
subprocess.run(
["git", "add", "--", *rel_paths],
cwd=cwd,
capture_output=True,
check=True,
)
result = subprocess.run(
["git", "commit", "-m", message],
cwd=cwd,
capture_output=True,
text=True,
)
return result.returncode == 0
def _snapshot_repo_state(cwd: Path) -> dict[str, str]: def _snapshot_repo_state(cwd: Path) -> dict[str, str]:
@@ -320,17 +410,26 @@ def _run_simple_pipeline(
# Setup shared worktree for agentic mode # Setup shared worktree for agentic mode
worktree_path: Path | None = None worktree_path: Path | None = None
agent_execution_path: Path | None = None
agentic_branch_name: str | None = None agentic_branch_name: str | None = None
agentic_base_commit: str | None = None
original_input_paths: dict[str, Path] = {}
base_repo_state: dict[str, str] | None = None base_repo_state: dict[str, str] | None = None
base_repo_status: str | None = None base_repo_status: str | None = None
if not dry_run and _has_agentic_steps(config, config.pipeline): if not dry_run and _has_agentic_steps(config, config.pipeline):
worktree_path, agentic_branch_name = _setup_worktree( if config.use_worktree:
cwd, run_dir, config.preset_name, worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree(
) cwd, run_dir, config.preset_name,
_copy_inputs_to_worktree(config, worktree_path) )
_refresh_input_references(config, input_contents) original_input_paths = _snapshot_input_paths(config)
base_repo_state = _snapshot_repo_state(cwd) _copy_inputs_to_worktree(config, worktree_path, base_cwd=cwd)
base_repo_status = _snapshot_repo_status(cwd) _refresh_input_references(config, input_contents)
base_repo_state = _snapshot_repo_state(cwd)
base_repo_status = _snapshot_repo_status(cwd)
agent_execution_path = worktree_path
else:
agent_execution_path = cwd
agentic_base_commit = _get_current_head(cwd)
feedback = "(no feedback — first iteration)" feedback = "(no feedback — first iteration)"
iterations: list[IterationResult] = [] iterations: list[IterationResult] = []
@@ -356,15 +455,16 @@ def _run_simple_pipeline(
config.pipeline, config, input_contents, feedback, config.pipeline, config, input_contents, feedback,
i, config.max_iterations, cwd, timeout, dry_run, i, config.max_iterations, cwd, timeout, dry_run,
run_dir=run_dir, output_iter=i, run_dir=run_dir, output_iter=i,
worktree_path=worktree_path, worktree_path=agent_execution_path,
runtime_env=runtime_env, runtime_env=runtime_env,
base_repo_state=base_repo_state, base_repo_state=base_repo_state,
base_repo_status=base_repo_status, base_repo_status=base_repo_status,
base_commit=agentic_base_commit,
) )
# Intermediate commit so next iteration's diff only shows new changes # Intermediate commit so next iteration's diff only shows new changes
if worktree_path is not None: if config.use_worktree and worktree_path is not None:
_commit_iteration(worktree_path, config.preset_name, i, verdict) agentic_base_commit = _commit_iteration(worktree_path, config.preset_name, i, verdict)
iter_result = IterationResult( iter_result = IterationResult(
iteration=i, iteration=i,
@@ -454,8 +554,25 @@ def _run_simple_pipeline(
break break
finally: finally:
if config.use_worktree and worktree_path is not None and original_input_paths:
restored_paths = _apply_worktree_inputs_to_base(
config, original_input_paths, cwd=cwd,
)
if restored_paths:
try:
committed = _commit_base_repo_paths(
cwd,
restored_paths,
f"cross-eval: {config.preset_name} ({final_verdict})",
)
if committed:
logger.info(" Applied and committed final input changes in base repo.")
else:
logger.info(" Applied final input changes in base repo (no commit created).")
except Exception:
logger.warning(" Failed to commit final input changes in base repo", exc_info=True)
agentic_branch: str | None = None agentic_branch: str | None = None
if worktree_path is not None and agentic_branch_name is not None: if config.use_worktree and worktree_path is not None and agentic_branch_name is not None:
agentic_branch = _finalize_worktree( agentic_branch = _finalize_worktree(
cwd, worktree_path, agentic_branch_name, cwd, worktree_path, agentic_branch_name,
config.preset_name, final_verdict, config.preset_name, final_verdict,
@@ -497,17 +614,26 @@ def _run_phased_pipeline(
# Setup shared worktree for agentic mode # Setup shared worktree for agentic mode
all_phase_steps = [s for p in config.phases for s in p.steps] all_phase_steps = [s for p in config.phases for s in p.steps]
worktree_path: Path | None = None worktree_path: Path | None = None
agent_execution_path: Path | None = None
agentic_branch_name: str | None = None agentic_branch_name: str | None = None
agentic_base_commit: str | None = None
original_input_paths: dict[str, Path] = {}
base_repo_state: dict[str, str] | None = None base_repo_state: dict[str, str] | None = None
base_repo_status: str | None = None base_repo_status: str | None = None
if not dry_run and _has_agentic_steps(config, all_phase_steps): if not dry_run and _has_agentic_steps(config, all_phase_steps):
worktree_path, agentic_branch_name = _setup_worktree( if config.use_worktree:
cwd, run_dir, config.preset_name, worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree(
) cwd, run_dir, config.preset_name,
_copy_inputs_to_worktree(config, worktree_path) )
_refresh_input_references(config, input_contents) original_input_paths = _snapshot_input_paths(config)
base_repo_state = _snapshot_repo_state(cwd) _copy_inputs_to_worktree(config, worktree_path, base_cwd=cwd)
base_repo_status = _snapshot_repo_status(cwd) _refresh_input_references(config, input_contents)
base_repo_state = _snapshot_repo_state(cwd)
base_repo_status = _snapshot_repo_status(cwd)
agent_execution_path = worktree_path
else:
agent_execution_path = cwd
agentic_base_commit = _get_current_head(cwd)
iterations: list[IterationResult] = [] iterations: list[IterationResult] = []
feedback = "(no feedback — first iteration)" feedback = "(no feedback — first iteration)"
@@ -554,15 +680,16 @@ def _run_phased_pipeline(
phase.steps, config, input_contents, feedback, phase.steps, config, input_contents, feedback,
pi, phase.max_iterations, cwd, timeout, dry_run, pi, phase.max_iterations, cwd, timeout, dry_run,
run_dir=run_dir, output_iter=global_iter, phase_name=phase.name, run_dir=run_dir, output_iter=global_iter, phase_name=phase.name,
worktree_path=worktree_path, worktree_path=agent_execution_path,
runtime_env=runtime_env, runtime_env=runtime_env,
base_repo_state=base_repo_state, base_repo_state=base_repo_state,
base_repo_status=base_repo_status, base_repo_status=base_repo_status,
base_commit=agentic_base_commit,
) )
# Intermediate commit so next iteration's diff only shows new changes # Intermediate commit so next iteration's diff only shows new changes
if worktree_path is not None: if config.use_worktree and worktree_path is not None:
_commit_iteration( agentic_base_commit = _commit_iteration(
worktree_path, f"{config.preset_name}/{phase.name}", worktree_path, f"{config.preset_name}/{phase.name}",
global_iter, verdict, global_iter, verdict,
) )
@@ -689,8 +816,25 @@ def _run_phased_pipeline(
final_verdict = "PASS" if phase_converged else "MAX_ITERATIONS_REACHED" final_verdict = "PASS" if phase_converged else "MAX_ITERATIONS_REACHED"
finally: finally:
if config.use_worktree and worktree_path is not None and original_input_paths:
restored_paths = _apply_worktree_inputs_to_base(
config, original_input_paths, cwd=cwd,
)
if restored_paths:
try:
committed = _commit_base_repo_paths(
cwd,
restored_paths,
f"cross-eval: {config.preset_name} ({final_verdict})",
)
if committed:
logger.info(" Applied and committed final input changes in base repo.")
else:
logger.info(" Applied final input changes in base repo (no commit created).")
except Exception:
logger.warning(" Failed to commit final input changes in base repo", exc_info=True)
agentic_branch: str | None = None agentic_branch: str | None = None
if worktree_path is not None and agentic_branch_name is not None: if config.use_worktree and worktree_path is not None and agentic_branch_name is not None:
agentic_branch = _finalize_worktree( agentic_branch = _finalize_worktree(
cwd, worktree_path, agentic_branch_name, cwd, worktree_path, agentic_branch_name,
config.preset_name, final_verdict, config.preset_name, final_verdict,
@@ -724,6 +868,8 @@ def _load_inputs(config: PipelineConfig) -> dict[str, str]:
for key, val in config.inputs.items(): for key, val in config.inputs.items():
if key.endswith("_ref"): if key.endswith("_ref"):
input_contents[key] = str(val) input_contents[key] = str(val)
elif key == "docs":
input_contents[key] = _load_docs_input(config, current_value=val)
elif isinstance(val, str): elif isinstance(val, str):
input_contents[key] = val input_contents[key] = val
else: else:
@@ -739,6 +885,8 @@ def _refresh_inputs(
for key, val in config.inputs.items(): for key, val in config.inputs.items():
if key.endswith("_ref"): if key.endswith("_ref"):
input_contents[key] = str(val) input_contents[key] = str(val)
elif key == "docs":
input_contents[key] = _load_docs_input(config, current_value=val)
elif isinstance(val, str): elif isinstance(val, str):
input_contents[key] = val input_contents[key] = val
elif isinstance(val, Path) and val.exists(): elif isinstance(val, Path) and val.exists():
@@ -746,6 +894,40 @@ def _refresh_inputs(
_refresh_input_references(config, input_contents) _refresh_input_references(config, input_contents)
def _load_docs_input(config: PipelineConfig, *, current_value: Path | str) -> str:
"""Load docs content from docs_ref when available so edits are visible next iteration."""
docs_ref = config.inputs.get("docs_ref")
docs_path = docs_ref if isinstance(docs_ref, Path) else None
if docs_path is not None and docs_path.exists():
if docs_path.is_dir():
return _read_docs_tree(docs_path)
try:
return docs_path.read_text(encoding="utf-8")
except (UnicodeDecodeError, OSError):
return ""
if isinstance(current_value, str):
return current_value
if current_value.exists() and current_value.is_file():
return current_value.read_text(encoding="utf-8")
return ""
def _read_docs_tree(docs_dir: Path) -> str:
"""Read all visible text files under a docs tree and concatenate them."""
parts: list[str] = []
for f in sorted(
path for path in docs_dir.rglob("*")
if path.is_file() and not any(part.startswith(".") for part in path.relative_to(docs_dir).parts)
):
try:
content = f.read_text(encoding="utf-8")
except (UnicodeDecodeError, OSError):
continue
rel_path = f.relative_to(docs_dir).as_posix()
parts.append(f"### {rel_path}\n{content}")
return "\n\n".join(parts)
def _refresh_input_references( def _refresh_input_references(
config: PipelineConfig, config: PipelineConfig,
input_contents: dict[str, str], input_contents: dict[str, str],
@@ -903,6 +1085,7 @@ def _run_steps(
runtime_env: dict[str, str] | None = None, runtime_env: dict[str, str] | None = None,
base_repo_state: dict[str, str] | None = None, base_repo_state: dict[str, str] | None = None,
base_repo_status: str | None = None, base_repo_status: str | None = None,
base_commit: str | None = None,
) -> tuple[dict[str, str], dict[str, AgentResult], str | None]: ) -> tuple[dict[str, str], dict[str, AgentResult], str | None]:
"""Execute all steps in one iteration, parallelizing where possible.""" """Execute all steps in one iteration, parallelizing where possible."""
step_outputs: dict[str, str] = {} step_outputs: dict[str, str] = {}
@@ -923,6 +1106,7 @@ def _run_steps(
runtime_env=runtime_env, runtime_env=runtime_env,
base_repo_state=base_repo_state, base_repo_state=base_repo_state,
base_repo_status=base_repo_status, base_repo_status=base_repo_status,
base_commit=base_commit,
) )
else: else:
_execute_parallel_batch( _execute_parallel_batch(
@@ -934,6 +1118,7 @@ def _run_steps(
runtime_env=runtime_env, runtime_env=runtime_env,
base_repo_state=base_repo_state, base_repo_state=base_repo_state,
base_repo_status=base_repo_status, base_repo_status=base_repo_status,
base_commit=base_commit,
) )
# Extract verdict from all verdict steps (ALL must PASS; ESCALATE wins over all) # Extract verdict from all verdict steps (ALL must PASS; ESCALATE wins over all)
@@ -961,6 +1146,7 @@ def _invoke_agentic(
env: dict[str, str] | None = None, env: dict[str, str] | None = None,
timeout: int | None = None, timeout: int | None = None,
quiet: bool = False, quiet: bool = False,
base_commit: str | None = None,
) -> AgentResult: ) -> AgentResult:
"""Run an agent in agentic mode using an existing worktree.""" """Run an agent in agentic mode using an existing worktree."""
return invoke_agent_agentic( return invoke_agent_agentic(
@@ -968,6 +1154,7 @@ def _invoke_agentic(
worktree_path=worktree_path, worktree_path=worktree_path,
env=env, env=env,
timeout=timeout, quiet=quiet, timeout=timeout, quiet=quiet,
base_commit=base_commit,
) )
@@ -992,6 +1179,7 @@ def _execute_step(
runtime_env: dict[str, str] | None = None, runtime_env: dict[str, str] | None = None,
base_repo_state: dict[str, str] | None = None, base_repo_state: dict[str, str] | None = None,
base_repo_status: str | None = None, base_repo_status: str | None = None,
base_commit: str | None = None,
) -> None: ) -> None:
"""Execute a single step, updating step_outputs and step_results in place.""" """Execute a single step, updating step_outputs and step_results in place."""
if not quiet: if not quiet:
@@ -1035,6 +1223,7 @@ def _execute_step(
worktree_path=worktree_path, worktree_path=worktree_path,
env=runtime_env, env=runtime_env,
timeout=timeout, quiet=quiet, timeout=timeout, quiet=quiet,
base_commit=base_commit,
) )
else: else:
# When worktree exists, run non-agentic agents (reviewers) in # When worktree exists, run non-agentic agents (reviewers) in
@@ -1125,6 +1314,7 @@ def _execute_parallel_batch(
runtime_env: dict[str, str] | None = None, runtime_env: dict[str, str] | None = None,
base_repo_state: dict[str, str] | None = None, base_repo_state: dict[str, str] | None = None,
base_repo_status: str | None = None, base_repo_status: str | None = None,
base_commit: str | None = None,
) -> None: ) -> None:
"""Execute multiple steps in parallel using threads.""" """Execute multiple steps in parallel using threads."""
agent_names = ", ".join(s.agent for s in batch) agent_names = ", ".join(s.agent for s in batch)
@@ -1139,6 +1329,7 @@ def _execute_parallel_batch(
run_dir=run_dir, output_iter=output_iter, phase_name=phase_name, run_dir=run_dir, output_iter=output_iter, phase_name=phase_name,
base_repo_state=base_repo_state, base_repo_state=base_repo_state,
base_repo_status=base_repo_status, base_repo_status=base_repo_status,
base_commit=base_commit,
) )
return return
@@ -1161,6 +1352,7 @@ def _execute_parallel_batch(
phase_name=phase_name, worktree_path=worktree_path, phase_name=phase_name, worktree_path=worktree_path,
base_repo_state=base_repo_state, base_repo_state=base_repo_state,
base_repo_status=base_repo_status, base_repo_status=base_repo_status,
base_commit=base_commit,
) )
return return
@@ -1204,6 +1396,7 @@ def _execute_parallel_batch(
worktree_path=worktree_path, worktree_path=worktree_path,
env=runtime_env, env=runtime_env,
timeout=timeout, quiet=True, timeout=timeout, quiet=True,
base_commit=base_commit,
) )
else: else:
effective_cwd = worktree_path if worktree_path else cwd effective_cwd = worktree_path if worktree_path else cwd
@@ -1664,3 +1857,12 @@ def _save_report(run_dir: Path, config: PipelineConfig, result: PipelineResult)
report_path.parent.mkdir(parents=True, exist_ok=True) report_path.parent.mkdir(parents=True, exist_ok=True)
report_path.write_text(report, encoding="utf-8") report_path.write_text(report, encoding="utf-8")
logger.info("Report saved: %s", report_path) logger.info("Report saved: %s", report_path)
def _copy_path(src: Path, dest: Path) -> None:
"""Copy a file or directory into the worktree, preserving structure."""
if src.is_dir():
shutil.copytree(src, dest, dirs_exist_ok=True)
return
dest.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(src, dest)

View File

@@ -472,12 +472,270 @@ PLAN_REVIEW_TEMPLATE_KO = """\
그렇지 않으면: VERDICT: FAIL 그렇지 않으면: VERDICT: FAIL
""" """
PLAN_FIX_TEMPLATE = """\
You are tasked with revising planning documents based on adjudicated review feedback.
## Artifact References
{artifact_references}
## Current Review Feedback
{feedback}
## Instructions
1. Read the referenced plan/checklist/docs/review artifacts directly from disk.
2. Update the planning package itself: the plan, checklist, and reference documents as needed.
3. Do NOT write or modify production code. Only revise planning artifacts.
4. Address ONLY the confirmed planning issues from the current review feedback.
5. If feedback marks any item as DISMISSED or false positive, leave it unchanged.
6. Make the smallest document changes that resolve ambiguity, omissions, scope creep, or repository compatibility issues.
7. Keep the plan, checklist, and supporting docs internally consistent after your edits.
8. After editing, briefly summarize what you changed and any blocker that still needs human input.
"""
PLAN_FIX_TEMPLATE_KO = """\
당신은 시니어 리뷰 결과를 바탕으로 기획 문서를 수정하는 담당자입니다.
## 참조 아티팩트
{artifact_references}
## 현재 리뷰 피드백
{feedback}
## 지침
1. 참조된 plan/checklist/docs/review markdown를 직접 읽으세요.
2. 수정 대상은 기획 패키지 자체입니다. 필요에 따라 기획서, 체크리스트, 참고 문서를 수정하세요.
3. 프로덕션 코드를 작성하거나 수정하지 마세요. 기획 문서만 고치세요.
4. 현재 리뷰 피드백에서 확정된 기획 이슈만 해결하세요.
5. DISMISSED 또는 오탐으로 정리된 항목은 건드리지 마세요.
6. 모호성, 누락, 과도한 범위, 저장소 정합성 문제를 해소하는 최소한의 문서 수정만 하세요.
7. 수정 후에도 기획서, 체크리스트, 참고 문서가 서로 모순되지 않게 유지하세요.
8. 수정이 끝나면 무엇을 바꿨는지와 아직 사람 판단이 필요한 blocker가 있는지 짧게 정리하세요.
"""
PLAN_VERIFY_TEMPLATE = """\
You are verifying the latest planning package after plan-only revisions.
## Plan
{plan}
## Checklist
{checklist}
## Reference Documents
{docs}
## Previous Review (iteration {iteration} of {max_iterations})
{feedback}
## Execution Evidence
{execution_evidence}
## Verify Instructions
Review the latest planning package itself: the plan, checklist, and reference documents.
You MAY inspect the current repository to confirm that the documents describe the current reality accurately enough.
Do NOT require production code, scripts, infrastructure, or external environments to already be fixed.
For `plan-review`, PASS means the documents are now clear enough to execute without further document edits.
A known implementation gap, repo mismatch, legacy script problem, external dependency, or environment blocker is NOT a FAIL by itself if:
- the issue is described accurately in the planning package,
- the affected scope or gate is documented clearly,
- the required follow-up action or non-go condition is documented clearly, and
- the package does not misrepresent unresolved work as already complete.
Only mark FAIL when the planning package still needs correction, such as:
- unresolved ambiguity or contradiction in the documents,
- missing prerequisite, dependency, gate, ownership, or evidence rule,
- a known blocker that is still described inaccurately or misleadingly,
- conflicting source-of-truth rules across the planning documents,
- checklist or status criteria that would cause an operator to make the wrong decision.
Report implementation/repository problems that are already documented correctly under "Out of Scope Issues" or note them as documented risks, not as FAIL reasons.
## Output Format
### Remaining Document Issues
- [Major][Omission] Description (reference specific plan/checklist/doc item)
(Write "None" if no document issue remains.)
### Documented Risks / Out of Scope
- Description of a real implementation/repository/environment risk that is already documented correctly
(Write "None" if nothing notable remains.)
### Summary
- Remaining document issues: N
- Documented risks / out-of-scope items: N
- Overall quality: [BRIEF ASSESSMENT]
### Verdict
If the planning package no longer needs document changes, output: VERDICT: PASS
Otherwise output: VERDICT: FAIL
"""
PLAN_VERIFY_TEMPLATE_KO = """\
당신은 plan-only 수정 이후 최신 기획 패키지를 재검증하는 검토자입니다.
## 기획서
{plan}
## 체크리스트
{checklist}
## 참고 문서
{docs}
## 이전 리뷰 결과 ({max_iterations}회 중 {iteration}번째)
{feedback}
## 실행 증거
{execution_evidence}
## 검증 지침
최신 기획 패키지 자체를 다시 검토하세요: 기획서, 체크리스트, 참고 문서를 함께 봅니다.
현재 저장소를 살펴보며 문서가 현실을 정확히 설명하는지 확인할 수는 있지만, 프로덕션 코드, 스크립트, 인프라, 외부 환경이 이미 수정되어 있을 것을 요구하면 안 됩니다.
`plan-review`에서 PASS의 뜻은 "이제 문서를 더 고칠 필요 없이 이 계획을 실행할 수 있다"입니다.
즉 구현 공백, 저장소 불일치, legacy 스크립트 문제, 외부 의존성, 환경 blocker가 남아 있어도 아래 조건을 만족하면 FAIL 사유가 아닙니다.
- 그 문제가 기획 패키지에 정확히 기록되어 있고
- 어떤 범위/게이트에 영향을 주는지 분명히 적혀 있고
- 필요한 후속 조치나 non-go 조건이 명확히 적혀 있고
- 아직 해결되지 않은 일을 이미 해결된 것처럼 오해하게 만들지 않는 경우
반대로 아래와 같은 경우에만 FAIL로 판정하세요.
- 문서 안에 아직 모호성이나 모순이 남아 있는 경우
- 선행조건, 의존성, 게이트, 담당 주체, evidence 규칙이 빠진 경우
- 알려진 blocker가 여전히 부정확하거나 오해를 부르는 방식으로 서술된 경우
- 기획 문서들 사이에서 source-of-truth 규칙이 충돌하는 경우
- 체크리스트나 상태 판정 기준 때문에 실행자가 잘못된 결정을 내릴 수 있는 경우
이미 문서에 정확히 기록된 구현/저장소 문제는 "범위 밖 이슈" 또는 "문서화된 리스크"로만 남기고, 그 자체를 FAIL 사유로 삼지 마세요.
## 출력 형식
### 남은 문서 이슈
- [Major][누락] 이슈 설명 (관련 기획서/체크리스트/참고 문서 항목 참조)
(남은 문서 이슈가 없으면 "없음"이라고 작성하세요.)
### 문서화된 리스크 / 범위 밖 이슈
- 실제 구현/저장소/환경 리스크이지만 문서에는 이미 정확히 반영된 항목
(해당 사항이 없으면 "없음"이라고 작성하세요.)
### 요약
- 남은 문서 이슈 수: N
- 문서화된 리스크 / 범위 밖 항목 수: N
- 전체 품질: [간략한 평가]
### 판정
기획 패키지를 더 수정할 필요가 없으면: VERDICT: PASS
그렇지 않으면: VERDICT: FAIL
"""
CODING_PLAN_REVIEW_TEMPLATE = """\
You are reviewing both the implementation and the planning package together.
## Artifact References
{artifact_references}
## Execution Evidence
{execution_evidence}
## Review Instructions
Read the referenced plan/checklist/docs/review artifacts directly from disk. \
Inspect the current repository and evaluate BOTH:
1. whether the implementation matches the plan/checklist/docs, and
2. whether the planning package still accurately describes the implementation target and constraints.
Report only issues that matter to delivering the original plan correctly. \
Do not invent new scope. Distinguish between code issues, document issues, and consistency gaps between them.
For each issue found, classify it with BOTH severity AND category:
- Severity: Critical / Major / Minor
- Category: Over-engineering / Omission
If previous review feedback is provided above, mark each prior item as CONFIRMED or DISMISSED.
If you find issues outside the original plan scope, report them separately under "Out of Scope Issues".
### Verdict
If the implementation satisfies the plan/checklist and the planning package no longer needs correction, output: VERDICT: PASS
Otherwise output: VERDICT: FAIL
"""
CODING_PLAN_REVIEW_TEMPLATE_KO = """\
당신은 구현 결과와 기획 문서 패키지를 함께 검토하는 리뷰어입니다.
## 참조 아티팩트
{artifact_references}
## 실행 증거
{execution_evidence}
## 검토 지침
참조된 plan/checklist/docs/review markdown를 직접 읽고 현재 저장소를 확인한 뒤, 아래 두 가지를 함께 평가하세요.
1. 현재 구현이 plan/checklist/docs와 일치하는가
2. 기획 문서 패키지가 현재 구현 목표와 제약을 여전히 정확하게 설명하는가
원래 계획을 제대로 완수하는 데 필요한 이슈만 보고하세요. 새로운 범위를 만들지 마세요.
코드 이슈, 문서 이슈, 코드-문서 불일치를 구분해서 적으세요.
발견된 각 이슈에 심각도와 카테고리를 모두 부여하세요.
- 심각도: Critical / Major / Minor
- 카테고리: 과최적화 / 누락
이전 리뷰 피드백이 있으면 각 항목을 CONFIRMED 또는 DISMISSED로 판정하세요.
원래 계획 범위 밖 이슈는 "범위 밖 이슈"로 별도 분리하세요.
### 판정
구현이 plan/checklist를 충족하고 기획 문서 패키지도 더 이상 수정할 필요가 없으면: VERDICT: PASS
그렇지 않으면: VERDICT: FAIL
"""
CODING_PLAN_FIX_TEMPLATE = """\
You are fixing confirmed issues in both the implementation and the planning package.
## Artifact References
{artifact_references}
## Current Review Feedback
{feedback}
## Instructions
1. Read the referenced plan/checklist/docs/review artifacts directly from disk.
2. Fix ONLY the confirmed issues from the current review feedback.
3. You may update both implementation files and planning artifacts when needed.
4. Preserve the original plan intent and scope. Do not silently broaden requirements.
5. Keep code, plan, checklist, and supporting docs consistent after edits.
6. After editing, briefly summarize what you changed and any blocker that still needs human input.
"""
CODING_PLAN_FIX_TEMPLATE_KO = """\
당신은 현재 리뷰에서 확정된 이슈를 코드와 기획 문서 패키지에 함께 반영하는 수정 담당자입니다.
## 참조 아티팩트
{artifact_references}
## 현재 리뷰 피드백
{feedback}
## 지침
1. 참조된 plan/checklist/docs/review markdown를 직접 읽으세요.
2. 현재 리뷰 피드백에서 확정된 이슈만 수정하세요.
3. 필요하면 코드와 기획 문서를 모두 수정할 수 있습니다.
4. 최초 plan의 의도와 범위를 유지하세요. 요구사항을 몰래 넓히지 마세요.
5. 수정 후 코드, plan, checklist, 참고 문서가 서로 모순되지 않게 유지하세요.
6. 수정이 끝나면 무엇을 바꿨는지와 아직 사람 판단이 필요한 blocker가 있는지 짧게 정리하세요.
"""
AGGREGATE_REVIEW_TEMPLATE = """\ AGGREGATE_REVIEW_TEMPLATE = """\
You are adjudicating multiple review results and turning them into an actionable decision. You are adjudicating multiple review results and turning them into an actionable decision.
## Artifact References ## Artifact References
{artifact_references} {artifact_references}
## Candidate Artifact Under Review
{candidate_outputs}
## Reviewer Findings Bundle
{reviews_bundle}
## Previous Issue Tracker ## Previous Issue Tracker
{previous_senior_tracker} {previous_senior_tracker}
@@ -486,19 +744,19 @@ You are adjudicating multiple review results and turning them into an actionable
## Instructions ## Instructions
Read the referenced plan/checklist/docs/review artifacts directly from disk. \ Read the referenced plan/checklist/docs/review artifacts directly from disk. \
Explore the project directory and the referenced git commit/diff to confirm the \ Inspect the repository and referenced artifacts only as needed to confirm the \
current codebase state. Use the execution evidence above to verify claims against \ current target state. Use the execution evidence above to verify claims against \
actual command outputs, artifact paths, and exit codes. Then: actual command outputs, artifact paths, and exit codes. Then:
1. Deduplicate overlapping issues across reviewers. 1. Deduplicate overlapping issues across reviewers.
2. Resolve disagreements explicitly. 2. Resolve disagreements explicitly.
3. Keep only issues supported by the plan, checklist, code, or reviewer evidence. 3. Keep only issues supported by the plan, checklist, reference docs, repository state, or reviewer evidence.
4. When evidence is mixed, explain what was confirmed, what was dismissed, and what still needs follow-up. 4. When evidence is mixed, explain what was confirmed, what was dismissed, and what still needs follow-up.
5. Produce a prioritized action list for the coder. 5. Produce a prioritized action list for the implementer/editor.
6. Maintain the Issue Tracker table across iterations (carry forward unresolved issues). 6. Maintain the Issue Tracker table across iterations (carry forward unresolved issues).
7. If no confirmed issue remains, output VERDICT: PASS. 7. If no confirmed issue remains, output VERDICT: PASS.
8. If issues exist that the coder can fix, output VERDICT: FAIL. 8. If issues exist that the implementer/editor can fix, output VERDICT: FAIL.
9. If issues require human intervention (ambiguous requirements, architecture decisions, \ 9. If issues require human intervention (ambiguous requirements, architecture decisions, \
external dependency problems, or the same issue persists after 2+ fix attempts), \ external dependency problems, or the same issue persists after 2+ attempts), \
output VERDICT: ESCALATE. output VERDICT: ESCALATE.
## Output Format ## Output Format
@@ -512,8 +770,8 @@ output VERDICT: ESCALATE.
(Write "None" if nothing was dismissed.) (Write "None" if nothing was dismissed.)
### Action Items ### Action Items
1. Concrete fix the coder should make 1. Concrete fix the implementer/editor should make
2. Concrete fix the coder should make 2. Concrete fix the implementer/editor should make
## Issue Tracker ## Issue Tracker
@@ -536,6 +794,12 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
## 참조 아티팩트 ## 참조 아티팩트
{artifact_references} {artifact_references}
## 현재 검토 대상
{candidate_outputs}
## 리뷰 결과 묶음
{reviews_bundle}
## 이전 이슈 트래커 ## 이전 이슈 트래커
{previous_senior_tracker} {previous_senior_tracker}
@@ -543,17 +807,17 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
{execution_evidence} {execution_evidence}
## 지침 ## 지침
참조된 plan/checklist/docs/review markdown와 git 상태를 직접 읽어 현재 코드베이스 상태를 확인한 뒤, \ 참조된 plan/checklist/docs/review markdown와 저장소 상태를 직접 읽어 현재 검토 대상의 상태를 확인한 뒤, \
위 실행 증거를 활용하여 에이전트의 주장을 실제 명령어 출력, 아티팩트 경로, 종료 코드로 검증하세요. \ 위 실행 증거를 활용하여 에이전트의 주장을 실제 명령어 출력, 아티팩트 경로, 종료 코드로 검증하세요. \
그런 다음 아래를 수행하세요. 그런 다음 아래를 수행하세요.
1. 리뷰어들 사이에 중복되는 이슈를 합치세요. 1. 리뷰어들 사이에 중복되는 이슈를 합치세요.
2. 의견 충돌은 명시적으로 정리하세요. 2. 의견 충돌은 명시적으로 정리하세요.
3. 기획서, 체크리스트, 코드, 리뷰 근거로 뒷받침되는 이슈만 남기세요. 3. 기획서, 체크리스트, 참고 문서, 저장소 상태, 리뷰 근거로 뒷받침되는 이슈만 남기세요.
4. 근거가 엇갈리면 무엇이 확정이고 무엇이 기각 또는 추가확인 대상인지 분명히 적으세요. 4. 근거가 엇갈리면 무엇이 확정이고 무엇이 기각 또는 추가확인 대상인지 분명히 적으세요.
5. coder가 바로 수정할 수 있는 우선순위 액션 아이템을 만드세요. 5. 수정 담당자가 바로 처리할 수 있는 우선순위 액션 아이템을 만드세요.
6. 이슈 트래커 테이블을 반복 간에 유지하세요 (미해결 이슈를 이월). 6. 이슈 트래커 테이블을 반복 간에 유지하세요 (미해결 이슈를 이월).
7. 확정된 이슈가 없으면 VERDICT: PASS 를 출력하세요. 7. 확정된 이슈가 없으면 VERDICT: PASS 를 출력하세요.
8. coder가 수정 가능한 이슈가 있으면 VERDICT: FAIL 을 출력하세요. 8. 수정 담당자가 해결 가능한 이슈가 있으면 VERDICT: FAIL 을 출력하세요.
9. 사람의 개입이 필요한 이슈(모호한 요구사항, 아키텍처 결정, 외부 의존성 문제, \ 9. 사람의 개입이 필요한 이슈(모호한 요구사항, 아키텍처 결정, 외부 의존성 문제, \
동일 이슈가 2회 이상 해결 실패)가 있으면 VERDICT: ESCALATE 를 출력하세요. 동일 이슈가 2회 이상 해결 실패)가 있으면 VERDICT: ESCALATE 를 출력하세요.
@@ -568,8 +832,8 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
(기각된 항목이 없으면 "없음"이라고 작성하세요.) (기각된 항목이 없으면 "없음"이라고 작성하세요.)
### 액션 아이템 ### 액션 아이템
1. coder가 수정해야 할 구체적인 작업 1. 수정 담당자가 처리해야 할 구체적인 작업
2. coder가 수정해야 할 구체적인 작업 2. 수정 담당자가 처리해야 할 구체적인 작업
## 이슈 트래커 ## 이슈 트래커
@@ -592,6 +856,10 @@ DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
"coding": CODING_TEMPLATE, "coding": CODING_TEMPLATE,
"review": REVIEW_TEMPLATE, "review": REVIEW_TEMPLATE,
"plan-review": PLAN_REVIEW_TEMPLATE, "plan-review": PLAN_REVIEW_TEMPLATE,
"plan-fix": PLAN_FIX_TEMPLATE,
"plan-verify": PLAN_VERIFY_TEMPLATE,
"coding-plan-review": CODING_PLAN_REVIEW_TEMPLATE,
"coding-plan-fix": CODING_PLAN_FIX_TEMPLATE,
"review-only": REVIEW_ONLY_TEMPLATE, "review-only": REVIEW_ONLY_TEMPLATE,
"aggregate-review": AGGREGATE_REVIEW_TEMPLATE, "aggregate-review": AGGREGATE_REVIEW_TEMPLATE,
}, },
@@ -599,6 +867,10 @@ DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
"coding": CODING_TEMPLATE_KO, "coding": CODING_TEMPLATE_KO,
"review": REVIEW_TEMPLATE_KO, "review": REVIEW_TEMPLATE_KO,
"plan-review": PLAN_REVIEW_TEMPLATE_KO, "plan-review": PLAN_REVIEW_TEMPLATE_KO,
"plan-fix": PLAN_FIX_TEMPLATE_KO,
"plan-verify": PLAN_VERIFY_TEMPLATE_KO,
"coding-plan-review": CODING_PLAN_REVIEW_TEMPLATE_KO,
"coding-plan-fix": CODING_PLAN_FIX_TEMPLATE_KO,
"review-only": REVIEW_ONLY_TEMPLATE_KO, "review-only": REVIEW_ONLY_TEMPLATE_KO,
"aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO, "aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO,
}, },
@@ -843,56 +1115,75 @@ def _build_review_only_preset(
def _build_plan_review_preset( def _build_plan_review_preset(
coders: list[str], reviewers: list[str], seniors: list[str], coders: list[str], reviewers: list[str], seniors: list[str],
) -> list[StepConfig]: ) -> list[StepConfig]:
"""Plan-review: reviewers audit planning docs before implementation.""" """Plan-review: review planning docs, revise them, then verify in a loop."""
if not coders:
raise ValueError("'plan-review' preset requires at least 1 coder")
if not reviewers: if not reviewers:
raise ValueError("'plan-review' preset requires at least 1 reviewer") raise ValueError("'plan-review' preset requires at least 1 reviewer")
if len(reviewers) == 1 and not seniors: review_steps: list[StepConfig] = []
return [ if len(reviewers) == 1:
review_steps.append(
StepConfig( StepConfig(
name="plan_review", name="plan_review",
agent=reviewers[0], agent=reviewers[0],
role="review", role="review",
prompt_template="default:plan-review", prompt_template="default:plan-review",
output_key="plan_review_result", output_key="plan_review_result",
verdict=True,
), ),
] )
review_step_names = ["plan_review"]
review_output_keys = ["plan_review_result"]
else:
reviewer_keys = _unique_safe_keys(reviewers)
for reviewer, rk in zip(reviewers, reviewer_keys):
review_steps.append(
StepConfig(
name=f"plan_review_{rk}",
agent=reviewer,
role="review",
prompt_template="default:plan-review",
output_key=f"plan_review_{rk}",
parallel=True,
),
)
review_step_names = [f"plan_review_{rk}" for rk in reviewer_keys]
review_output_keys = [f"plan_review_{rk}" for rk in reviewer_keys]
steps: list[StepConfig] = [] fix_coder = coders[0]
reviewer_keys = _unique_safe_keys(reviewers) senior_agent = seniors[0] if seniors else reviewers[0]
for reviewer, rk in zip(reviewers, reviewer_keys):
steps.append( return review_steps + [
StepConfig( StepConfig(
name=f"plan_review_{rk}", name="aggregate_review",
agent=reviewer, agent=senior_agent,
role="review", role="review",
prompt_template="default:plan-review", prompt_template="default:aggregate-review",
output_key=f"plan_review_{rk}", output_key="aggregate_review",
verdict=not seniors, context_override={
parallel=True, "candidate_outputs": "Current planning package under review (plan/checklist/reference docs).",
), "reviews_bundle": _build_named_bundle(
) reviewers, review_step_names, review_output_keys, "Review",
if seniors: ),
step_names = [f"plan_review_{rk}" for rk in reviewer_keys] },
output_keys = [f"plan_review_{rk}" for rk in reviewer_keys] ),
steps.append( StepConfig(
StepConfig( name="plan_fix",
name="senior_review", agent=fix_coder,
agent=seniors[0], role="coding",
role="review", prompt_template="default:plan-fix",
prompt_template="default:aggregate-review", output_key="plan_fix_output",
output_key="senior_review_result", context_override={"feedback": "{aggregate_review}"},
verdict=True, ),
context_override={ StepConfig(
"candidate_outputs": "Planning documents under review (plan/checklist/reference docs).", name="verify",
"reviews_bundle": _build_named_bundle( agent=senior_agent,
reviewers, step_names, output_keys, "Review", role="review",
), prompt_template="default:plan-verify",
}, output_key="verify_result",
), verdict=True,
) ),
return steps ]
def _build_review_fix_preset( def _build_review_fix_preset(
@@ -992,16 +1283,97 @@ def _build_coding_review_fix_preset(
] ]
def _build_coding_plan_review_preset(
coders: list[str], reviewers: list[str], seniors: list[str],
) -> list[PhaseConfig]:
"""Implement from plan/docs, then review and fix code+docs together."""
if not coders:
raise ValueError("'coding-plan-review' preset requires at least 1 coder")
if not reviewers:
raise ValueError("'coding-plan-review' preset requires at least 1 reviewer")
review_steps: list[StepConfig] = []
reviewer_keys = _unique_safe_keys(reviewers)
for reviewer, rk in zip(reviewers, reviewer_keys):
review_steps.append(
StepConfig(
name=f"review_{rk}",
agent=reviewer,
role="review",
prompt_template="default:coding-plan-review",
output_key=f"review_{rk}",
verdict=False,
parallel=True,
),
)
senior_agent = seniors[0] if seniors else reviewers[0]
review_step_names = [f"review_{rk}" for rk in reviewer_keys]
review_output_keys = [f"review_{rk}" for rk in reviewer_keys]
return [
PhaseConfig(
name="initial_coding",
steps=[
StepConfig(
name="coding",
agent=coders[0],
role="coding",
prompt_template="default:coding",
output_key="coding_output",
),
],
max_iterations=1,
consecutive_pass=1,
),
PhaseConfig(
name="coding_plan_review",
steps=review_steps + [
StepConfig(
name="aggregate_review",
agent=senior_agent,
role="review",
prompt_template="default:aggregate-review",
output_key="aggregate_review",
context_override={
"candidate_outputs": (
"Current implementation and planning package under review "
"(code + plan/checklist/reference docs)."
),
"reviews_bundle": _build_named_bundle(
reviewers, review_step_names, review_output_keys, "Review",
),
},
),
StepConfig(
name="coding_plan_fix",
agent=coders[0],
role="coding",
prompt_template="default:coding-plan-fix",
output_key="coding_plan_fix_output",
context_override={"feedback": "{aggregate_review}"},
),
StepConfig(
name="verify",
agent=senior_agent,
role="review",
prompt_template="default:coding-plan-review",
output_key="verify_result",
verdict=True,
),
],
max_iterations=5,
consecutive_pass=1,
),
]
PIPELINE_PRESETS: dict[str, Callable] = { PIPELINE_PRESETS: dict[str, Callable] = {
"simple": _build_simple_preset,
"cross-review": _build_cross_review_preset,
"plan-review": _build_plan_review_preset, "plan-review": _build_plan_review_preset,
"review-only": _build_review_only_preset,
} }
PHASED_PRESETS: dict[str, Callable] = { PHASED_PRESETS: dict[str, Callable] = {
"review-fix": _build_review_fix_preset, "coding-plan-review": _build_coding_plan_review_preset,
"coding-review-fix": _build_coding_review_fix_preset,
} }
ALL_PRESET_NAMES: list[str] = list(PIPELINE_PRESETS.keys()) + list(PHASED_PRESETS.keys()) ALL_PRESET_NAMES: list[str] = list(PIPELINE_PRESETS.keys()) + list(PHASED_PRESETS.keys())

View File

@@ -37,18 +37,31 @@ def make_worktree_dir(base_cwd: Path, branch_name: str) -> Path:
) )
def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> Path: def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> tuple[Path, str]:
"""Create a git worktree on a new branch from HEAD. """Create a git worktree on a new branch from HEAD.
1. Create branch from HEAD 1. Create branch from HEAD
2. Create worktree checked out to that branch 2. Create worktree checked out to that branch
The branch lives in the original repo, so it survives worktree removal. The branch lives in the original repo, so it survives worktree removal.
Returns (worktree_path, base_commit_sha).
""" """
work_dir = work_dir.resolve() work_dir = work_dir.resolve()
if work_dir.exists(): if work_dir.exists():
shutil.rmtree(work_dir) shutil.rmtree(work_dir)
# Record the base commit SHA before creating the branch.
# This is the anchor for all diffs — even if the agent makes its own commits,
# we always diff against this base to capture the full set of changes.
result = subprocess.run(
["git", "rev-parse", "HEAD"],
cwd=base_cwd,
capture_output=True,
text=True,
check=True,
)
base_commit = result.stdout.strip()
# Create the branch at HEAD # Create the branch at HEAD
try: try:
subprocess.run( subprocess.run(
@@ -83,15 +96,23 @@ def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> Path:
f"Failed to create worktree at {work_dir}: {e.stderr.strip()}" f"Failed to create worktree at {work_dir}: {e.stderr.strip()}"
) from e ) from e
logger.debug("Created worktree on branch '%s': %s", branch_name, work_dir) logger.debug("Created worktree on branch '%s': %s (base: %s)", branch_name, work_dir, base_commit[:8])
return work_dir return work_dir, base_commit
def capture_diff(worktree_path: Path) -> str: def capture_diff(worktree_path: Path, base_commit: str | None = None) -> str:
"""Capture all changes made in the worktree as a unified diff. """Capture all changes made in the worktree since ``base_commit``.
Includes both tracked modifications and new untracked files. Handles two scenarios:
1. Agent left changes uncommitted → ``git add -A && git diff base HEAD``
2. Agent committed its own changes → HEAD advanced, diff base..HEAD captures them
Args:
base_commit: The diff anchor — typically the worktree HEAD *before* this
iteration started (set by ``get_current_head`` after each
``_commit_iteration``). Falls back to ``HEAD`` if not given.
""" """
# Stage any uncommitted changes
subprocess.run( subprocess.run(
["git", "add", "-A"], ["git", "add", "-A"],
cwd=worktree_path, cwd=worktree_path,
@@ -99,12 +120,34 @@ def capture_diff(worktree_path: Path) -> str:
check=True, check=True,
) )
result = subprocess.run( # Commit staged changes so everything is reachable via HEAD
["git", "diff", "--cached", "HEAD"], # (this is a no-op if nothing is staged)
subprocess.run(
["git", "commit", "-m", "cross-eval: capture-diff snapshot", "--allow-empty-message"],
cwd=worktree_path, cwd=worktree_path,
capture_output=True, capture_output=True,
text=True, text=True,
) )
ref = base_commit or "HEAD~1"
result = subprocess.run(
["git", "diff", ref, "HEAD"],
cwd=worktree_path,
capture_output=True,
text=True,
)
return result.stdout.strip()
def get_current_head(worktree_path: Path) -> str:
"""Return the current HEAD SHA of the worktree."""
result = subprocess.run(
["git", "rev-parse", "HEAD"],
cwd=worktree_path,
capture_output=True,
text=True,
check=True,
)
return result.stdout.strip() return result.stdout.strip()

47
plan.md Normal file
View File

@@ -0,0 +1,47 @@
# cross-eval CLI 사용성 리팩토링
## 목표
`cross-eval`의 CLI 사용 경험을 리팩토링하여, 사용자가 각 옵션의 의미를 빠르게 이해하고 목적에 맞는 옵션 조합을 쉽게 선택할 수 있도록 만든다.
## 배경
현재 `cross-eval``init`, `run`, `demo`, `doctor` 등 주요 커맨드와 다양한 옵션을 제공하지만, 처음 사용하는 사용자가 어떤 상황에서 어떤 옵션을 써야 하는지 한눈에 이해하기 어렵다. 특히 `run`의 preset, agent 조합, config 기반 실행과 직접 옵션 기반 실행의 관계가 복잡하게 느껴질 수 있다.
## 요구사항
1. CLI 도움말 또는 온보딩 문구를 리팩토링해 초보 사용자도 주요 흐름을 빠르게 이해할 수 있어야 한다.
2. 사용자가 대표적인 사용 시나리오별로 적절한 옵션 조합을 쉽게 찾을 수 있어야 한다.
3. `run` 커맨드의 주요 옵션들(preset, coder/reviewer/senior, config, output 관련)의 역할이 더 명확하게 드러나야 한다.
4. `init` 이후 사용자가 다음에 무엇을 해야 하는지 자연스럽게 이어지도록 안내해야 한다.
5. 기존 기능은 유지해야 하며, 동작 방식 자체를 바꾸기보다 설명 구조와 사용 흐름을 개선하는 데 집중해야 한다.
## 사용자 시나리오
1. 처음 설치한 사용자가 `cross-eval init` 후 무엇을 해야 하는지 알고 싶다.
2. 사용자가 `run`을 실행하려는데 `--preset`별 차이를 빠르게 비교하고 싶다.
3. 사용자가 `claude`, `codex`, `senior` 조합을 어떤 상황에서 쓰는지 예시와 함께 이해하고 싶다.
4. 사용자가 config 기반 실행과 CLI 옵션 기반 실행 중 무엇을 써야 할지 판단하고 싶다.
5. 사용자가 실행 결과가 어디에 저장되는지, 어떤 식으로 확인하는지 알고 싶다.
## 제약조건
- 기존 CLI 명령 이름과 핵심 옵션 이름은 유지한다.
- 기존 파이프라인 동작 로직은 불필요하게 수정하지 않는다.
- 기능 추가보다 안내 구조, 도움말 문구, 예시, 설명 흐름 개선에 집중한다.
- 문서는 한국어 사용자 기준으로 이해하기 쉽게 유지하되, 기존 프로젝트 톤과 구조를 해치지 않는다.
## 범위
### 포함
- `argparse` help/description/epilog 문구 개선
- `init` 후 다음 단계 안내 문구 개선
- `run` 사용 예시 정리 및 대표 조합 예시 보강
- preset/agent/config/output 개념 설명 재구성
- 필요 시 README 또는 온보딩 문구 일부 정리
### 제외
- 새로운 preset 추가
- 새로운 CLI 옵션 추가
- 파이프라인 실행 알고리즘 변경
- 에이전트 호출 방식 자체 변경
## 성공 기준
1. `--help`만 읽어도 기본 사용 흐름이 명확하다.
2. 사용자가 대표 시나리오별 실행 예시를 바로 복사해 쓸 수 있다.
3. `init → 작성 → doctor → run → output 확인` 흐름이 자연스럽게 연결된다.
4. 옵션 설명이 길기만 하지 않고, 실제 선택 판단에 도움이 되도록 구조화된다.

View File

@@ -76,10 +76,12 @@ class TestCreateWorktree(unittest.TestCase):
wt_dir = Path(td) / "wt" wt_dir = Path(td) / "wt"
branch = "cross-eval/test_branch" branch = "cross-eval/test_branch"
result_path = create_worktree(base, wt_dir, branch) result_path, base_commit = create_worktree(base, wt_dir, branch)
# Worktree directory exists # Worktree directory exists
self.assertTrue(result_path.exists()) self.assertTrue(result_path.exists())
# Base commit SHA was captured
self.assertEqual(len(base_commit), 40)
# Branch was created in the original repo # Branch was created in the original repo
branches = subprocess.run( branches = subprocess.run(
["git", "branch", "--list", branch], ["git", "branch", "--list", branch],
@@ -102,7 +104,7 @@ class TestCaptureDiff(unittest.TestCase):
wt_dir = Path(td) / "wt" wt_dir = Path(td) / "wt"
branch = "cross-eval/diff_test" branch = "cross-eval/diff_test"
create_worktree(base, wt_dir, branch) create_worktree(base, wt_dir, branch) # ignore return tuple
# Make changes in the worktree # Make changes in the worktree
(wt_dir / "new_file.txt").write_text("hello\n") (wt_dir / "new_file.txt").write_text("hello\n")
@@ -488,6 +490,8 @@ class TestMakeAgenticCodex(unittest.TestCase):
def _make_agentic_config( def _make_agentic_config(
run_dir: Path, run_dir: Path,
agentic_coder: bool = True, agentic_coder: bool = True,
*,
use_worktree: bool = False,
) -> PipelineConfig: ) -> PipelineConfig:
"""Build a config with an agentic coder + non-agentic reviewer.""" """Build a config with an agentic coder + non-agentic reviewer."""
coder = AgentConfig( coder = AgentConfig(
@@ -519,6 +523,7 @@ def _make_agentic_config(
] ]
return PipelineConfig( return PipelineConfig(
output_dir=run_dir, output_dir=run_dir,
use_worktree=use_worktree,
max_iterations=2, max_iterations=2,
min_iterations=1, min_iterations=1,
language="en", language="en",
@@ -549,11 +554,11 @@ class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
) -> None: ) -> None:
with tempfile.TemporaryDirectory() as td: with tempfile.TemporaryDirectory() as td:
run_dir = Path(td) run_dir = Path(td)
config = _make_agentic_config(run_dir) config = _make_agentic_config(run_dir, use_worktree=True)
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test") mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
mock_invoke_agentic.return_value = AgentResult( mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0, output="diff output", exit_code=0,
@@ -571,6 +576,44 @@ class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
mock_setup.assert_called_once() mock_setup.assert_called_once()
class TestDirectAgenticMode(unittest.TestCase):
"""Agentic coders run in the current working tree by default."""
@patch("cross_eval.pipeline._setup_worktree")
@patch("cross_eval.pipeline.invoke_agent_agentic")
@patch("cross_eval.pipeline.invoke_agent")
def test_agentic_uses_current_worktree_by_default(
self,
mock_invoke: MagicMock,
mock_invoke_agentic: MagicMock,
mock_setup: MagicMock,
) -> None:
with tempfile.TemporaryDirectory() as td:
repo = Path(td)
_init_git_repo(repo)
run_dir = repo / ".cross-eval" / "output"
run_dir.mkdir(parents=True, exist_ok=True)
config = _make_agentic_config(run_dir)
mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0,
agent_name="claude-coder", step_name="coding",
duration_seconds=0.1,
)
mock_invoke.return_value = AgentResult(
output="VERDICT: PASS", exit_code=0,
agent_name="claude-reviewer", step_name="review",
duration_seconds=0.1,
)
run_pipeline(config, cwd=repo)
mock_setup.assert_not_called()
self.assertEqual(mock_invoke_agentic.call_args.kwargs["worktree_path"], repo)
reviewer_call = mock_invoke.call_args
self.assertEqual(reviewer_call.kwargs["cwd"], repo)
class TestSetupWorktreeLocation(unittest.TestCase): class TestSetupWorktreeLocation(unittest.TestCase):
"""_setup_worktree places agentic worktrees outside the base repo.""" """_setup_worktree places agentic worktrees outside the base repo."""
@@ -582,7 +625,7 @@ class TestSetupWorktreeLocation(unittest.TestCase):
run_dir.mkdir(parents=True) run_dir.mkdir(parents=True)
_init_git_repo(base) _init_git_repo(base)
worktree_path, branch_name = _setup_worktree(base, run_dir, "review-fix") worktree_path, branch_name, _base_commit = _setup_worktree(base, run_dir, "review-fix")
try: try:
self.assertTrue(worktree_path.exists()) self.assertTrue(worktree_path.exists())
self.assertNotIn(str(base.resolve()), str(worktree_path.resolve())) self.assertNotIn(str(base.resolve()), str(worktree_path.resolve()))
@@ -616,11 +659,11 @@ class TestReviewerRunsInWorktreeCwd(unittest.TestCase):
) -> None: ) -> None:
with tempfile.TemporaryDirectory() as td: with tempfile.TemporaryDirectory() as td:
run_dir = Path(td) run_dir = Path(td)
config = _make_agentic_config(run_dir) config = _make_agentic_config(run_dir, use_worktree=True)
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test") mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
mock_invoke_agentic.return_value = AgentResult( mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0, output="diff output", exit_code=0,
@@ -658,11 +701,11 @@ class TestCommitIterationCalled(unittest.TestCase):
) -> None: ) -> None:
with tempfile.TemporaryDirectory() as td: with tempfile.TemporaryDirectory() as td:
run_dir = Path(td) run_dir = Path(td)
config = _make_agentic_config(run_dir) config = _make_agentic_config(run_dir, use_worktree=True)
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test") mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
mock_invoke_agentic.return_value = AgentResult( mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0, output="diff output", exit_code=0,
@@ -700,11 +743,11 @@ class TestFinalizeWorktreeCalled(unittest.TestCase):
) -> None: ) -> None:
with tempfile.TemporaryDirectory() as td: with tempfile.TemporaryDirectory() as td:
run_dir = Path(td) run_dir = Path(td)
config = _make_agentic_config(run_dir) config = _make_agentic_config(run_dir, use_worktree=True)
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test") mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
mock_invoke_agentic.return_value = AgentResult( mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0, output="diff output", exit_code=0,
@@ -822,7 +865,7 @@ class TestParallelAgenticFallsBackToSequential(unittest.TestCase):
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test") mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
call_order: list[str] = [] call_order: list[str] = []

View File

@@ -42,6 +42,8 @@ from cross_eval.prompts import (
REVIEW_TEMPLATE_KO, REVIEW_TEMPLATE_KO,
PLAN_REVIEW_TEMPLATE, PLAN_REVIEW_TEMPLATE,
PLAN_REVIEW_TEMPLATE_KO, PLAN_REVIEW_TEMPLATE_KO,
PLAN_FIX_TEMPLATE,
PLAN_FIX_TEMPLATE_KO,
REVIEW_ONLY_TEMPLATE, REVIEW_ONLY_TEMPLATE,
REVIEW_ONLY_TEMPLATE_KO, REVIEW_ONLY_TEMPLATE_KO,
AGGREGATE_REVIEW_TEMPLATE, AGGREGATE_REVIEW_TEMPLATE,
@@ -310,26 +312,10 @@ class BuiltinAgentConfigTest(unittest.TestCase):
self.assertIn("Repeated Aggregate Findings", report) self.assertIn("Repeated Aggregate Findings", report)
self.assertIn("same as iteration 3", report) self.assertIn("same as iteration 3", report)
def test_review_fix_defaults_senior_from_reviewer_family(self) -> None: def test_fix_and_plan_presets_default_senior_from_reviewer_family(self) -> None:
self.assertEqual( self.assertEqual(
_default_seniors_for_preset( _default_seniors_for_preset(
"preset:review-fix", "preset:plan-review",
["codex-reviewer", "claude-reviewer"],
BUILTIN_AGENTS,
),
["codex-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:review-fix",
["claude-reviewer"],
BUILTIN_AGENTS,
),
["claude-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:coding-review-fix",
["codex-reviewer"], ["codex-reviewer"],
BUILTIN_AGENTS, BUILTIN_AGENTS,
), ),
@@ -337,7 +323,31 @@ class BuiltinAgentConfigTest(unittest.TestCase):
) )
self.assertEqual( self.assertEqual(
_default_seniors_for_preset( _default_seniors_for_preset(
"preset:simple", "preset:plan-review",
["claude-reviewer"],
BUILTIN_AGENTS,
),
["claude-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:coding-plan-review",
["codex-reviewer", "claude-reviewer"],
BUILTIN_AGENTS,
),
["codex-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:coding-plan-review",
["claude-reviewer"],
BUILTIN_AGENTS,
),
["claude-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:unknown",
["codex-reviewer"], ["codex-reviewer"],
BUILTIN_AGENTS, BUILTIN_AGENTS,
), ),
@@ -421,23 +431,49 @@ class BuiltinAgentConfigTest(unittest.TestCase):
) )
self.assertEqual( self.assertEqual(
[step.output_key for step in steps], [step.output_key for step in steps[:2]],
["plan_review_codex_reviewer", "plan_review_codex_reviewer_2"], ["plan_review_codex_reviewer", "plan_review_codex_reviewer_2"],
) )
def test_plan_review_with_senior_adds_aggregate_step(self) -> None: def test_plan_review_builds_review_fix_verify_loop(self) -> None:
steps = _build_plan_review_preset( steps = _build_plan_review_preset(
["codex-coder"], ["codex-coder"],
["claude-reviewer", "codex-reviewer"], ["claude-reviewer", "codex-reviewer"],
["claude-senior"], ["claude-senior"],
) )
self.assertEqual(steps[-1].name, "senior_review") self.assertEqual(
self.assertEqual(steps[-1].agent, "claude-senior") [step.name for step in steps],
self.assertTrue(steps[-1].verdict) [
"plan_review_claude_reviewer",
"plan_review_codex_reviewer",
"aggregate_review",
"plan_fix",
"verify",
],
)
self.assertEqual(steps[2].agent, "claude-senior")
self.assertEqual(steps[3].agent, "codex-coder")
self.assertEqual(steps[4].agent, "claude-senior")
self.assertTrue(steps[4].verdict)
self.assertFalse(steps[0].verdict) self.assertFalse(steps[0].verdict)
self.assertFalse(steps[1].verdict) self.assertFalse(steps[1].verdict)
def test_plan_review_single_reviewer_uses_default_loop_steps(self) -> None:
steps = _build_plan_review_preset(
["codex-coder"],
["codex-reviewer"],
[],
)
self.assertEqual(
[step.name for step in steps],
["plan_review", "aggregate_review", "plan_fix", "verify"],
)
self.assertEqual(steps[1].agent, "codex-reviewer")
self.assertEqual(steps[2].prompt_template, "default:plan-fix")
self.assertTrue(steps[3].verdict)
def test_cross_review_duplicate_coders_get_unique_step_keys(self) -> None: def test_cross_review_duplicate_coders_get_unique_step_keys(self) -> None:
steps = _build_cross_review_preset( steps = _build_cross_review_preset(
["codex-coder", "codex-coder"], ["codex-coder", "codex-coder"],
@@ -576,6 +612,8 @@ class PromptTemplateTest(unittest.TestCase):
"""Coding templates should tell coder to ignore DISMISSED items.""" """Coding templates should tell coder to ignore DISMISSED items."""
self.assertIn("DISMISSED", CODING_TEMPLATE) self.assertIn("DISMISSED", CODING_TEMPLATE)
self.assertIn("DISMISSED", CODING_TEMPLATE_KO) self.assertIn("DISMISSED", CODING_TEMPLATE_KO)
self.assertIn("DISMISSED", PLAN_FIX_TEMPLATE)
self.assertIn("DISMISSED", PLAN_FIX_TEMPLATE_KO)
def test_aggregate_templates_dismissed_structure(self) -> None: def test_aggregate_templates_dismissed_structure(self) -> None:
"""Aggregate templates should use [False positive] / [Already fixed] tags.""" """Aggregate templates should use [False positive] / [Already fixed] tags."""
@@ -583,6 +621,10 @@ class PromptTemplateTest(unittest.TestCase):
self.assertIn("[Already fixed]", AGGREGATE_REVIEW_TEMPLATE) self.assertIn("[Already fixed]", AGGREGATE_REVIEW_TEMPLATE)
self.assertIn("[오탐]", AGGREGATE_REVIEW_TEMPLATE_KO) self.assertIn("[오탐]", AGGREGATE_REVIEW_TEMPLATE_KO)
self.assertIn("[수정 완료]", AGGREGATE_REVIEW_TEMPLATE_KO) self.assertIn("[수정 완료]", AGGREGATE_REVIEW_TEMPLATE_KO)
self.assertIn("{candidate_outputs}", AGGREGATE_REVIEW_TEMPLATE)
self.assertIn("{reviews_bundle}", AGGREGATE_REVIEW_TEMPLATE)
self.assertIn("{candidate_outputs}", AGGREGATE_REVIEW_TEMPLATE_KO)
self.assertIn("{reviews_bundle}", AGGREGATE_REVIEW_TEMPLATE_KO)
class ReviewMetricsParsingTest(unittest.TestCase): class ReviewMetricsParsingTest(unittest.TestCase):
@@ -969,7 +1011,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
" checklist: checklist.md\n" " checklist: checklist.md\n"
"coders: [claude-coder]\n" "coders: [claude-coder]\n"
"reviewers: [claude-reviewer]\n" "reviewers: [claude-reviewer]\n"
"pipeline: preset:review-fix\n" "pipeline: preset:coding-plan-review\n"
f"max_iterations: {max_iterations}\n" f"max_iterations: {max_iterations}\n"
"language: en\n" "language: en\n"
), ),
@@ -981,8 +1023,9 @@ class FixPresetBehaviorTest(unittest.TestCase):
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
config = load_config(self._write_fix_config(Path(tmpdir), max_iterations=7)) config = load_config(self._write_fix_config(Path(tmpdir), max_iterations=7))
self.assertEqual(config.preset_name, "review-fix") self.assertEqual(config.preset_name, "coding-plan-review")
self.assertEqual(config.phases[0].max_iterations, 7) self.assertEqual(config.phases[0].max_iterations, 1)
self.assertEqual(config.phases[1].max_iterations, 7)
self.assertTrue(config.agents["claude-coder"].agentic) self.assertTrue(config.agents["claude-coder"].agentic)
self.assertNotIn("-p", config.agents["claude-coder"].args) self.assertNotIn("-p", config.agents["claude-coder"].args)
@@ -992,7 +1035,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
captured: dict[str, object] = {} captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs): def _fake_run_pipeline(config, **kwargs):
captured["phase_max"] = config.phases[0].max_iterations captured["phase_max"] = config.phases[1].max_iterations
captured["agentic"] = config.agents[config.coders[0]].agentic captured["agentic"] = config.agents[config.coders[0]].agentic
return PipelineResult( return PipelineResult(
iterations=[], iterations=[],
@@ -1012,13 +1055,13 @@ class FixPresetBehaviorTest(unittest.TestCase):
self.assertEqual(captured["phase_max"], 9) self.assertEqual(captured["phase_max"], 9)
self.assertTrue(captured["agentic"]) self.assertTrue(captured["agentic"])
def test_run_preset_review_fix_auto_enables_agentic_without_flag(self) -> None: def test_run_preset_coding_plan_review_auto_enables_agentic_without_flag(self) -> None:
captured: dict[str, object] = {} captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs): def _fake_run_pipeline(config, **kwargs):
captured["preset"] = config.preset_name captured["preset"] = config.preset_name
captured["agentic"] = config.agents[config.coders[0]].agentic captured["agentic"] = config.agents[config.coders[0]].agentic
captured["phase_max"] = config.phases[0].max_iterations captured["phase_max"] = config.phases[1].max_iterations
return PipelineResult( return PipelineResult(
iterations=[], iterations=[],
final_verdict="PASS", final_verdict="PASS",
@@ -1026,13 +1069,73 @@ class FixPresetBehaviorTest(unittest.TestCase):
) )
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline): with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main(["run", "--preset", "review-fix", "--dry-run"]) exit_code = main(["run", "--preset", "coding-plan-review", "--dry-run"])
self.assertEqual(exit_code, 0) self.assertEqual(exit_code, 0)
self.assertEqual(captured["preset"], "review-fix") self.assertEqual(captured["preset"], "coding-plan-review")
self.assertTrue(captured["agentic"]) self.assertTrue(captured["agentic"])
self.assertEqual(captured["phase_max"], 3) self.assertEqual(captured["phase_max"], 3)
def test_run_preset_plan_review_auto_enables_agentic_without_flag(self) -> None:
captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs):
captured["preset"] = config.preset_name
captured["agentic"] = config.agents[config.coders[0]].agentic
captured["use_worktree"] = config.use_worktree
captured["seniors"] = list(config.seniors)
captured["steps"] = [step.name for step in config.pipeline]
captured["max_iter"] = config.max_iterations
return PipelineResult(
iterations=[],
final_verdict="PASS",
run_dir=Path(".cross-eval/output"),
)
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main(["run", "--preset", "plan-review", "--dry-run"])
self.assertEqual(exit_code, 0)
self.assertEqual(captured["preset"], "plan-review")
self.assertTrue(captured["agentic"])
self.assertFalse(captured["use_worktree"])
self.assertEqual(captured["seniors"], ["claude-senior"])
self.assertEqual(
captured["steps"],
["plan_review", "aggregate_review", "plan_fix", "verify"],
)
self.assertEqual(captured["max_iter"], 3)
def test_run_worktree_flag_enables_isolated_worktree_mode(self) -> None:
captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs):
captured["use_worktree"] = config.use_worktree
return PipelineResult(
iterations=[],
final_verdict="PASS",
run_dir=Path(".cross-eval/output"),
)
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main(["run", "--preset", "plan-review", "--dry-run", "--worktree"])
self.assertEqual(exit_code, 0)
self.assertTrue(captured["use_worktree"])
def test_run_dry_run_returns_zero_even_when_not_pass(self) -> None:
def _fake_run_pipeline(config, **kwargs):
return PipelineResult(
iterations=[],
final_verdict="MAX_ITERATIONS_REACHED",
run_dir=Path(".cross-eval/output"),
)
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main(["run", "--preset", "plan-review", "--dry-run"])
self.assertEqual(exit_code, 0)
def test_run_senior_model_override_applies_only_to_seniors(self) -> None: def test_run_senior_model_override_applies_only_to_seniors(self) -> None:
captured: dict[str, list[str]] = {} captured: dict[str, list[str]] = {}
@@ -1049,7 +1152,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline): with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main([ exit_code = main([
"run", "run",
"--preset", "review-fix", "--preset", "coding-plan-review",
"--coder", "claude", "--coder", "claude",
"--reviewer", "claude", "--reviewer", "claude",
"--senior", "claude", "--senior", "claude",
@@ -1077,7 +1180,7 @@ class OutputDirectoryResolutionTest(unittest.TestCase):
" plan: plan.md\n" " plan: plan.md\n"
"coders: [claude-coder]\n" "coders: [claude-coder]\n"
"reviewers: [claude-reviewer]\n" "reviewers: [claude-reviewer]\n"
"pipeline: preset:simple\n" "pipeline: preset:coding-plan-review\n"
"output_dir: .cross-eval/output\n" "output_dir: .cross-eval/output\n"
), ),
encoding="utf-8", encoding="utf-8",

View File

@@ -465,6 +465,9 @@ class TestExpandedClaimMarkers(unittest.TestCase):
def test_changes_are_complete(self) -> None: def test_changes_are_complete(self) -> None:
self.assertTrue(_claims_file_changes("All changes are complete")) self.assertTrue(_claims_file_changes("All changes are complete"))
def test_korean_change_summary_triggers(self) -> None:
self.assertTrue(_claims_file_changes("모든 수정이 완료되었습니다. 아래는 변경 요약입니다."))
class TestExpandedNoChangeMarkers(unittest.TestCase): class TestExpandedNoChangeMarkers(unittest.TestCase):
"""New no-change markers prevent false positives.""" """New no-change markers prevent false positives."""
@@ -484,6 +487,9 @@ class TestExpandedNoChangeMarkers(unittest.TestCase):
def test_no_action_required(self) -> None: def test_no_action_required(self) -> None:
self.assertFalse(_claims_file_changes("No action required")) self.assertFalse(_claims_file_changes("No action required"))
def test_korean_no_change_marker(self) -> None:
self.assertFalse(_claims_file_changes("변경할 필요 없음"))
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# 6. Cross-iteration evidence propagation # 6. Cross-iteration evidence propagation

View File

@@ -55,7 +55,7 @@ class DoctorCheckInstalledTest(unittest.TestCase):
config_path = ce_dir / "config.yaml" config_path = ce_dir / "config.yaml"
config_path.write_text( config_path.write_text(
"inputs:\n plan: plan.md\ncoders: [claude-coder]\n" "inputs:\n plan: plan.md\ncoders: [claude-coder]\n"
"reviewers: [claude-reviewer]\npipeline: preset:simple\n", "reviewers: [claude-reviewer]\npipeline: preset:coding-plan-review\n",
encoding="utf-8", encoding="utf-8",
) )
# Also create plan.md so validation passes # Also create plan.md so validation passes
@@ -137,22 +137,22 @@ class DemoTest(unittest.TestCase):
def test_mock_demo_runs_without_error(self) -> None: def test_mock_demo_runs_without_error(self) -> None:
# Should not raise # Should not raise
with patch("sys.stdout"): with patch("sys.stdout"):
run_mock_demo(preset="simple") run_mock_demo(preset="coding-plan-review")
def test_mock_demo_escalate_runs_without_error(self) -> None: def test_mock_demo_escalate_runs_without_error(self) -> None:
with patch("sys.stdout"): with patch("sys.stdout"):
run_mock_demo(preset="simple", show_escalate=True) run_mock_demo(preset="coding-plan-review", show_escalate=True)
def test_cmd_demo_mock_default(self) -> None: def test_cmd_demo_mock_default(self) -> None:
with patch("cross_eval.demo.run_mock_demo") as mock: with patch("cross_eval.demo.run_mock_demo") as mock:
exit_code = main(["demo"]) exit_code = main(["demo"])
mock.assert_called_once_with(preset="simple", show_escalate=False) mock.assert_called_once_with(preset="coding-plan-review", show_escalate=False)
self.assertEqual(exit_code, 0) self.assertEqual(exit_code, 0)
def test_cmd_demo_escalate_flag(self) -> None: def test_cmd_demo_escalate_flag(self) -> None:
with patch("cross_eval.demo.run_mock_demo") as mock: with patch("cross_eval.demo.run_mock_demo") as mock:
exit_code = main(["demo", "--escalate"]) exit_code = main(["demo", "--escalate"])
mock.assert_called_once_with(preset="simple", show_escalate=True) mock.assert_called_once_with(preset="coding-plan-review", show_escalate=True)
self.assertEqual(exit_code, 0) self.assertEqual(exit_code, 0)
def test_cmd_demo_live_requires_confirmation(self) -> None: def test_cmd_demo_live_requires_confirmation(self) -> None:

View File

@@ -13,7 +13,11 @@ from cross_eval.models import (
StepConfig, StepConfig,
) )
from cross_eval.pipeline import run_pipeline from cross_eval.pipeline import run_pipeline
from cross_eval.prompts import _build_review_fix_preset, _build_simple_preset from cross_eval.prompts import (
_build_plan_review_preset,
_build_review_fix_preset,
_build_simple_preset,
)
def _make_mock_agent(outputs: list[str]): def _make_mock_agent(outputs: list[str]):
@@ -262,6 +266,60 @@ class TestPhasedPipelineEscalateBreaksPhase(unittest.TestCase):
self.assertTrue(len(result.escalated_issues) > 0) self.assertTrue(len(result.escalated_issues) > 0)
class TestPlanReviewPipelineLoopsUntilVerifyPass(unittest.TestCase):
"""Document plan-review should revise docs and re-verify across iterations."""
def test_plan_review_fail_then_pass(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
coders = ["claude-coder"]
reviewers = ["claude-reviewer"]
seniors = ["claude-senior"]
steps = _build_plan_review_preset(coders, reviewers, seniors)
config = PipelineConfig(
output_dir=Path(tmpdir),
max_iterations=4,
min_iterations=1,
language="en",
inputs={
"plan": "Test plan",
"checklist": "Test checklist",
"docs": "Reference docs",
},
agents=dict(BUILTIN_AGENTS),
coders=coders,
reviewers=reviewers,
seniors=seniors,
pipeline=steps,
preset_name="plan-review",
)
mock = _make_step_mock({
"plan_review": [
"Requirements are ambiguous\n\nVERDICT: FAIL",
"Looks aligned\n\nVERDICT: PASS",
],
"aggregate_review": [
"### Confirmed Issues\n- Clarify acceptance criteria\n\n"
"### Action Items\n1. Tighten the checklist\n\nVERDICT: FAIL",
"### Confirmed Issues\nNone\n\n"
"### Dismissed Findings\nNone\n\n"
"### Action Items\n1. No document changes needed\n\nVERDICT: PASS",
],
"plan_fix": ["Updated plan and checklist", "No-op"],
"verify": [
"Still missing edge-case criteria\n\nVERDICT: FAIL",
"Planning package is now implementable\n\nVERDICT: PASS",
],
})
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "PASS")
self.assertEqual(len(result.iterations), 2)
class TestAutoEscalateFiresWithoutSenior(unittest.TestCase): class TestAutoEscalateFiresWithoutSenior(unittest.TestCase):
"""Test 6: simple pipeline without senior, same FAIL feedback 3 times -> auto-escalate.""" """Test 6: simple pipeline without senior, same FAIL feedback 3 times -> auto-escalate."""

View File

@@ -16,12 +16,17 @@ from cross_eval.agent import (
) )
from cross_eval.models import AgentConfig, AgentResult, ExecutionConfig, PipelineConfig, StepConfig from cross_eval.models import AgentConfig, AgentResult, ExecutionConfig, PipelineConfig, StepConfig
from cross_eval.pipeline import ( from cross_eval.pipeline import (
_apply_worktree_inputs_to_base,
_commit_base_repo_paths,
_copy_inputs_to_worktree,
_commit_iteration, _commit_iteration,
_execute_parallel_batch, _execute_parallel_batch,
_execute_step, _execute_step,
_finalize_worktree, _finalize_worktree,
_format_runtime_error_markdown, _format_runtime_error_markdown,
_load_inputs,
_maybe_save_step_transcript, _maybe_save_step_transcript,
_refresh_inputs,
_snapshot_repo_state, _snapshot_repo_state,
) )
from cross_eval.runtime_env import ( from cross_eval.runtime_env import (
@@ -118,6 +123,146 @@ class TestInvokeAgentRuntime(unittest.TestCase):
self.assertEqual(ctx.exception.failure_type, "API_ERROR") self.assertEqual(ctx.exception.failure_type, "API_ERROR")
self.assertIn("backend down", ctx.exception.raw_error) self.assertIn("backend down", ctx.exception.raw_error)
class TestWorktreeInputMapping(unittest.TestCase):
def test_repo_local_plan_input_maps_to_tracked_worktree_path(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir) / "repo"
repo.mkdir()
_init_git_repo(repo)
(repo / "plan.md").write_text("plan v1\n", encoding="utf-8")
subprocess.run(["git", "add", "plan.md"], cwd=repo, capture_output=True, check=True)
subprocess.run(
["git", "commit", "-m", "add plan"],
cwd=repo,
capture_output=True,
check=True,
)
worktree_dir = Path(tmpdir) / "wt"
branch = "cross-eval/test-plan-review"
worktree_path, _ = create_worktree(repo, worktree_dir, branch)
try:
config = PipelineConfig(
inputs={"plan": repo / "plan.md"},
preset_name="plan-review",
)
_copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
self.assertEqual(config.inputs["plan"], worktree_path / "plan.md")
finally:
remove_worktree(base_cwd=repo, work_dir=worktree_path)
subprocess.run(
["git", "branch", "-D", branch],
cwd=repo,
capture_output=True,
)
def test_plan_review_docs_ref_maps_to_worktree_and_refreshes_docs(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir) / "repo"
repo.mkdir()
_init_git_repo(repo)
docs_dir = repo / "plans"
docs_dir.mkdir()
(docs_dir / "A.md").write_text("A v1\n", encoding="utf-8")
subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True)
subprocess.run(
["git", "commit", "-m", "add docs"],
cwd=repo,
capture_output=True,
check=True,
)
config = PipelineConfig(
inputs={
"docs": "stale snapshot",
"docs_ref": docs_dir,
},
preset_name="plan-review",
)
input_contents = _load_inputs(config)
self.assertIn("A.md", input_contents["docs"])
worktree_dir = Path(tmpdir) / "wt"
branch = "cross-eval/test-docs-ref"
worktree_path, _ = create_worktree(repo, worktree_dir, branch)
try:
_copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
self.assertEqual(config.inputs["docs_ref"], worktree_path / "plans")
updated = worktree_path / "plans" / "A.md"
updated.write_text("A v2\n", encoding="utf-8")
_refresh_inputs(config, input_contents)
self.assertIn("A.md", input_contents["docs"])
self.assertIn("A v2", input_contents["docs"])
finally:
remove_worktree(base_cwd=repo, work_dir=worktree_path)
subprocess.run(
["git", "branch", "-D", branch],
cwd=repo,
capture_output=True,
)
def test_worktree_doc_changes_apply_back_and_commit_in_base_repo(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir) / "repo"
repo.mkdir()
_init_git_repo(repo)
docs_dir = repo / "plans"
docs_dir.mkdir()
doc_path = docs_dir / "A.md"
doc_path.write_text("A v1\n", encoding="utf-8")
subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True)
subprocess.run(
["git", "commit", "-m", "add docs"],
cwd=repo,
capture_output=True,
check=True,
)
config = PipelineConfig(
inputs={"docs_ref": docs_dir},
preset_name="plan-review",
)
original_inputs = {"docs_ref": docs_dir}
worktree_dir = Path(tmpdir) / "wt"
branch = "cross-eval/test-apply-back"
worktree_path, _ = create_worktree(repo, worktree_dir, branch)
try:
_copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
worktree_doc = config.inputs["docs_ref"] / "A.md"
worktree_doc.write_text("A v2\n", encoding="utf-8")
restored = _apply_worktree_inputs_to_base(
config, original_inputs, cwd=repo,
)
self.assertEqual(restored, [docs_dir])
self.assertEqual(doc_path.read_text(encoding="utf-8"), "A v2\n")
committed = _commit_base_repo_paths(
repo, restored, "cross-eval: plan-review (FAIL)",
)
self.assertTrue(committed)
log = subprocess.run(
["git", "log", "-1", "--pretty=%s"],
cwd=repo,
capture_output=True,
text=True,
check=True,
)
self.assertEqual(log.stdout.strip(), "cross-eval: plan-review (FAIL)")
finally:
remove_worktree(base_cwd=repo, work_dir=worktree_path)
subprocess.run(
["git", "branch", "-D", branch],
cwd=repo,
capture_output=True,
)
def test_classify_unknown_failure(self) -> None: def test_classify_unknown_failure(self) -> None:
failure_type, suggested_action = _classify_agent_failure("weird crash") failure_type, suggested_action = _classify_agent_failure("weird crash")
self.assertEqual(failure_type, "UNKNOWN") self.assertEqual(failure_type, "UNKNOWN")
@@ -376,11 +521,13 @@ class TestInvokeAgenticRuntime(unittest.TestCase):
class TestPipelineHelpers(unittest.TestCase): class TestPipelineHelpers(unittest.TestCase):
@patch("cross_eval.worktree.get_current_head", return_value="a" * 40)
@patch("cross_eval.worktree.commit_worktree", return_value=True) @patch("cross_eval.worktree.commit_worktree", return_value=True)
def test_commit_iteration_logs_only_when_committed(self, mock_commit: MagicMock) -> None: def test_commit_iteration_logs_only_when_committed(self, mock_commit: MagicMock, mock_head: MagicMock) -> None:
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
_commit_iteration(Path(tmpdir), "review-fix", 2, "PASS") new_head = _commit_iteration(Path(tmpdir), "review-fix", 2, "PASS")
mock_commit.assert_called_once() mock_commit.assert_called_once()
self.assertEqual(new_head, "a" * 40)
def test_snapshot_repo_state_includes_untracked_digest(self) -> None: def test_snapshot_repo_state_includes_untracked_digest(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
@@ -775,11 +922,18 @@ class TestRuntimeEnvironmentHelpers(unittest.TestCase):
class TestWorktreeFailures(unittest.TestCase): class TestWorktreeFailures(unittest.TestCase):
@patch("cross_eval.worktree.subprocess.run") @patch("cross_eval.worktree.subprocess.run")
def test_create_worktree_raises_when_branch_creation_fails(self, mock_run: MagicMock) -> None: def test_create_worktree_raises_when_branch_creation_fails(self, mock_run: MagicMock) -> None:
mock_run.side_effect = subprocess.CalledProcessError( # First call: git rev-parse HEAD (succeeds)
1, # Second call: git branch (fails)
["git", "branch"], rev_parse_result = MagicMock(returncode=0)
stderr="branch failed", rev_parse_result.stdout = "a" * 40
) mock_run.side_effect = [
rev_parse_result,
subprocess.CalledProcessError(
1,
["git", "branch"],
stderr="branch failed",
),
]
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
base = Path(tmpdir) base = Path(tmpdir)
@@ -791,14 +945,17 @@ class TestWorktreeFailures(unittest.TestCase):
@patch("cross_eval.worktree.subprocess.run") @patch("cross_eval.worktree.subprocess.run")
def test_create_worktree_cleans_branch_on_worktree_failure(self, mock_run: MagicMock) -> None: def test_create_worktree_cleans_branch_on_worktree_failure(self, mock_run: MagicMock) -> None:
rev_parse_result = MagicMock(returncode=0)
rev_parse_result.stdout = "a" * 40
mock_run.side_effect = [ mock_run.side_effect = [
MagicMock(returncode=0), rev_parse_result, # git rev-parse HEAD
MagicMock(returncode=0), # git branch
subprocess.CalledProcessError( subprocess.CalledProcessError(
1, 1,
["git", "worktree", "add"], ["git", "worktree", "add"],
stderr="worktree failed", stderr="worktree failed",
), ),
MagicMock(returncode=0), MagicMock(returncode=0), # git branch -D (cleanup)
] ]
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir: