Compare commits

..

2 Commits

Author SHA1 Message Date
이충영 에이닷서비스개발
0bbe0f6f7b continue 2026-03-15 17:54:30 +09:00
chungyeong
28efd5bb8f fix: use incremental diff per iteration instead of cumulative base diff
After each iteration's _commit_iteration, record the new HEAD SHA and use
it as the diff anchor for the next iteration. Previously capture_diff
always diffed against the initial base commit, causing every iteration to
return the same full cumulative diff — reviewers couldn't see what changed
between iterations, leading to repeated feedback and stuck FAIL loops.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 10:07:11 +09:00
15 changed files with 913 additions and 224 deletions

View File

@@ -10,6 +10,8 @@ AI 에이전트 2개를 활용한 개발 워크플로우(기획→체크리스
- Generator: `--permission-mode auto` (파일 읽기/쓰기 가능) - Generator: `--permission-mode auto` (파일 읽기/쓰기 가능)
- Reviewer: `--permission-mode plan` (읽기 전용 탐색) - Reviewer: `--permission-mode plan` (읽기 전용 탐색)
- subprocess의 `cwd`를 현재 작업 디렉토리로 설정 - subprocess의 `cwd`를 현재 작업 디렉토리로 설정
- 기본 실행 모드는 **direct mode**다. 즉 agentic coder도 현재 작업트리에서 직접 수정한다.
- `--worktree` 또는 `use_worktree: true`를 명시한 경우에만 isolated git worktree를 생성한다.
## 사용자 경험 (UX Flow) ## 사용자 경험 (UX Flow)
@@ -34,6 +36,7 @@ ls output/v1/ v2/ final-report.md
```yaml ```yaml
output_dir: output output_dir: output
use_worktree: false
max_iterations: 3 max_iterations: 3
inputs: inputs:
@@ -51,10 +54,8 @@ agents:
system_prompt: "You are a meticulous code reviewer." system_prompt: "You are a meticulous code reviewer."
# 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음) # 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음)
pipeline: preset:simple # "A 생성 → B 리뷰" (기본값) pipeline: preset:coding-plan-review # "문서 기반 구현 → 코드/문서 리뷰 → 수정 → 재검증" (기본값)
# pipeline: preset:cross-review # "둘 다 생성 → 서로 리뷰"
# pipeline: preset:plan-review # "구현 전 문서 리뷰 → 수정 → 재검증 반복" # pipeline: preset:plan-review # "구현 전 문서 리뷰 → 수정 → 재검증 반복"
# pipeline: preset:coding-review-fix # "초기 코딩 1회 → 리뷰/수정 반복"
# 방법 2: 직접 커스텀 (고급 사용자용) # 방법 2: 직접 커스텀 (고급 사용자용)
# pipeline: # pipeline:
@@ -75,10 +76,8 @@ pipeline: preset:simple # "A 생성 → B 리뷰" (기본값)
| 프리셋 | 설명 | 자동 생성되는 steps | | 프리셋 | 설명 | 자동 생성되는 steps |
|--------|------|-------------------| |--------|------|-------------------|
| `simple` | A 코딩 → B 리뷰 | coding(agent1) → review(agent2) |
| `cross-review` | 둘 다 코딩, 서로 리뷰 | coding_a → coding_b → review_of_b(agent_a) → review_of_a(agent_b) |
| `plan-review` | 구현 전 문서 리뷰/수정/재검증 반복 | plan_review_* → aggregate_review → plan_fix → verify | | `plan-review` | 구현 전 문서 리뷰/수정/재검증 반복 | plan_review_* → aggregate_review → plan_fix → verify |
| `coding-review-fix` | 초기 코딩 후 리뷰/수정 반복 | initial_coding(coding) → review_fix(review* → aggregate → coding → verify) | | `coding-plan-review` | 문서 기반 구현 후 코드/문서 리뷰/수정 반복 | initial_coding(coding) → coding_plan_review(review* → aggregate → coding_plan_fix → verify) |
프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다. 프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다.
@@ -101,7 +100,7 @@ cross_eval/
**models.py** — 순환 참조 방지, 모든 데이터클래스 집중: **models.py** — 순환 참조 방지, 모든 데이터클래스 집중:
- `AgentConfig` (command, args, system_prompt, stdin_mode) - `AgentConfig` (command, args, system_prompt, stdin_mode)
- `StepConfig` (name, agent, role, prompt_template, output_key, verdict, verdict_pattern, context_override) - `StepConfig` (name, agent, role, prompt_template, output_key, verdict, verdict_pattern, context_override)
- `PipelineConfig` (output_dir, max_iterations, inputs, agents, pipeline) - `PipelineConfig` (output_dir, use_worktree, max_iterations, inputs, agents, pipeline)
- `AgentResult` (output, exit_code, agent_name, step_name, duration_seconds) - `AgentResult` (output, exit_code, agent_name, step_name, duration_seconds)
- `IterationResult` (iteration, step_outputs, verdict, feedback) - `IterationResult` (iteration, step_outputs, verdict, feedback)
- `PipelineResult` (iterations, final_verdict, total_duration) - `PipelineResult` (iterations, final_verdict, total_duration)
@@ -117,7 +116,7 @@ cross_eval/
- `default:review` — 과최적화/오탐/누락 3기준 검토 + `VERDICT: PASS|FAIL` 출력 + **"프로젝트 디렉토리를 직접 탐색하여 코드를 검증하라"** 지시 - `default:review` — 과최적화/오탐/누락 3기준 검토 + `VERDICT: PASS|FAIL` 출력 + **"프로젝트 디렉토리를 직접 탐색하여 코드를 검증하라"** 지시
- `{variable}` 플레이스홀더, 누락 시 `(no {key} provided)` 출력 - `{variable}` 플레이스홀더, 누락 시 `(no {key} provided)` 출력
- 사용자가 커스텀 .md 파일로 오버라이드 가능 - 사용자가 커스텀 .md 파일로 오버라이드 가능
- `PIPELINE_PRESETS` dict: `simple`, `cross-review`, `plan-review` 프리셋별 StepConfig 리스트 정의 - `PIPELINE_PRESETS` / `PHASED_PRESETS` dict: `plan-review`, `coding-plan-review` 프리셋별 StepConfig/PhaseConfig 정의
**agent.py**`invoke_agent(agent_config, prompt, cwd)`: **agent.py**`invoke_agent(agent_config, prompt, cwd)`:
- `cwd` 파라미터로 프로젝트 디렉토리 지정 → 에이전트가 해당 디렉토리에서 파일 탐색 가능 - `cwd` 파라미터로 프로젝트 디렉토리 지정 → 에이전트가 해당 디렉토리에서 파일 탐색 가능
@@ -139,16 +138,21 @@ for iteration 1..max_iterations:
final-report.md 생성 final-report.md 생성
``` ```
agentic 실행 경로는 두 모드가 있다.
- 기본: direct mode (`cwd`에서 직접 수정)
- opt-in: isolated worktree mode (`--worktree` 또는 `use_worktree: true`)
**report.py** — 최종 마크다운 리포트: **report.py** — 최종 마크다운 리포트:
- 요약 테이블 (반복 횟수, 판정, 소요시간) - 요약 테이블 (반복 횟수, 판정, 소요시간)
- 반복별 상세 (각 step 출력, 에이전트명, 소요시간) - 반복별 상세 (각 step 출력, 에이전트명, 소요시간)
- 최종 판정 - 최종 판정
**cli.py** — 서브커맨드: **cli.py** — 서브커맨드:
- `cross-eval init [--dir .] [--preset simple|cross-review|plan-review]` — 스캐폴딩 (기존 파일 안 덮어씀) - `cross-eval init [--dir .] [--preset coding-plan-review|plan-review]` — 스캐폴딩 (기존 파일 안 덮어씀)
- `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...]` - `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...] [--worktree]`
- `--input key=path`: config의 inputs 오버라이드/추가 - `--input key=path`: config의 inputs 오버라이드/추가
- `--dry-run`: 에이전트 호출 없이 렌더링된 프롬프트만 출력 - `--dry-run`: 에이전트 호출 없이 렌더링된 프롬프트만 출력
- `--worktree`: 기본 direct mode 대신 isolated git worktree에서 실행
## 수정할 파일 목록 ## 수정할 파일 목록
@@ -172,10 +176,12 @@ final-report.md 생성
4. plan.md/checklist.md에 간단한 내용 넣고 `cross-eval run --max-iter 2` 로 실제 실행 4. plan.md/checklist.md에 간단한 내용 넣고 `cross-eval run --max-iter 2` 로 실제 실행
5. `output/` 디렉토리에 v1/, final-report.md 생성 확인 5. `output/` 디렉토리에 v1/, final-report.md 생성 확인
`--dry-run` 은 미리보기 전용이며 실제 verdict가 PASS가 아니어도 프로세스 종료 코드는 `0`으로 처리한다.
cross-eval run \ cross-eval run \
--docs /Users/chungyeong/Desktop/Dev/new-alpha-foundry/plans/TO_CLICKHOUSE \ --docs /Users/chungyeong/Desktop/Dev/new-alpha-foundry/plans/TO_CLICKHOUSE \
--preset coding-review-fix \ --preset coding-plan-review \
--coder claude \ --coder claude \
--reviewer codex \ --reviewer codex \
--reviewer codex \ --reviewer codex \
@@ -187,4 +193,4 @@ final-report.md 생성
--max-iter 10 --max-iter 10
cross-eval run --plan /Users/chungyeong/Desktop/Dev/cross-eval/UX_IMPROVEMENT_PLAN.md --coder claude --reviewer claude --senior claude --model sonnet --preset coding-review-fix --lang ko --max-iter 1 cross-eval run --plan /Users/chungyeong/Desktop/Dev/cross-eval/UX_IMPROVEMENT_PLAN.md --coder claude --reviewer claude --senior claude --model sonnet --preset coding-plan-review --lang ko --max-iter 1

View File

@@ -51,12 +51,15 @@ cp .cross-eval/checklist-sample.md .cross-eval/checklist.md
### 3. 실행 ### 3. 실행
```bash ```bash
# 기본 실행 (코딩 → 리뷰, 최대 3회 반복) # 기본 실행 (현재 작업트리 direct mode, 최대 3회 반복)
cross-eval run cross-eval run
# 프롬프트만 확인 (에이전트 호출 없이, 비용 절약) # 프롬프트만 확인 (에이전트 호출 없이, 비용 절약)
cross-eval run --dry-run cross-eval run --dry-run
# 격리된 git worktree에서 실행하고 싶을 때만 명시
cross-eval run --worktree
# 최대 반복 횟수 변경 # 최대 반복 횟수 변경
cross-eval run --max-iter 5 cross-eval run --max-iter 5
@@ -80,6 +83,9 @@ output/
└── final-report.md # 전체 요약 리포트 └── final-report.md # 전체 요약 리포트
``` ```
기본값은 **direct mode**다. 즉 `cross-eval`은 현재 작업트리에서 직접 파일을 읽고 수정한다.
별도 격리 실행이 필요할 때만 `--worktree`를 붙여 isolated git worktree를 사용한다.
## 설정 (`.cross-eval/config.yaml`) ## 설정 (`.cross-eval/config.yaml`)
```yaml ```yaml
@@ -101,7 +107,8 @@ agents:
args: ["-p", "--model", "opus", "--permission-mode", "plan"] args: ["-p", "--model", "opus", "--permission-mode", "plan"]
system_prompt: "You are a meticulous code reviewer." system_prompt: "You are a meticulous code reviewer."
pipeline: preset:simple pipeline: preset:coding-plan-review
use_worktree: false # 기본값. true면 isolated worktree 사용
``` ```
실행 중에 `config.yaml`을 수정하면 다음 반복부터 자동으로 반영됩니다. 실행 중에 `config.yaml`을 수정하면 다음 반복부터 자동으로 반영됩니다.
@@ -110,16 +117,16 @@ pipeline: preset:simple
| 프리셋 | 설명 | | 프리셋 | 설명 |
|--------|------| |--------|------|
| `simple` | Agent A가 코딩, Agent B가 리뷰 (기본값) |
| `cross-review` | 둘 다 코딩, 서로 교차 리뷰 |
| `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 문서를 수정한 뒤 재검증까지 반복 | | `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 문서를 수정한 뒤 재검증까지 반복 |
| `review-only` | 기존 코드만 감사 용도로 검토 | | `coding-plan-review` | 입력 문서를 바탕으로 코드를 구현하고, 코드와 문서를 함께 리뷰/수정/재검증 반복 |
| `review-fix` | 리뷰 결과를 취합한 뒤 자동 수정과 재검증까지 반복 |
| `coding-review-fix` | 초기 코딩 1회 후 리뷰 결과를 취합해 자동 수정과 재검증을 반복 | 두 프리셋은 역할만 다르고, 대부분의 CLI 옵션은 동일하게 동작한다. 예를 들어 `--plan`, `--checklist`, `--docs`, `--coder`, `--reviewer`, `--senior`, `--max-iter`, `--dry-run`, `--worktree`는 둘 다 같은 방식으로 사용할 수 있다.
```bash ```bash
# 초기화 옵션 # 초기화 옵션
cross-eval init --preset cross-review # 교차 리뷰 프리셋 cross-eval init --preset coding-plan-review # 구현 + 코드/문서 리뷰 프리셋
cross-eval init --preset plan-review # 문서 리뷰/수정/재검증 프리셋 cross-eval init --preset plan-review # 문서 리뷰/수정/재검증 프리셋
cross-eval init --lang en # 영어 템플릿 cross-eval init --lang en # 영어 템플릿
``` ```
`cross-eval run --dry-run` 은 프롬프트와 파이프라인 구성을 미리보기만 하며, 실제 판정이 PASS가 아니어도 종료 코드는 `0`이다.

31
checklist.md Normal file
View File

@@ -0,0 +1,31 @@
# cross-eval CLI 사용성 리팩토링 체크리스트
## 핵심 사용자 흐름
- [ ] `cross-eval init` 이후 무엇을 해야 하는지 분명하게 안내한다.
- [ ] `cross-eval doctor`를 언제 왜 써야 하는지 설명한다.
- [ ] `cross-eval run` 실행 전 필요한 준비물이 명확하다.
- [ ] 실행 후 결과가 `.cross-eval/output` 아래에 저장된다는 점이 안내된다.
## `run` 커맨드 이해도
- [ ] `--preset`별 차이가 빠르게 비교 가능하다.
- [ ] `--coder`, `--reviewer`, `--senior`의 역할 차이가 설명된다.
- [ ] config 기반 실행과 CLI 옵션 기반 실행의 관계가 명확하다.
- [ ] 어떤 옵션이 config를 override하는지 혼동 없이 이해할 수 있다.
## 예시 품질
- [ ] 대표 사용 예시가 실제 사용자 목적 중심으로 정리되어 있다.
- [ ] 예시가 너무 많아 산만하지 않고, 핵심 조합 위주로 압축되어 있다.
- [ ] 초보자용 기본 예시와 고급 사용 예시가 구분되어 있다.
- [ ] 예시만 복사해도 실제 실행 가능한 수준이다.
## 리팩토링 범위 통제
- [ ] 기존 명령 이름과 옵션 이름을 바꾸지 않는다.
- [ ] 기능 동작을 불필요하게 변경하지 않는다.
- [ ] 안내 문구 개선이 목적이지 새 기능 추가가 아님을 유지한다.
- [ ] plan 범위를 넘는 UI/기능 확장을 하지 않는다.
## 코드 품질
- [ ] 기존 테스트가 깨지지 않도록 한다.
- [ ] 도움말/문구 변경으로 인한 회귀를 확인한다.
- [ ] 문자열 변경이 실제 출력 흐름과 모순되지 않는다.
- [ ] 중복되거나 상충되는 설명이 생기지 않는다.

View File

@@ -38,7 +38,7 @@ coders: [claude-coder]
reviewers: [claude-reviewer] reviewers: [claude-reviewer]
# seniors: [codex-senior] # seniors: [codex-senior]
# 파이프라인 종류: simple | cross-review | plan-review | review-only | review-fix | coding-review-fix # 파이프라인 종류: plan-review | coding-plan-review
pipeline: preset:{preset} pipeline: preset:{preset}
# 반복 설정 # 반복 설정
@@ -194,20 +194,12 @@ def main(argv: list[str] | None = None) -> int:
) )
init_parser.add_argument( init_parser.add_argument(
"--preset", "--preset",
default="simple", default="coding-plan-review",
choices=[ choices=["plan-review", "coding-plan-review"],
"simple",
"cross-review",
"plan-review",
"review-only",
"review-fix",
"coding-review-fix",
],
help=( help=(
"파이프라인 종류 (기본: simple). " "파이프라인 종류 (기본: coding-plan-review). "
"simple=코딩+리뷰, cross-review=교차리뷰, plan-review=문서리뷰수정재검증, " "plan-review=문서리뷰수정재검증, "
"review-only=리뷰만, review-fix=리뷰수렴+자동수정, " "coding-plan-review=문서기반구현후 코드+문서 리뷰/수정/재검증"
"coding-review-fix=초기코딩후리뷰수렴"
), ),
) )
init_parser.add_argument( init_parser.add_argument(
@@ -252,9 +244,9 @@ def main(argv: list[str] | None = None) -> int:
) )
demo_parser.add_argument( demo_parser.add_argument(
"--preset", "--preset",
default="simple", default="coding-plan-review",
choices=["simple", "review-fix", "coding-review-fix"], choices=["plan-review", "coding-plan-review"],
help="데모할 파이프라인 종류 (기본: simple)", help="데모할 파이프라인 종류 (기본: coding-plan-review)",
) )
demo_parser.add_argument( demo_parser.add_argument(
"--escalate", "--escalate",
@@ -281,25 +273,12 @@ def main(argv: list[str] | None = None) -> int:
), ),
epilog=( epilog=(
"파이프라인 종류 (--preset):\n" "파이프라인 종류 (--preset):\n"
" ┌───────────────────────────────────────────────────────────────────┐\n" " ┌───────────────────────────────────────────────────────────────────┐\n"
"simple │ Coder가 코드 작성 → Reviewer가 리뷰 \n" "coding-plan-review │ 입력 문서 기반 구현 → 코드+문서 리뷰/수정\n"
" │ (기본값) │ FAIL이면 피드백 반영해서 재코딩, PASS까지 반복\n" " │ (기본값) │ → 재검증 반복 \n"
" ├───────────────────────────────────────────────────────────────────┤\n" " ├───────────────────────────────────────────────────────────────────┤\n"
" │ review-fix │ 2단계 파이프라인: \n" "plan-review │ 구현 전 문서 리뷰 → 문서 수정 → 재검증 반복\n"
" │ │ Reviewer N명 병렬 리뷰 → 취합 → 수정 → 재검증 │\n" " └─────────────────────┴──────────────────────────────────────────────┘\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ coding- │ 3단계 파이프라인: │\n"
" │ review-fix │ 초기 코딩 1회 → 리뷰 취합 → 수정 → 재검증 반복 │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ plan-review │ 구현 전 기획서/체크리스트/문서를 검토하고 │\n"
" │ │ 수정한 뒤 시니어가 재검증할 때까지 반복 │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ review-only │ 코드 작성 없이 Reviewer N명이 기존 코드만 검토 │\n"
" │ │ (이미 작성된 코드의 품질 감사용) │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ cross-review │ Coder 2명이 각각 구현 → 상대방 코드를 교차 리뷰 │\n"
" │ │ (서로 다른 에이전트의 구현 비교용) │\n"
" └──────────────┴─────────────────────────────────────────────────────┘\n"
"\n" "\n"
"기본 제공 에이전트:\n" "기본 제공 에이전트:\n"
" ┌──────────────────┬─────────┬───────────┬──────────────────────────┐\n" " ┌──────────────────┬─────────┬───────────┬──────────────────────────┐\n"
@@ -316,34 +295,13 @@ def main(argv: list[str] | None = None) -> int:
"\n" "\n"
"사용 예시:\n" "사용 예시:\n"
"\n" "\n"
" 기본 실행 (Claude가 코딩하고 Claude가 리뷰):\n" " 코드 + 문서 구현/리뷰 루프 (coding-plan-review):\n"
" cross-eval run --plan plan.md\n" " cross-eval run --plan plan.md --preset coding-plan-review \\\n"
"\n" " --coder claude --reviewer codex --reviewer claude --senior codex\n"
" Codex가 코딩, Claude가 리뷰:\n"
" cross-eval run --plan plan.md --coder codex --reviewer claude\n"
"\n"
" 리뷰어 2명 (Claude + Codex):\n"
" cross-eval run --plan plan.md --reviewer claude --reviewer codex\n"
"\n"
" 리뷰 취합용 Senior 추가:\n"
" cross-eval run --plan plan.md --preset review-fix \\\n"
" --reviewer claude --reviewer codex --senior codex\n"
"\n"
" 리뷰 수렴 후 자동 수정 (review-fix):\n"
" cross-eval run --plan plan.md --preset review-fix \\\n"
" --reviewer claude --reviewer codex\n"
"\n"
" 초기 코딩 후 리뷰 수렴 + 자동 수정 (coding-review-fix):\n"
" cross-eval run --plan plan.md --preset coding-review-fix \\\n"
" --reviewer claude --reviewer codex\n"
"\n"
" 기존 코드 리뷰만 (review-only):\n"
" cross-eval run --plan plan.md --preset review-only \\\n"
" --reviewer claude --reviewer codex\n"
"\n" "\n"
" 문서 리뷰 + 수정 + 재검증 반복 (plan-review):\n" " 문서 리뷰 + 수정 + 재검증 반복 (plan-review):\n"
" cross-eval run --plan plan.md --preset plan-review \\\n" " cross-eval run --plan plan.md --preset plan-review \\\n"
" --coder codex --reviewer codex\n" " --coder claude --reviewer codex --reviewer claude --senior codex\n"
"\n" "\n"
" 모델 변경:\n" " 모델 변경:\n"
" cross-eval run --plan plan.md --model sonnet\n" " cross-eval run --plan plan.md --model sonnet\n"
@@ -420,7 +378,11 @@ def main(argv: list[str] | None = None) -> int:
) )
agent_group.add_argument( agent_group.add_argument(
"--agentic", action="store_true", default=False, "--agentic", action="store_true", default=False,
help="Coder를 agentic 모드로 실행 (worktree에서 파일 직접 수정, git diff로 결과 캡처)", help="Coder를 agentic 모드로 실행 (파일 직접 수정, git diff로 결과 캡처)",
)
agent_group.add_argument(
"--worktree", action="store_true", default=False,
help="기본 direct mode 대신 isolated git worktree에서 실행",
) )
agent_group.add_argument( agent_group.add_argument(
"--model", default=None, metavar="MODEL", "--model", default=None, metavar="MODEL",
@@ -443,15 +405,8 @@ def main(argv: list[str] | None = None) -> int:
pipe_group = run_parser.add_argument_group("파이프라인") pipe_group = run_parser.add_argument_group("파이프라인")
pipe_group.add_argument( pipe_group.add_argument(
"--preset", default=None, "--preset", default=None,
choices=[ choices=["plan-review", "coding-plan-review"],
"simple", help="파이프라인 종류 (기본: coding-plan-review). 각 종류 설명은 아래 참조",
"cross-review",
"plan-review",
"review-only",
"review-fix",
"coding-review-fix",
],
help="파이프라인 종류 (기본: simple). 각 종류 설명은 아래 참조",
) )
pipe_group.add_argument( pipe_group.add_argument(
"--max-iter", type=int, default=None, "--max-iter", type=int, default=None,
@@ -560,18 +515,11 @@ def cmd_demo(args: argparse.Namespace) -> int:
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
_PRESET_DESCRIPTIONS = { _PRESET_DESCRIPTIONS = {
"simple": "코딩 + 리뷰 (가장 기본)", "coding-plan-review": "입력 문서 기반 구현 후 코드+문서 리뷰/수정 반복",
"review-fix": "리뷰 → 취합 → 수정 → 재검증 반복",
"coding-review-fix": "초기 코딩 + 리뷰 수렴 반복",
"plan-review": "문서 리뷰 → 수정 → 재검증 반복", "plan-review": "문서 리뷰 → 수정 → 재검증 반복",
"review-only": "기존 코드만 리뷰 (코딩 없음)",
"cross-review": "2명이 각각 구현 후 교차 리뷰",
} }
_PRESET_ORDER = [ _PRESET_ORDER = ["coding-plan-review", "plan-review"]
"simple", "review-fix", "coding-review-fix",
"plan-review", "review-only", "cross-review",
]
def _prompt_choice( def _prompt_choice(
@@ -640,7 +588,7 @@ def _run_guided_init(target: Path) -> dict:
coder = _prompt_text(" Coder 에이전트", default="claude") coder = _prompt_text(" Coder 에이전트", default="claude")
reviewer = _prompt_text(" Reviewer 에이전트", default="claude") reviewer = _prompt_text(" Reviewer 에이전트", default="claude")
needs_senior = preset in ("review-fix", "coding-review-fix") needs_senior = preset in ("coding-plan-review", "plan-review")
senior = "" senior = ""
if needs_senior: if needs_senior:
senior = _prompt_text(" Senior 에이전트", default=reviewer) senior = _prompt_text(" Senior 에이전트", default=reviewer)
@@ -899,10 +847,10 @@ def cmd_run(args: argparse.Namespace) -> int:
need_rebuild = args.preset is not None or args.coders or args.reviewers or args.seniors need_rebuild = args.preset is not None or args.coders or args.reviewers or args.seniors
if need_rebuild: if need_rebuild:
from cross_eval.prompts import PHASED_PRESETS from cross_eval.prompts import PHASED_PRESETS
preset = args.preset or "simple" preset = args.preset or "coding-plan-review"
# Determine which preset was configured (from YAML or defaults) # Determine which preset was configured (from YAML or defaults)
if args.preset is None and config.phases: if args.preset is None and config.phases:
preset = config.preset_name if config.preset_name != "custom" else "review-fix" preset = config.preset_name if config.preset_name != "custom" else "coding-plan-review"
elif args.preset is None and not args.coders and not args.reviewers and not args.seniors: elif args.preset is None and not args.coders and not args.reviewers and not args.seniors:
pass # no changes needed pass # no changes needed
inferred_coders, inferred_reviewers, inferred_seniors = _infer_roles( inferred_coders, inferred_reviewers, inferred_seniors = _infer_roles(
@@ -929,8 +877,6 @@ def cmd_run(args: argparse.Namespace) -> int:
elif preset in PIPELINE_PRESETS: elif preset in PIPELINE_PRESETS:
config.pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors) config.pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors)
config.phases = [] config.phases = []
if preset == "review-only" and args.max_iter is None and args.min_iter is None:
config.max_iterations = 1
sync_phased_iterations(config) sync_phased_iterations(config)
if args.max_iter is not None: if args.max_iter is not None:
@@ -951,6 +897,9 @@ def cmd_run(args: argparse.Namespace) -> int:
if coder_name in config.agents: if coder_name in config.agents:
_make_agentic(config.agents[coder_name]) _make_agentic(config.agents[coder_name])
if args.worktree:
config.use_worktree = True
ensure_fix_preset_agentic(config) ensure_fix_preset_agentic(config)
# --model: apply to ALL agents # --model: apply to ALL agents
@@ -988,7 +937,7 @@ def cmd_run(args: argparse.Namespace) -> int:
print(f"No files found in: {docs_dir}", file=sys.stderr) print(f"No files found in: {docs_dir}", file=sys.stderr)
return 1 return 1
config.inputs["docs"] = docs_content config.inputs["docs"] = docs_content
config.inputs["docs_ref"] = str(docs_dir) config.inputs["docs_ref"] = docs_dir
if args.env_files: if args.env_files:
for env_file in args.env_files: for env_file in args.env_files:
@@ -1062,6 +1011,9 @@ def cmd_run(args: argparse.Namespace) -> int:
if not args.dry_run and result.run_dir: if not args.dry_run and result.run_dir:
print(f"Output: {result.run_dir}/") print(f"Output: {result.run_dir}/")
if args.dry_run:
return 0
if result.final_verdict == "ESCALATE": if result.final_verdict == "ESCALATE":
from cross_eval.report import print_escalation_report from cross_eval.report import print_escalation_report
print_escalation_report(config, result) print_escalation_report(config, result)

View File

@@ -31,7 +31,10 @@ DEFAULT_ROLE_REASONING_EFFORTS = {
"reviewer": "medium", "reviewer": "medium",
"senior": "high", "senior": "high",
} }
FIX_STYLE_PRESETS = {"plan-review", "review-fix", "coding-review-fix"} FIX_STYLE_PRESETS = {
"plan-review",
"coding-plan-review",
}
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
@@ -298,8 +301,7 @@ def _default_seniors_for_preset(
isinstance(pipeline_raw, str) isinstance(pipeline_raw, str)
and pipeline_raw in { and pipeline_raw in {
"preset:plan-review", "preset:plan-review",
"preset:review-fix", "preset:coding-plan-review",
"preset:coding-review-fix",
} }
and reviewers and reviewers
): ):
@@ -382,9 +384,11 @@ def default_config() -> PipelineConfig:
coders = ["claude-coder"] coders = ["claude-coder"]
reviewers = ["claude-reviewer"] reviewers = ["claude-reviewer"]
seniors: list[str] = [] seniors: list[str] = []
pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors) pipeline: list[StepConfig] = []
phases = PHASED_PRESETS["coding-plan-review"](coders, reviewers, seniors)
return PipelineConfig( return PipelineConfig(
output_dir=Path(".cross-eval/output"), output_dir=Path(".cross-eval/output"),
use_worktree=False,
max_iterations=3, max_iterations=3,
language="ko", language="ko",
execution=ExecutionConfig(), execution=ExecutionConfig(),
@@ -394,6 +398,8 @@ def default_config() -> PipelineConfig:
reviewers=reviewers, reviewers=reviewers,
seniors=seniors, seniors=seniors,
pipeline=pipeline, pipeline=pipeline,
phases=phases,
preset_name="coding-plan-review",
) )
@@ -437,7 +443,7 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
) )
# --- roles: explicit or inferred --- # --- roles: explicit or inferred ---
pipeline_raw = raw.get("pipeline", "preset:simple") pipeline_raw = raw.get("pipeline", "preset:coding-plan-review")
coders_raw = raw.get("coders") coders_raw = raw.get("coders")
reviewers_raw = raw.get("reviewers") reviewers_raw = raw.get("reviewers")
seniors_raw = raw.get("seniors") seniors_raw = raw.get("seniors")
@@ -498,6 +504,7 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
config = PipelineConfig( config = PipelineConfig(
output_dir=output_dir, output_dir=output_dir,
use_worktree=bool(raw.get("use_worktree", False)),
max_iterations=int(raw.get("max_iterations", 3)), max_iterations=int(raw.get("max_iterations", 3)),
min_iterations=int(raw.get("min_iterations", 1)), min_iterations=int(raw.get("min_iterations", 1)),
verbose=bool(raw.get("verbose", False)), verbose=bool(raw.get("verbose", False)),
@@ -555,10 +562,10 @@ def _resolve_pipeline(
"""Resolve pipeline from preset string or explicit step list. """Resolve pipeline from preset string or explicit step list.
Returns (steps, phases) tuple. Only one will be non-empty. Returns (steps, phases) tuple. Only one will be non-empty.
- Simple/cross-review/plan-review/review-only → steps populated, phases empty. - plan-review → steps populated, phases empty.
- Phased presets (review-fix) → steps empty, phases populated. - coding-plan-review → steps empty, phases populated.
""" """
# Preset: "preset:simple" or "preset:review-fix" # Preset: "preset:plan-review" or "preset:coding-plan-review"
if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"): if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"):
preset_name = pipeline_raw.split(":", 1)[1] preset_name = pipeline_raw.split(":", 1)[1]
if preset_name in PIPELINE_PRESETS: if preset_name in PIPELINE_PRESETS:
@@ -592,7 +599,7 @@ def _resolve_pipeline(
return steps, [] return steps, []
raise ValueError( raise ValueError(
f"'pipeline' must be a preset string (e.g. 'preset:simple') " f"'pipeline' must be a preset string (e.g. 'preset:plan-review') "
f"or a list of step definitions, got {type(pipeline_raw).__name__}" f"or a list of step definitions, got {type(pipeline_raw).__name__}"
) )

View File

@@ -165,7 +165,7 @@ CYAN = "\033[36m"
RESET = "\033[0m" RESET = "\033[0m"
def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None: def run_mock_demo(preset: str = "coding-plan-review", show_escalate: bool = False) -> None:
"""Run a simulated demo showing the full pipeline lifecycle.""" """Run a simulated demo showing the full pipeline lifecycle."""
steps = _MOCK_ESCALATE_STEPS if show_escalate else _MOCK_STEPS steps = _MOCK_ESCALATE_STEPS if show_escalate else _MOCK_STEPS
@@ -229,7 +229,7 @@ def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None:
def run_live_demo( def run_live_demo(
preset: str = "simple", preset: str = "coding-plan-review",
timeout: int | None = None, timeout: int | None = None,
) -> PipelineResult: ) -> PipelineResult:
"""Run a live demo with real agents using the built-in plan.""" """Run a live demo with real agents using the built-in plan."""
@@ -255,8 +255,9 @@ def run_live_demo(
pipeline = [] pipeline = []
phases = PHASED_PRESETS[preset](coders, reviewers, seniors) phases = PHASED_PRESETS[preset](coders, reviewers, seniors)
else: else:
pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors) pipeline = []
phases = [] phases = PHASED_PRESETS["coding-plan-review"](coders, reviewers, seniors)
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
plan_path = Path(tmpdir) / "plan.md" plan_path = Path(tmpdir) / "plan.md"

View File

@@ -62,6 +62,7 @@ class PipelineConfig:
"""Full cross-eval configuration.""" """Full cross-eval configuration."""
output_dir: Path = field(default_factory=lambda: Path(".cross-eval/output")) output_dir: Path = field(default_factory=lambda: Path(".cross-eval/output"))
use_worktree: bool = False
max_iterations: int = 3 max_iterations: int = 3
min_iterations: int = 1 min_iterations: int = 1
verbose: bool = False verbose: bool = False

View File

@@ -4,6 +4,7 @@ from __future__ import annotations
import logging import logging
import os import os
import re import re
import shutil
import subprocess import subprocess
import time import time
from hashlib import sha256 from hashlib import sha256
@@ -34,6 +35,19 @@ from cross_eval.runtime_env import (
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
def _get_current_head(cwd: Path) -> str | None:
"""Return the current HEAD SHA for an existing repository."""
result = subprocess.run(
["git", "rev-parse", "HEAD"],
cwd=cwd,
capture_output=True,
text=True,
)
if result.returncode != 0:
return None
return result.stdout.strip() or None
def run_pipeline( def run_pipeline(
config: PipelineConfig, config: PipelineConfig,
cwd: Path | None = None, cwd: Path | None = None,
@@ -62,18 +76,20 @@ def _commit_iteration(
label: str, label: str,
iteration: int, iteration: int,
verdict: str | None, verdict: str | None,
) -> None: ) -> str:
"""Intermediate commit after each agentic iteration. """Intermediate commit after each agentic iteration.
This resets the diff baseline so the next iteration only captures new changes. This resets the diff baseline so the next iteration only captures new changes.
Returns the new HEAD SHA to use as the base_commit for the next iteration.
""" """
from cross_eval.worktree import commit_worktree from cross_eval.worktree import commit_worktree, get_current_head
committed = commit_worktree( committed = commit_worktree(
worktree_path, worktree_path,
f"cross-eval: {label} v{iteration} ({verdict or 'no-verdict'})", f"cross-eval: {label} v{iteration} ({verdict or 'no-verdict'})",
) )
if committed: if committed:
logger.debug(" Intermediate commit: v%d (%s)", iteration, verdict) logger.debug(" Intermediate commit: v%d (%s)", iteration, verdict)
return get_current_head(worktree_path)
def _has_agentic_steps(config: PipelineConfig, steps: list[StepConfig]) -> bool: def _has_agentic_steps(config: PipelineConfig, steps: list[StepConfig]) -> bool:
@@ -122,8 +138,6 @@ def _copy_inputs_to_worktree(
Updates ``config.inputs`` in-place so subsequent reference refreshes use Updates ``config.inputs`` in-place so subsequent reference refreshes use
worktree-local paths. worktree-local paths.
""" """
import shutil
base_root = base_cwd.resolve() base_root = base_cwd.resolve()
track_external_inputs = config.preset_name == "plan-review" track_external_inputs = config.preset_name == "plan-review"
inputs_dir = worktree_path / ".cross-eval-inputs" inputs_dir = worktree_path / ".cross-eval-inputs"
@@ -132,7 +146,7 @@ def _copy_inputs_to_worktree(
# Exclude read-only input copies from git so they don't pollute code diffs. # Exclude read-only input copies from git so they don't pollute code diffs.
(inputs_dir / ".gitignore").write_text("*\n", encoding="utf-8") (inputs_dir / ".gitignore").write_text("*\n", encoding="utf-8")
for key, val in list(config.inputs.items()): for key, val in list(config.inputs.items()):
if key.endswith("_ref") or not isinstance(val, Path): if not isinstance(val, Path):
continue continue
if not val.exists(): if not val.exists():
continue continue
@@ -141,17 +155,71 @@ def _copy_inputs_to_worktree(
rel_path = resolved.relative_to(base_root) rel_path = resolved.relative_to(base_root)
except ValueError: except ValueError:
dest = inputs_dir / val.name dest = inputs_dir / val.name
shutil.copy2(resolved, dest) _copy_path(resolved, dest)
config.inputs[key] = dest config.inputs[key] = dest
continue continue
worktree_target = worktree_path / rel_path worktree_target = worktree_path / rel_path
if not worktree_target.exists(): if not worktree_target.exists():
worktree_target.parent.mkdir(parents=True, exist_ok=True) _copy_path(resolved, worktree_target)
shutil.copy2(resolved, worktree_target)
config.inputs[key] = worktree_target config.inputs[key] = worktree_target
def _snapshot_input_paths(config: PipelineConfig) -> dict[str, Path]:
"""Capture original on-disk input paths before remapping into a worktree."""
return {
key: val
for key, val in config.inputs.items()
if isinstance(val, Path)
}
def _apply_worktree_inputs_to_base(
config: PipelineConfig,
original_inputs: dict[str, Path],
*,
cwd: Path,
) -> list[Path]:
"""Copy the final worktree-edited inputs back onto the user-provided paths."""
restored: list[Path] = []
for key, original_path in original_inputs.items():
current_path = config.inputs.get(key)
if not isinstance(current_path, Path) or not current_path.exists():
continue
if current_path.resolve() == original_path.resolve():
continue
_copy_path(current_path, original_path)
restored.append(original_path)
return restored
def _commit_base_repo_paths(cwd: Path, paths: list[Path], message: str) -> bool:
"""Commit changed input paths in the base repository when they live under cwd."""
rel_paths: list[str] = []
for path in paths:
try:
rel_paths.append(str(path.resolve().relative_to(cwd.resolve())))
except ValueError:
continue
if not rel_paths:
return False
subprocess.run(
["git", "add", "--", *rel_paths],
cwd=cwd,
capture_output=True,
check=True,
)
result = subprocess.run(
["git", "commit", "-m", message],
cwd=cwd,
capture_output=True,
text=True,
)
return result.returncode == 0
def _snapshot_repo_state(cwd: Path) -> dict[str, str]: def _snapshot_repo_state(cwd: Path) -> dict[str, str]:
"""Capture the base repository working-tree state. """Capture the base repository working-tree state.
@@ -342,18 +410,26 @@ def _run_simple_pipeline(
# Setup shared worktree for agentic mode # Setup shared worktree for agentic mode
worktree_path: Path | None = None worktree_path: Path | None = None
agent_execution_path: Path | None = None
agentic_branch_name: str | None = None agentic_branch_name: str | None = None
agentic_base_commit: str | None = None agentic_base_commit: str | None = None
original_input_paths: dict[str, Path] = {}
base_repo_state: dict[str, str] | None = None base_repo_state: dict[str, str] | None = None
base_repo_status: str | None = None base_repo_status: str | None = None
if not dry_run and _has_agentic_steps(config, config.pipeline): if not dry_run and _has_agentic_steps(config, config.pipeline):
if config.use_worktree:
worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree( worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree(
cwd, run_dir, config.preset_name, cwd, run_dir, config.preset_name,
) )
original_input_paths = _snapshot_input_paths(config)
_copy_inputs_to_worktree(config, worktree_path, base_cwd=cwd) _copy_inputs_to_worktree(config, worktree_path, base_cwd=cwd)
_refresh_input_references(config, input_contents) _refresh_input_references(config, input_contents)
base_repo_state = _snapshot_repo_state(cwd) base_repo_state = _snapshot_repo_state(cwd)
base_repo_status = _snapshot_repo_status(cwd) base_repo_status = _snapshot_repo_status(cwd)
agent_execution_path = worktree_path
else:
agent_execution_path = cwd
agentic_base_commit = _get_current_head(cwd)
feedback = "(no feedback — first iteration)" feedback = "(no feedback — first iteration)"
iterations: list[IterationResult] = [] iterations: list[IterationResult] = []
@@ -379,7 +455,7 @@ def _run_simple_pipeline(
config.pipeline, config, input_contents, feedback, config.pipeline, config, input_contents, feedback,
i, config.max_iterations, cwd, timeout, dry_run, i, config.max_iterations, cwd, timeout, dry_run,
run_dir=run_dir, output_iter=i, run_dir=run_dir, output_iter=i,
worktree_path=worktree_path, worktree_path=agent_execution_path,
runtime_env=runtime_env, runtime_env=runtime_env,
base_repo_state=base_repo_state, base_repo_state=base_repo_state,
base_repo_status=base_repo_status, base_repo_status=base_repo_status,
@@ -387,8 +463,8 @@ def _run_simple_pipeline(
) )
# Intermediate commit so next iteration's diff only shows new changes # Intermediate commit so next iteration's diff only shows new changes
if worktree_path is not None: if config.use_worktree and worktree_path is not None:
_commit_iteration(worktree_path, config.preset_name, i, verdict) agentic_base_commit = _commit_iteration(worktree_path, config.preset_name, i, verdict)
iter_result = IterationResult( iter_result = IterationResult(
iteration=i, iteration=i,
@@ -478,8 +554,25 @@ def _run_simple_pipeline(
break break
finally: finally:
if config.use_worktree and worktree_path is not None and original_input_paths:
restored_paths = _apply_worktree_inputs_to_base(
config, original_input_paths, cwd=cwd,
)
if restored_paths:
try:
committed = _commit_base_repo_paths(
cwd,
restored_paths,
f"cross-eval: {config.preset_name} ({final_verdict})",
)
if committed:
logger.info(" Applied and committed final input changes in base repo.")
else:
logger.info(" Applied final input changes in base repo (no commit created).")
except Exception:
logger.warning(" Failed to commit final input changes in base repo", exc_info=True)
agentic_branch: str | None = None agentic_branch: str | None = None
if worktree_path is not None and agentic_branch_name is not None: if config.use_worktree and worktree_path is not None and agentic_branch_name is not None:
agentic_branch = _finalize_worktree( agentic_branch = _finalize_worktree(
cwd, worktree_path, agentic_branch_name, cwd, worktree_path, agentic_branch_name,
config.preset_name, final_verdict, config.preset_name, final_verdict,
@@ -521,18 +614,26 @@ def _run_phased_pipeline(
# Setup shared worktree for agentic mode # Setup shared worktree for agentic mode
all_phase_steps = [s for p in config.phases for s in p.steps] all_phase_steps = [s for p in config.phases for s in p.steps]
worktree_path: Path | None = None worktree_path: Path | None = None
agent_execution_path: Path | None = None
agentic_branch_name: str | None = None agentic_branch_name: str | None = None
agentic_base_commit: str | None = None agentic_base_commit: str | None = None
original_input_paths: dict[str, Path] = {}
base_repo_state: dict[str, str] | None = None base_repo_state: dict[str, str] | None = None
base_repo_status: str | None = None base_repo_status: str | None = None
if not dry_run and _has_agentic_steps(config, all_phase_steps): if not dry_run and _has_agentic_steps(config, all_phase_steps):
if config.use_worktree:
worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree( worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree(
cwd, run_dir, config.preset_name, cwd, run_dir, config.preset_name,
) )
original_input_paths = _snapshot_input_paths(config)
_copy_inputs_to_worktree(config, worktree_path, base_cwd=cwd) _copy_inputs_to_worktree(config, worktree_path, base_cwd=cwd)
_refresh_input_references(config, input_contents) _refresh_input_references(config, input_contents)
base_repo_state = _snapshot_repo_state(cwd) base_repo_state = _snapshot_repo_state(cwd)
base_repo_status = _snapshot_repo_status(cwd) base_repo_status = _snapshot_repo_status(cwd)
agent_execution_path = worktree_path
else:
agent_execution_path = cwd
agentic_base_commit = _get_current_head(cwd)
iterations: list[IterationResult] = [] iterations: list[IterationResult] = []
feedback = "(no feedback — first iteration)" feedback = "(no feedback — first iteration)"
@@ -579,7 +680,7 @@ def _run_phased_pipeline(
phase.steps, config, input_contents, feedback, phase.steps, config, input_contents, feedback,
pi, phase.max_iterations, cwd, timeout, dry_run, pi, phase.max_iterations, cwd, timeout, dry_run,
run_dir=run_dir, output_iter=global_iter, phase_name=phase.name, run_dir=run_dir, output_iter=global_iter, phase_name=phase.name,
worktree_path=worktree_path, worktree_path=agent_execution_path,
runtime_env=runtime_env, runtime_env=runtime_env,
base_repo_state=base_repo_state, base_repo_state=base_repo_state,
base_repo_status=base_repo_status, base_repo_status=base_repo_status,
@@ -587,8 +688,8 @@ def _run_phased_pipeline(
) )
# Intermediate commit so next iteration's diff only shows new changes # Intermediate commit so next iteration's diff only shows new changes
if worktree_path is not None: if config.use_worktree and worktree_path is not None:
_commit_iteration( agentic_base_commit = _commit_iteration(
worktree_path, f"{config.preset_name}/{phase.name}", worktree_path, f"{config.preset_name}/{phase.name}",
global_iter, verdict, global_iter, verdict,
) )
@@ -715,8 +816,25 @@ def _run_phased_pipeline(
final_verdict = "PASS" if phase_converged else "MAX_ITERATIONS_REACHED" final_verdict = "PASS" if phase_converged else "MAX_ITERATIONS_REACHED"
finally: finally:
if config.use_worktree and worktree_path is not None and original_input_paths:
restored_paths = _apply_worktree_inputs_to_base(
config, original_input_paths, cwd=cwd,
)
if restored_paths:
try:
committed = _commit_base_repo_paths(
cwd,
restored_paths,
f"cross-eval: {config.preset_name} ({final_verdict})",
)
if committed:
logger.info(" Applied and committed final input changes in base repo.")
else:
logger.info(" Applied final input changes in base repo (no commit created).")
except Exception:
logger.warning(" Failed to commit final input changes in base repo", exc_info=True)
agentic_branch: str | None = None agentic_branch: str | None = None
if worktree_path is not None and agentic_branch_name is not None: if config.use_worktree and worktree_path is not None and agentic_branch_name is not None:
agentic_branch = _finalize_worktree( agentic_branch = _finalize_worktree(
cwd, worktree_path, agentic_branch_name, cwd, worktree_path, agentic_branch_name,
config.preset_name, final_verdict, config.preset_name, final_verdict,
@@ -750,6 +868,8 @@ def _load_inputs(config: PipelineConfig) -> dict[str, str]:
for key, val in config.inputs.items(): for key, val in config.inputs.items():
if key.endswith("_ref"): if key.endswith("_ref"):
input_contents[key] = str(val) input_contents[key] = str(val)
elif key == "docs":
input_contents[key] = _load_docs_input(config, current_value=val)
elif isinstance(val, str): elif isinstance(val, str):
input_contents[key] = val input_contents[key] = val
else: else:
@@ -765,6 +885,8 @@ def _refresh_inputs(
for key, val in config.inputs.items(): for key, val in config.inputs.items():
if key.endswith("_ref"): if key.endswith("_ref"):
input_contents[key] = str(val) input_contents[key] = str(val)
elif key == "docs":
input_contents[key] = _load_docs_input(config, current_value=val)
elif isinstance(val, str): elif isinstance(val, str):
input_contents[key] = val input_contents[key] = val
elif isinstance(val, Path) and val.exists(): elif isinstance(val, Path) and val.exists():
@@ -772,6 +894,40 @@ def _refresh_inputs(
_refresh_input_references(config, input_contents) _refresh_input_references(config, input_contents)
def _load_docs_input(config: PipelineConfig, *, current_value: Path | str) -> str:
"""Load docs content from docs_ref when available so edits are visible next iteration."""
docs_ref = config.inputs.get("docs_ref")
docs_path = docs_ref if isinstance(docs_ref, Path) else None
if docs_path is not None and docs_path.exists():
if docs_path.is_dir():
return _read_docs_tree(docs_path)
try:
return docs_path.read_text(encoding="utf-8")
except (UnicodeDecodeError, OSError):
return ""
if isinstance(current_value, str):
return current_value
if current_value.exists() and current_value.is_file():
return current_value.read_text(encoding="utf-8")
return ""
def _read_docs_tree(docs_dir: Path) -> str:
"""Read all visible text files under a docs tree and concatenate them."""
parts: list[str] = []
for f in sorted(
path for path in docs_dir.rglob("*")
if path.is_file() and not any(part.startswith(".") for part in path.relative_to(docs_dir).parts)
):
try:
content = f.read_text(encoding="utf-8")
except (UnicodeDecodeError, OSError):
continue
rel_path = f.relative_to(docs_dir).as_posix()
parts.append(f"### {rel_path}\n{content}")
return "\n\n".join(parts)
def _refresh_input_references( def _refresh_input_references(
config: PipelineConfig, config: PipelineConfig,
input_contents: dict[str, str], input_contents: dict[str, str],
@@ -1701,3 +1857,12 @@ def _save_report(run_dir: Path, config: PipelineConfig, result: PipelineResult)
report_path.parent.mkdir(parents=True, exist_ok=True) report_path.parent.mkdir(parents=True, exist_ok=True)
report_path.write_text(report, encoding="utf-8") report_path.write_text(report, encoding="utf-8")
logger.info("Report saved: %s", report_path) logger.info("Report saved: %s", report_path)
def _copy_path(src: Path, dest: Path) -> None:
"""Copy a file or directory into the worktree, preserving structure."""
if src.is_dir():
shutil.copytree(src, dest, dirs_exist_ok=True)
return
dest.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(src, dest)

View File

@@ -512,6 +512,218 @@ PLAN_FIX_TEMPLATE_KO = """\
8. 수정이 끝나면 무엇을 바꿨는지와 아직 사람 판단이 필요한 blocker가 있는지 짧게 정리하세요. 8. 수정이 끝나면 무엇을 바꿨는지와 아직 사람 판단이 필요한 blocker가 있는지 짧게 정리하세요.
""" """
PLAN_VERIFY_TEMPLATE = """\
You are verifying the latest planning package after plan-only revisions.
## Plan
{plan}
## Checklist
{checklist}
## Reference Documents
{docs}
## Previous Review (iteration {iteration} of {max_iterations})
{feedback}
## Execution Evidence
{execution_evidence}
## Verify Instructions
Review the latest planning package itself: the plan, checklist, and reference documents.
You MAY inspect the current repository to confirm that the documents describe the current reality accurately enough.
Do NOT require production code, scripts, infrastructure, or external environments to already be fixed.
For `plan-review`, PASS means the documents are now clear enough to execute without further document edits.
A known implementation gap, repo mismatch, legacy script problem, external dependency, or environment blocker is NOT a FAIL by itself if:
- the issue is described accurately in the planning package,
- the affected scope or gate is documented clearly,
- the required follow-up action or non-go condition is documented clearly, and
- the package does not misrepresent unresolved work as already complete.
Only mark FAIL when the planning package still needs correction, such as:
- unresolved ambiguity or contradiction in the documents,
- missing prerequisite, dependency, gate, ownership, or evidence rule,
- a known blocker that is still described inaccurately or misleadingly,
- conflicting source-of-truth rules across the planning documents,
- checklist or status criteria that would cause an operator to make the wrong decision.
Report implementation/repository problems that are already documented correctly under "Out of Scope Issues" or note them as documented risks, not as FAIL reasons.
## Output Format
### Remaining Document Issues
- [Major][Omission] Description (reference specific plan/checklist/doc item)
(Write "None" if no document issue remains.)
### Documented Risks / Out of Scope
- Description of a real implementation/repository/environment risk that is already documented correctly
(Write "None" if nothing notable remains.)
### Summary
- Remaining document issues: N
- Documented risks / out-of-scope items: N
- Overall quality: [BRIEF ASSESSMENT]
### Verdict
If the planning package no longer needs document changes, output: VERDICT: PASS
Otherwise output: VERDICT: FAIL
"""
PLAN_VERIFY_TEMPLATE_KO = """\
당신은 plan-only 수정 이후 최신 기획 패키지를 재검증하는 검토자입니다.
## 기획서
{plan}
## 체크리스트
{checklist}
## 참고 문서
{docs}
## 이전 리뷰 결과 ({max_iterations}회 중 {iteration}번째)
{feedback}
## 실행 증거
{execution_evidence}
## 검증 지침
최신 기획 패키지 자체를 다시 검토하세요: 기획서, 체크리스트, 참고 문서를 함께 봅니다.
현재 저장소를 살펴보며 문서가 현실을 정확히 설명하는지 확인할 수는 있지만, 프로덕션 코드, 스크립트, 인프라, 외부 환경이 이미 수정되어 있을 것을 요구하면 안 됩니다.
`plan-review`에서 PASS의 뜻은 "이제 문서를 더 고칠 필요 없이 이 계획을 실행할 수 있다"입니다.
즉 구현 공백, 저장소 불일치, legacy 스크립트 문제, 외부 의존성, 환경 blocker가 남아 있어도 아래 조건을 만족하면 FAIL 사유가 아닙니다.
- 그 문제가 기획 패키지에 정확히 기록되어 있고
- 어떤 범위/게이트에 영향을 주는지 분명히 적혀 있고
- 필요한 후속 조치나 non-go 조건이 명확히 적혀 있고
- 아직 해결되지 않은 일을 이미 해결된 것처럼 오해하게 만들지 않는 경우
반대로 아래와 같은 경우에만 FAIL로 판정하세요.
- 문서 안에 아직 모호성이나 모순이 남아 있는 경우
- 선행조건, 의존성, 게이트, 담당 주체, evidence 규칙이 빠진 경우
- 알려진 blocker가 여전히 부정확하거나 오해를 부르는 방식으로 서술된 경우
- 기획 문서들 사이에서 source-of-truth 규칙이 충돌하는 경우
- 체크리스트나 상태 판정 기준 때문에 실행자가 잘못된 결정을 내릴 수 있는 경우
이미 문서에 정확히 기록된 구현/저장소 문제는 "범위 밖 이슈" 또는 "문서화된 리스크"로만 남기고, 그 자체를 FAIL 사유로 삼지 마세요.
## 출력 형식
### 남은 문서 이슈
- [Major][누락] 이슈 설명 (관련 기획서/체크리스트/참고 문서 항목 참조)
(남은 문서 이슈가 없으면 "없음"이라고 작성하세요.)
### 문서화된 리스크 / 범위 밖 이슈
- 실제 구현/저장소/환경 리스크이지만 문서에는 이미 정확히 반영된 항목
(해당 사항이 없으면 "없음"이라고 작성하세요.)
### 요약
- 남은 문서 이슈 수: N
- 문서화된 리스크 / 범위 밖 항목 수: N
- 전체 품질: [간략한 평가]
### 판정
기획 패키지를 더 수정할 필요가 없으면: VERDICT: PASS
그렇지 않으면: VERDICT: FAIL
"""
CODING_PLAN_REVIEW_TEMPLATE = """\
You are reviewing both the implementation and the planning package together.
## Artifact References
{artifact_references}
## Execution Evidence
{execution_evidence}
## Review Instructions
Read the referenced plan/checklist/docs/review artifacts directly from disk. \
Inspect the current repository and evaluate BOTH:
1. whether the implementation matches the plan/checklist/docs, and
2. whether the planning package still accurately describes the implementation target and constraints.
Report only issues that matter to delivering the original plan correctly. \
Do not invent new scope. Distinguish between code issues, document issues, and consistency gaps between them.
For each issue found, classify it with BOTH severity AND category:
- Severity: Critical / Major / Minor
- Category: Over-engineering / Omission
If previous review feedback is provided above, mark each prior item as CONFIRMED or DISMISSED.
If you find issues outside the original plan scope, report them separately under "Out of Scope Issues".
### Verdict
If the implementation satisfies the plan/checklist and the planning package no longer needs correction, output: VERDICT: PASS
Otherwise output: VERDICT: FAIL
"""
CODING_PLAN_REVIEW_TEMPLATE_KO = """\
당신은 구현 결과와 기획 문서 패키지를 함께 검토하는 리뷰어입니다.
## 참조 아티팩트
{artifact_references}
## 실행 증거
{execution_evidence}
## 검토 지침
참조된 plan/checklist/docs/review markdown를 직접 읽고 현재 저장소를 확인한 뒤, 아래 두 가지를 함께 평가하세요.
1. 현재 구현이 plan/checklist/docs와 일치하는가
2. 기획 문서 패키지가 현재 구현 목표와 제약을 여전히 정확하게 설명하는가
원래 계획을 제대로 완수하는 데 필요한 이슈만 보고하세요. 새로운 범위를 만들지 마세요.
코드 이슈, 문서 이슈, 코드-문서 불일치를 구분해서 적으세요.
발견된 각 이슈에 심각도와 카테고리를 모두 부여하세요.
- 심각도: Critical / Major / Minor
- 카테고리: 과최적화 / 누락
이전 리뷰 피드백이 있으면 각 항목을 CONFIRMED 또는 DISMISSED로 판정하세요.
원래 계획 범위 밖 이슈는 "범위 밖 이슈"로 별도 분리하세요.
### 판정
구현이 plan/checklist를 충족하고 기획 문서 패키지도 더 이상 수정할 필요가 없으면: VERDICT: PASS
그렇지 않으면: VERDICT: FAIL
"""
CODING_PLAN_FIX_TEMPLATE = """\
You are fixing confirmed issues in both the implementation and the planning package.
## Artifact References
{artifact_references}
## Current Review Feedback
{feedback}
## Instructions
1. Read the referenced plan/checklist/docs/review artifacts directly from disk.
2. Fix ONLY the confirmed issues from the current review feedback.
3. You may update both implementation files and planning artifacts when needed.
4. Preserve the original plan intent and scope. Do not silently broaden requirements.
5. Keep code, plan, checklist, and supporting docs consistent after edits.
6. After editing, briefly summarize what you changed and any blocker that still needs human input.
"""
CODING_PLAN_FIX_TEMPLATE_KO = """\
당신은 현재 리뷰에서 확정된 이슈를 코드와 기획 문서 패키지에 함께 반영하는 수정 담당자입니다.
## 참조 아티팩트
{artifact_references}
## 현재 리뷰 피드백
{feedback}
## 지침
1. 참조된 plan/checklist/docs/review markdown를 직접 읽으세요.
2. 현재 리뷰 피드백에서 확정된 이슈만 수정하세요.
3. 필요하면 코드와 기획 문서를 모두 수정할 수 있습니다.
4. 최초 plan의 의도와 범위를 유지하세요. 요구사항을 몰래 넓히지 마세요.
5. 수정 후 코드, plan, checklist, 참고 문서가 서로 모순되지 않게 유지하세요.
6. 수정이 끝나면 무엇을 바꿨는지와 아직 사람 판단이 필요한 blocker가 있는지 짧게 정리하세요.
"""
AGGREGATE_REVIEW_TEMPLATE = """\ AGGREGATE_REVIEW_TEMPLATE = """\
You are adjudicating multiple review results and turning them into an actionable decision. You are adjudicating multiple review results and turning them into an actionable decision.
@@ -645,6 +857,9 @@ DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
"review": REVIEW_TEMPLATE, "review": REVIEW_TEMPLATE,
"plan-review": PLAN_REVIEW_TEMPLATE, "plan-review": PLAN_REVIEW_TEMPLATE,
"plan-fix": PLAN_FIX_TEMPLATE, "plan-fix": PLAN_FIX_TEMPLATE,
"plan-verify": PLAN_VERIFY_TEMPLATE,
"coding-plan-review": CODING_PLAN_REVIEW_TEMPLATE,
"coding-plan-fix": CODING_PLAN_FIX_TEMPLATE,
"review-only": REVIEW_ONLY_TEMPLATE, "review-only": REVIEW_ONLY_TEMPLATE,
"aggregate-review": AGGREGATE_REVIEW_TEMPLATE, "aggregate-review": AGGREGATE_REVIEW_TEMPLATE,
}, },
@@ -653,6 +868,9 @@ DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
"review": REVIEW_TEMPLATE_KO, "review": REVIEW_TEMPLATE_KO,
"plan-review": PLAN_REVIEW_TEMPLATE_KO, "plan-review": PLAN_REVIEW_TEMPLATE_KO,
"plan-fix": PLAN_FIX_TEMPLATE_KO, "plan-fix": PLAN_FIX_TEMPLATE_KO,
"plan-verify": PLAN_VERIFY_TEMPLATE_KO,
"coding-plan-review": CODING_PLAN_REVIEW_TEMPLATE_KO,
"coding-plan-fix": CODING_PLAN_FIX_TEMPLATE_KO,
"review-only": REVIEW_ONLY_TEMPLATE_KO, "review-only": REVIEW_ONLY_TEMPLATE_KO,
"aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO, "aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO,
}, },
@@ -961,7 +1179,7 @@ def _build_plan_review_preset(
name="verify", name="verify",
agent=senior_agent, agent=senior_agent,
role="review", role="review",
prompt_template="default:plan-review", prompt_template="default:plan-verify",
output_key="verify_result", output_key="verify_result",
verdict=True, verdict=True,
), ),
@@ -1065,16 +1283,97 @@ def _build_coding_review_fix_preset(
] ]
def _build_coding_plan_review_preset(
coders: list[str], reviewers: list[str], seniors: list[str],
) -> list[PhaseConfig]:
"""Implement from plan/docs, then review and fix code+docs together."""
if not coders:
raise ValueError("'coding-plan-review' preset requires at least 1 coder")
if not reviewers:
raise ValueError("'coding-plan-review' preset requires at least 1 reviewer")
review_steps: list[StepConfig] = []
reviewer_keys = _unique_safe_keys(reviewers)
for reviewer, rk in zip(reviewers, reviewer_keys):
review_steps.append(
StepConfig(
name=f"review_{rk}",
agent=reviewer,
role="review",
prompt_template="default:coding-plan-review",
output_key=f"review_{rk}",
verdict=False,
parallel=True,
),
)
senior_agent = seniors[0] if seniors else reviewers[0]
review_step_names = [f"review_{rk}" for rk in reviewer_keys]
review_output_keys = [f"review_{rk}" for rk in reviewer_keys]
return [
PhaseConfig(
name="initial_coding",
steps=[
StepConfig(
name="coding",
agent=coders[0],
role="coding",
prompt_template="default:coding",
output_key="coding_output",
),
],
max_iterations=1,
consecutive_pass=1,
),
PhaseConfig(
name="coding_plan_review",
steps=review_steps + [
StepConfig(
name="aggregate_review",
agent=senior_agent,
role="review",
prompt_template="default:aggregate-review",
output_key="aggregate_review",
context_override={
"candidate_outputs": (
"Current implementation and planning package under review "
"(code + plan/checklist/reference docs)."
),
"reviews_bundle": _build_named_bundle(
reviewers, review_step_names, review_output_keys, "Review",
),
},
),
StepConfig(
name="coding_plan_fix",
agent=coders[0],
role="coding",
prompt_template="default:coding-plan-fix",
output_key="coding_plan_fix_output",
context_override={"feedback": "{aggregate_review}"},
),
StepConfig(
name="verify",
agent=senior_agent,
role="review",
prompt_template="default:coding-plan-review",
output_key="verify_result",
verdict=True,
),
],
max_iterations=5,
consecutive_pass=1,
),
]
PIPELINE_PRESETS: dict[str, Callable] = { PIPELINE_PRESETS: dict[str, Callable] = {
"simple": _build_simple_preset,
"cross-review": _build_cross_review_preset,
"plan-review": _build_plan_review_preset, "plan-review": _build_plan_review_preset,
"review-only": _build_review_only_preset,
} }
PHASED_PRESETS: dict[str, Callable] = { PHASED_PRESETS: dict[str, Callable] = {
"review-fix": _build_review_fix_preset, "coding-plan-review": _build_coding_plan_review_preset,
"coding-review-fix": _build_coding_review_fix_preset,
} }
ALL_PRESET_NAMES: list[str] = list(PIPELINE_PRESETS.keys()) + list(PHASED_PRESETS.keys()) ALL_PRESET_NAMES: list[str] = list(PIPELINE_PRESETS.keys()) + list(PHASED_PRESETS.keys())

View File

@@ -101,19 +101,18 @@ def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> tuple[P
def capture_diff(worktree_path: Path, base_commit: str | None = None) -> str: def capture_diff(worktree_path: Path, base_commit: str | None = None) -> str:
"""Capture all changes made in the worktree as a unified diff. """Capture all changes made in the worktree since ``base_commit``.
Includes both tracked modifications, new untracked files, and changes Handles two scenarios:
that the agent may have committed on its own. 1. Agent left changes uncommitted → ``git add -A && git diff base HEAD``
2. Agent committed its own changes → HEAD advanced, diff base..HEAD captures them
Args: Args:
base_commit: The commit SHA from when the worktree was created. base_commit: The diff anchor — typically the worktree HEAD *before* this
If provided, diffs against this fixed base instead of HEAD. iteration started (set by ``get_current_head`` after each
This is critical because agents (e.g. Claude in interactive ``_commit_iteration``). Falls back to ``HEAD`` if not given.
mode) may create their own commits, advancing HEAD and
making ``git diff --cached HEAD`` return empty.
""" """
# Stage any uncommitted changes so they're included in the diff # Stage any uncommitted changes
subprocess.run( subprocess.run(
["git", "add", "-A"], ["git", "add", "-A"],
cwd=worktree_path, cwd=worktree_path,
@@ -121,35 +120,33 @@ def capture_diff(worktree_path: Path, base_commit: str | None = None) -> str:
check=True, check=True,
) )
if base_commit: # Commit staged changes so everything is reachable via HEAD
# Diff everything (committed + staged) against the original base. # (this is a no-op if nothing is staged)
# This captures changes regardless of whether the agent committed them. subprocess.run(
result = subprocess.run( ["git", "commit", "-m", "cross-eval: capture-diff snapshot", "--allow-empty-message"],
["git", "diff", base_commit, "--cached"],
cwd=worktree_path, cwd=worktree_path,
capture_output=True, capture_output=True,
text=True, text=True,
) )
diff = result.stdout.strip()
if diff:
return diff
# Also check committed changes (agent may have committed and left ref = base_commit or "HEAD~1"
# nothing staged)
result = subprocess.run( result = subprocess.run(
["git", "diff", base_commit, "HEAD"], ["git", "diff", ref, "HEAD"],
cwd=worktree_path, cwd=worktree_path,
capture_output=True, capture_output=True,
text=True, text=True,
) )
return result.stdout.strip() return result.stdout.strip()
# Fallback: no base_commit, use original behavior
def get_current_head(worktree_path: Path) -> str:
"""Return the current HEAD SHA of the worktree."""
result = subprocess.run( result = subprocess.run(
["git", "diff", "--cached", "HEAD"], ["git", "rev-parse", "HEAD"],
cwd=worktree_path, cwd=worktree_path,
capture_output=True, capture_output=True,
text=True, text=True,
check=True,
) )
return result.stdout.strip() return result.stdout.strip()

47
plan.md Normal file
View File

@@ -0,0 +1,47 @@
# cross-eval CLI 사용성 리팩토링
## 목표
`cross-eval`의 CLI 사용 경험을 리팩토링하여, 사용자가 각 옵션의 의미를 빠르게 이해하고 목적에 맞는 옵션 조합을 쉽게 선택할 수 있도록 만든다.
## 배경
현재 `cross-eval``init`, `run`, `demo`, `doctor` 등 주요 커맨드와 다양한 옵션을 제공하지만, 처음 사용하는 사용자가 어떤 상황에서 어떤 옵션을 써야 하는지 한눈에 이해하기 어렵다. 특히 `run`의 preset, agent 조합, config 기반 실행과 직접 옵션 기반 실행의 관계가 복잡하게 느껴질 수 있다.
## 요구사항
1. CLI 도움말 또는 온보딩 문구를 리팩토링해 초보 사용자도 주요 흐름을 빠르게 이해할 수 있어야 한다.
2. 사용자가 대표적인 사용 시나리오별로 적절한 옵션 조합을 쉽게 찾을 수 있어야 한다.
3. `run` 커맨드의 주요 옵션들(preset, coder/reviewer/senior, config, output 관련)의 역할이 더 명확하게 드러나야 한다.
4. `init` 이후 사용자가 다음에 무엇을 해야 하는지 자연스럽게 이어지도록 안내해야 한다.
5. 기존 기능은 유지해야 하며, 동작 방식 자체를 바꾸기보다 설명 구조와 사용 흐름을 개선하는 데 집중해야 한다.
## 사용자 시나리오
1. 처음 설치한 사용자가 `cross-eval init` 후 무엇을 해야 하는지 알고 싶다.
2. 사용자가 `run`을 실행하려는데 `--preset`별 차이를 빠르게 비교하고 싶다.
3. 사용자가 `claude`, `codex`, `senior` 조합을 어떤 상황에서 쓰는지 예시와 함께 이해하고 싶다.
4. 사용자가 config 기반 실행과 CLI 옵션 기반 실행 중 무엇을 써야 할지 판단하고 싶다.
5. 사용자가 실행 결과가 어디에 저장되는지, 어떤 식으로 확인하는지 알고 싶다.
## 제약조건
- 기존 CLI 명령 이름과 핵심 옵션 이름은 유지한다.
- 기존 파이프라인 동작 로직은 불필요하게 수정하지 않는다.
- 기능 추가보다 안내 구조, 도움말 문구, 예시, 설명 흐름 개선에 집중한다.
- 문서는 한국어 사용자 기준으로 이해하기 쉽게 유지하되, 기존 프로젝트 톤과 구조를 해치지 않는다.
## 범위
### 포함
- `argparse` help/description/epilog 문구 개선
- `init` 후 다음 단계 안내 문구 개선
- `run` 사용 예시 정리 및 대표 조합 예시 보강
- preset/agent/config/output 개념 설명 재구성
- 필요 시 README 또는 온보딩 문구 일부 정리
### 제외
- 새로운 preset 추가
- 새로운 CLI 옵션 추가
- 파이프라인 실행 알고리즘 변경
- 에이전트 호출 방식 자체 변경
## 성공 기준
1. `--help`만 읽어도 기본 사용 흐름이 명확하다.
2. 사용자가 대표 시나리오별 실행 예시를 바로 복사해 쓸 수 있다.
3. `init → 작성 → doctor → run → output 확인` 흐름이 자연스럽게 연결된다.
4. 옵션 설명이 길기만 하지 않고, 실제 선택 판단에 도움이 되도록 구조화된다.

View File

@@ -490,6 +490,8 @@ class TestMakeAgenticCodex(unittest.TestCase):
def _make_agentic_config( def _make_agentic_config(
run_dir: Path, run_dir: Path,
agentic_coder: bool = True, agentic_coder: bool = True,
*,
use_worktree: bool = False,
) -> PipelineConfig: ) -> PipelineConfig:
"""Build a config with an agentic coder + non-agentic reviewer.""" """Build a config with an agentic coder + non-agentic reviewer."""
coder = AgentConfig( coder = AgentConfig(
@@ -521,6 +523,7 @@ def _make_agentic_config(
] ]
return PipelineConfig( return PipelineConfig(
output_dir=run_dir, output_dir=run_dir,
use_worktree=use_worktree,
max_iterations=2, max_iterations=2,
min_iterations=1, min_iterations=1,
language="en", language="en",
@@ -551,7 +554,7 @@ class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
) -> None: ) -> None:
with tempfile.TemporaryDirectory() as td: with tempfile.TemporaryDirectory() as td:
run_dir = Path(td) run_dir = Path(td)
config = _make_agentic_config(run_dir) config = _make_agentic_config(run_dir, use_worktree=True)
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
@@ -573,6 +576,44 @@ class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
mock_setup.assert_called_once() mock_setup.assert_called_once()
class TestDirectAgenticMode(unittest.TestCase):
"""Agentic coders run in the current working tree by default."""
@patch("cross_eval.pipeline._setup_worktree")
@patch("cross_eval.pipeline.invoke_agent_agentic")
@patch("cross_eval.pipeline.invoke_agent")
def test_agentic_uses_current_worktree_by_default(
self,
mock_invoke: MagicMock,
mock_invoke_agentic: MagicMock,
mock_setup: MagicMock,
) -> None:
with tempfile.TemporaryDirectory() as td:
repo = Path(td)
_init_git_repo(repo)
run_dir = repo / ".cross-eval" / "output"
run_dir.mkdir(parents=True, exist_ok=True)
config = _make_agentic_config(run_dir)
mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0,
agent_name="claude-coder", step_name="coding",
duration_seconds=0.1,
)
mock_invoke.return_value = AgentResult(
output="VERDICT: PASS", exit_code=0,
agent_name="claude-reviewer", step_name="review",
duration_seconds=0.1,
)
run_pipeline(config, cwd=repo)
mock_setup.assert_not_called()
self.assertEqual(mock_invoke_agentic.call_args.kwargs["worktree_path"], repo)
reviewer_call = mock_invoke.call_args
self.assertEqual(reviewer_call.kwargs["cwd"], repo)
class TestSetupWorktreeLocation(unittest.TestCase): class TestSetupWorktreeLocation(unittest.TestCase):
"""_setup_worktree places agentic worktrees outside the base repo.""" """_setup_worktree places agentic worktrees outside the base repo."""
@@ -618,7 +659,7 @@ class TestReviewerRunsInWorktreeCwd(unittest.TestCase):
) -> None: ) -> None:
with tempfile.TemporaryDirectory() as td: with tempfile.TemporaryDirectory() as td:
run_dir = Path(td) run_dir = Path(td)
config = _make_agentic_config(run_dir) config = _make_agentic_config(run_dir, use_worktree=True)
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
@@ -660,7 +701,7 @@ class TestCommitIterationCalled(unittest.TestCase):
) -> None: ) -> None:
with tempfile.TemporaryDirectory() as td: with tempfile.TemporaryDirectory() as td:
run_dir = Path(td) run_dir = Path(td)
config = _make_agentic_config(run_dir) config = _make_agentic_config(run_dir, use_worktree=True)
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
@@ -702,7 +743,7 @@ class TestFinalizeWorktreeCalled(unittest.TestCase):
) -> None: ) -> None:
with tempfile.TemporaryDirectory() as td: with tempfile.TemporaryDirectory() as td:
run_dir = Path(td) run_dir = Path(td)
config = _make_agentic_config(run_dir) config = _make_agentic_config(run_dir, use_worktree=True)
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()

View File

@@ -331,7 +331,7 @@ class BuiltinAgentConfigTest(unittest.TestCase):
) )
self.assertEqual( self.assertEqual(
_default_seniors_for_preset( _default_seniors_for_preset(
"preset:review-fix", "preset:coding-plan-review",
["codex-reviewer", "claude-reviewer"], ["codex-reviewer", "claude-reviewer"],
BUILTIN_AGENTS, BUILTIN_AGENTS,
), ),
@@ -339,7 +339,7 @@ class BuiltinAgentConfigTest(unittest.TestCase):
) )
self.assertEqual( self.assertEqual(
_default_seniors_for_preset( _default_seniors_for_preset(
"preset:review-fix", "preset:coding-plan-review",
["claude-reviewer"], ["claude-reviewer"],
BUILTIN_AGENTS, BUILTIN_AGENTS,
), ),
@@ -347,15 +347,7 @@ class BuiltinAgentConfigTest(unittest.TestCase):
) )
self.assertEqual( self.assertEqual(
_default_seniors_for_preset( _default_seniors_for_preset(
"preset:coding-review-fix", "preset:unknown",
["codex-reviewer"],
BUILTIN_AGENTS,
),
["codex-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:simple",
["codex-reviewer"], ["codex-reviewer"],
BUILTIN_AGENTS, BUILTIN_AGENTS,
), ),
@@ -1019,7 +1011,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
" checklist: checklist.md\n" " checklist: checklist.md\n"
"coders: [claude-coder]\n" "coders: [claude-coder]\n"
"reviewers: [claude-reviewer]\n" "reviewers: [claude-reviewer]\n"
"pipeline: preset:review-fix\n" "pipeline: preset:coding-plan-review\n"
f"max_iterations: {max_iterations}\n" f"max_iterations: {max_iterations}\n"
"language: en\n" "language: en\n"
), ),
@@ -1031,8 +1023,9 @@ class FixPresetBehaviorTest(unittest.TestCase):
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
config = load_config(self._write_fix_config(Path(tmpdir), max_iterations=7)) config = load_config(self._write_fix_config(Path(tmpdir), max_iterations=7))
self.assertEqual(config.preset_name, "review-fix") self.assertEqual(config.preset_name, "coding-plan-review")
self.assertEqual(config.phases[0].max_iterations, 7) self.assertEqual(config.phases[0].max_iterations, 1)
self.assertEqual(config.phases[1].max_iterations, 7)
self.assertTrue(config.agents["claude-coder"].agentic) self.assertTrue(config.agents["claude-coder"].agentic)
self.assertNotIn("-p", config.agents["claude-coder"].args) self.assertNotIn("-p", config.agents["claude-coder"].args)
@@ -1042,7 +1035,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
captured: dict[str, object] = {} captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs): def _fake_run_pipeline(config, **kwargs):
captured["phase_max"] = config.phases[0].max_iterations captured["phase_max"] = config.phases[1].max_iterations
captured["agentic"] = config.agents[config.coders[0]].agentic captured["agentic"] = config.agents[config.coders[0]].agentic
return PipelineResult( return PipelineResult(
iterations=[], iterations=[],
@@ -1062,13 +1055,13 @@ class FixPresetBehaviorTest(unittest.TestCase):
self.assertEqual(captured["phase_max"], 9) self.assertEqual(captured["phase_max"], 9)
self.assertTrue(captured["agentic"]) self.assertTrue(captured["agentic"])
def test_run_preset_review_fix_auto_enables_agentic_without_flag(self) -> None: def test_run_preset_coding_plan_review_auto_enables_agentic_without_flag(self) -> None:
captured: dict[str, object] = {} captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs): def _fake_run_pipeline(config, **kwargs):
captured["preset"] = config.preset_name captured["preset"] = config.preset_name
captured["agentic"] = config.agents[config.coders[0]].agentic captured["agentic"] = config.agents[config.coders[0]].agentic
captured["phase_max"] = config.phases[0].max_iterations captured["phase_max"] = config.phases[1].max_iterations
return PipelineResult( return PipelineResult(
iterations=[], iterations=[],
final_verdict="PASS", final_verdict="PASS",
@@ -1076,10 +1069,10 @@ class FixPresetBehaviorTest(unittest.TestCase):
) )
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline): with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main(["run", "--preset", "review-fix", "--dry-run"]) exit_code = main(["run", "--preset", "coding-plan-review", "--dry-run"])
self.assertEqual(exit_code, 0) self.assertEqual(exit_code, 0)
self.assertEqual(captured["preset"], "review-fix") self.assertEqual(captured["preset"], "coding-plan-review")
self.assertTrue(captured["agentic"]) self.assertTrue(captured["agentic"])
self.assertEqual(captured["phase_max"], 3) self.assertEqual(captured["phase_max"], 3)
@@ -1089,6 +1082,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
def _fake_run_pipeline(config, **kwargs): def _fake_run_pipeline(config, **kwargs):
captured["preset"] = config.preset_name captured["preset"] = config.preset_name
captured["agentic"] = config.agents[config.coders[0]].agentic captured["agentic"] = config.agents[config.coders[0]].agentic
captured["use_worktree"] = config.use_worktree
captured["seniors"] = list(config.seniors) captured["seniors"] = list(config.seniors)
captured["steps"] = [step.name for step in config.pipeline] captured["steps"] = [step.name for step in config.pipeline]
captured["max_iter"] = config.max_iterations captured["max_iter"] = config.max_iterations
@@ -1104,6 +1098,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
self.assertEqual(exit_code, 0) self.assertEqual(exit_code, 0)
self.assertEqual(captured["preset"], "plan-review") self.assertEqual(captured["preset"], "plan-review")
self.assertTrue(captured["agentic"]) self.assertTrue(captured["agentic"])
self.assertFalse(captured["use_worktree"])
self.assertEqual(captured["seniors"], ["claude-senior"]) self.assertEqual(captured["seniors"], ["claude-senior"])
self.assertEqual( self.assertEqual(
captured["steps"], captured["steps"],
@@ -1111,6 +1106,36 @@ class FixPresetBehaviorTest(unittest.TestCase):
) )
self.assertEqual(captured["max_iter"], 3) self.assertEqual(captured["max_iter"], 3)
def test_run_worktree_flag_enables_isolated_worktree_mode(self) -> None:
captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs):
captured["use_worktree"] = config.use_worktree
return PipelineResult(
iterations=[],
final_verdict="PASS",
run_dir=Path(".cross-eval/output"),
)
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main(["run", "--preset", "plan-review", "--dry-run", "--worktree"])
self.assertEqual(exit_code, 0)
self.assertTrue(captured["use_worktree"])
def test_run_dry_run_returns_zero_even_when_not_pass(self) -> None:
def _fake_run_pipeline(config, **kwargs):
return PipelineResult(
iterations=[],
final_verdict="MAX_ITERATIONS_REACHED",
run_dir=Path(".cross-eval/output"),
)
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main(["run", "--preset", "plan-review", "--dry-run"])
self.assertEqual(exit_code, 0)
def test_run_senior_model_override_applies_only_to_seniors(self) -> None: def test_run_senior_model_override_applies_only_to_seniors(self) -> None:
captured: dict[str, list[str]] = {} captured: dict[str, list[str]] = {}
@@ -1127,7 +1152,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline): with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main([ exit_code = main([
"run", "run",
"--preset", "review-fix", "--preset", "coding-plan-review",
"--coder", "claude", "--coder", "claude",
"--reviewer", "claude", "--reviewer", "claude",
"--senior", "claude", "--senior", "claude",
@@ -1155,7 +1180,7 @@ class OutputDirectoryResolutionTest(unittest.TestCase):
" plan: plan.md\n" " plan: plan.md\n"
"coders: [claude-coder]\n" "coders: [claude-coder]\n"
"reviewers: [claude-reviewer]\n" "reviewers: [claude-reviewer]\n"
"pipeline: preset:simple\n" "pipeline: preset:coding-plan-review\n"
"output_dir: .cross-eval/output\n" "output_dir: .cross-eval/output\n"
), ),
encoding="utf-8", encoding="utf-8",

View File

@@ -55,7 +55,7 @@ class DoctorCheckInstalledTest(unittest.TestCase):
config_path = ce_dir / "config.yaml" config_path = ce_dir / "config.yaml"
config_path.write_text( config_path.write_text(
"inputs:\n plan: plan.md\ncoders: [claude-coder]\n" "inputs:\n plan: plan.md\ncoders: [claude-coder]\n"
"reviewers: [claude-reviewer]\npipeline: preset:simple\n", "reviewers: [claude-reviewer]\npipeline: preset:coding-plan-review\n",
encoding="utf-8", encoding="utf-8",
) )
# Also create plan.md so validation passes # Also create plan.md so validation passes
@@ -137,22 +137,22 @@ class DemoTest(unittest.TestCase):
def test_mock_demo_runs_without_error(self) -> None: def test_mock_demo_runs_without_error(self) -> None:
# Should not raise # Should not raise
with patch("sys.stdout"): with patch("sys.stdout"):
run_mock_demo(preset="simple") run_mock_demo(preset="coding-plan-review")
def test_mock_demo_escalate_runs_without_error(self) -> None: def test_mock_demo_escalate_runs_without_error(self) -> None:
with patch("sys.stdout"): with patch("sys.stdout"):
run_mock_demo(preset="simple", show_escalate=True) run_mock_demo(preset="coding-plan-review", show_escalate=True)
def test_cmd_demo_mock_default(self) -> None: def test_cmd_demo_mock_default(self) -> None:
with patch("cross_eval.demo.run_mock_demo") as mock: with patch("cross_eval.demo.run_mock_demo") as mock:
exit_code = main(["demo"]) exit_code = main(["demo"])
mock.assert_called_once_with(preset="simple", show_escalate=False) mock.assert_called_once_with(preset="coding-plan-review", show_escalate=False)
self.assertEqual(exit_code, 0) self.assertEqual(exit_code, 0)
def test_cmd_demo_escalate_flag(self) -> None: def test_cmd_demo_escalate_flag(self) -> None:
with patch("cross_eval.demo.run_mock_demo") as mock: with patch("cross_eval.demo.run_mock_demo") as mock:
exit_code = main(["demo", "--escalate"]) exit_code = main(["demo", "--escalate"])
mock.assert_called_once_with(preset="simple", show_escalate=True) mock.assert_called_once_with(preset="coding-plan-review", show_escalate=True)
self.assertEqual(exit_code, 0) self.assertEqual(exit_code, 0)
def test_cmd_demo_live_requires_confirmation(self) -> None: def test_cmd_demo_live_requires_confirmation(self) -> None:

View File

@@ -16,13 +16,17 @@ from cross_eval.agent import (
) )
from cross_eval.models import AgentConfig, AgentResult, ExecutionConfig, PipelineConfig, StepConfig from cross_eval.models import AgentConfig, AgentResult, ExecutionConfig, PipelineConfig, StepConfig
from cross_eval.pipeline import ( from cross_eval.pipeline import (
_apply_worktree_inputs_to_base,
_commit_base_repo_paths,
_copy_inputs_to_worktree, _copy_inputs_to_worktree,
_commit_iteration, _commit_iteration,
_execute_parallel_batch, _execute_parallel_batch,
_execute_step, _execute_step,
_finalize_worktree, _finalize_worktree,
_format_runtime_error_markdown, _format_runtime_error_markdown,
_load_inputs,
_maybe_save_step_transcript, _maybe_save_step_transcript,
_refresh_inputs,
_snapshot_repo_state, _snapshot_repo_state,
) )
from cross_eval.runtime_env import ( from cross_eval.runtime_env import (
@@ -155,6 +159,110 @@ class TestWorktreeInputMapping(unittest.TestCase):
capture_output=True, capture_output=True,
) )
def test_plan_review_docs_ref_maps_to_worktree_and_refreshes_docs(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir) / "repo"
repo.mkdir()
_init_git_repo(repo)
docs_dir = repo / "plans"
docs_dir.mkdir()
(docs_dir / "A.md").write_text("A v1\n", encoding="utf-8")
subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True)
subprocess.run(
["git", "commit", "-m", "add docs"],
cwd=repo,
capture_output=True,
check=True,
)
config = PipelineConfig(
inputs={
"docs": "stale snapshot",
"docs_ref": docs_dir,
},
preset_name="plan-review",
)
input_contents = _load_inputs(config)
self.assertIn("A.md", input_contents["docs"])
worktree_dir = Path(tmpdir) / "wt"
branch = "cross-eval/test-docs-ref"
worktree_path, _ = create_worktree(repo, worktree_dir, branch)
try:
_copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
self.assertEqual(config.inputs["docs_ref"], worktree_path / "plans")
updated = worktree_path / "plans" / "A.md"
updated.write_text("A v2\n", encoding="utf-8")
_refresh_inputs(config, input_contents)
self.assertIn("A.md", input_contents["docs"])
self.assertIn("A v2", input_contents["docs"])
finally:
remove_worktree(base_cwd=repo, work_dir=worktree_path)
subprocess.run(
["git", "branch", "-D", branch],
cwd=repo,
capture_output=True,
)
def test_worktree_doc_changes_apply_back_and_commit_in_base_repo(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir) / "repo"
repo.mkdir()
_init_git_repo(repo)
docs_dir = repo / "plans"
docs_dir.mkdir()
doc_path = docs_dir / "A.md"
doc_path.write_text("A v1\n", encoding="utf-8")
subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True)
subprocess.run(
["git", "commit", "-m", "add docs"],
cwd=repo,
capture_output=True,
check=True,
)
config = PipelineConfig(
inputs={"docs_ref": docs_dir},
preset_name="plan-review",
)
original_inputs = {"docs_ref": docs_dir}
worktree_dir = Path(tmpdir) / "wt"
branch = "cross-eval/test-apply-back"
worktree_path, _ = create_worktree(repo, worktree_dir, branch)
try:
_copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
worktree_doc = config.inputs["docs_ref"] / "A.md"
worktree_doc.write_text("A v2\n", encoding="utf-8")
restored = _apply_worktree_inputs_to_base(
config, original_inputs, cwd=repo,
)
self.assertEqual(restored, [docs_dir])
self.assertEqual(doc_path.read_text(encoding="utf-8"), "A v2\n")
committed = _commit_base_repo_paths(
repo, restored, "cross-eval: plan-review (FAIL)",
)
self.assertTrue(committed)
log = subprocess.run(
["git", "log", "-1", "--pretty=%s"],
cwd=repo,
capture_output=True,
text=True,
check=True,
)
self.assertEqual(log.stdout.strip(), "cross-eval: plan-review (FAIL)")
finally:
remove_worktree(base_cwd=repo, work_dir=worktree_path)
subprocess.run(
["git", "branch", "-D", branch],
cwd=repo,
capture_output=True,
)
def test_classify_unknown_failure(self) -> None: def test_classify_unknown_failure(self) -> None:
failure_type, suggested_action = _classify_agent_failure("weird crash") failure_type, suggested_action = _classify_agent_failure("weird crash")
self.assertEqual(failure_type, "UNKNOWN") self.assertEqual(failure_type, "UNKNOWN")
@@ -413,11 +521,13 @@ class TestInvokeAgenticRuntime(unittest.TestCase):
class TestPipelineHelpers(unittest.TestCase): class TestPipelineHelpers(unittest.TestCase):
@patch("cross_eval.worktree.get_current_head", return_value="a" * 40)
@patch("cross_eval.worktree.commit_worktree", return_value=True) @patch("cross_eval.worktree.commit_worktree", return_value=True)
def test_commit_iteration_logs_only_when_committed(self, mock_commit: MagicMock) -> None: def test_commit_iteration_logs_only_when_committed(self, mock_commit: MagicMock, mock_head: MagicMock) -> None:
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
_commit_iteration(Path(tmpdir), "review-fix", 2, "PASS") new_head = _commit_iteration(Path(tmpdir), "review-fix", 2, "PASS")
mock_commit.assert_called_once() mock_commit.assert_called_once()
self.assertEqual(new_head, "a" * 40)
def test_snapshot_repo_state_includes_untracked_digest(self) -> None: def test_snapshot_repo_state_includes_untracked_digest(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir: