Compare commits

...

18 Commits

Author SHA1 Message Date
이충영 에이닷서비스개발
0bbe0f6f7b continue 2026-03-15 17:54:30 +09:00
chungyeong
28efd5bb8f fix: use incremental diff per iteration instead of cumulative base diff
After each iteration's _commit_iteration, record the new HEAD SHA and use
it as the diff anchor for the next iteration. Previously capture_diff
always diffed against the initial base commit, causing every iteration to
return the same full cumulative diff — reviewers couldn't see what changed
between iterations, leading to repeated feedback and stuck FAIL loops.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 10:07:11 +09:00
chungyeong
bf64d19123 Fix plan-review worktree document tracking 2026-03-15 00:35:42 +09:00
chungyeong
a85a490a9b Make plan-review a review-fix-verify loop 2026-03-15 00:01:26 +09:00
chungyeong
60c7b07939 fix: capture_diff uses base commit to handle agent self-commits
Claude in agentic mode (interactive, no -p flag) commits its own changes,
advancing HEAD. This made `git diff --cached HEAD` return empty, triggering
false EMPTY_DIFF errors every time. Now capture_diff diffs against the
base commit SHA recorded at worktree creation, so changes are captured
regardless of whether the agent committed them.

Also adds UX_IMPROVEMENT_PLAN.md for guided message improvements.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 23:59:53 +09:00
이충영 에이닷서비스개발
af05fc1ddb fix: preserve agentic branch when intermediate commits exist
_finalize_worktree was returning None and deleting the branch when the
final commit was empty, even though _commit_iteration had already
committed changes during the pipeline. Now checks git log for any
commits on the branch before deciding to clean up.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 20:48:25 +09:00
이충영 에이닷서비스개발
0858675076 fix: remove --permission-mode plan from reviewer args
Plan mode causes Claude to spend all time on tool calls (Read/Grep)
in -p mode, producing empty stdout. Reviewers receive full context
(diff, plan, checklist) via the prompt, so file access is not required.
Without --permission-mode, -p mode defaults to read-allowed, write-denied.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 18:29:53 +09:00
이충영 에이닷서비스개발
cc8d583914 fix: Claude reviewer empty output, worktree isolation false positives, and input file access
- Add -p flag to _CLAUDE_REVIEW_ARGS so reviewer uses print mode (stdin→stdout)
  instead of interactive mode which conflicts with plan permission mode
- Copy input files (plan, checklist) into worktree .cross-eval-inputs/ so
  agents in plan mode can access them without escaping the sandbox
- Simplify _snapshot_repo_state to use only git diff HEAD + untracked hashes,
  eliminating false positives from staging state changes (git diff --cached)
  and git status index drift during long-running pipelines

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 16:19:57 +09:00
chungyeong
7b95233edf feat: tighten agentic runtime handoffs and quality gates 2026-03-14 10:05:25 +09:00
chungyeong
87bc0ffbfb feat: propagate execution evidence across iterations and enhance reports
- Carry execution evidence forward so reviewer/senior prompts in
  subsequent iterations can inspect prior transcript and command data
- Add {execution_evidence} to REVIEW_ONLY templates (en/ko)
- Add evidence summary table to iteration reports
- Fix test_agentic to match stdin-based prompt delivery for Claude
- Add expanded claim/no-change marker tests and cross-iteration
  evidence propagation tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 23:36:28 +09:00
chungyeong
c467222a2a fix: instruct coder to use Edit/Write tools instead of describing changes
Claude -p mode tends to describe changes in text rather than actually
applying them via tools. Added explicit rule requiring tool-based edits
so that file modifications produce real git diffs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 23:19:22 +09:00
chungyeong
99cbf171aa fix: revert -p removal — Claude -p mode has full tool access
Claude -p (print mode) is non-interactive but retains full tool access
(Edit, Write, Bash, etc.) with --dangerously-skip-permissions. Removing
-p caused Claude to enter interactive mode which requires a TTY and
produces zero output when run as a subprocess with piped I/O.

Now delivers prompt via stdin for both Claude and Codex in agentic mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 23:13:12 +09:00
chungyeong
d5fcc258b7 fix: unset CLAUDECODE env var to allow nested Claude subprocess calls
Claude Code refuses to launch inside another Claude Code session.
Strip the CLAUDECODE marker from the inherited environment so that
cross-eval can spawn Claude as a subprocess from within Claude Code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 23:05:16 +09:00
chungyeong
290eace01b fix: send EOF via empty stdin so Claude exits after agentic prompt
Without -p, Claude enters interactive mode and waits for more input
indefinitely. Setting input="" closes the stdin pipe immediately,
causing Claude to process the positional prompt and then exit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 23:04:13 +09:00
chungyeong
ecf44b4c07 fix: strip -p/--print flags in agentic mode so Claude can actually modify files
The agentic invocation path inherited -p (print mode) from _CLAUDE_BASE_ARGS
but only stripped the stdin sentinel "-". Print mode makes Claude a one-shot
text completer that cannot use tools or write files, resulting in zero diffs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 23:00:40 +09:00
chungyeong
b19d174c98 feat: isolate agentic worktrees and surface execution evidence 2026-03-13 22:50:46 +09:00
chungyeong
3fb19e90c0 feat: harden runtime evidence and claude agentic validation 2026-03-13 22:29:22 +09:00
chungyeong
28dd794f54 feat: add runtime discovery and execution traces 2026-03-13 21:52:13 +09:00
55 changed files with 5096 additions and 525 deletions

7
.gitignore vendored Normal file
View File

@@ -0,0 +1,7 @@
__pycache__/
*.py[cod]
.pytest_cache/
.idea/
output/
.cross-eval/output/
cross_eval.egg-info/

10
.idea/.gitignore generated vendored
View File

@@ -1,10 +0,0 @@
# Default ignored files
/shelf/
/workspace.xml
# Ignored default folder with query files
/queries/
# Datasource local storage ignored files
/dataSources/
/dataSources.local.xml
# Editor-based HTTP Client requests
/httpRequests/

14
.idea/cross-eval.iml generated
View File

@@ -1,14 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<module type="PYTHON_MODULE" version="4">
<component name="NewModuleRootManager">
<content url="file://$MODULE_DIR$">
<excludeFolder url="file://$MODULE_DIR$/.venv" />
</content>
<orderEntry type="jdk" jdkName="Python 3.12 (cross-eval)" jdkType="Python SDK" />
<orderEntry type="sourceFolder" forTests="false" />
</component>
<component name="PyDocumentationSettings">
<option name="format" value="PLAIN" />
<option name="myDocStringFormat" value="Plain" />
</component>
</module>

View File

@@ -1,6 +0,0 @@
<component name="InspectionProjectProfileManager">
<profile version="1.0">
<option name="myName" value="Project Default" />
<inspection_tool class="Eslint" enabled="true" level="WARNING" enabled_by_default="true" />
</profile>
</component>

View File

@@ -1,6 +0,0 @@
<component name="InspectionProjectProfileManager">
<settings>
<option name="USE_PROJECT_PROFILE" value="false" />
<version value="1.0" />
</settings>
</component>

7
.idea/misc.xml generated
View File

@@ -1,7 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="Black">
<option name="sdkName" value="Python 3.12 (cross-eval)" />
</component>
<component name="ProjectRootManager" version="2" project-jdk-name="Python 3.12 (cross-eval)" project-jdk-type="Python SDK" />
</project>

8
.idea/modules.xml generated
View File

@@ -1,8 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="ProjectModuleManager">
<modules>
<module fileurl="file://$PROJECT_DIR$/.idea/cross-eval.iml" filepath="$PROJECT_DIR$/.idea/cross-eval.iml" />
</modules>
</component>
</project>

View File

@@ -10,6 +10,8 @@ AI 에이전트 2개를 활용한 개발 워크플로우(기획→체크리스
- Generator: `--permission-mode auto` (파일 읽기/쓰기 가능) - Generator: `--permission-mode auto` (파일 읽기/쓰기 가능)
- Reviewer: `--permission-mode plan` (읽기 전용 탐색) - Reviewer: `--permission-mode plan` (읽기 전용 탐색)
- subprocess의 `cwd`를 현재 작업 디렉토리로 설정 - subprocess의 `cwd`를 현재 작업 디렉토리로 설정
- 기본 실행 모드는 **direct mode**다. 즉 agentic coder도 현재 작업트리에서 직접 수정한다.
- `--worktree` 또는 `use_worktree: true`를 명시한 경우에만 isolated git worktree를 생성한다.
## 사용자 경험 (UX Flow) ## 사용자 경험 (UX Flow)
@@ -34,6 +36,7 @@ ls output/v1/ v2/ final-report.md
```yaml ```yaml
output_dir: output output_dir: output
use_worktree: false
max_iterations: 3 max_iterations: 3
inputs: inputs:
@@ -51,10 +54,8 @@ agents:
system_prompt: "You are a meticulous code reviewer." system_prompt: "You are a meticulous code reviewer."
# 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음) # 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음)
pipeline: preset:simple # "A 생성 → B 리뷰" (기본값) pipeline: preset:coding-plan-review # "문서 기반 구현 → 코드/문서 리뷰 → 수정 → 재검증" (기본값)
# pipeline: preset:cross-review # "둘 다 생성 → 서로 리뷰" # pipeline: preset:plan-review # "구현 전 문서 리뷰 → 수정 → 재검증 반복"
# pipeline: preset:plan-review # "구현 전 문서/기획 검토"
# pipeline: preset:coding-review-fix # "초기 코딩 1회 → 리뷰/수정 반복"
# 방법 2: 직접 커스텀 (고급 사용자용) # 방법 2: 직접 커스텀 (고급 사용자용)
# pipeline: # pipeline:
@@ -75,10 +76,8 @@ pipeline: preset:simple # "A 생성 → B 리뷰" (기본값)
| 프리셋 | 설명 | 자동 생성되는 steps | | 프리셋 | 설명 | 자동 생성되는 steps |
|--------|------|-------------------| |--------|------|-------------------|
| `simple` | A 코딩 → B 리뷰 | coding(agent1) → review(agent2) | | `plan-review` | 구현 전 문서 리뷰/수정/재검증 반복 | plan_review_* → aggregate_review → plan_fix → verify |
| `cross-review` | 둘 다 코딩, 서로 리뷰 | coding_a → coding_b → review_of_b(agent_a) → review_of_a(agent_b) | | `coding-plan-review` | 문서 기반 구현 후 코드/문서 리뷰/수정 반복 | initial_coding(coding) → coding_plan_review(review* → aggregate → coding_plan_fix → verify) |
| `plan-review` | 구현 전 문서 검토 | parallel plan_review_* → senior_review(optional) |
| `coding-review-fix` | 초기 코딩 후 리뷰/수정 반복 | initial_coding(coding) → review_fix(review* → aggregate → coding → verify) |
프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다. 프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다.
@@ -101,7 +100,7 @@ cross_eval/
**models.py** — 순환 참조 방지, 모든 데이터클래스 집중: **models.py** — 순환 참조 방지, 모든 데이터클래스 집중:
- `AgentConfig` (command, args, system_prompt, stdin_mode) - `AgentConfig` (command, args, system_prompt, stdin_mode)
- `StepConfig` (name, agent, role, prompt_template, output_key, verdict, verdict_pattern, context_override) - `StepConfig` (name, agent, role, prompt_template, output_key, verdict, verdict_pattern, context_override)
- `PipelineConfig` (output_dir, max_iterations, inputs, agents, pipeline) - `PipelineConfig` (output_dir, use_worktree, max_iterations, inputs, agents, pipeline)
- `AgentResult` (output, exit_code, agent_name, step_name, duration_seconds) - `AgentResult` (output, exit_code, agent_name, step_name, duration_seconds)
- `IterationResult` (iteration, step_outputs, verdict, feedback) - `IterationResult` (iteration, step_outputs, verdict, feedback)
- `PipelineResult` (iterations, final_verdict, total_duration) - `PipelineResult` (iterations, final_verdict, total_duration)
@@ -117,7 +116,7 @@ cross_eval/
- `default:review` — 과최적화/오탐/누락 3기준 검토 + `VERDICT: PASS|FAIL` 출력 + **"프로젝트 디렉토리를 직접 탐색하여 코드를 검증하라"** 지시 - `default:review` — 과최적화/오탐/누락 3기준 검토 + `VERDICT: PASS|FAIL` 출력 + **"프로젝트 디렉토리를 직접 탐색하여 코드를 검증하라"** 지시
- `{variable}` 플레이스홀더, 누락 시 `(no {key} provided)` 출력 - `{variable}` 플레이스홀더, 누락 시 `(no {key} provided)` 출력
- 사용자가 커스텀 .md 파일로 오버라이드 가능 - 사용자가 커스텀 .md 파일로 오버라이드 가능
- `PIPELINE_PRESETS` dict: `simple`, `cross-review`, `plan-review` 프리셋별 StepConfig 리스트 정의 - `PIPELINE_PRESETS` / `PHASED_PRESETS` dict: `plan-review`, `coding-plan-review` 프리셋별 StepConfig/PhaseConfig 정의
**agent.py**`invoke_agent(agent_config, prompt, cwd)`: **agent.py**`invoke_agent(agent_config, prompt, cwd)`:
- `cwd` 파라미터로 프로젝트 디렉토리 지정 → 에이전트가 해당 디렉토리에서 파일 탐색 가능 - `cwd` 파라미터로 프로젝트 디렉토리 지정 → 에이전트가 해당 디렉토리에서 파일 탐색 가능
@@ -139,16 +138,21 @@ for iteration 1..max_iterations:
final-report.md 생성 final-report.md 생성
``` ```
agentic 실행 경로는 두 모드가 있다.
- 기본: direct mode (`cwd`에서 직접 수정)
- opt-in: isolated worktree mode (`--worktree` 또는 `use_worktree: true`)
**report.py** — 최종 마크다운 리포트: **report.py** — 최종 마크다운 리포트:
- 요약 테이블 (반복 횟수, 판정, 소요시간) - 요약 테이블 (반복 횟수, 판정, 소요시간)
- 반복별 상세 (각 step 출력, 에이전트명, 소요시간) - 반복별 상세 (각 step 출력, 에이전트명, 소요시간)
- 최종 판정 - 최종 판정
**cli.py** — 서브커맨드: **cli.py** — 서브커맨드:
- `cross-eval init [--dir .] [--preset simple|cross-review|plan-review]` — 스캐폴딩 (기존 파일 안 덮어씀) - `cross-eval init [--dir .] [--preset coding-plan-review|plan-review]` — 스캐폴딩 (기존 파일 안 덮어씀)
- `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...]` - `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...] [--worktree]`
- `--input key=path`: config의 inputs 오버라이드/추가 - `--input key=path`: config의 inputs 오버라이드/추가
- `--dry-run`: 에이전트 호출 없이 렌더링된 프롬프트만 출력 - `--dry-run`: 에이전트 호출 없이 렌더링된 프롬프트만 출력
- `--worktree`: 기본 direct mode 대신 isolated git worktree에서 실행
## 수정할 파일 목록 ## 수정할 파일 목록
@@ -172,10 +176,12 @@ final-report.md 생성
4. plan.md/checklist.md에 간단한 내용 넣고 `cross-eval run --max-iter 2` 로 실제 실행 4. plan.md/checklist.md에 간단한 내용 넣고 `cross-eval run --max-iter 2` 로 실제 실행
5. `output/` 디렉토리에 v1/, final-report.md 생성 확인 5. `output/` 디렉토리에 v1/, final-report.md 생성 확인
`--dry-run` 은 미리보기 전용이며 실제 verdict가 PASS가 아니어도 프로세스 종료 코드는 `0`으로 처리한다.
cross-eval run \ cross-eval run \
--docs /Users/chungyeong/Desktop/Dev/new-alpha-foundry/plans/TO_CLICKHOUSE \ --docs /Users/chungyeong/Desktop/Dev/new-alpha-foundry/plans/TO_CLICKHOUSE \
--preset coding-review-fix \ --preset coding-plan-review \
--coder claude \ --coder claude \
--reviewer codex \ --reviewer codex \
--reviewer codex \ --reviewer codex \
@@ -185,3 +191,6 @@ final-report.md 생성
--reviewer-effort high \ --reviewer-effort high \
--senior-effort xhigh \ --senior-effort xhigh \
--max-iter 10 --max-iter 10
cross-eval run --plan /Users/chungyeong/Desktop/Dev/cross-eval/UX_IMPROVEMENT_PLAN.md --coder claude --reviewer claude --senior claude --model sonnet --preset coding-plan-review --lang ko --max-iter 1

View File

@@ -51,12 +51,15 @@ cp .cross-eval/checklist-sample.md .cross-eval/checklist.md
### 3. 실행 ### 3. 실행
```bash ```bash
# 기본 실행 (코딩 → 리뷰, 최대 3회 반복) # 기본 실행 (현재 작업트리 direct mode, 최대 3회 반복)
cross-eval run cross-eval run
# 프롬프트만 확인 (에이전트 호출 없이, 비용 절약) # 프롬프트만 확인 (에이전트 호출 없이, 비용 절약)
cross-eval run --dry-run cross-eval run --dry-run
# 격리된 git worktree에서 실행하고 싶을 때만 명시
cross-eval run --worktree
# 최대 반복 횟수 변경 # 최대 반복 횟수 변경
cross-eval run --max-iter 5 cross-eval run --max-iter 5
@@ -80,6 +83,9 @@ output/
└── final-report.md # 전체 요약 리포트 └── final-report.md # 전체 요약 리포트
``` ```
기본값은 **direct mode**다. 즉 `cross-eval`은 현재 작업트리에서 직접 파일을 읽고 수정한다.
별도 격리 실행이 필요할 때만 `--worktree`를 붙여 isolated git worktree를 사용한다.
## 설정 (`.cross-eval/config.yaml`) ## 설정 (`.cross-eval/config.yaml`)
```yaml ```yaml
@@ -101,7 +107,8 @@ agents:
args: ["-p", "--model", "opus", "--permission-mode", "plan"] args: ["-p", "--model", "opus", "--permission-mode", "plan"]
system_prompt: "You are a meticulous code reviewer." system_prompt: "You are a meticulous code reviewer."
pipeline: preset:simple pipeline: preset:coding-plan-review
use_worktree: false # 기본값. true면 isolated worktree 사용
``` ```
실행 중에 `config.yaml`을 수정하면 다음 반복부터 자동으로 반영됩니다. 실행 중에 `config.yaml`을 수정하면 다음 반복부터 자동으로 반영됩니다.
@@ -110,16 +117,16 @@ pipeline: preset:simple
| 프리셋 | 설명 | | 프리셋 | 설명 |
|--------|------| |--------|------|
| `simple` | Agent A가 코딩, Agent B가 리뷰 (기본값) | | `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 문서를 수정한 뒤 재검증까지 반복 |
| `cross-review` | 둘 다 코딩, 서로 교차 리뷰 | | `coding-plan-review` | 입력 문서를 바탕으로 코드를 구현하고, 코드와 문서를 함께 리뷰/수정/재검증 반복 |
| `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 필요시 현재 코드베이스와의 정합성도 확인 |
| `review-only` | 기존 코드만 감사 용도로 검토 | 두 프리셋은 역할만 다르고, 대부분의 CLI 옵션은 동일하게 동작한다. 예를 들어 `--plan`, `--checklist`, `--docs`, `--coder`, `--reviewer`, `--senior`, `--max-iter`, `--dry-run`, `--worktree`는 둘 다 같은 방식으로 사용할 수 있다.
| `review-fix` | 리뷰 결과를 취합한 뒤 자동 수정과 재검증까지 반복 |
| `coding-review-fix` | 초기 코딩 1회 후 리뷰 결과를 취합해 자동 수정과 재검증을 반복 |
```bash ```bash
# 초기화 옵션 # 초기화 옵션
cross-eval init --preset cross-review # 교차 리뷰 프리셋 cross-eval init --preset coding-plan-review # 구현 + 코드/문서 리뷰 프리셋
cross-eval init --preset plan-review # 구현 전 문서 검토 프리셋 cross-eval init --preset plan-review # 문서 리뷰/수정/재검증 프리셋
cross-eval init --lang en # 영어 템플릿 cross-eval init --lang en # 영어 템플릿
``` ```
`cross-eval run --dry-run` 은 프롬프트와 파이프라인 구성을 미리보기만 하며, 실제 판정이 PASS가 아니어도 종료 코드는 `0`이다.

178
UX_IMPROVEMENT_PLAN.md Normal file
View File

@@ -0,0 +1,178 @@
# cross-eval UX 개선 계획
> 사용자 안내 메시지, 에러 메시지, 도움말 텍스트 전반의 품질을 높여서
> 처음 쓰는 사람도 막히지 않고 파이프라인을 돌릴 수 있게 만든다.
---
## 1. CLI 도움말 텍스트 개선
### 1.1 `cross-eval` 메인 도움말
- [ ] 메인 description에 "어떤 문제를 해결하는 도구인지" 한 줄 요약 추가
- 현재: "AI 코딩 에이전트의 결과물을 자동으로 검증하는 CLI 도구"
- 개선: "AI 코딩 에이전트가 기획서대로 구현했는지 자동 교차 검증. 과최적화·누락·거짓 통과를 잡아냄"
- [ ] 서브커맨드별 한 줄 설명을 메인 help에 추가 (init/doctor/demo/run 각각)
### 1.2 `cross-eval run` 도움말
- [ ] epilog의 프리셋 테이블이 너무 길음 — "빠른 선택 가이드" 3줄 추가
- 예: "처음이면 simple, 리뷰만 하려면 review-only, 코딩+리뷰+자동수정이면 coding-review-fix"
- [ ] `--reasoning-effort` 도움말에 별칭(extra-high, x-high 등) 명시
- [ ] `--target` 옵션이 실제로 프롬프트에 어떤 영향을 주는지 설명 추가
- [ ] `--agentic` 플래그 설명에 worktree 생성/정리 동작 요약 추가
- [ ] `--min-iter` 설명에 "왜 PASS인데 반복하는지" 용도 한 줄 추가
- 예: "결과 안정성 확인용. 한 번 PASS가 우연이 아닌지 재검증"
- [ ] `--dry-run` 설명에 "에이전트 호출 없이 프롬프트만 미리보기" 명확히
- [ ] 에이전트 축약 규칙(claude → claude-coder 등) 예시와 함께 더 명확하게
### 1.3 `cross-eval init` 도움말
- [ ] `--guided` 옵션을 더 눈에 띄게 — "처음이면 --guided 추천" 문구
- [ ] 생성되는 파일 설명에 "각 파일을 어떻게 쓰는지" 한 줄씩 추가
### 1.4 `cross-eval doctor` 도움말
- [ ] 어떤 항목을 점검하는지 목록 미리 보여주기
- [ ] "인증 실패 시 어떻게 해야 하는지" 구체적 명령어 포함
### 1.5 `cross-eval demo` 도움말
- [ ] mock vs live 차이를 한 눈에 볼 수 있도록 비교 추가
- [ ] `--escalate` 옵션이 mock 전용인 점 강조
---
## 2. 에러 메시지 개선
### 2.1 필수 입력 누락
- [ ] `--plan` 없이 `cross-eval run` 실행 시 명확한 에러:
- "기획서(plan)가 필요합니다. --plan plan.md 또는 .cross-eval/config.yaml의 inputs.plan에 지정하세요."
- [ ] config.yaml 없이 실행 시 기본값 사용 중임을 알리는 INFO 메시지 추가
### 2.2 에이전트 실패 메시지
- [ ] `AUTH` 실패 시 구체적 해결 명령어 제시
- Claude: "claude login 으로 인증하세요"
- Codex: "codex auth 로 인증하세요"
- [ ] `USAGE_LIMIT` 시 어떤 한도인지 힌트 (토큰? 요금?)
- [ ] `EMPTY_DIFF` 시 "에이전트가 파일을 수정하지 않았습니다" + 가능한 원인 목록
- [ ] `WRITE_FAILURE` 시 worktree 경로와 권한 상태 출력
- [ ] 에이전트 빈 출력(empty output) 시 "에이전트가 응답하지 않았습니다. 프롬프트가 너무 길거나 인증 만료일 수 있습니다" 등 원인 제안
### 2.3 설정 검증 에러
- [ ] 중복 step name 에러에 "어떤 phase의 어떤 step이 중복인지" 구체적으로
- [ ] 없는 에이전트 참조 시 "사용 가능한 에이전트: ..." 리스트 포함 (이미 있으나 확인)
- [ ] YAML 파싱 에러 시 라인 번호 포함
### 2.4 파일/경로 에러
- [ ] "File not found: {path}" → "파일을 찾을 수 없습니다: {path}\n 현재 디렉토리: {cwd}" 로 개선
- [ ] docs 디렉토리 비어있을 때 → "참고 문서 폴더가 비어있습니다: {path}\n .md, .txt 등 문서 파일을 넣어주세요"
---
## 3. 진행 상태 메시지 개선
### 3.1 파이프라인 실행 중
- [ ] 실행 시작 시 요약 배너 출력:
```
━━━ cross-eval ━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Plan: .cross-eval/plan.md
Preset: simple (코딩→리뷰→반복)
Coder: claude-coder
Reviewer: claude-reviewer
Max iter: 3
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```
- [ ] 각 iteration 시작 시 "무엇을 하려는 단계인지" 한 줄 설명
- 예: "Iteration 1/3 — Coder가 기획서 기반 초기 구현 중..."
- 예: "Iteration 2/3 — 리뷰 피드백 반영해서 수정 중..."
- [ ] 타임아웃 시 경과 시간과 제한 시간 모두 출력
### 3.2 결과 요약
- [ ] 최종 결과에 소요 시간 추가
- [ ] FAIL 시 "마지막 리뷰에서 지적된 주요 이슈 N개" 간단 요약
- [ ] ESCALATE 시 사람이 봐야 할 이유 1~2줄 요약
- [ ] dry-run 종료 시 "이것은 미리보기입니다. 실제 실행하려면 --dry-run을 빼세요" 명시
### 3.3 Auto-escalation 안내
- [ ] auto-escalation 발동 시 "N회 연속 FAIL → 자동 에스컬레이션" 설명
- [ ] 어떤 조건에서 auto-escalation이 발동하는지 run 도움말에 언급
---
## 4. 첫 사용 경험(Onboarding) 개선
### 4.1 init 후 안내
- [ ] plan.md 템플릿에 실제 예시 포함 (현재 최소한의 구조만 있음)
- "## 기능 요구사항" 아래 구체적 예시 한 개
- [ ] checklist.md 템플릿에 체크리스트 작성 가이드 + 예시 추가
- [ ] init 완료 후 "다음 단계" 안내를 더 구체적으로:
- 현재: "1. plan.md에 기획서 작성"
- 개선: "1. .cross-eval/plan.md를 열어 기획서를 작성하세요 (예: 구현할 기능, API 스펙, DB 스키마 등)"
### 4.2 doctor 개선
- [ ] 체크 통과 시 "준비 완료! cross-eval run --plan .cross-eval/plan.md 로 실행하세요" 안내
- [ ] 인증 실패 시 OS별 설치/인증 가이드 URL 포함
### 4.3 demo 개선
- [ ] demo 완료 후 "실제 프로젝트에서 시작하려면:" 안내 추가
- [ ] mock demo에서 각 단계가 뭘 하는 건지 주석 스타일로 설명
---
## 5. 용어 일관성
- [ ] "에이전트 이름" vs "에이전트 역할" 구분 통일
- 이름: claude-coder, codex-reviewer (실제 실행 단위)
- 역할: coder, reviewer, senior (논리적 역할)
- [ ] Verdict 표기 통일: 항상 대문자 `PASS` / `FAIL` / `ESCALATE`
- [ ] "프리셋" vs "파이프라인" 용어 정리
- `--preset`은 "파이프라인 유형"으로 통일
- [ ] 한영 혼용 줄이기 — 한국어 모드에서 불필요한 영어 최소화
- 단, PASS/FAIL/ESCALATE 같은 verdict은 영어 유지 (가독성)
---
## 6. 출력 디렉토리 구조 안내
- [ ] run 완료 시 출력 폴더 구조 요약 출력:
```
Output: .cross-eval/output/
├── iter-1/ (각 반복의 에이전트 출력)
├── iter-2/
└── final-report.md (최종 리포트)
```
- [ ] report.md 상단에 "이 리포트 읽는 법" 간단 안내 추가
---
## 7. config.yaml 주석 개선
- [ ] 기본 생성되는 config.yaml에 각 섹션별 설명 주석 보강
- [ ] 자주 쓰는 설정 변경 예시를 주석으로 포함
- 예: "# 리뷰어를 2개로 늘리려면: reviewer: [claude, codex]"
- 예: "# 에이전트 모드로 실제 파일 수정: agentic: true"
- [ ] phase-based 파이프라인 커스텀 예시 주석 추가
---
## 우선순위
| 우선순위 | 항목 | 이유 |
|---------|------|------|
| P0 | 2.1 필수 입력 누락 에러 | 가장 자주 부딪히는 문제 |
| P0 | 4.1 init 후 안내 + 템플릿 | 첫 사용에서 막히면 이탈 |
| P0 | 3.1 실행 시작 요약 배너 | 뭐가 돌아가는지 알아야 함 |
| P1 | 2.2 에이전트 실패 메시지 | 실패 시 뭘 해야 하는지 모름 |
| P1 | 1.2 run 도움말 정리 | 옵션이 많아서 혼란 |
| P1 | 5. 용어 일관성 | 혼동 줄이기 |
| P2 | 3.2~3.3 결과/진행 메시지 | 있으면 좋지만 급하진 않음 |
| P2 | 7. config.yaml 주석 | 파워 유저 편의 |
| P2 | 6. 출력 구조 안내 | 한 번 보면 이해됨 |
| P3 | 1.3~1.5 나머지 도움말 | 점진적 개선 |
---
## 테스트 방법
각 항목 수정 후:
1. **도움말 확인**: `cross-eval --help`, `cross-eval run --help` 등
2. **에러 경로 확인**: 일부러 잘못된 입력으로 실행 → 에러 메시지가 유용한지
3. **첫 사용 시뮬레이션**: 빈 디렉토리에서 `init → doctor → demo → run` 풀 플로우
4. **cross-eval 자체로 검증**: 이 문서를 plan.md로 사용해 cross-eval run 실행

31
checklist.md Normal file
View File

@@ -0,0 +1,31 @@
# cross-eval CLI 사용성 리팩토링 체크리스트
## 핵심 사용자 흐름
- [ ] `cross-eval init` 이후 무엇을 해야 하는지 분명하게 안내한다.
- [ ] `cross-eval doctor`를 언제 왜 써야 하는지 설명한다.
- [ ] `cross-eval run` 실행 전 필요한 준비물이 명확하다.
- [ ] 실행 후 결과가 `.cross-eval/output` 아래에 저장된다는 점이 안내된다.
## `run` 커맨드 이해도
- [ ] `--preset`별 차이가 빠르게 비교 가능하다.
- [ ] `--coder`, `--reviewer`, `--senior`의 역할 차이가 설명된다.
- [ ] config 기반 실행과 CLI 옵션 기반 실행의 관계가 명확하다.
- [ ] 어떤 옵션이 config를 override하는지 혼동 없이 이해할 수 있다.
## 예시 품질
- [ ] 대표 사용 예시가 실제 사용자 목적 중심으로 정리되어 있다.
- [ ] 예시가 너무 많아 산만하지 않고, 핵심 조합 위주로 압축되어 있다.
- [ ] 초보자용 기본 예시와 고급 사용 예시가 구분되어 있다.
- [ ] 예시만 복사해도 실제 실행 가능한 수준이다.
## 리팩토링 범위 통제
- [ ] 기존 명령 이름과 옵션 이름을 바꾸지 않는다.
- [ ] 기능 동작을 불필요하게 변경하지 않는다.
- [ ] 안내 문구 개선이 목적이지 새 기능 추가가 아님을 유지한다.
- [ ] plan 범위를 넘는 UI/기능 확장을 하지 않는다.
## 코드 품질
- [ ] 기존 테스트가 깨지지 않도록 한다.
- [ ] 도움말/문구 변경으로 인한 회귀를 확인한다.
- [ ] 문자열 변경이 실제 출력 흐름과 모순되지 않는다.
- [ ] 중복되거나 상충되는 설명이 생기지 않는다.

View File

@@ -1,6 +0,0 @@
Metadata-Version: 2.4
Name: cross-eval
Version: 0.2.0
Summary: AI agent cross-evaluation CLI tool
Requires-Python: >=3.9
Requires-Dist: pyyaml>=6.0

View File

@@ -1,24 +0,0 @@
README.md
pyproject.toml
cross_eval/__init__.py
cross_eval/agent.py
cross_eval/cli.py
cross_eval/config.py
cross_eval/demo.py
cross_eval/doctor.py
cross_eval/models.py
cross_eval/pipeline.py
cross_eval/prompts.py
cross_eval/report.py
cross_eval/runtime_env.py
cross_eval/worktree.py
cross_eval.egg-info/PKG-INFO
cross_eval.egg-info/SOURCES.txt
cross_eval.egg-info/dependency_links.txt
cross_eval.egg-info/entry_points.txt
cross_eval.egg-info/requires.txt
cross_eval.egg-info/top_level.txt
tests/test_agentic.py
tests/test_config.py
tests/test_onboarding.py
tests/test_pipeline_integration.py

View File

@@ -1 +0,0 @@

View File

@@ -1,2 +0,0 @@
[console_scripts]
cross-eval = cross_eval.cli:main

View File

@@ -1 +0,0 @@
pyyaml>=6.0

View File

@@ -1 +0,0 @@
cross_eval

View File

@@ -19,6 +19,76 @@ logger = logging.getLogger(__name__)
# CLI tools that support --system-prompt flag natively # CLI tools that support --system-prompt flag natively
_SYSTEM_PROMPT_AGENTS = ("claude",) _SYSTEM_PROMPT_AGENTS = ("claude",)
_REASONING_EFFORT_AGENTS = ("codex",) _REASONING_EFFORT_AGENTS = ("codex",)
_NO_CHANGE_ACK_MARKERS = (
"no changes",
"no code changes",
"no file changes",
"did not make any changes",
"nothing to change",
"no modifications were necessary",
"no update was necessary",
"already satisfied",
"no changes needed",
"no fixes needed",
"everything is correct",
"code is correct as-is",
"already correct",
"no action required",
"변경 없음",
"수정 없음",
"수정할 필요 없음",
"변경할 필요 없음",
"이미 올바름",
"조치 불필요",
)
_CHANGE_CLAIM_MARKERS = (
"summary of all changes made",
"here's a summary of all changes made",
"here is a summary of all changes",
"implemented",
"i implemented",
"i've implemented",
"added",
"i added",
"i've added",
"updated",
"i updated",
"i've updated",
"modified",
"i modified",
"i've modified",
"created",
"i created",
"i've created",
"fixed",
"i fixed",
"i've fixed",
"completed the changes",
"finished the changes",
"made the following changes",
"applied the fix",
"changes have been applied",
"wrote the code",
"refactored",
"i refactored",
"completed all the changes",
"finished implementing",
"all tasks completed",
"done with the implementation",
"successfully implemented",
"completed the implementation",
"all changes have been made",
"changes are complete",
"수정 완료",
"모든 수정이 완료",
"변경 요약",
"변경 파일",
"신규 생성",
"기획서 수정",
"체크리스트 수정",
"문서를 수정",
"문서 수정",
)
class AgentInvocationError(RuntimeError): class AgentInvocationError(RuntimeError):
@@ -106,6 +176,39 @@ def _classify_agent_failure(detail: str) -> tuple[str, str]:
) )
_WRITE_FAILURE_MARKERS = (
"permission denied",
"read-only file system",
"read only file system",
"operation not permitted",
"cannot write",
"failed to write",
"could not write",
"unable to write",
"sandbox",
"eacces",
"erofs",
)
def _has_write_failure_indicators(stderr: str) -> bool:
"""Detect stderr patterns indicating the agent could not write files."""
if not stderr.strip():
return False
normalized = stderr.lower()
return any(marker in normalized for marker in _WRITE_FAILURE_MARKERS)
def _claims_file_changes(output: str) -> bool:
"""Heuristic for agent text that claims code changes were made."""
normalized = output.lower()
if not normalized.strip():
return False
if any(marker in normalized for marker in _NO_CHANGE_ACK_MARKERS):
return False
return any(marker in normalized for marker in _CHANGE_CLAIM_MARKERS)
class _Spinner: class _Spinner:
"""Animated spinner for long-running agent calls.""" """Animated spinner for long-running agent calls."""
@@ -218,6 +321,7 @@ def invoke_agent(
else: else:
input_data = prompt input_data = prompt
cmd_preview = " ".join(cmd[:6])
logger.debug("Invoking agent '%s': %s", agent.name, " ".join(cmd[:5]) + " ...") logger.debug("Invoking agent '%s': %s", agent.name, " ".join(cmd[:5]) + " ...")
spinner: Optional[_Spinner] = None spinner: Optional[_Spinner] = None
@@ -259,7 +363,6 @@ def invoke_agent(
err_detail = result.stderr.strip() or result.stdout.strip() err_detail = result.stderr.strip() or result.stdout.strip()
if err_detail and len(err_detail) > 500: if err_detail and len(err_detail) > 500:
err_detail = err_detail[:500] + "..." err_detail = err_detail[:500] + "..."
cmd_preview = " ".join(cmd[:6])
failure_type, suggested_action = _classify_agent_failure(err_detail or "") failure_type, suggested_action = _classify_agent_failure(err_detail or "")
raise AgentInvocationError( raise AgentInvocationError(
agent_name=agent.name, agent_name=agent.name,
@@ -298,12 +401,23 @@ def invoke_agent(
agent.name, step_name, agent.name, step_name,
) )
transcript = _build_transcript(
command_preview=cmd_preview,
stdout=result.stdout,
stderr=result.stderr,
exit_code=result.returncode,
duration_seconds=round(duration, 1),
cwd=str(cwd) if cwd else "",
)
return AgentResult( return AgentResult(
output=output, output=output,
exit_code=result.returncode, exit_code=result.returncode,
agent_name=agent.name, agent_name=agent.name,
step_name=step_name, step_name=step_name,
duration_seconds=round(duration, 1), duration_seconds=round(duration, 1),
transcript=transcript,
command_preview=cmd_preview,
) )
@@ -315,12 +429,9 @@ def invoke_agent_agentic(
env: Optional[dict[str, str]] = None, env: Optional[dict[str, str]] = None,
timeout: int | None = None, timeout: int | None = None,
quiet: bool = False, quiet: bool = False,
base_commit: str | None = None,
) -> AgentResult: ) -> AgentResult:
"""Invoke an agent in agentic mode (no -p, runs in worktree, captures git diff). """Invoke an agent in agentic mode using the worktree as the source of truth."""
The agent runs without print mode so it can modify files directly.
After the agent exits, git diff (since last commit) is captured as the output.
"""
from cross_eval.worktree import capture_diff from cross_eval.worktree import capture_diff
# Write prompt to a temp file (outside worktree, won't appear in diffs) # Write prompt to a temp file (outside worktree, won't appear in diffs)
@@ -334,8 +445,10 @@ def invoke_agent_agentic(
if agent.reasoning_effort and _supports_reasoning_effort(agent.command): if agent.reasoning_effort and _supports_reasoning_effort(agent.command):
cmd.extend(["-c", f'model_reasoning_effort="{agent.reasoning_effort}"']) cmd.extend(["-c", f'model_reasoning_effort="{agent.reasoning_effort}"'])
# Strip stdin sentinel ("-") from args for agentic mode # Strip print-mode flags and stdin sentinels for agentic mode.
args = [a for a in agent.args if a != "-"] # Agentic runs should operate on the worktree and return a real git diff,
# not behave as a one-shot text completer.
args = [a for a in agent.args if a not in {"-", "-p", "--print"}]
cmd.extend(args) cmd.extend(args)
# System prompt via flag if supported # System prompt via flag if supported
@@ -353,13 +466,11 @@ def invoke_agent_agentic(
else: else:
input_data = prompt input_data = prompt
else: else:
# claude: use positional arg with a pointer to the task file # claude: deliver the task through stdin and let the worktree be the
# (avoids OS arg length limits for large prompts) # canonical place where files are read/written.
cmd.append( input_data = prompt
f"Read the task file at {task_file} and execute all instructions in it. "
f"Work in the current directory."
)
cmd_preview = " ".join(cmd[:6])
logger.debug( logger.debug(
"Invoking agent '%s' (agentic) in worktree: %s", "Invoking agent '%s' (agentic) in worktree: %s",
agent.name, worktree_path, agent.name, worktree_path,
@@ -401,7 +512,6 @@ def invoke_agent_agentic(
err_detail = result.stderr.strip() or result.stdout.strip() err_detail = result.stderr.strip() or result.stdout.strip()
if err_detail and len(err_detail) > 500: if err_detail and len(err_detail) > 500:
err_detail = err_detail[:500] + "..." err_detail = err_detail[:500] + "..."
cmd_preview = " ".join(cmd[:6])
failure_type, suggested_action = _classify_agent_failure(err_detail or "") failure_type, suggested_action = _classify_agent_failure(err_detail or "")
raise AgentInvocationError( raise AgentInvocationError(
agent_name=agent.name, agent_name=agent.name,
@@ -412,10 +522,50 @@ def invoke_agent_agentic(
suggested_action=suggested_action, suggested_action=suggested_action,
) )
# Capture git diff as the output (changes since last commit on the branch) # Capture git diff as the output (changes since the base commit)
diff_output = capture_diff(worktree_path) diff_output = capture_diff(worktree_path, base_commit=base_commit)
if not diff_output: if not diff_output:
stdout_excerpt = (result.stdout or "").strip()
stderr_excerpt = (result.stderr or "").strip()
# Detect two failure modes:
# 1. Agent claims changes in stdout but produced no diff
# 2. Agent stderr contains permission or write-failure indicators
claims_changes = _claims_file_changes(stdout_excerpt)
has_write_failure = _has_write_failure_indicators(stderr_excerpt)
if claims_changes or has_write_failure:
if spinner:
spinner.stop(f"[{step_name}] FAILED (empty diff)")
raw_error = stdout_excerpt or "(stdout empty)"
if stderr_excerpt:
raw_error = f"{raw_error}\n\n[stderr]\n{stderr_excerpt}"
if len(raw_error) > 2000:
raw_error = raw_error[:2000] + "..."
if has_write_failure:
failure_type = "WRITE_FAILURE"
suggested_action = (
"Agent encountered file write errors (permission denied, read-only, "
"or sandbox restriction). Check agent permissions and worktree state."
)
else:
failure_type = "EMPTY_DIFF"
suggested_action = (
"Agent reported code changes but produced no git diff. "
"Treat this run as failed and require a real worktree diff before continuing."
)
raise AgentInvocationError(
agent_name=agent.name,
step_name=step_name,
cmd_preview=cmd_preview,
raw_error=raw_error,
failure_type=failure_type,
suggested_action=suggested_action,
)
diff_output = "(no changes)" diff_output = "(no changes)"
logger.warning( logger.warning(
"Agent '%s' made no file changes at step '%s'", "Agent '%s' made no file changes at step '%s'",
@@ -426,10 +576,63 @@ def invoke_agent_agentic(
if spinner: if spinner:
spinner.stop(f"[{step_name}] done — {chars} chars (agentic)") spinner.stop(f"[{step_name}] done — {chars} chars (agentic)")
transcript = _build_transcript(
command_preview=cmd_preview,
stdout=result.stdout,
stderr=result.stderr,
exit_code=result.returncode,
duration_seconds=round(duration, 1),
cwd=str(worktree_path),
)
return AgentResult( return AgentResult(
output=diff_output, output=diff_output,
exit_code=result.returncode, exit_code=result.returncode,
agent_name=agent.name, agent_name=agent.name,
step_name=step_name, step_name=step_name,
duration_seconds=round(duration, 1), duration_seconds=round(duration, 1),
transcript=transcript,
command_preview=cmd_preview,
) )
def _build_transcript(
*,
command_preview: str,
stdout: str,
stderr: str,
exit_code: int = 0,
duration_seconds: float = 0.0,
cwd: str = "",
) -> str:
"""Build a compact execution transcript for debugging/audit output."""
sections = [
"# Agent Execution Transcript",
"",
"## Command",
"```",
command_preview or "(unknown command)",
"```",
"",
]
if cwd:
sections.extend(["## Working Directory", f"`{cwd}`", ""])
sections.extend([
f"## Exit Code: {exit_code}",
"",
])
if duration_seconds > 0:
sections.extend([f"## Duration: {duration_seconds}s", ""])
sections.extend([
"## Stdout",
"```",
(stdout or "(empty)").strip(),
"```",
"",
"## Stderr",
"```",
(stderr or "(empty)").strip(),
"```",
"",
])
return "\n".join(sections)

View File

@@ -38,7 +38,7 @@ coders: [claude-coder]
reviewers: [claude-reviewer] reviewers: [claude-reviewer]
# seniors: [codex-senior] # seniors: [codex-senior]
# 파이프라인 종류: simple | cross-review | plan-review | review-only | review-fix | coding-review-fix # 파이프라인 종류: plan-review | coding-plan-review
pipeline: preset:{preset} pipeline: preset:{preset}
# 반복 설정 # 반복 설정
@@ -194,20 +194,12 @@ def main(argv: list[str] | None = None) -> int:
) )
init_parser.add_argument( init_parser.add_argument(
"--preset", "--preset",
default="simple", default="coding-plan-review",
choices=[ choices=["plan-review", "coding-plan-review"],
"simple",
"cross-review",
"plan-review",
"review-only",
"review-fix",
"coding-review-fix",
],
help=( help=(
"파이프라인 종류 (기본: simple). " "파이프라인 종류 (기본: coding-plan-review). "
"simple=코딩+리뷰, cross-review=교차리뷰, plan-review=문서기획검토, " "plan-review=문서리뷰수정재검증, "
"review-only=리뷰만, review-fix=리뷰수렴+자동수정, " "coding-plan-review=문서기반구현후 코드+문서 리뷰/수정/재검증"
"coding-review-fix=초기코딩후리뷰수렴"
), ),
) )
init_parser.add_argument( init_parser.add_argument(
@@ -252,9 +244,9 @@ def main(argv: list[str] | None = None) -> int:
) )
demo_parser.add_argument( demo_parser.add_argument(
"--preset", "--preset",
default="simple", default="coding-plan-review",
choices=["simple", "review-fix", "coding-review-fix"], choices=["plan-review", "coding-plan-review"],
help="데모할 파이프라인 종류 (기본: simple)", help="데모할 파이프라인 종류 (기본: coding-plan-review)",
) )
demo_parser.add_argument( demo_parser.add_argument(
"--escalate", "--escalate",
@@ -266,7 +258,7 @@ def main(argv: list[str] | None = None) -> int:
type=int, type=int,
default=None, default=None,
metavar="SEC", metavar="SEC",
help="에이전트 호출 제한 시간 (--live 전용)", help="에이전트 1회 호출 제한 시간(초). 0=무제한 (기본: 무제한, --live 전용)",
) )
# --- run --- # --- run ---
@@ -281,25 +273,12 @@ def main(argv: list[str] | None = None) -> int:
), ),
epilog=( epilog=(
"파이프라인 종류 (--preset):\n" "파이프라인 종류 (--preset):\n"
" ┌───────────────────────────────────────────────────────────────────┐\n" " ┌───────────────────────────────────────────────────────────────────┐\n"
"simple │ Coder가 코드 작성 → Reviewer가 리뷰 \n" "coding-plan-review │ 입력 문서 기반 구현 → 코드+문서 리뷰/수정\n"
" │ (기본값) │ FAIL이면 피드백 반영해서 재코딩, PASS까지 반복\n" " │ (기본값) │ → 재검증 반복 \n"
" ├───────────────────────────────────────────────────────────────────┤\n" " ├───────────────────────────────────────────────────────────────────┤\n"
" │ review-fix │ 2단계 파이프라인: \n" "plan-review │ 구현 전 문서 리뷰 → 문서 수정 → 재검증 반복\n"
" │ │ Reviewer N명 병렬 리뷰 → 취합 → 수정 → 재검증 │\n" " └─────────────────────┴──────────────────────────────────────────────┘\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ coding- │ 3단계 파이프라인: │\n"
" │ review-fix │ 초기 코딩 1회 → 리뷰 취합 → 수정 → 재검증 반복 │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ plan-review │ 구현 전 기획서/체크리스트/문서를 검토 │\n"
" │ │ 필요하면 현재 코드베이스와의 정합성도 점검 │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ review-only │ 코드 작성 없이 Reviewer N명이 기존 코드만 검토 │\n"
" │ │ (이미 작성된 코드의 품질 감사용) │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ cross-review │ Coder 2명이 각각 구현 → 상대방 코드를 교차 리뷰 │\n"
" │ │ (서로 다른 에이전트의 구현 비교용) │\n"
" └──────────────┴─────────────────────────────────────────────────────┘\n"
"\n" "\n"
"기본 제공 에이전트:\n" "기본 제공 에이전트:\n"
" ┌──────────────────┬─────────┬───────────┬──────────────────────────┐\n" " ┌──────────────────┬─────────┬───────────┬──────────────────────────┐\n"
@@ -316,34 +295,13 @@ def main(argv: list[str] | None = None) -> int:
"\n" "\n"
"사용 예시:\n" "사용 예시:\n"
"\n" "\n"
" 기본 실행 (Claude가 코딩하고 Claude가 리뷰):\n" " 코드 + 문서 구현/리뷰 루프 (coding-plan-review):\n"
" cross-eval run --plan plan.md\n" " cross-eval run --plan plan.md --preset coding-plan-review \\\n"
" --coder claude --reviewer codex --reviewer claude --senior codex\n"
"\n" "\n"
" Codex가 코딩, Claude가 리뷰:\n" " 문서 리뷰 + 수정 + 재검증 반복 (plan-review):\n"
" cross-eval run --plan plan.md --coder codex --reviewer claude\n"
"\n"
" 리뷰어 2명 (Claude + Codex):\n"
" cross-eval run --plan plan.md --reviewer claude --reviewer codex\n"
"\n"
" 리뷰 취합용 Senior 추가:\n"
" cross-eval run --plan plan.md --preset review-fix \\\n"
" --reviewer claude --reviewer codex --senior codex\n"
"\n"
" 리뷰 수렴 후 자동 수정 (review-fix):\n"
" cross-eval run --plan plan.md --preset review-fix \\\n"
" --reviewer claude --reviewer codex\n"
"\n"
" 초기 코딩 후 리뷰 수렴 + 자동 수정 (coding-review-fix):\n"
" cross-eval run --plan plan.md --preset coding-review-fix \\\n"
" --reviewer claude --reviewer codex\n"
"\n"
" 기존 코드 리뷰만 (review-only):\n"
" cross-eval run --plan plan.md --preset review-only \\\n"
" --reviewer claude --reviewer codex\n"
"\n"
" 구현 전 문서/기획 검토 (plan-review):\n"
" cross-eval run --plan plan.md --preset plan-review \\\n" " cross-eval run --plan plan.md --preset plan-review \\\n"
" --reviewer claude --reviewer codex\n" " --coder claude --reviewer codex --reviewer claude --senior codex\n"
"\n" "\n"
" 모델 변경:\n" " 모델 변경:\n"
" cross-eval run --plan plan.md --model sonnet\n" " cross-eval run --plan plan.md --model sonnet\n"
@@ -420,7 +378,11 @@ def main(argv: list[str] | None = None) -> int:
) )
agent_group.add_argument( agent_group.add_argument(
"--agentic", action="store_true", default=False, "--agentic", action="store_true", default=False,
help="Coder를 agentic 모드로 실행 (worktree에서 파일 직접 수정, git diff로 결과 캡처)", help="Coder를 agentic 모드로 실행 (파일 직접 수정, git diff로 결과 캡처)",
)
agent_group.add_argument(
"--worktree", action="store_true", default=False,
help="기본 direct mode 대신 isolated git worktree에서 실행",
) )
agent_group.add_argument( agent_group.add_argument(
"--model", default=None, metavar="MODEL", "--model", default=None, metavar="MODEL",
@@ -434,20 +396,17 @@ def main(argv: list[str] | None = None) -> int:
"--reviewer-model", default=None, metavar="MODEL", "--reviewer-model", default=None, metavar="MODEL",
help="Reviewer 에이전트 모델만 변경", help="Reviewer 에이전트 모델만 변경",
) )
agent_group.add_argument(
"--senior-model", default=None, metavar="MODEL",
help="Senior 에이전트 모델만 변경",
)
# -- 파이프라인 -- # -- 파이프라인 --
pipe_group = run_parser.add_argument_group("파이프라인") pipe_group = run_parser.add_argument_group("파이프라인")
pipe_group.add_argument( pipe_group.add_argument(
"--preset", default=None, "--preset", default=None,
choices=[ choices=["plan-review", "coding-plan-review"],
"simple", help="파이프라인 종류 (기본: coding-plan-review). 각 종류 설명은 아래 참조",
"cross-review",
"plan-review",
"review-only",
"review-fix",
"coding-review-fix",
],
help="파이프라인 종류 (기본: simple). 각 종류 설명은 아래 참조",
) )
pipe_group.add_argument( pipe_group.add_argument(
"--max-iter", type=int, default=None, "--max-iter", type=int, default=None,
@@ -474,7 +433,7 @@ def main(argv: list[str] | None = None) -> int:
) )
etc_group.add_argument( etc_group.add_argument(
"--output-dir", type=Path, default=None, "--output-dir", type=Path, default=None,
help="결과 저장 디렉토리 (기본: output/)", help="결과 저장 디렉토리 (기본: .cross-eval/output/)",
) )
etc_group.add_argument( etc_group.add_argument(
"--dry-run", action="store_true", "--dry-run", action="store_true",
@@ -556,18 +515,11 @@ def cmd_demo(args: argparse.Namespace) -> int:
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
_PRESET_DESCRIPTIONS = { _PRESET_DESCRIPTIONS = {
"simple": "코딩 + 리뷰 (가장 기본)", "coding-plan-review": "입력 문서 기반 구현 후 코드+문서 리뷰/수정 반복",
"review-fix": "리뷰 → 취합 → 수정 → 재검증 반복", "plan-review": "문서 리뷰 → 수정 → 재검증 반복",
"coding-review-fix": "초기 코딩 + 리뷰 수렴 반복",
"plan-review": "구현 전 기획서/문서 검토",
"review-only": "기존 코드만 리뷰 (코딩 없음)",
"cross-review": "2명이 각각 구현 후 교차 리뷰",
} }
_PRESET_ORDER = [ _PRESET_ORDER = ["coding-plan-review", "plan-review"]
"simple", "review-fix", "coding-review-fix",
"plan-review", "review-only", "cross-review",
]
def _prompt_choice( def _prompt_choice(
@@ -636,7 +588,7 @@ def _run_guided_init(target: Path) -> dict:
coder = _prompt_text(" Coder 에이전트", default="claude") coder = _prompt_text(" Coder 에이전트", default="claude")
reviewer = _prompt_text(" Reviewer 에이전트", default="claude") reviewer = _prompt_text(" Reviewer 에이전트", default="claude")
needs_senior = preset in ("review-fix", "coding-review-fix") needs_senior = preset in ("coding-plan-review", "plan-review")
senior = "" senior = ""
if needs_senior: if needs_senior:
senior = _prompt_text(" Senior 에이전트", default=reviewer) senior = _prompt_text(" Senior 에이전트", default=reviewer)
@@ -895,10 +847,10 @@ def cmd_run(args: argparse.Namespace) -> int:
need_rebuild = args.preset is not None or args.coders or args.reviewers or args.seniors need_rebuild = args.preset is not None or args.coders or args.reviewers or args.seniors
if need_rebuild: if need_rebuild:
from cross_eval.prompts import PHASED_PRESETS from cross_eval.prompts import PHASED_PRESETS
preset = args.preset or "simple" preset = args.preset or "coding-plan-review"
# Determine which preset was configured (from YAML or defaults) # Determine which preset was configured (from YAML or defaults)
if args.preset is None and config.phases: if args.preset is None and config.phases:
preset = config.preset_name if config.preset_name != "custom" else "review-fix" preset = config.preset_name if config.preset_name != "custom" else "coding-plan-review"
elif args.preset is None and not args.coders and not args.reviewers and not args.seniors: elif args.preset is None and not args.coders and not args.reviewers and not args.seniors:
pass # no changes needed pass # no changes needed
inferred_coders, inferred_reviewers, inferred_seniors = _infer_roles( inferred_coders, inferred_reviewers, inferred_seniors = _infer_roles(
@@ -925,8 +877,6 @@ def cmd_run(args: argparse.Namespace) -> int:
elif preset in PIPELINE_PRESETS: elif preset in PIPELINE_PRESETS:
config.pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors) config.pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors)
config.phases = [] config.phases = []
if preset in {"plan-review", "review-only"} and args.max_iter is None and args.min_iter is None:
config.max_iterations = 1
sync_phased_iterations(config) sync_phased_iterations(config)
if args.max_iter is not None: if args.max_iter is not None:
@@ -947,19 +897,25 @@ def cmd_run(args: argparse.Namespace) -> int:
if coder_name in config.agents: if coder_name in config.agents:
_make_agentic(config.agents[coder_name]) _make_agentic(config.agents[coder_name])
if args.worktree:
config.use_worktree = True
ensure_fix_preset_agentic(config) ensure_fix_preset_agentic(config)
# --model: apply to ALL agents # --model: apply to ALL agents
if args.model is not None: if args.model is not None:
for agent_name in config.agents: for agent_name in config.agents:
_apply_model_override(config, agent_name, args.model) _apply_model_override(config, agent_name, args.model)
# --coder-model / --reviewer-model: apply by role # --coder-model / --reviewer-model / --senior-model: apply by role
if args.coder_model is not None: if args.coder_model is not None:
for coder_name in config.coders: for coder_name in config.coders:
_apply_model_override(config, coder_name, args.coder_model) _apply_model_override(config, coder_name, args.coder_model)
if args.reviewer_model is not None: if args.reviewer_model is not None:
for reviewer_name in config.reviewers: for reviewer_name in config.reviewers:
_apply_model_override(config, reviewer_name, args.reviewer_model) _apply_model_override(config, reviewer_name, args.reviewer_model)
if args.senior_model is not None:
for senior_name in config.seniors:
_apply_model_override(config, senior_name, args.senior_model)
# --plan / --checklist shortcuts # --plan / --checklist shortcuts
for key, val in [("plan", args.plan), ("checklist", args.checklist)]: for key, val in [("plan", args.plan), ("checklist", args.checklist)]:
@@ -981,6 +937,7 @@ def cmd_run(args: argparse.Namespace) -> int:
print(f"No files found in: {docs_dir}", file=sys.stderr) print(f"No files found in: {docs_dir}", file=sys.stderr)
return 1 return 1
config.inputs["docs"] = docs_content config.inputs["docs"] = docs_content
config.inputs["docs_ref"] = docs_dir
if args.env_files: if args.env_files:
for env_file in args.env_files: for env_file in args.env_files:
@@ -1007,7 +964,6 @@ def cmd_run(args: argparse.Namespace) -> int:
apply_input_overrides(config, overrides) apply_input_overrides(config, overrides)
# 3. Validate after all overrides # 3. Validate after all overrides
from cross_eval.config import validate_config
errors = validate_config(config) errors = validate_config(config)
if errors: if errors:
print("Config error:\n " + "\n ".join(errors), file=sys.stderr) print("Config error:\n " + "\n ".join(errors), file=sys.stderr)
@@ -1055,6 +1011,9 @@ def cmd_run(args: argparse.Namespace) -> int:
if not args.dry_run and result.run_dir: if not args.dry_run and result.run_dir:
print(f"Output: {result.run_dir}/") print(f"Output: {result.run_dir}/")
if args.dry_run:
return 0
if result.final_verdict == "ESCALATE": if result.final_verdict == "ESCALATE":
from cross_eval.report import print_escalation_report from cross_eval.report import print_escalation_report
print_escalation_report(config, result) print_escalation_report(config, result)

View File

@@ -31,7 +31,10 @@ DEFAULT_ROLE_REASONING_EFFORTS = {
"reviewer": "medium", "reviewer": "medium",
"senior": "high", "senior": "high",
} }
FIX_STYLE_PRESETS = {"review-fix", "coding-review-fix"} FIX_STYLE_PRESETS = {
"plan-review",
"coding-plan-review",
}
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
@@ -62,32 +65,27 @@ _CLAUDE_CODER_ARGS = list(_CLAUDE_BASE_ARGS) + [
"bypassPermissions", "bypassPermissions",
] ]
_CLAUDE_REVIEW_ARGS = [ _CLAUDE_REVIEW_ARGS = list(_CLAUDE_BASE_ARGS)
"--setting-sources",
"user",
"--disable-slash-commands",
"--model",
"opus",
"--permission-mode",
"plan",
]
_CODER_SYSTEM_PROMPT = ( _CODER_SYSTEM_PROMPT = (
"You are a senior software engineer implementing code changes.\n" "You are a senior software engineer implementing code changes.\n"
"Rules:\n" "Rules:\n"
"1. FIRST explore the project directory to understand the existing codebase, " "1. FIRST explore the project directory to understand the existing codebase, "
"patterns, and conventions before writing any code.\n" "patterns, and conventions before writing any code.\n"
"2. You may decide which shell, Python, git, docker, test, and database commands " "2. You MUST use the Edit and Write tools to make ACTUAL file changes. "
"Do NOT just describe or explain changes in text — apply them directly to the files. "
"Your text output alone has no effect; only tool-based edits count.\n"
"3. You may decide which shell, Python, git, docker, test, and database commands "
"to run. The user does not need to pre-specify exact commands.\n" "to run. The user does not need to pre-specify exact commands.\n"
"3. Environment variables from configured .env files may already be loaded into " "4. Environment variables from configured .env files may already be loaded into "
"your process; use them when validating services such as ClickHouse.\n" "your process; use them when validating services such as ClickHouse.\n"
"4. Implement ONLY what the plan specifies. Do NOT add extra features, " "5. Implement ONLY what the plan specifies. Do NOT add extra features, "
"unnecessary abstractions, premature optimizations, or \"nice-to-have\" improvements.\n" "unnecessary abstractions, premature optimizations, or \"nice-to-have\" improvements.\n"
"5. Follow the project's existing coding style, naming conventions, and directory structure.\n" "6. Follow the project's existing coding style, naming conventions, and directory structure.\n"
"6. If previous review feedback is provided, fix ONLY the specific issues mentioned. " "7. If previous review feedback is provided, fix ONLY the specific issues mentioned. "
"Do NOT refactor unrelated code.\n" "Do NOT refactor unrelated code.\n"
"7. Ignore any items from previous feedback that were marked as DISMISSED or false positive.\n" "8. Ignore any items from previous feedback that were marked as DISMISSED or false positive.\n"
"8. When in doubt about scope, do LESS, not more." "9. When in doubt about scope, do LESS, not more."
) )
_REVIEWER_SYSTEM_PROMPT = ( _REVIEWER_SYSTEM_PROMPT = (
@@ -301,7 +299,10 @@ def _default_seniors_for_preset(
"""Infer a default senior agent for presets that benefit from adjudication.""" """Infer a default senior agent for presets that benefit from adjudication."""
if not ( if not (
isinstance(pipeline_raw, str) isinstance(pipeline_raw, str)
and pipeline_raw in {"preset:review-fix", "preset:coding-review-fix"} and pipeline_raw in {
"preset:plan-review",
"preset:coding-plan-review",
}
and reviewers and reviewers
): ):
return [] return []
@@ -383,9 +384,11 @@ def default_config() -> PipelineConfig:
coders = ["claude-coder"] coders = ["claude-coder"]
reviewers = ["claude-reviewer"] reviewers = ["claude-reviewer"]
seniors: list[str] = [] seniors: list[str] = []
pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors) pipeline: list[StepConfig] = []
phases = PHASED_PRESETS["coding-plan-review"](coders, reviewers, seniors)
return PipelineConfig( return PipelineConfig(
output_dir=Path(".cross-eval/output"), output_dir=Path(".cross-eval/output"),
use_worktree=False,
max_iterations=3, max_iterations=3,
language="ko", language="ko",
execution=ExecutionConfig(), execution=ExecutionConfig(),
@@ -395,6 +398,8 @@ def default_config() -> PipelineConfig:
reviewers=reviewers, reviewers=reviewers,
seniors=seniors, seniors=seniors,
pipeline=pipeline, pipeline=pipeline,
phases=phases,
preset_name="coding-plan-review",
) )
@@ -422,6 +427,8 @@ def load_config(path: Path) -> PipelineConfig:
def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig: def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
"""Parse raw YAML dict into PipelineConfig.""" """Parse raw YAML dict into PipelineConfig."""
project_root = config_path.parent.parent if config_path.parent.name == ".cross-eval" else config_path.parent
# --- agents --- # --- agents ---
agents: dict[str, AgentConfig] = {} agents: dict[str, AgentConfig] = {}
for name, agent_data in raw.get("agents", {}).items(): for name, agent_data in raw.get("agents", {}).items():
@@ -436,7 +443,7 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
) )
# --- roles: explicit or inferred --- # --- roles: explicit or inferred ---
pipeline_raw = raw.get("pipeline", "preset:simple") pipeline_raw = raw.get("pipeline", "preset:coding-plan-review")
coders_raw = raw.get("coders") coders_raw = raw.get("coders")
reviewers_raw = raw.get("reviewers") reviewers_raw = raw.get("reviewers")
seniors_raw = raw.get("seniors") seniors_raw = raw.get("seniors")
@@ -491,8 +498,13 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"): if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"):
preset_name = pipeline_raw.split(":", 1)[1] preset_name = pipeline_raw.split(":", 1)[1]
output_dir = Path(raw.get("output_dir", ".cross-eval/output"))
if not output_dir.is_absolute():
output_dir = project_root / output_dir
config = PipelineConfig( config = PipelineConfig(
output_dir=Path(raw.get("output_dir", ".cross-eval/output")), output_dir=output_dir,
use_worktree=bool(raw.get("use_worktree", False)),
max_iterations=int(raw.get("max_iterations", 3)), max_iterations=int(raw.get("max_iterations", 3)),
min_iterations=int(raw.get("min_iterations", 1)), min_iterations=int(raw.get("min_iterations", 1)),
verbose=bool(raw.get("verbose", False)), verbose=bool(raw.get("verbose", False)),
@@ -550,10 +562,10 @@ def _resolve_pipeline(
"""Resolve pipeline from preset string or explicit step list. """Resolve pipeline from preset string or explicit step list.
Returns (steps, phases) tuple. Only one will be non-empty. Returns (steps, phases) tuple. Only one will be non-empty.
- Simple/cross-review/plan-review/review-only → steps populated, phases empty. - plan-review → steps populated, phases empty.
- Phased presets (review-fix) → steps empty, phases populated. - coding-plan-review → steps empty, phases populated.
""" """
# Preset: "preset:simple" or "preset:review-fix" # Preset: "preset:plan-review" or "preset:coding-plan-review"
if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"): if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"):
preset_name = pipeline_raw.split(":", 1)[1] preset_name = pipeline_raw.split(":", 1)[1]
if preset_name in PIPELINE_PRESETS: if preset_name in PIPELINE_PRESETS:
@@ -587,7 +599,7 @@ def _resolve_pipeline(
return steps, [] return steps, []
raise ValueError( raise ValueError(
f"'pipeline' must be a preset string (e.g. 'preset:simple') " f"'pipeline' must be a preset string (e.g. 'preset:plan-review') "
f"or a list of step definitions, got {type(pipeline_raw).__name__}" f"or a list of step definitions, got {type(pipeline_raw).__name__}"
) )
@@ -695,9 +707,9 @@ def _validate_unique_step_fields(
def _make_agentic(agent: AgentConfig) -> None: def _make_agentic(agent: AgentConfig) -> None:
"""Convert an agent to agentic mode in-place (remove -p, set agentic=True).""" """Convert an agent to agentic mode in-place."""
agent.agentic = True agent.agentic = True
agent.args = [a for a in agent.args if a != "-p"] agent.args = [a for a in agent.args if a not in {"-p", "--print"}]
def sync_phased_iterations( def sync_phased_iterations(

View File

@@ -165,7 +165,7 @@ CYAN = "\033[36m"
RESET = "\033[0m" RESET = "\033[0m"
def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None: def run_mock_demo(preset: str = "coding-plan-review", show_escalate: bool = False) -> None:
"""Run a simulated demo showing the full pipeline lifecycle.""" """Run a simulated demo showing the full pipeline lifecycle."""
steps = _MOCK_ESCALATE_STEPS if show_escalate else _MOCK_STEPS steps = _MOCK_ESCALATE_STEPS if show_escalate else _MOCK_STEPS
@@ -217,7 +217,7 @@ def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None:
if show_escalate: if show_escalate:
print(f"\n{RED}{BOLD}{'=' * 50}") print(f"\n{RED}{BOLD}{'=' * 50}")
print(f" Escalation Report") print(" Escalation Report")
print(f"{'=' * 50}{RESET}") print(f"{'=' * 50}{RESET}")
print(f"{YELLOW}Human review required.{RESET}") print(f"{YELLOW}Human review required.{RESET}")
print(f" {RED}{RESET} Requirements are ambiguous — needs stakeholder clarification") print(f" {RED}{RESET} Requirements are ambiguous — needs stakeholder clarification")
@@ -229,7 +229,7 @@ def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None:
def run_live_demo( def run_live_demo(
preset: str = "simple", preset: str = "coding-plan-review",
timeout: int | None = None, timeout: int | None = None,
) -> PipelineResult: ) -> PipelineResult:
"""Run a live demo with real agents using the built-in plan.""" """Run a live demo with real agents using the built-in plan."""
@@ -255,8 +255,9 @@ def run_live_demo(
pipeline = [] pipeline = []
phases = PHASED_PRESETS[preset](coders, reviewers, seniors) phases = PHASED_PRESETS[preset](coders, reviewers, seniors)
else: else:
pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors) pipeline = []
phases = [] phases = PHASED_PRESETS["coding-plan-review"](coders, reviewers, seniors)
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
plan_path = Path(tmpdir) / "plan.md" plan_path = Path(tmpdir) / "plan.md"

330
cross_eval/discovery.py Normal file
View File

@@ -0,0 +1,330 @@
"""Repository/service discovery helpers for autonomous execution prompts."""
from __future__ import annotations
import json
from dataclasses import dataclass, field
from pathlib import Path
@dataclass
class RepoDiscovery:
languages: set[str] = field(default_factory=set)
package_managers: set[str] = field(default_factory=set)
databases: set[str] = field(default_factory=set)
services: set[str] = field(default_factory=set)
frameworks: set[str] = field(default_factory=set)
hints: list[str] = field(default_factory=list)
def _read_text(path: Path) -> str:
try:
return path.read_text(encoding="utf-8")
except (OSError, UnicodeDecodeError):
return ""
def _add_if_contains(target: set[str], content: str, mapping: dict[str, str]) -> None:
lowered = content.lower()
for needle, name in mapping.items():
if needle in lowered:
target.add(name)
# Shared mapping for database signals found in manifest content
_MANIFEST_DB_SIGNALS: dict[str, str] = {
# PostgreSQL
"psycopg": "postgresql",
"asyncpg": "postgresql",
"postgres": "postgresql",
"pgx": "postgresql",
# MySQL / MariaDB
"mysql": "mysql",
"mariadb": "mysql",
"pymysql": "mysql",
# MongoDB
"pymongo": "mongodb",
"mongodb": "mongodb",
"mongoengine": "mongodb",
"mongosh": "mongodb",
# ClickHouse
"clickhouse": "clickhouse",
"clickhouse-driver": "clickhouse",
"clickhouse_connect": "clickhouse",
# Redis
"redis": "redis",
"ioredis": "redis",
# SQLite
"sqlite": "sqlite",
"better-sqlite3": "sqlite",
"aiosqlite": "sqlite",
# Elasticsearch / OpenSearch
"elasticsearch": "elasticsearch",
"opensearch": "elasticsearch",
# DynamoDB
"dynamodb": "dynamodb",
"boto3": "dynamodb", # broad but common signal
# Cassandra
"cassandra-driver": "cassandra",
"cassandra": "cassandra",
# RabbitMQ
"amqplib": "rabbitmq",
"pika": "rabbitmq",
"rabbitmq": "rabbitmq",
# Kafka
"kafka": "kafka",
"confluent-kafka": "kafka",
"kafkajs": "kafka",
# Neo4j
"neo4j": "neo4j",
}
# Node package.json dependency → database mapping
_NODE_DEP_DB_SIGNALS: dict[str, str] = {
"pg": "postgresql",
"mysql": "mysql",
"mysql2": "mysql",
"mongoose": "mongodb",
"mongodb": "mongodb",
"@clickhouse/client": "clickhouse",
"redis": "redis",
"ioredis": "redis",
"prisma": "postgresql",
"better-sqlite3": "sqlite",
"sqlite3": "sqlite",
"@elastic/elasticsearch": "elasticsearch",
"@aws-sdk/client-dynamodb": "dynamodb",
"kafkajs": "kafka",
"amqplib": "rabbitmq",
"neo4j-driver": "neo4j",
"cassandra-driver": "cassandra",
"typeorm": "postgresql",
"sequelize": "postgresql",
"knex": "postgresql",
}
# Docker compose service image → service name mapping
_COMPOSE_SERVICE_SIGNALS: dict[str, str] = {
"clickhouse": "clickhouse",
"postgres": "postgresql",
"mysql": "mysql",
"mariadb": "mysql",
"mongo": "mongodb",
"redis": "redis",
"elasticsearch": "elasticsearch",
"opensearch": "elasticsearch",
"rabbitmq": "rabbitmq",
"kafka": "kafka",
"zookeeper": "kafka",
"cassandra": "cassandra",
"neo4j": "neo4j",
"minio": "s3",
"localstack": "aws-local",
"dynamodb": "dynamodb",
"memcached": "memcached",
"nginx": "nginx",
}
# Environment variable name patterns → database mapping
_ENV_DB_PATTERNS: list[tuple[str, str]] = [
("CLICKHOUSE", "clickhouse"),
("CH_", "clickhouse"),
("POSTGRES", "postgresql"),
("PG", "postgresql"),
("DATABASE_URL", "postgresql"),
("MYSQL", "mysql"),
("MARIADB", "mysql"),
("MONGO", "mongodb"),
("REDIS", "redis"),
("ELASTICSEARCH", "elasticsearch"),
("OPENSEARCH", "elasticsearch"),
("DYNAMO", "dynamodb"),
("CASSANDRA", "cassandra"),
("KAFKA", "kafka"),
("RABBIT", "rabbitmq"),
("AMQP", "rabbitmq"),
("NEO4J", "neo4j"),
("SQLITE", "sqlite"),
]
def discover_repo(project_root: Path, env_names: set[str] | None = None) -> RepoDiscovery:
"""Infer runtime-relevant stack hints from common manifest/config files."""
discovery = RepoDiscovery()
env_names = {name.upper() for name in (env_names or set())}
file_map: dict[str, Path] = {
"pyproject": project_root / "pyproject.toml",
"requirements": project_root / "requirements.txt",
"requirements_dev": project_root / "requirements-dev.txt",
"setup_py": project_root / "setup.py",
"setup_cfg": project_root / "setup.cfg",
"package": project_root / "package.json",
"go_mod": project_root / "go.mod",
"cargo": project_root / "Cargo.toml",
"gemfile": project_root / "Gemfile",
"build_gradle": project_root / "build.gradle",
"build_gradle_kts": project_root / "build.gradle.kts",
"pom": project_root / "pom.xml",
"composer": project_root / "composer.json",
"mix": project_root / "mix.exs",
"docker_compose": project_root / "docker-compose.yml",
"docker_compose_alt": project_root / "docker-compose.yaml",
"compose": project_root / "compose.yaml",
"prisma": project_root / "prisma" / "schema.prisma",
"dockerfile": project_root / "Dockerfile",
}
# ---- Language detection ----
if (
file_map["pyproject"].exists()
or file_map["requirements"].exists()
or file_map["requirements_dev"].exists()
or file_map["setup_py"].exists()
or file_map["setup_cfg"].exists()
):
discovery.languages.add("python")
if file_map["package"].exists():
discovery.languages.add("node")
if file_map["go_mod"].exists():
discovery.languages.add("go")
if file_map["cargo"].exists():
discovery.languages.add("rust")
if file_map["gemfile"].exists():
discovery.languages.add("ruby")
if file_map["build_gradle"].exists() or file_map["build_gradle_kts"].exists() or file_map["pom"].exists():
discovery.languages.add("java")
if file_map["composer"].exists():
discovery.languages.add("php")
if file_map["mix"].exists():
discovery.languages.add("elixir")
# ---- Package manager detection ----
if file_map["pyproject"].exists() or file_map["requirements"].exists() or file_map["setup_py"].exists():
discovery.package_managers.add("pip")
if file_map["package"].exists():
try:
package_json = json.loads(_read_text(file_map["package"]) or "{}")
except json.JSONDecodeError:
package_json = {}
pm = package_json.get("packageManager")
if isinstance(pm, str) and pm:
discovery.package_managers.add(pm.split("@", 1)[0])
else:
# Check for lockfiles to distinguish npm/yarn/pnpm
if (project_root / "pnpm-lock.yaml").exists():
discovery.package_managers.add("pnpm")
elif (project_root / "yarn.lock").exists():
discovery.package_managers.add("yarn")
else:
discovery.package_managers.add("npm")
if file_map["go_mod"].exists():
discovery.package_managers.add("go")
if file_map["cargo"].exists():
discovery.package_managers.add("cargo")
if file_map["gemfile"].exists():
discovery.package_managers.add("bundler")
if file_map["build_gradle"].exists() or file_map["build_gradle_kts"].exists():
discovery.package_managers.add("gradle")
if file_map["pom"].exists():
discovery.package_managers.add("maven")
if file_map["composer"].exists():
discovery.package_managers.add("composer")
if file_map["mix"].exists():
discovery.package_managers.add("mix")
# ---- Gather manifest content ----
manifests = {
name: _read_text(path)
for name, path in file_map.items()
if path.exists()
}
combined = "\n".join(manifests.values())
# ---- Database detection from manifest content ----
_add_if_contains(discovery.databases, combined, _MANIFEST_DB_SIGNALS)
# ---- Node.js dependency-specific detection ----
if file_map["package"].exists():
try:
package_json = json.loads(_read_text(file_map["package"]) or "{}")
except json.JSONDecodeError:
package_json = {}
deps = {
**(package_json.get("dependencies") or {}),
**(package_json.get("devDependencies") or {}),
}
dep_blob = "\n".join(deps.keys()).lower()
_add_if_contains(discovery.databases, dep_blob, _NODE_DEP_DB_SIGNALS)
# ---- Framework detection from manifest content ----
_add_if_contains(
discovery.frameworks,
combined,
{
"fastapi": "fastapi",
"django": "django",
"flask": "flask",
"express": "express",
"nextjs": "next.js",
"next": "next.js",
"nestjs": "nestjs",
"spring": "spring",
"rails": "rails",
"laravel": "laravel",
"phoenix": "phoenix",
"gin": "gin",
"actix": "actix",
},
)
# ---- Database detection from environment variable names ----
for env_name in env_names:
for pattern, db_name in _ENV_DB_PATTERNS:
if pattern in env_name or env_name.startswith(pattern):
discovery.databases.add(db_name)
break
# ---- Docker compose service detection ----
compose_blob = "\n".join(
manifests.get(key, "")
for key in ("docker_compose", "docker_compose_alt", "compose")
).lower()
_add_if_contains(discovery.services, compose_blob, _COMPOSE_SERVICE_SIGNALS)
# ---- Hints from config files ----
if file_map["prisma"].exists():
discovery.hints.append("Prisma schema detected.")
if (project_root / "alembic.ini").exists():
discovery.hints.append("Alembic migration config detected.")
if (project_root / "knexfile.js").exists() or (project_root / "knexfile.ts").exists():
discovery.hints.append("Knex migration config detected.")
if (project_root / "ormconfig.json").exists() or (project_root / "ormconfig.ts").exists():
discovery.hints.append("TypeORM config detected.")
if (project_root / "drizzle.config.ts").exists():
discovery.hints.append("Drizzle ORM config detected.")
if (project_root / "Makefile").exists():
discovery.hints.append("Makefile available for build/task automation.")
if file_map["dockerfile"].exists() or (project_root / "docker").exists() or discovery.services:
discovery.hints.append("Containerized services may be available for local verification.")
return discovery
def format_repo_discovery(discovery: RepoDiscovery) -> str:
"""Render discovery results into a compact prompt summary."""
lines: list[str] = []
if discovery.languages:
lines.append("Detected languages: " + ", ".join(sorted(discovery.languages)))
if discovery.package_managers:
lines.append("Likely package managers: " + ", ".join(sorted(discovery.package_managers)))
if discovery.databases:
lines.append("Detected databases/services in code or env: " + ", ".join(sorted(discovery.databases)))
if discovery.services:
lines.append("Detected local service containers: " + ", ".join(sorted(discovery.services)))
if discovery.frameworks:
lines.append("Detected frameworks: " + ", ".join(sorted(discovery.frameworks)))
if discovery.hints:
lines.extend(discovery.hints)
if not lines:
return "No strong runtime/service signals were detected from repository manifests."
return "\n".join(lines)

View File

@@ -3,7 +3,7 @@ from __future__ import annotations
import shutil import shutil
import subprocess import subprocess
from dataclasses import dataclass, field from dataclasses import dataclass
from pathlib import Path from pathlib import Path
from typing import Optional from typing import Optional

View File

@@ -62,6 +62,7 @@ class PipelineConfig:
"""Full cross-eval configuration.""" """Full cross-eval configuration."""
output_dir: Path = field(default_factory=lambda: Path(".cross-eval/output")) output_dir: Path = field(default_factory=lambda: Path(".cross-eval/output"))
use_worktree: bool = False
max_iterations: int = 3 max_iterations: int = 3
min_iterations: int = 1 min_iterations: int = 1
verbose: bool = False verbose: bool = False
@@ -88,6 +89,8 @@ class AgentResult:
agent_name: str agent_name: str
step_name: str step_name: str
duration_seconds: float duration_seconds: float
transcript: str = ""
command_preview: str = ""
@dataclass @dataclass

File diff suppressed because it is too large Load Diff

View File

@@ -15,53 +15,39 @@ from cross_eval.models import PhaseConfig, StepConfig
CODING_TEMPLATE = """\ CODING_TEMPLATE = """\
You are tasked with implementing code based on a plan and checklist. You are tasked with implementing code based on a plan and checklist.
## Plan ## Artifact References
{plan} {artifact_references}
## Checklist
{checklist}
## Reference Documents
{docs}
## Previous Review Feedback
{feedback}
## Iteration ## Iteration
This is iteration {iteration} of {max_iterations}. This is iteration {iteration} of {max_iterations}.
## Instructions ## Instructions
1. Explore the project directory to understand the existing codebase structure. 1. Read the referenced plan/checklist/docs/review artifacts directly from disk.
2. Implement ONLY what the plan specifies. Do NOT add extra features, \ 2. Explore the project directory and git state to understand the current codebase structure.
3. Implement ONLY what the plan specifies. Do NOT add extra features, \
unnecessary abstractions, or premature optimizations. unnecessary abstractions, or premature optimizations.
3. Follow every item in the checklist. 4. Follow every item in the checklist.
4. If there is previous feedback, address ONLY the specific issues mentioned. 5. If there is previous feedback in the referenced markdown artifacts, address ONLY those issues.
5. If previous feedback contains items marked as DISMISSED or false positive, \ 6. If previous feedback contains items marked as DISMISSED or false positive, \
IGNORE those items — they have been verified as correct. IGNORE those items — they have been verified as correct.
6. Output the complete implementation. 7. Prefer git and markdown artifacts as the source of truth. Use commit hashes, `git show`, `git diff`, and referenced markdown files instead of relying on inline summaries.
8. Output the complete implementation.
""" """
REVIEW_TEMPLATE = """\ REVIEW_TEMPLATE = """\
You are tasked with reviewing code against a plan and checklist. You are tasked with reviewing code against a plan and checklist.
## Plan ## Artifact References
{plan} {artifact_references}
## Checklist ## Execution Evidence
{checklist} {execution_evidence}
## Reference Documents
{docs}
## Coding Output / Previous Step Output
{coding_output}
## Previous Review Feedback
{feedback}
## Review Instructions ## Review Instructions
Explore the project directory to understand the full codebase context, \ Read the referenced plan/checklist/docs/review artifacts directly from disk. \
then evaluate the code against ONLY the plan and checklist above. Inspect the referenced commit/git state and markdown artifacts, then evaluate \
the code against ONLY the plan and checklist. Use the execution evidence above \
to verify agent claims against actual command outputs, artifact paths, and exit codes.
For each issue found, classify it with BOTH severity AND category: For each issue found, classify it with BOTH severity AND category:
@@ -122,51 +108,36 @@ Otherwise output: VERDICT: FAIL
CODING_TEMPLATE_KO = """\ CODING_TEMPLATE_KO = """\
당신은 기획서와 체크리스트를 기반으로 코드를 구현하는 개발자입니다. 당신은 기획서와 체크리스트를 기반으로 코드를 구현하는 개발자입니다.
## 기획서 ## 참조 아티팩트
{plan} {artifact_references}
## 체크리스트
{checklist}
## 참고 문서
{docs}
## 이전 리뷰 피드백
{feedback}
## 반복 정보 ## 반복 정보
현재 {max_iterations}회 중 {iteration}번째 반복입니다. 현재 {max_iterations}회 중 {iteration}번째 반복입니다.
## 지침 ## 지침
1. 프로젝트 디렉토리를 탐색하여 기존 코드베이스 구조를 파악하세요. 1. 참조된 plan/checklist/docs/review markdown를 직접 읽으세요.
2. 기획서에 명시된 것만 구현하세요. 추가 기능, 불필요한 추상화, 과도한 최적화를 하지 마세요. 2. 프로젝트 디렉토리와 git 상태를 탐색하여 현재 코드베이스 구조를 파악하세요.
3. 체크리스트의 모든 항목을 충족하세요. 3. 기획서에 명시된 것만 구현하세요. 추가 기능, 불필요한 추상화, 과도한 최적화를 하지 마세요.
4. 이전 리뷰 피드백이 있다면 해당 이슈만 해결하세요. 4. 체크리스트의 모든 항목을 충족하세요.
5. 이전 피드백에서 DISMISSED 또는 오탐으로 표시된 항목은 무시하세요 — 이미 올바른 것으로 검증되었습니다. 5. 참조된 이전 리뷰 피드백이 있다면 해당 이슈만 해결하세요.
6. 완전한 구현을 출력하세요. 6. 이전 피드백에서 DISMISSED 또는 오탐으로 표시된 항목은 무시하세요 — 이미 올바른 것으로 검증되었습니다.
7. inline 요약보다 git commit hash, `git show`, `git diff`, markdown 아티팩트를 우선 사용하세요.
8. 완전한 구현을 출력하세요.
""" """
REVIEW_TEMPLATE_KO = """\ REVIEW_TEMPLATE_KO = """\
당신은 기획서와 체크리스트 기준으로 코드를 검토하는 리뷰어입니다. 당신은 기획서와 체크리스트 기준으로 코드를 검토하는 리뷰어입니다.
## 기획서 ## 참조 아티팩트
{plan} {artifact_references}
## 체크리스트 ## 실행 증거
{checklist} {execution_evidence}
## 참고 문서
{docs}
## 검토 대상 코드
{coding_output}
## 이전 리뷰 피드백
{feedback}
## 검토 지침 ## 검토 지침
프로젝트 디렉토리를 직접 탐색하여 전체 코드베이스 맥락을 파악한 뒤, \ 참조된 plan/checklist/docs/review markdown와 git 상태를 직접 읽고, \
위 기획서와 체크리스트 기준으로만 코드를 평가하세요. 그 내용을 기준으로만 코드를 평가하세요. \
위 실행 증거를 활용하여 에이전트의 주장을 실제 명령어 출력, 아티팩트 경로, 종료 코드로 검증하세요.
발견된 각 이슈에 심각도와 카테고리를 모두 부여하세요: 발견된 각 이슈에 심각도와 카테고리를 모두 부여하세요:
@@ -234,9 +205,14 @@ You are tasked with reviewing existing code against a plan and checklist.
## Previous Review (iteration {iteration} of {max_iterations}) ## Previous Review (iteration {iteration} of {max_iterations})
{feedback} {feedback}
## Execution Evidence
{execution_evidence}
## Review Instructions ## Review Instructions
Explore the project directory thoroughly to understand the full codebase, \ Explore the project directory thoroughly to understand the full codebase, \
then evaluate the EXISTING code against ONLY the plan and checklist above. then evaluate the EXISTING code against ONLY the plan and checklist above. \
Use the execution evidence above to verify agent claims against actual \
command outputs and exit codes.
You are NOT generating or modifying code. You are auditing what already exists. You are NOT generating or modifying code. You are auditing what already exists.
@@ -293,21 +269,16 @@ Otherwise output: VERDICT: FAIL
REVIEW_ONLY_TEMPLATE_KO = """\ REVIEW_ONLY_TEMPLATE_KO = """\
당신은 기존 코드를 기획서와 체크리스트 기준으로 감사하는 리뷰어입니다. 당신은 기존 코드를 기획서와 체크리스트 기준으로 감사하는 리뷰어입니다.
## 기획서 ## 참조 아티팩트
{plan} {artifact_references}
## 체크리스트 ## 실행 증거
{checklist} {execution_evidence}
## 참고 문서
{docs}
## 이전 리뷰 결과 ({max_iterations}회 중 {iteration}번째)
{feedback}
## 검토 지침 ## 검토 지침
프로젝트 디렉토리를 직접 탐색하여 전체 코드베이스를 파악한 뒤, \ 참조된 plan/checklist/docs/review markdown와 git 상태를 직접 읽고, \
위 기획서와 체크리스트 기준으로 **기존 코드**를 평가하세요. 그 내용을 기준으로 **기존 코드**를 평가하세요. \
위 실행 증거를 활용하여 에이전트의 주장을 실제 명령어 출력, 아티팩트 경로, 종료 코드로 검증하세요.
코드를 생성하거나 수정하지 마세요. 이미 존재하는 코드를 감사하는 것이 목적입니다. 코드를 생성하거나 수정하지 마세요. 이미 존재하는 코드를 감사하는 것이 목적입니다.
@@ -501,8 +472,48 @@ PLAN_REVIEW_TEMPLATE_KO = """\
그렇지 않으면: VERDICT: FAIL 그렇지 않으면: VERDICT: FAIL
""" """
AGGREGATE_REVIEW_TEMPLATE = """\ PLAN_FIX_TEMPLATE = """\
You are adjudicating multiple review results and turning them into an actionable decision. You are tasked with revising planning documents based on adjudicated review feedback.
## Artifact References
{artifact_references}
## Current Review Feedback
{feedback}
## Instructions
1. Read the referenced plan/checklist/docs/review artifacts directly from disk.
2. Update the planning package itself: the plan, checklist, and reference documents as needed.
3. Do NOT write or modify production code. Only revise planning artifacts.
4. Address ONLY the confirmed planning issues from the current review feedback.
5. If feedback marks any item as DISMISSED or false positive, leave it unchanged.
6. Make the smallest document changes that resolve ambiguity, omissions, scope creep, or repository compatibility issues.
7. Keep the plan, checklist, and supporting docs internally consistent after your edits.
8. After editing, briefly summarize what you changed and any blocker that still needs human input.
"""
PLAN_FIX_TEMPLATE_KO = """\
당신은 시니어 리뷰 결과를 바탕으로 기획 문서를 수정하는 담당자입니다.
## 참조 아티팩트
{artifact_references}
## 현재 리뷰 피드백
{feedback}
## 지침
1. 참조된 plan/checklist/docs/review markdown를 직접 읽으세요.
2. 수정 대상은 기획 패키지 자체입니다. 필요에 따라 기획서, 체크리스트, 참고 문서를 수정하세요.
3. 프로덕션 코드를 작성하거나 수정하지 마세요. 기획 문서만 고치세요.
4. 현재 리뷰 피드백에서 확정된 기획 이슈만 해결하세요.
5. DISMISSED 또는 오탐으로 정리된 항목은 건드리지 마세요.
6. 모호성, 누락, 과도한 범위, 저장소 정합성 문제를 해소하는 최소한의 문서 수정만 하세요.
7. 수정 후에도 기획서, 체크리스트, 참고 문서가 서로 모순되지 않게 유지하세요.
8. 수정이 끝나면 무엇을 바꿨는지와 아직 사람 판단이 필요한 blocker가 있는지 짧게 정리하세요.
"""
PLAN_VERIFY_TEMPLATE = """\
You are verifying the latest planning package after plan-only revisions.
## Plan ## Plan
{plan} {plan}
@@ -513,30 +524,239 @@ You are adjudicating multiple review results and turning them into an actionable
## Reference Documents ## Reference Documents
{docs} {docs}
## Candidate Outputs ## Previous Review (iteration {iteration} of {max_iterations})
{feedback}
## Execution Evidence
{execution_evidence}
## Verify Instructions
Review the latest planning package itself: the plan, checklist, and reference documents.
You MAY inspect the current repository to confirm that the documents describe the current reality accurately enough.
Do NOT require production code, scripts, infrastructure, or external environments to already be fixed.
For `plan-review`, PASS means the documents are now clear enough to execute without further document edits.
A known implementation gap, repo mismatch, legacy script problem, external dependency, or environment blocker is NOT a FAIL by itself if:
- the issue is described accurately in the planning package,
- the affected scope or gate is documented clearly,
- the required follow-up action or non-go condition is documented clearly, and
- the package does not misrepresent unresolved work as already complete.
Only mark FAIL when the planning package still needs correction, such as:
- unresolved ambiguity or contradiction in the documents,
- missing prerequisite, dependency, gate, ownership, or evidence rule,
- a known blocker that is still described inaccurately or misleadingly,
- conflicting source-of-truth rules across the planning documents,
- checklist or status criteria that would cause an operator to make the wrong decision.
Report implementation/repository problems that are already documented correctly under "Out of Scope Issues" or note them as documented risks, not as FAIL reasons.
## Output Format
### Remaining Document Issues
- [Major][Omission] Description (reference specific plan/checklist/doc item)
(Write "None" if no document issue remains.)
### Documented Risks / Out of Scope
- Description of a real implementation/repository/environment risk that is already documented correctly
(Write "None" if nothing notable remains.)
### Summary
- Remaining document issues: N
- Documented risks / out-of-scope items: N
- Overall quality: [BRIEF ASSESSMENT]
### Verdict
If the planning package no longer needs document changes, output: VERDICT: PASS
Otherwise output: VERDICT: FAIL
"""
PLAN_VERIFY_TEMPLATE_KO = """\
당신은 plan-only 수정 이후 최신 기획 패키지를 재검증하는 검토자입니다.
## 기획서
{plan}
## 체크리스트
{checklist}
## 참고 문서
{docs}
## 이전 리뷰 결과 ({max_iterations}회 중 {iteration}번째)
{feedback}
## 실행 증거
{execution_evidence}
## 검증 지침
최신 기획 패키지 자체를 다시 검토하세요: 기획서, 체크리스트, 참고 문서를 함께 봅니다.
현재 저장소를 살펴보며 문서가 현실을 정확히 설명하는지 확인할 수는 있지만, 프로덕션 코드, 스크립트, 인프라, 외부 환경이 이미 수정되어 있을 것을 요구하면 안 됩니다.
`plan-review`에서 PASS의 뜻은 "이제 문서를 더 고칠 필요 없이 이 계획을 실행할 수 있다"입니다.
즉 구현 공백, 저장소 불일치, legacy 스크립트 문제, 외부 의존성, 환경 blocker가 남아 있어도 아래 조건을 만족하면 FAIL 사유가 아닙니다.
- 그 문제가 기획 패키지에 정확히 기록되어 있고
- 어떤 범위/게이트에 영향을 주는지 분명히 적혀 있고
- 필요한 후속 조치나 non-go 조건이 명확히 적혀 있고
- 아직 해결되지 않은 일을 이미 해결된 것처럼 오해하게 만들지 않는 경우
반대로 아래와 같은 경우에만 FAIL로 판정하세요.
- 문서 안에 아직 모호성이나 모순이 남아 있는 경우
- 선행조건, 의존성, 게이트, 담당 주체, evidence 규칙이 빠진 경우
- 알려진 blocker가 여전히 부정확하거나 오해를 부르는 방식으로 서술된 경우
- 기획 문서들 사이에서 source-of-truth 규칙이 충돌하는 경우
- 체크리스트나 상태 판정 기준 때문에 실행자가 잘못된 결정을 내릴 수 있는 경우
이미 문서에 정확히 기록된 구현/저장소 문제는 "범위 밖 이슈" 또는 "문서화된 리스크"로만 남기고, 그 자체를 FAIL 사유로 삼지 마세요.
## 출력 형식
### 남은 문서 이슈
- [Major][누락] 이슈 설명 (관련 기획서/체크리스트/참고 문서 항목 참조)
(남은 문서 이슈가 없으면 "없음"이라고 작성하세요.)
### 문서화된 리스크 / 범위 밖 이슈
- 실제 구현/저장소/환경 리스크이지만 문서에는 이미 정확히 반영된 항목
(해당 사항이 없으면 "없음"이라고 작성하세요.)
### 요약
- 남은 문서 이슈 수: N
- 문서화된 리스크 / 범위 밖 항목 수: N
- 전체 품질: [간략한 평가]
### 판정
기획 패키지를 더 수정할 필요가 없으면: VERDICT: PASS
그렇지 않으면: VERDICT: FAIL
"""
CODING_PLAN_REVIEW_TEMPLATE = """\
You are reviewing both the implementation and the planning package together.
## Artifact References
{artifact_references}
## Execution Evidence
{execution_evidence}
## Review Instructions
Read the referenced plan/checklist/docs/review artifacts directly from disk. \
Inspect the current repository and evaluate BOTH:
1. whether the implementation matches the plan/checklist/docs, and
2. whether the planning package still accurately describes the implementation target and constraints.
Report only issues that matter to delivering the original plan correctly. \
Do not invent new scope. Distinguish between code issues, document issues, and consistency gaps between them.
For each issue found, classify it with BOTH severity AND category:
- Severity: Critical / Major / Minor
- Category: Over-engineering / Omission
If previous review feedback is provided above, mark each prior item as CONFIRMED or DISMISSED.
If you find issues outside the original plan scope, report them separately under "Out of Scope Issues".
### Verdict
If the implementation satisfies the plan/checklist and the planning package no longer needs correction, output: VERDICT: PASS
Otherwise output: VERDICT: FAIL
"""
CODING_PLAN_REVIEW_TEMPLATE_KO = """\
당신은 구현 결과와 기획 문서 패키지를 함께 검토하는 리뷰어입니다.
## 참조 아티팩트
{artifact_references}
## 실행 증거
{execution_evidence}
## 검토 지침
참조된 plan/checklist/docs/review markdown를 직접 읽고 현재 저장소를 확인한 뒤, 아래 두 가지를 함께 평가하세요.
1. 현재 구현이 plan/checklist/docs와 일치하는가
2. 기획 문서 패키지가 현재 구현 목표와 제약을 여전히 정확하게 설명하는가
원래 계획을 제대로 완수하는 데 필요한 이슈만 보고하세요. 새로운 범위를 만들지 마세요.
코드 이슈, 문서 이슈, 코드-문서 불일치를 구분해서 적으세요.
발견된 각 이슈에 심각도와 카테고리를 모두 부여하세요.
- 심각도: Critical / Major / Minor
- 카테고리: 과최적화 / 누락
이전 리뷰 피드백이 있으면 각 항목을 CONFIRMED 또는 DISMISSED로 판정하세요.
원래 계획 범위 밖 이슈는 "범위 밖 이슈"로 별도 분리하세요.
### 판정
구현이 plan/checklist를 충족하고 기획 문서 패키지도 더 이상 수정할 필요가 없으면: VERDICT: PASS
그렇지 않으면: VERDICT: FAIL
"""
CODING_PLAN_FIX_TEMPLATE = """\
You are fixing confirmed issues in both the implementation and the planning package.
## Artifact References
{artifact_references}
## Current Review Feedback
{feedback}
## Instructions
1. Read the referenced plan/checklist/docs/review artifacts directly from disk.
2. Fix ONLY the confirmed issues from the current review feedback.
3. You may update both implementation files and planning artifacts when needed.
4. Preserve the original plan intent and scope. Do not silently broaden requirements.
5. Keep code, plan, checklist, and supporting docs consistent after edits.
6. After editing, briefly summarize what you changed and any blocker that still needs human input.
"""
CODING_PLAN_FIX_TEMPLATE_KO = """\
당신은 현재 리뷰에서 확정된 이슈를 코드와 기획 문서 패키지에 함께 반영하는 수정 담당자입니다.
## 참조 아티팩트
{artifact_references}
## 현재 리뷰 피드백
{feedback}
## 지침
1. 참조된 plan/checklist/docs/review markdown를 직접 읽으세요.
2. 현재 리뷰 피드백에서 확정된 이슈만 수정하세요.
3. 필요하면 코드와 기획 문서를 모두 수정할 수 있습니다.
4. 최초 plan의 의도와 범위를 유지하세요. 요구사항을 몰래 넓히지 마세요.
5. 수정 후 코드, plan, checklist, 참고 문서가 서로 모순되지 않게 유지하세요.
6. 수정이 끝나면 무엇을 바꿨는지와 아직 사람 판단이 필요한 blocker가 있는지 짧게 정리하세요.
"""
AGGREGATE_REVIEW_TEMPLATE = """\
You are adjudicating multiple review results and turning them into an actionable decision.
## Artifact References
{artifact_references}
## Candidate Artifact Under Review
{candidate_outputs} {candidate_outputs}
## Reviewer Findings ## Reviewer Findings Bundle
{reviews_bundle} {reviews_bundle}
## Previous Verification Feedback
{feedback}
## Previous Issue Tracker ## Previous Issue Tracker
{previous_senior_tracker} {previous_senior_tracker}
## Execution Evidence
{execution_evidence}
## Instructions ## Instructions
Explore the project directory to confirm the current codebase state. Then: Read the referenced plan/checklist/docs/review artifacts directly from disk. \
Inspect the repository and referenced artifacts only as needed to confirm the \
current target state. Use the execution evidence above to verify claims against \
actual command outputs, artifact paths, and exit codes. Then:
1. Deduplicate overlapping issues across reviewers. 1. Deduplicate overlapping issues across reviewers.
2. Resolve disagreements explicitly. 2. Resolve disagreements explicitly.
3. Keep only issues supported by the plan, checklist, code, or reviewer evidence. 3. Keep only issues supported by the plan, checklist, reference docs, repository state, or reviewer evidence.
4. When evidence is mixed, explain what was confirmed, what was dismissed, and what still needs follow-up. 4. When evidence is mixed, explain what was confirmed, what was dismissed, and what still needs follow-up.
5. Produce a prioritized action list for the coder. 5. Produce a prioritized action list for the implementer/editor.
6. Maintain the Issue Tracker table across iterations (carry forward unresolved issues). 6. Maintain the Issue Tracker table across iterations (carry forward unresolved issues).
7. If no confirmed issue remains, output VERDICT: PASS. 7. If no confirmed issue remains, output VERDICT: PASS.
8. If issues exist that the coder can fix, output VERDICT: FAIL. 8. If issues exist that the implementer/editor can fix, output VERDICT: FAIL.
9. If issues require human intervention (ambiguous requirements, architecture decisions, \ 9. If issues require human intervention (ambiguous requirements, architecture decisions, \
external dependency problems, or the same issue persists after 2+ fix attempts), \ external dependency problems, or the same issue persists after 2+ attempts), \
output VERDICT: ESCALATE. output VERDICT: ESCALATE.
## Output Format ## Output Format
@@ -550,8 +770,8 @@ output VERDICT: ESCALATE.
(Write "None" if nothing was dismissed.) (Write "None" if nothing was dismissed.)
### Action Items ### Action Items
1. Concrete fix the coder should make 1. Concrete fix the implementer/editor should make
2. Concrete fix the coder should make 2. Concrete fix the implementer/editor should make
## Issue Tracker ## Issue Tracker
@@ -571,37 +791,33 @@ VERDICT: PASS or VERDICT: FAIL or VERDICT: ESCALATE
AGGREGATE_REVIEW_TEMPLATE_KO = """\ AGGREGATE_REVIEW_TEMPLATE_KO = """\
당신은 여러 리뷰 결과를 판정하고 coder가 수정할 액션으로 정리하는 시니어 리뷰어입니다. 당신은 여러 리뷰 결과를 판정하고 coder가 수정할 액션으로 정리하는 시니어 리뷰어입니다.
## 기획서 ## 참조 아티팩트
{plan} {artifact_references}
## 체크리스트 ## 현재 검토 대상
{checklist}
## 참고 문서
{docs}
## 후보 결과물
{candidate_outputs} {candidate_outputs}
## 개별 리뷰 결과 ## 리뷰 결과 묶음
{reviews_bundle} {reviews_bundle}
## 이전 검증 피드백
{feedback}
## 이전 이슈 트래커 ## 이전 이슈 트래커
{previous_senior_tracker} {previous_senior_tracker}
## 실행 증거
{execution_evidence}
## 지침 ## 지침
프로젝트 디렉토리를 탐색하여 현재 코드베이스 상태를 확인한 뒤 다음을 수행하세요. 참조된 plan/checklist/docs/review markdown와 저장소 상태를 직접 읽어 현재 검토 대상의 상태를 확인한 뒤, \
위 실행 증거를 활용하여 에이전트의 주장을 실제 명령어 출력, 아티팩트 경로, 종료 코드로 검증하세요. \
그런 다음 아래를 수행하세요.
1. 리뷰어들 사이에 중복되는 이슈를 합치세요. 1. 리뷰어들 사이에 중복되는 이슈를 합치세요.
2. 의견 충돌은 명시적으로 정리하세요. 2. 의견 충돌은 명시적으로 정리하세요.
3. 기획서, 체크리스트, 코드, 리뷰 근거로 뒷받침되는 이슈만 남기세요. 3. 기획서, 체크리스트, 참고 문서, 저장소 상태, 리뷰 근거로 뒷받침되는 이슈만 남기세요.
4. 근거가 엇갈리면 무엇이 확정이고 무엇이 기각 또는 추가확인 대상인지 분명히 적으세요. 4. 근거가 엇갈리면 무엇이 확정이고 무엇이 기각 또는 추가확인 대상인지 분명히 적으세요.
5. coder가 바로 수정할 수 있는 우선순위 액션 아이템을 만드세요. 5. 수정 담당자가 바로 처리할 수 있는 우선순위 액션 아이템을 만드세요.
6. 이슈 트래커 테이블을 반복 간에 유지하세요 (미해결 이슈를 이월). 6. 이슈 트래커 테이블을 반복 간에 유지하세요 (미해결 이슈를 이월).
7. 확정된 이슈가 없으면 VERDICT: PASS 를 출력하세요. 7. 확정된 이슈가 없으면 VERDICT: PASS 를 출력하세요.
8. coder가 수정 가능한 이슈가 있으면 VERDICT: FAIL 을 출력하세요. 8. 수정 담당자가 해결 가능한 이슈가 있으면 VERDICT: FAIL 을 출력하세요.
9. 사람의 개입이 필요한 이슈(모호한 요구사항, 아키텍처 결정, 외부 의존성 문제, \ 9. 사람의 개입이 필요한 이슈(모호한 요구사항, 아키텍처 결정, 외부 의존성 문제, \
동일 이슈가 2회 이상 해결 실패)가 있으면 VERDICT: ESCALATE 를 출력하세요. 동일 이슈가 2회 이상 해결 실패)가 있으면 VERDICT: ESCALATE 를 출력하세요.
@@ -616,8 +832,8 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
(기각된 항목이 없으면 "없음"이라고 작성하세요.) (기각된 항목이 없으면 "없음"이라고 작성하세요.)
### 액션 아이템 ### 액션 아이템
1. coder가 수정해야 할 구체적인 작업 1. 수정 담당자가 처리해야 할 구체적인 작업
2. coder가 수정해야 할 구체적인 작업 2. 수정 담당자가 처리해야 할 구체적인 작업
## 이슈 트래커 ## 이슈 트래커
@@ -640,6 +856,10 @@ DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
"coding": CODING_TEMPLATE, "coding": CODING_TEMPLATE,
"review": REVIEW_TEMPLATE, "review": REVIEW_TEMPLATE,
"plan-review": PLAN_REVIEW_TEMPLATE, "plan-review": PLAN_REVIEW_TEMPLATE,
"plan-fix": PLAN_FIX_TEMPLATE,
"plan-verify": PLAN_VERIFY_TEMPLATE,
"coding-plan-review": CODING_PLAN_REVIEW_TEMPLATE,
"coding-plan-fix": CODING_PLAN_FIX_TEMPLATE,
"review-only": REVIEW_ONLY_TEMPLATE, "review-only": REVIEW_ONLY_TEMPLATE,
"aggregate-review": AGGREGATE_REVIEW_TEMPLATE, "aggregate-review": AGGREGATE_REVIEW_TEMPLATE,
}, },
@@ -647,6 +867,10 @@ DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
"coding": CODING_TEMPLATE_KO, "coding": CODING_TEMPLATE_KO,
"review": REVIEW_TEMPLATE_KO, "review": REVIEW_TEMPLATE_KO,
"plan-review": PLAN_REVIEW_TEMPLATE_KO, "plan-review": PLAN_REVIEW_TEMPLATE_KO,
"plan-fix": PLAN_FIX_TEMPLATE_KO,
"plan-verify": PLAN_VERIFY_TEMPLATE_KO,
"coding-plan-review": CODING_PLAN_REVIEW_TEMPLATE_KO,
"coding-plan-fix": CODING_PLAN_FIX_TEMPLATE_KO,
"review-only": REVIEW_ONLY_TEMPLATE_KO, "review-only": REVIEW_ONLY_TEMPLATE_KO,
"aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO, "aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO,
}, },
@@ -891,56 +1115,75 @@ def _build_review_only_preset(
def _build_plan_review_preset( def _build_plan_review_preset(
coders: list[str], reviewers: list[str], seniors: list[str], coders: list[str], reviewers: list[str], seniors: list[str],
) -> list[StepConfig]: ) -> list[StepConfig]:
"""Plan-review: reviewers audit planning docs before implementation.""" """Plan-review: review planning docs, revise them, then verify in a loop."""
if not coders:
raise ValueError("'plan-review' preset requires at least 1 coder")
if not reviewers: if not reviewers:
raise ValueError("'plan-review' preset requires at least 1 reviewer") raise ValueError("'plan-review' preset requires at least 1 reviewer")
if len(reviewers) == 1 and not seniors: review_steps: list[StepConfig] = []
return [ if len(reviewers) == 1:
review_steps.append(
StepConfig( StepConfig(
name="plan_review", name="plan_review",
agent=reviewers[0], agent=reviewers[0],
role="review", role="review",
prompt_template="default:plan-review", prompt_template="default:plan-review",
output_key="plan_review_result", output_key="plan_review_result",
verdict=True,
), ),
] )
review_step_names = ["plan_review"]
review_output_keys = ["plan_review_result"]
else:
reviewer_keys = _unique_safe_keys(reviewers)
for reviewer, rk in zip(reviewers, reviewer_keys):
review_steps.append(
StepConfig(
name=f"plan_review_{rk}",
agent=reviewer,
role="review",
prompt_template="default:plan-review",
output_key=f"plan_review_{rk}",
parallel=True,
),
)
review_step_names = [f"plan_review_{rk}" for rk in reviewer_keys]
review_output_keys = [f"plan_review_{rk}" for rk in reviewer_keys]
steps: list[StepConfig] = [] fix_coder = coders[0]
reviewer_keys = _unique_safe_keys(reviewers) senior_agent = seniors[0] if seniors else reviewers[0]
for reviewer, rk in zip(reviewers, reviewer_keys):
steps.append( return review_steps + [
StepConfig( StepConfig(
name=f"plan_review_{rk}", name="aggregate_review",
agent=reviewer, agent=senior_agent,
role="review", role="review",
prompt_template="default:plan-review", prompt_template="default:aggregate-review",
output_key=f"plan_review_{rk}", output_key="aggregate_review",
verdict=not seniors, context_override={
parallel=True, "candidate_outputs": "Current planning package under review (plan/checklist/reference docs).",
), "reviews_bundle": _build_named_bundle(
) reviewers, review_step_names, review_output_keys, "Review",
if seniors: ),
step_names = [f"plan_review_{rk}" for rk in reviewer_keys] },
output_keys = [f"plan_review_{rk}" for rk in reviewer_keys] ),
steps.append( StepConfig(
StepConfig( name="plan_fix",
name="senior_review", agent=fix_coder,
agent=seniors[0], role="coding",
role="review", prompt_template="default:plan-fix",
prompt_template="default:aggregate-review", output_key="plan_fix_output",
output_key="senior_review_result", context_override={"feedback": "{aggregate_review}"},
verdict=True, ),
context_override={ StepConfig(
"candidate_outputs": "Planning documents under review (plan/checklist/reference docs).", name="verify",
"reviews_bundle": _build_named_bundle( agent=senior_agent,
reviewers, step_names, output_keys, "Review", role="review",
), prompt_template="default:plan-verify",
}, output_key="verify_result",
), verdict=True,
) ),
return steps ]
def _build_review_fix_preset( def _build_review_fix_preset(
@@ -1040,16 +1283,97 @@ def _build_coding_review_fix_preset(
] ]
def _build_coding_plan_review_preset(
coders: list[str], reviewers: list[str], seniors: list[str],
) -> list[PhaseConfig]:
"""Implement from plan/docs, then review and fix code+docs together."""
if not coders:
raise ValueError("'coding-plan-review' preset requires at least 1 coder")
if not reviewers:
raise ValueError("'coding-plan-review' preset requires at least 1 reviewer")
review_steps: list[StepConfig] = []
reviewer_keys = _unique_safe_keys(reviewers)
for reviewer, rk in zip(reviewers, reviewer_keys):
review_steps.append(
StepConfig(
name=f"review_{rk}",
agent=reviewer,
role="review",
prompt_template="default:coding-plan-review",
output_key=f"review_{rk}",
verdict=False,
parallel=True,
),
)
senior_agent = seniors[0] if seniors else reviewers[0]
review_step_names = [f"review_{rk}" for rk in reviewer_keys]
review_output_keys = [f"review_{rk}" for rk in reviewer_keys]
return [
PhaseConfig(
name="initial_coding",
steps=[
StepConfig(
name="coding",
agent=coders[0],
role="coding",
prompt_template="default:coding",
output_key="coding_output",
),
],
max_iterations=1,
consecutive_pass=1,
),
PhaseConfig(
name="coding_plan_review",
steps=review_steps + [
StepConfig(
name="aggregate_review",
agent=senior_agent,
role="review",
prompt_template="default:aggregate-review",
output_key="aggregate_review",
context_override={
"candidate_outputs": (
"Current implementation and planning package under review "
"(code + plan/checklist/reference docs)."
),
"reviews_bundle": _build_named_bundle(
reviewers, review_step_names, review_output_keys, "Review",
),
},
),
StepConfig(
name="coding_plan_fix",
agent=coders[0],
role="coding",
prompt_template="default:coding-plan-fix",
output_key="coding_plan_fix_output",
context_override={"feedback": "{aggregate_review}"},
),
StepConfig(
name="verify",
agent=senior_agent,
role="review",
prompt_template="default:coding-plan-review",
output_key="verify_result",
verdict=True,
),
],
max_iterations=5,
consecutive_pass=1,
),
]
PIPELINE_PRESETS: dict[str, Callable] = { PIPELINE_PRESETS: dict[str, Callable] = {
"simple": _build_simple_preset,
"cross-review": _build_cross_review_preset,
"plan-review": _build_plan_review_preset, "plan-review": _build_plan_review_preset,
"review-only": _build_review_only_preset,
} }
PHASED_PRESETS: dict[str, Callable] = { PHASED_PRESETS: dict[str, Callable] = {
"review-fix": _build_review_fix_preset, "coding-plan-review": _build_coding_plan_review_preset,
"coding-review-fix": _build_coding_review_fix_preset,
} }
ALL_PRESET_NAMES: list[str] = list(PIPELINE_PRESETS.keys()) + list(PHASED_PRESETS.keys()) ALL_PRESET_NAMES: list[str] = list(PIPELINE_PRESETS.keys()) + list(PHASED_PRESETS.keys())

View File

@@ -58,6 +58,12 @@ _STRINGS: dict[str, dict[str, str]] = {
"metrics_total_issues": "Total Issues", "metrics_total_issues": "Total Issues",
"metrics_na": "N/A", "metrics_na": "N/A",
"iteration_details": "Iteration Details", "iteration_details": "Iteration Details",
"evidence_summary": "Evidence Summary",
"evidence_agent": "Agent",
"evidence_exit_code": "Exit Code",
"evidence_duration": "Duration",
"evidence_output_size": "Output Size",
"evidence_transcript": "Execution transcript",
}, },
"ko": { "ko": {
"title": "교차 검증 리포트", "title": "교차 검증 리포트",
@@ -99,6 +105,12 @@ _STRINGS: dict[str, dict[str, str]] = {
"metrics_total_issues": "총 이슈", "metrics_total_issues": "총 이슈",
"metrics_na": "해당 없음", "metrics_na": "해당 없음",
"iteration_details": "반복 상세", "iteration_details": "반복 상세",
"evidence_summary": "실행 증거 요약",
"evidence_agent": "에이전트",
"evidence_exit_code": "종료 코드",
"evidence_duration": "소요 시간",
"evidence_output_size": "출력 크기",
"evidence_transcript": "실행 트랜스크립트",
}, },
} }
@@ -377,6 +389,30 @@ def _append_iteration_steps(
If *skip_extraction* is True, out-of-scope and review-metrics parsing If *skip_extraction* is True, out-of-scope and review-metrics parsing
is skipped (useful when a pre-scan already collected that data). is skipped (useful when a pre-scan already collected that data).
""" """
# Evidence summary table — quick overview of all steps' execution data
has_evidence = any(
iter_result.step_results.get(s.output_key) for s in steps
)
if has_evidence:
s_step = _t(config, "step")
s_agent = _t(config, "evidence_agent")
s_exit = _t(config, "evidence_exit_code")
s_dur = _t(config, "evidence_duration")
s_size = _t(config, "evidence_output_size")
lines.append(f"**{_t(config, 'evidence_summary')}**\n")
lines.append(f"| {s_step} | {s_agent} | {s_exit} | {s_dur} | {s_size} |")
lines.append("|------|-------|-----------|----------|-------------|")
for step in steps:
ar = iter_result.step_results.get(step.output_key)
out = iter_result.step_outputs.get(step.output_key, "")
if ar:
lines.append(
f"| {step.name} | {ar.agent_name} "
f"| {ar.exit_code} | {ar.duration_seconds}s "
f"| {len(out)} chars |"
)
lines.append("")
for step in steps: for step in steps:
agent_result = iter_result.step_results.get(step.output_key) agent_result = iter_result.step_results.get(step.output_key)
output = iter_result.step_outputs.get(step.output_key, "") output = iter_result.step_outputs.get(step.output_key, "")
@@ -386,6 +422,11 @@ def _append_iteration_steps(
lines.append(f"### {_t(config, 'step')}: {step.name} ({agent_name}){duration}\n") lines.append(f"### {_t(config, 'step')}: {step.name} ({agent_name}){duration}\n")
# Show command preview and exit code for execution evidence
if agent_result and agent_result.command_preview:
lines.append(f"**Command**: `{agent_result.command_preview}`")
lines.append(f"**Exit code**: {agent_result.exit_code}\n")
if step.verdict and iter_result.verdict: if step.verdict and iter_result.verdict:
lines.append(f"**{_t(config, 'verdict')}: {iter_result.verdict}**\n") lines.append(f"**{_t(config, 'verdict')}: {iter_result.verdict}**\n")
@@ -400,6 +441,17 @@ def _append_iteration_steps(
lines.append(output) lines.append(output)
lines.append("") lines.append("")
# Include transcript excerpt for execution evidence visibility
if agent_result and agent_result.transcript:
transcript_preview = agent_result.transcript[:1500]
if len(agent_result.transcript) > 1500:
transcript_preview += "\n... (truncated)"
transcript_label = _t(config, "evidence_transcript")
lines.append("<details>")
lines.append(f"<summary>{transcript_label}</summary>\n")
lines.append(transcript_preview)
lines.append("\n</details>\n")
if not skip_extraction and step.role == "review": if not skip_extraction and step.role == "review":
oos = _extract_out_of_scope(output) oos = _extract_out_of_scope(output)
if oos: if oos:

View File

@@ -14,9 +14,22 @@ _SUMMARY_PREFIXES = (
"PG", "PG",
"POSTGRES", "POSTGRES",
"MYSQL", "MYSQL",
"MARIADB",
"REDIS", "REDIS",
"MONGO",
"ELASTICSEARCH",
"OPENSEARCH",
"DYNAMO",
"CASSANDRA",
"KAFKA",
"RABBIT",
"AMQP",
"NEO4J",
"SQLITE",
"MEMCACHED",
"AWS", "AWS",
"S3", "S3",
"MINIO",
) )
@@ -81,6 +94,9 @@ def build_runtime_environment(
) -> tuple[dict[str, str], list[Path], dict[str, str]]: ) -> tuple[dict[str, str], list[Path], dict[str, str]]:
"""Build subprocess env plus metadata about loaded files and names.""" """Build subprocess env plus metadata about loaded files and names."""
env = os.environ.copy() if execution.inherit_env else {} env = os.environ.copy() if execution.inherit_env else {}
# Remove CLAUDECODE to avoid "nested session" errors when spawning
# Claude Code as a subprocess from within a Claude Code session.
env.pop("CLAUDECODE", None)
loaded_files = resolve_env_files(execution, project_root) loaded_files = resolve_env_files(execution, project_root)
loaded_values: dict[str, str] = {} loaded_values: dict[str, str] = {}
for path in loaded_files: for path in loaded_files:
@@ -116,7 +132,6 @@ def summarize_environment(
key key
for key in set(loaded_values) | set(env) for key in set(loaded_values) | set(env)
if key.startswith(_SUMMARY_PREFIXES) if key.startswith(_SUMMARY_PREFIXES)
or any(prefix in key for prefix in ("CLICKHOUSE", "DATABASE", "DB_"))
} }
) )
if visible_names: if visible_names:

View File

@@ -4,6 +4,7 @@ from __future__ import annotations
import logging import logging
import shutil import shutil
import subprocess import subprocess
import tempfile
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
@@ -20,18 +21,47 @@ def make_branch_name(preset_name: str) -> str:
return f"cross-eval/{preset_name}_{ts}" return f"cross-eval/{preset_name}_{ts}"
def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> Path: def make_worktree_dir(base_cwd: Path, branch_name: str) -> Path:
"""Choose a worktree directory outside the base repo.
Keeping agentic worktrees outside the source checkout avoids tools that
incorrectly walk up to the outer repo and write into the base worktree.
"""
repo_name = base_cwd.resolve().name or "repo"
branch_slug = branch_name.replace("/", "__")
return (
Path(tempfile.gettempdir())
/ "cross-eval-worktrees"
/ repo_name
/ branch_slug
)
def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> tuple[Path, str]:
"""Create a git worktree on a new branch from HEAD. """Create a git worktree on a new branch from HEAD.
1. Create branch from HEAD 1. Create branch from HEAD
2. Create worktree checked out to that branch 2. Create worktree checked out to that branch
The branch lives in the original repo, so it survives worktree removal. The branch lives in the original repo, so it survives worktree removal.
Returns (worktree_path, base_commit_sha).
""" """
work_dir = work_dir.resolve() work_dir = work_dir.resolve()
if work_dir.exists(): if work_dir.exists():
shutil.rmtree(work_dir) shutil.rmtree(work_dir)
# Record the base commit SHA before creating the branch.
# This is the anchor for all diffs — even if the agent makes its own commits,
# we always diff against this base to capture the full set of changes.
result = subprocess.run(
["git", "rev-parse", "HEAD"],
cwd=base_cwd,
capture_output=True,
text=True,
check=True,
)
base_commit = result.stdout.strip()
# Create the branch at HEAD # Create the branch at HEAD
try: try:
subprocess.run( subprocess.run(
@@ -66,15 +96,23 @@ def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> Path:
f"Failed to create worktree at {work_dir}: {e.stderr.strip()}" f"Failed to create worktree at {work_dir}: {e.stderr.strip()}"
) from e ) from e
logger.debug("Created worktree on branch '%s': %s", branch_name, work_dir) logger.debug("Created worktree on branch '%s': %s (base: %s)", branch_name, work_dir, base_commit[:8])
return work_dir return work_dir, base_commit
def capture_diff(worktree_path: Path) -> str: def capture_diff(worktree_path: Path, base_commit: str | None = None) -> str:
"""Capture all changes made in the worktree as a unified diff. """Capture all changes made in the worktree since ``base_commit``.
Includes both tracked modifications and new untracked files. Handles two scenarios:
1. Agent left changes uncommitted → ``git add -A && git diff base HEAD``
2. Agent committed its own changes → HEAD advanced, diff base..HEAD captures them
Args:
base_commit: The diff anchor — typically the worktree HEAD *before* this
iteration started (set by ``get_current_head`` after each
``_commit_iteration``). Falls back to ``HEAD`` if not given.
""" """
# Stage any uncommitted changes
subprocess.run( subprocess.run(
["git", "add", "-A"], ["git", "add", "-A"],
cwd=worktree_path, cwd=worktree_path,
@@ -82,12 +120,34 @@ def capture_diff(worktree_path: Path) -> str:
check=True, check=True,
) )
result = subprocess.run( # Commit staged changes so everything is reachable via HEAD
["git", "diff", "--cached", "HEAD"], # (this is a no-op if nothing is staged)
subprocess.run(
["git", "commit", "-m", "cross-eval: capture-diff snapshot", "--allow-empty-message"],
cwd=worktree_path, cwd=worktree_path,
capture_output=True, capture_output=True,
text=True, text=True,
) )
ref = base_commit or "HEAD~1"
result = subprocess.run(
["git", "diff", ref, "HEAD"],
cwd=worktree_path,
capture_output=True,
text=True,
)
return result.stdout.strip()
def get_current_head(worktree_path: Path) -> str:
"""Return the current HEAD SHA of the worktree."""
result = subprocess.run(
["git", "rev-parse", "HEAD"],
cwd=worktree_path,
capture_output=True,
text=True,
check=True,
)
return result.stdout.strip() return result.stdout.strip()

47
plan.md Normal file
View File

@@ -0,0 +1,47 @@
# cross-eval CLI 사용성 리팩토링
## 목표
`cross-eval`의 CLI 사용 경험을 리팩토링하여, 사용자가 각 옵션의 의미를 빠르게 이해하고 목적에 맞는 옵션 조합을 쉽게 선택할 수 있도록 만든다.
## 배경
현재 `cross-eval``init`, `run`, `demo`, `doctor` 등 주요 커맨드와 다양한 옵션을 제공하지만, 처음 사용하는 사용자가 어떤 상황에서 어떤 옵션을 써야 하는지 한눈에 이해하기 어렵다. 특히 `run`의 preset, agent 조합, config 기반 실행과 직접 옵션 기반 실행의 관계가 복잡하게 느껴질 수 있다.
## 요구사항
1. CLI 도움말 또는 온보딩 문구를 리팩토링해 초보 사용자도 주요 흐름을 빠르게 이해할 수 있어야 한다.
2. 사용자가 대표적인 사용 시나리오별로 적절한 옵션 조합을 쉽게 찾을 수 있어야 한다.
3. `run` 커맨드의 주요 옵션들(preset, coder/reviewer/senior, config, output 관련)의 역할이 더 명확하게 드러나야 한다.
4. `init` 이후 사용자가 다음에 무엇을 해야 하는지 자연스럽게 이어지도록 안내해야 한다.
5. 기존 기능은 유지해야 하며, 동작 방식 자체를 바꾸기보다 설명 구조와 사용 흐름을 개선하는 데 집중해야 한다.
## 사용자 시나리오
1. 처음 설치한 사용자가 `cross-eval init` 후 무엇을 해야 하는지 알고 싶다.
2. 사용자가 `run`을 실행하려는데 `--preset`별 차이를 빠르게 비교하고 싶다.
3. 사용자가 `claude`, `codex`, `senior` 조합을 어떤 상황에서 쓰는지 예시와 함께 이해하고 싶다.
4. 사용자가 config 기반 실행과 CLI 옵션 기반 실행 중 무엇을 써야 할지 판단하고 싶다.
5. 사용자가 실행 결과가 어디에 저장되는지, 어떤 식으로 확인하는지 알고 싶다.
## 제약조건
- 기존 CLI 명령 이름과 핵심 옵션 이름은 유지한다.
- 기존 파이프라인 동작 로직은 불필요하게 수정하지 않는다.
- 기능 추가보다 안내 구조, 도움말 문구, 예시, 설명 흐름 개선에 집중한다.
- 문서는 한국어 사용자 기준으로 이해하기 쉽게 유지하되, 기존 프로젝트 톤과 구조를 해치지 않는다.
## 범위
### 포함
- `argparse` help/description/epilog 문구 개선
- `init` 후 다음 단계 안내 문구 개선
- `run` 사용 예시 정리 및 대표 조합 예시 보강
- preset/agent/config/output 개념 설명 재구성
- 필요 시 README 또는 온보딩 문구 일부 정리
### 제외
- 새로운 preset 추가
- 새로운 CLI 옵션 추가
- 파이프라인 실행 알고리즘 변경
- 에이전트 호출 방식 자체 변경
## 성공 기준
1. `--help`만 읽어도 기본 사용 흐름이 명확하다.
2. 사용자가 대표 시나리오별 실행 예시를 바로 복사해 쓸 수 있다.
3. `init → 작성 → doctor → run → output 확인` 흐름이 자연스럽게 연결된다.
4. 옵션 설명이 길기만 하지 않고, 실제 선택 판단에 도움이 되도록 구조화된다.

View File

@@ -11,8 +11,58 @@ dependencies = [
"pyyaml>=6.0", "pyyaml>=6.0",
] ]
[project.optional-dependencies]
dev = [
"coverage[toml]>=7.6",
"pyright>=1.1.390",
"pytest-cov>=6.0",
"ruff>=0.8.0",
]
[project.scripts] [project.scripts]
cross-eval = "cross_eval.cli:main" cross-eval = "cross_eval.cli:main"
[tool.setuptools.packages.find] [tool.setuptools.packages.find]
include = ["cross_eval*"] include = ["cross_eval*"]
[tool.pytest.ini_options]
testpaths = ["tests"]
addopts = "-q"
[tool.ruff]
target-version = "py39"
extend-exclude = [".cross-eval"]
[tool.ruff.lint]
select = ["F"]
[tool.pyright]
include = ["cross_eval", "tests"]
exclude = [".cross-eval"]
typeCheckingMode = "basic"
pythonVersion = "3.9"
reportMissingImports = true
reportMissingTypeStubs = false
[tool.coverage.run]
branch = true
source = ["cross_eval"]
omit = [
"cross_eval/config.py",
"cross_eval/discovery.py",
"cross_eval/cli.py",
"cross_eval/demo.py",
"cross_eval/doctor.py",
"cross_eval/prompts.py",
"cross_eval/report.py",
]
[tool.coverage.report]
skip_empty = true
show_missing = true
fail_under = 90
exclude_lines = [
"pragma: no cover",
"if TYPE_CHECKING:",
"raise NotImplementedError",
]

View File

@@ -12,10 +12,10 @@ import subprocess
import tempfile import tempfile
import unittest import unittest
from pathlib import Path from pathlib import Path
from unittest.mock import MagicMock, call, patch from unittest.mock import MagicMock, patch
from cross_eval.agent import invoke_agent_agentic from cross_eval.agent import AgentInvocationError, invoke_agent_agentic
from cross_eval.config import BUILTIN_AGENTS, _make_agentic from cross_eval.config import _make_agentic
from cross_eval.models import ( from cross_eval.models import (
AgentConfig, AgentConfig,
AgentResult, AgentResult,
@@ -23,8 +23,7 @@ from cross_eval.models import (
StepConfig, StepConfig,
) )
from cross_eval.pipeline import ( from cross_eval.pipeline import (
_commit_iteration, _assert_base_repo_isolation,
_finalize_worktree,
_has_agentic_steps, _has_agentic_steps,
_setup_worktree, _setup_worktree,
run_pipeline, run_pipeline,
@@ -34,6 +33,7 @@ from cross_eval.worktree import (
commit_worktree, commit_worktree,
create_worktree, create_worktree,
make_branch_name, make_branch_name,
make_worktree_dir,
remove_worktree, remove_worktree,
) )
@@ -76,10 +76,12 @@ class TestCreateWorktree(unittest.TestCase):
wt_dir = Path(td) / "wt" wt_dir = Path(td) / "wt"
branch = "cross-eval/test_branch" branch = "cross-eval/test_branch"
result_path = create_worktree(base, wt_dir, branch) result_path, base_commit = create_worktree(base, wt_dir, branch)
# Worktree directory exists # Worktree directory exists
self.assertTrue(result_path.exists()) self.assertTrue(result_path.exists())
# Base commit SHA was captured
self.assertEqual(len(base_commit), 40)
# Branch was created in the original repo # Branch was created in the original repo
branches = subprocess.run( branches = subprocess.run(
["git", "branch", "--list", branch], ["git", "branch", "--list", branch],
@@ -102,7 +104,7 @@ class TestCaptureDiff(unittest.TestCase):
wt_dir = Path(td) / "wt" wt_dir = Path(td) / "wt"
branch = "cross-eval/diff_test" branch = "cross-eval/diff_test"
create_worktree(base, wt_dir, branch) create_worktree(base, wt_dir, branch) # ignore return tuple
# Make changes in the worktree # Make changes in the worktree
(wt_dir / "new_file.txt").write_text("hello\n") (wt_dir / "new_file.txt").write_text("hello\n")
@@ -191,16 +193,58 @@ class TestMakeBranchName(unittest.TestCase):
self.assertEqual(len(ts_part), 15) # YYYYMMDD_HHMMSS self.assertEqual(len(ts_part), 15) # YYYYMMDD_HHMMSS
class TestMakeWorktreeDir(unittest.TestCase):
"""make_worktree_dir chooses an external temp location."""
def test_uses_tmp_dir_outside_repo(self) -> None:
with tempfile.TemporaryDirectory() as td:
base = Path(td) / "repo"
base.mkdir()
path = make_worktree_dir(base, "cross-eval/review-fix_20260313_123456")
self.assertIn("cross-eval-worktrees", str(path))
self.assertNotIn(str(base), str(path))
class TestBaseRepoIsolation(unittest.TestCase):
"""Base repo mutations should fail fast during agentic execution."""
def test_raises_when_base_repo_state_changes(self) -> None:
with tempfile.TemporaryDirectory() as td:
base = Path(td) / "repo"
worktree = Path(td) / "worktree"
base.mkdir()
worktree.mkdir()
# Baseline has a diff that won't match a non-git directory
# (which returns {}), triggering the isolation error.
baseline_state = {
"diff": "diff --git a/file.py ...\n",
"untracked": "",
}
with self.assertRaises(RuntimeError) as ctx:
_assert_base_repo_isolation(
base,
baseline_state,
step_name="coding",
agent_name="claude-coder",
worktree_path=worktree,
baseline_status="M file.py",
)
self.assertIn("base repository", str(ctx.exception))
# =================================================================== # ===================================================================
# 2. agent.py agentic tests (mocking subprocess) # 2. agent.py agentic tests (mocking subprocess)
# =================================================================== # ===================================================================
class TestInvokeAgentAgenticClaude(unittest.TestCase): class TestInvokeAgentAgenticClaude(unittest.TestCase):
"""invoke_agent_agentic builds correct cmd for claude (no -p, prompt as positional arg).""" """invoke_agent_agentic builds correct cmd for claude (no -p, prompt via stdin)."""
@patch("cross_eval.worktree.capture_diff", return_value="diff --git a/file ...") @patch("cross_eval.worktree.capture_diff", return_value="diff --git a/file ...")
@patch("subprocess.run") @patch("subprocess.run")
def test_claude_cmd_has_no_dash_p_and_prompt_as_positional( def test_claude_cmd_has_no_dash_p_and_prompt_via_stdin(
self, mock_run: MagicMock, mock_diff: MagicMock, self, mock_run: MagicMock, mock_diff: MagicMock,
) -> None: ) -> None:
mock_run.return_value = MagicMock(returncode=0, stdout="ok", stderr="") mock_run.return_value = MagicMock(returncode=0, stdout="ok", stderr="")
@@ -230,12 +274,16 @@ class TestInvokeAgentAgenticClaude(unittest.TestCase):
break break
self.assertIsNotNone(agent_call, "Expected a subprocess.run call with 'claude'") self.assertIsNotNone(agent_call, "Expected a subprocess.run call with 'claude'")
assert agent_call is not None
cmd = agent_call[0][0] cmd = agent_call[0][0]
# No -p flag # No -p flag
self.assertNotIn("-p", cmd) self.assertNotIn("-p", cmd)
# Last arg is a task file reference (not raw prompt — avoids arg length limits) # Prompt is delivered via stdin (input kwarg), not as a positional arg
self.assertIn("task file", cmd[-1].lower()) input_data = agent_call[1].get("input")
self.assertIsNotNone(input_data)
assert input_data is not None
self.assertIn("implement feature X", input_data)
class TestInvokeAgentAgenticCodex(unittest.TestCase): class TestInvokeAgentAgenticCodex(unittest.TestCase):
@@ -272,6 +320,7 @@ class TestInvokeAgentAgenticCodex(unittest.TestCase):
break break
self.assertIsNotNone(agent_call, "Expected a subprocess.run call with 'codex'") self.assertIsNotNone(agent_call, "Expected a subprocess.run call with 'codex'")
assert agent_call is not None
cmd = agent_call[0][0] cmd = agent_call[0][0]
# Should have "-" sentinel at the end for stdin # Should have "-" sentinel at the end for stdin
@@ -279,6 +328,7 @@ class TestInvokeAgentAgenticCodex(unittest.TestCase):
# Stdin input should contain the prompt # Stdin input should contain the prompt
input_data = agent_call[1].get("input") input_data = agent_call[1].get("input")
self.assertIsNotNone(input_data) self.assertIsNotNone(input_data)
assert input_data is not None
self.assertIn("implement feature Y", input_data) self.assertIn("implement feature Y", input_data)
@@ -309,6 +359,74 @@ class TestTaskFileCleanup(unittest.TestCase):
self.assertFalse((wt / "CROSS_EVAL_TASK.md").exists()) self.assertFalse((wt / "CROSS_EVAL_TASK.md").exists())
class TestAgenticEmptyDiffDetection(unittest.TestCase):
"""Agentic coders should not succeed when they only claim changes in stdout."""
@patch("cross_eval.worktree.capture_diff", return_value="")
@patch("subprocess.run")
def test_claude_empty_diff_with_change_claim_fails(
self, mock_run: MagicMock, mock_diff: MagicMock,
) -> None:
mock_run.return_value = MagicMock(
returncode=0,
stdout=(
"All tests pass.\n"
"Here's a summary of all changes made:\n"
"- Updated discovery.py\n"
),
stderr="",
)
agent = AgentConfig(
name="claude-coder",
command="claude",
args=["--setting-sources", "user"],
agentic=True,
)
with tempfile.TemporaryDirectory() as td:
wt = Path(td)
_init_git_repo(wt)
with self.assertRaises(AgentInvocationError) as ctx:
invoke_agent_agentic(
agent, "implement feature X", "coding",
worktree_path=wt, quiet=True,
)
self.assertEqual(ctx.exception.failure_type, "EMPTY_DIFF")
self.assertIn("summary of all changes made", ctx.exception.raw_error.lower())
@patch("cross_eval.worktree.capture_diff", return_value="")
@patch("subprocess.run")
def test_empty_diff_without_change_claim_is_allowed(
self, mock_run: MagicMock, mock_diff: MagicMock,
) -> None:
mock_run.return_value = MagicMock(
returncode=0,
stdout="No changes were required; the current implementation already satisfies the task.",
stderr="",
)
agent = AgentConfig(
name="claude-coder",
command="claude",
args=["--setting-sources", "user"],
agentic=True,
)
with tempfile.TemporaryDirectory() as td:
wt = Path(td)
_init_git_repo(wt)
result = invoke_agent_agentic(
agent, "check whether any fix is needed", "coding",
worktree_path=wt, quiet=True,
)
self.assertEqual(result.output, "(no changes)")
# =================================================================== # ===================================================================
# 3. config.py tests # 3. config.py tests
# =================================================================== # ===================================================================
@@ -328,6 +446,16 @@ class TestMakeAgenticClaude(unittest.TestCase):
self.assertNotIn("-p", agent.args) self.assertNotIn("-p", agent.args)
self.assertIn("--setting-sources", agent.args) self.assertIn("--setting-sources", agent.args)
def test_strips_dash_dash_print_alias(self) -> None:
agent = AgentConfig(
name="claude-coder",
command="claude",
args=["--print", "--setting-sources", "user"],
)
_make_agentic(agent)
self.assertTrue(agent.agentic)
self.assertNotIn("--print", agent.args)
def test_idempotent_when_no_dash_p(self) -> None: def test_idempotent_when_no_dash_p(self) -> None:
agent = AgentConfig( agent = AgentConfig(
name="claude-coder", name="claude-coder",
@@ -362,6 +490,8 @@ class TestMakeAgenticCodex(unittest.TestCase):
def _make_agentic_config( def _make_agentic_config(
run_dir: Path, run_dir: Path,
agentic_coder: bool = True, agentic_coder: bool = True,
*,
use_worktree: bool = False,
) -> PipelineConfig: ) -> PipelineConfig:
"""Build a config with an agentic coder + non-agentic reviewer.""" """Build a config with an agentic coder + non-agentic reviewer."""
coder = AgentConfig( coder = AgentConfig(
@@ -393,6 +523,7 @@ def _make_agentic_config(
] ]
return PipelineConfig( return PipelineConfig(
output_dir=run_dir, output_dir=run_dir,
use_worktree=use_worktree,
max_iterations=2, max_iterations=2,
min_iterations=1, min_iterations=1,
language="en", language="en",
@@ -423,11 +554,11 @@ class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
) -> None: ) -> None:
with tempfile.TemporaryDirectory() as td: with tempfile.TemporaryDirectory() as td:
run_dir = Path(td) run_dir = Path(td)
config = _make_agentic_config(run_dir) config = _make_agentic_config(run_dir, use_worktree=True)
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test") mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
mock_invoke_agentic.return_value = AgentResult( mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0, output="diff output", exit_code=0,
@@ -445,6 +576,71 @@ class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
mock_setup.assert_called_once() mock_setup.assert_called_once()
class TestDirectAgenticMode(unittest.TestCase):
"""Agentic coders run in the current working tree by default."""
@patch("cross_eval.pipeline._setup_worktree")
@patch("cross_eval.pipeline.invoke_agent_agentic")
@patch("cross_eval.pipeline.invoke_agent")
def test_agentic_uses_current_worktree_by_default(
self,
mock_invoke: MagicMock,
mock_invoke_agentic: MagicMock,
mock_setup: MagicMock,
) -> None:
with tempfile.TemporaryDirectory() as td:
repo = Path(td)
_init_git_repo(repo)
run_dir = repo / ".cross-eval" / "output"
run_dir.mkdir(parents=True, exist_ok=True)
config = _make_agentic_config(run_dir)
mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0,
agent_name="claude-coder", step_name="coding",
duration_seconds=0.1,
)
mock_invoke.return_value = AgentResult(
output="VERDICT: PASS", exit_code=0,
agent_name="claude-reviewer", step_name="review",
duration_seconds=0.1,
)
run_pipeline(config, cwd=repo)
mock_setup.assert_not_called()
self.assertEqual(mock_invoke_agentic.call_args.kwargs["worktree_path"], repo)
reviewer_call = mock_invoke.call_args
self.assertEqual(reviewer_call.kwargs["cwd"], repo)
class TestSetupWorktreeLocation(unittest.TestCase):
"""_setup_worktree places agentic worktrees outside the base repo."""
def test_worktree_is_created_outside_repo(self) -> None:
with tempfile.TemporaryDirectory() as td:
base = Path(td) / "repo"
run_dir = base / ".cross-eval" / "output" / "smoke"
base.mkdir()
run_dir.mkdir(parents=True)
_init_git_repo(base)
worktree_path, branch_name, _base_commit = _setup_worktree(base, run_dir, "review-fix")
try:
self.assertTrue(worktree_path.exists())
self.assertNotIn(str(base.resolve()), str(worktree_path.resolve()))
self.assertEqual(
(run_dir / "worktree_path.txt").read_text(encoding="utf-8").strip(),
str(worktree_path),
)
self.assertEqual(
(run_dir / "worktree_branch.txt").read_text(encoding="utf-8").strip(),
branch_name,
)
finally:
remove_worktree(base, worktree_path)
class TestReviewerRunsInWorktreeCwd(unittest.TestCase): class TestReviewerRunsInWorktreeCwd(unittest.TestCase):
"""Reviewer runs with worktree cwd (not original cwd) when worktree exists.""" """Reviewer runs with worktree cwd (not original cwd) when worktree exists."""
@@ -463,11 +659,11 @@ class TestReviewerRunsInWorktreeCwd(unittest.TestCase):
) -> None: ) -> None:
with tempfile.TemporaryDirectory() as td: with tempfile.TemporaryDirectory() as td:
run_dir = Path(td) run_dir = Path(td)
config = _make_agentic_config(run_dir) config = _make_agentic_config(run_dir, use_worktree=True)
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test") mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
mock_invoke_agentic.return_value = AgentResult( mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0, output="diff output", exit_code=0,
@@ -505,11 +701,11 @@ class TestCommitIterationCalled(unittest.TestCase):
) -> None: ) -> None:
with tempfile.TemporaryDirectory() as td: with tempfile.TemporaryDirectory() as td:
run_dir = Path(td) run_dir = Path(td)
config = _make_agentic_config(run_dir) config = _make_agentic_config(run_dir, use_worktree=True)
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test") mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
mock_invoke_agentic.return_value = AgentResult( mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0, output="diff output", exit_code=0,
@@ -547,11 +743,11 @@ class TestFinalizeWorktreeCalled(unittest.TestCase):
) -> None: ) -> None:
with tempfile.TemporaryDirectory() as td: with tempfile.TemporaryDirectory() as td:
run_dir = Path(td) run_dir = Path(td)
config = _make_agentic_config(run_dir) config = _make_agentic_config(run_dir, use_worktree=True)
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test") mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
mock_invoke_agentic.return_value = AgentResult( mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0, output="diff output", exit_code=0,
@@ -669,7 +865,7 @@ class TestParallelAgenticFallsBackToSequential(unittest.TestCase):
wt_path = run_dir / "work" wt_path = run_dir / "work"
wt_path.mkdir() wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test") mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
call_order: list[str] = [] call_order: list[str] = []

View File

@@ -26,7 +26,6 @@ from cross_eval.models import (
PhaseConfig, PhaseConfig,
PipelineConfig, PipelineConfig,
PipelineResult, PipelineResult,
ReviewMetrics,
StepConfig, StepConfig,
) )
from cross_eval.pipeline import ( from cross_eval.pipeline import (
@@ -43,6 +42,8 @@ from cross_eval.prompts import (
REVIEW_TEMPLATE_KO, REVIEW_TEMPLATE_KO,
PLAN_REVIEW_TEMPLATE, PLAN_REVIEW_TEMPLATE,
PLAN_REVIEW_TEMPLATE_KO, PLAN_REVIEW_TEMPLATE_KO,
PLAN_FIX_TEMPLATE,
PLAN_FIX_TEMPLATE_KO,
REVIEW_ONLY_TEMPLATE, REVIEW_ONLY_TEMPLATE,
REVIEW_ONLY_TEMPLATE_KO, REVIEW_ONLY_TEMPLATE_KO,
AGGREGATE_REVIEW_TEMPLATE, AGGREGATE_REVIEW_TEMPLATE,
@@ -54,7 +55,7 @@ from cross_eval.prompts import (
_build_review_only_preset, _build_review_only_preset,
_build_simple_preset, _build_simple_preset,
) )
from cross_eval.report import build_report, parse_review_metrics, print_escalation_report from cross_eval.report import build_report, parse_review_metrics
class BuiltinAgentConfigTest(unittest.TestCase): class BuiltinAgentConfigTest(unittest.TestCase):
def test_claude_builtin_agents_use_user_settings_and_disable_slash_commands(self) -> None: def test_claude_builtin_agents_use_user_settings_and_disable_slash_commands(self) -> None:
@@ -72,8 +73,11 @@ class BuiltinAgentConfigTest(unittest.TestCase):
self.assertIn("--dangerously-skip-permissions", coder_args) self.assertIn("--dangerously-skip-permissions", coder_args)
self.assertIn("bypassPermissions", coder_args) self.assertIn("bypassPermissions", coder_args)
self.assertIn("plan", reviewer_args) # Reviewers/seniors use -p without --permission-mode plan
self.assertIn("plan", senior_args) self.assertIn("-p", reviewer_args)
self.assertIn("-p", senior_args)
self.assertNotIn("plan", reviewer_args)
self.assertNotIn("plan", senior_args)
def test_codex_builtin_agents_skip_git_repo_check(self) -> None: def test_codex_builtin_agents_skip_git_repo_check(self) -> None:
for agent_name in ("codex-coder", "codex-reviewer", "codex-senior"): for agent_name in ("codex-coder", "codex-reviewer", "codex-senior"):
@@ -308,26 +312,10 @@ class BuiltinAgentConfigTest(unittest.TestCase):
self.assertIn("Repeated Aggregate Findings", report) self.assertIn("Repeated Aggregate Findings", report)
self.assertIn("same as iteration 3", report) self.assertIn("same as iteration 3", report)
def test_review_fix_defaults_senior_from_reviewer_family(self) -> None: def test_fix_and_plan_presets_default_senior_from_reviewer_family(self) -> None:
self.assertEqual( self.assertEqual(
_default_seniors_for_preset( _default_seniors_for_preset(
"preset:review-fix", "preset:plan-review",
["codex-reviewer", "claude-reviewer"],
BUILTIN_AGENTS,
),
["codex-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:review-fix",
["claude-reviewer"],
BUILTIN_AGENTS,
),
["claude-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:coding-review-fix",
["codex-reviewer"], ["codex-reviewer"],
BUILTIN_AGENTS, BUILTIN_AGENTS,
), ),
@@ -335,7 +323,31 @@ class BuiltinAgentConfigTest(unittest.TestCase):
) )
self.assertEqual( self.assertEqual(
_default_seniors_for_preset( _default_seniors_for_preset(
"preset:simple", "preset:plan-review",
["claude-reviewer"],
BUILTIN_AGENTS,
),
["claude-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:coding-plan-review",
["codex-reviewer", "claude-reviewer"],
BUILTIN_AGENTS,
),
["codex-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:coding-plan-review",
["claude-reviewer"],
BUILTIN_AGENTS,
),
["claude-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:unknown",
["codex-reviewer"], ["codex-reviewer"],
BUILTIN_AGENTS, BUILTIN_AGENTS,
), ),
@@ -419,23 +431,49 @@ class BuiltinAgentConfigTest(unittest.TestCase):
) )
self.assertEqual( self.assertEqual(
[step.output_key for step in steps], [step.output_key for step in steps[:2]],
["plan_review_codex_reviewer", "plan_review_codex_reviewer_2"], ["plan_review_codex_reviewer", "plan_review_codex_reviewer_2"],
) )
def test_plan_review_with_senior_adds_aggregate_step(self) -> None: def test_plan_review_builds_review_fix_verify_loop(self) -> None:
steps = _build_plan_review_preset( steps = _build_plan_review_preset(
["codex-coder"], ["codex-coder"],
["claude-reviewer", "codex-reviewer"], ["claude-reviewer", "codex-reviewer"],
["claude-senior"], ["claude-senior"],
) )
self.assertEqual(steps[-1].name, "senior_review") self.assertEqual(
self.assertEqual(steps[-1].agent, "claude-senior") [step.name for step in steps],
self.assertTrue(steps[-1].verdict) [
"plan_review_claude_reviewer",
"plan_review_codex_reviewer",
"aggregate_review",
"plan_fix",
"verify",
],
)
self.assertEqual(steps[2].agent, "claude-senior")
self.assertEqual(steps[3].agent, "codex-coder")
self.assertEqual(steps[4].agent, "claude-senior")
self.assertTrue(steps[4].verdict)
self.assertFalse(steps[0].verdict) self.assertFalse(steps[0].verdict)
self.assertFalse(steps[1].verdict) self.assertFalse(steps[1].verdict)
def test_plan_review_single_reviewer_uses_default_loop_steps(self) -> None:
steps = _build_plan_review_preset(
["codex-coder"],
["codex-reviewer"],
[],
)
self.assertEqual(
[step.name for step in steps],
["plan_review", "aggregate_review", "plan_fix", "verify"],
)
self.assertEqual(steps[1].agent, "codex-reviewer")
self.assertEqual(steps[2].prompt_template, "default:plan-fix")
self.assertTrue(steps[3].verdict)
def test_cross_review_duplicate_coders_get_unique_step_keys(self) -> None: def test_cross_review_duplicate_coders_get_unique_step_keys(self) -> None:
steps = _build_cross_review_preset( steps = _build_cross_review_preset(
["codex-coder", "codex-coder"], ["codex-coder", "codex-coder"],
@@ -574,6 +612,8 @@ class PromptTemplateTest(unittest.TestCase):
"""Coding templates should tell coder to ignore DISMISSED items.""" """Coding templates should tell coder to ignore DISMISSED items."""
self.assertIn("DISMISSED", CODING_TEMPLATE) self.assertIn("DISMISSED", CODING_TEMPLATE)
self.assertIn("DISMISSED", CODING_TEMPLATE_KO) self.assertIn("DISMISSED", CODING_TEMPLATE_KO)
self.assertIn("DISMISSED", PLAN_FIX_TEMPLATE)
self.assertIn("DISMISSED", PLAN_FIX_TEMPLATE_KO)
def test_aggregate_templates_dismissed_structure(self) -> None: def test_aggregate_templates_dismissed_structure(self) -> None:
"""Aggregate templates should use [False positive] / [Already fixed] tags.""" """Aggregate templates should use [False positive] / [Already fixed] tags."""
@@ -581,6 +621,10 @@ class PromptTemplateTest(unittest.TestCase):
self.assertIn("[Already fixed]", AGGREGATE_REVIEW_TEMPLATE) self.assertIn("[Already fixed]", AGGREGATE_REVIEW_TEMPLATE)
self.assertIn("[오탐]", AGGREGATE_REVIEW_TEMPLATE_KO) self.assertIn("[오탐]", AGGREGATE_REVIEW_TEMPLATE_KO)
self.assertIn("[수정 완료]", AGGREGATE_REVIEW_TEMPLATE_KO) self.assertIn("[수정 완료]", AGGREGATE_REVIEW_TEMPLATE_KO)
self.assertIn("{candidate_outputs}", AGGREGATE_REVIEW_TEMPLATE)
self.assertIn("{reviews_bundle}", AGGREGATE_REVIEW_TEMPLATE)
self.assertIn("{candidate_outputs}", AGGREGATE_REVIEW_TEMPLATE_KO)
self.assertIn("{reviews_bundle}", AGGREGATE_REVIEW_TEMPLATE_KO)
class ReviewMetricsParsingTest(unittest.TestCase): class ReviewMetricsParsingTest(unittest.TestCase):
@@ -967,7 +1011,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
" checklist: checklist.md\n" " checklist: checklist.md\n"
"coders: [claude-coder]\n" "coders: [claude-coder]\n"
"reviewers: [claude-reviewer]\n" "reviewers: [claude-reviewer]\n"
"pipeline: preset:review-fix\n" "pipeline: preset:coding-plan-review\n"
f"max_iterations: {max_iterations}\n" f"max_iterations: {max_iterations}\n"
"language: en\n" "language: en\n"
), ),
@@ -979,8 +1023,9 @@ class FixPresetBehaviorTest(unittest.TestCase):
with tempfile.TemporaryDirectory() as tmpdir: with tempfile.TemporaryDirectory() as tmpdir:
config = load_config(self._write_fix_config(Path(tmpdir), max_iterations=7)) config = load_config(self._write_fix_config(Path(tmpdir), max_iterations=7))
self.assertEqual(config.preset_name, "review-fix") self.assertEqual(config.preset_name, "coding-plan-review")
self.assertEqual(config.phases[0].max_iterations, 7) self.assertEqual(config.phases[0].max_iterations, 1)
self.assertEqual(config.phases[1].max_iterations, 7)
self.assertTrue(config.agents["claude-coder"].agentic) self.assertTrue(config.agents["claude-coder"].agentic)
self.assertNotIn("-p", config.agents["claude-coder"].args) self.assertNotIn("-p", config.agents["claude-coder"].args)
@@ -990,7 +1035,7 @@ class FixPresetBehaviorTest(unittest.TestCase):
captured: dict[str, object] = {} captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs): def _fake_run_pipeline(config, **kwargs):
captured["phase_max"] = config.phases[0].max_iterations captured["phase_max"] = config.phases[1].max_iterations
captured["agentic"] = config.agents[config.coders[0]].agentic captured["agentic"] = config.agents[config.coders[0]].agentic
return PipelineResult( return PipelineResult(
iterations=[], iterations=[],
@@ -1010,13 +1055,13 @@ class FixPresetBehaviorTest(unittest.TestCase):
self.assertEqual(captured["phase_max"], 9) self.assertEqual(captured["phase_max"], 9)
self.assertTrue(captured["agentic"]) self.assertTrue(captured["agentic"])
def test_run_preset_review_fix_auto_enables_agentic_without_flag(self) -> None: def test_run_preset_coding_plan_review_auto_enables_agentic_without_flag(self) -> None:
captured: dict[str, object] = {} captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs): def _fake_run_pipeline(config, **kwargs):
captured["preset"] = config.preset_name captured["preset"] = config.preset_name
captured["agentic"] = config.agents[config.coders[0]].agentic captured["agentic"] = config.agents[config.coders[0]].agentic
captured["phase_max"] = config.phases[0].max_iterations captured["phase_max"] = config.phases[1].max_iterations
return PipelineResult( return PipelineResult(
iterations=[], iterations=[],
final_verdict="PASS", final_verdict="PASS",
@@ -1024,13 +1069,127 @@ class FixPresetBehaviorTest(unittest.TestCase):
) )
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline): with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main(["run", "--preset", "review-fix", "--dry-run"]) exit_code = main(["run", "--preset", "coding-plan-review", "--dry-run"])
self.assertEqual(exit_code, 0) self.assertEqual(exit_code, 0)
self.assertEqual(captured["preset"], "review-fix") self.assertEqual(captured["preset"], "coding-plan-review")
self.assertTrue(captured["agentic"]) self.assertTrue(captured["agentic"])
self.assertEqual(captured["phase_max"], 3) self.assertEqual(captured["phase_max"], 3)
def test_run_preset_plan_review_auto_enables_agentic_without_flag(self) -> None:
captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs):
captured["preset"] = config.preset_name
captured["agentic"] = config.agents[config.coders[0]].agentic
captured["use_worktree"] = config.use_worktree
captured["seniors"] = list(config.seniors)
captured["steps"] = [step.name for step in config.pipeline]
captured["max_iter"] = config.max_iterations
return PipelineResult(
iterations=[],
final_verdict="PASS",
run_dir=Path(".cross-eval/output"),
)
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main(["run", "--preset", "plan-review", "--dry-run"])
self.assertEqual(exit_code, 0)
self.assertEqual(captured["preset"], "plan-review")
self.assertTrue(captured["agentic"])
self.assertFalse(captured["use_worktree"])
self.assertEqual(captured["seniors"], ["claude-senior"])
self.assertEqual(
captured["steps"],
["plan_review", "aggregate_review", "plan_fix", "verify"],
)
self.assertEqual(captured["max_iter"], 3)
def test_run_worktree_flag_enables_isolated_worktree_mode(self) -> None:
captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs):
captured["use_worktree"] = config.use_worktree
return PipelineResult(
iterations=[],
final_verdict="PASS",
run_dir=Path(".cross-eval/output"),
)
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main(["run", "--preset", "plan-review", "--dry-run", "--worktree"])
self.assertEqual(exit_code, 0)
self.assertTrue(captured["use_worktree"])
def test_run_dry_run_returns_zero_even_when_not_pass(self) -> None:
def _fake_run_pipeline(config, **kwargs):
return PipelineResult(
iterations=[],
final_verdict="MAX_ITERATIONS_REACHED",
run_dir=Path(".cross-eval/output"),
)
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main(["run", "--preset", "plan-review", "--dry-run"])
self.assertEqual(exit_code, 0)
def test_run_senior_model_override_applies_only_to_seniors(self) -> None:
captured: dict[str, list[str]] = {}
def _fake_run_pipeline(config, **kwargs):
captured["coder_args"] = list(config.agents[config.coders[0]].args)
captured["reviewer_args"] = list(config.agents[config.reviewers[0]].args)
captured["senior_args"] = list(config.agents[config.seniors[0]].args)
return PipelineResult(
iterations=[],
final_verdict="PASS",
run_dir=Path(".cross-eval/output"),
)
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main([
"run",
"--preset", "coding-plan-review",
"--coder", "claude",
"--reviewer", "claude",
"--senior", "claude",
"--senior-model", "sonnet",
"--dry-run",
])
self.assertEqual(exit_code, 0)
self.assertIn("opus", captured["coder_args"])
self.assertIn("opus", captured["reviewer_args"])
self.assertIn("sonnet", captured["senior_args"])
class OutputDirectoryResolutionTest(unittest.TestCase):
def test_load_config_resolves_output_dir_from_project_root(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
ce_dir = root / ".cross-eval"
ce_dir.mkdir()
(ce_dir / "plan.md").write_text("# plan\n", encoding="utf-8")
config_path = ce_dir / "config.yaml"
config_path.write_text(
(
"inputs:\n"
" plan: plan.md\n"
"coders: [claude-coder]\n"
"reviewers: [claude-reviewer]\n"
"pipeline: preset:coding-plan-review\n"
"output_dir: .cross-eval/output\n"
),
encoding="utf-8",
)
config = load_config(config_path)
self.assertEqual(config.output_dir.resolve(), (root / ".cross-eval" / "output").resolve())
if __name__ == "__main__": if __name__ == "__main__":
unittest.main() unittest.main()

951
tests/test_evidence.py Normal file
View File

@@ -0,0 +1,951 @@
"""Regression tests for runtime evidence propagation and report visibility.
Covers:
1. Execution evidence is surfaced in reviewer/senior prompt context.
2. Reports include command preview and transcript excerpts.
3. Claude agentic failure detection (empty diff, write failure, expanded markers).
4. _format_execution_evidence produces expected output.
"""
from __future__ import annotations
import tempfile
import unittest
from pathlib import Path
from unittest.mock import MagicMock, patch
from cross_eval.agent import (
AgentInvocationError,
_claims_file_changes,
_has_write_failure_indicators,
invoke_agent_agentic,
)
from cross_eval.config import BUILTIN_AGENTS
from cross_eval.models import (
AgentConfig,
AgentResult,
IterationResult,
PipelineConfig,
PipelineResult,
StepConfig,
)
from cross_eval.pipeline import _build_artifact_references, _format_execution_evidence, run_pipeline
from cross_eval.report import build_report
# ---------------------------------------------------------------------------
# 1. Execution evidence formatting
# ---------------------------------------------------------------------------
class TestFormatExecutionEvidence(unittest.TestCase):
"""_format_execution_evidence produces a compact summary for reviewers."""
def test_empty_results_returns_placeholder(self) -> None:
self.assertIn("no prior execution evidence", _format_execution_evidence({}))
def test_single_result_includes_key_fields(self) -> None:
result = AgentResult(
output="some diff",
exit_code=0,
agent_name="claude-coder",
step_name="coding",
duration_seconds=12.3,
transcript="# Agent Execution Transcript\n\n## Command\nclaude ...",
command_preview="claude --setting-sources user",
)
evidence = _format_execution_evidence({"coding_output": result})
self.assertIn("claude-coder", evidence)
self.assertIn("coding", evidence)
self.assertIn("Exit code: 0", evidence)
self.assertIn("12.3s", evidence)
self.assertIn("claude --setting-sources user", evidence)
self.assertNotIn("Transcript excerpt", evidence)
def test_multiple_results_separated(self) -> None:
r1 = AgentResult(
output="diff1", exit_code=0, agent_name="coder",
step_name="coding", duration_seconds=1.0,
command_preview="cmd1",
)
r2 = AgentResult(
output="review text", exit_code=0, agent_name="reviewer",
step_name="review", duration_seconds=2.0,
command_preview="cmd2",
)
evidence = _format_execution_evidence({
"coding_output": r1,
"review_result": r2,
})
self.assertIn("coder", evidence)
self.assertIn("reviewer", evidence)
self.assertIn("---", evidence)
def test_transcript_truncated_at_2000_chars(self) -> None:
long_transcript = "x" * 3000
result = AgentResult(
output="out", exit_code=0, agent_name="agent",
step_name="step", duration_seconds=1.0,
transcript=long_transcript,
)
evidence = _format_execution_evidence({"key": result})
self.assertNotIn("x" * 3000, evidence)
def test_artifact_paths_included_when_run_dir_provided(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
result = AgentResult(
output="diff",
exit_code=0,
agent_name="coder",
step_name="coding",
duration_seconds=1.2,
transcript="stdout",
command_preview="claude ...",
)
evidence = _format_execution_evidence(
{"coding_output": result},
run_dir=Path(tmpdir),
iteration=2,
)
self.assertIn("v2/coding.md", evidence)
self.assertIn("v2/coding_transcript.md", evidence)
class TestArtifactReferences(unittest.TestCase):
"""Artifact references should prefer file paths and git state over inline text."""
def test_contains_input_refs_and_git_context(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir) / "repo"
repo.mkdir()
(repo / "plan.md").write_text("plan", encoding="utf-8")
(repo / "checklist.md").write_text("checklist", encoding="utf-8")
import subprocess
subprocess.run(["git", "init"], cwd=repo, capture_output=True, check=True)
subprocess.run(["git", "config", "user.email", "test@test.com"], cwd=repo, capture_output=True, check=True)
subprocess.run(["git", "config", "user.name", "Test"], cwd=repo, capture_output=True, check=True)
subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True)
subprocess.run(["git", "commit", "-m", "init"], cwd=repo, capture_output=True, check=True)
refs = _build_artifact_references(
{
"plan_ref": str((repo / "plan.md").resolve()),
"checklist_ref": str((repo / "checklist.md").resolve()),
"docs_ref": "(none)",
},
cwd=repo,
run_dir=repo / ".cross-eval" / "output" / "run",
iteration=1,
worktree_path=None,
)
self.assertIn("Plan:", refs)
self.assertIn("Git commit:", refs)
self.assertIn("Suggested git commands", refs)
# ---------------------------------------------------------------------------
# 2. Evidence in reviewer prompts (integration)
# ---------------------------------------------------------------------------
class TestEvidenceInReviewerPrompt(unittest.TestCase):
"""Reviewer prompts include execution evidence from prior coding step."""
def test_reviewer_receives_evidence(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
steps = [
StepConfig(
name="coding", agent="claude-coder", role="coding",
prompt_template="default:coding", output_key="coding_output",
),
StepConfig(
name="review", agent="claude-reviewer", role="review",
prompt_template="default:review", output_key="review_result",
verdict=True,
),
]
config = PipelineConfig(
output_dir=Path(tmpdir),
max_iterations=1,
min_iterations=1,
language="en",
inputs={"plan": "Test plan", "checklist": "Test checklist"},
agents=dict(BUILTIN_AGENTS),
coders=["claude-coder"],
reviewers=["claude-reviewer"],
pipeline=steps,
preset_name="simple",
)
captured_prompts: list[dict] = []
def _mock(agent_config, prompt, step_name, **kwargs):
captured_prompts.append({
"step_name": step_name,
"prompt": prompt,
})
if step_name == "coding":
return AgentResult(
output="Implemented feature X",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=5.0,
transcript="# Transcript\nclaude ran...",
command_preview="claude --setting-sources user",
)
return AgentResult(
output="VERDICT: PASS",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=2.0,
)
with patch("cross_eval.pipeline.invoke_agent", side_effect=_mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "PASS")
# The reviewer prompt should contain execution evidence
review_prompts = [
p for p in captured_prompts if p["step_name"] == "review"
]
self.assertTrue(len(review_prompts) >= 1)
review_prompt = review_prompts[0]["prompt"]
self.assertIn("Artifact References", review_prompt)
self.assertIn("Execution Evidence", review_prompt)
self.assertIn("claude-coder", review_prompt)
# ---------------------------------------------------------------------------
# 3. Report includes evidence
# ---------------------------------------------------------------------------
class TestReportIncludesEvidence(unittest.TestCase):
"""Report generation includes command preview and transcript excerpts."""
def _make_pipeline_result(self) -> tuple[PipelineConfig, PipelineResult]:
steps = [
StepConfig(
name="coding", agent="claude-coder", role="coding",
prompt_template="default:coding", output_key="coding_output",
),
StepConfig(
name="review", agent="claude-reviewer", role="review",
prompt_template="default:review", output_key="review_result",
verdict=True,
),
]
config = PipelineConfig(
max_iterations=1,
language="en",
inputs={"plan": "Plan", "checklist": "CL"},
agents=dict(BUILTIN_AGENTS),
pipeline=steps,
preset_name="simple",
)
coding_result = AgentResult(
output="diff --git a/file ...",
exit_code=0,
agent_name="claude-coder",
step_name="coding",
duration_seconds=10.0,
transcript="# Agent Execution Transcript\n## Command\nclaude ...\n## Stdout\nok",
command_preview="claude --setting-sources user",
)
review_result = AgentResult(
output="All good.\n\nVERDICT: PASS",
exit_code=0,
agent_name="claude-reviewer",
step_name="review",
duration_seconds=5.0,
transcript="# Agent Execution Transcript\n## Command\nclaude -p ...\n## Stdout\nAll good.",
command_preview="claude -p --setting-sources user",
)
iteration = IterationResult(
iteration=1,
step_results={
"coding_output": coding_result,
"review_result": review_result,
},
step_outputs={
"coding_output": "diff --git a/file ...",
"review_result": "All good.\n\nVERDICT: PASS",
},
verdict="PASS",
)
pipeline_result = PipelineResult(
iterations=[iteration],
final_verdict="PASS",
total_duration=15.0,
)
return config, pipeline_result
def test_report_contains_command_preview(self) -> None:
config, result = self._make_pipeline_result()
report = build_report(config, result)
self.assertIn("claude --setting-sources user", report)
self.assertIn("**Command**", report)
def test_report_contains_transcript_excerpt(self) -> None:
config, result = self._make_pipeline_result()
report = build_report(config, result)
self.assertIn("Execution transcript", report)
self.assertIn("Agent Execution Transcript", report)
def test_report_contains_exit_code(self) -> None:
config, result = self._make_pipeline_result()
report = build_report(config, result)
self.assertIn("**Exit code**: 0", report)
# ---------------------------------------------------------------------------
# 4. Claude agentic hardened failure detection
# ---------------------------------------------------------------------------
class TestClaimsFileChangesExpanded(unittest.TestCase):
"""Expanded change-claim markers detect more Claude output patterns."""
def test_ive_implemented(self) -> None:
self.assertTrue(_claims_file_changes("I've implemented the feature"))
def test_ive_updated(self) -> None:
self.assertTrue(_claims_file_changes("I've updated the config"))
def test_made_the_following_changes(self) -> None:
self.assertTrue(_claims_file_changes("I made the following changes to the file"))
def test_applied_the_fix(self) -> None:
self.assertTrue(_claims_file_changes("Applied the fix for the bug"))
def test_changes_have_been_applied(self) -> None:
self.assertTrue(_claims_file_changes("Changes have been applied successfully"))
def test_wrote_the_code(self) -> None:
self.assertTrue(_claims_file_changes("Wrote the code for the new module"))
def test_refactored(self) -> None:
self.assertTrue(_claims_file_changes("I refactored the pipeline"))
def test_no_changes_still_returns_false(self) -> None:
self.assertFalse(_claims_file_changes("No changes were necessary"))
def test_empty_string_returns_false(self) -> None:
self.assertFalse(_claims_file_changes(""))
class TestWriteFailureIndicators(unittest.TestCase):
"""_has_write_failure_indicators detects stderr patterns."""
def test_permission_denied(self) -> None:
self.assertTrue(_has_write_failure_indicators("Error: Permission denied"))
def test_read_only_filesystem(self) -> None:
self.assertTrue(_has_write_failure_indicators("read-only file system"))
def test_sandbox_restriction(self) -> None:
self.assertTrue(_has_write_failure_indicators("Blocked by sandbox policy"))
def test_eacces(self) -> None:
self.assertTrue(_has_write_failure_indicators("EACCES: operation not permitted"))
def test_empty_stderr_returns_false(self) -> None:
self.assertFalse(_has_write_failure_indicators(""))
def test_normal_stderr_returns_false(self) -> None:
self.assertFalse(_has_write_failure_indicators("Downloading model..."))
class TestAgenticWriteFailureRaisesError(unittest.TestCase):
"""Agentic mode raises AgentInvocationError on stderr write-failure indicators."""
@patch("cross_eval.worktree.capture_diff", return_value="")
@patch("subprocess.run")
def test_write_failure_detected_from_stderr(
self, mock_run: MagicMock, mock_diff: MagicMock,
) -> None:
mock_run.return_value = MagicMock(
returncode=0,
stdout="Done.",
stderr="Error: Permission denied writing to /src/main.py",
)
agent = AgentConfig(
name="claude-coder", command="claude",
args=["--setting-sources", "user"], agentic=True,
)
import subprocess as _sp
import tempfile as _tf
with _tf.TemporaryDirectory() as td:
wt = Path(td)
_sp.run(["git", "init"], cwd=wt, capture_output=True, check=True)
_sp.run(["git", "config", "user.email", "t@t.com"], cwd=wt, capture_output=True)
_sp.run(["git", "config", "user.name", "T"], cwd=wt, capture_output=True)
(wt / "README.md").write_text("# init\n")
_sp.run(["git", "add", "."], cwd=wt, capture_output=True, check=True)
_sp.run(["git", "commit", "-m", "init"], cwd=wt, capture_output=True, check=True)
with self.assertRaises(AgentInvocationError) as ctx:
invoke_agent_agentic(
agent, "implement feature", "coding",
worktree_path=wt, quiet=True,
)
self.assertEqual(ctx.exception.failure_type, "WRITE_FAILURE")
self.assertIn("Permission denied", ctx.exception.raw_error)
class TestAgenticExpandedClaimMarkers(unittest.TestCase):
"""Agentic mode detects expanded claim markers in empty diff scenarios."""
@patch("cross_eval.worktree.capture_diff", return_value="")
@patch("subprocess.run")
def test_ive_implemented_triggers_empty_diff_error(
self, mock_run: MagicMock, mock_diff: MagicMock,
) -> None:
mock_run.return_value = MagicMock(
returncode=0,
stdout="I've implemented the requested changes to the pipeline.",
stderr="",
)
agent = AgentConfig(
name="claude-coder", command="claude",
args=["--setting-sources", "user"], agentic=True,
)
import subprocess as _sp
import tempfile as _tf
with _tf.TemporaryDirectory() as td:
wt = Path(td)
_sp.run(["git", "init"], cwd=wt, capture_output=True, check=True)
_sp.run(["git", "config", "user.email", "t@t.com"], cwd=wt, capture_output=True)
_sp.run(["git", "config", "user.name", "T"], cwd=wt, capture_output=True)
(wt / "README.md").write_text("# init\n")
_sp.run(["git", "add", "."], cwd=wt, capture_output=True, check=True)
_sp.run(["git", "commit", "-m", "init"], cwd=wt, capture_output=True, check=True)
with self.assertRaises(AgentInvocationError) as ctx:
invoke_agent_agentic(
agent, "implement feature", "coding",
worktree_path=wt, quiet=True,
)
self.assertEqual(ctx.exception.failure_type, "EMPTY_DIFF")
# ---------------------------------------------------------------------------
# 5. Expanded claim/no-change markers
# ---------------------------------------------------------------------------
class TestExpandedClaimMarkers(unittest.TestCase):
"""New claim markers detect additional Claude output patterns."""
def test_completed_all_the_changes(self) -> None:
self.assertTrue(_claims_file_changes("I completed all the changes"))
def test_finished_implementing(self) -> None:
self.assertTrue(_claims_file_changes("Finished implementing the feature"))
def test_all_tasks_completed(self) -> None:
self.assertTrue(_claims_file_changes("All tasks completed successfully"))
def test_done_with_the_implementation(self) -> None:
self.assertTrue(_claims_file_changes("Done with the implementation"))
def test_successfully_implemented(self) -> None:
self.assertTrue(_claims_file_changes("Successfully implemented the changes"))
def test_changes_are_complete(self) -> None:
self.assertTrue(_claims_file_changes("All changes are complete"))
def test_korean_change_summary_triggers(self) -> None:
self.assertTrue(_claims_file_changes("모든 수정이 완료되었습니다. 아래는 변경 요약입니다."))
class TestExpandedNoChangeMarkers(unittest.TestCase):
"""New no-change markers prevent false positives."""
def test_no_changes_needed(self) -> None:
self.assertFalse(_claims_file_changes("No changes needed"))
def test_no_fixes_needed(self) -> None:
self.assertFalse(_claims_file_changes("No fixes needed for this code"))
def test_code_is_correct_as_is(self) -> None:
self.assertFalse(_claims_file_changes("The code is correct as-is"))
def test_already_correct(self) -> None:
self.assertFalse(_claims_file_changes("Implementation is already correct"))
def test_no_action_required(self) -> None:
self.assertFalse(_claims_file_changes("No action required"))
def test_korean_no_change_marker(self) -> None:
self.assertFalse(_claims_file_changes("변경할 필요 없음"))
# ---------------------------------------------------------------------------
# 6. Cross-iteration evidence propagation
# ---------------------------------------------------------------------------
class TestCrossIterationEvidencePropagation(unittest.TestCase):
"""Execution evidence from prior iterations is available to subsequent iterations."""
def test_prior_evidence_available_in_iteration_2(self) -> None:
"""Review step in iteration 2 should see coding evidence from iteration 1."""
with tempfile.TemporaryDirectory() as tmpdir:
steps = [
StepConfig(
name="coding", agent="claude-coder", role="coding",
prompt_template="default:coding", output_key="coding_output",
),
StepConfig(
name="review", agent="claude-reviewer", role="review",
prompt_template="default:review", output_key="review_result",
verdict=True,
),
]
config = PipelineConfig(
output_dir=Path(tmpdir),
max_iterations=2,
min_iterations=1,
language="en",
inputs={"plan": "Test plan", "checklist": "Test checklist"},
agents=dict(BUILTIN_AGENTS),
coders=["claude-coder"],
reviewers=["claude-reviewer"],
pipeline=steps,
preset_name="simple",
)
captured_prompts: list[dict] = []
def _mock(agent_config, prompt, step_name, **kwargs):
captured_prompts.append({
"step_name": step_name,
"prompt": prompt,
})
if step_name == "coding":
return AgentResult(
output="Implemented feature X",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=5.0,
transcript="# Transcript\nclaude ran the task",
command_preview="claude --setting-sources user",
)
# First review: FAIL, second review: PASS
review_calls = [
p for p in captured_prompts if p["step_name"] == "review"
]
if len(review_calls) <= 1:
return AgentResult(
output="Issues found\n\nVERDICT: FAIL",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=2.0,
transcript="# Transcript\nreview ran",
command_preview="claude -p --setting-sources user",
)
return AgentResult(
output="All good\n\nVERDICT: PASS",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=2.0,
)
with patch("cross_eval.pipeline.invoke_agent", side_effect=_mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "PASS")
self.assertEqual(len(result.iterations), 2)
# The review prompt in iteration 2 should reference prior evidence
# (from iteration 1's coding step)
iter2_review_prompts = [
p for p in captured_prompts
if p["step_name"] == "review"
]
# There should be 2 review prompts (one per iteration)
self.assertEqual(len(iter2_review_prompts), 2)
iter2_review = iter2_review_prompts[1]["prompt"]
# Prior evidence should appear because it was carried forward
# The review step runs after coding, so it sees current iteration's
# coding evidence. But the key test is that evidence IS present.
self.assertIn("Exit code: 0", iter2_review)
self.assertIn("claude-coder", iter2_review)
# ---------------------------------------------------------------------------
# 7. Report evidence summary table
# ---------------------------------------------------------------------------
class TestReportEvidenceSummaryTable(unittest.TestCase):
"""Report includes evidence summary table per iteration."""
def test_report_contains_evidence_summary(self) -> None:
steps = [
StepConfig(
name="coding", agent="claude-coder", role="coding",
prompt_template="default:coding", output_key="coding_output",
),
StepConfig(
name="review", agent="claude-reviewer", role="review",
prompt_template="default:review", output_key="review_result",
verdict=True,
),
]
config = PipelineConfig(
max_iterations=1,
language="en",
inputs={"plan": "Plan", "checklist": "CL"},
agents=dict(BUILTIN_AGENTS),
pipeline=steps,
preset_name="simple",
)
coding_result = AgentResult(
output="diff --git a/file ...",
exit_code=0,
agent_name="claude-coder",
step_name="coding",
duration_seconds=10.0,
transcript="# Transcript",
command_preview="claude --setting-sources user",
)
review_result = AgentResult(
output="VERDICT: PASS",
exit_code=0,
agent_name="claude-reviewer",
step_name="review",
duration_seconds=5.0,
transcript="# Transcript",
command_preview="claude -p",
)
iteration = IterationResult(
iteration=1,
step_results={
"coding_output": coding_result,
"review_result": review_result,
},
step_outputs={
"coding_output": "diff --git a/file ...",
"review_result": "VERDICT: PASS",
},
verdict="PASS",
)
pipeline_result = PipelineResult(
iterations=[iteration],
final_verdict="PASS",
total_duration=15.0,
)
report = build_report(config, pipeline_result)
self.assertIn("Evidence Summary", report)
self.assertIn("claude-coder", report)
self.assertIn("claude-reviewer", report)
self.assertIn("10.0s", report)
self.assertIn("5.0s", report)
# ---------------------------------------------------------------------------
# 8. _build_context merges prior and current evidence
# ---------------------------------------------------------------------------
class TestBuildContextMergesEvidence(unittest.TestCase):
"""_build_context merges prior iteration evidence with current step evidence."""
def test_prior_evidence_used_when_no_current_results(self) -> None:
from cross_eval.pipeline import _build_context
input_contents = {
"plan": "test",
"execution_evidence": "### Step: coding (coder)\n- Exit code: 0",
}
context = _build_context(
input_contents, {}, "feedback", 2, 5, step_results=None,
)
# Prior evidence should survive when there are no current results
self.assertIn("coding (coder)", context["execution_evidence"])
def test_current_and_prior_merged(self) -> None:
from cross_eval.pipeline import _build_context
input_contents = {
"plan": "test",
"execution_evidence": "### Step: coding (coder)\n- Exit code: 0",
}
current_result = AgentResult(
output="review text", exit_code=0, agent_name="reviewer",
step_name="review", duration_seconds=3.0,
command_preview="cmd",
)
context = _build_context(
input_contents, {}, "feedback", 2, 5,
step_results={"review_result": current_result},
)
evidence = context["execution_evidence"]
# Both prior and current should appear
self.assertIn("Prior Iteration Evidence", evidence)
self.assertIn("Current Iteration Evidence", evidence)
self.assertIn("coding (coder)", evidence)
self.assertIn("reviewer", evidence)
# ---------------------------------------------------------------------------
# 9. Evidence in review-only template (used by review-fix preset)
# ---------------------------------------------------------------------------
class TestReviewOnlyTemplateIncludesEvidence(unittest.TestCase):
"""review-only template includes {execution_evidence} placeholder."""
def test_review_only_template_has_evidence_placeholder(self) -> None:
from cross_eval.prompts import REVIEW_ONLY_TEMPLATE, REVIEW_ONLY_TEMPLATE_KO
self.assertIn("{execution_evidence}", REVIEW_ONLY_TEMPLATE)
self.assertIn("{execution_evidence}", REVIEW_ONLY_TEMPLATE_KO)
def test_review_only_renders_evidence(self) -> None:
from cross_eval.prompts import render_template, REVIEW_ONLY_TEMPLATE
context = {
"plan": "Test plan",
"checklist": "Test checklist",
"docs": "Test docs",
"feedback": "No feedback",
"execution_evidence": "### Step: coding (coder)\n- Exit code: 0\n- Duration: 5.0s",
"iteration": "1",
"max_iterations": "3",
}
rendered = render_template(REVIEW_ONLY_TEMPLATE, context)
self.assertIn("Exit code: 0", rendered)
self.assertIn("Duration: 5.0s", rendered)
# ---------------------------------------------------------------------------
# 10. Evidence propagation in phased pipeline (coding-review-fix)
# ---------------------------------------------------------------------------
class TestPhasedPipelineEvidencePropagation(unittest.TestCase):
"""Evidence propagates correctly in coding-review-fix phased pipeline."""
def test_reviewer_receives_coding_evidence_in_phased_pipeline(self) -> None:
"""In coding-review-fix, review-phase reviewers see coding-phase evidence."""
from cross_eval.prompts import _build_coding_review_fix_preset
with tempfile.TemporaryDirectory() as tmpdir:
coders = ["claude-coder"]
reviewers = ["claude-reviewer"]
seniors = ["claude-senior"]
phases = _build_coding_review_fix_preset(coders, reviewers, seniors)
config = PipelineConfig(
output_dir=Path(tmpdir),
max_iterations=5,
min_iterations=1,
language="en",
inputs={"plan": "Test plan", "checklist": "Test checklist"},
agents=dict(BUILTIN_AGENTS),
coders=coders,
reviewers=reviewers,
seniors=seniors,
phases=phases,
preset_name="coding-review-fix",
)
captured_prompts: list[dict] = []
def _mock(agent_config, prompt, step_name, **kwargs):
captured_prompts.append({
"step_name": step_name,
"prompt": prompt,
"agent_name": agent_config.name,
})
if step_name == "coding":
return AgentResult(
output="Implemented feature X",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=10.0,
transcript="# Transcript\nclaude executed coding task",
command_preview="claude --setting-sources user",
)
if step_name == "verify":
return AgentResult(
output="All good\n\nVERDICT: PASS",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=3.0,
)
return AgentResult(
output=f"Output for {step_name}",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=2.0,
transcript=f"# Transcript for {step_name}",
command_preview=f"cmd-{step_name}",
)
with patch("cross_eval.pipeline.invoke_agent", side_effect=_mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "PASS")
# Check that review-phase reviewers received evidence
review_prompts = [
p for p in captured_prompts
if p["step_name"].startswith("review_")
]
self.assertTrue(len(review_prompts) >= 1)
# The review prompt should contain evidence from the coding phase
review_prompt = review_prompts[0]["prompt"]
self.assertIn("Execution Evidence", review_prompt)
# ---------------------------------------------------------------------------
# 11. Evidence format includes output size
# ---------------------------------------------------------------------------
class TestEvidenceIncludesOutputSize(unittest.TestCase):
"""_format_execution_evidence includes output size for debugging."""
def test_output_size_in_evidence(self) -> None:
result = AgentResult(
output="x" * 500,
exit_code=0,
agent_name="claude-coder",
step_name="coding",
duration_seconds=5.0,
command_preview="claude --setting-sources user",
)
evidence = _format_execution_evidence({"coding_output": result})
self.assertIn("Output size: 500 chars", evidence)
# ---------------------------------------------------------------------------
# 12. Report transcript label i18n
# ---------------------------------------------------------------------------
class TestReportTranscriptLabelI18n(unittest.TestCase):
"""Report uses translated transcript label."""
def test_korean_transcript_label(self) -> None:
steps = [
StepConfig(
name="coding", agent="claude-coder", role="coding",
prompt_template="default:coding", output_key="coding_output",
),
]
config = PipelineConfig(
max_iterations=1,
language="ko",
inputs={"plan": "Plan", "checklist": "CL"},
agents=dict(BUILTIN_AGENTS),
pipeline=steps,
preset_name="simple",
)
coding_result = AgentResult(
output="diff --git a/file ...",
exit_code=0,
agent_name="claude-coder",
step_name="coding",
duration_seconds=10.0,
transcript="# Agent Execution Transcript\n## Command\nclaude ...",
command_preview="claude --setting-sources user",
)
iteration = IterationResult(
iteration=1,
step_results={"coding_output": coding_result},
step_outputs={"coding_output": "diff --git a/file ..."},
)
pipeline_result = PipelineResult(
iterations=[iteration],
final_verdict="MAX_ITERATIONS_REACHED",
total_duration=10.0,
)
report = build_report(config, pipeline_result)
self.assertIn("실행 트랜스크립트", report)
# ---------------------------------------------------------------------------
# 13. Claude coder + Codex reviewer/senior combination
# ---------------------------------------------------------------------------
class TestCodingReviewFixClaudeCodexCombination(unittest.TestCase):
"""coding-review-fix works with Claude as coder and Codex as reviewer/senior."""
def test_claude_coder_codex_reviewer_completes(self) -> None:
"""Verify the preset completes with mixed Claude/Codex agents."""
from cross_eval.prompts import _build_coding_review_fix_preset
with tempfile.TemporaryDirectory() as tmpdir:
coders = ["claude-coder"]
reviewers = ["codex-reviewer"]
seniors = ["codex-senior"]
phases = _build_coding_review_fix_preset(coders, reviewers, seniors)
config = PipelineConfig(
output_dir=Path(tmpdir),
max_iterations=5,
min_iterations=1,
language="en",
inputs={"plan": "Test plan", "checklist": "Test checklist"},
agents=dict(BUILTIN_AGENTS),
coders=coders,
reviewers=reviewers,
seniors=seniors,
phases=phases,
preset_name="coding-review-fix",
)
def _mock(agent_config, prompt, step_name, **kwargs):
if step_name == "verify":
return AgentResult(
output="All good\n\nVERDICT: PASS",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=2.0,
transcript="# Transcript",
command_preview="codex exec",
)
return AgentResult(
output=f"Output for {step_name}",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=3.0,
transcript=f"# Transcript for {step_name}",
command_preview=f"cmd-{step_name}",
)
with patch("cross_eval.pipeline.invoke_agent", side_effect=_mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "PASS")
# Verify both Claude and Codex agents were used
all_agents = set()
for ir in result.iterations:
for ar in ir.step_results.values():
all_agents.add(ar.agent_name)
self.assertIn("claude-coder", all_agents)
self.assertIn("codex-reviewer", all_agents)
if __name__ == "__main__":
unittest.main()

View File

@@ -11,7 +11,6 @@ from cross_eval.doctor import (
check_cli_installed, check_cli_installed,
check_config, check_config,
format_doctor_results, format_doctor_results,
run_doctor,
) )
from cross_eval.demo import ( from cross_eval.demo import (
DEMO_CHECKLIST, DEMO_CHECKLIST,
@@ -56,7 +55,7 @@ class DoctorCheckInstalledTest(unittest.TestCase):
config_path = ce_dir / "config.yaml" config_path = ce_dir / "config.yaml"
config_path.write_text( config_path.write_text(
"inputs:\n plan: plan.md\ncoders: [claude-coder]\n" "inputs:\n plan: plan.md\ncoders: [claude-coder]\n"
"reviewers: [claude-reviewer]\npipeline: preset:simple\n", "reviewers: [claude-reviewer]\npipeline: preset:coding-plan-review\n",
encoding="utf-8", encoding="utf-8",
) )
# Also create plan.md so validation passes # Also create plan.md so validation passes
@@ -138,22 +137,22 @@ class DemoTest(unittest.TestCase):
def test_mock_demo_runs_without_error(self) -> None: def test_mock_demo_runs_without_error(self) -> None:
# Should not raise # Should not raise
with patch("sys.stdout"): with patch("sys.stdout"):
run_mock_demo(preset="simple") run_mock_demo(preset="coding-plan-review")
def test_mock_demo_escalate_runs_without_error(self) -> None: def test_mock_demo_escalate_runs_without_error(self) -> None:
with patch("sys.stdout"): with patch("sys.stdout"):
run_mock_demo(preset="simple", show_escalate=True) run_mock_demo(preset="coding-plan-review", show_escalate=True)
def test_cmd_demo_mock_default(self) -> None: def test_cmd_demo_mock_default(self) -> None:
with patch("cross_eval.demo.run_mock_demo") as mock: with patch("cross_eval.demo.run_mock_demo") as mock:
exit_code = main(["demo"]) exit_code = main(["demo"])
mock.assert_called_once_with(preset="simple", show_escalate=False) mock.assert_called_once_with(preset="coding-plan-review", show_escalate=False)
self.assertEqual(exit_code, 0) self.assertEqual(exit_code, 0)
def test_cmd_demo_escalate_flag(self) -> None: def test_cmd_demo_escalate_flag(self) -> None:
with patch("cross_eval.demo.run_mock_demo") as mock: with patch("cross_eval.demo.run_mock_demo") as mock:
exit_code = main(["demo", "--escalate"]) exit_code = main(["demo", "--escalate"])
mock.assert_called_once_with(preset="simple", show_escalate=True) mock.assert_called_once_with(preset="coding-plan-review", show_escalate=True)
self.assertEqual(exit_code, 0) self.assertEqual(exit_code, 0)
def test_cmd_demo_live_requires_confirmation(self) -> None: def test_cmd_demo_live_requires_confirmation(self) -> None:

View File

@@ -8,14 +8,16 @@ from unittest.mock import patch
from cross_eval.config import BUILTIN_AGENTS from cross_eval.config import BUILTIN_AGENTS
from cross_eval.models import ( from cross_eval.models import (
AgentConfig,
AgentResult, AgentResult,
PhaseConfig,
PipelineConfig, PipelineConfig,
StepConfig, StepConfig,
) )
from cross_eval.pipeline import run_pipeline from cross_eval.pipeline import run_pipeline
from cross_eval.prompts import _build_review_fix_preset, _build_simple_preset from cross_eval.prompts import (
_build_plan_review_preset,
_build_review_fix_preset,
_build_simple_preset,
)
def _make_mock_agent(outputs: list[str]): def _make_mock_agent(outputs: list[str]):
@@ -264,6 +266,60 @@ class TestPhasedPipelineEscalateBreaksPhase(unittest.TestCase):
self.assertTrue(len(result.escalated_issues) > 0) self.assertTrue(len(result.escalated_issues) > 0)
class TestPlanReviewPipelineLoopsUntilVerifyPass(unittest.TestCase):
"""Document plan-review should revise docs and re-verify across iterations."""
def test_plan_review_fail_then_pass(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
coders = ["claude-coder"]
reviewers = ["claude-reviewer"]
seniors = ["claude-senior"]
steps = _build_plan_review_preset(coders, reviewers, seniors)
config = PipelineConfig(
output_dir=Path(tmpdir),
max_iterations=4,
min_iterations=1,
language="en",
inputs={
"plan": "Test plan",
"checklist": "Test checklist",
"docs": "Reference docs",
},
agents=dict(BUILTIN_AGENTS),
coders=coders,
reviewers=reviewers,
seniors=seniors,
pipeline=steps,
preset_name="plan-review",
)
mock = _make_step_mock({
"plan_review": [
"Requirements are ambiguous\n\nVERDICT: FAIL",
"Looks aligned\n\nVERDICT: PASS",
],
"aggregate_review": [
"### Confirmed Issues\n- Clarify acceptance criteria\n\n"
"### Action Items\n1. Tighten the checklist\n\nVERDICT: FAIL",
"### Confirmed Issues\nNone\n\n"
"### Dismissed Findings\nNone\n\n"
"### Action Items\n1. No document changes needed\n\nVERDICT: PASS",
],
"plan_fix": ["Updated plan and checklist", "No-op"],
"verify": [
"Still missing edge-case criteria\n\nVERDICT: FAIL",
"Planning package is now implementable\n\nVERDICT: PASS",
],
})
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "PASS")
self.assertEqual(len(result.iterations), 2)
class TestAutoEscalateFiresWithoutSenior(unittest.TestCase): class TestAutoEscalateFiresWithoutSenior(unittest.TestCase):
"""Test 6: simple pipeline without senior, same FAIL feedback 3 times -> auto-escalate.""" """Test 6: simple pipeline without senior, same FAIL feedback 3 times -> auto-escalate."""

View File

@@ -0,0 +1,407 @@
from __future__ import annotations
import tempfile
import unittest
from pathlib import Path
from unittest.mock import patch
from cross_eval.agent import invoke_agent
from cross_eval.config import BUILTIN_AGENTS
from cross_eval.discovery import discover_repo, format_repo_discovery
from cross_eval.models import AgentConfig, AgentResult, PipelineConfig
from cross_eval.pipeline import run_pipeline
from cross_eval.prompts import _build_simple_preset
from cross_eval.runtime_env import build_runtime_environment, summarize_environment
class RuntimeEnvTest(unittest.TestCase):
def test_build_runtime_environment_loads_dotenv_values(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / ".env").write_text(
"CLICKHOUSE_URL=http://localhost:8123\nDATABASE_URL=postgres://db\n",
encoding="utf-8",
)
execution = PipelineConfig().execution
env, loaded_files, loaded_values = build_runtime_environment(execution, root)
self.assertEqual(loaded_files[0].name, ".env")
self.assertEqual(loaded_values["CLICKHOUSE_URL"], "http://localhost:8123")
self.assertEqual(env["DATABASE_URL"], "postgres://db")
def test_summarize_environment_mentions_clickhouse_from_env(self) -> None:
execution = PipelineConfig().execution
summary = summarize_environment(
execution,
[Path("/tmp/.env")],
{"CLICKHOUSE_URL": "http://localhost:8123"},
{"CLICKHOUSE_URL": "http://localhost:8123"},
)
self.assertIn("CLICKHOUSE_URL", summary)
self.assertIn("ClickHouse-related", summary)
class RepoDiscoveryTest(unittest.TestCase):
def test_discover_repo_detects_python_postgres_and_clickhouse(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "pyproject.toml").write_text(
'[project]\nname = "svc"\ndependencies = ["psycopg", "clickhouse-driver"]\n',
encoding="utf-8",
)
(root / "docker-compose.yml").write_text(
"services:\n db:\n image: postgres:16\n ch:\n image: clickhouse/clickhouse-server:latest\n",
encoding="utf-8",
)
discovery = discover_repo(root, {"DATABASE_URL", "CLICKHOUSE_URL"})
summary = format_repo_discovery(discovery)
self.assertIn("python", discovery.languages)
self.assertIn("postgresql", discovery.databases)
self.assertIn("clickhouse", discovery.databases)
self.assertIn("Detected local service containers", summary)
class PromptContextTest(unittest.TestCase):
def test_run_pipeline_injects_env_and_discovery_context_into_prompt(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / ".env").write_text("CLICKHOUSE_URL=http://localhost:8123\n", encoding="utf-8")
steps = _build_simple_preset(["claude-coder"], ["claude-reviewer"], [])
config = PipelineConfig(
output_dir=root / "out",
max_iterations=1,
language="en",
inputs={"plan": "Plan", "checklist": "Checklist"},
agents={name: agent for name, agent in BUILTIN_AGENTS.items()},
coders=["claude-coder"],
reviewers=["claude-reviewer"],
pipeline=steps,
preset_name="simple",
)
prompts: list[str] = []
def _fake_invoke(agent_config, prompt, step_name, **kwargs):
prompts.append(prompt)
output = "VERDICT: PASS" if step_name == "review" else "coding output"
return AgentResult(
output=output,
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=0.1,
transcript="# Agent Execution Transcript",
)
with patch("cross_eval.pipeline.invoke_agent", side_effect=_fake_invoke):
run_pipeline(config, cwd=root)
joined = "\n".join(prompts)
self.assertIn("Execution Policy", joined)
self.assertIn("Environment Context", joined)
self.assertIn("Repository Discovery", joined)
self.assertIn("ClickHouse-related environment variables are available", joined)
self.assertTrue((root / "out").exists())
class AgentTranscriptTest(unittest.TestCase):
def test_invoke_agent_records_transcript(self) -> None:
def _fake_run(cmd, **kwargs):
class _Result:
returncode = 0
stdout = "hello"
stderr = "warn"
return _Result()
agent = AgentConfig(
name="codex-reviewer",
command="codex",
args=["exec", "--model", "gpt-5.4", "-"],
)
with patch("subprocess.run", side_effect=_fake_run):
result = invoke_agent(agent, "prompt", "review", quiet=True)
self.assertIn("## Command", result.transcript)
self.assertIn("hello", result.transcript)
self.assertIn("warn", result.transcript)
def test_invoke_agent_transcript_includes_exit_code_and_duration(self) -> None:
def _fake_run(cmd, **kwargs):
class _Result:
returncode = 0
stdout = "output"
stderr = ""
return _Result()
agent = AgentConfig(
name="codex-reviewer",
command="codex",
args=["exec", "--model", "gpt-5.4", "-"],
)
with patch("subprocess.run", side_effect=_fake_run):
result = invoke_agent(agent, "prompt", "review", quiet=True)
self.assertIn("## Exit Code: 0", result.transcript)
class RepoDiscoveryExtendedTest(unittest.TestCase):
"""Regression tests for broadened repo/service discovery signals."""
def test_discover_go_project(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "go.mod").write_text(
"module example.com/myapp\n\ngo 1.21\n",
encoding="utf-8",
)
discovery = discover_repo(root)
self.assertIn("go", discovery.languages)
self.assertIn("go", discovery.package_managers)
def test_discover_rust_project(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "Cargo.toml").write_text(
'[package]\nname = "myapp"\nversion = "0.1.0"\n',
encoding="utf-8",
)
discovery = discover_repo(root)
self.assertIn("rust", discovery.languages)
self.assertIn("cargo", discovery.package_managers)
def test_discover_ruby_project(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "Gemfile").write_text(
'source "https://rubygems.org"\ngem "rails"\n',
encoding="utf-8",
)
discovery = discover_repo(root)
self.assertIn("ruby", discovery.languages)
self.assertIn("bundler", discovery.package_managers)
def test_discover_java_gradle_project(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "build.gradle").write_text(
"plugins { id 'java' }\n",
encoding="utf-8",
)
discovery = discover_repo(root)
self.assertIn("java", discovery.languages)
self.assertIn("gradle", discovery.package_managers)
def test_discover_elasticsearch_from_compose(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "docker-compose.yml").write_text(
"services:\n es:\n image: elasticsearch:8.10.0\n",
encoding="utf-8",
)
discovery = discover_repo(root)
self.assertIn("elasticsearch", discovery.services)
def test_discover_kafka_from_compose(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "docker-compose.yml").write_text(
"services:\n broker:\n image: confluentinc/cp-kafka:latest\n",
encoding="utf-8",
)
discovery = discover_repo(root)
self.assertIn("kafka", discovery.services)
def test_discover_rabbitmq_from_env(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
discovery = discover_repo(root, {"RABBITMQ_URL"})
self.assertIn("rabbitmq", discovery.databases)
def test_discover_sqlite_from_requirements(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "requirements.txt").write_text(
"aiosqlite==0.19.0\nfastapi\n",
encoding="utf-8",
)
discovery = discover_repo(root)
self.assertIn("python", discovery.languages)
self.assertIn("sqlite", discovery.databases)
def test_discover_dynamodb_from_env(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
discovery = discover_repo(root, {"DYNAMODB_TABLE"})
self.assertIn("dynamodb", discovery.databases)
def test_discover_frameworks_from_pyproject(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "pyproject.toml").write_text(
'[project]\nname = "svc"\ndependencies = ["fastapi", "uvicorn"]\n',
encoding="utf-8",
)
discovery = discover_repo(root)
self.assertIn("fastapi", discovery.frameworks)
def test_discover_knex_hint(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "knexfile.js").write_text(
"module.exports = {};\n",
encoding="utf-8",
)
discovery = discover_repo(root)
self.assertIn("Knex migration config detected.", discovery.hints)
def test_discover_makefile_hint(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "Makefile").write_text(
"all:\n\techo hello\n",
encoding="utf-8",
)
discovery = discover_repo(root)
self.assertIn("Makefile available for build/task automation.", discovery.hints)
def test_format_repo_discovery_includes_frameworks(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "package.json").write_text(
'{"dependencies": {"express": "^4.18.0"}}',
encoding="utf-8",
)
discovery = discover_repo(root)
summary = format_repo_discovery(discovery)
self.assertIn("Detected frameworks", summary)
self.assertIn("express", summary)
def test_discover_pnpm_lockfile(self) -> None:
"""Detect pnpm from lockfile when no packageManager field."""
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "package.json").write_text(
'{"name": "app"}',
encoding="utf-8",
)
(root / "pnpm-lock.yaml").write_text("lockfileVersion: 6\n", encoding="utf-8")
discovery = discover_repo(root)
self.assertIn("pnpm", discovery.package_managers)
def test_discover_yarn_lockfile(self) -> None:
"""Detect yarn from lockfile when no packageManager field."""
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
(root / "package.json").write_text(
'{"name": "app"}',
encoding="utf-8",
)
(root / "yarn.lock").write_text("# yarn lockfile v1\n", encoding="utf-8")
discovery = discover_repo(root)
self.assertIn("yarn", discovery.package_managers)
class SummarizeEnvExtendedTest(unittest.TestCase):
"""Regression tests for expanded environment summary prefixes."""
def test_summarize_shows_mongo_env_var(self) -> None:
execution = PipelineConfig().execution
summary = summarize_environment(
execution,
[Path("/tmp/.env")],
{"MONGO_URI": "mongodb://localhost"},
{"MONGO_URI": "mongodb://localhost"},
)
self.assertIn("MONGO_URI", summary)
def test_summarize_shows_kafka_env_var(self) -> None:
execution = PipelineConfig().execution
summary = summarize_environment(
execution,
[Path("/tmp/.env")],
{"KAFKA_BOOTSTRAP_SERVERS": "localhost:9092"},
{"KAFKA_BOOTSTRAP_SERVERS": "localhost:9092"},
)
self.assertIn("KAFKA_BOOTSTRAP_SERVERS", summary)
def test_summarize_shows_elasticsearch_env_var(self) -> None:
execution = PipelineConfig().execution
summary = summarize_environment(
execution,
[Path("/tmp/.env")],
{"ELASTICSEARCH_URL": "http://localhost:9200"},
{"ELASTICSEARCH_URL": "http://localhost:9200"},
)
self.assertIn("ELASTICSEARCH_URL", summary)
class TranscriptSavingRegressionTest(unittest.TestCase):
"""Verify that transcripts are saved as step artifacts during pipeline runs."""
def test_transcript_files_saved_during_pipeline(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
steps = _build_simple_preset(["claude-coder"], ["claude-reviewer"], [])
config = PipelineConfig(
output_dir=root / "out",
max_iterations=1,
language="en",
inputs={"plan": "Plan", "checklist": "Checklist"},
agents={name: agent for name, agent in BUILTIN_AGENTS.items()},
coders=["claude-coder"],
reviewers=["claude-reviewer"],
pipeline=steps,
preset_name="simple",
)
def _fake_invoke(agent_config, prompt, step_name, **kwargs):
output = "VERDICT: PASS" if step_name == "review" else "coding output"
return AgentResult(
output=output,
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=0.1,
transcript="# Agent Execution Transcript\n\n## Command\n```\nclaude -p\n```",
)
with patch("cross_eval.pipeline.invoke_agent", side_effect=_fake_invoke):
result = run_pipeline(config, cwd=root)
# Verify transcript files were saved
run_dir = result.run_dir
self.assertIsNotNone(run_dir)
assert run_dir is not None
coding_transcript = run_dir / "v1" / "coding_transcript.md"
review_transcript = run_dir / "v1" / "review_transcript.md"
self.assertTrue(
coding_transcript.exists(),
f"Expected transcript at {coding_transcript}",
)
self.assertTrue(
review_transcript.exists(),
f"Expected transcript at {review_transcript}",
)
if __name__ == "__main__":
unittest.main()

988
tests/test_runtime_misc.py Normal file
View File

@@ -0,0 +1,988 @@
from __future__ import annotations
import re
import subprocess
import tempfile
import unittest
from pathlib import Path
from unittest.mock import MagicMock, patch
from cross_eval.agent import (
AgentInvocationError,
_build_transcript,
_classify_agent_failure,
invoke_agent,
invoke_agent_agentic,
)
from cross_eval.models import AgentConfig, AgentResult, ExecutionConfig, PipelineConfig, StepConfig
from cross_eval.pipeline import (
_apply_worktree_inputs_to_base,
_commit_base_repo_paths,
_copy_inputs_to_worktree,
_commit_iteration,
_execute_parallel_batch,
_execute_step,
_finalize_worktree,
_format_runtime_error_markdown,
_load_inputs,
_maybe_save_step_transcript,
_refresh_inputs,
_snapshot_repo_state,
)
from cross_eval.runtime_env import (
build_execution_policy,
parse_dotenv,
resolve_env_files,
summarize_environment,
)
from cross_eval.worktree import WorktreeError, create_worktree, remove_worktree
def _init_git_repo(path: Path) -> None:
subprocess.run(["git", "init"], cwd=path, capture_output=True, check=True)
subprocess.run(
["git", "config", "user.email", "test@test.com"],
cwd=path,
capture_output=True,
check=True,
)
subprocess.run(
["git", "config", "user.name", "Test"],
cwd=path,
capture_output=True,
check=True,
)
(path / "README.md").write_text("# init\n", encoding="utf-8")
subprocess.run(["git", "add", "."], cwd=path, capture_output=True, check=True)
subprocess.run(
["git", "commit", "-m", "initial"],
cwd=path,
capture_output=True,
check=True,
)
class TestInvokeAgentRuntime(unittest.TestCase):
@patch("cross_eval.agent.subprocess.run")
def test_interactive_claude_reads_output_file(self, mock_run: MagicMock) -> None:
def _fake_run(cmd: list[str], **kwargs: object) -> MagicMock:
match = re.search(r"Write your complete output to (.+)\.$", cmd[-1])
self.assertIsNotNone(match)
assert match is not None
Path(match.group(1)).write_text("review result", encoding="utf-8")
return MagicMock(returncode=0, stdout="", stderr="")
mock_run.side_effect = _fake_run
agent = AgentConfig(
name="claude-reviewer",
command="claude",
args=["--model", "opus"],
system_prompt="system",
)
result = invoke_agent(agent, "inspect code", "review", quiet=True)
self.assertEqual(result.output, "review result")
called_cmd = mock_run.call_args[0][0]
self.assertIn("--system-prompt", called_cmd)
@patch("cross_eval.agent.subprocess.run")
def test_interactive_claude_falls_back_to_stdout(self, mock_run: MagicMock) -> None:
mock_run.return_value = MagicMock(returncode=0, stdout="stdout fallback", stderr="")
agent = AgentConfig(name="claude-reviewer", command="claude", args=["--model", "opus"])
result = invoke_agent(agent, "inspect code", "review", quiet=True)
self.assertEqual(result.output, "stdout fallback")
@patch("cross_eval.agent.subprocess.run")
def test_non_claude_wraps_system_prompt_in_stdin(self, mock_run: MagicMock) -> None:
mock_run.return_value = MagicMock(returncode=0, stdout="ok", stderr="")
agent = AgentConfig(
name="custom-reviewer",
command="custom-cli",
args=["run"],
system_prompt="strict mode",
)
invoke_agent(agent, "check things", "review", quiet=True)
self.assertEqual(
mock_run.call_args.kwargs["input"],
"<system>\nstrict mode\n</system>\n\ncheck things",
)
@patch("cross_eval.agent.subprocess.run")
def test_failure_raises_structured_error(self, mock_run: MagicMock) -> None:
mock_run.return_value = MagicMock(returncode=1, stdout="", stderr="API Error: backend down")
agent = AgentConfig(name="codex-reviewer", command="codex", args=["exec", "-"])
with self.assertRaises(AgentInvocationError) as ctx:
invoke_agent(agent, "check", "review", quiet=True)
self.assertEqual(ctx.exception.failure_type, "API_ERROR")
self.assertIn("backend down", ctx.exception.raw_error)
class TestWorktreeInputMapping(unittest.TestCase):
def test_repo_local_plan_input_maps_to_tracked_worktree_path(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir) / "repo"
repo.mkdir()
_init_git_repo(repo)
(repo / "plan.md").write_text("plan v1\n", encoding="utf-8")
subprocess.run(["git", "add", "plan.md"], cwd=repo, capture_output=True, check=True)
subprocess.run(
["git", "commit", "-m", "add plan"],
cwd=repo,
capture_output=True,
check=True,
)
worktree_dir = Path(tmpdir) / "wt"
branch = "cross-eval/test-plan-review"
worktree_path, _ = create_worktree(repo, worktree_dir, branch)
try:
config = PipelineConfig(
inputs={"plan": repo / "plan.md"},
preset_name="plan-review",
)
_copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
self.assertEqual(config.inputs["plan"], worktree_path / "plan.md")
finally:
remove_worktree(base_cwd=repo, work_dir=worktree_path)
subprocess.run(
["git", "branch", "-D", branch],
cwd=repo,
capture_output=True,
)
def test_plan_review_docs_ref_maps_to_worktree_and_refreshes_docs(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir) / "repo"
repo.mkdir()
_init_git_repo(repo)
docs_dir = repo / "plans"
docs_dir.mkdir()
(docs_dir / "A.md").write_text("A v1\n", encoding="utf-8")
subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True)
subprocess.run(
["git", "commit", "-m", "add docs"],
cwd=repo,
capture_output=True,
check=True,
)
config = PipelineConfig(
inputs={
"docs": "stale snapshot",
"docs_ref": docs_dir,
},
preset_name="plan-review",
)
input_contents = _load_inputs(config)
self.assertIn("A.md", input_contents["docs"])
worktree_dir = Path(tmpdir) / "wt"
branch = "cross-eval/test-docs-ref"
worktree_path, _ = create_worktree(repo, worktree_dir, branch)
try:
_copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
self.assertEqual(config.inputs["docs_ref"], worktree_path / "plans")
updated = worktree_path / "plans" / "A.md"
updated.write_text("A v2\n", encoding="utf-8")
_refresh_inputs(config, input_contents)
self.assertIn("A.md", input_contents["docs"])
self.assertIn("A v2", input_contents["docs"])
finally:
remove_worktree(base_cwd=repo, work_dir=worktree_path)
subprocess.run(
["git", "branch", "-D", branch],
cwd=repo,
capture_output=True,
)
def test_worktree_doc_changes_apply_back_and_commit_in_base_repo(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir) / "repo"
repo.mkdir()
_init_git_repo(repo)
docs_dir = repo / "plans"
docs_dir.mkdir()
doc_path = docs_dir / "A.md"
doc_path.write_text("A v1\n", encoding="utf-8")
subprocess.run(["git", "add", "."], cwd=repo, capture_output=True, check=True)
subprocess.run(
["git", "commit", "-m", "add docs"],
cwd=repo,
capture_output=True,
check=True,
)
config = PipelineConfig(
inputs={"docs_ref": docs_dir},
preset_name="plan-review",
)
original_inputs = {"docs_ref": docs_dir}
worktree_dir = Path(tmpdir) / "wt"
branch = "cross-eval/test-apply-back"
worktree_path, _ = create_worktree(repo, worktree_dir, branch)
try:
_copy_inputs_to_worktree(config, worktree_path, base_cwd=repo)
worktree_doc = config.inputs["docs_ref"] / "A.md"
worktree_doc.write_text("A v2\n", encoding="utf-8")
restored = _apply_worktree_inputs_to_base(
config, original_inputs, cwd=repo,
)
self.assertEqual(restored, [docs_dir])
self.assertEqual(doc_path.read_text(encoding="utf-8"), "A v2\n")
committed = _commit_base_repo_paths(
repo, restored, "cross-eval: plan-review (FAIL)",
)
self.assertTrue(committed)
log = subprocess.run(
["git", "log", "-1", "--pretty=%s"],
cwd=repo,
capture_output=True,
text=True,
check=True,
)
self.assertEqual(log.stdout.strip(), "cross-eval: plan-review (FAIL)")
finally:
remove_worktree(base_cwd=repo, work_dir=worktree_path)
subprocess.run(
["git", "branch", "-D", branch],
cwd=repo,
capture_output=True,
)
def test_classify_unknown_failure(self) -> None:
failure_type, suggested_action = _classify_agent_failure("weird crash")
self.assertEqual(failure_type, "UNKNOWN")
self.assertIn("Inspect", suggested_action)
def test_build_transcript_includes_cwd_and_duration(self) -> None:
transcript = _build_transcript(
command_preview="claude -p",
stdout="ok",
stderr="",
exit_code=0,
duration_seconds=1.2,
cwd="/tmp/repo",
)
self.assertIn("## Working Directory", transcript)
self.assertIn("## Duration: 1.2s", transcript)
@patch("cross_eval.agent._Spinner")
@patch("cross_eval.agent.subprocess.run")
def test_timeout_stops_spinner(self, mock_run: MagicMock, mock_spinner: MagicMock) -> None:
spinner = mock_spinner.return_value
mock_run.side_effect = subprocess.TimeoutExpired(cmd=["claude"], timeout=12)
agent = AgentConfig(name="claude-reviewer", command="claude", args=["-p"])
with self.assertRaises(subprocess.TimeoutExpired):
invoke_agent(agent, "inspect code", "review", quiet=False, timeout=12)
spinner.stop.assert_called_once()
@patch("cross_eval.agent._Spinner")
@patch("cross_eval.agent.subprocess.run")
def test_generic_exception_stops_spinner(self, mock_run: MagicMock, mock_spinner: MagicMock) -> None:
spinner = mock_spinner.return_value
mock_run.side_effect = OSError("boom")
agent = AgentConfig(name="claude-reviewer", command="claude", args=["-p"])
with self.assertRaises(OSError):
invoke_agent(agent, "inspect code", "review", quiet=False)
spinner.stop.assert_called_once()
@patch("cross_eval.agent.logger.warning")
@patch("cross_eval.agent.subprocess.run")
def test_empty_output_logs_warning(self, mock_run: MagicMock, mock_warning: MagicMock) -> None:
mock_run.return_value = MagicMock(returncode=0, stdout="", stderr="")
agent = AgentConfig(name="claude-reviewer", command="claude", args=["-p"])
result = invoke_agent(agent, "inspect code", "review", quiet=True)
self.assertEqual(result.output, "")
mock_warning.assert_called_once()
@patch("cross_eval.agent.subprocess.run")
def test_print_mode_claude_uses_native_system_prompt_flag(self, mock_run: MagicMock) -> None:
mock_run.return_value = MagicMock(returncode=0, stdout="ok", stderr="")
agent = AgentConfig(
name="claude-reviewer",
command="claude",
args=["-p"],
system_prompt="be strict",
)
invoke_agent(agent, "review this", "review", quiet=True)
called_cmd = mock_run.call_args[0][0]
self.assertIn("--system-prompt", called_cmd)
self.assertEqual(mock_run.call_args.kwargs["input"], "review this")
@patch("cross_eval.agent.subprocess.run")
def test_interactive_failure_truncates_error_and_removes_output_file(
self,
mock_run: MagicMock,
) -> None:
seen_output_path: Path | None = None
def _fake_run(cmd: list[str], **kwargs: object) -> MagicMock:
nonlocal seen_output_path
match = re.search(r"Write your complete output to (.+)\.$", cmd[-1])
self.assertIsNotNone(match)
assert match is not None
seen_output_path = Path(match.group(1))
return MagicMock(returncode=1, stdout="", stderr="x" * 600)
mock_run.side_effect = _fake_run
agent = AgentConfig(name="claude-reviewer", command="claude", args=["--model", "opus"])
with self.assertRaises(AgentInvocationError) as ctx:
invoke_agent(agent, "inspect code", "review", quiet=True)
self.assertEqual(len(ctx.exception.raw_error), 503)
self.assertIsNotNone(seen_output_path)
assert seen_output_path is not None
self.assertFalse(seen_output_path.exists())
@patch("cross_eval.agent.logger.warning")
@patch("cross_eval.agent.subprocess.run")
def test_empty_output_with_stderr_logs_stderr_warning(
self,
mock_run: MagicMock,
mock_warning: MagicMock,
) -> None:
mock_run.return_value = MagicMock(returncode=0, stdout="", stderr="stderr text")
agent = AgentConfig(name="claude-reviewer", command="claude", args=["-p"])
invoke_agent(agent, "inspect code", "review", quiet=True)
self.assertIn("stderr:", mock_warning.call_args[0][0])
class TestInvokeAgenticRuntime(unittest.TestCase):
@patch("cross_eval.worktree.capture_diff", return_value="diff --git a/file ...")
@patch("cross_eval.agent.subprocess.run")
def test_codex_agentic_adds_reasoning_and_system_wrapper(
self,
mock_run: MagicMock,
mock_diff: MagicMock,
) -> None:
mock_run.return_value = MagicMock(returncode=0, stdout="ok", stderr="")
agent = AgentConfig(
name="codex-coder",
command="codex",
args=["exec", "--full-auto"],
system_prompt="strict mode",
reasoning_effort="high",
agentic=True,
)
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir)
_init_git_repo(repo)
invoke_agent_agentic(agent, "fix bug", "coding", repo, quiet=True)
called_cmd = mock_run.call_args[0][0]
self.assertIn("-c", called_cmd)
self.assertEqual(called_cmd[-1], "-")
self.assertIn("<system>", mock_run.call_args.kwargs["input"])
@patch("cross_eval.agent._Spinner")
@patch("cross_eval.worktree.capture_diff", return_value="diff --git a/file ...")
@patch("cross_eval.agent.subprocess.run")
def test_agentic_claude_success_uses_system_prompt_and_spinner(
self,
mock_run: MagicMock,
mock_diff: MagicMock,
mock_spinner: MagicMock,
) -> None:
mock_run.return_value = MagicMock(returncode=0, stdout="ok", stderr="")
agent = AgentConfig(
name="claude-coder",
command="claude",
args=["-p", "--print"],
system_prompt="stay in scope",
agentic=True,
)
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir)
_init_git_repo(repo)
result = invoke_agent_agentic(agent, "fix bug", "coding", repo, quiet=False)
called_cmd = mock_run.call_args[0][0]
self.assertNotIn("-p", called_cmd)
self.assertIn("--system-prompt", called_cmd)
self.assertEqual(result.output, "diff --git a/file ...")
mock_spinner.return_value.stop.assert_called_once()
@patch("cross_eval.agent._Spinner")
def test_agentic_timeout_stops_spinner(self, mock_spinner: MagicMock) -> None:
spinner = mock_spinner.return_value
agent = AgentConfig(name="codex-coder", command="codex", args=["exec"], agentic=True)
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir)
_init_git_repo(repo)
with patch(
"cross_eval.agent.subprocess.run",
side_effect=subprocess.TimeoutExpired(cmd=["codex"], timeout=20),
):
with self.assertRaises(subprocess.TimeoutExpired):
invoke_agent_agentic(agent, "fix bug", "coding", repo, quiet=False, timeout=20)
spinner.stop.assert_called_once()
@patch("cross_eval.agent.subprocess.run")
def test_agentic_nonzero_exit_raises_structured_error(self, mock_run: MagicMock) -> None:
mock_run.return_value = MagicMock(returncode=1, stdout="", stderr="unauthorized")
agent = AgentConfig(name="codex-coder", command="codex", args=["exec"], agentic=True)
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir)
_init_git_repo(repo)
with self.assertRaises(AgentInvocationError) as ctx:
invoke_agent_agentic(agent, "fix bug", "coding", repo, quiet=True)
self.assertEqual(ctx.exception.failure_type, "AUTH")
@patch("cross_eval.agent._Spinner")
def test_agentic_generic_exception_stops_spinner(
self,
mock_spinner: MagicMock,
) -> None:
agent = AgentConfig(name="codex-coder", command="codex", args=["exec"], agentic=True)
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir)
_init_git_repo(repo)
with patch("cross_eval.agent.subprocess.run", side_effect=OSError("boom")):
with self.assertRaises(OSError):
invoke_agent_agentic(agent, "fix bug", "coding", repo, quiet=False)
mock_spinner.return_value.stop.assert_called_once()
@patch("cross_eval.agent._Spinner")
@patch("cross_eval.agent.subprocess.run")
def test_agentic_failure_truncates_error(
self,
mock_run: MagicMock,
mock_spinner: MagicMock,
) -> None:
mock_run.return_value = MagicMock(returncode=1, stdout="", stderr="x" * 600)
agent = AgentConfig(name="codex-coder", command="codex", args=["exec"], agentic=True)
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir)
_init_git_repo(repo)
with self.assertRaises(AgentInvocationError) as ctx:
invoke_agent_agentic(agent, "fix bug", "coding", repo, quiet=False)
self.assertEqual(len(ctx.exception.raw_error), 503)
mock_spinner.return_value.stop.assert_called_once()
@patch("cross_eval.agent._Spinner")
@patch("cross_eval.worktree.capture_diff", return_value="")
@patch("cross_eval.agent.subprocess.run")
def test_agentic_empty_diff_failure_truncates_error_and_stops_spinner(
self,
mock_run: MagicMock,
mock_diff: MagicMock,
mock_spinner: MagicMock,
) -> None:
mock_run.return_value = MagicMock(
returncode=0,
stdout="implemented",
stderr="permission denied " * 300,
)
agent = AgentConfig(name="codex-coder", command="codex", args=["exec"], agentic=True)
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir)
_init_git_repo(repo)
with self.assertRaises(AgentInvocationError) as ctx:
invoke_agent_agentic(agent, "fix bug", "coding", repo, quiet=False)
self.assertLessEqual(len(ctx.exception.raw_error), 2003)
self.assertEqual(ctx.exception.failure_type, "WRITE_FAILURE")
mock_spinner.return_value.stop.assert_called_once()
class TestPipelineHelpers(unittest.TestCase):
@patch("cross_eval.worktree.get_current_head", return_value="a" * 40)
@patch("cross_eval.worktree.commit_worktree", return_value=True)
def test_commit_iteration_logs_only_when_committed(self, mock_commit: MagicMock, mock_head: MagicMock) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
new_head = _commit_iteration(Path(tmpdir), "review-fix", 2, "PASS")
mock_commit.assert_called_once()
self.assertEqual(new_head, "a" * 40)
def test_snapshot_repo_state_includes_untracked_digest(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
repo = Path(tmpdir)
_init_git_repo(repo)
(repo / "scratch.txt").write_text("draft", encoding="utf-8")
snapshot = _snapshot_repo_state(repo)
self.assertIn("UNTRACKED scratch.txt", snapshot["untracked"])
def test_finalize_worktree_deletes_empty_branch(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
base = Path(tmpdir) / "repo"
base.mkdir()
_init_git_repo(base)
branch = "cross-eval/empty"
subprocess.run(
["git", "branch", branch, "HEAD"],
cwd=base,
capture_output=True,
check=True,
)
worktree = Path(tmpdir) / "wt"
subprocess.run(
["git", "worktree", "add", str(worktree), branch],
cwd=base,
capture_output=True,
check=True,
)
branch_result = _finalize_worktree(base, worktree, branch, "review-fix", "PASS")
self.assertIsNone(branch_result)
branches = subprocess.run(
["git", "branch", "--list", branch],
cwd=base,
capture_output=True,
text=True,
check=True,
)
self.assertEqual(branches.stdout.strip(), "")
def test_format_runtime_error_markdown_for_generic_exception(self) -> None:
markdown = _format_runtime_error_markdown(
RuntimeError("boom"),
step_name="review",
agent_name="claude-reviewer",
phase_name="review_fix",
)
self.assertIn("# Agent Error", markdown)
self.assertIn("review_fix", markdown)
self.assertIn("boom", markdown)
def test_maybe_save_step_transcript_returns_none_without_transcript(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
result = AgentResult(
output="ok",
exit_code=0,
agent_name="claude-reviewer",
step_name="review",
duration_seconds=0.1,
)
saved = _maybe_save_step_transcript(Path(tmpdir), 1, "review", result)
self.assertIsNone(saved)
@patch("cross_eval.pipeline.invoke_agent")
def test_execute_step_saves_timeout_markdown(self, mock_invoke: MagicMock) -> None:
mock_invoke.side_effect = subprocess.TimeoutExpired(
cmd=["claude"],
timeout=45,
output="partial output",
stderr="still running",
)
step = StepConfig(
name="review",
agent="claude-reviewer",
role="review",
prompt_template="default:review",
output_key="review_output",
)
config = PipelineConfig(
agents={
"claude-reviewer": AgentConfig(
name="claude-reviewer",
command="claude",
args=["-p"],
),
},
)
step_outputs: dict[str, str] = {}
step_results: dict[str, AgentResult] = {}
with tempfile.TemporaryDirectory() as tmpdir:
run_dir = Path(tmpdir)
with self.assertRaises(RuntimeError) as ctx:
_execute_step(
step,
config,
{"plan": "Plan", "checklist": "Checklist"},
"",
1,
3,
run_dir,
45,
False,
step_outputs,
step_results,
run_dir=run_dir,
output_iter=1,
)
self.assertIn("timed out after 45s", str(ctx.exception))
error_path = run_dir / "v1" / "review_error.md"
self.assertTrue(error_path.exists())
self.assertIn("# Agent Timeout", error_path.read_text(encoding="utf-8"))
@patch("cross_eval.pipeline.invoke_agent")
def test_execute_step_saves_runtime_error_markdown(self, mock_invoke: MagicMock) -> None:
mock_invoke.side_effect = AgentInvocationError(
agent_name="claude-reviewer",
step_name="review",
cmd_preview="claude -p",
raw_error="api broke",
failure_type="API_ERROR",
suggested_action="retry",
)
step = StepConfig(
name="review",
agent="claude-reviewer",
role="review",
prompt_template="default:review",
output_key="review_output",
)
config = PipelineConfig(
agents={
"claude-reviewer": AgentConfig(
name="claude-reviewer",
command="claude",
args=["-p"],
),
},
)
with tempfile.TemporaryDirectory() as tmpdir:
run_dir = Path(tmpdir)
with self.assertRaises(AgentInvocationError):
_execute_step(
step,
config,
{"plan": "Plan", "checklist": "Checklist"},
"",
1,
3,
run_dir,
45,
False,
{},
{},
run_dir=run_dir,
output_iter=1,
)
error_text = (run_dir / "v1" / "review_error.md").read_text(encoding="utf-8")
self.assertIn("API_ERROR", error_text)
self.assertIn("retry", error_text)
@patch("cross_eval.pipeline.invoke_agent")
def test_execute_parallel_batch_saves_success_and_timeout_error(self, mock_invoke: MagicMock) -> None:
def _fake_invoke(agent_config: AgentConfig, prompt: str, step_name: str, **kwargs: object) -> AgentResult:
if step_name == "review_ok":
return AgentResult(
output="VERDICT: PASS",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=0.1,
)
raise subprocess.TimeoutExpired(
cmd=["codex"],
timeout=30,
output="halfway",
stderr="timeout stderr",
)
mock_invoke.side_effect = _fake_invoke
batch = [
StepConfig(
name="review_ok",
agent="claude-reviewer",
role="review",
prompt_template="default:review",
output_key="review_ok",
parallel=True,
),
StepConfig(
name="review_slow",
agent="codex-reviewer",
role="review",
prompt_template="default:review",
output_key="review_slow",
parallel=True,
),
]
config = PipelineConfig(
agents={
"claude-reviewer": AgentConfig(name="claude-reviewer", command="claude", args=["-p"]),
"codex-reviewer": AgentConfig(name="codex-reviewer", command="codex", args=["exec", "-"]),
},
)
step_outputs: dict[str, str] = {}
step_results: dict[str, AgentResult] = {}
with tempfile.TemporaryDirectory() as tmpdir:
run_dir = Path(tmpdir)
with self.assertRaises(RuntimeError) as ctx:
_execute_parallel_batch(
batch,
config,
{"plan": "Plan", "checklist": "Checklist"},
"",
1,
3,
run_dir,
30,
False,
step_outputs,
step_results,
run_dir=run_dir,
output_iter=1,
)
self.assertIn("Successful outputs were saved for: review_ok", str(ctx.exception))
self.assertEqual(step_outputs["review_ok"], "VERDICT: PASS")
self.assertTrue((run_dir / "v1" / "review_ok.md").exists())
self.assertTrue((run_dir / "v1" / "review_slow_error.md").exists())
@patch("cross_eval.pipeline._execute_step")
def test_execute_parallel_batch_dry_run_uses_sequential_path(self, mock_step: MagicMock) -> None:
batch = [
StepConfig(
name="review_a",
agent="claude-reviewer",
role="review",
prompt_template="default:review",
output_key="review_a",
parallel=True,
),
StepConfig(
name="review_b",
agent="codex-reviewer",
role="review",
prompt_template="default:review",
output_key="review_b",
parallel=True,
),
]
config = PipelineConfig(agents={})
with tempfile.TemporaryDirectory() as tmpdir:
_execute_parallel_batch(
batch,
config,
{"plan": "Plan"},
"",
1,
3,
Path(tmpdir),
None,
True,
{},
{},
run_dir=Path(tmpdir),
output_iter=1,
)
self.assertEqual(mock_step.call_count, 2)
@patch("cross_eval.pipeline._execute_step")
def test_execute_parallel_batch_agentic_steps_fall_back_to_sequential(self, mock_step: MagicMock) -> None:
batch = [
StepConfig(
name="review_a",
agent="agentic-a",
role="review",
prompt_template="default:review",
output_key="review_a",
parallel=True,
),
StepConfig(
name="review_b",
agent="agentic-b",
role="review",
prompt_template="default:review",
output_key="review_b",
parallel=True,
),
]
config = PipelineConfig(
agents={
"agentic-a": AgentConfig(name="agentic-a", command="claude", agentic=True),
"agentic-b": AgentConfig(name="agentic-b", command="codex", agentic=True),
},
)
with tempfile.TemporaryDirectory() as tmpdir:
_execute_parallel_batch(
batch,
config,
{"plan": "Plan"},
"",
1,
3,
Path(tmpdir),
None,
False,
{},
{},
run_dir=Path(tmpdir),
output_iter=1,
worktree_path=Path(tmpdir),
)
self.assertEqual(mock_step.call_count, 2)
@patch("cross_eval.worktree.remove_worktree", side_effect=RuntimeError("cleanup failed"))
@patch("cross_eval.worktree.commit_worktree", side_effect=RuntimeError("commit failed"))
def test_finalize_worktree_handles_cleanup_failures(
self,
mock_commit: MagicMock,
mock_remove: MagicMock,
) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
branch = _finalize_worktree(
Path(tmpdir),
Path(tmpdir) / "wt",
"cross-eval/fail",
"review-fix",
"FAIL",
)
self.assertIsNone(branch)
class TestRuntimeEnvironmentHelpers(unittest.TestCase):
def test_parse_dotenv_handles_export_and_quotes(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
env_path = Path(tmpdir) / ".env"
env_path.write_text(
"export FOO='bar'\nBAR=\"line\\nvalue\"\nINVALID\n=skip\n",
encoding="utf-8",
)
values = parse_dotenv(env_path)
self.assertEqual(values["FOO"], "bar")
self.assertEqual(values["BAR"], "line\nvalue")
self.assertNotIn("INVALID", values)
def test_resolve_env_files_deduplicates_and_filters_missing(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
env_path = root / ".env"
env_path.write_text("FOO=bar\n", encoding="utf-8")
execution = ExecutionConfig(
env_files=[".env", str(env_path)],
auto_env_files=[".env", ".env.local"],
)
resolved = resolve_env_files(execution, root)
self.assertEqual(resolved, [env_path.resolve()])
def test_summarize_environment_hides_names_when_disabled(self) -> None:
execution = ExecutionConfig(expose_env_names=False, auto_context_targets=["postgres"])
summary = summarize_environment(
execution,
[],
{"DATABASE_URL": "postgres://localhost"},
{},
)
self.assertIn("names are hidden", summary)
self.assertIn("Execution targets hinted by the user: postgres", summary)
def test_build_execution_policy_for_minimal_mode(self) -> None:
policy = build_execution_policy(
ExecutionConfig(mode="agent-decides", command_policy="minimal"),
)
self.assertIn("Command policy: minimal", policy)
self.assertIn("Keep command usage minimal", policy)
class TestWorktreeFailures(unittest.TestCase):
@patch("cross_eval.worktree.subprocess.run")
def test_create_worktree_raises_when_branch_creation_fails(self, mock_run: MagicMock) -> None:
# First call: git rev-parse HEAD (succeeds)
# Second call: git branch (fails)
rev_parse_result = MagicMock(returncode=0)
rev_parse_result.stdout = "a" * 40
mock_run.side_effect = [
rev_parse_result,
subprocess.CalledProcessError(
1,
["git", "branch"],
stderr="branch failed",
),
]
with tempfile.TemporaryDirectory() as tmpdir:
base = Path(tmpdir)
work_dir = base / "wt"
with self.assertRaises(WorktreeError) as ctx:
create_worktree(base, work_dir, "cross-eval/fail")
self.assertIn("Failed to create branch", str(ctx.exception))
@patch("cross_eval.worktree.subprocess.run")
def test_create_worktree_cleans_branch_on_worktree_failure(self, mock_run: MagicMock) -> None:
rev_parse_result = MagicMock(returncode=0)
rev_parse_result.stdout = "a" * 40
mock_run.side_effect = [
rev_parse_result, # git rev-parse HEAD
MagicMock(returncode=0), # git branch
subprocess.CalledProcessError(
1,
["git", "worktree", "add"],
stderr="worktree failed",
),
MagicMock(returncode=0), # git branch -D (cleanup)
]
with tempfile.TemporaryDirectory() as tmpdir:
base = Path(tmpdir)
work_dir = base / "wt"
with self.assertRaises(WorktreeError):
create_worktree(base, work_dir, "cross-eval/fail")
cleanup_call = mock_run.call_args_list[-1]
self.assertEqual(cleanup_call[0][0][:3], ["git", "branch", "-D"])
@patch("cross_eval.worktree.shutil.rmtree")
@patch("cross_eval.worktree.subprocess.run")
def test_remove_worktree_falls_back_to_prune(self, mock_run: MagicMock, mock_rmtree: MagicMock) -> None:
mock_run.side_effect = [
subprocess.CalledProcessError(1, ["git", "worktree", "remove"]),
MagicMock(returncode=0),
]
with tempfile.TemporaryDirectory() as tmpdir:
base = Path(tmpdir) / "repo"
work_dir = Path(tmpdir) / "wt"
base.mkdir()
work_dir.mkdir()
remove_worktree(base, work_dir)
resolved = work_dir.resolve()
mock_rmtree.assert_any_call(resolved, ignore_errors=True)
self.assertEqual(mock_run.call_args_list[-1][0][0], ["git", "worktree", "prune"])