Cross-Eval CLI 구현 계획

Context

AI 에이전트 2개를 활용한 개발 워크플로우(기획→체크리스트→개발→리뷰→반복)에서 발생하는 과최적화/오탐/누락 문제를 잡기 위해, 에이전트 간 교차 검증 루프를 자동화하는 CLI 도구를 만든다. 현재 수동으로 2개 에이전트에 복붙하는 과정을 cross-eval run 한 줄로 대체한다.

핵심 설계 결정

에이전트가 코드베이스를 직접 탐색한다 — claude -p는 non-interactive지만 내장 도구(Read, Glob, Grep)는 사용 가능. 파일 내용을 프롬프트에 전부 넣는 대신, 에이전트가 프로젝트 디렉토리에서 직접 파일을 탐색하도록 한다.

Generator: --permission-mode auto (파일 읽기/쓰기 가능)
Reviewer: --permission-mode plan (읽기 전용 탐색)
subprocess의 cwd를 현재 작업 디렉토리로 설정
기본 실행 모드는 direct mode다. 즉 agentic coder도 현재 작업트리에서 직접 수정한다.
--worktree 또는 use_worktree: true를 명시한 경우에만 isolated git worktree를 생성한다.

사용자 경험 (UX Flow)

# 1. 프로젝트 초기화
cd my-project
cross-eval init
# → cross-eval.yaml, plan.md, checklist.md 생성

# 2. plan.md, checklist.md 작성 후 실행
cross-eval run

# 3. 옵션들
cross-eval run --config custom.yaml --max-iter 5 --dry-run
cross-eval run --input plan=./docs/spec.md --input checklist=./docs/checks.md

# 4. 결과 확인
ls output/v1/ v2/ final-report.md

설정 파일 형식 (`cross-eval.yaml`)

output_dir: output
use_worktree: false
max_iterations: 3

inputs:
  plan: plan.md
  checklist: checklist.md

agents:
  coder:
    command: claude
    args: ["-p", "--model", "sonnet", "--permission-mode", "auto"]
    system_prompt: "You are a senior software engineer. Follow the plan precisely."
  reviewer:
    command: claude
    args: ["-p", "--model", "opus", "--permission-mode", "plan"]
    system_prompt: "You are a meticulous code reviewer."

# 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음)
pipeline: preset:coding-plan-review   # "문서 기반 구현 → 코드/문서 리뷰 → 수정 → 재검증" (기본값)
# pipeline: preset:plan-review        # "구현 전 문서 리뷰 → 수정 → 재검증 반복"

# 방법 2: 직접 커스텀 (고급 사용자용)
# pipeline:
#   - name: coding
#     agent: coder
#     role: coding
#     prompt_template: "default:coding"
#     output_key: coding_output
#   - name: review
#     agent: reviewer
#     role: review
#     prompt_template: "default:review"
#     output_key: review_result
#     verdict: true

파이프라인 프리셋

프리셋	설명	자동 생성되는 steps
`plan-review`	구현 전 문서 리뷰/수정/재검증 반복	plan_review_* → aggregate_review → plan_fix → verify
`coding-plan-review`	문서 기반 구현 후 코드/문서 리뷰/수정 반복	initial_coding(coding) → coding_plan_review(review* → aggregate → coding_plan_fix → verify)

프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다.

모듈 구조 및 구현 순서

cross_eval/
├── __init__.py       (exists)
├── models.py         # 1. 모든 데이터클래스
├── config.py         # 2. YAML 로딩 + 검증
├── prompts.py        # 3. 프롬프트 템플릿
├── agent.py          # 4. subprocess 에이전트 호출
├── pipeline.py       # 5. 핵심 반복 루프
├── report.py         # 6. 마크다운 리포트
└── cli.py            # 7. argparse (init, run)

모듈별 핵심 내용

models.py — 순환 참조 방지, 모든 데이터클래스 집중:

AgentConfig (command, args, system_prompt, stdin_mode)
StepConfig (name, agent, role, prompt_template, output_key, verdict, verdict_pattern, context_override)
PipelineConfig (output_dir, use_worktree, max_iterations, inputs, agents, pipeline)
AgentResult (output, exit_code, agent_name, step_name, duration_seconds)
IterationResult (iteration, step_outputs, verdict, feedback)
PipelineResult (iterations, final_verdict, total_duration)

config.py — YAML → PipelineConfig + 검증:

step.agent가 agents에 정의되어 있는지
output_key 중복 없는지
input 파일 존재 여부
verdict_pattern 유효한 정규식인지

prompts.py — 기본 프롬프트 2종 + 파이프라인 프리셋 정의:

default:coding — "기획서에 명시된 것만 구현하라, 과최적화 금지" + plan/checklist/feedback + "프로젝트 디렉토리의 기존 코드를 탐색하여 컨텍스트를 파악하라" 지시
default:review — 과최적화/오탐/누락 3기준 검토 + VERDICT: PASS|FAIL 출력 + "프로젝트 디렉토리를 직접 탐색하여 코드를 검증하라" 지시
{variable} 플레이스홀더, 누락 시 (no {key} provided) 출력
사용자가 커스텀 .md 파일로 오버라이드 가능
PIPELINE_PRESETS / PHASED_PRESETS dict: plan-review, coding-plan-review 프리셋별 StepConfig/PhaseConfig 정의

agent.py — invoke_agent(agent_config, prompt, cwd):

cwd 파라미터로 프로젝트 디렉토리 지정 → 에이전트가 해당 디렉토리에서 파일 탐색 가능
stdin_mode=false: prompt를 마지막 인자로 전달
stdin_mode=true: stdin으로 파이프 (긴 프롬프트용)
command가 "claude"이고 system_prompt 있으면 --system-prompt 자동 주입
timeout 600초, 비정상 종료 시 RuntimeError

pipeline.py — 핵심 루프:

for iteration 1..max_iterations:
  for step in pipeline:
    1. 템플릿 resolve → context 구성 (inputs + 이전 step 출력 + feedback)
    2. context_override 적용 (교차 리뷰용 변수 매핑)
    3. 에이전트 호출 (cwd=현재 작업 디렉토리)
    4. output_dir/v{i}/{step.name}.md 저장
    5. verdict step이면 PASS/FAIL 판정
  PASS면 종료, FAIL이면 review 결과를 feedback으로 다음 반복
final-report.md 생성

agentic 실행 경로는 두 모드가 있다.

기본: direct mode (cwd에서 직접 수정)
opt-in: isolated worktree mode (--worktree 또는 use_worktree: true)

report.py — 최종 마크다운 리포트:

요약 테이블 (반복 횟수, 판정, 소요시간)
반복별 상세 (각 step 출력, 에이전트명, 소요시간)
최종 판정

cli.py — 서브커맨드:

cross-eval init [--dir .] [--preset coding-plan-review|plan-review] — 스캐폴딩 (기존 파일 안 덮어씀)
cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...] [--worktree]
--input key=path: config의 inputs 오버라이드/추가
--dry-run: 에이전트 호출 없이 렌더링된 프롬프트만 출력
--worktree: 기본 direct mode 대신 isolated git worktree에서 실행

수정할 파일 목록

파일	작업
`cross_eval/__init__.py`	이미 존재, 수정 없음
`cross_eval/models.py`	신규 생성
`cross_eval/config.py`	신규 생성
`cross_eval/prompts.py`	신규 생성
`cross_eval/agent.py`	신규 생성
`cross_eval/pipeline.py`	신규 생성
`cross_eval/report.py`	신규 생성
`cross_eval/cli.py`	신규 생성
`pyproject.toml`	이미 존재, 수정 없음

검증 방법

pip install -e . 로 로컬 설치
cross-eval init 로 스캐폴딩 확인 (3개 파일 생성)
cross-eval run --dry-run 로 프롬프트 렌더링 확인 (에이전트 호출 없이)
plan.md/checklist.md에 간단한 내용 넣고 cross-eval run --max-iter 2 로 실제 실행
output/ 디렉토리에 v1/, final-report.md 생성 확인

--dry-run 은 미리보기 전용이며 실제 verdict가 PASS가 아니어도 프로세스 종료 코드는 0으로 처리한다.

cross-eval run
--docs /Users/chungyeong/Desktop/Dev/new-alpha-foundry/plans/TO_CLICKHOUSE
--preset coding-plan-review
--coder claude
--reviewer codex
--reviewer codex
--reviewer codex
--senior codex
--coder-effort high
--reviewer-effort high
--senior-effort xhigh
--max-iter 10

cross-eval run --plan /Users/chungyeong/Desktop/Dev/cross-eval/UX_IMPROVEMENT_PLAN.md --coder claude --reviewer claude --senior claude --model sonnet --preset coding-plan-review --lang ko --max-iter 1

8.6 KiB Raw Permalink Blame History