Compare commits

..

2 Commits

Author SHA1 Message Date
chungyeong
941304398d release: cut 0.2.0 baseline 2026-03-13 21:47:54 +09:00
chungyeong
204e071b74 feat: ESCALATE verdict, issue tracker, onboarding commands
Add 3-verdict system (PASS/FAIL/ESCALATE) with priority handling across
simple and phased pipelines. Senior reviewers can now escalate issues
requiring human intervention, immediately breaking the review loop.

- ESCALATE verdict extraction with highest priority over PASS/FAIL
- Issue Tracker tables (ISS-NNN) carried across iterations
- Auto-escalate heuristic using (file, keyword) composite fingerprints
- Report restructuring: executive view first (verdict → tracker → metrics)
- Onboarding: `doctor`, `demo`, `init --guided` commands
- Exit codes: PASS=0, FAIL=1, ESCALATE=2
- 87 tests passing (54 config + 25 onboarding + 8 integration)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 18:19:05 +09:00
21 changed files with 4854 additions and 318 deletions

View File

@@ -41,7 +41,7 @@ inputs:
checklist: checklist.md
agents:
generator:
coder:
command: claude
args: ["-p", "--model", "sonnet", "--permission-mode", "auto"]
system_prompt: "You are a senior software engineer. Follow the plan precisely."
@@ -53,14 +53,16 @@ agents:
# 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음)
pipeline: preset:simple # "A 생성 → B 리뷰" (기본값)
# pipeline: preset:cross-review # "둘 다 생성 → 서로 리뷰"
# pipeline: preset:plan-review # "구현 전 문서/기획 검토"
# pipeline: preset:coding-review-fix # "초기 코딩 1회 → 리뷰/수정 반복"
# 방법 2: 직접 커스텀 (고급 사용자용)
# pipeline:
# - name: generate
# agent: generator
# role: generate
# prompt_template: "default:generate"
# output_key: generated_code
# - name: coding
# agent: coder
# role: coding
# prompt_template: "default:coding"
# output_key: coding_output
# - name: review
# agent: reviewer
# role: review
@@ -73,8 +75,10 @@ pipeline: preset:simple # "A 생성 → B 리뷰" (기본값)
| 프리셋 | 설명 | 자동 생성되는 steps |
|--------|------|-------------------|
| `simple` | A 생성 → B 리뷰 | generate(agent1) → review(agent2) |
| `cross-review` | 둘 다 생성, 서로 리뷰 | gen_a → gen_b → review_of_b(agent_a) → review_of_a(agent_b) |
| `simple` | A 코딩 → B 리뷰 | coding(agent1) → review(agent2) |
| `cross-review` | 둘 다 코딩, 서로 리뷰 | coding_a → coding_b → review_of_b(agent_a) → review_of_a(agent_b) |
| `plan-review` | 구현 전 문서 검토 | parallel plan_review_* → senior_review(optional) |
| `coding-review-fix` | 초기 코딩 후 리뷰/수정 반복 | initial_coding(coding) → review_fix(review* → aggregate → coding → verify) |
프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다.
@@ -109,11 +113,11 @@ cross_eval/
- verdict_pattern 유효한 정규식인지
**prompts.py** — 기본 프롬프트 2종 + 파이프라인 프리셋 정의:
- `default:generate` — "기획서에 명시된 것만 구현하라, 과최적화 금지" + plan/checklist/feedback + **"프로젝트 디렉토리의 기존 코드를 탐색하여 컨텍스트를 파악하라"** 지시
- `default:coding` — "기획서에 명시된 것만 구현하라, 과최적화 금지" + plan/checklist/feedback + **"프로젝트 디렉토리의 기존 코드를 탐색하여 컨텍스트를 파악하라"** 지시
- `default:review` — 과최적화/오탐/누락 3기준 검토 + `VERDICT: PASS|FAIL` 출력 + **"프로젝트 디렉토리를 직접 탐색하여 코드를 검증하라"** 지시
- `{variable}` 플레이스홀더, 누락 시 `(no {key} provided)` 출력
- 사용자가 커스텀 .md 파일로 오버라이드 가능
- `PIPELINE_PRESETS` dict: `simple`, `cross-review` 등 프리셋별 StepConfig 리스트 정의
- `PIPELINE_PRESETS` dict: `simple`, `cross-review`, `plan-review` 등 프리셋별 StepConfig 리스트 정의
**agent.py**`invoke_agent(agent_config, prompt, cwd)`:
- `cwd` 파라미터로 프로젝트 디렉토리 지정 → 에이전트가 해당 디렉토리에서 파일 탐색 가능
@@ -141,7 +145,7 @@ final-report.md 생성
- 최종 판정
**cli.py** — 서브커맨드:
- `cross-eval init [--dir .] [--preset simple|cross-review]` — 스캐폴딩 (기존 파일 안 덮어씀)
- `cross-eval init [--dir .] [--preset simple|cross-review|plan-review]` — 스캐폴딩 (기존 파일 안 덮어씀)
- `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...]`
- `--input key=path`: config의 inputs 오버라이드/추가
- `--dry-run`: 에이전트 호출 없이 렌더링된 프롬프트만 출력
@@ -167,3 +171,17 @@ final-report.md 생성
3. `cross-eval run --dry-run` 로 프롬프트 렌더링 확인 (에이전트 호출 없이)
4. plan.md/checklist.md에 간단한 내용 넣고 `cross-eval run --max-iter 2` 로 실제 실행
5. `output/` 디렉토리에 v1/, final-report.md 생성 확인
cross-eval run \
--docs /Users/chungyeong/Desktop/Dev/new-alpha-foundry/plans/TO_CLICKHOUSE \
--preset coding-review-fix \
--coder claude \
--reviewer codex \
--reviewer codex \
--reviewer codex \
--senior codex \
--coder-effort high \
--reviewer-effort high \
--senior-effort xhigh \
--max-iter 10

View File

@@ -2,7 +2,7 @@
AI 에이전트 간 교차 검증을 자동화하는 CLI 도구.
기획서와 체크리스트를 기반으로 "생성 → 리뷰 → 피드백 → 재생성" 루프를 자동으로 돌려서,
기획서와 체크리스트를 기반으로 "코딩 → 리뷰 → 피드백 → 재코딩" 루프를 자동으로 돌려서,
**과최적화 / 오탐 / 누락** 문제를 잡아냅니다.
## 설치
@@ -51,7 +51,7 @@ cp .cross-eval/checklist-sample.md .cross-eval/checklist.md
### 3. 실행
```bash
# 기본 실행 (생성 → 리뷰, 최대 3회 반복)
# 기본 실행 (코딩 → 리뷰, 최대 3회 반복)
cross-eval run
# 프롬프트만 확인 (에이전트 호출 없이, 비용 절약)
@@ -72,10 +72,10 @@ cross-eval run --config .cross-eval/config.yaml
```
output/
├── v1/
│ ├── generate.md # 에이전트 생성 결과
│ ├── coding.md # 에이전트 코딩 결과
│ └── review.md # 에이전트 리뷰 결과
├── v2/
│ ├── generate.md
│ ├── coding.md
│ └── review.md
└── final-report.md # 전체 요약 리포트
```
@@ -92,7 +92,7 @@ inputs:
checklist: checklist.md
agents:
generator:
coder:
command: claude
args: ["-p", "--model", "sonnet", "--permission-mode", "auto"]
system_prompt: "You are a senior software engineer."
@@ -110,11 +110,16 @@ pipeline: preset:simple
| 프리셋 | 설명 |
|--------|------|
| `simple` | Agent A가 생성, Agent B가 리뷰 (기본값) |
| `cross-review` | 둘 다 생성, 서로 교차 리뷰 |
| `simple` | Agent A가 코딩, Agent B가 리뷰 (기본값) |
| `cross-review` | 둘 다 코딩, 서로 교차 리뷰 |
| `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 필요시 현재 코드베이스와의 정합성도 확인 |
| `review-only` | 기존 코드만 감사 용도로 검토 |
| `review-fix` | 리뷰 결과를 취합한 뒤 자동 수정과 재검증까지 반복 |
| `coding-review-fix` | 초기 코딩 1회 후 리뷰 결과를 취합해 자동 수정과 재검증을 반복 |
```bash
# 초기화 옵션
cross-eval init --preset cross-review # 교차 리뷰 프리셋
cross-eval init --preset plan-review # 구현 전 문서 검토 프리셋
cross-eval init --lang en # 영어 템플릿
```

View File

@@ -1,6 +1,6 @@
Metadata-Version: 2.4
Name: cross-eval
Version: 0.1.0
Version: 0.2.0
Summary: AI agent cross-evaluation CLI tool
Requires-Python: >=3.9
Requires-Dist: pyyaml>=6.0

View File

@@ -4,14 +4,21 @@ cross_eval/__init__.py
cross_eval/agent.py
cross_eval/cli.py
cross_eval/config.py
cross_eval/demo.py
cross_eval/doctor.py
cross_eval/models.py
cross_eval/pipeline.py
cross_eval/prompts.py
cross_eval/report.py
cross_eval/runtime_env.py
cross_eval/worktree.py
cross_eval.egg-info/PKG-INFO
cross_eval.egg-info/SOURCES.txt
cross_eval.egg-info/dependency_links.txt
cross_eval.egg-info/entry_points.txt
cross_eval.egg-info/requires.txt
cross_eval.egg-info/top_level.txt
tests/test_agentic.py
tests/test_config.py
tests/test_onboarding.py
tests/test_pipeline_integration.py

View File

@@ -1 +1 @@
__version__ = "0.1.0"
__version__ = "0.2.0"

View File

@@ -3,8 +3,10 @@ from __future__ import annotations
import itertools
import logging
import os
import subprocess
import sys
import tempfile
import threading
import time
from pathlib import Path
@@ -19,6 +21,34 @@ _SYSTEM_PROMPT_AGENTS = ("claude",)
_REASONING_EFFORT_AGENTS = ("codex",)
class AgentInvocationError(RuntimeError):
"""Structured error for agent CLI failures."""
def __init__(
self,
*,
agent_name: str,
step_name: str,
cmd_preview: str,
raw_error: str,
failure_type: str,
suggested_action: str,
) -> None:
self.agent_name = agent_name
self.step_name = step_name
self.cmd_preview = cmd_preview
self.raw_error = raw_error
self.failure_type = failure_type
self.suggested_action = suggested_action
super().__init__(
f"Agent '{agent_name}' failed (exit code != 0) at step '{step_name}':\n"
f" type: {failure_type}\n"
f" cmd: {cmd_preview}\n"
f" error: {raw_error or '(no output)'}\n"
f" action: {suggested_action}"
)
def _supports_system_prompt_flag(command: str) -> bool:
"""Check if the agent CLI supports --system-prompt flag."""
return any(name in command for name in _SYSTEM_PROMPT_AGENTS)
@@ -29,6 +59,53 @@ def _supports_reasoning_effort(command: str) -> bool:
return any(name in command for name in _REASONING_EFFORT_AGENTS)
def _classify_agent_failure(detail: str) -> tuple[str, str]:
"""Classify a failed agent invocation into a user-actionable bucket."""
normalized = detail.lower()
auth_markers = (
"not logged in",
"please run /login",
"auth",
"authentication",
"invalid api key",
"api key",
"unauthorized",
"forbidden",
)
usage_limit_markers = (
"quota",
"rate limit",
"credits",
"credit balance",
"budget",
"insufficient funds",
"usage limit",
"token limit",
"billing",
)
if any(marker in normalized for marker in auth_markers):
return (
"AUTH",
"Agent CLI authentication is missing or expired. Re-authenticate the CLI, then rerun.",
)
if any(marker in normalized for marker in usage_limit_markers):
return (
"USAGE_LIMIT",
"Agent CLI hit a quota, billing, or token budget limit. Refill or raise the limit, then rerun.",
)
if "api error" in normalized:
return (
"API_ERROR",
"Agent CLI returned an API error. Inspect the saved error file for the raw response.",
)
return (
"UNKNOWN",
"Agent CLI failed for an unknown reason. Inspect the saved error file for details.",
)
class _Spinner:
"""Animated spinner for long-running agent calls."""
@@ -67,11 +144,17 @@ class _Spinner:
sys.stderr.flush()
def _is_print_mode(args: list[str]) -> bool:
"""Check if the agent args include -p / --print flag."""
return "-p" in args or "--print" in args
def invoke_agent(
agent: AgentConfig,
prompt: str,
step_name: str,
cwd: Optional[Path] = None,
env: Optional[dict[str, str]] = None,
timeout: int | None = None,
quiet: bool = False,
) -> AgentResult:
@@ -80,18 +163,54 @@ def invoke_agent(
Args:
quiet: If True, suppress spinner (for parallel execution).
"""
is_claude = "claude" in agent.command
is_interactive = is_claude and not _is_print_mode(agent.args)
cmd = [agent.command]
if agent.reasoning_effort and _supports_reasoning_effort(agent.command):
cmd.extend(["-c", f'model_reasoning_effort="{agent.reasoning_effort}"'])
cmd.extend(agent.args)
# Build the full prompt (system prompt + user prompt)
# --- Temp files for interactive (non -p) claude ---
task_file: Optional[Path] = None
output_file: Optional[Path] = None
if is_interactive:
# Write prompt + output instruction to temp task file
task_fd, task_path = tempfile.mkstemp(suffix=".md", prefix="cross_eval_task_")
task_file = Path(task_path)
os.close(task_fd)
out_fd, out_path = tempfile.mkstemp(suffix=".md", prefix="cross_eval_out_")
output_file = Path(out_path)
os.close(out_fd)
# Clear the output file so we can detect if agent wrote to it
output_file.write_text("", encoding="utf-8")
wrapped_prompt = (
f"{prompt}\n\n"
f"---\n"
f"IMPORTANT: Write your COMPLETE response to this file: {output_file}\n"
f"Do NOT modify any other files in the project."
)
task_file.write_text(wrapped_prompt, encoding="utf-8")
# System prompt via flag
if agent.system_prompt and _supports_system_prompt_flag(agent.command):
cmd.extend(["--system-prompt", agent.system_prompt])
# Positional arg: point claude to the task file
cmd.append(
f"Read the task file at {task_file} and follow all instructions in it. "
f"Write your complete output to {output_file}."
)
input_data: str | None = None
else:
# Print mode (-p) or non-claude: deliver prompt via stdin
if agent.system_prompt and _supports_system_prompt_flag(agent.command):
# claude: --system-prompt flag supported natively
cmd.extend(["--system-prompt", agent.system_prompt])
input_data = prompt
elif agent.system_prompt:
# codex, others: no --system-prompt flag, prepend to prompt
input_data = (
f"<system>\n{agent.system_prompt}\n</system>\n\n"
f"{prompt}"
@@ -103,7 +222,8 @@ def invoke_agent(
spinner: Optional[_Spinner] = None
if not quiet:
logger.info(" cmd: %s", " ".join(cmd[:6]))
mode_label = "interactive" if is_interactive else ""
logger.info(" cmd: %s %s", " ".join(cmd[:6]), f"({mode_label})" if mode_label else "")
spinner = _Spinner(f"[{step_name}] {agent.name} running...")
spinner.start()
@@ -116,6 +236,7 @@ def invoke_agent(
text=True,
timeout=timeout,
cwd=cwd,
env=env,
)
duration = time.monotonic() - start
except subprocess.TimeoutExpired:
@@ -126,30 +247,54 @@ def invoke_agent(
if spinner:
spinner.stop(f"[{step_name}] ERROR")
raise
output = result.stdout.strip()
chars = len(output)
finally:
if task_file:
task_file.unlink(missing_ok=True)
if result.returncode != 0:
if spinner:
spinner.stop(f"[{step_name}] FAILED (exit {result.returncode})")
if output_file:
output_file.unlink(missing_ok=True)
err_detail = result.stderr.strip() or result.stdout.strip()
if err_detail and len(err_detail) > 500:
err_detail = err_detail[:500] + "..."
cmd_preview = " ".join(cmd[:6])
raise RuntimeError(
f"Agent '{agent.name}' failed (exit code {result.returncode}) "
f"at step '{step_name}':\n"
f" cmd: {cmd_preview}\n"
f" error: {err_detail or '(no output)'}"
failure_type, suggested_action = _classify_agent_failure(err_detail or "")
raise AgentInvocationError(
agent_name=agent.name,
step_name=step_name,
cmd_preview=cmd_preview,
raw_error=err_detail or "(no output)",
failure_type=failure_type,
suggested_action=suggested_action,
)
# --- Capture output ---
if output_file:
output = output_file.read_text(encoding="utf-8").strip()
output_file.unlink(missing_ok=True)
if not output:
# Fallback to stdout if agent didn't write to the file
output = result.stdout.strip()
else:
output = result.stdout.strip()
chars = len(output)
if spinner:
spinner.stop(f"[{step_name}] done — {chars} chars")
if not output:
stderr_info = result.stderr.strip()
if stderr_info:
logger.warning(
"Agent '%s' produced empty output at step '%s'",
"Agent '%s' produced empty output at step '%s'. stderr: %s",
agent.name, step_name, stderr_info[:500],
)
else:
logger.warning(
"Agent '%s' produced empty output at step '%s' (no stderr either)",
agent.name, step_name,
)
@@ -160,3 +305,131 @@ def invoke_agent(
step_name=step_name,
duration_seconds=round(duration, 1),
)
def invoke_agent_agentic(
agent: AgentConfig,
prompt: str,
step_name: str,
worktree_path: Path,
env: Optional[dict[str, str]] = None,
timeout: int | None = None,
quiet: bool = False,
) -> AgentResult:
"""Invoke an agent in agentic mode (no -p, runs in worktree, captures git diff).
The agent runs without print mode so it can modify files directly.
After the agent exits, git diff (since last commit) is captured as the output.
"""
from cross_eval.worktree import capture_diff
# Write prompt to a temp file (outside worktree, won't appear in diffs)
import tempfile
task_fd, task_path = tempfile.mkstemp(suffix=".md", prefix="cross_eval_task_")
task_file = Path(task_path)
task_file.write_text(prompt, encoding="utf-8")
os.close(task_fd)
cmd = [agent.command]
if agent.reasoning_effort and _supports_reasoning_effort(agent.command):
cmd.extend(["-c", f'model_reasoning_effort="{agent.reasoning_effort}"'])
# Strip stdin sentinel ("-") from args for agentic mode
args = [a for a in agent.args if a != "-"]
cmd.extend(args)
# System prompt via flag if supported
if agent.system_prompt and _supports_system_prompt_flag(agent.command):
cmd.extend(["--system-prompt", agent.system_prompt])
# Deliver the prompt differently per agent type
is_codex = "codex" in agent.command
input_data: str | None = None
if is_codex:
# codex: stdin mode
cmd.append("-")
if agent.system_prompt and not _supports_system_prompt_flag(agent.command):
input_data = f"<system>\n{agent.system_prompt}\n</system>\n\n{prompt}"
else:
input_data = prompt
else:
# claude: use positional arg with a pointer to the task file
# (avoids OS arg length limits for large prompts)
cmd.append(
f"Read the task file at {task_file} and execute all instructions in it. "
f"Work in the current directory."
)
logger.debug(
"Invoking agent '%s' (agentic) in worktree: %s",
agent.name, worktree_path,
)
spinner: Optional[_Spinner] = None
if not quiet:
logger.info(" cmd: %s (agentic)", " ".join(cmd[:6]))
spinner = _Spinner(f"[{step_name}] {agent.name} (agentic) running...")
spinner.start()
try:
start = time.monotonic()
result = subprocess.run(
cmd,
input=input_data,
capture_output=True,
text=True,
timeout=timeout,
cwd=worktree_path,
env=env,
)
duration = time.monotonic() - start
except subprocess.TimeoutExpired:
if spinner:
spinner.stop(f"[{step_name}] TIMEOUT after {timeout}s")
raise
except Exception:
if spinner:
spinner.stop(f"[{step_name}] ERROR")
raise
finally:
# Clean up temp task file (it's in /tmp, not in worktree)
task_file.unlink(missing_ok=True)
if result.returncode != 0:
if spinner:
spinner.stop(f"[{step_name}] FAILED (exit {result.returncode})")
err_detail = result.stderr.strip() or result.stdout.strip()
if err_detail and len(err_detail) > 500:
err_detail = err_detail[:500] + "..."
cmd_preview = " ".join(cmd[:6])
failure_type, suggested_action = _classify_agent_failure(err_detail or "")
raise AgentInvocationError(
agent_name=agent.name,
step_name=step_name,
cmd_preview=cmd_preview,
raw_error=err_detail or "(no output)",
failure_type=failure_type,
suggested_action=suggested_action,
)
# Capture git diff as the output (changes since last commit on the branch)
diff_output = capture_diff(worktree_path)
if not diff_output:
diff_output = "(no changes)"
logger.warning(
"Agent '%s' made no file changes at step '%s'",
agent.name, step_name,
)
chars = len(diff_output)
if spinner:
spinner.stop(f"[{step_name}] done — {chars} chars (agentic)")
return AgentResult(
output=diff_output,
exit_code=result.returncode,
agent_name=agent.name,
step_name=step_name,
duration_seconds=round(duration, 1),
)

View File

@@ -7,7 +7,7 @@ import sys
from pathlib import Path
from cross_eval import __version__
from cross_eval.config import REASONING_EFFORT_CHOICES
from cross_eval.config import REASONING_EFFORT_CHOICES, resolve_agent_shorthand
logger = logging.getLogger(__name__)
@@ -38,7 +38,7 @@ coders: [claude-coder]
reviewers: [claude-reviewer]
# seniors: [codex-senior]
# 파이프라인 종류: simple | cross-review | review-only | review-fix
# 파이프라인 종류: simple | cross-review | plan-review | review-only | review-fix | coding-review-fix
pipeline: preset:{preset}
# 반복 설정
@@ -49,7 +49,7 @@ max_iterations: 3
language: {language}
# 결과 저장 경로
output_dir: output
output_dir: .cross-eval/output
# ─── 커스텀 에이전트 (선택) ────────────────────────────────────
# 기본 제공 에이전트를 덮어쓰거나 새 에이전트를 정의할 수 있습니다.
@@ -145,7 +145,7 @@ def main(argv: list[str] | None = None) -> int:
"AI 코딩 에이전트의 결과물을 자동으로 검증하는 CLI 도구.\n"
"\n"
"동작 방식:\n"
" 1. 기획서(plan)를 바탕으로 Coder 에이전트가 코드를 \n"
" 1. 기획서(plan)를 바탕으로 Coder 에이전트가 코드를 \n"
" 2. Reviewer 에이전트가 기획서 대비 코드를 검토하고 PASS/FAIL 판정\n"
" 3. FAIL이면 피드백을 반영해서 1~2를 반복 (최대 N회)\n"
"\n"
@@ -195,11 +195,19 @@ def main(argv: list[str] | None = None) -> int:
init_parser.add_argument(
"--preset",
default="simple",
choices=["simple", "cross-review", "review-only", "review-fix"],
choices=[
"simple",
"cross-review",
"plan-review",
"review-only",
"review-fix",
"coding-review-fix",
],
help=(
"파이프라인 종류 (기본: simple). "
"simple=코딩+리뷰, cross-review=교차리뷰, "
"review-only=리뷰만, review-fix=리뷰수렴+자동수정"
"simple=코딩+리뷰, cross-review=교차리뷰, plan-review=문서기획검토, "
"review-only=리뷰만, review-fix=리뷰수렴+자동수정, "
"coding-review-fix=초기코딩후리뷰수렴"
),
)
init_parser.add_argument(
@@ -208,13 +216,65 @@ def main(argv: list[str] | None = None) -> int:
choices=["en", "ko"],
help="프롬프트 언어 (기본: ko)",
)
init_parser.add_argument(
"--guided",
action="store_true",
help="대화형 설정 마법사 실행",
)
# --- doctor ---
doctor_parser = subparsers.add_parser(
"doctor",
help="실행 환경 점검 (CLI 설치, 인증, 설정 파일 검증)",
description="cross-eval 실행에 필요한 환경을 점검합니다.",
)
doctor_parser.add_argument(
"--dir",
type=Path,
default=Path("."),
help="점검할 디렉토리 (기본: 현재 디렉토리)",
)
# --- demo ---
demo_parser = subparsers.add_parser(
"demo",
help="내장 데모 실행 (파이프라인 동작 체험)",
description=(
"내장된 간단한 기획서로 cross-eval 파이프라인의 전체 동작을 체험합니다.\n"
"기본값은 mock 모드(시뮬레이션)이며, --live로 실제 에이전트를 호출할 수 있습니다."
),
formatter_class=argparse.RawDescriptionHelpFormatter,
)
demo_parser.add_argument(
"--live",
action="store_true",
help="실제 에이전트를 호출하여 데모 실행 (API 비용 발생)",
)
demo_parser.add_argument(
"--preset",
default="simple",
choices=["simple", "review-fix", "coding-review-fix"],
help="데모할 파이프라인 종류 (기본: simple)",
)
demo_parser.add_argument(
"--escalate",
action="store_true",
help="ESCALATE 시나리오 데모 (mock 모드 전용)",
)
demo_parser.add_argument(
"--timeout",
type=int,
default=None,
metavar="SEC",
help="에이전트 호출 제한 시간 (--live 전용)",
)
# --- run ---
run_parser = subparsers.add_parser(
"run",
help="검증 파이프라인 실행",
description=(
"기획서(plan)를 기반으로 AI 에이전트가 코드 생성과 리뷰를 반복합니다.\n"
"기획서(plan)를 기반으로 AI 에이전트가 코과 리뷰를 반복합니다.\n"
"\n"
"설정 파일 없이 바로 실행할 수 있고, config.yaml로도 실행할 수 있습니다.\n"
"CLI 옵션이 config.yaml보다 우선합니다."
@@ -222,13 +282,19 @@ def main(argv: list[str] | None = None) -> int:
epilog=(
"파이프라인 종류 (--preset):\n"
" ┌──────────────┬─────────────────────────────────────────────────────┐\n"
" │ simple │ Coder가 코드 성 → Reviewer가 리뷰 │\n"
" │ (기본값) │ FAIL이면 피드백 반영해서 재생성, PASS까지 반복 │\n"
" │ simple │ Coder가 코드 성 → Reviewer가 리뷰 │\n"
" │ (기본값) │ FAIL이면 피드백 반영해서 재코딩, PASS까지 반복 │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ review-fix │ 2단계 파이프라인: │\n"
" │ │ Reviewer N명 병렬 리뷰 → 취합 → 수정 → 재검증 │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
"review-only │ 코드 생성 없이 Reviewer N명이 기존 코드만 검토\n"
"coding- │ 3단계 파이프라인: \n"
" │ review-fix │ 초기 코딩 1회 → 리뷰 취합 → 수정 → 재검증 반복 │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ plan-review │ 구현 전 기획서/체크리스트/문서를 검토 │\n"
" │ │ 필요하면 현재 코드베이스와의 정합성도 점검 │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ review-only │ 코드 작성 없이 Reviewer N명이 기존 코드만 검토 │\n"
" │ │ (이미 작성된 코드의 품질 감사용) │\n"
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
" │ cross-review │ Coder 2명이 각각 구현 → 상대방 코드를 교차 리뷰 │\n"
@@ -239,10 +305,10 @@ def main(argv: list[str] | None = None) -> int:
" ┌──────────────────┬─────────┬───────────┬──────────────────────────┐\n"
" │ 이름 │ CLI │ 기본 모델 │ 역할 │\n"
" ├──────────────────┼─────────┼───────────┼──────────────────────────┤\n"
" │ claude-coder │ claude │ opus │ 코드 성 │\n"
" │ claude-coder │ claude │ opus │ 코드 성 │\n"
" │ claude-reviewer │ claude │ opus │ 코드 리뷰 │\n"
" │ claude-senior │ claude │ opus │ 리뷰 취합/판정 │\n"
" │ codex-coder │ codex │ gpt-5.4 │ 코드 성 │\n"
" │ codex-coder │ codex │ gpt-5.4 │ 코드 성 │\n"
" │ codex-reviewer │ codex │ gpt-5.4 │ 코드 리뷰 │\n"
" │ codex-senior │ codex │ gpt-5.4 │ 리뷰 취합/판정 │\n"
" └──────────────────┴─────────┴───────────┴──────────────────────────┘\n"
@@ -267,10 +333,18 @@ def main(argv: list[str] | None = None) -> int:
" cross-eval run --plan plan.md --preset review-fix \\\n"
" --reviewer claude --reviewer codex\n"
"\n"
" 초기 코딩 후 리뷰 수렴 + 자동 수정 (coding-review-fix):\n"
" cross-eval run --plan plan.md --preset coding-review-fix \\\n"
" --reviewer claude --reviewer codex\n"
"\n"
" 기존 코드 리뷰만 (review-only):\n"
" cross-eval run --plan plan.md --preset review-only \\\n"
" --reviewer claude --reviewer codex\n"
"\n"
" 구현 전 문서/기획 검토 (plan-review):\n"
" cross-eval run --plan plan.md --preset plan-review \\\n"
" --reviewer claude --reviewer codex\n"
"\n"
" 모델 변경:\n"
" cross-eval run --plan plan.md --model sonnet\n"
"\n"
@@ -298,6 +372,14 @@ def main(argv: list[str] | None = None) -> int:
"--input", action="append", dest="inputs", metavar="KEY=PATH",
help="추가 입력 파일 (예: --input spec=./api-spec.md)",
)
input_group.add_argument(
"--env-file", action="append", dest="env_files", type=Path, default=None,
help="에이전트 subprocess에 주입할 추가 .env 파일 (여러 개 가능)",
)
input_group.add_argument(
"--target", action="append", dest="execution_targets", default=None,
help="에이전트에게 강조할 실행 대상 힌트 (예: clickhouse, postgres)",
)
# -- 에이전트 설정 --
agent_group = run_parser.add_argument_group(
@@ -336,12 +418,16 @@ def main(argv: list[str] | None = None) -> int:
choices=REASONING_EFFORT_CHOICES + ("extra-high", "extra_high", "x-high"),
help="Senior용 reasoning effort",
)
agent_group.add_argument(
"--agentic", action="store_true", default=False,
help="Coder를 agentic 모드로 실행 (worktree에서 파일 직접 수정, git diff로 결과 캡처)",
)
agent_group.add_argument(
"--model", default=None, metavar="MODEL",
help="모든 에이전트의 모델을 한번에 변경 (예: sonnet, opus)",
)
agent_group.add_argument(
"--generator-model", default=None, metavar="MODEL",
"--coder-model", default=None, metavar="MODEL",
help="Coder 에이전트 모델만 변경",
)
agent_group.add_argument(
@@ -353,7 +439,14 @@ def main(argv: list[str] | None = None) -> int:
pipe_group = run_parser.add_argument_group("파이프라인")
pipe_group.add_argument(
"--preset", default=None,
choices=["simple", "cross-review", "review-only", "review-fix"],
choices=[
"simple",
"cross-review",
"plan-review",
"review-only",
"review-fix",
"coding-review-fix",
],
help="파이프라인 종류 (기본: simple). 각 종류 설명은 아래 참조",
)
pipe_group.add_argument(
@@ -400,6 +493,10 @@ def main(argv: list[str] | None = None) -> int:
if args.command == "init":
return cmd_init(args)
elif args.command == "doctor":
return cmd_doctor(args)
elif args.command == "demo":
return cmd_demo(args)
elif args.command == "run":
return cmd_run(args)
else:
@@ -407,9 +504,186 @@ def main(argv: list[str] | None = None) -> int:
return 0
def cmd_doctor(args: argparse.Namespace) -> int:
"""Run environment health checks."""
from cross_eval.doctor import format_doctor_results, run_doctor
checks = run_doctor(args.dir.resolve())
print(format_doctor_results(checks))
has_critical = any(not c.passed and c.critical for c in checks)
return 1 if has_critical else 0
def cmd_demo(args: argparse.Namespace) -> int:
"""Run a built-in demo to show the pipeline lifecycle."""
from cross_eval.demo import run_live_demo, run_mock_demo
if args.live:
print("\n⚠ --live 모드: 실제 AI 에이전트를 호출합니다 (API 비용 발생).")
print(" 내장 피보나치 함수 기획서를 사용합니다.\n")
try:
answer = input("계속하시겠습니까? [y/N] ").strip().lower()
except (EOFError, KeyboardInterrupt):
print("\n취소됨.")
return 0
if answer not in ("y", "yes"):
print("취소됨.")
return 0
try:
raw_timeout = args.timeout if args.timeout is not None else 0
agent_timeout = None if raw_timeout == 0 else raw_timeout
result = run_live_demo(preset=args.preset, timeout=agent_timeout)
print(f"\nResult: {result.final_verdict}")
print(f"Iterations: {len(result.iterations)}")
if result.run_dir:
print(f"Output: {result.run_dir}/")
return 0
except (RuntimeError, KeyboardInterrupt) as e:
if isinstance(e, KeyboardInterrupt):
print("\nInterrupted.")
return 130
print(f"Demo error: {e}", file=sys.stderr)
return 1
else:
run_mock_demo(preset=args.preset, show_escalate=args.escalate)
return 0
# ---------------------------------------------------------------------------
# Guided init wizard
# ---------------------------------------------------------------------------
_PRESET_DESCRIPTIONS = {
"simple": "코딩 + 리뷰 (가장 기본)",
"review-fix": "리뷰 → 취합 → 수정 → 재검증 반복",
"coding-review-fix": "초기 코딩 + 리뷰 수렴 반복",
"plan-review": "구현 전 기획서/문서 검토",
"review-only": "기존 코드만 리뷰 (코딩 없음)",
"cross-review": "2명이 각각 구현 후 교차 리뷰",
}
_PRESET_ORDER = [
"simple", "review-fix", "coding-review-fix",
"plan-review", "review-only", "cross-review",
]
def _prompt_choice(
message: str,
choices: list[str],
descriptions: dict[str, str] | None = None,
default: int = 1,
) -> str:
"""Prompt user to pick from a numbered list."""
print(f"\n{message}")
for i, choice in enumerate(choices, 1):
desc = f"{descriptions[choice]}" if descriptions and choice in descriptions else ""
marker = " (기본)" if i == default else ""
print(f" {i}. {choice}{desc}{marker}")
while True:
try:
raw = input(f"선택 [{default}]: ").strip()
except (EOFError, KeyboardInterrupt):
print()
return choices[default - 1]
if not raw:
return choices[default - 1]
try:
idx = int(raw)
if 1 <= idx <= len(choices):
return choices[idx - 1]
except ValueError:
if raw in choices:
return raw
print(f" 1-{len(choices)} 사이 숫자를 입력하세요.")
def _prompt_text(message: str, default: str = "") -> str:
"""Prompt for text input with default."""
suffix = f" [{default}]" if default else ""
try:
raw = input(f"{message}{suffix}: ").strip()
except (EOFError, KeyboardInterrupt):
print()
return default
return raw or default
def _run_guided_init(target: Path) -> dict:
"""Interactive setup wizard. Returns settings dict."""
print("\n━━━ cross-eval 설정 마법사 ━━━\n")
lang = _prompt_choice(
"언어 / Language:",
["ko", "en"],
{"ko": "한국어", "en": "English"},
default=1,
)
preset = _prompt_choice(
"파이프라인 종류:",
_PRESET_ORDER,
_PRESET_DESCRIPTIONS,
default=1,
)
print("\n--- 에이전트 설정 ---")
print(" 사용 가능: claude, codex (또는 claude-coder, codex-reviewer 등)")
coder = _prompt_text(" Coder 에이전트", default="claude")
reviewer = _prompt_text(" Reviewer 에이전트", default="claude")
needs_senior = preset in ("review-fix", "coding-review-fix")
senior = ""
if needs_senior:
senior = _prompt_text(" Senior 에이전트", default=reviewer)
else:
senior = _prompt_text(" Senior 에이전트 (선택, Enter로 건너뛰기)", default="")
max_iter = _prompt_text("최대 반복 횟수", default="3")
try:
max_iter_int = int(max_iter)
except ValueError:
max_iter_int = 3
create_templates = _prompt_text(
"\n템플릿 파일(plan.md, checklist.md) 생성?", default="Y",
).lower() in ("y", "yes", "")
return {
"lang": lang,
"preset": preset,
"coder": coder,
"reviewer": reviewer,
"senior": senior,
"max_iter": max_iter_int,
"create_templates": create_templates,
}
def cmd_init(args: argparse.Namespace) -> int:
"""Scaffold a new cross-eval project."""
target = args.dir.resolve()
if args.guided:
settings = _run_guided_init(target)
args.lang = settings["lang"]
args.preset = settings["preset"]
# We'll use guided settings for enhanced config generation
return _write_init_files(target, args, guided_settings=settings)
return _write_init_files(target, args)
def _write_init_files(
target: Path,
args: argparse.Namespace,
guided_settings: dict | None = None,
) -> int:
"""Write config and template files to target directory."""
ce_dir = target / ".cross-eval"
ce_dir.mkdir(parents=True, exist_ok=True)
@@ -417,14 +691,23 @@ def cmd_init(args: argparse.Namespace) -> int:
plan_sample = PLAN_SAMPLE_KO if lang == "ko" else PLAN_SAMPLE_EN
checklist_sample = CHECKLIST_SAMPLE_KO if lang == "ko" else CHECKLIST_SAMPLE_EN
files = {
".cross-eval/config.yaml": DEFAULT_CONFIG_YAML.format(
# Generate config content
if guided_settings:
config_content = _generate_guided_config(args.preset, lang, guided_settings)
else:
config_content = DEFAULT_CONFIG_YAML.format(
preset=args.preset, language=lang,
),
".cross-eval/plan.md": plan_sample,
".cross-eval/checklist.md": checklist_sample,
)
files: dict[str, str] = {
".cross-eval/config.yaml": config_content,
}
# Add templates unless guided mode opted out
if not guided_settings or guided_settings.get("create_templates", True):
files[".cross-eval/plan.md"] = plan_sample
files[".cross-eval/checklist.md"] = checklist_sample
created = []
skipped = []
for name, content in files.items():
@@ -436,23 +719,67 @@ def cmd_init(args: argparse.Namespace) -> int:
created.append(name)
if created:
print(f" 생성: {', '.join(created)}")
print(f"\n 생성: {', '.join(created)}")
if skipped:
print(f" 이미 존재 (건너뜀): {', '.join(skipped)}")
print(f"\n 파이프라인: {args.preset}")
print(f" 언어: {lang}")
if guided_settings:
print(f" Coder: {guided_settings['coder']}")
print(f" Reviewer: {guided_settings['reviewer']}")
if guided_settings.get("senior"):
print(f" Senior: {guided_settings['senior']}")
print(f" 최대 반복: {guided_settings['max_iter']}")
print("")
print("다음 단계:")
print(" 1. .cross-eval/plan.md 에 기획서 작성")
print(" 2. .cross-eval/checklist.md 에 체크리스트 작성 (선택)")
print(" 3. cross-eval run 으로 실행")
print("")
print("주의: 에이전트는 기본적으로 파일 읽기/쓰기/실행 권한을 가집니다.")
print(" 실행 전에 .cross-eval/config.yaml 을 확인하세요.")
print("팁: cross-eval doctor 로 환경 점검을 먼저 하세요.")
print(" cross-eval demo 로 동작 방식을 미리 볼 수 있습니다.")
return 0
def _generate_guided_config(
preset: str,
lang: str,
settings: dict,
) -> str:
"""Generate config.yaml content from guided init settings."""
coder_name = resolve_agent_shorthand(settings["coder"], "coder")
reviewer_name = resolve_agent_shorthand(settings["reviewer"], "reviewer")
lines = [
"# cross-eval 설정 (guided init으로 생성됨)",
"",
"inputs:",
" plan: plan.md",
" checklist: checklist.md",
"",
f"coders: [{coder_name}]",
f"reviewers: [{reviewer_name}]",
]
senior = settings.get("senior", "")
if senior:
senior_name = resolve_agent_shorthand(senior, "senior")
lines.append(f"seniors: [{senior_name}]")
lines.extend([
"",
f"pipeline: preset:{preset}",
"",
f"max_iterations: {settings['max_iter']}",
f"language: {lang}",
"output_dir: .cross-eval/output",
"",
])
return "\n".join(lines) + "\n"
def _read_docs_dir(docs_dir: Path) -> str:
"""Read all files in a directory and concatenate with filename headers."""
parts: list[str] = []
@@ -482,12 +809,21 @@ def _apply_model_override(config, agent_name: str, model: str) -> None:
agent.args = new_args
def _apply_phased_iteration_override(config, max_iter: int | None) -> None:
"""Apply CLI max-iter to converging phases while preserving setup phases."""
from cross_eval.config import sync_phased_iterations
sync_phased_iterations(config, max_iter)
def cmd_run(args: argparse.Namespace) -> int:
"""Load config, validate, and execute the pipeline."""
from cross_eval.config import (
ensure_fix_preset_agentic,
apply_input_overrides,
default_config,
load_config,
sync_phased_iterations,
validate_config,
)
from cross_eval.prompts import PIPELINE_PRESETS
@@ -562,7 +898,7 @@ def cmd_run(args: argparse.Namespace) -> int:
preset = args.preset or "simple"
# Determine which preset was configured (from YAML or defaults)
if args.preset is None and config.phases:
preset = "review-fix" # only phased preset currently
preset = config.preset_name if config.preset_name != "custom" else "review-fix"
elif args.preset is None and not args.coders and not args.reviewers and not args.seniors:
pass # no changes needed
inferred_coders, inferred_reviewers, inferred_seniors = _infer_roles(
@@ -584,13 +920,18 @@ def cmd_run(args: argparse.Namespace) -> int:
config.preset_name = preset
if preset in PHASED_PRESETS:
config.phases = PHASED_PRESETS[preset](coders, reviewers, seniors)
_apply_phased_iteration_override(config, args.max_iter)
config.pipeline = []
elif preset in PIPELINE_PRESETS:
config.pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors)
config.phases = []
if preset == "review-only" and args.max_iter is None and args.min_iter is None:
if preset in {"plan-review", "review-only"} and args.max_iter is None and args.min_iter is None:
config.max_iterations = 1
sync_phased_iterations(config)
if args.max_iter is not None:
sync_phased_iterations(config, args.max_iter)
apply_reasoning_effort_settings(
config,
reasoning_effort=args.reasoning_effort,
@@ -599,14 +940,23 @@ def cmd_run(args: argparse.Namespace) -> int:
senior_effort=args.senior_effort,
)
# --agentic: convert coder agents to agentic mode
if args.agentic:
from cross_eval.config import _make_agentic
for coder_name in config.coders:
if coder_name in config.agents:
_make_agentic(config.agents[coder_name])
ensure_fix_preset_agentic(config)
# --model: apply to ALL agents
if args.model is not None:
for agent_name in config.agents:
_apply_model_override(config, agent_name, args.model)
# --generator-model / --reviewer-model: apply by role
if args.generator_model is not None:
# --coder-model / --reviewer-model: apply by role
if args.coder_model is not None:
for coder_name in config.coders:
_apply_model_override(config, coder_name, args.generator_model)
_apply_model_override(config, coder_name, args.coder_model)
if args.reviewer_model is not None:
for reviewer_name in config.reviewers:
_apply_model_override(config, reviewer_name, args.reviewer_model)
@@ -632,6 +982,17 @@ def cmd_run(args: argparse.Namespace) -> int:
return 1
config.inputs["docs"] = docs_content
if args.env_files:
for env_file in args.env_files:
resolved = env_file.resolve()
if not resolved.exists():
print(f"Env file not found: {resolved}", file=sys.stderr)
return 1
config.execution.env_files.append(str(resolved))
if args.execution_targets:
config.execution.auto_context_targets = list(args.execution_targets)
if args.inputs:
overrides = {}
for item in args.inputs:
@@ -694,6 +1055,11 @@ def cmd_run(args: argparse.Namespace) -> int:
if not args.dry_run and result.run_dir:
print(f"Output: {result.run_dir}/")
if result.final_verdict == "ESCALATE":
from cross_eval.report import print_escalation_report
print_escalation_report(config, result)
return 2
return 0 if result.final_verdict == "PASS" else 1

View File

@@ -1,6 +1,7 @@
"""Configuration loading, validation, and preset resolution."""
from __future__ import annotations
import copy
import logging
import re
from pathlib import Path
@@ -8,7 +9,13 @@ from typing import Any
import yaml
from cross_eval.models import AgentConfig, PhaseConfig, PipelineConfig, StepConfig
from cross_eval.models import (
AgentConfig,
ExecutionConfig,
PhaseConfig,
PipelineConfig,
StepConfig,
)
from cross_eval.prompts import PHASED_PRESETS, PIPELINE_PRESETS
logger = logging.getLogger(__name__)
@@ -24,6 +31,7 @@ DEFAULT_ROLE_REASONING_EFFORTS = {
"reviewer": "medium",
"senior": "high",
}
FIX_STYLE_PRESETS = {"review-fix", "coding-review-fix"}
# ---------------------------------------------------------------------------
@@ -39,34 +47,67 @@ _CODEX_ARGS = [
"-",
]
_CLAUDE_BASE_ARGS = [
"-p",
"--setting-sources",
"user",
"--disable-slash-commands",
"--model",
"opus",
]
_CLAUDE_CODER_ARGS = list(_CLAUDE_BASE_ARGS) + [
"--dangerously-skip-permissions",
"--permission-mode",
"bypassPermissions",
]
_CLAUDE_REVIEW_ARGS = [
"--setting-sources",
"user",
"--disable-slash-commands",
"--model",
"opus",
"--permission-mode",
"plan",
]
_CODER_SYSTEM_PROMPT = (
"You are a senior software engineer implementing code changes.\n"
"Rules:\n"
"1. FIRST explore the project directory to understand the existing codebase, "
"patterns, and conventions before writing any code.\n"
"2. Implement ONLY what the plan specifies. Do NOT add extra features, "
"2. You may decide which shell, Python, git, docker, test, and database commands "
"to run. The user does not need to pre-specify exact commands.\n"
"3. Environment variables from configured .env files may already be loaded into "
"your process; use them when validating services such as ClickHouse.\n"
"4. Implement ONLY what the plan specifies. Do NOT add extra features, "
"unnecessary abstractions, premature optimizations, or \"nice-to-have\" improvements.\n"
"3. Follow the project's existing coding style, naming conventions, and directory structure.\n"
"4. If previous review feedback is provided, fix ONLY the specific issues mentioned. "
"5. Follow the project's existing coding style, naming conventions, and directory structure.\n"
"6. If previous review feedback is provided, fix ONLY the specific issues mentioned. "
"Do NOT refactor unrelated code.\n"
"5. Ignore any items from previous feedback that were marked as DISMISSED or false positive.\n"
"6. When in doubt about scope, do LESS, not more."
"7. Ignore any items from previous feedback that were marked as DISMISSED or false positive.\n"
"8. When in doubt about scope, do LESS, not more."
)
_REVIEWER_SYSTEM_PROMPT = (
"You are a code reviewer. You MUST NOT create, modify, or delete any files.\n"
"Rules:\n"
"1. Explore the project directory to understand the full codebase context.\n"
"2. Compare the implementation against the plan and checklist ONLY.\n"
"3. Classify every issue with BOTH severity AND category:\n"
"2. You may decide which shell, Python, test, git, docker, and database read commands "
"to run in order to verify behavior. The user does not need to pre-specify exact commands.\n"
"3. Environment variables from configured .env files may already be loaded into "
"your process; use them for verification when relevant.\n"
"4. Compare the implementation against the plan and checklist ONLY.\n"
"5. Classify every issue with BOTH severity AND category:\n"
" - Severity: Critical (breaks functionality/security) > Major (requirement mismatch) > Minor (convention/style)\n"
" - Category: Over-engineering / Omission\n"
"4. When reviewing with previous feedback, mark items as CONFIRMED (still an issue) "
"6. When reviewing with previous feedback, mark items as CONFIRMED (still an issue) "
"or DISMISSED (false positive) with rationale.\n"
"5. Report out-of-scope issues separately — problems found outside plan/checklist scope.\n"
"6. Order issues by severity (Critical first).\n"
"7. Do NOT suggest improvements beyond the plan scope.\n"
"8. End with VERDICT: PASS (all requirements met, no over-engineering) "
"7. Report out-of-scope issues separately — problems found outside plan/checklist scope.\n"
"8. Order issues by severity (Critical first).\n"
"9. Do NOT suggest improvements beyond the plan scope.\n"
"10. End with VERDICT: PASS (all requirements met, no over-engineering) "
"or VERDICT: FAIL (issues found)."
)
@@ -74,36 +115,48 @@ _SENIOR_SYSTEM_PROMPT = (
"You are a senior technical reviewer coordinating a review-fix-verification loop.\n"
"Rules:\n"
"1. Explore the project directory to understand the full codebase context.\n"
"2. In aggregation mode, deduplicate overlaps, resolve disagreements, and keep only "
"2. You may decide which shell, Python, test, git, docker, and database read commands "
"to run to verify disputed issues. The user does not need to pre-specify exact commands.\n"
"3. Environment variables from configured .env files may already be loaded into "
"your process; use them when validating service integrations.\n"
"4. In aggregation mode, deduplicate overlaps, resolve disagreements, and keep only "
"evidence-backed issues. Categorize dismissed findings as [False positive] or [Already fixed].\n"
"3. In verification mode, judge the current implementation directly against ONLY the "
"5. In verification mode, judge the current implementation directly against ONLY the "
"plan and checklist.\n"
"4. Be skeptical of false positives, but do not lower the bar on real requirement "
"6. Be skeptical of false positives, but do not lower the bar on real requirement "
"gaps.\n"
"5. When issues remain, produce a concise prioritized action list the coder can act on.\n"
"6. Do NOT invent new requirements beyond the plan and checklist.\n"
"7. End with VERDICT: PASS or VERDICT: FAIL."
"7. When issues remain, produce a concise prioritized action list the coder can act on.\n"
"8. Maintain an Issue Tracker table across iterations to track issue status.\n"
"9. Do NOT invent new requirements beyond the plan and checklist.\n"
"10. End with one of three verdicts:\n"
" - VERDICT: PASS — all requirements met, no issues remain.\n"
" - VERDICT: FAIL — issues found that the coder can fix.\n"
" - VERDICT: ESCALATE — issues that require human intervention. Use ESCALATE when:\n"
" * Requirements are ambiguous and need clarification from stakeholders\n"
" * Architecture decisions are needed that go beyond the plan scope\n"
" * External dependency issues block progress\n"
" * The coder has failed to resolve the same issue 2+ times"
)
BUILTIN_AGENTS: dict[str, AgentConfig] = {
"claude-coder": AgentConfig(
name="claude-coder",
command="claude",
args=["-p", "--model", "opus", "--permission-mode", "auto"],
args=list(_CLAUDE_CODER_ARGS),
system_prompt=_CODER_SYSTEM_PROMPT,
reasoning_effort=DEFAULT_ROLE_REASONING_EFFORTS["coder"],
),
"claude-reviewer": AgentConfig(
name="claude-reviewer",
command="claude",
args=["-p", "--model", "opus", "--permission-mode", "auto"],
args=list(_CLAUDE_REVIEW_ARGS),
system_prompt=_REVIEWER_SYSTEM_PROMPT,
reasoning_effort=DEFAULT_ROLE_REASONING_EFFORTS["reviewer"],
),
"claude-senior": AgentConfig(
name="claude-senior",
command="claude",
args=["-p", "--model", "opus", "--permission-mode", "auto"],
args=list(_CLAUDE_REVIEW_ARGS),
system_prompt=_SENIOR_SYSTEM_PROMPT,
reasoning_effort=DEFAULT_ROLE_REASONING_EFFORTS["senior"],
),
@@ -136,6 +189,11 @@ _AGENT_ALIASES: dict[str, str] = {
"codex": "codex",
}
_ROLE_ALIASES: dict[str, str] = {
"coding": "coding",
"review": "review",
}
def resolve_agent_shorthand(name: str, role: str) -> str:
"""Resolve shorthand agent name to full builtin name.
@@ -150,6 +208,16 @@ def resolve_agent_shorthand(name: str, role: str) -> str:
return name
def normalize_step_role(role: str) -> str:
"""Normalize step role aliases to the canonical role name."""
return _ROLE_ALIASES.get(role, role)
def normalize_prompt_template(template_ref: str) -> str:
"""Normalize prompt template aliases to canonical template refs."""
return template_ref
# ---------------------------------------------------------------------------
# Role inference (backward compatibility)
# ---------------------------------------------------------------------------
@@ -220,7 +288,7 @@ def _resolve_agents(
for name in all_referenced:
if name not in result and name in BUILTIN_AGENTS:
result[name] = BUILTIN_AGENTS[name]
result[name] = copy.deepcopy(BUILTIN_AGENTS[name])
return result
@@ -233,7 +301,7 @@ def _default_seniors_for_preset(
"""Infer a default senior agent for presets that benefit from adjudication."""
if not (
isinstance(pipeline_raw, str)
and pipeline_raw == "preset:review-fix"
and pipeline_raw in {"preset:review-fix", "preset:coding-review-fix"}
and reviewers
):
return []
@@ -311,15 +379,16 @@ def _apply_role_effort(
def default_config() -> PipelineConfig:
"""Return a PipelineConfig with sensible defaults (no YAML needed)."""
agents = dict(BUILTIN_AGENTS)
agents = copy.deepcopy(BUILTIN_AGENTS)
coders = ["claude-coder"]
reviewers = ["claude-reviewer"]
seniors: list[str] = []
pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors)
return PipelineConfig(
output_dir=Path("output"),
output_dir=Path(".cross-eval/output"),
max_iterations=3,
language="ko",
execution=ExecutionConfig(),
inputs={},
agents=agents,
coders=coders,
@@ -363,6 +432,7 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
system_prompt=agent_data.get("system_prompt"),
reasoning_effort=agent_data.get("reasoning_effort"),
stdin_mode=agent_data.get("stdin_mode", False),
agentic=agent_data.get("agentic", False),
)
# --- roles: explicit or inferred ---
@@ -402,6 +472,17 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
p = config_dir / p
inputs[key] = p
execution_raw = raw.get("execution", {}) or {}
execution = ExecutionConfig(
mode=execution_raw.get("mode", "agent-decides"),
command_policy=execution_raw.get("command_policy", "broad"),
inherit_env=bool(execution_raw.get("inherit_env", True)),
auto_env_files=list(execution_raw.get("auto_env_files", [".env", ".env.local"])),
env_files=list(execution_raw.get("env_files", [])),
expose_env_names=bool(execution_raw.get("expose_env_names", True)),
auto_context_targets=list(execution_raw.get("auto_context_targets", [])),
)
# --- pipeline (preset or custom) ---
steps, phases = _resolve_pipeline(pipeline_raw, coders, reviewers, seniors)
@@ -410,12 +491,13 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"):
preset_name = pipeline_raw.split(":", 1)[1]
return PipelineConfig(
output_dir=Path(raw.get("output_dir", "output")),
config = PipelineConfig(
output_dir=Path(raw.get("output_dir", ".cross-eval/output")),
max_iterations=int(raw.get("max_iterations", 3)),
min_iterations=int(raw.get("min_iterations", 1)),
verbose=bool(raw.get("verbose", False)),
language=raw.get("language", "en"),
execution=execution,
inputs=inputs,
agents=agents,
coders=coders,
@@ -427,6 +509,9 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
_config_path=config_path,
_config_mtime=config_path.stat().st_mtime,
)
sync_phased_iterations(config)
ensure_fix_preset_agentic(config)
return config
def try_reload_config(config: PipelineConfig) -> PipelineConfig:
@@ -465,7 +550,7 @@ def _resolve_pipeline(
"""Resolve pipeline from preset string or explicit step list.
Returns (steps, phases) tuple. Only one will be non-empty.
- Simple/cross-review/review-only → steps populated, phases empty.
- Simple/cross-review/plan-review/review-only → steps populated, phases empty.
- Phased presets (review-fix) → steps empty, phases populated.
"""
# Preset: "preset:simple" or "preset:review-fix"
@@ -485,11 +570,15 @@ def _resolve_pipeline(
if isinstance(pipeline_raw, list):
steps = []
for step_data in pipeline_raw:
raw_role = step_data.get("role", "coding")
normalized_role = normalize_step_role(raw_role)
steps.append(StepConfig(
name=step_data["name"],
agent=step_data["agent"],
role=step_data.get("role", "generate"),
prompt_template=step_data.get("prompt_template", f"default:{step_data.get('role', 'generate')}"),
role=normalized_role,
prompt_template=normalize_prompt_template(
step_data.get("prompt_template", f"default:{normalized_role}")
),
output_key=step_data["output_key"],
verdict=step_data.get("verdict", False),
verdict_pattern=step_data.get("verdict_pattern", r"VERDICT:\s*PASS"),
@@ -524,10 +613,6 @@ def validate_config(config: PipelineConfig) -> list[str]:
errors,
scope=f"Phase '{phase.name}'",
)
if not any(s.verdict for s in phase.steps):
errors.append(
f"Phase '{phase.name}' must have at least one step with verdict: true"
)
# Validate verdict patterns
for step in phase.steps:
if step.verdict:
@@ -576,6 +661,16 @@ def validate_config(config: PipelineConfig) -> list[str]:
if config.language not in ("en", "ko"):
errors.append(f"Unsupported language '{config.language}'. Use 'en' or 'ko'.")
if config.execution.mode not in {"agent-decides"}:
errors.append(
f"Unsupported execution.mode '{config.execution.mode}'. Use 'agent-decides'."
)
if config.execution.command_policy not in {"broad", "restricted"}:
errors.append(
"Unsupported execution.command_policy "
f"'{config.execution.command_policy}'. Use 'broad' or 'restricted'."
)
return errors
@@ -599,6 +694,37 @@ def _validate_unique_step_fields(
seen_output_keys.add(step.output_key)
def _make_agentic(agent: AgentConfig) -> None:
"""Convert an agent to agentic mode in-place (remove -p, set agentic=True)."""
agent.agentic = True
agent.args = [a for a in agent.args if a != "-p"]
def sync_phased_iterations(
config: PipelineConfig,
max_iter: int | None = None,
) -> None:
"""Apply effective max iterations to converging phases while preserving setup phases."""
if not config.phases:
return
effective_max_iter = config.max_iterations if max_iter is None else max_iter
for phase in config.phases:
if any(step.verdict for step in phase.steps):
phase.max_iterations = effective_max_iter
def ensure_fix_preset_agentic(config: PipelineConfig) -> None:
"""Fix-style presets should modify code, so coders run agentically by default."""
if config.preset_name not in FIX_STYLE_PRESETS:
return
for coder_name in config.coders:
agent = config.agents.get(coder_name)
if agent is not None and not agent.agentic:
_make_agentic(agent)
def apply_input_overrides(
config: PipelineConfig, overrides: dict[str, str]
) -> None:

282
cross_eval/demo.py Normal file
View File

@@ -0,0 +1,282 @@
"""Built-in demo for cross-eval — lets new users see the full lifecycle."""
from __future__ import annotations
import sys
import time
from pathlib import Path
from cross_eval.models import PipelineConfig, PipelineResult
# ---------------------------------------------------------------------------
# Built-in demo plan & checklist
# ---------------------------------------------------------------------------
DEMO_PLAN = """\
# Demo: Fibonacci Function
## Objective
Implement a `fibonacci(n)` function in Python.
## Requirements
1. `fibonacci(0)` returns `0`, `fibonacci(1)` returns `1`.
2. For `n >= 2`, return the sum of the two preceding values.
3. Raise `ValueError` for negative `n`.
4. Use an iterative approach (not recursive).
## Constraints
- Single file: `fib.py`
- No external dependencies.
"""
DEMO_CHECKLIST = """\
# Demo Checklist
- [ ] fibonacci(0) → 0
- [ ] fibonacci(1) → 1
- [ ] fibonacci(10) → 55
- [ ] fibonacci(-1) raises ValueError
- [ ] Iterative implementation (no recursion)
- [ ] No unnecessary abstractions
"""
# ---------------------------------------------------------------------------
# Mock outputs (realistic-looking)
# ---------------------------------------------------------------------------
_MOCK_CODING_V1 = """\
I'll implement the fibonacci function in `fib.py`.
```python
# fib.py
def fibonacci(n: int) -> int:
\"\"\"Return the nth Fibonacci number using iteration.\"\"\"
if n < 0:
return -1 # invalid input
if n <= 1:
return n
a, b = 0, 1
for _ in range(2, n + 1):
a, b = b, a + b
return b
```
Created `fib.py` with the iterative fibonacci function.
"""
_MOCK_REVIEW_V1 = """\
### Previous Feedback Assessment
N/A — first iteration.
### Issues Found
- ISS-001 [Major][Omission] Negative input should raise `ValueError`, \
but implementation returns `-1` instead. (Requirement 3: "Raise ValueError for negative n")
### Out of Scope Issues
None
### Summary
- Critical: 0, Major: 1, Minor: 0
- Over-engineering count: 0
- Omission count: 1
- CONFIRMED: 0, DISMISSED: 0
- Overall quality: Good structure, one requirement gap.
### Verdict
VERDICT: FAIL
"""
_MOCK_CODING_V2 = """\
Fixing the negative input handling per review feedback (ISS-001).
```python
# fib.py
def fibonacci(n: int) -> int:
\"\"\"Return the nth Fibonacci number using iteration.\"\"\"
if n < 0:
raise ValueError(f"n must be non-negative, got {n}")
if n <= 1:
return n
a, b = 0, 1
for _ in range(2, n + 1):
a, b = b, a + b
return b
```
Updated `fib.py`: negative input now raises `ValueError`.
"""
_MOCK_REVIEW_V2 = """\
### Previous Feedback Assessment
- DISMISSED (false positive): None
- CONFIRMED: None — ISS-001 has been fixed.
### Issues Found
None — all checklist items satisfied.
### Out of Scope Issues
None
### Summary
- Critical: 0, Major: 0, Minor: 0
- Over-engineering count: 0
- Omission count: 0
- CONFIRMED: 0, DISMISSED: 0
- Overall quality: All requirements met, clean implementation.
### Verdict
VERDICT: PASS
"""
_MOCK_STEPS = [
# (iteration, step_name, agent, duration, output_chars, verdict, output)
(1, "coding", "claude-coder", 2.1, 347, None, _MOCK_CODING_V1),
(1, "review", "claude-reviewer", 1.8, 423, "FAIL", _MOCK_REVIEW_V1),
(2, "coding", "claude-coder", 2.3, 382, None, _MOCK_CODING_V2),
(2, "review", "claude-reviewer", 1.5, 312, "PASS", _MOCK_REVIEW_V2),
]
_MOCK_ESCALATE_REVIEW = """\
### Issues Found
- ISS-001 [Critical][Omission] Requirements are ambiguous: "iterative approach" is unclear — \
does this exclude memoization? The plan needs clarification from stakeholders.
### Verdict
VERDICT: ESCALATE
"""
_MOCK_ESCALATE_STEPS = [
(1, "coding", "claude-coder", 2.1, 347, None, _MOCK_CODING_V1),
(1, "review", "claude-reviewer", 1.8, 520, "ESCALATE", _MOCK_ESCALATE_REVIEW),
]
# ---------------------------------------------------------------------------
# Mock demo runner
# ---------------------------------------------------------------------------
DIM = "\033[2m"
BOLD = "\033[1m"
GREEN = "\033[32m"
RED = "\033[31m"
YELLOW = "\033[33m"
CYAN = "\033[36m"
RESET = "\033[0m"
def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None:
"""Run a simulated demo showing the full pipeline lifecycle."""
steps = _MOCK_ESCALATE_STEPS if show_escalate else _MOCK_STEPS
print(f"\n{BOLD}=== cross-eval demo (mock) ==={RESET}")
print(f"{DIM}Preset: {preset} | Coder: claude-coder | Reviewer: claude-reviewer{RESET}")
print(f"{DIM}Plan: fibonacci function | Max iterations: 3{RESET}\n")
current_iter = 0
for iteration, step_name, agent, duration, chars, verdict, output in steps:
if iteration != current_iter:
current_iter = iteration
print(f"{BOLD}{'' * 50}")
print(f" Iteration {iteration}/3")
print(f"{'' * 50}{RESET}")
# Simulate running
sys.stdout.write(f" ⠋ [{step_name}] {agent} running...")
sys.stdout.flush()
time.sleep(0.5)
sys.stdout.write(f"\r {GREEN}{RESET} [{step_name}] {agent}{chars} chars ({duration}s)\n")
if verdict:
if verdict == "PASS":
color = GREEN
elif verdict == "ESCALATE":
color = YELLOW
else:
color = RED
print(f" {color}{BOLD}Verdict: {verdict}{RESET}")
if verdict == "FAIL":
# Show key feedback
print(f" {DIM}Feedback: ISS-001 [Major] Negative input returns -1 instead of ValueError{RESET}")
elif verdict == "ESCALATE":
print(f" {YELLOW}Reason: Requirements need clarification from stakeholders{RESET}")
print()
# Final result
if show_escalate:
final = "ESCALATE"
color = YELLOW
else:
final = "PASS"
color = GREEN
print(f"{BOLD}Result: {color}{final}{RESET}")
print(f"Iterations: {current_iter}")
if show_escalate:
print(f"\n{RED}{BOLD}{'=' * 50}")
print(f" Escalation Report")
print(f"{'=' * 50}{RESET}")
print(f"{YELLOW}Human review required.{RESET}")
print(f" {RED}{RESET} Requirements are ambiguous — needs stakeholder clarification")
print(f"{RED}{BOLD}{'=' * 50}{RESET}")
print(f"\n{DIM}This was a mock demo. To run with real agents:{RESET}")
print(f"{DIM} cross-eval demo --live{RESET}")
print(f"{DIM} cross-eval run --plan plan.md{RESET}\n")
def run_live_demo(
preset: str = "simple",
timeout: int | None = None,
) -> PipelineResult:
"""Run a live demo with real agents using the built-in plan."""
import tempfile
from cross_eval.config import (
BUILTIN_AGENTS,
_resolve_agents,
apply_reasoning_effort_settings,
)
from cross_eval.pipeline import run_pipeline
from cross_eval.prompts import PHASED_PRESETS, PIPELINE_PRESETS
coders = ["claude-coder"]
reviewers = ["claude-reviewer"]
seniors: list[str] = []
agents = _resolve_agents(dict(BUILTIN_AGENTS), coders, reviewers, seniors)
if preset in PIPELINE_PRESETS:
pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors)
phases = []
elif preset in PHASED_PRESETS:
pipeline = []
phases = PHASED_PRESETS[preset](coders, reviewers, seniors)
else:
pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors)
phases = []
with tempfile.TemporaryDirectory() as tmpdir:
plan_path = Path(tmpdir) / "plan.md"
checklist_path = Path(tmpdir) / "checklist.md"
plan_path.write_text(DEMO_PLAN, encoding="utf-8")
checklist_path.write_text(DEMO_CHECKLIST, encoding="utf-8")
config = PipelineConfig(
output_dir=Path(".cross-eval/output"),
max_iterations=3,
language="en",
inputs={"plan": plan_path, "checklist": checklist_path},
agents=agents,
coders=coders,
reviewers=reviewers,
seniors=seniors,
pipeline=pipeline,
phases=phases,
preset_name=f"demo-{preset}",
)
apply_reasoning_effort_settings(config)
return run_pipeline(config, timeout=timeout)

200
cross_eval/doctor.py Normal file
View File

@@ -0,0 +1,200 @@
"""Environment health checks for cross-eval."""
from __future__ import annotations
import shutil
import subprocess
from dataclasses import dataclass, field
from pathlib import Path
from typing import Optional
@dataclass
class DoctorCheck:
"""Result of a single health check."""
name: str
passed: bool
critical: bool
message: str
detail: Optional[str] = None
def check_cli_installed(command: str) -> tuple[bool, str]:
"""Check if a CLI tool is on PATH and get its version."""
path = shutil.which(command)
if not path:
return False, f"'{command}' not found on PATH"
try:
result = subprocess.run(
[command, "--version"],
capture_output=True,
text=True,
timeout=10,
)
version = (result.stdout.strip() or result.stderr.strip()).split("\n")[0]
return True, version or "(version unknown)"
except (subprocess.TimeoutExpired, OSError):
return True, "(installed but version check failed)"
def check_cli_authenticated(command: str) -> tuple[bool, str]:
"""Check if a CLI tool is authenticated by running a minimal probe."""
path = shutil.which(command)
if not path:
return False, "not installed"
if command == "claude":
try:
result = subprocess.run(
[command, "-p", "--model", "haiku", "--max-turns", "1"],
input="respond with just 'ok'",
capture_output=True,
text=True,
timeout=30,
)
combined = result.stdout + result.stderr
if any(kw in combined.lower() for kw in (
"not logged in", "login", "unauthorized", "unauthenticated",
"api key", "invalid key",
)):
return False, "not authenticated — run: claude login"
if result.returncode == 0:
return True, "authenticated"
return False, f"exit code {result.returncode}: {combined[:100]}"
except subprocess.TimeoutExpired:
return False, "timed out (30s) — possible network issue"
except OSError as e:
return False, str(e)
elif command == "codex":
try:
result = subprocess.run(
[command, "--version"],
capture_output=True,
text=True,
timeout=10,
)
combined = result.stdout + result.stderr
if any(kw in combined.lower() for kw in (
"not logged in", "login", "unauthorized", "api key",
)):
return False, "not authenticated — run: codex login"
return True, "installed (auth check: codex login if needed)"
except (subprocess.TimeoutExpired, OSError) as e:
return False, str(e)
return False, f"unknown command: {command}"
def check_config(directory: Path) -> tuple[bool, Optional[Path], list[str]]:
"""Check if config.yaml exists and is valid."""
config_path = directory / ".cross-eval" / "config.yaml"
if not config_path.exists():
return False, None, []
try:
from cross_eval.config import load_config
load_config(config_path)
return True, config_path, []
except (ValueError, FileNotFoundError) as e:
return False, config_path, [str(e)]
def run_doctor(directory: Path) -> list[DoctorCheck]:
"""Run all health checks and return results."""
checks: list[DoctorCheck] = []
# 1. claude CLI
installed, version = check_cli_installed("claude")
checks.append(DoctorCheck(
name="claude CLI",
passed=installed,
critical=True,
message=version if installed else "not found",
detail="Install: https://docs.anthropic.com/en/docs/claude-code" if not installed else None,
))
if installed:
auth_ok, auth_msg = check_cli_authenticated("claude")
checks.append(DoctorCheck(
name="claude auth",
passed=auth_ok,
critical=True,
message=auth_msg,
))
# 2. codex CLI
installed, version = check_cli_installed("codex")
checks.append(DoctorCheck(
name="codex CLI",
passed=installed,
critical=False,
message=version if installed else "not found (optional)",
detail="Install: https://github.com/openai/codex" if not installed else None,
))
if installed:
auth_ok, auth_msg = check_cli_authenticated("codex")
checks.append(DoctorCheck(
name="codex auth",
passed=auth_ok,
critical=False,
message=auth_msg,
))
# 3. Config
config_ok, config_path, config_errors = check_config(directory)
if config_path is None:
checks.append(DoctorCheck(
name="config",
passed=True, # not having config is fine
critical=False,
message="no .cross-eval/config.yaml (will use defaults)",
detail="Run: cross-eval init",
))
elif config_ok:
checks.append(DoctorCheck(
name="config",
passed=True,
critical=False,
message=f"valid ({config_path.name})",
))
else:
checks.append(DoctorCheck(
name="config",
passed=False,
critical=True,
message="invalid config",
detail="\n".join(config_errors),
))
return checks
def format_doctor_results(checks: list[DoctorCheck]) -> str:
"""Format doctor check results for terminal output."""
lines: list[str] = []
lines.append("\n cross-eval doctor\n")
for check in checks:
icon = "" if check.passed else ""
lines.append(f"{icon} {check.name}: {check.message}")
if check.detail and not check.passed:
for detail_line in check.detail.split("\n"):
lines.append(f" {detail_line}")
# Summary
failed_critical = [c for c in checks if not c.passed and c.critical]
failed_warn = [c for c in checks if not c.passed and not c.critical]
lines.append("")
if not failed_critical and not failed_warn:
lines.append(" All checks passed!")
elif failed_critical:
lines.append(f" {len(failed_critical)} critical issue(s) found.")
else:
lines.append(f" {len(failed_warn)} warning(s), no critical issues.")
lines.append("")
return "\n".join(lines)

View File

@@ -16,6 +16,7 @@ class AgentConfig:
system_prompt: Optional[str] = None
reasoning_effort: Optional[str] = None
stdin_mode: bool = False
agentic: bool = False # run in worktree, capture git diff instead of stdout
@dataclass
@@ -24,7 +25,7 @@ class StepConfig:
name: str
agent: str # reference to agents key
role: str # "generate" or "review"
role: str # "coding" or "review"
prompt_template: str # "default:<role>" or file path
output_key: str
verdict: bool = False
@@ -43,15 +44,29 @@ class PhaseConfig:
consecutive_pass: int = 1 # stop after N consecutive PASSes
@dataclass
class ExecutionConfig:
"""Runtime execution policy for agent subprocesses."""
mode: str = "agent-decides"
command_policy: str = "broad"
inherit_env: bool = True
auto_env_files: list[str] = field(default_factory=lambda: [".env", ".env.local"])
env_files: list[str] = field(default_factory=list)
expose_env_names: bool = True
auto_context_targets: list[str] = field(default_factory=list)
@dataclass
class PipelineConfig:
"""Full cross-eval configuration."""
output_dir: Path = field(default_factory=lambda: Path("output"))
output_dir: Path = field(default_factory=lambda: Path(".cross-eval/output"))
max_iterations: int = 3
min_iterations: int = 1
verbose: bool = False
language: str = "en" # "en" or "ko"
execution: ExecutionConfig = field(default_factory=ExecutionConfig)
inputs: dict[str, Path | str] = field(default_factory=dict)
agents: dict[str, AgentConfig] = field(default_factory=dict)
coders: list[str] = field(default_factory=list)
@@ -105,6 +120,7 @@ class IterationResult:
phase_name: Optional[str] = None
repeated_aggregate_warning: Optional[str] = None
review_metrics: Optional[ReviewMetrics] = None
escalated_issues: Optional[str] = None
@dataclass
@@ -116,3 +132,5 @@ class PipelineResult:
total_duration: float = 0.0
run_dir: Optional[Path] = None
repeated_aggregate_warnings: list[str] = field(default_factory=list)
escalated_issues: list[str] = field(default_factory=list)
agentic_branch: Optional[str] = None

View File

@@ -10,9 +10,11 @@ from concurrent.futures import ThreadPoolExecutor, as_completed
from datetime import datetime
from pathlib import Path
from cross_eval.agent import invoke_agent
from cross_eval.agent import AgentInvocationError, invoke_agent, invoke_agent_agentic
from cross_eval.worktree import WorktreeError
from cross_eval.config import try_reload_config
from cross_eval.models import (
AgentConfig,
AgentResult,
IterationResult,
PipelineConfig,
@@ -21,6 +23,11 @@ from cross_eval.models import (
)
from cross_eval.prompts import render_template, resolve_template, set_language
from cross_eval.report import build_report
from cross_eval.runtime_env import (
build_execution_policy,
build_runtime_environment,
summarize_environment,
)
logger = logging.getLogger(__name__)
@@ -48,6 +55,104 @@ def _make_run_dir(config: PipelineConfig) -> Path:
return run_dir
def _commit_iteration(
worktree_path: Path,
label: str,
iteration: int,
verdict: str | None,
) -> None:
"""Intermediate commit after each agentic iteration.
This resets the diff baseline so the next iteration only captures new changes.
"""
from cross_eval.worktree import commit_worktree
committed = commit_worktree(
worktree_path,
f"cross-eval: {label} v{iteration} ({verdict or 'no-verdict'})",
)
if committed:
logger.debug(" Intermediate commit: v%d (%s)", iteration, verdict)
def _has_agentic_steps(config: PipelineConfig, steps: list[StepConfig]) -> bool:
"""Check if any step uses an agentic agent."""
return any(
config.agents.get(s.agent, AgentConfig(name="", command="")).agentic
for s in steps
)
def _setup_worktree(cwd: Path, run_dir: Path, preset_name: str) -> tuple[Path, str]:
"""Create a shared worktree for the entire pipeline run.
1. Generate branch name (cross-eval/<preset>_<timestamp>)
2. Create branch from HEAD
3. Create worktree on that branch
Returns (worktree_path, branch_name).
"""
from cross_eval.worktree import create_worktree, make_branch_name
branch_name = make_branch_name(preset_name)
worktree_dir = run_dir / "work"
worktree_path = create_worktree(
base_cwd=cwd, work_dir=worktree_dir, branch_name=branch_name,
)
return worktree_path, branch_name
def _finalize_worktree(
cwd: Path,
worktree_path: Path,
branch_name: str,
preset_name: str,
final_verdict: str,
) -> str | None:
"""Commit changes on the branch, then remove the worktree.
The branch survives worktree removal and stays in the original repo.
Returns the branch name if changes were committed, None otherwise.
"""
from cross_eval.worktree import commit_worktree, remove_worktree
committed = False
try:
committed = commit_worktree(
worktree_path,
f"cross-eval: {preset_name} ({final_verdict})",
)
if committed:
logger.info(" Agentic changes committed on branch: %s", branch_name)
else:
logger.warning(" No agentic changes to commit (empty diff)")
except Exception:
logger.warning(" Failed to commit agentic changes", exc_info=True)
try:
remove_worktree(base_cwd=cwd, work_dir=worktree_path)
except Exception:
logger.warning("Failed to clean up worktree: %s", worktree_path)
# Check if branch has any commits beyond the base — if not, delete it
if not committed:
try:
# Check if branch has diverged from its base
result = subprocess.run(
["git", "log", "--oneline", f"HEAD..{branch_name}"],
cwd=cwd, capture_output=True, text=True,
)
if not result.stdout.strip():
# No commits on branch beyond base — clean up
subprocess.run(
["git", "branch", "-D", branch_name],
cwd=cwd, capture_output=True,
)
logger.info(" Deleted empty branch: %s", branch_name)
except Exception:
pass # best-effort cleanup
return branch_name if committed else None
def _run_simple_pipeline(
config: PipelineConfig,
run_dir: Path,
@@ -61,6 +166,15 @@ def _run_simple_pipeline(
set_language(config.language)
input_contents = _load_inputs(config)
runtime_env = _build_runtime_inputs(config, input_contents, cwd or Path(os.getcwd()))
# Setup shared worktree for agentic mode
worktree_path: Path | None = None
agentic_branch_name: str | None = None
if not dry_run and _has_agentic_steps(config, config.pipeline):
worktree_path, agentic_branch_name = _setup_worktree(
cwd, run_dir, config.preset_name,
)
feedback = "(no feedback — first iteration)"
iterations: list[IterationResult] = []
@@ -68,11 +182,15 @@ def _run_simple_pipeline(
final_verdict = "MAX_ITERATIONS_REACHED"
aggregate_history: dict[str, int] = {}
aggregate_warnings: list[str] = []
escalated_issues: list[str] = []
all_feedbacks: list[str] = []
try:
for i in range(1, config.max_iterations + 1):
config = try_reload_config(config)
set_language(config.language)
_refresh_inputs(config, input_contents)
runtime_env = _build_runtime_inputs(config, input_contents, cwd)
logger.info("=" * 50)
logger.info(" Iteration %d/%d", i, config.max_iterations)
@@ -82,8 +200,14 @@ def _run_simple_pipeline(
config.pipeline, config, input_contents, feedback,
i, config.max_iterations, cwd, timeout, dry_run,
run_dir=run_dir, output_iter=i,
worktree_path=worktree_path,
runtime_env=runtime_env,
)
# Intermediate commit so next iteration's diff only shows new changes
if worktree_path is not None:
_commit_iteration(worktree_path, config.preset_name, i, verdict)
iter_result = IterationResult(
iteration=i,
step_results=step_results,
@@ -100,8 +224,33 @@ def _run_simple_pipeline(
iter_result.feedback = _collect_feedback(config.pipeline, step_outputs)
feedback = iter_result.feedback or feedback
all_feedbacks.append(feedback)
# Extract tracker from verdict/review steps for next iteration
for step in config.pipeline:
if step.verdict or step.role == "review":
tracker = _extract_senior_tracker(
step_outputs.get(step.output_key, ""),
)
if tracker:
input_contents["previous_senior_tracker"] = tracker
iterations.append(iter_result)
# ESCALATE check (highest priority)
if verdict == "ESCALATE":
final_verdict = "ESCALATE"
for step in config.pipeline:
if step.verdict:
esc = _extract_escalated_issues(
step_outputs.get(step.output_key, ""),
)
if esc:
escalated_issues.append(esc)
iter_result.escalated_issues = esc
logger.info(" ESCALATE at iteration %d — stopping loop.", i)
break
if verdict == "PASS":
final_verdict = "PASS"
if i >= config.min_iterations:
@@ -113,10 +262,38 @@ def _run_simple_pipeline(
i, config.min_iterations,
)
# Auto-escalate: no senior/aggregator + repeated FAIL
has_aggregator = config.seniors or any(
s.prompt_template == "default:aggregate-review" for s in config.pipeline
)
if (
verdict == "FAIL"
and not has_aggregator
and i >= 2
and _detect_auto_escalate(all_feedbacks[:-1], feedback)
):
final_verdict = "ESCALATE"
auto_msg = (
f"Auto-escalated: same issues detected across {i} iterations "
f"without resolution (no senior reviewer configured)."
)
escalated_issues.append(auto_msg)
iter_result.escalated_issues = auto_msg
logger.info(" AUTO-ESCALATE at iteration %d", i)
break
if dry_run:
logger.info(" (dry-run: stopping after iteration 1)")
break
finally:
agentic_branch: str | None = None
if worktree_path is not None and agentic_branch_name is not None:
agentic_branch = _finalize_worktree(
cwd, worktree_path, agentic_branch_name,
config.preset_name, final_verdict,
)
total_duration = time.monotonic() - start_time
pipeline_result = PipelineResult(
@@ -125,6 +302,8 @@ def _run_simple_pipeline(
total_duration=round(total_duration, 1),
run_dir=run_dir,
repeated_aggregate_warnings=aggregate_warnings,
escalated_issues=escalated_issues,
agentic_branch=agentic_branch,
)
if not dry_run:
@@ -146,6 +325,16 @@ def _run_phased_pipeline(
set_language(config.language)
input_contents = _load_inputs(config)
runtime_env = _build_runtime_inputs(config, input_contents, cwd)
# Setup shared worktree for agentic mode
all_phase_steps = [s for p in config.phases for s in p.steps]
worktree_path: Path | None = None
agentic_branch_name: str | None = None
if not dry_run and _has_agentic_steps(config, all_phase_steps):
worktree_path, agentic_branch_name = _setup_worktree(
cwd, run_dir, config.preset_name,
)
iterations: list[IterationResult] = []
feedback = "(no feedback — first iteration)"
@@ -154,8 +343,15 @@ def _run_phased_pipeline(
global_iter = 0
aggregate_history_by_phase: dict[str, dict[str, int]] = {}
aggregate_warnings: list[str] = []
escalated_issues: list[str] = []
all_feedbacks: list[str] = []
escalated = False
try:
for phase_idx, phase in enumerate(config.phases):
if escalated:
break
logger.info("=" * 60)
logger.info(
" Phase: %s (max_iter=%d, consecutive_pass=%d)",
@@ -172,6 +368,7 @@ def _run_phased_pipeline(
config = try_reload_config(config)
set_language(config.language)
_refresh_inputs(config, input_contents)
runtime_env = _build_runtime_inputs(config, input_contents, cwd)
logger.info("-" * 50)
logger.info(
@@ -184,6 +381,15 @@ def _run_phased_pipeline(
phase.steps, config, input_contents, feedback,
pi, phase.max_iterations, cwd, timeout, dry_run,
run_dir=run_dir, output_iter=global_iter, phase_name=phase.name,
worktree_path=worktree_path,
runtime_env=runtime_env,
)
# Intermediate commit so next iteration's diff only shows new changes
if worktree_path is not None:
_commit_iteration(
worktree_path, f"{config.preset_name}/{phase.name}",
global_iter, verdict,
)
iter_result = IterationResult(
@@ -205,8 +411,45 @@ def _run_phased_pipeline(
iter_result.feedback = _collect_feedback(phase.steps, step_outputs)
feedback = iter_result.feedback or feedback
all_feedbacks.append(feedback)
# Extract tracker from verdict/review steps
for step in phase.steps:
if step.verdict or step.role == "review":
tracker = _extract_senior_tracker(
step_outputs.get(step.output_key, ""),
)
if tracker:
input_contents["previous_senior_tracker"] = tracker
iterations.append(iter_result)
# ESCALATE check
if verdict == "ESCALATE":
final_verdict = "ESCALATE"
for step in phase.steps:
if step.verdict:
esc = _extract_escalated_issues(
step_outputs.get(step.output_key, ""),
)
if esc:
escalated_issues.append(esc)
iter_result.escalated_issues = esc
logger.info(
" [%s] ESCALATE at iteration %d — stopping.",
phase.name, pi,
)
escalated = True
break
if verdict is None:
logger.info(
" [%s] completed (no verdict step; single-pass phase)",
phase.name,
)
phase_converged = True
break
if verdict == "PASS":
consecutive_passes += 1
logger.info(
@@ -223,9 +466,33 @@ def _run_phased_pipeline(
else:
consecutive_passes = 0
# Auto-escalate in phased pipeline
has_aggregator = config.seniors or any(
s.prompt_template == "default:aggregate-review" for s in phase.steps
)
if (
verdict == "FAIL"
and not has_aggregator
and pi >= 2
and _detect_auto_escalate(all_feedbacks[:-1], feedback)
):
final_verdict = "ESCALATE"
auto_msg = (
f"Auto-escalated: same issues detected across {pi} iterations "
f"in phase '{phase.name}' without resolution."
)
escalated_issues.append(auto_msg)
iter_result.escalated_issues = auto_msg
logger.info(" [%s] AUTO-ESCALATE at iteration %d", phase.name, pi)
escalated = True
break
if dry_run:
break
if escalated:
break
if phase_converged:
logger.info(" Phase '%s' completed: CONVERGED", phase.name)
else:
@@ -237,6 +504,14 @@ def _run_phased_pipeline(
if phase_idx == len(config.phases) - 1:
final_verdict = "PASS" if phase_converged else "MAX_ITERATIONS_REACHED"
finally:
agentic_branch: str | None = None
if worktree_path is not None and agentic_branch_name is not None:
agentic_branch = _finalize_worktree(
cwd, worktree_path, agentic_branch_name,
config.preset_name, final_verdict,
)
total_duration = time.monotonic() - start_time
pipeline_result = PipelineResult(
@@ -245,6 +520,8 @@ def _run_phased_pipeline(
total_duration=round(total_duration, 1),
run_dir=run_dir,
repeated_aggregate_warnings=aggregate_warnings,
escalated_issues=escalated_issues,
agentic_branch=agentic_branch,
)
if not dry_run:
@@ -346,6 +623,8 @@ def _run_steps(
run_dir: Path,
output_iter: int,
phase_name: str | None = None,
worktree_path: Path | None = None,
runtime_env: dict[str, str] | None = None,
) -> tuple[dict[str, str], dict[str, AgentResult], str | None]:
"""Execute all steps in one iteration, parallelizing where possible."""
step_outputs: dict[str, str] = {}
@@ -356,37 +635,60 @@ def _run_steps(
for batch in batches:
if len(batch) == 1:
# Single step — run directly
step = batch[0]
_execute_step(
step, config, input_contents, feedback,
iteration, max_iterations, cwd, timeout, dry_run,
step_outputs, step_results,
run_dir=run_dir, output_iter=output_iter, phase_name=phase_name,
run_dir=run_dir, output_iter=output_iter,
phase_name=phase_name, worktree_path=worktree_path,
runtime_env=runtime_env,
)
else:
# Parallel batch — run with ThreadPoolExecutor
_execute_parallel_batch(
batch, config, input_contents, feedback,
iteration, max_iterations, cwd, timeout, dry_run,
step_outputs, step_results,
run_dir=run_dir, output_iter=output_iter, phase_name=phase_name,
run_dir=run_dir, output_iter=output_iter,
phase_name=phase_name, worktree_path=worktree_path,
runtime_env=runtime_env,
)
# Extract verdict from all verdict steps (ALL must PASS)
# Extract verdict from all verdict steps (ALL must PASS; ESCALATE wins over all)
for step in steps:
if step.verdict:
output = step_outputs.get(step.output_key, "")
step_verdict = _extract_verdict(output, step.verdict_pattern)
logger.info(" [%s] verdict: %s", step.name, step_verdict)
if verdict is None:
if step_verdict == "ESCALATE":
verdict = "ESCALATE"
elif verdict is None:
verdict = step_verdict
elif step_verdict == "FAIL":
elif verdict != "ESCALATE" and step_verdict == "FAIL":
verdict = "FAIL"
return step_outputs, step_results, verdict
def _invoke_agentic(
agent_config: AgentConfig,
prompt: str,
step_name: str,
*,
worktree_path: Path,
env: dict[str, str] | None = None,
timeout: int | None = None,
quiet: bool = False,
) -> AgentResult:
"""Run an agent in agentic mode using an existing worktree."""
return invoke_agent_agentic(
agent_config, prompt, step_name,
worktree_path=worktree_path,
env=env,
timeout=timeout, quiet=quiet,
)
def _execute_step(
step: StepConfig,
config: PipelineConfig,
@@ -404,6 +706,8 @@ def _execute_step(
output_iter: int,
phase_name: str | None = None,
quiet: bool = False,
worktree_path: Path | None = None,
runtime_env: dict[str, str] | None = None,
) -> None:
"""Execute a single step, updating step_outputs and step_results in place."""
if not quiet:
@@ -423,6 +727,7 @@ def _execute_step(
# 4. Render prompt
prompt = render_template(template, context)
prompt = _augment_prompt_with_runtime_context(prompt, context)
# 5. Dry run: print and skip
if dry_run:
@@ -436,9 +741,20 @@ def _execute_step(
# 6. Invoke agent
agent_config = config.agents[step.agent]
try:
if agent_config.agentic and worktree_path:
result = _invoke_agentic(
agent_config, prompt, step.name,
worktree_path=worktree_path,
env=runtime_env,
timeout=timeout, quiet=quiet,
)
else:
# When worktree exists, run non-agentic agents (reviewers) in
# the worktree too so they can inspect the modified files.
effective_cwd = worktree_path if worktree_path else cwd
result = invoke_agent(
agent_config, prompt, step.name,
cwd=cwd, timeout=timeout, quiet=quiet,
cwd=effective_cwd, env=runtime_env, timeout=timeout, quiet=quiet,
)
except subprocess.TimeoutExpired as e:
stdout = (e.stdout or b"") if isinstance(e.stdout, bytes) else (e.stdout or "")
@@ -466,10 +782,11 @@ def _execute_step(
f"Try --timeout 0 (unlimited)"
)
except RuntimeError as e:
phase_info = f"- **Phase**: {phase_name}\n" if phase_name else ""
error_msg = (
f"# Agent Error\n\n{phase_info}"
f"- **Step**: {step.name}\n- **Agent**: {step.agent}\n\n```\n{e}\n```\n"
error_msg = _format_runtime_error_markdown(
e,
step_name=step.name,
agent_name=step.agent,
phase_name=phase_name,
)
_save_step_output(run_dir, output_iter, f"{step.name}_error", error_msg)
logger.error(" [%s] FAILED — saved to output", step.name)
@@ -505,6 +822,8 @@ def _execute_parallel_batch(
run_dir: Path,
output_iter: int,
phase_name: str | None = None,
worktree_path: Path | None = None,
runtime_env: dict[str, str] | None = None,
) -> None:
"""Execute multiple steps in parallel using threads."""
agent_names = ", ".join(s.agent for s in batch)
@@ -520,6 +839,26 @@ def _execute_parallel_batch(
)
return
# Agentic steps cannot run in parallel (they share a worktree)
agentic_in_batch = [
s for s in batch
if config.agents.get(s.agent, AgentConfig(name="", command="")).agentic
]
if len(agentic_in_batch) > 1:
logger.warning(
" [parallel] %d agentic steps cannot run concurrently — running sequentially",
len(agentic_in_batch),
)
for step in batch:
_execute_step(
step, config, input_contents, feedback,
iteration, max_iterations, cwd, timeout, dry_run,
step_outputs, step_results,
run_dir=run_dir, output_iter=output_iter,
phase_name=phase_name, worktree_path=worktree_path,
)
return
# Snapshot context before parallel execution (all steps see same state)
context_snapshot = dict(input_contents)
context_snapshot.update(step_outputs)
@@ -527,7 +866,7 @@ def _execute_parallel_batch(
# Collect results from parallel threads
local_outputs: dict[str, str] = {}
local_results: dict[str, AgentResult] = {}
errors: list[Exception] = []
errors: list[tuple[StepConfig, Exception]] = []
# Show a single spinner for the batch
from cross_eval.agent import _Spinner
@@ -546,11 +885,21 @@ def _execute_parallel_batch(
if step.context_override:
context = _apply_context_override(context, step.context_override)
prompt = render_template(template, context)
prompt = _augment_prompt_with_runtime_context(prompt, context)
agent_config = config.agents[step.agent]
if agent_config.agentic and worktree_path:
result = _invoke_agentic(
agent_config, prompt, step.name,
worktree_path=worktree_path,
env=runtime_env,
timeout=timeout, quiet=True,
)
else:
effective_cwd = worktree_path if worktree_path else cwd
result = invoke_agent(
agent_config, prompt, step.name,
cwd=cwd, timeout=timeout, quiet=True,
cwd=effective_cwd, env=runtime_env, timeout=timeout, quiet=True,
)
return step.output_key, result.output, result
@@ -563,19 +912,15 @@ def _execute_parallel_batch(
local_results[output_key] = result
local_outputs[output_key] = output
except Exception as e:
errors.append(e)
errors.append((step, e))
batch_elapsed = round(time.monotonic() - batch_start, 1)
if errors:
spinner.stop(f"[parallel] FAILED ({batch_elapsed}s)")
raise errors[0]
spinner.stop(f"[parallel] {len(batch)} agents done ({batch_elapsed}s)")
# Merge results
# Persist successful outputs even if a sibling step failed.
for step in batch:
key = step.output_key
if key not in local_outputs:
continue
step_outputs[key] = local_outputs[key]
step_results[key] = local_results[key]
r = local_results[key]
@@ -585,6 +930,48 @@ def _execute_parallel_batch(
)
_save_step_output(run_dir, output_iter, step.name, r.output)
if errors:
spinner.stop(f"[parallel] FAILED ({batch_elapsed}s)")
for failed_step, exc in errors:
if isinstance(exc, subprocess.TimeoutExpired):
stdout = (exc.stdout or b"") if isinstance(exc.stdout, bytes) else (exc.stdout or "")
stderr = (exc.stderr or b"") if isinstance(exc.stderr, bytes) else (exc.stderr or "")
if isinstance(stdout, bytes):
stdout = stdout.decode("utf-8", errors="replace")
if isinstance(stderr, bytes):
stderr = stderr.decode("utf-8", errors="replace")
phase_info = f"- **Phase**: {phase_name}\n" if phase_name else ""
error_msg = (
f"# Agent Timeout\n\n"
f"{phase_info}"
f"- **Step**: {failed_step.name}\n"
f"- **Agent**: {failed_step.agent}\n"
f"- **Timeout**: {timeout}s\n\n"
f"Partial stdout ({len(stdout)} chars):\n"
f"```\n{stdout[:2000] or '(none)'}\n```\n\n"
f"Stderr:\n```\n{stderr[:2000] or '(none)'}\n```\n"
)
else:
error_msg = _format_runtime_error_markdown(
exc,
step_name=failed_step.name,
agent_name=failed_step.agent,
phase_name=phase_name,
)
_save_step_output(run_dir, output_iter, f"{failed_step.name}_error", error_msg)
logger.error(" [%s] FAILED — saved to output", failed_step.name)
failed_steps = ", ".join(step.name for step, _ in errors)
saved_steps = ", ".join(step.name for step in batch if step.output_key in local_outputs)
first_error = errors[0][1]
saved_note = f" Successful outputs were saved for: {saved_steps}." if saved_steps else ""
raise RuntimeError(
f"Parallel batch failed: {len(errors)}/{len(batch)} steps failed ({failed_steps})."
f"{saved_note} First error:\n{first_error}"
)
spinner.stop(f"[parallel] {len(batch)} agents done ({batch_elapsed}s)")
# ---------------------------------------------------------------------------
# Context and template helpers
@@ -607,6 +994,35 @@ def _build_context(
return context
def _build_runtime_inputs(
config: PipelineConfig,
input_contents: dict[str, str],
cwd: Path,
) -> dict[str, str]:
"""Load runtime env and expose safe execution hints to prompts."""
env, loaded_files, loaded_values = build_runtime_environment(config.execution, cwd)
input_contents["execution_policy"] = build_execution_policy(config.execution)
input_contents["environment_context"] = summarize_environment(
config.execution, loaded_files, env, loaded_values,
)
return env
def _augment_prompt_with_runtime_context(
prompt: str,
context: dict[str, str],
) -> str:
"""Append execution/env guidance without requiring every template to include placeholders."""
extras: list[str] = []
if context.get("execution_policy"):
extras.append("## Execution Policy\n" + context["execution_policy"])
if context.get("environment_context"):
extras.append("## Environment Context\n" + context["environment_context"])
if not extras:
return prompt
return prompt.rstrip() + "\n\n" + "\n\n".join(extras) + "\n"
def _apply_context_override(
context: dict[str, str],
overrides: dict[str, str],
@@ -671,13 +1087,104 @@ def _normalize_aggregate_output(output: str) -> str:
return " ".join(output.lower().split())
_ESCALATE_PATTERN = re.compile(r"VERDICT:\s*ESCALATE", re.IGNORECASE)
_TRACKER_TABLE_PATTERN = re.compile(
r"(##+ Issue Tracker[^\n]*\n(?:\|[^\n]+\|\n?)+)", re.DOTALL,
)
def _extract_verdict(output: str, pattern: str) -> str:
"""Extract PASS or FAIL from output using regex pattern."""
"""Extract PASS, FAIL, or ESCALATE from output using regex pattern."""
if re.search(_ESCALATE_PATTERN, output):
return "ESCALATE" # highest priority
if re.search(pattern, output):
return "PASS"
return "FAIL"
def _extract_senior_tracker(output: str) -> str:
"""Extract Issue Tracker table from senior review output."""
match = _TRACKER_TABLE_PATTERN.search(output)
return match.group(0) if match else ""
def _extract_escalated_issues(output: str) -> str:
"""Extract escalation details from senior review output."""
# Look for content between VERDICT: ESCALATE and end, or an escalation section
pattern = r"(?:###?\s*Escalat(?:ed|ion).*?\n)(.*?)(?=\n###|\Z)"
match = re.search(pattern, output, re.DOTALL | re.IGNORECASE)
if match:
return match.group(1).strip()
# Fallback: grab the Action Items section
pattern2 = r"(?:###?\s*Action Items.*?\n)(.*?)(?=\n###|\Z)"
match2 = re.search(pattern2, output, re.DOTALL | re.IGNORECASE)
if match2:
return match2.group(1).strip()
return ""
_FP_PATTERN = re.compile(r"[\w/\\]+\.\w{1,5}")
_ISSUE_KEYWORDS = re.compile(
r"\b(missing|validation|error[\s_-]?handling|unused|import|"
r"injection|auth(?:entication|orization)?|deprecated|"
r"leak|overflow|null|undefined|timeout|deadlock|race[\s_-]?condition|"
r"security|permission|encoding|format|parsing|connection|"
r"boundary|initialization|cleanup|resource|concurrency|"
r"exception|crash|hang|corrupt|truncat|duplicat|inconsisten|"
r"omission|over[\s_-]?engineer|refactor|naming|docstring|"
r"type[\s_-]?hint|test|coverage|logging|config|performance)\w*",
re.IGNORECASE,
)
def _issue_fingerprints(text: str) -> set[tuple[str, str]]:
"""Extract (file_path, issue_keyword) pairs from feedback text.
For each file path found, look for issue keywords within a window of
~120 characters around the file path mention and create composite keys.
"""
lower = text.lower()
paths = list(_FP_PATTERN.finditer(lower))
if not paths:
return set()
pairs: set[tuple[str, str]] = set()
for m in paths:
fp = m.group()
# Search a window around the file path for issue keywords
window_start = max(0, m.start() - 60)
window_end = min(len(lower), m.end() + 60)
window = lower[window_start:window_end]
for kw_match in _ISSUE_KEYWORDS.finditer(window):
pairs.add((fp, kw_match.group().lower()))
return pairs
def _detect_auto_escalate(
feedbacks: list[str],
current_feedback: str,
threshold: int = 2,
) -> bool:
"""Detect repeated identical issues across iterations (for auto-escalation).
Extracts (file_path, issue_keyword) fingerprints from feedback and checks
if any identical pair appears in >= *threshold* previous iterations.
This avoids false positives when the same file is mentioned for completely
different issues across iterations.
"""
current_fps = _issue_fingerprints(current_feedback)
if not current_fps:
return False
repeat_count = 0
for prev in feedbacks:
prev_fps = _issue_fingerprints(prev)
if current_fps & prev_fps:
repeat_count += 1
return repeat_count >= threshold
def _save_step_output(
run_dir: Path,
iteration: int,
@@ -691,8 +1198,56 @@ def _save_step_output(
return path
def _format_runtime_error_markdown(
exc: Exception,
*,
step_name: str,
agent_name: str,
phase_name: str | None = None,
) -> str:
"""Render a structured markdown error report for a failed step."""
phase_info = f"- **Phase**: {phase_name}\n" if phase_name else ""
lines = [
"# Agent Error",
"",
phase_info.rstrip(),
f"- **Step**: {step_name}",
f"- **Agent**: {agent_name}",
]
lines = [line for line in lines if line]
if isinstance(exc, AgentInvocationError):
lines.extend(
[
f"- **Failure Type**: {exc.failure_type}",
f"- **Suggested Action**: {exc.suggested_action}",
"",
"## Command",
f"```",
exc.cmd_preview,
"```",
"",
"## Raw Error",
"```",
exc.raw_error,
"```",
],
)
else:
lines.extend(
[
"",
"```",
str(exc),
"```",
],
)
return "\n".join(lines) + "\n"
def _save_report(run_dir: Path, config: PipelineConfig, result: PipelineResult) -> None:
"""Generate and save the final markdown report."""
"""Build and save the final markdown report."""
report = build_report(config, result)
report_path = run_dir / "final-report.md"
report_path.parent.mkdir(parents=True, exist_ok=True)

View File

@@ -12,7 +12,7 @@ from cross_eval.models import PhaseConfig, StepConfig
# Default prompt templates
# ---------------------------------------------------------------------------
GENERATE_TEMPLATE = """\
CODING_TEMPLATE = """\
You are tasked with implementing code based on a plan and checklist.
## Plan
@@ -53,8 +53,8 @@ You are tasked with reviewing code against a plan and checklist.
## Reference Documents
{docs}
## Generated Code / Previous Step Output
{generated_code}
## Coding Output / Previous Step Output
{coding_output}
## Previous Review Feedback
{feedback}
@@ -94,10 +94,10 @@ security concerns, performance problems), report them separately under \
(Write "N/A" if no previous feedback was provided.)
### Issues Found
List issues ordered by severity (Critical first):
- [Critical][Over-engineering] Description (reference specific plan/checklist item)
- [Major][Omission] Description (reference specific plan/checklist item)
- [Minor][Omission] Description (reference specific plan/checklist item)
List issues ordered by severity (Critical first). Assign each issue a unique ID (ISS-NNN):
- ISS-001 [Critical][Over-engineering] Description (reference specific plan/checklist item)
- ISS-002 [Major][Omission] Description (reference specific plan/checklist item)
- ISS-003 [Minor][Omission] Description (reference specific plan/checklist item)
### Out of Scope Issues
Issues found outside plan/checklist scope but worth noting:
@@ -119,7 +119,7 @@ Otherwise output: VERDICT: FAIL
"""
GENERATE_TEMPLATE_KO = """\
CODING_TEMPLATE_KO = """\
당신은 기획서와 체크리스트를 기반으로 코드를 구현하는 개발자입니다.
## 기획서
@@ -159,7 +159,7 @@ REVIEW_TEMPLATE_KO = """\
{docs}
## 검토 대상 코드
{generated_code}
{coding_output}
## 이전 리뷰 피드백
{feedback}
@@ -195,10 +195,10 @@ REVIEW_TEMPLATE_KO = """\
(이전 피드백이 없으면 "해당 없음"이라고 작성하세요.)
### 발견된 이슈
심각도 순서(Critical 먼저)로 나열:
- [Critical][과최적화] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
- [Major][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
- [Minor][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
심각도 순서(Critical 먼저)로 나열. 각 이슈에 고유 ID(ISS-NNN)를 부여하세요:
- ISS-001 [Critical][과최적화] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
- ISS-002 [Major][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
- ISS-003 [Minor][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
### 범위 밖 이슈
기획서/체크리스트 범위 밖이지만 주목할 만한 이슈:
@@ -357,6 +357,150 @@ REVIEW_ONLY_TEMPLATE_KO = """\
그렇지 않으면: VERDICT: FAIL
"""
PLAN_REVIEW_TEMPLATE = """\
You are tasked with reviewing planning documents before implementation begins.
## Plan
{plan}
## Checklist
{checklist}
## Reference Documents
{docs}
## Previous Review (iteration {iteration} of {max_iterations})
{feedback}
## Review Instructions
Review the planning package itself: the plan, checklist, and reference documents.
You MAY inspect the current repository to validate feasibility, constraints, and integration assumptions.
Do NOT write or modify code. Assume implementation has NOT started yet.
Your job is to find planning issues that would likely cause bad implementation outcomes:
- Ambiguous or contradictory requirements
- Missing acceptance criteria, constraints, edge cases, or dependencies
- Scope that is broader or more complex than the stated objective
- Checklist items that do not verify the actual requirements
- Plan details that conflict with the current codebase or architecture
If previous review results are provided above, you MUST:
1. Verify each previously reported issue — is it a real issue or a false positive?
2. Look for issues the previous review MISSED.
3. Do NOT simply repeat the previous review. Provide your own independent assessment.
4. Explicitly mark items as CONFIRMED (still an issue) or DISMISSED (false positive).
For each issue found, classify it with BOTH severity AND category:
Severity levels:
- **Critical**: The plan is likely to cause fundamentally wrong implementation or unsafe behavior.
- **Major**: Important requirements, constraints, or acceptance criteria are unclear, conflicting, missing, or incompatible with the existing system.
- **Minor**: Wording, structure, or checklist quality problems that reduce implementation clarity.
Categories:
- **Over-engineering**: The plan introduces scope, abstractions, or complexity not justified by the stated objective.
- **Omission**: A necessary requirement, constraint, acceptance criterion, edge case, dependency, or compatibility consideration is missing or incomplete.
If you find issues outside the planning scope (e.g. repository health, pre-existing code problems), report them separately under "Out of Scope Issues".
## Output Format
### Issues Found
List issues ordered by severity (Critical first):
- [Critical][Over-engineering] Description (reference specific plan/checklist item)
- [Major][Omission] Description (reference specific plan/checklist item)
- [Minor][Omission] Description (reference specific plan/checklist item)
### Out of Scope Issues
Issues found outside planning scope but worth noting:
- [Critical] Description of issue
- [Minor] Description of issue
(Write "None" if no out-of-scope issues found.)
### Summary
- Critical: N, Major: N, Minor: N
- Over-engineering count: N
- Omission count: N
- CONFIRMED: N, DISMISSED: N
- Overall quality: [BRIEF ASSESSMENT]
### Verdict
If the planning documents are clear, complete enough to implement, compatible with the current repository, and free of unjustified scope, output: VERDICT: PASS
Otherwise output: VERDICT: FAIL
"""
PLAN_REVIEW_TEMPLATE_KO = """\
당신은 구현 시작 전에 기획 문서를 검토하는 리뷰어입니다.
## 기획서
{plan}
## 체크리스트
{checklist}
## 참고 문서
{docs}
## 이전 리뷰 결과 ({max_iterations}회 중 {iteration}번째)
{feedback}
## 검토 지침
검토 대상은 코드가 아니라 기획 패키지 자체입니다: 기획서, 체크리스트, 참고 문서를 함께 검토하세요.
현재 저장소를 살펴보며 구현 가능성, 제약조건, 통합 가정이 맞는지도 확인할 수 있습니다.
코드를 생성하거나 수정하지 마세요. 아직 구현이 시작되지 않았다고 가정하세요.
목표는 구현 단계에서 문제를 일으킬 기획 결함을 찾는 것입니다:
- 요구사항이 모호하거나 서로 충돌하는 경우
- 수용 기준, 제약조건, 엣지 케이스, 의존성이 빠진 경우
- 목표 대비 범위가 지나치게 넓거나 복잡한 경우
- 체크리스트가 실제 요구사항 검증에 충분하지 않은 경우
- 기획 내용이 현재 코드베이스나 아키텍처와 충돌하는 경우
이전 리뷰 결과가 제공된 경우 반드시:
1. 이전에 보고된 각 이슈를 검증하세요 — 진짜 이슈인지 오탐인지?
2. 이전 리뷰가 놓친 새로운 이슈를 찾으세요.
3. 이전 리뷰를 그대로 반복하지 마세요. 독립적인 평가를 제공하세요.
4. 각 항목에 CONFIRMED (여전히 이슈) 또는 DISMISSED (오탐) 태그를 명시하세요.
발견된 각 이슈에 심각도와 카테고리를 모두 부여하세요:
심각도:
- **Critical**: 잘못된 구현이나 위험한 동작으로 직결될 가능성이 큰 기획 결함.
- **Major**: 중요한 요구사항, 제약조건, 수용 기준이 모호하거나 충돌하거나 누락되었거나 기존 시스템과 맞지 않는 경우.
- **Minor**: 문서 표현, 구조, 체크리스트 품질 문제로 구현 명확성이 떨어지는 경우.
카테고리:
- **과최적화**: 목표 대비 불필요한 범위, 추상화, 복잡성을 기획에 추가한 경우.
- **누락**: 필요한 요구사항, 제약조건, 수용 기준, 엣지 케이스, 의존성, 호환성 고려가 빠졌거나 불완전한 경우.
기획 범위 밖에서 발견된 문제(저장소 상태, 기존 코드 문제 등)는 "범위 밖 이슈" 섹션에 별도로 보고하세요.
## 출력 형식
### 발견된 이슈
심각도 순서(Critical 먼저)로 나열:
- [Critical][과최적화] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
- [Major][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
- [Minor][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
### 범위 밖 이슈
기획 범위 밖이지만 주목할 만한 이슈:
- [Critical] 이슈 설명
- [Minor] 이슈 설명
(범위 밖 이슈가 없으면 "없음"이라고 작성하세요.)
### 요약
- Critical: N, Major: N, Minor: N
- 과최적화 수: N
- 누락 수: N
- CONFIRMED: N, DISMISSED: N
- 전체 품질: [간략한 평가]
### 판정
기획 문서가 구현 가능한 수준으로 명확하고 충분하며 현재 저장소와도 정합적이고, 불필요한 범위 확장이 없으면: VERDICT: PASS
그렇지 않으면: VERDICT: FAIL
"""
AGGREGATE_REVIEW_TEMPLATE = """\
You are adjudicating multiple review results and turning them into an actionable decision.
@@ -378,6 +522,9 @@ You are adjudicating multiple review results and turning them into an actionable
## Previous Verification Feedback
{feedback}
## Previous Issue Tracker
{previous_senior_tracker}
## Instructions
Explore the project directory to confirm the current codebase state. Then:
1. Deduplicate overlapping issues across reviewers.
@@ -385,7 +532,12 @@ Explore the project directory to confirm the current codebase state. Then:
3. Keep only issues supported by the plan, checklist, code, or reviewer evidence.
4. When evidence is mixed, explain what was confirmed, what was dismissed, and what still needs follow-up.
5. Produce a prioritized action list for the coder.
6. If no confirmed issue remains, output VERDICT: PASS. Otherwise VERDICT: FAIL.
6. Maintain the Issue Tracker table across iterations (carry forward unresolved issues).
7. If no confirmed issue remains, output VERDICT: PASS.
8. If issues exist that the coder can fix, output VERDICT: FAIL.
9. If issues require human intervention (ambiguous requirements, architecture decisions, \
external dependency problems, or the same issue persists after 2+ fix attempts), \
output VERDICT: ESCALATE.
## Output Format
@@ -401,13 +553,19 @@ Explore the project directory to confirm the current codebase state. Then:
1. Concrete fix the coder should make
2. Concrete fix the coder should make
## Issue Tracker
| ISS-ID | Severity | Description | Status | Since |
|--------|----------|-------------|--------|-------|
| ISS-001 | Critical | ... | Open/Fixed/Dismissed | v1 |
### Summary
- Confirmed issues: N
- Dismissed findings: N (false positive: N, already fixed: N)
- Overall quality: [BRIEF ASSESSMENT]
### Verdict
VERDICT: PASS or VERDICT: FAIL
VERDICT: PASS or VERDICT: FAIL or VERDICT: ESCALATE
"""
AGGREGATE_REVIEW_TEMPLATE_KO = """\
@@ -431,6 +589,9 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
## 이전 검증 피드백
{feedback}
## 이전 이슈 트래커
{previous_senior_tracker}
## 지침
프로젝트 디렉토리를 탐색하여 현재 코드베이스 상태를 확인한 뒤 다음을 수행하세요.
1. 리뷰어들 사이에 중복되는 이슈를 합치세요.
@@ -438,7 +599,11 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
3. 기획서, 체크리스트, 코드, 리뷰 근거로 뒷받침되는 이슈만 남기세요.
4. 근거가 엇갈리면 무엇이 확정이고 무엇이 기각 또는 추가확인 대상인지 분명히 적으세요.
5. coder가 바로 수정할 수 있는 우선순위 액션 아이템을 만드세요.
6. 확정된 이슈가 없으면 VERDICT: PASS, 있으면 VERDICT: FAIL 을 출력하세요.
6. 이슈 트래커 테이블을 반복 간에 유지하세요 (미해결 이슈를 이월).
7. 확정된 이슈가 없으면 VERDICT: PASS 를 출력하세요.
8. coder가 수정 가능한 이슈가 있으면 VERDICT: FAIL 을 출력하세요.
9. 사람의 개입이 필요한 이슈(모호한 요구사항, 아키텍처 결정, 외부 의존성 문제, \
동일 이슈가 2회 이상 해결 실패)가 있으면 VERDICT: ESCALATE 를 출력하세요.
## 출력 형식
@@ -454,26 +619,34 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
1. coder가 수정해야 할 구체적인 작업
2. coder가 수정해야 할 구체적인 작업
## 이슈 트래커
| ISS-ID | 심각도 | 설명 | 상태 | 최초 발견 |
|--------|--------|------|------|-----------|
| ISS-001 | Critical | ... | Open/Fixed/Dismissed | v1 |
### 요약
- 확정 이슈 수: N
- 기각된 주장 수: N (오탐: N, 수정 완료: N)
- 전체 품질: [간략한 평가]
### 판정
VERDICT: PASS 또는 VERDICT: FAIL
VERDICT: PASS 또는 VERDICT: FAIL 또는 VERDICT: ESCALATE
"""
DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
"en": {
"generate": GENERATE_TEMPLATE,
"coding": CODING_TEMPLATE,
"review": REVIEW_TEMPLATE,
"plan-review": PLAN_REVIEW_TEMPLATE,
"review-only": REVIEW_ONLY_TEMPLATE,
"aggregate-review": AGGREGATE_REVIEW_TEMPLATE,
},
"ko": {
"generate": GENERATE_TEMPLATE_KO,
"coding": CODING_TEMPLATE_KO,
"review": REVIEW_TEMPLATE_KO,
"plan-review": PLAN_REVIEW_TEMPLATE_KO,
"review-only": REVIEW_ONLY_TEMPLATE_KO,
"aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO,
},
@@ -544,18 +717,18 @@ def _build_named_bundle(
def _build_simple_preset(
coders: list[str], reviewers: list[str], seniors: list[str],
) -> list[StepConfig]:
"""First coder generates, first reviewer reviews."""
"""First coder writes code, first reviewer reviews."""
if not coders:
raise ValueError("'simple' preset requires at least 1 coder")
if not reviewers:
raise ValueError("'simple' preset requires at least 1 reviewer")
steps = [
StepConfig(
name="generate",
name="coding",
agent=coders[0],
role="generate",
prompt_template="default:generate",
output_key="generated_code",
role="coding",
prompt_template="default:coding",
output_key="coding_output",
),
StepConfig(
name="review",
@@ -576,7 +749,7 @@ def _build_simple_preset(
output_key="senior_review_result",
verdict=True,
context_override={
"candidate_outputs": "## Generated code\n{generated_code}",
"candidate_outputs": "## Coding output\n{coding_output}",
"reviews_bundle": f"## Review: {reviewers[0]} (review)\n{{review_result}}",
},
),
@@ -587,25 +760,25 @@ def _build_simple_preset(
def _build_cross_review_preset(
coders: list[str], reviewers: list[str], seniors: list[str],
) -> list[StepConfig]:
"""Both coders generate, then cross-review each other's output."""
"""Both coders write code, then cross-review each other's output."""
if len(coders) < 2:
raise ValueError("'cross-review' preset requires at least 2 coders")
a, b = coders[0], coders[1]
ak, bk = _unique_safe_keys([a, b])
steps = [
StepConfig(
name=f"generate_{ak}",
name=f"coding_{ak}",
agent=a,
role="generate",
prompt_template="default:generate",
role="coding",
prompt_template="default:coding",
output_key=f"code_{ak}",
parallel=True,
),
StepConfig(
name=f"generate_{bk}",
name=f"coding_{bk}",
agent=b,
role="generate",
prompt_template="default:generate",
role="coding",
prompt_template="default:coding",
output_key=f"code_{bk}",
parallel=True,
),
@@ -615,7 +788,7 @@ def _build_cross_review_preset(
role="review",
prompt_template="default:review",
output_key=f"review_by_{ak}",
context_override={"generated_code": f"{{code_{bk}}}"},
context_override={"coding_output": f"{{code_{bk}}}"},
parallel=True,
verdict=not seniors,
),
@@ -626,7 +799,7 @@ def _build_cross_review_preset(
prompt_template="default:review",
output_key=f"review_by_{bk}",
verdict=not seniors,
context_override={"generated_code": f"{{code_{ak}}}"},
context_override={"coding_output": f"{{code_{ak}}}"},
parallel=True,
),
]
@@ -642,9 +815,9 @@ def _build_cross_review_preset(
context_override={
"candidate_outputs": _build_named_bundle(
[a, b],
[f"generate_{ak}", f"generate_{bk}"],
[f"coding_{ak}", f"coding_{bk}"],
[f"code_{ak}", f"code_{bk}"],
"Candidate",
"Coding Output",
),
"reviews_bundle": _build_named_bundle(
[a, b],
@@ -715,6 +888,61 @@ def _build_review_only_preset(
return steps
def _build_plan_review_preset(
coders: list[str], reviewers: list[str], seniors: list[str],
) -> list[StepConfig]:
"""Plan-review: reviewers audit planning docs before implementation."""
if not reviewers:
raise ValueError("'plan-review' preset requires at least 1 reviewer")
if len(reviewers) == 1 and not seniors:
return [
StepConfig(
name="plan_review",
agent=reviewers[0],
role="review",
prompt_template="default:plan-review",
output_key="plan_review_result",
verdict=True,
),
]
steps: list[StepConfig] = []
reviewer_keys = _unique_safe_keys(reviewers)
for reviewer, rk in zip(reviewers, reviewer_keys):
steps.append(
StepConfig(
name=f"plan_review_{rk}",
agent=reviewer,
role="review",
prompt_template="default:plan-review",
output_key=f"plan_review_{rk}",
verdict=not seniors,
parallel=True,
),
)
if seniors:
step_names = [f"plan_review_{rk}" for rk in reviewer_keys]
output_keys = [f"plan_review_{rk}" for rk in reviewer_keys]
steps.append(
StepConfig(
name="senior_review",
agent=seniors[0],
role="review",
prompt_template="default:aggregate-review",
output_key="senior_review_result",
verdict=True,
context_override={
"candidate_outputs": "Planning documents under review (plan/checklist/reference docs).",
"reviews_bundle": _build_named_bundle(
reviewers, step_names, output_keys, "Review",
),
},
),
)
return steps
def _build_review_fix_preset(
coders: list[str], reviewers: list[str], seniors: list[str],
) -> list[PhaseConfig]:
@@ -762,11 +990,11 @@ def _build_review_fix_preset(
},
),
StepConfig(
name="generate",
name="coding",
agent=fix_coder,
role="generate",
prompt_template="default:generate",
output_key="generated_code",
role="coding",
prompt_template="default:coding",
output_key="coding_output",
context_override={"feedback": "{aggregate_review}"},
),
StepConfig(
@@ -784,14 +1012,44 @@ def _build_review_fix_preset(
]
def _build_coding_review_fix_preset(
coders: list[str], reviewers: list[str], seniors: list[str],
) -> list[PhaseConfig]:
"""Write code once, then run the review-fix convergence loop."""
if not coders:
raise ValueError("'coding-review-fix' preset requires at least 1 coder")
if not reviewers:
raise ValueError("'coding-review-fix' preset requires at least 1 reviewer")
return [
PhaseConfig(
name="initial_coding",
steps=[
StepConfig(
name="coding",
agent=coders[0],
role="coding",
prompt_template="default:coding",
output_key="coding_output",
),
],
max_iterations=1,
consecutive_pass=1,
),
*_build_review_fix_preset(coders, reviewers, seniors),
]
PIPELINE_PRESETS: dict[str, Callable] = {
"simple": _build_simple_preset,
"cross-review": _build_cross_review_preset,
"plan-review": _build_plan_review_preset,
"review-only": _build_review_only_preset,
}
PHASED_PRESETS: dict[str, Callable] = {
"review-fix": _build_review_fix_preset,
"coding-review-fix": _build_coding_review_fix_preset,
}
ALL_PRESET_NAMES: list[str] = list(PIPELINE_PRESETS.keys()) + list(PHASED_PRESETS.keys())
@@ -805,7 +1063,7 @@ def resolve_template(template_ref: str, templates_dir: Optional[Path] = None) ->
"""Resolve a template reference to its content string.
Formats:
- "default:generate" -> built-in GENERATE_TEMPLATE
- "default:coding" -> built-in CODING_TEMPLATE
- "default:review" -> built-in REVIEW_TEMPLATE
- "path/to/file.md" -> read file contents
"""

View File

@@ -48,11 +48,16 @@ _STRINGS: dict[str, dict[str, str]] = {
"pass_msg": "All checklist items satisfied. No over-engineering or omissions detected.",
"fail_phased": "Pipeline phases ({phases}) completed without full convergence.",
"fail_simple": "Maximum iterations ({max_iter}) reached without passing all checks.",
"escalate_msg": "Human review required. The following issues could not be resolved automatically:",
"escalate_title": "Escalation Report",
"issue_tracker_title": "Issue Tracker Summary",
"issue_tracker_desc": "Issues discovered across iterations and their final resolution status.",
"metrics_title": "Review Metrics",
"metrics_trend_title": "Metrics Trend",
"metrics_iter": "Iter",
"metrics_total_issues": "Total Issues",
"metrics_na": "N/A",
"iteration_details": "Iteration Details",
},
"ko": {
"title": "교차 검증 리포트",
@@ -84,11 +89,16 @@ _STRINGS: dict[str, dict[str, str]] = {
"pass_msg": "모든 체크리스트 항목 충족. 과최적화/누락 없음.",
"fail_phased": "파이프라인 페이즈 ({phases}) 완료, 완전한 수렴에 도달하지 못함.",
"fail_simple": "최대 반복 횟수 ({max_iter})에 도달, 모든 검증을 통과하지 못함.",
"escalate_msg": "사람의 확인이 필요합니다. 아래 이슈는 자동으로 해결할 수 없었습니다:",
"escalate_title": "에스컬레이션 리포트",
"issue_tracker_title": "이슈 트래커 요약",
"issue_tracker_desc": "반복 과정에서 발견된 이슈와 최종 처리 상태입니다.",
"metrics_title": "리뷰 메트릭",
"metrics_trend_title": "메트릭 추이",
"metrics_iter": "반복",
"metrics_total_issues": "총 이슈",
"metrics_na": "해당 없음",
"iteration_details": "반복 상세",
},
}
@@ -181,20 +191,41 @@ def _build_simple_report(
out_of_scope_items: list[tuple[int, str]] = []
# Pre-scan iterations to collect out-of-scope items and review metrics
# (needed before rendering final verdict / metrics sections)
for iter_result in result.iterations:
lines.append("---\n")
lines.append(f"## {_t(config, 'iteration')} {iter_result.iteration}\n")
for step in config.pipeline:
output = iter_result.step_outputs.get(step.output_key, "")
if step.role == "review":
oos = _extract_out_of_scope(output)
if oos:
out_of_scope_items.append((iter_result.iteration, oos))
step_metrics = parse_review_metrics(output)
if iter_result.review_metrics is None:
iter_result.review_metrics = step_metrics
else:
iter_result.review_metrics = _aggregate_metrics(
iter_result.review_metrics, step_metrics,
)
_append_iteration_steps(lines, config, iter_result, config.pipeline, out_of_scope_items)
_append_final_verdict(lines, config, result)
_append_issue_tracker_summary(lines, config, result)
_append_review_metrics_table(lines, config, result)
lines.append("---\n")
lines.append(f"## {_t(config, 'iteration_details')}\n")
for iter_result in result.iterations:
lines.append(f"### {_t(config, 'iteration')} {iter_result.iteration}\n")
_append_iteration_steps(lines, config, iter_result, config.pipeline, out_of_scope_items, skip_extraction=True)
if iter_result.feedback:
lines.append(f"**{_t(config, 'feedback_next')}** {iter_result.feedback[:200]}...")
lines.append("")
_append_out_of_scope(lines, config, out_of_scope_items)
_append_review_metrics_table(lines, config, result)
_append_repeated_aggregate(lines, config, result)
_append_final_verdict(lines, config, result)
return "\n".join(lines)
@@ -211,14 +242,42 @@ def _build_phased_report(
phase_map = {p.name: p for p in config.phases}
out_of_scope_items: list[tuple[int, str]] = []
# Pre-scan iterations to collect out-of-scope items and review metrics
for phase_name, phase_iters_iter in groupby(
result.iterations, key=lambda ir: ir.phase_name,
):
phase_iters = list(phase_iters_iter)
phase_config = phase_map.get(phase_name or "")
steps = phase_config.steps if phase_config else config.pipeline
for iter_result in phase_iters:
for step in steps:
output = iter_result.step_outputs.get(step.output_key, "")
if step.role == "review":
oos = _extract_out_of_scope(output)
if oos:
out_of_scope_items.append((iter_result.iteration, oos))
step_metrics = parse_review_metrics(output)
if iter_result.review_metrics is None:
iter_result.review_metrics = step_metrics
else:
iter_result.review_metrics = _aggregate_metrics(
iter_result.review_metrics, step_metrics,
)
_append_final_verdict(lines, config, result)
_append_issue_tracker_summary(lines, config, result)
_append_review_metrics_table(lines, config, result)
lines.append("---\n")
lines.append(f"## {_t(config, 'iteration_details')}\n")
for phase_name, phase_iters_iter in groupby(
result.iterations, key=lambda ir: ir.phase_name,
):
phase_iters = list(phase_iters_iter)
phase_config = phase_map.get(phase_name or "")
lines.append("---\n")
lines.append(f"## {_t(config, 'phase')}: {phase_name}\n")
lines.append(f"### {_t(config, 'phase')}: {phase_name}\n")
if phase_config:
step_desc = "".join(s.name for s in phase_config.steps)
@@ -242,14 +301,17 @@ def _build_phased_report(
verdict_label += ""
else:
verdict_label = " — PASS ✓"
elif iter_result.verdict == "ESCALATE":
consecutive = 0
verdict_label = " — ESCALATE"
else:
consecutive = 0
verdict_label = " — FAIL"
lines.append(
f"### {_t(config, 'iteration')} {iter_result.iteration}{verdict_label}\n"
f"#### {_t(config, 'iteration')} {iter_result.iteration}{verdict_label}\n"
)
_append_iteration_steps(lines, config, iter_result, steps, out_of_scope_items)
_append_iteration_steps(lines, config, iter_result, steps, out_of_scope_items, skip_extraction=True)
if iter_result.feedback:
lines.append(
@@ -258,9 +320,7 @@ def _build_phased_report(
lines.append("")
_append_out_of_scope(lines, config, out_of_scope_items)
_append_review_metrics_table(lines, config, result)
_append_repeated_aggregate(lines, config, result)
_append_final_verdict(lines, config, result)
return "\n".join(lines)
@@ -309,8 +369,14 @@ def _append_iteration_steps(
iter_result: IterationResult,
steps: list[StepConfig],
out_of_scope_items: list[tuple[int, str]],
*,
skip_extraction: bool = False,
) -> None:
"""Append step details for one iteration."""
"""Append step details for one iteration.
If *skip_extraction* is True, out-of-scope and review-metrics parsing
is skipped (useful when a pre-scan already collected that data).
"""
for step in steps:
agent_result = iter_result.step_results.get(step.output_key)
output = iter_result.step_outputs.get(step.output_key, "")
@@ -334,7 +400,7 @@ def _append_iteration_steps(
lines.append(output)
lines.append("")
if step.role == "review":
if not skip_extraction and step.role == "review":
oos = _extract_out_of_scope(output)
if oos:
out_of_scope_items.append((iter_result.iteration, oos))
@@ -469,8 +535,18 @@ def _append_final_verdict(
lines.append("---\n")
lines.append(f"## {_t(config, 'final_verdict_title')}: {result.final_verdict}\n")
if result.agentic_branch:
lines.append(f"**Agentic branch**: `{result.agentic_branch}`")
lines.append(f"```bash\ngit checkout {result.agentic_branch}\n```\n")
if result.final_verdict == "PASS":
lines.append(_t(config, "pass_msg"))
elif result.final_verdict == "ESCALATE":
lines.append(_t(config, "escalate_msg"))
lines.append("")
for issue in result.escalated_issues:
lines.append(f"- {issue}")
lines.append("")
else:
if config.phases:
phase_names = "".join(p.name for p in config.phases)
@@ -481,6 +557,121 @@ def _append_final_verdict(
)
# ---------------------------------------------------------------------------
# Issue Tracker extraction from senior/aggregate outputs
# ---------------------------------------------------------------------------
_ISSUE_TRACKER_PATTERN = re.compile(
r"##+ (?:Issue Tracker|이슈 트래커)[^\n]*\n((?:\|[^\n]+\|\n?)+)",
re.DOTALL,
)
_TRACKER_ROW_PATTERN = re.compile(
r"^\|\s*(ISS-\d+)\s*\|\s*(\S+)\s*\|\s*(.*?)\s*\|\s*(\S+)\s*\|\s*(\S+)\s*\|",
re.MULTILINE,
)
def _extract_issue_tracker_rows(
result: PipelineResult,
) -> list[dict[str, str]]:
"""Extract the latest Issue Tracker table from pipeline results.
Scans iteration outputs in reverse to find the most recent tracker table
from aggregate/senior review steps. Falls back to parsing individual
review outputs for ISS-NNN tagged issues.
"""
# Try to find a tracker table from the last iteration with one
for ir in reversed(result.iterations):
for key, output in ir.step_outputs.items():
match = _ISSUE_TRACKER_PATTERN.search(output)
if not match:
continue
table_text = match.group(1)
rows = []
for row_match in _TRACKER_ROW_PATTERN.finditer(table_text):
rows.append({
"id": row_match.group(1),
"severity": row_match.group(2),
"description": row_match.group(3).strip(),
"status": row_match.group(4),
"since": row_match.group(5),
})
if rows:
return rows
# Fallback: parse ISS-NNN from review outputs across iterations
seen: dict[str, dict[str, str]] = {}
for ir in result.iterations:
for key, output in ir.step_outputs.items():
for m in re.finditer(
r"(ISS-\d+)\s*\[(\w+)\]\[.*?\]\s*(.*?)(?:\n|$)", output,
):
iss_id = m.group(1)
if iss_id not in seen:
seen[iss_id] = {
"id": iss_id,
"severity": m.group(2),
"description": m.group(3).strip()[:80],
"status": "Open",
"since": f"v{ir.iteration}",
}
return list(seen.values())
def _append_issue_tracker_summary(
lines: list[str],
config: PipelineConfig,
result: PipelineResult,
) -> None:
"""Append a consolidated issue tracker table to the report."""
rows = _extract_issue_tracker_rows(result)
if not rows:
return
lines.append("---\n")
lines.append(f"## {_t(config, 'issue_tracker_title')}\n")
lines.append(f"{_t(config, 'issue_tracker_desc')}\n")
lang = getattr(config, "language", "en")
if lang == "ko":
lines.append("| ISS-ID | 심각도 | 설명 | 상태 | 최초 발견 |")
else:
lines.append("| ISS-ID | Severity | Description | Status | Since |")
lines.append("|--------|----------|-------------|--------|-------|")
for row in rows:
lines.append(
f"| {row['id']} | {row['severity']} "
f"| {row['description']} | {row['status']} | {row['since']} |"
)
lines.append("")
def print_escalation_report(
config: PipelineConfig,
result: PipelineResult,
) -> None:
"""Print a prominent ANSI-colored escalation report to the terminal."""
RED = "\033[31m"
YELLOW = "\033[33m"
BOLD = "\033[1m"
RESET = "\033[0m"
title = _t(config, "escalate_title")
msg = _t(config, "escalate_msg")
print(f"\n{RED}{BOLD}{'=' * 60}")
print(f" {title}")
print(f"{'=' * 60}{RESET}\n")
print(f"{YELLOW}{msg}{RESET}\n")
for issue in result.escalated_issues:
print(f" {RED}{RESET} {issue}")
print(f"\n{RED}{BOLD}{'=' * 60}{RESET}\n")
def _append_repeated_aggregate(
lines: list[str],
config: PipelineConfig,

152
cross_eval/runtime_env.py Normal file
View File

@@ -0,0 +1,152 @@
"""Helpers for building agent runtime environments from .env files."""
from __future__ import annotations
import os
from pathlib import Path
from cross_eval.models import ExecutionConfig
_SUMMARY_PREFIXES = (
"CLICKHOUSE",
"CH_",
"DB_",
"DATABASE",
"PG",
"POSTGRES",
"MYSQL",
"REDIS",
"AWS",
"S3",
)
def _strip_quotes(value: str) -> str:
if len(value) >= 2 and value[0] == value[-1] and value[0] in {"'", '"'}:
unwrapped = value[1:-1]
if value[0] == '"':
return bytes(unwrapped, "utf-8").decode("unicode_escape")
return unwrapped
return value
def parse_dotenv(path: Path) -> dict[str, str]:
"""Parse a simple dotenv file into key/value pairs."""
values: dict[str, str] = {}
for raw_line in path.read_text(encoding="utf-8").splitlines():
line = raw_line.strip()
if not line or line.startswith("#"):
continue
if line.startswith("export "):
line = line[len("export ") :].strip()
if "=" not in line:
continue
key, value = line.split("=", 1)
key = key.strip()
if not key:
continue
values[key] = _strip_quotes(value.strip())
return values
def resolve_env_files(execution: ExecutionConfig, project_root: Path) -> list[Path]:
"""Resolve and deduplicate configured env files under the project root."""
candidates: list[Path] = []
for raw in execution.env_files:
path = Path(raw)
if not path.is_absolute():
path = project_root / path
candidates.append(path)
for raw in execution.auto_env_files:
path = project_root / raw
candidates.append(path)
resolved: list[Path] = []
seen: set[Path] = set()
for path in candidates:
try:
normalized = path.resolve()
except OSError:
normalized = path
if normalized in seen or not normalized.exists() or not normalized.is_file():
continue
seen.add(normalized)
resolved.append(normalized)
return resolved
def build_runtime_environment(
execution: ExecutionConfig,
project_root: Path,
) -> tuple[dict[str, str], list[Path], dict[str, str]]:
"""Build subprocess env plus metadata about loaded files and names."""
env = os.environ.copy() if execution.inherit_env else {}
loaded_files = resolve_env_files(execution, project_root)
loaded_values: dict[str, str] = {}
for path in loaded_files:
file_values = parse_dotenv(path)
loaded_values.update(file_values)
env.update(file_values)
return env, loaded_files, loaded_values
def summarize_environment(
execution: ExecutionConfig,
loaded_files: list[Path],
env: dict[str, str],
loaded_values: dict[str, str],
) -> str:
"""Generate a safe environment summary for prompts without leaking secrets."""
lines: list[str] = []
if loaded_files:
joined = ", ".join(str(path) for path in loaded_files)
lines.append(f"Loaded env files into the agent process: {joined}")
else:
lines.append("No .env file was auto-loaded into the agent process.")
if execution.auto_context_targets:
lines.append(
"Execution targets hinted by the user: "
+ ", ".join(execution.auto_context_targets)
)
if execution.expose_env_names:
visible_names = sorted(
{
key
for key in set(loaded_values) | set(env)
if key.startswith(_SUMMARY_PREFIXES)
or any(prefix in key for prefix in ("CLICKHOUSE", "DATABASE", "DB_"))
}
)
if visible_names:
lines.append("Relevant env var names available to commands: " + ", ".join(visible_names))
else:
lines.append("No DB/service env var names matched the default summary filters.")
else:
lines.append("Environment variable values are loaded but names are hidden from the prompt.")
wants_clickhouse = "clickhouse" in {target.lower() for target in execution.auto_context_targets}
clickhouse_keys = [key for key in env if "CLICKHOUSE" in key or key.startswith("CH_")]
if wants_clickhouse or clickhouse_keys:
if clickhouse_keys:
lines.append("ClickHouse-related environment variables are available to the agent.")
else:
lines.append("No ClickHouse-specific env vars were detected in the loaded environment.")
return "\n".join(lines)
def build_execution_policy(execution: ExecutionConfig) -> str:
"""Describe the execution latitude granted to agentic coders/reviewers."""
lines = [
f"Execution mode: {execution.mode}",
f"Command policy: {execution.command_policy}",
"The agent may choose shell, Python, git, docker, test, and database commands on its own when needed.",
"The user does not need to pre-specify exact commands.",
]
if execution.command_policy == "broad":
lines.append("Prefer direct validation by running the minimum set of commands needed to prove a fix.")
else:
lines.append("Keep command usage minimal and focused on validation.")
return "\n".join(lines)

135
cross_eval/worktree.py Normal file
View File

@@ -0,0 +1,135 @@
"""Git worktree lifecycle management for agentic mode."""
from __future__ import annotations
import logging
import shutil
import subprocess
from datetime import datetime
from pathlib import Path
logger = logging.getLogger(__name__)
class WorktreeError(RuntimeError):
"""Error during worktree operations."""
def make_branch_name(preset_name: str) -> str:
"""Generate a branch name for agentic results."""
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
return f"cross-eval/{preset_name}_{ts}"
def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> Path:
"""Create a git worktree on a new branch from HEAD.
1. Create branch from HEAD
2. Create worktree checked out to that branch
The branch lives in the original repo, so it survives worktree removal.
"""
work_dir = work_dir.resolve()
if work_dir.exists():
shutil.rmtree(work_dir)
# Create the branch at HEAD
try:
subprocess.run(
["git", "branch", branch_name, "HEAD"],
cwd=base_cwd,
capture_output=True,
text=True,
check=True,
)
except subprocess.CalledProcessError as e:
raise WorktreeError(
f"Failed to create branch '{branch_name}': {e.stderr.strip()}"
) from e
# Create worktree on that branch
try:
subprocess.run(
["git", "worktree", "add", str(work_dir), branch_name],
cwd=base_cwd,
capture_output=True,
text=True,
check=True,
)
except subprocess.CalledProcessError as e:
# Clean up the branch if worktree creation fails
subprocess.run(
["git", "branch", "-D", branch_name],
cwd=base_cwd,
capture_output=True,
)
raise WorktreeError(
f"Failed to create worktree at {work_dir}: {e.stderr.strip()}"
) from e
logger.debug("Created worktree on branch '%s': %s", branch_name, work_dir)
return work_dir
def capture_diff(worktree_path: Path) -> str:
"""Capture all changes made in the worktree as a unified diff.
Includes both tracked modifications and new untracked files.
"""
subprocess.run(
["git", "add", "-A"],
cwd=worktree_path,
capture_output=True,
check=True,
)
result = subprocess.run(
["git", "diff", "--cached", "HEAD"],
cwd=worktree_path,
capture_output=True,
text=True,
)
return result.stdout.strip()
def commit_worktree(worktree_path: Path, message: str) -> bool:
"""Stage and commit all changes in the worktree.
Returns True if a commit was made, False if nothing to commit.
"""
subprocess.run(
["git", "add", "-A"],
cwd=worktree_path,
capture_output=True,
check=True,
)
result = subprocess.run(
["git", "commit", "-m", message],
cwd=worktree_path,
capture_output=True,
text=True,
)
# exit code 1 = nothing to commit
return result.returncode == 0
def remove_worktree(base_cwd: Path, work_dir: Path) -> None:
"""Remove a git worktree (branch is preserved in the original repo)."""
work_dir = work_dir.resolve()
try:
subprocess.run(
["git", "worktree", "remove", "--force", str(work_dir)],
cwd=base_cwd,
capture_output=True,
text=True,
check=True,
)
except subprocess.CalledProcessError:
if work_dir.exists():
shutil.rmtree(work_dir, ignore_errors=True)
subprocess.run(
["git", "worktree", "prune"],
cwd=base_cwd,
capture_output=True,
)
logger.debug("Removed worktree: %s (branch preserved)", work_dir)

View File

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project]
name = "cross-eval"
version = "0.1.0"
version = "0.2.0"
description = "AI agent cross-evaluation CLI tool"
requires-python = ">=3.9"
dependencies = [

701
tests/test_agentic.py Normal file
View File

@@ -0,0 +1,701 @@
"""Comprehensive tests for the agentic worktree flow.
Covers:
1. worktree.py unit tests (real temp git repo)
2. agent.py agentic tests (mocking subprocess)
3. config.py _make_agentic tests
4. pipeline integration tests (mock invoke_agent / invoke_agent_agentic)
"""
from __future__ import annotations
import subprocess
import tempfile
import unittest
from pathlib import Path
from unittest.mock import MagicMock, call, patch
from cross_eval.agent import invoke_agent_agentic
from cross_eval.config import BUILTIN_AGENTS, _make_agentic
from cross_eval.models import (
AgentConfig,
AgentResult,
PipelineConfig,
StepConfig,
)
from cross_eval.pipeline import (
_commit_iteration,
_finalize_worktree,
_has_agentic_steps,
_setup_worktree,
run_pipeline,
)
from cross_eval.worktree import (
capture_diff,
commit_worktree,
create_worktree,
make_branch_name,
remove_worktree,
)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _init_git_repo(path: Path) -> None:
"""Initialise a minimal git repo with one commit."""
subprocess.run(["git", "init"], cwd=path, capture_output=True, check=True)
subprocess.run(
["git", "config", "user.email", "test@test.com"],
cwd=path, capture_output=True, check=True,
)
subprocess.run(
["git", "config", "user.name", "Test"],
cwd=path, capture_output=True, check=True,
)
(path / "README.md").write_text("# init\n")
subprocess.run(["git", "add", "."], cwd=path, capture_output=True, check=True)
subprocess.run(
["git", "commit", "-m", "initial"],
cwd=path, capture_output=True, check=True,
)
# ===================================================================
# 1. worktree.py unit tests (real temp git repo)
# ===================================================================
class TestCreateWorktree(unittest.TestCase):
"""create_worktree creates a worktree on a named branch."""
def test_creates_worktree_and_branch(self) -> None:
with tempfile.TemporaryDirectory() as td:
base = Path(td) / "repo"
base.mkdir()
_init_git_repo(base)
wt_dir = Path(td) / "wt"
branch = "cross-eval/test_branch"
result_path = create_worktree(base, wt_dir, branch)
# Worktree directory exists
self.assertTrue(result_path.exists())
# Branch was created in the original repo
branches = subprocess.run(
["git", "branch", "--list", branch],
cwd=base, capture_output=True, text=True,
)
self.assertIn(branch, branches.stdout)
# Clean up
remove_worktree(base, wt_dir)
class TestCaptureDiff(unittest.TestCase):
"""capture_diff captures changes correctly."""
def test_captures_new_and_modified_files(self) -> None:
with tempfile.TemporaryDirectory() as td:
base = Path(td) / "repo"
base.mkdir()
_init_git_repo(base)
wt_dir = Path(td) / "wt"
branch = "cross-eval/diff_test"
create_worktree(base, wt_dir, branch)
# Make changes in the worktree
(wt_dir / "new_file.txt").write_text("hello\n")
(wt_dir / "README.md").write_text("# modified\n")
diff = capture_diff(wt_dir)
self.assertIn("new_file.txt", diff)
self.assertIn("hello", diff)
self.assertIn("modified", diff)
remove_worktree(base, wt_dir)
class TestCommitWorktree(unittest.TestCase):
"""commit_worktree commits changes and returns True; False when nothing to commit."""
def test_commit_returns_true_on_changes(self) -> None:
with tempfile.TemporaryDirectory() as td:
base = Path(td) / "repo"
base.mkdir()
_init_git_repo(base)
wt_dir = Path(td) / "wt"
branch = "cross-eval/commit_test"
create_worktree(base, wt_dir, branch)
(wt_dir / "file.txt").write_text("data\n")
result = commit_worktree(wt_dir, "test commit")
self.assertTrue(result)
remove_worktree(base, wt_dir)
def test_commit_returns_false_when_nothing_to_commit(self) -> None:
with tempfile.TemporaryDirectory() as td:
base = Path(td) / "repo"
base.mkdir()
_init_git_repo(base)
wt_dir = Path(td) / "wt"
branch = "cross-eval/empty_commit"
create_worktree(base, wt_dir, branch)
result = commit_worktree(wt_dir, "empty")
self.assertFalse(result)
remove_worktree(base, wt_dir)
class TestRemoveWorktree(unittest.TestCase):
"""remove_worktree removes worktree but branch survives."""
def test_branch_survives_worktree_removal(self) -> None:
with tempfile.TemporaryDirectory() as td:
base = Path(td) / "repo"
base.mkdir()
_init_git_repo(base)
wt_dir = Path(td) / "wt"
branch = "cross-eval/remove_test"
create_worktree(base, wt_dir, branch)
remove_worktree(base, wt_dir)
# Worktree directory should be gone
self.assertFalse(wt_dir.exists())
# Branch should still exist in the original repo
branches = subprocess.run(
["git", "branch", "--list", branch],
cwd=base, capture_output=True, text=True,
)
self.assertIn(branch, branches.stdout)
class TestMakeBranchName(unittest.TestCase):
"""make_branch_name generates expected format."""
def test_format(self) -> None:
name = make_branch_name("review-fix")
self.assertTrue(name.startswith("cross-eval/review-fix_"))
# Should contain a timestamp-like suffix
parts = name.split("_", 1)
self.assertEqual(len(parts), 2)
# Timestamp portion should be like 20260313_123456
ts_part = parts[1] # after "cross-eval/review-fix_"
self.assertEqual(len(ts_part), 15) # YYYYMMDD_HHMMSS
# ===================================================================
# 2. agent.py agentic tests (mocking subprocess)
# ===================================================================
class TestInvokeAgentAgenticClaude(unittest.TestCase):
"""invoke_agent_agentic builds correct cmd for claude (no -p, prompt as positional arg)."""
@patch("cross_eval.worktree.capture_diff", return_value="diff --git a/file ...")
@patch("subprocess.run")
def test_claude_cmd_has_no_dash_p_and_prompt_as_positional(
self, mock_run: MagicMock, mock_diff: MagicMock,
) -> None:
mock_run.return_value = MagicMock(returncode=0, stdout="ok", stderr="")
agent = AgentConfig(
name="claude-coder",
command="claude",
args=["--setting-sources", "user", "--dangerously-skip-permissions"],
agentic=True,
)
with tempfile.TemporaryDirectory() as td:
wt = Path(td)
_init_git_repo(wt)
invoke_agent_agentic(
agent, "implement feature X", "coding",
worktree_path=wt, quiet=True,
)
# Find the subprocess.run call that actually runs the agent
agent_call = None
for c in mock_run.call_args_list:
cmd = c[0][0] if c[0] else c[1].get("args", [])
if cmd and cmd[0] == "claude":
agent_call = c
break
self.assertIsNotNone(agent_call, "Expected a subprocess.run call with 'claude'")
cmd = agent_call[0][0]
# No -p flag
self.assertNotIn("-p", cmd)
# Last arg is a task file reference (not raw prompt — avoids arg length limits)
self.assertIn("task file", cmd[-1].lower())
class TestInvokeAgentAgenticCodex(unittest.TestCase):
"""invoke_agent_agentic builds correct cmd for codex (stdin mode, - sentinel)."""
@patch("cross_eval.worktree.capture_diff", return_value="diff --git a/file ...")
@patch("subprocess.run")
def test_codex_cmd_uses_stdin_with_dash_sentinel(
self, mock_run: MagicMock, mock_diff: MagicMock,
) -> None:
mock_run.return_value = MagicMock(returncode=0, stdout="ok", stderr="")
agent = AgentConfig(
name="codex-coder",
command="codex",
args=["exec", "--full-auto", "--skip-git-repo-check"],
agentic=True,
)
with tempfile.TemporaryDirectory() as td:
wt = Path(td)
_init_git_repo(wt)
invoke_agent_agentic(
agent, "implement feature Y", "coding",
worktree_path=wt, quiet=True,
)
agent_call = None
for c in mock_run.call_args_list:
cmd = c[0][0] if c[0] else c[1].get("args", [])
if cmd and cmd[0] == "codex":
agent_call = c
break
self.assertIsNotNone(agent_call, "Expected a subprocess.run call with 'codex'")
cmd = agent_call[0][0]
# Should have "-" sentinel at the end for stdin
self.assertEqual(cmd[-1], "-")
# Stdin input should contain the prompt
input_data = agent_call[1].get("input")
self.assertIsNotNone(input_data)
self.assertIn("implement feature Y", input_data)
class TestTaskFileCleanup(unittest.TestCase):
"""Task file is cleaned up before capture_diff."""
@patch("cross_eval.worktree.capture_diff", return_value="(no changes)")
@patch("subprocess.run")
def test_task_file_in_tmp_not_worktree(
self, mock_run: MagicMock, mock_diff: MagicMock,
) -> None:
mock_run.return_value = MagicMock(returncode=0, stdout="ok", stderr="")
agent = AgentConfig(
name="claude-coder", command="claude", args=[], agentic=True,
)
with tempfile.TemporaryDirectory() as td:
wt = Path(td)
_init_git_repo(wt)
invoke_agent_agentic(
agent, "do stuff", "coding",
worktree_path=wt, quiet=True,
)
# Task file should NOT be in the worktree (it's in /tmp)
self.assertFalse((wt / "CROSS_EVAL_TASK.md").exists())
# ===================================================================
# 3. config.py tests
# ===================================================================
class TestMakeAgenticClaude(unittest.TestCase):
"""_make_agentic strips -p from claude args and sets agentic=True."""
def test_strips_dash_p_and_sets_agentic(self) -> None:
agent = AgentConfig(
name="claude-coder",
command="claude",
args=["-p", "--setting-sources", "user", "--model", "opus"],
)
self.assertFalse(agent.agentic)
_make_agentic(agent)
self.assertTrue(agent.agentic)
self.assertNotIn("-p", agent.args)
self.assertIn("--setting-sources", agent.args)
def test_idempotent_when_no_dash_p(self) -> None:
agent = AgentConfig(
name="claude-coder",
command="claude",
args=["--setting-sources", "user"],
)
_make_agentic(agent)
self.assertTrue(agent.agentic)
self.assertEqual(agent.args, ["--setting-sources", "user"])
class TestMakeAgenticCodex(unittest.TestCase):
"""_make_agentic on codex agent still works (no -p to strip)."""
def test_codex_agentic_works(self) -> None:
agent = AgentConfig(
name="codex-coder",
command="codex",
args=["exec", "--full-auto", "-"],
)
_make_agentic(agent)
self.assertTrue(agent.agentic)
# -p was never there so args are unchanged
self.assertIn("exec", agent.args)
self.assertIn("--full-auto", agent.args)
# ===================================================================
# 4. pipeline integration tests
# ===================================================================
def _make_agentic_config(
run_dir: Path,
agentic_coder: bool = True,
) -> PipelineConfig:
"""Build a config with an agentic coder + non-agentic reviewer."""
coder = AgentConfig(
name="claude-coder", command="claude",
args=["--setting-sources", "user"],
agentic=agentic_coder,
)
reviewer = AgentConfig(
name="claude-reviewer", command="claude",
args=["-p", "--setting-sources", "user"],
agentic=False,
)
steps = [
StepConfig(
name="coding",
agent="claude-coder",
role="coding",
prompt_template="default:coding",
output_key="coding_output",
),
StepConfig(
name="review",
agent="claude-reviewer",
role="review",
prompt_template="default:review",
output_key="review_result",
verdict=True,
),
]
return PipelineConfig(
output_dir=run_dir,
max_iterations=2,
min_iterations=1,
language="en",
inputs={"plan": "Test plan", "checklist": "Test checklist"},
agents={"claude-coder": coder, "claude-reviewer": reviewer},
coders=["claude-coder"],
reviewers=["claude-reviewer"],
pipeline=steps,
preset_name="simple",
)
class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
"""When agentic agent is configured, _setup_worktree is called."""
@patch("cross_eval.pipeline._finalize_worktree", return_value="cross-eval/test")
@patch("cross_eval.pipeline._commit_iteration")
@patch("cross_eval.pipeline._setup_worktree")
@patch("cross_eval.pipeline.invoke_agent_agentic")
@patch("cross_eval.pipeline.invoke_agent")
def test_setup_worktree_called(
self,
mock_invoke: MagicMock,
mock_invoke_agentic: MagicMock,
mock_setup: MagicMock,
mock_commit_iter: MagicMock,
mock_finalize: MagicMock,
) -> None:
with tempfile.TemporaryDirectory() as td:
run_dir = Path(td)
config = _make_agentic_config(run_dir)
wt_path = run_dir / "work"
wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test")
mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0,
agent_name="claude-coder", step_name="coding",
duration_seconds=0.1,
)
mock_invoke.return_value = AgentResult(
output="VERDICT: PASS", exit_code=0,
agent_name="claude-reviewer", step_name="review",
duration_seconds=0.1,
)
run_pipeline(config, cwd=Path(td))
mock_setup.assert_called_once()
class TestReviewerRunsInWorktreeCwd(unittest.TestCase):
"""Reviewer runs with worktree cwd (not original cwd) when worktree exists."""
@patch("cross_eval.pipeline._finalize_worktree", return_value="cross-eval/test")
@patch("cross_eval.pipeline._commit_iteration")
@patch("cross_eval.pipeline._setup_worktree")
@patch("cross_eval.pipeline.invoke_agent_agentic")
@patch("cross_eval.pipeline.invoke_agent")
def test_reviewer_uses_worktree_cwd(
self,
mock_invoke: MagicMock,
mock_invoke_agentic: MagicMock,
mock_setup: MagicMock,
mock_commit_iter: MagicMock,
mock_finalize: MagicMock,
) -> None:
with tempfile.TemporaryDirectory() as td:
run_dir = Path(td)
config = _make_agentic_config(run_dir)
wt_path = run_dir / "work"
wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test")
mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0,
agent_name="claude-coder", step_name="coding",
duration_seconds=0.1,
)
mock_invoke.return_value = AgentResult(
output="VERDICT: PASS", exit_code=0,
agent_name="claude-reviewer", step_name="review",
duration_seconds=0.1,
)
run_pipeline(config, cwd=Path(td))
# The reviewer (non-agentic) should have been called with cwd=worktree_path
reviewer_call = mock_invoke.call_args
self.assertEqual(reviewer_call[1].get("cwd") or reviewer_call[0][3], wt_path)
class TestCommitIterationCalled(unittest.TestCase):
"""_commit_iteration is called after each iteration when worktree exists."""
@patch("cross_eval.pipeline._finalize_worktree", return_value="cross-eval/test")
@patch("cross_eval.pipeline._commit_iteration")
@patch("cross_eval.pipeline._setup_worktree")
@patch("cross_eval.pipeline.invoke_agent_agentic")
@patch("cross_eval.pipeline.invoke_agent")
def test_commit_iteration_called(
self,
mock_invoke: MagicMock,
mock_invoke_agentic: MagicMock,
mock_setup: MagicMock,
mock_commit_iter: MagicMock,
mock_finalize: MagicMock,
) -> None:
with tempfile.TemporaryDirectory() as td:
run_dir = Path(td)
config = _make_agentic_config(run_dir)
wt_path = run_dir / "work"
wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test")
mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0,
agent_name="claude-coder", step_name="coding",
duration_seconds=0.1,
)
mock_invoke.return_value = AgentResult(
output="VERDICT: PASS", exit_code=0,
agent_name="claude-reviewer", step_name="review",
duration_seconds=0.1,
)
run_pipeline(config, cwd=Path(td))
mock_commit_iter.assert_called_once()
call_args = mock_commit_iter.call_args
self.assertEqual(call_args[0][0], wt_path)
class TestFinalizeWorktreeCalled(unittest.TestCase):
"""_finalize_worktree commits and cleans up at end."""
@patch("cross_eval.pipeline._finalize_worktree", return_value="cross-eval/test")
@patch("cross_eval.pipeline._commit_iteration")
@patch("cross_eval.pipeline._setup_worktree")
@patch("cross_eval.pipeline.invoke_agent_agentic")
@patch("cross_eval.pipeline.invoke_agent")
def test_finalize_called(
self,
mock_invoke: MagicMock,
mock_invoke_agentic: MagicMock,
mock_setup: MagicMock,
mock_commit_iter: MagicMock,
mock_finalize: MagicMock,
) -> None:
with tempfile.TemporaryDirectory() as td:
run_dir = Path(td)
config = _make_agentic_config(run_dir)
wt_path = run_dir / "work"
wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test")
mock_invoke_agentic.return_value = AgentResult(
output="diff output", exit_code=0,
agent_name="claude-coder", step_name="coding",
duration_seconds=0.1,
)
mock_invoke.return_value = AgentResult(
output="VERDICT: PASS", exit_code=0,
agent_name="claude-reviewer", step_name="review",
duration_seconds=0.1,
)
run_pipeline(config, cwd=Path(td))
mock_finalize.assert_called_once()
call_args = mock_finalize.call_args
# Should pass cwd, worktree_path, branch_name, preset_name, verdict
self.assertEqual(call_args[0][1], wt_path)
self.assertEqual(call_args[0][2], "cross-eval/test")
class TestParallelAgenticFallsBackToSequential(unittest.TestCase):
"""Multiple agentic steps in parallel batch fall back to sequential."""
def test_has_agentic_steps_detects_agentic(self) -> None:
coder = AgentConfig(
name="claude-coder", command="claude", args=[], agentic=True,
)
reviewer = AgentConfig(
name="claude-reviewer", command="claude", args=[], agentic=False,
)
config = PipelineConfig(
agents={"claude-coder": coder, "claude-reviewer": reviewer},
)
steps = [
StepConfig(name="a", agent="claude-coder", role="coding",
prompt_template="default:coding", output_key="a"),
]
self.assertTrue(_has_agentic_steps(config, steps))
def test_has_agentic_steps_returns_false_without_agentic(self) -> None:
reviewer = AgentConfig(
name="claude-reviewer", command="claude", args=[], agentic=False,
)
config = PipelineConfig(
agents={"claude-reviewer": reviewer},
)
steps = [
StepConfig(name="r", agent="claude-reviewer", role="review",
prompt_template="default:review", output_key="r", verdict=True),
]
self.assertFalse(_has_agentic_steps(config, steps))
@patch("cross_eval.pipeline._finalize_worktree", return_value="cross-eval/test")
@patch("cross_eval.pipeline._commit_iteration")
@patch("cross_eval.pipeline._setup_worktree")
@patch("cross_eval.pipeline.invoke_agent_agentic")
@patch("cross_eval.pipeline.invoke_agent")
def test_parallel_agentic_runs_sequentially(
self,
mock_invoke: MagicMock,
mock_invoke_agentic: MagicMock,
mock_setup: MagicMock,
mock_commit_iter: MagicMock,
mock_finalize: MagicMock,
) -> None:
"""When multiple agentic steps are parallel, they should run sequentially."""
with tempfile.TemporaryDirectory() as td:
run_dir = Path(td)
coder_a = AgentConfig(
name="coder-a", command="claude", args=[], agentic=True,
)
coder_b = AgentConfig(
name="coder-b", command="claude", args=[], agentic=True,
)
reviewer = AgentConfig(
name="reviewer", command="claude", args=["-p"], agentic=False,
)
steps = [
StepConfig(
name="code_a", agent="coder-a", role="coding",
prompt_template="default:coding", output_key="code_a",
parallel=True,
),
StepConfig(
name="code_b", agent="coder-b", role="coding",
prompt_template="default:coding", output_key="code_b",
parallel=True,
),
StepConfig(
name="review", agent="reviewer", role="review",
prompt_template="default:review", output_key="review_result",
verdict=True,
),
]
config = PipelineConfig(
output_dir=run_dir,
max_iterations=1,
min_iterations=1,
language="en",
inputs={"plan": "Test plan", "checklist": "Test checklist"},
agents={
"coder-a": coder_a,
"coder-b": coder_b,
"reviewer": reviewer,
},
coders=["coder-a", "coder-b"],
reviewers=["reviewer"],
pipeline=steps,
preset_name="custom",
)
wt_path = run_dir / "work"
wt_path.mkdir()
mock_setup.return_value = (wt_path, "cross-eval/test")
call_order: list[str] = []
def _track_agentic(agent_config, prompt, step_name, **kwargs):
call_order.append(step_name)
return AgentResult(
output="diff", exit_code=0,
agent_name=agent_config.name, step_name=step_name,
duration_seconds=0.1,
)
mock_invoke_agentic.side_effect = _track_agentic
mock_invoke.return_value = AgentResult(
output="VERDICT: PASS", exit_code=0,
agent_name="reviewer", step_name="review",
duration_seconds=0.1,
)
run_pipeline(config, cwd=Path(td))
# Both agentic steps should have been called (sequentially)
agentic_calls = [c for c in call_order if c.startswith("code_")]
self.assertEqual(len(agentic_calls), 2)
# They should appear in order (sequential, not concurrent)
self.assertEqual(agentic_calls, ["code_a", "code_b"])
if __name__ == "__main__":
unittest.main()

View File

@@ -1,19 +1,27 @@
from __future__ import annotations
import tempfile
import unittest
from pathlib import Path
from unittest.mock import patch
from cross_eval.agent import _supports_reasoning_effort
from cross_eval.agent import AgentInvocationError, _supports_reasoning_effort
from cross_eval.cli import _apply_phased_iteration_override, main
from cross_eval.agent import invoke_agent
from cross_eval.config import (
BUILTIN_AGENTS,
_SENIOR_SYSTEM_PROMPT,
_default_seniors_for_preset,
apply_reasoning_effort_settings,
load_config,
normalize_reasoning_effort,
normalize_prompt_template,
normalize_step_role,
validate_config,
)
from cross_eval.models import (
AgentConfig,
AgentResult,
IterationResult,
PhaseConfig,
PipelineConfig,
@@ -21,25 +29,52 @@ from cross_eval.models import (
ReviewMetrics,
StepConfig,
)
from cross_eval.pipeline import _detect_repeated_aggregate
from cross_eval.pipeline import (
_detect_auto_escalate,
_detect_repeated_aggregate,
_execute_parallel_batch,
_extract_senior_tracker,
_extract_verdict,
)
from cross_eval.prompts import (
GENERATE_TEMPLATE,
GENERATE_TEMPLATE_KO,
CODING_TEMPLATE,
CODING_TEMPLATE_KO,
REVIEW_TEMPLATE,
REVIEW_TEMPLATE_KO,
PLAN_REVIEW_TEMPLATE,
PLAN_REVIEW_TEMPLATE_KO,
REVIEW_ONLY_TEMPLATE,
REVIEW_ONLY_TEMPLATE_KO,
AGGREGATE_REVIEW_TEMPLATE,
AGGREGATE_REVIEW_TEMPLATE_KO,
_build_cross_review_preset,
_build_coding_review_fix_preset,
_build_plan_review_preset,
_build_review_fix_preset,
_build_review_only_preset,
_build_simple_preset,
)
from cross_eval.report import build_report, parse_review_metrics
from cross_eval.report import build_report, parse_review_metrics, print_escalation_report
class BuiltinAgentConfigTest(unittest.TestCase):
def test_claude_builtin_agents_use_user_settings_and_disable_slash_commands(self) -> None:
for agent_name in ("claude-coder", "claude-reviewer", "claude-senior"):
with self.subTest(agent=agent_name):
args = BUILTIN_AGENTS[agent_name].args
self.assertIn("--setting-sources", args)
self.assertIn("user", args)
self.assertIn("--disable-slash-commands", args)
def test_claude_builtin_agents_use_role_specific_permission_modes(self) -> None:
coder_args = BUILTIN_AGENTS["claude-coder"].args
reviewer_args = BUILTIN_AGENTS["claude-reviewer"].args
senior_args = BUILTIN_AGENTS["claude-senior"].args
self.assertIn("--dangerously-skip-permissions", coder_args)
self.assertIn("bypassPermissions", coder_args)
self.assertIn("plan", reviewer_args)
self.assertIn("plan", senior_args)
def test_codex_builtin_agents_skip_git_repo_check(self) -> None:
for agent_name in ("codex-coder", "codex-reviewer", "codex-senior"):
with self.subTest(agent=agent_name):
@@ -62,6 +97,10 @@ class BuiltinAgentConfigTest(unittest.TestCase):
self.assertEqual(normalize_reasoning_effort("extra_high"), "xhigh")
self.assertEqual(normalize_reasoning_effort("x-high"), "xhigh")
def test_normalize_step_role_and_template_aliases(self) -> None:
self.assertEqual(normalize_step_role("coding"), "coding")
self.assertEqual(normalize_prompt_template("default:coding"), "default:coding")
def test_apply_reasoning_effort_settings_uses_defaults_and_role_overrides(self) -> None:
config = PipelineConfig(
agents={
@@ -116,6 +155,123 @@ class BuiltinAgentConfigTest(unittest.TestCase):
["codex", "-c", 'model_reasoning_effort="high"'],
)
def test_invoke_agent_classifies_auth_failures(self) -> None:
def _fake_run(cmd, **kwargs):
class _Result:
returncode = 1
stdout = ""
stderr = "Not logged in · Please run /login"
return _Result()
agent = AgentConfig(
name="claude-reviewer",
command="claude",
args=["-p", "--model", "opus"],
)
with patch("subprocess.run", side_effect=_fake_run):
with self.assertRaises(AgentInvocationError) as ctx:
invoke_agent(agent, "prompt", "review", quiet=True)
self.assertEqual(ctx.exception.failure_type, "AUTH")
self.assertIn("Re-authenticate", ctx.exception.suggested_action)
def test_invoke_agent_classifies_usage_limit_failures(self) -> None:
def _fake_run(cmd, **kwargs):
class _Result:
returncode = 1
stdout = ""
stderr = "API Error: 429 rate limit exceeded for current quota"
return _Result()
agent = AgentConfig(
name="codex-reviewer",
command="codex",
args=["exec", "--model", "gpt-5.4", "-"],
)
with patch("subprocess.run", side_effect=_fake_run):
with self.assertRaises(AgentInvocationError) as ctx:
invoke_agent(agent, "prompt", "review", quiet=True)
self.assertEqual(ctx.exception.failure_type, "USAGE_LIMIT")
self.assertIn("quota", ctx.exception.suggested_action)
def test_parallel_batch_saves_successes_before_failure(self) -> None:
config = PipelineConfig(
agents={
"ok-reviewer": AgentConfig(name="ok-reviewer", command="codex"),
"bad-reviewer": AgentConfig(name="bad-reviewer", command="claude"),
},
)
steps = [
StepConfig(
name="review_ok",
agent="ok-reviewer",
role="review",
prompt_template="default:review-only",
output_key="review_ok",
parallel=True,
),
StepConfig(
name="review_bad",
agent="bad-reviewer",
role="review",
prompt_template="default:review-only",
output_key="review_bad",
parallel=True,
),
]
step_outputs: dict[str, str] = {}
step_results: dict[str, AgentResult] = {}
def _fake_invoke(agent, prompt, step_name, **kwargs):
if step_name == "review_ok":
return AgentResult(
output="VERDICT: PASS",
exit_code=0,
agent_name=agent.name,
step_name=step_name,
duration_seconds=1.0,
)
raise AgentInvocationError(
agent_name=agent.name,
step_name=step_name,
cmd_preview="claude -p ...",
raw_error="API Error: 429 rate limit exceeded for current quota",
failure_type="USAGE_LIMIT",
suggested_action="Agent CLI hit a quota, billing, or token budget limit. Refill or raise the limit, then rerun.",
)
with tempfile.TemporaryDirectory() as tmpdir:
with patch("cross_eval.pipeline.invoke_agent", side_effect=_fake_invoke):
with self.assertRaises(RuntimeError) as ctx:
_execute_parallel_batch(
steps,
config,
input_contents={},
feedback="",
iteration=1,
max_iterations=3,
cwd=Path(tmpdir),
timeout=None,
dry_run=False,
step_outputs=step_outputs,
step_results=step_results,
run_dir=Path(tmpdir),
output_iter=1,
)
self.assertIn("Successful outputs were saved for: review_ok", str(ctx.exception))
self.assertEqual(step_outputs["review_ok"], "VERDICT: PASS")
self.assertTrue((Path(tmpdir) / "v1" / "review_ok.md").exists())
error_path = Path(tmpdir) / "v1" / "review_bad_error.md"
self.assertTrue(error_path.exists())
self.assertIn("Failure Type", error_path.read_text(encoding="utf-8"))
self.assertIn("USAGE_LIMIT", error_path.read_text(encoding="utf-8"))
def test_detect_repeated_aggregate_warns_on_same_output(self) -> None:
steps = [
StepConfig(
@@ -169,6 +325,14 @@ class BuiltinAgentConfigTest(unittest.TestCase):
),
["claude-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:coding-review-fix",
["codex-reviewer"],
BUILTIN_AGENTS,
),
["codex-senior"],
)
self.assertEqual(
_default_seniors_for_preset(
"preset:simple",
@@ -204,9 +368,37 @@ class BuiltinAgentConfigTest(unittest.TestCase):
)
self.assertEqual(
[step.name for step in converge.steps[3:]],
["aggregate_review", "generate", "verify"],
["aggregate_review", "coding", "verify"],
)
def test_coding_review_fix_starts_with_single_coding_phase(self) -> None:
phases = _build_coding_review_fix_preset(
["codex-coder"],
["claude-reviewer", "codex-reviewer"],
["codex-senior"],
)
self.assertEqual([phase.name for phase in phases], ["initial_coding", "review_fix"])
self.assertEqual(phases[0].max_iterations, 1)
self.assertEqual([step.name for step in phases[0].steps], ["coding"])
self.assertEqual([step.name for step in phases[1].steps[2:]], ["aggregate_review", "coding", "verify"])
def test_apply_phased_iteration_override_updates_only_verdict_phases(self) -> None:
config = PipelineConfig(
phases=_build_coding_review_fix_preset(
["codex-coder"],
["codex-reviewer"],
["codex-senior"],
),
)
_apply_phased_iteration_override(config, 10)
self.assertEqual(config.phases[0].name, "initial_coding")
self.assertEqual(config.phases[0].max_iterations, 1)
self.assertEqual(config.phases[1].name, "review_fix")
self.assertEqual(config.phases[1].max_iterations, 10)
def test_review_only_duplicate_reviewers_get_unique_step_keys(self) -> None:
steps = _build_review_only_preset(
["codex-coder"],
@@ -219,6 +411,31 @@ class BuiltinAgentConfigTest(unittest.TestCase):
["review_codex_reviewer", "review_codex_reviewer_2"],
)
def test_plan_review_duplicate_reviewers_get_unique_step_keys(self) -> None:
steps = _build_plan_review_preset(
["codex-coder"],
["codex-reviewer", "codex-reviewer"],
[],
)
self.assertEqual(
[step.output_key for step in steps],
["plan_review_codex_reviewer", "plan_review_codex_reviewer_2"],
)
def test_plan_review_with_senior_adds_aggregate_step(self) -> None:
steps = _build_plan_review_preset(
["codex-coder"],
["claude-reviewer", "codex-reviewer"],
["claude-senior"],
)
self.assertEqual(steps[-1].name, "senior_review")
self.assertEqual(steps[-1].agent, "claude-senior")
self.assertTrue(steps[-1].verdict)
self.assertFalse(steps[0].verdict)
self.assertFalse(steps[1].verdict)
def test_cross_review_duplicate_coders_get_unique_step_keys(self) -> None:
steps = _build_cross_review_preset(
["codex-coder", "codex-coder"],
@@ -246,7 +463,7 @@ class BuiltinAgentConfigTest(unittest.TestCase):
steps = phases[0].steps
self.assertEqual(steps[2].name, "aggregate_review")
self.assertEqual(steps[2].agent, "codex-senior")
self.assertEqual(steps[3].name, "generate")
self.assertEqual(steps[3].name, "coding")
self.assertEqual(steps[4].name, "verify")
self.assertEqual(steps[4].agent, "codex-senior")
self.assertTrue(steps[4].verdict)
@@ -273,7 +490,7 @@ class BuiltinAgentConfigTest(unittest.TestCase):
self.assertEqual(
[step.name for step in steps],
["generate", "review", "senior_review"],
["coding", "review", "senior_review"],
)
self.assertFalse(steps[1].verdict)
self.assertTrue(steps[2].verdict)
@@ -325,6 +542,8 @@ class PromptTemplateTest(unittest.TestCase):
for tmpl, label in [
(REVIEW_TEMPLATE, "REVIEW_TEMPLATE"),
(REVIEW_TEMPLATE_KO, "REVIEW_TEMPLATE_KO"),
(PLAN_REVIEW_TEMPLATE, "PLAN_REVIEW_TEMPLATE"),
(PLAN_REVIEW_TEMPLATE_KO, "PLAN_REVIEW_TEMPLATE_KO"),
(REVIEW_ONLY_TEMPLATE, "REVIEW_ONLY_TEMPLATE"),
(REVIEW_ONLY_TEMPLATE_KO, "REVIEW_ONLY_TEMPLATE_KO"),
]:
@@ -351,10 +570,10 @@ class PromptTemplateTest(unittest.TestCase):
self.assertIn("CONFIRMED", tmpl)
self.assertIn("DISMISSED", tmpl)
def test_generate_templates_ignore_dismissed(self) -> None:
"""Generate templates should tell coder to ignore DISMISSED items."""
self.assertIn("DISMISSED", GENERATE_TEMPLATE)
self.assertIn("DISMISSED", GENERATE_TEMPLATE_KO)
def test_coding_templates_ignore_dismissed(self) -> None:
"""Coding templates should tell coder to ignore DISMISSED items."""
self.assertIn("DISMISSED", CODING_TEMPLATE)
self.assertIn("DISMISSED", CODING_TEMPLATE_KO)
def test_aggregate_templates_dismissed_structure(self) -> None:
"""Aggregate templates should use [False positive] / [Already fixed] tags."""
@@ -487,11 +706,11 @@ class ReviewMetricsParsingTest(unittest.TestCase):
language="en",
pipeline=[
StepConfig(
name="generate",
name="coding",
agent="claude-coder",
role="generate",
prompt_template="default:generate",
output_key="generated_code",
role="coding",
prompt_template="default:coding",
output_key="coding_output",
verdict=True,
),
],
@@ -500,7 +719,7 @@ class ReviewMetricsParsingTest(unittest.TestCase):
iterations=[
IterationResult(
iteration=1,
step_outputs={"generated_code": "some code"},
step_outputs={"coding_output": "some code"},
verdict="PASS",
),
],
@@ -511,5 +730,307 @@ class ReviewMetricsParsingTest(unittest.TestCase):
self.assertNotIn("Review Metrics", report)
class EscalateVerdictTest(unittest.TestCase):
"""Test ESCALATE verdict functionality."""
def test_extract_verdict_escalate(self) -> None:
output = "Some review content\n\nVERDICT: ESCALATE\n"
result = _extract_verdict(output, r"VERDICT:\s*PASS")
self.assertEqual(result, "ESCALATE")
def test_extract_verdict_escalate_priority(self) -> None:
"""ESCALATE should take priority even if PASS pattern also matches."""
output = "VERDICT: PASS\n\nVERDICT: ESCALATE\n"
result = _extract_verdict(output, r"VERDICT:\s*PASS")
self.assertEqual(result, "ESCALATE")
def test_extract_verdict_pass_still_works(self) -> None:
output = "All good\n\nVERDICT: PASS\n"
result = _extract_verdict(output, r"VERDICT:\s*PASS")
self.assertEqual(result, "PASS")
def test_extract_verdict_fail_still_works(self) -> None:
output = "Issues found\n\nVERDICT: FAIL\n"
result = _extract_verdict(output, r"VERDICT:\s*PASS")
self.assertEqual(result, "FAIL")
def test_extract_senior_tracker(self) -> None:
output = (
"Some text\n\n"
"## Issue Tracker\n"
"| ISS-ID | Severity | Description | Status | Since |\n"
"|--------|----------|-------------|--------|-------|\n"
"| ISS-001 | Critical | Missing auth | Open | v1 |\n"
"| ISS-002 | Major | Bad naming | Fixed | v1 |\n"
"\nMore text"
)
tracker = _extract_senior_tracker(output)
self.assertIn("Issue Tracker", tracker)
self.assertIn("ISS-001", tracker)
self.assertIn("ISS-002", tracker)
def test_extract_senior_tracker_empty(self) -> None:
output = "No tracker table here"
tracker = _extract_senior_tracker(output)
self.assertEqual(tracker, "")
def test_auto_escalate_heuristic(self) -> None:
prev1 = "Issue in src/auth.py: missing validation"
prev2 = "Issue in src/auth.py: validation still missing"
current = "Issue in src/auth.py: validation not implemented"
# Should detect repeated issue
self.assertTrue(_detect_auto_escalate([prev1, prev2], current, threshold=2))
def test_auto_escalate_no_repeat(self) -> None:
prev1 = "Issue in src/auth.py: missing validation"
current = "Issue in src/database.py: connection pool"
self.assertFalse(_detect_auto_escalate([prev1], current, threshold=2))
def test_auto_escalate_different_issues_same_file(self) -> None:
"""Same file path but different issues should NOT trigger escalation."""
prev1 = "Issue in src/utils.py: missing validation on input"
prev2 = "Issue in src/utils.py: unused import at top of file"
current = "Issue in src/utils.py: error handling not implemented"
# All mention src/utils.py, but the issue keywords differ across
# iterations, so this should NOT escalate.
self.assertFalse(_detect_auto_escalate([prev1, prev2], current, threshold=2))
def test_report_escalate_verdict(self) -> None:
config = PipelineConfig(language="en")
result = PipelineResult(
final_verdict="ESCALATE",
escalated_issues=["Requirements are ambiguous — need stakeholder input"],
)
report = build_report(config, result)
self.assertIn("ESCALATE", report)
self.assertIn("Human review required", report)
self.assertIn("ambiguous", report)
def test_report_escalate_verdict_ko(self) -> None:
config = PipelineConfig(language="ko")
result = PipelineResult(
final_verdict="ESCALATE",
escalated_issues=["요구사항이 모호함"],
)
report = build_report(config, result)
self.assertIn("ESCALATE", report)
self.assertIn("사람의 확인이 필요합니다", report)
def test_exit_code_escalate(self) -> None:
from cross_eval.cli import main
mock_result = PipelineResult(
final_verdict="ESCALATE",
escalated_issues=["Needs human review"],
)
with patch("cross_eval.config.load_config") as mock_load, \
patch("cross_eval.config.validate_config", return_value=[]), \
patch("cross_eval.pipeline.run_pipeline", return_value=mock_result), \
patch("cross_eval.report.print_escalation_report"):
mock_config = PipelineConfig(
pipeline=[
StepConfig(
name="review",
agent="claude-reviewer",
role="review",
prompt_template="default:review",
output_key="review_result",
verdict=True,
),
],
agents=dict(BUILTIN_AGENTS),
coders=["claude-coder"],
reviewers=["claude-reviewer"],
inputs={"plan": Path("/tmp/plan.md")},
language="en",
max_iterations=3,
preset_name="simple",
)
mock_load.return_value = mock_config
with tempfile.NamedTemporaryFile(suffix=".yaml", mode="w") as f:
f.write("inputs:\n plan: /tmp/plan.md\n")
f.flush()
exit_code = main(["run", "-c", f.name])
self.assertEqual(exit_code, 2)
def test_senior_prompt_includes_escalate(self) -> None:
self.assertIn("ESCALATE", _SENIOR_SYSTEM_PROMPT)
self.assertIn("ambiguous", _SENIOR_SYSTEM_PROMPT.lower())
def test_aggregate_template_has_tracker(self) -> None:
self.assertIn("{previous_senior_tracker}", AGGREGATE_REVIEW_TEMPLATE)
self.assertIn("Issue Tracker", AGGREGATE_REVIEW_TEMPLATE)
self.assertIn("VERDICT: ESCALATE", AGGREGATE_REVIEW_TEMPLATE)
def test_report_includes_issue_tracker_summary(self) -> None:
config = PipelineConfig(
language="en",
pipeline=[
StepConfig(
name="review",
agent="claude-reviewer",
role="review",
prompt_template="default:review",
output_key="review_result",
verdict=True,
),
],
)
result = PipelineResult(
iterations=[
IterationResult(
iteration=1,
step_outputs={
"review_result": (
"### Issues Found\n"
"- ISS-001 [Critical][Omission] Missing auth check\n"
"- ISS-002 [Major][Omission] No input validation\n"
"### Verdict\nVERDICT: FAIL"
),
},
verdict="FAIL",
),
],
final_verdict="FAIL",
)
report = build_report(config, result)
self.assertIn("Issue Tracker Summary", report)
self.assertIn("ISS-001", report)
self.assertIn("ISS-002", report)
def test_report_includes_senior_tracker_table(self) -> None:
config = PipelineConfig(
language="en",
pipeline=[
StepConfig(
name="senior_review",
agent="claude-senior",
role="review",
prompt_template="default:aggregate-review",
output_key="senior_review_result",
verdict=True,
),
],
)
result = PipelineResult(
iterations=[
IterationResult(
iteration=1,
step_outputs={
"senior_review_result": (
"### Confirmed Issues\n- Missing auth\n\n"
"## Issue Tracker\n"
"| ISS-ID | Severity | Description | Status | Since |\n"
"|--------|----------|-------------|--------|-------|\n"
"| ISS-001 | Critical | Missing auth check | Open | v1 |\n"
"| ISS-002 | Major | No validation | Fixed | v1 |\n"
"\n### Verdict\nVERDICT: FAIL"
),
},
verdict="FAIL",
),
],
final_verdict="FAIL",
)
report = build_report(config, result)
self.assertIn("Issue Tracker Summary", report)
self.assertIn("ISS-001", report)
self.assertIn("Fixed", report)
def test_aggregate_template_ko_has_tracker(self) -> None:
self.assertIn("{previous_senior_tracker}", AGGREGATE_REVIEW_TEMPLATE_KO)
self.assertIn("이슈 트래커", AGGREGATE_REVIEW_TEMPLATE_KO)
self.assertIn("VERDICT: ESCALATE", AGGREGATE_REVIEW_TEMPLATE_KO)
class FixPresetBehaviorTest(unittest.TestCase):
def _write_fix_config(self, root: Path, *, max_iterations: int = 7) -> Path:
(root / "plan.md").write_text("# plan\n", encoding="utf-8")
(root / "checklist.md").write_text("# checklist\n", encoding="utf-8")
config_path = root / "config.yaml"
config_path.write_text(
(
"inputs:\n"
" plan: plan.md\n"
" checklist: checklist.md\n"
"coders: [claude-coder]\n"
"reviewers: [claude-reviewer]\n"
"pipeline: preset:review-fix\n"
f"max_iterations: {max_iterations}\n"
"language: en\n"
),
encoding="utf-8",
)
return config_path
def test_load_config_syncs_phased_iterations_and_enables_agentic(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
config = load_config(self._write_fix_config(Path(tmpdir), max_iterations=7))
self.assertEqual(config.preset_name, "review-fix")
self.assertEqual(config.phases[0].max_iterations, 7)
self.assertTrue(config.agents["claude-coder"].agentic)
self.assertNotIn("-p", config.agents["claude-coder"].args)
def test_run_config_max_iter_updates_existing_phased_pipeline(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
config_path = self._write_fix_config(Path(tmpdir), max_iterations=7)
captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs):
captured["phase_max"] = config.phases[0].max_iterations
captured["agentic"] = config.agents[config.coders[0]].agentic
return PipelineResult(
iterations=[],
final_verdict="PASS",
run_dir=Path(tmpdir) / "output",
)
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main([
"run",
"--config", str(config_path),
"--max-iter", "9",
"--dry-run",
])
self.assertEqual(exit_code, 0)
self.assertEqual(captured["phase_max"], 9)
self.assertTrue(captured["agentic"])
def test_run_preset_review_fix_auto_enables_agentic_without_flag(self) -> None:
captured: dict[str, object] = {}
def _fake_run_pipeline(config, **kwargs):
captured["preset"] = config.preset_name
captured["agentic"] = config.agents[config.coders[0]].agentic
captured["phase_max"] = config.phases[0].max_iterations
return PipelineResult(
iterations=[],
final_verdict="PASS",
run_dir=Path(".cross-eval/output"),
)
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
exit_code = main(["run", "--preset", "review-fix", "--dry-run"])
self.assertEqual(exit_code, 0)
self.assertEqual(captured["preset"], "review-fix")
self.assertTrue(captured["agentic"])
self.assertEqual(captured["phase_max"], 3)
if __name__ == "__main__":
unittest.main()

267
tests/test_onboarding.py Normal file
View File

@@ -0,0 +1,267 @@
"""Tests for doctor, demo, and guided init features."""
from __future__ import annotations
import tempfile
import unittest
from pathlib import Path
from unittest.mock import patch, MagicMock
from cross_eval.doctor import (
DoctorCheck,
check_cli_installed,
check_config,
format_doctor_results,
run_doctor,
)
from cross_eval.demo import (
DEMO_CHECKLIST,
DEMO_PLAN,
run_mock_demo,
)
from cross_eval.cli import (
_generate_guided_config,
_prompt_choice,
_prompt_text,
main,
)
# ---------------------------------------------------------------------------
# Doctor tests
# ---------------------------------------------------------------------------
class DoctorCheckInstalledTest(unittest.TestCase):
def test_check_cli_installed_found(self) -> None:
with patch("cross_eval.doctor.shutil.which", return_value="/usr/bin/python3"):
with patch("cross_eval.doctor.subprocess.run") as mock_run:
mock_run.return_value = MagicMock(
stdout="Python 3.12.0", stderr=""
)
found, version = check_cli_installed("python3")
self.assertTrue(found)
self.assertIn("Python", version)
def test_check_cli_installed_not_found(self) -> None:
with patch("cross_eval.doctor.shutil.which", return_value=None):
found, msg = check_cli_installed("nonexistent-tool")
self.assertFalse(found)
self.assertIn("not found", msg)
def test_check_config_exists_valid(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
ce_dir = Path(tmpdir) / ".cross-eval"
ce_dir.mkdir()
config_path = ce_dir / "config.yaml"
config_path.write_text(
"inputs:\n plan: plan.md\ncoders: [claude-coder]\n"
"reviewers: [claude-reviewer]\npipeline: preset:simple\n",
encoding="utf-8",
)
# Also create plan.md so validation passes
(ce_dir / "plan.md").write_text("# Plan", encoding="utf-8")
ok, path, errors = check_config(Path(tmpdir))
self.assertTrue(ok)
self.assertIsNotNone(path)
self.assertEqual(errors, [])
def test_check_config_not_exists(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
ok, path, errors = check_config(Path(tmpdir))
self.assertFalse(ok)
self.assertIsNone(path)
def test_check_config_invalid(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
ce_dir = Path(tmpdir) / ".cross-eval"
ce_dir.mkdir()
# Valid YAML but missing required fields → validation fails
(ce_dir / "config.yaml").write_text(
"inputs:\n plan: /nonexistent/plan.md\n",
encoding="utf-8",
)
ok, path, errors = check_config(Path(tmpdir))
self.assertFalse(ok)
self.assertIsNotNone(path)
def test_format_doctor_results_all_pass(self) -> None:
checks = [
DoctorCheck("test", True, True, "ok"),
DoctorCheck("test2", True, False, "ok"),
]
output = format_doctor_results(checks)
self.assertIn("", output)
self.assertIn("All checks passed", output)
def test_format_doctor_results_critical_fail(self) -> None:
checks = [
DoctorCheck("claude CLI", False, True, "not found"),
]
output = format_doctor_results(checks)
self.assertIn("", output)
self.assertIn("critical", output.lower())
def test_cmd_doctor_returns_0_all_pass(self) -> None:
with patch("cross_eval.doctor.run_doctor") as mock:
mock.return_value = [
DoctorCheck("test", True, True, "ok"),
]
exit_code = main(["doctor"])
self.assertEqual(exit_code, 0)
def test_cmd_doctor_returns_1_critical_fail(self) -> None:
with patch("cross_eval.doctor.run_doctor") as mock:
mock.return_value = [
DoctorCheck("claude CLI", False, True, "not found"),
]
exit_code = main(["doctor"])
self.assertEqual(exit_code, 1)
# ---------------------------------------------------------------------------
# Demo tests
# ---------------------------------------------------------------------------
class DemoTest(unittest.TestCase):
def test_demo_plan_is_nonempty(self) -> None:
self.assertIn("fibonacci", DEMO_PLAN.lower())
def test_demo_checklist_is_nonempty(self) -> None:
self.assertIn("fibonacci", DEMO_CHECKLIST.lower())
def test_mock_demo_runs_without_error(self) -> None:
# Should not raise
with patch("sys.stdout"):
run_mock_demo(preset="simple")
def test_mock_demo_escalate_runs_without_error(self) -> None:
with patch("sys.stdout"):
run_mock_demo(preset="simple", show_escalate=True)
def test_cmd_demo_mock_default(self) -> None:
with patch("cross_eval.demo.run_mock_demo") as mock:
exit_code = main(["demo"])
mock.assert_called_once_with(preset="simple", show_escalate=False)
self.assertEqual(exit_code, 0)
def test_cmd_demo_escalate_flag(self) -> None:
with patch("cross_eval.demo.run_mock_demo") as mock:
exit_code = main(["demo", "--escalate"])
mock.assert_called_once_with(preset="simple", show_escalate=True)
self.assertEqual(exit_code, 0)
def test_cmd_demo_live_requires_confirmation(self) -> None:
with patch("builtins.input", return_value="n"):
exit_code = main(["demo", "--live"])
self.assertEqual(exit_code, 0)
# ---------------------------------------------------------------------------
# Guided init tests
# ---------------------------------------------------------------------------
class GuidedInitTest(unittest.TestCase):
def test_prompt_choice_default(self) -> None:
with patch("builtins.input", return_value=""):
result = _prompt_choice("Pick:", ["a", "b", "c"], default=2)
self.assertEqual(result, "b")
def test_prompt_choice_by_number(self) -> None:
with patch("builtins.input", return_value="3"):
result = _prompt_choice("Pick:", ["a", "b", "c"], default=1)
self.assertEqual(result, "c")
def test_prompt_choice_by_name(self) -> None:
with patch("builtins.input", return_value="simple"):
result = _prompt_choice("Pick:", ["simple", "review-fix"], default=1)
self.assertEqual(result, "simple")
def test_prompt_text_default(self) -> None:
with patch("builtins.input", return_value=""):
result = _prompt_text("Name", default="claude")
self.assertEqual(result, "claude")
def test_prompt_text_custom(self) -> None:
with patch("builtins.input", return_value="codex"):
result = _prompt_text("Name", default="claude")
self.assertEqual(result, "codex")
def test_generate_guided_config(self) -> None:
config = _generate_guided_config(
"review-fix", "ko",
{
"coder": "claude",
"reviewer": "codex",
"senior": "codex",
"max_iter": 5,
},
)
self.assertIn("preset:review-fix", config)
self.assertIn("language: ko", config)
self.assertIn("claude-coder", config)
self.assertIn("codex-reviewer", config)
self.assertIn("codex-senior", config)
self.assertIn("max_iterations: 5", config)
def test_generate_guided_config_full_name(self) -> None:
config = _generate_guided_config(
"simple", "ko",
{
"coder": "claude-coder",
"reviewer": "codex-reviewer",
"senior": "",
"max_iter": 3,
},
)
# Full names should not be double-suffixed
self.assertIn("claude-coder", config)
self.assertNotIn("claude-coder-coder", config)
self.assertIn("codex-reviewer", config)
self.assertNotIn("codex-reviewer-reviewer", config)
def test_generate_guided_config_no_senior(self) -> None:
config = _generate_guided_config(
"simple", "en",
{
"coder": "claude",
"reviewer": "claude",
"senior": "",
"max_iter": 3,
},
)
self.assertNotIn("senior", config.lower())
def test_guided_init_creates_files(self) -> None:
# Simulate guided init with all defaults
inputs = iter(["", "", "", "", "", "", ""])
with tempfile.TemporaryDirectory() as tmpdir:
with patch("builtins.input", side_effect=lambda _="": next(inputs, "")):
exit_code = main(["init", "--guided", "--dir", tmpdir])
config_path = Path(tmpdir) / ".cross-eval" / "config.yaml"
self.assertTrue(config_path.exists())
self.assertEqual(exit_code, 0)
def test_guided_init_preserves_existing_files(self) -> None:
inputs = iter(["", "", "", "", "", "", ""])
with tempfile.TemporaryDirectory() as tmpdir:
ce_dir = Path(tmpdir) / ".cross-eval"
ce_dir.mkdir()
existing = ce_dir / "config.yaml"
existing.write_text("# existing", encoding="utf-8")
with patch("builtins.input", side_effect=lambda _="": next(inputs, "")):
main(["init", "--guided", "--dir", tmpdir])
# Should not overwrite
self.assertEqual(existing.read_text(), "# existing")
if __name__ == "__main__":
unittest.main()

View File

@@ -0,0 +1,461 @@
"""Integration tests for cross-eval pipeline with mocked agents."""
from __future__ import annotations
import tempfile
import unittest
from pathlib import Path
from unittest.mock import patch
from cross_eval.config import BUILTIN_AGENTS
from cross_eval.models import (
AgentConfig,
AgentResult,
PhaseConfig,
PipelineConfig,
StepConfig,
)
from cross_eval.pipeline import run_pipeline
from cross_eval.prompts import _build_review_fix_preset, _build_simple_preset
def _make_mock_agent(outputs: list[str]):
"""Returns a side_effect function that returns outputs in sequence."""
call_count = [0]
def _mock(agent_config, prompt, step_name, **kwargs):
idx = min(call_count[0], len(outputs) - 1)
call_count[0] += 1
return AgentResult(
output=outputs[idx],
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=0.1,
)
return _mock
def _make_step_mock(step_outputs: dict[str, list[str]]):
"""Returns a side_effect that dispatches by step_name, cycling through outputs."""
counters: dict[str, int] = {}
def _mock(agent_config, prompt, step_name, **kwargs):
if step_name not in counters:
counters[step_name] = 0
outputs = step_outputs.get(step_name, [""])
idx = min(counters[step_name], len(outputs) - 1)
counters[step_name] += 1
return AgentResult(
output=outputs[idx],
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=0.1,
)
return _mock
def _minimal_simple_config(
run_dir: Path,
max_iterations: int = 3,
seniors: list[str] | None = None,
) -> PipelineConfig:
"""Build a minimal simple pipeline config for testing."""
coders = ["claude-coder"]
reviewers = ["claude-reviewer"]
senior_list = seniors if seniors is not None else []
steps = _build_simple_preset(coders, reviewers, senior_list)
agents = dict(BUILTIN_AGENTS)
return PipelineConfig(
output_dir=run_dir,
max_iterations=max_iterations,
min_iterations=1,
language="en",
inputs={"plan": "Test plan", "checklist": "Test checklist"},
agents=agents,
coders=coders,
reviewers=reviewers,
seniors=senior_list,
pipeline=steps,
preset_name="simple",
)
class TestSimplePipelinePassStopsLoop(unittest.TestCase):
"""Test 1: mock agent returns VERDICT: PASS on first review -> stops at iteration 1."""
def test_simple_pipeline_pass_stops_loop(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
config = _minimal_simple_config(Path(tmpdir))
mock = _make_mock_agent([
"Coding output here", # coding step
"All good\n\nVERDICT: PASS", # review step
])
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "PASS")
self.assertEqual(len(result.iterations), 1)
class TestSimplePipelineFailThenPass(unittest.TestCase):
"""Test 2: FAIL on first review, PASS on second -> 2 iterations."""
def test_simple_pipeline_fail_then_pass(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
config = _minimal_simple_config(Path(tmpdir), max_iterations=5)
mock = _make_step_mock({
"coding": ["Coding output v1", "Coding output v2"],
"review": [
"Issues found\n\nVERDICT: FAIL",
"All good\n\nVERDICT: PASS",
],
})
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "PASS")
self.assertEqual(len(result.iterations), 2)
class TestSimplePipelineEscalateBreaksLoop(unittest.TestCase):
"""Test 3: ESCALATE on review -> stops immediately, final_verdict=ESCALATE."""
def test_simple_pipeline_escalate_breaks_loop(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
config = _minimal_simple_config(
Path(tmpdir), max_iterations=5, seniors=["claude-senior"],
)
escalate_output = (
"### Confirmed Issues\n"
"- [Critical] Requirements are ambiguous\n\n"
"### Escalated Issues\n"
"Requirements need stakeholder clarification\n\n"
"### Verdict\n"
"VERDICT: ESCALATE\n"
)
mock = _make_step_mock({
"coding": ["Coding output"],
"review": ["Issues found\n\nVERDICT: FAIL"],
"senior_review": [escalate_output],
})
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "ESCALATE")
self.assertEqual(len(result.iterations), 1)
self.assertTrue(len(result.escalated_issues) > 0)
class TestSimplePipelineEscalatePriorityOverPass(unittest.TestCase):
"""Test 4: one verdict step returns PASS, another returns ESCALATE -> ESCALATE wins."""
def test_simple_pipeline_escalate_priority_over_pass(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
# Build a custom pipeline with 2 verdict steps (no senior)
steps = [
StepConfig(
name="coding",
agent="claude-coder",
role="coding",
prompt_template="default:coding",
output_key="coding_output",
),
StepConfig(
name="review_a",
agent="claude-reviewer",
role="review",
prompt_template="default:review",
output_key="review_a_result",
verdict=True,
),
StepConfig(
name="review_b",
agent="claude-reviewer",
role="review",
prompt_template="default:review",
output_key="review_b_result",
verdict=True,
),
]
config = PipelineConfig(
output_dir=Path(tmpdir),
max_iterations=3,
min_iterations=1,
language="en",
inputs={"plan": "Test plan", "checklist": "Test checklist"},
agents=dict(BUILTIN_AGENTS),
coders=["claude-coder"],
reviewers=["claude-reviewer"],
pipeline=steps,
preset_name="custom",
)
escalate_output = (
"### Escalated Issues\n"
"Ambiguous requirements need clarification\n\n"
"VERDICT: ESCALATE\n"
)
mock = _make_step_mock({
"coding": ["Coding output"],
"review_a": ["All good\n\nVERDICT: PASS"],
"review_b": [escalate_output],
})
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "ESCALATE")
self.assertTrue(len(result.escalated_issues) > 0)
class TestPhasedPipelineEscalateBreaksPhase(unittest.TestCase):
"""Test 5: phased pipeline (review-fix), verify step returns ESCALATE -> phase stops."""
def test_phased_pipeline_escalate_breaks_phase(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
coders = ["claude-coder"]
reviewers = ["claude-reviewer"]
seniors = ["claude-senior"]
phases = _build_review_fix_preset(coders, reviewers, seniors)
config = PipelineConfig(
output_dir=Path(tmpdir),
max_iterations=5,
min_iterations=1,
language="en",
inputs={"plan": "Test plan", "checklist": "Test checklist"},
agents=dict(BUILTIN_AGENTS),
coders=coders,
reviewers=reviewers,
seniors=seniors,
phases=phases,
preset_name="review-fix",
)
escalate_output = (
"### Escalated Issues\n"
"Architecture decisions needed beyond plan scope\n\n"
"### Verdict\n"
"VERDICT: ESCALATE\n"
)
mock = _make_step_mock({
"review_claude_reviewer": ["Review findings here"],
"aggregate_review": ["Aggregated review\n\nAction items: fix X"],
"coding": ["Fixed code"],
"verify": [escalate_output],
})
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "ESCALATE")
self.assertTrue(len(result.escalated_issues) > 0)
class TestAutoEscalateFiresWithoutSenior(unittest.TestCase):
"""Test 6: simple pipeline without senior, same FAIL feedback 3 times -> auto-escalate."""
def test_auto_escalate_fires_without_senior(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
# No seniors -> review step has verdict=True
config = _minimal_simple_config(
Path(tmpdir), max_iterations=5, seniors=None,
)
# Same feedback mentioning the same file paths across all iterations
repeated_fail = (
"Issues found in src/auth.py: missing validation check.\n"
"The file src/auth.py still has the same problem.\n\n"
"VERDICT: FAIL"
)
mock = _make_step_mock({
"coding": ["Coding output v1", "Coding output v2", "Coding output v3"],
"review": [repeated_fail, repeated_fail, repeated_fail],
})
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "ESCALATE")
self.assertTrue(
any("Auto-escalated" in iss for iss in result.escalated_issues),
)
class TestAutoEscalateDoesNotFireWithSenior(unittest.TestCase):
"""Test 7: same repeated FAIL but WITH senior/aggregate step -> no auto-escalate."""
def test_auto_escalate_does_not_fire_with_senior(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
# With seniors -> senior_review step has verdict=True, review does not
config = _minimal_simple_config(
Path(tmpdir), max_iterations=5, seniors=["claude-senior"],
)
repeated_fail_review = (
"Issues found in src/auth.py: missing validation check.\n"
"VERDICT: FAIL"
)
# Senior also returns FAIL but the auto-escalate should NOT fire
# because has_aggregator is True (seniors list is populated)
senior_fail = (
"### Confirmed Issues\n"
"- Missing validation in src/auth.py\n\n"
"### Action Items\n"
"1. Add validation in src/auth.py\n\n"
"VERDICT: FAIL"
)
mock = _make_step_mock({
"coding": [
"Coding output v1",
"Coding output v2",
"Coding output v3",
"Coding output v4",
"Coding output v5",
],
"review": [
repeated_fail_review,
repeated_fail_review,
repeated_fail_review,
repeated_fail_review,
repeated_fail_review,
],
"senior_review": [
senior_fail,
senior_fail,
senior_fail,
senior_fail,
senior_fail,
],
})
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
result = run_pipeline(config)
# Should NOT auto-escalate; should reach max iterations
self.assertNotEqual(result.final_verdict, "ESCALATE")
self.assertEqual(result.final_verdict, "MAX_ITERATIONS_REACHED")
self.assertEqual(len(result.iterations), 5)
class TestTrackerExtractionAcrossIterations(unittest.TestCase):
"""Test 8: senior review output with Issue Tracker table -> passed to next iteration."""
def test_tracker_extraction_across_iterations(self) -> None:
with tempfile.TemporaryDirectory() as tmpdir:
config = _minimal_simple_config(
Path(tmpdir), max_iterations=3, seniors=["claude-senior"],
)
tracker_table = (
"## Issue Tracker\n"
"| ISS-ID | Severity | Description | Status | Since |\n"
"|--------|----------|-------------|--------|-------|\n"
"| ISS-001 | Critical | Missing auth check | Open | v1 |\n"
"| ISS-002 | Major | No validation | Open | v1 |\n"
)
senior_output_v1 = (
"### Confirmed Issues\n"
"- Missing auth\n\n"
f"{tracker_table}\n"
"### Verdict\n"
"VERDICT: FAIL"
)
senior_output_v2 = (
"### Confirmed Issues\n"
"- None remaining\n\n"
"## Issue Tracker\n"
"| ISS-ID | Severity | Description | Status | Since |\n"
"|--------|----------|-------------|--------|-------|\n"
"| ISS-001 | Critical | Missing auth check | Fixed | v1 |\n"
"| ISS-002 | Major | No validation | Fixed | v1 |\n"
"\n### Verdict\n"
"VERDICT: PASS"
)
captured_prompts: list[dict[str, str]] = []
def _tracking_mock(agent_config, prompt, step_name, **kwargs):
captured_prompts.append({
"step_name": step_name,
"prompt": prompt,
"agent_name": agent_config.name,
})
if step_name == "coding":
return AgentResult(
output="Coding output",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=0.1,
)
elif step_name == "review":
return AgentResult(
output="Review findings\n\nVERDICT: FAIL",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=0.1,
)
elif step_name == "senior_review":
# First call: FAIL with tracker, second call: PASS
senior_calls = [
p for p in captured_prompts if p["step_name"] == "senior_review"
]
if len(senior_calls) <= 1:
output = senior_output_v1
else:
output = senior_output_v2
return AgentResult(
output=output,
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=0.1,
)
return AgentResult(
output="",
exit_code=0,
agent_name=agent_config.name,
step_name=step_name,
duration_seconds=0.1,
)
with patch("cross_eval.pipeline.invoke_agent", side_effect=_tracking_mock):
result = run_pipeline(config)
self.assertEqual(result.final_verdict, "PASS")
self.assertEqual(len(result.iterations), 2)
# Verify that the second iteration's senior_review prompt contains
# the tracker table from iteration 1
iter2_senior_prompts = [
p for p in captured_prompts
if p["step_name"] == "senior_review"
and "ISS-001" in p["prompt"]
and "Missing auth check" in p["prompt"]
]
# The second senior_review call should have the tracker in its prompt
self.assertTrue(
len(iter2_senior_prompts) >= 1,
"Expected previous_senior_tracker content (ISS-001) to appear "
"in at least one senior_review prompt",
)
if __name__ == "__main__":
unittest.main()