Compare commits
2 Commits
main
...
941304398d
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
941304398d | ||
|
|
204e071b74 |
@@ -41,7 +41,7 @@ inputs:
|
|||||||
checklist: checklist.md
|
checklist: checklist.md
|
||||||
|
|
||||||
agents:
|
agents:
|
||||||
generator:
|
coder:
|
||||||
command: claude
|
command: claude
|
||||||
args: ["-p", "--model", "sonnet", "--permission-mode", "auto"]
|
args: ["-p", "--model", "sonnet", "--permission-mode", "auto"]
|
||||||
system_prompt: "You are a senior software engineer. Follow the plan precisely."
|
system_prompt: "You are a senior software engineer. Follow the plan precisely."
|
||||||
@@ -53,14 +53,16 @@ agents:
|
|||||||
# 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음)
|
# 방법 1: 프리셋 사용 (사용자가 pipeline YAML 직접 작성할 필요 없음)
|
||||||
pipeline: preset:simple # "A 생성 → B 리뷰" (기본값)
|
pipeline: preset:simple # "A 생성 → B 리뷰" (기본값)
|
||||||
# pipeline: preset:cross-review # "둘 다 생성 → 서로 리뷰"
|
# pipeline: preset:cross-review # "둘 다 생성 → 서로 리뷰"
|
||||||
|
# pipeline: preset:plan-review # "구현 전 문서/기획 검토"
|
||||||
|
# pipeline: preset:coding-review-fix # "초기 코딩 1회 → 리뷰/수정 반복"
|
||||||
|
|
||||||
# 방법 2: 직접 커스텀 (고급 사용자용)
|
# 방법 2: 직접 커스텀 (고급 사용자용)
|
||||||
# pipeline:
|
# pipeline:
|
||||||
# - name: generate
|
# - name: coding
|
||||||
# agent: generator
|
# agent: coder
|
||||||
# role: generate
|
# role: coding
|
||||||
# prompt_template: "default:generate"
|
# prompt_template: "default:coding"
|
||||||
# output_key: generated_code
|
# output_key: coding_output
|
||||||
# - name: review
|
# - name: review
|
||||||
# agent: reviewer
|
# agent: reviewer
|
||||||
# role: review
|
# role: review
|
||||||
@@ -73,8 +75,10 @@ pipeline: preset:simple # "A 생성 → B 리뷰" (기본값)
|
|||||||
|
|
||||||
| 프리셋 | 설명 | 자동 생성되는 steps |
|
| 프리셋 | 설명 | 자동 생성되는 steps |
|
||||||
|--------|------|-------------------|
|
|--------|------|-------------------|
|
||||||
| `simple` | A 생성 → B 리뷰 | generate(agent1) → review(agent2) |
|
| `simple` | A 코딩 → B 리뷰 | coding(agent1) → review(agent2) |
|
||||||
| `cross-review` | 둘 다 생성, 서로 리뷰 | gen_a → gen_b → review_of_b(agent_a) → review_of_a(agent_b) |
|
| `cross-review` | 둘 다 코딩, 서로 리뷰 | coding_a → coding_b → review_of_b(agent_a) → review_of_a(agent_b) |
|
||||||
|
| `plan-review` | 구현 전 문서 검토 | parallel plan_review_* → senior_review(optional) |
|
||||||
|
| `coding-review-fix` | 초기 코딩 후 리뷰/수정 반복 | initial_coding(coding) → review_fix(review* → aggregate → coding → verify) |
|
||||||
|
|
||||||
프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다.
|
프리셋은 내부적으로 적절한 pipeline steps + context_override를 자동 구성한다. agents에 정의된 순서대로 agent1, agent2가 배정된다. 프리셋이 불충분하면 직접 steps를 작성할 수 있다.
|
||||||
|
|
||||||
@@ -109,11 +113,11 @@ cross_eval/
|
|||||||
- verdict_pattern 유효한 정규식인지
|
- verdict_pattern 유효한 정규식인지
|
||||||
|
|
||||||
**prompts.py** — 기본 프롬프트 2종 + 파이프라인 프리셋 정의:
|
**prompts.py** — 기본 프롬프트 2종 + 파이프라인 프리셋 정의:
|
||||||
- `default:generate` — "기획서에 명시된 것만 구현하라, 과최적화 금지" + plan/checklist/feedback + **"프로젝트 디렉토리의 기존 코드를 탐색하여 컨텍스트를 파악하라"** 지시
|
- `default:coding` — "기획서에 명시된 것만 구현하라, 과최적화 금지" + plan/checklist/feedback + **"프로젝트 디렉토리의 기존 코드를 탐색하여 컨텍스트를 파악하라"** 지시
|
||||||
- `default:review` — 과최적화/오탐/누락 3기준 검토 + `VERDICT: PASS|FAIL` 출력 + **"프로젝트 디렉토리를 직접 탐색하여 코드를 검증하라"** 지시
|
- `default:review` — 과최적화/오탐/누락 3기준 검토 + `VERDICT: PASS|FAIL` 출력 + **"프로젝트 디렉토리를 직접 탐색하여 코드를 검증하라"** 지시
|
||||||
- `{variable}` 플레이스홀더, 누락 시 `(no {key} provided)` 출력
|
- `{variable}` 플레이스홀더, 누락 시 `(no {key} provided)` 출력
|
||||||
- 사용자가 커스텀 .md 파일로 오버라이드 가능
|
- 사용자가 커스텀 .md 파일로 오버라이드 가능
|
||||||
- `PIPELINE_PRESETS` dict: `simple`, `cross-review` 등 프리셋별 StepConfig 리스트 정의
|
- `PIPELINE_PRESETS` dict: `simple`, `cross-review`, `plan-review` 등 프리셋별 StepConfig 리스트 정의
|
||||||
|
|
||||||
**agent.py** — `invoke_agent(agent_config, prompt, cwd)`:
|
**agent.py** — `invoke_agent(agent_config, prompt, cwd)`:
|
||||||
- `cwd` 파라미터로 프로젝트 디렉토리 지정 → 에이전트가 해당 디렉토리에서 파일 탐색 가능
|
- `cwd` 파라미터로 프로젝트 디렉토리 지정 → 에이전트가 해당 디렉토리에서 파일 탐색 가능
|
||||||
@@ -141,7 +145,7 @@ final-report.md 생성
|
|||||||
- 최종 판정
|
- 최종 판정
|
||||||
|
|
||||||
**cli.py** — 서브커맨드:
|
**cli.py** — 서브커맨드:
|
||||||
- `cross-eval init [--dir .] [--preset simple|cross-review]` — 스캐폴딩 (기존 파일 안 덮어씀)
|
- `cross-eval init [--dir .] [--preset simple|cross-review|plan-review]` — 스캐폴딩 (기존 파일 안 덮어씀)
|
||||||
- `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...]`
|
- `cross-eval run [-c config] [--max-iter N] [--dry-run] [--output-dir path] [--input key=path ...]`
|
||||||
- `--input key=path`: config의 inputs 오버라이드/추가
|
- `--input key=path`: config의 inputs 오버라이드/추가
|
||||||
- `--dry-run`: 에이전트 호출 없이 렌더링된 프롬프트만 출력
|
- `--dry-run`: 에이전트 호출 없이 렌더링된 프롬프트만 출력
|
||||||
@@ -167,3 +171,17 @@ final-report.md 생성
|
|||||||
3. `cross-eval run --dry-run` 로 프롬프트 렌더링 확인 (에이전트 호출 없이)
|
3. `cross-eval run --dry-run` 로 프롬프트 렌더링 확인 (에이전트 호출 없이)
|
||||||
4. plan.md/checklist.md에 간단한 내용 넣고 `cross-eval run --max-iter 2` 로 실제 실행
|
4. plan.md/checklist.md에 간단한 내용 넣고 `cross-eval run --max-iter 2` 로 실제 실행
|
||||||
5. `output/` 디렉토리에 v1/, final-report.md 생성 확인
|
5. `output/` 디렉토리에 v1/, final-report.md 생성 확인
|
||||||
|
|
||||||
|
|
||||||
|
cross-eval run \
|
||||||
|
--docs /Users/chungyeong/Desktop/Dev/new-alpha-foundry/plans/TO_CLICKHOUSE \
|
||||||
|
--preset coding-review-fix \
|
||||||
|
--coder claude \
|
||||||
|
--reviewer codex \
|
||||||
|
--reviewer codex \
|
||||||
|
--reviewer codex \
|
||||||
|
--senior codex \
|
||||||
|
--coder-effort high \
|
||||||
|
--reviewer-effort high \
|
||||||
|
--senior-effort xhigh \
|
||||||
|
--max-iter 10
|
||||||
|
|||||||
19
README.md
19
README.md
@@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
AI 에이전트 간 교차 검증을 자동화하는 CLI 도구.
|
AI 에이전트 간 교차 검증을 자동화하는 CLI 도구.
|
||||||
|
|
||||||
기획서와 체크리스트를 기반으로 "생성 → 리뷰 → 피드백 → 재생성" 루프를 자동으로 돌려서,
|
기획서와 체크리스트를 기반으로 "코딩 → 리뷰 → 피드백 → 재코딩" 루프를 자동으로 돌려서,
|
||||||
**과최적화 / 오탐 / 누락** 문제를 잡아냅니다.
|
**과최적화 / 오탐 / 누락** 문제를 잡아냅니다.
|
||||||
|
|
||||||
## 설치
|
## 설치
|
||||||
@@ -51,7 +51,7 @@ cp .cross-eval/checklist-sample.md .cross-eval/checklist.md
|
|||||||
### 3. 실행
|
### 3. 실행
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 기본 실행 (생성 → 리뷰, 최대 3회 반복)
|
# 기본 실행 (코딩 → 리뷰, 최대 3회 반복)
|
||||||
cross-eval run
|
cross-eval run
|
||||||
|
|
||||||
# 프롬프트만 확인 (에이전트 호출 없이, 비용 절약)
|
# 프롬프트만 확인 (에이전트 호출 없이, 비용 절약)
|
||||||
@@ -72,10 +72,10 @@ cross-eval run --config .cross-eval/config.yaml
|
|||||||
```
|
```
|
||||||
output/
|
output/
|
||||||
├── v1/
|
├── v1/
|
||||||
│ ├── generate.md # 에이전트 생성 결과
|
│ ├── coding.md # 에이전트 코딩 결과
|
||||||
│ └── review.md # 에이전트 리뷰 결과
|
│ └── review.md # 에이전트 리뷰 결과
|
||||||
├── v2/
|
├── v2/
|
||||||
│ ├── generate.md
|
│ ├── coding.md
|
||||||
│ └── review.md
|
│ └── review.md
|
||||||
└── final-report.md # 전체 요약 리포트
|
└── final-report.md # 전체 요약 리포트
|
||||||
```
|
```
|
||||||
@@ -92,7 +92,7 @@ inputs:
|
|||||||
checklist: checklist.md
|
checklist: checklist.md
|
||||||
|
|
||||||
agents:
|
agents:
|
||||||
generator:
|
coder:
|
||||||
command: claude
|
command: claude
|
||||||
args: ["-p", "--model", "sonnet", "--permission-mode", "auto"]
|
args: ["-p", "--model", "sonnet", "--permission-mode", "auto"]
|
||||||
system_prompt: "You are a senior software engineer."
|
system_prompt: "You are a senior software engineer."
|
||||||
@@ -110,11 +110,16 @@ pipeline: preset:simple
|
|||||||
|
|
||||||
| 프리셋 | 설명 |
|
| 프리셋 | 설명 |
|
||||||
|--------|------|
|
|--------|------|
|
||||||
| `simple` | Agent A가 생성, Agent B가 리뷰 (기본값) |
|
| `simple` | Agent A가 코딩, Agent B가 리뷰 (기본값) |
|
||||||
| `cross-review` | 둘 다 생성, 서로 교차 리뷰 |
|
| `cross-review` | 둘 다 코딩, 서로 교차 리뷰 |
|
||||||
|
| `plan-review` | 구현 전 기획서/체크리스트/참고문서를 검토하고 필요시 현재 코드베이스와의 정합성도 확인 |
|
||||||
|
| `review-only` | 기존 코드만 감사 용도로 검토 |
|
||||||
|
| `review-fix` | 리뷰 결과를 취합한 뒤 자동 수정과 재검증까지 반복 |
|
||||||
|
| `coding-review-fix` | 초기 코딩 1회 후 리뷰 결과를 취합해 자동 수정과 재검증을 반복 |
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 초기화 옵션
|
# 초기화 옵션
|
||||||
cross-eval init --preset cross-review # 교차 리뷰 프리셋
|
cross-eval init --preset cross-review # 교차 리뷰 프리셋
|
||||||
|
cross-eval init --preset plan-review # 구현 전 문서 검토 프리셋
|
||||||
cross-eval init --lang en # 영어 템플릿
|
cross-eval init --lang en # 영어 템플릿
|
||||||
```
|
```
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
Metadata-Version: 2.4
|
Metadata-Version: 2.4
|
||||||
Name: cross-eval
|
Name: cross-eval
|
||||||
Version: 0.1.0
|
Version: 0.2.0
|
||||||
Summary: AI agent cross-evaluation CLI tool
|
Summary: AI agent cross-evaluation CLI tool
|
||||||
Requires-Python: >=3.9
|
Requires-Python: >=3.9
|
||||||
Requires-Dist: pyyaml>=6.0
|
Requires-Dist: pyyaml>=6.0
|
||||||
|
|||||||
@@ -4,14 +4,21 @@ cross_eval/__init__.py
|
|||||||
cross_eval/agent.py
|
cross_eval/agent.py
|
||||||
cross_eval/cli.py
|
cross_eval/cli.py
|
||||||
cross_eval/config.py
|
cross_eval/config.py
|
||||||
|
cross_eval/demo.py
|
||||||
|
cross_eval/doctor.py
|
||||||
cross_eval/models.py
|
cross_eval/models.py
|
||||||
cross_eval/pipeline.py
|
cross_eval/pipeline.py
|
||||||
cross_eval/prompts.py
|
cross_eval/prompts.py
|
||||||
cross_eval/report.py
|
cross_eval/report.py
|
||||||
|
cross_eval/runtime_env.py
|
||||||
|
cross_eval/worktree.py
|
||||||
cross_eval.egg-info/PKG-INFO
|
cross_eval.egg-info/PKG-INFO
|
||||||
cross_eval.egg-info/SOURCES.txt
|
cross_eval.egg-info/SOURCES.txt
|
||||||
cross_eval.egg-info/dependency_links.txt
|
cross_eval.egg-info/dependency_links.txt
|
||||||
cross_eval.egg-info/entry_points.txt
|
cross_eval.egg-info/entry_points.txt
|
||||||
cross_eval.egg-info/requires.txt
|
cross_eval.egg-info/requires.txt
|
||||||
cross_eval.egg-info/top_level.txt
|
cross_eval.egg-info/top_level.txt
|
||||||
|
tests/test_agentic.py
|
||||||
tests/test_config.py
|
tests/test_config.py
|
||||||
|
tests/test_onboarding.py
|
||||||
|
tests/test_pipeline_integration.py
|
||||||
|
|||||||
@@ -1 +1 @@
|
|||||||
__version__ = "0.1.0"
|
__version__ = "0.2.0"
|
||||||
|
|||||||
@@ -3,8 +3,10 @@ from __future__ import annotations
|
|||||||
|
|
||||||
import itertools
|
import itertools
|
||||||
import logging
|
import logging
|
||||||
|
import os
|
||||||
import subprocess
|
import subprocess
|
||||||
import sys
|
import sys
|
||||||
|
import tempfile
|
||||||
import threading
|
import threading
|
||||||
import time
|
import time
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
@@ -19,6 +21,34 @@ _SYSTEM_PROMPT_AGENTS = ("claude",)
|
|||||||
_REASONING_EFFORT_AGENTS = ("codex",)
|
_REASONING_EFFORT_AGENTS = ("codex",)
|
||||||
|
|
||||||
|
|
||||||
|
class AgentInvocationError(RuntimeError):
|
||||||
|
"""Structured error for agent CLI failures."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
agent_name: str,
|
||||||
|
step_name: str,
|
||||||
|
cmd_preview: str,
|
||||||
|
raw_error: str,
|
||||||
|
failure_type: str,
|
||||||
|
suggested_action: str,
|
||||||
|
) -> None:
|
||||||
|
self.agent_name = agent_name
|
||||||
|
self.step_name = step_name
|
||||||
|
self.cmd_preview = cmd_preview
|
||||||
|
self.raw_error = raw_error
|
||||||
|
self.failure_type = failure_type
|
||||||
|
self.suggested_action = suggested_action
|
||||||
|
super().__init__(
|
||||||
|
f"Agent '{agent_name}' failed (exit code != 0) at step '{step_name}':\n"
|
||||||
|
f" type: {failure_type}\n"
|
||||||
|
f" cmd: {cmd_preview}\n"
|
||||||
|
f" error: {raw_error or '(no output)'}\n"
|
||||||
|
f" action: {suggested_action}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def _supports_system_prompt_flag(command: str) -> bool:
|
def _supports_system_prompt_flag(command: str) -> bool:
|
||||||
"""Check if the agent CLI supports --system-prompt flag."""
|
"""Check if the agent CLI supports --system-prompt flag."""
|
||||||
return any(name in command for name in _SYSTEM_PROMPT_AGENTS)
|
return any(name in command for name in _SYSTEM_PROMPT_AGENTS)
|
||||||
@@ -29,6 +59,53 @@ def _supports_reasoning_effort(command: str) -> bool:
|
|||||||
return any(name in command for name in _REASONING_EFFORT_AGENTS)
|
return any(name in command for name in _REASONING_EFFORT_AGENTS)
|
||||||
|
|
||||||
|
|
||||||
|
def _classify_agent_failure(detail: str) -> tuple[str, str]:
|
||||||
|
"""Classify a failed agent invocation into a user-actionable bucket."""
|
||||||
|
normalized = detail.lower()
|
||||||
|
|
||||||
|
auth_markers = (
|
||||||
|
"not logged in",
|
||||||
|
"please run /login",
|
||||||
|
"auth",
|
||||||
|
"authentication",
|
||||||
|
"invalid api key",
|
||||||
|
"api key",
|
||||||
|
"unauthorized",
|
||||||
|
"forbidden",
|
||||||
|
)
|
||||||
|
usage_limit_markers = (
|
||||||
|
"quota",
|
||||||
|
"rate limit",
|
||||||
|
"credits",
|
||||||
|
"credit balance",
|
||||||
|
"budget",
|
||||||
|
"insufficient funds",
|
||||||
|
"usage limit",
|
||||||
|
"token limit",
|
||||||
|
"billing",
|
||||||
|
)
|
||||||
|
|
||||||
|
if any(marker in normalized for marker in auth_markers):
|
||||||
|
return (
|
||||||
|
"AUTH",
|
||||||
|
"Agent CLI authentication is missing or expired. Re-authenticate the CLI, then rerun.",
|
||||||
|
)
|
||||||
|
if any(marker in normalized for marker in usage_limit_markers):
|
||||||
|
return (
|
||||||
|
"USAGE_LIMIT",
|
||||||
|
"Agent CLI hit a quota, billing, or token budget limit. Refill or raise the limit, then rerun.",
|
||||||
|
)
|
||||||
|
if "api error" in normalized:
|
||||||
|
return (
|
||||||
|
"API_ERROR",
|
||||||
|
"Agent CLI returned an API error. Inspect the saved error file for the raw response.",
|
||||||
|
)
|
||||||
|
return (
|
||||||
|
"UNKNOWN",
|
||||||
|
"Agent CLI failed for an unknown reason. Inspect the saved error file for details.",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
class _Spinner:
|
class _Spinner:
|
||||||
"""Animated spinner for long-running agent calls."""
|
"""Animated spinner for long-running agent calls."""
|
||||||
|
|
||||||
@@ -67,11 +144,17 @@ class _Spinner:
|
|||||||
sys.stderr.flush()
|
sys.stderr.flush()
|
||||||
|
|
||||||
|
|
||||||
|
def _is_print_mode(args: list[str]) -> bool:
|
||||||
|
"""Check if the agent args include -p / --print flag."""
|
||||||
|
return "-p" in args or "--print" in args
|
||||||
|
|
||||||
|
|
||||||
def invoke_agent(
|
def invoke_agent(
|
||||||
agent: AgentConfig,
|
agent: AgentConfig,
|
||||||
prompt: str,
|
prompt: str,
|
||||||
step_name: str,
|
step_name: str,
|
||||||
cwd: Optional[Path] = None,
|
cwd: Optional[Path] = None,
|
||||||
|
env: Optional[dict[str, str]] = None,
|
||||||
timeout: int | None = None,
|
timeout: int | None = None,
|
||||||
quiet: bool = False,
|
quiet: bool = False,
|
||||||
) -> AgentResult:
|
) -> AgentResult:
|
||||||
@@ -80,30 +163,67 @@ def invoke_agent(
|
|||||||
Args:
|
Args:
|
||||||
quiet: If True, suppress spinner (for parallel execution).
|
quiet: If True, suppress spinner (for parallel execution).
|
||||||
"""
|
"""
|
||||||
|
is_claude = "claude" in agent.command
|
||||||
|
is_interactive = is_claude and not _is_print_mode(agent.args)
|
||||||
|
|
||||||
cmd = [agent.command]
|
cmd = [agent.command]
|
||||||
if agent.reasoning_effort and _supports_reasoning_effort(agent.command):
|
if agent.reasoning_effort and _supports_reasoning_effort(agent.command):
|
||||||
cmd.extend(["-c", f'model_reasoning_effort="{agent.reasoning_effort}"'])
|
cmd.extend(["-c", f'model_reasoning_effort="{agent.reasoning_effort}"'])
|
||||||
cmd.extend(agent.args)
|
cmd.extend(agent.args)
|
||||||
|
|
||||||
# Build the full prompt (system prompt + user prompt)
|
# --- Temp files for interactive (non -p) claude ---
|
||||||
if agent.system_prompt and _supports_system_prompt_flag(agent.command):
|
task_file: Optional[Path] = None
|
||||||
# claude: --system-prompt flag supported natively
|
output_file: Optional[Path] = None
|
||||||
cmd.extend(["--system-prompt", agent.system_prompt])
|
|
||||||
input_data = prompt
|
if is_interactive:
|
||||||
elif agent.system_prompt:
|
# Write prompt + output instruction to temp task file
|
||||||
# codex, others: no --system-prompt flag, prepend to prompt
|
task_fd, task_path = tempfile.mkstemp(suffix=".md", prefix="cross_eval_task_")
|
||||||
input_data = (
|
task_file = Path(task_path)
|
||||||
f"<system>\n{agent.system_prompt}\n</system>\n\n"
|
os.close(task_fd)
|
||||||
f"{prompt}"
|
|
||||||
|
out_fd, out_path = tempfile.mkstemp(suffix=".md", prefix="cross_eval_out_")
|
||||||
|
output_file = Path(out_path)
|
||||||
|
os.close(out_fd)
|
||||||
|
# Clear the output file so we can detect if agent wrote to it
|
||||||
|
output_file.write_text("", encoding="utf-8")
|
||||||
|
|
||||||
|
wrapped_prompt = (
|
||||||
|
f"{prompt}\n\n"
|
||||||
|
f"---\n"
|
||||||
|
f"IMPORTANT: Write your COMPLETE response to this file: {output_file}\n"
|
||||||
|
f"Do NOT modify any other files in the project."
|
||||||
)
|
)
|
||||||
|
task_file.write_text(wrapped_prompt, encoding="utf-8")
|
||||||
|
|
||||||
|
# System prompt via flag
|
||||||
|
if agent.system_prompt and _supports_system_prompt_flag(agent.command):
|
||||||
|
cmd.extend(["--system-prompt", agent.system_prompt])
|
||||||
|
|
||||||
|
# Positional arg: point claude to the task file
|
||||||
|
cmd.append(
|
||||||
|
f"Read the task file at {task_file} and follow all instructions in it. "
|
||||||
|
f"Write your complete output to {output_file}."
|
||||||
|
)
|
||||||
|
input_data: str | None = None
|
||||||
else:
|
else:
|
||||||
input_data = prompt
|
# Print mode (-p) or non-claude: deliver prompt via stdin
|
||||||
|
if agent.system_prompt and _supports_system_prompt_flag(agent.command):
|
||||||
|
cmd.extend(["--system-prompt", agent.system_prompt])
|
||||||
|
input_data = prompt
|
||||||
|
elif agent.system_prompt:
|
||||||
|
input_data = (
|
||||||
|
f"<system>\n{agent.system_prompt}\n</system>\n\n"
|
||||||
|
f"{prompt}"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
input_data = prompt
|
||||||
|
|
||||||
logger.debug("Invoking agent '%s': %s", agent.name, " ".join(cmd[:5]) + " ...")
|
logger.debug("Invoking agent '%s': %s", agent.name, " ".join(cmd[:5]) + " ...")
|
||||||
|
|
||||||
spinner: Optional[_Spinner] = None
|
spinner: Optional[_Spinner] = None
|
||||||
if not quiet:
|
if not quiet:
|
||||||
logger.info(" cmd: %s", " ".join(cmd[:6]))
|
mode_label = "interactive" if is_interactive else ""
|
||||||
|
logger.info(" cmd: %s %s", " ".join(cmd[:6]), f"({mode_label})" if mode_label else "")
|
||||||
spinner = _Spinner(f"[{step_name}] {agent.name} running...")
|
spinner = _Spinner(f"[{step_name}] {agent.name} running...")
|
||||||
spinner.start()
|
spinner.start()
|
||||||
|
|
||||||
@@ -116,6 +236,7 @@ def invoke_agent(
|
|||||||
text=True,
|
text=True,
|
||||||
timeout=timeout,
|
timeout=timeout,
|
||||||
cwd=cwd,
|
cwd=cwd,
|
||||||
|
env=env,
|
||||||
)
|
)
|
||||||
duration = time.monotonic() - start
|
duration = time.monotonic() - start
|
||||||
except subprocess.TimeoutExpired:
|
except subprocess.TimeoutExpired:
|
||||||
@@ -126,32 +247,56 @@ def invoke_agent(
|
|||||||
if spinner:
|
if spinner:
|
||||||
spinner.stop(f"[{step_name}] ERROR")
|
spinner.stop(f"[{step_name}] ERROR")
|
||||||
raise
|
raise
|
||||||
|
finally:
|
||||||
output = result.stdout.strip()
|
if task_file:
|
||||||
chars = len(output)
|
task_file.unlink(missing_ok=True)
|
||||||
|
|
||||||
if result.returncode != 0:
|
if result.returncode != 0:
|
||||||
if spinner:
|
if spinner:
|
||||||
spinner.stop(f"[{step_name}] FAILED (exit {result.returncode})")
|
spinner.stop(f"[{step_name}] FAILED (exit {result.returncode})")
|
||||||
|
if output_file:
|
||||||
|
output_file.unlink(missing_ok=True)
|
||||||
err_detail = result.stderr.strip() or result.stdout.strip()
|
err_detail = result.stderr.strip() or result.stdout.strip()
|
||||||
if err_detail and len(err_detail) > 500:
|
if err_detail and len(err_detail) > 500:
|
||||||
err_detail = err_detail[:500] + "..."
|
err_detail = err_detail[:500] + "..."
|
||||||
cmd_preview = " ".join(cmd[:6])
|
cmd_preview = " ".join(cmd[:6])
|
||||||
raise RuntimeError(
|
failure_type, suggested_action = _classify_agent_failure(err_detail or "")
|
||||||
f"Agent '{agent.name}' failed (exit code {result.returncode}) "
|
raise AgentInvocationError(
|
||||||
f"at step '{step_name}':\n"
|
agent_name=agent.name,
|
||||||
f" cmd: {cmd_preview}\n"
|
step_name=step_name,
|
||||||
f" error: {err_detail or '(no output)'}"
|
cmd_preview=cmd_preview,
|
||||||
|
raw_error=err_detail or "(no output)",
|
||||||
|
failure_type=failure_type,
|
||||||
|
suggested_action=suggested_action,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# --- Capture output ---
|
||||||
|
if output_file:
|
||||||
|
output = output_file.read_text(encoding="utf-8").strip()
|
||||||
|
output_file.unlink(missing_ok=True)
|
||||||
|
if not output:
|
||||||
|
# Fallback to stdout if agent didn't write to the file
|
||||||
|
output = result.stdout.strip()
|
||||||
|
else:
|
||||||
|
output = result.stdout.strip()
|
||||||
|
|
||||||
|
chars = len(output)
|
||||||
|
|
||||||
if spinner:
|
if spinner:
|
||||||
spinner.stop(f"[{step_name}] done — {chars} chars")
|
spinner.stop(f"[{step_name}] done — {chars} chars")
|
||||||
|
|
||||||
if not output:
|
if not output:
|
||||||
logger.warning(
|
stderr_info = result.stderr.strip()
|
||||||
"Agent '%s' produced empty output at step '%s'",
|
if stderr_info:
|
||||||
agent.name, step_name,
|
logger.warning(
|
||||||
)
|
"Agent '%s' produced empty output at step '%s'. stderr: %s",
|
||||||
|
agent.name, step_name, stderr_info[:500],
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
logger.warning(
|
||||||
|
"Agent '%s' produced empty output at step '%s' (no stderr either)",
|
||||||
|
agent.name, step_name,
|
||||||
|
)
|
||||||
|
|
||||||
return AgentResult(
|
return AgentResult(
|
||||||
output=output,
|
output=output,
|
||||||
@@ -160,3 +305,131 @@ def invoke_agent(
|
|||||||
step_name=step_name,
|
step_name=step_name,
|
||||||
duration_seconds=round(duration, 1),
|
duration_seconds=round(duration, 1),
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def invoke_agent_agentic(
|
||||||
|
agent: AgentConfig,
|
||||||
|
prompt: str,
|
||||||
|
step_name: str,
|
||||||
|
worktree_path: Path,
|
||||||
|
env: Optional[dict[str, str]] = None,
|
||||||
|
timeout: int | None = None,
|
||||||
|
quiet: bool = False,
|
||||||
|
) -> AgentResult:
|
||||||
|
"""Invoke an agent in agentic mode (no -p, runs in worktree, captures git diff).
|
||||||
|
|
||||||
|
The agent runs without print mode so it can modify files directly.
|
||||||
|
After the agent exits, git diff (since last commit) is captured as the output.
|
||||||
|
"""
|
||||||
|
from cross_eval.worktree import capture_diff
|
||||||
|
|
||||||
|
# Write prompt to a temp file (outside worktree, won't appear in diffs)
|
||||||
|
import tempfile
|
||||||
|
task_fd, task_path = tempfile.mkstemp(suffix=".md", prefix="cross_eval_task_")
|
||||||
|
task_file = Path(task_path)
|
||||||
|
task_file.write_text(prompt, encoding="utf-8")
|
||||||
|
os.close(task_fd)
|
||||||
|
|
||||||
|
cmd = [agent.command]
|
||||||
|
if agent.reasoning_effort and _supports_reasoning_effort(agent.command):
|
||||||
|
cmd.extend(["-c", f'model_reasoning_effort="{agent.reasoning_effort}"'])
|
||||||
|
|
||||||
|
# Strip stdin sentinel ("-") from args for agentic mode
|
||||||
|
args = [a for a in agent.args if a != "-"]
|
||||||
|
cmd.extend(args)
|
||||||
|
|
||||||
|
# System prompt via flag if supported
|
||||||
|
if agent.system_prompt and _supports_system_prompt_flag(agent.command):
|
||||||
|
cmd.extend(["--system-prompt", agent.system_prompt])
|
||||||
|
|
||||||
|
# Deliver the prompt differently per agent type
|
||||||
|
is_codex = "codex" in agent.command
|
||||||
|
input_data: str | None = None
|
||||||
|
if is_codex:
|
||||||
|
# codex: stdin mode
|
||||||
|
cmd.append("-")
|
||||||
|
if agent.system_prompt and not _supports_system_prompt_flag(agent.command):
|
||||||
|
input_data = f"<system>\n{agent.system_prompt}\n</system>\n\n{prompt}"
|
||||||
|
else:
|
||||||
|
input_data = prompt
|
||||||
|
else:
|
||||||
|
# claude: use positional arg with a pointer to the task file
|
||||||
|
# (avoids OS arg length limits for large prompts)
|
||||||
|
cmd.append(
|
||||||
|
f"Read the task file at {task_file} and execute all instructions in it. "
|
||||||
|
f"Work in the current directory."
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.debug(
|
||||||
|
"Invoking agent '%s' (agentic) in worktree: %s",
|
||||||
|
agent.name, worktree_path,
|
||||||
|
)
|
||||||
|
|
||||||
|
spinner: Optional[_Spinner] = None
|
||||||
|
if not quiet:
|
||||||
|
logger.info(" cmd: %s (agentic)", " ".join(cmd[:6]))
|
||||||
|
spinner = _Spinner(f"[{step_name}] {agent.name} (agentic) running...")
|
||||||
|
spinner.start()
|
||||||
|
|
||||||
|
try:
|
||||||
|
start = time.monotonic()
|
||||||
|
result = subprocess.run(
|
||||||
|
cmd,
|
||||||
|
input=input_data,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=timeout,
|
||||||
|
cwd=worktree_path,
|
||||||
|
env=env,
|
||||||
|
)
|
||||||
|
duration = time.monotonic() - start
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
if spinner:
|
||||||
|
spinner.stop(f"[{step_name}] TIMEOUT after {timeout}s")
|
||||||
|
raise
|
||||||
|
except Exception:
|
||||||
|
if spinner:
|
||||||
|
spinner.stop(f"[{step_name}] ERROR")
|
||||||
|
raise
|
||||||
|
finally:
|
||||||
|
# Clean up temp task file (it's in /tmp, not in worktree)
|
||||||
|
task_file.unlink(missing_ok=True)
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
if spinner:
|
||||||
|
spinner.stop(f"[{step_name}] FAILED (exit {result.returncode})")
|
||||||
|
err_detail = result.stderr.strip() or result.stdout.strip()
|
||||||
|
if err_detail and len(err_detail) > 500:
|
||||||
|
err_detail = err_detail[:500] + "..."
|
||||||
|
cmd_preview = " ".join(cmd[:6])
|
||||||
|
failure_type, suggested_action = _classify_agent_failure(err_detail or "")
|
||||||
|
raise AgentInvocationError(
|
||||||
|
agent_name=agent.name,
|
||||||
|
step_name=step_name,
|
||||||
|
cmd_preview=cmd_preview,
|
||||||
|
raw_error=err_detail or "(no output)",
|
||||||
|
failure_type=failure_type,
|
||||||
|
suggested_action=suggested_action,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Capture git diff as the output (changes since last commit on the branch)
|
||||||
|
diff_output = capture_diff(worktree_path)
|
||||||
|
|
||||||
|
if not diff_output:
|
||||||
|
diff_output = "(no changes)"
|
||||||
|
logger.warning(
|
||||||
|
"Agent '%s' made no file changes at step '%s'",
|
||||||
|
agent.name, step_name,
|
||||||
|
)
|
||||||
|
|
||||||
|
chars = len(diff_output)
|
||||||
|
if spinner:
|
||||||
|
spinner.stop(f"[{step_name}] done — {chars} chars (agentic)")
|
||||||
|
|
||||||
|
return AgentResult(
|
||||||
|
output=diff_output,
|
||||||
|
exit_code=result.returncode,
|
||||||
|
agent_name=agent.name,
|
||||||
|
step_name=step_name,
|
||||||
|
duration_seconds=round(duration, 1),
|
||||||
|
)
|
||||||
|
|||||||
@@ -7,7 +7,7 @@ import sys
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
from cross_eval import __version__
|
from cross_eval import __version__
|
||||||
from cross_eval.config import REASONING_EFFORT_CHOICES
|
from cross_eval.config import REASONING_EFFORT_CHOICES, resolve_agent_shorthand
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -38,7 +38,7 @@ coders: [claude-coder]
|
|||||||
reviewers: [claude-reviewer]
|
reviewers: [claude-reviewer]
|
||||||
# seniors: [codex-senior]
|
# seniors: [codex-senior]
|
||||||
|
|
||||||
# 파이프라인 종류: simple | cross-review | review-only | review-fix
|
# 파이프라인 종류: simple | cross-review | plan-review | review-only | review-fix | coding-review-fix
|
||||||
pipeline: preset:{preset}
|
pipeline: preset:{preset}
|
||||||
|
|
||||||
# 반복 설정
|
# 반복 설정
|
||||||
@@ -49,7 +49,7 @@ max_iterations: 3
|
|||||||
language: {language}
|
language: {language}
|
||||||
|
|
||||||
# 결과 저장 경로
|
# 결과 저장 경로
|
||||||
output_dir: output
|
output_dir: .cross-eval/output
|
||||||
|
|
||||||
# ─── 커스텀 에이전트 (선택) ────────────────────────────────────
|
# ─── 커스텀 에이전트 (선택) ────────────────────────────────────
|
||||||
# 기본 제공 에이전트를 덮어쓰거나 새 에이전트를 정의할 수 있습니다.
|
# 기본 제공 에이전트를 덮어쓰거나 새 에이전트를 정의할 수 있습니다.
|
||||||
@@ -145,7 +145,7 @@ def main(argv: list[str] | None = None) -> int:
|
|||||||
"AI 코딩 에이전트의 결과물을 자동으로 검증하는 CLI 도구.\n"
|
"AI 코딩 에이전트의 결과물을 자동으로 검증하는 CLI 도구.\n"
|
||||||
"\n"
|
"\n"
|
||||||
"동작 방식:\n"
|
"동작 방식:\n"
|
||||||
" 1. 기획서(plan)를 바탕으로 Coder 에이전트가 코드를 생성\n"
|
" 1. 기획서(plan)를 바탕으로 Coder 에이전트가 코드를 작성\n"
|
||||||
" 2. Reviewer 에이전트가 기획서 대비 코드를 검토하고 PASS/FAIL 판정\n"
|
" 2. Reviewer 에이전트가 기획서 대비 코드를 검토하고 PASS/FAIL 판정\n"
|
||||||
" 3. FAIL이면 피드백을 반영해서 1~2를 반복 (최대 N회)\n"
|
" 3. FAIL이면 피드백을 반영해서 1~2를 반복 (최대 N회)\n"
|
||||||
"\n"
|
"\n"
|
||||||
@@ -195,11 +195,19 @@ def main(argv: list[str] | None = None) -> int:
|
|||||||
init_parser.add_argument(
|
init_parser.add_argument(
|
||||||
"--preset",
|
"--preset",
|
||||||
default="simple",
|
default="simple",
|
||||||
choices=["simple", "cross-review", "review-only", "review-fix"],
|
choices=[
|
||||||
|
"simple",
|
||||||
|
"cross-review",
|
||||||
|
"plan-review",
|
||||||
|
"review-only",
|
||||||
|
"review-fix",
|
||||||
|
"coding-review-fix",
|
||||||
|
],
|
||||||
help=(
|
help=(
|
||||||
"파이프라인 종류 (기본: simple). "
|
"파이프라인 종류 (기본: simple). "
|
||||||
"simple=코딩+리뷰, cross-review=교차리뷰, "
|
"simple=코딩+리뷰, cross-review=교차리뷰, plan-review=문서기획검토, "
|
||||||
"review-only=리뷰만, review-fix=리뷰수렴+자동수정"
|
"review-only=리뷰만, review-fix=리뷰수렴+자동수정, "
|
||||||
|
"coding-review-fix=초기코딩후리뷰수렴"
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
init_parser.add_argument(
|
init_parser.add_argument(
|
||||||
@@ -208,13 +216,65 @@ def main(argv: list[str] | None = None) -> int:
|
|||||||
choices=["en", "ko"],
|
choices=["en", "ko"],
|
||||||
help="프롬프트 언어 (기본: ko)",
|
help="프롬프트 언어 (기본: ko)",
|
||||||
)
|
)
|
||||||
|
init_parser.add_argument(
|
||||||
|
"--guided",
|
||||||
|
action="store_true",
|
||||||
|
help="대화형 설정 마법사 실행",
|
||||||
|
)
|
||||||
|
|
||||||
|
# --- doctor ---
|
||||||
|
doctor_parser = subparsers.add_parser(
|
||||||
|
"doctor",
|
||||||
|
help="실행 환경 점검 (CLI 설치, 인증, 설정 파일 검증)",
|
||||||
|
description="cross-eval 실행에 필요한 환경을 점검합니다.",
|
||||||
|
)
|
||||||
|
doctor_parser.add_argument(
|
||||||
|
"--dir",
|
||||||
|
type=Path,
|
||||||
|
default=Path("."),
|
||||||
|
help="점검할 디렉토리 (기본: 현재 디렉토리)",
|
||||||
|
)
|
||||||
|
|
||||||
|
# --- demo ---
|
||||||
|
demo_parser = subparsers.add_parser(
|
||||||
|
"demo",
|
||||||
|
help="내장 데모 실행 (파이프라인 동작 체험)",
|
||||||
|
description=(
|
||||||
|
"내장된 간단한 기획서로 cross-eval 파이프라인의 전체 동작을 체험합니다.\n"
|
||||||
|
"기본값은 mock 모드(시뮬레이션)이며, --live로 실제 에이전트를 호출할 수 있습니다."
|
||||||
|
),
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
)
|
||||||
|
demo_parser.add_argument(
|
||||||
|
"--live",
|
||||||
|
action="store_true",
|
||||||
|
help="실제 에이전트를 호출하여 데모 실행 (API 비용 발생)",
|
||||||
|
)
|
||||||
|
demo_parser.add_argument(
|
||||||
|
"--preset",
|
||||||
|
default="simple",
|
||||||
|
choices=["simple", "review-fix", "coding-review-fix"],
|
||||||
|
help="데모할 파이프라인 종류 (기본: simple)",
|
||||||
|
)
|
||||||
|
demo_parser.add_argument(
|
||||||
|
"--escalate",
|
||||||
|
action="store_true",
|
||||||
|
help="ESCALATE 시나리오 데모 (mock 모드 전용)",
|
||||||
|
)
|
||||||
|
demo_parser.add_argument(
|
||||||
|
"--timeout",
|
||||||
|
type=int,
|
||||||
|
default=None,
|
||||||
|
metavar="SEC",
|
||||||
|
help="에이전트 호출 제한 시간 (--live 전용)",
|
||||||
|
)
|
||||||
|
|
||||||
# --- run ---
|
# --- run ---
|
||||||
run_parser = subparsers.add_parser(
|
run_parser = subparsers.add_parser(
|
||||||
"run",
|
"run",
|
||||||
help="검증 파이프라인 실행",
|
help="검증 파이프라인 실행",
|
||||||
description=(
|
description=(
|
||||||
"기획서(plan)를 기반으로 AI 에이전트가 코드 생성과 리뷰를 반복합니다.\n"
|
"기획서(plan)를 기반으로 AI 에이전트가 코딩과 리뷰를 반복합니다.\n"
|
||||||
"\n"
|
"\n"
|
||||||
"설정 파일 없이 바로 실행할 수 있고, config.yaml로도 실행할 수 있습니다.\n"
|
"설정 파일 없이 바로 실행할 수 있고, config.yaml로도 실행할 수 있습니다.\n"
|
||||||
"CLI 옵션이 config.yaml보다 우선합니다."
|
"CLI 옵션이 config.yaml보다 우선합니다."
|
||||||
@@ -222,13 +282,19 @@ def main(argv: list[str] | None = None) -> int:
|
|||||||
epilog=(
|
epilog=(
|
||||||
"파이프라인 종류 (--preset):\n"
|
"파이프라인 종류 (--preset):\n"
|
||||||
" ┌──────────────┬─────────────────────────────────────────────────────┐\n"
|
" ┌──────────────┬─────────────────────────────────────────────────────┐\n"
|
||||||
" │ simple │ Coder가 코드 생성 → Reviewer가 리뷰 │\n"
|
" │ simple │ Coder가 코드 작성 → Reviewer가 리뷰 │\n"
|
||||||
" │ (기본값) │ FAIL이면 피드백 반영해서 재생성, PASS까지 반복 │\n"
|
" │ (기본값) │ FAIL이면 피드백 반영해서 재코딩, PASS까지 반복 │\n"
|
||||||
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
|
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
|
||||||
" │ review-fix │ 2단계 파이프라인: │\n"
|
" │ review-fix │ 2단계 파이프라인: │\n"
|
||||||
" │ │ Reviewer N명 병렬 리뷰 → 취합 → 수정 → 재검증 │\n"
|
" │ │ Reviewer N명 병렬 리뷰 → 취합 → 수정 → 재검증 │\n"
|
||||||
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
|
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
|
||||||
" │ review-only │ 코드 생성 없이 Reviewer N명이 기존 코드만 검토 │\n"
|
" │ coding- │ 3단계 파이프라인: │\n"
|
||||||
|
" │ review-fix │ 초기 코딩 1회 → 리뷰 취합 → 수정 → 재검증 반복 │\n"
|
||||||
|
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
|
||||||
|
" │ plan-review │ 구현 전 기획서/체크리스트/문서를 검토 │\n"
|
||||||
|
" │ │ 필요하면 현재 코드베이스와의 정합성도 점검 │\n"
|
||||||
|
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
|
||||||
|
" │ review-only │ 코드 작성 없이 Reviewer N명이 기존 코드만 검토 │\n"
|
||||||
" │ │ (이미 작성된 코드의 품질 감사용) │\n"
|
" │ │ (이미 작성된 코드의 품질 감사용) │\n"
|
||||||
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
|
" ├──────────────┼─────────────────────────────────────────────────────┤\n"
|
||||||
" │ cross-review │ Coder 2명이 각각 구현 → 상대방 코드를 교차 리뷰 │\n"
|
" │ cross-review │ Coder 2명이 각각 구현 → 상대방 코드를 교차 리뷰 │\n"
|
||||||
@@ -239,10 +305,10 @@ def main(argv: list[str] | None = None) -> int:
|
|||||||
" ┌──────────────────┬─────────┬───────────┬──────────────────────────┐\n"
|
" ┌──────────────────┬─────────┬───────────┬──────────────────────────┐\n"
|
||||||
" │ 이름 │ CLI │ 기본 모델 │ 역할 │\n"
|
" │ 이름 │ CLI │ 기본 모델 │ 역할 │\n"
|
||||||
" ├──────────────────┼─────────┼───────────┼──────────────────────────┤\n"
|
" ├──────────────────┼─────────┼───────────┼──────────────────────────┤\n"
|
||||||
" │ claude-coder │ claude │ opus │ 코드 생성 │\n"
|
" │ claude-coder │ claude │ opus │ 코드 작성 │\n"
|
||||||
" │ claude-reviewer │ claude │ opus │ 코드 리뷰 │\n"
|
" │ claude-reviewer │ claude │ opus │ 코드 리뷰 │\n"
|
||||||
" │ claude-senior │ claude │ opus │ 리뷰 취합/판정 │\n"
|
" │ claude-senior │ claude │ opus │ 리뷰 취합/판정 │\n"
|
||||||
" │ codex-coder │ codex │ gpt-5.4 │ 코드 생성 │\n"
|
" │ codex-coder │ codex │ gpt-5.4 │ 코드 작성 │\n"
|
||||||
" │ codex-reviewer │ codex │ gpt-5.4 │ 코드 리뷰 │\n"
|
" │ codex-reviewer │ codex │ gpt-5.4 │ 코드 리뷰 │\n"
|
||||||
" │ codex-senior │ codex │ gpt-5.4 │ 리뷰 취합/판정 │\n"
|
" │ codex-senior │ codex │ gpt-5.4 │ 리뷰 취합/판정 │\n"
|
||||||
" └──────────────────┴─────────┴───────────┴──────────────────────────┘\n"
|
" └──────────────────┴─────────┴───────────┴──────────────────────────┘\n"
|
||||||
@@ -267,10 +333,18 @@ def main(argv: list[str] | None = None) -> int:
|
|||||||
" cross-eval run --plan plan.md --preset review-fix \\\n"
|
" cross-eval run --plan plan.md --preset review-fix \\\n"
|
||||||
" --reviewer claude --reviewer codex\n"
|
" --reviewer claude --reviewer codex\n"
|
||||||
"\n"
|
"\n"
|
||||||
|
" 초기 코딩 후 리뷰 수렴 + 자동 수정 (coding-review-fix):\n"
|
||||||
|
" cross-eval run --plan plan.md --preset coding-review-fix \\\n"
|
||||||
|
" --reviewer claude --reviewer codex\n"
|
||||||
|
"\n"
|
||||||
" 기존 코드 리뷰만 (review-only):\n"
|
" 기존 코드 리뷰만 (review-only):\n"
|
||||||
" cross-eval run --plan plan.md --preset review-only \\\n"
|
" cross-eval run --plan plan.md --preset review-only \\\n"
|
||||||
" --reviewer claude --reviewer codex\n"
|
" --reviewer claude --reviewer codex\n"
|
||||||
"\n"
|
"\n"
|
||||||
|
" 구현 전 문서/기획 검토 (plan-review):\n"
|
||||||
|
" cross-eval run --plan plan.md --preset plan-review \\\n"
|
||||||
|
" --reviewer claude --reviewer codex\n"
|
||||||
|
"\n"
|
||||||
" 모델 변경:\n"
|
" 모델 변경:\n"
|
||||||
" cross-eval run --plan plan.md --model sonnet\n"
|
" cross-eval run --plan plan.md --model sonnet\n"
|
||||||
"\n"
|
"\n"
|
||||||
@@ -298,6 +372,14 @@ def main(argv: list[str] | None = None) -> int:
|
|||||||
"--input", action="append", dest="inputs", metavar="KEY=PATH",
|
"--input", action="append", dest="inputs", metavar="KEY=PATH",
|
||||||
help="추가 입력 파일 (예: --input spec=./api-spec.md)",
|
help="추가 입력 파일 (예: --input spec=./api-spec.md)",
|
||||||
)
|
)
|
||||||
|
input_group.add_argument(
|
||||||
|
"--env-file", action="append", dest="env_files", type=Path, default=None,
|
||||||
|
help="에이전트 subprocess에 주입할 추가 .env 파일 (여러 개 가능)",
|
||||||
|
)
|
||||||
|
input_group.add_argument(
|
||||||
|
"--target", action="append", dest="execution_targets", default=None,
|
||||||
|
help="에이전트에게 강조할 실행 대상 힌트 (예: clickhouse, postgres)",
|
||||||
|
)
|
||||||
|
|
||||||
# -- 에이전트 설정 --
|
# -- 에이전트 설정 --
|
||||||
agent_group = run_parser.add_argument_group(
|
agent_group = run_parser.add_argument_group(
|
||||||
@@ -336,12 +418,16 @@ def main(argv: list[str] | None = None) -> int:
|
|||||||
choices=REASONING_EFFORT_CHOICES + ("extra-high", "extra_high", "x-high"),
|
choices=REASONING_EFFORT_CHOICES + ("extra-high", "extra_high", "x-high"),
|
||||||
help="Senior용 reasoning effort",
|
help="Senior용 reasoning effort",
|
||||||
)
|
)
|
||||||
|
agent_group.add_argument(
|
||||||
|
"--agentic", action="store_true", default=False,
|
||||||
|
help="Coder를 agentic 모드로 실행 (worktree에서 파일 직접 수정, git diff로 결과 캡처)",
|
||||||
|
)
|
||||||
agent_group.add_argument(
|
agent_group.add_argument(
|
||||||
"--model", default=None, metavar="MODEL",
|
"--model", default=None, metavar="MODEL",
|
||||||
help="모든 에이전트의 모델을 한번에 변경 (예: sonnet, opus)",
|
help="모든 에이전트의 모델을 한번에 변경 (예: sonnet, opus)",
|
||||||
)
|
)
|
||||||
agent_group.add_argument(
|
agent_group.add_argument(
|
||||||
"--generator-model", default=None, metavar="MODEL",
|
"--coder-model", default=None, metavar="MODEL",
|
||||||
help="Coder 에이전트 모델만 변경",
|
help="Coder 에이전트 모델만 변경",
|
||||||
)
|
)
|
||||||
agent_group.add_argument(
|
agent_group.add_argument(
|
||||||
@@ -353,7 +439,14 @@ def main(argv: list[str] | None = None) -> int:
|
|||||||
pipe_group = run_parser.add_argument_group("파이프라인")
|
pipe_group = run_parser.add_argument_group("파이프라인")
|
||||||
pipe_group.add_argument(
|
pipe_group.add_argument(
|
||||||
"--preset", default=None,
|
"--preset", default=None,
|
||||||
choices=["simple", "cross-review", "review-only", "review-fix"],
|
choices=[
|
||||||
|
"simple",
|
||||||
|
"cross-review",
|
||||||
|
"plan-review",
|
||||||
|
"review-only",
|
||||||
|
"review-fix",
|
||||||
|
"coding-review-fix",
|
||||||
|
],
|
||||||
help="파이프라인 종류 (기본: simple). 각 종류 설명은 아래 참조",
|
help="파이프라인 종류 (기본: simple). 각 종류 설명은 아래 참조",
|
||||||
)
|
)
|
||||||
pipe_group.add_argument(
|
pipe_group.add_argument(
|
||||||
@@ -400,6 +493,10 @@ def main(argv: list[str] | None = None) -> int:
|
|||||||
|
|
||||||
if args.command == "init":
|
if args.command == "init":
|
||||||
return cmd_init(args)
|
return cmd_init(args)
|
||||||
|
elif args.command == "doctor":
|
||||||
|
return cmd_doctor(args)
|
||||||
|
elif args.command == "demo":
|
||||||
|
return cmd_demo(args)
|
||||||
elif args.command == "run":
|
elif args.command == "run":
|
||||||
return cmd_run(args)
|
return cmd_run(args)
|
||||||
else:
|
else:
|
||||||
@@ -407,9 +504,186 @@ def main(argv: list[str] | None = None) -> int:
|
|||||||
return 0
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_doctor(args: argparse.Namespace) -> int:
|
||||||
|
"""Run environment health checks."""
|
||||||
|
from cross_eval.doctor import format_doctor_results, run_doctor
|
||||||
|
|
||||||
|
checks = run_doctor(args.dir.resolve())
|
||||||
|
print(format_doctor_results(checks))
|
||||||
|
|
||||||
|
has_critical = any(not c.passed and c.critical for c in checks)
|
||||||
|
return 1 if has_critical else 0
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_demo(args: argparse.Namespace) -> int:
|
||||||
|
"""Run a built-in demo to show the pipeline lifecycle."""
|
||||||
|
from cross_eval.demo import run_live_demo, run_mock_demo
|
||||||
|
|
||||||
|
if args.live:
|
||||||
|
print("\n⚠ --live 모드: 실제 AI 에이전트를 호출합니다 (API 비용 발생).")
|
||||||
|
print(" 내장 피보나치 함수 기획서를 사용합니다.\n")
|
||||||
|
try:
|
||||||
|
answer = input("계속하시겠습니까? [y/N] ").strip().lower()
|
||||||
|
except (EOFError, KeyboardInterrupt):
|
||||||
|
print("\n취소됨.")
|
||||||
|
return 0
|
||||||
|
if answer not in ("y", "yes"):
|
||||||
|
print("취소됨.")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
try:
|
||||||
|
raw_timeout = args.timeout if args.timeout is not None else 0
|
||||||
|
agent_timeout = None if raw_timeout == 0 else raw_timeout
|
||||||
|
result = run_live_demo(preset=args.preset, timeout=agent_timeout)
|
||||||
|
print(f"\nResult: {result.final_verdict}")
|
||||||
|
print(f"Iterations: {len(result.iterations)}")
|
||||||
|
if result.run_dir:
|
||||||
|
print(f"Output: {result.run_dir}/")
|
||||||
|
return 0
|
||||||
|
except (RuntimeError, KeyboardInterrupt) as e:
|
||||||
|
if isinstance(e, KeyboardInterrupt):
|
||||||
|
print("\nInterrupted.")
|
||||||
|
return 130
|
||||||
|
print(f"Demo error: {e}", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
else:
|
||||||
|
run_mock_demo(preset=args.preset, show_escalate=args.escalate)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Guided init wizard
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
_PRESET_DESCRIPTIONS = {
|
||||||
|
"simple": "코딩 + 리뷰 (가장 기본)",
|
||||||
|
"review-fix": "리뷰 → 취합 → 수정 → 재검증 반복",
|
||||||
|
"coding-review-fix": "초기 코딩 + 리뷰 수렴 반복",
|
||||||
|
"plan-review": "구현 전 기획서/문서 검토",
|
||||||
|
"review-only": "기존 코드만 리뷰 (코딩 없음)",
|
||||||
|
"cross-review": "2명이 각각 구현 후 교차 리뷰",
|
||||||
|
}
|
||||||
|
|
||||||
|
_PRESET_ORDER = [
|
||||||
|
"simple", "review-fix", "coding-review-fix",
|
||||||
|
"plan-review", "review-only", "cross-review",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _prompt_choice(
|
||||||
|
message: str,
|
||||||
|
choices: list[str],
|
||||||
|
descriptions: dict[str, str] | None = None,
|
||||||
|
default: int = 1,
|
||||||
|
) -> str:
|
||||||
|
"""Prompt user to pick from a numbered list."""
|
||||||
|
print(f"\n{message}")
|
||||||
|
for i, choice in enumerate(choices, 1):
|
||||||
|
desc = f" — {descriptions[choice]}" if descriptions and choice in descriptions else ""
|
||||||
|
marker = " (기본)" if i == default else ""
|
||||||
|
print(f" {i}. {choice}{desc}{marker}")
|
||||||
|
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
raw = input(f"선택 [{default}]: ").strip()
|
||||||
|
except (EOFError, KeyboardInterrupt):
|
||||||
|
print()
|
||||||
|
return choices[default - 1]
|
||||||
|
if not raw:
|
||||||
|
return choices[default - 1]
|
||||||
|
try:
|
||||||
|
idx = int(raw)
|
||||||
|
if 1 <= idx <= len(choices):
|
||||||
|
return choices[idx - 1]
|
||||||
|
except ValueError:
|
||||||
|
if raw in choices:
|
||||||
|
return raw
|
||||||
|
print(f" 1-{len(choices)} 사이 숫자를 입력하세요.")
|
||||||
|
|
||||||
|
|
||||||
|
def _prompt_text(message: str, default: str = "") -> str:
|
||||||
|
"""Prompt for text input with default."""
|
||||||
|
suffix = f" [{default}]" if default else ""
|
||||||
|
try:
|
||||||
|
raw = input(f"{message}{suffix}: ").strip()
|
||||||
|
except (EOFError, KeyboardInterrupt):
|
||||||
|
print()
|
||||||
|
return default
|
||||||
|
return raw or default
|
||||||
|
|
||||||
|
|
||||||
|
def _run_guided_init(target: Path) -> dict:
|
||||||
|
"""Interactive setup wizard. Returns settings dict."""
|
||||||
|
print("\n━━━ cross-eval 설정 마법사 ━━━\n")
|
||||||
|
|
||||||
|
lang = _prompt_choice(
|
||||||
|
"언어 / Language:",
|
||||||
|
["ko", "en"],
|
||||||
|
{"ko": "한국어", "en": "English"},
|
||||||
|
default=1,
|
||||||
|
)
|
||||||
|
|
||||||
|
preset = _prompt_choice(
|
||||||
|
"파이프라인 종류:",
|
||||||
|
_PRESET_ORDER,
|
||||||
|
_PRESET_DESCRIPTIONS,
|
||||||
|
default=1,
|
||||||
|
)
|
||||||
|
|
||||||
|
print("\n--- 에이전트 설정 ---")
|
||||||
|
print(" 사용 가능: claude, codex (또는 claude-coder, codex-reviewer 등)")
|
||||||
|
|
||||||
|
coder = _prompt_text(" Coder 에이전트", default="claude")
|
||||||
|
reviewer = _prompt_text(" Reviewer 에이전트", default="claude")
|
||||||
|
|
||||||
|
needs_senior = preset in ("review-fix", "coding-review-fix")
|
||||||
|
senior = ""
|
||||||
|
if needs_senior:
|
||||||
|
senior = _prompt_text(" Senior 에이전트", default=reviewer)
|
||||||
|
else:
|
||||||
|
senior = _prompt_text(" Senior 에이전트 (선택, Enter로 건너뛰기)", default="")
|
||||||
|
|
||||||
|
max_iter = _prompt_text("최대 반복 횟수", default="3")
|
||||||
|
try:
|
||||||
|
max_iter_int = int(max_iter)
|
||||||
|
except ValueError:
|
||||||
|
max_iter_int = 3
|
||||||
|
|
||||||
|
create_templates = _prompt_text(
|
||||||
|
"\n템플릿 파일(plan.md, checklist.md) 생성?", default="Y",
|
||||||
|
).lower() in ("y", "yes", "")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"lang": lang,
|
||||||
|
"preset": preset,
|
||||||
|
"coder": coder,
|
||||||
|
"reviewer": reviewer,
|
||||||
|
"senior": senior,
|
||||||
|
"max_iter": max_iter_int,
|
||||||
|
"create_templates": create_templates,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def cmd_init(args: argparse.Namespace) -> int:
|
def cmd_init(args: argparse.Namespace) -> int:
|
||||||
"""Scaffold a new cross-eval project."""
|
"""Scaffold a new cross-eval project."""
|
||||||
target = args.dir.resolve()
|
target = args.dir.resolve()
|
||||||
|
|
||||||
|
if args.guided:
|
||||||
|
settings = _run_guided_init(target)
|
||||||
|
args.lang = settings["lang"]
|
||||||
|
args.preset = settings["preset"]
|
||||||
|
# We'll use guided settings for enhanced config generation
|
||||||
|
return _write_init_files(target, args, guided_settings=settings)
|
||||||
|
|
||||||
|
return _write_init_files(target, args)
|
||||||
|
|
||||||
|
|
||||||
|
def _write_init_files(
|
||||||
|
target: Path,
|
||||||
|
args: argparse.Namespace,
|
||||||
|
guided_settings: dict | None = None,
|
||||||
|
) -> int:
|
||||||
|
"""Write config and template files to target directory."""
|
||||||
ce_dir = target / ".cross-eval"
|
ce_dir = target / ".cross-eval"
|
||||||
ce_dir.mkdir(parents=True, exist_ok=True)
|
ce_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
@@ -417,14 +691,23 @@ def cmd_init(args: argparse.Namespace) -> int:
|
|||||||
plan_sample = PLAN_SAMPLE_KO if lang == "ko" else PLAN_SAMPLE_EN
|
plan_sample = PLAN_SAMPLE_KO if lang == "ko" else PLAN_SAMPLE_EN
|
||||||
checklist_sample = CHECKLIST_SAMPLE_KO if lang == "ko" else CHECKLIST_SAMPLE_EN
|
checklist_sample = CHECKLIST_SAMPLE_KO if lang == "ko" else CHECKLIST_SAMPLE_EN
|
||||||
|
|
||||||
files = {
|
# Generate config content
|
||||||
".cross-eval/config.yaml": DEFAULT_CONFIG_YAML.format(
|
if guided_settings:
|
||||||
|
config_content = _generate_guided_config(args.preset, lang, guided_settings)
|
||||||
|
else:
|
||||||
|
config_content = DEFAULT_CONFIG_YAML.format(
|
||||||
preset=args.preset, language=lang,
|
preset=args.preset, language=lang,
|
||||||
),
|
)
|
||||||
".cross-eval/plan.md": plan_sample,
|
|
||||||
".cross-eval/checklist.md": checklist_sample,
|
files: dict[str, str] = {
|
||||||
|
".cross-eval/config.yaml": config_content,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Add templates unless guided mode opted out
|
||||||
|
if not guided_settings or guided_settings.get("create_templates", True):
|
||||||
|
files[".cross-eval/plan.md"] = plan_sample
|
||||||
|
files[".cross-eval/checklist.md"] = checklist_sample
|
||||||
|
|
||||||
created = []
|
created = []
|
||||||
skipped = []
|
skipped = []
|
||||||
for name, content in files.items():
|
for name, content in files.items():
|
||||||
@@ -436,23 +719,67 @@ def cmd_init(args: argparse.Namespace) -> int:
|
|||||||
created.append(name)
|
created.append(name)
|
||||||
|
|
||||||
if created:
|
if created:
|
||||||
print(f" 생성: {', '.join(created)}")
|
print(f"\n 생성: {', '.join(created)}")
|
||||||
if skipped:
|
if skipped:
|
||||||
print(f" 이미 존재 (건너뜀): {', '.join(skipped)}")
|
print(f" 이미 존재 (건너뜀): {', '.join(skipped)}")
|
||||||
|
|
||||||
print(f"\n 파이프라인: {args.preset}")
|
print(f"\n 파이프라인: {args.preset}")
|
||||||
print(f" 언어: {lang}")
|
print(f" 언어: {lang}")
|
||||||
|
if guided_settings:
|
||||||
|
print(f" Coder: {guided_settings['coder']}")
|
||||||
|
print(f" Reviewer: {guided_settings['reviewer']}")
|
||||||
|
if guided_settings.get("senior"):
|
||||||
|
print(f" Senior: {guided_settings['senior']}")
|
||||||
|
print(f" 최대 반복: {guided_settings['max_iter']}")
|
||||||
print("")
|
print("")
|
||||||
print("다음 단계:")
|
print("다음 단계:")
|
||||||
print(" 1. .cross-eval/plan.md 에 기획서 작성")
|
print(" 1. .cross-eval/plan.md 에 기획서 작성")
|
||||||
print(" 2. .cross-eval/checklist.md 에 체크리스트 작성 (선택)")
|
print(" 2. .cross-eval/checklist.md 에 체크리스트 작성 (선택)")
|
||||||
print(" 3. cross-eval run 으로 실행")
|
print(" 3. cross-eval run 으로 실행")
|
||||||
print("")
|
print("")
|
||||||
print("주의: 에이전트는 기본적으로 파일 읽기/쓰기/실행 권한을 가집니다.")
|
print("팁: cross-eval doctor 로 환경 점검을 먼저 하세요.")
|
||||||
print(" 실행 전에 .cross-eval/config.yaml 을 확인하세요.")
|
print(" cross-eval demo 로 동작 방식을 미리 볼 수 있습니다.")
|
||||||
return 0
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def _generate_guided_config(
|
||||||
|
preset: str,
|
||||||
|
lang: str,
|
||||||
|
settings: dict,
|
||||||
|
) -> str:
|
||||||
|
"""Generate config.yaml content from guided init settings."""
|
||||||
|
coder_name = resolve_agent_shorthand(settings["coder"], "coder")
|
||||||
|
reviewer_name = resolve_agent_shorthand(settings["reviewer"], "reviewer")
|
||||||
|
|
||||||
|
lines = [
|
||||||
|
"# cross-eval 설정 (guided init으로 생성됨)",
|
||||||
|
"",
|
||||||
|
"inputs:",
|
||||||
|
" plan: plan.md",
|
||||||
|
" checklist: checklist.md",
|
||||||
|
"",
|
||||||
|
f"coders: [{coder_name}]",
|
||||||
|
f"reviewers: [{reviewer_name}]",
|
||||||
|
]
|
||||||
|
|
||||||
|
senior = settings.get("senior", "")
|
||||||
|
if senior:
|
||||||
|
senior_name = resolve_agent_shorthand(senior, "senior")
|
||||||
|
lines.append(f"seniors: [{senior_name}]")
|
||||||
|
|
||||||
|
lines.extend([
|
||||||
|
"",
|
||||||
|
f"pipeline: preset:{preset}",
|
||||||
|
"",
|
||||||
|
f"max_iterations: {settings['max_iter']}",
|
||||||
|
f"language: {lang}",
|
||||||
|
"output_dir: .cross-eval/output",
|
||||||
|
"",
|
||||||
|
])
|
||||||
|
|
||||||
|
return "\n".join(lines) + "\n"
|
||||||
|
|
||||||
|
|
||||||
def _read_docs_dir(docs_dir: Path) -> str:
|
def _read_docs_dir(docs_dir: Path) -> str:
|
||||||
"""Read all files in a directory and concatenate with filename headers."""
|
"""Read all files in a directory and concatenate with filename headers."""
|
||||||
parts: list[str] = []
|
parts: list[str] = []
|
||||||
@@ -482,12 +809,21 @@ def _apply_model_override(config, agent_name: str, model: str) -> None:
|
|||||||
agent.args = new_args
|
agent.args = new_args
|
||||||
|
|
||||||
|
|
||||||
|
def _apply_phased_iteration_override(config, max_iter: int | None) -> None:
|
||||||
|
"""Apply CLI max-iter to converging phases while preserving setup phases."""
|
||||||
|
from cross_eval.config import sync_phased_iterations
|
||||||
|
|
||||||
|
sync_phased_iterations(config, max_iter)
|
||||||
|
|
||||||
|
|
||||||
def cmd_run(args: argparse.Namespace) -> int:
|
def cmd_run(args: argparse.Namespace) -> int:
|
||||||
"""Load config, validate, and execute the pipeline."""
|
"""Load config, validate, and execute the pipeline."""
|
||||||
from cross_eval.config import (
|
from cross_eval.config import (
|
||||||
|
ensure_fix_preset_agentic,
|
||||||
apply_input_overrides,
|
apply_input_overrides,
|
||||||
default_config,
|
default_config,
|
||||||
load_config,
|
load_config,
|
||||||
|
sync_phased_iterations,
|
||||||
validate_config,
|
validate_config,
|
||||||
)
|
)
|
||||||
from cross_eval.prompts import PIPELINE_PRESETS
|
from cross_eval.prompts import PIPELINE_PRESETS
|
||||||
@@ -562,7 +898,7 @@ def cmd_run(args: argparse.Namespace) -> int:
|
|||||||
preset = args.preset or "simple"
|
preset = args.preset or "simple"
|
||||||
# Determine which preset was configured (from YAML or defaults)
|
# Determine which preset was configured (from YAML or defaults)
|
||||||
if args.preset is None and config.phases:
|
if args.preset is None and config.phases:
|
||||||
preset = "review-fix" # only phased preset currently
|
preset = config.preset_name if config.preset_name != "custom" else "review-fix"
|
||||||
elif args.preset is None and not args.coders and not args.reviewers and not args.seniors:
|
elif args.preset is None and not args.coders and not args.reviewers and not args.seniors:
|
||||||
pass # no changes needed
|
pass # no changes needed
|
||||||
inferred_coders, inferred_reviewers, inferred_seniors = _infer_roles(
|
inferred_coders, inferred_reviewers, inferred_seniors = _infer_roles(
|
||||||
@@ -584,13 +920,18 @@ def cmd_run(args: argparse.Namespace) -> int:
|
|||||||
config.preset_name = preset
|
config.preset_name = preset
|
||||||
if preset in PHASED_PRESETS:
|
if preset in PHASED_PRESETS:
|
||||||
config.phases = PHASED_PRESETS[preset](coders, reviewers, seniors)
|
config.phases = PHASED_PRESETS[preset](coders, reviewers, seniors)
|
||||||
|
_apply_phased_iteration_override(config, args.max_iter)
|
||||||
config.pipeline = []
|
config.pipeline = []
|
||||||
elif preset in PIPELINE_PRESETS:
|
elif preset in PIPELINE_PRESETS:
|
||||||
config.pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors)
|
config.pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors)
|
||||||
config.phases = []
|
config.phases = []
|
||||||
if preset == "review-only" and args.max_iter is None and args.min_iter is None:
|
if preset in {"plan-review", "review-only"} and args.max_iter is None and args.min_iter is None:
|
||||||
config.max_iterations = 1
|
config.max_iterations = 1
|
||||||
|
|
||||||
|
sync_phased_iterations(config)
|
||||||
|
if args.max_iter is not None:
|
||||||
|
sync_phased_iterations(config, args.max_iter)
|
||||||
|
|
||||||
apply_reasoning_effort_settings(
|
apply_reasoning_effort_settings(
|
||||||
config,
|
config,
|
||||||
reasoning_effort=args.reasoning_effort,
|
reasoning_effort=args.reasoning_effort,
|
||||||
@@ -599,14 +940,23 @@ def cmd_run(args: argparse.Namespace) -> int:
|
|||||||
senior_effort=args.senior_effort,
|
senior_effort=args.senior_effort,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# --agentic: convert coder agents to agentic mode
|
||||||
|
if args.agentic:
|
||||||
|
from cross_eval.config import _make_agentic
|
||||||
|
for coder_name in config.coders:
|
||||||
|
if coder_name in config.agents:
|
||||||
|
_make_agentic(config.agents[coder_name])
|
||||||
|
|
||||||
|
ensure_fix_preset_agentic(config)
|
||||||
|
|
||||||
# --model: apply to ALL agents
|
# --model: apply to ALL agents
|
||||||
if args.model is not None:
|
if args.model is not None:
|
||||||
for agent_name in config.agents:
|
for agent_name in config.agents:
|
||||||
_apply_model_override(config, agent_name, args.model)
|
_apply_model_override(config, agent_name, args.model)
|
||||||
# --generator-model / --reviewer-model: apply by role
|
# --coder-model / --reviewer-model: apply by role
|
||||||
if args.generator_model is not None:
|
if args.coder_model is not None:
|
||||||
for coder_name in config.coders:
|
for coder_name in config.coders:
|
||||||
_apply_model_override(config, coder_name, args.generator_model)
|
_apply_model_override(config, coder_name, args.coder_model)
|
||||||
if args.reviewer_model is not None:
|
if args.reviewer_model is not None:
|
||||||
for reviewer_name in config.reviewers:
|
for reviewer_name in config.reviewers:
|
||||||
_apply_model_override(config, reviewer_name, args.reviewer_model)
|
_apply_model_override(config, reviewer_name, args.reviewer_model)
|
||||||
@@ -632,6 +982,17 @@ def cmd_run(args: argparse.Namespace) -> int:
|
|||||||
return 1
|
return 1
|
||||||
config.inputs["docs"] = docs_content
|
config.inputs["docs"] = docs_content
|
||||||
|
|
||||||
|
if args.env_files:
|
||||||
|
for env_file in args.env_files:
|
||||||
|
resolved = env_file.resolve()
|
||||||
|
if not resolved.exists():
|
||||||
|
print(f"Env file not found: {resolved}", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
config.execution.env_files.append(str(resolved))
|
||||||
|
|
||||||
|
if args.execution_targets:
|
||||||
|
config.execution.auto_context_targets = list(args.execution_targets)
|
||||||
|
|
||||||
if args.inputs:
|
if args.inputs:
|
||||||
overrides = {}
|
overrides = {}
|
||||||
for item in args.inputs:
|
for item in args.inputs:
|
||||||
@@ -694,6 +1055,11 @@ def cmd_run(args: argparse.Namespace) -> int:
|
|||||||
if not args.dry_run and result.run_dir:
|
if not args.dry_run and result.run_dir:
|
||||||
print(f"Output: {result.run_dir}/")
|
print(f"Output: {result.run_dir}/")
|
||||||
|
|
||||||
|
if result.final_verdict == "ESCALATE":
|
||||||
|
from cross_eval.report import print_escalation_report
|
||||||
|
print_escalation_report(config, result)
|
||||||
|
return 2
|
||||||
|
|
||||||
return 0 if result.final_verdict == "PASS" else 1
|
return 0 if result.final_verdict == "PASS" else 1
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,6 +1,7 @@
|
|||||||
"""Configuration loading, validation, and preset resolution."""
|
"""Configuration loading, validation, and preset resolution."""
|
||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import copy
|
||||||
import logging
|
import logging
|
||||||
import re
|
import re
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
@@ -8,7 +9,13 @@ from typing import Any
|
|||||||
|
|
||||||
import yaml
|
import yaml
|
||||||
|
|
||||||
from cross_eval.models import AgentConfig, PhaseConfig, PipelineConfig, StepConfig
|
from cross_eval.models import (
|
||||||
|
AgentConfig,
|
||||||
|
ExecutionConfig,
|
||||||
|
PhaseConfig,
|
||||||
|
PipelineConfig,
|
||||||
|
StepConfig,
|
||||||
|
)
|
||||||
from cross_eval.prompts import PHASED_PRESETS, PIPELINE_PRESETS
|
from cross_eval.prompts import PHASED_PRESETS, PIPELINE_PRESETS
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
@@ -24,6 +31,7 @@ DEFAULT_ROLE_REASONING_EFFORTS = {
|
|||||||
"reviewer": "medium",
|
"reviewer": "medium",
|
||||||
"senior": "high",
|
"senior": "high",
|
||||||
}
|
}
|
||||||
|
FIX_STYLE_PRESETS = {"review-fix", "coding-review-fix"}
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
@@ -39,34 +47,67 @@ _CODEX_ARGS = [
|
|||||||
"-",
|
"-",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
_CLAUDE_BASE_ARGS = [
|
||||||
|
"-p",
|
||||||
|
"--setting-sources",
|
||||||
|
"user",
|
||||||
|
"--disable-slash-commands",
|
||||||
|
"--model",
|
||||||
|
"opus",
|
||||||
|
]
|
||||||
|
|
||||||
|
_CLAUDE_CODER_ARGS = list(_CLAUDE_BASE_ARGS) + [
|
||||||
|
"--dangerously-skip-permissions",
|
||||||
|
"--permission-mode",
|
||||||
|
"bypassPermissions",
|
||||||
|
]
|
||||||
|
|
||||||
|
_CLAUDE_REVIEW_ARGS = [
|
||||||
|
"--setting-sources",
|
||||||
|
"user",
|
||||||
|
"--disable-slash-commands",
|
||||||
|
"--model",
|
||||||
|
"opus",
|
||||||
|
"--permission-mode",
|
||||||
|
"plan",
|
||||||
|
]
|
||||||
|
|
||||||
_CODER_SYSTEM_PROMPT = (
|
_CODER_SYSTEM_PROMPT = (
|
||||||
"You are a senior software engineer implementing code changes.\n"
|
"You are a senior software engineer implementing code changes.\n"
|
||||||
"Rules:\n"
|
"Rules:\n"
|
||||||
"1. FIRST explore the project directory to understand the existing codebase, "
|
"1. FIRST explore the project directory to understand the existing codebase, "
|
||||||
"patterns, and conventions before writing any code.\n"
|
"patterns, and conventions before writing any code.\n"
|
||||||
"2. Implement ONLY what the plan specifies. Do NOT add extra features, "
|
"2. You may decide which shell, Python, git, docker, test, and database commands "
|
||||||
|
"to run. The user does not need to pre-specify exact commands.\n"
|
||||||
|
"3. Environment variables from configured .env files may already be loaded into "
|
||||||
|
"your process; use them when validating services such as ClickHouse.\n"
|
||||||
|
"4. Implement ONLY what the plan specifies. Do NOT add extra features, "
|
||||||
"unnecessary abstractions, premature optimizations, or \"nice-to-have\" improvements.\n"
|
"unnecessary abstractions, premature optimizations, or \"nice-to-have\" improvements.\n"
|
||||||
"3. Follow the project's existing coding style, naming conventions, and directory structure.\n"
|
"5. Follow the project's existing coding style, naming conventions, and directory structure.\n"
|
||||||
"4. If previous review feedback is provided, fix ONLY the specific issues mentioned. "
|
"6. If previous review feedback is provided, fix ONLY the specific issues mentioned. "
|
||||||
"Do NOT refactor unrelated code.\n"
|
"Do NOT refactor unrelated code.\n"
|
||||||
"5. Ignore any items from previous feedback that were marked as DISMISSED or false positive.\n"
|
"7. Ignore any items from previous feedback that were marked as DISMISSED or false positive.\n"
|
||||||
"6. When in doubt about scope, do LESS, not more."
|
"8. When in doubt about scope, do LESS, not more."
|
||||||
)
|
)
|
||||||
|
|
||||||
_REVIEWER_SYSTEM_PROMPT = (
|
_REVIEWER_SYSTEM_PROMPT = (
|
||||||
"You are a code reviewer. You MUST NOT create, modify, or delete any files.\n"
|
"You are a code reviewer. You MUST NOT create, modify, or delete any files.\n"
|
||||||
"Rules:\n"
|
"Rules:\n"
|
||||||
"1. Explore the project directory to understand the full codebase context.\n"
|
"1. Explore the project directory to understand the full codebase context.\n"
|
||||||
"2. Compare the implementation against the plan and checklist ONLY.\n"
|
"2. You may decide which shell, Python, test, git, docker, and database read commands "
|
||||||
"3. Classify every issue with BOTH severity AND category:\n"
|
"to run in order to verify behavior. The user does not need to pre-specify exact commands.\n"
|
||||||
|
"3. Environment variables from configured .env files may already be loaded into "
|
||||||
|
"your process; use them for verification when relevant.\n"
|
||||||
|
"4. Compare the implementation against the plan and checklist ONLY.\n"
|
||||||
|
"5. Classify every issue with BOTH severity AND category:\n"
|
||||||
" - Severity: Critical (breaks functionality/security) > Major (requirement mismatch) > Minor (convention/style)\n"
|
" - Severity: Critical (breaks functionality/security) > Major (requirement mismatch) > Minor (convention/style)\n"
|
||||||
" - Category: Over-engineering / Omission\n"
|
" - Category: Over-engineering / Omission\n"
|
||||||
"4. When reviewing with previous feedback, mark items as CONFIRMED (still an issue) "
|
"6. When reviewing with previous feedback, mark items as CONFIRMED (still an issue) "
|
||||||
"or DISMISSED (false positive) with rationale.\n"
|
"or DISMISSED (false positive) with rationale.\n"
|
||||||
"5. Report out-of-scope issues separately — problems found outside plan/checklist scope.\n"
|
"7. Report out-of-scope issues separately — problems found outside plan/checklist scope.\n"
|
||||||
"6. Order issues by severity (Critical first).\n"
|
"8. Order issues by severity (Critical first).\n"
|
||||||
"7. Do NOT suggest improvements beyond the plan scope.\n"
|
"9. Do NOT suggest improvements beyond the plan scope.\n"
|
||||||
"8. End with VERDICT: PASS (all requirements met, no over-engineering) "
|
"10. End with VERDICT: PASS (all requirements met, no over-engineering) "
|
||||||
"or VERDICT: FAIL (issues found)."
|
"or VERDICT: FAIL (issues found)."
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -74,36 +115,48 @@ _SENIOR_SYSTEM_PROMPT = (
|
|||||||
"You are a senior technical reviewer coordinating a review-fix-verification loop.\n"
|
"You are a senior technical reviewer coordinating a review-fix-verification loop.\n"
|
||||||
"Rules:\n"
|
"Rules:\n"
|
||||||
"1. Explore the project directory to understand the full codebase context.\n"
|
"1. Explore the project directory to understand the full codebase context.\n"
|
||||||
"2. In aggregation mode, deduplicate overlaps, resolve disagreements, and keep only "
|
"2. You may decide which shell, Python, test, git, docker, and database read commands "
|
||||||
|
"to run to verify disputed issues. The user does not need to pre-specify exact commands.\n"
|
||||||
|
"3. Environment variables from configured .env files may already be loaded into "
|
||||||
|
"your process; use them when validating service integrations.\n"
|
||||||
|
"4. In aggregation mode, deduplicate overlaps, resolve disagreements, and keep only "
|
||||||
"evidence-backed issues. Categorize dismissed findings as [False positive] or [Already fixed].\n"
|
"evidence-backed issues. Categorize dismissed findings as [False positive] or [Already fixed].\n"
|
||||||
"3. In verification mode, judge the current implementation directly against ONLY the "
|
"5. In verification mode, judge the current implementation directly against ONLY the "
|
||||||
"plan and checklist.\n"
|
"plan and checklist.\n"
|
||||||
"4. Be skeptical of false positives, but do not lower the bar on real requirement "
|
"6. Be skeptical of false positives, but do not lower the bar on real requirement "
|
||||||
"gaps.\n"
|
"gaps.\n"
|
||||||
"5. When issues remain, produce a concise prioritized action list the coder can act on.\n"
|
"7. When issues remain, produce a concise prioritized action list the coder can act on.\n"
|
||||||
"6. Do NOT invent new requirements beyond the plan and checklist.\n"
|
"8. Maintain an Issue Tracker table across iterations to track issue status.\n"
|
||||||
"7. End with VERDICT: PASS or VERDICT: FAIL."
|
"9. Do NOT invent new requirements beyond the plan and checklist.\n"
|
||||||
|
"10. End with one of three verdicts:\n"
|
||||||
|
" - VERDICT: PASS — all requirements met, no issues remain.\n"
|
||||||
|
" - VERDICT: FAIL — issues found that the coder can fix.\n"
|
||||||
|
" - VERDICT: ESCALATE — issues that require human intervention. Use ESCALATE when:\n"
|
||||||
|
" * Requirements are ambiguous and need clarification from stakeholders\n"
|
||||||
|
" * Architecture decisions are needed that go beyond the plan scope\n"
|
||||||
|
" * External dependency issues block progress\n"
|
||||||
|
" * The coder has failed to resolve the same issue 2+ times"
|
||||||
)
|
)
|
||||||
|
|
||||||
BUILTIN_AGENTS: dict[str, AgentConfig] = {
|
BUILTIN_AGENTS: dict[str, AgentConfig] = {
|
||||||
"claude-coder": AgentConfig(
|
"claude-coder": AgentConfig(
|
||||||
name="claude-coder",
|
name="claude-coder",
|
||||||
command="claude",
|
command="claude",
|
||||||
args=["-p", "--model", "opus", "--permission-mode", "auto"],
|
args=list(_CLAUDE_CODER_ARGS),
|
||||||
system_prompt=_CODER_SYSTEM_PROMPT,
|
system_prompt=_CODER_SYSTEM_PROMPT,
|
||||||
reasoning_effort=DEFAULT_ROLE_REASONING_EFFORTS["coder"],
|
reasoning_effort=DEFAULT_ROLE_REASONING_EFFORTS["coder"],
|
||||||
),
|
),
|
||||||
"claude-reviewer": AgentConfig(
|
"claude-reviewer": AgentConfig(
|
||||||
name="claude-reviewer",
|
name="claude-reviewer",
|
||||||
command="claude",
|
command="claude",
|
||||||
args=["-p", "--model", "opus", "--permission-mode", "auto"],
|
args=list(_CLAUDE_REVIEW_ARGS),
|
||||||
system_prompt=_REVIEWER_SYSTEM_PROMPT,
|
system_prompt=_REVIEWER_SYSTEM_PROMPT,
|
||||||
reasoning_effort=DEFAULT_ROLE_REASONING_EFFORTS["reviewer"],
|
reasoning_effort=DEFAULT_ROLE_REASONING_EFFORTS["reviewer"],
|
||||||
),
|
),
|
||||||
"claude-senior": AgentConfig(
|
"claude-senior": AgentConfig(
|
||||||
name="claude-senior",
|
name="claude-senior",
|
||||||
command="claude",
|
command="claude",
|
||||||
args=["-p", "--model", "opus", "--permission-mode", "auto"],
|
args=list(_CLAUDE_REVIEW_ARGS),
|
||||||
system_prompt=_SENIOR_SYSTEM_PROMPT,
|
system_prompt=_SENIOR_SYSTEM_PROMPT,
|
||||||
reasoning_effort=DEFAULT_ROLE_REASONING_EFFORTS["senior"],
|
reasoning_effort=DEFAULT_ROLE_REASONING_EFFORTS["senior"],
|
||||||
),
|
),
|
||||||
@@ -136,6 +189,11 @@ _AGENT_ALIASES: dict[str, str] = {
|
|||||||
"codex": "codex",
|
"codex": "codex",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
_ROLE_ALIASES: dict[str, str] = {
|
||||||
|
"coding": "coding",
|
||||||
|
"review": "review",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def resolve_agent_shorthand(name: str, role: str) -> str:
|
def resolve_agent_shorthand(name: str, role: str) -> str:
|
||||||
"""Resolve shorthand agent name to full builtin name.
|
"""Resolve shorthand agent name to full builtin name.
|
||||||
@@ -150,6 +208,16 @@ def resolve_agent_shorthand(name: str, role: str) -> str:
|
|||||||
return name
|
return name
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_step_role(role: str) -> str:
|
||||||
|
"""Normalize step role aliases to the canonical role name."""
|
||||||
|
return _ROLE_ALIASES.get(role, role)
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_prompt_template(template_ref: str) -> str:
|
||||||
|
"""Normalize prompt template aliases to canonical template refs."""
|
||||||
|
return template_ref
|
||||||
|
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Role inference (backward compatibility)
|
# Role inference (backward compatibility)
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
@@ -220,7 +288,7 @@ def _resolve_agents(
|
|||||||
|
|
||||||
for name in all_referenced:
|
for name in all_referenced:
|
||||||
if name not in result and name in BUILTIN_AGENTS:
|
if name not in result and name in BUILTIN_AGENTS:
|
||||||
result[name] = BUILTIN_AGENTS[name]
|
result[name] = copy.deepcopy(BUILTIN_AGENTS[name])
|
||||||
|
|
||||||
return result
|
return result
|
||||||
|
|
||||||
@@ -233,7 +301,7 @@ def _default_seniors_for_preset(
|
|||||||
"""Infer a default senior agent for presets that benefit from adjudication."""
|
"""Infer a default senior agent for presets that benefit from adjudication."""
|
||||||
if not (
|
if not (
|
||||||
isinstance(pipeline_raw, str)
|
isinstance(pipeline_raw, str)
|
||||||
and pipeline_raw == "preset:review-fix"
|
and pipeline_raw in {"preset:review-fix", "preset:coding-review-fix"}
|
||||||
and reviewers
|
and reviewers
|
||||||
):
|
):
|
||||||
return []
|
return []
|
||||||
@@ -311,15 +379,16 @@ def _apply_role_effort(
|
|||||||
|
|
||||||
def default_config() -> PipelineConfig:
|
def default_config() -> PipelineConfig:
|
||||||
"""Return a PipelineConfig with sensible defaults (no YAML needed)."""
|
"""Return a PipelineConfig with sensible defaults (no YAML needed)."""
|
||||||
agents = dict(BUILTIN_AGENTS)
|
agents = copy.deepcopy(BUILTIN_AGENTS)
|
||||||
coders = ["claude-coder"]
|
coders = ["claude-coder"]
|
||||||
reviewers = ["claude-reviewer"]
|
reviewers = ["claude-reviewer"]
|
||||||
seniors: list[str] = []
|
seniors: list[str] = []
|
||||||
pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors)
|
pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors)
|
||||||
return PipelineConfig(
|
return PipelineConfig(
|
||||||
output_dir=Path("output"),
|
output_dir=Path(".cross-eval/output"),
|
||||||
max_iterations=3,
|
max_iterations=3,
|
||||||
language="ko",
|
language="ko",
|
||||||
|
execution=ExecutionConfig(),
|
||||||
inputs={},
|
inputs={},
|
||||||
agents=agents,
|
agents=agents,
|
||||||
coders=coders,
|
coders=coders,
|
||||||
@@ -363,6 +432,7 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
|
|||||||
system_prompt=agent_data.get("system_prompt"),
|
system_prompt=agent_data.get("system_prompt"),
|
||||||
reasoning_effort=agent_data.get("reasoning_effort"),
|
reasoning_effort=agent_data.get("reasoning_effort"),
|
||||||
stdin_mode=agent_data.get("stdin_mode", False),
|
stdin_mode=agent_data.get("stdin_mode", False),
|
||||||
|
agentic=agent_data.get("agentic", False),
|
||||||
)
|
)
|
||||||
|
|
||||||
# --- roles: explicit or inferred ---
|
# --- roles: explicit or inferred ---
|
||||||
@@ -402,6 +472,17 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
|
|||||||
p = config_dir / p
|
p = config_dir / p
|
||||||
inputs[key] = p
|
inputs[key] = p
|
||||||
|
|
||||||
|
execution_raw = raw.get("execution", {}) or {}
|
||||||
|
execution = ExecutionConfig(
|
||||||
|
mode=execution_raw.get("mode", "agent-decides"),
|
||||||
|
command_policy=execution_raw.get("command_policy", "broad"),
|
||||||
|
inherit_env=bool(execution_raw.get("inherit_env", True)),
|
||||||
|
auto_env_files=list(execution_raw.get("auto_env_files", [".env", ".env.local"])),
|
||||||
|
env_files=list(execution_raw.get("env_files", [])),
|
||||||
|
expose_env_names=bool(execution_raw.get("expose_env_names", True)),
|
||||||
|
auto_context_targets=list(execution_raw.get("auto_context_targets", [])),
|
||||||
|
)
|
||||||
|
|
||||||
# --- pipeline (preset or custom) ---
|
# --- pipeline (preset or custom) ---
|
||||||
steps, phases = _resolve_pipeline(pipeline_raw, coders, reviewers, seniors)
|
steps, phases = _resolve_pipeline(pipeline_raw, coders, reviewers, seniors)
|
||||||
|
|
||||||
@@ -410,12 +491,13 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
|
|||||||
if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"):
|
if isinstance(pipeline_raw, str) and pipeline_raw.startswith("preset:"):
|
||||||
preset_name = pipeline_raw.split(":", 1)[1]
|
preset_name = pipeline_raw.split(":", 1)[1]
|
||||||
|
|
||||||
return PipelineConfig(
|
config = PipelineConfig(
|
||||||
output_dir=Path(raw.get("output_dir", "output")),
|
output_dir=Path(raw.get("output_dir", ".cross-eval/output")),
|
||||||
max_iterations=int(raw.get("max_iterations", 3)),
|
max_iterations=int(raw.get("max_iterations", 3)),
|
||||||
min_iterations=int(raw.get("min_iterations", 1)),
|
min_iterations=int(raw.get("min_iterations", 1)),
|
||||||
verbose=bool(raw.get("verbose", False)),
|
verbose=bool(raw.get("verbose", False)),
|
||||||
language=raw.get("language", "en"),
|
language=raw.get("language", "en"),
|
||||||
|
execution=execution,
|
||||||
inputs=inputs,
|
inputs=inputs,
|
||||||
agents=agents,
|
agents=agents,
|
||||||
coders=coders,
|
coders=coders,
|
||||||
@@ -427,6 +509,9 @@ def _parse_raw(raw: dict[str, Any], config_path: Path) -> PipelineConfig:
|
|||||||
_config_path=config_path,
|
_config_path=config_path,
|
||||||
_config_mtime=config_path.stat().st_mtime,
|
_config_mtime=config_path.stat().st_mtime,
|
||||||
)
|
)
|
||||||
|
sync_phased_iterations(config)
|
||||||
|
ensure_fix_preset_agentic(config)
|
||||||
|
return config
|
||||||
|
|
||||||
|
|
||||||
def try_reload_config(config: PipelineConfig) -> PipelineConfig:
|
def try_reload_config(config: PipelineConfig) -> PipelineConfig:
|
||||||
@@ -465,7 +550,7 @@ def _resolve_pipeline(
|
|||||||
"""Resolve pipeline from preset string or explicit step list.
|
"""Resolve pipeline from preset string or explicit step list.
|
||||||
|
|
||||||
Returns (steps, phases) tuple. Only one will be non-empty.
|
Returns (steps, phases) tuple. Only one will be non-empty.
|
||||||
- Simple/cross-review/review-only → steps populated, phases empty.
|
- Simple/cross-review/plan-review/review-only → steps populated, phases empty.
|
||||||
- Phased presets (review-fix) → steps empty, phases populated.
|
- Phased presets (review-fix) → steps empty, phases populated.
|
||||||
"""
|
"""
|
||||||
# Preset: "preset:simple" or "preset:review-fix"
|
# Preset: "preset:simple" or "preset:review-fix"
|
||||||
@@ -485,11 +570,15 @@ def _resolve_pipeline(
|
|||||||
if isinstance(pipeline_raw, list):
|
if isinstance(pipeline_raw, list):
|
||||||
steps = []
|
steps = []
|
||||||
for step_data in pipeline_raw:
|
for step_data in pipeline_raw:
|
||||||
|
raw_role = step_data.get("role", "coding")
|
||||||
|
normalized_role = normalize_step_role(raw_role)
|
||||||
steps.append(StepConfig(
|
steps.append(StepConfig(
|
||||||
name=step_data["name"],
|
name=step_data["name"],
|
||||||
agent=step_data["agent"],
|
agent=step_data["agent"],
|
||||||
role=step_data.get("role", "generate"),
|
role=normalized_role,
|
||||||
prompt_template=step_data.get("prompt_template", f"default:{step_data.get('role', 'generate')}"),
|
prompt_template=normalize_prompt_template(
|
||||||
|
step_data.get("prompt_template", f"default:{normalized_role}")
|
||||||
|
),
|
||||||
output_key=step_data["output_key"],
|
output_key=step_data["output_key"],
|
||||||
verdict=step_data.get("verdict", False),
|
verdict=step_data.get("verdict", False),
|
||||||
verdict_pattern=step_data.get("verdict_pattern", r"VERDICT:\s*PASS"),
|
verdict_pattern=step_data.get("verdict_pattern", r"VERDICT:\s*PASS"),
|
||||||
@@ -524,10 +613,6 @@ def validate_config(config: PipelineConfig) -> list[str]:
|
|||||||
errors,
|
errors,
|
||||||
scope=f"Phase '{phase.name}'",
|
scope=f"Phase '{phase.name}'",
|
||||||
)
|
)
|
||||||
if not any(s.verdict for s in phase.steps):
|
|
||||||
errors.append(
|
|
||||||
f"Phase '{phase.name}' must have at least one step with verdict: true"
|
|
||||||
)
|
|
||||||
# Validate verdict patterns
|
# Validate verdict patterns
|
||||||
for step in phase.steps:
|
for step in phase.steps:
|
||||||
if step.verdict:
|
if step.verdict:
|
||||||
@@ -576,6 +661,16 @@ def validate_config(config: PipelineConfig) -> list[str]:
|
|||||||
if config.language not in ("en", "ko"):
|
if config.language not in ("en", "ko"):
|
||||||
errors.append(f"Unsupported language '{config.language}'. Use 'en' or 'ko'.")
|
errors.append(f"Unsupported language '{config.language}'. Use 'en' or 'ko'.")
|
||||||
|
|
||||||
|
if config.execution.mode not in {"agent-decides"}:
|
||||||
|
errors.append(
|
||||||
|
f"Unsupported execution.mode '{config.execution.mode}'. Use 'agent-decides'."
|
||||||
|
)
|
||||||
|
if config.execution.command_policy not in {"broad", "restricted"}:
|
||||||
|
errors.append(
|
||||||
|
"Unsupported execution.command_policy "
|
||||||
|
f"'{config.execution.command_policy}'. Use 'broad' or 'restricted'."
|
||||||
|
)
|
||||||
|
|
||||||
return errors
|
return errors
|
||||||
|
|
||||||
|
|
||||||
@@ -599,6 +694,37 @@ def _validate_unique_step_fields(
|
|||||||
seen_output_keys.add(step.output_key)
|
seen_output_keys.add(step.output_key)
|
||||||
|
|
||||||
|
|
||||||
|
def _make_agentic(agent: AgentConfig) -> None:
|
||||||
|
"""Convert an agent to agentic mode in-place (remove -p, set agentic=True)."""
|
||||||
|
agent.agentic = True
|
||||||
|
agent.args = [a for a in agent.args if a != "-p"]
|
||||||
|
|
||||||
|
|
||||||
|
def sync_phased_iterations(
|
||||||
|
config: PipelineConfig,
|
||||||
|
max_iter: int | None = None,
|
||||||
|
) -> None:
|
||||||
|
"""Apply effective max iterations to converging phases while preserving setup phases."""
|
||||||
|
if not config.phases:
|
||||||
|
return
|
||||||
|
|
||||||
|
effective_max_iter = config.max_iterations if max_iter is None else max_iter
|
||||||
|
for phase in config.phases:
|
||||||
|
if any(step.verdict for step in phase.steps):
|
||||||
|
phase.max_iterations = effective_max_iter
|
||||||
|
|
||||||
|
|
||||||
|
def ensure_fix_preset_agentic(config: PipelineConfig) -> None:
|
||||||
|
"""Fix-style presets should modify code, so coders run agentically by default."""
|
||||||
|
if config.preset_name not in FIX_STYLE_PRESETS:
|
||||||
|
return
|
||||||
|
|
||||||
|
for coder_name in config.coders:
|
||||||
|
agent = config.agents.get(coder_name)
|
||||||
|
if agent is not None and not agent.agentic:
|
||||||
|
_make_agentic(agent)
|
||||||
|
|
||||||
|
|
||||||
def apply_input_overrides(
|
def apply_input_overrides(
|
||||||
config: PipelineConfig, overrides: dict[str, str]
|
config: PipelineConfig, overrides: dict[str, str]
|
||||||
) -> None:
|
) -> None:
|
||||||
|
|||||||
282
cross_eval/demo.py
Normal file
282
cross_eval/demo.py
Normal file
@@ -0,0 +1,282 @@
|
|||||||
|
"""Built-in demo for cross-eval — lets new users see the full lifecycle."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from cross_eval.models import PipelineConfig, PipelineResult
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Built-in demo plan & checklist
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
DEMO_PLAN = """\
|
||||||
|
# Demo: Fibonacci Function
|
||||||
|
|
||||||
|
## Objective
|
||||||
|
Implement a `fibonacci(n)` function in Python.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
1. `fibonacci(0)` returns `0`, `fibonacci(1)` returns `1`.
|
||||||
|
2. For `n >= 2`, return the sum of the two preceding values.
|
||||||
|
3. Raise `ValueError` for negative `n`.
|
||||||
|
4. Use an iterative approach (not recursive).
|
||||||
|
|
||||||
|
## Constraints
|
||||||
|
- Single file: `fib.py`
|
||||||
|
- No external dependencies.
|
||||||
|
"""
|
||||||
|
|
||||||
|
DEMO_CHECKLIST = """\
|
||||||
|
# Demo Checklist
|
||||||
|
- [ ] fibonacci(0) → 0
|
||||||
|
- [ ] fibonacci(1) → 1
|
||||||
|
- [ ] fibonacci(10) → 55
|
||||||
|
- [ ] fibonacci(-1) raises ValueError
|
||||||
|
- [ ] Iterative implementation (no recursion)
|
||||||
|
- [ ] No unnecessary abstractions
|
||||||
|
"""
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Mock outputs (realistic-looking)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
_MOCK_CODING_V1 = """\
|
||||||
|
I'll implement the fibonacci function in `fib.py`.
|
||||||
|
|
||||||
|
```python
|
||||||
|
# fib.py
|
||||||
|
|
||||||
|
def fibonacci(n: int) -> int:
|
||||||
|
\"\"\"Return the nth Fibonacci number using iteration.\"\"\"
|
||||||
|
if n < 0:
|
||||||
|
return -1 # invalid input
|
||||||
|
if n <= 1:
|
||||||
|
return n
|
||||||
|
a, b = 0, 1
|
||||||
|
for _ in range(2, n + 1):
|
||||||
|
a, b = b, a + b
|
||||||
|
return b
|
||||||
|
```
|
||||||
|
|
||||||
|
Created `fib.py` with the iterative fibonacci function.
|
||||||
|
"""
|
||||||
|
|
||||||
|
_MOCK_REVIEW_V1 = """\
|
||||||
|
### Previous Feedback Assessment
|
||||||
|
N/A — first iteration.
|
||||||
|
|
||||||
|
### Issues Found
|
||||||
|
- ISS-001 [Major][Omission] Negative input should raise `ValueError`, \
|
||||||
|
but implementation returns `-1` instead. (Requirement 3: "Raise ValueError for negative n")
|
||||||
|
|
||||||
|
### Out of Scope Issues
|
||||||
|
None
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
- Critical: 0, Major: 1, Minor: 0
|
||||||
|
- Over-engineering count: 0
|
||||||
|
- Omission count: 1
|
||||||
|
- CONFIRMED: 0, DISMISSED: 0
|
||||||
|
- Overall quality: Good structure, one requirement gap.
|
||||||
|
|
||||||
|
### Verdict
|
||||||
|
VERDICT: FAIL
|
||||||
|
"""
|
||||||
|
|
||||||
|
_MOCK_CODING_V2 = """\
|
||||||
|
Fixing the negative input handling per review feedback (ISS-001).
|
||||||
|
|
||||||
|
```python
|
||||||
|
# fib.py
|
||||||
|
|
||||||
|
def fibonacci(n: int) -> int:
|
||||||
|
\"\"\"Return the nth Fibonacci number using iteration.\"\"\"
|
||||||
|
if n < 0:
|
||||||
|
raise ValueError(f"n must be non-negative, got {n}")
|
||||||
|
if n <= 1:
|
||||||
|
return n
|
||||||
|
a, b = 0, 1
|
||||||
|
for _ in range(2, n + 1):
|
||||||
|
a, b = b, a + b
|
||||||
|
return b
|
||||||
|
```
|
||||||
|
|
||||||
|
Updated `fib.py`: negative input now raises `ValueError`.
|
||||||
|
"""
|
||||||
|
|
||||||
|
_MOCK_REVIEW_V2 = """\
|
||||||
|
### Previous Feedback Assessment
|
||||||
|
- DISMISSED (false positive): None
|
||||||
|
- CONFIRMED: None — ISS-001 has been fixed.
|
||||||
|
|
||||||
|
### Issues Found
|
||||||
|
None — all checklist items satisfied.
|
||||||
|
|
||||||
|
### Out of Scope Issues
|
||||||
|
None
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
- Critical: 0, Major: 0, Minor: 0
|
||||||
|
- Over-engineering count: 0
|
||||||
|
- Omission count: 0
|
||||||
|
- CONFIRMED: 0, DISMISSED: 0
|
||||||
|
- Overall quality: All requirements met, clean implementation.
|
||||||
|
|
||||||
|
### Verdict
|
||||||
|
VERDICT: PASS
|
||||||
|
"""
|
||||||
|
|
||||||
|
_MOCK_STEPS = [
|
||||||
|
# (iteration, step_name, agent, duration, output_chars, verdict, output)
|
||||||
|
(1, "coding", "claude-coder", 2.1, 347, None, _MOCK_CODING_V1),
|
||||||
|
(1, "review", "claude-reviewer", 1.8, 423, "FAIL", _MOCK_REVIEW_V1),
|
||||||
|
(2, "coding", "claude-coder", 2.3, 382, None, _MOCK_CODING_V2),
|
||||||
|
(2, "review", "claude-reviewer", 1.5, 312, "PASS", _MOCK_REVIEW_V2),
|
||||||
|
]
|
||||||
|
|
||||||
|
_MOCK_ESCALATE_REVIEW = """\
|
||||||
|
### Issues Found
|
||||||
|
- ISS-001 [Critical][Omission] Requirements are ambiguous: "iterative approach" is unclear — \
|
||||||
|
does this exclude memoization? The plan needs clarification from stakeholders.
|
||||||
|
|
||||||
|
### Verdict
|
||||||
|
VERDICT: ESCALATE
|
||||||
|
"""
|
||||||
|
|
||||||
|
_MOCK_ESCALATE_STEPS = [
|
||||||
|
(1, "coding", "claude-coder", 2.1, 347, None, _MOCK_CODING_V1),
|
||||||
|
(1, "review", "claude-reviewer", 1.8, 520, "ESCALATE", _MOCK_ESCALATE_REVIEW),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Mock demo runner
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
DIM = "\033[2m"
|
||||||
|
BOLD = "\033[1m"
|
||||||
|
GREEN = "\033[32m"
|
||||||
|
RED = "\033[31m"
|
||||||
|
YELLOW = "\033[33m"
|
||||||
|
CYAN = "\033[36m"
|
||||||
|
RESET = "\033[0m"
|
||||||
|
|
||||||
|
|
||||||
|
def run_mock_demo(preset: str = "simple", show_escalate: bool = False) -> None:
|
||||||
|
"""Run a simulated demo showing the full pipeline lifecycle."""
|
||||||
|
steps = _MOCK_ESCALATE_STEPS if show_escalate else _MOCK_STEPS
|
||||||
|
|
||||||
|
print(f"\n{BOLD}=== cross-eval demo (mock) ==={RESET}")
|
||||||
|
print(f"{DIM}Preset: {preset} | Coder: claude-coder | Reviewer: claude-reviewer{RESET}")
|
||||||
|
print(f"{DIM}Plan: fibonacci function | Max iterations: 3{RESET}\n")
|
||||||
|
|
||||||
|
current_iter = 0
|
||||||
|
for iteration, step_name, agent, duration, chars, verdict, output in steps:
|
||||||
|
if iteration != current_iter:
|
||||||
|
current_iter = iteration
|
||||||
|
print(f"{BOLD}{'━' * 50}")
|
||||||
|
print(f" Iteration {iteration}/3")
|
||||||
|
print(f"{'━' * 50}{RESET}")
|
||||||
|
|
||||||
|
# Simulate running
|
||||||
|
sys.stdout.write(f" ⠋ [{step_name}] {agent} running...")
|
||||||
|
sys.stdout.flush()
|
||||||
|
time.sleep(0.5)
|
||||||
|
sys.stdout.write(f"\r {GREEN}✓{RESET} [{step_name}] {agent} — {chars} chars ({duration}s)\n")
|
||||||
|
|
||||||
|
if verdict:
|
||||||
|
if verdict == "PASS":
|
||||||
|
color = GREEN
|
||||||
|
elif verdict == "ESCALATE":
|
||||||
|
color = YELLOW
|
||||||
|
else:
|
||||||
|
color = RED
|
||||||
|
print(f" {color}{BOLD}Verdict: {verdict}{RESET}")
|
||||||
|
|
||||||
|
if verdict == "FAIL":
|
||||||
|
# Show key feedback
|
||||||
|
print(f" {DIM}Feedback: ISS-001 [Major] Negative input returns -1 instead of ValueError{RESET}")
|
||||||
|
elif verdict == "ESCALATE":
|
||||||
|
print(f" {YELLOW}Reason: Requirements need clarification from stakeholders{RESET}")
|
||||||
|
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Final result
|
||||||
|
if show_escalate:
|
||||||
|
final = "ESCALATE"
|
||||||
|
color = YELLOW
|
||||||
|
else:
|
||||||
|
final = "PASS"
|
||||||
|
color = GREEN
|
||||||
|
|
||||||
|
print(f"{BOLD}Result: {color}{final}{RESET}")
|
||||||
|
print(f"Iterations: {current_iter}")
|
||||||
|
|
||||||
|
if show_escalate:
|
||||||
|
print(f"\n{RED}{BOLD}{'=' * 50}")
|
||||||
|
print(f" Escalation Report")
|
||||||
|
print(f"{'=' * 50}{RESET}")
|
||||||
|
print(f"{YELLOW}Human review required.{RESET}")
|
||||||
|
print(f" {RED}•{RESET} Requirements are ambiguous — needs stakeholder clarification")
|
||||||
|
print(f"{RED}{BOLD}{'=' * 50}{RESET}")
|
||||||
|
|
||||||
|
print(f"\n{DIM}This was a mock demo. To run with real agents:{RESET}")
|
||||||
|
print(f"{DIM} cross-eval demo --live{RESET}")
|
||||||
|
print(f"{DIM} cross-eval run --plan plan.md{RESET}\n")
|
||||||
|
|
||||||
|
|
||||||
|
def run_live_demo(
|
||||||
|
preset: str = "simple",
|
||||||
|
timeout: int | None = None,
|
||||||
|
) -> PipelineResult:
|
||||||
|
"""Run a live demo with real agents using the built-in plan."""
|
||||||
|
import tempfile
|
||||||
|
|
||||||
|
from cross_eval.config import (
|
||||||
|
BUILTIN_AGENTS,
|
||||||
|
_resolve_agents,
|
||||||
|
apply_reasoning_effort_settings,
|
||||||
|
)
|
||||||
|
from cross_eval.pipeline import run_pipeline
|
||||||
|
from cross_eval.prompts import PHASED_PRESETS, PIPELINE_PRESETS
|
||||||
|
|
||||||
|
coders = ["claude-coder"]
|
||||||
|
reviewers = ["claude-reviewer"]
|
||||||
|
seniors: list[str] = []
|
||||||
|
agents = _resolve_agents(dict(BUILTIN_AGENTS), coders, reviewers, seniors)
|
||||||
|
|
||||||
|
if preset in PIPELINE_PRESETS:
|
||||||
|
pipeline = PIPELINE_PRESETS[preset](coders, reviewers, seniors)
|
||||||
|
phases = []
|
||||||
|
elif preset in PHASED_PRESETS:
|
||||||
|
pipeline = []
|
||||||
|
phases = PHASED_PRESETS[preset](coders, reviewers, seniors)
|
||||||
|
else:
|
||||||
|
pipeline = PIPELINE_PRESETS["simple"](coders, reviewers, seniors)
|
||||||
|
phases = []
|
||||||
|
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
plan_path = Path(tmpdir) / "plan.md"
|
||||||
|
checklist_path = Path(tmpdir) / "checklist.md"
|
||||||
|
plan_path.write_text(DEMO_PLAN, encoding="utf-8")
|
||||||
|
checklist_path.write_text(DEMO_CHECKLIST, encoding="utf-8")
|
||||||
|
|
||||||
|
config = PipelineConfig(
|
||||||
|
output_dir=Path(".cross-eval/output"),
|
||||||
|
max_iterations=3,
|
||||||
|
language="en",
|
||||||
|
inputs={"plan": plan_path, "checklist": checklist_path},
|
||||||
|
agents=agents,
|
||||||
|
coders=coders,
|
||||||
|
reviewers=reviewers,
|
||||||
|
seniors=seniors,
|
||||||
|
pipeline=pipeline,
|
||||||
|
phases=phases,
|
||||||
|
preset_name=f"demo-{preset}",
|
||||||
|
)
|
||||||
|
apply_reasoning_effort_settings(config)
|
||||||
|
|
||||||
|
return run_pipeline(config, timeout=timeout)
|
||||||
200
cross_eval/doctor.py
Normal file
200
cross_eval/doctor.py
Normal file
@@ -0,0 +1,200 @@
|
|||||||
|
"""Environment health checks for cross-eval."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import shutil
|
||||||
|
import subprocess
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class DoctorCheck:
|
||||||
|
"""Result of a single health check."""
|
||||||
|
|
||||||
|
name: str
|
||||||
|
passed: bool
|
||||||
|
critical: bool
|
||||||
|
message: str
|
||||||
|
detail: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
def check_cli_installed(command: str) -> tuple[bool, str]:
|
||||||
|
"""Check if a CLI tool is on PATH and get its version."""
|
||||||
|
path = shutil.which(command)
|
||||||
|
if not path:
|
||||||
|
return False, f"'{command}' not found on PATH"
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
[command, "--version"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=10,
|
||||||
|
)
|
||||||
|
version = (result.stdout.strip() or result.stderr.strip()).split("\n")[0]
|
||||||
|
return True, version or "(version unknown)"
|
||||||
|
except (subprocess.TimeoutExpired, OSError):
|
||||||
|
return True, "(installed but version check failed)"
|
||||||
|
|
||||||
|
|
||||||
|
def check_cli_authenticated(command: str) -> tuple[bool, str]:
|
||||||
|
"""Check if a CLI tool is authenticated by running a minimal probe."""
|
||||||
|
path = shutil.which(command)
|
||||||
|
if not path:
|
||||||
|
return False, "not installed"
|
||||||
|
|
||||||
|
if command == "claude":
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
[command, "-p", "--model", "haiku", "--max-turns", "1"],
|
||||||
|
input="respond with just 'ok'",
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=30,
|
||||||
|
)
|
||||||
|
combined = result.stdout + result.stderr
|
||||||
|
if any(kw in combined.lower() for kw in (
|
||||||
|
"not logged in", "login", "unauthorized", "unauthenticated",
|
||||||
|
"api key", "invalid key",
|
||||||
|
)):
|
||||||
|
return False, "not authenticated — run: claude login"
|
||||||
|
if result.returncode == 0:
|
||||||
|
return True, "authenticated"
|
||||||
|
return False, f"exit code {result.returncode}: {combined[:100]}"
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
return False, "timed out (30s) — possible network issue"
|
||||||
|
except OSError as e:
|
||||||
|
return False, str(e)
|
||||||
|
|
||||||
|
elif command == "codex":
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
[command, "--version"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=10,
|
||||||
|
)
|
||||||
|
combined = result.stdout + result.stderr
|
||||||
|
if any(kw in combined.lower() for kw in (
|
||||||
|
"not logged in", "login", "unauthorized", "api key",
|
||||||
|
)):
|
||||||
|
return False, "not authenticated — run: codex login"
|
||||||
|
return True, "installed (auth check: codex login if needed)"
|
||||||
|
except (subprocess.TimeoutExpired, OSError) as e:
|
||||||
|
return False, str(e)
|
||||||
|
|
||||||
|
return False, f"unknown command: {command}"
|
||||||
|
|
||||||
|
|
||||||
|
def check_config(directory: Path) -> tuple[bool, Optional[Path], list[str]]:
|
||||||
|
"""Check if config.yaml exists and is valid."""
|
||||||
|
config_path = directory / ".cross-eval" / "config.yaml"
|
||||||
|
if not config_path.exists():
|
||||||
|
return False, None, []
|
||||||
|
|
||||||
|
try:
|
||||||
|
from cross_eval.config import load_config
|
||||||
|
load_config(config_path)
|
||||||
|
return True, config_path, []
|
||||||
|
except (ValueError, FileNotFoundError) as e:
|
||||||
|
return False, config_path, [str(e)]
|
||||||
|
|
||||||
|
|
||||||
|
def run_doctor(directory: Path) -> list[DoctorCheck]:
|
||||||
|
"""Run all health checks and return results."""
|
||||||
|
checks: list[DoctorCheck] = []
|
||||||
|
|
||||||
|
# 1. claude CLI
|
||||||
|
installed, version = check_cli_installed("claude")
|
||||||
|
checks.append(DoctorCheck(
|
||||||
|
name="claude CLI",
|
||||||
|
passed=installed,
|
||||||
|
critical=True,
|
||||||
|
message=version if installed else "not found",
|
||||||
|
detail="Install: https://docs.anthropic.com/en/docs/claude-code" if not installed else None,
|
||||||
|
))
|
||||||
|
|
||||||
|
if installed:
|
||||||
|
auth_ok, auth_msg = check_cli_authenticated("claude")
|
||||||
|
checks.append(DoctorCheck(
|
||||||
|
name="claude auth",
|
||||||
|
passed=auth_ok,
|
||||||
|
critical=True,
|
||||||
|
message=auth_msg,
|
||||||
|
))
|
||||||
|
|
||||||
|
# 2. codex CLI
|
||||||
|
installed, version = check_cli_installed("codex")
|
||||||
|
checks.append(DoctorCheck(
|
||||||
|
name="codex CLI",
|
||||||
|
passed=installed,
|
||||||
|
critical=False,
|
||||||
|
message=version if installed else "not found (optional)",
|
||||||
|
detail="Install: https://github.com/openai/codex" if not installed else None,
|
||||||
|
))
|
||||||
|
|
||||||
|
if installed:
|
||||||
|
auth_ok, auth_msg = check_cli_authenticated("codex")
|
||||||
|
checks.append(DoctorCheck(
|
||||||
|
name="codex auth",
|
||||||
|
passed=auth_ok,
|
||||||
|
critical=False,
|
||||||
|
message=auth_msg,
|
||||||
|
))
|
||||||
|
|
||||||
|
# 3. Config
|
||||||
|
config_ok, config_path, config_errors = check_config(directory)
|
||||||
|
if config_path is None:
|
||||||
|
checks.append(DoctorCheck(
|
||||||
|
name="config",
|
||||||
|
passed=True, # not having config is fine
|
||||||
|
critical=False,
|
||||||
|
message="no .cross-eval/config.yaml (will use defaults)",
|
||||||
|
detail="Run: cross-eval init",
|
||||||
|
))
|
||||||
|
elif config_ok:
|
||||||
|
checks.append(DoctorCheck(
|
||||||
|
name="config",
|
||||||
|
passed=True,
|
||||||
|
critical=False,
|
||||||
|
message=f"valid ({config_path.name})",
|
||||||
|
))
|
||||||
|
else:
|
||||||
|
checks.append(DoctorCheck(
|
||||||
|
name="config",
|
||||||
|
passed=False,
|
||||||
|
critical=True,
|
||||||
|
message="invalid config",
|
||||||
|
detail="\n".join(config_errors),
|
||||||
|
))
|
||||||
|
|
||||||
|
return checks
|
||||||
|
|
||||||
|
|
||||||
|
def format_doctor_results(checks: list[DoctorCheck]) -> str:
|
||||||
|
"""Format doctor check results for terminal output."""
|
||||||
|
lines: list[str] = []
|
||||||
|
lines.append("\n cross-eval doctor\n")
|
||||||
|
|
||||||
|
for check in checks:
|
||||||
|
icon = " ✓" if check.passed else " ✗"
|
||||||
|
lines.append(f"{icon} {check.name}: {check.message}")
|
||||||
|
if check.detail and not check.passed:
|
||||||
|
for detail_line in check.detail.split("\n"):
|
||||||
|
lines.append(f" {detail_line}")
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
failed_critical = [c for c in checks if not c.passed and c.critical]
|
||||||
|
failed_warn = [c for c in checks if not c.passed and not c.critical]
|
||||||
|
|
||||||
|
lines.append("")
|
||||||
|
if not failed_critical and not failed_warn:
|
||||||
|
lines.append(" All checks passed!")
|
||||||
|
elif failed_critical:
|
||||||
|
lines.append(f" {len(failed_critical)} critical issue(s) found.")
|
||||||
|
else:
|
||||||
|
lines.append(f" {len(failed_warn)} warning(s), no critical issues.")
|
||||||
|
|
||||||
|
lines.append("")
|
||||||
|
return "\n".join(lines)
|
||||||
@@ -16,6 +16,7 @@ class AgentConfig:
|
|||||||
system_prompt: Optional[str] = None
|
system_prompt: Optional[str] = None
|
||||||
reasoning_effort: Optional[str] = None
|
reasoning_effort: Optional[str] = None
|
||||||
stdin_mode: bool = False
|
stdin_mode: bool = False
|
||||||
|
agentic: bool = False # run in worktree, capture git diff instead of stdout
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
@@ -24,7 +25,7 @@ class StepConfig:
|
|||||||
|
|
||||||
name: str
|
name: str
|
||||||
agent: str # reference to agents key
|
agent: str # reference to agents key
|
||||||
role: str # "generate" or "review"
|
role: str # "coding" or "review"
|
||||||
prompt_template: str # "default:<role>" or file path
|
prompt_template: str # "default:<role>" or file path
|
||||||
output_key: str
|
output_key: str
|
||||||
verdict: bool = False
|
verdict: bool = False
|
||||||
@@ -43,15 +44,29 @@ class PhaseConfig:
|
|||||||
consecutive_pass: int = 1 # stop after N consecutive PASSes
|
consecutive_pass: int = 1 # stop after N consecutive PASSes
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ExecutionConfig:
|
||||||
|
"""Runtime execution policy for agent subprocesses."""
|
||||||
|
|
||||||
|
mode: str = "agent-decides"
|
||||||
|
command_policy: str = "broad"
|
||||||
|
inherit_env: bool = True
|
||||||
|
auto_env_files: list[str] = field(default_factory=lambda: [".env", ".env.local"])
|
||||||
|
env_files: list[str] = field(default_factory=list)
|
||||||
|
expose_env_names: bool = True
|
||||||
|
auto_context_targets: list[str] = field(default_factory=list)
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
class PipelineConfig:
|
class PipelineConfig:
|
||||||
"""Full cross-eval configuration."""
|
"""Full cross-eval configuration."""
|
||||||
|
|
||||||
output_dir: Path = field(default_factory=lambda: Path("output"))
|
output_dir: Path = field(default_factory=lambda: Path(".cross-eval/output"))
|
||||||
max_iterations: int = 3
|
max_iterations: int = 3
|
||||||
min_iterations: int = 1
|
min_iterations: int = 1
|
||||||
verbose: bool = False
|
verbose: bool = False
|
||||||
language: str = "en" # "en" or "ko"
|
language: str = "en" # "en" or "ko"
|
||||||
|
execution: ExecutionConfig = field(default_factory=ExecutionConfig)
|
||||||
inputs: dict[str, Path | str] = field(default_factory=dict)
|
inputs: dict[str, Path | str] = field(default_factory=dict)
|
||||||
agents: dict[str, AgentConfig] = field(default_factory=dict)
|
agents: dict[str, AgentConfig] = field(default_factory=dict)
|
||||||
coders: list[str] = field(default_factory=list)
|
coders: list[str] = field(default_factory=list)
|
||||||
@@ -105,6 +120,7 @@ class IterationResult:
|
|||||||
phase_name: Optional[str] = None
|
phase_name: Optional[str] = None
|
||||||
repeated_aggregate_warning: Optional[str] = None
|
repeated_aggregate_warning: Optional[str] = None
|
||||||
review_metrics: Optional[ReviewMetrics] = None
|
review_metrics: Optional[ReviewMetrics] = None
|
||||||
|
escalated_issues: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
@@ -116,3 +132,5 @@ class PipelineResult:
|
|||||||
total_duration: float = 0.0
|
total_duration: float = 0.0
|
||||||
run_dir: Optional[Path] = None
|
run_dir: Optional[Path] = None
|
||||||
repeated_aggregate_warnings: list[str] = field(default_factory=list)
|
repeated_aggregate_warnings: list[str] = field(default_factory=list)
|
||||||
|
escalated_issues: list[str] = field(default_factory=list)
|
||||||
|
agentic_branch: Optional[str] = None
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@@ -12,7 +12,7 @@ from cross_eval.models import PhaseConfig, StepConfig
|
|||||||
# Default prompt templates
|
# Default prompt templates
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
GENERATE_TEMPLATE = """\
|
CODING_TEMPLATE = """\
|
||||||
You are tasked with implementing code based on a plan and checklist.
|
You are tasked with implementing code based on a plan and checklist.
|
||||||
|
|
||||||
## Plan
|
## Plan
|
||||||
@@ -53,8 +53,8 @@ You are tasked with reviewing code against a plan and checklist.
|
|||||||
## Reference Documents
|
## Reference Documents
|
||||||
{docs}
|
{docs}
|
||||||
|
|
||||||
## Generated Code / Previous Step Output
|
## Coding Output / Previous Step Output
|
||||||
{generated_code}
|
{coding_output}
|
||||||
|
|
||||||
## Previous Review Feedback
|
## Previous Review Feedback
|
||||||
{feedback}
|
{feedback}
|
||||||
@@ -94,10 +94,10 @@ security concerns, performance problems), report them separately under \
|
|||||||
(Write "N/A" if no previous feedback was provided.)
|
(Write "N/A" if no previous feedback was provided.)
|
||||||
|
|
||||||
### Issues Found
|
### Issues Found
|
||||||
List issues ordered by severity (Critical first):
|
List issues ordered by severity (Critical first). Assign each issue a unique ID (ISS-NNN):
|
||||||
- [Critical][Over-engineering] Description (reference specific plan/checklist item)
|
- ISS-001 [Critical][Over-engineering] Description (reference specific plan/checklist item)
|
||||||
- [Major][Omission] Description (reference specific plan/checklist item)
|
- ISS-002 [Major][Omission] Description (reference specific plan/checklist item)
|
||||||
- [Minor][Omission] Description (reference specific plan/checklist item)
|
- ISS-003 [Minor][Omission] Description (reference specific plan/checklist item)
|
||||||
|
|
||||||
### Out of Scope Issues
|
### Out of Scope Issues
|
||||||
Issues found outside plan/checklist scope but worth noting:
|
Issues found outside plan/checklist scope but worth noting:
|
||||||
@@ -119,7 +119,7 @@ Otherwise output: VERDICT: FAIL
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
GENERATE_TEMPLATE_KO = """\
|
CODING_TEMPLATE_KO = """\
|
||||||
당신은 기획서와 체크리스트를 기반으로 코드를 구현하는 개발자입니다.
|
당신은 기획서와 체크리스트를 기반으로 코드를 구현하는 개발자입니다.
|
||||||
|
|
||||||
## 기획서
|
## 기획서
|
||||||
@@ -159,7 +159,7 @@ REVIEW_TEMPLATE_KO = """\
|
|||||||
{docs}
|
{docs}
|
||||||
|
|
||||||
## 검토 대상 코드
|
## 검토 대상 코드
|
||||||
{generated_code}
|
{coding_output}
|
||||||
|
|
||||||
## 이전 리뷰 피드백
|
## 이전 리뷰 피드백
|
||||||
{feedback}
|
{feedback}
|
||||||
@@ -195,10 +195,10 @@ REVIEW_TEMPLATE_KO = """\
|
|||||||
(이전 피드백이 없으면 "해당 없음"이라고 작성하세요.)
|
(이전 피드백이 없으면 "해당 없음"이라고 작성하세요.)
|
||||||
|
|
||||||
### 발견된 이슈
|
### 발견된 이슈
|
||||||
심각도 순서(Critical 먼저)로 나열:
|
심각도 순서(Critical 먼저)로 나열. 각 이슈에 고유 ID(ISS-NNN)를 부여하세요:
|
||||||
- [Critical][과최적화] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
|
- ISS-001 [Critical][과최적화] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
|
||||||
- [Major][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
|
- ISS-002 [Major][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
|
||||||
- [Minor][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
|
- ISS-003 [Minor][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
|
||||||
|
|
||||||
### 범위 밖 이슈
|
### 범위 밖 이슈
|
||||||
기획서/체크리스트 범위 밖이지만 주목할 만한 이슈:
|
기획서/체크리스트 범위 밖이지만 주목할 만한 이슈:
|
||||||
@@ -357,6 +357,150 @@ REVIEW_ONLY_TEMPLATE_KO = """\
|
|||||||
그렇지 않으면: VERDICT: FAIL
|
그렇지 않으면: VERDICT: FAIL
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
PLAN_REVIEW_TEMPLATE = """\
|
||||||
|
You are tasked with reviewing planning documents before implementation begins.
|
||||||
|
|
||||||
|
## Plan
|
||||||
|
{plan}
|
||||||
|
|
||||||
|
## Checklist
|
||||||
|
{checklist}
|
||||||
|
|
||||||
|
## Reference Documents
|
||||||
|
{docs}
|
||||||
|
|
||||||
|
## Previous Review (iteration {iteration} of {max_iterations})
|
||||||
|
{feedback}
|
||||||
|
|
||||||
|
## Review Instructions
|
||||||
|
Review the planning package itself: the plan, checklist, and reference documents.
|
||||||
|
You MAY inspect the current repository to validate feasibility, constraints, and integration assumptions.
|
||||||
|
Do NOT write or modify code. Assume implementation has NOT started yet.
|
||||||
|
|
||||||
|
Your job is to find planning issues that would likely cause bad implementation outcomes:
|
||||||
|
- Ambiguous or contradictory requirements
|
||||||
|
- Missing acceptance criteria, constraints, edge cases, or dependencies
|
||||||
|
- Scope that is broader or more complex than the stated objective
|
||||||
|
- Checklist items that do not verify the actual requirements
|
||||||
|
- Plan details that conflict with the current codebase or architecture
|
||||||
|
|
||||||
|
If previous review results are provided above, you MUST:
|
||||||
|
1. Verify each previously reported issue — is it a real issue or a false positive?
|
||||||
|
2. Look for issues the previous review MISSED.
|
||||||
|
3. Do NOT simply repeat the previous review. Provide your own independent assessment.
|
||||||
|
4. Explicitly mark items as CONFIRMED (still an issue) or DISMISSED (false positive).
|
||||||
|
|
||||||
|
For each issue found, classify it with BOTH severity AND category:
|
||||||
|
|
||||||
|
Severity levels:
|
||||||
|
- **Critical**: The plan is likely to cause fundamentally wrong implementation or unsafe behavior.
|
||||||
|
- **Major**: Important requirements, constraints, or acceptance criteria are unclear, conflicting, missing, or incompatible with the existing system.
|
||||||
|
- **Minor**: Wording, structure, or checklist quality problems that reduce implementation clarity.
|
||||||
|
|
||||||
|
Categories:
|
||||||
|
- **Over-engineering**: The plan introduces scope, abstractions, or complexity not justified by the stated objective.
|
||||||
|
- **Omission**: A necessary requirement, constraint, acceptance criterion, edge case, dependency, or compatibility consideration is missing or incomplete.
|
||||||
|
|
||||||
|
If you find issues outside the planning scope (e.g. repository health, pre-existing code problems), report them separately under "Out of Scope Issues".
|
||||||
|
|
||||||
|
## Output Format
|
||||||
|
|
||||||
|
### Issues Found
|
||||||
|
List issues ordered by severity (Critical first):
|
||||||
|
- [Critical][Over-engineering] Description (reference specific plan/checklist item)
|
||||||
|
- [Major][Omission] Description (reference specific plan/checklist item)
|
||||||
|
- [Minor][Omission] Description (reference specific plan/checklist item)
|
||||||
|
|
||||||
|
### Out of Scope Issues
|
||||||
|
Issues found outside planning scope but worth noting:
|
||||||
|
- [Critical] Description of issue
|
||||||
|
- [Minor] Description of issue
|
||||||
|
(Write "None" if no out-of-scope issues found.)
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
- Critical: N, Major: N, Minor: N
|
||||||
|
- Over-engineering count: N
|
||||||
|
- Omission count: N
|
||||||
|
- CONFIRMED: N, DISMISSED: N
|
||||||
|
- Overall quality: [BRIEF ASSESSMENT]
|
||||||
|
|
||||||
|
### Verdict
|
||||||
|
If the planning documents are clear, complete enough to implement, compatible with the current repository, and free of unjustified scope, output: VERDICT: PASS
|
||||||
|
Otherwise output: VERDICT: FAIL
|
||||||
|
"""
|
||||||
|
|
||||||
|
PLAN_REVIEW_TEMPLATE_KO = """\
|
||||||
|
당신은 구현 시작 전에 기획 문서를 검토하는 리뷰어입니다.
|
||||||
|
|
||||||
|
## 기획서
|
||||||
|
{plan}
|
||||||
|
|
||||||
|
## 체크리스트
|
||||||
|
{checklist}
|
||||||
|
|
||||||
|
## 참고 문서
|
||||||
|
{docs}
|
||||||
|
|
||||||
|
## 이전 리뷰 결과 ({max_iterations}회 중 {iteration}번째)
|
||||||
|
{feedback}
|
||||||
|
|
||||||
|
## 검토 지침
|
||||||
|
검토 대상은 코드가 아니라 기획 패키지 자체입니다: 기획서, 체크리스트, 참고 문서를 함께 검토하세요.
|
||||||
|
현재 저장소를 살펴보며 구현 가능성, 제약조건, 통합 가정이 맞는지도 확인할 수 있습니다.
|
||||||
|
코드를 생성하거나 수정하지 마세요. 아직 구현이 시작되지 않았다고 가정하세요.
|
||||||
|
|
||||||
|
목표는 구현 단계에서 문제를 일으킬 기획 결함을 찾는 것입니다:
|
||||||
|
- 요구사항이 모호하거나 서로 충돌하는 경우
|
||||||
|
- 수용 기준, 제약조건, 엣지 케이스, 의존성이 빠진 경우
|
||||||
|
- 목표 대비 범위가 지나치게 넓거나 복잡한 경우
|
||||||
|
- 체크리스트가 실제 요구사항 검증에 충분하지 않은 경우
|
||||||
|
- 기획 내용이 현재 코드베이스나 아키텍처와 충돌하는 경우
|
||||||
|
|
||||||
|
이전 리뷰 결과가 제공된 경우 반드시:
|
||||||
|
1. 이전에 보고된 각 이슈를 검증하세요 — 진짜 이슈인지 오탐인지?
|
||||||
|
2. 이전 리뷰가 놓친 새로운 이슈를 찾으세요.
|
||||||
|
3. 이전 리뷰를 그대로 반복하지 마세요. 독립적인 평가를 제공하세요.
|
||||||
|
4. 각 항목에 CONFIRMED (여전히 이슈) 또는 DISMISSED (오탐) 태그를 명시하세요.
|
||||||
|
|
||||||
|
발견된 각 이슈에 심각도와 카테고리를 모두 부여하세요:
|
||||||
|
|
||||||
|
심각도:
|
||||||
|
- **Critical**: 잘못된 구현이나 위험한 동작으로 직결될 가능성이 큰 기획 결함.
|
||||||
|
- **Major**: 중요한 요구사항, 제약조건, 수용 기준이 모호하거나 충돌하거나 누락되었거나 기존 시스템과 맞지 않는 경우.
|
||||||
|
- **Minor**: 문서 표현, 구조, 체크리스트 품질 문제로 구현 명확성이 떨어지는 경우.
|
||||||
|
|
||||||
|
카테고리:
|
||||||
|
- **과최적화**: 목표 대비 불필요한 범위, 추상화, 복잡성을 기획에 추가한 경우.
|
||||||
|
- **누락**: 필요한 요구사항, 제약조건, 수용 기준, 엣지 케이스, 의존성, 호환성 고려가 빠졌거나 불완전한 경우.
|
||||||
|
|
||||||
|
기획 범위 밖에서 발견된 문제(저장소 상태, 기존 코드 문제 등)는 "범위 밖 이슈" 섹션에 별도로 보고하세요.
|
||||||
|
|
||||||
|
## 출력 형식
|
||||||
|
|
||||||
|
### 발견된 이슈
|
||||||
|
심각도 순서(Critical 먼저)로 나열:
|
||||||
|
- [Critical][과최적화] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
|
||||||
|
- [Major][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
|
||||||
|
- [Minor][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
|
||||||
|
|
||||||
|
### 범위 밖 이슈
|
||||||
|
기획 범위 밖이지만 주목할 만한 이슈:
|
||||||
|
- [Critical] 이슈 설명
|
||||||
|
- [Minor] 이슈 설명
|
||||||
|
(범위 밖 이슈가 없으면 "없음"이라고 작성하세요.)
|
||||||
|
|
||||||
|
### 요약
|
||||||
|
- Critical: N, Major: N, Minor: N
|
||||||
|
- 과최적화 수: N
|
||||||
|
- 누락 수: N
|
||||||
|
- CONFIRMED: N, DISMISSED: N
|
||||||
|
- 전체 품질: [간략한 평가]
|
||||||
|
|
||||||
|
### 판정
|
||||||
|
기획 문서가 구현 가능한 수준으로 명확하고 충분하며 현재 저장소와도 정합적이고, 불필요한 범위 확장이 없으면: VERDICT: PASS
|
||||||
|
그렇지 않으면: VERDICT: FAIL
|
||||||
|
"""
|
||||||
|
|
||||||
AGGREGATE_REVIEW_TEMPLATE = """\
|
AGGREGATE_REVIEW_TEMPLATE = """\
|
||||||
You are adjudicating multiple review results and turning them into an actionable decision.
|
You are adjudicating multiple review results and turning them into an actionable decision.
|
||||||
|
|
||||||
@@ -378,6 +522,9 @@ You are adjudicating multiple review results and turning them into an actionable
|
|||||||
## Previous Verification Feedback
|
## Previous Verification Feedback
|
||||||
{feedback}
|
{feedback}
|
||||||
|
|
||||||
|
## Previous Issue Tracker
|
||||||
|
{previous_senior_tracker}
|
||||||
|
|
||||||
## Instructions
|
## Instructions
|
||||||
Explore the project directory to confirm the current codebase state. Then:
|
Explore the project directory to confirm the current codebase state. Then:
|
||||||
1. Deduplicate overlapping issues across reviewers.
|
1. Deduplicate overlapping issues across reviewers.
|
||||||
@@ -385,7 +532,12 @@ Explore the project directory to confirm the current codebase state. Then:
|
|||||||
3. Keep only issues supported by the plan, checklist, code, or reviewer evidence.
|
3. Keep only issues supported by the plan, checklist, code, or reviewer evidence.
|
||||||
4. When evidence is mixed, explain what was confirmed, what was dismissed, and what still needs follow-up.
|
4. When evidence is mixed, explain what was confirmed, what was dismissed, and what still needs follow-up.
|
||||||
5. Produce a prioritized action list for the coder.
|
5. Produce a prioritized action list for the coder.
|
||||||
6. If no confirmed issue remains, output VERDICT: PASS. Otherwise VERDICT: FAIL.
|
6. Maintain the Issue Tracker table across iterations (carry forward unresolved issues).
|
||||||
|
7. If no confirmed issue remains, output VERDICT: PASS.
|
||||||
|
8. If issues exist that the coder can fix, output VERDICT: FAIL.
|
||||||
|
9. If issues require human intervention (ambiguous requirements, architecture decisions, \
|
||||||
|
external dependency problems, or the same issue persists after 2+ fix attempts), \
|
||||||
|
output VERDICT: ESCALATE.
|
||||||
|
|
||||||
## Output Format
|
## Output Format
|
||||||
|
|
||||||
@@ -401,13 +553,19 @@ Explore the project directory to confirm the current codebase state. Then:
|
|||||||
1. Concrete fix the coder should make
|
1. Concrete fix the coder should make
|
||||||
2. Concrete fix the coder should make
|
2. Concrete fix the coder should make
|
||||||
|
|
||||||
|
## Issue Tracker
|
||||||
|
|
||||||
|
| ISS-ID | Severity | Description | Status | Since |
|
||||||
|
|--------|----------|-------------|--------|-------|
|
||||||
|
| ISS-001 | Critical | ... | Open/Fixed/Dismissed | v1 |
|
||||||
|
|
||||||
### Summary
|
### Summary
|
||||||
- Confirmed issues: N
|
- Confirmed issues: N
|
||||||
- Dismissed findings: N (false positive: N, already fixed: N)
|
- Dismissed findings: N (false positive: N, already fixed: N)
|
||||||
- Overall quality: [BRIEF ASSESSMENT]
|
- Overall quality: [BRIEF ASSESSMENT]
|
||||||
|
|
||||||
### Verdict
|
### Verdict
|
||||||
VERDICT: PASS or VERDICT: FAIL
|
VERDICT: PASS or VERDICT: FAIL or VERDICT: ESCALATE
|
||||||
"""
|
"""
|
||||||
|
|
||||||
AGGREGATE_REVIEW_TEMPLATE_KO = """\
|
AGGREGATE_REVIEW_TEMPLATE_KO = """\
|
||||||
@@ -431,6 +589,9 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
|
|||||||
## 이전 검증 피드백
|
## 이전 검증 피드백
|
||||||
{feedback}
|
{feedback}
|
||||||
|
|
||||||
|
## 이전 이슈 트래커
|
||||||
|
{previous_senior_tracker}
|
||||||
|
|
||||||
## 지침
|
## 지침
|
||||||
프로젝트 디렉토리를 탐색하여 현재 코드베이스 상태를 확인한 뒤 다음을 수행하세요.
|
프로젝트 디렉토리를 탐색하여 현재 코드베이스 상태를 확인한 뒤 다음을 수행하세요.
|
||||||
1. 리뷰어들 사이에 중복되는 이슈를 합치세요.
|
1. 리뷰어들 사이에 중복되는 이슈를 합치세요.
|
||||||
@@ -438,7 +599,11 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
|
|||||||
3. 기획서, 체크리스트, 코드, 리뷰 근거로 뒷받침되는 이슈만 남기세요.
|
3. 기획서, 체크리스트, 코드, 리뷰 근거로 뒷받침되는 이슈만 남기세요.
|
||||||
4. 근거가 엇갈리면 무엇이 확정이고 무엇이 기각 또는 추가확인 대상인지 분명히 적으세요.
|
4. 근거가 엇갈리면 무엇이 확정이고 무엇이 기각 또는 추가확인 대상인지 분명히 적으세요.
|
||||||
5. coder가 바로 수정할 수 있는 우선순위 액션 아이템을 만드세요.
|
5. coder가 바로 수정할 수 있는 우선순위 액션 아이템을 만드세요.
|
||||||
6. 확정된 이슈가 없으면 VERDICT: PASS, 있으면 VERDICT: FAIL 을 출력하세요.
|
6. 이슈 트래커 테이블을 반복 간에 유지하세요 (미해결 이슈를 이월).
|
||||||
|
7. 확정된 이슈가 없으면 VERDICT: PASS 를 출력하세요.
|
||||||
|
8. coder가 수정 가능한 이슈가 있으면 VERDICT: FAIL 을 출력하세요.
|
||||||
|
9. 사람의 개입이 필요한 이슈(모호한 요구사항, 아키텍처 결정, 외부 의존성 문제, \
|
||||||
|
동일 이슈가 2회 이상 해결 실패)가 있으면 VERDICT: ESCALATE 를 출력하세요.
|
||||||
|
|
||||||
## 출력 형식
|
## 출력 형식
|
||||||
|
|
||||||
@@ -454,26 +619,34 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
|
|||||||
1. coder가 수정해야 할 구체적인 작업
|
1. coder가 수정해야 할 구체적인 작업
|
||||||
2. coder가 수정해야 할 구체적인 작업
|
2. coder가 수정해야 할 구체적인 작업
|
||||||
|
|
||||||
|
## 이슈 트래커
|
||||||
|
|
||||||
|
| ISS-ID | 심각도 | 설명 | 상태 | 최초 발견 |
|
||||||
|
|--------|--------|------|------|-----------|
|
||||||
|
| ISS-001 | Critical | ... | Open/Fixed/Dismissed | v1 |
|
||||||
|
|
||||||
### 요약
|
### 요약
|
||||||
- 확정 이슈 수: N
|
- 확정 이슈 수: N
|
||||||
- 기각된 주장 수: N (오탐: N, 수정 완료: N)
|
- 기각된 주장 수: N (오탐: N, 수정 완료: N)
|
||||||
- 전체 품질: [간략한 평가]
|
- 전체 품질: [간략한 평가]
|
||||||
|
|
||||||
### 판정
|
### 판정
|
||||||
VERDICT: PASS 또는 VERDICT: FAIL
|
VERDICT: PASS 또는 VERDICT: FAIL 또는 VERDICT: ESCALATE
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
|
DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
|
||||||
"en": {
|
"en": {
|
||||||
"generate": GENERATE_TEMPLATE,
|
"coding": CODING_TEMPLATE,
|
||||||
"review": REVIEW_TEMPLATE,
|
"review": REVIEW_TEMPLATE,
|
||||||
|
"plan-review": PLAN_REVIEW_TEMPLATE,
|
||||||
"review-only": REVIEW_ONLY_TEMPLATE,
|
"review-only": REVIEW_ONLY_TEMPLATE,
|
||||||
"aggregate-review": AGGREGATE_REVIEW_TEMPLATE,
|
"aggregate-review": AGGREGATE_REVIEW_TEMPLATE,
|
||||||
},
|
},
|
||||||
"ko": {
|
"ko": {
|
||||||
"generate": GENERATE_TEMPLATE_KO,
|
"coding": CODING_TEMPLATE_KO,
|
||||||
"review": REVIEW_TEMPLATE_KO,
|
"review": REVIEW_TEMPLATE_KO,
|
||||||
|
"plan-review": PLAN_REVIEW_TEMPLATE_KO,
|
||||||
"review-only": REVIEW_ONLY_TEMPLATE_KO,
|
"review-only": REVIEW_ONLY_TEMPLATE_KO,
|
||||||
"aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO,
|
"aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO,
|
||||||
},
|
},
|
||||||
@@ -544,18 +717,18 @@ def _build_named_bundle(
|
|||||||
def _build_simple_preset(
|
def _build_simple_preset(
|
||||||
coders: list[str], reviewers: list[str], seniors: list[str],
|
coders: list[str], reviewers: list[str], seniors: list[str],
|
||||||
) -> list[StepConfig]:
|
) -> list[StepConfig]:
|
||||||
"""First coder generates, first reviewer reviews."""
|
"""First coder writes code, first reviewer reviews."""
|
||||||
if not coders:
|
if not coders:
|
||||||
raise ValueError("'simple' preset requires at least 1 coder")
|
raise ValueError("'simple' preset requires at least 1 coder")
|
||||||
if not reviewers:
|
if not reviewers:
|
||||||
raise ValueError("'simple' preset requires at least 1 reviewer")
|
raise ValueError("'simple' preset requires at least 1 reviewer")
|
||||||
steps = [
|
steps = [
|
||||||
StepConfig(
|
StepConfig(
|
||||||
name="generate",
|
name="coding",
|
||||||
agent=coders[0],
|
agent=coders[0],
|
||||||
role="generate",
|
role="coding",
|
||||||
prompt_template="default:generate",
|
prompt_template="default:coding",
|
||||||
output_key="generated_code",
|
output_key="coding_output",
|
||||||
),
|
),
|
||||||
StepConfig(
|
StepConfig(
|
||||||
name="review",
|
name="review",
|
||||||
@@ -576,7 +749,7 @@ def _build_simple_preset(
|
|||||||
output_key="senior_review_result",
|
output_key="senior_review_result",
|
||||||
verdict=True,
|
verdict=True,
|
||||||
context_override={
|
context_override={
|
||||||
"candidate_outputs": "## Generated code\n{generated_code}",
|
"candidate_outputs": "## Coding output\n{coding_output}",
|
||||||
"reviews_bundle": f"## Review: {reviewers[0]} (review)\n{{review_result}}",
|
"reviews_bundle": f"## Review: {reviewers[0]} (review)\n{{review_result}}",
|
||||||
},
|
},
|
||||||
),
|
),
|
||||||
@@ -587,25 +760,25 @@ def _build_simple_preset(
|
|||||||
def _build_cross_review_preset(
|
def _build_cross_review_preset(
|
||||||
coders: list[str], reviewers: list[str], seniors: list[str],
|
coders: list[str], reviewers: list[str], seniors: list[str],
|
||||||
) -> list[StepConfig]:
|
) -> list[StepConfig]:
|
||||||
"""Both coders generate, then cross-review each other's output."""
|
"""Both coders write code, then cross-review each other's output."""
|
||||||
if len(coders) < 2:
|
if len(coders) < 2:
|
||||||
raise ValueError("'cross-review' preset requires at least 2 coders")
|
raise ValueError("'cross-review' preset requires at least 2 coders")
|
||||||
a, b = coders[0], coders[1]
|
a, b = coders[0], coders[1]
|
||||||
ak, bk = _unique_safe_keys([a, b])
|
ak, bk = _unique_safe_keys([a, b])
|
||||||
steps = [
|
steps = [
|
||||||
StepConfig(
|
StepConfig(
|
||||||
name=f"generate_{ak}",
|
name=f"coding_{ak}",
|
||||||
agent=a,
|
agent=a,
|
||||||
role="generate",
|
role="coding",
|
||||||
prompt_template="default:generate",
|
prompt_template="default:coding",
|
||||||
output_key=f"code_{ak}",
|
output_key=f"code_{ak}",
|
||||||
parallel=True,
|
parallel=True,
|
||||||
),
|
),
|
||||||
StepConfig(
|
StepConfig(
|
||||||
name=f"generate_{bk}",
|
name=f"coding_{bk}",
|
||||||
agent=b,
|
agent=b,
|
||||||
role="generate",
|
role="coding",
|
||||||
prompt_template="default:generate",
|
prompt_template="default:coding",
|
||||||
output_key=f"code_{bk}",
|
output_key=f"code_{bk}",
|
||||||
parallel=True,
|
parallel=True,
|
||||||
),
|
),
|
||||||
@@ -615,7 +788,7 @@ def _build_cross_review_preset(
|
|||||||
role="review",
|
role="review",
|
||||||
prompt_template="default:review",
|
prompt_template="default:review",
|
||||||
output_key=f"review_by_{ak}",
|
output_key=f"review_by_{ak}",
|
||||||
context_override={"generated_code": f"{{code_{bk}}}"},
|
context_override={"coding_output": f"{{code_{bk}}}"},
|
||||||
parallel=True,
|
parallel=True,
|
||||||
verdict=not seniors,
|
verdict=not seniors,
|
||||||
),
|
),
|
||||||
@@ -626,7 +799,7 @@ def _build_cross_review_preset(
|
|||||||
prompt_template="default:review",
|
prompt_template="default:review",
|
||||||
output_key=f"review_by_{bk}",
|
output_key=f"review_by_{bk}",
|
||||||
verdict=not seniors,
|
verdict=not seniors,
|
||||||
context_override={"generated_code": f"{{code_{ak}}}"},
|
context_override={"coding_output": f"{{code_{ak}}}"},
|
||||||
parallel=True,
|
parallel=True,
|
||||||
),
|
),
|
||||||
]
|
]
|
||||||
@@ -642,9 +815,9 @@ def _build_cross_review_preset(
|
|||||||
context_override={
|
context_override={
|
||||||
"candidate_outputs": _build_named_bundle(
|
"candidate_outputs": _build_named_bundle(
|
||||||
[a, b],
|
[a, b],
|
||||||
[f"generate_{ak}", f"generate_{bk}"],
|
[f"coding_{ak}", f"coding_{bk}"],
|
||||||
[f"code_{ak}", f"code_{bk}"],
|
[f"code_{ak}", f"code_{bk}"],
|
||||||
"Candidate",
|
"Coding Output",
|
||||||
),
|
),
|
||||||
"reviews_bundle": _build_named_bundle(
|
"reviews_bundle": _build_named_bundle(
|
||||||
[a, b],
|
[a, b],
|
||||||
@@ -715,6 +888,61 @@ def _build_review_only_preset(
|
|||||||
return steps
|
return steps
|
||||||
|
|
||||||
|
|
||||||
|
def _build_plan_review_preset(
|
||||||
|
coders: list[str], reviewers: list[str], seniors: list[str],
|
||||||
|
) -> list[StepConfig]:
|
||||||
|
"""Plan-review: reviewers audit planning docs before implementation."""
|
||||||
|
if not reviewers:
|
||||||
|
raise ValueError("'plan-review' preset requires at least 1 reviewer")
|
||||||
|
|
||||||
|
if len(reviewers) == 1 and not seniors:
|
||||||
|
return [
|
||||||
|
StepConfig(
|
||||||
|
name="plan_review",
|
||||||
|
agent=reviewers[0],
|
||||||
|
role="review",
|
||||||
|
prompt_template="default:plan-review",
|
||||||
|
output_key="plan_review_result",
|
||||||
|
verdict=True,
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
steps: list[StepConfig] = []
|
||||||
|
reviewer_keys = _unique_safe_keys(reviewers)
|
||||||
|
for reviewer, rk in zip(reviewers, reviewer_keys):
|
||||||
|
steps.append(
|
||||||
|
StepConfig(
|
||||||
|
name=f"plan_review_{rk}",
|
||||||
|
agent=reviewer,
|
||||||
|
role="review",
|
||||||
|
prompt_template="default:plan-review",
|
||||||
|
output_key=f"plan_review_{rk}",
|
||||||
|
verdict=not seniors,
|
||||||
|
parallel=True,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
if seniors:
|
||||||
|
step_names = [f"plan_review_{rk}" for rk in reviewer_keys]
|
||||||
|
output_keys = [f"plan_review_{rk}" for rk in reviewer_keys]
|
||||||
|
steps.append(
|
||||||
|
StepConfig(
|
||||||
|
name="senior_review",
|
||||||
|
agent=seniors[0],
|
||||||
|
role="review",
|
||||||
|
prompt_template="default:aggregate-review",
|
||||||
|
output_key="senior_review_result",
|
||||||
|
verdict=True,
|
||||||
|
context_override={
|
||||||
|
"candidate_outputs": "Planning documents under review (plan/checklist/reference docs).",
|
||||||
|
"reviews_bundle": _build_named_bundle(
|
||||||
|
reviewers, step_names, output_keys, "Review",
|
||||||
|
),
|
||||||
|
},
|
||||||
|
),
|
||||||
|
)
|
||||||
|
return steps
|
||||||
|
|
||||||
|
|
||||||
def _build_review_fix_preset(
|
def _build_review_fix_preset(
|
||||||
coders: list[str], reviewers: list[str], seniors: list[str],
|
coders: list[str], reviewers: list[str], seniors: list[str],
|
||||||
) -> list[PhaseConfig]:
|
) -> list[PhaseConfig]:
|
||||||
@@ -762,11 +990,11 @@ def _build_review_fix_preset(
|
|||||||
},
|
},
|
||||||
),
|
),
|
||||||
StepConfig(
|
StepConfig(
|
||||||
name="generate",
|
name="coding",
|
||||||
agent=fix_coder,
|
agent=fix_coder,
|
||||||
role="generate",
|
role="coding",
|
||||||
prompt_template="default:generate",
|
prompt_template="default:coding",
|
||||||
output_key="generated_code",
|
output_key="coding_output",
|
||||||
context_override={"feedback": "{aggregate_review}"},
|
context_override={"feedback": "{aggregate_review}"},
|
||||||
),
|
),
|
||||||
StepConfig(
|
StepConfig(
|
||||||
@@ -784,14 +1012,44 @@ def _build_review_fix_preset(
|
|||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _build_coding_review_fix_preset(
|
||||||
|
coders: list[str], reviewers: list[str], seniors: list[str],
|
||||||
|
) -> list[PhaseConfig]:
|
||||||
|
"""Write code once, then run the review-fix convergence loop."""
|
||||||
|
if not coders:
|
||||||
|
raise ValueError("'coding-review-fix' preset requires at least 1 coder")
|
||||||
|
if not reviewers:
|
||||||
|
raise ValueError("'coding-review-fix' preset requires at least 1 reviewer")
|
||||||
|
|
||||||
|
return [
|
||||||
|
PhaseConfig(
|
||||||
|
name="initial_coding",
|
||||||
|
steps=[
|
||||||
|
StepConfig(
|
||||||
|
name="coding",
|
||||||
|
agent=coders[0],
|
||||||
|
role="coding",
|
||||||
|
prompt_template="default:coding",
|
||||||
|
output_key="coding_output",
|
||||||
|
),
|
||||||
|
],
|
||||||
|
max_iterations=1,
|
||||||
|
consecutive_pass=1,
|
||||||
|
),
|
||||||
|
*_build_review_fix_preset(coders, reviewers, seniors),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
PIPELINE_PRESETS: dict[str, Callable] = {
|
PIPELINE_PRESETS: dict[str, Callable] = {
|
||||||
"simple": _build_simple_preset,
|
"simple": _build_simple_preset,
|
||||||
"cross-review": _build_cross_review_preset,
|
"cross-review": _build_cross_review_preset,
|
||||||
|
"plan-review": _build_plan_review_preset,
|
||||||
"review-only": _build_review_only_preset,
|
"review-only": _build_review_only_preset,
|
||||||
}
|
}
|
||||||
|
|
||||||
PHASED_PRESETS: dict[str, Callable] = {
|
PHASED_PRESETS: dict[str, Callable] = {
|
||||||
"review-fix": _build_review_fix_preset,
|
"review-fix": _build_review_fix_preset,
|
||||||
|
"coding-review-fix": _build_coding_review_fix_preset,
|
||||||
}
|
}
|
||||||
|
|
||||||
ALL_PRESET_NAMES: list[str] = list(PIPELINE_PRESETS.keys()) + list(PHASED_PRESETS.keys())
|
ALL_PRESET_NAMES: list[str] = list(PIPELINE_PRESETS.keys()) + list(PHASED_PRESETS.keys())
|
||||||
@@ -805,7 +1063,7 @@ def resolve_template(template_ref: str, templates_dir: Optional[Path] = None) ->
|
|||||||
"""Resolve a template reference to its content string.
|
"""Resolve a template reference to its content string.
|
||||||
|
|
||||||
Formats:
|
Formats:
|
||||||
- "default:generate" -> built-in GENERATE_TEMPLATE
|
- "default:coding" -> built-in CODING_TEMPLATE
|
||||||
- "default:review" -> built-in REVIEW_TEMPLATE
|
- "default:review" -> built-in REVIEW_TEMPLATE
|
||||||
- "path/to/file.md" -> read file contents
|
- "path/to/file.md" -> read file contents
|
||||||
"""
|
"""
|
||||||
|
|||||||
@@ -48,11 +48,16 @@ _STRINGS: dict[str, dict[str, str]] = {
|
|||||||
"pass_msg": "All checklist items satisfied. No over-engineering or omissions detected.",
|
"pass_msg": "All checklist items satisfied. No over-engineering or omissions detected.",
|
||||||
"fail_phased": "Pipeline phases ({phases}) completed without full convergence.",
|
"fail_phased": "Pipeline phases ({phases}) completed without full convergence.",
|
||||||
"fail_simple": "Maximum iterations ({max_iter}) reached without passing all checks.",
|
"fail_simple": "Maximum iterations ({max_iter}) reached without passing all checks.",
|
||||||
|
"escalate_msg": "Human review required. The following issues could not be resolved automatically:",
|
||||||
|
"escalate_title": "Escalation Report",
|
||||||
|
"issue_tracker_title": "Issue Tracker Summary",
|
||||||
|
"issue_tracker_desc": "Issues discovered across iterations and their final resolution status.",
|
||||||
"metrics_title": "Review Metrics",
|
"metrics_title": "Review Metrics",
|
||||||
"metrics_trend_title": "Metrics Trend",
|
"metrics_trend_title": "Metrics Trend",
|
||||||
"metrics_iter": "Iter",
|
"metrics_iter": "Iter",
|
||||||
"metrics_total_issues": "Total Issues",
|
"metrics_total_issues": "Total Issues",
|
||||||
"metrics_na": "N/A",
|
"metrics_na": "N/A",
|
||||||
|
"iteration_details": "Iteration Details",
|
||||||
},
|
},
|
||||||
"ko": {
|
"ko": {
|
||||||
"title": "교차 검증 리포트",
|
"title": "교차 검증 리포트",
|
||||||
@@ -84,11 +89,16 @@ _STRINGS: dict[str, dict[str, str]] = {
|
|||||||
"pass_msg": "모든 체크리스트 항목 충족. 과최적화/누락 없음.",
|
"pass_msg": "모든 체크리스트 항목 충족. 과최적화/누락 없음.",
|
||||||
"fail_phased": "파이프라인 페이즈 ({phases}) 완료, 완전한 수렴에 도달하지 못함.",
|
"fail_phased": "파이프라인 페이즈 ({phases}) 완료, 완전한 수렴에 도달하지 못함.",
|
||||||
"fail_simple": "최대 반복 횟수 ({max_iter})에 도달, 모든 검증을 통과하지 못함.",
|
"fail_simple": "최대 반복 횟수 ({max_iter})에 도달, 모든 검증을 통과하지 못함.",
|
||||||
|
"escalate_msg": "사람의 확인이 필요합니다. 아래 이슈는 자동으로 해결할 수 없었습니다:",
|
||||||
|
"escalate_title": "에스컬레이션 리포트",
|
||||||
|
"issue_tracker_title": "이슈 트래커 요약",
|
||||||
|
"issue_tracker_desc": "반복 과정에서 발견된 이슈와 최종 처리 상태입니다.",
|
||||||
"metrics_title": "리뷰 메트릭",
|
"metrics_title": "리뷰 메트릭",
|
||||||
"metrics_trend_title": "메트릭 추이",
|
"metrics_trend_title": "메트릭 추이",
|
||||||
"metrics_iter": "반복",
|
"metrics_iter": "반복",
|
||||||
"metrics_total_issues": "총 이슈",
|
"metrics_total_issues": "총 이슈",
|
||||||
"metrics_na": "해당 없음",
|
"metrics_na": "해당 없음",
|
||||||
|
"iteration_details": "반복 상세",
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -181,20 +191,41 @@ def _build_simple_report(
|
|||||||
|
|
||||||
out_of_scope_items: list[tuple[int, str]] = []
|
out_of_scope_items: list[tuple[int, str]] = []
|
||||||
|
|
||||||
|
# Pre-scan iterations to collect out-of-scope items and review metrics
|
||||||
|
# (needed before rendering final verdict / metrics sections)
|
||||||
for iter_result in result.iterations:
|
for iter_result in result.iterations:
|
||||||
lines.append("---\n")
|
for step in config.pipeline:
|
||||||
lines.append(f"## {_t(config, 'iteration')} {iter_result.iteration}\n")
|
output = iter_result.step_outputs.get(step.output_key, "")
|
||||||
|
if step.role == "review":
|
||||||
|
oos = _extract_out_of_scope(output)
|
||||||
|
if oos:
|
||||||
|
out_of_scope_items.append((iter_result.iteration, oos))
|
||||||
|
step_metrics = parse_review_metrics(output)
|
||||||
|
if iter_result.review_metrics is None:
|
||||||
|
iter_result.review_metrics = step_metrics
|
||||||
|
else:
|
||||||
|
iter_result.review_metrics = _aggregate_metrics(
|
||||||
|
iter_result.review_metrics, step_metrics,
|
||||||
|
)
|
||||||
|
|
||||||
_append_iteration_steps(lines, config, iter_result, config.pipeline, out_of_scope_items)
|
_append_final_verdict(lines, config, result)
|
||||||
|
_append_issue_tracker_summary(lines, config, result)
|
||||||
|
_append_review_metrics_table(lines, config, result)
|
||||||
|
|
||||||
|
lines.append("---\n")
|
||||||
|
lines.append(f"## {_t(config, 'iteration_details')}\n")
|
||||||
|
|
||||||
|
for iter_result in result.iterations:
|
||||||
|
lines.append(f"### {_t(config, 'iteration')} {iter_result.iteration}\n")
|
||||||
|
|
||||||
|
_append_iteration_steps(lines, config, iter_result, config.pipeline, out_of_scope_items, skip_extraction=True)
|
||||||
|
|
||||||
if iter_result.feedback:
|
if iter_result.feedback:
|
||||||
lines.append(f"**{_t(config, 'feedback_next')}** {iter_result.feedback[:200]}...")
|
lines.append(f"**{_t(config, 'feedback_next')}** {iter_result.feedback[:200]}...")
|
||||||
lines.append("")
|
lines.append("")
|
||||||
|
|
||||||
_append_out_of_scope(lines, config, out_of_scope_items)
|
_append_out_of_scope(lines, config, out_of_scope_items)
|
||||||
_append_review_metrics_table(lines, config, result)
|
|
||||||
_append_repeated_aggregate(lines, config, result)
|
_append_repeated_aggregate(lines, config, result)
|
||||||
_append_final_verdict(lines, config, result)
|
|
||||||
|
|
||||||
return "\n".join(lines)
|
return "\n".join(lines)
|
||||||
|
|
||||||
@@ -211,14 +242,42 @@ def _build_phased_report(
|
|||||||
phase_map = {p.name: p for p in config.phases}
|
phase_map = {p.name: p for p in config.phases}
|
||||||
out_of_scope_items: list[tuple[int, str]] = []
|
out_of_scope_items: list[tuple[int, str]] = []
|
||||||
|
|
||||||
|
# Pre-scan iterations to collect out-of-scope items and review metrics
|
||||||
|
for phase_name, phase_iters_iter in groupby(
|
||||||
|
result.iterations, key=lambda ir: ir.phase_name,
|
||||||
|
):
|
||||||
|
phase_iters = list(phase_iters_iter)
|
||||||
|
phase_config = phase_map.get(phase_name or "")
|
||||||
|
steps = phase_config.steps if phase_config else config.pipeline
|
||||||
|
for iter_result in phase_iters:
|
||||||
|
for step in steps:
|
||||||
|
output = iter_result.step_outputs.get(step.output_key, "")
|
||||||
|
if step.role == "review":
|
||||||
|
oos = _extract_out_of_scope(output)
|
||||||
|
if oos:
|
||||||
|
out_of_scope_items.append((iter_result.iteration, oos))
|
||||||
|
step_metrics = parse_review_metrics(output)
|
||||||
|
if iter_result.review_metrics is None:
|
||||||
|
iter_result.review_metrics = step_metrics
|
||||||
|
else:
|
||||||
|
iter_result.review_metrics = _aggregate_metrics(
|
||||||
|
iter_result.review_metrics, step_metrics,
|
||||||
|
)
|
||||||
|
|
||||||
|
_append_final_verdict(lines, config, result)
|
||||||
|
_append_issue_tracker_summary(lines, config, result)
|
||||||
|
_append_review_metrics_table(lines, config, result)
|
||||||
|
|
||||||
|
lines.append("---\n")
|
||||||
|
lines.append(f"## {_t(config, 'iteration_details')}\n")
|
||||||
|
|
||||||
for phase_name, phase_iters_iter in groupby(
|
for phase_name, phase_iters_iter in groupby(
|
||||||
result.iterations, key=lambda ir: ir.phase_name,
|
result.iterations, key=lambda ir: ir.phase_name,
|
||||||
):
|
):
|
||||||
phase_iters = list(phase_iters_iter)
|
phase_iters = list(phase_iters_iter)
|
||||||
phase_config = phase_map.get(phase_name or "")
|
phase_config = phase_map.get(phase_name or "")
|
||||||
|
|
||||||
lines.append("---\n")
|
lines.append(f"### {_t(config, 'phase')}: {phase_name}\n")
|
||||||
lines.append(f"## {_t(config, 'phase')}: {phase_name}\n")
|
|
||||||
|
|
||||||
if phase_config:
|
if phase_config:
|
||||||
step_desc = " → ".join(s.name for s in phase_config.steps)
|
step_desc = " → ".join(s.name for s in phase_config.steps)
|
||||||
@@ -242,14 +301,17 @@ def _build_phased_report(
|
|||||||
verdict_label += " ✓"
|
verdict_label += " ✓"
|
||||||
else:
|
else:
|
||||||
verdict_label = " — PASS ✓"
|
verdict_label = " — PASS ✓"
|
||||||
|
elif iter_result.verdict == "ESCALATE":
|
||||||
|
consecutive = 0
|
||||||
|
verdict_label = " — ESCALATE"
|
||||||
else:
|
else:
|
||||||
consecutive = 0
|
consecutive = 0
|
||||||
verdict_label = " — FAIL"
|
verdict_label = " — FAIL"
|
||||||
|
|
||||||
lines.append(
|
lines.append(
|
||||||
f"### {_t(config, 'iteration')} {iter_result.iteration}{verdict_label}\n"
|
f"#### {_t(config, 'iteration')} {iter_result.iteration}{verdict_label}\n"
|
||||||
)
|
)
|
||||||
_append_iteration_steps(lines, config, iter_result, steps, out_of_scope_items)
|
_append_iteration_steps(lines, config, iter_result, steps, out_of_scope_items, skip_extraction=True)
|
||||||
|
|
||||||
if iter_result.feedback:
|
if iter_result.feedback:
|
||||||
lines.append(
|
lines.append(
|
||||||
@@ -258,9 +320,7 @@ def _build_phased_report(
|
|||||||
lines.append("")
|
lines.append("")
|
||||||
|
|
||||||
_append_out_of_scope(lines, config, out_of_scope_items)
|
_append_out_of_scope(lines, config, out_of_scope_items)
|
||||||
_append_review_metrics_table(lines, config, result)
|
|
||||||
_append_repeated_aggregate(lines, config, result)
|
_append_repeated_aggregate(lines, config, result)
|
||||||
_append_final_verdict(lines, config, result)
|
|
||||||
|
|
||||||
return "\n".join(lines)
|
return "\n".join(lines)
|
||||||
|
|
||||||
@@ -309,8 +369,14 @@ def _append_iteration_steps(
|
|||||||
iter_result: IterationResult,
|
iter_result: IterationResult,
|
||||||
steps: list[StepConfig],
|
steps: list[StepConfig],
|
||||||
out_of_scope_items: list[tuple[int, str]],
|
out_of_scope_items: list[tuple[int, str]],
|
||||||
|
*,
|
||||||
|
skip_extraction: bool = False,
|
||||||
) -> None:
|
) -> None:
|
||||||
"""Append step details for one iteration."""
|
"""Append step details for one iteration.
|
||||||
|
|
||||||
|
If *skip_extraction* is True, out-of-scope and review-metrics parsing
|
||||||
|
is skipped (useful when a pre-scan already collected that data).
|
||||||
|
"""
|
||||||
for step in steps:
|
for step in steps:
|
||||||
agent_result = iter_result.step_results.get(step.output_key)
|
agent_result = iter_result.step_results.get(step.output_key)
|
||||||
output = iter_result.step_outputs.get(step.output_key, "")
|
output = iter_result.step_outputs.get(step.output_key, "")
|
||||||
@@ -334,7 +400,7 @@ def _append_iteration_steps(
|
|||||||
lines.append(output)
|
lines.append(output)
|
||||||
lines.append("")
|
lines.append("")
|
||||||
|
|
||||||
if step.role == "review":
|
if not skip_extraction and step.role == "review":
|
||||||
oos = _extract_out_of_scope(output)
|
oos = _extract_out_of_scope(output)
|
||||||
if oos:
|
if oos:
|
||||||
out_of_scope_items.append((iter_result.iteration, oos))
|
out_of_scope_items.append((iter_result.iteration, oos))
|
||||||
@@ -469,8 +535,18 @@ def _append_final_verdict(
|
|||||||
lines.append("---\n")
|
lines.append("---\n")
|
||||||
lines.append(f"## {_t(config, 'final_verdict_title')}: {result.final_verdict}\n")
|
lines.append(f"## {_t(config, 'final_verdict_title')}: {result.final_verdict}\n")
|
||||||
|
|
||||||
|
if result.agentic_branch:
|
||||||
|
lines.append(f"**Agentic branch**: `{result.agentic_branch}`")
|
||||||
|
lines.append(f"```bash\ngit checkout {result.agentic_branch}\n```\n")
|
||||||
|
|
||||||
if result.final_verdict == "PASS":
|
if result.final_verdict == "PASS":
|
||||||
lines.append(_t(config, "pass_msg"))
|
lines.append(_t(config, "pass_msg"))
|
||||||
|
elif result.final_verdict == "ESCALATE":
|
||||||
|
lines.append(_t(config, "escalate_msg"))
|
||||||
|
lines.append("")
|
||||||
|
for issue in result.escalated_issues:
|
||||||
|
lines.append(f"- {issue}")
|
||||||
|
lines.append("")
|
||||||
else:
|
else:
|
||||||
if config.phases:
|
if config.phases:
|
||||||
phase_names = " → ".join(p.name for p in config.phases)
|
phase_names = " → ".join(p.name for p in config.phases)
|
||||||
@@ -481,6 +557,121 @@ def _append_final_verdict(
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Issue Tracker extraction from senior/aggregate outputs
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
_ISSUE_TRACKER_PATTERN = re.compile(
|
||||||
|
r"##+ (?:Issue Tracker|이슈 트래커)[^\n]*\n((?:\|[^\n]+\|\n?)+)",
|
||||||
|
re.DOTALL,
|
||||||
|
)
|
||||||
|
|
||||||
|
_TRACKER_ROW_PATTERN = re.compile(
|
||||||
|
r"^\|\s*(ISS-\d+)\s*\|\s*(\S+)\s*\|\s*(.*?)\s*\|\s*(\S+)\s*\|\s*(\S+)\s*\|",
|
||||||
|
re.MULTILINE,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_issue_tracker_rows(
|
||||||
|
result: PipelineResult,
|
||||||
|
) -> list[dict[str, str]]:
|
||||||
|
"""Extract the latest Issue Tracker table from pipeline results.
|
||||||
|
|
||||||
|
Scans iteration outputs in reverse to find the most recent tracker table
|
||||||
|
from aggregate/senior review steps. Falls back to parsing individual
|
||||||
|
review outputs for ISS-NNN tagged issues.
|
||||||
|
"""
|
||||||
|
# Try to find a tracker table from the last iteration with one
|
||||||
|
for ir in reversed(result.iterations):
|
||||||
|
for key, output in ir.step_outputs.items():
|
||||||
|
match = _ISSUE_TRACKER_PATTERN.search(output)
|
||||||
|
if not match:
|
||||||
|
continue
|
||||||
|
table_text = match.group(1)
|
||||||
|
rows = []
|
||||||
|
for row_match in _TRACKER_ROW_PATTERN.finditer(table_text):
|
||||||
|
rows.append({
|
||||||
|
"id": row_match.group(1),
|
||||||
|
"severity": row_match.group(2),
|
||||||
|
"description": row_match.group(3).strip(),
|
||||||
|
"status": row_match.group(4),
|
||||||
|
"since": row_match.group(5),
|
||||||
|
})
|
||||||
|
if rows:
|
||||||
|
return rows
|
||||||
|
|
||||||
|
# Fallback: parse ISS-NNN from review outputs across iterations
|
||||||
|
seen: dict[str, dict[str, str]] = {}
|
||||||
|
for ir in result.iterations:
|
||||||
|
for key, output in ir.step_outputs.items():
|
||||||
|
for m in re.finditer(
|
||||||
|
r"(ISS-\d+)\s*\[(\w+)\]\[.*?\]\s*(.*?)(?:\n|$)", output,
|
||||||
|
):
|
||||||
|
iss_id = m.group(1)
|
||||||
|
if iss_id not in seen:
|
||||||
|
seen[iss_id] = {
|
||||||
|
"id": iss_id,
|
||||||
|
"severity": m.group(2),
|
||||||
|
"description": m.group(3).strip()[:80],
|
||||||
|
"status": "Open",
|
||||||
|
"since": f"v{ir.iteration}",
|
||||||
|
}
|
||||||
|
return list(seen.values())
|
||||||
|
|
||||||
|
|
||||||
|
def _append_issue_tracker_summary(
|
||||||
|
lines: list[str],
|
||||||
|
config: PipelineConfig,
|
||||||
|
result: PipelineResult,
|
||||||
|
) -> None:
|
||||||
|
"""Append a consolidated issue tracker table to the report."""
|
||||||
|
rows = _extract_issue_tracker_rows(result)
|
||||||
|
if not rows:
|
||||||
|
return
|
||||||
|
|
||||||
|
lines.append("---\n")
|
||||||
|
lines.append(f"## {_t(config, 'issue_tracker_title')}\n")
|
||||||
|
lines.append(f"{_t(config, 'issue_tracker_desc')}\n")
|
||||||
|
|
||||||
|
lang = getattr(config, "language", "en")
|
||||||
|
if lang == "ko":
|
||||||
|
lines.append("| ISS-ID | 심각도 | 설명 | 상태 | 최초 발견 |")
|
||||||
|
else:
|
||||||
|
lines.append("| ISS-ID | Severity | Description | Status | Since |")
|
||||||
|
lines.append("|--------|----------|-------------|--------|-------|")
|
||||||
|
|
||||||
|
for row in rows:
|
||||||
|
lines.append(
|
||||||
|
f"| {row['id']} | {row['severity']} "
|
||||||
|
f"| {row['description']} | {row['status']} | {row['since']} |"
|
||||||
|
)
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
|
||||||
|
def print_escalation_report(
|
||||||
|
config: PipelineConfig,
|
||||||
|
result: PipelineResult,
|
||||||
|
) -> None:
|
||||||
|
"""Print a prominent ANSI-colored escalation report to the terminal."""
|
||||||
|
RED = "\033[31m"
|
||||||
|
YELLOW = "\033[33m"
|
||||||
|
BOLD = "\033[1m"
|
||||||
|
RESET = "\033[0m"
|
||||||
|
|
||||||
|
title = _t(config, "escalate_title")
|
||||||
|
msg = _t(config, "escalate_msg")
|
||||||
|
|
||||||
|
print(f"\n{RED}{BOLD}{'=' * 60}")
|
||||||
|
print(f" {title}")
|
||||||
|
print(f"{'=' * 60}{RESET}\n")
|
||||||
|
print(f"{YELLOW}{msg}{RESET}\n")
|
||||||
|
|
||||||
|
for issue in result.escalated_issues:
|
||||||
|
print(f" {RED}•{RESET} {issue}")
|
||||||
|
|
||||||
|
print(f"\n{RED}{BOLD}{'=' * 60}{RESET}\n")
|
||||||
|
|
||||||
|
|
||||||
def _append_repeated_aggregate(
|
def _append_repeated_aggregate(
|
||||||
lines: list[str],
|
lines: list[str],
|
||||||
config: PipelineConfig,
|
config: PipelineConfig,
|
||||||
|
|||||||
152
cross_eval/runtime_env.py
Normal file
152
cross_eval/runtime_env.py
Normal file
@@ -0,0 +1,152 @@
|
|||||||
|
"""Helpers for building agent runtime environments from .env files."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from cross_eval.models import ExecutionConfig
|
||||||
|
|
||||||
|
_SUMMARY_PREFIXES = (
|
||||||
|
"CLICKHOUSE",
|
||||||
|
"CH_",
|
||||||
|
"DB_",
|
||||||
|
"DATABASE",
|
||||||
|
"PG",
|
||||||
|
"POSTGRES",
|
||||||
|
"MYSQL",
|
||||||
|
"REDIS",
|
||||||
|
"AWS",
|
||||||
|
"S3",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _strip_quotes(value: str) -> str:
|
||||||
|
if len(value) >= 2 and value[0] == value[-1] and value[0] in {"'", '"'}:
|
||||||
|
unwrapped = value[1:-1]
|
||||||
|
if value[0] == '"':
|
||||||
|
return bytes(unwrapped, "utf-8").decode("unicode_escape")
|
||||||
|
return unwrapped
|
||||||
|
return value
|
||||||
|
|
||||||
|
|
||||||
|
def parse_dotenv(path: Path) -> dict[str, str]:
|
||||||
|
"""Parse a simple dotenv file into key/value pairs."""
|
||||||
|
values: dict[str, str] = {}
|
||||||
|
for raw_line in path.read_text(encoding="utf-8").splitlines():
|
||||||
|
line = raw_line.strip()
|
||||||
|
if not line or line.startswith("#"):
|
||||||
|
continue
|
||||||
|
if line.startswith("export "):
|
||||||
|
line = line[len("export ") :].strip()
|
||||||
|
if "=" not in line:
|
||||||
|
continue
|
||||||
|
key, value = line.split("=", 1)
|
||||||
|
key = key.strip()
|
||||||
|
if not key:
|
||||||
|
continue
|
||||||
|
values[key] = _strip_quotes(value.strip())
|
||||||
|
return values
|
||||||
|
|
||||||
|
|
||||||
|
def resolve_env_files(execution: ExecutionConfig, project_root: Path) -> list[Path]:
|
||||||
|
"""Resolve and deduplicate configured env files under the project root."""
|
||||||
|
candidates: list[Path] = []
|
||||||
|
for raw in execution.env_files:
|
||||||
|
path = Path(raw)
|
||||||
|
if not path.is_absolute():
|
||||||
|
path = project_root / path
|
||||||
|
candidates.append(path)
|
||||||
|
|
||||||
|
for raw in execution.auto_env_files:
|
||||||
|
path = project_root / raw
|
||||||
|
candidates.append(path)
|
||||||
|
|
||||||
|
resolved: list[Path] = []
|
||||||
|
seen: set[Path] = set()
|
||||||
|
for path in candidates:
|
||||||
|
try:
|
||||||
|
normalized = path.resolve()
|
||||||
|
except OSError:
|
||||||
|
normalized = path
|
||||||
|
if normalized in seen or not normalized.exists() or not normalized.is_file():
|
||||||
|
continue
|
||||||
|
seen.add(normalized)
|
||||||
|
resolved.append(normalized)
|
||||||
|
return resolved
|
||||||
|
|
||||||
|
|
||||||
|
def build_runtime_environment(
|
||||||
|
execution: ExecutionConfig,
|
||||||
|
project_root: Path,
|
||||||
|
) -> tuple[dict[str, str], list[Path], dict[str, str]]:
|
||||||
|
"""Build subprocess env plus metadata about loaded files and names."""
|
||||||
|
env = os.environ.copy() if execution.inherit_env else {}
|
||||||
|
loaded_files = resolve_env_files(execution, project_root)
|
||||||
|
loaded_values: dict[str, str] = {}
|
||||||
|
for path in loaded_files:
|
||||||
|
file_values = parse_dotenv(path)
|
||||||
|
loaded_values.update(file_values)
|
||||||
|
env.update(file_values)
|
||||||
|
return env, loaded_files, loaded_values
|
||||||
|
|
||||||
|
|
||||||
|
def summarize_environment(
|
||||||
|
execution: ExecutionConfig,
|
||||||
|
loaded_files: list[Path],
|
||||||
|
env: dict[str, str],
|
||||||
|
loaded_values: dict[str, str],
|
||||||
|
) -> str:
|
||||||
|
"""Generate a safe environment summary for prompts without leaking secrets."""
|
||||||
|
lines: list[str] = []
|
||||||
|
if loaded_files:
|
||||||
|
joined = ", ".join(str(path) for path in loaded_files)
|
||||||
|
lines.append(f"Loaded env files into the agent process: {joined}")
|
||||||
|
else:
|
||||||
|
lines.append("No .env file was auto-loaded into the agent process.")
|
||||||
|
|
||||||
|
if execution.auto_context_targets:
|
||||||
|
lines.append(
|
||||||
|
"Execution targets hinted by the user: "
|
||||||
|
+ ", ".join(execution.auto_context_targets)
|
||||||
|
)
|
||||||
|
|
||||||
|
if execution.expose_env_names:
|
||||||
|
visible_names = sorted(
|
||||||
|
{
|
||||||
|
key
|
||||||
|
for key in set(loaded_values) | set(env)
|
||||||
|
if key.startswith(_SUMMARY_PREFIXES)
|
||||||
|
or any(prefix in key for prefix in ("CLICKHOUSE", "DATABASE", "DB_"))
|
||||||
|
}
|
||||||
|
)
|
||||||
|
if visible_names:
|
||||||
|
lines.append("Relevant env var names available to commands: " + ", ".join(visible_names))
|
||||||
|
else:
|
||||||
|
lines.append("No DB/service env var names matched the default summary filters.")
|
||||||
|
else:
|
||||||
|
lines.append("Environment variable values are loaded but names are hidden from the prompt.")
|
||||||
|
|
||||||
|
wants_clickhouse = "clickhouse" in {target.lower() for target in execution.auto_context_targets}
|
||||||
|
clickhouse_keys = [key for key in env if "CLICKHOUSE" in key or key.startswith("CH_")]
|
||||||
|
if wants_clickhouse or clickhouse_keys:
|
||||||
|
if clickhouse_keys:
|
||||||
|
lines.append("ClickHouse-related environment variables are available to the agent.")
|
||||||
|
else:
|
||||||
|
lines.append("No ClickHouse-specific env vars were detected in the loaded environment.")
|
||||||
|
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
def build_execution_policy(execution: ExecutionConfig) -> str:
|
||||||
|
"""Describe the execution latitude granted to agentic coders/reviewers."""
|
||||||
|
lines = [
|
||||||
|
f"Execution mode: {execution.mode}",
|
||||||
|
f"Command policy: {execution.command_policy}",
|
||||||
|
"The agent may choose shell, Python, git, docker, test, and database commands on its own when needed.",
|
||||||
|
"The user does not need to pre-specify exact commands.",
|
||||||
|
]
|
||||||
|
if execution.command_policy == "broad":
|
||||||
|
lines.append("Prefer direct validation by running the minimum set of commands needed to prove a fix.")
|
||||||
|
else:
|
||||||
|
lines.append("Keep command usage minimal and focused on validation.")
|
||||||
|
return "\n".join(lines)
|
||||||
135
cross_eval/worktree.py
Normal file
135
cross_eval/worktree.py
Normal file
@@ -0,0 +1,135 @@
|
|||||||
|
"""Git worktree lifecycle management for agentic mode."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import shutil
|
||||||
|
import subprocess
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class WorktreeError(RuntimeError):
|
||||||
|
"""Error during worktree operations."""
|
||||||
|
|
||||||
|
|
||||||
|
def make_branch_name(preset_name: str) -> str:
|
||||||
|
"""Generate a branch name for agentic results."""
|
||||||
|
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||||
|
return f"cross-eval/{preset_name}_{ts}"
|
||||||
|
|
||||||
|
|
||||||
|
def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> Path:
|
||||||
|
"""Create a git worktree on a new branch from HEAD.
|
||||||
|
|
||||||
|
1. Create branch from HEAD
|
||||||
|
2. Create worktree checked out to that branch
|
||||||
|
|
||||||
|
The branch lives in the original repo, so it survives worktree removal.
|
||||||
|
"""
|
||||||
|
work_dir = work_dir.resolve()
|
||||||
|
if work_dir.exists():
|
||||||
|
shutil.rmtree(work_dir)
|
||||||
|
|
||||||
|
# Create the branch at HEAD
|
||||||
|
try:
|
||||||
|
subprocess.run(
|
||||||
|
["git", "branch", branch_name, "HEAD"],
|
||||||
|
cwd=base_cwd,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
check=True,
|
||||||
|
)
|
||||||
|
except subprocess.CalledProcessError as e:
|
||||||
|
raise WorktreeError(
|
||||||
|
f"Failed to create branch '{branch_name}': {e.stderr.strip()}"
|
||||||
|
) from e
|
||||||
|
|
||||||
|
# Create worktree on that branch
|
||||||
|
try:
|
||||||
|
subprocess.run(
|
||||||
|
["git", "worktree", "add", str(work_dir), branch_name],
|
||||||
|
cwd=base_cwd,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
check=True,
|
||||||
|
)
|
||||||
|
except subprocess.CalledProcessError as e:
|
||||||
|
# Clean up the branch if worktree creation fails
|
||||||
|
subprocess.run(
|
||||||
|
["git", "branch", "-D", branch_name],
|
||||||
|
cwd=base_cwd,
|
||||||
|
capture_output=True,
|
||||||
|
)
|
||||||
|
raise WorktreeError(
|
||||||
|
f"Failed to create worktree at {work_dir}: {e.stderr.strip()}"
|
||||||
|
) from e
|
||||||
|
|
||||||
|
logger.debug("Created worktree on branch '%s': %s", branch_name, work_dir)
|
||||||
|
return work_dir
|
||||||
|
|
||||||
|
|
||||||
|
def capture_diff(worktree_path: Path) -> str:
|
||||||
|
"""Capture all changes made in the worktree as a unified diff.
|
||||||
|
|
||||||
|
Includes both tracked modifications and new untracked files.
|
||||||
|
"""
|
||||||
|
subprocess.run(
|
||||||
|
["git", "add", "-A"],
|
||||||
|
cwd=worktree_path,
|
||||||
|
capture_output=True,
|
||||||
|
check=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
result = subprocess.run(
|
||||||
|
["git", "diff", "--cached", "HEAD"],
|
||||||
|
cwd=worktree_path,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
return result.stdout.strip()
|
||||||
|
|
||||||
|
|
||||||
|
def commit_worktree(worktree_path: Path, message: str) -> bool:
|
||||||
|
"""Stage and commit all changes in the worktree.
|
||||||
|
|
||||||
|
Returns True if a commit was made, False if nothing to commit.
|
||||||
|
"""
|
||||||
|
subprocess.run(
|
||||||
|
["git", "add", "-A"],
|
||||||
|
cwd=worktree_path,
|
||||||
|
capture_output=True,
|
||||||
|
check=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
result = subprocess.run(
|
||||||
|
["git", "commit", "-m", message],
|
||||||
|
cwd=worktree_path,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
# exit code 1 = nothing to commit
|
||||||
|
return result.returncode == 0
|
||||||
|
|
||||||
|
|
||||||
|
def remove_worktree(base_cwd: Path, work_dir: Path) -> None:
|
||||||
|
"""Remove a git worktree (branch is preserved in the original repo)."""
|
||||||
|
work_dir = work_dir.resolve()
|
||||||
|
try:
|
||||||
|
subprocess.run(
|
||||||
|
["git", "worktree", "remove", "--force", str(work_dir)],
|
||||||
|
cwd=base_cwd,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
check=True,
|
||||||
|
)
|
||||||
|
except subprocess.CalledProcessError:
|
||||||
|
if work_dir.exists():
|
||||||
|
shutil.rmtree(work_dir, ignore_errors=True)
|
||||||
|
subprocess.run(
|
||||||
|
["git", "worktree", "prune"],
|
||||||
|
cwd=base_cwd,
|
||||||
|
capture_output=True,
|
||||||
|
)
|
||||||
|
logger.debug("Removed worktree: %s (branch preserved)", work_dir)
|
||||||
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|||||||
|
|
||||||
[project]
|
[project]
|
||||||
name = "cross-eval"
|
name = "cross-eval"
|
||||||
version = "0.1.0"
|
version = "0.2.0"
|
||||||
description = "AI agent cross-evaluation CLI tool"
|
description = "AI agent cross-evaluation CLI tool"
|
||||||
requires-python = ">=3.9"
|
requires-python = ">=3.9"
|
||||||
dependencies = [
|
dependencies = [
|
||||||
|
|||||||
701
tests/test_agentic.py
Normal file
701
tests/test_agentic.py
Normal file
@@ -0,0 +1,701 @@
|
|||||||
|
"""Comprehensive tests for the agentic worktree flow.
|
||||||
|
|
||||||
|
Covers:
|
||||||
|
1. worktree.py unit tests (real temp git repo)
|
||||||
|
2. agent.py agentic tests (mocking subprocess)
|
||||||
|
3. config.py _make_agentic tests
|
||||||
|
4. pipeline integration tests (mock invoke_agent / invoke_agent_agentic)
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import subprocess
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
from unittest.mock import MagicMock, call, patch
|
||||||
|
|
||||||
|
from cross_eval.agent import invoke_agent_agentic
|
||||||
|
from cross_eval.config import BUILTIN_AGENTS, _make_agentic
|
||||||
|
from cross_eval.models import (
|
||||||
|
AgentConfig,
|
||||||
|
AgentResult,
|
||||||
|
PipelineConfig,
|
||||||
|
StepConfig,
|
||||||
|
)
|
||||||
|
from cross_eval.pipeline import (
|
||||||
|
_commit_iteration,
|
||||||
|
_finalize_worktree,
|
||||||
|
_has_agentic_steps,
|
||||||
|
_setup_worktree,
|
||||||
|
run_pipeline,
|
||||||
|
)
|
||||||
|
from cross_eval.worktree import (
|
||||||
|
capture_diff,
|
||||||
|
commit_worktree,
|
||||||
|
create_worktree,
|
||||||
|
make_branch_name,
|
||||||
|
remove_worktree,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Helpers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def _init_git_repo(path: Path) -> None:
|
||||||
|
"""Initialise a minimal git repo with one commit."""
|
||||||
|
subprocess.run(["git", "init"], cwd=path, capture_output=True, check=True)
|
||||||
|
subprocess.run(
|
||||||
|
["git", "config", "user.email", "test@test.com"],
|
||||||
|
cwd=path, capture_output=True, check=True,
|
||||||
|
)
|
||||||
|
subprocess.run(
|
||||||
|
["git", "config", "user.name", "Test"],
|
||||||
|
cwd=path, capture_output=True, check=True,
|
||||||
|
)
|
||||||
|
(path / "README.md").write_text("# init\n")
|
||||||
|
subprocess.run(["git", "add", "."], cwd=path, capture_output=True, check=True)
|
||||||
|
subprocess.run(
|
||||||
|
["git", "commit", "-m", "initial"],
|
||||||
|
cwd=path, capture_output=True, check=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ===================================================================
|
||||||
|
# 1. worktree.py unit tests (real temp git repo)
|
||||||
|
# ===================================================================
|
||||||
|
|
||||||
|
class TestCreateWorktree(unittest.TestCase):
|
||||||
|
"""create_worktree creates a worktree on a named branch."""
|
||||||
|
|
||||||
|
def test_creates_worktree_and_branch(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
base = Path(td) / "repo"
|
||||||
|
base.mkdir()
|
||||||
|
_init_git_repo(base)
|
||||||
|
|
||||||
|
wt_dir = Path(td) / "wt"
|
||||||
|
branch = "cross-eval/test_branch"
|
||||||
|
result_path = create_worktree(base, wt_dir, branch)
|
||||||
|
|
||||||
|
# Worktree directory exists
|
||||||
|
self.assertTrue(result_path.exists())
|
||||||
|
# Branch was created in the original repo
|
||||||
|
branches = subprocess.run(
|
||||||
|
["git", "branch", "--list", branch],
|
||||||
|
cwd=base, capture_output=True, text=True,
|
||||||
|
)
|
||||||
|
self.assertIn(branch, branches.stdout)
|
||||||
|
|
||||||
|
# Clean up
|
||||||
|
remove_worktree(base, wt_dir)
|
||||||
|
|
||||||
|
|
||||||
|
class TestCaptureDiff(unittest.TestCase):
|
||||||
|
"""capture_diff captures changes correctly."""
|
||||||
|
|
||||||
|
def test_captures_new_and_modified_files(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
base = Path(td) / "repo"
|
||||||
|
base.mkdir()
|
||||||
|
_init_git_repo(base)
|
||||||
|
|
||||||
|
wt_dir = Path(td) / "wt"
|
||||||
|
branch = "cross-eval/diff_test"
|
||||||
|
create_worktree(base, wt_dir, branch)
|
||||||
|
|
||||||
|
# Make changes in the worktree
|
||||||
|
(wt_dir / "new_file.txt").write_text("hello\n")
|
||||||
|
(wt_dir / "README.md").write_text("# modified\n")
|
||||||
|
|
||||||
|
diff = capture_diff(wt_dir)
|
||||||
|
self.assertIn("new_file.txt", diff)
|
||||||
|
self.assertIn("hello", diff)
|
||||||
|
self.assertIn("modified", diff)
|
||||||
|
|
||||||
|
remove_worktree(base, wt_dir)
|
||||||
|
|
||||||
|
|
||||||
|
class TestCommitWorktree(unittest.TestCase):
|
||||||
|
"""commit_worktree commits changes and returns True; False when nothing to commit."""
|
||||||
|
|
||||||
|
def test_commit_returns_true_on_changes(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
base = Path(td) / "repo"
|
||||||
|
base.mkdir()
|
||||||
|
_init_git_repo(base)
|
||||||
|
|
||||||
|
wt_dir = Path(td) / "wt"
|
||||||
|
branch = "cross-eval/commit_test"
|
||||||
|
create_worktree(base, wt_dir, branch)
|
||||||
|
|
||||||
|
(wt_dir / "file.txt").write_text("data\n")
|
||||||
|
result = commit_worktree(wt_dir, "test commit")
|
||||||
|
self.assertTrue(result)
|
||||||
|
|
||||||
|
remove_worktree(base, wt_dir)
|
||||||
|
|
||||||
|
def test_commit_returns_false_when_nothing_to_commit(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
base = Path(td) / "repo"
|
||||||
|
base.mkdir()
|
||||||
|
_init_git_repo(base)
|
||||||
|
|
||||||
|
wt_dir = Path(td) / "wt"
|
||||||
|
branch = "cross-eval/empty_commit"
|
||||||
|
create_worktree(base, wt_dir, branch)
|
||||||
|
|
||||||
|
result = commit_worktree(wt_dir, "empty")
|
||||||
|
self.assertFalse(result)
|
||||||
|
|
||||||
|
remove_worktree(base, wt_dir)
|
||||||
|
|
||||||
|
|
||||||
|
class TestRemoveWorktree(unittest.TestCase):
|
||||||
|
"""remove_worktree removes worktree but branch survives."""
|
||||||
|
|
||||||
|
def test_branch_survives_worktree_removal(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
base = Path(td) / "repo"
|
||||||
|
base.mkdir()
|
||||||
|
_init_git_repo(base)
|
||||||
|
|
||||||
|
wt_dir = Path(td) / "wt"
|
||||||
|
branch = "cross-eval/remove_test"
|
||||||
|
create_worktree(base, wt_dir, branch)
|
||||||
|
|
||||||
|
remove_worktree(base, wt_dir)
|
||||||
|
|
||||||
|
# Worktree directory should be gone
|
||||||
|
self.assertFalse(wt_dir.exists())
|
||||||
|
|
||||||
|
# Branch should still exist in the original repo
|
||||||
|
branches = subprocess.run(
|
||||||
|
["git", "branch", "--list", branch],
|
||||||
|
cwd=base, capture_output=True, text=True,
|
||||||
|
)
|
||||||
|
self.assertIn(branch, branches.stdout)
|
||||||
|
|
||||||
|
|
||||||
|
class TestMakeBranchName(unittest.TestCase):
|
||||||
|
"""make_branch_name generates expected format."""
|
||||||
|
|
||||||
|
def test_format(self) -> None:
|
||||||
|
name = make_branch_name("review-fix")
|
||||||
|
self.assertTrue(name.startswith("cross-eval/review-fix_"))
|
||||||
|
# Should contain a timestamp-like suffix
|
||||||
|
parts = name.split("_", 1)
|
||||||
|
self.assertEqual(len(parts), 2)
|
||||||
|
# Timestamp portion should be like 20260313_123456
|
||||||
|
ts_part = parts[1] # after "cross-eval/review-fix_"
|
||||||
|
self.assertEqual(len(ts_part), 15) # YYYYMMDD_HHMMSS
|
||||||
|
|
||||||
|
|
||||||
|
# ===================================================================
|
||||||
|
# 2. agent.py agentic tests (mocking subprocess)
|
||||||
|
# ===================================================================
|
||||||
|
|
||||||
|
class TestInvokeAgentAgenticClaude(unittest.TestCase):
|
||||||
|
"""invoke_agent_agentic builds correct cmd for claude (no -p, prompt as positional arg)."""
|
||||||
|
|
||||||
|
@patch("cross_eval.worktree.capture_diff", return_value="diff --git a/file ...")
|
||||||
|
@patch("subprocess.run")
|
||||||
|
def test_claude_cmd_has_no_dash_p_and_prompt_as_positional(
|
||||||
|
self, mock_run: MagicMock, mock_diff: MagicMock,
|
||||||
|
) -> None:
|
||||||
|
mock_run.return_value = MagicMock(returncode=0, stdout="ok", stderr="")
|
||||||
|
|
||||||
|
agent = AgentConfig(
|
||||||
|
name="claude-coder",
|
||||||
|
command="claude",
|
||||||
|
args=["--setting-sources", "user", "--dangerously-skip-permissions"],
|
||||||
|
agentic=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
wt = Path(td)
|
||||||
|
_init_git_repo(wt)
|
||||||
|
|
||||||
|
invoke_agent_agentic(
|
||||||
|
agent, "implement feature X", "coding",
|
||||||
|
worktree_path=wt, quiet=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Find the subprocess.run call that actually runs the agent
|
||||||
|
agent_call = None
|
||||||
|
for c in mock_run.call_args_list:
|
||||||
|
cmd = c[0][0] if c[0] else c[1].get("args", [])
|
||||||
|
if cmd and cmd[0] == "claude":
|
||||||
|
agent_call = c
|
||||||
|
break
|
||||||
|
|
||||||
|
self.assertIsNotNone(agent_call, "Expected a subprocess.run call with 'claude'")
|
||||||
|
cmd = agent_call[0][0]
|
||||||
|
|
||||||
|
# No -p flag
|
||||||
|
self.assertNotIn("-p", cmd)
|
||||||
|
# Last arg is a task file reference (not raw prompt — avoids arg length limits)
|
||||||
|
self.assertIn("task file", cmd[-1].lower())
|
||||||
|
|
||||||
|
|
||||||
|
class TestInvokeAgentAgenticCodex(unittest.TestCase):
|
||||||
|
"""invoke_agent_agentic builds correct cmd for codex (stdin mode, - sentinel)."""
|
||||||
|
|
||||||
|
@patch("cross_eval.worktree.capture_diff", return_value="diff --git a/file ...")
|
||||||
|
@patch("subprocess.run")
|
||||||
|
def test_codex_cmd_uses_stdin_with_dash_sentinel(
|
||||||
|
self, mock_run: MagicMock, mock_diff: MagicMock,
|
||||||
|
) -> None:
|
||||||
|
mock_run.return_value = MagicMock(returncode=0, stdout="ok", stderr="")
|
||||||
|
|
||||||
|
agent = AgentConfig(
|
||||||
|
name="codex-coder",
|
||||||
|
command="codex",
|
||||||
|
args=["exec", "--full-auto", "--skip-git-repo-check"],
|
||||||
|
agentic=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
wt = Path(td)
|
||||||
|
_init_git_repo(wt)
|
||||||
|
|
||||||
|
invoke_agent_agentic(
|
||||||
|
agent, "implement feature Y", "coding",
|
||||||
|
worktree_path=wt, quiet=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
agent_call = None
|
||||||
|
for c in mock_run.call_args_list:
|
||||||
|
cmd = c[0][0] if c[0] else c[1].get("args", [])
|
||||||
|
if cmd and cmd[0] == "codex":
|
||||||
|
agent_call = c
|
||||||
|
break
|
||||||
|
|
||||||
|
self.assertIsNotNone(agent_call, "Expected a subprocess.run call with 'codex'")
|
||||||
|
cmd = agent_call[0][0]
|
||||||
|
|
||||||
|
# Should have "-" sentinel at the end for stdin
|
||||||
|
self.assertEqual(cmd[-1], "-")
|
||||||
|
# Stdin input should contain the prompt
|
||||||
|
input_data = agent_call[1].get("input")
|
||||||
|
self.assertIsNotNone(input_data)
|
||||||
|
self.assertIn("implement feature Y", input_data)
|
||||||
|
|
||||||
|
|
||||||
|
class TestTaskFileCleanup(unittest.TestCase):
|
||||||
|
"""Task file is cleaned up before capture_diff."""
|
||||||
|
|
||||||
|
@patch("cross_eval.worktree.capture_diff", return_value="(no changes)")
|
||||||
|
@patch("subprocess.run")
|
||||||
|
def test_task_file_in_tmp_not_worktree(
|
||||||
|
self, mock_run: MagicMock, mock_diff: MagicMock,
|
||||||
|
) -> None:
|
||||||
|
mock_run.return_value = MagicMock(returncode=0, stdout="ok", stderr="")
|
||||||
|
|
||||||
|
agent = AgentConfig(
|
||||||
|
name="claude-coder", command="claude", args=[], agentic=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
wt = Path(td)
|
||||||
|
_init_git_repo(wt)
|
||||||
|
|
||||||
|
invoke_agent_agentic(
|
||||||
|
agent, "do stuff", "coding",
|
||||||
|
worktree_path=wt, quiet=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Task file should NOT be in the worktree (it's in /tmp)
|
||||||
|
self.assertFalse((wt / "CROSS_EVAL_TASK.md").exists())
|
||||||
|
|
||||||
|
|
||||||
|
# ===================================================================
|
||||||
|
# 3. config.py tests
|
||||||
|
# ===================================================================
|
||||||
|
|
||||||
|
class TestMakeAgenticClaude(unittest.TestCase):
|
||||||
|
"""_make_agentic strips -p from claude args and sets agentic=True."""
|
||||||
|
|
||||||
|
def test_strips_dash_p_and_sets_agentic(self) -> None:
|
||||||
|
agent = AgentConfig(
|
||||||
|
name="claude-coder",
|
||||||
|
command="claude",
|
||||||
|
args=["-p", "--setting-sources", "user", "--model", "opus"],
|
||||||
|
)
|
||||||
|
self.assertFalse(agent.agentic)
|
||||||
|
_make_agentic(agent)
|
||||||
|
self.assertTrue(agent.agentic)
|
||||||
|
self.assertNotIn("-p", agent.args)
|
||||||
|
self.assertIn("--setting-sources", agent.args)
|
||||||
|
|
||||||
|
def test_idempotent_when_no_dash_p(self) -> None:
|
||||||
|
agent = AgentConfig(
|
||||||
|
name="claude-coder",
|
||||||
|
command="claude",
|
||||||
|
args=["--setting-sources", "user"],
|
||||||
|
)
|
||||||
|
_make_agentic(agent)
|
||||||
|
self.assertTrue(agent.agentic)
|
||||||
|
self.assertEqual(agent.args, ["--setting-sources", "user"])
|
||||||
|
|
||||||
|
|
||||||
|
class TestMakeAgenticCodex(unittest.TestCase):
|
||||||
|
"""_make_agentic on codex agent still works (no -p to strip)."""
|
||||||
|
|
||||||
|
def test_codex_agentic_works(self) -> None:
|
||||||
|
agent = AgentConfig(
|
||||||
|
name="codex-coder",
|
||||||
|
command="codex",
|
||||||
|
args=["exec", "--full-auto", "-"],
|
||||||
|
)
|
||||||
|
_make_agentic(agent)
|
||||||
|
self.assertTrue(agent.agentic)
|
||||||
|
# -p was never there so args are unchanged
|
||||||
|
self.assertIn("exec", agent.args)
|
||||||
|
self.assertIn("--full-auto", agent.args)
|
||||||
|
|
||||||
|
|
||||||
|
# ===================================================================
|
||||||
|
# 4. pipeline integration tests
|
||||||
|
# ===================================================================
|
||||||
|
|
||||||
|
def _make_agentic_config(
|
||||||
|
run_dir: Path,
|
||||||
|
agentic_coder: bool = True,
|
||||||
|
) -> PipelineConfig:
|
||||||
|
"""Build a config with an agentic coder + non-agentic reviewer."""
|
||||||
|
coder = AgentConfig(
|
||||||
|
name="claude-coder", command="claude",
|
||||||
|
args=["--setting-sources", "user"],
|
||||||
|
agentic=agentic_coder,
|
||||||
|
)
|
||||||
|
reviewer = AgentConfig(
|
||||||
|
name="claude-reviewer", command="claude",
|
||||||
|
args=["-p", "--setting-sources", "user"],
|
||||||
|
agentic=False,
|
||||||
|
)
|
||||||
|
steps = [
|
||||||
|
StepConfig(
|
||||||
|
name="coding",
|
||||||
|
agent="claude-coder",
|
||||||
|
role="coding",
|
||||||
|
prompt_template="default:coding",
|
||||||
|
output_key="coding_output",
|
||||||
|
),
|
||||||
|
StepConfig(
|
||||||
|
name="review",
|
||||||
|
agent="claude-reviewer",
|
||||||
|
role="review",
|
||||||
|
prompt_template="default:review",
|
||||||
|
output_key="review_result",
|
||||||
|
verdict=True,
|
||||||
|
),
|
||||||
|
]
|
||||||
|
return PipelineConfig(
|
||||||
|
output_dir=run_dir,
|
||||||
|
max_iterations=2,
|
||||||
|
min_iterations=1,
|
||||||
|
language="en",
|
||||||
|
inputs={"plan": "Test plan", "checklist": "Test checklist"},
|
||||||
|
agents={"claude-coder": coder, "claude-reviewer": reviewer},
|
||||||
|
coders=["claude-coder"],
|
||||||
|
reviewers=["claude-reviewer"],
|
||||||
|
pipeline=steps,
|
||||||
|
preset_name="simple",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
|
||||||
|
"""When agentic agent is configured, _setup_worktree is called."""
|
||||||
|
|
||||||
|
@patch("cross_eval.pipeline._finalize_worktree", return_value="cross-eval/test")
|
||||||
|
@patch("cross_eval.pipeline._commit_iteration")
|
||||||
|
@patch("cross_eval.pipeline._setup_worktree")
|
||||||
|
@patch("cross_eval.pipeline.invoke_agent_agentic")
|
||||||
|
@patch("cross_eval.pipeline.invoke_agent")
|
||||||
|
def test_setup_worktree_called(
|
||||||
|
self,
|
||||||
|
mock_invoke: MagicMock,
|
||||||
|
mock_invoke_agentic: MagicMock,
|
||||||
|
mock_setup: MagicMock,
|
||||||
|
mock_commit_iter: MagicMock,
|
||||||
|
mock_finalize: MagicMock,
|
||||||
|
) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
run_dir = Path(td)
|
||||||
|
config = _make_agentic_config(run_dir)
|
||||||
|
|
||||||
|
wt_path = run_dir / "work"
|
||||||
|
wt_path.mkdir()
|
||||||
|
mock_setup.return_value = (wt_path, "cross-eval/test")
|
||||||
|
|
||||||
|
mock_invoke_agentic.return_value = AgentResult(
|
||||||
|
output="diff output", exit_code=0,
|
||||||
|
agent_name="claude-coder", step_name="coding",
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
mock_invoke.return_value = AgentResult(
|
||||||
|
output="VERDICT: PASS", exit_code=0,
|
||||||
|
agent_name="claude-reviewer", step_name="review",
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
|
||||||
|
run_pipeline(config, cwd=Path(td))
|
||||||
|
|
||||||
|
mock_setup.assert_called_once()
|
||||||
|
|
||||||
|
|
||||||
|
class TestReviewerRunsInWorktreeCwd(unittest.TestCase):
|
||||||
|
"""Reviewer runs with worktree cwd (not original cwd) when worktree exists."""
|
||||||
|
|
||||||
|
@patch("cross_eval.pipeline._finalize_worktree", return_value="cross-eval/test")
|
||||||
|
@patch("cross_eval.pipeline._commit_iteration")
|
||||||
|
@patch("cross_eval.pipeline._setup_worktree")
|
||||||
|
@patch("cross_eval.pipeline.invoke_agent_agentic")
|
||||||
|
@patch("cross_eval.pipeline.invoke_agent")
|
||||||
|
def test_reviewer_uses_worktree_cwd(
|
||||||
|
self,
|
||||||
|
mock_invoke: MagicMock,
|
||||||
|
mock_invoke_agentic: MagicMock,
|
||||||
|
mock_setup: MagicMock,
|
||||||
|
mock_commit_iter: MagicMock,
|
||||||
|
mock_finalize: MagicMock,
|
||||||
|
) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
run_dir = Path(td)
|
||||||
|
config = _make_agentic_config(run_dir)
|
||||||
|
|
||||||
|
wt_path = run_dir / "work"
|
||||||
|
wt_path.mkdir()
|
||||||
|
mock_setup.return_value = (wt_path, "cross-eval/test")
|
||||||
|
|
||||||
|
mock_invoke_agentic.return_value = AgentResult(
|
||||||
|
output="diff output", exit_code=0,
|
||||||
|
agent_name="claude-coder", step_name="coding",
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
mock_invoke.return_value = AgentResult(
|
||||||
|
output="VERDICT: PASS", exit_code=0,
|
||||||
|
agent_name="claude-reviewer", step_name="review",
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
|
||||||
|
run_pipeline(config, cwd=Path(td))
|
||||||
|
|
||||||
|
# The reviewer (non-agentic) should have been called with cwd=worktree_path
|
||||||
|
reviewer_call = mock_invoke.call_args
|
||||||
|
self.assertEqual(reviewer_call[1].get("cwd") or reviewer_call[0][3], wt_path)
|
||||||
|
|
||||||
|
|
||||||
|
class TestCommitIterationCalled(unittest.TestCase):
|
||||||
|
"""_commit_iteration is called after each iteration when worktree exists."""
|
||||||
|
|
||||||
|
@patch("cross_eval.pipeline._finalize_worktree", return_value="cross-eval/test")
|
||||||
|
@patch("cross_eval.pipeline._commit_iteration")
|
||||||
|
@patch("cross_eval.pipeline._setup_worktree")
|
||||||
|
@patch("cross_eval.pipeline.invoke_agent_agentic")
|
||||||
|
@patch("cross_eval.pipeline.invoke_agent")
|
||||||
|
def test_commit_iteration_called(
|
||||||
|
self,
|
||||||
|
mock_invoke: MagicMock,
|
||||||
|
mock_invoke_agentic: MagicMock,
|
||||||
|
mock_setup: MagicMock,
|
||||||
|
mock_commit_iter: MagicMock,
|
||||||
|
mock_finalize: MagicMock,
|
||||||
|
) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
run_dir = Path(td)
|
||||||
|
config = _make_agentic_config(run_dir)
|
||||||
|
|
||||||
|
wt_path = run_dir / "work"
|
||||||
|
wt_path.mkdir()
|
||||||
|
mock_setup.return_value = (wt_path, "cross-eval/test")
|
||||||
|
|
||||||
|
mock_invoke_agentic.return_value = AgentResult(
|
||||||
|
output="diff output", exit_code=0,
|
||||||
|
agent_name="claude-coder", step_name="coding",
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
mock_invoke.return_value = AgentResult(
|
||||||
|
output="VERDICT: PASS", exit_code=0,
|
||||||
|
agent_name="claude-reviewer", step_name="review",
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
|
||||||
|
run_pipeline(config, cwd=Path(td))
|
||||||
|
|
||||||
|
mock_commit_iter.assert_called_once()
|
||||||
|
call_args = mock_commit_iter.call_args
|
||||||
|
self.assertEqual(call_args[0][0], wt_path)
|
||||||
|
|
||||||
|
|
||||||
|
class TestFinalizeWorktreeCalled(unittest.TestCase):
|
||||||
|
"""_finalize_worktree commits and cleans up at end."""
|
||||||
|
|
||||||
|
@patch("cross_eval.pipeline._finalize_worktree", return_value="cross-eval/test")
|
||||||
|
@patch("cross_eval.pipeline._commit_iteration")
|
||||||
|
@patch("cross_eval.pipeline._setup_worktree")
|
||||||
|
@patch("cross_eval.pipeline.invoke_agent_agentic")
|
||||||
|
@patch("cross_eval.pipeline.invoke_agent")
|
||||||
|
def test_finalize_called(
|
||||||
|
self,
|
||||||
|
mock_invoke: MagicMock,
|
||||||
|
mock_invoke_agentic: MagicMock,
|
||||||
|
mock_setup: MagicMock,
|
||||||
|
mock_commit_iter: MagicMock,
|
||||||
|
mock_finalize: MagicMock,
|
||||||
|
) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
run_dir = Path(td)
|
||||||
|
config = _make_agentic_config(run_dir)
|
||||||
|
|
||||||
|
wt_path = run_dir / "work"
|
||||||
|
wt_path.mkdir()
|
||||||
|
mock_setup.return_value = (wt_path, "cross-eval/test")
|
||||||
|
|
||||||
|
mock_invoke_agentic.return_value = AgentResult(
|
||||||
|
output="diff output", exit_code=0,
|
||||||
|
agent_name="claude-coder", step_name="coding",
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
mock_invoke.return_value = AgentResult(
|
||||||
|
output="VERDICT: PASS", exit_code=0,
|
||||||
|
agent_name="claude-reviewer", step_name="review",
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
|
||||||
|
run_pipeline(config, cwd=Path(td))
|
||||||
|
|
||||||
|
mock_finalize.assert_called_once()
|
||||||
|
call_args = mock_finalize.call_args
|
||||||
|
# Should pass cwd, worktree_path, branch_name, preset_name, verdict
|
||||||
|
self.assertEqual(call_args[0][1], wt_path)
|
||||||
|
self.assertEqual(call_args[0][2], "cross-eval/test")
|
||||||
|
|
||||||
|
|
||||||
|
class TestParallelAgenticFallsBackToSequential(unittest.TestCase):
|
||||||
|
"""Multiple agentic steps in parallel batch fall back to sequential."""
|
||||||
|
|
||||||
|
def test_has_agentic_steps_detects_agentic(self) -> None:
|
||||||
|
coder = AgentConfig(
|
||||||
|
name="claude-coder", command="claude", args=[], agentic=True,
|
||||||
|
)
|
||||||
|
reviewer = AgentConfig(
|
||||||
|
name="claude-reviewer", command="claude", args=[], agentic=False,
|
||||||
|
)
|
||||||
|
config = PipelineConfig(
|
||||||
|
agents={"claude-coder": coder, "claude-reviewer": reviewer},
|
||||||
|
)
|
||||||
|
steps = [
|
||||||
|
StepConfig(name="a", agent="claude-coder", role="coding",
|
||||||
|
prompt_template="default:coding", output_key="a"),
|
||||||
|
]
|
||||||
|
self.assertTrue(_has_agentic_steps(config, steps))
|
||||||
|
|
||||||
|
def test_has_agentic_steps_returns_false_without_agentic(self) -> None:
|
||||||
|
reviewer = AgentConfig(
|
||||||
|
name="claude-reviewer", command="claude", args=[], agentic=False,
|
||||||
|
)
|
||||||
|
config = PipelineConfig(
|
||||||
|
agents={"claude-reviewer": reviewer},
|
||||||
|
)
|
||||||
|
steps = [
|
||||||
|
StepConfig(name="r", agent="claude-reviewer", role="review",
|
||||||
|
prompt_template="default:review", output_key="r", verdict=True),
|
||||||
|
]
|
||||||
|
self.assertFalse(_has_agentic_steps(config, steps))
|
||||||
|
|
||||||
|
@patch("cross_eval.pipeline._finalize_worktree", return_value="cross-eval/test")
|
||||||
|
@patch("cross_eval.pipeline._commit_iteration")
|
||||||
|
@patch("cross_eval.pipeline._setup_worktree")
|
||||||
|
@patch("cross_eval.pipeline.invoke_agent_agentic")
|
||||||
|
@patch("cross_eval.pipeline.invoke_agent")
|
||||||
|
def test_parallel_agentic_runs_sequentially(
|
||||||
|
self,
|
||||||
|
mock_invoke: MagicMock,
|
||||||
|
mock_invoke_agentic: MagicMock,
|
||||||
|
mock_setup: MagicMock,
|
||||||
|
mock_commit_iter: MagicMock,
|
||||||
|
mock_finalize: MagicMock,
|
||||||
|
) -> None:
|
||||||
|
"""When multiple agentic steps are parallel, they should run sequentially."""
|
||||||
|
with tempfile.TemporaryDirectory() as td:
|
||||||
|
run_dir = Path(td)
|
||||||
|
|
||||||
|
coder_a = AgentConfig(
|
||||||
|
name="coder-a", command="claude", args=[], agentic=True,
|
||||||
|
)
|
||||||
|
coder_b = AgentConfig(
|
||||||
|
name="coder-b", command="claude", args=[], agentic=True,
|
||||||
|
)
|
||||||
|
reviewer = AgentConfig(
|
||||||
|
name="reviewer", command="claude", args=["-p"], agentic=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
steps = [
|
||||||
|
StepConfig(
|
||||||
|
name="code_a", agent="coder-a", role="coding",
|
||||||
|
prompt_template="default:coding", output_key="code_a",
|
||||||
|
parallel=True,
|
||||||
|
),
|
||||||
|
StepConfig(
|
||||||
|
name="code_b", agent="coder-b", role="coding",
|
||||||
|
prompt_template="default:coding", output_key="code_b",
|
||||||
|
parallel=True,
|
||||||
|
),
|
||||||
|
StepConfig(
|
||||||
|
name="review", agent="reviewer", role="review",
|
||||||
|
prompt_template="default:review", output_key="review_result",
|
||||||
|
verdict=True,
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
config = PipelineConfig(
|
||||||
|
output_dir=run_dir,
|
||||||
|
max_iterations=1,
|
||||||
|
min_iterations=1,
|
||||||
|
language="en",
|
||||||
|
inputs={"plan": "Test plan", "checklist": "Test checklist"},
|
||||||
|
agents={
|
||||||
|
"coder-a": coder_a,
|
||||||
|
"coder-b": coder_b,
|
||||||
|
"reviewer": reviewer,
|
||||||
|
},
|
||||||
|
coders=["coder-a", "coder-b"],
|
||||||
|
reviewers=["reviewer"],
|
||||||
|
pipeline=steps,
|
||||||
|
preset_name="custom",
|
||||||
|
)
|
||||||
|
|
||||||
|
wt_path = run_dir / "work"
|
||||||
|
wt_path.mkdir()
|
||||||
|
mock_setup.return_value = (wt_path, "cross-eval/test")
|
||||||
|
|
||||||
|
call_order: list[str] = []
|
||||||
|
|
||||||
|
def _track_agentic(agent_config, prompt, step_name, **kwargs):
|
||||||
|
call_order.append(step_name)
|
||||||
|
return AgentResult(
|
||||||
|
output="diff", exit_code=0,
|
||||||
|
agent_name=agent_config.name, step_name=step_name,
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
|
||||||
|
mock_invoke_agentic.side_effect = _track_agentic
|
||||||
|
mock_invoke.return_value = AgentResult(
|
||||||
|
output="VERDICT: PASS", exit_code=0,
|
||||||
|
agent_name="reviewer", step_name="review",
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
|
||||||
|
run_pipeline(config, cwd=Path(td))
|
||||||
|
|
||||||
|
# Both agentic steps should have been called (sequentially)
|
||||||
|
agentic_calls = [c for c in call_order if c.startswith("code_")]
|
||||||
|
self.assertEqual(len(agentic_calls), 2)
|
||||||
|
# They should appear in order (sequential, not concurrent)
|
||||||
|
self.assertEqual(agentic_calls, ["code_a", "code_b"])
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
@@ -1,19 +1,27 @@
|
|||||||
from __future__ import annotations
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import tempfile
|
||||||
import unittest
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
from unittest.mock import patch
|
from unittest.mock import patch
|
||||||
|
|
||||||
from cross_eval.agent import _supports_reasoning_effort
|
from cross_eval.agent import AgentInvocationError, _supports_reasoning_effort
|
||||||
|
from cross_eval.cli import _apply_phased_iteration_override, main
|
||||||
from cross_eval.agent import invoke_agent
|
from cross_eval.agent import invoke_agent
|
||||||
from cross_eval.config import (
|
from cross_eval.config import (
|
||||||
BUILTIN_AGENTS,
|
BUILTIN_AGENTS,
|
||||||
|
_SENIOR_SYSTEM_PROMPT,
|
||||||
_default_seniors_for_preset,
|
_default_seniors_for_preset,
|
||||||
apply_reasoning_effort_settings,
|
apply_reasoning_effort_settings,
|
||||||
|
load_config,
|
||||||
normalize_reasoning_effort,
|
normalize_reasoning_effort,
|
||||||
|
normalize_prompt_template,
|
||||||
|
normalize_step_role,
|
||||||
validate_config,
|
validate_config,
|
||||||
)
|
)
|
||||||
from cross_eval.models import (
|
from cross_eval.models import (
|
||||||
AgentConfig,
|
AgentConfig,
|
||||||
|
AgentResult,
|
||||||
IterationResult,
|
IterationResult,
|
||||||
PhaseConfig,
|
PhaseConfig,
|
||||||
PipelineConfig,
|
PipelineConfig,
|
||||||
@@ -21,25 +29,52 @@ from cross_eval.models import (
|
|||||||
ReviewMetrics,
|
ReviewMetrics,
|
||||||
StepConfig,
|
StepConfig,
|
||||||
)
|
)
|
||||||
from cross_eval.pipeline import _detect_repeated_aggregate
|
from cross_eval.pipeline import (
|
||||||
|
_detect_auto_escalate,
|
||||||
|
_detect_repeated_aggregate,
|
||||||
|
_execute_parallel_batch,
|
||||||
|
_extract_senior_tracker,
|
||||||
|
_extract_verdict,
|
||||||
|
)
|
||||||
from cross_eval.prompts import (
|
from cross_eval.prompts import (
|
||||||
GENERATE_TEMPLATE,
|
CODING_TEMPLATE,
|
||||||
GENERATE_TEMPLATE_KO,
|
CODING_TEMPLATE_KO,
|
||||||
REVIEW_TEMPLATE,
|
REVIEW_TEMPLATE,
|
||||||
REVIEW_TEMPLATE_KO,
|
REVIEW_TEMPLATE_KO,
|
||||||
|
PLAN_REVIEW_TEMPLATE,
|
||||||
|
PLAN_REVIEW_TEMPLATE_KO,
|
||||||
REVIEW_ONLY_TEMPLATE,
|
REVIEW_ONLY_TEMPLATE,
|
||||||
REVIEW_ONLY_TEMPLATE_KO,
|
REVIEW_ONLY_TEMPLATE_KO,
|
||||||
AGGREGATE_REVIEW_TEMPLATE,
|
AGGREGATE_REVIEW_TEMPLATE,
|
||||||
AGGREGATE_REVIEW_TEMPLATE_KO,
|
AGGREGATE_REVIEW_TEMPLATE_KO,
|
||||||
_build_cross_review_preset,
|
_build_cross_review_preset,
|
||||||
|
_build_coding_review_fix_preset,
|
||||||
|
_build_plan_review_preset,
|
||||||
_build_review_fix_preset,
|
_build_review_fix_preset,
|
||||||
_build_review_only_preset,
|
_build_review_only_preset,
|
||||||
_build_simple_preset,
|
_build_simple_preset,
|
||||||
)
|
)
|
||||||
from cross_eval.report import build_report, parse_review_metrics
|
from cross_eval.report import build_report, parse_review_metrics, print_escalation_report
|
||||||
|
|
||||||
|
|
||||||
class BuiltinAgentConfigTest(unittest.TestCase):
|
class BuiltinAgentConfigTest(unittest.TestCase):
|
||||||
|
def test_claude_builtin_agents_use_user_settings_and_disable_slash_commands(self) -> None:
|
||||||
|
for agent_name in ("claude-coder", "claude-reviewer", "claude-senior"):
|
||||||
|
with self.subTest(agent=agent_name):
|
||||||
|
args = BUILTIN_AGENTS[agent_name].args
|
||||||
|
self.assertIn("--setting-sources", args)
|
||||||
|
self.assertIn("user", args)
|
||||||
|
self.assertIn("--disable-slash-commands", args)
|
||||||
|
|
||||||
|
def test_claude_builtin_agents_use_role_specific_permission_modes(self) -> None:
|
||||||
|
coder_args = BUILTIN_AGENTS["claude-coder"].args
|
||||||
|
reviewer_args = BUILTIN_AGENTS["claude-reviewer"].args
|
||||||
|
senior_args = BUILTIN_AGENTS["claude-senior"].args
|
||||||
|
|
||||||
|
self.assertIn("--dangerously-skip-permissions", coder_args)
|
||||||
|
self.assertIn("bypassPermissions", coder_args)
|
||||||
|
self.assertIn("plan", reviewer_args)
|
||||||
|
self.assertIn("plan", senior_args)
|
||||||
|
|
||||||
def test_codex_builtin_agents_skip_git_repo_check(self) -> None:
|
def test_codex_builtin_agents_skip_git_repo_check(self) -> None:
|
||||||
for agent_name in ("codex-coder", "codex-reviewer", "codex-senior"):
|
for agent_name in ("codex-coder", "codex-reviewer", "codex-senior"):
|
||||||
with self.subTest(agent=agent_name):
|
with self.subTest(agent=agent_name):
|
||||||
@@ -62,6 +97,10 @@ class BuiltinAgentConfigTest(unittest.TestCase):
|
|||||||
self.assertEqual(normalize_reasoning_effort("extra_high"), "xhigh")
|
self.assertEqual(normalize_reasoning_effort("extra_high"), "xhigh")
|
||||||
self.assertEqual(normalize_reasoning_effort("x-high"), "xhigh")
|
self.assertEqual(normalize_reasoning_effort("x-high"), "xhigh")
|
||||||
|
|
||||||
|
def test_normalize_step_role_and_template_aliases(self) -> None:
|
||||||
|
self.assertEqual(normalize_step_role("coding"), "coding")
|
||||||
|
self.assertEqual(normalize_prompt_template("default:coding"), "default:coding")
|
||||||
|
|
||||||
def test_apply_reasoning_effort_settings_uses_defaults_and_role_overrides(self) -> None:
|
def test_apply_reasoning_effort_settings_uses_defaults_and_role_overrides(self) -> None:
|
||||||
config = PipelineConfig(
|
config = PipelineConfig(
|
||||||
agents={
|
agents={
|
||||||
@@ -116,6 +155,123 @@ class BuiltinAgentConfigTest(unittest.TestCase):
|
|||||||
["codex", "-c", 'model_reasoning_effort="high"'],
|
["codex", "-c", 'model_reasoning_effort="high"'],
|
||||||
)
|
)
|
||||||
|
|
||||||
|
def test_invoke_agent_classifies_auth_failures(self) -> None:
|
||||||
|
def _fake_run(cmd, **kwargs):
|
||||||
|
class _Result:
|
||||||
|
returncode = 1
|
||||||
|
stdout = ""
|
||||||
|
stderr = "Not logged in · Please run /login"
|
||||||
|
|
||||||
|
return _Result()
|
||||||
|
|
||||||
|
agent = AgentConfig(
|
||||||
|
name="claude-reviewer",
|
||||||
|
command="claude",
|
||||||
|
args=["-p", "--model", "opus"],
|
||||||
|
)
|
||||||
|
|
||||||
|
with patch("subprocess.run", side_effect=_fake_run):
|
||||||
|
with self.assertRaises(AgentInvocationError) as ctx:
|
||||||
|
invoke_agent(agent, "prompt", "review", quiet=True)
|
||||||
|
|
||||||
|
self.assertEqual(ctx.exception.failure_type, "AUTH")
|
||||||
|
self.assertIn("Re-authenticate", ctx.exception.suggested_action)
|
||||||
|
|
||||||
|
def test_invoke_agent_classifies_usage_limit_failures(self) -> None:
|
||||||
|
def _fake_run(cmd, **kwargs):
|
||||||
|
class _Result:
|
||||||
|
returncode = 1
|
||||||
|
stdout = ""
|
||||||
|
stderr = "API Error: 429 rate limit exceeded for current quota"
|
||||||
|
|
||||||
|
return _Result()
|
||||||
|
|
||||||
|
agent = AgentConfig(
|
||||||
|
name="codex-reviewer",
|
||||||
|
command="codex",
|
||||||
|
args=["exec", "--model", "gpt-5.4", "-"],
|
||||||
|
)
|
||||||
|
|
||||||
|
with patch("subprocess.run", side_effect=_fake_run):
|
||||||
|
with self.assertRaises(AgentInvocationError) as ctx:
|
||||||
|
invoke_agent(agent, "prompt", "review", quiet=True)
|
||||||
|
|
||||||
|
self.assertEqual(ctx.exception.failure_type, "USAGE_LIMIT")
|
||||||
|
self.assertIn("quota", ctx.exception.suggested_action)
|
||||||
|
|
||||||
|
def test_parallel_batch_saves_successes_before_failure(self) -> None:
|
||||||
|
config = PipelineConfig(
|
||||||
|
agents={
|
||||||
|
"ok-reviewer": AgentConfig(name="ok-reviewer", command="codex"),
|
||||||
|
"bad-reviewer": AgentConfig(name="bad-reviewer", command="claude"),
|
||||||
|
},
|
||||||
|
)
|
||||||
|
steps = [
|
||||||
|
StepConfig(
|
||||||
|
name="review_ok",
|
||||||
|
agent="ok-reviewer",
|
||||||
|
role="review",
|
||||||
|
prompt_template="default:review-only",
|
||||||
|
output_key="review_ok",
|
||||||
|
parallel=True,
|
||||||
|
),
|
||||||
|
StepConfig(
|
||||||
|
name="review_bad",
|
||||||
|
agent="bad-reviewer",
|
||||||
|
role="review",
|
||||||
|
prompt_template="default:review-only",
|
||||||
|
output_key="review_bad",
|
||||||
|
parallel=True,
|
||||||
|
),
|
||||||
|
]
|
||||||
|
step_outputs: dict[str, str] = {}
|
||||||
|
step_results: dict[str, AgentResult] = {}
|
||||||
|
|
||||||
|
def _fake_invoke(agent, prompt, step_name, **kwargs):
|
||||||
|
if step_name == "review_ok":
|
||||||
|
return AgentResult(
|
||||||
|
output="VERDICT: PASS",
|
||||||
|
exit_code=0,
|
||||||
|
agent_name=agent.name,
|
||||||
|
step_name=step_name,
|
||||||
|
duration_seconds=1.0,
|
||||||
|
)
|
||||||
|
raise AgentInvocationError(
|
||||||
|
agent_name=agent.name,
|
||||||
|
step_name=step_name,
|
||||||
|
cmd_preview="claude -p ...",
|
||||||
|
raw_error="API Error: 429 rate limit exceeded for current quota",
|
||||||
|
failure_type="USAGE_LIMIT",
|
||||||
|
suggested_action="Agent CLI hit a quota, billing, or token budget limit. Refill or raise the limit, then rerun.",
|
||||||
|
)
|
||||||
|
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
with patch("cross_eval.pipeline.invoke_agent", side_effect=_fake_invoke):
|
||||||
|
with self.assertRaises(RuntimeError) as ctx:
|
||||||
|
_execute_parallel_batch(
|
||||||
|
steps,
|
||||||
|
config,
|
||||||
|
input_contents={},
|
||||||
|
feedback="",
|
||||||
|
iteration=1,
|
||||||
|
max_iterations=3,
|
||||||
|
cwd=Path(tmpdir),
|
||||||
|
timeout=None,
|
||||||
|
dry_run=False,
|
||||||
|
step_outputs=step_outputs,
|
||||||
|
step_results=step_results,
|
||||||
|
run_dir=Path(tmpdir),
|
||||||
|
output_iter=1,
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertIn("Successful outputs were saved for: review_ok", str(ctx.exception))
|
||||||
|
self.assertEqual(step_outputs["review_ok"], "VERDICT: PASS")
|
||||||
|
self.assertTrue((Path(tmpdir) / "v1" / "review_ok.md").exists())
|
||||||
|
error_path = Path(tmpdir) / "v1" / "review_bad_error.md"
|
||||||
|
self.assertTrue(error_path.exists())
|
||||||
|
self.assertIn("Failure Type", error_path.read_text(encoding="utf-8"))
|
||||||
|
self.assertIn("USAGE_LIMIT", error_path.read_text(encoding="utf-8"))
|
||||||
|
|
||||||
def test_detect_repeated_aggregate_warns_on_same_output(self) -> None:
|
def test_detect_repeated_aggregate_warns_on_same_output(self) -> None:
|
||||||
steps = [
|
steps = [
|
||||||
StepConfig(
|
StepConfig(
|
||||||
@@ -169,6 +325,14 @@ class BuiltinAgentConfigTest(unittest.TestCase):
|
|||||||
),
|
),
|
||||||
["claude-senior"],
|
["claude-senior"],
|
||||||
)
|
)
|
||||||
|
self.assertEqual(
|
||||||
|
_default_seniors_for_preset(
|
||||||
|
"preset:coding-review-fix",
|
||||||
|
["codex-reviewer"],
|
||||||
|
BUILTIN_AGENTS,
|
||||||
|
),
|
||||||
|
["codex-senior"],
|
||||||
|
)
|
||||||
self.assertEqual(
|
self.assertEqual(
|
||||||
_default_seniors_for_preset(
|
_default_seniors_for_preset(
|
||||||
"preset:simple",
|
"preset:simple",
|
||||||
@@ -204,9 +368,37 @@ class BuiltinAgentConfigTest(unittest.TestCase):
|
|||||||
)
|
)
|
||||||
self.assertEqual(
|
self.assertEqual(
|
||||||
[step.name for step in converge.steps[3:]],
|
[step.name for step in converge.steps[3:]],
|
||||||
["aggregate_review", "generate", "verify"],
|
["aggregate_review", "coding", "verify"],
|
||||||
)
|
)
|
||||||
|
|
||||||
|
def test_coding_review_fix_starts_with_single_coding_phase(self) -> None:
|
||||||
|
phases = _build_coding_review_fix_preset(
|
||||||
|
["codex-coder"],
|
||||||
|
["claude-reviewer", "codex-reviewer"],
|
||||||
|
["codex-senior"],
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual([phase.name for phase in phases], ["initial_coding", "review_fix"])
|
||||||
|
self.assertEqual(phases[0].max_iterations, 1)
|
||||||
|
self.assertEqual([step.name for step in phases[0].steps], ["coding"])
|
||||||
|
self.assertEqual([step.name for step in phases[1].steps[2:]], ["aggregate_review", "coding", "verify"])
|
||||||
|
|
||||||
|
def test_apply_phased_iteration_override_updates_only_verdict_phases(self) -> None:
|
||||||
|
config = PipelineConfig(
|
||||||
|
phases=_build_coding_review_fix_preset(
|
||||||
|
["codex-coder"],
|
||||||
|
["codex-reviewer"],
|
||||||
|
["codex-senior"],
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
_apply_phased_iteration_override(config, 10)
|
||||||
|
|
||||||
|
self.assertEqual(config.phases[0].name, "initial_coding")
|
||||||
|
self.assertEqual(config.phases[0].max_iterations, 1)
|
||||||
|
self.assertEqual(config.phases[1].name, "review_fix")
|
||||||
|
self.assertEqual(config.phases[1].max_iterations, 10)
|
||||||
|
|
||||||
def test_review_only_duplicate_reviewers_get_unique_step_keys(self) -> None:
|
def test_review_only_duplicate_reviewers_get_unique_step_keys(self) -> None:
|
||||||
steps = _build_review_only_preset(
|
steps = _build_review_only_preset(
|
||||||
["codex-coder"],
|
["codex-coder"],
|
||||||
@@ -219,6 +411,31 @@ class BuiltinAgentConfigTest(unittest.TestCase):
|
|||||||
["review_codex_reviewer", "review_codex_reviewer_2"],
|
["review_codex_reviewer", "review_codex_reviewer_2"],
|
||||||
)
|
)
|
||||||
|
|
||||||
|
def test_plan_review_duplicate_reviewers_get_unique_step_keys(self) -> None:
|
||||||
|
steps = _build_plan_review_preset(
|
||||||
|
["codex-coder"],
|
||||||
|
["codex-reviewer", "codex-reviewer"],
|
||||||
|
[],
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(
|
||||||
|
[step.output_key for step in steps],
|
||||||
|
["plan_review_codex_reviewer", "plan_review_codex_reviewer_2"],
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_plan_review_with_senior_adds_aggregate_step(self) -> None:
|
||||||
|
steps = _build_plan_review_preset(
|
||||||
|
["codex-coder"],
|
||||||
|
["claude-reviewer", "codex-reviewer"],
|
||||||
|
["claude-senior"],
|
||||||
|
)
|
||||||
|
|
||||||
|
self.assertEqual(steps[-1].name, "senior_review")
|
||||||
|
self.assertEqual(steps[-1].agent, "claude-senior")
|
||||||
|
self.assertTrue(steps[-1].verdict)
|
||||||
|
self.assertFalse(steps[0].verdict)
|
||||||
|
self.assertFalse(steps[1].verdict)
|
||||||
|
|
||||||
def test_cross_review_duplicate_coders_get_unique_step_keys(self) -> None:
|
def test_cross_review_duplicate_coders_get_unique_step_keys(self) -> None:
|
||||||
steps = _build_cross_review_preset(
|
steps = _build_cross_review_preset(
|
||||||
["codex-coder", "codex-coder"],
|
["codex-coder", "codex-coder"],
|
||||||
@@ -246,7 +463,7 @@ class BuiltinAgentConfigTest(unittest.TestCase):
|
|||||||
steps = phases[0].steps
|
steps = phases[0].steps
|
||||||
self.assertEqual(steps[2].name, "aggregate_review")
|
self.assertEqual(steps[2].name, "aggregate_review")
|
||||||
self.assertEqual(steps[2].agent, "codex-senior")
|
self.assertEqual(steps[2].agent, "codex-senior")
|
||||||
self.assertEqual(steps[3].name, "generate")
|
self.assertEqual(steps[3].name, "coding")
|
||||||
self.assertEqual(steps[4].name, "verify")
|
self.assertEqual(steps[4].name, "verify")
|
||||||
self.assertEqual(steps[4].agent, "codex-senior")
|
self.assertEqual(steps[4].agent, "codex-senior")
|
||||||
self.assertTrue(steps[4].verdict)
|
self.assertTrue(steps[4].verdict)
|
||||||
@@ -273,7 +490,7 @@ class BuiltinAgentConfigTest(unittest.TestCase):
|
|||||||
|
|
||||||
self.assertEqual(
|
self.assertEqual(
|
||||||
[step.name for step in steps],
|
[step.name for step in steps],
|
||||||
["generate", "review", "senior_review"],
|
["coding", "review", "senior_review"],
|
||||||
)
|
)
|
||||||
self.assertFalse(steps[1].verdict)
|
self.assertFalse(steps[1].verdict)
|
||||||
self.assertTrue(steps[2].verdict)
|
self.assertTrue(steps[2].verdict)
|
||||||
@@ -325,6 +542,8 @@ class PromptTemplateTest(unittest.TestCase):
|
|||||||
for tmpl, label in [
|
for tmpl, label in [
|
||||||
(REVIEW_TEMPLATE, "REVIEW_TEMPLATE"),
|
(REVIEW_TEMPLATE, "REVIEW_TEMPLATE"),
|
||||||
(REVIEW_TEMPLATE_KO, "REVIEW_TEMPLATE_KO"),
|
(REVIEW_TEMPLATE_KO, "REVIEW_TEMPLATE_KO"),
|
||||||
|
(PLAN_REVIEW_TEMPLATE, "PLAN_REVIEW_TEMPLATE"),
|
||||||
|
(PLAN_REVIEW_TEMPLATE_KO, "PLAN_REVIEW_TEMPLATE_KO"),
|
||||||
(REVIEW_ONLY_TEMPLATE, "REVIEW_ONLY_TEMPLATE"),
|
(REVIEW_ONLY_TEMPLATE, "REVIEW_ONLY_TEMPLATE"),
|
||||||
(REVIEW_ONLY_TEMPLATE_KO, "REVIEW_ONLY_TEMPLATE_KO"),
|
(REVIEW_ONLY_TEMPLATE_KO, "REVIEW_ONLY_TEMPLATE_KO"),
|
||||||
]:
|
]:
|
||||||
@@ -351,10 +570,10 @@ class PromptTemplateTest(unittest.TestCase):
|
|||||||
self.assertIn("CONFIRMED", tmpl)
|
self.assertIn("CONFIRMED", tmpl)
|
||||||
self.assertIn("DISMISSED", tmpl)
|
self.assertIn("DISMISSED", tmpl)
|
||||||
|
|
||||||
def test_generate_templates_ignore_dismissed(self) -> None:
|
def test_coding_templates_ignore_dismissed(self) -> None:
|
||||||
"""Generate templates should tell coder to ignore DISMISSED items."""
|
"""Coding templates should tell coder to ignore DISMISSED items."""
|
||||||
self.assertIn("DISMISSED", GENERATE_TEMPLATE)
|
self.assertIn("DISMISSED", CODING_TEMPLATE)
|
||||||
self.assertIn("DISMISSED", GENERATE_TEMPLATE_KO)
|
self.assertIn("DISMISSED", CODING_TEMPLATE_KO)
|
||||||
|
|
||||||
def test_aggregate_templates_dismissed_structure(self) -> None:
|
def test_aggregate_templates_dismissed_structure(self) -> None:
|
||||||
"""Aggregate templates should use [False positive] / [Already fixed] tags."""
|
"""Aggregate templates should use [False positive] / [Already fixed] tags."""
|
||||||
@@ -487,11 +706,11 @@ class ReviewMetricsParsingTest(unittest.TestCase):
|
|||||||
language="en",
|
language="en",
|
||||||
pipeline=[
|
pipeline=[
|
||||||
StepConfig(
|
StepConfig(
|
||||||
name="generate",
|
name="coding",
|
||||||
agent="claude-coder",
|
agent="claude-coder",
|
||||||
role="generate",
|
role="coding",
|
||||||
prompt_template="default:generate",
|
prompt_template="default:coding",
|
||||||
output_key="generated_code",
|
output_key="coding_output",
|
||||||
verdict=True,
|
verdict=True,
|
||||||
),
|
),
|
||||||
],
|
],
|
||||||
@@ -500,7 +719,7 @@ class ReviewMetricsParsingTest(unittest.TestCase):
|
|||||||
iterations=[
|
iterations=[
|
||||||
IterationResult(
|
IterationResult(
|
||||||
iteration=1,
|
iteration=1,
|
||||||
step_outputs={"generated_code": "some code"},
|
step_outputs={"coding_output": "some code"},
|
||||||
verdict="PASS",
|
verdict="PASS",
|
||||||
),
|
),
|
||||||
],
|
],
|
||||||
@@ -511,5 +730,307 @@ class ReviewMetricsParsingTest(unittest.TestCase):
|
|||||||
self.assertNotIn("Review Metrics", report)
|
self.assertNotIn("Review Metrics", report)
|
||||||
|
|
||||||
|
|
||||||
|
class EscalateVerdictTest(unittest.TestCase):
|
||||||
|
"""Test ESCALATE verdict functionality."""
|
||||||
|
|
||||||
|
def test_extract_verdict_escalate(self) -> None:
|
||||||
|
output = "Some review content\n\nVERDICT: ESCALATE\n"
|
||||||
|
result = _extract_verdict(output, r"VERDICT:\s*PASS")
|
||||||
|
self.assertEqual(result, "ESCALATE")
|
||||||
|
|
||||||
|
def test_extract_verdict_escalate_priority(self) -> None:
|
||||||
|
"""ESCALATE should take priority even if PASS pattern also matches."""
|
||||||
|
output = "VERDICT: PASS\n\nVERDICT: ESCALATE\n"
|
||||||
|
result = _extract_verdict(output, r"VERDICT:\s*PASS")
|
||||||
|
self.assertEqual(result, "ESCALATE")
|
||||||
|
|
||||||
|
def test_extract_verdict_pass_still_works(self) -> None:
|
||||||
|
output = "All good\n\nVERDICT: PASS\n"
|
||||||
|
result = _extract_verdict(output, r"VERDICT:\s*PASS")
|
||||||
|
self.assertEqual(result, "PASS")
|
||||||
|
|
||||||
|
def test_extract_verdict_fail_still_works(self) -> None:
|
||||||
|
output = "Issues found\n\nVERDICT: FAIL\n"
|
||||||
|
result = _extract_verdict(output, r"VERDICT:\s*PASS")
|
||||||
|
self.assertEqual(result, "FAIL")
|
||||||
|
|
||||||
|
def test_extract_senior_tracker(self) -> None:
|
||||||
|
output = (
|
||||||
|
"Some text\n\n"
|
||||||
|
"## Issue Tracker\n"
|
||||||
|
"| ISS-ID | Severity | Description | Status | Since |\n"
|
||||||
|
"|--------|----------|-------------|--------|-------|\n"
|
||||||
|
"| ISS-001 | Critical | Missing auth | Open | v1 |\n"
|
||||||
|
"| ISS-002 | Major | Bad naming | Fixed | v1 |\n"
|
||||||
|
"\nMore text"
|
||||||
|
)
|
||||||
|
tracker = _extract_senior_tracker(output)
|
||||||
|
self.assertIn("Issue Tracker", tracker)
|
||||||
|
self.assertIn("ISS-001", tracker)
|
||||||
|
self.assertIn("ISS-002", tracker)
|
||||||
|
|
||||||
|
def test_extract_senior_tracker_empty(self) -> None:
|
||||||
|
output = "No tracker table here"
|
||||||
|
tracker = _extract_senior_tracker(output)
|
||||||
|
self.assertEqual(tracker, "")
|
||||||
|
|
||||||
|
def test_auto_escalate_heuristic(self) -> None:
|
||||||
|
prev1 = "Issue in src/auth.py: missing validation"
|
||||||
|
prev2 = "Issue in src/auth.py: validation still missing"
|
||||||
|
current = "Issue in src/auth.py: validation not implemented"
|
||||||
|
|
||||||
|
# Should detect repeated issue
|
||||||
|
self.assertTrue(_detect_auto_escalate([prev1, prev2], current, threshold=2))
|
||||||
|
|
||||||
|
def test_auto_escalate_no_repeat(self) -> None:
|
||||||
|
prev1 = "Issue in src/auth.py: missing validation"
|
||||||
|
current = "Issue in src/database.py: connection pool"
|
||||||
|
|
||||||
|
self.assertFalse(_detect_auto_escalate([prev1], current, threshold=2))
|
||||||
|
|
||||||
|
def test_auto_escalate_different_issues_same_file(self) -> None:
|
||||||
|
"""Same file path but different issues should NOT trigger escalation."""
|
||||||
|
prev1 = "Issue in src/utils.py: missing validation on input"
|
||||||
|
prev2 = "Issue in src/utils.py: unused import at top of file"
|
||||||
|
current = "Issue in src/utils.py: error handling not implemented"
|
||||||
|
|
||||||
|
# All mention src/utils.py, but the issue keywords differ across
|
||||||
|
# iterations, so this should NOT escalate.
|
||||||
|
self.assertFalse(_detect_auto_escalate([prev1, prev2], current, threshold=2))
|
||||||
|
|
||||||
|
def test_report_escalate_verdict(self) -> None:
|
||||||
|
config = PipelineConfig(language="en")
|
||||||
|
result = PipelineResult(
|
||||||
|
final_verdict="ESCALATE",
|
||||||
|
escalated_issues=["Requirements are ambiguous — need stakeholder input"],
|
||||||
|
)
|
||||||
|
|
||||||
|
report = build_report(config, result)
|
||||||
|
|
||||||
|
self.assertIn("ESCALATE", report)
|
||||||
|
self.assertIn("Human review required", report)
|
||||||
|
self.assertIn("ambiguous", report)
|
||||||
|
|
||||||
|
def test_report_escalate_verdict_ko(self) -> None:
|
||||||
|
config = PipelineConfig(language="ko")
|
||||||
|
result = PipelineResult(
|
||||||
|
final_verdict="ESCALATE",
|
||||||
|
escalated_issues=["요구사항이 모호함"],
|
||||||
|
)
|
||||||
|
|
||||||
|
report = build_report(config, result)
|
||||||
|
|
||||||
|
self.assertIn("ESCALATE", report)
|
||||||
|
self.assertIn("사람의 확인이 필요합니다", report)
|
||||||
|
|
||||||
|
def test_exit_code_escalate(self) -> None:
|
||||||
|
from cross_eval.cli import main
|
||||||
|
|
||||||
|
mock_result = PipelineResult(
|
||||||
|
final_verdict="ESCALATE",
|
||||||
|
escalated_issues=["Needs human review"],
|
||||||
|
)
|
||||||
|
|
||||||
|
with patch("cross_eval.config.load_config") as mock_load, \
|
||||||
|
patch("cross_eval.config.validate_config", return_value=[]), \
|
||||||
|
patch("cross_eval.pipeline.run_pipeline", return_value=mock_result), \
|
||||||
|
patch("cross_eval.report.print_escalation_report"):
|
||||||
|
mock_config = PipelineConfig(
|
||||||
|
pipeline=[
|
||||||
|
StepConfig(
|
||||||
|
name="review",
|
||||||
|
agent="claude-reviewer",
|
||||||
|
role="review",
|
||||||
|
prompt_template="default:review",
|
||||||
|
output_key="review_result",
|
||||||
|
verdict=True,
|
||||||
|
),
|
||||||
|
],
|
||||||
|
agents=dict(BUILTIN_AGENTS),
|
||||||
|
coders=["claude-coder"],
|
||||||
|
reviewers=["claude-reviewer"],
|
||||||
|
inputs={"plan": Path("/tmp/plan.md")},
|
||||||
|
language="en",
|
||||||
|
max_iterations=3,
|
||||||
|
preset_name="simple",
|
||||||
|
)
|
||||||
|
mock_load.return_value = mock_config
|
||||||
|
|
||||||
|
with tempfile.NamedTemporaryFile(suffix=".yaml", mode="w") as f:
|
||||||
|
f.write("inputs:\n plan: /tmp/plan.md\n")
|
||||||
|
f.flush()
|
||||||
|
exit_code = main(["run", "-c", f.name])
|
||||||
|
|
||||||
|
self.assertEqual(exit_code, 2)
|
||||||
|
|
||||||
|
def test_senior_prompt_includes_escalate(self) -> None:
|
||||||
|
self.assertIn("ESCALATE", _SENIOR_SYSTEM_PROMPT)
|
||||||
|
self.assertIn("ambiguous", _SENIOR_SYSTEM_PROMPT.lower())
|
||||||
|
|
||||||
|
def test_aggregate_template_has_tracker(self) -> None:
|
||||||
|
self.assertIn("{previous_senior_tracker}", AGGREGATE_REVIEW_TEMPLATE)
|
||||||
|
self.assertIn("Issue Tracker", AGGREGATE_REVIEW_TEMPLATE)
|
||||||
|
self.assertIn("VERDICT: ESCALATE", AGGREGATE_REVIEW_TEMPLATE)
|
||||||
|
|
||||||
|
def test_report_includes_issue_tracker_summary(self) -> None:
|
||||||
|
config = PipelineConfig(
|
||||||
|
language="en",
|
||||||
|
pipeline=[
|
||||||
|
StepConfig(
|
||||||
|
name="review",
|
||||||
|
agent="claude-reviewer",
|
||||||
|
role="review",
|
||||||
|
prompt_template="default:review",
|
||||||
|
output_key="review_result",
|
||||||
|
verdict=True,
|
||||||
|
),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
result = PipelineResult(
|
||||||
|
iterations=[
|
||||||
|
IterationResult(
|
||||||
|
iteration=1,
|
||||||
|
step_outputs={
|
||||||
|
"review_result": (
|
||||||
|
"### Issues Found\n"
|
||||||
|
"- ISS-001 [Critical][Omission] Missing auth check\n"
|
||||||
|
"- ISS-002 [Major][Omission] No input validation\n"
|
||||||
|
"### Verdict\nVERDICT: FAIL"
|
||||||
|
),
|
||||||
|
},
|
||||||
|
verdict="FAIL",
|
||||||
|
),
|
||||||
|
],
|
||||||
|
final_verdict="FAIL",
|
||||||
|
)
|
||||||
|
|
||||||
|
report = build_report(config, result)
|
||||||
|
self.assertIn("Issue Tracker Summary", report)
|
||||||
|
self.assertIn("ISS-001", report)
|
||||||
|
self.assertIn("ISS-002", report)
|
||||||
|
|
||||||
|
def test_report_includes_senior_tracker_table(self) -> None:
|
||||||
|
config = PipelineConfig(
|
||||||
|
language="en",
|
||||||
|
pipeline=[
|
||||||
|
StepConfig(
|
||||||
|
name="senior_review",
|
||||||
|
agent="claude-senior",
|
||||||
|
role="review",
|
||||||
|
prompt_template="default:aggregate-review",
|
||||||
|
output_key="senior_review_result",
|
||||||
|
verdict=True,
|
||||||
|
),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
result = PipelineResult(
|
||||||
|
iterations=[
|
||||||
|
IterationResult(
|
||||||
|
iteration=1,
|
||||||
|
step_outputs={
|
||||||
|
"senior_review_result": (
|
||||||
|
"### Confirmed Issues\n- Missing auth\n\n"
|
||||||
|
"## Issue Tracker\n"
|
||||||
|
"| ISS-ID | Severity | Description | Status | Since |\n"
|
||||||
|
"|--------|----------|-------------|--------|-------|\n"
|
||||||
|
"| ISS-001 | Critical | Missing auth check | Open | v1 |\n"
|
||||||
|
"| ISS-002 | Major | No validation | Fixed | v1 |\n"
|
||||||
|
"\n### Verdict\nVERDICT: FAIL"
|
||||||
|
),
|
||||||
|
},
|
||||||
|
verdict="FAIL",
|
||||||
|
),
|
||||||
|
],
|
||||||
|
final_verdict="FAIL",
|
||||||
|
)
|
||||||
|
|
||||||
|
report = build_report(config, result)
|
||||||
|
self.assertIn("Issue Tracker Summary", report)
|
||||||
|
self.assertIn("ISS-001", report)
|
||||||
|
self.assertIn("Fixed", report)
|
||||||
|
|
||||||
|
def test_aggregate_template_ko_has_tracker(self) -> None:
|
||||||
|
self.assertIn("{previous_senior_tracker}", AGGREGATE_REVIEW_TEMPLATE_KO)
|
||||||
|
self.assertIn("이슈 트래커", AGGREGATE_REVIEW_TEMPLATE_KO)
|
||||||
|
self.assertIn("VERDICT: ESCALATE", AGGREGATE_REVIEW_TEMPLATE_KO)
|
||||||
|
|
||||||
|
|
||||||
|
class FixPresetBehaviorTest(unittest.TestCase):
|
||||||
|
def _write_fix_config(self, root: Path, *, max_iterations: int = 7) -> Path:
|
||||||
|
(root / "plan.md").write_text("# plan\n", encoding="utf-8")
|
||||||
|
(root / "checklist.md").write_text("# checklist\n", encoding="utf-8")
|
||||||
|
config_path = root / "config.yaml"
|
||||||
|
config_path.write_text(
|
||||||
|
(
|
||||||
|
"inputs:\n"
|
||||||
|
" plan: plan.md\n"
|
||||||
|
" checklist: checklist.md\n"
|
||||||
|
"coders: [claude-coder]\n"
|
||||||
|
"reviewers: [claude-reviewer]\n"
|
||||||
|
"pipeline: preset:review-fix\n"
|
||||||
|
f"max_iterations: {max_iterations}\n"
|
||||||
|
"language: en\n"
|
||||||
|
),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
return config_path
|
||||||
|
|
||||||
|
def test_load_config_syncs_phased_iterations_and_enables_agentic(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
config = load_config(self._write_fix_config(Path(tmpdir), max_iterations=7))
|
||||||
|
|
||||||
|
self.assertEqual(config.preset_name, "review-fix")
|
||||||
|
self.assertEqual(config.phases[0].max_iterations, 7)
|
||||||
|
self.assertTrue(config.agents["claude-coder"].agentic)
|
||||||
|
self.assertNotIn("-p", config.agents["claude-coder"].args)
|
||||||
|
|
||||||
|
def test_run_config_max_iter_updates_existing_phased_pipeline(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
config_path = self._write_fix_config(Path(tmpdir), max_iterations=7)
|
||||||
|
captured: dict[str, object] = {}
|
||||||
|
|
||||||
|
def _fake_run_pipeline(config, **kwargs):
|
||||||
|
captured["phase_max"] = config.phases[0].max_iterations
|
||||||
|
captured["agentic"] = config.agents[config.coders[0]].agentic
|
||||||
|
return PipelineResult(
|
||||||
|
iterations=[],
|
||||||
|
final_verdict="PASS",
|
||||||
|
run_dir=Path(tmpdir) / "output",
|
||||||
|
)
|
||||||
|
|
||||||
|
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
|
||||||
|
exit_code = main([
|
||||||
|
"run",
|
||||||
|
"--config", str(config_path),
|
||||||
|
"--max-iter", "9",
|
||||||
|
"--dry-run",
|
||||||
|
])
|
||||||
|
|
||||||
|
self.assertEqual(exit_code, 0)
|
||||||
|
self.assertEqual(captured["phase_max"], 9)
|
||||||
|
self.assertTrue(captured["agentic"])
|
||||||
|
|
||||||
|
def test_run_preset_review_fix_auto_enables_agentic_without_flag(self) -> None:
|
||||||
|
captured: dict[str, object] = {}
|
||||||
|
|
||||||
|
def _fake_run_pipeline(config, **kwargs):
|
||||||
|
captured["preset"] = config.preset_name
|
||||||
|
captured["agentic"] = config.agents[config.coders[0]].agentic
|
||||||
|
captured["phase_max"] = config.phases[0].max_iterations
|
||||||
|
return PipelineResult(
|
||||||
|
iterations=[],
|
||||||
|
final_verdict="PASS",
|
||||||
|
run_dir=Path(".cross-eval/output"),
|
||||||
|
)
|
||||||
|
|
||||||
|
with patch("cross_eval.pipeline.run_pipeline", side_effect=_fake_run_pipeline):
|
||||||
|
exit_code = main(["run", "--preset", "review-fix", "--dry-run"])
|
||||||
|
|
||||||
|
self.assertEqual(exit_code, 0)
|
||||||
|
self.assertEqual(captured["preset"], "review-fix")
|
||||||
|
self.assertTrue(captured["agentic"])
|
||||||
|
self.assertEqual(captured["phase_max"], 3)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
unittest.main()
|
unittest.main()
|
||||||
|
|||||||
267
tests/test_onboarding.py
Normal file
267
tests/test_onboarding.py
Normal file
@@ -0,0 +1,267 @@
|
|||||||
|
"""Tests for doctor, demo, and guided init features."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
from unittest.mock import patch, MagicMock
|
||||||
|
|
||||||
|
from cross_eval.doctor import (
|
||||||
|
DoctorCheck,
|
||||||
|
check_cli_installed,
|
||||||
|
check_config,
|
||||||
|
format_doctor_results,
|
||||||
|
run_doctor,
|
||||||
|
)
|
||||||
|
from cross_eval.demo import (
|
||||||
|
DEMO_CHECKLIST,
|
||||||
|
DEMO_PLAN,
|
||||||
|
run_mock_demo,
|
||||||
|
)
|
||||||
|
from cross_eval.cli import (
|
||||||
|
_generate_guided_config,
|
||||||
|
_prompt_choice,
|
||||||
|
_prompt_text,
|
||||||
|
main,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Doctor tests
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
class DoctorCheckInstalledTest(unittest.TestCase):
|
||||||
|
def test_check_cli_installed_found(self) -> None:
|
||||||
|
with patch("cross_eval.doctor.shutil.which", return_value="/usr/bin/python3"):
|
||||||
|
with patch("cross_eval.doctor.subprocess.run") as mock_run:
|
||||||
|
mock_run.return_value = MagicMock(
|
||||||
|
stdout="Python 3.12.0", stderr=""
|
||||||
|
)
|
||||||
|
found, version = check_cli_installed("python3")
|
||||||
|
|
||||||
|
self.assertTrue(found)
|
||||||
|
self.assertIn("Python", version)
|
||||||
|
|
||||||
|
def test_check_cli_installed_not_found(self) -> None:
|
||||||
|
with patch("cross_eval.doctor.shutil.which", return_value=None):
|
||||||
|
found, msg = check_cli_installed("nonexistent-tool")
|
||||||
|
|
||||||
|
self.assertFalse(found)
|
||||||
|
self.assertIn("not found", msg)
|
||||||
|
|
||||||
|
def test_check_config_exists_valid(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
ce_dir = Path(tmpdir) / ".cross-eval"
|
||||||
|
ce_dir.mkdir()
|
||||||
|
config_path = ce_dir / "config.yaml"
|
||||||
|
config_path.write_text(
|
||||||
|
"inputs:\n plan: plan.md\ncoders: [claude-coder]\n"
|
||||||
|
"reviewers: [claude-reviewer]\npipeline: preset:simple\n",
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
# Also create plan.md so validation passes
|
||||||
|
(ce_dir / "plan.md").write_text("# Plan", encoding="utf-8")
|
||||||
|
|
||||||
|
ok, path, errors = check_config(Path(tmpdir))
|
||||||
|
|
||||||
|
self.assertTrue(ok)
|
||||||
|
self.assertIsNotNone(path)
|
||||||
|
self.assertEqual(errors, [])
|
||||||
|
|
||||||
|
def test_check_config_not_exists(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
ok, path, errors = check_config(Path(tmpdir))
|
||||||
|
|
||||||
|
self.assertFalse(ok)
|
||||||
|
self.assertIsNone(path)
|
||||||
|
|
||||||
|
def test_check_config_invalid(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
ce_dir = Path(tmpdir) / ".cross-eval"
|
||||||
|
ce_dir.mkdir()
|
||||||
|
# Valid YAML but missing required fields → validation fails
|
||||||
|
(ce_dir / "config.yaml").write_text(
|
||||||
|
"inputs:\n plan: /nonexistent/plan.md\n",
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
|
||||||
|
ok, path, errors = check_config(Path(tmpdir))
|
||||||
|
|
||||||
|
self.assertFalse(ok)
|
||||||
|
self.assertIsNotNone(path)
|
||||||
|
|
||||||
|
def test_format_doctor_results_all_pass(self) -> None:
|
||||||
|
checks = [
|
||||||
|
DoctorCheck("test", True, True, "ok"),
|
||||||
|
DoctorCheck("test2", True, False, "ok"),
|
||||||
|
]
|
||||||
|
output = format_doctor_results(checks)
|
||||||
|
self.assertIn("✓", output)
|
||||||
|
self.assertIn("All checks passed", output)
|
||||||
|
|
||||||
|
def test_format_doctor_results_critical_fail(self) -> None:
|
||||||
|
checks = [
|
||||||
|
DoctorCheck("claude CLI", False, True, "not found"),
|
||||||
|
]
|
||||||
|
output = format_doctor_results(checks)
|
||||||
|
self.assertIn("✗", output)
|
||||||
|
self.assertIn("critical", output.lower())
|
||||||
|
|
||||||
|
def test_cmd_doctor_returns_0_all_pass(self) -> None:
|
||||||
|
with patch("cross_eval.doctor.run_doctor") as mock:
|
||||||
|
mock.return_value = [
|
||||||
|
DoctorCheck("test", True, True, "ok"),
|
||||||
|
]
|
||||||
|
exit_code = main(["doctor"])
|
||||||
|
self.assertEqual(exit_code, 0)
|
||||||
|
|
||||||
|
def test_cmd_doctor_returns_1_critical_fail(self) -> None:
|
||||||
|
with patch("cross_eval.doctor.run_doctor") as mock:
|
||||||
|
mock.return_value = [
|
||||||
|
DoctorCheck("claude CLI", False, True, "not found"),
|
||||||
|
]
|
||||||
|
exit_code = main(["doctor"])
|
||||||
|
self.assertEqual(exit_code, 1)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Demo tests
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
class DemoTest(unittest.TestCase):
|
||||||
|
def test_demo_plan_is_nonempty(self) -> None:
|
||||||
|
self.assertIn("fibonacci", DEMO_PLAN.lower())
|
||||||
|
|
||||||
|
def test_demo_checklist_is_nonempty(self) -> None:
|
||||||
|
self.assertIn("fibonacci", DEMO_CHECKLIST.lower())
|
||||||
|
|
||||||
|
def test_mock_demo_runs_without_error(self) -> None:
|
||||||
|
# Should not raise
|
||||||
|
with patch("sys.stdout"):
|
||||||
|
run_mock_demo(preset="simple")
|
||||||
|
|
||||||
|
def test_mock_demo_escalate_runs_without_error(self) -> None:
|
||||||
|
with patch("sys.stdout"):
|
||||||
|
run_mock_demo(preset="simple", show_escalate=True)
|
||||||
|
|
||||||
|
def test_cmd_demo_mock_default(self) -> None:
|
||||||
|
with patch("cross_eval.demo.run_mock_demo") as mock:
|
||||||
|
exit_code = main(["demo"])
|
||||||
|
mock.assert_called_once_with(preset="simple", show_escalate=False)
|
||||||
|
self.assertEqual(exit_code, 0)
|
||||||
|
|
||||||
|
def test_cmd_demo_escalate_flag(self) -> None:
|
||||||
|
with patch("cross_eval.demo.run_mock_demo") as mock:
|
||||||
|
exit_code = main(["demo", "--escalate"])
|
||||||
|
mock.assert_called_once_with(preset="simple", show_escalate=True)
|
||||||
|
self.assertEqual(exit_code, 0)
|
||||||
|
|
||||||
|
def test_cmd_demo_live_requires_confirmation(self) -> None:
|
||||||
|
with patch("builtins.input", return_value="n"):
|
||||||
|
exit_code = main(["demo", "--live"])
|
||||||
|
self.assertEqual(exit_code, 0)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Guided init tests
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
class GuidedInitTest(unittest.TestCase):
|
||||||
|
def test_prompt_choice_default(self) -> None:
|
||||||
|
with patch("builtins.input", return_value=""):
|
||||||
|
result = _prompt_choice("Pick:", ["a", "b", "c"], default=2)
|
||||||
|
self.assertEqual(result, "b")
|
||||||
|
|
||||||
|
def test_prompt_choice_by_number(self) -> None:
|
||||||
|
with patch("builtins.input", return_value="3"):
|
||||||
|
result = _prompt_choice("Pick:", ["a", "b", "c"], default=1)
|
||||||
|
self.assertEqual(result, "c")
|
||||||
|
|
||||||
|
def test_prompt_choice_by_name(self) -> None:
|
||||||
|
with patch("builtins.input", return_value="simple"):
|
||||||
|
result = _prompt_choice("Pick:", ["simple", "review-fix"], default=1)
|
||||||
|
self.assertEqual(result, "simple")
|
||||||
|
|
||||||
|
def test_prompt_text_default(self) -> None:
|
||||||
|
with patch("builtins.input", return_value=""):
|
||||||
|
result = _prompt_text("Name", default="claude")
|
||||||
|
self.assertEqual(result, "claude")
|
||||||
|
|
||||||
|
def test_prompt_text_custom(self) -> None:
|
||||||
|
with patch("builtins.input", return_value="codex"):
|
||||||
|
result = _prompt_text("Name", default="claude")
|
||||||
|
self.assertEqual(result, "codex")
|
||||||
|
|
||||||
|
def test_generate_guided_config(self) -> None:
|
||||||
|
config = _generate_guided_config(
|
||||||
|
"review-fix", "ko",
|
||||||
|
{
|
||||||
|
"coder": "claude",
|
||||||
|
"reviewer": "codex",
|
||||||
|
"senior": "codex",
|
||||||
|
"max_iter": 5,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
self.assertIn("preset:review-fix", config)
|
||||||
|
self.assertIn("language: ko", config)
|
||||||
|
self.assertIn("claude-coder", config)
|
||||||
|
self.assertIn("codex-reviewer", config)
|
||||||
|
self.assertIn("codex-senior", config)
|
||||||
|
self.assertIn("max_iterations: 5", config)
|
||||||
|
|
||||||
|
def test_generate_guided_config_full_name(self) -> None:
|
||||||
|
config = _generate_guided_config(
|
||||||
|
"simple", "ko",
|
||||||
|
{
|
||||||
|
"coder": "claude-coder",
|
||||||
|
"reviewer": "codex-reviewer",
|
||||||
|
"senior": "",
|
||||||
|
"max_iter": 3,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
# Full names should not be double-suffixed
|
||||||
|
self.assertIn("claude-coder", config)
|
||||||
|
self.assertNotIn("claude-coder-coder", config)
|
||||||
|
self.assertIn("codex-reviewer", config)
|
||||||
|
self.assertNotIn("codex-reviewer-reviewer", config)
|
||||||
|
|
||||||
|
def test_generate_guided_config_no_senior(self) -> None:
|
||||||
|
config = _generate_guided_config(
|
||||||
|
"simple", "en",
|
||||||
|
{
|
||||||
|
"coder": "claude",
|
||||||
|
"reviewer": "claude",
|
||||||
|
"senior": "",
|
||||||
|
"max_iter": 3,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
self.assertNotIn("senior", config.lower())
|
||||||
|
|
||||||
|
def test_guided_init_creates_files(self) -> None:
|
||||||
|
# Simulate guided init with all defaults
|
||||||
|
inputs = iter(["", "", "", "", "", "", ""])
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
with patch("builtins.input", side_effect=lambda _="": next(inputs, "")):
|
||||||
|
exit_code = main(["init", "--guided", "--dir", tmpdir])
|
||||||
|
|
||||||
|
config_path = Path(tmpdir) / ".cross-eval" / "config.yaml"
|
||||||
|
self.assertTrue(config_path.exists())
|
||||||
|
self.assertEqual(exit_code, 0)
|
||||||
|
|
||||||
|
def test_guided_init_preserves_existing_files(self) -> None:
|
||||||
|
inputs = iter(["", "", "", "", "", "", ""])
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
ce_dir = Path(tmpdir) / ".cross-eval"
|
||||||
|
ce_dir.mkdir()
|
||||||
|
existing = ce_dir / "config.yaml"
|
||||||
|
existing.write_text("# existing", encoding="utf-8")
|
||||||
|
|
||||||
|
with patch("builtins.input", side_effect=lambda _="": next(inputs, "")):
|
||||||
|
main(["init", "--guided", "--dir", tmpdir])
|
||||||
|
|
||||||
|
# Should not overwrite
|
||||||
|
self.assertEqual(existing.read_text(), "# existing")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
461
tests/test_pipeline_integration.py
Normal file
461
tests/test_pipeline_integration.py
Normal file
@@ -0,0 +1,461 @@
|
|||||||
|
"""Integration tests for cross-eval pipeline with mocked agents."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
from cross_eval.config import BUILTIN_AGENTS
|
||||||
|
from cross_eval.models import (
|
||||||
|
AgentConfig,
|
||||||
|
AgentResult,
|
||||||
|
PhaseConfig,
|
||||||
|
PipelineConfig,
|
||||||
|
StepConfig,
|
||||||
|
)
|
||||||
|
from cross_eval.pipeline import run_pipeline
|
||||||
|
from cross_eval.prompts import _build_review_fix_preset, _build_simple_preset
|
||||||
|
|
||||||
|
|
||||||
|
def _make_mock_agent(outputs: list[str]):
|
||||||
|
"""Returns a side_effect function that returns outputs in sequence."""
|
||||||
|
call_count = [0]
|
||||||
|
|
||||||
|
def _mock(agent_config, prompt, step_name, **kwargs):
|
||||||
|
idx = min(call_count[0], len(outputs) - 1)
|
||||||
|
call_count[0] += 1
|
||||||
|
return AgentResult(
|
||||||
|
output=outputs[idx],
|
||||||
|
exit_code=0,
|
||||||
|
agent_name=agent_config.name,
|
||||||
|
step_name=step_name,
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
|
||||||
|
return _mock
|
||||||
|
|
||||||
|
|
||||||
|
def _make_step_mock(step_outputs: dict[str, list[str]]):
|
||||||
|
"""Returns a side_effect that dispatches by step_name, cycling through outputs."""
|
||||||
|
counters: dict[str, int] = {}
|
||||||
|
|
||||||
|
def _mock(agent_config, prompt, step_name, **kwargs):
|
||||||
|
if step_name not in counters:
|
||||||
|
counters[step_name] = 0
|
||||||
|
outputs = step_outputs.get(step_name, [""])
|
||||||
|
idx = min(counters[step_name], len(outputs) - 1)
|
||||||
|
counters[step_name] += 1
|
||||||
|
return AgentResult(
|
||||||
|
output=outputs[idx],
|
||||||
|
exit_code=0,
|
||||||
|
agent_name=agent_config.name,
|
||||||
|
step_name=step_name,
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
|
||||||
|
return _mock
|
||||||
|
|
||||||
|
|
||||||
|
def _minimal_simple_config(
|
||||||
|
run_dir: Path,
|
||||||
|
max_iterations: int = 3,
|
||||||
|
seniors: list[str] | None = None,
|
||||||
|
) -> PipelineConfig:
|
||||||
|
"""Build a minimal simple pipeline config for testing."""
|
||||||
|
coders = ["claude-coder"]
|
||||||
|
reviewers = ["claude-reviewer"]
|
||||||
|
senior_list = seniors if seniors is not None else []
|
||||||
|
steps = _build_simple_preset(coders, reviewers, senior_list)
|
||||||
|
agents = dict(BUILTIN_AGENTS)
|
||||||
|
return PipelineConfig(
|
||||||
|
output_dir=run_dir,
|
||||||
|
max_iterations=max_iterations,
|
||||||
|
min_iterations=1,
|
||||||
|
language="en",
|
||||||
|
inputs={"plan": "Test plan", "checklist": "Test checklist"},
|
||||||
|
agents=agents,
|
||||||
|
coders=coders,
|
||||||
|
reviewers=reviewers,
|
||||||
|
seniors=senior_list,
|
||||||
|
pipeline=steps,
|
||||||
|
preset_name="simple",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestSimplePipelinePassStopsLoop(unittest.TestCase):
|
||||||
|
"""Test 1: mock agent returns VERDICT: PASS on first review -> stops at iteration 1."""
|
||||||
|
|
||||||
|
def test_simple_pipeline_pass_stops_loop(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
config = _minimal_simple_config(Path(tmpdir))
|
||||||
|
|
||||||
|
mock = _make_mock_agent([
|
||||||
|
"Coding output here", # coding step
|
||||||
|
"All good\n\nVERDICT: PASS", # review step
|
||||||
|
])
|
||||||
|
|
||||||
|
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
|
||||||
|
result = run_pipeline(config)
|
||||||
|
|
||||||
|
self.assertEqual(result.final_verdict, "PASS")
|
||||||
|
self.assertEqual(len(result.iterations), 1)
|
||||||
|
|
||||||
|
|
||||||
|
class TestSimplePipelineFailThenPass(unittest.TestCase):
|
||||||
|
"""Test 2: FAIL on first review, PASS on second -> 2 iterations."""
|
||||||
|
|
||||||
|
def test_simple_pipeline_fail_then_pass(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
config = _minimal_simple_config(Path(tmpdir), max_iterations=5)
|
||||||
|
|
||||||
|
mock = _make_step_mock({
|
||||||
|
"coding": ["Coding output v1", "Coding output v2"],
|
||||||
|
"review": [
|
||||||
|
"Issues found\n\nVERDICT: FAIL",
|
||||||
|
"All good\n\nVERDICT: PASS",
|
||||||
|
],
|
||||||
|
})
|
||||||
|
|
||||||
|
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
|
||||||
|
result = run_pipeline(config)
|
||||||
|
|
||||||
|
self.assertEqual(result.final_verdict, "PASS")
|
||||||
|
self.assertEqual(len(result.iterations), 2)
|
||||||
|
|
||||||
|
|
||||||
|
class TestSimplePipelineEscalateBreaksLoop(unittest.TestCase):
|
||||||
|
"""Test 3: ESCALATE on review -> stops immediately, final_verdict=ESCALATE."""
|
||||||
|
|
||||||
|
def test_simple_pipeline_escalate_breaks_loop(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
config = _minimal_simple_config(
|
||||||
|
Path(tmpdir), max_iterations=5, seniors=["claude-senior"],
|
||||||
|
)
|
||||||
|
|
||||||
|
escalate_output = (
|
||||||
|
"### Confirmed Issues\n"
|
||||||
|
"- [Critical] Requirements are ambiguous\n\n"
|
||||||
|
"### Escalated Issues\n"
|
||||||
|
"Requirements need stakeholder clarification\n\n"
|
||||||
|
"### Verdict\n"
|
||||||
|
"VERDICT: ESCALATE\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
mock = _make_step_mock({
|
||||||
|
"coding": ["Coding output"],
|
||||||
|
"review": ["Issues found\n\nVERDICT: FAIL"],
|
||||||
|
"senior_review": [escalate_output],
|
||||||
|
})
|
||||||
|
|
||||||
|
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
|
||||||
|
result = run_pipeline(config)
|
||||||
|
|
||||||
|
self.assertEqual(result.final_verdict, "ESCALATE")
|
||||||
|
self.assertEqual(len(result.iterations), 1)
|
||||||
|
self.assertTrue(len(result.escalated_issues) > 0)
|
||||||
|
|
||||||
|
|
||||||
|
class TestSimplePipelineEscalatePriorityOverPass(unittest.TestCase):
|
||||||
|
"""Test 4: one verdict step returns PASS, another returns ESCALATE -> ESCALATE wins."""
|
||||||
|
|
||||||
|
def test_simple_pipeline_escalate_priority_over_pass(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
# Build a custom pipeline with 2 verdict steps (no senior)
|
||||||
|
steps = [
|
||||||
|
StepConfig(
|
||||||
|
name="coding",
|
||||||
|
agent="claude-coder",
|
||||||
|
role="coding",
|
||||||
|
prompt_template="default:coding",
|
||||||
|
output_key="coding_output",
|
||||||
|
),
|
||||||
|
StepConfig(
|
||||||
|
name="review_a",
|
||||||
|
agent="claude-reviewer",
|
||||||
|
role="review",
|
||||||
|
prompt_template="default:review",
|
||||||
|
output_key="review_a_result",
|
||||||
|
verdict=True,
|
||||||
|
),
|
||||||
|
StepConfig(
|
||||||
|
name="review_b",
|
||||||
|
agent="claude-reviewer",
|
||||||
|
role="review",
|
||||||
|
prompt_template="default:review",
|
||||||
|
output_key="review_b_result",
|
||||||
|
verdict=True,
|
||||||
|
),
|
||||||
|
]
|
||||||
|
config = PipelineConfig(
|
||||||
|
output_dir=Path(tmpdir),
|
||||||
|
max_iterations=3,
|
||||||
|
min_iterations=1,
|
||||||
|
language="en",
|
||||||
|
inputs={"plan": "Test plan", "checklist": "Test checklist"},
|
||||||
|
agents=dict(BUILTIN_AGENTS),
|
||||||
|
coders=["claude-coder"],
|
||||||
|
reviewers=["claude-reviewer"],
|
||||||
|
pipeline=steps,
|
||||||
|
preset_name="custom",
|
||||||
|
)
|
||||||
|
|
||||||
|
escalate_output = (
|
||||||
|
"### Escalated Issues\n"
|
||||||
|
"Ambiguous requirements need clarification\n\n"
|
||||||
|
"VERDICT: ESCALATE\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
mock = _make_step_mock({
|
||||||
|
"coding": ["Coding output"],
|
||||||
|
"review_a": ["All good\n\nVERDICT: PASS"],
|
||||||
|
"review_b": [escalate_output],
|
||||||
|
})
|
||||||
|
|
||||||
|
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
|
||||||
|
result = run_pipeline(config)
|
||||||
|
|
||||||
|
self.assertEqual(result.final_verdict, "ESCALATE")
|
||||||
|
self.assertTrue(len(result.escalated_issues) > 0)
|
||||||
|
|
||||||
|
|
||||||
|
class TestPhasedPipelineEscalateBreaksPhase(unittest.TestCase):
|
||||||
|
"""Test 5: phased pipeline (review-fix), verify step returns ESCALATE -> phase stops."""
|
||||||
|
|
||||||
|
def test_phased_pipeline_escalate_breaks_phase(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
coders = ["claude-coder"]
|
||||||
|
reviewers = ["claude-reviewer"]
|
||||||
|
seniors = ["claude-senior"]
|
||||||
|
phases = _build_review_fix_preset(coders, reviewers, seniors)
|
||||||
|
|
||||||
|
config = PipelineConfig(
|
||||||
|
output_dir=Path(tmpdir),
|
||||||
|
max_iterations=5,
|
||||||
|
min_iterations=1,
|
||||||
|
language="en",
|
||||||
|
inputs={"plan": "Test plan", "checklist": "Test checklist"},
|
||||||
|
agents=dict(BUILTIN_AGENTS),
|
||||||
|
coders=coders,
|
||||||
|
reviewers=reviewers,
|
||||||
|
seniors=seniors,
|
||||||
|
phases=phases,
|
||||||
|
preset_name="review-fix",
|
||||||
|
)
|
||||||
|
|
||||||
|
escalate_output = (
|
||||||
|
"### Escalated Issues\n"
|
||||||
|
"Architecture decisions needed beyond plan scope\n\n"
|
||||||
|
"### Verdict\n"
|
||||||
|
"VERDICT: ESCALATE\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
mock = _make_step_mock({
|
||||||
|
"review_claude_reviewer": ["Review findings here"],
|
||||||
|
"aggregate_review": ["Aggregated review\n\nAction items: fix X"],
|
||||||
|
"coding": ["Fixed code"],
|
||||||
|
"verify": [escalate_output],
|
||||||
|
})
|
||||||
|
|
||||||
|
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
|
||||||
|
result = run_pipeline(config)
|
||||||
|
|
||||||
|
self.assertEqual(result.final_verdict, "ESCALATE")
|
||||||
|
self.assertTrue(len(result.escalated_issues) > 0)
|
||||||
|
|
||||||
|
|
||||||
|
class TestAutoEscalateFiresWithoutSenior(unittest.TestCase):
|
||||||
|
"""Test 6: simple pipeline without senior, same FAIL feedback 3 times -> auto-escalate."""
|
||||||
|
|
||||||
|
def test_auto_escalate_fires_without_senior(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
# No seniors -> review step has verdict=True
|
||||||
|
config = _minimal_simple_config(
|
||||||
|
Path(tmpdir), max_iterations=5, seniors=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Same feedback mentioning the same file paths across all iterations
|
||||||
|
repeated_fail = (
|
||||||
|
"Issues found in src/auth.py: missing validation check.\n"
|
||||||
|
"The file src/auth.py still has the same problem.\n\n"
|
||||||
|
"VERDICT: FAIL"
|
||||||
|
)
|
||||||
|
|
||||||
|
mock = _make_step_mock({
|
||||||
|
"coding": ["Coding output v1", "Coding output v2", "Coding output v3"],
|
||||||
|
"review": [repeated_fail, repeated_fail, repeated_fail],
|
||||||
|
})
|
||||||
|
|
||||||
|
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
|
||||||
|
result = run_pipeline(config)
|
||||||
|
|
||||||
|
self.assertEqual(result.final_verdict, "ESCALATE")
|
||||||
|
self.assertTrue(
|
||||||
|
any("Auto-escalated" in iss for iss in result.escalated_issues),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestAutoEscalateDoesNotFireWithSenior(unittest.TestCase):
|
||||||
|
"""Test 7: same repeated FAIL but WITH senior/aggregate step -> no auto-escalate."""
|
||||||
|
|
||||||
|
def test_auto_escalate_does_not_fire_with_senior(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
# With seniors -> senior_review step has verdict=True, review does not
|
||||||
|
config = _minimal_simple_config(
|
||||||
|
Path(tmpdir), max_iterations=5, seniors=["claude-senior"],
|
||||||
|
)
|
||||||
|
|
||||||
|
repeated_fail_review = (
|
||||||
|
"Issues found in src/auth.py: missing validation check.\n"
|
||||||
|
"VERDICT: FAIL"
|
||||||
|
)
|
||||||
|
# Senior also returns FAIL but the auto-escalate should NOT fire
|
||||||
|
# because has_aggregator is True (seniors list is populated)
|
||||||
|
senior_fail = (
|
||||||
|
"### Confirmed Issues\n"
|
||||||
|
"- Missing validation in src/auth.py\n\n"
|
||||||
|
"### Action Items\n"
|
||||||
|
"1. Add validation in src/auth.py\n\n"
|
||||||
|
"VERDICT: FAIL"
|
||||||
|
)
|
||||||
|
|
||||||
|
mock = _make_step_mock({
|
||||||
|
"coding": [
|
||||||
|
"Coding output v1",
|
||||||
|
"Coding output v2",
|
||||||
|
"Coding output v3",
|
||||||
|
"Coding output v4",
|
||||||
|
"Coding output v5",
|
||||||
|
],
|
||||||
|
"review": [
|
||||||
|
repeated_fail_review,
|
||||||
|
repeated_fail_review,
|
||||||
|
repeated_fail_review,
|
||||||
|
repeated_fail_review,
|
||||||
|
repeated_fail_review,
|
||||||
|
],
|
||||||
|
"senior_review": [
|
||||||
|
senior_fail,
|
||||||
|
senior_fail,
|
||||||
|
senior_fail,
|
||||||
|
senior_fail,
|
||||||
|
senior_fail,
|
||||||
|
],
|
||||||
|
})
|
||||||
|
|
||||||
|
with patch("cross_eval.pipeline.invoke_agent", side_effect=mock):
|
||||||
|
result = run_pipeline(config)
|
||||||
|
|
||||||
|
# Should NOT auto-escalate; should reach max iterations
|
||||||
|
self.assertNotEqual(result.final_verdict, "ESCALATE")
|
||||||
|
self.assertEqual(result.final_verdict, "MAX_ITERATIONS_REACHED")
|
||||||
|
self.assertEqual(len(result.iterations), 5)
|
||||||
|
|
||||||
|
|
||||||
|
class TestTrackerExtractionAcrossIterations(unittest.TestCase):
|
||||||
|
"""Test 8: senior review output with Issue Tracker table -> passed to next iteration."""
|
||||||
|
|
||||||
|
def test_tracker_extraction_across_iterations(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmpdir:
|
||||||
|
config = _minimal_simple_config(
|
||||||
|
Path(tmpdir), max_iterations=3, seniors=["claude-senior"],
|
||||||
|
)
|
||||||
|
|
||||||
|
tracker_table = (
|
||||||
|
"## Issue Tracker\n"
|
||||||
|
"| ISS-ID | Severity | Description | Status | Since |\n"
|
||||||
|
"|--------|----------|-------------|--------|-------|\n"
|
||||||
|
"| ISS-001 | Critical | Missing auth check | Open | v1 |\n"
|
||||||
|
"| ISS-002 | Major | No validation | Open | v1 |\n"
|
||||||
|
)
|
||||||
|
senior_output_v1 = (
|
||||||
|
"### Confirmed Issues\n"
|
||||||
|
"- Missing auth\n\n"
|
||||||
|
f"{tracker_table}\n"
|
||||||
|
"### Verdict\n"
|
||||||
|
"VERDICT: FAIL"
|
||||||
|
)
|
||||||
|
senior_output_v2 = (
|
||||||
|
"### Confirmed Issues\n"
|
||||||
|
"- None remaining\n\n"
|
||||||
|
"## Issue Tracker\n"
|
||||||
|
"| ISS-ID | Severity | Description | Status | Since |\n"
|
||||||
|
"|--------|----------|-------------|--------|-------|\n"
|
||||||
|
"| ISS-001 | Critical | Missing auth check | Fixed | v1 |\n"
|
||||||
|
"| ISS-002 | Major | No validation | Fixed | v1 |\n"
|
||||||
|
"\n### Verdict\n"
|
||||||
|
"VERDICT: PASS"
|
||||||
|
)
|
||||||
|
|
||||||
|
captured_prompts: list[dict[str, str]] = []
|
||||||
|
|
||||||
|
def _tracking_mock(agent_config, prompt, step_name, **kwargs):
|
||||||
|
captured_prompts.append({
|
||||||
|
"step_name": step_name,
|
||||||
|
"prompt": prompt,
|
||||||
|
"agent_name": agent_config.name,
|
||||||
|
})
|
||||||
|
if step_name == "coding":
|
||||||
|
return AgentResult(
|
||||||
|
output="Coding output",
|
||||||
|
exit_code=0,
|
||||||
|
agent_name=agent_config.name,
|
||||||
|
step_name=step_name,
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
elif step_name == "review":
|
||||||
|
return AgentResult(
|
||||||
|
output="Review findings\n\nVERDICT: FAIL",
|
||||||
|
exit_code=0,
|
||||||
|
agent_name=agent_config.name,
|
||||||
|
step_name=step_name,
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
elif step_name == "senior_review":
|
||||||
|
# First call: FAIL with tracker, second call: PASS
|
||||||
|
senior_calls = [
|
||||||
|
p for p in captured_prompts if p["step_name"] == "senior_review"
|
||||||
|
]
|
||||||
|
if len(senior_calls) <= 1:
|
||||||
|
output = senior_output_v1
|
||||||
|
else:
|
||||||
|
output = senior_output_v2
|
||||||
|
return AgentResult(
|
||||||
|
output=output,
|
||||||
|
exit_code=0,
|
||||||
|
agent_name=agent_config.name,
|
||||||
|
step_name=step_name,
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
return AgentResult(
|
||||||
|
output="",
|
||||||
|
exit_code=0,
|
||||||
|
agent_name=agent_config.name,
|
||||||
|
step_name=step_name,
|
||||||
|
duration_seconds=0.1,
|
||||||
|
)
|
||||||
|
|
||||||
|
with patch("cross_eval.pipeline.invoke_agent", side_effect=_tracking_mock):
|
||||||
|
result = run_pipeline(config)
|
||||||
|
|
||||||
|
self.assertEqual(result.final_verdict, "PASS")
|
||||||
|
self.assertEqual(len(result.iterations), 2)
|
||||||
|
|
||||||
|
# Verify that the second iteration's senior_review prompt contains
|
||||||
|
# the tracker table from iteration 1
|
||||||
|
iter2_senior_prompts = [
|
||||||
|
p for p in captured_prompts
|
||||||
|
if p["step_name"] == "senior_review"
|
||||||
|
and "ISS-001" in p["prompt"]
|
||||||
|
and "Missing auth check" in p["prompt"]
|
||||||
|
]
|
||||||
|
# The second senior_review call should have the tracker in its prompt
|
||||||
|
self.assertTrue(
|
||||||
|
len(iter2_senior_prompts) >= 1,
|
||||||
|
"Expected previous_senior_tracker content (ISS-001) to appear "
|
||||||
|
"in at least one senior_review prompt",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
Reference in New Issue
Block a user