feat: ESCALATE verdict, issue tracker, onboarding commands

Add 3-verdict system (PASS/FAIL/ESCALATE) with priority handling across simple and phased pipelines. Senior reviewers can now escalate issues requiring human intervention, immediately breaking the review loop. - ESCALATE verdict extraction with highest priority over PASS/FAIL - Issue Tracker tables (ISS-NNN) carried across iterations - Auto-escalate heuristic using (file, keyword) composite fingerprints - Report restructuring: executive view first (verdict → tracker → metrics) - Onboarding: `doctor`, `demo`, `init --guided` commands - Exit codes: PASS=0, FAIL=1, ESCALATE=2 - 87 tests passing (54 config + 25 onboarding + 8 integration) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 18:19:05 +09:00
parent ee4f1a07ef
commit 204e071b74
15 changed files with 3032 additions and 156 deletions
--- a/cross_eval/prompts.py
+++ b/cross_eval/prompts.py
@@ -12,7 +12,7 @@ from cross_eval.models import PhaseConfig, StepConfig
 # Default prompt templates
 # ---------------------------------------------------------------------------

-GENERATE_TEMPLATE = """\
+CODING_TEMPLATE = """\
 You are tasked with implementing code based on a plan and checklist.

 ## Plan
@@ -53,8 +53,8 @@ You are tasked with reviewing code against a plan and checklist.
 ## Reference Documents
 {docs}

-## Generated Code / Previous Step Output
-{generated_code}
+## Coding Output / Previous Step Output
+{coding_output}

 ## Previous Review Feedback
 {feedback}
@@ -94,10 +94,10 @@ security concerns, performance problems), report them separately under \
 (Write "N/A" if no previous feedback was provided.)

 ### Issues Found
-List issues ordered by severity (Critical first):
- [Critical][Over-engineering] Description (reference specific plan/checklist item)
- [Major][Omission] Description (reference specific plan/checklist item)
- [Minor][Omission] Description (reference specific plan/checklist item)
+List issues ordered by severity (Critical first). Assign each issue a unique ID (ISS-NNN):
+- ISS-001 [Critical][Over-engineering] Description (reference specific plan/checklist item)
+- ISS-002 [Major][Omission] Description (reference specific plan/checklist item)
+- ISS-003 [Minor][Omission] Description (reference specific plan/checklist item)

 ### Out of Scope Issues
 Issues found outside plan/checklist scope but worth noting:
@@ -119,7 +119,7 @@ Otherwise output: VERDICT: FAIL
 """


-GENERATE_TEMPLATE_KO = """\
+CODING_TEMPLATE_KO = """\
 당신은 기획서와 체크리스트를 기반으로 코드를 구현하는 개발자입니다.

 ## 기획서
@@ -159,7 +159,7 @@ REVIEW_TEMPLATE_KO = """\
 {docs}

 ## 검토 대상 코드
-{generated_code}
+{coding_output}

 ## 이전 리뷰 피드백
 {feedback}
@@ -195,10 +195,10 @@ REVIEW_TEMPLATE_KO = """\
 (이전 피드백이 없으면 "해당 없음"이라고 작성하세요.)

 ### 발견된 이슈
-심각도 순서(Critical 먼저)로 나열:
- [Critical][과최적화] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
- [Major][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
- [Minor][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
+심각도 순서(Critical 먼저)로 나열. 각 이슈에 고유 ID(ISS-NNN)를 부여하세요:
+- ISS-001 [Critical][과최적화] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
+- ISS-002 [Major][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
+- ISS-003 [Minor][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)

 ### 범위 밖 이슈
 기획서/체크리스트 범위 밖이지만 주목할 만한 이슈:
@@ -357,6 +357,150 @@ REVIEW_ONLY_TEMPLATE_KO = """\
 그렇지 않으면: VERDICT: FAIL
 """

+PLAN_REVIEW_TEMPLATE = """\
+You are tasked with reviewing planning documents before implementation begins.
+
+## Plan
+{plan}
+
+## Checklist
+{checklist}
+
+## Reference Documents
+{docs}
+
+## Previous Review (iteration {iteration} of {max_iterations})
+{feedback}
+
+## Review Instructions
+Review the planning package itself: the plan, checklist, and reference documents.
+You MAY inspect the current repository to validate feasibility, constraints, and integration assumptions.
+Do NOT write or modify code. Assume implementation has NOT started yet.
+
+Your job is to find planning issues that would likely cause bad implementation outcomes:
+- Ambiguous or contradictory requirements
+- Missing acceptance criteria, constraints, edge cases, or dependencies
+- Scope that is broader or more complex than the stated objective
+- Checklist items that do not verify the actual requirements
+- Plan details that conflict with the current codebase or architecture
+
+If previous review results are provided above, you MUST:
+1. Verify each previously reported issue — is it a real issue or a false positive?
+2. Look for issues the previous review MISSED.
+3. Do NOT simply repeat the previous review. Provide your own independent assessment.
+4. Explicitly mark items as CONFIRMED (still an issue) or DISMISSED (false positive).
+
+For each issue found, classify it with BOTH severity AND category:
+
+Severity levels:
+- **Critical**: The plan is likely to cause fundamentally wrong implementation or unsafe behavior.
+- **Major**: Important requirements, constraints, or acceptance criteria are unclear, conflicting, missing, or incompatible with the existing system.
+- **Minor**: Wording, structure, or checklist quality problems that reduce implementation clarity.
+
+Categories:
+- **Over-engineering**: The plan introduces scope, abstractions, or complexity not justified by the stated objective.
+- **Omission**: A necessary requirement, constraint, acceptance criterion, edge case, dependency, or compatibility consideration is missing or incomplete.
+
+If you find issues outside the planning scope (e.g. repository health, pre-existing code problems), report them separately under "Out of Scope Issues".
+
+## Output Format
+
+### Issues Found
+List issues ordered by severity (Critical first):
+- [Critical][Over-engineering] Description (reference specific plan/checklist item)
+- [Major][Omission] Description (reference specific plan/checklist item)
+- [Minor][Omission] Description (reference specific plan/checklist item)
+
+### Out of Scope Issues
+Issues found outside planning scope but worth noting:
+- [Critical] Description of issue
+- [Minor] Description of issue
+(Write "None" if no out-of-scope issues found.)
+
+### Summary
+- Critical: N, Major: N, Minor: N
+- Over-engineering count: N
+- Omission count: N
+- CONFIRMED: N, DISMISSED: N
+- Overall quality: [BRIEF ASSESSMENT]
+
+### Verdict
+If the planning documents are clear, complete enough to implement, compatible with the current repository, and free of unjustified scope, output: VERDICT: PASS
+Otherwise output: VERDICT: FAIL
+"""
+
+PLAN_REVIEW_TEMPLATE_KO = """\
+당신은 구현 시작 전에 기획 문서를 검토하는 리뷰어입니다.
+
+## 기획서
+{plan}
+
+## 체크리스트
+{checklist}
+
+## 참고 문서
+{docs}
+
+## 이전 리뷰 결과 ({max_iterations}회 중 {iteration}번째)
+{feedback}
+
+## 검토 지침
+검토 대상은 코드가 아니라 기획 패키지 자체입니다: 기획서, 체크리스트, 참고 문서를 함께 검토하세요.
+현재 저장소를 살펴보며 구현 가능성, 제약조건, 통합 가정이 맞는지도 확인할 수 있습니다.
+코드를 생성하거나 수정하지 마세요. 아직 구현이 시작되지 않았다고 가정하세요.
+
+목표는 구현 단계에서 문제를 일으킬 기획 결함을 찾는 것입니다:
+- 요구사항이 모호하거나 서로 충돌하는 경우
+- 수용 기준, 제약조건, 엣지 케이스, 의존성이 빠진 경우
+- 목표 대비 범위가 지나치게 넓거나 복잡한 경우
+- 체크리스트가 실제 요구사항 검증에 충분하지 않은 경우
+- 기획 내용이 현재 코드베이스나 아키텍처와 충돌하는 경우
+
+이전 리뷰 결과가 제공된 경우 반드시:
+1. 이전에 보고된 각 이슈를 검증하세요 — 진짜 이슈인지 오탐인지?
+2. 이전 리뷰가 놓친 새로운 이슈를 찾으세요.
+3. 이전 리뷰를 그대로 반복하지 마세요. 독립적인 평가를 제공하세요.
+4. 각 항목에 CONFIRMED (여전히 이슈) 또는 DISMISSED (오탐) 태그를 명시하세요.
+
+발견된 각 이슈에 심각도와 카테고리를 모두 부여하세요:
+
+심각도:
+- **Critical**: 잘못된 구현이나 위험한 동작으로 직결될 가능성이 큰 기획 결함.
+- **Major**: 중요한 요구사항, 제약조건, 수용 기준이 모호하거나 충돌하거나 누락되었거나 기존 시스템과 맞지 않는 경우.
+- **Minor**: 문서 표현, 구조, 체크리스트 품질 문제로 구현 명확성이 떨어지는 경우.
+
+카테고리:
+- **과최적화**: 목표 대비 불필요한 범위, 추상화, 복잡성을 기획에 추가한 경우.
+- **누락**: 필요한 요구사항, 제약조건, 수용 기준, 엣지 케이스, 의존성, 호환성 고려가 빠졌거나 불완전한 경우.
+
+기획 범위 밖에서 발견된 문제(저장소 상태, 기존 코드 문제 등)는 "범위 밖 이슈" 섹션에 별도로 보고하세요.
+
+## 출력 형식
+
+### 발견된 이슈
+심각도 순서(Critical 먼저)로 나열:
+- [Critical][과최적화] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
+- [Major][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
+- [Minor][누락] 이슈 설명 (관련 기획서/체크리스트 항목 참조)
+
+### 범위 밖 이슈
+기획 범위 밖이지만 주목할 만한 이슈:
+- [Critical] 이슈 설명
+- [Minor] 이슈 설명
+(범위 밖 이슈가 없으면 "없음"이라고 작성하세요.)
+
+### 요약
+- Critical: N, Major: N, Minor: N
+- 과최적화 수: N
+- 누락 수: N
+- CONFIRMED: N, DISMISSED: N
+- 전체 품질: [간략한 평가]
+
+### 판정
+기획 문서가 구현 가능한 수준으로 명확하고 충분하며 현재 저장소와도 정합적이고, 불필요한 범위 확장이 없으면: VERDICT: PASS
+그렇지 않으면: VERDICT: FAIL
+"""
+
 AGGREGATE_REVIEW_TEMPLATE = """\
 You are adjudicating multiple review results and turning them into an actionable decision.

@@ -378,6 +522,9 @@ You are adjudicating multiple review results and turning them into an actionable
 ## Previous Verification Feedback
 {feedback}

+## Previous Issue Tracker
+{previous_senior_tracker}
+
 ## Instructions
 Explore the project directory to confirm the current codebase state. Then:
 1. Deduplicate overlapping issues across reviewers.
@@ -385,7 +532,12 @@ Explore the project directory to confirm the current codebase state. Then:
 3. Keep only issues supported by the plan, checklist, code, or reviewer evidence.
 4. When evidence is mixed, explain what was confirmed, what was dismissed, and what still needs follow-up.
 5. Produce a prioritized action list for the coder.
-6. If no confirmed issue remains, output VERDICT: PASS. Otherwise VERDICT: FAIL.
+6. Maintain the Issue Tracker table across iterations (carry forward unresolved issues).
+7. If no confirmed issue remains, output VERDICT: PASS.
+8. If issues exist that the coder can fix, output VERDICT: FAIL.
+9. If issues require human intervention (ambiguous requirements, architecture decisions, \
+external dependency problems, or the same issue persists after 2+ fix attempts), \
+output VERDICT: ESCALATE.

 ## Output Format

@@ -401,13 +553,19 @@ Explore the project directory to confirm the current codebase state. Then:
 1. Concrete fix the coder should make
 2. Concrete fix the coder should make

+## Issue Tracker
+
+| ISS-ID | Severity | Description | Status | Since |
+|--------|----------|-------------|--------|-------|
+| ISS-001 | Critical | ... | Open/Fixed/Dismissed | v1 |
+
 ### Summary
 - Confirmed issues: N
 - Dismissed findings: N (false positive: N, already fixed: N)
 - Overall quality: [BRIEF ASSESSMENT]

 ### Verdict
-VERDICT: PASS or VERDICT: FAIL
+VERDICT: PASS or VERDICT: FAIL or VERDICT: ESCALATE
 """

 AGGREGATE_REVIEW_TEMPLATE_KO = """\
@@ -431,6 +589,9 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
 ## 이전 검증 피드백
 {feedback}

+## 이전 이슈 트래커
+{previous_senior_tracker}
+
 ## 지침
 프로젝트 디렉토리를 탐색하여 현재 코드베이스 상태를 확인한 뒤 다음을 수행하세요.
 1. 리뷰어들 사이에 중복되는 이슈를 합치세요.
@@ -438,7 +599,11 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
 3. 기획서, 체크리스트, 코드, 리뷰 근거로 뒷받침되는 이슈만 남기세요.
 4. 근거가 엇갈리면 무엇이 확정이고 무엇이 기각 또는 추가확인 대상인지 분명히 적으세요.
 5. coder가 바로 수정할 수 있는 우선순위 액션 아이템을 만드세요.
-6. 확정된 이슈가 없으면 VERDICT: PASS, 있으면 VERDICT: FAIL 을 출력하세요.
+6. 이슈 트래커 테이블을 반복 간에 유지하세요 (미해결 이슈를 이월).
+7. 확정된 이슈가 없으면 VERDICT: PASS 를 출력하세요.
+8. coder가 수정 가능한 이슈가 있으면 VERDICT: FAIL 을 출력하세요.
+9. 사람의 개입이 필요한 이슈(모호한 요구사항, 아키텍처 결정, 외부 의존성 문제, \
+동일 이슈가 2회 이상 해결 실패)가 있으면 VERDICT: ESCALATE 를 출력하세요.

 ## 출력 형식

@@ -454,26 +619,34 @@ AGGREGATE_REVIEW_TEMPLATE_KO = """\
 1. coder가 수정해야 할 구체적인 작업
 2. coder가 수정해야 할 구체적인 작업

+## 이슈 트래커
+
+| ISS-ID | 심각도 | 설명 | 상태 | 최초 발견 |
+|--------|--------|------|------|-----------|
+| ISS-001 | Critical | ... | Open/Fixed/Dismissed | v1 |
+
 ### 요약
 - 확정 이슈 수: N
 - 기각된 주장 수: N (오탐: N, 수정 완료: N)
 - 전체 품질: [간략한 평가]

 ### 판정
-VERDICT: PASS 또는 VERDICT: FAIL
+VERDICT: PASS 또는 VERDICT: FAIL 또는 VERDICT: ESCALATE
 """


 DEFAULT_TEMPLATES: dict[str, dict[str, str]] = {
    "en": {
-        "generate": GENERATE_TEMPLATE,
+        "coding": CODING_TEMPLATE,
        "review": REVIEW_TEMPLATE,
+        "plan-review": PLAN_REVIEW_TEMPLATE,
        "review-only": REVIEW_ONLY_TEMPLATE,
        "aggregate-review": AGGREGATE_REVIEW_TEMPLATE,
    },
    "ko": {
-        "generate": GENERATE_TEMPLATE_KO,
+        "coding": CODING_TEMPLATE_KO,
        "review": REVIEW_TEMPLATE_KO,
+        "plan-review": PLAN_REVIEW_TEMPLATE_KO,
        "review-only": REVIEW_ONLY_TEMPLATE_KO,
        "aggregate-review": AGGREGATE_REVIEW_TEMPLATE_KO,
    },
@@ -544,18 +717,18 @@ def _build_named_bundle(
 def _build_simple_preset(
    coders: list[str], reviewers: list[str], seniors: list[str],
 ) -> list[StepConfig]:
-    """First coder generates, first reviewer reviews."""
+    """First coder writes code, first reviewer reviews."""
    if not coders:
        raise ValueError("'simple' preset requires at least 1 coder")
    if not reviewers:
        raise ValueError("'simple' preset requires at least 1 reviewer")
    steps = [
        StepConfig(
-            name="generate",
+            name="coding",
            agent=coders[0],
-            role="generate",
-            prompt_template="default:generate",
-            output_key="generated_code",
+            role="coding",
+            prompt_template="default:coding",
+            output_key="coding_output",
        ),
        StepConfig(
            name="review",
@@ -576,7 +749,7 @@ def _build_simple_preset(
                output_key="senior_review_result",
                verdict=True,
                context_override={
-                    "candidate_outputs": "## Generated code\n{generated_code}",
+                    "candidate_outputs": "## Coding output\n{coding_output}",
                    "reviews_bundle": f"## Review: {reviewers[0]} (review)\n{{review_result}}",
                },
            ),
@@ -587,25 +760,25 @@ def _build_simple_preset(
 def _build_cross_review_preset(
    coders: list[str], reviewers: list[str], seniors: list[str],
 ) -> list[StepConfig]:
-    """Both coders generate, then cross-review each other's output."""
+    """Both coders write code, then cross-review each other's output."""
    if len(coders) < 2:
        raise ValueError("'cross-review' preset requires at least 2 coders")
    a, b = coders[0], coders[1]
    ak, bk = _unique_safe_keys([a, b])
    steps = [
        StepConfig(
-            name=f"generate_{ak}",
+            name=f"coding_{ak}",
            agent=a,
-            role="generate",
-            prompt_template="default:generate",
+            role="coding",
+            prompt_template="default:coding",
            output_key=f"code_{ak}",
            parallel=True,
        ),
        StepConfig(
-            name=f"generate_{bk}",
+            name=f"coding_{bk}",
            agent=b,
-            role="generate",
-            prompt_template="default:generate",
+            role="coding",
+            prompt_template="default:coding",
            output_key=f"code_{bk}",
            parallel=True,
        ),
@@ -615,7 +788,7 @@ def _build_cross_review_preset(
            role="review",
            prompt_template="default:review",
            output_key=f"review_by_{ak}",
-            context_override={"generated_code": f"{{code_{bk}}}"},
+            context_override={"coding_output": f"{{code_{bk}}}"},
            parallel=True,
            verdict=not seniors,
        ),
@@ -626,7 +799,7 @@ def _build_cross_review_preset(
            prompt_template="default:review",
            output_key=f"review_by_{bk}",
            verdict=not seniors,
-            context_override={"generated_code": f"{{code_{ak}}}"},
+            context_override={"coding_output": f"{{code_{ak}}}"},
            parallel=True,
        ),
    ]
@@ -642,9 +815,9 @@ def _build_cross_review_preset(
                context_override={
                    "candidate_outputs": _build_named_bundle(
                        [a, b],
-                        [f"generate_{ak}", f"generate_{bk}"],
+                        [f"coding_{ak}", f"coding_{bk}"],
                        [f"code_{ak}", f"code_{bk}"],
-                        "Candidate",
+                        "Coding Output",
                    ),
                    "reviews_bundle": _build_named_bundle(
                        [a, b],
@@ -715,6 +888,61 @@ def _build_review_only_preset(
    return steps


+def _build_plan_review_preset(
+    coders: list[str], reviewers: list[str], seniors: list[str],
+) -> list[StepConfig]:
+    """Plan-review: reviewers audit planning docs before implementation."""
+    if not reviewers:
+        raise ValueError("'plan-review' preset requires at least 1 reviewer")
+
+    if len(reviewers) == 1 and not seniors:
+        return [
+            StepConfig(
+                name="plan_review",
+                agent=reviewers[0],
+                role="review",
+                prompt_template="default:plan-review",
+                output_key="plan_review_result",
+                verdict=True,
+            ),
+        ]
+
+    steps: list[StepConfig] = []
+    reviewer_keys = _unique_safe_keys(reviewers)
+    for reviewer, rk in zip(reviewers, reviewer_keys):
+        steps.append(
+            StepConfig(
+                name=f"plan_review_{rk}",
+                agent=reviewer,
+                role="review",
+                prompt_template="default:plan-review",
+                output_key=f"plan_review_{rk}",
+                verdict=not seniors,
+                parallel=True,
+            ),
+        )
+    if seniors:
+        step_names = [f"plan_review_{rk}" for rk in reviewer_keys]
+        output_keys = [f"plan_review_{rk}" for rk in reviewer_keys]
+        steps.append(
+            StepConfig(
+                name="senior_review",
+                agent=seniors[0],
+                role="review",
+                prompt_template="default:aggregate-review",
+                output_key="senior_review_result",
+                verdict=True,
+                context_override={
+                    "candidate_outputs": "Planning documents under review (plan/checklist/reference docs).",
+                    "reviews_bundle": _build_named_bundle(
+                        reviewers, step_names, output_keys, "Review",
+                    ),
+                },
+            ),
+        )
+    return steps
+
+
 def _build_review_fix_preset(
    coders: list[str], reviewers: list[str], seniors: list[str],
 ) -> list[PhaseConfig]:
@@ -762,11 +990,11 @@ def _build_review_fix_preset(
                    },
                ),
                StepConfig(
-                    name="generate",
+                    name="coding",
                    agent=fix_coder,
-                    role="generate",
-                    prompt_template="default:generate",
-                    output_key="generated_code",
+                    role="coding",
+                    prompt_template="default:coding",
+                    output_key="coding_output",
                    context_override={"feedback": "{aggregate_review}"},
                ),
                StepConfig(
@@ -784,14 +1012,44 @@ def _build_review_fix_preset(
    ]


+def _build_coding_review_fix_preset(
+    coders: list[str], reviewers: list[str], seniors: list[str],
+) -> list[PhaseConfig]:
+    """Write code once, then run the review-fix convergence loop."""
+    if not coders:
+        raise ValueError("'coding-review-fix' preset requires at least 1 coder")
+    if not reviewers:
+        raise ValueError("'coding-review-fix' preset requires at least 1 reviewer")
+
+    return [
+        PhaseConfig(
+            name="initial_coding",
+            steps=[
+                StepConfig(
+                    name="coding",
+                    agent=coders[0],
+                    role="coding",
+                    prompt_template="default:coding",
+                    output_key="coding_output",
+                ),
+            ],
+            max_iterations=1,
+            consecutive_pass=1,
+        ),
+        *_build_review_fix_preset(coders, reviewers, seniors),
+    ]
+
+
 PIPELINE_PRESETS: dict[str, Callable] = {
    "simple": _build_simple_preset,
    "cross-review": _build_cross_review_preset,
+    "plan-review": _build_plan_review_preset,
    "review-only": _build_review_only_preset,
 }

 PHASED_PRESETS: dict[str, Callable] = {
    "review-fix": _build_review_fix_preset,
+    "coding-review-fix": _build_coding_review_fix_preset,
 }

 ALL_PRESET_NAMES: list[str] = list(PIPELINE_PRESETS.keys()) + list(PHASED_PRESETS.keys())
@@ -805,7 +1063,7 @@ def resolve_template(template_ref: str, templates_dir: Optional[Path] = None) ->
    """Resolve a template reference to its content string.

    Formats:
-    - "default:generate" -> built-in GENERATE_TEMPLATE
+    - "default:coding"   -> built-in CODING_TEMPLATE
    - "default:review"   -> built-in REVIEW_TEMPLATE
    - "path/to/file.md"  -> read file contents
    """