feat: propagate execution evidence across iterations and enhance reports

- Carry execution evidence forward so reviewer/senior prompts in subsequent iterations can inspect prior transcript and command data - Add {execution_evidence} to REVIEW_ONLY templates (en/ko) - Add evidence summary table to iteration reports - Fix test_agentic to match stdin-based prompt delivery for Claude - Add expanded claim/no-change marker tests and cross-iteration evidence propagation tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 23:36:28 +09:00
parent c467222a2a
commit 87bc0ffbfb
5 changed files with 591 additions and 10 deletions
--- a/cross_eval/prompts.py
+++ b/cross_eval/prompts.py
@@ -243,9 +243,14 @@ You are tasked with reviewing existing code against a plan and checklist.
 ## Previous Review (iteration {iteration} of {max_iterations})
 {feedback}

+## Execution Evidence
+{execution_evidence}
+
 ## Review Instructions
 Explore the project directory thoroughly to understand the full codebase, \
-then evaluate the EXISTING code against ONLY the plan and checklist above.
+then evaluate the EXISTING code against ONLY the plan and checklist above. \
+Use the execution evidence above to verify agent claims against actual \
+command outputs and exit codes.

 You are NOT generating or modifying code. You are auditing what already exists.

@@ -314,9 +319,13 @@ REVIEW_ONLY_TEMPLATE_KO = """\
 ## 이전 리뷰 결과 ({max_iterations}회 중 {iteration}번째)
 {feedback}

+## 실행 증거
+{execution_evidence}
+
 ## 검토 지침
 프로젝트 디렉토리를 직접 탐색하여 전체 코드베이스를 파악한 뒤, \
-위 기획서와 체크리스트 기준으로 **기존 코드**를 평가하세요.
+위 기획서와 체크리스트 기준으로 **기존 코드**를 평가하세요. \
+위 실행 증거를 활용하여 에이전트의 주장을 실제 명령어 출력과 종료 코드로 검증하세요.

 코드를 생성하거나 수정하지 마세요. 이미 존재하는 코드를 감사하는 것이 목적입니다.