cross-eval

Author	SHA1	Message	Date
chungyeong	28efd5bb8f	fix: use incremental diff per iteration instead of cumulative base diff After each iteration's _commit_iteration, record the new HEAD SHA and use it as the diff anchor for the next iteration. Previously capture_diff always diffed against the initial base commit, causing every iteration to return the same full cumulative diff — reviewers couldn't see what changed between iterations, leading to repeated feedback and stuck FAIL loops. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 10:07:11 +09:00
chungyeong	bf64d19123	Fix plan-review worktree document tracking	2026-03-15 00:35:42 +09:00
chungyeong	a85a490a9b	Make plan-review a review-fix-verify loop	2026-03-15 00:01:26 +09:00
chungyeong	60c7b07939	fix: capture_diff uses base commit to handle agent self-commits Claude in agentic mode (interactive, no -p flag) commits its own changes, advancing HEAD. This made `git diff --cached HEAD` return empty, triggering false EMPTY_DIFF errors every time. Now capture_diff diffs against the base commit SHA recorded at worktree creation, so changes are captured regardless of whether the agent committed them. Also adds UX_IMPROVEMENT_PLAN.md for guided message improvements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 23:59:53 +09:00
이충영 에이닷서비스개발	af05fc1ddb	fix: preserve agentic branch when intermediate commits exist _finalize_worktree was returning None and deleting the branch when the final commit was empty, even though _commit_iteration had already committed changes during the pipeline. Now checks git log for any commits on the branch before deciding to clean up. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 20:48:25 +09:00
이충영 에이닷서비스개발	0858675076	fix: remove --permission-mode plan from reviewer args Plan mode causes Claude to spend all time on tool calls (Read/Grep) in -p mode, producing empty stdout. Reviewers receive full context (diff, plan, checklist) via the prompt, so file access is not required. Without --permission-mode, -p mode defaults to read-allowed, write-denied. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 18:29:53 +09:00
이충영 에이닷서비스개발	cc8d583914	fix: Claude reviewer empty output, worktree isolation false positives, and input file access - Add -p flag to _CLAUDE_REVIEW_ARGS so reviewer uses print mode (stdin→stdout) instead of interactive mode which conflicts with plan permission mode - Copy input files (plan, checklist) into worktree .cross-eval-inputs/ so agents in plan mode can access them without escaping the sandbox - Simplify _snapshot_repo_state to use only git diff HEAD + untracked hashes, eliminating false positives from staging state changes (git diff --cached) and git status index drift during long-running pipelines Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 16:19:57 +09:00
chungyeong	7b95233edf	feat: tighten agentic runtime handoffs and quality gates	2026-03-14 10:05:25 +09:00
chungyeong	87bc0ffbfb	feat: propagate execution evidence across iterations and enhance reports - Carry execution evidence forward so reviewer/senior prompts in subsequent iterations can inspect prior transcript and command data - Add {execution_evidence} to REVIEW_ONLY templates (en/ko) - Add evidence summary table to iteration reports - Fix test_agentic to match stdin-based prompt delivery for Claude - Add expanded claim/no-change marker tests and cross-iteration evidence propagation tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 23:36:28 +09:00
chungyeong	c467222a2a	fix: instruct coder to use Edit/Write tools instead of describing changes Claude -p mode tends to describe changes in text rather than actually applying them via tools. Added explicit rule requiring tool-based edits so that file modifications produce real git diffs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 23:19:22 +09:00
chungyeong	99cbf171aa	fix: revert -p removal — Claude -p mode has full tool access Claude -p (print mode) is non-interactive but retains full tool access (Edit, Write, Bash, etc.) with --dangerously-skip-permissions. Removing -p caused Claude to enter interactive mode which requires a TTY and produces zero output when run as a subprocess with piped I/O. Now delivers prompt via stdin for both Claude and Codex in agentic mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 23:13:12 +09:00
chungyeong	d5fcc258b7	fix: unset CLAUDECODE env var to allow nested Claude subprocess calls Claude Code refuses to launch inside another Claude Code session. Strip the CLAUDECODE marker from the inherited environment so that cross-eval can spawn Claude as a subprocess from within Claude Code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 23:05:16 +09:00
chungyeong	290eace01b	fix: send EOF via empty stdin so Claude exits after agentic prompt Without -p, Claude enters interactive mode and waits for more input indefinitely. Setting input="" closes the stdin pipe immediately, causing Claude to process the positional prompt and then exit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 23:04:13 +09:00
chungyeong	ecf44b4c07	fix: strip -p/--print flags in agentic mode so Claude can actually modify files The agentic invocation path inherited -p (print mode) from _CLAUDE_BASE_ARGS but only stripped the stdin sentinel "-". Print mode makes Claude a one-shot text completer that cannot use tools or write files, resulting in zero diffs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 23:00:40 +09:00
chungyeong	b19d174c98	feat: isolate agentic worktrees and surface execution evidence	2026-03-13 22:50:46 +09:00
chungyeong	3fb19e90c0	feat: harden runtime evidence and claude agentic validation	2026-03-13 22:29:22 +09:00
chungyeong	28dd794f54	feat: add runtime discovery and execution traces	2026-03-13 21:52:13 +09:00
chungyeong	941304398d	release: cut 0.2.0 baseline	2026-03-13 21:47:54 +09:00
chungyeong	204e071b74	feat: ESCALATE verdict, issue tracker, onboarding commands Add 3-verdict system (PASS/FAIL/ESCALATE) with priority handling across simple and phased pipelines. Senior reviewers can now escalate issues requiring human intervention, immediately breaking the review loop. - ESCALATE verdict extraction with highest priority over PASS/FAIL - Issue Tracker tables (ISS-NNN) carried across iterations - Auto-escalate heuristic using (file, keyword) composite fingerprints - Report restructuring: executive view first (verdict → tracker → metrics) - Onboarding: `doctor`, `demo`, `init --guided` commands - Exit codes: PASS=0, FAIL=1, ESCALATE=2 - 87 tests passing (54 config + 25 onboarding + 8 integration) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-13 18:19:05 +09:00
이충영 에이닷서비스개발	ee4f1a07ef	initial commit	2026-03-11 21:53:14 +09:00

20 Commits