fix: capture_diff uses base commit to handle agent self-commits

Claude in agentic mode (interactive, no -p flag) commits its own changes, advancing HEAD. This made `git diff --cached HEAD` return empty, triggering false EMPTY_DIFF errors every time. Now capture_diff diffs against the base commit SHA recorded at worktree creation, so changes are captured regardless of whether the agent committed them. Also adds UX_IMPROVEMENT_PLAN.md for guided message improvements. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 23:59:53 +09:00
parent af05fc1ddb
commit 60c7b07939
6 changed files with 281 additions and 28 deletions
--- a/UX_IMPROVEMENT_PLAN.md
+++ b/UX_IMPROVEMENT_PLAN.md
@@ -0,0 +1,178 @@
 # cross-eval UX 개선 계획
 > 사용자 안내 메시지, 에러 메시지, 도움말 텍스트 전반의 품질을 높여서
 > 처음 쓰는 사람도 막히지 않고 파이프라인을 돌릴 수 있게 만든다.
 ---
 ## 1. CLI 도움말 텍스트 개선
 ### 1.1 `cross-eval` 메인 도움말
 - [ ] 메인 description에 "어떤 문제를 해결하는 도구인지" 한 줄 요약 추가
  - 현재: "AI 코딩 에이전트의 결과물을 자동으로 검증하는 CLI 도구"
  - 개선: "AI 코딩 에이전트가 기획서대로 구현했는지 자동 교차 검증. 과최적화·누락·거짓 통과를 잡아냄"
 - [ ] 서브커맨드별 한 줄 설명을 메인 help에 추가 (init/doctor/demo/run 각각)
 ### 1.2 `cross-eval run` 도움말
 - [ ] epilog의 프리셋 테이블이 너무 길음 — "빠른 선택 가이드" 3줄 추가
  - 예: "처음이면 simple, 리뷰만 하려면 review-only, 코딩+리뷰+자동수정이면 coding-review-fix"
 - [ ] `--reasoning-effort` 도움말에 별칭(extra-high, x-high 등) 명시
 - [ ] `--target` 옵션이 실제로 프롬프트에 어떤 영향을 주는지 설명 추가
 - [ ] `--agentic` 플래그 설명에 worktree 생성/정리 동작 요약 추가
 - [ ] `--min-iter` 설명에 "왜 PASS인데 반복하는지" 용도 한 줄 추가
  - 예: "결과 안정성 확인용. 한 번 PASS가 우연이 아닌지 재검증"
 - [ ] `--dry-run` 설명에 "에이전트 호출 없이 프롬프트만 미리보기" 명확히
 - [ ] 에이전트 축약 규칙(claude → claude-coder 등) 예시와 함께 더 명확하게
 ### 1.3 `cross-eval init` 도움말
 - [ ] `--guided` 옵션을 더 눈에 띄게 — "처음이면 --guided 추천" 문구
 - [ ] 생성되는 파일 설명에 "각 파일을 어떻게 쓰는지" 한 줄씩 추가
 ### 1.4 `cross-eval doctor` 도움말
 - [ ] 어떤 항목을 점검하는지 목록 미리 보여주기
 - [ ] "인증 실패 시 어떻게 해야 하는지" 구체적 명령어 포함
 ### 1.5 `cross-eval demo` 도움말
 - [ ] mock vs live 차이를 한 눈에 볼 수 있도록 비교 추가
 - [ ] `--escalate` 옵션이 mock 전용인 점 강조
 ---
 ## 2. 에러 메시지 개선
 ### 2.1 필수 입력 누락
 - [ ] `--plan` 없이 `cross-eval run` 실행 시 명확한 에러:
  - "기획서(plan)가 필요합니다. --plan plan.md 또는 .cross-eval/config.yaml의 inputs.plan에 지정하세요."
 - [ ] config.yaml 없이 실행 시 기본값 사용 중임을 알리는 INFO 메시지 추가
 ### 2.2 에이전트 실패 메시지
 - [ ] `AUTH` 실패 시 구체적 해결 명령어 제시
  - Claude: "claude login 으로 인증하세요"
  - Codex: "codex auth 로 인증하세요"
 - [ ] `USAGE_LIMIT` 시 어떤 한도인지 힌트 (토큰? 요금?)
 - [ ] `EMPTY_DIFF` 시 "에이전트가 파일을 수정하지 않았습니다" + 가능한 원인 목록
 - [ ] `WRITE_FAILURE` 시 worktree 경로와 권한 상태 출력
 - [ ] 에이전트 빈 출력(empty output) 시 "에이전트가 응답하지 않았습니다. 프롬프트가 너무 길거나 인증 만료일 수 있습니다" 등 원인 제안
 ### 2.3 설정 검증 에러
 - [ ] 중복 step name 에러에 "어떤 phase의 어떤 step이 중복인지" 구체적으로
 - [ ] 없는 에이전트 참조 시 "사용 가능한 에이전트: ..." 리스트 포함 (이미 있으나 확인)
 - [ ] YAML 파싱 에러 시 라인 번호 포함
 ### 2.4 파일/경로 에러
 - [ ] "File not found: {path}" → "파일을 찾을 수 없습니다: {path}\n  현재 디렉토리: {cwd}" 로 개선
 - [ ] docs 디렉토리 비어있을 때 → "참고 문서 폴더가 비어있습니다: {path}\n  .md, .txt 등 문서 파일을 넣어주세요"
 ---
 ## 3. 진행 상태 메시지 개선
 ### 3.1 파이프라인 실행 중
 - [ ] 실행 시작 시 요약 배너 출력:
  ```
  ━━━ cross-eval ━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    Plan:      .cross-eval/plan.md
    Preset:    simple (코딩→리뷰→반복)
    Coder:     claude-coder
    Reviewer:  claude-reviewer
    Max iter:  3
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ```
 - [ ] 각 iteration 시작 시 "무엇을 하려는 단계인지" 한 줄 설명
  - 예: "Iteration 1/3 — Coder가 기획서 기반 초기 구현 중..."
  - 예: "Iteration 2/3 — 리뷰 피드백 반영해서 수정 중..."
 - [ ] 타임아웃 시 경과 시간과 제한 시간 모두 출력
 ### 3.2 결과 요약
 - [ ] 최종 결과에 소요 시간 추가
 - [ ] FAIL 시 "마지막 리뷰에서 지적된 주요 이슈 N개" 간단 요약
 - [ ] ESCALATE 시 사람이 봐야 할 이유 1~2줄 요약
 - [ ] dry-run 종료 시 "이것은 미리보기입니다. 실제 실행하려면 --dry-run을 빼세요" 명시
 ### 3.3 Auto-escalation 안내
 - [ ] auto-escalation 발동 시 "N회 연속 FAIL → 자동 에스컬레이션" 설명
 - [ ] 어떤 조건에서 auto-escalation이 발동하는지 run 도움말에 언급
 ---
 ## 4. 첫 사용 경험(Onboarding) 개선
 ### 4.1 init 후 안내
 - [ ] plan.md 템플릿에 실제 예시 포함 (현재 최소한의 구조만 있음)
  - "## 기능 요구사항" 아래 구체적 예시 한 개
 - [ ] checklist.md 템플릿에 체크리스트 작성 가이드 + 예시 추가
 - [ ] init 완료 후 "다음 단계" 안내를 더 구체적으로:
  - 현재: "1. plan.md에 기획서 작성"
  - 개선: "1. .cross-eval/plan.md를 열어 기획서를 작성하세요 (예: 구현할 기능, API 스펙, DB 스키마 등)"
 ### 4.2 doctor 개선
 - [ ] 체크 통과 시 "준비 완료! cross-eval run --plan .cross-eval/plan.md 로 실행하세요" 안내
 - [ ] 인증 실패 시 OS별 설치/인증 가이드 URL 포함
 ### 4.3 demo 개선
 - [ ] demo 완료 후 "실제 프로젝트에서 시작하려면:" 안내 추가
 - [ ] mock demo에서 각 단계가 뭘 하는 건지 주석 스타일로 설명
 ---
 ## 5. 용어 일관성
 - [ ] "에이전트 이름" vs "에이전트 역할" 구분 통일
  - 이름: claude-coder, codex-reviewer (실제 실행 단위)
  - 역할: coder, reviewer, senior (논리적 역할)
 - [ ] Verdict 표기 통일: 항상 대문자 `PASS` / `FAIL` / `ESCALATE`
 - [ ] "프리셋" vs "파이프라인" 용어 정리
  - `--preset`은 "파이프라인 유형"으로 통일
 - [ ] 한영 혼용 줄이기 — 한국어 모드에서 불필요한 영어 최소화
  - 단, PASS/FAIL/ESCALATE 같은 verdict은 영어 유지 (가독성)
 ---
 ## 6. 출력 디렉토리 구조 안내
 - [ ] run 완료 시 출력 폴더 구조 요약 출력:
  ```
  Output: .cross-eval/output/
    ├── iter-1/          (각 반복의 에이전트 출력)
    ├── iter-2/
    └── final-report.md  (최종 리포트)
  ```
 - [ ] report.md 상단에 "이 리포트 읽는 법" 간단 안내 추가
 ---
 ## 7. config.yaml 주석 개선
 - [ ] 기본 생성되는 config.yaml에 각 섹션별 설명 주석 보강
 - [ ] 자주 쓰는 설정 변경 예시를 주석으로 포함
  - 예: "# 리뷰어를 2개로 늘리려면: reviewer: [claude, codex]"
  - 예: "# 에이전트 모드로 실제 파일 수정: agentic: true"
 - [ ] phase-based 파이프라인 커스텀 예시 주석 추가
 ---
 ## 우선순위
 | 우선순위 | 항목 | 이유 |
 |---------|------|------|
 | P0 | 2.1 필수 입력 누락 에러 | 가장 자주 부딪히는 문제 |
 | P0 | 4.1 init 후 안내 + 템플릿 | 첫 사용에서 막히면 이탈 |
 | P0 | 3.1 실행 시작 요약 배너 | 뭐가 돌아가는지 알아야 함 |
 | P1 | 2.2 에이전트 실패 메시지 | 실패 시 뭘 해야 하는지 모름 |
 | P1 | 1.2 run 도움말 정리 | 옵션이 많아서 혼란 |
 | P1 | 5. 용어 일관성 | 혼동 줄이기 |
 | P2 | 3.2~3.3 결과/진행 메시지 | 있으면 좋지만 급하진 않음 |
 | P2 | 7. config.yaml 주석 | 파워 유저 편의 |
 | P2 | 6. 출력 구조 안내 | 한 번 보면 이해됨 |
 | P3 | 1.3~1.5 나머지 도움말 | 점진적 개선 |
 ---
 ## 테스트 방법
 각 항목 수정 후:
 1. **도움말 확인**: `cross-eval --help`, `cross-eval run --help` 등
 2. **에러 경로 확인**: 일부러 잘못된 입력으로 실행 → 에러 메시지가 유용한지
 3. **첫 사용 시뮬레이션**: 빈 디렉토리에서 `init → doctor → demo → run` 풀 플로우
 4. **cross-eval 자체로 검증**: 이 문서를 plan.md로 사용해 cross-eval run 실행
--- a/cross_eval/agent.py
+++ b/cross_eval/agent.py
@@ -414,6 +414,7 @@ def invoke_agent_agentic(
    env: Optional[dict[str, str]] = None,
    timeout: int | None = None,
    quiet: bool = False,
    base_commit: str | None = None,
 ) -> AgentResult:
    """Invoke an agent in agentic mode using the worktree as the source of truth."""
    from cross_eval.worktree import capture_diff
@@ -506,8 +507,8 @@ def invoke_agent_agentic(
            suggested_action=suggested_action,
        )
-    # Capture git diff as the output (changes since last commit on the branch)
+    # Capture git diff as the output (changes since the base commit)
-    diff_output = capture_diff(worktree_path)
+    diff_output = capture_diff(worktree_path, base_commit=base_commit)
    if not diff_output:
        stdout_excerpt = (result.stdout or "").strip()
--- a/cross_eval/pipeline.py
+++ b/cross_eval/pipeline.py
@@ -84,24 +84,25 @@ def _has_agentic_steps(config: PipelineConfig, steps: list[StepConfig]) -> bool:
    )
-def _setup_worktree(cwd: Path, run_dir: Path, preset_name: str) -> tuple[Path, str]:
+def _setup_worktree(cwd: Path, run_dir: Path, preset_name: str) -> tuple[Path, str, str]:
    """Create a shared worktree for the entire pipeline run.
    1. Generate branch name (cross-eval/<preset>_<timestamp>)
    2. Create branch from HEAD
    3. Create worktree on that branch
-    Returns (worktree_path, branch_name).
+    Returns (worktree_path, branch_name, base_commit).
    """
    from cross_eval.worktree import create_worktree, make_branch_name, make_worktree_dir
    branch_name = make_branch_name(preset_name)
    worktree_dir = make_worktree_dir(cwd, branch_name)
-    worktree_path = create_worktree(
+    worktree_path, base_commit = create_worktree(
        base_cwd=cwd, work_dir=worktree_dir, branch_name=branch_name,
    )
    (run_dir / "worktree_path.txt").write_text(f"{worktree_path}\n", encoding="utf-8")
    (run_dir / "worktree_branch.txt").write_text(f"{branch_name}\n", encoding="utf-8")
-    return worktree_path, branch_name
+    (run_dir / "worktree_base.txt").write_text(f"{base_commit}\n", encoding="utf-8")
    return worktree_path, branch_name, base_commit
 def _copy_inputs_to_worktree(
@@ -321,10 +322,11 @@ def _run_simple_pipeline(
    # Setup shared worktree for agentic mode
    worktree_path: Path | None = None
    agentic_branch_name: str | None = None
    agentic_base_commit: str | None = None
    base_repo_state: dict[str, str] | None = None
    base_repo_status: str | None = None
    if not dry_run and _has_agentic_steps(config, config.pipeline):
-        worktree_path, agentic_branch_name = _setup_worktree(
+        worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree(
            cwd, run_dir, config.preset_name,
        )
        _copy_inputs_to_worktree(config, worktree_path)
@@ -360,6 +362,7 @@ def _run_simple_pipeline(
                runtime_env=runtime_env,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=agentic_base_commit,
            )
            # Intermediate commit so next iteration's diff only shows new changes
@@ -498,10 +501,11 @@ def _run_phased_pipeline(
    all_phase_steps = [s for p in config.phases for s in p.steps]
    worktree_path: Path | None = None
    agentic_branch_name: str | None = None
    agentic_base_commit: str | None = None
    base_repo_state: dict[str, str] | None = None
    base_repo_status: str | None = None
    if not dry_run and _has_agentic_steps(config, all_phase_steps):
-        worktree_path, agentic_branch_name = _setup_worktree(
+        worktree_path, agentic_branch_name, agentic_base_commit = _setup_worktree(
            cwd, run_dir, config.preset_name,
        )
        _copy_inputs_to_worktree(config, worktree_path)
@@ -558,6 +562,7 @@ def _run_phased_pipeline(
                    runtime_env=runtime_env,
                    base_repo_state=base_repo_state,
                    base_repo_status=base_repo_status,
                    base_commit=agentic_base_commit,
                )
                # Intermediate commit so next iteration's diff only shows new changes
@@ -903,6 +908,7 @@ def _run_steps(
    runtime_env: dict[str, str] | None = None,
    base_repo_state: dict[str, str] | None = None,
    base_repo_status: str | None = None,
    base_commit: str | None = None,
 ) -> tuple[dict[str, str], dict[str, AgentResult], str | None]:
    """Execute all steps in one iteration, parallelizing where possible."""
    step_outputs: dict[str, str] = {}
@@ -923,6 +929,7 @@ def _run_steps(
                runtime_env=runtime_env,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=base_commit,
            )
        else:
            _execute_parallel_batch(
@@ -934,6 +941,7 @@ def _run_steps(
                runtime_env=runtime_env,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=base_commit,
            )
    # Extract verdict from all verdict steps (ALL must PASS; ESCALATE wins over all)
@@ -961,6 +969,7 @@ def _invoke_agentic(
    env: dict[str, str] | None = None,
    timeout: int | None = None,
    quiet: bool = False,
    base_commit: str | None = None,
 ) -> AgentResult:
    """Run an agent in agentic mode using an existing worktree."""
    return invoke_agent_agentic(
@@ -968,6 +977,7 @@ def _invoke_agentic(
        worktree_path=worktree_path,
        env=env,
        timeout=timeout, quiet=quiet,
        base_commit=base_commit,
    )
@@ -992,6 +1002,7 @@ def _execute_step(
    runtime_env: dict[str, str] | None = None,
    base_repo_state: dict[str, str] | None = None,
    base_repo_status: str | None = None,
    base_commit: str | None = None,
 ) -> None:
    """Execute a single step, updating step_outputs and step_results in place."""
    if not quiet:
@@ -1035,6 +1046,7 @@ def _execute_step(
                worktree_path=worktree_path,
                env=runtime_env,
                timeout=timeout, quiet=quiet,
                base_commit=base_commit,
            )
        else:
            # When worktree exists, run non-agentic agents (reviewers) in
@@ -1125,6 +1137,7 @@ def _execute_parallel_batch(
    runtime_env: dict[str, str] | None = None,
    base_repo_state: dict[str, str] | None = None,
    base_repo_status: str | None = None,
    base_commit: str | None = None,
 ) -> None:
    """Execute multiple steps in parallel using threads."""
    agent_names = ", ".join(s.agent for s in batch)
@@ -1139,6 +1152,7 @@ def _execute_parallel_batch(
                run_dir=run_dir, output_iter=output_iter, phase_name=phase_name,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=base_commit,
            )
        return
@@ -1161,6 +1175,7 @@ def _execute_parallel_batch(
                phase_name=phase_name, worktree_path=worktree_path,
                base_repo_state=base_repo_state,
                base_repo_status=base_repo_status,
                base_commit=base_commit,
            )
        return
@@ -1204,6 +1219,7 @@ def _execute_parallel_batch(
                worktree_path=worktree_path,
                env=runtime_env,
                timeout=timeout, quiet=True,
                base_commit=base_commit,
            )
        else:
            effective_cwd = worktree_path if worktree_path else cwd
--- a/cross_eval/worktree.py
+++ b/cross_eval/worktree.py
@@ -37,18 +37,31 @@ def make_worktree_dir(base_cwd: Path, branch_name: str) -> Path:
    )
-def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> Path:
+def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> tuple[Path, str]:
    """Create a git worktree on a new branch from HEAD.
    1. Create branch from HEAD
    2. Create worktree checked out to that branch
    The branch lives in the original repo, so it survives worktree removal.
    Returns (worktree_path, base_commit_sha).
    """
    work_dir = work_dir.resolve()
    if work_dir.exists():
        shutil.rmtree(work_dir)
    # Record the base commit SHA before creating the branch.
    # This is the anchor for all diffs — even if the agent makes its own commits,
    # we always diff against this base to capture the full set of changes.
    result = subprocess.run(
        ["git", "rev-parse", "HEAD"],
        cwd=base_cwd,
        capture_output=True,
        text=True,
        check=True,
    )
    base_commit = result.stdout.strip()
    # Create the branch at HEAD
    try:
        subprocess.run(
@@ -83,15 +96,24 @@ def create_worktree(base_cwd: Path, work_dir: Path, branch_name: str) -> Path:
            f"Failed to create worktree at {work_dir}: {e.stderr.strip()}"
        ) from e
-    logger.debug("Created worktree on branch '%s': %s", branch_name, work_dir)
+    logger.debug("Created worktree on branch '%s': %s (base: %s)", branch_name, work_dir, base_commit[:8])
-    return work_dir
+    return work_dir, base_commit
-def capture_diff(worktree_path: Path) -> str:
+def capture_diff(worktree_path: Path, base_commit: str | None = None) -> str:
    """Capture all changes made in the worktree as a unified diff.
-    Includes both tracked modifications and new untracked files.
+    Includes both tracked modifications, new untracked files, and changes
    that the agent may have committed on its own.
    Args:
        base_commit: The commit SHA from when the worktree was created.
                     If provided, diffs against this fixed base instead of HEAD.
                     This is critical because agents (e.g. Claude in interactive
                     mode) may create their own commits, advancing HEAD and
                     making ``git diff --cached HEAD`` return empty.
    """
    # Stage any uncommitted changes so they're included in the diff
    subprocess.run(
        ["git", "add", "-A"],
        cwd=worktree_path,
@@ -99,6 +121,30 @@ def capture_diff(worktree_path: Path) -> str:
        check=True,
    )
    if base_commit:
        # Diff everything (committed + staged) against the original base.
        # This captures changes regardless of whether the agent committed them.
        result = subprocess.run(
            ["git", "diff", base_commit, "--cached"],
            cwd=worktree_path,
            capture_output=True,
            text=True,
        )
        diff = result.stdout.strip()
        if diff:
            return diff
        # Also check committed changes (agent may have committed and left
        # nothing staged)
        result = subprocess.run(
            ["git", "diff", base_commit, "HEAD"],
            cwd=worktree_path,
            capture_output=True,
            text=True,
        )
        return result.stdout.strip()
    # Fallback: no base_commit, use original behavior
    result = subprocess.run(
        ["git", "diff", "--cached", "HEAD"],
        cwd=worktree_path,
--- a/tests/test_agentic.py
+++ b/tests/test_agentic.py
@@ -76,10 +76,12 @@ class TestCreateWorktree(unittest.TestCase):
            wt_dir = Path(td) / "wt"
            branch = "cross-eval/test_branch"
-            result_path = create_worktree(base, wt_dir, branch)
+            result_path, base_commit = create_worktree(base, wt_dir, branch)
            # Worktree directory exists
            self.assertTrue(result_path.exists())
            # Base commit SHA was captured
            self.assertEqual(len(base_commit), 40)
            # Branch was created in the original repo
            branches = subprocess.run(
                ["git", "branch", "--list", branch],
@@ -102,7 +104,7 @@ class TestCaptureDiff(unittest.TestCase):
            wt_dir = Path(td) / "wt"
            branch = "cross-eval/diff_test"
-            create_worktree(base, wt_dir, branch)
+            create_worktree(base, wt_dir, branch)  # ignore return tuple
            # Make changes in the worktree
            (wt_dir / "new_file.txt").write_text("hello\n")
@@ -553,7 +555,7 @@ class TestSetupWorktreeCalledForAgentic(unittest.TestCase):
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
@@ -582,7 +584,7 @@ class TestSetupWorktreeLocation(unittest.TestCase):
            run_dir.mkdir(parents=True)
            _init_git_repo(base)
-            worktree_path, branch_name = _setup_worktree(base, run_dir, "review-fix")
+            worktree_path, branch_name, _base_commit = _setup_worktree(base, run_dir, "review-fix")
            try:
                self.assertTrue(worktree_path.exists())
                self.assertNotIn(str(base.resolve()), str(worktree_path.resolve()))
@@ -620,7 +622,7 @@ class TestReviewerRunsInWorktreeCwd(unittest.TestCase):
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
@@ -662,7 +664,7 @@ class TestCommitIterationCalled(unittest.TestCase):
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
@@ -704,7 +706,7 @@ class TestFinalizeWorktreeCalled(unittest.TestCase):
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            mock_invoke_agentic.return_value = AgentResult(
                output="diff output", exit_code=0,
@@ -822,7 +824,7 @@ class TestParallelAgenticFallsBackToSequential(unittest.TestCase):
            wt_path = run_dir / "work"
            wt_path.mkdir()
-            mock_setup.return_value = (wt_path, "cross-eval/test")
+            mock_setup.return_value = (wt_path, "cross-eval/test", "a" * 40)
            call_order: list[str] = []
--- a/tests/test_runtime_misc.py
+++ b/tests/test_runtime_misc.py
@@ -775,11 +775,18 @@ class TestRuntimeEnvironmentHelpers(unittest.TestCase):
 class TestWorktreeFailures(unittest.TestCase):
    @patch("cross_eval.worktree.subprocess.run")
    def test_create_worktree_raises_when_branch_creation_fails(self, mock_run: MagicMock) -> None:
-        mock_run.side_effect = subprocess.CalledProcessError(
+        # First call: git rev-parse HEAD (succeeds)
-            1,
+        # Second call: git branch (fails)
-            ["git", "branch"],
+        rev_parse_result = MagicMock(returncode=0)
-            stderr="branch failed",
+        rev_parse_result.stdout = "a" * 40
-        )
+        mock_run.side_effect = [
            rev_parse_result,
            subprocess.CalledProcessError(
                1,
                ["git", "branch"],
                stderr="branch failed",
            ),
        ]
        with tempfile.TemporaryDirectory() as tmpdir:
            base = Path(tmpdir)
@@ -791,14 +798,17 @@ class TestWorktreeFailures(unittest.TestCase):
    @patch("cross_eval.worktree.subprocess.run")
    def test_create_worktree_cleans_branch_on_worktree_failure(self, mock_run: MagicMock) -> None:
        rev_parse_result = MagicMock(returncode=0)
        rev_parse_result.stdout = "a" * 40
        mock_run.side_effect = [
-            MagicMock(returncode=0),
+            rev_parse_result,           # git rev-parse HEAD
            MagicMock(returncode=0),    # git branch
            subprocess.CalledProcessError(
                1,
                ["git", "worktree", "add"],
                stderr="worktree failed",
            ),
-            MagicMock(returncode=0),
+            MagicMock(returncode=0),    # git branch -D (cleanup)
        ]
        with tempfile.TemporaryDirectory() as tmpdir: