feat(my-deepagent): v0.3 PR #2 — context compaction (auto + manual /compact)

Claude Code의 auto-compact + `/compact` 슬래시 등가.

핵심 동작:
- 세션 누적 토큰 (`total_input_tokens + total_output_tokens`)이 활성 모델
  컨텍스트 윈도우의 70%를 넘으면 자동으로 가장 오래된 비-system / 비-archived
  메시지를 cheap 모델 (`openrouter:deepseek/deepseek-chat` 기본)로 1회 요약 →
  `MessageRow(is_summary=True, role=system)` 1줄 삽입 + 원본은 `archived=True`
  + negative seq band (-(original.seq + 1))으로 옮김.
- LangGraph thread는 `thread_suffix` bump로 새 컨텍스트 시작 (재인입 비용 회피).
  세션 자체는 살아있음 — `sessions show <id> --all`로 archived 메시지 조회 가능.
- 수동 `/compact` 슬래시도 동일 함수 호출.  메시지가 부족하면 (`< MIN_COMPACTABLE`)
  사유 출력하고 no-op.

데이터·라이브러리:
- `monitoring/token_budget.py` (신규): `tiktoken cl100k_base`로 추정 (DeepSeek/
  Anthropic 모델 정확한 토크나이저가 없으므로 보수적 over-count).
  `MODEL_CONTEXT_LIMITS` (DeepSeek 64k, Claude Sonnet/Haiku/Opus 200k, GPT-4o
  128k), 미등록 모델은 32k 기본값.  `COMPACTION_THRESHOLD = 0.7`.
- `compaction.py` (신규): `should_compact()` / `compact_session()` /
  `CompactionResult`.  `_SESSION_LOCKS: dict[str, asyncio.Lock]` 세션별 직렬화 —
  동시 compaction은 두 번째가 첫 번째를 기다림.  `KEEP_RECENT_K = 10`,
  `MIN_COMPACTABLE = 4`.  LLM 호출은 DB session 바깥 (asyncpg connection
  점유 회피).
- `pyproject.toml`: `tiktoken>=0.7` 명시 (이전엔 langchain-openai 경유 transitive).

REPL 통합 (`cli/interactive.py`):
- `_approx_token_count`를 tiktoken-based로 교체.
- 매 ainvoke 후 `should_compact(session_row)` → 임계 초과 시 자동
  `compact_session()` → 성공 시 `clear_agent_cache()`로 thread bump + 한 줄 알림.
- `/compact` 슬래시 등록 (`_register_compaction_slash`).

테스트 (`tests/integration/test_compaction.py`, 7 케이스):
1. `should_compact` 70% 임계 아래/위/미등록 모델 (3개)
2. `MIN_COMPACTABLE` 미만 → LLM 호출 없이 거부
3. Happy path: 14 메시지 → 4 archive(negative seq) + summary at seq=1 + 10 live
   유지 + 토큰 카운터 산술 검증
4. 동일 session_id 동시 호출 2개 → Lock 직렬화 검증
5. 없는 session_id → `session_not_found`

게이트:
- ruff check / format --check / mypy: PASS
- pytest -q --ignore=tests/integration/test_e2e_workflow.py
  --ignore=tests/integration/test_openrouter_smoke.py: 611 passed (7 신규 포함)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
chungyeong
2026-05-17 20:28:11 +09:00
parent f8335e4515
commit f78b26dc69
7 changed files with 812 additions and 8 deletions

View File

@@ -33,11 +33,13 @@ from sqlalchemy import desc, select
from ..audit import make_audit_recorder
from ..budget import make_budget_tracker_from_config
from ..compaction import compact_session, should_compact
from ..config import Config, load_config
from ..governance import require_consent
from ..middleware.audit import AuditToolMiddleware
from ..middleware.cost import CostMiddleware
from ..monitoring.pricing import ModelPrice, PricingCache
from ..monitoring.token_budget import count_tokens
from ..persistence.checkpointer import get_checkpointer_ctx
from ..persistence.db import Database
from ..persistence.models import InteractiveSessionRow, MessageRow
@@ -462,10 +464,29 @@ def _register_telemetry_slash(reg: SlashRegistry) -> None:
reg.register("sessions", _sessions, help="list recent interactive sessions")
def _register_compaction_slash(reg: SlashRegistry, sess: InteractiveSession) -> None:
"""Register /compact slash handler (v0.3 PR #2)."""
async def _compact(_: SlashParsed) -> bool:
result = await compact_session(sess.db, sess.config, str(sess.session_id))
if result.compacted:
sess.clear_agent_cache()
_CONSOLE.print(
f"[green]compacted[/] — {result.archived} messages archived, "
f"summary {result.summary_tokens} tokens (new thread started)"
)
else:
_CONSOLE.print(f"[yellow]compaction skipped:[/] {result.reason}")
return False
reg.register("compact", _compact, help="manually compact the conversation history")
def _register_slash(reg: SlashRegistry, sess: InteractiveSession) -> None:
_register_navigation_slash(reg, sess)
_register_persona_slash(reg, sess)
_register_telemetry_slash(reg)
_register_compaction_slash(reg, sess)
def _completer(personas: list[Persona], slash_names: list[str]) -> WordCompleter:
@@ -474,14 +495,14 @@ def _completer(personas: list[Persona], slash_names: list[str]) -> WordCompleter
return WordCompleter(words, ignore_case=True, sentence=True)
def _approx_token_count(text: str) -> int:
"""Conservative char-based token estimate (PR #1 placeholder).
def _approx_token_count(text: str, model: str = "") -> int:
"""Token count via tiktoken (PR #2).
PR #2 swaps this for tiktoken with model-aware tokenizer selection.
1 token ≈ 4 chars is the cl100k_base rule of thumb for English; mixed
Korean text trends higher tokens/char, so we round up.
Falls back to a char-based heuristic inside `count_tokens` on tiktoken
failure. Caller passes the active model so future model-specific
tokenizers slot in without changing the call site.
"""
return max(0, (len(text) + 3) // 4)
return count_tokens(text, model)
async def _invoke_and_stream(
@@ -496,7 +517,7 @@ async def _invoke_and_stream(
sess.session_id,
"user",
user_text,
token_count=_approx_token_count(user_text),
token_count=_approx_token_count(user_text, sess.active_model),
)
# 2. Invoke the agent. LangGraph thread_id includes the suffix so /model
@@ -528,9 +549,23 @@ async def _invoke_and_stream(
sess.session_id,
"assistant",
content_str,
token_count=_approx_token_count(content_str),
token_count=_approx_token_count(content_str, sess.active_model),
)
# 4. Auto-compaction check. Triggered when total used tokens cross 70%
# of the active model's context window. Holds a per-session lock so
# concurrent turns serialise; failure is non-fatal (next turn retries).
async with sess.db.session() as s:
session_row = await s.get(InteractiveSessionRow, str(sess.session_id))
if session_row is not None and should_compact(session_row):
result = await compact_session(sess.db, sess.config, str(sess.session_id))
if result.compacted:
sess.clear_agent_cache() # bumps thread_suffix → fresh deepagents thread
_CONSOLE.print(
f"[dim]context compacted — {result.archived} messages archived, "
f"summary {result.summary_tokens} tokens, new thread[/]"
)
async def _repl_loop(
sess: InteractiveSession,

View File

@@ -0,0 +1,293 @@
"""Context compaction (v0.3 PR #2).
When `total_input_tokens + total_output_tokens` reaches ~70% of the active
model's context window, we summarise the oldest non-system, non-archived
messages with a cheap model and insert a single `MessageRow(is_summary=True)`
in their place. Original messages are marked `archived=True` (they stay in
the DB and can be inspected via `sessions show <id> --all`).
The session's LangGraph thread is also rolled forward (caller bumps
`InteractiveSession._thread_suffix`) so the next `ainvoke` starts a clean
state with only the summary + recent K messages.
"""
from __future__ import annotations
import asyncio
import logging
from datetime import UTC, datetime
from langchain_openai import ChatOpenAI
from sqlalchemy import select, update
from sqlalchemy.ext.asyncio import AsyncSession
from .config import Config
from .monitoring.token_budget import (
count_tokens,
is_over_threshold,
model_context_limit,
)
from .persistence.db import Database
from .persistence.models import InteractiveSessionRow, MessageRow
from .secrets import resolve_openrouter_api_key
_LOG = logging.getLogger(__name__)
#: Number of recent non-archived messages kept verbatim during compaction.
KEEP_RECENT_K = 10
#: Minimum number of compactable messages required for `/compact`.
MIN_COMPACTABLE = 4
#: One concurrent compaction per session_id.
_SESSION_LOCKS: dict[str, asyncio.Lock] = {}
def _now_iso() -> str:
return datetime.now(UTC).isoformat(timespec="seconds")
def _session_lock(session_id: str) -> asyncio.Lock:
lock = _SESSION_LOCKS.get(session_id)
if lock is None:
lock = asyncio.Lock()
_SESSION_LOCKS[session_id] = lock
return lock
class CompactionResult:
"""Outcome of a compaction call. Read-only by convention."""
def __init__(
self,
*,
compacted: bool,
archived: int = 0,
summary_tokens: int = 0,
reason: str = "",
) -> None:
self.compacted = compacted
self.archived = archived
self.summary_tokens = summary_tokens
self.reason = reason
def __repr__(self) -> str:
return (
f"<CompactionResult compacted={self.compacted} archived={self.archived} "
f"summary_tokens={self.summary_tokens} reason={self.reason!r}>"
)
def should_compact(session_row: InteractiveSessionRow) -> bool:
"""True when total used tokens >= 70% of the active model's window."""
used = (session_row.total_input_tokens or 0) + (session_row.total_output_tokens or 0)
model = session_row.model or ""
return is_over_threshold(used, model)
async def _collect_messages_for_compaction(
s: AsyncSession, session_id: str
) -> tuple[list[MessageRow], list[MessageRow]]:
"""Return (to_compact, to_keep) for the session.
Strategy:
- Skip all `is_summary` and `archived` rows.
- Skip `role=system` rows (they are MYDEEPAGENT.md / memory / skills
injections — owned by the next ainvoke, not by compaction).
- Keep the last KEEP_RECENT_K non-system, non-archived, non-summary
messages verbatim.
- Everything older than that is to_compact.
"""
rows = (
(
await s.execute(
select(MessageRow)
.where(MessageRow.session_id == session_id)
.where(MessageRow.archived.is_(False))
.where(MessageRow.is_summary.is_(False))
.where(MessageRow.role != "system")
.order_by(MessageRow.seq)
)
)
.scalars()
.all()
)
if len(rows) <= KEEP_RECENT_K:
return ([], list(rows))
cutoff = len(rows) - KEEP_RECENT_K
return (list(rows[:cutoff]), list(rows[cutoff:]))
def _format_for_summary(messages: list[MessageRow]) -> str:
"""Render messages as a compact transcript the summariser can consume."""
lines: list[str] = []
for m in messages:
role = m.role.upper()
content = (m.content or "").strip()
lines.append(f"[{m.seq}] {role}: {content}")
return "\n\n".join(lines)
async def _run_summary_llm(config: Config, model: str, transcript: str) -> str:
"""Single LLM call producing a compact paragraph-level summary."""
api_key = resolve_openrouter_api_key(config)
llm = ChatOpenAI(
model=model.removeprefix("openrouter:"),
base_url=config.openrouter_base_url,
api_key=api_key,
timeout=60.0,
max_completion_tokens=600,
)
system_prompt = (
"You are compressing an interactive conversation history for a developer "
"agent. Produce a single concise summary (Korean if the conversation is "
"Korean, otherwise English) that captures: (1) the user's intent and "
"any key decisions, (2) artefacts mentioned by name/path, (3) any "
"open questions or pending follow-ups. Aim for ≤ 300 tokens. Do NOT "
"invent details that are not in the transcript."
)
response = await llm.ainvoke(
[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Transcript:\n\n{transcript}"},
]
)
content = response.content
if isinstance(content, list):
content = "\n".join(
(c.get("text", str(c)) if isinstance(c, dict) else str(c)) for c in content
)
return str(content).strip()
async def compact_session(
db: Database,
config: Config,
session_id: str,
*,
summary_model: str | None = None,
) -> CompactionResult:
"""Compact the oldest portion of a session's history. Idempotent under lock.
The caller is responsible for bumping the LangGraph thread_id (e.g.
`InteractiveSession.clear_agent_cache()` style) AFTER this returns so the
next ainvoke starts fresh.
"""
lock = _session_lock(session_id)
async with lock:
async with db.session() as s:
row = await s.get(InteractiveSessionRow, session_id)
if row is None:
return CompactionResult(compacted=False, reason="session_not_found")
to_compact, _to_keep = await _collect_messages_for_compaction(s, session_id)
if len(to_compact) < MIN_COMPACTABLE:
return CompactionResult(
compacted=False,
reason=f"insufficient_messages ({len(to_compact)} < {MIN_COMPACTABLE})",
)
transcript = _format_for_summary(to_compact)
active_model = row.model or ""
model_for_summary = summary_model or "openrouter:deepseek/deepseek-chat"
# The LLM call lives OUTSIDE the DB session — it can take seconds and
# we don't want to hold an asyncpg connection during the wait.
try:
summary_text = await _run_summary_llm(config, model_for_summary, transcript)
except Exception as e:
_LOG.exception("compaction summariser failed for session %s", session_id)
return CompactionResult(compacted=False, reason=f"summariser_failed:{e}")
if not summary_text:
return CompactionResult(compacted=False, reason="empty_summary")
summary_tokens = count_tokens(summary_text, active_model)
async with db.session() as s:
# Re-fetch under fresh session in case state changed during LLM call.
row = await s.get(InteractiveSessionRow, session_id)
if row is None:
return CompactionResult(compacted=False, reason="session_disappeared")
# Insert the summary as a new MessageRow before the kept tail.
# - seq: 1 + max(seq of to_compact) → places summary just before
# the first kept message in the ordering, which is what readers
# want for "render history in seq order".
# - role: "system" + is_summary=True so the next ainvoke includes
# it and the GUI can render it distinctly.
#
# Then archive the originals in one UPDATE.
archive_ids = [m.id for m in to_compact]
archived_token_total = sum(int(m.token_count or 0) for m in to_compact)
now = _now_iso()
# Summary seq = smallest archived seq. We can't pick a fractional
# position (INTEGER seq) and shifting every kept row by +1 is too
# expensive. Instead we archive originals into the negative-seq
# band (see below) which frees their positive seqs; then the
# summary at to_compact[0].seq slots naturally in front of the
# kept tail when readers `ORDER BY seq ASC WHERE archived=False`.
summary_seq = to_compact[0].seq
s.add(
MessageRow(
session_id=session_id,
seq=summary_seq,
role="system",
content=summary_text,
tool_calls=None,
token_count=summary_tokens,
is_summary=True,
archived=False,
ts=now,
)
)
# Mark originals archived — bump their seq into a high band so the
# UNIQUE (session_id, seq) constraint doesn't trip with the new
# summary at summary_seq. We shift archived rows to negative seq
# space (id-based) to keep them addressable but out of the way.
#
# Postgres lets us do this in one UPDATE with FROM clause; SQLite
# doesn't, so do it per-row for portability.
for original in to_compact:
# Use a negative offset based on original.seq so they remain
# ordered relative to each other.
new_seq = -(original.seq + 1)
await s.execute(
update(MessageRow)
.where(MessageRow.id == original.id)
.values(archived=True, seq=new_seq)
)
# Update aggregate token counters on the session row. Subtract
# archived contribution and add the summary tokens (assigned to
# system → input bucket).
row.total_input_tokens = max(
0,
int(row.total_input_tokens or 0) - archived_token_total + summary_tokens,
)
row.last_message_at = now
await s.commit()
_LOG.info(
"compacted session=%s archived=%d summary_tokens=%d",
session_id,
len(archive_ids),
summary_tokens,
)
return CompactionResult(
compacted=True,
archived=len(archive_ids),
summary_tokens=summary_tokens,
reason="ok",
)
def context_usage_fraction(session_row: InteractiveSessionRow) -> float:
"""Return current used_tokens / context_limit (0.0 if no model set)."""
used = (session_row.total_input_tokens or 0) + (session_row.total_output_tokens or 0)
limit = model_context_limit(session_row.model or "")
if limit <= 0:
return 0.0
return used / limit

View File

@@ -0,0 +1,86 @@
"""Token counting + context-window lookup (v0.3 PR #2).
We can't rely on `LlmCallRow.input_tokens` / `output_tokens` because OpenRouter
sometimes forwards responses with empty `usage_metadata` (v0.1 known limit).
Instead we use `tiktoken` with `cl100k_base` as a conservative estimator —
it overcounts slightly for Korean text (which is what we want for a
compaction threshold check).
"""
from __future__ import annotations
import functools
from typing import Final
import tiktoken
# ---------------------------------------------------------------------------
# Model context windows. Keys match the form persisted in
# `InteractiveSessionRow.model` (e.g. "openrouter:deepseek/deepseek-chat").
# When a model is not in the table, we fall back to `_DEFAULT_LIMIT` and log
# nothing — compaction triggers conservatively at 70%.
# ---------------------------------------------------------------------------
_DEFAULT_LIMIT: Final[int] = 32_000
MODEL_CONTEXT_LIMITS: Final[dict[str, int]] = {
# OpenRouter — DeepSeek
"openrouter:deepseek/deepseek-chat": 64_000,
"deepseek/deepseek-chat": 64_000,
# OpenRouter — Anthropic
"openrouter:anthropic/claude-sonnet-4-6": 200_000,
"openrouter:anthropic/claude-haiku-4-5": 200_000,
"openrouter:anthropic/claude-opus-4-1": 200_000,
"anthropic/claude-sonnet-4-6": 200_000,
"anthropic/claude-haiku-4-5": 200_000,
"anthropic/claude-opus-4-1": 200_000,
# OpenRouter — OpenAI (for parity testing)
"openrouter:openai/gpt-4o": 128_000,
"openrouter:openai/gpt-4o-mini": 128_000,
}
# Compaction triggers when used tokens cross this fraction of the window.
COMPACTION_THRESHOLD: Final[float] = 0.7
def model_context_limit(model: str) -> int:
"""Return the context window size (tokens) for the given model id.
Unknown models return `_DEFAULT_LIMIT` (32 000) so compaction still kicks
in eventually — better than letting context grow unbounded.
"""
return MODEL_CONTEXT_LIMITS.get(model, _DEFAULT_LIMIT)
@functools.lru_cache(maxsize=8)
def _encoding_for_model(model: str) -> tiktoken.Encoding:
"""Best-effort tokenizer per model id.
tiktoken doesn't ship encodings for DeepSeek or recent Anthropic models.
`cl100k_base` (GPT-4 / GPT-3.5 turbo) is a reasonable upper-bound
estimator across modern models: it overcounts Korean by ~1.2x but that
only makes the 70% threshold trigger slightly earlier, which is the
conservative direction we want.
"""
return tiktoken.get_encoding("cl100k_base")
def count_tokens(text: str, model: str = "") -> int:
"""Return tiktoken-estimated token count for `text` under `model`.
Always returns >= 0. Empty string → 0. An exception inside tiktoken
falls back to a char-based heuristic (1 token ≈ 4 chars) so callers
don't have to guard.
"""
if not text:
return 0
try:
enc = _encoding_for_model(model)
return len(enc.encode(text, disallowed_special=()))
except Exception:
return max(1, (len(text) + 3) // 4)
def is_over_threshold(used_tokens: int, model: str) -> bool:
"""True if `used_tokens` >= context_limit(model) * COMPACTION_THRESHOLD."""
return used_tokens >= int(model_context_limit(model) * COMPACTION_THRESHOLD)