feat(my-deepagent): v0.3 PR #2 — context compaction (auto + manual /compact)
Claude Code의 auto-compact + `/compact` 슬래시 등가. 핵심 동작: - 세션 누적 토큰 (`total_input_tokens + total_output_tokens`)이 활성 모델 컨텍스트 윈도우의 70%를 넘으면 자동으로 가장 오래된 비-system / 비-archived 메시지를 cheap 모델 (`openrouter:deepseek/deepseek-chat` 기본)로 1회 요약 → `MessageRow(is_summary=True, role=system)` 1줄 삽입 + 원본은 `archived=True` + negative seq band (-(original.seq + 1))으로 옮김. - LangGraph thread는 `thread_suffix` bump로 새 컨텍스트 시작 (재인입 비용 회피). 세션 자체는 살아있음 — `sessions show <id> --all`로 archived 메시지 조회 가능. - 수동 `/compact` 슬래시도 동일 함수 호출. 메시지가 부족하면 (`< MIN_COMPACTABLE`) 사유 출력하고 no-op. 데이터·라이브러리: - `monitoring/token_budget.py` (신규): `tiktoken cl100k_base`로 추정 (DeepSeek/ Anthropic 모델 정확한 토크나이저가 없으므로 보수적 over-count). `MODEL_CONTEXT_LIMITS` (DeepSeek 64k, Claude Sonnet/Haiku/Opus 200k, GPT-4o 128k), 미등록 모델은 32k 기본값. `COMPACTION_THRESHOLD = 0.7`. - `compaction.py` (신규): `should_compact()` / `compact_session()` / `CompactionResult`. `_SESSION_LOCKS: dict[str, asyncio.Lock]` 세션별 직렬화 — 동시 compaction은 두 번째가 첫 번째를 기다림. `KEEP_RECENT_K = 10`, `MIN_COMPACTABLE = 4`. LLM 호출은 DB session 바깥 (asyncpg connection 점유 회피). - `pyproject.toml`: `tiktoken>=0.7` 명시 (이전엔 langchain-openai 경유 transitive). REPL 통합 (`cli/interactive.py`): - `_approx_token_count`를 tiktoken-based로 교체. - 매 ainvoke 후 `should_compact(session_row)` → 임계 초과 시 자동 `compact_session()` → 성공 시 `clear_agent_cache()`로 thread bump + 한 줄 알림. - `/compact` 슬래시 등록 (`_register_compaction_slash`). 테스트 (`tests/integration/test_compaction.py`, 7 케이스): 1. `should_compact` 70% 임계 아래/위/미등록 모델 (3개) 2. `MIN_COMPACTABLE` 미만 → LLM 호출 없이 거부 3. Happy path: 14 메시지 → 4 archive(negative seq) + summary at seq=1 + 10 live 유지 + 토큰 카운터 산술 검증 4. 동일 session_id 동시 호출 2개 → Lock 직렬화 검증 5. 없는 session_id → `session_not_found` 게이트: - ruff check / format --check / mypy: PASS - pytest -q --ignore=tests/integration/test_e2e_workflow.py --ignore=tests/integration/test_openrouter_smoke.py: 611 passed (7 신규 포함) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
293
my-deepagent/src/my_deepagent/compaction.py
Normal file
293
my-deepagent/src/my_deepagent/compaction.py
Normal file
@@ -0,0 +1,293 @@
|
||||
"""Context compaction (v0.3 PR #2).
|
||||
|
||||
When `total_input_tokens + total_output_tokens` reaches ~70% of the active
|
||||
model's context window, we summarise the oldest non-system, non-archived
|
||||
messages with a cheap model and insert a single `MessageRow(is_summary=True)`
|
||||
in their place. Original messages are marked `archived=True` (they stay in
|
||||
the DB and can be inspected via `sessions show <id> --all`).
|
||||
|
||||
The session's LangGraph thread is also rolled forward (caller bumps
|
||||
`InteractiveSession._thread_suffix`) so the next `ainvoke` starts a clean
|
||||
state with only the summary + recent K messages.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
from datetime import UTC, datetime
|
||||
|
||||
from langchain_openai import ChatOpenAI
|
||||
from sqlalchemy import select, update
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
|
||||
from .config import Config
|
||||
from .monitoring.token_budget import (
|
||||
count_tokens,
|
||||
is_over_threshold,
|
||||
model_context_limit,
|
||||
)
|
||||
from .persistence.db import Database
|
||||
from .persistence.models import InteractiveSessionRow, MessageRow
|
||||
from .secrets import resolve_openrouter_api_key
|
||||
|
||||
_LOG = logging.getLogger(__name__)
|
||||
|
||||
#: Number of recent non-archived messages kept verbatim during compaction.
|
||||
KEEP_RECENT_K = 10
|
||||
|
||||
#: Minimum number of compactable messages required for `/compact`.
|
||||
MIN_COMPACTABLE = 4
|
||||
|
||||
#: One concurrent compaction per session_id.
|
||||
_SESSION_LOCKS: dict[str, asyncio.Lock] = {}
|
||||
|
||||
|
||||
def _now_iso() -> str:
|
||||
return datetime.now(UTC).isoformat(timespec="seconds")
|
||||
|
||||
|
||||
def _session_lock(session_id: str) -> asyncio.Lock:
|
||||
lock = _SESSION_LOCKS.get(session_id)
|
||||
if lock is None:
|
||||
lock = asyncio.Lock()
|
||||
_SESSION_LOCKS[session_id] = lock
|
||||
return lock
|
||||
|
||||
|
||||
class CompactionResult:
|
||||
"""Outcome of a compaction call. Read-only by convention."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
compacted: bool,
|
||||
archived: int = 0,
|
||||
summary_tokens: int = 0,
|
||||
reason: str = "",
|
||||
) -> None:
|
||||
self.compacted = compacted
|
||||
self.archived = archived
|
||||
self.summary_tokens = summary_tokens
|
||||
self.reason = reason
|
||||
|
||||
def __repr__(self) -> str:
|
||||
return (
|
||||
f"<CompactionResult compacted={self.compacted} archived={self.archived} "
|
||||
f"summary_tokens={self.summary_tokens} reason={self.reason!r}>"
|
||||
)
|
||||
|
||||
|
||||
def should_compact(session_row: InteractiveSessionRow) -> bool:
|
||||
"""True when total used tokens >= 70% of the active model's window."""
|
||||
used = (session_row.total_input_tokens or 0) + (session_row.total_output_tokens or 0)
|
||||
model = session_row.model or ""
|
||||
return is_over_threshold(used, model)
|
||||
|
||||
|
||||
async def _collect_messages_for_compaction(
|
||||
s: AsyncSession, session_id: str
|
||||
) -> tuple[list[MessageRow], list[MessageRow]]:
|
||||
"""Return (to_compact, to_keep) for the session.
|
||||
|
||||
Strategy:
|
||||
- Skip all `is_summary` and `archived` rows.
|
||||
- Skip `role=system` rows (they are MYDEEPAGENT.md / memory / skills
|
||||
injections — owned by the next ainvoke, not by compaction).
|
||||
- Keep the last KEEP_RECENT_K non-system, non-archived, non-summary
|
||||
messages verbatim.
|
||||
- Everything older than that is to_compact.
|
||||
"""
|
||||
rows = (
|
||||
(
|
||||
await s.execute(
|
||||
select(MessageRow)
|
||||
.where(MessageRow.session_id == session_id)
|
||||
.where(MessageRow.archived.is_(False))
|
||||
.where(MessageRow.is_summary.is_(False))
|
||||
.where(MessageRow.role != "system")
|
||||
.order_by(MessageRow.seq)
|
||||
)
|
||||
)
|
||||
.scalars()
|
||||
.all()
|
||||
)
|
||||
if len(rows) <= KEEP_RECENT_K:
|
||||
return ([], list(rows))
|
||||
cutoff = len(rows) - KEEP_RECENT_K
|
||||
return (list(rows[:cutoff]), list(rows[cutoff:]))
|
||||
|
||||
|
||||
def _format_for_summary(messages: list[MessageRow]) -> str:
|
||||
"""Render messages as a compact transcript the summariser can consume."""
|
||||
lines: list[str] = []
|
||||
for m in messages:
|
||||
role = m.role.upper()
|
||||
content = (m.content or "").strip()
|
||||
lines.append(f"[{m.seq}] {role}: {content}")
|
||||
return "\n\n".join(lines)
|
||||
|
||||
|
||||
async def _run_summary_llm(config: Config, model: str, transcript: str) -> str:
|
||||
"""Single LLM call producing a compact paragraph-level summary."""
|
||||
api_key = resolve_openrouter_api_key(config)
|
||||
llm = ChatOpenAI(
|
||||
model=model.removeprefix("openrouter:"),
|
||||
base_url=config.openrouter_base_url,
|
||||
api_key=api_key,
|
||||
timeout=60.0,
|
||||
max_completion_tokens=600,
|
||||
)
|
||||
system_prompt = (
|
||||
"You are compressing an interactive conversation history for a developer "
|
||||
"agent. Produce a single concise summary (Korean if the conversation is "
|
||||
"Korean, otherwise English) that captures: (1) the user's intent and "
|
||||
"any key decisions, (2) artefacts mentioned by name/path, (3) any "
|
||||
"open questions or pending follow-ups. Aim for ≤ 300 tokens. Do NOT "
|
||||
"invent details that are not in the transcript."
|
||||
)
|
||||
response = await llm.ainvoke(
|
||||
[
|
||||
{"role": "system", "content": system_prompt},
|
||||
{"role": "user", "content": f"Transcript:\n\n{transcript}"},
|
||||
]
|
||||
)
|
||||
content = response.content
|
||||
if isinstance(content, list):
|
||||
content = "\n".join(
|
||||
(c.get("text", str(c)) if isinstance(c, dict) else str(c)) for c in content
|
||||
)
|
||||
return str(content).strip()
|
||||
|
||||
|
||||
async def compact_session(
|
||||
db: Database,
|
||||
config: Config,
|
||||
session_id: str,
|
||||
*,
|
||||
summary_model: str | None = None,
|
||||
) -> CompactionResult:
|
||||
"""Compact the oldest portion of a session's history. Idempotent under lock.
|
||||
|
||||
The caller is responsible for bumping the LangGraph thread_id (e.g.
|
||||
`InteractiveSession.clear_agent_cache()` style) AFTER this returns so the
|
||||
next ainvoke starts fresh.
|
||||
"""
|
||||
lock = _session_lock(session_id)
|
||||
async with lock:
|
||||
async with db.session() as s:
|
||||
row = await s.get(InteractiveSessionRow, session_id)
|
||||
if row is None:
|
||||
return CompactionResult(compacted=False, reason="session_not_found")
|
||||
|
||||
to_compact, _to_keep = await _collect_messages_for_compaction(s, session_id)
|
||||
if len(to_compact) < MIN_COMPACTABLE:
|
||||
return CompactionResult(
|
||||
compacted=False,
|
||||
reason=f"insufficient_messages ({len(to_compact)} < {MIN_COMPACTABLE})",
|
||||
)
|
||||
transcript = _format_for_summary(to_compact)
|
||||
active_model = row.model or ""
|
||||
model_for_summary = summary_model or "openrouter:deepseek/deepseek-chat"
|
||||
|
||||
# The LLM call lives OUTSIDE the DB session — it can take seconds and
|
||||
# we don't want to hold an asyncpg connection during the wait.
|
||||
try:
|
||||
summary_text = await _run_summary_llm(config, model_for_summary, transcript)
|
||||
except Exception as e:
|
||||
_LOG.exception("compaction summariser failed for session %s", session_id)
|
||||
return CompactionResult(compacted=False, reason=f"summariser_failed:{e}")
|
||||
|
||||
if not summary_text:
|
||||
return CompactionResult(compacted=False, reason="empty_summary")
|
||||
|
||||
summary_tokens = count_tokens(summary_text, active_model)
|
||||
|
||||
async with db.session() as s:
|
||||
# Re-fetch under fresh session in case state changed during LLM call.
|
||||
row = await s.get(InteractiveSessionRow, session_id)
|
||||
if row is None:
|
||||
return CompactionResult(compacted=False, reason="session_disappeared")
|
||||
|
||||
# Insert the summary as a new MessageRow before the kept tail.
|
||||
# - seq: 1 + max(seq of to_compact) → places summary just before
|
||||
# the first kept message in the ordering, which is what readers
|
||||
# want for "render history in seq order".
|
||||
# - role: "system" + is_summary=True so the next ainvoke includes
|
||||
# it and the GUI can render it distinctly.
|
||||
#
|
||||
# Then archive the originals in one UPDATE.
|
||||
archive_ids = [m.id for m in to_compact]
|
||||
archived_token_total = sum(int(m.token_count or 0) for m in to_compact)
|
||||
now = _now_iso()
|
||||
|
||||
# Summary seq = smallest archived seq. We can't pick a fractional
|
||||
# position (INTEGER seq) and shifting every kept row by +1 is too
|
||||
# expensive. Instead we archive originals into the negative-seq
|
||||
# band (see below) which frees their positive seqs; then the
|
||||
# summary at to_compact[0].seq slots naturally in front of the
|
||||
# kept tail when readers `ORDER BY seq ASC WHERE archived=False`.
|
||||
summary_seq = to_compact[0].seq
|
||||
s.add(
|
||||
MessageRow(
|
||||
session_id=session_id,
|
||||
seq=summary_seq,
|
||||
role="system",
|
||||
content=summary_text,
|
||||
tool_calls=None,
|
||||
token_count=summary_tokens,
|
||||
is_summary=True,
|
||||
archived=False,
|
||||
ts=now,
|
||||
)
|
||||
)
|
||||
|
||||
# Mark originals archived — bump their seq into a high band so the
|
||||
# UNIQUE (session_id, seq) constraint doesn't trip with the new
|
||||
# summary at summary_seq. We shift archived rows to negative seq
|
||||
# space (id-based) to keep them addressable but out of the way.
|
||||
#
|
||||
# Postgres lets us do this in one UPDATE with FROM clause; SQLite
|
||||
# doesn't, so do it per-row for portability.
|
||||
for original in to_compact:
|
||||
# Use a negative offset based on original.seq so they remain
|
||||
# ordered relative to each other.
|
||||
new_seq = -(original.seq + 1)
|
||||
await s.execute(
|
||||
update(MessageRow)
|
||||
.where(MessageRow.id == original.id)
|
||||
.values(archived=True, seq=new_seq)
|
||||
)
|
||||
|
||||
# Update aggregate token counters on the session row. Subtract
|
||||
# archived contribution and add the summary tokens (assigned to
|
||||
# system → input bucket).
|
||||
row.total_input_tokens = max(
|
||||
0,
|
||||
int(row.total_input_tokens or 0) - archived_token_total + summary_tokens,
|
||||
)
|
||||
row.last_message_at = now
|
||||
|
||||
await s.commit()
|
||||
|
||||
_LOG.info(
|
||||
"compacted session=%s archived=%d summary_tokens=%d",
|
||||
session_id,
|
||||
len(archive_ids),
|
||||
summary_tokens,
|
||||
)
|
||||
return CompactionResult(
|
||||
compacted=True,
|
||||
archived=len(archive_ids),
|
||||
summary_tokens=summary_tokens,
|
||||
reason="ok",
|
||||
)
|
||||
|
||||
|
||||
def context_usage_fraction(session_row: InteractiveSessionRow) -> float:
|
||||
"""Return current used_tokens / context_limit (0.0 if no model set)."""
|
||||
used = (session_row.total_input_tokens or 0) + (session_row.total_output_tokens or 0)
|
||||
limit = model_context_limit(session_row.model or "")
|
||||
if limit <= 0:
|
||||
return 0.0
|
||||
return used / limit
|
||||
Reference in New Issue
Block a user