IA al Día
the efficient way to stay informed
Back to Deep Dives
Deep Dive June 12, 2026 14 min read

Context Compaction in AI Agents: How Hermes, Codex, Claude Code, and OpenCode Manage the Memory Window

In-depth technical analysis of how four AI agent systems — Hermes Agent, Codex CLI, Claude Code, and OpenCode — implement context compaction to maintain long sessions without losing critical information or exceeding token limits.

Context Compaction in AI Agents: How Hermes, Codex, Claude Code, and OpenCode Manage the Memory Window
By IA al Día

Key Findings

  • Context compaction is an investment: spend ~$0.01 in tokens now to save 10-100× on future turns
  • Hermes Agent has the most complete system with a dual layer (gateway at 85% + agent at 50%) and a 4-phase algorithm that prunes tool results before calling the LLM
  • Claude Code integrates Anthropic Prompt Cache as an intermediate layer, reducing costs ~75% without compromising information
  • OpenCode uses timestamp-based hiding (not physical deletion) — data persists in the DB for future audits
  • Codex CLI preserves user messages verbatim while compressing everything else into a "handoff" summary
  • All converge on a common pattern: cheap pruning first, LLM summary second, head and tail always protected
Methodology

Research based on official documentation from Hermes Agent (Nous Research), source code from Codex CLI (OpenAI/openai/codex) and OpenCode (sst/opencode), plus community analysis of Claude Code (Anthropic). Each claim about thresholds, algorithm phases, and data structures was verified against source code or official documentation. Cost comparisons use public API prices as of June 2026.

When you use an AI agent for a long coding session, something inevitable happens: the conversation grows, tool results accumulate, and the context approaches the model’s limit. The agent starts “forgetting” things — the system prompt, early instructions, key decisions. And in the worst case, the API returns a context_length_exceeded error and the session dies.

The solution is called context compaction. It’s not about expanding the model’s memory, but about learning to forget with precision: keep what matters, discard the noise, and do it spending as few tokens as possible in the process.

This article analyzes how four AI agent systems implement this critical functionality. Hermes Agent, Codex CLI, Claude Code, and OpenCode have radically different approaches — from Hermes’ dual safety layer to Claude Code’s three progressive phases — but they all converge on a common pattern.

What is context compaction and why does it matter?

Each turn in a conversation with an AI agent adds tokens to the history. The system prompt (with tool definitions, instructions, persistent memory), user messages, assistant responses, tool calls, and tool results — everything accumulates.

In a typical debugging session, tool results account for ~80% of total tokens. A grep search can return 2,000 tokens; reading a 300-line file adds 3,500; a test run with stack traces adds another 3,000. This data was critical during debugging, but once the bug is fixed, it’s dead weight.

Without compaction, the agent has three options, all bad:

  1. Fail with a context exceeded error, losing all progress
  2. Ignore early instructions because the context was flooded by tool results
  3. Cost a fortune because each turn costs input tokens proportional to the accumulated history

Compaction solves this by replacing the middle history (old tool results, resolved conversation, completed debugging) with a structured summary, preserving the head (system prompt + first exchange) and the tail (recent turns). It’s an investment: you spend a few cents on a cheap LLM to generate the summary, and you save that cost on every future turn.

Hermes Agent: the dual-layer system

Hermes Agent, developed by Nous Research, has the best-documented and most configurable compaction system of the four. Its architecture uses two independent layers that operate at different points in the flow.

Layer 1: Gateway Session Hygiene (85% of context)

This layer lives in gateway/run.py and runs before the agent processes an incoming message. It’s a safety net for sessions that grow uncontrollably between turns — for example, a Telegram session that accumulates messages while the agent is inactive.

  • Threshold: fixed at 85% of the model’s context
  • Token source: uses real tokens reported by the API if available; falls back to character-based estimation
  • Trigger: only when len(history) >= 4 and compression is enabled
  • Purpose: catch sessions that escaped the agent’s compressor

The gateway threshold is intentionally higher than the agent’s. Setting it to 50% (same as the agent) caused premature compressions on every turn of long gateway sessions.

Layer 2: Agent ContextCompressor (50% of context, configurable)

This is the main layer, in agent/context_compressor.py. It runs inside the agent’s tool loop with access to precise API token counts.

Typical configuration:

compression:
  threshold: 0.50          # Fraction of context that triggers compression
  target_ratio: 0.20       # What fraction of the threshold to preserve as tail
  protect_last_n: 20       # Minimum number of recent messages protected

For a model with 200K context, compression triggers at ~100K tokens. The protected tail gets ~20K tokens, and the summary has a budget of up to 10K tokens.

4-phase algorithm

The magic is in how ContextCompressor.compress() transforms 45 messages and ~95K tokens into 25 messages and ~45K tokens.

Phase 1: Prune old tool results (no LLM, cost ~zero)

Tool results >200 characters that are outside the protected tail are replaced with a placeholder: [Old tool output cleared to save context space]. This frees a massive amount of tokens without needing to call any model.

The agent remembers it ran a command, but not the exact output. If it needs the detail, it can re-execute.

Phase 2: Determine boundaries

[0..2]   ← protect first 3 messages (system + first exchange)
[3..N]   ← middle turns → will be summarized
[N..end] ← protected tail (by token budget OR protect_last_n)

Tail protection walks backward from the end accumulating tokens until the budget is exhausted. If the budget would protect fewer messages than protect_last_n, the fixed minimum is used.

Boundaries are aligned so as not to split tool_call/tool_result groups. The _align_boundary_backward() method walks backward through consecutive tool results to find the parent assistant message.

Phase 3: Structured summary via auxiliary LLM

The middle section is sent to an auxiliary model (configurable, typically cheaper than the main one) with an 8-section template:

## Goal
[What the user is trying to achieve]

## Constraints & Preferences
[Preferences, code style, important decisions]

## Progress
### Done
[Completed work — specific paths, commands, results]
### In Progress
[Work in progress]
### Blocked
[Blockers or issues found]

## Key Decisions
[Important technical decisions and why]

## Relevant Files
[Files read, modified, or created]

## Next Steps
[What should happen next]

## Critical Context
[Specific values, error messages, configuration details]

The summary budget scales with the content to compress: content_tokens × 0.20, with a minimum of 2,000 and a maximum of min(context_length × 0.05, 12,000).

Phase 4: Assemble compressed messages

A new message list is built:

  • Head messages (with a note added to the system prompt on first compression: “Some previous turns have been compacted…”)
  • Summary message (with a role chosen to avoid consecutive role violations)
  • Tail (recent messages, unmodified)

_sanitize_tool_pairs() cleans up orphaned pairs: tool results referencing removed calls are deleted, tool calls whose results were deleted receive a stub.

Iterative re-compression

On subsequent compressions, the previous summary is passed to the LLM with instructions to update it, not summarize from scratch. Items move from “In Progress” to “Done”, new progress is added, and obsolete information is removed.

Prompt Caching (Anthropic)

Hermes also integrates Prompt Caching for Anthropic models, in agent/prompt_caching.py. The system_and_3 strategy places 4 cache breakpoints:

  • Breakpoint 1: System prompt (always stable)
  • Breakpoints 2-4: Last 3 non-system messages (rolling window)

This reduces input costs ~75% in multi-turn conversations without sacrificing information.

Codex CLI: the handoff summary

OpenAI Codex CLI takes a simpler but equally effective approach: a single compression layer that replaces everything with a “handoff” summary.

Dual-path design

Codex offers two compression paths:

  • Local path (compact.rs): the client calls an LLM to generate the summary. Works with any model provider.
  • Remote path (compact_remote.rs): calls OpenAI’s internal endpoint responses/compact. The OpenAI server handles compression, likely with specialized models and internal cache.

In both cases compression requires an LLM call. The difference is where it runs: the local path orchestrates everything from the client; the remote one subcontracts the “generate summary” step to OpenAI.

The compression prompt

You are performing a CONTEXT CHECKPOINT COMPRESSION. Create a handoff
summary for another LLM that will resume the task.

Include:
- Current progress and key decisions made
- Important context, constraints, or user preferences
- What remains to be done (clear next steps)
- Any critical data, examples, or references needed to continue

Be concise, structured, and focused on helping the next LLM seamlessly
continue the work.

The keyword is handoff — these are not meeting minutes, but a briefing so the next model can pick up the work without missing a beat.

Preservation of user messages

A distinctive feature of Codex: it preserves user messages verbatim. Only assistant responses and tool results are compressed. This means the agent can always see what the user originally said, though it reduces compression efficiency when user messages are long.

Fallback: head trimming

If there’s still not enough space after compression, Codex resorts to head trimming — cutting from the oldest messages. This is destructive and considered a last resort.

Claude Code: three layers of precision

Anthropic’s Claude Code has the most sophisticated system of the four, with three progressive layers that go from cheapest to most expensive.

Claude Code is not open source. This analysis is based on community reverse engineering and public materials.

Layer 1: Tool Result Trimming (LLM cost = zero)

This layer runs automatically before each request. No LLM calls — it’s purely a local rule engine.

The logic:

  • Protects the most recent tool call results
  • Old tool results → replaced with [Old tool result content cleared]

The agent remembers it ran a search, but not the result. If it needs the detail, it can re-execute the command. It’s selective amnesia, not total forgetfulness.

Layer 2: Cache-Friendly Strategy

This is Claude Code’s unique advantage. Anthropic supports Prompt Cache: if the prefix of your API message matches the previous request, the server reuses prior computations, drastically reducing cost and latency.

When cleaning messages, Claude Code deliberately avoids modifying the first half of the sequence. It prefers to cut from the end, keeping the beginning identical to maximize cache hits.

The trade-off: lower cleaning efficiency, but maximum cache rate. For long tasks (refactoring an entire module), this translates into significant savings — you pay only for the new content at the end.

Layer 3: Structured LLM Summary (last resort, 9 sections)

When the first two layers aren’t enough, the full summary is triggered. The auto-compaction threshold is: effective window - 13,000 tokens.

Before calling the LLM, the system attempts Session Memory Compact — using structured information already in session memory. Only when this is not possible does it fall back to a traditional LLM summary, which generates 9 fixed sections:

  1. Original user intent
  2. Core technical concepts
  3. Relevant files and code
  4. Errors encountered and how they were fixed
  5. Logical troubleshooting chain
  6. Summary of all user messages
  7. Pending tasks
  8. What is currently being worked on
  9. Suggested next steps

The prompt demands verbatim quotes of key phrases from the original, not paraphrasing. This prevents “context drift” — the model subtly deviating from the original meaning when retelling.

Post-compression

After compressing, Claude Code runs a series of state reconstruction steps:

  • Injects a lead-in: “This session continues from a previous conversation…”
  • Automatic re-read of up to 5 recently edited files (budget of 50K tokens, 5K per file)
  • Redeclares tool definitions and skill definitions
  • Specifications in CLAUDE.md (system prompt) remain intact

There is also a passive fallback: if the API returns prompt_too_long, it initiates reactive compression and retries. It pauses after 3 consecutive failures to avoid infinite loops.

OpenCode: stepped governance with non-destructive hiding

OpenCode (archived, formerly sst/opencode) offers the most balanced strategy, implemented in session/compaction.ts with Effect-TS.

Step 1: Prune (hide, don’t delete)

The first action is not deletion — it’s marking. OpenCode adds a timestamp compacted = Date.now() to old messages, making them invisible in subsequent requests. The data remains in the database.

Rules:

  • Only runs if it can free >20K tokens
  • Always preserves the last 40K tokens as a safety buffer
  • Tool outputs of type skill are never pruned
  • Protects the full content of the last 2 user turns

This is a visionary design decision: the data isn’t really lost. It leaves room for future audits, rollbacks, or history features.

Step 2: LLM Summary with 5 sections

If there are still issues after pruning, OpenCode uses a dedicated hidden agent (without interrupting the user) that generates a summary with 5 fixed sections:

  1. What was done
  2. What is currently being worked on
  3. Modified files
  4. Next steps
  5. Important technical decisions

Auto-replay of the last message

The smartest feature of OpenCode: after compaction, the system automatically resubmits the last user message. The user doesn’t even notice the compression happened — their last message is reprocessed, the agent responds, as if nothing happened.

OpenCode also follows the user’s language: if the conversation is in Spanish, the summary is generated in Spanish.

Comparison table

DimensionHermes AgentCodex CLIClaude CodeOpenCode
Layers2 (gateway + agent)1 (summary)3 (trim + cache + summary)2 (hide + summary)
LLM callsPhase 3 onlyAlways requiredLayer 3 onlyStep 2 only
Threshold50% (configurable)~180-244K tokens~95%overflow + margin
User messagesSummarizedPreserved verbatimSummarizedSummarized + replay
Tool resultsPlaceholderPhysical deletionPlaceholderTimestamp hiding
CacheAnthropic Prompt CacheNo specialDeep integrationReduces reads
Post-compressionIterative re-compressionPassive waitFile re-readAuto-replay last msg
IrreversibleYesYesYesNo (timestamp)
Open sourceYes (MIT)Yes (Apache 2.0)NoYes (archived)

The common pattern

Despite the differences, all four systems converge on a shared pattern:

  1. Cheap pruning first: before calling the LLM, everyone has a mechanical cleaning phase (tool results → placeholder/hiding). This frees 50-80% of space without spending a single token.
  2. Head and tail always protected: the system prompt and recent messages are never touched. What gets summarized is the middle section, where the completed work lives.
  3. Structured summary: free-form doesn’t work. Everyone uses templates with fixed sections (between 4 and 9) that guide the summarizer model.
  4. Cheaper auxiliary model: nobody uses the main model for summarizing. Compaction is delegated to a secondary model (often 10-100× cheaper).
  5. Post-processing: after compressing, everyone does something to prevent the agent from getting “disoriented” — resending the last message, re-reading files, injecting prefixes.

Is it worth it? The investment math

Context compaction spends tokens now to save later. Does it make sense?

For a model like DeepSeek V4 Flash at $0.28/1M output tokens:

  • Cost of one compaction: generating a ~2,000 token summary → ~$0.0006
  • Savings per subsequent turn: a compacted history of 45K tokens vs 95K tokens → savings of $0.014 per turn on input alone
  • Break-even point: after ~1 turn, the compaction has already paid for itself
  • Return over 20 turns: 20× cheaper than without compaction

For a premium model like Claude Opus 4.8 at $25/1M output tokens:

  • Compaction cost: ~$0.05 (cheaper auxiliary model)
  • Savings per turn: ~$1.25
  • Return over 20 turns: 25× cheaper

Compaction is not a luxury — it’s an economic requirement for AI agents to work in long sessions without breaking the bank on API costs.

Implications for agent development

If you are building your own agent or choosing an existing one, these are the right questions:

  • How many compression layers do you need? One layer (like Codex) is enough for short sessions. Two layers (like Hermes) provide safety against escapes. Three layers (like Claude Code) optimize costs in very long sessions.
  • Does irreversibility matter? OpenCode is the only one that doesn’t destroy data. If you do audits or need rollbacks, its timestamp approach is superior.
  • Do you use Anthropic? The Prompt Cache integration in Claude Code or Hermes can reduce your costs ~75% in multi-turn sessions.
  • Does the auxiliary model matter? A lot. If your auxiliary model has a smaller context window than the main one, compactions will silently fail (Hermes documents this as “the most common cause of degradation”).

In the end, all four systems demonstrate that the best context management isn’t about expanding the model’s memory, but about learning to forget with precision.


Main source: Hermes Agent Documentation — Context Compression and Caching | Source code: Hermes Agent, Codex CLI, OpenCode (archived)

Deep Dives

No deep dives published yet.