N
Nexus API Referencev2.4.1

Context, Caching, Compaction

Three mechanisms work together to keep an agent's working memory bounded: a layered system prompt that's mostly cacheable, prompt-cache analytics that watch for unexpected breaks, and compaction that summarizes older turns when the context grows past a threshold. Plus auto-memory, which is orthogonal — it's not about saving tokens, it's about saving intent across sessions.

The system-prompt boundary

SystemPromptBuilder (crates/runtime/src/prompt.rs:113–221) composes the system prompt out of named segments. The build() method (line 169–191) emits a Vec<String> in this order:

  1. Intro section
  2. Output style (if set)
  3. System guidelines (get_simple_system_section)
  4. Task guidelines (get_simple_doing_tasks_section)
  5. Action cautions (get_actions_section)
  6. SYSTEM_PROMPT_DYNAMIC_BOUNDARY — a literal sentinel string at line 178
  7. Environment section (model, cwd, date, platform, shell, OS version)
  8. Project context (git status, diff, recent commits)
  9. Instruction files (CLAUDE.md walk: home → repo root → subdirs)
  10. Runtime config (loaded settings.json entries)
  11. Appended sections (user-provided)

The sentinel is a string constant defined at crates/runtime/src/prompt.rs:40:

SYSTEM_PROMPT_DYNAMIC_BOUNDARY = "__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__"

Items 1–5 are the "static" half: invariant per model invocation, change rarely, ideal for prompt caching. Items 7–11 are the "dynamic" half: change every turn (timestamp, git status diff, project context). The sentinel marks where a downstream cache layer should split the prompt — everything before the sentinel goes in one cache-controlled segment, everything after is uncached.

The prompt-cache analytics gap

Here's where claw-code falls short of upstream Claude Code, and where it's most informative because it falls short.

crates/api/src/prompt_cache.rs (735 LOC) defines:

  • PromptCacheConfig (line 20): session_id, completion_ttl: Duration (default 30s — line 22), prompt_ttl: Duration (default 5min — line 23), cache_break_min_drop: u32 (default 2000 tokens — line 24)
  • PromptCacheStats (line 76–91): tracks tracked_requests, completion_cache_hits/misses/writes, expected_invalidations, unexpected_cache_breaks, total_cache_creation/read_input_tokens, plus last_break_reason
  • CacheBreakEvent (line 94–99): per-event detection of when cache reads dropped suddenly
  • PromptCache::lookup_completion() (line 145+) and record_response() (line 173+): the read/write API used by the Anthropic provider at crates/api/src/providers/anthropic.rs:292,310
  • detect_cache_break() (line 314): the heuristic — if token_drop < cache_break_min_drop, it's not a real break; if the previous TTL has expired (line 360), it's an expected invalidation; otherwise it's logged as unexpected_cache_breaks

What this is, in plain English: an analytics layer. It watches the usage field on each Anthropic response (which reports cache_read_input_tokens and cache_creation_input_tokens), and if those numbers diverge from what's expected based on the previous turn's cache state, it flags it.

What's missing: a grep across all crates for cache_control, "ephemeral", or prompt-caching in non-test code returns zero hits. The only references are:

  • betas: ["claude-code-20250219", "prompt-caching-scope-2026-01-05"] at anthropic.rs:1443 — but this is in a test asserting that betas get stripped from the body
  • The opt-in is communicated via the anthropic-beta HTTP header, never as a body field — comment at anthropic.rs:983–984

So claw-code knows the prompt-caching beta is supposed to be on, watches for cache hits/misses, but does not actually mark any system/tool/message blocks as cache_control: {"type": "ephemeral"}. It opts in to caching headers but doesn't place breakpoints.

What real Claude Code almost certainly does: insert cache_control: {"type": "ephemeral"} markers at strategic positions — typically at the end of the static system prompt section, at the end of the tools array, and possibly at the end of the most recent N messages. Each marker creates a cache breakpoint, so the API can serve the prefix from cache up to that point.

The 5-minute and 1-hour TTLs that Anthropic exposes are what the prompt_ttl constant (300s) anticipates. Cache-aware self-pacing — staying under 5 minutes when actively iterating, committing to 20+ minutes when a cache miss is unavoidable — is the kind of behavior that pays off if the breakpoints are correctly placed.

Borrow-able pattern. Even if you don't implement breakpoints (which require knowing your provider's caching semantics), the analytics layer is genuinely useful. Track cache_read_input_tokens per turn, compare against the previous turn's prefix, and log when it drops unexpectedly. This catches non-determinism in your prompt assembly (e.g., a HashMap that should be a BTreeMap, a timestamp that snuck into the static section) before it costs you money.

Compaction

crates/runtime/src/compact.rs (835 LOC). Public entry: compact_session(session, config) -> CompactionResult at line 96.

The trigger is at conversation.rs:506, every turn, via maybe_auto_compact() (lines 559–582). The condition is cumulative_usage().input_tokens >= auto_compaction_input_tokens_threshold (default 100_000 at conversation.rs:18). Manual compaction can also be triggered via runtime.compact(config) at conversation.rs:523.

The algorithm (compact.rs:96–183):

  1. Should we compact? (lines 97–104) — if message count is below preserve_recent_messages (default 4) or total compactable tokens are below max_estimated_tokens, return unchanged.
  2. Pull existing summary (lines 106–109) — if this is the second+ compaction, the previous summary is in the session's first message; extract it.
  3. Find the boundary (lines 111–158) — preserve the last N messages, but walk backward to avoid splitting a tool_use/tool_result pair across the boundary.
  4. Summarize removed messages (lines 159–281, summarize_messages()) — the LLM-friendly summary format includes:
    • User/assistant/tool message counts
    • Set of distinct tool names called
    • "Recent user requests" excerpt
    • "Pending work" inferred from incomplete tool sequences
    • "Key files" extracted from filesystem-touching tools
    • "Current work" from the latest assistant turn
    • Timeline (concise chronology)
  5. Merge with prior summary if re-compacting (lines 283–316) — concatenates rather than re-summarizes, to avoid information loss across multiple compactions.
  6. Build the continuation message (line 164) — a system-role message containing the summary plus a "Recent messages preserved" note plus a "direct-resume" instruction.
  7. Insert as first message, append preserved messages (lines 166–171). Record compaction metadata in the session's SessionCompaction field.

The key trade-off: compaction invalidates the prompt cache (because the message list is now different), but it dramatically reduces context size. So you only do it when context pressure is high enough that the next few turns will run cheaper despite the cache miss.

Summary compression

A separate, smaller utility at crates/runtime/src/summary_compression.rs (300 LOC). Distinct from compact.rs in that it operates on the output text of compaction, not on messages:

  • compress_summary(summary, budget) -> SummaryCompressionResult (line 37)
  • compress_summary_text(summary) -> String (line 91)

The budget (SummaryCompressionBudget, lines 8–22) constrains max_chars: usize (default 1200), max_lines: usize (24), max_line_chars: usize (160). The result tracks how much was trimmed, deduplicated, or truncated.

Use case: when compaction summaries would otherwise blow past a display budget (e.g., in the REPL's "what just got compacted" view), this trims them deterministically.

Recovery

crates/runtime/src/recovery_recipes.rs (635 LOC) is a separate concern but lives in the same context-management neighborhood. It maps known failure scenarios to recovery actions:

  • FailureScenario (lines 46–57): TrustPromptUnresolved, PromptMisdelivery, StaleBranch, CompileRedCrossCrate, McpHandshakeFailure, PartialPluginStartup, ProviderFailure
  • RecoveryRecipe (lines 77–86): AcceptTrustPrompt, RedirectPromptToAgent, RebaseBranch, CleanBuild, RetryMcpHandshake, RestartPlugin, RestartWorker, EscalateToHuman
  • EscalationPolicy (lines 91–95): AlertHuman, LogAndContinue, Abort

This is entirely worker-and-lane-oriented — recovery from coordination failures rather than per-turn compaction. The mechanism: when a worker emits a known-failure event, the recipe map is consulted, and the recipe is enqueued as a follow-up task. The harness can then either auto-execute it (LogAndContinue), surface it to the user (AlertHuman), or terminate the work (Abort).


Continue: Plan vs Execution

Last updated: May 14, 2026