The Mock Parity Harness

The most copy-paste-worthy artifact in claw-code. Three pieces:

A deterministic mock of /v1/messages — crates/mock-anthropic-service/src/lib.rs (1124 LOC). Implements the Anthropic Messages API endpoints, including SSE streaming, with fully scripted responses. No randomness, no real model calls.
A clean-environment CLI harness — crates/rusty-claude-cli/tests/mock_parity_harness.rs (883 LOC). Spawns the mock service on a random port, runs claw against it with a controlled environment, and asserts on the full request/response trace.
A scenarios manifest — mock_parity_scenarios.json. Twelve scripted scenarios:

streaming_text                       — baseline streaming, no tools
read_file_roundtrip                  — read_file tool execution + synthesis
grep_chunk_assembly                  — grep_search partial-JSON chunk assembly
write_file_allowed                   — workspace-write success path
write_file_denied                    — read-only mode blocks write
multi_tool_turn_roundtrip            — multiple tool calls in one assistant turn
bash_stdout_roundtrip                — bash exec + stdout roundtrip
bash_permission_prompt_approved      — workspace-write→bash escalation, approved
bash_permission_prompt_denied        — escalation denied
plugin_tool_roundtrip                — external plugin tool execution
auto_compact_triggered               — token threshold triggers compaction
token_cost_reporting                 — usage / cost in JSON output

Each scenario is a full end-to-end test: user input → CLI → mock API → tool execution → response assertion. Together they exercise streaming, tool calls, permission paths, plugins, compaction, and accounting.

What makes this work

Three things that aren't obvious from the surface:

The mock implements the real wire format precisely. Not "approximately Anthropic-compatible" — properly compatible, including SSE event sequencing (message_start, content_block_start, content_block_delta { delta: { type: "text_delta", text: "..." }}, etc.). When real Anthropic changes the wire format, the mock needs to track. The 1124 LOC are mostly wire-format faithfulness.
The CLI runs in a clean environment. No ~/.claw/, no ~/.config, no environment variables that might leak — the harness wraps claw invocation in a fresh tempdir and a controlled env. This means tests are deterministic across machines.
The scenarios are recordings, not generators. Each scenario file specifies the exact tool calls, the exact text deltas, the exact usage tokens. No prompts are sent to a real model. This makes the tests fast (sub-second) and stable.

Why this beats real-model integration tests

If you've ever tried to write end-to-end tests against a real LLM, you know the failure modes: rate limits, model updates breaking your snapshot, non-determinism in token counts, intermittent latency spikes. Each of these makes CI flaky.

The mock pattern eliminates all of them:

No rate limits — local subprocess
No model drift — you control what the mock returns
Deterministic token counts — scripted in the scenario file
No latency — runs in process, no network

The trade-off: you're testing your harness, not your model behavior. Real-model behavior verification still needs real-model tests, but those can be smaller (you trust the harness to behave; you only need to verify the model's outputs make sense for a few exemplar prompts).

How to lift it

For your own agent, an equivalent harness needs:

A faithful mock of your model API. If you're talking to OpenAI, mock the OpenAI Chat Completions endpoint (with streaming if you stream). If Anthropic, copy claw-code's mock.
A way to script responses by request shape. Match on the prompt or the tool definitions sent in the request, return a pre-recorded SSE stream.
A way to record real responses for replay. During development, hit the real API and save the response stream. This is your scenario file.
A scenario manifest mapping name → expected behavior. This is what gets diffed across versions of your agent — when you change something, the mock parity diff tells you which scenarios drifted.

run_mock_parity_diff.py (referenced in claw-code's README) is the diff runner. The pattern: each scenario produces a canonical artifact (e.g., the captured set of /v1/messages requests). Comparing artifact hashes across runs tells you whether your harness's outbound shape is stable.

Continue: Patterns to Borrow