The Mock Parity Harness
The most copy-paste-worthy artifact in claw-code. Three pieces:
-
A deterministic mock of
/v1/messages—crates/mock-anthropic-service/src/lib.rs(1124 LOC). Implements the Anthropic Messages API endpoints, including SSE streaming, with fully scripted responses. No randomness, no real model calls. -
A clean-environment CLI harness —
crates/rusty-claude-cli/tests/mock_parity_harness.rs(883 LOC). Spawns the mock service on a random port, runsclawagainst it with a controlled environment, and asserts on the full request/response trace. -
A scenarios manifest —
mock_parity_scenarios.json. Twelve scripted scenarios:
streaming_text — baseline streaming, no tools
read_file_roundtrip — read_file tool execution + synthesis
grep_chunk_assembly — grep_search partial-JSON chunk assembly
write_file_allowed — workspace-write success path
write_file_denied — read-only mode blocks write
multi_tool_turn_roundtrip — multiple tool calls in one assistant turn
bash_stdout_roundtrip — bash exec + stdout roundtrip
bash_permission_prompt_approved — workspace-write→bash escalation, approved
bash_permission_prompt_denied — escalation denied
plugin_tool_roundtrip — external plugin tool execution
auto_compact_triggered — token threshold triggers compaction
token_cost_reporting — usage / cost in JSON output
Each scenario is a full end-to-end test: user input → CLI → mock API → tool execution → response assertion. Together they exercise streaming, tool calls, permission paths, plugins, compaction, and accounting.
What makes this work
Three things that aren't obvious from the surface:
-
The mock implements the real wire format precisely. Not "approximately Anthropic-compatible" — properly compatible, including SSE event sequencing (
message_start,content_block_start,content_block_delta { delta: { type: "text_delta", text: "..." }}, etc.). When real Anthropic changes the wire format, the mock needs to track. The 1124 LOC are mostly wire-format faithfulness. -
The CLI runs in a clean environment. No
~/.claw/, no~/.config, no environment variables that might leak — the harness wrapsclawinvocation in a fresh tempdir and a controlled env. This means tests are deterministic across machines. -
The scenarios are recordings, not generators. Each scenario file specifies the exact tool calls, the exact text deltas, the exact usage tokens. No prompts are sent to a real model. This makes the tests fast (sub-second) and stable.
Why this beats real-model integration tests
If you've ever tried to write end-to-end tests against a real LLM, you know the failure modes: rate limits, model updates breaking your snapshot, non-determinism in token counts, intermittent latency spikes. Each of these makes CI flaky.
The mock pattern eliminates all of them:
- No rate limits — local subprocess
- No model drift — you control what the mock returns
- Deterministic token counts — scripted in the scenario file
- No latency — runs in process, no network
The trade-off: you're testing your harness, not your model behavior. Real-model behavior verification still needs real-model tests, but those can be smaller (you trust the harness to behave; you only need to verify the model's outputs make sense for a few exemplar prompts).
How to lift it
For your own agent, an equivalent harness needs:
- A faithful mock of your model API. If you're talking to OpenAI, mock the OpenAI Chat Completions endpoint (with streaming if you stream). If Anthropic, copy claw-code's mock.
- A way to script responses by request shape. Match on the prompt or the tool definitions sent in the request, return a pre-recorded SSE stream.
- A way to record real responses for replay. During development, hit the real API and save the response stream. This is your scenario file.
- A scenario manifest mapping name → expected behavior. This is what gets diffed across versions of your agent — when you change something, the mock parity diff tells you which scenarios drifted.
run_mock_parity_diff.py (referenced in claw-code's README) is the diff runner. The pattern: each scenario produces a canonical artifact (e.g., the captured set of /v1/messages requests). Comparing artifact hashes across runs tells you whether your harness's outbound shape is stable.
Continue: Patterns to Borrow