Backend Mocks — letting backend-gated flows reach their payoff
Some flows can only succeed if a real backend tool writes a slot. In eval those backends are degraded — so the flow loops forever and looks broken. The mock fixes that, without ever faking success.
What & why
Several flows have a success guard that reads a slot only a real backend tool writes — for example
schedule_event writes _booked_event_id via the calendar provider, and the
success transition fires on that slot. In eval, the provider runs in degraded mode (per the
external-services “degraded mode default” policy), so the slot never appears, the guard
never fires, and the flow loops to max_turns — un-evaluatable. You can't
measure a flow that can never finish.
The backend mock makes a called backend tool succeed deterministically, so the flow can reach its designed payoff and be scored — while preserving the critical distinction between “the backend was absent” and “the flow is broken.”
How it works
A mock is a same-name ToolSpec injected via extra_tools.
The provider registry keeps the last spec registered under a name, so the mock overrides the
real/degraded tool without touching production code. The handler simply writes the success
slots and returns {"status": "success"}.
# riff/flow_eval/eval_mocks.py — tool_name -> slots the success path writes
register_eval_backend_mock("schedule_event",
{"_booked_event_id": "evt-eval-mock", "_scheduling_confirmation": "confirmed"})
register_eval_backend_mock("cc_dispatch_inject", {"_cc_dispatch_completed": True})
register_eval_backend_mock("cc_get_fleet_status", {"_cc_status_summary": "All units nominal (eval mock)."})
register_eval_backend_mock("tutorial_verify_caller_code", {"_caller_verified": True})
Mocks are gated entirely by an environment variable — build_eval_backend_tools(flow)
returns an empty tuple unless RIFF_EVAL_MOCK_BACKENDS=1, and even then only injects
mocks for tools this flow's states actually allow.
Use case — the mock as a diagnostic
The mock is not just a crutch; it is a classifier. Turn it on and re-run, and the flows that
recover were “backend absent,” while the flows that still fail are “flow
broken.” Measured 2026-06-21 (commit def3e89, scoped in a142092):
| Flow | With mock | Verdict |
|---|---|---|
apartment_scheduler | correctness 0 → 10, reaches success | ✅ was genuinely backend-absent |
ceo_command_center | still fails — stalls at dispatch_intent (a confirm/resolve oscillation, cohesion 0.16) | ❌ real flow bug, before the backend |
property_trouble_ticket | still fails — escalates at verification | ❌ real classifier bug, before the backend |
So the mock proved that only apartment_scheduler was a backend-absence
victim. The scope is deliberately narrow and not over-claimed — the other two are real bugs the mock
is irrelevant to.
Example
# run an eval with backend mocks on — only apartment_scheduler's schedule_event is mocked
RIFF_EVAL_MOCK_BACKENDS=1 python -m riff.flow_eval --flows apartment_scheduler --no-judge --out OUT
python scripts/eval_db.py --ingest-aspects --dirs OUT
python scripts/eval_db.py --matrix | grep apartment_scheduler
Without the flag, the same run leaves apartment_scheduler at correctness 0 (the live
matrix shows the gemini-agent, no-mock row at operability 0.3); with it, the booking path completes
and the flow becomes scorable.
Where it fits
Backend mocks make backend-dependent flows measurable in the matrix
and the per-state layer. They pair with
scripts/stall_detector.py, which classifies a stalled conversation as backend-stall vs
flow-bug vs dead-end. The design lives in docs/flow-eval/BACKEND-STALL-DESIGN.md.