Overview › Measurement › Backend mocks

Backend Mocks — letting backend-gated flows reach their payoff

Some flows can only succeed if a real backend tool writes a slot. In eval those backends are degraded — so the flow loops forever and looks broken. The mock fixes that, without ever faking success.

What & why

Several flows have a success guard that reads a slot only a real backend tool writes — for example schedule_event writes _booked_event_id via the calendar provider, and the success transition fires on that slot. In eval, the provider runs in degraded mode (per the external-services “degraded mode default” policy), so the slot never appears, the guard never fires, and the flow loops to max_turnsun-evaluatable. You can't measure a flow that can never finish.

The backend mock makes a called backend tool succeed deterministically, so the flow can reach its designed payoff and be scored — while preserving the critical distinction between “the backend was absent” and “the flow is broken.”

How it works

A mock is a same-name ToolSpec injected via extra_tools. The provider registry keeps the last spec registered under a name, so the mock overrides the real/degraded tool without touching production code. The handler simply writes the success slots and returns {"status": "success"}.

# riff/flow_eval/eval_mocks.py — tool_name -> slots the success path writes
register_eval_backend_mock("schedule_event",
    {"_booked_event_id": "evt-eval-mock", "_scheduling_confirmation": "confirmed"})
register_eval_backend_mock("cc_dispatch_inject",      {"_cc_dispatch_completed": True})
register_eval_backend_mock("cc_get_fleet_status",     {"_cc_status_summary": "All units nominal (eval mock)."})
register_eval_backend_mock("tutorial_verify_caller_code", {"_caller_verified": True})
The mock NEVER auto-fires a guard. Codex's key constraint: if a flow never calls the tool, the mock never runs and the guard stays false — so a flow that forgets to call its booking tool still fails. We do not auto-satisfy success; we only make a tool that the agent actually invoked return a deterministic success. That preserves the test's power to catch a flow that doesn't even try to book.

Mocks are gated entirely by an environment variable — build_eval_backend_tools(flow) returns an empty tuple unless RIFF_EVAL_MOCK_BACKENDS=1, and even then only injects mocks for tools this flow's states actually allow.

Use case — the mock as a diagnostic

The mock is not just a crutch; it is a classifier. Turn it on and re-run, and the flows that recover were “backend absent,” while the flows that still fail are “flow broken.” Measured 2026-06-21 (commit def3e89, scoped in a142092):

FlowWith mockVerdict
apartment_schedulercorrectness 0 → 10, reaches success✅ was genuinely backend-absent
ceo_command_centerstill fails — stalls at dispatch_intent (a confirm/resolve oscillation, cohesion 0.16)❌ real flow bug, before the backend
property_trouble_ticketstill fails — escalates at verification❌ real classifier bug, before the backend

So the mock proved that only apartment_scheduler was a backend-absence victim. The scope is deliberately narrow and not over-claimed — the other two are real bugs the mock is irrelevant to.

Example

# run an eval with backend mocks on — only apartment_scheduler's schedule_event is mocked
RIFF_EVAL_MOCK_BACKENDS=1 python -m riff.flow_eval --flows apartment_scheduler --no-judge --out OUT
python scripts/eval_db.py --ingest-aspects --dirs OUT
python scripts/eval_db.py --matrix | grep apartment_scheduler

Without the flag, the same run leaves apartment_scheduler at correctness 0 (the live matrix shows the gemini-agent, no-mock row at operability 0.3); with it, the booking path completes and the flow becomes scorable.

Where it fits

Backend mocks make backend-dependent flows measurable in the matrix and the per-state layer. They pair with scripts/stall_detector.py, which classifies a stalled conversation as backend-stall vs flow-bug vs dead-end. The design lives in docs/flow-eval/BACKEND-STALL-DESIGN.md.