Overview › Operations › Shift-left acceptance

Proving a Flow Before It Ships

Catch the bad call in CI, not in production. A persona panel + deterministic scanners turn “does this flow hold up?” into a pass/fail gate — and per-state certification proves which model can drive which state.

The problem it solves

A flow can lint clean, pass its unit tests, and still fall apart on a real call: it re-asks for something the caller already gave, loops a confirmation, reads an internal slot name aloud, or quietly escalates instead of finishing. None of that shows up until a person dials in. Shift-left acceptance runs a panel of caller personas against the flow and scans the resulting transcripts for those exact failure shapes — before deploy, deterministically, with no human in the loop.

The acceptance gate

scripts/flow_acceptance.py drives the persona panel (cooperative, oversharer, corrector, privacy-guarded, skeptical) against a flow, then composes deterministic turn-invariant scanners over every conversation. Exit 0 = pass, 1 = fail, 2 = error — drop it straight into CI.

# prove the flagship maintenance-intake flow across the persona panel
python scripts/flow_acceptance.py --flow apartment_intake_SFO

Each scanner targets one observable failure mode — no LLM judge, no flake:

Scanner	Catches
`scan_caller_repeats`	caller has to repeat something already given (“I already told you…”)
`scan_stale_asks`	the agent asks for a slot that is already filled
`scan_collect_loops`	a collection state revisited without progress
`scan_repr_leaks`	a raw internal slot name spoken aloud (“entry permission”)
`find_double_beat`	two questions crammed into one agent turn
correctness gate	a cooperative caller must actually complete the task (correctness 10)

Why deterministic scanners, not a judge? The LLM judge is for nuance (humanness) and is noise-dominated (±1.5 pts single-run). A regression gate must be repeatable: the same transcript always yields the same verdict. These scanners check structural facts about the conversation, so the gate never flakes — exactly what CI needs.

Per-state model certification

A flow’s acceptance is model-specific. The same flow that sails through on one model can trip stale-asks or confirmation loops on another — and not everywhere, just at particular states. Rather than picking one model for the whole flow (and paying for the strongest everywhere), certify each state to the cheapest model proven to drive it cleanly.

scripts/flow_certify.py runs the acceptance panel under each candidate model, attributes every scanner finding to the state that produced it, and builds a state × model matrix — assigning each state its cheapest passing model:

# certify apartment_intake_SFO across two models, cheapest-first
python scripts/flow_certify.py --flow apartment_intake_SFO \
    --models dashscope:qwen-flash,gemini:gemini-2.5-flash \
    --personas cooperative,oversharer --write

Real output (abridged):

# certification matrix — apartment_intake_SFO  (qwen-flash  gemini-2.5-flash)
  ask_issue            qwen-flash=pass      gemini-2.5-flash=untested  -> qwen-flash
  ask_entry            qwen-flash=pass      gemini-2.5-flash=pass      -> qwen-flash
  await_confirmation   qwen-flash=pass      gemini-2.5-flash=fail      -> qwen-flash

The matrix localizes a model’s weakness. Here gemini is fine almost everywhere but fails at await_confirmation — so the finding isn’t “don’t use gemini for this flow,” it’s “gemini is fine except this one state.” The runtime then routes each state to its certified model.

Three honest design rules the builder enforces:

An unvisited state is untested, never silently pass. If the panel never reached a state, the matrix says so — it does not pretend coverage it doesn’t have.
Greetings and other deterministic announces aren’t certified. They never call the model, so there is nothing to certify.
Coverage is the real cost. A small panel under the GOAP-lite selector (one slot at a time, any order) visits only a subset of states per call, so certifying a whole flow needs many persona×repeat runs or targeted per-state scenarios. The matrix makes that gap visible instead of hiding it.

Where it fits

Acceptance is the deterministic floor; the eval matrix and judges measure quality above it; the transcript audit confirms any number against its conversation. Together: a flow can’t ship until it provably handles its persona panel, on a model proven to drive each of its states.

RIFF Eval · Shift-left acceptance & per-state certification · riff/flow_eval/acceptance.py, riff/flow_eval/certification.py