Proving a Flow Before It Ships
Catch the bad call in CI, not in production. A persona panel + deterministic scanners turn “does this flow hold up?” into a pass/fail gate — and per-state certification proves which model can drive which state.
The problem it solves
A flow can lint clean, pass its unit tests, and still fall apart on a real call: it re-asks for something the caller already gave, loops a confirmation, reads an internal slot name aloud, or quietly escalates instead of finishing. None of that shows up until a person dials in. Shift-left acceptance runs a panel of caller personas against the flow and scans the resulting transcripts for those exact failure shapes — before deploy, deterministically, with no human in the loop.
The acceptance gate
scripts/flow_acceptance.py drives the persona panel (cooperative, oversharer, corrector,
privacy-guarded, skeptical) against a flow, then composes deterministic turn-invariant scanners
over every conversation. Exit 0 = pass, 1 = fail, 2 = error — drop it straight into CI.
# prove the flagship maintenance-intake flow across the persona panel
python scripts/flow_acceptance.py --flow apartment_intake_SFO
Each scanner targets one observable failure mode — no LLM judge, no flake:
| Scanner | Catches |
|---|---|
scan_caller_repeats | caller has to repeat something already given (“I already told you…”) |
scan_stale_asks | the agent asks for a slot that is already filled |
scan_collect_loops | a collection state revisited without progress |
scan_repr_leaks | a raw internal slot name spoken aloud (“entry permission”) |
find_double_beat | two questions crammed into one agent turn |
| correctness gate | a cooperative caller must actually complete the task (correctness 10) |
Per-state model certification
A flow’s acceptance is model-specific. The same flow that sails through on one model can trip stale-asks or confirmation loops on another — and not everywhere, just at particular states. Rather than picking one model for the whole flow (and paying for the strongest everywhere), certify each state to the cheapest model proven to drive it cleanly.
scripts/flow_certify.py runs the acceptance panel under each candidate model, attributes
every scanner finding to the state that produced it, and builds a state × model
matrix — assigning each state its cheapest passing model:
# certify apartment_intake_SFO across two models, cheapest-first
python scripts/flow_certify.py --flow apartment_intake_SFO \
--models dashscope:qwen-flash,gemini:gemini-2.5-flash \
--personas cooperative,oversharer --write
Real output (abridged):
# certification matrix — apartment_intake_SFO (qwen-flash gemini-2.5-flash)
ask_issue qwen-flash=pass gemini-2.5-flash=untested -> qwen-flash
ask_entry qwen-flash=pass gemini-2.5-flash=pass -> qwen-flash
await_confirmation qwen-flash=pass gemini-2.5-flash=fail -> qwen-flash
await_confirmation — so the
finding isn’t “don’t use gemini for this flow,” it’s “gemini is fine
except this one state.” The runtime then routes each state to its certified model.
Three honest design rules the builder enforces:
- An unvisited state is
untested, never silentlypass. If the panel never reached a state, the matrix says so — it does not pretend coverage it doesn’t have. - Greetings and other deterministic announces aren’t certified. They never call the model, so there is nothing to certify.
- Coverage is the real cost. A small panel under the GOAP-lite selector (one slot at a time, any order) visits only a subset of states per call, so certifying a whole flow needs many persona×repeat runs or targeted per-state scenarios. The matrix makes that gap visible instead of hiding it.
Where it fits
Acceptance is the deterministic floor; the eval matrix and judges measure quality above it; the transcript audit confirms any number against its conversation. Together: a flow can’t ship until it provably handles its persona panel, on a model proven to drive each of its states.