New2026-06-24 Design proposed; measurement keystone (Phase 0 / PR3) now shipped & validated
Effective LLM Policy
One conversation does not need one model. RIFF is a finite-state machine, so the natural unit of model choice is the state — route a greeting, a slot read-back, and a booking-confirmation to the model each one actually needs.
await_confirmation → qwen, because gemini "FAILED" that state — did not survive
scrutiny. The cert FAIL is run-to-run-variable flow-rule / scanner artifacts
(a slot-less reprompt + a stale-ask the scanner over-flagged), not a model-quality gap.
The transcript mislabeled the stumble to the wrong state; qwen only "passed" via a cleaner collect
path. Routing was the wrong fix for this state. How we caught it — a panel of
independent judges, a deterministic guard, and repeat-N — is its own page:
Judge-the-Judge & discipline.
The framework stands; the assignment needs evidence, not intuition.
0: token counts were dropped
when TurnResult was built, and the bench read the integer turn count as if it
were per-turn rows. Now TurnResult carries input_tokens/output_tokens,
and each conversation records turn_rows — one row per agent turn with
{state_id (the state the model ran in, not its destination), effective_model, in_tokens,
out_tokens, latency_ms}. Validated live on gemini-2.5-flash
(coffee flow): per-turn tokens non-zero, correct from-state attribution, model recorded — evidence
persisted to quality.db (phase0_turn_evidence). Additive only; codex-approved;
5 new tests, 203 existing pass. Next: Phase 1 — repeat-N measurement + a withholding
persona so ask states are actually exercised, then a flow-level cost target.
Today every turn in a call runs on one model picked per session. But a flow's states are not equally hard. A greeting is deterministic. A simple slot grab is easy. A booking confirmation, where a wrong “yes” books the wrong appointment, is where you want your strongest reasoning. Effective LLM Policy lets a flow declare the right model per state, falling back to sane defaults so you only override what matters.
How a model gets chosen: the cascade
Four tiers, broad to narrow. Each tier may set a value or stay silent (inherit).
The resolver walks from the most specific tier it can and takes the first one that speaks.
inherit means “keep walking up.”The four cases we want to prove
Each is a separate hypothesis with its own measurement. We do not assume any of them. We test each one per state against the current single-model baseline.
A smaller model gives equivalent results
For many states (greet, simple collect, routing), a cheap model scores the same on correctness and humanness as the expensive one. If so, route those states to the cheap model and bank the cost and latency. RIFF's per-state cert matrix already measures exactly which model passes which state — this case is “act on a green cell.”
Win: cost ↓ and latency ↓ with correctness/humanness flat (inside the noise floor).
When a small model fails, escalate to a smarter one
Run cheap-first; on a detectable failure — low confidence, a validation fail, a repair loop, a
refusal — re-run the turn on a stronger model. RIFF already emits these signals (slot validators, the
stalled guard, the no-progress backstop). This is the classic cascade: cheap
model first, escalate on a quality threshold.
Win: same end correctness as always-smart, lower average cost, with the retry's extra latency paid only on the minority of turns that trip the signal. Watch the escalation rate.
Override one risky state without touching the rest
A layered config is easier to operate: set a sane global default once, override a single state in one line, and roll a model change out (or back) at one tier. A load-time gate catches a typo’d model name or an unreachable class at boot — never on a live call.
Win: an operator changes the model for await_confirmation only, in
one line, with a build error on a bad reference.
Deterministic first, then escalate to smarter models
The cheapest model is no model. Some states are deterministic — a fixed greeting, a yes/no gate, a menu choice. Serve those with static audio or a rule at zero LLM latency, and only climb to small → large LLMs when the deterministic path can’t resolve the input. RIFF already has the bottom rung (the Collect Pattern’s fast-path and deterministic stall recovery); this generalizes it into an explicit ladder.
Win: median turn latency ↓ with correctness held, because a measurable fraction of turns never reach an LLM.
How we prove it — and keep proving it
Nothing ships on intuition. Every case is an A/B against the current per-flow baseline, scored on four axes, run on a repeatable harness so we can re-run it on every change — not once. Correctness is a hard gate: a model that fails a state's acceptance check is never routed, at any price, before cost or latency is even considered.
MODEL_PRICES); final numbers come from our actual contract. The mechanism and the
deltas are what this review asks you to approve, not the absolute dollar values.
| Axis | What we measure | Existing tool |
|---|---|---|
| Cost | tokens × per-model price; escalation rate; deterministic hit rate | eval_db.py + per-turn model log |
| Latency | time-to-first-audio, full-turn wall-clock, per-rung breakdown | latency_bench.py, timing_benchmark.py |
| Intelligence | per-state correctness / cert pass-fail | flow_certify.py, evaluate_flows.py |
| Human-ness | judged humanness (curt/robotic vs natural) | rubric + multi-judge panel |
The data is the asset — tiered, per-LLM, kept across switches
The four axes are not measured once and thrown away. They accumulate into a retained, append-only profile keyed per LLM × architectural scope. That is what turns “which model for this state?” into a standing capability instead of a one-off study.
cert_pass ANDs. Every row is pinned to its model, and the store is append-only — switching the live model adds rows without deleting the old model's history. So a cost/human-ness/latency tradeoff can be re-decided at any level, any time, on accumulated evidence.- Stay in-family. RIFF’s classes are
fast: gemini-2.5-flashandsmart: gemini-2.5-pro— same family, the low-drift case. Cross-family hops spike drift, tool-format mismatch, and prompt-cache loss at once. - Switch at boundaries; never downgrade into late states. Route at state/segment edges, not mid-context. RIFF carries structured slots in the session, not just raw chat — a built-in grounding mitigation.
What exists, what’s proposed
| Tier | Status | Where |
|---|---|---|
| global default | exists (implicit) | server-default adapter |
| flow | exists | FlowLlmConfig.default_class + class table |
| group | proposed | add llm field to GoalSegment |
| state | exists | StateDef.llm |
| primary-turn routing | not wired | today only the continuation cue calls resolve_adapter |
resolve_adapter into the primary turn behind
RIFF_PRIMARY_LLM_ROUTING (off by default), with an adapter cache and per-turn model
logging — this is the unblock that makes any case measurable. PR2 adds the
group tier and the cascade. Per-flow stays the production default until a state has A/B evidence that
switching improves the outcome.