New2026-06-24 Design proposed; measurement keystone (Phase 0 / PR3) now shipped & validated

Effective LLM Policy

One conversation does not need one model. RIFF is a finite-state machine, so the natural unit of model choice is the state — route a greeting, a slot read-back, and a booking-confirmation to the model each one actually needs.

Validation update (2026-06-24): the first routing win was a MISDIAGNOSIS The mechanism (PR1 + PR2) is built and sound. But the first per-state assignment we shipped — await_confirmation → qwen, because gemini "FAILED" that state — did not survive scrutiny. The cert FAIL is run-to-run-variable flow-rule / scanner artifacts (a slot-less reprompt + a stale-ask the scanner over-flagged), not a model-quality gap. The transcript mislabeled the stumble to the wrong state; qwen only "passed" via a cleaner collect path. Routing was the wrong fix for this state. How we caught it — a panel of independent judges, a deterministic guard, and repeat-N — is its own page: Judge-the-Judge & discipline. The framework stands; the assignment needs evidence, not intuition.

Phase 0 shipped (2026-06-24, commit db7d3b2): the bench can finally measure cost. The roadmap's keystone — a real per-turn record — is built and validated. Until now the policy bench's cost & latency columns were always 0: token counts were dropped when TurnResult was built, and the bench read the integer turn count as if it were per-turn rows. Now TurnResult carries input_tokens/output_tokens, and each conversation records turn_rows — one row per agent turn with

{state_id (the state the model ran in, not its destination), effective_model, in_tokens,
    out_tokens, latency_ms}

. Validated live on gemini-2.5-flash (coffee flow): per-turn tokens non-zero, correct from-state attribution, model recorded — evidence persisted to quality.db (phase0_turn_evidence). Additive only; codex-approved; 5 new tests, 203 existing pass. Next: Phase 1 — repeat-N measurement + a withholding persona so ask states are actually exercised, then a flow-level cost target.

Today every turn in a call runs on one model picked per session. But a flow's states are not equally hard. A greeting is deterministic. A simple slot grab is easy. A booking confirmation, where a wrong “yes” books the wrong appointment, is where you want your strongest reasoning. Effective LLM Policy lets a flow declare the right model per state, falling back to sane defaults so you only override what matters.

The one idea Model choice is inheritance with override, exactly like CSS or a subclass. A broad tier sets the default; a narrower tier overrides it. Resolution walks most-specific-first and takes the first tier that sets a value.

How a model gets chosen: the cascade

Four tiers, broad to narrow. Each tier may set a value or stay silent (inherit). The resolver walks from the most specific tier it can and takes the first one that speaks.

Two ordered lookups. First the tier cascade picks which tier's value to use (state → group → flow → global, most-specific-wins). Then a second lookup turns that value into a concrete model: a class name resolves through the flow's class table; anything else is a literal model. inherit means “keep walking up.”

The four cases we want to prove

Each is a separate hypothesis with its own measurement. We do not assume any of them. We test each one per state against the current single-model baseline.

Case 1 · Right-sizing

A smaller model gives equivalent results

For many states (greet, simple collect, routing), a cheap model scores the same on correctness and humanness as the expensive one. If so, route those states to the cheap model and bank the cost and latency. RIFF's per-state cert matrix already measures exactly which model passes which state — this case is “act on a green cell.”

Win: cost ↓ and latency ↓ with correctness/humanness flat (inside the noise floor).

Case 2 · Escalation

When a small model fails, escalate to a smarter one

Run cheap-first; on a detectable failure — low confidence, a validation fail, a repair loop, a refusal — re-run the turn on a stronger model. RIFF already emits these signals (slot validators, the stalled guard, the no-progress backstop). This is the classic cascade: cheap model first, escalate on a quality threshold.

Win: same end correctness as always-smart, lower average cost, with the retry's extra latency paid only on the minority of turns that trip the signal. Watch the escalation rate.

Case 3 · Ease of deployment

Override one risky state without touching the rest

A layered config is easier to operate: set a sane global default once, override a single state in one line, and roll a model change out (or back) at one tier. A load-time gate catches a typo’d model name or an unreachable class at boot — never on a live call.

Win: an operator changes the model for await_confirmation only, in one line, with a build error on a bad reference.

Case 4 · The latency ladder

Deterministic first, then escalate to smarter models

The cheapest model is no model. Some states are deterministic — a fixed greeting, a yes/no gate, a menu choice. Serve those with static audio or a rule at zero LLM latency, and only climb to small → large LLMs when the deterministic path can’t resolve the input. RIFF already has the bottom rung (the Collect Pattern’s fast-path and deterministic stall recovery); this generalizes it into an explicit ladder.

Win: median turn latency ↓ with correctness held, because a measurable fraction of turns never reach an LLM.

The ladder (Case 4) extends the cascade with a non-LLM tier 0. A turn climbs only when the rung below cannot resolve it, so the expensive model is reached only for the inputs that truly need it.

How we prove it — and keep proving it

Nothing ships on intuition. Every case is an A/B against the current per-flow baseline, scored on four axes, run on a repeatable harness so we can re-run it on every change — not once. Correctness is a hard gate: a model that fails a state's acceptance check is never routed, at any price, before cost or latency is even considered.

Two honest caveats for this review (1) Intelligence is measurable today; per-state cost and latency are not yet. The cert matrix already scores correctness/pass per state, but per-state tokens and latency come from PR1's per-turn logging — until that lands the harness records those axes as empty rather than guessing. The methodology is real; two of its four axes go live with PR1. (2) Cost figures are illustrative public list prices (editable in one table, MODEL_PRICES); final numbers come from our actual contract. The mechanism and the deltas are what this review asks you to approve, not the absolute dollar values.

Axis	What we measure	Existing tool
Cost	tokens × per-model price; escalation rate; deterministic hit rate	`eval_db.py` + per-turn model log
Latency	time-to-first-audio, full-turn wall-clock, per-rung breakdown	`latency_bench.py`, `timing_benchmark.py`
Intelligence	per-state correctness / cert pass-fail	`flow_certify.py`, `evaluate_flows.py`
Human-ness	judged humanness (curt/robotic vs natural)	rubric + multi-judge panel

The continuous proof loop. A decision only counts if it clears the noise-floor gate (RIFF eval is judge-dominated: ≈1.5 pt single-run, so a sub-1-pt “win” is noise). The harness re-runs on every change, so a routing choice that was right last month is re-checked when a model or prompt moves under it.

The data is the asset — tiered, per-LLM, kept across switches

The four axes are not measured once and thrown away. They accumulate into a retained, append-only profile keyed per LLM × architectural scope. That is what turns “which model for this state?” into a standing capability instead of a one-off study.

Cost sums up the tiers; quality axes are turn-weighted; cert_pass ANDs. Every row is pinned to its model, and the store is append-only — switching the live model adds rows without deleting the old model's history. So a cost/human-ness/latency tradeoff can be re-decided at any level, any time, on accumulated evidence.

The risk this measurement guards against Switching models mid-conversation causes real drift: downgrades hurt most, drift compounds with conversation depth, and dissimilar models disrupt more than similar ones (Model-Switching Drift, 2026; “LLMs Get Lost in Multi-Turn” reports a 39% multi-turn drop before any switch). Two RIFF facts make this manageable — and the harness must confirm them:

Stay in-family. RIFF’s classes are fast: gemini-2.5-flash and smart: gemini-2.5-pro — same family, the low-drift case. Cross-family hops spike drift, tool-format mismatch, and prompt-cache loss at once.
Switch at boundaries; never downgrade into late states. Route at state/segment edges, not mid-context. RIFF carries structured slots in the session, not just raw chat — a built-in grounding mitigation.

What exists, what’s proposed

Tier	Status	Where
global default	exists (implicit)	server-default adapter
flow	exists	`FlowLlmConfig.default_class` + class table
group	proposed	add `llm` field to `GoalSegment`
state	exists	`StateDef.llm`
primary-turn routing	not wired	today only the continuation cue calls `resolve_adapter`

Sequencing PR1 wires resolve_adapter into the primary turn behind RIFF_PRIMARY_LLM_ROUTING (off by default), with an adapter cache and per-turn model logging — this is the unblock that makes any case measurable. PR2 adds the group tier and the cascade. Per-flow stays the production default until a state has A/B evidence that switching improves the outcome.

Effective LLM Policy — proposed design, published for stakeholder review before build. Full spec: docs/flow-eval/LLM-POLICY.md. Continuous proof harness: scripts/llm_policy_bench.py. Prior art: OrchestraLLM (NAACL 2024), RouteLLM (ICLR 2025), FrugalGPT. Builds on the RIFF evaluation system and the per-state cert matrix.