The RIFF Evaluation & Goal-Hierarchy System

How a voice-AI flow engine measures itself — at four nested levels, with cheap deterministic gates per commit and an expensive judged matrix at night.

RIFF is a voice-AI flow engine. A flow is a conversation script for a phone agent — “book a plumbing visit,” “take a coffee order,” “reserve a table.” Under the hood a flow is a finite-state machine (FSM): a graph of states, with edges (transitions) guarded by deterministic string expressions evaluated against the call's slot bag. The large language model (LLM) only produces language; it never decides which state comes next. That separation — declared graph in the FSM, words from the LLM — is the spine the whole evaluation system is built on.

This documentation describes everything built to answer one question reliably: “is a flow good, and if not, exactly which part is broken?” The answer is not a single number. It is a four-tier goal hierarchy, where every level declares a goal and is instrumented with a metric for that goal.

Four-tier RIFF evaluation hierarchy showing state, group segment, flow, and cross-flow archetype metrics
The evaluation stack reads both directions: state metrics roll up into group cohesion and flow quality, while archetypes compare the same phase across flows to identify systemic weaknesses.

The four-tier hierarchy

Read top-down. Each tier rolls up into the one above it, and each tier has its own “what good means” and its own measurement.

Tier 4 · Archetype cross-flow
The same FSM subpart-type across all flows — every flow's collect, every confirm, every action — grouped into one bucket.
Metric: mean cohesion across flows → finds systemic weakness (“our action phases are weak everywhere”) vs a one-flow bug. Archetypes »
Tier 3 · Flow
Goal: complete the caller's task.
Metric: correctness · completion · errors · latency · humanness, shown as the evaluation matrix (computed aspects + judged humanness).
Tier 2 · Group segment
Goal: a typed objective — collect the slots, confirm the read-back, act (a tool succeeds), reach a terminal, or handoff safely.
Metric: group_cohesion = goal_yield · (0.35·efficiency + 0.65·transition_coherence) — do the group's states work together? Group cohesion »
Tier 1 · State
Goal: one bounded capability / speech act (may fill or repair several slots).
Metric: progress · stall · revisit · dwell_turns · slot_fill_rate · latency_p95. Per-state metrics »
The one mental model to keep A bad flow score (Tier 3) is a symptom. You drill down — flow → group → state — to find the exact node that is broken, and you roll up across flows (Tier 4) to find weaknesses that are systemic to a kind of subpart, not to one flow. The command path is: --analyze--matrix--segment-matrix--state-matrix --flow F.

Computed vs. judged — the cost split

The single most important design decision: separate what we can compute deterministically from what we must judge with an LLM.

 Computed aspectsJudged aspect
whichcorrectness, completion, errors, latency, and all per-state / group metricshumanness only
sourcethe live event stream (state transitions, guards, tool calls, timing)an LLM grading a transcript against a rubric
noisenone — deterministic~1.5 points single-run (judge-dominated)
costfree, fastAPI calls, slow
cadenceevery commit (the CI gate)nightly (the judge panel)

Because only humanness pays the judge-noise tax, most of the scorecard is trustworthy and cheap, and we spend LLM calls only on the genuinely subjective axis. See Methodology & lessons for why the noise floor forces repeat-N and why “harden the instrument” is the prime directive.

The quality database

Everything lands in one SQLite file, docs/flow-eval/quality.db, driven by scripts/eval_db.py. Five tables hold the truth; a dozen read-time views slice it.

TableGrainWhat it holds
runsone per eval runcommit, branch, dirty flag, agent model, judge flags — the feature captured at run start
scoresflow × personalegacy quality/robustness/success + per-cell variance (quality_runs)
aspect_scoresconversation × aspect × judgethe tidy long-format matrix; computed rows use judge='(computed)'
state_scoresconversation × stateper-state root-cause metrics, tagged with the state's segment
flow_modsflow × commitevery git commit that touched a flow's YAML — so a score change can be tied to an edit vs. noise

Read this in order

Judge-the-Judge & disciplineNew2026-06-24

Never trust a verdict you can't cross-check. The judge panel (Gemini/Qwen/Codex), the deterministic guard, the scanner that over-flagged 60%, and why repeat-N beats one run.

Effective LLM Policy2026-06-23

Route the right model per state (inheritance + override). The four cases, the proof harness, the drift risk. Note: the first routing assignment was a misdiagnosis — see the validation update.

Goal-Hierarchy design

The flagship spec: flow → group → state, each with a goal and a metric. Start here.

The evaluation matrix

Tier-3 flow scorecard. Computed aspects + judged humanness, crossed by agent.

Per-state metrics

Tier-1 root cause. How the event stream becomes progress / stall / dwell.

Group cohesion

Tier-2. Do a group's states collaborate, or thrash? The bounded metric.

Archetypes

Tier-4. The same subpart across all flows — systemic-weakness finder.

GoalSegment schema

The typed group contract (new this session) — the data future phases consume.

Judges

The roster: remote, local, and CLI judges. Spread = humanness uncertainty.

Backend mocks

Letting backend-gated flows reach their payoff in eval — without faking guards.

Holdout validation

Dev vs. held-out personas — the overfitting guard. Never debug the held-out set.

CI quality gate

The per-commit deterministic gate + the pre-push hook.

Session replay

Re-drive recorded sessions after a change — the regression check.

Stall recovery (FSM)

The stalled guard + sets: edges — deterministic recovery from a withholding caller.

Methodology & lessons

The noise floor, harden-the-instrument, per-persona decomposition, holdout discipline.