The Evaluation Matrix
Tier 3. The flow scorecard: deterministic computed aspects from the live event stream, plus judged humanness — crossed by agent so a multi-model run never collapses into one misleading column.
What & why
For a long time a flow had one number: “quality.” That number was noisy, opaque, and conflated objective failure with subjective taste. The matrix replaces it with a per-aspect table that separates what we can compute (objective, deterministic, no judge noise) from what we must judge (subjective, LLM, ~1.5pt noise). Only one aspect — humanness — pays the judge tax, so most of the scorecard is trustworthy and cheap.
The aspects
| aspect | source | how | noise |
|---|---|---|---|
| correctness | computed | reached a success terminal + required/completion slots filled + on-contract final_status | none |
| completion | computed | completed flag + stop_reason == terminal | none |
| errors | computed | count of errors / unsafe stops / max-turns / loops (10 = no errors) | none |
| latency_p95 / mean | measured | wall-clock seconds per agent turn | none |
| humanness | judged | LLM rubric: natural, warm, not robotic/repetitive | ~1pt / judge |
How it works — the pipeline
Computed aspects come straight from the per-call event stream (no LLM). Humanness is scored
separately by a swappable judge. Everything lands tidy
(one row per conversation × aspect × judge) in aspect_scores, where computed rows carry
judge='(computed)'.
# 1. run the flows (computed aspects ride along in the transcripts; --no-judge keeps it free) python -m riff.flow_eval --flows austin_plumbing,weather --no-judge --out OUT # 2. fold computed aspects into aspect_scores python scripts/eval_db.py --ingest-aspects --dirs OUT # 3. (nightly) score humanness with a judge, then ingest the judged rows python scripts/aspect_judge.py OUT/flow-transcripts-*.jsonl --judge gemini --out H --judged-at 2026-06-22T14:00:00Z python scripts/eval_db.py --ingest-humanness --dirs H # 4. read the matrix python scripts/eval_db.py --matrix
operability — the gated aggregate
The matrix sorts rows by a single operability number, but it is a gated
aggregate, not a vibe. It blends correctness/completion/errors/latency/humanness, then caps:
if correctness < 10 it is capped at 4.0; if completion < 10, capped at 6.0. So a flow that
“completes” but books nothing can never look healthy. It uses a true median across
judges and, crucially, never fakes humanness — with no judge it reweights onto the
computed aspects instead of inventing a value. The per-aspect cells are the truth; operability is for
sorting.
(flow, agent_model), not just flow. A qwen-flash run and a gemini run of the
same flow are different situations; collapsing them would read the model difference as flow quality.
This is the performance-model view: quality ~ judge × agent_model × flow × persona.
Example
python scripts/eval_db.py --matrix
Real output (truncated) from the live quality.db:
# Evaluation matrix — (flow × agent) × aspects (computed) + humanness × judge | Flow | Agent | **operability** | correctness | completion | errors | lat_p95 | turn_count | hum:codex | hum:dashsc | hum:gemini | |-------------------------|--------------|-----------------|-------------|------------|--------|---------|------------|-----------|------------|------------| | apartment_scheduler | gemini-2.5-f | **0.3** | 0.0 | 0.0 | 0.0 | 11.1 | 7.9 | · | · | · | | property_trouble_ticket | qwen-flash | **1.8** | 0.0 | 4.2 | 0.3 | 4.2 | 5.0 | 1.1 | 0.8 | 2.2 | | austin_plumbing | qwen-flash | **4.0** | 6.8 | 10.0 | 2.9 | 3.9 | 7.3 | 1.8 | 0.9 | 2.1 | | dental_clinic | qwen-flash | **4.0** | 9.2 | 10.0 | 3.0 | 3.9 | 4.4 | 4.4 | 6.2 | 6.6 | | generic_inquiry | qwen-flash | **4.0** | 9.3 | 10.0 | 5.4 | 4.1 | 2.7 | 3.3 | 3.4 | 5.2 | Units: correctness/completion/errors 0-10 · latency seconds · turn_count raw. errors=10 means no errors; latency lower=better. humanness 0-10 (higher better).
Read it across, not down: austin_plumbing completes (10) but its correctness is only
6.8 and humanness ~2 — it reaches a terminal but the booking and the manner both need work. A
· means no judge ran that cell (gemini-agent rows were computed-only in this run).
Auto-analysis — the worklist
--analyze synthesizes the flow, segment, and state layers into a prioritized
“what to work on” list, so you start at the highest-impact node instead of reading three
tables by hand:
python scripts/eval_db.py --analyze
Real output (truncated):
# What to work on — auto-analysis (flow → segment → state) ## Weakest flows (computed correctness, by agent) ❌ apartment_scheduler correctness 0.0 (agent gemini-2.5-fla) ❌ ceo_command_center correctness 0.0 (agent qwen-flash) ## Weakest segments (phases that stall/escalate) ⚠ apartment_scheduler/action stall=0.36 escal=0.64 progress=0.0 ⚠ ceo_command_center/action stall=0.43 escal=0.0 progress=0.86 ## Root-cause states (fix these exact nodes) 🎯 apartment_scheduler/book_viewing stall 0.36 · escalates 0.64 · dwell 3.6 🎯 dog_grooming/listen_owner stall 0.33 · escalates 0.44 · slot_fill 0.0 (stuck, prog 0.22) Drill: --matrix (flow) → --segment-matrix → --state-matrix --flow F.
Other matrix views
| Command | What it shows |
|---|---|
--segment-matrix | per-segment (phase) quality — the middle tier; shows the worst-member state so a phase average can't mask one bad node |
--state-matrix [--flow F] | per-state quality, worst first (see Per-state metrics) |
--timeline | every run by time, change (commit), and judge LLM — quality is comparable only within a judge column |
--flow-table | all flows × (agent model, judge LLM), mean quality per cell — the calibration panel |
--export-csv | tidy CSV of every aspect score, joined to the flow's last edit before that run — graph quality vs fixes |
Where it fits
The matrix is the top of the drill-down and the public face of flow quality. It is computed every commit for the cheap aspects (the CI gate reuses the same ingest + a regression check) and fully judged nightly. From any weak row you descend into segments and states; across rows you ascend into archetypes.