Overview › Measurement › The evaluation matrix

The Evaluation Matrix

Tier 3. The flow scorecard: deterministic computed aspects from the live event stream, plus judged humanness — crossed by agent so a multi-model run never collapses into one misleading column.

What & why

For a long time a flow had one number: “quality.” That number was noisy, opaque, and conflated objective failure with subjective taste. The matrix replaces it with a per-aspect table that separates what we can compute (objective, deterministic, no judge noise) from what we must judge (subjective, LLM, ~1.5pt noise). Only one aspect — humanness — pays the judge tax, so most of the scorecard is trustworthy and cheap.

The aspects

aspectsourcehownoise
correctnesscomputedreached a success terminal + required/completion slots filled + on-contract final_statusnone
completioncomputedcompleted flag + stop_reason == terminalnone
errorscomputedcount of errors / unsafe stops / max-turns / loops (10 = no errors)none
latency_p95 / meanmeasuredwall-clock seconds per agent turnnone
humannessjudgedLLM rubric: natural, warm, not robotic/repetitive~1pt / judge

How it works — the pipeline

Computed aspects come straight from the per-call event stream (no LLM). Humanness is scored separately by a swappable judge. Everything lands tidy (one row per conversation × aspect × judge) in aspect_scores, where computed rows carry judge='(computed)'.

# 1. run the flows (computed aspects ride along in the transcripts; --no-judge keeps it free)
python -m riff.flow_eval --flows austin_plumbing,weather --no-judge --out OUT

# 2. fold computed aspects into aspect_scores
python scripts/eval_db.py --ingest-aspects --dirs OUT

# 3. (nightly) score humanness with a judge, then ingest the judged rows
python scripts/aspect_judge.py OUT/flow-transcripts-*.jsonl --judge gemini --out H --judged-at 2026-06-22T14:00:00Z
python scripts/eval_db.py --ingest-humanness --dirs H

# 4. read the matrix
python scripts/eval_db.py --matrix

operability — the gated aggregate

The matrix sorts rows by a single operability number, but it is a gated aggregate, not a vibe. It blends correctness/completion/errors/latency/humanness, then caps: if correctness < 10 it is capped at 4.0; if completion < 10, capped at 6.0. So a flow that “completes” but books nothing can never look healthy. It uses a true median across judges and, crucially, never fakes humanness — with no judge it reweights onto the computed aspects instead of inventing a value. The per-aspect cells are the truth; operability is for sorting.

Crossed by agent. Rows are (flow, agent_model), not just flow. A qwen-flash run and a gemini run of the same flow are different situations; collapsing them would read the model difference as flow quality. This is the performance-model view: quality ~ judge × agent_model × flow × persona.

Example

python scripts/eval_db.py --matrix

Real output (truncated) from the live quality.db:

# Evaluation matrix — (flow × agent) × aspects (computed) + humanness × judge

| Flow                    | Agent        | **operability** | correctness | completion | errors | lat_p95 | turn_count | hum:codex | hum:dashsc | hum:gemini |
|-------------------------|--------------|-----------------|-------------|------------|--------|---------|------------|-----------|------------|------------|
| apartment_scheduler     | gemini-2.5-f | **0.3**         | 0.0         | 0.0        | 0.0    | 11.1    | 7.9        | ·         | ·          | ·          |
| property_trouble_ticket | qwen-flash   | **1.8**         | 0.0         | 4.2        | 0.3    | 4.2     | 5.0        | 1.1       | 0.8        | 2.2        |
| austin_plumbing         | qwen-flash   | **4.0**         | 6.8         | 10.0       | 2.9    | 3.9     | 7.3        | 1.8       | 0.9        | 2.1        |
| dental_clinic           | qwen-flash   | **4.0**         | 9.2         | 10.0       | 3.0    | 3.9     | 4.4        | 4.4       | 6.2        | 6.6        |
| generic_inquiry         | qwen-flash   | **4.0**         | 9.3         | 10.0       | 5.4    | 4.1     | 2.7        | 3.3       | 3.4        | 5.2        |

Units: correctness/completion/errors 0-10 · latency seconds · turn_count raw.
errors=10 means no errors; latency lower=better. humanness 0-10 (higher better).

Read it across, not down: austin_plumbing completes (10) but its correctness is only 6.8 and humanness ~2 — it reaches a terminal but the booking and the manner both need work. A · means no judge ran that cell (gemini-agent rows were computed-only in this run).

Auto-analysis — the worklist

--analyze synthesizes the flow, segment, and state layers into a prioritized “what to work on” list, so you start at the highest-impact node instead of reading three tables by hand:

python scripts/eval_db.py --analyze

Real output (truncated):

# What to work on — auto-analysis (flow → segment → state)

## Weakest flows (computed correctness, by agent)
  ❌ apartment_scheduler      correctness 0.0  (agent gemini-2.5-fla)
  ❌ ceo_command_center       correctness 0.0  (agent qwen-flash)

## Weakest segments (phases that stall/escalate)
  ⚠ apartment_scheduler/action   stall=0.36 escal=0.64 progress=0.0
  ⚠ ceo_command_center/action    stall=0.43 escal=0.0  progress=0.86

## Root-cause states (fix these exact nodes)
  🎯 apartment_scheduler/book_viewing   stall 0.36 · escalates 0.64 · dwell 3.6
  🎯 dog_grooming/listen_owner          stall 0.33 · escalates 0.44 · slot_fill 0.0 (stuck, prog 0.22)

Drill: --matrix (flow) → --segment-matrix → --state-matrix --flow F.

Other matrix views

CommandWhat it shows
--segment-matrixper-segment (phase) quality — the middle tier; shows the worst-member state so a phase average can't mask one bad node
--state-matrix [--flow F]per-state quality, worst first (see Per-state metrics)
--timelineevery run by time, change (commit), and judge LLM — quality is comparable only within a judge column
--flow-tableall flows × (agent model, judge LLM), mean quality per cell — the calibration panel
--export-csvtidy CSV of every aspect score, joined to the flow's last edit before that run — graph quality vs fixes

Where it fits

The matrix is the top of the drill-down and the public face of flow quality. It is computed every commit for the cheap aspects (the CI gate reuses the same ingest + a regression check) and fully judged nightly. From any weak row you descend into segments and states; across rows you ascend into archetypes.