Overview › The hierarchy › Per-state metrics

Per-State Metrics — the root-cause layer

Tier 1. When a flow scores badly, this layer says exactly which state pulled it down — computed deterministically from the live event stream, no LLM.

What & why

The flow matrix tells you a flow is weak. It does not tell you where. The per-state layer is the answer: one row per (conversation, state), with a handful of metrics that localize the failure to a single node. It complements the conversation/flow truth in aspect_scores — it does not replace it. Think of aspect_scores as the verdict and state_scores as the autopsy.

How it works — the event stream

Every conversation runs on a per-call event bus (riff/events.py). As the FSM drives the call it emits typed events: state_entered, state_exited{to_state, reason}, turn_complete{latency_ms}, guard_evaluated{error, result}, and slot_filled{slot, value}. The driver (riff/flow_eval/driver.py) subscribes to the bus, collects the events for that one session, and hands them to compute_state_aspects (riff/flow_eval/state_aspects.py), which walks the visit timeline and produces the per-state rows. Because it reads only events, it is fully deterministic.

The metrics

MetricDefinition (as computed)
progress1 if the state exited to a forward state (advanced), else 0
stall1 if it is the last state of a non-completed conversation (stuck here)
escalation1 if it exited to an escalation terminal (escalated/escalate)
revisitre-entries = entries − 1 (oscillation / re-ask)
dwell_turnsaverage agent turns spent in the state per visit
latency_p95_ms95th-percentile turn latency while in the state
guard_errorcount of guard_evaluated errors
slot_fill_rate(collect states) required slots filled before exit ÷ required
to_statesthe set of states this one led to — for the surrounding-states / blast-radius view
slot_fill_rate alone is noisy — gate it on progress. A listen state legitimately reads slot_fill=0 when the caller front-loaded the slot earlier (it was filled before this state ran). That is healthy behavior, not a defect. The real signal is low slot_fill and low progress — genuinely stuck collecting. The --analyze root-cause view gates the slot_fill clause on progress < 0.7 for exactly this reason (commit dacb181).

Per-state slot evidence (shipped this session)

For slot_fill_rate to attribute a fill to the right state, a slot_filled event must fire for each public slot at the moment it is committed, tagged with the state that collected it. This session moved that emit to the single commit choke point in StateManager.commit_transition (commit 43b232b), so each slot is captured exactly once (tracked in ctx._emitted_slots) and falsy-but-valid values like 0 / False still count as filled. This is phase 2 of the goal hierarchy — the gating prerequisite for any group metric.

Use case

You ran the nightly eval. The matrix shows apartment_viewing at correctness 3.3 on gemini. Why? Open the state matrix and the answer is one row: collect_details stalls 40% of the time with a low slot-fill and a 8.8s p95 latency. Now you have a node to fix, not a flow to stare at.

Example

Ingest the per-state metrics from the transcripts, then read the worst-states-first matrix:

# 1. fold per-state metrics out of the transcripts into state_scores (tags each row's segment)
python scripts/eval_db.py --ingest-state-aspects --dirs /tmp/ab_eval/fullcalib

# 2. worst states first (high stall / low progress)
python scripts/eval_db.py --state-matrix

Real output (truncated) from the live quality.db:

# Per-state quality — all flows (worst first)

| Flow                     | State                      | n  | progress | stall   | escal | revisit | dwell | lat_p95(s) | slot_fill |
|--------------------------|----------------------------|----|----------|---------|-------|---------|-------|------------|-----------|
| ceo_command_center       | status_query               | 2  | 0.0      | **1.0** | 0.0   | 0.0     | 5.0   | 3.1        | ·         |
| property_trouble_ticket  | invalid_verification_code  | 13 | 0.0      | **1.0** | 0.0   | 0.0     | 1.0   | 2.6        | ·         |
| ceo_command_center       | dispatch_send              | 8  | 1.0      | **0.5** | 0.0   | 1.38    | 1.2   | 0.0        | ·         |
| apartment_viewing        | collect_details            | 5  | 0.6      | **0.4** | 0.0   | 0.0     | 2.4   | 8.8        | 0.1       |
| ceo_command_center       | dispatch_intent            | 8  | 1.0      | **0.38**| 0.0   | 1.75    | 1.3   | 1.7        | ·         |
| dog_grooming             | listen_owner               | 9  | 0.22     | **0.33**| 0.44  | 0.0     | 2.6   | 1.7        | 0.0       |
| ceo_command_center       | triage                     | 13 | 1.0      | **0.31**| 0.0   | 3.31    | 0.6   | 1.6        | ·         |

stall=fraction stuck here · progress=fraction that advanced · revisit=oscillation.
High stall + low progress = the state pulling the flow down.

Add --flow apartment_viewing to also print each state's neighbors (the to_states graph), so you can see the blast radius — which downstream states a broken node starves.

Where it fits

The per-state layer is the bottom of the drill-down. The auto-analysis (--analyze) ranks the worst states across all flows for you; --state-matrix is the full table; and the segment rollup (group cohesion) aggregates these same rows into the middle tier. The robustness rule from Codex applies throughout: errored conversations (a broken agent integration, an absent backend) are excluded from ingest — they are not a flow signal and must not be scored as a zero.