Per-State Metrics — the root-cause layer
Tier 1. When a flow scores badly, this layer says exactly which state pulled it down — computed deterministically from the live event stream, no LLM.
What & why
The flow matrix tells you a flow is weak. It does not tell you where. The per-state layer
is the answer: one row per (conversation, state), with a handful of metrics that localize
the failure to a single node. It complements the conversation/flow truth in
aspect_scores — it does not replace it. Think of aspect_scores as the
verdict and state_scores as the autopsy.
How it works — the event stream
Every conversation runs on a per-call event bus (riff/events.py).
As the FSM drives the call it emits typed events: state_entered,
state_exited{to_state, reason}, turn_complete{latency_ms},
guard_evaluated{error, result}, and slot_filled{slot, value}. The driver
(riff/flow_eval/driver.py) subscribes to the bus, collects the events for that one
session, and hands them to compute_state_aspects
(riff/flow_eval/state_aspects.py), which walks the visit timeline and produces the
per-state rows. Because it reads only events, it is fully deterministic.
The metrics
| Metric | Definition (as computed) |
|---|---|
progress | 1 if the state exited to a forward state (advanced), else 0 |
stall | 1 if it is the last state of a non-completed conversation (stuck here) |
escalation | 1 if it exited to an escalation terminal (escalated/escalate) |
revisit | re-entries = entries − 1 (oscillation / re-ask) |
dwell_turns | average agent turns spent in the state per visit |
latency_p95_ms | 95th-percentile turn latency while in the state |
guard_error | count of guard_evaluated errors |
slot_fill_rate | (collect states) required slots filled before exit ÷ required |
to_states | the set of states this one led to — for the surrounding-states / blast-radius view |
listen state legitimately reads slot_fill=0 when the caller
front-loaded the slot earlier (it was filled before this state ran). That is healthy
behavior, not a defect. The real signal is low slot_fill and low progress — genuinely
stuck collecting. The --analyze root-cause view gates the slot_fill clause on
progress < 0.7 for exactly this reason (commit dacb181).
Per-state slot evidence (shipped this session)
For slot_fill_rate to attribute a fill to the right state, a
slot_filled event must fire for each public slot at the moment it is committed, tagged
with the state that collected it. This session moved that emit to the single commit choke point in
StateManager.commit_transition (commit 43b232b), so each slot is captured
exactly once (tracked in ctx._emitted_slots) and falsy-but-valid values like
0 / False still count as filled. This is phase 2 of the goal hierarchy —
the gating prerequisite for any group metric.
Use case
You ran the nightly eval. The matrix shows apartment_viewing at correctness 3.3 on
gemini. Why? Open the state matrix and the answer is one row: collect_details
stalls 40% of the time with a low slot-fill and a 8.8s p95 latency. Now you have a node to fix, not a
flow to stare at.
Example
Ingest the per-state metrics from the transcripts, then read the worst-states-first matrix:
# 1. fold per-state metrics out of the transcripts into state_scores (tags each row's segment) python scripts/eval_db.py --ingest-state-aspects --dirs /tmp/ab_eval/fullcalib # 2. worst states first (high stall / low progress) python scripts/eval_db.py --state-matrix
Real output (truncated) from the live quality.db:
# Per-state quality — all flows (worst first) | Flow | State | n | progress | stall | escal | revisit | dwell | lat_p95(s) | slot_fill | |--------------------------|----------------------------|----|----------|---------|-------|---------|-------|------------|-----------| | ceo_command_center | status_query | 2 | 0.0 | **1.0** | 0.0 | 0.0 | 5.0 | 3.1 | · | | property_trouble_ticket | invalid_verification_code | 13 | 0.0 | **1.0** | 0.0 | 0.0 | 1.0 | 2.6 | · | | ceo_command_center | dispatch_send | 8 | 1.0 | **0.5** | 0.0 | 1.38 | 1.2 | 0.0 | · | | apartment_viewing | collect_details | 5 | 0.6 | **0.4** | 0.0 | 0.0 | 2.4 | 8.8 | 0.1 | | ceo_command_center | dispatch_intent | 8 | 1.0 | **0.38**| 0.0 | 1.75 | 1.3 | 1.7 | · | | dog_grooming | listen_owner | 9 | 0.22 | **0.33**| 0.44 | 0.0 | 2.6 | 1.7 | 0.0 | | ceo_command_center | triage | 13 | 1.0 | **0.31**| 0.0 | 3.31 | 0.6 | 1.6 | · | stall=fraction stuck here · progress=fraction that advanced · revisit=oscillation. High stall + low progress = the state pulling the flow down.
Add --flow apartment_viewing to also print each state's neighbors (the
to_states graph), so you can see the blast radius — which downstream states a broken
node starves.
Where it fits
The per-state layer is the bottom of the drill-down. The auto-analysis
(--analyze) ranks the worst states
across all flows for you; --state-matrix is the full table; and the segment rollup
(group cohesion) aggregates these same rows into the
middle tier. The robustness rule from Codex applies throughout: errored conversations (a broken agent
integration, an absent backend) are excluded from ingest — they are not a flow
signal and must not be scored as a zero.