Auditing a Score Against Its Transcript
A score tells you that a flow scored 10. The transcript tells you why. This is how you confirm a number reflects what actually happened on the call — and never trust a score you can audit.
The problem
The quality DB and the eval matrix report aggregate scores:
weather: correctness 10. But a number can be faithfully computed and still not
mean what you assume. Did the caller actually get what they needed? Was the slot caller-given or
defaulted? Did a “10” quietly include an escalation? You cannot tell from the number — only
from the conversation that produced it. Every eval run records those conversations as
flow-transcripts-*.jsonl (full caller+agent turns, captured slots, computed aspects);
the audit tool reads them back.
The tool
scripts/review_transcripts.py does three things (run under the project venv):
# 1. LIST every recorded conversation for a flow, with score + outcome .venv/bin/python scripts/review_transcripts.py --flow weather # 2. SHOW one conversation's full turn-by-turn transcript .venv/bin/python scripts/review_transcripts.py --flow weather --show 0 # 3. VALIDATE — does the recorded correctness match the transcript? .venv/bin/python scripts/review_transcripts.py --flow weather --validate
Flags: --persona <name> filters; --dir <run> targets a specific
run (default searches docs/flow-eval + /tmp/ab_eval/*). The same capability is
packaged as the review-transcripts skill.
The audit: differential validation
--validate independently recomputes correctness from the saved
outcome + slots, using the same deterministic rule as the live eval
(aspects.py _correctness: reached a success
terminal AND every target slot filled), and diffs it against the recorded score:
# Audit: weather — recorded vs recomputed correctness # targets (must be filled for 10): ['city'] # persona rec recomp match final_state basis 0 skeptical 10.0 10.0 ✅ report city=caller-given ... 6 conversations · 0 score/transcript mismatch(es).
recvsrecomp— recorded vs recomputed-from-transcript. They must match.match— ✅ if equal;❌ MISMATCHmeans the recorded number disagrees with the transcript (a real bug: scoring error or corrupt record). Exit code 1.basis— why it scored: which target slots are filled and whether each wascaller-given,extracted/other, orDEFAULTED.
Two things the audit makes visible
Both were found by running this audit on flows that looked perfect:
| Finding | What the number hid |
|---|---|
Shallow correctness. weather scores a clean 10 across personas. |
Correctness only checks a city was captured and a report terminal was reached — not that the weather content was right (it is a fixed stub). A “10” means the flow worked, not that the caller got accurate weather. (Backlog F-360.) |
Escalation scored as success. A restaurant_reservation conversation
scored correctness 10 with final_state=escalated. |
The metric counts “non-failure terminal + slots filled” as success, and
escalated isn't a failure status — so a booking that bailed to a human still scored 10.
Fine for a message-taking flow; debatable for a booking. The audit surfaced it by printing
final_state next to correctness. (Tracked as an eval-semantics item.) |
When to reach for it
- “Is this flow really a 10?” →
--validate, then read the basis and the final states. - “Show me what the caller and agent actually said.” →
--show N. - Pair it with every held-out draw. The held-out
monitoring loop runs
--validateon each fresh batch, so every cycle checks both did anything regress? and are the scores honest? — for almost no extra cost, since the run already produced the transcripts.