Overview › Measurement › Transcript audit

Auditing a Score Against Its Transcript

A score tells you that a flow scored 10. The transcript tells you why. This is how you confirm a number reflects what actually happened on the call — and never trust a score you can audit.

The problem

The quality DB and the eval matrix report aggregate scores: weather: correctness 10. But a number can be faithfully computed and still not mean what you assume. Did the caller actually get what they needed? Was the slot caller-given or defaulted? Did a “10” quietly include an escalation? You cannot tell from the number — only from the conversation that produced it. Every eval run records those conversations as flow-transcripts-*.jsonl (full caller+agent turns, captured slots, computed aspects); the audit tool reads them back.

The tool

scripts/review_transcripts.py does three things (run under the project venv):

# 1. LIST every recorded conversation for a flow, with score + outcome
.venv/bin/python scripts/review_transcripts.py --flow weather

# 2. SHOW one conversation's full turn-by-turn transcript
.venv/bin/python scripts/review_transcripts.py --flow weather --show 0

# 3. VALIDATE — does the recorded correctness match the transcript?
.venv/bin/python scripts/review_transcripts.py --flow weather --validate

Flags: --persona <name> filters; --dir <run> targets a specific run (default searches docs/flow-eval + /tmp/ab_eval/*). The same capability is packaged as the review-transcripts skill.

The audit: differential validation

--validate independently recomputes correctness from the saved outcome + slots, using the same deterministic rule as the live eval (aspects.py _correctness: reached a success terminal AND every target slot filled), and diffs it against the recorded score:

# Audit: weather — recorded vs recomputed correctness
# targets (must be filled for 10): ['city']
   # persona          rec  recomp  match  final_state   basis
   0 skeptical        10.0  10.0   ✅     report        city=caller-given
   ...
6 conversations · 0 score/transcript mismatch(es).

rec vs recomp — recorded vs recomputed-from-transcript. They must match.
match — ✅ if equal; ❌ MISMATCH means the recorded number disagrees with the transcript (a real bug: scoring error or corrupt record). Exit code 1.
basis — why it scored: which target slots are filled and whether each was caller-given, extracted/other, or DEFAULTED.

0 mismatches ≠ “the score is meaningful.” The audit proves a score is honestly computed, not that it means what you think. That is why it always prints the basis — so you can judge the second question yourself.

Two things the audit makes visible

Both were found by running this audit on flows that looked perfect:

Finding	What the number hid
Shallow correctness. `weather` scores a clean 10 across personas.	Correctness only checks a city was captured and a report terminal was reached — not that the weather content was right (it is a fixed stub). A “10” means the flow worked, not that the caller got accurate weather. (Backlog F-360.)
Escalation scored as success. A `restaurant_reservation` conversation scored correctness 10 with `final_state=escalated`.	The metric counts “non-failure terminal + slots filled” as success, and `escalated` isn't a failure status — so a booking that bailed to a human still scored 10. Fine for a message-taking flow; debatable for a booking. The audit surfaced it by printing `final_state` next to `correctness`. (Tracked as an eval-semantics item.)

When to reach for it

“Is this flow really a 10?” → --validate, then read the basis and the final states.
“Show me what the caller and agent actually said.” → --show N.
Pair it with every held-out draw. The held-out monitoring loop runs --validate on each fresh batch, so every cycle checks both did anything regress? and are the scores honest? — for almost no extra cost, since the run already produced the transcripts.

Tool: scripts/review_transcripts.py · skill: review-transcripts · score rule: riff/flow_eval/aspects.py (_correctness). See also the evaluation matrix and methodology.