Overview › Measurement › Transcript audit

Auditing a Score Against Its Transcript

A score tells you that a flow scored 10. The transcript tells you why. This is how you confirm a number reflects what actually happened on the call — and never trust a score you can audit.

The problem

The quality DB and the eval matrix report aggregate scores: weather: correctness 10. But a number can be faithfully computed and still not mean what you assume. Did the caller actually get what they needed? Was the slot caller-given or defaulted? Did a “10” quietly include an escalation? You cannot tell from the number — only from the conversation that produced it. Every eval run records those conversations as flow-transcripts-*.jsonl (full caller+agent turns, captured slots, computed aspects); the audit tool reads them back.

The tool

scripts/review_transcripts.py does three things (run under the project venv):

# 1. LIST every recorded conversation for a flow, with score + outcome
.venv/bin/python scripts/review_transcripts.py --flow weather

# 2. SHOW one conversation's full turn-by-turn transcript
.venv/bin/python scripts/review_transcripts.py --flow weather --show 0

# 3. VALIDATE — does the recorded correctness match the transcript?
.venv/bin/python scripts/review_transcripts.py --flow weather --validate

Flags: --persona <name> filters; --dir <run> targets a specific run (default searches docs/flow-eval + /tmp/ab_eval/*). The same capability is packaged as the review-transcripts skill.

The audit: differential validation

--validate independently recomputes correctness from the saved outcome + slots, using the same deterministic rule as the live eval (aspects.py _correctness: reached a success terminal AND every target slot filled), and diffs it against the recorded score:

# Audit: weather — recorded vs recomputed correctness
# targets (must be filled for 10): ['city']
   # persona          rec  recomp  match  final_state   basis
   0 skeptical        10.0  10.0   ✅     report        city=caller-given
   ...
6 conversations · 0 score/transcript mismatch(es).
0 mismatches ≠ “the score is meaningful.” The audit proves a score is honestly computed, not that it means what you think. That is why it always prints the basis — so you can judge the second question yourself.

Two things the audit makes visible

Both were found by running this audit on flows that looked perfect:

FindingWhat the number hid
Shallow correctness. weather scores a clean 10 across personas. Correctness only checks a city was captured and a report terminal was reached — not that the weather content was right (it is a fixed stub). A “10” means the flow worked, not that the caller got accurate weather. (Backlog F-360.)
Escalation scored as success. A restaurant_reservation conversation scored correctness 10 with final_state=escalated. The metric counts “non-failure terminal + slots filled” as success, and escalated isn't a failure status — so a booking that bailed to a human still scored 10. Fine for a message-taking flow; debatable for a booking. The audit surfaced it by printing final_state next to correctness. (Tracked as an eval-semantics item.)

When to reach for it