Session Replay — the regression check after a change
Record a known-good conversation once. After any change, re-drive it with the exact same caller utterances and see whether the flow still behaves. Two tools: one for eval recordings, one for live production sessions.
What & why
An eval run is non-deterministic on the caller side — the persona LLM phrases things differently each time, so a score wobble can be the caller, not your change. Replay removes that variable: it fixes the caller's words to a recording and re-runs only the side you changed (the flow + agent). Now a different final state or a lower correctness is unambiguously your change.
Two tools, two recordings
| Tool | Replays | Use |
|---|---|---|
scripts/replay_eval.py | EVAL recordings — flow-transcripts-*.jsonl (every eval run writes these) | regression-check a flow/agent change against golden eval sessions |
scripts/replay_session.py | LIVE sessions — logs/turns.jsonl (the production turn-logger) | triage a real call: “agent claimed success, calendar shows nothing” |
replay_eval.py — the change regression check
Every eval already writes one JSON line per conversation with the full caller+agent turns. Those
are the recordings. Replay re-drives each flow with the exact caller utterances
from the recording — via a ScriptedClient that emits the recorded lines in order and
ignores the LLM prompt — against the current flow + agent. A change that breaks a
previously-good session shows up as a different final state or a lower correctness. It exits 1 if any
session regressed (correctness dropped >1pt, or a success terminal became a failure), so it can gate
a loop after a flow edit.
RIFF_EVAL_AGENT=dashscope RIFF_EVAL_AGENT_MODEL=qwen-flash \
python scripts/replay_eval.py docs/flow-eval/flow-transcripts-20260620T042340105571Z.jsonl \
--flow austin_plumbing --regressions-only
replay_session.py — the production fabrication triage
This one replays a recorded live session step by step, reconstructing it with a
FakeAdapter scripted from the recorded LLM side, and prints a per-turn diff of
RECORDED vs REPLAY state and slots:
python scripts/replay_session.py <session_id> python scripts/replay_session.py <session_id> --to-turn=5 --verbose
── Turn 3 · 2026-04-22T15:12:18 ── user_utterance: "anytime Wednesday" RECORDED to_state=offer_scheduling slots.preferred_date=2026-04-23 REPLAY to_state=offer_scheduling slots.preferred_date=2026-04-23 tool_calls: ✓ 1 call, envelope event_id present ✓ no divergence
event_id="" but the assistant text says “booked.” That is the fabrication.
The check runs automatically and exits non-zero if it finds the pattern.
Use case
You change the slot extractor and want to be sure you didn't break a flow that was passing. Instead
of re-running a noisy eval and squinting at a ~1pt delta, you replay the golden transcripts: the caller
is identical, so any divergence in final state or correctness is your extractor change, full stop. This
was added this session precisely as the deterministic regression check after a runtime change (commit
2979f94).
Where it fits
Replay is the deterministic complement to the eval. The matrix tells you the absolute quality (with judge noise on humanness); replay tells you whether a specific change moved a specific session, with no caller-side noise at all. It pairs with the CI gate (which catches computed-aspect regressions on changed flows) to cover both “did the flow change?” and “did the runtime change?”