Overview › Operations › Session replay

Session Replay — the regression check after a change

Record a known-good conversation once. After any change, re-drive it with the exact same caller utterances and see whether the flow still behaves. Two tools: one for eval recordings, one for live production sessions.

What & why

An eval run is non-deterministic on the caller side — the persona LLM phrases things differently each time, so a score wobble can be the caller, not your change. Replay removes that variable: it fixes the caller's words to a recording and re-runs only the side you changed (the flow + agent). Now a different final state or a lower correctness is unambiguously your change.

Two tools, two recordings

Tool	Replays	Use
`scripts/replay_eval.py`	EVAL recordings — `flow-transcripts-*.jsonl` (every eval run writes these)	regression-check a flow/agent change against golden eval sessions
`scripts/replay_session.py`	LIVE sessions — `logs/turns.jsonl` (the production turn-logger)	triage a real call: “agent claimed success, calendar shows nothing”

replay_eval.py — the change regression check

Every eval already writes one JSON line per conversation with the full caller+agent turns. Those are the recordings. Replay re-drives each flow with the exact caller utterances from the recording — via a ScriptedClient that emits the recorded lines in order and ignores the LLM prompt — against the current flow + agent. A change that breaks a previously-good session shows up as a different final state or a lower correctness. It exits 1 if any session regressed (correctness dropped >1pt, or a success terminal became a failure), so it can gate a loop after a flow edit.

RIFF_EVAL_AGENT=dashscope RIFF_EVAL_AGENT_MODEL=qwen-flash \
  python scripts/replay_eval.py docs/flow-eval/flow-transcripts-20260620T042340105571Z.jsonl \
    --flow austin_plumbing --regressions-only

replay_session.py — the production fabrication triage

This one replays a recorded live session step by step, reconstructing it with a FakeAdapter scripted from the recorded LLM side, and prints a per-turn diff of RECORDED vs REPLAY state and slots:

python scripts/replay_session.py <session_id>
python scripts/replay_session.py <session_id> --to-turn=5 --verbose

── Turn 3 · 2026-04-22T15:12:18 ──
  user_utterance: "anytime Wednesday"
  RECORDED  to_state=offer_scheduling  slots.preferred_date=2026-04-23
  REPLAY    to_state=offer_scheduling  slots.preferred_date=2026-04-23
  tool_calls: ✓ 1 call, envelope event_id present
  ✓ no divergence

The smoking-gun check. Its primary job is B-181-class triage — “agent said booked, nothing got booked.” Replay the session, watch the tool-call envelopes flow turn by turn, and find the turn where event_id="" but the assistant text says “booked.” That is the fabrication. The check runs automatically and exits non-zero if it finds the pattern.

Use case

You change the slot extractor and want to be sure you didn't break a flow that was passing. Instead of re-running a noisy eval and squinting at a ~1pt delta, you replay the golden transcripts: the caller is identical, so any divergence in final state or correctness is your extractor change, full stop. This was added this session precisely as the deterministic regression check after a runtime change (commit 2979f94).

Where it fits

Replay is the deterministic complement to the eval. The matrix tells you the absolute quality (with judge noise on humanness); replay tells you whether a specific change moved a specific session, with no caller-side noise at all. It pairs with the CI gate (which catches computed-aspect regressions on changed flows) to cover both “did the flow change?” and “did the runtime change?”

Source: scripts/replay_eval.py, scripts/replay_session.py.