Voice-Quality Playbook — detect → reproduce → prevent → fix → document

Why this exists. Voice bugs hide. They feel like "the call was weird," they're invisible in the text transcript (it reads "yes → complete"), and they're often races that only fire on some calls. B-319 (the read_back confirmation stall) sat in 48% of flows and hit 6% of real calls — 15 callers literally asked "Are you still there?" — yet looked fine in the logs. This playbook is the standing process so that when one is found, it gets caught at the source, fixed everywhere, and never relearned.

The tools below already exist; this is how they fit together.

The loop

   DETECT ──► REPRODUCE ──► (a) SHIFT-LEFT ──► (b) FIX EXISTING ──► (c) DOCUMENT ──► (d) CLARIFY
  (review)   (voice case)   (prevent)          (root cause)         (lessons)        (the fix)

1. DETECT — the nightly review surfaces it (no human listening)

The two-layer review runs at 01:30 (scripts/nightly_review_cycle.sh) and lands in quality.db.live_call_reviews. Pull the worst calls (/review-calls):

SELECT session_id, flow_id, overall, efficiency, loop_score, round(wer,2) wer, completed
FROM live_call_reviews ORDER BY overall ASC, loop_score DESC;

Signatures of a systemic voice bug (vs a one-off):

loop_score ≥ 4 / high turn_count + completed=0 — runaway / stall
the caller said "are you still there?" — dead air (grep turns.jsonl)
efficiency/naturalness low while task_completion high — bad UX, right answer
high wer — the model mis-heard (Whisper vs Gemini)
a slot value absent from the audio — hallucination Then count it across calls/flows before calling it systemic (B-319 was 6% of 265 calls, 9 flows).

2. REPRODUCE — turn the real call into a deterministic voice case

A human can't reliably hit a race; scripts/voice_case.py can. It's event-driven: it drives the real live audio path and fires the caller's reply off the agent's own turn boundary (playback_end + offset), so the trigger is exact and repeatable.

python scripts/voice_case.py --flow coffee --confirm-offset-ms 0      # tight race → expect stall
python scripts/voice_case.py --flow coffee --confirm-offset-ms 1500   # polite pause → expect clean

An offset sweep isolates the trigger (race window, barge-in, interrupt). The case records to data/sessions/<sid>/ so the review tools evaluate it, and it measures the stall directly.

3. (a) SHIFT LEFT — prevent it recurring, as early as possible

Two layers, by bug class:

Live / timing / transport bugs (like B-319): add the reproducing voice case to a CI voice-case suite (tests/voice_cases/). It fails until fixed, then guards forever. This is the primary shift-left for anything that only manifests in the real audio path. (To build: a runner that boots the server, runs the suite at offset=0, asserts no stall. See "Build next".)
Flow-authoring bugs (a flow YAML pattern that's intrinsically fragile): add a build-time linter check in riff/flow_eval/linter.py so the risky pattern is flagged when the flow is written, not in production — the cheapest possible shift-left. (Mirror the existing pickup-delivery-close-misleading / collect-state-no-reprompt checks.)
Rule of thumb: fix the bug at the lowest layer that owns it, and add the guard at the earliest layer that can see it. B-319 is owned by the transport (fix there) but is guarded by a voice case in CI (the earliest place a timing race is visible).

4. (b) FIX EXISTING — systemically, once

Find the single place the bug lives and fix it there so every affected flow is fixed at once.

B-319: the live transport doesn't evaluate response_emitted on the agent's turn_complete → one transport fix unsticks all 25 flows. Don't patch 25 YAMLs.
File a backlog item (B-NNN), codex-review the fix, and verify with the voice case from step 2 (offset=0 must now advance, no stall).

5. (c) DOCUMENT — so the pattern is recognized next time

Every systemic find leaves a trail:

a backlog item (docs/project-memory/backlog/B-NNN-*.md) with the audio evidence + reproducer,
a roadmap line (docs/html/roadmap.html) if it's reliability-class,
a memory entry for the non-obvious lesson (the signature, not the fix),
and, for a class of bug, an entry in this playbook's Lessons section below.

6. (d) CLARIFY THE FIX — precise, not vibes

The fix writeup must state, in order: root cause (one sentence) · fix location (file:line) · the regression test that proves it · blast radius (how many flows/calls). If you can't fill all four, you haven't found the root cause yet.

Worked examples

B-319 — Read-back confirmation stall

Symptom: A coffee call "felt broken." Review showed ~50s dead air; 15 callers said "are you still there?"
Root Cause: riff/live/session.py:1353-1365 classified the agent's read-back as a model-only turn (no user text, no tool calls) and skipped the FSM transition walk.
Fix Location: [riff/live/session.py](file:///Users/davidmar/src/riff/riff/live/session.py#L1359-L1365). A model-only turn now walks the FSM when the agent actually emitted text and the current state has a response_emitted transition.
Regression Tests:
- Fast simulator parametric sweep: test_confirm_gate_advances_on_agent_emission in [tests/test_live_confirm_gates_all_flows.py](file:///Users/davidmar/src/riff/tests/test_live_confirm_gates_all_flows.py#L44)
- Unit regression test: test_b319_readback_advances_on_model_only_turn in [tests/test_live_transport.py](file:///Users/davidmar/src/riff/tests/test_live_transport.py#L274)
Blast Radius: 25 Collect-Pattern flows / 6% of real calls.

B-320 — Terminal close state re-narration

Symptom: After a ticket was submitted (voice-1782604290879), the caller said "Thanks" and the agent re-narrated the entire closing summary summary twice.
Root Cause: Post-terminal caller input arrived before carrier hangup, triggering Gemini to re-narrate from the terminal state policy.
Fix Location: [riff/live/session.py](file:///Users/davidmar/src/riff/riff/live/session.py#L1369-L1377) and [riff/live/session.py](file:///Users/davidmar/src/riff/riff/live/session.py#L741). Sets _terminal_close_delivered = True after the terminal close is spoken, causing should_suppress_model_output() to suppress further agent output.
Regression Test: test_b320_terminal_close_suppresses_post_close_renarration in [tests/test_live_confirm_gates_all_flows.py](file:///Users/davidmar/src/riff/tests/test_live_confirm_gates_all_flows.py#L104)
Blast Radius: All completed calls ending in terminal close states.

B-318 — Verification "yes" submitted as passcode (Open)

Symptom: Verification failed (voice-1782603683638) because the model submitted the caller's confirmation "yes" as the numeric verification code.
Root Cause: Three compounding faults: same-turn tool firing (send then verify), acknowledgment word reuse as code argument, and verification_max_attempts: 1.
Proposed Fix: Filter code formatting on the tool arguments, separate turns with prompt directives, and increase max attempts to 3.
Blast Radius: All verify-caller flows.

Lessons (the recurring signatures)

"Are you still there?" is a smoke alarm. It means dead air — almost always a live FSM transition not firing promptly (a confirm/collect gate that needs the agent's emission to advance).
The text transcript lies about timing. "yes → complete" can hide a 50s hole. Always check the audio (Whisper timestamps) for latency/mis-hear/hallucination — the judge reads Gemini's own (possibly wrong) transcript.
Clean browser audio plus bad phone audio is usually a transport question first. The 2026-06-29 Space Channel case sounded "worbly" on the handset while local/browser audio was clean. Piper was not the root cause; Telnyx outbound chunking/pacing was. The stable path is 20 ms chunks, absolute-deadline pacing, and pre-encoded μ-law/base64 payloads before the paced send loop. See docs/codex/2026-06-29-space-channel-phone-audio-learning.md and docs/html/space-channel-phone-audio-learning.html.
Pre-generate known lines, but do not confuse pregeneration with phone safety. Static audio reduces TTS latency and makes branded prompts deterministic, but it still must pass through a phone-safe paced sender. Use a static-first, live-second policy: pre-generate greetings, capabilities, launch summaries, and hold/computing sounds; reserve live/Gemini speech for caller-specific or genuinely dynamic answers.
Races need event-driven repro, not recordings. Control the caller's timing off the agent's turn boundary; a fixed audio file or a human can't hit the window reliably.
Model the listener's playback clock, not the network's arrival clock. audio_out arrives faster than real time; "agent finished" = first_arrival + bytes/(24kHz·2), not last byte.

Audio Simulation (Tier 2) — BUILT (2026-06-28)

The faithful audio tier now drives every flow with injected audio, end to end:

scripts/voice_case.py --mode converse — a generic LLM caller (gemini-2.5-flash) answers whatever the agent asks, so one harness drives any flow to a terminal state with real audio.
Caller PERSONAS (--persona) — each stresses a failure mode: cooperative (baseline), confused (B-303 confirm loop), corrector (re-read path), terse (mis-hear/WER).
Timing variance (--reply-delay-ms) — eager (0ms, trips races) … polite (2500ms).
scripts/voice_scenarios.py — the named scenario registry (part-1 'scenarios as DATA'): confirm-fast/-slow, barge-in-mild/-deep (B-300), confused-answer (B-303), corrector, terse (B-321), thanks-at-close (B-320, deterministic), real-booking (serene). --list for the coverage matrix (scenario -> bug -> expected outcome); --run <name> runs + checks one.
scripts/voice_dream.py — the "dreamer": randomly samples (flow × persona × timing), drives with audio, transcribes (Whisper), and upserts every metric to quality.db.live_call_reviews (incl. persona, reply_delay_ms, wer, completed). Property- based testing for voice — the breaking combinations surface in the data, not in inspection.

Two protocol fixes the dreamer forced (you only find these by placing the real call):

Trailing silence — Gemini's VAD ends a turn on speech→silence; the harness stopped sending frames (absence ≠ silence) so turns never ended. Send explicit zero-PCM silence after each utterance.
static_done — 28 flows (62%) have a static greeting that advances on audio_done, fired only when the browser sends static_done. The synthetic client must send it too, or every one stalls at greeting. Lesson: a static check passes the YAML; only a real call finds the message the test client forgot to send.

Part-3 word-level finding (2026-06-28, corrected) — the corpus WER is REAL mis-hears, not formatting. scripts/voice_mishears.py transcribes the persisted caller audio (audio_caller.wav) with Whisper (ground truth) and diffs it against what Gemini heard (turns.jsonl), normalizing formatting first (case, punctuation, number-words→digits, digit-grouping). It computes semantic_wer — the WER after formatting is removed — now stored alongside the raw wer column (--backfill N).

For the cleanest cooperative calls, formatting CAN be the only diff (a coffee call: 0 semantic mis-hears, raw WER all "two pm" vs "2 pm" / "512-555-0147" vs "5125550147"). That single example first suggested "WER is all formatting."
At corpus scale that's WRONG (self-corrected by the backfill, 18 sessions): mean raw WER 0.115 vs mean semantic WER 0.121 — essentially EQUAL. Per persona: terse 0.25/0.25, confused 0.11/0.11, corrector 0.08/0.10, cooperative 0.06/0.07. So the corpus WER measures GENUINE mis-hears, not formatting; it is a valid signal. (semantic_wer runs slightly HIGHER on phone-heavy turns because digit-grouping collapses "512 555 0147" into one token, shrinking the denominator.)
Real mis-hears concentrate in the terse/confused personas (semantic WER up to 0.5 on deli_counter/taco_truck) — that's B-321 (a VAD/endpointing limit on bare mumbled values), not a systemic ASR problem; cooperative ASR is sound. ERROR-TYPE BREAKDOWN (2026-06-28) — separate mis-hears / hallucinations / drops, with a caveat. voice_mishears.mishear_breakdown() splits the formatting-normalized word errors into the three part-3 categories: substitutions (wrong word), drops (truth word heard as nothing = VAD clip), hallucinations (heard a word never said). On the high-WER sessions the apparent split is ~68% hallucinations / 17% substitutions / 14% drops. CAVEAT (do NOT overclaim): the 'hallucination' count is CONFOUNDED by Whisper's own unreliability on mumbled/terse audio. A pair like truth='' heard='name is sarah miller' is scored a hallucination, but it's at least as likely the caller DID say it and WHISPER (our ground truth) under-transcribed the mumbled audio. So 'Gemini hallucinated' vs 'Whisper missed it' can't be separated by the transcripts alone — it needs a human spot-check of the WAV. The tool surfaces the candidates; it does not adjudicate them. Net: on hard audio BOTH ASRs struggle; the breakdown ranks where to listen, it isn't a verdict. (This is why cooperative ASR — clean audio — shows ~0 of all three.)

RESOLVED via caller_truth (2026-06-28): in converse mode the caller is an LLM whose lines are KNOWN text BEFORE TTS — the real ground truth, no ASR confound. voice_case now persists them to data/sessions/<sid>/caller_truth.txt, and voice_mishears._ground_truth() prefers them over Whisper (falling back to Whisper for old sessions). A three-way check on a terse call settled it: truth "Order something. A turkey sandwich. Two." → GEMINI dropped "Two" (a real DROP = B-321), WHISPER caught it, and ZERO hallucinations from either. So the "68% hallucinations" above WAS the Whisper-as-truth confound; against caller_truth the breakdown adjudicates instead of merely ranking. Use caller_truth for any converse session; reserve the Whisper-truth caveat for legacy sessions that lack it.

FULL-CORPUS CONFIRMATION (2026-06-28): semantic_wer is now backfilled across the whole audio corpus — 117 sessions (99 dreamer vc-* + legacy). The full-scale aggregate holds: by persona terse 0.19/0.20, corrector 0.10/0.12, confused 0.08/0.08, cooperative 0.06/0.07 (raw/semantic) — sem ~= raw everywhere, so the corpus WER is real mis-hears. terse is the worst (n=22), = B-321. Robustness note: semantic_wer has NO Inf/NaN (it returns 0.0 for an empty truth); the raw wer column has 1 legacy Inf (empty-reference division) — the dreamer rows have zero. So semantic_wer is the safer column to average. 'Capture ALL metrics' is met for the mis-hear axis. Lesson (twice over): don't generalize a metric claim from ONE clean session — the corpus disagreed. Now the DB has both wer and semantic_wer so formatting-vs-real is visible per call. Tested core: tests/test_voice_mishears.py.

Corpus findings (fixed + open):

static-greeting stall (28 flows) — harness didn't send static_done; fixed.
cal-provider tool-scoping leak — provider tools (schedule_event, propose_booking) reached EVERY flow's model when the provider was booted, so pizza mis-called schedule_event ("booked a haircut"). Fixed: get_tools_for_turn now scopes provider tools to the flow's declared allowed_tools (real booking flows declare them and keep them; order flows lose the footgun). Evolved the B-050 surface-parity contract to "declared provider tools reach the LLM"; guard: tests/test_live_tool_scoping.py.
B-303 / B-300 — flows reach await_confirmation then stall: the half-duplex mic gate ate the caller's confirm because the harness replied while the agent's audio was still playing. Harness fixed (wait for playback_end). The product mirror (a real fast confirmer gets eaten) is the open B-300 gate-tail decision.

Validation flow: flows/sweet_things_bakery.yaml exercises all three fixes by construction — the deterministic suites (test_live_confirm_gates_all_flows.py, test_live_tool_scoping.py) auto-discover it, so a regression of any fix fails CI on this flow too.

Build next

Decide B-300 gate-tail policy (shrink the speaking-tail so real fast confirmers aren't eaten?).
A flow-linter check that any flow with a response_emitted confirm gate has a registered voice case.
Promote a few bug-catching recordings to a "golden" replay set (deterministic, no Gemini).