← Choice Health VOICE-QUALITY-PLAYBOOK.md raw .md

Voice-Quality Playbook — detect → reproduce → prevent → fix → document

Why this exists. Voice bugs hide. They feel like "the call was weird," they're invisible in the text transcript (it reads "yes → complete"), and they're often races that only fire on some calls. B-319 (the read_back confirmation stall) sat in 48% of flows and hit 6% of real calls — 15 callers literally asked "Are you still there?" — yet looked fine in the logs. This playbook is the standing process so that when one is found, it gets caught at the source, fixed everywhere, and never relearned.

The tools below already exist; this is how they fit together.

The loop

   DETECT ──► REPRODUCE ──► (a) SHIFT-LEFT ──► (b) FIX EXISTING ──► (c) DOCUMENT ──► (d) CLARIFY
  (review)   (voice case)   (prevent)          (root cause)         (lessons)        (the fix)

1. DETECT — the nightly review surfaces it (no human listening)

The two-layer review runs at 01:30 (scripts/nightly_review_cycle.sh) and lands in quality.db.live_call_reviews. Pull the worst calls (/review-calls):

SELECT session_id, flow_id, overall, efficiency, loop_score, round(wer,2) wer, completed
FROM live_call_reviews ORDER BY overall ASC, loop_score DESC;

Signatures of a systemic voice bug (vs a one-off):

2. REPRODUCE — turn the real call into a deterministic voice case

A human can't reliably hit a race; scripts/voice_case.py can. It's event-driven: it drives the real live audio path and fires the caller's reply off the agent's own turn boundary (playback_end + offset), so the trigger is exact and repeatable.

python scripts/voice_case.py --flow coffee --confirm-offset-ms 0      # tight race → expect stall
python scripts/voice_case.py --flow coffee --confirm-offset-ms 1500   # polite pause → expect clean

An offset sweep isolates the trigger (race window, barge-in, interrupt). The case records to data/sessions/<sid>/ so the review tools evaluate it, and it measures the stall directly.

3. (a) SHIFT LEFT — prevent it recurring, as early as possible

Two layers, by bug class:

4. (b) FIX EXISTING — systemically, once

Find the single place the bug lives and fix it there so every affected flow is fixed at once.

5. (c) DOCUMENT — so the pattern is recognized next time

Every systemic find leaves a trail:

6. (d) CLARIFY THE FIX — precise, not vibes

The fix writeup must state, in order: root cause (one sentence) · fix location (file:line) · the regression test that proves it · blast radius (how many flows/calls). If you can't fill all four, you haven't found the root cause yet.

Worked examples

B-319 — Read-back confirmation stall

B-320 — Terminal close state re-narration

B-318 — Verification "yes" submitted as passcode (Open)

Lessons (the recurring signatures)

Audio Simulation (Tier 2) — BUILT (2026-06-28)

The faithful audio tier now drives every flow with injected audio, end to end:

Two protocol fixes the dreamer forced (you only find these by placing the real call):

  1. Trailing silence — Gemini's VAD ends a turn on speech→silence; the harness stopped sending frames (absence ≠ silence) so turns never ended. Send explicit zero-PCM silence after each utterance.
  2. static_done — 28 flows (62%) have a static greeting that advances on audio_done, fired only when the browser sends static_done. The synthetic client must send it too, or every one stalls at greeting. Lesson: a static check passes the YAML; only a real call finds the message the test client forgot to send.

Part-3 word-level finding (2026-06-28, corrected) — the corpus WER is REAL mis-hears, not formatting. scripts/voice_mishears.py transcribes the persisted caller audio (audio_caller.wav) with Whisper (ground truth) and diffs it against what Gemini heard (turns.jsonl), normalizing formatting first (case, punctuation, number-words→digits, digit-grouping). It computes semantic_wer — the WER after formatting is removed — now stored alongside the raw wer column (--backfill N).

RESOLVED via caller_truth (2026-06-28): in converse mode the caller is an LLM whose lines are KNOWN text BEFORE TTS — the real ground truth, no ASR confound. voice_case now persists them to data/sessions/<sid>/caller_truth.txt, and voice_mishears._ground_truth() prefers them over Whisper (falling back to Whisper for old sessions). A three-way check on a terse call settled it: truth "Order something. A turkey sandwich. Two." → GEMINI dropped "Two" (a real DROP = B-321), WHISPER caught it, and ZERO hallucinations from either. So the "68% hallucinations" above WAS the Whisper-as-truth confound; against caller_truth the breakdown adjudicates instead of merely ranking. Use caller_truth for any converse session; reserve the Whisper-truth caveat for legacy sessions that lack it.

FULL-CORPUS CONFIRMATION (2026-06-28): semantic_wer is now backfilled across the whole audio corpus — 117 sessions (99 dreamer vc-* + legacy). The full-scale aggregate holds: by persona terse 0.19/0.20, corrector 0.10/0.12, confused 0.08/0.08, cooperative 0.06/0.07 (raw/semantic) — sem ~= raw everywhere, so the corpus WER is real mis-hears. terse is the worst (n=22), = B-321. Robustness note: semantic_wer has NO Inf/NaN (it returns 0.0 for an empty truth); the raw wer column has 1 legacy Inf (empty-reference division) — the dreamer rows have zero. So semantic_wer is the safer column to average. 'Capture ALL metrics' is met for the mis-hear axis. Lesson (twice over): don't generalize a metric claim from ONE clean session — the corpus disagreed. Now the DB has both wer and semantic_wer so formatting-vs-real is visible per call. Tested core: tests/test_voice_mishears.py.

Corpus findings (fixed + open):

Validation flow: flows/sweet_things_bakery.yaml exercises all three fixes by construction — the deterministic suites (test_live_confirm_gates_all_flows.py, test_live_tool_scoping.py) auto-discover it, so a regression of any fix fails CI on this flow too.

Build next