← Choice Health 2026-06-29-space-channel-phone-audio-learning.md raw .md

Space Channel Phone Audio Learning — 2026-06-29

Audience: RIFF operators and engineers debugging real Telnyx calls. Short version: if the browser/local audio sounds clean but the phone sounds choppy, do not start by swapping TTS engines. First prove or rule out the outbound phone media clock.

What Happened

The Space Channel flow sounded clean in the browser, but over the phone the voice became "worbly" and choppy after a while. The call could also appear to hang up before a logical completion point. At first this looked like a Piper/TTS or robot-voice problem.

It was not primarily a voice-generation problem.

The source audio was clean. The phone artifact was caused by the Telnyx outbound media path: chunk size, pacing drift, and doing codec/base64 work inside the paced send loop.

Symptom Pattern

This is the signature to remember:

That pattern points at the carrier media stream, not the TTS model.

Diagnosis Path

The useful sequence was:

  1. Build a diagnostic flow with three versions of the same line:
    • Version 1: pre-recorded static audio.
    • Version 2: call-time generated speech played in sentence chunks.
    • Version 3: call-time generated speech fully buffered, then played.
  2. Route the Telnyx number to that diagnostic flow.
  3. Compare human handset reports against local artifacts:
    • audio_agent.wav: RIFF's generated/source agent audio.
    • audio_agent_played.wav: final outbound audio reconstructed from the Telnyx media payload path.
  4. Run:
./scripts/analyze_phone_audio.py --flow space_channel_audio_test --transcribe
  1. Check for:
    • clipping,
    • long dropout/silence runs,
    • level drift,
    • transcript failure,
    • source-vs-played artifact mismatch.

The successful verification call was:

data/sessions/v3:wpd6lpl01hKmhVjgi0Ki0l1VV2bBiPhPfeyqQLuhZklZ8XR7Rw4PdA/

It completed in 53.3s, reached state done, and produced:

Both audio_agent.wav and audio_agent_played.wav were stable, unclipped, and transcribed end to end. The operator also confirmed Version 1, Version 2, and Version 3 all sounded clear over the handset.

Root Cause

The root cause was outbound phone media timing, not Piper.

The old path could accumulate timing error because each chunk was encoded and sent inside the same paced loop. Send overhead and codec/base64 work could move the next send later than intended. Over a long enough response, that created phone-side jitter/choppiness even when the source WAV was clean.

Fixes That Mattered

The working combination is:

  1. 20 ms Telnyx chunks

    • Conventional telephony frame cadence.
    • Avoids relying on larger buffered frames behaving well through every leg.
  2. Absolute-deadline pacing

    • Each chunk is scheduled against an absolute playback clock.
    • Per-send overhead no longer accumulates as drift.
  3. Pre-encoded Telnyx media before the send loop

    • μ-law/base64 payloads are built before the paced send starts.
    • The send loop mostly timestamps, records, sends, and sleeps.
  4. Played-audio artifact capture

    • audio_agent_played.wav records the outbound media representation after the phone codec path.
    • This gives future investigators a source-vs-played comparison instead of guessing.

Why Pregeneration Helped, But Was Not the Whole Fix

Pre-generating audio removes TTS latency and reduces runtime variability. It is still useful. But it does not by itself fix a bad phone media clock.

In this incident, pre-recorded Version 1 also degraded before the pacing fixes. That proved the problem was downstream from TTS. Once the Telnyx send path was fixed, all three versions were clear:

The lesson is:

Pre-generate known content to reduce latency and make responses deterministic, but still send it through a phone-safe paced media loop.

When To Pre-Generate Instead Of Using Live/Gemini Speech

Use pre-generated or pre-buffered audio when the content is predictable, branded, or operationally important.

Good candidates:

Keep live/Gemini-generated speech for:

The best Space Channel production pattern is likely static-first, live-second:

  1. RIFF owns deterministic play_audio states for known lines.
  2. Common Space Channel artifacts are pre-generated into phone-safe assets.
  3. Dynamic answers are generated as text, then either:
    • fully buffered before playback, or
    • sentence-buffered with a short computing sound while waiting.
  4. All audio, static or dynamic, uses the same Telnyx-safe pacing path.

1. Build a Static Audio Library

Generate and version reusable phone prompts:

Store these as manifest-backed assets, not one-off files.

2. Use Full-Buffer For Longer Dynamic Turns

For longer Gemini-generated responses, prefer:

  1. Generate the complete text.
  2. Synthesize the full sentence/turn.
  3. Start playback only when enough audio is ready.
  4. Use a short pre-canned "computing" tone if generation takes long enough to feel silent.

This avoids mid-sentence starvation. It also makes the call easier to review because every spoken turn has a complete artifact.

3. Keep Sentence Streaming Only Where It Is Actually Needed

Sentence streaming is useful for fast first audio, but it is more sensitive to runtime jitter. For Space Channel, a slightly longer buffer is acceptable if it keeps the voice clean.

Use sentence streaming for short back-and-forth. Use full-buffer for:

4. Phone-Safe Asset Build

For future hardening, generate phone assets in the exact form the carrier path wants:

Evidence Commands

Use these when investigating a new phone-audio report:

find data/sessions -maxdepth 2 -type f -name session_manifest.json -print0 |
  xargs -0 ls -t | head

./scripts/analyze_phone_audio.py --flow space_channel_audio_test --transcribe

jq -r '{session_id, flow_id, finalized, duration_sec, artifacts}' \
  data/sessions/<session-id>/session_manifest.json

If audio_agent.wav is clean but audio_agent_played.wav is bad, inspect codec/chunking. If both are clean but the handset is bad, suspect carrier/network/handset path beyond RIFF. If audio_agent.wav itself is bad, inspect TTS/rendering.

Do Not Overfit The Wrong Variable

Avoid this trap:

"The phone voice sounds bad, so the TTS engine is bad."

That is only true if the local/generated source artifact is also bad.

The better question order is:

  1. Is the authored/generated source WAV clean?
  2. Is the played/encoded debug WAV clean?
  3. Did the handset sound clean?
  4. Did the call complete at a logical terminal state?

Only then decide whether to change TTS, prompt content, buffering, carrier chunking, or hangup logic.

Current Status

As of the successful 2026-06-29 Space Channel test:

Follow-Ups