Space Channel Phone Audio Learning — 2026-06-29
Audience: RIFF operators and engineers debugging real Telnyx calls. Short version: if the browser/local audio sounds clean but the phone sounds choppy, do not start by swapping TTS engines. First prove or rule out the outbound phone media clock.
What Happened
The Space Channel flow sounded clean in the browser, but over the phone the voice became "worbly" and choppy after a while. The call could also appear to hang up before a logical completion point. At first this looked like a Piper/TTS or robot-voice problem.
It was not primarily a voice-generation problem.
The source audio was clean. The phone artifact was caused by the Telnyx outbound media path: chunk size, pacing drift, and doing codec/base64 work inside the paced send loop.
Symptom Pattern
This is the signature to remember:
- Browser/local playback sounds clear.
- Pre-rendered WAVs sound clear when played outside the phone path.
- Phone audio starts clear, then degrades after a longer phrase or later in a static chain.
- The issue affects static and generated speech similarly once enough audio is sent.
- Changing voices or TTS engines does not reliably fix it.
That pattern points at the carrier media stream, not the TTS model.
Diagnosis Path
The useful sequence was:
- Build a diagnostic flow with three versions of the same line:
- Version 1: pre-recorded static audio.
- Version 2: call-time generated speech played in sentence chunks.
- Version 3: call-time generated speech fully buffered, then played.
- Route the Telnyx number to that diagnostic flow.
- Compare human handset reports against local artifacts:
audio_agent.wav: RIFF's generated/source agent audio.audio_agent_played.wav: final outbound audio reconstructed from the Telnyx media payload path.
- Run:
./scripts/analyze_phone_audio.py --flow space_channel_audio_test --transcribe
- Check for:
- clipping,
- long dropout/silence runs,
- level drift,
- transcript failure,
- source-vs-played artifact mismatch.
The successful verification call was:
data/sessions/v3:wpd6lpl01hKmhVjgi0Ki0l1VV2bBiPhPfeyqQLuhZklZ8XR7Rw4PdA/
It completed in 53.3s, reached state done, and produced:
audio.wavaudio_caller.wavaudio_agent.wavaudio_agent_played.wavsession_manifest.json
Both audio_agent.wav and audio_agent_played.wav were stable, unclipped, and transcribed
end to end. The operator also confirmed Version 1, Version 2, and Version 3 all sounded clear
over the handset.
Root Cause
The root cause was outbound phone media timing, not Piper.
The old path could accumulate timing error because each chunk was encoded and sent inside the same paced loop. Send overhead and codec/base64 work could move the next send later than intended. Over a long enough response, that created phone-side jitter/choppiness even when the source WAV was clean.
Fixes That Mattered
The working combination is:
-
20 ms Telnyx chunks
- Conventional telephony frame cadence.
- Avoids relying on larger buffered frames behaving well through every leg.
-
Absolute-deadline pacing
- Each chunk is scheduled against an absolute playback clock.
- Per-send overhead no longer accumulates as drift.
-
Pre-encoded Telnyx media before the send loop
- μ-law/base64 payloads are built before the paced send starts.
- The send loop mostly timestamps, records, sends, and sleeps.
-
Played-audio artifact capture
audio_agent_played.wavrecords the outbound media representation after the phone codec path.- This gives future investigators a source-vs-played comparison instead of guessing.
Why Pregeneration Helped, But Was Not the Whole Fix
Pre-generating audio removes TTS latency and reduces runtime variability. It is still useful. But it does not by itself fix a bad phone media clock.
In this incident, pre-recorded Version 1 also degraded before the pacing fixes. That proved the problem was downstream from TTS. Once the Telnyx send path was fixed, all three versions were clear:
- static pre-recorded audio,
- sentence-streamed generated audio,
- fully buffered generated audio.
The lesson is:
Pre-generate known content to reduce latency and make responses deterministic, but still send it through a phone-safe paced media loop.
When To Pre-Generate Instead Of Using Live/Gemini Speech
Use pre-generated or pre-buffered audio when the content is predictable, branded, or operationally important.
Good candidates:
- greetings,
- capability lists,
- launch summaries,
- "how may I help you" lines,
- repeated Space Channel summaries,
- error/hold/computing messages,
- menus and confirmations,
- known artifacts from the Space Channel site.
Keep live/Gemini-generated speech for:
- caller-specific answers,
- open-ended Q&A,
- dynamic tool results,
- unexpected caller intents,
- personalized synthesis of new information.
The best Space Channel production pattern is likely static-first, live-second:
- RIFF owns deterministic
play_audiostates for known lines. - Common Space Channel artifacts are pre-generated into phone-safe assets.
- Dynamic answers are generated as text, then either:
- fully buffered before playback, or
- sentence-buffered with a short computing sound while waiting.
- All audio, static or dynamic, uses the same Telnyx-safe pacing path.
Recommended Architecture For Space Channel
1. Build a Static Audio Library
Generate and version reusable phone prompts:
- greeting,
- "Space Channel operations" identity line,
- capabilities,
- launch command summary,
- bug intake summary,
- solar weather summary,
- UFO files summary,
- space news/radio handoff,
- "computing" / "stand by" sound.
Store these as manifest-backed assets, not one-off files.
2. Use Full-Buffer For Longer Dynamic Turns
For longer Gemini-generated responses, prefer:
- Generate the complete text.
- Synthesize the full sentence/turn.
- Start playback only when enough audio is ready.
- Use a short pre-canned "computing" tone if generation takes long enough to feel silent.
This avoids mid-sentence starvation. It also makes the call easier to review because every spoken turn has a complete artifact.
3. Keep Sentence Streaming Only Where It Is Actually Needed
Sentence streaming is useful for fast first audio, but it is more sensitive to runtime jitter. For Space Channel, a slightly longer buffer is acceptable if it keeps the voice clean.
Use sentence streaming for short back-and-forth. Use full-buffer for:
- capability explanations,
- summaries,
- long tool results,
- anything with multiple clauses.
4. Phone-Safe Asset Build
For future hardening, generate phone assets in the exact form the carrier path wants:
- normalize loudness before μ-law,
- phone-band-limit before encoding,
- optionally cache 8 kHz μ-law payload chunks,
- keep a 16 kHz WAV review artifact,
- preserve the
audio_agent_played.wavdebug path.
Evidence Commands
Use these when investigating a new phone-audio report:
find data/sessions -maxdepth 2 -type f -name session_manifest.json -print0 |
xargs -0 ls -t | head
./scripts/analyze_phone_audio.py --flow space_channel_audio_test --transcribe
jq -r '{session_id, flow_id, finalized, duration_sec, artifacts}' \
data/sessions/<session-id>/session_manifest.json
If audio_agent.wav is clean but audio_agent_played.wav is bad, inspect codec/chunking.
If both are clean but the handset is bad, suspect carrier/network/handset path beyond RIFF.
If audio_agent.wav itself is bad, inspect TTS/rendering.
Do Not Overfit The Wrong Variable
Avoid this trap:
"The phone voice sounds bad, so the TTS engine is bad."
That is only true if the local/generated source artifact is also bad.
The better question order is:
- Is the authored/generated source WAV clean?
- Is the played/encoded debug WAV clean?
- Did the handset sound clean?
- Did the call complete at a logical terminal state?
Only then decide whether to change TTS, prompt content, buffering, carrier chunking, or hangup logic.
Current Status
As of the successful 2026-06-29 Space Channel test:
- all three diagnostic versions sounded clear to the handset operator,
- the diagnostic flow completed normally,
audio_agent.wavandaudio_agent_played.wavwere stable and unclipped,- Whisper transcribed both artifacts end to end,
- the likely fix is the Telnyx outbound path:
- 20 ms chunks,
- absolute-deadline pacing,
- pre-encoded media before paced send.
Follow-Ups
- Build a Space Channel static audio library from site artifacts.
- Add a production policy knob for Space Channel dynamic speech:
static_audiofor known lines,droid_full_bufferfor long generated answers,- sentence streaming only for short replies.
- Improve
scripts/analyze_phone_audio.pysource-vs-played comparison so it aligns streams with send-time padding before reporting correlation. - Consider caching phone-ready μ-law chunks for known static audio.
- Keep
audio_agent_played.wavin the session manifest for real-call review.