Call Quality & Reviews
How we measure and debug live call performance: VAD endpoint tuning, cascading privacy rules, side-by-side Whisper transcription timelines, and nightly LLM-based grading.
1. Why a phone agent needs a review system
In text chat, latency is a minor annoyance. In voice-AI systems, latency is fatal. A single timing mismatch breaks conversational trust, causing immediate hang-ups.
Consider this true story from our early live testing. A caller was asked to verify their phone number:
- The caller said "Yes" to confirm.
- A 9-second silence followed.
- Confused and wondering if the line dropped, the caller said "Yes" again.
- The voice agent, which had finally transitioned to the
verify_codestate, treated this second "Yes" as the verification code. - The verification check failed, and the agent hung up.
In another test, a caller became trapped in a state loop for 10 minutes, asking "Are you there?" twice in frustration before abandoning the call.
To fix these issues, we couldn't rely on synthetic text benchmarks. We needed a system to record live call audio, align what the user actually said against what the model heard, measure latency at every turn, and automatically grade every session.
2. Recording a call, safely
Our call quality cycle starts with recording. Every voice call is captured by CallSessionRecorder, producing a mono caller audio file (audio.wav) in data/sessions/<session_id>/, which is then pushed to S3 (s3://riff-s3-audio/calls/…) when the session ends.
However, recording voice calls raises massive privacy concerns. Phone calls often contain sensitive Personally Identifiable Information (PII) like credit card numbers, addresses, and verification codes. We cannot persist this audio.
We solved this by building a cascading record-override policy. This matches the LLM model resolution policy, allowing authors to turn recording on or off at four different scopes: overall → flow → group → state. The smallest explicit scope wins.
# flows/apartment_intake.yaml
record: "on" # Flow-level default
segments:
- name: "triage"
record: "inherit"
- name: "security_auth"
record: "off" # Group-level override: black out this entire phase
states:
collect_verification_code:
speech_act: collect
record: "off" # State-level override: security guard
At runtime, the audio loop checks the policy for the active state using resolve_record_policy(). If a state is marked off, its audio frames are simply skipped and never written to disk, creating a silent blackout window in the final recording.
Because a typo like recrod: of could silently leak sensitive data, we enforce strict load-time validation in riff/record_policy.py:
def validate_record_values(flow: "Flow") -> None:
bad = []
for tier, name, value in _declared_records(flow):
if _norm(value) not in {"on", "off", "inherit"}:
bad.append(f"{tier} {name!r} record={value!r}")
if bad:
raise ValueError(f"Invalid record value(s): {'; '.join(bad)}")
3. Ground truth vs. what the model heard
When debugging the 9-second silence, our first question was: where did the time go? Was it a browser microphone issue, network packet loss, a slow speech-to-text (STT) transcriber, or Gemini itself?
To find out, we built scripts/compare_call_transcription.py. This tool runs a local Whisper model on the recorded audio to get ground-truth caller segments with timestamps, then aligns them side-by-side with Gemini's transcriptions (from turns.jsonl) and tool calls (from live.log) on a single timeline.
Here is the actual output that exposed the bug:
=== voice-1782603683638 — caller audio 42.1s, Whisper=tiny ===
--- WHISPER (caller ground truth, timestamped) ---
[ 24.9–25.2s] "Yes."
--- GEMINI (what it heard + did, timestamped) ---
[ 34.1s] GEMINI heard CALLER: 'yes' [collect_phone→verify_code]
[ 34.2s] GEMINI tool_call: verify_phone_number {"phone": "5125551234"}
The timeline was the smoking gun. The caller finished speaking "Yes" at 25.2s. Gemini did not process the turn until 34.1s. The microphone logs confirmed audio bytes were streaming continuously during the gap: suppressed=0 forwarded=25.
This proved the audio reached the server instantly. The latency was caused by Gemini's default turn-detection endpointing, which was waiting for additional audio before deciding the caller had finished their short, 300ms confirmation.
4. Fixing the pause: VAD tuning
To eliminate this pause, we tuned the Voice Activity Detection (VAD) parameters inside the Gemini Live session. By default, API model turn-taking is conservative, optimized to avoid cutting users off if they pause mid-thought. This behaves poorly for short confirmations ("yes", "no", "that works").
We override Gemini's default endpointing in riff/live/live_client.py, enabling END_SENSITIVITY_HIGH and dropping the silence window to 700ms via RIFF_LIVE_SILENCE_MS:
def _vad_config(types_mod: Any):
if os.environ.get("RIFF_LIVE_VAD", "1").strip() == "0":
return None
return types_mod.RealtimeInputConfig(
automatic_activity_detection=types_mod.AutomaticActivityDetection(
end_of_speech_sensitivity=types_mod.EndSensitivity.END_SENSITIVITY_HIGH,
silence_duration_ms=int(os.environ.get("RIFF_LIVE_SILENCE_MS", "700")),
prefix_padding_ms=int(os.environ.get("RIFF_LIVE_PREFIX_MS", "300")),
),
)
By forcing the model to decide the turn is over after 700ms of silence (and keeping 300ms of pre-speech padding to avoid clipping the start), response latency dropped from 9 seconds to under 1.2 seconds, matching natural human cadences.
5. The two-layer review
To ensure we catch latency regressions and loop issues automatically, we built a two-layer review architecture that integrates with our SQLite quality database, quality.db.
| Dimension | Layer 1: Objective (Instant) | Layer 2: Subjective (Nightly) |
|---|---|---|
| Execution | At call-end via FSM hook | Nightly batch cycle (01:30) |
| Cost & Deps | Free, no API, no Whisper | LLM API cost, Whisper local GPU/CPU |
| Metrics | Turn count, completion, final state, and loop_score |
Subjective rubrics (humanness, efficiency), Word Error Rate (WER) |
| Goal | Catch catastrophic loop crashes immediately | Grade conversational quality and catch mis-hears |
Layer 1: The Loop Score
Layer 1 computes the loop_score. By counting the maximum number of times a session visits any single state, we can flag runaway loops deterministically. For example, if a call visits verify_code 6 times, it gets a loop_score = 6, signaling a block:
def _objective_metrics() -> dict[str, dict]:
# Counts state frequencies in turns.jsonl
states = Counter(r.get("to_state") for r in rows if r.get("to_state"))
return {
"turn_count": len(rows),
"loop_score": max(states.values()) if states else 0,
"completed": 1 if last.get("verdict") == "complete" else 0,
"final_state": last.get("to_state"),
}
Layer 2: The LLM Judge
Layer 2 runs our standard evaluation rubric on the live transcript. An LLM parses the conversation and assigns scores from 1 to 5 across task_completion, naturalness, efficiency, and error_recovery, saving them to the live_call_reviews table.
6. The nightly cycle
Every night at 01:30 AM local time, a macOS launchd job (configured via com.riff.nightly-review) kicks off the script scripts/nightly_review_cycle.sh. This script coordinates three actions:
- Runs
scripts/judge_all_sessions.pyto run the Layer 2 LLM judge over all un-reviewed live call transcripts from the past 24 hours. - Runs
scripts/ingest_call_reviews.pyto extract Layer 1 metrics and upsert everything intoquality.db. - Runs
riff.nightly_review --once --window-hours 24to output a failure-mode summary, highlighting latency spikes and loop detections.
The entire output is logged to logs/nightly-review/cycle.log, keeping the development team updated every morning on live runtime quality.
7. Try it yourself
Operators and developers can query the review database or audit a specific call directly from the terminal:
Query the worst-performing recent calls
sqlite3 -header -column docs/flow-eval/quality.db \
"SELECT substr(session_id,7) sid, flow_id, overall, efficiency, loop_score, turn_count, completed
FROM live_call_reviews
ORDER BY overall ASC, loop_score DESC
LIMIT 10;"
Drill into a call's Whisper-vs-Gemini timeline
# Transcribe the caller audio and line it up with Gemini events
.venv/bin/python scripts/compare_call_transcription.py voice-1782603683638
Review calls in the developer interface
You can also recommend the /review-calls command in the chat interface to trigger a run-through, fetch recent failure summaries, or re-run the nightly ingest cycle manually.