Overview › Operations › Call quality & reviews

Call Quality & Reviews

How we measure and debug live call performance: VAD endpoint tuning, cascading privacy rules, side-by-side Whisper transcription timelines, and nightly LLM-based grading.

1. Why a phone agent needs a review system

In text chat, latency is a minor annoyance. In voice-AI systems, latency is fatal. A single timing mismatch breaks conversational trust, causing immediate hang-ups.

Consider this true story from our early live testing. A caller was asked to verify their phone number:

In another test, a caller became trapped in a state loop for 10 minutes, asking "Are you there?" twice in frustration before abandoning the call.

To fix these issues, we couldn't rely on synthetic text benchmarks. We needed a system to record live call audio, align what the user actually said against what the model heard, measure latency at every turn, and automatically grade every session.

Takeaway: In voice systems, a 9-second silence is not just latency; it is a structural FSM failure that leads directly to an abandoned call.

2. Recording a call, safely

Our call quality cycle starts with recording. Every voice call is captured by CallSessionRecorder, producing a mono caller audio file (audio.wav) in data/sessions/<session_id>/, which is then pushed to S3 (s3://riff-s3-audio/calls/…) when the session ends.

However, recording voice calls raises massive privacy concerns. Phone calls often contain sensitive Personally Identifiable Information (PII) like credit card numbers, addresses, and verification codes. We cannot persist this audio.

We solved this by building a cascading record-override policy. This matches the LLM model resolution policy, allowing authors to turn recording on or off at four different scopes: overall → flow → group → state. The smallest explicit scope wins.

# flows/apartment_intake.yaml
record: "on" # Flow-level default

segments:
  - name: "triage"
    record: "inherit"
  - name: "security_auth"
    record: "off" # Group-level override: black out this entire phase

states:
  collect_verification_code:
    speech_act: collect
    record: "off" # State-level override: security guard

At runtime, the audio loop checks the policy for the active state using resolve_record_policy(). If a state is marked off, its audio frames are simply skipped and never written to disk, creating a silent blackout window in the final recording.

Because a typo like recrod: of could silently leak sensitive data, we enforce strict load-time validation in riff/record_policy.py:

def validate_record_values(flow: "Flow") -> None:
    bad = []
    for tier, name, value in _declared_records(flow):
        if _norm(value) not in {"on", "off", "inherit"}:
            bad.append(f"{tier} {name!r} record={value!r}")
    if bad:
        raise ValueError(f"Invalid record value(s): {'; '.join(bad)}")
Takeaway: Privacy overrides must cascade down to individual FSM states, blacking out sensitive frames in memory before they ever touch the disk.

3. Ground truth vs. what the model heard

When debugging the 9-second silence, our first question was: where did the time go? Was it a browser microphone issue, network packet loss, a slow speech-to-text (STT) transcriber, or Gemini itself?

To find out, we built scripts/compare_call_transcription.py. This tool runs a local Whisper model on the recorded audio to get ground-truth caller segments with timestamps, then aligns them side-by-side with Gemini's transcriptions (from turns.jsonl) and tool calls (from live.log) on a single timeline.

Here is the actual output that exposed the bug:

=== voice-1782603683638 — caller audio 42.1s, Whisper=tiny ===

--- WHISPER (caller ground truth, timestamped) ---
  [ 24.9–25.2s]  "Yes."

--- GEMINI (what it heard + did, timestamped) ---
  [ 34.1s]  GEMINI heard CALLER: 'yes'  [collect_phone→verify_code]
  [ 34.2s]  GEMINI tool_call: verify_phone_number {"phone": "5125551234"}

The timeline was the smoking gun. The caller finished speaking "Yes" at 25.2s. Gemini did not process the turn until 34.1s. The microphone logs confirmed audio bytes were streaming continuously during the gap: suppressed=0 forwarded=25.

This proved the audio reached the server instantly. The latency was caused by Gemini's default turn-detection endpointing, which was waiting for additional audio before deciding the caller had finished their short, 300ms confirmation.

Takeaway: Comparing Whisper ground-truth timestamps with Gemini FSM events isolates whether latency lives in audio collection, network transit, or turn-detection logic.

4. Fixing the pause: VAD tuning

To eliminate this pause, we tuned the Voice Activity Detection (VAD) parameters inside the Gemini Live session. By default, API model turn-taking is conservative, optimized to avoid cutting users off if they pause mid-thought. This behaves poorly for short confirmations ("yes", "no", "that works").

We override Gemini's default endpointing in riff/live/live_client.py, enabling END_SENSITIVITY_HIGH and dropping the silence window to 700ms via RIFF_LIVE_SILENCE_MS:

def _vad_config(types_mod: Any):
    if os.environ.get("RIFF_LIVE_VAD", "1").strip() == "0":
        return None
    return types_mod.RealtimeInputConfig(
        automatic_activity_detection=types_mod.AutomaticActivityDetection(
            end_of_speech_sensitivity=types_mod.EndSensitivity.END_SENSITIVITY_HIGH,
            silence_duration_ms=int(os.environ.get("RIFF_LIVE_SILENCE_MS", "700")),
            prefix_padding_ms=int(os.environ.get("RIFF_LIVE_PREFIX_MS", "300")),
        ),
    )

By forcing the model to decide the turn is over after 700ms of silence (and keeping 300ms of pre-speech padding to avoid clipping the start), response latency dropped from 9 seconds to under 1.2 seconds, matching natural human cadences.

The VAD Sweet Spot: Setting the silence threshold too low (e.g. <500ms) will cut off callers who pause mid-sentence. We found 700ms is the optimal balance for intake forms.

5. The two-layer review

To ensure we catch latency regressions and loop issues automatically, we built a two-layer review architecture that integrates with our SQLite quality database, quality.db.

Dimension Layer 1: Objective (Instant) Layer 2: Subjective (Nightly)
Execution At call-end via FSM hook Nightly batch cycle (01:30)
Cost & Deps Free, no API, no Whisper LLM API cost, Whisper local GPU/CPU
Metrics Turn count, completion, final state, and loop_score Subjective rubrics (humanness, efficiency), Word Error Rate (WER)
Goal Catch catastrophic loop crashes immediately Grade conversational quality and catch mis-hears

Layer 1: The Loop Score

Layer 1 computes the loop_score. By counting the maximum number of times a session visits any single state, we can flag runaway loops deterministically. For example, if a call visits verify_code 6 times, it gets a loop_score = 6, signaling a block:

def _objective_metrics() -> dict[str, dict]:
    # Counts state frequencies in turns.jsonl
    states = Counter(r.get("to_state") for r in rows if r.get("to_state"))
    return {
        "turn_count": len(rows),
        "loop_score": max(states.values()) if states else 0,
        "completed": 1 if last.get("verdict") == "complete" else 0,
        "final_state": last.get("to_state"),
    }

Layer 2: The LLM Judge

Layer 2 runs our standard evaluation rubric on the live transcript. An LLM parses the conversation and assigns scores from 1 to 5 across task_completion, naturalness, efficiency, and error_recovery, saving them to the live_call_reviews table.

Takeaway: Separating instant objective metrics from nightly subjective evaluations ensures the system catches loop crashes for free while reserving LLM API budget for quality audits.

6. The nightly cycle

Every night at 01:30 AM local time, a macOS launchd job (configured via com.riff.nightly-review) kicks off the script scripts/nightly_review_cycle.sh. This script coordinates three actions:

  1. Runs scripts/judge_all_sessions.py to run the Layer 2 LLM judge over all un-reviewed live call transcripts from the past 24 hours.
  2. Runs scripts/ingest_call_reviews.py to extract Layer 1 metrics and upsert everything into quality.db.
  3. Runs riff.nightly_review --once --window-hours 24 to output a failure-mode summary, highlighting latency spikes and loop detections.

The entire output is logged to logs/nightly-review/cycle.log, keeping the development team updated every morning on live runtime quality.

Takeaway: Automating evaluations at 01:30 AM ensures yesterday's caller issues are compiled, graded, and indexed in the database before the workday begins.

7. Try it yourself

Operators and developers can query the review database or audit a specific call directly from the terminal:

Query the worst-performing recent calls

sqlite3 -header -column docs/flow-eval/quality.db \
  "SELECT substr(session_id,7) sid, flow_id, overall, efficiency, loop_score, turn_count, completed
   FROM live_call_reviews 
   ORDER BY overall ASC, loop_score DESC 
   LIMIT 10;"

Drill into a call's Whisper-vs-Gemini timeline

# Transcribe the caller audio and line it up with Gemini events
.venv/bin/python scripts/compare_call_transcription.py voice-1782603683638

Review calls in the developer interface

You can also recommend the /review-calls command in the chat interface to trigger a run-through, fetch recent failure summaries, or re-run the nightly ingest cycle manually.

Takeaway: One SQL query locates the outlier, one timeline script reveals the timestamp gap, and one FSM edit or config change deploys the fix.