Judges — scoring the one subjective aspect
Humanness is the only aspect that needs an LLM. The same transcripts, scored by a swappable roster of judges — the spread between them is the uncertainty.
What & why
Correctness, completion, errors, and latency are computed deterministically. Humanness — “does this sound like a warm, natural person on the phone, or a robot reading slot names aloud?” — is irreducibly subjective, so it is scored by an LLM judge against a fixed rubric. Because any single judge is noisy (~1pt) and biased, RIFF runs a panel: several judges score the same transcripts, and their disagreement is the honest error bar on humanness. Averaging ≥2 judges (median) shrinks it.
The rubric
One prompt, schema-locked to a JSON object, scores 0–10:
humanness = natural phone manner; warm and concise; NOT robotic or repetitive;
never reads raw slot names / lists / JSON aloud; recovers gracefully with an
appropriate tone. IGNORE whether the task objectively completed except where a
failure makes the conversation feel unnatural.
Output ONLY: {"humanness": <int 0-10>, "notes": "<one specific sentence>"}
Task success is deliberately excluded — that is what the computed aspects are for. The judge grades manner, not outcome.
The roster
scripts/aspect_judge.py holds a named roster spanning remote APIs, local models, and
CLI-shelled judges. A generic passthrough (--judge ollama:<model> /
mlx:<model>) covers anything else installed.
| Class | Judges | Notes |
|---|---|---|
| remote API | gemini (gemini-2.5-flash), qwen (qwen3.7-plus), claude (claude-haiku-4-5) | fast; the everyday panel |
| CLI-shelled | codex (GPT via the Codex CLI), claude-cli (claude -p) | fallbacks when an API is down — see below |
| local (ollama) | gemma, gemma-big, mistral, phi4 | free/unlimited for big sweeps |
| local (mlx) | mlx, gemma-mlx, phi4-mlx | Apple-silicon local inference |
The CLI judges — why they exist
Two judges shell out to a CLI instead of an API, because the corresponding APIs are unreliable in this environment:
- CodexJudgeAdapter (
riff/evaluation/codex_judge.py) runscodex execin a read-only sandbox with a forced output schema. It is the fallback when the DashScope qwen judge is exhausted (403). Slower (~8s/call) and spawns a subprocess, so it is the fallback, not the default. - ClaudeCodeAdapter (
riff/evaluation/claude_code_judge.py) runsclaude -p(print mode). The Anthropic API adapter 400s here, but the Claude Code CLI is installed and authenticated separately, so it is a usable Claude judge.
72e36f3). This is the “harden the
instrument” lesson in Methodology: measurement flaws
masquerade as subject flaws.
Robustness — emit every attempt
The judge writes a row for every attempt, including failures (humanness = null +
a status). A degraded judge that silently dropped its failures would shrink the sample and bias the
median upward — so failures are recorded, not omitted (Codex robustness #1). Judges that are
schema-locked to {quality, robustness} (the Codex CLI judge) fall back to their
holistic quality as the closest naturalness signal.
Example
python scripts/aspect_judge.py OUT/flow-transcripts-*.jsonl \
--judge gemini --out H --judged-at 2026-06-22T14:00:00Z
python scripts/eval_db.py --ingest-humanness --dirs H
# who judged, how often, timeout rate, average quality
python scripts/eval_db.py --judges
Real output from the live quality.db:
# Judges — who scored, how often, timeout rate, avg quality (none/heuristic) n=800 timed_out=0 avg_q=4.98 alibaba/qwen3.7-plus n=162 timed_out=0 avg_q=3.85 gemini/gemini-2.5-flash n=77 timed_out=0 avg_q=4.32 codexjudge/codex n=77 timed_out=0 avg_q=4.29 _fallbackjudge/qwen3.7-plus->codex(fallback) n=40 timed_out=0 avg_q=3.63 panel:gemma/ollama+codex/cli n=2 timed_out=1 avg_q=8.0
Note the spread of average quality by judge on overlapping work: qwen 3.85 vs gemini 4.32
vs codex 4.29 on the same kinds of transcripts. That ~0.5pt gap between judges is real and is why
quality is only comparable within a judge column (the --timeline rescore rows
make this explicit).
Canonicalization — joining judged to computed
When humanness rows are ingested, each is canonicalized against the matching computed row by
conversation_id: it copies run_id/run_ts/agent_model/
persona from the computed row, so a judgment attributes to the right flow version
(run time, not judge time) and joins cleanly. The judge-time is kept separately in
details_json. A humanness row with no computed twin is skipped — it can't be placed
(Codex robustness #2).
Where it fits
Judges feed the humanness column of the evaluation matrix.
They run nightly (the per-commit gate is judge-free by design),
and the full panel is driven by scripts/full_calibration.sh — 20 flows × 2 agents,
computed + per-state, then humanness × 3 judges on the same transcripts.