Overview › Measurement › Judges

Judges — scoring the one subjective aspect

Humanness is the only aspect that needs an LLM. The same transcripts, scored by a swappable roster of judges — the spread between them is the uncertainty.

What & why

Correctness, completion, errors, and latency are computed deterministically. Humanness — “does this sound like a warm, natural person on the phone, or a robot reading slot names aloud?” — is irreducibly subjective, so it is scored by an LLM judge against a fixed rubric. Because any single judge is noisy (~1pt) and biased, RIFF runs a panel: several judges score the same transcripts, and their disagreement is the honest error bar on humanness. Averaging ≥2 judges (median) shrinks it.

The rubric

One prompt, schema-locked to a JSON object, scores 0–10:

humanness = natural phone manner; warm and concise; NOT robotic or repetitive;
never reads raw slot names / lists / JSON aloud; recovers gracefully with an
appropriate tone. IGNORE whether the task objectively completed except where a
failure makes the conversation feel unnatural.

Output ONLY: {"humanness": <int 0-10>, "notes": "<one specific sentence>"}

Task success is deliberately excluded — that is what the computed aspects are for. The judge grades manner, not outcome.

The roster

scripts/aspect_judge.py holds a named roster spanning remote APIs, local models, and CLI-shelled judges. A generic passthrough (--judge ollama:<model> / mlx:<model>) covers anything else installed.

Class	Judges	Notes
remote API	`gemini` (gemini-2.5-flash), `qwen` (qwen3.7-plus), `claude` (claude-haiku-4-5)	fast; the everyday panel
CLI-shelled	`codex` (GPT via the Codex CLI), `claude-cli` (`claude -p`)	fallbacks when an API is down — see below
local (ollama)	`gemma`, `gemma-big`, `mistral`, `phi4`	free/unlimited for big sweeps
local (mlx)	`mlx`, `gemma-mlx`, `phi4-mlx`	Apple-silicon local inference

The CLI judges — why they exist

Two judges shell out to a CLI instead of an API, because the corresponding APIs are unreliable in this environment:

CodexJudgeAdapter (riff/evaluation/codex_judge.py) runs codex exec in a read-only sandbox with a forced output schema. It is the fallback when the DashScope qwen judge is exhausted (403). Slower (~8s/call) and spawns a subprocess, so it is the fallback, not the default.
ClaudeCodeAdapter (riff/evaluation/claude_code_judge.py) runs claude -p (print mode). The Anthropic API adapter 400s here, but the Claude Code CLI is installed and authenticated separately, so it is a usable Claude judge.

A broken judge is an instrument fault, not a flow fault. During this session the broken Claude API agent and a fallback-exhausted qwen judge both produced score craters that looked like flow regressions. The broken Claude agent was dropped from the calibration panel (commit 72e36f3). This is the “harden the instrument” lesson in Methodology: measurement flaws masquerade as subject flaws.

Robustness — emit every attempt

The judge writes a row for every attempt, including failures (humanness = null + a status). A degraded judge that silently dropped its failures would shrink the sample and bias the median upward — so failures are recorded, not omitted (Codex robustness #1). Judges that are schema-locked to {quality, robustness} (the Codex CLI judge) fall back to their holistic quality as the closest naturalness signal.

Example

python scripts/aspect_judge.py OUT/flow-transcripts-*.jsonl \
    --judge gemini --out H --judged-at 2026-06-22T14:00:00Z
python scripts/eval_db.py --ingest-humanness --dirs H

# who judged, how often, timeout rate, average quality
python scripts/eval_db.py --judges

Real output from the live quality.db:

# Judges — who scored, how often, timeout rate, avg quality

  (none/heuristic)             n=800   timed_out=0    avg_q=4.98
  alibaba/qwen3.7-plus         n=162   timed_out=0    avg_q=3.85
  gemini/gemini-2.5-flash      n=77    timed_out=0    avg_q=4.32
  codexjudge/codex             n=77    timed_out=0    avg_q=4.29
  _fallbackjudge/qwen3.7-plus->codex(fallback) n=40  timed_out=0  avg_q=3.63
  panel:gemma/ollama+codex/cli n=2     timed_out=1    avg_q=8.0

Note the spread of average quality by judge on overlapping work: qwen 3.85 vs gemini 4.32 vs codex 4.29 on the same kinds of transcripts. That ~0.5pt gap between judges is real and is why quality is only comparable within a judge column (the --timeline rescore rows make this explicit).

Canonicalization — joining judged to computed

When humanness rows are ingested, each is canonicalized against the matching computed row by conversation_id: it copies run_id/run_ts/agent_model/ persona from the computed row, so a judgment attributes to the right flow version (run time, not judge time) and joins cleanly. The judge-time is kept separately in details_json. A humanness row with no computed twin is skipped — it can't be placed (Codex robustness #2).

Where it fits

Judges feed the humanness column of the evaluation matrix. They run nightly (the per-commit gate is judge-free by design), and the full panel is driven by scripts/full_calibration.sh — 20 flows × 2 agents, computed + per-state, then humanness × 3 judges on the same transcripts.

Source: scripts/aspect_judge.py, riff/evaluation/codex_judge.py, riff/evaluation/claude_code_judge.py, scripts/full_calibration.sh.