New2026-06-24
Judge-the-Judge & Measurement Discipline
Never trust a verdict you can't cross-check. A panel of independent judges, a deterministic guard underneath, and repeat-N before any claim — because a single run lies, and the eval tooling keeps attributing failures to the wrong cause.
Every layer of the RIFF evaluator produces a verdict: the per-state cert says PASS/FAIL, the LLM judge scores humanness, a transcript scanner flags a defect. This page is about the uncomfortable thing we learned: those verdicts are wrong often enough that you must audit them — and the failure is almost always the same shape, a verdict attributed to the wrong cause.
Layer 1 — the mechanical guard (free, deterministic)
Before any LLM, cross-check every recorded verdict against the actual transcript. The bug that
started this: the bench reported gemini FAIL on 5 states that were actually
untested — the caller front-loaded every slot, so those states never ran, and the
code collapsed untested into FAIL.
The guard recomputes state-visit counts straight from the transcript turns and flags any recorded
status that disagrees — untested-but-visited, or pass/fail-but-never-visited.
It runs automatically at the end of every bench run. Tool: scripts/audit_bench.py.
agent:model, transcripts by the short model). The guard
has to be auditable too; that recursion is the point.
Layer 2 — judge-the-judge, a fleet of independent judges
After an LLM-as-judge run, re-judge a random ~10% sample of conversations with a different model than produced the score. A judge from another lineage breaks correlated bias; the disagreement rate is the trust signal.
gemini-2.5-pro) with the flow text
disagreed with correctness=10 — but it was a false positive (the slot it said
was missing had been correctly inferred). Codex/GPT agreed, the deterministic metric agreed,
and the human ruled it fine. A single powerful judge can be wrong; that's why the
audit is a panel, not a new authority. Run it as a separate process:
RIFF_JUDGE_AUDIT_RATE=0.10.
The scanner that over-flagged 60% — and the gate that fixed it
The deterministic p4-stale-ask scanner appeared 324× historically.
Live tracing showed it conflated two different things under one code:
| Situation | Example | Verdict |
|---|---|---|
| Re-asks an already-filled slot | pizza: "How many?" when quantity=2 | real defect |
| Asks for an unfilled slot, doesn't echo others | dog_grooming: re-asks an invalid phone | false positive |
The fix is a collect-complete gate: flag only when no required slot is still
missing. Data source, best-first: per-turn missing_required_slots telemetry (stamped by
the eval driver) → load the flow and compute it → legacy fallback (never go blind). Live A/B across
flows: p4-stale-ask 7 → 3 — 57% were false positives, every real defect kept.
The discipline: repeat-N, because one run lies
RIFF eval is judge-noise-dominated (σ ≈ 1.5 pts single-run). A candidate fix for the remaining
real defects looked like a win on one A/B (6 → 5). Repeat-N told the truth:
OFF: [2, 4, 3, 2] mean 2.75
ON: [4, 4, 4, 2] mean 3.50 # the "fix" is WORSE
per-run: ON beat OFF in 0/4 pairs # single-run 6->5 was noise pointing the wrong way
So the candidate was refuted and reverted. The rule: never flip a default on
single-run signal; --repeat N (≥5) before any claim. See
Methodology & lessons for the noise floor.
Evidence lives in the database
The eval DB tracked runs / scores / aspect_scores / state_scores but had no home for
lint metrics. A lint_findings table now records the before/after for scanner changes, so
"better than before" is a query, not a memory:
SELECT sum(count_legacy), sum(count_gated) FROM lint_findings WHERE code='p4-stale-ask';
-- 7 -> 3 (57% of p4 flags were false positives)