New2026-06-24

Judge-the-Judge & Measurement Discipline

Never trust a verdict you can't cross-check. A panel of independent judges, a deterministic guard underneath, and repeat-N before any claim — because a single run lies, and the eval tooling keeps attributing failures to the wrong cause.

Every layer of the RIFF evaluator produces a verdict: the per-state cert says PASS/FAIL, the LLM judge scores humanness, a transcript scanner flags a defect. This page is about the uncomfortable thing we learned: those verdicts are wrong often enough that you must audit them — and the failure is almost always the same shape, a verdict attributed to the wrong cause.

Four times in one session, the same disease A per-state routing "win" (a model gap that was really a flow-rule bug) → a transcript mislabeling a stumble to the wrong state → a scanner conflating two different situations under one code → a single-run A/B that pointed the opposite direction from the truth. Same root cause every time: the signal was at the per-turn runtime level, and the tooling read it one level too coarse.

Layer 1 — the mechanical guard (free, deterministic)

Before any LLM, cross-check every recorded verdict against the actual transcript. The bug that started this: the bench reported gemini FAIL on 5 states that were actually untested — the caller front-loaded every slot, so those states never ran, and the code collapsed untested into FAIL.

The guard recomputes state-visit counts straight from the transcript turns and flags any recorded status that disagrees — untested-but-visited, or pass/fail-but-never-visited. It runs automatically at the end of every bench run. Tool: scripts/audit_bench.py.

It caught its own bug first The audit's first run flagged 3 "mismatches" — which turned out to be the audit tool's own keying bug (records key by agent:model, transcripts by the short model). The guard has to be auditable too; that recursion is the point.

Layer 2 — judge-the-judge, a fleet of independent judges

After an LLM-as-judge run, re-judge a random ~10% sample of conversations with a different model than produced the score. A judge from another lineage breaks correlated bias; the disagreement rate is the trust signal.

recorded verdict e.g. correctness=10 audit agent random judge model: Gemini · Qwen · Codex/GPT × random flow + the FLOW'S actual text reads the transcript AGREE verdict holds DISAGREE human reviews human final arbiter
The fleet: agents = a randomly selected judge model (Gemini / Qwen / Codex CLI) crossed with a randomly selected flow, each reading the flow's actual definition. It flags disagreements for a human; it never overturns a score itself.
The panel corrected the panel On one conversation, a powerful judge (gemini-2.5-pro) with the flow text disagreed with correctness=10 — but it was a false positive (the slot it said was missing had been correctly inferred). Codex/GPT agreed, the deterministic metric agreed, and the human ruled it fine. A single powerful judge can be wrong; that's why the audit is a panel, not a new authority. Run it as a separate process: RIFF_JUDGE_AUDIT_RATE=0.10.

The scanner that over-flagged 60% — and the gate that fixed it

The deterministic p4-stale-ask scanner appeared 324× historically. Live tracing showed it conflated two different things under one code:

SituationExampleVerdict
Re-asks an already-filled slotpizza: "How many?" when quantity=2real defect
Asks for an unfilled slot, doesn't echo othersdog_grooming: re-asks an invalid phonefalse positive

The fix is a collect-complete gate: flag only when no required slot is still missing. Data source, best-first: per-turn missing_required_slots telemetry (stamped by the eval driver) → load the flow and compute it → legacy fallback (never go blind). Live A/B across flows: p4-stale-ask 7 → 3 — 57% were false positives, every real defect kept.

The discipline: repeat-N, because one run lies

RIFF eval is judge-noise-dominated (σ ≈ 1.5 pts single-run). A candidate fix for the remaining real defects looked like a win on one A/B (6 → 5). Repeat-N told the truth:

OFF: [2, 4, 3, 2]  mean 2.75
ON:  [4, 4, 4, 2]  mean 3.50      # the "fix" is WORSE
per-run: ON beat OFF in 0/4 pairs   # single-run 6->5 was noise pointing the wrong way

So the candidate was refuted and reverted. The rule: never flip a default on single-run signal; --repeat N (≥5) before any claim. See Methodology & lessons for the noise floor.

Evidence lives in the database

The eval DB tracked runs / scores / aspect_scores / state_scores but had no home for lint metrics. A lint_findings table now records the before/after for scanner changes, so "better than before" is a query, not a memory:

SELECT sum(count_legacy), sum(count_gated) FROM lint_findings WHERE code='p4-stale-ask';
-- 7 -> 3   (57% of p4 flags were false positives)