Overview › Wisdom › Methodology & lessons

Methodology & Hard-Won Lessons

The whole system is shaped by a handful of painful discoveries. Each one is a trap that cost real time before it became a rule. Internalize these and the rest of the design explains itself.

1. The eval noise floor is ~1.5 points — and it's judge-dominated

Run the same transcripts through two judges and you get a Pearson correlation of ~0.92 (rankings agree) but a mean absolute error of ~0.95pt, and the single-run standard deviation is σ ≈ 1.48. In plain terms: any quality number under ~1.5pt of difference is noise, and most of that noise is the judge, not the flow.

Consequence: repeat-N, and check deltas against the noise floor. Never claim a regression from a single run. Average ≥2 judges (median) to shrink the floor, run --repeat N for within-cell variance, and compare any delta to σ before believing it. Rankings are reliable; sub-1pt deltas are not.

This is why the architecture pushes everything it can to computed aspects (zero noise) and isolates the LLM to one axis — humanness — where the noise is unavoidable. It is also why the CI gate is judge-free: gating on a ±1.5pt number would produce flaky red builds.

2. Harden the instrument — measurement flaws masquerade as subject flaws

The single most repeated mistake this session: a broken measuring tool produced a score crater that looked exactly like a broken flow. Every one of these wasted time chasing a flow bug that didn't exist:

The instrument faultWhat it looked like
The Claude API agent 400s in this environmentflows “regressing” — actually errored conversations scored as failures
An absent/degraded backend (calendar provider)apartment_scheduler “broken” — actually it just couldn't reach its payoff (see backend mocks)
A persona with an invalid 7-digit phone numberdog_grooming scoring 3.3 — actually the persona supplied un-validatable data
A fallback-exhausted qwen judgea humanness crater — actually the judge 403'd

The defenses are now structural: errored conversations are excluded from ingest (not scored as zero — Codex robustness #3); the broken Claude agent was dropped from the calibration panel; the invalid persona phone was normalized (commit 5eaa022); judges emit every attempt incl. failures so a degraded judge can't bias the median. Rule: before believing a subject is broken, prove the instrument is sound.

3. Per-persona decomposition — an aggregate hides one broken persona

A flow's mean quality can look mediocre because one persona craters while the rest are robust — and the mean blends them into a flat “meh.” You will tune the wrong thing. The fix is to never trust the aggregate alone: decompose by persona (the scores_by_persona map), and by state (the per-state layer), so the one broken cell is visible. The same principle scales up: group cohesion exists because a group's mean hides one thrashing phase, and archetypes exist because a flow's mean hides a systemic pattern. At every level, an average can launder a localized failure — always keep the decomposition.

4. Holdout discipline — never debug the held-out set

The leave-out split is only a generalization signal while it stays untouched. The instant you read a held-out transcript to fix it, that persona becomes a dev persona and the guard silently dies — giving you false confidence that a memorized fix generalized. You act on dev; you gate on held-out; you never inspect held-out while fixing. A dev-only lift is flagged ⚠️ OVERFIT on purpose.

5. Computed vs judged — the cost/noise split is the spine

The clean separation isn't just tidy; it's load-bearing. Computed aspects are deterministic, free, and run per commit. The judged aspect is noisy, costly, and runs nightly. Conflating them gives you the worst of both: flaky gates and expensive per-commit runs. Keeping them separate gives a cheap deterministic gate that's trustworthy and an expensive judged matrix you run only when it pays.

 ComputedJudged
noisenone~1.5pt
costfreeAPI $
cadenceevery commit (CI gate)nightly (judge panel)
gate?yes — blockingno — human review

6. Quality is only comparable within a judge, and within a flow-set

Two subtler traps the timeline view defends against. First, judge: the same transcripts score differently under different judges (qwen ~3.85 vs gemini ~4.32 on overlapping work), so a quality trend is only meaningful within one judge column — the rescore / xjudge rows make the cross-judge spread visible as exactly that, noise. Second, flow-set composition: a 6-flow re-eval is not comparable to a 60-flow sweep; mixing them reads composition as quality. The --trend view only shows a delta between runs with the same flow set for this reason.

7. Tie scores to changes — the three clocks

To answer “did the change at time T move the score?” you need three separate clocks: the flow's edit time (flow_mods, synced from git), the conversation's run time (run_ts), and the judgment's judge time (judged_at). Aligning edit → run → judge against the noise floor is the only honest way to attribute a score shift to a fix rather than to noise. A jump right after an edit is plausibly the edit; a jump with no edit is noise.

A worked cautionary tale from this session

The repeat-3 before/after sweep measured the session's cumulative work at overall quality 4.3 → 4.1 — an apparent small regression. Decomposition (lesson 3) localized it: two flows dropped ~1.1pt and their only change was adding a name slot type, while coffee_split dropped 1.5pt with no flow change at all — i.e. runtime noise or an extractor side-effect. The honest reading, against the σ≈1.48 floor (lesson 1), is “small, possibly real, needs a HEAD-vs-HEAD re-run to separate signal from noise” — not “the session broke quality.” That is the methodology in action: decompose, attribute, and measure against the floor before you believe a delta.

The summary rule

Before you believe a number, ask: is the instrument sound, is the delta bigger than the noise floor, and have I looked under the aggregate? Almost every wasted hour this session came from skipping one of those three checks. The entire four-tier hierarchy, the computed/judged split, the holdout guard, and the replay tools exist to make those checks cheap and automatic.