Methodology & Hard-Won Lessons
The whole system is shaped by a handful of painful discoveries. Each one is a trap that cost real time before it became a rule. Internalize these and the rest of the design explains itself.
1. The eval noise floor is ~1.5 points — and it's judge-dominated
Run the same transcripts through two judges and you get a Pearson correlation of ~0.92 (rankings agree) but a mean absolute error of ~0.95pt, and the single-run standard deviation is σ ≈ 1.48. In plain terms: any quality number under ~1.5pt of difference is noise, and most of that noise is the judge, not the flow.
--repeat N for within-cell variance, and compare any delta to σ before believing it.
Rankings are reliable; sub-1pt deltas are not.
This is why the architecture pushes everything it can to computed aspects (zero noise) and isolates the LLM to one axis — humanness — where the noise is unavoidable. It is also why the CI gate is judge-free: gating on a ±1.5pt number would produce flaky red builds.
2. Harden the instrument — measurement flaws masquerade as subject flaws
The single most repeated mistake this session: a broken measuring tool produced a score crater that looked exactly like a broken flow. Every one of these wasted time chasing a flow bug that didn't exist:
| The instrument fault | What it looked like |
|---|---|
| The Claude API agent 400s in this environment | flows “regressing” — actually errored conversations scored as failures |
| An absent/degraded backend (calendar provider) | apartment_scheduler “broken” — actually it just couldn't reach its payoff (see backend mocks) |
| A persona with an invalid 7-digit phone number | dog_grooming scoring 3.3 — actually the persona supplied un-validatable data |
| A fallback-exhausted qwen judge | a humanness crater — actually the judge 403'd |
The defenses are now structural: errored conversations are excluded from ingest
(not scored as zero — Codex robustness #3); the broken Claude agent was dropped from the calibration
panel; the invalid persona phone was normalized (commit 5eaa022); judges emit every
attempt incl. failures so a degraded judge can't bias the median.
Rule: before believing a subject is broken, prove the instrument is sound.
3. Per-persona decomposition — an aggregate hides one broken persona
A flow's mean quality can look mediocre because one persona craters while the rest are
robust — and the mean blends them into a flat “meh.” You will tune the wrong thing. The
fix is to never trust the aggregate alone: decompose by persona (the scores_by_persona
map), and by state (the per-state layer), so the one
broken cell is visible. The same principle scales up: group
cohesion exists because a group's mean hides one thrashing phase, and
archetypes exist because a flow's mean hides a systemic
pattern. At every level, an average can launder a localized failure — always keep the
decomposition.
4. Holdout discipline — never debug the held-out set
The leave-out split is only a generalization signal
while it stays untouched. The instant you read a held-out transcript to fix it, that persona becomes a
dev persona and the guard silently dies — giving you false confidence that a memorized fix
generalized. You act on dev; you gate on held-out; you never inspect held-out while
fixing. A dev-only lift is flagged ⚠️ OVERFIT on purpose.
5. Computed vs judged — the cost/noise split is the spine
The clean separation isn't just tidy; it's load-bearing. Computed aspects are deterministic, free, and run per commit. The judged aspect is noisy, costly, and runs nightly. Conflating them gives you the worst of both: flaky gates and expensive per-commit runs. Keeping them separate gives a cheap deterministic gate that's trustworthy and an expensive judged matrix you run only when it pays.
| Computed | Judged | |
|---|---|---|
| noise | none | ~1.5pt |
| cost | free | API $ |
| cadence | every commit (CI gate) | nightly (judge panel) |
| gate? | yes — blocking | no — human review |
6. Quality is only comparable within a judge, and within a flow-set
Two subtler traps the timeline view defends
against. First, judge: the same transcripts score differently under different judges
(qwen ~3.85 vs gemini ~4.32 on overlapping work), so a quality trend is only meaningful within one
judge column — the rescore / xjudge rows make the cross-judge spread visible
as exactly that, noise. Second, flow-set composition: a 6-flow re-eval is not
comparable to a 60-flow sweep; mixing them reads composition as quality. The --trend view
only shows a delta between runs with the same flow set for this reason.
7. Tie scores to changes — the three clocks
To answer “did the change at time T move the score?” you need three separate clocks:
the flow's edit time (flow_mods, synced from git), the conversation's
run time (run_ts), and the judgment's judge time
(judged_at). Aligning edit → run → judge against the noise floor is the only
honest way to attribute a score shift to a fix rather than to noise. A jump right after an edit is
plausibly the edit; a jump with no edit is noise.
A worked cautionary tale from this session
The repeat-3 before/after sweep measured the session's cumulative work at overall quality
4.3 → 4.1 — an apparent small regression. Decomposition (lesson 3) localized it: two
flows dropped ~1.1pt and their only change was adding a name slot type, while
coffee_split dropped 1.5pt with no flow change at all — i.e. runtime noise or an
extractor side-effect. The honest reading, against the σ≈1.48 floor (lesson 1), is “small,
possibly real, needs a HEAD-vs-HEAD re-run to separate signal from noise” — not “the
session broke quality.” That is the methodology in action: decompose, attribute, and
measure against the floor before you believe a delta.