Overview › Measurement › Holdout validation

Holdout Validation — the overfitting guard

You tune against the personas you can read. A fix only counts if it also lifts the personas you never looked at. Otherwise you memorized, you didn't generalize.

What & why

A flow is evaluated by simulated personas — scripted callers with different styles (cooperative, terse, angry, an over-sharer who front-loads everything). When you fix a flow, you read the failing transcripts and tune against them. The danger is the same as in machine learning: overfitting. A change can make the exact transcripts you stared at pass, without improving the flow's real behavior at all. You memorized the test.

The guard is a leave-out split (Arbor-style): personas are partitioned into a DEV set you iterate against and a HELD-OUT set you never inspect while fixing. A fix must clear the bar on the held-out set — different personas, same failure axes — to prove it generalized.

How it works — the split

The split is the single source of truth in riff/flow_eval/splits.py. ~1/4 of personas are held out (3 of 12) — enough to be a real generalization signal without halving the dev surface you actually iterate on. The held-out set deliberately mirrors the dev failure axes with different personas, plus stretch axes (adversarial, withholding) that dev omits entirely.

axisDEV (9) — you read & debug theseHELD-OUT (3) — never inspected
baselinecooperative
pacedpaced_methodical
correctioncorrector, indecisive
front-loadoversharerbulk (diff persona, same axis)
terseterse
emotionalangry, chatty
off-scopeoffscope
adversarialskeptical (stretch)
withholdingprivacy_guarded (stretch)
# riff/flow_eval/splits.py
DEV_PERSONAS     = ("cooperative", "corrector", "oversharer", "paced_methodical",
                    "terse", "indecisive", "chatty", "angry", "offscope")
HELDOUT_PERSONAS = ("skeptical", "bulk", "privacy_guarded")
ALL_GATE_PERSONAS = DEV_PERSONAS + HELDOUT_PERSONAS   # one eval over the union, split after
Never debug a fix against held-out transcripts. The instant you read a held-out transcript to fix it, that persona becomes a dev persona and the guard silently dies — you've contaminated your test set. split_quality() reports the dev mean and held-out mean separately; you act on dev, you gate on held-out.

How the gate reads it

The flow build pipeline (scripts/flow-pipeline.sh) runs the union in one eval (cheaper than two runs), then gates on the held-out mean, showing dev for comparison. A fix that lifts dev but not held-out is flagged ⚠️ OVERFIT — a loud signal that you memorized the dev transcripts instead of generalizing.

Use case

You convert a flow to the Collect Pattern and the cooperative + oversharer transcripts you were watching now pass — dev mean jumps. Is the flow actually better? The held-out mean answers it. If bulk (a different front-loader you never saw) also improves, the fix generalized. If only dev moved, you tuned to the specific transcripts and the OVERFIT flag fires — back to the drawing board.

Example

# the pipeline runs DEV + HELD-OUT in one eval and gates on the held-out mean
scripts/flow-pipeline.sh austin_plumbing
# → prints dev mean (for comparison) and held-out mean (the gate);
#   a dev-only lift is flagged ⚠️ OVERFIT

The same two caller behaviors — paced (one answer at a time) and front-load (everything at once) — are exercised inside a single flow's eval, so passing the bar across both held-out personas proves the flow serves both paths. There is nothing to A/B: a brand-new flow has no “before,” so it is gated against a fixed quality bar, on the held-out split.

Where it fits

Holdout discipline governs every flow fix and every pattern rollout. It is the reason the Methodology page lists “holdout discipline” as a hard-won lesson: a guard that you quietly defeat is worse than no guard, because it gives false confidence. The split feeds the judged pipeline; the cheap CI gate uses a smaller deterministic persona set per commit.