Holdout Validation — the overfitting guard
You tune against the personas you can read. A fix only counts if it also lifts the personas you never looked at. Otherwise you memorized, you didn't generalize.
What & why
A flow is evaluated by simulated personas — scripted callers with different styles (cooperative, terse, angry, an over-sharer who front-loads everything). When you fix a flow, you read the failing transcripts and tune against them. The danger is the same as in machine learning: overfitting. A change can make the exact transcripts you stared at pass, without improving the flow's real behavior at all. You memorized the test.
The guard is a leave-out split (Arbor-style): personas are partitioned into a DEV set you iterate against and a HELD-OUT set you never inspect while fixing. A fix must clear the bar on the held-out set — different personas, same failure axes — to prove it generalized.
How it works — the split
The split is the single source of truth in riff/flow_eval/splits.py. ~1/4 of personas
are held out (3 of 12) — enough to be a real generalization signal without halving the dev surface you
actually iterate on. The held-out set deliberately mirrors the dev failure axes with
different personas, plus stretch axes (adversarial, withholding) that dev omits entirely.
| axis | DEV (9) — you read & debug these | HELD-OUT (3) — never inspected |
|---|---|---|
| baseline | cooperative | — |
| paced | paced_methodical | — |
| correction | corrector, indecisive | — |
| front-load | oversharer | bulk (diff persona, same axis) |
| terse | terse | — |
| emotional | angry, chatty | — |
| off-scope | offscope | — |
| adversarial | — | skeptical (stretch) |
| withholding | — | privacy_guarded (stretch) |
# riff/flow_eval/splits.py DEV_PERSONAS = ("cooperative", "corrector", "oversharer", "paced_methodical", "terse", "indecisive", "chatty", "angry", "offscope") HELDOUT_PERSONAS = ("skeptical", "bulk", "privacy_guarded") ALL_GATE_PERSONAS = DEV_PERSONAS + HELDOUT_PERSONAS # one eval over the union, split after
split_quality() reports the
dev mean and held-out mean separately; you act on dev, you gate on held-out.
How the gate reads it
The flow build pipeline (scripts/flow-pipeline.sh) runs the union in one eval
(cheaper than two runs), then gates on the held-out mean, showing dev for comparison.
A fix that lifts dev but not held-out is flagged ⚠️ OVERFIT — a loud signal that you
memorized the dev transcripts instead of generalizing.
Use case
You convert a flow to the Collect Pattern and the
cooperative + oversharer transcripts you were watching now pass — dev mean jumps. Is the flow actually
better? The held-out mean answers it. If bulk (a different front-loader you never saw)
also improves, the fix generalized. If only dev moved, you tuned to the specific transcripts and the
OVERFIT flag fires — back to the drawing board.
Example
# the pipeline runs DEV + HELD-OUT in one eval and gates on the held-out mean scripts/flow-pipeline.sh austin_plumbing # → prints dev mean (for comparison) and held-out mean (the gate); # a dev-only lift is flagged ⚠️ OVERFIT
The same two caller behaviors — paced (one answer at a time) and front-load (everything at once) — are exercised inside a single flow's eval, so passing the bar across both held-out personas proves the flow serves both paths. There is nothing to A/B: a brand-new flow has no “before,” so it is gated against a fixed quality bar, on the held-out split.
Where it fits
Holdout discipline governs every flow fix and every pattern rollout. It is the reason the Methodology page lists “holdout discipline” as a hard-won lesson: a guard that you quietly defeat is worse than no guard, because it gives false confidence. The split feeds the judged pipeline; the cheap CI gate uses a smaller deterministic persona set per commit.