The CI Quality Gate
A per-commit regression gate that is fast and free — because it scores only the deterministic computed aspects, never an LLM judge. The expensive judged matrix is a separate nightly job.
What & why
You want to catch a flow regression before it lands, on every push. But the judged humanness matrix is slow and costs API money — you can't run it per commit. The resolution is the computed/judged split: the computed aspects (correctness, completion) are deterministic, so a per-commit gate can run them on just the changed flows, compare to the DB baseline, and fail the build on a regression — all in ~22 seconds per flow, with zero judge calls.
How it works
1. scripts/ci_eval.sh — the gate body
- Determine which flows to gate: explicit args, else the flows touched by the last commit (tutorial/module/underscore flows excluded).
- Run computed aspects only — fast-tier agent (
qwen-flash), 2 personas (cooperative,skeptical),--repeat 1,--no-judge. Deterministic, no LLM judge. - Ingest the computed aspects into
aspect_scores. - Run
eval_db.py --gate— its exit code is the build verdict.
2. eval_db.py --gate — the regression check
For each flow it compares the latest run's correctness/completion to the
previous run, and fails (exit 1) on either a drop > 1.0 point or an absolute below
the floor (6.0). It prints a verdict line per flow:
✅ weather: correctness 10.0 ❌ austin_plumbing: correctness 4.2 < floor 6.0; completion regressed 10.0->7.0 GATE: FAIL (1 flow(s) regressed)
3. The pre-push hook
scripts/hooks/pre-push gates only the flows changed in the commits being
pushed (diffed against origin/main), and instant-skips when no flow YAML changed. Install
it once per clone:
scripts/install_hooks.sh # symlinks .git/hooks/pre-push -> scripts/hooks/pre-push # bypass once when you need to: git push --no-verify RIFF_SKIP_EVAL=1 git push
Use case
You edit flows/coffee.yaml to reorder the collect questions and push. The hook detects
the YAML change, runs the computed-aspect gate on just coffee (~22s), and finds
completion dropped from 10 to 7 because your reorder stranded a conditional slot. The push is blocked
with an exact verdict line — before a regression ever reaches main or the nightly judged
run.
Example
# gate explicit flows scripts/ci_eval.sh austin_plumbing weather # or let it auto-detect flows changed in the last commit scripts/ci_eval.sh # the gate alone, against whatever is already in the DB python scripts/eval_db.py --gate austin_plumbing weather
Why judge-free is the right call here
Because the gate scores only computed aspects, it has no noise floor — a fail is a real regression, not a judge having a bad night. (The judged humanness number swings ~1.5pt run to run; gating on it would produce flaky red builds.) Humanness regressions are caught by the nightly matrix and human review, not the blocking gate. See Methodology for the noise-floor evidence.
Where it fits
The CI gate is the per-commit floor of the whole system. It reuses the exact same
--ingest-aspects path as the matrix, so what it
gates and what the matrix reports are the same numbers. Above it sit the nightly judged matrix and the
replay regression check.