Deep dive

Flow Evaluation

You can now automate agent flow grading using a persona-driven driver and an LLM judge. This setup replaces manual checks with consistent, scored reports for success and quality.

riff

What was built & why

Manual testing missed subtle flow breaks and lacked consistent quality metrics. You built `riff/flow_eval/driver.py` to simulate user interactions via a Qwen persona. The `riff/flow_eval/rubric.py` module scores these interactions against defined criteria. `riff/flow_eval/linter.py` catches static definition errors before runtime. This pipeline generates objective report cards instead of relying on ad-hoc debugging.

Principles

Drive evaluation with realistic personas, not synthetic scripts.
Lint flow definitions statically to catch structural errors early.
Separate the driver logic from the scoring rubric.
Fallback to heuristics when LLM parsing fails.

Patterns

Persona-Driven Simulation
LLM-as-a-Judge Scoring
Static Flow Linting
Automated Report Generation

How to apply

Add new test scenarios in `tests/scenarios/caller_persona/utterance.py`.
Run `riff/flow_eval/runner.py` to execute full evaluation suites.
Inspect `riff/flow_eval/report.py` output for specific failure points.
Configure credential resolution for Alibaba/Qwen models in the runner.

Pitfalls

Local MLX judge may fail to parse JSON, triggering heuristic fallbacks.
Goal and slot alignment is best-effort and may need manual overrides.
Live smoke tests depend on external LLM provider availability.
Heuristic scores lack the nuance of a successful LLM judgment.