Deep dive

Flow Evaluation

You can now automate agent flow grading using a persona-driven driver and an LLM judge. This setup replaces manual checks with consistent, scored reports for success and quality.

riff

What was built & why

Manual testing missed subtle flow breaks and lacked consistent quality metrics. You built `riff/flow_eval/driver.py` to simulate user interactions via a Qwen persona. The `riff/flow_eval/rubric.py` module scores these interactions against defined criteria. `riff/flow_eval/linter.py` catches static definition errors before runtime. This pipeline generates objective report cards instead of relying on ad-hoc debugging.

Principles

Patterns

How to apply

Pitfalls