Deep dive
You can now automate agent flow grading using a persona-driven driver and an LLM judge. This setup replaces manual checks with consistent, scored reports for success and quality.
Manual testing missed subtle flow breaks and lacked consistent quality metrics. You built `riff/flow_eval/driver.py` to simulate user interactions via a Qwen persona. The `riff/flow_eval/rubric.py` module scores these interactions against defined criteria. `riff/flow_eval/linter.py` catches static definition errors before runtime. This pipeline generates objective report cards instead of relying on ad-hoc debugging.
Persona-Driven Simulation
LLM-as-a-Judge Scoring
Static Flow Linting
Automated Report Generation
Add new test scenarios in `tests/scenarios/caller_persona/utterance.py`.
Run `riff/flow_eval/runner.py` to execute full evaluation suites.
Inspect `riff/flow_eval/report.py` output for specific failure points.
Configure credential resolution for Alibaba/Qwen models in the runner.