The loop
- Describe the behavior as a scenario. A scenario file is an executable specification: the conversation, the expected events, and the criteria a response must meet.
- The assistant changes the agent. A prompt edit, a new tool, a pipeline change.
- The assistant runs the evals. One command, either against a running agent (
pipecat eval run) or letting the suite spawn the agent itself (pipecat eval suite). - The assistant reads the result. A non-zero exit code, a per-assertion failure message (“turn 1 expectation 0 (llm_response): judge said no: …”), and a full decision trace in
<scenario>.eval.log. - Repeat until green.
Why this works well for coding assistants
The framework was built to be driven by tools, not just humans:- One command, one exit code.
pipecat eval run scenarios/*.yamlexits0on success and1on failure, so an assistant knows mechanically whether it’s done. - Plain-text output when piped. Outside a terminal the CLI streams one result line per scenario instead of rendering a live dashboard, which is exactly what an assistant running shell commands sees.
- Actionable failures. Failures name the turn, the expectation, and the reason, including what the judge said. The
.eval.logdecision trace shows every event the harness observed, so “why did this fail” is answerable from files. - Suites are self-contained.
pipecat eval suitespawns the agents itself, so an autonomous loop doesn’t need to manage processes: edit, run one command, read the result. - Text mode is fast and cheap. Iterating on prompts and logic skips STT and TTS entirely, so an assistant can afford to run the evals after every change.
Setting up your project
Keep scenarios in the repo next to the agent and tell your assistant how to run them. For example, in your project’sCLAUDE.md or AGENTS.md:
Add a get_order_status tool to the agent and make sure it gets called when the user asks where their order is. Add a scenario for it and run the evals until they pass.
The assistant writes the tool, writes the scenario (a function_call assertion plus a judged response), runs pipecat eval run, reads any failure, and fixes its own work.
Evals as acceptance criteria
You can also run the loop in the other direction: write the scenario first, watch it fail, and hand the failure to the assistant. The scenario is the spec, and “make this pass” is the task.order_status.yaml
Guardrails
A few practices keep autonomous loops honest:- Review scenario changes like code. An assistant that can edit scenarios can also weaken them. Failing evals should usually be fixed in the agent, not in the scenario.
- Keep a regression set. As behaviors accumulate, so should scenarios. Run the full set (or a suite) before merging, not just the scenario being worked on.
- Gate merges in CI.
pipecat eval suite manifest.yamlin CI makes “the evals pass” a property of the branch, whoever (or whatever) wrote it. See Eval Suites. - Use audio mode for the final check. Iterate in text mode for speed, then run the audio variants before release to cover the full STT, LLM, and TTS path.
Next steps
AI-Assisted Development
Give your coding assistant access to Pipecat docs and source context.
Production Evaluation
Layer in simulation platforms and observability once your agent is
deployed.