Anatomy of a scenario
user:) and lists the events expected in response (expect:). Expected events must arrive in the order listed, but the agent may emit other events in between, so you don’t have to enumerate everything it does.
A turn without a user: field is observation-only: the harness just waits for the expected events. This is how you test agent-first behavior like an on-connect greeting:
Events
Scenarios assert on a small set of semantic events, mapped from the RTVI messages the agent emits:| Event | Meaning |
|---|---|
response | The agent’s reply. In audio mode this is a transcription of the agent’s actual synthesized speech; in text mode it resolves to llm_response. Prefer this for content checks. |
llm_response | The LLM’s text output for the turn. Available in both modes. |
tts_response | The text the TTS reports speaking, one segment at a time. Audio mode only. |
llm_started | The LLM began generating a response. |
function_call | The LLM called a function. |
user_transcription | The agent’s STT finalized a transcription of the user. Audio mode only. |
user_started_speaking | The agent’s VAD detected the start of user speech. Audio mode only. |
user_stopped_speaking | The agent’s VAD detected the end of user speech. Audio mode only. |
Assertions
Each entry inexpect: names an event and, optionally, asserts on its content or timing.
Semantic judging with eval
The eval: field is a natural-language criterion that the event’s text must satisfy, decided by the judge LLM:
eval: only makes sense on the agent’s text output (response, llm_response, tts_response).
Substring checks with text_contains
For exact content, text_contains does a plain substring check, with no judge round-trip:
Latency budgets with within_ms
within_ms bounds how long after the turn’s user send the event may arrive. All of a turn’s expectations share that one anchor:
--timeout), so timing is only asserted when you ask for it.
Because every deadline is measured from the send, time spent matching earlier expectations counts against later ones. In the example above, if llm_started arrives at 1.5 seconds, the response (with the default 60 second budget) has 58.5 seconds left, and a turn that stalls completely fails within a single budget rather than one per expectation.
Function calls
Afunction_call expectation asserts that the turn invoked one or more tools. List the expected calls under calls:; they’re matched by name in any order, and the expectation passes once all are found:
args is a subset check: every listed key/value must be present in the call’s arguments, and extra arguments are ignored. A single expected call can use the name:/args: shorthand directly on the expectation, and a bare function_call with neither just asserts that some call happened.
Interruptions
send_after: schedules a turn’s user send relative to a prior event, which is how you script barge-in tests:
Vision turns
A turn may register an image withimage: (a path relative to the scenario file). When a vision agent requests a user image during the turn, the eval transport serves it:
Text and audio modes
Two top-level blocks control a scenario’s modalities, and each has its ownmodality: field:
user:sets how each turn’s utterance is delivered to the agent: sent as text, bypassing its STT (modality: text), or synthesized into real speech (modality: audio).judge:sets what the judge evaluates: the agent’s LLM text, with its TTS skipped (modality: text), or a transcription of its actual spoken audio (modality: audio).
modality: isn’t specified, or a block is omitted entirely, it defaults to text. The two sides are also independent: you can drive the agent with text while judging its real speech, or speak to it and judge the LLM text.
A scenario with neither block runs entirely in text mode. No audio flows on either side, so this is the fastest and cheapest way to test prompts, conversational logic, and function calling: no audio service cost, and a multi-turn scenario finishes in seconds. The judge LLM is the only service the harness itself needs (Ollama with gemma2:9b by default).
The top-level
user: block only configures delivery. Each turn’s user:
field is the utterance itself, and is written the same way in both modes.User input with user:
Text (the default). Each turn’s utterance is sent to the agent as text, bypassing its STT. This needs no configuration; it’s equivalent to:
speech: block (the TTS service and voice) is required:
The built-in speech services are
kokoro, a local model and the recommended
default, and cartesia (HTTP) when you want a cloud voice.Judging with judge:
Text (the default). The agent’s TTS is skipped automatically, including any on-connect greeting, and the judge evaluates the LLM’s text output. Fast and silent; equivalent to:
response event becomes that transcription, so the judge evaluates what a user would actually have heard. This is the true end-to-end check: STT in, LLM in the middle, TTS out. The transcription: block is required:
The built-in transcribers are
moonshine and whisper, both local models.
When transcription.service: is omitted, it defaults to moonshine.judge.eval: block selects the judge LLM: ollama (the default, gemma2:9b), openai, or any OpenAI-compatible endpoint via endpoint:.
Custom services with factory:
To use a TTS or STT beyond the built-ins, both blocks accept a factory: escape hatch: a dotted path to a callable that receives the block’s mapping and the resolved sample rate, and returns the service. Any extra keys you put in the block are passed through to your factory:
my_evals/services.py
The service your factory returns must be a local model or an HTTP-based
service. WebSocket-streaming services aren’t supported: they need a running
pipeline to manage their connection lifecycle, and keeping them out keeps the
evals code simple.
EvalSpeech or EvalTranscriber directly and inject them through the library.
Sharing config across scenarios
Any value can be pulled from another file with!include, resolved relative to the scenario file. This keeps per-scenario noise down when a whole directory of scenarios shares the same audio setup:
Seed the conversation context with context:
By default the harness leaves the bot’s LLM context alone: whatever the bot sets up for itself — for example, a system prompt added in its connect handler — is what the scenario runs against. Provide context: to replace that with messages of your own, which lets a scenario start mid-conversation:
LLMMessagesUpdateFrame that replaces the bot’s context wholesale. Omit context: and the harness sends nothing, leaving the bot’s own context in place.
Running scenarios back to back
By default the bot keeps running between scenarios. When a scenario ends its eval connection closes, but the eval transport suppresses the bot’son_client_disconnected handler, so the pipeline stays up to serve the next scenario. This is what lets pipecat eval run a.yaml b.yaml c.yaml drive a whole list against one bot instance with no reboot between them, which keeps a run fast.
The trade-off is that anything the bot accumulated in one scenario is still there for the next. For results to be independent, each scenario has to start from a clean slate, and clearing that state is split between the harness and your bot:
- Conversation context — seed or clear it per scenario with
context:. The harness replaces the bot’s LLM context with the messages you provide (via anLLMMessagesUpdateFrame); without it, the previous conversation carries forward, which is rarely what you want across independent scenarios. - Application state — counters, flags, cached data, anything your bot holds outside the LLM context. The harness can’t see this, so resetting it is your bot’s job. A common place is the bot’s connect handler, which runs again for each scenario’s connection.
Exercising the disconnect path
Some bots do meaningful work inon_client_disconnected — a goodbye message, session teardown, resource cleanup. Because that handler is suppressed by default, set trigger_disconnect: true on a scenario to fire it when that scenario ends:
on_client_disconnected, so a scenario with trigger_disconnect: true usually ends the bot process — treat it as a terminal run, last in a list.
Enable it for every scenario in a run with
pipecat eval run --trigger-disconnect; a scenario’s own trigger_disconnect field still takes
precedence. This is independent of --stop-bot, which tears the bot down via
an eval-cancel message regardless of the disconnect handler.Next steps
The Eval Loop
Let a coding assistant write agent code, run evals, and iterate
automatically until the agent is better.