Overview
AssemblyAISTTService provides real-time speech recognition using AssemblyAI’s WebSocket API with support for interim results, end-of-turn detection, and configurable audio processing parameters for accurate transcription in conversational AI applications.
AssemblyAI STT API Reference
Pipecat’s API methods for AssemblyAI STT integration
Example Implementation
Example with AssemblyAI built-in turn detection
Universal-3 Pro Streaming
U3 Pro streaming documentation and features
U3 Pro API Reference
Complete U3 Pro streaming API reference
AssemblyAI Console
Access API keys and transcription features
Installation
To use AssemblyAI services, install the required dependency:Prerequisites
AssemblyAI Account Setup
Before using AssemblyAI STT services, you need:- AssemblyAI Account: Sign up at AssemblyAI Console
- API Key: Generate an API key from your dashboard
- Model Selection: Choose from available transcription models and features
Required Environment Variables
ASSEMBLYAI_API_KEY: Your AssemblyAI API key for authentication
Configuration
AssemblyAISTTService
AssemblyAI API key for authentication.
Language code for transcription. AssemblyAI currently supports English.
Deprecated in v0.0.105. Use
settings=AssemblyAISTTService.Settings(...)
instead.WebSocket endpoint URL. Override for custom or proxied deployments.
Audio sample rate in Hz.
Audio encoding format.
Connection configuration parameters. Deprecated in v0.0.105. Use
settings=AssemblyAISTTService.Settings(...) instead. See
AssemblyAIConnectionParams below for field
mapping.Controls turn detection mode. When
True (Pipecat mode, default): Forces
AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart
Turn) decides when the user is done. VAD stop sends ForceEndpoint as ceiling.
No UserStarted/StoppedSpeakingFrame emitted from STT. When False (AssemblyAI
turn detection mode, u3-rt-pro only): AssemblyAI’s model controls turn endings
using built-in turn detection. Uses AssemblyAI API defaults for all parameters
unless explicitly set. Emits UserStarted/StoppedSpeakingFrame from STT.Whether to interrupt the bot when the user starts speaking in AssemblyAI turn
detection mode (
vad_force_turn_endpoint=False). Only applies when using
AssemblyAI’s built-in turn detection.Optional format string for speaker labels when diarization is enabled. Use
{speaker} for speaker label and {text} for transcript text. Example:
"<{speaker}>{text}</{speaker}>" or "{speaker}: {text}". If None, transcript
text is not modified.Runtime-configurable settings for the STT service. See Settings
below.
P99 latency from speech end to final transcript in seconds. Override for your
deployment.
AssemblyAIConnectionParams
Connection-level parameters previously passed via theconnection_params constructor argument.
| Parameter | Type | Default | Description |
|---|---|---|---|
sample_rate | int | 16000 | Audio sample rate in Hz. |
encoding | Literal | "pcm_s16le" | Audio encoding format. Options: "pcm_s16le", "pcm_mulaw". |
end_of_turn_confidence_threshold | float | None | Confidence threshold for end-of-turn detection. |
min_turn_silence | int | None | Minimum silence duration (ms) when confident about end-of-turn. |
min_end_of_turn_silence_when_confident | int | None | DEPRECATED. Use min_turn_silence instead. Will be removed in a future version. |
max_turn_silence | int | None | Maximum silence duration (ms) before forcing end-of-turn. |
keyterms_prompt | List[str] | None | List of key terms to guide transcription. Will be JSON serialized before sending. |
prompt | str | None | BETA: Optional text prompt to guide transcription. Only used with U3 Pro models. Cannot be used with keyterms_prompt. We suggest starting with no prompt. See AssemblyAI prompting best practices for guidance. |
speech_model | Literal | "u3-rt-pro" | Required. Speech model to use. Options: "universal-streaming-english", "universal-streaming-multilingual", "u3-rt-pro", "u3-rt-pro-beta-1", "universal-3-5-pro". Defaults to "u3-rt-pro" if not specified. The U3 Pro family (u3-rt-pro, u3-rt-pro-beta-1, universal-3-5-pro) supports built-in turn detection, prompting, continuous partials, context carryover, and voice focus. |
language_detection | bool | None | Enable automatic language detection. Only applicable to universal-streaming-multilingual. Turn messages include language information. |
format_turns | bool | True | Whether to format transcript turns. Only applicable to universal-streaming-english and universal-streaming-multilingual models. For u3-rt-pro, formatting is automatic and built-in. |
speaker_labels | bool | None | Enable speaker diarization. Final transcripts include a speaker field (e.g., “Speaker A”, “Speaker B”). |
vad_threshold | float | None | Voice activity detection confidence threshold. Only applicable to U3 Pro models. The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection. Defaults to 0.3 (API default). For best performance when using with external VAD (e.g., Silero), align this value with your VAD’s activation threshold. Defaults to None (not sent). |
Settings
Runtime-configurable settings passed via thesettings constructor argument using AssemblyAISTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | STT model identifier. (Inherited from base STT settings.) |
language | Language | str | Language.EN | Language for speech recognition. (Inherited from base STT settings.) |
formatted_finals | bool | True | Whether to enable transcript formatting. |
word_finalization_max_wait_time | int | None | Maximum time to wait for word finalization in milliseconds. |
end_of_turn_confidence_threshold | float | None | Confidence threshold for end-of-turn detection. |
min_turn_silence | int | None | Minimum silence duration (ms) when confident about end-of-turn. |
max_turn_silence | int | None | Maximum silence duration (ms) before forcing end-of-turn. |
keyterms_prompt | List[str] | None | List of key terms to guide transcription. |
prompt | str | None | Optional text prompt to guide transcription (U3 Pro only). |
language_detection | bool | None | Enable automatic language detection. |
format_turns | bool | True | Whether to format transcript turns. |
speaker_labels | bool | None | Enable speaker diarization. |
vad_threshold | float | None | VAD confidence threshold (0.0–1.0) for classifying audio frames as silence (U3 Pro only). |
domain | str | None | Optional domain for specialized recognition modes (e.g., "medical-v1" for Medical Mode). |
continuous_partials | bool | True | Emit partial transcripts continuously during long turns (U3 Pro only). |
interruption_delay | int | None | Override for how soon the first partial is emitted, in milliseconds (0–1000). U3 Pro only. |
agent_context | str | None | Context carryover seed — the agent’s most recent spoken reply. Improves transcription of the user’s next turn (U3 Pro only). Clipped to ~1500 characters. |
previous_context_n_turns | int | None | Maximum prior conversation entries (0–100) carried forward as context. Set to 0 to disable context carryover (U3 Pro only). Defaults to server default (3). |
voice_focus | Literal["near-field", "far-field"] | None | Isolate primary voice and suppress background noise. Set to "near-field" for close mics or "far-field" for distant capture (U3 Pro only). |
voice_focus_threshold | float | None | How aggressively background audio is suppressed (0.0–1.0). Only takes effect when voice_focus is set (U3 Pro only). |
Usage
Basic Setup
With Custom Settings
With AssemblyAI Built-in Turn Detection
AssemblyAI’s u3-rt-pro model supports built-in turn detection for more natural conversation flow:With Speaker Diarization
Enable speaker identification for multi-party conversations:With Context Carryover
Context carryover (U3 Pro only) improves transcription by giving the model memory of recent conversation turns:previous_context_n_turns=0 to disable automatic carryover.
With Voice Focus
Voice focus (U3 Pro only) isolates the primary speaker and suppresses background noise:"near-field" for close-talking mics (headsets, handsets) or "far-field" for distant capture (conference rooms, laptop mics).
Methods
update_agent_context()
text(str): The agent’s spoken reply text. Clipped to ~1500 characters.
It is automatically invoked each time an assistant turn is completed.
Notes
- U3 Pro models: The default model is
u3-rt-pro. The U3 Pro family includesu3-rt-pro,u3-rt-pro-beta-1, anduniversal-3-5-pro, all supporting built-in turn detection, prompting, continuous partials, context carryover, and voice focus. - Turn detection modes:
- Pipecat mode (
vad_force_turn_endpoint=True, default): Forces AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart Turn) decides when the user is done. The service sends aForceEndpointmessage when VAD detects the user has stopped speaking. - AssemblyAI mode (
vad_force_turn_endpoint=False, U3 Pro only): AssemblyAI’s model controls turn endings using built-in turn detection. The service emitsUserStartedSpeakingFrameandUserStoppedSpeakingFramebased on AssemblyAI’s detection.
- Pipecat mode (
- Context carryover (U3 Pro only): Seed the agent’s most recent reply, improving transcription of the user’s next turn — short answers, spelled-out entities, disambiguation.
update_agent_context()is automatically invoked each time an assistant turn is completed. Control the window size withprevious_context_n_turns(0–100, default 3); set to 0 to disable carryover entirely. - Voice focus (U3 Pro only): Set
voice_focusto"near-field"or"far-field"to isolate the primary voice and suppress background noise. Tune suppression strength withvoice_focus_threshold(0.0–1.0, higher values suppress more). - Speaker diarization: Enable
speaker_labels=Truein Settings to automatically identify different speakers. Final transcripts will include a speaker field (e.g., “Speaker A”, “Speaker B”). Use thespeaker_formatparameter to format transcripts with speaker labels. - Language detection: When using
universal-streaming-multilingualwithlanguage_detection=True, Turn messages includelanguage_codeandlanguage_confidencefields for automatic language detection. - Prompting: The
promptparameter (U3 Pro only) allows you to guide transcription for specific names, terms, or domain vocabulary. This is a beta feature - AssemblyAI recommends testing without a prompt first. Cannot be used withkeyterms_prompt. - Dynamic settings updates: Most settings can be updated at runtime using
STTUpdateSettingsFrame.agent_contextis hot-updatable without reconnecting; other settings require a reconnect.
Event Handlers
AssemblyAI STT supports the standard service connection events, plus turn-level events for conversation tracking:| Event | Description |
|---|---|
on_connected | Connected to AssemblyAI WebSocket |
on_disconnected | Disconnected from AssemblyAI WebSocket |
on_end_of_turn | End of turn detected (fires after final transcript is pushed) |
on_end_of_turn event receives (service, transcript) where transcript is the final transcript text. This event fires after the final transcript is pushed, providing a reliable hook for end-of-turn logic that doesn’t race with TranscriptionFrame. Works in both Pipecat and AssemblyAI turn detection modes.