Overview
AssemblyAISTTService provides real-time speech recognition using AssemblyAI’s WebSocket API with support for interim results, end-of-turn detection, and configurable audio processing parameters for accurate transcription in conversational AI applications.
AssemblyAI STT API Reference
Pipecat’s API methods for AssemblyAI STT integration
Example Implementation
Example with AssemblyAI built-in turn detection
AssemblyAI Documentation
Official AssemblyAI documentation and features
AssemblyAI Console
Access API keys and transcription features
Installation
To use AssemblyAI services, install the required dependency:Prerequisites
AssemblyAI Account Setup
Before using AssemblyAI STT services, you need:- AssemblyAI Account: Sign up at AssemblyAI Console
- API Key: Generate an API key from your dashboard
- Model Selection: Choose from available transcription models and features
Required Environment Variables
ASSEMBLYAI_API_KEY: Your AssemblyAI API key for authentication
Configuration
AssemblyAISTTService
AssemblyAI API key for authentication.
Language code for transcription. AssemblyAI currently supports English.
WebSocket endpoint URL. Override for custom or proxied deployments.
Connection configuration parameters. See
AssemblyAIConnectionParams below.
Controls turn detection mode. When
True (Pipecat mode, default): Forces
AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart
Turn) decides when the user is done. VAD stop sends ForceEndpoint as ceiling.
No UserStarted/StoppedSpeakingFrame emitted from STT. When False (AssemblyAI
turn detection mode, u3-rt-pro only): AssemblyAI’s model controls turn endings
using built-in turn detection. Uses AssemblyAI API defaults for all parameters
unless explicitly set. Emits UserStarted/StoppedSpeakingFrame from STT.Whether to interrupt the bot when the user starts speaking in AssemblyAI turn
detection mode (
vad_force_turn_endpoint=False). Only applies when using
AssemblyAI’s built-in turn detection.Optional format string for speaker labels when diarization is enabled. Use
{speaker} for speaker label and {text} for transcript text. Example:
"<{speaker}>{text}</{speaker}>" or "{speaker}: {text}". If None, transcript
text is not modified.P99 latency from speech end to final transcript in seconds. Override for your
deployment.
AssemblyAIConnectionParams
Connection-level parameters passed via theconnection_params constructor argument.
| Parameter | Type | Default | Description |
|---|---|---|---|
sample_rate | int | 16000 | Audio sample rate in Hz. |
encoding | Literal | "pcm_s16le" | Audio encoding format. Options: "pcm_s16le", "pcm_mulaw". |
formatted_finals | bool | True | Whether to enable transcript formatting. |
word_finalization_max_wait_time | int | None | Maximum time to wait for word finalization in milliseconds. |
end_of_turn_confidence_threshold | float | None | Confidence threshold for end-of-turn detection. |
min_turn_silence | int | None | Minimum silence duration (ms) when confident about end-of-turn. |
min_end_of_turn_silence_when_confident | int | None | DEPRECATED. Use min_turn_silence instead. Will be removed in a future version. |
max_turn_silence | int | None | Maximum silence duration (ms) before forcing end-of-turn. |
keyterms_prompt | List[str] | None | List of key terms to guide transcription. Will be JSON serialized before sending. |
prompt | str | None | Optional text prompt to guide transcription. Only used when speech_model is "u3-rt-pro". Cannot be used with keyterms_prompt. |
speech_model | Literal | "u3-rt-pro" | Speech model. Options: "universal-streaming-english", "universal-streaming-multilingual", "u3-rt-pro". |
language_detection | bool | None | Enable automatic language detection. Only applicable to universal-streaming-multilingual. Turn messages include language information. |
format_turns | bool | True | Whether to format transcript turns. |
speaker_labels | bool | None | Enable speaker diarization. Final transcripts include a speaker field (e.g., “Speaker A”, “Speaker B”). |
Usage
Basic Setup
With Custom Connection Parameters
With AssemblyAI Built-in Turn Detection
AssemblyAI’s u3-rt-pro model supports built-in turn detection for more natural conversation flow:With Speaker Diarization
Enable speaker identification for multi-party conversations:Notes
- u3-rt-pro model: The default model is now
u3-rt-pro, which provides the best performance and supports built-in turn detection. - Turn detection modes:
- Pipecat mode (
vad_force_turn_endpoint=True, default): Forces AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart Turn) decides when the user is done. The service sends aForceEndpointmessage when VAD detects the user has stopped speaking. - AssemblyAI mode (
vad_force_turn_endpoint=False, u3-rt-pro only): AssemblyAI’s model controls turn endings using built-in turn detection. The service emitsUserStartedSpeakingFrameandUserStoppedSpeakingFramebased on AssemblyAI’s detection.
- Pipecat mode (
- Speaker diarization: Enable
speaker_labels=Trueinconnection_paramsto automatically identify different speakers. Final transcripts will include a speaker field (e.g., “Speaker A”, “Speaker B”). Use thespeaker_formatparameter to format transcripts with speaker labels. - Language detection: When using
universal-streaming-multilingualwithlanguage_detection=True, Turn messages includelanguage_codeandlanguage_confidencefields for automatic language detection. - Prompting: The
promptparameter (u3-rt-pro only) allows you to guide transcription for specific names, terms, or domain vocabulary. This is a beta feature - AssemblyAI recommends testing without a prompt first. Cannot be used withkeyterms_prompt. - Formatted finals: When
formatted_finals=True, the service waits for formatted transcripts before emitting finalTranscriptionFrames. This provides properly formatted text but may introduce a small delay. - Dynamic settings updates: You can update
keyterms_prompt,prompt,min_turn_silence, andmax_turn_silenceat runtime usingSTTUpdateSettingsFramewithout reconnecting.
Event Handlers
AssemblyAI STT supports the standard service connection events:| Event | Description |
|---|---|
on_connected | Connected to AssemblyAI WebSocket |
on_disconnected | Disconnected from AssemblyAI WebSocket |