Skip to main content

Overview

AssemblyAISTTService provides real-time speech recognition using AssemblyAI’s WebSocket API with support for interim results, end-of-turn detection, and configurable audio processing parameters for accurate transcription in conversational AI applications.

AssemblyAI STT API Reference

Pipecat’s API methods for AssemblyAI STT integration

Example Implementation

Example with AssemblyAI built-in turn detection

Universal-3 Pro Streaming

U3 Pro streaming documentation and features

U3 Pro API Reference

Complete U3 Pro streaming API reference

AssemblyAI Console

Access API keys and transcription features

Installation

To use AssemblyAI services, install the required dependency:
uv add "pipecat-ai[assemblyai]"

Prerequisites

AssemblyAI Account Setup

Before using AssemblyAI STT services, you need:
  1. AssemblyAI Account: Sign up at AssemblyAI Console
  2. API Key: Generate an API key from your dashboard
  3. Model Selection: Choose from available transcription models and features

Required Environment Variables

  • ASSEMBLYAI_API_KEY: Your AssemblyAI API key for authentication

Configuration

AssemblyAISTTService

api_key
str
required
AssemblyAI API key for authentication.
language
Language
default:"Language.EN"
deprecated
Language code for transcription. AssemblyAI currently supports English. Deprecated in v0.0.105. Use settings=AssemblyAISTTService.Settings(...) instead.
api_endpoint_base_url
str
default:"wss://streaming.assemblyai.com/v3/ws"
WebSocket endpoint URL. Override for custom or proxied deployments.
sample_rate
int
default:"16000"
Audio sample rate in Hz.
encoding
str
default:"pcm_s16le"
Audio encoding format.
connection_params
AssemblyAIConnectionParams
default:"None"
deprecated
Connection configuration parameters. Deprecated in v0.0.105. Use settings=AssemblyAISTTService.Settings(...) instead. See AssemblyAIConnectionParams below for field mapping.
vad_force_turn_endpoint
bool
default:"True"
Controls turn detection mode. When True (Pipecat mode, default): Forces AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart Turn) decides when the user is done. VAD stop sends ForceEndpoint as ceiling. No UserStarted/StoppedSpeakingFrame emitted from STT. When False (AssemblyAI turn detection mode, u3-rt-pro only): AssemblyAI’s model controls turn endings using built-in turn detection. Uses AssemblyAI API defaults for all parameters unless explicitly set. Emits UserStarted/StoppedSpeakingFrame from STT.
should_interrupt
bool
default:"True"
Whether to interrupt the bot when the user starts speaking in AssemblyAI turn detection mode (vad_force_turn_endpoint=False). Only applies when using AssemblyAI’s built-in turn detection.
speaker_format
str | None
default:"None"
Optional format string for speaker labels when diarization is enabled. Use {speaker} for speaker label and {text} for transcript text. Example: "<{speaker}>{text}</{speaker}>" or "{speaker}: {text}". If None, transcript text is not modified.
settings
AssemblyAISTTService.Settings
default:"None"
Runtime-configurable settings for the STT service. See Settings below.
ttfs_p99_latency
float
default:"ASSEMBLYAI_TTFS_P99"
P99 latency from speech end to final transcript in seconds. Override for your deployment.

AssemblyAIConnectionParams

connection_params is deprecated as of v0.0.105. Use settings=AssemblyAISTTService.Settings(...) instead. The sample_rate and encoding fields remain as direct constructor arguments. All other fields have moved into Settings — speech_model maps to model.
Connection-level parameters previously passed via the connection_params constructor argument.
ParameterTypeDefaultDescription
sample_rateint16000Audio sample rate in Hz.
encodingLiteral"pcm_s16le"Audio encoding format. Options: "pcm_s16le", "pcm_mulaw".
end_of_turn_confidence_thresholdfloatNoneConfidence threshold for end-of-turn detection.
min_turn_silenceintNoneMinimum silence duration (ms) when confident about end-of-turn.
min_end_of_turn_silence_when_confidentintNoneDEPRECATED. Use min_turn_silence instead. Will be removed in a future version.
max_turn_silenceintNoneMaximum silence duration (ms) before forcing end-of-turn.
keyterms_promptList[str]NoneList of key terms to guide transcription. Will be JSON serialized before sending.
promptstrNoneBETA: Optional text prompt to guide transcription. Only used with U3 Pro models. Cannot be used with keyterms_prompt. We suggest starting with no prompt. See AssemblyAI prompting best practices for guidance.
speech_modelLiteral"u3-rt-pro"Required. Speech model to use. Options: "universal-streaming-english", "universal-streaming-multilingual", "u3-rt-pro", "u3-rt-pro-beta-1", "universal-3-5-pro". Defaults to "u3-rt-pro" if not specified. The U3 Pro family (u3-rt-pro, u3-rt-pro-beta-1, universal-3-5-pro) supports built-in turn detection, prompting, continuous partials, context carryover, and voice focus.
language_detectionboolNoneEnable automatic language detection. Only applicable to universal-streaming-multilingual. Turn messages include language information.
format_turnsboolTrueWhether to format transcript turns. Only applicable to universal-streaming-english and universal-streaming-multilingual models. For u3-rt-pro, formatting is automatic and built-in.
speaker_labelsboolNoneEnable speaker diarization. Final transcripts include a speaker field (e.g., “Speaker A”, “Speaker B”).
vad_thresholdfloatNoneVoice activity detection confidence threshold. Only applicable to U3 Pro models. The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection. Defaults to 0.3 (API default). For best performance when using with external VAD (e.g., Silero), align this value with your VAD’s activation threshold. Defaults to None (not sent).

Settings

Runtime-configurable settings passed via the settings constructor argument using AssemblyAISTTService.Settings(...). These can be updated mid-conversation with STTUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrNoneSTT model identifier. (Inherited from base STT settings.)
languageLanguage | strLanguage.ENLanguage for speech recognition. (Inherited from base STT settings.)
formatted_finalsboolTrueWhether to enable transcript formatting.
word_finalization_max_wait_timeintNoneMaximum time to wait for word finalization in milliseconds.
end_of_turn_confidence_thresholdfloatNoneConfidence threshold for end-of-turn detection.
min_turn_silenceintNoneMinimum silence duration (ms) when confident about end-of-turn.
max_turn_silenceintNoneMaximum silence duration (ms) before forcing end-of-turn.
keyterms_promptList[str]NoneList of key terms to guide transcription.
promptstrNoneOptional text prompt to guide transcription (U3 Pro only).
language_detectionboolNoneEnable automatic language detection.
format_turnsboolTrueWhether to format transcript turns.
speaker_labelsboolNoneEnable speaker diarization.
vad_thresholdfloatNoneVAD confidence threshold (0.0–1.0) for classifying audio frames as silence (U3 Pro only).
domainstrNoneOptional domain for specialized recognition modes (e.g., "medical-v1" for Medical Mode).
continuous_partialsboolTrueEmit partial transcripts continuously during long turns (U3 Pro only).
interruption_delayintNoneOverride for how soon the first partial is emitted, in milliseconds (0–1000). U3 Pro only.
agent_contextstrNoneContext carryover seed — the agent’s most recent spoken reply. Improves transcription of the user’s next turn (U3 Pro only). Clipped to ~1500 characters.
previous_context_n_turnsintNoneMaximum prior conversation entries (0–100) carried forward as context. Set to 0 to disable context carryover (U3 Pro only). Defaults to server default (3).
voice_focusLiteral["near-field", "far-field"]NoneIsolate primary voice and suppress background noise. Set to "near-field" for close mics or "far-field" for distant capture (U3 Pro only).
voice_focus_thresholdfloatNoneHow aggressively background audio is suppressed (0.0–1.0). Only takes effect when voice_focus is set (U3 Pro only).

Usage

Basic Setup

from pipecat.services.assemblyai.stt import AssemblyAISTTService

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
)

With Custom Settings

from pipecat.services.assemblyai.stt import AssemblyAISTTService

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        keyterms_prompt=["Pipecat", "AssemblyAI"],
    ),
    vad_force_turn_endpoint=True,
)

With AssemblyAI Built-in Turn Detection

AssemblyAI’s u3-rt-pro model supports built-in turn detection for more natural conversation flow:
from pipecat.services.assemblyai.stt import AssemblyAISTTService

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    vad_force_turn_endpoint=False,  # Use AssemblyAI's built-in turn detection
    settings=AssemblyAISTTService.Settings(
        # Optional: Tune turn detection timing
        min_turn_silence=100,  # Minimum silence (ms) when confident about end-of-turn
        max_turn_silence=1000,  # Maximum silence (ms) before forcing end-of-turn
    ),
)

With Speaker Diarization

Enable speaker identification for multi-party conversations:
from pipecat.services.assemblyai.stt import AssemblyAISTTService

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        speaker_labels=True,  # Enable speaker diarization
    ),
    speaker_format="{speaker}: {text}",  # Format transcripts with speaker labels
)

With Context Carryover

Context carryover (U3 Pro only) improves transcription by giving the model memory of recent conversation turns:
from pipecat.services.assemblyai.stt import AssemblyAISTTService

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        model="universal-3-5-pro",
        previous_context_n_turns=5,  # Keep last 5 turns (if None, server default is 3)
    ),
)
Context carryover helps with short answers, spelled-out entities (emails, IDs), and similar-sounding words. Set previous_context_n_turns=0 to disable automatic carryover.

With Voice Focus

Voice focus (U3 Pro only) isolates the primary speaker and suppresses background noise:
from pipecat.services.assemblyai.stt import AssemblyAISTTService

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    settings=AssemblyAISTTService.Settings(
        voice_focus="near-field",  # Or "far-field" for distant capture
        voice_focus_threshold=0.5,  # 0.0-1.0, higher suppresses more
    ),
)
Use "near-field" for close-talking mics (headsets, handsets) or "far-field" for distant capture (conference rooms, laptop mics).

Methods

update_agent_context()

async def update_agent_context(text: str)
Send the agent’s latest spoken reply to AssemblyAI as carryover context (U3 Pro only). Improves transcription of the user’s next turn — short answers, spelled-out entities, disambiguation. No-op for non-U3 Pro models. Parameters:
  • text (str): The agent’s spoken reply text. Clipped to ~1500 characters.
Example:
# After the agent finishes speaking
await stt.update_agent_context("Your order total is $42.50.")
It is automatically invoked each time an assistant turn is completed.

Notes

  • U3 Pro models: The default model is u3-rt-pro. The U3 Pro family includes u3-rt-pro, u3-rt-pro-beta-1, and universal-3-5-pro, all supporting built-in turn detection, prompting, continuous partials, context carryover, and voice focus.
  • Turn detection modes:
    • Pipecat mode (vad_force_turn_endpoint=True, default): Forces AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart Turn) decides when the user is done. The service sends a ForceEndpoint message when VAD detects the user has stopped speaking.
    • AssemblyAI mode (vad_force_turn_endpoint=False, U3 Pro only): AssemblyAI’s model controls turn endings using built-in turn detection. The service emits UserStartedSpeakingFrame and UserStoppedSpeakingFrame based on AssemblyAI’s detection.
  • Context carryover (U3 Pro only): Seed the agent’s most recent reply, improving transcription of the user’s next turn — short answers, spelled-out entities, disambiguation. update_agent_context() is automatically invoked each time an assistant turn is completed. Control the window size with previous_context_n_turns (0–100, default 3); set to 0 to disable carryover entirely.
  • Voice focus (U3 Pro only): Set voice_focus to "near-field" or "far-field" to isolate the primary voice and suppress background noise. Tune suppression strength with voice_focus_threshold (0.0–1.0, higher values suppress more).
  • Speaker diarization: Enable speaker_labels=True in Settings to automatically identify different speakers. Final transcripts will include a speaker field (e.g., “Speaker A”, “Speaker B”). Use the speaker_format parameter to format transcripts with speaker labels.
  • Language detection: When using universal-streaming-multilingual with language_detection=True, Turn messages include language_code and language_confidence fields for automatic language detection.
  • Prompting: The prompt parameter (U3 Pro only) allows you to guide transcription for specific names, terms, or domain vocabulary. This is a beta feature - AssemblyAI recommends testing without a prompt first. Cannot be used with keyterms_prompt.
  • Dynamic settings updates: Most settings can be updated at runtime using STTUpdateSettingsFrame. agent_context is hot-updatable without reconnecting; other settings require a reconnect.
The connection_params= / InputParams / params= pattern is deprecated as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.

Event Handlers

AssemblyAI STT supports the standard service connection events, plus turn-level events for conversation tracking:
EventDescription
on_connectedConnected to AssemblyAI WebSocket
on_disconnectedDisconnected from AssemblyAI WebSocket
on_end_of_turnEnd of turn detected (fires after final transcript is pushed)
@stt.event_handler("on_connected")
async def on_connected(service):
    print("Connected to AssemblyAI")

@stt.event_handler("on_end_of_turn")
async def on_end_of_turn(service, transcript):
    print(f"Turn ended: {transcript}")
The on_end_of_turn event receives (service, transcript) where transcript is the final transcript text. This event fires after the final transcript is pushed, providing a reliable hook for end-of-turn logic that doesn’t race with TranscriptionFrame. Works in both Pipecat and AssemblyAI turn detection modes.