Skip to main content

Overview

AssemblyAISTTService provides real-time speech recognition using AssemblyAI’s WebSocket API with support for interim results, end-of-turn detection, and configurable audio processing parameters for accurate transcription in conversational AI applications.

Installation

To use AssemblyAI services, install the required dependency:
pip install "pipecat-ai[assemblyai]"

Prerequisites

AssemblyAI Account Setup

Before using AssemblyAI STT services, you need:
  1. AssemblyAI Account: Sign up at AssemblyAI Console
  2. API Key: Generate an API key from your dashboard
  3. Model Selection: Choose from available transcription models and features

Required Environment Variables

  • ASSEMBLYAI_API_KEY: Your AssemblyAI API key for authentication

Configuration

AssemblyAISTTService

api_key
str
required
AssemblyAI API key for authentication.
language
Language
default:"Language.EN"
Language code for transcription. AssemblyAI currently supports English.
api_endpoint_base_url
str
default:"wss://streaming.assemblyai.com/v3/ws"
WebSocket endpoint URL. Override for custom or proxied deployments.
connection_params
AssemblyAIConnectionParams
default:"AssemblyAIConnectionParams()"
Connection configuration parameters. See AssemblyAIConnectionParams below.
vad_force_turn_endpoint
bool
default:"True"
Controls turn detection mode. When True (Pipecat mode, default): Forces AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart Turn) decides when the user is done. VAD stop sends ForceEndpoint as ceiling. No UserStarted/StoppedSpeakingFrame emitted from STT. When False (AssemblyAI turn detection mode, u3-rt-pro only): AssemblyAI’s model controls turn endings using built-in turn detection. Uses AssemblyAI API defaults for all parameters unless explicitly set. Emits UserStarted/StoppedSpeakingFrame from STT.
should_interrupt
bool
default:"True"
Whether to interrupt the bot when the user starts speaking in AssemblyAI turn detection mode (vad_force_turn_endpoint=False). Only applies when using AssemblyAI’s built-in turn detection.
speaker_format
Optional[str]
default:"None"
Optional format string for speaker labels when diarization is enabled. Use {speaker} for speaker label and {text} for transcript text. Example: "<{speaker}>{text}</{speaker}>" or "{speaker}: {text}". If None, transcript text is not modified.
ttfs_p99_latency
float
default:"ASSEMBLYAI_TTFS_P99"
P99 latency from speech end to final transcript in seconds. Override for your deployment.

AssemblyAIConnectionParams

Connection-level parameters passed via the connection_params constructor argument.
ParameterTypeDefaultDescription
sample_rateint16000Audio sample rate in Hz.
encodingLiteral"pcm_s16le"Audio encoding format. Options: "pcm_s16le", "pcm_mulaw".
formatted_finalsboolTrueWhether to enable transcript formatting.
word_finalization_max_wait_timeintNoneMaximum time to wait for word finalization in milliseconds.
end_of_turn_confidence_thresholdfloatNoneConfidence threshold for end-of-turn detection.
min_turn_silenceintNoneMinimum silence duration (ms) when confident about end-of-turn.
min_end_of_turn_silence_when_confidentintNoneDEPRECATED. Use min_turn_silence instead. Will be removed in a future version.
max_turn_silenceintNoneMaximum silence duration (ms) before forcing end-of-turn.
keyterms_promptList[str]NoneList of key terms to guide transcription. Will be JSON serialized before sending.
promptstrNoneOptional text prompt to guide transcription. Only used when speech_model is "u3-rt-pro". Cannot be used with keyterms_prompt.
speech_modelLiteral"u3-rt-pro"Speech model. Options: "universal-streaming-english", "universal-streaming-multilingual", "u3-rt-pro".
language_detectionboolNoneEnable automatic language detection. Only applicable to universal-streaming-multilingual. Turn messages include language information.
format_turnsboolTrueWhether to format transcript turns.
speaker_labelsboolNoneEnable speaker diarization. Final transcripts include a speaker field (e.g., “Speaker A”, “Speaker B”).

Usage

Basic Setup

from pipecat.services.assemblyai import AssemblyAISTTService

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
)

With Custom Connection Parameters

from pipecat.services.assemblyai import AssemblyAISTTService
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    connection_params=AssemblyAIConnectionParams(
        sample_rate=16000,
        formatted_finals=True,
        keyterms_prompt=["Pipecat", "AssemblyAI"],
        speech_model="u3-rt-pro",
    ),
    vad_force_turn_endpoint=True,
)

With AssemblyAI Built-in Turn Detection

AssemblyAI’s u3-rt-pro model supports built-in turn detection for more natural conversation flow:
from pipecat.services.assemblyai import AssemblyAISTTService
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    vad_force_turn_endpoint=False,  # Use AssemblyAI's built-in turn detection
    connection_params=AssemblyAIConnectionParams(
        speech_model="u3-rt-pro",
        # Optional: Tune turn detection timing
        min_turn_silence=100,  # Minimum silence (ms) when confident about end-of-turn
        max_turn_silence=1000,  # Maximum silence (ms) before forcing end-of-turn
    ),
)

With Speaker Diarization

Enable speaker identification for multi-party conversations:
from pipecat.services.assemblyai import AssemblyAISTTService
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams

stt = AssemblyAISTTService(
    api_key=os.getenv("ASSEMBLYAI_API_KEY"),
    connection_params=AssemblyAIConnectionParams(
        speech_model="u3-rt-pro",
        speaker_labels=True,  # Enable speaker diarization
    ),
    speaker_format="{speaker}: {text}",  # Format transcripts with speaker labels
)

Notes

  • u3-rt-pro model: The default model is now u3-rt-pro, which provides the best performance and supports built-in turn detection.
  • Turn detection modes:
    • Pipecat mode (vad_force_turn_endpoint=True, default): Forces AssemblyAI to return finals ASAP so Pipecat’s turn detection (e.g., Smart Turn) decides when the user is done. The service sends a ForceEndpoint message when VAD detects the user has stopped speaking.
    • AssemblyAI mode (vad_force_turn_endpoint=False, u3-rt-pro only): AssemblyAI’s model controls turn endings using built-in turn detection. The service emits UserStartedSpeakingFrame and UserStoppedSpeakingFrame based on AssemblyAI’s detection.
  • Speaker diarization: Enable speaker_labels=True in connection_params to automatically identify different speakers. Final transcripts will include a speaker field (e.g., “Speaker A”, “Speaker B”). Use the speaker_format parameter to format transcripts with speaker labels.
  • Language detection: When using universal-streaming-multilingual with language_detection=True, Turn messages include language_code and language_confidence fields for automatic language detection.
  • Prompting: The prompt parameter (u3-rt-pro only) allows you to guide transcription for specific names, terms, or domain vocabulary. This is a beta feature - AssemblyAI recommends testing without a prompt first. Cannot be used with keyterms_prompt.
  • Formatted finals: When formatted_finals=True, the service waits for formatted transcripts before emitting final TranscriptionFrames. This provides properly formatted text but may introduce a small delay.
  • Dynamic settings updates: You can update keyterms_prompt, prompt, min_turn_silence, and max_turn_silence at runtime using STTUpdateSettingsFrame without reconnecting.

Event Handlers

AssemblyAI STT supports the standard service connection events:
EventDescription
on_connectedConnected to AssemblyAI WebSocket
on_disconnectedDisconnected from AssemblyAI WebSocket
@stt.event_handler("on_connected")
async def on_connected(service):
    print("Connected to AssemblyAI")