Ultravox STT
The UltravoxSTTService implements speech-to-text using a locally-loaded Ultravox multimodal model. It leverages vLLM for efficient inference and is ideal for privacy-first, on-device transcription.
Installation
To use Ultravox, install the required dependencies (requires GPU support):
pip install "piopiy-ai[ultravox]"
Prerequisites
- For gated models, a Hugging Face token (Get yours here).
- Set your token in your environment:
export HF_TOKEN="your_hf_token_here"
Configuration
UltravoxSTTService Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name | str | "fixie-ai/ultravox-v0_5-llama-3_1-8b" | The Ultravox model to load. |
hf_token | str | None | Hugging Face token (falls back to env). |
temperature | float | 0.7 | Sampling temperature for generation. |
max_tokens | int | 100 | Maximum tokens to generate per turn. |
Usage
Basic Setup
import os
from piopiy.services.ultravox.stt import UltravoxSTTService
stt = UltravoxSTTService(
model_name="fixie-ai/ultravox-v0_5-llama-3_1-8b",
hf_token=os.getenv("HF_TOKEN")
)
Notes
- Multimodal Inference: Ultravox processes audio data directly without an intermediate text transcription step, potentially capturing more conversational nuance.
- Model Warm-up: The service automatically performs a one-second silent audio warm-up on initialization to ensure the first user utterance is processed instantly.
- Hardware Requirements: This service requires a GPU and significantly more local resources than cloud-based API providers.