Skip to main content

Ultravox Ultravox STT

The UltravoxSTTService implements speech-to-text using a locally-loaded Ultravox multimodal model. It leverages vLLM for efficient inference and is ideal for privacy-first, on-device transcription.

Installation

To use Ultravox, install the required dependencies (requires GPU support):

pip install "piopiy-ai[ultravox]"

Prerequisites

  • For gated models, a Hugging Face token (Get yours here).
  • Set your token in your environment:
    export HF_TOKEN="your_hf_token_here"

Configuration

UltravoxSTTService Parameters

ParameterTypeDefaultDescription
model_namestr"fixie-ai/ultravox-v0_5-llama-3_1-8b"The Ultravox model to load.
hf_tokenstrNoneHugging Face token (falls back to env).
temperaturefloat0.7Sampling temperature for generation.
max_tokensint100Maximum tokens to generate per turn.

Usage

Basic Setup

import os
from piopiy.services.ultravox.stt import UltravoxSTTService

stt = UltravoxSTTService(
model_name="fixie-ai/ultravox-v0_5-llama-3_1-8b",
hf_token=os.getenv("HF_TOKEN")
)

Notes

  • Multimodal Inference: Ultravox processes audio data directly without an intermediate text transcription step, potentially capturing more conversational nuance.
  • Model Warm-up: The service automatically performs a one-second silent audio warm-up on initialization to ensure the first user utterance is processed instantly.
  • Hardware Requirements: This service requires a GPU and significantly more local resources than cloud-based API providers.