Ultravox STT

The UltravoxSTTService implements speech-to-text using a locally-loaded Ultravox multimodal model. It leverages vLLM for efficient inference and is ideal for privacy-first, on-device transcription.

Installation

To use Ultravox, install the required dependencies (requires GPU support):

pip install "piopiy-ai[ultravox]"

Prerequisites

For gated models, a Hugging Face token (Get yours here).
Set your token in your environment:
```
export HF_TOKEN="your_hf_token_here"
```

Configuration

`UltravoxSTTService` Parameters

Parameter	Type	Default	Description
`model_name`	`str`	`"fixie-ai/ultravox-v0_5-llama-3_1-8b"`	The Ultravox model to load.
`hf_token`	`str`	`None`	Hugging Face token (falls back to env).
`temperature`	`float`	`0.7`	Sampling temperature for generation.
`max_tokens`	`int`	`100`	Maximum tokens to generate per turn.

Usage

Basic Setup

import os
from piopiy.services.ultravox.stt import UltravoxSTTService

stt = UltravoxSTTService(
    model_name="fixie-ai/ultravox-v0_5-llama-3_1-8b",
    hf_token=os.getenv("HF_TOKEN")
)

Notes

Multimodal Inference: Ultravox processes audio data directly without an intermediate text transcription step, potentially capturing more conversational nuance.
Model Warm-up: The service automatically performs a one-second silent audio warm-up on initialization to ensure the first user utterance is processed instantly.
Hardware Requirements: This service requires a GPU and significantly more local resources than cloud-based API providers.

Installation​

Prerequisites​

Configuration​

UltravoxSTTService Parameters​

Usage​

Basic Setup​

Notes​