Skip to main content

Speech-to-Text (STT)

Overview

In conversational voice AI, Speech-to-Text (STT) services—often called Automatic Speech Recognition (ASR)—are responsible for converting the human caller's spoken audio into text.

Unlike frameworks that require you to manually handle audio frames and pipe them into text aggregators, Piopiy handles STT integration natively via the Action engine.

How Piopiy Handles STT

The STT service sits directly behind the TeleCMI transport. When an inbound phone call or WebRTC connection starts:

  1. The VoiceAgent immediately begins buffering raw audio from the user.
  2. The audio is streamed directly to your chosen STT provider.
  3. The STT provider streams back partial (interim) and final text transcripts.
  4. Piopiy automatically formats this text and injects it into the conversational context for your LLM.

Supported STT Providers

Piopiy provides pre-built, highly optimized native integrations for the industry's leading STT engines. You can easily switch between them by importing the corresponding service class.

Note: For complete parameter lists and API keys for each provider, refer to the Server API Reference.

Implementation Example

Integrating an STT provider in Piopiy is extremely straightforward. You initialize the service class with your API key and preferred model, then pass it directly to voice_agent.Action().

import os
from piopiy.agent import Agent
from piopiy.voice_agent import VoiceAgent
# Import the STT service of your choice
from piopiy.services.deepgram.stt import DeepgramSTTService

async def on_new_session(agent_id, call_id, from_number, to_number, metadata=None):

voice_agent = VoiceAgent(
instructions="You are a helpful assistant.",
greeting="Hello! How can I help you today?"
)

# 1. Initialize the STT Service
stt = DeepgramSTTService(
api_key=os.getenv("DEEPGRAM_API_KEY"),
model="nova-2",
language="en-US",
smart_format=True, # Automatically format numbers and dates
interim_results=True # Stream partial results for lower latency
)

# ... Initialize LLM and TTS ...

# 2. Pass the STT instance to the Action engine
await voice_agent.Action(stt=stt, llm=llm, tts=tts)
await voice_agent.start()

Best Practices for Conversational AI

To ensure your voice AI feels snappy and natural:

1. Enable Interim Results

When configuring your STT provider, always ensure interim results (or partial results) are enabled if the provider supports it. This allows Piopiy to begin reading the text as the user is speaking, powering features like real-time interruption detection (barge-in) and minimizing TTFB (Time to First Byte) latency for the LLM.

2. Use Smart Formatting

Enable smart_format=True or punctuate=True if your provider supports it. Clean, punctuated transcripts with correctly formatted numbers (e.g., "$100" instead of "one hundred dollars") drastically improve the LLM's comprehension and ability to trigger function calls correctly.

3. Match Languages

Ensure the language parameter in your STT configuration matches both the expected language of the caller, and the language you have instructed your LLM to speak. If you expect Spanish callers, set language="es-ES" to prevent the STT engine from aggressively trying to parse Spanish words as English gibberish.

What's Next