Speech-to-Text (STT)

Overview

In conversational voice AI, Speech-to-Text (STT) services—often called Automatic Speech Recognition (ASR)—are responsible for converting the human caller's spoken audio into text.

Unlike frameworks that require you to manually handle audio frames and pipe them into text aggregators, Piopiy handles STT integration natively via the Action engine.

How Piopiy Handles STT

The STT service sits directly behind the TeleCMI transport. When an inbound phone call or WebRTC connection starts:

The VoiceAgent immediately begins buffering raw audio from the user.
The audio is streamed directly to your chosen STT provider.
The STT provider streams back partial (interim) and final text transcripts.
Piopiy automatically formats this text and injects it into the conversational context for your LLM.

Supported STT Providers

Piopiy provides pre-built, highly optimized native integrations for the industry's leading STT engines. You can easily switch between them by importing the corresponding service class.

Industry-leading speech-to-text with ultra-low latency, optimized for real-time conversational agents.

State-of-the-art AI transcription featuring robust speaker diarization and complex vocabulary handling.

Microsoft's enterprise-grade cognitive service offering massive scale and extensive language support.

Highly reliable, cloud-native transcription backed by Google's ubiquitous language models.

Amazon's managed speech recognition service, seamlessly integrated with the broader AWS ecosystem.

Cloud-hosted inference for OpenAI's highly robust, multilingual Whisper foundational model.

Self-hosted, on-device Whisper execution guaranteeing zero cloud cost and maximum data privacy.

GPU-accelerated speech SDK offering unprecedented transcription speed on NVIDIA hardware.

Next-generation voice AI capabilities, pairing industry-leading STT with their renowned voice synthesis.

Lightning-fast LPU inference engine powering real-time Whisper transcription.

Cutting-edge sonic intelligence platform optimized for deeply conversational AI flows.

Autonomous speech recognition renowned for flawless handling of diverse accents and dialects.

Audio intelligence API tailored for high-performance multilingual audio processing.

Specialized foundational AI models built from the ground up to excel at Indic languages.

Enterprise AI platform providing high-efficiency inference for custom transcription workloads.

Highly accurate audio transcription leveraging proprietary deep learning architectures.

Serverless GPU platform delivering rapid inference for open-source foundation models.

Enterprise-grade speech recognition with advanced noise cancellation and diarization.

Specialized STT integration supporting tailored voice and audio processing workflows.

Note: For complete parameter lists and API keys for each provider, refer to the Server API Reference.

Implementation Example

Integrating an STT provider in Piopiy is extremely straightforward. You initialize the service class with your API key and preferred model, then pass it directly to voice_agent.Action().

import os
from piopiy.agent import Agent
from piopiy.voice_agent import VoiceAgent
# Import the STT service of your choice
from piopiy.services.deepgram.stt import DeepgramSTTService

async def on_new_session(agent_id, call_id, from_number, to_number, metadata=None):
    
    voice_agent = VoiceAgent(
        instructions="You are a helpful assistant.",
        greeting="Hello! How can I help you today?"
    )

    # 1. Initialize the STT Service
    stt = DeepgramSTTService(
        api_key=os.getenv("DEEPGRAM_API_KEY"),
        model="nova-2",
        language="en-US",
        smart_format=True,      # Automatically format numbers and dates
        interim_results=True    # Stream partial results for lower latency
    )

    # ... Initialize LLM and TTS ...

    # 2. Pass the STT instance to the Action engine
    await voice_agent.Action(stt=stt, llm=llm, tts=tts)
    await voice_agent.start()

Best Practices for Conversational AI

To ensure your voice AI feels snappy and natural:

1. Enable Interim Results

When configuring your STT provider, always ensure interim results (or partial results) are enabled if the provider supports it. This allows Piopiy to begin reading the text as the user is speaking, powering features like real-time interruption detection (barge-in) and minimizing TTFB (Time to First Byte) latency for the LLM.

2. Use Smart Formatting

Enable smart_format=True or punctuate=True if your provider supports it. Clean, punctuated transcripts with correctly formatted numbers (e.g., "$100" instead of "one hundred dollars") drastically improve the LLM's comprehension and ability to trigger function calls correctly.

3. Match Languages

Ensure the language parameter in your STT configuration matches both the expected language of the caller, and the language you have instructed your LLM to speak. If you expect Spanish callers, set language="es-ES" to prevent the STT engine from aggressively trying to parse Spanish words as English gibberish.

What's Next

Large Language Models: Learn how Piopiy passes the transcribed text to your AI brain.
Context Management: Learn how to structure and manage conversation context in Piopiy SDK.

Speech-to-Text (STT)

Overview

How Piopiy Handles STT

Supported STT Providers

Deepgram

AssemblyAI

Azure Speech

Google Cloud STT

AWS Transcribe

OpenAI

Local Whisper

NVIDIA Riva

ElevenLabs

Groq

Cartesia

Speechmatics

Gladia

Sarvam AI

SambaNova

Soniox

Fal AI

Ultravox

Gradium

Implementation Example

Best Practices for Conversational AI

1. Enable Interim Results

2. Use Smart Formatting

3. Match Languages

What's Next

Overview​

How Piopiy Handles STT​

Supported STT Providers​

Deepgram

AssemblyAI

Azure Speech

Google Cloud STT

AWS Transcribe

OpenAI

Local Whisper

NVIDIA Riva

ElevenLabs

Groq

Cartesia

Speechmatics

Gladia

Sarvam AI

SambaNova

Soniox

Fal AI

Ultravox

Gradium

Implementation Example​

Best Practices for Conversational AI​

1. Enable Interim Results​

2. Use Smart Formatting​

3. Match Languages​

What's Next​

Overview

How Piopiy Handles STT

Supported STT Providers

Implementation Example

Best Practices for Conversational AI

1. Enable Interim Results

2. Use Smart Formatting

3. Match Languages

What's Next