Text-to-Speech (TTS)

Overview

In conversational voice AI, Text-to-Speech (TTS) services are responsible for converting the AI agent's text responses into natural-sounding spoken audio that the human caller hears.

Piopiy handles TTS integration natively via the Action engine, ensuring seamless and ultra-low latency playback.

How Piopiy Handles TTS

The TTS service sits directly before the TeleCMI transport output. When the LLM generates a response:

The LLM streams text tokens to the VoiceAgent.
Piopiy buffers these tokens into sentences or phrases.
The chosen TTS provider synthesizes the text into an audio stream.
Piopiy streams this audio back to the user over the phone call or WebRTC connection in real-time.

Supported TTS Providers

Piopiy provides pre-built, highly optimized native integrations for the industry's leading TTS engines. You can easily switch between them by importing the corresponding service class.

Industry-leading, hyper-realistic voice synthesis with unmatched emotional range and personality.

Ultra-low latency text-to-speech optimized for lightning-fast real-time conversational agents.

Microsoft's enterprise-grade cognitive service offering massive scale and extensive voice support.

Highly reliable, cloud-native voice synthesis backed by Google's ubiquitous language models.

High-quality, natural-sounding voices integrated directly with the OpenAI ecosystem.

Cutting-edge sonic intelligence platform optimized for deeply conversational AI flows.

Premium AI voice generator providing incredibly lifelike text-to-speech capabilities.

Fast and expressive voice AI tailored for interactive voice applications.

Specialized TTS engine focused on high-fidelity, real-time voice generation.

Autonomous speech solutions delivering clear and accurate synthesized voices.

Innovative text-to-speech platform focused on dynamic voice rendering.

Ultra-realistic, context-aware speech synthesis designed for interactive AI characters.

Specialized foundational AI models built from the ground up to excel at Indic languages.

High-performance audio synthesis delivering rich and articulate voices.

Emotion-aware voice generation creating deeply empathetic and expressive agent responses.

Specialized TTS integration supporting custom voice synthesis workflows.

Note: For complete parameter lists and API keys for each provider, refer to the Supported Services.

Implementation Example

Integrating a TTS provider in Piopiy is extremely straightforward. You initialize the service class with your API key, preferred voice, and model, then pass it directly to voice_agent.Action().

Below is a complete, copy-pasteable example of an agent using ElevenLabs for high-quality voice synthesis.

import asyncio
import os
from dotenv import load_dotenv

from piopiy.agent import Agent
from piopiy.voice_agent import VoiceAgent
from piopiy.services.deepgram.stt import DeepgramSTTService
from piopiy.services.openai.llm import OpenAILLMService
from piopiy.services.elevenlabs.tts import ElevenLabsTTSService

load_dotenv()


async def create_session(agent_id, call_id, from_number, to_number, metadata=None):
    print(f"📞 Call {call_id} - Using ElevenLabs TTS")
    
    voice_agent = VoiceAgent(
        instructions="You are a helpful AI assistant.",
        greeting="Hello! I'm using ElevenLabs for high-quality voice. How can I help you?",
    )

    stt = DeepgramSTTService(
        api_key=os.getenv("DEEPGRAM_API_KEY"),
        model="nova-2"
    )

    llm = OpenAILLMService(
        api_key=os.getenv("OPENAI_API_KEY"),
        model="gpt-4o-mini"
    )

    # ElevenLabs TTS - Highest quality synthesis
    tts = ElevenLabsTTSService(
        api_key=os.getenv("ELEVENLABS_API_KEY"),
        voice_id="21m00Tcm4TlvDq8ikWAM",  # Rachel
        model="eleven_turbo_v2_5",
        stability=0.5,
        similarity_boost=0.75
    )

    await voice_agent.Action(stt=stt, llm=llm, tts=tts, vad=True)
    await voice_agent.start()


async def main():
    agent = Agent(
        agent_id=os.getenv("AGENT_ID"),
        agent_token=os.getenv("AGENT_TOKEN"),
        create_session=create_session
    )
    
    print("🚀 ElevenLabs TTS Agent Running")
    print("   Waiting for calls...")
    await agent.connect()


if __name__ == "__main__":
    asyncio.run(main())

Best Practices for Text-to-Speech

To ensure your agent sounds perfect over the network:

1. Optimize Output Formats

Telephony networks require specific sample rates (commonly 8kHz or 16kHz PCM). Always ensure your TTS provider is configured to produce either raw PCM (e.g., pcm_16000) or correctly decoded formats to avoid heavy downsampling latency.

2. Handle Streaming Latency

Piopiy handles token chunking natively, but ensure you select "Turbo" or latency-optimized models for interactive AI flows (e.g. ElevenLabs eleven_turbo_v2). Avoid models designed strictly for high-fidelity audiobooks, as their Time-To-First-Byte will destroy the conversational flow.

3. Voice Consistency

Always pair the emotional tone and gender of the selected TTS voice with the persona defined in your VoiceAgent instructions to maintain a cohesive user experience.

What's Next

Context Management: Learn how to structure and manage conversation context in Piopiy SDK.
Speech-to-Text: Learn how Piopiy converts caller audio to text.

Text-to-Speech (TTS)

Overview

How Piopiy Handles TTS

Supported TTS Providers

ElevenLabs

Deepgram

Azure Speech

Google Cloud TTS

OpenAI TTS

Cartesia

PlayHT

LMNT

Rime

Speechmatics

Fish AI

Inworld AI

Sarvam AI

Neuphonic

Hume AI

Gradium

Implementation Example

Best Practices for Text-to-Speech

1. Optimize Output Formats

2. Handle Streaming Latency

3. Voice Consistency

What's Next

Overview​

How Piopiy Handles TTS​

Supported TTS Providers​

ElevenLabs

Deepgram

Azure Speech

Google Cloud TTS

OpenAI TTS

Cartesia

PlayHT

LMNT

Rime

Speechmatics

Fish AI

Inworld AI

Sarvam AI

Neuphonic

Hume AI

Gradium

Implementation Example​

Best Practices for Text-to-Speech​

1. Optimize Output Formats​

2. Handle Streaming Latency​

3. Voice Consistency​

What's Next​

Overview

How Piopiy Handles TTS

Supported TTS Providers

Implementation Example

Best Practices for Text-to-Speech

1. Optimize Output Formats

2. Handle Streaming Latency

3. Voice Consistency

What's Next