Text-to-Speech (TTS)
Overview
In conversational voice AI, Text-to-Speech (TTS) services are responsible for converting the AI agent's text responses into natural-sounding spoken audio that the human caller hears.
Piopiy handles TTS integration natively via the Action engine, ensuring seamless and ultra-low latency playback.
How Piopiy Handles TTS
The TTS service sits directly before the TeleCMI transport output. When the LLM generates a response:
- The LLM streams text tokens to the
VoiceAgent. - Piopiy buffers these tokens into sentences or phrases.
- The chosen TTS provider synthesizes the text into an audio stream.
- Piopiy streams this audio back to the user over the phone call or WebRTC connection in real-time.
Supported TTS Providers
Piopiy provides pre-built, highly optimized native integrations for the industry's leading TTS engines. You can easily switch between them by importing the corresponding service class.
Note: For complete parameter lists and API keys for each provider, refer to the Supported Services.
Implementation Example
Integrating a TTS provider in Piopiy is extremely straightforward. You initialize the service class with your API key, preferred voice, and model, then pass it directly to voice_agent.Action().
Below is a complete, copy-pasteable example of an agent using ElevenLabs for high-quality voice synthesis.
import asyncio
import os
from dotenv import load_dotenv
from piopiy.agent import Agent
from piopiy.voice_agent import VoiceAgent
from piopiy.services.deepgram.stt import DeepgramSTTService
from piopiy.services.openai.llm import OpenAILLMService
from piopiy.services.elevenlabs.tts import ElevenLabsTTSService
load_dotenv()
async def create_session(agent_id, call_id, from_number, to_number, metadata=None):
print(f"📞 Call {call_id} - Using ElevenLabs TTS")
voice_agent = VoiceAgent(
instructions="You are a helpful AI assistant.",
greeting="Hello! I'm using ElevenLabs for high-quality voice. How can I help you?",
)
stt = DeepgramSTTService(
api_key=os.getenv("DEEPGRAM_API_KEY"),
model="nova-2"
)
llm = OpenAILLMService(
api_key=os.getenv("OPENAI_API_KEY"),
model="gpt-4o-mini"
)
# ElevenLabs TTS - Highest quality synthesis
tts = ElevenLabsTTSService(
api_key=os.getenv("ELEVENLABS_API_KEY"),
voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel
model="eleven_turbo_v2_5",
stability=0.5,
similarity_boost=0.75
)
await voice_agent.Action(stt=stt, llm=llm, tts=tts, vad=True)
await voice_agent.start()
async def main():
agent = Agent(
agent_id=os.getenv("AGENT_ID"),
agent_token=os.getenv("AGENT_TOKEN"),
create_session=create_session
)
print("🚀 ElevenLabs TTS Agent Running")
print(" Waiting for calls...")
await agent.connect()
if __name__ == "__main__":
asyncio.run(main())
Best Practices for Text-to-Speech
To ensure your agent sounds perfect over the network:
1. Optimize Output Formats
Telephony networks require specific sample rates (commonly 8kHz or 16kHz PCM). Always ensure your TTS provider is configured to produce either raw PCM (e.g., pcm_16000) or correctly decoded formats to avoid heavy downsampling latency.
2. Handle Streaming Latency
Piopiy handles token chunking natively, but ensure you select "Turbo" or latency-optimized models for interactive AI flows (e.g. ElevenLabs eleven_turbo_v2). Avoid models designed strictly for high-fidelity audiobooks, as their Time-To-First-Byte will destroy the conversational flow.
3. Voice Consistency
Always pair the emotional tone and gender of the selected TTS voice with the persona defined in your VoiceAgent instructions to maintain a cohesive user experience.
What's Next
- Context Management: Learn how to structure and manage conversation context in Piopiy SDK.
- Speech-to-Text: Learn how Piopiy converts caller audio to text.