Skip to main content

Overview of Piopiy AI

What You'll Learn

This guide explains the foundational concepts of the Piopiy AI architecture for building real-time, telephony and WebRTC-ready voice agents. Understanding these concepts will help you design reliable, highly responsive, and full-duplex conversational applications capable of handling live phone calls and web-based audio streams seamlessly.

Why Voice AI is Challenging

Building real-time voice agents is fundamentally different from building text-based chat applications. It presents unique hurdles:

  • Ultra-low Latency: Voice agents need to respond within ~500ms to feel natural. Every millisecond counts.
  • Continuous Streaming: Audio data streams constantly over telephony networks and WebSocket/WebRTC connections. You cannot wait for a complete sentence before processing.
  • Full-Duplex Communication: The user and the agent might speak simultaneously. The system needs to accurately handle logical interruptions (barge-in).
  • Complex Orchestration: You must coordinate Speech-to-Text (STT), Large Language Models (LLMs), Text-to-Speech (TTS), and Voice Activity Detection (VAD) concurrently without blocking the audio stream.

Piopiy's Solution

Piopiy AI solves these challenges by providing a robust, multi-transport orchestration layer. Instead of forcing developers to manually manage WebSocket streams and audio buffers, Piopiy abstracts the complexity into high-level constructs (Agent and VoiceAgent).

It securely handles the underlying signaling, audio routing, and VAD, while allowing you to declaratively combine best-in-class AI models across the STT, LLM, and TTS layers using a unified Action pipeline.

Core Architecture Concepts

Agent and Session Management

The Agent is your application's secure bridge to external transports, whether it's the Piopiy telephony network or a WebRTC client. Configured with your unique credentials or signaling endpoints, it listens for incoming connections and automatically triggers a new session whenever a user dials your phone number or connects via web browser.

This guarantees that every concurrent call or web session runs in its own isolated async timeline, receiving context-specific metadata such as the caller's ID or intent.

VoiceAgent and Action Orchestration

Within each session, the VoiceAgent acts as the conversational brain. It manages the persona, system prompts, and overall conversational state.

To make the agent functional, you define an Action. The action is a straightforward configuration where you plug in your chosen AI services (e.g., Deepgram for STT, OpenAI for LLM, Cartesia for TTS) alongside operational parameters like allow_interruptions and VAD sensitivity. The Action engine continuously coordinates the bidirectional flow of audio and text between these providers.

Service Providers

Piopiy is provider-agnostic. STT, LLM, and TTS capabilities are provided via standardized service classes. You have the ultimate flexibility to use ultra-fast APIs, high-quality enterprise engines, or completely local, open-source models (like Ollama and Whisper), seamlessly substituting one for another without changing your core agent logic.

Voice AI Processing Flow

When a user speaks during a live call or WebRTC session, the VoiceAgent orchestrates the interaction asynchronously:

  1. Audio Input: The Piopiy platform routes raw audio directly to your agent from the telephony network or WebRTC client.
  2. Speech Recognition: The defined STT service continuously transcribes the live audio stream.
  3. Context Management: The VoiceAgent appends the transcription to the conversation history alongside your system instructions.
  4. Language Processing: The LLM processes the history and streams back response tokens.
  5. Speech Synthesis: The TTS service synthesizes the streaming text back into natural-sounding audio.
  6. Audio Output: The optimized audio is streamed back through the transport to the caller's phone or browser.

If the user interrupts the agent (barge-in), the built-in VAD immediately stops the TTS and LLM streams, allowing the agent to listen and re-evaluate the context.

Dynamic Flexibility

Piopiy's architecture is designed for adaptability in production.

  • Tool Calling: Connect your VoiceAgent directly to your backend APIs using structured JSON schemas, enabling the agent to look up orders, book appointments, or trigger workflows natively.
  • Provider Switching: Using the ServiceSwitcher, you can dynamically swap STT or TTS providers mid-conversation based on context, user preference, or failover requirements.

What's Next

Now that you understand the core underlying architecture, it's time to put it to use.