Pipeline & Action Processing

Understanding the Architecture

Unlike traditional AI pipelines that require manually chaining individual node processors and managing complex queues, Piopiy abstracts audio routing and context aggregation into a single, declarative Action engine.

The core of Piopiy's execution model is the VoiceAgent. Instead of manipulating raw frames, you configure the VoiceAgent with best-in-class AI services, and Piopiy automatically constructs the underlying, highly optimized pipeline to coordinate the full-duplex conversational flow.

The Action Engine

The Action engine is responsible for orchestrating the bidirectional streaming of audio and text between your chosen providers.

# 1. Initialize Providers
stt = DeepgramSTTService(api_key="your-key")
llm = OpenAILLMService(api_key="your-key")
tts = CartesiaTTSService(api_key="your-key")

# 2. Define Action
await voice_agent.Action(
    stt=stt, 
    llm=llm, 
    tts=tts,
    allow_interruptions=True,
    enable_metrics=True
)

# 3. Start Execution
await voice_agent.start()

When voice_agent.start() runs, the Action engine automatically links the inputs and outputs of each service together.

How Data Flows Through the Action

Under the hood, the Piopiy engine manages a strictly ordered processing flow optimized for ultra-low latency:

1. Audio Ingestion (Transport -> STT)

Raw audio streams in natively from the TeleCMI telephony network or WebRTC client. The Action engine immediately routes chunks of this audio to your configured Speech-to-Text (STT) service.

2. Context Aggregation (STT -> LLM)

As the STT service streams back transcribed text, Piopiy automatically formats these partial and final transcriptions into the specific prompt structure required by your chosen Large Language Model (LLM). It seamlessly appends this user input to the agent's conversational history.

3. Language Processing (LLM -> TTS)

The LLM processes the conversational context and begins streaming response tokens back. The Action engine catches these tokens on the fly and routes them to the Text-to-Speech (TTS) engine.

4. Audio Synthesis & Egress (TTS -> Transport)

The TTS engine continuously converts the incoming text tokens into audible speech frames. Piopiy routes these synthesized audio frames back out through the TeleCMI transport to the caller's device.

Voice Activity Detection (VAD) & Interruptions

One of the most complex parts of voice AI is handling logical interruptions (barge-in). Piopiy's Action engine handles this automatically when allow_interruptions=True.

A built-in Voice Activity Detector (VAD) monitors the incoming audio stream from the user.
If the user begins speaking while the VoiceAgent is currently playing TTS audio, the VAD triggers an immediate interruption signal.
The Action engine instantly flushes the TTS buffer, cancels the in-flight LLM generation, and transitions the agent back to a "listening" state to ingest the new speech.

Dynamic Service Switching

While the execution flow is strictly ordered (STT -> LLM -> TTS), the services powering that flow are dynamic.

Using Piopiy's ServiceSwitcher, you can inject multiple service providers into the Action engine and swap them mid-conversation. For example, you can switch from OpenAI to Anthropic, or change TTS voices from a friendly tone to an urgent tone, without ever halting the underlying pipeline execution.

Key Takeaways

Piopiy simplifies voice AI by abstracting complex frame passing into a unified Action configuration.
The VoiceAgent automatically handles latency optimizations, context aggregation, and VAD interruptions.
The execution flow always follows the core Transport -> STT -> Context -> LLM -> TTS -> Transport pattern asynchronously.
You only need to call voice_agent.start() to ignite the engine and manage the lifecycle.

What's Next

Getting Started: Ready to build? Follow the quickstart guide.
Telephony: Learn deployment and routing patterns for real phone-call agents.

Understanding the Architecture​

The Action Engine​

How Data Flows Through the Action​

1. Audio Ingestion (Transport -> STT)​

2. Context Aggregation (STT -> LLM)​

3. Language Processing (LLM -> TTS)​

4. Audio Synthesis & Egress (TTS -> Transport)​

Voice Activity Detection (VAD) & Interruptions​

Dynamic Service Switching​

Key Takeaways​

What's Next​