Speech Input & Turn Detection

Overview

In conversational voice AI, knowing exactly when a user starts speaking and when they finish their thought is critical for a natural experience.

Traditional pipelines require you to manually wire up a Voice Activity Detection (VAD) model, pipe its output into a turn-detection strategy, and manually flush the audio buffers when an interruption occurs.

Piopiy abstracts this entire process. Speech input, turn detection, and interruption handling (barge-in) are managed automatically by the Action engine using two simple parameters: vad and allow_interruptions.

Voice Activity Detection (VAD)

Voice Activity Detection is the engine that monitors the incoming audio stream from the TeleCMI transport and determines if the audio packet contains human speech or background noise.

By default, VAD is enabled (vad=True) whenever you start an Action.

await voice_agent.Action(
    stt=stt,
    llm=llm,
    tts=tts,
    vad=True, # Enabled by default
    allow_interruptions=True
)

What VAD Does

Listens: Continuously scans the incoming audio stream.
Detects: Identifies the exact millisecond human speech starts.
Triggers: Signals the Action pipeline to start transcribing the speech via your chosen STT provider.
Concludes: Detects when the user has stopped speaking, signaling the LLM that the user's turn is complete.

Handling Interruptions (Barge-in)

The most difficult part of human-to-AI voice interaction is handling interruptions ("barge-in"). If the AI is currently speaking a long sentence, and the human user says "Wait, stop", the AI must instantly cease speaking and listen to the new instruction.

In Piopiy, this complex orchestration is achieved simply by setting allow_interruptions=True.

How Interruptions Work

When allow_interruptions=True, Piopiy monitors the VAD stream even while the agent is speaking.

The VoiceAgent is currently playing TTS audio down the phone line.
The user suddenly begins speaking.
The VAD detects human speech (vad_params.threshold is met).
The Interruption Occurs:
- The Action engine instantly halts the outgoing TTS audio.
- Any in-flight LLM generations are immediately cancelled.
- The agent's conversational context registers that the bot was interrupted.
The pipeline instantly pivots back to ingesting the user's new speech.

If allow_interruptions=False, the AI will ignore any user speech that occurs while it is currently talking, acting like a traditional walkie-talkie (half-duplex).

Fine-Tuning Turn Detection

Piopiy's default VAD settings are highly optimized for standard phone calls. However, you might want to adjust how sensitive the agent is to interruptions or how long it waits for the user to finish speaking.

You can modify these behaviors using vad_params:

vad_params = {
    "threshold": 0.5,              # Detection sensitivity (0.0 to 1.0)
    "prefix_padding_ms": 300,      # Audio buffered before speech detected
    "silence_duration_ms": 500     # Silence required to confirm turn end
}

await voice_agent.Action(
    stt=stt,
    llm=llm,
    tts=tts,
    vad=True,
    allow_interruptions=True,
    vad_params=vad_params
)

Key Parameters

`threshold` (Sensitivity)

Lower values (e.g. 0.3): Highly sensitive. Will detect very quiet speech, but might accidentally trigger on background noise or breathing.
Higher values (e.g. 0.8): Less sensitive. Requires the user to speak clearly and loudly to trigger the VAD. Best for noisy environments.

`silence_duration_ms` (Turn End)

What it does: Determines how many milliseconds of silence must pass before Piopiy decides the user has finished their sentence.
Low values (e.g. 300ms): The agent responds extremely fast, but might accidentally cut the user off if they pause to take a breath mid-sentence.
High values (e.g. 1500ms): The agent will comfortably let the user pause, but the resulting conversation will feel slower and more deliberate.

Best Practices

Use allow_interruptions=True (Default): Always leave this enabled for conversational AI to ensure the interaction feels natural.
Tweak silence_duration_ms based on Use Case:
- For rapid-fire customer service (e.g., collecting an account number), a short silence duration (400ms) keeps the call moving quickly.
- For empathetic use-cases (e.g., therapy bots or taking complex orders), a longer silence duration (1000ms+) prevents the bot from aggressively cutting off the user.

What's Next

See the complete Server API Reference for all available configurations.

Overview​

Voice Activity Detection (VAD)​

What VAD Does​

Handling Interruptions (Barge-in)​

How Interruptions Work​

Fine-Tuning Turn Detection​

Key Parameters​

threshold (Sensitivity)​

silence_duration_ms (Turn End)​

Best Practices​

What's Next​