Assistant Modes

AI assistants on Larsa AI can speak in two distinct modes.
Each mode affects how a caller’s speech is understood and how the assistant’s reply is generated:

1. Pipeline

Label in UI: Pipeline
How it works: Speech-to-Text → LLM → Text-to-Speech
Latency: ~800 – 1500 ms (varies by language & model)
Best for: Complex reasoning, dynamic prompts, multi-sentence replies

Overview:
Pipeline mode first transcribes the caller’s words into text, processes them with the language model, then converts the response back into audio. This traditional approach offers maximum flexibility:

Works with all voices in the library (including custom-cloned voices).
Great for long-form, detailed, or paragraph-style answers.
Easily references earlier context and injects variables.

When to choose Pipeline:

You need rich, multi-sentence answers (e.g., support queries, detailed explanations).
The assistant must reason over structured data or complex prompts.
You want complete control over the spoken voice (clone or brand voice).

2. Speech-to-Speech (Multimodal)

Label in UI: Speech-to-Speech
How it works: Direct speech-to-speech generation (no text stage)
Latency: ~300 – 600 ms (ultra-low)
Best for: Natural back-and-forth, short and reactive replies

Overview:
Speech-to-Speech skips separate transcription and TTS, using a multimodal model that listens and responds directly for smoother conversations:

Near-instant responses with fast turn-taking.
Naturally expressive speech (intonation, pauses, fillers).
Limited voice options for now, but expanding regularly.

When to choose Speech-to-Speech:

You want snappy, conversational flow (sales, booking confirmations).
Responses are typically short sentences or quick acknowledgments.
You’re fine using the system-provided voices for faster interaction.

Note: If you need a custom cloned voice or advanced prompt logic, Pipeline is still the better choice.

Switching Modes

You can select the mode for each assistant in Assistant → Settings → Voice Engine.
Test both modes to find the best balance of speed and quality for your use case.

Pro Tip: Record two calls—one in each mode—and compare perceived latency and engagement to decide which works best.

What is an AI Assistant?

System Prompts