Let's Connect

A person speaking to a smart speaker on a desk, representing a personal AI voice assistant

The whole game in ai voice assistant architecture is latency. A voice assistant is four stages chained together: wake-word detection, speech-to-text (STT), the LLM turn, and text-to-speech (TTS). Each stage adds delay, and those delays stack end to end. If the user finishes speaking and then waits two seconds in silence, it feels broken, no matter how good the answer is. So the takeaway up front: stream everything, and start speaking the first sentence of the reply before the model has finished generating the rest. Everything else in the design follows from protecting that latency budget. I built one of these for my own desk, and below is the pipeline, a sketch of the LLM turn against the Claude API, and the tradeoffs that actually bit me.

Before you commit, the single biggest decision is where each stage runs. I wrote up that whole question separately in local vs cloud AI assistants, and if you are doing the LLM stage from a backend you control, the request shape is the same one I cover in AI features in Laravel with the Anthropic API.

What are the four stages, and where does the time go?

Think of it as a relay. Audio comes in continuously; you want text coming out as audio as fast as possible. Here is the chain and the rough budget I aim for, end to perceived first audio:

  • Wake word — runs on-device, always listening. Detects "hey assistant" and opens the mic. Near-zero added latency; this is a tiny model, not a cloud call.
  • Speech-to-text (STT) — streams audio up and transcribes as the user speaks, so the transcript is nearly done the moment they stop. Endpointing (detecting the end of speech) is the latency-critical part here, not the transcription itself.
  • LLM turn — the model (I use Claude) takes the transcript plus conversation history and streams a reply token by token. Tool calls happen here for anything that touches the real world.
  • Text-to-speech (TTS) — streams the reply text into audio, sentence by sentence, and plays it back as it arrives.
  • The trick: do not wait for one stage to fully finish before starting the next. Overlap them. The user should hear the first words of the answer roughly 600-900 ms after they stop talking, not after the full reply exists.

Notice that two stages are the usual culprits for dead air: endpointing (deciding the user is done) and the LLM's time-to-first-token. You can tune endpointing aggressively, and you stream the LLM so the first token arrives fast. The rest is plumbing.

How do wake word and streaming STT fit together?

The wake word runs locally and cheaply because it has to run all the time, and you do not want to stream your living room to a cloud API 24/7 for privacy and cost reasons both. Once it fires, you open the mic and start streaming PCM frames to your STT in real time. The point of streaming STT is that transcription overlaps with the user still talking, so when they go quiet, you already have 90% of the transcript. Then your endpointer decides "they are done" and you fire the LLM turn immediately.

Endpointing is where most people lose time without realizing it. If you wait for a fixed 1.5 s of silence to be safe, you have just added 1.5 s of dead air to every single turn. I use a shorter silence threshold (around 500-700 ms) combined with the STT's own end-of-utterance signal, and accept the occasional early cutoff as a fair trade against constant lag.

Circuit board with interconnected components, representing the chained stages of a voice assistant pipeline
Four stages, one pipeline: wake word, STT, LLM, and TTS each add latency, so the design goal is to overlap them rather than run them strictly in sequence.

What does the LLM turn actually look like?

This is the core of it. You send the transcript and history to the model, stream the response, and give the model a tools array so it can do real things: set a timer, hit a weather API, control a light. The Claude API endpoint is POST https://api.anthropic.com/v1/messages with x-api-key and anthropic-version: 2023-06-01 headers. You declare tools as a list, each with a name, description, and input_schema (JSON Schema). When the model wants to act, the response comes back with stop_reason "tool_use" and a tool_use block; you run the function and send a tool_result back. Here is the request shape with two tools and streaming on, using the TypeScript SDK:

llm-turn.ts
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const tools: Anthropic.Tool[] = [
  {
    name: "set_timer",
    description: "Start a countdown timer. Call this when the user asks to set a timer or alarm for a duration.",
    input_schema: {
      type: "object",
      properties: {
        seconds: { type: "integer", description: "Duration in seconds" },
        label: { type: "string", description: "Optional name, e.g. 'pasta'" },
      },
      required: ["seconds"],
    },
  },
  {
    name: "get_weather",
    description: "Get the current weather for a city. Call this for any question about current conditions or temperature.",
    input_schema: {
      type: "object",
      properties: {
        city: { type: "string", description: "City name, e.g. 'Dhaka'" },
      },
      required: ["city"],
    },
  },
];

// Stream so you can hand text to TTS as it arrives, not after the full reply.
const stream = client.messages.stream({
  model: "claude-opus-4-8",
  max_tokens: 1024,
  system: "You are a terse voice assistant. Answer in one or two short sentences. No markdown, no lists — your words are read aloud.",
  tools,
  messages: [
    { role: "user", content: transcript }, // from your STT stage
  ],
});

stream.on("text", (delta) => sentenceBuffer.push(delta)); // feed TTS per sentence
const final = await stream.finalMessage();

Two things matter for voice specifically. First, the system prompt tells the model to answer in one or two short sentences with no markdown, because the output is spoken, not read. A bulleted list sounds insane out of a speaker. Second, you stream, and you do not feed raw token deltas to TTS; you buffer until you have a complete sentence (look for terminal punctuation) and then send that sentence to TTS while the model keeps generating the next one. That overlap is the single biggest latency win in the whole system.

When stop_reason comes back as "tool_use", you execute the function locally, append the model's turn and a tool_result block to the messages array, and call the API again so the model can turn the result into spoken words. That second call is unavoidable extra latency, so for a tool like a timer I play a short confirmation chime immediately on the tool_use, then let the spoken confirmation follow. Here is the manual loop, trimmed to the essentials:

tool-loop.ts
let messages: Anthropic.MessageParam[] = [{ role: "user", content: transcript }];

while (true) {
  const res = await client.messages.create({
    model: "claude-opus-4-8",
    max_tokens: 1024,
    tools,
    messages,
  });

  if (res.stop_reason !== "tool_use") {
    speak(res.content); // narrow to text blocks, hand to TTS
    break;
  }

  // Run every tool the model asked for, collect results in ONE user message.
  messages.push({ role: "assistant", content: res.content });
  const results: Anthropic.ToolResultBlockParam[] = [];
  for (const block of res.content) {
    if (block.type === "tool_use") {
      const output = await runTool(block.name, block.input); // your function
      results.push({ type: "tool_result", tool_use_id: block.id, content: output });
    }
  }
  messages.push({ role: "user", content: results });
}

If cost or latency at scale matters more than peak reasoning, swap the model string to claude-haiku-4-5 for the routine turns and keep claude-opus-4-8 for the hard ones. The request shape is identical. The exact tool-use mechanics and the available models are in the official Claude tool-use docs.

Local or cloud, and what about barge-in?

Every stage can run locally or in the cloud, and you usually mix them. The tradeoff is the same across all four: local buys you privacy and offline operation but costs you quality and a lot of engineering; cloud buys you quality and simplicity but costs money per request and leaks audio off the device. My rule: wake word always local (it is always listening), STT and TTS often cloud for quality, the LLM cloud because that is where the intelligence lives. If a stage must work with no network or must never send audio off-device, that is the one you pay the local-quality tax on, deliberately.

Barge-in is the feature that separates a toy from something you actually use. It means the user can interrupt the assistant mid-sentence and it stops talking and listens. To support it you keep the mic open while TTS is playing, run wake-word or voice-activity detection on that incoming audio, and the instant you detect speech you stop playback, cancel the in-flight LLM stream, and start a new STT turn. Without barge-in, every time the assistant misunderstands you have to sit through its whole wrong answer before you can correct it, which is maddening.

Latency is not one number you optimize at the end. It is a budget you spend across four stages, and the only way to stay under it is to overlap them: transcribe while they speak, and speak while the model still thinks.Md Raihan Hasan

What breaks, and how do you handle it gracefully?

Things fail constantly in a voice pipeline, and the failures are audible. The patterns that kept mine usable:

  • STT mis-hears — confirm destructive actions out loud before doing them ("delete all timers, is that right?") instead of acting on a shaky transcript.
  • Network or API hiccup — have a spoken fallback line ready ("sorry, I missed that") so dead silence never happens; silence reads as a crash to the user.
  • Tool call fails — send the error back as a tool_result with the failure described, and let the model apologize and recover in words rather than throwing an exception into the void.
  • The model gets cut off (stop_reason "max_tokens") — keep max_tokens generous enough for a spoken answer but not so high that a runaway reply rambles for thirty seconds out of your speaker.
  • Endpointing fires early — if the transcript looks like a fragment, prefer asking a short clarifying question over guessing.

Build it stage by stage. Get wake word and streaming STT producing clean transcripts first, then wire the LLM turn with one tool and verify the tool_use loop, then add streaming TTS and tune the sentence buffering, and only then add barge-in on top. Measure the time from end-of-speech to first audio at every step and protect that number above all else, because that is the number your users feel. Everything in ai voice assistant architecture comes back to that latency budget, and the systems that feel alive are the ones that started speaking before they finished thinking.