System Design Library

Voice Assistant (Alexa)

Turn speech into an action and a spoken reply, fast.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Wake-word
Speech→text (ASR)
Intent/NLU
Skill execution
Text→speech

Non-functional

Low end-to-end latency
Accurate

Scale

Hundreds of millions of devices

The approach

On-device wake-word detection; audio streamed to cloud ASR; NLU extracts intent + slots; routed to a skill/service; response synthesized (TTS) and streamed back. Pipeline stages overlap (stream, don't wait).

Key components

Device (wake-word) → ASR → NLU → skill router → TTS → device

Numbers that matter

End-to-end latency budget is ~1.5–2 seconds for a good UX: ~100ms on-device wake-word, ~300ms ASR streaming, ~200ms NLU + skill dispatch, ~400ms skill execution, ~400ms TTS synthesis and buffering
On-device wake-word models (Alexa, Hey Siri) run at <1% CPU on a DSP with false-positive rates of ~1–2 false triggers per day per device at production-tuned thresholds
TTS streaming (sending audio in chunks as synthesis proceeds) reduces perceived latency by ~30–40% vs. waiting for the full audio before playback begins
ASR word error rate for English in low-noise environments is now <5% with modern models (Whisper-class); accuracy drops to 10–20% with heavy accent or background noise

Senior deep-dive

The pipeline must stream, not batch — each stage (ASR, NLU, TTS) must begin processing before the prior stage completes, or latency stacks multiplicatively instead of additively.

On-device wake-word detection is a hard product requirement, not an optimization — always-on cloud listening is both a privacy violation and infeasible at the battery and bandwidth cost.

Intent resolution is the correctness bottleneck: ASR accuracy is now >95% for English, but mapping 'turn off the lights in the thing' to the right smart-home entity requires entity resolution that is inherently ambiguous and context-dependent.

Pipeline parallelism is the latency unlock

A naive pipeline waits for ASR to finish before starting NLU, and waits for NLU before starting TTS. The optimized architecture streams tokens between stages: ASR streams partial transcripts every 100–200ms, NLU starts intent scoring on partial results (speculative execution), and TTS begins synthesizing the first sentence of the response while the skill is still computing the second. Streaming gRPC between services is the enabling technology. The risk: a partial transcript that changes meaning at the end (e.g. 'play jazz... NOT') causes the speculative path to be thrown away — acceptable, because it's infrequent.

Wake-word detection: always-on DSP, never cloud

The DSP (digital signal processor) runs continuously at ~1mW, listening for the wake phrase using a tiny on-device model (3–5MB). False positives (triggers without the wake word) create privacy violations and poor UX. The threshold is tuned with a ROC curve tradeoff: lower threshold → fewer misses but more false triggers; higher → vice versa. After a true trigger, the device starts buffering audio immediately (pre-rolling ~500ms before the wake word) so the 'Hey' in 'Hey Alexa' isn't lost before the cloud connection is established. This buffering is stored in a local ring buffer, never sent to cloud without a trigger.

ASR: streaming beats batch by 300ms

Automatic Speech Recognition must operate in streaming mode (sending audio chunks and receiving partial transcripts) rather than sending the full utterance at end-of-speech. End-of-speech detection itself takes ~300–500ms of trailing silence. Streaming ASR returns a final hypothesis after end-of-speech is detected and collapses intermediate partials. The ASR output is a 1-best transcript with a confidence score; alternatives (N-best list) enable NLU to pick the interpretation that fits a known intent even if the top-1 transcript is a near-miss homophone (e.g. 'whether' vs 'whether').

NLU: intent + slot extraction is the skill routing contract

Natural Language Understanding maps a transcript to `(intent, slots)` — e.g. `(PlayMusic, {artist: 'Miles Davis', genre: null})`. The NLU model is a fine-tuned classifier, typically a small transformer. Ambiguous utterances go to a resolver that uses session context (which skill was active, which device) to break ties. Skill dispatch routes the resolved intent to the appropriate microservice (music skill, smart-home skill, timer skill) via a routing table keyed by intent namespace. Each skill is an independent service that returns a response directive and a TTS string — the skill framework, not the skill, handles synthesis.

TTS: streaming synthesis enables sub-second first-audio

Text-to-Speech synthesis is compute-intensive (~50–200ms/sentence on GPU). The architecture synthesizes sentence by sentence as the response text arrives, streaming audio chunks to the device immediately. The device plays chunk 1 while chunk 2 is being synthesized — perceived latency is time-to-first-audio, not time-to-full-response. Modern neural TTS (WaveNet-class) produces human-quality speech but requires a GPU serving fleet; latency is the primary constraint, so models are quantized and served on GPU with <50ms warmup. Precomputed phoneme caches for common phrases (e.g. 'I didn't understand that') avoid synthesis entirely.

What breaks at scale

The non-obvious failure mode is context store stale state: a user says 'turn it up' and the context store has an expired session (TTL fired because they paused for 90 seconds), so the system has no idea what 'it' refers to. The response is a confused 'What would you like me to turn up?' UX failure. The second failure is skill timeout propagation: if a third-party skill (e.g. a Jira integration) takes >3s to respond, the voice assistant must time out gracefully with 'Sorry, the Jira skill isn't responding' rather than leaving the user in silence. Every skill invocation must have a hard timeout with a canned fallback response, and the skill framework must enforce it, not trust skills to self-terminate.

In production

Amazon Alexa splits processing across a DSP chip for wake-word (always-on, ~1mW), a local SoC for initial ASR buffering, and cloud services for full ASR, NLU, and skill execution via their Alexa Skills Kit. Apple Siri processes more on-device for privacy (short queries handled entirely on-device since iOS 17), routing only complex or ambiguous utterances to the cloud. The real engineering challenge is context management: 'turn it up' is ambiguous without knowing the active device, media state, and prior conversation turn — maintaining a session context store with a short TTL per device is both the hard systems problem and the UX differentiator between good and mediocre assistants.

Common mistakes

Blocking stage-by-stage (high latency)
Streaming all audio to cloud (privacy/cost)
No skill timeout/fallback

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →