System Design Library

Spotify / Music Streaming

Stream audio to millions and recommend playlists.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Stream tracks
Playlists
Search
Recommendations

Non-functional

Instant playback
Offline cache

Scale

Hundreds of millions of users

The approach

Audio files (multiple bitrates) in object storage + CDN; metadata/catalog in DB + search; recommendations via offline pipelines; client prefetches next tracks.

Key components

Catalog DB + search · audio (object store) + CDN · recsys pipeline

Numbers that matter

~100M tracks in Spotify's catalog served at multiple bitrates (96/160/320 kbps AAC/OGG) — multiply by ~3 quality tiers = ~300M audio files in object storage.
~4MB average encoded size for a 4-minute song at 128kbps; a 320kbps stream needs ~12MB — well within a single CDN edge cache slot per track.
~456M monthly active users × ~30 minutes/day listening = ~228M streaming hours/day total throughput demand across the CDN.
~30% of Spotify listening comes from algorithmically recommended tracks (Discover Weekly, Radio, autoplay) — the recsys pipeline directly drives nearly a third of all audio delivery.

Senior deep-dive

Audio delivery is a solved CDN problem — recommendations are the hard part — the engineering investment at Spotify is overwhelmingly in the recommendation and personalization pipeline, not in serving MP3 bytes.

Client-side prefetch is what makes gapless playback feel magical — the client starts downloading the next track before the current one ends; the server's job is to make the next-track prediction accurate enough that the prefetched track is almost always the right one.

Catalog metadata and audio files are completely separate concerns — metadata (artist, album, tags, lyrics) lives in a relational/search layer; audio files live in object storage behind a CDN; they join only at the client.

Audio serving: CDN cache hit rate is everything

Popular tracks are served entirely from CDN edge caches — origin (GCS) sees minimal traffic for the top 1% of tracks. The long tail is harder: obscure tracks may not be cached at your nearest edge and require a round-trip to origin or a mid-tier cache. Spotify pre-warms CDN caches for anticipated spikes (new album releases from major artists) by pushing files to edge nodes before the release time. Byte-range requests allow clients to seek within a track without re-downloading from the start — CDN must support range requests and cache range-aligned chunks.

Client prefetch: the architecture of gapless playback

Gapless playback requires the next track to be buffered before the current track ends. The client requests the next track ID from the server while the current track is at ~80% completion. The server's job: return the correct next track (based on queue, autoplay model, or radio) with enough confidence that the client doesn't have to change what it's buffering. A wrong prediction wastes bandwidth and causes a gap. Spotify's autoplay model (what plays after the queue ends?) is a high-value, low-latency recommendation that runs on the client's current context.

Collaborative filtering at Spotify's scale

Spotify's recommendation starts with implicit matrix factorization over the user × track listening matrix. Implicit signals (stream = 1 if listened past 30s, skip = 0 if skipped in first 10s) are more reliable than explicit ratings. The matrix is 100M users × 100M tracks — too large for a single machine. Spotify trains this with Alternating Least Squares (ALS) distributed across a Spark cluster, producing user and track embedding vectors. Candidate generation for a user: retrieve tracks with high dot product similarity to the user vector — a HNSW ANN index over track embeddings makes this fast.

Podcast and audio normalization: the same CDN, different contracts

Podcasts are user-generated uploads of arbitrary quality, length (3h episodes), and format (MP3/M4A/OGG). Spotify transcodes podcasts to a standard internal format, but must also serve at the original podcast feed URL for RSS compatibility. Audio normalization (loudness leveling via ReplayGain/EBU R128) is applied during transcoding so a quiet indie podcast doesn't jar next to a loud EDM track. The transcoding pipeline is a stateless worker pool pulling from an upload queue — throughput scales horizontally but latency from upload to available-in-app is ~5-15 minutes.

Playlist and catalog search: the other hard query

Search over 100M tracks must be sub-200ms P99 with fuzzy matching (typos, transliterations) and multi-language support. Spotify uses Elasticsearch for catalog search with custom analyzers per language and phonetic matching for artist names. The harder problem is search result ranking: pure text match puts an obscure tribute band above the original artist. Ranking must blend text relevance + popularity signals + personalization (your listening history). This is a learned ranking model (LambdaMART or neural) trained on click-through data from search results.

What breaks at scale

New release day thundering herd: a major artist drops an album at midnight and 50M users hit play simultaneously. The CDN absorbs this if the tracks are pre-warmed; the metadata service (track info, lyrics, album art) is the bottleneck — it must handle the read spike without falling over. Embedding drift: as new tracks are added to the catalog (60K+ new tracks per day), they have no interaction data and thus no learned embeddings — cold-start recommendations default to content-based signals (audio features like tempo/key/genre from acoustic analysis). Regional licensing: a track licensed in the US may be blocked in Germany — the delivery layer must enforce geo-based access control on every audio request, which must be fast and not add latency to the playback start path.

In production

Spotify stores audio in Google Cloud Storage served via their own CDN layer. Their Discover Weekly uses a matrix-factorization collaborative filtering model trained on implicit feedback (streams, saves, skips) over the listening graph, augmented by NLP on playlist names and track co-occurrence. The real engineering challenge is offline model freshness at scale: Discover Weekly is generated once per week per user (456M personalized playlists), which requires a massive Spark/Dataflow job running over terabytes of interaction data. The second hard problem is audio fingerprinting for catalog deduplication — the same song uploaded by the label, a distributor, and a cover artist all need to be resolved into a canonical entity with merged play counts.

Common mistakes

Streaming from origin
No prefetch (gaps between tracks)
Online recommendation over the full catalog

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →