System Design Library

Gmail / Email

Store, search, send and receive email at billions-of-mailboxes scale.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Send/receive (SMTP)
Mailbox storage
Search
Spam filtering
Labels/threads

Non-functional

Durable
Fast search
Reliable delivery

Scale

Billions of mailboxes

The approach

Sharded mailbox storage (by user); inbound via SMTP → spam/virus pipeline → mailbox; outbound queued with retries; every message indexed for search; threads via references.

Key components

SMTP in → spam pipeline → mailbox store (sharded) + search index · outbound queue

Numbers that matter

Gmail stores over 15 billion mailboxes; each user's free tier is 15 GB shared across Drive/Gmail, requiring per-user quota accounting on every write.
Inbound SMTP pipeline adds roughly 100-300ms of processing latency for spam classification before delivery; async classification after pre-delivery is used for borderline cases.
Thread detection uses References and In-Reply-To headers; collisions from broken mail clients affect roughly 1-2% of threads and require heuristic subject-line fallback grouping.
Gmail's search index ingests message bodies with custom tokenizers (email addresses, phone numbers, quote stripping) and targets <1 second search latency for a mailbox of millions of messages.

Senior deep-dive

Mailbox sharding by user is the foundation — all of a user's mail lives on one shard set so thread assembly, search, and IMAP are purely local operations, no cross-shard joins needed.

The inbound pipeline (SMTP → spam/virus → delivery) is a multi-stage async fan — each stage can reject, quarantine, or transform the message; designing it as a pipeline means stages can be scaled and updated independently without affecting delivery throughput.

Search is powered by a private inverted index per user, not a shared cluster — this gives per-user isolation (no search leakage) and lets Gmail offer instant results because the index is co-located with the mailbox.

Sharding strategy: user is the partition key

All of a user's messages, labels, and thread metadata are co-located on one shard. This makes thread assembly, label filtering, and IMAP folder traversal local — no distributed joins. The risk is a hot shard if a single user has a multi-GB mailbox; Gmail handles this with per-user data routing that can migrate heavy mailboxes to under-loaded shard groups.

Threading: harder than it looks

Gmail threads on References/In-Reply-To headers first, falling back to normalized subject for replies from broken clients. The thread ID is assigned at first-message ingestion; subsequent messages are linked by header lookup in the thread index. Mailing lists break threading because they munge headers — Gmail has special-cased dozens of list server behaviors. A thread can span thousands of messages (legal hold mailboxes, mailing list subscriptions), so thread-level operations need pagination.

Spam pipeline: the real-time ML challenge

Every inbound message runs through a multi-stage classifier cascade: IP/domain reputation (fastest, cheapest), heuristic rules, then ML models scoring content and sender history. Sender reputation is the highest-signal feature — a first-time sender from a new domain is far more likely to be spam. The pipeline must complete in <500ms before the SMTP connection times out, forcing a tiered approach where uncertain messages are delivered optimistically and reclassified async.

Labels as the data model: not folders

Gmail stores messages once and attaches a set of label IDs per user-message pair — a message can have INBOX, STARRED, and a custom label simultaneously. This is a many-to-many relationship stored in an index table, not copies in folders. IMAP projection maps labels to folder paths, but mutations via IMAP (move = remove INBOX label, add label) must be translated, and concurrent IMAP + web edits require careful conflict resolution.

Search: per-user inverted index

Each mailbox has its own inverted index co-located with mailbox data — query results are not mixed across users. The index strips quoted text, expands abbreviations, and normalizes email addresses. The hard part is index freshness: a message delivered seconds ago must be searchable immediately, so the index is updated synchronously at delivery time, adding latency to the ingest path.

What breaks at scale

Attachment storms — a user who receives thousands of emails with large attachments can saturate their shard's disk I/O quota, degrading neighbors on the same physical host. IMAP clients doing bulk operations (mass-delete, folder sync on a 1M-message mailbox) generate write amplification that can overwhelm the label index. Spam model drift during major world events (elections, pandemics) means freshly trained models must be hot-swapped mid-stream without a delivery outage.

In production

Google built Colossus (successor to GFS) for mailbox storage and a Bigtable-backed metadata layer to track message flags, labels, and thread membership. The hardest operational problem is spam classification at ingestion — Gmail runs both heuristic rules and ML models (including sender reputation, IP reputation, and content signals) in under a second per message, and false positives (legitimate mail in spam) are far more damaging to user trust than false negatives. IMAP compatibility is a perpetual tax: Gmail's label model must be projected onto IMAP folders, and clients that do bulk-delete via IMAP cause enormous write amplification on the label index.

Common mistakes

No per-user sharding
Treating delivery as fire-and-forget
Linear scan instead of an index for search

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →