System Design Library

Gmail / Email

Store, search, send and receive email at billions-of-mailboxes scale.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Send/receive (SMTP)
  • Mailbox storage
  • Search
  • Spam filtering
  • Labels/threads

Non-functional

  • Durable
  • Fast search
  • Reliable delivery

Scale

Billions of mailboxes

The approach

Sharded mailbox storage (by user); inbound via SMTP → spam/virus pipeline → mailbox; outbound queued with retries; every message indexed for search; threads via references.

Key components

SMTP in → spam pipeline → mailbox store (sharded) + search index · outbound queue

Numbers that matter

Senior deep-dive

Mailbox sharding by user is the foundation — all of a user's mail lives on one shard set so thread assembly, search, and IMAP are purely local operations, no cross-shard joins needed.

The inbound pipeline (SMTP → spam/virus → delivery) is a multi-stage async fan — each stage can reject, quarantine, or transform the message; designing it as a pipeline means stages can be scaled and updated independently without affecting delivery throughput.

Search is powered by a private inverted index per user, not a shared cluster — this gives per-user isolation (no search leakage) and lets Gmail offer instant results because the index is co-located with the mailbox.

Sharding strategy: user is the partition key

All of a user's messages, labels, and thread metadata are co-located on one shard. This makes thread assembly, label filtering, and IMAP folder traversal local — no distributed joins. The risk is a hot shard if a single user has a multi-GB mailbox; Gmail handles this with per-user data routing that can migrate heavy mailboxes to under-loaded shard groups.

Threading: harder than it looks

Gmail threads on References/In-Reply-To headers first, falling back to normalized subject for replies from broken clients. The thread ID is assigned at first-message ingestion; subsequent messages are linked by header lookup in the thread index. Mailing lists break threading because they munge headers — Gmail has special-cased dozens of list server behaviors. A thread can span thousands of messages (legal hold mailboxes, mailing list subscriptions), so thread-level operations need pagination.

Spam pipeline: the real-time ML challenge

Every inbound message runs through a multi-stage classifier cascade: IP/domain reputation (fastest, cheapest), heuristic rules, then ML models scoring content and sender history. Sender reputation is the highest-signal feature — a first-time sender from a new domain is far more likely to be spam. The pipeline must complete in <500ms before the SMTP connection times out, forcing a tiered approach where uncertain messages are delivered optimistically and reclassified async.

Labels as the data model: not folders

Gmail stores messages once and attaches a set of label IDs per user-message pair — a message can have INBOX, STARRED, and a custom label simultaneously. This is a many-to-many relationship stored in an index table, not copies in folders. IMAP projection maps labels to folder paths, but mutations via IMAP (move = remove INBOX label, add label) must be translated, and concurrent IMAP + web edits require careful conflict resolution.

Search: per-user inverted index

Each mailbox has its own inverted index co-located with mailbox data — query results are not mixed across users. The index strips quoted text, expands abbreviations, and normalizes email addresses. The hard part is index freshness: a message delivered seconds ago must be searchable immediately, so the index is updated synchronously at delivery time, adding latency to the ingest path.

What breaks at scale

Attachment storms — a user who receives thousands of emails with large attachments can saturate their shard's disk I/O quota, degrading neighbors on the same physical host. IMAP clients doing bulk operations (mass-delete, folder sync on a 1M-message mailbox) generate write amplification that can overwhelm the label index. Spam model drift during major world events (elections, pandemics) means freshly trained models must be hot-swapped mid-stream without a delivery outage.

In production

Google built Colossus (successor to GFS) for mailbox storage and a Bigtable-backed metadata layer to track message flags, labels, and thread membership. The hardest operational problem is spam classification at ingestion — Gmail runs both heuristic rules and ML models (including sender reputation, IP reputation, and content signals) in under a second per message, and false positives (legitimate mail in spam) are far more damaging to user trust than false negatives. IMAP compatibility is a perpetual tax: Gmail's label model must be projected onto IMAP folders, and clients that do bulk-delete via IMAP cause enormous write amplification on the label index.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →