System Design Library

Notification System

Send push/email/SMS notifications reliably across channels, at scale, without spamming.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Multi-channel (push/email/SMS)
  • Templates
  • User preferences/opt-out
  • Rate/dedup
  • Delivery tracking

Non-functional

  • Reliable, at-least-once
  • Idempotent
  • Respect quiet hours

Scale

Billions/day

The approach

Events → notification service → per-channel queues → channel workers (APNs/FCM/SES/Twilio). Preferences & dedup checked before send; retries with backoff; DLQ for failures.

Key components

Producers → notif service → queues (per channel) → channel adapters · prefs store · tracking

Numbers that matter

Senior deep-dive

Per-channel queues with per-destination isolation are the core correctness mechanism — one slow or failing email provider must not block SMS delivery or delay push notifications for other users.

Idempotency and deduplication must happen before the send, not after — most providers (APNs, FCM, SES) do not guarantee idempotent delivery on retry, so you must. The DLQ is not just a safety net — it's a business-critical audit trail for compliance (did we send the legal notice?) and debugging (why didn't the user get the password reset?).

Per-channel queue isolation: the non-negotiable

If email, push, and SMS share a single queue, a transient SES outage causes a growing backlog that delays SMS and push notifications — which have much shorter relevance windows. Dedicate one queue per channel (one for APNs, one for FCM, one for SES, one for Twilio). This also lets you scale workers independently — email may need 10 workers, push may need 100. Queue isolation is the cheapest way to provide blast-radius containment between providers.

Idempotency: the most skipped requirement

Notification events can be delivered twice (queue at-least-once, upstream service retries, outage replays). Without idempotency you send duplicate 'Your order has shipped' emails. The fix: every notification event carries a stable idempotency key (e.g. SHA256(event_type + entity_id + timestamp_bucket)). Before sending, check a Redis SET with 24h TTL — if the key exists, skip the send. This is cheap (~1ms) and prevents the most common user-facing bug in notification systems.

Preference resolution: the hidden latency killer

Every notification must be filtered through user preferences — channel opt-outs, DND windows, frequency caps (no more than 3 push/day for this category). Querying a preferences DB per notification is O(1) per notification but at 10k notifications/sec that's 10k DB queries/sec. Pre-load preferences into a local in-process cache (or Redis) with a short TTL (~60 seconds). Frequency capping requires a Redis counter per (user, category, window) — use a sliding window counter with INCR + EXPIRE.

Provider feedback loops: token hygiene

APNs and FCM return feedback events when a push token is invalid (app uninstalled, device reset). Ignoring these and continuing to send to dead tokens wastes connections and can get your APNs account flagged. Process provider feedback asynchronously: a feedback consumer reads provider responses and marks tokens as invalid in your device registry. Schedule periodic token validation sweeps — tokens not seen in 90 days should be pruned. This reduces push queue depth and delivery latency for live tokens.

DLQ strategy: not all failures are equal

A permanent failure (invalid email address, uninstalled app, provider rejected) should not be retried — move immediately to DLQ and trigger a compensating action (fallback to another channel, log for compliance). A transient failure (provider timeout, rate limit) should retry with exponential backoff. Distinguish these at the worker level: HTTP 400 from FCM is permanent; HTTP 503 is transient. DLQ messages must be actionable — include full context (user_id, channel, provider response code, timestamp) for operational debugging.

What breaks at scale

Mass notification (system-wide blast to 100M users) must never go through the same path as transactional notifications — a blast saturates all channel queues, delaying password resets and order confirmations for hours. Maintain separate queue tiers (transactional = high priority, marketing = low priority, bulk blast = batch-only) with separate worker pools and queue depth limits. The second failure: APNs connection pool exhaustion — each APNs HTTP/2 connection handles ~500 concurrent in-flight requests; at 5,000 push/sec you need ~10 connections. Pool exhaustion causes silent send failures with no clear error — always instrument connection pool utilization as a key metric.

In production

Airbnb's notification system (Trebuchet) and Uber's (TNS) both follow the same pattern: events flow into a fanout service that resolves user preferences and channels, then per-channel queues (one per provider) backed by Kafka or SQS feed channel-specific workers. Firebase Cloud Messaging (FCM) is used by most consumer apps as the Android push gateway; APNs (Apple Push Notification service) requires a separate HTTP/2 connection pool with certificate-based auth. The real engineering challenge is preference resolution at scale: for every notification event, you must check user preferences (do they want email for this type? which phone? DND hours?) without adding >50ms latency — this requires a local preference cache per notification worker node, not a DB query per notification. Deduplication (don't send the same notification twice if an event is processed twice) requires a Redis SET with a TTL keyed by (user_id, event_id).

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →