System Design Library

Push Notification Gateway

Deliver mobile push to billions of devices via APNs/FCM reliably.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Register device tokens
Send push (single/broadcast)
Provider adapters
Retry/feedback

Non-functional

High throughput
At-least-once
Token hygiene

Scale

Billions of devices

The approach

Token registry per user/device; sends fan out through queues to provider adapters (APNs/FCM) over persistent connections; retries + handle provider feedback (invalid tokens pruned); broadcasts batched.

Key components

App → notif service → queues → APNs/FCM adapters · token registry

Numbers that matter

APNs allows up to 1,500 concurrent HTTP/2 streams per connection and drops to ~500 on congestion — a sending fleet must manage connection pools carefully to avoid stream exhaustion
FCM reports ~10–30% of registered tokens are invalid at any given time in production apps (uninstalls, re-installs, OS upgrades); a fresh prune cycle dramatically reduces wasted sends
End-to-end push delivery (your server → APNs/FCM → device) averages 1–5 seconds under normal conditions, spiking to 30–120 seconds when provider queues are saturated
A single delivery worker using HTTP/2 multiplexing can achieve 5,000–10,000 sends/second per connection; a fleet of 10 workers can sustain 50k–100k notifications/second aggregate throughput

Senior deep-dive

The provider connection pool, not your queue depth, is the real throughput bottleneck — APNs and FCM have persistent HTTP/2 connections that must be managed carefully to avoid rejections.

Invalid token pruning is an ongoing operational discipline — a 30% invalid-token rate is normal in a live app; sending to dead tokens wastes quota and triggers provider rate limits.

Broadcasts require fan-out infrastructure, not loops — sending a notification to 100M users via a serial loop takes hours; batched topic-based delivery (FCM topics / SNS) or a dedicated fan-out tier cuts this to minutes.

Provider connection pooling is the hidden bottleneck

APNs requires a persistent HTTP/2 connection and allows 1,500 concurrent in-flight streams per connection. If your sending code naively opens a new connection per notification, you burn ~200ms TLS setup per send and providers will throttle you. The right model: a connection pool of persistent HTTP/2 connections per provider, with a send queue in front. Each worker in the pool holds a connection alive with keep-alives and multiplexes streams. Connection drops (provider-side restarts are common) must trigger reconnect with exponential backoff — not a crash.

Per-destination queue isolation prevents head-of-line blocking

If a single customer's endpoint is slow or their device is offline, naive delivery stalls the queue for every other notification. The architecture requires per-destination (or per-app) queues so one slow consumer cannot block another. In practice, this means a topic-per-customer-app in your internal queue, with each worker consuming from one topic. Dead-letter queues for permanently failing deliveries prevent retry storms. The visibility timeout on the queue (e.g. 30s) ensures a worker crash re-enqueues the message rather than losing it.

Token registry must be actively pruned

APNs and FCM both provide feedback channels — APIs that report invalid/expired device tokens. APNs returns a 410 Gone with a timestamp; if the token was invalidated after your last send, the user has re-registered and you should retain the new token. FCM returns `registration_id` in the response for canonical token rotation. Not consuming the feedback channel means you accumulate dead tokens, waste quota, and eventually get rate-limited by the provider for low delivery ratios. Run a nightly prune job and process feedback inline after every send batch.

Fan-out for broadcasts cannot be a serial loop

Sending a push to 100M users by iterating a user table and calling your send API takes 20–30 hours at 1k sends/sec. The architecture for broadcast is parallelized fan-out: partition the user table into shards, dispatch each shard to a worker, workers batch-send to the provider (FCM supports up to 500 tokens per batch). Topic-based delivery (FCM topics or APNs broadcast pushes) offloads fan-out to the provider, but limits customization per recipient. For personalized broadcasts (different payload per user), the sharded worker fleet is the only option.

Retry logic must handle provider semantics, not just HTTP errors

FCM returns `Unavailable` (503) when overloaded — retry with exponential backoff + jitter. It returns `InvalidRegistration` (400) — do not retry, delete the token. It returns `MessageRateExceeded` — you're sending too fast to one device; back off specifically for that token, not globally. Conflating all errors as 'retry' is a common bug that amplifies storms: a spike of Unavailable responses triggers a retry wave that makes the provider more overloaded. A per-error-code state machine in the delivery worker is the correct implementation.

What breaks at scale

The catastrophic failure is token table corruption during a migration — if device tokens stored in your DB are truncated, encoded differently (base64 vs hex), or missing platform prefixes, every send returns InvalidRegistration. This has caused large-scale outage where 80% of sends fail silently (no exception, just a provider rejection logged to a metrics counter nobody watches). The second failure mode is broadcast amplification: a bug sends the same notification 10× to the same user because the dedup check (hash of notification_id + device_id) was missing. Always idempotency-key every send and dedup at the queue level.

In production

Apple mandates using HTTP/2 with TLS and a persistent connection (not a new connection per notification) — each reconnect incurs ~200ms of TLS handshake overhead. Meta/Facebook built a custom push system handling billions of daily notifications that separates token registration, routing, and delivery into independent services with dedicated fan-out for broadcast campaigns. AWS SNS wraps APNs/FCM behind a managed abstraction with per-platform queues, but provider rate limiting is still your problem — SNS will throttle you if you send to invalid tokens at high rates, and the fix is running your own feedback-loop cleaner against the provider's invalid token stream.

Common mistakes

New connection per push
Ignoring provider feedback (dead tokens)
Synchronous broadcast loops

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →