Push Notification Gateway
Deliver mobile push to billions of devices via APNs/FCM reliably.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Register device tokens
- Send push (single/broadcast)
- Provider adapters
- Retry/feedback
Non-functional
- High throughput
- At-least-once
- Token hygiene
Scale
Billions of devices
The approach
Token registry per user/device; sends fan out through queues to provider adapters (APNs/FCM) over persistent connections; retries + handle provider feedback (invalid tokens pruned); broadcasts batched.
Key components
App → notif service → queues → APNs/FCM adapters · token registry
Numbers that matter
- APNs allows up to 1,500 concurrent HTTP/2 streams per connection and drops to ~500 on congestion — a sending fleet must manage connection pools carefully to avoid stream exhaustion
- FCM reports ~10–30% of registered tokens are invalid at any given time in production apps (uninstalls, re-installs, OS upgrades); a fresh prune cycle dramatically reduces wasted sends
- End-to-end push delivery (your server → APNs/FCM → device) averages 1–5 seconds under normal conditions, spiking to 30–120 seconds when provider queues are saturated
- A single delivery worker using HTTP/2 multiplexing can achieve 5,000–10,000 sends/second per connection; a fleet of 10 workers can sustain 50k–100k notifications/second aggregate throughput
Senior deep-dive
The provider connection pool, not your queue depth, is the real throughput bottleneck — APNs and FCM have persistent HTTP/2 connections that must be managed carefully to avoid rejections.
Invalid token pruning is an ongoing operational discipline — a 30% invalid-token rate is normal in a live app; sending to dead tokens wastes quota and triggers provider rate limits.
Broadcasts require fan-out infrastructure, not loops — sending a notification to 100M users via a serial loop takes hours; batched topic-based delivery (FCM topics / SNS) or a dedicated fan-out tier cuts this to minutes.
Provider connection pooling is the hidden bottleneck
APNs requires a persistent HTTP/2 connection and allows 1,500 concurrent in-flight streams per connection. If your sending code naively opens a new connection per notification, you burn ~200ms TLS setup per send and providers will throttle you. The right model: a connection pool of persistent HTTP/2 connections per provider, with a send queue in front. Each worker in the pool holds a connection alive with keep-alives and multiplexes streams. Connection drops (provider-side restarts are common) must trigger reconnect with exponential backoff — not a crash.
Per-destination queue isolation prevents head-of-line blocking
If a single customer's endpoint is slow or their device is offline, naive delivery stalls the queue for every other notification. The architecture requires per-destination (or per-app) queues so one slow consumer cannot block another. In practice, this means a topic-per-customer-app in your internal queue, with each worker consuming from one topic. Dead-letter queues for permanently failing deliveries prevent retry storms. The visibility timeout on the queue (e.g. 30s) ensures a worker crash re-enqueues the message rather than losing it.
Token registry must be actively pruned
APNs and FCM both provide feedback channels — APIs that report invalid/expired device tokens. APNs returns a 410 Gone with a timestamp; if the token was invalidated after your last send, the user has re-registered and you should retain the new token. FCM returns `registration_id` in the response for canonical token rotation. Not consuming the feedback channel means you accumulate dead tokens, waste quota, and eventually get rate-limited by the provider for low delivery ratios. Run a nightly prune job and process feedback inline after every send batch.
Fan-out for broadcasts cannot be a serial loop
Sending a push to 100M users by iterating a user table and calling your send API takes 20–30 hours at 1k sends/sec. The architecture for broadcast is parallelized fan-out: partition the user table into shards, dispatch each shard to a worker, workers batch-send to the provider (FCM supports up to 500 tokens per batch). Topic-based delivery (FCM topics or APNs broadcast pushes) offloads fan-out to the provider, but limits customization per recipient. For personalized broadcasts (different payload per user), the sharded worker fleet is the only option.
Retry logic must handle provider semantics, not just HTTP errors
FCM returns `Unavailable` (503) when overloaded — retry with exponential backoff + jitter. It returns `InvalidRegistration` (400) — do not retry, delete the token. It returns `MessageRateExceeded` — you're sending too fast to one device; back off specifically for that token, not globally. Conflating all errors as 'retry' is a common bug that amplifies storms: a spike of Unavailable responses triggers a retry wave that makes the provider more overloaded. A per-error-code state machine in the delivery worker is the correct implementation.
What breaks at scale
The catastrophic failure is token table corruption during a migration — if device tokens stored in your DB are truncated, encoded differently (base64 vs hex), or missing platform prefixes, every send returns InvalidRegistration. This has caused large-scale outage where 80% of sends fail silently (no exception, just a provider rejection logged to a metrics counter nobody watches). The second failure mode is broadcast amplification: a bug sends the same notification 10× to the same user because the dedup check (hash of notification_id + device_id) was missing. Always idempotency-key every send and dedup at the queue level.
In production
Apple mandates using HTTP/2 with TLS and a persistent connection (not a new connection per notification) — each reconnect incurs ~200ms of TLS handshake overhead. Meta/Facebook built a custom push system handling billions of daily notifications that separates token registration, routing, and delivery into independent services with dedicated fan-out for broadcast campaigns. AWS SNS wraps APNs/FCM behind a managed abstraction with per-platform queues, but provider rate limiting is still your problem — SNS will throttle you if you send to invalid tokens at high rates, and the fix is running your own feedback-loop cleaner against the provider's invalid token stream.
Common mistakes
- New connection per push
- Ignoring provider feedback (dead tokens)
- Synchronous broadcast loops