Webhook Delivery
Reliably deliver event callbacks to customer HTTP endpoints, with retries.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Subscribe to events
- Deliver via HTTP POST
- Retries w/ backoff
- Signing/security
Non-functional
- At-least-once
- Ordered (optional)
- Slow/dead endpoints don't block others
Scale
Millions of deliveries
The approach
Events → per-destination queues → delivery workers POST with HMAC signature; failures retry with exponential backoff; persistent failures go to a DLQ; per-endpoint isolation prevents head-of-line blocking.
Key components
Producers → queues (per endpoint) → delivery workers · DLQ
Numbers that matter
- 3-5 delivery attempts with exponential backoff (1s, 5s, 30s, 5m, 30m) covers ~95% of transient endpoint outages without queue backup.
- <500ms first-attempt delivery latency P50 is the industry expectation for webhooks; P99 should be under 5s for healthy endpoints.
- ~100 bytes HMAC-SHA256 signature header overhead per request — negligible, but skipping it costs customers the ability to verify authenticity.
- ~10-50 concurrent HTTP connections per destination is a typical concurrency cap — beyond this you're likely overloading a small customer service.
Senior deep-dive
Per-destination isolation is the architecture primitive — one slow or dead endpoint must never block delivery to others; every destination gets its own queue.
Exponential backoff with a dead-letter queue is not optional — customer endpoints go down for hours, and a naive retry storm will amplify the outage by hammering an already-struggling server.
HMAC signatures are the security contract — without them, any attacker who discovers your webhook URL can forge events; customers must verify the signature on every delivery.
Per-destination queues: the only design that isolates failure
A naive design puts all pending deliveries in one queue and N workers consume from it. When a customer's endpoint goes down, retries pile up and consume worker slots, starving other customers of timely delivery. The fix is one queue (or priority partition) per destination endpoint — each endpoint's retries are independent. At Stripe scale, this means millions of queue partitions; Kafka partitions or DynamoDB per-endpoint records both work.
Retry strategy: backoff is not enough, you need a DLQ
Exponential backoff prevents hammering a struggling endpoint but doesn't solve what happens after max retries. A dead-letter queue holds these failed deliveries for inspection and manual replay — without it, events are silently dropped. Equally important: circuit breaking per endpoint — after 10 consecutive failures, pause delivery entirely and notify the customer rather than wasting compute on certain failures.
HMAC signatures: authentication you implement once, customers break forever
Sign the request body with HMAC-SHA256 using a per-endpoint secret and include it as a header (`X-Webhook-Signature`). Customers verify by recomputing the HMAC over the raw body. The failure mode: customers who verify the signature after JSON-parsing the body will fail on any whitespace or key-ordering change — they must verify over the raw byte body, not the parsed structure. Document this explicitly. Also add a timestamp in the signed payload and reject signatures >5 minutes old to prevent replay attacks.
Ordering and exactly-once: the lie we tell customers
Webhooks are at-least-once delivery by design — retries after a successful delivery (due to a timeout or network hiccup) mean customers will receive duplicates. Exactly-once is achieved by the customer, not the sender: include an event ID in every payload; customers deduplicate against a processed-events store. Ordering is weaker: retries can overtake successful deliveries from a different worker. Include a sequence number per resource (e.g., `customer.123 event seq=42`) so customers detect and handle out-of-order delivery.
Fanout: when one event has 1000 subscribers
An internal event (order.placed) may have N registered webhook endpoints across N customers. The fanout step — generating N delivery tasks from one event — must be fast and non-blocking. Write to the event log once; a fanout worker reads and enqueues N delivery tasks. At high throughput this fanout worker becomes a bottleneck: partition the fanout by event type or source, run multiple workers, and ensure the enqueue step is idempotent so crashes during fanout don't produce duplicate or missed deliveries.
What breaks at scale
Queue depth explosion during a sustained customer outage: retries accumulate faster than they drain, and a weeks-long outage means millions of queued events. You need a max queue depth per endpoint with overflow policy (drop oldest or stop accepting new events for that endpoint). Noisy neighbor writes: if one integration generates millions of events per second (e.g., a marketplace firing order.updated on every inventory change), a single customer's webhook endpoint can receive more traffic than your entire platform can handle — add per-endpoint rate limiting on the sender side. Finally, SSL cert expiry on customer endpoints is the #1 cause of mysterious webhook failures after months of working fine.
In production
Stripe, GitHub, and Twilio have all built their webhook delivery systems around per-destination queues with exponential backoff — Stripe's documentation even shows their exact retry schedule (1h, 2h, 6h, 24h... up to 72h). The real engineering challenge is backpressure isolation: without per-destination queues, one customer with a consistently failing endpoint causes head-of-line blocking that delays webhooks for all other customers sharing the same worker pool. Stripe solves this by giving each endpoint its own delivery queue with independent retry state. The second hard problem is exactly-once delivery — HTTP gives you at-most-once or at-least-once; customers must be prepared for duplicates and use idempotency keys to deduplicate.
Common mistakes
- Shared queue (one slow endpoint blocks all)
- No retry/backoff
- Unsigned payloads