Distributed Rate Limiter
Cap each client to N requests/window across a fleet of servers.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Limit per API key/IP/user
- Multiple rules (per-second/minute)
- Return 429 + Retry-After
Non-functional
- Adds <5ms
- Accurate within tolerance
- Fail-open vs fail-closed decision
Scale
Every request; global counters
The approach
Token bucket or sliding-window-log/counter. Counts shared in Redis (atomic INCR/Lua) so all gateway nodes agree. Enforce at the edge/gateway, before expensive work.
Key components
Gateway/LB → limiter (Redis counters) → app
Numbers that matter
- A single Redis INCR + EXPIRE round-trip runs in ~0.2-0.5ms from co-located gateway nodes — fast enough to enforce on every request without measurable latency impact
- Sliding window log stores one timestamp per request; at 1000 req/s per user with a 60s window that's 60K entries per user — impractical at scale without capping the log size
- Token bucket replenishment math: at 100 req/min, the bucket refills at ~1.67 tokens/second; burst capacity is the bucket size (e.g. 20 tokens = 20-request burst allowance
- Distributed rate limiters using quorum writes across 3 Redis nodes (Redis Cluster) tolerate 1-node failure while staying consistent; overhead is ~2× a single-node write
Senior deep-dive
The algorithm choice matters less than where you enforce and how you store state — a token bucket in Redis with a Lua script beats a perfect algorithm in the wrong place.
Sliding window log is the most accurate but also the most memory-hungry (~1 entry per request); sliding window counter (two buckets, interpolated) gives 95% accuracy at a fraction of the cost.
Enforce at the edge/gateway before any compute — a rate limiter that runs after your auth layer, DB lookup, or business logic has already wasted the resources it was meant to protect.
Token bucket vs. sliding window: the actual tradeoff
Token bucket is great for bursty traffic — it naturally smooths spikes up to bucket size. But it's stateful in a way that's hard to reason about: a user can drain the bucket, wait exactly one refill interval, and burst again. Sliding window counter (two fixed windows + linear interpolation) is simpler to implement in Redis and gives predictably smooth limits without allowing boundary-edge double-bursting. Use token bucket when you explicitly want burst allowance; sliding window when you want strict rate enforcement.
The Lua script is not optional
Implementing rate limiting as GET → check → INCR in application code is a TOCTOU race — two concurrent requests both see count=N-1 and both proceed, blowing the limit. The atomic Lua script (INCR + EXPIRE in one round-trip, with a conditional abort) is the correct primitive. Redis executes Lua scripts single-threaded, so atomicity is guaranteed. Alternatives: Redis transactions (MULTI/EXEC) work but have worse latency characteristics than Lua for this pattern.
Per-user vs. per-IP vs. per-endpoint: the key design
Most systems need multiple rate limit dimensions simultaneously: per-user (logged in) AND per-IP (unauthenticated) AND per-endpoint (expensive endpoints get tighter limits). Your Redis key schema should encode all three: `rl:{user_id}:{endpoint}:{window}`. Global per-user limits prevent API abuse even across endpoints; per-endpoint limits protect expensive paths like search or export. Missing either dimension leaves you exposed.
Distributed systems: the race between nodes
When you have 50 gateway nodes sharing a Redis cluster, each node fires a Redis INCR independently — this is correct and consistent because Redis serializes commands. The problem is network partition: if the gateway can't reach Redis, do you fail open (allow all requests) or fail closed (block all)? Fail-open is standard for rate limiting (it's a best-effort protection, not a security boundary); fail-closed is appropriate for billing-tier enforcement. Document the policy explicitly.
Header feedback: the protocol that clients depend on
Well-behaved API clients use `X-RateLimit-Limit`, `X-RateLimit-Remaining`, and `X-RateLimit-Reset` headers to self-throttle before hitting 429. These headers cost almost nothing to compute (you have the counts from the Redis response) but dramatically reduce retry storms from clients that would otherwise hammer you after a 429. Retry-After header on 429 responses is equally important — without it, exponential backoff defaults vary wildly across client libraries.
What breaks at scale
Redis becomes the rate-limiter's single point of failure: if Redis goes down, your entire API is effectively unprotected (fail-open) or entirely blocked (fail-closed). Multi-region deployments need local node counters as fallback with async Redis sync — accept slightly loose limits during degraded mode. The second cliff: key cardinality explosion — a limit per (user × endpoint × window) at 10M users × 100 endpoints × sliding windows creates hundreds of millions of Redis keys; set aggressive TTLs and consider count-min sketch for approximate counting at extreme cardinality.
In production
Stripe enforces rate limits at the API gateway layer using a sliding window counter stored in Redis, publishing `X-RateLimit-Remaining` headers on every response. Cloudflare's rate limiting runs entirely at the edge in their PoP network, using local counters that sync periodically — meaning a brief burst can exceed the limit across PoPs before sync. The real engineering challenge is sticky-vs-distributed tradeoff: routing a user's requests to the same gateway node lets you use local memory, but any node failure or rebalance loses state; a shared Redis layer is consistent but adds a network hop that must stay under ~1ms.
Common mistakes
- Per-node counters (bypassable)
- Fixed-window edge bursts (use sliding window)
- Synchronous Redis call with no fallback