System Design Library

Distributed Rate Limiter

Cap each client to N requests/window across a fleet of servers.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Limit per API key/IP/user
Multiple rules (per-second/minute)
Return 429 + Retry-After

Non-functional

Adds <5ms
Accurate within tolerance
Fail-open vs fail-closed decision

Scale

Every request; global counters

The approach

Token bucket or sliding-window-log/counter. Counts shared in Redis (atomic INCR/Lua) so all gateway nodes agree. Enforce at the edge/gateway, before expensive work.

Key components

Gateway/LB → limiter (Redis counters) → app

Numbers that matter

A single Redis INCR + EXPIRE round-trip runs in ~0.2-0.5ms from co-located gateway nodes — fast enough to enforce on every request without measurable latency impact
Sliding window log stores one timestamp per request; at 1000 req/s per user with a 60s window that's 60K entries per user — impractical at scale without capping the log size
Token bucket replenishment math: at 100 req/min, the bucket refills at ~1.67 tokens/second; burst capacity is the bucket size (e.g. 20 tokens = 20-request burst allowance
Distributed rate limiters using quorum writes across 3 Redis nodes (Redis Cluster) tolerate 1-node failure while staying consistent; overhead is ~2× a single-node write

Senior deep-dive

The algorithm choice matters less than where you enforce and how you store state — a token bucket in Redis with a Lua script beats a perfect algorithm in the wrong place.

Sliding window log is the most accurate but also the most memory-hungry (~1 entry per request); sliding window counter (two buckets, interpolated) gives 95% accuracy at a fraction of the cost.

Enforce at the edge/gateway before any compute — a rate limiter that runs after your auth layer, DB lookup, or business logic has already wasted the resources it was meant to protect.

Token bucket vs. sliding window: the actual tradeoff

Token bucket is great for bursty traffic — it naturally smooths spikes up to bucket size. But it's stateful in a way that's hard to reason about: a user can drain the bucket, wait exactly one refill interval, and burst again. Sliding window counter (two fixed windows + linear interpolation) is simpler to implement in Redis and gives predictably smooth limits without allowing boundary-edge double-bursting. Use token bucket when you explicitly want burst allowance; sliding window when you want strict rate enforcement.

The Lua script is not optional

Implementing rate limiting as GET → check → INCR in application code is a TOCTOU race — two concurrent requests both see count=N-1 and both proceed, blowing the limit. The atomic Lua script (INCR + EXPIRE in one round-trip, with a conditional abort) is the correct primitive. Redis executes Lua scripts single-threaded, so atomicity is guaranteed. Alternatives: Redis transactions (MULTI/EXEC) work but have worse latency characteristics than Lua for this pattern.

Per-user vs. per-IP vs. per-endpoint: the key design

Most systems need multiple rate limit dimensions simultaneously: per-user (logged in) AND per-IP (unauthenticated) AND per-endpoint (expensive endpoints get tighter limits). Your Redis key schema should encode all three: `rl:{user_id}:{endpoint}:{window}`. Global per-user limits prevent API abuse even across endpoints; per-endpoint limits protect expensive paths like search or export. Missing either dimension leaves you exposed.

Distributed systems: the race between nodes

When you have 50 gateway nodes sharing a Redis cluster, each node fires a Redis INCR independently — this is correct and consistent because Redis serializes commands. The problem is network partition: if the gateway can't reach Redis, do you fail open (allow all requests) or fail closed (block all)? Fail-open is standard for rate limiting (it's a best-effort protection, not a security boundary); fail-closed is appropriate for billing-tier enforcement. Document the policy explicitly.

Header feedback: the protocol that clients depend on

Well-behaved API clients use `X-RateLimit-Limit`, `X-RateLimit-Remaining`, and `X-RateLimit-Reset` headers to self-throttle before hitting 429. These headers cost almost nothing to compute (you have the counts from the Redis response) but dramatically reduce retry storms from clients that would otherwise hammer you after a 429. Retry-After header on 429 responses is equally important — without it, exponential backoff defaults vary wildly across client libraries.

What breaks at scale

Redis becomes the rate-limiter's single point of failure: if Redis goes down, your entire API is effectively unprotected (fail-open) or entirely blocked (fail-closed). Multi-region deployments need local node counters as fallback with async Redis sync — accept slightly loose limits during degraded mode. The second cliff: key cardinality explosion — a limit per (user × endpoint × window) at 10M users × 100 endpoints × sliding windows creates hundreds of millions of Redis keys; set aggressive TTLs and consider count-min sketch for approximate counting at extreme cardinality.

In production

Stripe enforces rate limits at the API gateway layer using a sliding window counter stored in Redis, publishing `X-RateLimit-Remaining` headers on every response. Cloudflare's rate limiting runs entirely at the edge in their PoP network, using local counters that sync periodically — meaning a brief burst can exceed the limit across PoPs before sync. The real engineering challenge is sticky-vs-distributed tradeoff: routing a user's requests to the same gateway node lets you use local memory, but any node failure or rebalance loses state; a shared Redis layer is consistent but adds a network hop that must stay under ~1ms.

Common mistakes

Per-node counters (bypassable)
Fixed-window edge bursts (use sliding window)
Synchronous Redis call with no fallback

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →