System Design Library

Distributed Rate Limiter

Cap each client to N requests/window across a fleet of servers.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Limit per API key/IP/user
  • Multiple rules (per-second/minute)
  • Return 429 + Retry-After

Non-functional

  • Adds <5ms
  • Accurate within tolerance
  • Fail-open vs fail-closed decision

Scale

Every request; global counters

The approach

Token bucket or sliding-window-log/counter. Counts shared in Redis (atomic INCR/Lua) so all gateway nodes agree. Enforce at the edge/gateway, before expensive work.

Key components

Gateway/LB → limiter (Redis counters) → app

Numbers that matter

Senior deep-dive

The algorithm choice matters less than where you enforce and how you store state — a token bucket in Redis with a Lua script beats a perfect algorithm in the wrong place.

Sliding window log is the most accurate but also the most memory-hungry (~1 entry per request); sliding window counter (two buckets, interpolated) gives 95% accuracy at a fraction of the cost.

Enforce at the edge/gateway before any compute — a rate limiter that runs after your auth layer, DB lookup, or business logic has already wasted the resources it was meant to protect.

Token bucket vs. sliding window: the actual tradeoff

Token bucket is great for bursty traffic — it naturally smooths spikes up to bucket size. But it's stateful in a way that's hard to reason about: a user can drain the bucket, wait exactly one refill interval, and burst again. Sliding window counter (two fixed windows + linear interpolation) is simpler to implement in Redis and gives predictably smooth limits without allowing boundary-edge double-bursting. Use token bucket when you explicitly want burst allowance; sliding window when you want strict rate enforcement.

The Lua script is not optional

Implementing rate limiting as GET → check → INCR in application code is a TOCTOU race — two concurrent requests both see count=N-1 and both proceed, blowing the limit. The atomic Lua script (INCR + EXPIRE in one round-trip, with a conditional abort) is the correct primitive. Redis executes Lua scripts single-threaded, so atomicity is guaranteed. Alternatives: Redis transactions (MULTI/EXEC) work but have worse latency characteristics than Lua for this pattern.

Per-user vs. per-IP vs. per-endpoint: the key design

Most systems need multiple rate limit dimensions simultaneously: per-user (logged in) AND per-IP (unauthenticated) AND per-endpoint (expensive endpoints get tighter limits). Your Redis key schema should encode all three: `rl:{user_id}:{endpoint}:{window}`. Global per-user limits prevent API abuse even across endpoints; per-endpoint limits protect expensive paths like search or export. Missing either dimension leaves you exposed.

Distributed systems: the race between nodes

When you have 50 gateway nodes sharing a Redis cluster, each node fires a Redis INCR independently — this is correct and consistent because Redis serializes commands. The problem is network partition: if the gateway can't reach Redis, do you fail open (allow all requests) or fail closed (block all)? Fail-open is standard for rate limiting (it's a best-effort protection, not a security boundary); fail-closed is appropriate for billing-tier enforcement. Document the policy explicitly.

Header feedback: the protocol that clients depend on

Well-behaved API clients use `X-RateLimit-Limit`, `X-RateLimit-Remaining`, and `X-RateLimit-Reset` headers to self-throttle before hitting 429. These headers cost almost nothing to compute (you have the counts from the Redis response) but dramatically reduce retry storms from clients that would otherwise hammer you after a 429. Retry-After header on 429 responses is equally important — without it, exponential backoff defaults vary wildly across client libraries.

What breaks at scale

Redis becomes the rate-limiter's single point of failure: if Redis goes down, your entire API is effectively unprotected (fail-open) or entirely blocked (fail-closed). Multi-region deployments need local node counters as fallback with async Redis sync — accept slightly loose limits during degraded mode. The second cliff: key cardinality explosion — a limit per (user × endpoint × window) at 10M users × 100 endpoints × sliding windows creates hundreds of millions of Redis keys; set aggressive TTLs and consider count-min sketch for approximate counting at extreme cardinality.

In production

Stripe enforces rate limits at the API gateway layer using a sliding window counter stored in Redis, publishing `X-RateLimit-Remaining` headers on every response. Cloudflare's rate limiting runs entirely at the edge in their PoP network, using local counters that sync periodically — meaning a brief burst can exceed the limit across PoPs before sync. The real engineering challenge is sticky-vs-distributed tradeoff: routing a user's requests to the same gateway node lets you use local memory, but any node failure or rebalance loses state; a shared Redis layer is consistent but adds a network hop that must stay under ~1ms.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →