Academy · Scaling & Load Balancing

Autoscaling & capacity

Traffic is spiky. Provision for the peak and you waste money; for the average and you fall over.

Open the interactive version → diagrams, practice & more

The problem

Traffic is spiky. Provision for the peak and you waste money; for the average and you fall over.

The idea

Autoscaling adds capacity when load rises and removes it when load drops, within limits you set.

How it works

Track a leading signal — queue depth or request rate usually beats CPU, which lags — and scale out past a threshold, in when quiet, with cooldowns to stop flapping. Reactive scaling always trails a spike because new capacity has a warm-up (boot + cache fill + JIT), so you keep headroom and add scheduled or predictive scaling for known peaks (a sale, a cron, a launch).

The tradeoff

Aggressive scale-out wastes money and can hammer cold dependencies; aggressive scale-in drops capacity right before the next spike and risks thrashing. Reactive scaling can't catch a sudden 10× burst — only headroom and graceful degradation (shed load, queue, serve cached) bridge the gap until capacity arrives.

In the wild

Black Friday traffic, a viral post, a product launch — all handled by autoscaling + headroom.

Interview deep dive

Flow

Pick a leading signal (queue depth/RPS) over a lagging one (CPU).
Set scale-out/in thresholds with cooldowns to avoid flapping.
Keep headroom for the warm-up gap before new nodes are ready.
Add scheduled/predictive scaling for known peaks.

Watch for

New capacity isn't instant — boot + warm-up lags the spike.
Scaling on CPU alone misses IO- or queue-bound saturation.
Too-tight scale-in thrashes and drops capacity before spikes.

Interviewer trap

Name the scaling signal and the warm-up gap headroom covers — reactive scaling always lags.

Related Academy

Part of Academy on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →