Autoscaling & capacity
Traffic is spiky. Provision for the peak and you waste money; for the average and you fall over.
Open the interactive version → diagrams, practice & moreThe problem
Traffic is spiky. Provision for the peak and you waste money; for the average and you fall over.
The idea
Autoscaling adds capacity when load rises and removes it when load drops, within limits you set.
How it works
Track a leading signal — queue depth or request rate usually beats CPU, which lags — and scale out past a threshold, in when quiet, with cooldowns to stop flapping. Reactive scaling always trails a spike because new capacity has a warm-up (boot + cache fill + JIT), so you keep headroom and add scheduled or predictive scaling for known peaks (a sale, a cron, a launch).
The tradeoff
Aggressive scale-out wastes money and can hammer cold dependencies; aggressive scale-in drops capacity right before the next spike and risks thrashing. Reactive scaling can't catch a sudden 10× burst — only headroom and graceful degradation (shed load, queue, serve cached) bridge the gap until capacity arrives.
In the wild
Black Friday traffic, a viral post, a product launch — all handled by autoscaling + headroom.
Interview deep dive
Flow
- Pick a leading signal (queue depth/RPS) over a lagging one (CPU).
- Set scale-out/in thresholds with cooldowns to avoid flapping.
- Keep headroom for the warm-up gap before new nodes are ready.
- Add scheduled/predictive scaling for known peaks.
Watch for
- New capacity isn't instant — boot + warm-up lags the spike.
- Scaling on CPU alone misses IO- or queue-bound saturation.
- Too-tight scale-in thrashes and drops capacity before spikes.
Interviewer trap
Name the scaling signal and the warm-up gap headroom covers — reactive scaling always lags.