System Design Library

Health Monitor / Heartbeat

Detect when servers/services go down, fast and without false positives.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Heartbeats from nodes
  • Detect failures
  • Alert/failover trigger

Non-functional

  • Fast detection
  • Few false positives

Scale

Thousands of nodes

The approach

Nodes send periodic heartbeats (push) or are polled (pull); a detector flags a node after N missed beats; phi-accrual detectors adapt the threshold to network jitter; gossip spreads liveness at large scale.

Key components

Nodes → heartbeats → failure detector → alerting/failover

Numbers that matter

Senior deep-dive

The fundamental tension is false positives vs. detection latency — a tight threshold catches failures fast but triggers spurious alerts on network blips; adaptive detectors (phi-accrual) tune themselves to each node's observed variance.

Push-based heartbeats are harder to fake failure on; pull-based (polling) are simpler to implement but don't scale past ~10,000 nodes without a tiered polling architecture.

Health != liveness: a node can be alive and reachable but unable to serve traffic (OOM, deadlock, saturated thread pool) — combine heartbeat liveness with synthetic probes that test actual request handling.

Push vs. pull: the architectural choice

In push mode (heartbeat), each node sends a beat on a fixed interval — the monitor detects failure by absence. This doesn't scale indefinitely (the monitor becomes a funnel), but is simpler for homogeneous fleets. In pull mode (polling), the monitor actively probes each node — the node doesn't need to know about the monitor, making it easy to add monitoring without touching services. Tiered polling (regional sub-monitors aggregate to a central monitor) scales pull-based checking to tens of thousands of nodes.

Phi-accrual detector: adaptive thresholds

Classical heartbeat detection uses a fixed timeout (miss N beats = failure). Phi-accrual computes a suspicion value φ from a historical distribution of arrival intervals — φ rises continuously if a heartbeat is late and surpasses a threshold only when lateness is statistically improbable given past behavior. This adapts to each node's network variance and eliminates false positives from GC pauses or network jitter that would trigger a fixed-timeout detector. Cassandra and Akka Cluster use this.

Gossip-based membership at scale

SWIM (Scalable Weakly-consistent Infection-style Membership) avoids a central monitor: each node pings a random peer every T seconds; if no ack, it asks k other nodes to probe indirectly. Failure is only suspected after direct + indirect probes fail. Membership updates piggyback on heartbeat messages and spread epidemically (O(log N) rounds). This gives O(1) message overhead per node per second, vs. O(N) for a centralized poller — the approach Consul, Serf, and Cassandra use for cluster membership.

Liveness vs. readiness vs. health: the three checks

Liveness (is the process alive?) can be a TCP connection check. Readiness (can it serve traffic?) requires a semantic check — is the connection pool initialized, are dependencies reachable? Health (is it performing well?) requires metrics — is p99 latency under SLA? Kubernetes separates liveness and readiness probes for exactly this reason: a deadlocked container passes TCP liveness but fails readiness. Use all three or you'll restart containers that are merely slow rather than broken.

Self-preservation: distinguishing failure from partition

When 50% of your fleet stops heartbeating simultaneously, the right interpretation is usually network partition, not mass failure — evicting all those nodes would cause a larger outage than the partition itself. Netflix Eureka's self-preservation mode stops evictions when the eviction rate exceeds a threshold. This is the right call but requires careful configuration: set the threshold too high and you preserve genuinely dead nodes, creating ghost entries that cause routing failures.

What breaks at scale

Monitor fan-in at 100,000 nodes: a centralized heartbeat receiver processing 100,000 messages/sec becomes a bottleneck — shard the monitoring tier by node range or use gossip. GC pauses cause spurious failures: a JVM with a 5-second stop-the-world GC pause will miss heartbeats and be falsely evicted; either use G1/ZGC with sub-100ms pauses or set heartbeat timeouts larger than your GC worst case. Clock skew in scheduled health checks: if your monitor's clock is behind by 60s and nodes use absolute timestamps in heartbeats, every node looks dead — always use relative elapsed-time for timeout calculations, not wall-clock comparisons.

In production

Netflix Eureka uses a push-based heartbeat (every 30s, 3-missed-beats = eviction) for service discovery, with a self-preservation mode that suppresses evictions when more than 15% of instances go silent simultaneously — correctly interpreting mass heartbeat loss as a network partition rather than individual failures. Kubernetes liveness probes poll each container via HTTP/TCP/exec at configurable intervals and restart containers that fail — this is pull-based from the kubelet. Consul combines gossip (SWIM) for scalable membership with HTTP health checks for application-level liveness. The real operational failure mode is the health check itself becoming a SPOF: a monitor that checks 10,000 nodes sequentially can fall behind and give stale health information.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →