Health Monitor / Heartbeat
Detect when servers/services go down, fast and without false positives.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Heartbeats from nodes
- Detect failures
- Alert/failover trigger
Non-functional
- Fast detection
- Few false positives
Scale
Thousands of nodes
The approach
Nodes send periodic heartbeats (push) or are polled (pull); a detector flags a node after N missed beats; phi-accrual detectors adapt the threshold to network jitter; gossip spreads liveness at large scale.
Key components
Nodes → heartbeats → failure detector → alerting/failover
Numbers that matter
- Cassandra's phi-accrual failure detector uses a default convict threshold of phi=8, corresponding to roughly 1 false positive per 4,000 heartbeat intervals.
- ZooKeeper session timeouts default to 30 seconds — nodes that miss a heartbeat for 30s lose their ephemeral nodes and are considered failed by watchers.
- AWS health checks support minimum intervals of 10 seconds with a 3-failure threshold — giving a 30s detection window for Route53 failover.
- A 1,000-node cluster with 1-second heartbeats generates 1,000 messages/sec at the monitor; gossip spreads the same information in O(log N) rounds per second.
Senior deep-dive
The fundamental tension is false positives vs. detection latency — a tight threshold catches failures fast but triggers spurious alerts on network blips; adaptive detectors (phi-accrual) tune themselves to each node's observed variance.
Push-based heartbeats are harder to fake failure on; pull-based (polling) are simpler to implement but don't scale past ~10,000 nodes without a tiered polling architecture.
Health != liveness: a node can be alive and reachable but unable to serve traffic (OOM, deadlock, saturated thread pool) — combine heartbeat liveness with synthetic probes that test actual request handling.
Push vs. pull: the architectural choice
In push mode (heartbeat), each node sends a beat on a fixed interval — the monitor detects failure by absence. This doesn't scale indefinitely (the monitor becomes a funnel), but is simpler for homogeneous fleets. In pull mode (polling), the monitor actively probes each node — the node doesn't need to know about the monitor, making it easy to add monitoring without touching services. Tiered polling (regional sub-monitors aggregate to a central monitor) scales pull-based checking to tens of thousands of nodes.
Phi-accrual detector: adaptive thresholds
Classical heartbeat detection uses a fixed timeout (miss N beats = failure). Phi-accrual computes a suspicion value φ from a historical distribution of arrival intervals — φ rises continuously if a heartbeat is late and surpasses a threshold only when lateness is statistically improbable given past behavior. This adapts to each node's network variance and eliminates false positives from GC pauses or network jitter that would trigger a fixed-timeout detector. Cassandra and Akka Cluster use this.
Gossip-based membership at scale
SWIM (Scalable Weakly-consistent Infection-style Membership) avoids a central monitor: each node pings a random peer every T seconds; if no ack, it asks k other nodes to probe indirectly. Failure is only suspected after direct + indirect probes fail. Membership updates piggyback on heartbeat messages and spread epidemically (O(log N) rounds). This gives O(1) message overhead per node per second, vs. O(N) for a centralized poller — the approach Consul, Serf, and Cassandra use for cluster membership.
Liveness vs. readiness vs. health: the three checks
Liveness (is the process alive?) can be a TCP connection check. Readiness (can it serve traffic?) requires a semantic check — is the connection pool initialized, are dependencies reachable? Health (is it performing well?) requires metrics — is p99 latency under SLA? Kubernetes separates liveness and readiness probes for exactly this reason: a deadlocked container passes TCP liveness but fails readiness. Use all three or you'll restart containers that are merely slow rather than broken.
Self-preservation: distinguishing failure from partition
When 50% of your fleet stops heartbeating simultaneously, the right interpretation is usually network partition, not mass failure — evicting all those nodes would cause a larger outage than the partition itself. Netflix Eureka's self-preservation mode stops evictions when the eviction rate exceeds a threshold. This is the right call but requires careful configuration: set the threshold too high and you preserve genuinely dead nodes, creating ghost entries that cause routing failures.
What breaks at scale
Monitor fan-in at 100,000 nodes: a centralized heartbeat receiver processing 100,000 messages/sec becomes a bottleneck — shard the monitoring tier by node range or use gossip. GC pauses cause spurious failures: a JVM with a 5-second stop-the-world GC pause will miss heartbeats and be falsely evicted; either use G1/ZGC with sub-100ms pauses or set heartbeat timeouts larger than your GC worst case. Clock skew in scheduled health checks: if your monitor's clock is behind by 60s and nodes use absolute timestamps in heartbeats, every node looks dead — always use relative elapsed-time for timeout calculations, not wall-clock comparisons.
In production
Netflix Eureka uses a push-based heartbeat (every 30s, 3-missed-beats = eviction) for service discovery, with a self-preservation mode that suppresses evictions when more than 15% of instances go silent simultaneously — correctly interpreting mass heartbeat loss as a network partition rather than individual failures. Kubernetes liveness probes poll each container via HTTP/TCP/exec at configurable intervals and restart containers that fail — this is pull-based from the kubelet. Consul combines gossip (SWIM) for scalable membership with HTTP health checks for application-level liveness. The real operational failure mode is the health check itself becoming a SPOF: a monitor that checks 10,000 nodes sequentially can fall behind and give stale health information.
Common mistakes
- Fixed aggressive timeout (false positives)
- Central monitor as a bottleneck/SPOF
- No hysteresis (flapping)