System Design Library

Presence Service

Track who is online/away/typing across millions of connections.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Online/away/offline
  • Typing indicators
  • Subscribe to a friend's presence
  • Last-seen

Non-functional

  • Realtime
  • Scales to millions
  • Ephemeral

Scale

Millions of concurrent users

The approach

WS gateways report connection state to a presence store (in-memory, TTL-based heartbeats); subscribers get updates via pub/sub; presence is ephemeral (rebuilt from connections), not durably stored.

Key components

WS gateways → presence store (TTL) → pub/sub to subscribers

Numbers that matter

Senior deep-dive

Presence is an ephemeral inference, not a stored fact — it must be rebuilt from live connections, not read from a database row.

TTL-based heartbeats are the core primitive: every connected client pings every 5–30s; the presence store expires entries if no ping arrives, and the failure mode is 'gone', never 'stuck online forever'.

Subscribers must receive deltas, not full roster snapshots — at 10M online users, broadcasting a full list on every change is catastrophically expensive; pub/sub with targeted per-user events is the only viable architecture.

Presence is ephemeral: rebuild from connections, never persist

The fundamental architecture decision is that presence must never be durably stored as a static field on a user record. A user can disconnect ungracefully (phone dies, network drop) and will never send a 'logout' event. TTL-based expiry is the only reliable mechanism: the presence store is a key-value map of `user_id → {status, last_seen}` with each key expiring in 2–3 heartbeat intervals. On expiry, the system infers 'offline' and publishes a gone event. Gossip-style liveness (used in SWIM) is an alternative for service meshes but overkill for user presence.

Gateway owns the connection lifecycle

Each WebSocket gateway process maintains a table of active connections and their associated user IDs. On connect, it writes `SETEX presence:{user_id} 60 {status}` to the presence store and publishes an 'online' event. On heartbeat receipt, it refreshes the TTL. On disconnect detection (TCP FIN or missing heartbeat), it deletes the key and publishes 'offline'. The gateway is the authoritative source — it must handle reconnect races (user reconnects faster than the TTL fires) by using a connection ID in the presence key to avoid a reconnecting client wiping its own active entry.

Pub/sub fan-out must be selective, not broadcast

A user coming online should notify only users who are subscribed to them: friends, channel members, open DM threads. Broadcasting to all users is O(total_users) — catastrophic. The architecture uses per-user subscription topics or per-room topics. When user A connects, the gateway subscribes to presence events for all of A's contacts. When a contact comes online, only A's gateway receives the event. Topic explosion (a user in 500 channels means 500 topic subscriptions) is the real operational problem; bounded fan-out (cap contacts per user, paginate large channels) is the mitigation.

Typing indicators are presence's hardest subproblem

Typing indicators must expire automatically (user stopped typing without sending a 'stop' event) and must not require server persistence. The pattern: client sends `typing_start` every 3–4 seconds while composing; server propagates it to the channel; a 3–5 second TTL at the receiver makes the indicator disappear if no refresh arrives. At high message volumes in large channels, throttle to 1 typing event per user per 3s to avoid flooding the channel's event stream. The ephemeral message bus (not the chat message store) handles these events; they're never durably stored.

Multi-device presence requires session aggregation

A user logged into mobile + desktop is 'online' if any session is active. A naive implementation marks 'offline' when the first session disconnects, incorrectly. The presence store must track presence per session_id (not just user_id), and derive user-level status by taking the maximum priority across sessions (active > idle > offline). Session-level TTLs expire independently; the last session to expire triggers the user-level 'offline' event. Without this, every mobile background-fetch disconnect causes a spurious 'went offline' notification to all contacts.

What breaks at scale

The failure mode nobody anticipates is presence store split during Redis failover: during a leader election (~15–30s), heartbeats cannot be refreshed and TTLs expire, marking millions of users offline simultaneously. This triggers a fan-out storm of offline events and then a reconnect storm. Mitigate with a longer TTL during known degraded periods (circuit breaker on the presence store) and debounce offline event publication by 10–30s so transient disconnections don't trigger notifications. The second problem is cold start after deploy: all connections drop and reconnect simultaneously, creating a thundering herd on the presence store.

In production

Discord routes presence updates through its pub/sub infrastructure (a custom system they've called 'Guilded' internally, previously on Elixir/Phoenix Channels) where each guild's member presence is a subscription topic. Slack represents presence per-workspace with a tiered model: 'active' (recent keypress/message), 'idle' (open but inactive >30min), 'away' (disconnected). The real engineering challenge is fan-out skew — a single celebrity account with 10M followers coming online means 10M presence-update messages must fan out instantly, which requires dedicated fan-out infrastructure separate from the normal per-user pub/sub.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →