View / Like Counter
Count views/likes on hot items accurately enough, without a write hotspot on one row.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Increment count
- Read count
- Top-N (optional)
Non-functional
- High write throughput
- Approximate-OK for display
Scale
Millions of increments/sec on hot items
The approach
Shard the counter across N keys (counter:item:shard) and sum on read, or buffer increments in memory/Redis and flush to the DB in batches. Exact billing-grade counts reconcile from a log.
Key components
App → sharded counters (Redis) → periodic flush → DB
Numbers that matter
- A naive single-row counter on PostgreSQL saturates at ~5,000–10,000 increments/sec due to row-level lock contention; Redis `INCR` handles ~100,000–500,000 increments/sec on a single key before becoming the bottleneck.
- Sharding a counter across N=100 shards reduces per-shard write rate by 100× at the cost of N reads on the read path — a read that sums 100 Redis keys via `MGET` takes ~0.5–1 ms vs a single `GET`.
- In-process buffering (aggregate locally, flush every 1 second) reduces write amplification by 1,000–10,000× for viral content — a video receiving 10K views/sec flushes 1 write/sec to Redis instead of 10K.
- YouTube view counts use a ~30-second display lag deliberately — counts are shown from a cached/batch-aggregated value, while exact counts for monetization reconcile from a log with eventual consistency of ~hours.
Senior deep-dive
Write sharding is the core technique: a single `UPDATE counter SET n=n+1` on one row under millions of concurrent requests causes lock contention — split into N sharded keys and sum on read.
Exact vs approximate is a product decision: for display ("1.2M views") you need approximate — buffer in Redis, flush async; for billing or fraud you need exact — reconcile from an immutable event log.
The hot key problem is universal: any viral item creates a hotspot on the shard that owns its counter — local in-process aggregation (count in application memory, flush every second) is often more effective than more shards.
Write sharding: the standard pattern and its tradeoffs
Split `counter:item:X` into `counter:item:X:shard:{0..N-1}`. Each write picks a shard (random or round-robin) and does `INCR counter:item:X:shard:hash(request)%N`. Reads `MGET` all N keys and sum. N=10 is usually enough for most workloads — it reduces per-shard rate by 10× and keeps read cost manageable. The tradeoff: reads become N round-trips (or 1 pipelined MGET) and the true count is only consistent if you read all shards atomically — a partial read mid-write gives a temporarily low count. For display purposes, this is fine; for billing, you need a different approach.
In-process buffering: the underrated optimization
Instead of one Redis `INCR` per view, aggregate in the application process in a `ConcurrentHashMap<itemId, AtomicLong>` and flush to Redis every 1–5 seconds with a single `INCRBY`. This reduces Redis write QPS by 1,000–10,000× for hot items. The failure mode: a server restart loses the unflushed buffer — acceptable for display counts (you lose a second of views) but not for billing. The pattern requires application instances to be stateful (cannot be killed without draining), which conflicts with Kubernetes's default graceful shutdown — add a shutdown hook to flush on SIGTERM.
Event log as the source of truth
For exact counts (billing, fraud), every view event is an immutable append to a Kafka topic. Downstream consumers aggregate into a count store. The display counter can be an approximate fast path; the reconciliation job recomputes exact counts from the log on a schedule (hourly, daily). This separation lets you tolerate display counter drift while guaranteeing billing accuracy. The key architectural principle: never use a mutable counter as the source of truth for money — always derive it from an append-only log that can be replayed, audited, and corrected.
HyperLogLog for unique view counts
"Views" often means unique views (distinct users), not total increments — which requires deduplication. A naive set of all user IDs that viewed an item is O(cardinality) memory — unworkable at scale. HyperLogLog estimates cardinality with ~0.81% standard error using ~1.5 KB of memory regardless of cardinality. Redis has HyperLogLog built-in (`PFADD`/`PFCOUNT`). The caveat: HLL can't tell you whether a specific user has viewed something — it only gives an aggregate estimate. For "has this user seen this?" you need a Bloom filter or an actual set.
Cache stampede on count reads
When a cached counter expires and 10,000 concurrent requests all find a cache miss simultaneously, they all query the (expensive) backing store — the cache stampede. For view counters, this manifests when a viral video's cached count expires mid-spike. Mitigations: probabilistic early expiry (start refreshing the cache when TTL drops below a random threshold, before it expires), locking with a single refresh (one request refreshes, others wait), or stale-while-revalidate (serve the stale count immediately, refresh asynchronously). The last is most appropriate for display counts where freshness is less critical than availability.
What breaks at scale
Viral item hotspots — a single item receiving 1M+ writes/sec — exceed the capacity of any single shard and even a well-sharded counter. The only solution is local aggregation at every layer (CDN edge, application tier, cache tier) so the central store receives batched increments rather than individual events. Counter desync after failure is the second failure mode: a Redis failover that doesn't flush AOF/RDB leaves the counter behind by minutes of events — for billing, this means revenue undercount. Integer overflow is a real (if embarrassing) production issue: a 32-bit counter overflows at ~2.1 billion; YouTube's early view counter reset to 0 on viral videos for exactly this reason — use 64-bit counters always.
In production
Twitter uses a combination of in-process counters flushed to Redis and periodic reconciliation to a MySQL sharded store for like/retweet counts. YouTube's view count is famously approximate on display (frozen at 301+ for days on viral videos historically) and exact for monetization via a separate pipeline. The real engineering challenge is consistency on viral content: a video going from 0 to 10 million views in an hour creates a transient hotspot that overwhelms any single counter shard — the fix is adaptive sharding (dynamically increase shard count for hot items) or gossip-based aggregation (each edge server maintains a local counter and exchanges with neighbors, converging to a global sum without a central bottleneck).
Common mistakes
- One row per counter (lock contention)
- Synchronous DB increment per view
- Treating display counts as billing-accurate