Comment System (Disqus)
Embeddable threaded comments across millions of sites.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Post/reply (threads)
- Moderation
- Vote/sort
- Embed widget
Non-functional
- Fast load
- Read-heavy
Scale
Billions of comments
The approach
Comments keyed by page/thread with a materialized path for nested replies; heavily cached per thread; async moderation pipeline; embedded via a lightweight widget hitting a cached API.
Key components
Widget → cached comment API → comment store · moderation
Numbers that matter
- A closure table needs O(d×n) rows for n comments at average depth d — at depth 5 and 1M comments that's 5M rows, making it expensive for deep threads.
- Disqus serves ~50 million comments per month across millions of embedded sites — per-thread caching at the CDN is what makes that economically viable.
- A materialized path query for all descendants is a single LIKE 'path/%' index scan, consistently under 5 ms for threads up to ~10,000 comments.
- Spam comment rates average 40–60% of raw submissions on open embeds — without async ML pre-filtering, moderator queues are unworkable.
Senior deep-dive
Thread storage is the core schema decision — flat (foreign key to parent), closure table (all ancestor/descendant pairs), or materialized path each trade write cost for read cost.
Cache the thread, not individual comments: a page of comments is almost always read as a unit, so cache the serialized thread blob by page URL with a short TTL rather than caching per-comment.
Moderation latency matters more than throughput — a comment stays visible until it's actioned, so async ML classification with a human-in-the-loop queue is the right architecture, not synchronous filtering.
Tree storage: flat adjacency vs. closure table vs. materialized path
Flat adjacency (parent_id FK) is simple to write but requires recursive CTEs or multiple queries to read a subtree. Closure tables store every ancestor-descendant pair, making subtree reads a single join but writes O(depth) inserts — painful for deeply nested replies. Materialized path (e.g., '1/42/107/') makes subtree reads a single index-range scan and is the pragmatic choice for most comment systems where trees are shallow (depth < 10).
Thread-level caching: the right granularity
Per-comment caching has a low hit rate because each comment is rarely requested in isolation. Cache the rendered thread blob (sorted, paginated JSON) keyed by thread ID + page + sort order. A single new comment invalidates the thread cache — fine, because comment write rates are low compared to read rates (99:1 or higher). For very active threads (viral posts), use a short TTL (30s) rather than active invalidation to avoid thundering herd on expiry.
Embedding: the iframe vs. script tag tradeoff
Disqus's classic model loads via a script tag that rewrites a div — fast to embed but the comment widget runs in the host page's JavaScript context, a security concern. An iframe-based embed is fully sandboxed but prevents seamless styling and complicates scroll/height negotiation. Most modern embeds use a postMessage bridge between host page and an iframe — sandboxed but able to communicate dimensions and auth tokens.
Voting and ranking within threads
Sorting by 'best' requires Wilson score interval (not raw upvote count) so comments with 10/10 upvotes rank above 1000/1100 — this is what Reddit uses. Storing a running Wilson score requires updating a float on every vote, which is a write hotspot on popular comments. Deferred batch recomputation (recalculate top comments every 60 seconds) is the practical approach; exact real-time ranking only matters for new threads.
Moderation pipeline: async, not synchronous
Synchronous ML filtering blocks the comment submission path and adds latency. The correct model: optimistically accept and display, classify async, auto-remove high-confidence spam/toxicity, route uncertain cases to a human review queue. Known-bad content (PhotoDNA for images, hash lists for text) should be blocked synchronously — the cost is low and the harm is high. Store every moderation decision for appeals and model retraining.
What breaks at scale
Thundering herd on viral threads — a post with 50,000 comments triggers a cache miss on the thread blob and hundreds of concurrent requests hit the DB simultaneously. Fix: probabilistic early expiration (re-cache 10s before actual TTL) or a read-through mutex (one request fetches, rest wait). Embedding across millions of sites means your comment widget is loaded by every browser visiting those sites — your JS bundle is on the critical path of the web; keep it under 30 KB gzipped and load it async.
In production
Disqus uses a graph-like storage model sharded by thread (site + URL), serving the full comment tree as a pre-rendered JSON blob via CDN. Reddit uses a closure table internally for nested comments but caps display depth and shows 'load more' to avoid rendering enormous subtrees. YouTube moved from threaded to two-level comments (top-level + direct replies only) specifically to simplify storage and reduce the edge cases in tree rendering. The real challenge is identity federation: an embedded widget must authenticate users across the host site's auth without holding passwords — OAuth delegated tokens are the standard approach.
Common mistakes
- Recursive tree fetches
- No per-thread cache
- Synchronous moderation blocking posts