System Design Library

Content Moderation Pipeline

Detect and act on harmful content (text/image/video) at scale.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Classify content (ML)
  • Auto-action + human review
  • Appeals
  • Audit trail

Non-functional

  • High throughput
  • Low latency for some surfaces
  • Accuracy

Scale

Billions of items/day

The approach

Content → async classification (text/image/video models) → confident cases auto-actioned, uncertain ones to a human-review queue; hashing (PhotoDNA) catches known-bad instantly; decisions logged for appeals/audit.

Key components

Upload → classifiers (queue) → auto-action / review queue · hash matching

Numbers that matter

Senior deep-dive

Latency and accuracy are in direct tension — the fast ML path makes mistakes, and the slow human path doesn't scale.

Known-bad content is free to catch: perceptual hashing (PhotoDNA / PDQ) identifies previously actioned images in microseconds before any model runs.

Decision logging is not optional: every auto-action must be auditable for appeals, regulatory compliance, and model retraining — the pipeline is a data flywheel.

Hash-first: eliminate the known-bad for free

Before any ML inference, compute a perceptual hash (PDQ for images, TMK for video) and look it up in a bloom filter or exact-match store of previously actioned content. This catches re-uploads of known CSAM, viral misinformation frames, and copyright material in under 1ms. False positive rate on PDQ is near zero because perceptual hashes tolerate minor re-encoding but reject different content. Every model inference saved here is a cost and latency win.

Async vs synchronous enforcement: the product decision

Synchronous blocking (refuse upload until classification passes) gives clean UX but adds 200–500ms latency on every upload and blocks on classifier availability. Async post-publish (accept, classify, remove if bad) means harmful content is briefly live — acceptable for low-risk categories, unacceptable for CSAM. Most platforms use a hybrid: synchronous on upload for high-severity categories (CSAM, terrorism), async with rapid takedown for lower-severity policy violations.

Threshold calibration: precision-recall isn't free

A single global threshold for all content categories is a mistake. Violence has very different false-positive costs than spam — wrongly removing a news video is a worse outcome than letting a bot through briefly. Each category should have its own threshold tuned against appeal rates (proxy for false positives) and escalation rates (proxy for false negatives). Thresholds drift as content distribution shifts, so weekly re-evaluation tied to production metrics is mandatory.

Human review queue: isolation prevents cross-contamination

If the human review queue is a single FIFO, a viral harmful event floods it and delays review of unrelated categories. Per-category priority queues with SLA targets isolate workloads. Reviewers seeing high volumes of traumatic content need exposure rotation — this is a product constraint that drives queue design. Blind inter-rater agreement on sampled items measures reviewer consistency and catches label drift in the training data.

Decision logging as a retraining flywheel

Every auto-action and every human decision is a labeled training example. Correct auto-actions are cheap positives; overturned appeals are gold negatives (the model was wrong and a human said so). A pipeline that doesn't log structured outcomes to a feature store is wasting its best signal. Active learning — routing borderline-confidence items to humans first — gets you more informative labels per review hour than random sampling.

What breaks at scale

Adversarial evasion is the primary scaling failure: once bad actors learn your hash database or model thresholds, they apply imperceptible perturbations (noise, crop, color shift) to defeat detection. Ensemble models with diverse architectures raise the evasion cost, but it's an arms race. The second failure is queue starvation during viral events: a single high-volume harmful meme can fill the human review queue and delay unrelated high-severity items — preemptive queue capacity scaling (spin up extra reviewers from other pools) and per-category SLAs with hard caps are the operational levers.

In production

Meta's content moderation stack layers PDQ perceptual hashing (instant known-bad match), NSFW classifiers (ResNet/CLIP-based), NLP models for text (XLM-R for multilingual), and a human review tier sourced from BPO vendors. YouTube's approach adds audio fingerprinting (Content ID) for copyright. The real challenge is multilingual and cultural context: a slur in one language is a common word in another, and an ML model trained on English data at 95% accuracy drops to 70% on low-resource languages, so routing borderline low-resource content to specialized reviewers is architecturally distinct from the main pipeline.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →