Content Moderation Pipeline
Detect and act on harmful content (text/image/video) at scale.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Classify content (ML)
- Auto-action + human review
- Appeals
- Audit trail
Non-functional
- High throughput
- Low latency for some surfaces
- Accuracy
Scale
Billions of items/day
The approach
Content → async classification (text/image/video models) → confident cases auto-actioned, uncertain ones to a human-review queue; hashing (PhotoDNA) catches known-bad instantly; decisions logged for appeals/audit.
Key components
Upload → classifiers (queue) → auto-action / review queue · hash matching
Numbers that matter
- PhotoDNA / PDQ hash matching runs in <1ms per image — three orders of magnitude faster than any ML inference.
- Human reviewers at scale handle ~1,000–2,000 items per day at high accuracy; queues grow to hours of backlog during viral events.
- Threshold tuning: moving a classifier threshold from 0.8 to 0.9 can halve false positives while only missing ~10% more true positives, but the right cutoff depends on appeal rates.
- Video moderation costs ~10–50× more compute per item than images due to per-frame sampling and audio transcription.
Senior deep-dive
Latency and accuracy are in direct tension — the fast ML path makes mistakes, and the slow human path doesn't scale.
Known-bad content is free to catch: perceptual hashing (PhotoDNA / PDQ) identifies previously actioned images in microseconds before any model runs.
Decision logging is not optional: every auto-action must be auditable for appeals, regulatory compliance, and model retraining — the pipeline is a data flywheel.
Hash-first: eliminate the known-bad for free
Before any ML inference, compute a perceptual hash (PDQ for images, TMK for video) and look it up in a bloom filter or exact-match store of previously actioned content. This catches re-uploads of known CSAM, viral misinformation frames, and copyright material in under 1ms. False positive rate on PDQ is near zero because perceptual hashes tolerate minor re-encoding but reject different content. Every model inference saved here is a cost and latency win.
Async vs synchronous enforcement: the product decision
Synchronous blocking (refuse upload until classification passes) gives clean UX but adds 200–500ms latency on every upload and blocks on classifier availability. Async post-publish (accept, classify, remove if bad) means harmful content is briefly live — acceptable for low-risk categories, unacceptable for CSAM. Most platforms use a hybrid: synchronous on upload for high-severity categories (CSAM, terrorism), async with rapid takedown for lower-severity policy violations.
Threshold calibration: precision-recall isn't free
A single global threshold for all content categories is a mistake. Violence has very different false-positive costs than spam — wrongly removing a news video is a worse outcome than letting a bot through briefly. Each category should have its own threshold tuned against appeal rates (proxy for false positives) and escalation rates (proxy for false negatives). Thresholds drift as content distribution shifts, so weekly re-evaluation tied to production metrics is mandatory.
Human review queue: isolation prevents cross-contamination
If the human review queue is a single FIFO, a viral harmful event floods it and delays review of unrelated categories. Per-category priority queues with SLA targets isolate workloads. Reviewers seeing high volumes of traumatic content need exposure rotation — this is a product constraint that drives queue design. Blind inter-rater agreement on sampled items measures reviewer consistency and catches label drift in the training data.
Decision logging as a retraining flywheel
Every auto-action and every human decision is a labeled training example. Correct auto-actions are cheap positives; overturned appeals are gold negatives (the model was wrong and a human said so). A pipeline that doesn't log structured outcomes to a feature store is wasting its best signal. Active learning — routing borderline-confidence items to humans first — gets you more informative labels per review hour than random sampling.
What breaks at scale
Adversarial evasion is the primary scaling failure: once bad actors learn your hash database or model thresholds, they apply imperceptible perturbations (noise, crop, color shift) to defeat detection. Ensemble models with diverse architectures raise the evasion cost, but it's an arms race. The second failure is queue starvation during viral events: a single high-volume harmful meme can fill the human review queue and delay unrelated high-severity items — preemptive queue capacity scaling (spin up extra reviewers from other pools) and per-category SLAs with hard caps are the operational levers.
In production
Meta's content moderation stack layers PDQ perceptual hashing (instant known-bad match), NSFW classifiers (ResNet/CLIP-based), NLP models for text (XLM-R for multilingual), and a human review tier sourced from BPO vendors. YouTube's approach adds audio fingerprinting (Content ID) for copyright. The real challenge is multilingual and cultural context: a slur in one language is a common word in another, and an ML model trained on English data at 95% accuracy drops to 70% on low-resource languages, so routing borderline low-resource content to specialized reviewers is architecturally distinct from the main pipeline.
Common mistakes
- Human-reviewing everything (doesn't scale)
- No hash matching for known content
- No audit trail/appeals path