System Design Library

Fraud Detection

Flag fraudulent transactions in realtime without blocking good users.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Score transactions live
  • Rules + ML model
  • Feature lookups
  • Feedback loop

Non-functional

  • Low latency (<100ms)
  • Low false-positive rate

Scale

Millions of transactions/sec

The approach

A streaming scorer combines fast rules + an ML model using features from a low-latency feature store (velocity, history, device); high-risk go to manual review/step-up auth; labeled outcomes retrain the model.

Key components

Txn stream → feature store → rules + model → decision · review queue

Numbers that matter

Senior deep-dive

Feature freshness, not model accuracy, is the operational bottleneck — a model trained on last month's data misses today's fraud patterns; the feature store velocity (how quickly velocity counters and behavioral features update) determines real-world precision.

You cannot block on the fraud check — payment authorizations have a ~100–200ms SLA; the ML scorer must return in <50ms, which means lightweight feature extraction from pre-materialized features, not on-the-fly aggregation over raw events.

False positives are the silent business killer: a 1% false positive rate on a payment processor doing 10M transactions/day blocks 100k legitimate purchases — the cost of false positives almost always exceeds the cost of missed fraud.

Rules engine first: cheap, fast, interpretable

Before invoking an ML model, a rule engine evaluates deterministic signals: blacklisted IPs, velocity rules (>10 transactions in 5 minutes from one card), known-fraudulent BINs, mismatched billing ZIP codes. Rules fire in <1ms and block a large fraction of obvious fraud with no model overhead. Rules are interpretable — you can tell a customer exactly why a transaction was declined, which is legally required in many jurisdictions (EU PSD2, US Reg E). The ML model handles the gray zone that rules can't classify with confidence. Rules also provide a safety net when the ML model is unavailable or degraded.

Feature store is the system's beating heart

The ML model is only as good as its features. Velocity features (count of transactions per card in last 1/5/60 minutes) are the most predictive but require real-time updates — a sliding window counter in Redis updated atomically on every transaction. Behavioral features (avg transaction amount, typical merchant category, device fingerprint) are computed offline in a batch pipeline and materialized to the feature store. The feature store exposes a <5ms read API that the scoring service calls synchronously. Feature skew (training features computed differently than serving features) is the #1 source of unexplained model degradation in production.

Scoring must be async from the authorization decision

For low-risk transactions, the fraud scorer can run after the authorization is issued ('post-auth scoring'), not blocking the user. The authorization is approved optimistically; if the scorer returns HIGH_RISK within a short window (e.g. 5 seconds), a reversal is issued. This is how most buy-now-pay-later providers work — they can't afford a 50ms ML call on every micro-transaction. Pre-auth scoring is reserved for high-value or anomalous transactions where the cost of a chargeback exceeds the cost of a false decline. The decision of pre- vs post-auth must be made in the rule engine before calling the ML model.

Model output is a risk score, not a binary decision

The ML model outputs a continuous risk score (0.0–1.0), not 'fraud/not fraud'. Business logic translates scores to actions: score <0.2 → approve, 0.2–0.7 → step-up auth (3DS challenge, SMS OTP), >0.7 → decline. This three-way split allows the business to tune the false-positive/false-negative tradeoff independently of the model. The threshold values must be monitored continuously — fraud pattern shifts change the score distribution, so a fixed threshold drifts from its intended operating point. Run A/B tests on threshold changes before deploying globally; a misconfigured threshold once caused a major payment processor to decline 15% of legitimate transactions for 45 minutes.

Labeled feedback loop is the model's immune system

Fraud labels arrive days to weeks after the transaction (chargebacks take time to process). The pipeline must join chargeback events back to the original transaction to create labeled training samples. Survivorship bias is the hidden trap: the model never sees transactions it declined (no outcome data), so it cannot learn from its own false positives. Shadow mode (score but don't act) on a sample of high-confidence declines is the only way to measure false positive rate accurately. Without a feedback loop, model precision degrades silently over 3–6 months as fraud patterns shift.

What breaks at scale

The catastrophic failure is feature store latency spike during a traffic surge (Black Friday): Redis CPU saturates, feature reads degrade from 3ms to 50ms, and the entire fraud scoring pipeline breaches its SLA. The mitigation is local caching of frequently-read features in the scoring service's process memory with a 1-second TTL — slightly stale velocity counts are acceptable versus timeouts. The second failure mode is model serving OOM: gradient-boosted trees with 10k features and 1000 trees at 10k RPS requires careful memory budgeting; a model artifact that's 2GB loaded into each scoring pod crashes the pod on startup. Model artifact size testing in CI before every model deploy is a non-optional operational practice.

In production

Stripe Radar uses a gradient-boosted tree ensemble (XGBoost-class) with hundreds of engineered features — transaction velocity, device fingerprint, BIN patterns, address mismatch scores — all pre-computed and stored in a low-latency feature store. PayPal built a two-stage system: a fast rule engine (<5ms) blocks obvious patterns (same IP + 20 cards in 1 hour), followed by an ML model (~30ms) for ambiguous cases. The real challenge is concept drift: fraud patterns shift weekly as attackers adapt; a model that's 95% accurate today may drop to 85% in 30 days without retraining, making the model monitoring and retraining pipeline as critical as the model itself.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →