System Design Library

Fraud Detection

Flag fraudulent transactions in realtime without blocking good users.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Score transactions live
Rules + ML model
Feature lookups
Feedback loop

Non-functional

Low latency (<100ms)
Low false-positive rate

Scale

Millions of transactions/sec

The approach

A streaming scorer combines fast rules + an ML model using features from a low-latency feature store (velocity, history, device); high-risk go to manual review/step-up auth; labeled outcomes retrain the model.

Key components

Txn stream → feature store → rules + model → decision · review queue

Numbers that matter

Visa/Mastercard authorization SLA is 100–150ms end-to-end; fraud scoring must complete in <50ms, leaving ~50–100ms for network, auth logic, and response — latency is the binding constraint on model complexity
Industry fraud rates for card-not-present (e-commerce) transactions run 0.5–1.5% of transaction value; a well-tuned model reduces chargebacks by 60–80% while keeping false positive rates below 0.3%
Feature store read latency must be <5ms (p99) to fit in the fraud-check budget; this requires in-memory storage (Redis) for hot features like 'transactions in last 5 minutes' rather than a DB query
Retraining cycles at major fraud shops run daily to hourly for gradient-boosted models; deep learning models retrain weekly due to compute cost, with online learning for fast feature weight updates

Senior deep-dive

Feature freshness, not model accuracy, is the operational bottleneck — a model trained on last month's data misses today's fraud patterns; the feature store velocity (how quickly velocity counters and behavioral features update) determines real-world precision.

You cannot block on the fraud check — payment authorizations have a ~100–200ms SLA; the ML scorer must return in <50ms, which means lightweight feature extraction from pre-materialized features, not on-the-fly aggregation over raw events.

False positives are the silent business killer: a 1% false positive rate on a payment processor doing 10M transactions/day blocks 100k legitimate purchases — the cost of false positives almost always exceeds the cost of missed fraud.

Rules engine first: cheap, fast, interpretable

Before invoking an ML model, a rule engine evaluates deterministic signals: blacklisted IPs, velocity rules (>10 transactions in 5 minutes from one card), known-fraudulent BINs, mismatched billing ZIP codes. Rules fire in <1ms and block a large fraction of obvious fraud with no model overhead. Rules are interpretable — you can tell a customer exactly why a transaction was declined, which is legally required in many jurisdictions (EU PSD2, US Reg E). The ML model handles the gray zone that rules can't classify with confidence. Rules also provide a safety net when the ML model is unavailable or degraded.

Feature store is the system's beating heart

The ML model is only as good as its features. Velocity features (count of transactions per card in last 1/5/60 minutes) are the most predictive but require real-time updates — a sliding window counter in Redis updated atomically on every transaction. Behavioral features (avg transaction amount, typical merchant category, device fingerprint) are computed offline in a batch pipeline and materialized to the feature store. The feature store exposes a <5ms read API that the scoring service calls synchronously. Feature skew (training features computed differently than serving features) is the #1 source of unexplained model degradation in production.

Scoring must be async from the authorization decision

For low-risk transactions, the fraud scorer can run after the authorization is issued ('post-auth scoring'), not blocking the user. The authorization is approved optimistically; if the scorer returns HIGH_RISK within a short window (e.g. 5 seconds), a reversal is issued. This is how most buy-now-pay-later providers work — they can't afford a 50ms ML call on every micro-transaction. Pre-auth scoring is reserved for high-value or anomalous transactions where the cost of a chargeback exceeds the cost of a false decline. The decision of pre- vs post-auth must be made in the rule engine before calling the ML model.

Model output is a risk score, not a binary decision

The ML model outputs a continuous risk score (0.0–1.0), not 'fraud/not fraud'. Business logic translates scores to actions: score <0.2 → approve, 0.2–0.7 → step-up auth (3DS challenge, SMS OTP), >0.7 → decline. This three-way split allows the business to tune the false-positive/false-negative tradeoff independently of the model. The threshold values must be monitored continuously — fraud pattern shifts change the score distribution, so a fixed threshold drifts from its intended operating point. Run A/B tests on threshold changes before deploying globally; a misconfigured threshold once caused a major payment processor to decline 15% of legitimate transactions for 45 minutes.

Labeled feedback loop is the model's immune system

Fraud labels arrive days to weeks after the transaction (chargebacks take time to process). The pipeline must join chargeback events back to the original transaction to create labeled training samples. Survivorship bias is the hidden trap: the model never sees transactions it declined (no outcome data), so it cannot learn from its own false positives. Shadow mode (score but don't act) on a sample of high-confidence declines is the only way to measure false positive rate accurately. Without a feedback loop, model precision degrades silently over 3–6 months as fraud patterns shift.

What breaks at scale

The catastrophic failure is feature store latency spike during a traffic surge (Black Friday): Redis CPU saturates, feature reads degrade from 3ms to 50ms, and the entire fraud scoring pipeline breaches its SLA. The mitigation is local caching of frequently-read features in the scoring service's process memory with a 1-second TTL — slightly stale velocity counts are acceptable versus timeouts. The second failure mode is model serving OOM: gradient-boosted trees with 10k features and 1000 trees at 10k RPS requires careful memory budgeting; a model artifact that's 2GB loaded into each scoring pod crashes the pod on startup. Model artifact size testing in CI before every model deploy is a non-optional operational practice.

In production

Stripe Radar uses a gradient-boosted tree ensemble (XGBoost-class) with hundreds of engineered features — transaction velocity, device fingerprint, BIN patterns, address mismatch scores — all pre-computed and stored in a low-latency feature store. PayPal built a two-stage system: a fast rule engine (<5ms) blocks obvious patterns (same IP + 20 cards in 1 hour), followed by an ML model (~30ms) for ambiguous cases. The real challenge is concept drift: fraud patterns shift weekly as attackers adapt; a model that's 95% accurate today may drop to 85% in 30 days without retraining, making the model monitoring and retraining pipeline as critical as the model itself.

Common mistakes

Computing features synchronously (too slow)
Hard-blocking on borderline scores
No feedback loop (model goes stale)

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →