A/B Testing Platform
Assign users to experiment variants and measure impact correctly.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Define experiments
- Deterministic bucketing
- Track exposure + metrics
- Stats/significance
Non-functional
- Consistent assignment
- Low overhead
- Trustworthy results
Scale
Many concurrent experiments
The approach
Deterministic hashing (hash(user+experiment) → bucket) gives stable, fast, offline-computable assignment; exposure + metric events flow to an analytics pipeline; a stats engine computes significance; guardrails detect conflicts/SRM.
Key components
Assignment SDK (hash) · exposure events → analytics → stats engine
Numbers that matter
- Deterministic hash assignment adds <1ms to any request — it's a single hash computation with no network call, making it the only viable architecture for high-throughput systems
- A typical experiment needs ~10,000–100,000 users per variant to detect a 5% relative change in a conversion metric at 80% statistical power (80% power, 95% confidence) — smaller experiments are underpowered and produce noise
- Facebook runs >10,000 concurrent A/B experiments at any given time; a layered namespace system allows thousands of independent experiments without mutual interference
- Novelty effect inflates metrics for the first 3–7 days of any UI experiment; experiments must run a minimum of 1–2 weeks to wash out novelty bias, even if they reach statistical significance earlier
Senior deep-dive
Assignment must be deterministic and stateless — `hash(user_id + experiment_id) % 100` gives the same bucket on every call, on every server, with no DB lookup, which is the only way to achieve consistent experience across sessions and devices.
The stats engine, not the assignment system, is where A/B testing fails — p-hacking via peeking (stopping an experiment the moment it looks significant) is rampant and produces false discoveries; valid analysis requires pre-registered sample sizes and hold-to-completion discipline.
Experiment interactions (mutual exclusion and layering) are the operational complexity that kills velocity at scale — a user in experiment A and experiment B simultaneously may receive a confounded experience, requiring an experiment namespace architecture to prevent collisions.
Deterministic hashing: zero infrastructure for assignment
The assignment function `bucket = hash(user_id + experiment_id) % num_buckets` is the entire assignment system for simple cases. It's idempotent (same inputs → same output), stateless (no DB lookup), and consistent across platforms (same hash function in server SDK + mobile SDK + web SDK). The `experiment_id` salt ensures a user in bucket 42 for experiment A is not automatically in bucket 42 for experiment B, preventing systematic correlation. Murmur3 or xxHash are preferred over MD5/SHA1 for speed; crypto-grade hashing is unnecessary overhead. The only state needed is the experiment definition (which is a small ruleset, cacheable indefinitely).
Exposure logging is the measurement contract
An experiment only 'counts' a user when they encounter the treated surface — not when they're assigned to a bucket. If 10% of assigned users never visit the page being tested, including them dilutes the measured effect (intent-to-treat vs. per-protocol analysis). Exposure events (`{user_id, experiment_id, variant, timestamp}`) must be logged at the moment the variant is shown, not at session start. This exposure log is the join key for metric computation: `metric_value JOIN exposure ON user_id WHERE experiment_id = X`. Missing or duplicate exposure logs are the most common source of Metric computation errors in experimentation platforms.
Layered namespace prevents experiment collisions
At scale, multiple experiments touch overlapping user populations. A namespace partitions users into non-overlapping buckets, and within a namespace, one experiment runs at a time. Layers are independent namespaces — a user can be in one experiment per layer simultaneously. This allows, e.g., a UX layer (testing button color) and a algorithm layer (testing ranking model) to run concurrently on the same user without contaminating each other's metrics. Mutual exclusion within a layer is enforced by the assignment service refusing to allocate buckets already claimed by a running experiment. Without this system, experiment interactions create confounded results that are impossible to interpret.
Statistical validity requires pre-commitment, not peeking
Peeking (checking results daily and stopping when p<0.05) inflates the false discovery rate to 20–30% even for a 5% significance threshold — a well-documented problem called multiple comparisons. The fix: pre-register sample size before starting (computed via power analysis on the primary metric), run to completion, analyze once. Sequential testing (always-valid p-values, e.g. using CUPED or mSPRT) allows early stopping with valid Type I error control and is how Netflix and LinkedIn do it without the peeking problem. Platforms that show a live p-value counter encourage peeking; best practice is to hide the p-value until the pre-registered sample size is reached.
Metric pipeline: CUPED reduces variance, not just noise
CUPED (Controlled-experiment Using Pre-Experiment Data) adjusts the experiment metric using a covariate (e.g. user's pre-experiment conversion rate) to reduce variance by 20–50%, enabling smaller sample sizes or faster experiment completion. It's the most impactful statistical technique in modern A/B testing platforms (used by Netflix, Booking.com, Microsoft). The implementation: compute the covariate (a pre-period metric), fit a linear correction, subtract from the treatment/control observations before computing the t-statistic. CUPED assumes the covariate is uncorrelated with treatment (satisfied by random assignment) — it's not magic variance reduction, it's variance explained by user heterogeneity that would otherwise inflate standard error.
What breaks at scale
The production failure is SRM (Sample Ratio Mismatch): the actual assignment ratio (e.g. 49.7% treatment, 50.3% control) diverges from the intended 50/50 split due to a bug in assignment logging, bot traffic, or redirect-based assignment loss. An SRM means the treatment and control groups are no longer comparable — the experiment results are invalid and must be discarded. Every experiment platform must compute and alarm on SRM as the first health check before any metric analysis. The second failure is carryover effects: a user who saw experiment A's variant last week has changed behavior that contaminates this week's experiment B — mandatory washout periods between experiments on the same surface are the mitigation.
In production
Netflix's experimentation platform (XP) uses a layered namespace system with mutual exclusion within layers: each layer partitions 100% of users, and a user gets exactly one assignment per layer. Optimizely uses a client-side SDK with a cached ruleset that computes assignments locally, eliminating any per-request server call. Booking.com runs one of the densest A/B testing environments in industry (1000+ concurrent tests on a site with ~1M daily visitors) and has published extensively on experiment interaction detection — when two experiments touch overlapping UI surfaces, their effects can be non-additive, requiring a held-out 'control for interactions' group. The hardest operational problem is ramping experiments safely: a bad experiment rolled to 100% of users before reaching statistical significance has caused major revenue regressions at every major tech company.
Common mistakes
- Random assignment without persistence (flip-flopping)
- Peeking/early-stopping (false positives)
- Ignoring interaction between experiments