Feature Flag Service
Toggle features per user/segment in realtime with near-zero read latency.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Define flags + targeting rules
- Evaluate flag for a user
- Gradual rollout/%
- Kill switch
Non-functional
- <1ms evaluation
- Highly available
- Fast propagation
Scale
Every request evaluates flags
The approach
Rules stored centrally; SDKs cache the ruleset locally and evaluate in-process (no network per check); changes pushed via streaming/poll. Evaluation is local + instant.
Key components
Admin → rules store → streaming/CDN → in-process SDK cache
Numbers that matter
- Sub-1ms flag evaluation latency in-process once the ruleset is cached locally — vs ~10-50ms per network call if evaluated remotely.
- ~10-100 KB compressed ruleset for a mature product with hundreds of flags — small enough to hold in every SDK instance.
- <5s P99 propagation latency target from flag change to all SDKs updated, using SSE/WebSocket streaming over polling.
- ~99.999% availability required — a flag service outage that falls back to all-off can disable critical features in production.
Senior deep-dive
Local evaluation is the entire architecture — if SDKs phone home per check, you've built a latency bomb into every hot path.
Push the full ruleset to clients on startup + stream deltas; the SDK evaluates in-process in microseconds with zero network hops.
Flag explosion and stale rules are the operational debt — without lifecycle enforcement, you accumulate hundreds of dead flags that nobody dares delete.
Local evaluation: the only architecture that works
Remote evaluation per flag check means every rendered page, every API call, every background job carries a network round-trip to the flag service — and now your flag service is a synchronous dependency in every hot path. The correct model: push the full ruleset to each SDK instance at startup; evaluate entirely in-process. The flag service only needs to deliver config, not answer queries.
Deterministic bucketing: why you hash, not randomize
Percentage rollouts must be sticky — user A must always see the same variant across requests, devices, and SDK instances. The solution is `bucket = hash(userId + flagKey + salt) % 100`. Salt per flag prevents correlation (user in bucket 5 for every flag). Never use session-random assignment — it breaks analytics and creates schizophrenic UX.
Streaming deltas over polling
Polling every N seconds means your worst-case propagation lag is N — unacceptable for a kill switch on a P0 outage. SSE (Server-Sent Events) gives you push with trivial client implementation and works through most proxies; WebSocket is overkill since flag updates are unidirectional. Clients reconnect with a cursor/version so they don't re-download the full ruleset on reconnect. The stream channel is extremely low throughput — most clients idle for minutes at a time.
Flag targeting: the complexity you don't see coming
Boolean on/off is trivial. The real complexity is multi-variate flags (A/B/C/D), segment targeting (users in a segment defined by 10 attributes), and prerequisite flags (flag B only evaluates if flag A is on). Evaluation order matters: rules are checked top-to-bottom, first match wins, like firewall rules. Circular prerequisites are a correctness bug the SDK must detect.
Analytics coupling: don't poison your experiments
A flag service doubles as an experimentation platform, and that means every exposure must be logged with the exact variant the user saw, with a timestamp, for later significance testing. The trap: if you log exposures lazily (only when the user hits a downstream event), you get survivorship bias in your funnel. Log exposure the moment the flag is evaluated, regardless of what the user does next.
What breaks at scale
Flag cardinality explosion is the first failure: 500 engineers each add 2 flags per sprint and nobody cleans up — you end up with 4,000 active flags, many of them 100%-on (effectively dead code). The SDK ruleset balloons and evaluation gets slower. Targeting rule fanout is the second: a rule targeting 10M individual user IDs serializes into a massive payload. At scale, targeting segments (server-side evaluated membership) instead of user ID lists is mandatory. Finally, SDK version skew — old SDKs don't know about new rule types and silently fall back to defaults, creating invisible split-brain experiments.
In production
LaunchDarkly pioneered the streaming ruleset delivery model: the SDK downloads the full flag ruleset on boot, evaluates locally, and maintains a persistent SSE stream for delta updates. Statsig, Flagsmith, and Unleash follow the same pattern. The real engineering challenge is targeting rule complexity — per-user, per-percentage-rollout, and per-segment evaluations all need to be deterministic (same user always gets the same bucket) so you can't just randomize; you hash `(userId + flagKey)` into a stable bucket. The second hard problem is consistency during deployments: if you release code and flag simultaneously, the flag must be live before the code ships, or you need a feature branch.
Common mistakes
- Network call per flag check
- No kill switch
- Stale rules with no push/poll