System Design Library

CRDT / Conflict-Free Store

Let many replicas update concurrently and converge automatically, even offline.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Concurrent writes anywhere
  • Automatic merge
  • Offline edits
  • Eventual convergence

Non-functional

  • Always writable
  • No coordination needed

Scale

Multi-region / offline-first

The approach

Conflict-free Replicated Data Types: design data so all concurrent operations commute (G-counters, OR-sets, LWW-registers, sequence CRDTs). Replicas exchange ops/states (gossip) and merge deterministically — same result regardless of order — so no consensus or locking is needed.

Key components

Replicas (each writable) ↔ gossip/sync · CRDT merge

Numbers that matter

Senior deep-dive

CRDTs guarantee convergence by construction — any two replicas that have seen the same set of operations will be in the same state, regardless of order or delivery gaps, because every operation is designed to be commutative and idempotent.

The design constraint is the challenge: not all data types are naturally CRDT-compatible; forcing arbitrary business logic into commutative operations often means accepting semantic losses (e.g., a counter can't go negative, a set can't distinguish "remove then re-add" from "add").

State-based CRDTs gossip full state; op-based CRDTs require exactly-once delivery — picking the wrong variant for your network model is the most common implementation mistake.

State-based vs op-based: pick based on your network

State-based CRDTs (CvRDTs) gossip the full state; merge is a join operation (like set union or max). They work correctly with any network model — duplicates and out-of-order delivery are fine. Cost: sending full state is expensive for large data structures. Op-based CRDTs (CmRDTs) send only the delta operation; the operation must be commutative and the network must guarantee at-least-once delivery with causal ordering. Delta-state CRDTs are a hybrid: gossip incremental state deltas, not full state, with idempotent merge — this is what most production systems use because it's bandwidth-efficient while remaining correct under duplication.

G-counter and PN-counter: the simplest useful CRDTs

A G-counter (grow-only counter) is a vector of per-replica counts; the global count is the sum of all components; merge takes the component-wise max. It's monotonically increasing — you can never decrement. A PN-counter pairs two G-counters (positive increments + negative increments) and returns `P - N` as the value. The trap: PN-counters can go negative, so you can't enforce a "stock never goes below zero" invariant without coordination. For inventory management, this is disqualifying — you need a coordinated atomic decrement, not a CRDT.

OR-Set: concurrent add/remove, correctly

The naïve set CRDT (add wins or remove wins on conflict) fails on the case: "Alice adds X, Bob concurrently removes X — who wins?" OR-Set (Observed-Remove Set) resolves this by tagging each add with a unique token; a remove only removes the specific tokens it has observed. A concurrent re-add generates a new token that survives the remove. Result: adds beat removes on concurrent operations — which is usually the right semantic for collaborative apps (a user re-adding something they just removed shouldn't silently lose their action). The cost is O(adds × replicas) metadata per element.

Sequence CRDTs: the hard case for text editing

Collaborative text is the hardest CRDT problem: concurrent insertions at the same position need a deterministic total order that all replicas agree on. WOOT assigns unique identifiers to each character and inserts relative to neighbors ("insert X between char 5 and char 6"). RGA (Replicated Growable Array) is more efficient. LSEQ uses fractional indexing with exponential density. All of these work but accumulate tombstones for deleted characters that must be retained to correctly position future inserts relative to them — a 100k-character document with heavy editing can balloon to millions of tombstone entries.

Tombstone GC: the convergence vs GC tension

Deleted elements in OR-Sets and sequence CRDTs leave tombstones — markers that "this thing was removed." Tombstones must be retained until all replicas have observed the delete, otherwise a replica that missed the delete message will re-add the element on next sync. Knowing when all replicas have observed a delete requires a global coordination step (a barrier or a stable vector clock), which defeats the purpose of a CRDT-based system. The practical answer is epoch-based GC: periodically coordinate a "stable frontier" across all online replicas and collect tombstones older than the frontier — accepting that offline replicas must reconcile before they can have their old operations GC'd.

What breaks at scale

Metadata explosion is the primary scaling failure: OR-Set elements accumulate vector clock entries for every replica that ever touched them; with hundreds of nodes and millions of elements, CRDT metadata can exceed the actual data size by 10–100×. The second failure is causal consistency violations under late-joining replicas: a replica that was offline for a week and reconnects needs to receive all operations since its last sync in causal order — if the op log is not retained for the full offline window, the replica must do a full state sync (expensive) rather than an incremental merge. Bounded op retention windows with forced full resync after expiry is the operational policy most systems use.

In production

Riak (Amazon Dynamo derivative) used CRDTs (G-counters, OR-Sets, Maps) as first-class data types to give users conflict-free data structures without application-level conflict resolution. Apple's CloudKit uses a CRDT-like model for offline-first sync. Figma built a custom CRDT for their multiplayer design tool using a hybrid of fractional indexing (for ordered layers) and LWW-registers (for properties). The real challenge is garbage collection: CRDT metadata (tombstones in OR-Sets, vector clock entries for retired replicas) accumulates indefinitely unless explicitly collected, and tombstone GC requires knowing that all replicas have seen the delete — which requires a coordination step that CRDTs are supposed to avoid, creating a fundamental tension between convergence and memory growth.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →