System Design Library

Google Docs (collaborative editor)

Many users editing one document simultaneously, seeing each other's keystrokes live and converging.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Concurrent editing
  • Live cursors/presence
  • History/undo
  • Offline edits

Non-functional

  • Low-latency sync
  • Convergence (all clients agree)

Scale

Dozens of editors/doc

The approach

Operational Transformation (OT) or CRDTs to merge concurrent edits deterministically. A per-doc server (or DO) sequences ops and broadcasts; clients apply transformed ops to converge.

Key components

Client ⇄ WS ⇄ per-doc collaboration server · op log/store · presence

Numbers that matter

Senior deep-dive

Concurrent edit convergence is the only genuinely hard problem — if two users insert a character at the same position, naive last-write-wins corrupts the document.

OT requires a central server to sequence operations; CRDTs remove the central sequencer but carry higher metadata overhead and trickier text semantics. The per-document server (or Durable Object) is the sequencing point — it must be stateful, sticky per document, and able to replay its operation log to rebuild document state after a crash.

OT vs CRDT: the actual decision criteria

Choose OT when you need a central server anyway (simpler metadata, smaller wire format, mature libraries like ShareDB/Quill). Choose CRDTs when you need peer-to-peer or offline-first semantics (e.g. local-first apps, offline mobile). OT's weakness is that transformation functions are notoriously hard to get right for complex document types (tables, embedded objects) — bugs here cause silent document corruption. CRDTs eliminate transformation but introduce tombstone bloat and GC complexity.

The per-document server is your consistency boundary

In an OT system, the per-document server (or Durable Object) is the only thing that assigns global sequence numbers — if two users submit op #5 simultaneously, the server picks an order and transforms accordingly. This server must be single-writer per document (no horizontal scale for writes). For very large documents (think Wikipedia articles), this can be a bottleneck; the mitigation is document splitting (sections as independent CRDT/OT units) so multiple server instances can handle different sections.

Presence and cursor broadcasting

Cursor positions and selections are ephemeral presence data, not document operations — treat them separately from OT/CRDT to avoid polluting the operation log. Broadcast cursor updates via a separate pub/sub channel per document with no persistence. Cursor positions must also be transformed against incoming ops (if user A's cursor is at position 50 and user B inserts 5 chars at position 10, A's cursor shifts to 55) — this is a simplified but real transform.

Persistence: operation log vs document snapshots

Never store only the latest document state — store the full operation log so you can replay, debug, and implement undo. Snapshots (materialized document state at sequence N) act as checkpoints so replay doesn't start from op #1. A good pattern: snapshot every 1,000 ops, keep the full log for 30 days, compress older logs. Undo is implemented by applying an inverse operation, not by replaying history — this is O(1) vs O(n).

Offline and reconnect: the hardest case

A client offline for T minutes has local ops based on server state at sequence S. On reconnect, it must rebase its local ops against all server ops from S to S+N (N ops happened while offline). With OT this is O(local_ops × N) transformations. The practical limit is O(hundreds) — beyond that, the client should discard local ops and show a conflict UI rather than attempting a transformation that may fail silently. CRDTs handle this more gracefully but still have merge complexity for deletions.

What breaks at scale

A viral shared document with 500 simultaneous editors overwhelms the broadcast fan-out: each op must be sent to 499 other WebSocket connections. The per-doc server becomes CPU-bound on serialization and network I/O. Fix with hierarchical fan-out (relay servers per region each holding a subset of connections) and op batching (coalesce 10ms of ops into one message). The second failure: large documents with long history cause slow initial load — mitigate with lazy loading (load the visible viewport's content first, load history only on request).

In production

Google Docs uses Operational Transformation over a central per-document server — each client sends ops to the server, which assigns a global sequence number and broadcasts transformed ops to all other clients. Figma switched from OT to a CRDT-like approach for its multiplayer canvas, finding CRDTs simpler for spatial objects where operations naturally commute. Cloudflare Durable Objects have become the go-to infrastructure for per-document stateful servers (one DO per doc, sticky WebSocket connections, in-memory state + durable KV backing). The real engineering challenge is handling offline edits and reconnection: a client that goes offline for 30 minutes accumulates ops that must be rebased against potentially thousands of server-side ops on reconnect — this is where OT's transformation complexity is most painful.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →