System Design Library

Collaborative IDE

Multiple users editing and running code together in a shared cloud workspace.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Shared editing
  • Run/terminal
  • Filesystem
  • Live cursors

Non-functional

  • Low-latency edits
  • Isolated execution

Scale

Many concurrent workspaces

The approach

CRDT/OT for shared editing (like Docs); each workspace is a sandboxed container with a filesystem; terminal/run streamed over WS; per-workspace isolation.

Key components

CRDT edit sync · per-workspace container · WS terminal stream

Numbers that matter

Senior deep-dive

Each workspace is a stateful container — you cannot route a user to any pod arbitrarily; their filesystem, running process, and terminal session are pinned to one instance.

CRDT for the editor buffer, container for the runtime: the hardest integration is keeping the two in sync — when a collaborator edits a file, the change must be flushed to the container's filesystem before the next compile or LSP request.

Latency budget is unforgiving: keystroke → character on collaborator's screen must be <100ms or the collaboration feels broken; this demands colocation of the CRDT server and the WebSocket gateway, not a round-trip through a central region.

Workspace isolation: VMs vs containers

Untrusted user code mandates strong isolation. Plain Docker containers share the host kernel — a kernel exploit escapes the sandbox. Firecracker microVMs (used by Fly.io, Replit, Lambda) boot in <125ms and provide VM-level isolation with container-level density (~150 microVMs per host). gVisor (used by Google Cloud Run) interposes a user-space kernel to intercept syscalls — lighter than a VM but not zero overhead. Choose based on your threat model: multi-tenant SaaS needs VM-level; internal dev tooling can use containers with seccomp profiles.

Collaborative editing: CRDT over the editor buffer

Yjs is the de facto choice: it implements a YATA-based sequence CRDT optimized for text, has adapters for Monaco/CodeMirror, and supports a WebRTC or WebSocket provider. The per-workspace CRDT server (can be a Cloudflare Durable Object, a Fly Machine, or a dedicated pod) receives operations, applies them to the authoritative document, and broadcasts to all connected clients. Awareness (cursor positions, selections) is handled separately via ephemeral broadcast — no need to persist it.

Filesystem sync: bridging CRDT and disk

The CRDT document is the in-memory truth; the container filesystem is the execution truth. On every CRDT operation, the server must write the delta (or full file) to the container's filesystem so that `npm run build` inside the container sees current code. Strategies: flush on save (collaborative but not real-time on disk), virtual FS (FUSE mount backed by the CRDT store — Replit's approach), or continuous sync (debounced write every 500ms). The FUSE approach is cleanest but adds ~1ms latency to every file read in the container.

Terminal and process streaming

A terminal session inside the workspace container is not stateless — it has a PTY with scrollback, running processes, and environment. It is proxied over a WebSocket as raw PTY bytes (xterm.js on the client renders it). The container must be sticky to the user's connection — you cannot re-route mid-session. On reconnect, the PTY session must survive: use tmux or screen as a resilience wrapper inside the container so a WebSocket drop doesn't kill the running process.

Prebuild and warm-start optimization

Cold start (clone repo + npm install + compile) can take 2–5 minutes. Prebuilds run this pipeline headlessly on every push, snapshot the resulting container image or filesystem, and store it. When a user opens the workspace, the prebuild is cloned and started — cold start drops to <5 seconds. The tricky part is cache invalidation: a prebuild for `main` is useful for a feature branch only if the lockfile hasn't changed. Branch-level prebuild inheritance (walk the git graph to find the nearest ancestor with a valid prebuild) is what Gitpod implements.

What breaks at scale

Workspace sprawl: a team of 1,000 engineers can accumulate 50,000 idle workspaces — each holding a snapshot on expensive block storage. Auto-sleep and garbage collection (sleep after 30 min idle, delete after 14 days) is mandatory but must not destroy unsaved work. The second failure mode is the 9am spike: everyone opens their IDE at the same time on Monday morning. Container scheduling must handle this burst without queue delays — pre-warming a pool of initialized base-image containers (with common dependencies but no user code) cuts cold start for the burst by 80%.

In production

GitHub Codespaces runs each workspace as a Docker container on an Azure VM, proxying the VS Code Extension Host over a tunneled WebSocket. Gitpod pioneered the prebuild model: on every push, the CI-like system runs workspace initialization (npm install, compile) and snapshots the result so the next open is instant. Replit is the most aggressive: it runs workspaces as Nix-managed environments inside Firecracker VMs for strong isolation, and its Multiplayer feature uses an OT/CRDT layer on top of a per-workspace process. The real challenge is filesystem synchronization — CRDTs keep editor buffers in sync but the container's on-disk files must also be updated. Replit solves this with a virtual filesystem that is the CRDT state; others flush on save. The gap between "the CRDT says X" and "the file on disk says X" is the source of most "why did my build fail" bugs.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →