Collaborative IDE
Multiple users editing and running code together in a shared cloud workspace.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Shared editing
- Run/terminal
- Filesystem
- Live cursors
Non-functional
- Low-latency edits
- Isolated execution
Scale
Many concurrent workspaces
The approach
CRDT/OT for shared editing (like Docs); each workspace is a sandboxed container with a filesystem; terminal/run streamed over WS; per-workspace isolation.
Key components
CRDT edit sync · per-workspace container · WS terminal stream
Numbers that matter
- A cloud IDE workspace container typically needs 2–4 vCPUs and 4–8 GB RAM for a mid-size Node or Go project; memory is the binding constraint — the language server (LSP) alone can consume 1–2 GB for large TypeScript projects.
- VS Code's language server protocol (LSP) round-trip for autocompletion should be <100ms; beyond 200ms users notice lag — network hops between the browser, gateway, and container are the dominant cost.
- Gitpod and GitHub Codespaces report workspace cold start times of 30–60 seconds without prebuild caching, dropping to <5 seconds with a prebuilt image for common branch states.
- A CRDT sequence (Yjs Y.Text) representing a 10,000-line file uses roughly ~2–5 MB of state in memory; operational history for undo can grow to 10–50 MB in long editing sessions without GC.
Senior deep-dive
Each workspace is a stateful container — you cannot route a user to any pod arbitrarily; their filesystem, running process, and terminal session are pinned to one instance.
CRDT for the editor buffer, container for the runtime: the hardest integration is keeping the two in sync — when a collaborator edits a file, the change must be flushed to the container's filesystem before the next compile or LSP request.
Latency budget is unforgiving: keystroke → character on collaborator's screen must be <100ms or the collaboration feels broken; this demands colocation of the CRDT server and the WebSocket gateway, not a round-trip through a central region.
Workspace isolation: VMs vs containers
Untrusted user code mandates strong isolation. Plain Docker containers share the host kernel — a kernel exploit escapes the sandbox. Firecracker microVMs (used by Fly.io, Replit, Lambda) boot in <125ms and provide VM-level isolation with container-level density (~150 microVMs per host). gVisor (used by Google Cloud Run) interposes a user-space kernel to intercept syscalls — lighter than a VM but not zero overhead. Choose based on your threat model: multi-tenant SaaS needs VM-level; internal dev tooling can use containers with seccomp profiles.
Collaborative editing: CRDT over the editor buffer
Yjs is the de facto choice: it implements a YATA-based sequence CRDT optimized for text, has adapters for Monaco/CodeMirror, and supports a WebRTC or WebSocket provider. The per-workspace CRDT server (can be a Cloudflare Durable Object, a Fly Machine, or a dedicated pod) receives operations, applies them to the authoritative document, and broadcasts to all connected clients. Awareness (cursor positions, selections) is handled separately via ephemeral broadcast — no need to persist it.
Filesystem sync: bridging CRDT and disk
The CRDT document is the in-memory truth; the container filesystem is the execution truth. On every CRDT operation, the server must write the delta (or full file) to the container's filesystem so that `npm run build` inside the container sees current code. Strategies: flush on save (collaborative but not real-time on disk), virtual FS (FUSE mount backed by the CRDT store — Replit's approach), or continuous sync (debounced write every 500ms). The FUSE approach is cleanest but adds ~1ms latency to every file read in the container.
Terminal and process streaming
A terminal session inside the workspace container is not stateless — it has a PTY with scrollback, running processes, and environment. It is proxied over a WebSocket as raw PTY bytes (xterm.js on the client renders it). The container must be sticky to the user's connection — you cannot re-route mid-session. On reconnect, the PTY session must survive: use tmux or screen as a resilience wrapper inside the container so a WebSocket drop doesn't kill the running process.
Prebuild and warm-start optimization
Cold start (clone repo + npm install + compile) can take 2–5 minutes. Prebuilds run this pipeline headlessly on every push, snapshot the resulting container image or filesystem, and store it. When a user opens the workspace, the prebuild is cloned and started — cold start drops to <5 seconds. The tricky part is cache invalidation: a prebuild for `main` is useful for a feature branch only if the lockfile hasn't changed. Branch-level prebuild inheritance (walk the git graph to find the nearest ancestor with a valid prebuild) is what Gitpod implements.
What breaks at scale
Workspace sprawl: a team of 1,000 engineers can accumulate 50,000 idle workspaces — each holding a snapshot on expensive block storage. Auto-sleep and garbage collection (sleep after 30 min idle, delete after 14 days) is mandatory but must not destroy unsaved work. The second failure mode is the 9am spike: everyone opens their IDE at the same time on Monday morning. Container scheduling must handle this burst without queue delays — pre-warming a pool of initialized base-image containers (with common dependencies but no user code) cuts cold start for the burst by 80%.
In production
GitHub Codespaces runs each workspace as a Docker container on an Azure VM, proxying the VS Code Extension Host over a tunneled WebSocket. Gitpod pioneered the prebuild model: on every push, the CI-like system runs workspace initialization (npm install, compile) and snapshots the result so the next open is instant. Replit is the most aggressive: it runs workspaces as Nix-managed environments inside Firecracker VMs for strong isolation, and its Multiplayer feature uses an OT/CRDT layer on top of a per-workspace process. The real challenge is filesystem synchronization — CRDTs keep editor buffers in sync but the container's on-disk files must also be updated. Replit solves this with a virtual filesystem that is the CRDT state; others flush on save. The gap between "the CRDT says X" and "the file on disk says X" is the source of most "why did my build fail" bugs.
Common mistakes
- No execution sandboxing
- Keeping idle containers hot (cost)
- Last-write-wins on shared files