System Design Library

Online Judge (LeetCode)

Run untrusted user code against test cases safely, at scale.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Submit code
  • Run against tests
  • Return verdict (AC/WA/TLE)
  • Languages

Non-functional

  • Secure isolation
  • Bounded time/memory
  • Fair queueing

Scale

Spiky during contests

The approach

Submissions → queue → pool of sandboxed runners (containers/gVisor) with CPU/mem/time limits and no network; results returned async; autoscale runners for contests.

Key components

API → submission queue → sandboxed runner pool → result store

Numbers that matter

Senior deep-dive

Sandboxed execution is the only non-negotiable — a compromised runner that lets user code escape to the host network or filesystem is a security incident, not a reliability bug; the entire architecture exists to make that impossible.

Async results via polling or WebSocket are required — code execution takes 0.1s–30s; a synchronous HTTP request would hold a connection for the full duration and collapse under contest load; the submission is a job, not a function call.

The queue absorbs contest spikes — submission rate at contest start can be 100× normal; the queue is the shock absorber, and autoscaling runners off the queue depth is the primary capacity management lever.

Sandbox isolation: the non-negotiable invariant

The sandbox must prevent network access (no exfiltration of test cases, no C&C), filesystem escape (no reading /etc/passwd or other users' files), and resource exhaustion (no fork bombs, no disk fill). seccomp-bpf is the primary syscall filter; a container with `--network none` and a read-only root filesystem covers most attack surface. gVisor adds a user-space kernel between the container and the host, preventing kernel exploits from escaping the sandbox entirely — the right choice when your threat model includes malicious users specifically targeting your infrastructure.

Async submission pipeline: job queue as the core pattern

A submission is written to a durable queue (SQS, Kafka, or a DB-backed queue) and the client gets a submission ID immediately. Workers pull jobs and execute them in sandboxes. Results are written to a result store (Redis or DB). Clients either poll (`GET /submission/{id}/result`) or receive results over a persistent WebSocket. The WebSocket approach is preferable for UX (instant notification) but requires the gateway to maintain a connection-to-submission-ID map for routing results back.

Container warm pools: hiding startup latency

Cold container starts (pulling image, running init) add 500ms–2s to the first submission's perceived latency. A warm pool pre-spawns N idle sandbox containers per runner host. When a job arrives, a warm container is assigned, the solution injected (via a shared volume or stdin), executed, and then the container is reset (not restarted) for the next job — resetting is faster than restarting but requires careful cleanup to prevent state leakage between users' submissions.

Test case design: fast-fail and hidden cases

Test cases run in order; the runner short-circuits on first failure to reduce occupancy. Hidden test cases (not shown to users) are the mechanism for preventing hardcoded answers — solutions that pass all visible cases but fail hidden ones get a Wrong Answer verdict. The hidden test set must be stored separately from the runner binary and injected at execution time, never exposed in error messages or timing signals (timing can leak which test case failed).

Contest autoscaling: queue depth is the signal

At contest start, submission rate spikes 10–100× within seconds. The correct autoscaling signal is queue depth + age (oldest unprocessed job age), not CPU utilization — CPU lags queue depth by the time a job is executing. Pre-scaling before the contest (predictive scaling) is the most reliable approach; reactive scaling is too slow to absorb the initial spike. Runner heterogeneity (different CPU speeds on different runners) produces unfair TLE verdicts — solutions should be benchmarked against a reference solution, not raw wall-clock time.

What breaks at scale

Warm pool exhaustion under sudden load spikes means all new submissions hit cold starts simultaneously, causing a latency cliff right when users care most (contest start). Queue back-pressure can let the oldest jobs age to minutes — users who submitted early get results last, which is unfair in a timed contest. Language runtime isolation is harder than it looks: Python's `import` can execute arbitrary code at module load time; a restricted `importlib` allowlist is required. Sandbox escape attempts via timing side-channels (measuring execution time of a no-op to infer system load) are low-risk but real, and suppressing them requires noise injection or deterministic time accounting.

In production

LeetCode uses Docker containers with seccomp profiles to sandbox execution; the seccomp profile blocks dangerous syscalls (network, fork bomb, device access). Google's internal systems use gVisor (a user-space kernel) for stronger isolation with lower overhead than full VMs. The real engineering challenge is I/O isolation — a user's solution must not be able to read another user's test cases or timing signals from the shared host. Contest fairness requires that two correct solutions in the same language get consistent timing — non-uniform runner hardware skews TLE verdicts, so runners should be homogeneous and results include the platform's own runtime measurement, not wall-clock time alone.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →