System Design Library

Code Deployment (CI/CD)

Build, test and safely roll out code to thousands of servers.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Trigger pipeline
  • Build + test
  • Artifact store
  • Staged rollout + rollback

Non-functional

  • Reliable
  • Auditable
  • Fast rollback

Scale

Thousands of deploy targets

The approach

Pipeline orchestrator runs build/test stages (queued workers); artifacts to a registry; rollout via canary → progressive waves with health checks; automatic rollback on regression.

Key components

Trigger → orchestrator → build/test workers → artifact registry → rollout controller

Numbers that matter

Senior deep-dive

The artifact registry is the contract between build and deploy — immutable, content-addressed images mean what you tested is exactly what runs in production.

Canary + automatic rollback is the only safe progressive rollout: a fixed percentage of traffic hits the new version, a health signal (error rate, p99 latency) gates promotion, and a threshold breach triggers automatic revert without human intervention.

Flaky tests are the number-one pipeline velocity killer — a 5% per-test flake rate on 100 tests yields a 99.4% chance of at least one failure per run; invest heavily in test quarantine and determinism before raw throughput.

Build: remote caching and hermetic environments

The fastest build is one that doesn't run. Remote action caching (Bazel/Buck2 + a shared CAS) keys each build step on a hash of its inputs; a cache hit skips execution entirely. This only works if builds are hermetic — no ambient environment variables, system-installed tools, or network calls during build. Hermeticity is hard to enforce but is the prerequisite for any meaningful cache hit rate.

Test: parallelism, sharding and flake management

Test suites that take 30+ minutes are not run; they are skipped. Horizontal sharding splits the test suite across N parallel runners; the bottleneck shifts to the slowest shard, so balance shards by historical duration. Flaky test quarantine is mandatory at scale: a test that fails intermittently is automatically quarantined (skipped from the required gate, tracked separately) so it doesn't block unrelated merges. Flake debt accumulates fast — track flakiness rate as a team KPI.

Artifact management: immutability and provenance

The build produces a content-addressed artifact (Docker image by digest, not tag; Helm chart by SHA). Tags are pointers, not guarantees — `latest` is the enemy of reproducible deploys. A container registry (ECR, GAR, Harbor) stores layers and manifests; a promotion policy moves an image from `dev` to `staging` to `prod` registries without rebuilding. SLSA provenance attestations sign the build inputs and outputs so you can prove what commit produced the running image.

Progressive rollout: canary, blue-green, ring

Blue-green gives instant rollback (flip the LB) but requires 2x capacity. Canary is cheaper — route 1% → 10% → 50% → 100% with metric gates between stages. The gate compares a baseline cohort (old version, same traffic shape) against the canary using statistical significance; this is Spinnaker's Kayenta model. Ring-based deploys (internal users → beta region → prod region) are common for infrastructure services where traffic shaping is harder.

Rollback: automatic and tested

Automatic rollback on breach of an error-rate SLO is the goal, but it requires a reliable signal — flaky metrics cause false rollbacks, which erode trust. The rollback mechanism itself must be tested regularly (chaos engineering: deliberately deploy a bad canary and verify the rollback fires). A deploy lock prevents simultaneous rollouts from multiple branches from racing; a separate freeze window blocks deploys during high-traffic periods.

What breaks at scale

Thundering herd on artifact pull: deploying to 10,000 pods simultaneously saturates the registry and the underlying object store. Mitigate with P2P image distribution (Dragonfly, Kraken at Uber) — nodes seed each other rather than all pulling from a central registry. The second failure mode is pipeline queue starvation: a monorepo with 1,000 engineers generates more commits than runners can process, creating a queue that delays feedback by hours. The fix is merge queues (batching commits into a single CI run before merging) combined with change-impact analysis that only runs tests affected by the diff.

In production

Google's Borg/Forge stack and Meta's Buck2 + Sandcastle are the industrial benchmarks: remote execution with a shared cache means a clean build of a multi-million-line monorepo takes seconds, not hours, because most actions are cache hits. For deployments, Spinnaker (Netflix-open-sourced) codifies the canary-then-bake-then-full pattern with automated metric comparison against a baseline. The real challenge is environment parity: a test that passes in CI fails in prod because the container image is identical but the surrounding infra (secrets injection, network policy, sidecar versions) differs. Teams that invest in ephemeral pre-prod environments (each PR gets its own namespace) catch this class of failure before merge. Artifact provenance (SLSA supply-chain levels) is the emerging requirement — signing images and attesting build inputs so you can prove what ran in prod and detect supply-chain compromise.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →