System Design Library

Service Mesh (Envoy/Istio)

Manage service-to-service traffic: routing, retries, mTLS, observability — without app changes.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Traffic routing/splitting
Retries/timeouts/circuit-breaking
mTLS
Telemetry

Non-functional

Low overhead
Centrally controlled

Scale

Thousands of services

The approach

A sidecar proxy (Envoy) runs next to each service intercepting all traffic; a control plane pushes config (routes, policies, certs); the data plane enforces retries, timeouts, mTLS and emits uniform telemetry — all transparent to app code.

Key components

Control plane → sidecar proxies (data plane) per service

Numbers that matter

Envoy sidecar adds approximately 2–5ms of latency per hop under load (P99); P50 overhead is typically <1ms for simple proxy operations.
A service mesh control plane (Istiod) comfortably manages ~1,000–2,000 Envoy proxies before needing horizontal scaling or federated control planes.
mTLS certificate rotation via SPIFFE/SPIRE can happen every 24 hours automatically — compared to manual cert management where certs often live for years.
Envoy's memory footprint per sidecar is typically 50–150MB resident, which adds up materially at thousands of pods — a real cost that drives mesh adoption debates.

Senior deep-dive

The sidecar is the insight — moving cross-cutting concerns (retries, mTLS, tracing) into a per-pod proxy decouples them from application code, but every proxy adds ~1–3ms latency and memory overhead at the container level.

The control plane is where the complexity lives: xDS protocol (Envoy's discovery API) lets the control plane push config to thousands of sidecars without restarts, but misconfigured policy propagation can silently break traffic.

mTLS is the real reason to adopt a mesh — automatic certificate rotation and workload identity replace a sprawling web of static API keys and network-level ACLs.

xDS: the control-plane protocol that everything rests on

Envoy discovers its config (routes, clusters, listeners, endpoints) via the xDS protocol — a streaming gRPC API. The control plane (Istiod or custom) pushes updates when topology or policy changes. The key insight is eventual consistency by design: there's a window between a pod starting and its first xDS update during which it routes based on stale config. ACK/NACK from Envoy tells the control plane whether a config was accepted; NACK means the previous config stays active — which is safer than a bad config going live.

mTLS and SPIFFE workload identity

The mesh issues each workload a SPIFFE Verifiable Identity Document (SVID) — an x.509 cert encoding the service's identity (`spiffe://cluster/ns/default/sa/payments`). Mutual TLS authenticates both sides of every connection, replacing network-level trust ("this IP is payments-service") with cryptographic workload identity. Certificate rotation every 1–24 hours is automatic — the sidecar handles it transparently. The operational win is that adding a new service gets mTLS without any developer action; the risk is that cert issuance becomes a critical path — if the CA (SPIRE or istiod) is down, new pods can't get certs and fail readiness.

Traffic management: where mesh earns its keep

Weighted routing (5% of traffic to v2, 95% to v1) for canaries is trivially configured in a VirtualService — no code change needed. Circuit breaking (eject a pod after 5 consecutive 503s) and outlier detection (latency-based ejection) are configurable at the mesh layer. The trap is retry amplification: if every service in a 5-hop call chain retries once on failure, a single downstream hiccup generates `2^5 = 32` upstream requests. Retry budgets (max N retries per second per destination) prevent this but require mesh-level config discipline.

Observability: the underrated mesh benefit

Every Envoy sidecar emits golden signals (request count, error rate, latency percentiles) for every service-to-service call without application instrumentation. Distributed tracing works by Envoy auto-injecting B3 / W3C trace headers and exporting spans to Jaeger or Zipkin. The catch: trace header propagation still requires application cooperation — if a service doesn't forward the `x-b3-traceid` header when it makes downstream calls, the trace is broken. This is the most common "why doesn't tracing work" complaint in mesh deployments.

Multi-cluster and federation: the complexity cliff

A single-cluster mesh is operationally manageable. Multi-cluster meshes (east-west gateway for cross-cluster mTLS, federated control planes) grow complexity non-linearly. Istio multi-cluster requires shared trust roots (same root CA) and explicit service endpoint export. The failure mode: a control plane networking partition between clusters can cause one cluster's sidecars to hold stale endpoint lists for services in the other cluster, routing to dead IPs until TTL expires. Health check integration and aggressive endpoint invalidation are the mitigations.

What breaks at scale

Config push storms are the primary scaling failure: a single Kubernetes node going down triggers endpoint updates for every service on that node, causing a fan-out push to every sidecar that has a route to those services. At 1,000+ services and 10,000+ pods, this is thousands of simultaneous xDS pushes — delta xDS and push rate limiting in the control plane are mandatory. The second failure is sidecar memory creep: Envoy caches route tables for all services in the mesh, so large meshes with fine-grained routing rules push per-sidecar memory above 500MB, making your pods' memory requests wrong and triggering OOMkills.

In production

Lyft originally built Envoy to solve their own service mesh problem; Istio packaged it with a control plane. Airbnb runs a large-scale Envoy mesh without Istio, using their own control plane built on xDS. The real challenge is not the data plane — it's control plane scalability: Istio's original Pilot/Galley architecture had a performance cliff around 500 services because every config change triggered a full push to all proxies. Delta xDS (incremental config updates) was the architectural fix, and it shipped in Envoy 1.16 / Istio 1.9. Without it, a Kubernetes HPA event caused a config flood that pushed stale routes to all sidecars simultaneously.

Common mistakes

Reimplementing resilience in each service
Ignoring sidecar latency/resource cost
Control plane as an unmonitored SPOF

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →