System Design Library

Distributed Transaction Coordinator

Commit a transaction spanning multiple services/databases atomically.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Coordinate multi-service commit
  • All-or-nothing
  • Handle participant failure

Non-functional

  • Atomic across services
  • No permanent blocking

Scale

Cross-service workflows

The approach

Two options: 2PC (prepare→commit, strong but blocks if coordinator dies) over a consensus-replicated coordinator; or Saga (a sequence of local transactions with compensating actions) for long-running, loosely-coupled workflows.

Key components

Coordinator (consensus-backed) ↔ participants · or Saga orchestrator

Numbers that matter

Senior deep-dive

2PC blocks on coordinator failure — if the coordinator dies between prepare and commit, participants hold locks indefinitely until the coordinator recovers; this is not theoretical, it happens in every distributed system under rolling deploys.

Sagas trade atomicity for availability: you get eventual consistency and must write compensating transactions for every forward step, which is hard to get right under partial failure.

Idempotency keys are the load-bearing primitive — without them, retries after coordinator crashes cause double-commits, and no protocol above can save you.

2PC: correct but blocking

In the prepare phase, the coordinator asks all participants to lock resources and vote yes/no. In the commit phase, if all vote yes, the coordinator writes a commit record and tells participants to commit — the commit record's durability is the single point of correctness. If the coordinator crashes after writing the commit record but before notifying participants, participants are in-doubt: they hold locks and cannot decide. Recovery requires the coordinator to restart and re-send the decision — this is the blocking problem.

3PC and Paxos Commit: non-blocking alternatives

3PC adds a pre-commit phase to eliminate the blocking window but introduces new failure modes under network partition. In practice, production systems solve the coordinator failure problem by making the coordinator itself replicated via Raft/Paxos (as Spanner does) so it survives failures without blocking. The coordinator state machine — pending/prepared/committed/aborted — must be persisted before sending any message, following the write-ahead log principle.

Sagas: availability over atomicity

A Saga decomposes a distributed transaction into a sequence of local transactions T1→T2→T3, each with a compensating transaction C3→C2→C1 executed on failure. The key insight: no distributed locks are held between steps, so availability is high. The hard part is writing compensating transactions that are correct under all partial failure modes — a compensation that itself fails needs its own retry logic. Semantic rollback is not always possible: if T2 sent an email, C2 cannot un-send it.

Orchestration vs choreography

Orchestration (a central saga coordinator, e.g. Temporal workflow) makes the control flow explicit and observable — you can query the coordinator for saga state. Choreography (services react to events, no central coordinator) is simpler to deploy but makes the overall flow implicit and hard to debug. The non-obvious tradeoff: orchestration creates a single point of operational complexity (the orchestrator must be highly available); choreography creates distributed debugging hell when a saga gets stuck.

Idempotency: the foundational primitive

Every step in any distributed transaction protocol must be idempotent — the coordinator will retry on timeout without knowing if the previous attempt succeeded. The standard pattern is an idempotency key (UUID per logical operation) stored in the participant's DB; on retry, the participant checks the key and returns the cached result. Without this, coordinator retries cause phantom debits, duplicate inventory decrements, or double-sent emails that no transaction protocol above can prevent.

What breaks at scale

Lock contention under 2PC at high throughput — thousands of in-flight transactions each holding participant locks — creates deadlock cycles that the coordinator must detect and break via timeout, degrading throughput. Saga state explosion: a system with 1000 saga types and no centralized saga store makes it impossible to answer "why is order #X stuck?" — you need a durable saga log queryable by business key. Compensation storms: when a downstream failure causes thousands of concurrent sagas to roll back simultaneously, the compensating write load can overload the systems being compensated.

In production

XA transactions (the standard 2PC protocol) are supported by most SQL databases but are notoriously slow and rarely used in high-throughput systems — they were designed for database-to-message-broker atomicity, not microservices. Saga orchestration (as used by Temporal, Apache Camel, and Eventuate) is the dominant pattern in microservices: an orchestrator drives each step and issues compensations on failure. TCC (Try-Confirm-Cancel) is a hybrid: the Try phase reserves resources (like a seat hold), Confirm commits them, Cancel releases — Alibaba's Seata implements TCC at massive scale. The real challenge is compensation idempotency: a network retry of a Cancel can hit an already-cancelled resource; every compensating action must be idempotent.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →