Payment System / Wallet
Move money between accounts with zero double-spends, ever.
Open the interactive version → diagrams, practice & moreRequirements
Functional
- Charge/transfer
- Idempotent retries
- Ledger/audit
- Refunds/reconciliation
Non-functional
- Strong consistency
- Exactly-once effect
- Auditable
Scale
Correctness > raw throughput
The approach
Double-entry ledger (append-only); every operation carries an idempotency key; transfers are atomic (ACID or Saga across services); reconciliation jobs catch drift. External providers via idempotent calls.
Key components
API (idempotency) → ledger DB (ACID) · saga orchestrator for cross-service · reconciliation
Numbers that matter
- A double-entry ledger append should complete in < 5ms at the DB layer — it's two inserts plus an index update, not a balance scan.
- Card network authorization round-trips (Visa/Mastercard via acquirer) add ~200–800ms of external latency — this is the dominant term in payment checkout latency.
- Idempotency windows are typically 24–72 hours — the period during which retrying the same key returns the cached result without re-executing; after that, the key is expired and a fresh request is allowed.
- Stripe processes ~1M API requests/minute at peak; their ledger is append-only with ~11 nines of durability via multi-region replication.
Senior deep-dive
The double-entry ledger is the only correct primitive — every money movement is two rows (debit + credit) in the same atomic write; a single-row balance update is a bug waiting to become a fraud vector.
Idempotency keys are not optional: external provider calls (banks, card networks) fail ambiguously, so every outbound request carries a stable key so a retry is a no-op, not a double-charge.
Reconciliation is the safety net, not the primary control — nightly or hourly batch jobs compare your ledger to bank statements and catch the drift that idempotency and transactions occasionally miss.
Double-entry: two rows or it never happened
Balance columns are a trap — they require read-modify-write under a lock or they race. A double-entry ledger (debit row + credit row, same transaction ID, atomic insert) never modifies existing rows, so it's lock-free at the row level and trivially auditable. Balance is derived by summing the ledger, not a stored field. Running balance denormalization (materializing current balance for fast reads) is an optimization layered on top, not the source of truth.
Idempotency keys: the contract with retrying callers
Network failures leave callers uncertain: did the charge succeed? An idempotency key (a UUID the caller generates) lets the server return the original result for any retry within the window, without re-executing side effects. The server stores `(key → result)` atomically with the operation itself — if the key is in the store, the operation already ran. The subtle failure mode: storing the key before executing the charge means a crash between store and charge leaves a 'succeeded' key for a charge that never happened — store key + result atomically after the charge.
State machines for external calls: parking is not failure
Card network calls can return three answers: success, failure, or timeout with unknown outcome. A payment state machine must have a `pending_confirmation` state — the charge is in flight, outcome unknown. Background workers poll or wait for webhooks to resolve it. Without this state, a naive implementation either double-charges on retry or silently drops the payment. Every state transition is an append to the ledger, so replay reconstructs the full history.
Saga vs 2PC: pick based on latency tolerance
For a payment touching card network + ledger + notification, 2PC is impractical — the card network is an external party that doesn't speak XA. Sagas sequence local transactions with compensating actions on failure: charge card → debit ledger → notify. If the ledger debit fails, a compensating refund hits the card. The saga log is the coordination record — it must be durable and must survive coordinator restarts to drive completion or rollback.
Reconciliation: the source of truth that runs after the fact
Idempotency and transactions catch most bugs; reconciliation catches the rest. Nightly (or hourly) jobs pull bank statements and acquirer settlement files, match them to ledger entries by reference ID, and flag unmatched debits, duplicate credits, and amount mismatches. These are ops alerts, not user-facing errors. The non-obvious part: settlement dates differ from authorization dates (T+1 or T+2), so the reconciliation window must account for in-flight authorizations.
What breaks at scale
Hotspot accounts — a single merchant receiving thousands of concurrent payments — cause row-level lock contention if you ever update a balance column. Pure append-only ledger sidesteps this, but balance queries become table scans without a materialized view. The second failure mode is idempotency store saturation under a thundering herd of retries: millions of callers retrying with the same keys simultaneously overwhelm Redis. Jitter + exponential backoff on the caller side is the only fix — the server can't solve a synchronized retry storm.
In production
Stripe uses a double-entry ledger backed by MySQL with strict serializable transactions per charge; idempotency keys are stored in Redis with a short TTL and the result cached so network retries are safe. Square and PayPal use Saga-based orchestration across payment, ledger, and notification services for long-running flows, with compensating transactions on failure. The real engineering challenge is handling ambiguous external outcomes — a bank times out without confirming or rejecting, so your state machine must park the transaction in a 'pending confirmation' state and poll or wait for a webhook, all while holding no locks.
Common mistakes
- Mutable balance column (no audit, race-prone)
- No idempotency key → double charges
- Distributed transaction without compensation/saga