System Design Library

Payment Gateway (Stripe)

Process card payments via external networks reliably and idempotently.

Open the interactive version → diagrams, practice & more

Requirements

Functional

  • Charge/refund
  • Tokenize cards
  • Talk to acquirers/networks
  • Webhooks
  • Ledger

Non-functional

  • Exactly-once effect
  • PCI compliance
  • Auditable

Scale

Global, money-critical

The approach

Idempotency keys on every request; card data tokenized (PCI vault); charges orchestrated against acquirer networks with retries and state machines; double-entry ledger; webhooks notify merchants.

Key components

API (idempotency) → payment orchestrator → acquirer adapters · ledger · webhook delivery

Numbers that matter

Senior deep-dive

Idempotency keys on every mutation are the entire reliability model — without them, a network timeout becomes a double charge, and at Stripe's scale that happens thousands of times a day.

Card data never touches your application servers — it goes directly to a PCI-compliant vault (Stripe Elements / client-side tokenization) and you handle only an opaque token; any architecture that lets raw PANs hit your app fails the PCI DSS audit.

The acquirer network is the unreliable external dependency: it times out, returns ambiguous responses, and has per-merchant rate limits; model every call as potentially idempotent-retryable with exponential backoff and circuit breaking.

Idempotency: the foundation of payment reliability

Every charge, refund, and payout API call accepts an idempotency key (client-generated UUID). The server stores (key → {status, response}) in a database with a unique constraint before executing the operation. A duplicate request with the same key returns the stored response without re-executing. The key expires after 24 hours. The subtle failure mode: the first request inserts the key row but crashes before completing the charge — on retry, the row exists but has no result, so you must replay the operation atomically under the same key.

Tokenization and PCI scope reduction

Stripe Elements / Stripe.js renders card fields in an iframe hosted by Stripe's PCI-compliant domain — raw PANs never reach the merchant's servers. The merchant receives a payment method token (pm_xxx) representing the card. Vault storage (Stripe's side) uses format-preserving encryption so the token can be stored in any database without PCI scope. This is the most important architectural constraint: reducing PCI scope from SAQ D (full audit) to SAQ A (no card data touched) saves months of compliance work per year.

Authorization state machine

A payment moves through created → processing → authorized → captured → settled (or failed/disputed branches). Each transition is a state machine step persisted atomically. The acquirer call happens in the `processing → authorized` step; a timeout parks the payment in a `processing` limbo state and a background job polls for resolution. Captures are separate from authorizations (hotel/car rental patterns hold auth for days before capturing) — this separation is mandated by card network rules.

Ledger and double-entry accounting

Every money movement creates two ledger entries (debit one account, credit another) in an append-only table. Balance is derived by summing entries — never stored directly — so there is no "balance update" that can be lost or double-applied. Reconciliation jobs compare the ledger against settlement files from card networks nightly; discrepancies trigger alerts. The ledger is the system of record; the payment object in the application DB is a derived view.

Retry logic and acquirer circuit breaking

Acquirer APIs fail in two ways: transient (HTTP 5xx, timeout — safe to retry) and permanent (card declined, insufficient funds — do not retry). A response code classifier maps ISO 8583 / HTTP codes to retry vs no-retry. Retries use exponential backoff with jitter (not synchronized retries, which thundering-herd the acquirer). A circuit breaker per acquirer opens after N consecutive failures and sheds load to a backup acquirer or returns degraded errors to the merchant rather than queueing indefinitely.

What breaks at scale

Idempotency key collisions: merchants who generate keys non-randomly (sequential integers, timestamps) collide across their own requests — enforce UUID v4 or similar entropy in client libraries. The deeper failure is distributed transaction atomicity: charging the card (acquirer call) and recording the result in your DB are two operations that cannot be wrapped in one ACID transaction. If you charge the card and then your DB write fails, you've taken money without a record. The pattern is write a pending record first, then charge, then update to complete — the pending record is the source of truth for reconciliation if the charge succeeds but the update fails.

In production

Stripe's architecture uses idempotency keys stored in a PostgreSQL table keyed by (key, user_id) with a unique constraint — concurrent duplicate requests hit this constraint and the second one waits for the first to commit. Braintree (PayPal) uses a similar pattern but built on top of a custom state machine per transaction persisted in MySQL. The real engineering challenge is handling ambiguous acquirer responses: an HTTP timeout on the authorize call means the charge may or may not have gone through at the bank. Stripe retries with the same idempotency key at the acquirer level (using the `original_transaction_id` field in ISO 8583) — this is only possible because the card networks support it, but not all acquirers implement it consistently, so you need a reconciliation job that checks next-day settlement files against your ledger to catch orphaned authorizations.

Common mistakes

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →