System Design Library

Stock Exchange (Matching Engine)

Match buy/sell orders fairly with microsecond latency and strict ordering.

Open the interactive version → diagrams, practice & more

Requirements

Functional

Order book
Match orders (price-time priority)
Cancel/modify
Market data feed

Non-functional

Deterministic & ordered
Ultra-low latency
Fair

Scale

Millions of orders/sec

The approach

A single-threaded in-memory matching engine per symbol (determinism + no lock contention) processes a strictly ordered input log; matches by price-time priority; an event-sourced log enables replay/recovery; market data fans out.

Key components

Gateway → sequenced input log → per-symbol matching engine → trade + market-data feeds

Numbers that matter

NASDAQ's matching engine processes ~10 million orders per day on active symbols; peak throughput is ~500,000 orders/second during open/close auctions, handled by a single-threaded engine per symbol.
End-to-end latency from order receipt to market data broadcast is <10 microseconds at colocation (Mahwah, NJ) for Nasdaq; retail broker round-trips add ~5–50 milliseconds of network.
An order book for a liquid symbol like SPY holds ~10,000–50,000 resting limit orders in memory; the entire book fits in <10 MB, well within L3 cache on modern CPUs.
The NYSE Pillar matching engine achieves ~1μs order-to-acknowledgment latency for co-located market makers using kernel-bypass networking (Solarflare NICs with OpenOnload).

Senior deep-dive

The matching engine must be single-threaded — deterministic price-time priority requires a strict total order on events, which is impossible with concurrent mutation of the order book.

The input log is the database: the order book is reconstructed from an ordered event log; the matching engine is stateless given the log, enabling replay, debugging, and disaster recovery without a traditional DB.

Latency is measured in microseconds and dominated by kernel bypass: a TCP stack adds 50–100μs; DPDK/RDMA brings this to <5μs; the matching itself (a price-level sorted map lookup) takes <1μs with a cache-warm order book.

Single-threaded matching: why concurrency is wrong here

Price-time priority (best price wins; equal price → earlier order wins) requires a strict total order on all events for a symbol. Any concurrent access to the order book creates races: two simultaneous matching decisions for overlapping price levels produce non-deterministic fills. The solution is a single-threaded event loop per instrument — incoming orders are serialized onto an input queue, the engine processes them one at a time, emits execution reports, and never blocks. Throughput comes from CPU frequency, cache efficiency, and lock-free queues, not parallelism.

Order book data structure

The order book is a price-level map: a sorted map (by price) where each level holds a FIFO queue of orders. For a limit buy at $100.01, the engine looks up the ask side for any resting order at ≤$100.01 — this is a min-heap or sorted map lookup. Array-backed price levels (a fixed-size array indexed by price in ticks) outperform `std::map` for liquid instruments with a narrow spread — O(1) level access vs O(log n). The entire book for a liquid instrument fits in L3 cache, which is why cache-warm throughput is orders of magnitude higher than cold.

Event sourcing: the log is the database

The matching engine does not write to a database during matching — it appends order events (new, cancel, fill) to a durable log. The order book is the projection of that log. On restart, replay the log to rebuild state. This is not just a design philosophy — it is how exchanges satisfy regulatory audit requirements: the regulator can request the event log and independently reconstruct every fill. The log also enables backtesting and market simulation without a separate testing environment.

Market data: SBE and multicast

Exchange market data feeds use Simple Binary Encoding (SBE) — a fixed-layout binary format with no heap allocation, parseable in a few nanoseconds. Data is broadcast via UDP multicast to subscribers who co-locate servers in the exchange's data center. Retailers receive a slower consolidated tape via TCP. The matching engine publishes fill and order book update events atomically with the matching decision — sequence numbers on the feed let consumers detect gaps and request retransmit via a separate request/response channel.

Risk and pre-trade checks

Pre-trade risk (position limits, order size limits, price collars) must run in <1μs or it becomes the latency bottleneck. This is implemented as a set of in-memory check tables updated asynchronously — the risk engine reads stale position data (last-known, updated every few milliseconds) rather than computing exact real-time exposure. Post-trade risk (exact exposure) runs in a separate process after the fill. The 2010 Flash Crash was partly attributed to risk systems not keeping pace with order flow — circuit breakers (limit-up/limit-down bands) are now a regulatory requirement.

What breaks at scale

Clock synchronization failures produce misordered timestamps: if two co-located servers have 500ns of clock skew and both submit orders in the same microsecond, the time-priority ordering is wrong. PTP hardware timestamping solves this, but requires supported NICs and careful switch configuration — software NTP is not sufficient. The second failure mode is input queue saturation: at auction open, order flow spikes 10–100x; the engine's input queue must be deep enough to absorb the burst, and the queue itself must be lock-free (Disruptor pattern) or the queue-write becomes the bottleneck.

In production

LMAX Disruptor is the most widely studied open-source pattern for exchange-grade throughput: a lock-free ring buffer replaces queues between pipeline stages (ingest → match → risk → publish), achieving 6 million orders/second on a single core without garbage collection. CME Group's Globex and ICE both run separate matching engines per product with no cross-product coordination — instrument isolation is the scalability lever. The real challenge is deterministic replay: every exchange must be able to reconstruct the exact order book state from the event log for regulatory audit and post-trade reconciliation. This requires that the log is the ground truth — any in-memory state is derived. Clock synchronization is a regulatory requirement (MiFID II mandates <100μs timestamp accuracy) and is achieved with PTP/IEEE 1588 hardware timestamping on the NIC, not OS clocks.

Common mistakes

Multi-threading one order book (nondeterminism)
Disk/DB in the hot path
Losing strict input ordering

Related System Design Library

Part of System Design Library on SystemLore — system design interview prep with 148 deep topics, interactive diagrams, and a practice game. Practice this one →