System Design

Designing a Stock Exchange at Scale: Order Book, Matching Engine & Market Data Feed

Stock exchanges are among the most demanding distributed systems ever built — they must process millions of orders per second with microsecond latency, guarantee strict price-time priority fairness, survive hardware failures without losing a single trade, and feed real-time market data to thousands of subscribers simultaneously. This deep-dive covers every architectural layer from the FIX gateway to the clearing house, giving you the knowledge to ace a senior system design interview and to reason confidently about production trading infrastructure.

Md Sanwar Hossain April 7, 2026 23 min read Trading Systems
Stock exchange system design matching engine order book market data

TL;DR — Architecture in One Paragraph

A production stock exchange routes orders through a FIX protocol gateway for pre-trade risk checks, serializes them into a sequencer for global ordering, then processes them in the matching engine — a single-threaded LMAX Disruptor ring buffer backed by a red-black tree order book per symbol with price-time priority. Executions are multicast as a market data feed to subscribers and forwarded to a central clearing counterparty (CCP) for T+1 settlement. A risk engine enforces position limits and circuit breakers. The entire critical path runs on kernel-bypass networking (DPDK/RDMA) targeting <10µs end-to-end latency.

Table of Contents

  1. Functional & Non-Functional Requirements
  2. High-Level Architecture Overview
  3. Order Gateway & FIX Protocol
  4. Order Book Design
  5. Matching Engine & LMAX Disruptor
  6. Market Data Feed
  7. Trade Settlement & Clearing
  8. Risk Engine & Circuit Breakers
  9. Persistence & Event Sourcing
  10. Capacity Estimation
  11. High Availability & Disaster Recovery
  12. System Design Interview Checklist

1. Functional & Non-Functional Requirements

Before jumping to architecture, nail the requirements. An interviewer asking you to "design a stock exchange" expects you to define scope precisely — the delta between a simple limit order system and a full exchange is enormous.

Functional Requirements

Non-Functional Requirements (the hard part)

Metric Tier-1 Exchange Target Notes
Order throughput 500K–5M orders/sec NSE India processes ~2M orders/sec peak
End-to-end latency (P99) <10 µs (colocation); <1 ms (internet) NYSE Pillar: ~2 µs median
Market data fan-out 10K–100K subscribers Multicast UDP at <1 µs per symbol update
Availability 99.999% (5.25 min downtime/year) During market hours only
Durability Zero order loss, zero duplicate fills Idempotent sequencing mandatory
Settlement finality T+1 (US equities as of 2024) T+2 for most EU markets

2. High-Level Architecture Overview

A production exchange is composed of six logically separate subsystems, each with distinct latency, throughput, and consistency requirements. Understanding the data flow between them is the key to getting this design right in an interview.

Order Lifecycle — Critical Path

Client → FIX Gateway (pre-trade risk) → Sequencer (global timestamp + sequence number) → Matching Engine (order book + trade execution) → Drop Copy (execution reports to broker) + Market Data Publisher (book updates) + Clearing Gateway (trade confirmation). The critical path from order receipt to execution acknowledgement must never cross a network hop — everything runs in a single NUMA node with kernel-bypass networking.

Stock exchange system design matching engine order book market data
Stock Exchange Architecture — FIX Gateway → Sequencer → Matching Engine → Market Data & Clearing. Source: mdsanwarhossain.me

Why Single-Threaded Matching?

Counter-intuitively, the matching engine is single-threaded. Multi-threaded matching requires locks on the order book — and lock contention at microsecond timescales dominates execution time. By partitioning symbols across single-threaded matching engines (one engine owns a set of symbols), you achieve linear scalability with zero locking overhead. The LMAX Disruptor pattern is the canonical implementation: a pre-allocated ring buffer replaces all queue synchronization with memory barriers, achieving 25 million events/sec on commodity hardware.

3. Order Gateway & FIX Protocol

The order gateway is the exchange's front door. Every order from every broker passes through here. It must handle thousands of concurrent TCP sessions, validate orders against pre-trade risk rules, and serialize accepted orders into the sequencer — all in microseconds.

FIX Protocol Deep Dive

FIX (Financial Information eXchange) is the industry-standard messaging protocol for pre-trade and trade communication. The wire format uses tag=value pairs separated by SOH (ASCII 0x01). Modern exchanges support FIX 4.2/4.4 and the binary FAST (FIX Adapted for STreaming) encoding for latency-critical paths:

// FIX 4.4 New Order Single (tag 35=D) — wire format (SOH shown as |)
8=FIX.4.4|9=176|35=D|49=BROKER1|56=EXCHANGE|34=1001|52=20260407-09:30:00.123|
11=ORD-20260407-00001|55=AAPL|54=1|38=100|40=2|44=185.50|59=0|10=042|

// Tag reference:
// 35=D   → MsgType: New Order Single
// 49     → SenderCompID (broker)
// 56     → TargetCompID (exchange)
// 11     → ClOrdID (client order ID — must be unique per session per day)
// 55     → Symbol
// 54=1   → Side: Buy (2=Sell)
// 38     → OrderQty
// 40=2   → OrdType: Limit (1=Market, 3=Stop, 4=StopLimit)
// 44     → Price (only for Limit orders)
// 59=0   → TimeInForce: Day (1=GTC, 3=IOC, 4=FOK)

Pre-Trade Risk Checks

Pre-trade risk is the gateway's most critical function — it stops bad orders before they enter the matching engine. These checks must complete in <5 µs:

Sequencer — The Global Order Arbiter

After passing pre-trade risk, each order is handed to the sequencer — the most critical single point of truth in the entire exchange. The sequencer assigns a globally monotonically increasing sequence number and a nanosecond-precision timestamp to every order, cancellation, and modification. This sequence number defines the canonical order in which the matching engine processes events. The sequencer is a single process — it must be replicated for HA, but only one instance is primary at any time. It publishes to a replicated log (similar to Kafka but with hardware-level ordering guarantees) that the matching engine reads from.

Colocation Services

Tier-1 exchanges offer colocation: high-frequency trading firms physically place their servers in the exchange's data center. Colocation eliminates the ~200µs transcontinental fiber latency, giving colocated participants a structural speed advantage. This raises fairness concerns, which exchanges address through:

4. Order Book Design

The order book is the core data structure of any exchange. For each tradable symbol, the exchange maintains a separate order book that tracks all outstanding limit orders on both sides (buy = bids, sell = asks), sorted by price-time priority.

Red-Black Tree Implementation

The canonical order book data structure uses a red-black tree keyed by price level, where each price level node contains a doubly linked FIFO queue of orders. This gives O(log P) operations for adding/removing price levels (where P is the number of distinct price levels) and O(1) for inserting/canceling orders within a known price level:

// Conceptual Java order book structure
class PriceLevel {
    final long price;          // in ticks (integer — never float!)
    long totalQuantity;        // sum of all orders at this level
    final Deque<Order> orders; // FIFO — price-time priority
}

class OrderBook {
    // Bids: highest price first → TreeMap in DESCENDING order
    final TreeMap<Long, PriceLevel> bids = new TreeMap<>(Comparator.reverseOrder());
    // Asks: lowest price first → TreeMap in ASCENDING order
    final TreeMap<Long, PriceLevel> asks = new TreeMap<>();
    // O(1) cancel: direct lookup by order ID
    final HashMap<String, Order> orderIndex = new HashMap<>();

    PriceLevel bestBid() { return bids.isEmpty() ? null : bids.firstEntry().getValue(); }
    PriceLevel bestAsk() { return asks.isEmpty() ? null : asks.firstEntry().getValue(); }
    long spread()        { return bestAsk().price - bestBid().price; } // in ticks
}

Order Types and Book Mechanics

Order Type Book Behavior Matching Rule
Limit Rests in book if not immediately matchable Match at limit price or better
Market Never rests — execute immediately or cancel Walk the book, consume all available liquidity
Stop Held in stop book (separate data structure) Triggered when last trade crosses stop price
Iceberg Only peak quantity visible; reserve hidden Refills peak from reserve; loses time priority on refill
IOC (Immediate-or-Cancel) Never rests in book Fill what's available, cancel remainder
FOK (Fill-or-Kill) Never rests in book Check if full quantity available; fill all or cancel all atomically

Price Representation — Never Use Floats

A critical implementation detail: prices are always represented as integers (ticks), never floating-point numbers. For a stock with a minimum tick size of $0.01, the price $185.50 is stored as the integer 18550. This eliminates floating-point rounding errors that could cause incorrect price comparisons — a bug that could result in regulatory fines and reputational damage. The tick size is an instrument-level configuration parameter.

5. Matching Engine & LMAX Disruptor

The matching engine is the heart of the exchange — the single component where every trade is born. Its design must satisfy two contradictory-seeming requirements: strict determinism (same inputs always produce same outputs, enabling replay) and extreme throughput (millions of events per second).

LMAX Disruptor Architecture

The LMAX Disruptor is a lock-free inter-thread communication library developed by LMAX Exchange (now Coinbase) for their retail FX trading platform. The key insight: replace a blocking queue with a pre-allocated ring buffer where producers and consumers communicate via memory barriers (CAS operations) rather than locks:

// LMAX Disruptor pipeline — order processing stages
//
//  [FIX Gateway] ──publish──▶ [RingBuffer<OrderEvent>] ──consume──▶
//
//  Stage 1: Journal Handler (write to WAL — must complete before Stage 2)
//  Stage 2: Matching Handler (single thread — owns all order books)
//  Stage 3 (parallel):
//    ├── Market Data Handler (publish book updates)
//    ├── Drop Copy Handler (execution reports to brokers)
//    └── Clearing Handler (forward fills to CCP)

// Ring buffer is a power-of-2 sized array — index wraps with bitwise AND
// All slots pre-allocated at startup: NO heap allocation on hot path
int RING_BUFFER_SIZE = 1 << 20; // 1,048,576 slots (~16MB at 16B/event)

// Producer: claims next slot, writes event, publishes sequence
long seq = ringBuffer.next();
OrderEvent event = ringBuffer.get(seq);
event.populate(orderId, symbol, side, price, quantity, type);
ringBuffer.publish(seq);  // single store-release fence — no lock

Price-Time Priority Matching Algorithm

The matching loop for a new incoming order is surprisingly straightforward once the order book is correctly structured:

// Pseudocode: match a new buy limit order against the ask side
List<Fill> match(Order incomingBuy, OrderBook book) {
    List<Fill> fills = new ArrayList<>();
    long remainingQty = incomingBuy.quantity;

    // Walk ask side from best (lowest) price upward
    while (remainingQty > 0 && !book.asks.isEmpty()) {
        PriceLevel bestAsk = book.bestAsk();

        // Price check: buy limit price must be ≥ ask price
        if (incomingBuy.price < bestAsk.price) break; // no match possible

        // Walk FIFO queue at this price level (time priority)
        while (remainingQty > 0 && !bestAsk.orders.isEmpty()) {
            Order restingOrder = bestAsk.orders.peekFirst();
            long fillQty = Math.min(remainingQty, restingOrder.quantity);
            long fillPrice = restingOrder.price; // resting order sets the price

            fills.add(new Fill(incomingBuy.id, restingOrder.id, fillPrice, fillQty));
            remainingQty -= fillQty;
            restingOrder.quantity -= fillQty;

            if (restingOrder.quantity == 0) {
                bestAsk.orders.pollFirst(); // fully filled — remove from queue
                book.orderIndex.remove(restingOrder.id);
            }
        }
        if (bestAsk.orders.isEmpty()) book.asks.pollFirstEntry(); // empty level
    }
    // If unfilled remainder and order is not IOC/FOK → rest in book
    if (remainingQty > 0 && incomingBuy.type == LIMIT) {
        book.addBid(incomingBuy.withQuantity(remainingQty));
    }
    return fills;
}

Opening Auction Call Algorithm

The opening auction (call auction) runs at market open to establish a reference price from accumulated orders. The algorithm finds the uncrossing price — the price that maximizes executable volume:

  1. Aggregate cumulative demand (sum of buy orders at each price and above) and supply (sum of sell orders at each price and below).
  2. Find all prices where demand ≥ supply AND supply ≥ demand (the intersection point). If multiple prices qualify, choose the one with maximum volume; if still tied, choose the price closest to yesterday's close.
  3. At the uncrossing price, match all orders that can execute. Remaining unmatched orders enter the continuous book.

6. Market Data Feed

Market data dissemination is a massive fan-out problem: every trade execution and order book change must be broadcast to thousands of subscribers in real time with zero data loss and guaranteed ordering. The architecture is fundamentally different from a typical pub/sub system — unicast TCP is unsuitable at this scale.

Level 1 vs Level 2 Data

Multicast UDP Feed Architecture

The exchange publishes market data over UDP multicast — one packet reaches all subscribers simultaneously, making fan-out O(1) regardless of subscriber count. Every message carries a monotonically increasing sequence number enabling gap detection:

// Market data message structure (binary, network byte order)
struct MarketDataMessage {
    uint32_t seqNum;       // monotonically increasing per channel
    uint64_t timestamp;    // nanoseconds since epoch
    uint16_t msgType;      // QUOTE_UPDATE=1, TRADE=2, ORDER_ADD=3, CANCEL=4
    char     symbol[8];    // null-padded
    int64_t  bidPrice;     // in ticks
    uint32_t bidSize;
    int64_t  askPrice;     // in ticks
    uint32_t askSize;
    int64_t  lastPrice;    // last trade price in ticks
    uint32_t lastSize;     // last trade volume
    uint64_t volume;       // cumulative daily volume
};  // Total: 62 bytes — fits in single UDP packet (MTU 1500)

// Gap detection on subscriber side:
if (msg.seqNum != expectedSeqNum) {
    // GAP DETECTED: request retransmission via unicast TCP recovery channel
    requestRetransmit(expectedSeqNum, msg.seqNum - 1);
}

Snapshot + Delta Architecture

A new subscriber cannot reconstruct the current book state by joining a live delta feed mid-session. The exchange solves this with a snapshot service: periodically (every few seconds) a full order book snapshot is published on a separate multicast group. Subscribers start by consuming the snapshot, then switch to the live delta feed starting from the snapshot's sequence number, replaying any buffered deltas that arrived after the snapshot was taken. This is identical to the read-model bootstrap pattern in CQRS/Event Sourcing.

7. Trade Settlement & Clearing

Trade execution and trade settlement are separated by days — a critical design feature of financial markets. Execution is real-time; settlement (the actual transfer of securities and cash) happens on T+1 (US equities) or T+2 (EU). This time gap introduces counterparty credit risk, which the Central Clearing Counterparty (CCP) is designed to eliminate.

Central Clearing Counterparty (CCP) and Novation

The moment a trade executes on the exchange, the CCP (e.g., DTCC in the US, LCH Clearnet in Europe) steps in as the legal counterparty to both sides through a process called novation: the original contract between Buyer A and Seller B is replaced by two new contracts — one between Buyer A and the CCP, and one between the CCP and Seller B. This means:

Settlement Flow

Day Activity System
T+0 (Trade Day) Trade executes; CCP steps in via novation; margin call calculated Exchange → CCP Gateway
T+0 EOD Multilateral netting; net settlement obligations calculated per member CCP Netting Engine
T+1 Morning Variation margin payments; failed trade handling; CNS (Continuous Net Settlement) DTCC / CSD
T+1 EOD Securities and cash transferred via Central Securities Depository (CSD) DTC / Fedwire

Position Ledger Design

The clearing system maintains a position ledger — a real-time record of every participant's net position in every instrument. The ledger must support:

8. Risk Engine & Circuit Breakers

The risk engine is the exchange's immune system — it prevents runaway algorithms, cascading liquidations, and market-wide crashes from propagating. Risk operates at two timescales: microsecond pre-trade checks in the gateway, and millisecond post-trade surveillance across the full market.

Position and Value-at-Risk Limits

Market Circuit Breakers

Circuit breakers halt trading when prices move too far, too fast — protecting against flash crashes like the 2010 US equity flash crash (Dow -1000 points in minutes) and the 2020 S&P 500 halts:

Trigger Action Duration
S&P 500 drops 7% from prior close Level 1 halt — all US equity markets pause 15 minutes
S&P 500 drops 13% Level 2 halt 15 minutes
S&P 500 drops 20% Level 3 halt — no resumption until next day Rest of trading day
Individual stock moves ±5% in 5 min (LULD) Single-stock trading pause (Limit-Up/Limit-Down band) 5 minutes or until price returns to band

Circuit Breaker Implementation

Circuit breakers are implemented as a continuously evaluated rule engine inside the matching engine itself. After every fill, the engine checks:

// LULD (Limit Up / Limit Down) band check after each fill
void checkLULD(String symbol, long fillPrice, long referencePrice) {
    double pctChange = (double)(fillPrice - referencePrice) / referencePrice * 100;
    double bandPct = isLargeCapStock(symbol) ? 5.0 : 10.0; // LULD band width

    if (Math.abs(pctChange) >= bandPct) {
        // Transition symbol to LULD auction mode
        symbolState.put(symbol, LULD_PAUSE);
        // Broadcast halt message to all market data subscribers
        marketDataPublisher.publishHalt(symbol, LULD_PAUSE, System.nanoTime());
        // Schedule resumption check in 5 minutes
        scheduler.schedule(() -> checkLULDResumption(symbol), 5, MINUTES);
    }
}

9. Persistence & Event Sourcing

The exchange's state — every order, every cancellation, every fill — must be durably persisted without introducing latency into the matching engine's critical path. The solution is a combination of write-ahead logging, event sourcing, and asynchronous persistence.

Write-Ahead Log (WAL)

Every event entering the sequencer is synchronously written to a Write-Ahead Log on persistent storage before being published to the matching engine. This is a sequential append-only file on an NVMe SSD — sequential writes are extremely fast (<10 µs per event) because they avoid random I/O. The WAL is the source of truth for crash recovery: on restart, the matching engine replays the WAL from the last checkpoint to rebuild in-memory order book state.

Event Store Architecture

The exchange's event store is a pure event sourcing system — the current state of every order book is derived by replaying the stream of events from the beginning of the trading day (or from the last checkpoint). This gives several powerful properties:

// Event types in the exchange event store
enum EventType {
    ORDER_SUBMITTED,   // new order arrived at gateway
    ORDER_ACCEPTED,    // passed pre-trade risk, entered sequencer
    ORDER_REJECTED,    // failed risk check
    ORDER_PARTIALLY_FILLED, // partial execution
    ORDER_FULLY_FILLED,
    ORDER_CANCELLED,
    ORDER_MODIFIED,    // price or quantity change
    TRADE_EXECUTED,    // complete fill record (both sides)
    SYMBOL_HALTED,     // circuit breaker or regulatory halt
    SYMBOL_RESUMED,
    MARKET_OPEN,       // phase transition
    MARKET_CLOSE,
    CHECKPOINT         // order book snapshot marker
}

// Each event is immutable, signed (HMAC-SHA256) for audit integrity,
// and replicated to at least 3 nodes before being ACKed to the producer.

10. Capacity Estimation

Capacity planning for a stock exchange is critical — under-provisioning during a market volatility spike can cause outages at the worst possible moment. Let's work through the numbers for a mid-size exchange (comparable to CBOE or Nasdaq).

Order Flow Estimates

Parameter Estimate Assumption
Listed symbols 5,000 equities Mid-size exchange
Peak order rate 1,000,000 orders/sec 10× average; volatility events
Average order size 200 bytes FIX message with all fields
Gateway inbound bandwidth 200 MB/s = 1.6 Gbps 1M × 200 bytes
Market data outbound (L1) 5,000 symbols × 100 updates/sec × 62 bytes = 31 MB/s L1 per-symbol update rate
Market data outbound (L2) 5,000 × 100 × 2 KB = 1 GB/s Full book; multicast to all subscribers
Daily trade storage ~100M events × 200 bytes = 20 GB/day 10% of orders result in trades

Symbol Partitioning Strategy

With 5,000 symbols and 1M orders/sec, a single matching engine cannot handle the full load. Partition symbols across multiple matching engine instances:

11. High Availability & Disaster Recovery

A stock exchange going down during market hours is a regulatory incident — regulators may impose fines, and brokers can sue for losses. The HA architecture must balance two competing pressures: achieving sub-microsecond latency (which argues against distributed consensus) and tolerating hardware failures without data loss (which normally requires distributed consensus).

Active-Passive Failover with WAL Replication

The standard pattern for exchange HA is active-passive failover: one primary matching engine processes all orders; one or more standby engines receive a real-time replica of the WAL over a dedicated high-speed link (InfiniBand or 100GbE). The standby applies WAL entries to its own in-memory order books, maintaining a replica state that is at most a few microseconds behind the primary.

Disaster Recovery: Geographic Redundancy

For data center-level failures (power, cooling, fiber cut), the exchange maintains a full DR site in a geographically separate location. This introduces a fundamental latency tradeoff: the speed of light limits synchronous replication to <~30km for 100µs round-trip; beyond that, replication must be asynchronous, accepting a small RPO (Recovery Point Objective):

Kernel-Bypass Networking for Latency

At microsecond timescales, the Linux kernel's network stack (~100µs syscall overhead) is unacceptably slow. Exchanges use kernel-bypass networking to achieve 1–2µs NIC-to-application latency:

12. System Design Interview Checklist

When asked to "design a stock exchange" in a system design interview, structure your answer around these decision points. Interviewers at FAANG-tier and quantitative finance firms will probe each area.

Requirements & Scope

  • ✅ Define order types (limit, market, stop, IOC, FOK) before designing the book
  • ✅ State throughput target (orders/sec) and latency SLA (µs vs ms) early
  • ✅ Clarify which phases to support (continuous trading? auctions?)
  • ✅ Ask about the number of symbols — this determines partitioning strategy

Data Structures

  • ✅ Use red-black tree (Java TreeMap) keyed by price with FIFO queues per level
  • ✅ Maintain a hash map for O(1) order lookup by order ID (for cancellations)
  • ✅ Always store prices as integers (ticks) — never floats
  • ✅ Separate stop order book from the main limit order book

Concurrency & Performance

  • ✅ Single-threaded matching engine per partition — no locks needed
  • ✅ LMAX Disruptor ring buffer for inter-thread communication (not a blocking queue)
  • ✅ Pre-allocate all objects at startup — no GC pressure on hot path
  • ✅ CPU affinity + busy-wait spin loop for the matching thread

Durability & Consistency

  • ✅ Sequencer assigns globally monotonic sequence numbers before matching
  • ✅ WAL written synchronously before execution ACK sent to broker
  • ✅ Event sourcing enables deterministic replay and time-travel debugging
  • ✅ Idempotent order IDs (ClOrdID) prevent duplicate fills on retry

Scalability & Availability

  • ✅ Partition symbols across matching engines; hot symbols get dedicated engines
  • ✅ Active-passive failover with WAL replication; target <50ms failover time
  • ✅ UDP multicast for market data fan-out — O(1) regardless of subscriber count
  • ✅ Snapshot + delta architecture for new subscriber bootstrap
  • ✅ Kernel-bypass networking (DPDK/RDMA) for µs-latency on critical path

Risk & Compliance

  • ✅ Pre-trade: fat finger, position limits, duplicate order detection at gateway
  • ✅ Post-trade: circuit breakers (LULD bands, market-wide halts) in matching engine
  • ✅ Kill switch cancels all orders for a participant in <100ms
  • ✅ CCP novation for counterparty risk elimination; multilateral netting for settlement efficiency
  • ✅ Immutable audit log (HMAC-signed events) for regulatory compliance

Key Architecture Tradeoffs at a Glance

Decision Choice Tradeoff
Matching concurrency Single-threaded per partition No lock overhead; limited by CPU single-core speed
Persistence Sync WAL + async checkpoint Durability without blocking matching loop
Market data delivery UDP multicast O(1) fan-out; requires gap detection + retransmit
HA mechanism Active-passive with WAL streaming <50ms failover; no Paxos overhead on hot path
Networking DPDK / RDMA kernel-bypass 1–2µs latency; complex ops and deployment
Order book structure Red-black tree + FIFO queue per level O(log P) level ops; O(1) order insert/cancel

Related Posts

Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices · System Design

All Posts
Last updated: April 7, 2026