What is Functional Requirements and how does it work?

Order Types: Limit orders (price + quantity), market orders (execute at best available price), stop orders (trigger at a threshold price), stop-limit, iceberg/reserve orders (hidden quantity), immediate-or-cancel (IOC), fill-or-kill (FOK), and good-till-cancel (GTC). Matching Rules: Strict price-time priority — best price wins; ties resolved by earliest submission timestamp at that price level. All-or-nothing fill semantics for FOK orders. Instruments: Equities, ETFs, derivatives (futures/options), bonds — each with its own tick size, lot size, and trading hours configuration. Market Phases: Pre-open (order accumulation, no matching), opening auction (call auction determines reference price), continuous trading, closing auction, after-hours. Phase transitions must be atomic.

System Design

Designing a Stock Exchange at Scale: Order Book, Matching Engine & Market Data Feed

Q: What is TL;DR — Architecture in One Paragraph and how does it work?

A production stock exchange routes orders through a FIX protocol gateway for pre-trade risk checks, serializes them into a sequencer for global ordering, then processes them in the matching engine — a single-threaded LMAX Disruptor ring buffer backed by a red-black tree order book per symbol with price-time priority. Executions are multicast as a market data feed to subscribers and forwarded to a central clearing counterparty (CCP) for T+1 settlement. A risk engine enforces position limits and circuit breakers. The entire critical path runs on kernel-bypass networking (DPDK/RDMA) targeting <10µs end-to-end latency.

Q: What is Functional & Non-Functional Requirements and how does it work?

Before jumping to architecture, nail the requirements. An interviewer asking you to "design a stock exchange" expects you to define scope precisely — the delta between a simple limit order system and a full exchange is enormous.

Q: What is High-Level Architecture Overview and how does it work?

A production exchange is composed of six logically separate subsystems, each with distinct latency, throughput, and consistency requirements. Understanding the data flow between them is the key to getting this design right in an interview.

Q: What is Order Lifecycle — Critical Path and how does it work?

Client → FIX Gateway (pre-trade risk) → Sequencer (global timestamp + sequence number) → Matching Engine (order book + trade execution) → Drop Copy (execution reports to broker) + Market Data Publisher (book updates) + Clearing Gateway (trade confirmation). The critical path from order receipt to execution acknowledgement must never cross a network hop — everything runs in a single NUMA node with kernel-bypass networking.

Stock exchanges are among the most demanding distributed systems ever built — they must process millions of orders per second with microsecond latency, guarantee strict price-time priority fairness, survive hardware failures without losing a single trade, and feed real-time market data to thousands of subscribers simultaneously. This deep-dive covers every architectural layer from the FIX gateway to the clearing house, giving you the knowledge to ace a senior system design interview and to reason confidently about production trading infrastructure.

Md Sanwar Hossain April 7, 2026 23 min read Trading Systems

Stock exchange system design matching engine order book market data

TL;DR — Architecture in One Paragraph

A production stock exchange routes orders through a FIX protocol gateway for pre-trade risk checks, serializes them into a sequencer for global ordering, then processes them in the matching engine — a single-threaded LMAX Disruptor ring buffer backed by a red-black tree order book per symbol with price-time priority. Executions are multicast as a market data feed to subscribers and forwarded to a central clearing counterparty (CCP) for T+1 settlement. A risk engine enforces position limits and circuit breakers. The entire critical path runs on kernel-bypass networking (DPDK/RDMA) targeting <10µs end-to-end latency.

Functional & Non-Functional Requirements
High-Level Architecture Overview
Order Gateway & FIX Protocol
Order Book Design
Matching Engine & LMAX Disruptor
Market Data Feed
Trade Settlement & Clearing
Risk Engine & Circuit Breakers
Persistence & Event Sourcing
Capacity Estimation
High Availability & Disaster Recovery
System Design Interview Checklist

1. Functional & Non-Functional Requirements

Before jumping to architecture, nail the requirements. An interviewer asking you to "design a stock exchange" expects you to define scope precisely — the delta between a simple limit order system and a full exchange is enormous.

Functional Requirements

Order Types: Limit orders (price + quantity), market orders (execute at best available price), stop orders (trigger at a threshold price), stop-limit, iceberg/reserve orders (hidden quantity), immediate-or-cancel (IOC), fill-or-kill (FOK), and good-till-cancel (GTC).
Matching Rules: Strict price-time priority — best price wins; ties resolved by earliest submission timestamp at that price level. All-or-nothing fill semantics for FOK orders.
Instruments: Equities, ETFs, derivatives (futures/options), bonds — each with its own tick size, lot size, and trading hours configuration.
Market Phases: Pre-open (order accumulation, no matching), opening auction (call auction determines reference price), continuous trading, closing auction, after-hours. Phase transitions must be atomic.
Market Data: Real-time Level 1 feed (best bid/ask, last trade), Level 2 feed (full order book depth), trade prints, order acknowledgements, and end-of-day summary files.
Participant Services: Order entry, order modification, order cancellation, position query, trade history, margin/collateral management.

Non-Functional Requirements (the hard part)

Metric	Tier-1 Exchange Target	Notes
Order throughput	500K–5M orders/sec	NSE India processes ~2M orders/sec peak
End-to-end latency (P99)	<10 µs (colocation); <1 ms (internet)	NYSE Pillar: ~2 µs median
Market data fan-out	10K–100K subscribers	Multicast UDP at <1 µs per symbol update
Availability	99.999% (5.25 min downtime/year)	During market hours only
Durability	Zero order loss, zero duplicate fills	Idempotent sequencing mandatory
Settlement finality	T+1 (US equities as of 2024)	T+2 for most EU markets

2. High-Level Architecture Overview

A production exchange is composed of six logically separate subsystems, each with distinct latency, throughput, and consistency requirements. Understanding the data flow between them is the key to getting this design right in an interview.

Order Lifecycle — Critical Path

Client → FIX Gateway (pre-trade risk) → Sequencer (global timestamp + sequence number) → Matching Engine (order book + trade execution) → Drop Copy (execution reports to broker) + Market Data Publisher (book updates) + Clearing Gateway (trade confirmation). The critical path from order receipt to execution acknowledgement must never cross a network hop — everything runs in a single NUMA node with kernel-bypass networking.

Why Single-Threaded Matching?

Counter-intuitively, the matching engine is single-threaded. Multi-threaded matching requires locks on the order book — and lock contention at microsecond timescales dominates execution time. By partitioning symbols across single-threaded matching engines (one engine owns a set of symbols), you achieve linear scalability with zero locking overhead. The LMAX Disruptor pattern is the canonical implementation: a pre-allocated ring buffer replaces all queue synchronization with memory barriers, achieving 25 million events/sec on commodity hardware.

No GC pauses: The engine is written in Java/C++ with object pooling. No heap allocation on the hot path — all order objects are pre-allocated from a memory pool.
CPU affinity: The matching thread is pinned to a dedicated CPU core, isolated from the OS scheduler using isolcpus and taskset.
Busy-wait spin loop: The engine never sleeps — it spins on the ring buffer's sequence barrier, trading CPU burn for guaranteed sub-microsecond response to new orders.

3. Order Gateway & FIX Protocol

The order gateway is the exchange's front door. Every order from every broker passes through here. It must handle thousands of concurrent TCP sessions, validate orders against pre-trade risk rules, and serialize accepted orders into the sequencer — all in microseconds.

FIX Protocol Deep Dive

FIX (Financial Information eXchange) is the industry-standard messaging protocol for pre-trade and trade communication. The wire format uses tag=value pairs separated by SOH (ASCII 0x01). Modern exchanges support FIX 4.2/4.4 and the binary FAST (FIX Adapted for STreaming) encoding for latency-critical paths:

// FIX 4.4 New Order Single (tag 35=D) — wire format (SOH shown as |)
8=FIX.4.4|9=176|35=D|49=BROKER1|56=EXCHANGE|34=1001|52=20260407-09:30:00.123|
11=ORD-20260407-00001|55=AAPL|54=1|38=100|40=2|44=185.50|59=0|10=042|

// Tag reference:
// 35=D   → MsgType: New Order Single
// 49     → SenderCompID (broker)
// 56     → TargetCompID (exchange)
// 11     → ClOrdID (client order ID — must be unique per session per day)
// 55     → Symbol
// 54=1   → Side: Buy (2=Sell)
// 38     → OrderQty
// 40=2   → OrdType: Limit (1=Market, 3=Stop, 4=StopLimit)
// 44     → Price (only for Limit orders)
// 59=0   → TimeInForce: Day (1=GTC, 3=IOC, 4=FOK)

Pre-Trade Risk Checks

Pre-trade risk is the gateway's most critical function — it stops bad orders before they enter the matching engine. These checks must complete in <5 µs:

Position limit check: Reject if filling this order would take the broker's net position in a symbol beyond its approved limit (pre-loaded from the risk database at session start).
Credit/margin check: For buy orders — available buying power ≥ order value. For short sells — check short sell approval and locate.
Fat finger check: Reject if price deviates more than X% from the reference price (last trade or NBBO), or if order quantity exceeds per-order maximum.
Duplicate order check: Reject if ClOrdID has already been seen this session (idempotency guard maintained in a lock-free hash map).
Symbol status check: Reject if symbol is halted, in auction phase, or outside trading hours.

Sequencer — The Global Order Arbiter

After passing pre-trade risk, each order is handed to the sequencer — the most critical single point of truth in the entire exchange. The sequencer assigns a globally monotonically increasing sequence number and a nanosecond-precision timestamp to every order, cancellation, and modification. This sequence number defines the canonical order in which the matching engine processes events. The sequencer is a single process — it must be replicated for HA, but only one instance is primary at any time. It publishes to a replicated log (similar to Kafka but with hardware-level ordering guarantees) that the matching engine reads from.

Colocation Services

Tier-1 exchanges offer colocation: high-frequency trading firms physically place their servers in the exchange's data center. Colocation eliminates the ~200µs transcontinental fiber latency, giving colocated participants a structural speed advantage. This raises fairness concerns, which exchanges address through:

Equidistant cable runs: All colocated racks are connected with equal-length fiber from the sequencer — ensuring every participant's network RTT is identical.
Randomized queue order within the same timestamp: When two orders arrive within the same nanosecond clock tick, a random tie-breaker prevents deterministic front-running.

4. Order Book Design

The order book is the core data structure of any exchange. For each tradable symbol, the exchange maintains a separate order book that tracks all outstanding limit orders on both sides (buy = bids, sell = asks), sorted by price-time priority.

Red-Black Tree Implementation

The canonical order book data structure uses a red-black tree keyed by price level, where each price level node contains a doubly linked FIFO queue of orders. This gives O(log P) operations for adding/removing price levels (where P is the number of distinct price levels) and O(1) for inserting/canceling orders within a known price level:

// Conceptual Java order book structure
class PriceLevel {
    final long price;          // in ticks (integer — never float!)
    long totalQuantity;        // sum of all orders at this level
    final Deque<Order> orders; // FIFO — price-time priority
}

class OrderBook {
    // Bids: highest price first → TreeMap in DESCENDING order
    final TreeMap<Long, PriceLevel> bids = new TreeMap<>(Comparator.reverseOrder());
    // Asks: lowest price first → TreeMap in ASCENDING order
    final TreeMap<Long, PriceLevel> asks = new TreeMap<>();
    // O(1) cancel: direct lookup by order ID
    final HashMap<String, Order> orderIndex = new HashMap<>();

    PriceLevel bestBid() { return bids.isEmpty() ? null : bids.firstEntry().getValue(); }
    PriceLevel bestAsk() { return asks.isEmpty() ? null : asks.firstEntry().getValue(); }
    long spread()        { return bestAsk().price - bestBid().price; } // in ticks
}

Order Types and Book Mechanics

Order Type	Book Behavior	Matching Rule
Limit	Rests in book if not immediately matchable	Match at limit price or better
Market	Never rests — execute immediately or cancel	Walk the book, consume all available liquidity
Stop	Held in stop book (separate data structure)	Triggered when last trade crosses stop price
Iceberg	Only peak quantity visible; reserve hidden	Refills peak from reserve; loses time priority on refill
IOC (Immediate-or-Cancel)	Never rests in book	Fill what's available, cancel remainder
FOK (Fill-or-Kill)	Never rests in book	Check if full quantity available; fill all or cancel all atomically

Price Representation — Never Use Floats

A critical implementation detail: prices are always represented as integers (ticks), never floating-point numbers. For a stock with a minimum tick size of $0.01, the price $185.50 is stored as the integer 18550. This eliminates floating-point rounding errors that could cause incorrect price comparisons — a bug that could result in regulatory fines and reputational damage. The tick size is an instrument-level configuration parameter.

5. Matching Engine & LMAX Disruptor

The matching engine is the heart of the exchange — the single component where every trade is born. Its design must satisfy two contradictory-seeming requirements: strict determinism (same inputs always produce same outputs, enabling replay) and extreme throughput (millions of events per second).

LMAX Disruptor Architecture

The LMAX Disruptor is a lock-free inter-thread communication library developed by LMAX Exchange (now Coinbase) for their retail FX trading platform. The key insight: replace a blocking queue with a pre-allocated ring buffer where producers and consumers communicate via memory barriers (CAS operations) rather than locks:

// LMAX Disruptor pipeline — order processing stages
//
//  [FIX Gateway] ──publish──▶ [RingBuffer<OrderEvent>] ──consume──▶
//
//  Stage 1: Journal Handler (write to WAL — must complete before Stage 2)
//  Stage 2: Matching Handler (single thread — owns all order books)
//  Stage 3 (parallel):
//    ├── Market Data Handler (publish book updates)
//    ├── Drop Copy Handler (execution reports to brokers)
//    └── Clearing Handler (forward fills to CCP)

// Ring buffer is a power-of-2 sized array — index wraps with bitwise AND
// All slots pre-allocated at startup: NO heap allocation on hot path
int RING_BUFFER_SIZE = 1 << 20; // 1,048,576 slots (~16MB at 16B/event)

// Producer: claims next slot, writes event, publishes sequence
long seq = ringBuffer.next();
OrderEvent event = ringBuffer.get(seq);
event.populate(orderId, symbol, side, price, quantity, type);
ringBuffer.publish(seq);  // single store-release fence — no lock

Price-Time Priority Matching Algorithm

The matching loop for a new incoming order is surprisingly straightforward once the order book is correctly structured:

// Pseudocode: match a new buy limit order against the ask side
List<Fill> match(Order incomingBuy, OrderBook book) {
    List<Fill> fills = new ArrayList<>();
    long remainingQty = incomingBuy.quantity;

    // Walk ask side from best (lowest) price upward
    while (remainingQty > 0 && !book.asks.isEmpty()) {
        PriceLevel bestAsk = book.bestAsk();

        // Price check: buy limit price must be ≥ ask price
        if (incomingBuy.price < bestAsk.price) break; // no match possible

        // Walk FIFO queue at this price level (time priority)
        while (remainingQty > 0 && !bestAsk.orders.isEmpty()) {
            Order restingOrder = bestAsk.orders.peekFirst();
            long fillQty = Math.min(remainingQty, restingOrder.quantity);
            long fillPrice = restingOrder.price; // resting order sets the price

            fills.add(new Fill(incomingBuy.id, restingOrder.id, fillPrice, fillQty));
            remainingQty -= fillQty;
            restingOrder.quantity -= fillQty;

            if (restingOrder.quantity == 0) {
                bestAsk.orders.pollFirst(); // fully filled — remove from queue
                book.orderIndex.remove(restingOrder.id);
            }
        }
        if (bestAsk.orders.isEmpty()) book.asks.pollFirstEntry(); // empty level
    }
    // If unfilled remainder and order is not IOC/FOK → rest in book
    if (remainingQty > 0 && incomingBuy.type == LIMIT) {
        book.addBid(incomingBuy.withQuantity(remainingQty));
    }
    return fills;
}

Opening Auction Call Algorithm

The opening auction (call auction) runs at market open to establish a reference price from accumulated orders. The algorithm finds the uncrossing price — the price that maximizes executable volume:

Aggregate cumulative demand (sum of buy orders at each price and above) and supply (sum of sell orders at each price and below).
Find all prices where demand ≥ supply AND supply ≥ demand (the intersection point). If multiple prices qualify, choose the one with maximum volume; if still tied, choose the price closest to yesterday's close.
At the uncrossing price, match all orders that can execute. Remaining unmatched orders enter the continuous book.

6. Market Data Feed

Market data dissemination is a massive fan-out problem: every trade execution and order book change must be broadcast to thousands of subscribers in real time with zero data loss and guaranteed ordering. The architecture is fundamentally different from a typical pub/sub system — unicast TCP is unsuitable at this scale.

Level 1 vs Level 2 Data

Level 1 (Top-of-book): Best bid price & size, best ask price & size, last trade price, volume, and VWAP. Broadcast on every book update. ~50–200 bytes per message. This is the NBBO (National Best Bid and Offer) data that retail brokers must use for best execution.
Level 2 (Full depth): All price levels on both sides of the book, typically top 5–20 levels. Substantially larger messages (~2–10 KB per symbol update). Used by market makers and algorithmic traders for queue position analysis.
Level 3 / Order-by-order: Every individual order event (new, cancel, modify, fill). Used by HFT firms for microstructure analysis. Highest bandwidth, often only available on premium feeds.

Multicast UDP Feed Architecture

The exchange publishes market data over UDP multicast — one packet reaches all subscribers simultaneously, making fan-out O(1) regardless of subscriber count. Every message carries a monotonically increasing sequence number enabling gap detection:

// Market data message structure (binary, network byte order)
struct MarketDataMessage {
    uint32_t seqNum;       // monotonically increasing per channel
    uint64_t timestamp;    // nanoseconds since epoch
    uint16_t msgType;      // QUOTE_UPDATE=1, TRADE=2, ORDER_ADD=3, CANCEL=4
    char     symbol[8];    // null-padded
    int64_t  bidPrice;     // in ticks
    uint32_t bidSize;
    int64_t  askPrice;     // in ticks
    uint32_t askSize;
    int64_t  lastPrice;    // last trade price in ticks
    uint32_t lastSize;     // last trade volume
    uint64_t volume;       // cumulative daily volume
};  // Total: 62 bytes — fits in single UDP packet (MTU 1500)

// Gap detection on subscriber side:
if (msg.seqNum != expectedSeqNum) {
    // GAP DETECTED: request retransmission via unicast TCP recovery channel
    requestRetransmit(expectedSeqNum, msg.seqNum - 1);
}

Snapshot + Delta Architecture

A new subscriber cannot reconstruct the current book state by joining a live delta feed mid-session. The exchange solves this with a snapshot service: periodically (every few seconds) a full order book snapshot is published on a separate multicast group. Subscribers start by consuming the snapshot, then switch to the live delta feed starting from the snapshot's sequence number, replaying any buffered deltas that arrived after the snapshot was taken. This is identical to the read-model bootstrap pattern in CQRS/Event Sourcing.

7. Trade Settlement & Clearing

Trade execution and trade settlement are separated by days — a critical design feature of financial markets. Execution is real-time; settlement (the actual transfer of securities and cash) happens on T+1 (US equities) or T+2 (EU). This time gap introduces counterparty credit risk, which the Central Clearing Counterparty (CCP) is designed to eliminate.

Central Clearing Counterparty (CCP) and Novation

The moment a trade executes on the exchange, the CCP (e.g., DTCC in the US, LCH Clearnet in Europe) steps in as the legal counterparty to both sides through a process called novation: the original contract between Buyer A and Seller B is replaced by two new contracts — one between Buyer A and the CCP, and one between the CCP and Seller B. This means:

No participant is exposed to the default of their trading counterparty — only to the default of the CCP, which is backstopped by the default fund contributed by all clearing members.
The CCP can net positions across thousands of trades per participant, reducing settlement obligations by 90%+ through multilateral netting.
Margin (initial margin + variation margin) is collected from each clearing member daily to cover potential losses in the event of member default.

Settlement Flow

Day	Activity	System
T+0 (Trade Day)	Trade executes; CCP steps in via novation; margin call calculated	Exchange → CCP Gateway
T+0 EOD	Multilateral netting; net settlement obligations calculated per member	CCP Netting Engine
T+1 Morning	Variation margin payments; failed trade handling; CNS (Continuous Net Settlement)	DTCC / CSD
T+1 EOD	Securities and cash transferred via Central Securities Depository (CSD)	DTC / Fedwire

Position Ledger Design

The clearing system maintains a position ledger — a real-time record of every participant's net position in every instrument. The ledger must support:

Atomic double-entry accounting: Every fill simultaneously increases the buyer's position and decreases the seller's position; cash flows in the opposite direction. All updates use database transactions with SERIALIZABLE isolation.
Intraday mark-to-market: Positions are revalued against current market prices to calculate unrealized P&L and margin requirements continuously throughout the day.
Audit trail: Every position change is immutably recorded with trade ID, timestamp, and counterparty — regulatory requirement under MiFID II, Dodd-Frank.

8. Risk Engine & Circuit Breakers

The risk engine is the exchange's immune system — it prevents runaway algorithms, cascading liquidations, and market-wide crashes from propagating. Risk operates at two timescales: microsecond pre-trade checks in the gateway, and millisecond post-trade surveillance across the full market.

Position and Value-at-Risk Limits

Gross position limit: Maximum absolute notional value in a single symbol per participant. Enforced in the gateway risk check cache (updated from position ledger every 100ms).
Net exposure limit: Maximum net long or short position across correlated symbols (e.g., index futures vs. underlying stocks).
VaR (Value at Risk): Statistical measure of maximum expected loss at 99% confidence over 1-day horizon. Calculated nightly; used to set margin requirements.
Order rate limit: Maximum orders per second per session. Prevents runaway order-flooding attacks. Implemented as a token bucket per ClCompID.
Kill switch: A single-button (or API call) that immediately cancels ALL outstanding orders for a participant and blocks any new orders. Must execute in <100ms for all symbols. Triggered automatically when position limit breach is detected, or manually by the exchange's market surveillance team.

Market Circuit Breakers

Circuit breakers halt trading when prices move too far, too fast — protecting against flash crashes like the 2010 US equity flash crash (Dow -1000 points in minutes) and the 2020 S&P 500 halts:

Trigger	Action	Duration
S&P 500 drops 7% from prior close	Level 1 halt — all US equity markets pause	15 minutes
S&P 500 drops 13%	Level 2 halt	15 minutes
S&P 500 drops 20%	Level 3 halt — no resumption until next day	Rest of trading day
Individual stock moves ±5% in 5 min (LULD)	Single-stock trading pause (Limit-Up/Limit-Down band)	5 minutes or until price returns to band

Circuit Breaker Implementation

Circuit breakers are implemented as a continuously evaluated rule engine inside the matching engine itself. After every fill, the engine checks:

// LULD (Limit Up / Limit Down) band check after each fill
void checkLULD(String symbol, long fillPrice, long referencePrice) {
    double pctChange = (double)(fillPrice - referencePrice) / referencePrice * 100;
    double bandPct = isLargeCapStock(symbol) ? 5.0 : 10.0; // LULD band width

    if (Math.abs(pctChange) >= bandPct) {
        // Transition symbol to LULD auction mode
        symbolState.put(symbol, LULD_PAUSE);
        // Broadcast halt message to all market data subscribers
        marketDataPublisher.publishHalt(symbol, LULD_PAUSE, System.nanoTime());
        // Schedule resumption check in 5 minutes
        scheduler.schedule(() -> checkLULDResumption(symbol), 5, MINUTES);
    }
}

9. Persistence & Event Sourcing

The exchange's state — every order, every cancellation, every fill — must be durably persisted without introducing latency into the matching engine's critical path. The solution is a combination of write-ahead logging, event sourcing, and asynchronous persistence.

Write-Ahead Log (WAL)

Every event entering the sequencer is synchronously written to a Write-Ahead Log on persistent storage before being published to the matching engine. This is a sequential append-only file on an NVMe SSD — sequential writes are extremely fast (<10 µs per event) because they avoid random I/O. The WAL is the source of truth for crash recovery: on restart, the matching engine replays the WAL from the last checkpoint to rebuild in-memory order book state.

Event Store Architecture

The exchange's event store is a pure event sourcing system — the current state of every order book is derived by replaying the stream of events from the beginning of the trading day (or from the last checkpoint). This gives several powerful properties:

Deterministic replay: Given the same event stream, the matching engine always produces the same output. This is essential for debugging disputes, regulatory audits, and disaster recovery.
Time-travel debugging: Replay the event stream to any point in time to reconstruct the exact order book state at that moment — invaluable for post-trade analysis.
Temporal decoupling: Secondary systems (surveillance, analytics, reporting) process the same event stream asynchronously without impacting the matching engine's latency.
Checkpointing: Every N minutes, a full snapshot of all order books is written to disk. On recovery, replay starts from the last snapshot + subsequent WAL events, not from market open, dramatically reducing recovery time.

// Event types in the exchange event store
enum EventType {
    ORDER_SUBMITTED,   // new order arrived at gateway
    ORDER_ACCEPTED,    // passed pre-trade risk, entered sequencer
    ORDER_REJECTED,    // failed risk check
    ORDER_PARTIALLY_FILLED, // partial execution
    ORDER_FULLY_FILLED,
    ORDER_CANCELLED,
    ORDER_MODIFIED,    // price or quantity change
    TRADE_EXECUTED,    // complete fill record (both sides)
    SYMBOL_HALTED,     // circuit breaker or regulatory halt
    SYMBOL_RESUMED,
    MARKET_OPEN,       // phase transition
    MARKET_CLOSE,
    CHECKPOINT         // order book snapshot marker
}

// Each event is immutable, signed (HMAC-SHA256) for audit integrity,
// and replicated to at least 3 nodes before being ACKed to the producer.

10. Capacity Estimation

Capacity planning for a stock exchange is critical — under-provisioning during a market volatility spike can cause outages at the worst possible moment. Let's work through the numbers for a mid-size exchange (comparable to CBOE or Nasdaq).

Order Flow Estimates

Parameter	Estimate	Assumption
Listed symbols	5,000 equities	Mid-size exchange
Peak order rate	1,000,000 orders/sec	10× average; volatility events
Average order size	200 bytes	FIX message with all fields
Gateway inbound bandwidth	200 MB/s = 1.6 Gbps	1M × 200 bytes
Market data outbound (L1)	5,000 symbols × 100 updates/sec × 62 bytes = 31 MB/s	L1 per-symbol update rate
Market data outbound (L2)	5,000 × 100 × 2 KB = 1 GB/s	Full book; multicast to all subscribers
Daily trade storage	~100M events × 200 bytes = 20 GB/day	10% of orders result in trades

Symbol Partitioning Strategy

With 5,000 symbols and 1M orders/sec, a single matching engine cannot handle the full load. Partition symbols across multiple matching engine instances:

Static partitioning: Assign symbols to engines based on a consistent hash of the symbol ticker. Simple but requires manual rebalancing when adding engines.
Volume-weighted partitioning: Route high-volume symbols (AAPL, TSLA, SPY) to dedicated single-symbol engines; bin low-volume symbols into shared engines. Rebalanced daily based on prior day's volume.
Cross-symbol orders: Spread orders (buy symbol A + sell symbol B atomically) require a two-phase commit across two engines — use a coordinator with rollback support.

11. High Availability & Disaster Recovery

A stock exchange going down during market hours is a regulatory incident — regulators may impose fines, and brokers can sue for losses. The HA architecture must balance two competing pressures: achieving sub-microsecond latency (which argues against distributed consensus) and tolerating hardware failures without data loss (which normally requires distributed consensus).

Active-Passive Failover with WAL Replication

The standard pattern for exchange HA is active-passive failover: one primary matching engine processes all orders; one or more standby engines receive a real-time replica of the WAL over a dedicated high-speed link (InfiniBand or 100GbE). The standby applies WAL entries to its own in-memory order books, maintaining a replica state that is at most a few microseconds behind the primary.

Failover trigger: The primary sends heartbeats every 1ms to a watchdog process. If three consecutive heartbeats are missed (>3ms gap), the watchdog promotes the standby to primary and re-routes the sequencer's output.
Failover time: Target <50ms end-to-end (including TCP session re-establishment for gateway sessions). During the failover window, new orders are buffered in the gateway; resting orders are preserved in the standby's in-memory book.
Split-brain prevention: Only one node can hold the "leader lease" at any time. The lease is granted by a quorum of 3 ZooKeeper nodes and expires if not renewed within 10ms — preventing both nodes from acting as primary simultaneously.

Disaster Recovery: Geographic Redundancy

For data center-level failures (power, cooling, fiber cut), the exchange maintains a full DR site in a geographically separate location. This introduces a fundamental latency tradeoff: the speed of light limits synchronous replication to <~30km for 100µs round-trip; beyond that, replication must be asynchronous, accepting a small RPO (Recovery Point Objective):

RPO (Recovery Point Objective): Maximum acceptable data loss. For exchanges, this is typically 0 orders — every accepted order must be in the DR site before an ACK is sent to the broker. Achieved via synchronous WAL replication to a near-site DR node (<10km away).
RTO (Recovery Time Objective): Maximum acceptable downtime. Tier-1 exchanges target <15 minutes for a data center failover. This requires a warm standby DR site with in-memory order books already populated.
Regulatory requirement: SEC Rule 17a-7 and CFTC Regulation 1.31 require exchanges to maintain records for 5 years and have a tested DR plan. FINRA requires DR drills at least annually.

Kernel-Bypass Networking for Latency

At microsecond timescales, the Linux kernel's network stack (~100µs syscall overhead) is unacceptably slow. Exchanges use kernel-bypass networking to achieve 1–2µs NIC-to-application latency:

DPDK (Data Plane Development Kit): Application reads packets directly from NIC memory via DMA, bypassing the kernel's TCP/IP stack. Used for the FIX gateway and market data publisher.
RDMA (Remote Direct Memory Access): For WAL replication between primary and standby, RDMA writes directly to the remote machine's memory without involving either CPU, achieving ~1µs replication latency.
Solarflare / Xilinx SmartNIC: Application runs directly on NIC FPGA firmware — the matching engine's hot path never enters the host CPU. Used by the most latency-sensitive HFT firms.

12. System Design Interview Checklist

When asked to "design a stock exchange" in a system design interview, structure your answer around these decision points. Interviewers at FAANG-tier and quantitative finance firms will probe each area.

Requirements & Scope

✅ Define order types (limit, market, stop, IOC, FOK) before designing the book
✅ State throughput target (orders/sec) and latency SLA (µs vs ms) early
✅ Clarify which phases to support (continuous trading? auctions?)
✅ Ask about the number of symbols — this determines partitioning strategy

Data Structures

✅ Use red-black tree (Java TreeMap) keyed by price with FIFO queues per level
✅ Maintain a hash map for O(1) order lookup by order ID (for cancellations)
✅ Always store prices as integers (ticks) — never floats
✅ Separate stop order book from the main limit order book

Concurrency & Performance

✅ Single-threaded matching engine per partition — no locks needed
✅ LMAX Disruptor ring buffer for inter-thread communication (not a blocking queue)
✅ Pre-allocate all objects at startup — no GC pressure on hot path
✅ CPU affinity + busy-wait spin loop for the matching thread

Durability & Consistency

✅ Sequencer assigns globally monotonic sequence numbers before matching
✅ WAL written synchronously before execution ACK sent to broker
✅ Event sourcing enables deterministic replay and time-travel debugging
✅ Idempotent order IDs (ClOrdID) prevent duplicate fills on retry

Scalability & Availability

✅ Partition symbols across matching engines; hot symbols get dedicated engines
✅ Active-passive failover with WAL replication; target <50ms failover time
✅ UDP multicast for market data fan-out — O(1) regardless of subscriber count
✅ Snapshot + delta architecture for new subscriber bootstrap
✅ Kernel-bypass networking (DPDK/RDMA) for µs-latency on critical path

Risk & Compliance

✅ Pre-trade: fat finger, position limits, duplicate order detection at gateway
✅ Post-trade: circuit breakers (LULD bands, market-wide halts) in matching engine
✅ Kill switch cancels all orders for a participant in <100ms
✅ CCP novation for counterparty risk elimination; multilateral netting for settlement efficiency
✅ Immutable audit log (HMAC-signed events) for regulatory compliance

Key Architecture Tradeoffs at a Glance

Decision	Choice	Tradeoff
Matching concurrency	Single-threaded per partition	No lock overhead; limited by CPU single-core speed
Persistence	Sync WAL + async checkpoint	Durability without blocking matching loop
Market data delivery	UDP multicast	O(1) fan-out; requires gap detection + retransmit
HA mechanism	Active-passive with WAL streaming	<50ms failover; no Paxos overhead on hot path
Networking	DPDK / RDMA kernel-bypass	1–2µs latency; complex ops and deployment
Order book structure	Red-black tree + FIFO queue per level	O(log P) level ops; O(1) order insert/cancel

Tags: stock exchange system design matching engine design order book system design trading platform architecture LMAX Disruptor FIX protocol system design market data feed design high frequency trading architecture settlement clearing system system design interview stock exchange

Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices · System Design

All Posts

Back to Blog

Last updated: April 7, 2026