Designing a Stock Exchange at Scale: Order Book, Matching Engine & Market Data Feed
Stock exchanges are among the most demanding distributed systems ever built — they must process millions of orders per second with microsecond latency, guarantee strict price-time priority fairness, survive hardware failures without losing a single trade, and feed real-time market data to thousands of subscribers simultaneously. This deep-dive covers every architectural layer from the FIX gateway to the clearing house, giving you the knowledge to ace a senior system design interview and to reason confidently about production trading infrastructure.
TL;DR — Architecture in One Paragraph
A production stock exchange routes orders through a FIX protocol gateway for pre-trade risk checks, serializes them into a sequencer for global ordering, then processes them in the matching engine — a single-threaded LMAX Disruptor ring buffer backed by a red-black tree order book per symbol with price-time priority. Executions are multicast as a market data feed to subscribers and forwarded to a central clearing counterparty (CCP) for T+1 settlement. A risk engine enforces position limits and circuit breakers. The entire critical path runs on kernel-bypass networking (DPDK/RDMA) targeting <10µs end-to-end latency.
Table of Contents
- Functional & Non-Functional Requirements
- High-Level Architecture Overview
- Order Gateway & FIX Protocol
- Order Book Design
- Matching Engine & LMAX Disruptor
- Market Data Feed
- Trade Settlement & Clearing
- Risk Engine & Circuit Breakers
- Persistence & Event Sourcing
- Capacity Estimation
- High Availability & Disaster Recovery
- System Design Interview Checklist
1. Functional & Non-Functional Requirements
Before jumping to architecture, nail the requirements. An interviewer asking you to "design a stock exchange" expects you to define scope precisely — the delta between a simple limit order system and a full exchange is enormous.
Functional Requirements
- Order Types: Limit orders (price + quantity), market orders (execute at best available price), stop orders (trigger at a threshold price), stop-limit, iceberg/reserve orders (hidden quantity), immediate-or-cancel (IOC), fill-or-kill (FOK), and good-till-cancel (GTC).
- Matching Rules: Strict price-time priority — best price wins; ties resolved by earliest submission timestamp at that price level. All-or-nothing fill semantics for FOK orders.
- Instruments: Equities, ETFs, derivatives (futures/options), bonds — each with its own tick size, lot size, and trading hours configuration.
- Market Phases: Pre-open (order accumulation, no matching), opening auction (call auction determines reference price), continuous trading, closing auction, after-hours. Phase transitions must be atomic.
- Market Data: Real-time Level 1 feed (best bid/ask, last trade), Level 2 feed (full order book depth), trade prints, order acknowledgements, and end-of-day summary files.
- Participant Services: Order entry, order modification, order cancellation, position query, trade history, margin/collateral management.
Non-Functional Requirements (the hard part)
| Metric | Tier-1 Exchange Target | Notes |
|---|---|---|
| Order throughput | 500K–5M orders/sec | NSE India processes ~2M orders/sec peak |
| End-to-end latency (P99) | <10 µs (colocation); <1 ms (internet) | NYSE Pillar: ~2 µs median |
| Market data fan-out | 10K–100K subscribers | Multicast UDP at <1 µs per symbol update |
| Availability | 99.999% (5.25 min downtime/year) | During market hours only |
| Durability | Zero order loss, zero duplicate fills | Idempotent sequencing mandatory |
| Settlement finality | T+1 (US equities as of 2024) | T+2 for most EU markets |
2. High-Level Architecture Overview
A production exchange is composed of six logically separate subsystems, each with distinct latency, throughput, and consistency requirements. Understanding the data flow between them is the key to getting this design right in an interview.
Order Lifecycle — Critical Path
Client → FIX Gateway (pre-trade risk) → Sequencer (global timestamp + sequence number) → Matching Engine (order book + trade execution) → Drop Copy (execution reports to broker) + Market Data Publisher (book updates) + Clearing Gateway (trade confirmation). The critical path from order receipt to execution acknowledgement must never cross a network hop — everything runs in a single NUMA node with kernel-bypass networking.
Why Single-Threaded Matching?
Counter-intuitively, the matching engine is single-threaded. Multi-threaded matching requires locks on the order book — and lock contention at microsecond timescales dominates execution time. By partitioning symbols across single-threaded matching engines (one engine owns a set of symbols), you achieve linear scalability with zero locking overhead. The LMAX Disruptor pattern is the canonical implementation: a pre-allocated ring buffer replaces all queue synchronization with memory barriers, achieving 25 million events/sec on commodity hardware.
- No GC pauses: The engine is written in Java/C++ with object pooling. No heap allocation on the hot path — all order objects are pre-allocated from a memory pool.
- CPU affinity: The matching thread is pinned to a dedicated CPU core, isolated from the OS scheduler using
isolcpusandtaskset. - Busy-wait spin loop: The engine never sleeps — it spins on the ring buffer's sequence barrier, trading CPU burn for guaranteed sub-microsecond response to new orders.
3. Order Gateway & FIX Protocol
The order gateway is the exchange's front door. Every order from every broker passes through here. It must handle thousands of concurrent TCP sessions, validate orders against pre-trade risk rules, and serialize accepted orders into the sequencer — all in microseconds.
FIX Protocol Deep Dive
FIX (Financial Information eXchange) is the industry-standard messaging protocol for pre-trade and trade communication. The wire format uses tag=value pairs separated by SOH (ASCII 0x01). Modern exchanges support FIX 4.2/4.4 and the binary FAST (FIX Adapted for STreaming) encoding for latency-critical paths:
// FIX 4.4 New Order Single (tag 35=D) — wire format (SOH shown as |)
8=FIX.4.4|9=176|35=D|49=BROKER1|56=EXCHANGE|34=1001|52=20260407-09:30:00.123|
11=ORD-20260407-00001|55=AAPL|54=1|38=100|40=2|44=185.50|59=0|10=042|
// Tag reference:
// 35=D → MsgType: New Order Single
// 49 → SenderCompID (broker)
// 56 → TargetCompID (exchange)
// 11 → ClOrdID (client order ID — must be unique per session per day)
// 55 → Symbol
// 54=1 → Side: Buy (2=Sell)
// 38 → OrderQty
// 40=2 → OrdType: Limit (1=Market, 3=Stop, 4=StopLimit)
// 44 → Price (only for Limit orders)
// 59=0 → TimeInForce: Day (1=GTC, 3=IOC, 4=FOK)
Pre-Trade Risk Checks
Pre-trade risk is the gateway's most critical function — it stops bad orders before they enter the matching engine. These checks must complete in <5 µs:
- Position limit check: Reject if filling this order would take the broker's net position in a symbol beyond its approved limit (pre-loaded from the risk database at session start).
- Credit/margin check: For buy orders — available buying power ≥ order value. For short sells — check short sell approval and locate.
- Fat finger check: Reject if price deviates more than X% from the reference price (last trade or NBBO), or if order quantity exceeds per-order maximum.
- Duplicate order check: Reject if ClOrdID has already been seen this session (idempotency guard maintained in a lock-free hash map).
- Symbol status check: Reject if symbol is halted, in auction phase, or outside trading hours.
Sequencer — The Global Order Arbiter
After passing pre-trade risk, each order is handed to the sequencer — the most critical single point of truth in the entire exchange. The sequencer assigns a globally monotonically increasing sequence number and a nanosecond-precision timestamp to every order, cancellation, and modification. This sequence number defines the canonical order in which the matching engine processes events. The sequencer is a single process — it must be replicated for HA, but only one instance is primary at any time. It publishes to a replicated log (similar to Kafka but with hardware-level ordering guarantees) that the matching engine reads from.
Colocation Services
Tier-1 exchanges offer colocation: high-frequency trading firms physically place their servers in the exchange's data center. Colocation eliminates the ~200µs transcontinental fiber latency, giving colocated participants a structural speed advantage. This raises fairness concerns, which exchanges address through:
- Equidistant cable runs: All colocated racks are connected with equal-length fiber from the sequencer — ensuring every participant's network RTT is identical.
- Randomized queue order within the same timestamp: When two orders arrive within the same nanosecond clock tick, a random tie-breaker prevents deterministic front-running.
4. Order Book Design
The order book is the core data structure of any exchange. For each tradable symbol, the exchange maintains a separate order book that tracks all outstanding limit orders on both sides (buy = bids, sell = asks), sorted by price-time priority.
Red-Black Tree Implementation
The canonical order book data structure uses a red-black tree keyed by price level, where each price level node contains a doubly linked FIFO queue of orders. This gives O(log P) operations for adding/removing price levels (where P is the number of distinct price levels) and O(1) for inserting/canceling orders within a known price level:
// Conceptual Java order book structure
class PriceLevel {
final long price; // in ticks (integer — never float!)
long totalQuantity; // sum of all orders at this level
final Deque<Order> orders; // FIFO — price-time priority
}
class OrderBook {
// Bids: highest price first → TreeMap in DESCENDING order
final TreeMap<Long, PriceLevel> bids = new TreeMap<>(Comparator.reverseOrder());
// Asks: lowest price first → TreeMap in ASCENDING order
final TreeMap<Long, PriceLevel> asks = new TreeMap<>();
// O(1) cancel: direct lookup by order ID
final HashMap<String, Order> orderIndex = new HashMap<>();
PriceLevel bestBid() { return bids.isEmpty() ? null : bids.firstEntry().getValue(); }
PriceLevel bestAsk() { return asks.isEmpty() ? null : asks.firstEntry().getValue(); }
long spread() { return bestAsk().price - bestBid().price; } // in ticks
}
Order Types and Book Mechanics
| Order Type | Book Behavior | Matching Rule |
|---|---|---|
| Limit | Rests in book if not immediately matchable | Match at limit price or better |
| Market | Never rests — execute immediately or cancel | Walk the book, consume all available liquidity |
| Stop | Held in stop book (separate data structure) | Triggered when last trade crosses stop price |
| Iceberg | Only peak quantity visible; reserve hidden | Refills peak from reserve; loses time priority on refill |
| IOC (Immediate-or-Cancel) | Never rests in book | Fill what's available, cancel remainder |
| FOK (Fill-or-Kill) | Never rests in book | Check if full quantity available; fill all or cancel all atomically |
Price Representation — Never Use Floats
A critical implementation detail: prices are always represented as integers (ticks), never floating-point numbers. For a stock with a minimum tick size of $0.01, the price $185.50 is stored as the integer 18550. This eliminates floating-point rounding errors that could cause incorrect price comparisons — a bug that could result in regulatory fines and reputational damage. The tick size is an instrument-level configuration parameter.
5. Matching Engine & LMAX Disruptor
The matching engine is the heart of the exchange — the single component where every trade is born. Its design must satisfy two contradictory-seeming requirements: strict determinism (same inputs always produce same outputs, enabling replay) and extreme throughput (millions of events per second).
LMAX Disruptor Architecture
The LMAX Disruptor is a lock-free inter-thread communication library developed by LMAX Exchange (now Coinbase) for their retail FX trading platform. The key insight: replace a blocking queue with a pre-allocated ring buffer where producers and consumers communicate via memory barriers (CAS operations) rather than locks:
// LMAX Disruptor pipeline — order processing stages
//
// [FIX Gateway] ──publish──▶ [RingBuffer<OrderEvent>] ──consume──▶
//
// Stage 1: Journal Handler (write to WAL — must complete before Stage 2)
// Stage 2: Matching Handler (single thread — owns all order books)
// Stage 3 (parallel):
// ├── Market Data Handler (publish book updates)
// ├── Drop Copy Handler (execution reports to brokers)
// └── Clearing Handler (forward fills to CCP)
// Ring buffer is a power-of-2 sized array — index wraps with bitwise AND
// All slots pre-allocated at startup: NO heap allocation on hot path
int RING_BUFFER_SIZE = 1 << 20; // 1,048,576 slots (~16MB at 16B/event)
// Producer: claims next slot, writes event, publishes sequence
long seq = ringBuffer.next();
OrderEvent event = ringBuffer.get(seq);
event.populate(orderId, symbol, side, price, quantity, type);
ringBuffer.publish(seq); // single store-release fence — no lock
Price-Time Priority Matching Algorithm
The matching loop for a new incoming order is surprisingly straightforward once the order book is correctly structured:
// Pseudocode: match a new buy limit order against the ask side
List<Fill> match(Order incomingBuy, OrderBook book) {
List<Fill> fills = new ArrayList<>();
long remainingQty = incomingBuy.quantity;
// Walk ask side from best (lowest) price upward
while (remainingQty > 0 && !book.asks.isEmpty()) {
PriceLevel bestAsk = book.bestAsk();
// Price check: buy limit price must be ≥ ask price
if (incomingBuy.price < bestAsk.price) break; // no match possible
// Walk FIFO queue at this price level (time priority)
while (remainingQty > 0 && !bestAsk.orders.isEmpty()) {
Order restingOrder = bestAsk.orders.peekFirst();
long fillQty = Math.min(remainingQty, restingOrder.quantity);
long fillPrice = restingOrder.price; // resting order sets the price
fills.add(new Fill(incomingBuy.id, restingOrder.id, fillPrice, fillQty));
remainingQty -= fillQty;
restingOrder.quantity -= fillQty;
if (restingOrder.quantity == 0) {
bestAsk.orders.pollFirst(); // fully filled — remove from queue
book.orderIndex.remove(restingOrder.id);
}
}
if (bestAsk.orders.isEmpty()) book.asks.pollFirstEntry(); // empty level
}
// If unfilled remainder and order is not IOC/FOK → rest in book
if (remainingQty > 0 && incomingBuy.type == LIMIT) {
book.addBid(incomingBuy.withQuantity(remainingQty));
}
return fills;
}
Opening Auction Call Algorithm
The opening auction (call auction) runs at market open to establish a reference price from accumulated orders. The algorithm finds the uncrossing price — the price that maximizes executable volume:
- Aggregate cumulative demand (sum of buy orders at each price and above) and supply (sum of sell orders at each price and below).
- Find all prices where demand ≥ supply AND supply ≥ demand (the intersection point). If multiple prices qualify, choose the one with maximum volume; if still tied, choose the price closest to yesterday's close.
- At the uncrossing price, match all orders that can execute. Remaining unmatched orders enter the continuous book.
6. Market Data Feed
Market data dissemination is a massive fan-out problem: every trade execution and order book change must be broadcast to thousands of subscribers in real time with zero data loss and guaranteed ordering. The architecture is fundamentally different from a typical pub/sub system — unicast TCP is unsuitable at this scale.
Level 1 vs Level 2 Data
- Level 1 (Top-of-book): Best bid price & size, best ask price & size, last trade price, volume, and VWAP. Broadcast on every book update. ~50–200 bytes per message. This is the NBBO (National Best Bid and Offer) data that retail brokers must use for best execution.
- Level 2 (Full depth): All price levels on both sides of the book, typically top 5–20 levels. Substantially larger messages (~2–10 KB per symbol update). Used by market makers and algorithmic traders for queue position analysis.
- Level 3 / Order-by-order: Every individual order event (new, cancel, modify, fill). Used by HFT firms for microstructure analysis. Highest bandwidth, often only available on premium feeds.
Multicast UDP Feed Architecture
The exchange publishes market data over UDP multicast — one packet reaches all subscribers simultaneously, making fan-out O(1) regardless of subscriber count. Every message carries a monotonically increasing sequence number enabling gap detection:
// Market data message structure (binary, network byte order)
struct MarketDataMessage {
uint32_t seqNum; // monotonically increasing per channel
uint64_t timestamp; // nanoseconds since epoch
uint16_t msgType; // QUOTE_UPDATE=1, TRADE=2, ORDER_ADD=3, CANCEL=4
char symbol[8]; // null-padded
int64_t bidPrice; // in ticks
uint32_t bidSize;
int64_t askPrice; // in ticks
uint32_t askSize;
int64_t lastPrice; // last trade price in ticks
uint32_t lastSize; // last trade volume
uint64_t volume; // cumulative daily volume
}; // Total: 62 bytes — fits in single UDP packet (MTU 1500)
// Gap detection on subscriber side:
if (msg.seqNum != expectedSeqNum) {
// GAP DETECTED: request retransmission via unicast TCP recovery channel
requestRetransmit(expectedSeqNum, msg.seqNum - 1);
}
Snapshot + Delta Architecture
A new subscriber cannot reconstruct the current book state by joining a live delta feed mid-session. The exchange solves this with a snapshot service: periodically (every few seconds) a full order book snapshot is published on a separate multicast group. Subscribers start by consuming the snapshot, then switch to the live delta feed starting from the snapshot's sequence number, replaying any buffered deltas that arrived after the snapshot was taken. This is identical to the read-model bootstrap pattern in CQRS/Event Sourcing.
7. Trade Settlement & Clearing
Trade execution and trade settlement are separated by days — a critical design feature of financial markets. Execution is real-time; settlement (the actual transfer of securities and cash) happens on T+1 (US equities) or T+2 (EU). This time gap introduces counterparty credit risk, which the Central Clearing Counterparty (CCP) is designed to eliminate.
Central Clearing Counterparty (CCP) and Novation
The moment a trade executes on the exchange, the CCP (e.g., DTCC in the US, LCH Clearnet in Europe) steps in as the legal counterparty to both sides through a process called novation: the original contract between Buyer A and Seller B is replaced by two new contracts — one between Buyer A and the CCP, and one between the CCP and Seller B. This means:
- No participant is exposed to the default of their trading counterparty — only to the default of the CCP, which is backstopped by the default fund contributed by all clearing members.
- The CCP can net positions across thousands of trades per participant, reducing settlement obligations by 90%+ through multilateral netting.
- Margin (initial margin + variation margin) is collected from each clearing member daily to cover potential losses in the event of member default.
Settlement Flow
| Day | Activity | System |
|---|---|---|
| T+0 (Trade Day) | Trade executes; CCP steps in via novation; margin call calculated | Exchange → CCP Gateway |
| T+0 EOD | Multilateral netting; net settlement obligations calculated per member | CCP Netting Engine |
| T+1 Morning | Variation margin payments; failed trade handling; CNS (Continuous Net Settlement) | DTCC / CSD |
| T+1 EOD | Securities and cash transferred via Central Securities Depository (CSD) | DTC / Fedwire |
Position Ledger Design
The clearing system maintains a position ledger — a real-time record of every participant's net position in every instrument. The ledger must support:
- Atomic double-entry accounting: Every fill simultaneously increases the buyer's position and decreases the seller's position; cash flows in the opposite direction. All updates use database transactions with SERIALIZABLE isolation.
- Intraday mark-to-market: Positions are revalued against current market prices to calculate unrealized P&L and margin requirements continuously throughout the day.
- Audit trail: Every position change is immutably recorded with trade ID, timestamp, and counterparty — regulatory requirement under MiFID II, Dodd-Frank.
8. Risk Engine & Circuit Breakers
The risk engine is the exchange's immune system — it prevents runaway algorithms, cascading liquidations, and market-wide crashes from propagating. Risk operates at two timescales: microsecond pre-trade checks in the gateway, and millisecond post-trade surveillance across the full market.
Position and Value-at-Risk Limits
- Gross position limit: Maximum absolute notional value in a single symbol per participant. Enforced in the gateway risk check cache (updated from position ledger every 100ms).
- Net exposure limit: Maximum net long or short position across correlated symbols (e.g., index futures vs. underlying stocks).
- VaR (Value at Risk): Statistical measure of maximum expected loss at 99% confidence over 1-day horizon. Calculated nightly; used to set margin requirements.
- Order rate limit: Maximum orders per second per session. Prevents runaway order-flooding attacks. Implemented as a token bucket per ClCompID.
- Kill switch: A single-button (or API call) that immediately cancels ALL outstanding orders for a participant and blocks any new orders. Must execute in <100ms for all symbols. Triggered automatically when position limit breach is detected, or manually by the exchange's market surveillance team.
Market Circuit Breakers
Circuit breakers halt trading when prices move too far, too fast — protecting against flash crashes like the 2010 US equity flash crash (Dow -1000 points in minutes) and the 2020 S&P 500 halts:
| Trigger | Action | Duration |
|---|---|---|
| S&P 500 drops 7% from prior close | Level 1 halt — all US equity markets pause | 15 minutes |
| S&P 500 drops 13% | Level 2 halt | 15 minutes |
| S&P 500 drops 20% | Level 3 halt — no resumption until next day | Rest of trading day |
| Individual stock moves ±5% in 5 min (LULD) | Single-stock trading pause (Limit-Up/Limit-Down band) | 5 minutes or until price returns to band |
Circuit Breaker Implementation
Circuit breakers are implemented as a continuously evaluated rule engine inside the matching engine itself. After every fill, the engine checks:
// LULD (Limit Up / Limit Down) band check after each fill
void checkLULD(String symbol, long fillPrice, long referencePrice) {
double pctChange = (double)(fillPrice - referencePrice) / referencePrice * 100;
double bandPct = isLargeCapStock(symbol) ? 5.0 : 10.0; // LULD band width
if (Math.abs(pctChange) >= bandPct) {
// Transition symbol to LULD auction mode
symbolState.put(symbol, LULD_PAUSE);
// Broadcast halt message to all market data subscribers
marketDataPublisher.publishHalt(symbol, LULD_PAUSE, System.nanoTime());
// Schedule resumption check in 5 minutes
scheduler.schedule(() -> checkLULDResumption(symbol), 5, MINUTES);
}
}
9. Persistence & Event Sourcing
The exchange's state — every order, every cancellation, every fill — must be durably persisted without introducing latency into the matching engine's critical path. The solution is a combination of write-ahead logging, event sourcing, and asynchronous persistence.
Write-Ahead Log (WAL)
Every event entering the sequencer is synchronously written to a Write-Ahead Log on persistent storage before being published to the matching engine. This is a sequential append-only file on an NVMe SSD — sequential writes are extremely fast (<10 µs per event) because they avoid random I/O. The WAL is the source of truth for crash recovery: on restart, the matching engine replays the WAL from the last checkpoint to rebuild in-memory order book state.
Event Store Architecture
The exchange's event store is a pure event sourcing system — the current state of every order book is derived by replaying the stream of events from the beginning of the trading day (or from the last checkpoint). This gives several powerful properties:
- Deterministic replay: Given the same event stream, the matching engine always produces the same output. This is essential for debugging disputes, regulatory audits, and disaster recovery.
- Time-travel debugging: Replay the event stream to any point in time to reconstruct the exact order book state at that moment — invaluable for post-trade analysis.
- Temporal decoupling: Secondary systems (surveillance, analytics, reporting) process the same event stream asynchronously without impacting the matching engine's latency.
- Checkpointing: Every N minutes, a full snapshot of all order books is written to disk. On recovery, replay starts from the last snapshot + subsequent WAL events, not from market open, dramatically reducing recovery time.
// Event types in the exchange event store
enum EventType {
ORDER_SUBMITTED, // new order arrived at gateway
ORDER_ACCEPTED, // passed pre-trade risk, entered sequencer
ORDER_REJECTED, // failed risk check
ORDER_PARTIALLY_FILLED, // partial execution
ORDER_FULLY_FILLED,
ORDER_CANCELLED,
ORDER_MODIFIED, // price or quantity change
TRADE_EXECUTED, // complete fill record (both sides)
SYMBOL_HALTED, // circuit breaker or regulatory halt
SYMBOL_RESUMED,
MARKET_OPEN, // phase transition
MARKET_CLOSE,
CHECKPOINT // order book snapshot marker
}
// Each event is immutable, signed (HMAC-SHA256) for audit integrity,
// and replicated to at least 3 nodes before being ACKed to the producer.
10. Capacity Estimation
Capacity planning for a stock exchange is critical — under-provisioning during a market volatility spike can cause outages at the worst possible moment. Let's work through the numbers for a mid-size exchange (comparable to CBOE or Nasdaq).
Order Flow Estimates
| Parameter | Estimate | Assumption |
|---|---|---|
| Listed symbols | 5,000 equities | Mid-size exchange |
| Peak order rate | 1,000,000 orders/sec | 10× average; volatility events |
| Average order size | 200 bytes | FIX message with all fields |
| Gateway inbound bandwidth | 200 MB/s = 1.6 Gbps | 1M × 200 bytes |
| Market data outbound (L1) | 5,000 symbols × 100 updates/sec × 62 bytes = 31 MB/s | L1 per-symbol update rate |
| Market data outbound (L2) | 5,000 × 100 × 2 KB = 1 GB/s | Full book; multicast to all subscribers |
| Daily trade storage | ~100M events × 200 bytes = 20 GB/day | 10% of orders result in trades |
Symbol Partitioning Strategy
With 5,000 symbols and 1M orders/sec, a single matching engine cannot handle the full load. Partition symbols across multiple matching engine instances:
- Static partitioning: Assign symbols to engines based on a consistent hash of the symbol ticker. Simple but requires manual rebalancing when adding engines.
- Volume-weighted partitioning: Route high-volume symbols (AAPL, TSLA, SPY) to dedicated single-symbol engines; bin low-volume symbols into shared engines. Rebalanced daily based on prior day's volume.
- Cross-symbol orders: Spread orders (buy symbol A + sell symbol B atomically) require a two-phase commit across two engines — use a coordinator with rollback support.
11. High Availability & Disaster Recovery
A stock exchange going down during market hours is a regulatory incident — regulators may impose fines, and brokers can sue for losses. The HA architecture must balance two competing pressures: achieving sub-microsecond latency (which argues against distributed consensus) and tolerating hardware failures without data loss (which normally requires distributed consensus).
Active-Passive Failover with WAL Replication
The standard pattern for exchange HA is active-passive failover: one primary matching engine processes all orders; one or more standby engines receive a real-time replica of the WAL over a dedicated high-speed link (InfiniBand or 100GbE). The standby applies WAL entries to its own in-memory order books, maintaining a replica state that is at most a few microseconds behind the primary.
- Failover trigger: The primary sends heartbeats every 1ms to a watchdog process. If three consecutive heartbeats are missed (>3ms gap), the watchdog promotes the standby to primary and re-routes the sequencer's output.
- Failover time: Target <50ms end-to-end (including TCP session re-establishment for gateway sessions). During the failover window, new orders are buffered in the gateway; resting orders are preserved in the standby's in-memory book.
- Split-brain prevention: Only one node can hold the "leader lease" at any time. The lease is granted by a quorum of 3 ZooKeeper nodes and expires if not renewed within 10ms — preventing both nodes from acting as primary simultaneously.
Disaster Recovery: Geographic Redundancy
For data center-level failures (power, cooling, fiber cut), the exchange maintains a full DR site in a geographically separate location. This introduces a fundamental latency tradeoff: the speed of light limits synchronous replication to <~30km for 100µs round-trip; beyond that, replication must be asynchronous, accepting a small RPO (Recovery Point Objective):
- RPO (Recovery Point Objective): Maximum acceptable data loss. For exchanges, this is typically 0 orders — every accepted order must be in the DR site before an ACK is sent to the broker. Achieved via synchronous WAL replication to a near-site DR node (<10km away).
- RTO (Recovery Time Objective): Maximum acceptable downtime. Tier-1 exchanges target <15 minutes for a data center failover. This requires a warm standby DR site with in-memory order books already populated.
- Regulatory requirement: SEC Rule 17a-7 and CFTC Regulation 1.31 require exchanges to maintain records for 5 years and have a tested DR plan. FINRA requires DR drills at least annually.
Kernel-Bypass Networking for Latency
At microsecond timescales, the Linux kernel's network stack (~100µs syscall overhead) is unacceptably slow. Exchanges use kernel-bypass networking to achieve 1–2µs NIC-to-application latency:
- DPDK (Data Plane Development Kit): Application reads packets directly from NIC memory via DMA, bypassing the kernel's TCP/IP stack. Used for the FIX gateway and market data publisher.
- RDMA (Remote Direct Memory Access): For WAL replication between primary and standby, RDMA writes directly to the remote machine's memory without involving either CPU, achieving ~1µs replication latency.
- Solarflare / Xilinx SmartNIC: Application runs directly on NIC FPGA firmware — the matching engine's hot path never enters the host CPU. Used by the most latency-sensitive HFT firms.
12. System Design Interview Checklist
When asked to "design a stock exchange" in a system design interview, structure your answer around these decision points. Interviewers at FAANG-tier and quantitative finance firms will probe each area.
Requirements & Scope
- ✅ Define order types (limit, market, stop, IOC, FOK) before designing the book
- ✅ State throughput target (orders/sec) and latency SLA (µs vs ms) early
- ✅ Clarify which phases to support (continuous trading? auctions?)
- ✅ Ask about the number of symbols — this determines partitioning strategy
Data Structures
- ✅ Use red-black tree (Java TreeMap) keyed by price with FIFO queues per level
- ✅ Maintain a hash map for O(1) order lookup by order ID (for cancellations)
- ✅ Always store prices as integers (ticks) — never floats
- ✅ Separate stop order book from the main limit order book
Concurrency & Performance
- ✅ Single-threaded matching engine per partition — no locks needed
- ✅ LMAX Disruptor ring buffer for inter-thread communication (not a blocking queue)
- ✅ Pre-allocate all objects at startup — no GC pressure on hot path
- ✅ CPU affinity + busy-wait spin loop for the matching thread
Durability & Consistency
- ✅ Sequencer assigns globally monotonic sequence numbers before matching
- ✅ WAL written synchronously before execution ACK sent to broker
- ✅ Event sourcing enables deterministic replay and time-travel debugging
- ✅ Idempotent order IDs (ClOrdID) prevent duplicate fills on retry
Scalability & Availability
- ✅ Partition symbols across matching engines; hot symbols get dedicated engines
- ✅ Active-passive failover with WAL replication; target <50ms failover time
- ✅ UDP multicast for market data fan-out — O(1) regardless of subscriber count
- ✅ Snapshot + delta architecture for new subscriber bootstrap
- ✅ Kernel-bypass networking (DPDK/RDMA) for µs-latency on critical path
Risk & Compliance
- ✅ Pre-trade: fat finger, position limits, duplicate order detection at gateway
- ✅ Post-trade: circuit breakers (LULD bands, market-wide halts) in matching engine
- ✅ Kill switch cancels all orders for a participant in <100ms
- ✅ CCP novation for counterparty risk elimination; multilateral netting for settlement efficiency
- ✅ Immutable audit log (HMAC-signed events) for regulatory compliance
Key Architecture Tradeoffs at a Glance
| Decision | Choice | Tradeoff |
|---|---|---|
| Matching concurrency | Single-threaded per partition | No lock overhead; limited by CPU single-core speed |
| Persistence | Sync WAL + async checkpoint | Durability without blocking matching loop |
| Market data delivery | UDP multicast | O(1) fan-out; requires gap detection + retransmit |
| HA mechanism | Active-passive with WAL streaming | <50ms failover; no Paxos overhead on hot path |
| Networking | DPDK / RDMA kernel-bypass | 1–2µs latency; complex ops and deployment |
| Order book structure | Red-black tree + FIFO queue per level | O(log P) level ops; O(1) order insert/cancel |