Designing a Distributed Key-Value Store: LSM Trees, Consistent Hashing & Quorum Replication
Key-value stores are the simplest and most powerful data structures in distributed systems — Redis serves millions of operations per second from memory, DynamoDB handles Amazon's entire e-commerce traffic at 99.999% availability, and Cassandra stores petabytes of data across thousands of nodes. Building one from scratch requires mastering storage engines, partitioning, replication, and failure handling — all fundamental distributed systems concepts.
TL;DR — Core Architecture Decisions
"A distributed key-value store needs: (1) a storage engine — LSM Tree for write-heavy workloads (Cassandra, RocksDB) or B-Tree for read-heavy (PostgreSQL's heapfile); (2) consistent hashing to partition keys across nodes with minimal reshuffling on membership changes; (3) leaderless replication with quorum writes (W) and reads (R) where W+R > N for strong consistency; (4) Bloom filters to avoid expensive disk lookups for non-existent keys; and (5) a write-ahead log (WAL) for crash recovery."
Table of Contents
- Requirements & System Overview
- Storage Engine: LSM Tree vs B-Tree
- LSM Tree Internals: Memtable, SSTables & Compaction
- Partitioning with Consistent Hashing
- Replication: Quorum Reads & Writes
- Bloom Filters for Read Optimization
- Write-Ahead Log & Crash Recovery
- Gossip Protocol & Failure Detection
- Redis vs DynamoDB vs Cassandra: Design Comparisons
- TTL, Eviction & Data Lifecycle
- Capacity Estimation & Conclusion
1. Requirements & System Overview
Before designing any system, clarify requirements. For a distributed key-value store, the key design questions are:
In-Memory (Redis)
- All data in RAM — sub-millisecond latency
- Data size bounded by memory (typically <1 TB)
- Optional persistence: RDB snapshots + AOF log
- Rich data types: String, Hash, List, Set, ZSet
- Use cases: caching, sessions, leaderboards, pub/sub
Disk-Based (DynamoDB / Cassandra)
- Petabyte-scale; data primarily on disk (SSD)
- Single-digit millisecond read latency
- LSM Tree storage engine (write-optimized)
- Horizontal scaling via consistent hashing
- Use cases: user profiles, IoT telemetry, product catalog
2. Storage Engine: LSM Tree vs B-Tree
The storage engine is the heart of any key-value store. The two dominant approaches are fundamentally different in their write path optimization strategy:
| Attribute | LSM Tree (Log-Structured Merge) | B-Tree |
|---|---|---|
| Write pattern | Sequential append (fast) | In-place update (slower, random I/O) |
| Read performance | Slower (may check multiple SSTables) | Faster (single tree traversal, O(log N)) |
| Write amplification | Low (write once to memtable) | High (update + COW + WAL) |
| Space amplification | Higher (tombstones, old versions) | Lower (in-place updates) |
| SSD friendliness | High (sequential writes extend SSD life) | Moderate (random writes wear NAND cells) |
| Used by | RocksDB, LevelDB, Cassandra, HBase | PostgreSQL, MySQL InnoDB, SQLite |
Rule of thumb: Choose LSM Tree for write-heavy workloads (event logging, time-series, IoT); choose B-Tree for read-heavy workloads with complex queries (OLTP databases). Most modern distributed key-value stores (DynamoDB, Cassandra, RocksDB) use LSM Trees because distributed systems tend to be write-heavy and benefit from sequential I/O patterns.
3. LSM Tree Internals: Memtable, SSTables & Compaction
The LSM Tree has three key components:
Memtable (In-Memory Buffer)
All writes go first to the memtable — an in-memory sorted data structure (typically a red-black tree or skip list). Writes are O(log N) in memory, much faster than any disk write. The memtable is also protected by a write-ahead log (WAL) on disk — if the server crashes before the memtable is flushed, the WAL allows replay to reconstruct it.
SSTables (Sorted String Tables)
When the memtable reaches a size threshold (typically 64 MB), it is flushed to disk as an SSTable (Sorted String Table) — an immutable, sorted file of key-value pairs. Each SSTable has an index block (sparse index: key every 1KB) and a Bloom filter for fast existence checking.
# LSM Tree write path
def put(key, value):
# 1. Append to WAL (sequential write, durable)
wal.append(key, value)
# 2. Insert into memtable (in-memory, sorted)
memtable.put(key, value)
# 3. If memtable > threshold, flush to SSTable
if memtable.size > 64_MB:
sstable = memtable.flush_to_disk() # sequential write
sst_manager.add(sstable)
memtable.clear()
wal.clear() # safe to clear after flush
# LSM Tree read path
def get(key):
# 1. Check memtable first (most recent writes)
if val := memtable.get(key):
return val
# 2. Check SSTables from newest to oldest
for sstable in reversed(sst_manager.all()):
if sstable.bloom_filter.might_contain(key):
if val := sstable.get(key): # disk seek
return val
return None
Compaction
Over time, many SSTables accumulate on disk. Reads become slow (must check each SSTable), and space is wasted by deleted or overwritten values. Compaction merges SSTables: reads multiple SSTables, produces a new merged SSTable with only the latest version of each key and no tombstones. Two strategies:
- Size-tiered compaction (Cassandra default): Compact similarly-sized SSTables together. Write-optimized; reads can be slow with many SSTable levels. Good for write-heavy workloads.
- Leveled compaction (RocksDB/LevelDB default): Maintain fixed levels (L0, L1, L2...) where each level is 10× larger than the previous. Keys are sorted and non-overlapping within each level. Read performance is much better (only check one SSTable per level), but write amplification is higher.
4. Partitioning with Consistent Hashing
A distributed KV store must partition data across nodes. Simple modulo hashing (hash(key) % numNodes) is unusable — adding or removing a node requires remapping almost all keys. Consistent hashing solves this.
The Hash Ring
- Map the hash space (0 to 2³²) to a ring. Each node is placed on the ring at
hash(nodeId). - For a key, compute
hash(key). Walk clockwise on the ring to find the first node — that node stores the key. - When a node is added, it takes over keys from its clockwise successor (only ~1/N keys move). When a node is removed, its keys move to its clockwise successor. Only ~1/N keys are affected — vastly better than modulo hashing.
- Virtual nodes: Each physical node is represented by V virtual nodes on the ring (e.g., V=150). This ensures uniform load distribution even with heterogeneous node sizes, and reduces the load spike on a single node when another fails.
5. Replication: Quorum Reads & Writes
For fault tolerance, each key is replicated to N nodes (typically N=3). Cassandra and DynamoDB use leaderless replication — any node can accept reads or writes. Consistency is controlled by quorum parameters:
- N = replication factor (e.g., 3): number of replicas for each key
- W = write quorum: number of replicas that must acknowledge a write before it's considered successful
- R = read quorum: number of replicas that must respond to a read; the response with the highest timestamp wins
- Strong consistency: W + R > N ensures at least one replica overlaps between the write set and read set. E.g., N=3, W=2, R=2: any read must hit at least one node that received the write.
- Eventual consistency: W=1, R=1 — maximum throughput, lowest latency, but reads may return stale data until replication propagates.
# Quorum write example (N=3, W=2)
def put(key, value, w=2):
nodes = ring.get_n_nodes(key, n=3) # primary + 2 replicas
responses = []
for node in nodes:
try:
responses.append(node.put(key, value, timestamp=now()))
except NodeUnavailable:
pass
if len(responses) >= w:
return SUCCESS # quorum achieved
else:
return FAILURE # not enough acks, client retries
# Read repair: if quorum read returns stale value from one node,
# coordinator writes the latest value back to that node asynchronously
6. Bloom Filters for Read Optimization
In an LSM Tree with many SSTables, a read for a non-existent key would require checking every SSTable on disk — potentially hundreds of disk seeks. Bloom filters eliminate this cost.
A Bloom filter is a probabilistic bit-array data structure. For each SSTable, a Bloom filter is computed over all keys in the SSTable. When checking if a key exists in an SSTable:
- If the Bloom filter says NO → the key is definitely not in this SSTable (skip it entirely — no disk I/O).
- If the Bloom filter says YES → the key might be in the SSTable (do the actual disk seek). There is a small false-positive rate (typically 1%), but never a false-negative.
With a 1% false positive rate and 10 SSTables to check, the Bloom filter eliminates ~99% of unnecessary disk seeks for non-existent keys. The filter for a 1-million-key SSTable requires only ~10 bits/key = ~1.25 MB — trivially storable in memory. Every major LSM-based system (RocksDB, Cassandra, HBase) uses Bloom filters as a critical read optimization.
7. Write-Ahead Log & Crash Recovery
The memtable is in memory and is lost on a crash. The Write-Ahead Log (WAL) ensures durability: every write is first appended to a sequential log file on disk before being applied to the memtable. On recovery after a crash:
- The system reads the WAL from the last checkpoint.
- Replays all operations from the WAL into a fresh memtable.
- The memtable is back to the pre-crash state. Data loss: zero (for any write that received a WAL confirmation).
The WAL is a sequential append-only file — the cheapest possible write operation on both HDD and SSD. The write is considered durable only after fsync() is called, which flushes the OS page cache to the actual storage medium. The performance trade-off: every acknowledged write requires at least one fsync() on the WAL file, which is the primary latency bottleneck for write throughput in disk-based KV stores.
8. Gossip Protocol & Failure Detection
In a cluster of hundreds of nodes, how does each node know which other nodes are alive? A centralized health monitor is a single point of failure. Gossip protocols provide decentralized failure detection.
How Gossip Works
- Every second, each node randomly selects 3 peers and exchanges its view of the cluster: a list of {nodeId, heartbeat counter, last seen timestamp} for all nodes.
- Upon receiving gossip, each node updates its local cluster view with the latest information (highest heartbeat counter wins).
- If a node's heartbeat hasn't been updated in T seconds (e.g., T=30s), it is marked as suspected failed. If still not updated after 2T seconds, it is declared dead and its key ranges are redistributed to neighboring nodes.
- Cassandra uses an Phi Accrual Failure Detector — a more nuanced approach that outputs a continuous "suspicion level" (Φ) based on the distribution of inter-heartbeat intervals, rather than a binary alive/dead threshold.
9. Redis vs DynamoDB vs Cassandra: Design Comparisons
| Feature | Redis | DynamoDB | Cassandra |
|---|---|---|---|
| Storage | In-memory (+ optional disk) | SSD (LSM via B-tree hybrid) | SSD (LSM / SSTable) |
| Read latency | <1ms | 1–5ms | 1–10ms |
| Throughput | 1M ops/sec (single node) | Unlimited (serverless) | Linear with node count |
| Consistency | Strong (single node) / eventual (cluster) | Configurable (strong or eventual) | Configurable (quorum tunable) |
| Data size | GB–TB (RAM bound) | Petabyte-scale | Petabyte-scale |
| Best for | Caching, sessions, real-time | Serverless, unpredictable load | Write-heavy, time-series, IoT |
10. TTL, Eviction & Data Lifecycle
TTL (Time To Live) is a first-class feature in key-value stores, especially for caching use cases. Three strategies for handling expired keys:
- Lazy expiration: Check the key's TTL only when it is accessed. If expired, delete and return null. Memory efficient (no background scan), but expired keys linger until accessed.
- Active expiration: Background thread scans a random sample of keys with TTLs every 100ms, deletes expired ones. Redis uses this approach: it samples 20 random keys from the expiry set every 100ms, deletes expired ones, and repeats if more than 25% were expired.
- Compaction cleanup (LSM stores): During compaction, expired tombstones and TTL-expired values are filtered out. This is the primary expiration mechanism in Cassandra and DynamoDB.
Memory Eviction Policies (Redis)
When Redis reaches its memory limit (maxmemory), it must evict keys. The configured policy determines which keys are evicted:
- allkeys-lru: Evict least-recently-used key from all keys. Best for general caching.
- volatile-lru: LRU eviction only among keys with TTLs set. Safe for mixed persistent + cached data.
- allkeys-lfu: Evict least-frequently-used (LFU) key. Better than LRU for skewed access patterns (popular keys are almost never evicted).
- noeviction: Return error on write when full. For durable data that must not be lost.
11. Capacity Estimation & Conclusion
Back-of-Envelope: Disk-Based KV Store
- 10 billion key-value pairs; average size: 1 KB/pair → 10 TB data total
- With N=3 replication: 30 TB raw storage across cluster
- With LSM space amplification ~2×: need 60 TB on disk
- At 4 TB per SSD node: 15 nodes needed for storage
- Write throughput: 100K writes/sec × 1 KB = 100 MB/s ingress; WAL needs sequential write at 100 MB/s — single SSD handles 500 MB/s writes
- Compaction: typically 10–50% of write bandwidth; provision accordingly
A distributed key-value store is the foundational building block of the modern distributed software stack — it underlies caches, session stores, feature flags, configuration management, rate limiters, and even other databases (Cassandra is built on an LSM KV engine; PostgreSQL uses a B-tree KV layer). Understanding the LSM vs B-Tree trade-off, consistent hashing, and quorum replication gives you the mental model to understand and design virtually any distributed storage system. These are the most frequently examined concepts in senior system design interviews at Google, Meta, Amazon, and Stripe.