System Design

Designing a Distributed Key-Value Store: LSM Trees, Consistent Hashing & Quorum Replication

Key-value stores are the simplest and most powerful data structures in distributed systems — Redis serves millions of operations per second from memory, DynamoDB handles Amazon's entire e-commerce traffic at 99.999% availability, and Cassandra stores petabytes of data across thousands of nodes. Building one from scratch requires mastering storage engines, partitioning, replication, and failure handling — all fundamental distributed systems concepts.

Md Sanwar Hossain April 6, 2026 21 min read System Design
Distributed key-value store design: LSM trees, consistent hashing, quorum replication, inspired by Redis, DynamoDB, and Cassandra

TL;DR — Core Architecture Decisions

"A distributed key-value store needs: (1) a storage engine — LSM Tree for write-heavy workloads (Cassandra, RocksDB) or B-Tree for read-heavy (PostgreSQL's heapfile); (2) consistent hashing to partition keys across nodes with minimal reshuffling on membership changes; (3) leaderless replication with quorum writes (W) and reads (R) where W+R > N for strong consistency; (4) Bloom filters to avoid expensive disk lookups for non-existent keys; and (5) a write-ahead log (WAL) for crash recovery."

Table of Contents

  1. Requirements & System Overview
  2. Storage Engine: LSM Tree vs B-Tree
  3. LSM Tree Internals: Memtable, SSTables & Compaction
  4. Partitioning with Consistent Hashing
  5. Replication: Quorum Reads & Writes
  6. Bloom Filters for Read Optimization
  7. Write-Ahead Log & Crash Recovery
  8. Gossip Protocol & Failure Detection
  9. Redis vs DynamoDB vs Cassandra: Design Comparisons
  10. TTL, Eviction & Data Lifecycle
  11. Capacity Estimation & Conclusion

1. Requirements & System Overview

Before designing any system, clarify requirements. For a distributed key-value store, the key design questions are:

In-Memory (Redis)

  • All data in RAM — sub-millisecond latency
  • Data size bounded by memory (typically <1 TB)
  • Optional persistence: RDB snapshots + AOF log
  • Rich data types: String, Hash, List, Set, ZSet
  • Use cases: caching, sessions, leaderboards, pub/sub

Disk-Based (DynamoDB / Cassandra)

  • Petabyte-scale; data primarily on disk (SSD)
  • Single-digit millisecond read latency
  • LSM Tree storage engine (write-optimized)
  • Horizontal scaling via consistent hashing
  • Use cases: user profiles, IoT telemetry, product catalog
Distributed key-value store architecture: consistent hashing ring, replication factor, LSM Tree storage engine, Bloom filter, and WAL
Distributed key-value store architecture — consistent hashing ring, quorum replication, LSM Tree, and gossip-based failure detection. Source: mdsanwarhossain.me

2. Storage Engine: LSM Tree vs B-Tree

The storage engine is the heart of any key-value store. The two dominant approaches are fundamentally different in their write path optimization strategy:

Attribute LSM Tree (Log-Structured Merge) B-Tree
Write pattern Sequential append (fast) In-place update (slower, random I/O)
Read performance Slower (may check multiple SSTables) Faster (single tree traversal, O(log N))
Write amplification Low (write once to memtable) High (update + COW + WAL)
Space amplification Higher (tombstones, old versions) Lower (in-place updates)
SSD friendliness High (sequential writes extend SSD life) Moderate (random writes wear NAND cells)
Used by RocksDB, LevelDB, Cassandra, HBase PostgreSQL, MySQL InnoDB, SQLite

Rule of thumb: Choose LSM Tree for write-heavy workloads (event logging, time-series, IoT); choose B-Tree for read-heavy workloads with complex queries (OLTP databases). Most modern distributed key-value stores (DynamoDB, Cassandra, RocksDB) use LSM Trees because distributed systems tend to be write-heavy and benefit from sequential I/O patterns.

3. LSM Tree Internals: Memtable, SSTables & Compaction

The LSM Tree has three key components:

Memtable (In-Memory Buffer)

All writes go first to the memtable — an in-memory sorted data structure (typically a red-black tree or skip list). Writes are O(log N) in memory, much faster than any disk write. The memtable is also protected by a write-ahead log (WAL) on disk — if the server crashes before the memtable is flushed, the WAL allows replay to reconstruct it.

SSTables (Sorted String Tables)

When the memtable reaches a size threshold (typically 64 MB), it is flushed to disk as an SSTable (Sorted String Table) — an immutable, sorted file of key-value pairs. Each SSTable has an index block (sparse index: key every 1KB) and a Bloom filter for fast existence checking.

# LSM Tree write path
def put(key, value):
    # 1. Append to WAL (sequential write, durable)
    wal.append(key, value)
    
    # 2. Insert into memtable (in-memory, sorted)
    memtable.put(key, value)
    
    # 3. If memtable > threshold, flush to SSTable
    if memtable.size > 64_MB:
        sstable = memtable.flush_to_disk()  # sequential write
        sst_manager.add(sstable)
        memtable.clear()
        wal.clear()  # safe to clear after flush

# LSM Tree read path
def get(key):
    # 1. Check memtable first (most recent writes)
    if val := memtable.get(key):
        return val
    
    # 2. Check SSTables from newest to oldest
    for sstable in reversed(sst_manager.all()):
        if sstable.bloom_filter.might_contain(key):
            if val := sstable.get(key):  # disk seek
                return val
    return None

Compaction

Over time, many SSTables accumulate on disk. Reads become slow (must check each SSTable), and space is wasted by deleted or overwritten values. Compaction merges SSTables: reads multiple SSTables, produces a new merged SSTable with only the latest version of each key and no tombstones. Two strategies:

LSM Tree compaction and quorum replication for distributed key-value store: memtable, SSTables, levels, and leaderless replication ring
LSM Tree levels and quorum replication: N=3 nodes, W=2 writes, R=2 reads guarantees strong consistency (W+R>N). Source: mdsanwarhossain.me

4. Partitioning with Consistent Hashing

A distributed KV store must partition data across nodes. Simple modulo hashing (hash(key) % numNodes) is unusable — adding or removing a node requires remapping almost all keys. Consistent hashing solves this.

The Hash Ring

5. Replication: Quorum Reads & Writes

For fault tolerance, each key is replicated to N nodes (typically N=3). Cassandra and DynamoDB use leaderless replication — any node can accept reads or writes. Consistency is controlled by quorum parameters:

# Quorum write example (N=3, W=2)
def put(key, value, w=2):
    nodes = ring.get_n_nodes(key, n=3)  # primary + 2 replicas
    responses = []
    for node in nodes:
        try:
            responses.append(node.put(key, value, timestamp=now()))
        except NodeUnavailable:
            pass
    if len(responses) >= w:
        return SUCCESS  # quorum achieved
    else:
        return FAILURE  # not enough acks, client retries

# Read repair: if quorum read returns stale value from one node,
# coordinator writes the latest value back to that node asynchronously

6. Bloom Filters for Read Optimization

In an LSM Tree with many SSTables, a read for a non-existent key would require checking every SSTable on disk — potentially hundreds of disk seeks. Bloom filters eliminate this cost.

A Bloom filter is a probabilistic bit-array data structure. For each SSTable, a Bloom filter is computed over all keys in the SSTable. When checking if a key exists in an SSTable:

With a 1% false positive rate and 10 SSTables to check, the Bloom filter eliminates ~99% of unnecessary disk seeks for non-existent keys. The filter for a 1-million-key SSTable requires only ~10 bits/key = ~1.25 MB — trivially storable in memory. Every major LSM-based system (RocksDB, Cassandra, HBase) uses Bloom filters as a critical read optimization.

7. Write-Ahead Log & Crash Recovery

The memtable is in memory and is lost on a crash. The Write-Ahead Log (WAL) ensures durability: every write is first appended to a sequential log file on disk before being applied to the memtable. On recovery after a crash:

  1. The system reads the WAL from the last checkpoint.
  2. Replays all operations from the WAL into a fresh memtable.
  3. The memtable is back to the pre-crash state. Data loss: zero (for any write that received a WAL confirmation).

The WAL is a sequential append-only file — the cheapest possible write operation on both HDD and SSD. The write is considered durable only after fsync() is called, which flushes the OS page cache to the actual storage medium. The performance trade-off: every acknowledged write requires at least one fsync() on the WAL file, which is the primary latency bottleneck for write throughput in disk-based KV stores.

8. Gossip Protocol & Failure Detection

In a cluster of hundreds of nodes, how does each node know which other nodes are alive? A centralized health monitor is a single point of failure. Gossip protocols provide decentralized failure detection.

How Gossip Works

9. Redis vs DynamoDB vs Cassandra: Design Comparisons

Feature Redis DynamoDB Cassandra
Storage In-memory (+ optional disk) SSD (LSM via B-tree hybrid) SSD (LSM / SSTable)
Read latency <1ms 1–5ms 1–10ms
Throughput 1M ops/sec (single node) Unlimited (serverless) Linear with node count
Consistency Strong (single node) / eventual (cluster) Configurable (strong or eventual) Configurable (quorum tunable)
Data size GB–TB (RAM bound) Petabyte-scale Petabyte-scale
Best for Caching, sessions, real-time Serverless, unpredictable load Write-heavy, time-series, IoT

10. TTL, Eviction & Data Lifecycle

TTL (Time To Live) is a first-class feature in key-value stores, especially for caching use cases. Three strategies for handling expired keys:

Memory Eviction Policies (Redis)

When Redis reaches its memory limit (maxmemory), it must evict keys. The configured policy determines which keys are evicted:

11. Capacity Estimation & Conclusion

Back-of-Envelope: Disk-Based KV Store

  • 10 billion key-value pairs; average size: 1 KB/pair → 10 TB data total
  • With N=3 replication: 30 TB raw storage across cluster
  • With LSM space amplification ~2×: need 60 TB on disk
  • At 4 TB per SSD node: 15 nodes needed for storage
  • Write throughput: 100K writes/sec × 1 KB = 100 MB/s ingress; WAL needs sequential write at 100 MB/s — single SSD handles 500 MB/s writes
  • Compaction: typically 10–50% of write bandwidth; provision accordingly

A distributed key-value store is the foundational building block of the modern distributed software stack — it underlies caches, session stores, feature flags, configuration management, rate limiters, and even other databases (Cassandra is built on an LSM KV engine; PostgreSQL uses a B-tree KV layer). Understanding the LSM vs B-Tree trade-off, consistent hashing, and quorum replication gives you the mental model to understand and design virtually any distributed storage system. These are the most frequently examined concepts in senior system design interviews at Google, Meta, Amazon, and Stripe.

Leave a Comment

Related Posts

Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices · System Design

All Posts
Last updated: April 6, 2026