What is In-Memory (Redis) and how does it work?

All data in RAM — sub-millisecond latency Data size bounded by memory (typically <1 TB) Optional persistence: RDB snapshots + AOF log Rich data types: String, Hash, List, Set, ZSet

What is Disk-Based (DynamoDB / Cassandra) and how does it work?

Petabyte-scale; data primarily on disk (SSD) Single-digit millisecond read latency LSM Tree storage engine (write-optimized) Horizontal scaling via consistent hashing

System Design

Designing a Distributed Key-Value Store: LSM Trees, Consistent Hashing & Quorum Replication

Q: What is TL;DR — Core Architecture Decisions and how does it work?

"A distributed key-value store needs: (1) a storage engine — LSM Tree for write-heavy workloads (Cassandra, RocksDB) or B-Tree for read-heavy (PostgreSQL's heapfile); (2) consistent hashing to partition keys across nodes with minimal reshuffling on membership changes; (3) leaderless replication with quorum writes (W) and reads (R) where W+R > N for strong consistency; (4) Bloom filters to avoid expensive disk lookups for non-existent keys; and (5) a write-ahead log (WAL) for crash recovery."

Q: What is Disk-Based (DynamoDB / Cassandra) and how does it work?

Petabyte-scale; data primarily on disk (SSD) Single-digit millisecond read latency LSM Tree storage engine (write-optimized) Horizontal scaling via consistent hashing

Q: What is Storage Engine and how does it work?

The storage engine is the heart of any key-value store. The two dominant approaches are fundamentally different in their write path optimization strategy: Rule of thumb: Choose LSM Tree for write-heavy workloads (event logging, time-series, IoT); choose B-Tree for read-heavy workloads with complex queries (OLTP databases). Most modern distributed key-value stores (DynamoDB, Cassandra, RocksDB) use LSM Trees because distributed systems tend to be write-heavy and benefit from sequential I/O patterns.

Q: What is Memtable (In-Memory Buffer) and how does it work?

All writes go first to the memtable — an in-memory sorted data structure (typically a red-black tree or skip list). Writes are O(log N) in memory, much faster than any disk write. The memtable is also protected by a write-ahead log (WAL) on disk — if the server crashes before the memtable is flushed, the WAL allows replay to reconstruct it.

Key-value stores are the simplest and most powerful data structures in distributed systems — Redis serves millions of operations per second from memory, DynamoDB handles Amazon's entire e-commerce traffic at 99.999% availability, and Cassandra stores petabytes of data across thousands of nodes. Building one from scratch requires mastering storage engines, partitioning, replication, and failure handling — all fundamental distributed systems concepts.

Md Sanwar Hossain April 6, 2026 21 min read System Design

Distributed key-value store design: LSM trees, consistent hashing, quorum replication, inspired by Redis, DynamoDB, and Cassandra

TL;DR — Core Architecture Decisions

"A distributed key-value store needs: (1) a storage engine — LSM Tree for write-heavy workloads (Cassandra, RocksDB) or B-Tree for read-heavy (PostgreSQL's heapfile); (2) consistent hashing to partition keys across nodes with minimal reshuffling on membership changes; (3) leaderless replication with quorum writes (W) and reads (R) where W+R > N for strong consistency; (4) Bloom filters to avoid expensive disk lookups for non-existent keys; and (5) a write-ahead log (WAL) for crash recovery."

Requirements & System Overview
Storage Engine: LSM Tree vs B-Tree
LSM Tree Internals: Memtable, SSTables & Compaction
Partitioning with Consistent Hashing
Replication: Quorum Reads & Writes
Bloom Filters for Read Optimization
Write-Ahead Log & Crash Recovery
Gossip Protocol & Failure Detection
Redis vs DynamoDB vs Cassandra: Design Comparisons
TTL, Eviction & Data Lifecycle
Capacity Estimation & Conclusion

1. Requirements & System Overview

Before designing any system, clarify requirements. For a distributed key-value store, the key design questions are:

In-Memory (Redis)

All data in RAM — sub-millisecond latency
Data size bounded by memory (typically <1 TB)
Optional persistence: RDB snapshots + AOF log
Rich data types: String, Hash, List, Set, ZSet
Use cases: caching, sessions, leaderboards, pub/sub

Disk-Based (DynamoDB / Cassandra)

Petabyte-scale; data primarily on disk (SSD)
Single-digit millisecond read latency
LSM Tree storage engine (write-optimized)
Horizontal scaling via consistent hashing
Use cases: user profiles, IoT telemetry, product catalog

Distributed key-value store architecture: consistent hashing ring, replication factor, LSM Tree storage engine, Bloom filter, and WAL — Distributed key-value store architecture — consistent hashing ring, quorum replication, LSM Tree, and gossip-based failure detection. Source: mdsanwarhossain.me

2. Storage Engine: LSM Tree vs B-Tree

The storage engine is the heart of any key-value store. The two dominant approaches are fundamentally different in their write path optimization strategy:

Attribute	LSM Tree (Log-Structured Merge)	B-Tree
Write pattern	Sequential append (fast)	In-place update (slower, random I/O)
Read performance	Slower (may check multiple SSTables)	Faster (single tree traversal, O(log N))
Write amplification	Low (write once to memtable)	High (update + COW + WAL)
Space amplification	Higher (tombstones, old versions)	Lower (in-place updates)
SSD friendliness	High (sequential writes extend SSD life)	Moderate (random writes wear NAND cells)
Used by	RocksDB, LevelDB, Cassandra, HBase	PostgreSQL, MySQL InnoDB, SQLite

Rule of thumb: Choose LSM Tree for write-heavy workloads (event logging, time-series, IoT); choose B-Tree for read-heavy workloads with complex queries (OLTP databases). Most modern distributed key-value stores (DynamoDB, Cassandra, RocksDB) use LSM Trees because distributed systems tend to be write-heavy and benefit from sequential I/O patterns.

3. LSM Tree Internals: Memtable, SSTables & Compaction

The LSM Tree has three key components:

Memtable (In-Memory Buffer)

All writes go first to the memtable — an in-memory sorted data structure (typically a red-black tree or skip list). Writes are O(log N) in memory, much faster than any disk write. The memtable is also protected by a write-ahead log (WAL) on disk — if the server crashes before the memtable is flushed, the WAL allows replay to reconstruct it.

SSTables (Sorted String Tables)

When the memtable reaches a size threshold (typically 64 MB), it is flushed to disk as an SSTable (Sorted String Table) — an immutable, sorted file of key-value pairs. Each SSTable has an index block (sparse index: key every 1KB) and a Bloom filter for fast existence checking.

# LSM Tree write path
def put(key, value):
    # 1. Append to WAL (sequential write, durable)
    wal.append(key, value)
    
    # 2. Insert into memtable (in-memory, sorted)
    memtable.put(key, value)
    
    # 3. If memtable > threshold, flush to SSTable
    if memtable.size > 64_MB:
        sstable = memtable.flush_to_disk()  # sequential write
        sst_manager.add(sstable)
        memtable.clear()
        wal.clear()  # safe to clear after flush

# LSM Tree read path
def get(key):
    # 1. Check memtable first (most recent writes)
    if val := memtable.get(key):
        return val
    
    # 2. Check SSTables from newest to oldest
    for sstable in reversed(sst_manager.all()):
        if sstable.bloom_filter.might_contain(key):
            if val := sstable.get(key):  # disk seek
                return val
    return None

Compaction

Over time, many SSTables accumulate on disk. Reads become slow (must check each SSTable), and space is wasted by deleted or overwritten values. Compaction merges SSTables: reads multiple SSTables, produces a new merged SSTable with only the latest version of each key and no tombstones. Two strategies:

Size-tiered compaction (Cassandra default): Compact similarly-sized SSTables together. Write-optimized; reads can be slow with many SSTable levels. Good for write-heavy workloads.
Leveled compaction (RocksDB/LevelDB default): Maintain fixed levels (L0, L1, L2...) where each level is 10× larger than the previous. Keys are sorted and non-overlapping within each level. Read performance is much better (only check one SSTable per level), but write amplification is higher.

LSM Tree compaction and quorum replication for distributed key-value store: memtable, SSTables, levels, and leaderless replication ring — LSM Tree levels and quorum replication: N=3 nodes, W=2 writes, R=2 reads guarantees strong consistency (W+R>N). Source: mdsanwarhossain.me

4. Partitioning with Consistent Hashing

A distributed KV store must partition data across nodes. Simple modulo hashing (hash(key) % numNodes) is unusable — adding or removing a node requires remapping almost all keys. Consistent hashing solves this.

The Hash Ring

Map the hash space (0 to 2³²) to a ring. Each node is placed on the ring at hash(nodeId).
For a key, compute hash(key). Walk clockwise on the ring to find the first node — that node stores the key.
When a node is added, it takes over keys from its clockwise successor (only ~1/N keys move). When a node is removed, its keys move to its clockwise successor. Only ~1/N keys are affected — vastly better than modulo hashing.
Virtual nodes: Each physical node is represented by V virtual nodes on the ring (e.g., V=150). This ensures uniform load distribution even with heterogeneous node sizes, and reduces the load spike on a single node when another fails.

5. Replication: Quorum Reads & Writes

For fault tolerance, each key is replicated to N nodes (typically N=3). Cassandra and DynamoDB use leaderless replication — any node can accept reads or writes. Consistency is controlled by quorum parameters:

N = replication factor (e.g., 3): number of replicas for each key
W = write quorum: number of replicas that must acknowledge a write before it's considered successful
R = read quorum: number of replicas that must respond to a read; the response with the highest timestamp wins
Strong consistency: W + R > N ensures at least one replica overlaps between the write set and read set. E.g., N=3, W=2, R=2: any read must hit at least one node that received the write.
Eventual consistency: W=1, R=1 — maximum throughput, lowest latency, but reads may return stale data until replication propagates.

# Quorum write example (N=3, W=2)
def put(key, value, w=2):
    nodes = ring.get_n_nodes(key, n=3)  # primary + 2 replicas
    responses = []
    for node in nodes:
        try:
            responses.append(node.put(key, value, timestamp=now()))
        except NodeUnavailable:
            pass
    if len(responses) >= w:
        return SUCCESS  # quorum achieved
    else:
        return FAILURE  # not enough acks, client retries

# Read repair: if quorum read returns stale value from one node,
# coordinator writes the latest value back to that node asynchronously

6. Bloom Filters for Read Optimization

In an LSM Tree with many SSTables, a read for a non-existent key would require checking every SSTable on disk — potentially hundreds of disk seeks. Bloom filters eliminate this cost.

A Bloom filter is a probabilistic bit-array data structure. For each SSTable, a Bloom filter is computed over all keys in the SSTable. When checking if a key exists in an SSTable:

If the Bloom filter says NO → the key is definitely not in this SSTable (skip it entirely — no disk I/O).
If the Bloom filter says YES → the key might be in the SSTable (do the actual disk seek). There is a small false-positive rate (typically 1%), but never a false-negative.

With a 1% false positive rate and 10 SSTables to check, the Bloom filter eliminates ~99% of unnecessary disk seeks for non-existent keys. The filter for a 1-million-key SSTable requires only ~10 bits/key = ~1.25 MB — trivially storable in memory. Every major LSM-based system (RocksDB, Cassandra, HBase) uses Bloom filters as a critical read optimization.

7. Write-Ahead Log & Crash Recovery

The memtable is in memory and is lost on a crash. The Write-Ahead Log (WAL) ensures durability: every write is first appended to a sequential log file on disk before being applied to the memtable. On recovery after a crash:

The system reads the WAL from the last checkpoint.
Replays all operations from the WAL into a fresh memtable.
The memtable is back to the pre-crash state. Data loss: zero (for any write that received a WAL confirmation).

The WAL is a sequential append-only file — the cheapest possible write operation on both HDD and SSD. The write is considered durable only after fsync() is called, which flushes the OS page cache to the actual storage medium. The performance trade-off: every acknowledged write requires at least one fsync() on the WAL file, which is the primary latency bottleneck for write throughput in disk-based KV stores.

8. Gossip Protocol & Failure Detection

In a cluster of hundreds of nodes, how does each node know which other nodes are alive? A centralized health monitor is a single point of failure. Gossip protocols provide decentralized failure detection.

How Gossip Works

Every second, each node randomly selects 3 peers and exchanges its view of the cluster: a list of {nodeId, heartbeat counter, last seen timestamp} for all nodes.
Upon receiving gossip, each node updates its local cluster view with the latest information (highest heartbeat counter wins).
If a node's heartbeat hasn't been updated in T seconds (e.g., T=30s), it is marked as suspected failed. If still not updated after 2T seconds, it is declared dead and its key ranges are redistributed to neighboring nodes.
Cassandra uses an Phi Accrual Failure Detector — a more nuanced approach that outputs a continuous "suspicion level" (Φ) based on the distribution of inter-heartbeat intervals, rather than a binary alive/dead threshold.

9. Redis vs DynamoDB vs Cassandra: Design Comparisons

Feature	Redis	DynamoDB	Cassandra
Storage	In-memory (+ optional disk)	SSD (LSM via B-tree hybrid)	SSD (LSM / SSTable)
Read latency	<1ms	1–5ms	1–10ms
Throughput	1M ops/sec (single node)	Unlimited (serverless)	Linear with node count
Consistency	Strong (single node) / eventual (cluster)	Configurable (strong or eventual)	Configurable (quorum tunable)
Data size	GB–TB (RAM bound)	Petabyte-scale	Petabyte-scale
Best for	Caching, sessions, real-time	Serverless, unpredictable load	Write-heavy, time-series, IoT

10. TTL, Eviction & Data Lifecycle

TTL (Time To Live) is a first-class feature in key-value stores, especially for caching use cases. Three strategies for handling expired keys:

Lazy expiration: Check the key's TTL only when it is accessed. If expired, delete and return null. Memory efficient (no background scan), but expired keys linger until accessed.
Active expiration: Background thread scans a random sample of keys with TTLs every 100ms, deletes expired ones. Redis uses this approach: it samples 20 random keys from the expiry set every 100ms, deletes expired ones, and repeats if more than 25% were expired.
Compaction cleanup (LSM stores): During compaction, expired tombstones and TTL-expired values are filtered out. This is the primary expiration mechanism in Cassandra and DynamoDB.

Memory Eviction Policies (Redis)

When Redis reaches its memory limit (maxmemory), it must evict keys. The configured policy determines which keys are evicted:

allkeys-lru: Evict least-recently-used key from all keys. Best for general caching.
volatile-lru: LRU eviction only among keys with TTLs set. Safe for mixed persistent + cached data.
allkeys-lfu: Evict least-frequently-used (LFU) key. Better than LRU for skewed access patterns (popular keys are almost never evicted).
noeviction: Return error on write when full. For durable data that must not be lost.

11. Capacity Estimation & Conclusion

Back-of-Envelope: Disk-Based KV Store

10 billion key-value pairs; average size: 1 KB/pair → 10 TB data total
With N=3 replication: 30 TB raw storage across cluster
With LSM space amplification ~2×: need 60 TB on disk
At 4 TB per SSD node: 15 nodes needed for storage
Write throughput: 100K writes/sec × 1 KB = 100 MB/s ingress; WAL needs sequential write at 100 MB/s — single SSD handles 500 MB/s writes
Compaction: typically 10–50% of write bandwidth; provision accordingly

A distributed key-value store is the foundational building block of the modern distributed software stack — it underlies caches, session stores, feature flags, configuration management, rate limiters, and even other databases (Cassandra is built on an LSM KV engine; PostgreSQL uses a B-tree KV layer). Understanding the LSM vs B-Tree trade-off, consistent hashing, and quorum replication gives you the mental model to understand and design virtually any distributed storage system. These are the most frequently examined concepts in senior system design interviews at Google, Meta, Amazon, and Stripe.

Designing a Distributed Key-Value Store: LSM Trees, Consistent Hashing & Quorum Replication

TL;DR — Core Architecture Decisions

Table of Contents

1. Requirements & System Overview

In-Memory (Redis)

Disk-Based (DynamoDB / Cassandra)

2. Storage Engine: LSM Tree vs B-Tree

3. LSM Tree Internals: Memtable, SSTables & Compaction

Memtable (In-Memory Buffer)

SSTables (Sorted String Tables)

Compaction

4. Partitioning with Consistent Hashing

The Hash Ring

5. Replication: Quorum Reads & Writes

6. Bloom Filters for Read Optimization

7. Write-Ahead Log & Crash Recovery

8. Gossip Protocol & Failure Detection

How Gossip Works

9. Redis vs DynamoDB vs Cassandra: Design Comparisons

10. TTL, Eviction & Data Lifecycle

Memory Eviction Policies (Redis)

11. Capacity Estimation & Conclusion

Back-of-Envelope: Disk-Based KV Store

Tags

Leave a Comment

Related Posts

Designing a Distributed Key-Value Store: LSM Trees, Consistent Hashing & Quorum Replication

TL;DR — Core Architecture Decisions

Table of Contents

1. Requirements & System Overview

In-Memory (Redis)

Disk-Based (DynamoDB / Cassandra)

2. Storage Engine: LSM Tree vs B-Tree

3. LSM Tree Internals: Memtable, SSTables & Compaction

Memtable (In-Memory Buffer)

SSTables (Sorted String Tables)

Compaction

4. Partitioning with Consistent Hashing

The Hash Ring

5. Replication: Quorum Reads & Writes

6. Bloom Filters for Read Optimization

7. Write-Ahead Log & Crash Recovery

8. Gossip Protocol & Failure Detection

How Gossip Works

9. Redis vs DynamoDB vs Cassandra: Design Comparisons

10. TTL, Eviction & Data Lifecycle

Memory Eviction Policies (Redis)

11. Capacity Estimation & Conclusion

Back-of-Envelope: Disk-Based KV Store

Tags

Leave a Comment

Related Posts

Consistent Hashing Ring Design

Distributed Caching Patterns

Write-Ahead Logging (WAL) Internals

Database Replication Strategies

Cookie Notice