System Design

Write-Ahead Logging (WAL) Internals: Crash Recovery and Durability in Distributed Databases

Write-Ahead Logging is one of the most foundational techniques in database engineering — and one of the least understood outside of database internals specialists. Understanding WAL deeply means understanding how systems like PostgreSQL, Kafka, CockroachDB, and etcd provide durability guarantees, and how to design your own systems with the same guarantees.

Md Sanwar Hossain March 19, 2026 21 min read System Design

Write-ahead logging WAL distributed database crash recovery durability

The Durability Problem: Why WAL Exists
WAL Mechanics: The Core Algorithm
Crash Recovery: ARIES Algorithm
PostgreSQL WAL: Deep Dive
Kafka's Commit Log: WAL at Messaging Scale
WAL-Based Replication and Change Data Capture
Production Failure Scenarios
Trade-offs: Durability vs. Performance
Designing WAL Into Your Own System
Key Takeaways

1. The Durability Problem: Why WAL Exists

Write-Ahead Log Architecture | mdsanwarhossain.me — Write-Ahead Log Architecture — mdsanwarhossain.me

This is the durability problem. The D in ACID says committed transactions must survive failures. WAL is the solution: before any change is applied to the in-memory data structures, a description of that change is written to the WAL — a sequential, append-only log file. The WAL write is synchronously flushed to disk before the transaction acknowledgment is returned to the client.

Key insight: Sequential disk writes are dramatically faster than random writes. WAL is designed to be sequential — each new log record is appended to the end. This makes WAL I/O fast enough to not be a bottleneck in most workloads, while the random writes to data pages can be deferred and batched.

2. WAL Mechanics: The Core Algorithm

The WAL algorithm enforces two invariants:

WAL Rule 1 (Undo Logging): Before a data page is written to disk, all WAL records for changes on that page (including the old values, for undo) must be written to the WAL and flushed to disk. This ensures that if a crash occurs after the data page write, the WAL contains enough information to undo the partial write.
WAL Rule 2 (Redo Logging): Before a transaction COMMIT record is written and flushed to WAL, all WAL records for changes in that transaction must be on disk. This ensures that after a crash, any transaction whose COMMIT is in the WAL can be fully replayed (redone).

Every WAL record contains a Log Sequence Number (LSN) — a monotonically increasing identifier. LSNs are used during crash recovery to determine which changes have been applied and which need redo/undo.

        Write path with WAL:

        1. Transaction starts

        2. READ page X into buffer pool (if not already there)

        3. MODIFY page X in buffer (in-memory change)

        4. WRITE WAL record: [LSN=42, TxID=100, PageX: old=v1, new=v2]

        5. FLUSH WAL to disk (fdatasync)

        6. Return COMMIT to client ← durability guaranteed here

        7. (Later) Background checkpoint: flush dirty page X to disk

           Mark page X as flushed at LSN=42

3. Crash Recovery: The ARIES Algorithm

Write-Ahead Logging (WAL) — Crash Recovery & Durability | mdsanwarhossain.me — Write-Ahead Logging (WAL) — Crash Recovery & Durability — mdsanwarhossain.me

The ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) algorithm, developed at IBM in 1992, is the basis for crash recovery in most major databases (PostgreSQL, DB2, SQL Server). It proceeds in three phases:

Phase 1: Analysis

Scan the WAL from the last checkpoint forward. Build a "dirty page table" (pages modified after checkpoint) and a "transaction table" (which transactions were active at crash time). Determine the redo start LSN (the earliest LSN of any change to a dirty page that may not be on disk).

Phase 2: Redo

Starting from the redo start LSN, replay all WAL records in order. For each record, if the corresponding page's on-disk LSN is older than the WAL record's LSN, apply the change. This brings the database to its exact state at the moment of the crash — including changes from transactions that were in-progress.

Phase 3: Undo

Roll back all transactions that were in-progress at the crash time (no COMMIT record in WAL). Process these in reverse LSN order, using the "old value" in each WAL record to restore the page to its pre-transaction state. Write Compensation Log Records (CLRs) to WAL so that undo operations themselves are recoverable from a crash-during-recovery.

4. PostgreSQL WAL: Deep Dive

PostgreSQL writes WAL records to 16MB WAL segment files in the pg_wal directory. Key configuration parameters that directly affect durability and performance:

Write-Ahead Logging (WAL) | mdsanwarhossain.me — Write-Ahead Logging (WAL) — mdsanwarhossain.me

# postgresql.conf key WAL settings

# Durability vs. performance trade-off
synchronous_commit = on          # Default: flush WAL before ACK
# synchronous_commit = off      # Risk: up to 3x wal_writer_delay data loss
# synchronous_commit = remote_write  # For replicas: write but don't fsync

# WAL write strategy
wal_sync_method = fdatasync      # Most portable; fdatasync or fsync
# On Linux, fdatasync is ~15% faster than fsync (no metadata update)

# Checkpoint frequency (affects crash recovery time)
checkpoint_timeout = 5min        # Max 5min of WAL to replay on crash
max_wal_size = 1GB               # Trigger checkpoint at 1GB WAL growth

# WAL buffer (in-memory WAL buffer before OS write)
wal_buffers = 64MB               # Increase for write-heavy workloads

# WAL compression (reduce I/O at cost of CPU)
wal_compression = on             # ~20-30% WAL size reduction

Critical production insight: Setting synchronous_commit = off gives up to 3x write throughput improvement but risks losing the last ~200ms of committed transactions on crash. This is acceptable for non-critical writes (analytics events, session data) but not for financial records.

Checkpoint tuning: More frequent checkpoints (smaller checkpoint_timeout) mean faster crash recovery (less WAL to replay) but higher sustained write amplification. For large OLTP databases, a 5–15 minute checkpoint interval with checkpoint_completion_target = 0.9 is the standard compromise.

5. Kafka's Commit Log: WAL at Messaging Scale

Kafka's architecture is fundamentally a distributed WAL. Each topic partition is an append-only, immutable log segment file on disk. Producers write to the end of the log; consumers read from any offset. This WAL design gives Kafka its exceptional throughput characteristics: sequential writes saturate disk bandwidth far more efficiently than random writes.

Kafka's durability model: Durability is controlled by the producer's acks setting and the broker's min.insync.replicas:

acks=0: Fire and forget — no durability guarantee.
acks=1: Leader acknowledges after writing to its local log — leader crash before replication = data loss.
acks=all + min.insync.replicas=2: Acknowledged only after 2 replicas have written to their local WAL. Strong durability guarantee.

Kafka also uses an in-memory index and a .timeindex file per segment. These are rebuilt on broker restart by replaying the log segment — classic WAL-based recovery. This is why Kafka brokers take longer to restart as segment files grow larger.

6. WAL-Based Replication and Change Data Capture

PostgreSQL's WAL is the foundation of logical replication and CDC. The WAL contains every row-level change — inserts, updates, deletes. Logical decoding plugins (pgoutput, wal2json, decoderbufs) transform the binary WAL records into human-readable or structured change events.

Tools like Debezium use this to stream database changes into Kafka with exactly-once guarantees. The architecture:

        PostgreSQL WAL → Logical Replication Slot → Debezium Connector

                                                               ↓

                                                Kafka Topic (change events)

                                                                   ↓

                                            Search Index / Cache / Analytics DB

Critical production concern: Logical replication slots hold WAL until the consumer confirms processing. A stalled Debezium consumer means WAL accumulates on the primary indefinitely, eventually filling the disk and crashing the PostgreSQL server. Always monitor replication slot lag and set max_slot_wal_keep_size as a circuit breaker.

7. Production Failure Scenarios

Scenario: WAL Disk Full → Database Crash

A PostgreSQL instance serving a high-write workload crashed because the WAL disk partition filled up. The WAL writer could not flush the next record, causing all write transactions to block indefinitely, and eventually the server to shutdown. Root cause: a runaway bulk import job generated 500GB of WAL in 2 hours. Fix: Separate WAL onto its own partition, monitor WAL directory size, and set max_wal_size with alerting at 70% of available WAL disk.

Scenario: Long Recovery After Crash

After a power failure, a database took 45 minutes to restart because checkpoint_timeout was set to 60 minutes. The ARIES redo phase had to replay 60 minutes of WAL. Fix: Reduce checkpoint_timeout to 5 minutes with checkpoint_completion_target = 0.9. Recovery time dropped to 5 minutes maximum.

8. Trade-offs: Durability vs. Performance

fsync = off: Never do this in production. If the OS crashes with dirty WAL buffers, the WAL is corrupted and recovery is impossible — data loss is guaranteed, and the amount is unbounded.
Group commit: Modern databases batch WAL flushes across multiple transactions (group commit). This amortizes the fsync overhead. PostgreSQL's commit_delay + commit_siblings enables this explicitly. Kafka's linger.ms is the producer-side equivalent.
Asynchronous replication: Accepting a small window of data loss (the replica might lag by 100ms) in exchange for significantly lower write latency on the primary is a valid production trade-off for non-financial data.

9. Designing WAL Into Your Own System

When building custom stateful services (e.g., a distributed cache, an order matching engine), WAL principles apply directly:

Append an intent record to a durable log before applying the change to your in-memory state.
Include an LSN in each log record and maintain a "checkpoint" record that indicates how far ahead the state is of the log.
On startup, replay log records from the last checkpoint forward to reconstruct in-memory state.
Use an append-only file with periodic rotation (similar to PostgreSQL WAL segment files) rather than a single ever-growing file.

This pattern is the foundation of Apache Flink's state backend, RocksDB's write-ahead log, and Redis's Append-Only File (AOF) — all variations on the same WAL concept.

10. Key Takeaways

WAL provides durability by persisting intent before effect — the "write ahead" invariant is the core guarantee.
ARIES crash recovery (Analysis → Redo → Undo) is the gold standard; understand it to reason about any database recovery.
PostgreSQL WAL controls durability granularity via synchronous_commit; choose your durability/latency trade-off explicitly.
Kafka's commit log is WAL at messaging scale — sequential writes + replication give both durability and throughput.
Monitor replication slot lag in PostgreSQL CDC setups; a stalled consumer can crash the entire primary by filling the WAL disk.
Checkpoint frequency directly controls crash recovery time; tune it to meet your RTO requirements.

Conclusion

Write-Ahead Logging is one of those foundational concepts that pays compound interest. Every time you reason about database durability, replication lag, CDC reliability, or stateful service recovery, you're reasoning about WAL. Engineers who understand WAL deeply make better decisions at every layer of the stack.

Frequently Asked Questions

What is The Durability Problem and how does it work?

Modern databases store their primary data structures in memory for performance — B-trees, hash indexes, buffer pools. A write to a database page goes to the in-memory buffer first; it gets flushed to disk later via a background checkpoint process. This creates a critical gap: if the system crashes after a transaction commits but before the dirty buffer page is flushed to disk, the committed data is lost. This is the durability problem. The D in ACID says committed transactions must survive failures. WAL is the solution: before any change is applied to the in-memory data structures, a description of that change is written to the WAL — a sequential, append-only log file. The WAL write is synchronously flushed to disk before the transaction acknowledgment is returned to the client.

What is WAL Mechanics and how does it work?

The WAL algorithm enforces two invariants: Every WAL record contains a Log Sequence Number (LSN) — a monotonically increasing identifier. LSNs are used during crash recovery to determine which changes have been applied and which need redo/undo. WAL Rule 1 (Undo Logging): Before a data page is written to disk, all WAL records for changes on that page (including the old values, for undo) must be written to the WAL and flushed to disk. This ensures that if a crash occurs after the data page write, the WAL contains enough information to undo the partial write. WAL Rule 2 (Redo Logging): Before a transaction COMMIT record is written and flushed to WAL, all WAL records for changes in that transaction must be on disk.

Write-Ahead Logging (WAL) Internals: Crash Recovery and Durability in Distributed Databases

Table of Contents

1. The Durability Problem: Why WAL Exists

2. WAL Mechanics: The Core Algorithm

3. Crash Recovery: The ARIES Algorithm

Phase 1: Analysis

Phase 2: Redo

Phase 3: Undo

4. PostgreSQL WAL: Deep Dive

5. Kafka's Commit Log: WAL at Messaging Scale

6. WAL-Based Replication and Change Data Capture

7. Production Failure Scenarios

Scenario: WAL Disk Full → Database Crash

Scenario: Long Recovery After Crash

8. Trade-offs: Durability vs. Performance

9. Designing WAL Into Your Own System

10. Key Takeaways

Conclusion

Frequently Asked Questions

What is The Durability Problem and how does it work?

What is WAL Mechanics and how does it work?

What is Crash Recovery and how does it work?

What is Phase 1 and how does it work?

What is Phase 2 and how does it work?

Tags

Leave a Comment

Related Posts

Write-Ahead Logging (WAL) Internals: Crash Recovery and Durability in Distributed Databases

Table of Contents

1. The Durability Problem: Why WAL Exists

2. WAL Mechanics: The Core Algorithm

3. Crash Recovery: The ARIES Algorithm

Phase 1: Analysis

Phase 2: Redo

Phase 3: Undo

4. PostgreSQL WAL: Deep Dive

5. Kafka's Commit Log: WAL at Messaging Scale

6. WAL-Based Replication and Change Data Capture

7. Production Failure Scenarios

Scenario: WAL Disk Full → Database Crash

Scenario: Long Recovery After Crash

8. Trade-offs: Durability vs. Performance

9. Designing WAL Into Your Own System

10. Key Takeaways

Conclusion

Frequently Asked Questions

What is The Durability Problem and how does it work?

What is WAL Mechanics and how does it work?

What is Crash Recovery and how does it work?

What is Phase 1 and how does it work?

What is Phase 2 and how does it work?

Tags

Leave a Comment

Related Posts

Distributed System Challenges: What Every Senior Engineer Must Know in 2026

CQRS and Event Sourcing in Production: A Deep Dive

Kafka Schema Registry in Production: Handling Schema Evolution Without Breaking Consumers

Database Sharding Strategies: Horizontal Scaling Patterns for High-Traffic Production Systems

Cookie Notice