Write-ahead logging WAL distributed database crash recovery durability
System Design March 19, 2026 21 min read Distributed Systems Failure Handling Series

Write-Ahead Logging (WAL) Internals: Crash Recovery and Durability in Distributed Databases

Write-Ahead Logging is one of the most foundational techniques in database engineering — and one of the least understood outside of database internals specialists. Understanding WAL deeply means understanding how systems like PostgreSQL, Kafka, CockroachDB, and etcd provide durability guarantees, and how to design your own systems with the same guarantees.

Table of Contents

  1. The Durability Problem: Why WAL Exists
  2. WAL Mechanics: The Core Algorithm
  3. Crash Recovery: ARIES Algorithm
  4. PostgreSQL WAL: Deep Dive
  5. Kafka's Commit Log: WAL at Messaging Scale
  6. WAL-Based Replication and Change Data Capture
  7. Production Failure Scenarios
  8. Trade-offs: Durability vs. Performance
  9. Designing WAL Into Your Own System
  10. Key Takeaways

1. The Durability Problem: Why WAL Exists

Modern databases store their primary data structures in memory for performance — B-trees, hash indexes, buffer pools. A write to a database page goes to the in-memory buffer first; it gets flushed to disk later via a background checkpoint process. This creates a critical gap: if the system crashes after a transaction commits but before the dirty buffer page is flushed to disk, the committed data is lost.

This is the durability problem. The D in ACID says committed transactions must survive failures. WAL is the solution: before any change is applied to the in-memory data structures, a description of that change is written to the WAL — a sequential, append-only log file. The WAL write is synchronously flushed to disk before the transaction acknowledgment is returned to the client.

Key insight: Sequential disk writes are dramatically faster than random writes. WAL is designed to be sequential — each new log record is appended to the end. This makes WAL I/O fast enough to not be a bottleneck in most workloads, while the random writes to data pages can be deferred and batched.

2. WAL Mechanics: The Core Algorithm

The WAL algorithm enforces two invariants:

Every WAL record contains a Log Sequence Number (LSN) — a monotonically increasing identifier. LSNs are used during crash recovery to determine which changes have been applied and which need redo/undo.

Write path with WAL:

1. Transaction starts
2. READ page X into buffer pool (if not already there)
3. MODIFY page X in buffer (in-memory change)
4. WRITE WAL record: [LSN=42, TxID=100, PageX: old=v1, new=v2]
5. FLUSH WAL to disk (fdatasync)
6. Return COMMIT to client ← durability guaranteed here
7. (Later) Background checkpoint: flush dirty page X to disk
   Mark page X as flushed at LSN=42

3. Crash Recovery: The ARIES Algorithm

The ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) algorithm, developed at IBM in 1992, is the basis for crash recovery in most major databases (PostgreSQL, DB2, SQL Server). It proceeds in three phases:

Phase 1: Analysis

Scan the WAL from the last checkpoint forward. Build a "dirty page table" (pages modified after checkpoint) and a "transaction table" (which transactions were active at crash time). Determine the redo start LSN (the earliest LSN of any change to a dirty page that may not be on disk).

Phase 2: Redo

Starting from the redo start LSN, replay all WAL records in order. For each record, if the corresponding page's on-disk LSN is older than the WAL record's LSN, apply the change. This brings the database to its exact state at the moment of the crash — including changes from transactions that were in-progress.

Phase 3: Undo

Roll back all transactions that were in-progress at the crash time (no COMMIT record in WAL). Process these in reverse LSN order, using the "old value" in each WAL record to restore the page to its pre-transaction state. Write Compensation Log Records (CLRs) to WAL so that undo operations themselves are recoverable from a crash-during-recovery.

4. PostgreSQL WAL: Deep Dive

PostgreSQL writes WAL records to 16MB WAL segment files in the pg_wal directory. Key configuration parameters that directly affect durability and performance:

# postgresql.conf key WAL settings

# Durability vs. performance trade-off
synchronous_commit = on          # Default: flush WAL before ACK
# synchronous_commit = off      # Risk: up to 3x wal_writer_delay data loss
# synchronous_commit = remote_write  # For replicas: write but don't fsync

# WAL write strategy
wal_sync_method = fdatasync      # Most portable; fdatasync or fsync
# On Linux, fdatasync is ~15% faster than fsync (no metadata update)

# Checkpoint frequency (affects crash recovery time)
checkpoint_timeout = 5min        # Max 5min of WAL to replay on crash
max_wal_size = 1GB               # Trigger checkpoint at 1GB WAL growth

# WAL buffer (in-memory WAL buffer before OS write)
wal_buffers = 64MB               # Increase for write-heavy workloads

# WAL compression (reduce I/O at cost of CPU)
wal_compression = on             # ~20-30% WAL size reduction

Critical production insight: Setting synchronous_commit = off gives up to 3x write throughput improvement but risks losing the last ~200ms of committed transactions on crash. This is acceptable for non-critical writes (analytics events, session data) but not for financial records.

Checkpoint tuning: More frequent checkpoints (smaller checkpoint_timeout) mean faster crash recovery (less WAL to replay) but higher sustained write amplification. For large OLTP databases, a 5–15 minute checkpoint interval with checkpoint_completion_target = 0.9 is the standard compromise.

5. Kafka's Commit Log: WAL at Messaging Scale

Kafka's architecture is fundamentally a distributed WAL. Each topic partition is an append-only, immutable log segment file on disk. Producers write to the end of the log; consumers read from any offset. This WAL design gives Kafka its exceptional throughput characteristics: sequential writes saturate disk bandwidth far more efficiently than random writes.

Kafka's durability model: Durability is controlled by the producer's acks setting and the broker's min.insync.replicas:

Kafka also uses an in-memory index and a .timeindex file per segment. These are rebuilt on broker restart by replaying the log segment — classic WAL-based recovery. This is why Kafka brokers take longer to restart as segment files grow larger.

6. WAL-Based Replication and Change Data Capture

PostgreSQL's WAL is the foundation of logical replication and CDC. The WAL contains every row-level change — inserts, updates, deletes. Logical decoding plugins (pgoutput, wal2json, decoderbufs) transform the binary WAL records into human-readable or structured change events.

Tools like Debezium use this to stream database changes into Kafka with exactly-once guarantees. The architecture:

PostgreSQL WAL → Logical Replication Slot → Debezium Connector
                                                       ↓
                                        Kafka Topic (change events)
                                                           ↓
                                    Search Index / Cache / Analytics DB

Critical production concern: Logical replication slots hold WAL until the consumer confirms processing. A stalled Debezium consumer means WAL accumulates on the primary indefinitely, eventually filling the disk and crashing the PostgreSQL server. Always monitor replication slot lag and set max_slot_wal_keep_size as a circuit breaker.

7. Production Failure Scenarios

Scenario: WAL Disk Full → Database Crash

A PostgreSQL instance serving a high-write workload crashed because the WAL disk partition filled up. The WAL writer could not flush the next record, causing all write transactions to block indefinitely, and eventually the server to shutdown. Root cause: a runaway bulk import job generated 500GB of WAL in 2 hours. Fix: Separate WAL onto its own partition, monitor WAL directory size, and set max_wal_size with alerting at 70% of available WAL disk.

Scenario: Long Recovery After Crash

After a power failure, a database took 45 minutes to restart because checkpoint_timeout was set to 60 minutes. The ARIES redo phase had to replay 60 minutes of WAL. Fix: Reduce checkpoint_timeout to 5 minutes with checkpoint_completion_target = 0.9. Recovery time dropped to 5 minutes maximum.

8. Trade-offs: Durability vs. Performance

9. Designing WAL Into Your Own System

When building custom stateful services (e.g., a distributed cache, an order matching engine), WAL principles apply directly:

This pattern is the foundation of Apache Flink's state backend, RocksDB's write-ahead log, and Redis's Append-Only File (AOF) — all variations on the same WAL concept.

10. Key Takeaways

  • WAL provides durability by persisting intent before effect — the "write ahead" invariant is the core guarantee.
  • ARIES crash recovery (Analysis → Redo → Undo) is the gold standard; understand it to reason about any database recovery.
  • PostgreSQL WAL controls durability granularity via synchronous_commit; choose your durability/latency trade-off explicitly.
  • Kafka's commit log is WAL at messaging scale — sequential writes + replication give both durability and throughput.
  • Monitor replication slot lag in PostgreSQL CDC setups; a stalled consumer can crash the entire primary by filling the WAL disk.
  • Checkpoint frequency directly controls crash recovery time; tune it to meet your RTO requirements.

Conclusion

Write-Ahead Logging is one of those foundational concepts that pays compound interest. Every time you reason about database durability, replication lag, CDC reliability, or stateful service recovery, you're reasoning about WAL. Engineers who understand WAL deeply make better decisions at every layer of the stack.

Related Posts

Md Sanwar Hossain
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Distributed Systems · System Design

Discussion / Comments

Join the conversation — your comment goes directly to my inbox.

Back to Blog