Write-Ahead Logging (WAL) Internals: Crash Recovery and Durability in Distributed Databases
Write-Ahead Logging is one of the most foundational techniques in database engineering — and one of the least understood outside of database internals specialists. Understanding WAL deeply means understanding how systems like PostgreSQL, Kafka, CockroachDB, and etcd provide durability guarantees, and how to design your own systems with the same guarantees.
Table of Contents
- The Durability Problem: Why WAL Exists
- WAL Mechanics: The Core Algorithm
- Crash Recovery: ARIES Algorithm
- PostgreSQL WAL: Deep Dive
- Kafka's Commit Log: WAL at Messaging Scale
- WAL-Based Replication and Change Data Capture
- Production Failure Scenarios
- Trade-offs: Durability vs. Performance
- Designing WAL Into Your Own System
- Key Takeaways
1. The Durability Problem: Why WAL Exists
Modern databases store their primary data structures in memory for performance — B-trees, hash indexes, buffer pools. A write to a database page goes to the in-memory buffer first; it gets flushed to disk later via a background checkpoint process. This creates a critical gap: if the system crashes after a transaction commits but before the dirty buffer page is flushed to disk, the committed data is lost.
This is the durability problem. The D in ACID says committed transactions must survive failures. WAL is the solution: before any change is applied to the in-memory data structures, a description of that change is written to the WAL — a sequential, append-only log file. The WAL write is synchronously flushed to disk before the transaction acknowledgment is returned to the client.
2. WAL Mechanics: The Core Algorithm
The WAL algorithm enforces two invariants:
- WAL Rule 1 (Undo Logging): Before a data page is written to disk, all WAL records for changes on that page (including the old values, for undo) must be written to the WAL and flushed to disk. This ensures that if a crash occurs after the data page write, the WAL contains enough information to undo the partial write.
- WAL Rule 2 (Redo Logging): Before a transaction COMMIT record is written and flushed to WAL, all WAL records for changes in that transaction must be on disk. This ensures that after a crash, any transaction whose COMMIT is in the WAL can be fully replayed (redone).
Every WAL record contains a Log Sequence Number (LSN) — a monotonically increasing identifier. LSNs are used during crash recovery to determine which changes have been applied and which need redo/undo.
1. Transaction starts
2. READ page X into buffer pool (if not already there)
3. MODIFY page X in buffer (in-memory change)
4. WRITE WAL record: [LSN=42, TxID=100, PageX: old=v1, new=v2]
5. FLUSH WAL to disk (fdatasync)
6. Return COMMIT to client ← durability guaranteed here
7. (Later) Background checkpoint: flush dirty page X to disk
Mark page X as flushed at LSN=42
3. Crash Recovery: The ARIES Algorithm
The ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) algorithm, developed at IBM in 1992, is the basis for crash recovery in most major databases (PostgreSQL, DB2, SQL Server). It proceeds in three phases:
Phase 1: Analysis
Scan the WAL from the last checkpoint forward. Build a "dirty page table" (pages modified after checkpoint) and a "transaction table" (which transactions were active at crash time). Determine the redo start LSN (the earliest LSN of any change to a dirty page that may not be on disk).
Phase 2: Redo
Starting from the redo start LSN, replay all WAL records in order. For each record, if the corresponding page's on-disk LSN is older than the WAL record's LSN, apply the change. This brings the database to its exact state at the moment of the crash — including changes from transactions that were in-progress.
Phase 3: Undo
Roll back all transactions that were in-progress at the crash time (no COMMIT record in WAL). Process these in reverse LSN order, using the "old value" in each WAL record to restore the page to its pre-transaction state. Write Compensation Log Records (CLRs) to WAL so that undo operations themselves are recoverable from a crash-during-recovery.
4. PostgreSQL WAL: Deep Dive
PostgreSQL writes WAL records to 16MB WAL segment files in the pg_wal directory. Key configuration parameters that directly affect durability and performance:
# postgresql.conf key WAL settings
# Durability vs. performance trade-off
synchronous_commit = on # Default: flush WAL before ACK
# synchronous_commit = off # Risk: up to 3x wal_writer_delay data loss
# synchronous_commit = remote_write # For replicas: write but don't fsync
# WAL write strategy
wal_sync_method = fdatasync # Most portable; fdatasync or fsync
# On Linux, fdatasync is ~15% faster than fsync (no metadata update)
# Checkpoint frequency (affects crash recovery time)
checkpoint_timeout = 5min # Max 5min of WAL to replay on crash
max_wal_size = 1GB # Trigger checkpoint at 1GB WAL growth
# WAL buffer (in-memory WAL buffer before OS write)
wal_buffers = 64MB # Increase for write-heavy workloads
# WAL compression (reduce I/O at cost of CPU)
wal_compression = on # ~20-30% WAL size reduction
Critical production insight: Setting synchronous_commit = off gives up to 3x write throughput improvement but risks losing the last ~200ms of committed transactions on crash. This is acceptable for non-critical writes (analytics events, session data) but not for financial records.
Checkpoint tuning: More frequent checkpoints (smaller checkpoint_timeout) mean faster crash recovery (less WAL to replay) but higher sustained write amplification. For large OLTP databases, a 5–15 minute checkpoint interval with checkpoint_completion_target = 0.9 is the standard compromise.
5. Kafka's Commit Log: WAL at Messaging Scale
Kafka's architecture is fundamentally a distributed WAL. Each topic partition is an append-only, immutable log segment file on disk. Producers write to the end of the log; consumers read from any offset. This WAL design gives Kafka its exceptional throughput characteristics: sequential writes saturate disk bandwidth far more efficiently than random writes.
Kafka's durability model: Durability is controlled by the producer's acks setting and the broker's min.insync.replicas:
acks=0: Fire and forget — no durability guarantee.acks=1: Leader acknowledges after writing to its local log — leader crash before replication = data loss.acks=all+min.insync.replicas=2: Acknowledged only after 2 replicas have written to their local WAL. Strong durability guarantee.
Kafka also uses an in-memory index and a .timeindex file per segment. These are rebuilt on broker restart by replaying the log segment — classic WAL-based recovery. This is why Kafka brokers take longer to restart as segment files grow larger.
6. WAL-Based Replication and Change Data Capture
PostgreSQL's WAL is the foundation of logical replication and CDC. The WAL contains every row-level change — inserts, updates, deletes. Logical decoding plugins (pgoutput, wal2json, decoderbufs) transform the binary WAL records into human-readable or structured change events.
Tools like Debezium use this to stream database changes into Kafka with exactly-once guarantees. The architecture:
↓
Kafka Topic (change events)
↓
Search Index / Cache / Analytics DB
Critical production concern: Logical replication slots hold WAL until the consumer confirms processing. A stalled Debezium consumer means WAL accumulates on the primary indefinitely, eventually filling the disk and crashing the PostgreSQL server. Always monitor replication slot lag and set max_slot_wal_keep_size as a circuit breaker.
7. Production Failure Scenarios
Scenario: WAL Disk Full → Database Crash
A PostgreSQL instance serving a high-write workload crashed because the WAL disk partition filled up. The WAL writer could not flush the next record, causing all write transactions to block indefinitely, and eventually the server to shutdown. Root cause: a runaway bulk import job generated 500GB of WAL in 2 hours. Fix: Separate WAL onto its own partition, monitor WAL directory size, and set max_wal_size with alerting at 70% of available WAL disk.
Scenario: Long Recovery After Crash
After a power failure, a database took 45 minutes to restart because checkpoint_timeout was set to 60 minutes. The ARIES redo phase had to replay 60 minutes of WAL. Fix: Reduce checkpoint_timeout to 5 minutes with checkpoint_completion_target = 0.9. Recovery time dropped to 5 minutes maximum.
8. Trade-offs: Durability vs. Performance
- fsync = off: Never do this in production. If the OS crashes with dirty WAL buffers, the WAL is corrupted and recovery is impossible — data loss is guaranteed, and the amount is unbounded.
- Group commit: Modern databases batch WAL flushes across multiple transactions (group commit). This amortizes the fsync overhead. PostgreSQL's
commit_delay+commit_siblingsenables this explicitly. Kafka'slinger.msis the producer-side equivalent. - Asynchronous replication: Accepting a small window of data loss (the replica might lag by 100ms) in exchange for significantly lower write latency on the primary is a valid production trade-off for non-financial data.
9. Designing WAL Into Your Own System
When building custom stateful services (e.g., a distributed cache, an order matching engine), WAL principles apply directly:
- Append an intent record to a durable log before applying the change to your in-memory state.
- Include an LSN in each log record and maintain a "checkpoint" record that indicates how far ahead the state is of the log.
- On startup, replay log records from the last checkpoint forward to reconstruct in-memory state.
- Use an append-only file with periodic rotation (similar to PostgreSQL WAL segment files) rather than a single ever-growing file.
This pattern is the foundation of Apache Flink's state backend, RocksDB's write-ahead log, and Redis's Append-Only File (AOF) — all variations on the same WAL concept.
10. Key Takeaways
- WAL provides durability by persisting intent before effect — the "write ahead" invariant is the core guarantee.
- ARIES crash recovery (Analysis → Redo → Undo) is the gold standard; understand it to reason about any database recovery.
- PostgreSQL WAL controls durability granularity via
synchronous_commit; choose your durability/latency trade-off explicitly. - Kafka's commit log is WAL at messaging scale — sequential writes + replication give both durability and throughput.
- Monitor replication slot lag in PostgreSQL CDC setups; a stalled consumer can crash the entire primary by filling the WAL disk.
- Checkpoint frequency directly controls crash recovery time; tune it to meet your RTO requirements.
Conclusion
Write-Ahead Logging is one of those foundational concepts that pays compound interest. Every time you reason about database durability, replication lag, CDC reliability, or stateful service recovery, you're reasoning about WAL. Engineers who understand WAL deeply make better decisions at every layer of the stack.
Related Posts
Software Engineer · Java · Spring Boot · Distributed Systems · System Design
Discussion / Comments
Join the conversation — your comment goes directly to my inbox.