MySQL InnoDB internals - buffer pool, redo log, MVCC deep dive for Java developers
Md Sanwar Hossain
Md Sanwar Hossain
Senior Software Engineer · MySQL & Database Internals Series
MySQL April 4, 2026 24 min read MySQL & Database Internals Series

MySQL InnoDB Internals: Buffer Pool, Redo Log & MVCC Deep Dive for Java Developers

Understanding InnoDB's internals is the difference between a Java developer who configures Spring Boot's spring.datasource.* properties and one who can explain exactly why a query slowed down under load, why an OOM occurs despite a large heap, and how to tune InnoDB to match the access patterns of a Spring Data JPA application. This deep dive covers the Buffer Pool LRU mechanics, Write-Ahead Logging (WAL) via the Redo Log, MVCC via Undo Logs, B+Tree clustered index layout, and the locking model including Gap Locks that prevent phantom reads.

Table of Contents

  1. InnoDB Architecture: Memory vs Disk Components
  2. Buffer Pool: Pages, LRU, and the Young/Old Trick
  3. Redo Log (WAL): How InnoDB Survives Crashes
  4. Undo Log: The Foundation of ROLLBACK and MVCC
  5. MVCC Deep Dive: ReadView, trx_id, and Consistent Reads
  6. B+Tree Clustered Index: Why InnoDB Hates UUIDs
  7. Locking: Record Locks, Gap Locks, and Next-Key Locks
  8. Java Developer Tuning Guide: Key InnoDB Variables
  9. Key Takeaways

1. InnoDB Architecture: Memory vs Disk Components

MySQL InnoDB Engine Architecture - Buffer Pool, Redo Log, MVCC | mdsanwarhossain.me
MySQL InnoDB Engine Architecture — mdsanwarhossain.me

InnoDB separates its operation into in-memory structures (the Buffer Pool, Log Buffer, and Change Buffer) and on-disk structures (tablespace .ibd files, redo log files ib_logfile0/1, and undo tablespaces). Every read and write operation flows through the Buffer Pool — InnoDB never reads from disk if the page is already in memory. The fundamental design goal is to maximize buffer pool hit rate while ensuring crash recovery via the redo log.

InnoDB pages are 16KB by default (innodb_page_size). Every table row, index entry, and overflow data lives in these fixed-size pages. When a read request comes in for a row, InnoDB reads the entire 16KB page containing that row into the buffer pool, then returns the row. Subsequent reads of other rows on the same page hit memory directly — this spatial locality is why InnoDB performs well for range scans.

2. Buffer Pool: Pages, LRU, and the Young/Old Trick

The Buffer Pool is InnoDB's cache. Its size (innodb_buffer_pool_size) should be 70–80% of available RAM on a dedicated database server. The standard advice is well-known; the why is less so: InnoDB uses a midpoint insertion LRU algorithm, not a standard LRU. This matters enormously for Java applications that run full-table scans (reports, batch jobs, admin exports).

-- Check buffer pool utilization
SELECT
    pool_id,
    pool_size,
    free_buffers,
    database_pages,
    old_database_pages,
    modified_database_pages,
    hit_rate,
    young_makes_per_thousand_gets,
    not_young_makes_per_thousand_gets
FROM information_schema.INNODB_BUFFER_POOL_STATS;

-- Buffer pool hit rate — aim for > 99% in production
-- hit_rate = 1000 means 100% hit rate (integer per mille)
-- young_makes high = pages being promoted from old to young sublist
-- If hit_rate is low, increase innodb_buffer_pool_size

The LRU list is split into a "young" (hot, 5/8 of the list by default) and "old" (cold, 3/8) sublist. New pages enter at the midpoint (top of the old sublist). If accessed again within innodb_old_blocks_time milliseconds (1000ms default), they stay in the old sublist; if accessed later, they are promoted to the young sublist. This prevents a full-table scan from evicting your working set: the scan pages enter the old sublist and are evicted quickly, preserving your hot OLTP pages.

# my.cnf — buffer pool tuning for an 8GB RAM dedicated MySQL server
innodb_buffer_pool_size = 6G         # 75% of RAM
innodb_buffer_pool_instances = 8     # Reduce mutex contention (1 per GB, min 8)
innodb_old_blocks_time = 1000        # ms before old block can be promoted
innodb_old_blocks_pct = 37           # 37% of buffer pool for old sublist

# Multiple buffer pool instances significantly reduce contention
# under high-concurrency Spring Boot applications

3. Redo Log (WAL): How InnoDB Survives Crashes

InnoDB uses Write-Ahead Logging (WAL). Every modification to a data page is first written to the redo log (a fixed-size circular file, ib_logfile0 and ib_logfile1) before the modified page is flushed to the tablespace. On crash recovery, InnoDB replays redo log records to bring data files back to a consistent state. This is how the Durability guarantee of ACID is implemented.

# The most important InnoDB durability setting
innodb_flush_log_at_trx_commit = 1   # Default: flush log on EVERY commit (full ACID)
# = 0: flush once per second (risk: up to 1s of data loss on crash)
# = 2: write to OS cache on commit, flush once per second (risk: OS crash loses data)

# Production: always use 1 for financial/critical data
# Use 2 only for read replicas or non-critical data with accepted risk

# Redo log file size — MySQL 8.0.30+ auto-manages; older versions:
innodb_log_file_size = 256M          # Larger = faster writes, longer recovery
innodb_log_buffer_size = 64M         # Buffer before flush; larger = fewer I/Os for big txns

A common Spring Boot performance issue: the developer enables innodb_flush_log_at_trx_commit=1 (correct for ACID) but uses batch inserts inside individual transactions. Each batch insert flushes the redo log, causing unnecessary fsync storms. The fix is to batch commits — insert 1000 rows per transaction instead of one row per transaction — reducing fsync calls by 999x while maintaining the same durability guarantee.

// Spring Batch: tune chunk size to batch commits
@Bean
public Step importProductsStep(JobRepository jobRepository,
        PlatformTransactionManager txManager,
        ItemReader<ProductCsv> reader,
        ItemWriter<Product> writer) {
    return new StepBuilder("importProducts", jobRepository)
        .<ProductCsv, Product>chunk(500, txManager)  // 500 rows per txn = 500 rows per fsync
        .reader(reader)
        .writer(writer)
        .build();
    // At chunk=1 (default), 100K rows = 100K fsyncs ≈ minutes
    // At chunk=500, 100K rows = 200 fsyncs ≈ seconds
}

4. Undo Log: The Foundation of ROLLBACK and MVCC

When InnoDB modifies a row, it writes the before image (the old version) to the undo log. This serves two purposes: enabling ROLLBACK (replay the undo records in reverse to restore the old state) and MVCC (other transactions can read old versions from the undo log instead of being blocked by the writer).

The undo log is stored in the system tablespace (ibdata1) or dedicated undo tablespaces (undo_001, undo_002 — the MySQL 8.0 default). A critical production pitfall: a long-running transaction that holds a consistent read snapshot prevents InnoDB from purging old undo log records, causing undo log bloat. The History List Length (HLL) metric measures this:

-- Monitor undo log health
SHOW ENGINE INNODB STATUS\G
-- Look for: "History list length" in the TRANSACTIONS section
-- > 1000 is a warning; > 10000 is a problem

-- Find long-running transactions causing HLL growth
SELECT trx_id, trx_started, TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS seconds_running,
       trx_query, trx_state
FROM information_schema.INNODB_TRX
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 30
ORDER BY trx_started;
Spring Boot Pitfall: A @Transactional(readOnly = true) Spring service method that holds a connection open while doing HTTP calls or slow business logic creates a long-running read transaction, inflating the HLL and slowing down the purge thread. Keep transaction boundaries tight — fetch your data, close the transaction, then do business logic.

5. MVCC Deep Dive: ReadView, trx_id, and Consistent Reads

MySQL InnoDB MVCC - ReadView and Undo Log Version Chain | mdsanwarhossain.me
MySQL InnoDB MVCC and Locking — mdsanwarhossain.me

MVCC (Multi-Version Concurrency Control) is why InnoDB readers don't block writers and writers don't block readers. Each row in InnoDB has two hidden columns: trx_id (the transaction ID that last modified this row) and roll_ptr (a pointer to the previous version in the undo log chain). When a transaction reads a row, InnoDB follows this algorithm:

  1. Take a ReadView snapshot at transaction start (for REPEATABLE_READ) or at each statement (for READ_COMMITTED). The ReadView records the set of active transaction IDs at that moment.
  2. For each row version encountered, check if its trx_id is visible (committed before the ReadView was taken and not in the active set).
  3. If not visible, follow the roll_ptr to the previous version in the undo log and repeat.
  4. Return the first visible version found.
-- Demonstrate MVCC in action
-- Session 1: REPEATABLE_READ (MySQL default)
START TRANSACTION;
SELECT balance FROM accounts WHERE id = 1;   -- Returns: 1000

-- Session 2 (concurrent): commits a write
UPDATE accounts SET balance = 500 WHERE id = 1;
COMMIT;

-- Session 1: still sees 1000 — ReadView taken at START TRANSACTION
SELECT balance FROM accounts WHERE id = 1;   -- Still returns: 1000

-- Session 1: only sees 500 after it starts a NEW transaction
COMMIT;
START TRANSACTION;
SELECT balance FROM accounts WHERE id = 1;   -- Now returns: 500

-- READ_COMMITTED: each statement gets a fresh ReadView
-- Session 1 would have seen 500 immediately after Session 2 committed

6. B+Tree Clustered Index: Why InnoDB Hates UUIDs

InnoDB stores all table data in a B+Tree clustered index, ordered by the primary key. This is fundamentally different from PostgreSQL's heap storage. The implication: the primary key determines physical row order. Sequential primary keys (AUTO_INCREMENT, BIGINT) are fast to insert because new rows append to the right side of the B+Tree. Random UUIDs (v4) cause page splits on every insert — the new UUID goes to a random position in the tree, requiring InnoDB to split an existing full page, causing write amplification and buffer pool fragmentation:

-- BAD: UUID v4 primary key (random, causes page splits)
CREATE TABLE orders_bad (
    id CHAR(36) PRIMARY KEY DEFAULT (UUID()),  -- Random UUID causes fragmentation
    user_id BIGINT,
    total DECIMAL(10,2)
);

-- GOOD 1: AUTO_INCREMENT BIGINT (sequential, optimal InnoDB insert performance)
CREATE TABLE orders_good (
    id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    user_id BIGINT,
    total DECIMAL(10,2)
);

-- GOOD 2: UUID v7 (time-ordered, compatible with UUID format, no page splits)
-- MySQL 8.0 doesn't have UUID_v7 natively — use application-generated UUIDv7
-- or UUID_TO_BIN(UUID(), true) with swap_flag=true for time-ordering
// Spring Boot JPA entity: prefer @GeneratedValue(strategy = AUTO) with BIGINT
@Entity
@Table(name = "orders")
public class Order {

    // AUTO_INCREMENT — optimal for InnoDB insert performance
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;

    // If UUID is required for public API (security through obscurity):
    @Column(unique = true, nullable = false)
    private UUID externalId = UUID.randomUUID(); // Only for public exposure

    // Keep primary key as BIGINT for DB-level operations
}

7. Locking: Record Locks, Gap Locks, and Next-Key Locks

InnoDB's locking is more nuanced than most developers expect. Understanding it prevents deadlocks and explains why queries that "should be fast" sometimes wait:

Lock Type What It Locks When Applied
Record Lock A single index row UPDATE/DELETE on specific row; SELECT FOR UPDATE
Gap Lock Space between index entries REPEATABLE_READ range queries; prevents phantom inserts
Next-Key Lock Row + gap before it Default in REPEATABLE_READ; prevents phantom reads
Insert Intention Lock Gap at insert position Before inserting; compatible with other insert intentions
-- Deadlock example: concurrent insert to the same gap
-- Both sessions try to insert into a gap protected by a gap lock
-- Session 1: holds gap lock (10, 20), waits for Session 2's gap lock
-- Session 2: holds gap lock (10, 20), waits for Session 1's gap lock
-- → InnoDB detects cycle, kills the transaction with fewer undo log records

-- Prevention: reduce isolation level to READ_COMMITTED for non-financial operations
-- (disables gap locking, only record locks remain)
SET TRANSACTION ISOLATION LEVEL READ_COMMITTED;

-- Or: use SELECT ... FOR UPDATE to acquire locks upfront in a consistent order
-- Consistent lock ordering prevents deadlocks by eliminating circular waits

8. Java Developer Tuning Guide: Key InnoDB Variables

Variable Default Recommended Impact
innodb_buffer_pool_size 128M 70-80% RAM Hit rate, read performance
innodb_buffer_pool_instances 1 8 (min for >1GB pools) Mutex contention under concurrency
innodb_flush_log_at_trx_commit 1 1 (ACID); 2 (replicas) Durability vs write throughput
innodb_io_capacity 200 2000–20000 (SSD) Flush rate of dirty pages
innodb_read_io_threads 4 8-16 Parallel read I/O for large datasets
max_connections 151 Match HikariCP pool × services Prevent connection exhaustion
HikariCP + InnoDB: Set spring.datasource.hikari.maximum-pool-size to number_of_CPU_cores × 2 for OLTP workloads, not an arbitrarily large number. More connections than InnoDB can service concurrently causes thread scheduling overhead that hurts performance.

9. Key Takeaways

Leave a Comment

Questions about InnoDB internals, buffer pool tuning, or MVCC behaviour in your Spring Boot application? Share below.