What is The JIT Warmup Problem and how does it work?

HotSpot JVM uses tiered compilation with three meaningful tiers: A naive benchmark that runs a method 1,000 times will measure mostly interpreter and C1 performance — not the steady-state throughput you care about in production. The warmup curve is steep: throughput can increase by 5–20× between the first invocation and after full JIT compilation. Tier 0 — Interpreter: All code starts here. Execution is ~10–50× slower than compiled code but starts immediately. Tier 1–3 (C1 compiler): After ~2,000 invocations a method gets a quick compile with limited optimizations. Moderate performance gains. Tier 4 (C2 / server compiler): After ~10,000–15,000 invocations the profiler-guided optimizer kicks in, applying aggressive inlining, loop unrolling, escape analysis, and SIMD vectorization. This is where peak performance is reached.

Core Java

Java JMH Micro-Benchmarking: Measuring Throughput, Latency & Performance Regression Detection in CI/CD 2026

Q: Why Naive Benchmarking Lies?

Java engineers often reach for System.currentTimeMillis() or System.nanoTime() wrapped around a loop to "measure" performance. This approach produces wildly inaccurate results because it ignores three fundamental JVM behaviors that dominate micro-scale execution time.

Measuring Java performance correctly is harder than it looks. The JVM's JIT compiler, dead code elimination, and warmup effects can silently invalidate naive benchmarks. This guide takes you from why System.currentTimeMillis() lies to production-grade JMH setups with CI/CD regression detection — the exact toolkit senior Java engineers use in 2026.

Md Sanwar Hossain April 11, 2026 20 min read Java Performance Testing

Java JMH micro-benchmarking for throughput and latency measurement in production

TL;DR — JMH Benchmarking Essentials

"Use JMH (Java Microbenchmark Harness) — never System.currentTimeMillis() for performance measurement. Always warmup (≥5 iterations). Use @Blackhole to prevent dead code elimination. Report in ops/sec (Throughput) or ns/op (AverageTime). Run benchmarks in CI to catch regressions before production."

Why Naive Benchmarking Lies: JIT, Warmup & Dead Code
JMH Setup: Dependencies, Maven & Gradle
Benchmark Modes: Throughput, AverageTime, SampleTime, SingleShotTime
@State, @Setup & @TearDown: Managing Benchmark State
Avoiding Dead Code Elimination with @Blackhole
Warmup Tuning & Fork Count Best Practices
Real Benchmarks: HashMap vs ConcurrentHashMap, String vs StringBuilder
Profilers & Async-Profiler Integration with JMH
Continuous Benchmarking in CI/CD: Regression Detection
JMH Pitfalls, Anti-Patterns & Conclusion

1. Why Naive Benchmarking Lies: JIT, Warmup & Dead Code

Java engineers often reach for System.currentTimeMillis() or System.nanoTime() wrapped around a loop to "measure" performance. This approach produces wildly inaccurate results because it ignores three fundamental JVM behaviors that dominate micro-scale execution time.

The JIT Warmup Problem

HotSpot JVM uses tiered compilation with three meaningful tiers:

Tier 0 — Interpreter: All code starts here. Execution is ~10–50× slower than compiled code but starts immediately.
Tier 1–3 (C1 compiler): After ~2,000 invocations a method gets a quick compile with limited optimizations. Moderate performance gains.
Tier 4 (C2 / server compiler): After ~10,000–15,000 invocations the profiler-guided optimizer kicks in, applying aggressive inlining, loop unrolling, escape analysis, and SIMD vectorization. This is where peak performance is reached.

A naive benchmark that runs a method 1,000 times will measure mostly interpreter and C1 performance — not the steady-state throughput you care about in production. The warmup curve is steep: throughput can increase by 5–20× between the first invocation and after full JIT compilation.

On-Stack Replacement (OSR)

OSR is the JVM's mechanism for switching from interpreted to compiled code while a loop is still executing. OSR-compiled code has different performance characteristics than normally compiled code because the JVM must preserve interpreter state at the OSR entry point, preventing some optimizations. Benchmarks that put their entire logic inside one large loop (common in hand-rolled timing code) measure OSR-compiled performance — another source of error.

Dead Code Elimination (DCE)

The C2 compiler performs aggressive dead code elimination. If it proves that a computation's result is never used, it removes the computation entirely. Consider this common mistake:

// WRONG: JIT will eliminate the entire computation
long start = System.nanoTime();
for (int i = 0; i < 1_000_000; i++) {
    int result = compute(i);  // result never used → DCE removes compute()
}
long elapsed = System.nanoTime() - start;
System.out.println("Took: " + elapsed + "ns");

The JIT will prove result has no side effects and eliminate the loop body. Your benchmark reports near-zero nanoseconds — not because your code is fast, but because it was never executed.

Constant Folding

Related to DCE is constant folding: if inputs to a computation are compile-time constants (or JIT-inferred constants), the JVM precomputes the result at compile time. A benchmark computing Math.sqrt(4.0) in a loop may be measuring nothing more than a constant load — the sqrt was hoisted out and replaced with 2.0 at JIT time. JMH's @State and @Blackhole exist specifically to defeat these optimizations safely.

2. JMH Setup: Dependencies, Maven & Gradle

JMH (Java Microbenchmark Harness) is the official OpenJDK benchmarking framework maintained by the JVM engineers themselves. It handles warmup, iteration control, forking, and result statistics so you can focus on writing meaningful benchmarks.

Maven Setup

Add JMH dependencies and the shade plugin to build an executable fat JAR:

<dependencies>
    <dependency>
        <groupId>org.openjdk.jmh</groupId>
        <artifactId>jmh-core</artifactId>
        <version>1.37</version>
    </dependency>
    <dependency>
        <groupId>org.openjdk.jmh</groupId>
        <artifactId>jmh-generator-annprocess</artifactId>
        <version>1.37</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.5.2</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals><goal>shade</goal></goals>
                    <configuration>
                        <finalName>benchmarks</finalName>
                        <transformers>
                            <transformer implementation=
                              "org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                <mainClass>org.openjdk.jmh.Main</mainClass>
                            </transformer>
                            <transformer implementation=
                              "org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                        </transformers>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Gradle Setup

Use the me.champeau.jmh Gradle plugin (version 0.7.2) for seamless integration:

// build.gradle.kts
plugins {
    id("me.champeau.jmh") version "0.7.2"
    java
}

jmh {
    jmhVersion.set("1.37")
    warmupIterations.set(5)
    iterations.set(10)
    fork.set(2)
    resultsFile.set(project.file("build/reports/jmh/results.json"))
    resultFormat.set("JSON")
}

dependencies {
    jmh("org.openjdk.jmh:jmh-core:1.37")
    jmhAnnotationProcessor("org.openjdk.jmh:jmh-generator-annprocess:1.37")
}

Running Benchmarks

# Maven: package then run
mvn clean package -DskipTests
java -jar target/benchmarks.jar

# Gradle
./gradlew jmh

# Run specific benchmarks matching regex
java -jar target/benchmarks.jar "HashMapBenchmark"

# List all available benchmarks
java -jar target/benchmarks.jar -l

First Benchmark Class Skeleton

package com.example.benchmarks;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 2)
public class MyFirstBenchmark {

    private int value;

    @Setup(Level.Trial)
    public void setUp() {
        value = 42;
    }

    @Benchmark
    public int addNumbers() {
        return value + value;
    }

    @Benchmark
    public void addNumbersBlackhole(Blackhole bh) {
        bh.consume(value + value);
    }
}

Java JMH benchmark pipeline: from benchmark code through warmup and measurement iterations to throughput and latency results. Source: mdsanwarhossain.me

3. Benchmark Modes: Throughput, AverageTime, SampleTime, SingleShotTime

JMH provides four benchmark modes via the @BenchmarkMode annotation. Choosing the wrong mode gives you the right number for the wrong question. Pair @BenchmarkMode with @OutputTimeUnit to control the unit of reported results.

Mode.Throughput — Operations Per Second

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Benchmark
public int measureThroughput() {
    return computeHash(input);
}

Reports ops/s — how many times the benchmark method executes per second. Higher is better. Ideal for: comparing alternative implementations of the same algorithm, measuring the impact of cache optimizations, or validating that a refactoring preserved throughput.

Mode.AverageTime — Average Invocation Time

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)   // or MICROSECONDS for slower ops
@Benchmark
public String measureAverageLatency() {
    return buildString(input);
}

Reports ns/op or µs/op — the arithmetic mean time per benchmark invocation. Lower is better. Use for understanding individual request latency. Note: AverageTime hides tail latency; a p99 spike won't show up in the mean.

Mode.SampleTime — Percentile Distribution

@BenchmarkMode(Mode.SampleTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Benchmark
public void measureSampleTime(Blackhole bh) {
    bh.consume(processRequest(request));
}

Samples individual invocation times and reports percentiles (p50, p90, p95, p99, p99.9). Critical for latency-sensitive systems where tail latency (p99) matters more than the mean. Use for: API endpoint response time, lock acquisition latency, GC pause simulation.

Mode.SingleShotTime — Cold Start / First Invocation

@BenchmarkMode(Mode.SingleShotTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 0)   // NO warmup for cold start measurement
@Fork(value = 10)          // Many forks to get statistical significance
@Benchmark
public void measureColdStart() {
    initializeHeavyComponent();
}

Executes the benchmark exactly once per fork with no warmup. Measures cold start cost: class loading, static initialization, first-time JIT. Use for: Spring context startup time, AWS Lambda cold starts, first-query plan compilation.

Mode Selection Reference

Mode	Unit	Use When	Example
Throughput	ops/s	Comparing fast algorithms	HashMap vs TreeMap get()
AverageTime	ns/op or µs/op	Single-operation latency	String.format() overhead
SampleTime	p50/p99 µs	Tail latency matters	DB query, HTTP call simulation
SingleShotTime	ms/op	Cold start / init cost	Spring context, Lambda init

4. @State, @Setup & @TearDown: Managing Benchmark State

Benchmark correctness depends critically on state management. @State tells JMH how to scope and share objects between benchmark invocations and threads. Getting scope wrong produces either misleading results (Scope too broad → lock contention) or incorrect results (Scope too narrow → missing shared-state bugs).

@State Scope Options

Scope.Thread (default): Each worker thread gets its own State instance. No sharing. Use for thread-local benchmarks like single-threaded algorithms, parser performance, or data structure operations where you don't want contention.
Scope.Benchmark: One State instance shared across all worker threads running the benchmark. Use to benchmark concurrent access patterns — e.g., ConcurrentHashMap shared across threads. Requires your State fields to be thread-safe or read-only.
Scope.Group: Shared within a @Group of methods. Useful for producer/consumer benchmarks where one method writes and another reads the same data structure simultaneously.

@Setup and @TearDown Levels

Level.Trial: Run once per @Fork (JVM instance). Use for expensive one-time setup: loading files, building large data structures, establishing DB connections.
Level.Iteration: Run before and after each measurement iteration (typically 1 second long). Use for resetting mutable state that accumulates across invocations.
Level.Invocation: Run before and after every single benchmark method call. Avoid unless necessary — the setup overhead is included in the measurement unless you're very careful, and it usually makes benchmarks unreliable for sub-microsecond operations.

@State(Scope.Thread)
public static class BenchmarkState {

    // Loaded once per JVM fork — expensive resources
    private Map<String, String> largeMap;
    private String[] testKeys;

    @Setup(Level.Trial)
    public void trialSetup() {
        largeMap = new HashMap<>(100_000);
        testKeys = new String[10_000];
        for (int i = 0; i < 100_000; i++) {
            largeMap.put("key-" + i, "value-" + i);
        }
        for (int i = 0; i < 10_000; i++) {
            testKeys[i] = "key-" + ThreadLocalRandom.current().nextInt(100_000);
        }
        System.gc(); // Request GC before measurement starts
    }

    // Resets mutable counters between measurement iterations
    private int iterationCounter;

    @Setup(Level.Iteration)
    public void iterationSetup() {
        iterationCounter = 0;
    }

    @TearDown(Level.Trial)
    public void trialTearDown() {
        largeMap = null; // Allow GC
    }
}

@Benchmark
public String benchmarkMapGet(BenchmarkState state, Blackhole bh) {
    String key = state.testKeys[state.iterationCounter++ % state.testKeys.length];
    return state.largeMap.get(key);
}

Thread Safety with Scope.Benchmark

When using Scope.Benchmark, all JMH worker threads share the same State instance. Fields accessed by benchmark methods must be thread-safe. Using a HashMap as a shared state with Scope.Benchmark and multiple threads will corrupt the map and crash — use ConcurrentHashMap or make fields read-only with Collections.unmodifiableMap().

5. Avoiding Dead Code Elimination with @Blackhole

Dead Code Elimination (DCE) is the most dangerous JMH pitfall. It's silent — your benchmark runs, reports results, and the numbers look plausible. They're just completely wrong because the JIT removed the code you thought you were measuring.

How the JIT Eliminates Dead Code

C2 compiler performs escape analysis and reachability analysis. If the result of a method call never escapes the method (never stored, never returned, never passed to another method), the call is proven to have no observable side effects and is eliminated. Similarly, constant folding replaces expressions with statically knowable inputs with their precomputed values at compile time.

Wrong vs Correct Benchmark Pattern

// ❌ WRONG — JIT will eliminate computeHash() if result is unused
@Benchmark
public void wrongBenchmark() {
    // result computed but never used → DCE eliminates this
    int hash = computeHash(data);
}

// ❌ ALSO WRONG — JIT may constant-fold if data is a static final
@Benchmark
public int alsoWrong() {
    return computeHash(CONSTANT_DATA); // CONSTANT_DATA = "hello" → folded
}

// ✅ CORRECT — return value forces JVM to perform computation
@Benchmark
public int correctWithReturn() {
    return computeHash(data); // returned value → not eliminated
}

// ✅ CORRECT — Blackhole sink consumes result preventing DCE
@Benchmark
public void correctWithBlackhole(Blackhole bh) {
    bh.consume(computeHash(data));
}

// ✅ CORRECT — Multiple results consumed
@Benchmark
public void multipleResults(Blackhole bh) {
    bh.consume(computeHash(data));
    bh.consume(formatString(data));
    bh.consume(parseJson(jsonData));
}

Blackhole.consumeCPU() for Timing Holes

Blackhole.consumeCPU(tokens) burns a predictable amount of CPU time without measurable memory allocation. It's used to simulate a "think time" between operations in producer/consumer benchmarks or to add realistic computational load to otherwise trivial benchmark methods:

@Benchmark
public void simulateWorkWithThinkTime(Blackhole bh) {
    bh.consume(acquireLock());
    Blackhole.consumeCPU(200); // ~200 "token" CPU cycles of simulated work
    releaseLock();
}

Return vs Blackhole: Which to Choose?

Both approaches prevent DCE. Use return when benchmarking a single computation that naturally returns a value — it's simpler and incurs minimal overhead. Use Blackhole when benchmarking multiple computations in one method, or when the method naturally returns void (e.g., writing to a buffer). Blackhole itself has negligible overhead but is designed to prevent both DCE and constant folding of its inputs.

6. Warmup Tuning & Fork Count Best Practices

Warmup iterations let the JIT compiler reach steady-state before measurement begins. Too few warmup iterations → measuring C1-compiled code, not peak C2 performance. Too many → wasted time without improved accuracy. Fork count controls JVM process isolation between benchmarks.

Warmup and Measurement Annotations

@Warmup(
    iterations = 5,          // number of warmup iterations
    time = 1,                // time per warmup iteration
    timeUnit = TimeUnit.SECONDS
)
@Measurement(
    iterations = 10,         // number of measurement iterations
    time = 1,                // time per measurement iteration
    timeUnit = TimeUnit.SECONDS
)
@Fork(
    value = 3,               // number of separate JVM processes (forks)
    jvmArgs = {"-Xms512m", "-Xmx512m", "-XX:+UseG1GC"}
)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
public class WarmupExampleBenchmark {
    @Benchmark
    public long fibonacci() {
        return fib(30);
    }
}

Why Forking Matters

Each @Fork starts a fresh JVM process. This is critical for two reasons:

JIT pollution isolation: If benchmarks A and B run in the same JVM, benchmark A's JIT profile can influence how B is compiled. Forked runs start with a clean JIT profile slate, giving independent measurements.
Statistical independence: Multiple forks provide independent samples of JVM startup + warmup + measurement, giving you variance across JVM instances in addition to within-run variance.

Single fork (@Fork(1)) is acceptable for quick development-time checks but should never be used for CI baseline numbers. Use at least 3 forks for production benchmarks.

Recommended Configurations

Context	Warmup	Measurement	Forks	Total Time
Development / Quick	3 × 1s	5 × 1s	1	~8s / benchmark
CI / Regression	5 × 1s	10 × 1s	2–3	~45–75s / benchmark
Production Baseline	10 × 2s	20 × 2s	5	~5min / benchmark

7. Real Benchmarks: HashMap vs ConcurrentHashMap, String vs StringBuilder

Let's write and interpret real JMH benchmarks that answer common Java performance questions. Both examples demonstrate complete, correct JMH usage you can run immediately.

HashMap vs ConcurrentHashMap: Read Performance

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@State(Scope.Benchmark)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 3)
@Threads({1, 4, 8, 16})   // Run at different concurrency levels
public class MapBenchmark {

    private Map<String, Integer> hashMap;
    private ConcurrentHashMap<String, Integer> concurrentMap;
    private String[] keys;

    @Setup(Level.Trial)
    public void setup() {
        hashMap = new HashMap<>(10_000);
        concurrentMap = new ConcurrentHashMap<>(10_000);
        keys = new String[10_000];

        for (int i = 0; i < 10_000; i++) {
            String key = "key-" + i;
            hashMap.put(key, i);
            concurrentMap.put(key, i);
            keys[i] = key;
        }
    }

    @Benchmark
    public Integer hashMapGet(Blackhole bh) {
        // NOT thread-safe for concurrent reads — single-threaded only
        return hashMap.get(keys[ThreadLocalRandom.current().nextInt(keys.length)]);
    }

    @Benchmark
    public Integer concurrentHashMapGet(Blackhole bh) {
        // Thread-safe, non-blocking reads
        return concurrentMap.get(keys[ThreadLocalRandom.current().nextInt(keys.length)]);
    }
}

Benchmark Results (Approximate, JDK 21, M2 Pro)

Benchmark	Threads	ops/s (approx)	Notes
hashMapGet	1	~420M ops/s	Fastest single-thread
concurrentHashMapGet	1	~380M ops/s	~10% overhead vs HashMap
concurrentHashMapGet	8	~2.8B ops/s	Near-linear scaling
concurrentHashMapGet	16	~4.9B ops/s	Still scaling at 16 threads

Takeaway: HashMap is ~10% faster for single-threaded reads, but ConcurrentHashMap's read scalability makes it the clear winner in any multi-threaded scenario. The read performance is non-blocking due to volatile reads on node values.

String Concatenation vs StringBuilder vs StringJoiner

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Thread)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 10, time = 1)
@Fork(2)
public class StringBenchmark {

    @Param({"5", "20", "100"})  // Benchmark with different iteration counts
    private int iterations;

    @Benchmark
    public String stringPlusConcat(Blackhole bh) {
        String result = "";
        for (int i = 0; i < iterations; i++) {
            result += "part-" + i;  // Creates new String each iteration
        }
        return result;
    }

    @Benchmark
    public String stringBuilder(Blackhole bh) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < iterations; i++) {
            sb.append("part-").append(i);
        }
        return sb.toString();
    }

    @Benchmark
    public String stringJoiner(Blackhole bh) {
        StringJoiner sj = new StringJoiner(", ");
        for (int i = 0; i < iterations; i++) {
            sj.add("part-" + i);
        }
        return sj.toString();
    }

    @Benchmark
    public String streamJoin(Blackhole bh) {
        return IntStream.range(0, iterations)
            .mapToObj(i -> "part-" + i)
            .collect(Collectors.joining(", "));
    }
}

At 100 iterations, String + concatenation is ~40–60× slower than StringBuilder due to quadratic string copy behavior. At 5 iterations, the JIT's invokedynamic optimization for string concatenation (since Java 9) nearly closes the gap. Use StringBuilder for loops, Collectors.joining() for streams, and don't obsess over String + for 2–3 concatenations.

8. Profilers & Async-Profiler Integration with JMH

JMH integrates with several profilers via the -prof flag. Profiler output answers why a benchmark performs as it does, not just how fast it is.

-prof gc: GC and Allocation Profiling

# Run benchmark with GC profiler
java -jar benchmarks.jar StringBenchmark -prof gc

# Sample output (allocation rate per operation):
# StringBenchmark.stringPlusConcat:·gc.alloc.rate       237.4 MB/sec
# StringBenchmark.stringPlusConcat:·gc.alloc.rate.norm  24576.0 B/op  ← LARGE
# StringBenchmark.stringBuilder:·gc.alloc.rate.norm      1056.0 B/op   ← SMALL
# StringBenchmark.streamJoin:·gc.alloc.rate.norm         3200.0 B/op

The gc.alloc.rate.norm metric (bytes allocated per operation) is one of the most valuable for identifying GC pressure. High allocation rates trigger frequent young-gen collections, adding jitter to latency measurements.

-prof async: Async-Profiler Flame Graphs

# Download async-profiler 3.x first, then:
java -jar benchmarks.jar HashMapBenchmark \
  -prof "async:libPath=/path/to/libasyncProfiler.so;output=flamegraph;dir=profiles"

# Or with event-based profiling
java -jar benchmarks.jar \
  -prof "async:event=alloc;output=flamegraph;dir=profiles"

Async-profiler uses Linux perf_events (CPU), Java TLAB events (allocation), or lock events to generate flame graphs. Open the generated .html flame graph in a browser. Wide frames indicate hot code paths. Look for unexpected frames from serialization, reflection, or logging inside benchmark loops.

-prof jfr: Java Flight Recorder

# Capture JFR recording during benchmark
java -jar benchmarks.jar MyBenchmark -prof jfr

# JFR file saved to working directory, open in JMC (Java Mission Control)
# or convert to flame graph with JFR to async-profiler converter

-prof stack: Stack Sampling

# Built-in stack sampler — no native library needed
java -jar benchmarks.jar MyBenchmark -prof stack

# Output: sorted stack frame frequencies
# 47.3%  com.example.MyClass.hotMethod()
# 23.1%  java.util.HashMap.get()
# 12.8%  java.lang.String.hashCode()

The built-in -prof stack profiler requires no native libraries and works on all platforms. It's less accurate than async-profiler (uses JVMTI safepoint-based sampling which misses certain frames) but is perfect for initial investigation. Use async-profiler for production-grade flame graphs.

Continuous benchmarking pipeline: JMH runs in CI comparing against baseline with regression threshold alerting. Source: mdsanwarhossain.me

9. Continuous Benchmarking in CI/CD: Regression Detection

Running benchmarks in CI turns performance from a reactive investigation into a proactive quality gate. The pipeline: run JMH → emit JSON results → compare against stored baseline → fail build on >10% regression.

JMH JSON Output

# Output results in JSON format for programmatic processing
java -jar benchmarks.jar \
  -rf json \
  -rff build/reports/jmh/results.json \
  -wi 5 -w 1 -i 10 -r 1 -f 2

# Gradle equivalent in build.gradle.kts
jmh {
    resultsFile.set(project.file("build/reports/jmh/results.json"))
    resultFormat.set("JSON")
    warmupIterations.set(5)
    iterations.set(10)
    fork.set(2)
}

Regression Detection Script

#!/usr/bin/env python3
# check_regression.py — fails with exit code 1 if any benchmark regressed >10%
import json, sys, os

THRESHOLD = 0.10  # 10% regression threshold

def load_results(path):
    with open(path) as f:
        data = json.load(f)
    return {b["benchmark"]: b["primaryMetric"]["score"] for b in data}

baseline_path = os.environ.get("BASELINE_RESULTS", "baseline/results.json")
current_path  = sys.argv[1] if len(sys.argv) > 1 else "build/reports/jmh/results.json"

if not os.path.exists(baseline_path):
    print("No baseline found — saving current results as baseline")
    os.makedirs("baseline", exist_ok=True)
    import shutil; shutil.copy(current_path, baseline_path)
    sys.exit(0)

baseline = load_results(baseline_path)
current  = load_results(current_path)

regressions = []
for name, curr_score in current.items():
    if name not in baseline:
        continue
    base_score = baseline[name]
    # For Throughput: higher is better; for AverageTime: lower is better
    # Detect regression as significant drop in throughput or increase in time
    change = (base_score - curr_score) / base_score
    if change > THRESHOLD:
        regressions.append((name, base_score, curr_score, change * 100))

if regressions:
    print("❌ PERFORMANCE REGRESSIONS DETECTED:")
    for name, base, curr, pct in regressions:
        print(f"  {name}: {base:.0f} → {curr:.0f} ops/s  ({pct:.1f}% regression)")
    sys.exit(1)
else:
    print(f"✅ No regressions detected (threshold: {THRESHOLD*100:.0f}%)")
    sys.exit(0)

GitHub Actions Workflow

# .github/workflows/benchmarks.yml
name: JMH Benchmark Regression Check

on:
  pull_request:
    branches: [ main ]
  schedule:
    - cron: '0 2 * * 1'  # Weekly Monday 2AM for baseline refresh

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Set up JDK 21
        uses: actions/setup-java@v4
        with:
          java-version: '21'
          distribution: 'temurin'

      - name: Cache Gradle packages
        uses: actions/cache@v4
        with:
          path: ~/.gradle/caches
          key: gradle-${{ hashFiles('**/*.gradle.kts') }}

      - name: Download baseline results
        uses: actions/download-artifact@v4
        with:
          name: jmh-baseline
          path: baseline/
        continue-on-error: true  # Don't fail if no baseline yet

      - name: Run JMH benchmarks
        run: ./gradlew jmh

      - name: Check for regressions
        run: |
          pip install -q json5
          python3 scripts/check_regression.py build/reports/jmh/results.json

      - name: Upload benchmark results as artifact
        uses: actions/upload-artifact@v4
        with:
          name: jmh-results-${{ github.sha }}
          path: build/reports/jmh/results.json

      - name: Update baseline on main branch push
        if: github.ref == 'refs/heads/main'
        uses: actions/upload-artifact@v4
        with:
          name: jmh-baseline
          path: build/reports/jmh/results.json

CI Infrastructure Considerations

Use dedicated benchmark runners: Shared CI runners have variable CPU frequency and background noise. Use runs-on: self-hosted with a dedicated machine or at minimum a c5.2xlarge EC2 for consistency.
Pin CPU frequency: On Linux runners, use cpupower frequency-set -g performance to disable turbo boost variance.
Use relative comparison only: Compare current run against previous run on the same hardware. Absolute numbers vary between machines.
10% regression threshold: CI noise on shared runners is typically 5–15%. Use 15–20% threshold on shared runners, 5–8% on dedicated hardware.

10. JMH Pitfalls, Anti-Patterns & Conclusion

Even experienced engineers make these JMH mistakes. This reference table summarizes the most common anti-patterns, their effects on benchmark validity, and the correct approach.

Mistake	Effect	Correction
No @Blackhole / no return	DCE eliminates benchmark body, reports 0ns	Always consume result via return or Blackhole
Scope.Benchmark + mutable state + multi-thread	Data corruption, NPE, or measures lock contention not the algorithm	Use Scope.Thread for mutable state; use Scope.Benchmark only for read-only or thread-safe structures
@Fork(1) for final results	JIT pollution from previous benchmarks skews results	Use @Fork(3) minimum for CI; @Fork(5) for baselines
Measuring I/O or network in JMH	Results dominated by OS/network jitter, not code under test	Mock I/O; JMH is for CPU/memory micro-benchmarks only
Missing @State (fields in benchmark class)	JMH may null-out fields or constant-fold them; NPE or DCE	Always put benchmark inputs in a @State class
Synchronization in benchmark method	Benchmark measures lock contention overhead, not the algorithm	Use Scope.Thread to isolate threads; add contention explicitly with @Group if needed
Too-short measurement time (100ms/iter)	High coefficient of variation (CV >5%) — results not reproducible	Use at least 1s per iteration; check CV in JMH output
Level.Invocation @Setup for sub-µs benchmarks	Setup overhead dwarfs measurement; results include setup cost	Use Level.Trial or Level.Iteration; reserve Invocation for multi-ms operations

Reading JMH Output Correctly

# Typical JMH output:
Benchmark                    Mode  Cnt     Score     Error  Units
HashMapBenchmark.get        thrpt   60  419432.3 ± 12341.2  ops/ms
ConcurrentHashMapBenchmark  thrpt   60  381023.7 ±  8923.4  ops/ms

# ± Error = 99.9% confidence interval half-width (not standard deviation)
# Cnt = warmup_iters × forks (60 = 10 iterations × 3 forks × 2 = wait,
#       actually Cnt = measurement_iters × forks = 20 × 3 = 60)
# If Error/Score > 5%, results are noisy — increase iterations or fork count

Conclusion: JMH Is Non-Negotiable for Java Performance Work

JMH is not optional for Java performance measurement — it's the only correct tool for micro-benchmarking on the JVM. Any alternative (System.nanoTime loops, JUnit-based timing, hand-written benchmarks) will produce unreliable results due to JIT warmup, dead code elimination, OSR effects, and measurement timing errors that JMH specifically addresses.

The professional workflow in 2026: write JMH benchmarks alongside unit tests, run them in CI with JSON output, compare against stored baselines, and gate PRs on regression thresholds. This transforms performance from "we think it's still fast" to a measured, version-controlled, automatically enforced quality attribute — the same way correctness is enforced by unit tests.

✅ Use @BenchmarkMode(Mode.Throughput) for throughput comparisons, Mode.AverageTime for latency, Mode.SampleTime for tail-latency analysis
✅ Always use @Blackhole or return results to prevent DCE
✅ Always warmup (≥5 iterations at 1s each) and fork (≥2, ideally 3–5)
✅ Store benchmark state in @State objects, not in the benchmark class itself
✅ Add -prof gc to identify allocation-heavy benchmarks and guide GC tuning
✅ Run in CI with JSON output and a regression check script gating every PR

Java JMH microbenchmark harness JMH throughput JMH AverageTime dead code elimination @Blackhole warmup iterations @Fork JMH HashMap benchmark CI regression detection Java performance testing JMH best practices

Md Sanwar Hossain

Senior Java Engineer specializing in Spring Boot, Kubernetes, and AWS. Passionate about JVM internals, performance engineering, and building reliable distributed systems. Author of 200+ technical articles on backend engineering.

GitHub LinkedIn Medium

Back to Blog Last updated: April 11, 2026