Core Java

Project CRaC for Java: Instant Cold Starts, Safe Checkpoints, and Production Playbooks

Project CRaC lets JVM services resume from a warmed checkpoint in milliseconds instead of paying full cold-start cost on every boot. This article focuses on practical checkpoint/restore operations, safety boundaries, and production guardrails for Java services running in cloud-native environments.

Md Sanwar Hossain March 22, 2026 15 min read Core Java
Project CRaC for Java: Instant Cold Starts and Safe Checkpoints

TL;DR

"Project CRaC for Java: instant cold starts, safe checkpoints, and production playbooks for JVM services on serverless, Kubernetes, and edge environments."

Table of Contents

  1. Introduction
  2. Why CRaC Now
  3. How CRaC Works
  4. Baseline Setup
  5. Application Lifecycle with CRaC
  6. Secret Management and Rotation
  7. Filesystem and Snapshot Strategy
  8. JVM Flags That Matter
  9. Containerization Patterns
  10. Benchmarks: What to Expect
  11. Troubleshooting Playbook
  12. Observability and Testing
  13. Governance and Operational Discipline
  14. Production Checklist
  15. Read Full Blog Here

Introduction

Java CRaC Architecture | mdsanwarhossain.me
Java CRaC Architecture — mdsanwarhossain.me

Project CRaC (Coordinated Restore at Checkpoint) gives the JVM the superpower of pausing a fully warmed process and resuming it in milliseconds. Instead of paying class loading, dependency injection, and JIT warm-up on every cold start, you checkpoint a live runtime, serialize the process image, and restore it on demand. The payoff is dramatic for serverless and bursty workloads where cold starts translate directly into latency SLO violations. CRaC also invites a new discipline: resource hygiene, snapshot-friendly filesystems, and restart-aware secrets. The orchestration rigor feels familiar to teams already practicing scoped lifecycles in structured concurrency, but applied to an entire JVM.

Why CRaC Now

How CRaC Works

Spring Boot with CRaC | mdsanwarhossain.me
Spring Boot with CRaC — mdsanwarhossain.me
Project CRaC for Java architecture and checkpoint restore flow

CRaC coordinates JVM safepoints with OS-level checkpointing (via CRIU on Linux). When a checkpoint is requested, CRaC:

  1. Invokes resource hooks (CheckpointNotification) so application code can release or quiesce external handles.
  2. Brings the JVM to a stable safepoint, flushes JIT code and metaspace state, and hands control to CRIU.
  3. CRIU serializes process memory, open file descriptors, CPU registers, and kernel state into an image directory.
  4. On restore, CRIU rehydrates the process, reattaches resources via RestoreNotification, and resumes execution from the checkpoint moment.

The win is that classpath scanning, Spring context boot, connection pool initialization, and JIT compilation have already happened. The risk is that anything time-sensitive, host-specific, or externally revoked must be rebuilt during restore.

Baseline Setup

Prerequisites for production-quality CRaC images:

Java CRaC Checkpoint/Restore | mdsanwarhossain.me
Java CRaC Checkpoint/Restore — mdsanwarhossain.me

Application Lifecycle with CRaC

A minimal Spring Boot flow:

  1. Start the app normally with CRaC-aware JDK.
  2. Warm caches and execute synthetic traffic until latency plateaus.
  3. Trigger checkpoint: jcmd <pid> JDK.checkpoint or expose an internal admin endpoint.
  4. CRIU writes the image to disk; the process exits.
  5. Restore by invoking java -XX:CRaCRestoreFrom=<image-path>; the process resumes within milliseconds.

Use a CI/CD step to create fresh images on every build, run smoke tests against the restored process, then publish OCI images with the checkpoint directory baked in.

Secret Management and Rotation

Secrets are the most common failure after restore because tokens expire while the process is paused.

Test worst-case timing: create a checkpoint, rotate secrets immediately, then restore. Verify hooks fetch fresh credentials and that TLS handshakes succeed.

Filesystem and Snapshot Strategy

CRaC images capture file descriptors and inode references. Sloppy filesystem choices create brittle restores.

If using persistent volumes, take a filesystem snapshot right after checkpointing so the CRIU image and any referenced files stay consistent. For serverless distributions, bundle the image within the container layer to keep restore self-contained.

JVM Flags That Matter

Pair these with normal GC flags (e.g., G1 or ZGC tuned for your heap). Keep heap sizing identical between checkpoint creation and restore to avoid unexpected remapping overhead.

Containerization Patterns

CRaC images inside containers need extra care:

  1. Image build pipeline: Stage 1 builds the app and warms it. Stage 2 copies the CRIU dump into /opt/crac-images. Stage 3 produces the runnable image with only the restore bits and the CRaC-enabled JDK.
  2. Capabilities: Grant CAP_CHECKPOINT_RESTORE, CAP_SYS_PTRACE, and CAP_SYS_ADMIN or run as --privileged (prefer the minimal caps set).
  3. Init containers: Use an init container to prefetch secrets and mount tmpfs directories before the main container restores.
  4. Orchestrator hooks: Kubernetes lifecycle hooks (postStart) can trigger a health check or mini-warm-up after restore to repopulate any caches intentionally cleared before checkpoint.

Ensure the restored pod’s CPU and memory requests match the checkpointed environment. Mismatches can cause cgroup-related restore errors or unpredictable GC behavior.

Benchmarks: What to Expect

Representative measurements from a Spring Boot 3.3 service with 120 beans, JPA, and a Redis cache (8 vCPU, 8 GB RAM, G1 GC):

Expect larger images (300–500 MB) for services with heavy caches; balance checkpoint depth (e.g., emptying caches before checkpoint) against restore latency goals.

Run benchmarks in three phases: (1) full cold boot to capture baseline; (2) warm-up traffic for at least 5 minutes with representative load to populate JIT profiles and caches; (3) checkpoint and restore, then immediately hammer the service with the same load. Compare tail latencies, GC pause times, and throughput. Pay attention to page faults after restore; high minor fault counts indicate the image is larger than the working set and might benefit from -XX:+AlwaysPreTouch or cache trimming before checkpoint. Capture perf profiles pre- and post-restore to confirm code cache addresses remain stable.

Troubleshooting Playbook

Create synthetic chaos drills: kill the restored process mid-traffic, rotate secrets, change DNS, and ensure hooks reestablish healthy state without manual intervention.

Observability and Testing

CRaC complicates metrics because process uptime resets without a full boot.

Correlate checkpoint images with release versions in your telemetry backend so on-call engineers can filter graphs by image hash. Emit counters for failed restores by reason (capability, CRIU, secrets, sockets) to reveal systemic gaps. Include CRaC events in structured logs with request IDs so that a restore that happens mid-transaction can be reconstructed alongside API traces.

Governance and Operational Discipline

CRaC changes change management:

  1. Image provenance: Version checkpoint images alongside application builds; include Git SHA and dependency bill of materials in the image metadata.
  2. Retention policy: Prune old images aggressively; stale checkpoints often hold expired secrets and grow storage costs.
  3. Compliance: Treat checkpoint directories as sensitive—encrypt at rest, restrict read access, and scrub secrets before checkpoint.
  4. Runbooks: Document restore failures with clear steps to regenerate checkpoints. Cross-train SREs and developers; CRaC is not a magic button.

Production Checklist

Tags:

Java CRaC coordinated restore at checkpoint JVM cold start Spring Boot CRaC checkpoint snapshot Java

Read Full Blog Here

The full deep dive, with diagrams of checkpoint/restore call flows, structured task scopes that manage warm-up and post-restore hooks, and additional kernel tuning notes, is available at https://mdsanwarhossain.me/blog-java-structured-concurrency.html.

Leave a Comment

Related Posts

Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices

Last updated: March 22, 2026