What is the difference between Spring Retry and Resilience4j?

Spring Retry is a lightweight library focused exclusively on retry logic. It provides @Retryable, @Recover, and RetryTemplate for simple use cases where you need to retry a method on failure with configurable backoff policies. Resilience4j is a comprehensive fault-tolerance library offering circuit breaker, retry, bulkhead, rate limiter, and time limiter patterns. Resilience4j integrates natively with Micrometer for metrics, Spring Boot Actuator for health endpoints, and supports both annotation-based and programmatic APIs. For new Spring Boot 3 projects, Resilience4j is the preferred choice since Netflix Hystrix is no longer maintained. Spring Retry is still useful for simple retry scenarios or when you don't need the full Resilience4j feature set.

How does the Resilience4j CircuitBreaker sliding window work?

The Resilience4j CircuitBreaker uses a sliding window to track call outcomes. There are two types: COUNT_BASED and TIME_BASED. COUNT_BASED uses a ring buffer of the last N calls (e.g., slidingWindowSize=10 tracks the last 10 calls). TIME_BASED aggregates calls over a configurable time duration (e.g., slidingWindowSize=10 tracks calls over the last 10 seconds). When the failure rate meets or exceeds failureRateThreshold (e.g., 50%), the circuit transitions from CLOSED to OPEN and rejects all calls. After waitDurationInOpenState (e.g., 30 seconds), it transitions to HALF_OPEN and permits permittedNumberOfCallsInHalfOpenState probe calls. If those succeed, it closes; if they fail, it opens again.

What is the correct order to combine Resilience4j decorators?

The correct decorator order is: Bulkhead -> CircuitBreaker -> Retry -> TimeLimiter (outermost to innermost). This means Bulkhead is the outermost guard (limits concurrent calls before they even reach the circuit breaker), CircuitBreaker is next (short-circuits when the downstream is unhealthy), Retry is inside the circuit breaker (retries are counted as individual calls by the circuit breaker), and TimeLimiter is innermost (cancels individual call attempts that exceed the timeout). Reversing the order—for example placing Retry outside CircuitBreaker—would cause retries to happen even when the circuit is open, wasting resources. The Spring Cloud Circuit Breaker abstraction enforces this order automatically when using the @CircuitBreaker annotation with Resilience4j.

When should I use ThreadPoolBulkhead vs SemaphoreBulkhead?

Use ThreadPoolBulkhead when your calls are asynchronous (CompletableFuture-based), when you need strict isolation with dedicated thread pools per dependency, or when you want queue management with configurable queue capacity and keep-alive time. ThreadPoolBulkhead runs work on a separate thread pool so that a slow downstream cannot block your calling threads. Use SemaphoreBulkhead for synchronous calls where you want lightweight concurrency limiting without thread switching overhead. SemaphoreBulkhead uses permits (semaphores) to limit concurrent access and is more efficient for low-overhead scenarios. If you are building a reactive application with Project Reactor or RxJava, prefer SemaphoreBulkhead since thread pool semantics do not apply to reactive execution models.

How do I test Resilience4j circuit breakers in Spring Boot?

Test Resilience4j circuit breakers using WireMock to simulate downstream failures and then assert state transitions. Inject the CircuitBreakerRegistry bean and use circuitBreaker.transitionToOpenState() or circuitBreaker.transitionToForcedOpenState() to force specific states. For integration tests use @SpringBootTest with WireMock stubs that return 5xx errors to drive the circuit open, then verify that CallNotPermittedException is thrown for subsequent calls. Use CircuitBreaker.getMetrics() to assert failure rate, success rate, and number of buffered calls. For unit tests, create a CircuitBreakerConfig programmatically with small sliding window sizes (e.g., 2) and minimumNumberOfCalls=2 so the circuit opens quickly. Also verify that your @Recover or fallback methods are invoked when the circuit is open.

Microservices April 11, 2026 22 min read

Spring Retry & Resilience4j: Complete Fault Tolerance Guide for Spring Boot Microservices (2026)

Transient network blips, momentary service restarts, and cascading failures are the reality of distributed systems. In this guide, you will master every fault-tolerance tool available in the Spring Boot ecosystem — from simple @Retryable annotations to production-grade Resilience4j circuit breakers, bulkheads, rate limiters, and time limiters — with complete Java configuration and production-ready code.

Md Sanwar Hossain

Senior Software Engineer & Tech Writer

Spring Retry and Resilience4j Fault Tolerance for Spring Boot Microservices

TL;DR

Spring Retry handles simple retry scenarios via @Retryable / RetryTemplate with exponential backoff and jitter. Resilience4j is the production-grade choice: it provides CircuitBreaker (CLOSED → OPEN → HALF_OPEN state machine), Retry, Bulkhead (thread-pool or semaphore isolation), RateLimiter, and TimeLimiter — all with Micrometer metrics and Actuator health endpoints. Combine them in the correct order: Bulkhead → CircuitBreaker → Retry → TimeLimiter so each decorator protects the next.

Why Fault Tolerance? Cascading Failures in Microservices
Spring Retry: @Retryable, @Recover, RetryTemplate
Exponential Backoff with Jitter (Configuration Deep Dive)
Resilience4j vs Hystrix (Migration Guide)
Resilience4j CircuitBreaker (CLOSED / OPEN / HALF_OPEN States)
Resilience4j Retry (decorateFunction, @Retry Annotation)
Bulkhead Pattern (ThreadPoolBulkhead vs SemaphoreBulkhead)
RateLimiter and TimeLimiter
Combining Patterns (Order Matters)
Actuator Integration and Metrics
Testing Fault Tolerance with WireMock
Production Checklist

1. Why Fault Tolerance? Cascading Failures in Microservices

In a monolith, a slow database query blocks a thread but does not propagate failure horizontally. In a microservices system, Service A calls Service B which calls Service C. When Service C degrades — perhaps due to a GC pause, a database lock, or a noisy-neighbor on a shared Kubernetes node — threads in Service B start queuing waiting for C. Service B's thread pool fills up and starts rejecting Service A's requests, which then back up into Service A's thread pool. Within seconds an isolated failure in one leaf service becomes a full system outage: the cascading failure.

Fault tolerance patterns break these cascade chains. The key patterns and their roles are:

Retry — automatically repeat a transient failure (network blip, 503) without surfacing it to the caller
Circuit Breaker — stop calling a downstream that is clearly unhealthy; fail fast and return a fallback
Bulkhead — limit concurrency per dependency so one slow service cannot exhaust the global thread pool
Rate Limiter — throttle outgoing calls to protect downstream services from being overwhelmed
Time Limiter — cancel a call after a deadline to bound the worst-case latency contribution

Pattern	Problem Solved	Library
Retry	Transient network / service errors	Spring Retry, Resilience4j
Circuit Breaker	Cascading failures, sustained outages	Resilience4j
Bulkhead	Thread pool exhaustion, resource isolation	Resilience4j
Rate Limiter	Downstream overload, quota enforcement	Resilience4j
Time Limiter	Unbounded latency, thread starvation	Resilience4j

2. Spring Retry: @Retryable, @Recover, RetryTemplate

Spring Retry adds retry capability to any Spring bean method with minimal configuration. Add the dependency and enable retries with @EnableRetry:

// pom.xml

<dependency>
    <groupId>org.springframework.retry</groupId>
    <artifactId>spring-retry</artifactId>
</dependency>
<!-- Spring Retry requires AOP -->
<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-aspects</artifactId>
</dependency>

// Enable Spring Retry on your @SpringBootApplication or @Configuration class

@SpringBootApplication
@EnableRetry
public class PaymentServiceApplication {
    public static void main(String[] args) {
        SpringApplication.run(PaymentServiceApplication.class, args);
    }
}

Annotate the method you want retried. The @Recover method is invoked when all retry attempts are exhausted:

// PaymentGatewayClient.java — @Retryable with @Recover fallback

@Service
public class PaymentGatewayClient {

    @Retryable(
        retryFor  = {HttpServerErrorException.class, ResourceAccessException.class},
        maxAttempts = 3,
        backoff = @Backoff(delay = 500, multiplier = 2.0, maxDelay = 5000, random = true)
    )
    public PaymentResponse charge(ChargeRequest request) {
        log.info("Attempting charge for orderId={}", request.getOrderId());
        return restTemplate.postForObject("/payments/charge", request, PaymentResponse.class);
    }

    @Recover
    public PaymentResponse chargeRecover(Exception ex, ChargeRequest request) {
        log.error("All retries exhausted for orderId={}: {}", request.getOrderId(), ex.getMessage());
        // Return a graceful degraded response or throw a domain exception
        throw new PaymentServiceUnavailableException("Payment gateway unavailable. Please retry later.");
    }
}

For programmatic retry logic use RetryTemplate — useful when you need retry inside a non-Spring-managed class or want full control over the policy:

// RetryTemplate bean configuration

@Configuration
public class RetryConfig {

    @Bean
    public RetryTemplate retryTemplate() {
        ExponentialBackOffPolicy backOff = new ExponentialBackOffPolicy();
        backOff.setInitialInterval(300);
        backOff.setMultiplier(2.0);
        backOff.setMaxInterval(10_000);

        SimpleRetryPolicy retryPolicy = new SimpleRetryPolicy(3,
            Map.of(HttpServerErrorException.class, true,
                   ResourceAccessException.class,  true));

        RetryTemplate template = new RetryTemplate();
        template.setBackOffPolicy(backOff);
        template.setRetryPolicy(retryPolicy);
        template.registerListener(new RetryListenerSupport() {
            @Override
            public <T, E extends Throwable> void onError(
                    RetryContext ctx, RetryCallback<T, E> cb, Throwable t) {
                log.warn("Retry attempt {} failed: {}", ctx.getRetryCount(), t.getMessage());
            }
        });
        return template;
    }
}

3. Exponential Backoff with Jitter (Configuration Deep Dive)

Naive exponential backoff causes a thundering herd: if 100 microservice instances all restart at the same time after a shared dependency outage, they all retry at the same exponential intervals and hit the recovering service in synchronized waves. Adding jitter (randomization) spreads the retries across time, dramatically reducing load spikes during recovery.

Strategy	Formula	Best For
Fixed	delay = constant	Simple retries, low concurrency
Exponential	delay = initial × multiplier^n	Most service-to-service calls
Full Jitter	delay = random(0, cap)	High-concurrency, many retrying clients
Decorrelated Jitter	delay = random(base, prev×3)	Best spread (AWS recommendation)

// application.yml — Spring Retry backoff with jitter via @Backoff

# No YAML key for @Backoff — configure via annotation:
# @Backoff(delay=500, multiplier=2.0, maxDelay=10000, random=true)
# random=true adds uniform jitter: actual = delay +/- (delay * 0.5)

# For Resilience4j exponential backoff with jitter:
resilience4j:
  retry:
    instances:
      paymentService:
        max-attempts: 3
        wait-duration: 500ms
        exponential-backoff-multiplier: 2.0
        exponential-max-wait-duration: 10s
        enable-exponential-backoff: true
        randomized-wait-factor: 0.5     # jitter: 50% of wait-duration
        retry-exceptions:
          - org.springframework.web.client.HttpServerErrorException
          - java.net.ConnectException

// Custom jitter implementation using IntervalFunction

RetryConfig config = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
        Duration.ofMillis(200),   // initialInterval
        2.0,                      // multiplier
        0.5,                      // randomizationFactor (jitter)
        Duration.ofSeconds(8)     // maxInterval
    ))
    .retryOnException(e -> e instanceof HttpServerErrorException
                        || e instanceof ConnectException)
    .build();

4. Resilience4j vs Hystrix (Migration Guide)

Netflix Hystrix entered maintenance mode in 2018 and is incompatible with Spring Boot 3 (Java 17+). Resilience4j is the recommended replacement. It is designed for Java 8+ functional programming, does not require a background health-check thread, and integrates natively with Micrometer, Spring Boot Actuator, and Spring Cloud Circuit Breaker abstraction.

Feature	Hystrix	Resilience4j
Maintenance Status	EOL (2018)	Actively maintained
Spring Boot 3 Support	No	Yes (native)
Sliding Window	Count-based only	Count-based + Time-based
Reactive Support	RxJava 1 only	RxJava 2/3, Project Reactor
Metrics	Hystrix Stream / Dashboard	Micrometer (Prometheus, Grafana)
Thread Model	Mandatory HystrixCommand thread pool	Decorators on any Supplier/Function

// pom.xml — Resilience4j Spring Boot 3 starter (replaces Hystrix)

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
    <version>2.2.0</version>
</dependency>
<!-- Spring Boot Actuator for health + metrics endpoints -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- AOP for annotation support -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-aop</artifactId>
</dependency>

5. Resilience4j CircuitBreaker (CLOSED / OPEN / HALF_OPEN States, Sliding Window)

The Resilience4j CircuitBreaker is a state machine with three states. In CLOSED state, calls pass through and outcomes are recorded in a sliding window. When the failure rate exceeds the threshold, it transitions to OPEN — all calls are immediately rejected with CallNotPermittedException, giving the downstream time to recover. After waitDurationInOpenState, it enters HALF_OPEN and allows a limited number of probe calls. Success closes it; failure re-opens it.

// application.yml — CircuitBreaker configuration

resilience4j:
  circuitbreaker:
    instances:
      inventoryService:
        sliding-window-type: COUNT_BASED        # or TIME_BASED
        sliding-window-size: 10                 # last 10 calls
        minimum-number-of-calls: 5              # min calls before evaluating
        failure-rate-threshold: 50              # open if >=50% failed
        slow-call-duration-threshold: 2s        # count slow calls as failures
        slow-call-rate-threshold: 80            # open if >=80% are slow
        wait-duration-in-open-state: 30s        # wait before half-open
        permitted-number-of-calls-in-half-open-state: 3
        automatic-transition-from-open-to-half-open-enabled: true
        record-exceptions:
          - java.io.IOException
          - org.springframework.web.client.HttpServerErrorException
        ignore-exceptions:
          - com.example.BusinessValidationException

// InventoryClient.java — @CircuitBreaker annotation with fallback

@Service
public class InventoryClient {

    @CircuitBreaker(name = "inventoryService", fallbackMethod = "inventoryFallback")
    public InventoryResponse checkStock(String productId) {
        return webClient.get()
            .uri("/inventory/{id}", productId)
            .retrieve()
            .bodyToMono(InventoryResponse.class)
            .block();
    }

    // Fallback must have the same return type and include the Throwable parameter
    public InventoryResponse inventoryFallback(String productId, CallNotPermittedException ex) {
        log.warn("Circuit open for inventory service, returning cached data: {}", ex.getMessage());
        return InventoryResponse.ofCachedData(productId);
    }

    public InventoryResponse inventoryFallback(String productId, Exception ex) {
        log.error("Inventory service error, returning default: {}", ex.getMessage());
        return InventoryResponse.defaultOutOfStock(productId);
    }
}

// Programmatic circuit breaker — useful for dynamic configuration

@Service
public class OrderService {

    private final CircuitBreaker cb;

    public OrderService(CircuitBreakerRegistry registry) {
        this.cb = registry.circuitBreaker("inventoryService");
        cb.getEventPublisher()
            .onStateTransition(e -> log.info("CB state: {} -> {}",
                e.getStateTransition().getFromState(),
                e.getStateTransition().getToState()));
    }

    public InventoryResponse checkStock(String productId) {
        Supplier<InventoryResponse> decorated =
            CircuitBreaker.decorateSupplier(cb, () -> inventoryClient.call(productId));
        return Try.ofSupplier(decorated)
                  .recover(CallNotPermittedException.class, ex -> InventoryResponse.cached(productId))
                  .get();
    }
}

6. Resilience4j Retry (decorateFunction, @Retry Annotation)

Resilience4j Retry wraps a function and re-executes it on exception or on a predicate match. Unlike Spring Retry, it supports both synchronous and async (CompletableFuture) variants, and emits fine-grained Micrometer events per retry attempt.

// application.yml — Resilience4j Retry with exponential backoff + jitter

resilience4j:
  retry:
    instances:
      paymentService:
        max-attempts: 4
        wait-duration: 300ms
        enable-exponential-backoff: true
        exponential-backoff-multiplier: 2.0
        exponential-max-wait-duration: 8s
        randomized-wait-factor: 0.5
        retry-exceptions:
          - java.net.ConnectException
          - org.springframework.web.client.HttpServerErrorException$ServiceUnavailable
        ignore-exceptions:
          - com.example.PaymentDeclinedException    # do NOT retry 4xx business errors

// PaymentService.java — @Retry with fallback

@Service
public class PaymentService {

    @Retry(name = "paymentService", fallbackMethod = "paymentFallback")
    public PaymentResult processPayment(PaymentRequest req) {
        log.debug("Sending payment request for amount={}", req.getAmount());
        return paymentGatewayClient.charge(req);
    }

    private PaymentResult paymentFallback(PaymentRequest req, Exception ex) {
        log.error("Payment service unavailable after retries: {}", ex.getMessage());
        return PaymentResult.pending(req.getOrderId(), "Gateway unavailable — queued for retry");
    }
}

// Programmatic — decorateFunction for non-annotation use
Retry retry = retryRegistry.retry("paymentService");
Function<PaymentRequest, PaymentResult> decorated =
    Retry.decorateFunction(retry, gatewayClient::charge);

// For CompletableFuture (async):
CompletableFuture<PaymentResult> future =
    retry.executeCompletionStage(scheduler, () ->
        CompletableFuture.supplyAsync(() -> gatewayClient.charge(req))
    ).toCompletableFuture();

A critical rule: never retry 4xx HTTP errors. A 400 Bad Request or 422 Unprocessable Entity indicates a client-side problem that will not be fixed by retrying. Only retry on 5xx server errors, network timeouts, and ConnectException. Configure ignoreExceptions to exclude business validation exceptions from retry logic.

7. Bulkhead Pattern (ThreadPoolBulkhead vs SemaphoreBulkhead)

The Bulkhead pattern isolates failures in one part of the system from affecting the rest — named after watertight compartments in ships. Resilience4j offers two implementations:

SemaphoreBulkhead — uses a counting semaphore to limit the number of concurrent calls. Lightweight, works in the calling thread. Best for synchronous, blocking calls with predictable execution time.
ThreadPoolBulkhead — submits work to a dedicated thread pool with a queue. Callers are isolated from the worker threads. Best for async calls and when you need strict thread isolation between dependencies.

// application.yml — SemaphoreBulkhead and ThreadPoolBulkhead

resilience4j:
  bulkhead:
    instances:
      inventoryService:
        max-concurrent-calls: 20      # max concurrent semaphore permits
        max-wait-duration: 50ms       # time to wait for a permit before BulkheadFullException

  thread-pool-bulkhead:
    instances:
      paymentService:
        max-thread-pool-size: 10      # max worker threads
        core-thread-pool-size: 4      # always-on threads
        queue-capacity: 50            # pending task queue size
        keep-alive-duration: 20ms     # idle thread keep-alive
        writeable-stack-trace-enabled: true

// OrderOrchestrator.java — @Bulkhead annotation (semaphore type)

@Service
public class OrderOrchestrator {

    // SEMAPHORE type (default) — synchronous
    @Bulkhead(name = "inventoryService", fallbackMethod = "inventoryBulkheadFallback")
    public InventoryResponse checkInventory(String skuId) {
        return inventoryClient.getStock(skuId);
    }

    public InventoryResponse inventoryBulkheadFallback(String skuId, BulkheadFullException ex) {
        log.warn("Inventory bulkhead full, returning cached data for sku={}", skuId);
        return cache.getOrDefault(skuId, InventoryResponse.unknown());
    }

    // THREADPOOL type — returns CompletableFuture
    @Bulkhead(name = "paymentService",
              type = Bulkhead.Type.THREADPOOL,
              fallbackMethod = "paymentBulkheadFallback")
    public CompletableFuture<PaymentResult> processPaymentAsync(PaymentRequest req) {
        return CompletableFuture.supplyAsync(() -> paymentClient.charge(req));
    }

    public CompletableFuture<PaymentResult> paymentBulkheadFallback(
            PaymentRequest req, BulkheadFullException ex) {
        return CompletableFuture.completedFuture(PaymentResult.queued(req.getOrderId()));
    }
}

Aspect	SemaphoreBulkhead	ThreadPoolBulkhead
Execution Model	Calling thread	Dedicated thread pool
Return Type	Any	CompletableFuture only
Overhead	Very low	Thread context switch
Queue Support	No (reject immediately)	Yes (configurable queue)
Reactive	Yes	Not recommended

8. RateLimiter and TimeLimiter

RateLimiter limits the number of calls per time window — useful when calling third-party APIs with strict quotas (e.g., a payment gateway that allows 100 req/sec) or to protect your own services from burst traffic. TimeLimiter wraps a CompletableFuture and cancels it if it does not complete within a configured deadline, preventing threads from waiting indefinitely.

// application.yml — RateLimiter and TimeLimiter

resilience4j:
  ratelimiter:
    instances:
      smsGateway:
        limit-for-period: 50             # max 50 calls per refresh period
        limit-refresh-period: 1s         # refresh window
        timeout-duration: 100ms          # wait time for a permission; 0 = fail fast

  timelimiter:
    instances:
      inventoryService:
        timeout-duration: 2s             # cancel CompletableFuture after 2s
        cancel-running-future: true      # interrupt the underlying thread

// SmsService.java — @RateLimiter and @TimeLimiter

@Service
public class SmsService {

    // Rate-limit outgoing SMS to 50/sec to comply with gateway quota
    @RateLimiter(name = "smsGateway", fallbackMethod = "smsFallback")
    public SmsResult send(SmsRequest request) {
        return smsGatewayClient.send(request);
    }

    public SmsResult smsFallback(SmsRequest req, RequestNotPermitted ex) {
        log.warn("SMS rate limit reached, queuing message id={}", req.getMessageId());
        smsQueue.enqueue(req);
        return SmsResult.queued(req.getMessageId());
    }
}

@Service
public class ShippingService {

    // TimeLimiter wraps CompletableFuture — must return CompletableFuture
    @TimeLimiter(name = "inventoryService", fallbackMethod = "shippingFallback")
    public CompletableFuture<ShippingQuote> getQuoteAsync(ShippingRequest req) {
        return CompletableFuture.supplyAsync(() -> shippingClient.getQuote(req));
    }

    public CompletableFuture<ShippingQuote> shippingFallback(ShippingRequest req, TimeoutException ex) {
        log.warn("Shipping quote timed out, returning default estimate");
        return CompletableFuture.completedFuture(ShippingQuote.defaultEstimate());
    }
}

9. Combining Patterns (Order Matters: Bulkhead → CircuitBreaker → Retry → TimeLimiter)

When combining multiple Resilience4j decorators, the decoration order determines which wraps which. The outermost decorator is executed first. The correct production order is:

Bulkhead (outermost) — reject early if resource limit reached, before wasting any other resources
CircuitBreaker — short-circuit immediately if downstream is known-unhealthy
Retry — retry individual failures; each attempt is seen and recorded by the circuit breaker above it
TimeLimiter (innermost) — enforce a per-attempt deadline on the actual I/O call

Why this order? If Retry were outside CircuitBreaker, retries would happen even when the circuit is open — wasting resources. If TimeLimiter were outside Retry, a timeout on the combined retry operation would hide individual attempt timeouts.

// Programmatic combination using Decorators builder (recommended)

@Service
public class ResilientInventoryClient {

    private final Supplier<InventoryResponse> resilientSupplier;

    public ResilientInventoryClient(
            CircuitBreakerRegistry cbRegistry,
            RetryRegistry retryRegistry,
            BulkheadRegistry bulkheadRegistry) {

        CircuitBreaker cb = cbRegistry.circuitBreaker("inventoryService");
        Retry retry         = retryRegistry.retry("inventoryService");
        Bulkhead bulkhead   = bulkheadRegistry.bulkhead("inventoryService");

        // Decoration order (innermost to outermost in the builder chain):
        // actual call -> TimeLimiter -> Retry -> CircuitBreaker -> Bulkhead
        this.resilientSupplier = Decorators
            .ofSupplier(() -> httpClient.fetchInventory())
            .withCircuitBreaker(cb)
            .withRetry(retry)
            .withBulkhead(bulkhead)
            .withFallback(
                List.of(CallNotPermittedException.class,
                        BulkheadFullException.class,
                        Exception.class),
                ex -> InventoryResponse.unavailable()
            )
            .decorate();
    }

    public InventoryResponse getInventory() {
        return resilientSupplier.get();
    }
}

// application.yml — annotation-based combination (Spring AOP order)

resilience4j:
  # Spring AOP processes annotations in this order by default:
  # Retry(Aspect order=2049) wraps CircuitBreaker(Aspect order=2050)
  # Override with spring.aop.proxy-target-class or set annotation orders explicitly

# Best practice: use a single @Bulkhead + @CircuitBreaker + @Retry stacked:
# @Bulkhead(name="svc", fallbackMethod="bFallback")
# @CircuitBreaker(name="svc", fallbackMethod="cbFallback")
# @Retry(name="svc", fallbackMethod="rFallback")
# public Response call() { ... }

# Or use the programmatic Decorators builder shown above for explicit ordering.

⚠️ Important: When using annotations, Resilience4j Spring Boot sets default aspect ordering so Retry wraps CircuitBreaker. Verify the resilience4j.circuitbreaker.circuit-breaker-aspect-order and resilience4j.retry.retry-aspect-order properties in your version to ensure correct ordering.

10. Actuator Integration and Metrics

Resilience4j integrates with Spring Boot Actuator to expose circuit breaker state via /actuator/health and detailed metrics via /actuator/circuitbreakers, /actuator/retries, and /actuator/bulkheads. All metrics are also published to Micrometer for Prometheus/Grafana.

// application.yml — Actuator + Resilience4j metrics exposure

management:
  endpoints:
    web:
      exposure:
        include: health, metrics, circuitbreakers, retries, bulkheads, ratelimiters
  endpoint:
    health:
      show-details: always
  health:
    circuitbreakers:
      enabled: true
    ratelimiters:
      enabled: true

resilience4j:
  circuitbreaker:
    instances:
      inventoryService:
        register-health-indicator: true    # appear in /actuator/health
        event-consumer-buffer-size: 20     # buffer for /actuator/circuitbreakerevents

// Key Prometheus metrics to monitor

# CircuitBreaker
resilience4j_circuitbreaker_state{name="inventoryService"}          # 0=CLOSED, 1=OPEN, 2=HALF_OPEN
resilience4j_circuitbreaker_failure_rate{name="inventoryService"}   # failure %
resilience4j_circuitbreaker_calls_total{name,kind}                  # kind: successful, failed, not_permitted, ignored

# Retry
resilience4j_retry_calls_total{name="paymentService",kind}          # kind: successful_with_retry, failed_with_retry

# Bulkhead
resilience4j_bulkhead_available_concurrent_calls{name}
resilience4j_bulkhead_max_allowed_concurrent_calls{name}

# RateLimiter
resilience4j_ratelimiter_available_permissions{name}
resilience4j_ratelimiter_waiting_threads{name}

// Grafana alert example (PromQL)

# Alert when any circuit breaker transitions to OPEN state
ALERT CircuitBreakerOpen
  IF resilience4j_circuitbreaker_state{state="open"} == 1
  FOR 30s
  LABELS { severity = "critical" }
  ANNOTATIONS {
    summary = "Circuit breaker {{ $labels.name }} is OPEN",
    description = "{{ $labels.name }} has been open for more than 30 seconds. Check downstream health."
  }

# Alert when retry rate exceeds 10%
ALERT HighRetryRate
  IF rate(resilience4j_retry_calls_total{kind="failed_with_retry"}[5m])
     / rate(resilience4j_retry_calls_total[5m]) > 0.10
  FOR 5m
  LABELS { severity = "warning" }

11. Testing Fault Tolerance with WireMock

Testing resilience patterns requires the ability to inject faults on demand. WireMock simulates downstream HTTP services with configurable fault injection — connection resets, delays, 5xx responses — making it ideal for driving circuit breakers open and verifying retry behavior.

// pom.xml test dependencies

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-contract-wiremock</artifactId>
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
    <scope>test</scope>
</dependency>

// InventoryClientCircuitBreakerTest.java

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@AutoConfigureWireMock(port = 0)
@TestPropertySource(properties = {
    "inventory.service.url=http://localhost:${wiremock.server.port}",
    "resilience4j.circuitbreaker.instances.inventoryService.sliding-window-size=3",
    "resilience4j.circuitbreaker.instances.inventoryService.minimum-number-of-calls=3",
    "resilience4j.circuitbreaker.instances.inventoryService.failure-rate-threshold=50",
    "resilience4j.circuitbreaker.instances.inventoryService.wait-duration-in-open-state=1s"
})
class InventoryClientCircuitBreakerTest {

    @Autowired private InventoryClient inventoryClient;
    @Autowired private CircuitBreakerRegistry circuitBreakerRegistry;

    @Test
    void shouldOpenCircuitAfterThresholdFailures() {
        // Stub 3 consecutive 503 responses to drive the circuit open
        stubFor(get(urlPathMatching("/inventory/.*"))
            .willReturn(aResponse().withStatus(503).withBody("Service Unavailable")));

        // 3 calls: all fail, circuit should open (50% threshold met after 3/3 calls)
        assertThatThrownBy(() -> inventoryClient.checkStock("SKU-001"))
            .isInstanceOf(HttpServerErrorException.ServiceUnavailable.class);
        assertThatThrownBy(() -> inventoryClient.checkStock("SKU-001"))
            .isInstanceOf(HttpServerErrorException.ServiceUnavailable.class);
        assertThatThrownBy(() -> inventoryClient.checkStock("SKU-001"))
            .isInstanceOf(HttpServerErrorException.ServiceUnavailable.class);

        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("inventoryService");
        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);

        // Next call should be rejected immediately without hitting WireMock
        assertThatThrownBy(() -> inventoryClient.checkStock("SKU-001"))
            .isInstanceOf(CallNotPermittedException.class);

        // Verify WireMock only received 3 calls, not 4
        verify(3, getRequestedFor(urlPathMatching("/inventory/.*")));
    }

    @Test
    void shouldTransitionToHalfOpenAndClosedOnSuccess() throws InterruptedException {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("inventoryService");
        cb.transitionToOpenState(); // Force open

        // Stub a success response for HALF_OPEN probe
        stubFor(get(urlPathMatching("/inventory/.*"))
            .willReturn(aResponse().withStatus(200)
                .withBody("{\"sku\":\"SKU-001\",\"available\":true}")
                .withHeader("Content-Type", "application/json")));

        // Wait for waitDurationInOpenState=1s, then trigger transition
        Thread.sleep(1100);
        cb.transitionToHalfOpenState();

        // Probe call should succeed and close the circuit
        InventoryResponse response = inventoryClient.checkStock("SKU-001");
        assertThat(response.isAvailable()).isTrue();
        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
    }
}

// WireMock fault injection — simulate connection reset and latency

// Connection reset (simulates TCP RST — triggers RetryableException)
stubFor(post("/payments/charge")
    .willReturn(aResponse().withFault(Fault.CONNECTION_RESET_BY_PEER)));

// Fixed delay (tests TimeLimiter timeout behavior)
stubFor(get("/inventory/SKU-999")
    .willReturn(aResponse().withStatus(200)
        .withFixedDelay(3000)          // 3s delay > 2s TimeLimiter timeout
        .withBody("{\"available\":true}")));

// Random delay range (tests retry with jitter)
stubFor(get(urlPathMatching("/inventory/.*"))
    .willReturn(aResponse().withStatus(503)
        .withRandomDelay(new UniformDistribution(100, 500))));

12. Production Checklist

Before deploying fault tolerance patterns to production, validate each item in this checklist to avoid common pitfalls:

✅ Do not retry 4xx errors — configure ignoreExceptions for business validation exceptions
✅ Add jitter to all exponential backoff configurations (randomizedWaitFactor ≥ 0.3)
✅ Set minimumNumberOfCalls appropriately — a window of 2 calls with 1 failure = 50% rate is statistically unreliable
✅ Register health indicators (register-health-indicator: true) so Kubernetes liveness/readiness probes reflect circuit breaker state
✅ Test fallbacks under load — ensure fallback methods themselves do not throw exceptions
✅ Tune TimeLimiter per SLA — do not use a global timeout; each downstream has a different p99 latency profile
✅ Use ThreadPoolBulkhead for async calls and SemaphoreBulkhead for synchronous/reactive calls
✅ Monitor retry rates with Prometheus/Grafana — sustained >5% retry rate indicates an underlying stability problem
✅ Verify decorator order in programmatic configuration — always Bulkhead → CircuitBreaker → Retry → TimeLimiter
✅ Configure event buffer sizes (event-consumer-buffer-size) to avoid dropping events in high-throughput services
✅ Test circuit breaker state transitions in integration tests with forced state transitions via circuitBreaker.transitionToOpenState()
✅ Document SLOs per dependency — circuit breaker thresholds should be informed by downstream SLOs, not guessed

Configuration Property	Recommended Starting Value	Notes
sliding-window-size	10–20	Too small = noisy; too large = slow to open
failure-rate-threshold	50%	Lower for critical paths (30%)
wait-duration-in-open-state	30s–60s	Match downstream recovery SLA
max-attempts (retry)	3	More than 3 rarely helps and worsens latency
timeout-duration (time limiter)	2× downstream p99	Profile in staging before setting
max-concurrent-calls (bulkhead)	10–25% of thread pool size	Reserve capacity for other dependencies

Tags:

spring retry resilience4j circuit breaker bulkhead pattern fault tolerance microservices spring boot 2026

Microservices

Kafka Producer Exactly-Once Semantics in Spring Boot

Microservices

Spring Cloud Gateway Production Guide

Microservices

Redis Caching in Spring Boot Production Guide

Microservices

Distributed Tracing with OpenTelemetry & Spring Boot

Back to Blog Last updated: April 11, 2026

Spring Retry & Resilience4j: Complete Fault Tolerance Guide for Spring Boot Microservices (2026)

Table of Contents

1. Why Fault Tolerance? Cascading Failures in Microservices

2. Spring Retry: @Retryable, @Recover, RetryTemplate

3. Exponential Backoff with Jitter (Configuration Deep Dive)

4. Resilience4j vs Hystrix (Migration Guide)

5. Resilience4j CircuitBreaker (CLOSED / OPEN / HALF_OPEN States, Sliding Window)

6. Resilience4j Retry (decorateFunction, @Retry Annotation)

7. Bulkhead Pattern (ThreadPoolBulkhead vs SemaphoreBulkhead)

8. RateLimiter and TimeLimiter

9. Combining Patterns (Order Matters: Bulkhead → CircuitBreaker → Retry → TimeLimiter)

10. Actuator Integration and Metrics

11. Testing Fault Tolerance with WireMock

12. Production Checklist

Leave a Comment

Related Posts