Microservices 22 min read

Spring Retry & Resilience4j: Complete Fault Tolerance Guide for Spring Boot Microservices (2026)

Transient network blips, momentary service restarts, and cascading failures are the reality of distributed systems. In this guide, you will master every fault-tolerance tool available in the Spring Boot ecosystem — from simple @Retryable annotations to production-grade Resilience4j circuit breakers, bulkheads, rate limiters, and time limiters — with complete Java configuration and production-ready code.

S

Senior Software Engineer & Tech Writer

Spring Retry and Resilience4j Fault Tolerance for Spring Boot Microservices
TL;DR

Spring Retry handles simple retry scenarios via @Retryable / RetryTemplate with exponential backoff and jitter. Resilience4j is the production-grade choice: it provides CircuitBreaker (CLOSED → OPEN → HALF_OPEN state machine), Retry, Bulkhead (thread-pool or semaphore isolation), RateLimiter, and TimeLimiter — all with Micrometer metrics and Actuator health endpoints. Combine them in the correct order: Bulkhead → CircuitBreaker → Retry → TimeLimiter so each decorator protects the next.

Table of Contents

  1. Why Fault Tolerance? Cascading Failures in Microservices
  2. Spring Retry: @Retryable, @Recover, RetryTemplate
  3. Exponential Backoff with Jitter (Configuration Deep Dive)
  4. Resilience4j vs Hystrix (Migration Guide)
  5. Resilience4j CircuitBreaker (CLOSED / OPEN / HALF_OPEN States)
  6. Resilience4j Retry (decorateFunction, @Retry Annotation)
  7. Bulkhead Pattern (ThreadPoolBulkhead vs SemaphoreBulkhead)
  8. RateLimiter and TimeLimiter
  9. Combining Patterns (Order Matters)
  10. Actuator Integration and Metrics
  11. Testing Fault Tolerance with WireMock
  12. Production Checklist

1. Why Fault Tolerance? Cascading Failures in Microservices

In a monolith, a slow database query blocks a thread but does not propagate failure horizontally. In a microservices system, Service A calls Service B which calls Service C. When Service C degrades — perhaps due to a GC pause, a database lock, or a noisy-neighbor on a shared Kubernetes node — threads in Service B start queuing waiting for C. Service B's thread pool fills up and starts rejecting Service A's requests, which then back up into Service A's thread pool. Within seconds an isolated failure in one leaf service becomes a full system outage: the cascading failure.

Fault tolerance patterns break these cascade chains. The key patterns and their roles are:

  • Retry — automatically repeat a transient failure (network blip, 503) without surfacing it to the caller
  • Circuit Breaker — stop calling a downstream that is clearly unhealthy; fail fast and return a fallback
  • Bulkhead — limit concurrency per dependency so one slow service cannot exhaust the global thread pool
  • Rate Limiter — throttle outgoing calls to protect downstream services from being overwhelmed
  • Time Limiter — cancel a call after a deadline to bound the worst-case latency contribution
Pattern Problem Solved Library
Retry Transient network / service errors Spring Retry, Resilience4j
Circuit Breaker Cascading failures, sustained outages Resilience4j
Bulkhead Thread pool exhaustion, resource isolation Resilience4j
Rate Limiter Downstream overload, quota enforcement Resilience4j
Time Limiter Unbounded latency, thread starvation Resilience4j

2. Spring Retry: @Retryable, @Recover, RetryTemplate

Spring Retry adds retry capability to any Spring bean method with minimal configuration. Add the dependency and enable retries with @EnableRetry:

// pom.xml
<dependency>
    <groupId>org.springframework.retry</groupId>
    <artifactId>spring-retry</artifactId>
</dependency>
<!-- Spring Retry requires AOP -->
<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-aspects</artifactId>
</dependency>
// Enable Spring Retry on your @SpringBootApplication or @Configuration class
@SpringBootApplication
@EnableRetry
public class PaymentServiceApplication {
    public static void main(String[] args) {
        SpringApplication.run(PaymentServiceApplication.class, args);
    }
}

Annotate the method you want retried. The @Recover method is invoked when all retry attempts are exhausted:

// PaymentGatewayClient.java — @Retryable with @Recover fallback
@Service
public class PaymentGatewayClient {

    @Retryable(
        retryFor  = {HttpServerErrorException.class, ResourceAccessException.class},
        maxAttempts = 3,
        backoff = @Backoff(delay = 500, multiplier = 2.0, maxDelay = 5000, random = true)
    )
    public PaymentResponse charge(ChargeRequest request) {
        log.info("Attempting charge for orderId={}", request.getOrderId());
        return restTemplate.postForObject("/payments/charge", request, PaymentResponse.class);
    }

    @Recover
    public PaymentResponse chargeRecover(Exception ex, ChargeRequest request) {
        log.error("All retries exhausted for orderId={}: {}", request.getOrderId(), ex.getMessage());
        // Return a graceful degraded response or throw a domain exception
        throw new PaymentServiceUnavailableException("Payment gateway unavailable. Please retry later.");
    }
}

For programmatic retry logic use RetryTemplate — useful when you need retry inside a non-Spring-managed class or want full control over the policy:

// RetryTemplate bean configuration
@Configuration
public class RetryConfig {

    @Bean
    public RetryTemplate retryTemplate() {
        ExponentialBackOffPolicy backOff = new ExponentialBackOffPolicy();
        backOff.setInitialInterval(300);
        backOff.setMultiplier(2.0);
        backOff.setMaxInterval(10_000);

        SimpleRetryPolicy retryPolicy = new SimpleRetryPolicy(3,
            Map.of(HttpServerErrorException.class, true,
                   ResourceAccessException.class,  true));

        RetryTemplate template = new RetryTemplate();
        template.setBackOffPolicy(backOff);
        template.setRetryPolicy(retryPolicy);
        template.registerListener(new RetryListenerSupport() {
            @Override
            public <T, E extends Throwable> void onError(
                    RetryContext ctx, RetryCallback<T, E> cb, Throwable t) {
                log.warn("Retry attempt {} failed: {}", ctx.getRetryCount(), t.getMessage());
            }
        });
        return template;
    }
}

3. Exponential Backoff with Jitter (Configuration Deep Dive)

Naive exponential backoff causes a thundering herd: if 100 microservice instances all restart at the same time after a shared dependency outage, they all retry at the same exponential intervals and hit the recovering service in synchronized waves. Adding jitter (randomization) spreads the retries across time, dramatically reducing load spikes during recovery.

Strategy Formula Best For
Fixed delay = constant Simple retries, low concurrency
Exponential delay = initial × multiplier^n Most service-to-service calls
Full Jitter delay = random(0, cap) High-concurrency, many retrying clients
Decorrelated Jitter delay = random(base, prev×3) Best spread (AWS recommendation)
// application.yml — Spring Retry backoff with jitter via @Backoff
# No YAML key for @Backoff — configure via annotation:
# @Backoff(delay=500, multiplier=2.0, maxDelay=10000, random=true)
# random=true adds uniform jitter: actual = delay +/- (delay * 0.5)

# For Resilience4j exponential backoff with jitter:
resilience4j:
  retry:
    instances:
      paymentService:
        max-attempts: 3
        wait-duration: 500ms
        exponential-backoff-multiplier: 2.0
        exponential-max-wait-duration: 10s
        enable-exponential-backoff: true
        randomized-wait-factor: 0.5     # jitter: 50% of wait-duration
        retry-exceptions:
          - org.springframework.web.client.HttpServerErrorException
          - java.net.ConnectException
// Custom jitter implementation using IntervalFunction
RetryConfig config = RetryConfig.custom()
    .maxAttempts(4)
    .intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
        Duration.ofMillis(200),   // initialInterval
        2.0,                      // multiplier
        0.5,                      // randomizationFactor (jitter)
        Duration.ofSeconds(8)     // maxInterval
    ))
    .retryOnException(e -> e instanceof HttpServerErrorException
                        || e instanceof ConnectException)
    .build();

4. Resilience4j vs Hystrix (Migration Guide)

Netflix Hystrix entered maintenance mode in 2018 and is incompatible with Spring Boot 3 (Java 17+). Resilience4j is the recommended replacement. It is designed for Java 8+ functional programming, does not require a background health-check thread, and integrates natively with Micrometer, Spring Boot Actuator, and Spring Cloud Circuit Breaker abstraction.

Feature Hystrix Resilience4j
Maintenance Status EOL (2018) Actively maintained
Spring Boot 3 Support No Yes (native)
Sliding Window Count-based only Count-based + Time-based
Reactive Support RxJava 1 only RxJava 2/3, Project Reactor
Metrics Hystrix Stream / Dashboard Micrometer (Prometheus, Grafana)
Thread Model Mandatory HystrixCommand thread pool Decorators on any Supplier/Function
// pom.xml — Resilience4j Spring Boot 3 starter (replaces Hystrix)
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
    <version>2.2.0</version>
</dependency>
<!-- Spring Boot Actuator for health + metrics endpoints -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- AOP for annotation support -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-aop</artifactId>
</dependency>

5. Resilience4j CircuitBreaker (CLOSED / OPEN / HALF_OPEN States, Sliding Window)

The Resilience4j CircuitBreaker is a state machine with three states. In CLOSED state, calls pass through and outcomes are recorded in a sliding window. When the failure rate exceeds the threshold, it transitions to OPEN — all calls are immediately rejected with CallNotPermittedException, giving the downstream time to recover. After waitDurationInOpenState, it enters HALF_OPEN and allows a limited number of probe calls. Success closes it; failure re-opens it.

// application.yml — CircuitBreaker configuration
resilience4j:
  circuitbreaker:
    instances:
      inventoryService:
        sliding-window-type: COUNT_BASED        # or TIME_BASED
        sliding-window-size: 10                 # last 10 calls
        minimum-number-of-calls: 5              # min calls before evaluating
        failure-rate-threshold: 50              # open if >=50% failed
        slow-call-duration-threshold: 2s        # count slow calls as failures
        slow-call-rate-threshold: 80            # open if >=80% are slow
        wait-duration-in-open-state: 30s        # wait before half-open
        permitted-number-of-calls-in-half-open-state: 3
        automatic-transition-from-open-to-half-open-enabled: true
        record-exceptions:
          - java.io.IOException
          - org.springframework.web.client.HttpServerErrorException
        ignore-exceptions:
          - com.example.BusinessValidationException
// InventoryClient.java — @CircuitBreaker annotation with fallback
@Service
public class InventoryClient {

    @CircuitBreaker(name = "inventoryService", fallbackMethod = "inventoryFallback")
    public InventoryResponse checkStock(String productId) {
        return webClient.get()
            .uri("/inventory/{id}", productId)
            .retrieve()
            .bodyToMono(InventoryResponse.class)
            .block();
    }

    // Fallback must have the same return type and include the Throwable parameter
    public InventoryResponse inventoryFallback(String productId, CallNotPermittedException ex) {
        log.warn("Circuit open for inventory service, returning cached data: {}", ex.getMessage());
        return InventoryResponse.ofCachedData(productId);
    }

    public InventoryResponse inventoryFallback(String productId, Exception ex) {
        log.error("Inventory service error, returning default: {}", ex.getMessage());
        return InventoryResponse.defaultOutOfStock(productId);
    }
}
// Programmatic circuit breaker — useful for dynamic configuration
@Service
public class OrderService {

    private final CircuitBreaker cb;

    public OrderService(CircuitBreakerRegistry registry) {
        this.cb = registry.circuitBreaker("inventoryService");
        cb.getEventPublisher()
            .onStateTransition(e -> log.info("CB state: {} -> {}",
                e.getStateTransition().getFromState(),
                e.getStateTransition().getToState()));
    }

    public InventoryResponse checkStock(String productId) {
        Supplier<InventoryResponse> decorated =
            CircuitBreaker.decorateSupplier(cb, () -> inventoryClient.call(productId));
        return Try.ofSupplier(decorated)
                  .recover(CallNotPermittedException.class, ex -> InventoryResponse.cached(productId))
                  .get();
    }
}

6. Resilience4j Retry (decorateFunction, @Retry Annotation)

Resilience4j Retry wraps a function and re-executes it on exception or on a predicate match. Unlike Spring Retry, it supports both synchronous and async (CompletableFuture) variants, and emits fine-grained Micrometer events per retry attempt.

// application.yml — Resilience4j Retry with exponential backoff + jitter
resilience4j:
  retry:
    instances:
      paymentService:
        max-attempts: 4
        wait-duration: 300ms
        enable-exponential-backoff: true
        exponential-backoff-multiplier: 2.0
        exponential-max-wait-duration: 8s
        randomized-wait-factor: 0.5
        retry-exceptions:
          - java.net.ConnectException
          - org.springframework.web.client.HttpServerErrorException$ServiceUnavailable
        ignore-exceptions:
          - com.example.PaymentDeclinedException    # do NOT retry 4xx business errors
// PaymentService.java — @Retry with fallback
@Service
public class PaymentService {

    @Retry(name = "paymentService", fallbackMethod = "paymentFallback")
    public PaymentResult processPayment(PaymentRequest req) {
        log.debug("Sending payment request for amount={}", req.getAmount());
        return paymentGatewayClient.charge(req);
    }

    private PaymentResult paymentFallback(PaymentRequest req, Exception ex) {
        log.error("Payment service unavailable after retries: {}", ex.getMessage());
        return PaymentResult.pending(req.getOrderId(), "Gateway unavailable — queued for retry");
    }
}

// Programmatic — decorateFunction for non-annotation use
Retry retry = retryRegistry.retry("paymentService");
Function<PaymentRequest, PaymentResult> decorated =
    Retry.decorateFunction(retry, gatewayClient::charge);

// For CompletableFuture (async):
CompletableFuture<PaymentResult> future =
    retry.executeCompletionStage(scheduler, () ->
        CompletableFuture.supplyAsync(() -> gatewayClient.charge(req))
    ).toCompletableFuture();

A critical rule: never retry 4xx HTTP errors. A 400 Bad Request or 422 Unprocessable Entity indicates a client-side problem that will not be fixed by retrying. Only retry on 5xx server errors, network timeouts, and ConnectException. Configure ignoreExceptions to exclude business validation exceptions from retry logic.

7. Bulkhead Pattern (ThreadPoolBulkhead vs SemaphoreBulkhead)

The Bulkhead pattern isolates failures in one part of the system from affecting the rest — named after watertight compartments in ships. Resilience4j offers two implementations:

  • SemaphoreBulkhead — uses a counting semaphore to limit the number of concurrent calls. Lightweight, works in the calling thread. Best for synchronous, blocking calls with predictable execution time.
  • ThreadPoolBulkhead — submits work to a dedicated thread pool with a queue. Callers are isolated from the worker threads. Best for async calls and when you need strict thread isolation between dependencies.
// application.yml — SemaphoreBulkhead and ThreadPoolBulkhead
resilience4j:
  bulkhead:
    instances:
      inventoryService:
        max-concurrent-calls: 20      # max concurrent semaphore permits
        max-wait-duration: 50ms       # time to wait for a permit before BulkheadFullException

  thread-pool-bulkhead:
    instances:
      paymentService:
        max-thread-pool-size: 10      # max worker threads
        core-thread-pool-size: 4      # always-on threads
        queue-capacity: 50            # pending task queue size
        keep-alive-duration: 20ms     # idle thread keep-alive
        writeable-stack-trace-enabled: true
// OrderOrchestrator.java — @Bulkhead annotation (semaphore type)
@Service
public class OrderOrchestrator {

    // SEMAPHORE type (default) — synchronous
    @Bulkhead(name = "inventoryService", fallbackMethod = "inventoryBulkheadFallback")
    public InventoryResponse checkInventory(String skuId) {
        return inventoryClient.getStock(skuId);
    }

    public InventoryResponse inventoryBulkheadFallback(String skuId, BulkheadFullException ex) {
        log.warn("Inventory bulkhead full, returning cached data for sku={}", skuId);
        return cache.getOrDefault(skuId, InventoryResponse.unknown());
    }

    // THREADPOOL type — returns CompletableFuture
    @Bulkhead(name = "paymentService",
              type = Bulkhead.Type.THREADPOOL,
              fallbackMethod = "paymentBulkheadFallback")
    public CompletableFuture<PaymentResult> processPaymentAsync(PaymentRequest req) {
        return CompletableFuture.supplyAsync(() -> paymentClient.charge(req));
    }

    public CompletableFuture<PaymentResult> paymentBulkheadFallback(
            PaymentRequest req, BulkheadFullException ex) {
        return CompletableFuture.completedFuture(PaymentResult.queued(req.getOrderId()));
    }
}
Aspect SemaphoreBulkhead ThreadPoolBulkhead
Execution Model Calling thread Dedicated thread pool
Return Type Any CompletableFuture only
Overhead Very low Thread context switch
Queue Support No (reject immediately) Yes (configurable queue)
Reactive Yes Not recommended

8. RateLimiter and TimeLimiter

RateLimiter limits the number of calls per time window — useful when calling third-party APIs with strict quotas (e.g., a payment gateway that allows 100 req/sec) or to protect your own services from burst traffic. TimeLimiter wraps a CompletableFuture and cancels it if it does not complete within a configured deadline, preventing threads from waiting indefinitely.

// application.yml — RateLimiter and TimeLimiter
resilience4j:
  ratelimiter:
    instances:
      smsGateway:
        limit-for-period: 50             # max 50 calls per refresh period
        limit-refresh-period: 1s         # refresh window
        timeout-duration: 100ms          # wait time for a permission; 0 = fail fast

  timelimiter:
    instances:
      inventoryService:
        timeout-duration: 2s             # cancel CompletableFuture after 2s
        cancel-running-future: true      # interrupt the underlying thread
// SmsService.java — @RateLimiter and @TimeLimiter
@Service
public class SmsService {

    // Rate-limit outgoing SMS to 50/sec to comply with gateway quota
    @RateLimiter(name = "smsGateway", fallbackMethod = "smsFallback")
    public SmsResult send(SmsRequest request) {
        return smsGatewayClient.send(request);
    }

    public SmsResult smsFallback(SmsRequest req, RequestNotPermitted ex) {
        log.warn("SMS rate limit reached, queuing message id={}", req.getMessageId());
        smsQueue.enqueue(req);
        return SmsResult.queued(req.getMessageId());
    }
}

@Service
public class ShippingService {

    // TimeLimiter wraps CompletableFuture — must return CompletableFuture
    @TimeLimiter(name = "inventoryService", fallbackMethod = "shippingFallback")
    public CompletableFuture<ShippingQuote> getQuoteAsync(ShippingRequest req) {
        return CompletableFuture.supplyAsync(() -> shippingClient.getQuote(req));
    }

    public CompletableFuture<ShippingQuote> shippingFallback(ShippingRequest req, TimeoutException ex) {
        log.warn("Shipping quote timed out, returning default estimate");
        return CompletableFuture.completedFuture(ShippingQuote.defaultEstimate());
    }
}

9. Combining Patterns (Order Matters: Bulkhead → CircuitBreaker → Retry → TimeLimiter)

When combining multiple Resilience4j decorators, the decoration order determines which wraps which. The outermost decorator is executed first. The correct production order is:

  1. Bulkhead (outermost) — reject early if resource limit reached, before wasting any other resources
  2. CircuitBreaker — short-circuit immediately if downstream is known-unhealthy
  3. Retry — retry individual failures; each attempt is seen and recorded by the circuit breaker above it
  4. TimeLimiter (innermost) — enforce a per-attempt deadline on the actual I/O call

Why this order? If Retry were outside CircuitBreaker, retries would happen even when the circuit is open — wasting resources. If TimeLimiter were outside Retry, a timeout on the combined retry operation would hide individual attempt timeouts.

// Programmatic combination using Decorators builder (recommended)
@Service
public class ResilientInventoryClient {

    private final Supplier<InventoryResponse> resilientSupplier;

    public ResilientInventoryClient(
            CircuitBreakerRegistry cbRegistry,
            RetryRegistry retryRegistry,
            BulkheadRegistry bulkheadRegistry) {

        CircuitBreaker cb = cbRegistry.circuitBreaker("inventoryService");
        Retry retry         = retryRegistry.retry("inventoryService");
        Bulkhead bulkhead   = bulkheadRegistry.bulkhead("inventoryService");

        // Decoration order (innermost to outermost in the builder chain):
        // actual call -> TimeLimiter -> Retry -> CircuitBreaker -> Bulkhead
        this.resilientSupplier = Decorators
            .ofSupplier(() -> httpClient.fetchInventory())
            .withCircuitBreaker(cb)
            .withRetry(retry)
            .withBulkhead(bulkhead)
            .withFallback(
                List.of(CallNotPermittedException.class,
                        BulkheadFullException.class,
                        Exception.class),
                ex -> InventoryResponse.unavailable()
            )
            .decorate();
    }

    public InventoryResponse getInventory() {
        return resilientSupplier.get();
    }
}
// application.yml — annotation-based combination (Spring AOP order)
resilience4j:
  # Spring AOP processes annotations in this order by default:
  # Retry(Aspect order=2049) wraps CircuitBreaker(Aspect order=2050)
  # Override with spring.aop.proxy-target-class or set annotation orders explicitly

# Best practice: use a single @Bulkhead + @CircuitBreaker + @Retry stacked:
# @Bulkhead(name="svc", fallbackMethod="bFallback")
# @CircuitBreaker(name="svc", fallbackMethod="cbFallback")
# @Retry(name="svc", fallbackMethod="rFallback")
# public Response call() { ... }

# Or use the programmatic Decorators builder shown above for explicit ordering.
⚠️ Important: When using annotations, Resilience4j Spring Boot sets default aspect ordering so Retry wraps CircuitBreaker. Verify the resilience4j.circuitbreaker.circuit-breaker-aspect-order and resilience4j.retry.retry-aspect-order properties in your version to ensure correct ordering.

10. Actuator Integration and Metrics

Resilience4j integrates with Spring Boot Actuator to expose circuit breaker state via /actuator/health and detailed metrics via /actuator/circuitbreakers, /actuator/retries, and /actuator/bulkheads. All metrics are also published to Micrometer for Prometheus/Grafana.

// application.yml — Actuator + Resilience4j metrics exposure
management:
  endpoints:
    web:
      exposure:
        include: health, metrics, circuitbreakers, retries, bulkheads, ratelimiters
  endpoint:
    health:
      show-details: always
  health:
    circuitbreakers:
      enabled: true
    ratelimiters:
      enabled: true

resilience4j:
  circuitbreaker:
    instances:
      inventoryService:
        register-health-indicator: true    # appear in /actuator/health
        event-consumer-buffer-size: 20     # buffer for /actuator/circuitbreakerevents
// Key Prometheus metrics to monitor
# CircuitBreaker
resilience4j_circuitbreaker_state{name="inventoryService"}          # 0=CLOSED, 1=OPEN, 2=HALF_OPEN
resilience4j_circuitbreaker_failure_rate{name="inventoryService"}   # failure %
resilience4j_circuitbreaker_calls_total{name,kind}                  # kind: successful, failed, not_permitted, ignored

# Retry
resilience4j_retry_calls_total{name="paymentService",kind}          # kind: successful_with_retry, failed_with_retry

# Bulkhead
resilience4j_bulkhead_available_concurrent_calls{name}
resilience4j_bulkhead_max_allowed_concurrent_calls{name}

# RateLimiter
resilience4j_ratelimiter_available_permissions{name}
resilience4j_ratelimiter_waiting_threads{name}
// Grafana alert example (PromQL)
# Alert when any circuit breaker transitions to OPEN state
ALERT CircuitBreakerOpen
  IF resilience4j_circuitbreaker_state{state="open"} == 1
  FOR 30s
  LABELS { severity = "critical" }
  ANNOTATIONS {
    summary = "Circuit breaker {{ $labels.name }} is OPEN",
    description = "{{ $labels.name }} has been open for more than 30 seconds. Check downstream health."
  }

# Alert when retry rate exceeds 10%
ALERT HighRetryRate
  IF rate(resilience4j_retry_calls_total{kind="failed_with_retry"}[5m])
     / rate(resilience4j_retry_calls_total[5m]) > 0.10
  FOR 5m
  LABELS { severity = "warning" }

11. Testing Fault Tolerance with WireMock

Testing resilience patterns requires the ability to inject faults on demand. WireMock simulates downstream HTTP services with configurable fault injection — connection resets, delays, 5xx responses — making it ideal for driving circuit breakers open and verifying retry behavior.

// pom.xml test dependencies
<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-contract-wiremock</artifactId>
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
    <scope>test</scope>
</dependency>
// InventoryClientCircuitBreakerTest.java
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@AutoConfigureWireMock(port = 0)
@TestPropertySource(properties = {
    "inventory.service.url=http://localhost:${wiremock.server.port}",
    "resilience4j.circuitbreaker.instances.inventoryService.sliding-window-size=3",
    "resilience4j.circuitbreaker.instances.inventoryService.minimum-number-of-calls=3",
    "resilience4j.circuitbreaker.instances.inventoryService.failure-rate-threshold=50",
    "resilience4j.circuitbreaker.instances.inventoryService.wait-duration-in-open-state=1s"
})
class InventoryClientCircuitBreakerTest {

    @Autowired private InventoryClient inventoryClient;
    @Autowired private CircuitBreakerRegistry circuitBreakerRegistry;

    @Test
    void shouldOpenCircuitAfterThresholdFailures() {
        // Stub 3 consecutive 503 responses to drive the circuit open
        stubFor(get(urlPathMatching("/inventory/.*"))
            .willReturn(aResponse().withStatus(503).withBody("Service Unavailable")));

        // 3 calls: all fail, circuit should open (50% threshold met after 3/3 calls)
        assertThatThrownBy(() -> inventoryClient.checkStock("SKU-001"))
            .isInstanceOf(HttpServerErrorException.ServiceUnavailable.class);
        assertThatThrownBy(() -> inventoryClient.checkStock("SKU-001"))
            .isInstanceOf(HttpServerErrorException.ServiceUnavailable.class);
        assertThatThrownBy(() -> inventoryClient.checkStock("SKU-001"))
            .isInstanceOf(HttpServerErrorException.ServiceUnavailable.class);

        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("inventoryService");
        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);

        // Next call should be rejected immediately without hitting WireMock
        assertThatThrownBy(() -> inventoryClient.checkStock("SKU-001"))
            .isInstanceOf(CallNotPermittedException.class);

        // Verify WireMock only received 3 calls, not 4
        verify(3, getRequestedFor(urlPathMatching("/inventory/.*")));
    }

    @Test
    void shouldTransitionToHalfOpenAndClosedOnSuccess() throws InterruptedException {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("inventoryService");
        cb.transitionToOpenState(); // Force open

        // Stub a success response for HALF_OPEN probe
        stubFor(get(urlPathMatching("/inventory/.*"))
            .willReturn(aResponse().withStatus(200)
                .withBody("{\"sku\":\"SKU-001\",\"available\":true}")
                .withHeader("Content-Type", "application/json")));

        // Wait for waitDurationInOpenState=1s, then trigger transition
        Thread.sleep(1100);
        cb.transitionToHalfOpenState();

        // Probe call should succeed and close the circuit
        InventoryResponse response = inventoryClient.checkStock("SKU-001");
        assertThat(response.isAvailable()).isTrue();
        assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
    }
}
// WireMock fault injection — simulate connection reset and latency
// Connection reset (simulates TCP RST — triggers RetryableException)
stubFor(post("/payments/charge")
    .willReturn(aResponse().withFault(Fault.CONNECTION_RESET_BY_PEER)));

// Fixed delay (tests TimeLimiter timeout behavior)
stubFor(get("/inventory/SKU-999")
    .willReturn(aResponse().withStatus(200)
        .withFixedDelay(3000)          // 3s delay > 2s TimeLimiter timeout
        .withBody("{\"available\":true}")));

// Random delay range (tests retry with jitter)
stubFor(get(urlPathMatching("/inventory/.*"))
    .willReturn(aResponse().withStatus(503)
        .withRandomDelay(new UniformDistribution(100, 500))));

12. Production Checklist

Before deploying fault tolerance patterns to production, validate each item in this checklist to avoid common pitfalls:

  • Do not retry 4xx errors — configure ignoreExceptions for business validation exceptions
  • Add jitter to all exponential backoff configurations (randomizedWaitFactor ≥ 0.3)
  • Set minimumNumberOfCalls appropriately — a window of 2 calls with 1 failure = 50% rate is statistically unreliable
  • Register health indicators (register-health-indicator: true) so Kubernetes liveness/readiness probes reflect circuit breaker state
  • Test fallbacks under load — ensure fallback methods themselves do not throw exceptions
  • Tune TimeLimiter per SLA — do not use a global timeout; each downstream has a different p99 latency profile
  • Use ThreadPoolBulkhead for async calls and SemaphoreBulkhead for synchronous/reactive calls
  • Monitor retry rates with Prometheus/Grafana — sustained >5% retry rate indicates an underlying stability problem
  • Verify decorator order in programmatic configuration — always Bulkhead → CircuitBreaker → Retry → TimeLimiter
  • Configure event buffer sizes (event-consumer-buffer-size) to avoid dropping events in high-throughput services
  • Test circuit breaker state transitions in integration tests with forced state transitions via circuitBreaker.transitionToOpenState()
  • Document SLOs per dependency — circuit breaker thresholds should be informed by downstream SLOs, not guessed
Configuration Property Recommended Starting Value Notes
sliding-window-size 10–20 Too small = noisy; too large = slow to open
failure-rate-threshold 50% Lower for critical paths (30%)
wait-duration-in-open-state 30s–60s Match downstream recovery SLA
max-attempts (retry) 3 More than 3 rarely helps and worsens latency
timeout-duration (time limiter) 2× downstream p99 Profile in staging before setting
max-concurrent-calls (bulkhead) 10–25% of thread pool size Reserve capacity for other dependencies
Tags:
spring retry resilience4j circuit breaker bulkhead pattern fault tolerance microservices spring boot 2026

Leave a Comment

Related Posts

Microservices

Kafka Producer Exactly-Once Semantics in Spring Boot

Microservices

Spring Cloud Gateway Production Guide

Microservices

Redis Caching in Spring Boot Production Guide

Microservices

Distributed Tracing with OpenTelemetry & Spring Boot

Back to Blog Last updated: