Spring Retry & Resilience4j: Complete Fault Tolerance Guide for Spring Boot Microservices (2026)
Transient network blips, momentary service restarts, and cascading failures are the reality of distributed systems. In this guide, you will master every fault-tolerance tool available in the Spring Boot ecosystem — from simple @Retryable annotations to production-grade Resilience4j circuit breakers, bulkheads, rate limiters, and time limiters — with complete Java configuration and production-ready code.
Senior Software Engineer & Tech Writer
Spring Retry handles simple retry scenarios via @Retryable / RetryTemplate with exponential backoff and jitter. Resilience4j is the production-grade choice: it provides CircuitBreaker (CLOSED → OPEN → HALF_OPEN state machine), Retry, Bulkhead (thread-pool or semaphore isolation), RateLimiter, and TimeLimiter — all with Micrometer metrics and Actuator health endpoints. Combine them in the correct order: Bulkhead → CircuitBreaker → Retry → TimeLimiter so each decorator protects the next.
Table of Contents
- Why Fault Tolerance? Cascading Failures in Microservices
- Spring Retry: @Retryable, @Recover, RetryTemplate
- Exponential Backoff with Jitter (Configuration Deep Dive)
- Resilience4j vs Hystrix (Migration Guide)
- Resilience4j CircuitBreaker (CLOSED / OPEN / HALF_OPEN States)
- Resilience4j Retry (decorateFunction, @Retry Annotation)
- Bulkhead Pattern (ThreadPoolBulkhead vs SemaphoreBulkhead)
- RateLimiter and TimeLimiter
- Combining Patterns (Order Matters)
- Actuator Integration and Metrics
- Testing Fault Tolerance with WireMock
- Production Checklist
1. Why Fault Tolerance? Cascading Failures in Microservices
In a monolith, a slow database query blocks a thread but does not propagate failure horizontally. In a microservices system, Service A calls Service B which calls Service C. When Service C degrades — perhaps due to a GC pause, a database lock, or a noisy-neighbor on a shared Kubernetes node — threads in Service B start queuing waiting for C. Service B's thread pool fills up and starts rejecting Service A's requests, which then back up into Service A's thread pool. Within seconds an isolated failure in one leaf service becomes a full system outage: the cascading failure.
Fault tolerance patterns break these cascade chains. The key patterns and their roles are:
- Retry — automatically repeat a transient failure (network blip, 503) without surfacing it to the caller
- Circuit Breaker — stop calling a downstream that is clearly unhealthy; fail fast and return a fallback
- Bulkhead — limit concurrency per dependency so one slow service cannot exhaust the global thread pool
- Rate Limiter — throttle outgoing calls to protect downstream services from being overwhelmed
- Time Limiter — cancel a call after a deadline to bound the worst-case latency contribution
| Pattern | Problem Solved | Library |
|---|---|---|
| Retry | Transient network / service errors | Spring Retry, Resilience4j |
| Circuit Breaker | Cascading failures, sustained outages | Resilience4j |
| Bulkhead | Thread pool exhaustion, resource isolation | Resilience4j |
| Rate Limiter | Downstream overload, quota enforcement | Resilience4j |
| Time Limiter | Unbounded latency, thread starvation | Resilience4j |
2. Spring Retry: @Retryable, @Recover, RetryTemplate
Spring Retry adds retry capability to any Spring bean method with minimal configuration. Add the dependency and enable retries with @EnableRetry:
<dependency>
<groupId>org.springframework.retry</groupId>
<artifactId>spring-retry</artifactId>
</dependency>
<!-- Spring Retry requires AOP -->
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-aspects</artifactId>
</dependency>
@SpringBootApplication
@EnableRetry
public class PaymentServiceApplication {
public static void main(String[] args) {
SpringApplication.run(PaymentServiceApplication.class, args);
}
}
Annotate the method you want retried. The @Recover method is invoked when all retry attempts are exhausted:
@Service
public class PaymentGatewayClient {
@Retryable(
retryFor = {HttpServerErrorException.class, ResourceAccessException.class},
maxAttempts = 3,
backoff = @Backoff(delay = 500, multiplier = 2.0, maxDelay = 5000, random = true)
)
public PaymentResponse charge(ChargeRequest request) {
log.info("Attempting charge for orderId={}", request.getOrderId());
return restTemplate.postForObject("/payments/charge", request, PaymentResponse.class);
}
@Recover
public PaymentResponse chargeRecover(Exception ex, ChargeRequest request) {
log.error("All retries exhausted for orderId={}: {}", request.getOrderId(), ex.getMessage());
// Return a graceful degraded response or throw a domain exception
throw new PaymentServiceUnavailableException("Payment gateway unavailable. Please retry later.");
}
}
For programmatic retry logic use RetryTemplate — useful when you need retry inside a non-Spring-managed class or want full control over the policy:
@Configuration
public class RetryConfig {
@Bean
public RetryTemplate retryTemplate() {
ExponentialBackOffPolicy backOff = new ExponentialBackOffPolicy();
backOff.setInitialInterval(300);
backOff.setMultiplier(2.0);
backOff.setMaxInterval(10_000);
SimpleRetryPolicy retryPolicy = new SimpleRetryPolicy(3,
Map.of(HttpServerErrorException.class, true,
ResourceAccessException.class, true));
RetryTemplate template = new RetryTemplate();
template.setBackOffPolicy(backOff);
template.setRetryPolicy(retryPolicy);
template.registerListener(new RetryListenerSupport() {
@Override
public <T, E extends Throwable> void onError(
RetryContext ctx, RetryCallback<T, E> cb, Throwable t) {
log.warn("Retry attempt {} failed: {}", ctx.getRetryCount(), t.getMessage());
}
});
return template;
}
}
3. Exponential Backoff with Jitter (Configuration Deep Dive)
Naive exponential backoff causes a thundering herd: if 100 microservice instances all restart at the same time after a shared dependency outage, they all retry at the same exponential intervals and hit the recovering service in synchronized waves. Adding jitter (randomization) spreads the retries across time, dramatically reducing load spikes during recovery.
| Strategy | Formula | Best For |
|---|---|---|
| Fixed | delay = constant | Simple retries, low concurrency |
| Exponential | delay = initial × multiplier^n | Most service-to-service calls |
| Full Jitter | delay = random(0, cap) | High-concurrency, many retrying clients |
| Decorrelated Jitter | delay = random(base, prev×3) | Best spread (AWS recommendation) |
# No YAML key for @Backoff — configure via annotation:
# @Backoff(delay=500, multiplier=2.0, maxDelay=10000, random=true)
# random=true adds uniform jitter: actual = delay +/- (delay * 0.5)
# For Resilience4j exponential backoff with jitter:
resilience4j:
retry:
instances:
paymentService:
max-attempts: 3
wait-duration: 500ms
exponential-backoff-multiplier: 2.0
exponential-max-wait-duration: 10s
enable-exponential-backoff: true
randomized-wait-factor: 0.5 # jitter: 50% of wait-duration
retry-exceptions:
- org.springframework.web.client.HttpServerErrorException
- java.net.ConnectException
RetryConfig config = RetryConfig.custom()
.maxAttempts(4)
.intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
Duration.ofMillis(200), // initialInterval
2.0, // multiplier
0.5, // randomizationFactor (jitter)
Duration.ofSeconds(8) // maxInterval
))
.retryOnException(e -> e instanceof HttpServerErrorException
|| e instanceof ConnectException)
.build();
4. Resilience4j vs Hystrix (Migration Guide)
Netflix Hystrix entered maintenance mode in 2018 and is incompatible with Spring Boot 3 (Java 17+). Resilience4j is the recommended replacement. It is designed for Java 8+ functional programming, does not require a background health-check thread, and integrates natively with Micrometer, Spring Boot Actuator, and Spring Cloud Circuit Breaker abstraction.
| Feature | Hystrix | Resilience4j |
|---|---|---|
| Maintenance Status | EOL (2018) | Actively maintained |
| Spring Boot 3 Support | No | Yes (native) |
| Sliding Window | Count-based only | Count-based + Time-based |
| Reactive Support | RxJava 1 only | RxJava 2/3, Project Reactor |
| Metrics | Hystrix Stream / Dashboard | Micrometer (Prometheus, Grafana) |
| Thread Model | Mandatory HystrixCommand thread pool | Decorators on any Supplier/Function |
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
<version>2.2.0</version>
</dependency>
<!-- Spring Boot Actuator for health + metrics endpoints -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- AOP for annotation support -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-aop</artifactId>
</dependency>
5. Resilience4j CircuitBreaker (CLOSED / OPEN / HALF_OPEN States, Sliding Window)
The Resilience4j CircuitBreaker is a state machine with three states. In CLOSED state, calls pass through and outcomes are recorded in a sliding window. When the failure rate exceeds the threshold, it transitions to OPEN — all calls are immediately rejected with CallNotPermittedException, giving the downstream time to recover. After waitDurationInOpenState, it enters HALF_OPEN and allows a limited number of probe calls. Success closes it; failure re-opens it.
resilience4j:
circuitbreaker:
instances:
inventoryService:
sliding-window-type: COUNT_BASED # or TIME_BASED
sliding-window-size: 10 # last 10 calls
minimum-number-of-calls: 5 # min calls before evaluating
failure-rate-threshold: 50 # open if >=50% failed
slow-call-duration-threshold: 2s # count slow calls as failures
slow-call-rate-threshold: 80 # open if >=80% are slow
wait-duration-in-open-state: 30s # wait before half-open
permitted-number-of-calls-in-half-open-state: 3
automatic-transition-from-open-to-half-open-enabled: true
record-exceptions:
- java.io.IOException
- org.springframework.web.client.HttpServerErrorException
ignore-exceptions:
- com.example.BusinessValidationException
@Service
public class InventoryClient {
@CircuitBreaker(name = "inventoryService", fallbackMethod = "inventoryFallback")
public InventoryResponse checkStock(String productId) {
return webClient.get()
.uri("/inventory/{id}", productId)
.retrieve()
.bodyToMono(InventoryResponse.class)
.block();
}
// Fallback must have the same return type and include the Throwable parameter
public InventoryResponse inventoryFallback(String productId, CallNotPermittedException ex) {
log.warn("Circuit open for inventory service, returning cached data: {}", ex.getMessage());
return InventoryResponse.ofCachedData(productId);
}
public InventoryResponse inventoryFallback(String productId, Exception ex) {
log.error("Inventory service error, returning default: {}", ex.getMessage());
return InventoryResponse.defaultOutOfStock(productId);
}
}
@Service
public class OrderService {
private final CircuitBreaker cb;
public OrderService(CircuitBreakerRegistry registry) {
this.cb = registry.circuitBreaker("inventoryService");
cb.getEventPublisher()
.onStateTransition(e -> log.info("CB state: {} -> {}",
e.getStateTransition().getFromState(),
e.getStateTransition().getToState()));
}
public InventoryResponse checkStock(String productId) {
Supplier<InventoryResponse> decorated =
CircuitBreaker.decorateSupplier(cb, () -> inventoryClient.call(productId));
return Try.ofSupplier(decorated)
.recover(CallNotPermittedException.class, ex -> InventoryResponse.cached(productId))
.get();
}
}
6. Resilience4j Retry (decorateFunction, @Retry Annotation)
Resilience4j Retry wraps a function and re-executes it on exception or on a predicate match. Unlike Spring Retry, it supports both synchronous and async (CompletableFuture) variants, and emits fine-grained Micrometer events per retry attempt.
resilience4j:
retry:
instances:
paymentService:
max-attempts: 4
wait-duration: 300ms
enable-exponential-backoff: true
exponential-backoff-multiplier: 2.0
exponential-max-wait-duration: 8s
randomized-wait-factor: 0.5
retry-exceptions:
- java.net.ConnectException
- org.springframework.web.client.HttpServerErrorException$ServiceUnavailable
ignore-exceptions:
- com.example.PaymentDeclinedException # do NOT retry 4xx business errors
@Service
public class PaymentService {
@Retry(name = "paymentService", fallbackMethod = "paymentFallback")
public PaymentResult processPayment(PaymentRequest req) {
log.debug("Sending payment request for amount={}", req.getAmount());
return paymentGatewayClient.charge(req);
}
private PaymentResult paymentFallback(PaymentRequest req, Exception ex) {
log.error("Payment service unavailable after retries: {}", ex.getMessage());
return PaymentResult.pending(req.getOrderId(), "Gateway unavailable — queued for retry");
}
}
// Programmatic — decorateFunction for non-annotation use
Retry retry = retryRegistry.retry("paymentService");
Function<PaymentRequest, PaymentResult> decorated =
Retry.decorateFunction(retry, gatewayClient::charge);
// For CompletableFuture (async):
CompletableFuture<PaymentResult> future =
retry.executeCompletionStage(scheduler, () ->
CompletableFuture.supplyAsync(() -> gatewayClient.charge(req))
).toCompletableFuture();
A critical rule: never retry 4xx HTTP errors. A 400 Bad Request or 422 Unprocessable Entity indicates a client-side problem that will not be fixed by retrying. Only retry on 5xx server errors, network timeouts, and ConnectException. Configure ignoreExceptions to exclude business validation exceptions from retry logic.
7. Bulkhead Pattern (ThreadPoolBulkhead vs SemaphoreBulkhead)
The Bulkhead pattern isolates failures in one part of the system from affecting the rest — named after watertight compartments in ships. Resilience4j offers two implementations:
- SemaphoreBulkhead — uses a counting semaphore to limit the number of concurrent calls. Lightweight, works in the calling thread. Best for synchronous, blocking calls with predictable execution time.
- ThreadPoolBulkhead — submits work to a dedicated thread pool with a queue. Callers are isolated from the worker threads. Best for async calls and when you need strict thread isolation between dependencies.
resilience4j:
bulkhead:
instances:
inventoryService:
max-concurrent-calls: 20 # max concurrent semaphore permits
max-wait-duration: 50ms # time to wait for a permit before BulkheadFullException
thread-pool-bulkhead:
instances:
paymentService:
max-thread-pool-size: 10 # max worker threads
core-thread-pool-size: 4 # always-on threads
queue-capacity: 50 # pending task queue size
keep-alive-duration: 20ms # idle thread keep-alive
writeable-stack-trace-enabled: true
@Service
public class OrderOrchestrator {
// SEMAPHORE type (default) — synchronous
@Bulkhead(name = "inventoryService", fallbackMethod = "inventoryBulkheadFallback")
public InventoryResponse checkInventory(String skuId) {
return inventoryClient.getStock(skuId);
}
public InventoryResponse inventoryBulkheadFallback(String skuId, BulkheadFullException ex) {
log.warn("Inventory bulkhead full, returning cached data for sku={}", skuId);
return cache.getOrDefault(skuId, InventoryResponse.unknown());
}
// THREADPOOL type — returns CompletableFuture
@Bulkhead(name = "paymentService",
type = Bulkhead.Type.THREADPOOL,
fallbackMethod = "paymentBulkheadFallback")
public CompletableFuture<PaymentResult> processPaymentAsync(PaymentRequest req) {
return CompletableFuture.supplyAsync(() -> paymentClient.charge(req));
}
public CompletableFuture<PaymentResult> paymentBulkheadFallback(
PaymentRequest req, BulkheadFullException ex) {
return CompletableFuture.completedFuture(PaymentResult.queued(req.getOrderId()));
}
}
| Aspect | SemaphoreBulkhead | ThreadPoolBulkhead |
|---|---|---|
| Execution Model | Calling thread | Dedicated thread pool |
| Return Type | Any | CompletableFuture only |
| Overhead | Very low | Thread context switch |
| Queue Support | No (reject immediately) | Yes (configurable queue) |
| Reactive | Yes | Not recommended |
8. RateLimiter and TimeLimiter
RateLimiter limits the number of calls per time window — useful when calling third-party APIs with strict quotas (e.g., a payment gateway that allows 100 req/sec) or to protect your own services from burst traffic. TimeLimiter wraps a CompletableFuture and cancels it if it does not complete within a configured deadline, preventing threads from waiting indefinitely.
resilience4j:
ratelimiter:
instances:
smsGateway:
limit-for-period: 50 # max 50 calls per refresh period
limit-refresh-period: 1s # refresh window
timeout-duration: 100ms # wait time for a permission; 0 = fail fast
timelimiter:
instances:
inventoryService:
timeout-duration: 2s # cancel CompletableFuture after 2s
cancel-running-future: true # interrupt the underlying thread
@Service
public class SmsService {
// Rate-limit outgoing SMS to 50/sec to comply with gateway quota
@RateLimiter(name = "smsGateway", fallbackMethod = "smsFallback")
public SmsResult send(SmsRequest request) {
return smsGatewayClient.send(request);
}
public SmsResult smsFallback(SmsRequest req, RequestNotPermitted ex) {
log.warn("SMS rate limit reached, queuing message id={}", req.getMessageId());
smsQueue.enqueue(req);
return SmsResult.queued(req.getMessageId());
}
}
@Service
public class ShippingService {
// TimeLimiter wraps CompletableFuture — must return CompletableFuture
@TimeLimiter(name = "inventoryService", fallbackMethod = "shippingFallback")
public CompletableFuture<ShippingQuote> getQuoteAsync(ShippingRequest req) {
return CompletableFuture.supplyAsync(() -> shippingClient.getQuote(req));
}
public CompletableFuture<ShippingQuote> shippingFallback(ShippingRequest req, TimeoutException ex) {
log.warn("Shipping quote timed out, returning default estimate");
return CompletableFuture.completedFuture(ShippingQuote.defaultEstimate());
}
}
9. Combining Patterns (Order Matters: Bulkhead → CircuitBreaker → Retry → TimeLimiter)
When combining multiple Resilience4j decorators, the decoration order determines which wraps which. The outermost decorator is executed first. The correct production order is:
- Bulkhead (outermost) — reject early if resource limit reached, before wasting any other resources
- CircuitBreaker — short-circuit immediately if downstream is known-unhealthy
- Retry — retry individual failures; each attempt is seen and recorded by the circuit breaker above it
- TimeLimiter (innermost) — enforce a per-attempt deadline on the actual I/O call
Why this order? If Retry were outside CircuitBreaker, retries would happen even when the circuit is open — wasting resources. If TimeLimiter were outside Retry, a timeout on the combined retry operation would hide individual attempt timeouts.
@Service
public class ResilientInventoryClient {
private final Supplier<InventoryResponse> resilientSupplier;
public ResilientInventoryClient(
CircuitBreakerRegistry cbRegistry,
RetryRegistry retryRegistry,
BulkheadRegistry bulkheadRegistry) {
CircuitBreaker cb = cbRegistry.circuitBreaker("inventoryService");
Retry retry = retryRegistry.retry("inventoryService");
Bulkhead bulkhead = bulkheadRegistry.bulkhead("inventoryService");
// Decoration order (innermost to outermost in the builder chain):
// actual call -> TimeLimiter -> Retry -> CircuitBreaker -> Bulkhead
this.resilientSupplier = Decorators
.ofSupplier(() -> httpClient.fetchInventory())
.withCircuitBreaker(cb)
.withRetry(retry)
.withBulkhead(bulkhead)
.withFallback(
List.of(CallNotPermittedException.class,
BulkheadFullException.class,
Exception.class),
ex -> InventoryResponse.unavailable()
)
.decorate();
}
public InventoryResponse getInventory() {
return resilientSupplier.get();
}
}
resilience4j:
# Spring AOP processes annotations in this order by default:
# Retry(Aspect order=2049) wraps CircuitBreaker(Aspect order=2050)
# Override with spring.aop.proxy-target-class or set annotation orders explicitly
# Best practice: use a single @Bulkhead + @CircuitBreaker + @Retry stacked:
# @Bulkhead(name="svc", fallbackMethod="bFallback")
# @CircuitBreaker(name="svc", fallbackMethod="cbFallback")
# @Retry(name="svc", fallbackMethod="rFallback")
# public Response call() { ... }
# Or use the programmatic Decorators builder shown above for explicit ordering.
resilience4j.circuitbreaker.circuit-breaker-aspect-order and resilience4j.retry.retry-aspect-order properties in your version to ensure correct ordering.
10. Actuator Integration and Metrics
Resilience4j integrates with Spring Boot Actuator to expose circuit breaker state via /actuator/health and detailed metrics via /actuator/circuitbreakers, /actuator/retries, and /actuator/bulkheads. All metrics are also published to Micrometer for Prometheus/Grafana.
management:
endpoints:
web:
exposure:
include: health, metrics, circuitbreakers, retries, bulkheads, ratelimiters
endpoint:
health:
show-details: always
health:
circuitbreakers:
enabled: true
ratelimiters:
enabled: true
resilience4j:
circuitbreaker:
instances:
inventoryService:
register-health-indicator: true # appear in /actuator/health
event-consumer-buffer-size: 20 # buffer for /actuator/circuitbreakerevents
# CircuitBreaker
resilience4j_circuitbreaker_state{name="inventoryService"} # 0=CLOSED, 1=OPEN, 2=HALF_OPEN
resilience4j_circuitbreaker_failure_rate{name="inventoryService"} # failure %
resilience4j_circuitbreaker_calls_total{name,kind} # kind: successful, failed, not_permitted, ignored
# Retry
resilience4j_retry_calls_total{name="paymentService",kind} # kind: successful_with_retry, failed_with_retry
# Bulkhead
resilience4j_bulkhead_available_concurrent_calls{name}
resilience4j_bulkhead_max_allowed_concurrent_calls{name}
# RateLimiter
resilience4j_ratelimiter_available_permissions{name}
resilience4j_ratelimiter_waiting_threads{name}
# Alert when any circuit breaker transitions to OPEN state
ALERT CircuitBreakerOpen
IF resilience4j_circuitbreaker_state{state="open"} == 1
FOR 30s
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "Circuit breaker {{ $labels.name }} is OPEN",
description = "{{ $labels.name }} has been open for more than 30 seconds. Check downstream health."
}
# Alert when retry rate exceeds 10%
ALERT HighRetryRate
IF rate(resilience4j_retry_calls_total{kind="failed_with_retry"}[5m])
/ rate(resilience4j_retry_calls_total[5m]) > 0.10
FOR 5m
LABELS { severity = "warning" }
11. Testing Fault Tolerance with WireMock
Testing resilience patterns requires the ability to inject faults on demand. WireMock simulates downstream HTTP services with configurable fault injection — connection resets, delays, 5xx responses — making it ideal for driving circuit breakers open and verifying retry behavior.
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-contract-wiremock</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
<scope>test</scope>
</dependency>
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@AutoConfigureWireMock(port = 0)
@TestPropertySource(properties = {
"inventory.service.url=http://localhost:${wiremock.server.port}",
"resilience4j.circuitbreaker.instances.inventoryService.sliding-window-size=3",
"resilience4j.circuitbreaker.instances.inventoryService.minimum-number-of-calls=3",
"resilience4j.circuitbreaker.instances.inventoryService.failure-rate-threshold=50",
"resilience4j.circuitbreaker.instances.inventoryService.wait-duration-in-open-state=1s"
})
class InventoryClientCircuitBreakerTest {
@Autowired private InventoryClient inventoryClient;
@Autowired private CircuitBreakerRegistry circuitBreakerRegistry;
@Test
void shouldOpenCircuitAfterThresholdFailures() {
// Stub 3 consecutive 503 responses to drive the circuit open
stubFor(get(urlPathMatching("/inventory/.*"))
.willReturn(aResponse().withStatus(503).withBody("Service Unavailable")));
// 3 calls: all fail, circuit should open (50% threshold met after 3/3 calls)
assertThatThrownBy(() -> inventoryClient.checkStock("SKU-001"))
.isInstanceOf(HttpServerErrorException.ServiceUnavailable.class);
assertThatThrownBy(() -> inventoryClient.checkStock("SKU-001"))
.isInstanceOf(HttpServerErrorException.ServiceUnavailable.class);
assertThatThrownBy(() -> inventoryClient.checkStock("SKU-001"))
.isInstanceOf(HttpServerErrorException.ServiceUnavailable.class);
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("inventoryService");
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.OPEN);
// Next call should be rejected immediately without hitting WireMock
assertThatThrownBy(() -> inventoryClient.checkStock("SKU-001"))
.isInstanceOf(CallNotPermittedException.class);
// Verify WireMock only received 3 calls, not 4
verify(3, getRequestedFor(urlPathMatching("/inventory/.*")));
}
@Test
void shouldTransitionToHalfOpenAndClosedOnSuccess() throws InterruptedException {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("inventoryService");
cb.transitionToOpenState(); // Force open
// Stub a success response for HALF_OPEN probe
stubFor(get(urlPathMatching("/inventory/.*"))
.willReturn(aResponse().withStatus(200)
.withBody("{\"sku\":\"SKU-001\",\"available\":true}")
.withHeader("Content-Type", "application/json")));
// Wait for waitDurationInOpenState=1s, then trigger transition
Thread.sleep(1100);
cb.transitionToHalfOpenState();
// Probe call should succeed and close the circuit
InventoryResponse response = inventoryClient.checkStock("SKU-001");
assertThat(response.isAvailable()).isTrue();
assertThat(cb.getState()).isEqualTo(CircuitBreaker.State.CLOSED);
}
}
// Connection reset (simulates TCP RST — triggers RetryableException)
stubFor(post("/payments/charge")
.willReturn(aResponse().withFault(Fault.CONNECTION_RESET_BY_PEER)));
// Fixed delay (tests TimeLimiter timeout behavior)
stubFor(get("/inventory/SKU-999")
.willReturn(aResponse().withStatus(200)
.withFixedDelay(3000) // 3s delay > 2s TimeLimiter timeout
.withBody("{\"available\":true}")));
// Random delay range (tests retry with jitter)
stubFor(get(urlPathMatching("/inventory/.*"))
.willReturn(aResponse().withStatus(503)
.withRandomDelay(new UniformDistribution(100, 500))));
12. Production Checklist
Before deploying fault tolerance patterns to production, validate each item in this checklist to avoid common pitfalls:
- ✅ Do not retry 4xx errors — configure
ignoreExceptionsfor business validation exceptions - ✅ Add jitter to all exponential backoff configurations (
randomizedWaitFactor≥ 0.3) - ✅ Set
minimumNumberOfCallsappropriately — a window of 2 calls with 1 failure = 50% rate is statistically unreliable - ✅ Register health indicators (
register-health-indicator: true) so Kubernetes liveness/readiness probes reflect circuit breaker state - ✅ Test fallbacks under load — ensure fallback methods themselves do not throw exceptions
- ✅ Tune TimeLimiter per SLA — do not use a global timeout; each downstream has a different p99 latency profile
- ✅ Use ThreadPoolBulkhead for async calls and SemaphoreBulkhead for synchronous/reactive calls
- ✅ Monitor retry rates with Prometheus/Grafana — sustained >5% retry rate indicates an underlying stability problem
- ✅ Verify decorator order in programmatic configuration — always Bulkhead → CircuitBreaker → Retry → TimeLimiter
- ✅ Configure event buffer sizes (
event-consumer-buffer-size) to avoid dropping events in high-throughput services - ✅ Test circuit breaker state transitions in integration tests with forced state transitions via
circuitBreaker.transitionToOpenState() - ✅ Document SLOs per dependency — circuit breaker thresholds should be informed by downstream SLOs, not guessed
| Configuration Property | Recommended Starting Value | Notes |
|---|---|---|
| sliding-window-size | 10–20 | Too small = noisy; too large = slow to open |
| failure-rate-threshold | 50% | Lower for critical paths (30%) |
| wait-duration-in-open-state | 30s–60s | Match downstream recovery SLA |
| max-attempts (retry) | 3 | More than 3 rarely helps and worsens latency |
| timeout-duration (time limiter) | 2× downstream p99 | Profile in staging before setting |
| max-concurrent-calls (bulkhead) | 10–25% of thread pool size | Reserve capacity for other dependencies |