Handling Partial Failures in Distributed Systems: Circuit Breaker, Retry, and Bulkhead Patterns
In distributed systems, partial failures are more insidious than total failures. When one service slows to a crawl, it can drag down the entire platform through cascading thread exhaustion. Circuit breaker, retry with exponential backoff, and bulkhead patterns form the essential defense layer every production microservices platform must implement.
The Real-World Problem: Cascading Failures
Imagine a major e-commerce platform during Black Friday. At 11:56 PM, database connection pool saturation causes the payment service to start responding in 15–20 seconds instead of its normal 200ms. Nothing has crashed — the service is still up, still accepting requests, still returning responses eventually. This is the scenario that kills platforms.
The order service begins calling payment service for each checkout. Each call ties up a thread in the order service's HTTP client connection pool while it waits for the slow response. Within 90 seconds, all 200 threads in the order service are blocked waiting for payment service responses. New order requests start queuing. The queue fills. The order service begins rejecting requests with 503 errors — not because it failed, but because it ran out of threads waiting for a degraded downstream dependency.
Now the product recommendation service, which calls order service to fetch recent order history for personalized recommendations, starts seeing 503s from order service. Its retry logic kicks in. Thread exhaustion spreads upward. The API gateway, which fans out to multiple services including recommendations, product catalog, and order history, starts timing out on aggregated responses. By 11:59 PM, what began as a 3-minute period of slow payment processing has cascaded into a complete platform outage affecting every customer on the site.
This scenario differs fundamentally from monolith failures. In a monolith, a slow database call slows the request but does not consume resources from other independent subsystems. In a microservices architecture, services share nothing except the network — but thread pools, connection pools, and timeouts create invisible coupling across service boundaries. A partial failure in one leaf service propagates inward through the dependency graph until it reaches the edge, consuming resources at every level.
The three patterns that prevent this — circuit breaker, retry with backoff, and bulkhead — each address a different failure vector. Circuit breakers stop calls to known-failing services. Retry handles transient failures that resolve quickly. Bulkhead contains the blast radius so one slow dependency cannot exhaust shared resources. None of them alone is sufficient; all three together form a coherent resilience layer.
Pattern 1: Circuit Breaker
The circuit breaker pattern, popularized by Michael Nygard in Release It!, is modeled on electrical circuit breakers. When a downstream service is failing, the circuit breaker trips — opening the circuit and immediately returning a failure (or a cached/fallback response) without actually calling the downstream service. This prevents further thread exhaustion and gives the downstream service time to recover without being overwhelmed by continued load from a struggling caller.
A circuit breaker has three states. In the Closed state (normal operation), calls pass through and outcomes are recorded in a sliding window. When the failure rate or slow-call rate exceeds configured thresholds, the breaker transitions to the Open state. In the Open state, all calls are immediately rejected with a CallNotPermittedException — no network call is made. After a configured wait duration, the breaker transitions to Half-Open, allowing a limited number of probe calls through. If those probe calls succeed, the breaker returns to Closed. If they fail, it returns to Open and resets the wait timer.
Resilience4j is the de facto circuit breaker library for Spring Boot applications, having replaced Hystrix (which is no longer maintained). The integration is annotation-based and highly configurable:
@Service
public class PaymentServiceClient {
@CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback")
public PaymentResponse processPayment(PaymentRequest request) {
return paymentClient.charge(request);
}
public PaymentResponse paymentFallback(PaymentRequest request, Exception ex) {
log.warn("Circuit breaker open for paymentService: {}", ex.getMessage());
// Queue payment for async retry processing
return PaymentResponse.queued(request.getOrderId());
}
}
The fallback method must have the same signature as the primary method plus an additional Exception parameter. The fallback is invoked both when the circuit is open (CallNotPermittedException) and when the underlying call throws an exception while the circuit is closed or half-open. This means your fallback must handle both cases gracefully.
The YAML configuration controls the breaker's sensitivity and recovery behavior:
resilience4j:
circuitbreaker:
instances:
paymentService:
slidingWindowType: COUNT_BASED
slidingWindowSize: 10
failureRateThreshold: 50
slowCallRateThreshold: 80
slowCallDurationThreshold: 2s
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3
registerHealthIndicator: true
recordExceptions:
- java.net.ConnectException
- java.util.concurrent.TimeoutException
- feign.RetryableException
ignoreExceptions:
- com.example.PaymentDeclinedException
- com.example.InsufficientFundsException
Let's examine each parameter. slidingWindowType: COUNT_BASED means failure rate is computed over the last N calls rather than over a time window — COUNT_BASED is more predictable for services with variable call rates. slidingWindowSize: 10 means the last 10 calls are considered. failureRateThreshold: 50 opens the circuit when >50% of the last 10 calls failed. slowCallRateThreshold: 80 opens the circuit when >80% of calls are slower than slowCallDurationThreshold: 2s — this catches degraded services that are responding but too slowly to be useful. waitDurationInOpenState: 30s means the breaker waits 30 seconds before probing recovery. permittedNumberOfCallsInHalfOpenState: 3 allows 3 probe calls in Half-Open before deciding to reclose or reopen.
Critically, ignoreExceptions contains business logic exceptions — a payment declined due to insufficient funds is not a system failure and should not count toward the failure rate. Only infrastructure failures (connection refused, timeout, gateway errors) should trip the breaker.
With registerHealthIndicator: true, circuit breaker state is exposed through Spring Boot Actuator's /actuator/health endpoint, giving you a real-time dashboard of every circuit breaker across your service fleet via Kubernetes readiness probes or monitoring dashboards.
Pattern 2: Retry with Exponential Backoff and Jitter
Transient failures — momentary network blips, brief resource exhaustion, pod restarts during rolling deployments — resolve quickly on their own. Retry logic allows the caller to transparently recover from these without surfacing an error to the user. However, naive retry implementations cause their own class of problems.
Constant-interval retry (retry every 500ms, 3 times) creates thundering herd problems. If a service restarts and 1,000 callers all begin retrying at exactly the same 500ms interval simultaneously, the recovering service is hit with 1,000 requests at second 0.5, then 1,000 more at second 1.0. This can re-crash a service that was just recovering. The solution is exponential backoff combined with jitter.
Exponential backoff increases the wait between retries geometrically: 500ms, then 1s, then 2s, then 4s (with a 2× multiplier). This gives the downstream service progressively more time to recover. Jitter adds a random offset (±20% is typical) to desynchronize retries across different caller instances. Instead of 1,000 instances all retrying at exactly 1s, they retry uniformly distributed between 0.8s and 1.2s — spreading the load and preventing synchronized stampedes.
@Service
public class InventoryServiceClient {
@Retry(name = "inventoryService", fallbackMethod = "inventoryFallback")
public InventoryStatus checkInventory(String productId) {
return inventoryClient.getStatus(productId);
}
public InventoryStatus inventoryFallback(String productId, Exception ex) {
log.error("Inventory service unavailable after retries for product {}: {}",
productId, ex.getMessage());
// Return optimistic status — oversell risk handled downstream
return InventoryStatus.assumeAvailable(productId);
}
}
resilience4j:
retry:
instances:
inventoryService:
maxAttempts: 4
waitDuration: 500ms
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2
exponentialMaxWaitDuration: 10s
enableRandomizedWait: true
randomizedWaitFactor: 0.2
retryExceptions:
- java.net.ConnectException
- java.util.concurrent.TimeoutException
- feign.RetryableException
ignoreExceptions:
- com.example.BadRequestException
- com.example.ConflictException
- com.example.ResourceNotFoundException
This configuration attempts up to 4 calls total (1 original + 3 retries) with delays of approximately 500ms, 1s, and 2s (each ±20% jitter), capped at 10s maximum. The ignoreExceptions list is critical: HTTP 400 Bad Request means the request itself is malformed — retrying it will never succeed and wastes resources. HTTP 409 Conflict means there is a business logic conflict (e.g., duplicate order) — retrying without changing the request will keep failing. Only retry on infrastructure failures that are genuinely transient.
A subtle but important rule: never retry non-idempotent operations without idempotency keys. Retrying a payment charge without an idempotency key can result in double charges. If you must retry payment processing, include an idempotency key in the request so the downstream service can detect and deduplicate retried requests.
Pattern 3: Bulkhead — Thread Pool and Semaphore Isolation
Even with circuit breakers and retry logic, a slow downstream service can still cause problems through resource exhaustion — specifically, consuming all available threads in a shared thread pool while waiting for responses. The bulkhead pattern isolates resources (thread pools or semaphore permits) per downstream dependency, ensuring that a slow or failing service can only consume its allocated share of resources, never the entire pool.
The name comes from ship design: a bulkhead is a watertight partition that prevents flooding in one compartment from sinking the entire ship. In microservices terms, it ensures that thread exhaustion caused by payment service calls cannot prevent inventory service calls from being processed.
Resilience4j offers two bulkhead implementations. Semaphore bulkhead limits the number of concurrent calls using a counting semaphore — simple and low overhead, but callers still block in the calling thread. Thread pool bulkhead executes calls in a dedicated thread pool — provides true isolation since even the calling thread is not blocked, but has higher resource overhead.
resilience4j:
bulkhead:
instances:
paymentService:
maxConcurrentCalls: 20
maxWaitDuration: 100ms
inventoryService:
maxConcurrentCalls: 50
maxWaitDuration: 50ms
thread-pool-bulkhead:
instances:
notificationService:
maxThreadPoolSize: 10
coreThreadPoolSize: 5
queueCapacity: 20
keepAliveDuration: 20ms
reportingService:
maxThreadPoolSize: 5
coreThreadPoolSize: 2
queueCapacity: 10
keepAliveDuration: 20ms
In this configuration, payment service calls are limited to 20 concurrent calls maximum. A caller that cannot acquire a semaphore within 100ms receives a BulkheadFullException immediately rather than blocking indefinitely. Notification service runs in its own dedicated thread pool of 5–10 threads with a queue of 20. Even if the payment service bulkhead is completely saturated (20 threads all stuck waiting), the notification service thread pool continues to operate independently — those threads are not shared.
The distinction matters at 2 AM during an incident. With bulkhead isolation, a payment service degradation causes payment-related features to degrade, but order confirmation emails, push notifications, and reporting dashboards continue functioning. Without bulkhead isolation, a single slow dependency can take down the entire application by consuming every available thread.
Use semaphore bulkhead for synchronous request-response flows where caller thread blocking is acceptable and low overhead matters. Use thread pool bulkhead for async operations, non-critical background calls (notifications, analytics, reporting), or when you need the calling thread to remain free for other work while the downstream call executes.
Combining All Three: Decorator Order Matters
Circuit breaker, retry, and bulkhead are most powerful when composed together. However, the order in which they are applied as decorators is not arbitrary — it has significant behavioral implications.
The correct order from outermost to innermost is: Bulkhead → CircuitBreaker → Retry → TimeLimiter → actual call. Think about what each layer needs to see. The bulkhead is outermost because we want to limit total resource consumption before any other logic runs — including retry logic. If retry were outside bulkhead, a single failed call could trigger 4 retry attempts, each consuming a bulkhead slot. The circuit breaker is inside bulkhead but outside retry: when the circuit is open, we want the retry layer to receive a CallNotPermittedException and not retry (since the circuit being open means we know the service is failing). Retry is inside circuit breaker to handle transient failures that occur while the circuit is closed.
// Manual decoration using Resilience4j Decorators API
Supplier<PaymentResponse> supplier = () -> paymentClient.charge(request);
Supplier<PaymentResponse> decorated = Decorators.ofSupplier(supplier)
.withBulkhead(bulkhead)
.withCircuitBreaker(circuitBreaker)
.withRetry(retry)
.withTimeLimiter(timeLimiter, scheduledExecutorService)
.withFallback(
List.of(BulkheadFullException.class,
CallNotPermittedException.class,
TimeoutException.class),
ex -> PaymentResponse.degraded(request.getOrderId()))
.decorate();
try {
return decorated.get();
} catch (Exception ex) {
// Only non-fallback exceptions reach here
throw new PaymentProcessingException("Payment failed after all resilience measures", ex);
}
The fallback at the end of the chain handles both bulkhead rejection (BulkheadFullException) and circuit breaker short-circuit (CallNotPermittedException) with a graceful degraded response. This ensures that even complete saturation of all resilience mechanisms produces a handled, predictable outcome rather than an unhandled exception propagating to the caller.
When using annotations (@CircuitBreaker, @Retry, @Bulkhead), Spring AOP applies the decorators in the order of the annotations on the method, from outermost to innermost. Placing @Bulkhead first, then @CircuitBreaker, then @Retry produces the correct nesting. Be explicit about this ordering — the wrong order produces subtly incorrect behavior that is difficult to diagnose in production.
Observability and Monitoring
Circuit breakers, retries, and bulkheads are only useful if you can observe their behavior in production. Without metrics, a circuit breaker that has been open for 45 minutes due to a misconfigured threshold looks identical to a properly functioning system from the user's perspective — both return the fallback response. Metrics tell you which is which.
Resilience4j integrates with Micrometer out of the box when both libraries are on the classpath. Key metrics exposed include:
resilience4j.circuitbreaker.calls.totaltagged withkind:successful,failed,not_permitted(rejected by open circuit),ignoredresilience4j.circuitbreaker.state: gauge with value 0 (CLOSED), 1 (OPEN), 2 (HALF_OPEN), 3 (FORCED_OPEN), 4 (DISABLED)resilience4j.circuitbreaker.failure.rate: current failure rate percentageresilience4j.retry.calls.totaltagged withkind:successful_without_retry,successful_with_retry,failed_with_retry,failed_without_retryresilience4j.bulkhead.available.concurrent.calls: remaining capacity
Log circuit breaker state transitions for post-incident debugging. A circuit breaker that opened at 11:56 PM and closed at 12:04 AM gives you the exact window of degradation to correlate with other signals:
@PostConstruct
public void registerCircuitBreakerListeners() {
circuitBreakerRegistry.circuitBreaker("paymentService")
.getEventPublisher()
.onStateTransition(event ->
log.warn("Circuit breaker [{}] state change: {} -> {}",
event.getCircuitBreakerName(),
event.getStateTransition().getFromState(),
event.getStateTransition().getToState()))
.onFailureRateExceeded(event ->
log.error("Circuit breaker [{}] failure rate exceeded threshold: {}%",
event.getCircuitBreakerName(),
event.getFailureRate()))
.onSlowCallRateExceeded(event ->
log.warn("Circuit breaker [{}] slow call rate exceeded threshold: {}%",
event.getCircuitBreakerName(),
event.getSlowCallRate()));
}
A Prometheus AlertManager rule for circuit breaker state changes ensures your on-call engineer is paged immediately when a breaker opens, rather than discovering it during the next morning's review:
# Prometheus alerting rule
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state{state="open"} == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Circuit breaker {{ $labels.name }} is OPEN"
description: "Circuit breaker {{ $labels.name }} has been open for more than 1 minute. Fallback responses are being served."
The for: 1m grace period prevents alert noise from brief circuit breaker trips during normal transient failures. A breaker that opens and closes within 30 seconds is operating correctly; one that stays open for over a minute indicates a genuine downstream problem requiring investigation.
Failure Scenarios and Debugging
Circuit breaker stays open even after service recovery: Check waitDurationInOpenState. If set to 60s and the downstream service recovered in 20s, the breaker will not probe recovery until 60s have elapsed from when it opened. If the breaker keeps reopening in Half-Open state, the permittedNumberOfCallsInHalfOpenState probe calls are failing — verify that the downstream service has truly recovered and is not in a slow-start state still warming up its connection pool.
Bulkhead rejection storm: If you see a sudden spike in BulkheadFullException, the maxConcurrentCalls is too low for your traffic pattern, or the downstream service has slowed dramatically, causing calls to take longer and occupy bulkhead slots for extended periods. Increase maxConcurrentCalls or reduce slowCallDurationThreshold to trip the circuit breaker faster, freeing up bulkhead capacity.
BulkheadFullException vs ThreadPoolBulkheadFullException: These are different exception types from semaphore and thread pool bulkheads respectively. Your fallback and exception handling code must handle both if you use both bulkhead types across your service. A common bug is handling only one type and seeing unhandled exceptions from the other.
Retry exhaustion masking root cause: When all retry attempts are exhausted, the final exception thrown is the last failure encountered — often a ConnectException or TimeoutException. Log the retry count and all intermediate exceptions in your fallback method to preserve the full retry history for debugging. Without this, all you see in logs is "connection refused after 4 attempts" with no information about what happened on attempts 1 through 3.
Slow call threshold misconfiguration: Setting slowCallDurationThreshold too aggressively (e.g., 100ms on a service that legitimately takes 150ms under normal load) causes the circuit breaker to open under normal traffic. Always calibrate thresholds against your P99 latency baseline, not your P50.
When Not to Use These Patterns
Resilience patterns have real costs in complexity, configuration overhead, and debugging difficulty. Adding them indiscriminately to every service call is a category error. Apply them proportionally to failure impact.
Synchronous user-facing flows without a meaningful fallback often should fail fast naturally rather than returning a degraded response. If your checkout flow requires payment processing and you have no meaningful fallback (you cannot complete a checkout without charging payment), adding a circuit breaker that returns a "payment queued" response may mislead the user. In this case, a fast failure with a clear error message is better UX than a false success that will need to be unwound later.
Idempotent read replicas that can tolerate eventual consistency rarely need circuit breakers. A product detail page that reads from a read replica can simply return a cached response on failure without needing the full circuit breaker state machine overhead.
Non-idempotent operations like payment charges, email sends, or SMS messages should never be retried without idempotency keys. Adding a retry decorator to a payment endpoint without implementing idempotency at the receiving service is a double-charge bug waiting to happen in production.
Intra-service calls within the same JVM (method calls, in-process event bus) do not need circuit breakers. Circuit breakers are for distributed network calls where failure modes are fundamentally different — network partitions, timeouts, remote process crashes — from local method invocations.
Key Takeaways
- Always combine all three patterns: Circuit breaker stops calls to known-failing services; retry handles transient failures; bulkhead contains resource exhaustion. None is sufficient alone.
- Decorator order matters: Bulkhead → CircuitBreaker → Retry → TimeLimiter is the correct nesting. Wrong order produces subtle behavioral bugs that are difficult to diagnose under production load.
- Jitter prevents thundering herd: Enable
enableRandomizedWaiton every retry configuration. A ±20% random factor is sufficient to desynchronize retries across hundreds of service instances. - Expose metrics via Micrometer: Without
resilience4j.circuitbreaker.stateandresilience4j.circuitbreaker.calls.totalin your Prometheus scrape targets, circuit breakers are invisible in production. Add Alertmanager rules for state transitions. - Test failure scenarios in staging: Use tools like Chaos Monkey, Toxiproxy, or Wiremock fault injection to verify that circuit breakers open correctly, fallbacks return sensible values, and bulkhead limits are appropriately sized for your traffic patterns.
- Do not add blindly: Keep configuration complexity proportional to failure impact. Calibrate thresholds against P99 latency baselines, not P50. Never retry non-idempotent operations without idempotency keys.
Related Articles
Discussion / Comments
Join the conversation — your comment goes directly to my inbox.