Software Engineer · Java · Spring Boot · Microservices
Kafka Consumer Group Rebalancing: Minimizing Downtime and Lag During Scaling Events
It was a routine Tuesday morning deployment. The on-call engineer scaled a payment processing Kafka consumer group from 4 to 8 pods to handle a surge ahead of a flash sale. What followed was 32 seconds of complete consumption pause — zero messages processed — while Kafka's group coordinator orchestrated a full stop-the-world rebalance. Consumer lag on the payment.transactions topic exploded from near-zero to 2.3 million records. Downstream services started timing out. The flash sale's first-minute revenue took a measurable hit. The root cause wasn't the scaling event itself — it was a fundamental misunderstanding of how Kafka consumer group rebalancing works, and which protocol the team had never bothered to configure. This post tears apart every dimension of Kafka rebalancing so you never get caught off guard again.
Table of Contents
- What Triggers a Consumer Group Rebalance
- Eager vs Cooperative Sticky Rebalancing
- The Rebalance Storm: Cascading Failure
- Timeout Tuning: session, heartbeat, poll
- Static Group Membership with group.instance.id
- Incremental Rebalance in Kubernetes
- Partition Assignment Strategies Compared
- Monitoring and Observability
- Failure: max.poll.interval.ms Breach
- Key Takeaways
1. What Triggers a Consumer Group Rebalance
A rebalance is Kafka's mechanism for redistributing topic partition ownership across all active members of a consumer group. The group coordinator — a Kafka broker elected for a given __consumer_offsets partition — initiates a rebalance whenever the group's membership or the set of subscribed partitions changes. Understanding every trigger is the first step to preventing unintended rebalances.
New consumer joins the group: When a new consumer instance starts and sends a JoinGroup request to the coordinator, it signals that the partition assignment must be recalculated to include this new member. This is the expected scaling path — but it still triggers a rebalance, even with cooperative protocols (though a cooperative rebalance is incremental rather than stop-the-world).
Consumer leaves or dies: A clean shutdown sends a LeaveGroup request immediately. An ungraceful crash — OOM kill, network partition, or pod eviction — is detected only when the session timeout expires. Until then, the coordinator waits, holding back the rebalance. Partitions assigned to the dead consumer are not reassigned until that timeout fires, meaning those partitions accumulate lag silently while the group waits.
Heartbeat timeout / session timeout: Consumers send periodic heartbeats to the coordinator to prove they're alive. If session.timeout.ms elapses without a heartbeat (default: 45,000 ms), the coordinator marks that consumer as dead and triggers a rebalance. GC pauses, full stop-the-world collections, or CPU starvation on a shared Kubernetes node can all cause heartbeat misses without the consumer actually being dead.
max.poll.interval.ms breach: If a consumer hasn't called poll() within the configured interval (default: 300,000 ms — 5 minutes), the client-side library proactively sends a LeaveGroup, triggering a rebalance. This is particularly nasty because it happens silently from the broker's perspective and the consumer log may not clearly indicate the root cause.
Topic partition count change: Adding partitions to a topic (a common scaling operation) invalidates the existing assignment and triggers a full rebalance for all subscriber groups. This is often underappreciated — teams add partitions thinking it's a pure broker operation, unaware that every consumer group subscribed to that topic will pause briefly.
2. Eager (Stop-the-World) vs Cooperative Sticky Rebalancing
The original Kafka rebalance protocol — still the default in many frameworks — is called the eager protocol. Every consumer in the group revokes ALL its partition assignments before the new assignment is computed and distributed. Think of it as a full stop: every consumer stops processing, every partition becomes unowned, the leader computes new assignments, and only then do consumers resume with their new partitions. The duration of this pause is proportional to the group size and the network round-trips involved.
The Cooperative Sticky Assignor (introduced in Kafka 2.4, matured in 2.5+) fundamentally changes this. Instead of revoking everything first, the protocol operates in two phases. In the first phase, the coordinator identifies only which partitions need to move to new owners. Consumers that retain their assignments continue processing throughout this phase — the rebalance is invisible to them. Only in the second phase do the consumers that must give up specific partitions revoke those partitions, and the receiving consumers claim them. The majority of partitions are never interrupted.
// Spring Kafka consumer factory — switch to CooperativeStickyAssignor
@Bean
public ConsumerFactory<String, PaymentEvent> consumerFactory() {
Map<String, Object> props = new HashMap<>();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ConsumerConfig.GROUP_ID_CONFIG, "payment-processor-group");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class);
// Switch from default RangeAssignor to CooperativeStickyAssignor
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
CooperativeStickyAssignor.class.getName());
// When migrating FROM eager to cooperative, use the transition list:
// props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
// List.of(CooperativeStickyAssignor.class, RangeAssignor.class).toString());
// Remove RangeAssignor after one full rolling restart cycle.
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 30_000);
props.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, 3_000);
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 120_000);
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 100);
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
return new DefaultKafkaConsumerFactory<>(props,
new StringDeserializer(),
new JsonDeserializer<>(PaymentEvent.class));
}
When NOT to use CooperativeStickyAssignor: For very small groups (2–3 consumers) processing stateless, low-throughput topics, the complexity of migration and two-phase rebalance overhead may not be worth it. Similarly, if consumers are truly stateless workers where partition affinity has zero value (e.g., simple log shippers), the plain RoundRobinAssignor is simpler to reason about. The cooperative protocol shines when consumers maintain per-partition state in memory (e.g., aggregation buffers, local caches keyed by partition) and reassigning partitions is expensive.
Migration path: You cannot flip from an eager to cooperative protocol in a single deployment without a brief extra rebalance. The safe migration uses a transitional configuration that lists both assignors, performs one full rolling restart, then removes the legacy assignor in the next deployment. The Kafka client handles mixed-mode negotiation automatically during the transition window.
3. The Rebalance Storm: Cascading Failure
A rebalance storm occurs when multiple consecutive rebalances fire in rapid succession, each triggered by the disruption caused by the previous one. The pattern is insidious: a rebalance pauses processing, causing backlog to build; when consumers resume and receive a flood of backed-up messages, processing slows; slower processing means some consumers breach max.poll.interval.ms; breaching the poll interval triggers another LeaveGroup; another LeaveGroup triggers another rebalance; and the cycle repeats.
The real incident: A fintech's fraud detection service used a consumer group of 12 instances, each consuming from a 24-partition topic. The team deployed a new feature that added a synchronous Redis call inside the message handler — a 15ms addition at P50. During a Black Friday traffic spike, the Redis cluster experienced elevated latency (P99 jumped to 600ms). With max.poll.records=500 and the new processing time, each poll batch now took 300ms × 500 records = 150 seconds to process. Since max.poll.interval.ms was left at the default 300 seconds (5 minutes), initially this was fine. But when Redis latency spiked to 1.2 seconds (P99) for 20 minutes, batch processing time exceeded 5 minutes, triggering consecutive LeaveGroup events across 8 of the 12 consumers. The group coordinator never had a stable membership for more than 30 seconds. Fraud checks halted entirely for 18 minutes.
// ConsumerRebalanceListener: commit offsets before partition revocation
public class PaymentConsumerRebalanceListener implements ConsumerRebalanceListener {
private final KafkaConsumer<String, PaymentEvent> consumer;
private final OffsetCommitCallback commitCallback;
public PaymentConsumerRebalanceListener(
KafkaConsumer<String, PaymentEvent> consumer) {
this.consumer = consumer;
this.commitCallback = (offsets, exception) -> {
if (exception != null) {
log.error("Offset commit failed during rebalance: {}", offsets, exception);
} else {
log.info("Offsets committed before revocation: {}", offsets);
}
};
}
@Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
// CRITICAL: synchronously commit current offsets before giving up partitions
// Prevents reprocessing the same records after the rebalance completes
log.warn("Partitions being revoked: {}. Committing offsets synchronously.", partitions);
consumer.commitSync(Duration.ofSeconds(5));
// Release any in-memory partition-local state (e.g., aggregation buffers)
partitions.forEach(tp -> aggregationCache.remove(tp));
}
@Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
log.info("Partitions assigned: {}. Initializing local state.", partitions);
partitions.forEach(tp -> aggregationCache.put(tp, new AggregationBuffer()));
}
@Override
public void onPartitionsLost(Collection<TopicPartition> partitions) {
// Called during cooperative rebalance when partitions are lost
// (e.g., consumer was kicked out). Do NOT commit — offsets are stale.
log.error("Partitions LOST (not revoked cleanly): {}. Clearing stale state.", partitions);
partitions.forEach(tp -> aggregationCache.remove(tp));
}
}
The onPartitionsLost callback is a critical distinction added in Kafka 2.4 for the cooperative protocol. During an eager rebalance, partitions are always revoked cleanly. During a cooperative rebalance where a consumer is forcibly removed (e.g., session timeout), partitions may be "lost" rather than "revoked" — meaning the consumer didn't get a chance to commit. Your listener must handle both cases without committing stale offsets in the lost scenario.
4. Tuning session.timeout.ms, heartbeat.interval.ms, max.poll.interval.ms, max.poll.records
These four parameters form the core of rebalance behaviour, and they interact in non-obvious ways. Getting them right requires understanding what each actually controls at the protocol level.
session.timeout.ms (default: 45,000 ms) — How long the broker waits for a heartbeat before declaring the consumer dead. Set too low: spurious rebalances from GC pauses or transient CPU pressure. Set too high: partitions sit unowned for too long after a genuine crash. Recommended production value: 30,000 ms for most workloads. For GC-heavy applications (e.g., large heap with stop-the-world G1 pauses >5s), consider 60,000 ms combined with G1GC tuning to reduce pause duration.
heartbeat.interval.ms (default: 3,000 ms) — How frequently the consumer sends heartbeats. Must be significantly less than session.timeout.ms. The rule of thumb is heartbeat.interval.ms = session.timeout.ms / 3. At 30s session timeout, use 10,000 ms heartbeat interval. More frequent heartbeats add negligible overhead but allow faster detection of intent (e.g., the coordinator notifies the consumer of a pending rebalance via heartbeat response).
max.poll.interval.ms (default: 300,000 ms — 5 minutes) — The maximum time between two consecutive poll() calls. This is entirely a client-side enforcement: if your processing loop takes longer than this, the client sends LeaveGroup. This is the single most dangerous misconfiguration. Set it to: (max.poll.records × P99 processing time per record) × 1.5 as a safety margin. If your P99 per record is 50ms and you pull 200 records, you need at least 10 seconds — use 30 seconds for headroom.
max.poll.records (default: 500) — Number of records returned in a single poll() call. The most underappreciated lever for rebalance prevention. Reducing this from 500 to 50–100 dramatically reduces per-batch processing time, making it far easier to stay within max.poll.interval.ms even during downstream latency spikes.
// Spring Boot application.yml — production-tuned rebalance configuration
spring:
kafka:
consumer:
group-id: payment-processor-group
auto-offset-reset: earliest
enable-auto-commit: false
properties:
# Rebalance-critical settings
session.timeout.ms: 30000
heartbeat.interval.ms: 10000
max.poll.interval.ms: 60000 # Allow 60s for heavy processing batches
max.poll.records: 100 # 100 records × P99 300ms = 30s max batch time
partition.assignment.strategy: org.apache.kafka.clients.consumer.CooperativeStickyAssignor
# Static membership (see Section 5)
group.instance.id: ${HOSTNAME} # Pod hostname — unique per pod, stable across restarts
listener:
ack-mode: MANUAL_IMMEDIATE # Manual offset commit — control in rebalance listener
concurrency: 3 # 3 listener threads per consumer instance
poll-timeout: 3000
5. Static Group Membership with group.instance.id
Kafka 2.3 introduced static group membership as a first-class solution to the rolling restart rebalance problem. When a consumer sets group.instance.id to a stable identifier (e.g., the pod hostname), the coordinator associates that instance ID with a partition assignment semi-permanently. If the consumer disconnects and reconnects within session.timeout.ms, the coordinator reassigns the same partitions to the returning consumer without triggering a group rebalance.
For Kubernetes rolling deployments, the ideal setup pairs static membership with the pod's stable DNS hostname (available with StatefulSets) or the pod name (available with Deployments via the HOSTNAME environment variable). When a pod is terminated during a rolling update and a new pod with the same name starts, it reclaims the same partitions — no rebalance, no lag spike.
// Java: static group membership with ContainerProperties
@Bean
public ConcurrentKafkaListenerContainerFactory<String, PaymentEvent> kafkaListenerContainerFactory(
ConsumerFactory<String, PaymentEvent> consumerFactory) {
ConcurrentKafkaListenerContainerFactory<String, PaymentEvent> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory);
factory.setConcurrency(3);
ContainerProperties containerProps = factory.getContainerProperties();
containerProps.setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
containerProps.setPollTimeout(3000L);
containerProps.setConsumerRebalanceListener(
new PaymentConsumerRebalanceListener(/* consumer ref injected */));
return factory;
}
// @KafkaListener with acknowledgment for manual offset control
@KafkaListener(
topics = "payment.transactions",
groupId = "payment-processor-group",
containerFactory = "kafkaListenerContainerFactory"
)
public void consume(
@Payload PaymentEvent event,
Acknowledgment ack,
@Header(KafkaHeaders.RECEIVED_PARTITION) int partition,
@Header(KafkaHeaders.OFFSET) long offset) {
try {
paymentService.process(event);
ack.acknowledge(); // commit only after successful processing
} catch (RetryableException e) {
log.warn("Retryable error processing offset {} on partition {}", offset, partition);
// Do NOT acknowledge — message will be redelivered after restart
throw e;
} catch (PoisonPillException e) {
log.error("Poison pill at offset {} partition {} — sending to DLQ", offset, partition);
deadLetterService.send(event, e);
ack.acknowledge(); // acknowledge to skip the poison pill
}
}
The tradeoff with static membership is that if a consumer crashes hard and doesn't restart within session.timeout.ms, the partitions sit unowned for the full session timeout duration before being reassigned. For most deployments, the improvement from eliminating rolling restart rebalances far outweighs this marginally longer recovery window for genuine crashes.
6. Incremental Cooperative Rebalance in Kubernetes Pod Scaling
When a Kubernetes HPA scales a consumer Deployment from 4 to 8 replicas, four new pods join the consumer group simultaneously. With the eager protocol, this triggers a single massive rebalance: all 4 existing consumers revoke their partitions, the leader recomputes assignments for 8 consumers, and all 8 resume. With cooperative-sticky, the rebalance is incremental: the 4 existing consumers keep their assigned partitions; the 4 new consumers claim only the newly assigned partitions (typically half of the total partitions, in this case).
The practical difference: with 24 partitions and 4 existing consumers (6 partitions each), scaling to 8 with cooperative rebalance means each new consumer gets 3 partitions taken from the existing consumers. Each existing consumer gives up 1–2 partitions briefly, while retaining the other 4–5. At no point does the entire group stop. The lag delta during scale-out shrinks from "all partitions paused for 20–40 seconds" to "individual partitions paused for 2–5 seconds during handoff."
Kubernetes readiness probe interaction: A newly started consumer pod should not be considered "ready" until it has successfully joined the group and received its partition assignment. Configure your readiness probe to check a custom health endpoint that returns healthy only after the first successful poll(). Otherwise, Kubernetes considers the pod ready before it's actually consuming, and premature traffic routing (for HTTP services that dual-serve as Kafka consumers) may deliver requests to an unprepared pod.
7. Partition Assignment Strategies Compared
Kafka ships with four built-in partition assignors. Each makes different trade-offs between simplicity, stickiness, and balance:
RangeAssignor's hidden problem: When a consumer group subscribes to multiple topics with the same partition count, Range assigns partitions 0–N/C from topic-A and partitions 0–N/C from topic-B to the same consumer. Consumer 0 always gets the "low" partitions of every topic. This creates a systematic skew in throughput: if partition 0 of topic-A is a hot partition (e.g., all single-key events route there), consumer 0 is perpetually overloaded while others are idle. RoundRobin and Sticky spread this load evenly.
8. Monitoring and Observability
You cannot fix what you cannot see. Rebalance visibility requires tracking metrics at both the consumer client level and the broker level. The following set of metrics should be in every Kafka consumer Grafana dashboard:
kafka.consumer.fetch-manager.records-lag-max— Maximum lag across all partitions for this consumer. Alert at >10,000 for near-real-time applications, >100,000 for batch-tolerant ones.kafka.consumer.coordinator.rebalance-rate-avg— Rebalance events per second. Any value above 0.01 in steady state (more than 1 rebalance per 100 seconds) is a red flag.kafka.consumer.coordinator.commit-latency-avg— Time to commit offsets. Spikes above 500ms indicate broker pressure or network issues that may precede session timeout failures.kafka.consumer.coordinator.join-time-avg— Time for the consumer to complete the join phase of the rebalance. High values indicate slow group coordinator response or large consumer group metadata.kafka.consumer.coordinator.sync-time-avg— Time for the sync phase (distributing assignments). High values suggest a slow group leader or complex assignment computation.kafka.consumer.fetch-manager.fetch-throttle-time-avg— Throttling imposed by the broker. Non-zero values signal quota exhaustion, which can indirectly trigger poll interval breaches.
# Prometheus alerting rules for Kafka consumer rebalancing
groups:
- name: kafka_consumer_rebalancing
rules:
- alert: KafkaConsumerRebalanceStorm
expr: |
rate(kafka_consumer_coordinator_rebalance_total[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "Kafka consumer group {{ $labels.group }} is rebalancing frequently"
description: "Rebalance rate {{ $value }} rebalances/sec over last 5 minutes"
- alert: KafkaConsumerLagCritical
expr: |
kafka_consumer_fetch_manager_records_lag_max > 500000
for: 5m
labels:
severity: critical
annotations:
summary: "Critical consumer lag on {{ $labels.topic }}"
description: "Lag of {{ $value }} records — likely caused by rebalance or slow consumer"
- alert: KafkaConsumerCommitLatencyHigh
expr: |
kafka_consumer_coordinator_commit_latency_avg > 1000
for: 3m
labels:
severity: warning
annotations:
summary: "High offset commit latency for group {{ $labels.group }}"
9. Failure Scenario: max.poll.interval.ms Breach During Heavy Processing
When a consumer's processing loop exceeds max.poll.interval.ms without calling poll() again, the Kafka client library internally marks the consumer as needing to leave the group and sends a LeaveGroup request on the next heartbeat thread cycle. The consumer's application thread is typically unaware this has happened. The consumer continues "processing" its current batch, commits offsets for records it has processed, and then calls poll() — only to find it's been removed from the group and must rejoin.
The most dangerous aspect is the window between the LeaveGroup and the offset commit. If the consumer commits offsets after being removed from the group, those commits succeed (offsets are stored at the broker, not per-session). But since the consumer has left and rejoined, it may have been assigned different partitions. The "committed but after rebalance" offsets can leave some partitions with duplicate processing (if another consumer claimed that partition before the commit) or skipped records (if the partition was assigned to a different consumer that already advanced).
1. Check
max.poll.interval.ms vs actual batch processing time (instrument with Timer metrics)2. Reduce
max.poll.records — this is the fastest fix3. Move heavy I/O (DB writes, external API calls) to async processing with a bounded queue
4. If using Spring Kafka batch listener, ensure the batch processing method returns within the poll interval
5. Consider splitting the consumer: one thread polls, another processes, with backpressure between them
"A rebalance is not a failure — it is Kafka's normal response to group membership change. The goal of rebalance tuning is not to prevent all rebalances, but to make rebalances invisible: so incremental that the system's throughput never visibly drops and consumer lag never spikes outside acceptable bounds."
— Adapted from the Kafka Improvement Proposal for KIP-429 (Incremental Cooperative Rebalancing)
10. Key Takeaways
- Switch to CooperativeStickyAssignor — this single change eliminates stop-the-world rebalances during scaling. Use the two-step migration (list both assignors, roll, then remove eager).
- Set group.instance.id to a stable pod identifier — prevents rolling restart rebalances entirely, as reconnecting members within
session.timeout.msreclaim their partitions without a group rebalance. - Tune max.poll.records before max.poll.interval.ms — reducing records per poll is safer than extending the poll interval, which also delays detection of genuinely stuck consumers.
- Implement ConsumerRebalanceListener — commit offsets synchronously in
onPartitionsRevokedto prevent reprocessing; clear local state inonPartitionsLostwithout committing. - Alert on rebalance-rate-avg — any rebalance rate above 0.01/s in steady state signals a configuration or processing problem that will eventually cascade.
- Design processing loops for poll interval compliance — bound your per-poll batch processing time to 50–60% of
max.poll.interval.msfor headroom during downstream latency spikes. - Cooperative rebalance doesn't eliminate all pauses — partitions being actively moved still pause briefly; stickiness ensures only the minimum number of partitions move per rebalance event.
Conclusion
Kafka consumer group rebalancing is not a bug — it is the mechanism Kafka uses to maintain partition ownership across a dynamic group of consumers. But the default eager protocol, combined with poorly tuned timeout parameters, turns routine operations like rolling deployments and scaling events into multi-second service disruptions and lag spikes that cascade downstream. The combination of CooperativeStickyAssignor, static group membership via group.instance.id, conservative max.poll.records, and a ConsumerRebalanceListener that commits offsets on revocation eliminates the vast majority of rebalance-induced incidents.
The payment processor from our opening story cut its rebalance-induced lag spikes from 2.3 million records to under 50,000 — a 98% reduction — by switching to cooperative-sticky and enabling static membership. The 32-second full-stop became a 2–3 second incremental handoff that monitoring barely registers. Invest the two hours to migrate; the on-call peace of mind is worth it.
Discussion / Comments
Related Posts
Kafka Schema Registry
Master schema evolution with Confluent Schema Registry, Avro compatibility, and zero-downtime migrations.
Dead Letter Queue Patterns
Handle poison pills and failed messages gracefully with production-grade DLQ strategies.
Event-Driven Architecture
Design loosely coupled, event-driven systems with Kafka and messaging patterns at scale.
Backpressure in Reactive Microservices
Prevent cascading failures with backpressure strategies in reactive systems under load.
Last updated: March 2026 — Written by Md Sanwar Hossain