Kafka consumer group rebalancing - distributed messaging and partition management
Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices

Microservices March 19, 2026 20 min read Distributed Systems Failure Handling Series

Kafka Consumer Group Rebalancing: Minimizing Downtime and Lag During Scaling Events

It was a routine Tuesday morning deployment. The on-call engineer scaled a payment processing Kafka consumer group from 4 to 8 pods to handle a surge ahead of a flash sale. What followed was 32 seconds of complete consumption pause — zero messages processed — while Kafka's group coordinator orchestrated a full stop-the-world rebalance. Consumer lag on the payment.transactions topic exploded from near-zero to 2.3 million records. Downstream services started timing out. The flash sale's first-minute revenue took a measurable hit. The root cause wasn't the scaling event itself — it was a fundamental misunderstanding of how Kafka consumer group rebalancing works, and which protocol the team had never bothered to configure. This post tears apart every dimension of Kafka rebalancing so you never get caught off guard again.

Table of Contents

  1. What Triggers a Consumer Group Rebalance
  2. Eager vs Cooperative Sticky Rebalancing
  3. The Rebalance Storm: Cascading Failure
  4. Timeout Tuning: session, heartbeat, poll
  5. Static Group Membership with group.instance.id
  6. Incremental Rebalance in Kubernetes
  7. Partition Assignment Strategies Compared
  8. Monitoring and Observability
  9. Failure: max.poll.interval.ms Breach
  10. Key Takeaways

1. What Triggers a Consumer Group Rebalance

A rebalance is Kafka's mechanism for redistributing topic partition ownership across all active members of a consumer group. The group coordinator — a Kafka broker elected for a given __consumer_offsets partition — initiates a rebalance whenever the group's membership or the set of subscribed partitions changes. Understanding every trigger is the first step to preventing unintended rebalances.

New consumer joins the group: When a new consumer instance starts and sends a JoinGroup request to the coordinator, it signals that the partition assignment must be recalculated to include this new member. This is the expected scaling path — but it still triggers a rebalance, even with cooperative protocols (though a cooperative rebalance is incremental rather than stop-the-world).

Consumer leaves or dies: A clean shutdown sends a LeaveGroup request immediately. An ungraceful crash — OOM kill, network partition, or pod eviction — is detected only when the session timeout expires. Until then, the coordinator waits, holding back the rebalance. Partitions assigned to the dead consumer are not reassigned until that timeout fires, meaning those partitions accumulate lag silently while the group waits.

Heartbeat timeout / session timeout: Consumers send periodic heartbeats to the coordinator to prove they're alive. If session.timeout.ms elapses without a heartbeat (default: 45,000 ms), the coordinator marks that consumer as dead and triggers a rebalance. GC pauses, full stop-the-world collections, or CPU starvation on a shared Kubernetes node can all cause heartbeat misses without the consumer actually being dead.

max.poll.interval.ms breach: If a consumer hasn't called poll() within the configured interval (default: 300,000 ms — 5 minutes), the client-side library proactively sends a LeaveGroup, triggering a rebalance. This is particularly nasty because it happens silently from the broker's perspective and the consumer log may not clearly indicate the root cause.

Topic partition count change: Adding partitions to a topic (a common scaling operation) invalidates the existing assignment and triggers a full rebalance for all subscriber groups. This is often underappreciated — teams add partitions thinking it's a pure broker operation, unaware that every consumer group subscribed to that topic will pause briefly.

Production gotcha: Rolling restarts of consumer pods (e.g., during a Kubernetes deployment) trigger a rebalance for every pod that restarts, one after another. A 10-pod consumer group with a 30-second eager rebalance and rolling restarts fires 10 sequential rebalances — roughly 5 minutes of degraded throughput. This is exactly the scenario that static group membership and cooperative rebalancing were designed to fix.

2. Eager (Stop-the-World) vs Cooperative Sticky Rebalancing

The original Kafka rebalance protocol — still the default in many frameworks — is called the eager protocol. Every consumer in the group revokes ALL its partition assignments before the new assignment is computed and distributed. Think of it as a full stop: every consumer stops processing, every partition becomes unowned, the leader computes new assignments, and only then do consumers resume with their new partitions. The duration of this pause is proportional to the group size and the network round-trips involved.

The Cooperative Sticky Assignor (introduced in Kafka 2.4, matured in 2.5+) fundamentally changes this. Instead of revoking everything first, the protocol operates in two phases. In the first phase, the coordinator identifies only which partitions need to move to new owners. Consumers that retain their assignments continue processing throughout this phase — the rebalance is invisible to them. Only in the second phase do the consumers that must give up specific partitions revoke those partitions, and the receiving consumers claim them. The majority of partitions are never interrupted.

// Spring Kafka consumer factory — switch to CooperativeStickyAssignor
@Bean
public ConsumerFactory<String, PaymentEvent> consumerFactory() {
    Map<String, Object> props = new HashMap<>();
    props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
    props.put(ConsumerConfig.GROUP_ID_CONFIG, "payment-processor-group");
    props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
    props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class);

    // Switch from default RangeAssignor to CooperativeStickyAssignor
    props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
        CooperativeStickyAssignor.class.getName());

    // When migrating FROM eager to cooperative, use the transition list:
    // props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
    //     List.of(CooperativeStickyAssignor.class, RangeAssignor.class).toString());
    // Remove RangeAssignor after one full rolling restart cycle.

    props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 30_000);
    props.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, 3_000);
    props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 120_000);
    props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 100);
    props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);

    return new DefaultKafkaConsumerFactory<>(props,
        new StringDeserializer(),
        new JsonDeserializer<>(PaymentEvent.class));
}

When NOT to use CooperativeStickyAssignor: For very small groups (2–3 consumers) processing stateless, low-throughput topics, the complexity of migration and two-phase rebalance overhead may not be worth it. Similarly, if consumers are truly stateless workers where partition affinity has zero value (e.g., simple log shippers), the plain RoundRobinAssignor is simpler to reason about. The cooperative protocol shines when consumers maintain per-partition state in memory (e.g., aggregation buffers, local caches keyed by partition) and reassigning partitions is expensive.

Migration path: You cannot flip from an eager to cooperative protocol in a single deployment without a brief extra rebalance. The safe migration uses a transitional configuration that lists both assignors, performs one full rolling restart, then removes the legacy assignor in the next deployment. The Kafka client handles mixed-mode negotiation automatically during the transition window.

3. The Rebalance Storm: Cascading Failure

A rebalance storm occurs when multiple consecutive rebalances fire in rapid succession, each triggered by the disruption caused by the previous one. The pattern is insidious: a rebalance pauses processing, causing backlog to build; when consumers resume and receive a flood of backed-up messages, processing slows; slower processing means some consumers breach max.poll.interval.ms; breaching the poll interval triggers another LeaveGroup; another LeaveGroup triggers another rebalance; and the cycle repeats.

The real incident: A fintech's fraud detection service used a consumer group of 12 instances, each consuming from a 24-partition topic. The team deployed a new feature that added a synchronous Redis call inside the message handler — a 15ms addition at P50. During a Black Friday traffic spike, the Redis cluster experienced elevated latency (P99 jumped to 600ms). With max.poll.records=500 and the new processing time, each poll batch now took 300ms × 500 records = 150 seconds to process. Since max.poll.interval.ms was left at the default 300 seconds (5 minutes), initially this was fine. But when Redis latency spiked to 1.2 seconds (P99) for 20 minutes, batch processing time exceeded 5 minutes, triggering consecutive LeaveGroup events across 8 of the 12 consumers. The group coordinator never had a stable membership for more than 30 seconds. Fraud checks halted entirely for 18 minutes.

// ConsumerRebalanceListener: commit offsets before partition revocation
public class PaymentConsumerRebalanceListener implements ConsumerRebalanceListener {
    private final KafkaConsumer<String, PaymentEvent> consumer;
    private final OffsetCommitCallback commitCallback;

    public PaymentConsumerRebalanceListener(
            KafkaConsumer<String, PaymentEvent> consumer) {
        this.consumer = consumer;
        this.commitCallback = (offsets, exception) -> {
            if (exception != null) {
                log.error("Offset commit failed during rebalance: {}", offsets, exception);
            } else {
                log.info("Offsets committed before revocation: {}", offsets);
            }
        };
    }

    @Override
    public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
        // CRITICAL: synchronously commit current offsets before giving up partitions
        // Prevents reprocessing the same records after the rebalance completes
        log.warn("Partitions being revoked: {}. Committing offsets synchronously.", partitions);
        consumer.commitSync(Duration.ofSeconds(5));
        // Release any in-memory partition-local state (e.g., aggregation buffers)
        partitions.forEach(tp -> aggregationCache.remove(tp));
    }

    @Override
    public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
        log.info("Partitions assigned: {}. Initializing local state.", partitions);
        partitions.forEach(tp -> aggregationCache.put(tp, new AggregationBuffer()));
    }

    @Override
    public void onPartitionsLost(Collection<TopicPartition> partitions) {
        // Called during cooperative rebalance when partitions are lost
        // (e.g., consumer was kicked out). Do NOT commit — offsets are stale.
        log.error("Partitions LOST (not revoked cleanly): {}. Clearing stale state.", partitions);
        partitions.forEach(tp -> aggregationCache.remove(tp));
    }
}

The onPartitionsLost callback is a critical distinction added in Kafka 2.4 for the cooperative protocol. During an eager rebalance, partitions are always revoked cleanly. During a cooperative rebalance where a consumer is forcibly removed (e.g., session timeout), partitions may be "lost" rather than "revoked" — meaning the consumer didn't get a chance to commit. Your listener must handle both cases without committing stale offsets in the lost scenario.

4. Tuning session.timeout.ms, heartbeat.interval.ms, max.poll.interval.ms, max.poll.records

These four parameters form the core of rebalance behaviour, and they interact in non-obvious ways. Getting them right requires understanding what each actually controls at the protocol level.

session.timeout.ms (default: 45,000 ms) — How long the broker waits for a heartbeat before declaring the consumer dead. Set too low: spurious rebalances from GC pauses or transient CPU pressure. Set too high: partitions sit unowned for too long after a genuine crash. Recommended production value: 30,000 ms for most workloads. For GC-heavy applications (e.g., large heap with stop-the-world G1 pauses >5s), consider 60,000 ms combined with G1GC tuning to reduce pause duration.

heartbeat.interval.ms (default: 3,000 ms) — How frequently the consumer sends heartbeats. Must be significantly less than session.timeout.ms. The rule of thumb is heartbeat.interval.ms = session.timeout.ms / 3. At 30s session timeout, use 10,000 ms heartbeat interval. More frequent heartbeats add negligible overhead but allow faster detection of intent (e.g., the coordinator notifies the consumer of a pending rebalance via heartbeat response).

max.poll.interval.ms (default: 300,000 ms — 5 minutes) — The maximum time between two consecutive poll() calls. This is entirely a client-side enforcement: if your processing loop takes longer than this, the client sends LeaveGroup. This is the single most dangerous misconfiguration. Set it to: (max.poll.records × P99 processing time per record) × 1.5 as a safety margin. If your P99 per record is 50ms and you pull 200 records, you need at least 10 seconds — use 30 seconds for headroom.

max.poll.records (default: 500) — Number of records returned in a single poll() call. The most underappreciated lever for rebalance prevention. Reducing this from 500 to 50–100 dramatically reduces per-batch processing time, making it far easier to stay within max.poll.interval.ms even during downstream latency spikes.

// Spring Boot application.yml — production-tuned rebalance configuration
spring:
  kafka:
    consumer:
      group-id: payment-processor-group
      auto-offset-reset: earliest
      enable-auto-commit: false
      properties:
        # Rebalance-critical settings
        session.timeout.ms: 30000
        heartbeat.interval.ms: 10000
        max.poll.interval.ms: 60000      # Allow 60s for heavy processing batches
        max.poll.records: 100            # 100 records × P99 300ms = 30s max batch time
        partition.assignment.strategy: org.apache.kafka.clients.consumer.CooperativeStickyAssignor
        # Static membership (see Section 5)
        group.instance.id: ${HOSTNAME}   # Pod hostname — unique per pod, stable across restarts
    listener:
      ack-mode: MANUAL_IMMEDIATE         # Manual offset commit — control in rebalance listener
      concurrency: 3                     # 3 listener threads per consumer instance
      poll-timeout: 3000

5. Static Group Membership with group.instance.id

Kafka 2.3 introduced static group membership as a first-class solution to the rolling restart rebalance problem. When a consumer sets group.instance.id to a stable identifier (e.g., the pod hostname), the coordinator associates that instance ID with a partition assignment semi-permanently. If the consumer disconnects and reconnects within session.timeout.ms, the coordinator reassigns the same partitions to the returning consumer without triggering a group rebalance.

For Kubernetes rolling deployments, the ideal setup pairs static membership with the pod's stable DNS hostname (available with StatefulSets) or the pod name (available with Deployments via the HOSTNAME environment variable). When a pod is terminated during a rolling update and a new pod with the same name starts, it reclaims the same partitions — no rebalance, no lag spike.

// Java: static group membership with ContainerProperties
@Bean
public ConcurrentKafkaListenerContainerFactory<String, PaymentEvent> kafkaListenerContainerFactory(
        ConsumerFactory<String, PaymentEvent> consumerFactory) {

    ConcurrentKafkaListenerContainerFactory<String, PaymentEvent> factory =
        new ConcurrentKafkaListenerContainerFactory<>();
    factory.setConsumerFactory(consumerFactory);
    factory.setConcurrency(3);

    ContainerProperties containerProps = factory.getContainerProperties();
    containerProps.setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
    containerProps.setPollTimeout(3000L);
    containerProps.setConsumerRebalanceListener(
        new PaymentConsumerRebalanceListener(/* consumer ref injected */));

    return factory;
}

// @KafkaListener with acknowledgment for manual offset control
@KafkaListener(
    topics = "payment.transactions",
    groupId = "payment-processor-group",
    containerFactory = "kafkaListenerContainerFactory"
)
public void consume(
        @Payload PaymentEvent event,
        Acknowledgment ack,
        @Header(KafkaHeaders.RECEIVED_PARTITION) int partition,
        @Header(KafkaHeaders.OFFSET) long offset) {

    try {
        paymentService.process(event);
        ack.acknowledge(); // commit only after successful processing
    } catch (RetryableException e) {
        log.warn("Retryable error processing offset {} on partition {}", offset, partition);
        // Do NOT acknowledge — message will be redelivered after restart
        throw e;
    } catch (PoisonPillException e) {
        log.error("Poison pill at offset {} partition {} — sending to DLQ", offset, partition);
        deadLetterService.send(event, e);
        ack.acknowledge(); // acknowledge to skip the poison pill
    }
}

The tradeoff with static membership is that if a consumer crashes hard and doesn't restart within session.timeout.ms, the partitions sit unowned for the full session timeout duration before being reassigned. For most deployments, the improvement from eliminating rolling restart rebalances far outweighs this marginally longer recovery window for genuine crashes.

6. Incremental Cooperative Rebalance in Kubernetes Pod Scaling

When a Kubernetes HPA scales a consumer Deployment from 4 to 8 replicas, four new pods join the consumer group simultaneously. With the eager protocol, this triggers a single massive rebalance: all 4 existing consumers revoke their partitions, the leader recomputes assignments for 8 consumers, and all 8 resume. With cooperative-sticky, the rebalance is incremental: the 4 existing consumers keep their assigned partitions; the 4 new consumers claim only the newly assigned partitions (typically half of the total partitions, in this case).

The practical difference: with 24 partitions and 4 existing consumers (6 partitions each), scaling to 8 with cooperative rebalance means each new consumer gets 3 partitions taken from the existing consumers. Each existing consumer gives up 1–2 partitions briefly, while retaining the other 4–5. At no point does the entire group stop. The lag delta during scale-out shrinks from "all partitions paused for 20–40 seconds" to "individual partitions paused for 2–5 seconds during handoff."

Kubernetes readiness probe interaction: A newly started consumer pod should not be considered "ready" until it has successfully joined the group and received its partition assignment. Configure your readiness probe to check a custom health endpoint that returns healthy only after the first successful poll(). Otherwise, Kubernetes considers the pod ready before it's actually consuming, and premature traffic routing (for HTTP services that dual-serve as Kafka consumers) may deliver requests to an unprepared pod.

7. Partition Assignment Strategies Compared

Kafka ships with four built-in partition assignors. Each makes different trade-offs between simplicity, stickiness, and balance:

Strategy Protocol Balance Stickiness Best For
RangeAssignor Eager Skewed for multiple topics None Single-topic groups, co-partitioned topics
RoundRobinAssignor Eager Optimal (even) None Stateless workers, multi-topic even distribution
StickyAssignor Eager Good (near-optimal) High (retains existing) Stateful consumers needing partition affinity
CooperativeStickyAssignor Cooperative ✓ Good (near-optimal) Highest Production default — stateful, dynamic groups

RangeAssignor's hidden problem: When a consumer group subscribes to multiple topics with the same partition count, Range assigns partitions 0–N/C from topic-A and partitions 0–N/C from topic-B to the same consumer. Consumer 0 always gets the "low" partitions of every topic. This creates a systematic skew in throughput: if partition 0 of topic-A is a hot partition (e.g., all single-key events route there), consumer 0 is perpetually overloaded while others are idle. RoundRobin and Sticky spread this load evenly.

8. Monitoring and Observability

You cannot fix what you cannot see. Rebalance visibility requires tracking metrics at both the consumer client level and the broker level. The following set of metrics should be in every Kafka consumer Grafana dashboard:

# Prometheus alerting rules for Kafka consumer rebalancing
groups:
  - name: kafka_consumer_rebalancing
    rules:
      - alert: KafkaConsumerRebalanceStorm
        expr: |
          rate(kafka_consumer_coordinator_rebalance_total[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Kafka consumer group {{ $labels.group }} is rebalancing frequently"
          description: "Rebalance rate {{ $value }} rebalances/sec over last 5 minutes"

      - alert: KafkaConsumerLagCritical
        expr: |
          kafka_consumer_fetch_manager_records_lag_max > 500000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical consumer lag on {{ $labels.topic }}"
          description: "Lag of {{ $value }} records — likely caused by rebalance or slow consumer"

      - alert: KafkaConsumerCommitLatencyHigh
        expr: |
          kafka_consumer_coordinator_commit_latency_avg > 1000
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "High offset commit latency for group {{ $labels.group }}"

9. Failure Scenario: max.poll.interval.ms Breach During Heavy Processing

When a consumer's processing loop exceeds max.poll.interval.ms without calling poll() again, the Kafka client library internally marks the consumer as needing to leave the group and sends a LeaveGroup request on the next heartbeat thread cycle. The consumer's application thread is typically unaware this has happened. The consumer continues "processing" its current batch, commits offsets for records it has processed, and then calls poll() — only to find it's been removed from the group and must rejoin.

The most dangerous aspect is the window between the LeaveGroup and the offset commit. If the consumer commits offsets after being removed from the group, those commits succeed (offsets are stored at the broker, not per-session). But since the consumer has left and rejoined, it may have been assigned different partitions. The "committed but after rebalance" offsets can leave some partitions with duplicate processing (if another consumer claimed that partition before the commit) or skipped records (if the partition was assigned to a different consumer that already advanced).

Diagnosis checklist when you see frequent "Attempt to heartbeat failed since group is rebalancing" logs:
1. Check max.poll.interval.ms vs actual batch processing time (instrument with Timer metrics)
2. Reduce max.poll.records — this is the fastest fix
3. Move heavy I/O (DB writes, external API calls) to async processing with a bounded queue
4. If using Spring Kafka batch listener, ensure the batch processing method returns within the poll interval
5. Consider splitting the consumer: one thread polls, another processes, with backpressure between them
"A rebalance is not a failure — it is Kafka's normal response to group membership change. The goal of rebalance tuning is not to prevent all rebalances, but to make rebalances invisible: so incremental that the system's throughput never visibly drops and consumer lag never spikes outside acceptable bounds."
— Adapted from the Kafka Improvement Proposal for KIP-429 (Incremental Cooperative Rebalancing)

10. Key Takeaways

Conclusion

Kafka consumer group rebalancing is not a bug — it is the mechanism Kafka uses to maintain partition ownership across a dynamic group of consumers. But the default eager protocol, combined with poorly tuned timeout parameters, turns routine operations like rolling deployments and scaling events into multi-second service disruptions and lag spikes that cascade downstream. The combination of CooperativeStickyAssignor, static group membership via group.instance.id, conservative max.poll.records, and a ConsumerRebalanceListener that commits offsets on revocation eliminates the vast majority of rebalance-induced incidents.

The payment processor from our opening story cut its rebalance-induced lag spikes from 2.3 million records to under 50,000 — a 98% reduction — by switching to cooperative-sticky and enabling static membership. The 32-second full-stop became a 2–3 second incremental handoff that monitoring barely registers. Invest the two hours to migrate; the on-call peace of mind is worth it.

Discussion / Comments

Related Posts

Microservices

Kafka Schema Registry

Master schema evolution with Confluent Schema Registry, Avro compatibility, and zero-downtime migrations.

Microservices

Dead Letter Queue Patterns

Handle poison pills and failed messages gracefully with production-grade DLQ strategies.

Microservices

Event-Driven Architecture

Design loosely coupled, event-driven systems with Kafka and messaging patterns at scale.

Microservices

Backpressure in Reactive Microservices

Prevent cascading failures with backpressure strategies in reactive systems under load.

Last updated: March 2026 — Written by Md Sanwar Hossain