System Design

Leader Election in Distributed Systems: Raft Consensus, ZooKeeper & etcd Deep Dive

In any distributed system, the hardest guarantee to provide is also the most fundamental: ensuring that exactly one node acts as the authority for a given resource at any point in time. Leader election is the mechanism that makes this possible — and getting it wrong causes data corruption, split-brain clusters, and cascading failures that can last hours. This deep dive explores the theory behind leader election, the production-proven algorithms (Raft, ZooKeeper's ZAB, etcd's etcd-raft), and concrete Java/Spring Boot integration patterns to build rock-solid leader election into your microservices.

Md Sanwar Hossain April 1, 2026 20 min read System Design

Leader Election in Distributed Systems - Raft Consensus, ZooKeeper and etcd

Why Leader Election Matters
The Split-Brain Problem
Raft Consensus Algorithm Deep Dive
ZooKeeper Leader Election: Ephemeral Sequential Znodes
etcd Leader Election: Spring Boot Integration
Redis-Based Leader Election: SETNX with TTL
Kubernetes Leader Election: controller-runtime Pattern
Fencing Tokens: Preventing Stale Leader Writes
Production Pitfalls
Comparison Table: ZooKeeper vs etcd vs Redis vs Raft
Key Takeaways
Conclusion

1. Why Leader Election Matters

Leader Election Architecture | mdsanwarhossain.me — Leader Election Architecture — mdsanwarhossain.me

Distributed systems achieve high availability by replicating state across multiple nodes. But replication immediately creates a coordination problem: when multiple nodes can write, conflicts arise. Leader election solves this by designating a single node — the leader — that serialises all writes and coordinates replicas. Followers replicate the leader's log passively.

The pattern appears across the entire distributed systems stack. Kafka partition leaders handle all produce and consume requests for a partition; when a broker hosting a partition leader crashes, the controller must elect a new leader from in-sync replicas (ISR) within seconds or producers start seeing LeaderNotAvailableException. Kubernetes controller-manager runs a leader election loop: only the elected pod runs reconciliation loops, preventing concurrent controllers from fighting over resource state. Distributed schedulers like Quartz Clustered, JobRunr, and Spring Batch in cluster mode elect a leader to avoid duplicate job execution — running the same payment batch twice is catastrophic, not just wasteful.

Beyond preventing duplicates, a single leader simplifies consistency: all reads and writes go through one authoritative node, so you never need to reconcile divergent state. The cost is availability during leader re-election — the cluster is briefly unavailable for writes until a new leader is confirmed by a quorum. This is the classic CAP theorem trade-off: CP systems (ZooKeeper, etcd, Raft) sacrifice brief write availability for strong consistency.

2. The Split-Brain Problem

Split-brain occurs when a network partition divides the cluster into two or more groups, each of which elects its own leader. Both leaders accept writes, diverging the state. When the partition heals, you now have two conflicting histories with no deterministic way to merge them without losing data. In a financial ledger, split-brain means debits applied in one partition and credits in another — when they reconcile, account balances are wrong and audit trails are broken.

The canonical prevention mechanism is quorum: a leader is only valid if a majority of nodes (more than N/2) acknowledge it. In a 5-node cluster partitioned 3/2, only the majority partition (3 nodes) can elect a valid leader. The minority partition (2 nodes) cannot reach quorum, so it refuses writes and waits for the partition to heal. This is the mathematical guarantee that prevents two simultaneous leaders.

Epoch numbers (called terms in Raft, epochs in Kafka, ZxIDs in ZooKeeper) are a monotonically increasing counter stamped on every message. When a new leader is elected, it increments the epoch. Followers reject any message from a node claiming to be leader if its epoch is less than the highest epoch seen. This prevents a previously isolated leader from rejoining the cluster and overwriting newer state — it will see higher epoch numbers and immediately step down.

⚠ Warning — GC pauses cause false leader timeouts: A JVM leader node experiencing a full GC pause of 8 seconds will miss heartbeat deadlines (typically 5–10 seconds) and be evicted by followers, even though it is perfectly healthy. When the GC pause ends, it may attempt to continue acting as leader — writing to storage — while a new leader has already been elected. This is a real split-brain scenario caused by the JVM's stop-the-world model. Use G1GC or ZGC with tuned pause targets, and set lease durations conservatively longer than your worst-case GC pause.

3. Raft Consensus Algorithm Deep Dive

K8s Leader Election | mdsanwarhossain.me — K8s Leader Election — mdsanwarhossain.me

Raft was designed specifically to be understandable — a reaction to Paxos, which is notoriously difficult to implement correctly. Raft decomposes consensus into three sub-problems: leader election, log replication, and safety. For leader election, the state machine is clean:

         timeout / no heartbeat
Follower ─────────────────────► Candidate
    ▲                               │
    │ discovers higher term         │ receives quorum votes
    │                               ▼
    └──────────────────────────── Leader
           higher term seen           │
                                      │ sends heartbeats
                                      └──► all Followers

Every node starts as a Follower. If a follower doesn't hear a heartbeat from the leader within its randomised election timeout (typically 150–300ms), it transitions to Candidate. The randomisation is critical: it breaks symmetry so that two nodes rarely become candidates simultaneously, avoiding vote splits that would require re-elections.

A candidate increments its current term, votes for itself, and broadcasts a RequestVote RPC to all peers. Each peer grants its vote if: (1) it hasn't already voted in this term, and (2) the candidate's log is at least as up-to-date as its own (Raft's log completeness property). A candidate that receives votes from a majority of nodes becomes Leader and immediately begins sending heartbeat AppendEntries RPCs to suppress further elections.

Term numbers are the heartbeat of correctness. When any node receives a message with a higher term than its own, it immediately converts to follower — even if it believes itself to be leader. This single rule prevents stale leaders from causing inconsistency when they reconnect after isolation.

// Simplified Raft RequestVote logic (illustrative)
public class RaftNode {
    private int currentTerm = 0;
    private String votedFor = null;
    private NodeState state = NodeState.FOLLOWER;

    public synchronized RequestVoteResponse handleRequestVote(RequestVoteRequest req) {
        if (req.getTerm() < currentTerm) {
            // Reject: stale candidate
            return RequestVoteResponse.of(currentTerm, false);
        }
        if (req.getTerm() > currentTerm) {
            // Higher term: revert to follower, reset vote
            currentTerm = req.getTerm();
            state = NodeState.FOLLOWER;
            votedFor = null;
        }
        boolean voteGranted = (votedFor == null || votedFor.equals(req.getCandidateId()))
                && isLogUpToDate(req.getLastLogIndex(), req.getLastLogTerm());
        if (voteGranted) {
            votedFor = req.getCandidateId();
        }
        return RequestVoteResponse.of(currentTerm, voteGranted);
    }

    private boolean isLogUpToDate(int candidateLastIndex, int candidateLastTerm) {
        // Raft §5.4.1: log is up-to-date if last term is higher,
        // or terms match and index is at least as long
        int myLastTerm  = getLastLogTerm();
        int myLastIndex = getLastLogIndex();
        if (candidateLastTerm != myLastTerm) return candidateLastTerm > myLastTerm;
        return candidateLastIndex >= myLastIndex;
    }
}

4. ZooKeeper Leader Election: Ephemeral Sequential Znodes

ZooKeeper exposes a hierarchical namespace of znodes with strong consistency guarantees backed by its own consensus protocol (ZAB — ZooKeeper Atomic Broadcast). The leader election recipe is elegant: each candidate creates an ephemeral sequential znode under a common parent path (e.g., /election/candidate-). ZooKeeper appends a monotonically increasing number (e.g., /election/candidate-0000000001, /election/candidate-0000000002). The node with the smallest sequence number is the leader.

Leader Election in Distributed Systems | mdsanwarhossain.me — Leader Election in Distributed Systems — mdsanwarhossain.me

Crucially, each non-leader node watches the znode immediately preceding it (not the leader directly), forming a chain. If the leader crashes, ZooKeeper deletes its ephemeral znode and notifies only the second node — which then becomes leader without a thundering herd. This is the ZooKeeper recipe for leader election without the herd effect.

Using Apache Curator (the recommended high-level ZooKeeper client) in Java:

import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.framework.recipes.leader.LeaderLatch;
import org.apache.curator.framework.recipes.leader.LeaderLatchListener;
import org.apache.curator.retry.ExponentialBackoffRetry;

@Service
public class ZooKeeperLeaderElection implements DisposableBean {

    private final LeaderLatch leaderLatch;
    private final CuratorFramework client;

    public ZooKeeperLeaderElection(
            @Value("${zookeeper.connect-string}") String connectString,
            @Value("${app.instance-id}") String instanceId) throws Exception {

        client = CuratorFrameworkFactory.builder()
                .connectString(connectString)
                .retryPolicy(new ExponentialBackoffRetry(1000, 3))
                .sessionTimeoutMs(15_000)
                .connectionTimeoutMs(5_000)
                .build();
        client.start();

        leaderLatch = new LeaderLatch(client, "/services/my-scheduler/leader", instanceId);

        leaderLatch.addListener(new LeaderLatchListener() {
            @Override
            public void isLeader() {
                log.info("Instance {} acquired leadership", instanceId);
                startSchedulerTasks();
            }

            @Override
            public void notLeader() {
                log.info("Instance {} lost leadership", instanceId);
                stopSchedulerTasks();
            }
        });

        leaderLatch.start(); // Joins the election
    }

    public boolean isCurrentLeader() {
        return leaderLatch.hasLeadership();
    }

    @Override
    public void destroy() throws Exception {
        // Releasing the latch deletes the ephemeral znode,
        // triggering immediate re-election among followers
        leaderLatch.close();
        client.close();
    }
}

ZooKeeper's session mechanism provides an important safety property: if the leader loses its ZooKeeper session (due to network partition, GC pause exceeding sessionTimeoutMs, or process crash), ZooKeeper automatically deletes all ephemeral znodes for that session. The election proceeds without any manual intervention. The sessionTimeoutMs is therefore your leader lease duration — tune it to balance between fast failover and GC-pause tolerance.

5. etcd Leader Election: Spring Boot Integration

etcd uses the etcd-raft library (a Go implementation of Raft) for its internal consensus. For client-facing leader election, etcd exposes a lease-based election API: a node creates a lease with a TTL, keeps it alive with periodic keepalives, and uses the lease as a guard key in a campaign operation. The lease TTL acts as the leader's maximum lock duration in case of failure.

The Java client library is jetcd (official etcd Java client). Here is a production-grade Spring Boot service:

import io.etcd.jetcd.Client;
import io.etcd.jetcd.Election;
import io.etcd.jetcd.Lease;
import io.etcd.jetcd.ByteSequence;
import io.etcd.jetcd.election.CampaignResponse;

@Service
public class EtcdLeaderElection implements InitializingBean, DisposableBean {

    private final Client etcdClient;
    private final Lease leaseClient;
    private final Election electionClient;
    private long leaseId;
    private volatile boolean isLeader = false;
    private ScheduledFuture<?> keepAliveFuture;

    @Value("${etcd.endpoints:http://localhost:2379}")
    private String etcdEndpoints;
    @Value("${app.instance-id}")
    private String instanceId;

    private static final long LEASE_TTL_SECONDS = 15;
    private static final String ELECTION_NAME = "/my-service/leader-election";

    @Override
    public void afterPropertiesSet() throws Exception {
        etcdClient = Client.builder()
                .endpoints(etcdEndpoints.split(","))
                .build();
        leaseClient = etcdClient.getLeaseClient();
        electionClient = etcdClient.getElectionClient();

        // Grant a lease — this is our "session" with TTL
        leaseId = leaseClient.grant(LEASE_TTL_SECONDS).get().getID();

        // Keep the lease alive with periodic renewals (every TTL/3 seconds)
        leaseClient.keepAlive(leaseId, new StreamObserver<>() {
            @Override public void onNext(LeaseKeepAliveResponse r) { /* renewal ok */ }
            @Override public void onError(Throwable t) { onLeaseExpired(); }
            @Override public void onCompleted() { onLeaseExpired(); }
        });

        // Campaign blocks until this node is elected leader
        CompletableFuture.runAsync(this::campaignForLeadership);
    }

    private void campaignForLeadership() {
        try {
            ByteSequence electionKey = ByteSequence.from(ELECTION_NAME, UTF_8);
            ByteSequence leaderValue = ByteSequence.from(instanceId, UTF_8);

            // campaign() blocks until this node becomes leader
            CampaignResponse response = electionClient
                    .campaign(electionKey, leaseId, leaderValue).get();

            isLeader = true;
            log.info("Instance {} is now the leader (lease={})", instanceId, leaseId);
            onLeadershipAcquired();
        } catch (Exception e) {
            log.error("Campaign failed", e);
        }
    }

    private void onLeaseExpired() {
        if (isLeader) {
            isLeader = false;
            log.warn("Lease expired — stepping down from leadership");
            onLeadershipLost();
            // Re-campaign after brief backoff
            CompletableFuture.delayedExecutor(2, TimeUnit.SECONDS)
                    .execute(this::campaignForLeadership);
        }
    }

    public boolean isLeader() { return isLeader; }

    @Override
    public void destroy() throws Exception {
        if (isLeader) {
            electionClient.resign(/* leader key from campaign response */).get();
        }
        leaseClient.revoke(leaseId).get();
        etcdClient.close();
    }
}

The campaign() call is a blocking gRPC streaming call that returns only when this node has been elected. Internally, etcd uses a put-if-not-exists on the election key guarded by the lease ID. If the current leader's lease expires, the key is deleted and the next campaigner wins.

6. Redis-Based Leader Election: SETNX with TTL

Redis SET key value NX PX ttl (set if not exists with millisecond TTL) is the simplest leader election primitive available. The node that successfully executes SET is the leader; all others get a nil response and must poll. The winner periodically refreshes its TTL with EXPIRE or by re-running the SET NX with ownership verification.

@Service
public class RedisLeaderElection {

    private final StringRedisTemplate redis;
    private final ScheduledExecutorService scheduler =
            Executors.newSingleThreadScheduledExecutor();

    private static final String LEADER_KEY   = "my-service:leader";
    private static final long   LEASE_TTL_MS = 10_000; // 10 seconds
    private static final long   RENEW_INTERVAL_MS = 3_000;

    private final String instanceId = UUID.randomUUID().toString();
    private volatile boolean isLeader = false;

    @PostConstruct
    public void startElection() {
        // Attempt election immediately, then every RENEW_INTERVAL_MS
        scheduler.scheduleAtFixedRate(this::tryAcquireOrRenew,
                0, RENEW_INTERVAL_MS, TimeUnit.MILLISECONDS);
    }

    private void tryAcquireOrRenew() {
        // SET my-service:leader <instanceId> NX PX 10000
        Boolean acquired = redis.opsForValue()
                .setIfAbsent(LEADER_KEY, instanceId,
                             Duration.ofMillis(LEASE_TTL_MS));
        if (Boolean.TRUE.equals(acquired)) {
            // Won the election
            if (!isLeader) {
                isLeader = true;
                log.info("Acquired leadership: {}", instanceId);
            }
            return;
        }

        // Key exists — check if we already own it
        String current = redis.opsForValue().get(LEADER_KEY);
        if (instanceId.equals(current)) {
            // Renew our TTL (Lua script for atomicity)
            redis.execute((RedisCallback<Void>) conn -> {
                conn.expire(LEADER_KEY.getBytes(), LEASE_TTL_MS / 1000);
                return null;
            });
        } else {
            if (isLeader) {
                isLeader = false;
                log.warn("Lost leadership to {}", current);
            }
        }
    }

    public boolean isLeader() { return isLeader; }

    @PreDestroy
    public void release() {
        // Only delete the key if we own it (Lua for atomicity)
        String luaScript =
            "if redis.call('get', KEYS[1]) == ARGV[1] then " +
            "  return redis.call('del', KEYS[1]) " +
            "else return 0 end";
        redis.execute(new DefaultRedisScript<>(luaScript, Long.class),
                      List.of(LEADER_KEY), instanceId);
        scheduler.shutdownNow();
    }
}

Critical limitations: Redis leader election with a single Redis instance has a fundamental safety gap — Redis itself can fail, and if you're using Redis Sentinel or Cluster, there are windows during failover where two nodes can both believe they hold the lock (the Redlock problem). For workloads requiring true linearizable guarantees, Redis is not the right tool. Use it only for low-stakes coordination where brief split-brain windows are acceptable and you have compensating logic.

7. Kubernetes Leader Election: controller-runtime Pattern

Kubernetes uses etcd as its backing store and implements leader election via the Lease resource (or historically, ConfigMap/Endpoints annotations). The controller-runtime library (used by all Kubernetes operators and the kube-controller-manager itself) abstracts this into a LeaderElector that your Java Spring Boot operator can leverage via the Fabric8 Kubernetes client.

The mechanism: a Lease object in the kube-system namespace stores the leader's identity and a renewTime. The elected leader renews the Lease every leaseDuration * (1/3) seconds. Competitors watch the Lease: if renewTime is older than leaseDuration, they attempt to acquire it by updating the Lease with their identity using an optimistic lock (Kubernetes resource version). Only one writer wins the update due to etcd's MVCC; others get a conflict error and retry.

// Spring Boot Kubernetes operator with Fabric8 leader election
@Configuration
public class LeaderElectionConfig {

    @Bean
    public LeaderElector leaderElector(KubernetesClient k8sClient,
                                       @Value("${app.instance-id}") String instanceId,
                                       ApplicationEventPublisher eventPublisher) {
        LeaderCallbacks callbacks = new LeaderCallbacks(
            () -> eventPublisher.publishEvent(new LeaderAcquiredEvent(instanceId)),
            () -> eventPublisher.publishEvent(new LeaderLostEvent(instanceId)),
            (newLeader) -> log.info("New leader: {}", newLeader)
        );

        LeaderElectionConfig config = new LeaderElectionConfigBuilder()
            .withName("my-operator-leader")    // Name of the Lease object
            .withNamespace("my-operator-ns")
            .withLeaseDuration(Duration.ofSeconds(15))
            .withRenewDeadline(Duration.ofSeconds(10))
            .withRetryPeriod(Duration.ofSeconds(2))
            .withLeaderCallbacks(callbacks)
            .build();

        return new LeaderElector(k8sClient, config, instanceId);
    }

    @Bean
    CommandLineRunner startElection(LeaderElector leaderElector) {
        return args -> {
            // Runs the election loop in a virtual thread
            Thread.ofVirtual().name("leader-election").start(leaderElector::run);
        };
    }
}

// In your reconciler — guard all write operations
@Component
public class MyReconciler implements Reconciler<MyResource> {

    @Autowired private LeaderElector leaderElector;

    @Override
    public UpdateControl<MyResource> reconcile(MyResource resource, Context ctx) {
        if (!leaderElector.isLeader()) {
            // Standby pods skip reconciliation
            return UpdateControl.noUpdate();
        }
        // Only the leader performs writes
        return performReconciliation(resource);
    }
}

8. Fencing Tokens: Preventing Stale Leader Writes

Even with perfect leader election, a leader can fall into a dangerous state: it is elected, begins a write operation, then experiences a GC pause or network stall. During the pause, its lease expires and a new leader is elected. When the original leader resumes (unaware that it has been deposed), it completes its write to the storage layer — which now has a newer leader writing simultaneously. You have split-brain at the storage level.

The solution is fencing tokens — a monotonically increasing integer issued by the lock service with each new election. Every write to storage must include the current fencing token. The storage layer rejects writes with a token lower than the highest token it has ever seen. When the stale leader's write arrives with token 7, but the storage layer has already processed writes with token 8 from the new leader, the write is rejected.

// Fencing token pattern with ZooKeeper (using znode version as token)
@Service
public class FencedStorageWriter {

    private final ZooKeeperLeaderElection election;
    private final StorageClient storage;

    /**
     * Writes data to storage only if our fencing token (ZK znode version)
     * is still the highest the storage layer has seen.
     */
    public void writeWithFencing(String key, byte[] data) throws Exception {
        if (!election.isCurrentLeader()) {
            throw new NotLeaderException("This node is not the current leader");
        }

        // The fencing token is the sequential number of our election znode
        long fencingToken = election.getLeaderEpoch();

        // Storage layer validates the token before accepting the write
        WriteResult result = storage.conditionalWrite(key, data, fencingToken);

        if (result == WriteResult.FENCING_TOKEN_REJECTED) {
            // Our token is stale — we are no longer the effective leader
            election.forceStepDown();
            throw new StaleLeaderException(
                "Write rejected: fencing token " + fencingToken + " is stale");
        }
    }
}

// Storage layer implementation
public class StorageClient {
    private final AtomicLong highestSeenToken = new AtomicLong(-1);

    public WriteResult conditionalWrite(String key, byte[] data, long token) {
        // CAS loop: only accept if token >= highestSeenToken
        long current = highestSeenToken.get();
        if (token < current) {
            return WriteResult.FENCING_TOKEN_REJECTED;
        }
        highestSeenToken.compareAndSet(current, token);
        return persistToStorage(key, data);
    }
}

ℹ️ Key insight: Fencing tokens shift the safety guarantee from the lock service to the storage layer. The lock service only needs to provide monotonically increasing tokens (which ZooKeeper's sequential znodes, etcd's revision numbers, and Kafka's epoch numbers all provide naturally). The storage layer enforces the actual safety invariant. This is why Martin Kleppmann argues that fencing tokens are the only reliable way to prevent stale writes — no amount of careful leader election logic can prevent a GC pause from causing a race after the lease check.

9. Production Pitfalls

GC pause = unintended leader eviction: As described above, JVM stop-the-world GC pauses can exceed your heartbeat timeout, causing a live leader to be evicted. In Java 17+, ZGC and Shenandoah offer sub-millisecond pause times for most workloads. If you must use G1GC, tune -XX:MaxGCPauseMillis=200 and set your lease duration to at least 3–4× your observed worst-case GC pause. Monitor GC pause times via JFR or Micrometer gauges and alert before they approach lease timeout boundaries.

Network partition asymmetry: TCP can be asymmetric — node A can send to node B, but node B's packets to node A are dropped. In Raft, this means A receives heartbeats and stays a follower, while B times out and starts an election. A grants B its vote because B has a higher term. B becomes leader but cannot replicate logs back to A. You end up with a cluster that has a leader but cannot make progress because the leader cannot reach a quorum to commit entries. This is not split-brain (both cannot write) but it is a liveness failure. Use pre-vote (an etcd extension to Raft) to prevent disruptive elections from nodes that cannot reach a quorum.

Lease too short: A lease shorter than your P99 network round-trip time causes continuous re-elections under load spikes. If your heartbeat interval is 150ms and your P99 cross-AZ RTT spikes to 180ms during a network event, followers will time out before receiving heartbeats. In AWS multi-AZ deployments, set election timeouts to at least 10× your P99 cross-AZ RTT with headroom for load spikes.

Lease too long: A lease of 60 seconds means 60 seconds of write unavailability after a leader crash. In a payment processing system with 800 TPS, that's 48,000 transactions that must be queued, retried, or rejected. Calibrate lease duration against your availability SLO: if your SLO is 99.9% uptime (8.7 hours/year downtime budget), a 60-second leader failover during a crash is a significant budget expenditure per incident.

Leader becoming a hotspot: Single-leader writes create a throughput ceiling. When your write throughput approaches the leader's capacity, consider partitioned leadership (Kafka's model: each partition has its own leader) rather than a single global leader. This distributes both the load and the blast radius of a leader failure.

"The fundamental problem with leader election is that you cannot tell the difference between a node that is dead and a node that is temporarily unreachable. Every timeout is a trade-off between false positives (evicting a live leader) and false negatives (failing to evict a dead one). Fencing tokens are the engineering acknowledgment that we can never perfectly distinguish the two."
— Martin Kleppmann, author of Designing Data-Intensive Applications

10. Comparison Table: ZooKeeper vs etcd vs Redis vs Raft

Choosing the right leader election substrate depends on your consistency requirements, operational complexity budget, and existing infrastructure. Here is a production-focused comparison:

Dimension	ZooKeeper	etcd	Redis	Custom Raft
Consistency	Linearizable (ZAB)	Linearizable (Raft)	Eventual (single) / weak (cluster)	Linearizable (if implemented correctly)
Availability	CP — unavailable during leader loss	CP — unavailable during leader loss	AP — may serve stale data	CP — configurable
Election Latency	200–500ms (session timeout + ZAB)	150–300ms (Raft election timeout)	Key TTL (1–30s typical)	150–300ms (tunable)
Operational Complexity	High (JVM-based, ensemble of 3–5)	Medium (Go binary, 3–5 node cluster)	Low (single instance) / Medium (cluster)	Very High (must implement + operate)
Java Client	Curator (excellent, mature)	jetcd (official, gRPC-based)	Lettuce / Jedis / Redisson	Custom (e.g., ratis from Apache)
Kubernetes Native	No (external service required)	Yes (etcd backs Kubernetes)	No (separate Redis deployment)	No
Best For	Kafka, HBase, legacy Hadoop ecosystems	Kubernetes operators, cloud-native services	Low-stakes coordination, cache invalidation	Embedded databases, storage engines

Key Takeaways

Quorum is the mathematical foundation — any correct leader election requires a majority (N/2 + 1) of nodes to agree; this is the only way to prevent split-brain at the protocol level.
Epoch/term numbers make stale leaders self-detect — when a previously isolated node reconnects and sees a higher epoch, it immediately steps down without any external intervention.
Fencing tokens are non-negotiable for storage safety — a leader election protocol alone cannot prevent stale writes from GC-paused JVM leaders; storage-layer fencing tokens provide the final safety net.
Use ZooKeeper + Curator for Kafka/Hadoop ecosystems — mature, battle-tested, with excellent Java client support and proven ephemeral znode semantics.
Use etcd for Kubernetes-native services — operators and controllers should use the Lease resource via controller-runtime/Fabric8 for seamless integration with the existing cluster control plane.
Redis is not a CP system — use it only where brief split-brain windows are tolerable and compensating logic (idempotency, deduplication) handles duplicates.
Tune lease duration against your GC pause P99 — your lease TTL must be longer than your worst observed GC pause; monitor GC pauses in production with JFR and alert before they approach the lease boundary.
Pre-vote prevents disruptive elections from isolated nodes — enable pre-vote in your Raft implementation (etcd supports it) to avoid term inflation from nodes that cannot reach a quorum.

Conclusion

Leader election is deceptively simple in concept and fiendishly subtle in production. The mathematical foundation — quorum-based consensus with monotonically increasing epoch numbers — is well understood, but the gap between theory and running JVM processes in a multi-AZ Kubernetes cluster filled with GC pauses and asymmetric network partitions is vast. The right approach is layered: a correct consensus protocol (Raft or ZAB) as the election substrate, lease-based heartbeats tuned to your observed infrastructure latencies, and fencing tokens at the storage layer as the ultimate safety backstop.

For most Spring Boot microservices running on Kubernetes, the Lease-based leader election via Fabric8 or controller-runtime is the right default — it leverages the same etcd cluster that Kubernetes itself depends on, has well-understood failure semantics, and integrates cleanly with Kubernetes RBAC. For services embedded in the Kafka or Hadoop ecosystem, Apache Curator with ZooKeeper remains the gold standard. Reserve Redis-based election for lightweight coordination scenarios where operational simplicity outweighs strict consistency guarantees, and always pair it with idempotent write semantics. Build fencing tokens into your architecture from day one — retrofitting them into an existing distributed system is one of the most painful migrations you will ever undertake.

Leader Election in Distributed Systems: Raft Consensus, ZooKeeper & etcd Deep Dive

Table of Contents

1. Why Leader Election Matters

2. The Split-Brain Problem

3. Raft Consensus Algorithm Deep Dive

4. ZooKeeper Leader Election: Ephemeral Sequential Znodes

5. etcd Leader Election: Spring Boot Integration

6. Redis-Based Leader Election: SETNX with TTL

7. Kubernetes Leader Election: controller-runtime Pattern

8. Fencing Tokens: Preventing Stale Leader Writes

9. Production Pitfalls

10. Comparison Table: ZooKeeper vs etcd vs Redis vs Raft

Key Takeaways

Conclusion

Tags

Leave a Comment

Related Posts

Leader Election in Distributed Systems: Raft Consensus, ZooKeeper & etcd Deep Dive

Table of Contents

1. Why Leader Election Matters

2. The Split-Brain Problem

3. Raft Consensus Algorithm Deep Dive

4. ZooKeeper Leader Election: Ephemeral Sequential Znodes

5. etcd Leader Election: Spring Boot Integration

6. Redis-Based Leader Election: SETNX with TTL

7. Kubernetes Leader Election: controller-runtime Pattern

8. Fencing Tokens: Preventing Stale Leader Writes

9. Production Pitfalls

10. Comparison Table: ZooKeeper vs etcd vs Redis vs Raft

Key Takeaways

Conclusion

Tags

Leave a Comment

Related Posts

Distributed System Challenges: Consistency, Availability & Partition Tolerance

Distributed Locking with Redis: Redlock, Fencing, and Production Patterns

CAP Theorem in Distributed Systems: Consistency, Availability & Partition Tolerance

Cookie Notice