System Design

Distributed Locking with Redis: SETNX, Redlock, and Production Edge Cases You Must Know

Q: What is SETNX and how does it work?

The foundational Redis locking primitive is the atomic SET key value NX EX seconds command. NX means "only set if the key does not exist"; EX seconds sets the TTL in the same atomic operation. If the command returns OK , you hold the lock. If it returns nil , the lock is held by someone else. A critical historical mistake was using two separate commands: SETNX key value followed by EXPIRE key seconds . This is broken by design. If the process crashes or is killed between the two commands, the key persists without an expiry, permanently blocking all future lock acquisitions on that key — a distributed deadlock. The atomic SET NX EX command was specifically introduced to eliminate this gap.

Q: What failure scenarios should you handle with Redis Distributed Locking in Production?

Failure 1: Clock jump during Redlock acquisition. An NTP resync on one Redis node advances its clock by several seconds while a client is mid-acquisition. The TTL that was set on that node effectively expires sooner than expected from the client's perspective. If the client acquired a majority but the clock-jumped node's key expires before the others, a second client can acquire a quorum on the remaining nodes plus the now-expired node, resulting in two lock holders simultaneously. Failure 2: Redis replica failover race. In single-node Redis with replication: Client A acquires the lock on the master. The master crashes before replicating the SET to the replica. Redis Sentinel promotes the replica. The promoted master has no lock key. Client B acquires the lock.

Distributed locks look deceptively simple: set a key in Redis, do your work, delete the key. In practice, clock skew, GC pauses, network partitions, and Redis failover each create race conditions that can corrupt data or cause duplicate execution. This article examines the full stack of Redis locking techniques — from the atomic SET NX EX primitive, through fencing tokens, to the multi-node Redlock algorithm — with production failure scenarios and safe implementation patterns for each.

Md Sanwar Hossain March 19, 2026 19 min read System Design

Redis distributed locking in production systems

Introduction — When "Simple" Redis Locks Cause Production Disasters
SETNX: The Simplest Redis Lock and Its Dangers
The Fencing Token: Solving Expiry Race Conditions
Redlock: The Multi-Node Algorithm
Production Failure Scenarios
Better Alternatives for Strong Consistency Needs
Production-Safe Implementation Patterns
Trade-offs and When NOT to Use Redis Locks
Key Takeaways
Conclusion

1. Introduction — When "Simple" Redis Locks Cause Production Disasters

Distributed Locking Architecture | mdsanwarhossain.me — Distributed Locking Architecture — mdsanwarhossain.me

Distributed locks coordinate access to shared resources across multiple service instances — preventing double-processing of a payment, enforcing single-writer access to a configuration record, or ensuring a scheduled job runs on exactly one node. Redis, with its atomic operations and sub-millisecond latency, seems like the perfect substrate for implementing them.

Consider a production incident at a fintech company: their payment processing service used a Redis lock keyed on payment ID to enforce idempotency. The flow was: acquire lock, charge the card, record the transaction, release lock. The lock TTL was set to 30 seconds — enough margin for any realistic database transaction. Six months after launch, a PostgreSQL vacuum operation caused unexpected lock contention, stretching one transaction to 38 seconds. The Redis TTL expired at 30 seconds. A second service instance, polling for work, acquired the same lock. Both instances were now in the critical section simultaneously. The result: 847 customers were charged twice before the duplicate detection system caught the anomaly.

Distributed locking is hard because correctness requires assumptions that distributed systems routinely violate: clocks are not perfectly synchronized, processes pause arbitrarily (GC, OS scheduling, VM live migration), and networks partition at inconvenient moments. Understanding exactly where each Redis locking approach breaks down is the prerequisite for building systems that remain correct despite these failures.

2. SETNX: The Simplest Redis Lock and Its Dangers

The foundational Redis locking primitive is the atomic SET key value NX EX seconds command. NX means "only set if the key does not exist"; EX seconds sets the TTL in the same atomic operation. If the command returns OK, you hold the lock. If it returns nil, the lock is held by someone else.

A critical historical mistake was using two separate commands: SETNX key value followed by EXPIRE key seconds. This is broken by design. If the process crashes or is killed between the two commands, the key persists without an expiry, permanently blocking all future lock acquisitions on that key — a distributed deadlock. The atomic SET NX EX command was specifically introduced to eliminate this gap.

// BROKEN: two-command pattern — process crash between commands = permanent deadlock
jedis.setnx("lock:payment:" + paymentId, clientToken); // set key
jedis.expire("lock:payment:" + paymentId, 30);          // CRASH HERE = no TTL, lock never expires!

// CORRECT: single atomic SET NX EX command
String acquired = jedis.set(
    "lock:payment:" + paymentId,  // key
    clientToken,                   // unique value — identifies THIS lock holder
    SetParams.setParams().nx().ex(30) // NX = only if absent, EX = TTL in seconds
);
boolean lockAcquired = "OK".equals(acquired);

// Release with Lua script — atomically checks ownership before deleting
// Prevents releasing a lock acquired by a DIFFERENT holder after our TTL expired
String luaRelease = """
    if redis.call('GET', KEYS[1]) == ARGV[1] then
        return redis.call('DEL', KEYS[1])
    else
        return 0
    end
    """;
jedis.eval(luaRelease, List.of("lock:payment:" + paymentId), List.of(clientToken));

The value stored in the lock key must be unique per lock holder — typically a UUID generated at acquisition time. This is essential for the safe release script: before deleting the key, the Lua script atomically checks that the key's value matches your token. Without this check, you could release a lock acquired by another process after your TTL expired — a dangerous race condition that the Lua script atomically eliminates.

The expiry problem remains even with atomic acquisition. If your critical section takes longer than the TTL — due to a slow database query, a stop-the-world GC pause of 40+ seconds in a JVM under memory pressure, or a network call to a degraded downstream — your lock expires while you still consider yourself the holder. Another process acquires it. Now two processes are simultaneously inside the critical section. The atomic SET command cannot solve this; it is a fundamental limitation of time-based distributed locks.

3. The Fencing Token: Solving Expiry Race Conditions

Redis Lock Patterns | mdsanwarhossain.me — Redis Lock Patterns — mdsanwarhossain.me

Martin Kleppmann's fencing token approach addresses the expiry race at the storage layer rather than at the lock layer. Each time a client successfully acquires a lock, it receives a monotonically increasing integer token — a fencing token. The client includes this token in every write request to the protected resource. The storage system maintains the highest-seen token and rejects any write carrying a token lower than the current maximum.

When a GC-paused process wakes up after its lock has expired and a new holder has acquired lock token 43, the paused process still holds token 42. Its write attempt arrives at storage with token 42, which is rejected as stale. The new holder's writes with token 43 succeed. Mutual exclusion is enforced by the storage system, not by the timing of lock expiry.

// Fencing token generation: Redis INCR is atomic, monotonically increasing
public long acquireLockWithFencingToken(String resource, String clientId, int ttlSeconds) {
    String lockKey = "lock:" + resource;
    String tokenKey = "fence:" + resource;

    // Atomic acquisition with unique value
    String acquired = jedis.set(lockKey, clientId,
        SetParams.setParams().nx().ex(ttlSeconds));

    if (!"OK".equals(acquired)) {
        return -1; // lock not acquired
    }

    // Return monotonically increasing token
    return jedis.incr(tokenKey);
}

// Storage layer enforces fencing — example with optimistic locking in PostgreSQL
public void writeWithFence(long fenceToken, String data) {
    int updated = jdbcTemplate.update(
        "UPDATE protected_resource SET data = ?, last_fence_token = ? " +
        "WHERE last_fence_token < ?",
        data, fenceToken, fenceToken
    );
    if (updated == 0) {
        throw new StaleLockException(
            "Write rejected: fence token " + fenceToken + " is stale");
    }
}

The fencing token approach is the only mechanism that provides safety guarantees in the presence of arbitrary process pauses. Its limitation is that it requires the storage system to cooperate — to check and enforce the monotonic token on every write. For databases and custom storage engines this is straightforward, but for external APIs or third-party systems that do not support conditional writes, fencing is not applicable.

4. Redlock: The Multi-Node Algorithm

Single-node Redis locking has an obvious failure mode: if the Redis instance goes down while a lock is held, and the replica hasn't yet received the replication of the SET command (Redis replication is asynchronous), a promoted replica presents a clean slate — allowing a second client to acquire the same lock. Redlock, proposed by Redis creator Salvatore Sanfilippo, addresses this with a multi-node quorum algorithm.

Distributed Locking with Redis | mdsanwarhossain.me — Distributed Locking with Redis — mdsanwarhossain.me

Redlock uses N independent Redis nodes (no replication between them — N masters). To acquire a lock: record start time, attempt SET NX EX on all N nodes sequentially, check that you acquired a majority (N/2 + 1 or more), and verify that the total elapsed time is less than the lock TTL. If any condition fails, release all acquired locks. The lock validity time is TTL minus elapsed acquisition time.

// Redlock implementation using Redisson (production-ready Java client)
// Maven: org.redisson:redisson:3.27.0

@Configuration
public class RedissonConfig {
    @Bean
    public RedissonClient redissonClient() {
        Config config = new Config();
        // 5 independent Redis masters for Redlock — no replication between them
        config.useClusterServers()
            .addNodeAddress(
                "redis://redis1:6379",
                "redis://redis2:6379",
                "redis://redis3:6379",
                "redis://redis4:6379",
                "redis://redis5:6379"
            );
        return Redisson.create(config);
    }
}

@Service
public class PaymentProcessor {
    private final RedissonClient redisson;

    public void processPayment(String paymentId, PaymentRequest request) {
        RLock lock = redisson.getLock("payment:lock:" + paymentId);
        boolean acquired = false;
        try {
            // tryLock(waitTime, leaseTime, unit)
            // waitTime: how long to wait for lock acquisition
            // leaseTime: TTL of the lock (Redisson watchdog can extend this)
            acquired = lock.tryLock(0, 30, TimeUnit.SECONDS);
            if (!acquired) {
                throw new DuplicatePaymentException("Payment " + paymentId + " already in progress");
            }
            executePaymentTransaction(request);
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new LockInterruptedException(e);
        } finally {
            if (acquired && lock.isHeldByCurrentThread()) {
                lock.unlock(); // always release in finally
            }
        }
    }
}

The Kleppmann vs Sanfilippo debate is worth understanding. Kleppmann argues that Redlock is unsafe because it relies on timing assumptions (the lock TTL must be significantly longer than clock drift plus GC pauses), and that even with a majority quorum, a process pause after acquiring but before using the lock can violate mutual exclusion. Sanfilippo counters that Redlock is designed for "efficiency" use cases — preventing duplicate work — not "correctness" use cases requiring strict mutual exclusion. Both are right in their domain: Redlock is not a replacement for fencing tokens where strict correctness is required.

5. Production Failure Scenarios

Failure 1: Clock jump during Redlock acquisition. An NTP resync on one Redis node advances its clock by several seconds while a client is mid-acquisition. The TTL that was set on that node effectively expires sooner than expected from the client's perspective. If the client acquired a majority but the clock-jumped node's key expires before the others, a second client can acquire a quorum on the remaining nodes plus the now-expired node, resulting in two lock holders simultaneously.

Failure 2: Redis replica failover race. In single-node Redis with replication: Client A acquires the lock on the master. The master crashes before replicating the SET to the replica. Redis Sentinel promotes the replica. The promoted master has no lock key. Client B acquires the lock. Client A is still in its critical section. Redlock with independent masters avoids this specific failure, but single-node setups remain vulnerable — a critical reason to use Redlock or Redisson's watchdog for production payment systems.

Failure 3: Long GC pause after acquisition. This is the most common real-world failure. A JVM running with CMS or a misconfigured G1GC triggers a 45-second stop-the-world pause after acquiring a 30-second lock. When the process resumes, it considers itself the lock holder. Another process has acquired the lock and may have already committed work. Detection: monitor GC pause durations with JFR or Micrometer GC metrics; alert when pauses approach lock TTL thresholds.

Failure 4: Network partition prevents lock release. The lock holder completes its work but cannot reach Redis to delete the lock key. All other clients must wait for the TTL to expire before proceeding. In a 30-second TTL system, this creates a 30-second availability gap. Design TTLs to be as short as safely possible. Redisson's watchdog mitigates this differently: it actively renews the lock every TTL/3 seconds, so a process that can still reach Redis keeps its lock alive, and a process that cannot reach Redis will not renew — the lock expires naturally.

6. Better Alternatives for Strong Consistency Needs

For use cases requiring strict mutual exclusion with correctness guarantees, two alternatives outperform Redis locks fundamentally:

ZooKeeper / etcd distributed locks use linearizable consensus algorithms (ZAB for ZooKeeper, Raft for etcd). Lock acquisition is modeled as an ephemeral node or a lease entry: if the client session disconnects, the lock is automatically released by the coordination service — no TTL race condition, no zombie lock due to GC pause. ZooKeeper's sequential ephemeral nodes enable fair queuing among lock waiters. The trade-off: higher operational complexity and higher latency (5–20 ms vs sub-millisecond for Redis).

PostgreSQL advisory locks are an underutilised gem for services already using PostgreSQL. pg_try_advisory_lock(key) acquires a session-level lock; pg_advisory_xact_lock(key) acquires a transaction-level lock that auto-releases when the transaction commits or rolls back. They survive connection failure gracefully and integrate naturally with database transactions — perfect for locks that should be scoped to a database write operation. No additional infrastructure required.

// PostgreSQL advisory lock — transactional, auto-releases on commit/rollback
@Transactional
public void processOrderWithPgLock(Long orderId) {
    // Acquires lock for the duration of this transaction — auto-released on commit/rollback
    int acquired = jdbcTemplate.queryForObject(
        "SELECT pg_try_advisory_xact_lock(?)", Integer.class, orderId
    );
    if (acquired == 0) {
        throw new ResourceBusyException("Order " + orderId + " is being processed");
    }
    // Safe: lock held for full transaction duration, auto-released on commit/rollback
    orderRepository.updateStatus(orderId, OrderStatus.PROCESSING);
    paymentGateway.charge(orderId);
    orderRepository.updateStatus(orderId, OrderStatus.COMPLETED);
}

// etcd distributed lock with Java client (jetcd)
Client etcdClient = Client.builder().endpoints("http://etcd1:2379", "http://etcd2:2379").build();
Lock lockClient = etcdClient.getLockClient();
Lease leaseClient = etcdClient.getLeaseClient();

long leaseId = leaseClient.grant(30).get().getID(); // 30-second lease
LockResponse lockResponse = lockClient.lock(
    ByteSequence.from("order/" + orderId, StandardCharsets.UTF_8),
    leaseId
).get();
// Lock held — etcd auto-releases if client disconnects (lease expires)

7. Production-Safe Implementation Patterns

When Redis locking is appropriate for your use case, Redisson is the recommended Java client. It handles the Lua-based atomic release, implements a watchdog that automatically renews the lock every TTL/3 seconds while the holder is alive, and provides fair lock variants for ordered acquisition. Do not hand-roll Redis locking in production — the edge cases are too numerous.

// Production-safe Redisson lock pattern
@Service
public class InventoryService {
    private final RedissonClient redisson;
    private final MeterRegistry meterRegistry;

    public boolean reserveStock(String skuId, int quantity) {
        String lockKey = "inventory:lock:" + skuId;
        RLock lock = redisson.getLock(lockKey);

        Timer.Sample sample = Timer.start(meterRegistry);
        boolean acquired = false;
        try {
            // tryLock with waitTime=0 for non-blocking check
            // leaseTime=-1 enables Redisson watchdog (auto-renewal every TTL/3)
            acquired = lock.tryLock(100, -1, TimeUnit.MILLISECONDS);
            if (!acquired) {
                meterRegistry.counter("lock.contention", "resource", skuId).increment();
                return false;
            }

            // Critical section
            int currentStock = inventoryRepository.getStock(skuId);
            if (currentStock < quantity) return false;
            inventoryRepository.decrementStock(skuId, quantity);
            return true;

        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            return false;
        } finally {
            sample.stop(meterRegistry.timer("lock.hold.duration", "resource", skuId));
            // Only unlock if we hold it — guards against accidental unlock after expiry
            if (acquired && lock.isHeldByCurrentThread()) {
                lock.unlock();
            }
        }
    }
}

// TTL sizing: 3x expected critical section + buffer for GC pauses
// Expected DB transaction: ~200ms -> TTL = 3 * 200ms + 2000ms buffer = 2.6s -> round to 5s
// Monitor: redis INFO stats for lock contention patterns

Monitor lock contention continuously. Track the ratio of tryLock attempts to successful acquisitions. A contention rate above 5% signals that either your critical section is too long, your concurrency is too high for the resource, or you should reconsider whether distributed locking is the right coordination mechanism.

8. Trade-offs and When NOT to Use Redis Locks

Redis locks are NOT suitable for: financial transactions requiring guaranteed mutual exclusion (payment double-charge prevention as the canonical example), strict inventory deduction where overselling is unacceptable, leader election in mission-critical distributed systems, or any scenario where incorrect behaviour under partial failure is unacceptable. In all these cases, use database transactions with pessimistic locking, PostgreSQL advisory locks, or ZooKeeper/etcd.

Redis locks ARE suitable for: preventing duplicate execution of idempotent background jobs (worst case: job runs twice, results are equivalent), distributed rate limiting (counter drift under failure is acceptable), cache warming (worst case: multiple processes populate the same cache entry simultaneously), and soft coordination where correctness under all failure scenarios is less critical than performance.

The deeper design principle: wherever possible, design systems to tolerate duplicate execution rather than prevent it. An idempotent operation protected by a Redis lock for efficiency — but safe to run twice — is vastly more robust than an operation that relies entirely on the lock for correctness. Combine Redis locks with idempotency keys, database unique constraints, and conditional writes to build systems that are safe regardless of whether the lock holds perfectly.

9. Key Takeaways

Always use atomic SET NX EX, never SETNX + EXPIRE. The two-command pattern creates a permanent deadlock if the process crashes between them.
Release locks with a Lua script that checks ownership. Without ownership verification, you can release a lock acquired by another holder after your TTL expired.
Fencing tokens are the only safe solution for expiry races. Time-based locks cannot guarantee mutual exclusion under GC pauses or clock drift; fencing enforces safety at the storage layer.
Redlock is safer than single-node Redis but not bulletproof. It eliminates SPOF and replica promotion races, but process pauses can still violate mutual exclusion without fencing tokens.
Use Redisson in production, not hand-rolled Lua scripts. Redisson's watchdog, atomic release, and fair lock variants address most edge cases correctly.
Design for idempotency first, locking second. A system that tolerates duplicate execution is safer and simpler than one that relies on a distributed lock for correctness. Use PostgreSQL advisory locks or etcd when strict mutual exclusion is truly required.

10. Conclusion

Redis distributed locking sits at the intersection of simplicity and danger. The primitives are straightforward; the failure modes are subtle and production-only. The gap between a naive SETNX implementation and a production-safe Redisson-based lock represents years of hard-won operational experience in the Redis community.

The core lesson is about appropriate tool selection. Redis locks are excellent at what they are designed for: soft coordination, duplicate work prevention, and performance optimisation under concurrent access. They are not general-purpose distributed mutexes, and using them as such — particularly for financial or inventory operations — sets up systems for exactly the kind of rare, high-impact failures that erode user trust.

Build distributed locks with the awareness that they will fail under some failure scenarios. Make your critical sections idempotent. Add fencing tokens where storage systems support them. Monitor lock contention and TTL expiry rates. And when correctness under all failure conditions is truly required, invest in ZooKeeper, etcd, or PostgreSQL advisory locks. The extra operational complexity is a small price for the safety properties they provide.

Distributed Locking with Redis: SETNX, Redlock, and Production Edge Cases You Must Know

Table of Contents

1. Introduction — When "Simple" Redis Locks Cause Production Disasters

2. SETNX: The Simplest Redis Lock and Its Dangers

3. The Fencing Token: Solving Expiry Race Conditions

4. Redlock: The Multi-Node Algorithm

5. Production Failure Scenarios

6. Better Alternatives for Strong Consistency Needs

7. Production-Safe Implementation Patterns

8. Trade-offs and When NOT to Use Redis Locks

9. Key Takeaways

10. Conclusion

Tags

Leave a Comment

Related Posts

Distributed Locking with Redis: SETNX, Redlock, and Production Edge Cases You Must Know

Table of Contents

1. Introduction — When "Simple" Redis Locks Cause Production Disasters

2. SETNX: The Simplest Redis Lock and Its Dangers

3. The Fencing Token: Solving Expiry Race Conditions

4. Redlock: The Multi-Node Algorithm

5. Production Failure Scenarios

6. Better Alternatives for Strong Consistency Needs

7. Production-Safe Implementation Patterns

8. Trade-offs and When NOT to Use Redis Locks

9. Key Takeaways

10. Conclusion

Tags

Leave a Comment

Related Posts

Idempotency Patterns in Distributed Systems: Building Exactly-Once Processing

Distributed System Challenges: What Every Senior Engineer Must Know in 2026

CAP Theorem in Practice: Consistency vs Availability Trade-offs in Distributed Systems

Cookie Notice