What is Single Node Limits and Failure Points and how does it work?

A single Redis instance has practical limits that become the ceiling for any non-trivial leaderboard: Write throughput: Redis is single-threaded for command processing. A heavily loaded instance typically handles 100,000–200,000 ops/sec for ZADD. At 1M score updates/sec you need at minimum 5–10 shards. Memory ceiling: Practical usable memory per Redis instance is ~75% of physical RAM to leave headroom for replication buffers and RDB snapshots. A 64GB server gives ~48GB usable. At 10M users × 100 bytes = 1GB per leaderboard, memory is not the primary concern — throughput is. Persistence overhead: AOF fsync at everysec adds ~10% write latency. During RDB snapshots, a fork doubles memory momentarily. In production, separate the read replicas from the write primary and pin RDB snapshots to replicas.

System Design

Designing a Real-Time Leaderboard at Global Scale: Redis Sorted Sets, Sharding, and Consistency Trade-offs

Q: What is Requirements Analysis and how does it work?

Functional requirements: Submit a score update for a user; query the global top-N players; query a specific user's rank; query the N players surrounding a given user on the leaderboard; support time-windowed leaderboards (daily, weekly, all-time). Non-functional requirements: 10 million concurrent active users; 1 million score updates per second during peak; sub-50ms p99 read latency for top-100 queries; sub-100ms p99 for user rank lookup; 99.99% availability; support global distribution across 5 regions. Data estimates: 10M users × ~50 bytes per entry (user ID + score + metadata) = ~500MB for a single leaderboard. For 3 time-window leaderboards (daily, weekly, all-time) = ~1.5GB per shard. This is well within Redis memory limits per node, but write throughput of 1M ops/sec is the binding constraint.

Q: What is Redis Sorted Sets Internals and how does it work?

A Redis Sorted Set ( zset ) is implemented as a dual data structure: a skiplist for ordered rank operations and a hashtable for O(1) score lookup by member. This hybrid gives excellent performance characteristics for leaderboard operations: For the "show me the players just above and below me" use case, combine ZREVRANK with ZREVRANGE using the computed rank as an offset — both operations together still run in O(log N + M) time, fast enough for sub-5ms response at modest scale.

Q: What are the common anti-patterns in Hash-Based Sharding (Anti-Pattern for Rankings)?

Hash the user ID modulo number of shards: shard = hash(userId) % N . Each shard holds a fraction of users. Write throughput scales linearly. However, computing global rank requires querying all shards for the top scores and merging them — an O(N_shards) scatter-gather on every rank read. For top-100 reads at 10,000 reads/sec, this creates massive read amplification.

The World Cup of the most popular mobile game in 2024 attracted 10 million simultaneous players. For 90 minutes, every point scored needed to appear on a global ranking visible to all players. The engineering team's single Redis instance — which had handled previous events just fine — began experiencing 800ms write latency within the first five minutes. The leaderboard froze. Players complained. The problem was not Redis per se, but a fundamental misunderstanding of single-node limits and the lack of any sharding or read path optimization. This post is the system design guide the team needed.

Md Sanwar Hossain March 19, 2026 20 min read System Design

Real-time leaderboard system design - gaming score ranking at global scale

Requirements Analysis
Redis Sorted Sets Internals
Single Node Limits and Failure Points
Sharding Strategies for Leaderboards
Geo-Distributed Leaderboards
Caching the Read Path
Write Path Optimization
Consistency Trade-offs
Failure Scenarios
Scaling Beyond Redis
Key Takeaways

1. Requirements Analysis

Realtime Leaderboard Architecture | mdsanwarhossain.me — Realtime Leaderboard Architecture — mdsanwarhossain.me

Functional requirements: Submit a score update for a user; query the global top-N players; query a specific user's rank; query the N players surrounding a given user on the leaderboard; support time-windowed leaderboards (daily, weekly, all-time).

Non-functional requirements: 10 million concurrent active users; 1 million score updates per second during peak; sub-50ms p99 read latency for top-100 queries; sub-100ms p99 for user rank lookup; 99.99% availability; support global distribution across 5 regions.

Data estimates: 10M users × ~50 bytes per entry (user ID + score + metadata) = ~500MB for a single leaderboard. For 3 time-window leaderboards (daily, weekly, all-time) = ~1.5GB per shard. This is well within Redis memory limits per node, but write throughput of 1M ops/sec is the binding constraint.

2. Redis Sorted Sets Internals

A Redis Sorted Set (zset) is implemented as a dual data structure: a skiplist for ordered rank operations and a hashtable for O(1) score lookup by member. This hybrid gives excellent performance characteristics for leaderboard operations:

ZADD leaderboard:global 9500 "user:12345" — O(log N) to insert or update a score
ZREVRANK leaderboard:global "user:12345" — O(log N) to get 0-based rank (0 = highest score)
ZREVRANGE leaderboard:global 0 99 WITHSCORES — O(log N + M) to get top-100 with scores
ZSCORE leaderboard:global "user:12345" — O(1) score lookup via hashtable
ZREVRANGEBYSCORE leaderboard:global +inf 9000 LIMIT 0 10 — O(log N + M) for score-range query

For the "show me the players just above and below me" use case, combine ZREVRANK with ZREVRANGE using the computed rank as an offset — both operations together still run in O(log N + M) time, fast enough for sub-5ms response at modest scale.

// Spring Boot with Lettuce Redis client
@Service
public class LeaderboardService {
    private final RedisTemplate<String, String> redisTemplate;
    private static final String GLOBAL_KEY = "leaderboard:global";

    // Add/update score - atomic O(log N)
    public void updateScore(String userId, double delta) {
        redisTemplate.opsForZSet().incrementScore(GLOBAL_KEY, userId, delta);
    }

    // Get top N players
    public Set<ZSetOperations.TypedTuple<String>> getTopN(int n) {
        return redisTemplate.opsForZSet()
            .reverseRangeWithScores(GLOBAL_KEY, 0, n - 1);
    }

    // Get user rank (0-based, 0 = #1)
    public Long getUserRank(String userId) {
        return redisTemplate.opsForZSet().reverseRank(GLOBAL_KEY, userId);
    }

    // Get N players around a user
    public Set<ZSetOperations.TypedTuple<String>> getNeighbors(
            String userId, int windowSize) {
        Long rank = getUserRank(userId);
        if (rank == null) return Collections.emptySet();
        long start = Math.max(0, rank - windowSize);
        long end = rank + windowSize;
        return redisTemplate.opsForZSet()
            .reverseRangeWithScores(GLOBAL_KEY, start, end);
    }
}

3. Single Node Limits and Failure Points

Redis Sorted Sets | mdsanwarhossain.me — Redis Sorted Sets — mdsanwarhossain.me

A single Redis instance has practical limits that become the ceiling for any non-trivial leaderboard:

Write throughput: Redis is single-threaded for command processing. A heavily loaded instance typically handles 100,000–200,000 ops/sec for ZADD. At 1M score updates/sec you need at minimum 5–10 shards.
Memory ceiling: Practical usable memory per Redis instance is ~75% of physical RAM to leave headroom for replication buffers and RDB snapshots. A 64GB server gives ~48GB usable. At 10M users × 100 bytes = 1GB per leaderboard, memory is not the primary concern — throughput is.
Persistence overhead: AOF fsync at everysec adds ~10% write latency. During RDB snapshots, a fork doubles memory momentarily. In production, separate the read replicas from the write primary and pin RDB snapshots to replicas.
Single point of failure: Without Redis Sentinel or Redis Cluster, a single primary failure loses all in-memory state (if AOF is disabled) or requires recovery time from AOF/RDB, during which the leaderboard is unavailable.

4. Sharding Strategies for Leaderboards

Hash-Based Sharding (Anti-Pattern for Rankings)

Hash the user ID modulo number of shards: shard = hash(userId) % N. Each shard holds a fraction of users. Write throughput scales linearly. However, computing global rank requires querying all shards for the top scores and merging them — an O(N_shards) scatter-gather on every rank read. For top-100 reads at 10,000 reads/sec, this creates massive read amplification.

Real-Time Leaderboard with Redis | mdsanwarhossain.me — Real-Time Leaderboard with Redis — mdsanwarhossain.me

Hierarchical Aggregation (Recommended)

Partition users across N write shards by hash. Maintain a separate aggregation layer: a periodic (e.g., every 1–5 seconds) job merges the top-K scores from each write shard into a global leaderboard shard. This global shard serves all rank reads. The trade-off: the global leaderboard is eventually consistent with a lag equal to the aggregation interval — typically acceptable for games where millisecond rank precision is not required.

For user rank lookups (not just top-N), maintain a sorted structure per shard and compute approximate global rank as: global_rank ≈ Σ(count of users with higher scores in each shard). This approach scales writes horizontally while keeping reads fast.

Time-Window Leaderboards

Maintain separate sorted sets per time window: leaderboard:daily:2026-03-19, leaderboard:weekly:2026-W12, leaderboard:alltime. Use EXPIRE on daily and weekly keys to auto-evict stale data. Write each score update to all applicable keys atomically via a Lua script or pipeline to avoid partial updates.

-- Lua script for atomic multi-leaderboard update
local globalKey = KEYS[1]
local dailyKey = KEYS[2]
local weeklyKey = KEYS[3]
local userId = ARGV[1]
local delta = tonumber(ARGV[2])

redis.call('ZINCRBY', globalKey, delta, userId)
redis.call('ZINCRBY', dailyKey, delta, userId)
redis.call('ZINCRBY', weeklyKey, delta, userId)
return redis.call('ZREVRANK', globalKey, userId)

5. Geo-Distributed Leaderboards

For global events, maintaining a single leaderboard in one region forces all writes to cross the internet. A player in Singapore writing to a US-East Redis primary experiences 150–200ms of write latency — unacceptable for real-time gameplay.

Regional shards accept writes locally and sync to a global aggregation layer asynchronously. Each region maintains its own leaderboard (e.g., leaderboard:apac, leaderboard:emea, leaderboard:amer). A global aggregation service — running in each region — periodically merges regional top-scores into leaderboard:global. Players always write to their regional shard (sub-10ms), and global rankings reflect the aggregated state with a configurable staleness window (1–10 seconds).

Conflict resolution: Since scores are monotonically increasing (in most games, you can't lose points), the merge strategy is simple: take the maximum score per user across all regional shards. For games where scores decrease, use a last-writer-wins strategy with timestamp comparison, or maintain per-event sourcing.

6. Caching the Read Path

The top-100 leaderboard is read orders of magnitude more than it is updated. Pre-compute and cache a serialized snapshot:

Application-level cache: Cache the top-100 result in an in-process cache (Caffeine) with a 1–2 second TTL. At 10,000 reads/sec, this reduces Redis reads to ~5,000/sec (one read per cache expiry per instance).
CDN caching: For public leaderboards, serve top-N snapshots via CDN with short TTLs (5–30 seconds). A billion-user game can serve top-100 from CDN edge nodes without touching Redis at all.
Warm-up strategy: Pre-populate the cache during off-peak hours. On cold start, the first request after cache expiry triggers a Redis read which repopulates the cache — not a thundering herd because only one in-flight request populates it (use request coalescing / cache locks).

7. Write Path Optimization

Pipelining: Instead of one round-trip per score update, batch multiple ZADD/ZINCRBY commands in a single pipeline. The Lettuce Redis client supports automatic pipelining. Batching 100 updates in one pipeline reduces network round-trips by 100x.

// Batch pipeline with Lettuce
public void batchUpdateScores(Map<String, Double> scoreDeltas) {
    redisTemplate.executePipelined((RedisCallback<Object>) connection -> {
        scoreDeltas.forEach((userId, delta) ->
            connection.zIncrBy(
                GLOBAL_KEY.getBytes(),
                delta,
                userId.getBytes()));
        return null;
    });
}

Kafka buffer: For extreme write volumes, publish score events to Kafka and have a Kafka consumer aggregate scores in memory per time window (e.g., 100ms tumbling window) before flushing to Redis. This converts 1M individual Redis writes/sec into 10K batched pipeline flushes/sec — a 100x write reduction. The trade-off is increased write latency (100ms aggregation window) and operational complexity.

8. Consistency Trade-offs

The CAP theorem forces a choice between consistency and availability for leaderboards across network partitions. In practice, eventual consistency with bounded staleness is the right model for almost all leaderboard use cases:

User tolerance: Players accept that the global leaderboard visible to them may be 1–5 seconds stale. Showing a rank that is 2 positions off is unnoticeable compared to the harm done by 500ms write latency.
Rank precision vs latency: Strongly consistent rank queries require all shards to be in sync at query time — impossible at global scale without impractical coordination. Accept ±5 rank precision in exchange for sub-10ms local writes.
Financial and competitive leaderboards (high stakes): For tournament finals where rank precision matters, use a separate strongly-consistent leaderboard for the top-1000 players only (stored on a single Redis primary with synchronous replica writes), and eventual consistency for the broader ranking.

Storage	Write Throughput	Read Latency	Rank Query	Best For
Redis Sorted Set	100K-200K/s (single)	<1ms	O(log N)	<5M users, single region
Redis Cluster (sharded)	1M+/s	<5ms	O(shards × log N)	10M+ users, high write load
PostgreSQL (window fn)	10K-50K/s	10-100ms	O(N log N)	Audit trail, complex queries
Apache Flink + Redis	10M+/s	<10ms (cached)	Pre-aggregated	Streaming, time-windowed

9. Failure Scenarios

Redis primary failure: Redis Sentinel or Redis Cluster handles automatic failover to a replica in ~30 seconds. During failover, writes fail. The application should queue writes in Kafka and replay them after failover. Reads can be served from replica (which may be slightly stale) with a READONLY flag.

Split-brain in Redis Cluster: Redis Cluster uses a quorum of masters — a partition that isolates a primary without a quorum causes it to stop accepting writes. The minority side can serve stale reads. Mitigate by enabling cluster-require-full-coverage no for partial availability, or accept the brief write outage.

During outage — serve last known snapshot: Maintain a materialized snapshot of the top-100 leaderboard in a durable store (PostgreSQL or S3) that is refreshed every 10 seconds. During Redis downtime, serve this stale snapshot with a "Last updated: X seconds ago" indicator rather than returning an error.

10. Scaling Beyond Redis

At true internet scale — hundreds of millions of users with complex analytics needs — Redis alone becomes insufficient:

Apache Flink: For streaming score aggregation with time-windowed leaderboards (tumbling, sliding, session windows), Flink provides native stateful processing, exactly-once semantics, and checkpointing. Flink emits pre-aggregated top-N results to Redis, reducing Redis write volume dramatically.
Apache Pinot: For analytics queries — "show me the rank distribution by country", "histogram of scores in the top-1000" — Pinot provides sub-second OLAP queries over streaming event data via Kafka ingestion. Not for per-user rank lookups, but for business intelligence on leaderboard data.
PostgreSQL with window functions: RANK() OVER (ORDER BY score DESC) works well for leaderboards up to ~5M users with proper indexing (B-tree on score DESC). Use for audit trails, official tournament records, and complex queries that Redis cannot express.

"A leaderboard is one of the simplest real-time data structures to implement and one of the hardest to scale. Redis gives you the primitives; architecture gives you the scale."

11. Key Takeaways

Redis Sorted Sets provide O(log N) write and rank operations ideal for leaderboards; understand memory and single-node throughput limits before scaling to Redis Cluster.
Hash-based sharding distributes writes but creates scatter-gather read amplification; use hierarchical aggregation instead for efficient global ranking.
Geo-distributed leaderboards require regional write shards with async sync to a global aggregation layer; design for bounded staleness, not strong consistency.
The read path should be aggressively cached — top-N snapshots at application level (Caffeine) and CDN level reduce Redis read load by orders of magnitude.
Use Kafka buffering to batch-reduce Redis write volume when write throughput exceeds single-node limits.
Have a fallback strategy for Redis outages: a durable snapshot in PostgreSQL or S3 that can be served with a staleness indicator.

Architecture Diagram Idea

Three-tier diagram: (1) Write tier — Game Servers → Kafka topic → Score Consumer → 8x Redis Write Shards (each holding 1/8 of users). (2) Aggregation tier — Aggregation Service polls all write shards every 2s, computes top-10K globally, writes to Redis Global Leaderboard shard. (3) Read tier — API Gateway → App Cache (Caffeine, 1s TTL) → Redis Global Shard → Fallback: PostgreSQL snapshot. Show regional replication arrows for APAC, EMEA, AMER.

Designing a Real-Time Leaderboard at Global Scale: Redis Sorted Sets, Sharding, and Consistency Trade-offs

Table of Contents

1. Requirements Analysis

2. Redis Sorted Sets Internals

3. Single Node Limits and Failure Points

4. Sharding Strategies for Leaderboards

Hash-Based Sharding (Anti-Pattern for Rankings)

Hierarchical Aggregation (Recommended)

Time-Window Leaderboards

5. Geo-Distributed Leaderboards

6. Caching the Read Path

7. Write Path Optimization

8. Consistency Trade-offs

9. Failure Scenarios

10. Scaling Beyond Redis

11. Key Takeaways

Architecture Diagram Idea

Tags

Leave a Comment

Related Posts

Designing a Real-Time Leaderboard at Global Scale: Redis Sorted Sets, Sharding, and Consistency Trade-offs

Table of Contents

1. Requirements Analysis

2. Redis Sorted Sets Internals

3. Single Node Limits and Failure Points

4. Sharding Strategies for Leaderboards

Hash-Based Sharding (Anti-Pattern for Rankings)

Hierarchical Aggregation (Recommended)

Time-Window Leaderboards

5. Geo-Distributed Leaderboards

6. Caching the Read Path

7. Write Path Optimization

8. Consistency Trade-offs

9. Failure Scenarios

10. Scaling Beyond Redis

11. Key Takeaways

Architecture Diagram Idea

Tags

Leave a Comment

Related Posts

Consistent Hashing Ring Design for Distributed Caches at Scale: Algorithms, Virtual Nodes, and Production Trade-offs

Database Sharding Strategies: Horizontal Scaling Patterns for High-Traffic Production Systems

Distributed Locking with Redis: SETNX, Redlock, and Production Edge Cases You Must Know

Rate Limiting, Caching & Load Balancing: Essential Building Blocks for Scalable APIs

Cookie Notice