Software Engineer · Java · Spring Boot · Microservices
Designing a Real-Time Leaderboard at Global Scale: Redis Sorted Sets, Sharding, and Consistency Trade-offs
The World Cup of the most popular mobile game in 2024 attracted 10 million simultaneous players. For 90 minutes, every point scored needed to appear on a global ranking visible to all players. The engineering team's single Redis instance — which had handled previous events just fine — began experiencing 800ms write latency within the first five minutes. The leaderboard froze. Players complained. The problem was not Redis per se, but a fundamental misunderstanding of single-node limits and the lack of any sharding or read path optimization. This post is the system design guide the team needed.
Table of Contents
1. Requirements Analysis
Functional requirements: Submit a score update for a user; query the global top-N players; query a specific user's rank; query the N players surrounding a given user on the leaderboard; support time-windowed leaderboards (daily, weekly, all-time).
Non-functional requirements: 10 million concurrent active users; 1 million score updates per second during peak; sub-50ms p99 read latency for top-100 queries; sub-100ms p99 for user rank lookup; 99.99% availability; support global distribution across 5 regions.
Data estimates: 10M users × ~50 bytes per entry (user ID + score + metadata) = ~500MB for a single leaderboard. For 3 time-window leaderboards (daily, weekly, all-time) = ~1.5GB per shard. This is well within Redis memory limits per node, but write throughput of 1M ops/sec is the binding constraint.
2. Redis Sorted Sets Internals
A Redis Sorted Set (zset) is implemented as a dual data structure: a skiplist for ordered rank operations and a hashtable for O(1) score lookup by member. This hybrid gives excellent performance characteristics for leaderboard operations:
ZADD leaderboard:global 9500 "user:12345"— O(log N) to insert or update a scoreZREVRANK leaderboard:global "user:12345"— O(log N) to get 0-based rank (0 = highest score)ZREVRANGE leaderboard:global 0 99 WITHSCORES— O(log N + M) to get top-100 with scoresZSCORE leaderboard:global "user:12345"— O(1) score lookup via hashtableZREVRANGEBYSCORE leaderboard:global +inf 9000 LIMIT 0 10— O(log N + M) for score-range query
For the "show me the players just above and below me" use case, combine ZREVRANK with ZREVRANGE using the computed rank as an offset — both operations together still run in O(log N + M) time, fast enough for sub-5ms response at modest scale.
// Spring Boot with Lettuce Redis client
@Service
public class LeaderboardService {
private final RedisTemplate<String, String> redisTemplate;
private static final String GLOBAL_KEY = "leaderboard:global";
// Add/update score - atomic O(log N)
public void updateScore(String userId, double delta) {
redisTemplate.opsForZSet().incrementScore(GLOBAL_KEY, userId, delta);
}
// Get top N players
public Set<ZSetOperations.TypedTuple<String>> getTopN(int n) {
return redisTemplate.opsForZSet()
.reverseRangeWithScores(GLOBAL_KEY, 0, n - 1);
}
// Get user rank (0-based, 0 = #1)
public Long getUserRank(String userId) {
return redisTemplate.opsForZSet().reverseRank(GLOBAL_KEY, userId);
}
// Get N players around a user
public Set<ZSetOperations.TypedTuple<String>> getNeighbors(
String userId, int windowSize) {
Long rank = getUserRank(userId);
if (rank == null) return Collections.emptySet();
long start = Math.max(0, rank - windowSize);
long end = rank + windowSize;
return redisTemplate.opsForZSet()
.reverseRangeWithScores(GLOBAL_KEY, start, end);
}
}
3. Single Node Limits and Failure Points
A single Redis instance has practical limits that become the ceiling for any non-trivial leaderboard:
- Write throughput: Redis is single-threaded for command processing. A heavily loaded instance typically handles 100,000–200,000 ops/sec for ZADD. At 1M score updates/sec you need at minimum 5–10 shards.
- Memory ceiling: Practical usable memory per Redis instance is ~75% of physical RAM to leave headroom for replication buffers and RDB snapshots. A 64GB server gives ~48GB usable. At 10M users × 100 bytes = 1GB per leaderboard, memory is not the primary concern — throughput is.
- Persistence overhead: AOF fsync at
everysecadds ~10% write latency. During RDB snapshots, a fork doubles memory momentarily. In production, separate the read replicas from the write primary and pin RDB snapshots to replicas. - Single point of failure: Without Redis Sentinel or Redis Cluster, a single primary failure loses all in-memory state (if AOF is disabled) or requires recovery time from AOF/RDB, during which the leaderboard is unavailable.
4. Sharding Strategies for Leaderboards
Hash-Based Sharding (Anti-Pattern for Rankings)
Hash the user ID modulo number of shards: shard = hash(userId) % N. Each shard holds a fraction of users. Write throughput scales linearly. However, computing global rank requires querying all shards for the top scores and merging them — an O(N_shards) scatter-gather on every rank read. For top-100 reads at 10,000 reads/sec, this creates massive read amplification.
Hierarchical Aggregation (Recommended)
Partition users across N write shards by hash. Maintain a separate aggregation layer: a periodic (e.g., every 1–5 seconds) job merges the top-K scores from each write shard into a global leaderboard shard. This global shard serves all rank reads. The trade-off: the global leaderboard is eventually consistent with a lag equal to the aggregation interval — typically acceptable for games where millisecond rank precision is not required.
For user rank lookups (not just top-N), maintain a sorted structure per shard and compute approximate global rank as: global_rank ≈ Σ(count of users with higher scores in each shard). This approach scales writes horizontally while keeping reads fast.
Time-Window Leaderboards
Maintain separate sorted sets per time window: leaderboard:daily:2026-03-19, leaderboard:weekly:2026-W12, leaderboard:alltime. Use EXPIRE on daily and weekly keys to auto-evict stale data. Write each score update to all applicable keys atomically via a Lua script or pipeline to avoid partial updates.
-- Lua script for atomic multi-leaderboard update
local globalKey = KEYS[1]
local dailyKey = KEYS[2]
local weeklyKey = KEYS[3]
local userId = ARGV[1]
local delta = tonumber(ARGV[2])
redis.call('ZINCRBY', globalKey, delta, userId)
redis.call('ZINCRBY', dailyKey, delta, userId)
redis.call('ZINCRBY', weeklyKey, delta, userId)
return redis.call('ZREVRANK', globalKey, userId)
5. Geo-Distributed Leaderboards
For global events, maintaining a single leaderboard in one region forces all writes to cross the internet. A player in Singapore writing to a US-East Redis primary experiences 150–200ms of write latency — unacceptable for real-time gameplay.
Regional shards accept writes locally and sync to a global aggregation layer asynchronously. Each region maintains its own leaderboard (e.g., leaderboard:apac, leaderboard:emea, leaderboard:amer). A global aggregation service — running in each region — periodically merges regional top-scores into leaderboard:global. Players always write to their regional shard (sub-10ms), and global rankings reflect the aggregated state with a configurable staleness window (1–10 seconds).
Conflict resolution: Since scores are monotonically increasing (in most games, you can't lose points), the merge strategy is simple: take the maximum score per user across all regional shards. For games where scores decrease, use a last-writer-wins strategy with timestamp comparison, or maintain per-event sourcing.
6. Caching the Read Path
The top-100 leaderboard is read orders of magnitude more than it is updated. Pre-compute and cache a serialized snapshot:
- Application-level cache: Cache the top-100 result in an in-process cache (Caffeine) with a 1–2 second TTL. At 10,000 reads/sec, this reduces Redis reads to ~5,000/sec (one read per cache expiry per instance).
- CDN caching: For public leaderboards, serve top-N snapshots via CDN with short TTLs (5–30 seconds). A billion-user game can serve top-100 from CDN edge nodes without touching Redis at all.
- Warm-up strategy: Pre-populate the cache during off-peak hours. On cold start, the first request after cache expiry triggers a Redis read which repopulates the cache — not a thundering herd because only one in-flight request populates it (use request coalescing / cache locks).
7. Write Path Optimization
Pipelining: Instead of one round-trip per score update, batch multiple ZADD/ZINCRBY commands in a single pipeline. The Lettuce Redis client supports automatic pipelining. Batching 100 updates in one pipeline reduces network round-trips by 100x.
// Batch pipeline with Lettuce
public void batchUpdateScores(Map<String, Double> scoreDeltas) {
redisTemplate.executePipelined((RedisCallback<Object>) connection -> {
scoreDeltas.forEach((userId, delta) ->
connection.zIncrBy(
GLOBAL_KEY.getBytes(),
delta,
userId.getBytes()));
return null;
});
}
Kafka buffer: For extreme write volumes, publish score events to Kafka and have a Kafka consumer aggregate scores in memory per time window (e.g., 100ms tumbling window) before flushing to Redis. This converts 1M individual Redis writes/sec into 10K batched pipeline flushes/sec — a 100x write reduction. The trade-off is increased write latency (100ms aggregation window) and operational complexity.
8. Consistency Trade-offs
The CAP theorem forces a choice between consistency and availability for leaderboards across network partitions. In practice, eventual consistency with bounded staleness is the right model for almost all leaderboard use cases:
- User tolerance: Players accept that the global leaderboard visible to them may be 1–5 seconds stale. Showing a rank that is 2 positions off is unnoticeable compared to the harm done by 500ms write latency.
- Rank precision vs latency: Strongly consistent rank queries require all shards to be in sync at query time — impossible at global scale without impractical coordination. Accept ±5 rank precision in exchange for sub-10ms local writes.
- Financial and competitive leaderboards (high stakes): For tournament finals where rank precision matters, use a separate strongly-consistent leaderboard for the top-1000 players only (stored on a single Redis primary with synchronous replica writes), and eventual consistency for the broader ranking.
| Storage | Write Throughput | Read Latency | Rank Query | Best For |
|---|---|---|---|---|
| Redis Sorted Set | 100K-200K/s (single) | <1ms | O(log N) | <5M users, single region |
| Redis Cluster (sharded) | 1M+/s | <5ms | O(shards × log N) | 10M+ users, high write load |
| PostgreSQL (window fn) | 10K-50K/s | 10-100ms | O(N log N) | Audit trail, complex queries |
| Apache Flink + Redis | 10M+/s | <10ms (cached) | Pre-aggregated | Streaming, time-windowed |
9. Failure Scenarios
Redis primary failure: Redis Sentinel or Redis Cluster handles automatic failover to a replica in ~30 seconds. During failover, writes fail. The application should queue writes in Kafka and replay them after failover. Reads can be served from replica (which may be slightly stale) with a READONLY flag.
Split-brain in Redis Cluster: Redis Cluster uses a quorum of masters — a partition that isolates a primary without a quorum causes it to stop accepting writes. The minority side can serve stale reads. Mitigate by enabling cluster-require-full-coverage no for partial availability, or accept the brief write outage.
During outage — serve last known snapshot: Maintain a materialized snapshot of the top-100 leaderboard in a durable store (PostgreSQL or S3) that is refreshed every 10 seconds. During Redis downtime, serve this stale snapshot with a "Last updated: X seconds ago" indicator rather than returning an error.
10. Scaling Beyond Redis
At true internet scale — hundreds of millions of users with complex analytics needs — Redis alone becomes insufficient:
- Apache Flink: For streaming score aggregation with time-windowed leaderboards (tumbling, sliding, session windows), Flink provides native stateful processing, exactly-once semantics, and checkpointing. Flink emits pre-aggregated top-N results to Redis, reducing Redis write volume dramatically.
- Apache Pinot: For analytics queries — "show me the rank distribution by country", "histogram of scores in the top-1000" — Pinot provides sub-second OLAP queries over streaming event data via Kafka ingestion. Not for per-user rank lookups, but for business intelligence on leaderboard data.
- PostgreSQL with window functions:
RANK() OVER (ORDER BY score DESC)works well for leaderboards up to ~5M users with proper indexing (B-tree on score DESC). Use for audit trails, official tournament records, and complex queries that Redis cannot express.
"A leaderboard is one of the simplest real-time data structures to implement and one of the hardest to scale. Redis gives you the primitives; architecture gives you the scale."
11. Key Takeaways
- Redis Sorted Sets provide O(log N) write and rank operations ideal for leaderboards; understand memory and single-node throughput limits before scaling to Redis Cluster.
- Hash-based sharding distributes writes but creates scatter-gather read amplification; use hierarchical aggregation instead for efficient global ranking.
- Geo-distributed leaderboards require regional write shards with async sync to a global aggregation layer; design for bounded staleness, not strong consistency.
- The read path should be aggressively cached — top-N snapshots at application level (Caffeine) and CDN level reduce Redis read load by orders of magnitude.
- Use Kafka buffering to batch-reduce Redis write volume when write throughput exceeds single-node limits.
- Have a fallback strategy for Redis outages: a durable snapshot in PostgreSQL or S3 that can be served with a staleness indicator.
Architecture Diagram Idea
Three-tier diagram: (1) Write tier — Game Servers → Kafka topic → Score Consumer → 8x Redis Write Shards (each holding 1/8 of users). (2) Aggregation tier — Aggregation Service polls all write shards every 2s, computes top-10K globally, writes to Redis Global Leaderboard shard. (3) Read tier — API Gateway → App Cache (Caffeine, 1s TTL) → Redis Global Shard → Fallback: PostgreSQL snapshot. Show regional replication arrows for APAC, EMEA, AMER.
Related Posts
Discussion / Comments
Have experience building leaderboards at scale? Share your approach below.
Last updated: March 2026 — Written by Md Sanwar Hossain