Rate Limiting, Caching & Load Balancing: Essential Building Blocks for Scalable APIs
Rate limiting, caching, and load balancing are the three foundational mechanisms that transform a single-server application into a scalable production system. Each has multiple implementation approaches with distinct trade-offs. Understanding these mechanisms at the algorithm level — not just the configuration level — is what enables you to design them correctly for your specific requirements.
Rate Limiting: Protecting Your API from Abuse and Overload
Rate limiting controls the number of requests a client can make to an API within a time window. Without rate limiting, a single misbehaving client can exhaust server resources, degrading service for all users. Rate limiting also provides the primary defense against denial-of-service attacks and abuse of free tier quotas.
Token Bucket Algorithm
The token bucket is the most widely used rate limiting algorithm. A bucket holds up to N tokens. Tokens are added at a fixed rate (e.g., 100 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected (HTTP 429 Too Many Requests). The key insight is that the bucket allows short bursts — if a client has been idle and accumulated tokens, it can make requests faster than the refill rate for a short time. This makes token bucket well-suited for API rate limiting where bursty traffic is legitimate.
// Redis-based token bucket rate limiter using Lua script (atomic)
@Service
public class TokenBucketRateLimiter {
private final StringRedisTemplate redis;
// Lua script ensures atomicity — read + update in a single Redis round-trip
private static final String RATE_LIMIT_SCRIPT = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- tokens per second
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens based on elapsed time
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 1 -- allowed
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return 0 -- rejected
end
""";
public boolean isAllowed(String clientId, int capacity, double refillRate) {
String key = "rate_limit:" + clientId;
long now = System.currentTimeMillis() / 1000;
Long result = redis.execute(
new DefaultRedisScript<>(RATE_LIMIT_SCRIPT, Long.class),
List.of(key),
String.valueOf(capacity),
String.valueOf(refillRate),
String.valueOf(now)
);
return Long.valueOf(1L).equals(result);
}
}
Sliding Window Counter
The fixed window algorithm has a boundary problem: a client can double its effective rate by sending requests at the end of one window and the start of the next. The sliding window counter solves this by maintaining a counter that slides continuously rather than resetting at fixed intervals. It is more accurate but requires more memory (storing a counter per sub-window slot). For most production API rate limiting, the sliding window log or the hybrid sliding window counter is the preferred algorithm.
Spring Boot Rate Limiting with Bucket4j
Bucket4j is the de facto standard rate limiting library for Java applications. It implements the token bucket algorithm with Redis backend support for distributed rate limiting across multiple application instances.
// Spring Boot filter with Bucket4j distributed rate limiting
@Component
public class RateLimitingFilter implements Filter {
private final Bucket4jConfiguration bucket4jConfig;
@Override
public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain)
throws IOException, ServletException {
HttpServletRequest httpReq = (HttpServletRequest) req;
HttpServletResponse httpRes = (HttpServletResponse) res;
String clientId = extractClientId(httpReq);
Bucket bucket = bucket4jConfig.getBucket(clientId);
ConsumptionProbe probe = bucket.tryConsumeAndReturnRemaining(1);
if (probe.isConsumed()) {
httpRes.setHeader("X-Rate-Limit-Remaining",
String.valueOf(probe.getRemainingTokens()));
chain.doFilter(req, res);
} else {
long retryAfterMs = probe.getNanosToWaitForRefill() / 1_000_000;
httpRes.setHeader("Retry-After", String.valueOf(retryAfterMs / 1000));
httpRes.sendError(HttpServletResponse.SC_TOO_MANY_REQUESTS,
"Rate limit exceeded. Retry after " + retryAfterMs + "ms");
}
}
}
Caching: The Highest-ROI Scalability Tool
Caching stores the result of an expensive operation so subsequent requests can be served from the cache without repeating the operation. A well-designed cache reduces database load by 90–99%, cuts response latency from milliseconds to microseconds, and enables a service to serve far more traffic than the underlying database can handle directly.
Caching Strategies
Cache-aside (lazy loading): The application checks the cache first. On a miss, it fetches from the database, stores the result in the cache, and returns it. This is the most flexible strategy and the most common. Cache entries are only populated for data that is actually requested, minimizing memory usage. The downside is that every cache miss incurs a database round-trip.
Write-through: Every database write is simultaneously written to the cache. The cache is always up to date with no staleness for recently written data. The downside is increased write latency and cache pollution (data is cached even if it is never read).
Write-behind (write-back): Writes go to the cache immediately and are flushed to the database asynchronously. This dramatically reduces write latency but risks data loss if the cache fails before the flush. Only appropriate for use cases where some data loss is tolerable (e.g., view counts, non-critical analytics).
Cache Eviction and Expiry
LRU (Least Recently Used) is the most common cache eviction policy — when the cache is full, the least recently accessed entry is evicted. Set explicit TTL (time-to-live) on all cache entries to prevent stale data from persisting indefinitely. The TTL should be calibrated to the acceptable staleness of the data — a user's profile can be cached for 30 minutes, but a user's account balance should be cached for at most a few seconds or not at all.
Cache Stampede Prevention
Cache stampede occurs when many concurrent requests simultaneously experience a cache miss for the same key and all rush to the database. This can overload the database precisely when load is highest. Prevent with probabilistic early expiration (slightly random TTL) or distributed locking — only one request fetches from the database while others wait for it to populate the cache.
Load Balancing: Distributing Traffic Across Instances
Load balancing distributes incoming requests across multiple application instances so no single instance is overwhelmed. It also provides resilience: if one instance fails, the load balancer routes traffic to healthy instances.
Load Balancing Algorithms
Round robin: Requests are distributed sequentially across instances. Simple and effective when all instances have identical capacity and request processing time is uniform. Least connections: Each new request is routed to the instance with the fewest active connections. Better than round robin when request processing time varies significantly. Consistent hashing: Requests from the same client (or with the same key) are consistently routed to the same instance. Essential for stateful applications (session affinity) and for distributing cache load consistently — so each cache shard receives requests for the same key set regardless of which application instances are active.
Consistent Hashing for Distributed Caches
When distributing cache across multiple Redis nodes, consistent hashing ensures that adding or removing a node only remaps a 1/N fraction of keys rather than remapping all keys. This prevents cache stampedes during cluster scaling events.
// Consistent hash ring for cache node selection
public class ConsistentHashRing {
private final TreeMap<Long, String> ring = new TreeMap<>();
private final int virtualNodes;
public ConsistentHashRing(List<String> nodes, int virtualNodes) {
this.virtualNodes = virtualNodes;
nodes.forEach(this::addNode);
}
public void addNode(String node) {
for (int i = 0; i < virtualNodes; i++) {
ring.put(hash(node + "-vn-" + i), node);
}
}
public void removeNode(String node) {
for (int i = 0; i < virtualNodes; i++) {
ring.remove(hash(node + "-vn-" + i));
}
}
public String getNode(String key) {
if (ring.isEmpty()) throw new IllegalStateException("No nodes in ring");
long hash = hash(key);
Map.Entry<Long, String> entry = ring.ceilingEntry(hash);
return (entry != null ? entry : ring.firstEntry()).getValue();
}
private long hash(String key) {
return Hashing.murmur3_128().hashString(key, StandardCharsets.UTF_8).asLong();
}
}
Putting It Together: A Scalable API Stack
A production-grade scalable API combines all three mechanisms in layers: a CDN or API Gateway applies rate limiting at the edge, rejecting abusive traffic before it reaches application servers. A load balancer (Kubernetes Ingress / AWS ALB) distributes traffic across application replicas using least-connections or consistent hashing. Application instances apply Redis-backed rate limiting for fine-grained per-user quotas. Cache layers (L1 local in-memory cache + L2 Redis distributed cache) serve the majority of reads without touching the database. The database handles only writes and cache-miss reads.
"Rate limiting protects you from bad actors. Caching protects you from your own traffic. Load balancing protects individual instances from each other. All three are necessary at production scale."
Key Takeaways
- Token bucket is the preferred rate limiting algorithm for API throttling — it allows controlled bursts while enforcing average rate limits.
- Use Redis Lua scripts for atomic rate limit state updates in distributed environments.
- Cache-aside is the most flexible caching strategy; set explicit TTLs and implement stampede prevention for high-traffic keys.
- Consistent hashing is essential for distributed caches — it minimizes cache remapping when cluster topology changes.
- Layer rate limiting at the edge (CDN/gateway) and per-user (application), caching in L1+L2, and load balancing with least-connections or consistent hashing.
Related Articles
Discussion / Comments
Join the conversation — your comment goes directly to my inbox.