System Design

Rate Limiting, Caching & Load Balancing: Essential Building Blocks for Scalable APIs

Rate limiting, caching, and load balancing are the three foundational mechanisms that transform a single-server application into a scalable production system. Each has multiple implementation approaches with distinct trade-offs. Understanding these mechanisms at the algorithm level — not just the configuration level — is what enables you to design them correctly for your specific requirements.

Md Sanwar Hossain March 2026 19 min read System Design

API rate limiting and caching architecture

Rate Limiting: Protecting Your API from Abuse and Overload
Caching: The Highest-ROI Scalability Tool
Load Balancing: Distributing Traffic Across Instances
Putting It Together: A Scalable API Stack

Rate Limiting: Protecting Your API from Abuse and Overload

Rate Limiting and Caching | mdsanwarhossain.me — Rate Limiting and Caching — mdsanwarhossain.me

Rate limiting controls the number of requests a client can make to an API within a time window. Without rate limiting, a single misbehaving client can exhaust server resources, degrading service for all users. Rate limiting also provides the primary defense against denial-of-service attacks and abuse of free tier quotas.

Token Bucket Algorithm

The token bucket is the most widely used rate limiting algorithm. A bucket holds up to N tokens. Tokens are added at a fixed rate (e.g., 100 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected (HTTP 429 Too Many Requests). The key insight is that the bucket allows short bursts — if a client has been idle and accumulated tokens, it can make requests faster than the refill rate for a short time. This makes token bucket well-suited for API rate limiting where bursty traffic is legitimate.

// Redis-based token bucket rate limiter using Lua script (atomic)
@Service
public class TokenBucketRateLimiter {
    private final StringRedisTemplate redis;
    // Lua script ensures atomicity — read + update in a single Redis round-trip
    private static final String RATE_LIMIT_SCRIPT = """
        local key = KEYS[1]
        local capacity = tonumber(ARGV[1])
        local refill_rate = tonumber(ARGV[2])  -- tokens per second
        local now = tonumber(ARGV[3])
        local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
        local tokens = tonumber(bucket[1]) or capacity
        local last_refill = tonumber(bucket[2]) or now
        -- Refill tokens based on elapsed time
        local elapsed = now - last_refill
        tokens = math.min(capacity, tokens + elapsed * refill_rate)
        if tokens >= 1 then
            tokens = tokens - 1
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 3600)
            return 1  -- allowed
        else
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 3600)
            return 0  -- rejected
        end
        """;
    public boolean isAllowed(String clientId, int capacity, double refillRate) {
        String key = "rate_limit:" + clientId;
        long now = System.currentTimeMillis() / 1000;
        Long result = redis.execute(
            new DefaultRedisScript<>(RATE_LIMIT_SCRIPT, Long.class),
            List.of(key),
            String.valueOf(capacity),
            String.valueOf(refillRate),
            String.valueOf(now)
        );
        return Long.valueOf(1L).equals(result);
    }
}

Sliding Window Counter

The fixed window algorithm has a boundary problem: a client can double its effective rate by sending requests at the end of one window and the start of the next. The sliding window counter solves this by maintaining a counter that slides continuously rather than resetting at fixed intervals. It is more accurate but requires more memory (storing a counter per sub-window slot). For most production API rate limiting, the sliding window log or the hybrid sliding window counter is the preferred algorithm.

Spring Boot Rate Limiting with Bucket4j

Bucket4j is the de facto standard rate limiting library for Java applications. It implements the token bucket algorithm with Redis backend support for distributed rate limiting across multiple application instances.

// Spring Boot filter with Bucket4j distributed rate limiting
@Component
public class RateLimitingFilter implements Filter {
    private final Bucket4jConfiguration bucket4jConfig;
    @Override
    public void doFilter(ServletRequest req, ServletResponse res, FilterChain chain)
            throws IOException, ServletException {
        HttpServletRequest httpReq = (HttpServletRequest) req;
        HttpServletResponse httpRes = (HttpServletResponse) res;
        String clientId = extractClientId(httpReq);
        Bucket bucket = bucket4jConfig.getBucket(clientId);
        ConsumptionProbe probe = bucket.tryConsumeAndReturnRemaining(1);
        if (probe.isConsumed()) {
            httpRes.setHeader("X-Rate-Limit-Remaining",
                String.valueOf(probe.getRemainingTokens()));
            chain.doFilter(req, res);
        } else {
            long retryAfterMs = probe.getNanosToWaitForRefill() / 1_000_000;
            httpRes.setHeader("Retry-After", String.valueOf(retryAfterMs / 1000));
            httpRes.sendError(HttpServletResponse.SC_TOO_MANY_REQUESTS,
                "Rate limit exceeded. Retry after " + retryAfterMs + "ms");
        }
    }
}

Caching: The Highest-ROI Scalability Tool

Caching stores the result of an expensive operation so subsequent requests can be served from the cache without repeating the operation. A well-designed cache reduces database load by 90–99%, cuts response latency from milliseconds to microseconds, and enables a service to serve far more traffic than the underlying database can handle directly.

Caching Strategies

Cache-aside (lazy loading): The application checks the cache first. On a miss, it fetches from the database, stores the result in the cache, and returns it. This is the most flexible strategy and the most common. Cache entries are only populated for data that is actually requested, minimizing memory usage. The downside is that every cache miss incurs a database round-trip.

Write-through: Every database write is simultaneously written to the cache. The cache is always up to date with no staleness for recently written data. The downside is increased write latency and cache pollution (data is cached even if it is never read).

Write-behind (write-back): Writes go to the cache immediately and are flushed to the database asynchronously. This dramatically reduces write latency but risks data loss if the cache fails before the flush. Only appropriate for use cases where some data loss is tolerable (e.g., view counts, non-critical analytics).

Cache Eviction and Expiry

LRU (Least Recently Used) is the most common cache eviction policy — when the cache is full, the least recently accessed entry is evicted. Set explicit TTL (time-to-live) on all cache entries to prevent stale data from persisting indefinitely. The TTL should be calibrated to the acceptable staleness of the data — a user's profile can be cached for 30 minutes, but a user's account balance should be cached for at most a few seconds or not at all.

Cache Stampede Prevention

Cache stampede occurs when many concurrent requests simultaneously experience a cache miss for the same key and all rush to the database. This can overload the database precisely when load is highest. Prevent with probabilistic early expiration (slightly random TTL) or distributed locking — only one request fetches from the database while others wait for it to populate the cache.

Load Balancing: Distributing Traffic Across Instances

Cache Architecture | mdsanwarhossain.me — Cache Architecture — mdsanwarhossain.me

Load balancing distributes incoming requests across multiple application instances so no single instance is overwhelmed. It also provides resilience: if one instance fails, the load balancer routes traffic to healthy instances.

Load Balancing Algorithms

Round robin: Requests are distributed sequentially across instances. Simple and effective when all instances have identical capacity and request processing time is uniform. Least connections: Each new request is routed to the instance with the fewest active connections. Better than round robin when request processing time varies significantly. Consistent hashing: Requests from the same client (or with the same key) are consistently routed to the same instance. Essential for stateful applications (session affinity) and for distributing cache load consistently — so each cache shard receives requests for the same key set regardless of which application instances are active.

Consistent Hashing for Distributed Caches

When distributing cache across multiple Redis nodes, consistent hashing ensures that adding or removing a node only remaps a 1/N fraction of keys rather than remapping all keys. This prevents cache stampedes during cluster scaling events.

// Consistent hash ring for cache node selection
public class ConsistentHashRing {
    private final TreeMap<Long, String> ring = new TreeMap<>();
    private final int virtualNodes;
    public ConsistentHashRing(List<String> nodes, int virtualNodes) {
        this.virtualNodes = virtualNodes;
        nodes.forEach(this::addNode);
    }
    public void addNode(String node) {
        for (int i = 0; i < virtualNodes; i++) {
            ring.put(hash(node + "-vn-" + i), node);
        }
    }
    public void removeNode(String node) {
        for (int i = 0; i < virtualNodes; i++) {
            ring.remove(hash(node + "-vn-" + i));
        }
    }
    public String getNode(String key) {
        if (ring.isEmpty()) throw new IllegalStateException("No nodes in ring");
        long hash = hash(key);
        Map.Entry<Long, String> entry = ring.ceilingEntry(hash);
        return (entry != null ? entry : ring.firstEntry()).getValue();
    }
    private long hash(String key) {
        return Hashing.murmur3_128().hashString(key, StandardCharsets.UTF_8).asLong();
    }
}

Putting It Together: A Scalable API Stack

A production-grade scalable API combines all three mechanisms in layers: a CDN or API Gateway applies rate limiting at the edge, rejecting abusive traffic before it reaches application servers. A load balancer (Kubernetes Ingress / AWS ALB) distributes traffic across application replicas using least-connections or consistent hashing. Application instances apply Redis-backed rate limiting for fine-grained per-user quotas. Cache layers (L1 local in-memory cache + L2 Redis distributed cache) serve the majority of reads without touching the database. The database handles only writes and cache-miss reads.

Rate Limiting & Caching Patterns | mdsanwarhossain.me — Rate Limiting & Caching Patterns — mdsanwarhossain.me

"Rate limiting protects you from bad actors. Caching protects you from your own traffic. Load balancing protects individual instances from each other. All three are necessary at production scale."

Key Takeaways

Token bucket is the preferred rate limiting algorithm for API throttling — it allows controlled bursts while enforcing average rate limits.
Use Redis Lua scripts for atomic rate limit state updates in distributed environments.
Cache-aside is the most flexible caching strategy; set explicit TTLs and implement stampede prevention for high-traffic keys.
Consistent hashing is essential for distributed caches — it minimizes cache remapping when cluster topology changes.
Layer rate limiting at the edge (CDN/gateway) and per-user (application), caching in L1+L2, and load balancing with least-connections or consistent hashing.

Rate Limiting, Caching & Load Balancing: Essential Building Blocks for Scalable APIs

Table of Contents

Rate Limiting: Protecting Your API from Abuse and Overload

Token Bucket Algorithm

Sliding Window Counter

Spring Boot Rate Limiting with Bucket4j

Caching: The Highest-ROI Scalability Tool

Caching Strategies

Cache Eviction and Expiry

Cache Stampede Prevention

Load Balancing: Distributing Traffic Across Instances

Load Balancing Algorithms

Consistent Hashing for Distributed Caches

Putting It Together: A Scalable API Stack

Key Takeaways

Frequently Asked Questions

What is Rate Limiting and how does it work?

How does the Token Bucket Algorithm work?

What is Sliding Window Counter and how does it work?

How do you implement Spring Boot Rate Limiting with Bucket4j?

What is Caching and how does it work?

Tags

Leave a Comment

Related Posts

Rate Limiting, Caching & Load Balancing: Essential Building Blocks for Scalable APIs

Table of Contents

Rate Limiting: Protecting Your API from Abuse and Overload

Token Bucket Algorithm

Sliding Window Counter

Spring Boot Rate Limiting with Bucket4j

Caching: The Highest-ROI Scalability Tool

Caching Strategies

Cache Eviction and Expiry

Cache Stampede Prevention

Load Balancing: Distributing Traffic Across Instances

Load Balancing Algorithms

Consistent Hashing for Distributed Caches

Putting It Together: A Scalable API Stack

Key Takeaways

Frequently Asked Questions

What is Rate Limiting and how does it work?

How does the Token Bucket Algorithm work?

What is Sliding Window Counter and how does it work?

How do you implement Spring Boot Rate Limiting with Bucket4j?

What is Caching and how does it work?

Tags

Leave a Comment

Related Posts

Designing Scalable Systems at Uber & Netflix Scale: Patterns, Trade-offs, and Architecture Decisions

System Design Patterns for Modern Backends

Event-Driven Architecture: Design, Patterns, and Production Best Practices

Cookie Notice