System Design

Designing Scalable Systems at Uber & Netflix Scale: Patterns, Trade-offs, and Architecture Decisions

Designing systems that serve millions of users simultaneously requires a fundamentally different approach than building typical enterprise applications. The patterns used by Uber, Netflix, and similar platforms are not magic — they are well-understood architectural decisions with specific trade-offs. Understanding those trade-offs is what separates senior system designers from the rest.

Md Sanwar Hossain March 2026 22 min read System Design

High-scale distributed system architecture visualization

The Three Axes of Scale
Capacity Estimation: The Foundation of Scalable Design
The Read-Heavy System: Netflix's Content Delivery Approach
The Write-Heavy System: Uber's Trip Data Approach
Geographic Distribution: Multi-Region Architecture
The CAP Theorem in Practice

The Three Axes of Scale

Scalable System Architecture | mdsanwarhossain.me — Scalable System Architecture — mdsanwarhossain.me

Before diving into specific patterns, it is important to distinguish the different dimensions of scale. Traffic scale: can the system handle millions of requests per second? Data scale: can the system store and query terabytes or petabytes efficiently? Geographic scale: can the system serve users in multiple regions with low latency? Each axis requires different architectural solutions, and real production systems at Uber/Netflix scale must address all three.

The common mistake is optimizing for one axis while ignoring others. A horizontally scaled application cluster with a single-region PostgreSQL database is not truly scalable at traffic scale — the database becomes the bottleneck. Thinking systematically about all three axes prevents architectural blind spots.

Capacity Estimation: The Foundation of Scalable Design

Good system design starts with numbers, not patterns. Before choosing technologies, estimate the scale requirements:

Traffic: Daily active users × average requests per user = daily request volume. Assume peak traffic is 2–3× the daily average rate. For Uber, peak occurs at Friday evening — that is your design point, not the average.
Storage: Data generated per event × events per day × retention period. Account for replication (typically 3×) and index overhead (typically 2–3×).
Bandwidth: Average request size × peak requests per second. Estimate for both ingress and egress separately.

These estimates drive hardware sizing decisions and identify the primary bottleneck — is it compute, storage, or network? — which determines where to focus architectural effort.

The Read-Heavy System: Netflix's Content Delivery Approach

Distributed Scalability | mdsanwarhossain.me — Distributed Scalability — mdsanwarhossain.me

Netflix serves hundreds of millions of concurrent streams. Video content is inherently read-heavy — content is uploaded once and read millions of times. The architectural strategy is aggressive caching at every layer.

CDN as the Primary Serving Layer

Netflix's Open Connect CDN deploys servers inside internet service providers' data centers globally, caching the most popular content files close to end users. By the time a user requests a popular title, the content file is on a server in their ISP's facility — dramatically reducing latency and Netlfix's egress bandwidth costs. Designing around a CDN for static and semi-static content is one of the highest-leverage architectural decisions for content-heavy systems.

Database Reads: Read Replicas and Caching

For metadata reads (movie information, user preferences, recommendations), the pattern is: cache first, read replica second, primary last. Redis or Memcached serves cache hits in under 1 ms. Read replicas handle cache misses, taking load off the primary. The primary handles only writes. This read-path hierarchy can absorb 99%+ of read traffic without touching the primary database.

// Cache-aside pattern for user preferences
@Service
public class UserPreferenceService {
    private final RedisTemplate<String, UserPreference> redis;
    private final UserPreferenceRepository repository;
    private static final Duration CACHE_TTL = Duration.ofMinutes(30);
    public UserPreference getPreferences(String userId) {
        String cacheKey = "user:preferences:" + userId;
        UserPreference cached = redis.opsForValue().get(cacheKey);
        if (cached != null) return cached;
        UserPreference prefs = repository.findByUserId(userId);
        redis.opsForValue().set(cacheKey, prefs, CACHE_TTL);
        return prefs;
    }
    public void updatePreferences(String userId, UserPreference prefs) {
        repository.save(prefs);
        redis.delete("user:preferences:" + userId);  // invalidate cache
    }
}

The Write-Heavy System: Uber's Trip Data Approach

Uber processes millions of GPS location updates per second at peak. Unlike Netflix's read-heavy model, location tracking is write-heavy. The architectural strategy is write path optimization with eventual consistency.

Scalable System Design | mdsanwarhossain.me — Scalable System Design — mdsanwarhossain.me

Message Queue as Write Buffer

Raw GPS events are published to Apache Kafka topics rather than written directly to a database. Kafka's throughput (millions of messages per second per cluster) far exceeds any database's write throughput. Consumers process events asynchronously: the trip position service reads events for active trips, the fare calculation service reads events to estimate ETAs and fares, and the analytics service reads events for business intelligence. The database receives only aggregated, post-processed data.

Database Sharding for Write Scale

When write volume exceeds a single database's capacity, horizontal sharding distributes data across multiple database instances. Each shard handles a subset of the key space — for trip data, sharding by driver ID or geographic region is natural. The application layer (or a proxy like Vitess, which Uber uses for MySQL sharding) routes writes and reads to the correct shard based on the sharding key.

Sharding introduces complexity: cross-shard queries require scatter-gather aggregation; shard rebalancing as the dataset grows requires careful migration planning; and operations that span multiple shards cannot be wrapped in a single ACID transaction. Design the sharding key to minimize cross-shard operations.

Geographic Distribution: Multi-Region Architecture

For true global scale, a single cloud region is not sufficient — not for latency (250 ms round-trip from US to Asia Pacific) and not for resilience (a single region outage must not take down the entire service). Multi-region architecture typically follows one of two patterns: active-passive (primary region handles all writes; secondary region is a warm standby for disaster recovery) or active-active (multiple regions handle writes with cross-region synchronization). Active-active is harder but achieves both lower global latency and higher resilience.

The challenge in active-active is handling conflicting writes to the same data from different regions. Design your data model to minimize cross-region write conflicts: partition data by geography (US users managed by US region, EU users by EU region), or use CRDTs (Conflict-free Replicated Data Types) for data that can be merged without conflict (counters, sets).

The CAP Theorem in Practice

The CAP theorem states that a distributed system can provide at most two of three guarantees: Consistency (all nodes see the same data at the same time), Availability (every request receives a response), and Partition tolerance (the system continues despite network partition). In practice, network partitions happen, so systems must choose between consistency and availability under partition.

Uber's dispatch system prioritizes availability: it is better to show a driver's last known location than to block dispatch because the location database is temporarily unavailable. Netflix's streaming prioritizes availability: it is better to play content from a slightly stale CDN cache than to block playback waiting for the freshest metadata. On the other hand, financial transaction systems (payment processing, account balances) prioritize consistency: a stale or incorrect balance is not an acceptable trade-off for availability.

Know your business requirements and make an explicit consistency vs availability decision for each data type rather than applying a blanket policy.

"Scale is not a binary — you scale to your actual requirements. Over-designing for scale you don't have is just as damaging as under-designing for scale you do have."

Key Takeaways

Address all three axes of scale: traffic, data, and geographic distribution.
Start with capacity estimates — numbers drive architectural decisions, not pattern preferences.
Read-heavy systems: CDN + read replicas + aggressive caching reduce database load by 99%+.
Write-heavy systems: Kafka as a write buffer + database sharding for write throughput at scale.
Make explicit CAP theorem decisions for each data type based on business requirements.

Scalability Reference Architecture

The following table maps system scale requirements to architectural choices across load, data, and geographic axes:

Scale Level	Traffic Pattern	Key Architecture Choice
10k RPS	Uniform, single region	Horizontal pod autoscaling + read replicas
100k RPS	Read-heavy, burst traffic	CDN + Redis cache layer + DB read replicas
500k+ RPS	Write-heavy, fan-out events	Kafka write buffer + horizontal DB sharding
Multi-region	Global users, low latency SLAs	Active-active with async replication + CRDT

Back-of-Envelope: Capacity Estimation Template

Before selecting an architecture tier, run these estimates to ground your design in real numbers:

# Example: URL shortener capacity estimation

## Traffic
Daily active users     : 100M
Writes per day         : 100M × 0.1% = 100,000 writes/day → ~1.15 writes/sec
Reads per day          : 100M × 10  = 1B reads/day        → ~11,574 reads/sec
Read:Write ratio       : ~10,000:1  → heavy read optimisation needed

## Storage
Shortened URL record   : ~500 bytes
Writes per day         : 100,000
Retention              : 5 years
Total storage          : 100,000 × 500B × 365 × 5 ≈ 91 GB (easily on single DB)

## Cache
Hot URLs (80/20 rule)  : 20% of URLs serve 80% of traffic
Cache size             : 11,574 reads/sec × 86,400 sec × 0.2 × 500B ≈ 100 GB Redis

## Bandwidth
Outbound (read)        : 11,574 req/s × 500B ≈ 5.5 MB/s  → trivial CDN load
Inbound (write)        : 1.15 req/s × 500B  ≈ negligible

💡 Tip: Always present capacity estimates before architecture diagrams in design reviews. Numbers expose wrong assumptions early — before any code is written.

Conclusion

Scalable systems are built incrementally, not designed perfectly upfront. Start with solid capacity estimates, choose the simplest architecture that meets current SLOs, and add complexity only when production data justifies it. The patterns covered here — horizontal scaling, caching strategies, database sharding, asynchronous event processing, and multi-region active-active — each solve a specific bottleneck at a specific scale level.

The most important habit is continuous measurement: track request rate, p95 latency, database replication lag, cache hit ratio, and queue consumer lag. When metrics cross thresholds, you will know exactly which bottleneck to address next. That data-driven approach keeps your architecture right-sized for where your system actually is today — not where it might be in five years.

Frequently Asked Questions

What is The Three Axes of Scale and how does it work?

Before diving into specific patterns, it is important to distinguish the different dimensions of scale. Traffic scale: can the system handle millions of requests per second? Data scale: can the system store and query terabytes or petabytes efficiently? Geographic scale: can the system serve users in multiple regions with low latency? Each axis requires different architectural solutions, and real production systems at Uber/Netflix scale must address all three. The common mistake is optimizing for one axis while ignoring others. A horizontally scaled application cluster with a single-region PostgreSQL database is not truly scalable at traffic scale — the database becomes the bottleneck. Thinking systematically about all three axes prevents architectural blind spots.

What is Capacity Estimation and how does it work?

Good system design starts with numbers, not patterns. Before choosing technologies, estimate the scale requirements: These estimates drive hardware sizing decisions and identify the primary bottleneck — is it compute, storage, or network? — which determines where to focus architectural effort. Traffic: Daily active users × average requests per user = daily request volume. Assume peak traffic is 2–3× the daily average rate. For Uber, peak occurs at Friday evening — that is your design point, not the average. Storage: Data generated per event × events per day × retention period. Account for replication (typically 3×) and index overhead (typically 2–3×). Bandwidth: Average request size × peak requests per second. Estimate for both ingress and egress separately.

Designing Scalable Systems at Uber & Netflix Scale: Patterns, Trade-offs, and Architecture Decisions

Table of Contents

The Three Axes of Scale

Capacity Estimation: The Foundation of Scalable Design

The Read-Heavy System: Netflix's Content Delivery Approach

CDN as the Primary Serving Layer

Database Reads: Read Replicas and Caching

The Write-Heavy System: Uber's Trip Data Approach

Message Queue as Write Buffer

Database Sharding for Write Scale

Geographic Distribution: Multi-Region Architecture

The CAP Theorem in Practice

Key Takeaways

Scalability Reference Architecture

Back-of-Envelope: Capacity Estimation Template

Conclusion

Frequently Asked Questions

What is The Three Axes of Scale and how does it work?

What is Capacity Estimation and how does it work?

What is The Read-Heavy System and how does it work?

What is CDN as the Primary Serving Layer and how does it work?

What is Database Reads and how does it work?

Tags

Leave a Comment

Related Posts

Designing Scalable Systems at Uber & Netflix Scale: Patterns, Trade-offs, and Architecture Decisions

Table of Contents

The Three Axes of Scale

Capacity Estimation: The Foundation of Scalable Design

The Read-Heavy System: Netflix's Content Delivery Approach

CDN as the Primary Serving Layer

Database Reads: Read Replicas and Caching

The Write-Heavy System: Uber's Trip Data Approach

Message Queue as Write Buffer

Database Sharding for Write Scale

Geographic Distribution: Multi-Region Architecture

The CAP Theorem in Practice

Key Takeaways

Scalability Reference Architecture

Back-of-Envelope: Capacity Estimation Template

Conclusion

Frequently Asked Questions

What is The Three Axes of Scale and how does it work?

What is Capacity Estimation and how does it work?

What is The Read-Heavy System and how does it work?

What is CDN as the Primary Serving Layer and how does it work?

What is Database Reads and how does it work?

Tags

Leave a Comment

Related Posts

System Design Patterns for Modern Backends

Rate Limiting, Caching & Load Balancing: Essential Building Blocks for Scalable APIs

Event-Driven Architecture: Design, Patterns, and Production Best Practices

Cookie Notice