System Design

Designing a Ride-Hailing System at Scale: Geolocation, Matching Engine & Surge Pricing

Uber handles 20 million trips per day, matching riders to drivers in under 15 seconds anywhere in the world. The technical challenges are deceptively hard: tracking hundreds of thousands of driver GPS positions in real time, finding the closest available driver in milliseconds, computing ETAs on live traffic graphs, and dynamically pricing rides based on supply-demand signals — all while maintaining sub-second response times globally.

Md Sanwar Hossain April 6, 2026 20 min read System Design
Ride-hailing system design: geolocation tracking, matching engine, surge pricing, and trip state machine at Uber scale

TL;DR — Core Architecture Decisions

"A ride-hailing system needs: (1) a geospatial location service using H3/S2 cells to index driver positions and answer nearby-driver queries in O(1), (2) a matching engine that scores candidates by ETA + rating + acceptance rate and assigns the best driver, (3) a trip state machine (REQUESTED → MATCHED → EN_ROUTE → ARRIVED → TRIP → COMPLETED) with idempotent transitions, (4) a surge pricing engine reading supply/demand ratios per geofence cell, and (5) WebSocket channels keeping driver and rider synchronized throughout the trip."

Table of Contents

  1. System Architecture Overview
  2. Real-Time Driver Location Tracking
  3. Geospatial Indexing: H3 vs S2 Cells
  4. Driver Matching Engine
  5. Trip State Machine
  6. Surge Pricing Algorithm
  7. Payment & Fare Calculation
  8. Real-Time Notifications & Messaging
  9. Maps & ETA Calculation
  10. Fault Tolerance & Global Scale
  11. Capacity Estimation & Conclusion

1. System Architecture Overview

A ride-hailing platform consists of several interconnected services that must collaborate in real time with hard latency requirements. The core services are:

Ride-hailing system architecture: location service, matching engine, trip state machine, surge pricing, and payment service at Uber scale
Complete ride-hailing platform architecture — from rider request to trip completion. Source: mdsanwarhossain.me

2. Real-Time Driver Location Tracking

The location service is the most write-intensive component. Uber's driver app sends GPS updates every 4 seconds. With 1 million active drivers globally, that's 250,000 writes per second. The design must handle this without falling behind.

Adaptive Location Update Frequency

Not all drivers need the same update frequency. The app adapts based on trip state:

Location Write Path

# Driver location update flow
Driver App → WebSocket gateway → Location Service API
                                         │
                         ┌───────────────┴───────────────┐
                         ▼                               ▼
              Redis GeoSet (hot)              Kafka topic "location-updates"
              GEOADD drivers                      │
              {lat, lng, driverId}                ▼
                         │                  Stream processor
                         │                  (Flink/Kafka Streams)
                         ▼                  - Enriches with city/zone
              Nearby driver queries          - Writes to geospatial DB
              answered in O(log N)           - Updates surge pricing engine

3. Geospatial Indexing: H3 vs S2 Cells

The core challenge: given a rider at (lat, lng), find all available drivers within a 3km radius in under 10ms. Brute-force comparison of millions of driver coordinates is too slow. The solution: hierarchical geospatial cell indexing.

H3 Hexagonal Hierarchy (Uber's Choice)

Uber built and open-sourced H3, a hexagonal hierarchical geospatial indexing system. The globe is divided into hexagonal cells at 16 resolution levels. At resolution 9, each hexagon is approximately 0.1 km² — the right granularity for driver density in urban areas.

S2 Geometry (Google's Choice)

Google Maps and many geospatial systems use S2 — a library that maps the sphere surface to a space-filling Hilbert curve, allowing spatial regions to be represented as ranges of 64-bit integers. S2 cells are squares at each level. For a ride-hailing system, you can store driver IDs in a sorted set indexed by S2 cell ID. Range queries on the sorted set retrieve drivers in nearby cells efficiently. Lyft uses S2 for this purpose.

Attribute H3 (Hexagonal) S2 (Square)
Cell shape Hexagon Square
Neighbor equidistance ✅ Equal (6 neighbors) ❌ Diagonal neighbors are farther
Range query support k-ring query Sorted set range query
Used by Uber, Airbnb Google, Lyft, Foursquare
Best for Uniform density analysis, surge zones Arbitrary polygon queries, regions
Geospatial indexing with H3 hexagonal cells and S2 squares for ride-hailing driver location tracking and nearby-driver matching
H3 hexagonal cell indexing vs S2 square cells for geospatial driver tracking and nearby-driver queries. Source: mdsanwarhossain.me

4. Driver Matching Engine

The matching engine must find the optimal driver for each rider request within a few hundred milliseconds. Simple nearest-distance matching is insufficient — a driver 0.3km away stuck in traffic may have a worse ETA than one 0.8km away on a clear road.

Matching Scoring Function

Each candidate driver is scored as a weighted combination of:

Batched Global Matching vs Greedy Local Matching

Two approaches to matching at scale:

5. Trip State Machine

The trip lifecycle is modeled as a state machine with strict transition rules. Each transition is idempotent and persisted to PostgreSQL with an event log in Kafka for auditability.

REQUESTED ──► MATCHING ──► DRIVER_ASSIGNED ──► DRIVER_EN_ROUTE
                                                       │
                                              ◄── CANCELLED (by rider or driver)
                                                       │
                                              DRIVER_ARRIVED
                                                       │
                                              TRIP_IN_PROGRESS ──► COMPLETED
                                                       │
                                              CANCELLED (rare: during trip)

# Each transition:
# 1. Validates current state (prevents invalid transitions)
# 2. Updates trip record in PostgreSQL (optimistic lock on version column)
# 3. Publishes TRIP_STATE_CHANGED event to Kafka
# 4. Notification service consumes event → pushes to rider and driver

Idempotency is critical: if the driver app retries "I arrived" due to a network glitch, the system must not double-bill or create duplicate events. The trip service uses the current state check: if state is already DRIVER_ARRIVED, the repeated request is silently acknowledged without re-processing.

6. Surge Pricing Algorithm

Surge pricing (dynamic pricing) serves a dual purpose: it incentivizes more drivers to come online (supply-side response) and reduces demand during peak periods by raising prices. The algorithm must react in near-real-time and be computed per geographic zone.

Surge Multiplier Calculation

Each H3 cell (or a cluster of cells forming a "surge zone") is evaluated every 30 seconds:

# Surge multiplier formula (piecewise linear)
demand = open_ride_requests_in_zone (last 5 min)
supply = available_drivers_in_zone

ratio = demand / max(supply, 1)

if ratio < 0.5:
    multiplier = 1.0   # No surge
elif ratio < 1.0:
    multiplier = 1.0 + (ratio - 0.5) * 1.2  # Linear ramp
elif ratio < 2.0:
    multiplier = 1.6 + (ratio - 1.0) * 0.8  # Slower ramp
else:
    multiplier = min(2.4 + (ratio - 2.0) * 0.2, 5.0)  # Cap at 5x

# Round to nearest 0.1 for display clarity

Surge Zone Architecture

The surge engine runs as a stream processing job (Kafka Streams or Flink) consuming the location-updates and ride-requests topics. It maintains a sliding 5-minute window of demand/supply per H3 cell, computes multipliers, and writes results to Redis. When the pricing service needs the current surge for a rider's zone, it reads from Redis in <1ms. Surge zones are persisted to PostgreSQL for fare auditing (regulators require price records for each trip).

7. Payment & Fare Calculation

The payment flow uses a two-phase approach to handle the case where the final fare is unknown at trip start (the rider doesn't know exactly how far they'll go or how long it will take).

8. Real-Time Notifications & Messaging

Both rider and driver need real-time updates throughout the trip. The notification architecture uses a tiered approach:

WebSocket (Primary)

  • Persistent connection during active session
  • Bidirectional: driver location → rider map
  • ~1ms delivery latency
  • Used for: driver location updates, trip status changes, chat

Push Notifications (Fallback)

  • Used when app is in background or WebSocket drops
  • APNs (iOS) and FCM (Android)
  • 2–5 second delivery latency
  • Used for: driver assigned, driver arrived, trip complete

The WebSocket gateway runs as a stateful service. Each connection is maintained by a specific gateway node. The notification service publishes events to Kafka; the gateway service has a consumer that routes messages to the correct WebSocket connection. For cross-node routing (event for user connected to node B arrives at node A's consumer), a Redis Pub/Sub channel broadcasts the message to all nodes — the correct node with the active connection delivers it.

9. Maps & ETA Calculation

ETA accuracy is a critical driver of rider satisfaction. Showing "3 minutes" and the driver arriving in 9 minutes destroys trust. The ETA engine must incorporate live traffic conditions, not just static road graphs.

Routing Pipeline

10. Fault Tolerance & Global Scale

Uber operates in 10,000+ cities across 70 countries. The architecture must handle regional outages without global impact.

Cell-Based Regional Isolation

Uber's architecture partitions the world into geographic cells. Each cell (typically covering a metro area) runs its own instance of the location, matching, and trip services. A cell failure (e.g., AWS us-east-1 outage) affects only the cities in that cell — other regions continue operating normally. Cell boundaries are designed so no major metro spans two cells, preventing cross-cell coordination during failures.

Graceful Degradation

Failed Component Degraded Behavior
Surge pricing service Default to 1.0x multiplier (no surge). Revenue loss but service continues.
ETA service Show distance-based estimate ("~5 min") without live traffic. Accuracy degrades.
WebSocket gateway Fall back to push notifications. Higher latency, trip still functional.
Payment pre-auth fails Block ride booking. Non-negotiable for revenue protection.

11. Capacity Estimation & Conclusion

Back-of-Envelope: Location Service

  • 1 million active drivers × update every 4s = 250,000 writes/sec to location service
  • Each update payload: ~50 bytes (driverId, lat, lng, timestamp, speed)
  • Bandwidth: 250,000 × 50 bytes = 12.5 MB/sec ingress
  • Redis GeoSet: 1M driver entries × ~100 bytes = 100 MB RAM — trivially fits in Redis
  • Kafka: 250,000 msg/sec × 50 bytes = 12.5 MB/sec; easily handled by a 3-broker cluster

A ride-hailing system is one of the most technically rich distributed system design problems because it touches every major distributed systems concept: real-time data streams, geospatial indexing, state machines, event-driven architecture, payment processing, and real-time communications — all under strict latency requirements. The H3 geospatial index and the two-phase matching approach are the most commonly tested concepts in system design interviews for this domain.

Leave a Comment

Related Posts

Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices · System Design

All Posts
Last updated: April 6, 2026