Designing a Hotel Booking System at Scale: Airbnb Architecture, Inventory & Double-Booking Prevention
Airbnb handles over 150 million users, 7 million active listings, and tens of thousands of bookings per minute at peak. Building a hotel or short-term rental booking platform at this scale is one of the most challenging system design problems you'll face — touching distributed inventory, strong consistency guarantees, real-time search, dynamic pricing, saga-based transactions, payment escrow, and fraud prevention all at once. This guide walks through every layer with production-grade depth.
TL;DR — Core Design Decisions
"Use Elasticsearch with geo_point for sub-100ms listing discovery. Model availability as a date-range bitset or calendar table with optimistic locking to prevent double booking. Orchestrate bookings as a Saga with compensating transactions. Capture payment in escrow and release after check-in. Trigger reviews 24 hours post-checkout via an event pipeline — not a cron job."
Table of Contents
- Functional & Non-Functional Requirements
- High-Level Architecture Overview
- Search Service — Geo-Search, Facets & ML Ranking
- Inventory & Availability Calendar
- Dynamic Pricing Engine
- Booking Engine & Saga Pattern
- Payment & Escrow Service
- Review & Trust System
- Notification Service
- Capacity Estimation
- Scalability & Reliability Patterns
- System Design Interview Checklist
1. Functional & Non-Functional Requirements
Before touching any architecture, lock down what you're building and what you're not. In a system design interview, this step separates the 5% who get offers from those who dive straight into databases.
Functional Requirements
- Search & Discovery: Users can search listings by location (city, lat/lng radius), date range, guest count, price range, amenities, and property type. Results are geo-ranked and personalized.
- Listing Management: Hosts create and manage property listings with photos, descriptions, amenities, house rules, cancellation policies, and pricing calendars.
- Availability Calendar: Real-time availability shown per night. Hosts can block dates manually; booked dates are automatically blocked.
- Booking Flow: Guest requests a booking for a date range. System checks availability, holds inventory, processes payment, and confirms booking — all atomically.
- Payment Processing: Capture guest payment at booking time; hold in escrow. Release funds to host 24 hours after check-in. Support refunds based on cancellation policy.
- Reviews & Ratings: Both guests and hosts can leave reviews post-stay. Ratings affect search ranking. Fraud detection prevents fake or coerced reviews.
- Notifications: Email, SMS, and push notifications for booking confirmations, reminders, cancellations, and messages.
- Messaging: In-platform messaging between guests and hosts before and during stays.
- Dynamic Pricing: Prices vary by demand, seasonality, local events, and competitor benchmarks. Hosts set base price; engine suggests optimal nightly rates.
Non-Functional Requirements
| Property | Target | Rationale |
|---|---|---|
| Search latency (p99) | < 150 ms | User conversion drops 7% per 100ms delay |
| Booking confirmation latency | < 3 s end-to-end | Includes availability lock + payment auth |
| Availability consistency | Strong (no double bookings) | Double bookings are catastrophic to trust |
| Search availability freshness | Eventual (< 30 s lag) | Search can show slightly stale results |
| System availability | 99.99% (4.3 min/month downtime) | Revenue-critical path |
| Payment idempotency | 100% — exactly once | Duplicate charges are legal/trust risks |
Out of Scope (for this design)
- Host identity verification and background check integration (third-party API)
- Multi-currency accounting and tax calculation engine
- Channel manager integration (connecting to Booking.com, Expedia OTA feeds)
- Mobile app — we focus on backend services
2. High-Level Architecture Overview
The platform is decomposed into bounded-context microservices aligned with domain ownership. Each service owns its database, communicates asynchronously via events for most flows, and synchronously only where consistency is critical (availability lock, payment).
Core Services & Responsibilities
- API Gateway: Rate limiting, JWT validation, request routing, SSL termination. AWS API Gateway or Kong.
- Search Service: Elasticsearch cluster with geo-search, faceted filtering, and ML ranking. Read-heavy, eventually consistent with listing data.
- Listing Service: CRUD for property listings, photos (S3), amenities, house rules. Publishes listing change events to Kafka.
- Inventory Service: Owns the availability calendar. The single source of truth for which dates are bookable. Enforces double-booking prevention with optimistic locking.
- Pricing Service: Computes nightly rates based on base price, demand signals, seasonal rules, and competitor benchmarks. Caches results in Redis.
- Booking Service: Orchestrates the booking saga. Calls inventory, pricing, and payment in sequence. Handles compensating transactions on failure.
- Payment Service: Integrates with Stripe/Braintree. Manages charge capture, escrow holds, payout scheduling, and refunds.
- Review Service: Collects post-stay reviews from both parties. Runs fraud detection. Publishes rating events to Search Service for re-ranking.
- Notification Service: Consumes events from Kafka and dispatches email (SES), SMS (Twilio), and push (FCM/APNs).
- User Service: Authentication (OAuth 2.0 + JWT), profile management, host/guest role management.
Data Store Selection
| Service | Primary Store | Cache | Reason |
|---|---|---|---|
| Search | Elasticsearch | Redis | Geo-queries, full-text, facets |
| Listing | PostgreSQL | Redis | Relational, ACID, complex joins |
| Inventory | PostgreSQL | Redis (read cache) | Strong consistency, row-level locks |
| Booking | PostgreSQL | — | Saga state machine, ACID required |
| Pricing | TimescaleDB | Redis | Time-series demand data |
| Reviews | PostgreSQL + Cassandra | CDN | Write-heavy review feed, read-heavy display |
3. Search Service — Geo-Search, Facets & ML Ranking
Search is the highest-traffic, most latency-sensitive component. At Airbnb scale, 80% of all requests are search queries. The search service runs on an Elasticsearch cluster with custom ML ranking layered on top of BM25 relevance.
Elasticsearch Geo-Search Design
Each listing document is indexed with a geo_point field. Search queries combine a geo filter with availability and facet filters:
// Elasticsearch query: location + date availability + price filter
GET /listings/_search
{
"query": {
"bool": {
"must": [
{"range": {"price_per_night": {"gte": 50, "lte": 300}}},
{"term": {"property_type": "entire_apartment"}},
{"term": {"max_guests": {"gte": 2}}}
],
"filter": [
{
"geo_distance": {
"distance": "10km",
"location": {"lat": 40.7128, "lon": -74.0060}
}
},
// Availability filter: date range NOT in booked_dates
{"bool": {"must_not": [
{"nested": {
"path": "booked_ranges",
"query": {
"bool": {
"must": [
{"range": {"booked_ranges.start": {"lte": "2026-06-20"}}},
{"range": {"booked_ranges.end": {"gte": "2026-06-15"}}}
]
}
}
}}
]}}
]
}
},
"sort": [
{"_score": "desc"},
{"geo_distance": {"location": {"lat": 40.7128, "lon": -74.0060}, "order": "asc"}}
],
"from": 0, "size": 20
}
Availability in Search vs. Inventory Service
A critical design decision: search availability is eventually consistent, but booking availability is strongly consistent. Here's the split:
- Search Service (Elasticsearch): Uses a denormalized snapshot of booked date ranges, updated via Kafka events from the Inventory Service. Staleness of up to 30 seconds is acceptable — guests just get an error at the booking step if truly unavailable.
- Inventory Service (PostgreSQL): The authoritative source. Every booking attempt does a real-time availability check here with row-level locking before proceeding.
- Why not make search strongly consistent? Elasticsearch doesn't support row-level locking. Making every search query go through the Inventory Service's PostgreSQL would collapse under load — you'd need 10× the database capacity just for read traffic.
ML Ranking Layer
Elasticsearch BM25 is the retrieval layer; a LambdaMART or XGBoost ranking model is the ranking layer. Features fed to the ranking model include:
- Listing signals: Average rating, number of reviews, Superhost status, response rate, booking acceptance rate, photo quality score.
- Price signals: Price deviation from median for this location/date, discount from original price, price-per-guest ratio.
- Personalization signals: User's historical booking price range, preferred property types, previously viewed but not booked listings.
- Contextual signals: Days until check-in (last-minute vs. planned), device type (mobile vs. desktop conversion rates differ), local event calendar.
- Freshness signal: Recently listed properties get a temporary boost to address the cold-start problem for new hosts.
Search Result Caching Strategy
Cache keys are constructed from hash(lat_lng_bucket + date_range + filters) where lat/lng is quantized to 0.01° cells (≈1 km). Cache TTL is 60 seconds for popular searches, 300 seconds for rare searches. Personalized ranking is applied post-cache, so the cache stores un-ranked result IDs. This approach gives 40–60% cache hit rates while still delivering personalized results.
4. Inventory & Availability Calendar — Preventing Double Bookings
Double booking is the single worst failure mode in a booking system. It destroys host and guest trust and is the source of most legal disputes. The availability calendar design must guarantee that two concurrent booking requests for the same property and overlapping date range cannot both succeed.
Availability Data Model
We model availability as a per-listing, per-date table rather than a range table. This gives O(1) date lookup and enables atomic locking at the individual night level:
-- Inventory availability table (PostgreSQL)
CREATE TABLE availability (
listing_id BIGINT NOT NULL,
date DATE NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'available',
-- 'available' | 'booked' | 'blocked' | 'pending'
booking_id BIGINT REFERENCES bookings(id),
price NUMERIC(10,2),
version BIGINT NOT NULL DEFAULT 0, -- optimistic locking
PRIMARY KEY (listing_id, date)
);
-- Index for range queries
CREATE INDEX idx_avail_listing_date ON availability (listing_id, date)
WHERE status = 'available';
-- Partial index for pending holds (garbage-collected after TTL)
CREATE INDEX idx_avail_pending ON availability (listing_id, date, booking_id)
WHERE status = 'pending';
Optimistic Locking for Double-Booking Prevention
When a guest initiates a booking, we atomically update all dates in the requested range from available to pending using an optimistic lock check. If any date has a version mismatch (concurrent modification), the transaction rolls back:
-- Atomic availability hold (all-or-nothing for date range)
-- Step 1: Read current state with FOR UPDATE (pessimistic variant)
SELECT date, version FROM availability
WHERE listing_id = $1
AND date BETWEEN $2 AND $3
AND status = 'available'
FOR UPDATE;
-- Step 2: Verify all dates are present (= no gaps in availability)
-- If count != expected nights, raise "DATES_UNAVAILABLE"
-- Step 3: Atomically mark as 'pending' with booking reference
UPDATE availability
SET status = 'pending',
booking_id = $4,
version = version + 1
WHERE listing_id = $1
AND date BETWEEN $2 AND $3
AND status = 'available'
AND version = ANY($5::BIGINT[]); -- optimistic check
-- Step 4: If rows_updated != expected nights → concurrent conflict → rollback
-- Saga compensating action: release hold
Pending Hold TTL & Cleanup
A pending hold is created when availability is locked but payment has not yet been processed. If payment fails or the user abandons the flow, the hold must be released within a bounded time. Two mechanisms enforce this:
- Hold TTL (10 minutes): A Kafka message with a 10-minute delay is published when the hold is created. The Inventory Service consumes it and releases stale pending holds.
- Booking saga timeout: If the saga does not reach the payment-confirmed step within 10 minutes, a compensating transaction fires and releases all held inventory.
- Background reconciliation job: A scheduled job runs every 5 minutes querying
pendingrows older than 15 minutes (safety net for missed events).
5. Dynamic Pricing Engine
Dynamic pricing is a competitive differentiator. Airbnb's "Smart Pricing" tool is estimated to increase host revenue by 15–30% compared to flat pricing. The engine computes a per-night recommended price based on multiple signals and updates it daily or on-demand.
Pricing Factors & Weights
- Base price (set by host): Floor below which the engine never recommends. Hosts retain full control.
- Day-of-week pattern: Friday/Saturday commands a 20–40% premium for urban markets; Sunday–Thursday sees lower rates. Derived from historical booking patterns per city.
- Lead time: 60+ days out → lower "early bird" price to maximize occupancy. 7–0 days out → surge price for last-minute inventory with inelastic demand.
- Demand score: Real-time search views vs. bookings ratio for similar listings in the same geo-cell and date window. Computed from Kafka event stream with a 1-hour rolling window in Flink.
- Local events: A concert, conference, or sports event in a 20km radius can spike demand 2–5×. Integrated via Ticketmaster/Predicthq APIs. Cached in Redis keyed by (city_id, date).
- Competitor benchmarking: Median price of top-20 similar listings (same type, capacity, amenities, within 2km) for the same date. Computed nightly by a batch job.
- Seasonality: Pre-trained time-series model (Prophet or seasonal decomposition) captures annual patterns per city.
- Listing performance: High conversion rate listings can command a premium; low conversion listings get a discount signal to improve demand.
Pricing Computation Pipeline
// Pricing service pseudo-code (Java Spring Boot)
@Service
public class DynamicPricingEngine {
public BigDecimal computePrice(Long listingId, LocalDate date) {
Listing listing = listingCache.get(listingId);
BigDecimal base = listing.getBasePrice();
// Multiplicative factors (each returns a ratio, e.g., 1.25 = +25%)
double dowFactor = dayOfWeekModel.getFactor(listing.getCityId(), date.getDayOfWeek());
double leadFactor = leadTimeModel.getFactor(ChronoUnit.DAYS.between(LocalDate.now(), date));
double demandFactor = demandSignalService.getDemandFactor(listing.getGeoCell(), date);
double eventFactor = eventService.getEventFactor(listing.getCityId(), date);
double compFactor = competitorService.getMedianRatio(listingId, date);
BigDecimal recommended = base
.multiply(BigDecimal.valueOf(dowFactor))
.multiply(BigDecimal.valueOf(leadFactor))
.multiply(BigDecimal.valueOf(demandFactor))
.multiply(BigDecimal.valueOf(eventFactor))
.multiply(BigDecimal.valueOf(compFactor));
// Clamp to [minPrice, maxPrice] set by host
return recommended.max(listing.getMinPrice()).min(listing.getMaxPrice());
}
}
Pricing Caching Strategy
Computing prices on every search request would be prohibitively expensive. Instead:
- Nightly batch job: Pre-computes recommended prices for all listings for the next 365 days. Stored in the pricing table (TimescaleDB) and cached in Redis with a 24-hour TTL.
- Real-time override: If a high-demand event is detected mid-day, a targeted re-computation is triggered for affected listings via a Kafka message.
- Cache key:
price:{listing_id}:{date}in Redis. Search service reads from this cache; booking service confirms the price at booking time from the authoritative pricing table.
6. Booking Engine & Saga Pattern
The booking flow spans multiple services — Inventory, Pricing, User, Payment — with no single ACID transaction boundary. We use the Saga pattern with an orchestrator (the Booking Service) coordinating the sequence and triggering compensating transactions on failure.
Booking Saga Steps
| Step | Action | Compensating Action | Owner Service |
|---|---|---|---|
| 1 | Create booking record (PENDING) | Mark booking CANCELLED | Booking Service |
| 2 | Lock inventory dates (PENDING hold) | Release inventory hold | Inventory Service |
| 3 | Confirm final price | N/A (read-only) | Pricing Service |
| 4 | Authorize & capture payment (escrow) | Void authorization or refund | Payment Service |
| 5 | Confirm inventory (PENDING → BOOKED) | Revert to AVAILABLE | Inventory Service |
| 6 | Publish BookingConfirmed event → Notifications | Publish BookingCancelled event | Booking Service |
Idempotency in the Booking Saga
Network failures can cause the same saga step to be retried. Every saga step must be idempotent:
- Inventory lock: The
UPDATE ... WHERE status = 'available'clause makes it naturally idempotent — if already pending, the update affects 0 rows and we query to verify our booking_id owns the hold. - Payment capture: The payment intent is created with a unique
idempotency_key = "booking-{booking_id}-capture"sent to Stripe. Retrying with the same key returns the same result without a double charge. - Saga state machine: The Booking Service persists each saga step state to PostgreSQL. On retry, it reads the current state and skips already-completed steps.
Booking State Machine
// Booking states (stored in bookings table)
enum BookingStatus {
PENDING, // Created, awaiting inventory lock
INVENTORY_HELD, // Dates locked in availability table
PAYMENT_AUTHORIZED, // Payment captured in escrow
CONFIRMED, // Inventory confirmed BOOKED
CHECKED_IN, // Guest checked in (triggers escrow release timer)
COMPLETED, // Post-checkout, payout released to host
CANCELLATION_PENDING, // Cancellation requested, calculating refund
CANCELLED, // Fully cancelled, refund processed
FAILED // Saga failed, all compensating actions run
}
// Saga transition table
PENDING → INVENTORY_HELD (on: inventory lock success)
PENDING → FAILED (on: inventory lock failure)
INVENTORY_HELD → PAYMENT_AUTHORIZED (on: payment captured)
INVENTORY_HELD → FAILED (on: payment failure → release inventory)
PAYMENT_AUTHORIZED → CONFIRMED (on: inventory confirmed)
CONFIRMED → CHECKED_IN (on: check-in event)
CHECKED_IN → COMPLETED (on: checkout + 24h timer)
CONFIRMED → CANCELLATION_PENDING (on: cancellation request)
CANCELLATION_PENDING → CANCELLED (on: refund processed)
7. Payment & Escrow Service
Payment in a booking platform is fundamentally different from an e-commerce checkout. The money is collected upfront but held for days or weeks before the host earns it. This escrow model protects guests (chargebacks on no-shows) while guaranteeing hosts eventual payment for delivered stays.
Payment Lifecycle
- Authorization at booking: A Stripe PaymentIntent is created with
capture_method: manual. This pre-authorizes the card for the full amount without charging it. Good for 7 days. - Capture at booking confirmation: Once the saga confirms successfully, the PaymentIntent is captured. Funds move to our platform Stripe account in escrow.
- Payout trigger (24h post check-in): A scheduled event fires 24 hours after the check-in date. The Payment Service creates a Stripe Transfer to the host's connected Stripe account for (total_charge - platform_fee - cleaning_fee_net). Platform fee is typically 3% from guests + 3% from hosts.
- Refund on cancellation: Refund amount depends on the listing's cancellation policy (flexible, moderate, strict). The Payment Service computes the refund amount, issues a partial or full Stripe Refund, and emits a BookingCancelled event.
Cancellation Policy Engine
| Policy | > 5 days before check-in | 2–5 days before | < 48 hours |
|---|---|---|---|
| Flexible | 100% refund | 100% refund | No refund |
| Moderate | 100% refund | 50% refund | No refund |
| Strict | 50% refund | No refund | No refund |
Preventing Duplicate Charges
The Payment Service maintains a payment_operations table with a unique constraint on (booking_id, operation_type). Before issuing any Stripe API call, it checks this table. If an entry already exists (prior successful operation), it skips the Stripe call and returns the cached result. This — combined with Stripe's idempotency keys — gives a two-layer guarantee against duplicate charges even under aggressive retries.
8. Review & Trust System
Reviews are the trust mechanism that makes the entire marketplace work. Guests rely on them to choose listings; hosts rely on them to attract bookings. The review system must be tamper-resistant, prompt, and fair to both parties.
Post-Stay Review Trigger
Reviews are triggered by the BookingCompleted event, published when the booking transitions to COMPLETED state (checkout + 24 hours). The Review Service consumes this event and:
- Creates two review slots: one for the guest (reviewing the listing), one for the host (reviewing the guest).
- Publishes a notification event to prompt both parties to leave a review (email + push).
- Opens a 14-day review window. After 14 days, the review slots expire and the booking is permanently closed for reviews.
- Reviews are double-blind: neither party's review is published until both have submitted, or the 14-day window expires. This prevents bias from seeing the other party's review first.
Review Fraud Detection
Fake reviews (both positive and negative) are a marketplace threat. Multi-layer fraud detection runs on every submitted review:
- Booking verification: Reviews are only accepted from users who completed a verified stay. No booking record → review rejected at API level.
- Review ring detection: Graph analysis detects clusters of users who mutually review each other without genuine stays. Flag if reviewer and reviewee have reciprocal review patterns across >3 bookings.
- Sentiment vs. rating inconsistency: NLP model checks whether review text sentiment is consistent with the star rating. "Amazing place, would stay again!" + 1 star → flag for manual review.
- Account age and history: New accounts with no booking history that submit reviews are scored higher risk. Weighted into the fraud model.
- IP and device fingerprinting: Multiple reviews from the same IP/device in a short window → fraud flag.
- Review velocity: Host listing receiving 10+ 5-star reviews in 24 hours from accounts created in the last 7 days → auto-hold pending investigation.
Impact on Search Ranking
When a review is published (both parties submit or window expires), the Review Service publishes a RatingUpdated event to Kafka. The Search Service consumes this event and re-indexes the listing's aggregate rating in Elasticsearch. The ML ranking model uses the updated rating within the next 60-second reindex cycle. A single 1-star review from a previously 5-star listing will affect search position within minutes.
9. Notification Service — Email, SMS & Push
Notifications are the voice of your platform. A missed booking confirmation or a delayed check-in reminder erodes user trust as quickly as a bug in the booking flow. The notification service is event-driven, fan-out capable, and multi-channel.
Architecture
The Notification Service subscribes to multiple Kafka topics. Each domain service emits events; the Notification Service maps event types to notification templates and channels:
- BookingConfirmed: Email to guest (itinerary + directions) + Email to host (guest profile + check-in instructions reminder) + Push to both. Sent immediately.
- CheckInReminder: Email + SMS to guest 24 hours before check-in with house rules, access codes, and host contact. Sent via a scheduled Kafka message with a 24h delay.
- CheckOutReminder: Push to guest at 10 AM on checkout day reminding them of checkout time.
- BookingCancelled: Email to both parties with refund amount and timeline. Sent immediately.
- PayoutSent: Email to host confirming payout amount and expected arrival date.
- ReviewRequest: Email + Push to both parties 2 hours after checkout, and reminder at 7 days if no review submitted.
- NewMessage: Push notification to recipient with message preview if they're not active in the app.
Channel Providers & Fallback
| Channel | Primary Provider | Fallback | Volume (Airbnb scale) |
|---|---|---|---|
| AWS SES | SendGrid | ~5M emails/day | |
| SMS | Twilio | MessageBird | ~500K SMS/day |
| Push (iOS) | Apple APNs | — | ~3M push/day |
| Push (Android) | Google FCM | — | ~4M push/day |
Deduplication & Rate Limiting
Kafka consumer retries can cause duplicate notifications — a guest should not receive three "Booking Confirmed" emails. The Notification Service maintains a sent_notifications table keyed by (user_id, event_id, channel). Before dispatching, it checks for an existing entry. This deduplication check uses Redis with a 72-hour TTL for high-throughput lookups before falling back to the database. Additionally, per-user rate limits (max 3 email notifications per hour, max 5 push per day for non-critical events) prevent spam fatigue.
10. Capacity Estimation
Back-of-envelope calculations anchor your infrastructure sizing decisions. These numbers reflect Airbnb-scale and are good baselines for system design interviews.
Traffic Estimates
- Users: 150 million registered users; 10 million DAU
- Listings: 7 million active listings; each with ~365 nights of availability data = 2.5 billion availability rows
- Search QPS: 10M DAU × 5 searches/user/day ÷ 86,400s = ~580 search QPS average; peak 3× = ~1,750 QPS
- Booking QPS: 1M bookings/day ÷ 86,400s = ~12 bookings/second average; peak = ~50/s
- Availability reads: Each search triggers availability checks on ~20 candidate listings → 1,750 × 20 = 35,000 availability reads/s (served from Elasticsearch)
- Availability writes: Each booking writes 1–14 nights of availability records = 50 writes/s × 7 avg nights = ~350 rows/s to inventory PostgreSQL
Storage Estimates
- Listing data: 7M listings × 2 KB/listing (metadata) = 14 GB. Trivial for PostgreSQL.
- Listing photos: 7M listings × 15 photos avg × 500 KB/photo = 52.5 TB in S3. Served via CloudFront CDN.
- Availability table: 7M listings × 365 nights × 60 bytes/row = ~153 GB in PostgreSQL. Fits in memory on a well-sized instance (r6g.4xlarge with 128 GB RAM).
- Elasticsearch index: 7M listings × 5 KB/doc (including nested booked ranges) = 35 GB per shard replica. A 3-node cluster with 1 primary + 2 replicas per shard handles this comfortably.
- Booking records: 1M bookings/day × 365 days × 1 KB/booking = 365 GB/year. Partition by created_date for efficient archival.
- Review data: 2M reviews/year × 1 KB = 2 GB/year. Cassandra handles this write-heavy pattern at scale.
Infrastructure Sizing (Production Baseline)
| Component | Count | Instance Type | Rationale |
|---|---|---|---|
| Search (Elasticsearch) | 9 nodes (3 primaries + 6 replicas) | r6g.2xlarge (64GB RAM) | Memory for inverted index; HA |
| Inventory DB (PostgreSQL) | 1 primary + 2 read replicas | r6g.4xlarge (128GB RAM) | Availability table in memory |
| Redis Cluster | 6 nodes (3 masters, 3 replicas) | r6g.xlarge (32GB RAM) | Price cache + session + dedup |
| Kafka | 6 brokers | m6i.2xlarge + NVMe storage | Event backbone; 7-day retention |
| Booking Service (K8s pods) | 20 pods (HPA, min:10, max:50) | 2 vCPU / 4 GB | Saga orchestration is CPU-bound |
11. Scalability & Reliability Patterns
A booking platform must handle extreme seasonal peaks (New Year's Eve, major holidays) — traffic can spike 5–10× overnight. Several architectural patterns make the system elastic and fault-tolerant.
Hot Listing Problem
A listing that goes viral (featured in a magazine, shared on social media) can receive thousands of concurrent booking attempts for a handful of available dates. This creates a hotspot on specific rows in the availability table. Mitigation strategies:
- Virtual queue: Rate-limit booking attempts per listing to 100/second via a Redis token bucket. Excess requests receive a "Booking is in high demand, please wait" response with a queue position. Prevents thundering herd on the PostgreSQL row.
- Optimistic locking (already described): The last-write-wins guarantee ensures only one booking succeeds; others get a clear conflict error and retry.
- Separate hot-listing database shard: If a listing is flagged as "viral" (search views > 10,000/hour), its availability rows are migrated to a dedicated shard with higher connection pool capacity.
Multi-Region Deployment
Three AWS regions: us-east-1 (primary), eu-west-1 (Europe primary), ap-southeast-1 (APAC primary). Each region is a full active-active deployment for search and reads. Booking writes are routed to the listing's "home region" — the region where the listing was created — to keep all availability writes co-located and avoid cross-region consistency issues.
- Global load balancer: AWS Route 53 with latency-based routing directs users to the nearest region for search. Booking write requests carry a region header and are routed by the API Gateway.
- Cross-region read replicas: Listing metadata is replicated to all regions via PostgreSQL logical replication. Searches in any region can serve listing details locally.
- Cross-region event streaming: Kafka MirrorMaker 2 replicates notification and review events across regions for regional notification dispatch.
Circuit Breakers & Bulkheads
- Payment service circuit breaker: If Stripe's API error rate exceeds 5% in a 60-second window, the circuit opens. New booking requests receive a "Payment service temporarily unavailable" error. The inventory hold is never created — no ghost holds. Circuit half-opens after 30 seconds to probe recovery.
- Notification bulkhead: Notification dispatch runs in a separate thread pool from the booking confirmation path. A Twilio outage cannot block booking confirmations — the notification event is simply queued in Kafka until Twilio recovers.
- Elasticsearch fallback: If the search cluster becomes unavailable, the API Gateway falls back to serving stale results from a Redis full-page cache (5-minute TTL). Users see a "Results may be slightly outdated" banner.
Database Sharding Strategy
The availability table is the most write-intensive database table in the system. As the platform grows beyond 20 million listings, a single PostgreSQL instance will not sustain the write throughput. Sharding by listing_id % num_shards distributes the load evenly. The Inventory Service uses a consistent hash ring to determine which shard owns a given listing_id, with the shard mapping cached in Redis for sub-millisecond routing. New shards are added by splitting existing shards (Citus extension for PostgreSQL enables transparent sharding without application code changes).
12. System Design Interview Checklist
When asked to design a hotel booking system like Airbnb or Booking.com in a system design interview, hit these points to demonstrate senior-level thinking:
Requirements Clarification (5 min)
- ✅ What is the scale? (number of listings, daily bookings, concurrent searches)
- ✅ Instant booking vs. host approval flow?
- ✅ What cancellation policies are supported?
- ✅ Is this a global system (multi-currency, multi-region)?
- ✅ What is the consistency requirement for availability? (This is the key question — answer: strong consistency for booking writes, eventual for search)
Core Design Decisions to Discuss
- ✅ Double-booking prevention: Explain optimistic locking on availability table, pending hold TTL, and saga compensating transactions
- ✅ Search vs. booking consistency split: Elasticsearch for search (eventually consistent), PostgreSQL for booking (strongly consistent) — explain why
- ✅ Saga pattern for distributed transaction: Name each step, its compensating action, and the idempotency mechanism
- ✅ Payment escrow: Capture at booking, release 24h post check-in — not a simple charge
- ✅ Hot listing handling: Virtual queue + rate limiting + dedicated shard
- ✅ Pricing architecture: Pre-computed nightly batch + real-time event override, cached in Redis
- ✅ Review double-blind window: 14-day window, neither review published until both submitted
Reliability & Failure Scenarios to Address
- ✅ Payment service is down mid-booking → circuit breaker, no ghost inventory holds
- ✅ Saga step fails after inventory held but before payment → compensating transaction releases hold
- ✅ Duplicate booking request (retry storm) → idempotency keys at every step
- ✅ Elasticsearch cluster goes down → Redis full-page cache fallback
- ✅ Database primary failure → automatic promotion of read replica (RDS Multi-AZ, < 30s failover)
- ✅ Kafka consumer lag spike → dead letter queue for failed notification events, replay after service recovery
Common Mistakes in Interviews
- ❌ Using a single global transaction across services (impossible in microservices without 2PC overhead)
- ❌ Querying the inventory PostgreSQL database for every search request (will not scale)
- ❌ Storing availability as a binary "available" flag on the listing row (no date granularity, race condition on concurrent updates)
- ❌ Making payment synchronously in the user's HTTP request without a timeout and saga retry (payment providers have p99 latencies of 3–8 seconds)
- ❌ Forgetting the pending hold TTL mechanism (orphaned holds permanently block availability)
- ❌ Not addressing idempotency for payment — "what if the same booking request is retried twice?" is always asked
Key Metrics to Monitor in Production
- Booking conversion funnel: Search → Listing view → Booking initiation → Booking confirmed. Drop-off at each stage.
- Saga failure rate per step: Which saga step fails most? Inventory conflicts mean demand exceeds supply for those dates. Payment failures mean card issuer issues.
- Pending hold leak rate: Number of holds older than 15 minutes that were not released by the saga — indicates saga failure without compensating action.
- Availability search accuracy: False-positive rate (search shows available, booking fails) — target < 1%. Higher means Elasticsearch sync lag is too long.
- Payment escrow balance: Total funds held in escrow at any point. Monitored for financial reconciliation and fraud detection (unusual accumulation).
- Review fraud flag rate: Percentage of submitted reviews flagged by fraud detection. Track over time for adversarial trends.