Designing Instagram at Scale: Photo/Video Upload, Feed Generation, Social Graph & CDN
Instagram serves 2 billion monthly active users, processes 100 million photo and video uploads every single day, and must generate a personalized, ML-ranked feed for each user in under 200 ms. This deep-dive covers every major subsystem — from resumable S3 multipart upload and async transcoding to the celebrity fan-out problem, Redis sorted-set timelines, Stories TTL, Elasticsearch hashtag search, and global CDN edge delivery — giving you a battle-ready mental model for the system design interview and production architecture decisions.
TL;DR — Key Design Decisions in One Paragraph
"Use resumable S3 multipart upload with async FFmpeg transcoding for media ingestion. Model the social graph in a sharded graph database and cache hot edges in Redis. Solve the celebrity fan-out problem with a hybrid push-pull feed: pre-compute timelines for normal users, pull-on-read for celebrity posts. Rank the feed with a two-tower ML model at retrieval + a lightweight scorer at re-rank. Serve media via a global CDN with adaptive WebP/AVIF resizing. Store Stories in Redis with a 24-hour TTL and append-only viewership logs."
Table of Contents
1. Functional & Non-Functional Requirements
Every great system design interview starts by narrowing scope. Instagram's core product surface covers six major capability areas — pin them down before sketching any boxes and arrows.
Functional Requirements
- Media Post: Users upload photos (JPEG/PNG/HEIC, up to 20 MB) and videos (MP4/MOV, up to 60 s / 100 MB for Feed Reels up to 90 s). Each post supports a caption, hashtags, location tag, and alt-text.
- Follow / Unfollow: Directed relationships (A follows B does not imply B follows A). Private accounts require a follow-request approval step. Users may block others, removing all social edges between them.
- Feed: Chronologically ordered (legacy) or ML-ranked personalized feed of posts from followed accounts. Must support infinite scroll with cursor-based pagination, not offset pagination.
- Stories: Short-lived ephemeral media (photo or 15-second video clip) visible for 24 hours to followers. Story viewership visible only to the poster. Multiple segments per day, sequential playback ordered by recency.
- Search & Discover: Full-text search for hashtags and usernames. Explore page showing trending and personalized content to non-followers.
- Direct Messages (DMs): One-to-one and group chats (up to 32 participants) supporting text, reactions, photo/video, voice messages, and link previews. Messages persist indefinitely.
- Notifications: Push (APNs / FCM) and in-app notifications for likes, comments, follows, DMs, and story views.
- Likes & Comments: Counter-based likes (approximate at scale) and threaded comments with nested replies up to 2 levels deep.
Non-Functional Requirements
| Attribute | Target | Notes |
|---|---|---|
| Feed load latency (P99) | < 200 ms | Pre-computed timeline read from Redis cache |
| Photo upload throughput | 100M uploads/day | ≈ 1,160 uploads/sec average; 3× for peak |
| Media storage growth | ~500 PB/year | Multi-resolution variants + originals in S3 |
| Availability (SLA) | 99.99% | < 53 min downtime/year; active-active multi-region |
| Media delivery latency | < 50 ms (TTFB via CDN) | Edge PoP within 20 ms RTT of 95% of users |
| DM delivery guarantee | At-least-once; idempotent | Client-side deduplication on message ID |
The system must also satisfy eventual consistency for likes/follower counts (acceptable to lag 1–2 seconds), strong consistency for financial-adjacent operations like account suspension, and read-heavy scalability with a read:write ratio exceeding 100:1 on the feed path.
2. High-Level Architecture Overview
At the highest level, Instagram's architecture is a collection of independently scalable microservices connected by asynchronous event streams. Traffic enters through a global load balancer tier and fans out to domain-specific services. The following diagram captures the major data flows.
Core Service Domains
- API Gateway & Auth: TLS termination, JWT/OAuth 2.0 validation, rate limiting (token bucket per user ID + IP), and request routing. Built on Nginx + Envoy for L7 traffic management.
- User Service: Account creation, profile management, authentication token issuance. PostgreSQL primary + read replicas. PII encrypted at rest with customer-managed KMS keys.
- Media Service: Upload coordinator, multipart S3 pre-signed URL generation, transcoding job dispatch, and metadata persistence (PostgreSQL + Cassandra for scale).
- Social Graph Service: Follow/unfollow mutations and follower/following reads. Graph stored in a custom sharded adjacency-list store (Facebook TAO-inspired) with Redis hot-edge caching.
- Feed Service: Timeline cache management in Redis Sorted Sets; fan-out worker consumers on Kafka; ML ranking pipeline integration.
- Story Service: Upload, metadata, TTL-based expiry, sequential playback ordering, and viewership tracking.
- Search Service: Elasticsearch cluster for hashtag + username indexing. Explore page ML ranking via a separate candidate-generation pipeline.
- Messaging Service: WebSocket connection manager, message persistence in Cassandra, end-to-end encryption key exchange.
- Notification Service: Fan-out to APNs/FCM push, in-app notification store with read/unread state.
- CDN & Media Delivery: Global CDN PoP network serving all media assets. Origin shield in each AWS region reduces S3 origin load by 95%+.
All inter-service communication uses gRPC for synchronous paths and Kafka for async event streams. Service mesh (Istio) enforces mutual TLS between services, provides circuit breakers, and emits distributed traces to Jaeger.
3. Media Upload Pipeline
Uploading 100 million media files per day on mobile networks — many of which are flaky 4G connections — demands a pipeline designed for resilience, efficiency, and deduplication from the ground up. A naive direct-upload approach fails at this scale for three reasons: mobile connections drop, large files timeout on standard HTTP connections, and storing duplicate content wastes petabytes of storage.
Step 1 — Resumable Upload via S3 Multipart
The client never uploads directly to the API servers. Instead:
- Client requests an upload session from the Media Service, sending file size, MIME type, and a client-generated UUID.
- Media Service calls S3
CreateMultipartUploadand generates a batch of pre-signed URLs (one per 5 MB part). These URLs expire in 1 hour and are returned to the client. - Client uploads each part in parallel (4 concurrent connections) directly to S3. On network failure, it resumes from the last successfully uploaded part using the
uploadIdstored in local device storage. - After all parts are uploaded, the client calls the Media Service to trigger
CompleteMultipartUpload. The original file is now in an S3raw/prefix.
// Media Service — initiate multipart upload (Spring Boot example)
@PostMapping("/upload/initiate")
public UploadSession initiateUpload(@RequestBody UploadRequest req) {
String uploadId = s3Client.createMultipartUpload(
CreateMultipartUploadRequest.builder()
.bucket("instagram-raw-media")
.key("raw/" + req.userId() + "/" + req.clientUuid())
.contentType(req.mimeType())
.serverSideEncryption(ServerSideEncryption.AWS_KMS)
.build()
).uploadId();
List<PresignedUrl> parts = IntStream.range(0, numParts(req.fileSizeBytes()))
.mapToObj(i -> presignPart(uploadId, i + 1, req))
.toList();
return new UploadSession(uploadId, parts, Instant.now().plusSeconds(3600));
}
Step 2 — Perceptual Hash Deduplication
Before triggering transcoding, the pipeline checks for duplicate content using perceptual hashing (pHash for images, video fingerprinting using frame-sampling). A perceptual hash is a 64-bit fingerprint that is similar for visually identical or near-identical images even after re-encoding. The process:
- An S3 event triggers a Lambda that computes the pHash of the uploaded file.
- The hash is queried against a Redis Bloom filter (billions of entries, < 1% false positive rate) for fast rejection of non-duplicates.
- On a Bloom filter hit, a secondary lookup in a DynamoDB deduplication table confirms exact match or near-duplicate (Hamming distance < 10 on pHash).
- Confirmed duplicates are hard-linked to the existing S3 object — no storage duplication, just a new metadata record pointing to the existing media key.
Step 3 — Async Transcoding Queue
After deduplication, a Kafka message is published to the media.transcoding.requested topic. A fleet of EC2 GPU-backed transcoding workers (G4dn instances with FFmpeg + hardware-accelerated H.264/H.265/AV1 encoding) consumes these events. Each upload produces multiple output variants stored in S3 under a processed/ prefix:
- Photos: 320px, 640px, 1080px JPEG (progressive, quality 85) + WebP variants. HEIC inputs are converted to JPEG as the primary store.
- Videos: 360p, 720p, 1080p H.264 MP4 (for broad device support) + AV1 for modern clients. HLS segmented playlist (.m3u8) for adaptive bitrate streaming.
- Thumbnails: 3-frame GIF preview and a static poster frame JPEG used for timeline placeholder.
Once all variants are ready, the transcoding worker publishes a media.transcoding.completed Kafka event. The Post Service consumes this and updates the post status from PROCESSING to PUBLISHED, making it visible in feeds. The client polls a lightweight status endpoint (long-polling with 30-second timeout) to detect when the post is ready.
Uploading through API servers would require sticky sessions (defeats horizontal scaling), saturate API server memory with large file buffers, and block request threads during slow mobile uploads. Pre-signed S3 URLs shift the transfer burden directly to S3's massive ingestion infrastructure — the API server merely coordinates, never touches the raw bytes.
4. Social Graph Service
The social graph — who follows whom — is the connective tissue of every feed, notification, and recommendation on Instagram. At 2 billion users with an average of 300 follow relationships each, the graph contains ~600 billion directed edges. This is far too large for a single relational database.
Data Model & Storage
The graph is stored as an adjacency list in a horizontally sharded data store. Each edge is a row: (follower_id, followee_id, created_at, status). The table is sharded on follower_id for efficient "who does user X follow?" reads, and a mirror table is sharded on followee_id for "who follows user X?" reads. Both tables must be kept in sync via a synchronous dual-write within the same Kafka transaction commit protocol.
-- Followings table (sharded on follower_id)
CREATE TABLE followings (
follower_id BIGINT NOT NULL,
followee_id BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
status SMALLINT NOT NULL DEFAULT 1, -- 1=active, 2=pending, 3=blocked
PRIMARY KEY (follower_id, followee_id)
) PARTITION BY HASH (follower_id);
-- Followers table (mirror, sharded on followee_id)
CREATE TABLE followers (
followee_id BIGINT NOT NULL,
follower_id BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
status SMALLINT NOT NULL DEFAULT 1,
PRIMARY KEY (followee_id, follower_id)
) PARTITION BY HASH (followee_id);
Sharding Strategy
With 600 billion edges, a single PostgreSQL cluster cannot handle this. Instagram uses consistent hashing with 1024 virtual nodes across a cluster of 128 shards, giving ~4.7 billion edges per shard — comfortably within range of a tuned PostgreSQL instance with SSD storage. Shard lookup is handled by a routing layer that maps user_id % num_shards to a specific shard host, cached in a local config file updated via ZooKeeper.
Hot spots occur for celebrity accounts with 100M+ followers. The followers table shard for Cristiano Ronaldo's followee_id would receive millions of fan-out reads during feed generation. The mitigation is a dedicated celebrity shard (identified by follower count > 10M) replicated with 10× read replicas and aggressive result caching in Redis with a 60-second TTL.
Redis Hot-Edge Cache
For feed generation, the most critical read pattern is: "give me the list of users that user X follows." This set is cached in Redis as a Sorted Set keyed by following:{user_id}, with score = created_at unix timestamp. The set is lazily populated on first feed request and has a 24-hour TTL. Cache size is bounded to the 1,000 most recent follows — for heavy-following accounts, only the most recent connections affect feed relevance anyway.
Friend-of-Friend Suggestions
The "Suggested for You" feature uses second-degree graph traversal. A nightly Spark batch job computes intersection counts of mutual connections for each user. Results are stored in a suggestions:{user_id} Redis list, sorted by mutual connection count descending. The job also incorporates signals like contact book matching (phone number hash lookup), location clustering (users frequently at the same geofence), and co-engagement on hashtags.
5. Feed Generation: Push, Pull & Hybrid
Feed generation is the central algorithmic and systems challenge in Instagram's architecture. The goal: when a user opens the app, surface the ~20 most relevant posts from their network within 200 ms. At 500 million daily active users refreshing their feed multiple times per day, this is billions of feed-read requests per day.
Pure Push (Fan-out on Write)
When user A publishes a post, the system immediately writes that post ID into the timeline cache (a Redis Sorted Set) of every follower of A. Feed reads become pure cache hits — O(1) per user.
- Pros: Near-instant feed reads. Read path is extremely simple (one Redis ZREVRANGE call).
- Cons: Celebrity fan-out is catastrophic. A user with 200M followers generates 200M Redis write operations on a single post — this takes minutes even with a large Kafka consumer fleet and Redis cluster, delaying post visibility. Also wastes resources writing to timelines of inactive users.
Pure Pull (Fan-out on Read)
On feed read, the system fetches the list of accounts user X follows, queries each account's post store for recent posts, merges and sorts the results, then applies ML ranking. No pre-computation needed.
- Pros: No write amplification. Works perfectly for celebrity accounts. Always fresh.
- Cons: Extremely slow for users following many accounts. A user following 2,000 accounts would require 2,000 database lookups per feed refresh, adding hundreds of milliseconds of latency. Does not scale to 500M DAU.
Hybrid Approach — The Production Solution
Instagram uses a hybrid model that combines push for normal users and pull for celebrities, merging results at read time:
- On post creation: The Post Service publishes a
post.createdKafka event containingpost_id,author_id, andauthor_follower_count. - Fan-out workers consume this event. If
author_follower_count < 100,000(non-celebrity), they fan-out thepost_idto the Redis timeline sorted set of each active follower (using Lua scripts for atomic ZADD + ZREMRANGEBYRANK to cap timeline size at 300 entries). Active = logged in within the past 7 days. - If
author_follower_count ≥ 100,000(celebrity), fan-out is skipped at write time. The post is written only to the celebrity's own post store in Cassandra. - On feed read: The Feed Service reads the user's pre-computed timeline from Redis (normal posts). It also fetches the list of celebrity accounts the user follows (cached in a small Redis set
celebrities_followed:{user_id}) and pulls their recent posts from Cassandra directly. - Both sets are merged, sorted by timestamp, and passed to the ML ranking layer. The final ranked list is returned to the client.
// Redis timeline cache key structure
// timeline:{user_id} → Sorted Set of post_ids, score = unix_timestamp
// Fan-out Lua script (atomic ZADD + trim)
local key = "timeline:" .. KEYS[1]
local score = ARGV[1] -- unix timestamp of post
local post_id = ARGV[2] -- post ID
local max_size = 300 -- cap timeline depth
redis.call("ZADD", key, score, post_id)
redis.call("ZREMRANGEBYRANK", key, 0, -(max_size + 1))
redis.call("EXPIRE", key, 86400 * 7) -- 7-day TTL
// Feed read pseudocode
List<Long> precomputedPostIds = redis.zrevrange("timeline:" + userId, 0, 99);
List<Long> celebrityFollowees = redis.smembers("celebrities_followed:" + userId);
List<Post> celebrityPosts = cassandra.getRecentPosts(celebrityFollowees, limit=50);
List<Post> mergedFeed = merge(precomputedPostIds, celebrityPosts);
return mlRanker.rank(mergedFeed, userId);
Timeline Cache Invalidation
When a user unfollows another, all posts from the unfollowed account must be removed from the Redis timeline. This is done lazily — the Feed Service filters out posts from unfollowed accounts during the read path by re-checking a compact Bloom filter of the user's current follows rather than eagerly removing entries from Redis, avoiding expensive set-difference operations.
6. ML Feed Ranking
The chronological feed was retired years ago. Today's Instagram feed is ranked by a multi-stage ML pipeline designed to predict the probability that a given user will engage with each candidate post — where engagement encompasses likes, comments, shares, saves, and "time spent viewing."
Stage 1 — Candidate Generation
The hybrid feed read produces ~500 candidate post IDs. A lightweight candidate retrieval model (a two-tower neural network with a query tower encoding user context and a document tower encoding post features) computes approximate nearest-neighbor scores using FAISS to quickly reduce the 500 candidates to the 100 most relevant, discarding obviously irrelevant content before the expensive re-ranking stage.
Feature Engineering
The ranking model consumes three categories of features, all pre-computed and stored in a low-latency feature store (Redis + offline batch precomputation in Spark):
- User features: Account age, past 7-day engagement rate per content type (photos vs. Reels vs. Stories), followed hashtags, preferred posting time zones, device type.
- Post features: Age of post, number of likes/comments/saves at prediction time, media type, caption sentiment score, detected objects/scenes from Vision AI, hashtag popularity percentile.
- Interaction features: Historical engagement rate between user and author (have they liked the last 5 posts?), time since last interaction, direct message history (strong positive signal), mutual follows.
Stage 2 — Re-Ranking
The top 100 candidates from Stage 1 are passed to a LightGBM gradient-boosted tree re-ranking model that outputs a probability score for each of 5 engagement types: like, comment, share, save, and 3-second video view. A weighted sum of these probabilities produces the final ranking score. The weights themselves are tuned via multi-objective Bayesian optimization to balance short-term engagement with longer-term user satisfaction metrics (session length, return visits).
Diversity & Anti-Obsession Constraints
A post-ranking diversity layer enforces business rules to prevent the feed from becoming a single-author echo chamber:
- No more than 3 consecutive posts from the same author.
- At most 20% of the feed from accounts the user has not interacted with in 30 days (prevents ghosting of followed friends).
- Reels are capped at 40% of feed slots to avoid Reels-only feeds alienating photo-preference users.
- Content from accounts the user muted is excluded before ranking.
A/B Testing Infrastructure
Every ranking model change ships via a statistically rigorous A/B test. Instagram uses a user-level bucketing system (not request-level) so each user sees a consistent experience within an experiment cohort. Experiments run for minimum 7 days (to capture weekly usage patterns) and measure holdout impact on D1/D7 retention, not just immediate engagement rate — since hyper-optimizing for likes degrades long-term user satisfaction.
7. Stories System
Instagram Stories processes over 500 million daily story creators and serves billions of story plays per day. The ephemeral nature — 24-hour expiry — makes Stories a fundamentally different system from the main feed, requiring a TTL-first data model and a viewership tracking system that can handle extreme read/write concurrency.
TTL-Based Expiry Architecture
Stories metadata is stored in two layers:
- Redis (hot layer):
story:{story_id}— a Hash containing media URL, caption, author ID, and publish timestamp. Redis TTL is set to86400seconds (24 hours) from post time usingEXPIREATwith the exact Unix expiry timestamp. The story is also added touser_stories:{author_id}(a Sorted Set with score = expiry timestamp) which powers the story ring ordering on profile pages. - Cassandra (durable layer): Stories are written to Cassandra with a
TTL 86400column-level TTL. Cassandra's compaction process handles garbage collection of expired data without any application-level cleanup job. After expiry, media assets in S3 are moved to a coldstories-archive/bucket (Glacier IA) via S3 Lifecycle rules — users retain access to their own archived stories.
-- Cassandra story schema with column TTL
CREATE TABLE stories (
author_id BIGINT,
story_id TIMEUUID, -- naturally ordered by creation time
media_url TEXT,
caption TEXT,
created_at TIMESTAMP,
expires_at TIMESTAMP,
PRIMARY KEY (author_id, story_id)
) WITH CLUSTERING ORDER BY (story_id DESC)
AND default_time_to_live = 86400; -- 24-hour TTL
-- Insert with per-row TTL override
INSERT INTO stories (author_id, story_id, media_url, caption, created_at, expires_at)
VALUES (?, now(), ?, ?, toTimestamp(now()), ?)
USING TTL 86400;
Story Tray Ordering
The story tray (the row of profile circles at the top of the home feed) is ordered by a combination of factors: (1) unseen stories first, sorted by the poster's last story timestamp descending; (2) seen stories ordered by recency of the post. The user's seen/unseen state for each story segment is tracked in a story_views:{viewer_id} Redis Bloom filter — a bit set per story ID. Bloom filters give O(1) membership checks at the cost of a tiny false positive rate, which is acceptable (a story mistakenly marked as "seen" is a minor UX issue, not a correctness bug).
Viewership Tracking
When a user views a story, the system must record: who viewed it, when, and whether they replied or reacted. At Instagram scale, this is ~10 billion view events per day. The implementation:
- View events are published to Kafka topic
story.views. Kafka handles 10M+ events/sec and provides durability without blocking the user's request. - A Flink streaming job consumes events and writes aggregated view counts to a
story_view_count:{story_id}Redis counter (INCR). Individual viewer IDs are appended to a Cassandrastory_viewerstable (partition key = story_id, clustering key = viewer_id) with the same 86400s TTL. - The poster sees "Viewed by @alice, @bob, and 1,247 others" — the top viewer display names come from a Redis sorted set (score = view timestamp) and the aggregate from the Redis counter.
8. CDN & Media Delivery
Instagram serves billions of media assets per day to users in every country on earth. Without a global CDN, every photo request would traverse thousands of kilometers to an AWS region, adding hundreds of milliseconds of latency and generating massive inter-region bandwidth costs. CDN delivery is not optional — it is foundational.
Global PoP Network
Instagram's CDN (built on Meta's own edge infrastructure, with Akamai and Cloudflare as fallback) operates 200+ Points of Presence (PoPs) across 100+ countries. Design goals for PoP placement:
- 95% of global internet users should be within 20 ms RTT of a PoP.
- Each PoP has 10–100 TB of SSD cache, sized to hold the "working set" of media accessed in the past 48 hours for that geography.
- PoPs are interconnected via a private backbone network (BGP-optimized for low latency) to avoid routing through public internet for cache misses going back to origin.
Adaptive Image Resizing & Format Conversion
Rather than storing fixed-resolution variants for every possible screen size, Instagram uses on-the-fly image processing at the CDN edge (or a regional Image Processing Service that sits behind the CDN as an origin):
- The image URL encodes dimensions and format:
https://cdn.instagram.com/media/{id}/320x320/f.webp. This is not the actual URL format but illustrates the principle. - The edge server parses the URL, fetches the original from S3 (via Origin Shield — a regional cluster that aggregates S3 requests), resizes using libvips (fastest available image processing library), and converts to WebP for browsers/apps that support it (Accept header negotiation) or AVIF for next-gen clients.
- The processed variant is cached at the PoP with a Cache-Control: public, max-age=31536000 (1 year) immutable header, since processed images are keyed by content hash and never mutate.
Cache Invalidation Strategy
Media assets are immutable by design — a photo uploaded to Instagram never changes, only its metadata (caption, tags) may change. This makes cache invalidation rare. When it does occur (e.g., CSAM removal requiring emergency CDN purge), Instagram uses:
- Tag-based invalidation: Assets are tagged with the author's user ID and post ID at cache insertion. Purge APIs accept wildcard patterns matching all variants of a given post.
- Surrogate keys: CDN edge nodes maintain a surrogate key index. A single API call propagates the invalidation to all PoPs in < 30 seconds globally.
- Origin-side 404: If a media asset is deleted from S3, the CDN will eventually serve a 404 after the TTL expires. For urgent removals, the active purge path is used.
| CDN Layer | Cache Hit Rate | TTFB | Role |
|---|---|---|---|
| Edge PoP (L1) | 85–92% | < 5 ms | Nearest geographic cache |
| Regional Shield (L2) | 95% of L1 misses | < 30 ms | Collapses thundering-herd misses |
| Image Processor (L3) | N/A (compute layer) | < 80 ms | On-the-fly resize + format convert |
| S3 Origin | N/A (source of truth) | 50–200 ms | Durable original storage |
9. Search & Discover
Instagram's search bar handles hundreds of millions of queries per day — users searching for hashtags, accounts, locations, and audio tracks. The Explore page extends discovery beyond the user's existing social graph, surfacing content from accounts they don't follow based on interest affinity.
Hashtag Indexing
When a post is published, a background consumer extracts all hashtags from the caption using a simple lexer (tokenize on '#' boundaries). Each unique hashtag triggers an upsert in an Elasticsearch index where the document is the hashtag (analyzed, lowercased, normalized) and the field is a counter of associated posts. Elasticsearch's inverted index enables sub-100ms prefix/full-text search across billions of hashtag-post associations.
For trending hashtags, a Flink streaming job counts hashtag occurrences in a sliding 1-hour window. The top 500 trending hashtags are written to a Redis list every 5 minutes, making the trending section a pure Redis read — no Elasticsearch needed at display time.
User Search
User search must support prefix search (type "san" and get "sanwar", "sandra", "santana") with typo tolerance. Elasticsearch's edge_ngram analyzer tokenizes usernames and display names into n-grams at index time, enabling fast prefix lookups. A custom scoring function boosts results by:
- Relationship proximity: Mutual followers rank above strangers for the same text match score.
- Account popularity: Verified accounts and accounts with follower counts > 10K receive a mild boost.
- Prior interaction: Accounts the user has messaged or tagged in the past 30 days are surfaced first.
Explore Page ML Ranking
The Explore page is a separate candidate generation pipeline optimized for interest expansion (discovering new creators) rather than social-graph reinforcement. It uses a combination of:
- Collaborative filtering: "Users similar to you also engaged with these posts" — computed nightly via a matrix factorization job (ALS on Spark).
- Content-based filtering: Vision embeddings of posts the user has engaged with are compared against a FAISS index of all recently uploaded content. Semantically similar images are surfaced.
- Topic affinity: Users are assigned a 128-dimensional interest vector. Content is tagged with topic distributions (food, travel, fitness, technology). Dot-product similarity drives exploration candidates.
The Explore candidate set (~1,000 posts) is then re-ranked by the same LightGBM model used for feed, with an additional novelty penalty applied to content the user has already seen. The final grid (3 columns × N rows) is served from a Redis cache keyed by explore:{user_id} with a 30-minute TTL before a refresh job regenerates it.
10. Direct Messages
Instagram DMs handle hundreds of billions of messages per year across 1-on-1 and group conversations. The messaging infrastructure must provide low-latency delivery (< 500 ms end-to-end on the same continent), durable message storage (messages persist indefinitely unless deleted by the user), and message status (sent, delivered, read) with eventual consistency.
Connection Management — WebSocket Gateway
When a user opens the Instagram app, the client establishes a persistent WebSocket connection to the nearest WebSocket Gateway server (a stateful service layer behind a TCP load balancer). Connection state (which user is on which gateway server) is maintained in a Redis Hash: user_connection:{user_id} → gateway_server_id. This lookup allows any service in the system to push a message to any online user.
The WebSocket Gateway fleet is sized at roughly 1 WebSocket server per 50,000 concurrent connections. With 100M concurrent users in peak hours and 10% having active WebSocket sessions (app in foreground), that's 10M concurrent connections — ~200 gateway servers. Each runs on a single-threaded async event loop (Node.js or Go's goroutine model) since WebSocket connections are primarily I/O-bound.
Message Flow — Send Path
- Sender's app sends a message payload over WebSocket to the Gateway server. Payload:
{client_msg_id, conversation_id, content, timestamp}. Theclient_msg_idis a client-generated UUID used for deduplication. - Gateway forwards the message to the Messaging Service via gRPC.
- Messaging Service: (a) deduplicates using
client_msg_idin Redis with a 24-hour TTL; (b) assigns a server-sidemessage_id(Snowflake-style monotonic ID); (c) persists to Cassandra with strong consistency write (QUORUM); (d) publishes amessage.sentKafka event. - ACK is returned to sender Gateway → sender's app receives "sent" tick (✓).
- A Delivery Worker consumes the Kafka event, looks up the recipient's connection in Redis, and pushes the message to the recipient's Gateway server (which relays it over WebSocket). On success, recipient delivers "delivered" status (✓✓) back to sender.
- If recipient is offline, the message is queued in a Redis List
offline_queue:{user_id}. A push notification is dispatched via APNs/FCM. On next app open, the client fetches missed messages from Cassandra using its last-seen message ID as a cursor.
// Cassandra message storage schema
CREATE TABLE messages (
conversation_id UUID,
message_id BIGINT, -- Snowflake ID (time-ordered)
sender_id BIGINT,
content_type TINYINT, -- 1=text, 2=photo, 3=video, 4=voice, 5=reaction
content TEXT,
client_msg_id UUID, -- for deduplication
created_at TIMESTAMP,
deleted_at TIMESTAMP, -- soft delete
PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC)
AND compaction = {'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 7};
End-to-End Encryption
Instagram's end-to-end encrypted (E2EE) messaging uses the Signal Protocol (Double Ratchet Algorithm + X3DH key exchange). In E2EE mode:
- Each device generates an identity key pair, a signed pre-key pair, and a batch of one-time pre-keys. Public keys are uploaded to Instagram's key distribution server (KDS).
- On initiating an E2EE conversation, the sender fetches the recipient's public key bundle from the KDS and performs X3DH to derive a shared session secret.
- Subsequent messages use the Double Ratchet for forward secrecy — even if a session key is compromised, past messages remain protected.
- Encrypted ciphertext is stored in Cassandra. Instagram's servers cannot read E2EE message content. Backup is encrypted with a user-held passphrase-derived key.
Group Chats (up to 32 Participants)
Group messages use a sender-key model (same as WhatsApp groups): the sender generates a single "sender key" for the group, encrypts one copy of the message, and distributes the sender key to each member once. This reduces encryption overhead from O(N) per message to O(1) per message + O(N) for key distribution — critical when N = 32 and message frequency is high.
11. Capacity Estimation
Capacity estimation in a system design interview grounds your design in reality and demonstrates engineering intuition. Work through these numbers methodically. All figures below are approximate and representative of Instagram's public scale.
Upload Traffic & Storage
| Metric | Calculation | Result |
|---|---|---|
| Daily uploads | Given (photos + videos) | 100M/day |
| Avg upload rate | 100M / 86,400s | ~1,160 uploads/sec |
| Peak upload rate | avg × 3× peak factor | ~3,500 uploads/sec |
| Avg photo size (processed) | 3 resolutions × ~500 KB avg | ~1.5 MB per photo |
| Avg video size (processed) | 3 resolutions × ~20 MB avg for 60s | ~60 MB per video |
| Daily media storage added | 80M photos × 1.5 MB + 20M videos × 60 MB | ~1.3 PB/day |
| Annual media storage | 1.3 PB × 365 days | ~475 PB/year |
| Metadata storage (PostgreSQL) | 100M posts × 500 bytes/row | ~50 GB/day (manageable) |
Feed & CDN Traffic
- Feed reads: 500M DAU × 5 feed refreshes/day = 2.5 billion feed reads/day ≈ 29,000 feed RPS average; ~90,000 at peak.
- CDN requests: Each feed refresh loads ~20 thumbnail images. 2.5B × 20 = 50 billion CDN requests/day ≈ 580,000 CDN requests/sec peak.
- CDN bandwidth: 50B requests × 30 KB avg thumbnail = 1.5 PB CDN traffic/day ≈ 140 Gbps average egress.
- Redis timeline cache: 500M active users × 300 post IDs × 8 bytes = ~1.2 TB Redis memory for timeline cache (fits in a 40-node Redis cluster at 32 GB/node).
- Cassandra message storage: 100B messages/year × 500 bytes avg = ~50 TB/year, comfortably managed across a 100-node Cassandra cluster with 3× replication.
12. System Design Interview Checklist
Use this checklist during your interview walkthrough to ensure you cover all the dimensions a senior engineer and interviewer will probe. Instagram-style social media system design interviews typically run 45–60 minutes; allocate time accordingly.
Phase 1: Requirements Clarification (5 min)
- ✅ Confirm scope: photos only, or videos too? Stories? DMs?
- ✅ Clarify user scale (2B MAU? 500M DAU?)
- ✅ Confirm read:write ratio expectation (typically 100:1)
- ✅ Latency SLA for feed load (< 200 ms P99?)
- ✅ Consistency requirements (eventual OK for likes? Strong for user data?)
Phase 2: Capacity Estimation (5–7 min)
- ✅ Calculate uploads/sec (avg and peak 3×)
- ✅ Estimate media storage growth per year (~475 PB)
- ✅ Calculate feed RPS (500M DAU × 5 refreshes / 86,400s ≈ 29K RPS)
- ✅ Estimate CDN bandwidth (50B requests × 30 KB ≈ 140 Gbps)
- ✅ Size Redis timeline cache (1.2 TB)
Phase 3: High-Level Architecture (10 min)
- ✅ Draw API gateway, auth, and rate limiting tier
- ✅ Show major services: User, Media, Post, Feed, Social Graph, DM, Story, Search, Notification
- ✅ Show data stores: PostgreSQL (user/post metadata), Cassandra (timelines/messages), Redis (cache), S3 (media), Kafka (event bus), Elasticsearch (search)
- ✅ Show CDN in front of all media reads
Phase 4: Deep Dives (20–25 min)
- ✅ Media upload: explain pre-signed S3 multipart + resumability
- ✅ Feed generation: explain push vs. pull vs. hybrid, celebrity fan-out problem
- ✅ Feed ranking: explain two-tower candidate generation + LightGBM re-ranking
- ✅ Social graph sharding: adjacency list, dual-table mirror, celebrity shards
- ✅ Stories TTL: Redis EXPIREAT + Cassandra column TTL + S3 lifecycle
- ✅ CDN: PoP network, adaptive resizing, cache invalidation strategy
Phase 5: Bottlenecks, Failures & Tradeoffs (5–8 min)
- ✅ What happens if Kafka lags? Fan-out delay — use priority queue for verified accounts
- ✅ What if Redis goes down? Fall back to on-demand Cassandra pull; rebuild cache on recovery
- ✅ How do you handle a hot user (Justin Bieber problem)? Hybrid push-pull with celebrity threshold
- ✅ How do you handle partial failures in the upload pipeline? S3 multipart expiry, dead-letter queue for failed transcoding jobs
- ✅ Global vs. single-region: active-active multi-region with eventual consistency via Kafka cross-region replication
Expert-Level Differentiators
Candidates who score "Strong Hire" consistently mention:
- Perceptual hash deduplication before storing media (saves significant storage)
- Sender-key encryption for group chats (O(1) message encryption, O(N) key distribution)
- Bloom filter for story seen/unseen tracking (O(1) membership, sublinear memory)
- FAISS approximate nearest neighbor for ML candidate retrieval (< 10 ms at billion-scale)
- Time-Window Compaction Strategy in Cassandra for time-series message data
- Origin shield to collapse CDN miss thundering herds before they hit S3