What is Functional Requirements and how does it work?

Media Post: Users upload photos (JPEG/PNG/HEIC, up to 20 MB) and videos (MP4/MOV, up to 60 s / 100 MB for Feed Reels up to 90 s). Each post supports a caption, hashtags, location tag, and alt-text. Follow / Unfollow: Directed relationships (A follows B does not imply B follows A). Private accounts require a follow-request approval step. Users may block others, removing all social edges between them. Feed: Chronologically ordered (legacy) or ML-ranked personalized feed of posts from followed accounts. Must support infinite scroll with cursor-based pagination, not offset pagination. Stories: Short-lived ephemeral media (photo or 15-second video clip) visible for 24 hours to followers. Story viewership visible only to the poster. Multiple segments per day, sequential playback ordered by recency.

System Design

Designing Instagram at Scale: Photo/Video Upload, Feed Generation, Social Graph & CDN

Q: What is TL;DR — Key Design Decisions in One Paragraph and how does it work?

"Use resumable S3 multipart upload with async FFmpeg transcoding for media ingestion. Model the social graph in a sharded graph database and cache hot edges in Redis. Solve the celebrity fan-out problem with a hybrid push-pull feed: pre-compute timelines for normal users, pull-on-read for celebrity posts. Rank the feed with a two-tower ML model at retrieval + a lightweight scorer at re-rank. Serve media via a global CDN with adaptive WebP/AVIF resizing . Store Stories in Redis with a 24-hour TTL and append-only viewership logs."

Q: What is Non-Functional Requirements and how does it work?

The system must also satisfy eventual consistency for likes/follower counts (acceptable to lag 1–2 seconds), strong consistency for financial-adjacent operations like account suspension, and read-heavy scalability with a read:write ratio exceeding 100:1 on the feed path.

Q: What is High-Level Architecture Overview and how does it work?

At the highest level, Instagram's architecture is a collection of independently scalable microservices connected by asynchronous event streams. Traffic enters through a global load balancer tier and fans out to domain-specific services. The following diagram captures the major data flows.

Instagram serves 2 billion monthly active users, processes 100 million photo and video uploads every single day, and must generate a personalized, ML-ranked feed for each user in under 200 ms. This deep-dive covers every major subsystem — from resumable S3 multipart upload and async transcoding to the celebrity fan-out problem, Redis sorted-set timelines, Stories TTL, Elasticsearch hashtag search, and global CDN edge delivery — giving you a battle-ready mental model for the system design interview and production architecture decisions.

Md Sanwar Hossain April 7, 2026 22 min read Social Media Architecture

TL;DR — Key Design Decisions in One Paragraph

"Use resumable S3 multipart upload with async FFmpeg transcoding for media ingestion. Model the social graph in a sharded graph database and cache hot edges in Redis. Solve the celebrity fan-out problem with a hybrid push-pull feed: pre-compute timelines for normal users, pull-on-read for celebrity posts. Rank the feed with a two-tower ML model at retrieval + a lightweight scorer at re-rank. Serve media via a global CDN with adaptive WebP/AVIF resizing. Store Stories in Redis with a 24-hour TTL and append-only viewership logs."

Functional & Non-Functional Requirements
High-Level Architecture Overview
Media Upload Pipeline
Social Graph Service
Feed Generation: Push, Pull & Hybrid
ML Feed Ranking
Stories System
CDN & Media Delivery
Search & Discover
Direct Messages
Capacity Estimation
System Design Interview Checklist

1. Functional & Non-Functional Requirements

Every great system design interview starts by narrowing scope. Instagram's core product surface covers six major capability areas — pin them down before sketching any boxes and arrows.

Functional Requirements

Media Post: Users upload photos (JPEG/PNG/HEIC, up to 20 MB) and videos (MP4/MOV, up to 60 s / 100 MB for Feed Reels up to 90 s). Each post supports a caption, hashtags, location tag, and alt-text.
Follow / Unfollow: Directed relationships (A follows B does not imply B follows A). Private accounts require a follow-request approval step. Users may block others, removing all social edges between them.
Feed: Chronologically ordered (legacy) or ML-ranked personalized feed of posts from followed accounts. Must support infinite scroll with cursor-based pagination, not offset pagination.
Stories: Short-lived ephemeral media (photo or 15-second video clip) visible for 24 hours to followers. Story viewership visible only to the poster. Multiple segments per day, sequential playback ordered by recency.
Search & Discover: Full-text search for hashtags and usernames. Explore page showing trending and personalized content to non-followers.
Direct Messages (DMs): One-to-one and group chats (up to 32 participants) supporting text, reactions, photo/video, voice messages, and link previews. Messages persist indefinitely.
Notifications: Push (APNs / FCM) and in-app notifications for likes, comments, follows, DMs, and story views.
Likes & Comments: Counter-based likes (approximate at scale) and threaded comments with nested replies up to 2 levels deep.

Non-Functional Requirements

Attribute	Target	Notes
Feed load latency (P99)	< 200 ms	Pre-computed timeline read from Redis cache
Photo upload throughput	100M uploads/day	≈ 1,160 uploads/sec average; 3× for peak
Media storage growth	~500 PB/year	Multi-resolution variants + originals in S3
Availability (SLA)	99.99%	< 53 min downtime/year; active-active multi-region
Media delivery latency	< 50 ms (TTFB via CDN)	Edge PoP within 20 ms RTT of 95% of users
DM delivery guarantee	At-least-once; idempotent	Client-side deduplication on message ID

The system must also satisfy eventual consistency for likes/follower counts (acceptable to lag 1–2 seconds), strong consistency for financial-adjacent operations like account suspension, and read-heavy scalability with a read:write ratio exceeding 100:1 on the feed path.

2. High-Level Architecture Overview

At the highest level, Instagram's architecture is a collection of independently scalable microservices connected by asynchronous event streams. Traffic enters through a global load balancer tier and fans out to domain-specific services. The following diagram captures the major data flows.

Instagram-scale system architecture — media upload pipeline, social graph service, hybrid feed generation, ML ranking, Stories, CDN, and DM chat. Source: mdsanwarhossain.me

Core Service Domains

API Gateway & Auth: TLS termination, JWT/OAuth 2.0 validation, rate limiting (token bucket per user ID + IP), and request routing. Built on Nginx + Envoy for L7 traffic management.
User Service: Account creation, profile management, authentication token issuance. PostgreSQL primary + read replicas. PII encrypted at rest with customer-managed KMS keys.
Media Service: Upload coordinator, multipart S3 pre-signed URL generation, transcoding job dispatch, and metadata persistence (PostgreSQL + Cassandra for scale).
Social Graph Service: Follow/unfollow mutations and follower/following reads. Graph stored in a custom sharded adjacency-list store (Facebook TAO-inspired) with Redis hot-edge caching.
Feed Service: Timeline cache management in Redis Sorted Sets; fan-out worker consumers on Kafka; ML ranking pipeline integration.
Story Service: Upload, metadata, TTL-based expiry, sequential playback ordering, and viewership tracking.
Search Service: Elasticsearch cluster for hashtag + username indexing. Explore page ML ranking via a separate candidate-generation pipeline.
Messaging Service: WebSocket connection manager, message persistence in Cassandra, end-to-end encryption key exchange.
Notification Service: Fan-out to APNs/FCM push, in-app notification store with read/unread state.
CDN & Media Delivery: Global CDN PoP network serving all media assets. Origin shield in each AWS region reduces S3 origin load by 95%+.

All inter-service communication uses gRPC for synchronous paths and Kafka for async event streams. Service mesh (Istio) enforces mutual TLS between services, provides circuit breakers, and emits distributed traces to Jaeger.

3. Media Upload Pipeline

Uploading 100 million media files per day on mobile networks — many of which are flaky 4G connections — demands a pipeline designed for resilience, efficiency, and deduplication from the ground up. A naive direct-upload approach fails at this scale for three reasons: mobile connections drop, large files timeout on standard HTTP connections, and storing duplicate content wastes petabytes of storage.

Step 1 — Resumable Upload via S3 Multipart

The client never uploads directly to the API servers. Instead:

Client requests an upload session from the Media Service, sending file size, MIME type, and a client-generated UUID.
Media Service calls S3 CreateMultipartUpload and generates a batch of pre-signed URLs (one per 5 MB part). These URLs expire in 1 hour and are returned to the client.
Client uploads each part in parallel (4 concurrent connections) directly to S3. On network failure, it resumes from the last successfully uploaded part using the uploadId stored in local device storage.
After all parts are uploaded, the client calls the Media Service to trigger CompleteMultipartUpload. The original file is now in an S3 raw/ prefix.

// Media Service — initiate multipart upload (Spring Boot example)
@PostMapping("/upload/initiate")
public UploadSession initiateUpload(@RequestBody UploadRequest req) {
    String uploadId = s3Client.createMultipartUpload(
        CreateMultipartUploadRequest.builder()
            .bucket("instagram-raw-media")
            .key("raw/" + req.userId() + "/" + req.clientUuid())
            .contentType(req.mimeType())
            .serverSideEncryption(ServerSideEncryption.AWS_KMS)
            .build()
    ).uploadId();

    List<PresignedUrl> parts = IntStream.range(0, numParts(req.fileSizeBytes()))
        .mapToObj(i -> presignPart(uploadId, i + 1, req))
        .toList();

    return new UploadSession(uploadId, parts, Instant.now().plusSeconds(3600));
}

Step 2 — Perceptual Hash Deduplication

Before triggering transcoding, the pipeline checks for duplicate content using perceptual hashing (pHash for images, video fingerprinting using frame-sampling). A perceptual hash is a 64-bit fingerprint that is similar for visually identical or near-identical images even after re-encoding. The process:

An S3 event triggers a Lambda that computes the pHash of the uploaded file.
The hash is queried against a Redis Bloom filter (billions of entries, < 1% false positive rate) for fast rejection of non-duplicates.
On a Bloom filter hit, a secondary lookup in a DynamoDB deduplication table confirms exact match or near-duplicate (Hamming distance < 10 on pHash).
Confirmed duplicates are hard-linked to the existing S3 object — no storage duplication, just a new metadata record pointing to the existing media key.

Step 3 — Async Transcoding Queue

After deduplication, a Kafka message is published to the media.transcoding.requested topic. A fleet of EC2 GPU-backed transcoding workers (G4dn instances with FFmpeg + hardware-accelerated H.264/H.265/AV1 encoding) consumes these events. Each upload produces multiple output variants stored in S3 under a processed/ prefix:

Photos: 320px, 640px, 1080px JPEG (progressive, quality 85) + WebP variants. HEIC inputs are converted to JPEG as the primary store.
Videos: 360p, 720p, 1080p H.264 MP4 (for broad device support) + AV1 for modern clients. HLS segmented playlist (.m3u8) for adaptive bitrate streaming.
Thumbnails: 3-frame GIF preview and a static poster frame JPEG used for timeline placeholder.

Once all variants are ready, the transcoding worker publishes a media.transcoding.completed Kafka event. The Post Service consumes this and updates the post status from PROCESSING to PUBLISHED, making it visible in feeds. The client polls a lightweight status endpoint (long-polling with 30-second timeout) to detect when the post is ready.

⚠️ Interview Insight: Why Not Upload to API Servers?

Uploading through API servers would require sticky sessions (defeats horizontal scaling), saturate API server memory with large file buffers, and block request threads during slow mobile uploads. Pre-signed S3 URLs shift the transfer burden directly to S3's massive ingestion infrastructure — the API server merely coordinates, never touches the raw bytes.

The social graph — who follows whom — is the connective tissue of every feed, notification, and recommendation on Instagram. At 2 billion users with an average of 300 follow relationships each, the graph contains ~600 billion directed edges. This is far too large for a single relational database.

Data Model & Storage

The graph is stored as an adjacency list in a horizontally sharded data store. Each edge is a row: (follower_id, followee_id, created_at, status). The table is sharded on follower_id for efficient "who does user X follow?" reads, and a mirror table is sharded on followee_id for "who follows user X?" reads. Both tables must be kept in sync via a synchronous dual-write within the same Kafka transaction commit protocol.

-- Followings table (sharded on follower_id)
CREATE TABLE followings (
    follower_id  BIGINT NOT NULL,
    followee_id  BIGINT NOT NULL,
    created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    status       SMALLINT NOT NULL DEFAULT 1, -- 1=active, 2=pending, 3=blocked
    PRIMARY KEY (follower_id, followee_id)
) PARTITION BY HASH (follower_id);

-- Followers table (mirror, sharded on followee_id)
CREATE TABLE followers (
    followee_id  BIGINT NOT NULL,
    follower_id  BIGINT NOT NULL,
    created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    status       SMALLINT NOT NULL DEFAULT 1,
    PRIMARY KEY (followee_id, follower_id)
) PARTITION BY HASH (followee_id);

Sharding Strategy

With 600 billion edges, a single PostgreSQL cluster cannot handle this. Instagram uses consistent hashing with 1024 virtual nodes across a cluster of 128 shards, giving ~4.7 billion edges per shard — comfortably within range of a tuned PostgreSQL instance with SSD storage. Shard lookup is handled by a routing layer that maps user_id % num_shards to a specific shard host, cached in a local config file updated via ZooKeeper.

Hot spots occur for celebrity accounts with 100M+ followers. The followers table shard for Cristiano Ronaldo's followee_id would receive millions of fan-out reads during feed generation. The mitigation is a dedicated celebrity shard (identified by follower count > 10M) replicated with 10× read replicas and aggressive result caching in Redis with a 60-second TTL.

Redis Hot-Edge Cache

For feed generation, the most critical read pattern is: "give me the list of users that user X follows." This set is cached in Redis as a Sorted Set keyed by following:{user_id}, with score = created_at unix timestamp. The set is lazily populated on first feed request and has a 24-hour TTL. Cache size is bounded to the 1,000 most recent follows — for heavy-following accounts, only the most recent connections affect feed relevance anyway.

Friend-of-Friend Suggestions

The "Suggested for You" feature uses second-degree graph traversal. A nightly Spark batch job computes intersection counts of mutual connections for each user. Results are stored in a suggestions:{user_id} Redis list, sorted by mutual connection count descending. The job also incorporates signals like contact book matching (phone number hash lookup), location clustering (users frequently at the same geofence), and co-engagement on hashtags.

5. Feed Generation: Push, Pull & Hybrid

Feed generation is the central algorithmic and systems challenge in Instagram's architecture. The goal: when a user opens the app, surface the ~20 most relevant posts from their network within 200 ms. At 500 million daily active users refreshing their feed multiple times per day, this is billions of feed-read requests per day.

Pure Push (Fan-out on Write)

When user A publishes a post, the system immediately writes that post ID into the timeline cache (a Redis Sorted Set) of every follower of A. Feed reads become pure cache hits — O(1) per user.

Pros: Near-instant feed reads. Read path is extremely simple (one Redis ZREVRANGE call).
Cons: Celebrity fan-out is catastrophic. A user with 200M followers generates 200M Redis write operations on a single post — this takes minutes even with a large Kafka consumer fleet and Redis cluster, delaying post visibility. Also wastes resources writing to timelines of inactive users.

Pure Pull (Fan-out on Read)

On feed read, the system fetches the list of accounts user X follows, queries each account's post store for recent posts, merges and sorts the results, then applies ML ranking. No pre-computation needed.

Pros: No write amplification. Works perfectly for celebrity accounts. Always fresh.
Cons: Extremely slow for users following many accounts. A user following 2,000 accounts would require 2,000 database lookups per feed refresh, adding hundreds of milliseconds of latency. Does not scale to 500M DAU.

Hybrid Approach — The Production Solution

Instagram uses a hybrid model that combines push for normal users and pull for celebrities, merging results at read time:

On post creation: The Post Service publishes a post.created Kafka event containing post_id, author_id, and author_follower_count.
Fan-out workers consume this event. If author_follower_count < 100,000 (non-celebrity), they fan-out the post_id to the Redis timeline sorted set of each active follower (using Lua scripts for atomic ZADD + ZREMRANGEBYRANK to cap timeline size at 300 entries). Active = logged in within the past 7 days.
If author_follower_count ≥ 100,000 (celebrity), fan-out is skipped at write time. The post is written only to the celebrity's own post store in Cassandra.
On feed read: The Feed Service reads the user's pre-computed timeline from Redis (normal posts). It also fetches the list of celebrity accounts the user follows (cached in a small Redis set celebrities_followed:{user_id}) and pulls their recent posts from Cassandra directly.
Both sets are merged, sorted by timestamp, and passed to the ML ranking layer. The final ranked list is returned to the client.

// Redis timeline cache key structure
// timeline:{user_id} → Sorted Set of post_ids, score = unix_timestamp

// Fan-out Lua script (atomic ZADD + trim)
local key = "timeline:" .. KEYS[1]
local score = ARGV[1]     -- unix timestamp of post
local post_id = ARGV[2]   -- post ID
local max_size = 300       -- cap timeline depth

redis.call("ZADD", key, score, post_id)
redis.call("ZREMRANGEBYRANK", key, 0, -(max_size + 1))
redis.call("EXPIRE", key, 86400 * 7)  -- 7-day TTL

// Feed read pseudocode
List<Long> precomputedPostIds = redis.zrevrange("timeline:" + userId, 0, 99);
List<Long> celebrityFollowees = redis.smembers("celebrities_followed:" + userId);
List<Post> celebrityPosts = cassandra.getRecentPosts(celebrityFollowees, limit=50);
List<Post> mergedFeed = merge(precomputedPostIds, celebrityPosts);
return mlRanker.rank(mergedFeed, userId);

Timeline Cache Invalidation

When a user unfollows another, all posts from the unfollowed account must be removed from the Redis timeline. This is done lazily — the Feed Service filters out posts from unfollowed accounts during the read path by re-checking a compact Bloom filter of the user's current follows rather than eagerly removing entries from Redis, avoiding expensive set-difference operations.

6. ML Feed Ranking

The chronological feed was retired years ago. Today's Instagram feed is ranked by a multi-stage ML pipeline designed to predict the probability that a given user will engage with each candidate post — where engagement encompasses likes, comments, shares, saves, and "time spent viewing."

Stage 1 — Candidate Generation

The hybrid feed read produces ~500 candidate post IDs. A lightweight candidate retrieval model (a two-tower neural network with a query tower encoding user context and a document tower encoding post features) computes approximate nearest-neighbor scores using FAISS to quickly reduce the 500 candidates to the 100 most relevant, discarding obviously irrelevant content before the expensive re-ranking stage.

Feature Engineering

The ranking model consumes three categories of features, all pre-computed and stored in a low-latency feature store (Redis + offline batch precomputation in Spark):

User features: Account age, past 7-day engagement rate per content type (photos vs. Reels vs. Stories), followed hashtags, preferred posting time zones, device type.
Post features: Age of post, number of likes/comments/saves at prediction time, media type, caption sentiment score, detected objects/scenes from Vision AI, hashtag popularity percentile.
Interaction features: Historical engagement rate between user and author (have they liked the last 5 posts?), time since last interaction, direct message history (strong positive signal), mutual follows.

Stage 2 — Re-Ranking

The top 100 candidates from Stage 1 are passed to a LightGBM gradient-boosted tree re-ranking model that outputs a probability score for each of 5 engagement types: like, comment, share, save, and 3-second video view. A weighted sum of these probabilities produces the final ranking score. The weights themselves are tuned via multi-objective Bayesian optimization to balance short-term engagement with longer-term user satisfaction metrics (session length, return visits).

Diversity & Anti-Obsession Constraints

A post-ranking diversity layer enforces business rules to prevent the feed from becoming a single-author echo chamber:

No more than 3 consecutive posts from the same author.
At most 20% of the feed from accounts the user has not interacted with in 30 days (prevents ghosting of followed friends).
Reels are capped at 40% of feed slots to avoid Reels-only feeds alienating photo-preference users.
Content from accounts the user muted is excluded before ranking.

A/B Testing Infrastructure

Every ranking model change ships via a statistically rigorous A/B test. Instagram uses a user-level bucketing system (not request-level) so each user sees a consistent experience within an experiment cohort. Experiments run for minimum 7 days (to capture weekly usage patterns) and measure holdout impact on D1/D7 retention, not just immediate engagement rate — since hyper-optimizing for likes degrades long-term user satisfaction.

7. Stories System

Instagram Stories processes over 500 million daily story creators and serves billions of story plays per day. The ephemeral nature — 24-hour expiry — makes Stories a fundamentally different system from the main feed, requiring a TTL-first data model and a viewership tracking system that can handle extreme read/write concurrency.

TTL-Based Expiry Architecture

Stories metadata is stored in two layers:

Redis (hot layer): story:{story_id} — a Hash containing media URL, caption, author ID, and publish timestamp. Redis TTL is set to 86400 seconds (24 hours) from post time using EXPIREAT with the exact Unix expiry timestamp. The story is also added to user_stories:{author_id} (a Sorted Set with score = expiry timestamp) which powers the story ring ordering on profile pages.
Cassandra (durable layer): Stories are written to Cassandra with a TTL 86400 column-level TTL. Cassandra's compaction process handles garbage collection of expired data without any application-level cleanup job. After expiry, media assets in S3 are moved to a cold stories-archive/ bucket (Glacier IA) via S3 Lifecycle rules — users retain access to their own archived stories.

-- Cassandra story schema with column TTL
CREATE TABLE stories (
    author_id    BIGINT,
    story_id     TIMEUUID,          -- naturally ordered by creation time
    media_url    TEXT,
    caption      TEXT,
    created_at   TIMESTAMP,
    expires_at   TIMESTAMP,
    PRIMARY KEY (author_id, story_id)
) WITH CLUSTERING ORDER BY (story_id DESC)
  AND default_time_to_live = 86400;  -- 24-hour TTL

-- Insert with per-row TTL override
INSERT INTO stories (author_id, story_id, media_url, caption, created_at, expires_at)
VALUES (?, now(), ?, ?, toTimestamp(now()), ?)
USING TTL 86400;

Story Tray Ordering

The story tray (the row of profile circles at the top of the home feed) is ordered by a combination of factors: (1) unseen stories first, sorted by the poster's last story timestamp descending; (2) seen stories ordered by recency of the post. The user's seen/unseen state for each story segment is tracked in a story_views:{viewer_id} Redis Bloom filter — a bit set per story ID. Bloom filters give O(1) membership checks at the cost of a tiny false positive rate, which is acceptable (a story mistakenly marked as "seen" is a minor UX issue, not a correctness bug).

Viewership Tracking

When a user views a story, the system must record: who viewed it, when, and whether they replied or reacted. At Instagram scale, this is ~10 billion view events per day. The implementation:

View events are published to Kafka topic story.views. Kafka handles 10M+ events/sec and provides durability without blocking the user's request.
A Flink streaming job consumes events and writes aggregated view counts to a story_view_count:{story_id} Redis counter (INCR). Individual viewer IDs are appended to a Cassandra story_viewers table (partition key = story_id, clustering key = viewer_id) with the same 86400s TTL.
The poster sees "Viewed by @alice, @bob, and 1,247 others" — the top viewer display names come from a Redis sorted set (score = view timestamp) and the aggregate from the Redis counter.

8. CDN & Media Delivery

Instagram serves billions of media assets per day to users in every country on earth. Without a global CDN, every photo request would traverse thousands of kilometers to an AWS region, adding hundreds of milliseconds of latency and generating massive inter-region bandwidth costs. CDN delivery is not optional — it is foundational.

Global PoP Network

Instagram's CDN (built on Meta's own edge infrastructure, with Akamai and Cloudflare as fallback) operates 200+ Points of Presence (PoPs) across 100+ countries. Design goals for PoP placement:

95% of global internet users should be within 20 ms RTT of a PoP.
Each PoP has 10–100 TB of SSD cache, sized to hold the "working set" of media accessed in the past 48 hours for that geography.
PoPs are interconnected via a private backbone network (BGP-optimized for low latency) to avoid routing through public internet for cache misses going back to origin.

Adaptive Image Resizing & Format Conversion

Rather than storing fixed-resolution variants for every possible screen size, Instagram uses on-the-fly image processing at the CDN edge (or a regional Image Processing Service that sits behind the CDN as an origin):

The image URL encodes dimensions and format: https://cdn.instagram.com/media/{id}/320x320/f.webp. This is not the actual URL format but illustrates the principle.
The edge server parses the URL, fetches the original from S3 (via Origin Shield — a regional cluster that aggregates S3 requests), resizes using libvips (fastest available image processing library), and converts to WebP for browsers/apps that support it (Accept header negotiation) or AVIF for next-gen clients.
The processed variant is cached at the PoP with a Cache-Control: public, max-age=31536000 (1 year) immutable header, since processed images are keyed by content hash and never mutate.

Cache Invalidation Strategy

Media assets are immutable by design — a photo uploaded to Instagram never changes, only its metadata (caption, tags) may change. This makes cache invalidation rare. When it does occur (e.g., CSAM removal requiring emergency CDN purge), Instagram uses:

Tag-based invalidation: Assets are tagged with the author's user ID and post ID at cache insertion. Purge APIs accept wildcard patterns matching all variants of a given post.
Surrogate keys: CDN edge nodes maintain a surrogate key index. A single API call propagates the invalidation to all PoPs in < 30 seconds globally.
Origin-side 404: If a media asset is deleted from S3, the CDN will eventually serve a 404 after the TTL expires. For urgent removals, the active purge path is used.

CDN Layer	Cache Hit Rate	TTFB	Role
Edge PoP (L1)	85–92%	< 5 ms	Nearest geographic cache
Regional Shield (L2)	95% of L1 misses	< 30 ms	Collapses thundering-herd misses
Image Processor (L3)	N/A (compute layer)	< 80 ms	On-the-fly resize + format convert
S3 Origin	N/A (source of truth)	50–200 ms	Durable original storage

9. Search & Discover

Instagram's search bar handles hundreds of millions of queries per day — users searching for hashtags, accounts, locations, and audio tracks. The Explore page extends discovery beyond the user's existing social graph, surfacing content from accounts they don't follow based on interest affinity.

Hashtag Indexing

When a post is published, a background consumer extracts all hashtags from the caption using a simple lexer (tokenize on '#' boundaries). Each unique hashtag triggers an upsert in an Elasticsearch index where the document is the hashtag (analyzed, lowercased, normalized) and the field is a counter of associated posts. Elasticsearch's inverted index enables sub-100ms prefix/full-text search across billions of hashtag-post associations.

For trending hashtags, a Flink streaming job counts hashtag occurrences in a sliding 1-hour window. The top 500 trending hashtags are written to a Redis list every 5 minutes, making the trending section a pure Redis read — no Elasticsearch needed at display time.

User Search

User search must support prefix search (type "san" and get "sanwar", "sandra", "santana") with typo tolerance. Elasticsearch's edge_ngram analyzer tokenizes usernames and display names into n-grams at index time, enabling fast prefix lookups. A custom scoring function boosts results by:

Relationship proximity: Mutual followers rank above strangers for the same text match score.
Account popularity: Verified accounts and accounts with follower counts > 10K receive a mild boost.
Prior interaction: Accounts the user has messaged or tagged in the past 30 days are surfaced first.

Explore Page ML Ranking

The Explore page is a separate candidate generation pipeline optimized for interest expansion (discovering new creators) rather than social-graph reinforcement. It uses a combination of:

Collaborative filtering: "Users similar to you also engaged with these posts" — computed nightly via a matrix factorization job (ALS on Spark).
Content-based filtering: Vision embeddings of posts the user has engaged with are compared against a FAISS index of all recently uploaded content. Semantically similar images are surfaced.
Topic affinity: Users are assigned a 128-dimensional interest vector. Content is tagged with topic distributions (food, travel, fitness, technology). Dot-product similarity drives exploration candidates.

The Explore candidate set (~1,000 posts) is then re-ranked by the same LightGBM model used for feed, with an additional novelty penalty applied to content the user has already seen. The final grid (3 columns × N rows) is served from a Redis cache keyed by explore:{user_id} with a 30-minute TTL before a refresh job regenerates it.

10. Direct Messages

Instagram DMs handle hundreds of billions of messages per year across 1-on-1 and group conversations. The messaging infrastructure must provide low-latency delivery (< 500 ms end-to-end on the same continent), durable message storage (messages persist indefinitely unless deleted by the user), and message status (sent, delivered, read) with eventual consistency.

Connection Management — WebSocket Gateway

When a user opens the Instagram app, the client establishes a persistent WebSocket connection to the nearest WebSocket Gateway server (a stateful service layer behind a TCP load balancer). Connection state (which user is on which gateway server) is maintained in a Redis Hash: user_connection:{user_id} → gateway_server_id. This lookup allows any service in the system to push a message to any online user.

The WebSocket Gateway fleet is sized at roughly 1 WebSocket server per 50,000 concurrent connections. With 100M concurrent users in peak hours and 10% having active WebSocket sessions (app in foreground), that's 10M concurrent connections — ~200 gateway servers. Each runs on a single-threaded async event loop (Node.js or Go's goroutine model) since WebSocket connections are primarily I/O-bound.

Message Flow — Send Path

Sender's app sends a message payload over WebSocket to the Gateway server. Payload: {client_msg_id, conversation_id, content, timestamp}. The client_msg_id is a client-generated UUID used for deduplication.
Gateway forwards the message to the Messaging Service via gRPC.
Messaging Service: (a) deduplicates using client_msg_id in Redis with a 24-hour TTL; (b) assigns a server-side message_id (Snowflake-style monotonic ID); (c) persists to Cassandra with strong consistency write (QUORUM); (d) publishes a message.sent Kafka event.
ACK is returned to sender Gateway → sender's app receives "sent" tick (✓).
A Delivery Worker consumes the Kafka event, looks up the recipient's connection in Redis, and pushes the message to the recipient's Gateway server (which relays it over WebSocket). On success, recipient delivers "delivered" status (✓✓) back to sender.
If recipient is offline, the message is queued in a Redis List offline_queue:{user_id}. A push notification is dispatched via APNs/FCM. On next app open, the client fetches missed messages from Cassandra using its last-seen message ID as a cursor.

// Cassandra message storage schema
CREATE TABLE messages (
    conversation_id UUID,
    message_id      BIGINT,       -- Snowflake ID (time-ordered)
    sender_id       BIGINT,
    content_type    TINYINT,      -- 1=text, 2=photo, 3=video, 4=voice, 5=reaction
    content         TEXT,
    client_msg_id   UUID,         -- for deduplication
    created_at      TIMESTAMP,
    deleted_at      TIMESTAMP,    -- soft delete
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'DAYS',
                    'compaction_window_size': 7};

End-to-End Encryption

Instagram's end-to-end encrypted (E2EE) messaging uses the Signal Protocol (Double Ratchet Algorithm + X3DH key exchange). In E2EE mode:

Each device generates an identity key pair, a signed pre-key pair, and a batch of one-time pre-keys. Public keys are uploaded to Instagram's key distribution server (KDS).
On initiating an E2EE conversation, the sender fetches the recipient's public key bundle from the KDS and performs X3DH to derive a shared session secret.
Subsequent messages use the Double Ratchet for forward secrecy — even if a session key is compromised, past messages remain protected.
Encrypted ciphertext is stored in Cassandra. Instagram's servers cannot read E2EE message content. Backup is encrypted with a user-held passphrase-derived key.

Group Chats (up to 32 Participants)

Group messages use a sender-key model (same as WhatsApp groups): the sender generates a single "sender key" for the group, encrypts one copy of the message, and distributes the sender key to each member once. This reduces encryption overhead from O(N) per message to O(1) per message + O(N) for key distribution — critical when N = 32 and message frequency is high.

11. Capacity Estimation

Capacity estimation in a system design interview grounds your design in reality and demonstrates engineering intuition. Work through these numbers methodically. All figures below are approximate and representative of Instagram's public scale.

Upload Traffic & Storage

Metric	Calculation	Result
Daily uploads	Given (photos + videos)	100M/day
Avg upload rate	100M / 86,400s	~1,160 uploads/sec
Peak upload rate	avg × 3× peak factor	~3,500 uploads/sec
Avg photo size (processed)	3 resolutions × ~500 KB avg	~1.5 MB per photo
Avg video size (processed)	3 resolutions × ~20 MB avg for 60s	~60 MB per video
Daily media storage added	80M photos × 1.5 MB + 20M videos × 60 MB	~1.3 PB/day
Annual media storage	1.3 PB × 365 days	~475 PB/year
Metadata storage (PostgreSQL)	100M posts × 500 bytes/row	~50 GB/day (manageable)

Feed & CDN Traffic

Feed reads: 500M DAU × 5 feed refreshes/day = 2.5 billion feed reads/day ≈ 29,000 feed RPS average; ~90,000 at peak.
CDN requests: Each feed refresh loads ~20 thumbnail images. 2.5B × 20 = 50 billion CDN requests/day ≈ 580,000 CDN requests/sec peak.
CDN bandwidth: 50B requests × 30 KB avg thumbnail = 1.5 PB CDN traffic/day ≈ 140 Gbps average egress.
Redis timeline cache: 500M active users × 300 post IDs × 8 bytes = ~1.2 TB Redis memory for timeline cache (fits in a 40-node Redis cluster at 32 GB/node).
Cassandra message storage: 100B messages/year × 500 bytes avg = ~50 TB/year, comfortably managed across a 100-node Cassandra cluster with 3× replication.

12. System Design Interview Checklist

Use this checklist during your interview walkthrough to ensure you cover all the dimensions a senior engineer and interviewer will probe. Instagram-style social media system design interviews typically run 45–60 minutes; allocate time accordingly.

Phase 1: Requirements Clarification (5 min)

✅ Confirm scope: photos only, or videos too? Stories? DMs?
✅ Clarify user scale (2B MAU? 500M DAU?)
✅ Confirm read:write ratio expectation (typically 100:1)
✅ Latency SLA for feed load (< 200 ms P99?)
✅ Consistency requirements (eventual OK for likes? Strong for user data?)

Phase 2: Capacity Estimation (5–7 min)

✅ Calculate uploads/sec (avg and peak 3×)
✅ Estimate media storage growth per year (~475 PB)
✅ Calculate feed RPS (500M DAU × 5 refreshes / 86,400s ≈ 29K RPS)
✅ Estimate CDN bandwidth (50B requests × 30 KB ≈ 140 Gbps)
✅ Size Redis timeline cache (1.2 TB)

Phase 3: High-Level Architecture (10 min)

✅ Draw API gateway, auth, and rate limiting tier
✅ Show major services: User, Media, Post, Feed, Social Graph, DM, Story, Search, Notification
✅ Show data stores: PostgreSQL (user/post metadata), Cassandra (timelines/messages), Redis (cache), S3 (media), Kafka (event bus), Elasticsearch (search)
✅ Show CDN in front of all media reads

Phase 4: Deep Dives (20–25 min)

✅ Media upload: explain pre-signed S3 multipart + resumability
✅ Feed generation: explain push vs. pull vs. hybrid, celebrity fan-out problem
✅ Feed ranking: explain two-tower candidate generation + LightGBM re-ranking
✅ Social graph sharding: adjacency list, dual-table mirror, celebrity shards
✅ Stories TTL: Redis EXPIREAT + Cassandra column TTL + S3 lifecycle
✅ CDN: PoP network, adaptive resizing, cache invalidation strategy

Phase 5: Bottlenecks, Failures & Tradeoffs (5–8 min)

✅ What happens if Kafka lags? Fan-out delay — use priority queue for verified accounts
✅ What if Redis goes down? Fall back to on-demand Cassandra pull; rebuild cache on recovery
✅ How do you handle a hot user (Justin Bieber problem)? Hybrid push-pull with celebrity threshold
✅ How do you handle partial failures in the upload pipeline? S3 multipart expiry, dead-letter queue for failed transcoding jobs
✅ Global vs. single-region: active-active multi-region with eventual consistency via Kafka cross-region replication

Expert-Level Differentiators

Candidates who score "Strong Hire" consistently mention:

Perceptual hash deduplication before storing media (saves significant storage)
Sender-key encryption for group chats (O(1) message encryption, O(N) key distribution)
Bloom filter for story seen/unseen tracking (O(1) membership, sublinear memory)
FAISS approximate nearest neighbor for ML candidate retrieval (< 10 ms at billion-scale)
Time-Window Compaction Strategy in Cassandra for time-series message data
Origin shield to collapse CDN miss thundering herds before they hit S3

Tags: Instagram system design social media platform design photo sharing system design feed generation system design social graph design celebrity fan-out problem Redis timeline cache media upload system CDN media delivery system design interview Instagram

Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices · AI/LLM Systems

All Posts

Back to Blog

Last updated: April 7, 2026