System Design

Designing a Real-Time Notification System at Scale: WebSockets, SSE & Push

Q: What is Protocol Selection and how does it work?

Long Polling is the oldest technique and requires no special server infrastructure. The client makes an HTTP request, and the server holds it open until an event occurs or a timeout expires (typically 30–60 seconds), then returns the response. The client immediately makes another request. Long polling works everywhere HTTP works and requires no special browser support, but it has significant drawbacks: each connection consumes a full HTTP connection and thread on the server, latency equals the polling interval plus request round-trip time, and HTTP overhead (headers, TCP handshake) is repeated every cycle. At scale, long polling becomes extremely expensive. Server-Sent Events (SSE) is a one-way HTTP/1.1 streaming protocol where the server pushes events to the client over a persistent HTTP connection.

A real-time notification system is one of the most deceptively complex components in a large-scale platform. What appears to be "just sending a message to the browser" involves persistent connections, fan-out logic, mobile push infrastructure, cross-datacenter routing, and durability guarantees — all at millions of simultaneous users.

Md Sanwar Hossain March 2026 20 min read System Design

Real-time notification system design at scale

Requirements Analysis
Protocol Selection: WebSockets vs SSE vs Long Polling
High-Level Architecture
Connection Management at Scale
Fan-Out on Write vs Fan-Out on Read
Redis Pub/Sub for Message Routing
Kafka for Durability and Replay
Mobile Push: APNs and FCM
User-Level Throttling and Preferences
Real-World Examples
Scaling Considerations and Failure Modes

Requirements Analysis

Notification System Design | mdsanwarhossain.me — Notification System Design — mdsanwarhossain.me

Before diving into technology choices, it is critical to nail down requirements. Functional requirements for a notification system typically include: deliver in-app notifications to online users in under 500ms; deliver push notifications to offline mobile users; support notification preferences (user can mute channels); persist unread notifications for users who reconnect; support multiple notification types (mentions, likes, system alerts, DMs). Non-functional requirements are where the complexity lives: support 10 million concurrent connections, handle 100,000 notifications/second at peak, 99.99% delivery rate for high-priority notifications, horizontal scalability without downtime, and notification deduplication.

The scale numbers matter enormously for technology selection. A system handling 1,000 concurrent users can use polling. A system handling 10 million concurrent connections cannot — at 1 TCP connection per user, polling every 5 seconds would generate 2 million HTTP requests per second from polling alone, consuming all server capacity just for the polling overhead before any real work is done.

Protocol Selection: WebSockets vs SSE vs Long Polling

Long Polling is the oldest technique and requires no special server infrastructure. The client makes an HTTP request, and the server holds it open until an event occurs or a timeout expires (typically 30–60 seconds), then returns the response. The client immediately makes another request. Long polling works everywhere HTTP works and requires no special browser support, but it has significant drawbacks: each connection consumes a full HTTP connection and thread on the server, latency equals the polling interval plus request round-trip time, and HTTP overhead (headers, TCP handshake) is repeated every cycle. At scale, long polling becomes extremely expensive.

Server-Sent Events (SSE) is a one-way HTTP/1.1 streaming protocol where the server pushes events to the client over a persistent HTTP connection. The browser has a built-in EventSource API that automatically reconnects on disconnect. SSE is perfect for notification feeds, activity streams, and dashboards — scenarios where the server pushes to the client and the client does not need to send messages back on the same connection. SSE works through HTTP/2 multiplexing and most proxies and CDNs, making it operationally simpler than WebSockets. The limitation is unidirectionality — the client cannot send messages back through the SSE connection.

WebSockets provide full-duplex bidirectional communication over a single TCP connection. After an HTTP upgrade handshake, both client and server can send frames at any time without request/response overhead. WebSockets are ideal for chat applications, collaborative editing, gaming, and any scenario requiring bidirectional real-time communication. The overhead per message is minimal (2–14 bytes for the frame header vs hundreds of bytes for HTTP headers). The operational complexity is higher: WebSocket connections are stateful, proxies must be WebSocket-aware, and horizontal scaling requires session affinity or pub/sub routing.

For a general-purpose notification system at LinkedIn/Slack scale, the recommended approach is: WebSockets for mobile/desktop apps where bidirectionality is valuable (presence updates, message acknowledgements), SSE as a fallback for browser clients where WebSockets are blocked, and HTTPS long polling as the last-resort fallback for restrictive corporate networks.

High-Level Architecture

Realtime Architecture | mdsanwarhossain.me — Realtime Architecture — mdsanwarhossain.me

The system consists of several distinct layers. At the edge, a fleet of WebSocket/SSE gateway servers maintain persistent connections with clients. These are stateless in terms of business logic but stateful in terms of connections — each server holds open N connections from clients. A consistent hashing ring maps each user ID to a specific gateway server, ensuring reconnections land on the same server (or a predictable set of servers).

Behind the gateways, a notification service receives events from upstream services (comment posted, payment received, friend request sent) via Kafka topics, processes them through user preference rules, and routes them to the appropriate gateway servers via Redis Pub/Sub. The notification service writes every notification to a notification store (Cassandra) for persistence, enabling delivery to offline users when they reconnect.

For mobile users, a push notification service integrates with Apple Push Notification service (APNs) and Firebase Cloud Messaging (FCM). Mobile push is the only delivery channel when users are offline — WebSocket connections don't survive app backgrounding on iOS/Android.

Connection Management at Scale

Managing 10 million concurrent WebSocket connections requires careful resource planning. Each WebSocket connection in a modern event-driven server (Node.js, Netty, Go) consumes roughly 50–100KB of memory for buffers and connection state. At 10M connections, that is 500GB–1TB of memory across the gateway fleet — requiring approximately 100–200 servers with 8GB RAM each, or 20–40 servers with 32GB RAM each.

Real-Time Notification System | mdsanwarhossain.me — Real-Time Notification System — mdsanwarhossain.me

The key insight is using non-blocking I/O event loops rather than thread-per-connection models. A Netty-based Java server or a Go HTTP server with goroutines can handle 50,000–100,000 concurrent WebSocket connections per server, whereas a traditional Tomcat thread-per-request model would run out of threads at around 500 connections per server with 4GB heap.

Connection registration must be fast. When a client connects, the gateway registers the connection in Redis: SET user:{userId}:gateway {gatewayId} with a TTL equal to the connection heartbeat timeout (typically 90 seconds, renewed on each heartbeat). When a notification arrives, the system looks up which gateway holds the target user's connection and routes accordingly.

# Connection registration on WebSocket connect
HSET connections:{gatewayId} {userId} {connectionId}
SET user:{userId}:gateway {gatewayId} EX 90

# Heartbeat renewal every 30 seconds
EXPIRE user:{userId}:gateway 90

# On disconnect
HDEL connections:{gatewayId} {userId}
DEL user:{userId}:gateway

Fan-Out on Write vs Fan-Out on Read

Fan-out is the process of distributing a single notification event to potentially thousands of recipients. Consider a Twitter-like platform where a celebrity with 10 million followers posts a tweet — the system must deliver a notification to 10 million users. There are two fundamentally different strategies.

Fan-out on write (push model) writes the notification to each recipient's notification inbox at write time. When the event occurs, a worker iterates over all followers and writes one record per follower. Reads are fast (just fetch your own inbox), but writes are expensive — a celebrity post triggers 10 million writes. This model works well when follower counts are bounded (typical enterprise Slack workspace) or when read latency is critical.

Fan-out on read (pull model) writes the event once to a shared feed, and each reader fetches and merges from followed accounts at read time. Writes are cheap (one write regardless of follower count), but reads are expensive (must fetch from N followed accounts and merge). This is the model Twitter historically used for non-celebrity accounts.

Real systems use a hybrid: fan-out on write for normal users (follower count < 10,000), and fan-out on read for high-follower accounts ("celebrity problem"). LinkedIn uses this hybrid model. Slack, operating within bounded team sizes, uses fan-out on write consistently.

Redis Pub/Sub for Message Routing

Once a notification is ready for delivery, it must reach the specific gateway server holding the target user's WebSocket connection. Redis Pub/Sub provides an efficient in-memory channel for this inter-server routing.

Each gateway server subscribes to a Redis channel named for its own ID: SUBSCRIBE gateway:{gatewayId}. The notification service, after looking up which gateway holds a user's connection, publishes to that specific channel: PUBLISH gateway:{gatewayId} {notificationPayload}. The gateway receives the message and forwards it down the appropriate WebSocket connection.

Redis Pub/Sub has an important limitation: it is fire-and-forget with no persistence. If a gateway server is momentarily overloaded and misses a Pub/Sub message, the notification is lost. For high-priority notifications (payment confirmations, security alerts), use Redis Streams or Kafka instead, which provide consumer group semantics and message durability. The notification service writes to Kafka; each gateway server is a Kafka consumer in its own consumer group, processing only the partition assigned to it.

Kafka for Durability and Replay

Kafka serves as the backbone of the notification pipeline, providing durability and enabling offline delivery. Every notification event is published to a Kafka topic partitioned by user ID (ensuring ordered delivery per user). The notification service consumes from upstream topics (social events, transaction events, system events), enriches them with user preference rules (should this user receive this type of notification? are they in do-not-disturb hours?), and publishes to the notification delivery topic.

Kafka's consumer group semantics enable exactly-once delivery semantics with careful design. The notification service uses transactional producers to atomically write the enriched notification to both Kafka (for gateway delivery) and Cassandra (for persistence). This ensures that even if the notification service crashes mid-flight, the Kafka offset is not committed and the notification is reprocessed from the last committed offset on restart.

Dead letter queues (DLQ) handle notifications that fail delivery after N retries. A notification that consistently fails delivery (invalid device token, user account deleted) is moved to a DLQ topic for asynchronous investigation rather than blocking the main delivery pipeline. A monitoring alert fires when DLQ depth exceeds a threshold, indicating a systemic delivery problem.

Mobile Push: APNs and FCM

Mobile push notifications operate through Apple Push Notification service (APNs) for iOS and Firebase Cloud Messaging (FCM) for Android. These are the only channels available when a user's app is backgrounded or the device is offline — the mobile OS maintains a persistent connection to Apple's/Google's infrastructure even when apps are not running.

The push notification service maintains a device token registry: a mapping from user ID to device tokens (a user may have multiple devices). When a mobile user is detected as offline (no active WebSocket connection), the push service sends a push notification through APNs/FCM rather than dropping it.

Device token management is operationally complex. Tokens expire, change on app reinstall, and become invalid when users uninstall the app. APNs returns error codes for invalid tokens; FCM returns registration_not_registered. The push service must handle these responses, marking tokens as invalid and cleaning up the registry. Failure to do so results in increasingly degraded push delivery rates as stale tokens accumulate.

For high-volume push delivery (10M notifications in 10 minutes after a major event), the system must use APNs HTTP/2 connection pooling — APNs supports multiplexing hundreds of requests over a single HTTP/2 connection, and maintaining a pool of 50–100 persistent HTTP/2 connections to APNs enables throughputs of millions of notifications per minute from a single push service instance.

User-Level Throttling and Preferences

Notification spam is a major driver of user churn. A notification system without throttling will eventually produce the degenerate behavior seen in many apps: dozens of notifications per hour that users learn to ignore, leading to notification permission revocation on mobile. Throttling must be a first-class concern, not an afterthought.

Each notification type should have a configurable per-user rate limit. Social platform likes might be batched (instead of 50 individual "X liked your post" notifications, send one "X and 49 others liked your post" after a 5-minute batching window). System alerts bypass throttling. Marketing notifications are most aggressively throttled — typically no more than 2 per day per user across all channels.

User preferences must be enforced at the notification service layer, not at the gateway. Checking preferences at the gateway wastes all the compute spent on fan-out and message routing — if a user has muted a channel, that check should happen before fan-out begins. Store preferences in a low-latency read store (Redis or a distributed cache) so the preference lookup adds sub-millisecond overhead to the notification enrichment path.

Real-World Examples

Slack uses a combination of WebSockets for desktop/browser clients and APNs/FCM for mobile. Slack's gateway layer (called the "RTM" — Real Time Messaging — API) maintains WebSocket connections per user per client. When a message is sent to a Slack channel, the system fans out to all channel members who are currently connected. Slack's infrastructure runs on AWS and uses their internal pub/sub system built on top of Kafka for cross-region notification routing.

WhatsApp famously runs on Erlang/OTP, using the XMPP protocol over persistent TCP connections. WhatsApp's server design, documented in various technical talks, prioritizes connection efficiency — their servers handle hundreds of thousands of connections each, leveraging Erlang's lightweight process model where each connection is a separate Erlang process with minimal memory overhead.

LinkedIn built a notification system (described in engineering blog posts) processing over 7 billion notification events per day. LinkedIn uses Kafka extensively, with each notification type on its own topic. Their fan-out system uses the hybrid write/read model described above, with a "social graph cache" maintaining follower lists in memory for fast fan-out computation. LinkedIn's experience surfaced the celebrity problem acutely — certain LinkedIn influencers have millions of followers, making naive fan-out on write catastrophically expensive.

See also scalable system design principles and event-driven architecture patterns for broader context on the architecture decisions underpinning notification systems.

Scaling Considerations and Failure Modes

The most common failure mode in notification systems is the reconnection storm. When a gateway server restarts (during deployment or after a crash), all its clients detect the connection drop and attempt to reconnect simultaneously. If they all reconnect to the same new server, that server is immediately overwhelmed. Exponential backoff with jitter on the client side (±30% random factor, base 1 second, max 30 seconds) is essential to distribute reconnections over time. A fleet of 100 servers each handling 100,000 connections means a rolling restart affects 100,000 clients per server restart — without jitter, that is a 100,000-connection spike every few minutes.

The second major failure mode is notification ordering. When using multiple notification service instances, two notifications generated close together may be processed by different instances and arrive at the gateway out of order. For most notifications, order does not matter. For chat messages and edit history, it matters critically. Use Kafka partitioning by conversation ID to ensure all messages in a conversation are processed by the same consumer instance, preserving order. For the rate limiting and caching strategies that apply here, see the dedicated post.

For the broader system design patterns and database sharding strategies that underpin the notification store, refer to those dedicated deep-dives.

Key Takeaways

Protocol selection is driven by scale: SSE for simple push-only notifications, WebSockets for bidirectional real-time, long polling as a fallback. Never use short polling at scale.
Fan-out strategy depends on follower distribution: Fan-out on write for bounded follower counts, fan-out on read or hybrid for the celebrity problem.
Redis for routing, Kafka for durability: Use Redis Pub/Sub for low-latency gateway routing, Kafka for durable event storage and offline delivery.
Mobile push requires token lifecycle management: Stale APNs/FCM tokens must be cleaned up promptly or delivery rates degrade over time.
Reconnection storms are predictable and preventable: Implement exponential backoff with jitter on the client before deploying a WebSocket gateway.
Notification preferences must be enforced early: Before fan-out, not at delivery time, to avoid wasting compute on unwanted notifications.

Designing a Real-Time Notification System at Scale: WebSockets, SSE & Push

Table of Contents

Requirements Analysis

Protocol Selection: WebSockets vs SSE vs Long Polling

High-Level Architecture

Connection Management at Scale

Fan-Out on Write vs Fan-Out on Read

Redis Pub/Sub for Message Routing

Kafka for Durability and Replay

Mobile Push: APNs and FCM

User-Level Throttling and Preferences

Real-World Examples

Scaling Considerations and Failure Modes

Key Takeaways

Tags

Leave a Comment

Related Posts

Designing a Real-Time Notification System at Scale: WebSockets, SSE & Push

Table of Contents

Requirements Analysis

Protocol Selection: WebSockets vs SSE vs Long Polling

High-Level Architecture

Connection Management at Scale

Fan-Out on Write vs Fan-Out on Read

Redis Pub/Sub for Message Routing

Kafka for Durability and Replay

Mobile Push: APNs and FCM

User-Level Throttling and Preferences

Real-World Examples

Scaling Considerations and Failure Modes

Key Takeaways

Tags

Leave a Comment

Related Posts

Designing Scalable Systems at Uber & Netflix Scale: Patterns, Trade-offs, and Architecture Decisions

Event-Driven Architecture: Design, Patterns, and Production Best Practices

Rate Limiting, Caching & Load Balancing: Essential Building Blocks for Scalable APIs

System Design Patterns for Modern Backends

Cookie Notice