Chat system design at scale - WebSockets, message ordering, and real-time architecture
Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices

System Design April 1, 2026 22 min read System Design Mastery Series

Designing a Global Chat System at Scale: WebSockets, Message Ordering & Read Receipts

Designing a chat system at the scale of Slack or Discord is one of the most instructive exercises in distributed systems engineering. It forces you to confront the full spectrum of hard problems: maintaining 50 million persistent WebSocket connections simultaneously, ordering messages across distributed nodes without a central clock, fanning out a single message to 10,000 channel members in under 100 milliseconds, tracking who is online across hundreds of millions of users, and delivering read receipts reliably even when recipients are offline. This deep-dive walks through every layer of the system with precise capacity estimates, architectural trade-offs, and production-tested design decisions.

Table of Contents

  1. Requirements and Scale Estimation
  2. WebSocket vs Long Polling vs SSE
  3. Connection Management at Scale
  4. Message Storage and Ordering
  5. Message Fanout Architecture
  6. Presence Detection System
  7. Read Receipts and Delivery Guarantees
  8. Multi-Region Deployment
  9. Key Takeaways
  10. Conclusion

1. Requirements and Scale Estimation

Chat System Design | mdsanwarhossain.me
Chat System Design — mdsanwarhossain.me

Functional requirements define the core capabilities: one-to-one direct messaging between users; group channels supporting up to 10,000 members; searchable message history; file and media attachments up to 100 MB; read receipts showing which members have seen a message; online/offline/away presence indicators for a user's contact list; and push notifications for offline users via APNs (Apple) and FCM (Google).

Non-functional requirements drive the architecture: 1 billion registered users, 50 million daily active users (DAU), each sending an average of 20 messages per day. That gives us 50M × 20 = 1 billion messages per day. Converted: 1,000,000,000 ÷ 86,400 seconds ≈ 11,574 messages per second at average load. Accounting for 3× peak traffic: 34,722 messages per second at peak. At 200 bytes average per message: 200 bytes × 1B messages/day = 200 GB of new message data per day, or approximately 73 TB per year. Keeping 5 years of history: ~365 TB of message storage.

WebSocket connections: 50M DAU, each maintaining one persistent WebSocket connection while the app is open. Assuming average session duration of 4 hours spread across 16 active hours: average concurrent connections = 50M × (4/16) = 12.5M. At peak (lunch, evening): 50M concurrent connections. This is the number that shapes the entire infrastructure design — 50 million persistent stateful TCP connections requires a fundamentally different architecture from stateless HTTP endpoints.

Latency target: P99 message delivery under 100ms for users in the same region. Cross-region delivery under 300ms. Message ordering: within a channel, messages must appear in a globally consistent order to all participants. Availability: 99.99% uptime — maximum 52 minutes downtime per year. The combination of low latency, strong ordering, and high availability is the core design tension this system must navigate.

Key insight: The 50M concurrent WebSocket connections number is the single most constraining requirement. Each connection consumes approximately 10 KB of kernel memory for socket buffers. 50M connections = 500 GB RAM just for socket state, distributed across a large fleet of connection servers. Horizontal scaling with consistent hashing is the only viable approach.

2. WebSocket vs Long Polling vs SSE

The choice of transport protocol is the first architectural decision with major downstream consequences. Three options exist, each with distinct characteristics that make them suitable for different use cases.

WebSocket is the clear winner for chat. After an HTTP Upgrade handshake, both parties can send frames in either direction at any time over a single persistent TCP connection. The frame overhead is minimal — 2 bytes for small frames vs. full HTTP headers on every request. Full duplex means the client can send a message while simultaneously receiving an incoming message, with no queuing. The WebSocket protocol (RFC 6455) is universally supported across modern browsers and mobile platforms.

Server-Sent Events (SSE) are unidirectional — only the server can push to the client. The client must use a separate HTTP request to send messages. SSE uses regular HTTP, which makes load balancing simpler (no stickiness required if HTTP/2 multiplexing is used), and browsers automatically handle reconnection. SSE is ideal for notification feeds, stock tickers, and dashboards — one-way push use cases. For a chat system it requires two connections per client (SSE for receive, HTTP for send), doubling connection overhead.

Long Polling is the HTTP-compatible fallback: the client sends an HTTP request, the server holds it open until a message is available (or a timeout occurs), responds, and the client immediately sends another request. It works in environments that block WebSocket upgrades (certain corporate firewalls, older proxies). Its latency is inherently higher than WebSocket because of the HTTP request-response cycle on every message. Battery impact on mobile is significant — the radio must wake up for each polling request.

// WebSocket HTTP Upgrade handshake
// Client sends:
GET /chat HTTP/1.1
Host: chat.example.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13

// Server responds:
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: s3pPLMBiTxaQ9kYGzzhZRbK+xOo=

// After upgrade, bidirectional frames flow over the same TCP connection
// Client sends message frame:  { "type": "message", "channelId": "ch-123", "text": "Hello!" }
// Server pushes incoming frame: { "type": "message", "from": "user-456", "text": "Hi back!" }

// Java Netty WebSocket handler (connection server)
public class ChatWebSocketHandler extends SimpleChannelInboundHandler<WebSocketFrame> {

    @Override
    protected void channelRead0(ChannelHandlerContext ctx, WebSocketFrame frame) {
        if (frame instanceof TextWebSocketFrame textFrame) {
            ChatMessage msg = objectMapper.readValue(textFrame.text(), ChatMessage.class);
            messageRouter.route(msg, ctx.channel());
        } else if (frame instanceof PingWebSocketFrame) {
            ctx.writeAndFlush(new PongWebSocketFrame(frame.content().retain()));
        }
    }

    @Override
    public void channelInactive(ChannelHandlerContext ctx) {
        connectionRegistry.deregister(ctx.channel());
        presenceService.markOffline(connectionRegistry.getUserId(ctx.channel()));
    }
}

In practice, production chat systems implement a fallback hierarchy: WebSocket is attempted first; if the upgrade is rejected (firewall or old proxy), the client falls back to HTTP/2 multiplexed long polling; as a last resort, traditional long polling over HTTP/1.1. The fallback paths serve less than 2% of clients in modern environments, but their absence generates disproportionate support volume from enterprise customers behind restrictive network policies.

3. Connection Management at Scale

Real-time Event Architecture | mdsanwarhossain.me
Real-time Event Architecture — mdsanwarhossain.me

Connection servers are stateful — they hold open WebSocket connections in memory. This statefulness is both necessary and expensive. Each connection server (a Netty-based JVM process) can handle approximately 100,000 concurrent WebSocket connections before memory and CPU become bottlenecks, meaning 50M concurrent connections require approximately 500 connection server instances.

Routing messages to the correct connection server requires knowing which server holds a user's connection. The canonical solution is a Redis hash mapping userId to connection server identity: when user Alice connects to server CS-47, the connection server writes HSET user:alice connectionServer cs-47 with a TTL. When a message must be delivered to Alice, the message router reads her entry from Redis and forwards the message to CS-47 via an internal pub/sub channel or gRPC call.

// Connection registry in Redis
// On connection established:
HSET user:{userId} connectionServer "cs-47" connectedAt "2026-04-01T10:00:00Z"
EXPIRE user:{userId} 300  // 5-minute TTL, refreshed by heartbeat

// Heartbeat: client sends ping every 30s, server refreshes TTL
// If no heartbeat for 90s, server closes connection and cleans up

// Routing a message to a user:
String targetServer = redis.hget("user:" + recipientId, "connectionServer");
if (targetServer != null) {
    // Forward to the connection server that holds the WebSocket
    internalMessageBus.publish("server:" + targetServer, messageJson);
} else {
    // User is offline - queue for push notification
    pushNotificationQueue.enqueue(recipientId, message);
}

// Connection server subscribes to its own channel
redis.subscribe("server:cs-47", (channel, msg) -> {
    ChatMessage chatMsg = deserialize(msg);
    Channel wsChannel = connectionRegistry.getChannel(chatMsg.getRecipientId());
    if (wsChannel != null && wsChannel.isActive()) {
        wsChannel.writeAndFlush(new TextWebSocketFrame(msg));
    }
});

Load balancing connection servers uses consistent hashing keyed on userId. When a client reconnects — common on mobile due to network switches between WiFi and cellular — consistent hashing attempts to route them back to the same server, exploiting any warm in-memory state. However, this is a best-effort optimization; the system is correct even when a client connects to a different server, it simply re-populates the connection registry.

Session resumption after a brief disconnect (under 30 seconds) uses a message buffer. The connection server keeps a per-user circular buffer of the last 50 messages received while the connection was down. On reconnect, the client sends its last received message sequence number, and the server replays any buffered messages. This prevents message loss during network blips without requiring the client to do a full message history fetch from the database — which is significantly more expensive at scale.

Warning: Never use database writes in the critical path of WebSocket message delivery. Synchronous database writes per message at 34,722 msg/sec will saturate any relational database. Write messages asynchronously to Kafka first, then to storage — the WebSocket push path should touch only Redis and in-memory structures.

4. Message Storage and Ordering

Message ordering in a distributed system is one of the hardest problems in computer science. Physical clocks on different machines drift and are not monotonic. If two users send messages simultaneously on different servers, we cannot use wall-clock time to determine which came first — the clocks may disagree by milliseconds, causing a different ordering to appear on different recipients' screens. This inconsistency breaks the fundamental contract of a chat system.

The production solution is Snowflake ID generation, first developed by Twitter. A Snowflake ID is a 63-bit integer composed of: 41 bits for millisecond timestamp (since epoch), 10 bits for machine/datacenter ID (allowing 1024 unique generators), and 12 bits for sequence number (4096 unique IDs per millisecond per machine). This gives globally sortable IDs that are monotonically increasing on each machine and nearly monotonically increasing globally (within clock skew tolerance).

// Snowflake ID generator (Java)
public class SnowflakeIdGenerator {
    private static final long EPOCH = 1700000000000L; // custom epoch
    private static final long MACHINE_ID_BITS = 10L;
    private static final long SEQUENCE_BITS = 12L;
    private static final long MAX_MACHINE_ID = (1L << MACHINE_ID_BITS) - 1;  // 1023
    private static final long MAX_SEQUENCE = (1L << SEQUENCE_BITS) - 1;      // 4095
    private static final long MACHINE_SHIFT = SEQUENCE_BITS;
    private static final long TIMESTAMP_SHIFT = SEQUENCE_BITS + MACHINE_ID_BITS;

    private final long machineId;
    private long lastTimestamp = -1L;
    private long sequence = 0L;

    public SnowflakeIdGenerator(long machineId) {
        if (machineId < 0 || machineId > MAX_MACHINE_ID)
            throw new IllegalArgumentException("machineId out of range: " + machineId);
        this.machineId = machineId;
    }

    public synchronized long nextId() {
        long timestamp = System.currentTimeMillis() - EPOCH;
        if (timestamp == lastTimestamp) {
            sequence = (sequence + 1) & MAX_SEQUENCE;
            if (sequence == 0) {
                // Sequence exhausted — wait for next millisecond
                while ((timestamp = System.currentTimeMillis() - EPOCH) <= lastTimestamp);
            }
        } else {
            sequence = 0;
        }
        lastTimestamp = timestamp;
        return (timestamp << TIMESTAMP_SHIFT) | (machineId << MACHINE_SHIFT) | sequence;
    }
}

// Cassandra message storage schema
CREATE TABLE messages (
    channel_id  uuid,
    message_id  bigint,          -- Snowflake ID: naturally time-ordered
    sender_id   uuid,
    content     text,
    attachments list<text>,
    created_at  timestamp,
    PRIMARY KEY (channel_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'DAYS',
                    'compaction_window_size': 1}
  AND default_time_to_live = 157680000;  -- 5 years in seconds

Cassandra is the ideal message store because its data model maps perfectly to the access patterns of a chat system. Partitioning by channel_id keeps all messages for a channel co-located on the same node, making channel history reads a single-partition scan. Clustering by message_id DESC returns the most recent messages first with no sorting overhead — Cassandra stores them already in order. For DM conversations, the partition key is a deterministic hash of the two user IDs sorted lexicographically, ensuring both directions of the conversation land on the same partition.

For channels with extremely high message volume (busy public channels with millions of members), a single Cassandra partition can grow too large. The mitigation is partition bucketing: include a time bucket in the partition key — PRIMARY KEY ((channel_id, bucket), message_id) where bucket is derived from the week or month. This bounds partition size while maintaining efficient range reads within a time window.

5. Message Fanout Architecture

When a user sends a message to a group channel, that message must be delivered to every online member of the channel. For a channel with 10,000 members, this means potentially 10,000 WebSocket pushes per message. The fanout strategy is one of the most critical architectural decisions in a chat system's design.

Fanout-on-write pre-computes each recipient's inbox at send time. When a message arrives, the system iterates over all channel members and writes the message to each member's inbox queue. This makes reads very fast (fetch from your inbox) but writes very expensive for large channels. A single message to a 10,000-member channel triggers 10,000 write operations — untenable at scale.

Fanout-on-read stores the message once in the channel timeline and each client fetches the channel timeline on demand. Writes are cheap (one write per message), but reads require querying the channel's message history. This is efficient for large channels where most members are offline — there is no point pre-computing push notifications for users who won't receive them.

The production approach is hybrid: use fanout-on-write via Redis pub/sub for channels with fewer than 1,000 members (covers 95% of channels), and fanout-on-read with caching for large channels. The threshold is configurable and can be adjusted based on observed behavior.

// Fanout architecture using Kafka and Redis pub/sub
// 1. Message arrives at API server
public void sendMessage(ChatMessage message) {
    // Generate Snowflake ID
    message.setMessageId(snowflake.nextId());

    // Publish to Kafka for durable storage and async processing
    kafka.send("messages", message.getChannelId(), message);

    // For small channels: immediate fanout via Redis pub/sub
    if (channelMemberCount(message.getChannelId()) <= 1000) {
        String serialized = objectMapper.writeValueAsString(message);
        redis.publish("channel:" + message.getChannelId(), serialized);
    }
    // For large channels: connection servers poll channel feed
}

// Connection server subscribes to channels of its connected users
public void onUserConnect(String userId, Channel wsChannel) {
    connectionRegistry.register(userId, wsChannel);
    List<String> channelIds = channelService.getChannelsForUser(userId);
    for (String channelId : channelIds) {
        redis.subscribe("channel:" + channelId, (ch, msg) -> {
            // Only push if this user is a member of the channel
            ChatMessage chatMsg = deserialize(msg);
            if (channelService.isMember(channelId, userId)) {
                wsChannel.writeAndFlush(new TextWebSocketFrame(msg));
            }
        });
    }
}

// Kafka consumer: durable write to Cassandra
@KafkaListener(topics = "messages", groupId = "message-storage")
public void persistMessage(ChatMessage message) {
    cassandraTemplate.insert(message);
    // Also update last_message_at on channel for inbox sorting
    redis.zadd("user-inbox:" + message.getSenderId(),
               message.getMessageId(), message.getChannelId());
}

The Kafka layer serves two purposes: it decouples the message sending path from the storage path (a slow Cassandra write does not block message delivery), and it provides replay capability for re-processing messages if a consumer crashes. Kafka's per-partition ordering guarantees that messages within a channel are processed in send order — a critical property for maintaining conversation coherence in storage.

6. Presence Detection System

Presence — knowing whether a user is online, offline, or away — sounds simple but becomes a hard distributed systems problem at scale. With 50M concurrent users, a naive implementation that writes a heartbeat to a database every 30 seconds generates 50M ÷ 30 ≈ 1.67M database writes per second from presence alone. No relational database handles this workload.

The Redis-based presence solution uses a key per user with a TTL: SET presence:{userId} online EX 90. Every heartbeat from the client (sent every 30 seconds via the WebSocket connection) refreshes the TTL. If no heartbeat arrives within 90 seconds, Redis automatically expires the key — the user is considered offline. Reading a user's presence is an O(1) Redis GET, and writing is an O(1) Redis SET. Redis can handle millions of operations per second on commodity hardware.

// Presence management using Redis
// Connection server: on heartbeat received from client
public void handleHeartbeat(String userId) {
    // Refresh presence TTL: user is online for next 90 seconds
    redis.set("presence:" + userId, "online", SetArgs.Builder.ex(90));
}

// Presence lookup for a user's contact list
public Map<String, String> getBulkPresence(List<String> userIds) {
    // Use Redis pipeline for bulk fetch - single round trip
    List<KeyValue<String, String>> results = redis.mget(
        userIds.stream()
               .map(id -> "presence:" + id)
               .toArray(String[]::new)
    );
    Map<String, String> presenceMap = new HashMap<>();
    for (int i = 0; i < userIds.size(); i++) {
        String value = results.get(i).getValue();
        presenceMap.put(userIds.get(i), value != null ? value : "offline");
    }
    return presenceMap;
}

// Presence notifications using Redis Sorted Set for last-seen timestamps
// On disconnect: record last_seen timestamp
public void onDisconnect(String userId) {
    redis.zadd("last-seen", System.currentTimeMillis(), userId);
    redis.del("presence:" + userId);  // immediately mark offline
    notifyContactsOfPresenceChange(userId, "offline");
}

// For large channels: aggregate presence (show count, not individual status)
public int getOnlineMemberCount(String channelId) {
    List<String> memberIds = channelService.getMemberIds(channelId);
    // Pipeline check: how many have presence keys set?
    return (int) memberIds.stream()
        .filter(id -> redis.exists("presence:" + id) > 0)
        .count();
}

Propagating presence changes to all interested parties is a second-order fanout problem. When Alice's presence changes to offline, all users who have Alice in their contact list or share a channel with Alice need to be notified. The approach: maintain a reverse index in Redis mapping userId → set of users who are subscribed to their presence. On presence change, iterate this set and push update events via the connection server's pub/sub channels. For users in large public channels (where the reverse index would be enormous), aggregate presence (showing "1,247 online" rather than per-user status) avoids the notification explosion.

7. Read Receipts and Delivery Guarantees

Read receipts require tracking the lifecycle state of each message for each recipient. The message lifecycle has three states: SENT (message stored in Cassandra, delivery attempted), DELIVERED (WebSocket push acknowledged by the connection server, or push notification sent to APNs/FCM), and READ (user opened the chat window containing this message and the client sent a read ACK).

At-least-once delivery is achieved through the ACK mechanism. When a connection server pushes a message frame to a client, it records the message ID in a pending-ACK set. The client responds with an ACK frame containing the message ID within 5 seconds. If no ACK is received, the connection server retransmits up to 3 times before marking the message as failed delivery (the client has likely disconnected). Idempotent processing on the client side — deduplicating by message ID using a local seen-set — ensures that retransmitted messages are not displayed twice.

// Message delivery state tracking in Cassandra
CREATE TABLE message_receipts (
    conversation_id  uuid,
    message_id       bigint,
    recipient_id     uuid,
    status           text,     -- 'sent', 'delivered', 'read'
    delivered_at     timestamp,
    read_at          timestamp,
    PRIMARY KEY ((conversation_id, message_id), recipient_id)
);

// ACK flow: client acknowledges receipt
// Client sends WebSocket frame:
{ "type": "ack", "messageId": 1234567890123, "conversationId": "conv-abc" }

// Connection server processes ACK:
public void handleAck(String userId, long messageId, String conversationId) {
    // Update delivery status
    cassandra.execute(
        "UPDATE message_receipts SET status='delivered', delivered_at=? " +
        "WHERE conversation_id=? AND message_id=? AND recipient_id=?",
        Instant.now(), conversationId, messageId, userId
    );
    // Notify the original sender of the delivery confirmation
    String senderId = messageMetadataCache.getSenderId(messageId);
    pushDeliveryUpdate(senderId, messageId, userId, "delivered");
}

// Read receipt: triggered when user opens the conversation
public void markConversationRead(String userId, String conversationId, long upToMessageId) {
    // Single write covers all messages up to cursor — efficient bulk mark-read
    cassandra.execute(
        "UPDATE user_read_cursors SET last_read_message_id=?, read_at=? " +
        "WHERE user_id=? AND conversation_id=?",
        upToMessageId, Instant.now(), userId, conversationId
    );
    // Notify channel members of read receipt
    redis.publish("receipts:" + conversationId,
        "{\"userId\":\"" + userId + "\",\"readUpTo\":" + upToMessageId + "}");
}

For offline users, message delivery falls to push notifications. The connection server detects that a user has no active WebSocket connection (no entry in the connection registry) and enqueues a push notification request to APNs (iOS) or FCM (Android). The notification payload includes the message preview and deep-link data so tapping the notification opens the correct conversation. Push notifications are at-most-once: they may be dropped by the platform in battery-saving modes. This is acceptable — the message exists in Cassandra and will be fetched when the user opens the app.

The read cursor approach — storing the highest message ID that a user has read in a conversation, rather than individual per-message read receipts — is a critical optimization. Marking 500 messages as read one by one would generate 500 writes; the cursor approach collapses this to a single write. The trade-off: you lose per-message read granularity (cannot tell if a user read message 300 but not 301 if both are before the cursor), but this granularity is rarely needed in practice.

8. Multi-Region Deployment

A global chat system serving users in North America, Europe, and Asia-Pacific cannot run from a single region — the speed of light alone makes cross-Pacific WebSocket latency unacceptable. A multi-region active-active deployment is required, where each region serves its local users independently while messages can cross regional boundaries when participants are in different regions.

Each user is homed to a primary region based on their geographic location. Their user profile, contact list, and recent message history are replicated to their home region. When a user connects, GeoDNS routes them to the nearest regional cluster. The regional cluster has its own full stack: connection servers, API servers, Kafka cluster, Cassandra cluster (as a multi-DC deployment), and Redis cluster.

Cross-region messaging — Alice in US-East messaging Bob in EU-West — requires inter-region coordination. The architecture is: Alice's message is written to the US-East Kafka topic and Cassandra DC. A Kafka MirrorMaker 2 bridge asynchronously replicates the message to the EU-West Kafka topic. Bob's connection server in EU-West, subscribed to the EU-West Kafka topic for the relevant channel, receives the replicated message and pushes it to Bob's WebSocket. The replication lag is typically 50-150ms for cross-Atlantic messages — within the 300ms SLO for cross-region delivery.

# Multi-region Cassandra configuration
# Cassandra replication across 3 datacenters
CREATE KEYSPACE chat
    WITH replication = {
        'class': 'NetworkTopologyStrategy',
        'us-east-1': 3,   -- 3 replicas in US-East
        'eu-west-1': 3,   -- 3 replicas in EU-West
        'ap-southeast-1': 2  -- 2 replicas in APAC
    }
    AND durable_writes = true;

-- Write consistency: LOCAL_QUORUM ensures writes are durable in local DC
-- Read consistency: LOCAL_QUORUM for low-latency local reads
-- Cross-region reads: use QUORUM only for critical global reads

# Kafka MirrorMaker 2 for cross-region message replication
# Source connector (US-East → EU-West)
connectors:
  - name: us-east-to-eu-west
    config:
      connector.class: org.apache.kafka.connect.mirror.MirrorSourceConnector
      source.cluster.alias: us-east
      target.cluster.alias: eu-west
      topics: messages,receipts,presence-updates
      replication.factor: 3
      offset-syncs.topic.replication.factor: 3
      # Only replicate channels with cross-region members
      topics.exclude: "^local-.*"

# GeoDNS routing: user connects to nearest regional API gateway
# Regional API Gateway → Regional Connection Server Fleet
# Message routing logic:
public void routeMessage(ChatMessage msg) {
    String senderRegion = userService.getHomeRegion(msg.getSenderId());
    Set<String> memberRegions = channelService.getMemberRegions(msg.getChannelId());

    // Always write to sender's home region first (lowest latency for sender)
    kafka.send(senderRegion + "-messages", msg.getChannelId(), msg);

    // Async replication to other regions via MirrorMaker handles cross-region delivery
    // No synchronous cross-region calls in the hot path
}

Message ordering across regions relies on Snowflake IDs. Because Snowflake IDs embed a millisecond-precision timestamp, messages from different regions are naturally ordered by their creation time, with sub-millisecond collisions resolved by the machine ID component. Clients receiving messages from the replication stream simply insert them in Snowflake ID order — no vector clocks or other complex ordering mechanisms are required. The trade-off is that network partition scenarios can lead to temporary out-of-order display until replication catches up, but this is visually corrected by the client within seconds and the stored order in Cassandra remains correct.

Failover between regions uses circuit-breaker logic at the GeoDNS layer. If US-East's health checks fail, DNS TTLs of 30 seconds allow rapid failover to redirect US-East users to EU-West or US-West, accepting higher latency in exchange for continued availability. Cassandra's multi-DC replication means EU-West already has a complete (or near-complete) copy of US-East data, so failover users can read their history immediately.

Key Takeaways

Conclusion

Designing a global chat system is a masterclass in distributed systems trade-offs. Every component — transport protocol, connection management, message storage, fanout, presence, read receipts, and multi-region routing — has a naive solution that works at small scale and breaks at large scale. The patterns in this guide reflect how production systems like Slack, Discord, and WhatsApp have solved these problems through years of operational experience.

The most important design insight is that different operations have vastly different scale requirements, and their architectures must be designed independently. Message delivery latency is optimized by keeping only Redis and in-memory state in the hot path. Message durability is achieved by Kafka and Cassandra in the async path. Presence detection is solved by Redis TTL — a mechanism so simple it seems inadequate until you run the math and see that it handles 50M users at a fraction of the cost of database-based approaches. The composition of these specialized components, each perfect for its role, produces a system that is simultaneously fast, durable, and globally scalable.

Aspect WebSocket Server-Sent Events Long Polling
DirectionFull-duplexServer → Client onlyServer → Client (via response)
Protocolws:// / wss://HTTP/1.1 or HTTP/2HTTP/1.1 or HTTP/2
Browser supportUniversal (2012+)Universal (except IE)Universal
Load balancer stickinessRequiredOptional (HTTP/2)Not required
ReconnectionManual (app logic)Automatic (browser)Manual (resend request)
Latency< 10ms< 10ms50-200ms
Mobile battery impactLow (persistent connection)Low (persistent connection)High (frequent radio wake)
Use caseChat, gaming, collaborationNotifications, feeds, dashboardsLegacy support, restricted networks
"The hardest part of building a real-time messaging system is not the real-time part — it's the correctness guarantees. Any system can push bytes fast. Only a well-designed system pushes the right bytes, in the right order, exactly once, to the right place."
— Engineering lesson from production chat systems

Leave a Comment

Related Posts

Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices

Last updated: April 1, 2026