System Design

Designing a Collaborative Editing System: Google Docs Architecture, OT vs CRDT & Real-Time Sync

Google Docs allows millions of users to edit the same document simultaneously in real time. Every keystroke must be reflected on all collaborators' screens within milliseconds, conflict-free, even when users are concurrently editing the same paragraph. This problem — collaborative editing — is one of the most subtle and mathematically rich challenges in distributed systems, solved by two competing paradigms: Operational Transformation (OT) and CRDTs.

Md Sanwar Hossain April 6, 2026 19 min read System Design
Collaborative editing system design: Google Docs architecture, operational transformation, CRDT, real-time sync

TL;DR — Core Architecture Decisions

"A collaborative editor needs: (1) a conflict resolution algorithm — either Operational Transformation (OT) as used by Google Docs, or CRDTs (Logoot/LSEQ/Yjs) as used by Figma/Notion — to merge concurrent edits correctly, (2) a WebSocket server (or WebRTC for P2P) to propagate operations in real time, (3) a persistent operation log in the database for document reconstruction and undo history, (4) snapshot checkpointing to avoid replaying the full operation history on load, and (5) presence service for cursor/selection awareness with peer-to-peer efficiency."

Table of Contents

  1. The Core Problem: Concurrent Edit Conflicts
  2. Operational Transformation (OT): How Google Docs Works
  3. CRDTs: Conflict-Free Replicated Data Types
  4. OT vs CRDT: When to Choose Which
  5. System Architecture & Storage Design
  6. Real-Time Sync Protocol & WebSocket Architecture
  7. Cursor & Presence Awareness
  8. Offline Editing & Reconnection
  9. Undo/Redo in Collaborative Context
  10. Snapshot Checkpointing & Persistence
  11. Scale & Conclusion

1. The Core Problem: Concurrent Edit Conflicts

Imagine Alice and Bob are both editing the same document: "Hello". Alice inserts "World" at position 5 (after "Hello"), and simultaneously Bob deletes position 0 (the "H"). If both operations are applied naively:

This is the concurrent edit conflict problem. Operations that were valid when issued become invalid or produce wrong results when applied after other operations. The solution: transform operations before applying them to account for concurrently-applied operations — this is the essence of Operational Transformation.

Collaborative editing system architecture: OT server, operation log, WebSocket gateway, snapshot store, and presence service for Google Docs-scale systems
Collaborative editing platform architecture — OT/CRDT engine, operation log, WebSocket sync, and snapshot checkpointing. Source: mdsanwarhossain.me

2. Operational Transformation (OT): How Google Docs Works

Operational Transformation was invented in 1989 and is the algorithm powering Google Docs (via the Wave/ShareJS lineage). The core idea: before applying a remote operation, transform it against all locally-applied operations it was concurrent with, adjusting positions to account for those operations.

OT Transform Function

The transform function T(op1, op2) takes two concurrent operations and returns a transformed version of op1 that can be applied after op2 has already been applied:

# Simplified OT transform for text Insert/Delete
def transform_insert_insert(op1, op2):
    # op1 = Insert("X", position=5)
    # op2 = Insert("Y", position=3)  [applied before op1]
    # Since op2 inserts before op1's position, shift op1 right
    if op2.position <= op1.position:
        return Insert(op1.char, op1.position + len(op2.text))
    return op1  # op2 is after op1, no shift needed

def transform_insert_delete(op1, op2):
    # op1 = Insert("X", position=5)
    # op2 = Delete(position=3)  [applied before op1]
    # Since op2 deletes before op1's position, shift op1 left
    if op2.position < op1.position:
        return Insert(op1.char, op1.position - 1)
    return op1

Server-Mediated OT Architecture

Pure peer-to-peer OT is computationally intractable for more than 2 users (the transform compositions become exponential). Production systems use a server-mediated approach: the server is the single point that orders operations. Each operation carries a revision number indicating what document state it was based on. When the server receives an operation at revision R that was created on the client at revision C (C < R), the server transforms the operation against operations [C+1, R] before applying and broadcasting it. Clients apply the server's broadcast operations directly — no client-side conflict resolution needed.

3. CRDTs: Conflict-Free Replicated Data Types

CRDTs (Conflict-Free Replicated Data Types) take a different mathematical approach: design the data structure such that concurrent operations always commute — applying them in any order always produces the same result. No transformation is needed.

CRDT for Text: The Tombstone Approach

The key insight in CRDT text types (Logoot, LSEQ, RGA, Yjs) is to assign each character a globally unique, immutable position identifier rather than an integer index. When Alice inserts "W" between "Hello" and the end, she assigns it a unique fractional identifier between position(5) and position(6). When Bob deletes "H", he marks it as a tombstone (deleted but retained in the structure).

# CRDT text representation (simplified RGA/Yjs model)
document = [
    {id: (1, "alice"), char: "H", deleted: false},
    {id: (2, "alice"), char: "e", deleted: false},
    {id: (3, "alice"), char: "l", deleted: false},
    {id: (4, "alice"), char: "l", deleted: false},
    {id: (5, "alice"), char: "o", deleted: false},
]

# Alice inserts "W" after id=(5, "alice"):
alice_op = {type: "insert", after: (5,"alice"), id: (6,"alice"), char: "W"}

# Bob deletes id=(1, "alice") (the "H"):
bob_op = {type: "delete", id: (1,"alice")}

# Both ops can be applied in ANY order and produce identical result:
# [{H, deleted:true}, e, l, l, o, W] → visible: "elloW"
# The position reference (6,"alice") is NEVER invalidated by Bob's delete
CRDT collaborative editing sync: operation broadcasting, tombstone deletion, cursor transformation, and offline merge
CRDT-based collaborative sync: operations commute in any order; tombstones enable offline merges without conflicts. Source: mdsanwarhossain.me

4. OT vs CRDT: When to Choose Which

Dimension Operational Transformation CRDT
Server requirement Required (for ordering) Optional (P2P possible)
Offline support Complex (buffer + sync on reconnect) Natural (ops merge on any reconnect)
Memory overhead Low (no tombstones) Higher (tombstones grow over time)
Algorithm complexity High (correct OT is notoriously hard) Moderate (well-specified math)
Ordering intent preservation High (server imposes total order) Lower (concurrent inserts may interleave)
Used by Google Docs, Etherpad Figma, Notion, Linear (Yjs), Apple Notes

Verdict: For centralized server architectures with a reliable network, OT is simpler to reason about and has lower storage overhead. For distributed, offline-first, or P2P architectures, CRDTs are superior. Yjs (a CRDT library) has become the most popular choice for new collaborative applications in 2024–2026 due to its performance, offline support, and rich ecosystem.

5. System Architecture & Storage Design

A production collaborative editor requires several storage layers:

Data Storage Purpose
Document snapshots PostgreSQL / S3 (large docs) Fast document load without replaying all ops
Operation log PostgreSQL (append-only) Undo history, audit log, catch-up for reconnects
Active session state Redis Current document state, pending ops, presence info
Presence / cursor data Redis Pub/Sub Ephemeral; expires on disconnect
Document metadata (title, ACL) PostgreSQL Ownership, sharing permissions, version history

6. Real-Time Sync Protocol & WebSocket Architecture

Each document session is managed by a dedicated document server (or a WebSocket connection to a document service). The protocol:

  1. Client connects: Opens WebSocket to document server. Server sends current document snapshot + revision number. Client renders document and subscribes to the operation stream.
  2. Client edits: User types "X" at position 12. Client immediately applies the operation locally (optimistic UI — the edit appears instantly). Client sends operation to server with the last-known revision: {op: Insert("X", 12), clientRevision: 47}.
  3. Server processes: Server receives operation. If clientRevision == serverRevision, no conflict — apply directly. If clientRevision < serverRevision, transform the operation against operations [clientRevision+1, serverRevision], then apply. Assign the operation revision 48. Broadcast to all connected clients.
  4. Clients receive broadcast: Each client applies the operation to their local document state, adjusting for any pending local operations via OT/CRDT.

7. Cursor & Presence Awareness

Seeing other collaborators' cursors moving in real time is a key Google Docs experience feature. Cursor positions must be transmitted without interfering with document operations and must be adjusted when document operations shift text positions.

Cursor as an Operation

Cursor position is broadcast as a lightweight ephemeral event (not persisted). When Alice's cursor is at position 23 and Bob inserts 5 characters at position 10, Alice's cursor must shift to position 28 — exactly the same transformation logic applied to edits. The presence service sends cursor updates at most 30 times per second (throttled), transmitted via the same WebSocket channel. Cursor events are separated from document operations in the protocol and do not get an operation revision — they are purely ephemeral and can be dropped without affecting document correctness.

8. Offline Editing & Reconnection

Google Docs supports offline editing. When the network drops, the client continues accepting edits and buffers operations locally in IndexedDB (the browser's local database). When connectivity is restored:

  1. Client reconnects to document server and reports its last-applied revision (e.g., revision 50).
  2. Server sends all operations from revision 51 to current (e.g., operations 51–75 applied by other collaborators during the outage).
  3. Client transforms its buffered local operations against the received server operations and sends them to the server.
  4. Server integrates the client's operations via the same OT/CRDT pipeline.

For CRDT-based systems, this is even simpler: all buffered ops are simply merged with server state — no transformation needed since CRDT operations commute.

9. Undo/Redo in Collaborative Context

Undo in a collaborative editor is conceptually harder than undo in a single-user editor. If Alice types "Hello" and Bob then types "World", and Alice presses Undo — should she undo just her typing (reverting to the empty state), or should she see the document without her "Hello" but with Bob's "World" preserved?

The universally accepted answer: selective undo — undo only the specific user's own operations, leaving concurrent operations from other users intact. This is implemented by inverting the user's operation (Insert(X, pos) → Delete(X, pos)) and transforming the inverse operation against all subsequent operations by any user. The transformed inverse is then applied as a new operation, which is itself broadcast and logged.

10. Snapshot Checkpointing & Persistence

A heavily edited document might accumulate millions of operations over its lifetime. Loading the document by replaying all operations from scratch would take seconds or minutes. The solution: periodic snapshot checkpointing.

11. Scale & Conclusion

Google Docs Scale Estimation

  • Google Docs: 1 billion+ active documents; millions of concurrent editing sessions
  • Peak concurrent editors per document: typically 1–20; outliers (shared class docs) 100–1000
  • Operation rate: ~5 operations/second per active editor × 5 editors = 25 ops/sec per document
  • 1 million active document sessions × 25 ops/sec = 25 million operations/sec peak
  • Each operation: ~50 bytes → 1.25 GB/sec broadcast bandwidth at peak
  • Document servers: stateful (hold in-memory document state); shard by documentId

Collaborative editing is one of the most intellectually demanding distributed systems problems. The combination of concurrent state management (OT/CRDT), real-time networking (WebSocket), offline-first architecture, and user experience constraints (sub-100ms perceived latency for typing) makes it uniquely challenging. The industry has largely converged on CRDT-based approaches for new systems — Yjs in particular has become the standard library, powering Notion, Linear, and dozens of other tools — while Google Docs continues to use its battle-tested OT implementation. Understanding both paradigms gives you deep insight into distributed state management that applies far beyond editors.

Leave a Comment

Related Posts

Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices · System Design

All Posts
Last updated: April 6, 2026