Distributed Search Engine Design — Inverted Index, Query Fanout and Relevance Ranking at Scale
Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices

System Design March 21, 2026 19 min read Distributed Systems Failure Handling Series

Designing a Distributed Search Engine from Scratch: Inverted Index, Query Fanout, and Relevance Ranking at Scale

A B2B SaaS platform serving 50,000 companies once built their search entirely on LIKE queries in PostgreSQL. At 10 million documents, p95 search latency was 4.2 seconds. At 50 million documents it climbed to 28 seconds — and users abandoned search entirely. The solution was a purpose-built distributed search layer using Elasticsearch internals as inspiration, but designed from first principles so engineers actually understand what is happening under the hood when they tune shard counts, tweak BM25 parameters, or diagnose a split-brain event.

Table of Contents

  1. Why RDBMS Full-Text Search Doesn't Scale
  2. The Inverted Index: Foundation of All Search Engines
  3. Distributed Index Architecture
  4. Query Processing and Fanout
  5. Relevance Ranking: BM25 and Beyond
  6. Real-Time Indexing and the Write Path
  7. Failure Scenarios in Production
  8. Scaling Considerations
  9. When NOT to Build Your Own Search
  10. Key Takeaways
  11. Conclusion

1. Why RDBMS Full-Text Search Doesn't Scale

The first instinct when adding search to an application is to reach for SQL. A LIKE '%query%' clause looks harmless on a dataset of ten thousand rows. But the predicate forces a full table scan at O(N) for every search request — the database must read every row to check whether the pattern matches. There is no index structure that can accelerate a leading-wildcard LIKE scan. At ten million rows, that is ten million comparisons per query, per concurrent user.

PostgreSQL's tsvector full-text search with GIN indexes is a genuine improvement and works well up to roughly 20–30 million rows under moderate write load. But the B2B SaaS platform described above discovered the ceiling painfully. When document volume crossed 50 million and concurrent writes exceeded 500/second, the GIN index degraded. Each write triggered partial GIN index reconstruction, and under concurrent write pressure, index maintenance stalled query paths — producing the 28-second p95 latency disaster.

Production incident: At 50M documents and 500 concurrent writes/sec, GIN index maintenance under write pressure caused search latency to spike from 4.2s (p95) to 28s. Users stopped using search. The root cause: GIN index pages were being rewritten faster than query plans could navigate them — a fundamental architectural limit, not a configuration problem.

You need a dedicated search index when any of these conditions apply: your document corpus exceeds 1 million entries, you require relevance ranking beyond simple keyword match, you need faceted search (filter by category, date range, price), or you need real-time indexing with sub-second query latency. All four applied to this platform. The answer is an inverted index — the data structure at the heart of every search engine ever built.

2. The Inverted Index: Foundation of All Search Engines

An inverted index maps each term in your corpus to the list of documents that contain it — along with position and frequency metadata. This structure inverts the natural document-centric view (doc → terms) into a term-centric view (term → docs), enabling O(1) lookup for any term regardless of corpus size.

Term             → Postings List
"elasticsearch"  → [(doc3, pos:5, tf:2), (doc7, pos:1, tf:1), (doc15, pos:8, tf:3)]
"distributed"    → [(doc1, pos:2, tf:1), (doc3, pos:1, tf:4)]
"search"         → [(doc1, pos:3, tf:2), (doc2, pos:1, tf:1), (doc3, pos:2, tf:3)]

Each entry in a postings list records the document ID, the term's position within the document (for phrase queries), and the term frequency (how many times it appears). This metadata is the raw material for relevance ranking. Before a document reaches the index, it passes through an analysis pipeline: tokenization splits text into terms, stemming reduces words to their root form (runningrun), and stopword removal strips meaningless high-frequency words like the, is, and a that would pollute result ranking.

Apache Lucene — the library underlying Elasticsearch and OpenSearch — organises its inverted index into segments: immutable, write-once structures flushed to disk from an in-memory buffer. Because segments are immutable, a document update is implemented as a soft-delete of the old segment entry plus a new document write. Periodically, the segment merge strategy combines many small segments into fewer large ones, reclaiming space from deleted documents and reducing the number of per-query segment reads.

3. Distributed Index Architecture

A single Lucene index on a single node cannot scale past the capacity of that machine. The distributed layer shards the index across multiple nodes, each holding a subset of the total document corpus.

Write Path:
  Document → Indexer → Primary Shard → Replica Shard(s) → Segment Flush

Read Path:
  Query → Coordinator Node → Fan-out to ALL Shards → Merge & Rank → Return Results

Elasticsearch routes each document to a shard using a deterministic formula: shard = hash(docId) % numShards. Because this formula is evaluated at read time as well as write time, a query can always determine which shard holds a given document without a central routing table. Each primary shard has one or more replica shards on different nodes, providing both fault tolerance and read scalability — queries can be served from any replica.

Critical planning constraint: The shard count is immutable after index creation because changing it would invalidate the routing formula for all existing documents. Plan for 3–5 years of growth. Over-sharding wastes memory and JVM overhead; under-sharding forces costly reindexing at scale. A common starting heuristic: size each shard between 10GB and 50GB of data.

4. Query Processing and Fanout

Because any document could reside on any shard, a search query must interrogate every shard — this is the scatter-gather (fan-out) pattern. The coordinator node dispatches the query in parallel to all N shards, each shard returns its local top-K candidates with their relevance scores, and the coordinator performs a global merge to produce the final ranked result set.

public SearchResult search(SearchQuery query) {
    // Phase 1: Scatter — send query to all N shards in parallel
    List<CompletableFuture<ShardResult>> futures = shards.stream()
        .map(shard -> CompletableFuture.supplyAsync(
            () -> shard.search(query, topK=100), // each shard returns top 100
            searchThreadPool
        ))
        .toList();

    // Phase 2: Gather — collect results from all shards
    List<ShardResult> results = futures.stream()
        .map(CompletableFuture::join)
        .toList();

    // Phase 3: Global merge — re-rank top K across all shard results
    return mergeAndRank(results, query.getSize());
}

The critical detail is that each shard must return more than the final page size to the coordinator. If the user requests 10 results and there are 20 shards, each shard returns its local top 100 candidates (a configurable over-fetch), and the coordinator selects the globally best 10 from the 2,000 candidates it receives. This over-fetching is the price of distributing an index — and it is also the source of coordinator OOM risk if top-K is set too large on a high-shard-count cluster.

5. Relevance Ranking: BM25 and Beyond

Classic TF-IDF scoring rewards documents where the query term appears frequently (TF) and penalises terms that appear in many documents (IDF, a proxy for term informativeness). Its well-known weakness is that TF grows unboundedly — a document that mentions elasticsearch 100 times scores proportionally higher than one that mentions it 5 times, even if both are equally relevant.

BM25 (Best Match 25) is the industry standard relevance algorithm and the default in Elasticsearch since version 5.0. It saturates the TF component and applies a document-length normalisation that TF-IDF lacks:

BM25 score = IDF × (TF × (k1 + 1)) / (TF + k1 × (1 - b + b × fieldLen / avgLen))

k1 = 1.2  → controls TF saturation (higher = less saturation)
b  = 0.75 → controls length normalisation (1.0 = full normalisation, 0 = none)

The denominator's length-normalisation term is what makes BM25 robust: a short document where elasticsearch appears once can outscore a long document where it appears three times, because those three occurrences represent a lower density in the longer document. Tune k1 upward (1.5–2.0) for domains where term frequency is highly meaningful (legal documents, code), and set b lower (0.3–0.5) for fields where all documents are roughly the same length (product titles).

For semantic relevance beyond keyword matching, vector search (KNN) retrieves documents whose embedding vectors are nearest to the query vector in high-dimensional space. Elasticsearch 8.x supports hybrid search using Reciprocal Rank Fusion (RRF), which combines BM25 lexical ranking with KNN semantic ranking without requiring manual score normalisation — a practical production solution for search that must handle both exact-match queries and natural-language semantic queries.

6. Real-Time Indexing and the Write Path

Lucene's near-real-time (NRT) indexing model means a newly indexed document is searchable within the refresh_interval — defaulting to one second in Elasticsearch. The refresh operation makes in-memory segment data visible to searchers without a full fsync to disk. This is the mechanism that delivers the "one-second freshness" guarantee, and it is also the primary source of write amplification: every refresh produces a new small segment, and small segments must be merged into larger ones in the background, consuming CPU and I/O.

For bulk ingestion scenarios — initial index population or large backfills — the _bulk API is mandatory. Sending documents one at a time over HTTP creates prohibitive per-request overhead. Optimal batch sizes lie between 5MB and 15MB per bulk request body; experiment within this range while monitoring indexing throughput and rejected request counts from the bulk thread pool.

# Zero-downtime reindex using index aliases
# 1. Create new index with correct mapping
PUT /products_v2 { "mappings": { ... }, "settings": { "number_of_shards": 10 } }

# 2. Reindex data from old index to new
POST /_reindex { "source": { "index": "products_v1" }, "dest": { "index": "products_v2" } }

# 3. Atomically swap the alias — zero downtime
POST /_aliases {
  "actions": [
    { "remove": { "index": "products_v1", "alias": "products" } },
    { "add":    { "index": "products_v2", "alias": "products" } }
  ]
}

Index aliases allow application code to always write and read through a stable alias name (e.g., products), while the underlying index can be replaced atomically during a reindex operation. This is the standard technique for zero-downtime mapping changes, which otherwise require recreating an index from scratch.

7. Failure Scenarios in Production

Distributed search clusters fail in characteristically different ways than single-node systems. The following table covers the four most operationally significant failure modes:

Failure Root Cause Impact Fix
Split brain Network partition with 2 master-eligible nodes Data corruption, dual primaries Use 3 master-eligible nodes, quorum-based election
Shard imbalance Hot shard for popular terms High latency on 1 shard, uneven node load Custom _routing with composite keys
OOM on coordinator Too many top-K candidates from shards Full cluster outage Limit from + size, enable circuit breaker
Segment explosion Too many small segments from rapid refreshes High CPU, merge storms, degraded query speed Increase refresh_interval, schedule force merge

Split brain deserves special emphasis. With only two master-eligible nodes, a network partition causes each node to elect itself master, resulting in two independent clusters that each accept writes — and produce divergent data that cannot be automatically reconciled. The cure is deploying three master-eligible nodes so that a quorum (2 of 3) is required to elect a master, making simultaneous dual-election impossible by construction.

8. Scaling Considerations

Horizontal scaling is straightforward: add data nodes to the cluster and Elasticsearch's shard allocation service automatically rebalances shards across the expanded node set. No application changes are required — the routing formula continues to work as long as the shard count is unchanged.

For time-series or age-stratified corpora, a warm/hot architecture dramatically reduces storage costs. Recent, frequently queried data lives on hot nodes (NVMe SSDs, high RAM), while older data migrates automatically via ILM (Index Lifecycle Management) to warm nodes (spinning HDDs, lower cost). Query routing transparently spans both tiers.

Cross-cluster search (CCS) allows a coordinator to fan queries out across multiple independent Elasticsearch clusters in different regions, enabling a single search API to span geographically distributed indices without data replication. CCS is the standard approach for multi-region search with data residency constraints.

Elasticsearch maintains three layers of search caching: the query cache (caches filter results at the shard level), the request cache (caches complete shard-level search responses for queries with a fixed total hits result), and the field data cache (holds uninverted field data in memory for aggregations and sorting). Tune heap allocation and cache eviction policies to keep hot queries in memory and prevent GC pressure from evictions.

9. When NOT to Build Your Own Search

A distributed search layer is powerful, but it is also a significant operational investment. Three conditions where you should explicitly avoid it:

"The best search architecture is the simplest one that meets your actual scale and relevance requirements — not the most sophisticated one that could theoretically handle 10× your future load."
— Principle of appropriate complexity

Key Takeaways

Conclusion

A distributed search engine is one of the most complex stateful systems you will operate. Its complexity is not arbitrary — every design decision from the immutable shard count to the BM25 saturation parameter to the three-node master quorum encodes a lesson learned from production failures at scale. The goal of this guide is not to have you replicate Elasticsearch from scratch, but to understand its architecture well enough to make sound operational decisions: when to add shards, why your p99 latency spiked after a bulk ingest job, and whether your coordinator heap is sized correctly for your top-K fan-out.

For teams designing the broader scalable backend that sits around a search layer, our Scalable System Design guide covers the architectural patterns for load balancing, caching, and data tier design that complement a distributed search layer in a production environment.

Read Full Blog Here

Explore the complete guide including shard sizing worksheets, BM25 tuning recipes, and a production runbook for distributed search cluster operations.

Read the Full Post

Discussion / Comments

Related Posts

System Design

Vector Database Architecture

Deep dive into vector databases powering semantic search and AI-native applications at scale.

System Design

Database Sharding Strategies

Horizontal partitioning patterns, resharding techniques, and cross-shard query trade-offs.

System Design

Consistent Hashing Explained

How consistent hashing minimises key remapping during node additions and removals in distributed systems.

Last updated: March 2026 — Written by Md Sanwar Hossain