Vector database architecture ANN search embeddings scaling
Md Sanwar Hossain - Senior Software Engineer
Md Sanwar Hossain

Senior Software Engineer · System Design · AI Infrastructure

System Design March 19, 2026 22 min read AI Infrastructure & System Design Series

Vector Database Architecture for Production AI: HNSW, ANN Search, and Billion-Scale Retrieval

Every production RAG system, semantic search engine, and recommendation pipeline eventually hits the same architectural inflection point: the naive nearest-neighbor search that worked fine at 100K embeddings becomes the 800ms query latency bottleneck at 50M embeddings. Understanding the internal architecture of vector databases — the indexing algorithms, the memory management strategies, the consistency trade-offs — is no longer optional knowledge for backend engineers building AI-native applications. This guide goes beyond "how to call the API" to explain why vector databases behave the way they do under production load, and how to architect your embedding storage for scale.

Table of Contents

  1. The Approximate Nearest Neighbor Problem
  2. HNSW: The Algorithm Behind Most Production Vector DBs
  3. IVF + PQ: Billion-Scale Indexing with FAISS
  4. Hybrid Search: Combining Dense and Sparse Retrieval
  5. Storage Architecture: Segments, WAL, and Persistence
  6. Scaling Vector Search: Sharding and Replication
  7. pgvector vs Pinecone vs Weaviate vs Qdrant: Decision Framework
  8. Production Failure Scenarios
  9. Trade-offs: Recall, Latency, Freshness, Cost

1. The Approximate Nearest Neighbor Problem

Given a query vector q and a database of n vectors, exact nearest-neighbor search requires computing the distance between q and every vector in the database — O(n · d) where d is the embedding dimension (typically 768 or 1536 for modern models). At 50M vectors with d=1536, that is 76.8 billion floating-point operations per query. Even with vectorized SIMD instructions processing ~10 GFLOP/s, that is 7.68 seconds per query — completely unacceptable for any interactive application.

Approximate Nearest Neighbor (ANN) algorithms trade recall for speed: they return a result set that contains the true nearest neighbors with high probability (e.g., 95%) but is not guaranteed to be exact. The key insight is that for most production applications — semantic search, RAG context retrieval, recommendation — 95% recall at 10ms latency is far more valuable than 100% recall at 8 seconds latency.

2. HNSW: The Algorithm Behind Most Production Vector DBs

Hierarchical Navigable Small World (HNSW) is the algorithm at the core of Weaviate, Qdrant, Milvus, pgvector (HNSW index type), and Chroma. It builds a multi-layer graph where each node (embedding) is connected to its approximate nearest neighbors. The top layer is sparse (few long-range connections); lower layers are denser (many short-range connections). Search starts at the top layer, greedily descends toward the query, and terminates at the bottom layer with a precise local search.

The key tuning parameters with their production implications:

M (number of connections per node): Higher M improves recall but increases memory and construction time. For 768-dim embeddings: M=16 for balanced recall/memory (typical production default), M=32 for high-recall applications (medical/legal search where missing a relevant document has serious consequences).

ef_construction (search width during index build): Higher value improves index quality but slows build time linearly. Production recommendation: 200 for general search, 400 for high-recall requirements. Index build for 10M vectors at ef_construction=200 takes approximately 2–4 hours on a 32-core machine.

ef (search width at query time): The most important runtime parameter. Higher ef increases recall but increases query latency. Benchmark your specific dataset: typically ef=50 gives ~90% recall at 5ms, ef=200 gives ~98% recall at 15ms.

-- pgvector: Create HNSW index for production semantic search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (
  m = 16,                -- connections per layer
  ef_construction = 200  -- build quality
);

-- Set ef at query time for recall/latency trade-off
SET hnsw.ef_search = 100;

-- Semantic search query
SELECT id, title, 1 - (embedding <=> $1) AS similarity
FROM documents
WHERE 1 - (embedding <=> $1) > 0.75  -- minimum similarity threshold
ORDER BY embedding <=> $1
LIMIT 20;

HNSW memory requirements are substantial and often underestimated. Each node stores M*2 connections (forward and backward edges), each as a 4-byte integer index. For 10M vectors at M=16: 10M × 32 connections × 4 bytes = 1.28 GB just for the graph structure, plus 10M × 1536 dimensions × 4 bytes = 61 GB for the raw embedding vectors. Total: ~62 GB RAM for the HNSW index to deliver sub-10ms queries — the entire index must be in memory. Plan accordingly.

3. IVF + PQ: Billion-Scale Indexing with FAISS

When vector counts reach hundreds of millions, HNSW's memory requirements become prohibitive. Inverted File Index with Product Quantization (IVF+PQ) — the backbone of FAISS (Facebook AI Similarity Search) and Milvus's disk-based indexes — solves this with lossy compression.

IVF partitioning: k-means clusters the embedding space into Nlists centroids (typically 1024–65536). Each vector is assigned to its nearest centroid. At query time, only the nprobe closest centroids are searched — dramatically reducing the search space. With Nlists=4096 and nprobe=64, only 64/4096 = 1.5% of vectors are compared per query.

Product Quantization (PQ) compression: Each 1536-dim vector is split into 48 sub-vectors of 32 dimensions each. Each sub-vector is quantized to one of 256 cluster centroids (8-bit index). The original 1536 × 4 = 6,144 bytes per vector becomes 48 bytes — a 128x compression ratio. This compression is lossy: PQ reduces recall by 5–15% depending on nprobe settings, but enables billion-scale indexes to fit on a single machine's RAM or NVMe.

Pure vector search underperforms on queries with precise keywords (product IDs, names, exact phrases). Pure BM25/TF-IDF keyword search misses semantic similarity. Production RAG systems typically outperform either alone by combining both via Reciprocal Rank Fusion (RRF):

# Hybrid search with Weaviate (dense + sparse BM25)
result = client.query.get("Document", ["title", "content"]) \
    .with_hybrid(
        query="Spring Boot reactive programming tutorial",
        alpha=0.75,        # 0=pure BM25, 1=pure vector, 0.75=75% vector weight
        fusion_type=HybridFusion.RELATIVE_SCORE
    ) \
    .with_limit(10) \
    .do()

# Qdrant hybrid with RRF
from qdrant_client.models import FusionQuery, Fusion

results = client.query_points(
    collection_name="docs",
    prefetch=[
        models.Prefetch(query=embed(query_text), using="dense", limit=20),
        models.Prefetch(query=models.SparseVector(indices=..., values=...),
                       using="sparse", limit=20),
    ],
    query=FusionQuery(fusion=Fusion.RRF),
    limit=10
)

5. Storage Architecture: Segments, WAL, and Persistence

Production vector databases (Qdrant, Weaviate, Milvus) use a segment-based storage model similar to LSM-tree databases. New embeddings are written to an in-memory write buffer (sometimes called a "growing segment" or "memtable"). When the buffer reaches a threshold, it is flushed to disk as an immutable segment with its own HNSW or IVF index. A background compaction process merges small segments into larger ones for query efficiency.

The critical durability mechanism is the Write-Ahead Log (WAL): every insert/update/delete is written to the WAL on disk before acknowledging the client. If the process crashes mid-flush, the WAL is replayed on restart to recover the in-memory state. Without the WAL, any crash during segment flush would lose all buffered writes.

The architectural tension: WAL replay is fast for small write buffers but can take minutes for large ones at startup. Operators must balance flush_interval (how often to flush to disk) against recovery_time_objective — a 10-minute flush interval means up to 10 minutes of WAL replay on restart, plus up to 10 minutes of recently written vectors being available only in memory (not persisted through backup snapshots).

6. Scaling Vector Search: Sharding and Replication

Sharding distributes the vector collection across multiple nodes, with each node holding a subset of vectors. A query is broadcast to all shards, each returns its local top-k results, and the coordinator merges and re-ranks the results to produce the final top-k. The merge step is cheap (comparing k×shard_count results), but the broadcast doubles network overhead per additional shard.

Key sharding decision: Number of shards is typically chosen based on memory: target <80% memory utilization per shard, accounting for the full HNSW index + raw vector data + overhead. For 100M vectors at 1536 dimensions: ~620 GB of data + ~130 GB HNSW graph ≈ 750 GB. With 64 GB nodes: 750 / (64 × 0.7) ≈ 17 shards minimum (round up to 20 for headroom).

Replication provides fault tolerance and read throughput scaling. A replication factor of 2 means each shard has one replica; queries can be served by either. The consistency trade-off: with async replication, a query directed to the replica immediately after a write may return stale results. For RAG systems where freshness matters (customer support knowledge base updated in real time), use sync replication with acknowledgment from all replicas before confirming the write — at the cost of higher write latency.

7. pgvector vs Pinecone vs Weaviate vs Qdrant: Decision Framework

pgvector (PostgreSQL extension): Best choice when your application already uses PostgreSQL and vectors are <5M. You get ACID transactions, SQL joins between vectors and metadata, and zero additional infrastructure. Limitations: HNSW build is single-threaded in pgvector <0.7; query throughput caps around 100 QPS for large collections.

Qdrant: Best choice for 5M–500M vectors requiring high query throughput. Rust-native, memory-efficient, supports on-disk HNSW (mmap-based), native sparse vector support for hybrid search, and Kubernetes operator available. Strong production track record at mid-scale.

Weaviate: Best choice when you need tightly integrated hybrid search (BM25 + vector) with a graph data model. Higher operational complexity; module system adds flexibility but also surface area. GraphQL API is more expressive than most alternatives.

Pinecone: Best choice for teams that want zero operational overhead and are willing to accept vendor lock-in and per-query pricing. Strong managed SLA, global availability, but data leaves your infrastructure — may be a compliance blocker for regulated industries.

8. Production Failure Scenarios

Index eviction from RAM. When the OS evicts HNSW index pages from memory under memory pressure, subsequent queries trigger expensive page faults, causing latency spikes from 5ms to 500ms+. Mitigation: allocate dedicated nodes for the vector database with memory reserved exclusively for the process (use Linux cgroups or Kubernetes resource limits), and monitor cache_miss_rate as a leading indicator.

Recall degradation after bulk upsert. HNSW indexes are built incrementally — each new vector is connected to its nearest neighbors at insertion time. After a massive bulk load (e.g., 10M new vectors added in one batch), the graph structure may be suboptimal because the newly added vectors weren't connected based on the final index state. Schedule a background index re-optimization job after bulk loads.

9. Trade-offs: Recall, Latency, Freshness, Cost

Every vector database architecture involves four-way tension: Recall (finding the true nearest neighbors), Latency (query response time), Freshness (time from write to searchability), and Cost (RAM, compute, storage). No system optimizes all four simultaneously. HNSW maximizes recall and latency at the cost of RAM. IVF+PQ minimizes cost at the cost of recall. Segment-based architectures improve write throughput at the cost of freshness (new vectors are not immediately indexed in the HNSW graph — they live in the write buffer and are searched separately with linear scan until the next flush/index build).

Define your system's requirements along these four axes before choosing an algorithm or database. A legal document search system (high recall, latency <100ms, freshness <1 hour, moderate cost) has a fundamentally different optimal architecture than a real-time product recommendation system (moderate recall, latency <5ms, freshness <1 minute, high cost acceptable).

Discussion / Comments

Related Posts

Agentic AI

RAG Architecture

Retrieval-augmented generation system design for production LLM applications.

System Design

Consistent Hashing

Distributed cache sharding with consistent hashing rings and virtual nodes.

Agentic AI

Agentic RAG in Production

Multi-step retrieval, re-ranking, and adaptive query strategies for agentic RAG systems.

Last updated: March 2026 — Written by Md Sanwar Hossain