What is Functional Requirements and how does it work?

Event ingestion: Ingest events from web, mobile, and IoT devices across multiple tenants Real-time dashboards: Dashboard queries with <5 second data freshness end-to-end Ad-hoc queries: Sub-second response time for pre-aggregated queries; <10 seconds for ad-hoc exploration Historical analysis: Query historical data up to 3 years back with consistent SQL interface

What is Non-Functional Requirements and how does it work?

Throughput: 10 billion events per day sustained, with 3× burst capacity P99 query latency: <1 second for pre-aggregated dashboard queries P99 ad-hoc latency: <10 seconds for exploratory queries on raw data Ingestion availability: 99.99% uptime for the ingestion pipeline

What is Scale Math & Capacity Estimation and how does it work?

Let's work through the numbers systematically before choosing technology: Events per second: 10B ÷ 86,400 seconds = ~115,740 events/second (peak with 3× = ~347K events/sec) Average event size: 500 bytes (JSON with typical web analytics fields) Daily ingestion volume: 10B × 500B = 5 TB/day raw data Monthly storage (uncompressed): 5 TB × 30 = 150 TB/month

System Design

Designing a Real-Time Analytics Platform at Scale: ClickHouse, Apache Druid, Lambda Architecture & Streaming Analytics

Q: What is TL;DR — Architecture Decision in One Paragraph and how does it work?

"Use Lambda architecture for combined batch + streaming analytics. ClickHouse for sub-second OLAP queries. Apache Druid for sub-second slice-and-dice. Apache Flink for stateful stream processing. Kafka as the unified event backbone. Pre-aggregate with materialized views for dashboard queries."

Q: What is Requirements & Scale Estimation and how does it work?

Before designing the system, we must define the scale, latency targets, and data retention requirements. Getting these wrong leads to over-engineered or under-provisioned infrastructure that is expensive to fix in production.

Q: What is Non-Functional Requirements and how does it work?

Throughput: 10 billion events per day sustained, with 3× burst capacity P99 query latency: <1 second for pre-aggregated dashboard queries P99 ad-hoc latency: <10 seconds for exploratory queries on raw data Ingestion availability: 99.99% uptime for the ingestion pipeline

Q: What is Scale Math & Capacity Estimation and how does it work?

Let's work through the numbers systematically before choosing technology: Events per second: 10B ÷ 86,400 seconds = ~115,740 events/second (peak with 3× = ~347K events/sec) Average event size: 500 bytes (JSON with typical web analytics fields) Daily ingestion volume: 10B × 500B = 5 TB/day raw data Monthly storage (uncompressed): 5 TB × 30 = 150 TB/month

Building an analytics platform that ingests 10 billion events per day while delivering sub-second query responses requires careful architectural decisions. This comprehensive guide walks through Lambda vs Kappa architecture trade-offs, ClickHouse vs Apache Druid vs Pinot comparisons, Apache Flink stream processing, and the materialized view strategies that power dashboards at scale.

Md Sanwar Hossain April 11, 2026 24 min read System Design

Real-time analytics platform system design at scale with ClickHouse, Apache Druid, and Lambda architecture

TL;DR — Architecture Decision in One Paragraph

"Use Lambda architecture for combined batch + streaming analytics. ClickHouse for sub-second OLAP queries. Apache Druid for sub-second slice-and-dice. Apache Flink for stateful stream processing. Kafka as the unified event backbone. Pre-aggregate with materialized views for dashboard queries."

Requirements & Scale Estimation
Lambda vs Kappa Architecture Trade-offs
Ingestion Layer: Kafka, Kinesis & Pulsar
Stream Processing: Apache Flink at Scale
Batch Layer: Spark & Data Lake Architecture
OLAP Storage: ClickHouse vs Apache Druid vs Pinot
Query Layer & API Design
Materialized Views & Pre-Aggregation Strategies
Reliability, Data Quality & SLOs
Real-World Case Studies

1. Requirements & Scale Estimation

Before designing the system, we must define the scale, latency targets, and data retention requirements. Getting these wrong leads to over-engineered or under-provisioned infrastructure that is expensive to fix in production.

Functional Requirements

Event ingestion: Ingest events from web, mobile, and IoT devices across multiple tenants
Real-time dashboards: Dashboard queries with <5 second data freshness end-to-end
Ad-hoc queries: Sub-second response time for pre-aggregated queries; <10 seconds for ad-hoc exploration
Historical analysis: Query historical data up to 3 years back with consistent SQL interface
Alerting: Threshold-based alerting on metric values with <30 second detection latency

Non-Functional Requirements

Throughput: 10 billion events per day sustained, with 3× burst capacity
P99 query latency: <1 second for pre-aggregated dashboard queries
P99 ad-hoc latency: <10 seconds for exploratory queries on raw data
Ingestion availability: 99.99% uptime for the ingestion pipeline
Data retention: Hot tier (30 days in OLAP), warm tier (1 year in data lake), cold tier (3 years in object storage)

Scale Math & Capacity Estimation

Let's work through the numbers systematically before choosing technology:

Events per second: 10B ÷ 86,400 seconds = ~115,740 events/second (peak with 3× = ~347K events/sec)
Average event size: 500 bytes (JSON with typical web analytics fields)
Daily ingestion volume: 10B × 500B = 5 TB/day raw data
Monthly storage (uncompressed): 5 TB × 30 = 150 TB/month
With 10:1 compression ratio: 15 TB/month compressed (ClickHouse LZ4/ZSTD typical)
3-year retention: 15 TB × 36 months = ~540 TB compressed total storage
Kafka throughput needed: 115K events/sec × 500B = ~57 MB/sec; buffer 3× = 171 MB/sec across all partitions

Component	Count	Spec	Purpose
Kafka Brokers	9	32 vCPU, 128 GB RAM, 10 TB SSD	Event backbone, 7-day retention
Flink TaskManagers	20	16 vCPU, 64 GB RAM	Stream processing, windowed aggregations
ClickHouse Nodes	12	32 vCPU, 256 GB RAM, 20 TB NVMe	OLAP hot tier, sub-second queries
Query Servers	6	8 vCPU, 32 GB RAM	REST/GraphQL API, caching, routing

Real-time analytics platform architecture: Lambda and Kappa patterns with ClickHouse and Apache Druid — Real-Time Analytics Platform Architecture — Lambda and Kappa patterns with ClickHouse and Apache Druid. Source: mdsanwarhossain.me

2. Lambda vs Kappa Architecture Trade-offs

The architectural pattern you choose determines system complexity, operational burden, and your ability to correct historical data. Both Lambda and Kappa have legitimate use cases — the decision depends on your team size, accuracy requirements, and reprocessing frequency.

Lambda Architecture

Lambda architecture processes data through two parallel paths that merge at the serving layer:

Speed layer: Apache Flink processes events in real-time and writes incremental results to the serving layer with low latency
Batch layer: Apache Spark reprocesses all raw data on a schedule (hourly or daily) for accuracy correction and historical backfill
Serving layer: ClickHouse or Druid merges batch + speed layer results, preferring batch results when available
Pros: Highly accurate historical reprocessing, battle-tested at Hadoop-era scale, clear separation of concerns
Cons: Maintaining two code paths (Flink streaming logic + Spark batch logic), data consistency challenges at merge time, higher operational complexity

Kappa Architecture

Kappa architecture simplifies by using a single streaming path for both real-time processing and historical reprocessing:

Single streaming layer: Apache Flink handles both real-time event processing and full historical reprocessing
Kafka as infinite log: Extended retention (7–30 days) or S3-backed Kafka (Tiered Storage) allows replaying from any offset
Reprocessing strategy: Deploy new Flink job version alongside old one, replay Kafka from beginning, cut over serving layer when new job catches up
Pros: Single codebase, simplified operations, faster iteration on processing logic changes
Cons: Full reprocessing requires re-reading entire Kafka topic which is compute-intensive; long reprocessing windows (hours/days) for 3-year history

Factor	Lambda	Kappa
Complexity	High (two code paths)	Lower (one code path)
Reprocessing	Accurate, efficient Spark batch	Replay from Kafka / data lake
Team size fit	Large teams with dedicated streaming + batch eng	Smaller teams, faster iteration
Latency profile	Dual: real-time + batch correction	Stream-first, eventual consistency
Use when	Accuracy is critical, frequent backfills	Speed-to-market, simpler ops

3. Ingestion Layer: Kafka, Kinesis & Pulsar

The ingestion layer is the system's foundation. Every downstream component depends on it being reliable, ordered (within partitions), and fast. At 115K+ events/second, small design mistakes become catastrophic at scale.

Kafka Partitioning Strategy

Partition key choice affects ordering guarantees and downstream join performance:

Partition by tenant_id: Keeps all events for a tenant in order; simplifies per-tenant aggregations in Flink
Partition by user_id: Enables user session tracking with ordered events; better for user-centric analytics
Compound key (tenant_id + event_type): Balances load while keeping analytically related events co-located
Kafka producer tuning: batch.size=65536, linger.ms=5, compression.type=lz4 for throughput optimization

Event Schema (Avro with Schema Registry)

{
  "event_id": "uuid-v4",
  "tenant_id": "acme-corp",
  "user_id": "user-12345",
  "event_type": "page_view",
  "timestamp": "2026-04-11T10:30:00Z",
  "properties": {
    "page": "/dashboard",
    "duration_ms": 1250,
    "referrer": "google.com"
  },
  "session_id": "sess-xyz",
  "ip_address": "192.168.1.1"
}

Kafka vs Kinesis vs Pulsar

Feature	Apache Kafka	Amazon Kinesis	Apache Pulsar
Max throughput	GB/sec per broker	1 MB/sec per shard	GB/sec per broker
Retention	Configurable (days–infinite)	7 days max	Tiered (unlimited)
Ops burden	High (self-managed)	Low (managed)	High (self-managed)
Multi-tenancy	Namespace isolation	Stream-level	Native multi-tenancy
Best for	High-throughput, Kappa arch	AWS-native, serverless	Multi-region, geo-replication

Dead Letter Queue & Idempotency

Dead letter queue (DLQ): Route malformed or schema-invalid events to a separate Kafka topic for analysis and reprocessing without blocking the main pipeline
Idempotency: Consumers use event_id (UUID v4) as deduplication key; check against a Redis bloom filter before writing to OLAP storage to prevent duplicate aggregations
Schema Registry: Apache Confluent Schema Registry enforces Avro schema compatibility (BACKWARD, FORWARD, FULL) to prevent breaking schema changes

4. Stream Processing: Apache Flink at Scale

Apache Flink is the industry standard for stateful stream processing at scale. Its exactly-once semantics, rich windowing API, and SQL support make it the right choice for building the speed layer of a real-time analytics platform.

Window Types & When to Use Each

Tumbling windows (1-minute): Non-overlapping fixed-size windows; ideal for per-minute metric aggregations like page views per minute, events per minute
Sliding windows (5-minute window, 1-minute slide): Overlapping windows for moving averages; useful for trailing 5-minute averages shown on dashboards
Session windows (gap-based): Dynamically sized windows closed when activity gap exceeds threshold; perfect for user session tracking with variable session lengths
Global windows: Accumulate state indefinitely; used for all-time counters with custom triggers

Flink SQL for Real-Time Aggregations

SELECT 
    tenant_id,
    event_type,
    TUMBLE_START(rowtime, INTERVAL '1' MINUTE) as window_start,
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as unique_users
FROM events
GROUP BY 
    tenant_id, 
    event_type,
    TUMBLE(rowtime, INTERVAL '1' MINUTE)

Apache Flink stream processing pipeline: windowing, stateful operators and exactly-once semantics — Apache Flink Stream Processing Pipeline — windowing, stateful operators, and exactly-once semantics. Source: mdsanwarhossain.me

Exactly-Once Semantics & State Backends

Flink checkpointing: Enable distributed checkpoints with checkpointing.interval=30s, execution.checkpointing.mode=EXACTLY_ONCE for fault-tolerant processing
RocksDB state backend: Use for large state (>GB per TaskManager) — spills to disk, supports incremental checkpoints to S3; mandatory for session windows over millions of users
HashMapStateBackend: Use for small state (<few GB) — in-memory, lower latency but no spill-to-disk; appropriate for short tumbling window aggregations
Watermarks for late events: Configure allowed lateness of 30–60 seconds; emit results with watermark, update with late arrivals, use side outputs for very late events
Backpressure handling: Monitor Flink metrics for backpressure indicators; scale TaskManagers horizontally or tune parallelism per operator when ingestion spikes occur

5. Batch Layer: Spark & Data Lake Architecture

In Lambda architecture, the batch layer provides the ground truth by reprocessing all raw data. Apache Spark running over a well-structured data lake handles this efficiently, correcting any errors from the speed layer while enabling long-horizon historical analysis.

Data Lake Zone Architecture

Raw zone (S3/GCS): Immutable raw events written directly from Kafka consumers; no transformation; enables full reprocessing from source-of-truth
Cleaned zone: Schema-validated, deduplicated, PII-masked data in Apache Parquet format; partitioned by year/month/day/hour for efficient time-range pruning
Aggregated zone: Pre-computed Spark aggregations (hourly rollups, daily summaries) for batch serving and historical dashboard queries

Table Format: Delta Lake vs Apache Iceberg

Feature	Delta Lake	Apache Iceberg
Schema evolution	Add/rename columns	Full schema evolution + type promotions
Time travel	Yes (transaction log)	Yes (snapshot-based)
Engine support	Spark-centric, Trino/Flink improving	Spark, Trino, Flink, Hive native
Best for	Databricks-centric stacks	Multi-engine open data lakes

Spark Batch Job Patterns

Hourly accuracy correction: Spark jobs run at H+15 to reprocess the previous hour, incorporating late-arriving events and correcting speed-layer approximations
Daily compaction: Merge hundreds of small Parquet files written by Flink or Kafka consumers into larger files (512 MB target) for efficient query performance
Partitioning strategy: Partition by year=YYYY/month=MM/day=DD/hour=HH with tenant_id as a secondary partition for multi-tenant isolation in scans

6. OLAP Storage: ClickHouse vs Apache Druid vs Pinot

The OLAP storage layer is where query performance lives. Your choice of ClickHouse, Apache Druid, or Apache Pinot determines the types of queries you can serve sub-second, the operational complexity you accept, and the cost structure at scale.

Feature	ClickHouse	Apache Druid	Apache Pinot
Query latency	Sub-second	Sub-second	Sub-second
Ingestion mode	Batch + streaming	Native streaming	Native streaming
SQL support	Full ANSI SQL	Limited SQL	Limited SQL
Joins	Full joins supported	Limited joins	Limited joins
Data freshness	Seconds	Seconds	Seconds
Cardinality	High cardinality OK	HLL approximations	HLL approximations
Compression	10:1 typical	8:1 typical	8:1 typical
Horizontal scaling	Sharding + replication	Tier-based (historical/real-time)	Tier-based
Best for	Ad-hoc SQL analytics	Time-series, dashboards	Pinpoint low-latency queries
Used by	Cloudflare, Yandex	Uber, Netflix	LinkedIn, Uber

ClickHouse Table DDL: MergeTree Engine

The ORDER BY clause is the most critical design decision in ClickHouse — it becomes the primary index and determines which query patterns will be sub-second:

CREATE TABLE events (
    event_id UUID,
    tenant_id String,
    event_type LowCardinality(String),
    user_id String,
    timestamp DateTime,
    properties String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (tenant_id, event_type, timestamp)
SETTINGS index_granularity = 8192;

Key design choices: LowCardinality(String) for event_type (reduces memory by 3–5×), monthly partitioning for efficient data retention management, and ORDER BY starting with tenant_id to isolate scans per tenant.

7. Query Layer & API Design

The query layer abstracts the storage tier from consumers, enabling intelligent routing between materialized views (fast) and raw tables (flexible), caching to reduce OLAP load, and multi-tenancy enforcement at the API boundary.

REST API Design

GET /api/v1/analytics/metrics: Dashboard query endpoint; accepts tenant_id, time_range, granularity, filters; returns pre-aggregated data from materialized views when available
POST /api/v1/analytics/query: Ad-hoc query endpoint; accepts SQL-like query DSL or parameterized queries; routed to ClickHouse raw tables with 30-second timeout enforcement
GET /api/v1/analytics/async/{query_id}: Async query result endpoint for long-running queries; pattern: submit → 202 Accepted → poll for results

Caching & Query Routing

Redis caching layer: Cache dashboard query results with TTL=30 seconds for real-time data; TTL=5 minutes for historical queries; key = hash(tenant_id + query_params + time_bucket)
Intelligent query routing: Detect if query can be served from materialized view (pre-defined aggregation granularities) vs requires raw table scan; materialized views serve ~80% of dashboard traffic
Rate limiting: 1,000 queries per minute per tenant via Redis token bucket; burst of 100 queries/second for peak dashboard loads
Query timeout enforcement: 30-second max for synchronous queries; async pattern for queries predicted to exceed 10 seconds based on query plan analysis

GraphQL for Flexible Ad-Hoc Queries

GraphQL enables frontend teams to request exactly the metrics and dimensions they need without requiring backend changes for each new dashboard widget. Use persisted queries in production to prevent expensive arbitrary queries that could scan terabytes of data. Pair with query complexity limits (max depth = 5, max nodes = 50) to protect the OLAP tier.

8. Materialized Views & Pre-Aggregation Strategies

Pre-aggregation is the single highest-leverage optimization for analytics platform query performance. At 10B events/day, scanning raw events for every dashboard query is infeasible. Materialized views trade storage for query latency, and with 10:1 compression, the storage cost is manageable.

ClickHouse Materialized View: Hourly Rollup

CREATE MATERIALIZED VIEW events_hourly_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (tenant_id, event_type, hour)
AS SELECT
    tenant_id,
    event_type,
    toStartOfHour(timestamp) as hour,
    count() as event_count,
    uniq(user_id) as unique_users,
    avg(response_time_ms) as avg_response_time
FROM events
GROUP BY tenant_id, event_type, hour;

Tiered Aggregation Strategy

Tier	Granularity	Retention	Use case
Live	Second-level (Flink)	1 hour	Real-time alerting, live counters
Minute	1-minute rollup (ClickHouse MV)	7 days	Short-range dashboard sparklines
Hour	1-hour rollup (ClickHouse MV)	90 days	Dashboard primary time series
Day	Daily rollup (Spark batch)	3 years	Historical trend analysis, reports

Cache Invalidation Strategy

Event-driven cache busting using Flink: when a new minute-window closes, publish a cache invalidation event to a Kafka topic; query servers consume it and evict the relevant cache keys for the affected tenant_id + time_bucket combinations. This avoids polling-based TTL staleness while keeping cache hit rates above 70% for repeat dashboard loads.

9. Reliability, Data Quality & SLOs

An analytics platform that loses events silently is worse than one that fails loudly — customers make business decisions based on your data. Data quality must be enforced at every layer, and SLOs must be defined before the system is built, not after.

Data Quality Checks

Schema validation: Validate against Avro schema at ingestion; reject malformed events to DLQ with error details for debugging
Null checks: event_id, tenant_id, and timestamp are required fields; null values trigger DLQ routing
Range checks: Reject events with timestamps >1 hour in the future or >7 days in the past as likely clock skew / replay attacks
Anomaly detection: Alert when event rate for a tenant deviates >3 sigma from the rolling 7-day baseline; catches both data loss and sudden spike conditions
Idempotent consumers: Check event_id against Redis bloom filter before writing to OLAP; accept duplicates gracefully (log + discard) rather than erroring

SLO Definitions

Production SLOs

Ingestion availability: 99.99% measured over rolling 30 days (<4.4 minutes downtime/month)
P99 dashboard load time: <2 seconds for pre-aggregated queries
Data freshness: <5 seconds end-to-end for real-time dashboard data
Query success rate: >99.9% of queries return results within timeout
Data completeness: <0.01% event loss rate measured via end-to-end reconciliation
Error budget policy: Freeze all non-critical deployments when monthly error budget is 50% consumed; incident review required at 75% consumed

Disaster Recovery

Multi-AZ Kafka: 3 replicas across availability zones; automatic leader election on broker failure with no data loss
ClickHouse replication: 2 replicas per shard via ClickHouse Keeper; reads can be served from either replica during failures
Data lake as recovery source: Raw events in S3 with versioning enabled; Flink or Spark can replay any time range for OLAP tier reconstruction
RTO/RPO targets: RTO = 30 minutes for OLAP tier recovery (replay last N hours from data lake); RPO = 0 for ingestion (Kafka durability guarantees)

10. Real-World Case Studies

The architectural patterns described above are not theoretical — they power the analytics infrastructure at some of the most data-intensive companies in the world. Here's what the published engineering literature reveals about their choices and lessons learned.

Pinterest — Apache Druid at 1.7 Trillion Rows

Scale: 1.7 trillion rows, 50K+ queries/day for ad analytics
Architecture: Kafka → Druid real-time nodes → historical nodes
Result: Sub-second P50 latency for ad campaign performance queries; enabled real-time budget pacing for advertisers
Key lesson: Druid's time-based partitioning (segments) makes time-range queries extremely efficient for ad analytics workloads

Uber — Apache Pinot (uPinot) at 600K+ Queries/Day

Scale: 600K+ queries/day across real-time driver supply/demand analytics
Architecture: Kafka → Pinot real-time tables; batch import for historical
Result: Real-time visibility into driver availability per geo-hash enabled dynamic surge pricing optimization
Key lesson: Pinot's upsert support enables real-time fact updates (e.g., trip status changes) without full reingestion

Cloudflare — ClickHouse at 6M Events/Second

Scale: 6+ million HTTP log events per second across global edge network
Architecture: Direct ClickHouse ingestion via HTTP table engine; MergeTree with ZooKeeper replication
Result: Full SQL analytics on DNS and HTTP traffic; sub-second dashboards for security & DDoS detection
Key lesson: Full ANSI SQL support was the key ClickHouse differentiator — enabled ad-hoc security investigations without pre-defined query schemas

Netflix — Druid + Apache Iceberg for A/B Testing

Scale: Viewer engagement analytics across 200M+ subscribers; A/B test result dashboards
Architecture: Kafka → Flink → Iceberg data lake (truth) + Druid hot tier (dashboards)
Result: Recommendation model metrics available within 2 minutes of viewing events; A/B test results queryable in near real-time
Key lesson: Iceberg + Druid combination gives the best of both worlds — Iceberg for reliable batch reprocessing and schema evolution; Druid for sub-second dashboard serving

Lessons Learned Across All Case Studies

Pre-aggregate aggressively: Across all four companies, 80%+ of queries hit materialized views or pre-computed rollups — raw table scans are the exception, not the rule
Schema design is critical: The ORDER BY / sort key in ClickHouse (and segment granularity in Druid) affects query performance by 10× or more; changing it requires full data reingestion
Kafka retention = safety net: Keep a minimum of 7 days of Kafka retention (or use Tiered Storage for unlimited retention) to allow replaying the speed layer without accessing the data lake — this has saved multiple production incidents
Start with Kappa, graduate to Lambda: Most teams start with a single Flink pipeline (Kappa) and add a Spark batch layer (Lambda) only when reprocessing accuracy becomes a measurable business requirement
Monitor data freshness as an SLO: Treat end-to-end data latency (event time to dashboard visibility) as a first-class SLO with alerting; silent lag is the most dangerous failure mode in analytics systems