Designing a Real-Time Analytics Platform at Scale: ClickHouse, Apache Druid, Lambda Architecture & Streaming Analytics
Building an analytics platform that ingests 10 billion events per day while delivering sub-second query responses requires careful architectural decisions. This comprehensive guide walks through Lambda vs Kappa architecture trade-offs, ClickHouse vs Apache Druid vs Pinot comparisons, Apache Flink stream processing, and the materialized view strategies that power dashboards at scale.
TL;DR — Architecture Decision in One Paragraph
"Use Lambda architecture for combined batch + streaming analytics. ClickHouse for sub-second OLAP queries. Apache Druid for sub-second slice-and-dice. Apache Flink for stateful stream processing. Kafka as the unified event backbone. Pre-aggregate with materialized views for dashboard queries."
Table of Contents
- Requirements & Scale Estimation
- Lambda vs Kappa Architecture Trade-offs
- Ingestion Layer: Kafka, Kinesis & Pulsar
- Stream Processing: Apache Flink at Scale
- Batch Layer: Spark & Data Lake Architecture
- OLAP Storage: ClickHouse vs Apache Druid vs Pinot
- Query Layer & API Design
- Materialized Views & Pre-Aggregation Strategies
- Reliability, Data Quality & SLOs
- Real-World Case Studies
1. Requirements & Scale Estimation
Before designing the system, we must define the scale, latency targets, and data retention requirements. Getting these wrong leads to over-engineered or under-provisioned infrastructure that is expensive to fix in production.
Functional Requirements
- Event ingestion: Ingest events from web, mobile, and IoT devices across multiple tenants
- Real-time dashboards: Dashboard queries with <5 second data freshness end-to-end
- Ad-hoc queries: Sub-second response time for pre-aggregated queries; <10 seconds for ad-hoc exploration
- Historical analysis: Query historical data up to 3 years back with consistent SQL interface
- Alerting: Threshold-based alerting on metric values with <30 second detection latency
Non-Functional Requirements
- Throughput: 10 billion events per day sustained, with 3× burst capacity
- P99 query latency: <1 second for pre-aggregated dashboard queries
- P99 ad-hoc latency: <10 seconds for exploratory queries on raw data
- Ingestion availability: 99.99% uptime for the ingestion pipeline
- Data retention: Hot tier (30 days in OLAP), warm tier (1 year in data lake), cold tier (3 years in object storage)
Scale Math & Capacity Estimation
Let's work through the numbers systematically before choosing technology:
- Events per second: 10B ÷ 86,400 seconds = ~115,740 events/second (peak with 3× = ~347K events/sec)
- Average event size: 500 bytes (JSON with typical web analytics fields)
- Daily ingestion volume: 10B × 500B = 5 TB/day raw data
- Monthly storage (uncompressed): 5 TB × 30 = 150 TB/month
- With 10:1 compression ratio: 15 TB/month compressed (ClickHouse LZ4/ZSTD typical)
- 3-year retention: 15 TB × 36 months = ~540 TB compressed total storage
- Kafka throughput needed: 115K events/sec × 500B = ~57 MB/sec; buffer 3× = 171 MB/sec across all partitions
| Component | Count | Spec | Purpose |
|---|---|---|---|
| Kafka Brokers | 9 | 32 vCPU, 128 GB RAM, 10 TB SSD | Event backbone, 7-day retention |
| Flink TaskManagers | 20 | 16 vCPU, 64 GB RAM | Stream processing, windowed aggregations |
| ClickHouse Nodes | 12 | 32 vCPU, 256 GB RAM, 20 TB NVMe | OLAP hot tier, sub-second queries |
| Query Servers | 6 | 8 vCPU, 32 GB RAM | REST/GraphQL API, caching, routing |
2. Lambda vs Kappa Architecture Trade-offs
The architectural pattern you choose determines system complexity, operational burden, and your ability to correct historical data. Both Lambda and Kappa have legitimate use cases — the decision depends on your team size, accuracy requirements, and reprocessing frequency.
Lambda Architecture
Lambda architecture processes data through two parallel paths that merge at the serving layer:
- Speed layer: Apache Flink processes events in real-time and writes incremental results to the serving layer with low latency
- Batch layer: Apache Spark reprocesses all raw data on a schedule (hourly or daily) for accuracy correction and historical backfill
- Serving layer: ClickHouse or Druid merges batch + speed layer results, preferring batch results when available
- Pros: Highly accurate historical reprocessing, battle-tested at Hadoop-era scale, clear separation of concerns
- Cons: Maintaining two code paths (Flink streaming logic + Spark batch logic), data consistency challenges at merge time, higher operational complexity
Kappa Architecture
Kappa architecture simplifies by using a single streaming path for both real-time processing and historical reprocessing:
- Single streaming layer: Apache Flink handles both real-time event processing and full historical reprocessing
- Kafka as infinite log: Extended retention (7–30 days) or S3-backed Kafka (Tiered Storage) allows replaying from any offset
- Reprocessing strategy: Deploy new Flink job version alongside old one, replay Kafka from beginning, cut over serving layer when new job catches up
- Pros: Single codebase, simplified operations, faster iteration on processing logic changes
- Cons: Full reprocessing requires re-reading entire Kafka topic which is compute-intensive; long reprocessing windows (hours/days) for 3-year history
| Factor | Lambda | Kappa |
|---|---|---|
| Complexity | High (two code paths) | Lower (one code path) |
| Reprocessing | Accurate, efficient Spark batch | Replay from Kafka / data lake |
| Team size fit | Large teams with dedicated streaming + batch eng | Smaller teams, faster iteration |
| Latency profile | Dual: real-time + batch correction | Stream-first, eventual consistency |
| Use when | Accuracy is critical, frequent backfills | Speed-to-market, simpler ops |
3. Ingestion Layer: Kafka, Kinesis & Pulsar
The ingestion layer is the system's foundation. Every downstream component depends on it being reliable, ordered (within partitions), and fast. At 115K+ events/second, small design mistakes become catastrophic at scale.
Kafka Partitioning Strategy
Partition key choice affects ordering guarantees and downstream join performance:
- Partition by tenant_id: Keeps all events for a tenant in order; simplifies per-tenant aggregations in Flink
- Partition by user_id: Enables user session tracking with ordered events; better for user-centric analytics
- Compound key (tenant_id + event_type): Balances load while keeping analytically related events co-located
- Kafka producer tuning:
batch.size=65536,linger.ms=5,compression.type=lz4for throughput optimization
Event Schema (Avro with Schema Registry)
{
"event_id": "uuid-v4",
"tenant_id": "acme-corp",
"user_id": "user-12345",
"event_type": "page_view",
"timestamp": "2026-04-11T10:30:00Z",
"properties": {
"page": "/dashboard",
"duration_ms": 1250,
"referrer": "google.com"
},
"session_id": "sess-xyz",
"ip_address": "192.168.1.1"
}
Kafka vs Kinesis vs Pulsar
| Feature | Apache Kafka | Amazon Kinesis | Apache Pulsar |
|---|---|---|---|
| Max throughput | GB/sec per broker | 1 MB/sec per shard | GB/sec per broker |
| Retention | Configurable (days–infinite) | 7 days max | Tiered (unlimited) |
| Ops burden | High (self-managed) | Low (managed) | High (self-managed) |
| Multi-tenancy | Namespace isolation | Stream-level | Native multi-tenancy |
| Best for | High-throughput, Kappa arch | AWS-native, serverless | Multi-region, geo-replication |
Dead Letter Queue & Idempotency
- Dead letter queue (DLQ): Route malformed or schema-invalid events to a separate Kafka topic for analysis and reprocessing without blocking the main pipeline
- Idempotency: Consumers use
event_id(UUID v4) as deduplication key; check against a Redis bloom filter before writing to OLAP storage to prevent duplicate aggregations - Schema Registry: Apache Confluent Schema Registry enforces Avro schema compatibility (BACKWARD, FORWARD, FULL) to prevent breaking schema changes
4. Stream Processing: Apache Flink at Scale
Apache Flink is the industry standard for stateful stream processing at scale. Its exactly-once semantics, rich windowing API, and SQL support make it the right choice for building the speed layer of a real-time analytics platform.
Window Types & When to Use Each
- Tumbling windows (1-minute): Non-overlapping fixed-size windows; ideal for per-minute metric aggregations like page views per minute, events per minute
- Sliding windows (5-minute window, 1-minute slide): Overlapping windows for moving averages; useful for trailing 5-minute averages shown on dashboards
- Session windows (gap-based): Dynamically sized windows closed when activity gap exceeds threshold; perfect for user session tracking with variable session lengths
- Global windows: Accumulate state indefinitely; used for all-time counters with custom triggers
Flink SQL for Real-Time Aggregations
SELECT
tenant_id,
event_type,
TUMBLE_START(rowtime, INTERVAL '1' MINUTE) as window_start,
COUNT(*) as event_count,
COUNT(DISTINCT user_id) as unique_users
FROM events
GROUP BY
tenant_id,
event_type,
TUMBLE(rowtime, INTERVAL '1' MINUTE)
Exactly-Once Semantics & State Backends
- Flink checkpointing: Enable distributed checkpoints with
checkpointing.interval=30s,execution.checkpointing.mode=EXACTLY_ONCEfor fault-tolerant processing - RocksDB state backend: Use for large state (>GB per TaskManager) — spills to disk, supports incremental checkpoints to S3; mandatory for session windows over millions of users
- HashMapStateBackend: Use for small state (<few GB) — in-memory, lower latency but no spill-to-disk; appropriate for short tumbling window aggregations
- Watermarks for late events: Configure allowed lateness of 30–60 seconds; emit results with watermark, update with late arrivals, use side outputs for very late events
- Backpressure handling: Monitor Flink metrics for backpressure indicators; scale TaskManagers horizontally or tune parallelism per operator when ingestion spikes occur
5. Batch Layer: Spark & Data Lake Architecture
In Lambda architecture, the batch layer provides the ground truth by reprocessing all raw data. Apache Spark running over a well-structured data lake handles this efficiently, correcting any errors from the speed layer while enabling long-horizon historical analysis.
Data Lake Zone Architecture
- Raw zone (S3/GCS): Immutable raw events written directly from Kafka consumers; no transformation; enables full reprocessing from source-of-truth
- Cleaned zone: Schema-validated, deduplicated, PII-masked data in Apache Parquet format; partitioned by year/month/day/hour for efficient time-range pruning
- Aggregated zone: Pre-computed Spark aggregations (hourly rollups, daily summaries) for batch serving and historical dashboard queries
Table Format: Delta Lake vs Apache Iceberg
| Feature | Delta Lake | Apache Iceberg |
|---|---|---|
| Schema evolution | Add/rename columns | Full schema evolution + type promotions |
| Time travel | Yes (transaction log) | Yes (snapshot-based) |
| Engine support | Spark-centric, Trino/Flink improving | Spark, Trino, Flink, Hive native |
| Best for | Databricks-centric stacks | Multi-engine open data lakes |
Spark Batch Job Patterns
- Hourly accuracy correction: Spark jobs run at H+15 to reprocess the previous hour, incorporating late-arriving events and correcting speed-layer approximations
- Daily compaction: Merge hundreds of small Parquet files written by Flink or Kafka consumers into larger files (512 MB target) for efficient query performance
- Partitioning strategy: Partition by
year=YYYY/month=MM/day=DD/hour=HHwith tenant_id as a secondary partition for multi-tenant isolation in scans
6. OLAP Storage: ClickHouse vs Apache Druid vs Pinot
The OLAP storage layer is where query performance lives. Your choice of ClickHouse, Apache Druid, or Apache Pinot determines the types of queries you can serve sub-second, the operational complexity you accept, and the cost structure at scale.
| Feature | ClickHouse | Apache Druid | Apache Pinot |
|---|---|---|---|
| Query latency | Sub-second | Sub-second | Sub-second |
| Ingestion mode | Batch + streaming | Native streaming | Native streaming |
| SQL support | Full ANSI SQL | Limited SQL | Limited SQL |
| Joins | Full joins supported | Limited joins | Limited joins |
| Data freshness | Seconds | Seconds | Seconds |
| Cardinality | High cardinality OK | HLL approximations | HLL approximations |
| Compression | 10:1 typical | 8:1 typical | 8:1 typical |
| Horizontal scaling | Sharding + replication | Tier-based (historical/real-time) | Tier-based |
| Best for | Ad-hoc SQL analytics | Time-series, dashboards | Pinpoint low-latency queries |
| Used by | Cloudflare, Yandex | Uber, Netflix | LinkedIn, Uber |
ClickHouse Table DDL: MergeTree Engine
The ORDER BY clause is the most critical design decision in ClickHouse — it becomes the primary index and determines which query patterns will be sub-second:
CREATE TABLE events (
event_id UUID,
tenant_id String,
event_type LowCardinality(String),
user_id String,
timestamp DateTime,
properties String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (tenant_id, event_type, timestamp)
SETTINGS index_granularity = 8192;
Key design choices: LowCardinality(String) for event_type (reduces memory by 3–5×), monthly partitioning for efficient data retention management, and ORDER BY starting with tenant_id to isolate scans per tenant.
7. Query Layer & API Design
The query layer abstracts the storage tier from consumers, enabling intelligent routing between materialized views (fast) and raw tables (flexible), caching to reduce OLAP load, and multi-tenancy enforcement at the API boundary.
REST API Design
- GET /api/v1/analytics/metrics: Dashboard query endpoint; accepts tenant_id, time_range, granularity, filters; returns pre-aggregated data from materialized views when available
- POST /api/v1/analytics/query: Ad-hoc query endpoint; accepts SQL-like query DSL or parameterized queries; routed to ClickHouse raw tables with 30-second timeout enforcement
- GET /api/v1/analytics/async/{query_id}: Async query result endpoint for long-running queries; pattern: submit → 202 Accepted → poll for results
Caching & Query Routing
- Redis caching layer: Cache dashboard query results with TTL=30 seconds for real-time data; TTL=5 minutes for historical queries; key = hash(tenant_id + query_params + time_bucket)
- Intelligent query routing: Detect if query can be served from materialized view (pre-defined aggregation granularities) vs requires raw table scan; materialized views serve ~80% of dashboard traffic
- Rate limiting: 1,000 queries per minute per tenant via Redis token bucket; burst of 100 queries/second for peak dashboard loads
- Query timeout enforcement: 30-second max for synchronous queries; async pattern for queries predicted to exceed 10 seconds based on query plan analysis
GraphQL for Flexible Ad-Hoc Queries
GraphQL enables frontend teams to request exactly the metrics and dimensions they need without requiring backend changes for each new dashboard widget. Use persisted queries in production to prevent expensive arbitrary queries that could scan terabytes of data. Pair with query complexity limits (max depth = 5, max nodes = 50) to protect the OLAP tier.
8. Materialized Views & Pre-Aggregation Strategies
Pre-aggregation is the single highest-leverage optimization for analytics platform query performance. At 10B events/day, scanning raw events for every dashboard query is infeasible. Materialized views trade storage for query latency, and with 10:1 compression, the storage cost is manageable.
ClickHouse Materialized View: Hourly Rollup
CREATE MATERIALIZED VIEW events_hourly_mv
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(hour)
ORDER BY (tenant_id, event_type, hour)
AS SELECT
tenant_id,
event_type,
toStartOfHour(timestamp) as hour,
count() as event_count,
uniq(user_id) as unique_users,
avg(response_time_ms) as avg_response_time
FROM events
GROUP BY tenant_id, event_type, hour;
Tiered Aggregation Strategy
| Tier | Granularity | Retention | Use case |
|---|---|---|---|
| Live | Second-level (Flink) | 1 hour | Real-time alerting, live counters |
| Minute | 1-minute rollup (ClickHouse MV) | 7 days | Short-range dashboard sparklines |
| Hour | 1-hour rollup (ClickHouse MV) | 90 days | Dashboard primary time series |
| Day | Daily rollup (Spark batch) | 3 years | Historical trend analysis, reports |
Cache Invalidation Strategy
Event-driven cache busting using Flink: when a new minute-window closes, publish a cache invalidation event to a Kafka topic; query servers consume it and evict the relevant cache keys for the affected tenant_id + time_bucket combinations. This avoids polling-based TTL staleness while keeping cache hit rates above 70% for repeat dashboard loads.
9. Reliability, Data Quality & SLOs
An analytics platform that loses events silently is worse than one that fails loudly — customers make business decisions based on your data. Data quality must be enforced at every layer, and SLOs must be defined before the system is built, not after.
Data Quality Checks
- Schema validation: Validate against Avro schema at ingestion; reject malformed events to DLQ with error details for debugging
- Null checks:
event_id,tenant_id, andtimestampare required fields; null values trigger DLQ routing - Range checks: Reject events with timestamps >1 hour in the future or >7 days in the past as likely clock skew / replay attacks
- Anomaly detection: Alert when event rate for a tenant deviates >3 sigma from the rolling 7-day baseline; catches both data loss and sudden spike conditions
- Idempotent consumers: Check
event_idagainst Redis bloom filter before writing to OLAP; accept duplicates gracefully (log + discard) rather than erroring
SLO Definitions
Production SLOs
- Ingestion availability: 99.99% measured over rolling 30 days (<4.4 minutes downtime/month)
- P99 dashboard load time: <2 seconds for pre-aggregated queries
- Data freshness: <5 seconds end-to-end for real-time dashboard data
- Query success rate: >99.9% of queries return results within timeout
- Data completeness: <0.01% event loss rate measured via end-to-end reconciliation
- Error budget policy: Freeze all non-critical deployments when monthly error budget is 50% consumed; incident review required at 75% consumed
Disaster Recovery
- Multi-AZ Kafka: 3 replicas across availability zones; automatic leader election on broker failure with no data loss
- ClickHouse replication: 2 replicas per shard via ClickHouse Keeper; reads can be served from either replica during failures
- Data lake as recovery source: Raw events in S3 with versioning enabled; Flink or Spark can replay any time range for OLAP tier reconstruction
- RTO/RPO targets: RTO = 30 minutes for OLAP tier recovery (replay last N hours from data lake); RPO = 0 for ingestion (Kafka durability guarantees)
10. Real-World Case Studies
The architectural patterns described above are not theoretical — they power the analytics infrastructure at some of the most data-intensive companies in the world. Here's what the published engineering literature reveals about their choices and lessons learned.
Pinterest — Apache Druid at 1.7 Trillion Rows
Scale: 1.7 trillion rows, 50K+ queries/day for ad analytics
Architecture: Kafka → Druid real-time nodes → historical nodes
Result: Sub-second P50 latency for ad campaign performance queries; enabled real-time budget pacing for advertisers
Key lesson: Druid's time-based partitioning (segments) makes time-range queries extremely efficient for ad analytics workloads
Uber — Apache Pinot (uPinot) at 600K+ Queries/Day
Scale: 600K+ queries/day across real-time driver supply/demand analytics
Architecture: Kafka → Pinot real-time tables; batch import for historical
Result: Real-time visibility into driver availability per geo-hash enabled dynamic surge pricing optimization
Key lesson: Pinot's upsert support enables real-time fact updates (e.g., trip status changes) without full reingestion
Cloudflare — ClickHouse at 6M Events/Second
Scale: 6+ million HTTP log events per second across global edge network
Architecture: Direct ClickHouse ingestion via HTTP table engine; MergeTree with ZooKeeper replication
Result: Full SQL analytics on DNS and HTTP traffic; sub-second dashboards for security & DDoS detection
Key lesson: Full ANSI SQL support was the key ClickHouse differentiator — enabled ad-hoc security investigations without pre-defined query schemas
Netflix — Druid + Apache Iceberg for A/B Testing
Scale: Viewer engagement analytics across 200M+ subscribers; A/B test result dashboards
Architecture: Kafka → Flink → Iceberg data lake (truth) + Druid hot tier (dashboards)
Result: Recommendation model metrics available within 2 minutes of viewing events; A/B test results queryable in near real-time
Key lesson: Iceberg + Druid combination gives the best of both worlds — Iceberg for reliable batch reprocessing and schema evolution; Druid for sub-second dashboard serving
Lessons Learned Across All Case Studies
- Pre-aggregate aggressively: Across all four companies, 80%+ of queries hit materialized views or pre-computed rollups — raw table scans are the exception, not the rule
- Schema design is critical: The ORDER BY / sort key in ClickHouse (and segment granularity in Druid) affects query performance by 10× or more; changing it requires full data reingestion
- Kafka retention = safety net: Keep a minimum of 7 days of Kafka retention (or use Tiered Storage for unlimited retention) to allow replaying the speed layer without accessing the data lake — this has saved multiple production incidents
- Start with Kappa, graduate to Lambda: Most teams start with a single Flink pipeline (Kappa) and add a Spark batch layer (Lambda) only when reprocessing accuracy becomes a measurable business requirement
- Monitor data freshness as an SLO: Treat end-to-end data latency (event time to dashboard visibility) as a first-class SLO with alerting; silent lag is the most dangerous failure mode in analytics systems