Time-Series Database Design at Scale
Md Sanwar Hossain
Md Sanwar Hossain
Software Engineer · Java, Spring Boot, Kubernetes & AWS
System Design March 22, 2026 18 min read Distributed Systems Failure Handling Series

Designing a Time-Series Database at Scale: LSM Trees, Delta Encoding & Long-Term Retention Policies

Time-series data is unlike any other kind of data your systems produce. It arrives in an unrelenting, high-velocity stream and demands storage engines with fundamentally different trade-offs from the relational databases most engineers reach for by default. This post dissects how purpose-built time-series databases solve the write amplification, compression, and retention challenges that make general-purpose databases buckle under real IoT and observability workloads.

Table of Contents

  1. Introduction
  2. Why Traditional Databases Fail for Time-Series Workloads
  3. LSM Tree Architecture: How TSDBs Store Data
  4. Delta Encoding and Gorilla Compression
  5. Architecture and Code Examples
  6. Failure Scenarios and Trade-offs
  7. When NOT to Use a Time-Series Database
  8. Optimization Techniques for Production
  9. Key Takeaways
  10. Conclusion

1. Introduction

An IoT platform engineering team managing a fleet of 50,000 connected industrial sensors starts seeing alarming trends. Their PostgreSQL database, initially adequate for prototype scale, begins struggling as the sensor count grows. Each sensor emits 40 metrics per second — temperature, pressure, vibration, flow rate, and more. That is 2 million data points per second reaching the ingestion layer. The team notices write latency climbing from sub-millisecond to tens of milliseconds, and the dashboard that once loaded instantly now spins for seconds before rendering a single chart.

Queries that once returned instantly — "show me all pressure anomalies in pump array 7 over the last 6 hours" — begin timing out. The database's CPU pegs at 100% during peak hours, not from complex joins or poorly written queries but from simple time-range scans that should be trivial. The on-call rotation turns into a weekly ritual of emergency index rebuilds and VACUUM operations, with engineers spending more time firefighting the database than building features. Customer SLAs are starting to slip, and the operations team is losing confidence in the platform's reliability.

The team doesn't have a performance problem that can be solved with more RAM or faster SSDs. They have an architectural mismatch: they are using a general-purpose relational database to solve a specialized time-series storage problem. The characteristics of sensor data — monotonically increasing timestamps, high write throughput, low read selectivity, and predictable temporal decay in query frequency — are precisely the characteristics that relational databases were not designed to optimize for. Adding more indexes makes writes slower. Removing indexes makes queries slower. There is no tuning knob that makes a B-tree work well for both 2 million writes per second and sub-second range scans across billions of rows.

This post explores why that architectural mismatch is so severe, and how purpose-built time-series databases solve it with Log-Structured Merge (LSM) trees, delta encoding, Gorilla compression, time-partitioned compaction, and intelligent retention policies. Whether you are evaluating InfluxDB, TimescaleDB, Prometheus, or a managed TSDB service, understanding these underlying mechanisms will help you make better architectural decisions and avoid the class of problems this IoT team ran into.

2. Why Traditional Databases Fail for Time-Series Workloads

The core issue is write amplification caused by B-tree index rebalancing. Relational databases use B-tree structures to maintain sorted indexes. Every insert into an indexed table potentially triggers page splits, node rebalancing, and cascading updates throughout the tree. For time-series data — where inserts arrive at a ferocious and continuous rate — this creates constant write amplification: one logical write translates to multiple physical writes as the index reorganizes itself to maintain sort order. At 2 million writes per second, the IO subsystem is performing 5 to 10 times the logical write volume in actual physical IO operations, exhausting both disk bandwidth and CPU cycles that should be serving reads.

Index bloat compounds this problem over time. A B-tree index on a compound column like (device_id, timestamp) across billions of rows consumes tens of gigabytes of disk space and requires expensive ongoing maintenance. PostgreSQL's VACUUM process must continuously reclaim dead tuple space left behind by MVCC updates. Under high write throughput, VACUUM cannot keep pace, causing table bloat that further degrades scan performance. The index that was meant to speed up queries begins to drag down every insert, and the autovacuum processes designed to help become a source of contention themselves.

Compression efficiency is also dramatically worse in relational databases. A row containing (device_id VARCHAR, timestamp TIMESTAMPTZ, metric_name VARCHAR, value FLOAT8) might occupy 80 to 120 bytes of storage including tuple header overhead, alignment padding, and MVCC metadata. The same logical data encoded with delta compression and shared schemas takes 2 to 3 bytes per value. That is a 40x compression difference — which translates directly into proportional differences in storage cost, cache hit rates, and IO throughput. A TSDB that fits its working dataset in the OS page cache will serve queries from memory; a relational database with 40x more data on disk will not.

Finally, time-series query patterns are fundamentally misaligned with B-tree optimization strategies. Time-series queries are almost always time-range scans: "give me all values between T1 and T2 for device X." B-tree indexes are optimized for point lookups and prefix scans, not for returning large contiguous ranges of sequentially inserted data. A specialized time-series index, organized so that data for a given device and time window is physically co-located on disk, can serve the same range scan with a fraction of the IO operations a B-tree requires.

Real scenario: A fintech team running PostgreSQL for market tick data found that their pg_indexes table showed their primary time-series index consuming 3.2TB of disk while the actual data was 800GB. The index was 4x the size of the data it indexed. Their VACUUM processes ran continuously but could never keep up with the write rate, causing HOT (Heap Only Tuple) update chains to grow unbounded, further degrading both read and write performance.

3. LSM Tree Architecture: How TSDBs Store Data

Log-Structured Merge-trees (LSM trees) solve write amplification by converting random writes into sequential writes, which are an order of magnitude faster on both SSDs and spinning disks. Incoming data is first written to an in-memory buffer called the memtable — a sorted data structure that accumulates recent writes. Simultaneously, each write is appended to a Write-Ahead Log (WAL) on disk, ensuring durability in case of a crash before the memtable is persisted. The WAL write is sequential and fast; there is no index to maintain and no rebalancing to perform. When the memtable reaches a configurable size threshold, it is flushed to disk as an immutable Sorted String Table (SSTable). SSTables are never modified after creation — updates and deletes are handled by writing new entries with newer timestamps, not by in-place modification of existing files.

Time-series databases exploit an additional optimization that general-purpose LSM stores like RocksDB or LevelDB cannot: time-ordering. Because sensor data almost always arrives in timestamp order, TSDBs use time-partitioned compaction rather than level-based compaction. Time-partitioned compaction groups SSTables by time window — one SSTable per hour, or one per day depending on write volume — and merges only within the same time window. This keeps individual compaction jobs small and bounded in scope, avoiding the "compaction storm" problem where a single large compaction job must merge gigabytes of data across the entire key space, blocking reads and writes for minutes at a time.

For the read path, TSDBs add block indices and Bloom filters to each SSTable to minimize unnecessary IO. A block index is a lightweight in-memory structure that maps time ranges to byte offsets within the SSTable. When serving a query, the storage engine performs a binary search on the block index to jump directly to the relevant byte offset, avoiding a full file scan. A Bloom filter answers the question "does this SSTable contain any data for device X?" in O(1) time with a configurable false-positive rate, typically 1%. Without Bloom filters, serving a query might require opening dozens of SSTables; with them, the engine can skip all SSTables that definitely do not contain relevant data.

Together, these structures let a TSDB serve a query like "last 1 hour of data for device 7" by reading only the 2 to 3 SSTables covering the relevant time window and jumping directly to the correct byte offset within each. The result is predictable, low-latency reads even as the total dataset grows to petabyte scale, because query IO is bounded by time range width rather than total dataset size. The parallel compaction workers that LSM-based TSDBs use to merge SSTables are a classic producer-consumer concurrency problem. In JVM-based TSDB implementations, the scoped lifecycle guarantees of Java Structured Concurrency are well-suited to managing compaction job trees — each compaction scope owns its worker threads and cancels them cleanly if the job is aborted.

4. Delta Encoding, RLE, and Gorilla Compression

Delta encoding is the first and most impactful compression technique for time-series data. Instead of storing absolute values, you store the difference between consecutive values. For timestamps, which increment by a fixed interval in well-behaved sensor data, the delta is nearly constant — often exactly 1000ms or 5000ms for sensors reporting at 1Hz or 0.2Hz respectively. Delta-of-delta encoding takes this further: store the difference of the differences. A perfectly periodic signal has delta-of-delta equal to zero for every point after the first. This compresses a long sequence of timestamps into just the first value, the first delta, and a stream of zeros — a dramatic reduction in storage. Even for slightly irregular sensor emission rates, the delta-of-delta values are small integers that compress far better than raw 64-bit timestamps.

For metric values themselves, delta encoding works well for slowly changing sensors. A temperature sensor hovering around 72.4°C generates tiny deltas (0.1, -0.2, 0.1, 0.0) rather than full 64-bit IEEE 754 doubles. Run-Length Encoding (RLE) handles the case where a sensor goes offline or sends a flat-line signal. Instead of storing the same value 3,600 times — once per second for an hour — RLE stores a tuple of (value=0.0, count=3600), compressing one hour of data into two numbers. This is critical for IoT platforms where connectivity issues cause sensors to report the same last-known value repeatedly for extended periods. Combined, delta encoding and RLE can achieve 10 to 15x compression on typical IoT sensor data compared to raw storage.

Facebook's Gorilla compression algorithm, used in Prometheus, InfluxDB, and many other TSDBs, applies XOR-based compression to floating-point values. The insight is that consecutive floating-point values in a time series tend to share the same sign bit, exponent, and upper mantissa bits — because real-world metrics change gradually rather than jumping between unrelated values. XOR-ing consecutive values produces a small result where the leading and trailing bits are zero. Gorilla encodes these leading and trailing zero counts along with the meaningful bits in the middle as variable-length bit strings, exploiting the structure that physically adjacent values in a time series are semantically similar.

The practical results of Gorilla compression are striking. Facebook's published benchmarks show approximately 1.37 bytes per data point on typical monitoring metric data, compared to 8 bytes for a raw IEEE 754 double and 16 bytes for a (timestamp, value) pair stored naively. That is nearly 6x compression on values alone and close to 12x on the full time-series pair. When combined with delta-of-delta timestamp encoding, a one-second resolution sensor stream that would require 16 bytes per sample in a relational database can be compressed to under 2 bytes per sample — making it economically viable to retain millisecond-resolution raw data for days or weeks before downsampling to coarser rollups.

5. Architecture and Code Examples

The following examples ground the theory in concrete implementation patterns. We begin with Python pseudocode illustrating delta-of-delta timestamp encoding and XOR float compression, then show production-ready InfluxDB and TimescaleDB configurations for a real IoT sensor fleet. Understanding these patterns helps you reason about what your TSDB is doing internally and how to configure it for optimal performance.

The key insight across all these examples is that the encoding and storage layers are inseparable. The way you model your data — tag cardinality, field naming, chunk intervals — directly determines which compression paths the engine can apply and how efficiently it can serve your most common query patterns. Design your schema and write path together.

# Delta-of-delta encoding for timestamps
def encode_timestamps(timestamps):
    if not timestamps:
        return []
    result = [timestamps[0]]          # Store first timestamp absolutely
    first_delta = timestamps[1] - timestamps[0]
    result.append(first_delta)        # Store first delta
    prev_delta = first_delta
    for i in range(2, len(timestamps)):
        delta = timestamps[i] - timestamps[i-1]
        delta_of_delta = delta - prev_delta
        result.append(delta_of_delta) # Usually 0 for periodic sensors
        prev_delta = delta
    return result

# XOR-based float compression (Gorilla-style)
import struct
def xor_compress_floats(values):
    if not values:
        return []
    prev_bits = struct.unpack('Q', struct.pack('d', values[0]))[0]
    compressed = [values[0]]         # First value stored raw
    for v in values[1:]:
        curr_bits = struct.unpack('Q', struct.pack('d', v))[0]
        xor_val = prev_bits ^ curr_bits
        # Leading/trailing zero bits indicate shared prefix/suffix
        compressed.append(xor_val)
        prev_bits = curr_bits
    return compressed
# InfluxDB line protocol: measurement,tag_key=tag_value field_key=value timestamp
sensor_metrics,device_id=pump-007,location=plant-A temperature=72.4,pressure=4.2,vibration=0.03 1711065600000000000
sensor_metrics,device_id=pump-007,location=plant-A temperature=72.5,pressure=4.1,vibration=0.02 1711065601000000000

# Flux query: downsample to 5-minute averages for last 24 hours
from(bucket: "iot_raw")
  |> range(start: -24h)
  |> filter(fn: (r) => r._measurement == "sensor_metrics" and r.device_id == "pump-007")
  |> aggregateWindow(every: 5m, fn: mean, createEmpty: false)
  |> to(bucket: "iot_5min_rollup")
-- Create hypertable partitioned by time (1-day chunks)
SELECT create_hypertable('sensor_metrics', 'time',
    chunk_time_interval => INTERVAL '1 day',
    if_not_exists => TRUE);

-- Enable compression on chunks older than 7 days
ALTER TABLE sensor_metrics SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'device_id',
    timescaledb.compress_orderby = 'time DESC'
);
SELECT add_compression_policy('sensor_metrics', INTERVAL '7 days');

-- Continuous aggregate: 1-hour rollup
CREATE MATERIALIZED VIEW sensor_hourly
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 hour', time) AS bucket,
    device_id,
    avg(temperature)            AS avg_temp,
    max(pressure)               AS max_pressure,
    avg(vibration)              AS avg_vibration
FROM sensor_metrics
GROUP BY bucket, device_id;

-- Refresh policy: keep aggregate up-to-date
SELECT add_continuous_aggregate_policy('sensor_hourly',
    start_offset => INTERVAL '2 hours',
    end_offset   => INTERVAL '1 hour',
    schedule_interval => INTERVAL '30 minutes');
# influxdb.conf — retention policies via CLI
influx bucket create --name iot_raw      --retention 7d   --org my-org
influx bucket create --name iot_5min     --retention 90d  --org my-org
influx bucket create --name iot_hourly   --retention 365d --org my-org
influx bucket create --name iot_daily    --retention 0    --org my-org  # infinite

# Downsampling task: raw -> 5-min rollup (runs every 5 minutes)
option task = {name: "downsample_5min", every: 5m}
from(bucket: "iot_raw")
  |> range(start: -10m)
  |> aggregateWindow(every: 5m, fn: mean)
  |> to(bucket: "iot_5min")

For a Java service that ingests metrics and fans out writes to multiple TSDB shards in parallel, StructuredTaskScope.ShutdownOnFailure provides exactly the right semantics: if any shard write fails, cancel the remaining writes and propagate the error to the caller immediately, avoiding partial writes that would create inconsistent state across shards.

6. Failure Scenarios and Trade-offs

Compaction storms are the most common failure mode in production LSM-based TSDBs. When the compaction process cannot keep pace with ingestion — due to insufficient IO bandwidth, too few compaction threads, or a sudden burst of writes — the count of uncompacted SSTables grows. Each additional SSTable increases read amplification: a query for "device 7 from T1 to T2" must now check 20 SSTables instead of 3, performing 20 Bloom filter checks and potentially 20 index lookups. In the worst case, a sustained write burst followed by a compaction backlog causes read latency to spike 10 to 100x while the compaction workers catch up. Mitigation strategies include tuning compaction thread counts to match available IO parallelism, using time-partitioned compaction to bound the scope of each job, and monitoring SSTable count per time window as a key operational metric that triggers alerts before reads degrade.

The hot partition problem is subtle but severe at scale. Modern TSDBs partition data by time, and the "current time" chunk receives all incoming writes. This creates a natural hot partition — one chunk gets 100% of write traffic while all historical chunks are cold. In a distributed TSDB cluster, this means one storage node absorbs the entire write load regardless of how many nodes you add, effectively making horizontal scaling useless for write throughput. The fix is to partition by both time and a shard key derived from device identity, distributing current-time writes across multiple nodes. The shard key must be chosen carefully: too coarse and you recreate the hot partition, too fine and small-device queries must fan out to many nodes.

Tombstone accumulation is a trap for teams that implement row-level deletes in their TSDB. LSM-based storage handles deletes by writing tombstone markers — new entries indicating that a key should be considered deleted. These tombstones occupy space and slow down reads because the query engine must filter them. They are only reclaimed during compaction after all SSTables containing the original values have been merged with the tombstone. In a time-series context, where deletion is typically used for GDPR right-to-erasure requests or test data cleanup, tombstones can linger for days. The correct pattern is to use retention policies rather than row-level deletes: retention policies drop entire SSTable files at the block level, which is instantaneous and free of overhead.

Replication lag with out-of-order writes is a subtle but critical edge case in replicated TSDB deployments. Replication lag can cause writes to arrive out of chronological order at replica nodes. For delta-encoded storage, this is a serious correctness problem: if values arrive out of sequence, the delta chain is broken and decompression either fails or produces garbage values that silently corrupt metrics. TSDBs handle this with a configurable out-of-order write window, typically one to five minutes, that buffers late-arriving writes and re-orders them before encoding. Writes arriving outside this window are rejected with an explicit error code, requiring the client to implement a replay strategy. Teams must design their write path to handle these rejections gracefully, with retry logic and a dead-letter queue for writes that arrive too late.

Compaction storms often manifest as periodic latency spikes that are extremely difficult to correlate with application-level events in dashboards. Instrument your TSDB with the following metrics: compaction queue depth, bytes pending compaction, and read amplification factor. A read amplification factor above 10 is a clear signal that compaction is lagging significantly and immediate action is required before user-facing query latency degrades visibly.

7. When NOT to Use a Time-Series Database

If your time-series data has low write cardinality and you need complex joins with other business entities — for example, "show me sensor anomalies alongside the maintenance ticket that was open at that time, joined against the engineer who was on shift" — a relational database or a hybrid approach may be more appropriate. TSDBs are optimized exclusively for pure time-series ingestion and retrieval. Multi-entity joins requiring foreign key traversal, window functions over non-temporal dimensions, and aggregate queries that blend sensor readings with relational records are all weak points. Many production teams end up with a hybrid architecture: a TSDB for raw metrics ingestion and storage, and a relational database for metadata, tickets, and business context, with application-layer joins at query time.

Event logs with high textual content — application logs, audit trails, security events, user activity streams — are better served by columnar stores like ClickHouse or log aggregation platforms like OpenSearch and Elasticsearch. TSDBs are purpose-built for numeric metrics with simple tag-based dimensionality. Full-text search, inverted index queries, and multi-field filtering across semi-structured log payloads are not capabilities that TSDBs provide efficiently. Trying to stuff application logs into InfluxDB as string fields produces poor compression ratios and query performance compared to a store designed for that workload.

Transactional data requiring ACID guarantees across multiple entities does not belong in a TSDB. If your write operation requires "record the order AND record the associated sensor reading AND commit both atomically with rollback on failure," a TSDB cannot help — it provides no cross-entity transaction semantics. The correct pattern is to write the transactional data to your RDBMS within a database transaction, and then asynchronously forward a copy of the sensor reading to the TSDB after the transaction commits. This decouples the systems and prevents a TSDB unavailability from blocking critical transactional writes.

Finally, small-scale IoT projects with fewer than 10,000 sensors and sub-second query requirements can often use PostgreSQL with the TimescaleDB extension rather than a standalone time-series database. The operational complexity of deploying, monitoring, and scaling a dedicated InfluxDB cluster or a managed TSDB service is non-trivial. If your data volume fits comfortably on a single PostgreSQL instance with TimescaleDB's hypertable and continuous aggregate features, the operational simplicity of staying in Postgres is a genuine engineering advantage that should not be dismissed in favor of architectural novelty.

Insight: TimescaleDB occupies an interesting middle ground: it is a PostgreSQL extension that adds time-series optimizations (hypertables, continuous aggregates, columnar compression) while retaining full SQL compatibility and ACID transactions. For teams already operating PostgreSQL, TimescaleDB is often the lowest-friction upgrade path — you get 90% of the TSDB benefits with none of the operational overhead of a separate storage system.

8. Optimization Techniques for Production

Chunk compression in TimescaleDB provides one of the highest-leverage optimizations available for IoT sensor workloads. Each chunk — a time-partitioned table shard containing one day or one week of data — can be independently compressed using columnar storage. Unlike row-based storage where each row is stored as a unit, columnar compression groups all temperature values together, all pressure values together, and all vibration values together in contiguous storage blocks. Similar values compress far better when physically adjacent: a column of 86,400 temperature readings for a single day, slowly varying around 72°C, compresses dramatically better than 86,400 interleaved rows where temperature is mixed with pressure and vibration values. TimescaleDB reports typical compression ratios of 90 to 95% on IoT sensor data using this approach, reducing a 100GB raw dataset to 5 to 10GB of compressed storage.

InfluxDB's Time Series Index (TSI) addresses a different scaling bottleneck: tag cardinality. The default InfluxDB storage engine maintains an in-memory inverted index mapping tag values to series IDs. At low cardinality (thousands of unique devices), this index is small and fast. At very high cardinality — millions of unique device IDs, especially in multi-tenant platforms where device IDs are UUIDs — this in-memory index can exhaust the JVM heap, causing garbage collection pressure that degrades write throughput. TSI moves the index to disk using an LSM-like structure, trading some read latency for dramatically reduced memory footprint. The operational guideline is to enable TSI when your platform exceeds approximately 100,000 unique series, which occurs quickly in platforms with per-metric series (one series per device per metric field).

A four-tier downsampling pipeline is the standard approach for managing long-term retention costs without sacrificing analytical capability. Raw data at one-second resolution is retained for seven days — enough to debug any recent incident with full fidelity. A background task aggregates the raw data into five-minute averages and stores them for ninety days, providing a month and a half of trend analysis with acceptable resolution. Hourly rollups extend to one year, enabling year-over-year comparisons and capacity planning. Daily rollups are retained indefinitely, providing the long-term baseline needed for anomaly detection models. Each tier is a separate bucket or schema with its own storage and compression settings, and queries are automatically routed to the appropriate tier based on the requested time range.

Shard key design is the most critical architectural decision for distributed TSDB deployments. In systems like InfluxDB Clustered, Thanos, or Cortex, the shard key determines how time-series data is distributed across storage nodes. Using only time as the shard key creates the hot-partition problem: all writes for the current time window route to the same node. A composite shard key of (time_bucket + hash(device_id) modulo num_shards) distributes writes evenly across all nodes while keeping data for the same device and time window co-located on the same node. This locality property is critical for query performance: a query for "device 7 from T1 to T2" should need to contact at most one or two storage nodes, not fan out to all nodes in the cluster. Getting this balance right requires modeling your specific query patterns before choosing the shard key.

9. Key Takeaways

10. Conclusion

The mismatch between general-purpose RDBMS and time-series workloads is not a matter of tuning or configuration — it is architectural. The write patterns, query patterns, compression opportunities, and retention semantics of time-series data are fundamentally different from transactional data, and they demand storage engines purpose-built for those characteristics. Applying VACUUM tuning and index-only scans to a PostgreSQL database receiving 2 million writes per second is the engineering equivalent of putting a spoiler on a sedan to make it competitive in Formula 1. The problem is not the tuning; it is the vehicle.

LSM trees, delta encoding, and time-partitioned compaction are not academic optimizations. They are the engineering foundations that make it possible to ingest millions of sensor readings per second on commodity hardware while keeping storage costs linear in data volume rather than super-linear. When you understand why these mechanisms exist — write amplification, compression opportunities in temporally correlated data, bounded compaction scope — you can make informed decisions about configuration, shard key design, and retention policy tiers rather than treating them as opaque dials to turn until performance improves.

The choice between InfluxDB and TimescaleDB comes down to operational context. If you operate PostgreSQL today, TimescaleDB minimizes architectural complexity and keeps your entire data stack in a single, well-understood system. If you are starting fresh at IoT scale or migrating from a struggling general-purpose database, a purpose-built TSDB like InfluxDB Clustered or a managed service like InfluxDB Cloud provides the operational tooling purpose-designed for time-series workloads. Either way, investing time upfront to design your retention policy tiers, shard key strategy, and downsampling pipelines — before your dataset grows to the point where changes require painful migrations — will pay dividends in both cost and query performance for years to come.

Leave a Comment

Related Posts

System Design

Distributed Search Engine Design

Build a scalable distributed search system with inverted indexes, sharding, and relevance ranking.

System Design

Database Sharding Strategies

Master horizontal partitioning, shard routing, resharding, and cross-shard query patterns.

System Design

Consistent Hashing in Distributed Systems

Understand consistent hashing rings, virtual nodes, and their role in distributed caching and databases.

Last updated: March 22, 2026