The Evolution of TSDB Compression: From Gorilla to InfluxDB 3
In the world of high-scale observability, data volume is the enemy of performance. A system collecting 10 million metrics per second generates nearly a trillion data points a day. Without aggressive compression, storage costs and query latency would make such systems economically and technically unfeasible.
Over the last decade, TSDB (Time Series Database) compression has evolved from custom-tailored bit-stream algorithms to standardized columnar formats. This article explores that journey, focusing on the techniques that defined an era.
The Gorilla Era: Exploiting Regularity
In 2015, Facebook published the seminal paper “Gorilla: A Fast, Scalable, In-Memory Time Series Database”. It introduced two key techniques that became the industry standard for nearly a decade: Delta-of-Delta for timestamps and XOR compression for values.
1. Timestamps: Delta-of-Delta Encoding
Most time-series data is periodic (e.g., every 60 seconds). In a perfect world, the difference between consecutive timestamps (the “delta”) is constant. Delta-of-Delta encoding exploits this.
Step-by-Step Walk-through
Let’s compress the sequence: [1643673600, 1643673660, 1643673722, 1643673780]
-
Calculate Deltas:
-
Calculate Delta-of-Deltas ():
-
Bit-Stream Representation (Simplified Gorilla):
- : Store bit
0. (1 bit) - : Store bits
10followed by 7 bits of . (9 bits) - : Store bits
110followed by 9 bits of . (12 bits) - …and so on for larger ranges.
- : Store bit
For our sequence:
- : Prefix
10, value0000010(9 bits total) - : Prefix
10, value1111100(two’s complement, 9 bits total)
Compared to storing a 64-bit integer, we used only 9 bits.
2. Values: XOR Compression
Floating-point values (IEEE 754) often change slowly. When you XOR two consecutive values (), identical bits result in 0.
graph LR V1[Value N-1] -- XOR -- Result[XOR Result] V2[Value N] -- XOR -- Result Result --> Zeros[Leading/Trailing Zeros] Zeros --> Meaningful[Meaningful Bits]
Implementation: Core Gorilla in Rust
Below is a simplified implementation of the Gorilla-style bit-packing logic in Rust, focusing on the XOR compression of 64-bit floats.
/// A simplified XOR compressor for f64 values.
pub struct GorillaValueCompressor {
last_value: u64,
last_leading_zeros: u32,
last_trailing_zeros: u32,
first: bool,
}
impl GorillaValueCompressor {
pub fn new() -> Self {
Self {
last_value: 0,
last_leading_zeros: u32::MAX,
last_trailing_zeros: 0,
first: true,
}
}
pub fn compress(&mut self, val: f64) -> Vec<u8> {
let x = val.to_bits();
if self.first {
self.last_value = x;
self.first = false;
// In a real impl, we'd write the full 64 bits here.
return x.to_be_bytes().to_vec();
}
let xor = x ^ self.last_value;
self.last_value = x;
if xor == 0 {
return vec![0b0]; // Single '0' bit
}
let leading = xor.leading_zeros();
let trailing = xor.trailing_zeros();
// This is where bit-packing (writing specific bit lengths) occurs.
// For brevity, we return a symbolic representation.
self.encode_xor(xor, leading, trailing)
}
fn encode_xor(&mut self, xor: u64, leading: u32, trailing: u32) -> Vec<u8> {
// Logic for '1' prefix + leading/trailing zero checks...
// ... (truncated for conceptual clarity)
vec![0b1]
}
}The Paradigm Shift: InfluxDB 3 and Parquet
InfluxDB 3 moved away from custom bit-streams toward Apache Parquet, a columnar storage standard.
Gorilla vs. Parquet (Columnar)
sequenceDiagram participant TS as Time Series participant G as Gorilla (Row-Oriented Bitstream) participant P as Parquet (Columnar/SIMD) TS->>G: [T1,V1], [T2,V2], [T3,V3] Note over G: Bit-by-bit serial processing.<br/>Hard to parallelize. TS->>P: [T1, T2, T3], [V1, V2, V3] Note over P: Block-based processing.<br/>Optimized for SIMD & Vectorization.
Modern Techniques: SIMD and Bit-packing
Modern engines use SIMD (Single Instruction, Multiple Data) to process blocks of values at once. Instead of Gorilla’s bit-at-a-time branching, Parquet’s DELTA_BINARY_PACKED uses Bit-packing and Varint encoding.
- Block Deltas: Calculate deltas for a block (e.g., 128 values).
- Min Delta: Find the minimum delta.
- Subtract: Subtract min delta from all.
- Bit-width: Find the max bits needed and pack the entire block.
Summary
The evolution from Gorilla to InfluxDB 3 is a story of maturing infrastructure. We have moved from specialized, “hand-crafted” compression aimed at saving every possible bit in RAM, to standardized, block-oriented columnar formats optimized for disk I/O and ecosystem interoperability.
Technical References: