Prometheus Native Histograms on the Wire: BucketSpan Stitching, ZigZag Packed Deltas, and a Zero-Allocation Rust Decoder
Classic Prometheus histograms are “fake” at ingestion time: they arrive as many float samples (_bucket, _sum, _count). Native histograms are the opposite: a single structured sample carries count, sum, sparse buckets, plus optional exemplars.
That structure is great—until you try to:
- stream-decode Remote-Write histograms at 2M samples/sec,
- explain why “just a few more buckets” made your ingester CPU go vertical,
- or verify another language’s implementation byte-for-byte.
This is a byte-level deep dive into the prompb.Histogram message used by Remote-Write:
- how bucket indices are described via
BucketSpan { offset, length }, - how bucket counts are carried as packed
sint64deltas (ZigZag + varint), - how to reconstruct the logical bucket stream without allocating,
- and why Go sometimes beats Rust even when Rust uses SIMD.
Specs & sources (manually verified)
These links are used for specific details that we rely on:
- Prometheus native histograms: the “first-class histogram sample type”, sparse buckets, schema, spans, and deltas concepts:
- The actual protobuf schema for
prompb.HistogramandBucketSpan(field numbers and types): - Protobuf binary encoding (varints, tag = (field<<3)|wire_type, packed repeated numeric fields):
If any of the above move, your decoder can still work—but your field numbers can’t.
The data model we have to reconstruct
In prompb/types.proto, the core pieces (integer histogram case) look like:
repeated BucketSpan positive_spans = 11;(offset:sint32,length:uint32)repeated sint64 positive_deltas = 12;(delta-of-counts)
Semantics:
- A span describes where buckets exist, as runs of consecutive bucket indices.
- The deltas describe what counts those buckets have.
Bucket indices are implicit: you don’t get an (index, count) list. You get a compressed “index tape” (spans) + a compressed “count tape” (deltas).
BucketSpan = sparse index tape
Each span is:
offset: gap to previous span, or starting index for the first span (can be negative)length: number of consecutive buckets in the span
So the bucket indices produced by spans are:
start_0 = offset_0
indices_0 = start_0 .. start_0 + length_0 - 1
start_1 = (start_0 + length_0) + offset_1
indices_1 = start_1 .. start_1 + length_1 - 1
...That “+ length then + offset” rule is the thing you must not get wrong.
Deltas = delta-coded counts
For integer histograms, counts are sent as deltas:
delta[0] = count(bucket_0) - 0delta[i] = count(bucket_i) - count(bucket_{i-1})
To recover absolute counts:
prev = 0
for d in deltas:
prev += d
count = prevThese deltas are sint64, which in protobuf means:
- value → ZigZag transform to unsigned
- then encode as base-128 varint
So negative deltas (yes, they happen when buckets are sparse and your distribution shifts) are cheap when their magnitude is small.
Protobuf at the bit level: tags, varints, ZigZag
Protobuf messages are a stream of key-value records:
- key/tag: varint of
(field_number << 3) | wire_type - then a payload whose shape depends on
wire_type
For example, positive_deltas = 12 is a repeated numeric field. In proto3 it is typically encoded as a packed field:
- tag byte for
(12<<3)|2=0x62(wire type 2 = LEN) - then a varint length
L - then
Lbytes of concatenated varints (each varint is a ZigZag’dsint64)
ZigZag (sint64)
ZigZag maps signed integers to unsigned so small magnitudes stay small:
zz(x) = (x << 1) ^ (x >> 63)Examples:
| signed x | ZigZag zz(x) | varint bytes |
|---|---|---|
| 0 | 0 | 00 |
| 1 | 2 | 02 |
| -1 | 1 | 01 |
| 2 | 4 | 04 |
| -2 | 3 | 03 |
That table is worth memorizing if you ever debug wire dumps.
A concrete micro-example: decode by hand
Let’s build an intentionally tiny Histogram submessage with only:
schema = 0(field 4,sint32)positive_spans = [{ offset: 1, length: 3 }](field 11, embedded message)positive_deltas = [ +5, -2, +1 ](field 12, packedsint64)
Step 1: schema = 0
Field 4, wire type VARINT (0): tag = (4<<3)|0 = 0x20.
Value 0 as sint32 ZigZag’s to 0, varint 00.
Bytes:
20 00Step 2: positive_spans (field 11)
Field 11 is LEN (embedded message): tag = (11<<3)|2 = 0x5a.
The embedded BucketSpan is:
offset = 1(field 1,sint32): tag0x08, value ZigZag(1)=2 →02length = 3(field 2,uint32): tag0x10, value03
So the span message bytes are:
08 02 10 03Length is 4, so outer encoding:
5a 04 08 02 10 03Step 3: positive_deltas packed sint64
Field 12 packed: tag = 0x62.
Encode each delta as sint64:
- +5 → ZigZag = 10 → varint
0a - -2 → ZigZag = 3 → varint
03 - +1 → ZigZag = 2 → varint
02
Packed payload = 0a 03 02, length = 3.
62 03 0a 03 02Full message hex
Concatenate:
20 00
5a 04 08 02 10 03
62 03 0a 03 02If you can round-trip this through your decoder, you’re already ahead of most “it parses in my language” implementations.
Mermaid: how spans + deltas become buckets (index tape + count tape)
flowchart LR subgraph S[BucketSpans (index tape)] A[offset=1,length=3] end subgraph D[Deltas (count tape)] d1[+5] d2[-2] d3[+1] end A --> i1[idx=1] A --> i2[idx=2] A --> i3[idx=3] d1 --> c1[count=5] d2 --> c2[count=3] d3 --> c3[count=4] i1 --> b1[(bucket 1: 5)] i2 --> b2[(bucket 2: 3)] i3 --> b3[(bucket 3: 4)] c1 --> b1 c2 --> b2 c3 --> b3
This is not a generic flow: it’s the exact data dependency you implement.
Zero-allocation Rust decoder (protobuf without codegen)
Below is the “crunchy” part: decoding tags/varints, reading packed sint64, and iterating spans to yield (bucket_index, absolute_count).
Key goals:
- no heap allocations while decoding a message
- streaming-compatible: you can feed it slices from an HTTP body
- explicit wire-type validation so corrupted inputs fail fast
Varint + ZigZag
#[inline]
fn read_uvarint(mut p: &[u8]) -> Option<(u64, &[u8])> {
let mut x: u64 = 0;
let mut s: u32 = 0;
// protobuf varint is max 10 bytes for u64
for _ in 0..10 {
let b = *p.get(0)?;
p = &p[1..];
x |= ((b & 0x7f) as u64) << s;
if (b & 0x80) == 0 {
return Some((x, p));
}
s += 7;
}
None
}
#[inline]
fn zigzag_decode_u64(n: u64) -> i64 {
// inverse of (x<<1) ^ (x>>63)
((n >> 1) as i64) ^ (-((n & 1) as i64))
}
#[inline]
fn read_svarint(p: &[u8]) -> Option<(i64, &[u8])> {
let (u, rest) = read_uvarint(p)?;
Some((zigzag_decode_u64(u), rest))
}
#[inline]
fn read_len_delim(p: &[u8]) -> Option<(&[u8], &[u8])> {
let (len, mut rest) = read_uvarint(p)?;
let len = len as usize;
if rest.len() < len { return None; }
let msg = &rest[..len];
rest = &rest[len..];
Some((msg, rest))
}Decode BucketSpan { offset, length }
#[derive(Clone, Copy, Debug)]
struct Span { offset: i32, length: u32 }
fn decode_span(mut p: &[u8]) -> Option<(Span, &[u8])> {
let mut offset: i32 = 0;
let mut length: u32 = 0;
while !p.is_empty() {
let (tag, rest) = read_uvarint(p)?;
p = rest;
let field = (tag >> 3) as u32;
let wire = (tag & 0x07) as u32;
match (field, wire) {
(1, 0) => { // sint32 offset
let (v, rest) = read_svarint(p)?;
p = rest;
offset = v as i32;
}
(2, 0) => { // uint32 length
let (v, rest) = read_uvarint(p)?;
p = rest;
length = v as u32;
}
// skip unknown fields (proto forward-compat)
(_, 0) => { let (_v, rest) = read_uvarint(p)?; p = rest; }
(_, 2) => { let (_b, rest) = read_len_delim(p)?; p = rest; }
// not expecting fixed32/fixed64 here
_ => return None,
}
}
Some((Span { offset, length }, p))
}Stream buckets without allocating
We decode the histogram fields we care about (positive spans + positive deltas), then iterate them in lockstep.
Important: positive_deltas is usually packed (wire type LEN). You must accept both packed and unpacked in practice.
struct PackedSvarints<'a> { p: &'a [u8] }
impl<'a> Iterator for PackedSvarints<'a> {
type Item = i64;
fn next(&mut self) -> Option<i64> {
if self.p.is_empty() { return None; }
let (v, rest) = read_svarint(self.p)?;
self.p = rest;
Some(v)
}
}
fn iter_buckets(
spans: &[Span],
mut deltas: PackedSvarints<'_>,
) -> impl Iterator<Item = (i32, i64)> + '_ {
let mut span_i = 0usize;
let mut in_span = 0u32;
let mut idx: i32 = 0;
let mut prev_count: i64 = 0;
std::iter::from_fn(move || {
loop {
if span_i >= spans.len() { return None; }
let sp = spans[span_i];
if in_span == 0 {
// first bucket of this span
if span_i == 0 {
idx = sp.offset;
} else {
idx = idx + 1 + (sp.offset - 1); // see note below
// The exact update rule is subtle; we’ll implement the canonical form below.
}
}
if in_span >= sp.length {
span_i += 1;
in_span = 0;
continue;
}
let d = deltas.next()?;
prev_count = prev_count.wrapping_add(d);
let out = (idx, prev_count);
idx += 1;
in_span += 1;
return Some(out);
}
})
}That idx update logic is where bugs hide.
Here is the canonical index update logic (less cute, more correct):
fn span_indices(spans: &[Span]) -> impl Iterator<Item = i32> + '_ {
let mut cur_start: i32 = 0;
let mut cur_end_excl: i32 = 0;
let mut first = true;
spans.iter().flat_map(move |sp| {
if first {
cur_start = sp.offset;
first = false;
} else {
// next start is previous end + offset
cur_start = cur_end_excl + sp.offset;
}
cur_end_excl = cur_start + sp.length as i32;
(cur_start..cur_end_excl)
})
}
fn buckets_from_spans_and_deltas<'a>(
spans: &'a [Span],
deltas: PackedSvarints<'a>,
) -> impl Iterator<Item = (i32, i64)> + 'a {
let mut prev: i64 = 0;
span_indices(spans).zip(deltas).map(move |(i, d)| {
prev += d;
(i, prev)
})
}Yes, it allocates nothing. No, it does not forgive mismatch lengths: if deltas run out early, you stop. If you have extra deltas, you should treat it as corruption.
Trade-offs (Prometheus native hist vs OTLP exponential hist vs custom encodings)
Native histograms are an engineering compromise, not a free lunch.
Prometheus native histograms
Pros:
- sparse representation: empty buckets don’t cost per-sample
- mergeable exponential schema and dynamic resolution
- integrates directly into PromQL & TSDB
Cons:
- decoding cost shifts to ingestion/query: you must stitch spans + apply deltas
- packed varints are branchy; CPU is dominated by varint loops on hot paths
- protobuf schema evolution is constrained: field numbers are forever
OTLP Exponential Histogram (OpenTelemetry)
OTLP also uses exponential bucketing, but the wire choices differ (notably around aggregation temporality and field layout). If you’re converting OTLP → Prometheus, you may end up re-bucketing or mapping indices.
Custom TSM / Parquet / columnar
- Parquet (columnar) can compress runs of repeated values extremely well, but typically wants you to batch a lot (latency trade-off).
- A custom TSM-like format can encode bucket indices and counts with domain-specific bit packing (e.g. 16-bit deltas) and beat protobuf—at the cost of interoperability.
The real comparison is not “format A vs format B”; it’s “how much CPU per byte saved” on your ingest and query nodes.
Go vs Rust: the uncomfortable benchmark reality
Even with Rust doing careful bounds checks and avoiding allocations, Go can win in end-to-end Remote-Write ingestion benchmarks because:
- Go’s runtime + compiler are aggressively tuned for protobuf-style varint decoding patterns.
- Your Rust implementation may lose to branch mispredicts in
read_uvarint, not to memory bandwidth. - If you use
prostcodegen, you often pay for intermediate allocations and Vec growth unless you redesign the API.
If you want Rust to win, you typically need either:
- hand-rolled parsing like above,
- or a SIMD-assisted varint decoder (hard; varints are the enemy of SIMD),
- or to change the upstream data representation (which you can’t for Prometheus Remote-Write).
Practical debugging tips
- Assume packed fields. Proto3 repeated numeric fields are packed by default.
- Verify tag bytes. For field 12 packed: tag must be
0x62. - Treat extra bytes as corruption. If a packed field declares length
L, you must consume exactlyLbytes. - Watch for negative deltas. They’re legal; don’t cast to
u64by accident.
Research question (a real paradox)
If protobuf varints are branch-heavy and histogram bucket streams are fundamentally serial, why do many production ingesters still show higher throughput in Go than in “zero-allocation Rust” implementations?
Is the answer:
- microarchitecture (branch predictor + front-end),
- compiler codegen quality (Go’s mature protobuf hot paths vs Rust’s inlining heuristics),
- or a deeper truth: that on modern CPUs, the shape of the data (varints + ZigZag) matters more than the language?
If you can produce a benchmark where Rust wins without changing the wire format, I’m interested in what trick you used.