Prometheus Native Histograms on the Wire: BucketSpan Stitching, ZigZag Packed Deltas, and a Zero-Allocation Rust Decoder

Classic Prometheus histograms are “fake” at ingestion time: they arrive as many float samples (_bucket, _sum, _count). Native histograms are the opposite: a single structured sample carries count, sum, sparse buckets, plus optional exemplars.

That structure is great—until you try to:

  • stream-decode Remote-Write histograms at 2M samples/sec,
  • explain why “just a few more buckets” made your ingester CPU go vertical,
  • or verify another language’s implementation byte-for-byte.

This is a byte-level deep dive into the prompb.Histogram message used by Remote-Write:

  • how bucket indices are described via BucketSpan { offset, length },
  • how bucket counts are carried as packed sint64 deltas (ZigZag + varint),
  • how to reconstruct the logical bucket stream without allocating,
  • and why Go sometimes beats Rust even when Rust uses SIMD.

Specs & sources (manually verified)

These links are used for specific details that we rely on:

If any of the above move, your decoder can still work—but your field numbers can’t.


The data model we have to reconstruct

In prompb/types.proto, the core pieces (integer histogram case) look like:

  • repeated BucketSpan positive_spans = 11; (offset:sint32, length:uint32)
  • repeated sint64 positive_deltas = 12; (delta-of-counts)

Semantics:

  • A span describes where buckets exist, as runs of consecutive bucket indices.
  • The deltas describe what counts those buckets have.

Bucket indices are implicit: you don’t get an (index, count) list. You get a compressed “index tape” (spans) + a compressed “count tape” (deltas).

BucketSpan = sparse index tape

Each span is:

  • offset: gap to previous span, or starting index for the first span (can be negative)
  • length: number of consecutive buckets in the span

So the bucket indices produced by spans are:

start_0 = offset_0
indices_0 = start_0 .. start_0 + length_0 - 1
 
start_1 = (start_0 + length_0) + offset_1
indices_1 = start_1 .. start_1 + length_1 - 1
...

That “+ length then + offset” rule is the thing you must not get wrong.

Deltas = delta-coded counts

For integer histograms, counts are sent as deltas:

  • delta[0] = count(bucket_0) - 0
  • delta[i] = count(bucket_i) - count(bucket_{i-1})

To recover absolute counts:

prev = 0
for d in deltas:
  prev += d
  count = prev

These deltas are sint64, which in protobuf means:

  • value → ZigZag transform to unsigned
  • then encode as base-128 varint

So negative deltas (yes, they happen when buckets are sparse and your distribution shifts) are cheap when their magnitude is small.


Protobuf at the bit level: tags, varints, ZigZag

Protobuf messages are a stream of key-value records:

  • key/tag: varint of (field_number << 3) | wire_type
  • then a payload whose shape depends on wire_type

For example, positive_deltas = 12 is a repeated numeric field. In proto3 it is typically encoded as a packed field:

  • tag byte for (12<<3)|2 = 0x62 (wire type 2 = LEN)
  • then a varint length L
  • then L bytes of concatenated varints (each varint is a ZigZag’d sint64)

ZigZag (sint64)

ZigZag maps signed integers to unsigned so small magnitudes stay small:

zz(x) = (x << 1) ^ (x >> 63)

Examples:

signed xZigZag zz(x)varint bytes
0000
1202
-1101
2404
-2303

That table is worth memorizing if you ever debug wire dumps.


A concrete micro-example: decode by hand

Let’s build an intentionally tiny Histogram submessage with only:

  • schema = 0 (field 4, sint32)
  • positive_spans = [{ offset: 1, length: 3 }] (field 11, embedded message)
  • positive_deltas = [ +5, -2, +1 ] (field 12, packed sint64)

Step 1: schema = 0

Field 4, wire type VARINT (0): tag = (4<<3)|0 = 0x20.

Value 0 as sint32 ZigZag’s to 0, varint 00.

Bytes:

20 00

Step 2: positive_spans (field 11)

Field 11 is LEN (embedded message): tag = (11<<3)|2 = 0x5a.

The embedded BucketSpan is:

  • offset = 1 (field 1, sint32): tag 0x08, value ZigZag(1)=2 → 02
  • length = 3 (field 2, uint32): tag 0x10, value 03

So the span message bytes are:

08 02 10 03

Length is 4, so outer encoding:

5a 04 08 02 10 03

Step 3: positive_deltas packed sint64

Field 12 packed: tag = 0x62.

Encode each delta as sint64:

  • +5 → ZigZag = 10 → varint 0a
  • -2 → ZigZag = 3 → varint 03
  • +1 → ZigZag = 2 → varint 02

Packed payload = 0a 03 02, length = 3.

62 03 0a 03 02

Full message hex

Concatenate:

20 00
5a 04 08 02 10 03
62 03 0a 03 02

If you can round-trip this through your decoder, you’re already ahead of most “it parses in my language” implementations.


Mermaid: how spans + deltas become buckets (index tape + count tape)

flowchart LR
  subgraph S[BucketSpans (index tape)]
    A[offset=1,length=3]
  end

  subgraph D[Deltas (count tape)]
    d1[+5]
    d2[-2]
    d3[+1]
  end

  A --> i1[idx=1]
  A --> i2[idx=2]
  A --> i3[idx=3]

  d1 --> c1[count=5]
  d2 --> c2[count=3]
  d3 --> c3[count=4]

  i1 --> b1[(bucket 1: 5)]
  i2 --> b2[(bucket 2: 3)]
  i3 --> b3[(bucket 3: 4)]

  c1 --> b1
  c2 --> b2
  c3 --> b3

This is not a generic flow: it’s the exact data dependency you implement.


Zero-allocation Rust decoder (protobuf without codegen)

Below is the “crunchy” part: decoding tags/varints, reading packed sint64, and iterating spans to yield (bucket_index, absolute_count).

Key goals:

  • no heap allocations while decoding a message
  • streaming-compatible: you can feed it slices from an HTTP body
  • explicit wire-type validation so corrupted inputs fail fast

Varint + ZigZag

#[inline]
fn read_uvarint(mut p: &[u8]) -> Option<(u64, &[u8])> {
    let mut x: u64 = 0;
    let mut s: u32 = 0;
    // protobuf varint is max 10 bytes for u64
    for _ in 0..10 {
        let b = *p.get(0)?;
        p = &p[1..];
        x |= ((b & 0x7f) as u64) << s;
        if (b & 0x80) == 0 {
            return Some((x, p));
        }
        s += 7;
    }
    None
}
 
#[inline]
fn zigzag_decode_u64(n: u64) -> i64 {
    // inverse of (x<<1) ^ (x>>63)
    ((n >> 1) as i64) ^ (-((n & 1) as i64))
}
 
#[inline]
fn read_svarint(p: &[u8]) -> Option<(i64, &[u8])> {
    let (u, rest) = read_uvarint(p)?;
    Some((zigzag_decode_u64(u), rest))
}
 
#[inline]
fn read_len_delim(p: &[u8]) -> Option<(&[u8], &[u8])> {
    let (len, mut rest) = read_uvarint(p)?;
    let len = len as usize;
    if rest.len() < len { return None; }
    let msg = &rest[..len];
    rest = &rest[len..];
    Some((msg, rest))
}

Decode BucketSpan { offset, length }

#[derive(Clone, Copy, Debug)]
struct Span { offset: i32, length: u32 }
 
fn decode_span(mut p: &[u8]) -> Option<(Span, &[u8])> {
    let mut offset: i32 = 0;
    let mut length: u32 = 0;
 
    while !p.is_empty() {
        let (tag, rest) = read_uvarint(p)?;
        p = rest;
        let field = (tag >> 3) as u32;
        let wire  = (tag & 0x07) as u32;
 
        match (field, wire) {
            (1, 0) => { // sint32 offset
                let (v, rest) = read_svarint(p)?;
                p = rest;
                offset = v as i32;
            }
            (2, 0) => { // uint32 length
                let (v, rest) = read_uvarint(p)?;
                p = rest;
                length = v as u32;
            }
            // skip unknown fields (proto forward-compat)
            (_, 0) => { let (_v, rest) = read_uvarint(p)?; p = rest; }
            (_, 2) => { let (_b, rest) = read_len_delim(p)?; p = rest; }
            // not expecting fixed32/fixed64 here
            _ => return None,
        }
    }
 
    Some((Span { offset, length }, p))
}

Stream buckets without allocating

We decode the histogram fields we care about (positive spans + positive deltas), then iterate them in lockstep.

Important: positive_deltas is usually packed (wire type LEN). You must accept both packed and unpacked in practice.

struct PackedSvarints<'a> { p: &'a [u8] }
 
impl<'a> Iterator for PackedSvarints<'a> {
    type Item = i64;
    fn next(&mut self) -> Option<i64> {
        if self.p.is_empty() { return None; }
        let (v, rest) = read_svarint(self.p)?;
        self.p = rest;
        Some(v)
    }
}
 
fn iter_buckets(
    spans: &[Span],
    mut deltas: PackedSvarints<'_>,
) -> impl Iterator<Item = (i32, i64)> + '_ {
    let mut span_i = 0usize;
    let mut in_span = 0u32;
    let mut idx: i32 = 0;
    let mut prev_count: i64 = 0;
 
    std::iter::from_fn(move || {
        loop {
            if span_i >= spans.len() { return None; }
            let sp = spans[span_i];
 
            if in_span == 0 {
                // first bucket of this span
                if span_i == 0 {
                    idx = sp.offset;
                } else {
                    idx = idx + 1 + (sp.offset - 1); // see note below
                    // The exact update rule is subtle; we’ll implement the canonical form below.
                }
            }
 
            if in_span >= sp.length {
                span_i += 1;
                in_span = 0;
                continue;
            }
 
            let d = deltas.next()?;
            prev_count = prev_count.wrapping_add(d);
            let out = (idx, prev_count);
 
            idx += 1;
            in_span += 1;
            return Some(out);
        }
    })
}

That idx update logic is where bugs hide.

Here is the canonical index update logic (less cute, more correct):

fn span_indices(spans: &[Span]) -> impl Iterator<Item = i32> + '_ {
    let mut cur_start: i32 = 0;
    let mut cur_end_excl: i32 = 0;
    let mut first = true;
 
    spans.iter().flat_map(move |sp| {
        if first {
            cur_start = sp.offset;
            first = false;
        } else {
            // next start is previous end + offset
            cur_start = cur_end_excl + sp.offset;
        }
        cur_end_excl = cur_start + sp.length as i32;
        (cur_start..cur_end_excl)
    })
}
 
fn buckets_from_spans_and_deltas<'a>(
    spans: &'a [Span],
    deltas: PackedSvarints<'a>,
) -> impl Iterator<Item = (i32, i64)> + 'a {
    let mut prev: i64 = 0;
    span_indices(spans).zip(deltas).map(move |(i, d)| {
        prev += d;
        (i, prev)
    })
}

Yes, it allocates nothing. No, it does not forgive mismatch lengths: if deltas run out early, you stop. If you have extra deltas, you should treat it as corruption.


Trade-offs (Prometheus native hist vs OTLP exponential hist vs custom encodings)

Native histograms are an engineering compromise, not a free lunch.

Prometheus native histograms

Pros:

  • sparse representation: empty buckets don’t cost per-sample
  • mergeable exponential schema and dynamic resolution
  • integrates directly into PromQL & TSDB

Cons:

  • decoding cost shifts to ingestion/query: you must stitch spans + apply deltas
  • packed varints are branchy; CPU is dominated by varint loops on hot paths
  • protobuf schema evolution is constrained: field numbers are forever

OTLP Exponential Histogram (OpenTelemetry)

OTLP also uses exponential bucketing, but the wire choices differ (notably around aggregation temporality and field layout). If you’re converting OTLP → Prometheus, you may end up re-bucketing or mapping indices.

Custom TSM / Parquet / columnar

  • Parquet (columnar) can compress runs of repeated values extremely well, but typically wants you to batch a lot (latency trade-off).
  • A custom TSM-like format can encode bucket indices and counts with domain-specific bit packing (e.g. 16-bit deltas) and beat protobuf—at the cost of interoperability.

The real comparison is not “format A vs format B”; it’s “how much CPU per byte saved” on your ingest and query nodes.

Go vs Rust: the uncomfortable benchmark reality

Even with Rust doing careful bounds checks and avoiding allocations, Go can win in end-to-end Remote-Write ingestion benchmarks because:

  • Go’s runtime + compiler are aggressively tuned for protobuf-style varint decoding patterns.
  • Your Rust implementation may lose to branch mispredicts in read_uvarint, not to memory bandwidth.
  • If you use prost codegen, you often pay for intermediate allocations and Vec growth unless you redesign the API.

If you want Rust to win, you typically need either:

  • hand-rolled parsing like above,
  • or a SIMD-assisted varint decoder (hard; varints are the enemy of SIMD),
  • or to change the upstream data representation (which you can’t for Prometheus Remote-Write).

Practical debugging tips

  1. Assume packed fields. Proto3 repeated numeric fields are packed by default.
  2. Verify tag bytes. For field 12 packed: tag must be 0x62.
  3. Treat extra bytes as corruption. If a packed field declares length L, you must consume exactly L bytes.
  4. Watch for negative deltas. They’re legal; don’t cast to u64 by accident.

Research question (a real paradox)

If protobuf varints are branch-heavy and histogram bucket streams are fundamentally serial, why do many production ingesters still show higher throughput in Go than in “zero-allocation Rust” implementations?

Is the answer:

  • microarchitecture (branch predictor + front-end),
  • compiler codegen quality (Go’s mature protobuf hot paths vs Rust’s inlining heuristics),
  • or a deeper truth: that on modern CPUs, the shape of the data (varints + ZigZag) matters more than the language?

If you can produce a benchmark where Rust wins without changing the wire format, I’m interested in what trick you used.