Prometheus Native Histograms on the Wire: BucketSpan Stitching, ZigZag Packed Deltas, and a Zero-Allocation Rust Decoder

Classic Prometheus histograms are “fake” at ingestion time: they arrive as many float samples (_bucket, _sum, _count). Native histograms are the opposite: a single structured sample carries count, sum, sparse buckets, plus optional exemplars.

That structure is great—until you try to:

stream-decode Remote-Write histograms at 2M samples/sec,
explain why “just a few more buckets” made your ingester CPU go vertical,
or verify another language’s implementation byte-for-byte.

This is a byte-level deep dive into the prompb.Histogram message used by Remote-Write:

how bucket indices are described via BucketSpan { offset, length },
how bucket counts are carried as packed sint64 deltas (ZigZag + varint),
how to reconstruct the logical bucket stream without allocating,
and why Go sometimes beats Rust even when Rust uses SIMD.

Specs & sources (manually verified)

These links are used for specific details that we rely on:

Prometheus native histograms: the “first-class histogram sample type”, sparse buckets, schema, spans, and deltas concepts:
- https://prometheus.io/docs/specs/native_histograms/
The actual protobuf schema for prompb.Histogram and BucketSpan (field numbers and types):
- https://raw.githubusercontent.com/prometheus/prometheus/main/prompb/types.proto
Protobuf binary encoding (varints, tag = (field<<3)|wire_type, packed repeated numeric fields):
- https://protobuf.dev/programming-guides/encoding/

If any of the above move, your decoder can still work—but your field numbers can’t.

The data model we have to reconstruct

In prompb/types.proto, the core pieces (integer histogram case) look like:

repeated BucketSpan positive_spans = 11; (offset:sint32, length:uint32)
repeated sint64 positive_deltas = 12; (delta-of-counts)

Semantics:

A span describes where buckets exist, as runs of consecutive bucket indices.
The deltas describe what counts those buckets have.

Bucket indices are implicit: you don’t get an (index, count) list. You get a compressed “index tape” (spans) + a compressed “count tape” (deltas).

BucketSpan = sparse index tape

Each span is:

offset: gap to previous span, or starting index for the first span (can be negative)
length: number of consecutive buckets in the span

So the bucket indices produced by spans are:

start_0 = offset_0
indices_0 = start_0 .. start_0 + length_0 - 1
 
start_1 = (start_0 + length_0) + offset_1
indices_1 = start_1 .. start_1 + length_1 - 1
...

That “+ length then + offset” rule is the thing you must not get wrong.

Deltas = delta-coded counts

For integer histograms, counts are sent as deltas:

delta[0] = count(bucket_0) - 0
delta[i] = count(bucket_i) - count(bucket_{i-1})

To recover absolute counts:

prev = 0
for d in deltas:
  prev += d
  count = prev

These deltas are sint64, which in protobuf means:

value → ZigZag transform to unsigned
then encode as base-128 varint

So negative deltas (yes, they happen when buckets are sparse and your distribution shifts) are cheap when their magnitude is small.

Protobuf at the bit level: tags, varints, ZigZag

Protobuf messages are a stream of key-value records:

key/tag: varint of (field_number << 3) | wire_type
then a payload whose shape depends on wire_type

For example, positive_deltas = 12 is a repeated numeric field. In proto3 it is typically encoded as a packed field:

tag byte for (12<<3)|2 = 0x62 (wire type 2 = LEN)
then a varint length L
then L bytes of concatenated varints (each varint is a ZigZag’d sint64)

ZigZag (`sint64`)

ZigZag maps signed integers to unsigned so small magnitudes stay small:

zz(x) = (x << 1) ^ (x >> 63)

Examples:

signed x	ZigZag zz(x)	varint bytes
0	0	00
1	2	02
-1	1	01
2	4	04
-2	3	03

That table is worth memorizing if you ever debug wire dumps.

A concrete micro-example: decode by hand

Let’s build an intentionally tiny Histogram submessage with only:

schema = 0 (field 4, sint32)
positive_spans = [{ offset: 1, length: 3 }] (field 11, embedded message)
positive_deltas = [ +5, -2, +1 ] (field 12, packed sint64)

Step 1: `schema = 0`

Field 4, wire type VARINT (0): tag = (4<<3)|0 = 0x20.

Value 0 as sint32 ZigZag’s to 0, varint 00.

Bytes:

20 00

Step 2: `positive_spans` (field 11)

Field 11 is LEN (embedded message): tag = (11<<3)|2 = 0x5a.

The embedded BucketSpan is:

offset = 1 (field 1, sint32): tag 0x08, value ZigZag(1)=2 → 02
length = 3 (field 2, uint32): tag 0x10, value 03

So the span message bytes are:

08 02 10 03

Length is 4, so outer encoding:

5a 04 08 02 10 03

Step 3: `positive_deltas` packed `sint64`

Field 12 packed: tag = 0x62.

Encode each delta as sint64:

+5 → ZigZag = 10 → varint 0a
-2 → ZigZag = 3 → varint 03
+1 → ZigZag = 2 → varint 02

Packed payload = 0a 03 02, length = 3.

62 03 0a 03 02

Full message hex

Concatenate:

20 00
5a 04 08 02 10 03
62 03 0a 03 02

If you can round-trip this through your decoder, you’re already ahead of most “it parses in my language” implementations.

Mermaid: how spans + deltas become buckets (index tape + count tape)

flowchart LR
  subgraph S[BucketSpans (index tape)]
    A[offset=1,length=3]
  end

  subgraph D[Deltas (count tape)]
    d1[+5]
    d2[-2]
    d3[+1]
  end

  A --> i1[idx=1]
  A --> i2[idx=2]
  A --> i3[idx=3]

  d1 --> c1[count=5]
  d2 --> c2[count=3]
  d3 --> c3[count=4]

  i1 --> b1[(bucket 1: 5)]
  i2 --> b2[(bucket 2: 3)]
  i3 --> b3[(bucket 3: 4)]

  c1 --> b1
  c2 --> b2
  c3 --> b3

This is not a generic flow: it’s the exact data dependency you implement.

Zero-allocation Rust decoder (protobuf without codegen)

Below is the “crunchy” part: decoding tags/varints, reading packed sint64, and iterating spans to yield (bucket_index, absolute_count).

Key goals:

no heap allocations while decoding a message
streaming-compatible: you can feed it slices from an HTTP body
explicit wire-type validation so corrupted inputs fail fast

Varint + ZigZag

#[inline]
fn read_uvarint(mut p: &[u8]) -> Option<(u64, &[u8])> {
    let mut x: u64 = 0;
    let mut s: u32 = 0;
    // protobuf varint is max 10 bytes for u64
    for _ in 0..10 {
        let b = *p.get(0)?;
        p = &p[1..];
        x |= ((b & 0x7f) as u64) << s;
        if (b & 0x80) == 0 {
            return Some((x, p));
        }
        s += 7;
    }
    None
}
 
#[inline]
fn zigzag_decode_u64(n: u64) -> i64 {
    // inverse of (x<<1) ^ (x>>63)
    ((n >> 1) as i64) ^ (-((n & 1) as i64))
}
 
#[inline]
fn read_svarint(p: &[u8]) -> Option<(i64, &[u8])> {
    let (u, rest) = read_uvarint(p)?;
    Some((zigzag_decode_u64(u), rest))
}
 
#[inline]
fn read_len_delim(p: &[u8]) -> Option<(&[u8], &[u8])> {
    let (len, mut rest) = read_uvarint(p)?;
    let len = len as usize;
    if rest.len() < len { return None; }
    let msg = &rest[..len];
    rest = &rest[len..];
    Some((msg, rest))
}

Decode `BucketSpan { offset, length }`

#[derive(Clone, Copy, Debug)]
struct Span { offset: i32, length: u32 }
 
fn decode_span(mut p: &[u8]) -> Option<(Span, &[u8])> {
    let mut offset: i32 = 0;
    let mut length: u32 = 0;
 
    while !p.is_empty() {
        let (tag, rest) = read_uvarint(p)?;
        p = rest;
        let field = (tag >> 3) as u32;
        let wire  = (tag & 0x07) as u32;
 
        match (field, wire) {
            (1, 0) => { // sint32 offset
                let (v, rest) = read_svarint(p)?;
                p = rest;
                offset = v as i32;
            }
            (2, 0) => { // uint32 length
                let (v, rest) = read_uvarint(p)?;
                p = rest;
                length = v as u32;
            }
            // skip unknown fields (proto forward-compat)
            (_, 0) => { let (_v, rest) = read_uvarint(p)?; p = rest; }
            (_, 2) => { let (_b, rest) = read_len_delim(p)?; p = rest; }
            // not expecting fixed32/fixed64 here
            _ => return None,
        }
    }
 
    Some((Span { offset, length }, p))
}

Stream buckets without allocating

We decode the histogram fields we care about (positive spans + positive deltas), then iterate them in lockstep.

Important: positive_deltas is usually packed (wire type LEN). You must accept both packed and unpacked in practice.

struct PackedSvarints<'a> { p: &'a [u8] }
 
impl<'a> Iterator for PackedSvarints<'a> {
    type Item = i64;
    fn next(&mut self) -> Option<i64> {
        if self.p.is_empty() { return None; }
        let (v, rest) = read_svarint(self.p)?;
        self.p = rest;
        Some(v)
    }
}
 
fn iter_buckets(
    spans: &[Span],
    mut deltas: PackedSvarints<'_>,
) -> impl Iterator<Item = (i32, i64)> + '_ {
    let mut span_i = 0usize;
    let mut in_span = 0u32;
    let mut idx: i32 = 0;
    let mut prev_count: i64 = 0;
 
    std::iter::from_fn(move || {
        loop {
            if span_i >= spans.len() { return None; }
            let sp = spans[span_i];
 
            if in_span == 0 {
                // first bucket of this span
                if span_i == 0 {
                    idx = sp.offset;
                } else {
                    idx = idx + 1 + (sp.offset - 1); // see note below
                    // The exact update rule is subtle; we’ll implement the canonical form below.
                }
            }
 
            if in_span >= sp.length {
                span_i += 1;
                in_span = 0;
                continue;
            }
 
            let d = deltas.next()?;
            prev_count = prev_count.wrapping_add(d);
            let out = (idx, prev_count);
 
            idx += 1;
            in_span += 1;
            return Some(out);
        }
    })
}

That idx update logic is where bugs hide.

Here is the canonical index update logic (less cute, more correct):

fn span_indices(spans: &[Span]) -> impl Iterator<Item = i32> + '_ {
    let mut cur_start: i32 = 0;
    let mut cur_end_excl: i32 = 0;
    let mut first = true;
 
    spans.iter().flat_map(move |sp| {
        if first {
            cur_start = sp.offset;
            first = false;
        } else {
            // next start is previous end + offset
            cur_start = cur_end_excl + sp.offset;
        }
        cur_end_excl = cur_start + sp.length as i32;
        (cur_start..cur_end_excl)
    })
}
 
fn buckets_from_spans_and_deltas<'a>(
    spans: &'a [Span],
    deltas: PackedSvarints<'a>,
) -> impl Iterator<Item = (i32, i64)> + 'a {
    let mut prev: i64 = 0;
    span_indices(spans).zip(deltas).map(move |(i, d)| {
        prev += d;
        (i, prev)
    })
}

Yes, it allocates nothing. No, it does not forgive mismatch lengths: if deltas run out early, you stop. If you have extra deltas, you should treat it as corruption.

Trade-offs (Prometheus native hist vs OTLP exponential hist vs custom encodings)

Native histograms are an engineering compromise, not a free lunch.

Prometheus native histograms

Pros:

sparse representation: empty buckets don’t cost per-sample
mergeable exponential schema and dynamic resolution
integrates directly into PromQL & TSDB

Cons:

decoding cost shifts to ingestion/query: you must stitch spans + apply deltas
packed varints are branchy; CPU is dominated by varint loops on hot paths
protobuf schema evolution is constrained: field numbers are forever

OTLP Exponential Histogram (OpenTelemetry)

OTLP also uses exponential bucketing, but the wire choices differ (notably around aggregation temporality and field layout). If you’re converting OTLP → Prometheus, you may end up re-bucketing or mapping indices.

Custom TSM / Parquet / columnar

Parquet (columnar) can compress runs of repeated values extremely well, but typically wants you to batch a lot (latency trade-off).
A custom TSM-like format can encode bucket indices and counts with domain-specific bit packing (e.g. 16-bit deltas) and beat protobuf—at the cost of interoperability.

The real comparison is not “format A vs format B”; it’s “how much CPU per byte saved” on your ingest and query nodes.

Go vs Rust: the uncomfortable benchmark reality

Even with Rust doing careful bounds checks and avoiding allocations, Go can win in end-to-end Remote-Write ingestion benchmarks because:

Go’s runtime + compiler are aggressively tuned for protobuf-style varint decoding patterns.
Your Rust implementation may lose to branch mispredicts in read_uvarint, not to memory bandwidth.
If you use prost codegen, you often pay for intermediate allocations and Vec growth unless you redesign the API.

If you want Rust to win, you typically need either:

hand-rolled parsing like above,
or a SIMD-assisted varint decoder (hard; varints are the enemy of SIMD),
or to change the upstream data representation (which you can’t for Prometheus Remote-Write).

Practical debugging tips

Assume packed fields. Proto3 repeated numeric fields are packed by default.
Verify tag bytes. For field 12 packed: tag must be 0x62.
Treat extra bytes as corruption. If a packed field declares length L, you must consume exactly L bytes.
Watch for negative deltas. They’re legal; don’t cast to u64 by accident.

Research question (a real paradox)

If protobuf varints are branch-heavy and histogram bucket streams are fundamentally serial, why do many production ingesters still show higher throughput in Go than in “zero-allocation Rust” implementations?

Is the answer:

microarchitecture (branch predictor + front-end),
compiler codegen quality (Go’s mature protobuf hot paths vs Rust’s inlining heuristics),
or a deeper truth: that on modern CPUs, the shape of the data (varints + ZigZag) matters more than the language?

If you can produce a benchmark where Rust wins without changing the wire format, I’m interested in what trick you used.

Explorer

Table of Contents

Prometheus Native Histograms on the Wire: BucketSpan Stitching, ZigZag Packed Deltas, and a Zero-Allocation Rust Decoder

Specs & sources (manually verified)

The data model we have to reconstruct

BucketSpan = sparse index tape

Deltas = delta-coded counts

Protobuf at the bit level: tags, varints, ZigZag

ZigZag (`sint64`)

A concrete micro-example: decode by hand

Step 1: `schema = 0`

Step 2: `positive_spans` (field 11)

Step 3: `positive_deltas` packed `sint64`

Full message hex

Mermaid: how spans + deltas become buckets (index tape + count tape)

Zero-allocation Rust decoder (protobuf without codegen)

Varint + ZigZag

Decode `BucketSpan { offset, length }`

Stream buckets without allocating

Trade-offs (Prometheus native hist vs OTLP exponential hist vs custom encodings)

Prometheus native histograms

OTLP Exponential Histogram (OpenTelemetry)

Custom TSM / Parquet / columnar

Go vs Rust: the uncomfortable benchmark reality

Practical debugging tips

Research question (a real paradox)

Table of Contents

Backlinks

Graph View

Explorer

Table of Contents

Prometheus Native Histograms on the Wire: BucketSpan Stitching, ZigZag Packed Deltas, and a Zero-Allocation Rust Decoder

Specs & sources (manually verified)

The data model we have to reconstruct

BucketSpan = sparse index tape

Deltas = delta-coded counts

Protobuf at the bit level: tags, varints, ZigZag

ZigZag (sint64)

A concrete micro-example: decode by hand

Step 1: schema = 0

Step 2: positive_spans (field 11)

Step 3: positive_deltas packed sint64

Full message hex

Mermaid: how spans + deltas become buckets (index tape + count tape)

Zero-allocation Rust decoder (protobuf without codegen)

Varint + ZigZag

Decode BucketSpan { offset, length }

Stream buckets without allocating

Trade-offs (Prometheus native hist vs OTLP exponential hist vs custom encodings)

Prometheus native histograms

OTLP Exponential Histogram (OpenTelemetry)

Custom TSM / Parquet / columnar

Go vs Rust: the uncomfortable benchmark reality

Practical debugging tips

Research question (a real paradox)

Table of Contents

Backlinks

Graph View

ZigZag (`sint64`)

Step 1: `schema = 0`

Step 2: `positive_spans` (field 11)

Step 3: `positive_deltas` packed `sint64`

Decode `BucketSpan { offset, length }`