OTLP/gRPC on the Wire: HTTP/2 Frame Headers, 5-Byte gRPC Records, and Zero-Copy Parsing in Rust

When an OpenTelemetry Collector “CPU spikes on ingest”, the easy explanation is “protobuf is slow” or “TLS is expensive”.

The real answer is usually framing:

  • HTTP/2 frames are chopped to satisfy flow control and max-frame-size.
  • gRPC messages are re-framed inside DATA frames as 5-byte records.
  • The protobuf payload is another TLV stream (varints + length-delimited submessages).

If you don’t understand the exact bytes, you can’t:

  • explain why packet captures show “fragmentation” even on a LAN,
  • implement a fast receiver without accidental copies,
  • or reason about how concurrency interacts with flow control.

This deep dive is about the actual on-the-wire layout for OTLP over gRPC:

  1. HTTP/2 frame header (9 bytes)
  2. gRPC length-prefixed message record (5 bytes)
  3. protobuf encoding costs (varints, embedded messages)

…and how to parse all of it with predictable branches and zero allocations.

Specs we rely on (manually verified)

(For the HTTP/2 and gRPC framing facts, I extracted the exact definitions from the upstream text: RFC 9113 §4.1 and PROTOCOL-HTTP2.md “Length-Prefixed-Message”.)


Layer 0 (not optional): TLS record boundaries are irrelevant

Even before HTTP/2, remember that TLS records are not message boundaries. You can’t assume “one TLS record one HTTP/2 frame” or “one TCP segment one gRPC message”.

Operational implication: a production receiver must treat the byte stream as an infinite tape and build framing explicitly.


Layer 1: HTTP/2 framing (9-byte header, then payload)

RFC 9113 §4.1 defines the HTTP/2 frame header as a fixed 9-octet header:

  • Length (24-bit, payload length only)
  • Type (8-bit)
  • Flags (8-bit)
  • Reserved (1-bit)
  • Stream Identifier (31-bit)

Bit/field layout (what you actually parse)

Mermaid view (field sizes):

flowchart LR
  subgraph H2["HTTP/2 Frame Header (9 bytes)"]
    L["Length: 24 bits\n(payload bytes)"] --> T["Type: 8"] --> Fl["Flags: 8"] --> R["R: 1"] --> SID["Stream Identifier: 31"]
  end

Concrete byte layout (network byte order):

BytesBitsField
0..224Length (unsigned)
38Type
48Flags
5..81 + 31R bit + StreamID

Rust: parse the 9-byte header with minimal branching

The performance trick is not “use a faster HTTP/2 library”. It’s knowing what the library is doing so you can decide where the CPU goes.

Here’s a tight parser for the header using shifts and a single unaligned load for the last 4 bytes:

#[derive(Debug, Clone, Copy)]
pub struct H2FrameHeader {
    pub len: u32,      // 24-bit
    pub ty: u8,
    pub flags: u8,
    pub stream_id: u32 // 31-bit
}
 
#[inline(always)]
pub fn parse_h2_frame_header(buf: &[u8]) -> Option<(H2FrameHeader, &[u8])> {
    if buf.len() < 9 { return None; }
 
    // Length is 24-bit big-endian.
    let len = ((buf[0] as u32) << 16) | ((buf[1] as u32) << 8) | (buf[2] as u32);
    let ty = buf[3];
    let flags = buf[4];
 
    // StreamID is the low 31 bits of a big-endian u32.
    // Load bytes[5..9] as BE u32, mask off the reserved bit.
    let sid_be = u32::from_be_bytes([buf[5], buf[6], buf[7], buf[8]]);
    let stream_id = sid_be & 0x7fff_ffff;
 
    Some((H2FrameHeader { len, ty, flags, stream_id }, &buf[9..]))
}

The “gotcha”: DATA frame boundaries don’t align with gRPC message boundaries

HTTP/2 DATA frames are sized by flow control and SETTINGS, not by your OTLP Export request size.

So a single OTLP export (a single protobuf message) can be:

  • split across many DATA frames, or
  • multiple gRPC messages can share a single DATA frame.

That’s why Wireshark “reassembly” feels magical: it’s doing the same tape-parsing you have to implement.


Layer 2: gRPC message framing (5-byte record header)

gRPC defines its own record format inside the byte stream carried by HTTP/2 DATA frames:

  • Compressed-Flag: 1 byte, value 0 or 1
  • Message-Length: 4 bytes, big-endian
  • Message: Message-Length bytes

From grpc/grpc’s protocol document:

  • Length-Prefixed-Message → Compressed-Flag Message-Length Message
  • Compressed-Flag → 0 / 1 ; encoded as 1 byte unsigned integer
  • Message-Length → ... ; encoded as 4 byte unsigned integer (big endian)

(See: https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md)

Bit/byte layout

flowchart LR
  subgraph GRPC["gRPC Length-Prefixed Message"]
    CF["Compressed flag\n1 byte"] --> ML["Message length\n4 bytes BE"] --> M["Message bytes\nN bytes"]
  end

Example: an uncompressed message of length 0x0000_0010 (16 bytes):

offsetmeaningbyte
0compressed flag00
1len[31:24]00
2len[23:16]00
3len[15:8]00
4len[7:0]10

Rust: parse the 5-byte prefix + payload as a streaming state machine

A fast receiver doesn’t “read message then parse”; it incrementally consumes bytes as they arrive.

Below is a minimal streaming decoder. It never allocates; it only yields borrowed slices when the full message is present.

pub struct GrpcFramer {
    // 0..=5 bytes of header
    hdr: [u8; 5],
    have: usize,
    want_msg: Option<usize>,
}
 
impl GrpcFramer {
    pub fn new() -> Self {
        Self { hdr: [0; 5], have: 0, want_msg: None }
    }
 
    /// Feed more bytes. Returns (consumed, maybe_one_complete_message).
    /// A production implementation would loop and yield multiple messages.
    pub fn feed<'a>(&mut self, mut buf: &'a [u8]) -> (usize, Option<(bool, &'a [u8])>) {
        let start_len = buf.len();
 
        // Step 1: fill 5-byte header.
        while self.have < 5 && !buf.is_empty() {
            self.hdr[self.have] = buf[0];
            self.have += 1;
            buf = &buf[1..];
        }
 
        if self.have < 5 {
            return (start_len - buf.len(), None);
        }
 
        // Step 2: compute wanted payload length.
        if self.want_msg.is_none() {
            let compressed = self.hdr[0] != 0;
            let len = u32::from_be_bytes([self.hdr[1], self.hdr[2], self.hdr[3], self.hdr[4]]) as usize;
            self.want_msg = Some((compressed as usize) << 31 | len); // pack compressed bit + len
        }
 
        // Unpack compressed + length.
        let packed = self.want_msg.unwrap();
        let compressed = (packed >> 31) != 0;
        let len = packed & 0x7fff_ffff;
 
        if buf.len() < len {
            return (start_len - buf.len(), None);
        }
 
        let msg = &buf[..len];
 
        // Reset for next message.
        self.have = 0;
        self.want_msg = None;
 
        (start_len - (buf.len() - len), Some((compressed, msg)))
    }
}

This is intentionally “low-level ugly” because it mirrors the wire.

Micro-optimization note: replace the per-byte copy with one unaligned load

In real ingestion code you typically have a contiguous ring buffer; you can take advantage of that:

  • memcpy the 5 bytes once (ptr::copy_nonoverlapping)
  • parse u32 length with read_unaligned + from_be

The branch predictor wins because the hot path is: “already have header → already have bytes → parse → yield”.


Layer 3: Protobuf inside the gRPC message (varints and length-delimited fields)

Once you’ve extracted Message, the protobuf wire format starts.

Two details matter for observability payloads:

  1. Tags are varints with (field_number << 3) | wire_type.
  2. Most OTLP payload is length-delimited (wire type 2): strings, bytes, embedded messages, packed repeated fields.

That means you see lots of:

  • varint tag
  • varint length
  • len bytes

…and that is fundamentally “scan, branch, bounds check”.

Rust: a “crunchy” varint decoder (pointer-based, branch-minimized)

This is the core inner loop you’ll end up writing if you ever implement a custom fast-path (e.g., to count spans without fully decoding):

use core::ptr;
 
#[inline(always)]
pub unsafe fn read_u64_varint(mut p: *const u8, end: *const u8) -> Option<(u64, *const u8)> {
    // Protobuf varint is at most 10 bytes.
    let mut x: u64 = 0;
    let mut shift = 0;
 
    while p < end && shift < 70 {
        let b = ptr::read(p);
        p = p.add(1);
 
        x |= ((b & 0x7f) as u64) << shift;
        if (b & 0x80) == 0 {
            return Some((x, p));
        }
        shift += 7;
    }
    None
}
 
#[inline(always)]
pub fn split_tag(tag: u64) -> (u32, u8) {
    let wire = (tag & 0x7) as u8;
    let field = (tag >> 3) as u32;
    (field, wire)
}

If you want to go further, you can SIMD-scan for “bytes with MSB clear” to accelerate varint termination detection (common in dense streams), but in practice the biggest win is often: avoid decoding what you don’t need.


Why OTLP/gRPC hurts (and when it doesn’t)

Cost centers you can’t see in “protobuf parsing time”

  1. HTTP/2 field compression (HPACK) for headers/trailers
  2. Flow control: WINDOW_UPDATE bookkeeping and potential sender stalls
  3. DATA frame fragmentation: the receiver has to stitch bytes across frames
  4. gRPC decompression (gzip) per-message if enabled
  5. Protobuf: varints + nested length-delimited fields

Many profiles blame “protobuf” because that’s where app-level CPU shows up, but the actual throughput limiter is frequently: “copying and reassembling buffers in a way that defeats cache locality”.

Trade-offs: OTLP/gRPC vs OTLP/HTTP vs Prometheus Remote-Write

  • OTLP/gRPC

    • Pros: multiplexing, backpressure, rich status/trailers, ubiquitous client libs
    • Cons: HTTP/2 stack complexity; reassembly overhead; head-of-line at TCP; often ends up CPU-bound in user space
  • OTLP/HTTP (protobuf over HTTP/1.1 or HTTP/2)

    • Pros: simpler stack; easier debugging; proxies behave more predictably
    • Cons: fewer semantics than gRPC; still faces protobuf costs; batching often less disciplined
  • Prometheus Remote-Write

    • Pros: single POST payload; no per-message 5-byte header; “one big blob” amortizes framing
    • Cons: Snappy decode + protobuf; label amplification can dominate

Language/runtime trade-off (Go vs Rust)

  • Go’s gRPC + HTTP/2 stacks have been hammered in production for a decade; allocations are well-understood, and the runtime scheduler sometimes hides latency spikes.
  • Rust can be faster in tight loops (varints, masking, pointer math), but the ecosystem can lose to Go due to:
    • less mature HPACK implementations,
    • more copying in safe abstractions,
    • and mismatch between async runtimes and kernel I/O patterns.

Rust wins when you own the end-to-end pipeline and can keep telemetry bytes in L1/L2 as long as possible.

Go wins when you use the standard stack and benefit from years of “boring optimizations”.


Practical debugging: “what bytes did I actually receive?”

If you capture traffic and want to sanity-check your mental model:

  1. Identify the HTTP/2 stream carrying the gRPC call (HEADERS with :path like /opentelemetry.proto.collector.trace.v1.TraceService/Export).
  2. For DATA frames on that stream:
    • parse the 9-byte frame header
    • append payload bytes into a per-stream buffer
  3. Parse that buffer as a sequence of gRPC 5-byte records.
  4. For each record, protobuf-decode (or partially decode) the OTLP ExportRequest.

Once you can do steps 1–3, half of “collector CPU mystery” incidents become boring.


Provocative conclusion: the framing paradox

If the bottleneck is “too many small messages”, the obvious fix is “batch more”.

But batching more increases:

  • maximum message size → more buffering and larger working sets,
  • tail latency → bigger exports wait longer,
  • and in some stacks, worse allocator behavior.

Research Question:

In high-cardinality workloads, why do some Go OTLP receivers outperform SIMD-heavy Rust decoders?

Is it because Go is “faster”, or because the Go stack accidentally chooses better batch/flow-control points that keep cache and allocator behavior stable?