OTLP/gRPC on the Wire: HTTP/2 Frame Headers, 5-Byte gRPC Records, and Zero-Copy Parsing in Rust
When an OpenTelemetry Collector “CPU spikes on ingest”, the easy explanation is “protobuf is slow” or “TLS is expensive”.
The real answer is usually framing:
- HTTP/2 frames are chopped to satisfy flow control and max-frame-size.
- gRPC messages are re-framed inside DATA frames as 5-byte records.
- The protobuf payload is another TLV stream (varints + length-delimited submessages).
If you don’t understand the exact bytes, you can’t:
- explain why packet captures show “fragmentation” even on a LAN,
- implement a fast receiver without accidental copies,
- or reason about how concurrency interacts with flow control.
This deep dive is about the actual on-the-wire layout for OTLP over gRPC:
- HTTP/2 frame header (9 bytes)
- gRPC length-prefixed message record (5 bytes)
- protobuf encoding costs (varints, embedded messages)
…and how to parse all of it with predictable branches and zero allocations.
Specs we rely on (manually verified)
- OTLP spec: OTLP is implemented over gRPC and HTTP transports; payload schema is Protocol Buffers.
- gRPC over HTTP/2 protocol: DATA frames contain a repeated sequence of Length-Prefixed-Message items, defined as 1 byte compression flag + 4 byte big-endian length + message bytes.
- HTTP/2 frame layout: fixed 9-octet header with Length(24), Type(8), Flags(8), R(1), StreamID(31).
- Protobuf encoding (varints, tags = (field_number << 3) | wire_type).
(For the HTTP/2 and gRPC framing facts, I extracted the exact definitions from the upstream text: RFC 9113 §4.1 and PROTOCOL-HTTP2.md “Length-Prefixed-Message”.)
Layer 0 (not optional): TLS record boundaries are irrelevant
Even before HTTP/2, remember that TLS records are not message boundaries. You can’t assume “one TLS record one HTTP/2 frame” or “one TCP segment one gRPC message”.
Operational implication: a production receiver must treat the byte stream as an infinite tape and build framing explicitly.
Layer 1: HTTP/2 framing (9-byte header, then payload)
RFC 9113 §4.1 defines the HTTP/2 frame header as a fixed 9-octet header:
Length(24-bit, payload length only)Type(8-bit)Flags(8-bit)Reserved(1-bit)Stream Identifier(31-bit)
Bit/field layout (what you actually parse)
Mermaid view (field sizes):
flowchart LR subgraph H2["HTTP/2 Frame Header (9 bytes)"] L["Length: 24 bits\n(payload bytes)"] --> T["Type: 8"] --> Fl["Flags: 8"] --> R["R: 1"] --> SID["Stream Identifier: 31"] end
Concrete byte layout (network byte order):
| Bytes | Bits | Field |
|---|---|---|
| 0..2 | 24 | Length (unsigned) |
| 3 | 8 | Type |
| 4 | 8 | Flags |
| 5..8 | 1 + 31 | R bit + StreamID |
Rust: parse the 9-byte header with minimal branching
The performance trick is not “use a faster HTTP/2 library”. It’s knowing what the library is doing so you can decide where the CPU goes.
Here’s a tight parser for the header using shifts and a single unaligned load for the last 4 bytes:
#[derive(Debug, Clone, Copy)]
pub struct H2FrameHeader {
pub len: u32, // 24-bit
pub ty: u8,
pub flags: u8,
pub stream_id: u32 // 31-bit
}
#[inline(always)]
pub fn parse_h2_frame_header(buf: &[u8]) -> Option<(H2FrameHeader, &[u8])> {
if buf.len() < 9 { return None; }
// Length is 24-bit big-endian.
let len = ((buf[0] as u32) << 16) | ((buf[1] as u32) << 8) | (buf[2] as u32);
let ty = buf[3];
let flags = buf[4];
// StreamID is the low 31 bits of a big-endian u32.
// Load bytes[5..9] as BE u32, mask off the reserved bit.
let sid_be = u32::from_be_bytes([buf[5], buf[6], buf[7], buf[8]]);
let stream_id = sid_be & 0x7fff_ffff;
Some((H2FrameHeader { len, ty, flags, stream_id }, &buf[9..]))
}The “gotcha”: DATA frame boundaries don’t align with gRPC message boundaries
HTTP/2 DATA frames are sized by flow control and SETTINGS, not by your OTLP Export request size.
So a single OTLP export (a single protobuf message) can be:
- split across many DATA frames, or
- multiple gRPC messages can share a single DATA frame.
That’s why Wireshark “reassembly” feels magical: it’s doing the same tape-parsing you have to implement.
Layer 2: gRPC message framing (5-byte record header)
gRPC defines its own record format inside the byte stream carried by HTTP/2 DATA frames:
- Compressed-Flag: 1 byte, value 0 or 1
- Message-Length: 4 bytes, big-endian
- Message:
Message-Lengthbytes
From grpc/grpc’s protocol document:
Length-Prefixed-Message → Compressed-Flag Message-Length MessageCompressed-Flag → 0 / 1 ; encoded as 1 byte unsigned integerMessage-Length → ... ; encoded as 4 byte unsigned integer (big endian)
(See: https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md)
Bit/byte layout
flowchart LR subgraph GRPC["gRPC Length-Prefixed Message"] CF["Compressed flag\n1 byte"] --> ML["Message length\n4 bytes BE"] --> M["Message bytes\nN bytes"] end
Example: an uncompressed message of length 0x0000_0010 (16 bytes):
| offset | meaning | byte |
|---|---|---|
| 0 | compressed flag | 00 |
| 1 | len[31:24] | 00 |
| 2 | len[23:16] | 00 |
| 3 | len[15:8] | 00 |
| 4 | len[7:0] | 10 |
Rust: parse the 5-byte prefix + payload as a streaming state machine
A fast receiver doesn’t “read message then parse”; it incrementally consumes bytes as they arrive.
Below is a minimal streaming decoder. It never allocates; it only yields borrowed slices when the full message is present.
pub struct GrpcFramer {
// 0..=5 bytes of header
hdr: [u8; 5],
have: usize,
want_msg: Option<usize>,
}
impl GrpcFramer {
pub fn new() -> Self {
Self { hdr: [0; 5], have: 0, want_msg: None }
}
/// Feed more bytes. Returns (consumed, maybe_one_complete_message).
/// A production implementation would loop and yield multiple messages.
pub fn feed<'a>(&mut self, mut buf: &'a [u8]) -> (usize, Option<(bool, &'a [u8])>) {
let start_len = buf.len();
// Step 1: fill 5-byte header.
while self.have < 5 && !buf.is_empty() {
self.hdr[self.have] = buf[0];
self.have += 1;
buf = &buf[1..];
}
if self.have < 5 {
return (start_len - buf.len(), None);
}
// Step 2: compute wanted payload length.
if self.want_msg.is_none() {
let compressed = self.hdr[0] != 0;
let len = u32::from_be_bytes([self.hdr[1], self.hdr[2], self.hdr[3], self.hdr[4]]) as usize;
self.want_msg = Some((compressed as usize) << 31 | len); // pack compressed bit + len
}
// Unpack compressed + length.
let packed = self.want_msg.unwrap();
let compressed = (packed >> 31) != 0;
let len = packed & 0x7fff_ffff;
if buf.len() < len {
return (start_len - buf.len(), None);
}
let msg = &buf[..len];
// Reset for next message.
self.have = 0;
self.want_msg = None;
(start_len - (buf.len() - len), Some((compressed, msg)))
}
}This is intentionally “low-level ugly” because it mirrors the wire.
Micro-optimization note: replace the per-byte copy with one unaligned load
In real ingestion code you typically have a contiguous ring buffer; you can take advantage of that:
- memcpy the 5 bytes once (
ptr::copy_nonoverlapping) - parse
u32length withread_unaligned+from_be
The branch predictor wins because the hot path is: “already have header → already have bytes → parse → yield”.
Layer 3: Protobuf inside the gRPC message (varints and length-delimited fields)
Once you’ve extracted Message, the protobuf wire format starts.
Two details matter for observability payloads:
- Tags are varints with
(field_number << 3) | wire_type. - Most OTLP payload is length-delimited (wire type 2): strings, bytes, embedded messages, packed repeated fields.
That means you see lots of:
- varint tag
- varint length
lenbytes
…and that is fundamentally “scan, branch, bounds check”.
Rust: a “crunchy” varint decoder (pointer-based, branch-minimized)
This is the core inner loop you’ll end up writing if you ever implement a custom fast-path (e.g., to count spans without fully decoding):
use core::ptr;
#[inline(always)]
pub unsafe fn read_u64_varint(mut p: *const u8, end: *const u8) -> Option<(u64, *const u8)> {
// Protobuf varint is at most 10 bytes.
let mut x: u64 = 0;
let mut shift = 0;
while p < end && shift < 70 {
let b = ptr::read(p);
p = p.add(1);
x |= ((b & 0x7f) as u64) << shift;
if (b & 0x80) == 0 {
return Some((x, p));
}
shift += 7;
}
None
}
#[inline(always)]
pub fn split_tag(tag: u64) -> (u32, u8) {
let wire = (tag & 0x7) as u8;
let field = (tag >> 3) as u32;
(field, wire)
}If you want to go further, you can SIMD-scan for “bytes with MSB clear” to accelerate varint termination detection (common in dense streams), but in practice the biggest win is often: avoid decoding what you don’t need.
Why OTLP/gRPC hurts (and when it doesn’t)
Cost centers you can’t see in “protobuf parsing time”
- HTTP/2 field compression (HPACK) for headers/trailers
- Flow control: WINDOW_UPDATE bookkeeping and potential sender stalls
- DATA frame fragmentation: the receiver has to stitch bytes across frames
- gRPC decompression (gzip) per-message if enabled
- Protobuf: varints + nested length-delimited fields
Many profiles blame “protobuf” because that’s where app-level CPU shows up, but the actual throughput limiter is frequently: “copying and reassembling buffers in a way that defeats cache locality”.
Trade-offs: OTLP/gRPC vs OTLP/HTTP vs Prometheus Remote-Write
-
OTLP/gRPC
- Pros: multiplexing, backpressure, rich status/trailers, ubiquitous client libs
- Cons: HTTP/2 stack complexity; reassembly overhead; head-of-line at TCP; often ends up CPU-bound in user space
-
OTLP/HTTP (protobuf over HTTP/1.1 or HTTP/2)
- Pros: simpler stack; easier debugging; proxies behave more predictably
- Cons: fewer semantics than gRPC; still faces protobuf costs; batching often less disciplined
-
Prometheus Remote-Write
- Pros: single POST payload; no per-message 5-byte header; “one big blob” amortizes framing
- Cons: Snappy decode + protobuf; label amplification can dominate
Language/runtime trade-off (Go vs Rust)
- Go’s gRPC + HTTP/2 stacks have been hammered in production for a decade; allocations are well-understood, and the runtime scheduler sometimes hides latency spikes.
- Rust can be faster in tight loops (varints, masking, pointer math), but the ecosystem can lose to Go due to:
- less mature HPACK implementations,
- more copying in safe abstractions,
- and mismatch between async runtimes and kernel I/O patterns.
Rust wins when you own the end-to-end pipeline and can keep telemetry bytes in L1/L2 as long as possible.
Go wins when you use the standard stack and benefit from years of “boring optimizations”.
Practical debugging: “what bytes did I actually receive?”
If you capture traffic and want to sanity-check your mental model:
- Identify the HTTP/2 stream carrying the gRPC call (HEADERS with
:pathlike/opentelemetry.proto.collector.trace.v1.TraceService/Export). - For DATA frames on that stream:
- parse the 9-byte frame header
- append payload bytes into a per-stream buffer
- Parse that buffer as a sequence of gRPC 5-byte records.
- For each record, protobuf-decode (or partially decode) the OTLP ExportRequest.
Once you can do steps 1–3, half of “collector CPU mystery” incidents become boring.
Provocative conclusion: the framing paradox
If the bottleneck is “too many small messages”, the obvious fix is “batch more”.
But batching more increases:
- maximum message size → more buffering and larger working sets,
- tail latency → bigger exports wait longer,
- and in some stacks, worse allocator behavior.
Research Question:
In high-cardinality workloads, why do some Go OTLP receivers outperform SIMD-heavy Rust decoders?
Is it because Go is “faster”, or because the Go stack accidentally chooses better batch/flow-control points that keep cache and allocator behavior stable?