Observability on the Edge With OTel and FluentBit

Published by DZONE

When we design observability pipelines for modern cloud environments, we implicitly rely on a set of luxurious guarantees: limitless bandwidth, highly available networks, practically infinite storage, and abundant computing power. But when you move these workloads to the edge, think of a maritime vessel navigating the mid-Atlantic or a remote wind turbine, those guarantees vanish. Edge environments are constrained by intermittent connectivity, severe limits on CPU and RAM, and a lack of persistent storage guarantees. You simply cannot run a full, traditional observability stack locally, nor can you stream everything to the cloud without exhausting limited satellite bandwidth.

The engineering challenge becomes clear: how do we build a pipeline that reliably captures traces, metrics, and logs, survives unpredictable network outages, and perfectly correlates signals without saturating edge constraints? A highly compelling, production-realistic solution to this problem was showcased for KubeCon EU 2026, demonstrating a fully correlated observability pipeline built for constrained edge environments using OpenTelemetry and Fluent Bit. You can explore the complete implementation in the graz-dev/observability-on-edge repository.

This article dives deep into the architecture, the technical trade-offs, and the specific configurations required to make observability work reliably when the network itself is your biggest enemy.

Taming the Bandwidth With Tail-Based Sampling

The fundamental problem with distributed tracing is the sheer volume of data it generates. A single HTTP request traversing various middleware and downstream services can easily produce up to 50 individual spans. At a relatively modest load of ten requests per second, you are suddenly dealing with hundreds of gigabytes of trace data over a year. In a cloud environment, you might simply scale up your storage. On a maritime vessel connected via an expensive, low-bandwidth satellite link, sending all of this is economically and technically impossible.

To solve this, we must aggressively sample the data, dropping the noise and keeping only what is actionable. The most naive approach is head-based sampling, where the system makes a keep-or-drop decision at the very first span of a trace. While head-based sampling adds almost no computational latency, it is entirely blind to the outcome of the request. If you decide to drop a trace at its inception, and that request subsequently fails or experiences a massive latency spike, that crucial diagnostic information is lost forever. In edge deployments where errors might be rare but highly critical, this is unacceptable.

The solution is tail-based sampling. By buffering all spans of a trace in memory within the OpenTelemetry Collector, the pipeline waits for the trace to fully complete before making a decision based on the final outcome. In this implementation, the tail-sampling policy is strictly configured to keep 100% of traces that contain an error, and 100% of traces where the duration exceeds 200 milliseconds. Normal, fast, successful traces are discarded entirely. The observed result is a massive reduction in bandwidth overhead, dropping roughly 80% of all spans before they are ever exported over the network.

To make this bandwidth reduction truly effective, we must apply the same philosophy to our logs. If we filter traces but send every single log line, we defeat the purpose. For log processing, Fluent Bit runs as a DaemonSet on the edge nodes, tailing container logs. Rather than using Fluent Bit's native grep filters, which struggle with complex multi-field conditional logic, a custom Lua filter is injected into the pipeline. This Lua script precisely mirrors the OpenTelemetry tail-sampling criteria, evaluating each log record and keeping only those with an error level or a duration exceeding 200 milliseconds. By performing this logic at the absolute edge, before the log data even leaves the node, the pipeline drops approximately 86% of log volume at the source, preventing unnecessary network I/O.

Persistent Queuing for Intermittent Connectivity

Aggressive sampling solves the bandwidth issue during steady-state operations, but what happens when the satellite link inevitably fails? If the OpenTelemetry Collector cannot reach the central hub, it will quickly exhaust its in-memory retry buffers, and all telemetry accumulated during the outage will be permanently lost.

To survive these disconnections, the pipeline implements a file-backed persistent queue utilizing the file_storage extension within the OpenTelemetry Collector. This provides a bbolt (BoltDB) key-value store directly on the edge node's local disk. When the exporter's sending queue is configured to use this file storage, outgoing telemetry batches are serialized and written safely to the crash-safe bbolt database before dispatch. These items remain safely on disk until the collector receives a successful acknowledgment from the remote endpoints.

Configuring this queue correctly requires understanding a critical nuance of the OpenTelemetry Collector's internal architecture regarding consumers. By default, the exporter uses four concurrent consumer goroutines to claim batches from the queue. Because the processing pipeline generally produces batches relatively slowly at edge traffic volumes, these four consumers will claim batches almost instantly, holding them in their in-memory retry buffers rather than leaving them in the bbolt queue.

Consequently, your queue depth metrics will deceptively report zero even during an active network outage, blinding you to the growing backlog. By deliberately setting the num_consumers configuration to exactly one, only a single batch is ever held in flight in memory. All subsequent batches safely queue in bbolt, allowing the metric to accurately reflect the growing backlog during an outage.

The Reconnection and the Time Travel Problem

When the satellite link is eventually restored, the real chaos begins. The OpenTelemetry Collector detects the restored connection and immediately begins draining the accumulated bbolt queue at maximum network speed. However, the data stored in this queue possesses timestamps from minutes or hours in the past. We are essentially attempting to push historical, out-of-order data into our central observability backends. Different backends handle this "time travel" problem differently. Jaeger, our distributed trace storage, handles it gracefully by design. Its storage model is append-only, possessing no concept of out-of-order rejection. Traces originating from the failure window simply appear in the user interface precisely where they belong chronologically.

Loki, handling our logs, is much stricter. By default, Loki expects log entries for a given stream to arrive in roughly chronological order, and it will forcefully reject significantly older timestamps with HTTP 400 errors. If left unconfigured, the OpenTelemetry Collector would receive these errors continuously upon queue drain, leading to the permanent loss of all logs generated during the outage. To prevent this disaster, we must explicitly configure unordered_writes: true in the Loki settings. This crucial parameter disables the strict per-stream ordering requirement, allowing the massive burst of queued, historical log entries to be ingested successfully.

Metrics ingestion presents an even harsher reality. In this architecture, metrics are exported to Prometheus using the prometheusremotewrite exporter. Unlike logs and traces, the OpenTelemetry Collector library supporting this exporter lacks support for the file-backed bbolt queue, leaving the metrics queue strictly in-memory.

Furthermore, when the link restores, any old metrics held in memory are sent to Prometheus, but Prometheus natively rejects out-of-order samples. While there are alternative protocols like OTLP HTTP for metrics, utilizing OTLP for Prometheus ingestion results in aggressive HTTP 400 rejections for out-of-order data. This causes the exporter to retry indefinitely, permanently blocking the queue and grinding the entire metrics pipeline to a halt.

It is crucial to note that this specific limitation — the inability to use file-backed queues for metrics — is a known library constraint tied directly to the OpenTelemetry Collector versions (v0.95 and v0.96) utilized in this repository. Because these specific builds do not support sending_queue.storage: file_storage for the Prometheus remote-write exporter, the architecture is forced to keep the metrics queue in RAM. The architectural decision here is a deliberate and calculated engineering trade-off: by using the prometheusremotewrite exporter targeting the remote-write endpoint, Prometheus silently skips the out-of-order samples by returning an HTTP 204 status with zero written. The pipeline queue drains cleanly and unblocks, but the metric data generated during the actual outage window is intentionally sacrificed. At the edge, maintaining pipeline integrity is often prioritized over absolute metric continuity.

Achieving Deterministic Signal Correlation

An observability stack is only as valuable as its ability to correlate signals. A latency spike on a dashboard must lead seamlessly to the exact distributed trace, which must seamlessly transition to the specific log lines emitted during that exact request. In this edge architecture, achieving perfect correlation is not reliant on best-effort timestamp matching; it is structurally guaranteed.

The process begins in the application code, which extracts the OpenTelemetry trace_id and span_id
from the context of every incoming HTTP request and structurally injects them into every log line via a JSON logger. Because the Fluent Bit Lua filter and the OpenTelemetry tail-sampling processor utilize the exact same logic, we achieve a deterministic alignment. Every trace that survives the sampler will have a corresponding log line surviving the Lua filter, and conversely, no log line will exist without a parent trace. There are no orphaned traces and no orphaned logs. To link our high-level metrics directly to these traces, the architecture employs the OpenTelemetry spanmetrics connector. This connector reads the sampled spans and generates Prometheus histogram metrics regarding request rates and latencies. Crucially, it attaches an exemplar to each histogram bucket, a sparse metadata annotation carrying the specific trace_id that contributed to that latency measurement.

The placement of this connector is paramount. In the configuration pipeline, spanmetrics is wired strictly after the tail sampling processor. Because it runs post-sampling, every single trace_id it embeds into a metric exemplar is absolutely guaranteed to have survived the sampling process and exist in Jaeger. When operators view their Grafana dashboards, they see diamond markers on their latency graphs representing these exemplars; clicking a marker reliably drops them directly into the exact failing trace with zero dead links.

Performance Testing

To prove this architecture works under realistic conditions, the project doesn't just send a few manual requests. Instead, it employs the k6 Operator, a Kubernetes-native load test runner, to generate a continuous, high-volume telemetry stream from within the cluster hub. The load generator utilizes a custom TestRun resource that spins up 500 virtual users over a 40-minute period.

This performance test follows a deliberate ramp-up profile: it scales from zero to 100 virtual users in the first 30 seconds, climbs to 250 in the next 30 seconds, reaches 500 at the one-minute mark, and sustains that peak for a 40-minute steady state. At its peak, this setup generates approximately 2,500 spans per second. The traffic is intelligently distributed across four distinct API endpoints simulating a vessel's systems, each with specific latency and error profiles, perfectly exercising the tail-sampling and Lua filtering logic.

While the k6 load test validates the pipeline's throughput, the underlying reality is that rigorous performance testing was essential for a much more critical goal: drastically minimizing the collector's resource footprint. In constrained edge environments, every megabyte of RAM and CPU cycle consumed by observability is stolen directly from the primary application workloads. Through an extensive, automated performance tuning campaign, we analyzed the complex interactions between the Go runtime and the collector's internal processors. The findings revealed that optimizing an edge node requires surgically tuning both the OTel configuration and the underlying Go environment rather than simply guessing at limits. By methodically testing various permutations, we discovered the exact "sweet spot" to maximize performance while shrinking the footprint.

The most impactful findings from these tests led to highly specific internal calibrations. For example, the memory_limiter processor was precisely tuned to enforce a soft limit of 320 MiB and a hard limit of 400 MiB. This was paired with a batch processor rigorously configured to accumulate exactly 512 spans or wait for a maximum 5-second timeout. Furthermore, these tests demonstrated that throttling the exporter queue to a single consumer (num_consumers: 1) was critical. It not only provided accurate backpressure metrics during a simulated satellite outage but structurally prevented the Go runtime's garbage collector from thrashing when the connection was restored and massive historical queues suddenly drained.

The results of this optimization campaign are striking. Stripped of the bloated default components via the OpenTelemetry Collector Builder, the resulting 30 MB binary operates seamlessly under intense pressure. It continuously processes thousands of spans per second while consuming merely 1% to 5% of a single CPU core and hovering predictably between 80 MiB and 150 MiB of active memory. This definitively proves that with proper performance testing and exact Go runtime calibrations, you do not have to choose between rich telemetry and edge node stability.

Validating Resilience Through Network Chaos

Proving that resilience actually works requires simulating harsh physical realities within a controlled Kubernetes environment. Relying on high-level Kubernetes NetworkPolicies is insufficient for this testing, as they do not provide the surgical, instantaneous, and reversible IP-layer control needed to simulate an abrupt satellite drop. Instead, the project utilizes a privileged DaemonSet running netshoot, a network debugging container. Operating in the host network namespace, this pod can directly manipulate the edge node's kernel routing rules using iptables. A dedicated chaos script surgically inserts DROP rules into the FORWARD chain, specifically targeting the outbound ports for Jaeger, Loki, and Prometheus.

A critical detail in this simulation is the behavior of the Linux kernel's connection tracking framework, conntrack. Modern kernels maintain state for established TCP connections, allowing them to bypass newly inserted DROP rules. If you apply an iptables drop rule without further action, existing gRPC connections between the collector and the hub will simply continue to flow unaffected. The chaos script explicitly executes a conntrack flush command targeting the collector's IP address. This violently terminates the established states, forcing the client to initiate a new TCP handshake, which is immediately blocked by the new rules. This accurately triggers the failure state: the OpenTelemetry exporter begins failing, batches begin piling up in the bbolt database, and the queue depth metrics steadily climb. Removing the rules simulates link restoration, triggering the massive, satisfying spike in export throughput as the resilient queue drains historical data into the backends.

Conclusion

Observability at the edge forces engineering teams to abandon the comfortable defaults of cloud-native computing. We cannot afford to transmit every metric, log line, and trace span. By combining aggressive tail-based and source-based sampling, highly localized persistent queuing, out-of-order gap-filling configurations, and meticulous correlation through exemplars, it is completely possible to maintain deep, actionable visibility into remote, constrained environments.

The implementation presented in the graz-dev/observability-on-edge repository serves as an example of these techniques. It proves that with strict resource management and a deep understanding of network behavior, robust edge observability is not just a theoretical concept, but a highly achievable engineering reality.