OpenTelemetry Collector Deep Dive: Building Vendor-Neutral Observability Pipelines
Vendor lock-in in observability is a slow tax — you barely feel it until you try to leave. OpenTelemetry Collector is the infrastructure that makes telemetry truly portable. This post is not a getting-started guide; it is a production engineering deep dive covering pipeline architecture, tail-based sampling, scaling strategies, and the failure modes that bite teams after they go live.
OTel Collector Architecture: The Pipeline Model
The OpenTelemetry Collector is a vendor-agnostic proxy for observability data. Its architecture is a three-stage pipeline:
┌─────────────────────────────────────────────────────────────────┐
│ OTEL COLLECTOR PIPELINE │
│ │
│ RECEIVERS PROCESSORS EXPORTERS │
│ ───────── ────────── ───────── │
│ otlp (gRPC) → memory_limiter → otlp (Jaeger) │
│ otlp (HTTP) → batch → prometheusremote │
│ prometheus → filter → loki │
│ jaeger → transform → kafka │
│ zipkin → resource → file (debug) │
│ hostmetrics → tail_sampling → otlp (cloud) │
└─────────────────────────────────────────────────────────────────┘
Critically: pipelines are typed. A traces pipeline can only connect trace receivers to trace processors to trace exporters. You define separate pipelines for traces, metrics, and logs, though processors like resource and batch work across all signal types.
Real Scenario: Migrating from Datadog to a Vendor-Neutral Stack
A 40-service Spring Boot microservices platform was spending $180k/year on Datadog. The migration goal: move to Prometheus + Jaeger + Loki without changing a single line of application code. The OTel Collector made this possible.
The applications were already instrumented with opentelemetry-spring-boot-starter. They were sending OTLP to Datadog via the Datadog Agent. The migration involved:
- Deploying OTel Collector as a Kubernetes DaemonSet (Agent mode)
- Reconfiguring application
OTEL_EXPORTER_OTLP_ENDPOINTto point to the local Collector - Configuring Collector to export traces to Jaeger, metrics to Prometheus, logs to Loki
- Running dual-export for 2 weeks (both Datadog and new stack) for validation
- Cutting over Datadog
Zero application code changes. Total migration time: 3 weeks.
Agent vs Gateway Deployment Modes
Agent Mode (DaemonSet)
One Collector instance per Kubernetes node, receiving telemetry from all pods on that node via localhost. Advantages: low-latency data collection, node-local resource enrichment (host metrics, pod metadata). Disadvantages: no tail-based sampling (data from a single trace may span multiple nodes and agents).
# agent-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
hostmetrics:
collection_interval: 30s
scrapers:
cpu:
memory:
disk:
network:
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
batch:
timeout: 5s
send_batch_size: 1000
resource:
attributes:
- key: k8s.node.name
from_attribute: host.name
action: insert
exporters:
otlp/gateway:
endpoint: otel-collector-gateway:4317
tls:
insecure: false
ca_file: /etc/ssl/certs/ca.crt
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/gateway]
metrics:
receivers: [otlp, hostmetrics]
processors: [memory_limiter, batch, resource]
exporters: [otlp/gateway]
Gateway Mode (Deployment)
Centralized Collector deployment receiving from all agents. This is where stateful processors (tail-based sampling, span aggregation) live. Scale with horizontal pod autoscaling, but note that stateful processors need consistent hashing load balancing — all spans for a given trace ID must reach the same Collector instance.
Tail-Based Sampling: Why It Matters and How to Configure It
Head-based sampling (sampling at trace creation) is the naive approach: sample 10% of requests. The problem: errors are rare. If an error occurs in the 90% you dropped, you have no trace to debug with.
Tail-based sampling buffers spans for a configurable window, then decides whether to keep the trace based on its complete outcome — including error status, latency thresholds, and custom attributes. The trade-off: memory consumption and the need for all spans to reach the same Collector instance.
# gateway-collector-config.yaml — tail sampling configuration
processors:
tail_sampling:
decision_wait: 10s # Buffer traces for 10 seconds before deciding
num_traces: 100000 # Max traces held in memory simultaneously
expected_new_traces_per_sec: 1000
policies:
# Always keep error traces
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
# Always keep slow traces (>2 second)
- name: slow-traces-policy
type: latency
latency:
threshold_ms: 2000
# Keep 5% of healthy fast traces for baseline
- name: baseline-sampling-policy
type: probabilistic
probabilistic:
sampling_percentage: 5
# Always keep traces with specific attributes (e.g., canary deployments)
- name: canary-policy
type: string_attribute
string_attribute:
key: deployment.environment
values: [canary, staging]
Production Config with Spring Boot (OTLP over gRPC)
# application.yaml — Spring Boot OTLP configuration
management:
tracing:
sampling:
probability: 1.0 # Send 100% to Collector; let Collector do tail sampling
otlp:
tracing:
endpoint: http://otel-collector-agent:4317
# Spring Boot dependencies (pom.xml)
# io.micrometer:micrometer-tracing-bridge-otel
# io.opentelemetry:opentelemetry-exporter-otlp
# io.opentelemetry.instrumentation:opentelemetry-spring-boot-starter
# Full gateway collector config
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
memory_limiter:
check_interval: 1s
limit_mib: 2048
spike_limit_mib: 512
batch:
timeout: 5s
send_batch_size: 2000
filter/drop_health_checks:
traces:
span:
- 'attributes["http.target"] == "/health"'
- 'attributes["http.target"] == "/actuator/health"'
transform/enrich:
trace_statements:
- context: span
statements:
- set(attributes["team"], "platform") where resource.attributes["service.name"] == "order-service"
tail_sampling:
decision_wait: 10s
num_traces: 50000
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow
type: latency
latency: {threshold_ms: 1500}
- name: sample-5pct
type: probabilistic
probabilistic: {sampling_percentage: 5}
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
resource_to_telemetry_conversion:
enabled: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
labels:
resource:
service.name: "service_name"
k8s.pod.name: "pod"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, filter/drop_health_checks, transform/enrich, tail_sampling, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
Scaling the Collector: Load Balancing for Stateful Processors
Stateless processors (batch, filter, transform) scale horizontally without coordination — any Collector instance can handle any span. Stateful processors — specifically tail_sampling and groupbytrace — require that all spans for a given trace arrive at the same Collector instance. Use the loadbalancingexporter in agent Collectors to route by trace ID:
# Agent collector — route to gateway by trace ID hash
exporters:
loadbalancing:
routing_key: "traceID"
protocol:
otlp:
timeout: 1s
tls:
insecure: true
resolver:
k8s:
service: otel-collector-gateway-headless
ports: [4317]
Failure Scenarios and Mitigation
Collector OOM
The memory_limiter processor must always be the first processor in every pipeline. When memory exceeds the limit, the Collector refuses new data and returns a retryable error to senders (which will buffer and retry). Without it, a cardinality explosion or traffic spike kills the Collector process and you lose all buffered data.
Data Loss on Collector Restart
The Collector is stateless by default — in-flight data at restart time is lost. For production, use the file_storage extension to enable persistent queuing:
extensions:
file_storage/persistent_queue:
directory: /var/lib/otelcol/queue
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
sending_queue:
enabled: true
storage: file_storage/persistent_queue
num_consumers: 10
queue_size: 10000
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
Cardinality Explosions
A single service adding high-cardinality labels (user ID, request ID) to metrics can generate millions of unique time series, crashing Prometheus. The filter and transform processors are your circuit breakers:
processors:
transform/drop_high_cardinality:
metric_statements:
- context: datapoint
statements:
- delete_key(attributes, "user.id")
- delete_key(attributes, "request.id")
When NOT to Use the Collector
The OTel Collector adds operational overhead: it is another process to deploy, monitor, and upgrade. Avoid it when:
- You have a single service sending to a single backend — direct OTLP export is simpler
- Running at the edge (IoT, embedded) where memory and CPU are constrained — use SDK-level direct export
- Your team has no Kubernetes/Docker operational experience — the Collector's configuration YAML has significant learning curve
Use it when you have 5+ services, want backend flexibility, need tail-based sampling, or require cross-service attribute enrichment.
Key Takeaways
- The OTel Collector's pipeline model (receivers → processors → exporters) is the key to understanding its power and constraints
- Always put
memory_limiteras the first processor — it is the safety valve preventing OOM crashes - Tail-based sampling requires stateful Collector instances with consistent hash load balancing by trace ID
- Agent + Gateway topology is the production standard: agents handle local enrichment and fan-out, gateways handle sampling and routing
- Use
file_storagepersistent queuing for exporters to prevent data loss on Collector restarts - The migration from Datadog to vendor-neutral stack requires zero application code changes when applications use OTLP
Conclusion
The OpenTelemetry Collector transforms observability from vendor-dependent infrastructure into a programmable data pipeline. Its value is not just in vendor neutrality — it is in the ability to enrich, filter, sample, and route telemetry data independently of your application code. Building it correctly from the start — with proper memory limits, persistent queuing, and the right deployment topology — determines whether it becomes a reliable cornerstone of your platform or a source of production incidents. The patterns in this post represent production-tested configurations for services at significant scale.