OpenTelemetry Collector Deep Dive: Building Vendor-Neutral Observability Pipelines
Vendor lock-in in observability is a slow tax — you barely feel it until you try to leave. OpenTelemetry Collector is the infrastructure that makes telemetry truly portable. This post is not a getting-started guide; it is a production engineering deep dive covering pipeline architecture, tail-based sampling, scaling strategies, and the failure modes that bite teams after they go live.
Table of Contents
- OTel Collector Architecture: The Pipeline Model
- Real Scenario: Migrating from Datadog to a Vendor-Neutral Stack
- Agent vs Gateway Deployment Modes
- Tail-Based Sampling: Why It Matters and How to Configure It
- Production Config with Spring Boot (OTLP over gRPC)
- Scaling the Collector: Load Balancing for Stateful Processors
- Failure Scenarios and Mitigation
- When NOT to Use the Collector
- Conclusion
OTel Collector Architecture: The Pipeline Model
The OpenTelemetry Collector is a vendor-agnostic proxy for observability data. Its architecture is a three-stage pipeline:
┌─────────────────────────────────────────────────────────────────┐
│ OTEL COLLECTOR PIPELINE │
│ │
│ RECEIVERS PROCESSORS EXPORTERS │
│ ───────── ────────── ───────── │
│ otlp (gRPC) → memory_limiter → otlp (Jaeger) │
│ otlp (HTTP) → batch → prometheusremote │
│ prometheus → filter → loki │
│ jaeger → transform → kafka │
│ zipkin → resource → file (debug) │
│ hostmetrics → tail_sampling → otlp (cloud) │
└─────────────────────────────────────────────────────────────────┘
Critically: pipelines are typed. A traces pipeline can only connect trace receivers to trace processors to trace exporters. You define separate pipelines for traces, metrics, and logs, though processors like resource and batch work across all signal types.
Real Scenario: Migrating from Datadog to a Vendor-Neutral Stack
A 40-service Spring Boot microservices platform was spending $180k/year on Datadog. The migration goal: move to Prometheus + Jaeger + Loki without changing a single line of application code. The OTel Collector made this possible.
The applications were already instrumented with opentelemetry-spring-boot-starter. They were sending OTLP to Datadog via the Datadog Agent. The migration involved:
- Deploying OTel Collector as a Kubernetes DaemonSet (Agent mode)
- Reconfiguring application
OTEL_EXPORTER_OTLP_ENDPOINTto point to the local Collector - Configuring Collector to export traces to Jaeger, metrics to Prometheus, logs to Loki
- Running dual-export for 2 weeks (both Datadog and new stack) for validation
- Cutting over Datadog
Zero application code changes. Total migration time: 3 weeks.
Agent vs Gateway Deployment Modes
Agent Mode (DaemonSet)
One Collector instance per Kubernetes node, receiving telemetry from all pods on that node via localhost. Advantages: low-latency data collection, node-local resource enrichment (host metrics, pod metadata). Disadvantages: no tail-based sampling (data from a single trace may span multiple nodes and agents).
# agent-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
hostmetrics:
collection_interval: 30s
scrapers:
cpu:
memory:
disk:
network:
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
batch:
timeout: 5s
send_batch_size: 1000
resource:
attributes:
- key: k8s.node.name
from_attribute: host.name
action: insert
exporters:
otlp/gateway:
endpoint: otel-collector-gateway:4317
tls:
insecure: false
ca_file: /etc/ssl/certs/ca.crt
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp/gateway]
metrics:
receivers: [otlp, hostmetrics]
processors: [memory_limiter, batch, resource]
exporters: [otlp/gateway]
Gateway Mode (Deployment)
Centralized Collector deployment receiving from all agents. This is where stateful processors (tail-based sampling, span aggregation) live. Scale with horizontal pod autoscaling, but note that stateful processors need consistent hashing load balancing — all spans for a given trace ID must reach the same Collector instance.
Tail-Based Sampling: Why It Matters and How to Configure It
Head-based sampling (sampling at trace creation) is the naive approach: sample 10% of requests. The problem: errors are rare. If an error occurs in the 90% you dropped, you have no trace to debug with.
Tail-based sampling buffers spans for a configurable window, then decides whether to keep the trace based on its complete outcome — including error status, latency thresholds, and custom attributes. The trade-off: memory consumption and the need for all spans to reach the same Collector instance.
# gateway-collector-config.yaml — tail sampling configuration
processors:
tail_sampling:
decision_wait: 10s # Buffer traces for 10 seconds before deciding
num_traces: 100000 # Max traces held in memory simultaneously
expected_new_traces_per_sec: 1000
policies:
# Always keep error traces
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
# Always keep slow traces (>2 second)
- name: slow-traces-policy
type: latency
latency:
threshold_ms: 2000
# Keep 5% of healthy fast traces for baseline
- name: baseline-sampling-policy
type: probabilistic
probabilistic:
sampling_percentage: 5
# Always keep traces with specific attributes (e.g., canary deployments)
- name: canary-policy
type: string_attribute
string_attribute:
key: deployment.environment
values: [canary, staging]
Production Config with Spring Boot (OTLP over gRPC)
# application.yaml — Spring Boot OTLP configuration
management:
tracing:
sampling:
probability: 1.0 # Send 100% to Collector; let Collector do tail sampling
otlp:
tracing:
endpoint: http://otel-collector-agent:4317
# Spring Boot dependencies (pom.xml)
# io.micrometer:micrometer-tracing-bridge-otel
# io.opentelemetry:opentelemetry-exporter-otlp
# io.opentelemetry.instrumentation:opentelemetry-spring-boot-starter
# Full gateway collector config
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
memory_limiter:
check_interval: 1s
limit_mib: 2048
spike_limit_mib: 512
batch:
timeout: 5s
send_batch_size: 2000
filter/drop_health_checks:
traces:
span:
- 'attributes["http.target"] == "/health"'
- 'attributes["http.target"] == "/actuator/health"'
transform/enrich:
trace_statements:
- context: span
statements:
- set(attributes["team"], "platform") where resource.attributes["service.name"] == "order-service"
tail_sampling:
decision_wait: 10s
num_traces: 50000
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow
type: latency
latency: {threshold_ms: 1500}
- name: sample-5pct
type: probabilistic
probabilistic: {sampling_percentage: 5}
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
resource_to_telemetry_conversion:
enabled: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
labels:
resource:
service.name: "service_name"
k8s.pod.name: "pod"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, filter/drop_health_checks, transform/enrich, tail_sampling, batch]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
Scaling the Collector: Load Balancing for Stateful Processors
Stateless processors (batch, filter, transform) scale horizontally without coordination — any Collector instance can handle any span. Stateful processors — specifically tail_sampling and groupbytrace — require that all spans for a given trace arrive at the same Collector instance. Use the loadbalancingexporter in agent Collectors to route by trace ID:
# Agent collector — route to gateway by trace ID hash
exporters:
loadbalancing:
routing_key: "traceID"
protocol:
otlp:
timeout: 1s
tls:
insecure: true
resolver:
k8s:
service: otel-collector-gateway-headless
ports: [4317]
Failure Scenarios and Mitigation
Collector OOM
The memory_limiter processor must always be the first processor in every pipeline. When memory exceeds the limit, the Collector refuses new data and returns a retryable error to senders (which will buffer and retry). Without it, a cardinality explosion or traffic spike kills the Collector process and you lose all buffered data.
Data Loss on Collector Restart
The Collector is stateless by default — in-flight data at restart time is lost. For production, use the file_storage extension to enable persistent queuing:
extensions:
file_storage/persistent_queue:
directory: /var/lib/otelcol/queue
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
sending_queue:
enabled: true
storage: file_storage/persistent_queue
num_consumers: 10
queue_size: 10000
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
Cardinality Explosions
A single service adding high-cardinality labels (user ID, request ID) to metrics can generate millions of unique time series, crashing Prometheus. The filter and transform processors are your circuit breakers:
processors:
transform/drop_high_cardinality:
metric_statements:
- context: datapoint
statements:
- delete_key(attributes, "user.id")
- delete_key(attributes, "request.id")
When NOT to Use the Collector
The OTel Collector adds operational overhead: it is another process to deploy, monitor, and upgrade. Avoid it when:
- You have a single service sending to a single backend — direct OTLP export is simpler
- Running at the edge (IoT, embedded) where memory and CPU are constrained — use SDK-level direct export
- Your team has no Kubernetes/Docker operational experience — the Collector's configuration YAML has significant learning curve
Use it when you have 5+ services, want backend flexibility, need tail-based sampling, or require cross-service attribute enrichment.
Writing Custom Processors: Transforming and Enriching Telemetry Data
The OTel Collector ships with a powerful set of built-in processors, but two deserve particular attention for production enrichment workflows: the transform processor and the attributes processor. Together they allow you to reshape telemetry data in flight without touching application code — the key capability that makes the Collector a real data pipeline rather than just a router.
Transform Processor
The transform processor uses the OpenTelemetry Transformation Language (OTTL) to express complex mutations across spans, metrics, and logs. OTTL is expression-based and operates on the OpenTelemetry data model directly, enabling conditional logic, type conversion, and attribute manipulation in a declarative YAML configuration.
processors:
transform/enrich_spans:
trace_statements:
- context: span
statements:
# Promote HTTP status code to span status
- set(status.code, STATUS_CODE_ERROR) where attributes["http.status_code"] >= 500
# Normalize service version attribute from resource
- set(attributes["service.version"], resource.attributes["service.version"])
# Add environment tag if missing
- set(attributes["deployment.environment"], "production") where resource.attributes["deployment.environment"] == nil
# Truncate excessively long span names generated by ORMs
- replace_pattern(name, "SELECT .* FROM", "SELECT ... FROM")
metric_statements:
- context: datapoint
statements:
# Convert milliseconds to seconds for latency histograms
- multiply(value, 0.001) where metric.name == "http.server.duration" and unit == "ms"
# Drop internal runtime metrics not needed in the backend
- drop() where IsMatch(metric.name, "^go\\..*")
log_statements:
- context: log_record
statements:
# Promote log severity from body text if not set by SDK
- set(severity_number, SEVERITY_NUMBER_ERROR) where body == "ERROR"
# Hash PII fields for compliance before export
- set(attributes["user.id"], SHA256(attributes["user.id"]))
Attributes Processor
The attributes processor handles simpler key-value operations: insert, update, delete, hash, extract (from regex), and convert. It is less powerful than the transform processor but significantly easier to read and audit in team settings:
processors:
attributes/enrich:
actions:
# Add datacenter region to all telemetry
- key: cloud.region
value: eu-west-1
action: insert
# Rename legacy field to OTel semantic convention
- key: http.method
from_attribute: legacy.verb
action: upsert
# Hash customer email for GDPR compliance before export
- key: customer.email
action: hash
# Delete internal debugging attributes
- key: debug.internal_id
action: delete
# Extract service tier from hostname pattern "svc-prod-api-01"
- key: service.tier
pattern: ^svc-(?P<tier>[a-z]+)-.*
from_attribute: host.name
action: extract
A production enrichment pipeline typically chains both processors: the attributes processor handles simple key-value enrichment from static values or other attributes, while the transform processor handles conditional logic and cross-signal operations. Keep each processor's responsibility narrow and document each action — these configurations accumulate undocumented changes over time and become hard to reason about without inline comments.
Connector Component: Linking Pipelines for Metrics from Traces
The Connector component is one of the most powerful and underused OTel Collector features. Connectors act as both exporters and receivers simultaneously, enabling data to flow from one pipeline into another. This is how you generate metrics from traces — a pattern that eliminates the need for application-level metrics instrumentation for latency histograms, error rates, and throughput counters.
Span-to-Metrics Connector
The spanmetrics connector aggregates span data into RED (Rate, Errors, Duration) metrics automatically. These metrics populate your service latency dashboards without requiring explicit histogram instrumentation in every service:
connectors:
spanmetrics:
histogram:
explicit:
buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s]
dimensions:
- name: http.method
- name: http.status_code
- name: service.name
exemplars:
enabled: true
metrics_flush_interval: 15s
namespace: traces
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger, spanmetrics] # spanmetrics acts as exporter here
metrics:
receivers: [prometheus, spanmetrics] # spanmetrics acts as receiver here
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
Service-Graph Connector
The servicegraph connector generates metrics representing relationships between services inferred from trace data. It produces a graph of service call latencies, error rates, and request counts that can be visualised in Grafana's service map panel:
connectors:
servicegraph:
latency_histogram_buckets: [10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s]
dimensions:
- http.method
store:
ttl: 2s
max_items: 1000
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/jaeger, servicegraph]
metrics:
receivers: [prometheus, servicegraph]
processors: [batch]
exporters: [prometheusremotewrite]
With the service-graph connector, you get automatic service topology maps driven purely by trace data — no manual service mesh configuration or sidecar annotations required. This is particularly valuable during incident response when you need to quickly identify which upstream service is the source of latency in a call chain.
Security Hardening the OTel Collector
The OTel Collector receives, processes, and forwards all observability data for your platform — making it a high-value target and a compliance boundary. A misconfigured Collector can leak sensitive data, act as a lateral movement vector, or become a denial-of-service amplifier. Production deployments require explicit security hardening at multiple layers.
Mutual TLS (mTLS) for Receiver and Exporter Authentication
All OTLP connections — both incoming from services and outgoing to backends — should use mutual TLS to authenticate both parties. The Collector's configtls package provides consistent TLS configuration across all components:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
tls:
cert_file: /certs/collector.crt
key_file: /certs/collector.key
client_ca_file: /certs/ca.crt # enforce mTLS — clients must present cert
exporters:
otlp/backend:
endpoint: backend.internal:4317
tls:
cert_file: /certs/collector.crt
key_file: /certs/collector.key
ca_file: /certs/backend-ca.crt # verify backend's certificate
Kubernetes RBAC and Network Policies
In Kubernetes deployments, the Collector should run under a dedicated ServiceAccount with minimal permissions. No Collector instance should need cluster-admin. Use a tight NetworkPolicy to restrict which pods can send data to agent Collectors and which services gateway Collectors can reach:
# NetworkPolicy: allow inbound only from the application namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: otel-agent-ingress
namespace: observability
spec:
podSelector:
matchLabels:
app: otel-agent
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: production
ports:
- protocol: TCP
port: 4317 # gRPC
- protocol: TCP
port: 4318 # HTTP
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: observability
ports:
- protocol: TCP
port: 4317
Additionally, configure resource limits on Collector pods to prevent a traffic spike from consuming unbounded cluster resources, and use readOnlyRootFilesystem: true in the pod security context unless the file_storage extension requires a writable mount path.
Scrubbing Sensitive Data in Flight
Use the redaction processor to automatically detect and redact sensitive values matching configurable patterns. This prevents PII, API keys, and credentials from propagating into observability backends where they may be accessible to a wider audience than the originating service:
processors:
redaction:
allow_all_keys: false
allowed_keys:
- http.method
- http.status_code
- http.url
- service.name
- trace.id
blocked_values:
# Redact credit card numbers
- \b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\b
# Redact JWT tokens
- eyJ[A-Za-z0-9+/=]+\.eyJ[A-Za-z0-9+/=]+\.[A-Za-z0-9+/=\-_]+
# Redact AWS secret key patterns
- (?:aws_secret_access_key|AWS_SECRET)[^\s]*\s*=\s*\S+
summary: debug
OTel Collector Benchmark Results and Resource Planning
Sizing the OTel Collector correctly is critical: under-provisioned Collectors drop data under load; over-provisioned Collectors waste infrastructure budget. The following benchmarks were collected on a standard 2 vCPU / 4 GB instance running the OTel Collector v0.97 with a representative production pipeline (memory_limiter → batch → tail_sampling → otlp exporter).
| Spans/sec | CPU (2 vCPU) | Memory | Recommended Replicas | Notes |
|---|---|---|---|---|
| 1,000 | 5–8% | 120 MB | 1 (agent DaemonSet) | Comfortable for small services |
| 5,000 | 18–25% | 250 MB | 2 (gateway HPA min) | Add tail sampling at this tier |
| 15,000 | 55–65% | 600 MB | 3–4 (enable HPA) | Tail sampling adds significant CPU overhead |
| 30,000 | 80–90% | 1.1 GB | 6+ (shard by service hash) | Consider dedicated gateway tier |
| 50,000+ | Saturation | 2 GB+ | Scale out + increase to 4 vCPU | Use load-balancing exporter |
Key scaling thresholds to monitor and act on before data loss occurs:
- CPU > 70%: Add replicas before the Collector starts dropping data — it cannot shed load gracefully under CPU saturation
- Memory > 80% of
memory_limiterlimit: The limiter will start refusing new spans; scale out or increase the limit - Queue depth > 5,000: The exporter is falling behind; increase
num_consumersor reduce backend latency - Batch timeout > 10s: Batches are forming too slowly; reduce
send_batch_max_sizeor increase concurrency
For a 40-service microservices platform generating 10k–20k spans/second, a 3-replica gateway Collector deployment with 2 vCPU and 2 GB memory per pod, autoscaled by the otelcol_processor_batch_batch_size_trigger_send metric, provides adequate headroom with graceful burst handling. Always benchmark with your specific processor chain — tail-based sampling is 2–3x more CPU-intensive than simple batching.
Monitoring the Collector Itself
The Collector exposes its own Prometheus metrics on port 8888 by default. Critical metrics to alert on include otelcol_receiver_refused_spans (data being dropped at ingress), otelcol_exporter_queue_size (backpressure from slow backends), otelcol_processor_dropped_spans (tail sampler rejecting spans above memory limits), and go_memstats_heap_inuse_bytes (heap growth trending toward the memory limiter threshold). Build a dedicated "Collector Health" Grafana dashboard with these metrics and include it in your platform on-call runbook. A Collector silently dropping 5% of spans is difficult to detect without explicit monitoring — the symptom appears as missing traces in Jaeger, which engineers typically attribute to application-level instrumentation issues rather than the pipeline itself. This misattribution adds significant mean-time-to-resolution overhead during incidents. Instrument the Collector as rigorously as any other production service in your platform.
Key Takeaways
- The OTel Collector's pipeline model (receivers → processors → exporters) is the key to understanding its power and constraints
- Always put
memory_limiteras the first processor — it is the safety valve preventing OOM crashes - Tail-based sampling requires stateful Collector instances with consistent hash load balancing by trace ID
- Agent + Gateway topology is the production standard: agents handle local enrichment and fan-out, gateways handle sampling and routing
- Use
file_storagepersistent queuing for exporters to prevent data loss on Collector restarts - The migration from Datadog to vendor-neutral stack requires zero application code changes when applications use OTLP
- Use the
spanmetricsandservicegraphconnectors to generate RED metrics and service topology maps directly from trace data without additional instrumentation
Conclusion
The OpenTelemetry Collector transforms observability from vendor-dependent infrastructure into a programmable data pipeline. Its value is not just in vendor neutrality — it is in the ability to enrich, filter, sample, and route telemetry data independently of your application code. Building it correctly from the start — with proper memory limits, persistent queuing, and the right deployment topology — determines whether it becomes a reliable cornerstone of your platform or a source of production incidents. The patterns in this post represent production-tested configurations for services at significant scale.
Leave a Comment
Related Posts
Software Engineer · Java · Spring Boot · Microservices