DevOps

OpenTelemetry Collector Deep Dive: Building Vendor-Neutral Observability Pipelines

Vendor lock-in in observability is a slow tax — you barely feel it until you try to leave. OpenTelemetry Collector is the infrastructure that makes telemetry truly portable. This post is not a getting-started guide; it is a production engineering deep dive covering pipeline architecture, tail-based sampling, scaling strategies, and the failure modes that bite teams after they go live.

Md Sanwar Hossain March 2026 12 min read DevOps
OpenTelemetry Collector Observability Pipeline

Table of Contents

  1. OTel Collector Architecture: The Pipeline Model
  2. Real Scenario: Migrating from Datadog to a Vendor-Neutral Stack
  3. Agent vs Gateway Deployment Modes
  4. Tail-Based Sampling: Why It Matters and How to Configure It
  5. Production Config with Spring Boot (OTLP over gRPC)
  6. Scaling the Collector: Load Balancing for Stateful Processors
  7. Failure Scenarios and Mitigation
  8. When NOT to Use the Collector
  9. Conclusion

OTel Collector Architecture: The Pipeline Model

OpenTelemetry Collector | mdsanwarhossain.me
OpenTelemetry Collector — mdsanwarhossain.me

The OpenTelemetry Collector is a vendor-agnostic proxy for observability data. Its architecture is a three-stage pipeline:

┌─────────────────────────────────────────────────────────────────┐
│                   OTEL COLLECTOR PIPELINE                       │
│                                                                 │
│  RECEIVERS          PROCESSORS             EXPORTERS           │
│  ─────────          ──────────             ─────────           │
│  otlp (gRPC)   →    memory_limiter    →    otlp (Jaeger)       │
│  otlp (HTTP)   →    batch             →    prometheusremote    │
│  prometheus    →    filter            →    loki                │
│  jaeger        →    transform         →    kafka               │
│  zipkin        →    resource          →    file (debug)        │
│  hostmetrics   →    tail_sampling     →    otlp (cloud)        │
└─────────────────────────────────────────────────────────────────┘

Critically: pipelines are typed. A traces pipeline can only connect trace receivers to trace processors to trace exporters. You define separate pipelines for traces, metrics, and logs, though processors like resource and batch work across all signal types.

Real Scenario: Migrating from Datadog to a Vendor-Neutral Stack

A 40-service Spring Boot microservices platform was spending $180k/year on Datadog. The migration goal: move to Prometheus + Jaeger + Loki without changing a single line of application code. The OTel Collector made this possible.

The applications were already instrumented with opentelemetry-spring-boot-starter. They were sending OTLP to Datadog via the Datadog Agent. The migration involved:

  1. Deploying OTel Collector as a Kubernetes DaemonSet (Agent mode)
  2. Reconfiguring application OTEL_EXPORTER_OTLP_ENDPOINT to point to the local Collector
  3. Configuring Collector to export traces to Jaeger, metrics to Prometheus, logs to Loki
  4. Running dual-export for 2 weeks (both Datadog and new stack) for validation
  5. Cutting over Datadog

Zero application code changes. Total migration time: 3 weeks.

Agent vs Gateway Deployment Modes

Telemetry Pipeline | mdsanwarhossain.me
Telemetry Pipeline — mdsanwarhossain.me

Agent Mode (DaemonSet)

One Collector instance per Kubernetes node, receiving telemetry from all pods on that node via localhost. Advantages: low-latency data collection, node-local resource enrichment (host metrics, pod metadata). Disadvantages: no tail-based sampling (data from a single trace may span multiple nodes and agents).

# agent-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      disk:
      network:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  batch:
    timeout: 5s
    send_batch_size: 1000
  resource:
    attributes:
      - key: k8s.node.name
        from_attribute: host.name
        action: insert

exporters:
  otlp/gateway:
    endpoint: otel-collector-gateway:4317
    tls:
      insecure: false
      ca_file: /etc/ssl/certs/ca.crt

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/gateway]
    metrics:
      receivers: [otlp, hostmetrics]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/gateway]

Gateway Mode (Deployment)

Centralized Collector deployment receiving from all agents. This is where stateful processors (tail-based sampling, span aggregation) live. Scale with horizontal pod autoscaling, but note that stateful processors need consistent hashing load balancing — all spans for a given trace ID must reach the same Collector instance.

Tail-Based Sampling: Why It Matters and How to Configure It

Head-based sampling (sampling at trace creation) is the naive approach: sample 10% of requests. The problem: errors are rare. If an error occurs in the 90% you dropped, you have no trace to debug with.

OpenTelemetry Collector Architecture | mdsanwarhossain.me
OpenTelemetry Collector Architecture — mdsanwarhossain.me

Tail-based sampling buffers spans for a configurable window, then decides whether to keep the trace based on its complete outcome — including error status, latency thresholds, and custom attributes. The trade-off: memory consumption and the need for all spans to reach the same Collector instance.

# gateway-collector-config.yaml — tail sampling configuration
processors:
  tail_sampling:
    decision_wait: 10s          # Buffer traces for 10 seconds before deciding
    num_traces: 100000          # Max traces held in memory simultaneously
    expected_new_traces_per_sec: 1000
    policies:
      # Always keep error traces
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Always keep slow traces (>2 second)
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 2000

      # Keep 5% of healthy fast traces for baseline
      - name: baseline-sampling-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

      # Always keep traces with specific attributes (e.g., canary deployments)
      - name: canary-policy
        type: string_attribute
        string_attribute:
          key: deployment.environment
          values: [canary, staging]

Production Config with Spring Boot (OTLP over gRPC)

# application.yaml — Spring Boot OTLP configuration
management:
  tracing:
    sampling:
      probability: 1.0  # Send 100% to Collector; let Collector do tail sampling
  otlp:
    tracing:
      endpoint: http://otel-collector-agent:4317

# Spring Boot dependencies (pom.xml)
# io.micrometer:micrometer-tracing-bridge-otel
# io.opentelemetry:opentelemetry-exporter-otlp
# io.opentelemetry.instrumentation:opentelemetry-spring-boot-starter
# Full gateway collector config
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512
  batch:
    timeout: 5s
    send_batch_size: 2000
  filter/drop_health_checks:
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/actuator/health"'
  transform/enrich:
    trace_statements:
      - context: span
        statements:
          - set(attributes["team"], "platform") where resource.attributes["service.name"] == "order-service"
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 1500}
      - name: sample-5pct
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
    resource_to_telemetry_conversion:
      enabled: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    labels:
      resource:
        service.name: "service_name"
        k8s.pod.name: "pod"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter/drop_health_checks, transform/enrich, tail_sampling, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Scaling the Collector: Load Balancing for Stateful Processors

Stateless processors (batch, filter, transform) scale horizontally without coordination — any Collector instance can handle any span. Stateful processors — specifically tail_sampling and groupbytrace — require that all spans for a given trace arrive at the same Collector instance. Use the loadbalancingexporter in agent Collectors to route by trace ID:

# Agent collector — route to gateway by trace ID hash
exporters:
  loadbalancing:
    routing_key: "traceID"
    protocol:
      otlp:
        timeout: 1s
        tls:
          insecure: true
    resolver:
      k8s:
        service: otel-collector-gateway-headless
        ports: [4317]

Failure Scenarios and Mitigation

Collector OOM

The memory_limiter processor must always be the first processor in every pipeline. When memory exceeds the limit, the Collector refuses new data and returns a retryable error to senders (which will buffer and retry). Without it, a cardinality explosion or traffic spike kills the Collector process and you lose all buffered data.

Data Loss on Collector Restart

The Collector is stateless by default — in-flight data at restart time is lost. For production, use the file_storage extension to enable persistent queuing:

extensions:
  file_storage/persistent_queue:
    directory: /var/lib/otelcol/queue

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    sending_queue:
      enabled: true
      storage: file_storage/persistent_queue
      num_consumers: 10
      queue_size: 10000
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

Cardinality Explosions

A single service adding high-cardinality labels (user ID, request ID) to metrics can generate millions of unique time series, crashing Prometheus. The filter and transform processors are your circuit breakers:

processors:
  transform/drop_high_cardinality:
    metric_statements:
      - context: datapoint
        statements:
          - delete_key(attributes, "user.id")
          - delete_key(attributes, "request.id")

When NOT to Use the Collector

The OTel Collector adds operational overhead: it is another process to deploy, monitor, and upgrade. Avoid it when:

Use it when you have 5+ services, want backend flexibility, need tail-based sampling, or require cross-service attribute enrichment.

Writing Custom Processors: Transforming and Enriching Telemetry Data

The OTel Collector ships with a powerful set of built-in processors, but two deserve particular attention for production enrichment workflows: the transform processor and the attributes processor. Together they allow you to reshape telemetry data in flight without touching application code — the key capability that makes the Collector a real data pipeline rather than just a router.

Transform Processor

The transform processor uses the OpenTelemetry Transformation Language (OTTL) to express complex mutations across spans, metrics, and logs. OTTL is expression-based and operates on the OpenTelemetry data model directly, enabling conditional logic, type conversion, and attribute manipulation in a declarative YAML configuration.

processors:
  transform/enrich_spans:
    trace_statements:
      - context: span
        statements:
          # Promote HTTP status code to span status
          - set(status.code, STATUS_CODE_ERROR) where attributes["http.status_code"] >= 500
          # Normalize service version attribute from resource
          - set(attributes["service.version"], resource.attributes["service.version"])
          # Add environment tag if missing
          - set(attributes["deployment.environment"], "production") where resource.attributes["deployment.environment"] == nil
          # Truncate excessively long span names generated by ORMs
          - replace_pattern(name, "SELECT .* FROM", "SELECT ... FROM")
    metric_statements:
      - context: datapoint
        statements:
          # Convert milliseconds to seconds for latency histograms
          - multiply(value, 0.001) where metric.name == "http.server.duration" and unit == "ms"
          # Drop internal runtime metrics not needed in the backend
          - drop() where IsMatch(metric.name, "^go\\..*")
    log_statements:
      - context: log_record
        statements:
          # Promote log severity from body text if not set by SDK
          - set(severity_number, SEVERITY_NUMBER_ERROR) where body == "ERROR"
          # Hash PII fields for compliance before export
          - set(attributes["user.id"], SHA256(attributes["user.id"]))

Attributes Processor

The attributes processor handles simpler key-value operations: insert, update, delete, hash, extract (from regex), and convert. It is less powerful than the transform processor but significantly easier to read and audit in team settings:

processors:
  attributes/enrich:
    actions:
      # Add datacenter region to all telemetry
      - key: cloud.region
        value: eu-west-1
        action: insert
      # Rename legacy field to OTel semantic convention
      - key: http.method
        from_attribute: legacy.verb
        action: upsert
      # Hash customer email for GDPR compliance before export
      - key: customer.email
        action: hash
      # Delete internal debugging attributes
      - key: debug.internal_id
        action: delete
      # Extract service tier from hostname pattern "svc-prod-api-01"
      - key: service.tier
        pattern: ^svc-(?P<tier>[a-z]+)-.*
        from_attribute: host.name
        action: extract

A production enrichment pipeline typically chains both processors: the attributes processor handles simple key-value enrichment from static values or other attributes, while the transform processor handles conditional logic and cross-signal operations. Keep each processor's responsibility narrow and document each action — these configurations accumulate undocumented changes over time and become hard to reason about without inline comments.

Connector Component: Linking Pipelines for Metrics from Traces

The Connector component is one of the most powerful and underused OTel Collector features. Connectors act as both exporters and receivers simultaneously, enabling data to flow from one pipeline into another. This is how you generate metrics from traces — a pattern that eliminates the need for application-level metrics instrumentation for latency histograms, error rates, and throughput counters.

Span-to-Metrics Connector

The spanmetrics connector aggregates span data into RED (Rate, Errors, Duration) metrics automatically. These metrics populate your service latency dashboards without requiring explicit histogram instrumentation in every service:

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s]
    dimensions:
      - name: http.method
      - name: http.status_code
      - name: service.name
    exemplars:
      enabled: true
    metrics_flush_interval: 15s
    namespace: traces

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger, spanmetrics]    # spanmetrics acts as exporter here
    metrics:
      receivers: [prometheus, spanmetrics]      # spanmetrics acts as receiver here
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

Service-Graph Connector

The servicegraph connector generates metrics representing relationships between services inferred from trace data. It produces a graph of service call latencies, error rates, and request counts that can be visualised in Grafana's service map panel:

connectors:
  servicegraph:
    latency_histogram_buckets: [10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s]
    dimensions:
      - http.method
    store:
      ttl: 2s
      max_items: 1000

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/jaeger, servicegraph]
    metrics:
      receivers: [prometheus, servicegraph]
      processors: [batch]
      exporters: [prometheusremotewrite]

With the service-graph connector, you get automatic service topology maps driven purely by trace data — no manual service mesh configuration or sidecar annotations required. This is particularly valuable during incident response when you need to quickly identify which upstream service is the source of latency in a call chain.

Security Hardening the OTel Collector

The OTel Collector receives, processes, and forwards all observability data for your platform — making it a high-value target and a compliance boundary. A misconfigured Collector can leak sensitive data, act as a lateral movement vector, or become a denial-of-service amplifier. Production deployments require explicit security hardening at multiple layers.

Mutual TLS (mTLS) for Receiver and Exporter Authentication

All OTLP connections — both incoming from services and outgoing to backends — should use mutual TLS to authenticate both parties. The Collector's configtls package provides consistent TLS configuration across all components:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /certs/collector.crt
          key_file:  /certs/collector.key
          client_ca_file: /certs/ca.crt   # enforce mTLS — clients must present cert

exporters:
  otlp/backend:
    endpoint: backend.internal:4317
    tls:
      cert_file: /certs/collector.crt
      key_file:  /certs/collector.key
      ca_file:   /certs/backend-ca.crt   # verify backend's certificate

Kubernetes RBAC and Network Policies

In Kubernetes deployments, the Collector should run under a dedicated ServiceAccount with minimal permissions. No Collector instance should need cluster-admin. Use a tight NetworkPolicy to restrict which pods can send data to agent Collectors and which services gateway Collectors can reach:

# NetworkPolicy: allow inbound only from the application namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: otel-agent-ingress
  namespace: observability
spec:
  podSelector:
    matchLabels:
      app: otel-agent
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: production
      ports:
        - protocol: TCP
          port: 4317   # gRPC
        - protocol: TCP
          port: 4318   # HTTP
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: observability
      ports:
        - protocol: TCP
          port: 4317

Additionally, configure resource limits on Collector pods to prevent a traffic spike from consuming unbounded cluster resources, and use readOnlyRootFilesystem: true in the pod security context unless the file_storage extension requires a writable mount path.

Scrubbing Sensitive Data in Flight

Use the redaction processor to automatically detect and redact sensitive values matching configurable patterns. This prevents PII, API keys, and credentials from propagating into observability backends where they may be accessible to a wider audience than the originating service:

processors:
  redaction:
    allow_all_keys: false
    allowed_keys:
      - http.method
      - http.status_code
      - http.url
      - service.name
      - trace.id
    blocked_values:
      # Redact credit card numbers
      - \b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\b
      # Redact JWT tokens
      - eyJ[A-Za-z0-9+/=]+\.eyJ[A-Za-z0-9+/=]+\.[A-Za-z0-9+/=\-_]+
      # Redact AWS secret key patterns
      - (?:aws_secret_access_key|AWS_SECRET)[^\s]*\s*=\s*\S+
    summary: debug

OTel Collector Benchmark Results and Resource Planning

Sizing the OTel Collector correctly is critical: under-provisioned Collectors drop data under load; over-provisioned Collectors waste infrastructure budget. The following benchmarks were collected on a standard 2 vCPU / 4 GB instance running the OTel Collector v0.97 with a representative production pipeline (memory_limiter → batch → tail_sampling → otlp exporter).

Spans/sec CPU (2 vCPU) Memory Recommended Replicas Notes
1,000 5–8% 120 MB 1 (agent DaemonSet) Comfortable for small services
5,000 18–25% 250 MB 2 (gateway HPA min) Add tail sampling at this tier
15,000 55–65% 600 MB 3–4 (enable HPA) Tail sampling adds significant CPU overhead
30,000 80–90% 1.1 GB 6+ (shard by service hash) Consider dedicated gateway tier
50,000+ Saturation 2 GB+ Scale out + increase to 4 vCPU Use load-balancing exporter

Key scaling thresholds to monitor and act on before data loss occurs:

For a 40-service microservices platform generating 10k–20k spans/second, a 3-replica gateway Collector deployment with 2 vCPU and 2 GB memory per pod, autoscaled by the otelcol_processor_batch_batch_size_trigger_send metric, provides adequate headroom with graceful burst handling. Always benchmark with your specific processor chain — tail-based sampling is 2–3x more CPU-intensive than simple batching.

Monitoring the Collector Itself

The Collector exposes its own Prometheus metrics on port 8888 by default. Critical metrics to alert on include otelcol_receiver_refused_spans (data being dropped at ingress), otelcol_exporter_queue_size (backpressure from slow backends), otelcol_processor_dropped_spans (tail sampler rejecting spans above memory limits), and go_memstats_heap_inuse_bytes (heap growth trending toward the memory limiter threshold). Build a dedicated "Collector Health" Grafana dashboard with these metrics and include it in your platform on-call runbook. A Collector silently dropping 5% of spans is difficult to detect without explicit monitoring — the symptom appears as missing traces in Jaeger, which engineers typically attribute to application-level instrumentation issues rather than the pipeline itself. This misattribution adds significant mean-time-to-resolution overhead during incidents. Instrument the Collector as rigorously as any other production service in your platform.

Key Takeaways

Conclusion

The OpenTelemetry Collector transforms observability from vendor-dependent infrastructure into a programmable data pipeline. Its value is not just in vendor neutrality — it is in the ability to enrich, filter, sample, and route telemetry data independently of your application code. Building it correctly from the start — with proper memory limits, persistent queuing, and the right deployment topology — determines whether it becomes a reliable cornerstone of your platform or a source of production incidents. The patterns in this post represent production-tested configurations for services at significant scale.


Leave a Comment

Related Posts

Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices

Last updated: March 18, 2026