OpenTelemetry Collector Deep Dive: Building Vendor-Neutral Observability Pipelines

OpenTelemetry Collector Observability Pipeline

Vendor lock-in in observability is a slow tax — you barely feel it until you try to leave. OpenTelemetry Collector is the infrastructure that makes telemetry truly portable. This post is not a getting-started guide; it is a production engineering deep dive covering pipeline architecture, tail-based sampling, scaling strategies, and the failure modes that bite teams after they go live.

OTel Collector Architecture: The Pipeline Model

The OpenTelemetry Collector is a vendor-agnostic proxy for observability data. Its architecture is a three-stage pipeline:

┌─────────────────────────────────────────────────────────────────┐
│                   OTEL COLLECTOR PIPELINE                       │
│                                                                 │
│  RECEIVERS          PROCESSORS             EXPORTERS           │
│  ─────────          ──────────             ─────────           │
│  otlp (gRPC)   →    memory_limiter    →    otlp (Jaeger)       │
│  otlp (HTTP)   →    batch             →    prometheusremote    │
│  prometheus    →    filter            →    loki                │
│  jaeger        →    transform         →    kafka               │
│  zipkin        →    resource          →    file (debug)        │
│  hostmetrics   →    tail_sampling     →    otlp (cloud)        │
└─────────────────────────────────────────────────────────────────┘

Critically: pipelines are typed. A traces pipeline can only connect trace receivers to trace processors to trace exporters. You define separate pipelines for traces, metrics, and logs, though processors like resource and batch work across all signal types.

Real Scenario: Migrating from Datadog to a Vendor-Neutral Stack

A 40-service Spring Boot microservices platform was spending $180k/year on Datadog. The migration goal: move to Prometheus + Jaeger + Loki without changing a single line of application code. The OTel Collector made this possible.

The applications were already instrumented with opentelemetry-spring-boot-starter. They were sending OTLP to Datadog via the Datadog Agent. The migration involved:

  1. Deploying OTel Collector as a Kubernetes DaemonSet (Agent mode)
  2. Reconfiguring application OTEL_EXPORTER_OTLP_ENDPOINT to point to the local Collector
  3. Configuring Collector to export traces to Jaeger, metrics to Prometheus, logs to Loki
  4. Running dual-export for 2 weeks (both Datadog and new stack) for validation
  5. Cutting over Datadog

Zero application code changes. Total migration time: 3 weeks.

Agent vs Gateway Deployment Modes

Agent Mode (DaemonSet)

One Collector instance per Kubernetes node, receiving telemetry from all pods on that node via localhost. Advantages: low-latency data collection, node-local resource enrichment (host metrics, pod metadata). Disadvantages: no tail-based sampling (data from a single trace may span multiple nodes and agents).

# agent-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      disk:
      network:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  batch:
    timeout: 5s
    send_batch_size: 1000
  resource:
    attributes:
      - key: k8s.node.name
        from_attribute: host.name
        action: insert

exporters:
  otlp/gateway:
    endpoint: otel-collector-gateway:4317
    tls:
      insecure: false
      ca_file: /etc/ssl/certs/ca.crt

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/gateway]
    metrics:
      receivers: [otlp, hostmetrics]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/gateway]

Gateway Mode (Deployment)

Centralized Collector deployment receiving from all agents. This is where stateful processors (tail-based sampling, span aggregation) live. Scale with horizontal pod autoscaling, but note that stateful processors need consistent hashing load balancing — all spans for a given trace ID must reach the same Collector instance.

Tail-Based Sampling: Why It Matters and How to Configure It

Head-based sampling (sampling at trace creation) is the naive approach: sample 10% of requests. The problem: errors are rare. If an error occurs in the 90% you dropped, you have no trace to debug with.

Tail-based sampling buffers spans for a configurable window, then decides whether to keep the trace based on its complete outcome — including error status, latency thresholds, and custom attributes. The trade-off: memory consumption and the need for all spans to reach the same Collector instance.

# gateway-collector-config.yaml — tail sampling configuration
processors:
  tail_sampling:
    decision_wait: 10s          # Buffer traces for 10 seconds before deciding
    num_traces: 100000          # Max traces held in memory simultaneously
    expected_new_traces_per_sec: 1000
    policies:
      # Always keep error traces
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Always keep slow traces (>2 second)
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 2000

      # Keep 5% of healthy fast traces for baseline
      - name: baseline-sampling-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

      # Always keep traces with specific attributes (e.g., canary deployments)
      - name: canary-policy
        type: string_attribute
        string_attribute:
          key: deployment.environment
          values: [canary, staging]

Production Config with Spring Boot (OTLP over gRPC)

# application.yaml — Spring Boot OTLP configuration
management:
  tracing:
    sampling:
      probability: 1.0  # Send 100% to Collector; let Collector do tail sampling
  otlp:
    tracing:
      endpoint: http://otel-collector-agent:4317

# Spring Boot dependencies (pom.xml)
# io.micrometer:micrometer-tracing-bridge-otel
# io.opentelemetry:opentelemetry-exporter-otlp
# io.opentelemetry.instrumentation:opentelemetry-spring-boot-starter
# Full gateway collector config
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512
  batch:
    timeout: 5s
    send_batch_size: 2000
  filter/drop_health_checks:
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/actuator/health"'
  transform/enrich:
    trace_statements:
      - context: span
        statements:
          - set(attributes["team"], "platform") where resource.attributes["service.name"] == "order-service"
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 1500}
      - name: sample-5pct
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
    resource_to_telemetry_conversion:
      enabled: true
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    labels:
      resource:
        service.name: "service_name"
        k8s.pod.name: "pod"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter/drop_health_checks, transform/enrich, tail_sampling, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Scaling the Collector: Load Balancing for Stateful Processors

Stateless processors (batch, filter, transform) scale horizontally without coordination — any Collector instance can handle any span. Stateful processors — specifically tail_sampling and groupbytrace — require that all spans for a given trace arrive at the same Collector instance. Use the loadbalancingexporter in agent Collectors to route by trace ID:

# Agent collector — route to gateway by trace ID hash
exporters:
  loadbalancing:
    routing_key: "traceID"
    protocol:
      otlp:
        timeout: 1s
        tls:
          insecure: true
    resolver:
      k8s:
        service: otel-collector-gateway-headless
        ports: [4317]

Failure Scenarios and Mitigation

Collector OOM

The memory_limiter processor must always be the first processor in every pipeline. When memory exceeds the limit, the Collector refuses new data and returns a retryable error to senders (which will buffer and retry). Without it, a cardinality explosion or traffic spike kills the Collector process and you lose all buffered data.

Data Loss on Collector Restart

The Collector is stateless by default — in-flight data at restart time is lost. For production, use the file_storage extension to enable persistent queuing:

extensions:
  file_storage/persistent_queue:
    directory: /var/lib/otelcol/queue

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    sending_queue:
      enabled: true
      storage: file_storage/persistent_queue
      num_consumers: 10
      queue_size: 10000
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

Cardinality Explosions

A single service adding high-cardinality labels (user ID, request ID) to metrics can generate millions of unique time series, crashing Prometheus. The filter and transform processors are your circuit breakers:

processors:
  transform/drop_high_cardinality:
    metric_statements:
      - context: datapoint
        statements:
          - delete_key(attributes, "user.id")
          - delete_key(attributes, "request.id")

When NOT to Use the Collector

The OTel Collector adds operational overhead: it is another process to deploy, monitor, and upgrade. Avoid it when:

  • You have a single service sending to a single backend — direct OTLP export is simpler
  • Running at the edge (IoT, embedded) where memory and CPU are constrained — use SDK-level direct export
  • Your team has no Kubernetes/Docker operational experience — the Collector's configuration YAML has significant learning curve

Use it when you have 5+ services, want backend flexibility, need tail-based sampling, or require cross-service attribute enrichment.

Key Takeaways

  • The OTel Collector's pipeline model (receivers → processors → exporters) is the key to understanding its power and constraints
  • Always put memory_limiter as the first processor — it is the safety valve preventing OOM crashes
  • Tail-based sampling requires stateful Collector instances with consistent hash load balancing by trace ID
  • Agent + Gateway topology is the production standard: agents handle local enrichment and fan-out, gateways handle sampling and routing
  • Use file_storage persistent queuing for exporters to prevent data loss on Collector restarts
  • The migration from Datadog to vendor-neutral stack requires zero application code changes when applications use OTLP

Conclusion

The OpenTelemetry Collector transforms observability from vendor-dependent infrastructure into a programmable data pipeline. Its value is not just in vendor neutrality — it is in the ability to enrich, filter, sample, and route telemetry data independently of your application code. Building it correctly from the start — with proper memory limits, persistent queuing, and the right deployment topology — determines whether it becomes a reliable cornerstone of your platform or a source of production incidents. The patterns in this post represent production-tested configurations for services at significant scale.


Related Posts