What is distributed tracing and why do I need it?

Distributed tracing tracks a single request as it flows through multiple microservices, databases, and queues. Without it, debugging latency issues in microservices requires correlating logs across services manually — near impossible at scale. With tracing, you see a visual timeline (waterfall) of every service call, database query, and external API call that made up the request, making root cause analysis from minutes to seconds.

What is the difference between OpenTelemetry, Jaeger, and Zipkin?

OpenTelemetry is the instrumentation standard (vendor-neutral API + SDK for generating traces, metrics, logs). Jaeger and Zipkin are trace backends (storage + UI for visualizing traces). Use OTel to instrument your code once, then send data to Jaeger, Zipkin, Tempo (Grafana), or commercial APMs like Datadog/Dynatrace — all without changing your application code.

How does trace context propagate between microservices?

Trace context is propagated via HTTP headers. The W3C traceparent header carries the trace-id (globally unique for the entire request chain), span-id (current service), and flags. Spring Boot 3 with Micrometer Tracing automatically injects traceparent on outgoing RestTemplate/WebClient calls and extracts it on incoming requests — so you get end-to-end traces with zero manual code.

What sampling rate should I use in production?

At low traffic (under 100 req/s), 100% sampling is fine. At medium traffic (100-1000 req/s), use 10% tail-based or parent-based sampling. At high traffic (10K+ req/s), use 1% or less, but always sample 100% of error traces and slow traces (over 1s threshold). Tail-based sampling in the OTel Collector evaluates the complete trace and samples intelligently, preserving all interesting traces while discarding repetitive successful ones.

How do I add business context to traces (like order ID, user ID)?

Use span attributes and baggage. Span attributes are key-value pairs on the current span visible in the trace backend. Baggage is propagated downstream to all child services. In Spring Boot, use @Observed annotation or Tracer.currentSpan().tag(key, value) to add attributes. Add orderId, userId, tenantId as attributes so you can search for traces by business entity in Jaeger/Zipkin.

Microservices April 11, 2026 · 22 min read

Distributed Tracing with OpenTelemetry & Spring Boot: Complete Production Guide (2026)

A complete guide to implementing distributed tracing across Spring Boot microservices: OpenTelemetry Java agent vs SDK, Micrometer Tracing auto-instrumentation, custom spans, trace context propagation via W3C traceparent, Jaeger and Zipkin backends, sampling strategies, and Grafana Tempo integration.

Md Sanwar Hossain
Senior Java & Backend Engineer

Distributed Tracing OpenTelemetry Spring Boot 2026

TL;DR: Spring Boot 3 + Micrometer Tracing + OTel auto-instruments HTTP, JDBC, Kafka, and Redis with zero code changes. Add custom spans for business operations; use W3C traceparent for end-to-end trace propagation; send to Jaeger/Zipkin/Tempo via OTLP.

1. Core Concepts: Traces, Spans, Propagation

Trace: A complete record of one request as it flows through your entire system. Every trace has a globally unique traceId (128-bit hex string).
Span: A single unit of work within a trace (e.g., HTTP request, DB query, Kafka publish). Each span has a spanId, start/end time, parent span ID, status, and key-value attributes.
Parent-child relationship: When Service A calls Service B, Service B creates a child span with Service A's span as the parent. This forms the trace tree (waterfall view).
Trace context propagation: The traceId and parent spanId are forwarded to all downstream services via HTTP headers (W3C traceparent) or message headers (Kafka). Automatic in Spring Boot 3.
Attributes vs Events: Attributes are metadata on the span (userId, orderId, HTTP status). Events are time-stamped annotations within a span (e.g., "cache miss", "retry attempt 2").

2. Tooling: OTel Java Agent vs Micrometer Tracing

Approach	How	Instrumentation	Best For
OTel Java Agent	JVM -javaagent flag	Auto (bytecode)	Legacy apps, zero code change
Micrometer Tracing (Spring Boot 3)	Spring Boot starter	Auto + @Observed API	Spring Boot microservices (recommended)
OTel SDK direct	SDK dependency	Manual Span API	Full control, non-Spring apps

Recommendation: Use Micrometer Tracing for Spring Boot 3+ apps — it bridges to the OTel SDK under the hood, auto-instruments RestTemplate, WebClient, Feign, Spring Data, Kafka, and Redis, and integrates with Spring's observation API.

3. Spring Boot 3 Setup: Zero-Code Auto-Instrumentation

// pom.xml — Micrometer Tracing with OTel bridge + OTLP export

<!-- Micrometer Tracing core -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<!-- OTel OTLP exporter (sends to Jaeger/Tempo/any OTLP-compatible backend) -->
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>

# application.yml — tracing config

management:
  tracing:
    sampling:
      probability: 1.0   # 100% for dev; 0.1 for prod (10%)
    propagation:
      type: w3c           # W3C traceparent (recommended)
  otlp:
    tracing:
      endpoint: http://jaeger:4318/v1/traces
  zipkin:
    tracing:
      endpoint: http://zipkin:9411/api/v2/spans  # if using Zipkin

spring:
  application:
    name: order-service   # appears as service name in trace backend

With this config, Spring Boot 3 automatically instruments: all HTTP incoming requests, RestTemplate/WebClient/Feign outbound calls, Spring Data (JPA/MongoDB/Redis), Spring Kafka, and @Scheduled tasks. No code changes needed.

4. Custom Spans: @Observed & Tracer API

// ❌ BAD: No business context in traces — only technical spans

// You see: "POST /api/orders" taking 3s — but WHY? Which sub-operation is slow?

// ✅ GOOD: Custom spans with business attributes + @Observed annotation

// Option 1: @Observed annotation (declarative, AOP-based)
@Observed(name = "order.payment", contextualName = "processPayment",
          lowCardinalityKeyValues = {"payment.provider", "stripe"})
public PaymentResult processPayment(Order order) {
    return stripeService.charge(order);
}

// Option 2: Tracer API for fine-grained control
@Service
public class InventoryService {
    @Autowired private Tracer tracer;

    public void deductInventory(String productId, int qty) {
        Span span = tracer.nextSpan()
            .name("inventory.deduct")
            .tag("product.id", productId)
            .tag("quantity", String.valueOf(qty))
            .start();
        try (Tracer.SpanInScope ws = tracer.withSpan(span.start())) {
            // Business logic
            Product p = productRepository.findById(productId).orElseThrow();
            if (p.getStock() < qty) {
                span.tag("error", "insufficient_stock");
                span.event("insufficient_stock_detected");
                throw new InsufficientStockException(productId);
            }
            p.setStock(p.getStock() - qty);
            productRepository.save(p);
            span.tag("new.stock", String.valueOf(p.getStock()));
        } catch (Exception ex) {
            span.error(ex);
            throw ex;
        } finally {
            span.end();  // always end the span
        }
    }
}

5. Trace Context Propagation

Spring Boot 3 with Micrometer Tracing propagates the W3C traceparent header automatically for:

RestTemplate / WebClient / Feign: Auto-injects traceparent on all outgoing HTTP calls
Spring Kafka: Injects/extracts trace context in Kafka message headers
Incoming requests: Extracts traceparent from incoming HTTP headers to continue the trace

W3C traceparent format: 00-{traceId}-{parentSpanId}-{flags}. Example: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

// ✅ GOOD: Baggage for cross-service business context propagation

// In order-service: set business baggage that all downstream services see
@PostMapping("/orders")
public ResponseEntity<Order> createOrder(@RequestBody OrderRequest request) {
    // Baggage is auto-propagated via traceparent to all downstream services
    BaggageField.create("tenant.id").updateValue(request.getTenantId());
    BaggageField.create("user.id").updateValue(request.getUserId());
    return ResponseEntity.ok(orderService.create(request));
}

// In inventory-service: read baggage without any code coupling
@Service
public class InventoryService {
    public void deductStock(String productId, int qty) {
        String tenantId = BaggageField.getByName("tenant.id").getValue();
        // tenantId is automatically available here — propagated via HTTP header!
        log.info("Deducting stock for tenant={} product={}", tenantId, productId);
    }
}

6. Tracing Through Kafka Messages

// ✅ GOOD: Kafka trace propagation with Spring Kafka + Micrometer auto-instrumentation

// Producer: trace headers injected AUTOMATICALLY by Spring Kafka + Micrometer Tracing
@Service
public class OrderEventPublisher {
    @Autowired private KafkaTemplate<String, OrderCreatedEvent> kafkaTemplate;

    public void publish(OrderCreatedEvent event) {
        // traceparent header is auto-added to Kafka message headers — no manual code!
        kafkaTemplate.send("order-created", event.getOrderId(), event);
    }
}

// Consumer: trace automatically continued from message headers
@KafkaListener(topics = "order-created", groupId = "inventory-group")
@Observed(name = "kafka.order.inventory.process")
public void handleOrderCreated(OrderCreatedEvent event) {
    // This span is automatically a child of the producer's span — full trace!
    inventoryService.deductInventory(event.getProductId(), event.getQuantity());
}

7. Backends: Jaeger, Zipkin & Grafana Tempo

Backend	Protocol	Storage	Best For
Jaeger	OTLP / Thrift UDP	Cassandra, Elasticsearch, Badger	Self-hosted, mature UI, Kubernetes-native
Zipkin	HTTP JSON / OTLP	In-memory, MySQL, Elasticsearch	Lightweight, simple setup, dev environments
Grafana Tempo	OTLP	Object storage (S3, GCS)	Production scale, correlate with Loki logs & Prometheus

8. Sampling Strategies

Strategy	Decision Point	Pros	Cons
Head-based (probabilistic)	At trace start	Low overhead	Discards interesting error traces at low rates
Tail-based (in OTel Collector)	After trace complete	✅ Sample ALL errors, slow traces	Higher memory in collector
Always-on for errors	Status code check	Never miss error traces	Requires custom sampler

9. Correlating Logs with Traces: MDC Integration

# application.yml — auto-inject traceId/spanId into every log line

logging:
  pattern:
    console: "%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} [%X{traceId},%X{spanId}] - %msg%n"
  level:
    io.micrometer.tracing: DEBUG   # see trace propagation in logs during debugging

# Result: every log line contains the traceId
# 09:15:42.001 [nio-8080-exec-1] INFO  OrderService [4bf92f3577b34da6a...] - Processing order 123
# Click traceId in Grafana/Jaeger to see the full request waterfall!

10. Production Observability Stack

The modern production observability stack for Spring Boot microservices in 2026:

Metrics: Micrometer + Prometheus + Grafana (JVM, business metrics)
Logs: Logback/Log4j2 → Loki (Grafana) or Elasticsearch (ELK)
Traces: Micrometer Tracing → OTel Collector → Grafana Tempo (all services)
Correlation: traceId in all three systems — click a log line to see the trace, click a slow trace to see related logs
Alerting: Prometheus AlertManager for metric-based alerts; Grafana for cross-signal alerts

11. Interview Questions & Observability Checklist

Q: A request takes 5 seconds but your health check says all services are healthy. How do you debug it?

A: Open the trace for that specific request in Jaeger/Tempo. The waterfall view shows which span takes 5 seconds — whether it's a database query, an external API call, or a specific microservice. Drill into that span's attributes (SQL query, endpoint URL). Cross-reference with logs for that traceId to get application-level context. Without distributed tracing, this investigation takes hours; with it, minutes.

✅ Distributed Tracing Production Checklist

Micrometer Tracing + OTel bridge in all services
Service name set per service (spring.application.name)
W3C traceparent propagation enabled
Custom spans for critical business operations
Business attributes on spans (orderId, userId)
traceId in log pattern (MDC)
Sampling 10% in prod; 100% for errors
Trace IDs in error API responses
Grafana Tempo linked to Loki logs
OTel Collector as sidecar (buffer + retry)

12. At BRAC IT: How Distributed Tracing Transformed Incident Response

Before we implemented distributed tracing, debugging cross-service failures at BRAC IT was a multi-hour archaeology project. We would identify the failing endpoint, find its logs in Kibana, look for errors, find a downstream service mentioned in the error, switch to that service's logs, filter by the approximate timestamp, find a new downstream reference, and repeat. With 20+ services, a 20-hop request chain could require opening 8 different Kibana queries in sequence. Average time to identify root cause: 2–4 hours.

After deploying OpenTelemetry with Grafana Tempo as the backend, every distributed request has a single trace ID that flows through all services. When an incident fires, the investigation starts with one query in Tempo: search for the trace ID from the failing request (included in our error API responses as X-Trace-ID). The complete waterfall diagram appears instantly, showing every service call, database query, and Kafka message publication in the correct sequence with timing. Root cause identification time dropped from 3 hours to 25 minutes on average.

Our most impactful incident: in October 2025, our loan officer portal was returning HTTP 504 timeouts on approximately 5% of loan application submissions. The trace showed a 20-hop request chain where hop 14 — a call to our credit bureau integration service — was taking 8 seconds instead of the expected 300 milliseconds. Hop 14 was waiting on a database query. The query plan showed a full table scan on a 40-million-row table because a recent migration had accidentally dropped a composite index. The fix: restore the index. Total investigation time: 18 minutes. Without distributed tracing: estimated 4+ hours.

13. Correlating Traces with Logs and Metrics

Distributed tracing's full value is realised when you can jump between the three pillars of observability — traces, logs, and metrics — without losing context. The key is injecting trace context into logs automatically, so every log line contains the trace ID of the active request:

# logback-spring.xml: inject traceId, spanId into every log line
<pattern>
  %d{ISO8601} [%thread] %-5level %logger{36}
  traceId=%mdc{traceId:-NONE}
  spanId=%mdc{spanId:-NONE}
  - %message%n
</pattern>

# Output example:
# 2026-04-28T10:32:01.234 [http-nio-8080-exec-1] INFO LoanService
# traceId=4bf92f3577b34da6a3ce929d0e0e4736
# spanId=00f067aa0ba902b7
# - Processing loan application for borrower b-uuid-123

With trace IDs in logs, Grafana provides three-way navigation in the Grafana Explore view:

Trace → Logs: Click a span in Tempo, see the logs generated during that span in Loki
Logs → Trace: Click a traceId in Kibana/Loki, jump directly to the trace in Tempo
Metrics → Trace: From a Grafana alert (high latency metric), drill down to example traces showing slow requests

This three-way correlation collapses the "find the log, find the trace, find the metric" workflow into a single click. Configure Grafana Tempo as a datasource in Grafana, then add a "Derived Fields" configuration in your Loki datasource that creates a clickable link from any log line containing a traceId.

14. OpenTelemetry Semantic Conventions: Why Consistency Matters

OpenTelemetry semantic conventions define standard attribute names for common operations: db.system, db.statement, http.method, http.status_code, messaging.system, messaging.destination. When every service uses these standard names, your dashboards, alerts, and queries work across services without customisation.

Always use the SemanticAttributes constants instead of string literals in your custom spans. This prevents typos and keeps your spans consistent with auto-instrumented spans:

// Wrong: string literals are error-prone
span.setAttribute("db.system", "postgresql");
span.setAttribute("http.method", "POST");

// Correct: use SemanticAttributes constants
span.setAttribute(SemanticAttributes.DB_SYSTEM, DbSystemValues.POSTGRESQL);
span.setAttribute(SemanticAttributes.HTTP_REQUEST_METHOD, "POST");
span.setAttribute(SemanticAttributes.HTTP_ROUTE, "/api/v1/loans");

// For Kafka producers:
span.setAttribute(SemanticAttributes.MESSAGING_SYSTEM, "kafka");
span.setAttribute(SemanticAttributes.MESSAGING_DESTINATION_NAME, "loan-events");
span.setAttribute(SemanticAttributes.MESSAGING_MESSAGE_ID, messageId);

Auto-instrumentation covers the most common scenarios: HTTP calls via RestTemplate/WebClient, JDBC queries, gRPC calls, Kafka producers and consumers, and Redis operations. Manual spans are needed for: business logic that deserves its own span for performance monitoring, external API calls via custom clients, and complex computations whose duration you want to track separately. The rule: if you want to set an alert on a specific operation's duration, it needs its own span.

Tags:

distributed tracing spring boot opentelemetry spring boot 2026 micrometer tracing jaeger spring boot grafana tempo custom spans java

Frequently Asked Questions

What is Core Concepts and how does it work?

Trace: A complete record of one request as it flows through your entire system. Every trace has a globally unique traceId (128-bit hex string). Span: A single unit of work within a trace (e.g., HTTP request, DB query, Kafka publish). Each span has a spanId , start/end time, parent span ID, status, and key-value attributes. Parent-child relationship: When Service A calls Service B, Service B creates a child span with Service A's span as the parent. This forms the trace tree (waterfall view). Trace context propagation: The traceId and parent spanId are forwarded to all downstream services via HTTP headers (W3C traceparent ) or message headers (Kafka). Automatic in Spring Boot 3.

What is Tooling and how does it work?

Recommendation: Use Micrometer Tracing for Spring Boot 3+ apps — it bridges to the OTel SDK under the hood, auto-instruments RestTemplate, WebClient, Feign, Spring Data, Kafka, and Redis, and integrates with Spring's observation API.

How do you configure Spring Boot 3 Setup?

 <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-tracing-bridge-otel</artifactId> </dependency>  <dependency> <groupId>io.opentelemetry</groupId> <artifactId>opentelemetry-exporter-otlp</artifactId> </dependency> # application.yml — tracing config management: tracing: sampling: probability: 1.0 # 100% for dev; 0.1 for prod (10%) propagation: type: w3c # W3C traceparent (recommended) otlp: tracing: endpoint: http://jaeger:4318/v1/traces zipkin: tracing: endpoint: http://zipkin:9411/api/v2/spans # if using Zipkin spring: application: name: order-service # appears as service name in trace backend With this config, Spring Boot 3 automatically instruments: all HTTP incoming requests, RestTemplate/WebClient/Feign outbound calls, Spring Data (JPA/MongoDB/Redis), Spring Kafka, and @Scheduled tasks. No code changes needed.

What is Trace Context Propagation and how does it work?

Spring Boot 3 with Micrometer Tracing propagates the W3C traceparent header automatically for: W3C traceparent format: 00-{traceId}-{parentSpanId}-{flags} . Example: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 RestTemplate / WebClient / Feign: Auto-injects traceparent on all outgoing HTTP calls Spring Kafka: Injects/extracts trace context in Kafka message headers Incoming requests: Extracts traceparent from incoming HTTP headers to continue the trace

What are the production considerations for Production Observability Stack?

The modern production observability stack for Spring Boot microservices in 2026: Metrics: Micrometer + Prometheus + Grafana (JVM, business metrics) Logs: Logback/Log4j2 → Loki (Grafana) or Elasticsearch (ELK) Traces: Micrometer Tracing → OTel Collector → Grafana Tempo (all services) Correlation: traceId in all three systems — click a log line to see the trace, click a slow trace to see related logs

Leave a Comment

DevOps

Microservices Observability: Prometheus & Grafana

DevOps

ELK Stack for Java Microservices

Microservices

Spring Cloud Gateway Production

Core Java

Kafka Streams Java Guide

Back to Blog Last updated: April 11, 2026

Distributed Tracing with OpenTelemetry & Spring Boot: Complete Production Guide (2026)

1. Core Concepts: Traces, Spans, Propagation

2. Tooling: OTel Java Agent vs Micrometer Tracing

3. Spring Boot 3 Setup: Zero-Code Auto-Instrumentation

4. Custom Spans: @Observed & Tracer API

5. Trace Context Propagation

6. Tracing Through Kafka Messages

7. Backends: Jaeger, Zipkin & Grafana Tempo

8. Sampling Strategies

9. Correlating Logs with Traces: MDC Integration

10. Production Observability Stack

11. Interview Questions & Observability Checklist

12. At BRAC IT: How Distributed Tracing Transformed Incident Response

13. Correlating Traces with Logs and Metrics

14. OpenTelemetry Semantic Conventions: Why Consistency Matters

Frequently Asked Questions

What is Core Concepts and how does it work?

What is Tooling and how does it work?

How do you configure Spring Boot 3 Setup?

What is Trace Context Propagation and how does it work?

What are the production considerations for Production Observability Stack?

Related Posts