gRPC streaming microservices bidirectional flow control
Md Sanwar Hossain - Senior Software Engineer
Md Sanwar Hossain

Senior Software Engineer · Microservices · Spring Boot · gRPC

Microservices March 19, 2026 21 min read Microservices Reliability Series

gRPC Bidirectional Streaming in Production Microservices: Flow Control, Deadlines, and Failure Recovery

gRPC's streaming RPCs unlock low-latency, high-throughput communication patterns that REST/HTTP JSON cannot match — but they introduce failure modes that are categorically different from unary RPCs. A bidirectional stream between a trading client and a market-data service can carry millions of messages per hour across a single long-lived HTTP/2 connection. When that stream fails silently, or when the server-side slow producer causes the client to exhaust its H2 flow-control window, the result is a frozen connection that looks healthy from outside and produces cascading slowdowns across every service sharing that HTTP/2 multiplexed connection.

Table of Contents

  1. gRPC Streaming Types and When to Use Each
  2. HTTP/2 Flow Control: The Hidden Backpressure Mechanism
  3. Deadline Propagation Across Streaming RPCs
  4. Half-Close Semantics and Clean Stream Termination
  5. The 14 Most Common gRPC Streaming Failure Modes
  6. Retry and Hedging Policies for Streaming RPCs
  7. Observability: Stream Health Metrics and Tracing
  8. Trade-offs and When NOT to Use Streaming

1. gRPC Streaming Types and When to Use Each

gRPC defines four RPC patterns. Unary (single request, single response) is the default and the safest for most service-to-service calls. Server streaming (single request, multiple responses) is ideal for subscriptions: pushing real-time updates to a client that issued one query. Client streaming (multiple requests, single response) is suited for data upload scenarios: a client sending a bulk batch of records and receiving a summary. Bidirectional streaming (multiple requests and responses on the same stream, interleaved) is the highest-power and highest-risk pattern.

Bidirectional streaming is appropriate when the communication pattern is inherently conversational — order book updates where a trading system sends order modifications and receives confirmations in real time, or an AI inference service where the client streams input tokens and receives output tokens concurrently. The key question before choosing bidirectional: can the problem be solved with server streaming plus client-side request batching? If yes, server streaming is safer and simpler.

2. HTTP/2 Flow Control: The Hidden Backpressure Mechanism

Every gRPC stream runs over an HTTP/2 stream, which has a per-stream flow-control window. The default initial window size is 65,535 bytes (HTTP/2 spec). When the receiver's buffer fills up (because it is processing messages slower than they arrive), HTTP/2 sends WINDOW_UPDATE frames to increase the window only as fast as the application layer processes messages. If the application stops processing, the window reaches zero, and the sender blocks — it cannot send more data until the window opens.

Real incident: A logistics platform streamed real-time position updates from 50,000 GPS devices through a single gRPC bidirectional stream aggregator service. A downstream database write slow-down caused the aggregator's message-processing rate to drop from 10,000 msg/s to 800 msg/s. The H2 flow-control window filled within seconds. All 50,000 sender connections stalled, waiting for WINDOW_UPDATE. From the monitoring system, the aggregator service appeared healthy (CPU low, no errors) — it was simply blocked on H2 flow control waiting to send data it wasn't reading.

The fix is two-pronged: (1) Increase the initial H2 window size for streaming services to match the expected burst capacity (e.g., 4 MB per stream for high-throughput scenarios), and (2) decouple stream consumption from processing — read from the stream into an in-memory queue and process asynchronously, so the H2 window is always open even if the downstream processor is slow. The queue provides bounded buffering and applies backpressure via its own capacity limit.

// Spring Boot gRPC server — tune H2 flow control window
@Bean
public GrpcServerConfigurer serverConfigurer() {
    return serverBuilder -> {
        if (serverBuilder instanceof NettyServerBuilder nettyBuilder) {
            nettyBuilder
                .flowControlWindow(4 * 1024 * 1024)  // 4 MB per stream
                .initialFlowControlWindow(4 * 1024 * 1024)
                .maxInboundMessageSize(16 * 1024 * 1024)
                .keepAliveTime(30, TimeUnit.SECONDS)
                .keepAliveTimeout(5, TimeUnit.SECONDS)
                .permitKeepAliveWithoutCalls(true);
        }
    };
}

3. Deadline Propagation Across Streaming RPCs

gRPC deadlines (sent as the grpc-timeout header) are absolute point-in-time deadlines, not per-hop timeouts. A client setting a 5-second deadline means: the entire RPC chain — including all server-side processing and streaming — must complete within 5 seconds of the RPC start. If a server receives a gRPC call with a 1-second remaining deadline, it must propagate that deadline to its own outgoing gRPC calls — not set a fresh 5-second timeout.

// Server-side: propagate incoming deadline to outgoing streams
@Override
public StreamObserver<TradeRequest> streamTrades(
        StreamObserver<TradeConfirmation> responseObserver) {

    // Extract deadline from incoming context (set by client)
    Context.CancellableContext ctx = Context.current().withDeadline(
        Context.current().getDeadline(),  // inherit from incoming RPC context
        Executors.newSingleThreadScheduledExecutor()
    );

    return new StreamObserver<>() {
        @Override
        public void onNext(TradeRequest trade) {
            if (ctx.isCancelled()) {
                responseObserver.onError(Status.DEADLINE_EXCEEDED
                    .withDescription("upstream deadline exceeded")
                    .asRuntimeException());
                return;
            }
            // Check remaining deadline before downstream call
            Deadline deadline = Context.current().getDeadline();
            if (deadline != null && deadline.isExpired()) {
                responseObserver.onError(Status.DEADLINE_EXCEEDED.asRuntimeException());
                return;
            }
            TradeConfirmation conf = downstreamService.process(trade);
            responseObserver.onNext(conf);
        }
        @Override public void onError(Throwable t) { ctx.cancel(t); }
        @Override public void onCompleted() {
            responseObserver.onCompleted();
            ctx.cancel(null);
        }
    };
}

4. Half-Close Semantics and Clean Stream Termination

In bidirectional streaming, each side of the stream closes independently. When the client calls requestObserver.onCompleted(), it sends a half-close signal — meaning "I'm done sending requests but I'm still listening for responses." The server can continue sending responses after the client half-close. Only when the server calls responseObserver.onCompleted() is the stream fully terminated.

Ignoring half-close semantics is a common source of resource leaks. If the server doesn't handle the onCompleted() signal from the client and doesn't eventually call its own responseObserver.onCompleted(), the stream hangs open indefinitely — consuming a H2 stream slot, memory, and potentially a server-side thread or selector registration. Under load, this produces a "stream exhaustion" failure mode where the server accepts new connections but cannot create new streams because the maximum concurrent streams limit (MAX_CONCURRENT_STREAMS, default 100 in most gRPC implementations) is reached.

5. The Most Common gRPC Streaming Failure Modes

Based on production incident analysis across multiple microservices platforms, the most impactful gRPC streaming failure patterns are:

Silent stream death from network partitions. When a network partition occurs mid-stream with no keepalive configured, both sides believe the stream is alive. The client waits for responses that will never come; the server waits for the next request. Neither side detects the failure until a keepalive timeout fires — or never, if keepalive isn't configured. Fix: always configure gRPC keepalive with keepAliveTime ≤ 30s and keepAliveTimeout ≤ 5s.

Exception in onNext not propagated. If the server's onNext handler throws an unchecked exception without calling responseObserver.onError(), the gRPC framework catches the exception internally and may close the stream without the client receiving a proper error status. The client sees UNKNOWN or INTERNAL with no context. Wrap all onNext logic in try-catch and explicitly call onError with a descriptive Status.

Concurrent calls to onNext from multiple threads. gRPC's StreamObserver is not thread-safe. Calling responseObserver.onNext() from multiple threads produces a race condition that corrupts the H2 frame serialization. Use SynchronizedStreamObserver (from grpc-java 1.55+) or a single-threaded channel for response dispatch.

Missing onError handling for client cancellation. When a client cancels a streaming RPC (e.g., due to its own deadline expiring), the server receives an onError callback with Status.CANCELLED. If the server ignores this and continues processing, it wastes CPU and downstream resources. Check ctx.isCancelled() in long loops and handle CANCELLED status explicitly.

6. Retry and Hedging Policies for Streaming RPCs

gRPC's built-in retry policy (configured via service config) applies only to unary and client-streaming RPCs where the server has not yet sent any response messages. Once a server has sent the first response message on a streaming RPC, retries are no longer automatic — the framework cannot safely determine whether replaying the request would produce correct results. Application-level retry logic is required for streaming RPCs.

// Application-level retry with exponential backoff for streaming RPC
public void connectWithRetry(MarketDataService stub) {
    int attempt = 0;
    while (attempt < MAX_ATTEMPTS) {
        CountDownLatch latch = new CountDownLatch(1);
        AtomicReference<Throwable> error = new AtomicReference<>();

        StreamObserver<Quote> quoteObserver = new StreamObserver<>() {
            @Override public void onNext(Quote q) { processQuote(q); }
            @Override public void onError(Throwable t) {
                error.set(t);
                latch.countDown();
            }
            @Override public void onCompleted() { latch.countDown(); }
        };

        StreamObserver<Subscription> reqObserver =
            stub.streamQuotes(quoteObserver);
        reqObserver.onNext(buildSubscription());
        // Do NOT call onCompleted here — keep stream open

        try { latch.await(); } catch (InterruptedException e) { return; }

        Throwable t = error.get();
        if (t == null) return; // completed cleanly
        Status status = Status.fromThrowable(t);
        if (!isRetryable(status)) throw new RuntimeException(t);

        long backoffMs = Math.min(1000L * (1L << attempt), 30_000L);
        Thread.sleep(backoffMs);
        attempt++;
    }
}

7. Observability: Stream Health Metrics and Tracing

Standard gRPC metrics (from the official gRPC OpenTelemetry plugin or Micrometer gRPC integration) provide: grpc.server.call.started, grpc.server.call.duration, and grpc.server.call.rcvd_msgs_per_rpc. For streaming RPCs, the most important custom metrics are: active stream count (gauge — growing unboundedly signals stream leak), messages-per-stream rate (drops signal H2 flow control stalls), stream duration p99 (should be bounded for request-response patterns), and half-close lag (time between client's onCompleted and server's final onCompleted).

For distributed tracing, gRPC's OpenTelemetry propagator automatically injects and extracts the traceparent header in gRPC metadata for unary calls. For streaming RPCs, each onNext message is typically part of the same trace context as the stream's initial call — but message-level spans for individual stream messages are not generated by default. Add explicit span events in your onNext handler for high-value message processing paths.

8. Trade-offs and When NOT to Use gRPC Streaming

Use gRPC streaming when: Messages need to be processed in order with low latency. High message volumes (>1,000 msg/s per connection) make unary RPC overhead significant. Long-lived session semantics are required (authentication state, shared cursor position).

Avoid gRPC streaming when: Your load balancer (L4 or L7) doesn't support H2 multiplexing — Kubernetes services without gRPC-aware load balancing will route all messages from a streaming client to the same pod, defeating horizontal scaling. When Kafka or a message broker can decouple producers and consumers — streaming RPCs are synchronous; a slow consumer directly impacts the producer. When your team lacks gRPC expertise — unary RPCs with REST/JSON are far simpler to debug and monitor.

Discussion / Comments

Related Posts

Microservices

Backpressure in Reactive Microservices

Flow control and backpressure strategies for high-load reactive microservice architectures.

Microservices

Service Mesh Deep Dive

Istio vs Linkerd vs Cilium — choosing the right service mesh for gRPC-heavy architectures.

Microservices

API Gateway & Service Mesh

Integrate API gateways with service mesh for comprehensive traffic management and observability.

Last updated: March 2026 — Written by Md Sanwar Hossain