System Design

Webhook System Design: Reliable Event Delivery, Retry Logic & At-Scale Architecture

Q: What Are Webhooks and Why Do They Fail?

A webhook is an HTTP push notification : when an event occurs in System A, it makes a POST request to a URL registered by System B, delivering a JSON payload describing what happened. The subscriber URL is configuration data — the webhook consumer owns it and hands it to the producer during registration. This inverts the traditional polling model: instead of System B asking "did anything change?" every few seconds, System A tells System B the moment something does. GitHub uses webhooks to notify CI/CD systems the instant a push lands on a repository. Stripe fires a payment_intent.succeeded event the moment a card charge completes. Shopify broadcasts an order.created event to all registered fulfilment partners when a customer places an order.

Webhooks look deceptively simple: your system makes an HTTP POST to a subscriber's URL when something happens. In practice, building a webhook delivery platform that guarantees reliability under network failures, subscriber downtime, thundering-herd load spikes, and adversarial payloads is one of the more nuanced distributed systems problems in backend engineering. This deep dive covers the full production stack — from the outbox pattern and queue-backed dispatch to HMAC signature verification, exponential backoff, dead letter queues, Kafka fan-out, and the observability layer that keeps GitHub-, Stripe-, and Shopify-scale webhook systems healthy.

Md Sanwar Hossain April 1, 2026 17 min read System Design

Webhook System Design - reliable event delivery and retry architecture

What Are Webhooks and Why Do They Fail?
Core Webhook Architecture Components
HMAC Signature Verification
Reliable Delivery: Outbox Pattern + Queue-Backed Dispatch
Exponential Backoff Retry Logic with Dead Letter Queue
Fan-Out Architecture: One Event to Many Subscribers
Webhook Payload Versioning and Schema Evolution
Observability: Delivery Metrics, Alerting & Debug Console
Production Pitfalls
Key Takeaways
Conclusion

1. What Are Webhooks and Why Do They Fail?

Webhook System Design | mdsanwarhossain.me — Webhook System Design — mdsanwarhossain.me

A webhook is an HTTP push notification: when an event occurs in System A, it makes a POST request to a URL registered by System B, delivering a JSON payload describing what happened. The subscriber URL is configuration data — the webhook consumer owns it and hands it to the producer during registration. This inverts the traditional polling model: instead of System B asking "did anything change?" every few seconds, System A tells System B the moment something does.

GitHub uses webhooks to notify CI/CD systems the instant a push lands on a repository. Stripe fires a payment_intent.succeeded event the moment a card charge completes. Shopify broadcasts an order.created event to all registered fulfilment partners when a customer places an order. The pull alternative — polling the Stripe API every second to check for new payments — wastes API quota, adds latency, and scales poorly. Webhooks solve the latency and efficiency problem, but they introduce a reliability problem that polling silently avoids: what happens when the HTTP POST fails?

Failure modes are numerous and real:

Subscriber downtime: The receiving service is being redeployed, experiencing an outage, or the Kubernetes pod is mid-restart during a rolling update.
Network partitions: Transient DNS failures, BGP route flaps, or cloud provider connectivity issues cause the POST to time out or receive a TCP reset.
Slow consumers: The subscriber endpoint takes 35 seconds to process a heavy payload, causing the sender's HTTP client to time out even though the subscriber eventually succeeds.
Rate limiting: The subscriber's API gateway returns HTTP 429 because the webhook sender fires 500 events per second after a large batch operation.
5xx responses: The subscriber's application throws an unhandled exception and returns HTTP 500, signalling that delivery failed but not whether partial processing occurred.
Silent data loss: The sender fires the event, the subscriber returns HTTP 200, but the subscriber crashes before committing the side effect to its database — leaving the subscriber in an inconsistent state.

Each of these failure modes demands a different mitigation strategy. The cumulative solution is a reliable webhook delivery platform — not a single feature, but an architecture.

2. Core Webhook Architecture Components

A production-grade webhook platform is composed of five distinct subsystems, each with a clear ownership boundary:

1. Event Producer (Sender) — the application service that detects a domain event (e.g., an order is created) and writes a webhook delivery record to the outbox table. It never calls subscriber URLs directly.

2. Delivery Queue — a durable message queue (Kafka topic, AWS SQS, or RabbitMQ) that buffers webhook delivery jobs. The queue decouples the producer from the dispatcher and absorbs load spikes without backpressure on the business logic path.

3. Dispatcher Service — a pool of workers that consume from the delivery queue, look up subscriber endpoint URLs, sign the payload with HMAC, and make the HTTP POST. They record the outcome — status code, response body, latency — to the delivery log.

4. Delivery Log — an append-only audit trail of every delivery attempt: event ID, subscriber ID, attempt number, HTTP status, response body (truncated), timestamp, and latency. This is the source of truth for the debug console and alerting.

5. Retry Scheduler — a separate scheduled process that queries the delivery log for failed attempts that are eligible for retry (within the retry window, below the max attempt count), computes the exponential backoff delay, and re-enqueues the delivery job on the queue. Failed jobs that exhaust retries are routed to a Dead Letter Queue (DLQ) for manual inspection.

These five components, combined with the subscription registry (which maps event types to subscriber URLs and secrets), form the complete webhook platform. None of them are complex individually — the reliability emerges from their composition and the invariants maintained between them.

3. HMAC Signature Verification

Event-Driven Webhooks | mdsanwarhossain.me — Event-Driven Webhooks — mdsanwarhossain.me

Before a subscriber processes a webhook payload, it must verify that the request genuinely originated from the trusted sender and that the payload has not been tampered with in transit. The industry-standard mechanism is HMAC-SHA256 signature verification, used by GitHub (X-Hub-Signature-256), Stripe (Stripe-Signature), and Shopify (X-Shopify-Hmac-Sha256).

The protocol is straightforward: the sender and subscriber share a secret key established at subscription time. When sending a webhook, the sender computes HMAC-SHA256(secret, rawRequestBody) and includes the hex-encoded digest in a request header. The subscriber independently computes the same HMAC over the raw request body it received and compares the two values. A mismatch means the payload was tampered with or the request is not from the trusted sender.

Here is the sender-side signing in Spring Boot:

import javax.crypto.Mac;
import javax.crypto.spec.SecretKeySpec;
import java.nio.charset.StandardCharsets;

@Component
public class WebhookSigningService {

    private static final String HMAC_ALGORITHM = "HmacSHA256";

    public String sign(String secret, String payload) {
        try {
            Mac mac = Mac.getInstance(HMAC_ALGORITHM);
            SecretKeySpec keySpec = new SecretKeySpec(
                secret.getBytes(StandardCharsets.UTF_8), HMAC_ALGORITHM);
            mac.init(keySpec);
            byte[] digest = mac.doFinal(
                payload.getBytes(StandardCharsets.UTF_8));
            return "sha256=" + HexFormat.of().formatHex(digest);
        } catch (Exception e) {
            throw new WebhookSigningException("Failed to sign payload", e);
        }
    }
}

And the subscriber-side verification in a Spring Boot controller — critically, using constant-time comparison to prevent timing attacks:

@RestController
@RequestMapping("/webhooks")
public class WebhookReceiverController {

    @Value("${webhook.secret}")
    private String webhookSecret;

    private final WebhookSigningService signingService;
    private final OrderEventHandler orderEventHandler;

    @PostMapping("/orders")
    public ResponseEntity<Void> receiveOrderEvent(
            @RequestHeader("X-Webhook-Signature") String signature,
            @RequestBody String rawBody) {

        // Compute expected signature
        String expectedSignature = signingService.sign(webhookSecret, rawBody);

        // Constant-time comparison — never use String.equals() here
        if (!MessageDigest.isEqual(
                expectedSignature.getBytes(StandardCharsets.UTF_8),
                signature.getBytes(StandardCharsets.UTF_8))) {
            return ResponseEntity.status(HttpStatus.UNAUTHORIZED).build();
        }

        // Parse and handle the event
        OrderEvent event = objectMapper.readValue(rawBody, OrderEvent.class);
        orderEventHandler.handle(event);

        return ResponseEntity.ok().build();
    }
}

⚠ Warning: Never read the request body as a parsed object (e.g., @RequestBody OrderEvent event) before computing the HMAC. JSON serializers do not guarantee byte-for-byte reproducibility — field ordering, whitespace, and number formatting can differ between the sender's serializer and the subscriber's deserializer. Always compute the HMAC over the raw bytes of the HTTP request body, before any parsing. In Spring Boot, inject HttpServletRequest and read request.getInputStream() if needed, or use @RequestBody String rawBody.

4. Reliable Delivery: Outbox Pattern + Queue-Backed Dispatch

The most common webhook reliability failure is a race between the business transaction and the event dispatch: the application commits its database change and then tries to enqueue the webhook delivery job. If the application crashes, the job is enqueued on RabbitMQ or fires an HTTP request that hangs — but the business event has already been committed. Conversely, if the enqueue fails and the developer wraps everything in a transaction, the enqueue exception can roll back the business change too, losing the customer's order entirely.

The Transactional Outbox Pattern solves this atomically. Instead of calling the queue directly, the application writes the webhook payload to an outbox table in the same database transaction as the business change. A separate relay process (CDC via Debezium or a polling relay) reads committed outbox rows and publishes them to the queue. The flow is:

/*
 * TEXT ARCHITECTURE DIAGRAM
 *
 *  Business Service
 *       |
 *       | (single DB transaction)
 *       |--[1]--> orders table (INSERT order row)
 *       |--[2]--> webhook_outbox table (INSERT delivery record)
 *       |
 *  Outbox Relay (Debezium CDC or polling job)
 *       |
 *       |--[3]--> Reads un-published outbox rows
 *       |--[4]--> Publishes to Kafka topic: webhook.delivery.pending
 *       |--[5]--> Marks outbox row as published
 *       |
 *  Dispatcher Workers (Kafka consumers)
 *       |
 *       |--[6]--> Consumes delivery job
 *       |--[7]--> Looks up subscriber URL + secret
 *       |--[8]--> Signs payload with HMAC-SHA256
 *       |--[9]--> HTTP POST to subscriber endpoint
 *       |--[10]-> Writes attempt result to webhook_delivery_log
 *       |
 *  Retry Scheduler (scheduled every 60 seconds)
 *       |
 *       |--[11]-> Queries failed attempts eligible for retry
 *       |--[12]-> Re-enqueues on Kafka with backoff delay
 *       |--[13]-> Routes exhausted jobs to DLQ topic
 */

The outbox table schema is minimal but sufficient:

CREATE TABLE webhook_outbox (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    event_type    VARCHAR(120)     NOT NULL,
    payload       JSONB            NOT NULL,
    subscriber_id UUID             NOT NULL,
    idempotency_key VARCHAR(255)   NOT NULL UNIQUE,
    status        VARCHAR(20)      NOT NULL DEFAULT 'PENDING',  -- PENDING | PUBLISHED
    created_at    TIMESTAMPTZ      NOT NULL DEFAULT now(),
    published_at  TIMESTAMPTZ
);

CREATE INDEX idx_outbox_status ON webhook_outbox (status, created_at)
    WHERE status = 'PENDING';

In the Spring Boot service layer, the entire operation is a single @Transactional method:

@Service
@Transactional
public class OrderService {

    private final OrderRepository orderRepository;
    private final WebhookOutboxRepository outboxRepository;

    public Order createOrder(CreateOrderRequest request) {
        Order order = orderRepository.save(Order.from(request));

        // Write to outbox in the same transaction — atomically
        WebhookOutboxEntry outboxEntry = WebhookOutboxEntry.builder()
            .eventType("order.created")
            .payload(buildOrderCreatedPayload(order))
            .subscriberId(request.subscriberId())
            .idempotencyKey("order-created-" + order.getId())
            .build();

        outboxRepository.save(outboxEntry);

        // No Kafka call here — the relay will pick it up after commit
        return order;
    }

    private JsonNode buildOrderCreatedPayload(Order order) {
        return objectMapper.valueToTree(OrderCreatedEvent.builder()
            .eventId(UUID.randomUUID().toString())
            .eventType("order.created")
            .schemaVersion("2026-04-01")
            .orderId(order.getId())
            .customerId(order.getCustomerId())
            .totalAmount(order.getTotalAmount())
            .currency(order.getCurrency())
            .lineItems(order.getLineItems())
            .occurredAt(Instant.now())
            .build());
    }
}

5. Exponential Backoff Retry Logic with Dead Letter Queue

When a webhook delivery attempt fails — whether by a timeout, a 5xx response, or a connection refusal — the dispatcher must not immediately retry. Hammering a struggling subscriber at full speed makes the subscriber's recovery harder and creates a thundering-herd problem if many deliveries are failing simultaneously. The solution is exponential backoff with jitter: each retry waits longer than the previous one, with a small random offset to spread the retry storm across time.

Stripe's retry schedule is a well-known reference implementation: attempts at 5 minutes, 30 minutes, 2 hours, 5 hours, and 10 hours after the initial failure — five retries over roughly 72 hours before an endpoint is considered permanently failing. GitHub similarly retries for 72 hours. Here is a Spring Boot retry scheduler that implements this pattern:

@Component
public class WebhookRetryScheduler {

    private static final int MAX_ATTEMPTS = 6;
    // Base delay in seconds: 5m, 30m, 2h, 5h, 10h, DLQ
    private static final long[] RETRY_DELAYS_SECONDS = {300, 1800, 7200, 18000, 36000};

    private final WebhookDeliveryLogRepository deliveryLogRepository;
    private final KafkaTemplate<String, WebhookDeliveryJob> kafkaTemplate;

    @Scheduled(fixedDelay = 60_000)  // Run every 60 seconds
    public void scheduleRetries() {
        List<WebhookDeliveryLog> failedAttempts =
            deliveryLogRepository.findEligibleForRetry(
                LocalDateTime.now(), MAX_ATTEMPTS);

        for (WebhookDeliveryLog failed : failedAttempts) {
            int attempt = failed.getAttemptNumber();

            if (attempt >= MAX_ATTEMPTS) {
                routeToDlq(failed);
                continue;
            }

            long delaySeconds = computeBackoffWithJitter(attempt);
            Instant nextAttemptAt = Instant.now().plusSeconds(delaySeconds);

            WebhookDeliveryJob retryJob = WebhookDeliveryJob.builder()
                .outboxId(failed.getOutboxId())
                .subscriberId(failed.getSubscriberId())
                .attemptNumber(attempt + 1)
                .scheduledAt(nextAttemptAt)
                .build();

            kafkaTemplate.send("webhook.delivery.retry", retryJob);
            deliveryLogRepository.markScheduledForRetry(failed.getId(), nextAttemptAt);
        }
    }

    private long computeBackoffWithJitter(int attempt) {
        long baseDelay = RETRY_DELAYS_SECONDS[Math.min(attempt, RETRY_DELAYS_SECONDS.length - 1)];
        // Add up to 10% random jitter to spread the retry storm
        long jitter = (long) (baseDelay * 0.1 * Math.random());
        return baseDelay + jitter;
    }

    private void routeToDlq(WebhookDeliveryLog failed) {
        kafkaTemplate.send("webhook.delivery.dlq",
            DeadLetterEntry.from(failed, "MAX_ATTEMPTS_EXCEEDED"));
        deliveryLogRepository.markExhausted(failed.getId());
    }
}

The delivery log repository query is equally important — only events within the retry window and below the max attempt ceiling should be eligible:

@Query("""
    SELECT l FROM WebhookDeliveryLog l
    WHERE l.status IN ('FAILED', 'TIMEOUT')
      AND l.attemptNumber < :maxAttempts
      AND l.nextRetryAt IS NOT NULL
      AND l.nextRetryAt <= :now
      AND l.createdAt >= :now - INTERVAL '72 hours'
    ORDER BY l.nextRetryAt ASC
    LIMIT 1000
    """)
List<WebhookDeliveryLog> findEligibleForRetry(
    LocalDateTime now, int maxAttempts);

ℹ Idempotency is non-negotiable: Because retries are inherent to the system, every subscriber endpoint must be idempotent. The same order.created event may be delivered two or three times if a subscriber times out after processing but before returning HTTP 200. Include a stable event_id (UUID) in every payload and have subscribers deduplicate on it using a processed-events table keyed by (subscriber_id, event_id). The sender should also include a monotonically stable idempotency_key field — using the same key across retries of the same original event.

6. Fan-Out Architecture: One Event to Many Subscribers

A real webhook platform rarely delivers one event to one subscriber. A Shopify order.created event may need to be delivered to a fulfilment partner, a loyalty points service, an email marketing platform, and a fraud analytics service — all at once. This is the fan-out problem: one domain event producing N independent delivery jobs, each with its own subscriber URL, secret, retry state, and delivery log.

Kafka is the natural choice for fan-out at scale. A single Kafka topic partition can sustain millions of events per second, and independent consumer groups represent independent subscribers. The architecture separates two concerns:

Event Router: Reads from the domain events topic and writes one delivery job per registered subscriber to the webhook.delivery.pending topic.
Dispatcher Workers: Consumer group instances that read from webhook.delivery.pending and execute the actual HTTP POST.

@Service
public class WebhookEventRouter {

    private final SubscriptionRegistry subscriptionRegistry;
    private final KafkaTemplate<String, WebhookDeliveryJob> kafkaTemplate;

    @KafkaListener(topics = "domain.events", groupId = "webhook-router")
    public void routeEvent(DomainEvent event, Acknowledgment ack) {
        List<WebhookSubscription> subscribers =
            subscriptionRegistry.findSubscribersFor(event.getType());

        if (subscribers.isEmpty()) {
            ack.acknowledge();
            return;
        }

        // Fan-out: create one delivery job per subscriber
        List<CompletableFuture<SendResult<String, WebhookDeliveryJob>>> futures =
            subscribers.stream()
                .map(sub -> {
                    WebhookDeliveryJob job = WebhookDeliveryJob.builder()
                        .eventId(event.getId())
                        .eventType(event.getType())
                        .payload(event.getPayload())
                        .subscriberId(sub.getId())
                        .endpointUrl(sub.getEndpointUrl())
                        .attemptNumber(1)
                        .scheduledAt(Instant.now())
                        .build();
                    // Key by subscriberId to preserve per-subscriber ordering
                    return kafkaTemplate.send(
                        "webhook.delivery.pending", sub.getId().toString(), job);
                })
                .toList();

        // Wait for all publishes to confirm before acknowledging the source event
        CompletableFuture.allOf(futures.toArray(CompletableFuture[]::new)).join();
        ack.acknowledge();
    }
}

Keying delivery jobs by subscriberId ensures Kafka sends all jobs for a given subscriber to the same partition, preserving event ordering per subscriber. A slow subscriber's dispatcher consumer will lag behind without affecting other subscribers' delivery — true isolation.

7. Webhook Payload Versioning and Schema Evolution

Webhook payloads are API contracts. The moment you ship order.created v1 to production subscribers, changing the shape of that payload is a breaking change that can crash subscriber code without warning. Versioning is therefore mandatory from day one.

The most pragmatic approach is a schema_version field in every payload envelope, combined with a versioned api_version header (matching Stripe's model where subscribers can pin to a specific API version at registration time). Here is a versioned payload envelope:

// Versioned webhook payload — always include envelope fields
{
  "id": "evt_01HXYZ789",
  "type": "order.created",
  "schema_version": "2026-04-01",
  "api_version": "v2",
  "created_at": "2026-04-01T10:23:45.000Z",
  "idempotency_key": "order-created-ord-00123",
  "data": {
    "order_id": "ord-00123",
    "customer_id": "cust-456",
    "total_amount": 9999,
    "currency": "USD",
    "line_items": [
      {
        "sku": "WIDGET-RED-L",
        "quantity": 2,
        "unit_price": 4999
      }
    ]
  }
}

Schema evolution rules for safe forward-compatibility:

Change Type	Delivery Strategy	Guarantee
Add optional field	Ship immediately (additive)	At-least-once
Rename existing field	Dual-write old + new field during transition	At-least-once with versioning
Remove field	Deprecate in schema_version N, remove in N+1 after sunset	Versioned, subscriber opt-in
Change field type	New field name with new type; retire old field	Breaking — requires major version bump
New event type	Ship immediately; subscribers opt-in at registration	At-least-once

8. Observability: Delivery Metrics, Alerting & Debug Console

A webhook platform without observability is a black box. When a customer complains their order fulfilment partner never received the order.created event for order ord-00123, you need to answer in under 60 seconds: was it ever dispatched? What HTTP status did the subscriber return? How many retry attempts have been made? What was the exact request and response body?

The three layers of observability are metrics, alerting, and the debug console:

Metrics (Micrometer + Prometheus):

@Component
public class WebhookDispatcherMetrics {

    private final MeterRegistry meterRegistry;
    private final Counter deliveryAttempts;
    private final Counter deliverySuccesses;
    private final Counter deliveryFailures;
    private final Timer deliveryLatency;

    public WebhookDispatcherMetrics(MeterRegistry registry) {
        this.meterRegistry = registry;
        this.deliveryAttempts = Counter.builder("webhook.delivery.attempts")
            .description("Total webhook delivery attempts")
            .register(registry);
        this.deliverySuccesses = Counter.builder("webhook.delivery.successes")
            .description("Successful webhook deliveries (2xx)")
            .register(registry);
        this.deliveryFailures = Counter.builder("webhook.delivery.failures")
            .description("Failed webhook deliveries (non-2xx or exception)")
            .register(registry);
        this.deliveryLatency = Timer.builder("webhook.delivery.latency")
            .description("HTTP POST latency to subscriber endpoint")
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);
    }

    public void recordAttempt(String eventType, boolean success, Duration latency) {
        Tags tags = Tags.of("event_type", eventType);
        deliveryAttempts.increment();
        deliveryLatency.record(latency);
        if (success) {
            deliverySuccesses.increment();
        } else {
            deliveryFailures.increment();
            meterRegistry.counter("webhook.delivery.failures",
                tags.and("reason", "http_error")).increment();
        }
    }
}

Key Prometheus alerts: Alert on webhook_delivery_failure_rate > 5% for any event type over a 5-minute window. Alert on DLQ depth growing above 100 entries. Alert on p99 delivery latency exceeding 10 seconds. Alert on outbox relay lag — if the oldest un-published outbox row is more than 30 seconds old, the relay has fallen behind.

Debug Console: The delivery log table is the backbone of the debug console. Each row stores the event type, subscriber ID, attempt number, HTTP status code, response body (truncated to 4KB), and the request payload. Provide an internal API (GET /internal/webhook-logs?event_id=xxx) that customer support and engineers can query to reconstruct the full delivery history of any event in seconds. GitHub's webhook delivery log in repository settings is a well-known example of this pattern.

9. Production Pitfalls

Even well-designed webhook systems accumulate production debt in predictable ways. Here are the most common failure patterns observed in real systems:

Timeout too low: Setting an HTTP timeout of 5 seconds on the dispatcher is a common default. The problem is that subscriber endpoints doing database writes or calling downstream APIs can legitimately take 8–15 seconds. A 5-second timeout causes phantom failures — the subscriber actually processed the event and returned 200, but the dispatcher's HTTP client timed out and recorded the delivery as a failure, triggering unnecessary retries. Set timeouts to 30 seconds (Stripe uses 30 seconds) and separately measure response time distribution via the delivery log to identify genuinely slow subscribers for proactive capacity guidance.

No idempotency on the subscriber side: Without idempotency, retries cause duplicate order fulfilments, double loyalty point grants, and double-charged credit cards. Every subscriber endpoint must be idempotent. Include an event_id in every payload and verify with your subscriber partners that they implement deduplication before going to production.

Subscriber overload during catch-up: After a subscriber has been down for 12 hours and resumes, the retry scheduler will attempt to deliver hundreds or thousands of queued events simultaneously. Without per-subscriber rate limiting, this creates a second outage on the subscriber side from the sudden load spike. Implement a per-subscriber delivery concurrency limit (e.g., max 10 concurrent dispatches per subscriber) in the dispatcher worker pool.

Secret rotation without zero-downtime: When a subscriber wants to rotate their webhook secret, a naive implementation breaks all deliveries during the rotation window. The solution is a dual-secret grace period: the platform accepts both the old and the new secret for a configurable window (e.g., 24 hours) after rotation, allowing the subscriber to deploy their updated secret-handling code without any delivery interruption. Only after the subscriber confirms the new secret is live should the old secret be revoked.

Missing replay capability: The delivery log retains the original payload, but without a replay API, operators must manually re-enqueue individual events from the DLQ. Build a POST /internal/webhook-events/{event_id}/replay endpoint that re-creates a delivery job from the delivery log and sends it back through the standard dispatch pipeline.

Key Takeaways

Use the Outbox Pattern — write webhook delivery records in the same database transaction as business changes to guarantee atomicity and eliminate the lost-event race condition.
Always verify HMAC signatures with constant-time comparison — compute over raw bytes before parsing; never use String.equals() for security-sensitive comparisons.
Implement exponential backoff with jitter — retry on failure with increasing delays (5 min, 30 min, 2 h, 5 h, 10 h) and add random jitter to prevent thundering-herd retry storms.
Route exhausted events to a Dead Letter Queue — never silently discard failed deliveries; keep them inspectable and replayable.
Fan-out via Kafka for multi-subscriber delivery — key delivery jobs by subscriber ID to preserve ordering and isolate slow consumers from fast ones.
Version every payload from day one — include schema_version and api_version in every payload envelope; dual-write during field renames and use major version bumps for breaking changes.
Build a debug console backed by the delivery log — full delivery history per event ID, including request payload, HTTP status, and response body, is non-negotiable for operator and customer support workflows.
Enforce idempotency end-to-end — every subscriber must deduplicate on event ID; every sender must use stable idempotency keys across retries of the same event.

Conclusion

Webhooks are one of the most ubiquitous integration patterns in modern software — and one of the most underestimated in terms of the engineering required to make them truly reliable. The simple HTTP POST that defines a webhook belies the distributed systems complexity underneath: you are making an asynchronous, unreliable, one-way call across an untrusted network boundary, with no guarantee of delivery, ordering, or exactly-once semantics.

GitHub, Stripe, and Shopify have each spent years refining their webhook platforms to address these challenges. The good news is that the patterns are well understood and implementable in any Java/Spring Boot backend within a few weeks: the Transactional Outbox eliminates lost events, HMAC verification closes the security gap, exponential backoff with a DLQ handles the retry lifecycle gracefully, Kafka fan-out scales to thousands of subscribers, and the delivery log gives operators the visibility they need to debug issues in minutes rather than hours.

Start with the outbox pattern and HMAC verification — those two alone eliminate the most common classes of webhook failures. Add retry logic and the delivery log next. Fan-out and versioning are natural extensions once the foundation is solid. The result is a webhook platform that earns the trust of the subscribers who depend on it, and the engineers who operate it.

Webhook System Design: Reliable Event Delivery, Retry Logic & At-Scale Architecture

Table of Contents

1. What Are Webhooks and Why Do They Fail?

2. Core Webhook Architecture Components

3. HMAC Signature Verification

4. Reliable Delivery: Outbox Pattern + Queue-Backed Dispatch

5. Exponential Backoff Retry Logic with Dead Letter Queue

6. Fan-Out Architecture: One Event to Many Subscribers

7. Webhook Payload Versioning and Schema Evolution

8. Observability: Delivery Metrics, Alerting & Debug Console

9. Production Pitfalls

Key Takeaways

Conclusion

Tags

Leave a Comment

Related Posts

Webhook System Design: Reliable Event Delivery, Retry Logic & At-Scale Architecture

Table of Contents

1. What Are Webhooks and Why Do They Fail?

2. Core Webhook Architecture Components

3. HMAC Signature Verification

4. Reliable Delivery: Outbox Pattern + Queue-Backed Dispatch

5. Exponential Backoff Retry Logic with Dead Letter Queue

6. Fan-Out Architecture: One Event to Many Subscribers

7. Webhook Payload Versioning and Schema Evolution

8. Observability: Delivery Metrics, Alerting & Debug Console

9. Production Pitfalls

Key Takeaways

Conclusion

Tags

Leave a Comment

Related Posts

Event-Driven Architecture: Patterns, Pitfalls & Production

Transactional Outbox Pattern with Debezium & Kafka

Dead Letter Queue Patterns: Handling Poison Messages at Scale

Cookie Notice