System Design

Event-Driven Architecture: Design, Patterns, and Production Best Practices

Event-driven architecture decouples producers and consumers, enables independent scaling, and makes business workflows observable through their event streams. But EDA introduces significant complexity around ordering, idempotency, and distributed consistency that must be designed for explicitly. This guide covers the patterns that make EDA work in production.

Md Sanwar Hossain March 2026 21 min read System Design

Event-driven architecture showing event streams and services

What Makes an Architecture "Event-Driven"
Domain Events: The Heart of EDA
Event Sourcing: The Event Log as the Source of Truth
CQRS: Separating Reads from Writes
The Saga Pattern: Distributed Transactions Without 2PC
Production Challenges and How to Address Them

What Makes an Architecture "Event-Driven"

Event-Driven Architecture | mdsanwarhossain.me — Event-Driven Architecture — mdsanwarhossain.me

In an event-driven architecture, services communicate by producing and consuming events — immutable records of something that has already happened. "OrderPlaced", "PaymentProcessed", "ItemShipped" are events. An event is a statement of fact, in the past tense, with a complete description of what occurred. Unlike an RPC call (which says "do this now"), an event says "this happened" — and any interested consumer can react to it independently.

This distinction has profound architectural implications. The order service that publishes "OrderPlaced" does not know which other services will consume it. Today it might be the inventory service and the notification service. Next quarter, the fraud detection service might subscribe. Adding a new consumer does not require changing the order service. This is the fundamental decoupling that makes event-driven systems extensible — new capabilities can be added to the system without modifying the components that produce the events.

Domain Events: The Heart of EDA

Domain events represent significant occurrences within a bounded context. They are the primary integration mechanism between services in a Domain-Driven Design system. Designing domain events well — choosing the right level of granularity, including the right payload, and versioning them for backward compatibility — is as important as designing API contracts.

// Well-designed domain event — immutable, timestamped, versioned
public record OrderPlacedEvent(
    @JsonProperty("event_id")     String eventId,        // unique event identifier
    @JsonProperty("event_type")   String eventType,      // discriminator for consumers
    @JsonProperty("event_version") int eventVersion,     // schema version
    @JsonProperty("occurred_at")  Instant occurredAt,    // when it happened
    @JsonProperty("order_id")     String orderId,        // aggregate ID
    @JsonProperty("customer_id")  String customerId,
    @JsonProperty("items")        List<OrderItemDto> items,
    @JsonProperty("total_amount") BigDecimal totalAmount,
    @JsonProperty("currency")     String currency
) {
    public static OrderPlacedEvent from(Order order) {
        return new OrderPlacedEvent(
            UUID.randomUUID().toString(),
            "ORDER_PLACED",
            1,
            Instant.now(),
            order.getId(),
            order.getCustomerId(),
            order.getItems().stream().map(OrderItemDto::from).toList(),
            order.getTotalAmount(),
            order.getCurrency()
        );
    }
}

Event Sourcing: The Event Log as the Source of Truth

Distributed Events | mdsanwarhossain.me — Distributed Events — mdsanwarhossain.me

In event sourcing, instead of storing the current state of an entity in a database row, you store the sequence of events that led to the current state. The current state is derived by replaying all events from the beginning (or from a snapshot). This approach has powerful properties: complete audit trail (every state change is recorded); temporal queries (reconstruct state at any point in time); and debugging by replaying events to reproduce bugs.

Event sourcing is not appropriate for every entity. Use it for entities where auditability and temporal queries are requirements, and where the event volume per entity is manageable. Financial accounts, inventory items, and document revision history are natural fits. User profile data with frequent updates is a poor fit — the event log grows without bound and replay becomes expensive.

CQRS: Separating Reads from Writes

Command Query Responsibility Segregation (CQRS) separates the write model (commands that change state, producing events) from the read model (queries that return data, consuming events). The write model is optimized for transactional correctness; the read model is optimized for query performance, potentially using denormalized projections, search indexes, or materialized views that are updated by consuming events from the write model.

Event-Driven Architecture with Kafka | mdsanwarhossain.me — Event-Driven Architecture with Kafka — mdsanwarhossain.me

// CQRS: Command handler writes event; projection updates read model
@Service
public class PlaceOrderCommandHandler {
    private final OrderRepository orderRepository;  // write store
    private final EventPublisher eventPublisher;
    public String handle(PlaceOrderCommand cmd) {
        Order order = Order.create(cmd.customerId(), cmd.items());
        orderRepository.save(order);
        eventPublisher.publish(OrderPlacedEvent.from(order));
        return order.getId();  // return aggregate ID only
    }
}
@Component
public class OrderSummaryProjection {
    private final OrderSummaryRepository readRepository;  // read-optimized store
    @EventHandler
    public void on(OrderPlacedEvent event) {
        // Upsert a denormalized read model optimized for "list my orders" queries
        readRepository.save(new OrderSummary(
            event.orderId(),
            event.customerId(),
            event.totalAmount(),
            event.currency(),
            OrderStatus.PLACED,
            event.occurredAt()
        ));
    }
}

CQRS with event-driven projections creates eventual consistency: the read model reflects the write model after a small lag (typically milliseconds). Design your UI and API to acknowledge this — show a "processing" state after a command rather than immediately querying the read model.

The Saga Pattern: Distributed Transactions Without 2PC

In a microservices architecture, a business transaction often spans multiple services (place order → reserve inventory → process payment → confirm order). Traditional two-phase commit (2PC) across services is impractical — it creates tight coupling and availability dependency. The Saga pattern replaces 2PC with a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the preceding steps.

Choreography-Based Saga

Each service listens for events and decides independently what to do next. No central coordinator. This is simpler to implement but harder to trace and debug — the business workflow is implicit in the event routing rather than explicitly described anywhere.

Orchestration-Based Saga

A central saga orchestrator sends commands to each service and listens for their responses. The workflow is explicit and visible in the orchestrator's code. Failures trigger compensating commands. This is more complex to implement but much easier to monitor and debug, making it the preferred pattern for complex multi-step workflows.

// Orchestration-based saga for order placement
@Component
public class PlaceOrderSaga {
    @StartSaga
    @SagaEventHandler(associationProperty = "orderId")
    public void handle(OrderPlacedEvent event) {
        // Step 1: Reserve inventory
        commandGateway.send(new ReserveInventoryCommand(
            event.orderId(), event.items()
        ));
    }
    @SagaEventHandler(associationProperty = "orderId")
    public void handle(InventoryReservedEvent event) {
        // Step 2: Process payment
        commandGateway.send(new ProcessPaymentCommand(
            event.orderId(), event.totalAmount()
        ));
    }
    @SagaEventHandler(associationProperty = "orderId")
    public void handle(PaymentFailedEvent event) {
        // Compensate: release inventory reservation
        commandGateway.send(new ReleaseInventoryCommand(event.orderId()));
        commandGateway.send(new CancelOrderCommand(
            event.orderId(), "Payment failed"
        ));
    }
    @EndSaga
    @SagaEventHandler(associationProperty = "orderId")
    public void handle(PaymentProcessedEvent event) {
        commandGateway.send(new ConfirmOrderCommand(event.orderId()));
    }
}

Production Challenges and How to Address Them

Idempotency: Consumers may receive the same event more than once (Kafka at-least-once delivery). Every consumer must be idempotent — processing the same event twice must produce the same result as processing it once. Use the event ID as an idempotency key, storing processed event IDs in a deduplication table.

Ordering: Events in a Kafka topic partition are ordered, but events across partitions are not. Design your partitioning key so events that must be processed in order (all events for the same aggregate) land in the same partition.

Schema evolution: Consumer and producer evolve independently. Use a Schema Registry (Confluent or Apicurio) with Avro or Protobuf to enforce compatibility rules. Backward-compatible changes (adding optional fields) are safe; breaking changes (removing or renaming fields) require a new event version.

Observability: Distributed event chains are hard to trace. Propagate a correlation ID through all events. Use distributed tracing (OpenTelemetry with Kafka instrumentation) to trace the full journey of a business transaction across services and events.

"Events are the API contract of the future — but unlike REST contracts, they are asynchronous, durable, and replayable. Design them with the same care you give to your synchronous APIs."

Key Takeaways

Domain events are immutable facts about what happened — design them to be self-describing, versioned, and backward-compatible.
Event sourcing suits entities requiring full audit trails; CQRS separates the write model from read-optimized projections.
The Saga pattern replaces 2PC for distributed transactions — prefer orchestration-based sagas for complex multi-step workflows.
Every consumer must be idempotent; partition events by aggregate ID to ensure ordering within a stream.
Use a Schema Registry to manage event schema evolution safely across independently-deployed services.

At BRAC IT: Event-Driven Loan Processing

In 2023, our loan approval flow was a chain of 8 synchronous REST calls: eligibility check → credit bureau lookup → income verification → fraud scoring → compliance check → approval decision → disbursement trigger → notification. Every call was sequential. Average approval time: 4.2 seconds. When the fraud scoring service slowed to 3 seconds (it happened twice under load), the entire chain timed out and the application was dropped with a generic error. Engineers had to manually resubmit or call the applicant back.

We redesigned the flow around a LoanApplicationSubmitted event in Q1 2024. The four verification services (credit, income, fraud, compliance) now consume the event in parallel. Each publishes its own result event (CreditCheckCompleted, IncomeVerified, etc.) to its own Kafka topic. An orchestrator service aggregates the four results and publishes LoanDecisionMade when all four arrive.

Results after three months in production:

Average approval latency: 4.2 s → 820 ms (parallel execution of the 4 checks)
Timeout-related application drops: 12/day → 0/day (Kafka retries handle transient failures)
Credit bureau service outage impact: isolated (other checks continue, decision deferred)
Deployability: each service deploys independently with no coordination needed

The key insight: the 4 verification checks were logically independent — they could run in parallel — but the synchronous REST design forced them to run sequentially. The event-driven design revealed the true structure of the domain.

Dead Letter Queues and Poison Message Handling

Every event-driven system will eventually encounter a poison message — an event that cannot be processed successfully due to malformed data, schema mismatch, downstream dependency failure, or a bug in the consumer. Without proper handling, a poison message blocks the consumer, halting processing of all subsequent events in that partition. This is one of the most common production failures in Kafka-based systems.

The standard pattern: retry with exponential backoff, then route to a Dead Letter Queue (DLQ) after a fixed number of retries:

// Spring Kafka dead letter queue configuration
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<?, ?> template) {
    // Dead letter topic: original-topic.DLT
    DeadLetterPublishingRecoverer recoverer =
        new DeadLetterPublishingRecoverer(template);

    // Retry 3 times with exponential backoff before DLQ
    ExponentialBackOffWithMaxRetries backOff =
        new ExponentialBackOffWithMaxRetries(3);
    backOff.setInitialInterval(1_000);   // 1s, 2s, 4s
    backOff.setMultiplier(2.0);

    DefaultErrorHandler handler = new DefaultErrorHandler(recoverer, backOff);

    // Don't retry on deserialization errors — straight to DLQ
    handler.addNotRetryableExceptions(
        DeserializationException.class,
        JsonParseException.class
    );
    return handler;
}

Monitor your DLQ depth as an SLO. A growing DLQ means either a schema breaking change broke consumers, or a downstream dependency is consistently unavailable. Automate DLQ alerting: when DLQ depth exceeds 10 messages, page the on-call engineer. Create a runbook for DLQ replay — most DLQ messages can be replayed after the root cause is fixed. Design consumers to be idempotent so replay is safe.

Event Schema Evolution: Staying Compatible

The hardest operational problem in event-driven architectures is schema evolution. Unlike REST APIs where you can version a single endpoint, events are durable — a LoanApplicationSubmitted event published in January 2025 may be replayed in January 2026. Your consumer must handle both versions.

Three strategies, in order of increasing robustness:

Strategy	How	Safe Change Types	Risk
Additive-only	Only add nullable fields; never remove or rename	Add fields	Schema bloat over time
Versioned events	Include event version field; consumer handles v1 and v2	Add/rename/remove fields with version bump	Consumer complexity
Schema Registry	Confluent or Apicurio Schema Registry with Avro/Protobuf	All backward-compatible changes enforced at build time	Operational overhead

At BRAC IT we use JSON with an additive-only rule for intra-cluster events, and Avro with the Confluent Schema Registry for events crossing system boundaries (between our microfinance core and third-party integrations). The Schema Registry prevents breaking changes from deploying — a CI check validates backward compatibility before merge. This has prevented three breaking-change incidents that would have taken down production consumers.

Event-Driven Architecture Production Checklist

Before releasing an event-driven service to production, validate every item in this checklist:

Consumers are idempotent — processing the same event twice produces the same result with no side effects
Dead Letter Queue configured — events that cannot be processed after N retries go to a DLQ; DLQ depth is monitored and alerted
Schema Registry or additive-only policy enforced — no producer can publish an event that breaks existing consumers
Partitioning key is the aggregate ID — all events for the same entity land in the same partition, preserving ordering
Correlation IDs propagated — every event carries the correlation ID of the business transaction that triggered it
Consumer group IDs are unique per service — two different services should never share a consumer group ID on the same topic
Lag alerts configured — if consumer lag grows beyond a threshold, on-call is paged before the lag becomes a business problem
Rollback plan documented — if a new event type causes consumer failures, how do you stop publishing without a service restart?
Event log retention set appropriately — for audit events: 7 years. For real-time notifications: 7 days. Retention is a compliance decision, not a Kafka default
Security: topic-level ACLs defined — producers cannot read their own topic; consumers cannot write to the topic they consume

Event-Driven Architecture: Design, Patterns, and Production Best Practices

Table of Contents

What Makes an Architecture "Event-Driven"

Domain Events: The Heart of EDA

Event Sourcing: The Event Log as the Source of Truth

CQRS: Separating Reads from Writes

The Saga Pattern: Distributed Transactions Without 2PC

Choreography-Based Saga

Orchestration-Based Saga

Production Challenges and How to Address Them

Key Takeaways

At BRAC IT: Event-Driven Loan Processing

Dead Letter Queues and Poison Message Handling

Event Schema Evolution: Staying Compatible

Event-Driven Architecture Production Checklist

Tags

Leave a Comment

Related Posts

Event-Driven Architecture: Design, Patterns, and Production Best Practices

Table of Contents

What Makes an Architecture "Event-Driven"

Domain Events: The Heart of EDA

Event Sourcing: The Event Log as the Source of Truth

CQRS: Separating Reads from Writes

The Saga Pattern: Distributed Transactions Without 2PC

Choreography-Based Saga

Orchestration-Based Saga

Production Challenges and How to Address Them

Key Takeaways

At BRAC IT: Event-Driven Loan Processing

Dead Letter Queues and Poison Message Handling

Event Schema Evolution: Staying Compatible

Event-Driven Architecture Production Checklist

Tags

Leave a Comment

Related Posts

Service Communication Patterns in Microservices: REST, gRPC, Messaging, and GraphQL Federation

Designing Scalable Systems at Uber & Netflix Scale: Patterns, Trade-offs, and Architecture Decisions

Microservices Architecture Patterns: Building Resilient, Scalable Distributed Systems

Cookie Notice