Event-Driven Architecture: Design, Patterns, and Production Best Practices
Event-driven architecture decouples producers and consumers, enables independent scaling, and makes business workflows observable through their event streams. But EDA introduces significant complexity around ordering, idempotency, and distributed consistency that must be designed for explicitly. This guide covers the patterns that make EDA work in production.
Table of Contents
What Makes an Architecture "Event-Driven"
In an event-driven architecture, services communicate by producing and consuming events — immutable records of something that has already happened. "OrderPlaced", "PaymentProcessed", "ItemShipped" are events. An event is a statement of fact, in the past tense, with a complete description of what occurred. Unlike an RPC call (which says "do this now"), an event says "this happened" — and any interested consumer can react to it independently.
This distinction has profound architectural implications. The order service that publishes "OrderPlaced" does not know which other services will consume it. Today it might be the inventory service and the notification service. Next quarter, the fraud detection service might subscribe. Adding a new consumer does not require changing the order service. This is the fundamental decoupling that makes event-driven systems extensible — new capabilities can be added to the system without modifying the components that produce the events.
Domain Events: The Heart of EDA
Domain events represent significant occurrences within a bounded context. They are the primary integration mechanism between services in a Domain-Driven Design system. Designing domain events well — choosing the right level of granularity, including the right payload, and versioning them for backward compatibility — is as important as designing API contracts.
// Well-designed domain event — immutable, timestamped, versioned
public record OrderPlacedEvent(
@JsonProperty("event_id") String eventId, // unique event identifier
@JsonProperty("event_type") String eventType, // discriminator for consumers
@JsonProperty("event_version") int eventVersion, // schema version
@JsonProperty("occurred_at") Instant occurredAt, // when it happened
@JsonProperty("order_id") String orderId, // aggregate ID
@JsonProperty("customer_id") String customerId,
@JsonProperty("items") List<OrderItemDto> items,
@JsonProperty("total_amount") BigDecimal totalAmount,
@JsonProperty("currency") String currency
) {
public static OrderPlacedEvent from(Order order) {
return new OrderPlacedEvent(
UUID.randomUUID().toString(),
"ORDER_PLACED",
1,
Instant.now(),
order.getId(),
order.getCustomerId(),
order.getItems().stream().map(OrderItemDto::from).toList(),
order.getTotalAmount(),
order.getCurrency()
);
}
}
Event Sourcing: The Event Log as the Source of Truth
In event sourcing, instead of storing the current state of an entity in a database row, you store the sequence of events that led to the current state. The current state is derived by replaying all events from the beginning (or from a snapshot). This approach has powerful properties: complete audit trail (every state change is recorded); temporal queries (reconstruct state at any point in time); and debugging by replaying events to reproduce bugs.
Event sourcing is not appropriate for every entity. Use it for entities where auditability and temporal queries are requirements, and where the event volume per entity is manageable. Financial accounts, inventory items, and document revision history are natural fits. User profile data with frequent updates is a poor fit — the event log grows without bound and replay becomes expensive.
CQRS: Separating Reads from Writes
Command Query Responsibility Segregation (CQRS) separates the write model (commands that change state, producing events) from the read model (queries that return data, consuming events). The write model is optimized for transactional correctness; the read model is optimized for query performance, potentially using denormalized projections, search indexes, or materialized views that are updated by consuming events from the write model.
// CQRS: Command handler writes event; projection updates read model
@Service
public class PlaceOrderCommandHandler {
private final OrderRepository orderRepository; // write store
private final EventPublisher eventPublisher;
public String handle(PlaceOrderCommand cmd) {
Order order = Order.create(cmd.customerId(), cmd.items());
orderRepository.save(order);
eventPublisher.publish(OrderPlacedEvent.from(order));
return order.getId(); // return aggregate ID only
}
}
@Component
public class OrderSummaryProjection {
private final OrderSummaryRepository readRepository; // read-optimized store
@EventHandler
public void on(OrderPlacedEvent event) {
// Upsert a denormalized read model optimized for "list my orders" queries
readRepository.save(new OrderSummary(
event.orderId(),
event.customerId(),
event.totalAmount(),
event.currency(),
OrderStatus.PLACED,
event.occurredAt()
));
}
}
CQRS with event-driven projections creates eventual consistency: the read model reflects the write model after a small lag (typically milliseconds). Design your UI and API to acknowledge this — show a "processing" state after a command rather than immediately querying the read model.
The Saga Pattern: Distributed Transactions Without 2PC
In a microservices architecture, a business transaction often spans multiple services (place order → reserve inventory → process payment → confirm order). Traditional two-phase commit (2PC) across services is impractical — it creates tight coupling and availability dependency. The Saga pattern replaces 2PC with a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the preceding steps.
Choreography-Based Saga
Each service listens for events and decides independently what to do next. No central coordinator. This is simpler to implement but harder to trace and debug — the business workflow is implicit in the event routing rather than explicitly described anywhere.
Orchestration-Based Saga
A central saga orchestrator sends commands to each service and listens for their responses. The workflow is explicit and visible in the orchestrator's code. Failures trigger compensating commands. This is more complex to implement but much easier to monitor and debug, making it the preferred pattern for complex multi-step workflows.
// Orchestration-based saga for order placement
@Component
public class PlaceOrderSaga {
@StartSaga
@SagaEventHandler(associationProperty = "orderId")
public void handle(OrderPlacedEvent event) {
// Step 1: Reserve inventory
commandGateway.send(new ReserveInventoryCommand(
event.orderId(), event.items()
));
}
@SagaEventHandler(associationProperty = "orderId")
public void handle(InventoryReservedEvent event) {
// Step 2: Process payment
commandGateway.send(new ProcessPaymentCommand(
event.orderId(), event.totalAmount()
));
}
@SagaEventHandler(associationProperty = "orderId")
public void handle(PaymentFailedEvent event) {
// Compensate: release inventory reservation
commandGateway.send(new ReleaseInventoryCommand(event.orderId()));
commandGateway.send(new CancelOrderCommand(
event.orderId(), "Payment failed"
));
}
@EndSaga
@SagaEventHandler(associationProperty = "orderId")
public void handle(PaymentProcessedEvent event) {
commandGateway.send(new ConfirmOrderCommand(event.orderId()));
}
}
Production Challenges and How to Address Them
Idempotency: Consumers may receive the same event more than once (Kafka at-least-once delivery). Every consumer must be idempotent — processing the same event twice must produce the same result as processing it once. Use the event ID as an idempotency key, storing processed event IDs in a deduplication table.
Ordering: Events in a Kafka topic partition are ordered, but events across partitions are not. Design your partitioning key so events that must be processed in order (all events for the same aggregate) land in the same partition.
Schema evolution: Consumer and producer evolve independently. Use a Schema Registry (Confluent or Apicurio) with Avro or Protobuf to enforce compatibility rules. Backward-compatible changes (adding optional fields) are safe; breaking changes (removing or renaming fields) require a new event version.
Observability: Distributed event chains are hard to trace. Propagate a correlation ID through all events. Use distributed tracing (OpenTelemetry with Kafka instrumentation) to trace the full journey of a business transaction across services and events.
"Events are the API contract of the future — but unlike REST contracts, they are asynchronous, durable, and replayable. Design them with the same care you give to your synchronous APIs."
Key Takeaways
- Domain events are immutable facts about what happened — design them to be self-describing, versioned, and backward-compatible.
- Event sourcing suits entities requiring full audit trails; CQRS separates the write model from read-optimized projections.
- The Saga pattern replaces 2PC for distributed transactions — prefer orchestration-based sagas for complex multi-step workflows.
- Every consumer must be idempotent; partition events by aggregate ID to ensure ordering within a stream.
- Use a Schema Registry to manage event schema evolution safely across independently-deployed services.
At BRAC IT: Event-Driven Loan Processing
In 2023, our loan approval flow was a chain of 8 synchronous REST calls: eligibility check → credit bureau lookup → income verification → fraud scoring → compliance check → approval decision → disbursement trigger → notification. Every call was sequential. Average approval time: 4.2 seconds. When the fraud scoring service slowed to 3 seconds (it happened twice under load), the entire chain timed out and the application was dropped with a generic error. Engineers had to manually resubmit or call the applicant back.
We redesigned the flow around a LoanApplicationSubmitted event in Q1 2024. The four verification services (credit, income, fraud, compliance) now consume the event in parallel. Each publishes its own result event (CreditCheckCompleted, IncomeVerified, etc.) to its own Kafka topic. An orchestrator service aggregates the four results and publishes LoanDecisionMade when all four arrive.
Results after three months in production:
- Average approval latency: 4.2 s → 820 ms (parallel execution of the 4 checks)
- Timeout-related application drops: 12/day → 0/day (Kafka retries handle transient failures)
- Credit bureau service outage impact: isolated (other checks continue, decision deferred)
- Deployability: each service deploys independently with no coordination needed
The key insight: the 4 verification checks were logically independent — they could run in parallel — but the synchronous REST design forced them to run sequentially. The event-driven design revealed the true structure of the domain.
Dead Letter Queues and Poison Message Handling
Every event-driven system will eventually encounter a poison message — an event that cannot be processed successfully due to malformed data, schema mismatch, downstream dependency failure, or a bug in the consumer. Without proper handling, a poison message blocks the consumer, halting processing of all subsequent events in that partition. This is one of the most common production failures in Kafka-based systems.
The standard pattern: retry with exponential backoff, then route to a Dead Letter Queue (DLQ) after a fixed number of retries:
// Spring Kafka dead letter queue configuration
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<?, ?> template) {
// Dead letter topic: original-topic.DLT
DeadLetterPublishingRecoverer recoverer =
new DeadLetterPublishingRecoverer(template);
// Retry 3 times with exponential backoff before DLQ
ExponentialBackOffWithMaxRetries backOff =
new ExponentialBackOffWithMaxRetries(3);
backOff.setInitialInterval(1_000); // 1s, 2s, 4s
backOff.setMultiplier(2.0);
DefaultErrorHandler handler = new DefaultErrorHandler(recoverer, backOff);
// Don't retry on deserialization errors — straight to DLQ
handler.addNotRetryableExceptions(
DeserializationException.class,
JsonParseException.class
);
return handler;
}
Monitor your DLQ depth as an SLO. A growing DLQ means either a schema breaking change broke consumers, or a downstream dependency is consistently unavailable. Automate DLQ alerting: when DLQ depth exceeds 10 messages, page the on-call engineer. Create a runbook for DLQ replay — most DLQ messages can be replayed after the root cause is fixed. Design consumers to be idempotent so replay is safe.
Event Schema Evolution: Staying Compatible
The hardest operational problem in event-driven architectures is schema evolution. Unlike REST APIs where you can version a single endpoint, events are durable — a LoanApplicationSubmitted event published in January 2025 may be replayed in January 2026. Your consumer must handle both versions.
Three strategies, in order of increasing robustness:
| Strategy | How | Safe Change Types | Risk |
|---|---|---|---|
| Additive-only | Only add nullable fields; never remove or rename | Add fields | Schema bloat over time |
| Versioned events | Include event version field; consumer handles v1 and v2 | Add/rename/remove fields with version bump | Consumer complexity |
| Schema Registry | Confluent or Apicurio Schema Registry with Avro/Protobuf | All backward-compatible changes enforced at build time | Operational overhead |
At BRAC IT we use JSON with an additive-only rule for intra-cluster events, and Avro with the Confluent Schema Registry for events crossing system boundaries (between our microfinance core and third-party integrations). The Schema Registry prevents breaking changes from deploying — a CI check validates backward compatibility before merge. This has prevented three breaking-change incidents that would have taken down production consumers.
Event-Driven Architecture Production Checklist
Before releasing an event-driven service to production, validate every item in this checklist:
- Consumers are idempotent — processing the same event twice produces the same result with no side effects
- Dead Letter Queue configured — events that cannot be processed after N retries go to a DLQ; DLQ depth is monitored and alerted
- Schema Registry or additive-only policy enforced — no producer can publish an event that breaks existing consumers
- Partitioning key is the aggregate ID — all events for the same entity land in the same partition, preserving ordering
- Correlation IDs propagated — every event carries the correlation ID of the business transaction that triggered it
- Consumer group IDs are unique per service — two different services should never share a consumer group ID on the same topic
- Lag alerts configured — if consumer lag grows beyond a threshold, on-call is paged before the lag becomes a business problem
- Rollback plan documented — if a new event type causes consumer failures, how do you stop publishing without a service restart?
- Event log retention set appropriately — for audit events: 7 years. For real-time notifications: 7 days. Retention is a compliance decision, not a Kafka default
- Security: topic-level ACLs defined — producers cannot read their own topic; consumers cannot write to the topic they consume
Leave a Comment
Related Posts
Software Engineer · Java · Spring Boot · Microservices