Software Engineer · Java · Spring Boot · Microservices
Outbox Pattern with Debezium CDC: Solving the Dual-Write Problem in Event-Driven Microservices
Every event-driven microservice faces the same treacherous moment: you commit a row to your database and publish a message to Kafka. These are two separate I/O operations with no shared transaction boundary. One of them will eventually fail at the worst possible time, and when it does, your system silently diverges. The Transactional Outbox Pattern with Debezium CDC is the production-proven, 2PC-free solution that guarantees at-least-once event delivery without ever touching a distributed transaction coordinator.
Table of Contents
- The Real Problem: Why Dual Writes Fail in Production
- What is the Dual-Write Problem?
- The Transactional Outbox Pattern — Architecture
- Implementing the Outbox Table in Spring Boot
- Debezium CDC: Reading the Outbox Without Polling
- Failure Scenarios & Edge Cases
- Trade-offs: When NOT to Use the Outbox Pattern
- Optimization
- Key Takeaways
- Conclusion
1. The Real Problem: Why Dual Writes Fail in Production
Consider a high-traffic e-commerce order service running on Spring Boot. When a customer places an order, the service persists an Order row in PostgreSQL and immediately publishes an OrderPlaced event to a Kafka topic that the Inventory service consumes to decrement stock.
This works perfectly in the happy path. But during a routine Kafka broker rolling restart one Tuesday morning — a completely normal maintenance window — something went silently wrong. For approximately fourteen minutes, the Kafka producer client was failing with NOT_LEADER_OR_FOLLOWER errors as leadership elections completed. During those fourteen minutes, 1,200 orders were committed to PostgreSQL successfully. The database transaction completed, the HTTP 200 was returned to the customer's browser, and the order appeared fully created.
OrderPlaced events were never published. The Inventory service never received a single message. Orders appeared fully created to customers. Inventory was never decremented. The mismatch was discovered three hours later when fulfilment staff noticed stock counts were inconsistent with order volumes — silent data corruption at scale.
The post-mortem revealed the root cause immediately: the application had a classic dual-write without any atomicity guarantee. The developer who wrote the original service had added a try-catch around the Kafka publish call and a TODO comment: "add retry logic later". That TODO sat unaddressed for eleven months. The incident cost the team two days of manual reconciliation work and a P1 SLA breach. This post explains how to make that class of failure structurally impossible.
2. What is the Dual-Write Problem?
A dual write occurs when a single logical business operation requires writing to two independent systems with no shared transaction boundary. In a typical event-driven microservice, those two systems are a relational database and a message broker. The operation looks deceptively simple in code:
// Classic dual-write — looks harmless, is not
@Transactional
public Order placeOrder(OrderRequest request) {
Order order = orderRepository.save(new Order(request)); // writes to PostgreSQL
kafkaTemplate.send("order-placed", new OrderPlacedEvent(order)); // writes to Kafka
return order;
}
// What does @Transactional actually protect here?
// Only the PostgreSQL write. Kafka has NO idea about the DB transaction.
The @Transactional annotation gives a false sense of safety. It wraps the PostgreSQL JDBC operations in a database transaction, but Kafka's KafkaTemplate.send() is entirely outside that transaction scope. There are three distinct failure modes:
- DB succeeds, Kafka fails: The order is persisted. The event is never published. Downstream services never react. Data diverges silently.
- Kafka publishes, DB rolls back: The event fires. The Inventory service decrements stock. The order never actually commits to the database. A phantom order consumed real inventory.
- Network partition between service and Kafka: The publish call hangs until timeout. The DB transaction already committed. You re-throw, but the order is already saved. A retry attempt sends a duplicate event.
The instinct to fix this with try-catch and retry is seductive but broken. If the database committed and the Kafka publish failed, retrying the entire business operation creates a duplicate order in the database. If you retry only the Kafka publish, you might succeed on attempt two — but now you've published the same event twice. Without end-to-end idempotency across all consumers, duplicate events cause duplicate side effects: double inventory decrements, double email confirmations, double payment charges.
3. The Transactional Outbox Pattern — Architecture
The Transactional Outbox Pattern has an elegant core insight: instead of writing to Kafka directly from your business logic, you append a row to an OUTBOX table inside the same database transaction that modifies your domain data. Since the OUTBOX row and the domain row are committed atomically by the database, they are always consistent. A separate relay process — the CDC connector — reads unprocessed outbox rows and publishes them to Kafka, then marks them processed.
The full data flow looks like this:
Order Service
├── BEGIN TRANSACTION
│ ├── INSERT INTO orders (id, customer_id, total, status) VALUES (...)
│ └── INSERT INTO outbox_events (aggregate_type, aggregate_id,
│ event_type, payload, created_at)
│ VALUES ('Order', '1234', 'OrderPlaced', '{...}', NOW())
└── COMMIT ← both rows commit atomically
PostgreSQL WAL (Write-Ahead Log)
└── Debezium Connector reads WAL changes
└── Detects INSERT on outbox_events table
└── Routes to Kafka topic: orders.OrderPlaced
└── Inventory Service consumes OrderPlaced event
The key architectural properties that make this reliable are: (1) the outbox INSERT is part of the business transaction — if the order commit fails, no outbox row exists, so no spurious event is ever published; (2) Debezium reads the WAL, not the application, so there is no coupling between publishing availability and order creation availability; and (3) Debezium's offset tracking in Kafka means that even if the connector restarts, it resumes from exactly where it left off, ensuring at-least-once delivery without manual bookkeeping.
4. Implementing the Outbox Table in Spring Boot
Start with the outbox table schema and its corresponding JPA entity. The schema is intentionally minimal — the outbox is a pure durability buffer, not a query store.
-- PostgreSQL DDL
CREATE TABLE outbox_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type VARCHAR(64) NOT NULL,
aggregate_id VARCHAR(64) NOT NULL,
event_type VARCHAR(128) NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
processed BOOLEAN NOT NULL DEFAULT FALSE
);
CREATE INDEX idx_outbox_unprocessed ON outbox_events (created_at)
WHERE processed = FALSE;
// OutboxEvent.java — JPA entity
@Entity
@Table(name = "outbox_events")
public class OutboxEvent {
@Id
@GeneratedValue(strategy = GenerationType.UUID)
private UUID id;
@Column(name = "aggregate_type", nullable = false, length = 64)
private String aggregateType;
@Column(name = "aggregate_id", nullable = false, length = 64)
private String aggregateId;
@Column(name = "event_type", nullable = false, length = 128)
private String eventType;
@Column(name = "payload", nullable = false, columnDefinition = "jsonb")
private String payload;
@Column(name = "created_at", nullable = false)
private Instant createdAt = Instant.now();
@Column(name = "processed", nullable = false)
private boolean processed = false;
// constructors, getters, builder omitted for brevity
}
The service layer is where the pattern's elegance becomes visible. Both the domain save and the outbox insert happen inside the same @Transactional method, sharing the same database connection and transaction context:
@Service
@RequiredArgsConstructor
public class OrderService {
private final OrderRepository orderRepository;
private final OutboxEventRepository outboxEventRepository;
private final ObjectMapper objectMapper;
@Transactional // wraps BOTH writes in one DB transaction
public Order placeOrder(OrderRequest request) throws JsonProcessingException {
// Step 1: persist the domain entity
Order order = orderRepository.save(Order.from(request));
// Step 2: append outbox row IN THE SAME TRANSACTION
OutboxEvent outboxEvent = OutboxEvent.builder()
.aggregateType("Order")
.aggregateId(order.getId().toString())
.eventType("OrderPlaced")
.payload(objectMapper.writeValueAsString(OrderPlacedEvent.from(order)))
.build();
outboxEventRepository.save(outboxEvent);
// No Kafka call here — Debezium handles publishing asynchronously
return order;
}
}
Notice the complete absence of any Kafka code in the business service. The order service has zero runtime coupling to the message broker. It doesn't need Kafka to be healthy to accept orders. A Kafka outage simply means events accumulate in the outbox table — and are published in order as soon as the broker recovers. Your order service's availability is decoupled from your messaging infrastructure's availability.
5. Debezium CDC: Reading the Outbox Without Polling
Debezium is an open-source Change Data Capture (CDC) platform that tails a database's transaction log and streams row-level change events to Kafka. For PostgreSQL, Debezium reads the Write-Ahead Log (WAL) — the same binary log that powers streaming replication. This is fundamentally different from a polling-based relay: there are no SELECT WHERE processed = FALSE queries hammering your database every few seconds. The WAL tap is a push-based, low-latency stream with minimal database overhead.
Before configuring Debezium, enable logical replication in PostgreSQL:
# postgresql.conf
wal_level = logical
max_replication_slots = 4
max_wal_senders = 4
# Create a replication user
CREATE ROLE debezium REPLICATION LOGIN PASSWORD 'debezium_secret';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium;
GRANT USAGE ON SCHEMA public TO debezium;
Deploy the Debezium PostgreSQL connector via the Kafka Connect REST API with the outbox event router SMT (Single Message Transform):
{
"name": "order-outbox-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "debezium",
"database.password": "debezium_secret",
"database.dbname": "orderdb",
"database.server.name": "orderdb",
"table.include.list": "public.outbox_events",
"plugin.name": "pgoutput",
"slot.name": "debezium_outbox_slot",
"transforms": "outbox",
"transforms.outbox.type":
"io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.table.field.event.id": "id",
"transforms.outbox.table.field.event.key": "aggregate_id",
"transforms.outbox.table.field.event.payload": "payload",
"transforms.outbox.table.field.event.type": "event_type",
"transforms.outbox.route.by.field": "aggregate_type",
"transforms.outbox.route.topic.replacement": "orders.${routedByValue}",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"value.converter": "org.apache.kafka.connect.storage.StringConverter"
}
}
The EventRouter SMT is doing critical work here. It intercepts each CDC event from the outbox table and re-routes it to a dynamic Kafka topic based on the aggregate_type column value. An outbox row with aggregate_type = "Order" is routed to the topic orders.Order. The aggregate_id becomes the Kafka message key, which ensures that all events for the same order ID land on the same Kafka partition — preserving per-aggregate ordering. The raw CDC envelope (which includes before/after images) is stripped away, and only the business payload is forwarded.
6. Failure Scenarios & Edge Cases
Understanding the failure envelope of your system is as important as understanding its happy path. Here are the four most common operational challenges and their mitigations:
Debezium connector crashes mid-stream. This is the most common concern. Debezium stores its WAL offset (LSN — Log Sequence Number) as committed offsets in a dedicated Kafka topic named connect-offsets. When the connector restarts, it reads the last committed offset and resumes from that exact WAL position. Because WAL events are committed to Kafka before the offset is advanced, a connector crash results in at-least-once delivery, not data loss. Your Kafka consumers must handle duplicate messages with idempotency keys — for example, checking whether an OrderPlaced event with a given order ID has already been processed before applying side effects.
Outbox table growth. Without a cleanup strategy, the outbox table accumulates rows indefinitely. Once Debezium has confirmed delivery (you can track this via the processed flag or a separate archival job), rows should be deleted or moved to a cold archive table. A scheduled Spring Boot job running every 15 minutes that deletes rows older than 24 hours and marked processed is usually sufficient. Ensure the delete is batched — deleting millions of rows in a single statement causes table locks.
// Outbox cleanup job — delete processed events older than 24 hours in batches
@Scheduled(fixedDelay = 15 * 60 * 1000)
@Transactional
public void cleanupProcessedEvents() {
Instant cutoff = Instant.now().minus(Duration.ofHours(24));
int deleted;
do {
deleted = outboxEventRepository
.deleteProcessedBefore(cutoff, PageRequest.of(0, 1000));
} while (deleted == 1000); // keep deleting until batch is smaller than limit
log.info("Outbox cleanup: deleted {} processed events", deleted);
}
Debezium replication lag. Debezium's WAL position lags behind the current database LSN by the time taken to process and publish events. Under high write loads, this lag can grow. Monitor it via the Debezium JMX metric debezium.postgres:type=connector-metrics,context=streaming,server=orderdb → MilliSecondsBehindSource. A sustained lag exceeding 30 seconds usually indicates Kafka publish throughput is the bottleneck — increase connector tasks or Kafka partition count.
Message ordering guarantees. Events for the same aggregate (same order ID) are delivered in WAL order to the same Kafka partition because the EventRouter SMT uses aggregate_id as the Kafka message key. However, events for different aggregate IDs are only guaranteed to be ordered within their own partition. If your consumer logic requires cross-aggregate ordering — for example, you need all OrderPlaced events processed before any OrderShipped events across all orders — you need a different routing strategy or a consumer-side ordering layer, because Kafka only guarantees ordering per partition.
7. Trade-offs: When NOT to Use the Outbox Pattern
The Outbox Pattern is a powerful tool, but it introduces genuine operational complexity that is not always justified. Evaluate the following trade-offs honestly before adopting it in every service:
Infrastructure overhead. The pattern requires a running Debezium connector, Kafka Connect cluster, a dedicated replication slot in PostgreSQL, and an operational team comfortable debugging WAL-related issues. For a team of two engineers maintaining a handful of low-traffic services, this infrastructure cost may dwarf the reliability benefit. A simpler approach — retry with exponential backoff and a dead-letter table — might be acceptable if your SLA tolerates occasional event delays rather than requiring strict at-least-once guarantees.
Eventual consistency delay. Events are delivered within milliseconds to seconds of the database commit, not synchronously. If your downstream service needs to query the order immediately after the event is published — for example, a customer dashboard that calls the Inventory service synchronously after placing an order — the inventory update may not have happened yet. Design your UI and downstream reads to handle eventual consistency explicitly.
Not suitable for sub-millisecond requirements. Debezium's WAL-to-Kafka pipeline introduces latency on the order of tens of milliseconds to a few seconds. If your event delivery requirement is sub-millisecond for financial tick data or real-time gaming state, the outbox is the wrong pattern. Consider instead direct in-transaction Kafka transactional producers (KIP-98) if your use case can afford the overhead of Kafka transactions.
When a retry with idempotency key is sufficient. If the downstream consumer is fully idempotent and your retry window is large enough to tolerate occasional missed events (for example, a nightly batch that reconciles inventory), a simple in-application retry loop with an idempotency key stored in Redis may be cheaper to operate than a full CDC pipeline. Reserve the Outbox Pattern for services where missed events cause real financial or consistency damage.
"Distributed systems are not about being clever. They are about being explicit. The Outbox Pattern makes the guarantee explicit in the data model itself, not in the hope that two independent I/O operations will always succeed together."
— Gunnar Morling, Debezium project lead
8. Optimization
Once the Outbox Pattern is working correctly, several optimizations significantly improve throughput and operational visibility:
Kafka partition key selection. The EventRouter SMT uses aggregate_id as the Kafka message key by default. This ensures all events for a given order land on the same partition and are consumed in order. For high-cardinality aggregates (millions of unique order IDs), this distributes load well across partitions. However, if your aggregate ID is a sequential database integer, all new orders land on the same partition until a rebalance. Prefer UUIDs or composite keys like customerId:orderId for better partition entropy.
Replication slot monitoring. PostgreSQL retains WAL segments until all replication slots have consumed them. If Debezium falls behind or is stopped without being dropped, WAL files accumulate on disk and can fill your data volume. Set a WAL retention limit and alert when the replication lag slot exceeds a threshold:
-- Monitor replication slot lag in PostgreSQL
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_size,
active,
confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_type = 'logical';
-- Alert if lag_size exceeds 1 GB
-- Drop inactive slots to release WAL: SELECT pg_drop_replication_slot('slot_name');
Batching for high-throughput scenarios. Under extreme write loads (>50,000 events/sec), the default Debezium connector configuration may become a bottleneck because it publishes one Kafka record per outbox row. Tune the max.batch.size and poll.interval.ms settings in your Kafka Connect worker to allow the connector to batch multiple WAL events into a single Kafka producer request. Additionally, consider using Kafka's linger and batch size settings on the producer side to improve throughput at the cost of a small latency increase.
Debezium connector lag monitoring. Expose Debezium's JMX metrics to your Prometheus stack via the JMX Exporter agent. The key metric is MilliSecondsBehindSource — set an alert at 10 seconds of lag for production. Combine this with a Grafana dashboard that shows WAL lag, Kafka consumer group lag for each downstream service, and outbox table row count to get full end-to-end visibility of your event pipeline.
Key Takeaways
- The dual-write problem is structural, not a bug to patch — retries without idempotency only transform data loss into data duplication. The fix must be architectural.
- The Outbox Pattern makes the database the source of truth for both domain state and pending events — the outbox INSERT is atomic with the business transaction, eliminating the gap that causes dual-write failures.
- Debezium CDC removes polling overhead — reading the PostgreSQL WAL is push-based, low-latency, and imposes near-zero database load compared to repeated SELECT queries on an outbox table.
- At-least-once delivery requires consumer-side idempotency — design consumers to handle duplicate events safely using idempotency keys stored in a database or cache.
- Operational hygiene is non-negotiable — monitor replication slot lag, implement outbox cleanup jobs, and alert on Debezium connector health to prevent WAL disk accumulation or silent delivery failures.
Conclusion
The Transactional Outbox Pattern with Debezium CDC is the most pragmatic solution to the dual-write problem in event-driven microservices. It requires no distributed transaction coordinator, no two-phase commit protocol, and no changes to how your business logic is written beyond routing the outbox INSERT through the same @Transactional boundary as your domain save. The reliability guarantee comes from exploiting the atomicity that your relational database already provides — something it has done reliably for decades — rather than introducing a new coordination protocol between heterogeneous systems.
For teams managing complex multi-service workflows where the outbox is just one piece of a larger consistency puzzle, complement this pattern with a choreography or orchestration layer. Our Saga Pattern guide covers orchestration and choreography approaches for distributed transaction patterns that work hand-in-hand with the outbox. And for handling events that fail after repeated delivery attempts, our guide on Dead Letter Queue Patterns covers the full spectrum of error handling strategies for Kafka consumers.
Read Full Blog Here
Explore the complete guide including full connector configuration, consumer idempotency patterns, and production runbook for the Outbox Pattern with Debezium CDC.
Read the Full PostDiscussion / Comments
Related Posts
Saga Pattern: Orchestration vs Choreography
Manage distributed transactions across microservices without 2PC using the Saga pattern.
Dead Letter Queue Patterns
Handle poison-pill messages and persistent consumer failures with robust DLQ strategies.
Event-Driven Architecture Patterns
Design loosely coupled systems with events, Kafka, and async communication patterns.
Last updated: March 2026 — Written by Md Sanwar Hossain