focus keywords: dual control planes, geo-partitioned event sourcing, kafka multi-region, disaster recovery, region blackout, split-brain prevention
Dual Control Planes for Geo-Partitioned Event Sourcing: Surviving Region Blackouts
Audience: platform, SRE, and staff engineers running multi-region Kafka/postgres event stores who need to survive full regional loss without breaking audit guarantees.
Introduction
Event-sourced systems love audit trails, reliable replays, and append-only histories. They hate ambiguous failovers, partial replicas, and control planes that vanish mid-incident. In a single region, the control plane that owns schemas, topic policies, consumer offsets, and projection rollouts can be centralized. In a geo-partitioned deployment, centralization becomes a liability: if the region hosting the control plane disappears, operators are blind and the replay choreography that keeps projections coherent freezes.
This piece distills the pattern into concrete commands, configs, and a step-by-step recovery flow you can lift into a runbook.
Story: When the East Coast Disappears
It is 08:12 UTC on a Tuesday. An upstream cloud networking incident silently isolates us-east-1. Your Kafka control plane, Schema Registry primary, and GitOps controllers all live there. Producers in eu-west-1 keep writing locally. Consumers in ap-southeast-1 stall because their offset commits target a control plane that is now unreachable. Within 15 minutes, incident commanders say, “Fail east traffic to EU.” The runbook says nothing about projection rebuild order or who owns schema evolution. Without a second control plane, you are left with hand-edited configs and hope.
Why Dual Control Planes in Event-Sourced Stacks
Dual control planes are not about hot spares; they are about autonomy. Each region needs a control plane that can:
- Approve or reject schema changes and topic ACLs without waiting for a remote admin API.
- Coordinate projection rebuilds and idempotent replay guards locally.
- Own consumer offset commits and DLQ routing within the region.
- Synchronize policy, not command, across peers so that one region’s outage does not block others from acting.
Instead of a single orchestration brain, you operate two peer brains that exchange state through a low-frequency, signed policy mirror. During an outage, each region keeps processing with its last validated policy and queues diffs for later reconciliation.
Architecture Blueprint
The pattern builds four planes per region: data (Kafka + event store), control (GitOps + registries + orchestration), compute (services + projections), and observability (metrics/logs/traces). Duplicate the control plane in at least two regions and mirror policy as code through signed Git remotes. In practice:
# flux-kustomization.yaml (region-scoped)
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: control-plane
spec:
interval: 2m
targetNamespace: platform-system
path: ./clusters/us-east-1
prune: true
sourceRef:
kind: GitRepository
name: platform-config
suspend: false
postBuild:
substitute:
REGION: us-east-1
KAFKA_CLUSTER: kafka-east
---
apiVersion: v1
kind: ConfigMap
metadata:
name: policy-signing
namespace: platform-system
data:
publicKey.pem: |
-----BEGIN PUBLIC KEY-----
-----END PUBLIC KEY-----
Each region runs its own Flux/ArgoCD instance pointed to the same Git mirror but scoped paths, so a regional loss does not freeze reconciliation elsewhere. The orchestration code that fans out projection rebuilds should still enforce structured lifecycles: every replay job owns its child tasks and cancels them cleanly when switching failover states.
Event Store Topology
For geo-partitioned event sourcing, the typical topology is:
- Kafka clusters per region with MirrorMaker 2 (or Cluster Linking) replicating topics across peer regions.
- Region-pinned partitions for latency-sensitive aggregates, plus geo-replicated audit topics for global reads.
- Write-local, read-anywhere semantics: producers write to their local cluster; global consumers read from replicas with well-defined lag budgets.
Example topic creation that enforces region pinning and retention aligned to replay windows:
$ kafka-topics --bootstrap-server kafka-east:9092 \
--create --topic orders.us \
--partitions 12 --replication-factor 3 \
--config min.insync.replicas=2 \
--config retention.ms=1209600000 # 14 days
# single-quoted heredoc prevents shell expansion and keeps the MM2 config literal
$ cat > mm2.properties <<'EOF'
clusters = east,eu
east.bootstrap.servers = kafka-east:9092
eu.bootstrap.servers = kafka-eu:9092
east->eu.enabled = true
east->eu.topics = orders.us
sync.topic.acls.enabled = true
tasks.max = 4
EOF
$ connect-mirror-maker.sh mm2.properties
Control Plane Mechanics and Failover
Dual control planes operate in active-active mode with a treaty:
- Policy is mirrored Git: schemas, ACLs, consumer group ownership, projection rollout manifests.
- Runtime state is local: offsets, DLQ topics, replay cursors.
- Leaders are regional: each control plane is authoritative for its region’s compute plane.
During normal operations, policy PRs merge in either region, signed, and mirrored. During a blackout, the surviving control plane freezes cross-region policy merges but keeps applying region-local manifests, then reconciles deterministically when the failed region returns.
Step-by-Step Recovery Flow
The runbook below assumes us-east-1 went dark while eu-west-1 survived:
- Freeze cross-region changes: Pause MirrorMaker/Cluster Linking for topics sourced from the failed region to prevent stale backfill.
- Promote standby schemas: In EU, set Schema Registry compatibility to
BACKWARDand apply the last signed schema bundle. - Re-point producers: Toggle DNS or service mesh to route US traffic to EU ingress; writes land in
orders.eu. - Replay critical projections: Kick off scoped replays (orders, payments) from a checkpointed offset snapshot, canceling on first failure.
- Switch read models: Update API read endpoints to point to EU projections.
- Drain DLQs locally: EU control plane owns DLQ processing; halt cross-region DLQ shipping until east recovers.
- Observe lag budgets: Alert if replication lag for
orders.usmirrors exceeds your RPO (e.g., 5 minutes). - Restore east: When east recovers, keep it read-only, reconcile schemas/ACLs, then re-enable mirrors from EU to US.
- Offset reconciliation and unfreeze: Export EU offsets, import to US once caught up, then reopen cross-region Git mirrors and automation.
Operational Commands and Config Snippets
# Pause mirror links from failed region
$ kafka-cluster-links --bootstrap-server kafka-eu:9092 \
--alter --link east-to-eu --config "link.mode=paused"
# Promote EU schema bundle
$ curl -X PUT http://schema-eu:8081/config \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"compatibility": "BACKWARD"}'
$ tar -xf schemas/signed-bundle-us-east.tar.gz -C /tmp/schemas
$ curl -X POST http://schema-eu:8081/subjects/orders-value/versions \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d @/tmp/schemas/orders-value.json
# Replay projections with checkpointed offsets
$ kubectl -n platform-system create job replay-orders \
--image=registry/replayer:1.4 \
-- \
--topic orders.eu --group projections.orders \
--offset-snapshot s3://backups/checkpoints/orders-eu.json
# Drain DLQ locally
$ kafka-console-consumer --bootstrap-server kafka-eu:9092 \
--topic orders.dlq --from-beginning --property print.headers=true
Configuring consumer failover with region affinity:
spring:
kafka:
bootstrap-servers:
- kafka-eu:9092
- kafka-east:9092
properties:
client.rack: eu-west-1
partition.assignment.strategy: org.apache.kafka.clients.consumer.StickyAssignor
consumer:
group-id: projections.orders
auto-offset-reset: latest
enable-auto-commit: false
Failure Modes to Drill
- Schema divergence: One region merges a backward-incompatible schema before the mirror pauses. Drill rollbacks using the signed bundle mechanism.
- Mirror partition skew: After reconnection, some partitions are ahead in EU; rehearse selective catch-up with
kafka-reassign-partitions. - Projection poisoning: A bad event batch replays twice. Validate idempotency keys and ensure replay jobs stop on first duplicate detection.
Observability and Runbooks
Blackouts blur visibility. Keep observability regional and federate asynchronously. Runbooks should express steps as structured workflows rather than ad-hoc commands; nested tasks that own their children prevent orphaned replays when failover decisions change. The orchestration discipline in structured concurrency maps directly to incident automations.
- Lag dashboards per link: Mirror lag, offset export/import durations, DLQ depth.
- Health budgets: Alert when control-plane reconciliation exceeds 5 minutes or when schema bundles are older than 24 hours.
Data Repair and Consistency
Once the failed region returns:
- Read-only staging: Keep producers on the surviving region; mount recovered brokers as mirrors.
- Verify deltas: Compare partition checksums and event counts; only proceed when parity holds.
- Import and restore: Import surviving offsets, run a dry-run replay to confirm idempotency, then shift traffic back gradually (10%, 25%, 50%, 100%).
Trade-offs and Costs
Dual control planes add expense: duplicate GitOps controllers, schema registries, and CI runners per region, plus more storage for mirrored topics. The payoff is autonomy: regional outages become localized incidents, and you keep audit-grade guarantees because replay order and schema governance never rely on a single region.
Mistakes to Avoid
- Letting offset edits bypass signed policy bundles.
- Failing to pause mirrors before rerouting producers, causing backfill storms.
- Replaying projections without idempotency keys or dedupe fences.
- Running a single Schema Registry for all regions.
- Treating DLQs as trash cans instead of structured queues with ownership.
Key Takeaways
- Control planes must be regional citizens with their own autonomy, not distant overlords.
- Policy mirrors and signed bundles keep schemas, ACLs, and ownership aligned across blackouts.
- Replay discipline, lag budgets, and DLQ ownership matter more than raw replication speed.
- Runbooks should be executable scripts with clear cancellation semantics, not prose.
Conclusion
Surviving a regional blackout in an event-sourced world is less about “flip DNS” and more about choreography: pausing mirrors, rerouting producers, replaying projections in the right order, and reconciling offsets with proof. Dual control planes give you that choreography even when one side of the world goes dark. For the concurrency discipline behind those workflows, revisit structured concurrency and apply it to your incident automations.
Read Full Blog Here
For orchestration patterns behind these workflows, read the structured concurrency guide: https://mdsanwarhossain.me/blog-java-structured-concurrency.html.