System Design March 22, 2026 18 min read

Dual Control Planes for Geo-Partitioned Event Sourcing: Surviving Region Blackouts

Audience: platform, SRE, and staff engineers running multi-region Kafka/postgres event stores who need to survive full regional loss without breaking audit guarantees.

Table of Contents

  1. Introduction
  2. Story: When the East Coast Disappears
  3. Why Dual Control Planes in Event-Sourced Stacks
  4. Architecture Blueprint
  5. Event Store Topology
  6. Control Plane Mechanics and Failover
  7. Step-by-Step Recovery Flow
  8. Failure Modes to Drill
  9. Trade-offs and Costs
  10. Mistakes to Avoid
  11. Key Takeaways

Introduction

Event-sourced systems love audit trails, reliable replays, and append-only histories. They hate ambiguous failovers, partial replicas, and control planes that vanish mid-incident. In a single region, the control plane that owns schemas, topic policies, consumer offsets, and projection rollouts can be centralized. In a geo-partitioned deployment, centralization becomes a liability: if the region hosting the control plane disappears, operators are blind and the replay choreography that keeps projections coherent freezes.

This piece distills the pattern into concrete commands, configs, and a step-by-step recovery flow you can lift into a runbook.

Story: When the East Coast Disappears

It is 08:12 UTC on a Tuesday. An upstream cloud networking incident silently isolates us-east-1. Your Kafka control plane, Schema Registry primary, and GitOps controllers all live there. Producers in eu-west-1 keep writing locally. Consumers in ap-southeast-1 stall because their offset commits target a control plane that is now unreachable. Within 15 minutes, incident commanders say, "Fail east traffic to EU." The runbook says nothing about projection rebuild order or who owns schema evolution. Without a second control plane, you are left with hand-edited configs and hope.

Why Dual Control Planes in Event-Sourced Stacks

Dual control planes are not about hot spares; they are about autonomy. Each region needs a control plane that can:

Instead of a single orchestration brain, you operate two peer brains that exchange state through a low-frequency, signed policy mirror. During an outage, each region keeps processing with its last validated policy and queues diffs for later reconciliation.

Architecture Blueprint

The pattern builds four planes per region: data (Kafka + event store), control (GitOps + registries + orchestration), compute (services + projections), and observability (metrics/logs/traces). Duplicate the control plane in at least two regions and mirror policy as code through signed Git remotes.

Event Store Topology

For geo-partitioned event sourcing, the typical topology is:

Control Plane Mechanics and Failover

Dual control planes operate in active-active mode with a treaty:

  1. Policy is mirrored Git: schemas, ACLs, consumer group ownership, projection rollout manifests.
  2. Runtime state is local: offsets, DLQ topics, replay cursors.
  3. Leaders are regional: each control plane is authoritative for its region's compute plane.

During normal operations, policy PRs merge in either region, signed, and mirrored. During a blackout, the surviving control plane freezes cross-region policy merges but keeps applying region-local manifests, then reconciles deterministically when the failed region returns.

Step-by-Step Recovery Flow

The runbook below assumes us-east-1 went dark while eu-west-1 survived:

  1. Freeze cross-region changes: Pause MirrorMaker/Cluster Linking for topics sourced from the failed region.
  2. Promote standby schemas: In EU, set Schema Registry compatibility to BACKWARD and apply the last signed schema bundle.
  3. Re-point producers: Toggle DNS or service mesh to route US traffic to EU ingress.
  4. Replay critical projections: Kick off scoped replays from a checkpointed offset snapshot, canceling on first failure.
  5. Switch read models: Update API read endpoints to point to EU projections.
  6. Drain DLQs locally: EU control plane owns DLQ processing; halt cross-region DLQ shipping until east recovers.
  7. Observe lag budgets: Alert if replication lag exceeds your RPO (e.g., 5 minutes).
  8. Restore east: Keep it read-only, reconcile schemas/ACLs, then re-enable mirrors from EU to US.
  9. Offset reconciliation and unfreeze: Export EU offsets, import to US once caught up, then reopen cross-region automation.

Failure Modes to Drill

Trade-offs and Costs

Dual control planes add expense: duplicate GitOps controllers, schema registries, and CI runners per region, plus more storage for mirrored topics. The payoff is autonomy: regional outages become localized incidents, and you keep audit-grade guarantees because replay order and schema governance never rely on a single region.

Mistakes to Avoid

Key Takeaways

Conclusion

Surviving a regional blackout in an event-sourced world is less about "flip DNS" and more about choreography: pausing mirrors, rerouting producers, replaying projections in the right order, and reconciling offsets with proof. Dual control planes give you that choreography even when one side of the world goes dark.

Related Posts

Back to Blog