Kubernetes Operator Pattern: Building Custom Controllers for Stateful Applications
Kubernetes Deployments and StatefulSets handle stateless and basic stateful workloads well. But running databases, message brokers, and distributed datastores at production scale requires operational knowledge that can't be encoded in YAML — backup orchestration, primary election, rolling schema upgrades, disaster recovery. The Operator Pattern encodes this human operational knowledge as code.
Table of Contents
- The Problem with YAML-Driven StatefulSet Management
- The Operator Pattern: Controller + CRD
- The Reconciliation Loop
- Designing the Custom Resource (CRD)
- Implementing the Controller with Kubebuilder
- Real Stateful Application Scenarios
- Production Failure Scenarios
- Testing Operators: Unit, Integration, E2E
- Trade-offs and When NOT to Write an Operator
- Key Takeaways
1. The Problem with YAML-Driven StatefulSet Management
Consider deploying a Kafka cluster on Kubernetes. A StatefulSet can manage the pods, but it can't:
- Automatically rebalance partitions when a broker is added or removed
- Perform rolling upgrades while ensuring at-least one ISR replica per partition
- Automatically trigger backup before a destructive operation
- Detect and remediate a broker that's joined with mismatched configuration
- Expose cluster-level metrics as Kubernetes status conditions
These operations require deep application-specific knowledge. Before Operators, teams encoded this knowledge in runbooks, manual scripts, and tribal knowledge. The Operator Pattern replaces the runbook with code — specifically, a Kubernetes controller that watches a Custom Resource and continuously reconciles the cluster state toward the desired state.
2. The Operator Pattern: Controller + CRD
An Operator consists of two components:
- Custom Resource Definition (CRD): Extends the Kubernetes API with a new resource type (e.g.,
KafkaCluster). Users create instances of this resource to express desired state. - Controller: A Go program (typically) running as a Deployment inside the cluster. It watches Custom Resource instances and reconciles the cluster's actual state to match the desired state expressed in the resource.
↓
Kubernetes API Server stores in etcd
↓
Controller watches for KafkaCluster events (informer)
↓
Reconcile loop runs: compare desired vs actual
↓
Controller creates/updates/deletes: StatefulSets, Services,
ConfigMaps, PVCs, RBAC, NetworkPolicies
↓
Updates KafkaCluster.Status (conditions, observedState)
3. The Reconciliation Loop
The reconciliation loop is the heart of every controller. It must be:
- Idempotent: Running reconcile 100 times on an already-converged cluster should produce no changes. Use
CreateOrUpdatesemantics, not justCreate. - Level-triggered, not edge-triggered: The controller doesn't act on "what changed" but on "what is the current desired vs. actual state." This makes it resilient to missed events and duplicated events.
- Return RequeueAfter on partial progress: If the cluster is in a transitional state (a node is starting up), return
ctrl.Result{RequeueAfter: 30 * time.Second}to check again later.
4. Designing the Custom Resource (CRD)
apiVersion: kafka.myorg.io/v1alpha1
kind: KafkaCluster
metadata:
name: payments-kafka
namespace: production
spec:
version: "3.7.0"
replicas: 3
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "4"
memory: "16Gi"
storage:
class: premium-ssd
size: 500Gi
config:
defaultReplicationFactor: 3
minInsyncReplicas: 2
logRetentionHours: 168
backup:
enabled: true
schedule: "0 3 * * *" # Daily at 3 AM
destination: s3://backups/kafka
monitoring:
prometheusEnabled: true
status:
phase: Running # Pending | Initializing | Running | Degraded
readyBrokers: 3
conditions:
- type: Available
status: "True"
lastTransitionTime: "2026-03-19T05:00:00Z"
- type: BrokersDegraded
status: "False"
CRD design principles: (1) Spec expresses desired state (immutable business intent); Status expresses observed state. Never let users write to Status — it's owned by the controller. (2) Version your API from day one (v1alpha1 → v1beta1 → v1). Kubernetes CRD versioning with conversion webhooks handles schema evolution. (3) Validation: use CEL (Common Expression Language) validation rules in the CRD schema to reject invalid specs before they reach the controller.
5. Implementing the Controller with Kubebuilder
// Kubebuilder controller skeleton
//+kubebuilder:rbac:groups=kafka.myorg.io,resources=kafkaclusters,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=apps,resources=statefulsets,verbs=get;list;watch;create;update;patch;delete
func (r *KafkaClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// 1. Fetch the KafkaCluster instance
cluster := &kafkav1alpha1.KafkaCluster{}
if err := r.Get(ctx, req.NamespacedName, cluster); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 2. Handle deletion via finalizer
if !cluster.DeletionTimestamp.IsZero() {
return r.handleDeletion(ctx, cluster)
}
// 3. Reconcile StatefulSet
if result, err := r.reconcileStatefulSet(ctx, cluster); err != nil || result.Requeue {
return result, err
}
// 4. Reconcile Services
if result, err := r.reconcileServices(ctx, cluster); err != nil || result.Requeue {
return result, err
}
// 5. Wait for brokers to be ready
ready, err := r.checkBrokerReadiness(ctx, cluster)
if err != nil {
return ctrl.Result{}, err
}
if !ready {
log.Info("Brokers not ready yet, requeueing")
return ctrl.Result{RequeueAfter: 15 * time.Second}, nil
}
// 6. Update status
return r.updateStatus(ctx, cluster)
}
6. Real Stateful Application Scenarios
Scenario: Safe Rolling Upgrade
When the user updates spec.version from 3.6 to 3.7, the operator can't just do a rolling restart like a Deployment would. It must: (1) Verify that all partitions have sufficient ISR replicas before touching any broker, (2) Upgrade one broker at a time, (3) Wait for the upgraded broker to re-join and ISR to stabilize before upgrading the next, (4) Roll back all upgraded brokers if any step fails. This multi-step coordinated upgrade is impossible with native StatefulSet rolling update logic.
Scenario: Automatic Backup Before Destructive Operation
When a user decreases spec.replicas from 5 to 3, the operator recognizes this as a destructive scale-down. Before proceeding, it triggers an immediate backup (creating a KafkaBackup CR which the operator also manages), waits for the backup to complete successfully, then proceeds with the scale-down. If the backup fails, it blocks the scale-down and sets a status condition explaining why.
7. Production Failure Scenarios
Failure: Controller Restart During Multi-Step Operation
If the controller pod restarts mid-upgrade (step 2 of 5), the reconciliation loop restarts from scratch. Idempotent reconciliation means it re-checks the current state and determines which brokers have been upgraded vs. which haven't — and continues from the correct point. Stateful multi-step operations must be encoded in the Custom Resource's Status (current step, checkpoints) so restarts can resume rather than restart from zero.
Failure: Status Desync After etcd Compaction
etcd compaction can cause controllers to miss events. The controller's informer cache re-syncs on schedule, but there's a window where Status may not reflect actual state. Implement a periodical status health check: every 5 minutes, regardless of events, verify that Status.ReadyBrokers matches the actual pod count. Correct divergence immediately.
8. Testing Operators: Unit, Integration, E2E
- Unit tests: Test the reconcile function with a fake Kubernetes client (controller-runtime's
envtestfake client). Verify that the correct K8s objects are created/updated for given CRD specs without a real cluster. - Integration tests with envtest: Kubebuilder's
envtestruns a real etcd and API server in-process. Test the full reconciliation loop including watching, status updates, and webhook validation. - E2E tests with kind: Spin up a kind (Kubernetes in Docker) cluster in CI, install the operator, apply test CRs, and verify cluster state with assertions. This catches cluster-RBAC issues and real Pod scheduling behavior.
- Chaos testing: Kill the controller pod mid-reconciliation; kill a managed pod; simulate PVC binding failures. Verify the operator correctly detects, reports, and recovers from each scenario.
9. Trade-offs and When NOT to Write an Operator
- High engineering investment: A production-quality operator requires 3–6 months of engineering. Assess whether using an existing community operator (Strimzi for Kafka, Zalando for PostgreSQL) is sufficient before building custom.
- Operator lifecycle responsibility: You own the operator. When Kubernetes upgrades APIs or your application releases a new version, you must update the operator. Budget ongoing maintenance.
- Don't use for stateless apps: Deployments + Helm cover stateless application lifecycle perfectly. Operators add complexity where none is needed.
- Use when: You're running stateful applications (databases, message brokers) at scale, the operational runbook is longer than 20 pages, you have multiple teams that need to self-service cluster provisioning, or you need automated recovery from application-level failures (not just pod failures).
10. Key Takeaways
- The Operator Pattern encodes operational runbook knowledge (backup, upgrade, failover) as code in a Kubernetes controller.
- Reconciliation loops must be idempotent and level-triggered — they compare desired vs. actual state, not deltas between events.
- CRD Spec = user-desired state (never written by controller); CRD Status = observed state (owned by controller).
- Multi-step operations must checkpoint progress in Status so controller restarts resume correctly.
- Test with
envtestfor unit/integration tests; kind-based E2E tests for full lifecycle validation; chaos testing for resilience. - Before writing a custom operator, evaluate existing community operators (Strimzi, Zalando, etc.) — they encode years of production expertise.
Conclusion
Kubernetes Operators represent the evolution from "infrastructure managed by YAML" to "infrastructure managed by code." For teams running stateful systems at production scale, the investment in a well-designed operator pays for itself within a few months in reduced operational incidents and faster onboarding of new cluster instances.
Start by understanding the reconciliation model deeply — it's the conceptual foundation for everything else. Then choose Kubebuilder over Operator SDK for new projects (better Kubernetes API integration) and write your first operator for the smallest, simplest use case to build intuition before tackling complex stateful workloads.
Related Posts
Software Engineer · Java · Spring Boot · Kubernetes · DevOps · Distributed Systems
Discussion / Comments
Join the conversation — your comment goes directly to my inbox.