Tags/Focus: agentic AI incident co-pilot, Kubernetes outages, runbook automation, SRE playbooks, AI on-call, remediation bots
Introduction
I first felt the ceiling of human on-call speed at 2:07 AM on a winter night in Stockholm. Our payment cluster was down, kubelet logs scrolled like slot machines, and Slack lit up with impatient VP pings. The incident command channel was noisy; juniors waited for approvals while seniors skimmed ten different runbooks. That night convinced me that an agentic AI co-pilot, grounded in our runbooks, could make Kubernetes outages less about heroics and more about reliable, repeatable response. Think of a model that remembers the scar tissue encoded in your wiki, executes safe actions behind guardrails, and narrates every step—using structured concurrency lessons to keep its own workflows orderly.
Real-world Problem
In real SRE rooms, outages rarely follow the happy path. DNS flaps mask node pressure, CNI quirks masquerade as pod crashes, and a redeploy triggered by habit silently invalidates cache warmups. Teams rely on tribal knowledge: “If CoreDNS restarts twice, cordon nodes before draining,” “Never restart the ingress during Friday peak.” Yet incident channels flood with repeated questions because runbooks live in Confluence, not in the flow of action. High-stress conditions make humans skip steps. We need a runbook-aware co-pilot that surfaces the exact snippet, executes vetted commands, and documents its reasoning, while respecting blast-radius policies and audit constraints.
I’ve seen a marketplace startup lose forty minutes while three engineers debated whether to drain or replace a node group. Another bank’s mobile outage worsened because a senior engineer restarted the API gateway before rotating stale client certs. The common theme: the right answer existed in a buried playbook, but no one pulled it up quickly. The co-pilot should be present where engineers work—Slack, CLI, PagerDuty notes—so the best practice is one slash command away, with contextual awareness of the cluster, time of day, and regulatory obligations.
Deep Dive
A credible co-pilot must absorb heterogeneous signals: Kubernetes events, Prometheus alerts, tracing spans, deployment metadata, feature flags, and even on-call calendar context. It must reason across them with deterministic scaffolding akin to structured concurrency discipline: bounded retries, deadline propagation, and cancellation when a hypothesis is disproven. The agent must also understand identity boundaries—what it is allowed to do in prod versus staging—while mapping symptoms to canonical runbook steps. When kube-apiserver latency spikes, the co-pilot should ask, “Is this an API overload, an etcd quorum issue, or an OIDC cache poison?” and branch into different diagnostic trees without creative hallucination.
Runbooks are not flat documents; they encode decision graphs. A good co-pilot builds a knowledge graph of entities (services, namespaces, SLOs, owners) and relationships (depends-on, consumes-metric, guarded-by-PDB). When it pulls logs from kubectl logs deploy/payments -n core, it should join that evidence with deployment age, container images, and pending PRs. It also needs negative knowledge: “We already tried scaling to zero; don’t repeat.” Statefulness across a single incident timeline is essential; the co-pilot becomes the consistent memory that humans often lose at 3 AM.
Solution Approach
Start by encoding runbooks as machine-readable actions, not prose. Each step becomes a typed procedure: preconditions, commands, validation checks, rollback instructions, and owner approvals. The co-pilot ingests these steps, chains them with explicit control flow, and uses retrieval to pick the right play. Observability is the grounding layer: the agent reads alerts, fetches kubectl outputs, and correlates with recent deploys. Human-in-the-loop is a toggle: in “assist” mode it drafts the command set; in “auto” mode it executes low-blast actions while requesting approval for riskier ones. Every execution emits a timeline, keeping postmortems factual and fast.
Pragmatically, you can start with a YAML schema like:
# policy.yaml
name: kube-api-latency
preconditions:
- metric: apiserver_request_duration_seconds_p99 > 1.5
actions:
- cmd: kubectl get --raw=/readyz
validate: output contains ok
- cmd: kubectl get pods -n kube-system -o wide
validate: no restarts > 3
rollback:
- cmd: kubectl delete pod
approvals:
- group: sre-leads
Wrap execution in a policy engine: “Never run destructive commands without an approval token,” “Only mutate namespaces tagged safe-for-auto.” Track each step’s result so future prompts remain grounded in reality rather than model imagination. This scaffolding is the backbone that keeps the AI respectful, auditable, and fast.
Architecture
Picture an architecture diagram: on the left, event sources (Prometheus Alertmanager, PagerDuty webhooks, kube-apiserver audit logs). They flow into an ingestion bus (Kafka or NATS). An orchestrator service (written with the rigor of structured concurrency patterns) spawns bounded workflows per incident, each owning a context that carries deadlines and auth scopes. The agent layer combines a vector store of runbook chunks, a policy engine (OPA/Rego) for guardrails, and tool adapters for kubectl, helm, aws, and ticketing APIs. A Renderer posts markdown to Slack, creates Jira updates, and stores evidence in S3. Feature flags choose auto-vs-manual execution, and every action passes through a safety proxy that simulates before live runs when supported.
The control plane lives in your secure services cluster; the data plane sits close to the Kubernetes clusters to minimize round trips for kubectl and crictl calls. All commands are executed by short-lived runners with ephemeral credentials obtained via workload identity. Observability spans are emitted for each tool call, letting you trace “Alert → Hypothesis → Command → Validation” inside Grafana Tempo or Jaeger. If you already run GitOps, the co-pilot can open PRs for risky changes instead of mutating live state, letting Argo CD or Flux own convergence.
Failure Scenarios
- Etcd quorum loss masked by alert storms: The co-pilot throttles inbound alerts, prioritizes control-plane health, and proposes
etcdctl endpoint statusagainst surviving members. - Node pressure cascading into pod evictions: It compares
kubectl top nodetrends with VPA/HPA churn, suggestskubectl cordontargeted nodes, and validates that eviction budgets are respected. - CNI regression after rolling upgrade: Detects spike in
NetworkUnavailableconditions, captureskubectl get events --field-selector reason=FailedCreatePodSandBox, and recommends rollback viahelm rollback cni <previous-revision>with smoke tests. - Ingress cert expiry at midnight: Finds expiring certs from
kubectl get certificaterequests, renews via ACME controller hook, and rotates ingress pods with a staggeredkubectl rollout restart deployment/ingress -n ingress-system.
Notice the pattern: detect → hypothesize → verify → act → validate. Each failure path has explicit guardrails: snapshot state before mutation, compare desired vs actual, and capture evidence so humans can intervene midstream. The co-pilot’s biggest value is not just speed, but the creation of a crisp narrative that regulators, customers, and future engineers can trust.
Trade-offs
Full automation accelerates mean-time-to-mitigate but increases blast radius if guardrails are weak. Human confirmation reduces risk but adds latency. Strictly codified runbooks lower hallucination risk but may trail new architectures. Model-powered retrieval improves relevance yet risks overfitting to stale data. Investing in simulation (dry-run, canary namespaces) slows first response but prevents double incidents. Per-tenant isolation in multi-cluster fleets keeps noise contained but complicates shared runbooks. Choose the balance intentionally for your risk appetite.
Optimization Techniques
- Pre-warm diagnostics: Cache
kubectl get --rawdiscovery data. - Parallel-safe checks: Use structured concurrency to cap parallel drains and enforce deadlines.
- Adaptive backoff: Increase intervals when control plane latency spikes.
- Result caching: Reuse pod lists across steps to avoid API thrash.
- Read replicas: Query metrics from replicas to avoid overloading primaries.
# Example safe drain with deadlines
kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data \
--grace-period=30 --timeout=90s
# Observability check before restart
kubectl -n kube-system logs deploy/cilium-operator --tail=50
When concurrency is needed, follow patterns from structured concurrency to prevent runaway goroutines or threads.
Mistakes to Avoid
- Running cluster-wide
kubectl delete pod --all -Aduring control-plane instability. - Skipping
--dry-run=clientand applying manifests blindly. - Forgetting storage attachments when cordoning/draining stateful nodes.
- Ignoring DNS/mesh dependencies before blaming the API server.
- Not capturing timelines—postmortems will lack evidence.
Key Takeaways
- Runbook-aware agentic co-pilots need guardrails, grounding, and transparent narration.
- Structured concurrency principles keep remediation bounded and safe.
- Optimization is about reducing control-plane load while keeping humans in the loop.
- Invest in drift detection between runbooks and live clusters.
Conclusion
That snowy 2:07 AM taught me that great teams deserve better tools. By weaving structured runbooks, principled orchestration, and tight safety loops, we can respond to Kubernetes outages with calm, not chaos. Agentic AI co-pilots reduce cognitive load, shrink MTTR, and create richer postmortems. Start small—one noisy alert, one curated runbook—then grow with confidence.
Read Full Blog Here
This article stands alone; for extended orchestration patterns, code snippets, and governance templates that reinforce these ideas, visit my structured concurrency deep dive: Read the full blog here.
Architecture Deep Dive: LangChain4j + Kubernetes Event Listener
A production incident co-pilot watches Kubernetes events using the Java Kubernetes client and feeds them into a LangChain4j agent. The agent cross-references the event against a vector-indexed runbook store and proposes remediation actions, all without touching production resources until a human approves. Here's the core architecture:
// Kubernetes event watcher — feeds the AI co-pilot
@Component
public class K8sEventWatcher implements Watcher<V1Event> {
private final IncidentCoPilot coPilot;
@Override
public void eventReceived(Action action, V1Event event) {
if ("Warning".equals(event.getType()) && isCritical(event)) {
IncidentContext ctx = IncidentContext.builder()
.eventReason(event.getReason())
.involvedObject(event.getInvolvedObject().getName())
.namespace(event.getInvolvedObject().getNamespace())
.message(event.getMessage())
.timestamp(Instant.now())
.build();
// Non-blocking — runbook lookup + AI analysis run async
coPilot.analyzeAsync(ctx)
.thenAccept(recommendation -> slackNotifier.post(recommendation));
}
}
}
@AiService
public interface IncidentCoPilot {
@SystemMessage("""
You are a Kubernetes incident co-pilot. You have access to runbooks
via the searchRunbook tool. Given an incident context, you must:
1. Identify the root cause category
2. Retrieve the matching runbook section
3. Propose SAFE, READ-ONLY diagnostic commands first
4. Only suggest write operations after explicit human approval
Format: JSON with keys: rootCause, diagnosticSteps, remediationPlan, severity
""")
CompletableFuture<IncidentRecommendation> analyzeAsync(
@UserMessage IncidentContext context
);
}
The co-pilot uses three tools: searchRunbook() for vector-similarity retrieval of documented procedures, getPodsStatus() for read-only Kubernetes API inspection, and queryMetrics() for Prometheus queries. None of the tools mutate cluster state — mutations require the human-approval gate described below.
@Service
public class RunbookTools {
@Tool("Search the runbook knowledge base for procedures matching the query")
public List<RunbookSection> searchRunbook(String query) {
// pgvector similarity search over embedded runbook sections
return runbookVectorStore.similaritySearch(
SearchRequest.query(query).withTopK(3).withSimilarityThreshold(0.75)
);
}
@Tool("Get current status of pods in the given namespace")
public List<PodStatus> getPodsStatus(String namespace) {
// Read-only Kubernetes API call — safe for any agent to invoke
return k8sClient.listNamespacedPod(namespace)
.getItems()
.stream()
.map(PodStatus::from)
.toList();
}
@Tool("Query Prometheus metrics with PromQL")
public MetricResult queryMetrics(String promqlExpression) {
// Read-only Prometheus HTTP API call
return prometheusClient.query(promqlExpression);
}
}
Safety Guardrails and Human-in-the-Loop Escalation
The most dangerous failure mode of an agentic co-pilot is taking destructive action without authorization — deleting pods, scaling down deployments, or applying misconfigured manifests. A three-tier approval gate prevents this while keeping the automated-diagnostic loop fast:
| Tier | Action Type | Approval Required | Example |
|---|---|---|---|
| Green | Read-only diagnostics | None — auto-execute | kubectl get pods, Prometheus query |
| Yellow | Low-risk mutations | On-call engineer Slack approval | Pod restart, ConfigMap update |
| Red | High-risk mutations | Two-person approval + audit log | Rollback deployment, scale to zero |
// Human-in-the-loop gate via Slack interactive message
@Service
public class ApprovalGate {
public CompletableFuture<Boolean> requestApproval(
RemediationAction action, String oncallUser) {
String callbackId = UUID.randomUUID().toString();
approvalStore.save(callbackId, action);
// Post interactive Slack message with Approve/Reject buttons
slackClient.postInteractiveMessage(oncallUser, SlackMessage.builder()
.text("🚨 Incident co-pilot requests approval: " + action.description())
.callbackId(callbackId)
.actions(List.of(
SlackAction.button("Approve", "approve", ButtonStyle.DANGER),
SlackAction.button("Reject", "reject", ButtonStyle.DEFAULT)
))
.build());
// CompletableFuture completes when Slack callback arrives
return approvalStore.awaitDecision(callbackId, Duration.ofMinutes(5));
}
}
Runbook-to-Code: Structured Remediation Actions
Effective incident co-pilots don't execute raw shell commands — they implement a typed RemediationAction hierarchy where each action type encapsulates validation, dry-run, and rollback logic. This design prevents prompt-injection attacks from producing unexpected shell escapes and gives your audit log structured, queryable records:
// Sealed hierarchy — only known safe actions exist
public sealed interface RemediationAction
permits RestartPod, RollbackDeployment, ScaleDeployment, PatchConfigMap {
String description();
ActionTier tier(); // GREEN / YELLOW / RED
void dryRun(KubernetesClient client);
void execute(KubernetesClient client);
void rollback(KubernetesClient client);
}
public record RestartPod(String namespace, String podName)
implements RemediationAction {
public ActionTier tier() { return ActionTier.YELLOW; }
public void execute(KubernetesClient client) {
client.pods().inNamespace(namespace).withName(podName).delete();
// Kubernetes will reschedule automatically via ReplicaSet
auditLog.record(ActionAudit.of("RESTART_POD", namespace, podName));
}
}
The co-pilot LLM never executes code directly. It produces a structured JSON payload that your Java application parses into a RemediationAction. Invalid or unknown action types are rejected before reaching the approval gate — prompt injection attempts that try to construct an arbitrary shell command will fail at JSON deserialization, not at execution time.
Automated Postmortem Enrichment
One of the highest-value applications of an incident co-pilot is automatic postmortem generation. After a resolved incident, the co-pilot aggregates the event timeline, diagnostic steps taken, Prometheus metric snapshots, and the approved remediation actions into a structured postmortem document. This eliminates the manual work that usually prevents teams from writing postmortems consistently:
@AiService
public interface PostmortemWriter {
@SystemMessage("""
Write a blameless postmortem from the incident timeline.
Use this exact format:
## Summary (2 sentences)
## Timeline (chronological bullet points with timestamps)
## Root Cause (single paragraph, technical)
## Impact (affected users, services, duration)
## Resolution (what fixed it)
## Action Items (numbered list with owner initials)
""")
String writePostmortem(@UserMessage IncidentSummary summary);
}
@Service
public class IncidentCloser {
public Postmortem closeIncident(IncidentId incidentId) {
// Gather all data from the incident window
IncidentSummary summary = IncidentSummary.builder()
.events(auditLog.eventsFor(incidentId))
.metricsSnapshot(prometheus.queryRange(incidentId.window()))
.runbookSectionsUsed(runbookStore.getUsedSections(incidentId))
.remediationActions(approvalLog.actionsFor(incidentId))
.resolvedAt(Instant.now())
.build();
String markdown = postmortemWriter.writePostmortem(summary);
// Post to Confluence and open GitHub issue for action items
confluence.createPage("Incident Postmortems", markdown);
return postmortemRepository.save(new Postmortem(incidentId, markdown));
}
}
Teams using automated postmortem generation report 60–80% reduction in postmortem writing time and a 3× increase in postmortem completion rate. The co-pilot captures details that human responders often forget to record under stress — exact timestamps, intermediate diagnostic steps, and metric values at the time of the incident. Over time, the postmortem database becomes a knowledge base that improves runbook coverage and reduces MTTR for recurring incident patterns.
Runbook Drift Detection: Keeping AI Knowledge Current
The most dangerous failure mode of a runbook-aware co-pilot is stale knowledge — runbooks that describe procedures for a Kubernetes cluster that has since been upgraded, services that have been renamed, or networking changes that invalidated the documented steps. Runbook drift — the gap between documented procedure and current reality — silently degrades co-pilot effectiveness until an incident reveals it.
Implement automated drift detection by cross-referencing runbook content against live cluster state weekly. For each runbook that references specific service names, namespace names, or config map keys, verify that those resources exist in the current cluster. Flag any runbook section where referenced resources can't be found in the live Kubernetes API:
@Scheduled(cron = "0 0 3 * * MON") // Every Monday at 3 AM
public void detectRunbookDrift() {
runbookStore.getAllSections().forEach(section -> {
List<String> referencedServices = serviceExtractor.extract(section.content());
List<String> missing = referencedServices.stream()
.filter(svc -> !k8sClient.services()
.inAnyNamespace()
.list()
.getItems()
.stream()
.anyMatch(s -> s.getMetadata().getName().equals(svc)))
.toList();
if (!missing.isEmpty()) {
driftAlertService.alert(RunbookDriftAlert.builder()
.runbookId(section.id())
.missingResources(missing)
.lastValidated(section.lastValidated())
.build());
}
});
}
Combine drift detection with a runbook confidence score displayed to the on-call engineer during incidents: "This runbook was validated against live cluster 3 days ago (score: 94%). 2 referenced services could not be verified." This transparency lets engineers trust the co-pilot's suggestions appropriately, taking higher-risk actions only when the runbook is fresh and verified against the current cluster state.
Leave a Comment
Related Posts
Software Engineer · Java · Spring Boot · Microservices