Kubernetes KEDA event-driven autoscaling for production workloads
Md Sanwar Hossain - Software Engineer
Md Sanwar Hossain

Software Engineer · Java · Spring Boot · Microservices

DevOps March 21, 2026 15 min read DevOps Reliability Engineering Series

Kubernetes KEDA and HPA: Event-Driven Autoscaling for Production Workloads That HPA Can't Handle

Standard Kubernetes Horizontal Pod Autoscaling works brilliantly for stateless HTTP services — until the moment your workload is driven by an event queue rather than CPU utilization. At that inflection point, HPA's reliance on reactive resource metrics introduces a dangerous lag between when work arrives and when your pods are ready to process it. KEDA (Kubernetes-based Event Driven Autoscaler) was built specifically to close that gap, scaling on the actual workload signal — queue depth, consumer lag, Prometheus query results — rather than the downstream resource symptoms those signals eventually produce.

Table of Contents

  1. The Real Problem: Why HPA Fails for Event-Driven Workloads
  2. What is KEDA?
  3. Architecture: How KEDA Works
  4. Implementing KEDA for Kafka Consumer Scaling
  5. Scale to Zero in Practice
  6. KEDA with Prometheus Metrics (Custom Business Metrics)
  7. Failure Scenarios and Debugging
  8. KEDA vs HPA: When to Use Which
  9. Performance Optimization
  10. Key Takeaways
  11. Conclusion

1. The Real Problem: Why HPA Fails for Event-Driven Workloads

A media streaming platform was running a video transcoding fleet on Kubernetes. When a user uploaded a new video, an event landed on a Kafka topic and a pool of transcoding pods consumed those events to convert the file to multiple resolutions. The ops team had configured an HPA targeting 70% CPU utilization — which seemed perfectly reasonable.

Then a creator with 2 million followers uploaded a video that went immediately viral. Within seconds, hundreds of concurrent upload events flooded the Kafka topic. The transcoding pods that were already running were immediately saturated. But the HPA did not trigger. Why? Because the existing pods had not yet started processing the backlog — they had not yet driven their CPU up to the 70% threshold. The CPU metric was still low. HPA saw nothing to act on.

Production incident: A 4-minute lag elapsed between the viral video upload event and transcoding pods scaling up to handle the load. During those 4 minutes, 847 concurrent users were stuck on a "processing" spinner with no video available. The root cause was not insufficient capacity — it was that HPA was watching CPU while the real backlog signal was in Kafka consumer lag.

The fundamental disconnect is this: HPA thinks about resource consumption; event-driven workloads need to think about workload intent. CPU and memory are trailing indicators. A queue filling with unprocessed messages is the leading indicator. For batch processors, Kafka consumers, SQS workers, and cron-triggered jobs, scaling on resource metrics means you will always scale after users are already waiting — not before they notice a problem. The fix was KEDA scaling on Kafka consumer lag, which scaled pods up within seconds of the lag crossing a threshold, before CPU had any chance to respond.

2. What is KEDA?

KEDA (Kubernetes-based Event Driven Autoscaler) is a CNCF graduated project that adds event-driven autoscaling capabilities to any Kubernetes cluster without replacing the existing HPA machinery — it extends it. KEDA was originally created by Microsoft and Red Hat and donated to the CNCF in 2020. Its graduation status reflects production maturity across thousands of enterprise deployments.

KEDA's architecture consists of three primary components:

The critical difference from native HPA is that KEDA reads directly from the event source. When you configure a Kafka trigger, KEDA connects to the Kafka broker, queries the consumer group lag for a specific topic, and immediately surfaces that lag as a scaling metric — before any pod has consumed a single message. Native HPA cannot do this because it is limited to CPU, memory, and custom metrics that must already be emitted by running pods.

Scale to Zero — the killer feature: KEDA supports minReplicaCount: 0, allowing a deployment to scale all the way down to zero pods when there is no work. Native HPA enforces a minimum of 1 replica at all times. This single capability can reduce idle infrastructure costs by 60–80% for batch workloads, nightly ETL jobs, and development environments.

3. Architecture: How KEDA Works

Understanding KEDA's internal data flow is essential for tuning it correctly in production. The scaling loop works as follows:

Kafka Topic (consumer lag grows)
  └── KEDA Scaler polls broker every pollingInterval seconds
        └── Reads consumer group lag for target topic
              └── Exposes metric via External Metrics API
                    └── Kubernetes HPA controller reads metric
                          └── HPA calculates desired replicas:
                                desiredReplicas = ceil(lag / lagThreshold)
                          └── Kubernetes scales the Deployment
                                └── New transcoder pods start consuming
                                      └── Consumer lag drops → scale-down after cooldownPeriod

KEDA does not bypass or replace the HPA controller. Instead, it acts as a metric provider that feeds the standard HPA loop. When you create a ScaledObject, KEDA automatically creates a corresponding HorizontalPodAutoscaler resource managed by the KEDA operator. The HPA controller then drives actual pod scaling using Kubernetes's native replica management. This design means KEDA inherits all of HPA's stability guarantees — including the scale-down stabilization window — while adding event-source awareness that HPA alone cannot provide.

The KEDA operator runs as a Deployment in the keda namespace and consists of two containers: the keda-operator (manages ScaledObject lifecycle and metric exposure) and the keda-operator-metrics-apiserver (serves the External Metrics API endpoint). Both are stateless and can be replicated for high availability.

4. Implementing KEDA for Kafka Consumer Scaling

The following ScaledObject is the exact configuration that resolved the viral video transcoding incident described in the introduction. It scales the video-transcoder deployment based on Kafka consumer lag, with a 1-pod-per-5-messages ratio:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: video-transcoder-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: video-transcoder
  pollingInterval: 10        # Check every 10 seconds
  cooldownPeriod: 60         # Wait 60s before scaling down
  minReplicaCount: 0         # Scale to zero when idle!
  maxReplicaCount: 50        # Cap at 50 pods
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka-broker:9092
      consumerGroup: video-transcoder-group
      topic: video-upload-events
      lagThreshold: "5"      # 1 pod per 5 messages in lag
      offsetResetPolicy: latest

Let's break down the key fields. pollingInterval: 10 tells KEDA to query the Kafka broker every 10 seconds for updated consumer lag. This is what enables near-real-time reaction to queue depth changes — a stark contrast to HPA's CPU scrape interval which often runs at 15–30 seconds and then requires CPU to actually climb before triggering. lagThreshold: "5" means KEDA will target one pod for every 5 unprocessed messages. With 50 messages in lag, KEDA will drive the deployment to 10 pods. With 250 messages, it caps at maxReplicaCount: 50.

cooldownPeriod: 60 is the number of seconds KEDA waits after the last active event before beginning to scale down. This prevents flapping when lag briefly drops to zero between bursts. For video transcoding, 60 seconds is appropriate — enough to absorb inter-upload gaps without keeping idle pods running for minutes.

Result after deployment: Scale-up latency dropped from 4 minutes (CPU-based HPA) to under 30 seconds (KEDA Kafka lag). The 847-user spinner problem was eliminated. During the next viral upload event, pods were provisioned and consuming messages before any user had waited more than 45 seconds for their video.

5. Scale to Zero in Practice

Scale to zero is KEDA's most financially impactful feature for workloads with predictable idle periods. A nightly ETL job, a report generation service, a development-environment worker — these run for hours and then sit idle consuming node resources for the remainder of the day. With minReplicaCount: 0, the deployment drops to zero pods when the event source is empty, and KEDA restores replicas the moment a new trigger event arrives.

The cold start problem is the main trade-off: scaling from zero means the first event after an idle period must wait for a new pod to be scheduled, pulled, and initialized before processing begins. Mitigations include:

Here is a ScaledJob configuration for a batch image processing pipeline that runs to completion and terminates, with zero idle cost:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: image-processor-job
  namespace: production
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: image-processor
          image: myrepo/image-processor:latest
        restartPolicy: Never
  pollingInterval: 15
  maxReplicaCount: 20
  triggers:
  - type: rabbitmq
    metadata:
      host: amqp://rabbitmq:5672/
      queueName: image-processing-queue
      queueLength: "1"       # 1 Job per message in queue

6. KEDA with Prometheus Metrics (Custom Business Metrics)

Scaling on Prometheus metrics sounds straightforward until you try to wire the Prometheus Adapter to Kubernetes's Custom Metrics API — at which point you discover that getting the adapter's RBAC, APIService registration, and PromQL configuration aligned correctly takes days, not hours. KEDA's prometheus trigger bypasses this entirely: you give KEDA a Prometheus server address and a PromQL query, and KEDA handles the rest.

This opens a powerful capability: scaling on business-level metrics, not just infrastructure metrics. Consider an order processing service where the right scaling signal is not pod CPU but the number of orders sitting in a queued state in the database:

triggers:
- type: prometheus
  metadata:
    serverAddress: http://prometheus:9090
    metricName: pending_orders_total
    threshold: "100"
    query: sum(pending_orders{status="queued"})

With this configuration, KEDA evaluates the PromQL query sum(pending_orders{status="queued"}) every pollingInterval seconds. When the result exceeds 100, KEDA scales up: if there are 300 pending orders, it drives the deployment to 3 replicas; 500 pending orders yields 5 replicas. The metric is derived directly from application instrumentation — a Micrometer gauge emitting pending_orders with a status label — not from inferred infrastructure consumption.

This approach aligns infrastructure cost directly with business throughput. The order processor fleet is sized exactly proportional to business backlog, not to the CPU that backlog eventually causes. Teams using this pattern typically see 30–40% cost reduction compared to static replica counts or CPU-based HPA, because the fleet is never over-provisioned in anticipation of load that never arrives.

7. Failure Scenarios and Debugging

KEDA introduces a new failure surface between your event source and your scaling loop. Here are the most common production failure modes and how to investigate them.

ScaledObject stuck at wrong replica count. Run kubectl describe scaledobject video-transcoder-scaler -n production and examine the Conditions block. A healthy ScaledObject shows ScalerReady=True and Active=True. If ScalerReady=False, the trigger cannot connect to its event source — usually a network policy block, wrong bootstrap server address, or authentication failure. The kubectl get events -n production output will show the specific error message from the scaler.

KEDA operator crash or unavailability. KEDA is designed to fail safe: if the KEDA operator pod crashes, the last-known replica count is preserved by the underlying HPA. The deployment does not immediately scale to zero or to max. The HPA will continue to function based on the most recently computed metric until KEDA recovers. This means a KEDA outage during low-traffic periods is generally graceful, though a KEDA outage during a traffic surge means no scale-up until the operator recovers.

Kafka SASL/TLS authentication. Production Kafka clusters require authentication. KEDA handles this via the TriggerAuthentication CRD, which references Kubernetes Secrets containing credentials. Never embed credentials directly in the ScaledObject metadata.

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: kafka-trigger-auth
  namespace: production
spec:
  secretTargetRef:
  - parameter: username
    name: kafka-credentials
    key: username
  - parameter: password
    name: kafka-credentials
    key: password

Reference the TriggerAuthentication from your ScaledObject trigger with authenticationRef: { name: kafka-trigger-auth }. Common failures and their resolutions:

Failure Symptom Fix
Kafka auth error Scaler shows Error, pods stuck at 0 Add TriggerAuthentication with SASL/TLS creds
lagThreshold too high Slow scale-up despite large lag Lower threshold value, reduce pollingInterval
cooldownPeriod too short Pods flapping up and down repeatedly Increase cooldownPeriod to 120s or more

8. KEDA vs HPA: When to Use Which

KEDA is not a wholesale replacement for HPA — both tools have clear, non-overlapping use cases. Use HPA when your workload is CPU or memory bound and the resource metric accurately represents demand. Use KEDA when your workload reacts to external event sources, queues, or business-level metrics that precede resource consumption.

Scenario Use HPA Use KEDA
CPU/memory bound workloads
Event queue consumers (Kafka, SQS, RabbitMQ)
Scale to zero needed
Prometheus-driven business metrics
Simple stateless web API scaling
Cron-based batch jobs

One nuance: you can use KEDA and HPA on the same deployment simultaneously when KEDA manages replica count based on queue depth and HPA handles CPU autoscaling within the KEDA-defined bounds. KEDA explicitly supports this via the advanced.horizontalPodAutoscalerConfig field in the ScaledObject, which lets you pass custom HPA behavior settings alongside KEDA triggers.

9. Performance Optimization

pollingInterval tuning. Lower values (5–10 seconds) reduce scale-up latency but increase API load against the event source broker. For latency-sensitive consumer groups, 10 seconds is a practical floor. For batch jobs with relaxed SLAs, 30–60 seconds reduces broker query load significantly. Never set pollingInterval below 5 seconds without confirming your broker can handle the increased polling frequency across all KEDA-managed consumer groups.

Combining KEDA and HPA for dual-axis scaling. KEDA governs the outer loop — how many pods exist based on queue depth. HPA can govern the inner loop — CPU-based vertical burst scaling within pods, or managing request rate on a sidecar proxy. Configure this by setting advanced.horizontalPodAutoscalerConfig.behavior in the ScaledObject spec. The HPA behavior settings apply to the KEDA-managed HPA, giving you fine-grained control over scale-up and scale-down stabilization windows.

Cluster Autoscaler integration. When KEDA scales a deployment beyond available node capacity, pods enter Pending state. The Kubernetes Cluster Autoscaler detects pending pods and provisions additional nodes. This creates a cascading scale-out: KEDA scales pods → CA scales nodes → pods become schedulable → consumers process lag → KEDA scales down → CA removes idle nodes. Tune your CA scale-down-delay-after-add to match KEDA's cooldownPeriod to avoid premature node removal that would immediately trigger another CA scale-out.

"The best scaling policy is one that can see the future — not the past. KEDA gives you the closest thing to that: scaling on intent, not on consequence."
— KEDA project maintainers, KubeCon 2024

Key Takeaways

Conclusion

KEDA resolves the fundamental mismatch between how Kubernetes HPA observes load and how event-driven workloads actually generate load. By scaling directly on queue depth, consumer lag, and business-level Prometheus metrics, KEDA eliminates the 3–4 minute response lag that CPU-based autoscaling inevitably introduces for consumer workloads — a lag that translates directly into user-visible delays and unnecessary costs from over-provisioned standing fleets.

The transcoding fleet example from the introduction is representative of a broad class of production problems that appear as capacity issues but are actually scaling-signal latency issues. The hardware was always sufficient — the signal to use it simply arrived too late. KEDA's architecture of reading the event source directly, rather than waiting for that event to propagate through CPU utilization and then through HPA's scrape interval, is the correct abstraction for the event-driven systems that increasingly define modern backend infrastructure.

For teams managing complex multi-cluster deployments and advanced Kubernetes workload patterns, our Kubernetes Advanced Patterns guide covers StatefulSets, Operators, admission webhooks, and the full spectrum of production Kubernetes primitives that complement KEDA in a mature platform engineering setup.

Read Full Blog Here

Explore the complete guide including advanced ScaledObject configurations, multi-trigger setups, KEDA operator HA, and production runbook for event-driven autoscaling.

Read the Full Post

Discussion / Comments

Related Posts

DevOps

Kubernetes Cost Optimization

Right-sizing, spot instances, and resource quota strategies to reduce Kubernetes cloud spend.

DevOps

Kubernetes Operator Pattern

Build production-grade Kubernetes Operators to automate complex stateful application lifecycle management.

DevOps

Chaos Engineering in Production

Proactively find failure modes with Chaos Monkey, Litmus, and fault injection experiments on Kubernetes.

Last updated: March 2026 — Written by Md Sanwar Hossain