DevOps

Kubernetes KEDA and HPA: Event-Driven Autoscaling for Production Workloads That HPA Can't Handle

Q: What is KEDA?

KEDA (Kubernetes-based Event Driven Autoscaler) is a CNCF graduated project that adds event-driven autoscaling capabilities to any Kubernetes cluster without replacing the existing HPA machinery — it extends it. KEDA was originally created by Microsoft and Red Hat and donated to the CNCF in 2020. Its graduation status reflects production maturity across thousands of enterprise deployments. KEDA's architecture consists of three primary components: The critical difference from native HPA is that KEDA reads directly from the event source . When you configure a Kafka trigger, KEDA connects to the Kafka broker, queries the consumer group lag for a specific topic, and immediately surfaces that lag as a scaling metric — before any pod has consumed a single message.

Q: How do you improve Implementing KEDA for Kafka Consumer Scaling?

The following ScaledObject is the exact configuration that resolved the viral video transcoding incident described in the introduction. It scales the video-transcoder deployment based on Kafka consumer lag, with a 1-pod-per-5-messages ratio: apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: video-transcoder-scaler namespace: production spec: scaleTargetRef: name: video-transcoder pollingInterval: 10 # Check every 10 seconds cooldownPeriod: 60 # Wait 60s before scaling down minReplicaCount: 0 # Scale to zero when idle! maxReplicaCount: 50 # Cap at 50 pods triggers: - type: kafka metadata: bootstrapServers: kafka-broker:9092 consumerGroup: video-transcoder-group topic: video-upload-events lagThreshold: "5" # 1 pod per 5 messages in lag offsetResetPolicy: latest Let's break down the key fields. pollingInterval: 10 tells KEDA to query the Kafka broker every 10 seconds for updated consumer lag.

Standard Kubernetes Horizontal Pod Autoscaling works brilliantly for stateless HTTP services — until the moment your workload is driven by an event queue rather than CPU utilization. At that inflection point, HPA's reliance on reactive resource metrics introduces a dangerous lag between when work arrives and when your pods are ready to process it. KEDA (Kubernetes-based Event Driven Autoscaler) was built specifically to close that gap, scaling on the actual workload signal — queue depth, consumer lag, Prometheus query results — rather than the downstream resource symptoms those signals eventually produce.

Md Sanwar Hossain March 21, 2026 15 min read DevOps

Kubernetes KEDA event-driven autoscaling for production workloads

The Real Problem: Why HPA Fails for Event-Driven Workloads
What is KEDA?
Architecture: How KEDA Works
Implementing KEDA for Kafka Consumer Scaling
Scale to Zero in Practice
KEDA with Prometheus Metrics (Custom Business Metrics)
Failure Scenarios and Debugging
KEDA vs HPA: When to Use Which
Performance Optimization
Key Takeaways
Conclusion

1. The Real Problem: Why HPA Fails for Event-Driven Workloads

KEDA Autoscaling Architecture | mdsanwarhossain.me — KEDA Autoscaling Architecture — mdsanwarhossain.me

A media streaming platform was running a video transcoding fleet on Kubernetes. When a user uploaded a new video, an event landed on a Kafka topic and a pool of transcoding pods consumed those events to convert the file to multiple resolutions. The ops team had configured an HPA targeting 70% CPU utilization — which seemed perfectly reasonable.

Then a creator with 2 million followers uploaded a video that went immediately viral. Within seconds, hundreds of concurrent upload events flooded the Kafka topic. The transcoding pods that were already running were immediately saturated. But the HPA did not trigger. Why? Because the existing pods had not yet started processing the backlog — they had not yet driven their CPU up to the 70% threshold. The CPU metric was still low. HPA saw nothing to act on.

Production incident: A 4-minute lag elapsed between the viral video upload event and transcoding pods scaling up to handle the load. During those 4 minutes, 847 concurrent users were stuck on a "processing" spinner with no video available. The root cause was not insufficient capacity — it was that HPA was watching CPU while the real backlog signal was in Kafka consumer lag.

The fundamental disconnect is this: HPA thinks about resource consumption; event-driven workloads need to think about workload intent. CPU and memory are trailing indicators. A queue filling with unprocessed messages is the leading indicator. For batch processors, Kafka consumers, SQS workers, and cron-triggered jobs, scaling on resource metrics means you will always scale after users are already waiting — not before they notice a problem. The fix was KEDA scaling on Kafka consumer lag, which scaled pods up within seconds of the lag crossing a threshold, before CPU had any chance to respond.

2. What is KEDA?

KEDA (Kubernetes-based Event Driven Autoscaler) is a CNCF graduated project that adds event-driven autoscaling capabilities to any Kubernetes cluster without replacing the existing HPA machinery — it extends it. KEDA was originally created by Microsoft and Red Hat and donated to the CNCF in 2020. Its graduation status reflects production maturity across thousands of enterprise deployments.

KEDA's architecture consists of three primary components:

ScaledObject CRD: A custom resource you define per workload that declares the scaling target, trigger sources, min/max replicas, and polling intervals.
External Metrics Server: KEDA exposes a custom metrics API endpoint that Kubernetes's HPA controller reads from. This is the bridge between KEDA's event-source data and the standard HPA scaling loop.
Trigger Providers (Scalers): Pluggable adapters that connect to specific event sources — Kafka, RabbitMQ, AWS SQS, Redis, Prometheus, PostgreSQL, Azure Service Bus, Cron, and over 50 others — and translate their queue depth or lag into a numeric metric the HPA controller can act on.

The critical difference from native HPA is that KEDA reads directly from the event source. When you configure a Kafka trigger, KEDA connects to the Kafka broker, queries the consumer group lag for a specific topic, and immediately surfaces that lag as a scaling metric — before any pod has consumed a single message. Native HPA cannot do this because it is limited to CPU, memory, and custom metrics that must already be emitted by running pods.

Scale to Zero — the killer feature: KEDA supports minReplicaCount: 0, allowing a deployment to scale all the way down to zero pods when there is no work. Native HPA enforces a minimum of 1 replica at all times. This single capability can reduce idle infrastructure costs by 60–80% for batch workloads, nightly ETL jobs, and development environments.

3. Architecture: How KEDA Works

Event-Driven Autoscaling | mdsanwarhossain.me — Event-Driven Autoscaling — mdsanwarhossain.me

Understanding KEDA's internal data flow is essential for tuning it correctly in production. The scaling loop works as follows:

Kafka Topic (consumer lag grows)
  └── KEDA Scaler polls broker every pollingInterval seconds
        └── Reads consumer group lag for target topic
              └── Exposes metric via External Metrics API
                    └── Kubernetes HPA controller reads metric
                          └── HPA calculates desired replicas:
                                desiredReplicas = ceil(lag / lagThreshold)
                          └── Kubernetes scales the Deployment
                                └── New transcoder pods start consuming
                                      └── Consumer lag drops → scale-down after cooldownPeriod

KEDA does not bypass or replace the HPA controller. Instead, it acts as a metric provider that feeds the standard HPA loop. When you create a ScaledObject, KEDA automatically creates a corresponding HorizontalPodAutoscaler resource managed by the KEDA operator. The HPA controller then drives actual pod scaling using Kubernetes's native replica management. This design means KEDA inherits all of HPA's stability guarantees — including the scale-down stabilization window — while adding event-source awareness that HPA alone cannot provide.

The KEDA operator runs as a Deployment in the keda namespace and consists of two containers: the keda-operator (manages ScaledObject lifecycle and metric exposure) and the keda-operator-metrics-apiserver (serves the External Metrics API endpoint). Both are stateless and can be replicated for high availability.

4. Implementing KEDA for Kafka Consumer Scaling

The following ScaledObject is the exact configuration that resolved the viral video transcoding incident described in the introduction. It scales the video-transcoder deployment based on Kafka consumer lag, with a 1-pod-per-5-messages ratio:

KEDA & Kubernetes Autoscaling | mdsanwarhossain.me — KEDA & Kubernetes Autoscaling — mdsanwarhossain.me

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: video-transcoder-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: video-transcoder
  pollingInterval: 10        # Check every 10 seconds
  cooldownPeriod: 60         # Wait 60s before scaling down
  minReplicaCount: 0         # Scale to zero when idle!
  maxReplicaCount: 50        # Cap at 50 pods
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka-broker:9092
      consumerGroup: video-transcoder-group
      topic: video-upload-events
      lagThreshold: "5"      # 1 pod per 5 messages in lag
      offsetResetPolicy: latest

Let's break down the key fields. pollingInterval: 10 tells KEDA to query the Kafka broker every 10 seconds for updated consumer lag. This is what enables near-real-time reaction to queue depth changes — a stark contrast to HPA's CPU scrape interval which often runs at 15–30 seconds and then requires CPU to actually climb before triggering. lagThreshold: "5" means KEDA will target one pod for every 5 unprocessed messages. With 50 messages in lag, KEDA will drive the deployment to 10 pods. With 250 messages, it caps at maxReplicaCount: 50.

cooldownPeriod: 60 is the number of seconds KEDA waits after the last active event before beginning to scale down. This prevents flapping when lag briefly drops to zero between bursts. For video transcoding, 60 seconds is appropriate — enough to absorb inter-upload gaps without keeping idle pods running for minutes.

Result after deployment: Scale-up latency dropped from 4 minutes (CPU-based HPA) to under 30 seconds (KEDA Kafka lag). The 847-user spinner problem was eliminated. During the next viral upload event, pods were provisioned and consuming messages before any user had waited more than 45 seconds for their video.

5. Scale to Zero in Practice

Scale to zero is KEDA's most financially impactful feature for workloads with predictable idle periods. A nightly ETL job, a report generation service, a development-environment worker — these run for hours and then sit idle consuming node resources for the remainder of the day. With minReplicaCount: 0, the deployment drops to zero pods when the event source is empty, and KEDA restores replicas the moment a new trigger event arrives.

The cold start problem is the main trade-off: scaling from zero means the first event after an idle period must wait for a new pod to be scheduled, pulled, and initialized before processing begins. Mitigations include:

Pre-cached container images: Ensure the node image cache is warm via a DaemonSet or node startup scripts.
Lightweight init containers: Keep pod startup time under 5 seconds by avoiding heavy classpath scanning at boot.
KEDA ScaledJob: For true batch workloads, use ScaledJob instead of ScaledObject. KEDA creates a new Kubernetes Job per event batch, which terminates cleanly after processing rather than idling between batches.
KEDA-HTTP Add-on: For HTTP workloads, the KEDA HTTP add-on proxies incoming requests and buffers them while the deployment scales from zero, preventing connection errors during cold start.

Here is a ScaledJob configuration for a batch image processing pipeline that runs to completion and terminates, with zero idle cost:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: image-processor-job
  namespace: production
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: image-processor
          image: myrepo/image-processor:latest
        restartPolicy: Never
  pollingInterval: 15
  maxReplicaCount: 20
  triggers:
  - type: rabbitmq
    metadata:
      host: amqp://rabbitmq:5672/
      queueName: image-processing-queue
      queueLength: "1"       # 1 Job per message in queue

6. KEDA with Prometheus Metrics (Custom Business Metrics)

Scaling on Prometheus metrics sounds straightforward until you try to wire the Prometheus Adapter to Kubernetes's Custom Metrics API — at which point you discover that getting the adapter's RBAC, APIService registration, and PromQL configuration aligned correctly takes days, not hours. KEDA's prometheus trigger bypasses this entirely: you give KEDA a Prometheus server address and a PromQL query, and KEDA handles the rest.

This opens a powerful capability: scaling on business-level metrics, not just infrastructure metrics. Consider an order processing service where the right scaling signal is not pod CPU but the number of orders sitting in a queued state in the database:

triggers:
- type: prometheus
  metadata:
    serverAddress: http://prometheus:9090
    metricName: pending_orders_total
    threshold: "100"
    query: sum(pending_orders{status="queued"})

With this configuration, KEDA evaluates the PromQL query sum(pending_orders{status="queued"}) every pollingInterval seconds. When the result exceeds 100, KEDA scales up: if there are 300 pending orders, it drives the deployment to 3 replicas; 500 pending orders yields 5 replicas. The metric is derived directly from application instrumentation — a Micrometer gauge emitting pending_orders with a status label — not from inferred infrastructure consumption.

This approach aligns infrastructure cost directly with business throughput. The order processor fleet is sized exactly proportional to business backlog, not to the CPU that backlog eventually causes. Teams using this pattern typically see 30–40% cost reduction compared to static replica counts or CPU-based HPA, because the fleet is never over-provisioned in anticipation of load that never arrives.

7. Failure Scenarios and Debugging

KEDA introduces a new failure surface between your event source and your scaling loop. Here are the most common production failure modes and how to investigate them.

ScaledObject stuck at wrong replica count. Run kubectl describe scaledobject video-transcoder-scaler -n production and examine the Conditions block. A healthy ScaledObject shows ScalerReady=True and Active=True. If ScalerReady=False, the trigger cannot connect to its event source — usually a network policy block, wrong bootstrap server address, or authentication failure. The kubectl get events -n production output will show the specific error message from the scaler.

KEDA operator crash or unavailability. KEDA is designed to fail safe: if the KEDA operator pod crashes, the last-known replica count is preserved by the underlying HPA. The deployment does not immediately scale to zero or to max. The HPA will continue to function based on the most recently computed metric until KEDA recovers. This means a KEDA outage during low-traffic periods is generally graceful, though a KEDA outage during a traffic surge means no scale-up until the operator recovers.

Kafka SASL/TLS authentication. Production Kafka clusters require authentication. KEDA handles this via the TriggerAuthentication CRD, which references Kubernetes Secrets containing credentials. Never embed credentials directly in the ScaledObject metadata.

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: kafka-trigger-auth
  namespace: production
spec:
  secretTargetRef:
  - parameter: username
    name: kafka-credentials
    key: username
  - parameter: password
    name: kafka-credentials
    key: password

Reference the TriggerAuthentication from your ScaledObject trigger with authenticationRef: { name: kafka-trigger-auth }. Common failures and their resolutions:

Failure	Symptom	Fix
Kafka auth error	Scaler shows Error, pods stuck at 0	Add TriggerAuthentication with SASL/TLS creds
lagThreshold too high	Slow scale-up despite large lag	Lower threshold value, reduce pollingInterval
cooldownPeriod too short	Pods flapping up and down repeatedly	Increase cooldownPeriod to 120s or more

8. KEDA vs HPA: When to Use Which

KEDA is not a wholesale replacement for HPA — both tools have clear, non-overlapping use cases. Use HPA when your workload is CPU or memory bound and the resource metric accurately represents demand. Use KEDA when your workload reacts to external event sources, queues, or business-level metrics that precede resource consumption.

Scenario	Use HPA	Use KEDA
CPU/memory bound workloads	✓
Event queue consumers (Kafka, SQS, RabbitMQ)		✓
Scale to zero needed		✓
Prometheus-driven business metrics		✓
Simple stateless web API scaling	✓
Cron-based batch jobs		✓

One nuance: you can use KEDA and HPA on the same deployment simultaneously when KEDA manages replica count based on queue depth and HPA handles CPU autoscaling within the KEDA-defined bounds. KEDA explicitly supports this via the advanced.horizontalPodAutoscalerConfig field in the ScaledObject, which lets you pass custom HPA behavior settings alongside KEDA triggers.

9. Performance Optimization

pollingInterval tuning. Lower values (5–10 seconds) reduce scale-up latency but increase API load against the event source broker. For latency-sensitive consumer groups, 10 seconds is a practical floor. For batch jobs with relaxed SLAs, 30–60 seconds reduces broker query load significantly. Never set pollingInterval below 5 seconds without confirming your broker can handle the increased polling frequency across all KEDA-managed consumer groups.

Combining KEDA and HPA for dual-axis scaling. KEDA governs the outer loop — how many pods exist based on queue depth. HPA can govern the inner loop — CPU-based vertical burst scaling within pods, or managing request rate on a sidecar proxy. Configure this by setting advanced.horizontalPodAutoscalerConfig.behavior in the ScaledObject spec. The HPA behavior settings apply to the KEDA-managed HPA, giving you fine-grained control over scale-up and scale-down stabilization windows.

Cluster Autoscaler integration. When KEDA scales a deployment beyond available node capacity, pods enter Pending state. The Kubernetes Cluster Autoscaler detects pending pods and provisions additional nodes. This creates a cascading scale-out: KEDA scales pods → CA scales nodes → pods become schedulable → consumers process lag → KEDA scales down → CA removes idle nodes. Tune your CA scale-down-delay-after-add to match KEDA's cooldownPeriod to avoid premature node removal that would immediately trigger another CA scale-out.

"The best scaling policy is one that can see the future — not the past. KEDA gives you the closest thing to that: scaling on intent, not on consequence."
— KEDA project maintainers, KubeCon 2024

Key Takeaways

HPA scales on resource consumption — KEDA scales on workload intent. For event-driven workloads, queue depth and consumer lag are leading indicators that CPU metrics will never capture in time to prevent user impact.
Scale to zero is a game-changer for batch and dev workloads. KEDA's minReplicaCount: 0 support can reduce idle infrastructure costs by 60–80% for workloads with predictable idle windows that native HPA cannot handle.
KEDA's Prometheus trigger removes the Prometheus Adapter complexity entirely. Direct PromQL evaluation in KEDA triggers lets you scale on business metrics — pending orders, job queue depth — without custom metrics API plumbing.
TriggerAuthentication is mandatory for production Kafka clusters. Never embed credentials in ScaledObject metadata — use the TriggerAuthentication CRD to reference Kubernetes Secrets for SASL and TLS configurations.
KEDA and Cluster Autoscaler form a complete two-tier scaling solution. KEDA handles pod-level scaling based on events; the Cluster Autoscaler handles node-level capacity in response to pod scheduling pressure, creating an end-to-end elastic infrastructure layer.

Conclusion

KEDA resolves the fundamental mismatch between how Kubernetes HPA observes load and how event-driven workloads actually generate load. By scaling directly on queue depth, consumer lag, and business-level Prometheus metrics, KEDA eliminates the 3–4 minute response lag that CPU-based autoscaling inevitably introduces for consumer workloads — a lag that translates directly into user-visible delays and unnecessary costs from over-provisioned standing fleets.

The transcoding fleet example from the introduction is representative of a broad class of production problems that appear as capacity issues but are actually scaling-signal latency issues. The hardware was always sufficient — the signal to use it simply arrived too late. KEDA's architecture of reading the event source directly, rather than waiting for that event to propagate through CPU utilization and then through HPA's scrape interval, is the correct abstraction for the event-driven systems that increasingly define modern backend infrastructure.

For teams managing complex multi-cluster deployments and advanced Kubernetes workload patterns, our Kubernetes Advanced Patterns guide covers StatefulSets, Operators, admission webhooks, and the full spectrum of production Kubernetes primitives that complement KEDA in a mature platform engineering setup.

Read Full Blog Here

Explore the complete guide including advanced ScaledObject configurations, multi-trigger setups, KEDA operator HA, and production runbook for event-driven autoscaling.

Read the Full Post

Kubernetes KEDA and HPA: Event-Driven Autoscaling for Production Workloads That HPA Can't Handle

Table of Contents

1. The Real Problem: Why HPA Fails for Event-Driven Workloads

2. What is KEDA?

3. Architecture: How KEDA Works

4. Implementing KEDA for Kafka Consumer Scaling

5. Scale to Zero in Practice

6. KEDA with Prometheus Metrics (Custom Business Metrics)

7. Failure Scenarios and Debugging

8. KEDA vs HPA: When to Use Which

9. Performance Optimization

Key Takeaways

Conclusion

Read Full Blog Here

Tags

Leave a Comment

Related Posts

Kubernetes KEDA and HPA: Event-Driven Autoscaling for Production Workloads That HPA Can't Handle

Table of Contents

1. The Real Problem: Why HPA Fails for Event-Driven Workloads

2. What is KEDA?

3. Architecture: How KEDA Works

4. Implementing KEDA for Kafka Consumer Scaling

5. Scale to Zero in Practice

6. KEDA with Prometheus Metrics (Custom Business Metrics)

7. Failure Scenarios and Debugging

8. KEDA vs HPA: When to Use Which

9. Performance Optimization

Key Takeaways

Conclusion

Read Full Blog Here

Tags

Leave a Comment

Related Posts

Kubernetes Cost Optimization at Scale: FinOps, Resource Rightsizing, and Spot Instance Strategies

Kubernetes Operator Pattern: Building Custom Controllers for Stateful Applications

Chaos Engineering in Production: Controlled Failure Injection with Chaos Monkey and LitmusChaos

Cookie Notice