Designing a Distributed Task Queue at Scale: Kafka, Celery & Beyond
Every production system eventually needs to defer work. Send an email after checkout. Resize an image after upload. Generate a PDF invoice after payment. Process a machine learning inference batch overnight. These are background tasks, and how you design the infrastructure to run them at scale determines whether your system handles 1,000 or 10 million jobs per day without data loss, duplicate processing, or silent failures.
Part of the System Design Deep Dive Series.
Introduction
A distributed task queue is one of the most fundamental infrastructure components in a production system, yet it is one of the most frequently underdesigned. Teams start with a simple database table with a status column (PENDING, PROCESSING, DONE) polled by worker threads — a pattern that works at low volume but degrades catastrophically under load due to polling overhead, lock contention on the status update, and lack of priority, retry, and visibility capabilities.
At scale, a distributed task queue must solve: durable task storage (tasks must survive worker crashes), exactly-once or at-least-once delivery semantics (depending on task idempotency), priority queuing (critical tasks must not wait behind bulk background jobs), worker scaling (adding workers should increase throughput linearly), dead-letter handling (tasks that fail repeatedly must not block the queue), observability (every task must be traceable through its entire lifecycle), and rate limiting (task execution must respect downstream capacity).
Real-World Problem: The Database Queue That Killed the Database
An HR SaaS company used a PostgreSQL table as their task queue. It worked fine at 10,000 jobs/day. At 500,000 jobs/day after a major enterprise customer onboarding, the table grew to 8 million rows. Workers polled with SELECT ... WHERE status='PENDING' FOR UPDATE SKIP LOCKED LIMIT 10 every 500ms from 50 worker threads. The index on (status, created_at) was being scanned by 100 concurrent queries per second. PostgreSQL's autovacuum could not keep pace with the update churn (rows transitioning PENDING → PROCESSING → DONE). Table bloat grew to 40 GB. VACUUM full locks caused 20-second pauses. The database CPU hit 95%, and primary application queries started timing out. The task queue had become the database killer.
Architecture Options
Option 1: Redis-Based Queues (Sidekiq, BullMQ)
Redis lists and sorted sets provide O(1) enqueue/dequeue with microsecond latency. Sidekiq (Ruby) and BullMQ (Node.js) use Redis as the underlying store. Advantages: extremely fast, built-in retry with exponential backoff, job visibility (Web UI), priority queues via multiple sorted sets, delayed jobs via TTL. Limitations: Redis memory is finite — at millions of queued jobs, memory cost is significant. Redis is not designed for multi-day job retention. Persistence requires AOF/RDB configuration. Not suitable for jobs requiring exactly-once semantics — Redis does not support distributed transactions.
Option 2: Message Brokers (RabbitMQ, ActiveMQ)
Traditional message queues provide durable, acknowledged delivery. RabbitMQ supports: per-message TTL, dead-letter exchanges (DLX) for failed message routing, priority queues (separate internal queues per priority level), consumer acknowledgements for at-least-once delivery, and publisher confirms for durable message persistence. Limitations: RabbitMQ does not retain messages after consumption — you cannot replay a processed job. Queue length is bounded by memory/disk configuration. At very high throughput (>100K messages/second per queue), RabbitMQ performance degrades significantly.
Option 3: Kafka as a Task Queue
Kafka is not designed as a task queue but is frequently used as one for high-volume, high-durability scenarios. Advantages: virtually unlimited retention, replay capability, horizontal scaling via partitions, consumer group lag monitoring, and schema evolution via Confluent Schema Registry. Limitations: Kafka has no native priority queue support (you need separate topics per priority level). Tasks are consumed in partition order — one stuck task blocks all subsequent tasks on that partition (head-of-line blocking). Exactly-once semantics require careful configuration of producer idempotence, transactions, and consumer offset management.
Option 4: Purpose-Built Systems (Temporal, Celery, Quartz)
Temporal (formerly Cadence) is the most advanced option for complex, long-running workflows. It provides: durable execution state (workflows survive worker crashes and continue where they left off), timer-based delays, sagas with compensation, child workflows, and a complete history of every step. The overhead is higher — Temporal requires its own cluster — but for workflows that span hours or days, it is the correct choice. Celery (Python) is the most widely deployed task queue for Python services, using Redis or RabbitMQ as the broker. Quartz Scheduler is the Java standard for scheduled and triggered background jobs, used in Spring Boot via @Scheduled or the full Quartz API.
Deep Dive: Key Design Decisions
Delivery Semantics: At-Least-Once vs Exactly-Once
At-least-once delivery means a task may execute more than once (due to worker crash after execution but before acknowledgement). This is the default for most queue systems and is acceptable if tasks are idempotent — executing them twice produces the same result. For non-idempotent tasks (charge a credit card, send an email), you need idempotency at the task level: a unique task ID stored in a database, checked before execution, set atomically on completion. Exactly-once delivery at the infrastructure level is only achievable with Kafka transactions + transactional outbox pattern, or with Temporal's durable execution model.
Priority Queue Design
Priority queues are commonly needed: user-facing tasks (image processing triggered by a user action) should not wait behind bulk background tasks (nightly data export). Design: multiple topic/queues per priority tier (CRITICAL, HIGH, NORMAL, LOW) with separate consumer groups consuming at different rates. Alternatively, use a single queue with task metadata and a priority-aware dispatcher that dequeues by priority score. Redis sorted sets are excellent for priority queues — score by priority * timestamp to break ties by submission order.
Dead-Letter Queue Design
Every task that fails must have a bounded retry policy and a dead-letter destination. Design: retry with exponential backoff (attempt 1: 1s, 2: 4s, 3: 16s, 4: 64s, 5: 256s). After max retries, move to the DLQ — a separate queue for manually inspectable, replayable failed tasks. DLQ should include: original task payload, failure reason, stack trace, retry history, and submitter context. Build a monitoring dashboard on the DLQ that alerts when DLQ depth exceeds a threshold — a growing DLQ indicates a systemic failure in the task processing logic.
Poison Pill Handling
A poison pill task is one that consistently crashes its worker — due to malformed payload, a bug triggered by specific data, or infinite loop. In Kafka, a poison pill blocks the entire partition (head-of-line blocking). Mitigations: catch all exceptions in the task handler and record failures without crashing the worker; implement a "sick skip" threshold — if the same task fails 5 times in a row, move it to DLQ and continue; use schema validation to reject malformed payloads before they enter the queue.
Visibility Timeout Pattern
When a worker dequeues a task, it takes a visibility lease — the task becomes invisible to other workers for a configured duration (e.g., 30s). If the worker completes successfully, it deletes the task. If the worker crashes, the visibility timeout expires and the task becomes visible again for reprocessing. This is the standard pattern in SQS and mimics "FOR UPDATE SKIP LOCKED" in PostgreSQL queues. The visibility timeout must be longer than the expected task execution time, with a heartbeat extension for long-running tasks.
Scaling Strategies
Horizontal worker scaling: Consumer groups in Kafka scale to N partitions maximum — partition count is the ceiling on parallelism. For RabbitMQ/Redis, workers scale independently. Implement KEDA (Kubernetes Event-Driven Autoscaler) to automatically scale worker deployments based on queue depth. KEDA supports native scalers for Kafka, RabbitMQ, Redis, SQS, and PostgreSQL-based queues.
Rate limiting task execution: When tasks call downstream APIs with rate limits (third-party payment APIs, email services), implement a token bucket at the worker level. Spring Boot's RateLimiter (from Resilience4j) or Redis-based token buckets provide distributed rate limiting across multiple worker instances.
Task chunking for large payloads: Tasks with large payloads (>1 MB) should store the payload in S3/GCS and pass only a reference in the task message. This keeps queue messages small, reduces broker memory pressure, and allows payload reuse across retries without re-uploading.
Failure Scenarios
Worker crash mid-execution: Handled by visibility timeout expiry (SQS/Redis) or uncommitted consumer offset (Kafka). The task is reprocessed by another worker. Ensure all task logic is idempotent or uses idempotency keys to prevent double-processing side effects.
Queue overload (producer faster than consumer): Monitor queue depth. Implement producer-side rate limiting before queue depth becomes unbounded. Alert at 80% of maximum queue capacity. Provide a backpressure signal to the producer (HTTP 503 / circuit breaker open) to slow submission.
Cascade from downstream failures: If all tasks call a downstream service that is down, all tasks fail and retry simultaneously, creating a retry storm that hammers the downstream service as it recovers. Mitigate with jitter in retry backoff, circuit breaker around downstream calls, and rate-limited DLQ replay after recovery.
Technology Selection Guide
Use Redis/BullMQ/Sidekiq for: sub-second job latency, <10M jobs/day, short retention (hours), JavaScript/Ruby ecosystems. Use RabbitMQ for: traditional message routing, dead-letter exchange natively, polyglot consumers, <100K messages/second. Use Kafka for: high throughput (>100K/second), long retention, replay, event sourcing, Java/Go ecosystems. Use Temporal for: long-running workflows (>minutes), saga coordination, complex retry logic, workflow versioning. Use Quartz/Spring @Scheduled for: simple cron-based jobs with low concurrency requirements.
Key Takeaways
- Database tables as task queues work at low volume and collapse under high load due to polling overhead and MVCC bloat
- At-least-once delivery + idempotent tasks is the most practical production model for most use cases
- Separate priority tiers into distinct queues/topics — never mix high-priority and bulk tasks in the same queue
- Every queue needs a dead-letter queue with monitoring — a silent DLQ is data loss
- Use KEDA for automatic worker scaling based on queue depth in Kubernetes
- Temporal is the correct choice for workflows with complex state, compensation logic, or multi-hour execution
Conclusion
Designing a distributed task queue is a microcosm of distributed systems design: you are trading consistency, availability, performance, and operational complexity against each other. The right design depends on your volume, latency requirements, delivery semantics needs, and team capabilities. What is universal: a queue without a dead-letter queue is not a queue — it is a silent data loss machine. Invest in DLQ monitoring, idempotent task design, and backpressure mechanisms before scaling your worker fleet, and your background job infrastructure will support your growth rather than becoming its bottleneck.
Related Articles
Discussion / Comments
Join the conversation — your comment goes directly to my inbox.