Building Scalable Task Queues with Redis and Node.js

When you need to process tasks asynchronously at scale, a robust queue system becomes essential. Here's what I learned building one that handles 100K+ jobs daily.

Why Not Just Use SQS?

Amazon SQS is great, but for our use case we needed:

Sub-second job pickup latency
Complex retry strategies per job type
Real-time job progress tracking
Priority queues

Redis with Bull gave us all of this with simpler operational overhead.

Architecture Overview

The system has three main components:

Producers - API servers that enqueue jobs
Redis - The queue storage and pub/sub backbone
Workers - Horizontally scalable job processors

Key Design Decisions

1. Job Persistence

While Redis is fast, we needed durability. Every job is:

Written to Redis for processing
Logged to PostgreSQL for audit trail
Results persisted after completion

2. Retry Strategy

Different jobs need different retry behaviors:

const emailQueue = new Bull("email", {
  defaultJobOptions: {
    attempts: 5,
    backoff: {
      type: "exponential",
      delay: 2000, // 2s, 4s, 8s, 16s, 32s
    },
  },
});

3. Worker Scaling

Workers scale based on queue depth using Kubernetes HPA:

metrics:
  - type: External
    external:
      metric:
        name: redis_queue_depth
      target:
        type: AverageValue
        averageValue: "100"

Lessons Learned

Always set job timeouts - Stuck jobs will block workers
Use separate queues for different priorities - Don't let bulk jobs block critical ones
Monitor everything - Queue depth, processing time, failure rates

The system has been running in production for 2 years with 99.9% job completion rate.