Skip to main content

How to Build a Reliable Queue in Redis

The simplest Redis queue uses LPUSH to add jobs and RPOP to fetch them. This works until a worker crashes after popping a job but before finishing it. The job vanishes, lost forever. For background tasks that matter, you need a queue where jobs survive worker failures.

A reliable queue keeps jobs visible until a worker explicitly acknowledges completion. If a worker crashes, the job remains in a "processing" state where a recovery process can find it and return it to the queue. This pattern gives you at-least-once delivery without the complexity of Redis Streams.

Which Redis data types we will use

List is used in two ways in this implementation:

  1. As the pending queue where new jobs wait to be processed. Producers push jobs with LPUSH, and workers claim them from the other end.
  2. As the processing list where jobs live while being worked on. The LMOVE command atomically transfers a job from pending to processing in one step.

Hash stores metadata for each job, including the timestamp when processing started, which worker claimed it, how many times it has been attempted, and any error messages from failed attempts. This metadata enables retry logic and helps identify stuck jobs.

Sorted Set tracks when each job entered the processing state. The score is the claim timestamp, letting you efficiently query for jobs that have been processing longer than a timeout threshold. This powers the recovery process that rescues stuck jobs.

The basic pattern and why jobs get lost

The naive queue pattern pushes jobs to a list and pops them for processing. A producer calls LPUSH to add a job, and a worker calls BRPOP to wait for and retrieve a job. This is simple and fast, but the moment BRPOP returns, the job exists only in the worker's memory.

If the worker crashes, loses network connectivity, or gets killed before completing the job, that job disappears. There is no record that it ever existed. For jobs like sending emails, processing payments, or updating critical data, this is unacceptable.

# Producer adds a job
LPUSH jobs:pending '{"id":"job_123","task":"send_email"}'
> 1

# Worker fetches a job (blocks until one is available)
BRPOP jobs:pending 30
> ["jobs:pending", '{"id":"job_123","task":"send_email"}']

# If the worker crashes here, the job is gone forever

Claiming jobs atomically

The fix is to never let a job exist in only one place. Instead of popping a job and hoping you finish it, atomically move it from the pending queue to a processing list. The LMOVE command does this in one step: it removes an element from one list and pushes it to another atomically.

After LMOVE, the job is in the processing list. If the worker crashes, the job is still there. A separate recovery process can find it and move it back to pending. The job is always either pending, processing, or done. It never disappears.

Use BLMOVE for the blocking variant that waits when the pending queue is empty. This is more efficient than polling with LMOVE in a tight loop.

# Worker atomically moves job from pending to processing
BLMOVE jobs:pending jobs:processing RIGHT LEFT 30
> '{"id":"job_123","task":"send_email"}'

# Job is now in the processing list, not pending
LRANGE jobs:processing 0 -1
> ['{"id":"job_123","task":"send_email"}']

# Record when we started processing (for timeout detection)
ZADD jobs:processing:times 1699900060 "job_123"
> 1

Acknowledging completed jobs

When a worker finishes a job successfully, it must acknowledge completion by removing the job from the processing list. This is the ACK operation. Use LREM to remove the specific job from the processing list, and clean up the timestamp tracking.

After acknowledgment, the job is completely gone from the queue system. If you need to keep a record of completed jobs, add them to a separate completed list or log them to your database before acknowledging.

# Job completed successfully, remove from processing
LREM jobs:processing 1 '{"id":"job_123","task":"send_email"}'
> 1 (removed)

# Clean up the timestamp tracking
ZREM jobs:processing:times "job_123"
> 1

Returning failed jobs to the queue

When processing fails, you have choices: retry immediately, retry with a delay, or give up after too many attempts. The NACK operation removes the job from processing and either returns it to the pending queue for retry or moves it to a dead letter queue.

Track retry attempts in a hash. Each time a job fails, increment its attempt counter. If the counter exceeds your maximum, move the job to a dead letter list for manual inspection instead of retrying forever.

# Job failed, check how many attempts so far
HINCRBY jobs:attempts job_123 1
> 2 (second attempt)

# Under max retries? Return to pending queue
LREM jobs:processing 1 '{"id":"job_123","task":"send_email"}'
RPUSH jobs:pending '{"id":"job_123","task":"send_email"}'
ZREM jobs:processing:times "job_123"

# Over max retries? Move to dead letter queue
LREM jobs:processing 1 '{"id":"job_123","task":"send_email"}'
RPUSH jobs:dead '{"id":"job_123","task":"send_email"}'
ZREM jobs:processing:times "job_123"

Recovering jobs from crashed workers

Workers can crash without calling ACK or NACK. A background recovery process handles this by finding jobs that have been in the processing state too long. Query the sorted set for jobs with timestamps older than your timeout threshold, then move them back to the pending queue.

Run this recovery process periodically, perhaps every minute. It catches jobs abandoned by crashed workers and returns them to the queue for another attempt. The retry counter ensures jobs do not retry forever even if workers keep crashing.

# Find jobs processing for more than 5 minutes (300 seconds)
# Current time is 1699900360, so cutoff is 1699900060
ZRANGEBYSCORE jobs:processing:times -inf 1699900060
> ["job_123", "job_456"]

# For each stuck job, find it in the processing list and requeue
# This requires knowing the full job data, so store job_id -> job_data mapping
# or scan the processing list

When to use this pattern

This reliable queue pattern works well for moderate workloads where you need job persistence without external dependencies. It handles worker crashes gracefully, supports retries with backoff, and provides visibility into stuck jobs. The implementation uses only basic Redis commands and is easy to understand and debug.

For high-throughput scenarios or when you need consumer groups, message IDs, and built-in acknowledgment tracking, consider Redis Streams. Streams provide these features natively with better performance for complex use cases. But for many applications, this simple list-based pattern provides enough reliability without the added complexity.