Skip to main content

Implementing dead letter queues for failed webhooks

Webhooks fail. Endpoints go down, networks partition, and servers crash. A good retry strategy handles temporary failures, but some webhooks never succeed no matter how many times you retry. Without a plan for these permanent failures, events disappear silently and data gets lost.

Dead letter queues solve this problem. When a webhook exhausts all retry attempts, instead of discarding it, you move it to a separate queue for failed messages. These messages wait there until someone investigates, fixes the underlying problem, and either replays them or acknowledges the failure.

This article covers when to use dead letter queues, how to implement them, and what to do with the messages that end up there.

When webhooks should go to the dead letter queue

The goal of a DLQ is to capture webhooks that cannot be delivered through normal means. This happens in several scenarios.

Retry exhaustion is the most common case. You have tried delivering five or ten times over several days, and the endpoint still fails. Maybe it returns 500 errors, maybe it times out, maybe it refuses connections entirely. At some point, you stop trying and move the webhook to the DLQ.

Invalid endpoints also belong in the DLQ. If an endpoint returns 404 or 410 (Gone), retrying will not help. The URL is wrong or the resource was deleted. Similarly, authentication failures like 401 or 403 suggest the endpoint exists but rejects your requests, which retries will not fix.

Malformed responses can indicate problems worth capturing. If an endpoint returns 200 but with a body that suggests something went wrong, you might want to flag it for investigation rather than marking it successful.

Some failures should not go to the DLQ. A brief network timeout during a deployment will likely succeed on retry. A 429 rate limit response means you should slow down, not give up. Reserve the DLQ for failures that human intervention might resolve.

Structuring your dead letter queue

The DLQ stores everything needed to understand and potentially replay failed webhooks. At minimum, this includes the original payload, the destination URL, the timestamp of the original event, and a record of delivery attempts.

class DeadLetteredWebhook:
id: str # Unique identifier
original_event_id: str # The event this webhook represents
endpoint_url: str # Where we tried to deliver
payload: bytes # The exact bytes we sent
headers: dict # Headers including signature
created_at: datetime # When the event occurred
dead_lettered_at: datetime # When we gave up
attempts: list[DeliveryAttempt] # History of all attempts
failure_reason: str # Why we stopped trying

Store the complete delivery history. For each attempt, record the timestamp, response status code, response body (truncated if large), and latency. This history helps diagnose whether the endpoint was completely unreachable, returning errors, or timing out.

Include the failure reason in a human-readable form. "Exhausted 5 retry attempts" is more useful than a status code when someone is triaging the DLQ.

Surfacing dead lettered webhooks

A DLQ nobody monitors is useless. Build visibility into your system so operators and customers know when webhooks fail permanently.

For webhook providers, create a dashboard showing DLQ contents grouped by customer and endpoint. Alert on sudden spikes in dead lettered webhooks, which might indicate a customer outage or a bug in your system. Provide API access so customers can query their own failed webhooks.

For webhook consumers, monitor your own failure patterns. If you are receiving webhooks and processing fails, dead letter the message internally so you can investigate. Log enough context to debug later.

Set up alerts for DLQ growth. A few dead letters are normal. A sudden increase suggests a systemic problem: an endpoint configuration change, an expired certificate, or a bug in webhook handling code.

Replaying dead lettered webhooks

Once someone fixes the underlying problem, you need a way to replay failed webhooks. This is trickier than it sounds.

The simplest approach is to resend the original payload. Move the webhook back to the main delivery queue and let normal processing take over. This works for idempotent events but can cause problems if the consumer is not prepared for old events arriving after newer ones.

Consider adding a header indicating this is a replay. Something like X-Replay: true or X-Original-Timestamp: 2024-01-15T10:30:00Z lets consumers handle replayed events specially if needed.

For events where replay ordering matters, provide a bulk replay option that sends events in chronological order. If a customer missed events 5 through 50, they probably want to process them in sequence rather than random order.

Not all dead letters should be replayed. If the original event is stale or superseded by newer data, replaying might cause confusion. Provide an option to acknowledge and discard without replaying.

Retention and cleanup

Dead letter queues grow indefinitely without cleanup. Set a retention policy based on your SLA and storage constraints.

A typical policy might keep dead letters for 30 days, giving ample time for investigation and replay. After 30 days, archive to cold storage or delete entirely. Notify customers before deletion so they have a chance to act.

Consider tiered retention. Keep full payloads for 7 days, then keep only metadata for another 23 days. This reduces storage costs while preserving the ability to investigate older failures.

For high-volume systems, monitor DLQ size as a capacity metric. A runaway producer or a persistently broken endpoint can fill your DLQ faster than you expect.

Dead letters on the consumer side

If you are receiving webhooks, implement your own DLQ for events you cannot process. When your handler throws an exception or validation fails, do not just log and discard. Store the event for later investigation.

Your consumer-side DLQ captures different failures than the provider's DLQ. The provider only knows about delivery failures. You know about processing failures: a malformed payload your code did not expect, a database error mid-processing, or a bug in your event handler.

Structure consumer DLQ entries with the raw payload, the error message or stack trace, and the timestamp. Build tooling to replay events once you fix the bug. This turns mysterious data inconsistencies into debuggable, recoverable problems.