Webhook retry strategies

Webhooks fail. Networks are unreliable, servers go down, deployments cause brief outages. A webhook system without retries would lose events constantly. But retries done wrong can overwhelm recovering servers, waste resources on permanently broken endpoints, and create confusing duplicate deliveries.

This article covers the strategies for retrying failed webhooks effectively: exponential backoff to avoid hammering struggling servers, jitter to prevent thundering herds, and knowing when to stop retrying entirely.

Why naive retries cause problems

The simplest retry approach is to immediately try again when a delivery fails. This is almost always wrong. If a server returned an error because it is overloaded, immediately retrying adds more load. If a network blip caused a timeout, the request might still be in flight, and retrying creates duplicates. If a deployment is rolling out, the server might be down for 30 seconds, and rapid retries just pile up failed attempts.

Immediate retries also create feedback loops. Imagine a server that starts failing under load. Immediate retries double the incoming traffic. More requests fail, causing more retries, causing more failures. The server never recovers because the retry storm keeps it overwhelmed.

Exponential backoff spreads out retries

Exponential backoff solves these problems by waiting progressively longer between retry attempts. The first retry might happen after 30 seconds, the second after 2 minutes, the third after 8 minutes, and so on. Each retry waits roughly twice as long as the previous one.

This gives transient failures time to resolve. A brief network hiccup clears up in seconds. A deployment completes in a minute or two. A rate limit window resets. By the time your retry arrives, the problem has likely fixed itself.

A typical exponential backoff schedule looks like this:

Attempt 1: immediate (initial delivery)
Attempt 2: 30 seconds after failure
Attempt 3: 2 minutes after attempt 2
Attempt 4: 8 minutes after attempt 3
Attempt 5: 32 minutes after attempt 4
Attempt 6: 2 hours after attempt 5
Attempt 7: 8 hours after attempt 6

The formula is usually base_delay * (2 ^ attempt_number), sometimes with a cap to prevent absurdly long waits. You might cap the maximum delay at 8 hours even if the exponential formula would suggest 24 hours.

Adding jitter prevents thundering herds

Pure exponential backoff has a subtle problem. If a server goes down and 10,000 webhooks fail at the same moment, they all schedule their first retry for exactly 30 seconds later. When that retry happens, 10,000 requests hit the server simultaneously. Even if the server has recovered, this coordinated spike might knock it down again.

Jitter solves this by adding randomness to the retry delay. Instead of waiting exactly 30 seconds, you wait somewhere between 15 and 45 seconds. The 10,000 retries spread out over a 30-second window instead of arriving simultaneously.

There are two common approaches to jitter. Full jitter picks a random delay between zero and the calculated backoff time. Decorrelated jitter uses the previous delay to calculate a range for the next one, creating more variation across attempts.

A simple implementation multiplies the backoff by a random factor:

import random

def calculate_delay(attempt, base_delay=30, max_delay=28800):
    exponential = base_delay * (2 ** attempt)
    with_jitter = exponential * (0.5 + random.random())  # 0.5x to 1.5x
    return min(with_jitter, max_delay)

The combination of exponential backoff and jitter gives you the best of both worlds: spreading out individual endpoint retries over time while also spreading out retries across different endpoints.

Knowing when to stop

Retrying forever wastes resources and can mask configuration problems. If an endpoint has been returning 404 for three days, something is fundamentally wrong, and continued retries accomplish nothing.

Most webhook systems set a maximum retry window, typically 24 to 72 hours. After exhausting all attempts within this window, the delivery is marked as permanently failed. Some systems send a notification to the webhook owner so they can investigate and potentially replay the failed events manually.

The maximum attempts and time window should balance reliability against practicality. Too few retries and you lose events to brief outages. Too many and you waste resources on dead endpoints while the failed event queue grows unbounded.

A reasonable default might be 6 to 8 attempts over 24 to 48 hours. This handles most transient failures while giving up on endpoints that are clearly broken.

Distinguishing retryable from non-retryable failures

Not all failures deserve retries. A 500 Internal Server Error suggests a temporary server problem and is worth retrying. A 404 Not Found means the endpoint does not exist and retrying is pointless. A 401 Unauthorized indicates a credential problem that retrying will not fix.

Smart retry logic categorizes responses:

Retry these: 408 Request Timeout, 429 Too Many Requests, 500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable, 504 Gateway Timeout, and network errors like connection refused or DNS failure.

Do not retry these: 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 405 Method Not Allowed. These indicate problems with the request itself or the endpoint configuration, not transient failures.

For 429 Too Many Requests, check if the response includes a Retry-After header. If it does, respect that delay instead of your normal backoff schedule.

What receivers should know

If you are receiving webhooks, understand that retries are coming. Build your endpoint to be idempotent using the event ID, so processing the same webhook twice produces the same result. Respond quickly with a 200 OK before doing heavy processing, so you do not trigger unnecessary retries due to timeouts.

If you need the sender to back off because you are overwhelmed, return 429 with a Retry-After header. This is more graceful than returning 500 errors or timing out, and good webhook senders will respect it.

Choosing your retry strategy

For most webhook systems, a sensible default is exponential backoff starting at 30 seconds, doubling each time up to a maximum of 8 hours, with full jitter applied, over a maximum of 6 to 8 attempts spanning 24 to 48 hours. Distinguish retryable from non-retryable status codes to avoid wasting attempts on permanent failures.

This strategy handles the common cases well: brief outages, rate limiting, network hiccups, and deployment windows. It gives struggling servers room to recover while eventually giving up on endpoints that are genuinely broken.

If you are building a webhook sender for customers, consider letting them customize the retry policy. Some customers have endpoints that recover quickly and want aggressive retries. Others have systems that need more breathing room. Allowing configuration of the retry count, backoff multiplier, or maximum retry window accommodates different operational needs without requiring you to pick a one-size-fits-all default.

Webhook retry strategies

Why naive retries cause problems​

Exponential backoff spreads out retries​

Adding jitter prevents thundering herds​

Knowing when to stop​

Distinguishing retryable from non-retryable failures​

What receivers should know​

Choosing your retry strategy​