Skip to main content

Using circuit breakers for webhook delivery

When a webhook endpoint goes down, naive retry logic keeps hammering it with requests. Each attempt fails, consumes resources, and delays other deliveries. The endpoint's owners might be trying to bring their server back up while your retries add load. Your queue backs up as failing requests block the workers.

Circuit breakers prevent this cascade. When an endpoint fails repeatedly, the circuit "opens" and stops sending requests. After a cooldown period, the circuit allows a test request through. If that succeeds, the circuit "closes" and normal delivery resumes. If it fails, the circuit stays open longer.

This pattern protects both sides: your infrastructure stops wasting resources on doomed requests, and the struggling endpoint gets breathing room to recover.

How circuit breakers work

A circuit breaker tracks the health of each endpoint and transitions between three states: closed, open, and half-open.

In the closed state, requests flow normally. The circuit breaker monitors outcomes, counting successes and failures. As long as failures stay below a threshold, everything proceeds as usual.

When failures exceed the threshold, the circuit opens. In the open state, requests fail immediately without attempting delivery. You might queue them for later or mark them for retry after the circuit closes. No actual HTTP requests go out, which is the whole point.

After a timeout, the circuit transitions to half-open. In this state, the breaker allows a limited number of requests through to test whether the endpoint has recovered. If these test requests succeed, the circuit closes and normal delivery resumes. If they fail, the circuit opens again with a potentially longer timeout.

class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.state = "closed"
self.opened_at = None

def record_success(self):
self.failure_count = 0
self.state = "closed"

def record_failure(self):
self.failure_count += 1
if self.failure_count >= self.failure_threshold:
self.state = "open"
self.opened_at = time.time()

def allow_request(self):
if self.state == "closed":
return True
if self.state == "open":
if time.time() - self.opened_at > self.reset_timeout:
self.state = "half-open"
return True
return False
if self.state == "half-open":
return True # Allow test request
return False

Choosing thresholds and timeouts

The failure threshold determines how many failures trigger the circuit to open. Too low, and brief hiccups cause unnecessary circuit opens. Too high, and you waste many requests before reacting to a real outage.

A threshold of five consecutive failures works well for most webhook systems. This tolerates occasional timeouts while catching actual outages quickly. You might use a higher threshold for endpoints with naturally variable latency or a lower one for endpoints that rarely fail.

The reset timeout determines how long the circuit stays open before testing recovery. Shorter timeouts mean faster recovery detection but more test traffic to struggling endpoints. Longer timeouts give endpoints more time to recover but delay resumption of normal delivery.

Start with 30 to 60 seconds for the initial timeout. Consider exponential backoff for repeated circuit openings: if the endpoint fails again after half-open testing, wait longer before the next test. This prevents aggressive probing of endpoints that need extended recovery time.

Circuit breakers per endpoint

Each customer endpoint needs its own circuit breaker. If one customer's server goes down, you should not stop delivering to other customers. Isolate failures so they affect only the failing endpoint.

This means maintaining state for potentially thousands of endpoints. In-memory circuit breakers work for moderate scale but lose state on restarts. For larger systems, store circuit state in Redis or a database. Keep the state simple: just the failure count, current state, and timestamp.

def get_circuit_breaker(endpoint_id):
state = redis.hgetall(f"circuit:{endpoint_id}")
if not state:
return CircuitBreaker() # New endpoint, start fresh

breaker = CircuitBreaker()
breaker.failure_count = int(state.get("failures", 0))
breaker.state = state.get("state", "closed")
breaker.opened_at = float(state.get("opened_at", 0))
return breaker

Consider grouping related endpoints. If a customer has multiple webhook URLs that all point to the same server, opening one circuit might justify opening the others. This requires knowledge of endpoint relationships, which you may or may not have.

What to do when the circuit is open

When a circuit opens, you have several options for pending webhooks.

Queue for retry: Keep webhooks in your normal queue with a future delivery time. When the circuit closes, they will be delivered in order. This preserves events but can create backlogs if the endpoint is down for extended periods.

Fast-fail and rely on polling: Mark webhooks as undeliverable and let consumers catch up through API polling. This keeps your system responsive but relies on consumers having a reconciliation process.

Buffer separately: Move queued webhooks for that endpoint to a separate buffer. When the circuit closes, process the buffer first. This isolates backlog from your main queue.

Most systems use queuing with a maximum retry window. If the circuit stays open beyond a time limit, webhooks eventually fail permanently and move to the dead letter queue.

Observability and alerting

Circuit breaker state is valuable operational data. Track how often circuits open, how long they stay open, and which endpoints are most problematic.

Alert when circuits open for important endpoints. A customer's production webhook going down might warrant proactive outreach. Unusual patterns like many circuits opening simultaneously could indicate a problem on your side rather than theirs.

Expose circuit state through your API or dashboard. Customers troubleshooting failed webhooks should be able to see if their endpoint's circuit is open. This saves support time and helps them understand why deliveries paused.

Log circuit state transitions with enough context to debug later. Include the endpoint, the failure count, the triggering error, and timestamps. These logs help post-incident analysis and capacity planning.