- Tom Hacohen
Github was down again today. This is following multiple such incidents from the last few weeks alone. From the short update they have posted a few days ago, it looks like their webhooks system is at least partially to blame.
Downtime can happen to anyone, especially at Github's scale. We send our greatest sympathies to the Github team and hope they find a lasting solution for this. Both as fellow developers, and as (otherwise) happy customers.
What is going on?
Github has been down multiple times in the last few weeks. Every day was a bit different in terms of affected systems, but all (most?) of them have taken turns being down, including Github Actions, API requests, Codespaces, Git operations, issues, and webhooks. As of this moment, they have yet to share an official post-mortem. However, from what they have shared so far, it looks like they are suffering from high loads due to some usage patterns. Their current mitigation is to throttle their webhooks, which indicates that their webhook system is causing a significant load on their system.
Webhooks are used extensively by Github customers. The most common use-case is PR bots. These include external checks being run when a PR is updated, deployment to preview environments (e.g. with Vercel), chat integrations (e.g. Slack), and syncing other ticketing systems (e.g. Jira, Linear, or Kitemaker), among others.
With this in mind, it's no surprise that webhooks are causing significant loads. They are causing loads because everyone uses them. Github customers have built workflows on top of Github and use Github to drive other processes, which is exactly what you aim for when building a service.
Lessons learned for other webhooks system
The above issue is made even worse because the Github's webhooks system is fairly primitive. They provide no visibility into the webhooks being sent (or failed!), and they don't do retries. This means that a missing webhook with Github is a webhook lost forever.
If Github is facing such reliability issues and loads with their primitive webhooks implementation, what will happen with more feature-full systems that also implement retries and observability?
This is definitely a concern and something one should keep in mind when building their webhook system. The solution though is not to make your webhooks worse (see the mayhem the lack of retries causes with Github), but rather to make them more resilient.
I'm not usually an advocate for microservices (not against them either!), but sometimes it makes sense to keep a logically separate system separate. Webhooks are essentially a notification dispatcher, so they can definitely live on their own. This way they can both scale independently of the main system, but also not bring the whole thing down under significant loads.
My experience with webhooks is what led me to start the Svix webhooks service (and open source project). I've been exposed to the issues some of the world's top companies experienced with webhooks, and figured we could just make it into a robust service. The same way SendGrid does for email, and Twilio for SMS.
Whether you're building your own in-house webhooks service, or using Svix, make sure that your webhooks are secure, robust, and implement retries. Webhooks are too important to be anything less than great.