- Tom Hacohen
We suffered a partial outage on Saturday the 11th of March causing for most API calls to return 5xx in the US region (didn't affect other regions). Before I go into greater detail on what went wrong and what we are going to do next, I want to apologize to all affected customers. We know you all rely on us for your webhooks, and we take this responsibility very seriously.
While we had occasional API errors in the past, they were always very rare and only affected a small fraction of our customers and requests. This is our first outage since we have started Svix, and while we are aware that outages can happen to anyone, we work really hard trying to make sure that they never happen to us.
The issue was that our containers kept on getting killed due to OOM (out of memory) errors, even though the maximum memory utilization was never above 40%. What's even weirder, is that our memory utilization is normally at 10-15%, so 40% was already much outside of the ordinary.
After releasing a fix/mitigating, we tried investigating with AWS for a few hours, and neither they nor us were able to understand the reason for the sudden memory usage jump nor any indication to why the services would get killed for OOM. AWS was unable to provide us with additional information on the affected hosts.
Another interesting fact, is that even though our services were being killed for OOM, there haven't been any auto-scaling events (to increase the number of nodes) which we would have expected if memory usage were high, which increases our suspicion that there's more to this story.
What can we rule out?
It was not due to a code change. Or rather, if it was due to a code change, it wasn't something that manifested immediately. We haven't deployed any code changes that day. It was a Saturday, and we don't deploy code on the weekend unless it's an urgent fix. There are a few reasons for that, but one of the most important ones is: if there's ever a customer facing issue, we want to make sure our customers can discover it during working hours, and not get paged on their day off. We are here to make our customers' lives easier, not harder, and this is an important part of it.
There was no unusual traffic. We had traffic spikes before and after the issue, but nothing out of the ordinary, and much smaller than what we usually deal with. Our initial thought that it was maybe some malicious activity (e.g. a DDoS), though we have detected no such anomalies.
What do we think happened?
We are still investigating. Our working theory is that there's maybe an edge case with one of the libraries we use, e.g.
serde (Rust de/serialization library), where a specific set of inputs may cause sudden and very large memory usage spikes as well as memory fragmentation.
It feels like a bit of a stretch, and the big question is: why hasn't it happened before? Though we have since managed to reproduce the memory spikes locally, and we also managed to tame them by switching to a more fragmentation resistant allocator (
jemalloc), so we believe this theory is plausible. Though again, is why has this never happened before? We plan on finding out.
The incident itself
At roughly 19:25 UTC on March 11th we were alerted that most of API calls in the US region started returning 5xx errors, and that our containers there were repeatedly getting killed. This was only happening in the US region; the EU region, and our private regions were fine. The engineer on call and another engineer started debugging the issues the moment the alerting started.
It wasn't clear why the containers were getting killed. The AWS dashboard said they were being killed because of OOM. However, even though the memory usage was high (stable at 40%, when it's normally 10-15%), it didn't look like a memory leak or anything of that sort, as the memory usage wasn't growing. In fact, we were also not getting alerts about excessive memory usage, as it was below our alerting threshold.
A week prior we've introduced a change that increases the amount of caching we do in memory. We suspected that maybe the cache was growing out of control, even though it was limited in size and in scoped (only used sparsely) and was behaving well all throughout the week. We tried reverting it, but to no avail, the issue was still present.
One way to solve an OOM issue is to just give the containers more memory, though even after increasing the memory capacity 8x, we were still getting these OOM errors. So if it was really OOM, it was growing fast, and in an almost unbound manner.
It wasn't a case of overwhelming a partially initialized small cluster either. We verified that by blocking all the traffic to the services, letting them start and become healthy, and only then letting the traffic flow back there. Even with all of that, they were still dying. This was even after increasing the number of services running 4x, and giving each 8x the amount of RAM.
Luckily, one of the engineers came up with a quick and clever mitigation. In addition to our normal API services, we are also able to serve the API from AWS Lambda. While Lambda may also suffer from the same issue, the advantage (and disadvantage) of Lambda is that each Lambda just handles one HTTP request. This means that even if memory gets exceedingly fragmented, it will probably still manage to handle at least one request. We deployed this fix at roughly 20:30 UTC, which fixed the issue and brought everything back up.
We have since switched back from Lambda to our normal ECS tasks with much more memory attached to them. We are closely monitoring memory usage, which is back to normal levels, and are also continuing our investigation.
As mentioned above, we are still investigating, though we are now able to reproduce the memory spikes in a controlled (local) environment, and have already started testing
jemalloc as an alternative allocator.
Another lesson learned from this incident, is that we just can't trust AWS's metrics. They are too much of a black box, which impaired our ability to investigate and resolve this sooner. This is actually a lesson we've learned a while back, and most of our observability doesn't rely on AWS anymore, though there are still a few missing spots (like ECS tasks) where we need to add our additional metrics gathering.
I would like to again offer my deepest apologies to those affected. This is out first partial outage, and the team is gutted that we adversely affected our customers. As I said above, we know you all rely on us for your webhooks, and we take this responsibility very seriously.
In case you haven't already, please check out our status page. You can register there to be notified about incidents, though we hope that you never will. Additionally, you can always contact us via email and Slack in case you notice any issues or you have any questions.