We had an outage between Wednesday the 12th and Thursday the 13th of January 2022. After our standup we noticed some strange behavior with our Slack notifications, the volume of notifications we were sending didn’t look right. Looking at our data we couldn’t see some of our events for the last 18 hours.
While investigating why we weren’t processing as many Slack notifications we learned that a code change had broken the ingestion of some events from Segment.
After a short time, we identified and fixed the issue.
If you were one of our impacted customers you should've already received an email about this. If you haven't received any emails then there's no impact on your workspace.
Between the 12th of January, 14:00 UTC to 13 January 2022, 10:00 UTC we didn't ingest some events coming from backend SDKs. So you might have some data gaps in June for that interval.
After some discussions in the last 10 days, Segment was able to replay data into June for users on their business plan. This fixed these inconsistencies.
Unfortunately, they can't do this for all of our impacted customers. Only users on the Segment Business were able to replay data into June.
The technical details
What happened is that we introduced some faulty code in our event ingestion pipeline.
When receiving backend events through Segment the payload we receive looks like this instead:
So what happened was that for 19 hours we rejected many backend events from Segment.
This caused a spike in 401 errors (unauthorized), that we couldn't detect before in our instrumentation, because the number of requests we receive is highly volatile.
Our changes moving forward
In the last two weeks, we followed up on this incident to make sure we can catch things like this a lot earlier as they happen.
- We improved our instrumentation setup. We can now identify these problems within minutes. This makes it easier to spot new instances of similar errors.
- We improved our approval process for any code changes that impact our data ingestion. This should make it less likely for similar incidents to happen in the future.
We deeply regret the disruption in service. Every incident is an opportunity to learn, and an unplanned investment in future reliability. We learned a lot from this incident and, as always, we intend to make the most of this unplanned investment to make our infrastructure better in 2022 and beyond.
If you have any questions don't hesitate to reach out, we're here to address any of your concerns