In an effort to provide transparency and share key learnings, we’re making our Post-Incident Review notes public. Read on to understand our approach to detection, response, remediation, analysis, and readiness of this recent incident.
Reliable alert delivery is, without question, a core operational tenet at VictorOps. On December 22, 2017, we encountered a technical challenge adversely affecting our ability to deliver alerts in a timely manner.
Above and beyond providing the technical forensics around this incident, we also extend a genuine and sincere apology for any challenges this incident caused you, your colleagues, or your customers. Summary of the Incident
On December 22 at 02:32 MST, the VictorOps production Cassandra cluster experienced failures affecting alert delivery and functionality for all customers. Alerts were still being processed and notifications were being sent to customers, but there were delays.
This continued until 03:50 MST when our support team started receiving reports of delayed alerts, delayed acknowledgment, and delayed resolve operations issued by our clients. Support escalated to engineering teams at 03:59 MST to assist in troubleshooting. A fix was identified and deployed by 04:17 MST, and at 04:22 MST the issue had been resolved.
Alert processing was delayed Ack/resolve operations were delayed
This incident lasted 110 minutes, from 02:32 MST to 04:22 MST.
P1 – this was a customer-affecting issue with a straightforward remediation.
Most customers experienced delays in alert processing due to increased latency and failures in our production Cassandra services. This caused intermittent delays when sending notifications as critical alerts were held up in processing.
Some customers also experienced delays in acknowledgment and resolve operations issued from all clients. This resulted in our application continuing to page some users after the incidents were acknowledged.
We’re hopeful the information provided in this report allows you a detailed and transparent view into the incident itself—and the lessons we learned in response to this incident going forward.