On Wednesday, January 11, at approximately 10:15 AM (MST), VictorOps began a planned maintenance operation on our Cassandra database cluster. Due to an undetected replication error, the operation triggered an inconsistency that made a segment of the cluster unresponsive for some customers.
The problem was detected through normal monitoring immediately. VictorOps returned the cluster to a fully operational state within 30 minutes of the error occurring. Due to the downtime, a small number of alerts, totaling less than 1% of alerts received, were not processed by the platform, and some notifications were delayed for up to 30 minutes. This is a highly unusual situation within our architecture and operational history, so we immediately reached out to Datastax, our Enterprise Cassandra partner, for diagnosis.
Working with the support team at Datastax in the 24 hours following the event, we have identified the problem that triggered the downtime, and have taken several steps to ensure that a similar replication error will not occur in the future. We will continue to work with Datastax to test and validate future maintenance steps as a secondary check before executing them.
The availability of our service is of paramount importance, and we understand that our customers count on that availability. This issue has top level visibility within VictorOps. We sincerely apologize for any unintended inconvenience.
Please contact the VictorOps Support Team at support@victorops.com if you have any additional questions regarding this incident.