Delays in Incident Processing

Incident Report for Splunk On-Call

Postmortem

On Wednesday, January 11, at approximately 10:15 AM (MST), VictorOps began a planned maintenance operation on our Cassandra database cluster. Due to an undetected replication error, the operation triggered an inconsistency that made a segment of the cluster unresponsive for some customers.

The problem was detected through normal monitoring immediately. VictorOps returned the cluster to a fully operational state within 30 minutes of the error occurring. Due to the downtime, a small number of alerts, totaling less than 1% of alerts received, were not processed by the platform, and some notifications were delayed for up to 30 minutes. This is a highly unusual situation within our architecture and operational history, so we immediately reached out to Datastax, our Enterprise Cassandra partner, for diagnosis.

Working with the support team at Datastax in the 24 hours following the event, we have identified the problem that triggered the downtime, and have taken several steps to ensure that a similar replication error will not occur in the future. We will continue to work with Datastax to test and validate future maintenance steps as a secondary check before executing them.

The availability of our service is of paramount importance, and we understand that our customers count on that availability. This issue has top level visibility within VictorOps. We sincerely apologize for any unintended inconvenience.

Please contact the VictorOps Support Team at support@victorops.com if you have any additional questions regarding this incident.

Posted Jan 13, 2017 - 11:28 MST

Resolved

A problem was identified with incident processing jobs on isolated database servers which may have resulted in some alerts being dropped. Incoming alert ingestion was paused briefly to prevent further loss, which resulted in a delay in alert ingestion for isolated organizations. The problem has been fixed and processing has resumed as normal. Please contact support@victorops.com with any related questions.

Posted Jan 11, 2017 - 11:10 MST

Identified

The back-end problem affecting incident data has been identified and is currently being addressed with the highest level of urgency. Incoming alert ingestion will be delayed while the fix is being implemented.

Posted Jan 11, 2017 - 10:52 MST

Investigating

Some organizations may experience slight delays in incident processing and/or display of incidents in the incident pane. We are actively investigating the issue. More to follow.

Posted Jan 11, 2017 - 10:30 MST

This incident affected: Alert Processing & Integrations (Alert Ingestion, Alert Ingestion - Inbound email) and Clients (Web client (portal), Android client, iOS client).