Issue Discovered - Service disruption in North American Region - Multiple Services

Incident Report for xMatters

Postmortem

What happened?

On October 17, 2019 at approximately 5:35 PM PDT, the xMatters monitoring systems alerted the Customer Support team to a potential issue with an xMatters service within the North America region. While this incident was in progress, all North American customers may have experienced delays or a rejection when injecting an event into xMatters, and delays or failures in notification delivery. No other regions were affected, and the web user interface remained accessible and responsive through the incident.

Why did it happen?

This issue occurred during routine scheduled maintenance involving security enhancements to an xMatters service that is responsible for delivering notifications. The maintenance was near completion when the service experienced an unexpected error that caused the entire service cluster to fail, resulting in cascading failures to other dependent services. This maintenance was completed across other regions prior to North America without any issues, delays, or downtime.

How did we respond?

As soon as the xMatters monitoring tools detected connectivity issues, the xMatters Customer Support team escalated the issue to a Severity-1 and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause and working to restore services for clients, Customer Support posted a notice to the xMatters status page to inform clients about the incident. The teams immediately identified the issue was related to the scheduled security enhancements for an xMatters service responsible for delivering notifications. The teams began troubleshooting the issue and identified a failure that occurred during the maintenance with the last remaining service node. This resulted in a cascade of failure, leading to the service disruption.

The teams decided the fastest approach to restoring operations would be to perform a rolling restart of the affected notification service. They also determined that promoting services to another region would be a "last resort" option, as the unique circumstances of this failure could potentially cause a longer delay in restoration of services. After the teams completed the rolling restart, they determined the service showed no significant improvement. The teams continued to perform additional troubleshooting steps an an attempt to alleviate the issue, but notifications queues were continuing to increase and all attempts to restore service were unsuccessful.

With guidance from the xMatters executive, the teams decided to start preparing to promote all services and client instances to an alternate data center in North America. To rule out the possibility that the issue was related to the underlying hardware, the teams performed a rolling restart of each virtual machine in the notification service cluster in an attempt to reschedule them to different hardware. Just before the promotion of services was about to begin, the teams confirmed that the rolling restart was successful, and the system was processing events and delivering notifications. With service apparently being restored, the teams held back the promotion of services to an alternate data center until they confirmed that all queues were clearing. Due to the duration of the service disruption, the teams waited for the backlog of notifications to clear before starting up other dependent services. They then confirmed that all services were restored and normal operations had resumed.

What are we doing to prevent it from happening again?

To prevent this issue from occurring again, xMatters has committed to the following action items:

Increase monitoring thresholds to help identify any latency with notification delivery earlier in the process. (In progress)
Review the schedule for promotion of client instances to identify specific guidelines around acceptable delays and data retention during incidents. (In progress)
Investigate current architecture and cluster configuration to determinate any potential avenues towards improving overall system resiliency. (In progress)

In addition, the Engineering and Operations teams are conducting a full post-mortem of the incident to help identify any potential improvements to testing suites, playbooks, and other collateral used to help isolate and identify root causes during and after an incident.

Timeline:

Date/Time PDT	Description
2019-10-16 5:00PM	xMatters Engineering begins applying security enhancements to the xMatters notification service
2019-10-16 5:35PM	xMatters monitoring tools alert Customer Support to possible latency issues for clients in North America
2019-10-16 6:20PM	Severity-1 issue raised, internal major incident management process initiated
2019-10-16 6:35PM	Bulletin posted to xMatters status page: https://status.xmatters.com/incidents/096sszlgyz0n
2019-10-16 6:40PM	Rolling restart of notification service completed
2019-10-16 7:00PM	Additional troubleshooting steps begin
2019-10-16 8:50PM	Rolling restart of notification server cluster begins
2019-10-16 9:15PM	Events begin processing and notifications start being delivered
2019-10-16 9:23PM	Notification server cluster restart is completed
2019-10-16 9:30PM	Remaining dependent services are restarted
2019-10-16 10:45PM	Full restoration completed; services resume normal operations

If you have any questions, please visit http://support.xmatters.com

Posted Oct 18, 2019 - 14:54 PDT

Resolved

The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.

Posted Oct 16, 2019 - 22:49 PDT

Monitoring

The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.

Please note that there may be a backlog of notifications to process for some customers.

Posted Oct 16, 2019 - 21:41 PDT

Update

Notifications and events are beginning to process at this time - back log of notifications will begin to process shortly.

Posted Oct 16, 2019 - 21:24 PDT

Update

Recovery efforts are progressing - some notifications are processing. ETA for full recovery pending.

Posted Oct 16, 2019 - 21:18 PDT

Update

The xMatters incident team is taking corrective action at this time. ETA is pending at this time.

Posted Oct 16, 2019 - 20:54 PDT

Update

We are continuing to work on a solution to this issue.

Posted Oct 16, 2019 - 20:19 PDT

Update

We are continuing to troubleshoot. Update in the next 15 minutes.

Posted Oct 16, 2019 - 19:50 PDT

Update

Some events are processing, teams are still working on resolution. Troubleshooting continues.

Posted Oct 16, 2019 - 19:26 PDT

Update

Engineering teams are continuing to troubleshoot the issue. Please watch this page for updates.

Posted Oct 16, 2019 - 19:07 PDT

Identified

The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.

Posted Oct 16, 2019 - 18:48 PDT

Investigating

xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available.

Please see incident details for specific services impacted.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.

Posted Oct 16, 2019 - 18:35 PDT

This incident affected: North America (Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).