On October 17, 2019 at approximately 5:35 PM PDT, the xMatters monitoring systems alerted the Customer Support team to a potential issue with an xMatters service within the North America region. While this incident was in progress, all North American customers may have experienced delays or a rejection when injecting an event into xMatters, and delays or failures in notification delivery. No other regions were affected, and the web user interface remained accessible and responsive through the incident.
This issue occurred during routine scheduled maintenance involving security enhancements to an xMatters service that is responsible for delivering notifications. The maintenance was near completion when the service experienced an unexpected error that caused the entire service cluster to fail, resulting in cascading failures to other dependent services. This maintenance was completed across other regions prior to North America without any issues, delays, or downtime.
As soon as the xMatters monitoring tools detected connectivity issues, the xMatters Customer Support team escalated the issue to a Severity-1 and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause and working to restore services for clients, Customer Support posted a notice to the xMatters status page to inform clients about the incident. The teams immediately identified the issue was related to the scheduled security enhancements for an xMatters service responsible for delivering notifications. The teams began troubleshooting the issue and identified a failure that occurred during the maintenance with the last remaining service node. This resulted in a cascade of failure, leading to the service disruption.
The teams decided the fastest approach to restoring operations would be to perform a rolling restart of the affected notification service. They also determined that promoting services to another region would be a "last resort" option, as the unique circumstances of this failure could potentially cause a longer delay in restoration of services. After the teams completed the rolling restart, they determined the service showed no significant improvement. The teams continued to perform additional troubleshooting steps an an attempt to alleviate the issue, but notifications queues were continuing to increase and all attempts to restore service were unsuccessful.
With guidance from the xMatters executive, the teams decided to start preparing to promote all services and client instances to an alternate data center in North America. To rule out the possibility that the issue was related to the underlying hardware, the teams performed a rolling restart of each virtual machine in the notification service cluster in an attempt to reschedule them to different hardware. Just before the promotion of services was about to begin, the teams confirmed that the rolling restart was successful, and the system was processing events and delivering notifications. With service apparently being restored, the teams held back the promotion of services to an alternate data center until they confirmed that all queues were clearing. Due to the duration of the service disruption, the teams waited for the backlog of notifications to clear before starting up other dependent services. They then confirmed that all services were restored and normal operations had resumed.
To prevent this issue from occurring again, xMatters has committed to the following action items:
In addition, the Engineering and Operations teams are conducting a full post-mortem of the incident to help identify any potential improvements to testing suites, playbooks, and other collateral used to help isolate and identify root causes during and after an incident.
|2019-10-16 5:00PM||xMatters Engineering begins applying security enhancements to the xMatters notification service|
|2019-10-16 5:35PM||xMatters monitoring tools alert Customer Support to possible latency issues for clients in North America|
|2019-10-16 6:20PM||Severity-1 issue raised, internal major incident management process initiated|
|2019-10-16 6:35PM||Bulletin posted to xMatters status page: https://status.xmatters.com/incidents/096sszlgyz0n|
|2019-10-16 6:40PM||Rolling restart of notification service completed|
|2019-10-16 7:00PM||Additional troubleshooting steps begin|
|2019-10-16 8:50PM||Rolling restart of notification server cluster begins|
|2019-10-16 9:15PM||Events begin processing and notifications start being delivered|
|2019-10-16 9:23PM||Notification server cluster restart is completed|
|2019-10-16 9:30PM||Remaining dependent services are restarted|
|2019-10-16 10:45PM||Full restoration completed; services resume normal operations|
If you have any questions, please visit http://support.xmatters.com