Issue Discovered - Service disruption in North American and European Region - Multiple Services

Incident Report for xMatters

Postmortem

What happened?

On November 15, 2019, at approximately 6:30 AM PST, xMatters internal monitoring systems alerted the Engineering teams to an issue with a service in the North America region. While the incident was in progress, North American customers may have experienced intermittent delays in notification delivery, including a 15-minute window where notifications were not processing for some customers. No other regions were affected, and the web user interface remained accessible and responsive throughout the incident, save for brief period during one of the remediation procedures.

Why did it happen?

This issue occurred when the services responsible for processing events experienced a sudden spike in usage, resulting in an unusually high load. Although the Engineering teams immediately initiated standard remediation practices for the notification delivery service, a dependent service used for queuing notifications began to experience instability approximately 10 minutes after the initial remediation began. The instability in the queuing service caused it to intermittently reject future incoming connection attempts from upstream services.

How did we respond?

When the queuing errors were discovered, xMatters initiated the major incident management process and gathered the incident response team. The team began to troubleshoot and performed a rolling recycle of the affected services. When the recycle failed to address the issue, the team decided to promote affected customers to the secondary site. They initiated the promotion at 7:57 AM PST and completed the process at 8:34 AM PST. The majority of customers were now able to process notifications without issue. The teams continued troubleshooting and resolved the underlying issue on the primary site by performing a full restart of the queuing service. Once testing was completed, all customers were promoted back to the primary site and all services were confirmed as operational.

What are we doing to prevent it from happening again?

While attempting to reproduce this issue in our test environments, we have identified a number of potential improvements and optimizations within the configuration and usage of the queuing service. To prevent this issue from reoccurring, the xMatters Engineering teams are working to implement all of these changes. The teams are still investigating the source of the initial resource spike.

Timeline:

November 15, 2019

6:30 AM xMatters internal monitoring tools alert Engineering to unusual load on notification processing nodes

6:45 AM Engineering performs rolling recycle of nodes and discovers queuing errors

7:25 AM Major incident raised and internal major incident management process initiated

7:29 AM Bulletin posted to xMatters status page: https://status.xmatters.com/incidents/qy9l66599jnf

7:43 AM Promotion of services begin to secondary site

8:34 AM Promotion is complete

9:15 AM Issue is resolved on primary

9:22 AM Promotion of service to primary begins

9:44 AM Promotion to primary complete, all services resume normal operations

Posted Nov 19, 2019 - 12:54 PST

Resolved

The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter. We will provide a full root cause analysis once the post-mortem activities have been completed.

Posted Nov 15, 2019 - 09:47 PST

Monitoring

The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.

Posted Nov 15, 2019 - 09:25 PST

Update

Some customers in North America should start to see notifications delivered again without any delays. We are continuing to work on restoring services for the remaining customers. We will continue to post updates here as they become available.

Posted Nov 15, 2019 - 08:58 PST

Update

European region customers should now be seeing notifications delivered without any delay. We are continuing to work on resolving delays for North American customers. We will continue to provide updates as they become available.

Posted Nov 15, 2019 - 08:10 PST

Identified

The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.

Posted Nov 15, 2019 - 07:42 PST

Investigating

xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the North America region. Some clients may notice delays in receiving notifications. We are currently investigating the issue and will update as information becomes available.

Please see incident details for specific services impacted.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.

Posted Nov 15, 2019 - 07:29 PST

This incident affected: Europe, Middle East, and Africa (Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API) and North America (Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).