Service disruption in North American Region
Incident Report for xMatters
Postmortem

What happened?

On January 12, 2020, at approximately 7:05 AM PST, xMatters internal monitoring tools and customer reports alerted Customer Support to an issue with event processing and delivery in the North America region. While the incident was in progress, some North American customers may have experienced delays in event processing and notification delivery, including a window where notifications were not being generated for active events. No other regions were affected, and the web user interface remained accessible and responsive throughout the incident, save for a brief period during one of the remediation procedures.

Why did it happen?

This issue occurred when a process responsible for inter-service communication encountered resource issues. The issue was traced to an earlier change which increased the internal processes retention period to improve xMatters' ability to recover data. Resources for the process were sized in terms of processing, disk and memory, but a setting that controls the number of open files to be retained was not sized appropriately.

How did we respond?

As soon as the monitoring tools alerts to the error, Customer Support initiated the Severity-1 process and engaged the incident response teams. The teams began to troubleshoot and restarted the affected process. When the restart failed to recover properly, the team decided to promote affected customers to the secondary site to ensure reliable processing of events and notifications. Once the teams initiated the promotion at 7:35 AM PST, notifications began processing properly for most customers. The promotion procedures were completed at 7:50 AM PST, and the majority of notifications continued processing without issue. The teams continued troubleshooting and identified and resolved the underlying issue on the primary site by increasing the retention period. Once testing was completed, all customers were promoted back to the primary site and all services were confirmed as operational.

What are we doing to prevent it from happening again?

To resolve this issue permanently, the xMatters teams have adjusted the setting that governs the number of open files for the process.

Timeline: Date/Time (PST)

2020-01-12 7:05 AM - Monitoring alerts to incident with notification processing; Severity 1 incident declared

7:21 AM Rolling restart completed

7:24 AM Errors do not clear, notifications still impacted

7:34 AM Promotion to secondary site begins

7:50 AM Promotion to secondary site completed, notifications begin to process as expected

7:55 AM Team begins to monitor the mitigation

8:20 AM Incident resolved

If you have any questions, please visit http://support.xmatters.com

Posted Jan 17, 2020 - 14:18 PST

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Jan 12, 2020 - 08:20 PST
Monitoring
Customers may continue to experience some slowness as the incident team continues to implement the fixes for this issue. Events will be processing, however some customers may experience some delays. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Jan 12, 2020 - 08:02 PST
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Jan 12, 2020 - 07:40 PST
Update
We are continuing to investigate this issue.
Posted Jan 12, 2020 - 07:38 PST
Investigating
xMatters monitoring tools have identified a potential issue with the xMatters Integration Platform for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available.
Posted Jan 12, 2020 - 07:35 PST
This incident affected: North America (SMS Notifications, Voice Notifications, Integration Platform).