Issue Discovered - Service disruption in North American Region – Integration Platform

Incident Report for xMatters

Postmortem

What happened?
On January 9, 2020, at approximately 3:50 PM PST, xMatters internal monitoring tools and customer reports alerted Customer Support to an issue with event processing through the xMatters Integration Builder in the North America region. While the incident was in progress, some North American customers may have experienced intermittent delays in integration processing, including a 15-minute window where integrations were not accepting or processing events. No other regions were affected, and the web user interface remained accessible and responsive throughout the incident, save for a brief period during one of the remediation procedures.

Why did it happen?
This issue occurred when a node in the queuing service cluster experience high levels of load and unexpectedly disconnected from its cluster. This caused execution of integrations on that node to be delayed.

How did we respond?
When the queuing errors were discovered, Customer Support initiated the Severity-1 process and engaged the incident response teams. The teams began to troubleshoot and restarted the affected node. The node failed to recover properly after the restart and the team decided to promote affected customers to the secondary site to ensure reliable processing of integrations. They initiated the promotion at 4:34 PM PST and completed the process at 4:57 PM PST. The majority of customers were now able to process notifications without issue. The teams continued troubleshooting and resolved the underlying issue on the primary site by increasing the available resources for all nodes in the queuing service and then performing a full restart of the queuing service. Once testing was completed, all customers were promoted back to the primary site and all services were confirmed as operational.

What are we doing to prevent it from happening again?
To prevent this issue from reoccurring, the xMatters teams have provided the service with a significant increase in computing resources. The team has also implemented more robust monitoring that will alert the service teams if a node disconnects from the cluster. Through further investigation and testing, the teams have also identified a method of recovering nodes faster and more reliably.

Timeline:
2020-01-09

3:50 PM Monitoring alerts to incident with notification processing; Severity 1 incident declared
3:59 PM Rolling restart completed
4:01 PM Errors clear, performance still impacted
4:27 PM Errors return, more intervention required
4:34 PM Promotion to secondary site begins
4:57 PM Promotion to secondary site completed, notifications begin to process as expected
5:20 PM Services restored, team continued monitoring
6:35 PM Incident resolved

If you have any questions, please visit http://support.xmatters.com

Posted Jan 15, 2020 - 13:12 PST

Resolved

The incident has now been resolved. Thank you for your patience while we addressed this matter. A root cause will be available after post-mortem activities have been completed.

Posted Jan 09, 2020 - 18:35 PST

Update

The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.

Posted Jan 09, 2020 - 18:06 PST

Update

We continue to monitor, some customers may see delays in event processing.

Posted Jan 09, 2020 - 17:48 PST

Update

We are continuing to monitor for any further issues.

Posted Jan 09, 2020 - 17:32 PST

Update

Issue has been mitigated. Performance may be degraded.

Posted Jan 09, 2020 - 17:14 PST

Monitoring

The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.

Posted Jan 09, 2020 - 17:13 PST

Update

We are seeing intermittent delays with Integration event processing at this time.

Posted Jan 09, 2020 - 16:55 PST

Identified

The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.

Posted Jan 09, 2020 - 16:51 PST

Investigating

xMatters monitoring tools have identified a potential issue with the xMatters Integration Platform for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.

Posted Jan 09, 2020 - 16:32 PST

This incident affected: North America (Integration Platform).