On January 9, 2020, at approximately 3:50 PM PST, xMatters internal monitoring tools and customer reports alerted Customer Support to an issue with event processing through the xMatters Integration Builder in the North America region. While the incident was in progress, some North American customers may have experienced intermittent delays in integration processing, including a 15-minute window where integrations were not accepting or processing events. No other regions were affected, and the web user interface remained accessible and responsive throughout the incident, save for a brief period during one of the remediation procedures.
Why did it happen?
This issue occurred when a node in the queuing service cluster experience high levels of load and unexpectedly disconnected from its cluster. This caused execution of integrations on that node to be delayed.
How did we respond?
When the queuing errors were discovered, Customer Support initiated the Severity-1 process and engaged the incident response teams. The teams began to troubleshoot and restarted the affected node. The node failed to recover properly after the restart and the team decided to promote affected customers to the secondary site to ensure reliable processing of integrations. They initiated the promotion at 4:34 PM PST and completed the process at 4:57 PM PST. The majority of customers were now able to process notifications without issue. The teams continued troubleshooting and resolved the underlying issue on the primary site by increasing the available resources for all nodes in the queuing service and then performing a full restart of the queuing service. Once testing was completed, all customers were promoted back to the primary site and all services were confirmed as operational.
What are we doing to prevent it from happening again?
To prevent this issue from reoccurring, the xMatters teams have provided the service with a significant increase in computing resources. The team has also implemented more robust monitoring that will alert the service teams if a node disconnects from the cluster. Through further investigation and testing, the teams have also identified a method of recovering nodes faster and more reliably.
3:50 PM Monitoring alerts to incident with notification processing; Severity 1 incident declared
3:59 PM Rolling restart completed
4:01 PM Errors clear, performance still impacted
4:27 PM Errors return, more intervention required
4:34 PM Promotion to secondary site begins
4:57 PM Promotion to secondary site completed, notifications begin to process as expected
5:20 PM Services restored, team continued monitoring
6:35 PM Incident resolved
If you have any questions, please visit http://support.xmatters.com