On November 15, 2019, at approximately 6:30 AM PST, xMatters internal monitoring systems alerted the Engineering teams to an issue with a service in the North America region. While the incident was in progress, North American customers may have experienced intermittent delays in notification delivery, including a 15-minute window where notifications were not processing for some customers. No other regions were affected, and the web user interface remained accessible and responsive throughout the incident, save for brief period during one of the remediation procedures.
This issue occurred when the services responsible for processing events experienced a sudden spike in usage, resulting in an unusually high load. Although the Engineering teams immediately initiated standard remediation practices for the notification delivery service, a dependent service used for queuing notifications began to experience instability approximately 10 minutes after the initial remediation began. The instability in the queuing service caused it to intermittently reject future incoming connection attempts from upstream services.
When the queuing errors were discovered, xMatters initiated the major incident management process and gathered the incident response team. The team began to troubleshoot and performed a rolling recycle of the affected services. When the recycle failed to address the issue, the team decided to promote affected customers to the secondary site. They initiated the promotion at 7:57 AM PST and completed the process at 8:34 AM PST. The majority of customers were now able to process notifications without issue. The teams continued troubleshooting and resolved the underlying issue on the primary site by performing a full restart of the queuing service. Once testing was completed, all customers were promoted back to the primary site and all services were confirmed as operational.
While attempting to reproduce this issue in our test environments, we have identified a number of potential improvements and optimizations within the configuration and usage of the queuing service. To prevent this issue from reoccurring, the xMatters Engineering teams are working to implement all of these changes. The teams are still investigating the source of the initial resource spike.
November 15, 2019
6:30 AM xMatters internal monitoring tools alert Engineering to unusual load on notification processing nodes
6:45 AM Engineering performs rolling recycle of nodes and discovers queuing errors
7:25 AM Major incident raised and internal major incident management process initiated
7:29 AM Bulletin posted to xMatters status page: https://status.xmatters.com/incidents/qy9l66599jnf
7:43 AM Promotion of services begin to secondary site
8:34 AM Promotion is complete
9:15 AM Issue is resolved on primary
9:22 AM Promotion of service to primary begins
9:44 AM Promotion to primary complete, all services resume normal operations