On March 9, 2019, at approximately 5:04 PM GMT, the xMatters monitoring tools alerted Client Assistance to an issue with the notification services in the European region. During the resolution of the issue, with lasted approximately 25 minutes, notifications were being created but not sent to the intended recipients. New events and responses to existing notifications were still being accepted and processed, and the web user interface was accessible and fully responsive, but no new notifications were going out.
This issue occurred when a queuing mechanism shared between multiple services ran out of available connections, resulting in a lack of available resources for notification delivery in the European region. The root cause of the issue was that unused or expired connections between services were not being cleared, causing a degradation in performance that triggered the alert from the xMatters monitoring tools.
As soon as the xMatters monitoring tools alerted Client Assistance to an issue, they launched a Severity-1 incident and initiated the internal major incident management process. The incident response teams quickly verified that notifications were not being sent and began working to isolate the cause of the performance degradation and to mitigate the impact to customers. The teams began a rolling restart of the affected services to reduce the bottleneck in the queuing mechanism, which immediately improved performance and restored notification delivery service for all affected customers. Once the teams confirmed that notifications were being sent, they continued monitoring the performance of the affected service and investigating the root cause. When the rolling restarts had completed, the teams confirmed that all services had been restored.
To prevent the issue from reoccurring while working on a permanent solution, the teams implemented an automatic restart schedule for the affected services that purges queue connections and ensures that capacity is freed on a regular basis. Due to service redundancy within the xMatters infrastructure, this action does not affect performance or notification delivery. The Engineering team optimized the use of connections by the queuing mechanism and designed an automated connection clearing schedule. The changes were developed and tested for the xMatters On-Demand 5.5.250 release, which was implemented in all production systems on March 14, 2019.
March 9, 2019 - 5:04 PM - xMatters monitoring tools alert to notification issues in the European region
5:05 PM - Severity-1 incident initiated
5:06 PM - Issue verified; multiple services cannot get connection to queue
5:07 PM - Impacted services restarted
5:30 PM - Performance improvement verified; services are restored
5:45 PM - Rolling restarts continue; no impact to customer services
5:50 PM - Verification and service checks continue
6:19 PM - Monitoring to ensure full service and performance
7:01 PM - Issue resolved.
If you have any questions, please visit: http://support.xmatters.com