Issue Discovered - Service disruption in Europe
Incident Report for xMatters
Postmortem

What happened?

On March 9, 2019, at approximately 5:04 PM GMT, the xMatters monitoring tools alerted Client Assistance to an issue with the notification services in the European region. During the resolution of the issue, with lasted approximately 25 minutes, notifications were being created but not sent to the intended recipients. New events and responses to existing notifications were still being accepted and processed, and the web user interface was accessible and fully responsive, but no new notifications were going out.

Why did it happen?

This issue occurred when a queuing mechanism shared between multiple services ran out of available connections, resulting in a lack of available resources for notification delivery in the European region. The root cause of the issue was that unused or expired connections between services were not being cleared, causing a degradation in performance that triggered the alert from the xMatters monitoring tools.

How did we respond?

As soon as the xMatters monitoring tools alerted Client Assistance to an issue, they launched a Severity-1 incident and initiated the internal major incident management process. The incident response teams quickly verified that notifications were not being sent and began working to isolate the cause of the performance degradation and to mitigate the impact to customers. The teams began a rolling restart of the affected services to reduce the bottleneck in the queuing mechanism, which immediately improved performance and restored notification delivery service for all affected customers. Once the teams confirmed that notifications were being sent, they continued monitoring the performance of the affected service and investigating the root cause. When the rolling restarts had completed, the teams confirmed that all services had been restored.

What are we doing to prevent it from happening again?

To prevent the issue from reoccurring while working on a permanent solution, the teams implemented an automatic restart schedule for the affected services that purges queue connections and ensures that capacity is freed on a regular basis. Due to service redundancy within the xMatters infrastructure, this action does not affect performance or notification delivery. The Engineering team optimized the use of connections by the queuing mechanism and designed an automated connection clearing schedule. The changes were developed and tested for the xMatters On-Demand 5.5.250 release, which was implemented in all production systems on March 14, 2019.

Timeline:

March 9, 2019 - 5:04 PM - xMatters monitoring tools alert to notification issues in the European region

5:05 PM - Severity-1 incident initiated

5:06 PM - Issue verified; multiple services cannot get connection to queue

5:07 PM - Impacted services restarted

5:30 PM - Performance improvement verified; services are restored

5:45 PM - Rolling restarts continue; no impact to customer services

5:50 PM - Verification and service checks continue

6:19 PM - Monitoring to ensure full service and performance

7:01 PM - Issue resolved.

If you have any questions, please visit: http://support.xmatters.com

Posted Mar 15, 2019 - 15:30 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Mar 09, 2019 - 11:19 PST
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Mar 09, 2019 - 11:12 PST
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Mar 09, 2019 - 11:01 PST
Investigating
The xMatters monitoring tools have identified a potential issue with notification delivery for xMatters On-Demand for some clients located in Europe. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Mar 09, 2019 - 10:49 PST
This incident affected: Europe, Middle East, and Africa (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).