Issue Discovered - Service disruption in Europe
Incident Report for xMatters
Postmortem

What happened?

On Monday, February 26, 2018, at approximately 5:50 AM PST, the xMatters monitoring systems alerted the Client Assistance team to an issue with the xMatters On-Demand service for some clients located in Europe. Some users may have experienced delays in notification delivery after injecting an event into xMatters via the Integration Builder. Some users may have also experienced a brief outage of their service during the resolution of this issue.

Why did it happen?

This issue occurred when multiple systems failed within a short period of time and experienced issues with their networking components. This caused significant performance degradation problems and other stability issues for some customers hosted in one of our European data centers while the cluster recovered.

How did we respond?

As soon as the xMatters network monitoring tools detected unreliable connectivity, the Client Assistance and Operations teams initiated the internal Severity-1 process. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. The teams launched the failover procedures, and failed over the impacted clients to an alternate data center. Once this was completed, all services were restored and functioning as expected.

What are we doing to prevent it from happening again?

The Operations team is currently still investigating why several systems failed in short order, and is performing preventative software and firmware patching based on the observed behaviors.

Timeline:

2018-02-26 05:52 - xMatters monitoring tools alert the Client Assistance team to an issue with On-Demand services in the European region

2018-02-26 06:00 - Internal Severity-1 process initiated

2018-02-26 06:40 - Integration Builder issues detected for some EU customers

2018-02-26 07:04 - Bulletin posted to xMatters status page: http://status.xmatters.com/incidents/ftmng8h9j6h3

2018-02-26 07:37 - Notification delays detected for some EU customers.

2018-02-26 07:37 - Failover process initiated for some customers.

2018-02-26 08:05 - Solution is implemented

2018-02-26 08:20 - Services are confirmed restored   If you have any questions, please visit http://support.xmatters.com

Posted Mar 02, 2018 - 15:21 PST

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Feb 26, 2018 - 08:20 PST
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Feb 26, 2018 - 08:05 PST
Update
The xMatters Incident Response team has identified the source of the issue and is working on a fix, some customers may be experiencing delayed notifications as well as Integration Builder issues. We will update once a solution has been identified and implemented.
Posted Feb 26, 2018 - 07:37 PST
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Feb 26, 2018 - 07:05 PST
Investigating
The xMatters monitoring tools have identified a potential issue with our Integration Builder service for some clients located in Europe. We are currently investigating the issue, and will update as information becomes available.
Posted Feb 26, 2018 - 07:04 PST
This incident affected: Europe, Middle East, and Africa (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).