Issue Discovered - Service disruption in Europe
Incident Report for xMatters
Postmortem

What happened?

On Sunday, February 11, 2018 at approximately 9:40 PM GMT, the xMatters monitoring systems alerted the Client Assistance team to an issue with the xMatters On-Demand services for clients located in Europe. Some users may have experienced delays in notification delivery when injecting an event into xMatters. 

Why did it happen?

This issue was caused by a failure of redundant components within one of the data centers located in the European region. The system uses anti-affinity rules within the private cloud platform to prevent services from running on the same physical hardware, but this redundancy constraint was not respected by the hosting infrastructure, resulting in the failure of multiple DNS servers and putting the xMatters service into a state where it could no longer process notifications.

How did we respond?

As soon as the xMatters network monitoring detected connectivity issues, xMatters Client Assistance and Operations teams initiated the internal Severity-1 process. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. The team quickly identified a DNS-related issue within the private cloud infrastructure, which was causing some clients to experience delays in notifications when injecting an event into xMatters. The Operations team was able to isolate and identify the component that caused the issue, and implemented a solution. Once the solution was implemented, notification delivery was back to normal thresholds, and clients confirmed that all services had been restored. 

What are we doing to prevent it from happening again?

 To prevent this issue from occurring again, xMatters is committed to the following actions:

• Audit the data centers and private cloud infrastructure to identify any other violations of the redundancy constraint. (In progress)

• Implement additional monitoring to identify and proactively respond to any similar constraint violations. (Currently in the design stage; internal reference EVO-2223.)

Timeline:

2018-02-11 21:43 - xMatters monitoring tools alert the Client Assistance team to an issue with On-Demand services in the European region

2018-02-11 21:45 - Internal Severity-1 process initiated

2018-02-11 22:14 - Bulletin posted to xMatters status page: http://status.xmatters.com/incidents/p221y9h7f76q

2018-02-11 22:20 - Issue is isolated to a DNS-related component within the private cloud infrastructure

2018-02-11 22:45 - Solution is implemented

2018-02-11 22:51 - Services are restored    If you have any questions, please visit http://support.xmatters.com

Posted Feb 15, 2018 - 11:01 PST

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Feb 11, 2018 - 15:29 PST
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Feb 11, 2018 - 14:51 PST
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Feb 11, 2018 - 14:21 PST
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in Europe. Some clients may be experiencing delays in notifications. We are currently investigating the issue, and will update as information becomes available.
Posted Feb 11, 2018 - 14:14 PST
This incident affected: Europe, Middle East, and Africa (Email Notifications, SMS Notifications, Voice Notifications).