What happened?
On Sunday, February 11, 2018 at approximately 9:40 PM GMT, the xMatters monitoring systems alerted the Client Assistance team to an issue with the xMatters On-Demand services for clients located in Europe. Some users may have experienced delays in notification delivery when injecting an event into xMatters.
Why did it happen?
This issue was caused by a failure of redundant components within one of the data centers located in the European region. The system uses anti-affinity rules within the private cloud platform to prevent services from running on the same physical hardware, but this redundancy constraint was not respected by the hosting infrastructure, resulting in the failure of multiple DNS servers and putting the xMatters service into a state where it could no longer process notifications.
How did we respond?
As soon as the xMatters network monitoring detected connectivity issues, xMatters Client Assistance and Operations teams initiated the internal Severity-1 process. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. The team quickly identified a DNS-related issue within the private cloud infrastructure, which was causing some clients to experience delays in notifications when injecting an event into xMatters. The Operations team was able to isolate and identify the component that caused the issue, and implemented a solution. Once the solution was implemented, notification delivery was back to normal thresholds, and clients confirmed that all services had been restored.
What are we doing to prevent it from happening again?
To prevent this issue from occurring again, xMatters is committed to the following actions:
• Audit the data centers and private cloud infrastructure to identify any other violations of the redundancy constraint. (In progress)
• Implement additional monitoring to identify and proactively respond to any similar constraint violations. (Currently in the design stage; internal reference EVO-2223.)
Timeline:
2018-02-11 21:43 - xMatters monitoring tools alert the Client Assistance team to an issue with On-Demand services in the European region
2018-02-11 21:45 - Internal Severity-1 process initiated
2018-02-11 22:14 - Bulletin posted to xMatters status page: http://status.xmatters.com/incidents/p221y9h7f76q
2018-02-11 22:20 - Issue is isolated to a DNS-related component within the private cloud infrastructure
2018-02-11 22:45 - Solution is implemented
2018-02-11 22:51 - Services are restored If you have any questions, please visit http://support.xmatters.com