What happened?
On Tuesday, June 27, 2017, at approximately 01:10pm PDT, the xMatters network monitoring systems alerted the Operations team to a disruption with the xMatters On-Demand services in one of the data centers located in North America. Some users may have experienced intermittent access to the user interface, and a delay or rejection when injecting an event into xMatters.
Why did it happen?
This issue was caused by a hardware failure that occurred on our data center provider's network equipment, resulting in services being unavailable for a brief period (less than ten minutes).
How did we respond?
As soon as the xMatters network monitoring tools detected unreliable connectivity and notified Client Assistance and Operations, the teams initiated the internal Major Incident Management process and posted a bulletin to the xMatters status page. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. They quickly determined that the issue was being caused by a network problem within one of the North American data centers. The Operations team immediately created a Severity 1 ticket with the data center provider and began the process to begin promoting services to an alternate data center. However, during the initiation of the automated failover process, the vendor confirmed that the issue had been resolved. The Operations team continued to monitor the situation, and decided to hold the promotion of services to an alternate data center as all services had been restored and reported as stable.
What are we doing to prevent it from happening again?
This disruption was caused by an unexpected network event that affected the entire hosting data center. The data center provider is currently conducting an internal investigation, and providing more information as it is discovered. The provider is also continuing their internal processes and working with their network vendors to identify any potential remediation actions, including replacing the impacted hardware. While these kinds of issues are difficult to predict and prevent, the xMatters teams are continually reviewing the failover processes and seeking to identify any potential areas of improvement or ways to reduce the amount of time required to get clients back online.
Timeline:
2017-06-27 01:10PM - xMatters monitoring detects a networking issue in one of the data centers located in North America
2017-06-27 01:13PM - Teams initiate the internal Major Incident Management process
2017-06-27 01:15PM - Client Assistance posts a support bulletin: http://status.xmatters.com/incidents/npf13g87cl2t
2017-06-27 01:17PM - Operations team confirms the issue appears to be resolved
2017-06-27 01:20PM - Data center provider confirms the hardware failure and that the issue has been resolved
2017-06-27 01:20PM - All services are restored
If you have any questions, please visit http://support.xmatters.com