On Thursday, December 27, 2018 at approximately 8:41 AM PST, the xMatters networking monitoring systems alerted Client Assistance to an issue with xMatters On-Demand services for some clients located North America. During the issue, some clients may have experienced intermittent access to the xMatters user interface or a delay when injecting events into xMatters. In addition, some clients may have experienced intermittent delays or interruptions with the delivery and reception of xMatters emails.
The root cause of this issue was a high-impact service outage experienced by a primary internet service provider (ISP) in North America. This wide-reaching ISP outage impacted connectivity, email service, and Internet access across North America and even parts of Europe, and caused some issues common to large ISP outages, such as DNS gaps and mobile app connectivity problems. Throughout the incident, the xMatters web user interface was operating and functional, event injection methods were working properly, and non-email notifications and responses were being sent and processed normally. Most clients may have experienced increased latency during the event that affected the overall user experience.
As soon as the xMatters network monitoring tools detected connectivity issues, the xMatters Client Assistance and Engineering teams escalated the issue to Severity 1 and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause. Client Assistance identified and informed affected clients about the incident. The teams immediately identified that the issue was limited to a specific data center within the North American region and determined that the problem was due to a widespread ISP outage in North America. The team connected with the ISP and began working in collaboration with them to determine the impact to xMatters customers, and rerouted email services through an unaffected path.
During the event, all in-flight deployments and upgrades were paused until network access was fully restored to avoid the possibility of impact. Our incident management team continued to monitor the situation closely and update clients as the ISP reported on their restoration progress.
xMatters uses multiple network backbones and automatically routes traffic across other networks and through other data centers in the event of an Internet failure. During this event, these systems were working as designed and connectivity was reestablished within the expected period of re-convergence.
As part of our commitment to continuous improvement, we are conducting hosting service improvements to our infrastructure-as-a-service, scheduled to occur in the North American region in January 2019. These improvements will greatly reduce the potential impact of ISP outages. For more information, see the article on our support site: https://support.xmatters.com/hc/en-us/articles/115005269506.
December 27, 2018 - 8:41 AM - xMatters internal monitoring alerts Client Assistance to issue in North America
8:43 AM - Client Assistance confirms all services are accessible and operational
8:58 AM - Client Assistance escalates issue to Severity 1; incident response teams begin investigation
9:03 AM - Team confirms issue with ISP
9:28 AM - xMatters engages ISP and obtains point of contact
5:46 PM - Issues identified with email service and delivery
6:04 PM - Email traffic re-routed to alternate path
6:07 PM - Email services restored
9:22 PM - ISP provides 4-hour ETA for resolution
December 28 2018 - 9:19 AM - ISP indicates progress and claims to be nearing resolution
6:16 PM - ISP indicates that a solution has been implemented; currently monitoring connection for stability
11:44 PM - xMatters confirms all services restored
If you have any questions, please visit http://support.xmatters.com