Issue Discovered - Service disruption in North America
Incident Report for xMatters
Postmortem

What happened?

On Thursday, December 27, 2018 at approximately 8:41 AM PST, the xMatters networking monitoring systems alerted Client Assistance to an issue with xMatters On-Demand services for some clients located North America. During the issue, some clients may have experienced intermittent access to the xMatters user interface or a delay when injecting events into xMatters. In addition, some clients may have experienced intermittent delays or interruptions with the delivery and reception of xMatters emails.

Why did it happen?

The root cause of this issue was a high-impact service outage experienced by a primary internet service provider (ISP) in North America. This wide-reaching ISP outage impacted connectivity, email service, and Internet access across North America and even parts of Europe, and caused some issues common to large ISP outages, such as DNS gaps and mobile app connectivity problems. Throughout the incident, the xMatters web user interface was operating and functional, event injection methods were working properly, and non-email notifications and responses were being sent and processed normally. Most clients may have experienced increased latency during the event that affected the overall user experience.

How did we respond?

As soon as the xMatters network monitoring tools detected connectivity issues, the xMatters Client Assistance and Engineering teams escalated the issue to Severity 1 and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause. Client Assistance identified and informed affected clients about the incident. The teams immediately identified that the issue was limited to a specific data center within the North American region and determined that the problem was due to a widespread ISP outage in North America. The team connected with the ISP and began working in collaboration with them to determine the impact to xMatters customers, and rerouted email services through an unaffected path.

During the event, all in-flight deployments and upgrades were paused until network access was fully restored to avoid the possibility of impact. Our incident management team continued to monitor the situation closely and update clients as the ISP reported on their restoration progress.

What are we doing to prevent it from happening again?

xMatters uses multiple network backbones and automatically routes traffic across other networks and through other data centers in the event of an Internet failure. During this event, these systems were working as designed and connectivity was reestablished within the expected period of re-convergence.

As part of our commitment to continuous improvement, we are conducting hosting service improvements to our infrastructure-as-a-service, scheduled to occur in the North American region in January 2019. These improvements will greatly reduce the potential impact of ISP outages. For more information, see the article on our support site: https://support.xmatters.com/hc/en-us/articles/115005269506.

Timeline:

December 27, 2018 - 8:41 AM - xMatters internal monitoring alerts Client Assistance to issue in North America

8:43 AM - Client Assistance confirms all services are accessible and operational

8:58 AM - Client Assistance escalates issue to Severity 1; incident response teams begin investigation

9:03 AM - Team confirms issue with ISP

9:28 AM - xMatters engages ISP and obtains point of contact

5:46 PM - Issues identified with email service and delivery

6:04 PM - Email traffic re-routed to alternate path

6:07 PM - Email services restored

9:22 PM - ISP provides 4-hour ETA for resolution

December 28 2018 - 9:19 AM - ISP indicates progress and claims to be nearing resolution

6:16 PM - ISP indicates that a solution has been implemented; currently monitoring connection for stability

11:44 PM - xMatters confirms all services restored

If you have any questions, please visit http://support.xmatters.com

Posted Jan 03, 2019 - 09:02 PST

Resolved
The issue has been addressed by the ISP and network services have been restored. Thank you for your patience while this issue was being addressed.
Posted Dec 28, 2018 - 23:44 PST
Monitoring
The ISP have confirmed that most networking issues they were experiencing should now be resolved. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Dec 28, 2018 - 18:16 PST
Update
The ISP have made some progress, but are still working on fully restoring their service. Some users will continue to see issues accessing the web user interface depending on the geographic location. We continue to monitor the situation and will provide updates as we get them.
Posted Dec 28, 2018 - 09:19 PST
Identified
As mentioned previously, this issue has been identified to be a widespread issue impacting a primary ISP in North America. We continue to monitor the situation and will provide another update as it becomes available.
Posted Dec 27, 2018 - 22:20 PST
Investigating
xMatters have received several reports today of users not being able to access the web user interface. The root cause of this issue is related to a wide impact service outage experienced by a primary internet service provider (ISP) in North America. xMatters services are running and operational, however some users may not be able to access their xMatters instance based on their geographic location. We continue to monitor the situation closely and will provide updates as they become available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Dec 27, 2018 - 21:17 PST
This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).