Issue Discovered - Service disruption in North America for Email delivery
Incident Report for xMatters
Postmortem

What happened?

On Thursday, December 27, 2018 at approximately 8:41 AM PST, the xMatters networking monitoring systems alerted Client Assistance to an issue with xMatters On-Demand services for some clients located North America. During the issue, some clients may have experienced intermittent access to the xMatters user interface or a delay when injecting events into xMatters. In addition, some clients may have experienced intermittent delays or interruptions with the delivery and reception of xMatters emails.

Why did it happen?

The root cause of this issue was a high-impact service outage experienced by a primary internet service provider (ISP) in North America. This wide-reaching ISP outage impacted connectivity, email service, and Internet access across North America and even parts of Europe, and caused some issues common to large ISP outages, such as DNS gaps and mobile app connectivity problems. Throughout the incident, the xMatters web user interface was operating and functional, event injection methods were working properly, and non-email notifications and responses were being sent and processed normally. Most clients may have experienced increased latency during the event that affected the overall user experience.

How did we respond?

As soon as the xMatters network monitoring tools detected connectivity issues, the xMatters Client Assistance and Engineering teams escalated the issue to Severity 1 and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause. Client Assistance identified and informed affected clients about the incident. The teams immediately identified that the issue was limited to a specific data center within the North American region and determined that the problem was due to a widespread ISP outage in North America. The team connected with the ISP and began working in collaboration with them to determine the impact to xMatters customers, and rerouted email services through an unaffected path.

During the event, all in-flight deployments and upgrades were paused until network access was fully restored to avoid the possibility of impact. Our incident management team continued to monitor the situation closely and update clients as the ISP reported on their restoration progress.

What are we doing to prevent it from happening again?

xMatters uses multiple network backbones and automatically routes traffic across other networks and through other data centers in the event of an Internet failure. During this event, these systems were working as designed and connectivity was reestablished within the expected period of re-convergence.

As part of our commitment to continuous improvement, we are conducting hosting service improvements to our infrastructure-as-a-service, scheduled to occur in the North American region in January 2019. These improvements will greatly reduce the potential impact of ISP outages. For more information, see the article on our support site: https://support.xmatters.com/hc/en-us/articles/115005269506.

Timeline:

December 27, 2018 - 8:41 AM - xMatters internal monitoring alerts Client Assistance to issue in North America

8:43 AM - Client Assistance confirms all services are accessible and operational

8:58 AM - Client Assistance escalates issue to Severity 1; incident response teams begin investigation

9:03 AM - Team confirms issue with ISP

9:28 AM - xMatters engages ISP and obtains point of contact

5:46 PM - Issues identified with email service and delivery

6:04 PM - Email traffic re-routed to alternate path

6:07 PM - Email services restored

9:22 PM - ISP provides 4-hour ETA for resolution

December 28 2018 - 9:19 AM - ISP indicates progress and claims to be nearing resolution

6:16 PM - ISP indicates that a solution has been implemented; currently monitoring connection for stability

11:44 PM - xMatters confirms all services restored

If you have any questions, please visit http://support.xmatters.com

Posted 3 months ago. Jan 03, 2019 - 14:45 PST

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted 3 months ago. Dec 27, 2018 - 18:18 PST
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted 3 months ago. Dec 27, 2018 - 18:13 PST
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted 3 months ago. Dec 27, 2018 - 18:10 PST
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America where email delivery is being delayed. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted 3 months ago. Dec 27, 2018 - 18:00 PST
This incident affected: North America (Web Interface, Mobile Interface, Email Notifications, SMS Notifications, Voice Notifications, Mobile Push Notifications, Conferencing, Integration Platform, REST API, Email Initiation).