Issue Discovered - Service disruption in North America

Incident Report for xMatters

Postmortem

What happened?

On Thursday, December 8, 2016, at 10:40am PST, the xMatters network monitoring systems alerted the Operations team to an issue with the On-Demand services in one of the data centers located in North America. Some users may have briefly experienced intermittent access to the user interface, and a delay or rejection when injecting an event into xMatters.

Why did it happen?

The root cause of this issue was a service outage experienced by the primary internet service provider (ISP) for one of the North American data centers.

How did we respond?

As soon as the xMatters network monitoring tools detected unreliable connectivity and notified the Client Assistance and Operations teams, they initiated the internal Major Incident Management process and posted a bulletin to the xMatters status page. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. The xMatters failover tools automatically began routing traffic through another data center, which could take up to five minutes due to DNS time-to-live rules. Because the root cause was not yet known, the Operations team began the failover process to promote the affected clients to another data center located in North America. During the investigation, the teams determined that the impacted data center was not accessible over the public internet. Over the next few minutes, the ISP was restoring services and connectivity with the affected data center was intermittent. Once services were fully restored, the incident team made the decision to halt any further failovers to the other data center. They continued to monitor the situation closely over the next several hours, but no further issues occurred.

What are we doing to prevent it from happening again?

To help mitigate the effects of any issues experienced by internet service providers, the xMatters Operations team will be improving the network monitoring dashboards and alerts to better detect ISP service outages or degradations. They will also be researching and implementing better mechanisms to prevent excessive traffic rebalancing between data centers during automatic remediation.

Furthermore, the Operations team is investigating ways to improve the internal tools used to troubleshoot service outages related to an ISP. This will allow the incident team to detect the root cause of any issues much faster and prevent unnecessary down time for our clients.

Timeline:

2016-12-08 10:39AM - xMatters monitoring tools alert the incident team of accessibility issues with one of the data centers in North America

2016-12-08 10:41AM - Internal Major Incident process initiated

2016-12-08 10:44AM - Support bulletin posted: http://status.xmatters.com/incidents/0cc69cgqnzs1

2016-12-08 10:48AM - Failover process is initiated for all clients in the impacted data center

2016-12-08 10:53AM - Identified to be an issue with the primary internet service provider

2016-12-08 10:57AM - Failover process halted due to issue being related to ISP; routing is completed through a different data center

2016-12-08 11:01AM - All services are restored

Posted Dec 14, 2016 - 14:25 PST

Resolved

The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Dec 08, 2016 - 11:22 PST

Monitoring

The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Dec 08, 2016 - 11:06 PST

Identified

The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Dec 08, 2016 - 10:58 PST

Investigating

The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available.
Posted Dec 08, 2016 - 10:44 PST