What happened?
On Thursday, December 8, 2016, at 10:40am PST, the xMatters network monitoring systems alerted the Operations team to an issue with the On-Demand services in one of the data centers located in North America. Some users may have briefly experienced intermittent access to the user interface, and a delay or rejection when injecting an event into xMatters.
Why did it happen?
The root cause of this issue was a service outage experienced by the primary internet service provider (ISP) for one of the North American data centers.
How did we respond?
As soon as the xMatters network monitoring tools detected unreliable connectivity and notified the Client Assistance and Operations teams, they initiated the internal Major Incident Management process and posted a bulletin to the xMatters status page. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. The xMatters failover tools automatically began routing traffic through another data center, which could take up to five minutes due to DNS time-to-live rules. Because the root cause was not yet known, the Operations team began the failover process to promote the affected clients to another data center located in North America. During the investigation, the teams determined that the impacted data center was not accessible over the public internet. Over the next few minutes, the ISP was restoring services and connectivity with the affected data center was intermittent. Once services were fully restored, the incident team made the decision to halt any further failovers to the other data center. They continued to monitor the situation closely over the next several hours, but no further issues occurred.
What are we doing to prevent it from happening again?
To help mitigate the effects of any issues experienced by internet service providers, the xMatters Operations team will be improving the network monitoring dashboards and alerts to better detect ISP service outages or degradations. They will also be researching and implementing better mechanisms to prevent excessive traffic rebalancing between data centers during automatic remediation.
Furthermore, the Operations team is investigating ways to improve the internal tools used to troubleshoot service outages related to an ISP. This will allow the incident team to detect the root cause of any issues much faster and prevent unnecessary down time for our clients.
Timeline:
2016-12-08 10:39AM - xMatters monitoring tools alert the incident team of accessibility issues with one of the data centers in North America
2016-12-08 10:41AM - Internal Major Incident process initiated
2016-12-08 10:44AM - Support bulletin posted: http://status.xmatters.com/incidents/0cc69cgqnzs1
2016-12-08 10:48AM - Failover process is initiated for all clients in the impacted data center
2016-12-08 10:53AM - Identified to be an issue with the primary internet service provider
2016-12-08 10:57AM - Failover process halted due to issue being related to ISP; routing is completed through a different data center
2016-12-08 11:01AM - All services are restored