What happened?
On Friday, April 6, 2016, at 11:25am PST, the xMatters network monitoring systems alerted the Operations team to an issue with the On-Demand services in one of the data centers located in North America. Some users may have briefly experienced intermittent access to the user interface, and a delay or rejection when injecting an event into xMatters.
Why did it happen?
The root cause of this issue was a brief service outage experienced by the primary Internet service provider (ISP) for one of the North American data centers.
How did we respond?
As soon as the xMatters network monitoring tools detected unreliable connectivity and notified the Client Assistance and Operations teams, they initiated the internal Major Incident Management process and posted a bulletin to the xMatters status page. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. During the investigation, the teams determined that the impacted data center was not accessible over the public internet. Over the next few minutes, the ISP was restoring services and connectivity with the affected data center was intermittent. During the event, automatic network failovers to other providers was simultaneously occurring. Once services were fully restored, the incident team made the decision to halt any failovers to the other data center. They continued to monitor the situation closely over the next several hours, but no further issues occurred.
What are we doing to prevent it from happening again?
xMatters uses multiple network backbones and performs failover to other networks by routing traffic through other data centers in the event of an Internet failure. During this event, these systems were working as designed and connectivity was re-established within the expected period of re-convergence.
Timeline:
2018-04-06 11:25AM - xMatters monitoring tools alert the Operations and Client Assistance team of accessibility issues with one of the data centers in North America
2018-04-06 11:30AM - Internal Major Incident process initiated
2018-04-06 11:30AM - Support bulletin posted: http://status.xmatters.com/incidents/gymstt3gz0s2
2018-04-06 11:30AM - xMatters monitoring tools report all systems are restored
2018-04-06 11:35AM - Issue is identified as an outage with the primary Internet service provider
2018-04-06 11:35AM - All services are confirmed restored
If you have any questions, please visit http://support.xmatters.com