What happened?
On July 19, 2018, at approximately 03:17 AM AEST, the xMatters monitoring tools alerted to an issue with our hosting service in the Australian region. During the incident, which lasted less than 30 minutes, some customers reported encountering errors when they attempted to access their xMatters instance, and events and notifications were not being accepted or processed.
Why did it happen?
The root cause of this issue was traced to a service failure with an upstream provider for one our data centers in the Asia-Pacific region. Our automated failover process to another data center restored connectivity for some customers but the problem was resolved within 30 minutes of first report, before the failover completed.
How did we respond?
As soon as the issue was detected, the Client Assistance team initiated the internal major incident management process and posted a notice to the xMatters Status page. The incident response teams quickly determined that external issues were preventing access to xMatters instances as all internal services were functioning normally. The teams immediately escalated the incident to the data center provider, who confirmed that they were experiencing issues. While the provider continued to investigate and attempt to restore their service, the incident response teams began implementing work-around solutions to bypass the problematic data center. During the implementation of the workaround, the provider restored service and all instances were reported as functional and healthy.
What are we doing to prevent it from happening again?
At xMatters, we understand that availability is at the core of our service and treat the requirements of our customers as a mission critical service. This disruption was caused by an unexpected network event that affected the entire hosting data center. The data center provider is currently conducting an internal investigation and providing more information they discover it. The provider is also continuing their internal processes and working with their network vendors to identify any potential remediation actions, including replacing the impacted hardware.
While these kinds of issues are difficult to predict and prevent, the xMatters teams are continually reviewing the failover processes and seeking to identify any potential areas of improvement or ways to reduce the amount of time required to get clients back online. As part of this commitment, we are conducting hosting service improvements to our infrastructure-as-a-service, scheduled to occur in the Asia-Pacific region in October 2018. For more information, see the article on our support site: https://support.xmatters.com/hc/en-us/articles/115005269506
Timeline:
July 19, 2018 - 03:17 AM - Monitoring tools alert to an issue in the Australian region
03:22 AM - Client Assistance initiates major incident management process, launches investigation
03:24 AM - Status page updated : https://status.xmatters.com/incidents/qfjqy9pfpntd
03:24 AM - Issue identified as external
03:24 AM - Issue reported to data center provider
03:24 AM - Operations begins work-around to attempt to mitigate issue for xMatters customers
03:28 AM - Majority of customers reported back up
03:35 AM - All services restored
If you have any questions, please visit http://support.xmatters.com