On October 25th, 2018, at approximately 1:16 PM AEST, the xMatters monitoring tools alerted the Client Assistance team to an issue impacting the On-Demand service for some clients located in the Australian region. During the incident, some clients may have experienced intermittent access to the user interface, a delay or rejection when injecting an event into xMatters, and delays in notification delivery. There was no impact or loss to client data during this incident.
This issue was caused by a sudden, unexpected failure of a network interface card within the hosted data center supporting our services in Australia. While the impacted hardware was redundant the failure caused a condition that resulted in a cascade of failures. An automated failover to an alternate data center was initiated immediately, but the process of redirecting services around the issue took longer than expected due to the nature of the failure.
As soon as they were alerted by the monitoring systems, Client Assistance initiated the internal major incident management process and launched an investigation. The xMatters incident response teams confirmed the issue and began monitoring the automated failover process. The Client Assistance team proactively contacted each client individually to let them know about the issue and to update them on the status of their services. The failover was completed, and all services were fully restored less than an hour after the issue was identified.
Hardware failure is difficult to predict, and this condition was unique in that existing services and redundancies failed to perform as previously tested. The hosting service improvements and migrations just completed in the Australian region will make similar issues highly unlikely on this new and significantly more robust infrastructure. For more information about these changes, see the article on our support site: https://support.xmatters.com/hc/en-us/articles/115005269506
October 25, 2018 1:16 PM (AEDT) - Monitoring tools alert to an issue in the Australian region
1:18 PM - Client Assistance initiates major incident management process, launches investigation
1:20 PM - Issue identified as impacting some clients hosted in one of the APAC data centers
1:25 PM - Failover process begins for clients impacted
1:40 PM - Status page updated: https://status.xmatters.com/incidents/jtcs9w4grlh4
2:05 PM - All affected customers reported back up
2:08 PM - All services restored