Issue Discovered - Service disruption in Asia Pacific Region
Incident Report for xMatters
Postmortem

What happened?

On October 25th, 2018, at approximately 1:16 PM AEST, the xMatters monitoring tools alerted the Client Assistance team to an issue impacting the On-Demand service for some clients located in the Australian region. During the incident, some clients may have experienced intermittent access to the user interface, a delay or rejection when injecting an event into xMatters, and delays in notification delivery. There was no impact or loss to client data during this incident.

Why did it happen?

This issue was caused by a sudden, unexpected failure of a network interface card within the hosted data center supporting our services in Australia. While the impacted hardware was redundant the failure caused a condition that resulted in a cascade of failures. An automated failover to an alternate data center was initiated immediately, but the process of redirecting services around the issue took longer than expected due to the nature of the failure. 

How did we respond?

As soon as they were alerted by the monitoring systems, Client Assistance initiated the internal major incident management process and launched an investigation. The xMatters incident response teams confirmed the issue and began monitoring the automated failover process. The Client Assistance team proactively contacted each client individually to let them know about the issue and to update them on the status of their services. The failover was completed, and all services were fully restored less than an hour after the issue was identified. 

What are we doing to prevent it from happening again?

Hardware failure is difficult to predict, and this condition was unique in that existing services and redundancies failed to perform as previously tested. The hosting service improvements and migrations just completed in the Australian region will make similar issues highly unlikely on this new and significantly more robust infrastructure. For more information about these changes, see the article on our support site: https://support.xmatters.com/hc/en-us/articles/115005269506

Timeline:

October 25, 2018 1:16 PM (AEDT) - Monitoring tools alert to an issue in the Australian region

1:18 PM - Client Assistance initiates major incident management process, launches investigation

1:20 PM - Issue identified as impacting some clients hosted in one of the APAC data centers

1:25 PM - Failover process begins for clients impacted

1:40 PM - Status page updated:  https://status.xmatters.com/incidents/jtcs9w4grlh4

2:05 PM - All affected customers reported back up

2:08 PM - All services restored

Posted Nov 21, 2018 - 11:45 PST

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Oct 24, 2018 - 21:09 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Oct 24, 2018 - 20:40 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Oct 24, 2018 - 20:25 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the Asia Pacific region. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Oct 24, 2018 - 20:22 PDT
This incident affected: Asia Pacific (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).