Issue Discovered - Service disruption in Asia Pacific Region
Incident Report for xMatters
Postmortem

What happened?

On January 9, 2019, at approximately 11:20AM AEDT, the xMatters monitoring tools alerted Client Assistance to an issue with our hosting service in the Asia-Pacific region. During the incident, which lasted less than 20 minutes, some customers reported encountering a 503 error or a blank screen when attempting to access their xMatters instance, and events and notifications were not being accepted or processed.

Why did it happen?

This issue was caused by a connectivity failure within the Google Cloud Platform (GCP) Infrastructure-as-a-Service (IAAS) in the Asia-Pacific region. No egress traffic to the Internet from the Australian region was functional due to issues within Google's networks.

How did we respond?

As soon as the issue was detected, the Client Assistance team initiated the internal major incident management process and launched an investigation. The incident response teams quickly determined that all internal services were functioning normally, but traffic was not being sent to the internet. The teams immediately escalated the incident to the GCP team, who confirmed that they were experiencing issues and posted information about the problem on their status portals. While Google continued to investigate and attempt to restore their service, the incident response teams began implementing a work-around solution to re-route traffic through another region. During the implementation of the workaround, Google restored their services and by 16:35 all instances were reporting as functional and healthy.

What are we doing to prevent it from happening again?

At xMatters, we understand that availability is at the core of our service and treat the requirements of our customers as a mission critical service. As a precaution against possible future issues with Google services, the Operations and Engineering teams are committed to establishing a formalized work-around procedure that will bypass problematic services and allow us to continue to deliver services in the event of a failure.

Timeline:

January 9, 2019 11:15AM - xMatters team discovers connectivity issues in the APAC Region

11:21AM - xMatters Client Assistance initiates major incident management process, launches investigation

11:23AM - Issue identified as external

11:25AM - Issue reported to Google Support

11:33AM - Operations begins work-around to attempt to mitigate issue for xMatters customers

11:35AM - Google reports service restored, all instances reported functional

Posted 2 months ago. Jan 10, 2019 - 16:21 PST

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted 2 months ago. Jan 08, 2019 - 16:40 PST
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted 2 months ago. Jan 08, 2019 - 16:36 PST
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted 2 months ago. Jan 08, 2019 - 16:33 PST
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the Asia Pacific region. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted 2 months ago. Jan 08, 2019 - 16:31 PST
This incident affected: Asia Pacific (Web Interface, Mobile Interface, Email Notifications, SMS Notifications, Voice Notifications, Mobile Push Notifications, Conferencing, Integration Platform, REST API, Email Initiation).