Issue Discovered - Service disruption in Europe
Incident Report for xMatters
Postmortem

What happened?

On July 17, 2018, at approximately 19:14 GMT, the xMatters monitoring tools alerted to an issue with our hosting service in the European region. During the incident, which lasted less than an hour, some customers reported encountering a "502 Bad Gateway" error whenever they attempted to access their xMatters instance, and events and notifications were not being accepted or processed.

Why did it happen?

This issue was caused by a failure of the load balancing service within the Google Cloud Platform (GCP) Infrastructure-as-a-Service (IAAS). Google experienced a major failure with the Google Load Balancer (GLB) service, which impacted a number of European customers in addition to xMatters and prevented traffic from reaching xMatters instances.

How did we respond?

As soon as the issue was detected, the Client Assistance team initiated the internal major incident management process and launched an investigation. The incident response teams quickly determined that external issues were preventing access to xMatters instances as all internal services were functioning normally. The teams immediately escalated the incident to the GCP team, who confirmed that they were experiencing issues and posted information about the problem on their status portals. While Google continued to investigate and attempt to restore their service, the incident response teams began investigating and implementing work-around solutions to bypass the problematic Google service. During the implementation of the workaround, Google restored the GLB service and by 12:57 PM all instances were reporting as functional and healthy.

What are we doing to prevent it from happening again?

At xMatters, we understand that availability is at the core of our service and treat the requirements of our customers as a mission critical service. As a precaution against possible future issues with the GLB service, the Operations and Engineering teams are committed to establishing a formalized work-around procedure that will bypass the GLB service and allow us to continue to deliver services in the event of a failure. In addition, Google has committed to adding additional safeguards and additional hardening to prevent a recurrence of this issue.

For more information about the root cause of this issue, Google has published their RCA of the incident at https://status.cloud.google.com/incident/cloud-networking/18012.

Timeline:

July 17, 2108 19:14 - Monitoring tools alert to an issue in the European region

19:16 - xMatters Client Assistance initiates major incident management process, launches investigation

19:28 - Issue identified as external

19:40 - Issue reported to Google Support

19:45 - Operations begins work-around to attempt to mitigate issue for xMatters customers

19:57 - Google reports service restored, all instances reported functional

If you have any questions, please visit http://support.xmatters.com

Posted Jul 20, 2018 - 15:44 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Jul 17, 2018 - 13:01 PDT
Monitoring
The xMatters Incident Response team have confirmed an issue with our global cloud provider. The provider has now implemented a fix and sites are returning back to normal operations. We will provide another update once we receive confirmation that issue has been resolved.
Posted Jul 17, 2018 - 12:54 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Jul 17, 2018 - 12:35 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for clients located in Europe. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Jul 17, 2018 - 12:18 PDT
This incident affected: Europe, Middle East, and Africa (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).