What happened?
On July 17, 2018, at approximately 19:14 GMT, the xMatters monitoring tools alerted to an issue with our hosting service in the European region. During the incident, which lasted less than an hour, some customers reported encountering a "502 Bad Gateway" error whenever they attempted to access their xMatters instance, and events and notifications were not being accepted or processed.
Why did it happen?
This issue was caused by a failure of the load balancing service within the Google Cloud Platform (GCP) Infrastructure-as-a-Service (IAAS). Google experienced a major failure with the Google Load Balancer (GLB) service, which impacted a number of European customers in addition to xMatters and prevented traffic from reaching xMatters instances.
How did we respond?
As soon as the issue was detected, the Client Assistance team initiated the internal major incident management process and launched an investigation. The incident response teams quickly determined that external issues were preventing access to xMatters instances as all internal services were functioning normally. The teams immediately escalated the incident to the GCP team, who confirmed that they were experiencing issues and posted information about the problem on their status portals. While Google continued to investigate and attempt to restore their service, the incident response teams began investigating and implementing work-around solutions to bypass the problematic Google service. During the implementation of the workaround, Google restored the GLB service and by 12:57 PM all instances were reporting as functional and healthy.
What are we doing to prevent it from happening again?
At xMatters, we understand that availability is at the core of our service and treat the requirements of our customers as a mission critical service. As a precaution against possible future issues with the GLB service, the Operations and Engineering teams are committed to establishing a formalized work-around procedure that will bypass the GLB service and allow us to continue to deliver services in the event of a failure. In addition, Google has committed to adding additional safeguards and additional hardening to prevent a recurrence of this issue.
For more information about the root cause of this issue, Google has published their RCA of the incident at https://status.cloud.google.com/incident/cloud-networking/18012.
Timeline:
July 17, 2108 19:14 - Monitoring tools alert to an issue in the European region
19:16 - xMatters Client Assistance initiates major incident management process, launches investigation
19:28 - Issue identified as external
19:40 - Issue reported to Google Support
19:45 - Operations begins work-around to attempt to mitigate issue for xMatters customers
19:57 - Google reports service restored, all instances reported functional
If you have any questions, please visit http://support.xmatters.com