On November 16, 2021, at approximately 09:40 AM PT, xMatters monitoring tools alerted technical teams of Google 404 errors from xMatters instances across all regions. For the duration of the incident, users were unable to access the web user interface, incoming signals were not processing, and notifications were not being generated.
xMatters uses Google Cloud Load Balancing (GCLB) services, which were not operational during the outage and resulted in the errors seen by customers.
Based on the RCA provided by Google:
"Google Cloud Networking experienced issues with Google Cloud Load Balancing (GCLB) service resulting in impact to several downstream Google Cloud services. Impacted customers observed Google 404 errors on their websites. From preliminary analysis, the root cause of the issue was a latent bug in a network configuration service which was triggered during routine system operation."
See https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh for the complete report from Google.
After receiving alert notifications from the xMatters monitoring tools, xMatters Customer Support and the operations team initiated a Severity-1 incident. The incident team quickly identified the issue as related to an incident within the Google Cloud Platform, which impacted a wide range of SaaS operators worldwide hosted by Google.
xMatters Customer Support began communicating with customers by updating https://status.xmatters.com/incidents/rtl4qyz4nj3m with detailed, real-time information. xMatters initiated a dialog with Google to gather updates on resolution progress. The incident team remained engaged until Google resolved the incident to ensure that xMatters recovered smoothly once services were restored. There was no intervention required after Google resolved the issue, but some customers may have experienced slow loading times until all Google networking components fully recovered.
xMatters is committed to providing redundancy and high availability to all customers. Our architecture allows for multiple regional and international failover scenarios, including regionally redundant databases and international traffic rerouting. A worldwide service provider failure is difficult to account for and generally unprecedented. Based on this incident, we are reviewing feasibility options for cloud vendor redundancy; however, there is no imminent action plan for this type of incident.
November 16. 2021
09:43 PT – xMatters monitoring tools alert teams to Google 404 failures; teams initiate Severity-1 incident
09:50 PT – Verification of incident external to xMatters
09:55 PT – xMatters status page posted
10:07 PT – xMatters instances begin to recover
10:09 PT – Google declares incident mitigated
10:42 PT – xMatters declares incident closed
If you have any questions, please visit http://support.xmatters.com