Issue Discovered - Service disruption in All Regions – Multiple Services
Incident Report for xMatters
Postmortem

What happened?

On November 16, 2021, at approximately 09:40 AM PT, xMatters monitoring tools alerted technical teams of Google 404 errors from xMatters instances across all regions. For the duration of the incident, users were unable to access the web user interface, incoming signals were not processing, and notifications were not being generated.

Why did it happen?

xMatters uses Google Cloud Load Balancing (GCLB) services, which were not operational during the outage and resulted in the errors seen by customers.

Based on the RCA provided by Google:

"Google Cloud Networking experienced issues with Google Cloud Load Balancing (GCLB) service resulting in impact to several downstream Google Cloud services. Impacted customers observed Google 404 errors on their websites. From preliminary analysis, the root cause of the issue was a latent bug in a network configuration service which was triggered during routine system operation."

See https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh for the complete report from Google.

How did we respond?

After receiving alert notifications from the xMatters monitoring tools, xMatters Customer Support and the operations team initiated a Severity-1 incident. The incident team quickly identified the issue as related to an incident within the Google Cloud Platform, which impacted a wide range of SaaS operators worldwide hosted by Google.

xMatters Customer Support began communicating with customers by updating https://status.xmatters.com/incidents/rtl4qyz4nj3m with detailed, real-time information. xMatters initiated a dialog with Google to gather updates on resolution progress. The incident team remained engaged until Google resolved the incident to ensure that xMatters recovered smoothly once services were restored. There was no intervention required after Google resolved the issue, but some customers may have experienced slow loading times until all Google networking components fully recovered.

What are we doing to prevent it from happening again?

xMatters is committed to providing redundancy and high availability to all customers. Our architecture allows for multiple regional and international failover scenarios, including regionally redundant databases and international traffic rerouting. A worldwide service provider failure is difficult to account for and generally unprecedented. Based on this incident, we are reviewing feasibility options for cloud vendor redundancy; however, there is no imminent action plan for this type of incident.

Timeline:

November 16. 2021

09:43 PT – xMatters monitoring tools alert teams to Google 404 failures; teams initiate Severity-1 incident
09:50 PT – Verification of incident external to xMatters
09:55 PT – xMatters status page posted
10:07 PT – xMatters instances begin to recover
10:09 PT – Google declares incident mitigated
10:42 PT – xMatters declares incident closed

If you have any questions, please visit http://support.xmatters.com

Posted Nov 23, 2021 - 08:36 PST

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Nov 16, 2021 - 10:42 PST
Update
We are seeing traffic to all xMatters instances, we continue to monitor. Some instances may experience increased latency.
Posted Nov 16, 2021 - 10:27 PST
Update
We are continuing to monitor for any further issues.
Posted Nov 16, 2021 - 10:20 PST
Monitoring
The xMatters Incident Response team is seeing some instances recovering, xMatters engineering is monitoring the situation to ensure the system is stable and that all services are restored.
Posted Nov 16, 2021 - 10:12 PST
Identified
We are currently tracking a problem with our cloud provider and are working directly with them to resolve the issue. We will provide updates as soon as we know more. This outage is impacting multiple services across the internet.
Posted Nov 16, 2021 - 10:08 PST
Investigating
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for clients in All Regions. We are currently investigating the issue and will update as information becomes available.

Please see incident details for specific services impacted.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support team is waiting to help.
Posted Nov 16, 2021 - 09:55 PST
This incident affected: Europe, Middle East, and Africa (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App), Asia Pacific (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App), and North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).