Issue Discovered - Service disruption in All Regions – Integration Platform

Incident Report for xMatters

Postmortem

What happened?

On April 3, 2025, at approximately 9:32 AM Pacific, the xMatters internal monitoring systems identified an issue where the system was not processing events initiated via Flow Designer across multiple regions. Customers may have observed the system not processing events or creating alerts while the issue was in progress.

Why did it happen?

The issue occurred when a routine update to add new permissions to the xMatters' Google Cloud Platform (GCP) unexpectedly removed required permissions. When the teams performed the update, which should not have had any impact to customers, the policy used in the automation script was the authoritative resource at the GCP project level rather than authoritative at the individual resource level. Although the teams tested the change before deploying it and found no changes beyond what was included in the update, when the update was deployed to production GCP removed all permissions that were not in the policy in the background. Because the policy only included the new permissions, all other permissions were removed.

How did we respond?

As soon as the xMatters monitoring tools reported an issue with the system not processing events, the incident response teams initiated the internal Major Incident Management process and engaged the Engineering and Support teams. The teams were able to quickly identify the recent update as the root cause of the issue and reverted the change to restore the permissions that were managed by the automation process. This restored access and functionality for most of the xMatters services, but restoring permissions for Flow Designer proved to be more complicated.

The teams determined that missing permissions for the xMatters infrastructure were Google-generated permissions essential for specific xMatters services and engaged GCP Support to aid in the investigation. The teams generated a list of all permissions that existed prior to the update and designed a fix to re-apply them to the development environment. Once the teams had implemented the change and validated that the missing permissions and all services had been restored in the development environment, they moved to quickly apply the fix across all staging and production environments. Monitoring tools and customers confirmed that services were fully functional, and the teams continued to monitor the system as it processed all messages queued by the Flow Designer services. The system was fully restored at 2:46 PM Pacific.

What are we doing to prevent it from happening again?

The Engineering teams were able to identify all of the permissions created in the xMatters environment, including those that are created by Google, and are ensuring they are added to the management scripts. The teams are adding additional rigor to the application of these types of infrastructure changes to run idempotence tests after the changes are applied to ensure that there are no changes pending. Should a change be applied that fails this test, it will cause failures in development environments, which would catch and prevent a similar issue from occurring.

Posted Apr 07, 2025 - 11:55 PDT

Resolved

This incident has been resolved.
Posted Apr 03, 2025 - 14:45 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 03, 2025 - 14:18 PDT

Update

The xMatters Incident Response team has identified the source of the issue and are currently testing a fix. We will provide another update shortly.
Posted Apr 03, 2025 - 14:13 PDT

Update

The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Apr 03, 2025 - 14:00 PDT

Update

The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Apr 03, 2025 - 13:59 PDT

Update

The xMatters Incident Response team has identified the source of the issue and is still actively working on a fix, there is no current estimate on resolution time. We will update once a solution has been implemented.
Posted Apr 03, 2025 - 13:26 PDT

Update

The xMatters Incident Response team has identified the source of the issue and is still actively working on a fix, there is no current estimate on resolution time. We will update once a solution has been implemented.
Posted Apr 03, 2025 - 12:56 PDT

Update

The xMatters Incident Response team has identified the source of the issue and is still actively working on a fix, there is no current estimate on resolution time. We will update once a solution has been implemented.
Posted Apr 03, 2025 - 12:26 PDT

Update

The xMatters Incident Response team has identified the source of the issue and is still actively working on a fix, there is no current estimate on resolution time. We will update once a solution has been implemented.
Posted Apr 03, 2025 - 12:00 PDT

Update

The xMatters Incident Response team has identified the source of the issue and is actively working on a fix, there is no current estimate on resolution time. We will update once a solution has been implemented.
Posted Apr 03, 2025 - 11:44 PDT

Update

The xMatters Incident Response team has identified the source of the issue and is still working on a fix. We will update once a solution has been identified and implemented.
Posted Apr 03, 2025 - 11:28 PDT

Update

The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Apr 03, 2025 - 11:01 PDT

Update

The xMatters Incident Response team has identified the source of the issue and is still working on a fix. We will update once a solution has been identified and implemented.
Posted Apr 03, 2025 - 10:46 PDT

Identified

The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Apr 03, 2025 - 10:17 PDT

Investigating

xMatters monitoring tools have identified a potential issue with xMatters Integration Platform for some clients in All Regions. We are currently investigating the issue and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Apr 03, 2025 - 10:15 PDT
This incident affected: Europe, Middle East, and Africa (Integration Platform), Asia Pacific (Integration Platform), and North America (Integration Platform).