What happened?
On April 3, 2025, at approximately 9:32 AM Pacific, the xMatters internal monitoring systems identified an issue where the system was not processing events initiated via Flow Designer across multiple regions. Customers may have observed the system not processing events or creating alerts while the issue was in progress.
Why did it happen?
The issue occurred when a routine update to add new permissions to the xMatters' Google Cloud Platform (GCP) unexpectedly removed required permissions. When the teams performed the update, which should not have had any impact to customers, the policy used in the automation script was the authoritative resource at the GCP project level rather than authoritative at the individual resource level. Although the teams tested the change before deploying it and found no changes beyond what was included in the update, when the update was deployed to production GCP removed all permissions that were not in the policy in the background. Because the policy only included the new permissions, all other permissions were removed.
How did we respond?
As soon as the xMatters monitoring tools reported an issue with the system not processing events, the incident response teams initiated the internal Major Incident Management process and engaged the Engineering and Support teams. The teams were able to quickly identify the recent update as the root cause of the issue and reverted the change to restore the permissions that were managed by the automation process. This restored access and functionality for most of the xMatters services, but restoring permissions for Flow Designer proved to be more complicated.
The teams determined that missing permissions for the xMatters infrastructure were Google-generated permissions essential for specific xMatters services and engaged GCP Support to aid in the investigation. The teams generated a list of all permissions that existed prior to the update and designed a fix to re-apply them to the development environment. Once the teams had implemented the change and validated that the missing permissions and all services had been restored in the development environment, they moved to quickly apply the fix across all staging and production environments. Monitoring tools and customers confirmed that services were fully functional, and the teams continued to monitor the system as it processed all messages queued by the Flow Designer services. The system was fully restored at 2:46 PM Pacific.
What are we doing to prevent it from happening again?
The Engineering teams were able to identify all of the permissions created in the xMatters environment, including those that are created by Google, and are ensuring they are added to the management scripts. The teams are adding additional rigor to the application of these types of infrastructure changes to run idempotence tests after the changes are applied to ensure that there are no changes pending. Should a change be applied that fails this test, it will cause failures in development environments, which would catch and prevent a similar issue from occurring.