What happened?
On November 21, 2025, at 12:50 PM UTC, the xMatters internal monitoring tools detected irregular behavior in how internal traffic was being routed. Some customers in the APAC region communicating with services in North America (specifically US-East) may have encountered intermittent request failures or increased latency. Only traffic between these two regions was affected; all other systems and regions continued normal operations.
Why did it happen?
A temporary network disruption between Australia Southeast and US-East caused one internal routing node in Australia to lose accurate information about available backend systems in USEast. The node generated an incomplete routing configuration and temporarily stopped directing traffic to US-East. Under normal circumstances, routing updates refresh automatically when connectivity returns. In this case, the affected node did not recover cleanly and remained in a stale state until Engineering intervened.
How did we respond?
As soon as Engineering was alerted through internal monitoring, they engaged with the platform engineering team, service owners and Customer Support to launch an investigation. The teams reached out to impacted customers to validate issue symptoms and restarted routing components in both affected regions to force a configuration refresh. Once the restart completed, routing returned to normal levels while the teams continued to monitor and investigate the root cause. They were able to confirm that only one routing node and specific cross-region traffic was impacted.
What are we doing to prevent it from happening again?
While teams were mitigating this issue, they created new alerting rules to detect the routing patterns they observed during the incident and expanded internal monitoring to help identify when routing nodes fail to refresh their configuration or otherwise enter a ‘stale’ state. The teams also have planned and prepared infrastructure updates that will further reduce the risk of similar issues. These include improved configuration recovery behavior, enhanced stability for routing components, and additional logging and observability improvements for diagnosing routing anomalies. They will deploy these updates once the current code freeze window has elapsed.