Issue Discovered - Service disruption in North American Region – Web User Interface
Incident Report for xMatters
Postmortem

What happened?

On April 14, 2023, at approximately 9:05 AM Pacific, some customers reported an issue to xMatters Customer Support where users were encountering errors when attempting to log in to the xMatters web interface. During the incident, some customers in North America may have experienced 503 errors when attempting to access or use xMatters or encountered errors with integrations that communicated with the xMatters API. These errors were intermittent, and only impacted a subset of customers whose primary instance was based in the us-east data center. Customers in the EMEA and APAC regions, and in other North American data centers were not impacted.

Why did it happen?

The issue was caused when a customer inadvertently initiated a denial-of-service attack by launching an excessive number of API requests. The incoming requests request peaked at over 90,000 per minute and overwhelmed the capacity of edge systems to manage the volume, causing a cascade that eventually blocked access to API endpoints and triggered 503 errors for systems that rely on them.

How did we respond?

xMatters monitoring systems alerted to the issue just before customers reported encountering errors. xMatters Customer Support confirmed the issue and initiated the major incident management process. The incident response teams determined that the best course of action was to promote impacted customers to unaffected regions and mitigate the inbound traffic by redirecting it away from critical systems. Once the traffic was mitigated, impacted systems were able to recover and customers were migrated back to their original data centers.

A status page notification was posted to status.xmatters.com but due to the limited scope and intermittent impact, it was noted as a degraded service. This classification intentionally does not email status page subscribers.

What are we doing to prevent it from happening again?

xMatters Engineering has determined that additional protections are needed at entry points to identify any excessive inbound volume and allow for quick mitigation. The teams are in the process of determining the best parameters and implementation of these protections to address both intentional and unintentional denial-of-service incidents.

Timeline:

Friday, April 14, 2023

9:00 AM - Customers report 503 issues

9:06 AM - xMatters Customer Support initiates Severity-1 incident

9:15 AM - Investigation reveals high volume to us-east

9:25 AM - Source of volume identified; being routing customers to other regions

9:45 AM - Routing changes complete

10:17 AM - Incident mitigated

If you have any questions, please visit http://support.xmatters.com

Posted Apr 25, 2023 - 09:26 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Apr 14, 2023 - 10:17 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Apr 14, 2023 - 09:46 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Apr 14, 2023 - 09:24 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Apr 14, 2023 - 09:23 PDT
Investigating
xMatters monitoring tools have identified a potential issue with the xMatters Web User Interface for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help
Posted Apr 14, 2023 - 09:15 PDT
This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).