Issue Discovered - Service disruption in North American Region - Multiple Services
Incident Report for xMatters
Postmortem

What happened?

On August 3, 2023, at approximately 4:00 PM Pacific, xMatters internal monitoring detected an issue where some customers in the US-EAST data center experienced a blank screen with a "We've run into a problem..." error while attempting to log in to their instances. This incident only impacted login to the web user interface and did not affect notification processing.

Why did it happen?

An unusually high volume of user delivery requests caused connection timeouts when the xMatters API service attempted to access the historical data storage service. Since the API service is a critical component in user login processing, the connection timeouts resulted in some customers being unable to log in to their instances. Further investigation revealed that automated sizing of resources for the data storage service was unable to mitigate the temporary increase in request load.

How did we respond?

As soon as xMatters Customer Support confirmed the issue, they escalated it to the xMatters Engineering teams. xMatters Engineering was able to isolate the issue and restart the storage service. The restart dropped all pending connection requests which allowed the service to recover; however, this may have caused some event requests to retry in order to complete and led to some delay in event processing.

What are we doing to prevent it from happening again?

xMatters Engineering has started a review of the existing data storage to better address times of unexpected usage. This includes reviewing new throttling options and improving our ability to speed recovery through manual intervention.

Timeline:

Date/Time Action

August 3, 2023

4:00 PM PT Internal monitoring detects login failures.

4:04 PM Severity-1 Incident raised.

4:19 PM Issue identified - increased error rate for xM-API.

4:26 PM Data storage service restarted.

4:35 PM Instances recovering.

4:48 PM Incident resolved.

If you have any questions, please visit http://support.xmatters.com.

Posted Aug 15, 2023 - 09:20 PDT

Resolved
This incident has been resolved.
Posted Aug 03, 2023 - 17:12 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 03, 2023 - 16:30 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Aug 03, 2023 - 16:29 PDT
Investigating
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available.

Please see incident details for specific services impacted.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Aug 03, 2023 - 16:17 PDT
This incident affected: North America (Web Interface, Integration Platform, API).