On August 3, 2023, at approximately 4:00 PM Pacific, xMatters internal monitoring detected an issue where some customers in the US-EAST data center experienced a blank screen with a "We've run into a problem..." error while attempting to log in to their instances. This incident only impacted login to the web user interface and did not affect notification processing.
An unusually high volume of user delivery requests caused connection timeouts when the xMatters API service attempted to access the historical data storage service. Since the API service is a critical component in user login processing, the connection timeouts resulted in some customers being unable to log in to their instances. Further investigation revealed that automated sizing of resources for the data storage service was unable to mitigate the temporary increase in request load.
As soon as xMatters Customer Support confirmed the issue, they escalated it to the xMatters Engineering teams. xMatters Engineering was able to isolate the issue and restart the storage service. The restart dropped all pending connection requests which allowed the service to recover; however, this may have caused some event requests to retry in order to complete and led to some delay in event processing.
xMatters Engineering has started a review of the existing data storage to better address times of unexpected usage. This includes reviewing new throttling options and improving our ability to speed recovery through manual intervention.
August 3, 2023
4:00 PM PT Internal monitoring detects login failures.
4:04 PM Severity-1 Incident raised.
4:19 PM Issue identified - increased error rate for xM-API.
4:26 PM Data storage service restarted.
4:35 PM Instances recovering.
4:48 PM Incident resolved.
If you have any questions, please visit http://support.xmatters.com.