Issue Discovered - Service disruption in North America Region – Web User Interface
Incident Report for xMatters
Postmortem

What happened?

On October 22, 2020, at approximately 9:45 AM Pacific, internal monitoring tools alerted xMatters Customer Support to an issue impacting xMatters database storage services. During the incident, some customers reported not being able to access the xMatters user interface. This impacted some customers in North America for approximately 20 minutes; events processed normally and notifications were not affected.

Why did it happen?

The investigation revealed a loss of network connectivity between two xMatters components, specifically the xMatters API service and analytics database, which lead to the inability to service login requests. These connectivity issues led to a failure of the xMatters API to reconnect with the database. This loss of connectivity to the analytics database had a cascading effect that impacted the querying of a small subset of customer databases and access to the xMatters web user interface. The incident investigation determined that the xMatters API was able to create connections to the database but was unable to complete some queries. This condition resulted in a backlog of connection requests which eventually impacted the xMatters web user interface.

How did we respond?

xMatters engineering restarted the API service as part of the investigation into the cause of the errors. After the restart, xMatters Customer Support confirmed there was still an issue accessing the xMatters web user interface and initiated a Severity-1 incident. The incident response team gathered and promoted impacted instances to redundant architecture. Once that was complete, customers were able to login to xMatters without issue. The connectivity errors cleared without xMatters intervention after the load was removed from the impacted services.

What are we doing to prevent it from happening again?

Once mitigated, the connection issue was resolved. It is expected that the issue is a one time occurrence with a very low likelihood to reoccur; however, we are taking additional steps to improve the resiliency of the retry logic if a future connection failure occurs. Additional monitoring has been added to alert the team of similar conditions, which will allow for proactive measures to be taken before impacting customers.

Timeline:

Date & Time PDT

October 21, 2020 - 09:45 - Some customer instances begin reporting errors

October 22, 2020 - 00:45 - Rolling restart of API Service
October 22, 2020 - 00:50 - Login errors identified, Severity 1 Incident called
October 22, 2020 - 00:53 - Impacted instances routed to redundant architecture
October 22, 2020 - 01:06 - Impact mitigated
October 22, 2020 - 01:27 - Incident verified as resolved

If you have any questions, please visit http://support.xmatters.com

Posted Nov 04, 2020 - 09:19 PST

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Oct 22, 2020 - 02:42 PDT
Monitoring
Customers may continue to experience some slowness as the incident team continues to implement the fixes for this issue. Events will be processing, however some customers may experience some delays. We are currently monitoring the situation to ensure the implementation is stable and services are restored.
Posted Oct 22, 2020 - 01:50 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Oct 22, 2020 - 01:25 PDT
Update
We are continuing to investigate this issue.
Posted Oct 22, 2020 - 01:16 PDT
Investigating
xMatters monitoring tools have identified a potential issue with the xMatters Web User Interface for some clients in All Regions. We are currently investigating the issue and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Oct 22, 2020 - 01:11 PDT
This incident affected: North America (Web Interface).