On October 22, 2020, at approximately 9:45 AM Pacific, internal monitoring tools alerted xMatters Customer Support to an issue impacting xMatters database storage services. During the incident, some customers reported not being able to access the xMatters user interface. This impacted some customers in North America for approximately 20 minutes; events processed normally and notifications were not affected.
The investigation revealed a loss of network connectivity between two xMatters components, specifically the xMatters API service and analytics database, which lead to the inability to service login requests. These connectivity issues led to a failure of the xMatters API to reconnect with the database. This loss of connectivity to the analytics database had a cascading effect that impacted the querying of a small subset of customer databases and access to the xMatters web user interface. The incident investigation determined that the xMatters API was able to create connections to the database but was unable to complete some queries. This condition resulted in a backlog of connection requests which eventually impacted the xMatters web user interface.
xMatters engineering restarted the API service as part of the investigation into the cause of the errors. After the restart, xMatters Customer Support confirmed there was still an issue accessing the xMatters web user interface and initiated a Severity-1 incident. The incident response team gathered and promoted impacted instances to redundant architecture. Once that was complete, customers were able to login to xMatters without issue. The connectivity errors cleared without xMatters intervention after the load was removed from the impacted services.
Once mitigated, the connection issue was resolved. It is expected that the issue is a one time occurrence with a very low likelihood to reoccur; however, we are taking additional steps to improve the resiliency of the retry logic if a future connection failure occurs. Additional monitoring has been added to alert the team of similar conditions, which will allow for proactive measures to be taken before impacting customers.
Date & Time PDT
October 21, 2020 - 09:45 - Some customer instances begin reporting errors
October 22, 2020 - 00:45 - Rolling restart of API Service
October 22, 2020 - 00:50 - Login errors identified, Severity 1 Incident called
October 22, 2020 - 00:53 - Impacted instances routed to redundant architecture
October 22, 2020 - 01:06 - Impact mitigated
October 22, 2020 - 01:27 - Incident verified as resolved
If you have any questions, please visit http://support.xmatters.com