On July 17, 2019, at approximately 1:45 PM PDT, the xMatters internal monitoring systems alerted Customer Support to an issue with potentially unresponsive customer instances. Shortly afterwards, some customers reported that they were unable to reach their xMatters instances or were encountering a 503 error when attempting to log in to the xMatters web user interface. Some clients may also have noticed delays in notification delivery for a very brief period (less than 5 minutes).
This incident occurred when, during an active event, a client used the web user interface to delete a very large group from their instance while that group was being targeted for notification. The deletion process became a long-running database request which rapidly consumed all available processing resources. This blocked other processes on the database cluster while the service waited for the request to complete, causing instances using the same cluster to become unresponsive.
As soon as the internal monitoring systems alerted to an issue with client instances, Customer Support confirmed the issue and launched a Severity-1 incident. The incident response teams immediately began their investigation, even as the affected services began to recover automatically. The system records and reports showed that a single database cluster had been consuming processing resources at an exceptionally high rate. The teams were able to trace the problem to a customer-initiated deletion request during an active event that caused a brief database lock. The automatic recovery process and redundant service architecture restored service quickly, and once the client's active event completed, system performance resumed normal levels. All services were restored, though the teams continued to investigate the root cause while manually clearing any remaining deadlocks.
The incident in question was quickly mitigated by the redundant service architecture and automated recovery capabilities of the xMatters On-Demand service, and all services have been restored. The teams have confirmed that all affected database clusters are operating at optimum performance levels and there are no remaining deadlocks.
To determine the best method of preventing similar issues, the Engineering teams responsible for the affected services and database performance are currently investigating this issue and reproducing the problem on internal testing systems. Once they have completed a full evaluation of the conditions incurred during this incident, they will implement any necessary changes to the service via the internal development and testing procedures. While this process is underway, the Customer Support and Engineering teams have implemented additional monitoring checks to notify the appropriate resources about any potential deadlocks so they can respond before any customers are impacted.
July 17, 2019: