Issue Discovered - Service disruption in North American Region - Multiple Services

Incident Report for xMatters

Postmortem

What happened?

On July 17, 2019, at approximately 1:45 PM PDT, the xMatters internal monitoring systems alerted Customer Support to an issue with potentially unresponsive customer instances. Shortly afterwards, some customers reported that they were unable to reach their xMatters instances or were encountering a 503 error when attempting to log in to the xMatters web user interface. Some clients may also have noticed delays in notification delivery for a very brief period (less than 5 minutes).

Why did it happen?

This incident occurred when, during an active event, a client used the web user interface to delete a very large group from their instance while that group was being targeted for notification. The deletion process became a long-running database request which rapidly consumed all available processing resources. This blocked other processes on the database cluster while the service waited for the request to complete, causing instances using the same cluster to become unresponsive.

How did we respond?

As soon as the internal monitoring systems alerted to an issue with client instances, Customer Support confirmed the issue and launched a Severity-1 incident. The incident response teams immediately began their investigation, even as the affected services began to recover automatically. The system records and reports showed that a single database cluster had been consuming processing resources at an exceptionally high rate. The teams were able to trace the problem to a customer-initiated deletion request during an active event that caused a brief database lock. The automatic recovery process and redundant service architecture restored service quickly, and once the client's active event completed, system performance resumed normal levels. All services were restored, though the teams continued to investigate the root cause while manually clearing any remaining deadlocks.

What are we doing to prevent it from happening again?

The incident in question was quickly mitigated by the redundant service architecture and automated recovery capabilities of the xMatters On-Demand service, and all services have been restored. The teams have confirmed that all affected database clusters are operating at optimum performance levels and there are no remaining deadlocks.
To determine the best method of preventing similar issues, the Engineering teams responsible for the affected services and database performance are currently investigating this issue and reproducing the problem on internal testing systems. Once they have completed a full evaluation of the conditions incurred during this incident, they will implement any necessary changes to the service via the internal development and testing procedures. While this process is underway, the Customer Support and Engineering teams have implemented additional monitoring checks to notify the appropriate resources about any potential deadlocks so they can respond before any customers are impacted.

Timeline:

July 17, 2019:

1:45 PM PDT Monitoring alerts Customer Support of unresponsive customer instances
1:47 PM - Severity-1 Incident called
1:50 PM - Incident team identifies high CPU utilization on a database cluster
1:50 PM - Customer instances begin to respond
2:00 PM - Issue identified; all clusters cleared of deadlocks
2:20 PM - Incident resolved

Posted Jul 25, 2019 - 13:18 PDT

Resolved

The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.

Posted Jul 17, 2019 - 14:22 PDT

Monitoring

The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.

Posted Jul 17, 2019 - 14:06 PDT

Identified

The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.

Posted Jul 17, 2019 - 14:01 PDT

Investigating

xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available.

Please see incident details for specific services impacted.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.

Posted Jul 17, 2019 - 13:49 PDT

This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).