Issue Discovered - Service disruption in North America
Incident Report for xMatters
Postmortem

What happened?

On Tuesday, March 6, 2018 at approximately 8:00 AM PST, and again on Thursday March 8, 2018 at approximately 5:00 AM PST, the xMatters monitoring systems alerted the Client Assistance team to a potential issue with the xMatters On-Demand services for some clients located in North America. During both incidents, users may have experienced intermittent access to the user interface, a delay or rejection when injecting an event into xMatters, and delays in notification delivery.

Why did it happen?

These two incidents were directly related, and caused by a database consuming all of its connections due to an xMatters component responsible for creating notifications. The system required a high number of database queries to resolve event recipients for notifications, forcing it to take an unusual amount of time to process requests.

How did we respond?

As soon as the xMatters network monitoring detected connectivity issues, the xMatters Client Assistance and Operations teams initiated the internal Severity-1 process. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients while Client Assistance posted a notice to the xMatters status page. The teams immediately identified that the issue was limited to a specific subset of users within the North America region, and determined that it was related to a database consuming nearly all of its resources. To mitigate the problem, the Operations team restarted the database service, which restored service and normal operations for the affected clients. During the continuing investigation, the teams determined that the incident was due to the amount of time some database queries were taking to resolve event recipients for notifications. The teams identified some approaches that could mitigate the problem, but the issue re-occurred on Thursday March 8th, before they could implement a solution. During this disruption, the team applied one of the recommended fixes and restarted the services. Shortly afterwards, clients confirmed that all services had been restored.

What are we doing to prevent it from happening again?

To prevent this issue from occurring again, the xMatters Engineering team has committed to the following:

  1. Apply the fix to the underlying database and update to the latest patch release version. (Completed)

  2. Improve the efficiency of the database queries. (In progress - BUG-11787)

  3. Increase monitoring thresholds to identify any latency with notification delivery earlier in the process. (In progress - EVO-2292)

Timeline:

2018-03-06 08:00AM - Client Assistance team receives reports of latency for some clients with On-Demand services in North America

2018-03-06 08:20AM - Internal Severity-1 process initiated

2018-03-06 08:45AM - Status page bulletin posted: http://status.xmatters.com/incidents/hst4wjdvqy46

2018-03-06 08:55AM - Engineering deploys a fix for the issue to restore services

2018-03-06 09:05AM - Services are restored

2018-03-08 05:30AM - Client Assistance team receives reports of notification delays for some clients with On-Demand services in North America

2018-03-08 05:50AM - Internal Severity-1 process initiated 2018-03-08 06:13AM - Engineering identifies the issue to be related to the incident on 2018-03-06

2018-03-08 06:17AM - Status page bulletin posted: http://status.xmatters.com/incidents/zyttzrz9phwt

2018-03-08 06:30AM - A fix is deployed to the database and services are restarted

2018-03-08 06:35AM - All services are restored

If you have any questions, please visit http://support.xmatters.com

Posted 7 months ago. Mar 13, 2018 - 11:03 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted 8 months ago. Mar 08, 2018 - 06:49 PST
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted 8 months ago. Mar 08, 2018 - 06:35 PST
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted 8 months ago. Mar 08, 2018 - 06:23 PST
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available.
Posted 8 months ago. Mar 08, 2018 - 06:17 PST
This incident affected: North America (Web Interface, Mobile Interface, Email Notifications, SMS Notifications, Voice Notifications, Mobile Push Notifications, Conferencing, Integration Platform, REST API, Email Initiation).