Issue Discovered - Service disruption in North America
Incident Report for xMatters
Postmortem

What happened?

On Tuesday, October 16, 2018 at approximately 12:40 PM PDT, the xMatters monitoring systems alerted the Client Assistance team to a potential issue with On-Demand services for some clients located in North America. During the incident, some users may have experienced intermittent access to the user interface, a delay or rejection when injecting an event into xMatters, and delays in notification delivery. Early the next morning, on October 17, some clients reported that their inbound requests via the Integration Builder were not processing messages. During this incident, some users may have experienced delays in notification delivery. The issue reoccurred for a third time early in the afternoon on Thursday, October 18, when Client Assistance noticed ongoing performance issues in one of the North American data centers. Some clients may have encountered intermittent access to the user interface and delays in notification delivery during this time.

Why did it happen?

This issue was caused by a database query change which was introduced as part of a bug fix in the recent xMatters On-Demand 5.5.230 release, and entered production on Monday, October 15. These changes resulted in databases taking an increased amount of time to process certain requests, and only occurred during specific conditions that occurred during increased concurrency or increased notification requests.  The teams had some difficulty in identifying the root cause because the performance issues appeared to abate after each solution was implemented. It was not until the third occurrence that the teams were able to gather enough information about the common elements to correctly isolate the source of the problem.

How did we respond?

As soon as the xMatters network monitoring tools detected connectivity issues, the xMatters Client Assistance and Operations teams escalated the issue to Severity 1 and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause and working to restore services for clients, Client Assistance posted a notice to the xMatters status page to inform clients about the incident. The teams immediately identified that the issue was limited to a specific subset of client instances within the North America region, and determined that the problem was related to a database consuming nearly all of its resources. In an attempt to mitigate the issue, the Operations team restarted the database service, resulting in marginal improvements to notification delivery. Upon further investigation, the team identified additional approaches that could mitigate the problem, and applied one of the recommended fixes to the database. Once the services were restarted, notification delivery resumed normal operations and all services appeared to be restored.

On October 17, the Client Assistance team began receiving reports from clients that some injected events were not delivering notifications. The Client Assistance team confirmed the issue and initiated the internal major incident management process to engage the incident response teams. The teams identified that a service responsible for handling inbound requests from the Integration Builder was in a blocked state. Once the impacted service was restarted the block was cleared, and events began processing notifications. The teams continued to investigate and determined that the original incident had blocked certain database tables and that additional components required a restart. The Operations team unblocked the database tables, and restarted affected components to ensure that all services were fully restored. The teams continued to search for the underlying cause of the incident while monitoring the affected systems.

At approximately 12:30 PM on Thursday, October 18, Client Assistance again noticed performance issues with one of the data centers in North America. They immediately launched the major incident management process and engaged the response teams to begin resolving the issue. The teams were able to start simultaneously restoring services and investigating the root cause. The third occurrence provided the teams with the information necessary to link the issues and review similar behavior during all three incidents. By comparing common elements that occurred during each incident, the teams managed to isolate and identify the query that caused the database performance issues. Once they were certain that they had identified the correct source of the problems, the Operations and Engineering teams devised and implemented a hot fix to mitigate any further impact to customers. Clients then confirmed that all services had been restored.

What are we doing to prevent it from happening again?

To prevent this issue from occurring again, xMatters has committed to the following action items:

  1. Upgrading the underlying database and update to the latest patch release version. (Completed)
  2. Increase monitoring thresholds to help identify any latency with notification delivery earlier in the process. (In progress)
  3. Deploy a hotfix to fix the problematic query on the impacted systems. (Completed)
  4. Deploy a permanent fix to the query to eliminate the issue across all customers and all systems. (Deployed as part of the 5.5.231 release on Monday, October 22.)

In addition, the Engineering and Operations teams are conducting a full post-mortem of the incident to help identify any potential improvements to testing suites, playbooks, and other collateral used to help isolate and identify root causes during and after an incident.

Timeline:

October 16, 2018, 12:40 PM xMatters monitoring tools alert the Client Assistance team to possible latency issues for some clients in North America
12:50 PM Internal Severity 1 process initiated
1:15 PM Engineering attempts to restore services for clients by restarting impacted notification service
1:32 PM Client Assistance posts status page bulletin: https://status.xmatters.com/incidents/c7vqmddldtbl

1:50 PM Engineering recommends mitigation steps to recover the notification service
2:01 PM Fix deployed to database; impacted service restarted
2:10 PM Services are restored
October 17, 2018, 6:00 AM Client Assistance receives reports that some events are not processing
7:58 AM Client Assistance initiates internal major incident process
8:05 AM Engineering begins investigating the issue
9:10 AM Engineering applies fix, events begin processing notifications
9:14 AM Services are restored
October 18, 2018, 12:30 PM xMatters Client Assistance is alerted to possible latency issues in a North American data center
12:34 PM Issue escalated to Severity 1
12:58 PM Client Assistance posts notice to xMatters status page: https://status.xmatters.com/incidents/7yptsvdrm2p5

1:13 PM Teams confirm that all three incidents are related and identify updated query as the root cause
1:37 PM Engineering and Operations teams deploy a hotfix to repair the query
5:13 PM All services are confirmed restored.

If you have any questions, please visit http://support.xmatters.com

Posted Oct 25, 2018 - 09:14 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Oct 16, 2018 - 14:16 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Oct 16, 2018 - 14:09 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Oct 16, 2018 - 13:55 PDT
Investigating
The xMatters team have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Oct 16, 2018 - 13:32 PDT
This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).