What happened?
On Tuesday, March 6, 2018 at approximately 8:00 AM PST, and again on Thursday March 8, 2018 at approximately 5:00 AM PST, the xMatters monitoring systems alerted the Client Assistance team to a potential issue with the xMatters On-Demand services for some clients located in North America. During both incidents, users may have experienced intermittent access to the user interface, a delay or rejection when injecting an event into xMatters, and delays in notification delivery.
Why did it happen?
These two incidents were directly related, and caused by a database consuming all of its connections due to an xMatters component responsible for creating notifications. The system required a high number of database queries to resolve event recipients for notifications, forcing it to take an unusual amount of time to process requests.
How did we respond?
As soon as the xMatters network monitoring detected connectivity issues, the xMatters Client Assistance and Operations teams initiated the internal Severity-1 process. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients while Client Assistance posted a notice to the xMatters status page. The teams immediately identified that the issue was limited to a specific subset of users within the North America region, and determined that it was related to a database consuming nearly all of its resources. To mitigate the problem, the Operations team restarted the database service, which restored service and normal operations for the affected clients. During the continuing investigation, the teams determined that the incident was due to the amount of time some database queries were taking to resolve event recipients for notifications. The teams identified some approaches that could mitigate the problem, but the issue re-occurred on Thursday March 8th, before they could implement a solution. During this disruption, the team applied one of the recommended fixes and restarted the services. Shortly afterwards, clients confirmed that all services had been restored.
What are we doing to prevent it from happening again?
To prevent this issue from occurring again, the xMatters Engineering team has committed to the following:
Apply the fix to the underlying database and update to the latest patch release version. (Completed)
Improve the efficiency of the database queries. (In progress - BUG-11787)
Increase monitoring thresholds to identify any latency with notification delivery earlier in the process. (In progress - EVO-2292)
Timeline:
2018-03-06 08:00AM - Client Assistance team receives reports of latency for some clients with On-Demand services in North America
2018-03-06 08:20AM - Internal Severity-1 process initiated
2018-03-06 08:45AM - Status page bulletin posted: http://status.xmatters.com/incidents/hst4wjdvqy46
2018-03-06 08:55AM - Engineering deploys a fix for the issue to restore services
2018-03-06 09:05AM - Services are restored
2018-03-08 05:30AM - Client Assistance team receives reports of notification delays for some clients with On-Demand services in North America
2018-03-08 05:50AM - Internal Severity-1 process initiated 2018-03-08 06:13AM - Engineering identifies the issue to be related to the incident on 2018-03-06
2018-03-08 06:17AM - Status page bulletin posted: http://status.xmatters.com/incidents/zyttzrz9phwt
2018-03-08 06:30AM - A fix is deployed to the database and services are restarted
2018-03-08 06:35AM - All services are restored
If you have any questions, please visit http://support.xmatters.com