Issue Discovered - Service disruption in North America
Incident Report for xMatters
Postmortem

What happened?

On Tuesday, January 23, 2018 at approximately 2:13 PM PST, the xMatters monitoring systems alerted the Client Assistance team to an issue with the xMatters On-Demand services for clients located in North America. Users may have experienced intermittent access to the user interface, and a delay or rejection when injecting an event into xMatters.

Why did it happen?

This issue occurred when a previously unknown defect within an internal, back-end service responsible for removing older, deprecated services caused it to erroneously remove services that were still in use.

How did we respond?

As soon as the xMatters network monitoring detected connectivity issues, xMatters Client Assistance and Operations teams initiated the internal Major Incident Management process. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients while Client Assistance posted a notice to the xMatters status page. The teams immediately identified that the connectivity problems were impacting a majority of clients located in North America, and that the problem was related to a back-end service responsible for cleaning up services no longer in use. In this case, the service had erroneously removed live, in-use services. To quickly mitigate client issues, the incident response team began redeploying the services that had been removed. Once the services were redeployed, clients confirmed that all services were restored and their users were able to access the user interface and inject events into xMatters.

What are we doing to prevent it from happening again?

To prevent this issue from occurring again, the xMatters Engineering team will undertake the following steps:

  1. Repair the defect within the service to prevent the erroneous deletion of live services. (In progress)

  2. If possible, update the back-end service to clean up old services on an individual basis, rather than all at once. (Currently investigating.)

  3. Implement additional monitoring and alerting on the at-fault service to ensure any similar scenarios can be captured and prevented before they can impact clients. (In progress)

Timeline:

2018-01-23 02:13 PM - xMatters monitoring tools alert the Client Assistance team to an issue with On-Demand services in North America

2018-01-23 02:15 PM - Internal Major Incident process initiated

2018-01-23 02:16 PM - Status page bulletin posted: http://status.xmatters.com/incidents/pp24rkxdf0pg

2018-01-23 02:19 PM - Incident team identifies the source of the issue and works to redeploy services

2018-01-23 02:40 PM - Services begin to come back online

2018-01-23 03:15 PM - Clients confirm service fully restored   If you have any questions, please visit http://support.xmatters.com

Posted Jan 26, 2018 - 18:39 PST

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Jan 23, 2018 - 15:25 PST
Update
The majority of services should now be back online. We are continuing to work on the remaining services and will provide another update in 30 minutes.
Posted Jan 23, 2018 - 14:54 PST
Monitoring
The xMatters Incident Response team has deployed a fix for the issue and services are now coming back online. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored. We will provide the next update in 10 minutes.
Posted Jan 23, 2018 - 14:42 PST
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Jan 23, 2018 - 14:33 PST
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available.
Posted Jan 23, 2018 - 14:16 PST
This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).