What happened?
On Tuesday, January 23, 2018 at approximately 2:13 PM PST, the xMatters monitoring systems alerted the Client Assistance team to an issue with the xMatters On-Demand services for clients located in North America. Users may have experienced intermittent access to the user interface, and a delay or rejection when injecting an event into xMatters.
Why did it happen?
This issue occurred when a previously unknown defect within an internal, back-end service responsible for removing older, deprecated services caused it to erroneously remove services that were still in use.
How did we respond?
As soon as the xMatters network monitoring detected connectivity issues, xMatters Client Assistance and Operations teams initiated the internal Major Incident Management process. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients while Client Assistance posted a notice to the xMatters status page. The teams immediately identified that the connectivity problems were impacting a majority of clients located in North America, and that the problem was related to a back-end service responsible for cleaning up services no longer in use. In this case, the service had erroneously removed live, in-use services. To quickly mitigate client issues, the incident response team began redeploying the services that had been removed. Once the services were redeployed, clients confirmed that all services were restored and their users were able to access the user interface and inject events into xMatters.
What are we doing to prevent it from happening again?
To prevent this issue from occurring again, the xMatters Engineering team will undertake the following steps:
Repair the defect within the service to prevent the erroneous deletion of live services. (In progress)
If possible, update the back-end service to clean up old services on an individual basis, rather than all at once. (Currently investigating.)
Implement additional monitoring and alerting on the at-fault service to ensure any similar scenarios can be captured and prevented before they can impact clients. (In progress)
Timeline:
2018-01-23 02:13 PM - xMatters monitoring tools alert the Client Assistance team to an issue with On-Demand services in North America
2018-01-23 02:15 PM - Internal Major Incident process initiated
2018-01-23 02:16 PM - Status page bulletin posted: http://status.xmatters.com/incidents/pp24rkxdf0pg
2018-01-23 02:19 PM - Incident team identifies the source of the issue and works to redeploy services
2018-01-23 02:40 PM - Services begin to come back online
2018-01-23 03:15 PM - Clients confirm service fully restored If you have any questions, please visit http://support.xmatters.com