On March 27, 2019, at approximately 11:38 PM PDT, some clients reported an issue to xMatters Client Assistance where they were not able to see the correct list of users in the On-Demand web user interface. Users reported that the Users page in the web interface was displaying an incomplete list of users, or was not displaying any users at all. During the investigation and resolution of the issue, additional reports came in that confirmed the issue was impacting only the Users page. Other aspects of the web user interface were not affected, and the On-Demand service continued to accept all incoming events, send notifications, and process responses without interruption.
This issue was caused by a software defect introduced in the 5.5.252 release of xMatters On-Demand, which included a change to the way that historical user roles were retrieved and displayed.
As soon as Client Assistance received reports about an issue with the web user interface, they launched an investigation and began attempting to reproduce the issue. Initial findings seemed to indicate that the problem was limited in scope as internal checkpoints could not reproduce the issue. As further reports came in and clarified the issue and its scope, Client Assistance successfully reproduced the problem and immediately escalated it to a Severity-1, initiating the internal major incident management process. While the incident response teams began working to identify the root cause, Client Assistance posted a notice to the xMatters On-Demand status page.
The Engineering teams identified an error in the query used to retrieve user roles, but determined that changing the query in place could have unforeseeable consequences. To mitigate the issue and restore service as safely as possible, the teams decided to rollback the service to the previous release. Although the rollback process could take longer, the teams identified it as the safest, most effective solution. The Engineering team immediately began the rollback process while Client Assistance updated affected clients on progress. As soon as the rollback was complete, clients confirmed that all services had been restored.
The defect introduced in the release was repaired and the release redeployed via hotfix to all production instances later the same day. All clients were successfully updated to the 5.5.252 release and have confirmed that the issue was resolved.
As a proactive approach to preventing these types of incidents, the Engineering teams are currently reviewing all user-interface-related incidents from the past year, and identifying any potential enhancements or areas of further improvement. In addition, the Client Assistance team has identified that the notice posted to the xMatters status page was too general, and did not narrowly identify the client impact sufficiently. This may have caused some clients undue stress as the issue affected only the web user interface, and did not impact underlying data, event processing, or notification and response handling. To help prevent similar miscommunications, Client Assistance is reviewing their status page updates and communication practices to ensure that future updates are more focused and better represent the nature of any incidents.
March 27, 2019 11:38 PM - Client Assistance receives reports of issues displaying users in the web user interface
12:03 AM - Client Assistance begins to attempting to reproduce the issue
2:05 AM - Other clients report encountering the issue
3:40 AM - Scope of impact identified; Severity-1 incident initiated
4:02 AM - Incident response teams assemble and begin work to identify cause
5:25 AM - Cause of incident determined
6:53 AM - Rollback process initiated
9:18 AM - Rollback completed; all services confirmed restored
If you have any questions, please visit http://support.xmatters.com