Issue Discovered - Service disruption
Incident Report for xMatters
Postmortem

What happened?

On March 27, 2019, at approximately 11:38 PM PDT, some clients reported an issue to xMatters Client Assistance where they were not able to see the correct list of users in the On-Demand web user interface. Users reported that the Users page in the web interface was displaying an incomplete list of users, or was not displaying any users at all. During the investigation and resolution of the issue, additional reports came in that confirmed the issue was impacting only the Users page. Other aspects of the web user interface were not affected, and the On-Demand service continued to accept all incoming events, send notifications, and process responses without interruption.

Why did it happen?

This issue was caused by a software defect introduced in the 5.5.252 release of xMatters On-Demand, which included a change to the way that historical user roles were retrieved and displayed.

How did we respond?

As soon as Client Assistance received reports about an issue with the web user interface, they launched an investigation and began attempting to reproduce the issue. Initial findings seemed to indicate that the problem was limited in scope as internal checkpoints could not reproduce the issue. As further reports came in and clarified the issue and its scope, Client Assistance successfully reproduced the problem and immediately escalated it to a Severity-1, initiating the internal major incident management process. While the incident response teams began working to identify the root cause, Client Assistance posted a notice to the xMatters On-Demand status page.

The Engineering teams identified an error in the query used to retrieve user roles, but determined that changing the query in place could have unforeseeable consequences. To mitigate the issue and restore service as safely as possible, the teams decided to rollback the service to the previous release. Although the rollback process could take longer, the teams identified it as the safest, most effective solution. The Engineering team immediately began the rollback process while Client Assistance updated affected clients on progress. As soon as the rollback was complete, clients confirmed that all services had been restored.

What are we doing to prevent it from happening again?

The defect introduced in the release was repaired and the release redeployed via hotfix to all production instances later the same day. All clients were successfully updated to the 5.5.252 release and have confirmed that the issue was resolved.

As a proactive approach to preventing these types of incidents, the Engineering teams are currently reviewing all user-interface-related incidents from the past year, and identifying any potential enhancements or areas of further improvement. In addition, the Client Assistance team has identified that the notice posted to the xMatters status page was too general, and did not narrowly identify the client impact sufficiently. This may have caused some clients undue stress as the issue affected only the web user interface, and did not impact underlying data, event processing, or notification and response handling. To help prevent similar miscommunications, Client Assistance is reviewing their status page updates and communication practices to ensure that future updates are more focused and better represent the nature of any incidents.

Timeline:

March 27, 2019 11:38 PM - Client Assistance receives reports of issues displaying users in the web user interface
12:03 AM - Client Assistance begins to attempting to reproduce the issue
2:05 AM - Other clients report encountering the issue
3:40 AM - Scope of impact identified; Severity-1 incident initiated
4:02 AM - Incident response teams assemble and begin work to identify cause
5:25 AM - Cause of incident determined
6:53 AM - Rollback process initiated
9:18 AM - Rollback completed; all services confirmed restored

If you have any questions, please visit http://support.xmatters.com

Posted 2 months ago. Apr 04, 2019 - 09:28 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted 3 months ago. Mar 28, 2019 - 08:38 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.

If any users receive an error when browsing the Web UI please refresh your browser or restart your browser.
Posted 3 months ago. Mar 28, 2019 - 08:09 PDT
Update
A fix is being implemented at the moment, we'll provide further updates as we get them.

The impact is still isolated to the Web User Interface and specifically the Users list not displaying all users. There are no issues with notifications or accessing your instance.
Posted 3 months ago. Mar 28, 2019 - 08:02 PDT
Update
We have confirmed that the impact is limited to the Web User interface, where full user lists are not available. No other services are impacted.
Posted 3 months ago. Mar 28, 2019 - 07:21 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted 3 months ago. Mar 28, 2019 - 06:44 PDT
Update
We are continuing to investigate this issue.
Posted 3 months ago. Mar 28, 2019 - 06:43 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted 3 months ago. Mar 28, 2019 - 06:43 PDT
This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App), Europe, Middle East, and Africa (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App), and Asia Pacific (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).