Issue Discovered - Some users receiving User Interface errors
Incident Report for xMatters
Postmortem

What happened?

On March 21, 2019, at approximately 8:54 AM PDT, some clients began reporting an issue to xMatters Client Assistance where they were encountering a "404" error when attempting to access the On-Demand web user interface. Clients were able to login but could not perform any actions or access any pages due to the error. While the issue prevented clients from being able to use the web user interface to send messages, view event status, or run reports, the system continued to process events as well as all notifications and user responses.

Why did it happen?

This issue was caused by a mismatch in file creation dates that the web server uses to determine which files to serve. The Engineering team created and deployed a hotfix for an issue in the web user interface for a specific release after the artifacts for the subsequent scheduled release had already been built. When that release was deployed to the On-Demand service, the inconsistency in the creation dates for the files on the web server caused the interface to display an error instead of the necessary web pages.

How did we respond?

As soon as clients reported the errors, Client Assistance confirmed the reports and immediately escalated the issue to a Severity-1 incident. They launched the internal major incident management process to engage the incident response teams and posted a notice to the xMatters status page. The incident response teams began investigating and quickly identified the web server artifacts that were causing the date mismatch. To help immediately mitigate the impact and restore access to the web user interface, the teams began rolling back affected clients to the previous known good deployment while the Engineering team began rebuilding the release artifacts. As soon as the rollback was complete, clients reported that they could properly access the web user interface and that all services had been restored. The Engineering team completed the rebuild of the release artifacts and successfully redeployed the release later the same day.

What are we doing to prevent it from happening again?

To help prevent similar issues from happening in the future, the Engineering team has added additional checkpoints to the build and deployment process. These checkpoints test for file creation mismatches throughout all phases of the roll out and release process.

Timeline:

March 21, 2019 - 8:54 AM - Some clients report 404 errors when using the web user interface

8:55 AM - Client Assistance confirms and replicates the issue

8:56 AM - Client Assistance issues a Severity-1 incident

8:57 AM - Status page notice: https://status.xmatters.com/incidents/hjhj8sty2g3b9:26Incident team isolates the cause and begins to investigate rollback to last known state

10:00 AM - Rollback initiated

10:07 AM - Rollback confirmed, team begins to monitor for further errors

10:28 AM - Confirmation that all services are restored

Posted Mar 29, 2019 - 16:25 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Mar 21, 2019 - 10:28 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Mar 21, 2019 - 10:15 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Mar 21, 2019 - 10:08 PDT
Investigating
The xMatters team have been receiving some reports of errors when viewing certain pages in the Web UI. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Mar 21, 2019 - 09:57 PDT
This incident affected: North America (Web Interface), Europe, Middle East, and Africa (Web Interface), and Asia Pacific (Web Interface).