On March 21, 2019, at approximately 8:54 AM PDT, some clients began reporting an issue to xMatters Client Assistance where they were encountering a "404" error when attempting to access the On-Demand web user interface. Clients were able to login but could not perform any actions or access any pages due to the error. While the issue prevented clients from being able to use the web user interface to send messages, view event status, or run reports, the system continued to process events as well as all notifications and user responses.
This issue was caused by a mismatch in file creation dates that the web server uses to determine which files to serve. The Engineering team created and deployed a hotfix for an issue in the web user interface for a specific release after the artifacts for the subsequent scheduled release had already been built. When that release was deployed to the On-Demand service, the inconsistency in the creation dates for the files on the web server caused the interface to display an error instead of the necessary web pages.
As soon as clients reported the errors, Client Assistance confirmed the reports and immediately escalated the issue to a Severity-1 incident. They launched the internal major incident management process to engage the incident response teams and posted a notice to the xMatters status page. The incident response teams began investigating and quickly identified the web server artifacts that were causing the date mismatch. To help immediately mitigate the impact and restore access to the web user interface, the teams began rolling back affected clients to the previous known good deployment while the Engineering team began rebuilding the release artifacts. As soon as the rollback was complete, clients reported that they could properly access the web user interface and that all services had been restored. The Engineering team completed the rebuild of the release artifacts and successfully redeployed the release later the same day.
To help prevent similar issues from happening in the future, the Engineering team has added additional checkpoints to the build and deployment process. These checkpoints test for file creation mismatches throughout all phases of the roll out and release process.
March 21, 2019 - 8:54 AM - Some clients report 404 errors when using the web user interface
8:55 AM - Client Assistance confirms and replicates the issue
8:56 AM - Client Assistance issues a Severity-1 incident
8:57 AM - Status page notice: https://status.xmatters.com/incidents/hjhj8sty2g3b9:26Incident team isolates the cause and begins to investigate rollback to last known state
10:00 AM - Rollback initiated
10:07 AM - Rollback confirmed, team begins to monitor for further errors
10:28 AM - Confirmation that all services are restored