What happened?
On March 26, 2018, the xMatters monitoring tools reported an issue with the On-Demand service for some clients located in Europe and North America. Some users may have experienced errors and performance issues when attempting to access the web user interface, and a delay or rejection when attempting to inject an event into xMatters.
Why did it happen?
This issue occurred during a regularly-scheduled deployment update, when a previously unidentified defect caused the update to not complete successfully. The warning messages raised by the defect did not properly alert the monitoring systems to indicate that the deployment had not completed.
How did we respond?
The initial report from the monitoring tools indicated that the system had quickly recovered from a minor error, but Client Assistance received a second report later the same day and immediately began investigating the issue. Once the nature and extent of the problem became clear, they initiated the internal major incident management process and escalated the issue to a Severity 1. The incident response teams were able to isolate the issue, and related it to a series of minor warning messages that occurred during the deployment earlier in the day. Once they identified the problem, they engaged the Engineering team to help devise a solution. The team applied the fix to the affected data centers, and clients confirmed that all services had been restored.
What are we doing to prevent it from happening again?
To prevent this issue from occurring again, the xMatters Engineering team is currently developing a permanent solution to the defect that was identified during the investigation. Part of the solution will also ensure that the correct alerts are in place to properly flag any potential issues during the deployment process. (BUG-11899 - In Progress)
Timeline:
2018-03-26 01:36PM - xMatters monitoring tools alert the Client Assistance team to a potential issue for some clients in Europe and North America
2018-03-26 02:15PM - Internal Severity-1 process is initiated
2018-03-26 02:45PM - Issue is identified as related to the deployment of release 5.5.204
2018-03-26 02:51PM - Support Bulletin is posted: http://status.xmatters.com/incidents/4q16gbd43by0
2018-03-26 03:10PM - Fix is applied to the impacted clients
2018-03-26 03:12PM - All services are restored
If you have any questions, please visit http://support.xmatters.com