Issue Discovered - Service disruption with the User Interface and Notifications
Incident Report for xMatters
Postmortem

What happened?

On March 26, 2018, the xMatters monitoring tools reported an issue with the On-Demand service for some clients located in Europe and North America. Some users may have experienced errors and performance issues when attempting to access the web user interface, and a delay or rejection when attempting to inject an event into xMatters.

Why did it happen?

This issue occurred during a regularly-scheduled deployment update, when a previously unidentified defect caused the update to not complete successfully. The warning messages raised by the defect did not properly alert the monitoring systems to indicate that the deployment had not completed.

How did we respond?

The initial report from the monitoring tools indicated that the system had quickly recovered from a minor error, but Client Assistance received a second report later the same day and immediately began investigating the issue. Once the nature and extent of the problem became clear, they initiated the internal major incident management process and escalated the issue to a Severity 1. The incident response teams were able to isolate the issue, and related it to a series of minor warning messages that occurred during the deployment earlier in the day. Once they identified the problem, they engaged the Engineering team to help devise a solution. The team applied the fix to the affected data centers, and clients confirmed that all services had been restored.

What are we doing to prevent it from happening again?

To prevent this issue from occurring again, the xMatters Engineering team is currently developing a permanent solution to the defect that was identified during the investigation. Part of the solution will also ensure that the correct alerts are in place to properly flag any potential issues during the deployment process. (BUG-11899 - In Progress)

Timeline:

2018-03-26 01:36PM - xMatters monitoring tools alert the Client Assistance team to a potential issue for some clients in Europe and North America

2018-03-26 02:15PM - Internal Severity-1 process is initiated

2018-03-26 02:45PM - Issue is identified as related to the deployment of release 5.5.204

2018-03-26 02:51PM - Support Bulletin is posted: http://status.xmatters.com/incidents/4q16gbd43by0

2018-03-26 03:10PM - Fix is applied to the impacted clients

2018-03-26 03:12PM - All services are restored

If you have any questions, please visit http://support.xmatters.com

Posted Mar 27, 2018 - 13:22 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Mar 26, 2018 - 15:12 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Mar 26, 2018 - 15:01 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients when accessing the user interface or delays in notifications. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Mar 26, 2018 - 14:51 PDT
This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications) and Europe, Middle East, and Africa (Web Interface, Email Notifications, SMS Notifications, Voice Notifications).