Issue Discovered - Service disruption in Notification Delivery
Incident Report for xMatters
Postmortem

What happened?

On March 14, 2019, at approximately 2:52 PM (PDT), the xMatters monitoring tools alerted Client Assistance to an issue involving notification delivery. The On-Demand service was accepting and processing events, but was not creating or sending notifications. Some clients reported the issue to Client Assistance while the incident was being investigated, confirming that they were unable to initiate or send notifications.

Why did it happen?

The issue was caused by an operator error during a clean-up process that reverted some services to a prior state, resulting in a misconfiguration between services. The misconfiguration prevented notifications from being processed after events were submitted to xMatters.

How did we respond?

As soon as the internal monitoring tools alerted Client Assistance to an issue, they launched an investigation. When they were able to reproduce the issue and identify the scope, they immediately initiated the internal major incident management process and posted a notice for customers on the xMatters status page. The incident response teams began working to restore services and searching for the root cause. They identified a misconfiguration within services required for notification creation and distribution. They quickly initiated a resolution process to restore service configurations to a prior, known good state. As soon as the resolution was applied, notifications began processing, and the teams continued to monitor the notification queues until the backlogs had cleared. Clients confirmed that they were receiving notifications promptly and that all services had been restored.

What are we doing to prevent it from happening again?

The xMatters Engineering team has already conducted and completed an internal review, and are developing and implementing an automated process for all clean-up activities for the On-Demand service. This process will include the following:

Additional monitoring check points to optimize clean-up activities

Automated rerouting of live traffic prior to reverting any services.

Timeline:

March 14, 2019 - 2:52 PM - Internal monitoring alerts Client Assistance to issue with notification processing

3:04 PM - Client Assistance confirms and replicates the issue

3:05 PM - Issue updated to MIM - incident response teams assembled

3:12 PM - Notification posted to xMatters status page3:15 PMIncident response teams isolate issue

3:27 PM - Corrective action designed and tested3:30 PMFix promoted to production; notifications begin processing

3:30 PM - Incident response teams monitor event processing and clearing of backlog

3:57 PM - Backlogs cleared; all services restored

If you have any questions, please visit http://support.xmatters.com

Posted Mar 19, 2019 - 16:22 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Mar 14, 2019 - 15:57 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Mar 14, 2019 - 15:35 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Mar 14, 2019 - 15:15 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Mar 14, 2019 - 15:12 PDT
This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App) and Europe, Middle East, and Africa (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).