Issue Discovered - Service disruption in North America
Incident Report for xMatters
Postmortem

What happened?

On Friday July 21, 2017, the xMatters monitoring tools alerted the Client Assistance and Operations teams to an issue with a component in one of the North American data centers. While investigating the issue, they began receiving reports from some clients about delays in notification delivery for all device types. Some clients may have also experienced a brief service disruption when accessing the xMatters web user interface or injecting an event.

Why did it happen?

This issue had similar impact of a disruption that occurred earlier in the same week, related to a slowdown of one of the xMatters components responsible for creating notifications in a North American data center. The xMatters teams were still in the process of implementing additional fixes and updates required to address the original issue when it reoccurred. The slowdown of the component caused internal components to refuse connections, and caused delays in notification processing.

How did we respond?

As soon as the xMatters Client Assistance team was alerted, they notified the Operations team, initiated the internal Major Incident Management process, and posted a bulletin to the xMatters status page. The incident response teams began simultaneously investigating the underlying cause and working to restore service for clients. The investigation quickly identified the problem as related to a previous incident that occurred on Tuesday July 18, 2017 (http://status.xmatters.com/incidents/cqjzcm4m47zj) with the message bus used in delivering notifications. The incident team immediately began remediation steps by redistributing the work to healthy components in the cluster. Once this completed, the service was able to catch up to the backlog and notifications were delivering without any delays.

Approximately five hours later the issue reappeared and the monitoring tools alerted the incident team again. The team initiated the internal Major Incident Management process, and posted a new bulletin to the xMatters status page. The team immediately began remediation efforts by performing another rebalancing operation of the impacted component, unfortunately this produced only marginal improvements and the problem continued to reoccur. They then began the process of promoting the impacted clients to an alternate site located in North America. During this process, some clients may have experienced a brief disruption in their ability to access the web user interface or to inject an event.

Simultaneously, the incident team began re-deploying the impacted service components in the affected data center in an effort to restore service to normal operation. Once the re-deployment was completed, the notification queues in the affected data center began to process again. As the service began to clear the queued notifications, the team suspended any further client promotions to the alternate data center. The xMatters teams continued to monitor the queues closely until they could confirm that the system was again delivering notifications without delay and that all services had been restored. 

What are we doing to prevent it from happening again?

The xMatters teams were still implementing the changes and updates they devised after the initial incident of this issue when the second incident occurred. The second occurrence provided the Engineering and Operations teams with an additional opportunity to learn about the factors leading up to the problem. Based upon the information they gathered, the teams have confirmed that their intended approach - once complete - should effectively prevent this issue from happening again in the future.

xMatters will complete the following actions:

  1. Ensure all resolution service components have necessary available resources. (Completed throughout North America data centers.)

  2. Review component architecture for potential changes to improve resiliency under different situations. (Engineering work is currently in progress - tentatively scheduled to deploy mid-August 2017.)

  3. Reproduce the issue in development test environments to identify additional changes. (Engineering work is currently in progress.)

  4. Increase monitoring thresholds to identify any latency with notification delivery earlier in the process. (Complete.)

  5. Review service metrics used to predict thresholds and usage limits. (Ongoing.)

  6. Review potential improvements to service component automation. (Ongoing.)

xMatters strives to provide high availability to our clients and we recognize that reliability of services is of utmost importance to our customers and their businesses. xMatters is committed to improving our resiliency and investing in the tools and processes required to prevent and minimize service disruptions.

Timeline:

2017-07-21 02:52 - xMatters monitoring tools alert of an issue with notification delays in one of the data centers located in North America

2017-07-21 02:56 - Internal Major Incident process is initiated

2017-07-21 03:04 - Support Bulletin is posted: http://status.xmatters.com/incidents/6d7zfwbsh71x

2017-07-21 03:56 - Rebalance and recycle of impacted service component is completed

2017-07-21 04:02 - Service is restored

2017-07-21 09:20 - xMatters monitoring tools alert of the same issue with notification delays in one of the data centers located in North America

2017-07-21 09:25 - Internal Major Incident process is initiated

2017-07-21 09:35 - Support Bulletin is posted: http://status.xmatters.com/incidents/5nf7gv1ytw0c

2017-07-21 09:51 - Rebalance and recycle of impacted service component is completed

2017-07-21 10:16 - Incident team identifies that the remediation has only made a marginal improvement and notifications are still delayed

2017-07-21 10:20 - A second attempt to re-balance and recycle impacted service components begins

2017-07-21 10:39 - Incident team decides to begin service promotion to alternate site

2017-07-21 11:20 - Incident team determines that the second attempt has only made a marginal improvement and notifications are still delayed

2017-07-21 11:40 - Full redeploy of the impacted service component begins in the impacted data center

2017-07-21 12:20 - Service promotion to alternate site continues

2017-07-21 13:25 - Redeployment is complete and services are back online in the impacted data center with no delays in notifications

2017-07-21 13:30 - Promotion of services to alternate data center is halted

2017-07-21 13:49 - All services are restored and notifications are processing without any delays

If you have any questions, please visit http://support.xmatters.com

Posted Jul 27, 2017 - 16:12 PDT

Resolved
The issue has now been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Jul 21, 2017 - 13:49 PDT
Update
The xMatters Incident Response team is still working on implementing the fix. Notifications for some customers still delayed but are improving. We will provide more information as it becomes available.
Posted Jul 21, 2017 - 12:45 PDT
Update
The xMatters Incident Response team is still working on implementing the fix. Notifications for some customers still delayed. We will provide more information as it becomes available.
Posted Jul 21, 2017 - 11:49 PDT
Update
The xMatters Incident Response team is still working on implementing the fix. Notifications are still delayed. We will provide more information as it becomes available.
Posted Jul 21, 2017 - 10:50 PDT
Update
The xMatters Incident Response team has identified the source of the issue and is still working on the fix. Customers will see improvement in notification delivery but there are still delays on some notifications.
Posted Jul 21, 2017 - 10:10 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Jul 21, 2017 - 09:38 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. Notifications on all delivery methods will be delayed, events will be sent as normal once issue is resolved.

We are currently investigating the issue, and will update as information becomes available.
Posted Jul 21, 2017 - 09:35 PDT
This incident affected: North America (Email Notifications, SMS Notifications, Voice Notifications, Conferencing).