What happened?
On Friday July 21, 2017, the xMatters monitoring tools alerted the Client Assistance and Operations teams to an issue with a component in one of the North American data centers. While investigating the issue, they began receiving reports from some clients about delays in notification delivery for all device types. Some clients may have also experienced a brief service disruption when accessing the xMatters web user interface or injecting an event.
Why did it happen?
This issue had similar impact of a disruption that occurred earlier in the same week, related to a slowdown of one of the xMatters components responsible for creating notifications in a North American data center. The xMatters teams were still in the process of implementing additional fixes and updates required to address the original issue when it reoccurred. The slowdown of the component caused internal components to refuse connections, and caused delays in notification processing.
How did we respond?
As soon as the xMatters Client Assistance team was alerted, they notified the Operations team, initiated the internal Major Incident Management process, and posted a bulletin to the xMatters status page. The incident response teams began simultaneously investigating the underlying cause and working to restore service for clients. The investigation quickly identified the problem as related to a previous incident that occurred on Tuesday July 18, 2017 (http://status.xmatters.com/incidents/cqjzcm4m47zj) with the message bus used in delivering notifications. The incident team immediately began remediation steps by redistributing the work to healthy components in the cluster. Once this completed, the service was able to catch up to the backlog and notifications were delivering without any delays.
Approximately five hours later the issue reappeared and the monitoring tools alerted the incident team again. The team initiated the internal Major Incident Management process, and posted a new bulletin to the xMatters status page. The team immediately began remediation efforts by performing another rebalancing operation of the impacted component, unfortunately this produced only marginal improvements and the problem continued to reoccur. They then began the process of promoting the impacted clients to an alternate site located in North America. During this process, some clients may have experienced a brief disruption in their ability to access the web user interface or to inject an event.
Simultaneously, the incident team began re-deploying the impacted service components in the affected data center in an effort to restore service to normal operation. Once the re-deployment was completed, the notification queues in the affected data center began to process again. As the service began to clear the queued notifications, the team suspended any further client promotions to the alternate data center. The xMatters teams continued to monitor the queues closely until they could confirm that the system was again delivering notifications without delay and that all services had been restored.
What are we doing to prevent it from happening again?
The xMatters teams were still implementing the changes and updates they devised after the initial incident of this issue when the second incident occurred. The second occurrence provided the Engineering and Operations teams with an additional opportunity to learn about the factors leading up to the problem. Based upon the information they gathered, the teams have confirmed that their intended approach - once complete - should effectively prevent this issue from happening again in the future.
xMatters will complete the following actions:
Ensure all resolution service components have necessary available resources. (Completed throughout North America data centers.)
Review component architecture for potential changes to improve resiliency under different situations. (Engineering work is currently in progress - tentatively scheduled to deploy mid-August 2017.)
Reproduce the issue in development test environments to identify additional changes. (Engineering work is currently in progress.)
Increase monitoring thresholds to identify any latency with notification delivery earlier in the process. (Complete.)
Review service metrics used to predict thresholds and usage limits. (Ongoing.)
Review potential improvements to service component automation. (Ongoing.)
xMatters strives to provide high availability to our clients and we recognize that reliability of services is of utmost importance to our customers and their businesses. xMatters is committed to improving our resiliency and investing in the tools and processes required to prevent and minimize service disruptions.
Timeline:
2017-07-21 02:52 - xMatters monitoring tools alert of an issue with notification delays in one of the data centers located in North America
2017-07-21 02:56 - Internal Major Incident process is initiated
2017-07-21 03:04 - Support Bulletin is posted: http://status.xmatters.com/incidents/6d7zfwbsh71x
2017-07-21 03:56 - Rebalance and recycle of impacted service component is completed
2017-07-21 04:02 - Service is restored
2017-07-21 09:20 - xMatters monitoring tools alert of the same issue with notification delays in one of the data centers located in North America
2017-07-21 09:25 - Internal Major Incident process is initiated
2017-07-21 09:35 - Support Bulletin is posted: http://status.xmatters.com/incidents/5nf7gv1ytw0c
2017-07-21 09:51 - Rebalance and recycle of impacted service component is completed
2017-07-21 10:16 - Incident team identifies that the remediation has only made a marginal improvement and notifications are still delayed
2017-07-21 10:20 - A second attempt to re-balance and recycle impacted service components begins
2017-07-21 10:39 - Incident team decides to begin service promotion to alternate site
2017-07-21 11:20 - Incident team determines that the second attempt has only made a marginal improvement and notifications are still delayed
2017-07-21 11:40 - Full redeploy of the impacted service component begins in the impacted data center
2017-07-21 12:20 - Service promotion to alternate site continues
2017-07-21 13:25 - Redeployment is complete and services are back online in the impacted data center with no delays in notifications
2017-07-21 13:30 - Promotion of services to alternate data center is halted
2017-07-21 13:49 - All services are restored and notifications are processing without any delays
If you have any questions, please visit http://support.xmatters.com