On Tuesday July 18, 2017 at approximately 12pm PDT, the xMatters monitoring tools alerts the Client Assistance team to an issue with a component in one of the North American data centers. While investigating the issue, they began receiving reports from some clients about delays in notification delivery for all device types. Some clients may have also experienced a brief service disruption (less than 10 minutes) when accessing the xMatters web user interface or injecting an event.
Why did it happen?
The root cause of this issue occurred within the xMatters component responsible for creating notifications in one of the North American data centers. Due to an unanticipated level of usage (based on observed growth trends, past requirements, and available indicators), the system required an unusual amount of time to process requests when the component's capacity spiked past the monitoring threshold and approached its configured limit.
How did we respond?
As soon as the xMatters Client Assistance team was alerted, they notified the Operations team, initiated the internal Major Incident Management process, and posted a bulletin to the xMatters status page. The incident response teams began simultaneously investigating the underlying cause and working to restore service for clients. The investigation quickly identified an issue with the service responsible for determining where to deliver notifications. Despite the high-availability, redundant configuration of the service, the notification queues had increased to levels that were causing significant delays in notification delivery. During initial efforts to remediate the issue, the incident team attempted to rebalance the service's components to allow notification queues to be processed normally. After multiple attempts resulted in only marginal improvements, the teams initiated the automated process to begin promoting clients to an alternate data center within North America. During this process, some clients may have experienced a brief disruption in their ability to access the web user interface or to inject an event.
While the promotion process was underway, the investigation discovered a service configuration that required an unexpected amount of resources during peak periods of load. To mitigate the issue as quickly as possible, the incident team reallocated additional resources to the notification service components. As the service began to clear the queued notifications, the team suspended any further client promotions to the alternate data center. The xMatters teams continued to monitor the queues closely until they could confirm that the system was again delivering notifications without delay and that all services had been restored.
What are we doing to prevent it from happening again?
To prevent this issue from happening again in the future, xMatters will perform the following actions:
Ensure all resolution service components have necessary available resources. (Completed throughout North America data centers.)
Review and revise potential component changes to improve resiliency under different situations. (Engineering work is currently in progress - tentatively scheduled to deploy mid-August 2017.)
Increase monitoring thresholds to identify any latency with notification delivery earlier in the process. (Complete.)
Review service metrics used to predict thresholds and usage limits. (Ongoing)
xMatters strives to provide high availability to our clients and we recognize that reliability of services is of utmost importance to our customers and their businesses. xMatters is committed to improving our resiliency and investing in the tools and processes required to prevent and minimize service disruptions.
2017-07-18 12:00 - xMatters Client Assistance begins receiving alerts from the monitoring tools; clients report delays in their notifications.
2017-07-18 12:18 - xMatters Operations identifies an issue with a component responsible for delivery notifications.
2017-07-18 12:19 - Internal Major incident process is initiated.
2017-07-18 12:23 - Support bulletin is posted: http://status.xmatters.com/incidents/cqjzcm4m47zj
2017-07-18 13:20 - Rebalancing service components; marginal improvements to notification delivery.
2017-07-18 14:00 - Some clients promoted to alternate data center; services for those clients are restored.
2017-07-18 14:10 - Resources reallocated on the impacted service.
2017-07-18 14:29 - Notification delivery is back to normal levels; incident team continues to monitor to ensure all systems are back to normal.
2017-07-18 14:43 - All services are confirmed restored.
If you have any questions, please visit http://support.xmatters.com