On Tuesday, October 17, 2017, the xMatters network monitoring systems alerted the Client Assistance and Operations teams to an issue with the On-Demand services in one of the North American data centers. Some users may have experienced intermittent access to the xMatters On-Demand web user interface, and a delay or rejection when injecting events into xMatters.
Why did it happen?
On Monday, October 16, the Engineering and Operations teams completed the final phase of an update to a new service in the xMatters infrastructure. After approximately 24 hours, the monitoring tools detected increased latency within the infrastructure, which impacted service availability. The problem was traced to a configuration issue: the front end of the service was accepting more traffic than its back end was capable of processing. As the number of requests waiting to be processed by the back end increased, the service began to take more and more time to process all requests, which affected access to the web user interface, and caused a potential delay or rejection of new events. These behaviors did not occur or were not observed in testing because the front end of the service continued to report that it was accepting and processing requests, which would not have prevented it from triggering the automatic fallback safety mechanism.
How did we respond?
As soon as the xMatters network monitoring tools detected unreliable connectivity, the Client Assistance and Operations teams initiated the internal Major Incident Management process, notified the impacted clients, and posted a notice to the xMatters StatusPage. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. While releasing the new service, xMatters implemented a deployment process to include the ability to quickly bypass the changes in the event of any issues, and the teams immediately restored the previous configuration of the service. The latency decreased to normal levels, and all services were restored and operating as expected. The investigating teams observed that requests involving the service were queuing and not being processed before eventually timing out, and identified the issue as related to the configuration of the back end of the new service.
What are we doing to prevent it from happening again?
To prevent this issue from happening again, the xMatters Engineering team will be performing the following: 1. Configuring the service to ensure it can handle all requests even through peak periods (in progress). 2. Improving the health checks performed on the service to be able to quickly pivot to an alternate service based on performance (in progress). 3. Increasing monitoring on the service to ensure any issues can be detected and alerted on prior to any service disruptions (completed).
2017-10-18 06:41AM - xMatters monitoring tools alert the Operations and Client Assistance team of accessibility issues within one of the datacenter's in North America
2017-10-18 06:43AM - Internal Major Incident process in initiated
2017-10-18 06:47AM - Support Bulletin Posted - http://status.xmatters.com/incidents/h4nhr10fvww0
2017-10-18 07:01AM - The Operations team discovers the issue to be related to the service responsible for processing internet requests
2017-10-18 07:05AM - Service is disabled and bypassed by the Operations team
2017-10-18 07:06AM - All services are restored and working as expected
If you have any questions, please visit http://support.xmatters.com