Issue Discovered - Service disruption in North America
Incident Report for xMatters
Postmortem

What happened?

On Tuesday, October 17, 2017, the xMatters network monitoring systems alerted the Client Assistance and Operations teams to an issue with the On-Demand services in one of the North American data centers. Some users may have experienced intermittent access to the xMatters On-Demand web user interface, and a delay or rejection when injecting events into xMatters.

Why did it happen?

On Monday, October 16, the Engineering and Operations teams completed the final phase of an update to a new service in the xMatters infrastructure. After approximately 24 hours, the monitoring tools detected increased latency within the infrastructure, which impacted service availability. The problem was traced to a configuration issue: the front end of the service was accepting more traffic than its back end was capable of processing. As the number of requests waiting to be processed by the back end increased, the service began to take more and more time to process all requests, which affected access to the web user interface, and caused a potential delay or rejection of new events. These behaviors did not occur or were not observed in testing because the front end of the service continued to report that it was accepting and processing requests, which would not have prevented it from triggering the automatic fallback safety mechanism.

How did we respond?

As soon as the xMatters network monitoring tools detected unreliable connectivity, the Client Assistance and Operations teams initiated the internal Major Incident Management process, notified the impacted clients, and posted a notice to the xMatters StatusPage. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. While releasing the new service, xMatters implemented a deployment process to include the ability to quickly bypass the changes in the event of any issues, and the teams immediately restored the previous configuration of the service. The latency decreased to normal levels, and all services were restored and operating as expected. The investigating teams observed that requests involving the service were queuing and not being processed before eventually timing out, and identified the issue as related to the configuration of the back end of the new service.   

What are we doing to prevent it from happening again?

To prevent this issue from happening again, the xMatters Engineering team will be performing the following: 1. Configuring the service to ensure it can handle all requests even through peak periods (in progress). 2. Improving the health checks performed on the service to be able to quickly pivot to an alternate service based on performance (in progress). 3. Increasing monitoring on the service to ensure any issues can be detected and alerted on prior to any service disruptions (completed).

Timeline:

2017-10-18 06:41AM - xMatters monitoring tools alert the Operations and Client Assistance team of accessibility issues within one of the datacenter's in North America

2017-10-18 06:43AM - Internal Major Incident process in initiated

2017-10-18 06:47AM - Support Bulletin Posted - http://status.xmatters.com/incidents/h4nhr10fvww0

2017-10-18 07:01AM - The Operations team discovers the issue to be related to the service responsible for processing internet requests

2017-10-18 07:05AM - Service is disabled and bypassed by the Operations team

2017-10-18 07:06AM - All services are restored and working as expected

If you have any questions, please visit http://support.xmatters.com

Posted Oct 25, 2017 - 14:01 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Oct 17, 2017 - 07:18 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Oct 17, 2017 - 07:13 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Oct 17, 2017 - 07:11 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available.
Posted Oct 17, 2017 - 06:47 PDT