Issue Discovered - Service disruption in North America

Incident Report for xMatters

Postmortem

What happened?

On Thursday, November 15, 2018 at approximately 5:25 PM PDT, the xMatters monitoring systems alerted Client Assistance to a potential issue with one of the data centers located in North America. For the remainder of Thursday, much of Friday morning, and into the weekend, North American customers experienced intermittent access to the user interface, delays or rejection when injecting an event into xMatters, and delays or failures in notification delivery. Also during this time, some customers in regions outside of North America may have experienced a delay or rejection with voice notifications.

Why did it happen?

This incident was caused by several issues occurring in succession. The major incident was caused by a software defect with a storage array located in one of our North American data centers, which resulted in the inaccessibility of the array and its associated disk volumes for several hours. Several hours after completing a failover of the affected databases and applications to an alternate data center, the xMatters teams observed additional failures caused by services attempting to access unresponsive databases. This resulted in the connection pools for those services filling up and rejecting new connections. The following day, the teams identified a defect with the Integration Builder platform that was causing intermittent failures of web servers and the user interface.

How did we respond?

As soon as the xMatters network monitoring tools detected connectivity issues, the xMatters Client Assistance and Operations teams escalated the issue to a major incident and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause and working to restore services for clients, Client Assistance posted a notice to the xMatters status page to inform clients about the incident. The teams immediately identified that the issue was limited to a specific data center within the North America region, and determined that the problem was related to the storage array. The Operations team began migrating production services to an alternate data center. These steps restored service for impacted clients, but a new issue continued to cause intermittent service disruptions for several clients throughout the night and into the following morning, with some applications and web services experiencing intermittent failures and high error rates.

The investigation determined that this new issue was caused by some databases in the failed data center appearing as still accessible over the network when they were not actually responsive. Customers attempting to access these non-production services caused connection and performance degradation in some healthy services at the live data center, where database connection pools began to fill up until connections were rejected. When the teams discovered this issue, all systems at the failed data center were powered off to ensure they were unavailable and inaccessible, and this issue was resolved.

Later, some customers reported reduced performance and slowness of the web user interface. This issue was traced to a problem in the Activity Stream of the Integration Builder and occurred when the system was attempting to process a large number of sizable integration logs related to failed integrations. The issue was causing the web services to run out of memory, where they would be automatically restarted. These restart loops caused rotating failures throughout the pool of web servers. The team was able to mitigate the issue and resolve the out-of-memory errors by truncating the largest integration logs.

The Operations and Engineering teams continued to review the existing state of the failed data center and began systematically bringing services and data back online in a safe and coordinated manner. The teams reviewed client instances and performance throughout the day and made any necessary configuration modifications to ensure the systems were operational. At 4:00 PM PST, the Operations team began restoring the redundant databases affected by the storage array malfunction.

What are we doing to prevent it from happening again?

This series of incidents caused a major disruption of service and at xMatters, we know we can do better. While the incidents themselves were unrelated, their occurrence in short succession prolonged a period of instability. While these kinds of issues are difficult to predict and prevent, xMatters teams continually review our processes and seek areas of improvement or ways to reduce the amount of time clients are impacted. We are still working with the storage array and data center vendors and providers to determine the root cause of the initial failure and will update this root cause analysis if and when those investigations uncover any further information. To mitigate and eliminate other issues uncovered during this disruption, the xMatters teams have committed to the following actions:

The Engineering team has developed a fix for the issue related to the non-responsive databases causing connection pool consumption; and the fix is schedule to be deployed as part of the 5.5.235 release (scheduled for Wednesday, November 21).
The Engineering team is developing and testing a fix for the issue related to the Activity Stream in the Integration Builder. The fix will be deployed as a hotfix for the 5.5.235 release as soon as the testing is complete.

As part of our commitment to continuous improvement, we are conducting hosting service improvements to our infrastructure-as-a-service, scheduled to occur in the North American region in January 2019. These improvements will remove points of failure such as the storage array involved in this incident. For more information, see the article on our support site: https://support.xmatters.com/hc/en-us/articles/115005269506. The robustness of this new infrastructure is dramatically improved with increased resiliency across the entire service implementation. To help reduce the load on our existing data centers and prevent similar issues from reoccurring, we are currently investigating ways to accelerate the migration process for some customers.

Timeline:

November 15, 2018, 5:25 PM - xMatters monitoring tools alert the Client Assistance team to a potential issue with clients in North America

5:26 PM - Internal major incident management process initiated

5:33 PM - Engineering identifies the issue as related to the storage array; begins fail-over to alternate data center

5:37 PM - Client Assistance posts status page bulletin: https://status.xmatters.com/incidents/pj2bj697gkxw

5:41 PM - Systems begin to come online in new data center

5:52 PM - Engineering implements mitigation steps to reduce load on storage array

5:57 PM - Majority of services restored; major incident team continues to work through systems

6:52 PM - All fail-over complete; some services require additional rehabilitation

7:31 PM - All services restored

November 16, 2018 1:43 AM - xMatters monitoring tools alert the Client Assistance team to an intermittent issue with some clients in North America

2:00 AM - Internal major incident management process re-initiated

2:04 AM - Engineering begins investigating the issue

2:09 AM - Client Assistance posts notice to xMatters status page: https://status.xmatters.com/incidents/bz1hxfxfbrlt

2:53 PM - Operations begins work-around to attempt to mitigate issue for xMatters customers in the new data center

3:01 PM - All services reporting as restored

5:23 AM - Clients contact xMatters Client assistance, report slowness in navigating/accessing the web user interface

5:36 AM - Client Assistance posts notice to xMatters status page: https://status.xmatters.com/incidents/3s1n4l1kldmt

6:33 AM - Services stabilize

7:30 AM - 3:30 PM - Major incident teams continue review each instance and make necessary corrections or restarts

4:00 PM - Back-end database replication started to restore data replication

If you have any questions, please visit http://support.xmatters.com

Posted Nov 21, 2018 - 10:05 PST

Resolved

The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.

Posted Nov 16, 2018 - 06:51 PST

Monitoring

The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.

Posted Nov 16, 2018 - 06:33 PST

Identified

The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.

Posted Nov 16, 2018 - 05:36 PST