Issue Discovered - Service disruption in North American Region – Web User Interface
Incident Report for xMatters
Postmortem

Details

What happened?

On July 12, 2022, at approximately 10:30 AM Pacific, some customers in North America reported an issue to xMatters Customer Support where they were unable to load User and Group Performance reports. Some users also reported performance issues involving slow loading of dashboard widgets in the Communication Center or errors when attempting to login to the xMatters user interface. The issue affected only the performance reports, dashboard widgets, and login; all other services, including signal processing, notification creation and delivery, and response processing were not impacted.

Why did it happen?

The issue was traced to enhancements to the User and Group Performance reports that had been enabled, or toggled on, shortly before the first reported issues. The backend services that query data for the performance reports and related dashboard widgets were not appropriately sized for a production load. This caused a backlog in request processing, which led to delays in accessing the data via the web user interface. The scale of the change required for that morning's Pole Position release led to the misconfiguration as the interaction between features was missed during the QA process.

How did we respond?

As soon as customers reported the issue, Customer Support confirmed performance issues via the internal monitoring tools and initiated the major incident management process. The incident response team determined that the best course of action to mitigate the issue quickly was to toggle off the recently changed reporting features to reduce the load on the backend services. This allowed the web user interface to more easily complete its processing requirements and the backlog of requests quickly cleared. Customers confirmed that performance had returned to normal levels and service had been restored.

The teams continued to investigate the cause of the issue and identified that the backend services that query performance reporting for dashboard widgets and the report pages in the web user interface were unable to retrieve data in a timely manner. This also cause the web login issue as delays in loading dashboards eventually led to login timeouts. The teams were able to determine that the resources allocated to dashboard widgets were not processing requests quickly enough, leading to delays in responses to requests and causing upstream services to create backlogs of incoming requests. 

What are we doing to prevent it from happening again?

To prevent this issue from reoccurring, the Engineering and Operations teams revised the resource allocations for all of the new reporting and dashboard updates. Over the course of July 13 and 14, they enabled each of the new features in sequence and verified all new features were operating normally and that no other issues occurred. 

To prevent similar issues, and to ensure that QA in both Development and Non-Production environments properly account for production load and are able to surface these types of misconfiguration, the teams are reviewing QA and release practices to reduce the level of complexity required for large-scale releases. The teams are currently implementing the following changes:

  1. Enacting a process to sequence the enablement of features using an Enable > Test > Verify process during large scale deployments.
  2. Reviewing QA processes to better identify potential performance-related impacts. 

Timeline:

Action
Tuesday, July 12 10:15 AM PT Pole Position features are toggled on in production deployments
10:30 Internal monitoring tools alert to potential performance impact
10:37 Initial customer reports of performance or login issues 
10:41 Severity 1 Incident initiated
10:55 Mitigation actions begin
11:40 Mitigation actions complete
11:47 Services restored
12:05 Incident resolved

If you have any questions, please visit http://support.xmatters.com

Posted Jul 15, 2022 - 14:36 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Jul 12, 2022 - 12:04 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored. Some customers may still experience some performance degradation in dashboard widgets.
Posted Jul 12, 2022 - 11:48 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Jul 12, 2022 - 11:40 PDT
Update
We are continuing to investigate the issue. Some customers may be experiencing intermittent errors when logging into the xMatters Web UI.
Posted Jul 12, 2022 - 11:37 PDT
Investigating
xMatters monitoring tools have identified a potential issue with the xMatters Web User Interface for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help
Posted Jul 12, 2022 - 11:30 PDT
This incident affected: North America (Web Interface).