On July 12, 2022, at approximately 10:30 AM Pacific, some customers in North America reported an issue to xMatters Customer Support where they were unable to load User and Group Performance reports. Some users also reported performance issues involving slow loading of dashboard widgets in the Communication Center or errors when attempting to login to the xMatters user interface. The issue affected only the performance reports, dashboard widgets, and login; all other services, including signal processing, notification creation and delivery, and response processing were not impacted.
The issue was traced to enhancements to the User and Group Performance reports that had been enabled, or toggled on, shortly before the first reported issues. The backend services that query data for the performance reports and related dashboard widgets were not appropriately sized for a production load. This caused a backlog in request processing, which led to delays in accessing the data via the web user interface. The scale of the change required for that morning's Pole Position release led to the misconfiguration as the interaction between features was missed during the QA process.
As soon as customers reported the issue, Customer Support confirmed performance issues via the internal monitoring tools and initiated the major incident management process. The incident response team determined that the best course of action to mitigate the issue quickly was to toggle off the recently changed reporting features to reduce the load on the backend services. This allowed the web user interface to more easily complete its processing requirements and the backlog of requests quickly cleared. Customers confirmed that performance had returned to normal levels and service had been restored.
The teams continued to investigate the cause of the issue and identified that the backend services that query performance reporting for dashboard widgets and the report pages in the web user interface were unable to retrieve data in a timely manner. This also cause the web login issue as delays in loading dashboards eventually led to login timeouts. The teams were able to determine that the resources allocated to dashboard widgets were not processing requests quickly enough, leading to delays in responses to requests and causing upstream services to create backlogs of incoming requests.
To prevent this issue from reoccurring, the Engineering and Operations teams revised the resource allocations for all of the new reporting and dashboard updates. Over the course of July 13 and 14, they enabled each of the new features in sequence and verified all new features were operating normally and that no other issues occurred.
To prevent similar issues, and to ensure that QA in both Development and Non-Production environments properly account for production load and are able to surface these types of misconfiguration, the teams are reviewing QA and release practices to reduce the level of complexity required for large-scale releases. The teams are currently implementing the following changes:
Action | |
---|---|
Tuesday, July 12 10:15 AM PT | Pole Position features are toggled on in production deployments |
10:30 | Internal monitoring tools alert to potential performance impact |
10:37 | Initial customer reports of performance or login issues |
10:41 | Severity 1 Incident initiated |
10:55 | Mitigation actions begin |
11:40 | Mitigation actions complete |
11:47 | Services restored |
12:05 | Incident resolved |
If you have any questions, please visit http://support.xmatters.com