On January 12, 2020, at approximately 7:05 AM PST, xMatters internal monitoring tools and customer reports alerted Customer Support to an issue with event processing and delivery in the North America region. While the incident was in progress, some North American customers may have experienced delays in event processing and notification delivery, including a window where notifications were not being generated for active events. No other regions were affected, and the web user interface remained accessible and responsive throughout the incident, save for a brief period during one of the remediation procedures.
Why did it happen?
This issue occurred when a process responsible for inter-service communication encountered resource issues. The issue was traced to an earlier change which increased the internal processes retention period to improve xMatters' ability to recover data. Resources for the process were sized in terms of processing, disk and memory, but a setting that controls the number of open files to be retained was not sized appropriately.
How did we respond?
As soon as the monitoring tools alerts to the error, Customer Support initiated the Severity-1 process and engaged the incident response teams. The teams began to troubleshoot and restarted the affected process. When the restart failed to recover properly, the team decided to promote affected customers to the secondary site to ensure reliable processing of events and notifications. Once the teams initiated the promotion at 7:35 AM PST, notifications began processing properly for most customers. The promotion procedures were completed at 7:50 AM PST, and the majority of notifications continued processing without issue. The teams continued troubleshooting and identified and resolved the underlying issue on the primary site by increasing the retention period. Once testing was completed, all customers were promoted back to the primary site and all services were confirmed as operational.
What are we doing to prevent it from happening again?
To resolve this issue permanently, the xMatters teams have adjusted the setting that governs the number of open files for the process.
Timeline: Date/Time (PST)
2020-01-12 7:05 AM - Monitoring alerts to incident with notification processing; Severity 1 incident declared
7:21 AM Rolling restart completed
7:24 AM Errors do not clear, notifications still impacted
7:34 AM Promotion to secondary site begins
7:50 AM Promotion to secondary site completed, notifications begin to process as expected
7:55 AM Team begins to monitor the mitigation
8:20 AM Incident resolved
If you have any questions, please visit http://support.xmatters.com