On March 14, 2019, at approximately 2:52 PM (PDT), the xMatters monitoring tools alerted Client Assistance to an issue involving notification delivery. The On-Demand service was accepting and processing events, but was not creating or sending notifications. Some clients reported the issue to Client Assistance while the incident was being investigated, confirming that they were unable to initiate or send notifications.
The issue was caused by an operator error during a clean-up process that reverted some services to a prior state, resulting in a misconfiguration between services. The misconfiguration prevented notifications from being processed after events were submitted to xMatters.
As soon as the internal monitoring tools alerted Client Assistance to an issue, they launched an investigation. When they were able to reproduce the issue and identify the scope, they immediately initiated the internal major incident management process and posted a notice for customers on the xMatters status page. The incident response teams began working to restore services and searching for the root cause. They identified a misconfiguration within services required for notification creation and distribution. They quickly initiated a resolution process to restore service configurations to a prior, known good state. As soon as the resolution was applied, notifications began processing, and the teams continued to monitor the notification queues until the backlogs had cleared. Clients confirmed that they were receiving notifications promptly and that all services had been restored.
The xMatters Engineering team has already conducted and completed an internal review, and are developing and implementing an automated process for all clean-up activities for the On-Demand service. This process will include the following:
Additional monitoring check points to optimize clean-up activities
Automated rerouting of live traffic prior to reverting any services.
March 14, 2019 - 2:52 PM - Internal monitoring alerts Client Assistance to issue with notification processing
3:04 PM - Client Assistance confirms and replicates the issue
3:05 PM - Issue updated to MIM - incident response teams assembled
3:12 PM - Notification posted to xMatters status page3:15 PMIncident response teams isolate issue
3:27 PM - Corrective action designed and tested3:30 PMFix promoted to production; notifications begin processing
3:30 PM - Incident response teams monitor event processing and clearing of backlog
3:57 PM - Backlogs cleared; all services restored
If you have any questions, please visit http://support.xmatters.com