What happened?
On April 9, 2018, beginning at approximately 07:33 AM PDT, some clients reported an issue to xMatters Client Assistance where the On-Demand service appeared to not be processing any of their notifications. When clients checked the Events report, the Log would be empty and show no signs of processing, even after waiting a few minutes and refreshing several times. The system continued to accept injected events, but would not create or send any notifications.
Why did it happen?
The issue was caused by a previously unknown defect involving company names containing ampersands (&) that resulted in parsing errors, which in turn caused notifications to stop processing and events to be queued.
How did we respond?
As soon as Client Assistance received reports of an issue with notification processing, they began collecting information and initiated the internal major incident management process to engage Engineering and Operations. The incident response teams identified ongoing instability within the notification processing infrastructure, and determined that the issue was related to a problem with specific messages. The teams immediately took measures to remove the problematic messages and prevent them from reoccurring, and then restarted the notification services. Client Assistance notified the impacted clients that the issue had been resolved and confirmed that all services had been restored.
What are we doing to prevent it from happening again?
The xMatters Engineering team is currently designing a permanent fix for the underlying defect (reference: BUG-11989). As soon as the solution has been developed and tested, it will be released as part of the regular release deployment schedule.
Timeline:
2018-04-09 - 07:33 AM - Client Assistance receives a report from some clients about notification processing.
2018-04-09 - 07:45 AM - Internal major incident management processes initiated
2018-04-09 - 08:10 AM - Errors in company names identified
2018-04-09 - 08:18 AM - Changes made to the affected company names to prevent the issue from happening further
2018-04-09 - 08:30 AM - Residual errors seen, specific notifications cleaned up
2018-04-09 - 09:00 AM - Rolling service restarts performed; issue resolved
If you have any questions, please visit http://support.xmatters.com