Issue Discovered - Notification Delays for some North American Customers
Incident Report for xMatters
Postmortem

What happened?

On April 9, 2018, beginning at approximately 07:33 AM PDT, some clients reported an issue to xMatters Client Assistance where the On-Demand service appeared to not be processing any of their notifications. When clients checked the Events report, the Log would be empty and show no signs of processing, even after waiting a few minutes and refreshing several times. The system continued to accept injected events, but would not create or send any notifications.

Why did it happen?

The issue was caused by a previously unknown defect involving company names containing ampersands (&) that resulted in parsing errors, which in turn caused notifications to stop processing and events to be queued.

How did we respond?

As soon as Client Assistance received reports of an issue with notification processing, they began collecting information and initiated the internal major incident management process to engage Engineering and Operations. The incident response teams identified ongoing instability within the notification processing infrastructure, and determined that the issue was related to a problem with specific messages. The teams immediately took measures to remove the problematic messages and prevent them from reoccurring, and then restarted the notification services. Client Assistance notified the impacted clients that the issue had been resolved and confirmed that all services had been restored.

What are we doing to prevent it from happening again?

The xMatters Engineering team is currently designing a permanent fix for the underlying defect (reference: BUG-11989). As soon as the solution has been developed and tested, it will be released as part of the regular release deployment schedule.

Timeline:

2018-04-09 - 07:33 AM - Client Assistance receives a report from some clients about notification processing.

2018-04-09 - 07:45 AM - Internal major incident management processes initiated

2018-04-09 - 08:10 AM - Errors in company names identified

2018-04-09 - 08:18 AM - Changes made to the affected company names to prevent the issue from happening further

2018-04-09 - 08:30 AM - Residual errors seen, specific notifications cleaned up

2018-04-09 - 09:00 AM - Rolling service restarts performed; issue resolved

If you have any questions, please visit http://support.xmatters.com

Posted Apr 17, 2018 - 17:26 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Apr 09, 2018 - 09:28 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Apr 09, 2018 - 09:08 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Apr 09, 2018 - 08:28 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. Customers may experience notification delays. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Apr 09, 2018 - 08:16 PDT
Monitoring
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. Customers may experience notification delays. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Apr 09, 2018 - 08:14 PDT
This incident affected: North America (Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).