On July 11, 2019, at approximately 2:18 PM PDT, xMatters monitoring alerted Customer Support to possible delays in notification delivery within the NorthAmerica region. While the issue was being addressed, some users may have experienced delays when receiving SMS and voice notifications, or when injecting events for some integrations. While the system continued to process events and notifications, some events were delayed or did not complete properly and required manual termination.
This issue was caused when a backend queuing service stopped unexpectedly, resulting in a large backlog of events waiting to be processed which in turn prevented the services responsible for event processing from connecting to the queuing process. While the issue did not impact all of the available queues, the remaining queues took longer to process events.
As soon as the monitoring tools alerted to an issue with notification delivery, Customer Support began troubleshooting, connected with subject matter experts to assist, and created a Severity-1 incident. The incident response teams discovered an issue with a backend service responsible for delivering notifications within the North American region. Once the teams identified the impacted services, they began a rolling restart of the event processing services. When the restart had no effect, the teams began a rolling restart of the queuing process. This restart had the desired effect, and the system began processing events and clearing the backlog. Once the backlog had cleared, the teams confirmed that all services had been restored.
To prevent this issue from recurring, we have identified the following enhancements to the queuing process. Implementing these enhancements will include additional capacity to build further redundancy:
The end result of these updates will be to add more capacity for queue processing in the event of another single queue incident. These changes will be implemented as soon as development and testing procedures are complete.