North America Service Disruption - Notification Delivery
Incident Report for xMatters
Postmortem

What happened?

On July 11, 2019, at approximately 2:18 PM PDT, xMatters monitoring alerted Customer Support to possible delays in notification delivery within the NorthAmerica region. While the issue was being addressed, some users may have experienced delays when receiving SMS and voice notifications, or when injecting events for some integrations. While the system continued to process events and notifications, some events were delayed or did not complete properly and required manual termination.

Why did it happen?

This issue was caused when a backend queuing service stopped unexpectedly, resulting in a large backlog of events waiting to be processed which in turn prevented the services responsible for event processing from connecting to the queuing process. While the issue did not impact all of the available queues, the remaining queues took longer to process events.

How did we respond?

As soon as the monitoring tools alerted to an issue with notification delivery, Customer Support began troubleshooting, connected with subject matter experts to assist, and created a Severity-1 incident. The incident response teams discovered an issue with a backend service responsible for delivering notifications within the North American region. Once the teams identified the impacted services, they began a rolling restart of the event processing services. When the restart had no effect, the teams began a rolling restart of the queuing process. This restart had the desired effect, and the system began processing events and clearing the backlog. Once the backlog had cleared, the teams confirmed that all services had been restored.

What are we doing to prevent it from happening again?

To prevent this issue from recurring, we have identified the following enhancements to the queuing process. Implementing these enhancements will include additional capacity to build further redundancy:

  • Increase queue process cluster size
  • Allocate additional memory and processing resources to the queue node

The end result of these updates will be to add more capacity for queue processing in the event of another single queue incident. These changes will be implemented as soon as development and testing procedures are complete.

Timeline:

  • July 11, 2019 - 2:11 PM PDT - Monitoring alerts Customer Support of queuing delays
  • 2:18 PM - Severity-1 Incident called
  • 2:30 PM - Restarted event processing nodes
  • 2:43 PM - Restarted queuing process
  • 2:50 PM - Backlog begins to clear
  • 3:17 PM - Services begin to restore
  • 3:29 PM - Second reset of queue process3:41 PMQueuing restarted complete3:45 PMIncident team monitors queues for performance issues
  • 4:28 PM - All services restored; incident closed
Posted Jul 23, 2019 - 13:52 PDT

Resolved
On Thursday July 11, 2019 at approximately 2:15 PM PDT, the xMatters monitoring tools detected an issue with a backend service responsible for delivering notifications within the North American region. Some xMatters clients may have experienced a rejection or delay in notification delivery during this time. The issue was identified and rectified by 3:30 PM PDT and all queued notifications processed and delivered. In some cases, the incident may have left events in a non-terminated state. If you notice events that have not terminated properly, you can terminate them manually. We will provide a full root cause analysis once we have concluded the incident investigation.
Posted Jul 11, 2019 - 14:15 PDT