What happened?
On July 5th, at approximately 8:35AM Pacific, the xMatters monitoring tools alerted Customer Support to an issue where alert notifications were not being sent out for some customers in the North America region. Some customers attempting to initiate alerts may have encountered long delays in processing or may have had requests time out.
Why did it happen?
This issue occurred due to a sudden spike in the number of resources required by our backend services. The resulting memory overload issue caused some request handlers to time out before they could properly process incoming alerts.
How did we respond?
As soon as the xMatters Customer Support team confirmed the issue from the monitoring tools, they initiate the internal major incident management process and engaged the xMatters Engineering teams. To immediately mitigate the issue and restore service quickly, the incident response teams performed a rolling restart for the affected services. As soon as the restart was completed, the system resumed processing alerts and all services were restored.
What are we doing to prevent it from happening again?
The Engineering teams have implemented a performance enhancement for backend service queries. In addition, the teams are evaluating and testing additional methods to help mitigate resource spikes and prevent them from impacting alert notifications in the future. Once development and testing are complete, we'll deploy these changes with our regularly scheduled maintenance.
Timeline:
July 5th, 2024
8:35AM PT - xMatters internal monitoring tools alert to potential issue.
9:22AM PT - Issue identified.
9:36AM PT - Rolling restart initiated.
10:16AM PT - Issue Resolved.