On July 28, 2019, at approximately 5:09 AM PDT, the xMatters internal monitoring systems alerted Customer Support to an issue with potentially unresponsive integrations in North America. Shortly afterwards, some customers reported that they were noticing delays in notification delivery. The delays affected only notifications generated through the xMatters Integration Builder; manually entered notifications were processing normally.
This incident was caused by an error within the cloud infrastructure-as-a-service provider hosting the xMatters On-Demand service. The error caused instability within a process responsible for allocating Integration Builder resources to incoming event requests. During this brief period, the Integration Builder was accepting notification requests but not processing outbound notifications, which resulted in the delays experienced by some customers. Due to the distributed, redundant architecture of the On-Demand service, the issue was extremely localized, and only impacted service in a limited geographical region.
As soon as the internal monitoring systems alerted to an issue with client instances, Customer Support confirmed the issue and launched a Severity-1 incident. The incident response teams immediately began investigating and identified the impacted process along with corresponding events reported by the cloud provider. They initiated a reset of the affected components and confirmed that the process was allocating resources as expected. The teams confirmed that all services had been restored and continued to monitor the system while gathering further data around the incident.
This issue was resolved as soon as the affected services were reset. The problem has not reoccurred, and the system continues to operate at optimum performance levels.
The xMatters Engineering teams have completed an investigation into the issue and have confirmed that there were no code changes or other updates to the affected service that could have led to this incident. While there are no potential changes to the impacted service or supporting processes, the teams are engaged in designing and implementing additional monitoring metrics around the consumption of these resources to ensure that allocation does not fluctuate outside normal operating parameters. These improvements will allow the system to self-heal in the event of any similar infrastructure-related issues. Until these changes can be implemented, the teams have configured additional monitoring for the system that will allow the team responsible for the service to respond to and mitigate fluctuations before they impact any customers.
July 29, 2019 - All times are in PDT
05:06 AM Monitoring alerts Customer Support of slow notification processing
05:09 AM Severity-1 Incident called
05:16 AM Incident team gathers
05:25 AM Issue identified
05:29 AM Services restored
05:46 AM Incident is resolved