Issue Discovered - Service disruption in North American Region – Integration Builder

Incident Report for xMatters

Postmortem

What happened?

On July 28, 2019, at approximately 5:09 AM PDT, the xMatters internal monitoring systems alerted Customer Support to an issue with potentially unresponsive integrations in North America. Shortly afterwards, some customers reported that they were noticing delays in notification delivery. The delays affected only notifications generated through the xMatters Integration Builder; manually entered notifications were processing normally.

Why did it happen?

This incident was caused by an error within the cloud infrastructure-as-a-service provider hosting the xMatters On-Demand service. The error caused instability within a process responsible for allocating Integration Builder resources to incoming event requests. During this brief period, the Integration Builder was accepting notification requests but not processing outbound notifications, which resulted in the delays experienced by some customers. Due to the distributed, redundant architecture of the On-Demand service, the issue was extremely localized, and only impacted service in a limited geographical region.

How did we respond?

As soon as the internal monitoring systems alerted to an issue with client instances, Customer Support confirmed the issue and launched a Severity-1 incident. The incident response teams immediately began investigating and identified the impacted process along with corresponding events reported by the cloud provider. They initiated a reset of the affected components and confirmed that the process was allocating resources as expected. The teams confirmed that all services had been restored and continued to monitor the system while gathering further data around the incident.

What are we doing to prevent it from happening again?

This issue was resolved as soon as the affected services were reset. The problem has not reoccurred, and the system continues to operate at optimum performance levels.
The xMatters Engineering teams have completed an investigation into the issue and have confirmed that there were no code changes or other updates to the affected service that could have led to this incident. While there are no potential changes to the impacted service or supporting processes, the teams are engaged in designing and implementing additional monitoring metrics around the consumption of these resources to ensure that allocation does not fluctuate outside normal operating parameters. These improvements will allow the system to self-heal in the event of any similar infrastructure-related issues. Until these changes can be implemented, the teams have configured additional monitoring for the system that will allow the team responsible for the service to respond to and mitigate fluctuations before they impact any customers.

Timeline:

July 29, 2019 - All times are in PDT

05:06 AM Monitoring alerts Customer Support of slow notification processing
05:09 AM Severity-1 Incident called
05:16 AM Incident team gathers
05:25 AM Issue identified
05:29 AM Services restored
05:46 AM Incident is resolved

Posted Aug 02, 2019 - 12:21 PDT

Resolved

The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.

Posted Jul 28, 2019 - 05:47 PDT

Monitoring

The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.

Posted Jul 28, 2019 - 05:30 PDT

Identified

The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.

Posted Jul 28, 2019 - 05:17 PDT

Investigating

xMatters monitoring tools have identified a potential issue with the xMatters Integration Builder platform for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.

Posted Jul 28, 2019 - 05:13 PDT

This incident affected: North America (Integration Platform).