Issue Discovered - Service disruption in North America with the Integration Platform
Incident Report for xMatters
Postmortem

What happened?

On Thursday, February 22, 2018, at approximately 9:50 AM PST, the xMatters monitoring systems alerted the Client Assistance team to an issue with the xMatters On-Demand service for some clients located in North America. Some users may have experienced delays in notification delivery after injecting an event into xMatters via the Integration Builder.

Why did it happen?

This issue was caused by a previously unknown defect within an Integration Builder service that prevented the service from automatically reconnecting after a related component in the data center was restarted. Without the connection, the service failed to process new events and left them in a backlog.

How did we respond?

After identifying an issue with Integration Builder event creation, the xMatters Client Assistance and Operations teams initiated the internal Severity-1 process. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. The team quickly identified a failed back-end service that was causing some clients to experience delays in notifications when injecting an event into xMatters via the Integration Builder. The incident team was able to isolate and identify the component that caused the issue, and implemented a solution. Once the solution was implemented, notification delivery was back to normal thresholds, and clients confirmed that all services had been restored. 

What are we doing to prevent it from happening again?

The xMatters Engineering team has identified and isolated the defect within the Integration Builder and is currently developing and implementing a permanent solution. (BUG-11673 - In Progress)

Description

2018-02-22 09:52 - xMatters monitoring tools alert the Client Assistance team to an issue with On-Demand services in the North American region

2018-02-22 10:40 - Client reports an issue with injecting events via the Integration Builder

2018-02-22 11:05 - Internal Severity-1 process initiated

2018-02-22 11:08 - Bulletin posted to xMatters status page: http://status.xmatters.com/incidents/ytyrn1gk1pb3

2018-02-22 11:20 - Issue is isolated to a specific data center Integration Builder component

2018-02-22 11:33 - Solution is implemented

2018-02-22 11:35 - Services are restored   If you have any questions, please visit http://support.xmatters.com

Posted Mar 01, 2018 - 10:56 PST

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Feb 22, 2018 - 11:35 PST
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Feb 22, 2018 - 11:33 PST
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Feb 22, 2018 - 11:20 PST
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. Some clients may be experiencing delays in events being injected into xMatters through the Integration Platform. We are currently investigating the issue, and will update as information becomes available.
Posted Feb 22, 2018 - 11:08 PST
This incident affected: North America (Integration Platform).