Issue Discovered - Service disruption - Integration Platform
Incident Report for xMatters
Postmortem

What happened?

On Friday, July 6, 2018 at approximately 11:45AM PST, some clients reported an issue to xMatters Client Assistance where some previously healthy integrations had stopped accepting events and were not sending notifications for specific communication plans and forms. The underlying issue was resolved later in the day, though the xMatters web user interface remained accessible throughout, and clients were still able to initiate events for affected integrations using other methods.

Why did it happen?

This issue occurred during a regularly scheduled update to the xMatters On-Demand service, which included a refactoring of the Integration Builder code to implement some additional inbound integration authentication features. The code changes introduced a bug that caused integrations using non-preemptive authentication to fail because the Integration Builder service was not returning the correct response information.

How did we respond?

As soon as the first client reported the issue to xMatters Client Assistance, the support engineer began troubleshooting the issue and engaged the Engineering team to assist in the investigation. The support engineer quickly suggested a workaround to temporarily mitigate the issue; once the client implemented the workaround, they confirmed the integration was again injecting events. Other clients then reported a similar problem with their integrations, and the support engineer escalated the issue to a Severity 1 to initiate the internal major incident management process. The incident response teams began investigating and posted a notice to the xMatters status page containing the workaround to ensure it was available to all customers. The teams first isolated the issue to the authentication process, and then identified the underlying bug related to non-preemptive integration requests. The Engineering team began working on a fix which was developed, tested, and deployed for affected clients later that afternoon. The Operations team redeployed the fix for some clients later in the evening once the regularly scheduled maintenance update was finished for their systems. Once the fix was redeployed, all clients confirmed that the issue had been addressed, and services were restored.

What are we doing to prevent it from happening again?

The hotfix deployed on July 6, 2018 permanently addressed this issue for all xMatters On-Demand environments, and the issue should not reoccur. To help prevent the need to redeploy fixes for some clients after the initial release, the Engineering and Client Assistance teams are reviewing the internal deployment process to confirm that all environments are included when a hotfix is applied. Furthermore, the specific regression linked to this issue has been added to the QA process to detect if it occurs again for this particular scenario.

Timeline:

2018-07-06 11:45AM - Client reports issue to xMatters Client Assistance; support engineer begins investigating.

2018-07-06 12:46PM - Support engineer suggests a workaround, informs client issue has been escalated to Engineering team.

2018-07-06 01:23PM - Client confirms workaround is in place and integration is working

2018-07-06 02:33PM - Other clients report the same issue; support engineer escalates issue to Severity 1.

2018-07-06 02:19PM - Client Assistance posts an update to xMatters status page: https://status.xmatters.com/incidents/8zqdpdk4r1t3

2018-07-06 02:31PM - Engineering team determines root cause, begins working on a fix.

2018-07-06 07:35PM - Operations team deploys fix.

2018-07-06 10:40PM - Another client reports issue still occurring.

2018-07-06 10:50PM - Operations team identifies missed environments, redeploys fix as necessary.

2018-07-06 10:54PM - Operations confirms all environments updated.

2018-07-06 11:04PM - All clients report issue is resolved.

If you have any questions, please visit http://support.xmatters.com

Posted Jul 12, 2018 - 15:56 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Jul 06, 2018 - 20:54 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Jul 06, 2018 - 19:35 PDT
Update
We are continuing to work on a fix for this issue.
Posted Jul 06, 2018 - 15:41 PDT
Identified
xMatters has identified an issue pertaining to Basic Authentication when using the Integration Builder with REST calls from the Integration Agent.

The current workaround is to use URL authentication.
See https://help.xmatters.com/ondemand/xmodwelcome/integrationbuilder/create-inbound-updates.htm.

Workaround:
Step 1 - Modify the inbound integration to use URL authentication as per the guide above, follow this link for more detailed information https://help.xmatters.com/ondemand/xmodwelcome/integrationbuilder/generate-urls.htm

Step 2 - Update your WEB_SERVICE URL variable in /integrationservices//configuration.js with the new URL created in Step 1.

Step 3 - Restart the IA.
Posted Jul 06, 2018 - 14:31 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Jul 06, 2018 - 14:19 PDT
This incident affected: Europe, Middle East, and Africa (Integration Platform), Asia Pacific (Integration Platform), and North America (Integration Platform).