What happened?
On Friday, July 6, 2018 at approximately 11:45AM PST, some clients reported an issue to xMatters Client Assistance where some previously healthy integrations had stopped accepting events and were not sending notifications for specific communication plans and forms. The underlying issue was resolved later in the day, though the xMatters web user interface remained accessible throughout, and clients were still able to initiate events for affected integrations using other methods.
Why did it happen?
This issue occurred during a regularly scheduled update to the xMatters On-Demand service, which included a refactoring of the Integration Builder code to implement some additional inbound integration authentication features. The code changes introduced a bug that caused integrations using non-preemptive authentication to fail because the Integration Builder service was not returning the correct response information.
How did we respond?
As soon as the first client reported the issue to xMatters Client Assistance, the support engineer began troubleshooting the issue and engaged the Engineering team to assist in the investigation. The support engineer quickly suggested a workaround to temporarily mitigate the issue; once the client implemented the workaround, they confirmed the integration was again injecting events. Other clients then reported a similar problem with their integrations, and the support engineer escalated the issue to a Severity 1 to initiate the internal major incident management process. The incident response teams began investigating and posted a notice to the xMatters status page containing the workaround to ensure it was available to all customers. The teams first isolated the issue to the authentication process, and then identified the underlying bug related to non-preemptive integration requests. The Engineering team began working on a fix which was developed, tested, and deployed for affected clients later that afternoon. The Operations team redeployed the fix for some clients later in the evening once the regularly scheduled maintenance update was finished for their systems. Once the fix was redeployed, all clients confirmed that the issue had been addressed, and services were restored.
What are we doing to prevent it from happening again?
The hotfix deployed on July 6, 2018 permanently addressed this issue for all xMatters On-Demand environments, and the issue should not reoccur. To help prevent the need to redeploy fixes for some clients after the initial release, the Engineering and Client Assistance teams are reviewing the internal deployment process to confirm that all environments are included when a hotfix is applied. Furthermore, the specific regression linked to this issue has been added to the QA process to detect if it occurs again for this particular scenario.
Timeline:
2018-07-06 11:45AM - Client reports issue to xMatters Client Assistance; support engineer begins investigating.
2018-07-06 12:46PM - Support engineer suggests a workaround, informs client issue has been escalated to Engineering team.
2018-07-06 01:23PM - Client confirms workaround is in place and integration is working
2018-07-06 02:33PM - Other clients report the same issue; support engineer escalates issue to Severity 1.
2018-07-06 02:19PM - Client Assistance posts an update to xMatters status page: https://status.xmatters.com/incidents/8zqdpdk4r1t3
2018-07-06 02:31PM - Engineering team determines root cause, begins working on a fix.
2018-07-06 07:35PM - Operations team deploys fix.
2018-07-06 10:40PM - Another client reports issue still occurring.
2018-07-06 10:50PM - Operations team identifies missed environments, redeploys fix as necessary.
2018-07-06 10:54PM - Operations confirms all environments updated.
2018-07-06 11:04PM - All clients report issue is resolved.
If you have any questions, please visit http://support.xmatters.com