Issue Discovered - Service disruption in North American Region – Integration Platform
Incident Report for xMatters
Postmortem

What happened?

On September 9, 2019, at approximately 10:20 AM Pacific, xMatters monitoring reported an issue to xMatters Customer Support where some customer integrations became unresponsive and stopped processing events. Some customers may have seen integration logs showing a number of errors related to script failures, and some events may not have been properly processed,

Why did it happen?

This incident was caused by integration scripts containing code that was not fully compliant JavaScript. During regularly scheduled maintenance on the morning of September 9, xMatters released a new version of the Integration Builder service to enable a faster, more efficient scripting engine. This change required that the Integration Builder be updated to Java Development Kit (JDK) version 11. The new scripting engine, GraalJS, is native to JavaScript and requires that all code be fully JavaScript compliant. The previous version of the scripting engine, Nashorn, accepted some Java String methods that are not technically JavaScript, and not fully compliant. While the upgrade process did enable backwards compatibility with the previous version of the JDK, the compatibility features did not cover the inconsistency with non-compliant JavaScript. As a result, some integration scripts that included non-compliant code returned errors and prevented the scripts from executing correctly.

How did we respond?

When the internal monitoring tools flagged errors in integration scripts, xMatters Customer Support began their investigation. As soon as customers reported issues with their integrations, Customer Support escalated the issue to a Severity-1 Incident and launched the internal major incident management process. They were able to quickly determine that the issue was related to the release of the JDK 11 upgrade, and initiated an immediate code rollback. Once the rollback was complete, the teams confirmed the issue was no longer occurring and that all services had been restored.

What are we doing to prevent it from happening again?

The xMatters Engineering teams have examined the errors from the Integration Builder logs, and isolated some differences between the two versions of the JDK that were causing the issue. Specifically, they were able to identify three Java String methods that the previous iteration of the scripting engine could process that were not being handled by GraalJS. The teams added handling to the Integration Builder service that will allow the new scripting engine to process the Java String methods without breaking script functionality. They tested the changes on internal systems and confirmed that while the Integration Builder logs will mark the errors for easy identification, the scripts will continue to execute without any developer or integrator intervention.

Customer Support posted a notice about the upcoming availability of the new scripting engine on the support site at https://support.xmatters.com/hc/en-us/articles/360033568811 and rescheduled the deployment of the JDK upgrade for Tuesday, September 17. In addition, they updated the xMatters Status page (status.xmatters.com) with a scheduled maintenance notice about the change.

While the Engineering teams are confident that even customers with non-fully compliant JavaScript will not see any issues arise from the deployment of the JDK 11 update, they were only able to target known errors. It is possible that integration scripts containing non-compliant code for which the Engineering team has not added handling may result in a similar error. We highly recommend that all customers using custom Integration Builder scripts review their integrations and ensure they are using only fully-compliant, standard JavaScript code.

Timeline:

September 9, 2019

10:00 AM xMatters deploys new version of Javascript (JDK11)
10:20 AM Internal monitoring flags integration errors along with customer reports of integration errors
10:30 AM Customer Support launches Severity-1 Incident
10:35 AM Issue discovered - rollback to previous version initiated
10:45 AM Verification of resolution and return to normal operation
10:55 AM SEV-1 Issue closed

Posted Sep 17, 2019 - 08:50 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Sep 09, 2019 - 10:53 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Sep 09, 2019 - 10:47 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Sep 09, 2019 - 10:43 PDT
Investigating
xMatters monitoring tools have identified a potential issue with the xMatters Integration Platform for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Sep 09, 2019 - 10:38 PDT
This incident affected: North America (Integration Platform).