Issue Discovered - Service disruption in North America
Incident Report for xMatters
Postmortem

What happened?

Beginning on Tuesday, October 9, 2018 at approximately 7:40 PM PST, the xMatters monitoring systems alerted Client Assistance to a potential issue with xMatters On-Demand services for clients in the North America region. During the incident, some customers may have experienced delays of up to 15 minutes in notification delivery. No notifications were lost during this period, and event injection and responses continued processing as normal.

Why did it happen?

This issue was caused by a previously unidentified defect within a service responsible for handling notification processing, triggered by an unusually high volume of notification requests.

How did we respond?

As soon as the automated monitoring tools alerted xMatters Client Assistance to a possible delay in notification delivery, the teams began attempting to both reproduce the problem and determine the cause of the issue. Once the issue was confirmed, xMatters Client Assistance escalated the issue to Severity 1 and initiated the internal major incident management process. The incident response teams began working to identify and isolate the issue and quickly identified a problem with the notification service. The team discovered that some back-end services were in the process of automatically recovering from a failure and restarted one of the affected components to speed the recovery process. This appeared to resolve the issue and all services were restored. The teams concluded the major incident process, while continuing to monitor the situation and were able to identify the root cause as a defect in the notification service.

What are we doing to prevent it from happening again?

To prevent this issue from recurring, the Engineering team will upgrade to a newer version of the affected back-end service which contains a fix for the defect that caused the delay in processing notifications. This new version is currently in development and will be deployed as soon as testing and validation has been completed. To ensure that the issue does not reoccur before the team can deploy the fix, the Engineering and Operations teams have implemented a rate limit on the affected service so that it will not experience the unusually high volume of requests that triggered the defect.

Timeline:

October 9, 2018 - 7:40 PM xMatters internal monitoring alerts Operations to issue in North America region
7:47 AM Client Assistance begins testing notification delivery
8:00 PM Client Assistance escalates issue to Severity 1; incident response teams begin investigation
8:30 PM Notification service restarted.
8:40 PM Client Assistance confirms notifications are being processed; all services restored.

If you have any questions, please visit http://support.xmatters.com

Posted about 2 months ago. Oct 17, 2018 - 09:31 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted 2 months ago. Oct 09, 2018 - 20:52 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted 2 months ago. Oct 09, 2018 - 20:44 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted 2 months ago. Oct 09, 2018 - 20:30 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted 2 months ago. Oct 09, 2018 - 20:21 PDT
This incident affected: North America (Email Notifications, SMS Notifications, Voice Notifications, Mobile Push Notifications).