Issue Discovered - Service disruption in North America
Incident Report for xMatters
Postmortem

What happened?

On February 4, 2019, at approximately 11:35 PM PST, the xMatters monitoring tools alerted Client Assistance to an issue affecting service in the North American region. During the incident, some clients may have experienced some brief interruptions or delays, including 503 errors, when attempting to access the xMatters web user interface. While the underlying issue required approximately 12 hours to completely resolve, there was a total of 12 minutes of actual impact to clients. These impacts were spread across the incident duration in short intervals while the underlying issues were resolved.

Why did it happen?

The issue was related to unexpected and increased connection pool usage within the xMatters platform, which caused the web user interfaces and some API services to reach capacity and auto-heal multiple times. Increased query times on some databases resulted in back pressure on user-facing services; this decrease of performance resulted in connection pools reaching capacity.

How did we respond?

As soon as the issue was detected, the Client Assistance team immediately initiated the internal Major Incident Management process and launched an investigation. The incident response team quickly identified an issue impacting the web user interface and declared a Severity 1 incident while engaging additional subject matter experts. Their first priority was to mitigate any client impact, and then work to identify a root cause and build a solution. When the issue reoccurred on February 5 at approximately 7:24 AM PST, the teams were able to immediately isolate the affected components and isolate the problematic services to perform remediation. They confirmed that this had correctly mitigated the problem, and that all services had been restored.

What are we doing to prevent it from happening again?

To prevent this issue from reoccurring, we are adding additional monitoring that will allow us to detect these types of incidents much earlier and automatically implement additional self-healing processes for affected service. In addition, we have conducted a thorough post-mortem and identified multiple areas where system resiliency can be improved.

Timeline:

January 04, 2019 - 23:35 - Monitoring tools alert Client Assistance to an issue in the North American region. Brief interruptions are detected.

January 04, 2019 - 23:40 - xMatters Client Assistance initiates major incident management process, launches investigation.

January 05, 2019 - 00:08 - Interruptions are no longer occurring.

January 05, 2019 - 07:24 - Second service impact begins (brief interruptions continue until 11:31)

January 05, 2019 - 08:04 - Status page updated to investigating: https://status.xmatters.com/incidents/t86p7lvvdn5g

January 05, 2019 - 08:05 - Status page updated to identified, incident team works to determine root cause and resolution options.

January 05, 2019 - 11:31 - Resolution initiated, status page set to monitoring.

January 05, 2019 - 12:42 - Production environment is determined to be stable, no further impact detected. Status page updated to resolved.

If you have any questions, please visit http://support.xmatters.com

Posted Feb 13, 2019 - 14:10 PST

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Feb 05, 2019 - 10:23 PST
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Feb 05, 2019 - 10:07 PST
Update
We are continuing to work on resolving this issue. Majority of clients should be able to access the web interface, however may temporarily see accessibility issues. We will provide another update in 30 minutes.
Posted Feb 05, 2019 - 09:39 PST
Update
The xMatters Incident Response team has identified the source of the issue and is still working on a fix. We will update once a solution has been identified and implemented.
Posted Feb 05, 2019 - 09:07 PST
Update
The xMatters Incident Response team has identified the source of the issue and is still working on a fix. We will update once a solution has been identified and implemented.
Posted Feb 05, 2019 - 08:36 PST
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Feb 05, 2019 - 08:05 PST
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America, accessing the system may result in a temporary error. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Feb 05, 2019 - 08:04 PST
This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).