Issue Discovered - Service disruption in North America
Incident Report for xMatters
Postmortem

What happened?

On April 6, 2019, at approximately 4:37 AM PDT, the xMatters monitoring systems alerted the Engineering teams to a service disruption with On-Demand services within the North American region. Users may have experienced intermittent access to the user interface, and a delay or rejection when injecting an event into xMatters.

Why did it happen?

This issue was caused by excessive memory consumption by a monitoring service. The monitoring service was buffering metrics for reporting and consumed an excessive amount of memory, causing some database queries to fail.

How did we respond?

As soon as the xMatters network monitoring tools detected unreliable connectivity in the xMatters system, the Client Assistance team launched the internal severity-1 investigation process, which was later upgraded to a major incident, and posted a notice to the xMatters status page. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. The teams determined that the fastest way to restore service and cause the least impact to clients would be to perform a manual database failover to a system not experiencing resource exhaustion. Once the promotion process was complete, clients confirmed that all services were restored and functioning as expected.

What are we doing to prevent it from happening again?

To help prevent similar incidents in the future, the xMatters Engineering teams are investigating a potential way to improve their current method of resource monitoring. Any knowledge or information they identify will be added to the relevant playbooks to ensure that it becomes a consistent part of our standard processes. In addition, Engineering teams are working with the service vendor to review the issue and determine what additional actions can be taken to ensure the issue does not reoccur.

Timeline:

April 6, 2019 4:37 AM - First notification of potential issue with On-Demand services. No client impact at this time

4:47 AM - Investigation begins

5:37 AM - Severity-1 process launched. Issue becomes client impacting

6:20 AM - Cause is identified. Manual database failover performed

6:30 AM - Monitoring service responsible is disabled

6:33 AM - Client impact is mitigated. Teams continue to monitor

6:37 AM - Confirmation of system recovery

6:47 AM - All services restored.

If you have any questions, please visit http://support.xmatters.com

Posted Apr 12, 2019 - 16:04 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Apr 06, 2019 - 07:09 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Apr 06, 2019 - 06:46 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. Customers may receive an error when trying to access the system. The error is intermittent.

We will update once a solution has been identified and implemented.
Posted Apr 06, 2019 - 06:17 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Apr 06, 2019 - 06:03 PDT
This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).