Issue Discovered - Service disruption in North America
Incident Report for xMatters
Postmortem

What happened?

On October 3, 2018 at approximately 3:05 AM PDT, the xMatters monitoring systems alerted Client Assistance to a potential issue with one of the data centers located in North America. No customers reported any issues, though it is possible that some users may have experienced a very brief interruption in attempting to access the On-Demand web user interface. No alerts or events were lost during this incident, and all notifications were delivered promptly.

Why did it happen?

This issue was caused by a connectivity problem with the Internet service provider for one of our North American data centers. The connection issue occurred beyond the xMatters environments, outside our firewalls.

How did we respond?

As soon as the monitoring tools alerted Client Assistance to an issue, they immediately began checking client environments for connection issues. The monitoring tools continued to show fluctuations in connectivity, though initial checks showed client environments that were initially reported down recovering within one minute. Client Assistance initiated the major incident management process and engaged the Operations and Engineering teams to assist in identifying any possible issues. The incident response teams isolated the fluctuations as occurring beyond the xMatters firewalls and identified the root cause as an issue with the Internet provider for the data center. Within minutes of the initial alarm, the Internet connection stabilized, and the teams confirmed that all services were operating normally.

What are we doing to prevent it from happening again?

Although the xMatters monitoring tools indicated intermittent connectivity between 3:04 and 3:11 AM, the Internet service provider could not confirm the issue, reporting that they had not received any reports of maintenance or outages on their network at that time. While it is difficult if not impossible to predict connection issues with Internet service providers, we are taking steps to resolve these types of problems via our hosting service improvements described here: https://support.xmatters.com/hc/en-us/articles/115005269506-Improving-our-hosting-services

The robustness of this new infrastructure should help avoid similar issues by reducing dependence on any individual service provider. In the short term, we will continue to work with our existing carrier to identify ways to prevent customer impact should a similar issue occur in the future.

Timeline

October 3, 2018 3:05 AM xMatters monitoring tools alert Client Assistance to a potential issue with client environments being down

3:14 AM Major incident management process initiated, incident response teams begin investigation

3:16 AM Root cause identified as connectivity fluctuations that have since ceased; all customer environments reported up

3:22 AM All services confirmed restored

If you have any questions, please visit http://support.xmatters.com

Posted Oct 10, 2018 - 12:46 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Oct 03, 2018 - 03:29 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Oct 03, 2018 - 03:27 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Oct 03, 2018 - 03:23 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Oct 03, 2018 - 03:19 PDT
This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).