Issue Discovered - Service disruption in North America
Incident Report for xMatters
Postmortem

What happened?

On Tuesday, June 27, 2017, at approximately 01:10pm PDT, the xMatters network monitoring systems alerted the Operations team to a disruption with the xMatters On-Demand services in one of the data centers located in North America. Some users may have experienced intermittent access to the user interface, and a delay or rejection when injecting an event into xMatters.

Why did it happen?

This issue was caused by a hardware failure that occurred on our data center provider's network equipment, resulting in services being unavailable for a brief period (less than ten minutes). 

How did we respond?

As soon as the xMatters network monitoring tools detected unreliable connectivity and notified Client Assistance and Operations, the teams initiated the internal Major Incident Management process and posted a bulletin to the xMatters status page. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. They quickly determined that the issue was being caused by a network problem within one of the North American data centers. The Operations team immediately created a Severity 1 ticket with the data center provider and began the process to begin promoting services to an alternate data center. However, during the initiation of the automated failover process, the vendor confirmed that the issue had been resolved. The Operations team continued to monitor the situation, and decided to hold the promotion of services to an alternate data center as all services had been restored and reported as stable.

What are we doing to prevent it from happening again?

This disruption was caused by an unexpected network event that affected the entire hosting data center. The data center provider is currently conducting an internal investigation, and providing more information as it is discovered. The provider is also continuing their internal processes and working with their network vendors to identify any potential remediation actions, including replacing the impacted hardware. While these kinds of issues are difficult to predict and prevent, the xMatters teams are continually reviewing the failover processes and seeking to identify any potential areas of improvement or ways to reduce the amount of time required to get clients back online.

Timeline:

2017-06-27 01:10PM - xMatters monitoring detects a networking issue in one of the data centers located in North America

2017-06-27 01:13PM - Teams initiate the internal Major Incident Management process

2017-06-27 01:15PM - Client Assistance posts a support bulletin: http://status.xmatters.com/incidents/npf13g87cl2t

2017-06-27 01:17PM - Operations team confirms the issue appears to be resolved

2017-06-27 01:20PM - Data center provider confirms the hardware failure and that the issue has been resolved

2017-06-27 01:20PM - All services are restored

If you have any questions, please visit http://support.xmatters.com

Posted Jun 28, 2017 - 15:24 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Jun 27, 2017 - 13:35 PDT
Monitoring
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Posted Jun 27, 2017 - 13:25 PDT
Investigating
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available.
Posted Jun 27, 2017 - 13:15 PDT
This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).