Issue Discovered - Service disruption in Asia Pacific Region - Multiple Services
Incident Report for xMatters
Postmortem

What happened?

On July 25, 2019 at approximately 10:04 PDT, the xMatters internal monitoring systems alerted Customer Support to an issue that was resulting in a "Server Internal Error" message displaying when accessing an xMatters instance. Clients in the Asia-Pacific region may also have seen this message when attempting to access their instances, or experienced difficulty in accessing xMatters services. All notifications continued to process as expected during the incident.

Why did it happen?

The incident occurred during scheduled database maintenance. During the upgrade activity, the standby databases are upgraded and promoted to become a master database. This process typically takes seconds to complete and is not customer impacting. During this upgrade, there was an unexpected and undetected delay in synchronous replication to other services. This caused the process to wait until data was in sync.

How did we respond?

As soon as the internal monitoring systems sent the alert about an issue impacting client instances, Customer Support confirmed the issue and launched a Severity-1 incident. The incident response teams immediately began investigating and identified that the maintenance process was waiting for synchronous replication to complete. This resulted in customers receiving an error message when attempting to access their systems. To resolve the issue, the incident team performed a manual intervention, and all services were restored.

What are we doing to prevent it from happening again?

We are updating our upgrade process to detect delays in synchronous replication and to postpone an update if a delay exists. This event has a very low likelihood of recurrence; however, the teams are continuing their testing and are replicating the issue to determine if any additional changes are required. The teams are also reviewing the maintenance process and monitoring settings to identify any potential improvements.

Timeline:

July 25, 2019. All times PDT

10:04 AM - Monitoring alerts Customer Support to Server Internal Error
10:06 AM - System auto recovers
10:09 AM - All services restored
10:11 AM - Incident closed

Posted Aug 02, 2019 - 12:47 PDT

Resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Posted Jul 25, 2019 - 10:11 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 25, 2019 - 10:08 PDT
Identified
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Posted Jul 25, 2019 - 10:06 PDT
Investigating
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the Asia Pacific region. We are currently investigating the issue and will update as information becomes available.

Please see incident details for specific services impacted.

If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Posted Jul 25, 2019 - 10:04 PDT
This incident affected: Asia Pacific (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).