On July 25, 2019 at approximately 10:04 PDT, the xMatters internal monitoring systems alerted Customer Support to an issue that was resulting in a "Server Internal Error" message displaying when accessing an xMatters instance. Clients in the Asia-Pacific region may also have seen this message when attempting to access their instances, or experienced difficulty in accessing xMatters services. All notifications continued to process as expected during the incident.
The incident occurred during scheduled database maintenance. During the upgrade activity, the standby databases are upgraded and promoted to become a master database. This process typically takes seconds to complete and is not customer impacting. During this upgrade, there was an unexpected and undetected delay in synchronous replication to other services. This caused the process to wait until data was in sync.
As soon as the internal monitoring systems sent the alert about an issue impacting client instances, Customer Support confirmed the issue and launched a Severity-1 incident. The incident response teams immediately began investigating and identified that the maintenance process was waiting for synchronous replication to complete. This resulted in customers receiving an error message when attempting to access their systems. To resolve the issue, the incident team performed a manual intervention, and all services were restored.
We are updating our upgrade process to detect delays in synchronous replication and to postpone an update if a delay exists. This event has a very low likelihood of recurrence; however, the teams are continuing their testing and are replicating the issue to determine if any additional changes are required. The teams are also reviewing the maintenance process and monitoring settings to identify any potential improvements.
July 25, 2019. All times PDT
10:04 AM - Monitoring alerts Customer Support to Server Internal Error
10:06 AM - System auto recovers
10:09 AM - All services restored
10:11 AM - Incident closed