On April 6, 2019, at approximately 4:37 AM PDT, the xMatters monitoring systems alerted the Engineering teams to a service disruption with On-Demand services within the North American region. Users may have experienced intermittent access to the user interface, and a delay or rejection when injecting an event into xMatters.
This issue was caused by excessive memory consumption by a monitoring service. The monitoring service was buffering metrics for reporting and consumed an excessive amount of memory, causing some database queries to fail.
As soon as the xMatters network monitoring tools detected unreliable connectivity in the xMatters system, the Client Assistance team launched the internal severity-1 investigation process, which was later upgraded to a major incident, and posted a notice to the xMatters status page. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. The teams determined that the fastest way to restore service and cause the least impact to clients would be to perform a manual database failover to a system not experiencing resource exhaustion. Once the promotion process was complete, clients confirmed that all services were restored and functioning as expected.
To help prevent similar incidents in the future, the xMatters Engineering teams are investigating a potential way to improve their current method of resource monitoring. Any knowledge or information they identify will be added to the relevant playbooks to ensure that it becomes a consistent part of our standard processes. In addition, Engineering teams are working with the service vendor to review the issue and determine what additional actions can be taken to ensure the issue does not reoccur.
April 6, 2019 4:37 AM - First notification of potential issue with On-Demand services. No client impact at this time
4:47 AM - Investigation begins
5:37 AM - Severity-1 process launched. Issue becomes client impacting
6:20 AM - Cause is identified. Manual database failover performed
6:30 AM - Monitoring service responsible is disabled
6:33 AM - Client impact is mitigated. Teams continue to monitor
6:37 AM - Confirmation of system recovery
6:47 AM - All services restored.
If you have any questions, please visit http://support.xmatters.com