On Thursday, November 7, 2019, at approximately 5:20 AM PST, the xMatters network monitoring systems alerted the Customer Support teams to an issue with the On-Demand services within North America. Some users may have experienced intermittent access to the xMatters On-Demand web user interface, and a delay or rejection when injecting events into xMatters.
This incident was caused by a single database within one of the database clusters consuming a disproportionate amount of resources. This limited the ability of other databases in the cluster to accept new requests, resulting in intermittent access to the web user interface.
As soon as the internal monitoring systems alerted to an issue with customer instances, Customer Support confirmed the issue and launched the internal major incident management process. The incident response teams immediately began their investigation and identified a database cluster that was consuming processing resources at an exceptionally high rate. The teams determined that the issue was confined to a specific database in the cluster that was causing latency and preventing other resources from serving their requests. The teams concluded that the best way to remedy the issue quickly was to promote a standby database cluster to become the new primary. The recovery process and redundant service architecture restored services, and system performance resumed normal operations.
To prevent this issue from reoccurring, the Engineering teams will be taking the following steps:
xMatters strives to provide high availability to our clients and we recognize that reliability of services is of utmost importance to our customers and their businesses. xMatters is committed to improving our resiliency and investing in the tools and processes required to prevent and minimize service disruptions.
Date/Time (PST) | Description |
---|---|
2019-11-07 05:20 AM | xMatters monitoring tools alert Customer Support to intermittent access to some client instances in North America. |
05:45 AM | Severity-1 issue raised; internal major incident management process initiated. |
06:19 AM | Bulletin posted to xMatters status page: https://status.xmatters.com/incidents/xrq45x6g0zpp |
06:43 AM | Incident team identifies issue as related to a database within the cluster. |
07:00 AM | Promotion of secondary database cluster begins. |
07:09 AM | All services are restored. |
If you have any questions, please visit http://support.xmatters.com