Service disruption in North American Region
Incident Report for xMatters
Postmortem

Details

What happened?

On Thursday, November 7, 2019, at approximately 5:20 AM PST, the xMatters network monitoring systems alerted the Customer Support teams to an issue with the On-Demand services within North America. Some users may have experienced intermittent access to the xMatters On-Demand web user interface, and a delay or rejection when injecting events into xMatters.

Why did it happen?

This incident was caused by a single database within one of the database clusters consuming a disproportionate amount of resources. This limited the ability of other databases in the cluster to accept new requests, resulting in intermittent access to the web user interface.

How did we respond?

As soon as the internal monitoring systems alerted to an issue with customer instances, Customer Support confirmed the issue and launched the internal major incident management process. The incident response teams immediately began their investigation and identified a database cluster that was consuming processing resources at an exceptionally high rate. The teams determined that the issue was confined to a specific database in the cluster that was causing latency and preventing other resources from serving their requests. The teams concluded that the best way to remedy the issue quickly was to promote a standby database cluster to become the new primary. The recovery process and redundant service architecture restored services, and system performance resumed normal operations.

What are we doing to prevent it from happening again?

To prevent this issue from reoccurring, the Engineering teams will be taking the following steps:

  1. Resize the database cluster to accommodate potential usage spikes and to increase tolerance for similar issues. (Completed)
  2. Rebalance the database cluster to increase bandwidth for all impacted customers. (Scheduled for completion on or before November 14, 2019)
  3. Increase monitoring thresholds to identify spikes in usage during peak periods. (Completed)

xMatters strives to provide high availability to our clients and we recognize that reliability of services is of utmost importance to our customers and their businesses. xMatters is committed to improving our resiliency and investing in the tools and processes required to prevent and minimize service disruptions.

Timeline:

Date/Time (PST) Description
2019-11-07 05:20 AM xMatters monitoring tools alert Customer Support to intermittent access to some client instances in North America.
05:45 AM Severity-1 issue raised; internal major incident management process initiated.
06:19 AM Bulletin posted to xMatters status page: https://status.xmatters.com/incidents/xrq45x6g0zpp
06:43 AM Incident team identifies issue as related to a database within the cluster.
07:00 AM Promotion of secondary database cluster begins.
07:09 AM All services are restored.

If you have any questions, please visit http://support.xmatters.com

Posted Nov 08, 2019 - 13:37 PST

Resolved
At approximately 5:56am PDT, we experienced an issue with xMatters that prevented users in North America to access the web user interface. Services were restored at approximately 7:12am PDT.
Posted Nov 07, 2019 - 07:16 PST
Update
We are continuing to investigate this issue.
Posted Nov 07, 2019 - 07:12 PST
Update
We are continuing to investigate this issue.
Posted Nov 07, 2019 - 06:50 PST
Investigating
At approximately 5:56am PDT, our internal monitoring detected an issue affecting some customers in North America. Some customers may be unable to log into their xMatters instances.

We are currently investigating this issue.
Posted Nov 07, 2019 - 06:19 PST
This incident affected: North America (Web Interface, Email Notifications, SMS Notifications, Voice Notifications, Conferencing, Integration Platform, API, Mobile App).