On June 21, 2021, at approximately 10:05 AM Pacific, the xMatters monitoring tools alerted Customer Support to an issue where the web user interface was unresponsive or exhibiting slow performance. During the incident, some customers may have noticed "Instance Unavailable" errors, or experience longer page loading time when accessing the web user interface. This issue only affected the web user interface; events continued to be accepted and created, and notifications and responses were processed normally.
This issue was caused by a single instance attempting to load approximately 140,000 user records into memory. This eventually increased memory usage to 100%, resulting in an unresponsive service. While the condition properly triggered an automated restart of the web user interface service, the service was unable to recover properly until the underlying issue could be mitigated.
As soon as Customer Support received the alert from the monitoring tools and confirmed the issue, they initiated a Severity-1 incident and gathered the major incident response team. The team identified the instance responsible for consuming resources and isolated it within a dedicated resource stack to prevent any potential recurrence. The team then manually cleared the cache and restarted the web user interface service, confirming that it had resumed normal operation.
The Engineering team has isolated the source of the memory usage and reconfigured it with dedicated CPU and separate resources to eliminate future incidents of this type. They are currently developing additional memory clean up routines to further improve automated recovery, and investigating how the single instance was able to consume the available memory. Until these improvements are in place, the team will continue to isolate the source of the memory consumption.
Date/Time (Pacific) | Action |
---|---|
Monday June 21, 2021 - 10:05 AM | xMatters monitoring alerts to slow or unresponsive customer instances |
10:17 | Severity-1 Incident initiated |
10:20 | Source of memory usage identified |
10:22 | Instance isolated and web UI service restarted |
10:30 | Web user interface service declared stable |
10:45 | Incident resolved |
If you have any questions, please visit http://support.xmatters.com
No labels