Historical record of incidents for xMatters
Report: "Issue Discovered - Service disruption in North American Region – Integration Platform"
Last updatexMatters monitoring tools have identified a potential issue with the xMatters Integration Platform for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in European Region - Multiple Services"
Last update**What happened?** On May 13, 2025 at approximately 10:45 AM Pacific, xMatters internal monitoring tools identified an issue where customers in the EU region experienced intermittent web UI and API timeouts. **Why did it happen?** The issue occurred because a backend queueing service experienced network timeouts during an unpredictable rapid increase in usage. The increase in resource consumption due to the surge in network usage caused service timeouts and restarts, as well as higher latency which caused further delays in responses to backend requests. **How did we respond?** xMatters internal monitoring tools alerted the xMatters Incident Response Team to the issue, then the team launched the internal SEV-1 process. Due to early detection, Engineering teams were able to scale up the queueing services to prevent further service degradation and availability issues. The network timeouts were resolved after resources were scaled up to accommodate the increase in usage. **What are we doing to prevent it from happening again?** The Engineering teams have adjusted resources to better compensate for sudden usage increases and to prevent them from affecting backend services. The improvement in resource allocation and adaptability should prevent similar issues from occurring in the future.
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
We are continuing to work on a fix for this issue.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the Europe region. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in All Regions – Integration Platform"
Last update**What happened?** On April 3, 2025, at approximately 9:32 AM Pacific, the xMatters internal monitoring systems identified an issue where the system was not processing events initiated via Flow Designer across multiple regions. Customers may have observed the system not processing events or creating alerts while the issue was in progress. **Why did it happen?** The issue occurred when a routine update to add new permissions to the xMatters' Google Cloud Platform \(GCP\) unexpectedly removed required permissions. When the teams performed the update, which should not have had any impact to customers, the policy used in the automation script was the authoritative resource at the GCP project level rather than authoritative at the individual resource level. Although the teams tested the change before deploying it and found no changes beyond what was included in the update, when the update was deployed to production GCP removed all permissions that were not in the policy in the background. Because the policy only included the new permissions, all other permissions were removed. **How did we respond?** As soon as the xMatters monitoring tools reported an issue with the system not processing events, the incident response teams initiated the internal Major Incident Management process and engaged the Engineering and Support teams. The teams were able to quickly identify the recent update as the root cause of the issue and reverted the change to restore the permissions that were managed by the automation process. This restored access and functionality for most of the xMatters services, but restoring permissions for Flow Designer proved to be more complicated. The teams determined that missing permissions for the xMatters infrastructure were Google-generated permissions essential for specific xMatters services and engaged GCP Support to aid in the investigation. The teams generated a list of all permissions that existed prior to the update and designed a fix to re-apply them to the development environment. Once the teams had implemented the change and validated that the missing permissions and all services had been restored in the development environment, they moved to quickly apply the fix across all staging and production environments. Monitoring tools and customers confirmed that services were fully functional, and the teams continued to monitor the system as it processed all messages queued by the Flow Designer services. The system was fully restored at 2:46 PM Pacific. **What are we doing to prevent it from happening again?** The Engineering teams were able to identify all of the permissions created in the xMatters environment, including those that are created by Google, and are ensuring they are added to the management scripts. The teams are adding additional rigor to the application of these types of infrastructure changes to run idempotence tests after the changes are applied to ensure that there are no changes pending. Should a change be applied that fails this test, it will cause failures in development environments, which would catch and prevent a similar issue from occurring.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The xMatters Incident Response team has identified the source of the issue and are currently testing a fix. We will provide another update shortly.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters Incident Response team has identified the source of the issue and is still actively working on a fix, there is no current estimate on resolution time. We will update once a solution has been implemented.
The xMatters Incident Response team has identified the source of the issue and is still actively working on a fix, there is no current estimate on resolution time. We will update once a solution has been implemented.
The xMatters Incident Response team has identified the source of the issue and is still actively working on a fix, there is no current estimate on resolution time. We will update once a solution has been implemented.
The xMatters Incident Response team has identified the source of the issue and is still actively working on a fix, there is no current estimate on resolution time. We will update once a solution has been implemented.
The xMatters Incident Response team has identified the source of the issue and is actively working on a fix, there is no current estimate on resolution time. We will update once a solution has been implemented.
The xMatters Incident Response team has identified the source of the issue and is still working on a fix. We will update once a solution has been identified and implemented.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters Incident Response team has identified the source of the issue and is still working on a fix. We will update once a solution has been identified and implemented.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with xMatters Integration Platform for some clients in All Regions. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North American Region - Multiple Services"
Last update**What happened?** On July 5th, at approximately 8:35AM Pacific, the xMatters monitoring tools alerted Customer Support to an issue where alert notifications were not being sent out for some customers in the North America region. Some customers attempting to initiate alerts may have encountered long delays in processing or may have had requests time out. **Why did it happen?** This issue occurred due to a sudden spike in the number of resources required by our backend services. The resulting memory overload issue caused some request handlers to time out before they could properly process incoming alerts. **How did we respond?** As soon as the xMatters Customer Support team confirmed the issue from the monitoring tools, they initiate the internal major incident management process and engaged the xMatters Engineering teams. To immediately mitigate the issue and restore service quickly, the incident response teams performed a rolling restart for the affected services. As soon as the restart was completed, the system resumed processing alerts and all services were restored. **What are we doing to prevent it from happening again?** The Engineering teams have implemented a performance enhancement for backend service queries. In addition, the teams are evaluating and testing additional methods to help mitigate resource spikes and prevent them from impacting alert notifications in the future. Once development and testing are complete, we'll deploy these changes with our regularly scheduled maintenance. **Timeline:** July 5th, 2024 8:35AM PT - xMatters internal monitoring tools alert to potential issue. 9:22AM PT - Issue identified. 9:36AM PT - Rolling restart initiated. 10:16AM PT - Issue Resolved.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North America Region – Partial Outage"
Last updateThe issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
xMatters monitoring tools have identified a potential issue with xMatters for some clients in the North America region. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North America Region – Partial Outage"
Last updateThe issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with xMatters for some clients in the North America region. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Voice calls to China blocked by carriers"
Last updateWe have resolved this issue with assistance and cooperation from China Mobile. Due to regulatory changes and requirement for Chinese telephone carriers, customers needing to place voice calls to users in China will need to request a Direct Inward Dialing number from China Mobile International. For more information, see the knowledge base article at https://support.xmatters.com/hc/en-us/articles/25940694575131-Voice-calls-known-issues-and-limitations
We are finalizing the technical solution we identified in October and working through the last of the product and architecture updates required. With the current code freezes in place until the end of the year and a reduced availability of some resources over the upcoming holiday season, our new expected implementation date is early in 2024.
We have identified a solution to this issue that will involve using one of the main Chinese carriers to more directly deliver voice calls to China. We have made significant progress in proving out the technical solution, and are moving ahead with the necessary product updates to adopt this approach. Given our current progress and planning, we expect to have a solution before the end of 2023.
We are currently unable to deliver voice calls to China due to phone providers blocking traffic from multiple sources. We continue to look for a resolution to this issue and are working closely with our suppliers but we cannot provide an estimate of when a solution will be available. Other delivery methods are still able to send notifications to Chinese users and voice calls to other countries in the Asia-Pacific region are unaffected, For more information, see the accompanying article on the Support site at https://support.xmatters.com/hc/en-us/articles/17026765547547
Report: "Intermittent scheduler services in APAC region"
Last updateThe issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the Asia Pacific region. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating an issue with one of our scheduler services impacting customers in Australia.
Report: "Issue Discovered - Service disruption in Asia Pacific Region - Multiple Services"
Last updateThe issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the Asia Pacific region. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in All Regions – Mobile App (Android only)"
Last updateThe issue has been resolved, and no app update is required at this time. We are continuing to investigate the root cause and develop permanent resolution to prevent this issue from reoccurring.
The issue has been resolved; no app update is required. The problem appears to have stemmed from a third-party conflict: the way the certificate transparency library handles updates from the Google log list conflicts with its own concurrency and job cancellation. We are continuing to monitor the situation and investigating ways to resolve the issue permanently.
We are continuing to work toward mitigation and resolution of this issue, and believe we have identified the root cause. We are currently developing and testing a potential fix. In the meantime, push notifications are still being delivered and customers have reported that allowing pages within the app to fully load and waiting before attempting to navigate away or perform another action has significantly reduced occurrences of the error. We will continue to update as more information becomes available, but note that it will likely be necessary to update the app to resolve the problem.
We've identified this issue as related to the Android certificate transparency library we use for enhanced security in xMatters. Certificate transparency is an important part of the mobile app's end-to-end security and we are currently testing mitigation strategies to ensure they will not cause any disruption of this feature. We are continuing to investigate the issue and will provide more information as it becomes available.
We are investigating an issue with the Android mobile app not working on some Android devices. Reports indicate that users are receiving an error message that their devices cannot connect to xMatters We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Support at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in Asia Pacific Region - Multiple Services"
Last update**What happened?** On November 9, 2023, at approximately 5:35 AM AEDT, some customers in the APAC region reported an issue to xMatters Customer Support where they were unable to add a new user. The Add User button was greyed out, and hovering over the button was showing the message "You've reached the maximum number of user licenses for your account" despite having additional licenses available. Some users may also have experienced an intermittent inability to log into the web user interface. Throughout this issue and the subsequent mitigation procedures, the system continued to accept events and generate alerts, and all notifications and responses were processed correctly. **Why did it happen?** During a regularly scheduled update to the backend services in the APAC region, a timing issue caused the service responsible for instance configuration and license tracking to be directed to a version that hadn't received the latest configuration data. This conflict caused the system to calculate allotted licenses incorrectly and caused intermittent login issues. **How did we respond?** The Engineering teams were monitoring the update and were not encountering any warnings or errors within the process that they considered outside acceptable levels for this specific operation. When customers reported the issue to xMatters Customer Support, however, the teams made the decision to roll back the deployment immediately to mitigate any potential problems. As soon as the rollback was completed, customers confirmed that all services had been restored. The Engineering team launched an internal review process and were able to identify some avenues of improvement and successfully redeployed the update without incident. **What are we doing to prevent it from happening again?** In addition to adding additional automated checks to ensure configuration data is always up to date across services prior to an update, the teams isolated the specific cause of the configuration data mismatch to a timeout issue and have updated the timing settings to ensure that it will not happen again.
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
We are continuing to work on a fix for this issue.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the Asia Pacific region. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in All Regions – Multiple Services"
Last update**What happened?** On October 31, 2023, at approximately 10:22 AM Pacific, some customers reported that they were unable to log in to xMatters or, if they were already logged in, encountered "503" errors in the web user interface. Customers may also have noticed some flows failing to execute. **Why did it happen?** xMatters deployed a regularly scheduled update to one of the backend services that comprise the platform. Due to recent hosting changes that included the physical relocation of a data center, the deployment caused a conflict that resulted in a lack of processing availability. **How did we respond?** As soon as customers reported an inability to access the web user interface, the Support team confirmed the issue and initiated the internal major incident process. The response teams quickly identified the root cause and rolled back the deployment to the previous version of the service. This resolved the issue and customers reported that all services were restored. The xMatters Engineering teams then investigated the recent deployment and were able to reconfigure the update and redeploy the service. The service was deployed and restarted without further impact to customers within 20 minutes of resolving the initial issue. **What are we doing to prevent it from happening again?** The xMatters teams regularly deploy updates to backend services and aim for a seamless transition between versions that won't impact customers. To help prevent this type of issue from reoccurring, the teams are adding more process checks to ensure that updates meet backend service requirements and dependencies before customers are switched over to a new version of a service.
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
We are continuing to monitor for any further issues.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for clients in All Regions. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support team is waiting to help.
Report: "Issue Discovered - Service disruption in North American Region - Multiple Services"
Last update### What happened? On August 3, 2023, at approximately 4:00 PM Pacific, xMatters internal monitoring detected an issue where some customers in the US-EAST data center experienced a blank screen with a "We've run into a problem..." error while attempting to log in to their instances. This incident only impacted login to the web user interface and did not affect notification processing. ### Why did it happen? An unusually high volume of user delivery requests caused connection timeouts when the xMatters API service attempted to access the historical data storage service. Since the API service is a critical component in user login processing, the connection timeouts resulted in some customers being unable to log in to their instances. Further investigation revealed that automated sizing of resources for the data storage service was unable to mitigate the temporary increase in request load. ### How did we respond? As soon as xMatters Customer Support confirmed the issue, they escalated it to the xMatters Engineering teams. xMatters Engineering was able to isolate the issue and restart the storage service. The restart dropped all pending connection requests which allowed the service to recover; however, this may have caused some event requests to retry in order to complete and led to some delay in event processing. ### What are we doing to prevent it from happening again? xMatters Engineering has started a review of the existing data storage to better address times of unexpected usage. This includes reviewing new throttling options and improving our ability to speed recovery through manual intervention. ### Timeline: #### Date/Time Action August 3, 2023 4:00 PM PT Internal monitoring detects login failures. 4:04 PM Severity-1 Incident raised. 4:19 PM Issue identified - increased error rate for xM-API. 4:26 PM Data storage service restarted. 4:35 PM Instances recovering. 4:48 PM Incident resolved. If you have any questions, please visit http://support.xmatters.com.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North American Region – Web User Interface"
Last update### What happened? On April 14, 2023, at approximately 9:05 AM Pacific, some customers reported an issue to xMatters Customer Support where users were encountering errors when attempting to log in to the xMatters web interface. During the incident, some customers in North America may have experienced 503 errors when attempting to access or use xMatters or encountered errors with integrations that communicated with the xMatters API. These errors were intermittent, and only impacted a subset of customers whose primary instance was based in the us-east data center. Customers in the EMEA and APAC regions, and in other North American data centers were not impacted. ### Why did it happen? The issue was caused when a customer inadvertently initiated a denial-of-service attack by launching an excessive number of API requests. The incoming requests request peaked at over 90,000 per minute and overwhelmed the capacity of edge systems to manage the volume, causing a cascade that eventually blocked access to API endpoints and triggered 503 errors for systems that rely on them. ### How did we respond? xMatters monitoring systems alerted to the issue just before customers reported encountering errors. xMatters Customer Support confirmed the issue and initiated the major incident management process. The incident response teams determined that the best course of action was to promote impacted customers to unaffected regions and mitigate the inbound traffic by redirecting it away from critical systems. Once the traffic was mitigated, impacted systems were able to recover and customers were migrated back to their original data centers. A status page notification was posted to status.xmatters.com but due to the limited scope and intermittent impact, it was noted as a degraded service. This classification intentionally does not email status page subscribers. ### What are we doing to prevent it from happening again? xMatters Engineering has determined that additional protections are needed at entry points to identify any excessive inbound volume and allow for quick mitigation. The teams are in the process of determining the best parameters and implementation of these protections to address both intentional and unintentional denial-of-service incidents. ### Timeline: **Friday, April 14, 2023** **9:00 AM** - Customers report 503 issues **9:06 AM** - xMatters Customer Support initiates Severity-1 incident **9:15 AM** - Investigation reveals high volume to us-east **9:25 AM** - Source of volume identified; being routing customers to other regions **9:45 AM** - Routing changes complete **10:17 AM** - Incident mitigated If you have any questions, please visit http://support.xmatters.com
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
xMatters monitoring tools have identified a potential issue with the xMatters Web User Interface for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help
Report: "Issue Discovered - Service degradation in North American Region – Email Notification"
Last updateThe issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
A fix has been deployed for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The issue has been identified and a fix is being implemented.
We are continuing to work with our email vendor to resolve this issue. We'll provide updates as we have them.
We are continuing to push our email vendor for updates and resolution on this issue. We'll provide additional updates as we have them.
We are continuing to work with our email vendor to resolve this issue. We'll provide updates as we have them.
We are continuing to work on a fix for this issue.
The xMatters Incident Response team has identified the source of the issue to be with a downstream vendor, we are engaged with them on resolution of the issue. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with xMatters Email Notification for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North American Region – Web User Interface"
Last update## Details ### What happened? On July 12, 2022, at approximately 10:30 AM Pacific, some customers in North America reported an issue to xMatters Customer Support where they were unable to load User and Group Performance reports. Some users also reported performance issues involving slow loading of dashboard widgets in the Communication Center or errors when attempting to login to the xMatters user interface. The issue affected only the performance reports, dashboard widgets, and login; all other services, including signal processing, notification creation and delivery, and response processing were not impacted. ### Why did it happen? The issue was traced to enhancements to the User and Group Performance reports that had been enabled, or toggled on, shortly before the first reported issues. The backend services that query data for the performance reports and related dashboard widgets were not appropriately sized for a production load. This caused a backlog in request processing, which led to delays in accessing the data via the web user interface. The scale of the change required for that morning's Pole Position release led to the misconfiguration as the interaction between features was missed during the QA process. ### How did we respond? As soon as customers reported the issue, Customer Support confirmed performance issues via the internal monitoring tools and initiated the major incident management process. The incident response team determined that the best course of action to mitigate the issue quickly was to toggle off the recently changed reporting features to reduce the load on the backend services. This allowed the web user interface to more easily complete its processing requirements and the backlog of requests quickly cleared. Customers confirmed that performance had returned to normal levels and service had been restored. The teams continued to investigate the cause of the issue and identified that the backend services that query performance reporting for dashboard widgets and the report pages in the web user interface were unable to retrieve data in a timely manner. This also cause the web login issue as delays in loading dashboards eventually led to login timeouts. The teams were able to determine that the resources allocated to dashboard widgets were not processing requests quickly enough, leading to delays in responses to requests and causing upstream services to create backlogs of incoming requests. ### What are we doing to prevent it from happening again? To prevent this issue from reoccurring, the Engineering and Operations teams revised the resource allocations for all of the new reporting and dashboard updates. Over the course of July 13 and 14, they enabled each of the new features in sequence and verified all new features were operating normally and that no other issues occurred. To prevent similar issues, and to ensure that QA in both Development and Non-Production environments properly account for production load and are able to surface these types of misconfiguration, the teams are reviewing QA and release practices to reduce the level of complexity required for large-scale releases. The teams are currently implementing the following changes: 1. Enacting a process to sequence the enablement of features using an Enable > Test > Verify process during large scale deployments. 2. Reviewing QA processes to better identify potential performance-related impacts. ### Timeline: | | Action | | --- | --- | | Tuesday, July 12 10:15 AM PT | Pole Position features are toggled on in production deployments | | 10:30 | Internal monitoring tools alert to potential performance impact | | 10:37 | Initial customer reports of performance or login issues | | 10:41 | Severity 1 Incident initiated | | 10:55 | Mitigation actions begin | | 11:40 | Mitigation actions complete | | 11:47 | Services restored | | 12:05 | Incident resolved | If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com/)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored. Some customers may still experience some performance degradation in dashboard widgets.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
We are continuing to investigate the issue. Some customers may be experiencing intermittent errors when logging into the xMatters Web UI.
xMatters monitoring tools have identified a potential issue with the xMatters Web User Interface for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help
Report: "Issue Discovered - Service disruption in All Regions – Multiple Services"
Last update### What happened? On November 16, 2021, at approximately 09:40 AM PT, xMatters monitoring tools alerted technical teams of Google 404 errors from xMatters instances across all regions. For the duration of the incident, users were unable to access the web user interface, incoming signals were not processing, and notifications were not being generated. ### Why did it happen? xMatters uses Google Cloud Load Balancing \(GCLB\) services, which were not operational during the outage and resulted in the errors seen by customers. Based on the RCA provided by Google: "Google Cloud Networking experienced issues with Google Cloud Load Balancing \(GCLB\) service resulting in impact to several downstream Google Cloud services. Impacted customers observed Google 404 errors on their websites. From preliminary analysis, the root cause of the issue was a latent bug in a network configuration service which was triggered during routine system operation." See [https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh](https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh) for the complete report from Google. ### How did we respond? After receiving alert notifications from the xMatters monitoring tools, xMatters Customer Support and the operations team initiated a Severity-1 incident. The incident team quickly identified the issue as related to an incident within the Google Cloud Platform, which impacted a wide range of SaaS operators worldwide hosted by Google. xMatters Customer Support began communicating with customers by updating [https://status.xmatters.com/incidents/rtl4qyz4nj3m](https://status.xmatters.com/incidents/rtl4qyz4nj3m) with detailed, real-time information. xMatters initiated a dialog with Google to gather updates on resolution progress. The incident team remained engaged until Google resolved the incident to ensure that xMatters recovered smoothly once services were restored. There was no intervention required after Google resolved the issue, but some customers may have experienced slow loading times until all Google networking components fully recovered. ### What are we doing to prevent it from happening again? xMatters is committed to providing redundancy and high availability to all customers. Our architecture allows for multiple regional and international failover scenarios, including regionally redundant databases and international traffic rerouting. A worldwide service provider failure is difficult to account for and generally unprecedented. Based on this incident, we are reviewing feasibility options for cloud vendor redundancy; however, there is no imminent action plan for this type of incident. ### Timeline: November 16. 2021 09:43 PT – xMatters monitoring tools alert teams to Google 404 failures; teams initiate Severity-1 incident 09:50 PT – Verification of incident external to xMatters 09:55 PT – xMatters status page posted 10:07 PT – xMatters instances begin to recover 10:09 PT – Google declares incident mitigated 10:42 PT – xMatters declares incident closed If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
We are seeing traffic to all xMatters instances, we continue to monitor. Some instances may experience increased latency.
We are continuing to monitor for any further issues.
The xMatters Incident Response team is seeing some instances recovering, xMatters engineering is monitoring the situation to ensure the system is stable and that all services are restored.
We are currently tracking a problem with our cloud provider and are working directly with them to resolve the issue. We will provide updates as soon as we know more. This outage is impacting multiple services across the internet.
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for clients in All Regions. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support team is waiting to help.
Report: "Issue Discovered - Service disruption in North American Region – Web User Interface"
Last update### **What happened?** On June 21, 2021, at approximately 10:05 AM Pacific, the xMatters monitoring tools alerted Customer Support to an issue where the web user interface was unresponsive or exhibiting slow performance. During the incident, some customers may have noticed "Instance Unavailable" errors, or experience longer page loading time when accessing the web user interface. This issue only affected the web user interface; events continued to be accepted and created, and notifications and responses were processed normally. ### **Why did it happen?** This issue was caused by a single instance attempting to load approximately 140,000 user records into memory. This eventually increased memory usage to 100%, resulting in an unresponsive service. While the condition properly triggered an automated restart of the web user interface service, the service was unable to recover properly until the underlying issue could be mitigated. ### **How did we respond?** As soon as Customer Support received the alert from the monitoring tools and confirmed the issue, they initiated a Severity-1 incident and gathered the major incident response team. The team identified the instance responsible for consuming resources and isolated it within a dedicated resource stack to prevent any potential recurrence. The team then manually cleared the cache and restarted the web user interface service, confirming that it had resumed normal operation. ### **What are we doing to prevent it from happening again?** The Engineering team has isolated the source of the memory usage and reconfigured it with dedicated CPU and separate resources to eliminate future incidents of this type. They are currently developing additional memory clean up routines to further improve automated recovery, and investigating how the single instance was able to consume the available memory. Until these improvements are in place, the team will continue to isolate the source of the memory consumption. ### **Timeline:** | **Date/Time \(Pacific\)** | **Action** | | --- | --- | | Monday June 21, 2021 - 10:05 AM | xMatters monitoring alerts to slow or unresponsive customer instances | | 10:17 | Severity-1 Incident initiated | | 10:20 | Source of memory usage identified | | 10:22 | Instance isolated and web UI service restarted | | 10:30 | Web user interface service declared stable | | 10:45 | Incident resolved | If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com) No labels
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with the xMatters Web User Interface for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help
Report: "Issue Discovered - Service disruption in North America Region – Web User Interface"
Last update### What happened? On October 22, 2020, at approximately 9:45 AM Pacific, internal monitoring tools alerted xMatters Customer Support to an issue impacting xMatters database storage services. During the incident, some customers reported not being able to access the xMatters user interface. This impacted some customers in North America for approximately 20 minutes; events processed normally and notifications were not affected. ### Why did it happen? The investigation revealed a loss of network connectivity between two xMatters components, specifically the xMatters API service and analytics database, which lead to the inability to service login requests. These connectivity issues led to a failure of the xMatters API to reconnect with the database. This loss of connectivity to the analytics database had a cascading effect that impacted the querying of a small subset of customer databases and access to the xMatters web user interface. The incident investigation determined that the xMatters API was able to create connections to the database but was unable to complete some queries. This condition resulted in a backlog of connection requests which eventually impacted the xMatters web user interface. ### How did we respond? xMatters engineering restarted the API service as part of the investigation into the cause of the errors. After the restart, xMatters Customer Support confirmed there was still an issue accessing the xMatters web user interface and initiated a Severity-1 incident. The incident response team gathered and promoted impacted instances to redundant architecture. Once that was complete, customers were able to login to xMatters without issue. The connectivity errors cleared without xMatters intervention after the load was removed from the impacted services. ### What are we doing to prevent it from happening again? Once mitigated, the connection issue was resolved. It is expected that the issue is a one time occurrence with a very low likelihood to reoccur; however, we are taking additional steps to improve the resiliency of the retry logic if a future connection failure occurs. Additional monitoring has been added to alert the team of similar conditions, which will allow for proactive measures to be taken before impacting customers. ### Timeline: **Date & Time PDT** **October 21, 2020 - 09:45** - Some customer instances begin reporting errors **October 22, 2020 - 00:45** - Rolling restart of API Service **October 22, 2020 - 00:50** - Login errors identified, Severity 1 Incident called **October 22, 2020 - 00:53** - Impacted instances routed to redundant architecture **October 22, 2020 - 01:06** - Impact mitigated **October 22, 2020 - 01:27** - Incident verified as resolved If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Customers may continue to experience some slowness as the incident team continues to implement the fixes for this issue. Events will be processing, however some customers may experience some delays. We are currently monitoring the situation to ensure the implementation is stable and services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
We are continuing to investigate this issue.
xMatters monitoring tools have identified a potential issue with the xMatters Web User Interface for some clients in All Regions. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service degradation in All Regions – Android Mobile App"
Last update### What happened? On August 27, 2020, at approximately 12 AM Pacific, some customers reported an issue to xMatters Customer Support where they were unable to log into the latest version of the xMatters mobile app on Android. Customers who installed the 2.26.0 version of the Android app were being logged out of the app, and encountered either a "Host does not exist or is unavailable" or an "Invalid server certificate" error message when attempting to log back in. Customers who had not upgraded their apps to the latest version were not impacted. ### Why did it happen? This issue was caused by a bug introduced in the 2.26.0 version of the Android app. The release build process includes a step that adds code optimization supplied by a third-party vendor. The software points to several public certificate guarantors and is added as part of the final build before the app is released to the Google Play store. In this case, the third-party optimization software was pointing to incorrect guarantors. Although the app development team performed a final set of tests prior to release and after the app was optimized, the optimization code was pointed at development guarantors which were not affected by the bug. The app passed the testing phase and was promoted to public release. ### How did we respond? As soon as customers began reporting issues with logging into the Android app, the xMatters development team began to troubleshoot the issue. After testing, they determined that the issue was not due to localized external factors and decided to roll back the app to the previous version \(2.25.2\), which would appear in the store as version 2.26.1. They recompiled the app and uploaded it to the Google Play store, simultaneously turning off automatic upgrade prompts to help prevent anyone else from upgrading to version 2.26.0. Although a working version of the app was uploaded to the store at 3:10 AM, and the testing and verification process from Google typically takes less than two hours, the process has been slowed recently due to the impact of COVID-19. Due to the delay in the verification process, xMatters Customer Support posted a status page update to inform customers of the issue and to help prevent any other updates before the fixed version was available. The updated 2.26.1 version of the app was released to the store at 10:15 AM. ### What are we doing to prevent it from happening again? To help prevent the issue from reoccurring, the development teams has implemented a second round of QA to the release process. Due to the difference between test guarantors and production guarantors, testers outside of the standard QA process will perform a final functionality test before applications are promoted to production. This will account for any future differences between production and test environments. ### Timeline: **August 26, 2020** 5:40 PM - xMatters Android app version 2.26.0 released to Google Play store **August 26, 2020** 11:40 - PM Google publishes app to Google Play store **August 27, 2020** 2:00 AM - First reports of login issues from customers 2:20 AM - xMatters Engineering identifies an issue with the app and initiates rollback procedures 3:08 AM - Rollback of app to previous version \(2.25.2\), published as 2.26.1 in the Google Play store 3:15 AM - App auto update notification disabled 8:49 AM - xMatters publishes status page update 10:29 AM -Version 2.26.1 verified and published to the Google Play store If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The fix is now available in all Google Play stores. If you are are experiencing login issues, please reinstall the xMatters Andriod app through the store, some customers may need to uninstall the xMatters app before installing the fix.
Some Google Pay stores have now posted version 2.25.2. If you recently installed version 2.26.0 please go to the store and install version 2.52.2. If you did not install version 2.26.0 or are not experiencing login issues, you do not need to take any action.
The latest version of the xMatters Android App (2.26.0) has impacted the ability to login for some customers. We've already identified the issue and rolled back the app version while we work on a fix. The rollback has been submitted to Android, but it can take some time for it to reach all stores around the world. We have also turned off the prompts for in-app update and recommend that customers do not upgrade to version 2.26.0 at this time. If you have already updated your app to version 2.26.0, we recommend that you roll back to version 2.25.2 once it's available in the Google Play store.
Report: "Issue Discovered - Service disruption in All Regions – Web User Interface"
Last update**What happened?** On July 13, 2020, at approximately 16:26 PDT, some customers reported issues with logging into the xMatters web user interface \(web UI\). When attempting to log in via SSO/SAML, some customers received one of two error messages: _"We've run into a problem while retrieving your data. Refresh the page to try again or, if the problem persists, contact xMatters Client Assistance_" or _"Server Internal Error on URI /sp/SSO.saml2; error: RFC6265 Cookie values may not contain character..."._ The issue only affected customers using SSO/SAML to login to the web UI. There was no impact to event processing or notification delivery, and customers using the native xMatters login did not experience the issue. **Why did it happen?** The investigation discovered that an update to the service which supports the xMatters Web UI impacted authentication using SSO/SAML. The update applied compliance to RFC 6265, which defines how web cookies are used and no longer supports for version-1 cookies. Since many SSO systems rely on cookies, this resulted in some cookies being read as invalid, causing the login error. **How did we respond?** As soon as Customer Support verified that there were login issues via SSO, they escalated the issue to a Severity-1 incident and initiated the internal major incident management process. When the incident team identified that the issue was related to the latest release of the web UI service, they rolled back to the previous version of the service. Once the rollback was complete, the incident team and customers verified that SSO login functioned as expected. **What are we doing to prevent it from happening again?** The xMatters Engineering team is working to ensure backwards compatibility with version-1 cookies before the next release of the web UI service. The team is also updating the internal QA processes to add a verification of SSO/SAML for all cookie versions. **Timeline:** July 12, 2020 15:50 Release of new web UI version 16:26 Customers begin to report errors when logging in with SAML/SSO 16:43 Issue linked to web UI release; Severity-1 incident initiated 16:50 web UI service rollback initiated 16:59 Rollback complete 17:01 Customers confirm ability to log in 17:05 Incident resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
xMatters monitoring tools have identified a potential issue with the xMatters Web User Interface for some clients in All Regions. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Service disruption in North American Region"
Last update**What happened?** On January 12, 2020, at approximately 7:05 AM PST, xMatters internal monitoring tools and customer reports alerted Customer Support to an issue with event processing and delivery in the North America region. While the incident was in progress, some North American customers may have experienced delays in event processing and notification delivery, including a window where notifications were not being generated for active events. No other regions were affected, and the web user interface remained accessible and responsive throughout the incident, save for a brief period during one of the remediation procedures. **Why did it happen?** This issue occurred when a process responsible for inter-service communication encountered resource issues. The issue was traced to an earlier change which increased the internal processes retention period to improve xMatters' ability to recover data. Resources for the process were sized in terms of processing, disk and memory, but a setting that controls the number of open files to be retained was not sized appropriately. **How did we respond?** As soon as the monitoring tools alerts to the error, Customer Support initiated the Severity-1 process and engaged the incident response teams. The teams began to troubleshoot and restarted the affected process. When the restart failed to recover properly, the team decided to promote affected customers to the secondary site to ensure reliable processing of events and notifications. Once the teams initiated the promotion at 7:35 AM PST, notifications began processing properly for most customers. The promotion procedures were completed at 7:50 AM PST, and the majority of notifications continued processing without issue. The teams continued troubleshooting and identified and resolved the underlying issue on the primary site by increasing the retention period. Once testing was completed, all customers were promoted back to the primary site and all services were confirmed as operational. **What are we doing to prevent it from happening again?** To resolve this issue permanently, the xMatters teams have adjusted the setting that governs the number of open files for the process. **Timeline: Date/Time \(PST\)** 2020-01-12 7:05 AM - Monitoring alerts to incident with notification processing; Severity 1 incident declared 7:21 AM Rolling restart completed 7:24 AM Errors do not clear, notifications still impacted 7:34 AM Promotion to secondary site begins 7:50 AM Promotion to secondary site completed, notifications begin to process as expected 7:55 AM Team begins to monitor the mitigation 8:20 AM Incident resolved If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Customers may continue to experience some slowness as the incident team continues to implement the fixes for this issue. Events will be processing, however some customers may experience some delays. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
We are continuing to investigate this issue.
xMatters monitoring tools have identified a potential issue with the xMatters Integration Platform for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available.
Report: "Issue Discovered - Service disruption in North American Region – Integration Platform"
Last update**What happened?** On January 9, 2020, at approximately 3:50 PM PST, xMatters internal monitoring tools and customer reports alerted Customer Support to an issue with event processing through the xMatters Integration Builder in the North America region. While the incident was in progress, some North American customers may have experienced intermittent delays in integration processing, including a 15-minute window where integrations were not accepting or processing events. No other regions were affected, and the web user interface remained accessible and responsive throughout the incident, save for a brief period during one of the remediation procedures. **Why did it happen?** This issue occurred when a node in the queuing service cluster experience high levels of load and unexpectedly disconnected from its cluster. This caused execution of integrations on that node to be delayed. **How did we respond?** When the queuing errors were discovered, Customer Support initiated the Severity-1 process and engaged the incident response teams. The teams began to troubleshoot and restarted the affected node. The node failed to recover properly after the restart and the team decided to promote affected customers to the secondary site to ensure reliable processing of integrations. They initiated the promotion at 4:34 PM PST and completed the process at 4:57 PM PST. The majority of customers were now able to process notifications without issue. The teams continued troubleshooting and resolved the underlying issue on the primary site by increasing the available resources for all nodes in the queuing service and then performing a full restart of the queuing service. Once testing was completed, all customers were promoted back to the primary site and all services were confirmed as operational. **What are we doing to prevent it from happening again?** To prevent this issue from reoccurring, the xMatters teams have provided the service with a significant increase in computing resources. The team has also implemented more robust monitoring that will alert the service teams if a node disconnects from the cluster. Through further investigation and testing, the teams have also identified a method of recovering nodes faster and more reliably. **Timeline:** 2020-01-09 3:50 PM Monitoring alerts to incident with notification processing; Severity 1 incident declared 3:59 PM Rolling restart completed 4:01 PM Errors clear, performance still impacted 4:27 PM Errors return, more intervention required 4:34 PM Promotion to secondary site begins 4:57 PM Promotion to secondary site completed, notifications begin to process as expected 5:20 PM Services restored, team continued monitoring 6:35 PM Incident resolved If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The incident has now been resolved. Thank you for your patience while we addressed this matter. A root cause will be available after post-mortem activities have been completed.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
We continue to monitor, some customers may see delays in event processing.
We are continuing to monitor for any further issues.
Issue has been mitigated. Performance may be degraded.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
We are seeing intermittent delays with Integration event processing at this time.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with the xMatters Integration Platform for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North American and European Region - Multiple Services"
Last update### What happened? On November 15, 2019, at approximately 6:30 AM PST, xMatters internal monitoring systems alerted the Engineering teams to an issue with a service in the North America region. While the incident was in progress, North American customers may have experienced intermittent delays in notification delivery, including a 15-minute window where notifications were not processing for some customers. No other regions were affected, and the web user interface remained accessible and responsive throughout the incident, save for brief period during one of the remediation procedures. ### Why did it happen? This issue occurred when the services responsible for processing events experienced a sudden spike in usage, resulting in an unusually high load. Although the Engineering teams immediately initiated standard remediation practices for the notification delivery service, a dependent service used for queuing notifications began to experience instability approximately 10 minutes after the initial remediation began. The instability in the queuing service caused it to intermittently reject future incoming connection attempts from upstream services. ### How did we respond? When the queuing errors were discovered, xMatters initiated the major incident management process and gathered the incident response team. The team began to troubleshoot and performed a rolling recycle of the affected services. When the recycle failed to address the issue, the team decided to promote affected customers to the secondary site. They initiated the promotion at 7:57 AM PST and completed the process at 8:34 AM PST. The majority of customers were now able to process notifications without issue. The teams continued troubleshooting and resolved the underlying issue on the primary site by performing a full restart of the queuing service. Once testing was completed, all customers were promoted back to the primary site and all services were confirmed as operational. ### What are we doing to prevent it from happening again? While attempting to reproduce this issue in our test environments, we have identified a number of potential improvements and optimizations within the configuration and usage of the queuing service. To prevent this issue from reoccurring, the xMatters Engineering teams are working to implement all of these changes. The teams are still investigating the source of the initial resource spike. ### Timeline: November 15, 2019 **6:30 AM** xMatters internal monitoring tools alert Engineering to unusual load on notification processing nodes **6:45 AM** Engineering performs rolling recycle of nodes and discovers queuing errors **7:25 AM** Major incident raised and internal major incident management process initiated **7:29 AM** Bulletin posted to xMatters status page: [https://status.xmatters.com/incidents/qy9l66599jnf](https://status.xmatters.com/incidents/qy9l66599jnf) **7:43 AM** Promotion of services begin to secondary site **8:34 AM** Promotion is complete **9:15 AM** Issue is resolved on primary **9:22 AM** Promotion of service to primary begins **9:44 AM** Promotion to primary complete, all services resume normal operations
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter. We will provide a full root cause analysis once the post-mortem activities have been completed.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
Some customers in North America should start to see notifications delivered again without any delays. We are continuing to work on restoring services for the remaining customers. We will continue to post updates here as they become available.
European region customers should now be seeing notifications delivered without any delay. We are continuing to work on resolving delays for North American customers. We will continue to provide updates as they become available.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the North America region. Some clients may notice delays in receiving notifications. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Service disruption in North American Region"
Last update## Details ### What happened? On Thursday, November 7, 2019, at approximately 5:20 AM PST, the xMatters network monitoring systems alerted the Customer Support teams to an issue with the On-Demand services within North America. Some users may have experienced intermittent access to the xMatters On-Demand web user interface, and a delay or rejection when injecting events into xMatters. ### Why did it happen? This incident was caused by a single database within one of the database clusters consuming a disproportionate amount of resources. This limited the ability of other databases in the cluster to accept new requests, resulting in intermittent access to the web user interface. ### How did we respond? As soon as the internal monitoring systems alerted to an issue with customer instances, Customer Support confirmed the issue and launched the internal major incident management process. The incident response teams immediately began their investigation and identified a database cluster that was consuming processing resources at an exceptionally high rate. The teams determined that the issue was confined to a specific database in the cluster that was causing latency and preventing other resources from serving their requests. The teams concluded that the best way to remedy the issue quickly was to promote a standby database cluster to become the new primary. The recovery process and redundant service architecture restored services, and system performance resumed normal operations. ### What are we doing to prevent it from happening again? To prevent this issue from reoccurring, the Engineering teams will be taking the following steps: 1. Resize the database cluster to accommodate potential usage spikes and to increase tolerance for similar issues. \(Completed\) 2. Rebalance the database cluster to increase bandwidth for all impacted customers. \(Scheduled for completion on or before November 14, 2019\) 3. Increase monitoring thresholds to identify spikes in usage during peak periods. \(Completed\) xMatters strives to provide high availability to our clients and we recognize that reliability of services is of utmost importance to our customers and their businesses. xMatters is committed to improving our resiliency and investing in the tools and processes required to prevent and minimize service disruptions. ### Timeline: | **Date/Time \(PST\)** | **Description** | | --- | --- | | 2019-11-07 05:20 AM | xMatters monitoring tools alert Customer Support to intermittent access to some client instances in North America. | | 05:45 AM | Severity-1 issue raised; internal major incident management process initiated. | | 06:19 AM | Bulletin posted to xMatters status page: [https://status.xmatters.com/incidents/xrq45x6g0zpp](https://status.xmatters.com/incidents/xrq45x6g0zpp) | | 06:43 AM | Incident team identifies issue as related to a database within the cluster. | | 07:00 AM | Promotion of secondary database cluster begins. | | 07:09 AM | All services are restored. | If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
At approximately 5:56am PDT, we experienced an issue with xMatters that prevented users in North America to access the web user interface. Services were restored at approximately 7:12am PDT.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
At approximately 5:56am PDT, our internal monitoring detected an issue affecting some customers in North America. Some customers may be unable to log into their xMatters instances. We are currently investigating this issue.
Report: "Issue Discovered - Service disruption in North American Region - Multiple Services"
Last update### What happened? On October 17, 2019 at approximately 5:35 PM PDT, the xMatters monitoring systems alerted the Customer Support team to a potential issue with an xMatters service within the North America region. While this incident was in progress, all North American customers may have experienced delays or a rejection when injecting an event into xMatters, and delays or failures in notification delivery. No other regions were affected, and the web user interface remained accessible and responsive through the incident. ### Why did it happen? This issue occurred during routine scheduled maintenance involving security enhancements to an xMatters service that is responsible for delivering notifications. The maintenance was near completion when the service experienced an unexpected error that caused the entire service cluster to fail, resulting in cascading failures to other dependent services. This maintenance was completed across other regions prior to North America without any issues, delays, or downtime. ### How did we respond? As soon as the xMatters monitoring tools detected connectivity issues, the xMatters Customer Support team escalated the issue to a Severity-1 and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause and working to restore services for clients, Customer Support posted a notice to the xMatters status page to inform clients about the incident. The teams immediately identified the issue was related to the scheduled security enhancements for an xMatters service responsible for delivering notifications. The teams began troubleshooting the issue and identified a failure that occurred during the maintenance with the last remaining service node. This resulted in a cascade of failure, leading to the service disruption. The teams decided the fastest approach to restoring operations would be to perform a rolling restart of the affected notification service. They also determined that promoting services to another region would be a "last resort" option, as the unique circumstances of this failure could potentially cause a longer delay in restoration of services. After the teams completed the rolling restart, they determined the service showed no significant improvement. The teams continued to perform additional troubleshooting steps an an attempt to alleviate the issue, but notifications queues were continuing to increase and all attempts to restore service were unsuccessful. With guidance from the xMatters executive, the teams decided to start preparing to promote all services and client instances to an alternate data center in North America. To rule out the possibility that the issue was related to the underlying hardware, the teams performed a rolling restart of each virtual machine in the notification service cluster in an attempt to reschedule them to different hardware. Just before the promotion of services was about to begin, the teams confirmed that the rolling restart was successful, and the system was processing events and delivering notifications. With service apparently being restored, the teams held back the promotion of services to an alternate data center until they confirmed that all queues were clearing. Due to the duration of the service disruption, the teams waited for the backlog of notifications to clear before starting up other dependent services. They then confirmed that all services were restored and normal operations had resumed. ### What are we doing to prevent it from happening again? To prevent this issue from occurring again, xMatters has committed to the following action items: 1. Increase monitoring thresholds to help identify any latency with notification delivery earlier in the process. \(In progress\) 2. Review the schedule for promotion of client instances to identify specific guidelines around acceptable delays and data retention during incidents. \(In progress\) 3. Investigate current architecture and cluster configuration to determinate any potential avenues towards improving overall system resiliency. \(In progress\) In addition, the Engineering and Operations teams are conducting a full post-mortem of the incident to help identify any potential improvements to testing suites, playbooks, and other collateral used to help isolate and identify root causes during and after an incident. ### Timeline: | **Date/Time PDT** | **Description** | | --- | --- | | 2019-10-16 5:00PM | xMatters Engineering begins applying security enhancements to the xMatters notification service | | 2019-10-16 5:35PM | xMatters monitoring tools alert Customer Support to possible latency issues for clients in North America | | 2019-10-16 6:20PM | Severity-1 issue raised, internal major incident management process initiated | | 2019-10-16 6:35PM | Bulletin posted to xMatters status page: [https://status.xmatters.com/incidents/096sszlgyz0n](https://status.xmatters.com/incidents/096sszlgyz0n) | | 2019-10-16 6:40PM | Rolling restart of notification service completed | | 2019-10-16 7:00PM | Additional troubleshooting steps begin | | 2019-10-16 8:50PM | Rolling restart of notification server cluster begins | | 2019-10-16 9:15PM | Events begin processing and notifications start being delivered | | 2019-10-16 9:23PM | Notification server cluster restart is completed | | 2019-10-16 9:30PM | Remaining dependent services are restarted | | 2019-10-16 10:45PM | Full restoration completed; services resume normal operations | If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored. Please note that there may be a backlog of notifications to process for some customers.
Notifications and events are beginning to process at this time - back log of notifications will begin to process shortly.
Recovery efforts are progressing - some notifications are processing. ETA for full recovery pending.
The xMatters incident team is taking corrective action at this time. ETA is pending at this time.
We are continuing to work on a solution to this issue.
We are continuing to troubleshoot. Update in the next 15 minutes.
Some events are processing, teams are still working on resolution. Troubleshooting continues.
Engineering teams are continuing to troubleshoot the issue. Please watch this page for updates.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North American Region – Integration Platform"
Last update### What happened? On September 9, 2019, at approximately 10:20 AM Pacific, xMatters monitoring reported an issue to xMatters Customer Support where some customer integrations became unresponsive and stopped processing events. Some customers may have seen integration logs showing a number of errors related to script failures, and some events may not have been properly processed, ### Why did it happen? This incident was caused by integration scripts containing code that was not fully compliant JavaScript. During regularly scheduled maintenance on the morning of September 9, xMatters released a new version of the Integration Builder service to enable a faster, more efficient scripting engine. This change required that the Integration Builder be updated to Java Development Kit \(JDK\) version 11. The new scripting engine, GraalJS, is native to JavaScript and requires that all code be fully JavaScript compliant. The previous version of the scripting engine, Nashorn, accepted some Java String methods that are not technically JavaScript, and not fully compliant. While the upgrade process did enable backwards compatibility with the previous version of the JDK, the compatibility features did not cover the inconsistency with non-compliant JavaScript. As a result, some integration scripts that included non-compliant code returned errors and prevented the scripts from executing correctly. ### How did we respond? When the internal monitoring tools flagged errors in integration scripts, xMatters Customer Support began their investigation. As soon as customers reported issues with their integrations, Customer Support escalated the issue to a Severity-1 Incident and launched the internal major incident management process. They were able to quickly determine that the issue was related to the release of the JDK 11 upgrade, and initiated an immediate code rollback. Once the rollback was complete, the teams confirmed the issue was no longer occurring and that all services had been restored. ### What are we doing to prevent it from happening again? The xMatters Engineering teams have examined the errors from the Integration Builder logs, and isolated some differences between the two versions of the JDK that were causing the issue. Specifically, they were able to identify three Java String methods that the previous iteration of the scripting engine could process that were not being handled by GraalJS. The teams added handling to the Integration Builder service that will allow the new scripting engine to process the Java String methods without breaking script functionality. They tested the changes on internal systems and confirmed that while the Integration Builder logs will mark the errors for easy identification, the scripts will continue to execute without any developer or integrator intervention. Customer Support posted a notice about the upcoming availability of the new scripting engine on the support site at [https://support.xmatters.com/hc/en-us/articles/360033568811](https://support.xmatters.com/hc/en-us/articles/360033568811) and rescheduled the deployment of the JDK upgrade for Tuesday, September 17. In addition, they updated the xMatters Status page \([status.xmatters.com](http://status.xmatters.com)\) with a scheduled maintenance notice about the change. While the Engineering teams are confident that even customers with non-fully compliant JavaScript will not see any issues arise from the deployment of the JDK 11 update, they were only able to target known errors. It is possible that integration scripts containing non-compliant code for which the Engineering team has not added handling may result in a similar error. We highly recommend that all customers using custom Integration Builder scripts review their integrations and ensure they are using only fully-compliant, standard JavaScript code. ### Timeline: September 9, 2019 10:00 AM xMatters deploys new version of Javascript \(JDK11\) 10:20 AM Internal monitoring flags integration errors along with customer reports of integration errors 10:30 AM Customer Support launches Severity-1 Incident 10:35 AM Issue discovered - rollback to previous version initiated 10:45 AM Verification of resolution and return to normal operation 10:55 AM SEV-1 Issue closed
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with the xMatters Integration Platform for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in Asia Pacific Region - Multiple Services"
Last update#### What happened? On July 25, 2019 at approximately 10:04 PDT, the xMatters internal monitoring systems alerted Customer Support to an issue that was resulting in a "Server Internal Error" message displaying when accessing an xMatters instance. Clients in the Asia-Pacific region may also have seen this message when attempting to access their instances, or experienced difficulty in accessing xMatters services. All notifications continued to process as expected during the incident. #### Why did it happen? The incident occurred during scheduled database maintenance. During the upgrade activity, the standby databases are upgraded and promoted to become a master database. This process typically takes seconds to complete and is not customer impacting. During this upgrade, there was an unexpected and undetected delay in synchronous replication to other services. This caused the process to wait until data was in sync. #### How did we respond? As soon as the internal monitoring systems sent the alert about an issue impacting client instances, Customer Support confirmed the issue and launched a Severity-1 incident. The incident response teams immediately began investigating and identified that the maintenance process was waiting for synchronous replication to complete. This resulted in customers receiving an error message when attempting to access their systems. To resolve the issue, the incident team performed a manual intervention, and all services were restored. #### What are we doing to prevent it from happening again? We are updating our upgrade process to detect delays in synchronous replication and to postpone an update if a delay exists. This event has a very low likelihood of recurrence; however, the teams are continuing their testing and are replicating the issue to determine if any additional changes are required. The teams are also reviewing the maintenance process and monitoring settings to identify any potential improvements. #### Timeline: July 25, 2019. All times PDT 10:04 AM - Monitoring alerts Customer Support to Server Internal Error 10:06 AM - System auto recovers 10:09 AM - All services restored 10:11 AM - Incident closed
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
A fix has been implemented and we are monitoring the results.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the Asia Pacific region. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North American Region – Integration Builder"
Last update#### What happened? On July 28, 2019, at approximately 5:09 AM PDT, the xMatters internal monitoring systems alerted Customer Support to an issue with potentially unresponsive integrations in North America. Shortly afterwards, some customers reported that they were noticing delays in notification delivery. The delays affected only notifications generated through the xMatters Integration Builder; manually entered notifications were processing normally. #### Why did it happen? This incident was caused by an error within the cloud infrastructure-as-a-service provider hosting the xMatters On-Demand service. The error caused instability within a process responsible for allocating Integration Builder resources to incoming event requests. During this brief period, the Integration Builder was accepting notification requests but not processing outbound notifications, which resulted in the delays experienced by some customers. Due to the distributed, redundant architecture of the On-Demand service, the issue was extremely localized, and only impacted service in a limited geographical region. #### How did we respond? As soon as the internal monitoring systems alerted to an issue with client instances, Customer Support confirmed the issue and launched a Severity-1 incident. The incident response teams immediately began investigating and identified the impacted process along with corresponding events reported by the cloud provider. They initiated a reset of the affected components and confirmed that the process was allocating resources as expected. The teams confirmed that all services had been restored and continued to monitor the system while gathering further data around the incident. #### What are we doing to prevent it from happening again? This issue was resolved as soon as the affected services were reset. The problem has not reoccurred, and the system continues to operate at optimum performance levels. The xMatters Engineering teams have completed an investigation into the issue and have confirmed that there were no code changes or other updates to the affected service that could have led to this incident. While there are no potential changes to the impacted service or supporting processes, the teams are engaged in designing and implementing additional monitoring metrics around the consumption of these resources to ensure that allocation does not fluctuate outside normal operating parameters. These improvements will allow the system to self-heal in the event of any similar infrastructure-related issues. Until these changes can be implemented, the teams have configured additional monitoring for the system that will allow the team responsible for the service to respond to and mitigate fluctuations before they impact any customers. #### Timeline: July 29, 2019 - All times are in PDT 05:06 AM Monitoring alerts Customer Support of slow notification processing 05:09 AM Severity-1 Incident called 05:16 AM Incident team gathers 05:25 AM Issue identified 05:29 AM Services restored 05:46 AM Incident is resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with the xMatters Integration Builder platform for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North American Region - Multiple Services"
Last update#### What happened? On July 17, 2019, at approximately 1:45 PM PDT, the xMatters internal monitoring systems alerted Customer Support to an issue with potentially unresponsive customer instances. Shortly afterwards, some customers reported that they were unable to reach their xMatters instances or were encountering a 503 error when attempting to log in to the xMatters web user interface. Some clients may also have noticed delays in notification delivery for a very brief period \(less than 5 minutes\). #### Why did it happen? This incident occurred when, during an active event, a client used the web user interface to delete a very large group from their instance while that group was being targeted for notification. The deletion process became a long-running database request which rapidly consumed all available processing resources. This blocked other processes on the database cluster while the service waited for the request to complete, causing instances using the same cluster to become unresponsive. #### How did we respond? As soon as the internal monitoring systems alerted to an issue with client instances, Customer Support confirmed the issue and launched a Severity-1 incident. The incident response teams immediately began their investigation, even as the affected services began to recover automatically. The system records and reports showed that a single database cluster had been consuming processing resources at an exceptionally high rate. The teams were able to trace the problem to a customer-initiated deletion request during an active event that caused a brief database lock. The automatic recovery process and redundant service architecture restored service quickly, and once the client's active event completed, system performance resumed normal levels. All services were restored, though the teams continued to investigate the root cause while manually clearing any remaining deadlocks. #### What are we doing to prevent it from happening again? The incident in question was quickly mitigated by the redundant service architecture and automated recovery capabilities of the xMatters On-Demand service, and all services have been restored. The teams have confirmed that all affected database clusters are operating at optimum performance levels and there are no remaining deadlocks. To determine the best method of preventing similar issues, the Engineering teams responsible for the affected services and database performance are currently investigating this issue and reproducing the problem on internal testing systems. Once they have completed a full evaluation of the conditions incurred during this incident, they will implement any necessary changes to the service via the internal development and testing procedures. While this process is underway, the Customer Support and Engineering teams have implemented additional monitoring checks to notify the appropriate resources about any potential deadlocks so they can respond before any customers are impacted. #### Timeline: July 17, 2019: * 1:45 PM PDT Monitoring alerts Customer Support of unresponsive customer instances * 1:47 PM - Severity-1 Incident called * 1:50 PM - Incident team identifies high CPU utilization on a database cluster * 1:50 PM - Customer instances begin to respond * 2:00 PM - Issue identified; all clusters cleared of deadlocks * 2:20 PM - Incident resolved
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the North America region. We are currently investigating the issue and will update as information becomes available. Please see incident details for specific services impacted. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "North America Service Disruption - Notification Delivery"
Last update#### **What happened?** On July 11, 2019, at approximately 2:18 PM PDT, xMatters monitoring alerted Customer Support to possible delays in notification delivery within the NorthAmerica region. While the issue was being addressed, some users may have experienced delays when receiving SMS and voice notifications, or when injecting events for some integrations. While the system continued to process events and notifications, some events were delayed or did not complete properly and required manual termination. #### Why did it happen? This issue was caused when a backend queuing service stopped unexpectedly, resulting in a large backlog of events waiting to be processed which in turn prevented the services responsible for event processing from connecting to the queuing process. While the issue did not impact all of the available queues, the remaining queues took longer to process events. #### How did we respond? As soon as the monitoring tools alerted to an issue with notification delivery, Customer Support began troubleshooting, connected with subject matter experts to assist, and created a Severity-1 incident. The incident response teams discovered an issue with a backend service responsible for delivering notifications within the North American region. Once the teams identified the impacted services, they began a rolling restart of the event processing services. When the restart had no effect, the teams began a rolling restart of the queuing process. This restart had the desired effect, and the system began processing events and clearing the backlog. Once the backlog had cleared, the teams confirmed that all services had been restored. #### What are we doing to prevent it from happening again? To prevent this issue from recurring, we have identified the following enhancements to the queuing process. Implementing these enhancements will include additional capacity to build further redundancy: * Increase queue process cluster size * Allocate additional memory and processing resources to the queue node The end result of these updates will be to add more capacity for queue processing in the event of another single queue incident. These changes will be implemented as soon as development and testing procedures are complete. #### Timeline: * July 11, 2019 - 2:11 PM PDT - Monitoring alerts Customer Support of queuing delays * 2:18 PM - Severity-1 Incident called * 2:30 PM - Restarted event processing nodes * 2:43 PM - Restarted queuing process * 2:50 PM - Backlog begins to clear * 3:17 PM - Services begin to restore * 3:29 PM - Second reset of queue process3:41 PMQueuing restarted complete3:45 PMIncident team monitors queues for performance issues * 4:28 PM - All services restored; incident closed
On Thursday July 11, 2019 at approximately 2:15 PM PDT, the xMatters monitoring tools detected an issue with a backend service responsible for delivering notifications within the North American region. Some xMatters clients may have experienced a rejection or delay in notification delivery during this time. The issue was identified and rectified by 3:30 PM PDT and all queued notifications processed and delivered. In some cases, the incident may have left events in a non-terminated state. If you notice events that have not terminated properly, you can terminate them manually. We will provide a full root cause analysis once we have concluded the incident investigation.
Report: "Issue Discovered - Service disruption in North America"
Last update## What happened? On April 6, 2019, at approximately 4:37 AM PDT, the xMatters monitoring systems alerted the Engineering teams to a service disruption with On-Demand services within the North American region. Users may have experienced intermittent access to the user interface, and a delay or rejection when injecting an event into xMatters. ## Why did it happen? This issue was caused by excessive memory consumption by a monitoring service. The monitoring service was buffering metrics for reporting and consumed an excessive amount of memory, causing some database queries to fail. ## How did we respond? As soon as the xMatters network monitoring tools detected unreliable connectivity in the xMatters system, the Client Assistance team launched the internal severity-1 investigation process, which was later upgraded to a major incident, and posted a notice to the xMatters status page. The incident response teams began simultaneously investigating the underlying cause and working to restore services for clients. The teams determined that the fastest way to restore service and cause the least impact to clients would be to perform a manual database failover to a system not experiencing resource exhaustion. Once the promotion process was complete, clients confirmed that all services were restored and functioning as expected. ## What are we doing to prevent it from happening again? To help prevent similar incidents in the future, the xMatters Engineering teams are investigating a potential way to improve their current method of resource monitoring. Any knowledge or information they identify will be added to the relevant playbooks to ensure that it becomes a consistent part of our standard processes. In addition, Engineering teams are working with the service vendor to review the issue and determine what additional actions can be taken to ensure the issue does not reoccur. ## Timeline: April 6, 2019 4:37 AM - First notification of potential issue with On-Demand services. No client impact at this time 4:47 AM - Investigation begins 5:37 AM - Severity-1 process launched. Issue becomes client impacting 6:20 AM - Cause is identified. Manual database failover performed 6:30 AM - Monitoring service responsible is disabled 6:33 AM - Client impact is mitigated. Teams continue to monitor 6:37 AM - Confirmation of system recovery 6:47 AM - All services restored. If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. Customers may receive an error when trying to access the system. The error is intermittent. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption"
Last update### What happened? On March 27, 2019, at approximately 11:38 PM PDT, some clients reported an issue to xMatters Client Assistance where they were not able to see the correct list of users in the On-Demand web user interface. Users reported that the Users page in the web interface was displaying an incomplete list of users, or was not displaying any users at all. During the investigation and resolution of the issue, additional reports came in that confirmed the issue was impacting only the Users page. Other aspects of the web user interface were not affected, and the On-Demand service continued to accept all incoming events, send notifications, and process responses without interruption. ### Why did it happen? This issue was caused by a software defect introduced in the 5.5.252 release of xMatters On-Demand, which included a change to the way that historical user roles were retrieved and displayed. ### How did we respond? As soon as Client Assistance received reports about an issue with the web user interface, they launched an investigation and began attempting to reproduce the issue. Initial findings seemed to indicate that the problem was limited in scope as internal checkpoints could not reproduce the issue. As further reports came in and clarified the issue and its scope, Client Assistance successfully reproduced the problem and immediately escalated it to a Severity-1, initiating the internal major incident management process. While the incident response teams began working to identify the root cause, Client Assistance posted a notice to the xMatters On-Demand status page. The Engineering teams identified an error in the query used to retrieve user roles, but determined that changing the query in place could have unforeseeable consequences. To mitigate the issue and restore service as safely as possible, the teams decided to rollback the service to the previous release. Although the rollback process could take longer, the teams identified it as the safest, most effective solution. The Engineering team immediately began the rollback process while Client Assistance updated affected clients on progress. As soon as the rollback was complete, clients confirmed that all services had been restored. ### What are we doing to prevent it from happening again? The defect introduced in the release was repaired and the release redeployed via hotfix to all production instances later the same day. All clients were successfully updated to the 5.5.252 release and have confirmed that the issue was resolved. As a proactive approach to preventing these types of incidents, the Engineering teams are currently reviewing all user-interface-related incidents from the past year, and identifying any potential enhancements or areas of further improvement. In addition, the Client Assistance team has identified that the notice posted to the xMatters status page was too general, and did not narrowly identify the client impact sufficiently. This may have caused some clients undue stress as the issue affected only the web user interface, and did not impact underlying data, event processing, or notification and response handling. To help prevent similar miscommunications, Client Assistance is reviewing their status page updates and communication practices to ensure that future updates are more focused and better represent the nature of any incidents. ### Timeline: March 27, 2019 11:38 PM - Client Assistance receives reports of issues displaying users in the web user interface 12:03 AM - Client Assistance begins to attempting to reproduce the issue 2:05 AM - Other clients report encountering the issue 3:40 AM - Scope of impact identified; Severity-1 incident initiated 4:02 AM - Incident response teams assemble and begin work to identify cause 5:25 AM - Cause of incident determined 6:53 AM - Rollback process initiated 9:18 AM - Rollback completed; all services confirmed restored If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored. If any users receive an error when browsing the Web UI please refresh your browser or restart your browser.
A fix is being implemented at the moment, we'll provide further updates as we get them. The impact is still isolated to the Web User Interface and specifically the Users list not displaying all users. There are no issues with notifications or accessing your instance.
We have confirmed that the impact is limited to the Web User interface, where full user lists are not available. No other services are impacted.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Some users receiving User Interface errors"
Last update### What happened? On March 21, 2019, at approximately 8:54 AM PDT, some clients began reporting an issue to xMatters Client Assistance where they were encountering a "404" error when attempting to access the On-Demand web user interface. Clients were able to login but could not perform any actions or access any pages due to the error. While the issue prevented clients from being able to use the web user interface to send messages, view event status, or run reports, the system continued to process events as well as all notifications and user responses. ### Why did it happen? This issue was caused by a mismatch in file creation dates that the web server uses to determine which files to serve. The Engineering team created and deployed a hotfix for an issue in the web user interface for a specific release after the artifacts for the subsequent scheduled release had already been built. When that release was deployed to the On-Demand service, the inconsistency in the creation dates for the files on the web server caused the interface to display an error instead of the necessary web pages. ### How did we respond? As soon as clients reported the errors, Client Assistance confirmed the reports and immediately escalated the issue to a Severity-1 incident. They launched the internal major incident management process to engage the incident response teams and posted a notice to the xMatters status page. The incident response teams began investigating and quickly identified the web server artifacts that were causing the date mismatch. To help immediately mitigate the impact and restore access to the web user interface, the teams began rolling back affected clients to the previous known good deployment while the Engineering team began rebuilding the release artifacts. As soon as the rollback was complete, clients reported that they could properly access the web user interface and that all services had been restored. The Engineering team completed the rebuild of the release artifacts and successfully redeployed the release later the same day. ### What are we doing to prevent it from happening again? To help prevent similar issues from happening in the future, the Engineering team has added additional checkpoints to the build and deployment process. These checkpoints test for file creation mismatches throughout all phases of the roll out and release process. ### Timeline: March 21, 2019 - 8:54 AM - Some clients report 404 errors when using the web user interface 8:55 AM - Client Assistance confirms and replicates the issue 8:56 AM - Client Assistance issues a Severity-1 incident 8:57 AM - Status page notice: [https://status.xmatters.com/incidents/hjhj8sty2g3b9:26Incident](https://status.xmatters.com/incidents/hjhj8sty2g3b9:26Incident) team isolates the cause and begins to investigate rollback to last known state 10:00 AM - Rollback initiated 10:07 AM - Rollback confirmed, team begins to monitor for further errors 10:28 AM - Confirmation that all services are restored
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters team have been receiving some reports of errors when viewing certain pages in the Web UI. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in Notification Delivery"
Last update### What happened? On March 14, 2019, at approximately 2:52 PM \(PDT\), the xMatters monitoring tools alerted Client Assistance to an issue involving notification delivery. The On-Demand service was accepting and processing events, but was not creating or sending notifications. Some clients reported the issue to Client Assistance while the incident was being investigated, confirming that they were unable to initiate or send notifications. ### Why did it happen? The issue was caused by an operator error during a clean-up process that reverted some services to a prior state, resulting in a misconfiguration between services. The misconfiguration prevented notifications from being processed after events were submitted to xMatters. ### How did we respond? As soon as the internal monitoring tools alerted Client Assistance to an issue, they launched an investigation. When they were able to reproduce the issue and identify the scope, they immediately initiated the internal major incident management process and posted a notice for customers on the xMatters status page. The incident response teams began working to restore services and searching for the root cause. They identified a misconfiguration within services required for notification creation and distribution. They quickly initiated a resolution process to restore service configurations to a prior, known good state. As soon as the resolution was applied, notifications began processing, and the teams continued to monitor the notification queues until the backlogs had cleared. Clients confirmed that they were receiving notifications promptly and that all services had been restored. ### What are we doing to prevent it from happening again? The xMatters Engineering team has already conducted and completed an internal review, and are developing and implementing an automated process for all clean-up activities for the On-Demand service. This process will include the following: Additional monitoring check points to optimize clean-up activities Automated rerouting of live traffic prior to reverting any services. ### Timeline: March 14, 2019 - 2:52 PM - Internal monitoring alerts Client Assistance to issue with notification processing 3:04 PM - Client Assistance confirms and replicates the issue 3:05 PM - Issue updated to MIM - incident response teams assembled 3:12 PM - Notification posted to xMatters status page3:15 PMIncident response teams isolate issue 3:27 PM - Corrective action designed and tested3:30 PMFix promoted to production; notifications begin processing 3:30 PM - Incident response teams monitor event processing and clearing of backlog 3:57 PM - Backlogs cleared; all services restored If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in Europe"
Last update### What happened? On March 9, 2019, at approximately 5:04 PM GMT, the xMatters monitoring tools alerted Client Assistance to an issue with the notification services in the European region. During the resolution of the issue, with lasted approximately 25 minutes, notifications were being created but not sent to the intended recipients. New events and responses to existing notifications were still being accepted and processed, and the web user interface was accessible and fully responsive, but no new notifications were going out. ### Why did it happen? This issue occurred when a queuing mechanism shared between multiple services ran out of available connections, resulting in a lack of available resources for notification delivery in the European region. The root cause of the issue was that unused or expired connections between services were not being cleared, causing a degradation in performance that triggered the alert from the xMatters monitoring tools. ### How did we respond? As soon as the xMatters monitoring tools alerted Client Assistance to an issue, they launched a Severity-1 incident and initiated the internal major incident management process. The incident response teams quickly verified that notifications were not being sent and began working to isolate the cause of the performance degradation and to mitigate the impact to customers. The teams began a rolling restart of the affected services to reduce the bottleneck in the queuing mechanism, which immediately improved performance and restored notification delivery service for all affected customers. Once the teams confirmed that notifications were being sent, they continued monitoring the performance of the affected service and investigating the root cause. When the rolling restarts had completed, the teams confirmed that all services had been restored. ### What are we doing to prevent it from happening again? To prevent the issue from reoccurring while working on a permanent solution, the teams implemented an automatic restart schedule for the affected services that purges queue connections and ensures that capacity is freed on a regular basis. Due to service redundancy within the xMatters infrastructure, this action does not affect performance or notification delivery. The Engineering team optimized the use of connections by the queuing mechanism and designed an automated connection clearing schedule. The changes were developed and tested for the xMatters On-Demand 5.5.250 release, which was implemented in all production systems on March 14, 2019. ### Timeline: March 9, 2019 - 5:04 PM - xMatters monitoring tools alert to notification issues in the European region 5:05 PM - Severity-1 incident initiated 5:06 PM - Issue verified; multiple services cannot get connection to queue 5:07 PM - Impacted services restarted 5:30 PM - Performance improvement verified; services are restored 5:45 PM - Rolling restarts continue; no impact to customer services 5:50 PM - Verification and service checks continue 6:19 PM - Monitoring to ensure full service and performance 7:01 PM - Issue resolved. If you have any questions, please visit: [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with notification delivery for xMatters On-Demand for some clients located in Europe. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North America"
Last update### What happened? On February 4, 2019, at approximately 11:35 PM PST, the xMatters monitoring tools alerted Client Assistance to an issue affecting service in the North American region. During the incident, some clients may have experienced some brief interruptions or delays, including 503 errors, when attempting to access the xMatters web user interface. While the underlying issue required approximately 12 hours to completely resolve, there was a total of 12 minutes of actual impact to clients. These impacts were spread across the incident duration in short intervals while the underlying issues were resolved. ### Why did it happen? The issue was related to unexpected and increased connection pool usage within the xMatters platform, which caused the web user interfaces and some API services to reach capacity and auto-heal multiple times. Increased query times on some databases resulted in back pressure on user-facing services; this decrease of performance resulted in connection pools reaching capacity. ### How did we respond? As soon as the issue was detected, the Client Assistance team immediately initiated the internal Major Incident Management process and launched an investigation. The incident response team quickly identified an issue impacting the web user interface and declared a Severity 1 incident while engaging additional subject matter experts. Their first priority was to mitigate any client impact, and then work to identify a root cause and build a solution. When the issue reoccurred on February 5 at approximately 7:24 AM PST, the teams were able to immediately isolate the affected components and isolate the problematic services to perform remediation. They confirmed that this had correctly mitigated the problem, and that all services had been restored. ### What are we doing to prevent it from happening again? To prevent this issue from reoccurring, we are adding additional monitoring that will allow us to detect these types of incidents much earlier and automatically implement additional self-healing processes for affected service. In addition, we have conducted a thorough post-mortem and identified multiple areas where system resiliency can be improved. ### Timeline: **January 04, 2019 - 23:35 -** Monitoring tools alert Client Assistance to an issue in the North American region. Brief interruptions are detected. **January 04, 2019 - 23:40 -** xMatters Client Assistance initiates major incident management process, launches investigation. **January 05, 2019 - 00:08 -** Interruptions are no longer occurring. **January 05, 2019 - 07:24** - Second service impact begins \(brief interruptions continue until 11:31\) **January 05, 2019 - 08:04 -** Status page updated to investigating: [https://status.xmatters.com/incidents/t86p7lvvdn5g](https://status.xmatters.com/incidents/t86p7lvvdn5g) **January 05, 2019 - 08:05 -** Status page updated to identified, incident team works to determine root cause and resolution options. **January 05, 2019 - 11:31** - Resolution initiated, status page set to monitoring. **January 05, 2019 - 12:42 -** Production environment is determined to be stable, no further impact detected. Status page updated to resolved. If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
We are continuing to work on resolving this issue. Majority of clients should be able to access the web interface, however may temporarily see accessibility issues. We will provide another update in 30 minutes.
The xMatters Incident Response team has identified the source of the issue and is still working on a fix. We will update once a solution has been identified and implemented.
The xMatters Incident Response team has identified the source of the issue and is still working on a fix. We will update once a solution has been identified and implemented.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America, accessing the system may result in a temporary error. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in Asia Pacific Region"
Last update### What happened? On January 9, 2019, at approximately 11:20AM AEDT, the xMatters monitoring tools alerted Client Assistance to an issue with our hosting service in the Asia-Pacific region. During the incident, which lasted less than 20 minutes, some customers reported encountering a 503 error or a blank screen when attempting to access their xMatters instance, and events and notifications were not being accepted or processed. ### Why did it happen? This issue was caused by a connectivity failure within the Google Cloud Platform \(GCP\) Infrastructure-as-a-Service \(IAAS\) in the Asia-Pacific region. No egress traffic to the Internet from the Australian region was functional due to issues within Google's networks. ### How did we respond? As soon as the issue was detected, the Client Assistance team initiated the internal major incident management process and launched an investigation. The incident response teams quickly determined that all internal services were functioning normally, but traffic was not being sent to the internet. The teams immediately escalated the incident to the GCP team, who confirmed that they were experiencing issues and posted information about the problem on their status portals. While Google continued to investigate and attempt to restore their service, the incident response teams began implementing a work-around solution to re-route traffic through another region. During the implementation of the workaround, Google restored their services and by 16:35 all instances were reporting as functional and healthy. ### What are we doing to prevent it from happening again? At xMatters, we understand that availability is at the core of our service and treat the requirements of our customers as a mission critical service. As a precaution against possible future issues with Google services, the Operations and Engineering teams are committed to establishing a formalized work-around procedure that will bypass problematic services and allow us to continue to deliver services in the event of a failure. ### Timeline: January 9, 2019 11:15AM - xMatters team discovers connectivity issues in the APAC Region 11:21AM - xMatters Client Assistance initiates major incident management process, launches investigation 11:23AM - Issue identified as external 11:25AM - Issue reported to Google Support 11:33AM - Operations begins work-around to attempt to mitigate issue for xMatters customers 11:35AM - Google reports service restored, all instances reported functional
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the Asia Pacific region. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North America for Email delivery"
Last update### What happened? On Thursday, December 27, 2018 at approximately 8:41 AM PST, the xMatters networking monitoring systems alerted Client Assistance to an issue with xMatters On-Demand services for some clients located North America. During the issue, some clients may have experienced intermittent access to the xMatters user interface or a delay when injecting events into xMatters. In addition, some clients may have experienced intermittent delays or interruptions with the delivery and reception of xMatters emails. ### Why did it happen? The root cause of this issue was a high-impact service outage experienced by a primary internet service provider \(ISP\) in North America. This wide-reaching ISP outage impacted connectivity, email service, and Internet access across North America and even parts of Europe, and caused some issues common to large ISP outages, such as DNS gaps and mobile app connectivity problems. Throughout the incident, the xMatters web user interface was operating and functional, event injection methods were working properly, and non-email notifications and responses were being sent and processed normally. Most clients may have experienced increased latency during the event that affected the overall user experience. ### How did we respond? As soon as the xMatters network monitoring tools detected connectivity issues, the xMatters Client Assistance and Engineering teams escalated the issue to Severity 1 and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause. Client Assistance identified and informed affected clients about the incident. The teams immediately identified that the issue was limited to a specific data center within the North American region and determined that the problem was due to a widespread ISP outage in North America. The team connected with the ISP and began working in collaboration with them to determine the impact to xMatters customers, and rerouted email services through an unaffected path. During the event, all in-flight deployments and upgrades were paused until network access was fully restored to avoid the possibility of impact. Our incident management team continued to monitor the situation closely and update clients as the ISP reported on their restoration progress. ### What are we doing to prevent it from happening again? xMatters uses multiple network backbones and automatically routes traffic across other networks and through other data centers in the event of an Internet failure. During this event, these systems were working as designed and connectivity was reestablished within the expected period of re-convergence. As part of our commitment to continuous improvement, we are conducting hosting service improvements to our infrastructure-as-a-service, scheduled to occur in the North American region in January 2019. These improvements will greatly reduce the potential impact of ISP outages. For more information, see the article on our support site: [https://support.xmatters.com/hc/en-us/articles/115005269506](https://support.xmatters.com/hc/en-us/articles/115005269506). ### Timeline: December 27, 2018 - 8:41 AM - xMatters internal monitoring alerts Client Assistance to issue in North America 8:43 AM - Client Assistance confirms all services are accessible and operational 8:58 AM - Client Assistance escalates issue to Severity 1; incident response teams begin investigation 9:03 AM - Team confirms issue with ISP 9:28 AM - xMatters engages ISP and obtains point of contact 5:46 PM - Issues identified with email service and delivery 6:04 PM - Email traffic re-routed to alternate path 6:07 PM - Email services restored 9:22 PM - ISP provides 4-hour ETA for resolution December 28 2018 - 9:19 AM - ISP indicates progress and claims to be nearing resolution 6:16 PM - ISP indicates that a solution has been implemented; currently monitoring connection for stability 11:44 PM - xMatters confirms all services restored If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America where email delivery is being delayed. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North America"
Last update### What happened? On Thursday, December 27, 2018 at approximately 8:41 AM PST, the xMatters networking monitoring systems alerted Client Assistance to an issue with xMatters On-Demand services for some clients located North America. During the issue, some clients may have experienced intermittent access to the xMatters user interface or a delay when injecting events into xMatters. In addition, some clients may have experienced intermittent delays or interruptions with the delivery and reception of xMatters emails. ### Why did it happen? The root cause of this issue was a high-impact service outage experienced by a primary internet service provider \(ISP\) in North America. This wide-reaching ISP outage impacted connectivity, email service, and Internet access across North America and even parts of Europe, and caused some issues common to large ISP outages, such as DNS gaps and mobile app connectivity problems. Throughout the incident, the xMatters web user interface was operating and functional, event injection methods were working properly, and non-email notifications and responses were being sent and processed normally. Most clients may have experienced increased latency during the event that affected the overall user experience. ### How did we respond? As soon as the xMatters network monitoring tools detected connectivity issues, the xMatters Client Assistance and Engineering teams escalated the issue to Severity 1 and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause. Client Assistance identified and informed affected clients about the incident. The teams immediately identified that the issue was limited to a specific data center within the North American region and determined that the problem was due to a widespread ISP outage in North America. The team connected with the ISP and began working in collaboration with them to determine the impact to xMatters customers, and rerouted email services through an unaffected path. During the event, all in-flight deployments and upgrades were paused until network access was fully restored to avoid the possibility of impact. Our incident management team continued to monitor the situation closely and update clients as the ISP reported on their restoration progress. ### What are we doing to prevent it from happening again? xMatters uses multiple network backbones and automatically routes traffic across other networks and through other data centers in the event of an Internet failure. During this event, these systems were working as designed and connectivity was reestablished within the expected period of re-convergence. As part of our commitment to continuous improvement, we are conducting hosting service improvements to our infrastructure-as-a-service, scheduled to occur in the North American region in January 2019. These improvements will greatly reduce the potential impact of ISP outages. For more information, see the article on our support site: [https://support.xmatters.com/hc/en-us/articles/115005269506](https://support.xmatters.com/hc/en-us/articles/115005269506). ### Timeline: December 27, 2018 - 8:41 AM - xMatters internal monitoring alerts Client Assistance to issue in North America 8:43 AM - Client Assistance confirms all services are accessible and operational 8:58 AM - Client Assistance escalates issue to Severity 1; incident response teams begin investigation 9:03 AM - Team confirms issue with ISP 9:28 AM - xMatters engages ISP and obtains point of contact 5:46 PM - Issues identified with email service and delivery 6:04 PM - Email traffic re-routed to alternate path 6:07 PM - Email services restored 9:22 PM - ISP provides 4-hour ETA for resolution December 28 2018 - 9:19 AM - ISP indicates progress and claims to be nearing resolution 6:16 PM - ISP indicates that a solution has been implemented; currently monitoring connection for stability 11:44 PM - xMatters confirms all services restored If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed by the ISP and network services have been restored. Thank you for your patience while this issue was being addressed.
The ISP have confirmed that most networking issues they were experiencing should now be resolved. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The ISP have made some progress, but are still working on fully restoring their service. Some users will continue to see issues accessing the web user interface depending on the geographic location. We continue to monitor the situation and will provide updates as we get them.
As mentioned previously, this issue has been identified to be a widespread issue impacting a primary ISP in North America. We continue to monitor the situation and will provide another update as it becomes available.
xMatters have received several reports today of users not being able to access the web user interface. The root cause of this issue is related to a wide impact service outage experienced by a primary internet service provider (ISP) in North America. xMatters services are running and operational, however some users may not be able to access their xMatters instance based on their geographic location. We continue to monitor the situation closely and will provide updates as they become available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "On-Demand Deployment Completed (All Regions)"
Last updateThe latest release of xMatters On-Demand has been successfully deployed to all regions. For information about the fixes and updates in this release, refer to the Support Notes document at https://support.xmatters.com/hc/en-us/articles/360019493351 For information about features and updates to the mobile apps and xMatters API, check the Development Highlights for this quarter at https://support.xmatters.com/hc/en-us/articles/360018776612
Report: "Issue Discovered - Service disruption in Asia Pacific Region"
Last update### **What happened?** On October 25th, 2018, at approximately 1:16 PM AEST, the xMatters monitoring tools alerted the Client Assistance team to an issue impacting the On-Demand service for some clients located in the Australian region. During the incident, some clients may have experienced intermittent access to the user interface, a delay or rejection when injecting an event into xMatters, and delays in notification delivery. There was no impact or loss to client data during this incident. ### **Why did it happen?** This issue was caused by a sudden, unexpected failure of a network interface card within the hosted data center supporting our services in Australia. While the impacted hardware was redundant the failure caused a condition that resulted in a cascade of failures. An automated failover to an alternate data center was initiated immediately, but the process of redirecting services around the issue took longer than expected due to the nature of the failure. ### **How did we respond?** As soon as they were alerted by the monitoring systems, Client Assistance initiated the internal major incident management process and launched an investigation. The xMatters incident response teams confirmed the issue and began monitoring the automated failover process. The Client Assistance team proactively contacted each client individually to let them know about the issue and to update them on the status of their services. The failover was completed, and all services were fully restored less than an hour after the issue was identified. ### **What are we doing to prevent it from happening again?** Hardware failure is difficult to predict, and this condition was unique in that existing services and redundancies failed to perform as previously tested. The hosting service improvements and migrations just completed in the Australian region will make similar issues highly unlikely on this new and significantly more robust infrastructure. For more information about these changes, see the article on our support site: [https://support.xmatters.com/hc/en-us/articles/115005269506](https://support.xmatters.com/hc/en-us/articles/115005269506) ### Timeline: October 25, 2018 1:16 PM \(AEDT\) - Monitoring tools alert to an issue in the Australian region 1:18 PM - Client Assistance initiates major incident management process, launches investigation 1:20 PM - Issue identified as impacting some clients hosted in one of the APAC data centers 1:25 PM - Failover process begins for clients impacted 1:40 PM - Status page updated: [https://status.xmatters.com/incidents/jtcs9w4grlh4](https://status.xmatters.com/incidents/jtcs9w4grlh4) 2:05 PM - All affected customers reported back up 2:08 PM - All services restored
Resolved - The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Monitoring - The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the Asia Pacific region (AU1 Clients only). We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in Asia Pacific Region"
Last update### **What happened?** On October 25th, 2018, at approximately 1:16 PM AEST, the xMatters monitoring tools alerted the Client Assistance team to an issue impacting the On-Demand service for some clients located in the Australian region. During the incident, some clients may have experienced intermittent access to the user interface, a delay or rejection when injecting an event into xMatters, and delays in notification delivery. There was no impact or loss to client data during this incident. ### **Why did it happen?** This issue was caused by a sudden, unexpected failure of a network interface card within the hosted data center supporting our services in Australia. While the impacted hardware was redundant the failure caused a condition that resulted in a cascade of failures. An automated failover to an alternate data center was initiated immediately, but the process of redirecting services around the issue took longer than expected due to the nature of the failure. ### **How did we respond?** As soon as they were alerted by the monitoring systems, Client Assistance initiated the internal major incident management process and launched an investigation. The xMatters incident response teams confirmed the issue and began monitoring the automated failover process. The Client Assistance team proactively contacted each client individually to let them know about the issue and to update them on the status of their services. The failover was completed, and all services were fully restored less than an hour after the issue was identified. ### **What are we doing to prevent it from happening again?** Hardware failure is difficult to predict, and this condition was unique in that existing services and redundancies failed to perform as previously tested. The hosting service improvements and migrations just completed in the Australian region will make similar issues highly unlikely on this new and significantly more robust infrastructure. For more information about these changes, see the article on our support site: [https://support.xmatters.com/hc/en-us/articles/115005269506](https://support.xmatters.com/hc/en-us/articles/115005269506) ### Timeline: October 25, 2018 1:16 PM \(AEDT\) - Monitoring tools alert to an issue in the Australian region 1:18 PM - Client Assistance initiates major incident management process, launches investigation 1:20 PM - Issue identified as impacting some clients hosted in one of the APAC data centers 1:25 PM - Failover process begins for clients impacted 1:40 PM - Status page updated: [https://status.xmatters.com/incidents/jtcs9w4grlh4](https://status.xmatters.com/incidents/jtcs9w4grlh4) 2:05 PM - All affected customers reported back up 2:08 PM - All services restored
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in the Asia Pacific region. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North America"
Last update### What happened? On Thursday, November 15, 2018 at approximately 5:25 PM PDT, the xMatters monitoring systems alerted Client Assistance to a potential issue with one of the data centers located in North America. For the remainder of Thursday, much of Friday morning, and into the weekend, North American customers experienced intermittent access to the user interface, delays or rejection when injecting an event into xMatters, and delays or failures in notification delivery. Also during this time, some customers in regions outside of North America may have experienced a delay or rejection with voice notifications. ### Why did it happen? This incident was caused by several issues occurring in succession. The major incident was caused by a software defect with a storage array located in one of our North American data centers, which resulted in the inaccessibility of the array and its associated disk volumes for several hours. Several hours after completing a failover of the affected databases and applications to an alternate data center, the xMatters teams observed additional failures caused by services attempting to access unresponsive databases. This resulted in the connection pools for those services filling up and rejecting new connections. The following day, the teams identified a defect with the Integration Builder platform that was causing intermittent failures of web servers and the user interface. ### How did we respond? As soon as the xMatters network monitoring tools detected connectivity issues, the xMatters Client Assistance and Operations teams escalated the issue to a major incident and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause and working to restore services for clients, Client Assistance posted a notice to the xMatters status page to inform clients about the incident. The teams immediately identified that the issue was limited to a specific data center within the North America region, and determined that the problem was related to the storage array. The Operations team began migrating production services to an alternate data center. These steps restored service for impacted clients, but a new issue continued to cause intermittent service disruptions for several clients throughout the night and into the following morning, with some applications and web services experiencing intermittent failures and high error rates. The investigation determined that this new issue was caused by some databases in the failed data center appearing as still accessible over the network when they were not actually responsive. Customers attempting to access these non-production services caused connection and performance degradation in some healthy services at the live data center, where database connection pools began to fill up until connections were rejected. When the teams discovered this issue, all systems at the failed data center were powered off to ensure they were unavailable and inaccessible, and this issue was resolved. Later, some customers reported reduced performance and slowness of the web user interface. This issue was traced to a problem in the Activity Stream of the Integration Builder and occurred when the system was attempting to process a large number of sizable integration logs related to failed integrations. The issue was causing the web services to run out of memory, where they would be automatically restarted. These restart loops caused rotating failures throughout the pool of web servers. The team was able to mitigate the issue and resolve the out-of-memory errors by truncating the largest integration logs. The Operations and Engineering teams continued to review the existing state of the failed data center and began systematically bringing services and data back online in a safe and coordinated manner. The teams reviewed client instances and performance throughout the day and made any necessary configuration modifications to ensure the systems were operational. At 4:00 PM PST, the Operations team began restoring the redundant databases affected by the storage array malfunction. ### What are we doing to prevent it from happening again? This series of incidents caused a major disruption of service and at xMatters, we know we can do better. While the incidents themselves were unrelated, their occurrence in short succession prolonged a period of instability. While these kinds of issues are difficult to predict and prevent, xMatters teams continually review our processes and seek areas of improvement or ways to reduce the amount of time clients are impacted. We are still working with the storage array and data center vendors and providers to determine the root cause of the initial failure and will update this root cause analysis if and when those investigations uncover any further information. To mitigate and eliminate other issues uncovered during this disruption, the xMatters teams have committed to the following actions: * The Engineering team has developed a fix for the issue related to the non-responsive databases causing connection pool consumption; and the fix is schedule to be deployed as part of the 5.5.235 release \(scheduled for Wednesday, November 21\). * The Engineering team is developing and testing a fix for the issue related to the Activity Stream in the Integration Builder. The fix will be deployed as a hotfix for the 5.5.235 release as soon as the testing is complete. As part of our commitment to continuous improvement, we are conducting hosting service improvements to our infrastructure-as-a-service, scheduled to occur in the North American region in January 2019. These improvements will remove points of failure such as the storage array involved in this incident. For more information, see the article on our support site: [https://support.xmatters.com/hc/en-us/articles/115005269506](https://support.xmatters.com/hc/en-us/articles/115005269506). The robustness of this new infrastructure is dramatically improved with increased resiliency across the entire service implementation. To help reduce the load on our existing data centers and prevent similar issues from reoccurring, we are currently investigating ways to accelerate the migration process for some customers. ### Timeline: November 15, 2018, 5:25 PM - xMatters monitoring tools alert the Client Assistance team to a potential issue with clients in North America 5:26 PM - Internal major incident management process initiated 5:33 PM - Engineering identifies the issue as related to the storage array; begins fail-over to alternate data center 5:37 PM - Client Assistance posts status page bulletin: [https://status.xmatters.com/incidents/pj2bj697gkxw](https://status.xmatters.com/incidents/pj2bj697gkxw) 5:41 PM - Systems begin to come online in new data center 5:52 PM - Engineering implements mitigation steps to reduce load on storage array 5:57 PM - Majority of services restored; major incident team continues to work through systems 6:52 PM - All fail-over complete; some services require additional rehabilitation 7:31 PM - All services restored November 16, 2018 1:43 AM - xMatters monitoring tools alert the Client Assistance team to an intermittent issue with some clients in North America 2:00 AM - Internal major incident management process re-initiated 2:04 AM - Engineering begins investigating the issue 2:09 AM - Client Assistance posts notice to xMatters status page: [https://status.xmatters.com/incidents/bz1hxfxfbrlt](https://status.xmatters.com/incidents/bz1hxfxfbrlt) 2:53 PM - Operations begins work-around to attempt to mitigate issue for xMatters customers in the new data center 3:01 PM - All services reporting as restored 5:23 AM - Clients contact xMatters Client assistance, report slowness in navigating/accessing the web user interface 5:36 AM - Client Assistance posts notice to xMatters status page: [https://status.xmatters.com/incidents/3s1n4l1kldmt](https://status.xmatters.com/incidents/3s1n4l1kldmt) 6:33 AM - Services stabilize 7:30 AM - 3:30 PM - Major incident teams continue review each instance and make necessary corrections or restarts 4:00 PM - Back-end database replication started to restore data replication If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
Report: "Issue Discovered - Service disruption in North America"
Last update### What happened? On Thursday, November 15, 2018 at approximately 5:25 PM PDT, the xMatters monitoring systems alerted Client Assistance to a potential issue with one of the data centers located in North America. For the remainder of Thursday, much of Friday morning, and into the weekend, North American customers experienced intermittent access to the user interface, delays or rejection when injecting an event into xMatters, and delays or failures in notification delivery. Also during this time, some customers in regions outside of North America may have experienced a delay or rejection with voice notifications. ### Why did it happen? This incident was caused by several issues occurring in succession. The major incident was caused by a software defect with a storage array located in one of our North American data centers, which resulted in the inaccessibility of the array and its associated disk volumes for several hours. Several hours after completing a failover of the affected databases and applications to an alternate data center, the xMatters teams observed additional failures caused by services attempting to access unresponsive databases. This resulted in the connection pools for those services filling up and rejecting new connections. The following day, the teams identified a defect with the Integration Builder platform that was causing intermittent failures of web servers and the user interface. ### How did we respond? As soon as the xMatters network monitoring tools detected connectivity issues, the xMatters Client Assistance and Operations teams escalated the issue to a major incident and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause and working to restore services for clients, Client Assistance posted a notice to the xMatters status page to inform clients about the incident. The teams immediately identified that the issue was limited to a specific data center within the North America region, and determined that the problem was related to the storage array. The Operations team began migrating production services to an alternate data center. These steps restored service for impacted clients, but a new issue continued to cause intermittent service disruptions for several clients throughout the night and into the following morning, with some applications and web services experiencing intermittent failures and high error rates. The investigation determined that this new issue was caused by some databases in the failed data center appearing as still accessible over the network when they were not actually responsive. Customers attempting to access these non-production services caused connection and performance degradation in some healthy services at the live data center, where database connection pools began to fill up until connections were rejected. When the teams discovered this issue, all systems at the failed data center were powered off to ensure they were unavailable and inaccessible, and this issue was resolved. Later, some customers reported reduced performance and slowness of the web user interface. This issue was traced to a problem in the Activity Stream of the Integration Builder and occurred when the system was attempting to process a large number of sizable integration logs related to failed integrations. The issue was causing the web services to run out of memory, where they would be automatically restarted. These restart loops caused rotating failures throughout the pool of web servers. The team was able to mitigate the issue and resolve the out-of-memory errors by truncating the largest integration logs. The Operations and Engineering teams continued to review the existing state of the failed data center and began systematically bringing services and data back online in a safe and coordinated manner. The teams reviewed client instances and performance throughout the day and made any necessary configuration modifications to ensure the systems were operational. At 4:00 PM PST, the Operations team began restoring the redundant databases affected by the storage array malfunction. ### What are we doing to prevent it from happening again? This series of incidents caused a major disruption of service and at xMatters, we know we can do better. While the incidents themselves were unrelated, their occurrence in short succession prolonged a period of instability. While these kinds of issues are difficult to predict and prevent, xMatters teams continually review our processes and seek areas of improvement or ways to reduce the amount of time clients are impacted. We are still working with the storage array and data center vendors and providers to determine the root cause of the initial failure and will update this root cause analysis if and when those investigations uncover any further information. To mitigate and eliminate other issues uncovered during this disruption, the xMatters teams have committed to the following actions: * The Engineering team has developed a fix for the issue related to the non-responsive databases causing connection pool consumption; and the fix is schedule to be deployed as part of the 5.5.235 release \(scheduled for Wednesday, November 21\). * The Engineering team is developing and testing a fix for the issue related to the Activity Stream in the Integration Builder. The fix will be deployed as a hotfix for the 5.5.235 release as soon as the testing is complete. As part of our commitment to continuous improvement, we are conducting hosting service improvements to our infrastructure-as-a-service, scheduled to occur in the North American region in January 2019. These improvements will remove points of failure such as the storage array involved in this incident. For more information, see the article on our support site: [https://support.xmatters.com/hc/en-us/articles/115005269506](https://support.xmatters.com/hc/en-us/articles/115005269506). The robustness of this new infrastructure is dramatically improved with increased resiliency across the entire service implementation. To help reduce the load on our existing data centers and prevent similar issues from reoccurring, we are currently investigating ways to accelerate the migration process for some customers. ### Timeline: November 15, 2018, 5:25 PM - xMatters monitoring tools alert the Client Assistance team to a potential issue with clients in North America 5:26 PM - Internal major incident management process initiated 5:33 PM - Engineering identifies the issue as related to the storage array; begins fail-over to alternate data center 5:37 PM - Client Assistance posts status page bulletin: [https://status.xmatters.com/incidents/pj2bj697gkxw](https://status.xmatters.com/incidents/pj2bj697gkxw) 5:41 PM - Systems begin to come online in new data center 5:52 PM - Engineering implements mitigation steps to reduce load on storage array 5:57 PM - Majority of services restored; major incident team continues to work through systems 6:52 PM - All fail-over complete; some services require additional rehabilitation 7:31 PM - All services restored November 16, 2018 1:43 AM - xMatters monitoring tools alert the Client Assistance team to an intermittent issue with some clients in North America 2:00 AM - Internal major incident management process re-initiated 2:04 AM - Engineering begins investigating the issue 2:09 AM - Client Assistance posts notice to xMatters status page: [https://status.xmatters.com/incidents/bz1hxfxfbrlt](https://status.xmatters.com/incidents/bz1hxfxfbrlt) 2:53 PM - Operations begins work-around to attempt to mitigate issue for xMatters customers in the new data center 3:01 PM - All services reporting as restored 5:23 AM - Clients contact xMatters Client assistance, report slowness in navigating/accessing the web user interface 5:36 AM - Client Assistance posts notice to xMatters status page: [https://status.xmatters.com/incidents/3s1n4l1kldmt](https://status.xmatters.com/incidents/3s1n4l1kldmt) 6:33 AM - Services stabilize 7:30 AM - 3:30 PM - Major incident teams continue review each instance and make necessary corrections or restarts 4:00 PM - Back-end database replication started to restore data replication If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
Resolved - The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
Monitoring - The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North America"
Last update### What happened? On Thursday, November 15, 2018 at approximately 5:25 PM PDT, the xMatters monitoring systems alerted Client Assistance to a potential issue with one of the data centers located in North America. For the remainder of Thursday, much of Friday morning, and into the weekend, North American customers experienced intermittent access to the user interface, delays or rejection when injecting an event into xMatters, and delays or failures in notification delivery. Also during this time, some customers in regions outside of North America may have experienced a delay or rejection with voice notifications. ### Why did it happen? This incident was caused by several issues occurring in succession. The major incident was caused by a software defect with a storage array located in one of our North American data centers, which resulted in the inaccessibility of the array and its associated disk volumes for several hours. Several hours after completing a failover of the affected databases and applications to an alternate data center, the xMatters teams observed additional failures caused by services attempting to access unresponsive databases. This resulted in the connection pools for those services filling up and rejecting new connections. The following day, the teams identified a defect with the Integration Builder platform that was causing intermittent failures of web servers and the user interface. ### How did we respond? As soon as the xMatters network monitoring tools detected connectivity issues, the xMatters Client Assistance and Operations teams escalated the issue to a major incident and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause and working to restore services for clients, Client Assistance posted a notice to the xMatters status page to inform clients about the incident. The teams immediately identified that the issue was limited to a specific data center within the North America region, and determined that the problem was related to the storage array. The Operations team began migrating production services to an alternate data center. These steps restored service for impacted clients, but a new issue continued to cause intermittent service disruptions for several clients throughout the night and into the following morning, with some applications and web services experiencing intermittent failures and high error rates. The investigation determined that this new issue was caused by some databases in the failed data center appearing as still accessible over the network when they were not actually responsive. Customers attempting to access these non-production services caused connection and performance degradation in some healthy services at the live data center, where database connection pools began to fill up until connections were rejected. When the teams discovered this issue, all systems at the failed data center were powered off to ensure they were unavailable and inaccessible, and this issue was resolved. Later, some customers reported reduced performance and slowness of the web user interface. This issue was traced to a problem in the Activity Stream of the Integration Builder and occurred when the system was attempting to process a large number of sizable integration logs related to failed integrations. The issue was causing the web services to run out of memory, where they would be automatically restarted. These restart loops caused rotating failures throughout the pool of web servers. The team was able to mitigate the issue and resolve the out-of-memory errors by truncating the largest integration logs. The Operations and Engineering teams continued to review the existing state of the failed data center and began systematically bringing services and data back online in a safe and coordinated manner. The teams reviewed client instances and performance throughout the day and made any necessary configuration modifications to ensure the systems were operational. At 4:00 PM PST, the Operations team began restoring the redundant databases affected by the storage array malfunction. ### What are we doing to prevent it from happening again? This series of incidents caused a major disruption of service and at xMatters, we know we can do better. While the incidents themselves were unrelated, their occurrence in short succession prolonged a period of instability. While these kinds of issues are difficult to predict and prevent, xMatters teams continually review our processes and seek areas of improvement or ways to reduce the amount of time clients are impacted. We are still working with the storage array and data center vendors and providers to determine the root cause of the initial failure and will update this root cause analysis if and when those investigations uncover any further information. To mitigate and eliminate other issues uncovered during this disruption, the xMatters teams have committed to the following actions: * The Engineering team has developed a fix for the issue related to the non-responsive databases causing connection pool consumption; and the fix is schedule to be deployed as part of the 5.5.235 release \(scheduled for Wednesday, November 21\). * The Engineering team is developing and testing a fix for the issue related to the Activity Stream in the Integration Builder. The fix will be deployed as a hotfix for the 5.5.235 release as soon as the testing is complete. As part of our commitment to continuous improvement, we are conducting hosting service improvements to our infrastructure-as-a-service, scheduled to occur in the North American region in January 2019. These improvements will remove points of failure such as the storage array involved in this incident. For more information, see the article on our support site: [https://support.xmatters.com/hc/en-us/articles/115005269506](https://support.xmatters.com/hc/en-us/articles/115005269506). The robustness of this new infrastructure is dramatically improved with increased resiliency across the entire service implementation. To help reduce the load on our existing data centers and prevent similar issues from reoccurring, we are currently investigating ways to accelerate the migration process for some customers. ### Timeline: November 15, 2018, 5:25 PM - xMatters monitoring tools alert the Client Assistance team to a potential issue with clients in North America 5:26 PM - Internal major incident management process initiated 5:33 PM - Engineering identifies the issue as related to the storage array; begins fail-over to alternate data center 5:37 PM - Client Assistance posts status page bulletin: [https://status.xmatters.com/incidents/pj2bj697gkxw](https://status.xmatters.com/incidents/pj2bj697gkxw) 5:41 PM - Systems begin to come online in new data center 5:52 PM - Engineering implements mitigation steps to reduce load on storage array 5:57 PM - Majority of services restored; major incident team continues to work through systems 6:52 PM - All fail-over complete; some services require additional rehabilitation 7:31 PM - All services restored November 16, 2018 1:43 AM - xMatters monitoring tools alert the Client Assistance team to an intermittent issue with some clients in North America 2:00 AM - Internal major incident management process re-initiated 2:04 AM - Engineering begins investigating the issue 2:09 AM - Client Assistance posts notice to xMatters status page: [https://status.xmatters.com/incidents/bz1hxfxfbrlt](https://status.xmatters.com/incidents/bz1hxfxfbrlt) 2:53 PM - Operations begins work-around to attempt to mitigate issue for xMatters customers in the new data center 3:01 PM - All services reporting as restored 5:23 AM - Clients contact xMatters Client assistance, report slowness in navigating/accessing the web user interface 5:36 AM - Client Assistance posts notice to xMatters status page: [https://status.xmatters.com/incidents/3s1n4l1kldmt](https://status.xmatters.com/incidents/3s1n4l1kldmt) 6:33 AM - Services stabilize 7:30 AM - 3:30 PM - Major incident teams continue review each instance and make necessary corrections or restarts 4:00 PM - Back-end database replication started to restore data replication If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North America"
Last update### What happened? On Tuesday, October 16, 2018 at approximately 12:40 PM PDT, the xMatters monitoring systems alerted the Client Assistance team to a potential issue with On-Demand services for some clients located in North America. During the incident, some users may have experienced intermittent access to the user interface, a delay or rejection when injecting an event into xMatters, and delays in notification delivery. Early the next morning, on October 17, some clients reported that their inbound requests via the Integration Builder were not processing messages. During this incident, some users may have experienced delays in notification delivery. The issue reoccurred for a third time early in the afternoon on Thursday, October 18, when Client Assistance noticed ongoing performance issues in one of the North American data centers. Some clients may have encountered intermittent access to the user interface and delays in notification delivery during this time. ### Why did it happen? This issue was caused by a database query change which was introduced as part of a bug fix in the recent xMatters On-Demand 5.5.230 release, and entered production on Monday, October 15. These changes resulted in databases taking an increased amount of time to process certain requests, and only occurred during specific conditions that occurred during increased concurrency or increased notification requests. The teams had some difficulty in identifying the root cause because the performance issues appeared to abate after each solution was implemented. It was not until the third occurrence that the teams were able to gather enough information about the common elements to correctly isolate the source of the problem. ### How did we respond? As soon as the xMatters network monitoring tools detected connectivity issues, the xMatters Client Assistance and Operations teams escalated the issue to Severity 1 and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause and working to restore services for clients, Client Assistance posted a notice to the xMatters status page to inform clients about the incident. The teams immediately identified that the issue was limited to a specific subset of client instances within the North America region, and determined that the problem was related to a database consuming nearly all of its resources. In an attempt to mitigate the issue, the Operations team restarted the database service, resulting in marginal improvements to notification delivery. Upon further investigation, the team identified additional approaches that could mitigate the problem, and applied one of the recommended fixes to the database. Once the services were restarted, notification delivery resumed normal operations and all services appeared to be restored. On October 17, the Client Assistance team began receiving reports from clients that some injected events were not delivering notifications. The Client Assistance team confirmed the issue and initiated the internal major incident management process to engage the incident response teams. The teams identified that a service responsible for handling inbound requests from the Integration Builder was in a blocked state. Once the impacted service was restarted the block was cleared, and events began processing notifications. The teams continued to investigate and determined that the original incident had blocked certain database tables and that additional components required a restart. The Operations team unblocked the database tables, and restarted affected components to ensure that all services were fully restored. The teams continued to search for the underlying cause of the incident while monitoring the affected systems. At approximately 12:30 PM on Thursday, October 18, Client Assistance again noticed performance issues with one of the data centers in North America. They immediately launched the major incident management process and engaged the response teams to begin resolving the issue. The teams were able to start simultaneously restoring services and investigating the root cause. The third occurrence provided the teams with the information necessary to link the issues and review similar behavior during all three incidents. By comparing common elements that occurred during each incident, the teams managed to isolate and identify the query that caused the database performance issues. Once they were certain that they had identified the correct source of the problems, the Operations and Engineering teams devised and implemented a hot fix to mitigate any further impact to customers. Clients then confirmed that all services had been restored. ### What are we doing to prevent it from happening again? To prevent this issue from occurring again, xMatters has committed to the following action items: 1. Upgrading the underlying database and update to the latest patch release version. \(Completed\) 2. Increase monitoring thresholds to help identify any latency with notification delivery earlier in the process. \(In progress\) 3. Deploy a hotfix to fix the problematic query on the impacted systems. \(Completed\) 4. Deploy a permanent fix to the query to eliminate the issue across all customers and all systems. \(Deployed as part of the 5.5.231 release on Monday, October 22.\) In addition, the Engineering and Operations teams are conducting a full post-mortem of the incident to help identify any potential improvements to testing suites, playbooks, and other collateral used to help isolate and identify root causes during and after an incident. ### Timeline: October 16, 2018, 12:40 PM xMatters monitoring tools alert the Client Assistance team to possible latency issues for some clients in North America 12:50 PM Internal Severity 1 process initiated 1:15 PM Engineering attempts to restore services for clients by restarting impacted notification service 1:32 PM Client Assistance posts status page bulletin: [https://status.xmatters.com/incidents/c7vqmddldtbl](https://status.xmatters.com/incidents/c7vqmddldtbl) 1:50 PM Engineering recommends mitigation steps to recover the notification service 2:01 PM Fix deployed to database; impacted service restarted 2:10 PM Services are restored October 17, 2018, 6:00 AM Client Assistance receives reports that some events are not processing 7:58 AM Client Assistance initiates internal major incident process 8:05 AM Engineering begins investigating the issue 9:10 AM Engineering applies fix, events begin processing notifications 9:14 AM Services are restored October 18, 2018, 12:30 PM xMatters Client Assistance is alerted to possible latency issues in a North American data center 12:34 PM Issue escalated to Severity 1 12:58 PM Client Assistance posts notice to xMatters status page: [https://status.xmatters.com/incidents/7yptsvdrm2p5](https://status.xmatters.com/incidents/7yptsvdrm2p5) 1:13 PM Teams confirm that all three incidents are related and identify updated query as the root cause 1:37 PM Engineering and Operations teams deploy a hotfix to repair the query 5:13 PM All services are confirmed restored. If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a temporary fix for the issue. Notification delivery times have returned to normal and a permanent fix is being worked on. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. Users may see a delay of up to 20 minutes for notifications. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North America"
Last update### What happened? On Tuesday, October 16, 2018 at approximately 12:40 PM PDT, the xMatters monitoring systems alerted the Client Assistance team to a potential issue with On-Demand services for some clients located in North America. During the incident, some users may have experienced intermittent access to the user interface, a delay or rejection when injecting an event into xMatters, and delays in notification delivery. Early the next morning, on October 17, some clients reported that their inbound requests via the Integration Builder were not processing messages. During this incident, some users may have experienced delays in notification delivery. The issue reoccurred for a third time early in the afternoon on Thursday, October 18, when Client Assistance noticed ongoing performance issues in one of the North American data centers. Some clients may have encountered intermittent access to the user interface and delays in notification delivery during this time. ### Why did it happen? This issue was caused by a database query change which was introduced as part of a bug fix in the recent xMatters On-Demand 5.5.230 release, and entered production on Monday, October 15. These changes resulted in databases taking an increased amount of time to process certain requests, and only occurred during specific conditions that occurred during increased concurrency or increased notification requests. The teams had some difficulty in identifying the root cause because the performance issues appeared to abate after each solution was implemented. It was not until the third occurrence that the teams were able to gather enough information about the common elements to correctly isolate the source of the problem. ### How did we respond? As soon as the xMatters network monitoring tools detected connectivity issues, the xMatters Client Assistance and Operations teams escalated the issue to Severity 1 and initiated the internal major incident management process. While the incident response teams began simultaneously investigating the underlying cause and working to restore services for clients, Client Assistance posted a notice to the xMatters status page to inform clients about the incident. The teams immediately identified that the issue was limited to a specific subset of client instances within the North America region, and determined that the problem was related to a database consuming nearly all of its resources. In an attempt to mitigate the issue, the Operations team restarted the database service, resulting in marginal improvements to notification delivery. Upon further investigation, the team identified additional approaches that could mitigate the problem, and applied one of the recommended fixes to the database. Once the services were restarted, notification delivery resumed normal operations and all services appeared to be restored. On October 17, the Client Assistance team began receiving reports from clients that some injected events were not delivering notifications. The Client Assistance team confirmed the issue and initiated the internal major incident management process to engage the incident response teams. The teams identified that a service responsible for handling inbound requests from the Integration Builder was in a blocked state. Once the impacted service was restarted the block was cleared, and events began processing notifications. The teams continued to investigate and determined that the original incident had blocked certain database tables and that additional components required a restart. The Operations team unblocked the database tables, and restarted affected components to ensure that all services were fully restored. The teams continued to search for the underlying cause of the incident while monitoring the affected systems. At approximately 12:30 PM on Thursday, October 18, Client Assistance again noticed performance issues with one of the data centers in North America. They immediately launched the major incident management process and engaged the response teams to begin resolving the issue. The teams were able to start simultaneously restoring services and investigating the root cause. The third occurrence provided the teams with the information necessary to link the issues and review similar behavior during all three incidents. By comparing common elements that occurred during each incident, the teams managed to isolate and identify the query that caused the database performance issues. Once they were certain that they had identified the correct source of the problems, the Operations and Engineering teams devised and implemented a hot fix to mitigate any further impact to customers. Clients then confirmed that all services had been restored. ### What are we doing to prevent it from happening again? To prevent this issue from occurring again, xMatters has committed to the following action items: 1. Upgrading the underlying database and update to the latest patch release version. \(Completed\) 2. Increase monitoring thresholds to help identify any latency with notification delivery earlier in the process. \(In progress\) 3. Deploy a hotfix to fix the problematic query on the impacted systems. \(Completed\) 4. Deploy a permanent fix to the query to eliminate the issue across all customers and all systems. \(Deployed as part of the 5.5.231 release on Monday, October 22.\) In addition, the Engineering and Operations teams are conducting a full post-mortem of the incident to help identify any potential improvements to testing suites, playbooks, and other collateral used to help isolate and identify root causes during and after an incident. ### Timeline: October 16, 2018, 12:40 PM xMatters monitoring tools alert the Client Assistance team to possible latency issues for some clients in North America 12:50 PM Internal Severity 1 process initiated 1:15 PM Engineering attempts to restore services for clients by restarting impacted notification service 1:32 PM Client Assistance posts status page bulletin: [https://status.xmatters.com/incidents/c7vqmddldtbl](https://status.xmatters.com/incidents/c7vqmddldtbl) 1:50 PM Engineering recommends mitigation steps to recover the notification service 2:01 PM Fix deployed to database; impacted service restarted 2:10 PM Services are restored October 17, 2018, 6:00 AM Client Assistance receives reports that some events are not processing 7:58 AM Client Assistance initiates internal major incident process 8:05 AM Engineering begins investigating the issue 9:10 AM Engineering applies fix, events begin processing notifications 9:14 AM Services are restored October 18, 2018, 12:30 PM xMatters Client Assistance is alerted to possible latency issues in a North American data center 12:34 PM Issue escalated to Severity 1 12:58 PM Client Assistance posts notice to xMatters status page: [https://status.xmatters.com/incidents/7yptsvdrm2p5](https://status.xmatters.com/incidents/7yptsvdrm2p5) 1:13 PM Teams confirm that all three incidents are related and identify updated query as the root cause 1:37 PM Engineering and Operations teams deploy a hotfix to repair the query 5:13 PM All services are confirmed restored. If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters team have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North America"
Last update### What happened? Beginning on Tuesday, October 9, 2018 at approximately 7:40 PM PST, the xMatters monitoring systems alerted Client Assistance to a potential issue with xMatters On-Demand services for clients in the North America region. During the incident, some customers may have experienced delays of up to 15 minutes in notification delivery. No notifications were lost during this period, and event injection and responses continued processing as normal. ### Why did it happen? This issue was caused by a previously unidentified defect within a service responsible for handling notification processing, triggered by an unusually high volume of notification requests. ### How did we respond? As soon as the automated monitoring tools alerted xMatters Client Assistance to a possible delay in notification delivery, the teams began attempting to both reproduce the problem and determine the cause of the issue. Once the issue was confirmed, xMatters Client Assistance escalated the issue to Severity 1 and initiated the internal major incident management process. The incident response teams began working to identify and isolate the issue and quickly identified a problem with the notification service. The team discovered that some back-end services were in the process of automatically recovering from a failure and restarted one of the affected components to speed the recovery process. This appeared to resolve the issue and all services were restored. The teams concluded the major incident process, while continuing to monitor the situation and were able to identify the root cause as a defect in the notification service. ### What are we doing to prevent it from happening again? To prevent this issue from recurring, the Engineering team will upgrade to a newer version of the affected back-end service which contains a fix for the defect that caused the delay in processing notifications. This new version is currently in development and will be deployed as soon as testing and validation has been completed. To ensure that the issue does not reoccur before the team can deploy the fix, the Engineering and Operations teams have implemented a rate limit on the affected service so that it will not experience the unusually high volume of requests that triggered the defect. ### Timeline: October 9, 2018 - 7:40 PM xMatters internal monitoring alerts Operations to issue in North America region 7:47 AM Client Assistance begins testing notification delivery 8:00 PM Client Assistance escalates issue to Severity 1; incident response teams begin investigation 8:30 PM Notification service restarted. 8:40 PM Client Assistance confirms notifications are being processed; all services restored. If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North America"
Last update### What happened? On October 3, 2018 at approximately 3:05 AM PDT, the xMatters monitoring systems alerted Client Assistance to a potential issue with one of the data centers located in North America. No customers reported any issues, though it is possible that some users may have experienced a very brief interruption in attempting to access the On-Demand web user interface. No alerts or events were lost during this incident, and all notifications were delivered promptly. ### Why did it happen? This issue was caused by a connectivity problem with the Internet service provider for one of our North American data centers. The connection issue occurred beyond the xMatters environments, outside our firewalls. ### How did we respond? As soon as the monitoring tools alerted Client Assistance to an issue, they immediately began checking client environments for connection issues. The monitoring tools continued to show fluctuations in connectivity, though initial checks showed client environments that were initially reported down recovering within one minute. Client Assistance initiated the major incident management process and engaged the Operations and Engineering teams to assist in identifying any possible issues. The incident response teams isolated the fluctuations as occurring beyond the xMatters firewalls and identified the root cause as an issue with the Internet provider for the data center. Within minutes of the initial alarm, the Internet connection stabilized, and the teams confirmed that all services were operating normally. ### What are we doing to prevent it from happening again? Although the xMatters monitoring tools indicated intermittent connectivity between 3:04 and 3:11 AM, the Internet service provider could not confirm the issue, reporting that they had not received any reports of maintenance or outages on their network at that time. While it is difficult if not impossible to predict connection issues with Internet service providers, we are taking steps to resolve these types of problems via our hosting service improvements described here: [https://support.xmatters.com/hc/en-us/articles/115005269506-Improving-our-hosting-services](https://support.xmatters.com/hc/en-us/articles/115005269506-Improving-our-hosting-services) The robustness of this new infrastructure should help avoid similar issues by reducing dependence on any individual service provider. In the short term, we will continue to work with our existing carrier to identify ways to prevent customer impact should a similar issue occur in the future. ### Timeline October 3, 2018 3:05 AM xMatters monitoring tools alert Client Assistance to a potential issue with client environments being down 3:14 AM Major incident management process initiated, incident response teams begin investigation 3:16 AM Root cause identified as connectivity fluctuations that have since ceased; all customer environments reported up 3:22 AM All services confirmed restored If you have any questions, please visit [http://support.xmatters.com](http://support.xmatters.com)
The issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.
Report: "Issue Discovered - Service disruption in North America"
Last updateThe issue has been addressed, and all services have been restored. Thank you for your patience while we addressed this matter.
The xMatters Incident Response team has deployed a fix for the issue. We are currently monitoring the situation to ensure the implementation is stable and that all services are restored.
The xMatters Incident Response team has identified the source of the issue and is working on a fix. We will update once a solution has been identified and implemented.
The xMatters monitoring tools have identified a potential issue with xMatters On-Demand for some clients located in North America where notifications may be delayed. We are currently investigating the issue, and will update as information becomes available. If you are also experiencing issues, or if you're not sure whether this issue impacts your service, please contact xMatters Client Assistance at https://support.xmatters.com/hc/en-us/requests/new - our support agents are waiting to help.