Auvik Networks Inc.

Is Auvik Networks Inc. Down Right Now? Check if there is a current outage ongoing.

Auvik Networks Inc. is currently Operational

Last checked from Auvik Networks Inc.'s official status page

Historical record of incidents for Auvik Networks Inc.

Report: "Legacy alerts for sites on the US1 cluster delayed"

Last update
resolved

The processing of legacy alerts for clients on the US1 cluster was delayed on May 11, 2025, from 00:00 to 12:00 UTC. The service has been restored, and alerts have been processed through the system with the proper time codes. No other services or clusters were affected. Auvik apologizes for the delay in alerts and will post an RCA after conducting an internal analysis.

Report: "Service Degraded - Auvik Dashboard in AU1"

Last update
resolved

This incident has been resolved.

investigating

We are continuing to investigate this issue.

investigating

Affected Services: Auvik Dashboard Cluster(s): AU1 Description: We are currently experiencing degraded performance loading the Auvik Dashboard. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience slower load times of the Auvik Dashboard. Monitoring services are not impacted. Next Steps: We will update as more information becomes available. Thank you for your patience as we work to restore full functionality.

Report: "Service Degraded - Auvik Dashboard in AU1"

Last update
Resolved

This incident has been resolved.

Update

We are continuing to investigate this issue.

Investigating

Affected Services: Auvik DashboardCluster(s): AU1Description:We are currently experiencing degraded performance loading the Auvik Dashboard. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible.Impact:Users may experience slower load times of the Auvik Dashboard.Monitoring services are not impacted.Next Steps:We will update as more information becomes available.Thank you for your patience as we work to restore full functionality.

Report: "Scheduled Maintenance"

Last update
Completed

The scheduled maintenance has been completed.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We will be upgrading the Auvik cloud. The session will take about three hours. During this time, you may not be able to log into Auvik. There may also be interruptions to your network monitoring.If you have any questions, please contact support@auvik.com.

Report: "Auvik Reporting Sites Down Post After Maintenance"

Last update
postmortem

# Service Disruption - Sites are not available after maintenance ## Root Cause Analysis ### Duration of incident Discovered: May 10, 2025 13:04 - UTC Resolved: May 11, 2025 01:00 - UTC ### Cause A scheduled upgrade of the system failed to complete successfully. ### Effect Auvik functionality was impacted after the upgrade was implemented. This began a cascade of product functionality failures that required reimplementing the upgraded version using a stepped restart of Auvik. ### Action taken _All times are in UTC_ **04/10/2025** **11:00** Upgrade process begins on core components. **12:45** An issue is detected affecting data replication, and some clusters experience connectivity problems. **13:05** Engineering begins active investigation into the connectivity issue. **13:24** Recovery actions initiated for affected clusters. **13:49** Maintenance window extended to address ongoing issues. **14:00-14:05** Impacted clusters begin recovering. **14:21** Post-upgrade validation reveals a new issue affecting dashboard display in most regions. **14:35** Further analysis confirms the issue affects multiple clusters. **15:00** Deeper technical investigation begins to isolate the root cause, which is suspected to involve backend services. **17:04** Root cause identified as an issue with a core data processing component. **17:20** Mitigation strategies explored; decision made to re-attempt the upgrade with a modified approach. **18:30-20:17** Second upgrade process begins; similar issues surface in specific regions. **21:00-21:25** Recovery actions for affected clusters show positive results; services begin to stabilize. **21:30-21:40** Core services successfully rolled out to additional clusters with improved configuration. **23:47** One final cluster exhibits recovery issues, addressed through targeted intervention. **05/11/2025** **00:00-01:00** Final recovery actions completed; all services return to normal. **01:00** Complete system restoration is confirmed. ### Future consideration\(s\) * Implement additional alerting to monitor bandwidth issues on the backend systems more effectively and proactively to prevent bottlenecks. * Complete the improvements that are already in progress. * Mitigate the load placed on all backend systems simultaneously after a maintenance window. * Remove several single-point failure configurations with more scalable configurations.

resolved

Towards the end of Auvik's scheduled maintenance window, on 5/10/2025, Engineering noticed some loading issues with sites on several clusters. Upon investigation, it was determined there was an issue with the data flow between systems. This interruption required Auvik to extend its maintenance window. Auvik was able to bring each cluster's tenants up throughout the process. This work was considered completed at 21:00 EDT. Auvik will furnish an RCA after an internal review has been completed.

Report: "Service Degraded - Some Clients on the US4 cluster are offline."

Last update
postmortem

# Service Disruption - Over 50% of clients on the US4 cluster experienced service interruptions. ## Root Cause Analysis ### Duration of incident Discovered: Apr 14, 2025 19:45 UTC Resolved: Apr 15, 2025 04:05 UTC ### Cause A configuration change related to Meraki Devices. ### Effect About 55% of tenants in US4 became inaccessible due to increased traffic and system load. Action taken _All times are in UTC_ **04/14/2025** **19:45** - Auvik receives internal alerts for abnormal CPU usage on its backend systems for the US4 cluster. **19:50** - Engineering begins an investigation into the issue, actively taking measures to stabilize the system. **20:42** - A large number of sites become inaccessible, and Auvik implements its incident response. **20:42-21:45** - Engineering continues to investigate. **21:45** - A possible root cause of the issue is identified, and Engineering begins recovering sites. **04/14/25-04/15/25** **21:45 - 00:10** - Engineering continues to bring most of the affected sites back online. **04/15/25** **00:10** - All sites, except one client, are back up and accessible. **00:10-01:00** - Auvik continues to work on bringing the last client tenants online and getting them up and running. **01:00** - A root cause is determined for the cause of the incident. Engineering creates mitigation steps. **01:00-03:05** - Mitigation steps are implemented, and the remaining sites of the last client are brought online and accessible. ### Future consideration\(s\) * Auvik has implemented safeguards to prevent a recurrence.

resolved

Affected Services: Site availability Cluster(s): US4 Description: The issue affecting site availability has been fully resolved. Regular service has been restored, and all systems are now operating as expected. Impact: Users should no longer experience issues related to this incident except for select clients we have communicated with. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected Services: Site availability Cluster(s): US4 Description: Our team has implemented a fix for the issue affecting site connectivity on the US4 cluster. We are waiting for the rest of the sites to be available online. We monitor the situation to ensure stability and confirm that the service remains fully functional. Impact: Services should operate normally, except for the remaining sites, which we continue working to make fully available. Services: None of the other clusters and services are affected.. Next Steps: We will provide a final update once all issues are resolved. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected Services: Site availability Cluster(s): US4 Description: Our team has implemented a fix for the issue affecting site connectivity on the US4 cluster. We are waiting for the rest of the sites to be available online. We monitor the situation to ensure stability and confirm that the service remains fully functional. Impact: Services should operate normally, except for the remaining sites, which we are continuing to work to make fully available. Services: None of the other clusters and services are affected.. Next Steps: We will provide a final update once all issues are resolved. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected Services: Site availability Cluster(s): US4 Description: Our team has implemented a fix for the issue affecting site connectivity on the US4 cluster. We are waiting for the rest of the sites to be available online. We monitor the situation to ensure stability and confirm that the service remains fully functional. Impact: Services should be operating normally, except for the remaining site, which we are waiting for to become fully available. Services: None of the other clusters and services are affected.. Next Steps: We will provide a final update once all issues are resolved. Thank you for your patience, and we apologize for any inconvenience caused.

identified

Affected Services: Site availability Cluster(s): US4 Description: Our team has identified the root cause of the degraded performance affecting client site availability in the US4 cluster. We are currently investigating a solution to restore normal service levels. Impact: While we work on the resolution, users may experience connectivity issues as sites become available again. Services: None of the other clusters and services are affected. Next Steps: Our team is actively working to resolve the issue and will provide updates as progress is made or by 23:00 UTC Thank you for your patience as we work to restore full functionality.

investigating

Affected Services: Site availability Cluster(s): US4 Description: We are currently experiencing degraded performance with sites running on the US4 cluster. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience connectivity issues with their tenants. Services: None of the other clusters and services are affected. Next Steps: We will provide updates as more information becomes available or within the next hour. Thank you for your patience as we work to restore full functionality.

Report: "Service Degraded - Internet Connection Checks are creating false alerts on the US3 cluster."

Last update
postmortem

# Service Disruption - Cloud Ping Checks create false alerts on the US3 cluster. ## Root Cause Analysis ### Duration of incident Discovered: Mar 31, 2025, 13:15 - UTC Resolved: Apr 01, 2025, 15:52 - UTC ### Cause The performance of the ping server service on the US3 cluster degraded and produced invalid data. ### Effect The ping server service sent incorrect data based on the internet connection checks to the alerting service, which created large batches of false alerts sent to customers on the US3 cluster. ### Action taken _All times are in UTC_ **03/31/2025** **17:10 -** Ping Server started showing symptoms of degradation. **17:15 -** Internet Connections are marked offline. Customers experience excessive false alert reports based on the cloud ping check service on the US3 cluster. **17:20 -** The Auvik engineering team begins its investigation. **17:20-20:00 -** Auvik continues its investigations and disables the cloud ping service for several large customers on the US3 cluster to prevent excessive alerting once the service is restored. **20:00 -** Auvik resets the ping server service on the US3 cluster. Ping services fail over to the backup primary ping server service. **22:25 -** The primary ping server service load rises to a level that begins impacting customers on other clusters. **04/01/25** **00:00 -** The US3 cluster is restarted to revert cloud ping checks to the US3 cluster ping server services. Auvik notifies the customer where the cloud ping checks were disabled that the service will remain down until engineering can confirm they can be enabled without causing excessive alerting. **01:00-01:25 -** The US3 cluster fully restarts successfully. Functionality is restored for most clients on the US3 cluster. **12:00-15:30 -** Engineering reviews the disabled configurations and disables the responses to the cloud ping check-based alerts. **15:30-15:52 -** Auvik validates that all cloud ping check services and alerts are enabled for all customers on the US3 cluster. Additional clean-up commences. The incident is concluded. ### Future consideration\(s\) * Auvik is building a new cloud ping check server service for the product. This new server service will be rolled out gradually and is expected to be fully deployed into production over the next month. * Our error handling in the service that processes the cloud ping server data has been improved to identify and ignore invalid data. * Addresses will no longer be considered offline when invalid data is received.

resolved

Affected Services: Internet Connection Service Cluster(s):US3 Description: The issue affecting Internet Connection Ping Checks has been fully resolved. Regular service has been restored, and all systems are operating as expected. Impact: Users should no longer experience any issues related to this incident. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected Services: Internet Connection Service Cluster(s):US3 Description: Our team has implemented a fix for the issue affecting the Internet connection ping check for the tenants on the US3 cluster, and performance is returning to normal. We monitor the situation to ensure stability and confirm that the service remains fully functional. Impact: Services are operating normally for most sites. We do continue monitoring for irregularities with a few sites that have been contacted Next Steps: Tenants on the US3 cluster are still recovering and look healthy. We are attending to a few sites to regain full functionality. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected Services: Internet Connection Service Cluster(s):US3 Description: Our team has implemented a fix for the issue affecting the Internet connection ping check for the tenants on the US3 cluster, and performance is returning to normal. We are monitoring the situation to ensure stability and confirm that the service remains fully functional. Impact: Services should be operating normally; however, we continue monitoring for irregularities. Next Steps: Tenants on the US3 cluster are still recovering and look healthy. We will continue to monitor the status of the tenants on US3 overnight and report back in the morning. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected Services: Internet Connection Service Cluster(s):US3 Description: Our team has implemented a fix for the issue affecting the Internet connection ping check for the tenants on the US3 cluster, and performance is returning to normal. We are currently monitoring the situation to ensure stability and confirm that the service remains fully functional. Impact: Services should be operating normally; however, we continue monitoring for irregularities. Next Steps: Tenants on the US3 cluster are still recovering and look healthy. Thank you for your patience, and we apologize for any inconvenience caused.

identified

Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: The 20-minute maintenance window for the internet connection service for all clusters has been completed. Services, including other monitoring services, are not impacted. Next Steps: The US3 cluster is still going through its restart process. We sincerely apologize for the extended window for this action. Thank you for your patience as we work to restore full functionality.

identified

We are continuing to work on a fix for this issue.

identified

Affected Services: Internet Connection Service Cluster(s):All Clusters Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: Auvik will perform an emergency cluster restart on US3 tenants at 00:00, which will take approximately 1.5 hours to complete. At this time, Auvik will also perform a 20-minute maintenance window to allow for a restart of the Internet connection service for all of Auvik. We sincerely apologize for the extended window for this action. Thank you for your patience as we work to restore full functionality.

identified

Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: Auvik has disabled alerts for clients on the US3 cluster. This action will continue for an additional hour until 23:00 UTC. This is a preventative measure as we work through false alerts for the internet connection checks. Clients may experience a slowed UI response time during this work. Any UI slowness should be very short, if noticeable at all. We apologize for the extended window for this action. Thank you for your patience as we work to restore full functionality.

identified

Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: Auvik will disable alerts for clients on the US3 cluster for up to 1 hour starting at 22:00 UTC. This is a preventative measure as we work through false alerts for the internet connection checks. Clients may experience a slowed UI response time during this work. This UI slowness should be very short if it is noticeable at all. We apologize for the late notice. Thank you for your patience as we work to restore full functionality.

identified

Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: Auvik will disable alerts for clients on the US3 cluster for up to 1 hour starting at 22:00 UTC. This is a preventative measure as we work through false alerts for the internet connection checks. We apologize for the late notice. Thank you for your patience as we work to restore full functionality.

investigating

Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: We will update you as more information becomes available or by 22:00 UTC. Thank you for your patience as we work to restore full functionality.

investigating

Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: We will update you as more information becomes available or by 21:00 UTC. Thank you for your patience as we work to restore full functionality.

investigating

Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Some clients may see the alerts associated with the Internet connection service missing from the alert dashboard. This is temporary, and these alerts will be restored when the internet connection check service is restored. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: We will update you as more information becomes available or by 20:00 UTC. Thank you for your patience as we work to restore full functionality.

investigating

Affected Services: Internet Connection Service Cluster(s):US3 Description: We are currently experiencing degraded performance with the internet connection check ping service. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience false internet connection disconnects. Services, including other monitoring services, are not impacted. Next Steps: We will update you as more information becomes available or within the next hour. Thank you for your patience as we work to restore full functionality.

Report: "Service Disruption - US4"

Last update
postmortem

# Service Disruption - Clients on the US4 Cluster Unreachable ## Root Cause Analysis ### Duration of incident Discovered: Feb 28, 2025 Time - 16:32 - UTC Resolved: Feb 28, 2025 Time - 19:30- UTC ### Cause Overload of backend resources for services on the US4 cluster. ### Effect Tenants on the US4 cluster became inaccessible. ### Action taken _All times in UTC_ **02/28/2025** **16:32** - Auvik Engineering discovers several non-responsive backends on the US4 cluster, which causes some tenants to be unresponsive. Engineering begins investigating. **17:00** - Attempts are made to revive the non-responsive backends. **17:28** - Cluster is in distress, with more backends starting to fail. **17:45** - Engineering restarts the entire cluster. **18:10-19:30** - The cluster is observed as it restarts and monitored as it comes up to full functionality. The incident is declared resolved. ### Future consideration\(s\) * Auvik is currently improving backend monitoring and stability within the product and infrastructure. These improvements aim to help mitigate potential issues proactively in the future.

resolved

Affected Services: Tenants on US4 Services not impacted: Tenants on all other clusters Description: The issue affecting tenant inaccessibility on the US4 cluster has been fully resolved. Regular service has been restored, and all systems are now operating as expected. Impact: Users should no longer experience any issues related to this incident. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected Services: Clients on the US4 cluster Service not impacted: Clients other clusters Description: Our team has fixed the issue affecting tenants' inaccessibility on the US4 cluster. The remaining tenants are recovering. We are monitoring the situation to ensure stability and confirm that the service remains fully functional. Impact: Service should operate normally; some tenant sites are still becoming accessible. Services: sites on other clusters are not affected Next Steps: We will provide a final update once the issue is resolved. Thank you for your patience, and we apologize for any inconvenience caused.

identified

Affected Services: Clients on the US4 cluster Service not impacted: Clients other clusters Description: Our team has identified the root cause of the degraded performance with tenants on the US4 cluster. We are seeing tenants becoming available to normal service levels. Impact: While we work on the resolution, users start to see their tenants become responsive, Services: Other clusters are not impacted. Next Steps: Our team is actively working to resolve the issue and will provide updates as progress is made or by 19:30 -UTC. Thank you for your patience as we work to restore full functionality.

identified

Affected Services: Clients on the US4 cluster Service not impacted: Clients other clusters Description: Our team has identified the root cause of the degraded performance affecting tenants on the US4 cluster and is currently investigating a solution to restore normal service levels. Impact: Users will experience issues with connectivity to their tenants Services: Other clusters are not experiencing issues Next Steps: Our team is actively working to resolve the issue and will provide updates as progress is made or by 18:30 UTC Thank you for your patience as we work to restore full functionality.

investigating

Affected Services: Clients on the US4 cluster Service not impacted: Clients other clusters Description: We are experiencing degraded performance with tenants on the US4 cluster. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users will experience issues with connectivity to their tenants Services: Other clusters are not experiencing issues Next Steps: We will provide updates as more information becomes available or by 18:30 UTC. Thank you for your patience as we work to restore full functionality.

Report: "Service Degraded - Cloud Ping check not working for some tenants"

Last update
postmortem

# Service Degraded - Cloud Ping Services Check Failing Intermittently on the US3 Cluster ## Root Cause Analysis ### Duration of incident Discovered: Feb 19, 2025 14:18 - UTC Resolved: Mar 01, 2025 15:00 - UTC ### Cause The Cloud Ping service became unstable due to a large number of clients running ping checks at a 5-second interval, leading to widespread ping check failures. ### Effect Clients received excessive Cloud ping check alerts corresponding to failed pings. ### Action taken _All times in UTC_ **02/13/2025-02/19/2025** Auvik started receiving complaints about an unusually high number of internet connection failures. A general investigation begins with customers reporting these issues. **02/19/2025** **14:18** - Auvik Engineering ascertains that the US3 cluster has several clients with a high number of internet connection checks set to the 5-second setting. An internal investigation then begins. **17:42** - Auvik disables Cloud Ping alerts in the US3 cluster for those affected. **17:53-18:44** - Auvik Engineering decides to restart the ping service to help clear the lag and re-stabilize it. A maintenance window is required to perform this action. **19:00** - A one-hour maintenance window is started. **19:21** - The work required under the maintenance window concludes early, and the services are back up and running. Cloud Ping alerts are restored for all clients. **02/24/2025** It’s noted that while the ping service is behaving normally for most clients, there continue to be intermittent problems. It is determined that a complete cluster restart is required. To minimize the impact on all customers, a decision is made to do maintenance on 03/01/2025 **03/01/2025** **12:00-15:00** - Auvik undergoes maintenance, during which US3 is safely restarted to restore the health of all services. ### Future consideration\(s\) * Auvik has worked with several clients who have set up a 5-second ping check to regulate the flow and prevent system overload. * Auvik will remove clients' ability to perform a 5-second cloud ping check and default the check frequency to one minute. The timing of this change will follow in future Auvik release notes.

resolved

Affected Services: Cloud Ping Check Cluster(s): All Clusters Description: We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users should no longer experience any issues related to this incident. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected Services: Cloud Ping Check Cluster(s): All Clusters Description: We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Service should operate normally; however, we continue monitoring for any irregularities. Services: All other monitoring, alerting, maps and integrations are not impacted. Next Steps: We will provide a final update once we confirm the issue is fully resolved. Thank you for your patience, and we apologize for any inconvenience caused.

identified

Affected Services: All alerts Cluster(s): All Clusters The alerting maintenance window has been ended. alerts will now flow as intended Thank you for your patience as we work to restore full functionality.

identified

Affected Services: All alerts Cluster(s): All Clusters Auvik is posting an emergency maintenance window to disable alerts starting at 19:00 UTC. Alerts are scheduled to be re-enabled by 20:00 UTC Thank you for your patience as we work to restore full functionality.

identified

Affected Services: Cloud Ping Check Cluster(s): All Clusters Description: We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience excessive false alerts. Resources failing over from US3 may affect alerting in other clusters. Services: All other monitoring, alerting, maps and integrations are not impacted. Next Steps: Our team is actively working to resolve the issue and will provide updates as progress is made or by 19:00 UTC Thank you for your patience as we work to restore full functionality.

investigating

Affected Services: Cloud Ping Check Cluster(s): US3 Description: We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience excessive false alerts. Services: All other monitoring, alerting, maps and integrations are not impacted. Auvik recommends you disable your Cloud Ping Check and any customized Cloud Ping Check alerts until the problem is resolved, Next Steps: We will provide updates as more information becomes available or by 18:00 UTC Thank you for your patience as we work to restore full functionality.

investigating

Affected Services: Cloud Ping Check Cluster(s): US3 Description: We are currently experiencing degraded performance with the Cloud Ping Service check. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience excessive false alerts. Services: All other monitoring, alerting, maps and integrations are not impacted. Next Steps: We will provide updates as more information becomes available or within the next hour Thank you for your patience as we work to restore full functionality.

Report: "Service Disruption - US3 is down"

Last update
postmortem

# Service Disruption - Clients on the US3 Cluster Unreachable ## Root Cause Analysis ### Duration of incident Discovered: Feb 14, 2025 Time - 22:42 - UTC Resolved: Feb 17, 2025 Time - 16:30 - UTC ### Cause Auvik made changes to its system to address issues with third-party integration for a client on US3 to process information that was not working as expected. ### Effect This change exposed a bug in the code that caused the backend systems to become overloaded. This caused data corruption in the hierarchical tables, which caused more instability in the system for clients on the US3 cluster. ### Action taken _All times in UTC_ **02/14/2025** **22:42** - First signs of increased backend pressure on the systems on the US3 cluster. **2/15/2025** **15:40** - Backend pressure on the US3 cluster increases. Engineering begins to monitor its systems for performance issues. **16:00-23:00** - Engineering attempts several interventions to reduce backend pressure. Success is intermittent. Ultimately, the root cause is identified as an abnormally growing dataset due to a bug. **23:00** - The tenant associated with the data table is disabled. However, several backends in the cluster are in severe distress, requiring a complete reboot of the cluster. A reboot is initiated. **02/16/2025** **00:00** - Most tenants are observed to be functional. **00:23** - Steps are initiated to re-enable the offending tenant. Unforeseen issues during this step create a cascading failure that results in another cluster reboot. **00:42** - The offending tenant is disabled again, and US3 is rebooted. **03:15** - Cluster is deemed stable. **16:00** - Engineering diagnoses a further root cause of the instability arising from the offending tenant. **17:00-21:30** - This tenant’s data is cleaned up manually. Finally, the tenant is restarted successfully. **02/17/25** **16:30** - Engineering notes that some hierarchical datasets have been corrupted, causing some tenants' alert notifications to be set to default values. Engineering initiates a cleanup of all such occurrences. **03/01/25** - A fix is applied to prevent further occurrences of such issues. All clusters are upgraded with this bug fix. ### Future consideration\(s\) * The bug that caused the instability has been addressed. * A review of how Auvik processes data in its hierarchical tables is under review. * Improved internal processes have been implemented to diagnose the cause of similar issues more quickly should they occur.

resolved

Affected clusters: US3 Description: The issue affecting US3 has been fully resolved. Normal service has been restored, and all systems are now operating as expected. Impact: Users should no longer experience any issues related to this incident. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected clusters: US3 Description: We have encountered an issue during monitoring of the cluster. The cluster is non-operational at this time. Our team is actively working to restore sites on this cluster as quickly as possible. Impact: Tenants hosted on US3 are not accessible at this time. Next Steps: We will provide updates as more information becomes available or within the next hour. Thank you for your patience as we work to restore full functionality.

monitoring

Affected clusters: US3 Description: The majority of tenants have been restored and we are performing final cluster checks. Impact: N/A Next Steps: We will provide updates as more information becomes available or within the next hour. Thank you for your patience as we work to restore full functionality.

monitoring

Affected clusters: US3 Description: Our team has restarted US3. Tenants are starting and we are continuing to monitor recovery of all tenants. Impact: Some tenants hosted on US3 may still be starting and unreachable at this time. Next Steps: We will provide a final update once we confirm the issue is fully resolved. Thank you for your patience, and we apologize for any inconvenience caused.

identified

Affected clusters: US3 Description: We are currently experiencing an outage on US3. Our team is actively working to restore sites on this cluster as quickly as possible. Impact: Sites hosted on US3 are currently unreachable. Next Steps: We will provide updates as more information becomes available or within the next hour. Thank you for your patience as we work to restore full functionality.

Report: "Service Disruption - The US4 cluster is down"

Last update
postmortem

# Service Disruption - Clients on US4 are not accessible ‌ ## Root Cause Analysis ### Duration of incident Discovered: Feb 03, 2023 Time - 09:27 - UTC Resolved: Feb 03, 2023 Time - 11:55 - UTC ### Cause Overload of backend resources for services on the US4 cluster. ### Effect Tenants on the US4 cluster became inaccessible. ### Action taken _All times in UTC_ **02/03/2025** **09:27** - Engineering receives alerts that tenants on the US4 cluster are not accessible. **09:33** - Engineering reacts to the outage and begins its investigation. **09:53** - Engineering restarts US4 cluster backends to address its non-responsiveness. **9:53- 11:55** - The cluster is observed as it restarted and monitored as it comes up to full functionality. The incident is declared resolved. ### Future consideration\(s\) * Auvik is currently improving backend monitoring and stability within the product and infrastructure. These improvements are aimed to assist in proactively mitigating potential issues in the future.

resolved

Affected Services: clients in US4 are now accessible. Description: The issue affecting US4 tenants has been resolved. Regular service has been restored, and all systems are operating as expected. Impact: Users should no longer experience any issues related to this incident. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected Services: Clients on US4 Cluster Description: Our team has implemented a fix for the issue, and tenants are in the process of becoming fully accessible. We are monitoring the situation to ensure stability and confirm that the service remains fully functional. Impact: Services should be operating normally; with a few client sites still in the process of starting up. We continue to monitor for any irregularities. Next Steps: We will provide a final update once we confirm the issue is fully resolved. Thank you for your patience, and we apologize for any inconvenience caused.

identified

Affected Services: All clients are currently not accessible Description: Our team has identified the root cause of the site down. We are currently investigating a solution to restore normal service levels. Impact: While we work on the resolution, users may experience slower load times and intermittent connectivity issues, Next Steps: Our team is actively working to resolve the issue and will provide updates as progress is made or by 11:00 UTC Thank you for your patience as we work to restore full functionality.

investigating

Affected Services: All clients are currently not accessible Service not impacted: NA Description: Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users are experiencing no access to their tenants Next Steps: We will provide updates as more information becomes available or within the next at 11:00 UTC. Thank you for your patience as we work to restore full functionality.

Report: "Service Degraded - Discovery Consolidation on US6 Cluster"

Last update
postmortem

# **Service Degraded - Newly discovered devices and consolidation are not working for clients on the US6 cluster.** ## **Root Cause Analysis** ### Duration of incident Discovered: Feb 02, 2023 17:00 - UTC Resolved: Feb 02, 2023 21:30 - UTC ### Cause A reorganization of engineering caused a permission change for tenant migrations. ### Effect This change caused permission issues with a tenant migration to another cluster, which, in turn, also caused problems with consolidation on the same cluster. ### Action taken _All times in UTC_ **02/03/2025** **16:58 –** A tenant is migrated off of the US6 cluster. **18:00 –** Engineering is aware of consolidation issues for clients on the US6 cluster and begins investigating. **20:28 –** The initial cause for the interruption is determined. Engineering disables the migration service. **20:34 –** The tenant migration that caused the issues is identified. **20:47 –** The root cause of the interruption of services is identified. **21:15 –** The underlying issues that caused the service interruption are fixed. **21:45 –** Tenant migration is re-enabled tenant migrations in the consolidation service. **22:01 –** The problematic tenant is successfully migrated. **22:39 –** All services are confirmed to be running as intended. ### Future consideration\(s\) * Auvik is reviewing permission changes that have occurred and validating tests of the blast radius of the changes. * Any changes will have full comments and documentation created to follow the changes better. * Auvik will set up a test migration regularly to validate tenant migration functionality.

resolved

Affected Services: Discovery Consolidation Cluster(s): US6 Description: The issue affecting Discovery Consolidation has been fully resolved. Normal service has been restored, and all systems are now operating as expected. Impact: Users should no longer experience any issues related to this incident. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected Services: Discovery Consolidation Cluster(s): US6 Description: Our team has implemented a fix for the issue affecting the consolidation of devices, and the performance consolidation of devices is returning to normal. We are monitoring the situation to ensure stability and confirm that the service remains fully functional. Impact: Service is returning to normal; however, we continue monitoring for irregularities. Services Alerting was not impacted. Next Steps: We will provide a final update once we confirm the issue is fully resolved. Thank you for your patience, and we apologize for any inconvenience caused.

identified

Affected Services: Discovery Consolidation Cluster(s): US6 Description: Our team has identified the root cause of the degraded performance affecting new device discovery consolidation. Cluster(s): US6. We are currently investigating a solution to restore normal service levels. Impact: While we work on the resolution, users will continue to experience device discovery and consolidation. Services: Alerting is not impacted. Next Steps: Our team is actively working to resolve the issue and will provide updates as progress is made or within the next hour Thank you for your patience as we work to restore full functionality.

investigating

Affected Services: Discovery Consolidation Cluster(s): US6 Description: We are currently experiencing degraded performance with the consolidation of devices. Our team is still actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience issues with new device discovery and consolidation. Services: Alerting is not impacted. Next Steps: We will provide updates as more information becomes available or by 20:30 UTC. Thank you for your patience as we work to restore full functionality.

investigating

Affected Services: Discovery Consolidation Cluster(s): US6 Description: We are currently experiencing degraded performance with the consolidation of devices. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience issues with new device discovery and consolidation. Services: Alerting is not impacted. Next Steps: We will provide updates as more information becomes available or by 19:30 UTC. Thank you for your patience as we work to restore full functionality.

Report: "Service Disruption - US4 cluster is unreachable"

Last update
postmortem

# Service Disruption - Cluster US4 is unreachable for customers ## Root Cause Analysis ### Duration of incident Discovered: Dec 13, 2024 17:03 - UTC Resolved: Dec 13, 2024 18:23 - UTC ### Cause Routine maintenance tasks in preparation for the upcoming weekend's maintenance cause an unexpected load to the system. Effect The backend systems overwhelmed the systems on the US4 cluster, which caused a communication interruption with the tenants. ### Action taken _All times in UTC_ **12/13/2024** **16:57 -** Steps to prepare the system for the next day’s maintenance performed. **17:03 -** Tenants on the US4 cluster become unreachable. **17:09 -** The Auvik engineering team assembles stakeholders to investigate the service interruption. **17:25 -** The backend systems on the US4 cluster begin to recover independently. **17:39 -** Tenants begin to become reachable internally. **17:40 -** Tenants become visible in the UI. **17:57** - Engineering addressed tenants that are not coming back up gracefully. **18:23 -** Tenants on US4 have recovered. ### Future consideration\(s\) * Auvik has altered its preparation for scheduled maintenance, eliminating processes that could affect system performance in the future.

resolved

Affected Services: US4 Cluster Description: The issue affecting US4 has been addressed and the system has recovered. Impact: Users should now be able to access their tenants on US4. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected Services: US4 cluster Description: Our team has implemented a fix for the issue affecting the US4 cluster. Tenants are being restored and we are continuing to monitor the recovery progress. Impact: Any unreachable tenant is queued to be started and will be reachable within approximately 1 hour. Next Steps: We will provide a final update once we confirm the issue is fully resolved. Thank you for your patience, and we apologize for any inconvenience caused.

investigating

We are continuing to investigate this issue.

investigating

Affected Services: US4 Cluster Description: We are currently experiencing an outage on tenants hosted on our US4 cluster. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users will not be able to reach their tenants hosted in US4. Next Steps: We will provide updates as more information becomes available or within the next hour. Thank you for your patience as we work to restore full functionality.

Report: "Service Degraded - Alerting Integration"

Last update
resolved

Affected Services: Alert Integrations Cluster(s): All Description: The issue affecting Alert Integrations has been fully resolved. Normal service has been restored, and all systems are now operating as expected. Impact: Users should no longer experience any issues related to this incident. Next Steps: We are preparing a detailed Root Cause Analysis (RCA) report to provide further insights into the incident and preventive measures. Thank you for your patience, and we apologize for any inconvenience caused.

monitoring

Affected Services: Alert Integrations Cluster(s): All Description: Our team has implemented a fix for the alert integration issue, and performance has returned to normal. We are currently monitoring the situation to ensure stability and confirm that the service remains fully functional. Impact: Service should operate normally; however, we continue monitoring for any irregularities. Services Monitoring and UI are not impacted. Next Steps: We will provide a final update once we confirm the issue is fully resolved. Thank you for your patience, and we apologize for any inconvenience caused.

identified

Affected Services: Alert Integrations Cluster(s): All Description: Our team has identified the root cause of the degraded performance affecting alert integrations and is currently investigating a solution to restore normal service levels. Impact: While we work on the resolution, users may continue to experience alerts that are not posted to their integrated systems. Services: Monitoring and UI are not impacted. Next Steps: Our team is actively working to resolve the issue and will provide updates as progress is made or within the next hour. Thank you for your patience as we work to restore full functionality.

investigating

Affected Services: Alert Integrations Cluster(s): All Description: We are currently experiencing degraded performance with alert integrations. Our team is actively investigating the root cause and working to resolve the issue as quickly as possible. Impact: Users may experience alerts that are not posted to their integrated systems. Services: Monitoring and UI are not impacted. Next Steps: We will provide updates as more information becomes available or within the next hour. Thank you for your patience as we work to restore full functionality.

Report: "Service Disruption - maps and some UI elements are unavailable"

Last update
postmortem

# Service Disruption - Maps and data not being populated in the UI ## Root Cause Analysis ### Duration of incident Discovered: Oct 26, 2024 13:13 - UTC Resolved: Oct 26, 2024 17:59 - UTC ### Cause Requests to the permission cache prevented data from syncing properly across clusters. ### Effect Key product features, like map functionality, became unavailable to users. ### Action taken _All times in UTC_ **10/26/2024** **11:00** Planned upgrade of backend data started. **13:13** Issues were noted with permissions that affected UI components. **13:30** Confirmed issues with Maps during the post-upgrade check. **14:24** Attempt to restart services to enable UI updates. **15:30** Decision was made to restart services from the beginning to flush out any lingering issues. **15:47** Backend services restarted with an additional emphasis on enabling a clean restart. **17:59** Service fully restored. ### Future consideration\(s\) * Parallelize investigations and timebox responses for efficiency to avoid prolonged troubleshooting when a complete restart could resolve issues. * Improve upgrade and detection protocols to catch errors earlier.

resolved

The source of the disruption has been resolved. Services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review has been completed.

monitoring

We are continuing to monitor the system as tenants are becoming available. We estimate that all tenants will be fully operational within the hour and will provide an update accordingly.

monitoring

The system has been restarted and services are coming back online. Tenants are beginning to start and become available at this time. We are monitoring the system and will provide an update in 30 minutes.

identified

We’ve identified the source of the service disruption with maps and other UI elements. We will be restarting the system to recover from this incident.

investigating

We’re experiencing disruption with network topology and select UI data not populating in the Auvik UI. Some data may be unavailable. We will continue to provide updates as they become available.

Report: "Service Disruption - Traffic Insights did not render protocol/services data the UI on the AU1 cluster"

Last update
postmortem

# Service Disruption - Traffic Insights did not render protocol/services data in the UI in the AU1 cluster ## Root Cause Analysis ### Duration of incident Discovered: Oct 21, 2024 15:03- UTC Resolved: Oct 21, 2024 16:15- UTC ### Cause The recent refactoring of Traffic Insight Services inadvertently altered the offset management behavior. ### Effect This caused traffic identification to not render properly in the UI for clients in the AU1 cluster, ### Action taken _All times in UTC_ **10/15/2024** **15:01-15:35** Engineering is alerted to an issue involving traffic insight data for clients in the AU1 cluster not rendering correctly. Engineering begins its investigation. It is noted that traffic has not been rendered correctly since October 19 at approximately 10:00. **15:41-15:55** A solution to fix the problem is designed and implemented. Traffic insights data is appropriately rendered in the UI. **16:15** Traffic for data classification is confirmed flowing as it should over in the UI. ### Future consideration\(s\) * Timing of changes and a more thorough review process of results will be implemented for any future work performed on the affected services.

resolved

Clients on the AU1 cluster that subscribed to Auvik Performance did not have their Traffic Insights data rendering protocol/services data from approximately 10:00 UTC on October 18 until 14:00 UTC on October 21. Data rendering is now working as expected. We apologize for the interruption of services. An RCA will follow after a complete investigation.

Report: "Service Disruption - Clients on the US3 cluster are receiving 500 errors when trying to access their sites"

Last update
postmortem

# Service Disruption ## Backend Resource Strain and Service Disruption over a multiple-day period ### Root Cause Analysis ### Duration of incident Discovered: Oct 07, 2024 09:56 - UTC Resolved: Oct 07, 2024 19:00 - UTC Discovered: Oct 14, 2024 10:55- UTC Resolved: Oct 14, 2024 14:00 - UTC Discovered: Oct 16, 2024 05:42 - UTC Resolved: Oct 17, 2024 13:37 - UTC ### **Cause** The primary cause of this multi-day incident was a combination of backend instability and resource management challenges triggered by technical bugs and configuration issues. Specifically, a non-thread-safe map in the Autotask integration led to excessive CPU consumption, compounded by frequent tenant migrations and high memory usage across multiple clusters. Excessive API requests through the Web Application Firewall \(WAF\) and misconfigurations further strained backend resources, resulting in widespread service disruptions and extended recovery time. ### **Effect** The incident significantly impacted service availability and performance across multiple clusters. Users experienced frequent 500 and 504 errors, delays in accessing tenant data, and slow UI loading times. The high CPU usage and backend instability led to tenant migrations and disrupted connectivity, causing certain features to become intermittently unavailable. Additionally, the ongoing backend strain increased support cases and required multiple restarts and resource reallocations, prolonging the disruption and leading to a degraded experience for affected users over several days. ### **Action taken** _All times in UTC_ **10/07/2024** **Initial Detection and Escalation** **09:56 - 10:02** Key symptoms identified: * High heap usage across multiple backends. * Communication failures between nodes in clusters CA1 and US1, causing tenant access issues. * Multiple tenants are stuck in a verifying state. **10:20 - 11:30** Escalated mitigations: Decided to restart CA1, followed by US1, to address node communication issues. The status page is updated to notify users of ongoing disruptions. **12:19 - 13:06** Status recap and monitoring of ongoing issues, including: Continued high heap usage. Tenant availability errors \(504s\) due to lost seed nodes. Investigation of tenant verification issues. **14:00 - 19:00** Work continues on the instability of model investigations and backend performance issues, with some partial fixes applied. **19:00** Temporary workaround applied to stabilize model flapping. **10/14/2024** **Continued Investigation and Remediation** **11:30** Focused mitigation for US4 clients to stabilize tenant access and service performance. **14:00** Affected sites and tenants restarted, resolving some availability issues. **10/16/2024** **Addressing WAF and High CPU Issues** **17:15** WAF mitigation steps taken, blocking excessive requests from specific IPs. **18:31** WAF issues confirmed resolved after blocking IPs responsible for high traffic. **10/16/2024** **High CPU Issues and Tenant Rebalancing** **10:55 - 11:25** High CPU usage detected on multiple backends: Affect backends are capped, restarted, and drained to mitigate load. **12:12 - 12:29** Specific tenant issues, including problematic tenants, were identified, which triggered frequent backend moves and further resource strain. **15:00 - 18:00** Troubleshooting and tenant isolation continue; problematic tenants are isolated, and partial recovery is achieved. **10/17/2024** **Root Cause Fixes and Final Resolution** **10:35** Further diagnosis identifies the root cause in the non-thread-safe map, leading to high CPU usage. **13:27** A short-term fix was applied to stabilize the problematic tenant and manage resource allocation. **13:37** Confirmed complete restoration of affected tenants and systems. ### **Future consideration\(s\)** * Auvik has installed a repair for the model identification instability. * Auvik has implemented a repair address for tenants stuck in a verifying state who cannot locate their tenant manager. * Auvik has implemented a fix to prevent the identified third-party integration from locking CPU processes, which will cause the backend to fail due to high resource consumption. * Auvik has installed a fix to prevent long device names from causing continual tent failures across backends. * Auvik has added enhanced monitoring for excessive backend tenant failures.

resolved

The fix has been implemented for sites with 500 errors and inaccessible sites. The source of the disruption has been resolved, and services have been fully restored.

monitoring

We’ve identified the source of the service disruption with client sites on the US3 Cluster. When they try to access their sites, they receive 500 errors. We are implementing the fix and will keep you posted on a resolution.

identified

We’ve identified the source of the service disruption with client sites on the US3 Cluster. When they try to access their sites, they receive 500 errors. We are working to restore service as quickly as possible.

investigating

We’re experiencing disruption with client sites on the US3 Cluster. When they try to access their sites, they receive 500 errors. We will continue to provide updates as they become available.

Report: "Service Disruption - Clients on the US4 cluster are receiving 500 errors when trying to access their sites"

Last update
postmortem

# Service Disruption ## Backend Resource Strain and Service Disruption over a multiple-day period ### Root Cause Analysis ### Duration of incident Discovered: Oct 07, 2024 09:56 - UTC Resolved: Oct 07, 2024 19:00 - UTC Discovered: Oct 14, 2024 10:55- UTC Resolved: Oct 14, 2024 14:00 - UTC Discovered: Oct 16, 2024 05:42 - UTC Resolved: Oct 17, 2024 13:37 - UTC ### **Cause** The primary cause of this multi-day incident was a combination of backend instability and resource management challenges triggered by technical bugs and configuration issues. Specifically, a non-thread-safe map in the Autotask integration led to excessive CPU consumption, compounded by frequent tenant migrations and high memory usage across multiple clusters. Excessive API requests through the Web Application Firewall \(WAF\) and misconfigurations further strained backend resources, resulting in widespread service disruptions and extended recovery time. ### **Effect** The incident significantly impacted service availability and performance across multiple clusters. Users experienced frequent 500 and 504 errors, delays in accessing tenant data, and slow UI loading times. The high CPU usage and backend instability led to tenant migrations and disrupted connectivity, causing certain features to become intermittently unavailable. Additionally, the ongoing backend strain increased support cases and required multiple restarts and resource reallocations, prolonging the disruption and leading to a degraded experience for affected users over several days. ### **Action taken** _All times in UTC_ **10/07/2024** **Initial Detection and Escalation** **09:56 - 10:02** Key symptoms identified: * High heap usage across multiple backends. * Communication failures between nodes in clusters CA1 and US1, causing tenant access issues. * Multiple tenants are stuck in a verifying state. **10:20 - 11:30** Escalated mitigations: Decided to restart CA1, followed by US1, to address node communication issues. The status page is updated to notify users of ongoing disruptions. **12:19 - 13:06** Status recap and monitoring of ongoing issues, including: Continued high heap usage. Tenant availability errors \(504s\) due to lost seed nodes. Investigation of tenant verification issues. **14:00 - 19:00** Work continues on the instability of model investigations and backend performance issues, with some partial fixes applied. **19:00** Temporary workaround applied to stabilize model flapping. **10/14/2024** **Continued Investigation and Remediation** **11:30** Focused mitigation for US4 clients to stabilize tenant access and service performance. **14:00** Affected sites and tenants restarted, resolving some availability issues. **10/16/2024** **Addressing WAF and High CPU Issues** **17:15** WAF mitigation steps taken, blocking excessive requests from specific IPs. **18:31** WAF issues confirmed resolved after blocking IPs responsible for high traffic. **10/16/2024** **High CPU Issues and Tenant Rebalancing** **10:55 - 11:25** High CPU usage detected on multiple backends: Affect backends are capped, restarted, and drained to mitigate load. **12:12 - 12:29** Specific tenant issues, including problematic tenants, were identified, which triggered frequent backend moves and further resource strain. **15:00 - 18:00** Troubleshooting and tenant isolation continue; problematic tenants are isolated, and partial recovery is achieved. **10/17/2024** **Root Cause Fixes and Final Resolution** **10:35** Further diagnosis identifies the root cause in the non-thread-safe map, leading to high CPU usage. **13:27** A short-term fix was applied to stabilize the problematic tenant and manage resource allocation. **13:37** Confirmed complete restoration of affected tenants and systems. ### **Future consideration\(s\)** * Auvik has installed a repair for the model identification instability. * Auvik has implemented a repair address for tenants stuck in a verifying state who cannot locate their tenant manager. * Auvik has implemented a fix to prevent the identified third-party integration from locking CPU processes, which will cause the backend to fail due to high resource consumption. * Auvik has installed a fix to prevent long device names from causing continual tent failures across backends. * Auvik has added enhanced monitoring for excessive backend tenant failures.

resolved

The disruption of client sites on the US4 cluster, receiving 500 errors when they tried to access their sites, has been resolved. Services have been restored. There are a few large client sites still verifying that should resolve shortly. A Root Cause Analysis (RCA) will follow after a full review.

monitoring

We’ve identified the source of the service disruption with sites on the US4 cluster. We have performed an emergency cluster restart and are monitoring the situation. Sites on the cluster are recovering, and we anticipate all sites will be up and running by 14:00 UTC. We’ll keep you posted on a resolution.

identified

We’re experiencing disruption with client sites on the US4 Cluster. When they try to access their sites, they receive 500 errors. Auvik requires an emergency cluster restart of the US4. This action will take a half-hour. Tenants on the US4 cluster are expected to start recovering after the restart. All sites are to be fully functional after 1.5 hours from the restart

investigating

We’re experiencing disruption with client sites on the US4 Cluster. When they try to access their sites, they receive 500 errors. We will continue to provide updates as they become available.

Report: "Service Disruption - For several tenants on the CA1 and US3 clusters"

Last update
postmortem

# Service Disruption ## Backend Resource Strain and Service Disruption over a multiple-day period ### Root Cause Analysis ### Duration of incident Discovered: Oct 07, 2024 09:56 - UTC Resolved: Oct 07, 2024 19:00 - UTC Discovered: Oct 14, 2024 10:55- UTC Resolved: Oct 14, 2024 14:00 - UTC Discovered: Oct 16, 2024 05:42 - UTC Resolved: Oct 17, 2024 13:37 - UTC ### **Cause** The primary cause of this multi-day incident was a combination of backend instability and resource management challenges triggered by technical bugs and configuration issues. Specifically, a non-thread-safe map in the Autotask integration led to excessive CPU consumption, compounded by frequent tenant migrations and high memory usage across multiple clusters. Excessive API requests through the Web Application Firewall \(WAF\) and misconfigurations further strained backend resources, resulting in widespread service disruptions and extended recovery time. ### **Effect** The incident significantly impacted service availability and performance across multiple clusters. Users experienced frequent 500 and 504 errors, delays in accessing tenant data, and slow UI loading times. The high CPU usage and backend instability led to tenant migrations and disrupted connectivity, causing certain features to become intermittently unavailable. Additionally, the ongoing backend strain increased support cases and required multiple restarts and resource reallocations, prolonging the disruption and leading to a degraded experience for affected users over several days. ### **Action taken** _All times in UTC_ **10/07/2024** **Initial Detection and Escalation** **09:56 - 10:02** Key symptoms identified: * High heap usage across multiple backends. * Communication failures between nodes in clusters CA1 and US1, causing tenant access issues. * Multiple tenants are stuck in a verifying state. **10:20 - 11:30** Escalated mitigations: Decided to restart CA1, followed by US1, to address node communication issues. The status page is updated to notify users of ongoing disruptions. **12:19 - 13:06** Status recap and monitoring of ongoing issues, including: Continued high heap usage. Tenant availability errors \(504s\) due to lost seed nodes. Investigation of tenant verification issues. **14:00 - 19:00** Work continues on the instability of model investigations and backend performance issues, with some partial fixes applied. **19:00** Temporary workaround applied to stabilize model flapping. **10/14/2024** **Continued Investigation and Remediation** **11:30** Focused mitigation for US4 clients to stabilize tenant access and service performance. **14:00** Affected sites and tenants restarted, resolving some availability issues. **10/16/2024** **Addressing WAF and High CPU Issues** **17:15** WAF mitigation steps taken, blocking excessive requests from specific IPs. **18:31** WAF issues confirmed resolved after blocking IPs responsible for high traffic. **10/16/2024** **High CPU Issues and Tenant Rebalancing** **10:55 - 11:25** High CPU usage detected on multiple backends: Affect backends are capped, restarted, and drained to mitigate load. **12:12 - 12:29** Specific tenant issues, including problematic tenants, were identified, which triggered frequent backend moves and further resource strain. **15:00 - 18:00** Troubleshooting and tenant isolation continue; problematic tenants are isolated, and partial recovery is achieved. **10/17/2024** **Root Cause Fixes and Final Resolution** **10:35** Further diagnosis identifies the root cause in the non-thread-safe map, leading to high CPU usage. **13:27** A short-term fix was applied to stabilize the problematic tenant and manage resource allocation. **13:37** Confirmed complete restoration of affected tenants and systems. ### **Future consideration\(s\)** * Auvik has installed a repair for the model identification instability. * Auvik has implemented a repair address for tenants stuck in a verifying state who cannot locate their tenant manager. * Auvik has implemented a fix to prevent the identified third-party integration from locking CPU processes, which will cause the backend to fail due to high resource consumption. * Auvik has installed a fix to prevent long device names from causing continual tent failures across backends. * Auvik has added enhanced monitoring for excessive backend tenant failures.

resolved

We experienced disruption with several tenants on clusters CA1 and US3. Sites were unavailable. The source of the disruption has been resolved, and services have been fully restored. A Root Cause Analysis (RCA) will follow after a full internal review.

identified

The disruption to tenants on cluster CA1 appears to have been addressed. We are monitoring the situation to validate all sites are responding as expected. We will update this page when the validation is complete.

identified

We’re experiencing disruption with several tenants on cluster CA1. Some sites are responding slowly in the UI. We are also receiving reports of 401 errors when accessing sites. We will continue to provide updates as they become available.

identified

We’re experiencing disruption with several tenants on cluster CA1. Some sites are responding slowly in the UI. We will continue to provide updates as they become available.

monitoring

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available. We are receiving reports with UI responsiveness for clients on the CA1 and are investigating. Clients on US3 are continuing to start up. We will continue to monitor this process throughout the action.

monitoring

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available. Clients on the CA1 cluster have recovered. Clients on US3 are continuing to start up. We will continue to monitor this process throughout the action.

monitoring

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available. The downtime for CA1 proceeds as expected, with 90% of reporting up. The remaining 10% are being monitored for completion. Clients on US3 have begun their downtime window. We will continue to monitor this process throughout the action.

identified

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available. Auvik is required to restart all tenants on the US3 cluster at 15:30 UTC (11:30 EDT), a delay from the 14:50 UTC restart post. This will create a maintenance window of up to 1.5 hours, with most sites recovering before that.

identified

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available. Auvik has begun restarting the CA1 cluster. This will take up to 1.5 hours, but most sites will recover before that.

identified

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available. Auvik is required to restart all tenants on the US3 cluster at 14:50 UTC (10:50 EDT). This will create a maintenance window of up to 1.5 hours, with most sites recovering before that.

identified

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available. Auvik is required to restart all tenants on the CA1 cluster at 14:35 UTC (10:35 EDT). This will create a maintenance window of up to 1.5 hours, with most sites recovering before that.

investigating

We’re experiencing disruption with several tenants on clusters CA1 and US3. Sites are unavailable. We will continue to provide updates as they become available.

Report: "Service Disruption - Shared Collectors lost association with shared sites after maintenance"

Last update
postmortem

# Service Disruption - Customers with Shared Collectors on the US2 Cluster Lost the Association with the Sites Monitored by the Shared Collectors ## Root Cause Analysis ### Duration of incident Discovered: Oct 07, 2024 13:00 UTC Resolved: Oct 07, 2024 19:00 UTC ### Cause During maintenance on October 05, 2024, at 10:00 UTC, modifications were made to address a bug with protocol handling on the US2 cluster. This produced an excessive load on the system, which caused the cluster's start-up to fail. The cluster was successfully restarted. ### Effect The information for shared collector sites was not processed properly on the restart, and the association with the sites for those collectors was removed. This caused the sites not to be monitored during this time. ### Action taken _All times in UTC_ **10/05/2024** **11:00 -** Scheduled maintenance upgrade of the system begins. **11:44 -** The US2 cluster is started. **12:20 -** The code change is enabled for clients on the US2 cluster to address the protocol handling. **12:22 -** Tenants are started on the US2. **12:55 -** The US2 cluster is found to be disconnected. Auvik takes steps to be able to restart the US2 cluster. **13:10 -** Tenants are started on the US2 cluster for the second time, and the maintenance banner is removed from the Auvik site. **10/07/2024** **9:00-12:00 -** Auvik support receives multiple reports of client issues. Data is gathered from tenants on several different clusters for different problems. This data is collected and sent to engineering. **12:00-13:00 -** Engineering determines that multiple issues in the product occurred during the scheduled maintenance. These issues are not associated with each other and will need separate teams to address them. **13:00-15:30 -** Engineering is able to determine that the shared collectors of clients on the US2 cluster have lost their association with their sites. **15:30-18:00 -** Engineering investigates which clients were explicitly affected by the loss of the collectors and their states before the maintenance on October 05. **18:00-18:20 -** The decision to reset the shared collector states to what they were before maintenance is made, and Engineering is given the go-ahead to proceed with the reset. **18:20 -** The process for the restore is executed. **19:00 -** Sites with shared collectors are validated to be restored to their state before the October 05 maintenance. ### Future consideration\(s\) * Improvements to the process for system health after maintenance will be made to protect against significant changes with collectors connected to the system. * Engineering is investigating improvements to the system to provide safeguards against the system processing changes to sites with shared collectors due to a lack of data after maintenance.

resolved

We experienced disruption with Shared Collectors and lost association with shared sites after maintenance on Saturday at 13:00 UTC. The current fix is to re-associate the Shared collector with the site. We are investigating to determine a root cause.

Report: "Service Disruption - Issues with Customer-Created Filters & Devices In a Stuck State"

Last update
resolved

Auvik has addressed the issues with customer-created device filters. These filters were not correctly filtering the desired devices, which caused problems with any customer-created device filter for maintenance windows, alerting, customer OID pollers, discovery settings, and managed credentials. Additionally, Auvik is still addressing another issue concerning devices being stuck in a stuck state with their credentials, which will prevent the need for additional restarts. (Update) A fix for this was implemented for a limited number of clients to validate thought testing over the next two weeks for a permanent, Auvik-wide fix. These clients were informed of this change. Auvik is still on track for a release to all customers after the September 7th maintenance window. We apologize for the inconvenience our customers were caused. Please report any follow-up issues to Auvik support.

identified

Auvik is currently addressing issues with customer-created device filters. These filters are not properly filtering the desired devices, causing problems with any customer-created device filter for maintenance windows, alerting, customer OID pollers, discovery settings, and managed credentials. Other areas of the product may also be impacted. The issue does not affect default Auvik filtering. Additionally, Auvik is addressing another issue concerning devices being stuck in a stuck state with their credentials, which will prevent the need for additional restarts. (Update) This fix will be implemented under a testing regiment after this maintenance and will be available to all customers after the regular maintenance scheduled for September 7th, 2024. A permanent fix for the customer-created device filter issue will be released under a scheduled maintenance window on Saturday, August 24, 2024.

identified

Auvik is currently addressing issues with customer-created device filters. These filters are not properly filtering the desired devices, causing problems with any customer-created device filter for maintenance windows, alerting, customer OID pollers, discovery settings, and managed credentials. Other areas of the product may also be impacted. The issue does not affect default Auvik filtering. Additionally, Auvik is addressing another issue concerning devices being stuck in a stuck state with their credentials, which will prevent the need for additional restarts. A permanent fix for these outstanding issues will be released under a scheduled maintenance window on Saturday, August 24, 2024. Any additional updates will be posted here.

Report: "Data lag in AU1 cluster for Syslog and Traffic Insights"

Last update
resolved

Auvik experienced an interruption in processing Syslog and Traffic Insights data on the AU1 cluster while performing standard system upkeep. This interruption has delayed the processing of live Syslog and Traffic Insights data to the UI. There has been no loss of data. We anticipate recovering this delay over the next several hours as the backlog is processed. If any change affects the outcome, we will post an update. We apologize for the delay in services.

Report: "Service Disruption - Performance issues with clients on the US4 cluster."

Last update
postmortem

Please see [Performance Issue - Map Rendering Delayed on US Cluster Customers](https://status.auvik.com/incidents/6t543bp2xdwr) for post mortem.

resolved

We've applied the fix for the performance disruption with services for clients on the US4 cluster. It is related to the incident earlier today. Tenants in the other clusters are not affected. There may be some associated lag as the system comes up to full speed. Alerting will become active for US4 clients at 22:00 UTC (18:00 EDT). We apologize for the issues today. The source of the disruption has been resolved. Services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review and posted to the original incident.

monitoring

We've identified the source of the performance disruption with services for clients on the US4 cluster. It is related to the incident earlier today. Tenants in the other clusters are not affected. The maintenance window began at 20:00 UTC (16:00 EDT), during which updates to the UI and alerts will be delayed. This maintenance window is expected to last for around 90 minutes. We apologize for the continued issues. We will continue to provide updates as they become available..

identified

We've identified the source of the performance disruption with services for clients on the US4 cluster. It is related to the incident earlier today. Tenants in the other clusters are not affected. The will begin a maintenance window at 20:00 UTC (16:00 EDT) where updates to the UI and alerting will be delayed. We apologize for the continued issues. We will continue to provide updates as they become available.

investigating

We’re experiencing a performance disruption with services for clients on the US4 cluster. It is related to the incident earlier today. Tenants in the other clusters are not affected. We apologize for the continued issues. We will continue to provide updates as they become available.

Report: "Service Disruption - Auvik Site Performance and Device Health Issues"

Last update
postmortem

Please see [Performance Issue - Map Rendering Delayed on US Cluster Customers](https://status.auvik.com/incidents/6t543bp2xdwr) for post mortem.

resolved

The fix for service disruption with site performance and device discovery has been fully deployed and implemented. The source of the disruption has been resolved, and services have been fully restored. There may be a slight delay with some connectors reconnecting and map updating, but this will resolve itself. Delays with alerts have ended, and sites are again communicating as normal. A Root Cause Analysis (RCA) will follow after a full review.

monitoring

We’ve identified the source of the service disruption with site performance and device discovery. In some cases, this may include the Map and Network dashboard. We have deployed the hotfix. The application is taking longer to recover than anticipated but is recovering. We are anticipating another hour for all sites to recover. During this window, alerting and site communication may be interrupted or delayed. We apologize for this inconvenience. We will monitor the progress and provide updates here and the banner on the website.

monitoring

We’ve identified the source of the service disruption with site performance and device discovery. In some cases, this may include the Map and Network dashboard. We have begun deploying the hotfix, which is estimated to take approximately two hours to fully deploy. During this window, alerting and site communication may be interrupted or delayed. We apologize for this inconvenience. We will monitor the progress and provide updates here, as well as the banner on the website.

identified

We’ve identified the source of the service disruption with site performance and device discovery. In some cases, this may include the Map and Network dashboard. We will deploy a hotfix to the affected clusters starting at 15:30 UTC (11:30 EDT), which will take approximately two hours to deploy. During this window, alerting and site communication may be delayed. We apologize for this inconvenience. We will monitor the progress and provide updates here, as needed.

identified

We’ve identified the source of the service disruption with site performance and device discovery. In some cases, this may include the Map and Network dashboard. We are currently testing a fix for the issue and working to restore service as quickly as possible.

Report: "Performance Issue - Map Rendering Delayed on US Cluster Customers"

Last update
postmortem

# Service Disruption - Data update delays caused performance delays and delays in Map rendering. ## Root Cause Analysis ### Duration of incident Discovered: Jun 1, 2024, 20:30 - UTC Resolved: Jun 7, 2024, 13:05 - UTC ### Cause Updates performed during scheduled maintenance on June 1, 2024, caused an improper assertion on data in the Auvik application’s data stream. ### Effect The service disruption resulted in a significant increase in the data in the streaming queue, leading to noticeable delays in data processing for our customers. This was particularly evident in map rendering and updating, impacting the real-time visibility of our services for our stakeholders. ### Action taken _All times in UTC_ **06/01/2024** **20:30 -** Auvik support alerts the on-call engineering team of abnormal CPU spikes in processing data. **20:59 -** The engineering team begins its initial investigation. **21:11 -** Engineering determines that the system is, indeed, seeing increased data input within the system. **21:15 -** The team works to identify the cause of the increased input. **23:00 -** The team identifies the specific data flows and increased input and turns off the presumed change that caused these issues. **06/02/2024** **02:00 -** The team implements the changes into one cluster and waits to validate that the change resolves the issues. **11:28 -** It is reported that the change did not resolve the issue and the ongoing incident. The engineering team assembles to determine the root cause. **11:45-17:00 -** Engineering continues investigating the issue to determine a fix. **17:00 -** The root cause of the issues is determined, and the next steps to resolve the incident are formulated. **17:00-21:30 -** A fix for the issues is written and tested successfully. **22:45 -** A plan for deploying the fix to production is formulated. **06/03/2024** **01:00-2:45 -** The proposed fix is deployed to one cluster to test and validate its correctness in the production environment. **13:30 -** The team validates the desired results in the test cluster and formulates a plan for the remaining clusters. **16:00-21:30 -** The fix is deployed to the remaining clusters. The team will wait for the backlog to catch up. **06/04/2024** **05:00-18:55 -** Engineering makes several changes to increase resourcing and velocity of the backlog processing. During this time, all non-US clusters recover from their data delay. **23:15 -** The US4 cluster recovers from its data delay. **06/06/2024** **09:00 -** The US3 and US5 clusters recover from their data delay. **06/07/2024** **08:35 -** The US1 cluster recovers from its data delay. **13:05 -** The US2 cluster recovers from its data delay. The incident is closed on the status page. ### Future consideration\(s\) * Adjust alert workflow to understand better when problems arise with the product in a more timely manner. * Auto-scale resources to adjust to dynamic demands for resourcing. * Investigate the testing environment to provide more valuable results when implementing system changes that reflect the actual impact on the production systems.

resolved

The impact of any remaining lag is negligible for customers on the US2 cluster and should resolve itself. All other clusters are running optimally. We are closing this incident at this time. A Root Cause Analysis (RCA) will follow after completing a full review.

monitoring

The US1 cluster has fully recovered. Most of the US2 cluster’s clients have fully recovered. The delay is only in processing interface information in the map and only applies to a small subsection of clients. The estimate for recovery on this final part of the data lag for this map component depends on the influx of data it receives today. All other parts of the product are running normally. We continue actively monitoring the situation while waiting for this final component to recover from its data lag. We understand the impact of this incident on your experience with the product and sincerely apologize for the inconvenience it has caused.

monitoring

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US1 and US2 clusters. We are waiting as the informational lag works through the data backlog. The US1 cluster has had most of its clients fully recover. However, a very small subsection of clients still has a data lag and is delayed. Due to a heavy influx of data, the cluster is processing the data but maintaining the backlog size. The delay is only in processing interface information in the map. The US2 cluster is still delayed but is still decreasing in lag. Customers are only experiencing interface information delays in the map. We anticipate a full recovery by 11:00 UTC (7:00 EDT) tomorrow. Rest assured, the dashboard information and alerts remain unaffected, providing up-to-date and accurate information. We are diligently and actively monitoring the situation. We are waiting for the remaining components to catch up and be current. We understand the impact of this incident on your experience with the product and sincerely apologize for the inconvenience it has caused.

monitoring

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US1 and US2 clusters. We are waiting as the informational lag works through the data backlog. The US1 cluster has had most of its clients fully recover. However, a very small subsection of clients still has a data lag and is delayed. Due to a heavy influx of data, the cluster is processing the data but maintaining the backlog size. The delay is only in processing interface information in the map. The US2 cluster is still delayed but is still decreasing in lag. Customers are only experiencing interface information delays in the map. Rest assured, the dashboard information and alerts remain unaffected, providing up-to-date and accurate information. We are diligently and actively monitoring the situation. We are waiting for the remaining components to catch up and be current. We understand the impact of this incident on your experience with the product and sincerely apologize for the inconvenience it has caused.

monitoring

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US1 and US2 clusters. We are waiting as the informational lag works through the data backlog. The US1 cluster has almost recovered, with only a small subset of customers experiencing interface information delays in the map. The US2 cluster is still delayed, customers are only experiencing interface information delays in the map. Customers on the US3 and US5 clusters have fully recovered since the last update. Dashboard information and alerts are not affected and are providing up-to-date information. We are actively monitoring the situation and waiting for the remaining components to catch up and be current. We understand the impact of this incident on your experience with the product and we sincerely apologize for the inconvenience it has caused.

monitoring

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US5). We are waiting as the informational lag works through the data backlog. Dashboard information and alerts are not affected and are providing up-to-date information. The maps for customers on a small percentage of customers on the US5 cluster still show delayed inferred connections, but the rest of the map should be current. The inferred connection delay should conclude in the next several hours. Clients on US clusters US1, US2, and US3 continue to decrease their lag. We now estimate it will take another 10-12 hours for all clusters' Map discovery and rendering to be current again. Several components are again current in the map. We are waiting for the remaining components to catch up and be current. We continue to monitor this. We understand the impact this is having on your experience with the product and apologize for any impact this may be having on you and your clients.

monitoring

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US5). We are waiting as the informational lag works through the data backlog. Dashboard information and alerts are not affected and are providing up-to-date information. The maps for customers on the US5 cluster still show delayed inferred connections, but the rest of the map should be current. The inferred connection delay is still dropping and should become current in the next 4 hours. Clients on US clusters US1, US2, and US3 are continue to decrease their lag. We now estimate it will take another 18-20 hours for all clusters' Map discovery and rendering to be current again. We will continue to monitor this. We understand the impact this is having on your experience with the product and apologize for any impact this may be having on you and your clients.

monitoring

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US5). We are waiting as the informational lag works through the data backlog. Dashboard information and alerts are not affected and are providing up-to-date information. The maps for customers on the US5 cluster still show delayed inferred connections, but the rest of the map should be current. The inferred connection delay is still dropping and should become current in the next 4-8 hours. Clients on US clusters. US1, US2, and US3 are continuing to decrease their lag. We estimate it will take another 24 hours for all clusters' Map discovery and rendering to be current again. We will continue to monitor it. We apologize for the impact this may be causing you and your clients.

monitoring

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US5). We are waiting as the informational lag works through the data backlog. Dashboard information and alerts are not affected and are providing up-to-date information. We still expect clients on the US5 cluster to recover from their lag sometime during the evening, most likely in the next four hours. Clients on US clusters. US1, US2, and US3 are slowly decreasing their lag. We do not have an estimate of when their Map discovery and rendering will be current, but we continue monitoring it closely. We apologize for the impact this may be causing you and your clients. We continue to monitor progress and will post relevant updates.

monitoring

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US5). We are waiting as the informational lag works through the data backlog. Dashboard information and alerts are not affected and are providing up-to-date information. We expect clients on the US5 cluster to recover from their lag at some point during the evening. Clients on US clusters. US1, US2, and US3 are slowly decreasing their lag. We do not have an estimate of when their Map discovery and rendering will be current, but we continue monitoring it closely. We apologize for the impact this may be causing you and your clients. We continue to monitor progress and will post relevant updates.

monitoring

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US5). We are waiting as the informational lag works through the data backlog. The lag for clients on the US4 cluster should be recovered in the next hour. Dashboard information and alerts are not affected and are providing up-to-date information. We apologize for the impact this may be causing you and your clients. We continue to monitor progress and will post updates throughout the delay.

identified

We’ve identified the source of the performance issue with Map discovery and rendering for customers running under the US clusters (US1, US2, US3, US4, US5). We are waiting as the informational lag works its way through the data backlog. Dashboard information and alerts are not affected and are providing up-to-date information. All relevant resources have been upgraded to provide the most expedient resolution. We apologize for the impact this may be causing you and your clients. We will continue to monitor the progress and post updates throughout the day.

Report: "Performance Issue - Device Discovery Delayed"

Last update
postmortem

# Performance Disruption - Delays with New Device Discovery ## Root Cause Analysis ### Duration of incident Discovered: Apr 25, 2024 14:00- UTC Resolved: Apr 26, 2024 01:30- UTC ### Cause Changes were placed into production to address findings from the Auvik March 15, 2024, incident. The changes were not behind a feature flag to prevent them from affecting production data. ### Effect The changes were not granted proper permissions, which caused a data crash loop. This delayed newly discovered devices. ### Action taken _All times in UTC_ **04/24/2024** **14:00-17:30 -** Updated code merged into production code to address the bug discovered in the Auvik March 15, 2024, incident. **4/25/2024** **14:35 -** An approved tenant migration causes a crash loop of data for newly discovered devices. **18:04 -** The Auvik engineering team responsible for the implemented change is made aware of the crash loop and delay in rendering new devices in the product. **18:17 -** Engineering determines the cause of the crash loop and adjusts permissions for the implemented changes. **18:23 -** The changes implemented for permissions have the desired effect, and consumer lag begins to improve. Data will be delayed as the lag catches up to the live production data. **4/26/2024** **01:30 -** Consumer lag fully recovers, and all data is current. The incident is closed. ### Future consideration\(s\) * Changes have been implemented to adjust service account permissions for improvements to code automatically. * An internal review was performed on the review of code changes and approval processes for production. * Adjustments to internal alerting are reviewed to highlight the prioritization of production-impacted changes.

resolved

The delay for device discovery has been resolved. The source of the performance impact has been addressed, and performance should again be optimal. A Root Cause Analysis (RCA) will follow after completing a full review.

monitoring

We’ve identified the source of the performance issue with delays in new device discovery and are monitoring the situation. We've implemented the fix and are waiting for device information to catch up in the system. As the lag catches up, we expect to be back to optimal performance in a few hours. We’ll keep you posted on a resolution.

identified

We’ve identified the source of the performance issue in the discovery of new devices. We are working to restore optimal service as quickly as possible.

Report: "Service Disruption - EU1 cluster is experiencing an outage"

Last update
postmortem

# Service Disruption - EU1 Customers Experienced an Outage Following the April 20, 2024, Upgrade ## Root Cause Analysis ### Duration of incident Discovered: Apr 20, 2024, 12:36 - UTC Resolved: Apr 20, 2024, 16:10 - UTC ### Cause A scheduled upgrade was performed on the EU1 cluster to address software requirements for performance and security improvements. ### Effect Scheduled processes would not run, and network connectivity issues were experienced for clients on the EU1 cluster. ### Action taken _All times in UTC_ **04/20/2024** **10:31 -** Planned upgrade occurring during scheduled maintenance. **12:36 -** Issues from the upgrade are detected. **12:50 -** Initial mitigation to address issues taken. **13:00 -** Initial mitigation step deems insufficient. Investigation for the next steps started. **13:34 -** Additional mitigation steps implemented. **14:15 -** The concluding steps to address disruption taken by engineering to clear out the failed upgrade. **15:10 -** EU1 cluster and clients appear to be recovering. **16:00 -** The old data is cleared from the internal pods. **16:10 -** The incident is declared resolved. ### Future consideration\(s\) * The order of operations list will be reviewed and standardized for upgrades to part of the Auvik product. * Ensure that the Subject Matter Expert \(SME\) approval has been signed off on and that an SME is available when pertinent upgrades are scheduled. * Enforce the preferred roll-back processes where upgrades to the product are implemented.

resolved

The source of the disruption has been resolved, and services have been fully restored.

monitoring

We’ve identified the source of the service disruption and applied a fix. Sites are starting, and we are monitoring to ensure all systems are functional.

identified

We’ve identified the source of the service disruption to EU1. Sites continue to be down at this time. We are working to apply changes and restore service as quickly as possible.

investigating

We’re experiencing an outage on the EU1 cluster. Customers will be unable to access their sites at this time. We will continue to provide updates as they become available

Report: "Performance Disruption - Internal Data Requests to Auvik’s Systems Delayed to Customers on the US2 and US5 Clusters"

Last update
postmortem

# Performance Disruption - Internal Data Requests to Auvik’s Systems Delayed to Customers on the US2 and US5 Clusters ## Root Cause Analysis ### Duration of incident Discovered: Mar 28, 2024, 17:11 - UTC Resolved: Mar 28, 2024, 20:20 - UTC ### Cause The delay in Auvik’s ability to process internal data requests was due to additional overhead created by implementing Auvik’s new Beta Alert testing. ### Effect The system delayed all requests: Mapping, UI updates, data retrieval, monitoring, and alerting. This was limited to customers with tenants on the US2 and US5 clusters. ### Action taken _All times in UTC_ **03/18/2024** **14:00 -** Auvik Alerting Beta was deployed to the US2 and US5 clusters. **03/19/2024** **13:00 -** Additional resources were added to the US2 and US5 clusters to address the lag in processing due to the addition of the Auvik beta alerting deployment. **03/19/2024 - 03/28/2024** Alerting Beta continues to run on US2 and US5 clusters. **03/28/2024** **17:11 -** Engineering is addressing data processing lag issues reported by customers on the US2 and US5 clusters and has discovered a considerable lag in data processing for several Auvik processes. **17:20 -** Internal Auvik resources meet to determine the root cause of the performance issues **18:05 - 18:15 -** Auvik increases processing resources to the affected clusters, locking out the system for approximately 10 minutes for customers. Soon after, an update reporting the interruption is posted on the Auvik Status page. **18:15 - 22:20 -** The engineering teams work with the hosting company to adjust the resources on the US2 and US5 clusters to handle the system's new processing requirements created by the Auvik Beta Alerting. **03/29/2024 - 03-30-2024** Non-optimized data and unused space are cleaned from the system to improve system efficiency and performance. ### Future consideration\(s\) * Better understand the differences in database instances and implement the proper builds within the product. * Implement the proper internal alerting to prevent the growth of lag that was discovered in the incident. * Create an internal performance insight metric to understand better the effects of implementing significant scale changes to the system. * Evaluate engineering team permissions to the system and address blockers to resolve issues where appropriate.

resolved

Auvik’s systems experienced delays with data requests with customers on the US2 and US5 clusters on March 28th, 2024. This impact on performance occurred between 17:11 and 20:20 UTC. There was no impact of data loss or downtime.

Report: "500 errors when accessing the Auvik UI"

Last update
resolved

Auvik experienced an interruption with access to the UI this afternoon at approximately 18:05 UTC. The cause was discovered and addressed quickly. Interruption to access was approximately 10 minutes. Services have fully recovered. An RCA will follow as soon as an internal investigation has been completed.

Report: "Possible Service Disruption - Users may experience possible delays accessing their Auvik site"

Last update
resolved

The possible interruption of access to Auvik sites for customers has been resolved. The third-party vendor, who was the cause of the disruption, has been replaced with a different third-party vendor. Services have been fully restored.

monitoring

It has come to Auvik’s attention that one of its third-party vendors is experiencing an incident that may cause delays of 1-3 minutes for users to access their site. We apologize for the inconvenience. We are working on a way to bypass this issue. We’ll keep you posted on a resolution.

identified

It has come to Auvik’s attention that one of its third-party vendors is experiencing an incident that may cause delays of 1-3 minutes for users to access their site. We apologize for the inconvenience. We are closely monitoring their situation and will report any updates.

Report: "Service Disruption - Disruption of IP associations with devices on approximately 11% of devices on US5 Cluster."

Last update
postmortem

# Service Disruption - Disruption of IP associations with devices for some US5 and US1 cluster clients. ## Root Cause Analysis ### Duration of incident Discovered: Mar 14, 2024 09:15 - UTC Resolved: Mar 16, 2024 01:20 - UTC ### Cause After the incident on March 14th, approximately 2,000 tenants who had previously been migrated reprocessed steps in the initial migration. ### Effect Clients affected by the incident on the US5 cluster lost 134,727 IP addresses \(Approximately 10% of devices across the affected tenants\). The US1 cluster had five tenants who experienced similar issues. ### Action taken _All times in UTC_ **03/14/2024** **21:00** - Cluster recovery from the March 14th incident leads to unexpected tenant migrations. ‌ **03/15/2024** **13:23 -** The relevant Auvik engineering team is informed of the issue with a specific client. **13:30 -** The cause is misdiagnosed, and the tenant is restarted to address the issue. **14:00 -** The restart does not resolve the issue, and a deeper investigation into the reason for the problem is begun. **16:45 -** The engineering team discovers that the cause of the issue is an unexpected rerun of tenant migrations that were kicked off from the previous day’s incident. **16:55 -** A plan is developed to reset IPs lost IPs against affected devices. This action will only reattach IPs to the proper device. Previous configuration customizations, backups, or alerting will be lost with the reconsolidation of the devices. **17:07 -** The Auvik engineering team kicks off a systematic reattachment of deleted IPs. ‌ **03/16/2023** **01:18 -** The engineering team finished the reattachment of the removed IPs. **02:20 -** The incident is declared closed. ### Future consideration\(s\) * Auvik will develop improved safeguards around tenant restarts and migrations. * Auvik will deploy an improved safety configuration to restore lost configuration data from IP reassignments that cause device reconsolidation.

resolved

We’ve identified the source of the service disruption to IPs associated with devices. The disruption has been resolved, and services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review.

identified

We’ve identified the source of the service disruption to IPs associated with devices on the US5 cluster. Approximately 11% of devices on US5 and five sites on US1 are not appearing. The tenants affected on the US1 cluster have had services restored. The tenants on the US5 cluster are having their services systematically restored. The ETA for recovering services to all affected tenants is within the next three hours. We will post when it is completed or if there is any change in outlook.

identified

We’ve identified the source of the service disruption to IPs associated with devices on the US5 cluster. Approximately 11% of devices on US5 and five sites on US1 are not appearing. We are continuing to restore service as quickly as possible.

identified

We’ve identified the source of the service disruption to IPs associated with devices on the US5 cluster. Approximately 11% of devices on US5 and devices on 5 sites on US1 are not appearing . We are working to restore service as quickly as possible.

investigating

We’re experiencing disruption to IPs associated with devices on the US5 cluster. We will continue to provide updates as they become available.

Report: "Service Disruption - Network monitoring"

Last update
postmortem

# Service Disruption - Network Monitoring Interruption to Services ## Root Cause Analysis ### Duration of incident Discovered: Mar 14, 2024 18:41 - UTC Resolved: Mar 14, 2024 00:15 - UTC ### Cause After testing in a stage environment, a change was deployed to the Auvik production environment. ### Effect The modifications in the scaled-up production environment caused the system to overload and stop processing the streaming monitoring data. ### Action taken _All times in UTC_ **03/14/2024** **13:55 -** Changes to production were introduced. **14:35 -** Attention was raised to Auvik engineering as the systems for TrafficInsights, Syslog, and integrations began to crash loop. **14:40 -** It was determined the issues raised in the initial report were more widespread than initially thought. The system as a whole was being overloaded with data. **14:45 -** An incident was raised, and resources were called in to address it. **15:03 -** Replication topic services were taken offline to reduce system load. **15:20-16:20 -** An engineering team attempts to remove the additional data created by the overload. **16:20 -** Engineering attempts to run commands to bulk remove the extraneous data from the system. **16:25- 17:30 -** Engineering waits for the system to process the commands. Additional resources are added to the system to provide resources to process the load. **17:30 -** The first cluster has now recovered. **17:50 -** Engineering is seeing improvement across the system, with several other clusters starting to come back online. **18:35 -** The bulk changes have the desired effect, and Services are starting to recover. Engineering begins by going through each service to validate health and functionality. ‌ **03/14/2024 - 03/15/2024** **18:35 - 00:15 -** Engineering continues working through all affected services and troubleshoots any unresolved issues. Temporary resources are added to speed things along. ‌ **03/15/2024** **00:15 -** The incident is declared closed. ### Future consideration\(s\) ‌ * Auvik will better understand the effect that new software used in the system may have on performance and system load. * Auvik has scheduled the replacement of soon to be the end of the support software the system currently relies on to address a bug discovered during the incident post-mortem. * Auvik will better define its internal alerting to force relevance on the alerting and emphasize actual issues rather than “noise.”

resolved

The service disruption with the clusters and services in production has recovered. Services are working as they should, and the production environment is currently working as expected. A Root Cause Analysis (RCA) will follow after a full review.

monitoring

We’ve identified the source of the service disruption and most clusters and services have recovered. We are continuing to monitor the remaining data processing jobs and will provide an update in approximately 1 hour.

monitoring

We’ve identified the source of the service disruption affecting, but not limited to, Maps, TrafficInsights, Alerts, and product UI across all clusters. We are are continuing to monitoring the system as clusters and services continue to recover. We’ll keep you posted on a resolution.

identified

We’ve identified the source of the service disruption affecting, but not limited to, Maps, TrafficInsights, Alerts, and product UI across all clusters. We are are continuing to work to restore service as quickly as possible.

identified

We’ve identified the source of the service disruption affecting, but not limited to, Maps, TrafficInsights, Alerts, and product UI. We are working to restore service as quickly as possible.

investigating

We are continuing to investigate this issue.

investigating

We are continuing to investigate. We will continue to provide updates as they become available.

Report: "Service Disruption - Customers with sites on the US1 cluster are not receiving."

Last update
postmortem

# Service Disruption - Clients on the US1 cluster experienced a delay with alerts. ## Root Cause Analysis ### Duration of incident Discovered: Mar 11, 2024 13:35 - UTC Resolved: Mar 11, 2024 16:30 - UTC ### Cause Issues with the March 9th system update. ### Effect The internal alerting engine for the US1 cluster got into a state where alerts were delayed to clients on US1 from March 9th at 11:00 UTC until March 11 at 16:30 UTC. ### Action taken _All times in UTC_ **03/09/2024** **11:00 -14:00 -** Auvik performs scheduled maintenance on the system. The planned maintenance is extended due to issues with the system during the restart. It is believed the system has recovered successfully. **03/11/2024** **08:00 -** Auvik Engineering identifies from internal alerting that the alerts on the US1 cluster are lagging when displayed in the system. **08:55 -** Engineering restarts the alerting service and attempts to create a new checkpoint for alerting. **10:05 -** The new checkpoint failed, and the responsible engineering team was notified about the system delay with the US1 cluster alerts. **13:18 -** An external incident was posted to alert clients about the delay. Engineering continues to work on the issue. **14:00 -** Progress is achieved by saving checkpoints within the system for the alerting assembler. **15:00 -** Alert lag begins to fall. **16:00 -** Alert lag has been processed successfully. **16:20 -** Engineering declares the incident closed. ### Future consideration\(s\) * Auvik will review the checklist of resources and systems after maintenance to properly ensure complete recovery of its systems. * Auvik will review and update systems used in its streaming architecture to ensure against possible performance-related issues. * Auvik will update its method of cloud-related data systems to handle data lags in the future.

resolved

Delays for alerts for devices and services for customers on the US1 cluster have been resolved. The lag has been removed for alerts, and they are now current. The source of the disruption has been resolved, and services have been fully restored.

monitoring

We’ve identified the source of alerts for devices and services for customers on the US1 cluster and continue to monitor the situation. Alert lag is steadily decreasing. We expect a resolution in the near term. We’ll keep you posted when resolved.

monitoring

We’ve identified the source of the service disruption with alerts for devices and services for customers on the US1 cluster and are monitoring the situation. We’ll keep you posted on a resolution.

identified

We’ve identified the source of the service disruption, with alerts for devices and services for customers on the US1 cluster. We are working to restore service as quickly as possible.

investigating

We’re experiencing disruption to alerts for devices and services for customers on the US1 cluster. We will continue to provide updates as they become available.

Report: "Service Disruption - network map may be unavailable for some customers"

Last update
postmortem

# Service Disruption - Maps Unavailable in the UI ## Root Cause Analysis ### Duration of incident Discovered: Mar 09, 2024 11:00 - UTC Resolved: Mar 10, 2024 03:21 - UTC ### Cause Significant maintenance upgrade to the system. ### Effect Maps were unavailable to customers across the platform. ### Action taken _All times in UTC_ **03/09/2024** **11:00 -** Regularly planned maintenance performed on the system. This included infrastructure upgrades. **13:00 -** Maintenance completed. A few internal issues were noticed, and action was taken to address them. **13:40 -** Internal issues noticed at the end of maintenance appear to be addressed and resolved. **15:20 -** Auvik Engineering is aware of issues with maps not loading in the UI. **15:30 -** Additional permission issues were also discovered. An incident is declared, and the on-call team is assembled. **15:30-18:40 -** The engineering team begins its investigation and works to discover the incident's underlying cause. **18:40 -** The Core data actors and injector are restarted. Engineering must wait for results as the system reloads data. **18:40-21:40 -** Engineering observes the results as they update. It is determined that the restart did not provide the desired outcome and that issues while recovering are occurring too slowly for a product environment. Engineering decides to perform a complete system restart. The restart will involve staggering individual cluster restarts to prevent overloading the core part of the product. **21:40 -** Engineering performs the complete system restart with staggered starts of each cluster. **03/10/2024** **03::21 -** All clusters have successfully restarted, and Map functionality is back at an acceptable product level. The incident is declared closed. ### Future consideration\(s\) * Improve tenant inspection after maintenance windows to validate that there are no adverse effects from the changes implemented, especially after a more significant or complex upgrade. * Review and update the current post-maintenance checklist. * Create improved guidance for when a complete system restart and specific criteria to apply it are warranted. * Investigate why changes to the system from this upgrade caused a delay in map rendering that forced a staggered reboot.

resolved

The network maps issue has been resolved.

monitoring

Most clusters have recovered at this time. We are monitoring the remaining clusters before resolving this incident.

monitoring

We’ve identified the source of the service disruption with network maps. We have restarted clusters and are seeing systems recover. We will continue to monitor the situation until all systems have recovered.

identified

We’ve identified the source of the service disruption with network maps. We will be restarting all of our clusters at this time. We will continue to update the status as the system begins to recover from the restart.

identified

We’ve identified the source of the service disruption with network maps. We are continuing to restore service as quickly as possible.

identified

We’ve identified the source of the service disruption with network maps. We are working to restore service as quickly as possible.

investigating

We are continuing to investigate the disruption to network map errors. Customers in CA1 will encounter additional errors for a short period of time as additional troubleshooting is being performed on that cluster. We will continue to provide updates as they become available.

investigating

We’re experiencing disruption to the network map for some customers. Impacted customers may encounter errors loading the map. We will continue to provide updates as they become available.

Report: "Service Disruption - US1 Cluster Experienced Collector Disconnects"

Last update
resolved

At around 16:30 UTC (11:30 EST) approximately 20% of tenants on the US1 cluster experienced a two to five-minute connection gap between Auvik and their collectors. This was due to actions taken to pre-emptively address an internal authentication issue with collectors. An interruption to service was not expected when the action was taken. The connection "blip" was momentary, and all services recovered quickly and are now stable.

Report: "Service Disruption - Traffic Insights data on the EU1 Cluster"

Last update
postmortem

# Service Disruption - Delay of TrafficInsights data on the EU1 Cluster ## Root Cause Analysis ### Duration of incident Discovered: Jan 23, 2024 12:30 - UTC Resolved: Jan 23, 2024 15:30 - UTC ### Cause TrafficInsights' back-end in the EU1 cluster exhausted its resources. The services involved stopped processing data. ### Effect TrafficInsights stopped processing data to the UI. ### Action taken _All times in UTC_ **01/23/2024** **12:30 -** Auvik Engineering receives an internal alert that TrafficInsights has stopped processing data on the EU1 cluster. **12:35 -** Engineering confirms no current work has caused the stoppage of data flow. **12:40 -** Engineering cancels the TrafficInsights processing job and restarts it to begin processing TrafficInsights data again on the EU1 cluster. **12:45 -** The restart fails to complete successfully. **13:30 -** Additional resources are added to the processes being called. **14:20 -** The location of where to begin the data flow is adjusted to start from when it failed instead of an older safe point to bring the TrafficInsights data on the EU1 cluster current in the efficient time. **14:38 -** TrafficInsights data in the EU1 cluster begins flowing successfully. **15:30 -** Data lag for TrafficInsights on the EU1 cluster has caught up with the current data being processed from the devices. The incident is marked as resolved. ### Future consideration\(s\) * Auvik will improve monitoring around resources of the backend services of TrafficInsights. * Resources for backend services for TrafficInsights will be increased across clusters to accommodate increased data. * Outline and discovery for depreciating the current processing engine used by TrafficInsights to increase its resilience.

resolved

The resolution for the delay of Traffic insights data in the EU1 cluster has been achieved. The source of the disruption has been resolved, and services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review has been completed.

monitoring

We’ve identified the source of the service disruption with Traffic Insights on the EU1 cluster, have addressed the cause, and are monitoring the situation. Traffic Insight data is delayed and will need approximately two hours to be caught up and current. We’ll keep you posted on a resolution.

identified

We’ve identified the source of the performance issue with Traffic Insights on the EU1 cluster. Data is delayed. We are working to restore optimal service as quickly as possible.

investigating

We’re experiencing disruption with Traffic Insights on the EU1 cluster. Data is delayed. We will continue to provide updates as they become available.

Report: "Service Disruption - Auvik Network Management Web UI Intermittent Errors"

Last update
postmortem

# Service Disruption - Unable to Connect to Auvik Tenants ## Root Cause Analysis ### Duration of incident Discovered: Jan 17, 2024 15:27 - UTC Resolved: Jan 18, 2024 04:30 - UTC ### Cause A CORE settings change was implemented on 50% of Auvik clients after a successful initial rollout to 5% of Auvik clients the day before. ### Effect Clients that were part of the 50% under the setting change became inaccessible. A percentage of these clients disabled themselves from being activated due to the number of attempted restarts that accompanied the disconnection. ### Action taken _All times in UTC_ **01/17/2024** **15:27 –** Auvik Engineering enables the same CoreSettings for 50% of tenants after a successful dry run from the previous day with 5% of its clients. **15:35 –** Internal Auvik alerts notify Engineering of a significant service disruption. **15:36 –** Engineering begins its Investigation. **15:39 –** The backend services of the clients where changes were implemented stop reporting metrics. **15:53 –** Engineering reverts the change that was implemented. **16:30 –** Engineering manually begins restarting clusters of the affected clients. **18:40 –** Engineering begins manually repairing the connections to back-end services of clients that are not starting or reporting metrics properly. **21:00 –** All clusters are recovered. Engineering is seeing successful reporting of services and believes the incident to be over. The incident is marked as resolved on the Status page. **21:51 –** Auvik Support receives notice that one of the affected client’s tenants has been unexpectedly disabled. **2024-01-18** **01:31 –** Auvik continues to receive more client reports of unexpectedly disabled tenants. **02:30 –** The Auvik Engineering On-Call team is engaged. **03:37 –** Engineering determines the number of tenants unexpectedly disabled to be just over 1000. **03:50 –** Engineering re-enables the disabled tenants. **04:30 –** The number of running tenants is back to its pre-incident level. This incident is officially closed. ### Future consideration\(s\) * Auvik will review and adjust its rollout process & guidelines. Enforcement and training on the updated process will be implemented. * Recovery procedures for resolving an incident have been updated to check for unexpected deactivation of clients.

resolved

The fix for intermittent errors and disconnections has been applied. The source of the disruption has been resolved, and services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review has been completed.

monitoring

We are continuing to monitor for any further issues on the remaining clusters.

monitoring

We’ve identified the source of the service disruption with Auvik Network Management. Some customers may still experience intermittent errors when accessing the web UI, data processing or including collectors disconnecting and reconnecting intermittently. We are monitoring the situation. We are currently implementing a fix for the involved issues. We appreciate your patience as we continue to work through the issues. We’ll keep you posted on a resolution.

monitoring

We’ve identified the source of the service disruption with Auvik Network Management. Some customers may still experience intermittent errors when accessing the web UI, data processing or including collectors disconnecting and reconnecting intermittently. We are monitoring the situation. We are currently implementing a fix for the involved issues. We appreciate your patience as we continue to work through the issues. We’ll keep you posted on a resolution.

monitoring

We’ve identified the source of the service disruption with Auvik Network Management. Some customers may still experience intermittent errors when accessing the web UI, data processing or including collectors disconnecting and reconnecting intermittently. We are monitoring the situation. We are currently implementing a fix for the involved issues. We appreciate your patience as we continue to work through the issues. We’ll keep you posted on a resolution.

monitoring

We’ve identified the source of the service disruption with Auvik Network Management. Some customers may still experience intermittent errors when accessing the web UI, data processing or including collectors disconnecting and reconnecting intermittently. We are monitoring the situation. We are currently implementing a fix for the involved issues. We’ll keep you posted on a resolution.

monitoring

We’ve identified the source of the service disruption with Auvik Network Management. Some customers may experience intermittent errors when accessing the web UI or data processing and are monitoring the situation. We are currently implementing a fix for the involved issues. We’ll keep you posted on a resolution.

identified

We’ve identified the source of the service disruption to Auvik Network Management. We are working to restore service as quickly as possible.

identified

We’ve identified the source of the service disruption with Auvik Network Management. Some customers may experience intermittent errors when accessing the web UI or the processing of data. We are working to restore service as quickly as possible.

investigating

We’re experiencing disruption to Auvik Network Management. Some customers may experience intermittent errors when accessing the web UI or the processing of data. We will continue to provide updates as they become available.

Report: "Service Disruption - network disconnection alerts"

Last update
postmortem

# Service Disruption - Cloud Ping Check Not Responding ## Root Cause Analysis ### Duration of incident Discovered: Jan 16, 2024, 21:40 - UTC Resolved: Jan 16, 2024, 22:33 - UTC ### Cause There was a significant spike in CPU/memory resources for the ping services in the product. ### Effect Auvik clients with Internet connection checks enabled received a large volume of connection alert failures. ### Action taken _All times in UTC_ **01/16/2024** **21:17 -** Auvik Support alerted Auvik Engineering of a sudden influx of tickets concerning failed Internet connection checks **21:27 -** Engineering confirms there was no disruption to the number of connected agents **21:31 -** Engineering confirms there has been an escalation in CPU/memory for the ping server **21:48 -** A broken backend connection was deleted and recreated. **21:56 -** Engineering confirms that resource demands start to decrease and manually confirms clients that reported connection alerts are now responding **22:08 -** Engineering confirms with its alerting team that there’s no manual intervention needed for the alerts that were fired; they will resolve themselves **22:33 -** Incident has been resolved - alerts resolved themselves, and resources decreased to expected values for the affected service. ### Future consideration\(s\) * Auvik will create internal alerting for the Ping services. * Auvik will create a failover instance of the Ping service to prevent a single point of failure situation in the future.

resolved

The source of the disruption has been resolved, and services have been fully restored.

monitoring

We’ve identified the source of the service disruption with network disconnection alerts and implemented a fix. We are monitoring the situation.

identified

We’ve identified a service disruption with alerts for network disconnection. Some customers may receive erroneous network disconnection alerts. We are working to restore service as quickly as possible.

Report: "Service Disruption - Cluster US3 clients fail to connect to tenants"

Last update
postmortem

# Service Disruption - Clients Fail to Connect to Their Tenants on the US3 Cluster ## Root Cause Analysis ### Duration of incident Discovered: Dec 19, 2023 15:58 - UTC Resolved: Dec 19, 2023 18:15 - UTC ### Cause Action to address clean-up of residual issues from the incident, Service Disruption - Devices Deleted from Auvik UI when seen as down in health check. ### Effect Resources on the backend of the US3 cluster were overloaded. Collectors then disconnected from tenants on backends. This caused logins to fail until the tenants were restarted. The US3 cluster reboot was then performed to regain cluster stability, which behaved like an Auvik biweekly maintenance window. This is when all collectors disconnected, and tenants could not log in for a few minutes up to a few hours, depending on the order of when they restarted. No data loss of existing collected data occurred during this incident until the recovery. ### Action taken _All times in UTC_ **12/19/2024** **15:58 -** Auvik Engineering notices issues with US3 customer tenants. **16:20 -** Initial investigation into metrics on the US3 cluster. **17:00 -** Engineering decides to reboot the US3 cluster. **17:00- 18:15** Engineering monitors the tenants after the reboot, much like after a maintenance window. The Engineering manually brought up larger tenants. **18:15 -** The Incident is deemed closed, ‌ ### Future consideration\(s\) * Improve timeliness of communication when making changes to production. Over-communicate actions at the time they occur and not too far in front of the actions themselves. * Added documentation for this case of resource overload and protections from it being repeated in the future.

resolved

Service on the US3 cluster has been restored.

monitoring

We’ve identified the source of the service disruption on the US3 cluster and are monitoring the situation. We have taken steps to mitigate the cause. Tenants may still have issues connecting as we work through the issue. We’ll keep you posted on a resolution.

identified

We’ve identified the source of the service disruption with the US3 Cluster. Access to tenants will be sporadic at this time. We are working to restore service as quickly as possible.

investigating

We’re experiencing disruption to tenants on the US3 cluster. Clients failing to connect. We will continue to provide updates as they become available.

Report: "Service Disruption - Orphaned IPs on cluster US4"

Last update
postmortem

# Service Disruption - Devices with Orphaned IPs in Cluster US4 ## Root Cause Analysis ### Duration of incident Discovered: Dec 6, 2023, 15:45 - UTC Resolved: Dec 7, 2023, 01:30 - UTC ### Cause An internal service that injects network and IP data into the product was in a crash loop. \(Repartioner Service\) ### Effect This caused the Consolidation services that attach IPs to devices to understand the IP had been deleted. This mismatch of data then caused the devices to lose their association with their actual IPs, resulting in orphaned devices. ### Action taken ### All times in UTC **12/05/2023** **16:15 -** Backend services related to the Juniper Mist Release to GA on the US4 cluster are beginning to report errors. The backend Repartioner service fell into a crash loop. ‌ **12/06/2023** **15:45 -** Auvik Support reports a client has devices with what appear to be deletions of attached IPs. Several more tickets follow in quick succession. Engineering is alerted to the issue and begins its investigation. **16:30 -** An incident is declared and posted to the Auvik status page. Engineering continues to investigate the cause. Engineering turns off the consolidation engineer on the US4 cluster to prevent any more deletions. **16:30 -17:00 -** Engineering identifies the Repartioner service is crashing, looping, and restarts the service successfully. It is determined the Repartioner service needs more resources to process the accumulated data lag from the last day. Additional resources are provisioned. **17:00 -** The lag is processed through the Repartioner service. The processed data is now attempting to catch up with the production environment. **17:30 -** Injecting the delayed data back into the product on the US4 cluster will take a while. Adjustments to US4 cluster processing services are made to allow the lagged data to catch up more expediently. It is noted that devices with orphaned IPs are recovering. **12/06/2023 -12/07/2023** **17:30 - 1:30 -** Engineering monitors the data lag decrease and validates the data can catch up. ‌ **12/07/2023** **01:30 -** Data lag for the IP and network data on cluster US4 has caught up. **09:41 -** The Auvik status page posts that the incident has been closed. ### Future consideration\(s\) * Improved monitoring of legacy services \(Repartioner\) will be implemented to prevent long-duration issues from occurring with action taking place. * Auvik will add greater resilience to the Consolidation services to prevent orphaning large amounts of device networking data.

resolved

The resolution with devices with orphaned IPs on Cluster US4 has been completed. The source of the disruption has been resolved, and services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review has been completed.

monitoring

We’ve identified the source of the service disruption with devices with orphaned IPs on Cluster US4 and continue to monitor the situation. There will continue to be a delay for data to catch up in the UI. The lag catch-up has proceeded more slowly than anticipated. The new estimated time for the lag to become current is now at some point early in the morning December 7th EST. We apologize for this delay. We’ll keep you posted on a resolution.

monitoring

We’ve identified the source of the service disruption with devices with orphaned IPs on Cluster US4 and continue to monitor the situation. There will continue to be a delay for data to catch up in the UI. The estimated time for the lag to become current is 23:00 UTC or 6:00 PM EST. We’ll keep you posted on a resolution.

monitoring

We’ve identified the source of the service disruption with devices with orphaned IPs on Cluster US4 and are monitoring the situation. There will continue to be a delay for data to catch up in the UI. We’ll keep you posted on a resolution.

identified

We’ve identified the source of the service disruption with devices with orphaned IPs on Cluster US4. We are working to restore service as quickly as possible.

investigating

We’re experiencing disruption with devices with orphaned IPs on Cluster US4. We will continue to provide updates as they become available.

Report: "Service Disruption - US4 cluster may be unavailable to some customers"

Last update
postmortem

# Service Disruption - Some clients on the US4 cluster could not connect to their tenants ## Root Cause Analysis ### Duration of incident Discovered: Dec 5, 2023, 15:47 - UTC Resolved: Dec 5, 2023, 19:00 - UTC ### Cause Rolling out the new Juniper Mist capabilities for GA release to clients. ### Effect The amount of accumulated data in the product for the Juniper Mist feature overloaded the capabilities of the US4 cluster to process the data. The amount of historical data pushed to production was too large with a few clients on this cluster, which caused a few backend nodes to fail. This caused connectivity issues with the clients associated with the failed backend nodes. ### Action taken All times in UTC **12/05/2023** **15:47 -** Auvik Engineering begins to roll out the GA release for the Juniper Mist monitoring. **15:55 -** Backend services related to this upgrade show signs of stress. **16:15 -** Errors occur on the US4 cluster, with some tenants having issues connecting. **16:30 -** All other clusters complete the Juniper Mist release action except US4. **17:05 -** A decision is made to roll back changes for the Juniper Mist release on the US4 cluster. Engineering performs the rollback and waits for the changes to propagate in the US4 Cluster. **17:26 -** A few of the backend nodes continued to throw errors. Engineering restarts these backend nodes to clear the errors. **18:00 -** The US4 cluster was running normally. The incident is closed. ### Future consideration\(s\) * Auvik will roll out the Juniper Mist GA release to the US4 cluster during its scheduled maintenance windows on December 16, 2023, to complete the release. * Auvik will adjust how it rolls out new functionality if it entails large amounts of data movement within the product. It will roll out the changes in discrete stages instead of out to the cluster as a whole.

resolved

The resolution for disruption to the US4 cluster, with some customers hosted on this cluster not being able to access their site, has been implemented. The source of the disruption has been resolved, and services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review has been completed.

monitoring

We’ve identified the source of the service disruption on the US4 cluster of clients connecting to their sites and are monitoring the situation. Clients should be able to connect successfully. We’ll keep you posted on a resolution.

identified

We’ve identified the source of the service disruption with the US4 cluster. Some customers hosted on this cluster may be unable to access their site. We are working to restore service as quickly as possible.

investigating

We’re experiencing a disruption to the US4 cluster. Some customers hosted on this cluster may not be able to access their site. We will continue to provide updates as they become available.

Report: "Service Disruption - Devices Deleted from Auvik UI when seen as down in health check"

Last update
postmortem

# Service Disruption - Auvik Deleting Devices After Appearing Offline In A Health Check ## Root Cause Analysis ## Duration of incident Discovered: Nov 22, 2023, 18:30 - UTC Resolved: Nov 27, 2023, 18:00 - UTC ### Cause Incorrect message data for networks and IPs was delivered to the device consolidation tables. ### Effect The incorrect message data provided information to delete network and IP information of active devices, which Auvik believed had been removed from the product. This, in turn, deleted devices from clients' tenants. Which in turn also deleted backups and historical data of the deleted devices. ### Action taken _All times in UTC_ **11/22/2023** **17:30** - The first noticeable ticket of a significant device loss is sent to engineering. This is followed by two more over the next hour. **18:30 -** Auvik declares an incident for the loss of networks and devices. **18:30 - 22:30 -** Engineering begins investigating the cause of the incident and how to arrest the deletion of client networks and IPs. **22:30 -** Auvik Engineering is able to determine a way to turn off the deletion of client networks and IPs from the platform. A change is implemented into production. This stops the devices from continuing to be deleted. **11/22/2023-11/23/2023** **22:30- 00:30 -** Auvik Engineering caused the platform to run a discovery of the lost networks and IPs to recreate the devices lost on 11/22. This action did not restore the historical data, backups, and customized alerting from the recreated devices. **11/23/2023** **00:30 -** Auvik declares the deletion part of the incident closed. **1/23/2023-11/27/2023** **00:30 -18:00 -** The Auvik consolidation team continues its analysis of the network and IP deletions to backtrack any other devices that may have been deleted before 11/22. It periodically runs scripts to replace lost devices at tenants' sites. While the devices are rediscovered, historical data, backups and customized alerting are not recovered. Measures are put into place to prevent the system from being able to delete devices when receiving incorrect data. **11/27/2023** **18:00 -** The incident is closed for replacing lost devices. ### Future consideration\(s\) * Currently in development with engineering: New tooling to retain device data to restore devices with device history to the original devices. * Auvik reviewed its backup frequency to validate the ability to do a restore per day if required. This was validated to work as expected. * Auvik will improve internal alerting for mass device, IP, or network removal to gain earlier insight into similar incidents in the future.

resolved

The source of the disruption has been resolved, and services have been fully restored.

monitoring

We’ve identified the source of the service disruption with devices being deleted from the UI when registered as offline in the health check and are monitoring the situation. Devices are being brought back into the UI. We’ll keep you posted on a resolution.

identified

We’ve identified the source of the service disruption with devices being deleted from the UI when registered as offline in the health check and are continuing to work to restore service as quickly as possible.

identified

We’ve identified the source of the service disruption with devices being deleted from the UI when registered as offline in the health check and are continuing to work to restore service as quickly as possible.

identified

We’ve identified the source of the service disruption that deleted devices from the UI. We are working to restore service as quickly as possible.

investigating

We’re experiencing disruption with devices being deleted from the UI when registered as offline in the health check. We will continue to provide updates as they become available.

Report: "Service Disruption - Delay in processing TrafficInsights data in US4 cluster"

Last update
postmortem

# Service Disruption - Traffic Insights \(TI\) Data Stopped Processing on US4 Cluster ## Root Cause Analysis ### Duration of incident Discovered: Nov 6, 2023, 13:04 - UTC Resolved: Nov 7, 2023, 06:32 - UTC ### Cause Updates to code caused an unexpected restart of services that affected TI data flow on the US4 cluster. ### Effect The restart of services began a restart loop that prevented TI data in the US4 cluster from flowing into the user interface as expected. ### Action taken All times in UTC **11/06/2023** **13:04 -** After approval code is released into production. **13:05 -** Services unexpectedly restart. The restart loop of services begins, which causes a delay in updating TI data in the US4 cluster. **13:34 -** An internal alert is fired, notifying Auvik Engineering that TI data on the US4 cluster was delayed. **15:52 -** Engineering begins its investigation. **16:45 -** Engineering adjusts the TI data flow for clients on the US4 cluster to bypass the restart issue. **16:48 -** Engineering can confirm TI data flow back into the US4 cluster client is working. Engineering monitors the reduction of TI data lag in the US4 cluster. **18:00 -** Engineering continues to monitor the reduction in TI data lag from being current for clients in the US4 cluster. Additional resources are allocated to speed up the lag reduction. Engineering continues to monitor. **11/07/2023** **02:38 -** All TI data lag is confirmed to have caught up. **06:32 -** All data processes are confirmed as up-to-date and working correctly. The incident is closed. ### Future consideration\(s\) * Auvik will update CPU limits on services related to Traffic insights to prevent resource bottlenecks. * Auvik will investigate and determine service dependencies and better document possible conflicts. * Auvik will update older code to take advantage of new services to prevent this type of incident with these services in the future.

resolved

The resolution for disruption to Traffic Insights data processing in US4 has been implemented. The source of the disruption has been resolved, and services have been fully restored, and Traffic Insights data is flowing normally. A Root Cause Analysis (RCA) will follow after a full review has been completed.

monitoring

We’ve identified the source of the service disruption in TrafficInsights data processing in the US4 cluster and are monitoring the situation. The Traffic Insights data is catching up to the current flow. The time for Traffic Insights data to become current is projected somewhere around 03:00 Nov 7 UTC. Updates will follow.

identified

We’ve identified the source of the service disruption with TrafficInsights data processing in US4. We are working to restore service as quickly as possible.

investigating

We continue investigating the disruption of Traffic Insights data with clients in the US4 cluster. We will continue to provide updates as they become available.

investigating

We’re experiencing disruption to TrafficInsights data processing us US4. We will continue to provide updates as they become available.

Report: "Service Disruption - Delay with the Delivery Syslog Messages to Clients on Cluster US2"

Last update
postmortem

# Service Disruption - Delay with the Delivery of Syslog Messages to Clients in US2 Cluster ## Root Cause Analysis ### Duration of incident Discovered: Oct 26, 2023, 05:30 - UTC Resolved: Oct 27, 2023, 18:52 - UTC ### Cause Disk space ran out on the processing disks for Syslog on the US2 cluster. ### Effect Syslog message delivery was stopped to clients on the US2 cluster. ### Action taken _All times in UTC_ **10/26/2023** **05:30  -** An internal alert was created that Syslog messaging was not working on the US2 cluster. **07:15 -** Auvik Engineering begins its investigation. **08:30 -** Engineering begins action to increase disk space to be able to process Syslog messages. **09:20 -** Engineering alters data retention policy to ensure no data is lost due to the delay. **10:02 -** Engineering triggers the new policy to test rollout. **11:10 -** Engineering validates new settings and proceeds to see data lag continue to shrink and customer information now flows appropriately. **11:15 -** The initial incident is marked as closed. **10/27/2023** **09:10 -** Data was checked for the cluster as part of standard operating procedure. Data restored by the policy implementation was no longer there. **09:20 -** The Auvik Engineering team proceeds to launch an investigation. **09:45 -** Engineering confirms that Syslog data from the last 20 days was absent for US2 cluster clients. **10:10 - 10:35 -** The log entry for why the Syslog data was deleted was located. The location of the backup of the data was also obtained. **10:43 -** Engineering begins to restore the absent Syslog data to the US2 cluster. **10:43 - 18:50 -** The data for the Syslog messages is restored to the US2 cluster for clients. **18:52 -** The restoration is finished. The incident is closed. ### Future consideration\(s\) * Auvik will add alerting for the particular disk space issues attributed to this incident, including the cause and repair process. * Auvik will upgrade the specific systems to avoid disk space issues like this occurring again. * Auvik will update the documentation with retention policies to reflect timing issues with policy changes.

resolved

The disruption with delivering Syslog messages to clients on the US2 cluster has been resolved, and services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review has been completed.

monitoring

We’ve identified the source of the service disruption with the delivery of Syslog messages to clients on the US2 cluster and are monitoring the situation. The Syslog delay is now catching up. All clients should see current Syslog messages in the next couple of hours. We’ll keep you posted on a resolution.

identified

We’ve identified the source of the service disruption with the delivery of Syslog messages to clients on the US2 cluster. We are working to restore service as quickly as possible.

investigating

We’re experiencing disruption with the delivery of Syslog messages to clients on the US2 cluster. We will continue to provide updates as they become available.

Report: "Service Disruption - Clients on the EU2 cluster are having access issues to their sites. Throwing 502 errors."

Last update
postmortem

# Service Disruption - Clients on the EU2 Cluster Intermittently Receive 502 Errors When Connecting ## Root Cause Analysis ### Duration of incident Discovered: Oct 31, 2023, 22:35 - UTC Resolved: Nov 01, 2023, 15:58 - UTC ### Cause An update to an internal service caused repeated service reloads. ### Effect The repeated service reload caused increased memory depletions for the service, which in turn caused intermittent connection issues to client websites on the EU2 cluster. ### Action taken _All times in UTC_ **10/31/2023** **19:15 -** The kOps service is updated on the EU2 cluster. **20:00 -** Service issues begin with affected services on the EU2 cluster. Memory usage starts to increase. **10/31/2023** **20:00 -** The EU2 cluster clients start having 502 web page displays when they attempt to log in. **22:35 -** Auvik internal alerting reports disconnection issues with the EU2 cluster clients. **11/01/2023** **08:30 -** Auvik Engineering begins the investigation. **09:50 -** Auvik declares an incident and posts to the status page. **10:15 -** Engineering adds additional memory resources to the service. This resolves the connection issues for the clients. **10:15- 14:33 -** Engineering continues investigating to determine the root cause and permanent fix. **15:37 -** The fix is tested in a stage environment successfully. **16:15 -** Auvik alerts its clients on the EU2 cluster it will implement the fix at 18:00 - UTC with possible service disruptions over the hour it will take to complete. **18:00 -** Auvik implements the fix into the EU2 cluster **18:58 -** Auvik completes installing the fix and clean-up processes from the fix implementation. The incident is resolved. ### Future consideration\(s\) * Auvik has reviewed and updated the documentation for upgrading the affected Kops service to prevent this incident from reoccurring. * Auvik will create improved internal alerting to notify Auvik when resources are repeatedly being restarted abnormally. * Auvik will validate the restoration procedure for the Kops service if required.

resolved

The fix for clients having issues connecting to their tenants on cluster EU2 has been implemented. The source of the disruption has been resolved, and services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review has been completed.

monitoring

We’ve identified the source of the service disruption for clients connecting on the EU2 cluster. To resolve this issue, Auvik was required to upgrade a degraded service. This work has been completed. Internal services are resetting for ingress connections. We are actively monitoring the follow-up and will update this page when complete.

monitoring

We’ve identified the source of the service disruption for clients connecting on the EU2 cluster. To resolve this issue, Auvik was required to upgrade a degraded service. This work has been completed. We are actively monitoring the follow-up and will update this page when complete.

monitoring

We’ve identified the source of the service disruption for clients connecting on the EU2 cluster. To resolve this issue, Auvik is required to upgrade a degraded service. This work has begun. This overall action should take up to an hour, with any disruptions to any individual tenant lasting no more than one minute if an interruption occurs. We apologize for any unscheduled downtime that may arise due to this action. We are actively monitoring and will update this message once complete.

identified

We’ve identified the source of the service disruption for clients connecting on the EU2 cluster. To resolve this issue, Auvik is required to upgrade a degraded service. Auvik will perform this work at 18:00 UTC (6:00 PM GMT). This overall action should take an hour with any disruptions to any individual tenant lasting no more than one minute, if an interruption occurs. We apologize for any unscheduled downtime that may occur due to this action.

monitoring

We’ve identified the source of the service disruption with access for clients to their sites on EU2 and are monitoring the situation. We have implemented changes to alleviate disruption. We continue to work on resolving the root cause. All clients should have no issue connecting. We’ll keep you posted on a resolution.

monitoring

We’ve identified the source of the service disruption with access for clients to their sites on EU2 and are monitoring the situation. We have implemented changes to alleviate disruption. We’ll keep you posted on a resolution.

investigating

We’re experiencing disruption for clients in the EU2 cluster. Access to their sites is impacted. We will continue to provide updates as they become available.

Report: "Service Disruption - Clients on Cluster US2 are unable to log into Auvik"

Last update
postmortem

# Service Disruption - Clients in Cluster US2 Failing to Load Sites ## Root Cause Analysis ### Duration of incident Discovered: Oct 20, 2023, 17:29 - UTC Resolved: Oct 20, 2023, 18:50 - UTC ### Cause Data on the root backend could not communicate with other parts of the website backend on the US2 cluster. ### Effect Resources on the US2 cluster backend were overwhelmed with calls for processor resources which caused the backend to drop connection in the UI. This action prevented clients from accessing their tenants. Data on the backend was delayed but not lost. ### Action taken _All times in UTC_ **10/20/2023** **17:29 -** Auvik engineering teams receive Internal notifications of processor issues on the US2 cluster. **17:37 -** The Auvik engineering team starts an investigation into the issue. **17:50 -** The engineering team decides on a course of action to resolve the issue. **18:07 -** Action is taken by engineering to retain data for discovery of a root cause and a reset of the backends on US2. **18:07-18:50 -** US2 backends recovered, and tenants started to recover. **18:50 -** The incident is declared over. ### Future consideration\(s\) * Auvik will alter its monitoring and alerting to predict better processor overflows with its backends for the Auvik website.

resolved

The repair for clients on US2 being able to log into their site(s) has been implemented. The source of the disruption has been resolved, and services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review has been completed.

monitoring

We’ve identified the source of the service disruption with clients connecting to their site(s) on cluster U2 and continue to monitor the situation. Client on US2 should now be able to start to log back into their site(s). We’ll keep you posted on a resolution.

identified

We’ve identified the source of the service disruption with clients being able to connect to their site(s) in Auvik. We are working to restore service as quickly as possible.

investigating

We’re experiencing disruption with clients on US2. Clients cannot log into their sites. We will continue to provide updates as they become available.

Report: "Service Disruption - Some Devices were Deleted from Some Tenants Sites"

Last update
postmortem

# Service Disruption - Offline Devices Accidentally Deleted from Site Before Set Deletion Threshold ## Root Cause Analysis ### Duration of incident Discovered: Oct 19, 2023 22:00 - UTC Resolved: Oct 20, 2023 13:00 - UTC ### Cause Efforts to resolve a deletion threshold issue inadvertently affected separate work to resolve issues with tenant migration that used a linked identifier. ### Effect Auvik deleted several devices out of client sites that, at the time, were disconnected but had not yet exceeded their time to live threshold where they would normally be deleted. ### Action taken _All times in UTC_ **10/19/2023** **22:00 -** Auvik completes work on repairing the deletion threshold issue. **23:00 -** An Auvik engineer team member becomes aware that the number of devices deleted is higher than was expected. He alerts the rest of the team responsible for this work. **10/20/2023** **00:00 -** The team begins the analysis of deletions and determines what caused the additional device deletions. **00:30 -** Auvik determines the cause for the additional device deletions and stops the changes from continuing. **00:30 - 13:00 -** The engineering team determined there were 178 billable devices deleted during the issue before the action was stopped. Engineering works to restore all billable devices back to the clients' sites. Non-billable devices were not restored at this time since they should reconnect when they are rediscovered by the Auvik scan\(s\). **13:00 -** The issue is considered resolved. ### Future consideration\(s\) * Auvik will proactively address code not written to best practices when it will be affected by other code changes. * Auvik will design code unit tests with the intention of attempting to catch more edge case issues and not just direct result issues. * Auvik will improve testing of its Scanner \(Discovery\) services to account for any work not related to possibly affecting discovery or deletion services used for consolidation.

resolved

It was discovered that the act of a clean-up of stale devices on 10/19/2023 resulted in the accidental deletion of actual still active devices on a small number of client sites. The vast majority of these devices were offline at the time. Devices that come back online will reconnect to their tenants. These devices will be re-identified as a new device and not will not sync with their historical data. A Root Cause Analysis (RCA) will follow after a full review has been completed.

Report: "Service Disruption - Hierarchy Issues in AU1 are causing issues with a subset of clients in AU1"

Last update
postmortem

# Service Disruption - Parts of the UI are Inaccessible Due to Hierarchical Issues in AU1 ## Root Cause Analysis ### Duration of incident Discovered: Oct 19,2023, 19:00 - UTC Resolved: Oct 21,2023, 12:30 - UTC ### Cause A backend service responsible for some non-critical elements of our data hierarchy fell out of sync with other services on the AU1 cluster corrupting out the hierarchical data rendering in the user interface \(UI\). ### Effect Clients in the AU1 cluster had degraded functionality with the UI. ### Action taken _All times in UTC_ **10/19/20** **19:00 - 20:44 -** Investigation into a ticket from a client uncovers an incident level issue. This is determined after correlating several different issues submitted by Auvik clients. **23:00 -** The issue is escalated to an incident to provide proper resourcing to resolve. **10/20/20** **0:00 - 3:30 -** Engineering uncovers issues with the hierarchical services and investigates possible resolutions. Alerts are validated to be intact and working properly. They just cannot be seen in the UI. Ancillary issues dependent on the service are also uncovered which degrade some other abilities in the UI. There was no impact on back-end functionality for alerting or other services. **13:00 - 13:30 -** Access to other reported functionality in the UI is validated to only affect Auvik Support and not clients. **13:30 -** A determination is made that due to the down time it would cause clients, and that the data and functionality of alerting is not affected by this incident, the changes required to resolve the incident can wait until the next day’s scheduled maintenance window. **10/21/20** **11:00 - 12:30 -** The hierarchy service is rebuilt with no loss of data. Full functionality in the UI is restored. The incident is resolved. ### Future consideration\(s\) * Auvik added additional monitoring to alert when the hierarchy services find corruption in processing data.

resolved

The fix for with Hierarchy issues in AU1 was deployed during maintenance successfully. The source of the disruption has been resolved, and services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review has been completed.

monitoring

We’ve identified the source of the service disruption with Hierarchy issues in AU1 and are monitoring the situation. This issue is only impacting UI functionality on a small subsection of clients in AU1. All services, even if they cannot be seen in the UI, are not affected. Alerts, etc., will still function as they have been set up to do. Resetting the hierarchal services is expected to take clients down for about an hour. Since all backend functionality is working, Auvik has elected to perform the reset work during the maintenance window at 11:00 UTC on Saturday, October 21, to lessen any impact on the affected clients. We’ll keep you posted if there are any changes to this status.

identified

We’ve identified the source of the service disruption with Hierarchy Issues in AU1. This has caused issues with permissions and missing data in the UI. We are working to restore service as quickly as possible.

Report: "Service Disruption Syslog AU1"

Last update
postmortem

# Service Disruption - Syslog data delayed in AU1 Cluster ## Root Cause Analysis ### Duration of incident Discovered: Oct 20, 2023 15:00 - UTC Resolved: Oct 20, 2023 16:42 - UTC ### Cause An upgrade, for which no downtime was expected to Auvik’s production environment, failed due to a new procedure not properly documented. ### Effect Syslog data was delayed for clients in the AU1 cluster for just under two hours. ### Action taken _All times in UTC_ **10/20/2023** **15:00 -** Upgrade work for Auvik backend services on the AU1 cluster started. **15:30 -** An internal alarm was raised about Syslog not processing in the AU1 cluster. **15:55 -** Auvik’s engineering team investigates the outage. **16:00 -** The Syslog job processor was restarted. Syslog data begins to flow. **16:42 -** Delayed Syslog data on cluster AU1 has caught back up. Incident ended. ### Future consideration\(s\) * Auvik will better review the new documentation and more thoroughly test it prior to implementing it into a production environment. * Auvik will improve its pre-production environment testing and monitoring to understand changes to the production environment.

resolved

Syslog data was delayed for approximately 1.5 hours on Cluster AU1. The interruption started at approximately 15:00 UTC. Data is now flowing properly. There was no data loss.

Report: "Device credentials could not be created or edited."

Last update
postmortem

# Service Disruption - Device Credentials Could Not Be Added Or Changed ## Root Cause Analysis ### Duration of incident Discovered: SEP 26,2023 16:30 - UTC Resolved: SEP 26, 2023 18:19 - UTC ### Cause A small update to the device configuration page\(s\) was done without sufficient testing. ### Effect Clients could not add, remove, or change device credentials in their tenants in the Auvik product. ### Action taken _All times in UTC_ **2023-09-26** **16:30 -** A device configuration page change rolled out to the Auvik production environment. **16:45 -** **17:22 -** Auvik receives notifications from clients that they cannot adjust or create device credentials within the product. **17:22 -** Auvik engineering begins an investigation. **17:42 -** Auvik engineering validates that the issue was not present before the page update. The decision was made to roll back changes, and Auvik engineering began preparation to roll back to the previous page code. **17:50 -** Auvik begins the rollback. **18:13 -** The rollback finished. **18:19 -** Auvik is able to validate that rollback was successful, and the Incident was declared to be resolved. ### Future consideration\(s\) * Auvik will improve testing methods before introducing any changes to its UI. * Auvik will validate changes to the production environment with the On-call personnel prior to implementation.

resolved

Auvik had several reports of device credentials not being able to be created or altered. This caused an interruption in the ability to update connectivity to devices under Discovery, SNMP pollers, and CLI connections. The functionality was restored by 18:30 UTC. RCA to follow.

Report: "Clients disconnected from sites under AU1 CA1 and EU1 clusters"

Last update
postmortem

# **Service Disruption - Clients in Clusters AU1, CA1 and EU1 were Disconnected**  ## Root Cause Analysis ### Duration of incident Discovered: Sep 27, 2023, 11:11 - UTC Resolved: Sep 27, 2023. 12:30 - UTC ### Cause An Auvik engineer inadvertently deleted connectivity to Auvik’s Core Data to the Auvik website. ### Effect * Collectors for clusters AU1, CA1, and EU1 were disconnected from Auvik’s Core Data. * Users with tenants on AU1, CA1, and EU1 could not access their tenants/sites for approximately 10 minutes. * Collectors and new data were unavailable to clients on AU1, CA1, and EU1 for anywhere from 15 minutes to up to 1.5 hours, depending on the tenant. ### Action taken _All times in UTC_ **09/27/2023** **11:11 -** An engineer was performing a maintenance task and inadvertently deleted connectivity to Auvik’s Core data. The engineer immediately recognized their error and rescinded the command. Clusters AU1, CA1 and EU1 were affected. Other Auvik clusters were not affected. **11:12 -** The engineer redeployed connectivity to Auvik’s Core data to the three affected clusters. **11:13 -** An internal alert was raised to the engineering team. **11:20 - 12:30** - Tenants are manually restarted to validate connectivity. **12:30 -** The incident is closed and posted on the Auvik Status page. ### Future consideration\(s\) * Improved safety notification for the Engineering team has been added to the current maintenance workflow to ensure connectivity cannot be terminated without secondary consent. * Review of processes for connection to the Production environment and implement any recommended changes for connecting to it.

resolved

Auvik had tenants under the AU1, EU1, and CA1 clusters become disconnected due to a backend process that removed connectivity. There was no data loss. The connection has been restored. Any remaining client tenants not showing connection will be restored by 12:30 UTC. RCA to follow

Report: "Service Disruption - Several collectors in the AU1 cluster have lost connection with Auvik"

Last update
resolved

The resolution for some collectors in cluster AU1 not connecting has been implemented. The source of the disruption has been resolved, and services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review has been completed.

monitoring

We’ve identified the source of the service disruption with collectors not connecting on AU1 cluster and are monitoring the situation. We are currently reconnecting the collectors having connection issues. We’ll keep you posted on a resolution.

identified

We’ve identified the source of the service disruption with several collector connections to Auvik in the AU1 cluster. We are working to restore service as quickly as possible.

investigating

We’re experiencing a service disruption to several collectors connecting to Auvik in the AU1 cluster. We will continue to provide updates as they become available.

Report: "Service Disruption - A number of collectors in EU1 are experiencing connectivity issues with Auvik."

Last update
resolved

The changes to restore connectivity to affected customers on the EU1 cluster have been completed. The source of the disruption has been resolved, and services have been fully restored. A Root Cause Analysis (RCA) will follow after a full review has been completed.

monitoring

We’ve identified the source of the service disruption with a number of clients connected to the EU1 Cluster and are monitoring the situation. Changes to restore connectivity are nearly complete. We anticipate full restoration of service within the hour. We’ll keep you posted on a resolution.

monitoring

We’ve identified the source of the service disruption with a number of clients connected to the EU1 Cluster and are monitoring the situation. We are making changes to restore connectivity as quickly as possible. We’ll keep you posted on a resolution.

identified

We’ve identified the source of the service disruption with a number of clients connected to the EU1 Cluster. We are working to restore service as quickly as possible.

investigating

We’re experiencing disruption to collector connectivity with a number of clients connected to the EU1 Cluster. We will continue to provide updates as they become available.