Historical record of incidents for Netdata
Report: "MQTT broker failure"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We had a major problem with our MQTT broker. It is currently up and running and agents are reconnecting to the cloud.
Report: "MQTT broker failure"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We had a major problem with our MQTT broker. It is currently up and running and agents are reconnecting to the cloud.
Report: "Nightly static builds overwrite node and metric data upon install"
Last updateAffected Agents can cause the creation of multiple duplicate nodes in Netdata Cloud. All but the last one will appear as offline, and the last one will be as if it was created from scratch with no data. Unfortunately, the previously stored metrics for the affected nodes can not be recovered. The duplicate offline nodes can be safely deleted from Space Settings -> Nodes. Note that you may have to add the newest copy of these nodes to the appropriate rooms. The fixed nightly static build is v2.3.0-102.
We have found that a recent change in the nightly static builds of Netdata Agent causes meta data on the Agent to be overwritten. Specifically the sqlite3 database that keeps meta data on which timeseries stored in dbengine correspond with which metrics, and the information on the Agent's "machine GUID" are overwritten with the same copy in the build package. Not affected are: - All stable releases - Native packages (.deb and .rpm) Affected are all nightly static builds with the following version numbers: - 2.3.0-50-nightly - 2.3.0-60-nightly - 2.3.0-72-nightly - 2.3.0-78-nightly - 2.3.0-87-nightly The initial impact is that all affected Agent installs, even though they still have the timeseries data stored on disk, have lost all meta data associated with it, so these timeseries become inaccessable. This is unrecoverable. Additionally, the main form of identification is overwritten, too. We are assessing what the impact is for users of Netdata Cloud, and will update this incident with more information when the investigation is completed. The bug itself has been fixed and merged. We will issue a new nightly build shortly.
Report: "Lastest Agent nightly build (v2.2.0-245) broken at first start"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
The builds are completed, so we are watching out for any remaining related issues.
We have identified the issue, committed a fix, and initiated new nightly builds for all platforms. This will take several hours. In the mean time, please restart Netdata to work around the issue.
We are investigating an issue with today's nightly (v2.2.0-245), causing alerting to not work ("health") and external plugins, including go.d, to not connect properly. This may be resolved by restarting the Agent. Stable versions of the Agent are not affected.
Report: "Alarm Processing Delays"
Last updateOur alarm processing infrastructure was running behind which is causing inaccuracies alarms for some nodes. No data has been lost and the systems should be already up to date.
Report: "Alerting is working slower"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Due to the release of Netdata Agent 2.0 we have quite a big backlog for alarms. We are investigating this issue.
Report: "Delays in alarms on the Netdata Cloud"
Last updateThis incident has been resolved.
Currently, we are waiting for the fix to take effect, and some users might experience delays in all cloud operations.
The issue has been identified and a fix is being implemented.
We were alerted to a delay in alarms for some users and are investigating the matter.
Report: "Recent nightly static and local builds of Netdata Agent overwrite netdata.conf with defaults"
Last updatePrior to [netdata/netdata#17475](https://github.com/netdata/netdata/pull/17475), the `netdata.conf` and `netdata-updater.conf` files where handled by the installer code outside of the build system. With the shift to using the build system to produce packages, handling for them needed to be moved into the build system. However, insufficient testing was performed to confirm that this would not break other installation types, and the change was not properly made conditional on packages being built. As a result, the static and local builds with version `v1.45.0-315-nightly` will overwrite these configuration files with the default templates for those files. This causes all local changes to those files to be lost. In particular, if the Agent configuration had been changed for longer retention, the overwritten configuration will have undone those settings, causing any metrics data **beyond the _default_ retention to be lost** on the first run of this version. We have pulled the affected build artifacts to prevent our installer from using them. While [the fix](https://github.com/netdata/netdata/pull/17572) ensures the issue won't occur in future versions, starting with version `v1.45.0-326-nightly`, it is important to note that affected installations **will not automatically recover** their previous configurations. If you were using a non-default `netdata.conf` and/or `netdata-updater.conf` and experienced this bug, you will need to **manually reconfigure** your Netdata install. As we aim to carefully develop Netdata for many platforms and hardware architectures, we release nightly builds of the Netdata Agent to catch any issues our changes may have caused, beyond our own internal testing. Unfortunately, we make mistakes that we did not catch in our testing, with data loss as an extreme possible outcome. Therefore we strongly recommend using our **stable releases for production systems**. You can review the [difference between nightly and stable builds](https://learn.netdata.cloud/docs/netdata-agent/installation#nightly-vs-stable-releases), and our recommended [best practices](https://www.netdata.cloud/blog/netdata-best-practices/). If you have been affected by this issue and/or have any questions, please let us know.
The build artifacts for the new nightly release (1.45.0-326) are now available, and consider the incident resolved. Should you experience any issues, please let us know!
Update regarding potential data loss. This will happen if the configuration had been changed to increase metric retention (with respect to the defaults). Unfortunately, any stored data beyond the default metric retention will be lost on running installs of the affected builds. The only way to prevent this is by not using (of having used) version v1.45.0-315-nightly. We have made sure that the corresponding artifacts are no longer accessible by the installer.
The affected build number is v1.45.0-315-nightly, and local builds starting with commit https://github.com/netdata/netdata/commit/5973417027606bacf044b3ead40a882931ce773f (April 30, 11:45 UTC) up until commit https://github.com/netdata/netdata/commit/0f2a261839d5ffc42f17383b4292673aa93d6a1f (May 1, 15:13 UTC).
We've identified an issue with static and local builds of the Netdata Agent, that causes its main configuration in `/etc/netdata/netdata.conf` or `/opt/netdata/etc/netdata/netdata.conf` to be overwritten with the default. The `netdata-updater.conf` file is similarly affected. Depending on your configuration settings that have been changed with respect to the defaults, this may result in data loss. We will update this incident with more detailed information on the impact as soon as possible. Docker image or native package builds, as well as stable builds, are not affected. We have created a fix (https://github.com/netdata/netdata/pull/17572) and have triggered a new nightly build. As soon as those are available, we will also update this incident.
Report: "The "kickstart" way of installation is broken"
Last updateThis incident has been resolved.
The installation method described in https://learn.netdata.cloud/docs/installing/one-line-installer-for-all-linux-systems is broken. We are currently investigating the issue.
Report: "Delay with sending cloud alarms"
Last updateThis incident has been resolved.
We have noticed that there could be some delays with sending alarms. The situation should get back to normal in ca. 1h.
Report: "Cloud connectivity issue"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have noticed that some agents are experiencing intermittent disconnections from the cloud in our metrics. We are currently investigating the issue.
Report: "Problem with Alert Configuration in the Cloud Web UI"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We've implemented a fix and are monitoring for left-over issues.
We have applied a fix that allows for seeing all configured alerts again, but the detail page for individual alerts may still not render. We are working on addressing that latter issue.
We've identified that we are getting some malformed data from an API after a recent update and are working on a fix.
We are currently investigating the issue
Report: "Processing Events feed problem"
Last updateThis incident has been resolved.
We detected problems with processing the Events feed and working on a solution.
Report: "Possible login problem - Netdata Cloud"
Last updateWe have successfully addressed the issue from our end, eliminating the necessity of cookie removal.
We have detected a potential problem with the cookies that are necessary for logging into the Netdata Cloud platform. The easy and quick solution is to clear cookies for app.netdata.cloud. We are currently looking for a permanent way to resolve this issue on our end.
Report: "Startup issue in latest Agent nightly (1.40.0-6-nightly)"
Last updateAll packages have been published. If your nodes are still on 1.40.0-6, please refer to the instructions to upgrade: https://learn.netdata.cloud/docs/maintaining/update-netdata-agents#updates-for-most-systems. We are now closing this incident, but please let us know if things are still not working on your nodes.
The source tarballs with the fix for native builds are now available. Packages for ARM systems are still building but should be fully published and available by 17:00 UTC at the latest.
The native packages for x86-based distributions have been published. The ARM ones are still building and should follow shortly, as well as the static builds. We're monitoring Netdata Cloud and the various social networking tools to monitor the outcome of the new builds.
The fix has been merged, we've kicked off the build process for the packages. We will provide another update when the packages for the affected systems have been pushed.
We have created a fix for this issue, which is a combination of making systemd not change the ownership and permissions the directories the Agent uses, and the Agent properly changing permissions recursively to recover from the effects of the bad version. As soon as we've tested the fix, and the packages have been built, we will trigger an explicit push to the nightlies repos.
While we are working on a fix, which requires a new package to be built, we have developed a workaround. It requires downgrading the Agent to 1.40.0-2-nightly and fixing the permissions. For Debian based systems, this script should work, run as root: https://gist.github.com/ralphm/1326498c474aaacf0a12f9e569dac863
Agents running the most recent nightly (1.40.0-6-nightly) fail to start on some platforms, because of a permissioning issue. We believe the culprit is this change: https://github.com/netdata/netdata/pull/14890, and are working on a fix. As this happens early on in the Agent, this affects Cloud and non-Cloud users alike.
We are currently investigating an issue with agent connectivity to the cloud.
Report: "Agent connectivity problem."
Last updateConnected clients metrics are going back to normal values, new Netdata Agent works as expected.
We had to ban 1.39.0-97 agent version from connecting to the cloud. The exact affected agent versions are: 1.39.0-97-nightly and 1.39.0-97-{hash}. This incident is going to be closed when new Netdata release will be available for the download. Please update your endpoints then or wait for an automatic update to take place tomorrow.
We found that the issue is caused by latest nightly version of the agent. We are releasing the fix.
We are continuing to investigate this issue.
We are currently investigating the issue.
Report: "Node status updates are delayed"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We had a problem with updating node status on Cloud UI. In practice it means that there was a delay between node changing status from for example online to offline and cloud realising this fact. This would also affect new agents or deleting an existing one. Due to connection update delay there is also a delay in chart metadata updates in our database with which we are dealing right now. This on the other hand from user perspective means, that charts for newly installed applications or charts in general for new nodes are not yet showing up on the Cloud UI.
Report: "Problem with reconnecting agents"
Last updateThe incident was resolved and the problem was triggered by an automatic configuration reload in the load balancer. During that time there was a delay in processing the alarms.
Netdata agents are being forced to reconnect. We are investigating the root cause
Report: "Netdata Agents [NIGHTLY] with ML turned off might crash."
Last updateDear valued users, We would like to inform you of a recent development regarding our latest nightly image of Netdata Agent. It has come to our attention that some machines may experience an issue when the ML setting is turned off. We apologize for any inconvenience this may have caused. However, we are pleased to announce that our team has already taken action to address this matter. A revert has been successfully merged through this pull request: https://github.com/netdata/netdata/pull/14908. Rest assured that we are currently in the process of re-testing and building a new image after the fix to ensure that the issue will be resolved. Please note that the stable release is not affected by this matter, and you may continue to use it without any concern. We appreciate your understanding and patience as we work to improve our services. Thank you for your continued support. Sincerely, The Netdata Team
Report: "Agent connectivity disruption"
Last updateAs we see the number of connected agents go back to expected levels, and the number of agents running the previous nightly going down, we consider this incident resolved.
The new build (1.37.0-55) has completed for most platforms. Please follow the instructions at https://learn.netdata.cloud/docs/agent/packaging/installer/update if you are on the affected version (1.37.0-48) and want to upgrade your agents manually. If you have automatic updates configured, you can also wait for the update to be done during your night. We will be monitoring the progress of Agents as they reconnect.
The new build (1.37.0-55) has been triggered and we will post an update when it is ready. We will include instructions on how to update manually, or you can wait until the auto-upgrade happens during your night. Note: * If you are running a nightly build older than 1.37.0-48, you are not affected and no action is required. * If you are running a stable build, you are not affected and no action is required. However, we do strongly recommend upgrading to 1.37.1 because of two security vulnerabilities in older versions.
We have identified the offending change in the Agent. Only the latest nightly build (1.37.0-48-nightly) of the Agent is affected. The problem only occurs if the Agent tries to reconnect after having lost its first connection to Cloud. This means that if you restart your agent, the problem is avoided until its connection to Cloud drops. We will issue a new nightly build that removes the offending change.
We are able to reproduce the issue and are attempting to pinpoint the cause.
We are seeing an increasing number of Agents that cannot (properly) connect to Cloud. We are investigating the cause, but initial indications are that it may be related to the latest nightly release of the Agent (version 1.37.0-48-nightly).
Report: "Delay in processing node availability changes"
Last updateThis incident has been resolved.
The backlog has been consumed. We are monitoring the situation.
We are working back the backlog of availability updates and should be done in about 30 minutes.
We've identified an issue with delayed processing of node availability (online, stale, offline) changes. For a fraction of our users this means that these changes are not reflected properly in Netdata Cloud. As the availability affects what metrics are shown in Cloud, it may be that some metrics are not visible even though the node is supposed to be available.
Report: "Node status updates are delayed"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are facing an issue when our backlog for node status updates is consumed slower than normal. Users might face an issue where node status is slow to change from online to offline or the other way around.
Report: "Degradated performance on charts metadata"
Last updateThis incident has been resolved.
Request for retrieving and updating charts metadata are slower than they should. Users might not see changes in their charts instantly replicated to the Cloud UI (i.e. software installation or removal, while on Agent's local UI will be instantly updated). We already identified the problem and we are fixing it.
Report: "Delayed alarms"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Hi all, we just that for some of our clients we might have a bit of a delay between Agent triggering an alert and Cloud UI showing it. We are investigating the issue.
Report: "Routing problem on netdata cloud"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are facing a small issue with our internal routing between the apps. We identified the problem and solution is going to be deployed soon.
Report: "Sign-ups via email magic link are failing"
Last updateThis incident has been resolved.
We've reverted and are monitoring the issue.
We have found the offending change that causes sign-ups via email magic link to fail, and are reverting to work on a proper fix.
We are currently investigating this issue.
Report: "Slow and failing Agent chart data responses"
Last updateReverting the default away from MQTT5 removed the immediate issue, and most Agents on the nightlies are now on the latest (v1.35.0-104-nightly). In the mean time we've also found the true cause: the Agent was not properly processing incoming commands in the MQTT5 implementation, due to a bug in how the parser interacted with the buffer of incoming data. This has been resolved in the upcoming nightly build of the Agent. As we want to do some more testing, for now the Agent will keep using the older MQTT library by default.
For completeness, the affected versions are v1.35.0-84-nightly and v1.35.0-96-nightly. Latest, corrected version is v1.35.0-104-nightly.
The new nightly version of the Netdata Agent has been published and installed by a large portion of the agents that auto-update. We are monitoring the results.
We have identified part of the cause of failing responses for alarm values. In yesterday's nightly build of the Agent, we enabled the use of the newer MQTT5 library by default. We will create another build to revert that. In the meanwhile, you can explicitly disable this library using the mqtt5 setting in your configuration as described here: https://github.com/netdata/cloud-backend/issues/178. Additionally the other latencies appear to be another instance of a known issue that causes responses with a small payload to be delayed. We are working on resolving this issue.
Users with nightly versions of the Netdata Agent are experiencing slow responses between Cloud and Agent, resulting in failing or slow charts in their Cloud dashboards. We are investigating the issue.
Report: "Missing charts"
Last updateWe have implemented a change that restores all charts. Unfortunately there remains a bug that causes the top gauges to be missing from the single node tabs, in certain situations. We will fix this in the coming week. Updates on this, including a workaround, by using the overview tab with node filtering, can be found here: https://github.com/netdata/netdata-cloud/issues/484#issuecomment-1166306503
The issue has been identified and a fix is being implemented.
We are investigating an issue that causes some charts to missing from the single node view and overview tabs.
Report: "Charts metadata updates are delayed"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
Charts metadata is not refreshed instantly, users might see old or not updated charts in their UI. Datapoints in displayed charts are up-to-date since it is streamed directly from the node.
Report: "Newly connected / re-connected agents to cloud shown as offline"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "Newly connected / re-connected agents to cloud shown as offline"
Last updateWe are currently investigating this issue.
Report: "Newly connected / re-connected agents to cloud shown as offline"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Nodes missing from spaces"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We've identified the cause of the problem and we're working on recovering the missing nodes.
We're currently investigating an issue about some nodes missing from user spaces. We estimate that 10% of the nodes is affected by this.
Report: "Slow charts and dimensions metadata updates"
Last updateWe hade a bigger than usual backlog of metadata updates. Some users might had a problem with quick updates to their charts and dimensions. This is all solved now.
Report: "Netdata Cloud Alarms status updates are delayed"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Degraded chart syncing performance"
Last updateThis incident has been resolved.
We've applied the fix. We carefully monitor the situation now.
We are working on a fix now.
Some users reported missing charts in the nodes tab and charts for older metrics that are not currently recording data. We found a synchronization issue that affects about 5% of nodes. No (meta)data is lost, and we are working on a solution.
We're investigating an issue with slow chart syncing
Report: "Agent Cloud connectivity issue."
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
Our backbone infrastructure is currently unstable, looks like it might be related to our cloud vendor problems. We are investigating an issue right now.
Report: "Alerting might be delayed"
Last updateThis incident has been resolved.
Everything is stable for some time now but we are observing still the changes made to affected applications.
We are currently having some issues with alerting component (Netdata Cloud). Some users might experience delayed alerts. We are working on solving the issue.
Report: "Cloud application degraded performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Our Pulsar cluster is encountering stability problems. We're actively working on it.
We're investigating a degradation on our cloud application performance.
Report: "MQTT Broker problems"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We finished the restart and managed to fix the problems. All services are now stable.
Looks like we need to perform a restart of MQTT Brokers (in about 30 minutes) due to some bug fixes. It will result in agent reconnections to Netdata Cloud.
Our MQTT Brokers are not performing up to the standard we would like them to perform. As of this moment, end user's are not experiencing any difficulties or degradation of services. Issue is investigated, we will provide more details soon.
Report: "Agent -> Cloud connection issue"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We have an issue on our Database. We are investigating.
Report: "Agent -> Cloud connection issue"
Last updateWe discovered that one of our MQTT brokers performance is worse than expected, resulting in numerous agents not being able to connect to the cloud infrastructure. The problem has been fixed.
Report: "Agent - Cloud Connectivity Issue"
Last updateWe are seeing the number of connected agents at the same level as before, and the processing backlog resolved. Closing.
Connectivity has been restored and agents are reconnecting.
We are currently investigating this issue.
Report: "Significant drop of traffic between agents and cloud"
Last updateThis incident has been resolved and pending messages have been processed.
A fix has been applied and agent connections are recovering. We are monitoring to make sure everything is working as expected.
We've identified a problem in connections between Netdata agents and Cloud, and are working on a fix.
Report: "Broken agent connections"
Last updateIncident resolved
We've applied a change that should improve reconnection speed.
The issue has been identified and a fix is being implemented.
We have identified an issue with our proxies that caused agents to be disconnected. Agents should reconnect automatically.
Report: "SSL certificate verification errors connecting to the cloud after Sep 30th"
Last updateUser action required, please see https://community.netdata.cloud/t/certificate-verification-error-connecting-to-the-cloud/1790
Report: "We notice cloud application degrated performance"
Last updateIncident is resolved.
We observe nominal behavior on all micro-services. Pulsar is consuming messages on a normal rate. We continue to monitor all services.
We have rolled back all Pulsar replications. Services are stabilized. Residual effects (like late notifications) may still exist at this point. We continue to monitor the services.
Geo-replication on pulsar partially failed due to increased RAM requirements. The extra requirements forced Kubernetes to restart specific pods. As a result some messages have been transmitted out of order, and many notifications were transmitted with delay.
Web UI is back on line. Service is partially restored.
It seems that netdata messages are not properly consumed. The issue relates to a new Pulsar replication that was introduced today. We proceed with immediate roll-back.
We are currently investigating the issue.
Report: "Notification Center Service does not consume notification events."
Last updateAll service indicators are nominal. Incident is considered resolved.
The fix is applied and we are monitoring the performance.
Staging tests on the new fix completed successfully. We proceed with deployment to production.
Τhe issue is identified on a new query introduced on the latest release. Immediate fix is applied and currently under testing.
We are proceeding to drop messages on queue in order to reduce load on the DB and bring the system to a normal state. If the problem is not resolved, we are going to proceed with rollback of latest updates.
We are currently investigating the issue. MongoDB experiences high load.
Report: "VerneMQ / Pulsar drops messages without processing them."
Last updateSince services is stable for the last 12 hours, we proceed on declaring the case as resolved.
The service is stable and all messages from agents are consumed properly. We will continue to monitor for any inconsistencies and close the incident in the coming hours.
Monitoring continues. We see no anomalies so far.
Service under monitoring.
A few Kubernetes pods experienced a race condition hanging on a response from redis services. The issue has been resolved, and currently all messages are properly handled. We will leave the additional resources in place, and monitor performance and stability during the following hours. Appropriate root cause analysis will follow with redis. Further to that additional monitoring metrics will be introduced, in order to react / rectify future similar incidents.
We continue to investigate the root cause. We have added more pods on K8 in an effort to reduce the messages dropped, and we have significantly improved the message consumption, but we still observe messages lost. Further updates will follow once we identify what is causing the issue.
Issue still under investigation.
We are currently investigating the issue.
Report: "Persistent timeouts for some nodes"
Last updateWe don't see the same pattern any more. There are occasional delays, but ones that are unrelated to the persistent timeouts we were observing before.
A fix has been implemented and we are monitoring the results.
We just restarted a piece of our infrastructure that will cause all agents to reconnect to the cloud. It will take a few minutes until the app works again.
About 4% of requests for charts from the agents are timing out, due to an issue we are aware of. We are trying different approaches to resolve the situation for now and have identified what we need to do for a permanent fix.
Report: "Service unavailable"
Last updateResolved at 16:00 UTC. All services are operational.
A fix has been implemented and we are monitoring the results.
We are aware of the root cause of the outage that started about 30 min ago and are working on returning to proper operation.