Is Netdata Down Right Now? Discover if there is an ongoing service outage.

Netdata is currently Operational

Last checked Jul 31, 2025 0:56 UTC from Netdata's official status page

Historical record of incidents for Netdata

Jul 25, 2025

Report: "Windows Agent Monitoring Issues"

Last update 2025-07-25T12:27:10.386Z

monitoring2025-07-25T12:27:09.913Z

We identified monitoring issues affecting customers using Netdata Agent version 2.6.0 on Windows servers. These issues were caused by upstream changes in Windows libraries that impacted agent functionality. Resolution Fixed Version Available: Netdata Agent v2.6.1 Release Date: July 25, 2025 (16:00 UTC) Preview Available: Nightly release v2.6.0-2 Required Action Immediate upgrade recommended - All users running Netdata Agent v2.6.0 on Windows servers should upgrade to v2.6.1 as soon as possible to restore full monitoring capabilities. We appreciate your patience as we continue working on additional improvements related to the upstream Windows library changes.

Jun 4, 2025

Report: "MQTT broker failure"

Last update 2025-06-04T13:49:42.427Z

resolved2025-06-04T13:49:42.072Z

This incident has been resolved.

monitoring2025-06-04T13:42:12.982Z

A fix has been implemented and we are monitoring the results.

identified2025-06-04T12:59:53.511Z

We had a major problem with our MQTT broker. It is currently up and running and agents are reconnecting to the cloud.

Report: "MQTT broker failure"

Last update 2025-06-04T13:49:00.000Z

Resolved2025-06-04T13:49:00.000Z

This incident has been resolved.

Monitoring2025-06-04T13:42:00.000Z

A fix has been implemented and we are monitoring the results.

Identified2025-06-04T12:59:00.000Z

We had a major problem with our MQTT broker. It is currently up and running and agents are reconnecting to the cloud.

Apr 2, 2025

Report: "Nightly static builds overwrite node and metric data upon install"

Last update 2025-04-02T09:25:41.492Z

resolved2025-04-02T09:25:41.087Z

Affected Agents can cause the creation of multiple duplicate nodes in Netdata Cloud. All but the last one will appear as offline, and the last one will be as if it was created from scratch with no data. Unfortunately, the previously stored metrics for the affected nodes can not be recovered. The duplicate offline nodes can be safely deleted from Space Settings -> Nodes. Note that you may have to add the newest copy of these nodes to the appropriate rooms. The fixed nightly static build is v2.3.0-102.

investigating2025-04-01T09:00:06.810Z

We have found that a recent change in the nightly static builds of Netdata Agent causes meta data on the Agent to be overwritten. Specifically the sqlite3 database that keeps meta data on which timeseries stored in dbengine correspond with which metrics, and the information on the Agent's "machine GUID" are overwritten with the same copy in the build package. Not affected are: - All stable releases - Native packages (.deb and .rpm) Affected are all nightly static builds with the following version numbers: - 2.3.0-50-nightly - 2.3.0-60-nightly - 2.3.0-72-nightly - 2.3.0-78-nightly - 2.3.0-87-nightly The initial impact is that all affected Agent installs, even though they still have the timeseries data stored on disk, have lost all meta data associated with it, so these timeseries become inaccessable. This is unrecoverable. Additionally, the main form of identification is overwritten, too. We are assessing what the impact is for users of Netdata Cloud, and will update this incident with more information when the investigation is completed. The bug itself has been fixed and merged. We will issue a new nightly build shortly.

Feb 25, 2025

Report: "Lastest Agent nightly build (v2.2.0-245) broken at first start"

Last update 2025-02-25T15:25:32.206Z

resolved2025-02-25T15:25:31.902Z

This incident has been resolved.

monitoring2025-02-25T13:29:03.767Z

We are continuing to monitor for any further issues.

monitoring2025-02-25T13:14:07.375Z

The builds are completed, so we are watching out for any remaining related issues.

identified2025-02-25T11:04:48.304Z

We have identified the issue, committed a fix, and initiated new nightly builds for all platforms. This will take several hours. In the mean time, please restart Netdata to work around the issue.

investigating2025-02-25T09:41:18.296Z

We are investigating an issue with today's nightly (v2.2.0-245), causing alerting to not work ("health") and external plugins, including go.d, to not connect properly. This may be resolved by restarting the Agent. Stable versions of the Agent are not affected.

Nov 15, 2024

Report: "Alarm Processing Delays"

Last update 2024-11-15T08:23:51.333Z

resolved2024-11-15T07:00:00.000Z

Our alarm processing infrastructure was running behind which is causing inaccuracies alarms for some nodes. No data has been lost and the systems should be already up to date.

Nov 8, 2024

Report: "Alerting is working slower"

Last update 2024-11-08T13:36:46.791Z

resolved2024-11-08T13:36:46.500Z

This incident has been resolved.

monitoring2024-11-08T11:01:58.499Z

A fix has been implemented and we are monitoring the results.

identified2024-11-08T09:49:27.291Z

The issue has been identified and a fix is being implemented.

investigating2024-11-08T08:07:44.639Z

Due to the release of Netdata Agent 2.0 we have quite a big backlog for alarms. We are investigating this issue.

Jul 18, 2024

Report: "Delays in alarms on the Netdata Cloud"

Last update 2024-07-18T11:41:27.874Z

resolved2024-07-18T11:41:27.459Z

This incident has been resolved.

identified2024-07-18T08:17:39.811Z

Currently, we are waiting for the fix to take effect, and some users might experience delays in all cloud operations.

identified2024-07-18T05:37:15.708Z

The issue has been identified and a fix is being implemented.

investigating2024-07-18T04:45:23.405Z

We were alerted to a delay in alarms for some users and are investigating the matter.

May 2, 2024

Report: "Recent nightly static and local builds of Netdata Agent overwrite netdata.conf with defaults"

Last update 2024-05-02T11:02:58.358Z

postmortem2024-05-02T10:02:59.474Z

Prior to [netdata/netdata#17475](https://github.com/netdata/netdata/pull/17475), the `netdata.conf` and `netdata-updater.conf` files where handled by the installer code outside of the build system. With the shift to using the build system to produce packages, handling for them needed to be moved into the build system. However, insufficient testing was performed to confirm that this would not break other installation types, and the change was not properly made conditional on packages being built. As a result, the static and local builds with version `v1.45.0-315-nightly` will overwrite these configuration files with the default templates for those files. This causes all local changes to those files to be lost. In particular, if the Agent configuration had been changed for longer retention, the overwritten configuration will have undone those settings, causing any metrics data **beyond the _default_ retention to be lost** on the first run of this version. We have pulled the affected build artifacts to prevent our installer from using them. While [the fix](https://github.com/netdata/netdata/pull/17572) ensures the issue won't occur in future versions, starting with version `v1.45.0-326-nightly`, it is important to note that affected installations **will not automatically recover** their previous configurations. If you were using a non-default `netdata.conf` and/or `netdata-updater.conf` and experienced this bug, you will need to **manually reconfigure** your Netdata install. As we aim to carefully develop Netdata for many platforms and hardware architectures, we release nightly builds of the Netdata Agent to catch any issues our changes may have caused, beyond our own internal testing. Unfortunately, we make mistakes that we did not catch in our testing, with data loss as an extreme possible outcome. Therefore we strongly recommend using our **stable releases for production systems**. You can review the [difference between nightly and stable builds](https://learn.netdata.cloud/docs/netdata-agent/installation#nightly-vs-stable-releases), and our recommended [best practices](https://www.netdata.cloud/blog/netdata-best-practices/). If you have been affected by this issue and/or have any questions, please let us know.

resolved2024-05-01T17:22:00.949Z

The build artifacts for the new nightly release (1.45.0-326) are now available, and consider the incident resolved. Should you experience any issues, please let us know!

identified2024-05-01T16:28:19.939Z

Update regarding potential data loss. This will happen if the configuration had been changed to increase metric retention (with respect to the defaults). Unfortunately, any stored data beyond the default metric retention will be lost on running installs of the affected builds. The only way to prevent this is by not using (of having used) version v1.45.0-315-nightly. We have made sure that the corresponding artifacts are no longer accessible by the installer.

identified2024-05-01T15:59:20.067Z

The affected build number is v1.45.0-315-nightly, and local builds starting with commit https://github.com/netdata/netdata/commit/5973417027606bacf044b3ead40a882931ce773f (April 30, 11:45 UTC) up until commit https://github.com/netdata/netdata/commit/0f2a261839d5ffc42f17383b4292673aa93d6a1f (May 1, 15:13 UTC).

identified2024-05-01T15:44:21.634Z

We've identified an issue with static and local builds of the Netdata Agent, that causes its main configuration in `/etc/netdata/netdata.conf` or `/opt/netdata/etc/netdata/netdata.conf` to be overwritten with the default. The `netdata-updater.conf` file is similarly affected. Depending on your configuration settings that have been changed with respect to the defaults, this may result in data loss. We will update this incident with more detailed information on the impact as soon as possible. Docker image or native package builds, as well as stable builds, are not affected. We have created a fix (https://github.com/netdata/netdata/pull/17572) and have triggered a new nightly build. As soon as those are available, we will also update this incident.

Dec 28, 2023

Report: "The "kickstart" way of installation is broken"

Last update 2023-12-28T10:36:31.948Z

resolved2023-12-28T10:36:30.558Z

This incident has been resolved.

monitoring2023-12-28T10:09:22.088Z

The installation method described in https://learn.netdata.cloud/docs/installing/one-line-installer-for-all-linux-systems is broken. We are currently investigating the issue.

Dec 1, 2023

Report: "Delay with sending cloud alarms"

Last update 2023-12-01T10:47:24.510Z

resolved2023-12-01T10:47:23.596Z

This incident has been resolved.

identified2023-12-01T09:37:26.808Z

We have noticed that there could be some delays with sending alarms. The situation should get back to normal in ca. 1h.

Jul 25, 2023

Report: "Cloud connectivity issue"

Last update 2023-07-25T14:52:25.915Z

resolved2023-07-25T14:52:25.900Z

This incident has been resolved.

monitoring2023-07-25T14:31:32.631Z

A fix has been implemented and we are monitoring the results.

investigating2023-07-25T12:05:57.305Z

We have noticed that some agents are experiencing intermittent disconnections from the cloud in our metrics. We are currently investigating the issue.

Jul 14, 2023

Report: "Problem with Alert Configuration in the Cloud Web UI"

Last update 2023-07-14T15:47:33.457Z

resolved2023-07-14T15:47:32.828Z

This incident has been resolved.

monitoring2023-07-14T14:23:21.155Z

A fix has been implemented and we are monitoring the results.

identified2023-07-14T14:23:04.107Z

We've implemented a fix and are monitoring for left-over issues.

identified2023-07-13T12:30:55.586Z

We have applied a fix that allows for seeing all configured alerts again, but the detail page for individual alerts may still not render. We are working on addressing that latter issue.

identified2023-07-13T10:10:22.204Z

We've identified that we are getting some malformed data from an API after a recent update and are working on a fix.

investigating2023-07-13T09:01:45.814Z

We are currently investigating the issue

Jul 12, 2023

Report: "Processing Events feed problem"

Last update 2023-07-12T10:44:14.434Z

resolved2023-07-12T10:44:13.361Z

This incident has been resolved.

identified2023-07-12T09:48:51.606Z

We detected problems with processing the Events feed and working on a solution.

Jul 6, 2023

Report: "Possible login problem - Netdata Cloud"

Last update 2023-07-06T14:29:37.023Z

resolved2023-07-06T14:29:36.157Z

We have successfully addressed the issue from our end, eliminating the necessity of cookie removal.

investigating2023-07-06T10:33:27.587Z

We have detected a potential problem with the cookies that are necessary for logging into the Netdata Cloud platform. The easy and quick solution is to clear cookies for app.netdata.cloud. We are currently looking for a permanent way to resolve this issue on our end.

Jun 16, 2023

Report: "Startup issue in latest Agent nightly (1.40.0-6-nightly)"

Last update 2023-06-16T17:58:41.550Z

resolved2023-06-16T17:58:40.905Z

All packages have been published. If your nodes are still on 1.40.0-6, please refer to the instructions to upgrade: https://learn.netdata.cloud/docs/maintaining/update-netdata-agents#updates-for-most-systems. We are now closing this incident, but please let us know if things are still not working on your nodes.

monitoring2023-06-16T14:57:05.496Z

The source tarballs with the fix for native builds are now available. Packages for ARM systems are still building but should be fully published and available by 17:00 UTC at the latest.

monitoring2023-06-16T14:42:25.637Z

The native packages for x86-based distributions have been published. The ARM ones are still building and should follow shortly, as well as the static builds. We're monitoring Netdata Cloud and the various social networking tools to monitor the outcome of the new builds.

identified2023-06-16T13:24:39.385Z

The fix has been merged, we've kicked off the build process for the packages. We will provide another update when the packages for the affected systems have been pushed.

identified2023-06-16T11:43:18.379Z

We have created a fix for this issue, which is a combination of making systemd not change the ownership and permissions the directories the Agent uses, and the Agent properly changing permissions recursively to recover from the effects of the bad version. As soon as we've tested the fix, and the packages have been built, we will trigger an explicit push to the nightlies repos.

identified2023-06-16T08:11:20.674Z

While we are working on a fix, which requires a new package to be built, we have developed a workaround. It requires downgrading the Agent to 1.40.0-2-nightly and fixing the permissions. For Debian based systems, this script should work, run as root: https://gist.github.com/ralphm/1326498c474aaacf0a12f9e569dac863

identified2023-06-16T06:53:31.133Z

Agents running the most recent nightly (1.40.0-6-nightly) fail to start on some platforms, because of a permissioning issue. We believe the culprit is this change: https://github.com/netdata/netdata/pull/14890, and are working on a fix. As this happens early on in the Agent, this affects Cloud and non-Cloud users alike.

investigating2023-06-16T05:55:13.043Z

We are currently investigating an issue with agent connectivity to the cloud.

Jun 7, 2023

Report: "Agent connectivity problem."

Last update 2023-06-07T07:43:10.535Z

resolved2023-06-07T07:43:09.131Z

Connected clients metrics are going back to normal values, new Netdata Agent works as expected.

monitoring2023-06-06T09:04:11.010Z

We had to ban 1.39.0-97 agent version from connecting to the cloud. The exact affected agent versions are: 1.39.0-97-nightly and 1.39.0-97-{hash}. This incident is going to be closed when new Netdata release will be available for the download. Please update your endpoints then or wait for an automatic update to take place tomorrow.

identified2023-06-06T07:13:10.536Z

We found that the issue is caused by latest nightly version of the agent. We are releasing the fix.

investigating2023-06-06T07:08:07.103Z

We are continuing to investigate this issue.

investigating2023-06-06T06:39:27.838Z

We are currently investigating the issue.

May 12, 2023

Report: "Node status updates are delayed"

Last update 2023-05-12T08:42:54.965Z

resolved2023-05-12T08:42:53.218Z

This incident has been resolved.

monitoring2023-05-12T08:33:51.430Z

A fix has been implemented and we are monitoring the results.

identified2023-05-12T07:30:23.412Z

The issue has been identified and a fix is being implemented.

investigating2023-05-12T07:30:06.746Z

We had a problem with updating node status on Cloud UI. In practice it means that there was a delay between node changing status from for example online to offline and cloud realising this fact. This would also affect new agents or deleting an existing one. Due to connection update delay there is also a delay in chart metadata updates in our database with which we are dealing right now. This on the other hand from user perspective means, that charts for newly installed applications or charts in general for new nodes are not yet showing up on the Cloud UI.

Apr 20, 2023

Report: "Problem with reconnecting agents"

Last update 2023-04-20T12:41:32.477Z

resolved2023-04-20T12:41:32.082Z

The incident was resolved and the problem was triggered by an automatic configuration reload in the load balancer. During that time there was a delay in processing the alarms.

investigating2023-04-20T11:47:40.328Z

Netdata agents are being forced to reconnect. We are investigating the root cause

Apr 14, 2023

Report: "Netdata Agents [NIGHTLY] with ML turned off might crash."

Last update 2023-04-14T08:45:18.800Z

resolved2023-04-14T06:00:00.000Z

Dear valued users, We would like to inform you of a recent development regarding our latest nightly image of Netdata Agent. It has come to our attention that some machines may experience an issue when the ML setting is turned off. We apologize for any inconvenience this may have caused. However, we are pleased to announce that our team has already taken action to address this matter. A revert has been successfully merged through this pull request: https://github.com/netdata/netdata/pull/14908. Rest assured that we are currently in the process of re-testing and building a new image after the fix to ensure that the issue will be resolved. Please note that the stable release is not affected by this matter, and you may continue to use it without any concern. We appreciate your understanding and patience as we work to improve our services. Thank you for your continued support. Sincerely, The Netdata Team

Dec 15, 2022

Report: "Agent connectivity disruption"

Last update 2022-12-15T06:58:04.711Z

resolved2022-12-15T06:58:03.410Z

As we see the number of connected agents go back to expected levels, and the number of agents running the previous nightly going down, we consider this incident resolved.

monitoring2022-12-14T19:02:11.000Z

The new build (1.37.0-55) has completed for most platforms. Please follow the instructions at https://learn.netdata.cloud/docs/agent/packaging/installer/update if you are on the affected version (1.37.0-48) and want to upgrade your agents manually. If you have automatic updates configured, you can also wait for the update to be done during your night. We will be monitoring the progress of Agents as they reconnect.

identified2022-12-14T17:17:40.604Z

The new build (1.37.0-55) has been triggered and we will post an update when it is ready. We will include instructions on how to update manually, or you can wait until the auto-upgrade happens during your night. Note: * If you are running a nightly build older than 1.37.0-48, you are not affected and no action is required. * If you are running a stable build, you are not affected and no action is required. However, we do strongly recommend upgrading to 1.37.1 because of two security vulnerabilities in older versions.

identified2022-12-14T15:56:12.107Z

We have identified the offending change in the Agent. Only the latest nightly build (1.37.0-48-nightly) of the Agent is affected. The problem only occurs if the Agent tries to reconnect after having lost its first connection to Cloud. This means that if you restart your agent, the problem is avoided until its connection to Cloud drops. We will issue a new nightly build that removes the offending change.

investigating2022-12-14T14:38:10.406Z

We are able to reproduce the issue and are attempting to pinpoint the cause.

investigating2022-12-14T06:26:20.000Z

We are seeing an increasing number of Agents that cannot (properly) connect to Cloud. We are investigating the cause, but initial indications are that it may be related to the latest nightly release of the Agent (version 1.37.0-48-nightly).

Dec 6, 2022

Report: "Delay in processing node availability changes"

Last update 2022-12-06T17:50:16.259Z

resolved2022-12-06T17:50:15.647Z

This incident has been resolved.

monitoring2022-12-06T17:34:06.821Z

The backlog has been consumed. We are monitoring the situation.

identified2022-12-06T17:07:48.652Z

We are working back the backlog of availability updates and should be done in about 30 minutes.

identified2022-12-06T14:35:31.000Z

We've identified an issue with delayed processing of node availability (online, stale, offline) changes. For a fraction of our users this means that these changes are not reflected properly in Netdata Cloud. As the availability affects what metrics are shown in Cloud, it may be that some metrics are not visible even though the node is supposed to be available.

Dec 2, 2022

Report: "Node status updates are delayed"

Last update 2022-12-02T13:43:17.264Z

resolved2022-12-02T13:43:16.869Z

This incident has been resolved.

monitoring2022-12-02T13:22:18.848Z

A fix has been implemented and we are monitoring the results.

investigating2022-12-02T13:01:20.538Z

We are facing an issue when our backlog for node status updates is consumed slower than normal. Users might face an issue where node status is slow to change from online to offline or the other way around.

Oct 18, 2022

Report: "Degradated performance on charts metadata"

Last update 2022-10-18T19:24:46.587Z

resolved2022-10-18T19:24:45.721Z

This incident has been resolved.

identified2022-10-18T17:20:22.750Z

Request for retrieving and updating charts metadata are slower than they should. Users might not see changes in their charts instantly replicated to the Cloud UI (i.e. software installation or removal, while on Agent's local UI will be instantly updated). We already identified the problem and we are fixing it.

Oct 17, 2022

Report: "Delayed alarms"

Last update 2022-10-17T11:08:13.352Z

resolved2022-10-17T11:08:12.897Z

This incident has been resolved.

monitoring2022-10-17T10:05:14.446Z

A fix has been implemented and we are monitoring the results.

identified2022-10-17T09:05:40.562Z

The issue has been identified and a fix is being implemented.

investigating2022-10-17T08:33:28.936Z

Hi all, we just that for some of our clients we might have a bit of a delay between Agent triggering an alert and Cloud UI showing it. We are investigating the issue.

Oct 4, 2022

Report: "Routing problem on netdata cloud"

Last update 2022-10-04T12:52:50.393Z

resolved2022-10-04T12:52:49.816Z

This incident has been resolved.

monitoring2022-10-04T12:48:52.883Z

A fix has been implemented and we are monitoring the results.

identified2022-10-04T12:45:30.392Z

We are facing a small issue with our internal routing between the apps. We identified the problem and solution is going to be deployed soon.

Sep 23, 2022

Report: "Sign-ups via email magic link are failing"

Last update 2022-09-23T13:40:46.926Z

resolved2022-09-23T13:40:45.024Z

This incident has been resolved.

monitoring2022-09-23T09:56:38.563Z

We've reverted and are monitoring the issue.

identified2022-09-23T09:45:54.484Z

We have found the offending change that causes sign-ups via email magic link to fail, and are reverting to work on a proper fix.

investigating2022-09-23T08:30:28.000Z

We are currently investigating this issue.

Jun 30, 2022

Report: "Slow and failing Agent chart data responses"

Last update 2022-06-30T17:35:38.764Z

resolved2022-06-30T17:35:38.214Z

Reverting the default away from MQTT5 removed the immediate issue, and most Agents on the nightlies are now on the latest (v1.35.0-104-nightly). In the mean time we've also found the true cause: the Agent was not properly processing incoming commands in the MQTT5 implementation, due to a bug in how the parser interacted with the buffer of incoming data. This has been resolved in the upcoming nightly build of the Agent. As we want to do some more testing, for now the Agent will keep using the older MQTT library by default.

monitoring2022-06-30T08:03:42.946Z

For completeness, the affected versions are v1.35.0-84-nightly and v1.35.0-96-nightly. Latest, corrected version is v1.35.0-104-nightly.

monitoring2022-06-30T07:20:43.971Z

The new nightly version of the Netdata Agent has been published and installed by a large portion of the agents that auto-update. We are monitoring the results.

identified2022-06-29T14:11:14.916Z

We have identified part of the cause of failing responses for alarm values. In yesterday's nightly build of the Agent, we enabled the use of the newer MQTT5 library by default. We will create another build to revert that. In the meanwhile, you can explicitly disable this library using the mqtt5 setting in your configuration as described here: https://github.com/netdata/cloud-backend/issues/178. Additionally the other latencies appear to be another instance of a known issue that causes responses with a small payload to be delayed. We are working on resolving this issue.

investigating2022-06-29T08:54:36.203Z

Users with nightly versions of the Netdata Agent are experiencing slow responses between Cloud and Agent, resulting in failing or slow charts in their Cloud dashboards. We are investigating the issue.

Jun 25, 2022

Report: "Missing charts"

Last update 2022-06-25T15:45:52.852Z

resolved2022-06-25T15:45:52.135Z

We have implemented a change that restores all charts. Unfortunately there remains a bug that causes the top gauges to be missing from the single node tabs, in certain situations. We will fix this in the coming week. Updates on this, including a workaround, by using the overview tab with node filtering, can be found here: https://github.com/netdata/netdata-cloud/issues/484#issuecomment-1166306503

identified2022-06-25T12:43:00.567Z

The issue has been identified and a fix is being implemented.

investigating2022-06-25T11:11:22.483Z

We are investigating an issue that causes some charts to missing from the single node view and overview tabs.

Jun 16, 2022

Report: "Charts metadata updates are delayed"

Last update 2022-06-16T14:42:04.997Z

resolved2022-06-16T14:42:03.519Z

This incident has been resolved.

monitoring2022-06-16T13:28:54.306Z

A fix has been implemented and we are monitoring the results.

identified2022-06-16T12:34:23.646Z

We are continuing to work on a fix for this issue.

identified2022-06-16T12:32:41.906Z

The issue has been identified and a fix is being implemented.

investigating2022-06-16T10:00:30.000Z

Charts metadata is not refreshed instantly, users might see old or not updated charts in their UI. Datapoints in displayed charts are up-to-date since it is streamed directly from the node.

Jun 14, 2022

Report: "Newly connected / re-connected agents to cloud shown as offline"

Last update 2022-06-14T01:46:36.960Z

resolved2022-06-14T01:46:35.294Z

This incident has been resolved.

identified2022-06-14T01:16:45.712Z

The issue has been identified and a fix is being implemented.

Report: "Newly connected / re-connected agents to cloud shown as offline"

Last update 2022-06-14T01:14:55.853Z

resolved2022-06-10T06:00:00.000Z

We are currently investigating this issue.

Jun 10, 2022

Report: "Newly connected / re-connected agents to cloud shown as offline"

Last update 2022-06-10T12:18:32.731Z

resolved2022-06-10T12:18:31.824Z

This incident has been resolved.

monitoring2022-06-10T11:14:18.566Z

A fix has been implemented and we are monitoring the results.

identified2022-06-10T11:14:12.666Z

The issue has been identified and a fix is being implemented.

investigating2022-06-10T09:57:27.686Z

We are currently investigating this issue.

Jun 8, 2022

Report: "Nodes missing from spaces"

Last update 2022-06-08T11:50:28.938Z

resolved2022-06-07T18:54:52.016Z

This incident has been resolved.

monitoring2022-06-07T18:19:58.729Z

A fix has been implemented and we are monitoring the results.

identified2022-06-06T13:03:50.820Z

We've identified the cause of the problem and we're working on recovering the missing nodes.

investigating2022-06-06T12:53:04.726Z

We're currently investigating an issue about some nodes missing from user spaces. We estimate that 10% of the nodes is affected by this.

May 25, 2022

Report: "Slow charts and dimensions metadata updates"

Last update 2022-05-25T12:41:19.769Z

resolved2022-05-24T23:00:00.000Z

We hade a bigger than usual backlog of metadata updates. Some users might had a problem with quick updates to their charts and dimensions. This is all solved now.

May 22, 2022

Report: "Netdata Cloud Alarms status updates are delayed"

Last update 2022-05-22T10:25:45.696Z

resolved2022-05-22T10:25:43.885Z

This incident has been resolved.

monitoring2022-05-22T10:18:15.379Z

A fix has been implemented and we are monitoring the results.

identified2022-05-22T08:28:36.533Z

The issue has been identified and a fix is being implemented.

May 19, 2022

Report: "Degraded chart syncing performance"

Last update 2022-05-19T09:26:42.537Z

resolved2022-05-19T09:26:42.061Z

This incident has been resolved.

monitoring2022-05-18T09:19:22.162Z

We've applied the fix. We carefully monitor the situation now.

identified2022-05-16T08:45:26.003Z

We are working on a fix now.

investigating2022-05-14T22:23:36.811Z

Some users reported missing charts in the nodes tab and charts for older metrics that are not currently recording data. We found a synchronization issue that affects about 5% of nodes. No (meta)data is lost, and we are working on a solution.

investigating2022-05-14T19:57:02.533Z

We're investigating an issue with slow chart syncing

Mar 16, 2022

Report: "Agent Cloud connectivity issue."

Last update 2022-03-16T17:22:49.691Z

resolved2022-03-15T14:01:43.011Z

This incident has been resolved.

monitoring2022-03-15T13:34:42.541Z

A fix has been implemented and we are monitoring the results.

investigating2022-03-15T12:44:58.436Z

We are continuing to investigate this issue.

investigating2022-03-15T08:35:44.082Z

Our backbone infrastructure is currently unstable, looks like it might be related to our cloud vendor problems. We are investigating an issue right now.

Mar 4, 2022

Report: "Alerting might be delayed"

Last update 2022-03-04T17:32:40.000Z

resolved2022-03-04T17:32:39.615Z

This incident has been resolved.

monitoring2022-03-04T15:26:45.467Z

Everything is stable for some time now but we are observing still the changes made to affected applications.

investigating2022-03-03T20:33:25.171Z

We are currently having some issues with alerting component (Netdata Cloud). Some users might experience delayed alerts. We are working on solving the issue.

Feb 15, 2022

Report: "Cloud application degraded performance"

Last update 2022-02-15T03:18:13.853Z

resolved2022-02-15T03:18:11.024Z

This incident has been resolved.

monitoring2022-02-15T02:54:31.031Z

A fix has been implemented and we are monitoring the results.

identified2022-02-15T02:44:29.377Z

Our Pulsar cluster is encountering stability problems. We're actively working on it.

investigating2022-02-15T02:30:42.019Z

We're investigating a degradation on our cloud application performance.

Jan 20, 2022

Report: "MQTT Broker problems"

Last update 2022-01-20T10:20:35.574Z

resolved2022-01-20T10:20:34.795Z

This incident has been resolved.

monitoring2022-01-20T10:08:45.922Z

A fix has been implemented and we are monitoring the results.

investigating2022-01-20T10:08:24.471Z

We finished the restart and managed to fix the problems. All services are now stable.

investigating2022-01-20T08:18:11.494Z

Looks like we need to perform a restart of MQTT Brokers (in about 30 minutes) due to some bug fixes. It will result in agent reconnections to Netdata Cloud.

investigating2022-01-20T07:33:39.639Z

Our MQTT Brokers are not performing up to the standard we would like them to perform. As of this moment, end user's are not experiencing any difficulties or degradation of services. Issue is investigated, we will provide more details soon.

Jan 13, 2022

Report: "Agent -> Cloud connection issue"

Last update 2022-01-13T17:59:55.687Z

resolved2022-01-11T17:05:08.414Z

This incident has been resolved.

monitoring2022-01-11T16:42:08.449Z

A fix has been implemented and we are monitoring the results.

identified2022-01-11T16:33:50.508Z

The issue has been identified and a fix is being implemented.

investigating2022-01-11T16:17:54.166Z

We have an issue on our Database. We are investigating.

Jan 6, 2022

Report: "Agent -> Cloud connection issue"

Last update 2022-01-06T00:18:18.961Z

resolved2022-01-05T23:30:00.000Z

We discovered that one of our MQTT brokers performance is worse than expected, resulting in numerous agents not being able to connect to the cloud infrastructure. The problem has been fixed.

Dec 21, 2021

Report: "Agent - Cloud Connectivity Issue"

Last update 2021-12-21T15:49:30.816Z

resolved2021-12-21T15:42:34.000Z

We are seeing the number of connected agents at the same level as before, and the processing backlog resolved. Closing.

monitoring2021-12-21T15:04:25.284Z

Connectivity has been restored and agents are reconnecting.

investigating2021-12-21T14:40:50.000Z

We are currently investigating this issue.

Dec 15, 2021

Report: "Significant drop of traffic between agents and cloud"

Last update 2021-12-15T18:50:07.694Z

resolved2021-12-15T18:44:54.000Z

This incident has been resolved and pending messages have been processed.

monitoring2021-12-15T18:01:32.561Z

A fix has been applied and agent connections are recovering. We are monitoring to make sure everything is working as expected.

identified2021-12-15T17:11:08.000Z

We've identified a problem in connections between Netdata agents and Cloud, and are working on a fix.

Oct 27, 2021

Report: "Broken agent connections"

Last update 2021-10-27T15:30:29.773Z

resolved2021-10-27T15:30:29.263Z

Incident resolved

monitoring2021-10-27T14:55:49.856Z

We've applied a change that should improve reconnection speed.

identified2021-10-27T14:55:12.838Z

The issue has been identified and a fix is being implemented.

investigating2021-10-27T14:54:01.955Z

We have identified an issue with our proxies that caused agents to be disconnected. Agents should reconnect automatically.

Oct 1, 2021

Report: "SSL certificate verification errors connecting to the cloud after Sep 30th"

Last update 2021-10-01T14:04:12.726Z

resolved2021-10-01T14:04:12.327Z

User action required, please see https://community.netdata.cloud/t/certificate-verification-error-connecting-to-the-cloud/1790

Apr 21, 2021

Report: "We notice cloud application degrated performance"

Last update 2021-04-21T20:31:07.754Z

resolved2021-04-21T20:31:07.223Z

Incident is resolved.

monitoring2021-04-21T20:28:48.267Z

We observe nominal behavior on all micro-services. Pulsar is consuming messages on a normal rate. We continue to monitor all services.

monitoring2021-04-21T19:54:19.111Z

We have rolled back all Pulsar replications. Services are stabilized. Residual effects (like late notifications) may still exist at this point. We continue to monitor the services.

identified2021-04-21T19:52:34.739Z

Geo-replication on pulsar partially failed due to increased RAM requirements. The extra requirements forced Kubernetes to restart specific pods. As a result some messages have been transmitted out of order, and many notifications were transmitted with delay.

investigating2021-04-21T18:38:53.160Z

Web UI is back on line. Service is partially restored.

investigating2021-04-21T18:30:05.784Z

It seems that netdata messages are not properly consumed. The issue relates to a new Pulsar replication that was introduced today. We proceed with immediate roll-back.

investigating2021-04-21T18:27:39.487Z

We are currently investigating the issue.

Apr 13, 2021

Report: "Notification Center Service does not consume notification events."

Last update 2021-04-13T16:07:22.771Z

resolved2021-04-13T16:07:21.733Z

All service indicators are nominal. Incident is considered resolved.

monitoring2021-04-13T15:08:45.129Z

The fix is applied and we are monitoring the performance.

identified2021-04-13T14:19:46.661Z

Staging tests on the new fix completed successfully. We proceed with deployment to production.

identified2021-04-13T14:15:00.027Z

Τhe issue is identified on a new query introduced on the latest release. Immediate fix is applied and currently under testing.

investigating2021-04-13T13:46:48.260Z

We are proceeding to drop messages on queue in order to reduce load on the DB and bring the system to a normal state. If the problem is not resolved, we are going to proceed with rollback of latest updates.

investigating2021-04-13T13:44:53.958Z

We are currently investigating the issue. MongoDB experiences high load.

Report: "VerneMQ / Pulsar drops messages without processing them."

Last update 2021-04-13T09:10:03.505Z

resolved2021-04-13T09:10:02.910Z

Since services is stable for the last 12 hours, we proceed on declaring the case as resolved.

monitoring2021-04-13T07:17:02.000Z

The service is stable and all messages from agents are consumed properly. We will continue to monitor for any inconsistencies and close the incident in the coming hours.

monitoring2021-04-13T07:15:46.000Z

Monitoring continues. We see no anomalies so far.

monitoring2021-04-12T20:15:49.000Z

Service under monitoring.

monitoring2021-04-12T20:14:47.862Z

A few Kubernetes pods experienced a race condition hanging on a response from redis services. The issue has been resolved, and currently all messages are properly handled. We will leave the additional resources in place, and monitor performance and stability during the following hours. Appropriate root cause analysis will follow with redis. Further to that additional monitoring metrics will be introduced, in order to react / rectify future similar incidents.

investigating2021-04-12T18:37:11.962Z

We continue to investigate the root cause. We have added more pods on K8 in an effort to reduce the messages dropped, and we have significantly improved the message consumption, but we still observe messages lost. Further updates will follow once we identify what is causing the issue.

investigating2021-04-12T18:35:26.000Z

Issue still under investigation.

investigating2021-04-12T18:35:04.547Z

We are currently investigating the issue.

Feb 5, 2021

Report: "Persistent timeouts for some nodes"

Last update 2021-02-05T19:57:20.590Z

resolved2021-02-05T19:57:20.575Z

We don't see the same pattern any more. There are occasional delays, but ones that are unrelated to the persistent timeouts we were observing before.

monitoring2021-02-05T19:54:43.857Z

A fix has been implemented and we are monitoring the results.

identified2021-02-05T17:13:02.186Z

We just restarted a piece of our infrastructure that will cause all agents to reconnect to the cloud. It will take a few minutes until the app works again.

identified2021-02-05T17:01:55.001Z

About 4% of requests for charts from the agents are timing out, due to an issue we are aware of. We are trying different approaches to resolve the situation for now and have identified what we need to do for a permanent fix.

Jan 26, 2021

Report: "Service unavailable"

Last update 2021-01-26T19:05:14.782Z

resolved2021-01-26T19:05:14.768Z

Resolved at 16:00 UTC. All services are operational.

monitoring2021-01-26T15:34:13.720Z

A fix has been implemented and we are monitoring the results.

identified2021-01-26T15:27:40.226Z

We are aware of the root cause of the outage that started about 30 min ago and are working on returning to proper operation.