Netdata

Is Netdata Down Right Now? Check if there is a current outage ongoing.

Netdata is currently Operational

Last checked from Netdata's official status page

Historical record of incidents for Netdata

Report: "MQTT broker failure"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We had a major problem with our MQTT broker. It is currently up and running and agents are reconnecting to the cloud.

Report: "MQTT broker failure"

Last update
Resolved

This incident has been resolved.

Monitoring

A fix has been implemented and we are monitoring the results.

Identified

We had a major problem with our MQTT broker. It is currently up and running and agents are reconnecting to the cloud.

Report: "Nightly static builds overwrite node and metric data upon install"

Last update
resolved

Affected Agents can cause the creation of multiple duplicate nodes in Netdata Cloud. All but the last one will appear as offline, and the last one will be as if it was created from scratch with no data. Unfortunately, the previously stored metrics for the affected nodes can not be recovered. The duplicate offline nodes can be safely deleted from Space Settings -> Nodes. Note that you may have to add the newest copy of these nodes to the appropriate rooms. The fixed nightly static build is v2.3.0-102.

investigating

We have found that a recent change in the nightly static builds of Netdata Agent causes meta data on the Agent to be overwritten. Specifically the sqlite3 database that keeps meta data on which timeseries stored in dbengine correspond with which metrics, and the information on the Agent's "machine GUID" are overwritten with the same copy in the build package. Not affected are: - All stable releases - Native packages (.deb and .rpm) Affected are all nightly static builds with the following version numbers: - 2.3.0-50-nightly - 2.3.0-60-nightly - 2.3.0-72-nightly - 2.3.0-78-nightly - 2.3.0-87-nightly The initial impact is that all affected Agent installs, even though they still have the timeseries data stored on disk, have lost all meta data associated with it, so these timeseries become inaccessable. This is unrecoverable. Additionally, the main form of identification is overwritten, too. We are assessing what the impact is for users of Netdata Cloud, and will update this incident with more information when the investigation is completed. The bug itself has been fixed and merged. We will issue a new nightly build shortly.

Report: "Lastest Agent nightly build (v2.2.0-245) broken at first start"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

The builds are completed, so we are watching out for any remaining related issues.

identified

We have identified the issue, committed a fix, and initiated new nightly builds for all platforms. This will take several hours. In the mean time, please restart Netdata to work around the issue.

investigating

We are investigating an issue with today's nightly (v2.2.0-245), causing alerting to not work ("health") and external plugins, including go.d, to not connect properly. This may be resolved by restarting the Agent. Stable versions of the Agent are not affected.

Report: "Alarm Processing Delays"

Last update
resolved

Our alarm processing infrastructure was running behind which is causing inaccuracies alarms for some nodes. No data has been lost and the systems should be already up to date.

Report: "Alerting is working slower"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

Due to the release of Netdata Agent 2.0 we have quite a big backlog for alarms. We are investigating this issue.

Report: "Delays in alarms on the Netdata Cloud"

Last update
resolved

This incident has been resolved.

identified

Currently, we are waiting for the fix to take effect, and some users might experience delays in all cloud operations.

identified

The issue has been identified and a fix is being implemented.

investigating

We were alerted to a delay in alarms for some users and are investigating the matter.

Report: "Recent nightly static and local builds of Netdata Agent overwrite netdata.conf with defaults"

Last update
postmortem

Prior to [netdata/netdata#17475](https://github.com/netdata/netdata/pull/17475), the `netdata.conf` and `netdata-updater.conf` files where handled by the installer code outside of the build system. With the shift to using the build system to produce packages, handling for them needed to be moved into the build system. However, insufficient testing was performed to confirm that this would not break other installation types, and the change was not properly made conditional on packages being built. As a result, the static and local builds with version `v1.45.0-315-nightly` will overwrite these configuration files with the default templates for those files. This causes all local changes to those files to be lost. In particular, if the Agent configuration had been changed for longer retention, the overwritten configuration will have undone those settings, causing any metrics data **beyond the _default_ retention to be lost** on the first run of this version. We have pulled the affected build artifacts to prevent our installer from using them. While [the fix](https://github.com/netdata/netdata/pull/17572) ensures the issue won't occur in future versions, starting with version `v1.45.0-326-nightly`, it is important to note that affected installations **will not automatically recover** their previous configurations. If you were using a non-default `netdata.conf` and/or `netdata-updater.conf` and experienced this bug, you will need to **manually reconfigure** your Netdata install. As we aim to carefully develop Netdata for many platforms and hardware architectures, we release nightly builds of the Netdata Agent to catch any issues our changes may have caused, beyond our own internal testing. Unfortunately, we make mistakes that we did not catch in our testing, with data loss as an extreme possible outcome. Therefore we strongly recommend using our **stable releases for production systems**. You can review the [difference between nightly and stable builds](https://learn.netdata.cloud/docs/netdata-agent/installation#nightly-vs-stable-releases), and our recommended [best practices](https://www.netdata.cloud/blog/netdata-best-practices/). If you have been affected by this issue and/or have any questions, please let us know.

resolved

The build artifacts for the new nightly release (1.45.0-326) are now available, and consider the incident resolved. Should you experience any issues, please let us know!

identified

Update regarding potential data loss. This will happen if the configuration had been changed to increase metric retention (with respect to the defaults). Unfortunately, any stored data beyond the default metric retention will be lost on running installs of the affected builds. The only way to prevent this is by not using (of having used) version v1.45.0-315-nightly. We have made sure that the corresponding artifacts are no longer accessible by the installer.

identified

The affected build number is v1.45.0-315-nightly, and local builds starting with commit https://github.com/netdata/netdata/commit/5973417027606bacf044b3ead40a882931ce773f (April 30, 11:45 UTC) up until commit https://github.com/netdata/netdata/commit/0f2a261839d5ffc42f17383b4292673aa93d6a1f (May 1, 15:13 UTC).

identified

We've identified an issue with static and local builds of the Netdata Agent, that causes its main configuration in `/etc/netdata/netdata.conf` or `/opt/netdata/etc/netdata/netdata.conf` to be overwritten with the default. The `netdata-updater.conf` file is similarly affected. Depending on your configuration settings that have been changed with respect to the defaults, this may result in data loss. We will update this incident with more detailed information on the impact as soon as possible. Docker image or native package builds, as well as stable builds, are not affected. We have created a fix (https://github.com/netdata/netdata/pull/17572) and have triggered a new nightly build. As soon as those are available, we will also update this incident.

Report: "The "kickstart" way of installation is broken"

Last update
resolved

This incident has been resolved.

monitoring

The installation method described in https://learn.netdata.cloud/docs/installing/one-line-installer-for-all-linux-systems is broken. We are currently investigating the issue.

Report: "Delay with sending cloud alarms"

Last update
resolved

This incident has been resolved.

identified

We have noticed that there could be some delays with sending alarms. The situation should get back to normal in ca. 1h.

Report: "Cloud connectivity issue"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We have noticed that some agents are experiencing intermittent disconnections from the cloud in our metrics. We are currently investigating the issue.

Report: "Problem with Alert Configuration in the Cloud Web UI"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We've implemented a fix and are monitoring for left-over issues.

identified

We have applied a fix that allows for seeing all configured alerts again, but the detail page for individual alerts may still not render. We are working on addressing that latter issue.

identified

We've identified that we are getting some malformed data from an API after a recent update and are working on a fix.

investigating

We are currently investigating the issue

Report: "Processing Events feed problem"

Last update
resolved

This incident has been resolved.

identified

We detected problems with processing the Events feed and working on a solution.

Report: "Possible login problem - Netdata Cloud"

Last update
resolved

We have successfully addressed the issue from our end, eliminating the necessity of cookie removal.

investigating

We have detected a potential problem with the cookies that are necessary for logging into the Netdata Cloud platform. The easy and quick solution is to clear cookies for app.netdata.cloud. We are currently looking for a permanent way to resolve this issue on our end.

Report: "Startup issue in latest Agent nightly (1.40.0-6-nightly)"

Last update
resolved

All packages have been published. If your nodes are still on 1.40.0-6, please refer to the instructions to upgrade: https://learn.netdata.cloud/docs/maintaining/update-netdata-agents#updates-for-most-systems. We are now closing this incident, but please let us know if things are still not working on your nodes.

monitoring

The source tarballs with the fix for native builds are now available. Packages for ARM systems are still building but should be fully published and available by 17:00 UTC at the latest.

monitoring

The native packages for x86-based distributions have been published. The ARM ones are still building and should follow shortly, as well as the static builds. We're monitoring Netdata Cloud and the various social networking tools to monitor the outcome of the new builds.

identified

The fix has been merged, we've kicked off the build process for the packages. We will provide another update when the packages for the affected systems have been pushed.

identified

We have created a fix for this issue, which is a combination of making systemd not change the ownership and permissions the directories the Agent uses, and the Agent properly changing permissions recursively to recover from the effects of the bad version. As soon as we've tested the fix, and the packages have been built, we will trigger an explicit push to the nightlies repos.

identified

While we are working on a fix, which requires a new package to be built, we have developed a workaround. It requires downgrading the Agent to 1.40.0-2-nightly and fixing the permissions. For Debian based systems, this script should work, run as root: https://gist.github.com/ralphm/1326498c474aaacf0a12f9e569dac863

identified

Agents running the most recent nightly (1.40.0-6-nightly) fail to start on some platforms, because of a permissioning issue. We believe the culprit is this change: https://github.com/netdata/netdata/pull/14890, and are working on a fix. As this happens early on in the Agent, this affects Cloud and non-Cloud users alike.

investigating

We are currently investigating an issue with agent connectivity to the cloud.

Report: "Agent connectivity problem."

Last update
resolved

Connected clients metrics are going back to normal values, new Netdata Agent works as expected.

monitoring

We had to ban 1.39.0-97 agent version from connecting to the cloud. The exact affected agent versions are: 1.39.0-97-nightly and 1.39.0-97-{hash}. This incident is going to be closed when new Netdata release will be available for the download. Please update your endpoints then or wait for an automatic update to take place tomorrow.

identified

We found that the issue is caused by latest nightly version of the agent. We are releasing the fix.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating the issue.

Report: "Node status updates are delayed"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We had a problem with updating node status on Cloud UI. In practice it means that there was a delay between node changing status from for example online to offline and cloud realising this fact. This would also affect new agents or deleting an existing one. Due to connection update delay there is also a delay in chart metadata updates in our database with which we are dealing right now. This on the other hand from user perspective means, that charts for newly installed applications or charts in general for new nodes are not yet showing up on the Cloud UI.

Report: "Problem with reconnecting agents"

Last update
resolved

The incident was resolved and the problem was triggered by an automatic configuration reload in the load balancer. During that time there was a delay in processing the alarms.

investigating

Netdata agents are being forced to reconnect. We are investigating the root cause

Report: "Netdata Agents [NIGHTLY] with ML turned off might crash."

Last update
resolved

Dear valued users, We would like to inform you of a recent development regarding our latest nightly image of Netdata Agent. It has come to our attention that some machines may experience an issue when the ML setting is turned off. We apologize for any inconvenience this may have caused. However, we are pleased to announce that our team has already taken action to address this matter. A revert has been successfully merged through this pull request: https://github.com/netdata/netdata/pull/14908. Rest assured that we are currently in the process of re-testing and building a new image after the fix to ensure that the issue will be resolved. Please note that the stable release is not affected by this matter, and you may continue to use it without any concern. We appreciate your understanding and patience as we work to improve our services. Thank you for your continued support. Sincerely, The Netdata Team

Report: "Agent connectivity disruption"

Last update
resolved

As we see the number of connected agents go back to expected levels, and the number of agents running the previous nightly going down, we consider this incident resolved.

monitoring

The new build (1.37.0-55) has completed for most platforms. Please follow the instructions at https://learn.netdata.cloud/docs/agent/packaging/installer/update if you are on the affected version (1.37.0-48) and want to upgrade your agents manually. If you have automatic updates configured, you can also wait for the update to be done during your night. We will be monitoring the progress of Agents as they reconnect.

identified

The new build (1.37.0-55) has been triggered and we will post an update when it is ready. We will include instructions on how to update manually, or you can wait until the auto-upgrade happens during your night. Note: * If you are running a nightly build older than 1.37.0-48, you are not affected and no action is required. * If you are running a stable build, you are not affected and no action is required. However, we do strongly recommend upgrading to 1.37.1 because of two security vulnerabilities in older versions.

identified

We have identified the offending change in the Agent. Only the latest nightly build (1.37.0-48-nightly) of the Agent is affected. The problem only occurs if the Agent tries to reconnect after having lost its first connection to Cloud. This means that if you restart your agent, the problem is avoided until its connection to Cloud drops. We will issue a new nightly build that removes the offending change.

investigating

We are able to reproduce the issue and are attempting to pinpoint the cause.

investigating

We are seeing an increasing number of Agents that cannot (properly) connect to Cloud. We are investigating the cause, but initial indications are that it may be related to the latest nightly release of the Agent (version 1.37.0-48-nightly).

Report: "Delay in processing node availability changes"

Last update
resolved

This incident has been resolved.

monitoring

The backlog has been consumed. We are monitoring the situation.

identified

We are working back the backlog of availability updates and should be done in about 30 minutes.

identified

We've identified an issue with delayed processing of node availability (online, stale, offline) changes. For a fraction of our users this means that these changes are not reflected properly in Netdata Cloud. As the availability affects what metrics are shown in Cloud, it may be that some metrics are not visible even though the node is supposed to be available.

Report: "Node status updates are delayed"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are facing an issue when our backlog for node status updates is consumed slower than normal. Users might face an issue where node status is slow to change from online to offline or the other way around.

Report: "Degradated performance on charts metadata"

Last update
resolved

This incident has been resolved.

identified

Request for retrieving and updating charts metadata are slower than they should. Users might not see changes in their charts instantly replicated to the Cloud UI (i.e. software installation or removal, while on Agent's local UI will be instantly updated). We already identified the problem and we are fixing it.

Report: "Delayed alarms"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

Hi all, we just that for some of our clients we might have a bit of a delay between Agent triggering an alert and Cloud UI showing it. We are investigating the issue.

Report: "Routing problem on netdata cloud"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are facing a small issue with our internal routing between the apps. We identified the problem and solution is going to be deployed soon.

Report: "Sign-ups via email magic link are failing"

Last update
resolved

This incident has been resolved.

monitoring

We've reverted and are monitoring the issue.

identified

We have found the offending change that causes sign-ups via email magic link to fail, and are reverting to work on a proper fix.

investigating

We are currently investigating this issue.

Report: "Slow and failing Agent chart data responses"

Last update
resolved

Reverting the default away from MQTT5 removed the immediate issue, and most Agents on the nightlies are now on the latest (v1.35.0-104-nightly). In the mean time we've also found the true cause: the Agent was not properly processing incoming commands in the MQTT5 implementation, due to a bug in how the parser interacted with the buffer of incoming data. This has been resolved in the upcoming nightly build of the Agent. As we want to do some more testing, for now the Agent will keep using the older MQTT library by default.

monitoring

For completeness, the affected versions are v1.35.0-84-nightly and v1.35.0-96-nightly. Latest, corrected version is v1.35.0-104-nightly.

monitoring

The new nightly version of the Netdata Agent has been published and installed by a large portion of the agents that auto-update. We are monitoring the results.

identified

We have identified part of the cause of failing responses for alarm values. In yesterday's nightly build of the Agent, we enabled the use of the newer MQTT5 library by default. We will create another build to revert that. In the meanwhile, you can explicitly disable this library using the mqtt5 setting in your configuration as described here: https://github.com/netdata/cloud-backend/issues/178. Additionally the other latencies appear to be another instance of a known issue that causes responses with a small payload to be delayed. We are working on resolving this issue.

investigating

Users with nightly versions of the Netdata Agent are experiencing slow responses between Cloud and Agent, resulting in failing or slow charts in their Cloud dashboards. We are investigating the issue.

Report: "Missing charts"

Last update
resolved

We have implemented a change that restores all charts. Unfortunately there remains a bug that causes the top gauges to be missing from the single node tabs, in certain situations. We will fix this in the coming week. Updates on this, including a workaround, by using the overview tab with node filtering, can be found here: https://github.com/netdata/netdata-cloud/issues/484#issuecomment-1166306503

identified

The issue has been identified and a fix is being implemented.

investigating

We are investigating an issue that causes some charts to missing from the single node view and overview tabs.

Report: "Charts metadata updates are delayed"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

investigating

Charts metadata is not refreshed instantly, users might see old or not updated charts in their UI. Datapoints in displayed charts are up-to-date since it is streamed directly from the node.

Report: "Newly connected / re-connected agents to cloud shown as offline"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "Newly connected / re-connected agents to cloud shown as offline"

Last update
resolved

We are currently investigating this issue.

Report: "Newly connected / re-connected agents to cloud shown as offline"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Nodes missing from spaces"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We've identified the cause of the problem and we're working on recovering the missing nodes.

investigating

We're currently investigating an issue about some nodes missing from user spaces. We estimate that 10% of the nodes is affected by this.

Report: "Slow charts and dimensions metadata updates"

Last update
resolved

We hade a bigger than usual backlog of metadata updates. Some users might had a problem with quick updates to their charts and dimensions. This is all solved now.

Report: "Netdata Cloud Alarms status updates are delayed"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Degraded chart syncing performance"

Last update
resolved

This incident has been resolved.

monitoring

We've applied the fix. We carefully monitor the situation now.

identified

We are working on a fix now.

investigating

Some users reported missing charts in the nodes tab and charts for older metrics that are not currently recording data. We found a synchronization issue that affects about 5% of nodes. No (meta)data is lost, and we are working on a solution.

investigating

We're investigating an issue with slow chart syncing

Report: "Agent Cloud connectivity issue."

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue.

investigating

Our backbone infrastructure is currently unstable, looks like it might be related to our cloud vendor problems. We are investigating an issue right now.

Report: "Alerting might be delayed"

Last update
resolved

This incident has been resolved.

monitoring

Everything is stable for some time now but we are observing still the changes made to affected applications.

investigating

We are currently having some issues with alerting component (Netdata Cloud). Some users might experience delayed alerts. We are working on solving the issue.

Report: "Cloud application degraded performance"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Our Pulsar cluster is encountering stability problems. We're actively working on it.

investigating

We're investigating a degradation on our cloud application performance.

Report: "MQTT Broker problems"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We finished the restart and managed to fix the problems. All services are now stable.

investigating

Looks like we need to perform a restart of MQTT Brokers (in about 30 minutes) due to some bug fixes. It will result in agent reconnections to Netdata Cloud.

investigating

Our MQTT Brokers are not performing up to the standard we would like them to perform. As of this moment, end user's are not experiencing any difficulties or degradation of services. Issue is investigated, we will provide more details soon.

Report: "Agent -> Cloud connection issue"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We have an issue on our Database. We are investigating.

Report: "Agent -> Cloud connection issue"

Last update
resolved

We discovered that one of our MQTT brokers performance is worse than expected, resulting in numerous agents not being able to connect to the cloud infrastructure. The problem has been fixed.

Report: "Agent - Cloud Connectivity Issue"

Last update
resolved

We are seeing the number of connected agents at the same level as before, and the processing backlog resolved. Closing.

monitoring

Connectivity has been restored and agents are reconnecting.

investigating

We are currently investigating this issue.

Report: "Significant drop of traffic between agents and cloud"

Last update
resolved

This incident has been resolved and pending messages have been processed.

monitoring

A fix has been applied and agent connections are recovering. We are monitoring to make sure everything is working as expected.

identified

We've identified a problem in connections between Netdata agents and Cloud, and are working on a fix.

Report: "Broken agent connections"

Last update
resolved

Incident resolved

monitoring

We've applied a change that should improve reconnection speed.

identified

The issue has been identified and a fix is being implemented.

investigating

We have identified an issue with our proxies that caused agents to be disconnected. Agents should reconnect automatically.

Report: "SSL certificate verification errors connecting to the cloud after Sep 30th"

Last update
resolved

User action required, please see https://community.netdata.cloud/t/certificate-verification-error-connecting-to-the-cloud/1790

Report: "We notice cloud application degrated performance"

Last update
resolved

Incident is resolved.

monitoring

We observe nominal behavior on all micro-services. Pulsar is consuming messages on a normal rate. We continue to monitor all services.

monitoring

We have rolled back all Pulsar replications. Services are stabilized. Residual effects (like late notifications) may still exist at this point. We continue to monitor the services.

identified

Geo-replication on pulsar partially failed due to increased RAM requirements. The extra requirements forced Kubernetes to restart specific pods. As a result some messages have been transmitted out of order, and many notifications were transmitted with delay.

investigating

Web UI is back on line. Service is partially restored.

investigating

It seems that netdata messages are not properly consumed. The issue relates to a new Pulsar replication that was introduced today. We proceed with immediate roll-back.

investigating

We are currently investigating the issue.

Report: "Notification Center Service does not consume notification events."

Last update
resolved

All service indicators are nominal. Incident is considered resolved.

monitoring

The fix is applied and we are monitoring the performance.

identified

Staging tests on the new fix completed successfully. We proceed with deployment to production.

identified

Τhe issue is identified on a new query introduced on the latest release. Immediate fix is applied and currently under testing.

investigating

We are proceeding to drop messages on queue in order to reduce load on the DB and bring the system to a normal state. If the problem is not resolved, we are going to proceed with rollback of latest updates.

investigating

We are currently investigating the issue. MongoDB experiences high load.

Report: "VerneMQ / Pulsar drops messages without processing them."

Last update
resolved

Since services is stable for the last 12 hours, we proceed on declaring the case as resolved.

monitoring

The service is stable and all messages from agents are consumed properly. We will continue to monitor for any inconsistencies and close the incident in the coming hours.

monitoring

Monitoring continues. We see no anomalies so far.

monitoring

Service under monitoring.

monitoring

A few Kubernetes pods experienced a race condition hanging on a response from redis services. The issue has been resolved, and currently all messages are properly handled. We will leave the additional resources in place, and monitor performance and stability during the following hours. Appropriate root cause analysis will follow with redis. Further to that additional monitoring metrics will be introduced, in order to react / rectify future similar incidents.

investigating

We continue to investigate the root cause. We have added more pods on K8 in an effort to reduce the messages dropped, and we have significantly improved the message consumption, but we still observe messages lost. Further updates will follow once we identify what is causing the issue.

investigating

Issue still under investigation.

investigating

We are currently investigating the issue.

Report: "Persistent timeouts for some nodes"

Last update
resolved

We don't see the same pattern any more. There are occasional delays, but ones that are unrelated to the persistent timeouts we were observing before.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We just restarted a piece of our infrastructure that will cause all agents to reconnect to the cloud. It will take a few minutes until the app works again.

identified

About 4% of requests for charts from the agents are timing out, due to an issue we are aware of. We are trying different approaches to resolve the situation for now and have identified what we need to do for a permanent fix.

Report: "Service unavailable"

Last update
resolved

Resolved at 16:00 UTC. All services are operational.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are aware of the root cause of the outage that started about 30 min ago and are working on returning to proper operation.