Historical record of incidents for balena.io
Report: "Elevated GIT/Application Builder Errors"
Last updateThe issue has been identified and a fix is being implemented.
We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.
Report: "balenaCloud infrastructure maintenance"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
We will be undergoing scheduled maintenance during this time.
Report: "Elevated Device URLs/VPN Errors"
Last updateWe were notified of an issue with memory consumption on some of the Cloudlink pods in our cluster on Sunday. We were able to redeploy the pods while we investigated the root cause. There should have been little or no interruption of service while the new pods came online and the old ones were slowly drained. We have since implemented one improvement to avoid memory consumption issues, and are looking into other possible root causes.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of errors in our Device URLs and VPN infrastructure and are currently looking into the issue.
Report: "Elevated API Errors"
Last updateA database schema update caused a deadlock that led to an elevated number of API errors. We have aborted the schema update that caused the deadlock. We will be applying the schema update in a non-blocking way to avoid affecting the performance of the services.
Report: "Elevated API Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Degraded Performance on device state endpoints"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Unable to pin releases in the dashboard"
Last updateA UI update included updates to a few nested dependencies, which changed a behavior we relied on. The result was preventing the data population of available releases in the Target Release section on the Fleet Summary page. We initially reverted the changed package, and subsequently changed our approach so that it works with the latest version of the package.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating an issue with users unable to pin releases in the dashboard.
Report: "Degraded cloudlink connection performance"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
Report: "Degraded cloudlink connection performance"
Last updateCloudlink connections are all stable and all old connections have been drained and reestablished
We cycled the cloudlink deployment and monitor the connections recovering.
Internal connection metrics collector failed to report properly to connection monitoring.
We are currently investigating the cause of connection drops on out cloudlink services.
Extended connection drops over period of time
Report: "Elevated API Errors"
Last updateAn internal observability feature led to unreasonable base memory footprint for API instances under production load, leading to frequent evictions. For now, we’ve rolled back to a previous API version to restore stability, while we investigate the root cause.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Elevated Dashboard Errors"
Last updateA bug was introduced in the UI after the recent upgrade to React Router 6, which was redirecting the user away from the email verification page before the verification was initiated when a JWT was already present in the localStorage. We have a base "route component" responsible for doing that in many of our pages, as well as handling return URLs, so that authenticated users are redirected away from the non-authenticated pages, but because of a race condition this issue did not exhibit itself. We’ve put together a workaround to no longer use that base route component in the email verification page. We might want to reconsider some things though and handle our routing in a different way based on what we learned. It seems that with the old React Router, the page was rendered before the JWT was loaded, so the verification worked fine, but after the upgrade it seems that the JWT is now loaded before the page renders. Semantically, what we were doing was wrong, and probably the only thing that we needed was the returnUrl handling, but because of the race condition, this problem has gone unnoticed for years.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
Users may find that the link in the email verification does not work when signing up with an e-mail address. We are aware of the issue and are investigating. Users can try using incognito mode to open the link.
Report: "Elevated API Log Errors"
Last updateWe deployed an upgrade to our log system that worked well in staging, but in production hit issues with log streams so we quickly rolled it back. We have identified the issue and why it only affected one environment, and are investigating an alternate solution going forward.
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We're experiencing an elevated level of log stream errors and are currently looking into the issue.
Report: "Elevated Device SSH Errors"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of device SSH errors and are currently looking into the issue.
Report: "Elevated Device URL Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're experiencing an elevated level of Device URL errors and are currently looking into the issue.
Report: "The OS versions list for upgrading HosOS throught the dashboard does not show any version for some devices."
Last updateThe OS versions list under the settings menu of some devices may not be showing any versions for some devices preventing users from upgrading their HostOS. Some users may also see a notice on their device summary page saying "OS downgrades are not allowed". We have reverted the dashboard to a previous version that has a working OS version list for upgrading HostOS. We are still investigating why the OS version list is not rendered in the latest dashboard version.
Report: "Elevated Dashboard Errors"
Last updateA configuration change to our API that enforced stricter policies for new tokens resulted in validation failures when the Dashboard was used to download balenaOS images. We quickly noticed the issue when our internal tests started failing, and we rolled back the API to the previous release. Once the changes have been retested against the current releases of the Dashboard they will be redeployed.
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of errors on our Dashboard and are currently looking into the issue.
Report: "Elevated Device URLs/VPN Errors"
Last updateWe observed degraded Cloudlink \(VPN\) connections following several subsequent API release deployments. These were spread out over the course of a day, and took some time to settle without any manual intervention. This is generally referred to as a "thundering herd" when 1000s of devices are attempting to connect to a new node at the same time and get rate limited. Upon investigation we found that when we are running at peak usage, the load balancing policies in place for our TCP Cloudlink connections were not optimized to avoid proxying through nodes that were scaling up and scaling down during deploys. Due to the nature of TCP, even though our Cloudlink instances were largely unmoved, the proxied TCP connections were being interrupted by the shuffle of other backend services. We have since implemented some changes to our load balancers to only route TCP Cloudlink traffic via nodes that have online and ready Cloudlink pods running. We are also in the early stages of enabling UDP connections for this endpoint and will announce more details in the future.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of errors in our Device URLs and VPN infrastructure and are currently looking into the issue.
Report: "Elevated GIT/Application Builder Errors"
Last updateTemporary issue due to a switchover between internally hosted certificate authorities.
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.
Report: "Elevated Delta Image Downloads Errors"
Last updateThe delta image back-end was failing to connect to our internal worker nodes because it was using an outdated authentication certificate. The back-end and the worker nodes were to be configured to use a new certificate at the same time to avoid disruptions but we had a delay in re-configuring the back-end. We are reviewing the process to avoid more disruptions like this in the future.
Report: "Elevated GIT/Application Builder Errors"
Last updateAn error was introduced in our builder firewall rules that dropped all outbound traffic from the build context. During this time users would see builds start but fail when a RUN instruction included a command that required internet access. We rolled the builders back to a previous release while we investigated the root cause of the bug. A fix is now being reviewed and tested to catch this edge case that takes time to manifest and was missed in original testing.
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue. Some users may experience DNS resolution errors during builds using the balenaCloud builders.
Report: "Balena API reporting outdated device heartbeat status"
Last updateSome devices may have the status of "Reduced Functionality" on the dashboard because the API returned outdated Heartbeat status. We found that the error occurred after we switched to a new cache for the API backend. We applied a fix to the API to allow it to display the latest reported heartbeat status from the devices. We are still investigating how to prevent this error from occurring in the future.
Report: "Elevated Device SSH Errors"
Last updateThis incident has been resolved.
There was a brief outage of remote SSH access to devices while we rotated some access keys in our cluster. Going forward we will adapt our process to avoid the downtime, or include it as part of planned maintenance windows.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of device SSH errors and are currently looking into the issue.
Report: "Elevated GIT/Application Builder Errors"
Last updateA token used for caching pulls from DockerHub expired on our cloud builders. This resulted pulls from DockerHub as appearing as unauthenticated \(even though the images were public\). We quickly generated a new token and updated the builder workers to use this token for the local registry caches.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.
Report: "Builder returns 403 error"
Last updateThis incident has been resolved.
A temporary fix has been deployed.
The issue has been identified and we're working on a fix.
We're currently investigating the builder returning 403 errors
Report: "Elevated Application Registry Errors"
Last updateWe experienced timeouts for deltas and builds when pushing images to our registry hosted in the US East \(N. Virginia\) region. This issue impacted our cloud builders in Finland and Germany, among other regions. The root cause was identified as a public routing issue between certain regions, affecting the ability of some of our systems to access the registry efficiently. We resolved the issue by enabling proxied routing protocols for our registry endpoint. This allowed us to bypass the impacted network paths and restore normal operations. ## Impact * Cloud builders in Finland and Germany experienced delays in image pushing * Potential delays in deployment pipelines for affected regions * No data loss or security breaches occurred
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We're experiencing degraded performance around creating new releases on our builders and are currently looking into the issue. Users experiencing issues creating releases using the `balena push` command, are advised to temporarily switch to local builds via `balena deploy`, if possible, until the incident is resolved.
Report: "Elevated builder failure for arm64"
Last updateThis incident has been resolved.
All affected arm64 builders are back online. We're monitoring the situation.
Some builds sent to our arm64 builder infrastructure are failing. The problem is caused by a faulty garbage collection process. We expect everything to return to normal quickly.
Report: "Elevated balenaOS Download Errors"
Last update**Incident Summary:** A modification in our Continuous Integration \(CI\) system inadvertently led to the creation of non-annotated GitHub release tags. This triggered a disruption in the balenaOS finalized release pipeline with duplicate releases containing no associated artifacts. **Impact:** While balenaOS releases were marked as final and publicly available for download, the corresponding image artifacts were missing from our backend storage. This issue did not affect device hostApp updates; however, attempts to download OS release images via the latest tags resulted in failures. **Resolution:** We invalidated the affected releases, resolved the tagging issue in our CI pipeline to ensure the generation of correct tags in the future, and implemented a fallback in our OS deployment pipeline to accommodate both types of git tags.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of balenaOS download errors and are currently looking into the issue.
Report: "Elevated GIT/Application Builder Errors"
Last updateDiscovered because builders were crash-looping, indicated by error codes related to long strings from large buffer sizes. Quickly linked to an accumulation of uncleaned images on the builder instances. A manual volume prune was executed to remove images and temporarily recover builder performance. A patch is awaiting review to lower the max storage consumption of the builder workers before automatic cleanup is performed, and projects have been proposed to patch the upstream libraries and/or move to ephemeral builder workers. We also have a project in the queue to add host metrics monitoring to the builder-worker nodes to catch issues like this earlier.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.
Report: "old balenaOS fail to connect to balena API and download delta as it lacks support for elliptic curve encryption in openssl library."
Last updateWe've now switched to another certificate provider which offer certificates compatible with older devices. Our our cache proxy back online and performance is now restored for everyone. We're still working on a migration path for users running old balenaOS releases in anticipation for future technology deprecation.
We found out that delta was also affected which prevents hostos upgrade on older devices. We temporarily disabled our proxy for delta.
We have temporarily bypassed our cache proxy to restore connectivity to older devices while we identify a migration path for users running affected balenaOS releases from 2019 and older.
Balena utilises automated TLS certificate updates to improve the security and trustworthy of our service landscape. This process has now updated to a certificate chain that enforces ECDSA (elliptic curve) instead of RSA encryption. From all known information, that affects balenaOS releases from around 2019 and older. We are currently investigating which balenaOS versions containing specific openssl libraries are affected. Moreover, we are assessing how this situation can be solved for such old balenaOS versions in a continuously changing security landscape.
Report: "Builder partially degraded service"
Last updateDiscovered because builders were crash-looping, indicated by error codes related to long strings from large buffer sizes. Quickly linked to an accumulation of uncleaned images on the builder instances. A manual command balena image prune was executed to remove images and temporarily recover builder performance. Recognized the need for more strategic, long-term solutions.
This incident has been resolved.
Discovered because builders were crash-looping, indicated by error codes related to long strings from large buffer sizes. Quickly linked to an accumulation of uncleaned images on the builder instances. A manual command balena image prune was executed to remove images and temporarily recover builder performance. Recognized the need for more strategic, long-term solutions.
We are currently investigating this issue.
Report: "Device settings undefined property reference"
Last updateWe've identified and are addressing an issue on the device settings page caused by re-fetching the selected device and receiving it as undefined during the first few renders, despite already having information about it. We have addressed the issue by using the the information we already have about the device when on the device settings page, instead of attempting to re-fetch it.
Balena Dashboard device settings show an `undefined` property reference. `TypeError: Cannot read properties of undefined (reading 'reduce')`
Report: "Dashboard terminals were not working"
Last updateWe were using the wrong `styled` function, `styled-components` instead of `Material UI's`
Dashboard terminals were not working briefly as a result of an incorrect import in the code following a migration. The dashboard was promptly reverted back to a functional version. A fix has been merged and deployed.
Report: "Elevated API Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Delta generation timeout"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Delta generation timeout"
Last updateTimeouts generating on-demand deltas between releases, starting around 1pm UTC and resuming around 4pm.
Report: "Elevated Application Registry Errors"
Last updateA misconfiguration during a registry change resulted in elevated errors when pushing application images. The issue has since been resolved.
Report: "Elevated GIT/Application Builder Errors"
Last updateA version of our cloud builder API was deployed that changed some variable types and did not properly report the build ID back to balena CLI when using the push command with release tags. We identified the issue and rolled back the release in production while we work on a fix.
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.
Report: "Elevated GIT/Application Builder Errors"
Last updateOne of our x86 cloud builders was in a bad state and would hang forever when pulling previous releases. It has since been recovered and is accepting new builds again.
One of our x86 cloud builders was in a bad state and would hang forever when pulling previous releases. It has since been recovered and is accepting new builds again.
Report: "Elevated Dashboard Errors"
Last updateWe pushed a change intended to migrate the Download Image Dialog from rendition to MUI as we are in the process of migrating between UI libraries. This involved adding some analytics/tracking content in the Download Image Dialog. This added cookies, which in conjunction with how we checked cookies inadvertently created an infinite loop when checking cookies. The quick solution while a fix was worked on was to pin the dashboard back to the previous version. A solution has been PRed and is aimed to be merged tomorrow when the team has more time to monitor the situation.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're experiencing an elevated level of errors on our Dashboard and are currently looking into the issue.
Report: "Elevated GIT/Application Builder Errors"
Last updateAn expired credit card was tied to the DockerHub account we use to authenticate public image pulls and increase DockerHub rate-limits on our cloud builders. After several attempts to charge the card the user was downgraded to a free account and was subject to the rate-limits published by DockerHub. See [https://docs.docker.com/docker-hub/download-rate-limit/](https://docs.docker.com/docker-hub/download-rate-limit/). Once the issue was identified we upgraded the user account to a new paid plan and the increased rate-limits were automatically restored to our cloud builders.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of rate-limit errors in our application builder infrastructure and are currently looking into the issue.
Report: "Elevated Delta Image Downloads Service Errors"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of errors in our delta image downloads infrastructure and are currently looking into the issue.
Report: "Elevated Delta Image Updates Service Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're experiencing an elevated level of errors with delta image update service and are currently looking into the issue.
Report: "Elevated GIT/Application Builder Errors"
Last updateOn our build worker nodes the image caches were running full and we have cleaned them up to restore functionality.
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.
Report: "Elevated API Errors"
Last update> **tl;dr** an operations error lead to inadvertently targeting the live cluster instead of test and causing full re-provisioning of all backend assets, including load balancers and DNS records balenaCloud services are deployed into Kubernetes, using the [Flux](https://fluxcd.io/) GitOps framework with configuration historically split between test and production in different Git repositories. There is quite a lot of configuration duplication between these repositories, leading to operational drift between environments. One of the improvements we are currently working on is to combine them into a single source of truth using [Kustomize bases/overlays approach](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/#bases-and-overlays). In order to begin with the refactoring of the configuration, we deployed a new flux instance into the test cluster. Unfortunately, the operator ran the CLI, with the assumption that the tooling respected the `KUBECTL_CONTEXT` environment variable, which at the time was pointing to the test cluster. The specific tooling did not respect the environment variable and instead used the default Kubernetes context set in `~/.kube/config`, which was pointing at the live production cluster. Once this new instance of flux, pointing to an empty GitHub configuration repository, was installed into the production cluster, it deleted all of the existing production assets, including replica-sets, deployments and their corresponding services and ingress\(s\) \(i.e AWS ELB/NLB/ALBs\). Recovery effort involved four engineers and took just under an hour to restore the service. It involved manually applying the configuration back to the production cluster using Kubernetes CLI tooling, addressing some configuration inconsistencies, which were blocking creation of some cloud assets, and recreating/repointing DNS records at the new load balancers. Lastly, Flux was re-configured/re-enabled against production GitHub repositories. To prevent these sorts of event in the future, we’ve made a note of updating the default context to point to the test cluster as well as double and triple checking assumptions made when using less familiar tool chains \(e.g.\): ```shell $ grep current-context ~/.kube/config current-context: arn:aws:eks:us-east-1:1234567890:cluster/test ```
This incident has been resolved.
We are continuing to monitor for any further issues.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Web Terminal SSH Errors"
Last updateThe service handling the web-socket connection for the web terminal silently failed. Restarting the service fixed the issue. While we’re investigating the root cause, we added some checks to ensure the service is restarted quickly if the web-socket server is not responsive.
The web-terminal on the dashboard is now back to a fully working state.
Dashboard Web Terminal is now available again and we're monitoring the situation.
We're experiencing some disruption with the web terminal SSH access on the dashboard. Note that SSH access using the CLI is still working ( `$ balena ssh *deviceuuid*` ) We are currently looking into the issue.
Report: "Elevated Device URLs/Cloudlink Errors"
Last updateWe've recently experienced some instabilities with our Cloudlink service which impacts: webterminal, ssh, tunnels, actions, and public-url. We host our Cloudlink servers along with some other parts of the balena-cloud infrastructure on a Kubernetes cluster. Kubernetes scale and reorganize those servers depending on the load. When a Cloudlink server needs to be shut down, connections are drained, this is expected and in most cases, transparent as devices reconnect to another Cloudlink server automatically. What happened lately, and is problematic, is our Kubernetes cluster moving Cloudlink servers on hosts that are not able to sustain them at that moment. Causing another move shortly after, while devices didn't have the time to reconnect. Hopefully, only a small amount of devices are concerned as only a few servers are moved at any given time, but it's still a very big problem for the customers who own those devices. We already deployed some remediation to reduce the time it takes to move one of those servers and we're working on multiple different solutions to continue to both reduce the recovery time and reduce the occurrence of such issues.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're experiencing an elevated level of errors in our Device URLs and Cloudlink infrastructure and are currently looking into the issue.
Report: "Elevated Device URLs/VPN Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're experiencing an elevated level of errors in our Device URLs and Cloudfront/VPN infrastructure and are currently looking into the issue.
Report: "Elevated Device URLs/VPN Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're experiencing an elevated level of errors in our Device URLs and VPN infrastructure and are currently looking into the issue.
Report: "VPN Degraded Performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We've identified an issue with a resource deficiency causing degraded cloudlink performance and connection failures. We're working on scaling up available resources to meet demand.
We are currently investigating this issue.
Report: "Dashboard could not establish terminal connections to devices"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have restarted the service and are continuing the investigation to find the cause.
The dashboard could not establish terminal connections to the devices. We are currently looking into the issue. Users can still use the balena CLI's ssh command to connect to their devices.
Report: "Dashboard could not establish terminal connections to devices"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The web terminal backend terminated prematurely. We have restarted the service and are continuing the investigation to find the cause.
The dashboard could not establish terminal connections to the devices. We are currently looking into the issue. Users can still use the balena CLI's ssh command to connect to their devices.
Report: "Elevated API Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We're experiencing an elevated level of API errors for device registration and are currently looking into the issue.