Is balena.io Down Right Now? Discover if there is an ongoing service outage.

balena.io is currently Operational

Last checked Jul 29, 2025 23:27 UTC from balena.io's official status page

Historical record of incidents for balena.io

Jul 7, 2025

Report: "Degraded API performance"

Last update 2025-07-07T10:40:41.121Z

resolved2025-07-07T10:00:00.000Z

Temporarily degraded performance of API response processing.

Jun 5, 2025

Report: "Elevated GIT/Application Builder Errors"

Last update 2025-06-05T14:12:14.458Z

identified2025-06-05T14:12:09.947Z

The issue has been identified and a fix is being implemented.

investigating2025-06-05T13:00:06.000Z

We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.

May 25, 2025

Report: "balenaCloud infrastructure maintenance"

Last update 2025-05-25T13:28:00.000Z

Completed2025-05-25T13:28:00.000Z

The scheduled maintenance has been completed.

In progress2025-05-25T12:00:00.000Z

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled2025-04-07T11:18:00.000Z

We will be undergoing scheduled maintenance during this time.

May 20, 2025

Report: "Elevated Device URLs/VPN Errors"

Last update 2025-05-20T18:28:15.258Z

postmortem2025-05-20T18:25:41.077Z

We were notified of an issue with memory consumption on some of the Cloudlink pods in our cluster on Sunday. We were able to redeploy the pods while we investigated the root cause. There should have been little or no interruption of service while the new pods came online and the old ones were slowly drained. We have since implemented one improvement to avoid memory consumption issues, and are looking into other possible root causes.

resolved2025-05-18T16:17:45.898Z

This incident has been resolved.

monitoring2025-05-18T13:30:32.297Z

A fix has been implemented and we are monitoring the results.

identified2025-05-18T13:26:34.504Z

The issue has been identified and a fix is being implemented.

investigating2025-05-18T13:13:08.858Z

We're experiencing an elevated level of errors in our Device URLs and VPN infrastructure and are currently looking into the issue.

Apr 30, 2025

Report: "Elevated API Errors"

Last update 2025-04-30T08:57:40.456Z

resolved2025-04-30T08:00:00.000Z

A database schema update caused a deadlock that led to an elevated number of API errors. We have aborted the schema update that caused the deadlock. We will be applying the schema update in a non-blocking way to avoid affecting the performance of the services.

Report: "Elevated API Errors"

Last update 2025-04-30T08:40:43.846Z

resolved2025-04-30T08:40:43.826Z

This incident has been resolved.

monitoring2025-04-22T15:00:33.000Z

A fix has been implemented and we are monitoring the results.

identified2025-04-22T10:00:36.000Z

The issue has been identified and a fix is being implemented.

investigating2025-04-21T10:00:25.000Z

We're experiencing an elevated level of API errors and are currently looking into the issue.

Apr 18, 2025

Report: "Degraded Performance on device state endpoints"

Last update 2025-04-18T12:26:34.487Z

resolved2025-04-18T12:26:34.468Z

This incident has been resolved.

monitoring2025-04-17T15:00:24.299Z

A fix has been implemented and we are monitoring the results.

identified2025-04-17T14:56:22.897Z

The issue has been identified and a fix is being implemented.

investigating2025-04-17T14:38:38.542Z

We are currently investigating this issue.

Apr 14, 2025

Report: "Unable to pin releases in the dashboard"

Last update 2025-04-14T12:52:51.532Z

postmortem2025-04-14T12:52:28.413Z

A UI update included updates to a few nested dependencies, which changed a behavior we relied on. The result was preventing the data population of available releases in the Target Release section on the Fleet Summary page. We initially reverted the changed package, and subsequently changed our approach so that it works with the latest version of the package.

resolved2025-04-10T14:40:57.968Z

This incident has been resolved.

monitoring2025-04-10T13:59:33.413Z

A fix has been implemented and we are monitoring the results.

identified2025-04-10T13:00:23.095Z

The issue has been identified and a fix is being implemented.

investigating2025-04-10T13:00:11.420Z

We are currently investigating an issue with users unable to pin releases in the dashboard.

Mar 22, 2025

Report: "Degraded cloudlink connection performance"

Last update 2025-03-22T00:40:57.869Z

resolved2025-03-22T00:40:57.852Z

This incident has been resolved.

monitoring2025-03-22T00:37:35.572Z

We are continuing to monitor for any further issues.

monitoring2025-03-22T00:37:29.161Z

A fix has been implemented and we are monitoring the results.

identified2025-03-22T00:10:51.213Z

We are continuing to work on a fix for this issue.

identified2025-03-22T00:10:29.433Z

The issue has been identified and a fix is being implemented.

Mar 20, 2025

Report: "Degraded cloudlink connection performance"

Last update 2025-03-20T20:54:44.018Z

resolved2025-03-20T20:54:44.001Z

Cloudlink connections are all stable and all old connections have been drained and reestablished

monitoring2025-03-20T20:20:48.509Z

We cycled the cloudlink deployment and monitor the connections recovering.

identified2025-03-20T20:20:25.755Z

Internal connection metrics collector failed to report properly to connection monitoring.

investigating2025-03-20T18:13:05.040Z

We are currently investigating the cause of connection drops on out cloudlink services.

investigating2025-03-20T15:00:37.000Z

Extended connection drops over period of time

Mar 4, 2025

Report: "Elevated API Errors"

Last update 2025-03-04T00:38:53.884Z

postmortem2025-03-04T00:37:07.512Z

An internal observability feature led to unreasonable base memory footprint for API instances under production load, leading to frequent evictions. For now, we’ve rolled back to a previous API version to restore stability, while we investigate the root cause.

resolved2025-03-04T00:36:40.777Z

This incident has been resolved.

monitoring2025-03-03T21:03:31.339Z

A fix has been implemented and we are monitoring the results.

identified2025-03-03T20:46:26.560Z

The issue has been identified and a fix is being implemented.

identified2025-03-03T19:48:55.962Z

The issue has been identified and a fix is being implemented.

investigating2025-03-03T18:59:33.506Z

We're experiencing an elevated level of API errors and are currently looking into the issue.

Feb 10, 2025

Report: "Elevated Dashboard Errors"

Last update 2025-02-10T15:27:06.087Z

postmortem2025-02-10T15:22:29.200Z

A bug was introduced in the UI after the recent upgrade to React Router 6, which was redirecting the user away from the email verification page before the verification was initiated when a JWT was already present in the localStorage. We have a base "route component" responsible for doing that in many of our pages, as well as handling return URLs, so that authenticated users are redirected away from the non-authenticated pages, but because of a race condition this issue did not exhibit itself. We’ve put together a workaround to no longer use that base route component in the email verification page. We might want to reconsider some things though and handle our routing in a different way based on what we learned. It seems that with the old React Router, the page was rendered before the JWT was loaded, so the verification worked fine, but after the upgrade it seems that the JWT is now loaded before the page renders. Semantically, what we were doing was wrong, and probably the only thing that we needed was the returnUrl handling, but because of the race condition, this problem has gone unnoticed for years.

resolved2025-02-01T01:18:27.528Z

This incident has been resolved.

monitoring2025-01-31T09:43:27.736Z

A fix has been implemented and we are monitoring the results.

identified2025-01-30T18:30:09.107Z

Users may find that the link in the email verification does not work when signing up with an e-mail address. We are aware of the issue and are investigating. Users can try using incognito mode to open the link.

Feb 6, 2025

Report: "Elevated API Log Errors"

Last update 2025-02-06T19:29:15.691Z

postmortem2025-02-06T19:26:50.193Z

We deployed an upgrade to our log system that worked well in staging, but in production hit issues with log streams so we quickly rolled it back. We have identified the issue and why it only affected one environment, and are investigating an alternate solution going forward.

resolved2025-02-06T13:58:10.379Z

This incident has been resolved.

monitoring2025-02-05T17:53:53.921Z

We are continuing to monitor for any further issues.

monitoring2025-02-05T17:53:49.245Z

A fix has been implemented and we are monitoring the results.

investigating2025-02-05T16:45:05.000Z

We're experiencing an elevated level of log stream errors and are currently looking into the issue.

Dec 24, 2024

Report: "Elevated Device SSH Errors"

Last update 2024-12-24T14:21:41.281Z

resolved2024-12-24T14:21:41.265Z

This incident has been resolved.

monitoring2024-12-24T13:30:12.169Z

We are continuing to monitor for any further issues.

monitoring2024-12-24T13:17:20.802Z

A fix has been implemented and we are monitoring the results.

identified2024-12-24T13:09:34.185Z

The issue has been identified and a fix is being implemented.

investigating2024-12-24T11:57:33.555Z

We're experiencing an elevated level of device SSH errors and are currently looking into the issue.

Report: "Elevated Device URL Errors"

Last update 2024-12-24T11:40:33.014Z

resolved2024-12-24T11:40:32.996Z

This incident has been resolved.

monitoring2024-12-24T11:35:50.166Z

A fix has been implemented and we are monitoring the results.

investigating2024-12-24T11:06:21.119Z

We're experiencing an elevated level of Device URL errors and are currently looking into the issue.

Dec 13, 2024

Report: "The OS versions list for upgrading HosOS throught the dashboard does not show any version for some devices."

Last update 2024-12-13T02:31:48.182Z

resolved2024-12-12T21:00:00.000Z

The OS versions list under the settings menu of some devices may not be showing any versions for some devices preventing users from upgrading their HostOS. Some users may also see a notice on their device summary page saying "OS downgrades are not allowed". We have reverted the dashboard to a previous version that has a working OS version list for upgrading HostOS. We are still investigating why the OS version list is not rendered in the latest dashboard version.

Dec 10, 2024

Report: "Elevated Dashboard Errors"

Last update 2024-12-10T18:51:17.867Z

postmortem2024-12-10T18:46:26.112Z

A configuration change to our API that enforced stricter policies for new tokens resulted in validation failures when the Dashboard was used to download balenaOS images. We quickly noticed the issue when our internal tests started failing, and we rolled back the API to the previous release. Once the changes have been retested against the current releases of the Dashboard they will be redeployed.

resolved2024-12-09T20:31:06.103Z

This incident has been resolved.

monitoring2024-12-09T19:15:53.784Z

We are continuing to monitor for any further issues.

monitoring2024-12-09T19:15:48.798Z

A fix has been implemented and we are monitoring the results.

identified2024-12-09T19:05:05.188Z

The issue has been identified and a fix is being implemented.

investigating2024-12-09T18:44:12.864Z

We're experiencing an elevated level of errors on our Dashboard and are currently looking into the issue.

Nov 22, 2024

Report: "Elevated Device URLs/VPN Errors"

Last update 2024-11-22T14:06:35.479Z

postmortem2024-11-22T14:06:15.067Z

We observed degraded Cloudlink $VPN$ connections following several subsequent API release deployments. These were spread out over the course of a day, and took some time to settle without any manual intervention. This is generally referred to as a "thundering herd" when 1000s of devices are attempting to connect to a new node at the same time and get rate limited. Upon investigation we found that when we are running at peak usage, the load balancing policies in place for our TCP Cloudlink connections were not optimized to avoid proxying through nodes that were scaling up and scaling down during deploys. Due to the nature of TCP, even though our Cloudlink instances were largely unmoved, the proxied TCP connections were being interrupted by the shuffle of other backend services. We have since implemented some changes to our load balancers to only route TCP Cloudlink traffic via nodes that have online and ready Cloudlink pods running. We are also in the early stages of enabling UDP connections for this endpoint and will announce more details in the future.

resolved2024-11-20T14:01:34.458Z

This incident has been resolved.

monitoring2024-11-19T21:24:57.820Z

A fix has been implemented and we are monitoring the results.

identified2024-11-19T20:32:25.518Z

The issue has been identified and a fix is being implemented.

investigating2024-11-19T10:00:02.000Z

We're experiencing an elevated level of errors in our Device URLs and VPN infrastructure and are currently looking into the issue.

Nov 12, 2024

Report: "Elevated GIT/Application Builder Errors"

Last update 2024-11-12T21:19:10.688Z

postmortem2024-11-12T21:18:39.378Z

Temporary issue due to a switchover between internally hosted certificate authorities.

resolved2024-11-12T18:18:14.842Z

This incident has been resolved.

monitoring2024-11-12T18:13:34.310Z

We are continuing to monitor for any further issues.

monitoring2024-11-12T18:13:30.236Z

A fix has been implemented and we are monitoring the results.

identified2024-11-12T18:06:33.976Z

The issue has been identified and a fix is being implemented.

investigating2024-11-12T18:00:39.015Z

We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.

Report: "Elevated Delta Image Downloads Errors"

Last update 2024-11-12T09:06:30.666Z

resolved2024-11-12T09:06:30.645Z

The delta image back-end was failing to connect to our internal worker nodes because it was using an outdated authentication certificate. The back-end and the worker nodes were to be configured to use a new certificate at the same time to avoid disruptions but we had a delay in re-configuring the back-end. We are reviewing the process to avoid more disruptions like this in the future.

Oct 30, 2024

Report: "Elevated GIT/Application Builder Errors"

Last update 2024-10-30T17:14:15.440Z

postmortem2024-10-30T17:09:39.946Z

An error was introduced in our builder firewall rules that dropped all outbound traffic from the build context. During this time users would see builds start but fail when a RUN instruction included a command that required internet access. We rolled the builders back to a previous release while we investigated the root cause of the bug. A fix is now being reviewed and tested to catch this edge case that takes time to manifest and was missed in original testing.

resolved2024-10-30T12:01:03.274Z

This incident has been resolved.

monitoring2024-10-30T11:18:27.855Z

We are continuing to monitor for any further issues.

monitoring2024-10-30T11:18:22.286Z

A fix has been implemented and we are monitoring the results.

investigating2024-10-30T10:42:14.630Z

We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue. Some users may experience DNS resolution errors during builds using the balenaCloud builders.

Oct 28, 2024

Report: "Balena API reporting outdated device heartbeat status"

Last update 2024-10-28T13:20:16.706Z

resolved2024-10-27T01:00:00.000Z

Some devices may have the status of "Reduced Functionality" on the dashboard because the API returned outdated Heartbeat status. We found that the error occurred after we switched to a new cache for the API backend. We applied a fix to the API to allow it to display the latest reported heartbeat status from the devices. We are still investigating how to prevent this error from occurring in the future.

Oct 18, 2024

Report: "Elevated Device SSH Errors"

Last update 2024-10-18T22:48:56.561Z

resolved2024-10-18T21:45:06.000Z

This incident has been resolved.

postmortem2024-10-18T21:20:15.506Z

There was a brief outage of remote SSH access to devices while we rotated some access keys in our cluster. Going forward we will adapt our process to avoid the downtime, or include it as part of planned maintenance windows.

monitoring2024-10-18T21:08:55.235Z

A fix has been implemented and we are monitoring the results.

identified2024-10-18T21:07:50.952Z

The issue has been identified and a fix is being implemented.

investigating2024-10-18T20:57:28.006Z

We're experiencing an elevated level of device SSH errors and are currently looking into the issue.

Sep 25, 2024

Report: "Elevated GIT/Application Builder Errors"

Last update 2024-09-25T21:44:19.918Z

postmortem2024-09-25T21:42:34.465Z

A token used for caching pulls from DockerHub expired on our cloud builders. This resulted pulls from DockerHub as appearing as unauthenticated $even though the images were public$. We quickly generated a new token and updated the builder workers to use this token for the local registry caches.

resolved2024-09-25T21:42:29.176Z

This incident has been resolved.

monitoring2024-09-25T20:17:01.656Z

A fix has been implemented and we are monitoring the results.

identified2024-09-25T20:08:27.131Z

The issue has been identified and a fix is being implemented.

investigating2024-09-25T19:46:54.970Z

We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.

Sep 12, 2024

Report: "Builder returns 403 error"

Last update 2024-09-12T09:50:57.101Z

resolved2024-09-11T15:00:38.000Z

This incident has been resolved.

monitoring2024-09-11T14:17:49.626Z

A temporary fix has been deployed.

identified2024-09-11T13:11:18.810Z

The issue has been identified and we're working on a fix.

investigating2024-09-11T12:00:04.000Z

We're currently investigating the builder returning 403 errors

Jul 31, 2024

Report: "Elevated Application Registry Errors"

Last update 2024-07-31T13:26:54.281Z

postmortem2024-07-31T13:25:59.627Z

We experienced timeouts for deltas and builds when pushing images to our registry hosted in the US East $N. Virginia$ region. This issue impacted our cloud builders in Finland and Germany, among other regions. The root cause was identified as a public routing issue between certain regions, affecting the ability of some of our systems to access the registry efficiently. We resolved the issue by enabling proxied routing protocols for our registry endpoint. This allowed us to bypass the impacted network paths and restore normal operations. ## Impact * Cloud builders in Finland and Germany experienced delays in image pushing * Potential delays in deployment pipelines for affected regions * No data loss or security breaches occurred

resolved2024-07-31T13:25:53.771Z

This incident has been resolved.

monitoring2024-07-26T15:13:23.432Z

A fix has been implemented and we are monitoring the results.

investigating2024-07-26T12:13:55.573Z

We are continuing to investigate this issue.

investigating2024-07-25T23:59:09.789Z

We are continuing to investigate this issue.

investigating2024-07-25T20:31:47.395Z

We are continuing to investigate this issue.

investigating2024-07-25T16:57:27.000Z

We're experiencing degraded performance around creating new releases on our builders and are currently looking into the issue. Users experiencing issues creating releases using the `balena push` command, are advised to temporarily switch to local builds via `balena deploy`, if possible, until the incident is resolved.

Jul 24, 2024

Report: "Elevated builder failure for arm64"

Last update 2024-07-24T09:22:47.770Z

resolved2024-07-24T09:22:47.747Z

This incident has been resolved.

monitoring2024-07-24T09:08:34.908Z

All affected arm64 builders are back online. We're monitoring the situation.

identified2024-07-24T08:45:41.000Z

Some builds sent to our arm64 builder infrastructure are failing. The problem is caused by a faulty garbage collection process. We expect everything to return to normal quickly.

May 1, 2024

Report: "Elevated balenaOS Download Errors"

Last update 2024-05-01T12:44:14.477Z

postmortem2024-04-30T17:27:02.532Z

**Incident Summary:** A modification in our Continuous Integration $CI$ system inadvertently led to the creation of non-annotated GitHub release tags. This triggered a disruption in the balenaOS finalized release pipeline with duplicate releases containing no associated artifacts. **Impact:** While balenaOS releases were marked as final and publicly available for download, the corresponding image artifacts were missing from our backend storage. This issue did not affect device hostApp updates; however, attempts to download OS release images via the latest tags resulted in failures. **Resolution:** We invalidated the affected releases, resolved the tagging issue in our CI pipeline to ensure the generation of correct tags in the future, and implemented a fallback in our OS deployment pipeline to accommodate both types of git tags.

resolved2024-04-30T14:01:54.473Z

This incident has been resolved.

monitoring2024-04-26T19:44:32.343Z

A fix has been implemented and we are monitoring the results.

identified2024-04-26T19:22:52.509Z

The issue has been identified and a fix is being implemented.

investigating2024-04-26T19:07:36.510Z

We're experiencing an elevated level of balenaOS download errors and are currently looking into the issue.

Apr 17, 2024

Report: "Elevated GIT/Application Builder Errors"

Last update 2024-04-17T20:28:07.721Z

postmortem2024-04-17T20:24:09.504Z

Discovered because builders were crash-looping, indicated by error codes related to long strings from large buffer sizes. Quickly linked to an accumulation of uncleaned images on the builder instances. A manual volume prune was executed to remove images and temporarily recover builder performance. A patch is awaiting review to lower the max storage consumption of the builder workers before automatic cleanup is performed, and projects have been proposed to patch the upstream libraries and/or move to ephemeral builder workers. We also have a project in the queue to add host metrics monitoring to the builder-worker nodes to catch issues like this earlier.

resolved2024-04-17T20:23:45.898Z

This incident has been resolved.

monitoring2024-04-17T20:17:56.166Z

A fix has been implemented and we are monitoring the results.

identified2024-04-17T19:05:41.284Z

The issue has been identified and a fix is being implemented.

investigating2024-04-17T18:17:16.194Z

We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.

Apr 11, 2024

Report: "old balenaOS fail to connect to balena API and download delta as it lacks support for elliptic curve encryption in openssl library."

Last update 2024-04-11T19:35:46.079Z

resolved2024-04-11T19:35:46.066Z

We've now switched to another certificate provider which offer certificates compatible with older devices. Our our cache proxy back online and performance is now restored for everyone. We're still working on a migration path for users running old balenaOS releases in anticipation for future technology deprecation.

monitoring2024-04-09T15:42:37.184Z

We found out that delta was also affected which prevents hostos upgrade on older devices. We temporarily disabled our proxy for delta.

monitoring2024-04-03T22:25:05.517Z

We have temporarily bypassed our cache proxy to restore connectivity to older devices while we identify a migration path for users running affected balenaOS releases from 2019 and older.

identified2024-04-03T16:48:01.985Z

Balena utilises automated TLS certificate updates to improve the security and trustworthy of our service landscape. This process has now updated to a certificate chain that enforces ECDSA (elliptic curve) instead of RSA encryption. From all known information, that affects balenaOS releases from around 2019 and older. We are currently investigating which balenaOS versions containing specific openssl libraries are affected. Moreover, we are assessing how this situation can be solved for such old balenaOS versions in a continuously changing security landscape.

Mar 29, 2024

Report: "Builder partially degraded service"

Last update 2024-03-29T17:15:04.717Z

postmortem2024-03-29T17:14:55.277Z

resolved2024-03-29T17:14:41.177Z

This incident has been resolved.

monitoring2024-03-29T14:53:05.218Z

investigating2024-03-29T13:44:39.773Z

We are currently investigating this issue.

Mar 21, 2024

Report: "Device settings undefined property reference"

Last update 2024-03-21T16:58:21.349Z

postmortem2024-03-21T16:53:22.318Z

We've identified and are addressing an issue on the device settings page caused by re-fetching the selected device and receiving it as undefined during the first few renders, despite already having information about it. We have addressed the issue by using the the information we already have about the device when on the device settings page, instead of attempting to re-fetch it.

resolved2024-03-21T13:00:00.000Z

Balena Dashboard device settings show an `undefined` property reference. `TypeError: Cannot read properties of undefined (reading 'reduce')`

Feb 28, 2024

Report: "Dashboard terminals were not working"

Last update 2024-02-28T15:32:47.075Z

postmortem2024-02-28T14:40:05.810Z

We were using the wrong `styled` function, `styled-components` instead of `Material UI's`

resolved2024-02-28T13:00:00.000Z

Dashboard terminals were not working briefly as a result of an incorrect import in the code following a migration. The dashboard was promptly reverted back to a functional version. A fix has been merged and deployed.

Feb 20, 2024

Report: "Elevated API Errors"

Last update 2024-02-20T01:40:08.466Z

resolved2024-02-20T01:40:08.449Z

This incident has been resolved.

monitoring2024-02-19T23:33:12.835Z

A fix has been implemented and we are monitoring the results.

investigating2024-02-19T22:32:24.771Z

We're experiencing an elevated level of API errors and are currently looking into the issue.

Feb 15, 2024

Report: "Delta generation timeout"

Last update 2024-02-15T02:12:29.990Z

resolved2024-02-15T02:12:29.978Z

This incident has been resolved.

monitoring2024-02-15T01:50:17.222Z

A fix has been implemented and we are monitoring the results.

investigating2024-02-15T01:14:27.622Z

We are currently investigating this issue.

Feb 12, 2024

Report: "Delta generation timeout"

Last update 2024-02-12T19:43:10.656Z

resolved2024-02-12T13:00:00.000Z

Timeouts generating on-demand deltas between releases, starting around 1pm UTC and resuming around 4pm.

Feb 9, 2024

Report: "Elevated Application Registry Errors"

Last update 2024-02-09T01:14:55.769Z

resolved2024-02-08T22:30:00.000Z

A misconfiguration during a registry change resulted in elevated errors when pushing application images. The issue has since been resolved.

Jan 3, 2024

Report: "Elevated GIT/Application Builder Errors"

Last update 2024-01-03T18:11:37.089Z

postmortem2024-01-03T18:09:01.606Z

A version of our cloud builder API was deployed that changed some variable types and did not properly report the build ID back to balena CLI when using the push command with release tags. We identified the issue and rolled back the release in production while we work on a fix.

resolved2024-01-03T17:09:56.162Z

This incident has been resolved.

monitoring2024-01-03T16:18:29.376Z

We are continuing to monitor for any further issues.

monitoring2024-01-03T16:17:52.023Z

A fix has been implemented and we are monitoring the results.

investigating2024-01-03T15:54:48.630Z

We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.

Dec 14, 2023

Report: "Elevated GIT/Application Builder Errors"

Last update 2023-12-14T20:32:59.839Z

postmortem2023-12-14T20:32:58.401Z

One of our x86 cloud builders was in a bad state and would hang forever when pulling previous releases. It has since been recovered and is accepting new builds again.

resolved2023-12-14T19:00:00.000Z

One of our x86 cloud builders was in a bad state and would hang forever when pulling previous releases. It has since been recovered and is accepting new builds again.

Report: "Elevated Dashboard Errors"

Last update 2023-12-14T14:14:39.627Z

postmortem2023-12-14T14:14:09.400Z

We pushed a change intended to migrate the Download Image Dialog from rendition to MUI as we are in the process of migrating between UI libraries. This involved adding some analytics/tracking content in the Download Image Dialog. This added cookies, which in conjunction with how we checked cookies inadvertently created an infinite loop when checking cookies. The quick solution while a fix was worked on was to pin the dashboard back to the previous version. A solution has been PRed and is aimed to be merged tomorrow when the team has more time to monitor the situation.

resolved2023-12-13T18:21:56.042Z

This incident has been resolved.

monitoring2023-12-13T17:26:21.621Z

A fix has been implemented and we are monitoring the results.

investigating2023-12-13T17:09:02.817Z

We're experiencing an elevated level of errors on our Dashboard and are currently looking into the issue.

Nov 16, 2023

Report: "Elevated GIT/Application Builder Errors"

Last update 2023-11-16T14:18:00.343Z

postmortem2023-11-16T14:13:52.617Z

An expired credit card was tied to the DockerHub account we use to authenticate public image pulls and increase DockerHub rate-limits on our cloud builders. After several attempts to charge the card the user was downgraded to a free account and was subject to the rate-limits published by DockerHub. See [https://docs.docker.com/docker-hub/download-rate-limit/](https://docs.docker.com/docker-hub/download-rate-limit/). Once the issue was identified we upgraded the user account to a new paid plan and the increased rate-limits were automatically restored to our cloud builders.

resolved2023-11-15T22:32:37.039Z

This incident has been resolved.

monitoring2023-11-15T21:47:11.865Z

A fix has been implemented and we are monitoring the results.

identified2023-11-15T20:53:00.378Z

The issue has been identified and a fix is being implemented.

investigating2023-11-15T19:50:47.202Z

We're experiencing an elevated level of rate-limit errors in our application builder infrastructure and are currently looking into the issue.

Nov 13, 2023

Report: "Elevated Delta Image Downloads Service Errors"

Last update 2023-11-13T22:49:31.934Z

resolved2023-11-13T22:49:31.916Z

This incident has been resolved.

monitoring2023-11-13T22:04:41.514Z

We are continuing to monitor for any further issues.

monitoring2023-11-13T22:04:24.673Z

A fix has been implemented and we are monitoring the results.

identified2023-11-13T22:04:20.531Z

The issue has been identified and a fix is being implemented.

investigating2023-11-13T21:44:42.000Z

We're experiencing an elevated level of errors in our delta image downloads infrastructure and are currently looking into the issue.

Report: "Elevated Delta Image Updates Service Errors"

Last update 2023-11-13T11:14:47.398Z

resolved2023-11-13T11:14:47.382Z

This incident has been resolved.

monitoring2023-11-13T10:35:07.485Z

A fix has been implemented and we are monitoring the results.

investigating2023-11-13T10:13:20.404Z

We're experiencing an elevated level of errors with delta image update service and are currently looking into the issue.

Sep 21, 2023

Report: "Elevated GIT/Application Builder Errors"

Last update 2023-09-21T19:46:45.856Z

postmortem2023-09-21T19:33:08.571Z

On our build worker nodes the image caches were running full and we have cleaned them up to restore functionality.

resolved2023-09-21T19:20:26.837Z

This incident has been resolved.

monitoring2023-09-21T17:23:52.418Z

We are continuing to monitor for any further issues.

monitoring2023-09-21T17:20:58.214Z

A fix has been implemented and we are monitoring the results.

identified2023-09-21T16:56:01.777Z

The issue has been identified and a fix is being implemented.

investigating2023-09-21T16:49:50.489Z

We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.

Sep 12, 2023

Report: "Elevated API Errors"

Last update 2023-09-12T15:27:20.633Z

postmortem2023-09-12T01:08:16.901Z

> **tl;dr** an operations error lead to inadvertently targeting the live cluster instead of test and causing full re-provisioning of all backend assets, including load balancers and DNS records‌ balenaCloud services are deployed into Kubernetes, using the [Flux](https://fluxcd.io/) GitOps framework with configuration historically split between test and production in different Git repositories. There is quite a lot of configuration duplication between these repositories, leading to operational drift between environments. One of the improvements we are currently working on is to combine them into a single source of truth using [Kustomize bases/overlays approach](https://kubernetes.io/docs/tasks/manage-kubernetes-objects/kustomization/#bases-and-overlays). In order to begin with the refactoring of the configuration, we deployed a new flux instance into the test cluster. Unfortunately, the operator ran the CLI, with the assumption that the tooling respected the `KUBECTL_CONTEXT` environment variable, which at the time was pointing to the test cluster. The specific tooling did not respect the environment variable and instead used the default Kubernetes context set in `~/.kube/config`, which was pointing at the live production cluster. Once this new instance of flux, pointing to an empty GitHub configuration repository, was installed into the production cluster, it deleted all of the existing production assets, including replica-sets, deployments and their corresponding services and ingress$s$ $i.e AWS ELB/NLB/ALBs$. Recovery effort involved four engineers and took just under an hour to restore the service. It involved manually applying the configuration back to the production cluster using Kubernetes CLI tooling, addressing some configuration inconsistencies, which were blocking creation of some cloud assets, and recreating/repointing DNS records at the new load balancers. Lastly, Flux was re-configured/re-enabled against production GitHub repositories. To prevent these sorts of event in the future, we’ve made a note of updating the default context to point to the test cluster as well as double and triple checking assumptions made when using less familiar tool chains $e.g.$: ```shell $ grep current-context ~/.kube/config current-context: arn:aws:eks:us-east-1:1234567890:cluster/test ```

resolved2023-09-12T01:01:31.757Z

This incident has been resolved.

monitoring2023-09-12T01:00:50.538Z

We are continuing to monitor for any further issues.

monitoring2023-09-12T00:12:17.516Z

We are continuing to monitor for any further issues.

monitoring2023-09-12T00:12:02.215Z

A fix has been implemented and we are monitoring the results.

identified2023-09-11T23:58:31.637Z

The issue has been identified and a fix is being implemented.

investigating2023-09-11T23:34:55.548Z

We're experiencing an elevated level of API errors and are currently looking into the issue.

Aug 10, 2023

Report: "Web Terminal SSH Errors"

Last update 2023-08-10T16:12:11.329Z

postmortem2023-08-10T15:27:04.250Z

The service handling the web-socket connection for the web terminal silently failed. Restarting the service fixed the issue. While we’re investigating the root cause, we added some checks to ensure the service is restarted quickly if the web-socket server is not responsive.

resolved2023-07-31T08:06:50.325Z

The web-terminal on the dashboard is now back to a fully working state.

monitoring2023-07-31T07:30:35.461Z

Dashboard Web Terminal is now available again and we're monitoring the situation.

investigating2023-07-31T07:17:39.465Z

We're experiencing some disruption with the web terminal SSH access on the dashboard. Note that SSH access using the CLI is still working ( `$ balena ssh *deviceuuid*` ) We are currently looking into the issue.

Report: "Elevated Device URLs/Cloudlink Errors"

Last update 2023-08-10T15:30:54.592Z

postmortem2023-08-10T15:30:49.987Z

We've recently experienced some instabilities with our Cloudlink service which impacts: webterminal, ssh, tunnels, actions, and public-url. We host our Cloudlink servers along with some other parts of the balena-cloud infrastructure on a Kubernetes cluster. Kubernetes scale and reorganize those servers depending on the load. When a Cloudlink server needs to be shut down, connections are drained, this is expected and in most cases, transparent as devices reconnect to another Cloudlink server automatically. What happened lately, and is problematic, is our Kubernetes cluster moving Cloudlink servers on hosts that are not able to sustain them at that moment. Causing another move shortly after, while devices didn't have the time to reconnect. Hopefully, only a small amount of devices are concerned as only a few servers are moved at any given time, but it's still a very big problem for the customers who own those devices. We already deployed some remediation to reduce the time it takes to move one of those servers and we're working on multiple different solutions to continue to both reduce the recovery time and reduce the occurrence of such issues.

resolved2023-08-09T18:05:58.116Z

This incident has been resolved.

monitoring2023-08-09T16:52:24.591Z

A fix has been implemented and we are monitoring the results.

investigating2023-08-09T14:55:48.697Z

We're experiencing an elevated level of errors in our Device URLs and Cloudlink infrastructure and are currently looking into the issue.

Aug 2, 2023

Report: "Elevated Device URLs/VPN Errors"

Last update 2023-08-02T20:16:06.445Z

resolved2023-08-02T20:16:06.433Z

This incident has been resolved.

monitoring2023-08-02T17:48:33.973Z

A fix has been implemented and we are monitoring the results.

investigating2023-08-02T16:16:46.541Z

We're experiencing an elevated level of errors in our Device URLs and Cloudfront/VPN infrastructure and are currently looking into the issue.

Jul 26, 2023

Report: "Elevated Device URLs/VPN Errors"

Last update 2023-07-26T15:26:51.725Z

resolved2023-07-26T15:26:51.709Z

This incident has been resolved.

monitoring2023-07-26T14:14:23.475Z

A fix has been implemented and we are monitoring the results.

investigating2023-07-26T13:54:42.893Z

We're experiencing an elevated level of errors in our Device URLs and VPN infrastructure and are currently looking into the issue.

Jul 21, 2023

Report: "VPN Degraded Performance"

Last update 2023-07-21T22:00:27.827Z

resolved2023-07-21T22:00:27.812Z

This incident has been resolved.

monitoring2023-07-21T21:30:31.030Z

A fix has been implemented and we are monitoring the results.

identified2023-07-21T21:07:11.196Z

We've identified an issue with a resource deficiency causing degraded cloudlink performance and connection failures. We're working on scaling up available resources to meet demand.

investigating2023-07-21T18:29:43.394Z

We are currently investigating this issue.

Jul 19, 2023

Report: "Dashboard could not establish terminal connections to devices"

Last update 2023-07-19T20:50:08.460Z

resolved2023-07-19T20:50:08.448Z

This incident has been resolved.

monitoring2023-07-19T19:52:01.661Z

A fix has been implemented and we are monitoring the results.

identified2023-07-19T19:47:12.880Z

We have restarted the service and are continuing the investigation to find the cause.

investigating2023-07-19T19:42:22.739Z

The dashboard could not establish terminal connections to the devices. We are currently looking into the issue. Users can still use the balena CLI's ssh command to connect to their devices.

Jul 12, 2023

Report: "Dashboard could not establish terminal connections to devices"

Last update 2023-07-12T03:42:50.075Z

resolved2023-07-12T03:42:50.064Z

This incident has been resolved.

monitoring2023-07-12T03:17:44.361Z

A fix has been implemented and we are monitoring the results.

identified2023-07-12T02:39:59.806Z

The web terminal backend terminated prematurely. We have restarted the service and are continuing the investigation to find the cause.

investigating2023-07-12T02:02:55.308Z

The dashboard could not establish terminal connections to the devices. We are currently looking into the issue. Users can still use the balena CLI's ssh command to connect to their devices.

Jun 26, 2023

Report: "Elevated API Errors"

Last update 2023-06-26T21:46:42.441Z

resolved2023-06-26T21:46:42.426Z

This incident has been resolved.

monitoring2023-06-26T21:40:55.987Z

A fix has been implemented and we are monitoring the results.

identified2023-06-26T17:40:36.460Z

The issue has been identified and a fix is being implemented.

investigating2023-06-26T15:06:41.785Z

We are continuing to investigate this issue.

investigating2023-06-26T15:06:10.525Z

We're experiencing an elevated level of API errors for device registration and are currently looking into the issue.