Astro

Is Astro Down Right Now? Discover if there is an ongoing service outage.

Astro is currently Operational

Last checked from Astro's official status page

Historical record of incidents for Astro

Report: "Astro clusters in GCP are having scaling issues due to a GCP outage"

Last update
identified

At this time we believe that Deployments on Azure and AWS are unaffected. We are currently checking our components to be certain. We have seen task failure rates increase on Astro Deployments on GCP. We will pass along any updates we receive from Google regarding this issue.

investigating

https://status.cloud.google.com/ We will continue to monitor the issue and update this page

investigating

There is an active GCP outage that is affecting Astro customers using GCP

Report: "Creating a Connection can crash the browser tab"

Last update
investigating

If you create certain Connection types in the Environments menu, it can crash your browser tab. Currently it affects SSH, SMTP, SFTP, and Generic. There could be others. We are investigating currently.

Report: "403 Errors for Image Deploys"

Last update
resolved

We have determined that this error is caused by cached credentials which are no longer valid after an internal change in Astro to the image registry. The fix must be performed client-side (i.e. on the machine running `astro deploy`). If you experience this error, run `docker logout` for each Astro registry that this machine has cached credentials for. By default, credentials are stored in ~/.docker/config.json, and if you are using this default setting, the following bash script will identify cached credentials and run docker logout for those that correspond to Astro registries. for domain in $(grep 'registry.astronomer.run' ~/.docker/config.json | awk '{print $1}' | tr -d '":' | sort | uniq); do docker logout "$domain" done

monitoring

We've implemented a mitigation for this issue and the affected clusters should see successful image pushes. We will continue to monitor for additional errors.

investigating

A small subset of customers have reported 403 errors when running the astro deploy command to deploy a new image. We are actively investigating this issue. If you are experiencing these errors, we encourage you to contact support and include the login command you used, astro cli and docker versions, and any log messages.

Report: "403 Errors for Image Deploys"

Last update
Investigating

A small subset of customers have reported 403 errors when running the astro deploy command to deploy a new image. We are actively investigating this issue. If you are experiencing these errors, we encourage you to contact support and include the login command you used, astro cli and docker versions, and any log messages.

Report: "Identified a configuration issue affecting Runtime 9 which is affecting DAG execution on these deployments"

Last update
resolved

This incident has been resolved.

identified

Fix has been validated and is rolling out to affected deployments.

investigating

We are currently investigating the issue.

Report: "Identified a configuration issue affecting Runtime 9 which is affecting DAG execution on these deployments"

Last update
Investigating

We are currently investigating the issue

Report: "Identified a configuration issue affecting Runtime 9 and below which is affecting DAG execution on these deployments"

Last update
Investigating

We are currently investigating the issue

Report: "Stuck worker pods resulting in tasks failing in the queued state"

Last update
resolved

This incident has been resolved.

investigating

The incident is resolved.

investigating

We are continuing to investigate this issue.

investigating

In some deployments, worker pods are getting stuck in the initialization state for an extended period of time. Due to this, queued tasks are unable to run and fail. This is not affecting all deployments. We are investigating which deployments are affected and why.

Report: "Stuck worker pods resulting in tasks failing in the queued state"

Last update
Investigating

In some deployments, worker pods are getting stuck in the initialization state for an extended period of time. Due to this, queued tasks are unable to run and fail.This is not affecting all deployments. We are investigating which deployments are affected and why.

Report: "Customers will not be able to create new Azure Clusters"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Cluster maintenance

Report: "Customers will not be able to create new Azure Clusters"

Last update
Monitoring

A fix has been implemented and we are monitoring the results.

Investigating

Cluster maintenance

Report: "Cost Breakdown Dashboard data update delayed"

Last update
resolved

This issue is now resolved except for one customer who we have contacted directly.

identified

Deployment cost is now up to date, but compute costs for some customers remain outdated. We were working with our billing vendor to determine the source of the issue.

identified

Data shown in the Organization Dashboards Cost Breakdown (for Enterprise customers) is delayed. As stated on the page itself, the latest data is as of April 4th. The processing to update this dashboard is currently ongoing, and we expect the data to be refreshed at approximately 16:00 UTC.

Report: "Cost Breakdown Dashboard data update delayed"

Last update
Identified

Data shown in the Organization Dashboards Cost Breakdown (for Enterprise customers) is delayed. As stated on the page itself, the latest data is as of April 4th. The processing to update this dashboard is currently ongoing, and we expect the data to be refreshed at approximately 16:00 UTC.

Report: "We are experiencing an issue with new task execution on AWS clusters"

Last update
postmortem

# Overview Between March 18 and March 26, 2025, Astro experienced a series of related incidents. This write up serves as the analysis of the full set of incidents, as the events overlap. On Tuesday, March 18, an Azure outage prevented our control plane components that handle authentication requests from reaching our Authentication vendor, Auth0. This lasted for 45 minutes and was resolved without intervention from Astronomer. Because the control plane for Astro is hosted in Azure, all customers were affected, including those that host their data planes in AWS or GCP. During this period, customers were unable to reach the Astro or Airflow UI and were unable to get a response from the Airflow API.  On Friday, March 21, our monitoring systems alerted us to a critical issue affecting the control plane authentication components, again preventing customers from accessing the Astro UI and API. Our investigation revealed the root cause was unusually high volumes of automated authentication requests that overwhelmed the scaling capabilities of our authentication components. While we cannot share specific details about this traffic for security reasons, we can confirm that our security team verified no customer data was compromised or exposed. Our engineering team implemented protective measures against similar authentication scaling issues, including reconfiguring several network settings to improve resilience.  Early Sunday, March 23, we discovered that our network reconfiguration had unintentionally triggered a previously unknown bug in our system.  * Astro uses a component called “Harmony client” to synchronize changes between the control plane \(where customers make configuration changes\) and the data plane \(where customer workloads run\).  * During normal operations, when the Harmony client cannot connect to the control plane, it waits a few seconds and retries. However, we discovered a specific bug in how Harmony client handles connection errors for AWS clusters. When receiving a 404 error, it interpreted this as an empty response which signaled that we didn’t need resources and inadvertently deleted all worker node pools, the computing resources that run tasks.  * This affected around a dozen customer clusters, disrupting workflow execution and UI access. Most clusters were protected by safety mechanisms that prevented the deletion of node pools, but this also blocked automatic recovery.  The team manually restored node pools to all affected clusters by Sunday afternoon. We developed a fix for the Harmony client bug and implemented enhanced monitoring to detect any similar issues. Given the complexity of these interconnected incidents, we made the decision to implement a temporary change freeze while thoroughly reviewing the fix and our deployment procedures to avoid introducing new issues.  On Wednesday, March 26, our monitoring systems detected a recurrence of the Harmony client issue. Our incident response team immediately engaged and identified the cause: during our troubleshooting process, a change intended for our staging environment had been applied to production through a gap in our change management process. This change caused 404 errors to once again be received by the Harmony client, re-triggering the issue. Given the large number of clusters, our efforts to apply manual fixes were taking too long to address the issue.  As we no longer had confidence that it would be safer to wait, we initiated a hotfix release of the patch to fix the root cause to Harmony client across the fleet. # Long-Term Preventive Measures We have corrected the underlying bug in the Harmony client and put in initial steps to deal with bursts of unusual traffic. In the coming weeks, we plan to put further mitigations in place around automated traffic to further protect the system. We are looking at additional process measures to further separate staging and production access. We have also improved our monitoring to detect similar issues and are discussing further monitoring improvements.  Additionally, we will deploy a script for automated recovery of a set of clusters to improve remediation time.    If you have any additional questions or concerns, please contact us at [support@astronomer.io](mailto:support@astronomer.io).

resolved

This incident has been resolved.

identified

The team has identified the issue and is currently actively working on mitigating it. The issue concerns new task execution on AWS clusters and accessing the UI.

investigating

We are experiencing an issue with new task execution on AWS clusters and issue with accessing UI. The team is actively investigating the issue.

Report: "We are experiencing an issue with new task execution on AWS clusters"

Last update
Postmortem
Resolved

This incident has been resolved.

Identified

The team has identified the issue and is currently actively working on mitigating it. The issue concerns new task execution on AWS clusters and accessing the UI.

Investigating

We are experiencing an issue with new task execution on AWS clusters and issue with accessing UI. The team is actively investigating the issue.

Report: "Airflow UI and API Unavailable for a few customers"

Last update
resolved

We have confirmed that no additional clusters were affected beyond those that were initially identified. This incident is fully resolved.

identified

We have identified that this issue is specific to clusters that have custom networking, specifically route tables that require a carve-out for traffic back to Astro's control plane. The public IPs for the control plane were changed, and certain custom networking setups required that the IPs be updated accordingly. We have fixed this for all customers who reported this issue and are checking all clusters to determine if there are any others affected.

investigating

We are experiencing an issue in a few clusters, causing the Airflow UI and API to become unavailable. The team is actively investigating the issue.

Report: "Airflow UI and API Unavailable for a few customers"

Last update
Resolved

We have confirmed that no additional clusters were affected beyond those that were initially identified. This incident is fully resolved.

Identified

We have identified that this issue is specific to clusters that have custom networking, specifically route tables that require a carve-out for traffic back to Astro's control plane. The public IPs for the control plane were changed, and certain custom networking setups required that the IPs be updated accordingly.We have fixed this for all customers who reported this issue and are checking all clusters to determine if there are any others affected.

Investigating

We are experiencing an issue in a few clusters, causing the Airflow UI and API to become unavailable.The team is actively investigating the issue.

Report: "Certain Astro clusters on AWS experiencing downtime"

Last update
resolved

We have determined the event that caused this downtime and we are confident that it will not occur again. We will post a public RCA in the coming week.

monitoring

We have applied a remediation for all of the affected clusters. No clusters are currently experiencing downtime. We are continuing to examine the root cause and will update again when we are confident that the issue will not recur.

identified

We have identified a problem with scaling behavior that is causing a limited number of clusters to experience downtime. The message 'Internal Server Error' displays on the UI preventing the viewing of DAGs and the Airflow UI. This is in some cases affecting task execution. We are working on a fix currently.

Report: "Certain Astro clusters on AWS experiencing downtime"

Last update
Resolved

We have determined the event that caused this downtime and we are confident that it will not occur again. We will post a public RCA in the coming week.

Monitoring

We have applied a remediation for all of the affected clusters. No clusters are currently experiencing downtime. We are continuing to examine the root cause and will update again when we are confident that the issue will not recur.

Identified

We have identified a problem with scaling behavior that is causing a limited number of clusters to experience downtime. The message 'Internal Server Error' displays on the UI preventing the viewing of DAGs and the Airflow UI. This is in some cases affecting task execution. We are working on a fix currently.

Report: "'Internal Server Error' when attempting to access Airflow UI"

Last update
resolved

This incident has been resolved

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue. Tasks do not appear to be impacted.

Report: "'Internal Server Error' when attempting to access Airflow UI"

Last update
Resolved

This incident has been resolved

Update

We are continuing to investigate this issue.

Investigating

We are currently investigating this issue. Tasks do not appear to be impacted.

Report: "Astro UI and Astro API not available."

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently experiencing an issue impacting the Astro Control Plane Cluster during routine maintenance activities. Current Impact: Astro UI and Astro API are not available at this moment. However, airflow tasks will continue to run. Actions Being Taken: Our engineering team is actively monitoring and working to restore services promptly. Next Update: We will provide further status updates as more information becomes available. We apologize for the inconvenience and thank you for your patience.

Report: "Astro UI and Astro API not available."

Last update
Resolved

This incident has been resolved.

Update

We are continuing to monitor for any further issues.

Monitoring

A fix has been implemented and we are monitoring the results.

Identified

The issue has been identified and a fix is being implemented.

Investigating

We are currently experiencing an issue impacting the Astro Control Plane Cluster during routine maintenance activities.Current Impact:Astro UI and Astro API are not available at this moment. However, airflow tasks will continue to run.Actions Being Taken:Our engineering team is actively monitoring and working to restore services promptly.Next Update:We will provide further status updates as more information becomes available.We apologize for the inconvenience and thank you for your patience.

Report: "Service Degradation: Delayed or Missing DAG Alerts"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Our engineering team is actively investigating the root cause of this issue. We are working on implementing a long-term fix to restore full functionality. Further updates will be provided as we make progress.

investigating

Affected Services: Astro Alerts (DAG success, failure, SLA miss notifications) Description: We are currently investigating an issue affecting DAG alerts on Astro. Impact: Customers may experience delays or failures in receiving DAG alerts.

Report: "Airflow DAGs cannot be triggered via the Astro UI"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and the fix is being implemented

Report: "Airflow incorrectly being reported as being deployed"

Last update
resolved

The issue has been fixed and all deployment information is accurate.

identified

We have identified an issue where Airflow deployments are incorrectly being marked as being deployed. This causes the "Open Airflow" button in the Astro UI to be greyed out. Despite being greyed out, the "Open Airflow" button will still work, and the deployment is still able to be accessed. This does not affect the running Airflow deployments, so no DAGs will be affected.

Report: "Deployment unhealthy after downgrading from Runtime 12.3.0+"

Last update
resolved

We are declaring an end to this incident as the impact has been determined to be quite narrow. This issue will only impact a deployment that at one point was on an Astro Runtime version below 12.3.0, then upgraded to 12.3.0 - 12.7.1, upgraded again to a version between 12.3.0 - 12.7.1, then used the downgraded back to the initial 12.3.0 - 12.7.1 version. For example, if a Deployment were to go from 11.0.0 -> 12.3.0 -> 12.7.1 and attempt to rollback to 11.0.0, the rollback would be performed as expected, but if it was attempted to rollback to 12.3.0, it would fail. We are working on an update to the Astro Data Plane to fix this issue which we expect to release this month. Our support team will assist if any deployments do encounter this bug in the meantime.

monitoring

The issue has been verified as a missed minor database update that does not impact deployment behaviour unless a deployment rollback is applied. A deployment is affected if it was upgraded to a Runtime version greater than 12.3.0 from a Runtime version lesser than 12.3.0. For example: - Runtime 12.2.0 upgraded to Runtime 12.7.0 (Affected) - Runtime 12.0.0 upgraded to Runtime 12.3.0 (Affected) - Runtime 12.4.0 upgraded to Runtime 12.7.0 (Unaffected) The underlying issue has been fixed and any upgrades going forward is unaffected. We are currently applying a permanent fix for affected deployments.

investigating

Astronomer has identified an issue with the upgrade to Runtime 12.3.0+ where a database migration is not applied. While the upgrade will complete without errors, the deployment will become unhealthy when attempting to downgrade from Runtime 12.3.0+. We do not recommend upgrading to Runtime 12.3.0+ at this time as there may be other incompatible interactions between the database.

Report: ".airflowignore file not being respected"

Last update
resolved

The fix is rolled out and confirmed on affected deployments.

identified

A solution has been made and is currently being rolled out to affected deployments

identified

Astronomer has identified an issue with the component responsible for injecting customer code into Airflow environments. The .airflowignore is not being respected, causing deployments to parse files that should be ignored to be parsed. This issue is only affecting some deployments, and it does not affect any deployments that have not been updated today.

Report: "Certain Shared Deployments in AWS US West 2 are down"

Last update
resolved

This incident has been resolved.

monitoring

We have restored the Shared Cluster to functionality. We are monitoring for any further disruptions.

investigating

A particular shared cluster in AWS US West 2 is experiencing an issue with its database server that is causing an outage. 20% of shared cluster deployments in this region are affected. We are working to restore operations as quickly as possible. Deployments on Dedicated Clusters and Deployments in other clouds or regions are unaffected.

Report: "Issue with Airflow Deployments on Newly Created Hosted GCP Clusters"

Last update
resolved

We have implemented a mitigation for the issue preventing access to the Airflow UI on newly created hosted GCP clusters.

investigating

Newly created hosted GCP clusters are being created successfully, but users are unable to access the Airflow UI. Our team is actively investigating the issue and will provide updates as we progress.

investigating

We are currently investigating an issue preventing newly created hosted GCP clusters from successfully deploying fully functional Airflow environments. Our team is actively working to identify the root cause and will provide updates as we progress.

Report: "Unable to create new clusters on GCP"

Last update
resolved

This issue has been resolved and new Hosted Clusters are creating successfully on Google Cloud.

investigating

We are currently investigating an issue preventing new Astro Hosted clusters from being created on Google Cloud.

Report: "Azure Outage causing scaling issues on Azure Clusters"

Last update
resolved

We are no longer seeing issues with node scaling in our Azure clusters

identified

Deployments in the Azure East US2 region are experiencing issues in running DAGs and tasks due to an ongoing Azure outage affecting node scale-up https://azure.status.microsoft/en-us/status

Report: "Issue with the DAGs page on the Astro UI"

Last update
postmortem

At 23:59 UTC on November 22, Astronomer identified a bug in Astro’s implementation of AWS’ customer managed egress. This resulted in all Astro clusters on AWS routing all traffic through a single NAT gateway, which was determined to be a reliability issue. Astro API was accidentally downgraded when a hotfix was released to production at 21:14 UTC on November 25, which removed its ability to properly issue tokens, which resulted in the Dag view in the Astro UI \(not the Airflow UI\) being unable to list Dags. The initial solution to address the Astro API issue resulted in a small number of Astro customers whose clusters had custom IP Allowlists being unable to access the Airflow UI. Custom IP Allowlists were not properly tested against the changes to address the previous issue which introduced a regression that broke the authentication to the Airflow UI. These issues did not cause any impact to Dag runs or task instances. These bugs were introduced because of a complex process for releasing hotfixes, which the Astronomer engineering team is actively reviewing.

resolved

This incident has been resolved.

monitoring

The issue affecting the DAGs page on the Astro UI has been resolved by our engineering team and is under monitoring. Users can now access the page without any disruptions.

investigating

We have identified an issue affecting the DAGs page on the Astro UI. Our team has escalated this matter to our engineering team for further investigation. Please note that this issue does not impact the execution of running DAGs, and users can still view their DAGs through the Airflow UI.

Report: "Creation of GCP-based Astro clusters is currently failing due to an issue with CPU quotas"

Last update
resolved

The GCP support has increased the quota, allowing us to create GCP clusters on Astro without issues. This issue has been resolved.

investigating

We are currently investigating this issue.

Report: "Airflow UI inaccessible with the error Internal Server Error"

Last update
resolved

auth0 have fixed their issue on their end https://status.auth0.com/ and the astro airflow UI is now accessible

identified

We have identified issue with Auth0, and our team is actively working on implementing a fix

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Astro deployment schedulers crash looping due to incorrect airflow local settings config"

Last update
postmortem

On August 14 at 2:18PM UTC, Astronomer’s internal monitoring system began receiving sporadic, seemingly disparate alerts indicating that the scheduler component in some Astro Deployments was unhealthy. At 3:40PM UTC, once it became clear that these alerts were related and impacting Astro users, Astronomer initiated its incident response process. At 3:44PM UTC, the root cause of the incident was identified, and by 4:02PM UTC, the incident was resolved. The incident was caused by a code change to the [airflow\_local\_settings.py](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html#configuring-local-settings) file that is managed by Astronomer to implement cluster policies that usually make Airflow more reliable. For example, one such cluster policy prevents users from creating KubernetesPodOperator pods or KubernetesExecutor pods from attempting to consume more resources than are available in the user’s Deployment. This code change introduced a bug that impacted users who are using an Astro Runtime with the Python version set to less than Python 3.10. This bug was caused by implementing a feature, Python’s [platform.freedesktop\_os\_release\(\)](https://docs.python.org/3/library/platform.html#platform.freedesktop_os_release), which was introduced in Python 3.10 and unavailable in Python versions less than 3.10. To prevent this from happening again, Astronomer is now testing against all Python versions that are supported by Astro Runtime. In order to respond quicker to potentially systematic issues impacting scheduler availability, Astronomer will raise the priority of alerts indicating schedulers are unhealthy such that the on-call support engineer will be paged.

resolved

This incident has been resolved.

monitoring

Correction, the affected Runtime images are those using versions of Python *less than* 3.10. We have implemented a fix and are monitoring the results now.

identified

We have identified the issue, which only effects deployments using Astro Runtime with Python version <3.10. We are working on releasing a hotfix now.

investigating

Astronomer is investigating an issue with Astro deployment scheduler pods entering a CrashLoop state due to an error in the airflow local settings file. We're investigating the root cause and will update this page as we have more information.

Report: "cloud.astronomer.io was not loading properly for all users"

Last update
resolved

From 14:52 to 15:09 UTC, cloud.astronomer.io was inaccessible. DAGs and Tasks continued to run, but the UI was not visible nor were deployments from the cli or API possible.

Report: "Astro does not respect .airflowignore when deploying to Astro with Astro CLI 1.28.0"

Last update
resolved

This incident has been resolved. Please upgrade to Astro CLI version 1.28.1.

identified

The issue has been identified and a fix is being implemented.

Report: "All KubernetesExecutor tasks failing for some customers using DAG-only deploy"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Update Deployment feature impacted on Hybrid GCP Clusters"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently experiencing an issue where customers hosted on Hybrid GCP Clusters are unable to update their deployments. Our engineering team is actively investigating the cause of this problem and working to resolve it as quickly as possible. Please note the existing dags are running fine however updates to deployment resources(variables, worker queues, etc) are blocked.

Report: "Astronomer is currently investigating an issue preventing customers from deploying to Astro"

Last update
resolved

This incident has been resolved.

monitoring

The issue has been mitigated, and we are continuing to monitor.

investigating

We are currently investigating this issue.

Report: "Airflow API 500 Errors"

Last update
resolved

The incident has been resolved. During this time, DAGs that were making API calls to Airflow deployments may have failed - those tasks can now be reran.

investigating

When accessing the Airflow API programmatically, requests may fail and return an HTTP 500 status code. We are investigating the cause of this issue. Access through the UI is unaffected and all DAGs will continue to run as normal.

Report: "GCP outage causing scaling issues for GCP clusters"

Last update
resolved

The service disruption caused by the Google Cloud Platform (GCP) outage has been resolved. All affected services have been restored to normal operation. Please refer to the GCP status page for more details on the incident. https://status.cloud.google.com/incidents/xVSEV3kVaJBmS7SZbnre#RP1d9aZLNFZEJmTBk8e1

investigating

Fairly widespread GCP outage is preventing new Astro nodes in GCP from pulling images. This should not affect existing nodes and shouldn't affect running DAGs unless they need to scale https://status.cloud.google.com/incidents/xVSEV3kVaJBmS7SZbnre#RP1d9aZLNFZEJmTBk8e1

Report: "Authentication Errors"

Last update
resolved

This incident has been resolved.

monitoring

We have received notification from our authentication provider that this is now resolved. If you are still unable to login to Astro, please refresh your browser and try again.

identified

We are experiencing a failure with our upstream authentication provider that is causing users to receive a 404 error when trying to login to the Astro platform. We are currently working with the provider to resolve the issue. This issue does not affect DAGs from running.

Report: "Worker Nodes Not Spinning Up in GCP Dataplane Clusters"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and the fix is being implemented.

investigating

Incident Description: Some worker nodes within several GCP dataplane clusters are failing to spin up as expected. This issue is causing delays in task execution and may lead to DAGs/tasks getting stuck in the queued state or failing. Current Status: We have pinpointed the issue and confirmed its existence. Our engineering team is actively collaborating to resolve the problem. Impact: Delays in task execution within affected clusters. There is a risk of DAGs/tasks getting stuck in the queued state or failing due to the inability to spin up worker nodes. Resolution: Our engineering team is working diligently to implement a fix for this issue. Communication: Regular updates will be provided to keep you informed of any developments. We apologize for any inconvenience this may cause and appreciate your patience as we work to resolve this issue promptly. Please stay tuned for further updates.

Report: "Deployment metrics sometimes failing to load"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Listing or viewing some deployments will display "metrics failed to load" instead of showing Dag Run and Task Instance summaries. Actual DAG Runs and tasks are continuing to execute correctly, and the Airflow UI is still accessible. We have identified the problem and are working on deploying a fix.

Report: "Quay.io image registry is having an outage"

Last update
resolved

This incident has been resolved.

monitoring

Quay.io appears to be back up

investigating

This outage affects: * New cluster creation * CI/CD pipelines which pull public images (e.g. Astro runtime) from Quay * Provisioning new worker pods & nodes (scale up) for some clusters

Report: "Astronomer Cloud UI and API Unavailable"

Last update
resolved

We have identified the issue and a mitigation was applied. Services have resumed healthy operation. This issue is now resolved.

investigating

We are currently investigating an issue with the Astronomer Cloud UI and API. Please standby for further updates.

Report: "Astro CLI versions <= 1.22 are unable to successfully execute some commands"

Last update
resolved

The fix has been made and the issue is now resolved.

identified

Upgrading the Astro CLI to 1.24.1 is known to fix this issue. A change to backend systems broke some functionality of the Astro CLI, including the ability to deploy code to Astro. We've identified the issue and are working to implement a fix.

Report: "Unable to update deployment from Astro UI"

Last update
resolved

Issues has been resolved!

monitoring

The issue has been fixed and deployment updates are now working from the UI. We will continue to monitor.

identified

Updates to non-development deployments via the Astro UI may be declined with an invalid request error. A fix is being worked on.

Report: "Intermittent Network and Scheduling Issues in AWS us-west-2 Region for Airflow Deployments"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Airflow deployments in the AWS US-West-2 region may encounter occasional network and scheduling disruptions. The affected cluster has been cordoned off to mitigate the impact on new implementations. Our team is actively investigating the issues within this cluster.

Report: "Intermittent Network and Scheduling Issues in AWS us-west-2 Region for Airflow Deployments"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Airflow deployments in the AWS US-West-2 region may encounter occasional network and scheduling disruptions. The affected cluster has been cordoned off to mitigate the impact on new implementations. Our team is actively investigating the issues within this cluster.

Report: "Hybrid customers are unable to create or update deployments"

Last update
postmortem

**The Problem** On Wednesday January 17th 2024, some customers were unable to create or apply changes to Airflow Deployments running on Astro Hybrid clusters, which included deploying image or DAG updates to these Deployments. This outage spanned from 22:20 to 23:25 UTC for a total of 1 hour and 5 minutes. The outage impacted a total of 720 individual HTTP requests to Astro API, the main API powering all user interactions with Astro. During this time period, Airflow tasks continued to run and were unaffected. Ultimately, the outage was caused by an incompatibility between two services that power the Astro control plane: Astro API and an internal service named Harmony. Astro API is our public-facing API that all user interactions flow through, which includes creating, updating and deleting Airflow deployments. Astro API requests that create, update or delete an Airflow deployment then flow to Harmony, which generates and synchronizes Kubernetes manifest files that describe the desired state of a given Astro data plane. Prior to the release that caused this outage, we were running two distinct versions of Harmony. One version managed the Hybrid data planes, and one version managed the Hosted data planes. When the release that triggered this outage was deployed to production, Astro Hybrid deployment requests from Astro API to Harmony began to fail, due to an incompatibility with the version of the Harmony service that was running. Astro Hosted deployments were unaffected. ‌ **Root Cause** This outage was caused by an incompatibility between our external Astro API service, and our internal Harmony service. While our Astro API service is rolled out via an automated deployment process, our Harmony service is upgraded differently, using a canary rollout mechanism. We use this canary system to gradually roll out changes to the data planes under management. At the moment, progressing through the canary rollout requires a human-in-the-loop. There were two main contributing factors that led to this outage: First, the changes we deployed to our Astro API were not backward compatible with the currently running version of our Harmony service. Regardless of our canary rollout process, the changes made here should have been able to work with the older _and_ newer version of Harmony being deployed. Second, the harmony rollout process is completely automated in our Development and Staging environments, which creates a difference between the lower environments and our Production environment. Our Production environment _currently_ requires human intervention to operate the canary rollout process. While investigating this issue, we found that the issue actually _did_ occur in our lower environments, but it was automatically resolved within minutes due to our automated data plane upgrade process. This led to the issue not being discovered until it was released in our Production environment. ‌ **What We’re Doing to Prevent this from Recurring** At the lowest level, this issue surfaced due to an API incompatibility. This particular change happened as we were actually simplifying our system. As mentioned above, prior to this release, we were running two versions of our Harmony service, one for Hybrid, and one for Hosted. Unfortunately, as we rolled out the change to consolidate this to a single service, we mistakenly hadn’t made the Astro API change backward compatible with the previous version of Harmony. Once this outage had been mitigated, our system was actually simpler than it was prior to the outage. We now send both Hosted and Hybrid requests to a single Harmony service, and no longer need to manage the complexity of two different versions. Now that we’ve reduced the surface area of the Harmony service, we are working to harden the service into an Open API specification, allowing our consuming services to share code and take advantage of an auto-generated client. This will help prevent certain types of bugs in this communication path in the future. Beyond preventing code-level bugs, we are also working to get our Staging and Production environments more aligned regarding the Harmony service’s canary rollout system. As mentioned, if our Staging environment’s data planes were subject to our canary rollout over a longer period of time like Production, we would have caught this issue much earlier in the process. At a higher level, we are working to build a much more robust system to roll out upgrades to our Data Planes. Instead of the entire Hybrid data plane fleet being potentially exposed to this issue, we could have verified the changes in a smaller subset before rolling the changes out any further.

resolved

Creating and updating deployments will result in failure which could manifest as Internal server errors (status code 500).

Report: "Astro Cloud UI "No Healthy Upstream""

Last update
postmortem

**Problem** Early on the morning of January 22nd, customers faced issues accessing the Cloud UI and performing certain operations with the Astro CLI. This was an intermittent outage which spanned from 3:25 AM to 4:07 AM PST. This intermittent outage resulted in unavailability of the Cloud UI for some customers, prevented Astro CLI deploy commands from running successfully, and prevented successful node scale-ups in Astro data planes. We don’t believe that this outage caused any tasks to fail, but it might have slowed task scheduling for some customers. * For Deployments using the Celery executor on Astro Hybrid, workers could not properly scale up during the outage periods. * For Deployments using the Kubernetes executor or Deployments using Celery Executor on Astro Hosted, new worker pods could start, but new nodes could not be added, so the autoscaling would limit itself to the nodes available at the start of the outage. The component that caused the intermittent outage is called the _Astro API_, which is a critical service in the control plane that mediates most actions in Astro. However, because Airflow itself doesn’t use this component, losing the Astro API prevents most Astro functions from working but doesn’t directly impact the operation of running Airflow Deployments. The problem was triggered when an end user configured the “Linked to all Deployments”option in the Airflow connection management menu for their Astro Workspace. This end user action resulted in a segmentation fault that bypassed middleware designed to trap and recover from the segmentation faults, killing the Astro API container and taking down one of the running replicas. Because a typical user action is to retry the operation upon failure, it’s possible that the end user kept retrying this while the back-end system was bringing up new Astro API containers to replace the container which had just crashed, thereby triggering a degradation of service and errors to users trying to access the service at the same time. One of these customers that encountered an error reported the problem to Astronomer Support almost simultaneously to our internal alert which detected the Astro API containers crashing. This was immediately escalated by the Support team to Engineering via our Incident Management process and a mitigation was put in place within 40 minutes of the escalation. ‌ **Root Cause** We introduced code in the release we rolled out on January 17 that has now been determined to be thread unsafe. Ironically, this code was written to reduce duplications in our code and make it modular and testable and reduce the risk of  introducing bugs. We deployed logic in a specific code path that would read from the database using transactions. However, in golang \(and the ORM library we use\), the concurrent transaction reads were actually using different connections and hence thread unsafe. This led to the database calls not being able to retrieve anything but not returning any errors either, which further led to a nil pointer dereference panic on Astro API pods causing them to restart. We also now know the user scenario that executes this thread unsafe code. 1. A user goes to a Workspace with more than 1 Deployment in it. 2. The user opens the Environment tab for that workspace. 3. The user switches the “Linked to all Deployments” option from true to false OR from false to true OR adds/updates a connection. We believe that the switching “Linked to all Deployments” option in the UI is such a spammable operation that a single user who was trying that operation could have done it a few times in the UI which would have led to the individual pods handling those requests in the backend to panic and crash, thereby resulting in a degradation of the service. The initial mitigation fix was rolled out within 40 minutes so that, instead of panicking in those situations, Astro API would return HTTP 500 errors. The individual problem requests still failed but did not cause the pods to restart. We also released a medium term fix on Jan 24, wherein we changed the concurrent db read with serial \(less optimal\) reads that loop through all Deployments in the Workspace instead. This change has been validated to have fixed the problem, which means no more panics nor 500 errors for the users changing the auto-link option. **What We’re Doing to Prevent this from Recurring** We are focusing on what can be done to **improve the robustness of the Astro API** given its key role in the Astro control plane, as well as how to have a **faster response** and **improved monitoring** for issues that arise. The first alert that was raised for this incident was a pod restart alert for the Astro API pods. This alert however was not tuned correctly in two ways. Firstly, it auto-resolved because of an incorrect setting. Secondly, and more critically, receiving this alert does not always mean that there is customer impact. Because of this and the auto-resolution, the engineer who was paged mistakenly believed that this issue was momentary and would not have customer impact. To prevent this, we are adding two new alerts based on the dashboards we used to determine the amount of impact during the outage. These alerts will measure both the real customer generated traffic to the Astro API and our own synthetic traffic \(in case the real traffic is blocked by an ingress issue\) to raise high priority alarms to both our Support and Engineering teams if the Astro API drops below very high levels of consistency. Because these alerts will always indicate an important customer-visible problem and go to multiple teams within Astronomer, we are confident that they will be acted on with the appropriate urgency. Our postmortem review also revealed that with a more robust Astro API pod setup, even issues of this magnitude would have much less customer impact. The problem was that the minimum size of the Astro API autoscaling group is too small, and it was at that minimum size during the outage. We currently run a small number of large pods, and we are now looking to have a significantly larger number of smaller pods instead. With a larger number of smaller pods, the panics would have been less likely to crash all of the pods at once. There are some nuances to work out about how to size and manage database connections with a larger number of pods, so this change will not be rolled out until we can be sure it will not have other unintended negative consequences. We have analyzed the specific bug that triggered this outage, and we don’t believe that we could have reasonably implemented a regression test for the behavior that would have detected this. Without the deep understanding that the database connections would later be made concurrent and accidentally thread-unsafe, we would not have been able to predict which tests would be required. We also evaluated the feasibility of enabling a quick rollback so that we would not have to determine the full root cause and fix to resolve the outage. However, although we are capable of doing a rollback in our deployment model, it would not have been feasible to do this. This is because the change that included the bug was committed over a week before the outage and deployed five days before the outage. We have a weekly release cadence in the control plane, but even if we went to smaller and more frequent releases, this change did not cause a problem immediately; it had to be triggered by a specific user action. Because other updates since this change involve database schema updates and other infrastructure changes, it is not obviously safe to roll back to before the bug was deployed.

resolved

Incident has been resolved, all systems operational.

monitoring

The issue was identified, and a fix has been applied. We are currently monitoring the deployments.

investigating

The impact reassessed as major for expedited mitigation.

investigating

We are currently investigating this issue.

Report: "Astro Analytics - Degraded Performance"

Last update
investigating

Our team is currently investigating the degraded performance of Astro analytics service.

resolved

This incident has been resolved. Astronomer builds metrics in part by using a logging tool. The performance of the logging tool’s indexer was adversely impacted by an increase in scheduled queries, which overwhelmed the logging tool, resulting in a backup of queries, which in turn impacted the monitors in the Astro UI. After optimizing scheduled queries, performance returned to normal.

Report: "New worker pods in Azure AKS clusters unable to start"

Last update
resolved

This incident has been resolved.

monitoring

The issue has been identified and we are beginning to update the affected clusters. Worker pods that were stuck in Pending state are spinning up now.

investigating

We are aware of an issue with Azure and are currently investigating it. Pods older than 1:30 PM CST (0630 UTC) are not affected.

Report: "Monitoring service in Astro Standard Clusters experiencing issues"

Last update
resolved

This incident has been resolved.

monitoring

Hotfix has been deployed to prod, affected clusters are being bootstrapped with the hotfix. We are monitoring the results of the fix.

identified

Hotfix has been released to stage, we are validating the results and will proceed with the release to prod following that validation.

identified

The issue has been identified and a hotfix is being created and rolled out.

investigating

We are currently investigating an issue in the monitoring service Astronomer uses to monitor Astro Standard Clusters.

Report: "Modifying environment variables from the Astro UI may delete the values for other environment variables marked as "Secret""

Last update
postmortem

Astronomer is undergoing a migration from a legacy, internal, API to the recently released [Astro API](https://docs.astronomer.io/astro/api/overview), which is more resilient and performant. The two APIs handle [falsy](https://developer.mozilla.org/en-US/docs/Glossary/Falsy) values differently, and this introduced a bug in Astro API’s updating of environment variables, resulting in the values for environment variables marked “Secret” being functionally deleted when environment variables were modified. When environment variables on Astro are saved, they are saved together, so updating any environment variable resulted in the deletion of values for all environment variables marked “Secret”.Beyond implementing the fix to resolve this issue, we are implementing the following remediations to prevent something similar from happening again: * Review functional testing for secret environment variables * Implement feature flags to enable staged rollouts for future API migration tasks

resolved

This incident has been resolved.

investigating

We are currently investigating this issue. If you rely on setting environment variables in the Astro UI, please refrain from updating environment variables at this time.

Report: "Bug in AstroAPI endpoint call deleting connections from Astro Environment"

Last update
resolved

This incident has been resolved.

identified

A bug has been identified in the Managed Connections of Astro Hosted Environments that deletes existing connections. A fix has been made and is being deployed.

Report: "Hybrid customers Unable to view Teams"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Quay.io outage causing new pods to be stuck in Pending waiting to download container images"

Last update
resolved

This incident has been resolved.

monitoring

Quay.io has indicated that they have completed the fix and is operating correctly for pushes and pulls. New Airflow services that come up are operational. We are continuing to monitor the situation as Quay.io has not marked their incident as resolved.

identified

This issue is ongoing. We have observed some instances where images are able to be pulled, but we're continuing to observe widespread image pull issues. We will update as more information becomes available.

identified

Quay has indicated that they are continuing to experience instability and are moving their image repo to read-only mode, which will affect image push operations.

identified

Quay.io, the container image repository used by Astronomer is experiencing issues with image pull failures. Quay.io incident: https://status.quay.io/incidents/z7sbjqmb34p1 We will continue monitoring the situation and update this incident as more information becomes available. Existing pods should be unaffected and will continue executing tasks.

Report: "Tasks from deployments with KubernetesExecutor are unable to execute"

Last update
postmortem

From Nov 8, 16:09Z until the issue was resolved on Nov 8, 20:21Z, Astro Hybrid Deployments using Kubernetes Executor with DAG-Only Deploy that were updated by customers were unable to start new Airflow workers. In total, there were 4 such Deployments across Astro. As part of releasing Deploy Rollbacks, the location of the URL from which the Deployment checks for new DAGs was changed to be stored locally within the Astro Data Plane instead of requiring an API call to the Control Plane. This design is more robust to intermittent network issues, and we continue to believe this change was conceptually correct. However, Kubernetes worker pods were not updated with correct access to the new location of the URL. This prevented the worker from starting up, as retrieving this URL and then using it to download the DAGs is a critical step in initializing new workers. Once the issue was understood, an update was rolled out to all Hybrid clusters to properly give access to the new location. This issue clearly should have been caught in testing. We reviewed our testing procedure and found a gap. On Celery Executor, a deployment procedure's impact on starting workers can be tested without running any DAGs, because Celery Executor starts a worker even when no tasks are running \(unless scale to zero is on, but we disable that setting for these particular tests\). We were using the same test procedure for Kubernetes Executor, but because no DAGs were running, no workers were started, and thus no errors were raised. We have now adjusted the Kubernetes Executor test suite to include actually running a DAG after testing a new deployment procedure, thus ensuring a much more realistic test. Our existing alerting did detect these issues as they started happening to customers. We are nevertheless going to invest in more targeted alerting around the DAG Download process.

resolved

This incident has been resolved.

monitoring

Further correction, the only affected deployments are those under Astro Hybrid, have updated in the last day, and have dag deploy enabled.

monitoring

Update - We have correctly identified the only affected deployments are those under Astro Hybrid and have updated in the last day. We are continuing to monitor the results of the fix.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We have correctly identified the only affected deployments are those under Astro Hybrid and have updated in the last day.

identified

Due to an issue with the Kubernetes Airflow worker pod being unable to download DAGs, these pods are unable to initialize leading to the task instance to be stuck in queued. This only affects Airflow deployments using KubernetesExecutor.

Report: "Astro CLI cannot access Airflow variables and connections"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

When using the Astro CLI to access deployments variables and connections, you may receive the following error failed to decode response from API. If modifying the variables or connections and you receive this message, the modifications have not taken effect. Examples of CLI commands that may fail are: - astro deployment airflow-variable list - astro deployment connection list - astro deployment connection update Our team has identified the issue and is releasing a fix.