Is Astro Down Right Now? Discover if there is an ongoing service outage.

Astro is currently Operational

Last checked Aug 30, 2025 22:13 UTC from Astro's official status page

Incident History

Aug 13, 2025

Report: "Docs for the self-hosted Astronomer Software product are down"

Last update 2025-08-13T19:33:40.521Z

investigating2025-08-13T19:33:40.518Z

We are currently migrating our docs to a new platform. All docs are working except the docs for our self-hosted platform generally referred to as Astronomer Software. You can find a working mirror here https://clear-mousepad.cloudvent.net/docs/software/

Report: "Issue with Live task logs with Runtime 3.0-7"

Last update 2025-08-13T08:13:25.852Z

investigating2025-08-13T08:13:25.849Z

We are aware of an issue affecting live task logs with Runtime 3.0-7 and are actively investigating the issue

Aug 11, 2025

Report: "False Positive SLA Violation Alerts on Astro Observe"

Last update 2025-08-11T15:10:53.005Z

investigating2025-08-11T15:10:53.002Z

We are seeing reports that users of Astro Observe are getting false positive alerts for Data Freshness SLAs. We are currently investigating the cause of these false positives.

Jul 21, 2025

Report: "Customers using Azure-managed subscriptions may be unable to access the Astro UI."

Last update 2025-07-21T09:23:51.930Z

investigating2025-07-21T09:23:51.926Z

We are actively investigating the issue.

Jul 18, 2025

Report: "GCP us-east1 incident may affect Astro clusters in this region"

Last update 2025-07-18T16:25:38.368Z

investigating2025-07-18T16:25:38.365Z

Google Cloud has put up a status page indicating that several services in us-east1 are affected by an incident. Astro clusters in this region will be affected by this. Clusters in other regions and other clouds should not be affected, as none of the control plane components for Astro are hosted in this region. For more information, follow the GCP incident: https://status.cloud.google.com/incidents/8cY8jdUpEGGbsSMSQk7J

Jul 10, 2025

Report: "GitHub Integration Image Deploys Failing"

Last update 2025-07-10T15:25:43.407Z

investigating2025-07-10T15:24:30.000Z

We’re currently investigating an issue where GitHub Integration image deploys are failing.

Jul 3, 2025

Report: "Some clusters are unable to start new KPO tasks"

Last update 2025-07-03T23:19:35.503Z

investigating2025-07-03T23:19:35.500Z

Some clusters that were updated today will fail to run any KPO tasks

Jul 1, 2025

Report: "Astro CLI 1.35 may unintentionally modify worker queue configurations"

Last update 2025-07-01T13:59:14.123Z

investigating2025-07-01T13:59:14.120Z

Upgrading to Astro CLI version 1.35 can lead to unintended changes in your worker queues settings, particularly when deploying with modified or missing workerQueues definitions. Please use CLI version 1.34 or lower.

Jun 12, 2025

Report: "Astro clusters in GCP are having scaling issues due to a GCP outage"

Last update 2025-06-12T18:53:12.442Z

identified2025-06-12T18:53:12.439Z

At this time we believe that Deployments on Azure and AWS are unaffected. We are currently checking our components to be certain. We have seen task failure rates increase on Astro Deployments on GCP. We will pass along any updates we receive from Google regarding this issue.

investigating2025-06-12T18:50:38.540Z

https://status.cloud.google.com/ We will continue to monitor the issue and update this page

investigating2025-06-12T18:49:26.000Z

There is an active GCP outage that is affecting Astro customers using GCP

Jun 5, 2025

Report: "Creating a Connection can crash the browser tab"

Last update 2025-06-05T23:48:37.453Z

investigating2025-06-05T23:48:37.449Z

If you create certain Connection types in the Environments menu, it can crash your browser tab. Currently it affects SSH, SMTP, SFTP, and Generic. There could be others. We are investigating currently.

May 20, 2025

Report: "403 Errors for Image Deploys"

Last update 2025-05-20T17:25:05.754Z

resolved2025-05-20T17:25:05.740Z

We have determined that this error is caused by cached credentials which are no longer valid after an internal change in Astro to the image registry. The fix must be performed client-side (i.e. on the machine running `astro deploy`). If you experience this error, run `docker logout` for each Astro registry that this machine has cached credentials for. By default, credentials are stored in ~/.docker/config.json, and if you are using this default setting, the following bash script will identify cached credentials and run docker logout for those that correspond to Astro registries. for domain in $(grep 'registry.astronomer.run' ~/.docker/config.json | awk '{print $1}' | tr -d '":' | sort | uniq); do docker logout "$domain" done

monitoring2025-05-20T16:52:47.332Z

We've implemented a mitigation for this issue and the affected clusters should see successful image pushes. We will continue to monitor for additional errors.

investigating2025-05-20T16:02:01.811Z

A small subset of customers have reported 403 errors when running the astro deploy command to deploy a new image. We are actively investigating this issue. If you are experiencing these errors, we encourage you to contact support and include the login command you used, astro cli and docker versions, and any log messages.

Report: "403 Errors for Image Deploys"

Last update 2025-05-20T16:02:00.000Z

Investigating2025-05-20T16:02:00.000Z

May 16, 2025

Report: "Identified a configuration issue affecting Runtime 9 which is affecting DAG execution on these deployments"

Last update 2025-05-16T13:09:48.231Z

resolved2025-05-16T13:09:48.218Z

This incident has been resolved.

identified2025-05-16T12:12:40.290Z

Fix has been validated and is rolling out to affected deployments.

investigating2025-05-16T11:28:04.000Z

We are currently investigating the issue.

Report: "Identified a configuration issue affecting Runtime 9 which is affecting DAG execution on these deployments"

Last update 2025-05-16T11:28:00.000Z

Investigating2025-05-16T11:28:00.000Z

We are currently investigating the issue

Report: "Identified a configuration issue affecting Runtime 9 and below which is affecting DAG execution on these deployments"

Last update 2025-05-16T11:28:00.000Z

Investigating2025-05-16T11:28:00.000Z

We are currently investigating the issue

Apr 19, 2025

Report: "Stuck worker pods resulting in tasks failing in the queued state"

Last update 2025-04-19T06:12:32.136Z

resolved2025-04-19T06:12:32.120Z

This incident has been resolved.

investigating2025-04-18T21:36:11.865Z

The incident is resolved.

investigating2025-04-18T19:12:55.124Z

We are continuing to investigate this issue.

investigating2025-04-18T14:25:40.386Z

In some deployments, worker pods are getting stuck in the initialization state for an extended period of time. Due to this, queued tasks are unable to run and fail. This is not affecting all deployments. We are investigating which deployments are affected and why.

Apr 18, 2025

Report: "Stuck worker pods resulting in tasks failing in the queued state"

Last update 2025-04-18T14:25:00.000Z

Investigating2025-04-18T14:25:00.000Z

In some deployments, worker pods are getting stuck in the initialization state for an extended period of time. Due to this, queued tasks are unable to run and fail.This is not affecting all deployments. We are investigating which deployments are affected and why.

Apr 17, 2025

Report: "Customers will not be able to create new Azure Clusters"

Last update 2025-04-17T19:57:21.089Z

resolved2025-04-17T19:57:21.073Z

This incident has been resolved.

monitoring2025-04-17T19:39:20.953Z

A fix has been implemented and we are monitoring the results.

investigating2025-04-17T19:39:10.630Z

Cluster maintenance

Report: "Customers will not be able to create new Azure Clusters"

Last update 2025-04-17T19:39:00.000Z

Monitoring2025-04-17T19:39:00.000Z

A fix has been implemented and we are monitoring the results.

Investigating2025-04-17T19:39:00.000Z

Cluster maintenance

Apr 7, 2025

Report: "Cost Breakdown Dashboard data update delayed"

Last update 2025-04-07T19:18:06.299Z

resolved2025-04-07T19:18:06.280Z

This issue is now resolved except for one customer who we have contacted directly.

identified2025-04-07T16:43:07.786Z

Deployment cost is now up to date, but compute costs for some customers remain outdated. We were working with our billing vendor to determine the source of the issue.

identified2025-04-07T13:42:36.950Z

Data shown in the Organization Dashboards Cost Breakdown (for Enterprise customers) is delayed. As stated on the page itself, the latest data is as of April 4th. The processing to update this dashboard is currently ongoing, and we expect the data to be refreshed at approximately 16:00 UTC.

Report: "Cost Breakdown Dashboard data update delayed"

Last update 2025-04-07T13:42:00.000Z

Identified2025-04-07T13:42:00.000Z

Mar 27, 2025

Report: "We are experiencing an issue with new task execution on AWS clusters"

Last update 2025-03-27T21:40:51.513Z

postmortem2025-03-27T21:36:48.870Z

# Overview Between March 18 and March 26, 2025, Astro experienced a series of related incidents. This write up serves as the analysis of the full set of incidents, as the events overlap. On Tuesday, March 18, an Azure outage prevented our control plane components that handle authentication requests from reaching our Authentication vendor, Auth0. This lasted for 45 minutes and was resolved without intervention from Astronomer. Because the control plane for Astro is hosted in Azure, all customers were affected, including those that host their data planes in AWS or GCP. During this period, customers were unable to reach the Astro or Airflow UI and were unable to get a response from the Airflow API. On Friday, March 21, our monitoring systems alerted us to a critical issue affecting the control plane authentication components, again preventing customers from accessing the Astro UI and API. Our investigation revealed the root cause was unusually high volumes of automated authentication requests that overwhelmed the scaling capabilities of our authentication components. While we cannot share specific details about this traffic for security reasons, we can confirm that our security team verified no customer data was compromised or exposed. Our engineering team implemented protective measures against similar authentication scaling issues, including reconfiguring several network settings to improve resilience. Early Sunday, March 23, we discovered that our network reconfiguration had unintentionally triggered a previously unknown bug in our system. * Astro uses a component called “Harmony client” to synchronize changes between the control plane $where customers make configuration changes$ and the data plane $where customer workloads run$. * During normal operations, when the Harmony client cannot connect to the control plane, it waits a few seconds and retries. However, we discovered a specific bug in how Harmony client handles connection errors for AWS clusters. When receiving a 404 error, it interpreted this as an empty response which signaled that we didn’t need resources and inadvertently deleted all worker node pools, the computing resources that run tasks. * This affected around a dozen customer clusters, disrupting workflow execution and UI access. Most clusters were protected by safety mechanisms that prevented the deletion of node pools, but this also blocked automatic recovery. The team manually restored node pools to all affected clusters by Sunday afternoon. We developed a fix for the Harmony client bug and implemented enhanced monitoring to detect any similar issues. Given the complexity of these interconnected incidents, we made the decision to implement a temporary change freeze while thoroughly reviewing the fix and our deployment procedures to avoid introducing new issues. On Wednesday, March 26, our monitoring systems detected a recurrence of the Harmony client issue. Our incident response team immediately engaged and identified the cause: during our troubleshooting process, a change intended for our staging environment had been applied to production through a gap in our change management process. This change caused 404 errors to once again be received by the Harmony client, re-triggering the issue. Given the large number of clusters, our efforts to apply manual fixes were taking too long to address the issue. As we no longer had confidence that it would be safer to wait, we initiated a hotfix release of the patch to fix the root cause to Harmony client across the fleet. # Long-Term Preventive Measures We have corrected the underlying bug in the Harmony client and put in initial steps to deal with bursts of unusual traffic. In the coming weeks, we plan to put further mitigations in place around automated traffic to further protect the system. We are looking at additional process measures to further separate staging and production access. We have also improved our monitoring to detect similar issues and are discussing further monitoring improvements. Additionally, we will deploy a script for automated recovery of a set of clusters to improve remediation time. If you have any additional questions or concerns, please contact us at [support@astronomer.io](mailto:support@astronomer.io).

resolved2025-03-26T07:42:34.437Z

This incident has been resolved.

identified2025-03-26T06:22:37.771Z

The team has identified the issue and is currently actively working on mitigating it. The issue concerns new task execution on AWS clusters and accessing the UI.

investigating2025-03-26T04:59:47.000Z

We are experiencing an issue with new task execution on AWS clusters and issue with accessing UI. The team is actively investigating the issue.

Report: "We are experiencing an issue with new task execution on AWS clusters"

Last update 2025-03-27T21:40:00.000Z

Postmortem2025-03-27T21:40:00.000Z

Resolved2025-03-26T07:42:00.000Z

This incident has been resolved.

Identified2025-03-26T06:22:00.000Z

The team has identified the issue and is currently actively working on mitigating it. The issue concerns new task execution on AWS clusters and accessing the UI.

Investigating2025-03-26T04:59:00.000Z

We are experiencing an issue with new task execution on AWS clusters and issue with accessing UI. The team is actively investigating the issue.

Mar 24, 2025

Report: "Airflow UI and API Unavailable for a few customers"

Last update 2025-03-24T14:50:12.642Z

resolved2025-03-24T14:50:12.620Z

We have confirmed that no additional clusters were affected beyond those that were initially identified. This incident is fully resolved.

identified2025-03-24T14:23:15.166Z

We have identified that this issue is specific to clusters that have custom networking, specifically route tables that require a carve-out for traffic back to Astro's control plane. The public IPs for the control plane were changed, and certain custom networking setups required that the IPs be updated accordingly. We have fixed this for all customers who reported this issue and are checking all clusters to determine if there are any others affected.

investigating2025-03-24T10:30:04.733Z

We are experiencing an issue in a few clusters, causing the Airflow UI and API to become unavailable. The team is actively investigating the issue.

Report: "Airflow UI and API Unavailable for a few customers"

Last update 2025-03-24T14:50:00.000Z

Resolved2025-03-24T14:50:00.000Z

We have confirmed that no additional clusters were affected beyond those that were initially identified. This incident is fully resolved.

Identified2025-03-24T14:23:00.000Z

We have identified that this issue is specific to clusters that have custom networking, specifically route tables that require a carve-out for traffic back to Astro's control plane. The public IPs for the control plane were changed, and certain custom networking setups required that the IPs be updated accordingly.We have fixed this for all customers who reported this issue and are checking all clusters to determine if there are any others affected.

Investigating2025-03-24T10:30:00.000Z

We are experiencing an issue in a few clusters, causing the Airflow UI and API to become unavailable.The team is actively investigating the issue.

Mar 23, 2025

Report: "Certain Astro clusters on AWS experiencing downtime"

Last update 2025-03-23T18:27:32.220Z

resolved2025-03-23T18:27:32.202Z

We have determined the event that caused this downtime and we are confident that it will not occur again. We will post a public RCA in the coming week.

monitoring2025-03-23T15:44:17.699Z

We have applied a remediation for all of the affected clusters. No clusters are currently experiencing downtime. We are continuing to examine the root cause and will update again when we are confident that the issue will not recur.

identified2025-03-23T14:51:08.694Z

We have identified a problem with scaling behavior that is causing a limited number of clusters to experience downtime. The message 'Internal Server Error' displays on the UI preventing the viewing of DAGs and the Airflow UI. This is in some cases affecting task execution. We are working on a fix currently.

Report: "Certain Astro clusters on AWS experiencing downtime"

Last update 2025-03-23T18:27:00.000Z

Resolved2025-03-23T18:27:00.000Z

We have determined the event that caused this downtime and we are confident that it will not occur again. We will post a public RCA in the coming week.

Monitoring2025-03-23T15:44:00.000Z

Identified2025-03-23T14:51:00.000Z

Mar 22, 2025

Report: "'Internal Server Error' when attempting to access Airflow UI"

Last update 2025-03-22T02:44:34.209Z

resolved2025-03-22T02:44:34.192Z

This incident has been resolved

investigating2025-03-21T22:05:30.707Z

We are continuing to investigate this issue.

investigating2025-03-21T21:55:15.000Z

We are currently investigating this issue. Tasks do not appear to be impacted.

Report: "'Internal Server Error' when attempting to access Airflow UI"

Last update 2025-03-22T02:44:00.000Z

Resolved2025-03-22T02:44:00.000Z

This incident has been resolved

Update2025-03-21T22:05:00.000Z

We are continuing to investigate this issue.

Investigating2025-03-21T21:55:00.000Z

We are currently investigating this issue. Tasks do not appear to be impacted.

Mar 18, 2025

Report: "Astro UI and Astro API not available."

Last update 2025-03-18T05:24:52.381Z

resolved2025-03-18T05:24:52.365Z

This incident has been resolved.

monitoring2025-03-18T04:21:57.812Z

We are continuing to monitor for any further issues.

monitoring2025-03-18T04:21:11.277Z

A fix has been implemented and we are monitoring the results.

identified2025-03-18T04:12:05.479Z

The issue has been identified and a fix is being implemented.

investigating2025-03-18T03:43:38.827Z

We are currently experiencing an issue impacting the Astro Control Plane Cluster during routine maintenance activities. Current Impact: Astro UI and Astro API are not available at this moment. However, airflow tasks will continue to run. Actions Being Taken: Our engineering team is actively monitoring and working to restore services promptly. Next Update: We will provide further status updates as more information becomes available. We apologize for the inconvenience and thank you for your patience.

Report: "Astro UI and Astro API not available."

Last update 2025-03-18T05:24:00.000Z

Resolved2025-03-18T05:24:00.000Z

This incident has been resolved.

Update2025-03-18T04:21:00.000Z

We are continuing to monitor for any further issues.

Monitoring2025-03-18T04:21:00.000Z

A fix has been implemented and we are monitoring the results.

Identified2025-03-18T04:12:00.000Z

The issue has been identified and a fix is being implemented.

Investigating2025-03-18T03:43:00.000Z

We are currently experiencing an issue impacting the Astro Control Plane Cluster during routine maintenance activities.Current Impact:Astro UI and Astro API are not available at this moment. However, airflow tasks will continue to run.Actions Being Taken:Our engineering team is actively monitoring and working to restore services promptly.Next Update:We will provide further status updates as more information becomes available.We apologize for the inconvenience and thank you for your patience.

Mar 14, 2025

Report: "Service Degradation: Delayed or Missing DAG Alerts"

Last update 2025-03-14T12:51:46.335Z

resolved2025-03-14T12:51:46.316Z

This incident has been resolved.

monitoring2025-03-14T08:36:31.961Z

A fix has been implemented and we are monitoring the results.

identified2025-03-14T03:41:31.820Z

Our engineering team is actively investigating the root cause of this issue. We are working on implementing a long-term fix to restore full functionality. Further updates will be provided as we make progress.

investigating2025-03-14T03:39:35.651Z

Affected Services: Astro Alerts (DAG success, failure, SLA miss notifications) Description: We are currently investigating an issue affecting DAG alerts on Astro. Impact: Customers may experience delays or failures in receiving DAG alerts.

Mar 7, 2025

Report: "Airflow DAGs cannot be triggered via the Astro UI"

Last update 2025-03-07T22:22:28.473Z

resolved2025-03-07T22:22:28.458Z

This incident has been resolved.

monitoring2025-03-07T22:20:07.822Z

A fix has been implemented and we are monitoring the results.

identified2025-03-07T21:30:12.897Z

We are continuing to work on a fix for this issue.

identified2025-03-07T21:24:06.116Z

The issue has been identified and the fix is being implemented

Report: "Airflow incorrectly being reported as being deployed"

Last update 2025-03-07T17:33:12.064Z

resolved2025-03-07T17:33:12.046Z

The issue has been fixed and all deployment information is accurate.

identified2025-03-07T17:14:00.598Z

We have identified an issue where Airflow deployments are incorrectly being marked as being deployed. This causes the "Open Airflow" button in the Astro UI to be greyed out. Despite being greyed out, the "Open Airflow" button will still work, and the deployment is still able to be accessed. This does not affect the running Airflow deployments, so no DAGs will be affected.

Mar 4, 2025

Report: "Deployment unhealthy after downgrading from Runtime 12.3.0+"

Last update 2025-03-04T19:50:02.564Z

resolved2025-03-04T19:50:02.542Z

We are declaring an end to this incident as the impact has been determined to be quite narrow. This issue will only impact a deployment that at one point was on an Astro Runtime version below 12.3.0, then upgraded to 12.3.0 - 12.7.1, upgraded again to a version between 12.3.0 - 12.7.1, then used the downgraded back to the initial 12.3.0 - 12.7.1 version. For example, if a Deployment were to go from 11.0.0 -> 12.3.0 -> 12.7.1 and attempt to rollback to 11.0.0, the rollback would be performed as expected, but if it was attempted to rollback to 12.3.0, it would fail. We are working on an update to the Astro Data Plane to fix this issue which we expect to release this month. Our support team will assist if any deployments do encounter this bug in the meantime.

monitoring2025-02-28T17:40:12.174Z

The issue has been verified as a missed minor database update that does not impact deployment behaviour unless a deployment rollback is applied. A deployment is affected if it was upgraded to a Runtime version greater than 12.3.0 from a Runtime version lesser than 12.3.0. For example: - Runtime 12.2.0 upgraded to Runtime 12.7.0 (Affected) - Runtime 12.0.0 upgraded to Runtime 12.3.0 (Affected) - Runtime 12.4.0 upgraded to Runtime 12.7.0 (Unaffected) The underlying issue has been fixed and any upgrades going forward is unaffected. We are currently applying a permanent fix for affected deployments.

investigating2025-02-28T16:21:07.000Z

Astronomer has identified an issue with the upgrade to Runtime 12.3.0+ where a database migration is not applied. While the upgrade will complete without errors, the deployment will become unhealthy when attempting to downgrade from Runtime 12.3.0+. We do not recommend upgrading to Runtime 12.3.0+ at this time as there may be other incompatible interactions between the database.

Feb 27, 2025

Report: ".airflowignore file not being respected"

Last update 2025-02-27T21:59:40.948Z

resolved2025-02-27T21:59:40.931Z

The fix is rolled out and confirmed on affected deployments.

identified2025-02-27T20:22:07.868Z

A solution has been made and is currently being rolled out to affected deployments

identified2025-02-27T18:10:43.754Z

Astronomer has identified an issue with the component responsible for injecting customer code into Airflow environments. The .airflowignore is not being respected, causing deployments to parse files that should be ignored to be parsed. This issue is only affecting some deployments, and it does not affect any deployments that have not been updated today.

Feb 20, 2025

Report: "Certain Shared Deployments in AWS US West 2 are down"

Last update 2025-02-20T19:30:31.499Z

resolved2025-02-20T19:30:31.469Z

This incident has been resolved.

monitoring2025-02-20T17:32:12.367Z

We have restored the Shared Cluster to functionality. We are monitoring for any further disruptions.

investigating2025-02-20T16:37:17.009Z

A particular shared cluster in AWS US West 2 is experiencing an issue with its database server that is causing an outage. 20% of shared cluster deployments in this region are affected. We are working to restore operations as quickly as possible. Deployments on Dedicated Clusters and Deployments in other clouds or regions are unaffected.

Feb 5, 2025

Report: "Issue with Airflow Deployments on Newly Created Hosted GCP Clusters"

Last update 2025-02-05T04:18:05.029Z

resolved2025-02-05T04:18:04.999Z

We have implemented a mitigation for the issue preventing access to the Airflow UI on newly created hosted GCP clusters.

investigating2025-02-05T03:47:41.091Z

Newly created hosted GCP clusters are being created successfully, but users are unable to access the Airflow UI. Our team is actively investigating the issue and will provide updates as we progress.

investigating2025-02-05T02:32:09.787Z

We are currently investigating an issue preventing newly created hosted GCP clusters from successfully deploying fully functional Airflow environments. Our team is actively working to identify the root cause and will provide updates as we progress.

Jan 13, 2025

Report: "Unable to create new clusters on GCP"

Last update 2025-01-13T16:32:34.259Z

resolved2025-01-13T16:32:29.466Z

This issue has been resolved and new Hosted Clusters are creating successfully on Google Cloud.

investigating2025-01-13T15:45:15.679Z

We are currently investigating an issue preventing new Astro Hosted clusters from being created on Google Cloud.

Jan 11, 2025

Report: "Azure Outage causing scaling issues on Azure Clusters"

Last update 2025-01-11T00:22:33.057Z

resolved2025-01-11T00:22:33.041Z

We are no longer seeing issues with node scaling in our Azure clusters

identified2025-01-10T13:13:31.619Z

Deployments in the Azure East US2 region are experiencing issues in running DAGs and tasks due to an ongoing Azure outage affecting node scale-up https://azure.status.microsoft/en-us/status

Dec 4, 2024

Report: "Issue with the DAGs page on the Astro UI"

Last update 2024-12-04T19:06:54.511Z

postmortem2024-12-04T19:06:47.710Z

At 23:59 UTC on November 22, Astronomer identified a bug in Astro’s implementation of AWS’ customer managed egress. This resulted in all Astro clusters on AWS routing all traffic through a single NAT gateway, which was determined to be a reliability issue. Astro API was accidentally downgraded when a hotfix was released to production at 21:14 UTC on November 25, which removed its ability to properly issue tokens, which resulted in the Dag view in the Astro UI $not the Airflow UI$ being unable to list Dags. The initial solution to address the Astro API issue resulted in a small number of Astro customers whose clusters had custom IP Allowlists being unable to access the Airflow UI. Custom IP Allowlists were not properly tested against the changes to address the previous issue which introduced a regression that broke the authentication to the Airflow UI. These issues did not cause any impact to Dag runs or task instances. These bugs were introduced because of a complex process for releasing hotfixes, which the Astronomer engineering team is actively reviewing.

resolved2024-11-26T19:28:27.334Z

This incident has been resolved.

monitoring2024-11-26T09:34:27.916Z

The issue affecting the DAGs page on the Astro UI has been resolved by our engineering team and is under monitoring. Users can now access the page without any disruptions.

investigating2024-11-26T07:09:21.555Z

We have identified an issue affecting the DAGs page on the Astro UI. Our team has escalated this matter to our engineering team for further investigation. Please note that this issue does not impact the execution of running DAGs, and users can still view their DAGs through the Airflow UI.

Oct 30, 2024

Report: "Creation of GCP-based Astro clusters is currently failing due to an issue with CPU quotas"

Last update 2024-10-30T04:25:24.506Z

resolved2024-10-30T04:25:24.478Z

The GCP support has increased the quota, allowing us to create GCP clusters on Astro without issues. This issue has been resolved.

investigating2024-10-30T00:06:24.255Z

We are currently investigating this issue.

Oct 13, 2024

Report: "Airflow UI inaccessible with the error Internal Server Error"

Last update 2024-10-13T16:33:52.763Z

resolved2024-10-13T16:33:52.751Z

auth0 have fixed their issue on their end https://status.auth0.com/ and the astro airflow UI is now accessible

identified2024-10-13T16:21:10.444Z

We have identified issue with Auth0, and our team is actively working on implementing a fix

identified2024-10-13T16:11:18.271Z

The issue has been identified and a fix is being implemented.

investigating2024-10-13T15:53:02.000Z

We are currently investigating this issue.

Aug 27, 2024

Report: "Astro deployment schedulers crash looping due to incorrect airflow local settings config"

Last update 2024-08-27T20:15:03.064Z

postmortem2024-08-27T20:13:04.330Z

On August 14 at 2:18PM UTC, Astronomer’s internal monitoring system began receiving sporadic, seemingly disparate alerts indicating that the scheduler component in some Astro Deployments was unhealthy. At 3:40PM UTC, once it became clear that these alerts were related and impacting Astro users, Astronomer initiated its incident response process. At 3:44PM UTC, the root cause of the incident was identified, and by 4:02PM UTC, the incident was resolved. The incident was caused by a code change to the [airflow\_local\_settings.py](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html#configuring-local-settings) file that is managed by Astronomer to implement cluster policies that usually make Airflow more reliable. For example, one such cluster policy prevents users from creating KubernetesPodOperator pods or KubernetesExecutor pods from attempting to consume more resources than are available in the user’s Deployment. This code change introduced a bug that impacted users who are using an Astro Runtime with the Python version set to less than Python 3.10. This bug was caused by implementing a feature, Python’s [platform.freedesktop\_os\_release](https://docs.python.org/3/library/platform.html#platform.freedesktop_os_release), which was introduced in Python 3.10 and unavailable in Python versions less than 3.10. To prevent this from happening again, Astronomer is now testing against all Python versions that are supported by Astro Runtime. In order to respond quicker to potentially systematic issues impacting scheduler availability, Astronomer will raise the priority of alerts indicating schedulers are unhealthy such that the on-call support engineer will be paged.

resolved2024-08-14T16:32:41.448Z

This incident has been resolved.

monitoring2024-08-14T16:01:27.037Z

Correction, the affected Runtime images are those using versions of Python *less than* 3.10. We have implemented a fix and are monitoring the results now.

identified2024-08-14T15:48:25.000Z

We have identified the issue, which only effects deployments using Astro Runtime with Python version <3.10. We are working on releasing a hotfix now.

investigating2024-08-14T15:45:32.000Z

Astronomer is investigating an issue with Astro deployment scheduler pods entering a CrashLoop state due to an error in the airflow local settings file. We're investigating the root cause and will update this page as we have more information.

Aug 21, 2024

Report: "cloud.astronomer.io was not loading properly for all users"

Last update 2024-08-21T15:19:47.434Z

resolved2024-08-21T15:00:00.000Z

From 14:52 to 15:09 UTC, cloud.astronomer.io was inaccessible. DAGs and Tasks continued to run, but the UI was not visible nor were deployments from the cli or API possible.

Jul 25, 2024

Report: "Astro does not respect .airflowignore when deploying to Astro with Astro CLI 1.28.0"

Last update 2024-07-25T15:24:03.396Z

resolved2024-07-25T15:24:03.382Z

This incident has been resolved. Please upgrade to Astro CLI version 1.28.1.

identified2024-07-25T14:42:30.095Z

The issue has been identified and a fix is being implemented.

Jul 3, 2024

Report: "All KubernetesExecutor tasks failing for some customers using DAG-only deploy"

Last update 2024-07-03T19:19:29.060Z

resolved2024-07-03T19:19:29.045Z

This incident has been resolved.

monitoring2024-07-03T19:04:06.220Z

A fix has been implemented and we are monitoring the results.

identified2024-07-03T18:12:49.522Z

The issue has been identified and a fix is being implemented.

investigating2024-07-03T18:12:13.935Z

We are continuing to investigate this issue.

investigating2024-07-03T17:48:00.134Z

We are currently investigating this issue.

Jun 17, 2024

Report: "Update Deployment feature impacted on Hybrid GCP Clusters"

Last update 2024-06-17T15:17:16.328Z

resolved2024-06-17T15:17:16.308Z

This incident has been resolved.

identified2024-06-17T11:51:34.118Z

The issue has been identified and a fix is being implemented.

investigating2024-06-17T09:19:22.929Z

We are currently experiencing an issue where customers hosted on Hybrid GCP Clusters are unable to update their deployments. Our engineering team is actively investigating the cause of this problem and working to resolve it as quickly as possible. Please note the existing dags are running fine however updates to deployment resources(variables, worker queues, etc) are blocked.

May 24, 2024

Report: "Astronomer is currently investigating an issue preventing customers from deploying to Astro"

Last update 2024-05-24T15:50:53.040Z

resolved2024-05-24T15:50:53.027Z

This incident has been resolved.

monitoring2024-05-24T15:25:21.571Z

The issue has been mitigated, and we are continuing to monitor.

investigating2024-05-24T15:09:21.519Z

We are currently investigating this issue.

May 22, 2024

Report: "Airflow API 500 Errors"

Last update 2024-05-22T18:53:09.166Z

resolved2024-05-22T18:53:09.151Z

The incident has been resolved. During this time, DAGs that were making API calls to Airflow deployments may have failed - those tasks can now be reran.

investigating2024-05-22T18:29:31.273Z

When accessing the Airflow API programmatically, requests may fail and return an HTTP 500 status code. We are investigating the cause of this issue. Access through the UI is unaffected and all DAGs will continue to run as normal.

May 17, 2024

Report: "GCP outage causing scaling issues for GCP clusters"

Last update 2024-05-17T04:10:43.347Z

resolved2024-05-17T04:10:43.332Z

The service disruption caused by the Google Cloud Platform (GCP) outage has been resolved. All affected services have been restored to normal operation. Please refer to the GCP status page for more details on the incident. https://status.cloud.google.com/incidents/xVSEV3kVaJBmS7SZbnre#RP1d9aZLNFZEJmTBk8e1

investigating2024-05-17T00:27:41.653Z

Fairly widespread GCP outage is preventing new Astro nodes in GCP from pulling images. This should not affect existing nodes and shouldn't affect running DAGs unless they need to scale https://status.cloud.google.com/incidents/xVSEV3kVaJBmS7SZbnre#RP1d9aZLNFZEJmTBk8e1

Apr 17, 2024

Report: "Authentication Errors"

Last update 2024-04-17T18:59:57.533Z

resolved2024-04-17T18:59:57.516Z

This incident has been resolved.

monitoring2024-04-17T15:46:07.121Z

We have received notification from our authentication provider that this is now resolved. If you are still unable to login to Astro, please refresh your browser and try again.

identified2024-04-17T15:33:01.300Z

We are experiencing a failure with our upstream authentication provider that is causing users to receive a 404 error when trying to login to the Astro platform. We are currently working with the provider to resolve the issue. This issue does not affect DAGs from running.

Apr 4, 2024

Report: "Worker Nodes Not Spinning Up in GCP Dataplane Clusters"

Last update 2024-04-04T04:52:36.953Z

resolved2024-04-04T04:52:36.938Z

This incident has been resolved.

monitoring2024-04-04T04:51:35.560Z

We are continuing to monitor for any further issues.

monitoring2024-04-04T03:56:50.687Z

A fix has been implemented and we are monitoring the results.

identified2024-04-04T03:32:46.154Z

The issue has been identified and the fix is being implemented.

investigating2024-04-04T03:27:31.360Z

Incident Description: Some worker nodes within several GCP dataplane clusters are failing to spin up as expected. This issue is causing delays in task execution and may lead to DAGs/tasks getting stuck in the queued state or failing. Current Status: We have pinpointed the issue and confirmed its existence. Our engineering team is actively collaborating to resolve the problem. Impact: Delays in task execution within affected clusters. There is a risk of DAGs/tasks getting stuck in the queued state or failing due to the inability to spin up worker nodes. Resolution: Our engineering team is working diligently to implement a fix for this issue. Communication: Regular updates will be provided to keep you informed of any developments. We apologize for any inconvenience this may cause and appreciate your patience as we work to resolve this issue promptly. Please stay tuned for further updates.

Mar 27, 2024

Report: "Deployment metrics sometimes failing to load"

Last update 2024-03-27T02:57:16.879Z

resolved2024-03-27T02:57:16.865Z

This incident has been resolved.

monitoring2024-03-27T02:42:41.988Z

A fix has been implemented and we are monitoring the results.

identified2024-03-26T23:19:56.022Z

Listing or viewing some deployments will display "metrics failed to load" instead of showing Dag Run and Task Instance summaries. Actual DAG Runs and tasks are continuing to execute correctly, and the Airflow UI is still accessible. We have identified the problem and are working on deploying a fix.

Mar 22, 2024

Report: "Quay.io image registry is having an outage"

Last update 2024-03-22T18:55:17.151Z

resolved2024-03-22T18:55:17.133Z

This incident has been resolved.

monitoring2024-03-22T18:42:18.522Z

Quay.io appears to be back up

investigating2024-03-22T17:31:05.000Z

This outage affects: * New cluster creation * CI/CD pipelines which pull public images (e.g. Astro runtime) from Quay * Provisioning new worker pods & nodes (scale up) for some clusters

Mar 16, 2024

Report: "Astronomer Cloud UI and API Unavailable"

Last update 2024-03-16T14:57:34.724Z

resolved2024-03-16T14:57:34.708Z

We have identified the issue and a mitigation was applied. Services have resumed healthy operation. This issue is now resolved.

investigating2024-03-16T12:43:05.000Z

We are currently investigating an issue with the Astronomer Cloud UI and API. Please standby for further updates.

Mar 6, 2024

Report: "Astro CLI versions <= 1.22 are unable to successfully execute some commands"

Last update 2024-03-06T22:43:43.744Z

resolved2024-03-06T22:43:43.732Z

The fix has been made and the issue is now resolved.

identified2024-03-06T17:42:22.231Z

Upgrading the Astro CLI to 1.24.1 is known to fix this issue. A change to backend systems broke some functionality of the Astro CLI, including the ability to deploy code to Astro. We've identified the issue and are working to implement a fix.

Feb 28, 2024

Report: "Unable to update deployment from Astro UI"

Last update 2024-02-28T23:30:31.171Z

resolved2024-02-28T23:30:31.158Z

Issues has been resolved!

monitoring2024-02-28T22:43:02.156Z

The issue has been fixed and deployment updates are now working from the UI. We will continue to monitor.

identified2024-02-28T22:07:47.131Z

Updates to non-development deployments via the Astro UI may be declined with an invalid request error. A fix is being worked on.

Feb 7, 2024

Report: "Intermittent Network and Scheduling Issues in AWS us-west-2 Region for Airflow Deployments"

Last update 2024-02-07T00:27:23.564Z

resolved2024-02-07T00:27:23.552Z

This incident has been resolved.

monitoring2024-02-06T23:59:05.381Z

A fix has been implemented and we are monitoring the results.

investigating2024-02-06T23:08:41.982Z

Airflow deployments in the AWS US-West-2 region may encounter occasional network and scheduling disruptions. The affected cluster has been cordoned off to mitigate the impact on new implementations. Our team is actively investigating the issues within this cluster.

Feb 6, 2024

Report: "Intermittent Network and Scheduling Issues in AWS us-west-2 Region for Airflow Deployments"

Last update 2024-02-06T06:24:57.214Z

resolved2024-02-06T06:24:57.198Z

This incident has been resolved.

monitoring2024-02-06T03:59:47.814Z

A fix has been implemented and we are monitoring the results.

investigating2024-02-06T02:56:45.311Z

Jan 30, 2024

Report: "Hybrid customers are unable to create or update deployments"

Last update 2024-01-30T19:55:13.160Z

postmortem2024-01-30T19:54:29.325Z

**The Problem** On Wednesday January 17th 2024, some customers were unable to create or apply changes to Airflow Deployments running on Astro Hybrid clusters, which included deploying image or DAG updates to these Deployments. This outage spanned from 22:20 to 23:25 UTC for a total of 1 hour and 5 minutes. The outage impacted a total of 720 individual HTTP requests to Astro API, the main API powering all user interactions with Astro. During this time period, Airflow tasks continued to run and were unaffected. Ultimately, the outage was caused by an incompatibility between two services that power the Astro control plane: Astro API and an internal service named Harmony. Astro API is our public-facing API that all user interactions flow through, which includes creating, updating and deleting Airflow deployments. Astro API requests that create, update or delete an Airflow deployment then flow to Harmony, which generates and synchronizes Kubernetes manifest files that describe the desired state of a given Astro data plane. Prior to the release that caused this outage, we were running two distinct versions of Harmony. One version managed the Hybrid data planes, and one version managed the Hosted data planes. When the release that triggered this outage was deployed to production, Astro Hybrid deployment requests from Astro API to Harmony began to fail, due to an incompatibility with the version of the Harmony service that was running. Astro Hosted deployments were unaffected. ‌ **Root Cause** This outage was caused by an incompatibility between our external Astro API service, and our internal Harmony service. While our Astro API service is rolled out via an automated deployment process, our Harmony service is upgraded differently, using a canary rollout mechanism. We use this canary system to gradually roll out changes to the data planes under management. At the moment, progressing through the canary rollout requires a human-in-the-loop. There were two main contributing factors that led to this outage: First, the changes we deployed to our Astro API were not backward compatible with the currently running version of our Harmony service. Regardless of our canary rollout process, the changes made here should have been able to work with the older _and_ newer version of Harmony being deployed. Second, the harmony rollout process is completely automated in our Development and Staging environments, which creates a difference between the lower environments and our Production environment. Our Production environment _currently_ requires human intervention to operate the canary rollout process. While investigating this issue, we found that the issue actually _did_ occur in our lower environments, but it was automatically resolved within minutes due to our automated data plane upgrade process. This led to the issue not being discovered until it was released in our Production environment. ‌ **What We’re Doing to Prevent this from Recurring** At the lowest level, this issue surfaced due to an API incompatibility. This particular change happened as we were actually simplifying our system. As mentioned above, prior to this release, we were running two versions of our Harmony service, one for Hybrid, and one for Hosted. Unfortunately, as we rolled out the change to consolidate this to a single service, we mistakenly hadn’t made the Astro API change backward compatible with the previous version of Harmony. Once this outage had been mitigated, our system was actually simpler than it was prior to the outage. We now send both Hosted and Hybrid requests to a single Harmony service, and no longer need to manage the complexity of two different versions. Now that we’ve reduced the surface area of the Harmony service, we are working to harden the service into an Open API specification, allowing our consuming services to share code and take advantage of an auto-generated client. This will help prevent certain types of bugs in this communication path in the future. Beyond preventing code-level bugs, we are also working to get our Staging and Production environments more aligned regarding the Harmony service’s canary rollout system. As mentioned, if our Staging environment’s data planes were subject to our canary rollout over a longer period of time like Production, we would have caught this issue much earlier in the process. At a higher level, we are working to build a much more robust system to roll out upgrades to our Data Planes. Instead of the entire Hybrid data plane fleet being potentially exposed to this issue, we could have verified the changes in a smaller subset before rolling the changes out any further.

resolved2024-01-17T23:00:00.000Z

Creating and updating deployments will result in failure which could manifest as Internal server errors (status code 500).

Jan 26, 2024

Report: "Astro Cloud UI "No Healthy Upstream""

Last update 2024-01-26T21:42:38.817Z

postmortem2024-01-22T15:44:19.337Z

**Problem** Early on the morning of January 22nd, customers faced issues accessing the Cloud UI and performing certain operations with the Astro CLI. This was an intermittent outage which spanned from 3:25 AM to 4:07 AM PST. This intermittent outage resulted in unavailability of the Cloud UI for some customers, prevented Astro CLI deploy commands from running successfully, and prevented successful node scale-ups in Astro data planes. We don’t believe that this outage caused any tasks to fail, but it might have slowed task scheduling for some customers. * For Deployments using the Celery executor on Astro Hybrid, workers could not properly scale up during the outage periods. * For Deployments using the Kubernetes executor or Deployments using Celery Executor on Astro Hosted, new worker pods could start, but new nodes could not be added, so the autoscaling would limit itself to the nodes available at the start of the outage. The component that caused the intermittent outage is called the _Astro API_, which is a critical service in the control plane that mediates most actions in Astro. However, because Airflow itself doesn’t use this component, losing the Astro API prevents most Astro functions from working but doesn’t directly impact the operation of running Airflow Deployments. The problem was triggered when an end user configured the “Linked to all Deployments”option in the Airflow connection management menu for their Astro Workspace. This end user action resulted in a segmentation fault that bypassed middleware designed to trap and recover from the segmentation faults, killing the Astro API container and taking down one of the running replicas. Because a typical user action is to retry the operation upon failure, it’s possible that the end user kept retrying this while the back-end system was bringing up new Astro API containers to replace the container which had just crashed, thereby triggering a degradation of service and errors to users trying to access the service at the same time. One of these customers that encountered an error reported the problem to Astronomer Support almost simultaneously to our internal alert which detected the Astro API containers crashing. This was immediately escalated by the Support team to Engineering via our Incident Management process and a mitigation was put in place within 40 minutes of the escalation. ‌ **Root Cause** We introduced code in the release we rolled out on January 17 that has now been determined to be thread unsafe. Ironically, this code was written to reduce duplications in our code and make it modular and testable and reduce the risk of introducing bugs. We deployed logic in a specific code path that would read from the database using transactions. However, in golang $and the ORM library we use$, the concurrent transaction reads were actually using different connections and hence thread unsafe. This led to the database calls not being able to retrieve anything but not returning any errors either, which further led to a nil pointer dereference panic on Astro API pods causing them to restart. We also now know the user scenario that executes this thread unsafe code. 1. A user goes to a Workspace with more than 1 Deployment in it. 2. The user opens the Environment tab for that workspace. 3. The user switches the “Linked to all Deployments” option from true to false OR from false to true OR adds/updates a connection. We believe that the switching “Linked to all Deployments” option in the UI is such a spammable operation that a single user who was trying that operation could have done it a few times in the UI which would have led to the individual pods handling those requests in the backend to panic and crash, thereby resulting in a degradation of the service. The initial mitigation fix was rolled out within 40 minutes so that, instead of panicking in those situations, Astro API would return HTTP 500 errors. The individual problem requests still failed but did not cause the pods to restart. We also released a medium term fix on Jan 24, wherein we changed the concurrent db read with serial $less optimal$ reads that loop through all Deployments in the Workspace instead. This change has been validated to have fixed the problem, which means no more panics nor 500 errors for the users changing the auto-link option. **What We’re Doing to Prevent this from Recurring** We are focusing on what can be done to **improve the robustness of the Astro API** given its key role in the Astro control plane, as well as how to have a **faster response** and **improved monitoring** for issues that arise. The first alert that was raised for this incident was a pod restart alert for the Astro API pods. This alert however was not tuned correctly in two ways. Firstly, it auto-resolved because of an incorrect setting. Secondly, and more critically, receiving this alert does not always mean that there is customer impact. Because of this and the auto-resolution, the engineer who was paged mistakenly believed that this issue was momentary and would not have customer impact. To prevent this, we are adding two new alerts based on the dashboards we used to determine the amount of impact during the outage. These alerts will measure both the real customer generated traffic to the Astro API and our own synthetic traffic $in case the real traffic is blocked by an ingress issue$ to raise high priority alarms to both our Support and Engineering teams if the Astro API drops below very high levels of consistency. Because these alerts will always indicate an important customer-visible problem and go to multiple teams within Astronomer, we are confident that they will be acted on with the appropriate urgency. Our postmortem review also revealed that with a more robust Astro API pod setup, even issues of this magnitude would have much less customer impact. The problem was that the minimum size of the Astro API autoscaling group is too small, and it was at that minimum size during the outage. We currently run a small number of large pods, and we are now looking to have a significantly larger number of smaller pods instead. With a larger number of smaller pods, the panics would have been less likely to crash all of the pods at once. There are some nuances to work out about how to size and manage database connections with a larger number of pods, so this change will not be rolled out until we can be sure it will not have other unintended negative consequences. We have analyzed the specific bug that triggered this outage, and we don’t believe that we could have reasonably implemented a regression test for the behavior that would have detected this. Without the deep understanding that the database connections would later be made concurrent and accidentally thread-unsafe, we would not have been able to predict which tests would be required. We also evaluated the feasibility of enabling a quick rollback so that we would not have to determine the full root cause and fix to resolve the outage. However, although we are capable of doing a rollback in our deployment model, it would not have been feasible to do this. This is because the change that included the bug was committed over a week before the outage and deployed five days before the outage. We have a weekly release cadence in the control plane, but even if we went to smaller and more frequent releases, this change did not cause a problem immediately; it had to be triggered by a specific user action. Because other updates since this change involve database schema updates and other infrastructure changes, it is not obviously safe to roll back to before the bug was deployed.

resolved2024-01-22T14:01:17.263Z

Incident has been resolved, all systems operational.

monitoring2024-01-22T12:43:08.779Z

The issue was identified, and a fix has been applied. We are currently monitoring the deployments.

investigating2024-01-22T12:03:02.000Z

The impact reassessed as major for expedited mitigation.

investigating2024-01-22T12:01:45.526Z

We are currently investigating this issue.

Report: "Astro Analytics - Degraded Performance"

Last update 2024-01-26T00:29:03.004Z

investigating2024-01-26T19:14:23.000Z

Our team is currently investigating the degraded performance of Astro analytics service.

resolved2024-01-26T00:29:02.988Z

This incident has been resolved. Astronomer builds metrics in part by using a logging tool. The performance of the logging tool’s indexer was adversely impacted by an increase in scheduled queries, which overwhelmed the logging tool, resulting in a backup of queries, which in turn impacted the monitors in the Astro UI. After optimizing scheduled queries, performance returned to normal.

Dec 21, 2023

Report: "New worker pods in Azure AKS clusters unable to start"

Last update 2023-12-21T21:55:43.910Z

resolved2023-12-21T21:55:43.889Z

This incident has been resolved.

monitoring2023-12-21T20:49:52.392Z

The issue has been identified and we are beginning to update the affected clusters. Worker pods that were stuck in Pending state are spinning up now.

investigating2023-12-21T20:30:05.038Z

We are aware of an issue with Azure and are currently investigating it. Pods older than 1:30 PM CST (0630 UTC) are not affected.

Dec 15, 2023

Report: "Monitoring service in Astro Standard Clusters experiencing issues"

Last update 2023-12-15T02:37:47.923Z

resolved2023-12-15T02:37:47.907Z

This incident has been resolved.

monitoring2023-12-14T23:14:20.780Z

Hotfix has been deployed to prod, affected clusters are being bootstrapped with the hotfix. We are monitoring the results of the fix.

identified2023-12-14T22:21:30.874Z

Hotfix has been released to stage, we are validating the results and will proceed with the release to prod following that validation.

identified2023-12-14T21:12:22.212Z

The issue has been identified and a hotfix is being created and rolled out.

investigating2023-12-14T20:31:26.000Z

We are currently investigating an issue in the monitoring service Astronomer uses to monitor Astro Standard Clusters.

Dec 14, 2023

Report: "Modifying environment variables from the Astro UI may delete the values for other environment variables marked as "Secret""

Last update 2023-12-14T16:50:02.874Z

postmortem2023-12-14T16:48:47.367Z

Astronomer is undergoing a migration from a legacy, internal, API to the recently released [Astro API](https://docs.astronomer.io/astro/api/overview), which is more resilient and performant. The two APIs handle [falsy](https://developer.mozilla.org/en-US/docs/Glossary/Falsy) values differently, and this introduced a bug in Astro API’s updating of environment variables, resulting in the values for environment variables marked “Secret” being functionally deleted when environment variables were modified. When environment variables on Astro are saved, they are saved together, so updating any environment variable resulted in the deletion of values for all environment variables marked “Secret”.Beyond implementing the fix to resolve this issue, we are implementing the following remediations to prevent something similar from happening again: * Review functional testing for secret environment variables * Implement feature flags to enable staged rollouts for future API migration tasks

resolved2023-12-07T01:52:38.246Z

This incident has been resolved.

investigating2023-12-06T19:30:46.655Z

We are currently investigating this issue. If you rely on setting environment variables in the Astro UI, please refrain from updating environment variables at this time.

Dec 9, 2023

Report: "Bug in AstroAPI endpoint call deleting connections from Astro Environment"

Last update 2023-12-09T01:13:01.469Z

resolved2023-12-09T01:13:01.454Z

This incident has been resolved.

identified2023-12-08T21:52:15.818Z

A bug has been identified in the Managed Connections of Astro Hosted Environments that deletes existing connections. A fix has been made and is being deployed.

Dec 8, 2023

Report: "Hybrid customers Unable to view Teams"

Last update 2023-12-08T05:59:30.053Z

resolved2023-12-08T05:59:30.039Z

This incident has been resolved.

monitoring2023-12-08T05:52:41.558Z

A fix has been implemented and we are monitoring the results.

investigating2023-12-08T05:51:22.730Z

We are continuing to investigate this issue.

investigating2023-12-08T04:32:38.416Z

We are currently investigating this issue.

Nov 30, 2023

Report: "Quay.io outage causing new pods to be stuck in Pending waiting to download container images"

Last update 2023-11-30T07:20:52.579Z

resolved2023-11-15T19:09:45.714Z

This incident has been resolved.

monitoring2023-11-15T01:41:54.089Z

Quay.io has indicated that they have completed the fix and is operating correctly for pushes and pulls. New Airflow services that come up are operational. We are continuing to monitor the situation as Quay.io has not marked their incident as resolved.

identified2023-11-14T22:19:47.980Z

This issue is ongoing. We have observed some instances where images are able to be pulled, but we're continuing to observe widespread image pull issues. We will update as more information becomes available.

identified2023-11-14T21:01:49.386Z

Quay has indicated that they are continuing to experience instability and are moving their image repo to read-only mode, which will affect image push operations.

identified2023-11-14T20:53:24.000Z

Quay.io, the container image repository used by Astronomer is experiencing issues with image pull failures. Quay.io incident: https://status.quay.io/incidents/z7sbjqmb34p1 We will continue monitoring the situation and update this incident as more information becomes available. Existing pods should be unaffected and will continue executing tasks.

Nov 28, 2023

Report: "Tasks from deployments with KubernetesExecutor are unable to execute"

Last update 2023-11-28T03:45:11.494Z

postmortem2023-11-28T03:44:12.354Z

From Nov 8, 16:09Z until the issue was resolved on Nov 8, 20:21Z, Astro Hybrid Deployments using Kubernetes Executor with DAG-Only Deploy that were updated by customers were unable to start new Airflow workers. In total, there were 4 such Deployments across Astro. As part of releasing Deploy Rollbacks, the location of the URL from which the Deployment checks for new DAGs was changed to be stored locally within the Astro Data Plane instead of requiring an API call to the Control Plane. This design is more robust to intermittent network issues, and we continue to believe this change was conceptually correct. However, Kubernetes worker pods were not updated with correct access to the new location of the URL. This prevented the worker from starting up, as retrieving this URL and then using it to download the DAGs is a critical step in initializing new workers. Once the issue was understood, an update was rolled out to all Hybrid clusters to properly give access to the new location. This issue clearly should have been caught in testing. We reviewed our testing procedure and found a gap. On Celery Executor, a deployment procedure's impact on starting workers can be tested without running any DAGs, because Celery Executor starts a worker even when no tasks are running $unless scale to zero is on, but we disable that setting for these particular tests$. We were using the same test procedure for Kubernetes Executor, but because no DAGs were running, no workers were started, and thus no errors were raised. We have now adjusted the Kubernetes Executor test suite to include actually running a DAG after testing a new deployment procedure, thus ensuring a much more realistic test. Our existing alerting did detect these issues as they started happening to customers. We are nevertheless going to invest in more targeted alerting around the DAG Download process.

resolved2023-11-08T22:39:56.502Z

This incident has been resolved.

monitoring2023-11-08T19:43:35.425Z

Further correction, the only affected deployments are those under Astro Hybrid, have updated in the last day, and have dag deploy enabled.

monitoring2023-11-08T19:21:00.512Z

Update - We have correctly identified the only affected deployments are those under Astro Hybrid and have updated in the last day. We are continuing to monitor the results of the fix.

monitoring2023-11-08T19:20:02.206Z

A fix has been implemented and we are monitoring the results.

identified2023-11-08T19:10:26.553Z

We have correctly identified the only affected deployments are those under Astro Hybrid and have updated in the last day.

identified2023-11-08T16:55:31.972Z

Due to an issue with the Kubernetes Airflow worker pod being unable to download DAGs, these pods are unable to initialize leading to the task instance to be stuck in queued. This only affects Airflow deployments using KubernetesExecutor.

Nov 8, 2023

Report: "Astro CLI cannot access Airflow variables and connections"

Last update 2023-11-08T22:39:08.259Z

resolved2023-11-08T22:39:08.245Z

This incident has been resolved.

monitoring2023-11-08T20:11:58.518Z

A fix has been implemented and we are monitoring the results.

identified2023-11-08T19:19:23.198Z

When using the Astro CLI to access deployments variables and connections, you may receive the following error failed to decode response from API. If modifying the variables or connections and you receive this message, the modifications have not taken effect. Examples of CLI commands that may fail are: - astro deployment airflow-variable list - astro deployment connection list - astro deployment connection update Our team has identified the issue and is releasing a fix.