Fly.io

Is Fly.io Down Right Now? Check if there is a current outage ongoing.

Fly.io is currently Operational

Last checked from Fly.io's official status page

Historical record of incidents for Fly.io

Report: "IPv6 Connectivity Loss in GDL"

Last update
identified

We have experienced a temporary loss in IPv6 connectivity in Guadalajara, Mexico (GDL) and are currently working with our upstream providers to resolve the issue. IPv4 connectivity is currently unaffected.

Report: "Network issues in LHR"

Last update
investigating

We are observing network issues in LHR region. Apps continue to run, but may have network issues, and deploying/updating apps may fail.

Report: "Network maintenance in GRU (São Paulo, Brazil)"

Last update
Scheduled

An upstream provider is performing network maintenance in GRU, from 2025-05-30 at 12:00 UTC (9:00am BRT local time) to 14:00 UTC (11:00am BRT local time). You may experience a short total loss of connectivity for up to 5 minutes within the scheduled maintenance window hours.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Report: "Network maintenance in LHR"

Last update
In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

An upstream provider is performing network maintenance on a subset of our servers in LHR, from 2025-05-29 at 23:00 UTC to 2025-05-30 at 03:00 UTC. You may experience network connectivity disruptions for some time within the maintenance window.

Report: "Network maintenance in CDG (Paris, France)"

Last update
In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

An upstream provider is performing network maintenance in CDG, from 2025-05-29 at 22:00 UTC (2025-05-30 12:00am CEST local time) to 2025-05-30 at 02:00 UTC (4:00am CEST local time). You may experience a short total loss of connectivity for up to 5 minutes within the scheduled maintenance window hours.

Report: "Burst of network related alerts from some servers in LHR"

Last update
resolved

This incident has been resolved.

monitoring

Alerts appear to be related to a network blip caused by an upstream provider's router failover, with no ongoing disruption.

investigating

We've received a flood of networking related alerts from a subset of servers running in LHR. We are not yet sure of the impact on customer workloads.

Report: "Burst of network related alerts from some servers in LHR"

Last update
Monitoring

Alerts appear to be related to a network blip caused by an upstream provider's router failover, with no ongoing disruption.

Investigating

We've received a flood of networking related alerts from a subset of servers running in LHR. We are not yet sure of the impact on customer workloads.

Report: "WireGuard gateway issues"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are investigating issues with WireGuard over websockets (the default connection mode in flyctl). `flyctl ssh`, `flyctl proxy`, `flyctl logs` commands as well as others may fail. If you are on a network that allows UDP connections, running `fly wg websockets disable` may fix the issue as a workaround.

Report: "WireGuard gateway issues"

Last update
Investigating

We are investigating issues with WireGuard over websockets (the default connection mode in flyctl).`flyctl ssh`, `flyctl proxy`, `flyctl logs` commands as well as others may fail.If you are on a network that allows UDP connections, running `fly wg websockets disable` may fix the issue as a workaround.

Report: "Production database is being migrated"

Last update
resolved

This incident has been resolved.

monitoring

The issue has been resolved

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented, and we're monitoring the results. API performance should be back to normal, although app creates may still be degraded.

identified

We're continuing to work on fully restoring the Machines API. API calls are still taking longer than usual but we're no longer seeing failures.

identified

We are continuing to work on a fix for this issue.

identified

We identified an issue while migrating our production traffic, and have applied a fix to restore dashboard functionality. We're continuing to work on fully restoring the Machines API.

investigating

The Fly Dashboard is also affected and may prevent certain dashboard functionality, like the support portal. If you're on a paid support plan, please submit tickets using your support email address in the meantime.

investigating

We’re migrating production traffic over to a new production database. GraphQL queries, including flyctl commands, may be slow.

Report: "Production database is being migrated"

Last update
Investigating

We’re migrating production traffic over to a new production database. GraphQL queries, including flyctl commands, may be slow.

Report: "Network maintenance in AMS (Amsterdam, The Netherlands)"

Last update
In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

An upstream provider is performing network maintenance in AMS, from 2025-05-19 22:30 UTC (00:30 local time) to 2025-05-20 04:00 UTC (06:00 local time). No operational impact is expected.

Report: "Network maintenance in BOG (Bogotá, Colombia)"

Last update
Scheduled

An upstream provider is performing network maintenance in BOG on 2025-05-17, from 11:00 UTC (06:00am local time) to 15:00 UTC (10:00am local time). No operational impact is expected.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Report: "Machines API degraded performance"

Last update
resolved

We identified the problem and deployed a fix.

investigating

We're investigating degraded performance with the Machines API metadata update endpoint.

Report: "Machines API degraded performance"

Last update
Investigating

We're investigating degraded performance with the Machines API metadata update endpoint.

Report: "Network issues in NRT/HKG"

Last update
resolved

This incident has been resolved.

investigating

Machines API requests (including `fly deploy` or `fly machines` commands) may occasionally fail when trying to create/update machines in NRT or HKG regions. We are investigating.

investigating

An upstream provider is investigating a network issue in NRT and HKG regions. Apps continue to run, but requests may occasionally fail.

Report: "Network issues in NRT/HKG"

Last update
Investigating

An upstream provider is investigating a network issue in NRT and HKG regions. Apps continue to run, but requests may occasionally fail.

Report: "Network maintenance in SEA (Seattle, Washington, USA)"

Last update
Scheduled

An upstream provider is performing critical network maintenance in SEA, from 14:00 UTC (07:00am PDT local time) to 16:00 UTC (09:00am PDT local time). You may experience a short total loss of connectivity for up to 15 minutes within the scheduled maintenance window hours.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Report: "New MPG clusters cannot be provisioned in FRA"

Last update
resolved

This incident has been resolved, all operations in FRA are working as expected.

monitoring

A fix has been implemented and we are seeing MPG creations in FRA succeed again. The MPG tab of the fly.io dashboard is working again for users with clusters in FRA.

identified

New MPG cluster creations in Frankfurt (FRA) region are currently failing. Cluster creation in other MPG regions is working as normal. We are working to restore FRA cluster creation. Existing, running database clusters in FRA are not impacted and continue to work as normal. However the MPG page in the Fly.io dashboard may not load for users with clusters in FRA.

Report: "Errors (5xx, timeouts) in Fly.io dashboard"

Last update
resolved

This incident is resolved, Dashboard, API and CLI operations should be working normally now.

monitoring

We continue to monitor the deployed fix. Dashboard and API/CLI operations should be functional now.

identified

We have identified the troublesome component and a fix has been rolled out. We are monitoring the results and may need to perform further updates to fully stabilize things.

investigating

Our metrics and user reports show Fly.io/dashboard and portions of the API backend are timing out or returning 5xx errors. All operations in the Fly dashboard and most operations using fly CLI will fail or timeout at this point. Currently-running machines or workloads should not be affected.

Report: "New MPG clusters cannot be provisioned in FRA"

Last update
Identified

New MPG cluster creations in Frankfurt (FRA) region are currently failing. Cluster creation in other MPG regions is working as normal. We are working to restore FRA cluster creation.Existing, running database clusters in FRA are not impacted and continue to work as normal. However the MPG page in the Fly.io dashboard may not load for users with clusters in FRA.

Report: "Errors (5xx, timeouts) in Fly.io dashboard"

Last update
Investigating

Our metrics and user reports show Fly.io/dashboard and portions of the API backend are timing out or returning 5xx errors. All operations in the Fly dashboard and most operations using fly CLI will fail or timeout at this point.Currently-running machines or workloads should not be affected.

Report: "Depot builders experiencing issues"

Last update
resolved

From roughly 11:00AM Pacific to 3:00PM Pacific, Depot builders were unable to complete deploys (https://status.depot.dev/cmafni8la004z9pwuozks8vwx). During this time, deploys defaulted back to our legacy Fly builders, and users may have seen slower-than-usual deploys depending on the size of the build. This has been resolved, and deploys are now defaulting to Depot builders again.

Report: "Depot builders experiencing issues"

Last update
Resolved

From roughly 11:00AM Pacific to 3:00PM Pacific, Depot builders were unable to complete deploys (https://status.depot.dev/cmafni8la004z9pwuozks8vwx). During this time, deploys defaulted back to our legacy Fly builders, and users may have seen slower-than-usual deploys depending on the size of the build.This has been resolved, and deploys are now defaulting to Depot builders again.

Report: "IAD Managed Postgres control plane unavailability"

Last update
resolved

This incident has been resolved.

investigating

We are investigating intermittent unavailability of the Managed Postgres control plane in IAD region. Database clusters continue to run.

Report: "IAD Managed Postgres control plane unavailability"

Last update
Investigating

We are investigating intermittent unavailability of the Managed Postgres control plane in IAD region. Database clusters continue to run.

Report: "Some or all *.fly.dev subdomains are currently returning NXDOMAIN errors in IAD"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

applications may be inaccessible via DNS.

Report: "WireGuard connectivity into CDG is unavailable"

Last update
resolved

We have re-enabled the CDG gateway for flyctl.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Inbound wireguard connections to our CDG gateways is currently unavailable due to an upstream networking issue. Any static peers configured in CDG will be unavailable until this is resolved.

Report: "Loss of connectivity in IAD"

Last update
resolved

We are experienced an outage with one of our upstream transit providers in IAD for around 10 minutes. Traffic has been re-routed to alternate paths and connectivity should be back to normal.

Report: "Loss of connectivity in IAD"

Last update
Resolved

We are experienced an outage with one of our upstream transit providers in IAD for around 10 minutes. Traffic has been re-routed to alternate paths and connectivity should be back to normal.

Report: "Some or all *.fly.dev subdomains are currently returning NXDOMAIN errors in IAD"

Last update
Investigating

applications may be inaccessible via DNS.

Report: "WireGuard connectivity into CDG is unavailable"

Last update
Identified

Inbound wireguard connections to our CDG gateways is currently unavailable due to an upstream networking issue. Any static peers configured in CDG will be unavailable until this is resolved.

Report: "Upstream network outage in MAD"

Last update
resolved

This incident has been resolved.

monitoring

Power has been brought back online for the region. We're closely monitoring for any further complications.

identified

Our edges in Madrid, Spain are currently affected by an upstream outage caused by ongoing power issues in the region. Regional and static egress IPs may be temporarily unavailable. Access via Anycast IPs is currently unaffected. We are working with our upstream to resolve this situation.

Report: "Upstream network outage in MAD"

Last update
Resolved

This incident has been resolved.

Monitoring

Power has been brought back online for the region. We're closely monitoring for any further complications.

Identified

Our edges in Madrid, Spain are currently affected by an upstream outage caused by ongoing power issues in the region. Regional and static egress IPs may be temporarily unavailable. Access via Anycast IPs is currently unaffected. We are working with our upstream to resolve this situation.

Report: "Fly.io dashboard down"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Fly.io dashboard down"

Last update
Resolved

This incident has been resolved.

Monitoring

A fix has been implemented and we are monitoring the results.

Update

We are continuing to investigate this issue.

Investigating

We are currently investigating this issue.

Report: "Network performance issues in ORD"

Last update
resolved

This incident has been resolved. The issues impacting performance on the affected routes do not seem to have been caused by issues within our network infrastructure.

investigating

We are continuing to investigate this issue.

investigating

We are continuing to investigate this issue.

investigating

Some network paths in a single region (ORD) are slightly slower than expected. You may experience lower network performance for requests in ORD.

Report: "Network performance issues in ORD"

Last update
Resolved

This incident has been resolved. The issues impacting performance on the affected routes do not seem to have been caused by issues within our network infrastructure.

Update

We are continuing to investigate this issue.

Update

We are continuing to investigate this issue.

Investigating

Some network paths in a single region (ORD) are slightly slower than expected. You may experience lower network performance for requests in ORD.

Report: "Degraded performance"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue.

investigating

We're investigating degraded performance on our web dashboard and GraphQL API. You may notice slower responses as well as occasional 500 errors at this time.

Report: "Degraded performance"

Last update
Resolved

This incident has been resolved.

Monitoring

A fix has been implemented and we are monitoring the results.

Update

We are continuing to investigate this issue.

Investigating

We're investigating degraded performance on our web dashboard and GraphQL API. You may notice slower responses as well as occasional 500 errors at this time.

Report: "Network maintenance in SCL (Santiago, Chile)"

Last update
Completed

The scheduled maintenance has been completed.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

An upstream provider is performing critical network maintenance in SCL, from 7:00am UTC (3:00am local time) to 9:00am UTC (5:00am local time). You may experience a short total loss of connectivity for up to 25 minutes within the scheduled maintenance window hours.

Report: "Scheduled Maintenance in GIG Region (Rio De Janeiro)"

Last update
Completed

The scheduled maintenance has been completed.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We are performing networking upgrades at our GIG data centre from 06:00 - 09:00 UTC (03:00 - 06:00 Local Time). Users with machines in GIG may experience networking downtime of up to 40 minutes within the scheduled maintenance period. We recommend users scale up to nearby regions, such as GRU, if needed.

Report: "Network maintenance in QRO"

Last update
Completed

The scheduled maintenance has been completed.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

An upstream provider is performing critical network maintenance in QRO. You may experience a short total loss of connectivity for up to 25 minutes within the scheduled maintenance window hours.

Report: "Issues with API"

Last update
resolved

A fix has been deployed and the API is back up.

investigating

We are currently investigating issues with our Graphql API. You might experience issues connecting to the dashboard and flyctl.

Report: "Organization invites failing on dashboard"

Last update
resolved

This incident has been resolved.

investigating

We are investigating an issue where inviting users to organization from the web dashboard may fail. As a workaround, inviting users using the flyctl command-line (`fly orgs invite` command) is working.

Report: "Networking issues in HKG"

Last update
resolved

This incident has been resolved.

investigating

We are continuing to investigate this issue.

investigating

We are investigating intermittent network issues in the HKG region. Apps running in the region may have trouble reaching apps in other regions at this time.

Report: "Network issues in GDL"

Last update
resolved

This incident has been resolved.

investigating

We are investigating network issues in the GDL region. Apps running in the region may be unreachable at this time.

Report: "504 Errors from Logs API"

Last update
resolved

Historical logs are back up.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating an issue with looking up historical logs. The fly logs command may fail. Streaming logs with NATS is not affected..

Report: "Capacity issues in FRA"

Last update
resolved

This incident has been resolved.

monitoring

New capacity has been added in FRA, we will continue to monitor the region for capacity constraints.

identified

We are continuing to work on a fix for this issue.

identified

We are continuing to work on a fix for this issue.

identified

We are continuing to work on a fix for this issue.

identified

We are actively working to add additional capacity in the FRA region. We'll provide another update in the next 15-30 minutes.

identified

We are experiencing low capacity in FRA. You may see machine launch failures. We are working on adding new capacity to FRA as soon as possible.

Report: "Network issues in SJC"

Last update
resolved

Networking in SJC is working as expected on all hosts. This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

monitoring

We've identified the cause of the issue and have applied a fix. We are seeing improvements are continuing to monitor for full recovery.

investigating

A small number of hosts in SJC are continuing to experience networking issues after the earlier scheduled maintenance. We are working with our upstream provider to restore full connectivity to these hosts. Machines on impacted hosts may see reduced networking performance connecting to other machines within Fly.io and the broader internet.

investigating

We are investigating network issues resulting from the earlier scheduled maintenance in SJC.

Report: "Capacity issues in LHR region"

Last update
resolved

This incident has been resolved.

monitoring

We've provisioned new host capacity in LHR region, machine/volume creates have been re-enabled and deploys should now be possible again. We are monitoring capacity and will provide updates if the situation changes.

identified

New machine/volume creates in LHR regions are currently unavailable as there is no host capacity available. Any workloads currently running will continue to run; it is also still possible to update existing machines/volumes. Increasing `fly scale count` in LHR region is not possible. Blue-green deploys are also not possible at the moment, as well as deploys with `release_command`. We expect more capacity to become available in the coming weeks. For the time being, please choose a nearby region for new workloads, such as AMS (Amsterdam, Netherlands) or ARN (Stockholm, Sweden).

Report: "Management plane for managed postgres in ORD is unavailable"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Degraded connectivity to Fly Registry"

Last update
resolved

We have identified that transoceanic subsea cable faults resulted in degraded connectivity to some registry instances in AMS, FRA, WAW regions. Our monitoring indicates error rates have improved after cordoning the affected instances at 16:40 UTC.

monitoring

We are continuing to monitor results after cordoning affected registry instances.

investigating

We are investigating timeouts connecting to instances of registry.fly.io in AMS, FRA, WAW regions. Customers may experience slower image pushes and pulls within Fly Machines in the affected regions.

monitoring

We have cordoned the affected registry instances in AMS, FRA, WAW and are seeing timeout errors decrease.

investigating

We are continuing to investigate the cause of increased connection timeouts to instances of our primary registry in AMS, FRA, WAW. Affected customers may be able to work around by pushing images to an alternate registry, registry2.fly.io: FLY_REGISTRY_HOST=registry2.fly.io fly deploy

investigating

We are investigating timeouts connecting to registry.fly.io. Customers may experience slower image pushes and pulls within Fly Machines.

Report: "Capacity issues in IAD and AMS"

Last update
resolved

We have provisioned additional capacity in the affected regions.

monitoring

New machine/volume creates in IAD regions may fail as there is no host capacity available. Any workloads currently running will continue to run; it is also still possible to update existing machines/volumes. Increasing `fly scale count` in these regions may not work. Blue-green deploys may also be unavailable at the moment, as well as deploys with `release_command`. We are provisioning additional capacity in this region.

Report: "Leader Election Issues with PG Flex Clusters close to NA region"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring

investigating

We are investigating an issue where postgres flex clusters are unable to elect a new leader.

Report: "Network issues in AMS region"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

Networking on the impacted hosts has been restored. Machines and apps on those hosts will now be reachable. We're continuing to monitor to ensure everything remains stable.

identified

The hardware switchover is complete. We are continuing the process of re-connecting the downed hosts to the network.

identified

Installation of the new hardware has completed and we are starting the switchover process. A networking blip may be observed on Machines in the AMS region during this process.

identified

Installation of the replacement hardware is still ongoing.

identified

Replacement hardware is onsite and is being installed.

identified

The upstream provider has identified this issue to a broken switch, and are working to replace the switch. They expect connectivity to return in ~1 hour.

investigating

Various hosts in AMS region have lost network connectivity. We are investigating this along with our upstream provider.

Report: "Network issues in ARN"

Last update
resolved

Load has subsided on the edge nodes and we are not observing any related errors at this time.

investigating

Our edge nodes in Stockholm are currently experiencing high load. Some incoming connections may fail while we work to address the issue.

Report: "Network outage"

Last update
resolved

This incident has been resolved.

identified

Network connectivity in IAD has been restored. Our APIs should be working again, but might have higher response times.

identified

Network connectivity in IAD has been restored. Our APIs should be working again, but might have higher response times.

identified

We're bringing our platform up in another region and waiting for things to settle. Our upstream provider is also replacing the affected networking devices in IAD.

identified

We're continuing work to move our APIs away from affected regions/providers. Another update will be provided at 13h00 UTC or earlier.

identified

The IAD region is unavailable due to an incident at an upstream provider. Our API is hosted in this region and as such is unavailable.

investigating

We are investigating widespread reports of networking issues. Apps appear to be running correctly but requests made to the apps may fail. The API and dashboard are also unavailable at the moment.

Report: "Edge network issues in GRU and SCL"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are seeing network issues on our edge servers in regions GRU and SCL. Machines are running correctly, but inbound requests from clients in those regions may fail intermittently..

Report: "Network issues in JNB"

Last update
resolved

This incident has been resolved.

monitoring

We have implemented a workaround for the network issue and are monitoring the situation.

investigating

There is an issue with an upstream network provider in JNB. Apps are still running but may observe network issues. New deploys for apps may fail.

Report: "Depot builders failing with internal error"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented on Depot side. https://status.depot.dev/cm6zolsn40009f2dj5ss7lrd7

identified

The Depot service is currently degraded due to a database outage. We're continuing to monitor for recovery. Customers can also follow the Depot status page at https://status.depot.dev/ for updates. Customers that need to deploy can use legacy Fly.io hosted builders with `fly deploy --depot=false`

investigating

We are investigating failures when trying to build using the default Depot builders. The recommended workaround is to use `--depot=false` with `fly deploy`. The error from Depot builders is `Error: failed to fetch an image or build from source: error building: input:3: ensureDepotRemoteBuilder {"code"=>"internal", "message"=>"internal error"}`

Report: "SSH failing for newly created machines"

Last update
resolved

This incident has been resolved.

monitoring

This issue has been fixed, newly created machines will have working SSH. Machines created during this incident will need to be updated (`fly machine update --yes <id>`) or deleted/recreated to fix SSH.

investigating

As a workaround, run the `fly ssh console` command with `--pty --command /bin/sh` flags.

investigating

We are investigating reports that connecting to newly created machines via SSH (`fly ssh console`) may fail.

Report: "Elevated network latency in FRA"

Last update
resolved

Network functionality is fully restored in FRA.

monitoring

We've deployed a fix for this incident and we are monitoring while network latency and bandwidth return to normal. All user apps should start seeing improved and normal response times.

identified

We're addressing elevated network latency and saturation affecting the FRA region. Apps with machines in this region might experience longer response times and possible timeouts (502 errors).

Report: "Capacity Constraints in IAD"

Last update
resolved

This incident has been resolved

monitoring

We have brought additional IAD capacity online. Customers should see machine creation, deploy, and scaling operations succeed as normal in the region. We're continuing to monitor to ensure full recovery.

identified

We are continuing the process of adding additional machine capacity in the IAD region.

investigating

Machine capacity in the IAD region is currently low. We're working to bring additional capacity online. In the meantime, you may see errors deploying new machines in IAD, or increasing the size of existing machines in the region. Customers may want to deploy machines to nearby regions, such as ewr

Report: "Deploys using Depot Builders failing"

Last update
resolved

This issue has been resolved, deploys using Depot Builders are succeeding as expected.

monitoring

The Depot builder service is partially recovered and we are seeing deploys using Depot builders succeed again. Some customers may still experience degraded performance using Depot builders at this time. We're continuing to monitor for full recovery. Customers can still deploy using Fly.io hosted builders with `fly deploy --depot=false`

identified

The Depot service is currently degraded due to a database outage. We're continuing to monitor for recovery. Customers can also follow the Depot status page at https://status.depot.dev/ for updates. Customers can still deploy using Fly.io hosted builders with `fly deploy --depot=false`

investigating

We are investigating increased error rates when deploying apps using the default Depot Builders. Customers who experience this issue can work around it by using `fly deploy --depot=false` to deploy your image with a Fly.io hosted builder.

Report: "API errors"

Last update
resolved

This incident has been resolved.

investigating

We are investigating error 503 when making requests to our GraphQL API, or running flyctl commands.

Report: "Bluegreen healthchecks not passing"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and bluegreen deploys are succeeding as expected. We're continuing to monitor deploys to ensure stability, but customers should see BlueGreen deploys succeed in all regions.

identified

The issue has been identified and a fix is being implemented.

investigating

We are seeing signs of recovery, with Bluegreen deployments succeeding for many customers. We are continuing to investigate the root cause of the issue. Customers who still experience a Bluegreen deployment failure can retry using the rolling strategy with `fly deploy --strategy rolling`.

investigating

A temporary workaround for new deployments is to use rolling strategy: `fly deploy --strategy rolling`.

investigating

We are still investigating the issue.

investigating

When deploying with bluegreen strategy some green machines (new app version) won't pass healthchecks. Temporary workaround: unless bluegreen is a must for your app you can temporarily deploy using a different strategy by `fly deploy --strategy NAME`.

Report: "Machine creation errors in LHR"

Last update
resolved

We observed several periods where Machine creations in LHR resulted in authentication errors from 11 Jan to 15 Jan 2025. Customers creating new Machines in the region may have seen failures with: failed to launch VM: permission_denied: bolt token: failed to verify service token: no verified tokens; token <token>: verify: context deadline exceeded The disruptions were caused by degraded connectivity to our token creation service from three hosts. We deployed a preventative fix for the network issues on 15 Jan 2025 at 12:58 UTC. Timestamps of occurrences (UTC): 2025-01-11 03:32 to 2025-01-11 04:11 2025-01-11 17:07 to 2025-01-11 17:54 2025-01-14 11:36 to 2025-01-14 12:14 2025-01-15 07:46 to 2025-01-15 09:49

Report: "Network issues in SJC region"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating inbound network connectivity issues in SJC region. Users routed to SJC may be unable to access apps, or latency may be increased.

Report: "Transient networking issue in FRA"

Last update
resolved

This incident has been resolved.

monitoring

We have noticed a spike in packet loss across the FRA region at around 14:44 UTC caused by an upstream issue. This has recovered since 14:47 UTC, and we are currently monitoring the situation along with our upstream providers.

Report: "IPv6 Networking Issue in SCL"

Last update
resolved

This incident has been resolved.

identified

We are aware of a temporary IPv6 networking issue in SCL when accessing certain IPv6 ranges / providers caused by an upstream maintenance and are working with our upstream for a fix. IPv6 request originating from your machines in SCL may see increased error rates.

Report: "Network Instability"

Last update
resolved

This incident has been resolved.

monitoring

We're monitoring the platform which continues to be stable and work normally. Additionally we are in the process of deploying the Fly Proxy build that contains the fix for the bug that caused this issue.

monitoring

We have identified the cause of the network blip to be a bug in our Fly proxy and we're applying a fix.

monitoring

We have noticed a temporary blip in our upstream network(s) between 16:38-16:40 UTC that affected our platform. This has been resolving and we are monitoring for any continuing effects.

Report: "Machine Creates and Updates currently failing"

Last update
resolved

All changes have been deployed and Machine Create/Update API operations are healthy.

monitoring

The validation fix has been deployed and our monitoring has resolved for the API error rate.

identified

We were alerted to elevated error rates for machine creates and updates. A deploy caused a validation error which is now being reverted.

Report: "Networking issues in GDL"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is currently being implemented.

Report: "sjc region capacity"

Last update
resolved

This incident has been resolved.

identified

We are currently at capacity in our SJC region. We're actively working on fixing this, however you may wish to deploy to nearby regions (lax or phx) as a workaround.

investigating

We are currently at capacity in our SJC region. We're actively working on fixing this, however you may wish to deploy to nearby regions (lax or phx) as a workaround.

Report: "Elevated API Latency and Timeout Errors"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and both Machines API and GraphQL API performance have returned to normal.

identified

We have identified the cause of the API latency increase and are working to mitigate

investigating

We are currently investigating elevated error rates with our Machines and Graphql APIs. Users may experience slower responses or timeouts using the Machines API and flyctl commands

Report: "Degraded Connectivity"

Last update
resolved

We have determined that some customers' machines are being throttled due to our full rollout of CPU quotas, separate from the incident yesterday. This in turn caused apparent networking issues. We have now temporarily rolled back these changes while we work with customers to better adapt to CPU quotas.

investigating

We are aware of customer-reported issues with internal networking and are investigating.