Historical record of incidents for Cloud 66
Report: "Google login issues"
Last updateGoogle authentication issues is having issues causing login and access issues for our systems and customers. We are investigating the impact and mitigation strategies.
Report: "Kubernetes Cluster Scale/Creation"
Last updateThis incident has been resolved.
We are currently investigating this issue.
The issue has been identified and a fix is being implemented.
We are aware of problems when scaling/creating Kubernetes clusters. This appears to be caused by new rate limits imposed by docker hub. The team are working on a solution.
Report: "Stuck Deployments"
Last updateThis incident has now been resolved.
We have identified the issue causing the deployments to be stuck and fixed the issue. This was caused by a database failure at our cloud provider which we are investigating and monitoring.
We are currently investigating a number of stuck deployments.
Report: "CustomConfig git access"
Last updateThe CustomConfig git backend migration has been completed
Applications are able to be deployed, but the CustomConfig backend is not available for the moment.
We are aware of issues with our CustomConfig git repositories. This issue impacts direct git access to the CustomConfig git repository. Direct access to CustomConfig pages via the web dashboard and API is not affected. We are investigating the root cause of the problem.
Report: "Production Issues"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are investigating issues affecting some production systems.
Report: "Github API Outage"
Last updateThis incident has been resolved.
Github is experiencing some API outages, deployments don't currently appear to be affected. We are monitoring (https://www.githubstatus.com/)
Report: "Multiple Google Cloud services in the europe-west9 region are impacted"
Last updateMaestro installations are working as expected again
We are waiting on Google resolutions
Due to a major outage in Google Cloud (https://status.cloud.google.com/regional/europe) - Kubernetes installations may fail on installations on servers located in the same region, regardless of the cloud, due to Kubernetes images themselves being hosted on Google Cloud
Water intrusion in a data center in europe-west9 has caused a multi-cluster failure and has led to a shutdown of multiple zones. We expect general unavailability of the europe-west9 region. There is no current ETA for recovery of operations in the europe-west9 region at this time, but it is expected to be an extended outage. Customers are advised to failover to other regions if they are impacted.
Report: "Github Not Listing Repositories"
Last updateIssue confirmed to be fixed by GitHub.
The issue has been identified and a fix is being implemented and this should resolve the issue for most users. If the error is still persisting, please use the workaround provided: 1. Remove the Github installation. 2. Manually configure SSH key access to Github, please see the link below. https://help.cloud66.com/rails/how-to-guides/common-tools/access-your-code#manually-configuring-github-access
Currently, GitHub is having an issue with listing repositories while using the Cloud 66 Github app. As a workaround while this issue is ongoing. 1. Remove the Github installation. 2. Manually configure SSH key access to Github, please see the link below. https://help.cloud66.com/rails/how-to-guides/common-tools/access-your-code#manually-configuring-github-access
Report: "Subset of Buildgrid Image pushes slower than normal"
Last updateThis incident has been resolved.
The team has completed the first part of the mitigation strategy for intermittent slow Buildgrid pushes. We are now monitoring ongoing performance.
Our engineering team is continuing to look into mitigations for this issue.
The team is aware that some Buildgrid pushes are taking longer than normal, and are working on a mitigation strategy
Report: "Production Agent Reporting Outage"
Last updateAt approximately 07:00 UTC we started to perform a minor system quality-of-life backend update. The update had unintended consequences on an older system component which hadn’t been changed in a while. The component was to do with agent server communications, and it ended up disabled after the update. Although the same component was unaffected during testing in dev/staging and passed UAT, it turned out that there was a necessarily different configuration around rewrites in production that caused the issue. The nature of the update meant that rolling back wasn’t straightforward, so the team resolved to fix the incompatibility in-place, and as such were unable to stop the erroneous “server down” notifications that then went out. The issue was resolved approximate 1 hour later. at 08:00 UTC - after which time servers would have started to appear as “online” again. Subsequent to this outage, the difference in configuration has been added to our operations processes, such that this will not occur again. Our apologies for the inconvenience and concern that this may have created!
This incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "Github outage"
Last updateThis incident has been resolved.
GitHub is having an outage at this time. This may adversely affect Deployments.
Report: "Azure reporting DNS problems (affecting Ubuntu 18.04)"
Last updateThis incident has been resolved.
We are continuing to monitor this incident
Azure is recommending that affected customers reboot their servers to obtain an updated DHCP lease to resolve this issue
From Azure (https://status.azure.com/en-gb/status) Starting at approximately 06:00 UTC on 30 Aug 2022, a number of customers running Ubuntu 18.04 (bionic) VMs recently upgraded to systemd version 237-3ubuntu10.54 reported experiencing DNS errors when trying to access their resources. Reports of this issue are confined to this single Ubuntu version.