Historical record of incidents for amazee.io
Report: "Partially available workloads after maintenance"
Last updateAfter the maintenance on us2, some workloads are only partially available. We are investigating the issue and are working on mitigations to make these workloads available again.
Report: "Regular Maintenance - EMEA"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
We are conducting regular maintenance on our systems.
Report: "Regular Maintenance - APAC"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
We are conducting regular maintenance on our systems.
Report: "Regular Maintenance - AMERICAS"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
We are conducting regular maintenance on our systems.
Report: "MySQL 8 Update for Switzerland (CH4) Environments"
Last updateThe scheduled maintenance has been completed.
New environments created in our Switzerland (CH4) region will be provisioned with MySQL 8 by default. What this means for you:- Existing environments will remain on MySQL 5.7 (no action required)- Only applies to newly created environments in CH4- Most applications will experience no issues with MySQL 8This is phase 1 of our MySQL upgrade plan. We'll provide separate communication about upgrading existing environments in the future.If you experience any compatibility issues with your application on newly created environments, please contact our support team.
Report: "Varying support coverage during holiday season"
Last updateThis incident has been resolved.
From April 18th to April 21st our support coverage varies due to the holiday season. While we monitor the platform as usual, you might experience slower response times on support cases. For critical issues that affect production services, please open a support ticket and call the emergency number noted in your contract.
We are continuing to monitor for any further issues.
From April 18th to April 21st our support coverage varies due to the holiday season. While we monitor the platform as usual, you might experience slower response times on support cases. For critical issues that affect production services, please open a support ticket and call the emergency number noted in your contract.
Report: "Fastly Certificate Error"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating some discrepancies in our Fastly certificate automation. While existing certificates remain unaffected, you may encounter errors during deployment.
Report: "Global docker image registry errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Logging partially available"
Last updateThis incident has been resolved.
We identified a higher than usual load on the logging system and assigned it more resources. The logging system is partially available while it is restarting.
Report: "Volumes in read-only mode"
Last updateWith the latest applied patch the issue has been resolved.
We applied a patch that resolves the read-only issue which some workloads experienced.
We are investigating multiple cases of volumes that were only mounted in read-only mode leading to failed writes for applications. While we are working on a permanent solution, please do reach out to our support should you see your application being impacted by this.
Report: "Volumes in read-only mode"
Last updateAll potentially affected workloads have been restarted.
We have updated the impacted regions to include au2.lagoon
We have rolled out a patch and will restart possibly affected workloads.
We are investigating multiple cases of volumes that were only mounted in read-only mode leading to failed writes for applications. While we are working on a permanent solution, please do reach out to our support should you see your application being impacted by this.
Report: "Volumes in read-only mode"
Last updateSince we rolled back some components, we have no longer seen issues with volumes getting mounted read-only.
Since the rollback we have no longer seen volumes getting falsely mounted as read-only. We will keep monitoring the situation.
As a temporary workaround we are immediately reverting some components to a previous version. The impact of this for workloads is like a regular maintenance.
We are investigating multiple cases of volumes that were only mounted in read-only mode leading to failed writes for applications. While we are working on a permanent solution, please do reach out to our support should you see your application being impacted by this.
Report: "GCP MySQL 8 Migration Update"
Last updateMySQL 8 Upgrade Schedule: - Development environments: February 26, 2025 - Production environments: March 12, 2025 Impact: - Expected database connection interruption: <10 minutes per environment - Affected clusters: CH4, FI2, and US3 What to expect: - No action required from customers - Our team will monitor upgrades and handle any issues - Status updates will be provided during maintenance windows
We are actively working on implementing a seamless migration to MySQL 8 on our Google Cloud Platform (GCP) environments. Our engineering team is in the final stages of the testing phase, ensuring a smooth transition with minimal impact on our services. We are committed to maintaining system stability throughout this upgrade process. We will announce the specific migration date and detailed timeline next week, along with any relevant instructions for our users.
Report: "Postgresql US2 upgrade to 14.10"
Last updateThis incident has been resolved.
We are performing the Postgresql upgrade
We will perform upgrade of Postgresql on US2 cluster to version 14.10
Report: "GCP MySQL 8 Upgrade"
Last updateThe planned MySQL 8 upgrade for GCP development environments is currently paused while we investigate implementation challenges. This delay affects only test/development environments and does not impact current production systems. We are actively working on resolving these issues and will announce a new upgrade schedule through the status page once our investigation is complete. Customers planning to test their applications with MySQL 8 will be given advance notice before the upgrade resumes.
During a pre-upgrade check for the upgrade of the development environments to MySQL 8 we identified a technical requirement that we need to test further before we do the upgrade. Therefore we will not be upgrading the development environments during today's maintenance windows.
Report: "Intermittent SSH access issues"
Last updateThe issue has been resolved. The issue was caused by a communication fault between the remotes and core.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
The issue has been identified and a fix is being implemented.
We're investigating intermittent SSH access to cloud clusters
Report: "JSM Assist sync issues"
Last updateThe issue has been resolved by Atlassian.
We are monitoring updates from Atlassian for JSM Cloud customers concerning a sync issue with Assist, you might experience delays awaiting responses from our Support team.
Report: "Failing image builds"
Last updateThe caching issue has been resolved and image builds are fully operational.
Some image builds on DE3 are timing out due to an issue with the build cache. We are working on a solution.
Report: "Delayed logs"
Last updateThe backlog of logs was processed completely and logs are showing up as usual in the logging system.
Logs might not appear in the logging system as quick as usual due to a larger backlog that is currently being processed. Real time logs through the Lagoon CLI (https://docs.amazee.io/cloud/logging/#real-time-logs-via-lagoon-cli) are not affected by this.
Report: "Let's Encrypt Certificate creation issues"
Last updateNew certificates are again issued without any delays.
A fix has been implemented and we are monitoring the results.
Customers with valid certificates are not impacted only newly issued certificates seem to take longer than usual to load the certificate onto the route. If you see immediate issues please get in touch with Support.
Report: "DE3 production MySQL 8 upgrade"
Last updateFor DE3 the upgrade of the production database to MySQL 8 was completed successfully.
A fix has been implemented and we are monitoring the results.
DE3 cluster is running on MySQL8
DE3 cluster is running on MySQL8
Report: "MySQL 8 upgrade"
Last updateThis incident has been resolved.
Hello Team! We are upgrading Mysql 8 from Mysql5.7. We expect limited downtime during maintenance window.
Report: "Degraded performance on UK3 MySQL production databases"
Last updateThis incident has been resolved.
A database failover was executed to resolve the performance issue and we are monitoring the situation.
We are currently investigating this issue.
Report: "Absent router logs"
Last updateThis incident has been resolved.
The router logs from ch4 were not shipped to the logging infrastructure from 2024-07-24 15:53 to 2024-08-12 06:16 UTC due to a misconfiguration in the logging system. Application and container logs were not impacted by this misconfiguration and are available as usual.
Report: "Status update delays of builds and tasks, and webhooks not being processed"
Last updateThe status and webhook delays are now resolved
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring. Webhooks may be delayed while the received webhook queue is processed.
The issue has been identified and a fix is being implemented.
Report: "Changes in un-idling behavior"
Last updateThis incident has been resolved.
Workloads in non-production environments will only be un-idled when a client accessing them can run JavaScript. This change will prevent most cases of undesired un-idling triggered by automated requests. More information on environment idling can be found in the Lagoon documentation https://docs.lagoon.sh/concepts-advanced/environment-idling/.
Report: "Degraded database performance on FI2 MySQL production"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
In order to stabilize the performance of the database we will trigger a failover. This will lead to short interruptions for applications connecting to the database.
Report: "Fastly API Issues"
Last updateIssue has been resolved on the upstream.
Disruptions in access to manage.fastly.com, configuration propagation, and access to the Fastly API have been fixed. We are monitoring the current API state.
We are currently investigating issues caused by an upstream issue with Fastly - https://www.fastlystatus.com/incident/376458
Report: "Intermittent Workload Restarts"
Last updateThis incident has been resolved.
We're still investigating certain workload reastarts. There seems to be certain workloads that can trigger coditions on the compute nodes which then leads to the workloads on the compute node being rescheduled.
A fix has been implemented and we are monitoring the results.
Following up from the earlier Incident regarding the intermittent workload restarts: We'll run an additional maintenance window after 21:00 UTC today to move workloads onto a new set of compute nodes. This action should stabilize the intermittent workload restarts we are seeing.
Report: "Lagoon tasks error out"
Last updateWe've resolved the issue now for the majority of users. A previous update identified a workaround in the unlikely event that you may still experience the issue. Reach out to support if you do encounter the error and aren't quite sure how to resolve it.
A fix has been implemented and we are monitoring the results.
After release of Lagoon 2.18 triggering tasks that require cli pod from UI will result this error `Environment <id> has no service cli` A short term fix is to trigger a deployment OR run this api mutation for the environment that the task is broken on mutation { addOrUpdateEnvironmentService(input: { environment: <environment-id> name: "cli" type: "cli" }) { id name type } } Lagoon team is working on permanent fix
Report: "Client support availability during Easter Holidays"
Last updateThis incident has been resolved.
During the upcoming Easter holidays (Mar 29th 2024 - Apr 1st 2024), amazee.io will be continuing to offer support albeit at a reduced availability. Our on-call engineers will continue to monitor the platform and the ticketing system. This is a reminder that in case of need of support, you can create a support ticket (via email, Slack if available, the Support portal, or the chat widget within the amazee.io Lagoon dashboard). For critical or high-severity issues that require more immediate attention, please call the emergency number written down in your contract. Full support services will commence again as of Tuesday, Apr 2nd, 2024. From all of us at Amazee.io, we wish you a safe and happy holiday break.
Report: "Timeouts during log retrieval"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're working on getting the log storage back operational - In the background there's a lot of data loading happening currently which leads to slower answer times while retrieving the logs.
Some queries to retrieve logs are currently failing due to timeouts.
Report: "Login to Logs Backend fails with redirect error"
Last updateThe issue has been solved - login to the Logs Backend should work again without issues.
We're seeing reports from users that the login to the logs.amazeeio.cloud is failing currently. We're looking into this at the moment.
Report: "Lagoon API Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the situation
We are continuing to work on a fix for this issue.
We've identified the issue and are working on restoration
We are currently investigating this issue.
Report: "Timeouts on Lagoon Logs"
Last updateThe logging system is fully operational again.
After rebalancing the data some parts of the logging system failed and we are working on making it fully operational again.
We have rolled out some improvements and the logging system is currently rebalancing data.
We have seen an increase in timeouts for log queries and are working on identifying the root cause of this.
Report: "Emergency Maintenance Window"
Last updateThis incident has been resolved.
We'll be running an emergency maintenance window today within the usual maintenance window times for clusters on AWS infrastructure.
Report: "Intermittent Workload Restarts"
Last updateWe've implemented a fix that should lower the impact of workload restarts. Our team is monitoring the situation and taking action where neccessary.
We're investigating workloads being rescheduled intermittently. This only affects a small subset of projects on amazeeio-ch4. This can lead to availability issues on standard availability projects. We've found a cause of this behavior and are working on rolling out a fix for this issue during the maintenance window.
Report: "Scaling activities"
Last updateThe situation is stable. We will resolve this incident here and follow up with a post incident review in the coming days.
The original database cluster can only be started in read mode. In accordance with our backup and recovery processes, we promoted the new database cluster with the state of 2024-01-11 03:05 UTC as the new production cluster. We updated all workloads to use this new database cluster. Please note that this does not contain data between 2024-01-11 03:05 UTC and the moment the database cluster went offline (~ 2024-01-11 07:22 UTC). Dumps of the original database with the latest data can be exported and shared on request. A summary of the incident will be shared in the upcoming days. We are sorry for the inconvenience this caused you and your clients. If you have any questions regarding this, please reach out to us.
Recovering the database was interrupted due to an unforeseen issue. We are working with the AWS RDS team to bring the database back online. As an alternative option for recovery we can point single environments to a new database cluster, containing data up until 2024-01-11 03:05 UTC. Please be aware that this option would lead to data loss. If you would like to pursue this route, please contact us through our support channels.
We're making good progress on recovering the database cluster. We're expecting the database cluster to be back online within the next 2 hours.
Recovery is still underway. We're evaluating additional ways to recover from the current situation quicker and restore services.
We're making progress in recovery - We can't give a firm ETA as the recovery speed hasn't settled fully yet. Still in discussions with the AWS RDS team on timings and additional recovery options.
We've identified the issue in the meantime and working on recovering from the outage. We can't give an ETA for now and evaluating several options.
We're still working with AWS RDS team to investigate the issue and what causes the connectivity issues.
We're seeing connectivity issues to the database cluster after the scaling operation. We'll involve our upstream provider to look into this issue aswell.
We're seeing issues with the Database Cluster and investigating
We observed an increase in resource usage on the shared MySQL cluster on UK3. To account for the increase we are scaling the cluster which will lead to one failover.
Report: "FI2 - Database Load"
Last updateThis incident has been resolved.
We've identified the issue and limiting the impact on customers. We're continuing to monitor the situation.
We're currently investigating issues on amazeeio-fi2 related to increased DB load.
Report: "Cluster Scaling Operations"
Last updateThe scaling operations have finished - We're monitoring the situation but everything looks all clear now.
Some clusters had an increase in node count. In order to lower the compute node footprint, we've enabled down scaling on all clusters. This can have intermittent impact for sites that are not highly available.
Report: "Image registry issues"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're seeing some issues on the image registry after yesterday's maintenance - Our team is working on resolving this issue. We're re-running the maintenance tasks to solve those issues.
Report: "AU2 - Database load issues"
Last updateThis incident has been resolved.
To handle the increased load, we've scaled the database infrastructure. We're continuing to monitor the situation closely.
A subset of customers see slowness on database queries. We're looking into the situation and take action where needed
Report: "SSH Connectivity Issues"
Last updateThis incident has been resolved.
A fix has been implemented and rolled out to all clusters. We're monitoring the situation.
We've identified an issue with SSH connections, where connections might fail. This issue has already been identified, and we're working on rolling out a fix for this.
Report: "Increased workload rescheduling"
Last updateThe changes have been effective and rescheduling activities are back to a normal level.
The changes have been rolled out during the last maintenance window. We will monitor the workloads closely during the next few hours to verify that the rescheduling activity stays within an expected range.
We're working on a mitigation for the issue at hand. The changes will be rolled out in the upcoming maintenance window and should improve scheduling speed as well as lower the possibility of unplanned workload rescheduling - We'll monitor the situation as soon as the change has been rolled out.
We observed an increase in workload rescheduling and are currently exploring possible fixes for the root cause.
Report: "Site unavailability on DE3"
Last updateThe incident has been resolved.
We've identified the issue and added a workaround - affected sites should have recovered. We are monitoring the situation closely.
We're seeing reports of sites being unavailable and getting timeouts on amazeeio-de3. Our team is currently investigating based on those reports. This seems to impact only a subset of sites.
Report: "Development Database Scaling - Finland"
Last updateThe development database instance has been scaled successfully.
We've identified that there are workloads impacting the development database performance. We'll scale up the resources, which might lead to a temporary unavailability of the development environments for the FI region during the scaling operation.
Report: "Drupal build failures"
Last updateThis incident has been resolved.
Additional resources have been provisioned and recently failing builds are no longer blocked.
Some Drupal builds on US2 are currently failing due to networking issues with an upstream provider. We are provisioning additional resources to prevent further build failures.
Report: "Fastly API Issues"
Last updateThe Upstream issue has been resolved.
A fix has been implemented and we are monitoring the results.
We're currently investigating issues caused by an Upstream API issue with Fastly - https://www.fastlystatus.com/incident/376081 Live traffic is not affected; we mostly see this incident causing issues on actions where we integrate with Fastly, e.g. Certificate Updates, Domain Updates or Changes to Fastly Services.
Report: "Deployments not starting"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're working on a permanent solution for this issue - Customers who see deployments on US2 are stuck in New can contact support to get stuck deployment fixed.
We have identified the issue - It only affects a small subset of deployments from progressing. Our Engineers are looking into solving the problem.
Deployments on us2.lagoon are blocked from starting and stay in "New" status. We are looking into resolving this issue. There is no impact on site availability.
Report: "Logging Infrastructure not available"
Last updateThe logging infrastructure is fully operational again.
A fix has been implemented and we are monitoring the results.
We are working on fully restoring the Logging service - Currently, responses might be slow while data is recovering. Recent logging data will become available in the next few hours.
The issue has been identified and a fix is being implemented.
We're currently investigating an issue with the logging infrastructure
Report: "Lagoon API unavailability and slowness"
Last updateWe're closing the incident - The earlier-mentioned changes show that the API stability is back to normal levels.
We have identified the most likely root cause of the slowness and stability issues over the last couple of weeks. We have rewritten the relevant code and deployed, and are monitoring closely. All signs are currently positive, and services are running normally.
We're seeing the issue returning, leading to SSH and API timeouts. We will monitor and work on short term improvements, as required.
Performance of the API and Dashboard have improved, cause was a high amount of messages and requests to be handled by the API. We are continuing to monitor and work on improvements to be able to handle the additional load.
Unfortunately we're seeing problems again with the performance of the API and Dashboard, we're working on identifying the problem and fixing it.
We have implemented a fix, we are continuing to monitor the situation
We are continuing to investigate this issue.
Issues with the API have started again, we are investigating
We've put measures in place to stabilize API and the Lagoon Dashboard. There might be slow responses, and our team is working on getting everything back to speed. As we focus on fully resolving this issue, the updates regarding this incident may become less frequent.
We're seeing the issue returning, leading to SSH and API timeouts. Our team is investigating.
The issue has been identified and a fix is being implemented. Some customers might see intermittent SSH connectivity issues.
The limitations that have been put in place were successful, and we were able to scale up to the standard capacity of the API. The API and Lagoon Dashboard are operating normally. Although customers may encounter intermittent delays in API response times. We continue to monitor the situation and take appropriate action if needed.
The issue has been identified - Currently, there are a few limitations in place to see how the situation stabilizes and we're working to open up the API, Dashboard and SSH connections to full capacity again.
We continue to investigate the issue. There might be temporary issues with SSH connections. There seems to be an unusual amount of API request volume, which causes issues with the Lagoon API, Dashboard and SSH connections. We are looking into isolating the problem and putting limits in place.
API is stable, but may be slow as things recover
We are continuing to investigate this issue.
Currently experiencing degraded API performance
Report: "Isolated connectivity issues"
Last updateThis incident has been resolved.
The workloads have been evacuated from the faulty compute host and we are monitoring the connectivity between hosts.
We identified connectivity issues originating from one of the compute hosts and will evacuate workloads running on this host.
Report: "Partial Request Failures"
Last updateThis incident has been resolved.
Due to load spikes some requests on uk3 failed. This was mitigated by automated scale ups and we are monitoring the situation.
Report: "Deployment Failures on New Environments Containing Special Characters"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
## Impact Currently, new environments with consecutive and trailing special characters such as dashes can not be deployed. We have identified the issue and are working on a permanent solution. ## Workaround Remove consecutive and trailing special characters from the environment/branch name ## Example Invalid: feature--new-ui- Valid: feature-new-ui
Report: "Intermittent connection issues between CDN and AWS Clusters"
Last updateAfter many hours of work together with Fastly and AWS the root cause has been found and resolved. The workaround has been removed in April 2023. Active monitoring over the last weeks shows that the connection issues have been permanently resolved.
We are continuing to monitor and trying to find the root cause of this issue, as there are many different engineering teams involved this takes time. We are though very certain that the current implemented workaround solves the issues and therefore there should be no impact on customer websites from this issue. We will keep this issue open and update it as soon as we found the root cause issue and a permanent resolution.
Over the last 7 days environments that are using the amazee.io CDN (Fastly) and are hosted on AWS clusters have experienced elevated connection issues. While this only affected a very small amount (less than 0.01%) of requests, we started to analyze and investigate this issue together with the teams at Fastly and AWS. While we have not found the exact root cause yet, we found a workaround on the AWS Loadbalancers that reduces the connection issues to regular levels expected of regular internet connection issues. We are continuing to monitor this issue and trying to find the root cause together with Fastly and AWS. We will continue to provide updates here.