amazee.io

Is amazee.io Down Right Now? Check if there is a current outage ongoing.

amazee.io is currently Operational

Last checked from amazee.io's official status page

Historical record of incidents for amazee.io

Report: "Partially available workloads after maintenance"

Last update
investigating

After the maintenance on us2, some workloads are only partially available. We are investigating the issue and are working on mitigations to make these workloads available again.

Report: "Regular Maintenance - EMEA"

Last update
Completed

The scheduled maintenance has been completed.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We are conducting regular maintenance on our systems.

Report: "Regular Maintenance - APAC"

Last update
Completed

The scheduled maintenance has been completed.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We are conducting regular maintenance on our systems.

Report: "Regular Maintenance - AMERICAS"

Last update
Completed

The scheduled maintenance has been completed.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We are conducting regular maintenance on our systems.

Report: "MySQL 8 Update for Switzerland (CH4) Environments"

Last update
Completed

The scheduled maintenance has been completed.

Scheduled

New environments created in our Switzerland (CH4) region will be provisioned with MySQL 8 by default. What this means for you:- Existing environments will remain on MySQL 5.7 (no action required)- Only applies to newly created environments in CH4- Most applications will experience no issues with MySQL 8This is phase 1 of our MySQL upgrade plan. We'll provide separate communication about upgrading existing environments in the future.If you experience any compatibility issues with your application on newly created environments, please contact our support team.

Report: "Varying support coverage during holiday season"

Last update
resolved

This incident has been resolved.

monitoring

From April 18th to April 21st our support coverage varies due to the holiday season. While we monitor the platform as usual, you might experience slower response times on support cases. For critical issues that affect production services, please open a support ticket and call the emergency number noted in your contract.

monitoring

We are continuing to monitor for any further issues.

monitoring

From April 18th to April 21st our support coverage varies due to the holiday season. While we monitor the platform as usual, you might experience slower response times on support cases. For critical issues that affect production services, please open a support ticket and call the emergency number noted in your contract.

Report: "Fastly Certificate Error"

Last update
resolved

This incident has been resolved.

investigating

We are continuing to investigate this issue.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating some discrepancies in our Fastly certificate automation. While existing certificates remain unaffected, you may encounter errors during deployment.

Report: "Global docker image registry errors"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Logging partially available"

Last update
resolved

This incident has been resolved.

monitoring

We identified a higher than usual load on the logging system and assigned it more resources. The logging system is partially available while it is restarting.

Report: "Volumes in read-only mode"

Last update
resolved

With the latest applied patch the issue has been resolved.

monitoring

We applied a patch that resolves the read-only issue which some workloads experienced.

identified

We are investigating multiple cases of volumes that were only mounted in read-only mode leading to failed writes for applications. While we are working on a permanent solution, please do reach out to our support should you see your application being impacted by this.

Report: "Volumes in read-only mode"

Last update
resolved

All potentially affected workloads have been restarted.

identified

We have updated the impacted regions to include au2.lagoon

identified

We have rolled out a patch and will restart possibly affected workloads.

investigating

We are investigating multiple cases of volumes that were only mounted in read-only mode leading to failed writes for applications. While we are working on a permanent solution, please do reach out to our support should you see your application being impacted by this.

Report: "Volumes in read-only mode"

Last update
resolved

Since we rolled back some components, we have no longer seen issues with volumes getting mounted read-only.

monitoring

Since the rollback we have no longer seen volumes getting falsely mounted as read-only. We will keep monitoring the situation.

investigating

As a temporary workaround we are immediately reverting some components to a previous version. The impact of this for workloads is like a regular maintenance.

investigating

We are investigating multiple cases of volumes that were only mounted in read-only mode leading to failed writes for applications. While we are working on a permanent solution, please do reach out to our support should you see your application being impacted by this.

Report: "GCP MySQL 8 Migration Update"

Last update
resolved

MySQL 8 Upgrade Schedule: - Development environments: February 26, 2025 - Production environments: March 12, 2025 Impact: - Expected database connection interruption: <10 minutes per environment - Affected clusters: CH4, FI2, and US3 What to expect: - No action required from customers - Our team will monitor upgrades and handle any issues - Status updates will be provided during maintenance windows

identified

We are actively working on implementing a seamless migration to MySQL 8 on our Google Cloud Platform (GCP) environments. Our engineering team is in the final stages of the testing phase, ensuring a smooth transition with minimal impact on our services. We are committed to maintaining system stability throughout this upgrade process. We will announce the specific migration date and detailed timeline next week, along with any relevant instructions for our users.

Report: "Postgresql US2 upgrade to 14.10"

Last update
resolved

This incident has been resolved.

monitoring

We are performing the Postgresql upgrade

identified

We will perform upgrade of Postgresql on US2 cluster to version 14.10

Report: "GCP MySQL 8 Upgrade"

Last update
resolved

The planned MySQL 8 upgrade for GCP development environments is currently paused while we investigate implementation challenges. This delay affects only test/development environments and does not impact current production systems. We are actively working on resolving these issues and will announce a new upgrade schedule through the status page once our investigation is complete. Customers planning to test their applications with MySQL 8 will be given advance notice before the upgrade resumes.

identified

During a pre-upgrade check for the upgrade of the development environments to MySQL 8 we identified a technical requirement that we need to test further before we do the upgrade. Therefore we will not be upgrading the development environments during today's maintenance windows.

Report: "Intermittent SSH access issues"

Last update
resolved

The issue has been resolved. The issue was caused by a communication fault between the remotes and core.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

identified

The issue has been identified and a fix is being implemented.

investigating

We're investigating intermittent SSH access to cloud clusters

Report: "JSM Assist sync issues"

Last update
resolved

The issue has been resolved by Atlassian.

monitoring

We are monitoring updates from Atlassian for JSM Cloud customers concerning a sync issue with Assist, you might experience delays awaiting responses from our Support team.

Report: "Failing image builds"

Last update
resolved

The caching issue has been resolved and image builds are fully operational.

identified

Some image builds on DE3 are timing out due to an issue with the build cache. We are working on a solution.

Report: "Delayed logs"

Last update
resolved

The backlog of logs was processed completely and logs are showing up as usual in the logging system.

monitoring

Logs might not appear in the logging system as quick as usual due to a larger backlog that is currently being processed. Real time logs through the Lagoon CLI (https://docs.amazee.io/cloud/logging/#real-time-logs-via-lagoon-cli) are not affected by this.

Report: "Let's Encrypt Certificate creation issues"

Last update
resolved

New certificates are again issued without any delays.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Customers with valid certificates are not impacted only newly issued certificates seem to take longer than usual to load the certificate onto the route. If you see immediate issues please get in touch with Support.

Report: "DE3 production MySQL 8 upgrade"

Last update
resolved

For DE3 the upgrade of the production database to MySQL 8 was completed successfully.

monitoring

A fix has been implemented and we are monitoring the results.

identified

DE3 cluster is running on MySQL8

investigating

DE3 cluster is running on MySQL8

Report: "MySQL 8 upgrade"

Last update
resolved

This incident has been resolved.

investigating

Hello Team! We are upgrading Mysql 8 from Mysql5.7. We expect limited downtime during maintenance window.

Report: "Degraded performance on UK3 MySQL production databases"

Last update
resolved

This incident has been resolved.

monitoring

A database failover was executed to resolve the performance issue and we are monitoring the situation.

investigating

We are currently investigating this issue.

Report: "Absent router logs"

Last update
resolved

This incident has been resolved.

identified

The router logs from ch4 were not shipped to the logging infrastructure from 2024-07-24 15:53 to 2024-08-12 06:16 UTC due to a misconfiguration in the logging system. Application and container logs were not impacted by this misconfiguration and are available as usual.

Report: "Status update delays of builds and tasks, and webhooks not being processed"

Last update
resolved

The status and webhook delays are now resolved

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring. Webhooks may be delayed while the received webhook queue is processed.

identified

The issue has been identified and a fix is being implemented.

Report: "Changes in un-idling behavior"

Last update
resolved

This incident has been resolved.

investigating

Workloads in non-production environments will only be un-idled when a client accessing them can run JavaScript. This change will prevent most cases of undesired un-idling triggered by automated requests. More information on environment idling can be found in the Lagoon documentation https://docs.lagoon.sh/concepts-advanced/environment-idling/.

Report: "Degraded database performance on FI2 MySQL production"

Last update
resolved

This incident has been resolved.

identified

We are continuing to work on a fix for this issue.

identified

In order to stabilize the performance of the database we will trigger a failover. This will lead to short interruptions for applications connecting to the database.

Report: "Fastly API Issues"

Last update
resolved

Issue has been resolved on the upstream.

monitoring

Disruptions in access to manage.fastly.com, configuration propagation, and access to the Fastly API have been fixed. We are monitoring the current API state.

investigating

We are currently investigating issues caused by an upstream issue with Fastly - https://www.fastlystatus.com/incident/376458

Report: "Intermittent Workload Restarts"

Last update
resolved

This incident has been resolved.

monitoring

We're still investigating certain workload reastarts. There seems to be certain workloads that can trigger coditions on the compute nodes which then leads to the workloads on the compute node being rescheduled.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Following up from the earlier Incident regarding the intermittent workload restarts: We'll run an additional maintenance window after 21:00 UTC today to move workloads onto a new set of compute nodes. This action should stabilize the intermittent workload restarts we are seeing.

Report: "Lagoon tasks error out"

Last update
resolved

We've resolved the issue now for the majority of users. A previous update identified a workaround in the unlikely event that you may still experience the issue. Reach out to support if you do encounter the error and aren't quite sure how to resolve it.

monitoring

A fix has been implemented and we are monitoring the results.

identified

After release of Lagoon 2.18 triggering tasks that require cli pod from UI will result this error `Environment <id> has no service cli` A short term fix is to trigger a deployment OR run this api mutation for the environment that the task is broken on mutation { addOrUpdateEnvironmentService(input: { environment: <environment-id> name: "cli" type: "cli" }) { id name type } } Lagoon team is working on permanent fix

Report: "Client support availability during Easter Holidays"

Last update
resolved

This incident has been resolved.

monitoring

During the upcoming Easter holidays (Mar 29th 2024 - Apr 1st 2024), amazee.io will be continuing to offer support albeit at a reduced availability. Our on-call engineers will continue to monitor the platform and the ticketing system. This is a reminder that in case of need of support, you can create a support ticket (via email, Slack if available, the Support portal, or the chat widget within the amazee.io Lagoon dashboard). For critical or high-severity issues that require more immediate attention, please call the emergency number written down in your contract. Full support services will commence again as of Tuesday, Apr 2nd, 2024. From all of us at Amazee.io, we wish you a safe and happy holiday break.

Report: "Timeouts during log retrieval"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We're working on getting the log storage back operational - In the background there's a lot of data loading happening currently which leads to slower answer times while retrieving the logs.

investigating

Some queries to retrieve logs are currently failing due to timeouts.

Report: "Login to Logs Backend fails with redirect error"

Last update
resolved

The issue has been solved - login to the Logs Backend should work again without issues.

investigating

We're seeing reports from users that the login to the logs.amazeeio.cloud is failing currently. We're looking into this at the moment.

Report: "Lagoon API Outage"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the situation

identified

We are continuing to work on a fix for this issue.

identified

We've identified the issue and are working on restoration

investigating

We are currently investigating this issue.

Report: "Timeouts on Lagoon Logs"

Last update
resolved

The logging system is fully operational again.

investigating

After rebalancing the data some parts of the logging system failed and we are working on making it fully operational again.

monitoring

We have rolled out some improvements and the logging system is currently rebalancing data.

investigating

We have seen an increase in timeouts for log queries and are working on identifying the root cause of this.

Report: "Emergency Maintenance Window"

Last update
resolved

This incident has been resolved.

identified

We'll be running an emergency maintenance window today within the usual maintenance window times for clusters on AWS infrastructure.

Report: "Intermittent Workload Restarts"

Last update
resolved

We've implemented a fix that should lower the impact of workload restarts. Our team is monitoring the situation and taking action where neccessary.

identified

We're investigating workloads being rescheduled intermittently. This only affects a small subset of projects on amazeeio-ch4. This can lead to availability issues on standard availability projects. We've found a cause of this behavior and are working on rolling out a fix for this issue during the maintenance window.

Report: "Scaling activities"

Last update
resolved

The situation is stable. We will resolve this incident here and follow up with a post incident review in the coming days.

monitoring

The original database cluster can only be started in read mode. In accordance with our backup and recovery processes, we promoted the new database cluster with the state of 2024-01-11 03:05 UTC as the new production cluster. We updated all workloads to use this new database cluster. Please note that this does not contain data between 2024-01-11 03:05 UTC and the moment the database cluster went offline (~ 2024-01-11 07:22 UTC). Dumps of the original database with the latest data can be exported and shared on request. A summary of the incident will be shared in the upcoming days. We are sorry for the inconvenience this caused you and your clients. If you have any questions regarding this, please reach out to us.

identified

Recovering the database was interrupted due to an unforeseen issue. We are working with the AWS RDS team to bring the database back online. As an alternative option for recovery we can point single environments to a new database cluster, containing data up until 2024-01-11 03:05 UTC. Please be aware that this option would lead to data loss. If you would like to pursue this route, please contact us through our support channels.

identified

We're making good progress on recovering the database cluster. We're expecting the database cluster to be back online within the next 2 hours.

identified

Recovery is still underway. We're evaluating additional ways to recover from the current situation quicker and restore services.

identified

We're making progress in recovery - We can't give a firm ETA as the recovery speed hasn't settled fully yet. Still in discussions with the AWS RDS team on timings and additional recovery options.

identified

We've identified the issue in the meantime and working on recovering from the outage. We can't give an ETA for now and evaluating several options.

investigating

We're still working with AWS RDS team to investigate the issue and what causes the connectivity issues.

investigating

We're seeing connectivity issues to the database cluster after the scaling operation. We'll involve our upstream provider to look into this issue aswell.

investigating

We're seeing issues with the Database Cluster and investigating

monitoring

We observed an increase in resource usage on the shared MySQL cluster on UK3. To account for the increase we are scaling the cluster which will lead to one failover.

Report: "FI2 - Database Load"

Last update
resolved

This incident has been resolved.

identified

We've identified the issue and limiting the impact on customers. We're continuing to monitor the situation.

investigating

We're currently investigating issues on amazeeio-fi2 related to increased DB load.

Report: "Cluster Scaling Operations"

Last update
resolved

The scaling operations have finished - We're monitoring the situation but everything looks all clear now.

monitoring

Some clusters had an increase in node count. In order to lower the compute node footprint, we've enabled down scaling on all clusters. This can have intermittent impact for sites that are not highly available.

Report: "Image registry issues"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We're seeing some issues on the image registry after yesterday's maintenance - Our team is working on resolving this issue. We're re-running the maintenance tasks to solve those issues.

Report: "AU2 - Database load issues"

Last update
resolved

This incident has been resolved.

monitoring

To handle the increased load, we've scaled the database infrastructure. We're continuing to monitor the situation closely.

investigating

A subset of customers see slowness on database queries. We're looking into the situation and take action where needed

Report: "SSH Connectivity Issues"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and rolled out to all clusters. We're monitoring the situation.

identified

We've identified an issue with SSH connections, where connections might fail. This issue has already been identified, and we're working on rolling out a fix for this.

Report: "Increased workload rescheduling"

Last update
resolved

The changes have been effective and rescheduling activities are back to a normal level.

monitoring

The changes have been rolled out during the last maintenance window. We will monitor the workloads closely during the next few hours to verify that the rescheduling activity stays within an expected range.

identified

We're working on a mitigation for the issue at hand. The changes will be rolled out in the upcoming maintenance window and should improve scheduling speed as well as lower the possibility of unplanned workload rescheduling - We'll monitor the situation as soon as the change has been rolled out.

identified

We observed an increase in workload rescheduling and are currently exploring possible fixes for the root cause.

Report: "Site unavailability on DE3"

Last update
resolved

The incident has been resolved.

identified

We've identified the issue and added a workaround - affected sites should have recovered. We are monitoring the situation closely.

investigating

We're seeing reports of sites being unavailable and getting timeouts on amazeeio-de3. Our team is currently investigating based on those reports. This seems to impact only a subset of sites.

Report: "Development Database Scaling - Finland"

Last update
resolved

The development database instance has been scaled successfully.

identified

We've identified that there are workloads impacting the development database performance. We'll scale up the resources, which might lead to a temporary unavailability of the development environments for the FI region during the scaling operation.

Report: "Drupal build failures"

Last update
resolved

This incident has been resolved.

monitoring

Additional resources have been provisioned and recently failing builds are no longer blocked.

identified

Some Drupal builds on US2 are currently failing due to networking issues with an upstream provider. We are provisioning additional resources to prevent further build failures.

Report: "Fastly API Issues"

Last update
resolved

The Upstream issue has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We're currently investigating issues caused by an Upstream API issue with Fastly - https://www.fastlystatus.com/incident/376081 Live traffic is not affected; we mostly see this incident causing issues on actions where we integrate with Fastly, e.g. Certificate Updates, Domain Updates or Changes to Fastly Services.

Report: "Deployments not starting"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We're working on a permanent solution for this issue - Customers who see deployments on US2 are stuck in New can contact support to get stuck deployment fixed.

identified

We have identified the issue - It only affects a small subset of deployments from progressing. Our Engineers are looking into solving the problem.

investigating

Deployments on us2.lagoon are blocked from starting and stay in "New" status. We are looking into resolving this issue. There is no impact on site availability.

Report: "Logging Infrastructure not available"

Last update
resolved

The logging infrastructure is fully operational again.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are working on fully restoring the Logging service - Currently, responses might be slow while data is recovering. Recent logging data will become available in the next few hours.

identified

The issue has been identified and a fix is being implemented.

investigating

We're currently investigating an issue with the logging infrastructure

Report: "Lagoon API unavailability and slowness"

Last update
resolved

We're closing the incident - The earlier-mentioned changes show that the API stability is back to normal levels.

monitoring

We have identified the most likely root cause of the slowness and stability issues over the last couple of weeks. We have rewritten the relevant code and deployed, and are monitoring closely. All signs are currently positive, and services are running normally.

identified

We're seeing the issue returning, leading to SSH and API timeouts. We will monitor and work on short term improvements, as required.

monitoring

Performance of the API and Dashboard have improved, cause was a high amount of messages and requests to be handled by the API. We are continuing to monitor and work on improvements to be able to handle the additional load.

identified

Unfortunately we're seeing problems again with the performance of the API and Dashboard, we're working on identifying the problem and fixing it.

monitoring

We have implemented a fix, we are continuing to monitor the situation

investigating

We are continuing to investigate this issue.

investigating

Issues with the API have started again, we are investigating

monitoring

We've put measures in place to stabilize API and the Lagoon Dashboard. There might be slow responses, and our team is working on getting everything back to speed. As we focus on fully resolving this issue, the updates regarding this incident may become less frequent.

identified

We're seeing the issue returning, leading to SSH and API timeouts. Our team is investigating.

identified

The issue has been identified and a fix is being implemented. Some customers might see intermittent SSH connectivity issues.

monitoring

The limitations that have been put in place were successful, and we were able to scale up to the standard capacity of the API. The API and Lagoon Dashboard are operating normally. Although customers may encounter intermittent delays in API response times. We continue to monitor the situation and take appropriate action if needed.

identified

The issue has been identified - Currently, there are a few limitations in place to see how the situation stabilizes and we're working to open up the API, Dashboard and SSH connections to full capacity again.

investigating

We continue to investigate the issue. There might be temporary issues with SSH connections. There seems to be an unusual amount of API request volume, which causes issues with the Lagoon API, Dashboard and SSH connections. We are looking into isolating the problem and putting limits in place.

monitoring

API is stable, but may be slow as things recover

investigating

We are continuing to investigate this issue.

investigating

Currently experiencing degraded API performance

Report: "Isolated connectivity issues"

Last update
resolved

This incident has been resolved.

monitoring

The workloads have been evacuated from the faulty compute host and we are monitoring the connectivity between hosts.

identified

We identified connectivity issues originating from one of the compute hosts and will evacuate workloads running on this host.

Report: "Partial Request Failures"

Last update
resolved

This incident has been resolved.

monitoring

Due to load spikes some requests on uk3 failed. This was mitigated by automated scale ups and we are monitoring the situation.

Report: "Deployment Failures on New Environments Containing Special Characters"

Last update
resolved

This incident has been resolved.

identified

We are continuing to work on a fix for this issue.

identified

## Impact Currently, new environments with consecutive and trailing special characters such as dashes can not be deployed. We have identified the issue and are working on a permanent solution. ## Workaround Remove consecutive and trailing special characters from the environment/branch name ## Example Invalid: feature--new-ui- Valid: feature-new-ui

Report: "Intermittent connection issues between CDN and AWS Clusters"

Last update
resolved

After many hours of work together with Fastly and AWS the root cause has been found and resolved. The workaround has been removed in April 2023. Active monitoring over the last weeks shows that the connection issues have been permanently resolved.

monitoring

We are continuing to monitor and trying to find the root cause of this issue, as there are many different engineering teams involved this takes time. We are though very certain that the current implemented workaround solves the issues and therefore there should be no impact on customer websites from this issue. We will keep this issue open and update it as soon as we found the root cause issue and a permanent resolution.

monitoring

Over the last 7 days environments that are using the amazee.io CDN (Fastly) and are hosted on AWS clusters have experienced elevated connection issues. While this only affected a very small amount (less than 0.01%) of requests, we started to analyze and investigate this issue together with the teams at Fastly and AWS. While we have not found the exact root cause yet, we found a workaround on the AWS Loadbalancers that reduces the connection issues to regular levels expected of regular internet connection issues. We are continuing to monitor this issue and trying to find the root cause together with Fastly and AWS. We will continue to provide updates here.