Historical record of incidents for Metabase Cloud
Report: "Some Metabase Cloud customers hosted in the EU are reporting their site is down or slowness"
Last updateWe have identified the issue which caused some EU-based instances to become unresponsive, which we are tracking here: https://github.com/metabase/metabase/issues/56148 We have added additional monitoring to our cloud infrastructure to alert us if this issue appears again while we work on fixing the root cause.
The issue has been resolved for impacted customers. We will be monitoring closely for the next several hours to insure there is not a recurrence.
We are currently investigating this issue.
Report: "Some Metabase Cloud customers hosted in the EU are reporting their site is down or slowness"
Last updateWe are currently investigating this issue.
Report: "Some Metabase Cloud customers are reporting their site is down or slowness"
Last updateWe are currently investigating this issue.
Report: "Metabase Store unavailable"
Last updateMetabase store was unavailable intermittently for an hour due to a problem during the store update.
Report: "Issue with SSL certificates"
Last update**Summary** On Thursday Nov 9th the default certificate for the external Nginx ingress was replaced with a dummy certificate. Users attempting to reach their hosted instances would get an invalid certificate error. This was due to a misconfiguration in the new CI for releasing our helm charts that used the internal Nginx ingress values files for the external Nginx ingress. **Impact** All hosted customers not using custom domains were affected. **How was the root cause diagnosed?** We identified that the values were properly set in the values files. We Investigated the CI steps to determine why the correct values were not set. It was determined that the wrong values files were being used for external Nginx ingress. **How we’ll make this not happen again?** * Update staging and production value files to match. * Add alert to pagerduty for non-custom domains \(catch alerts faster\)
Ingress controller certificates are now back to normal. You should be able to get the correct certificate in the browser now. We're very sorry for the inconvenience, we'll publish a retrospective about the issue in the next few days.
We're investigating an issue in our cloud with certificate management that's causing browsers to receive incorrect certificates
Report: "EU region not available"
Last updateWe had an issue in our EU cluster where instances were not being provisioned due to an autoscaler problem in that specific region. Issue is now resolved and we'll be monitoring the situation.
Report: "Metabase Store is unavailable due to maintenance"
Last updateWe are continuing to investigate this issue.
We are continuing to investigate this issue.
We are preforming routine maintenance on the Metabase Store infrastructure. We expect it to be fully operational within the hour.
Report: "Unable to send outbound emails"
Last updateThis incident has been resolved.
Metabase Cloud customers are unable to send outbound emails temporarily
Report: "Unable to create new Metabase cloud instances"
Last updateThis incident has been resolved.
Metabase customers trying to create a new cloud trial or cloud instance are unable to do so.
Report: "Unable to create new database connections"
Last updateThis incident has been resolved.
Metabase Cloud customers are currently unable to add new databases or edit database connections.
Report: "Google SSO bug affecting LATAM customers"
Last updateWe've built and pushed new release artifacts to ship to our cloud and self hosted users who are facing this issue. Please refer to our website, Dockerhub or Github releases site to pull the new artifact that will solve this issue. Our Cloud instances will be upgraded as soon as we can in order to normalize the service.
We are aware of an issue in Google identity services that's affecting our customers in LATAM. If you're running Metabase self hosted or in our Cloud and have Google SSO enabled, you'll see an error on the login screen that won't let you access the product. We will keep monitoring the situation as we expect that it should be fixed soon.
Report: "Problems when creating instances"
Last updateAWS reports that the service should be normalized now
There is a problem with our cloud service provider that doesn’t let us create new instances of Metabase. We have identified the problem and we are waiting for the cloud service provider to restore its services.
Report: "Cloud email quota exceeded"
Last updateDuring the day we hit the quota in our email gateway and made the subscriptions and alerts of our Cloud hosted customers to fail. The service is restored now.
Report: "Issue with Metabase Cloud"
Last updateWhile provisioning new K8S clusters in our AWS environment, DNS and load balancer related configurations were accidentally overwritten with wrong parameters, preventing ingress network access to all Metabase Cloud instances. We have restored the correct configurations and added additional process controls and automated checks to prevent this from happening in the future.
We have restored the service. We will provide a description of the issue in the next few days.
We identified an issue with our Cloud. We're currently investigating
Report: "Google SSO might not work in LATAM/other regions"
Last updateWe have started rolling out 44.6 version which includes the fix for the SSO issue. All instances should be upgraded in the next few hours.
We finished the release process to obtain a new Metabase version that includes the fix for the issue. We'll be starting our Cloud upgrade process in the next hour.
On Oct-31th, Google pushed a change without notice to their oAuth2 flows to the LATAM and other regions changing the name of a parameter that broke all SSO flows. We're actively working to push a new version of Metabase with a hotfix along upgrading our Cloud instances. In the meantime, please use username/password authentication if you cannot authenticate to your Metabase instance.
Report: "Issues on our Cloud"
Last updateService has been restored, we will provide more details about the issue as soon as possible. In case you have further questions, please reach out to our support email.
We are continuing to investigate this issue.
We're currently investigating an issue that is causing downtime on our Cloud
Report: "Emails not sending"
Last updateWe've identified that we run over quota on our Cloud email service so we requested to increase the limits. Emails on our instances are now back to normal.
We're aware that emails are not being sent on instances that run on Metabase Cloud. We're investigating the issue.
Report: "AWS Outage"
Last updateThis incident has been resolved.
AWS has restored most of their services and as a result, our customers who were experiencing downtime are now operational. We will continue to monitor closely. Please do not hesitate to contact support if you have any questions or experience any issues.
Due to an ongoing issue with AWS us-east-1 region, a few our customers are experiencing temporary downtime. We’re monitoring the situation closely
Report: "Degraded service in our Business tier"
Last updateA failure in the upgrade in our Elastic Beanstalk deployment for improving the timeout on reverse proxies led to instances not being able to revert back to previous version leaving them in an inconsistent state.
This incident has been resolved.
We've deployed a fix and we're monitoring the health of the instances.
We've identified a fix for the outage and we're in the process of issuing an update to resolve the situation
We identified an issue in our Business tier customers that run on isolated environments. The issue appears to be an issue with our load balancers not being able to flag instances as healthy which results in application servers being recycled continously.
Report: "Issues creating new instances in Metabase Cloud"
Last updateA customer alerted us about instances not being created on our Cloud environment. After some investigations, our engineers detected that an endpoint in our Docker image registry was returning a 500 error status so our customers were receiving the emails about instance creation but there was no instance at all. Systems should be back to normal now.
Report: "Database incident"
Last updateAround 16:30 PT we received a report of a customer that could not connect to it's Metabase instance. Our engineering team identified an issue on the application database which showed that the clusters weren't available in our cloud provider (CSP). After further investigation they found out that a cluster resize didn't apply correctly due to a configuration mismatch between the configuration scripts and the CSP offering resulting in downtime. The configuration script was modified to be compatible with the CSP and the database instances came back online.