Cluvio

Is Cluvio Down Right Now? Check if there is a current outage ongoing.

Cluvio is currently Operational

Last checked from Cluvio's official status page

Historical record of incidents for Cluvio

Report: "Reports with R script are not finishing"

Last update
resolved

A fix for the issue has been deployed to our production environments and reports with R script work correctly again.

identified

We have identified an issue with a change deployed today, where reports saved on dashboard with an R script do not finish running the query execution. A fix is being implemented and will be deployed morning UTC tomorrow (2022-08-02).

Report: "Investigating an issue rendering dashboard PDF/Image"

Last update
resolved

The issue has been resolved and the PDF/Image rendering is working correctly again.

identified

The root cause has been identified and a fix is being deployed.

investigating

We are currently investigating an issue causing rendering of dashboards as PDF or image to fail

Report: "Sending schedules and exporting as PDF/PNG delayed"

Last update
postmortem

**What happened:** Sending of dashboard schedule and sql alert emails was delayed for about 2 hours. **Root cause:** An attacker created a trial account and by abusing our API triggered spam/phishing email sending. Our ops team was alerted about the abuse and blocked the account, and due to rate-limiting, the number of requests processed was limited, but there was still about 2 hour delay for the email sending due to the job processing queue being affected. **What we did to avoid these types of issues in the future:** To prevent this type of abuse, we tightened some aspects of the trial account usage \(which otherwise has full functionality without any restrictions\) regarding email sending for alerts, schedules and user invitations. As a result, abuse attempts like the one causing this issue will not be possible.

resolved

The email sending and dashboard exporting are working properly again. The slowdown was caused by clogging of the dashboard schedule job queue from an attempted abuse for distribution of spam. We have disabled the spamming account and will deploy better protection against this type of abuse later today.

investigating

We are investigating an issue with dashboard schedule emails being delayed and exporting PDF/PNG not completing

Report: "Major service performance degradation and outage"

Last update
postmortem

**What happened:** Primary Postgres database cluster was running at 100% CPU, adversely affecting backend response times and overall performance. The overall duration of the outage was ~6.5 hours \(10:54 UTC to 17:30 UTC\) with the performance intermittently worsening/improving during this time period. **Root cause:** A maintenance job failed to run for several days unnoticed, which caused the database to be pushed to limits in several high-frequency operations. Ultimately a change in query planner strategy and resource exhaustion caused the DB performance to suddenly drop. **A secondary major impact** was our inability to timely notify our customers about the outage in progress. Due to the unfortunate combination of a laptop HW failure and no access credentials for a new member of our ops team, we were not able to update the [status.cluvio.com](http://status.cluvio.com) with the information about the outage. Additionally, the lack of support staff made it very hard to answer individual support requests while simultaneously working on fixing the outage core issue. This resulted in more than a few customers experiencing several hours of an outage without any information or responses from the Cluvio side. This was ultimately likely the more significant impact of this incident. **What we did to fix the issue:** After determining the root cause, we manually applied the necessary maintenance cleanup in several steps, reducing the database load along the way \(which temporarily worsened the external performance\), increased the DB capacity, and performed a subsequent database optimization. **What we are going to implement to avoid these types of issues in the future:** 1. We will improve proactive monitoring of ongoing maintenance being correctly performed as well as additional early detection of symptoms of failures, further improving the checks we have in place \(response time degradation, elevated error rates, etc.\) 2. We will add clearly defined \(and quick to perform\) steps during any outage that prioritize updating the operational status on [status.cluvio.com](http://status.cluvio.com) timely when issues are first detected and frequent updates during any longer-duration issue. 3. We will improve the redundancy of staff for both Ops team \(primary / secondary on-call, with necessary redundancy of actual ability to effectively intervene - HW, systems access\) as well as making sure we have support capacity to handle spikes of support requests during the business hours.

resolved

The incident has been resolved and the Cluvio service is back to nominal.

monitoring

We have addressed the core issue of the slowdown and the service should be back to nominal for all customers, we will continue to monitor the situation closely.

investigating

We have been investigating serious performance issues, that affect most customers. The cause is still unknown, we will provide an update as soon as we know more.

Report: "Cluvio service outage"

Last update
resolved

The issue has been resolved and the service is fully restored. Apologies for the ~12-minute downtime!

identified

We have identified the root cause of the issue and a fix is being deployed

investigating

We are currently investigating a database issue that affects the availability of Cluvio dashboards and API.

Report: "Query queue processing is causing delays or internal errors for some customers."

Last update
resolved

The query queue processing issues were identified and fixed, query execution now flows correctly again. Apologies for the issues!

investigating

Query queue processing is causing delays or internal errors for some customers.

Report: "Degraded performance of dashboards"

Last update
resolved

This incident has been resolved.

monitoring

The performance is back to normal, we continue to monitor the server traffic and capacity.

monitoring

We are currently investigating a reduced performance due to an unusual spike in usage, which we are addressing via increased server capacity while investigating the root cause.

Report: "Further performance issues"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and the performance is back to normal again, we will continue to investigate the root cause to avoid further service disruptions.

investigating

We are investigating performance issues that reappeared.

Report: "Performance issues affecting most users"

Last update
resolved

The performance issues have been resolved. Apologies for the ~22 mins of availability issues!

identified

The issue has been identified and a fix is currently being deployed.

investigating

We are currently investigating performance issues that cause the Clvuio dashboards to not load for most users.

Report: "Investigating intermittent performance issues on Cluvio backend"

Last update
resolved

This incident has been resolved.

monitoring

Our monitoring shows performance has returned to nominal for all customers, we are continuing to monitor the situation.

identified

We have identified the issue and a fix has been deployed, the performance should return to nominal in the next 10 minutes.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating intermittent performance issues with our main database cluster, affecting the performance of dashboards for most customers

Report: "Intermittent availability issues"

Last update
resolved

This incident has been resolved.

monitoring

The fix has been applied and performance is back to nominal for all customers. We will continue to closely monitor the systems.

identified

We identified the root cause of the issue in a database cluster and a fix is being applied, we expect the availability back to normal in the next 5-10 minutes.

investigating

We are currently investigating server-side problems affecting the availability of Cluvio for most customers.

Report: "Performance degradation"

Last update
resolved

The fix has been deployed and the performance is fully restored. We will continue to monitor the situation, apologies for the issues.

identified

The performance impact was caused by a change recently deployed, we are in the process of deploying a fix, which should finish in ~10 minutes. In the meantime a database-level mitigation was performed to reduce the latency, so the performance will be back to normal until the fix is deployed.

investigating

The performance impact was caused by a change recently deployed, we are in the process of deploying a fix, which should finish in ~10 minutes. In the meantime a database-level mitigation was performed to reduce the latency, so the performance will be back to normal until the fix is deployed.

investigating

We are currently investigating reduced performance accessing dashboards

Report: "Intermittent performance issues loading dashboards"

Last update
resolved

The database issue has been now correctly identified and fully resolved. The performance should be back to normal for all customers. We will continue to monitor the situation, apologies for the issues.

identified

We are continuing to investigate performance issues loading and displaying dashboards.

Report: "Long load times for dashboards"

Last update
resolved

We have identified and resolved a performance issue that was causing long load times for dashboards

Report: "Increased error rate and latency"

Last update
resolved

We have resolved the issue with a primary database that was causing the performance degradation and the operation is back to normal. We will continue to monitor the situation, apologies for the issues.

identified

We are investigating an issue that causes increased error rate and latency for Cluvio dashboards.

Report: "Small number of dashboard schedules are sent without rendered PDF"

Last update
resolved

The bugfix has been deployed and all dashboard schedules will work correctly again, we apologize for the inconvenience.

identified

We discovered a bug in a new code deployed earlier today that caused a small number of dashboard schedules to be sent without the rendered image / PDF. A bugfix is ready and being deployed.

Report: "Degraded performance of dashboards"

Last update
resolved

We fixed and issue that was causing random degraded performance for loading dashboards and that has been occurring over last 48 hours. The performance is now back to normal.

Report: "SQL alerts and Dashboard schedules queries failing"

Last update
resolved

We identified and resolved an issue where for some customers the queries for SQL alerts and rendering of dashboards for dashboard schedules was failing with DB authentication errors. We apologize for the issues.

Report: "Cluvio REST API down"

Last update
resolved

This incident has been resolved.

investigating

The issue has been resolved and all services are working again, apologies for the short downtime.

investigating

We are currently investigating an outage causing our backend services API to be down.

Report: "Some Alert and Schedules emails are delayed or not being sent"

Last update
resolved

This incident has been resolved.

identified

The bugfix was deployed and everything is back to working order, we apologise for the problems.

identified

We identified a performance issue that is causing some small number of sql alerts and dashboards schedules to not be sent. We are deploying a fix, next update in ~ 20 minutes

Report: "Intermittent errors"

Last update
resolved

The problem was fixed, we are back to normal again, apologies for the issues!

investigating

We are currently investigating issue with intermittent errors on backend services.

Report: "Dashboard schedules and alerts not sent in last 6 hours"

Last update
resolved

We have identified a problem with a deployment earlier today, which caused dashboard schedules and sql alerts not to be sent. This is fully resolved now,

Report: "Web application frontend not loading for some users"

Last update
resolved

The CDN provider notified us that all the issues were resolved, we are back to normal operation.

monitoring

The situation seems to be almost back to normal for most users, we continue to monitor the situation as the CDN provider is finishing the resolution.

identified

Some users are reporting very slow loading times, but ultimately the web application finishes loading (once the assets are loaded, you can use Cluvio without issues, as long as you navigate with regular links and avoid full page reloads)

identified

We are experiencing issues with our CDN provider, where for some users the frontend assets for the web application are not loading. We reported the issue and are monitoring the situation, will update as we know more.

Report: "Degraded performance due to CDN performance issues"

Last update
resolved

And we are back at full speed, thanks for the patience!

monitoring

The recent outage is still causing ripples on the CDN side as they are recovering from the short outage. There will likely be some time where the web application load times would be (much) worse than usual,

Report: "Problems loading frontend due to CDN partial outage"

Last update
resolved

The service is fully restored, apologies for the ~15 minutes downtime.

monitoring

Our CDN provider has a partial outage in progress, currently causing our web frontend to not load. Our backend API and query execution clusters are operating normally, so this outage only affects those who do not have Cluvio already open in their browser. Rendering PDF and images as part of dashboard schedules is also affected, dashboard schedules are sent without the PDF / images at the moment. We monitor the situation and will update as things progress.

Report: "Query execution slow"

Last update
resolved

An issue with executor cluster auto-scaling affected some user accounts with query executions being stuck in the queue and as a result very slow to complete.

Report: "Rolling back failed deployment"

Last update
resolved

We are back on, apologies for the short downtime.

identified

We are currently rolling back unsuccessful deployment of several backend services, expected downtime less than 10 minutes

Report: "Short downtime during scheduled maintenance"

Last update
resolved

A scheduled maintenance resulted in an unplanned brief (5 minutes) unavailability of the Cluvio web app and APIs.

Report: "Intermittent errors"

Last update
resolved

The incident has been resolved, all systems are back, apologies for the issues experienced.

investigating

We are investigating intermittent errors on the Cluvio API, affecting API access and the Cluvio web application users.

Report: "Degraded performance"

Last update
resolved

We are back to full speed now, apologies for the issues.

identified

We have identified a problem with database performance negatively affecting the app responsiveness, our team is working on mitigating the problem, full functionality should be back in ~ 30 minutes.

Report: "Deployment causing problems with backend API"

Last update
resolved

Problems resolved, all functionality is back up

identified

We are currently resolving an outage caused by deployment in the EU region, estimated downtime less than 20 minutes.