Historical record of incidents for Cluvio
Report: "Reports with R script are not finishing"
Last updateA fix for the issue has been deployed to our production environments and reports with R script work correctly again.
We have identified an issue with a change deployed today, where reports saved on dashboard with an R script do not finish running the query execution. A fix is being implemented and will be deployed morning UTC tomorrow (2022-08-02).
Report: "Investigating an issue rendering dashboard PDF/Image"
Last updateThe issue has been resolved and the PDF/Image rendering is working correctly again.
The root cause has been identified and a fix is being deployed.
We are currently investigating an issue causing rendering of dashboards as PDF or image to fail
Report: "Sending schedules and exporting as PDF/PNG delayed"
Last update**What happened:** Sending of dashboard schedule and sql alert emails was delayed for about 2 hours. **Root cause:** An attacker created a trial account and by abusing our API triggered spam/phishing email sending. Our ops team was alerted about the abuse and blocked the account, and due to rate-limiting, the number of requests processed was limited, but there was still about 2 hour delay for the email sending due to the job processing queue being affected. **What we did to avoid these types of issues in the future:** To prevent this type of abuse, we tightened some aspects of the trial account usage \(which otherwise has full functionality without any restrictions\) regarding email sending for alerts, schedules and user invitations. As a result, abuse attempts like the one causing this issue will not be possible.
The email sending and dashboard exporting are working properly again. The slowdown was caused by clogging of the dashboard schedule job queue from an attempted abuse for distribution of spam. We have disabled the spamming account and will deploy better protection against this type of abuse later today.
We are investigating an issue with dashboard schedule emails being delayed and exporting PDF/PNG not completing
Report: "Major service performance degradation and outage"
Last update**What happened:** Primary Postgres database cluster was running at 100% CPU, adversely affecting backend response times and overall performance. The overall duration of the outage was ~6.5 hours \(10:54 UTC to 17:30 UTC\) with the performance intermittently worsening/improving during this time period. **Root cause:** A maintenance job failed to run for several days unnoticed, which caused the database to be pushed to limits in several high-frequency operations. Ultimately a change in query planner strategy and resource exhaustion caused the DB performance to suddenly drop. **A secondary major impact** was our inability to timely notify our customers about the outage in progress. Due to the unfortunate combination of a laptop HW failure and no access credentials for a new member of our ops team, we were not able to update the [status.cluvio.com](http://status.cluvio.com) with the information about the outage. Additionally, the lack of support staff made it very hard to answer individual support requests while simultaneously working on fixing the outage core issue. This resulted in more than a few customers experiencing several hours of an outage without any information or responses from the Cluvio side. This was ultimately likely the more significant impact of this incident. **What we did to fix the issue:** After determining the root cause, we manually applied the necessary maintenance cleanup in several steps, reducing the database load along the way \(which temporarily worsened the external performance\), increased the DB capacity, and performed a subsequent database optimization. **What we are going to implement to avoid these types of issues in the future:** 1. We will improve proactive monitoring of ongoing maintenance being correctly performed as well as additional early detection of symptoms of failures, further improving the checks we have in place \(response time degradation, elevated error rates, etc.\) 2. We will add clearly defined \(and quick to perform\) steps during any outage that prioritize updating the operational status on [status.cluvio.com](http://status.cluvio.com) timely when issues are first detected and frequent updates during any longer-duration issue. 3. We will improve the redundancy of staff for both Ops team \(primary / secondary on-call, with necessary redundancy of actual ability to effectively intervene - HW, systems access\) as well as making sure we have support capacity to handle spikes of support requests during the business hours.
The incident has been resolved and the Cluvio service is back to nominal.
We have addressed the core issue of the slowdown and the service should be back to nominal for all customers, we will continue to monitor the situation closely.
We have been investigating serious performance issues, that affect most customers. The cause is still unknown, we will provide an update as soon as we know more.
Report: "Cluvio service outage"
Last updateThe issue has been resolved and the service is fully restored. Apologies for the ~12-minute downtime!
We have identified the root cause of the issue and a fix is being deployed
We are currently investigating a database issue that affects the availability of Cluvio dashboards and API.
Report: "Query queue processing is causing delays or internal errors for some customers."
Last updateThe query queue processing issues were identified and fixed, query execution now flows correctly again. Apologies for the issues!
Query queue processing is causing delays or internal errors for some customers.
Report: "Degraded performance of dashboards"
Last updateThis incident has been resolved.
The performance is back to normal, we continue to monitor the server traffic and capacity.
We are currently investigating a reduced performance due to an unusual spike in usage, which we are addressing via increased server capacity while investigating the root cause.
Report: "Further performance issues"
Last updateThis incident has been resolved.
A fix has been implemented and the performance is back to normal again, we will continue to investigate the root cause to avoid further service disruptions.
We are investigating performance issues that reappeared.
Report: "Performance issues affecting most users"
Last updateThe performance issues have been resolved. Apologies for the ~22 mins of availability issues!
The issue has been identified and a fix is currently being deployed.
We are currently investigating performance issues that cause the Clvuio dashboards to not load for most users.
Report: "Investigating intermittent performance issues on Cluvio backend"
Last updateThis incident has been resolved.
Our monitoring shows performance has returned to nominal for all customers, we are continuing to monitor the situation.
We have identified the issue and a fix has been deployed, the performance should return to nominal in the next 10 minutes.
We are continuing to investigate this issue.
We are currently investigating intermittent performance issues with our main database cluster, affecting the performance of dashboards for most customers
Report: "Intermittent availability issues"
Last updateThis incident has been resolved.
The fix has been applied and performance is back to nominal for all customers. We will continue to closely monitor the systems.
We identified the root cause of the issue in a database cluster and a fix is being applied, we expect the availability back to normal in the next 5-10 minutes.
We are currently investigating server-side problems affecting the availability of Cluvio for most customers.
Report: "Performance degradation"
Last updateThe fix has been deployed and the performance is fully restored. We will continue to monitor the situation, apologies for the issues.
The performance impact was caused by a change recently deployed, we are in the process of deploying a fix, which should finish in ~10 minutes. In the meantime a database-level mitigation was performed to reduce the latency, so the performance will be back to normal until the fix is deployed.
The performance impact was caused by a change recently deployed, we are in the process of deploying a fix, which should finish in ~10 minutes. In the meantime a database-level mitigation was performed to reduce the latency, so the performance will be back to normal until the fix is deployed.
We are currently investigating reduced performance accessing dashboards
Report: "Intermittent performance issues loading dashboards"
Last updateThe database issue has been now correctly identified and fully resolved. The performance should be back to normal for all customers. We will continue to monitor the situation, apologies for the issues.
We are continuing to investigate performance issues loading and displaying dashboards.
Report: "Long load times for dashboards"
Last updateWe have identified and resolved a performance issue that was causing long load times for dashboards
Report: "Increased error rate and latency"
Last updateWe have resolved the issue with a primary database that was causing the performance degradation and the operation is back to normal. We will continue to monitor the situation, apologies for the issues.
We are investigating an issue that causes increased error rate and latency for Cluvio dashboards.
Report: "Small number of dashboard schedules are sent without rendered PDF"
Last updateThe bugfix has been deployed and all dashboard schedules will work correctly again, we apologize for the inconvenience.
We discovered a bug in a new code deployed earlier today that caused a small number of dashboard schedules to be sent without the rendered image / PDF. A bugfix is ready and being deployed.
Report: "Degraded performance of dashboards"
Last updateWe fixed and issue that was causing random degraded performance for loading dashboards and that has been occurring over last 48 hours. The performance is now back to normal.
Report: "SQL alerts and Dashboard schedules queries failing"
Last updateWe identified and resolved an issue where for some customers the queries for SQL alerts and rendering of dashboards for dashboard schedules was failing with DB authentication errors. We apologize for the issues.
Report: "Cluvio REST API down"
Last updateThis incident has been resolved.
The issue has been resolved and all services are working again, apologies for the short downtime.
We are currently investigating an outage causing our backend services API to be down.
Report: "Some Alert and Schedules emails are delayed or not being sent"
Last updateThis incident has been resolved.
The bugfix was deployed and everything is back to working order, we apologise for the problems.
We identified a performance issue that is causing some small number of sql alerts and dashboards schedules to not be sent. We are deploying a fix, next update in ~ 20 minutes
Report: "Intermittent errors"
Last updateThe problem was fixed, we are back to normal again, apologies for the issues!
We are currently investigating issue with intermittent errors on backend services.
Report: "Dashboard schedules and alerts not sent in last 6 hours"
Last updateWe have identified a problem with a deployment earlier today, which caused dashboard schedules and sql alerts not to be sent. This is fully resolved now,
Report: "Web application frontend not loading for some users"
Last updateThe CDN provider notified us that all the issues were resolved, we are back to normal operation.
The situation seems to be almost back to normal for most users, we continue to monitor the situation as the CDN provider is finishing the resolution.
Some users are reporting very slow loading times, but ultimately the web application finishes loading (once the assets are loaded, you can use Cluvio without issues, as long as you navigate with regular links and avoid full page reloads)
We are experiencing issues with our CDN provider, where for some users the frontend assets for the web application are not loading. We reported the issue and are monitoring the situation, will update as we know more.
Report: "Degraded performance due to CDN performance issues"
Last updateAnd we are back at full speed, thanks for the patience!
The recent outage is still causing ripples on the CDN side as they are recovering from the short outage. There will likely be some time where the web application load times would be (much) worse than usual,
Report: "Problems loading frontend due to CDN partial outage"
Last updateThe service is fully restored, apologies for the ~15 minutes downtime.
Our CDN provider has a partial outage in progress, currently causing our web frontend to not load. Our backend API and query execution clusters are operating normally, so this outage only affects those who do not have Cluvio already open in their browser. Rendering PDF and images as part of dashboard schedules is also affected, dashboard schedules are sent without the PDF / images at the moment. We monitor the situation and will update as things progress.
Report: "Query execution slow"
Last updateAn issue with executor cluster auto-scaling affected some user accounts with query executions being stuck in the queue and as a result very slow to complete.
Report: "Rolling back failed deployment"
Last updateWe are back on, apologies for the short downtime.
We are currently rolling back unsuccessful deployment of several backend services, expected downtime less than 10 minutes
Report: "Short downtime during scheduled maintenance"
Last updateA scheduled maintenance resulted in an unplanned brief (5 minutes) unavailability of the Cluvio web app and APIs.
Report: "Intermittent errors"
Last updateThe incident has been resolved, all systems are back, apologies for the issues experienced.
We are investigating intermittent errors on the Cluvio API, affecting API access and the Cluvio web application users.
Report: "Degraded performance"
Last updateWe are back to full speed now, apologies for the issues.
We have identified a problem with database performance negatively affecting the app responsiveness, our team is working on mitigating the problem, full functionality should be back in ~ 30 minutes.
Report: "Deployment causing problems with backend API"
Last updateProblems resolved, all functionality is back up
We are currently resolving an outage caused by deployment in the EU region, estimated downtime less than 20 minutes.