Historical record of incidents for Upstash
Report: "QStash: Degraded performance"
Last updateWe are currently investigating this issue.
Report: "Degraded Performance"
Last updateWe are currently investigating this issue.
Report: "Performance Degradation"
Last updateWe are currently investigating this issue.
Report: "Global Ireland (eu-west-1) Degraded Performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Report: "Global Ireland (eu-west-1) Degraded Performance"
Last updateA fix has been implemented and we are monitoring the results.
Report: "QStash: Degraded performance"
Last updateAn internal cleanup task for QStash events, coinciding with disc layer compaction task has caused performance degradation to some users. Mitigation: Team has mitigated the event by pausing some of these tasks and monitored the status for a while. Fixes: Improvements are being applied to these tasks to use resources more gracefully. Disk resources are increased to be able to handle a bigger burst of load.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "QStash: Degraded performance"
Last updateWe are continuing to investigate this issue.
We are currently investigating this issue.
Report: "QStash: Degraded performance in request processing and event logs"
Last update### Product: QStash ### Incident Summary Due to high load, the volume of QStash event logs reached to a point which caused latency in the underlying data store operations. Event log creation was slowed down and lead to performance degradation in QStash request processing. In order to resolve the performance degradation in QStash requests, event logging module was turned off temporarily. After deploying a hot fix and configuration changes, we eventually turned on event logging and system went back to stable state again. ### Root Cause At 07:15 UTC we received alerts on the performance degradation and started the investigation. We discovered long running queries for synching event logs from main QStash servers to QStash event server. In order to resolve performance degradation, we turned off event logging functionality as an immediate action. This action turned the performance back to normal levels for QStash requests but left event log processing disabled. We deployed a hotfix during the day to remove some redundant calls and alleviate the impact. Around 16:20 UTC, we observed another performance degradation on QStash requests due to a load increase, and disabled event log processing again. In the following hours, we deployed a configuration change to relax the job interval durations for event log tasks and turned on event logging again. This configuration change helped to resolve the performance issues without any further issues. ### Impact During the problematic timeframes, when the slow event log processing was observed, QStash requests experienced high latency and caused timeouts for customers. No events were lost. Duplicate event deliveries were observed due to a number of restarts during the incident. ### Resolution Improvements are applied to the event logging module to prevent the same issue from happening again. Also, we have planned to upgrade underlying disks to stronger models.
QStash service and Event logs are fully functional without any remaning issues.
Monitoring: Main QStash service is back to normal. Event logs service is back online but events will be lagging a few mins.
Main QStash service is back to normal. Event logs are still temporarily unavailable.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "QStash: Degraded performance in request processing and event logs"
Last updateQStash service and Event logs are fully functional without any remaning issues.
Monitoring: Main QStash service is back to normal. Event logs service is back online but events will be lagging a few mins.
Main QStash service is back to normal. Event logs are still temporarily unavailable.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Performance degradation on QStash"
Last updateWe are currently investigating this issue.
Report: "Performance degradation on QStash"
Last updateOn 09:43 UTC, QStash experienced degraded service when a high number of requests to a specific domain were throttled, resulting in timeouts during an unexpected phase of the TCP connection establishment. These requests and resulting retries triggered excessive consumption of network resources and negatively impacted all users. We have added more resources to QStash as a quick remediation and as for the resolution, we have improved the timeout mechanism to detect and fail-faster in such cases.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Regional AWS eu-west-1 Cluster Performance Degradation Issue"
Last update### Incident Summary During a maintenance update to the regional Upstash Redis databases in AWS eu-west-1, several databases hosted in that region has unnecessarily triggered a full synchronisation between their primary and backup replicas. ### Root Cause A full synchronisation is the invalidation of the whole data in the target replica and starting a fresh re-population from the source replica. Under normal circumstances, full synchronisation is required only in cases where the data integrity is lost in one of the replicas, which was not the case here. ### Impact This incident impacted the performance of regional databases on AWS eu-west-1 only. Full synchronisation has caused a very high CPU load and caused a performance degradation on some of the databases that has a replica in this region. Moreover, our system throttles the databases that are going through this operation to allocate more CPU to the synchronisation to finish it sooner. No data or consistency has been lost. ### Resolution As a quick remediation, we have unthrottled affected databases on 15:06UTC and enabled more throughput, however high latency has still been observed until the full synchronisation is completed on 21:23UTC. A fix has been prepared to avoid this unnecessary full synchronisation on regional databases, and will be deployed shortly. This issue is not present on Upstash Global databases, which is our new generation infrastructure and is now our default offering. We will reach out to our regional users on how to migrate to Upstash Global going forward.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
Regional AWS eu-west-1 cluster is experiencing performance degradation, and we are adding more resources to the cluster.
Report: "We experienced a very short period of API downtime for the incoming requests to QStash due to urgent maintenance to ensure the stability and performance of our services. Our team acted quickly to address the issue, and everything is now fully operational."
Last updateWe experienced a very short period of API downtime for the incoming requests to QStash due to urgent maintenance to ensure the stability and performance of our services. Our team acted quickly to address the issue, and everything is now fully operational.
Report: "QStash Workflow Run Failure"
Last updateLatest QStash release caused some Workflow runs to fail due to a bug in Workflow URL detection mechanism. Affected workflows did not start at all and moved directly to DLQ. These workflow runs, which show "detected non-workflow destination" message in the response body, can be retried from DLQ. This was a partial failure, lasted from 10:10 to 11:00 UTC, not all users' workflows were affected.
Report: "Disk failure in some Regional Databases"
Last updateWe observed a disk failure on some instances on 16:52 UTC and restarted affected components to bring them back online. Issue is resolved on 17:13 UTC. During the time of the issue databases reachability are affected, no data is lost.
Report: "Performance degradation on QStash"
Last update**Product:** QStash **Impact:** Degraded performance, delayed processing of events, and duplicate event deliveries for some customers ## Incident Summary QStash experienced an incident marked by a sudden and extreme load on our servers. This caused a degradation in performance, with extremely high latency for event processing for all users. We also noticed some of the events being delivered multiple times to some of the users. To mitigate the high load, we have increased the capacity as our initial response while investigation proceeds. Eventually, fixes for the issues are confirmed with an issue reproducer and deployed to production. ## Root Cause Analysis In a certain type of usage, failure handling of [failureFunction](https://upstash.com/docs/workflow/basics/serve#failurefunction) can cause recursive calls which causes a leak in the queue of the tasks, causing a severe load on the QStash servers. This also triggered an edge case which caused some of the events to be delivered multiple times. ## Resolution Two hotfixes to the QStash processes are deployed 1. Prevent recursive calls within the failure function. 2. Eliminate duplicate deliveries while keeping "at least once delivery" guarantee. These are verified to successfully resolve the root cause, normalizing server load and restoring standard event processing operations. ## Impact on Customers High latency of event processing is observed for all users. Some users received duplicate event deliveries. No events were lost, and all were delivered as part of our "at least once delivery" guarantee. Customers do not need to take any corrective action, as workflows have returned to normal and preventive fixes are deployed.
We will be sharing a postmortem about the incident soon.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Performance degradation on QStash"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Performance degradation on QStash"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
Our data processing infrastructure is running behind. No data has been lost and the system should be caught up shortly.
Report: "QStash API Not Reachable"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "Partial Degraded Performance"
Last updateSome of the databases in us-east-1 region might have experienced increased latencies.
Report: "Degraded availability on Vector us-east-1"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Degraded performance on us-east-1 Kafka"
Last updateThis incident has been resolved.
We are working on a fix and monitoring the cluster performance.
We are currently investigating this issue.
Report: "Maintenance - Global Ap-Southeast-1"
Last updateWe have taken actions to increase the capacity of the region. During the operation, clients might have experienced higher than usual latencies for about 15 minutes.
Report: "Connectivity Issues - Global eu-central-1"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Connectivity Issues - Global Ap-Southeast-1"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Degraded Availability in Global Us-East-1"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Degraded Availability on Redis"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Some databases are not reachable. We are currently investigating issue.
Report: "Client connections are intermittently dropping"
Last updateThis incident has been resolved.
We found that small number of database connections are intermittently dropping. We have applied a fix and observing now.
Report: "Degraded Availability on QStash"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
An issue in persistence layer causes degradation in QStash availability. We have identified the cause and working on a fix right now.
Report: "Partial Downtime for REST Clients on Kafka EU-WEST-1 Clusters"
Last updateWe regret to inform you that we experienced a partial downtime during our recent maintenance. This downtime specifically impacted our REST clients. We apologize for any inconvenience caused and assure you that our team has fixed connectivity issues.
Report: "Degraded performance on QStash"
Last updateThe performance of QStash experienced degradation, after getting alert from our monitoring system, our team intervened and restored its stability.
Report: "Elevated latency in us-east-1"
Last updateThis incident has been resolved.
We are monitoring the status now.
We have identified the issue and applied the fix.
We are currently experiencing elevated latency in us-east-1 and are investigating the issue. We will share updates as they become available.
Report: "Degraded performance on QStash"
Last updateSurge in user activity resulted in unusually high traffic, causing temporary service disruption in QStash service.
Report: "Degraded performance on QStash"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Degraded performance on Stash"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Degregeded performance on QStash
Report: "Degraded performance on AWS EU-WEST-1 region"
Last updateThis incident has been resolved.
We have identified a heavy system load on some of the database servers. We are adding new machines to the pool to prevent this happening again. Some databases may have observed high latencies or short disconnections during the event.
Report: "QStash Unavailable"
Last updateQStash had a short period of unavailability. More resource has been automatically allocated to the instance during this time. We are taking steps to optimize the resource availability by allocating further resources.
Report: "QStash Unavailable"
Last updateWe have allocated more resources to the instance. QStash is stable.
QStash had a short period of unavailability. More resource has been automatically allocated to the instance during this time. We are taking steps to optimize the resource availability by allocating further resources.
Report: "Degraded performance on Kafka eu-west-1 cluster"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the cluster
We have identified the problem, waiting for a resolution from our cloud provider.
We are experiencing a cloud provider related problem at this moment. Investigating
Report: "Database Backup/Restore Performance Degradation on AWS US-EAST-1 Cluster"
Last updateThis incident has been resolved.
We have observed degraded performance on database backup/restore operations on AWS US-EAST-1 cluster. The functionality is disabled temporarily
Report: "US-EAST-1 Free Tier Rest Service Outage"
Last updateThere was a partial outage in the REST service in one of the us-east free tier clusters
Report: "US-EAST-1 Free Tier Rest Service Outage"
Last updateThere was a partial outage in the REST service in one of the us-east free tier clusters
Report: "This is an example incident"
Last updateWhen your product or service isn’t functioning as expected, let your customers know by creating an incident. Communicate early, even if you don’t know exactly what’s going on.
Empathize with those affected and let them know everything is operating as normal.
As you continue to work through the incident, update your customers frequently.
Let your users know once a fix is in place, and keep communication clear and precise.