Is InfluxDB Cloud Down Right Now? Discover if there is an ongoing service outage.

InfluxDB Cloud is currently Operational

Last checked Jul 29, 2025 23:11 UTC from InfluxDB Cloud's official status page

Historical record of incidents for InfluxDB Cloud

Jul 10, 2025

Report: "Cloud Dedicated table management API Availability"

Last update 2025-07-10T18:42:28.177Z

investigating2025-07-10T18:42:28.174Z

The Cloud dedicated table management API is currently unavailable impacting the ability to create, list, update, and delete tables

Jun 30, 2025

Report: "Serverless - Sporadic query performance issues on AWS EU-Central-1"

Last update 2025-06-30T11:16:46.638Z

monitoring2025-06-30T11:16:46.620Z

A fix has been implemented and we are monitoring the results.

investigating2025-06-30T11:11:14.810Z

We are continuing to investigate this issue.

investigating2025-06-30T10:10:07.179Z

We are currently investigating this issue.

Jun 13, 2025

Report: "Elevated Query Error Rate"

Last update 2025-06-13T16:13:55.017Z

investigating2025-06-13T16:13:55.015Z

We have observed an elevated query error rate and are investigating.

Report: "Query performance degradation in Azure eu-west"

Last update 2025-06-13T16:13:14.964Z

investigating2025-06-13T16:12:36.933Z

We are currently investigating a potential issue in the Azure EU-West cluster. The cluster appears to be degraded, and we’re observing slow query performance.

Jun 12, 2025

Report: "Management Service Availability User Interface and API"

Last update 2025-06-12T18:48:29.898Z

investigating2025-06-12T18:48:29.894Z

We are currently investigating availability of the management user interface and API.

Report: "GCP Query Availability"

Last update 2025-06-12T18:43:40.540Z

investigating2025-06-12T18:43:40.537Z

We are continuing to investigate this issue.

investigating2025-06-12T18:37:55.479Z

We are currently investigating query availability in GCP regions.

May 30, 2025

Report: "Cloud2 and Serverless Query and Write API Availability Issues AWS EU-Central-1"

Last update 2025-05-30T00:20:43.536Z

postmortem2025-05-30T00:18:57.615Z

**RCA - Query and Write Outage on May 28, 2025** ‌ **Summary** ‌ On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud \(TSM and IOx\) clusters. At 15:07 UTC, the continuous deployment \(CD\) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue. ‌ Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries. ‌ Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly. ‌ The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments. ‌ **Cause of the Incident** ‌ Our software is deployed via a CD pipeline to three staging clusters \(one per cloud provider\) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process. ‌ On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue. ‌ We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries. ‌ **Investigation and Recovery** ‌ Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC \(Infrastructure-as-Code\) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release. ‌ Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes. ‌ Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored \(both internally and on the InfluxData status page\). ‌ **Future mitigations** ‌ We are implementing several methods to reduce the likelihood of a similar incident in the future: ‌ 1. **Alert on or increase telemetry of secret-related deletion.** We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens. 2. **Isolate and stage infrastructure changes to critical systems.** Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step. 3. **Ongoing investigation and continued hardening.** We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified. [https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#control-plane-details](https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#control-plane-details) [https://status.influxdata.com](https://status.influxdata.com)

resolved2025-05-28T22:10:57.395Z

All regions fully back online. A full RCA will be provided as soon as it is completed.

monitoring2025-05-28T20:37:53.660Z

AWS EU-Central is now operational. We are continuing to monitor

identified2025-05-28T19:48:01.630Z

All regions except EU Central are now operational. Work continues on EU-Central

monitoring2025-05-28T18:38:59.848Z

A fix has been implemented and we are monitoring the results.

identified2025-05-28T17:25:05.869Z

The issue has been identified and a fix is being implemented.

investigating2025-05-28T17:02:53.325Z

We are continuing to investigate this issue.

investigating2025-05-28T16:43:41.612Z

We are continuing to investigate this issue.

investigating2025-05-28T16:42:05.523Z

We are continuing to investigate this issue.

investigating2025-05-28T16:34:23.049Z

We are continuing to investigate this issue.

investigating2025-05-28T16:23:25.516Z

We are continuing to investigate this issue.

investigating2025-05-28T16:18:04.255Z

We are currently investigating this issue.

May 29, 2025

Report: "Cloud2 and Serverless Query and Write API Availability Issues AWS EU-Central-1"

Last update 2025-05-29T19:20:00.000Z

Postmortem2025-05-29T19:20:00.000Z

Resolved2025-05-28T17:10:00.000Z

All regions fully back online. A full RCA will be provided as soon as it is completed.

Monitoring2025-05-28T15:37:00.000Z

AWS EU-Central is now operational. We are continuing to monitor

Identified2025-05-28T14:48:00.000Z

All regions except EU Central are now operational. Work continues on EU-Central

Monitoring2025-05-28T13:38:00.000Z

A fix has been implemented and we are monitoring the results.

Identified2025-05-28T12:25:00.000Z

The issue has been identified and a fix is being implemented.

Update2025-05-28T12:02:00.000Z

We are continuing to investigate this issue.

Update2025-05-28T11:43:00.000Z

We are continuing to investigate this issue.

Update2025-05-28T11:42:00.000Z

We are continuing to investigate this issue.

Update2025-05-28T11:34:00.000Z

We are continuing to investigate this issue.

Update2025-05-28T11:23:00.000Z

We are continuing to investigate this issue.

Investigating2025-05-28T11:18:00.000Z

We are currently investigating this issue.

May 28, 2025

Report: "Cloud Dedicated - Management API is unavailable"

Last update 2025-05-28T17:46:32.410Z

resolved2025-05-28T17:46:32.396Z

This incident has been resolved.

investigating2025-05-28T17:15:56.660Z

We are continuing to investigate this issue.

investigating2025-05-28T16:08:49.841Z

We are currently investigating this issue.

Report: "Cloud Dedicated - Management API is unavailable"

Last update 2025-05-28T12:46:00.000Z

Resolved2025-05-28T12:46:00.000Z

This incident has been resolved.

Update2025-05-28T12:15:00.000Z

We are continuing to investigate this issue.

Investigating2025-05-28T11:08:00.000Z

We are currently investigating this issue.

May 13, 2025

Report: "Increase in query errors reported in Azure west europe"

Last update 2025-05-13T15:07:37.219Z

resolved2025-05-13T15:07:37.202Z

This incident has been resolved.

monitoring2025-05-13T14:32:27.839Z

We are continuing to monitor for any further issues.

monitoring2025-05-13T13:36:34.431Z

A fix has been implemented and we are monitoring the results.

investigating2025-05-13T13:06:20.876Z

We are currently investigating this issue.

Apr 27, 2025

Report: "Increased rate of InfluxQL V1 API errors for prod01-us-west-2 only"

Last update 2025-04-27T14:52:20.294Z

resolved2025-04-27T14:52:20.275Z

The issue has now been resolved, the service is now stable and is working as expected.

investigating2025-04-27T14:45:56.517Z

Our engineers are currently working on this issue.

investigating2025-04-27T13:22:02.937Z

We are seeing an increased rate of InfluxQL V1 API errors for prod01-us-west-2 only.

Apr 14, 2025

Report: "Increase in query errors in AWS US-East-1"

Last update 2025-04-14T17:59:47.632Z

resolved2025-04-14T17:59:47.618Z

This incident has been resolved, and query operations have returned to normal.

monitoring2025-04-14T17:52:03.083Z

A fix has been implemented and we are monitoring the results.

identified2025-04-14T14:21:10.042Z

The issue has been identified and a fix is being implemented.

investigating2025-04-14T13:05:02.739Z

We are still investigating the issue and will continue to update here.

investigating2025-04-14T11:44:19.455Z

We are currently investigating this issue.

Mar 26, 2025

Report: "InfluxDB Cloud 2 signups are down, but logins are working."

Last update 2025-03-26T20:10:39.977Z

resolved2025-03-26T20:10:39.960Z

The issue has been resolved, and signups are now back to normal.

monitoring2025-03-26T19:57:55.386Z

We have identified the issue and implemented a fix. We will continue to monitor the situation

monitoring2025-03-26T19:24:43.888Z

Signups are working again. We are continuing to monitor while we identify the root cause

investigating2025-03-26T19:22:29.723Z

We're aware that InfluxDB Cloud 2 signups are down, and we're currently investigating the issue. Logins are working as expected.

Mar 9, 2025

Report: "Degraded reads in us-west-2"

Last update 2025-03-09T15:31:28.692Z

resolved2025-03-09T15:31:28.674Z

This incident has been resolved.

monitoring2025-03-09T14:48:08.176Z

We are continuing to monitor for any further issues.

monitoring2025-03-09T12:44:00.269Z

Degraded query performance issues are resolving - the team is continuing to monitor the recovery

identified2025-03-09T12:34:40.311Z

The issue has been identified and a fix is being implemented.

investigating2025-03-09T12:32:45.271Z

In Cloud2 us-west-2 location we identified reads suffering and have implemented a solution

Feb 13, 2025

Report: "Cloud Dedicated Upgrade Issue"

Last update 2025-02-13T23:47:20.965Z

resolved2025-02-13T23:47:20.949Z

All services have been recovered. We apologize for the service disruption, and will be publishing a full RCA when we have completed our internal review of what happened today.

monitoring2025-02-13T20:31:04.589Z

All services have been recovered, but we are continuing to monitor performance.

identified2025-02-13T19:43:23.670Z

We are working to recover the affected components.

investigating2025-02-13T18:43:34.960Z

We are continuing to investigate this issue.

investigating2025-02-13T18:40:26.019Z

We have an issue with an upgrade to our infrastructure that we are reverting

Jan 12, 2025

Report: "Increased TTBR and degradation to some queries in us-east-1"

Last update 2025-01-12T18:59:46.061Z

resolved2025-01-12T18:59:46.047Z

This incident has been resolved.

monitoring2025-01-12T16:20:31.097Z

A fix has been implemented and we are monitoring the results.

identified2025-01-12T12:13:36.682Z

The issue has been identified and a fix is being implemented.

investigating2025-01-12T12:02:22.668Z

We are aware of a potential increase in TTBR and query degradation in the us-west-1 region. Our team is actively investigating the issue, and we will provide updates as soon as more information becomes available.

Dec 30, 2024

Report: "Degraded Query Performance in AWS US-East"

Last update 2024-12-30T16:30:33.076Z

resolved2024-12-30T16:30:15.434Z

This incident has been resolved.

monitoring2024-12-30T16:09:27.906Z

Services have recovered and we are continuing to monitor

identified2024-12-30T15:54:38.450Z

The issue has been identified and a fix is being implemented.

Dec 16, 2024

Report: "Write outage in AWS eu-central-1"

Last update 2024-12-16T14:29:24.602Z

resolved2024-12-16T14:29:24.586Z

This incident has been resolved.

monitoring2024-12-16T13:07:36.437Z

A brief (2 minute) outage in writes occurred. We have investigated and a fix has been implemented and we are monitoring the results.

Dec 9, 2024

Report: "Query performance degradation in AWS eu-central-1 & us-east-1"

Last update 2024-12-09T21:28:58.315Z

resolved2024-12-09T21:28:58.295Z

This incident has been resolved.

monitoring2024-12-09T19:08:25.324Z

A fix has been deployed, we believe it has solved the issue and are monitoring the fix.

identified2024-12-09T18:43:38.000Z

We are experiencing a higher-than-normal error rate for flux queries in AWS eu-central-1 & AWS us-east-1. The issue has been identified and a fix is being implemented

Nov 27, 2024

Report: "Query errors in Azure us-east region"

Last update 2024-11-27T14:04:20.347Z

resolved2024-11-27T14:04:20.331Z

This incident has been resolved.

monitoring2024-11-27T13:02:03.103Z

A fix has been implemented and we are monitoring the results.

investigating2024-11-27T12:48:25.349Z

We are currently investigating this issue.

Nov 26, 2024

Report: "Flux Query errors in eu-cental-1 and us-east-1"

Last update 2024-11-26T19:10:40.578Z

resolved2024-11-26T19:10:40.559Z

The incident has been resolved, and operations have returned to normal.

monitoring2024-11-26T17:16:34.076Z

We are continuing to monitor for any further issues.

monitoring2024-11-26T17:03:05.863Z

A fix has been implemented and we are monitoring the results.

investigating2024-11-26T16:54:04.363Z

We are currently investigating this issue.

Nov 23, 2024

Report: "Degraded query performance in AWS us-east-1 and AWS eu-central"

Last update 2024-11-23T04:21:48.976Z

resolved2024-11-23T04:21:48.949Z

This incident has been resolved.

monitoring2024-11-23T03:31:18.320Z

We are continuing to monitor for any further issues.

monitoring2024-11-23T01:54:57.439Z

Queries are recovering in AWS us-east-1 and AWS eu-central and we are monitoring the results.

investigating2024-11-22T22:18:59.042Z

We are continuing to investigate this issue.

investigating2024-11-22T22:17:36.466Z

We are continuing to investigate this issue.

investigating2024-11-22T21:55:22.923Z

We are continuing to investigate this issue.

investigating2024-11-22T21:46:35.000Z

We are working on resolving the problem

investigating2024-11-22T21:42:28.962Z

We are still investigating the issue with degraded query performance in the US-east-1 region and will continue to update here.

investigating2024-11-22T21:02:28.062Z

We are currently investigating this issue.

Oct 24, 2024

Report: "Increased TTBR and degradation to some queries in eu-central-1."

Last update 2024-10-24T21:23:31.696Z

resolved2024-10-24T18:03:04.133Z

The issue has been resolved and all operations are back to normal.

monitoring2024-10-24T17:57:11.000Z

TTBR has returned to normal, and all operations are back to normal. We are continuing to monitor for any further issues.

monitoring2024-10-24T17:56:40.528Z

TTBR has returned to normal, and all operations are back to normal

identified2024-10-24T17:52:38.671Z

The issue has been identified and a fix is being implemented.

investigating2024-10-24T17:18:47.000Z

We are aware of a potential increase in TTBR and query degradation in the eu-central-1 region. Our team is actively investigating the issue, and we will provide updates as soon as more information becomes available.

Oct 14, 2024

Report: "Query performance degradation in AWS us-east-1"

Last update 2024-10-14T15:42:33.503Z

resolved2024-10-14T15:42:33.486Z

This incident has been resolved.

monitoring2024-10-14T15:02:36.746Z

We are continuing to closely monitor as conditions continue to improve.

monitoring2024-10-14T14:21:47.326Z

Our team have implemented a fix and are currently monitoring the recovery

investigating2024-10-14T13:28:52.088Z

We are currently investigating this issue.

Oct 12, 2024

Report: "Minor issue affecting writes + queries in AWS us-east-1"

Last update 2024-10-12T08:48:59.420Z

resolved2024-10-12T08:48:59.408Z

This issue has now been resolved.

investigating2024-10-12T06:25:15.410Z

We're aware of an increase in TTBR on one partition within this region. Some points may not be available immediately after being written, the team are currently investigating.

Oct 10, 2024

Report: "Query issues in AWS us-east-1"

Last update 2024-10-10T16:19:06.126Z

resolved2024-10-10T16:16:48.000Z

The incident has been resolved, and operations have returned to normal

monitoring2024-10-10T15:05:02.552Z

Query Duration issues have been resolved however, some queries continue to experience delays for data that was recently written. The team is continuing to monitor the recovery.

monitoring2024-10-10T14:07:17.000Z

Degraded performance issues are resolving - the team is continuing to monitor the recovery

monitoring2024-10-10T12:45:26.245Z

The majority of queries are now completing correctly. However, there may still be delays in the time for some data to become readable - the team are continuing to monitor the recovery.

monitoring2024-10-10T12:14:29.362Z

The team has implemented a fix and is currently monitoring to ensure that it results in recovery.

investigating2024-10-10T10:29:58.697Z

We are currently investigating this issue.

Sep 18, 2024

Report: "Management Tokens not working"

Last update 2024-09-18T18:21:30.559Z

resolved2024-09-18T18:21:30.533Z

This incident has been resolved.

monitoring2024-09-18T16:48:28.143Z

A fix has been implemented and we are monitoring the results.

investigating2024-09-18T11:00:19.000Z

We have identified an issue with management tokens failing to authorize correctly in Cloud Dedicated. It's causing 403 authorization errors when valid Admin users attempt to perform actions.

Sep 13, 2024

Report: "Degraded query performance in AWS US West"

Last update 2024-09-13T13:58:01.659Z

resolved2024-09-13T13:58:01.642Z

This incident has been resolved.

monitoring2024-09-13T12:20:38.107Z

A fix has been implemented and we are monitoring the results.

investigating2024-09-13T10:42:08.432Z

We are aware of query performance degradation in this cloud2 region, the team is currently investigating.

Jul 4, 2024

Report: "Degraded query performance in AWS eu-central-1"

Last update 2024-07-04T15:51:44.458Z

resolved2024-07-04T15:51:44.446Z

This incident has been resolved.

monitoring2024-07-04T13:37:18.379Z

A fix has been implemented. The queries are running normally, the team continue to monitor.

investigating2024-07-04T11:55:16.514Z

We are aware of degraded query performance. The team is working on resolving the issue.

Report: "Degraded Performance on Queries"

Last update 2024-07-04T02:33:58.348Z

resolved2024-07-04T02:33:58.333Z

This incident has been resolved.

monitoring2024-07-03T23:40:24.774Z

We are continuing to monitor for any further issues.

monitoring2024-07-03T18:04:44.307Z

A fix has been implemented and we are monitoring the result.

investigating2024-07-03T17:23:08.860Z

We are aware of degraded performance on queries and some queries may be failing. The team is working on resolving the issue.

Jun 20, 2024

Report: "Increase in read errors - Azure US-East Region"

Last update 2024-06-20T09:14:14.724Z

resolved2024-06-20T09:14:14.711Z

This incident has been resolved.

monitoring2024-06-19T23:22:14.368Z

A fix has been implemented and we are monitoring the results.

investigating2024-06-19T22:01:15.588Z

We are continuing to investigate this issue.

investigating2024-06-19T20:12:44.501Z

We are continuing to investigate this issue.

investigating2024-06-19T18:35:31.198Z

We are currently investigating an increase in rate of read errors in the Azure US East region.

Jun 13, 2024

Report: "Partial outage for Queries in Eu-Central - We are investigating the issue"

Last update 2024-06-13T03:16:55.870Z

postmortem2024-06-13T03:08:14.722Z

# RCA **Sustained query error-rate in AWS eu-central-1 on June 7, 2024** ### Background Data stored in InfluxDB Cloud is distributed across 64 partitions. Distribution is performed using a persistent hash of the series key, with the intent that the write and query load will, on average, be distributed evenly across partitions. When an InfluxDB Cloud user writes data, their writes first go into a durable queue. So, rather than being written directly by users, storage pods consume ingest data from the other end of this queue, amongst other things, this allows writes to be accepted even if storage is encountering issues. One of the metrics used to reflect the status of this pipeline is [Time To Become Readable \(TTBR\)](https://docs.influxdata.com/influxdb/cloud/reference/internals/ttbr/), the time it takes between a write being accepted and passing through the queue so that its data is available to queries. In order to respond to a query, the compute tier needs to request relevant data from each of the 64 partitions. In order for a query to succeed, the compute **must** receive a response from every partition \(this is to ensure that incomplete results are not returned\). Each partition has multiple pods that are responsible for it, and query activity is distributed across them. ### Start of Incident On the 7th of June 2024, partition 44 started to report large increases in TTBR. This meant that, whilst customer’s writes were being safely accepted into the durable queue, they were delayed in becoming available to queries. At around the same time, alerts were received indicating an elevated query failure rate, accompanied by an increase in the query queue depth. ### Investigation Investigation showed that the pods responsible for partition 44 were periodically trying to consume more RAM than permitted, causing them to exit and report an out-of-memory \(OOM\) event. InfluxData allocated additional RAM to the pods to try to mitigate the customer-facing impact quickly. However, they continued to OOM, so the investigation moved on to identifying the source of the excessive resource usage. In a multi-tenant system, the resource usage of a single user impacting other users is known as a noisy neighbor issue. The best way to address the problem is to identify the tenant that is the source of the problematic query and temporarily block their queries, while we engage with them to correct the problematic query. In this case, the customer had automated the execution of a query that attempted to run an in-memory _sort\(\)_ against data taken from a particularly dense series. With the problematic query being submitted regularly, more RAM was consumed, until the storage pods ultimately OOMed. As a result of these regular OOMs, Kubernetes moved the pods into a CrashloopBackOff state, which lengthened the recovery time between each OOM. The extended recovery periodically caused all pods responsible for partition 44 to be offline, preventing the query tier from authoritatively answering queries. ### Actions We are working on several changes to better identify the source of incidents and reduce the likelihood of them occurring in the future. These changes include: * Better visualization to make it easier to identify small noisy neighbors * Disabling CrashLoopbackOff for storage pods once Kubernetes can do so has been added.

resolved2024-06-08T01:08:11.055Z

This incident has been resolved.

monitoring2024-06-08T00:51:59.047Z

We are continuing to monitor for any further issues.

monitoring2024-06-08T00:51:26.480Z

The issue has been addressed and we are monitoring

investigating2024-06-08T00:50:30.859Z

The issue has been addressed and we are monitoring

investigating2024-06-08T00:02:29.415Z

We have added more capacity to support the increased query workload that we are seeing and continue to investigate.

investigating2024-06-07T22:37:26.348Z

We are currently investigating this issue.

May 27, 2024

Report: "Query issues in GCP us-central-1"

Last update 2024-05-27T11:41:25.180Z

resolved2024-05-27T11:41:25.167Z

This incident has been resolved.

monitoring2024-05-27T10:22:01.635Z

A fix has been implemented and we are monitoring the results.

investigating2024-05-27T08:25:25.741Z

We are currently investigating this issue.

Report: "Query issues in Azure us-east-1"

Last update 2024-05-27T10:42:59.872Z

resolved2024-05-27T10:42:59.855Z

This incident has been resolved.

monitoring2024-05-27T09:49:17.976Z

A fix has been implemented and we are monitoring the results.

investigating2024-05-27T08:03:54.493Z

We are currently investigating this issue.

May 23, 2024

Report: "Degradable performance on read and write in azure-us-east"

Last update 2024-05-23T00:24:02.673Z

postmortem2024-05-23T00:20:14.088Z

# RCA **Degradable performance on read and write in azure-us-east on April 1, 2024** ‌ **Summary** Alerts were received indicating an increase in Time To Be Readable \(TTBR\) followed by an increase in the number of queries failing within the region **Cause** The cluster experienced a significant increase in workload which consumed the available CPU time within the storage tier. This led to the query queue growing deeper, with some queries timing out before being processed. Whenever a multi-tenant cluster experiences performance issues which do not appear to correlate to any changes that we’ve made, we first check the larger customers in the cluster to see whether they’ve exhibited any change in behavior. In this instance, a large customer had increased the number of queries being run throughout the day. The incident coincided with another large customer writing a large number of new series into the cluster, which will have led to indexes being locked more frequently than usual. ‌ The combination of the two led to the storage tier answering queries far more slowly than normal, allowing more queries to queue and therefore sustaining pressure on the system. ‌ On InfluxDB Cloud2 the performance of writes and queries is inextricably linked with changes in behavior within one path able to affect the other \(note: this is no longer the case in v3 based products such as InfluxDB Cloud Dedicated\). ‌ **Future mitigations** Multi-tenant clusters come with an inherent risk that an increase or change in workload can negatively impact other users of the cluster. When we observe significant and persistent changes in workload, we look to adjust the cluster to handle the new workload.

resolved2024-04-02T16:00:00.335Z

This incident has been resolved.

monitoring2024-04-01T22:17:44.085Z

The cluster is now healthy, and operations have returned to normal.

monitoring2024-04-01T22:11:10.223Z

A fix has been implemented and we are monitoring the results.

investigating2024-04-01T20:46:03.845Z

We are continuing to investigate this issue.

investigating2024-04-01T20:41:19.912Z

We are currently investigating this issue.

Apr 18, 2024

Report: "Serverless: Degraded Query Performance/Errors in AWS us-east-1 and eu-central-1"

Last update 2024-04-18T20:23:10.288Z

resolved2024-04-18T20:23:10.277Z

All regions have resumed normal operations

identified2024-04-18T19:36:20.185Z

The issue has been identified. AWS us-east-1 has recovered.

identified2024-04-18T19:35:27.225Z

The issue has been identified and a fix is being implemented.

investigating2024-04-18T19:10:19.509Z

We are currently investigating this issue.

Apr 17, 2024

Report: "Degraded query performance on AWS eu-central-1"

Last update 2024-04-17T13:57:02.379Z

resolved2024-04-17T13:57:02.365Z

Resource usage has returned to and remained at the level it was prior to incident.

monitoring2024-04-17T10:52:09.821Z

There has been no further impact on customer queries, however the team continue to monitor the recovery

monitoring2024-04-17T09:06:04.787Z

A fix has been implemented and we are monitoring the results.

investigating2024-04-17T08:35:54.399Z

We are continuing to investigate this issue.

investigating2024-04-17T07:24:39.341Z

A subset of queries are having a higher latency than normal. Our team is currently investigating.

Mar 29, 2024

Report: "Query degradation in eu-central-1"

Last update 2024-03-29T00:43:31.413Z

postmortem2024-03-29T00:27:46.573Z

# RCA ## Query Degradation in [eu-central](https://influxdata.slack.com/archives/C06D17X5ZRT/p1710849571229689)-1 on Jan 9, 2024 ### Background Data stored in InfluxDB Cloud is distributed across 64 partitions. Distribution is performed using a persistent hash of the series key to ensure even write and query load distribution. When users write data into InfluxDB Cloud, their writes first enter a durable queue. Storage pods consume ingest data from the queue, allowing writes to be accepted even during storage issues. Time To Become Readable \(TTBR\) measures the time between a write being accepted and its data becoming available for queries. ### Summary On January 9, 2024, a single partition experienced significant increases in TTBR, causing delays in data availability for queries. CPU usage on the pods responsible for this partition rose to high levels. An investigation revealed a noisy neighbor issue caused by a small organization with infinite retention running resource-intensive queries. ### Internal Visibility of Issue Identifying the affected queries took longer than usual due to: Queries timing out in the query tier but continuing to run on storage, creating a disconnect in observed logs. The organization's small size caused it to not appear prominently in metrics. Failing queries represent a tiny proportion of the organization's usage, making shifts in query success ratios minimal. Metrics and logs relied on completed gRPC calls, which were not completing for the problematic queries. ### Cause The resource usage of a single user impacting other users, known as a noisy neighbor issue, was identified. Resources were consumed by a relatively small organization attempting to run an expensive function against all data in a dense series. Queries timing out led to continued consumption of resources, eventually leading to insufficient CPU for reliable queue consumption, thus pushing TTBR up. Mitigation Additional compute resources were deployed to mitigate any impact on the customers and allow for a smooth recovery without customer visible impact. ### Prevention Planned or ongoing changes include: Improvements to profiling for reporting usage per organization. Enhancements in visualization to facilitate easier identification of noisy neighbors.

resolved2024-01-09T22:20:14.793Z

This incident has been resolved.

monitoring2024-01-09T19:49:06.236Z

A fix has been implemented and we are monitoring the results.

investigating2024-01-09T18:25:39.314Z

We are aware of query degradation in eu-central-1, the team are currently investigating

Mar 19, 2024

Report: "Intermittent query errors in Serverless in eu-central-1 region"

Last update 2024-03-19T11:19:28.543Z

resolved2024-03-19T11:19:28.530Z

This incident has been resolved.

monitoring2024-03-19T10:16:53.162Z

A fix has been implemented and we are monitoring the results.

investigating2024-03-19T10:09:42.514Z

We are currently investigating this issue.

Mar 14, 2024

Report: "Increased internal query error rates in eu-central-1 region"

Last update 2024-03-14T13:15:26.880Z

resolved2024-03-14T13:15:26.858Z

This incident has been resolved.

monitoring2024-03-14T12:51:17.029Z

A fix has been implemented and we are monitoring the results.

investigating2024-03-14T12:24:42.305Z

We are currently investigating this issue.

Mar 11, 2024

Report: "Intermitted Write Errors (Frankfurt, DE)"

Last update 2024-03-11T13:33:06.509Z

resolved2024-03-11T13:33:06.480Z

This incident has been resolved.

monitoring2024-03-11T12:39:25.416Z

We are experiencing intermitted write errors on AWS - DE, Frankfurt most recently at 12:15 and 12:25 UTC. We are currently monitoring the issue.

Feb 28, 2024

Report: "Intermitted Write Errors"

Last update 2024-02-28T20:41:46.010Z

resolved2024-02-28T20:41:45.992Z

This incident has been resolved.

investigating2024-02-28T19:56:51.290Z

We are experiencing intermitted write errors on AWS - US, Virginia at this moment. We are investigating the issue.

Feb 19, 2024

Report: "Task runs and MQTT availability issues in Azure West Europe."

Last update 2024-02-19T15:50:54.163Z

resolved2024-02-19T15:50:54.148Z

This incident has been resolved.

monitoring2024-02-19T13:27:17.826Z

The team identified an issue affecting the creation of dynamic roles within the platform. A fix has been implemented and the team are monitoring the recovery.

investigating2024-02-19T11:24:56.373Z

We are aware of issues affecting Task runs and MQTT availability, the team are currently investigating.

Feb 9, 2024

Report: "Write failures in eu-central-1 in 3.0 Serverless."

Last update 2024-02-09T10:48:06.060Z

resolved2024-02-09T09:00:00.000Z

There was write outage between 8.50 and 9.16 UTC for eu-central-1 in 3.0 Serverless. The writes are back to normal now.

Jan 31, 2024

Report: "Increase query error rate in 3.0 Serverless US-EAST-1 region"

Last update 2024-01-31T15:37:10.052Z

resolved2024-01-31T15:37:10.038Z

This incident has been resolved.

monitoring2024-01-31T13:43:45.000Z

It has been over an hour since we’ve seen an availability alarm and the querier metrics have returned to normal. We will continue to monitor.

identified2024-01-31T13:18:42.915Z

Within the last hour the CPU usage has dropped again, and the timeout rate appears to have returned to normal. We continue to search for the root cause.

identified2024-01-31T12:06:41.743Z

We have increased the number of queriers in the system, This has alleviated the pressure slightly. We are still processing queries more slowly than before the incident started.

investigating2024-01-31T10:17:38.500Z

We are still investigating the issue. The high CPU usage and error rate persist.

investigating2024-01-31T09:17:35.115Z

We have observed a reduction in query performance on InfluxDB Serverless prod101-us-east-1 causing an increased query failure rate. We are currently investigating the issue.

Jan 24, 2024

Report: "Query issues reported in AWS us-west-2 region"

Last update 2024-01-24T21:11:25.135Z

resolved2024-01-24T18:57:59.014Z

This incident has been resolved.

monitoring2024-01-24T18:24:52.103Z

A fix has been implemented and we are monitoring the results.

investigating2024-01-24T16:29:43.547Z

We are currently investigating an issue affecting a small portion of queries in the AWS us-west 2 region

Dec 7, 2023

Report: "Delayed writes in AWS eu-central"

Last update 2023-12-07T01:35:59.204Z

resolved2023-12-07T01:34:43.160Z

This incident has been resolved.

investigating2023-12-07T01:12:54.000Z

Investigating and monitoring the issue

Nov 8, 2023

Report: "Increased query and write latency in AWS us-west"

Last update 2023-11-08T22:13:37.604Z

resolved2023-11-08T22:13:37.592Z

This incident has been resolved.

monitoring2023-11-08T18:04:07.616Z

A fix has been successfully implemented, and we will continue to monitor to ensure that the issue does not reoccur.

monitoring2023-11-08T12:36:10.268Z

A fix has been implemented and we are monitoring the results.

investigating2023-11-08T11:49:42.970Z

We are continuing to investigate this issue.

investigating2023-11-08T10:42:38.076Z

We are currently investigating this issue.

Oct 26, 2023

Report: "Serverless : IOx update causing query failures."

Last update 2023-10-26T00:46:59.816Z

resolved2023-10-26T00:46:59.799Z

This incident has been resolved.

identified2023-10-26T00:45:34.000Z

The issue is resolved.

identified2023-10-25T23:52:42.572Z

We are continuing to work on a fix for this issue.

identified2023-10-25T20:37:47.268Z

We are continuing to work on a fix for this issue.

identified2023-10-25T16:48:26.756Z

We are continuing to work on a fix for this issue.

identified2023-10-25T14:28:30.533Z

We are continuing to work on a fix for this issue.

identified2023-10-25T12:48:02.842Z

We are continuing to work on a fix for this issue.

identified2023-10-25T10:35:56.571Z

We are aware of a problem effecting a small number of customer queries and we have identified the problem and are working to resolve this issue.

Oct 6, 2023

Report: "Investigating possible higher latency on AWS, Virginia - US (US-East-1-1)"

Last update 2023-10-06T05:38:56.858Z

resolved2023-10-06T05:38:56.840Z

This incident has been resolved.

monitoring2023-10-06T03:05:36.069Z

We are continuing to monitor for any further issues.

monitoring2023-10-05T22:13:18.410Z

A fix has been implemented and we are monitoring the results.

investigating2023-10-05T22:12:00.627Z

We are continuing to investigate this issue.

investigating2023-10-05T21:27:55.439Z

We are continuing to investigate this issue.

investigating2023-10-05T21:26:14.139Z

We are continuing to investigate this issue.

investigating2023-10-05T21:19:30.421Z

We are currently investigating this issue.

Sep 14, 2023

Report: "Investigating potential problem with reads and writes in us-west-2 region"

Last update 2023-09-14T23:00:47.042Z

postmortem2023-09-14T22:56:00.325Z

RCA - Write and Query failures on prod102-us-west-2 on September 7, 2023 ‌ **Summary** ‌ We upgraded nginx from 1.3 to 1.7 in prod102-us-west-2 and the upgrade appeared to be fine, with the cluster functioning normally immediately after the upgrade. We had already done the same upgrade of nginx in other Cloud 2 clusters, with no negative impact. Approximately 8 hours later, we were alerted by a high rate of write and query errors in the cluster. We ultimately resolved the issue by reverting the nginx upgrade. ‌ **Cause of the Incident** We believe that the cause of the incident was an interaction between the nginx upgrade and stale connection handling, although we were not able to definitively prove this, as we could not cause the problem to occur in later testing. We also did not see this interaction in any of the other clusters where we upgraded nginx. Eight hours after nginx was upgraded, we did a normal deployment which caused all of our gateway pods to restart \(normal behavior\). When they restarted, we got a very high rate \(though not quite 100%\) of connection failures on the gateway pods. This was visible to our customers, who saw 100% failures for queries, and between 80% and 90% failures for writes. ‌ **Recovery** ‌ On receiving the alerts, we investigated and identified the high failure rate to the gateway pods. As nginx had already been running for 8 hours, we didn’t immediately assume that nginx was at fault, and instead we investigated any other potential causes \(including the changes that had been deployed in the deployment that had just occurred\). When we could not identify any other potential source, we rolled back the nginx upgrade, and the cluster recovered. We have since upgraded nginx again, on this cluster, with no negative impact. Our assumption is not that there is a problem with the particular version of nginx, but that the action of upgrading nginx caused it to hold onto stale connections, so after the gateway pods restarted they could no longer connect successfully. ‌ **Timeline** ‌ Sept 7, 2023 18:15 UTC - Alerted that we are seeing a high rate of write and query errors. Sept 7, 2023 18:20 UTC - Engineering team began investigating. Sept 7, 2023 18:20 UTC to 21:10 UTC - Our early investigation showed that all the gateway pods had restarted. We investigated any changes that had been in that deployment, but there was nothing in the updated code that could have caused an issue of this magnitude \(impacting queries and writes\). We manually restarted one gateway pod and it did not recover. We also investigated whether this could have been caused by something external \(e.g. some network issue within the AWS infrastructure\) but could not identify any cause there. As we could not find a root cause, we chose to undo all recent changes in that cluster, including the nginx upgrade. Sept 7, 21:15 UTC - Rolled back nginx to v1.3. Sept 7, 21:20 UTC - All gateway connections recovered. Query and Write latency was still high as there was a backlog of failed requests to catch up. Sept 7, 22:30 UTC - Cluster fully recovered. Sept 8, 10:00 UTC - We upgraded nginx to 1.7, and restarted all the gateway pods, without any negative consequences. ‌ **Future mitigations** ‌ We will force a deployment directly after each nginx upgrade, to ensure that all connections are refreshed, to avoid any potential interaction with stale connections.

resolved2023-09-07T23:32:11.448Z

This incident has been resolved.

monitoring2023-09-07T22:00:20.000Z

Writes and queries appear to be succeeding for most users now. While the cluster catches up on previous write traffic, query and write latencies may be elevated.

investigating2023-09-07T21:55:50.000Z

Writes and queries appear to be succeeding for most users now. While the cluster catches up on previous write traffic, query and write latencies may be elevated.

investigating2023-09-07T21:26:30.697Z

We are continuing to investigate this issue.

investigating2023-09-07T19:21:11.588Z

We are investigating a potential problem with reads and writes.

Sep 8, 2023

Report: "Issue with InfluxQL queries in eu-central-1"

Last update 2023-09-08T13:46:54.354Z

resolved2023-09-08T13:45:47.579Z

This incident has been resolved.

monitoring2023-09-08T12:55:42.941Z

We have addressed the problem and are monitoring the changes.

investigating2023-09-08T11:40:05.796Z

We are continuing to investigate this issue.

investigating2023-09-08T10:00:40.356Z

The team is aware of the issue impacting InfluxQL queries for both C2 and Serverless and are currently working on it.

Report: "Degraded Query and Writes in AWS us-east-1"

Last update 2023-09-08T12:38:20.037Z

resolved2023-09-08T12:38:20.021Z

This incident has been resolved.

monitoring2023-09-08T11:10:56.258Z

The team have identified the cause of the latency and are monitoring the recovery

investigating2023-09-08T09:29:27.852Z

The team are aware of an issue impacting read and write latency in the AWS us-east-1 region.

Sep 7, 2023

Report: "AWS us-east increased error rate for queries (Serverless/IOx)"

Last update 2023-09-07T16:41:13.796Z

resolved2023-09-07T16:41:13.780Z

This incident has been resolved.

monitoring2023-09-07T15:49:24.288Z

The query issue has been identified and should be resolved, we are continuing to monitor

investigating2023-09-07T14:27:23.557Z

We are currently investigating this issue.

Sep 6, 2023

Report: "Write failures in eu-central-1 & us-east-1 regions"

Last update 2023-09-06T21:21:37.623Z

resolved2023-09-06T21:21:37.607Z

This incident has been resolved.

monitoring2023-09-06T20:26:45.112Z

We are continuing to monitor for any further issues.

monitoring2023-09-06T19:35:43.498Z

A fix has been implemented and we are monitoring the results.

identified2023-09-06T19:30:00.983Z

The issue has been identified and a fix is being implemented.

investigating2023-09-06T19:28:27.310Z

We are currently investigating this issue.

Report: "Read/Write failures in us-west-2 region"

Last update 2023-09-06T13:03:27.524Z

resolved2023-09-06T13:03:27.507Z

This incident has been resolved and we will continue to monitor.

investigating2023-09-06T12:09:57.005Z

We are actively working on this issue.

Report: "Write failures in us-east-1 region (Cloud Serverless)"

Last update 2023-09-06T04:55:53.826Z

resolved2023-09-06T04:55:53.812Z

This incident has been resolved.

investigating2023-09-06T04:14:46.130Z

We are continuing to investigate this issue.

investigating2023-09-06T04:10:12.679Z

We are currently investigating this issue affecting iox instances only.