Historical record of incidents for InfluxDB Cloud
Report: "Elevated Query Error Rate"
Last updateWe have observed an elevated query error rate and are investigating.
Report: "Query performance degradation in Azure eu-west"
Last updateWe are currently investigating a potential issue in the Azure EU-West cluster. The cluster appears to be degraded, and we’re observing slow query performance.
Report: "Management Service Availability User Interface and API"
Last updateWe are currently investigating availability of the management user interface and API.
Report: "GCP Query Availability"
Last updateWe are continuing to investigate this issue.
We are currently investigating query availability in GCP regions.
Report: "Cloud2 and Serverless Query and Write API Availability Issues AWS EU-Central-1"
Last update**RCA - Query and Write Outage on May 28, 2025** **Summary** On May 28, 2025, at 14:33 UTC, we deployed an infrastructure change to multiple production InfluxDB Cloud \(TSM and IOx\) clusters. At 15:07 UTC, the continuous deployment \(CD\) process required a manual step to update an immutable Kubernetes resource. This step involves deleting and recreating the resource. Processes like this are not uncommon and have been executed successfully in staging environments without issue. Shortly after the deployment completed, an alert at 15:09 UTC indicated that certain microservices were unable to authenticate with each other across multiple production clusters. These services are critical to data ingestion and querying, and the authentication failure prevented them from processing incoming writes and queries. Writes and queries began to fail at 15:09 UTC and remained unavailable until authentication paths were restored on each affected cluster. The outage duration differed for each cluster because each cluster was examined individually to confirm our diagnosis, apply a fix, and ensure all services restored communication properly. The full outage timeframe ranged from 15:09 UTC to 17:43 UTC, with the first cluster fully recovering at 17:04 UTC. We define a full recovery as the point when a cluster can accept writes, return queries, and resume normal software deployments. **Cause of the Incident** Our software is deployed via a CD pipeline to three staging clusters \(one per cloud provider\) where a suite of automated tests is run. If the tests pass, it is deployed simultaneously to all production clusters. This is our standard cloud service deployment process. On May 29, 2025, an engineer submitted a Kubernetes infrastructure change aimed at improving the resiliency of the service responsible for secure application authentication. The change was peer reviewed and successfully applied, and initial deployments across several clusters appeared to complete without issue. We were alerted to a malfunction when the change landed on larger, more active clusters. Upon investigation, we found that while the manual CD step to remove and recreate the Kubernetes resource completed successfully, it did not behave as expected. After Kubernetes 1.29, certain service account tokens are no longer generated automatically. As a result, the token required for microservices to authenticate with each other was missing, preventing them from processing writes and queries. **Investigation and Recovery** Our initial focus was to verify that the PR itself did not tamper with the resource. Next, we examined the CD mechanics to ensure that the CD platform and pipeline were performing as designed. Our attention then turned to Kubernetes itself as the resource was missing in-cluster yet remained in our IaC \(Infrastructure-as-Code\) repository. Further investigation revealed that the missing resource was the result of a deliberate change in a recent Kubernetes release. Once we identified the root cause, we quickly mitigated the outage by recreating and applying the new resource to each cluster. As soon as this mitigation change was deployed within each cluster, success rates improved from 0 percent to 100 percent within minutes. Each cluster was examined to ensure that failure rates returned to pre-incident levels. We verified that all cluster components were consuming the new resource, and each cluster was peer reviewed to confirm full functionality being marked as restored \(both internally and on the InfluxData status page\). **Future mitigations** We are implementing several methods to reduce the likelihood of a similar incident in the future: 1. **Alert on or increase telemetry of secret-related deletion.** We are examining improvements to monitoring and alerts for the deletion or absence of critical authentication tokens. 2. **Isolate and stage infrastructure changes to critical systems.** Infrastructure changes impacting key services will be implemented in stages. Time-based staging will be added between cluster deployments, along with further validation checks at each step. 3. **Ongoing investigation and continued hardening.** We are continuing to investigate additional contributing factors and will implement further mitigation steps as they are identified. [https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#control-plane-details](https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#control-plane-details) [https://status.influxdata.com](https://status.influxdata.com)
All regions fully back online. A full RCA will be provided as soon as it is completed.
AWS EU-Central is now operational. We are continuing to monitor
All regions except EU Central are now operational. Work continues on EU-Central
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Cloud2 and Serverless Query and Write API Availability Issues AWS EU-Central-1"
Last updateAll regions fully back online. A full RCA will be provided as soon as it is completed.
AWS EU-Central is now operational. We are continuing to monitor
All regions except EU Central are now operational. Work continues on EU-Central
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Cloud Dedicated - Management API is unavailable"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Cloud Dedicated - Management API is unavailable"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Increase in query errors reported in Azure west europe"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Increased rate of InfluxQL V1 API errors for prod01-us-west-2 only"
Last updateThe issue has now been resolved, the service is now stable and is working as expected.
Our engineers are currently working on this issue.
We are seeing an increased rate of InfluxQL V1 API errors for prod01-us-west-2 only.
Report: "Increase in query errors in AWS US-East-1"
Last updateThis incident has been resolved, and query operations have returned to normal.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are still investigating the issue and will continue to update here.
We are currently investigating this issue.
Report: "InfluxDB Cloud 2 signups are down, but logins are working."
Last updateThe issue has been resolved, and signups are now back to normal.
We have identified the issue and implemented a fix. We will continue to monitor the situation
Signups are working again. We are continuing to monitor while we identify the root cause
We're aware that InfluxDB Cloud 2 signups are down, and we're currently investigating the issue. Logins are working as expected.
Report: "Degraded reads in us-west-2"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
Degraded query performance issues are resolving - the team is continuing to monitor the recovery
The issue has been identified and a fix is being implemented.
In Cloud2 us-west-2 location we identified reads suffering and have implemented a solution
Report: "Cloud Dedicated Upgrade Issue"
Last updateAll services have been recovered. We apologize for the service disruption, and will be publishing a full RCA when we have completed our internal review of what happened today.
All services have been recovered, but we are continuing to monitor performance.
We are working to recover the affected components.
We are continuing to investigate this issue.
We have an issue with an upgrade to our infrastructure that we are reverting
Report: "Increased TTBR and degradation to some queries in us-east-1"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are aware of a potential increase in TTBR and query degradation in the us-west-1 region. Our team is actively investigating the issue, and we will provide updates as soon as more information becomes available.
Report: "Degraded Query Performance in AWS US-East"
Last updateThis incident has been resolved.
Services have recovered and we are continuing to monitor
The issue has been identified and a fix is being implemented.
Report: "Write outage in AWS eu-central-1"
Last updateThis incident has been resolved.
A brief (2 minute) outage in writes occurred. We have investigated and a fix has been implemented and we are monitoring the results.
Report: "Query performance degradation in AWS eu-central-1 & us-east-1"
Last updateThis incident has been resolved.
A fix has been deployed, we believe it has solved the issue and are monitoring the fix.
We are experiencing a higher-than-normal error rate for flux queries in AWS eu-central-1 & AWS us-east-1. The issue has been identified and a fix is being implemented
Report: "Query errors in Azure us-east region"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Flux Query errors in eu-cental-1 and us-east-1"
Last updateThe incident has been resolved, and operations have returned to normal.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Degraded query performance in AWS us-east-1 and AWS eu-central"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
Queries are recovering in AWS us-east-1 and AWS eu-central and we are monitoring the results.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are working on resolving the problem
We are still investigating the issue with degraded query performance in the US-east-1 region and will continue to update here.
We are currently investigating this issue.
Report: "Increased TTBR and degradation to some queries in eu-central-1."
Last updateThe issue has been resolved and all operations are back to normal.
TTBR has returned to normal, and all operations are back to normal. We are continuing to monitor for any further issues.
TTBR has returned to normal, and all operations are back to normal
The issue has been identified and a fix is being implemented.
We are aware of a potential increase in TTBR and query degradation in the eu-central-1 region. Our team is actively investigating the issue, and we will provide updates as soon as more information becomes available.
Report: "Query performance degradation in AWS us-east-1"
Last updateThis incident has been resolved.
We are continuing to closely monitor as conditions continue to improve.
Our team have implemented a fix and are currently monitoring the recovery
We are currently investigating this issue.
Report: "Minor issue affecting writes + queries in AWS us-east-1"
Last updateThis issue has now been resolved.
We're aware of an increase in TTBR on one partition within this region. Some points may not be available immediately after being written, the team are currently investigating.
Report: "Query issues in AWS us-east-1"
Last updateThe incident has been resolved, and operations have returned to normal
Query Duration issues have been resolved however, some queries continue to experience delays for data that was recently written. The team is continuing to monitor the recovery.
Degraded performance issues are resolving - the team is continuing to monitor the recovery
The majority of queries are now completing correctly. However, there may still be delays in the time for some data to become readable - the team are continuing to monitor the recovery.
The team has implemented a fix and is currently monitoring to ensure that it results in recovery.
We are currently investigating this issue.
Report: "Management Tokens not working"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified an issue with management tokens failing to authorize correctly in Cloud Dedicated. It's causing 403 authorization errors when valid Admin users attempt to perform actions.
Report: "Degraded query performance in AWS US West"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are aware of query performance degradation in this cloud2 region, the team is currently investigating.
Report: "Degraded query performance in AWS eu-central-1"
Last updateThis incident has been resolved.
A fix has been implemented. The queries are running normally, the team continue to monitor.
We are aware of degraded query performance. The team is working on resolving the issue.
Report: "Degraded Performance on Queries"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the result.
We are aware of degraded performance on queries and some queries may be failing. The team is working on resolving the issue.
Report: "Increase in read errors - Azure US-East Region"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating an increase in rate of read errors in the Azure US East region.
Report: "Partial outage for Queries in Eu-Central - We are investigating the issue"
Last update# RCA **Sustained query error-rate in AWS eu-central-1 on June 7, 2024** ### Background Data stored in InfluxDB Cloud is distributed across 64 partitions. Distribution is performed using a persistent hash of the series key, with the intent that the write and query load will, on average, be distributed evenly across partitions. When an InfluxDB Cloud user writes data, their writes first go into a durable queue. So, rather than being written directly by users, storage pods consume ingest data from the other end of this queue, amongst other things, this allows writes to be accepted even if storage is encountering issues. One of the metrics used to reflect the status of this pipeline is [Time To Become Readable \(TTBR\)](https://docs.influxdata.com/influxdb/cloud/reference/internals/ttbr/), the time it takes between a write being accepted and passing through the queue so that its data is available to queries. In order to respond to a query, the compute tier needs to request relevant data from each of the 64 partitions. In order for a query to succeed, the compute **must** receive a response from every partition \(this is to ensure that incomplete results are not returned\). Each partition has multiple pods that are responsible for it, and query activity is distributed across them. ### Start of Incident On the 7th of June 2024, partition 44 started to report large increases in TTBR. This meant that, whilst customer’s writes were being safely accepted into the durable queue, they were delayed in becoming available to queries. At around the same time, alerts were received indicating an elevated query failure rate, accompanied by an increase in the query queue depth. ### Investigation Investigation showed that the pods responsible for partition 44 were periodically trying to consume more RAM than permitted, causing them to exit and report an out-of-memory \(OOM\) event. InfluxData allocated additional RAM to the pods to try to mitigate the customer-facing impact quickly. However, they continued to OOM, so the investigation moved on to identifying the source of the excessive resource usage. In a multi-tenant system, the resource usage of a single user impacting other users is known as a noisy neighbor issue. The best way to address the problem is to identify the tenant that is the source of the problematic query and temporarily block their queries, while we engage with them to correct the problematic query. In this case, the customer had automated the execution of a query that attempted to run an in-memory _sort\(\)_ against data taken from a particularly dense series. With the problematic query being submitted regularly, more RAM was consumed, until the storage pods ultimately OOMed. As a result of these regular OOMs, Kubernetes moved the pods into a CrashloopBackOff state, which lengthened the recovery time between each OOM. The extended recovery periodically caused all pods responsible for partition 44 to be offline, preventing the query tier from authoritatively answering queries. ### Actions We are working on several changes to better identify the source of incidents and reduce the likelihood of them occurring in the future. These changes include: * Better visualization to make it easier to identify small noisy neighbors * Disabling CrashLoopbackOff for storage pods once Kubernetes can do so has been added.
This incident has been resolved.
We are continuing to monitor for any further issues.
The issue has been addressed and we are monitoring
The issue has been addressed and we are monitoring
We have added more capacity to support the increased query workload that we are seeing and continue to investigate.
We are currently investigating this issue.
Report: "Query issues in GCP us-central-1"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Query issues in Azure us-east-1"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Degradable performance on read and write in azure-us-east"
Last update# RCA **Degradable performance on read and write in azure-us-east on April 1, 2024** **Summary** Alerts were received indicating an increase in Time To Be Readable \(TTBR\) followed by an increase in the number of queries failing within the region **Cause** The cluster experienced a significant increase in workload which consumed the available CPU time within the storage tier. This led to the query queue growing deeper, with some queries timing out before being processed. Whenever a multi-tenant cluster experiences performance issues which do not appear to correlate to any changes that we’ve made, we first check the larger customers in the cluster to see whether they’ve exhibited any change in behavior. In this instance, a large customer had increased the number of queries being run throughout the day. The incident coincided with another large customer writing a large number of new series into the cluster, which will have led to indexes being locked more frequently than usual. The combination of the two led to the storage tier answering queries far more slowly than normal, allowing more queries to queue and therefore sustaining pressure on the system. On InfluxDB Cloud2 the performance of writes and queries is inextricably linked with changes in behavior within one path able to affect the other \(note: this is no longer the case in v3 based products such as InfluxDB Cloud Dedicated\). **Future mitigations** Multi-tenant clusters come with an inherent risk that an increase or change in workload can negatively impact other users of the cluster. When we observe significant and persistent changes in workload, we look to adjust the cluster to handle the new workload.
This incident has been resolved.
The cluster is now healthy, and operations have returned to normal.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Serverless: Degraded Query Performance/Errors in AWS us-east-1 and eu-central-1"
Last updateAll regions have resumed normal operations
The issue has been identified. AWS us-east-1 has recovered.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Degraded query performance on AWS eu-central-1"
Last updateResource usage has returned to and remained at the level it was prior to incident.
There has been no further impact on customer queries, however the team continue to monitor the recovery
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
A subset of queries are having a higher latency than normal. Our team is currently investigating.
Report: "Query degradation in eu-central-1"
Last update# RCA ## Query Degradation in [eu-central](https://influxdata.slack.com/archives/C06D17X5ZRT/p1710849571229689)-1 on Jan 9, 2024 ### Background Data stored in InfluxDB Cloud is distributed across 64 partitions. Distribution is performed using a persistent hash of the series key to ensure even write and query load distribution. When users write data into InfluxDB Cloud, their writes first enter a durable queue. Storage pods consume ingest data from the queue, allowing writes to be accepted even during storage issues. Time To Become Readable \(TTBR\) measures the time between a write being accepted and its data becoming available for queries. ### Summary On January 9, 2024, a single partition experienced significant increases in TTBR, causing delays in data availability for queries. CPU usage on the pods responsible for this partition rose to high levels. An investigation revealed a noisy neighbor issue caused by a small organization with infinite retention running resource-intensive queries. ### Internal Visibility of Issue Identifying the affected queries took longer than usual due to: Queries timing out in the query tier but continuing to run on storage, creating a disconnect in observed logs. The organization's small size caused it to not appear prominently in metrics. Failing queries represent a tiny proportion of the organization's usage, making shifts in query success ratios minimal. Metrics and logs relied on completed gRPC calls, which were not completing for the problematic queries. ### Cause The resource usage of a single user impacting other users, known as a noisy neighbor issue, was identified. Resources were consumed by a relatively small organization attempting to run an expensive function against all data in a dense series. Queries timing out led to continued consumption of resources, eventually leading to insufficient CPU for reliable queue consumption, thus pushing TTBR up. Mitigation Additional compute resources were deployed to mitigate any impact on the customers and allow for a smooth recovery without customer visible impact. ### Prevention Planned or ongoing changes include: Improvements to profiling for reporting usage per organization. Enhancements in visualization to facilitate easier identification of noisy neighbors.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are aware of query degradation in eu-central-1, the team are currently investigating
Report: "Intermittent query errors in Serverless in eu-central-1 region"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Increased internal query error rates in eu-central-1 region"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Intermitted Write Errors (Frankfurt, DE)"
Last updateThis incident has been resolved.
We are experiencing intermitted write errors on AWS - DE, Frankfurt most recently at 12:15 and 12:25 UTC. We are currently monitoring the issue.
Report: "Intermitted Write Errors"
Last updateThis incident has been resolved.
We are experiencing intermitted write errors on AWS - US, Virginia at this moment. We are investigating the issue.
Report: "Task runs and MQTT availability issues in Azure West Europe."
Last updateThis incident has been resolved.
The team identified an issue affecting the creation of dynamic roles within the platform. A fix has been implemented and the team are monitoring the recovery.
We are aware of issues affecting Task runs and MQTT availability, the team are currently investigating.
Report: "Write failures in eu-central-1 in 3.0 Serverless."
Last updateThere was write outage between 8.50 and 9.16 UTC for eu-central-1 in 3.0 Serverless. The writes are back to normal now.
Report: "Increase query error rate in 3.0 Serverless US-EAST-1 region"
Last updateThis incident has been resolved.
It has been over an hour since we’ve seen an availability alarm and the querier metrics have returned to normal. We will continue to monitor.
Within the last hour the CPU usage has dropped again, and the timeout rate appears to have returned to normal. We continue to search for the root cause.
We have increased the number of queriers in the system, This has alleviated the pressure slightly. We are still processing queries more slowly than before the incident started.
We are still investigating the issue. The high CPU usage and error rate persist.
We have observed a reduction in query performance on InfluxDB Serverless prod101-us-east-1 causing an increased query failure rate. We are currently investigating the issue.
Report: "Query issues reported in AWS us-west-2 region"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an issue affecting a small portion of queries in the AWS us-west 2 region
Report: "Delayed writes in AWS eu-central"
Last updateThis incident has been resolved.
Investigating and monitoring the issue
Report: "Increased query and write latency in AWS us-west"
Last updateThis incident has been resolved.
A fix has been successfully implemented, and we will continue to monitor to ensure that the issue does not reoccur.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Serverless : IOx update causing query failures."
Last updateThis incident has been resolved.
The issue is resolved.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
We are aware of a problem effecting a small number of customer queries and we have identified the problem and are working to resolve this issue.
Report: "Investigating possible higher latency on AWS, Virginia - US (US-East-1-1)"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Investigating potential problem with reads and writes in us-west-2 region"
Last updateRCA - Write and Query failures on prod102-us-west-2 on September 7, 2023 **Summary** We upgraded nginx from 1.3 to 1.7 in prod102-us-west-2 and the upgrade appeared to be fine, with the cluster functioning normally immediately after the upgrade. We had already done the same upgrade of nginx in other Cloud 2 clusters, with no negative impact. Approximately 8 hours later, we were alerted by a high rate of write and query errors in the cluster. We ultimately resolved the issue by reverting the nginx upgrade. **Cause of the Incident** We believe that the cause of the incident was an interaction between the nginx upgrade and stale connection handling, although we were not able to definitively prove this, as we could not cause the problem to occur in later testing. We also did not see this interaction in any of the other clusters where we upgraded nginx. Eight hours after nginx was upgraded, we did a normal deployment which caused all of our gateway pods to restart \(normal behavior\). When they restarted, we got a very high rate \(though not quite 100%\) of connection failures on the gateway pods. This was visible to our customers, who saw 100% failures for queries, and between 80% and 90% failures for writes. **Recovery** On receiving the alerts, we investigated and identified the high failure rate to the gateway pods. As nginx had already been running for 8 hours, we didn’t immediately assume that nginx was at fault, and instead we investigated any other potential causes \(including the changes that had been deployed in the deployment that had just occurred\). When we could not identify any other potential source, we rolled back the nginx upgrade, and the cluster recovered. We have since upgraded nginx again, on this cluster, with no negative impact. Our assumption is not that there is a problem with the particular version of nginx, but that the action of upgrading nginx caused it to hold onto stale connections, so after the gateway pods restarted they could no longer connect successfully. **Timeline** Sept 7, 2023 18:15 UTC - Alerted that we are seeing a high rate of write and query errors. Sept 7, 2023 18:20 UTC - Engineering team began investigating. Sept 7, 2023 18:20 UTC to 21:10 UTC - Our early investigation showed that all the gateway pods had restarted. We investigated any changes that had been in that deployment, but there was nothing in the updated code that could have caused an issue of this magnitude \(impacting queries and writes\). We manually restarted one gateway pod and it did not recover. We also investigated whether this could have been caused by something external \(e.g. some network issue within the AWS infrastructure\) but could not identify any cause there. As we could not find a root cause, we chose to undo all recent changes in that cluster, including the nginx upgrade. Sept 7, 21:15 UTC - Rolled back nginx to v1.3. Sept 7, 21:20 UTC - All gateway connections recovered. Query and Write latency was still high as there was a backlog of failed requests to catch up. Sept 7, 22:30 UTC - Cluster fully recovered. Sept 8, 10:00 UTC - We upgraded nginx to 1.7, and restarted all the gateway pods, without any negative consequences. **Future mitigations** We will force a deployment directly after each nginx upgrade, to ensure that all connections are refreshed, to avoid any potential interaction with stale connections.
This incident has been resolved.
Writes and queries appear to be succeeding for most users now. While the cluster catches up on previous write traffic, query and write latencies may be elevated.
Writes and queries appear to be succeeding for most users now. While the cluster catches up on previous write traffic, query and write latencies may be elevated.
We are continuing to investigate this issue.
We are investigating a potential problem with reads and writes.
Report: "Issue with InfluxQL queries in eu-central-1"
Last updateThis incident has been resolved.
We have addressed the problem and are monitoring the changes.
We are continuing to investigate this issue.
The team is aware of the issue impacting InfluxQL queries for both C2 and Serverless and are currently working on it.
Report: "Degraded Query and Writes in AWS us-east-1"
Last updateThis incident has been resolved.
The team have identified the cause of the latency and are monitoring the recovery
The team are aware of an issue impacting read and write latency in the AWS us-east-1 region.
Report: "AWS us-east increased error rate for queries (Serverless/IOx)"
Last updateThis incident has been resolved.
The query issue has been identified and should be resolved, we are continuing to monitor
We are currently investigating this issue.
Report: "Write failures in eu-central-1 & us-east-1 regions"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Read/Write failures in us-west-2 region"
Last updateThis incident has been resolved and we will continue to monitor.
We are actively working on this issue.
Report: "Write failures in us-east-1 region (Cloud Serverless)"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating this issue affecting iox instances only.