Historical record of incidents for Datadog US1
Report: "Delayed processing of APM Trace Metrics"
Last updateWe are investigating delayed processing of APM Trace metrics starting around 21:40 UTC. Dashboards and monitors relying on these metrics are affected.
Report: "Elevated error rates in queries across multiple products"
Last updateWe are actively investigating issues querying data affecting multiple products. As a result of this issue, there might be errors when trying to load data from queries on different pages of the web application or through the API.
Report: "Monitors - Delayed Evaluation"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are investigating delays in Distribution Monitors Evaluation, which began at 5:30pm UTC. Monitors for other types of metrics are evaluating as usual.
Report: "Delayed Traces and Spans in APM"
Last updateThe incident is now resolved. APM trace ingestion and all downstream systems, including monitors, have fully recovered and are up to date.
We are monitoring a fix with to increased latency processing in APM Metrics. APM data in live view is current but distributed tracing metrics are delayed by 20 minutes. Monitors sourced from the data are impacted until the data becomes current.
As a result of the issue we are monitoring delays in Monitors Evaluation
A fix has been implemented and we are monitoring the results.
We are investigating increased latency processing Traces and Spans in APM As a result of this issue, some users may see missing or delayed traces and Spans starting at 18:33 UTC.
Report: "Delayed Traces and Spans in APM"
Last updateWe are investigating increased latency processing Traces and Spans in APMAs a result of this issue, some users may see missing or delayed traces and Spans starting at 18:45 UTC.
Report: "Delayed AWS Metrics and Events"
Last updateThis incident has been resolved.
A fix has been implemented and recovery is in progress. To prevent spurious alerts, monitors on AWS Metrics and Events remain disabled until recovery is complete.
The issue has been identified and a fix is being implemented.
We are investigating increased latency processing AWS metrics and events. As a result of this issue, some users may see delays or gaps in graphs that contain these metrics and events. To prevent spurious alerts, we have temporarily disabled monitors based on this data.
Report: "Delayed AWS Metrics and Events"
Last updateWe are investigating increased latency processing AWS metrics and events.As a result of this issue, some users may see delays or gaps in graphs that contain these metrics and events.To prevent spurious alerts, we have temporarily disabled monitors based on this data.
Report: "Monitors - Delayed Evaluation"
Last updateThis incident has been resolved.
The incident has fully recovered. The service is now fully operational.
We are investigating delays in Monitors Evaluation, which began at 12:45 UTC.
Report: "Monitors - Delayed Evaluation"
Last updateWe are investigating delays in Monitors Evaluation, which began at 12:45 UTC.
Report: "Delayed processing of APM Trace Metrics"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are investigating delayed processing of APM Trace metrics starting around 07:00 UTC. Dashboards and monitors relying on these metrics are affected.
Report: "Delayed processing of APM Trace Metrics"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are investigating delayed processing of APM Trace metrics starting around 07:00 UTC. Dashboards and monitors relying on these metrics are affected.
Report: "Login Issues"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are investigating user login issues related to reCAPTCHA for customers using password login. If you experience an issue with reCAPTCHA, refreshing the page can often mitigate the issue. Please note that data processing and alerts are not affected by this incident.
Report: "Login Issues"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are investigating user login issues related to reCAPTCHA for customers using password login. If you experience an issue with reCAPTCHA, refreshing the page can often mitigate the issue. Please note that data processing and alerts are not affected by this incident.
Report: "Delayed Processing for a Subset of Metrics"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We have identified the underlying issue and continue to work on a fix. It is important to note that no data has been lost: data is being backfilled and will be available once the service is operational again.
We have identified the underlying issue and continue to work on a fix. It is important to note that no data has been lost: data is being backfilled and will be available once the service is operational again.
We have identified the underlying issue and continue to work on a fix. It is important to note that no data has been lost, and it will be backfilled and available once the service is operational again.
We have identified the underlying issue and are working on a fix. It is important to note that no data has been lost, and it will be backfilled and available once the service is operational again.
We are investigating increased latency processing Trace Metrics. As a result of this issue, some users may see delays or gaps for a subset of their metrics on graphs and statistics on Service Catalog.
Report: "Degraded Web Application Performance"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We have identified the underlying issue and are continuing to work on a fix. Degraded web application performance is primarily observed in customers with low network bandwidth.
We have identified the underlying issue and are working on a fix.
We are investigating degraded performance with the web application.
Report: "Increased delay processing events"
Last updateThis incident has been resolved.
We are continue to monitor the progress of processing the backlog in Events. The majority of the backlog has been processed. Event Monitor evaluation remains delayed while we finish processing the backlog.
We've implemented a fix, and are currently working through the backlog of delayed Events. Event Monitor evaluation remains delayed while we work through the backlog. All other monitor types have recovered and are currently evaluating.
We have identified the issue causing delayed ingestion of Events. Alerting evaluation continues to be delayed for Event Monitors, Process Monitors, and Cloud Network monitors. All other monitor types have recovered and are currently evaluating.
We are continuing to investigate this issue.
We are investigating increased latency processing Events. As a result of this issue, some users may see delays in the event stream or for event queries on dashboards, and event alert evaluation is delayed. This issue also caused a delay in the processing of alerts across other products. We've implemented a fix for this, and are monitoring the recovery of the alert evaluation pipeline. As a result, a subset alerts may be delayed while the system recovers.
Report: "APM connections retrying"
Last updateThis incident has been resolved.
We have mitigated the cause of transient agent submission errors for APM and customers should no longer observe these errors. The Datadog Agent automatically retries these errors and succeeded on retry; this incident did not result in any data loss
The issue has been identified and a fix is being implemented.
Some US1 customers experiencing degraded performance for APM. Customers may see transient errors, but these should resolve with an automatic retry from the Datadog agent.
Report: "Delayed APM Distribution Metrics, Data Streams Monitoring Metrics & Monitor Notifications"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Data Streams Monitoring metrics and associated monitor notifications based on these metrics have recovered.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are investigating increased latency in processing APM Distribution Metrics and Data Streams Monitoring Metrics as well as monitors notifications based on these metrics, which began at 17h47 UTC. As a result of this issue, some users may see delays or gaps for these metrics on graphs, including APM pages as well as delayed monitor notifications.
Report: "Delayed APM data ingestion"
Last updateThis incident has been resolved.
A fix has been implemented and systems are recovering.
We are investigating increased ingestion latency of APM data.
Report: "Monitors - Delayed Evaluation for Distribution Metric Monitors"
Last updateThis incident has been resolved.
We have rolled out out a fix and all distribution monitors are up to date. We are continuing to monitor the customer experience and expect to resolve this incident in the next 30 minutes.
We are in the process of rolling out a fix that will bring all distribution monitors up to date. We will update again when the issue is resolved.
The root cause has been identified. We are working on a fix so that distribution metric monitor evaluations are up to date.
We are investigating delays in monitor evaluations for monitors based on distribution metrics, starting at 15h35UTC. This is causing a delay in notifications.
We are investigating delays in Distribution Metric Monitors Evaluation, which began at 15h35UTC.
Report: "Monitors - Delayed Evaluation"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating delays in Events-based Monitor Evaluation, which began at 16:00 UTC.
Report: "Delayed Distribution Metrics"
Last updateThis incident has been resolved. All distribution metrics are being processed and monitors are no longer disabled for distribution metrics.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are investigating increased latency processing Distribution Metrics. As a result, some users may see delays or gaps for distribution metrics on graphs, including APM pages. Monitors based on this data may also be delayed. We have identified the problem and are actively working to resolve the issue.
Report: "Delayed distribution metrics & monitor notifications"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We have identified the underlying issue and are working on a fix.
We are investigating delays in distribution metrics, and on monitors notifications for monitors based on these metrics, which began at 17:40 UTC.
Report: "Delayed Distribution Metrics"
Last updateThis incident has been resolved. All distribution metrics are being processed and monitors are no longer disabled for distribution metrics.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and remediation steps are underway.
We are investigating increased latency processing Distribution Metrics. As a result of this issue, some users may see delays or gaps for distribution metrics on graphs. To prevent spurious alerts, we have temporarily disabled monitors based on distribution metrics.
Report: "[SSO] Login Errors"
Last updateThis incident has been resolved. If you continue to see issues, please contact Datadog technical support.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
We have identified the issue and are implementing a fix.
We are investigating user login issues with the web application when using Okta SSO.
Report: "Delayed Monitors Notifications"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are investigating delays in Monitors Notifications for distribution metrics, which began at 20:00 UTC.
Report: "Degraded Web Application Performance"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are investigating degraded performance with the web application related to metrics-based widgets.
Report: "Web UI features maybe hidden"
Last updateThis incident has been resolved. Please refresh your Datadog web page to resolve the issue completely.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating an issue, that is causing certain features to be hidden from our UI. There is no data loss or monitoring impact.
Report: "Delayed Monitors Notifications"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating delays in Monitors Notifications, which began at 0605 ET.
Report: "Delayed Metrics Monitor Evaluations"
Last updateThis incident has been resolved.
Monitors with long intervals may still be delayed but the service is recovered.
We have identified the issue and deployed a fix, we are monitoring the recovery.
We are investigating increased metrics based monitor delays for some customers.
Report: "Delayed Monitors Evaluations"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We are investigating delayed evaluation of a subset of metric monitors. Customers may experience delayed or missing monitor notifications as a result.
Report: "APM - degraded performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating an issue in executing trace queries, the team is working on a fix
Report: "CI Visibility - Page Load issue"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified an issue that prevents most Software Delivery pages from loading. Also, Intelligent Test Runner, Quality Gates and GitHub PR comments are affected
Report: "Application Security Management - Issue Updating Configurations"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating an issue in updating configurations in the product, the team is working on a fix
Report: "Partial outage on components of RUM product"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified an issue which affects the use of Sankey and Cohorts Analysis in the RUM product, the team is working on a fix.
Report: "Delayed Monitors Notifications"
Last updateThis incident has been resolved.
We identified a delay in Monitor Notifications from 13:52 UTC and 14:05 UTC. The issue has resolved, but we continue to monitor the situation.
We identified a delay in Monitor Notifications from 13:52 UTC and 14:05 UTC. The issue has resolved, but we continue to monitor the situation.
Report: "Delayed AWS, GCP, Azure, and SaaS Integration Metrics"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We are investigating increased latency processing some AWS, GCP, Azure and SaaS Integration Metrics. As a result of this issue, some users may see delays or gaps in graphs that contain these metrics. To prevent spurious alerts, we have temporarily disabled monitors based on this data.
Report: "Delayed Monitors Notifications"
Last updateThis incident has been resolved.
We are finalizing our recovery and at this time expect customers should see no further impact. We will continue to monitor for issues.
We are seeing continuing improvements and recovering as quickly as possible while maintaining system stability. Distribution metrics remain delayed and associated monitors evaluation are currently skipped. Point metrics and associated monitors are fully recovered.
We are seeing continuing improvements. Distribution metrics remain delayed and associated monitors evaluation are currently skipped.
We are seeing continuing improvements. Distribution metrics remain delayed and associated monitors evaluation are currently skipped.
We are seeing improvements on metrics processing. Distribution metrics remain delayed and associated monitors evaluation are currently skipped.
We are investigating issues in metrics processing, leading to impact on monitors evaluation, dashboards as well as other products.
We are continuing to investigate this issue.
We are investigating delays in Monitors Notifications, which began at 14:40 UTC.
Report: "We are investigating user login issues with the web application"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We have identified the underlying issue and are working on a fix.
We are investigating user login issues with the web application login by email. Please note that data processing and alerts are not affected by this incident.
Report: "Metrics historical data failed queries"
Last updateOur metrics system has recovered and all historical metrics are now queryable.
The system continues to recover. Data is available, but some results may be slow or incomplete until full recovery. Our teams continue to monitor the incident.
A fix has been implemented. While the system continues to recover, data will be available but some results may be slow or incomplete until full recovery is complete.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are investigating queries failing for historical data for metrics, impacting timeframes more than one day ago. Queries for recent data are not affected by this incident.
Report: "Partial outage of metrics query"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Elevated Error Rates for Metrics Submission"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are investigating elevated error rates for Metrics Submission APIs. As a result of this issue, submitting new metric data through the API might fail temporarily. Please note that the Datadog Agent and Client Libraries will buffer data or retry to avoid data loss.
Report: "Elevated Errors for API Key Validation"
Last updateFrom 12:45-1:15 PM US EST Datadog’s endpoint to validate Datadog API keys was unavailable. During this window Datadog Agents would be unable to validate their API key. In all cases Agents would continue to send data. Some Agents running in Kubernetes may be marked unhealthy until restarted. Newly started Agents would fail to start. Build jobs using our CI Visibility product would be missing custom tags and measures.
Report: "Google SSO login issues for web application"
Last updateThis incident has been resolved.
We are continuing to monitor progress. We will post further updates when we have them.
We are seeing signs of recovery and are continuing to monitor progress. We will post further updates when we have them.
We are investigating user login issues with the web application via Google SSO. Users switching orgs might also be affected. Please note that data processing and alerts are not affected by this incident.
Report: "Elevated Error Rates for Metrics Submission"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring recovery. Metric monitor evaluations still might be delayed; we will post an update when this recovers.
We are still investigating elevated error rates for Metrics Submission APIs and delays processing metrics monitors.
We are investigating elevated error rates for Metrics Submission APIs. As a result of this issue, submitting new metric data through the API might fail temporarily. Please note that the Datadog Agent and Client Libraries will buffer data or retry to avoid data loss.
Report: "Logs Status Elevated Error Rates"
Last updateThis incident has been resolved. As a result of this incident, logs from AWS Lambda (specifically, those tagged with source:lambda) were incorrectly categorized as errors from 18:30 UTC to 20:50 UTC on 2024-01-22. All logs after this date are being processed as normal.
A fix has been implemented and we are monitoring the results.
We have identified the underlying issues for elevated error rates for Log Management As a result of this issue, some users may see incorrect statuses for logs from AWS Lambda
Report: "Delayed Infrastructure Updates"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update in 30 minutes once the service is fully operational.
We are investigating increased latency processing host updates. As a result of this issue, some users may see delays in host activity status updates on the infrastructure list.
Report: "Elevated error rate for metrics and delayed metric monitors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are actively investigating elevated errors and slow queries for metrics data. As a result of this issue, some users may see errors when trying to load data on dashboards and metrics monitors evaluation may be delayed.
Report: "Delayed Monitors Notifications"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
We have identified the underlying issue and are working on a fix.
We are investigating delays in the processing of Metrics and corresponding Monitor Notifications, which began at 22:30 UTC.
Report: "Delayed Monitors Notifications"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We are continuing to investigate the issue.
We are investigating delays in Monitors Notifications, which began at 15:16 UTC]
Report: "Degraded Web Application Performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and we are working on a fix. Note that this only affects Monitors, SLOs and Incident Management web apis.
We are investigating degraded performance with the web application.
Report: "Elevated Error Rates for Log Queries and Monitors"
Last updateThis incident has been resolved.
Fix rollout has now been completed.
The fix rollout is currently ongoing. Once completed we will confirm resolution.
The fix rollout is currently ongoing. Once completed we will confirm resolution.
We have successfully tested a fix for this issue and are currently deploying it to resolve this incident.
We're still working on a fix for historical data impacted by this incident.
We're still working on a fix for historical data impacted by this incident.
We're still working on a fix for historical data impacted by this incident.
We're still working on a fix for historical data impacted by this incident.
We're still working on a fix for historical data impacted by this incident.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved. At this time, newly ingested data is properly queryable, and monitors targeting Logs sent from 2023-10-03 20:40 UTC onwards are valid. Queries targeting logs between 2023-10-02 11:40 UTC and 2023-10-03 20:40 UTC may return erroneous data. We are evaluating a fix that will restore query correctness for this time-window.
We have identified the underlying issue and are working on a fix.
We are continuing to investigate these issues, and will provide an update as soon as possible.
We are actively investigating issues with Log Queries returning unexpected results. As a result of this issue, some users may experience issues querying logs on the web application or API, and with Logs based Monitors and Log-Based Metrics.
Report: "Delayed Metric Monitor Notifications"
Last updateThis incident has been resolved.
We have identified the underlying issue and are working on a fix. It is important to note that no data has been lost, and notifications will be caught up once the service is operational again.
We are investigating delays in Metrics Monitors Notifications, which began at 02:35 UTC.
Report: "Monitors Notifications Delayed"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We are aware of delays in Metric Monitors Notifications, which began at 20:55 UTC. We have identified the underlying issue and are working on a fix.
Report: "Delayed Processing for a Subset of Metrics"
Last updateThis incident has been resolved.
We are continuing to investigate the issue. To prevent spurious alerts, we have temporarily disabled affected monitors based on this data.
We are investigating increased latency processing Processing for a Subset of Metrics. As a result of this issue, some users may see delays or gaps for a subset of their metrics on graphs.
Report: "Delayed monitor notifications & metrics graphing issues"
Last updateThis incident has been resolved.
Users may still be experiencing some issues with graphs not loading in the web application. We will provide another update once the issue is fully resolved.
Users may still be experiencing some issues with graphs not loading in the web application. We will provide another update once the issue is fully resolved.
Users may still be experiencing some issues with graphs not loading in the web application. We will provide another update once the issue is fully resolved.
Issues with monitor notifications have been resolved. Users may still be experiencing some issues with graphs not loading in the web application. We will provide another update once the issue is fully resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We are continuing to investigate the issue.
We are continuing to investigate the issue.
We are investigating delays in Monitors Notifications for monitors which rely on distribution metrics, which began at 17:58 UTC. Users may also experience some issues with graphs not loading in the web application. Please note that data ingest is not affected by this incident.
Report: "Delayed Monitors Notifications"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We have identified the underlying issue and are working on a fix.
We are investigating delays in Monitors Notifications affecting, which began at 13:53 UTC.