Historical record of incidents for Datadog US3
Report: "Delayed Monitors Notifications"
Last updateThe issue has been identified and a fix is being implemented.
We are investigating delays in Monitors Notifications, which began at 15:36 UTC.
Report: "Delayed Monitors Notifications"
Last updateWe are investigating delays in Monitors Notifications, and submission errors for customers using private link, which began at 0837 UTC.
Report: "Delayed Metrics"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results, full recovery for all impacted metrics is estimated to take up to 60 minutes. We will provide an update if recovery happens sooner.
The issue has been identified and a fix is being implemented.
We are investigating increased latency processing Metrics impacting metrics generated from APM (traces), Logs, Synthetics, RUM, Containers, Integrations, DBM, Estimated Usage Metrics, and Distribution Metrics.As a result of this issue, some users may see delays or gaps for metrics on graphs.To prevent spurious alerts, we have temporarily disabled monitors based on this data.
Report: "Delayed Metrics"
Last updateWe are investigating increased latency processing Metrics impacting metrics generated from APM (traces), Logs, Synthetics, RUM, Containers, Integrations, DBM, and Distribution Metrics. As a result of this issue, some users may see delays or gaps for metrics on graphs. To prevent spurious alerts, we have temporarily disabled monitors based on this data.
Report: "Delayed Monitors Notifications"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are investigating delays in Distribution Monitors Evaluation, which began at 5:30pm UTC. Monitors for other types of metrics are evaluating as usual.
Report: "Delayed Monitors Notifications"
Last updateWe are investigating delays in Monitors Evaluation, which began at 5:30pm UTC.
Report: "Metrics delayed"
Last updateThis incident has been resolved.
All metrics data during the impacted window is available. We will being re-enabling monitors with an evaluation window greater than 60 minutes. Monitors with an evaluation window of less than 60 minutes continue to be evaluated.
We have identified the issues, and are backfilling data. Monitors with an alert window of one hour or less have been restored, and live metrics data is available
We are continuing to work on a fix for this issue.
For the period May 2, 2025, 11:25 - 13:00 UTC, metrics are delayed. We are backfilling the data for that time period and anticipate no data loss. Metric monitors that include data between 11:25 - 13:00 UTC time range are delayed. Metric queries and metrics monitors evaluating data after 13:00 UTC are correct and working as expected.
All metric monitor notifications have been delayed starting at 14:57 UTC. We are working on identifying the issue.
We are continuing to work on a fix for this issue.
For the period May 2, 2025, 11:25 - 13:00 UTC, metrics are delayed. We are backfilling the data for that time period and anticipate no data loss. Metric monitors that include data in that time range are delayed. Metrics after 13:00 UTC are correct, and metric monitors that only consider that timeframe are working properly.
For the period May 2, 2025, 11:25 - 13:00 UTC, metrics are delayed. We are backfilling the data for that time period and anticipate no data loss. Metric monitors that include data in that time range are delayed. Metrics after 13:00 UTC are correct, and metric monitors that only consider that timeframe are working properly.
We are continuing to investigate this issue.
We’re investigating increased metric latencies. Graphs may be delayed. To avoid spurious alerts, we’ve temporarily disabled alerts for Metric Monitors.
Report: "Metrics delayed"
Last updateWe’re investigating increased metric latencies. Graphs may be delayed. To avoid spurious alerts, we’ve temporarily disabled alerts for Metric Monitors.
Report: "Login Issues"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are investigating user login issues related to reCAPTCHA for customers using password login. If you experience an issue with reCAPTCHA, refreshing the page can often mitigate the issue. Please note that data processing and alerts are not affected by this incident.
Report: "Login Issues"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are investigating user login issues related to reCAPTCHA for customers using password login. If you experience an issue with reCAPTCHA, refreshing the page can often mitigate the issue. Please note that data processing and alerts are not affected by this incident.
Report: "Delayed Log Monitors Notifications"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We have identified the underlying issue and are working on a fix.
We are investigating delays in Log Monitors Notifications, which began at 1:40 PM UTC.
Report: "Delayed Metrics Based Monitor Notifications"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are investigating delays in metrics based Monitor Notifications, which began at 20:20 UTC.
Report: "Degraded Web Application Performance"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We have identified the underlying issue and are continuing to work on a fix. Degraded web application performance is primarily observed in customers with low network bandwidth.
We have identified the underlying issue and are working on a fix.
We are investigating degraded performance with the web application.
Report: "Degraded Web Application Performance"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are investigating degraded performance with the web application.
Report: "Web UI features maybe hidden"
Last updateThis incident has been resolved. Please refresh your Datadog web page to resolve the issue completely.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating an issue, that is causing certain features to be hidden from our UI. There is no data loss or monitoring impact.
Report: "Metric monitor evaluations delayed for aws.* metrics"
Last updateThis incident has been resolved.
Monitor evaluations for aws.* metrics are no longer affected, we will continue monitoring the recovery
The issue has been identified and a fix is being implemented.
We are experiencing issues with processing cloud integrations which is resulting in delayed integration metrics for aws.* metrics. We have disabled notifications relying on these metrics. We are investigating the issue and will provide additional information as it becomes available.
Report: "CI Visibility - Page Load issue"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified an issue that prevents most Software Delivery pages from loading. Also, Intelligent Test Runner, Quality Gates and GitHub PR comments are affected
Report: "APM - degraded performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating an issue in executing trace queries, the team is working on a fix
Report: "Web Application Not Loading"
Last updateThis incident has been resolved
A workaround has been implemented and we are monitoring the results.
We are investigating loading issues on our web application. As a result, some users might be getting errors when loading the web application. Please note that data processing and alerts are not affected by this incident.
Report: "Issues with data ingesting and alerting"
Last updateThis incident has been resolved.
Remediation efforts continue. RUM and Application Vulnerability Management are operational again. Cloud SIEM and Cloud Security Management, as well as alerting off these data types, continue to be impacted. We will provide another update once the issue is fully resolved.
Remediation efforts continue. Profiling is operational again. RUM, Cloud SIEM, Cloud Security Management, and Application Vulnerability Management, as well as alerting off these data types, continue to be impacted. We will provide another update once the issue is fully resolved.
We continue to deploy fixes and are monitoring the results. We will provide another update once the issue is fully resolved.
We continue to deploy fixes and are monitoring the results. RUM, Profiling, Cloud SIEM, Cloud Security Management, and Application Vulnerability Management, as well as alerting off these data types, continue to be impacted. We will provide another update once the issue is fully resolved.
We continue to deploy fixes and are monitoring the results. We will provide another update once the issue is fully resolved.
APM is operational at this time, and alerting based on APM data has also resumed. RUM, Profiling, Cloud Security Management, and Application Vulnerability Management, as well as alerting off these data types, continue to be impacted. We will provide another update once the issue is fully resolved.
We have deployed a fix and are monitoring the results. Certain data types (Logs, NPM, and Synthetics) are operational again, and alerting from those types has also resumed. APM, RUM, Profiling, Cloud Security Management, and Application Vulnerability Management, as well as alerting off these data types, continue to be impacted. We will provide another update once the issue is fully resolved.
We are investigating an issue with ingesting data which began around 20:40 UTC. As a result, data from Log Management, APM, Synthetics, Profiling, RUM, CSM, and NPM is delayed. Additionally, monitors derived from this data are delayed.
Report: "We are investigating user login issues with the web application"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We have identified the underlying issue and are working on a fix.
We are continuing to investigate this issue.
We are investigating user login issues with the web application login by email. Please note that data processing and alerts are not affected by this incident.
Report: "Delayed events for Logs, Synthetics and Container"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
The issue has been identified and a fix is being implemented.
We are investigating increased latency processing Logs, Synthetics Test Results and Container updates. As a result of this issue, some users may see delays or gaps for data in their logs queries and synthetics tests status. To prevent spurious alerts, we have temporarily disabled monitors based on this data.
Report: "Elevated Error Rates for Error Tracking"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results.
We are investigating increased errors in Error Tracking processing. As a result of this issue, some users may experience gaps in Error Tracking updates and alerts.
Report: "Delayed Monitors Notifications"
Last updateAll monitor notifications have caught up.
We are continuing to investigate this issue.
We are investigating delays in Monitors Notifications, which began at 7:37 UTC.
Report: "Elevated Error Rates for Monitors"
Last updateThis incident has been resolved.
A fix has been implemented. Monitors have recovered and live data is now available. Backfilling is ongoing for historical data.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are actively investigating elevated error rates for Monitors. As a result of this issue, some users may experience issues in addition with CI Visibility, NDM, NPM, Profiling, RUM and Synthetics
We are actively investigating elevated error rates for Monitors. Metrics monitors are not affected.
Report: "Delays in Processes Monitors Evaluation (Other Monitor types unaffected)"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are investigating delays in Processes Monitors Evaluation, which began at 15:15 UTC. Note that this only affects monitors based on processes product. All other monitor types are unaffected
Report: "User Login Issues and Delayed Synthetics/APM/Sketch Metrics"
Last updateThis incident has been resolved.
The majority of metrics processing delays have been resolved and we are actively monitoring the recovery for the remaining metrics.
We have continuing to work on recovery. We will provide another update at 9:45 pm EST.
We have continuing to work on recovery. We will provide another update at 9:15 pm EST.
We have continuing to work on recovery. We will provide another update at 8:45 pm EST.
We have identified the issue and are working towards recovery. We will provide another update at 8:15 pm EST.
We are continuing to investigate this issue.
We are investigating increased latency processing synthetics metrics, APM metrics, and sketch metrics. As a result, notifications for these monitor types may be delayed.
Report: "Elevated Errors for API Key Validation"
Last updateFrom 12:45-1:15 PM US EST Datadog’s endpoint to validate Datadog API keys was unavailable. During this window Datadog Agents would be unable to validate their API key. In all cases Agents would continue to send data. Some Agents running in Kubernetes may be marked unhealthy until restarted. Newly started Agents would fail to start. Build jobs using our CI Visibility product would be missing custom tags and measures.
Report: "Elevated Error Rates for Metrics Queries"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are actively investigating elevated error rates for Metrics Queries. As a result of this issue, some users may see errors with metrics graphs on the web application or API. Metrics monitors evaluations are also delayed as a consequence.
Report: "Elevated Error Rates for Metric Monitors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are actively investigating elevated error rates for Metric Monitors. To prevent spurious alerts, we have temporarily disabled monitors based on this data.
Report: "Delayed monitor evaluation and query failures across multiple products"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
We experienced an increased rate of query failure across multiple products including Logs Management, APM, RUM, Synthetics, CI Visibility, Error Tracking, Audit Logs, Database Monitoring and NPM. This resulted in delayed monitor evaluation and notification for a subset of monitors. We are monitoring recovery. Metric Monitors and queries are unaffected by this incident.
We are currently investigating this issue.
Report: "[SSO] Login Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating user login issues with the web application. Please note that data processing and alerts are not affected by this incident.
Report: "Data delays and web errors"
Last updateThis incident has been resolved.
All components have recovered now. We are replaying a few failed notifications.
We are continuing to work on a fix for this issue.
We have identified an issue which caused temporarily elevated error rates on our web application (500 error pages) and increased latency processing monitoring data. As a result of this issue, customers might still some metrics are delayed as well as monitors relying on this data. We are currently working on a fix and the data is being backfilled. We will provide another update once the service is fully operational again.
Report: "Delays in Logs ingestion"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We have identified the underlying issue and are working on a fix. It is important to note that no data has been lost, and notifications will be caught up once the service is operational again.
We are investigating delays in Logs Ingestion which will delay log data and log-based monitors notifications. This began at 21:56 UTC.
Report: "Elevated Access Denied errors"
Last updateThis incident has been resolved.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
We have identified the underlying issue and are working on a fix. Users should be able to access the Datadog web application at this time, but may still see occasional errors.
We are investigating elevated error rates on our web application. As a result, some users might be getting Access Denied errors when loading the web application. Please note that data processing and alerts are not affected by this incident.
Report: "Elevated Error Rates for Log Queries and Monitors"
Last updateThis incident has been resolved.
Fix has been rolled out and we are currently monitoring to confirm full resolution.
We have successfully tested a fix for this issue and are currently deploying it to resolve this incident.
We're still working on a fix for historical data impacted by this incident.
We're still working on a fix for historical data impacted by this incident.
We're still working on a fix for historical data impacted by this incident.
We're still working on a fix for historical data impacted by this incident.
We're still working on a fix for historical data impacted by this incident.
We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved. At this time, newly ingested data is properly queryable, and monitors targeting Logs sent from 2023-10-03 20:40 UTC onwards are valid. Queries targeting logs between 2023-10-02 11:40 UTC and 2023-10-03 20:40 UTC may return erroneous data. We are evaluating a fix that will restore query correctness for this time-window.
We have identified the underlying issue and are working on a fix.
We are continuing to investigate these issues, and will provide an update as soon as possible.
We are actively investigating issues with Log Queries returning unexpected results. As a result of this issue, some users may experience issues querying logs on the web application or API, and with Logs based Monitors and Log-Based Metrics.
Report: "Delayed Metrics"
Last updateThis incident has been resolved.
Metrics are no longer delayed for all customers
We are continuing to work on a fix for this issue.
We are investigating increased latency processing Metrics. As a result of this issue, some users may see delays or gaps for metrics on graphs. To prevent spurious alerts, we have temporarily disabled monitors based on this data.
Report: "Delayed Azure Native Integration Metrics"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We’re actively investigating increased latencies for collecting Azure Native Integration metrics due to third party errors. As an effect, there might be delays in graphs displaying these metrics. To prevent spurious alerts, we have temporarily disabled monitors based on this data.
Report: "Delayed Synthetic Browser Test Results"
Last updateWe have scaled up the underlying system and we no longer observe latency in synthetic browser test results.
We have identified an issue that resulted in an increased latency executing Synthetics browser tests. As a result of this issue, some users may experience delays in receiving test results and notifications.
Report: "Degraded Web Application Performance"
Last updateThis incident has been resolved.
We have identified the underlying issue, and are recovering. We are monitoring the recovery and will provide another update once the issue is fully resolved.
We are continuing to investigate this issue.
We are investigating loading issues on our web application and delays in ingesting metrics data and evaluating monitors on this data, which began at 18:51 UTC.
We are investigating degraded performance with the web application.
Report: "Some monitor notifications are delayed"
Last updateThis incident has been resolved.
We are investigating delays in Monitors Notifications, which began at 3:43 AM UTC. This only impacts monitors which rely on APM trace distribution metrics. We have deployed a fix and we are monitoring the results. We will provide another update once the issue is fully resolved.
Report: "Delayed Azure Native Integration Metrics"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We’re actively investigating increased latencies for collecting Azure Native Integration metrics due to third party errors. As an effect, there might be delays in graphs displaying these metrics. To prevent spurious alerts, we have temporarily disabled monitors based on this data.
Report: "Delayed Metrics"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified an issue causing increased latency processing Metrics and are working on a fix. As a result of this issue, some users may see delays or gaps for metrics on graphs. Graphs may be delayed. To avoid spurious alerts, we’ve temporarily disabled “no data” alerts for Metric Monitors
Report: "Delayed Monitors Notifications"
Last updateThis incident has been resolved. Delays may have been observed for a subset of Distribution Metrics Monitor notifications between 22:30 and 00:56 UTC.
We are investigating delays in Monitors Notifications for distribution metrics, which began at 22:30 UTC.
Report: "Delayed Synthetics tests results"
Last updateBackfill is finished. This incident has been resolved.
All services are fully operational and processing live data. We have started to backfill Synthetics tests results and will provide another update once the backfills are finished.
We have deployed a fix and we are monitoring the results. It is important to note that no data has been lost, and it will be backfilled and available once the service is operational again. We will provide another update once the issue is fully resolved.
We have identified an issue that resulted in an increased latency processing Synthetics tests results and are working on a fix. As a result of this issue, some users may see delays with test results and in notifications based on this test data.
Report: "Web application performance degraded"
Last updateThis incident has been resolved.
We have identified the underlying issue and are working on a fix.
We are investigating loading issues on our web application. As a result, some users might be getting errors or degraded performance when loading the web application, specifically on dashboards.
Report: "Web Application Not Loading"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified. Synthetic tests may be temporarily running with an outdated configuration and new Synthetic tests may not start immediately.
We are investigating loading issues on our web application. As a result, some users might be getting errors when loading the web application. Please note that data processing and alerts are not affected by this incident.
Report: "Delayed Monitors Notifications"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are investigating delays in Monitors Notifications, which began at 21:24 UTC.
Report: "Backfilling historical data for March 8, 2023 incident"
Last updateWe have finished backfilling data across all products: all data received during the incident that had been successfully buffered but unprocessed, is now fully accessible on the platform. Due to the nature of this outage, you may see some residual gaps in the data we received within the first few hours after the start of the incident. We truly appreciate your patience and understanding during this incident.
We have completed backfill of data for the following products * Database Monitoring * Serverless Monitoring We are now in the process of validating and verifying data across all customers in those products. For other products, we are actively working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
We have also completed backfilling data for the following products: RUM We are now in the process of validating and verifying data across all customers in those products. For other products, we are actively working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
We have completed backfill of data for the following products: * APM traces and services * Logs * Network Performance Monitoring * Network Device Monitoring * Profiling * CI Visibility and are now in the process of validating and verifying data across all customers in those products. For other products, we are actively working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
All Datadog services are now available and able to receive, query, and report on live data. Monitors continue to be evaluated correctly since live data has been restored. Some customers may still observe gaps in historical data for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
All Datadog services are now available and able to receive, query, and report on live data. Monitors continue to be evaluated correctly since live data has been restored. Some customers may still observe gaps in historical data for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
Monitors continue to be evaluated correctly since live data has been restored. Unless noted otherwise, all Datadog services are now available and able to receive and query live data. Some customers may still observe gaps in historical data for certain products for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
APM Traces and Error Tracking are operational. We will continue to monitor progress towards recovering the remaining services. Unless noted otherwise, all Datadog services are now available and able to receive, query, and report on live data. Some customers may still observe gaps in historical data for certain products for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
Unless noted otherwise, all Datadog services are now available and able to receive, query, and report on live data. Some customers may still observe gaps in historical data for certain products for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
Unless noted otherwise, all Datadog services are now available and able to receive, query, and report on live data. Some customers may still observe gaps in historical data for certain products for parts of the last 24 hours. We are now working on backfilling data and will provide updates every 2 - 3 hours until the backfill effort is complete and the incident is fully resolved.
APM Traces and Error Tracking are operational. We will continue to monitor progress towards recovering the remaining services.
Security Monitoring is operational. SLOs are operational. Cloud Integrations are operational. Profiling recent data is available for queries. We will continue to monitor progress towards recovering the remaining services.
RUM is fully operational. We will continue to monitor progress towards recovering the remaining services.
Logs Management is operational, live data and alerting are back to normal. External Archives and Log Forwarding are still delayed. Metrics are fully operational. Serverless monitoring is operational. We will continue to monitor progress towards recovering the remaining services.
Network Device Monitoring is fully operational. Metrics generated from Logs are now available. We will continue to monitor progress towards recovering the remaining services.
We're in the process of enabling metric alerts for some customers for time windows less than 1 hour. Network Performance Monitoring is fully operational. Event Management is fully operational. Error Tracking is partially available. We will continue to monitor progress towards recovering the remaining services.
The Synthetics product is fully operational. We're seeing partial recovery for Serverless Monitoring, as well as metrics from our cloud provider integrations. We will continue to monitor progress towards recovering the remaining services.
Monitors for Logs and Service Checks are operational. Database Monitoring is operational. We will continue to monitor progress towards recovering the remaining services.
Live data is now available for Logs, and CI Visibility is fully operational. We're seeing partial recovery for Watchdog. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.
We are continuing to work on a fix for this issue.
Live Search on last 15 mins for APM Traces is recovered. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.
We're seeing partial recovery across several products including Security Monitoring, CI Visibility and Network Performance Monitoring. These products may have gaps in data and partial limitations based on data available to monitors. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.
We're seeing partial recovery across several products including SLOs and Logs. These products may have gaps in data and partial limitations based on data available to monitors. We will continue to monitor progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.
Processes and their respective monitors, and Metrics are operational in US3. There may be gaps in historical metric data. We continue progress towards recovering the remaining services. Data ingestion and monitor notifications remain delayed across non-metric data types.
We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
At 06:00 UTC on March 8th, 2023 the Datadog platform started experiencing widespread issues across multiple products and regions . The web application was unavailable or intermittently loading, and data ingestion & monitor evaluation were delayed. We will share a more detailed analysis post-recovery, but at a very high level: A system update on a number of hosts controlling our compute clusters caused a subset of these hosts to lose network connectivity As a result a number of the corresponding clusters entered unhealthy states and caused failures in a number of the internal services, datastores and applications hosted on these clusters. Our current status is: We identified and mitigated the initial issue, and rebuilt our clusters We also have recovered a number of our applications and services, including our web portals We are now working on recovering and catching-up the rest of our data systems for metrics, traces and logs across the regions that are still affected (see region-specific status pages). The recovery work is currently constrained by the number and large scale of the systems involved. What to expect next: We are focusing on bringing back live data for all customers and all products before catching-up on any historical data we may have stored during the outage We expect live data recovery in a matter of hours (not minutes, and not days) We will continue to issue regular updates as the situation unfolds We understand how critical Datadog is to your business, we sincerely apologize for the inconvenience and we are working hard to resolve this issue.
We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
We are continuing to make progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
We continue progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
We are still working on the identified issue and are making continued progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
We are still working on the identified issue and are making continued progress towards recovering all services. Data ingestion and monitor notifications remain delayed across all data types.
We are still working on the identified issue and are making continued progress towards recovery. Data ingestion and monitor notifications remain delayed across all data types.
We are still working on the identified issue and are making continued progress towards recovery. Data ingestion and monitor notifications remain delayed across all data types.
We are still working on the identified issue and are making continued progress towards recovery. Data ingestion and monitor notifications remain delayed across all data types.
We have identified the issue, and are making continued progress towards recovery. Data ingestion and monitor notifications remain delayed across all data types.
We are continuing to work on mitigating and investigating the issue causing delayed data ingestion across all data types. Monitor notifications are delayed, and you may observe delayed data throughout the app. Additionally, the web application continues to have elevated error rates.
We are continuing to work on mitigating and investigating the issue causing delayed data ingestion across all data types. Monitor notifications are delayed, and you may observe delayed data throughout the app. Additionally, the web application continues to have elevated error rates.
We are continuing to work on mitigating and investigating the issue causing delayed data ingestion across all data types. Monitor notifications are delayed, and you may observe delayed data throughout the app. Additionally, the web application continues to have elevated error rates.
We are continuing to investigate this issue.
We are still investigating issues causing delayed data ingestion across all data types. Monitor notifications may be delayed, and you may observe delayed data throughout the web app.
We are still investigating issues causing delayed data ingestion across all data types. Monitor notifications may be delayed, and you may observe delayed data throughout the web app.
We are investigating issues causing delayed data ingestion across all data types. As a result monitor notifications may be delayed, and you may observe delayed data throughout the web app.
We are investigating loading issues on our web application. As a result, some users might be getting errors when loading the web application.
Report: "GCP metrics delayed"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an issue with our metrics collection from Google Cloud Platform. Metrics collected from the Google Cloud Platform may be delayed.
Report: "Delayed Events"
Last updateThis incident has been resolved. Remaining data are being processed.
We are continuing to monitor for any further issues. Backfilling is still in progress.
A fix has been implemented and we are monitoring the results. Recent data are being processed normally, older data impacted by the incident are currently being backfilled.
We have identified the underlying issue and are working on a fix. It is important to note that no data has been lost, and it will be backfilled and available once the service is operational again.
Report: "Delayed Monitors Notifications"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate the issue. Notifications are back to normal for all users, except for the ones sent to Microsoft Teams.
We are investigating delays in Monitors Notifications which impacts a subset of customers. It began at 07:10am UTC on 25th of Jan 2023.
Report: "[SSO] Login Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating user login issues with the web application [via SSO]. We are investigating an issue causing the "Login with SAML" button to not appear for some users. While we work on a fix, users may contact support@datadoghq.com to get the correct link to log-in with SAML
Report: "Issue processing cloud integration data."
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are experiencing issues with processing cloud integrations which is resulting in delayed integration metrics and delays processing xray traces. We have disabled notifications relying on these metrics. We are investigating the issue and will provide additional information as it becomes available.
Report: "Delayed Metrics"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are investigating increased latency processing Metrics. As a result of this issue, some users may see delays or gaps for metrics on graphs. To prevent spurious alerts, we have temporarily disabled monitors based on this data.
Report: "Delayed Monitors Notifications and Events"
Last updateThis incident has been resolved.
Delays in Monitor notifications are now resolved as of 03:15 UTC. We continue to investigate delays with Events and are investigating our cloud service provider.
We are continuing to investigate this issue with our cloud service provider.
We continue to investigate delays in Monitors Notifications and Events, and we have raised the issue with our cloud service provider for further investigation.
We are investigating delays in Monitors Notifications and Events, which began at 00:33 UTC.
Report: "Composite monitors evaluations are failing"
Last updateThis incident has been resolved. Composite monitors are evaluated again since 8:44am GMT.
Between 7:27am GMT and 7:57am GMT half of the composite monitors were not evaluated for all customers. Since 7:57am GMT none of them are evaluated. Other type of monitors are not affected. We have identified the issue and a fix is being implemented.