Historical record of incidents for DataRobot
Report: "DataRobot Experiencing Stability Issues"
Last updateExternal cloud providers are experiencing outages which is impacting the DataRobot platform and services.
Report: "Managed EU AI Cloud Degraded Performance"
Last updateManaged EU AI Cloud had a degraded performance during 7:00 - 7:30 AM UTC. Some users could experience intermittent connection interruptions. Please reach out to support@datarobot.com if you have any questions.
Report: "Deployment drift and accuracy charts are broken"
Last updateIssue is now resolved and the drift charts are working as expected
Team is working on implementing the fix
Root cause is identified and team is working on the fix
Report: "Deployment drift and accuracy charts are broken"
Last updateTeam is working on implementing the fix
Root cause is identified and team is working on the fix
Report: "Codespaces are not starting"
Last updateThe Engineering team has applied the fix and the issue has been resolved.
Codespaces are not starting on EU MTS because of internal connection issues. The Engineering team is working on the fix.
Report: "Codespaces are not starting"
Last updateThe Engineering team has applied the fix and the issue has been resolved.
Codespaces are not starting on EU MTS because of internal connection issues. The Engineering team is working on the fix.
Report: "Notebooks sessions getting terminated."
Last updateEngineering has successfully applied the fix, and all notebooks are now running as expected across all MTS environments.
Engineering has found the potential root cause and applied a fix. Engineering is now monitoring impacted services.
Subset of running notebooks sessions getting terminated in all environments (US/EU/JP). Multiple customers are impacted. Engineering team investigating the root cause.
Report: "Creation of new trial users is occasionally failing"
Last updateThe Engineering team has applied the fix to US and EU MTS and the Trial user creation has been restored across all MTS environments.
The Engineering team has applied the fix to Japan MTS and the Trial user creation has been restored in that cluster. The team is now proceeding to apply the fix to US and EU MTS.
The Engineering team has identified the root cause and is woking on a fix.
Report: "Users with language settings other than English in profile unable to launch or access AI apps on MTS"
Last updateA fix has been applied to all production environments, no-code applications are operational in all languages.
ETA for the deployment of the hotfix to the Japan (JP) environment is approximately 4 hours.
We are continuing to work on a fix for this issue.
Engineering has identified the root cause of the issue and is currently preparing a fix.
Users with language settings other than English in profile unable to launch or access AI apps on MTS as the translation files are not loading correctly. As a workaround users can switch to English. We are actively investigating the issue and working on a fix.
Report: "DataRobot STS degraded performance"
Last updateThe Engineering team has identified the issue and is currently applying mitigation steps. If you continue to experience any issues, please contact DataRobot Support.
STS customers might observe degraded performance caused by Regular Maintenance performed by DataRobot Engineering. Engineering team is working on the mitigation.
Report: "AWS outage in ap-northeast-1 region"
Last updateAWS had an outage in ap-northeast-1 region that has since been resolved. Outage times: 7.47am GMT to 9:30am GMT Engineering is working to restore any affected customers on STS.
Report: "Code Assistant functionality in Notebooks is unstable on US MTS"
Last updateThis incident has been resolved.
Code Assistant API requests are currently taking longer than usual for US Production customers. This is a result of a partial outage on Azure. The Engineering team is monitoring the situation and will provide updates when functionality is restored.
Report: "Issues during VDB creation in Multi-Tenant SaaS environments"
Last updateThe issue has been successfully mitigated in the MTS production environments. The incident is now contained, and VDB creation functionality has been fully restored.
Engineering has observed that VDB creation began failing across all MTS environments following today's production deployment. The team is currently investigating the root cause and actively working on a fix.
Report: "DataRobot SSO Issues"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
SSO login has been broken, which may affect users in an attempt to log in. The Engineering team has prepared the fix. The approximate ETA for the hotfix release is 8 hours across all MTS Environments.
Report: "Issues while accessing Custom Applications on US Prod"
Last updateThe affected users can access the Custom Applications now. The incident has been mitigated.
Some users on US SaaS are unable to access their Custom Applications. Engineering team is currently investigating, we will keep you updated.
Report: "The Trial user provisioning is not working."
Last updateThe Trial user provisioning is back on and working as expected in the DataRobot US production environment. The issue is resolved.
The Trial user provisioning is not working on the DataRobot US production environment. The engineering team is currently investigating.
Report: "Issues with creating Custom Models from LLM Playground"
Last updateThis incident has been resolved.
The issue is mitigated and users are able to create custom models again. The engineering team will continue to monitor the environment and prepare a permanent fix until the incident is contained. The estimate is ~ 2 hrs at the moment.
Japan MTS cluster is experiencing issues with creating custom models from LLM Playground. The engineering team is investigating.
Report: "Delay updating the Deployment Monitoring Information."
Last updateThis incident has been contained.
The unprocessed message backlog continues to catch up. The engineering team is closely monitoring the process. We will provide an update once the processing of delayed messages is caught up.
Our team has identified the root cause and implemented the fix. Service Health and Accuracy no longer have a delay and are operating normally. The delay in Data Drift monitoring is improving, however the Engineering team expects it will take several hours to fully recover as the system processes through accumulated data. The team can confirm there has been no data loss during this time. The team is currently monitoring the situation.
Our team has identified an issue with our Deployment Monitoring Information. This is a process delay and no data loss is expected. Our team is currently investigating the root cause and is working on a fix. The following services are currently impacted Service Health, Data Drift, and Accuracy monitoring.
Report: "The custom workload is failing to start"
Last updateEngineering has identified the root cause and fixed the issue. The Custom Workloads are starting normally on the US MTSaaS environment, and the incident is resolved.
The Custom Workload is failing to start in the US MTSaaS environment. Existing running Custom Apps aren't affected. The Engineering team is investigating the root cause.
Report: "Notebooks are not starting in the US MTS environment"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are currently experiencing an issue with Notebook services which is preventing Notebooks from starting in the US MTS environment. Our engineering team is actively working on resolving the problem.
Report: "A Network Policy service issue is causing an interruption"
Last updateThe problem related to Network Policy have been resolved. All services are operational.
Engineering has identified the problem and mitigation has been applied. Engineering is currently monitoring the progress.
A Network Policy service issue is causing an interruption in functionality for some components on app.datarobot.com. Engineering is actively working on a resolution.
Report: "Custom model services issue"
Last updateThe engineering has resolved the issue. This incident is now contained.
We are continuing to work on a fix for this issue.
Engineering has identified the root cause and is currently working on fixing the issue.
We are observing an issue with the Custom model services on the US production environment causing Custom Models, Custom Jobs and Custom Apps to stop working. Engineering is currently investigating the issue.
Report: "Notification policies, health checks, and scheduled custom jobs are currently not updating"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Due to an ongoing incident, notification policies, health checks, and scheduled custom jobs are currently not updating. Deployments themselves remain unaffected. The DataRobot team is working on resolution.
Report: "Search engine for documentation is not working for https://docs.datarobot.com/"
Last updateThe engineering has implemented the fix and the search engine for documentation https://docs.datarobot.com/ is working as expected. This IR is now contained.
The DataRobot engineering team has identified the issue and they are working towards permanently fixing the issue.
Search engine for documentation is not working for https://docs.datarobot.com/, but the documentation is still available. DataRobot team is working on resolving the issue, we apologise for inconvenience caused.
Report: "Issue with Custom Models and Custom Applications."
Last updateCustom Models and Custom Applications are back to Operational state. No further updates are expected.
Our team has implemented a fix and the next update will be shared in an hour.
Our team has noticed some issues with custom models and custom applications after the recent production deployment on the US MTSaas. Engineering team is working on mitigation.
Report: "Sending LLM blueprint to Model Workshop doesn't work for LLM Blueprints without a VDB"
Last updateThis incident has been resolved.
Starting the morning of September 16, 10 AM UTC, sending an LLM blueprint to the Model Workshop will fail if the Vector Database is not specified. The fix is ready, and the engineering team is currently verifying it.
Report: "Connection Issues with DataRobot Notebooks"
Last updateThe engineering has implemented a fix and the notebooks on the DataRobot US, EU and JP clouds are functioning as expected. The incident is marked as contained.
The issue has been identified and a fix is being implemented.
We are currently experiencing issues with the Notebooks. The users on the DataRobot US, EU, and JP clouds are experiencing issues with new and existing Notebooks. The engineering is working on identifying the root cause.
Report: "Predictions for custom models and custom apps are unavailable for users on the US cloud"
Last updateCustom Models and Custom Applications are back to Operational state. No further updates are expected.
Predictions for custom models and custom apps are unavailable for users on the US cloud. The Engineering team is investigating the issue. The next update will be in 30 minutes.
Report: "Free Trial users on US and EU SaaS environments have issues with real-time predictions on text generation deployment features"
Last updateThis incident has been resolved.
We are aware that a subset of free trial users on US and EU SaaS environments may experience some issues with real-time predictions on text generation deployment features. Engineering is working on a resolution.
Report: "Issue with a subset of email notifications on US and EU production."
Last updateUsers may have experienced problems with receiving email notifications related to deployed models.
Report: "Users on US Production may experience slower allocation of Workers than normal"
Last updateThe issue has been mitigated.
DataRobot Engineering has identified an issue where certain users on US Production may experience slower allocation of Workers than normal. We are investigating potential causes.
Report: "Issues Occuring with Project Creation and Model Deployments"
Last updateThis incident has been resolved.
A fix has been implemented and engineering is currently monitoring the results.
Users are experience issues when creating projects or deploying models in DataRobot. Our engineering team is currently investigating the issue.
Report: "Customers Experiencing Errors with Custom Modeling and Notebooks"
Last updateEngineering has applied a fix in the managed EU AI cloud which resolved the issue. The issue is contained.
Engineering has identified an issue with custom modeling and notebooks for customer in the Managed EU AI Cloud. Engineering has found a potential fix and is currently working on deploying that fix to production.
Report: "5xx errors from the prediction servers on US production"
Last updateThe engineering team has identified the root cause and rolled back the changes. That has mitigated the issue and all the services are working as expected. The incident is marked as contained.
Multiple services including dedicated predictions are experiencing degraded performance. Engineering is investigating.
Report: "Projection and Notebook Creation Down"
Last updateOur Engineering team noticed that Project and Notebook creation were down temporarily. Engineering applied fix and Project and Notebook creation are working as expected.
Report: "Increased error rate on US production"
Last updateThe issue with the increased error rate on the US production has been resolved after the token certificates update and service restart.
DataRobot has increased error rate on US production. The Engineering is investigating the root cause.
Report: "Predictions monitoring getting dropped intermittently."
Last updateThe fix was deployed to the US production environment and has been verified. The incident is contained.
The Engineering team has identified the root cause of the issue with predictions monitoring getting dropped intermittently. A fix is being worked on to resolve the issue. As the fix deploys this week, the Engineering team will continue to monitor and ensure no further issues arise.
Engineering has identified an issue with predictions monitoring getting dropped intermittently. Prediction by itself is not impacted by this issue. Engineering is currently investigating for the root cause.
Report: "Customer experiencing 503 error on US prod when launching Custom apps."
Last updateThe incident has been resolved.
Users are unable to launch custom apps on US prod. The Engineering team is investigating the issue.
Report: "Users cannot start Notebook session since 05:00 UTC in US SaaS environment"
Last updateThe Engineering team has been monitoring the fix for the last 3 days and the issue has been contained. No further updates are expected.
Engineering has applied a fix, and notebook services are operational again. Engineering is continuing to monitor.
Notebooks are not able to start on US Prod. Engineering is currently investigating.
Report: "DataRobot STS - Single Tenant SaaS (Managed by DataRobot)"
Last updateThis incident has been resolved.
Engineering tea has identified the issues and fix is under way !
DataRobot Single Tenant SaaS platform unable to ingest data through AI Catalog, our engineering team is investigating the issue.
Report: "Users cannot start Notebook session since 11:30 CET in US SaaS environment"
Last updateThe issue has been mitigated as of 12:45 CET. Notebooks working as expected.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
Engineers are working on fixing the issue.
Report: "Delay in Deployment Alerts"
Last updateThis incident has been resolved.
Engineering has applied a mitigating solution to stabilize the system, the issue is under monitoring and a permanent fix is under a way !
Engineering has noticed delayed in health notification, the issues has been identified and a fix is under way. Scheduled health check notification will not be working where as no impact for RealTime notification for DataRobot US Managed AI cloud.
We are continuing to investigate this issue.
Some DataRobot users on US SaaS may see a delay in deployment alerts. Engineering is investigating.
Report: "Users are unable to sign-up for trial"
Last updateUsers are now able to sign-up for trial in both US and EU environment. Engineering has resolved the issue.
Users are unable to sign-up for trial, in both EU and US environments. Engineering is investigating
Report: "Issue with Notebooks in the DataRobot EU cluster (app.eu.datarobot.com)"
Last updateThis incident has now been resolved.
Notebook functionality has been restored on the EU cluster. Engineering is currently monitoring this issue.
We are continuing to investigate this issue.
The creation of new Notebooks on the EU cluster is currently down. Engineering is currently investigating.
Report: "Issue with creating new AI applications on DataRobot US cluster(app.datarobot.com)"
Last updateThe issue is contained. Please contact support@datarobot.com if you have any questions.
The engineering team is working on fixing the issue. No impact is expected for the existing applications.
Report: "Network Latency/Timeout noticed in Kubeworkers"
Last updateAWS updated that the operational issue US-EAST-1(use1-az1) Region is fully resolved. DataRobot has monitored our services and see everything is up and operational. The issue is contained.
AWS has operational issues in US-EAST-1(use1-az1) Region. This could impact some of the DataRobot services in that region. Our team will continue to monitor for any kind of impact on the DataRobot services.
Report: "Users unable to start notebooks in the EU Production Environment"
Last updateEngineering has applied a fix and the problem of starting notebooks in the EU Production environment has been contained.
There is an incident affecting starting of notebooks in the EU Production environment. Our Engineering team is currently investigating.
Report: "Deployment report generation failure in US and EU Prod"
Last updateThis incident has been resolved.
The engineering team is still working on mitigating the broken deployment reports on production. The new ETA for the resolution of the issue is Thursday, 13:00 UTC, 31st of August.
Some customers may experience unexpected behavior when generating deployment reports. This issue is limited to report generation only and all other DataRobot services, including predictions and model monitoring, are functioning normally. Fix will be provided on Monday evening UTC.
We are continuing to work on a fix for this issue.
Some customers may experience unexpected behavior when generating deployment reports. This issue is limited to report generation only and all other DataRobot services, including predictions and model monitoring, are functioning normally. Engineering is currently working on a fix for the issue.
Report: "A delay in prediction monitoring was observed on the EU SaaS cluster"
Last updateFrom 10:50 UTC to 13:27 UTC on 13 July 2023, a delay in prediction monitoring was observed for EU SaaS customers. The issue has been identified and corrected. All systems are functioning normally at this time.
Report: "Partial outage is reported with DataRobot"
Last updateThis incident has been resolved. All affected DataRobot Services are functional.
Our team has implemented a fix, and we are actively motoring the status.
DataRobot Saas platform is experiencing import errors on Prediction requests. Our team has identified the issue and is working on a fix. Please expect the next update within 60 minutes, or reach out to support@datarobot.com if there are any questions.
Report: "Partial outage is reported with DataRobot"
Last updateThis incident has been resolved. All affected DataRobot Services are functional.
Certain services within DataRobot are down, including but not limited to Notebooks and Modeling Jobs. Engineering is investigating.
Report: "Users might have experienced delay in data upload job processing on US Production."
Last updateAn issue delayed the upload jobs from May 10th 2:34 AM UTC to May 10th 3:35 AM UTC. Users who tried uploading the data might have experienced a delay. The incident has been contained.
Report: "DataRobot Notebooks Not Unavailable."
Last updateThis incident has been resolved.
Engineering has identified the root cause and applied a mitigation. We are currently monitoring.
There is an incident affecting all usage of notebooks in our US Production environment. Our Engineering team is currently investigating.
Report: "Zepl Is Experiencing Partial Outage"
Last updateEngineering has applied the additional fixes to the database, and this incident has been resolved. Zepl services are back to normal.
The fix that engineering applied to Zepl did not resolve the 404 Errors. We have applied additional modifications to the Zepl backend and monitoring the stability of the application.
We have applied a fix to the Zepl platform backend. We are currently monitoring.
Zepl is experiencing intermittent application degraded performance and 404 Page Not Found errors. Our Engineering team is currently investigating.