Historical record of incidents for Scale AI
Report: "Degraded performance"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Site Outage"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Nucleus Degraded Performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Nucleus and Data Engine is currently experiencing degraded performance. Pages and queries will take longer to respond than usual.
Report: "Cloudflare Outage"
Last updateThis incident has been resolved.
Confirmed that the issue has been resolved.
Cloudflare has implemented a fix and seems to be resolving any certificate issues.
We are currently investigating this issue.
Report: "Donovan Web Application Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an application issue. We will post updates shortly.
Report: "Site Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an issue, some operations are experiencing high latency or are failing.
Report: "Elevated Error Rate"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A temporary fix has been implemented and we are monitoring the status of the platform
We are continuing to work on mitigating the issue. Our website is experiencing a technical issue with the hosting provider. Current ETA for resolution is ~2 hours, or around 6:30 PT.
We are seeing elevated error rates with our cloud provider. We are currently investigating the issue.
We are currently investigating this issue.
Report: "Spellbook Application Error"
Last updateThe Spellbook application is fully back up and operational.
The fix for the deployment is in progress now.
We have identified a platform issue impacting all Spellbook applications. Applications will not appear in the UI or be available over the API. We are deploying a fix for this issue.
Report: "Catalog Forge Platform Unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "ML Pod Startup Failures"
Last updateThere was an outage with a service that manages the metadata for models from 7:06am PT to 7:50am PT. It prevented model services from starting up new pods. ML services that received security updates, deployed a new version, or attempted to scale up were impacted. There was an additional 8 to 12 minutes of service degradation after restoring the impacted ML services to catch up on missed requests.
Report: "Site Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Elevated latencies"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Elevated API Error Rates"
Last updateThis incident has been resolved.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Catalog Forge slow or unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Degraded Site Performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Many users may still be experiencing slow application loading times and less responsiveness. We have identified that our database has been experiencing internal issues and are continuing to work to resolve them. We have made efforts to mitigate the impact caused and have tuned our database to optimize for the current situation.
We are currently investigating this issue.
Report: "ML Inference Failures"
Last updateNetworking issue with ML infra.
Report: "Platform slow or unresponsive"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
We are seeing elevated error rates that may indicate a slow or unresponsive platform. We are currently investigating the issue.
Report: "Platform slow or unresponsive"
Last updateIncident has been resolved.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Degraded performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an issue with our asynchronous jobs.
Report: "Site Outage"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating the issue.
Report: "Site Outage"
Last updateThis incident has been resolved.
We are starting to see signs of recovery and are monitoring for issues.
We are currently investigating an issue, jobs may be affected across all build types.
Report: "Error processing signed S3 attachment URLs"
Last updateA fix has been implemented and we are monitoring the results.
Report: "ML Infrastructure Latency Issues"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Site incident"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Platform slow or unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Platform Slow or Unresponsive"
Last updateScale AI is now operational
We are seeing elevated error rates that may indicate a slow or unresponsive platform. We are currently investigating the issue.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Increased site latency & error rate"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are actively working with our underlying service providers to resolve the issues impacting Scale's service.
We are seeing elevated error rates that may indicate a slow or unresponsive platform. We are currently investigating the issue and its potential impact.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work with our underlying service providers to resolve the issues impacting Scale's service.
We are actively working with our underlying service providers to resolve the issues impacting Scale's service.
We are continuing to investigate the issue.
We are seeing elevated error rates that may indicate a slow or unresponsive platform. We are currently investigating the issue.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are seeing elevated error rates that may indicate a slow or unresponsive platform. We are currently investigating the issue.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
Site is unstable again and we're trying to implement a fix
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Platform availability issues"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Platform availability issues"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Platform Outage"
Last updateWe have identified an issue related to database index performance. We have completed deploying a fix for this issue. The platform is up and running with non-elevated error rates.
The platform is experiencing a degraded service level and we are continuing to investigate. API and Website error rates are elevated.
The platform has recovered. We are actively monitoring and root causing the issue.
We are currently experiencing a platform outage. All services are impacted, including the Scale Dashboard, API, and associated services. We are currently reviewing this issue.
Report: "Elevated error rates & increased latency"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Elevated error rates"
Last updateThis incident has been resolved.
We are experiencing elevated error rates across the platform
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Nucleus Database Offline"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
The upgrade appears stuck. We're restoring the database.
Report: "Intermittent site errors"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Site down"
Last updateThis incident has been resolved at 4:56PM PST
We are investigating the issue.
Site is not accessible around 3:55PM PST
We are currently investigating this issue.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue. Instability starts around 7:15PM PST
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
We've identified the cause and error rate has returned to normal around 9:40PM. We'll keep monitoring the issue.
We are seeing significantly higher latencies around 9PM and are investigating.
Report: "Platform Slow or Unresponsive"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.