CircleCI

Is CircleCI Down Right Now? Check if there is a current outage ongoing.

CircleCI is currently Operational

Last checked from CircleCI's official status page

Historical record of incidents for CircleCI

Report: "Maintenance window for Runner"

Last update
Scheduled

Maintenance window for Runner is scheduled for June 3rd, 2025, at 19:00 PST/22:00 EST. The maintenance window will last until 19:10 PST/22:10 EST.During this period:- Resource management will not be available- The Runner web UI for inventory and installation will not be available- Up to a 5-minute delayed start time for runner jobs.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Report: "Delays to start macOS jobs on m2pro.medium and m2pro.large"

Last update
resolved

Thanks for the patience everyone. Everything back to normal.

monitoring

Job start times have returned to normal. We'll continue to monitor.

identified

We are experiencing delays starting macOS jobs on m2pro.medium and m2pro.large. Thanks you for your patience.

Report: "Delays to start macOS jobs on m2pro.medium and m2pro.large"

Last update
Identified

We are experiencing delays starting macOS jobs on m2pro.medium and m2pro.large. Thanks you for your patience.

Report: "Delays in starting Mac Jobs"

Last update
resolved

This is now resolved. Wait times have recovered.

identified

The fix is still rolling out across our fleet, all looking good so far.

identified

We are experiencing delays in starting Mac Jobs. We have identified the issue and are in the process of rolling out a fix. Thank you for your patience.

Report: "Delays in starting Mac Jobs"

Last update
Identified

We are experiencing delays in starting Mac Jobs. We have identified the issue and are in the process of rolling out a fix. Thank you for your patience.

Report: "Dropped webhooks for GitHub pipelines"

Last update
resolved

GitHub have updated their API status to operational, and we are no longer seeing related customer impact. Customers will need to push new commits for any lost pipelines.

monitoring

Some GitHub webhooks are being dropped due to an incident with GitHub. Customers may also experience a delay in scheduled workflows.

Report: "Dropped webhooks for GitHub pipelines"

Last update
Monitoring

Some GitHub webhooks are being dropped due to an incident with GitHub.Customers may also experience a delay in scheduled workflows.

Report: "Delays in starting some jobs"

Last update
postmortem

## Summary On May 1, 2025, from 22:20 UTC to May 2, 2025 02:00 UTC, CircleCI customers experienced delays in starting most jobs. Jobs affected were contained to the following resource classes: Docker large, Docker medium, Docker small and Linux large. During this time customers may have also experienced delays in obtaining status checks. ## What Happened \(all times UTC\) At approximately 22:05 on May 1, 2025, we initiated a database upgrade to the service that dispatches jobs. We used a blue/green deployment to stand up a second database running the upgraded version and use logical replication to keep the data across the two databases in sync. We had been running the blue \(old version\) and the green \(new version\) without issues for a couple days and replication was confirmed to be in sync when we triggered the cut over from blue to green. Upon completion of the cutover process, we noticed application errors for jobs, which meant the application pods failed to automatically pickup the new DNS route. A rolling restart of the pods was performed, and all pods were back online with no further application errors as of 22:17. At 22:40, teams were alerted that Docker jobs were backing up. They initially investigated if the pod restarts caused fewer processing nodes to be online, and began to manually scale up the nodes. At 23:47, it was confirmed only a small quantity of jobs were making it through to the processing pods, causing the backlog and ruling out an infrastructure issue. It was determined that jobs in the following resource classes were not executing: Docker large, Docker medium, Docker small and Linux large. At 00:40 on May 2, 2025, orphaned task records for the above mentioned resource classes were identified. An orphaned task record is an item of work with no associated jobs, as these records were picked up by the service it causes a failure preventing the next record from being picked up. The team updated the task status to “completed” and immediately saw more jobs processing and the backlog of jobs dropped. By 00:45, the backlog of jobs had completely cleared and the issue was thought to be remediated. At 00:56 UTC, an alert triggered, warning of a backlog of jobs once again. Upon investigation, it was determined that only some Docker resource classes were affected. These included large, medium and small, all other resource classes including Linux jobs were operating as expected. An investigation determined additional orphaned task records had been written to the database after 00:40. Logical replication was manually disabled and the orphaned task records were updated at 01:55. At 02:10 the backlog of jobs had once again cleared. The team continued to monitor over the following hour with no additional occurrences of orphaned tasks and declared the incident closed at 03:39. Post-incident, the team continued to investigate. The root cause was determined to be a race condition between the application and logical replication when the application pods were restarted. A task event was rerun and wrote to the green \(new\) database before the original task event status was replicated from the blue \(old\) database. This created a unique constraint error that broke replication. Because logical replication does not respect foreign key constraints, task records were replicated to the green database which were older than those already in the green database, creating the orphaned task records seen during the incident. The issue resurfaced immediately after draining the job queue as the failed replication task tried to restart. ## Future Prevention and Process Improvement The incident has exposed the need to implement further controls on database writes during the upgrade process while using logical replication. Even if replication is in sync, the milliseconds network delay incurred in transferring the data can be enough to trigger this scenario. 1. We will update the upgrade procedure to limit writes to the database for a short period of time while logical replication writes the final updates from the old database version to the new version. 2. A second data replication verification test will be added to the procedure before turning writes on for the new version. 3. Once replication is confirmed, in sync replication will be disabled to avoid any possibilities of conflicts. 4. We will be implementing a more in depth review process between the database and service owner teams to review the upgrade process and risks prior to performing the change. We sincerely apologize for the disruption this incident caused to your ability to build on our platform. We understand the critical role CircleCI plays in your development workflow and take any service disruption seriously. We're committed to learning from this experience and have already implemented several measures to prevent similar occurrences in the future. Thank you for your patience and continued trust in CircleCI.

resolved

All jobs are now running normally. Thank you for your patience whilst we resolved the issue.

monitoring

We are continuing to monitor for any further issues.

monitoring

Jobs for the following resource classes will have suffered significant delays in running, these will be processed over the next X minutes. * Docker Large, Medium and Small * Linux Large Those jobs will start within the next 15 minutes, you should not need to retry them. We thank you for your patience whilst we resolve this issue.

monitoring

We're continuing to monitor the delays with starting Docker jobs. Thank you for your patience.

monitoring

Docker jobs have not recovered as expected, and customers may continue to see delays for Docker jobs starting. We are working to increase capacity and thank you for your patience.

monitoring

This incident impacted final result delivery between 22:06 and 22:17 UTC. Customers may experience delays starting Docker Large jobs as the system recovers. We will continue to monitor recovery and thank you for your patience.

monitoring

This also impacts status checks which may not have been sent to GitHub.

Report: "Delays in insights dashboard data"

Last update
resolved

We've verified our fix and insights data is refreshing as expected.

monitoring

We are monitoring our change to catch up on delayed insights data.

identified

We have identified an issue with delays in insights data. The cause has been identified and we are working on a solution.

Report: "Delays in insights dashboard data"

Last update
Identified

We have identified an issue with delays in insights data.The cause has been identified and we are working on a solution.

Report: "We are currently experiencing an outage affecting v1 and v2 API documentation pages"

Last update
resolved

This issue has been resolved. Thank you for your patience.

investigating

We are currently experiencing an outage affecting the following API documentation pages: V1 API Documentation: https://circleci.com/docs/api/v1/index.html V2 API Documentation: https://circleci.com/docs/api/v2

Report: "We are currently experiencing an outage affecting v1 and v2 API documentation pages"

Last update
Investigating

We are currently experiencing an outage affecting the following API documentation pages:V1 API Documentation: https://circleci.com/docs/api/v1/index.htmlV2 API Documentation: https://circleci.com/docs/api/v2

Report: "Test results are delayed for test insights"

Last update
resolved

This incident has been resolved.

monitoring

We will continue monitoring overnight. Thank you for your patience.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

Some users may notice a delay in their test insights. We are working on a fix. Thank you for your patience.

Report: "Test results are delayed for test insights"

Last update
Investigating

Some users may notice a delay in their test insights. We are working on a fix. Thank you for your patience.

Report: "Delays in May 1, 2025 Data in Usage API"

Last update
resolved

The issue causing delays in usage api data has now been resolved, we thank you for patience while our engineers worked to resolve the.

identified

Some customers will see a delay in Usage API Data for May 1, 2025. We've identified the problem and are working to resolve it. Thank you for your patience.

Report: "CircleCI UI Loading & build triggering issues"

Last update
postmortem

## Summary On April 4, 2025, from 00:16 to 01:49 UTC \(approximately 1 hour and 33 minutes\), CircleCI experienced a service disruption affecting both our user interface and build capabilities. During this time, customers were unable to access the CircleCI UI or initiate new builds. The incident was caused by an inadvertently applied Web Application Firewall \(WAF\) rule that blocked legitimate traffic to CircleCI services. It was resolved when our engineering team identified and removed this rule. [The original status page can be found here.](https://status.circleci.com/incidents/zh1qd6lrntl7) ## What Happened \(all times UTC\) The WAF is a critical security component that sits in front of our services and protects them from malicious traffic, while allowing legitimate requests to pass through. All times below are in UTC * **00:16**: A WAF rule was inadvertently introduced that began blocking legitimate traffic to CircleCI services. * **00:26 - 00:52**: Our monitoring systems detected degraded performance across multiple services. This occurred just as our teams were concluding another [unrelated incident](https://status.circleci.com/incidents/31n0h4tcl02g), which initially caused some confusion about whether the issues might be connected. Customers began reporting inability to access the CircleCI UI or initiate new builds, and our teams pivoted to investigate these new symptoms. The team noted a drop in GitHub webhooks and widespread connectivity issues between our frontend and backend services, spending time to ensure these weren't aftereffects of the previous incident. * **00:52**: We established we were looking at a completely separate incident and launched our incident process with a new incident, and a dedicated response team was assembled to investigate the service disruption. * **01:15:**  Initial investigation revealed broad connectivity issues between the frontend and our backing APIs, including CORS \(Cross-Origin Resource Sharing\) errors. The team explored multiple potential causes, including recent deployments and infrastructure changes, but the cause remained unclear. * **01:35**: Our automated Terraform drift detection identified a difference in configuration between our defined and current WAF settings. This discovery revealed that a WAF rule had been changed outside of our standard Terraform deployment process, and was blocking legitimate traffic to [api.circleci.com](http://api.circleci.com) and [circleci.com](http://circleci.com) CloudFront distributions. * **01:41**: The problematic WAF rule was reverted from both affected CloudFront distributions. * **01:49**: Our monitoring confirmed that error rates decreased across all affected services  as traffic was properly routed again.  * **01:55**: Full service restoration was confirmed across the board at this time.  * **02:59**: The incident was officially closed after a period of monitoring confirmed stable operation. ## **Root Cause Analysis** While we manage all our infrastructure, including WAF, almost entirely with Terraform, we discovered during this incident a misconfiguration in IAM controls that allowed a specific role to make changes without using our infrastructure-as-code tooling. As a result, while investigating routine security monitoring, an operator manually modified WAF configuration, believing they were taking read-only actions. The resulting change blocked legitimate traffic to our services. Based on the same assumptions, those investigating the incident did not prioritize investigating WAF configuration expecting that any changes would have gone through our Terraform pipeline and there was no record of such changes. The diverse symptoms produced across our platform combined with the occurrence shortly after a separate, [unrelated incident](https://status.circleci.com/incidents/31n0h4tcl02g), led to time spent on paths of inquiry that ultimately proved fruitless. Eventually, our automated drift detection process ran and identified the issue. While this safeguard was invaluable, it was nearly 80 minutes between the initial change and the detection. Drift detection identified the exact configuration change that caused the issue despite the confusion and led directly to the resolution of the incident. ## Future Prevention and Process Improvement This incident highlighted the strength of our existing systems while identifying several areas where we can improve and make them even more robust: 1. We have implemented stricter IAM policies that prevent direct modification of infrastructure managed by our infrastructure-as-code pipeline. 2. Terraform's drift detection was instrumental in identifying the root cause of this incident. We are enhancing these capabilities to provide faster alerts when critical infra components deviate from their expected state.  We are also adding  technical guardrails to ensure all configuration management follows this approach, which helps prevent human error and provides better visibility into changes. 3. Specifically, we're establishing better protocols for implementing and testing WAF rules before they reach production environments. Additionally, we are adding monitoring specifically for WAF behavior and traffic patterns to detect potential issues more quickly. 4. We're investigating additional technical controls through Security Control Policies \(SCPs\) that provide organization-wide restrictions on IAM roles, reducing the risk of accidental misconfigurations. These policies create hard boundaries on what actions can be performed on critical systems like our WAFs, adding an extra layer of protection against unintended changes. We sincerely apologize for the disruption this incident caused to your ability to build on our platform. We understand the critical role CircleCI plays in your development workflow and take any service disruption seriously. We're committed to learning from this experience and have already implemented several measures to prevent similar occurrences in the future. Thank you for your patience and continued trust in CircleCI.

resolved

The incident has now been resolved. Thank you for your understanding and patience while our engineers investigated and mitigated the issue.

monitoring

A fix has been implemented, and we are currently monitoring the system to ensure everything is functioning as expected. Thank you for your patience.

investigating

We are investigating intermittent issues triggering pipelines or sending status updates.

investigating

We are investigating intermittent issues with loading the CircleCI UI.

Report: "Delays in May 1, 2025 Data in Usage API"

Last update
Identified

Some customers will see a delay in Usage API Data for May 1, 2025. We've identified the problem and are working to resolve it. Thank you for your patience.

Report: "Delay in starting some jobs"

Last update
resolved

This incident has been resolved.

monitoring

A fix was put in place. We are monitoring the situation.

identified

Queues should be clearing, and jobs starting normally.

investigating

We are investigating a delay in starting some jobs.

Report: "Delay in starting some jobs"

Last update
Investigating

We are investigating a delay in starting some jobs.

Report: "Final results of some jobs may not be reported in the UI"

Last update
Monitoring

This also impacts status checks which may not have been sent to GitHub.

Report: "intermittent checkout job failures"

Last update
resolved

This incident has been resolved.

monitoring

We've rolled out the fix and are monitoring.

investigating

Some customers are experiencing checkout step failures.

Report: "intermittent checkout job failures"

Last update
Investigating

Some customers are experiencing checkout step failures.

Report: "Delays in starting workflows"

Last update
postmortem

## Summary On April 3, 2025, from 22:08 UTC to 23:45 UTC, CircleCI customers experienced increased latency and some failures with starting and canceling workflows and jobs. During this time customers may have experienced delays and difficulty viewing workflows in the UI. We appreciate your patience and understanding as we worked to resolve this incident. ## What Happened \(all times UTC\) At approximately 22:00 on April 4, we initiated an upgrade to the service responsible for workflows. We expected a short delay \(< 90 seconds\) during the database upgrade where calls to the database from the workflows service would get sent to a queue and retried over a 10 minute period. We expected to see the queues grow slightly during and immediately after the upgrade. At 22:08, when the blue/green deployment was complete, we verified queries being served. At 22:17, we identified increased latency in the workflows service, as well as some errors from jobs being dropped due to exhausting their 10 minute retries. At 22:29 additional engineers were engaged, and at 22:30 the team restarted the workflows pods to ensure they were all connecting to the correct database. At 22:35 a public incident was declared. At 22:41, it was observed that all queries on the new database were hitting disk, which indicated that the database statistics tables had not updated. The team immediately upsized the database and disabled any non business critical operations on the database. At 23:00, the workflows service was scaled down to a single pod to give the database capacity to recover while the statistics table was rebuilt. At 23:10, the team observed the workflows queue backing up due to the reduction in pods as expected but also did not see an improvement in database performance. At 23:19, the team decided to re-enable writes on the old database and reinstate its primary status to restore service to customers sooner. This work completed at 23:29. The team continued to monitor the workflows queue. At 23:45 it was determined that the workflow queue was back to normal operating levels, and no further errors were observed. Post-incident, the team continued to investigate. The root cause was determined to be that the analyze operation to rebuild the database’s statistics table, which is used for indexes, had been executed too early in the operation and was made stale by a second major version upgrade within the same deployment. ## Future Prevention and Process Improvement The blue/green database deployment procedures have been updated to run an analysis procedure after every major version change. The team has also tested running the analyze command while a database is under pressure to determine it has no further degrading effects on the database performance. This will be noted for future remediation. Before any additional migrations are run, the team will add additional automated tests and manual checkpoints throughout the process to identify and resolve issues before the blue/green cutover.

resolved

The issue impacting workflows and pipelines has now been resolved.

monitoring

Our engineers have implemented a fix for the issue impacting workflows and pipelines and are back within normal operation range. We will continue to monitor the situation. We thank you for your patience while we worked to resolve this issue.

identified

We are continuing to work on the issue impacting workflows and pipelines and are starting to see our systems recover. Thank you for your patience while our engineers are working to resolve this.

identified

We have identified the issue causing workflows and pipelines to be delayed or not start at all. Our engineers are working on a fix. We appreciate your patience and understanding as we actively work to resolve this disruption. We will keep you updated on our progress.

investigating

We are continuing to investigate this issue.

investigating

We are investigating a delays in starting workflows.

Report: "Our engineers investigated an issue impacting log-in"

Last update
resolved

This incident has been resolved.

monitoring

Auth0 has applied a fix and we are seeing reduced log-in error rates as well. We will continue to monitor our systems. Thank you for your patience.

identified

We've identified an issue affecting users attempting to log in with username and password credentials. Our engineering team has determined this is related to an ongoing incident with Auth0, our authentication provider. Users can track the Auth0 incident status at https://status.auth0.com/incidents/zgyzzt12c40v. We're actively monitoring the situation and will provide updates as the issue is resolved. We apologize for any inconvenience this may cause.

Report: "Our engineers investigated an issue impacting log-in"

Last update
Identified

We've identified an issue affecting users attempting to log in with username and password credentials. Our engineering team has determined this is related to an ongoing incident with Auth0, our authentication provider. Users can track the Auth0 incident status at https://status.auth0.com/incidents/zgyzzt12c40v. We're actively monitoring the situation and will provide updates as the issue is resolved. We apologize for any inconvenience this may cause.

Report: "Delay in starting jobs"

Last update
resolved

Between 09:54UTC and 10:01UTC all jobs experienced a slight delay in starting. All jobs will have run, so it is not necessary rerun any of them. We apologize for the delay.

Report: "Delay in starting jobs"

Last update
Resolved

Between 09:54UTC and 10:01UTC all jobs experienced a slight delay in starting. All jobs will have run, so it is not necessary rerun any of them. We apologize for the delay.

Report: "CircleCI UI Loading & build triggering issues"

Last update
Update

We are investigating intermittent issues triggering pipelines or sending status updates.

Investigating

We are investigating intermittent issues with loading the CircleCI UI.

Report: "CircleCI UI Loading Issues"

Last update
Investigating

We are investigating intermittent issues with loading the CircleCI UI.

Report: "Delays in starting workflows"

Last update
Update

We are continuing to investigate this issue.

Investigating

We are investigating a delays in starting workflows.

Report: "Orb fetch causing pipeline failure"

Last update
resolved

This incident is now resolved. Orbs are functioning as normal, thank you for your patience.

monitoring

We are monitoring a solution to this issue.

identified

Error message displayed is configs deemed to be invalid. A fix is identified and being implemented.

identified

The issue has been identified and a fix is being implemented.

Report: "Orb fetch causing pipeline failure"

Last update
Update

Error message displayed is configs deemed to be invalid. A fix is identified and being implemented.

Identified

The issue has been identified and a fix is being implemented.

Report: "Customers may see delays with status updates on Github"

Last update
resolved

The issue impacting status updates for Github App basic status has now been resolved. Please note that a small percentage of pipelines triggered directly from the CircleCI API did not post status to Github and must be re-run. If you have any further issues, please reach out to support for assistance. We thank you for your patience and understanding as our engineers worked towards mitigation.

monitoring

We are continuing to monitor for any further issues.

monitoring

A small percentage of pipelines triggered directly from the CircleCI API did not successfully post status to GitHub. To post status, the pipeline must be re-run.

investigating

Our engineers are investigating an issue where some customers may see delays with status updates. The impact is limited to Github App basic status and does not affect oAuth basic or Github checks. We appreciate your patience and understanding as we actively work to resolve this delay. We will keep you updated on our progress.

Report: "Customers may see delays with status updates on Github"

Last update
Investigating

Our engineers are investigating an issue where some customers may see delays with status updates. The impact is limited to Github App basic status and does not affect oAuth basic or Github checks.We appreciate your patience and understanding as we actively work to resolve this delay. We will keep you updated on our progress.

Report: "Increased Job Start Latencies"

Last update
Resolved

Between 7:15 UTC and 7:50 UTC, some customers may have been impacted by increased job start latency. Our engineering team promptly identified and resolved this issue and job start times have now returned to normal levels. We thank you for your patience and understanding.

Report: "Increased Job Start Latencies"

Last update
resolved

Between 7:15 UTC and 7:50 UTC, some customers may have been impacted by increased job start latency. Our engineering team promptly identified and resolved this issue and job start times have now returned to normal levels. We thank you for your patience and understanding.

Report: "Docker Executor Infrastructure Upgrade"

Last update
In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

For additional details, please refer to this announcement: https://discuss.circleci.com/t/docker-executor-infrastructure-upgrade/52282

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We are updating the infrastructure supporting the arm resource class and the ip_ranges feature on the Docker Executors through April 1st. Customers may experience build failures during this time. View the announcement post for more information: link.

Report: "Increase wait time for M2pro medium resource class"

Last update
resolved

The issue impacting the wait time on M2 pro medium resource class has now been resolved. We thank you for your patience while we worked through to resolve the delays caused.

monitoring

A fix has been implemented for the issue impacting our M2 pro medium resource class. The wait times are within normal range and we will continue to monitor the situation as the fix gets rolled out. We thank you for your patience while our engineers worked to resolve this delay.

identified

Our engineers have identified the issue and implemented a fix where builds using m2pro.medium resource class were delayed for up to 6 minutes. We are starting to see recovery in the wait time. We thank you for your patience as we continue to work on this issue.

investigating

Our engineers are investigating an issue where customers using M2 Pro medium resource class may experience a higher wait time (delays of up to 6 minutes). We appreciate your patience and understanding as we actively work to resolve this delay. We will keep you updated on our progress.

Report: "Increase wait time for M2pro medium resource class"

Last update
Resolved

The issue impacting the wait time on M2 pro medium resource class has now been resolved. We thank you for your patience while we worked through to resolve the delays caused.

Monitoring

A fix has been implemented for the issue impacting our M2 pro medium resource class. The wait times are within normal range and we will continue to monitor the situation as the fix gets rolled out. We thank you for your patience while our engineers worked to resolve this delay.

Identified

Our engineers have identified the issue and implemented a fix where builds using m2pro.medium resource class were delayed for up to 6 minutes. We are starting to see recovery in the wait time. We thank you for your patience as we continue to work on this issue.

Investigating

Our engineers are investigating an issue where customers using M2 Pro medium resource class may experience a higher wait time (delays of up to 6 minutes). We appreciate your patience and understanding as we actively work to resolve this delay. We will keep you updated on our progress.

Report: "Pipelines page intermittently not loading pipelines."

Last update
resolved

This incident is now resolved.

monitoring

The pipelines page is recovering and we are observing normal behaviour. We are continuing to monitor.

identified

We have identified the cause and are continuing to work on a fix.

identified

We have identified an issue causing a timeout when loading the Pipelines page on projects. We are currently working on a potential fix. Pipelines are running as normal.

Report: "Pipelines page intermittently not loading pipelines."

Last update
Resolved

This incident is now resolved.

Monitoring

The pipelines page is recovering and we are observing normal behaviour. We are continuing to monitor.

Update

We have identified the cause and are continuing to work on a fix.

Identified

We have identified an issue causing a timeout when loading the Pipelines page on projects. We are currently working on a potential fix. Pipelines are running as normal.

Report: "Increased wait times for M2 Pro Large"

Last update
resolved

There was increased wait times for M2 Pro Large.

Report: "Increased wait times for M2 Pro Large"

Last update
Resolved

There was increased wait times for M2 Pro Large.

Report: "Outbound webhooks delayed"

Last update
resolved

The issue causing delays in outbound webhooks setup for job completed event has now been resolved. We thank you for your patience and understanding while our engineers worked to fix this issue.

monitoring

The issue causing delays in outbound webhooks has now been mitigated, the latencies in outbound webhooks are back to normal. We will continue to monitor the situation. Thank you for your understanding while we worked to investigate the issue.

investigating

Our engineers are currently investigating an issue causing delays in outbound webhooks, the impact is limited to customers that have outbound webhooks setup for job completed. We will provide updates as soon as more information is available. Thank you for your understanding.

Report: "Outbound webhooks delayed"

Last update
Resolved

The issue causing delays in outbound webhooks setup for job completed event has now been resolved. We thank you for your patience and understanding while our engineers worked to fix this issue.

Monitoring

The issue causing delays in outbound webhooks has now been mitigated, the latencies in outbound webhooks are back to normal. We will continue to monitor the situation. Thank you for your understanding while we worked to investigate the issue.

Investigating

Our engineers are currently investigating an issue causing delays in outbound webhooks, the impact is limited to customers that have outbound webhooks setup for job completed. We will provide updates as soon as more information is available. Thank you for your understanding.

Report: "Jobs using contexts are not running"

Last update
resolved

We are continuing to observe normal behaviour. This incident is now resolved. If you see any affected workflows/jobs, they will need to be re-run on CircleCI or a new commit pushed.

monitoring

We have identified the cause and implemented a fix to the affected service. We are seeing recovery and are currently monitoring.

investigating

Jobs that use contexts are not running. We are currently investigating the cause.

Report: "Incread Queue Times for macos.m1.large.gen1"

Last update
resolved

This incident has been resolved.

monitoring

Queue times have stabilized.

identified

Capacity on M1 Resource Class is limited. Customers can experience less queuing if you move to m2pro.medium or m2pro.large.

Report: "Issues loading the jobs page"

Last update
resolved

This incident has been resolved.

monitoring

The jobs page has automatically recovered. We have identified the change that caused this issue and have reverted it so that the issue cannot reoccur. Thank you for your patience.

investigating

Our engineering team is currently investigating an issue affecting some customers with loading the jobs page. Please note that jobs continue to flow through the system without interruption; the impact is limited to the user interface only. We will provide updates as soon as more information is available. Thank you for your understanding!

Report: "Delays in starting M1 Large Mac Jobs"

Last update
resolved

The issues with higher queue times for M1 Large resource class has now been resolved. If you encounter any further delays, please consider switching to the M2 resource class for improved performance. Thank you for your patience!

monitoring

We experienced higher queue times for customers requesting M1 Large resource class between 17:05 and 17:20 UTC. Queue times have now returned to normal. If you continue to experience delays with the M1 resource class, we recommend switching to the M2 resource class for optimal performance. We will continue to monitor the situation with M1 Large resource class capacity. Thank you for your understanding!

Report: "Login Page issues with Bitbucket user not able to login"

Last update
resolved

The issue has been resolved. Thank you for your patience. Please refresh the login page and try logging in with Bitbucket again.

monitoring

We have implemented a fix and are currently monitoring the results. Please refresh the login page and try logging in using Bitbucket again.

identified

We have identified the issue affecting our system and are actively working on a resolution. Thank you for your patience while we resolve this.

investigating

We are currently investigating an issue where Bitbucket users are unable to log in through our login page.

Report: "Delays in starting M1 Large Mac Jobs"

Last update
resolved

There was a delay in starting M1 Large Mac Jobs. In some cases the delay could reach up to 10 minutes. No work was lost.

Report: "Delays in starting Mac jobs using m2pro instance"

Last update
resolved

This incident has been resolved.

monitoring

Extra capacity has been added and we are seeing wait times decrease to normal levels. Thank you for your patience, we will continue to monitor recovery.

identified

There are presently delays starting m2pro machines, we are working to resolve this issue but it will take time to resolve due to high demand. Thank you for your patience whilst we resolve this.

Report: "Some pipelines failed to be created"

Last update
resolved

This incident is resolved. Our engineers detected failures in pipeline creation at 20:17 UTC, and the system automatically recovered by 20:18 UTC.

monitoring

The system seems to have recovered and we are monitoring. Customers should re-trigger impacted pipelines, either in the UI or by re-pushing the work to the code repository.

Report: "Some commit status updates were not updated"

Last update
postmortem

## Summary On January 23, 2025, from 19:48 UTC to 20:43 UTC, customers using CircleCI GitHub OAuth and Bitbucket projects stopped receiving commit status updates. This was due to a code change deployed at 19:48 UTC that negatively impacted the service responsible for sending commit statuses to the Version Control System \(VCS\) provider. ## What Happened On January 23, 2025, at 19:48 UTC, we deployed a change in how we send events from our service that orchestrates workflows. This change inadvertently modified the value of a key field used by a downstream service responsible for setting commit statuses. At 20:03 UTC, the team responsible for the downstream service was alerted to an increase in errors when setting commit statuses. This alert auto-resolved without intervention, delaying our response time. At 20:12 UTC, our support team notified us that customers were experiencing issues with commit status updates. This prompted an investigation. By 20:40 UTC, we had identified and reverted the faulty code change, with customer impact ceasing at 20:43 UTC. ## Future Prevention and Process Improvement We will add more comprehensive testing to cover the events sent by our orchestration service. Additionally, we will implement synthetic tests to catch failures in setting proper commit status updates. We are also investigating why the alert auto-resolved to ensure similar issues are actioned sooner. While investigation and remediation started promptly after being notified of the issue, there was a delay in initializing our incident protocol, which delayed the creation of a status page update and limited the information available to provide clear timing on the published update. We are revisiting our incident declaration procedures and tool configuration to provide further clarity around incident declaration and improve response time.

resolved

At 19:48 UTC, some customers' projects may have stopped receiving commit status updates. The incident was resolved at 20:43 UTC. To ensure that the checks are reported correctly, we recommend rerunning the impacted workflows from the start.

Report: "Customers may be seeing delays with their workflows starting and may notice issues viewing their workflows through our UI"

Last update
postmortem

## Summary From January 21, 2025 at 23:50 UTC to January 22, 2025 at 00:56 UTC, CircleCI customers experienced increased latency with starting and canceling workflows and jobs, and experienced delays and difficulty viewing workflows in the UI. We appreciate your patience and understanding as we worked to resolve this incident. ## What Happened \(all times UTC\) At approximately 23:00 on January 21, an automated alert indicated that a database instance responsible for holding archived data was almost out of free storage space. At 23:09, the team halted a blue/green deployment on the database to free a logical replication slot, thinking that may have been the cause, but that did not help the database recover. The archival service is called synchronously by the service responsible for orchestrating workflows. When the archival service’s database reached capacity, these requests started timing out, which impacted the overall performance of the workflows service. At 23:26, the workflows queue began to grow, leading to increased latency starting workflows and jobs, canceling jobs, and viewing workflows in the UI. This was not immediately attributed to the archival database issues in part because there was a separate alert at approximately the same time related to request volume, but when the queue continued to grow after that issue resolved, a separate team began to investigate workflows further and scaled up the event consumer responsible for processing the queue at 23:44. The team investigating the unhealthy database instance promoted a read replica to a standalone primary at 23:55. By 00:03, the workflows queue depth returned to normal, which resolved workflow latency and UI impacts. However, at around the same time, Linux machine jobs began to queue downstream due to errors trying to provision instances with our cloud provider, which was actively investigating increased API error rates to the provisioning endpoint in our region. Requests began to be fulfilled around 00:32, but due to the volume of requests being processed, we also experienced rate limiting that extended the length of impact. Our queues returned to normal levels at 00:56, and the incident resolved at 01:26 after confirming there was no further impact. Post-incident, the team continued to investigate. The root cause was determined to be a code change made to a function in the impacted database on January 16th, which unintentionally created an excessive number of log messages. The function has been updated to fix this behavior. ## Future Prevention and Process Improvement We have added a max duration to the workflows retry policy for archiving workflows to allow it to fail earlier than the default timeout, limiting the potential impact on the workflows service should there be a future issue with the archival service. Longer-term, we intend to shift the workflow archival process to an event-based model to decouple the services. While alerting did indicate an issue with the archival database, the team did not have much time to address the problem before it caused customer impact because the database was filling significantly more quickly than previously forecasted. We will be implementing forecast and anomaly monitoring for our databases to alert us of unusual activity before it reaches critical levels.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue. Thank you for your patience and understanding.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue. You can still view pipelines for a specific project in the UI.

Report: "Delayed Status checks and outbound webhooks"

Last update
resolved

The incident impacting status and outbound webhooks has now been resolved. We thank you for your patience while our engineers worked on the issue.

monitoring

We are seeing signs of recoveries with the issue causing delays in status checks and outbound webhooks. We will continue to monitor the situation closely.

investigating

Our engineers are currently investigating an issue that may have an impact on Status checks and outbound webhooks. We will provide further updates as more information becomes available.

Report: "M2-Pro Medium jobs delayed"

Last update
resolved

The incident impacting m2pro.medium resource class has now been resolved. We thank you for your patience while our engineers worked through this issue.

monitoring

We have implemented a fix for the issue affecting the MacOS m2pro.medium resource class and are currently observing signs of improvements. The task start time has decreased and has returned to normal levels. We will continue to monitor the situation closely. Thank you for your continued patience.

identified

Our engineers have identified an issue where builds using m2pro.medium resource class are facing delays of up to 6-8 minutes. We are actively working to mitigate the issue and increase the capacity to resolve this delay. We appreciate your patience and understanding as we work to enhance our service. We will keep you updated on our progress.

Report: "Documentation site is down"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Unexpected Build Failures"

Last update
resolved

Between 20:01 UTC and 21:06 UTC, some users may have experienced unexpected build failures related to the configured working_directory in their configuration. The cause of these failures has been identified and reverted, and builds should now complete successfully.

Report: "MacOS m2pro.large Jobs delayed"

Last update
resolved

The issue affecting the macOS m2pro.large resource class, causing them to have a delayed job-start time, has now been fully resolved. We thank you for your patience while our engineers worked through this incident.

monitoring

We have implemented a fix for the issue affecting the MacOS m2pro.large resource class and are currently observing signs of improvements. The task start time has decreased and has returned to normal levels. We will continue to monitor the situation closely. Thank you for your continued patience

identified

Our engineers have identified an issue causing delays to macOS m2pro.large tasks. We are working to mitigate the issue and will provide further updates as more information becomes available

Report: "Delays sending webhooks"

Last update
resolved

Outbound webhook processing time has recovered.

monitoring

Our mitigations are working as expected, we are monitoring the change.

investigating

Outbound webhook processing is delayed. We have identified the issue and are rolling out a fix to mitigate the issue.

Report: "Trigger Pipeline modal in web UI not working"

Last update
resolved

The incident is now resolved.

monitoring

We have identified the root cause of the issue and have reverted the change. Users can now use the previous Trigger Pipeline modal as required.

investigating

We are investigating the cause of the Trigger Pipelines modal in the web UI not working as expected. Affected users can trigger pipelines via the API if required.

Report: "Auto-cancellation disabled for GitHub App pipelines"

Last update
resolved

Starting on October 14th until November 17th, the auto-cancellation feature was disabled for all GitHub App pipelines. The issue did not impact any pipelines integrated through GitHub OAuth or any other VCS. The issue is now resolved, and expected behaviour has been restored.

Report: "Machine jobs are not starting"

Last update
resolved

This incident has been resolved.

monitoring

We're successfully processing the backlog and continuing to monitor it.

identified

We are continuing to see a backlog for machine jobs and are working on resolving that.

identified

We are continuing to work on a fix for this issue.

identified

We have implemented a fix, however jobs are delayed as we work through the backlog of jobs which arrived during the outage. Thank you for your patience whilst we work through the backlog.

investigating

We have implemented a fix, however jobs are delayed as we work through the backlog of jobs which arrived during the outage. Thank you for your patience whilst we work through the backlog.

identified

We're currently investigating a possible issue. We'll update as soon as we know more details.

Report: "Some customers may experience delays with Runner builds"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Customers may see delays receiving credits"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Checks and statuses were not updated for GitHub App users and GitLab users"

Last update
resolved

Starting at 2:40AM UTC we did not update checks or statuses for GitHub App or GitLab users. This continued until 3:40PM UTC. Statuses during that time will not be sent, but any after 3:40PM UTC will update as normal in GitHub or GitLab. Thank you for your patience while we resolved this issue.

Report: "Jobs failing to start or in progress fail"

Last update
postmortem

## Summary: On October 22, 2024, from 14:45 to 15:52 and again from 17:41 to 18:22 UTC, CircleCI customers experienced failures on new job submissions as well as failures on jobs that were in progress. A sudden increase in the number of tasks completing simultaneously and requests to upload artifacts from jobs overloaded the service responsible for managing job output. On October 28, 2024, from 13:27 to 14:13 and from 14:58 to 15:50, CircleCI customers experienced a recurrence of these effects due to a similar cause. During these sets of incidents, customers would have experienced their jobs failing to start with an infrastructure failure. Jobs that were already in progress also failed with an infrastructure failure. We want to thank our customers for your patience and understanding as we worked to resolve these incidents. The original status pages for the incidents on October 22 can be found [here](https://status.circleci.com/incidents/6yjv79g764yc) and [here](https://status.circleci.com/incidents/0crxbhkflndc). The status pages for incidents on October 28 can be found [here](https://status.circleci.com/incidents/xk37ycndxbhc) and [here](https://status.circleci.com/incidents/8ktdwlsf2lm8). ## What Happened: \(All times UTC\) On October 22, 2024, at 14:45 there was a sudden increase of customer tasks completing at the same time within CircleCI. In order to record each of these task end events, including the amount of storage the task used, the system that manages task state \(distributor\) made calls to our internal API gateway, which subsequently queried the system responsible for storing job output \(output service\). At this point, output service became overwhelmed with requests; although some requests were handled successfully, the vast majority were delayed before finally receiving a `499 Client Closed Request` error response. ![](https://global.discourse-cdn.com/circleci/original/3X/2/b/2b68322aaf27124eb5ae63a15bc0f8f2118c3f7b.png) `Distributor task end calls to the internal API gateway` Additionally, at 14:50, output service received an influx of artifact upload requests, further straining resources in the service. An incident was officially declared at 14:57. Output service was scaled horizontally at 15:16 to handle the additional load it was receiving. Internal health checks began to recover at 15:25, and we continued to monitor output service until incoming requests returned to normal levels. The incident was resolved at 15:52 and we kept output service horizontally scaled. At 17:41, output service received another sharp increase in requests to upload artifacts and was unable to keep up with the additional load, causing jobs to fail again. An incident was declared at 17:57. Because output service was still horizontally scaled from the initial incident, it automatically recovered by 18:00. As a proactive measure, we further scaled output service horizontally at 18:02. We continued to monitor our systems until the incident was resolved at 18:22. Following incident resolution, we continued our investigation and uncovered on October 25 that our internal API gateway was configured with low values for the maximum number of connections allowed to each of the services that experienced increased load on October 22. We immediately increased these values so that the gateway could handle increased volume of task end events moving forward. Despite these improvements, on October 28, 2024, at 13:27, customer jobs started to fail in the same way as they previously did on October 22. An incident was officially declared at 13:38. By 13:48, the system automatically recovered without any intervention and the incident was resolved at 14:13. We continued to investigate the root cause of the delays and failures, but at 14:45 customer jobs started to fail again in the same way. We declared another incident at 14:50. In order to reduce the load on output service, we removed the retry logic when requesting storage used per task from output service. This allowed tasks to complete even if storage used could not be retrieved \(to the customer’s benefit\). Additionally, we scaled distributor horizontally at 15:19 in order to handle the increased load. At 15:21, our systems began to recover. We continued to monitor and resolved the incident at 15:51. We returned to our investigation into the root cause of this recurring behavior and discovered that there was an additional client in distributor that was configured with a low value for maximum number of connections to our internal API gateway. We increased this value at 17:33. ## Future Prevention and Process Improvement: Following the remediation on October 28, we conducted an audit of **all** of the HTTP clients in the execution environment and proactively increased those that were similarly configured to ones in the internal API gateway and distributor. Additionally, we identified a gap in observability with these HTTP clients that prevented us from identifying the root cause of these sets of incidents sooner. We immediately added additional observability to all of the clients in order to enable better alerting if connections pools were to become exhausted again in the future.

resolved

The incident has been resolved. Thanks for your patience.

monitoring

Jobs are working again. If you had any jobs showing failures you will have to re-run. We will continue monitoring.

investigating

Some jobs are failing to start, and some jobs are having infrastructure failures. We are looking into it.

Report: "GitHub App branch config fetching failures"

Last update
resolved

From 17:30 UTC to 07:20 UTC, config fetching for GitHub App customers on branches that included a / character in the branch name were failing. A fix has been implemented and we are seeing successful config fetches from these affected branches. Please rerun any failed jobs, or push a new commit.

Report: "Bitbucket checkout failing"

Last update
resolved

From 16:39 UTC to 17:24 UTC, Bitbucket checkouts were failing. A fix has been implemented and we are seeing Bitbucket checkouts pass. Please rerun any failed jobs, or push a new commit.

Report: "Jobs failing to start or in progress fails."

Last update
resolved

The incident has been resolved. Thanks for your patience.

monitoring

Jobs are working again. If you had any jobs showing failures you will have to re-run. We will continue monitoring.

investigating

Some jobs are failing to start, and some jobs are having infrastructure failures. We are looking into it.

Report: "MacOS Job Starts Delayed: M2 Pro Medium"

Last update
resolved

This incident has been resolved.

monitoring

We are seeing recovery and will continue to monitor.

identified

Wait times continue to decrease. We are monitoring the fix.

identified

MacOS job starts delayed for M2 Pro medium resource class. We've identified the issue and we are working to resolve it. We will provide more updates as information becomes available and we appreciate your continued patience.

identified

The issue has been identified and a fix is being implemented.

Report: "Plans and Usage pages are unavailable"

Last update
resolved

This incident has been resolved.

monitoring

The plans and usage pages are now accessible and is functioning normally.

identified

We have identified the cause of the issue and have begun remediating it. We appreciate your patience whilst we work through the issue.

investigating

We're continuing to investigate this issue. Thank you for your patience.

investigating

Users are unable to view the plans or usage pages. We're investigating this issue.

Report: "Some Runner jobs not starting"

Last update
resolved

During this incident, customers could not access the Runner Inventory page and experienced infrastructure failures for Runner jobs.