Is Harness Down Right Now? Discover if there is an ongoing service outage.

Harness is currently Operational

Last checked Jul 29, 2025 14:37 UTC from Harness's official status page

Historical record of incidents for Harness

Jul 24, 2025

Report: "Some accounts facing degraded functionalities : PROD1"

Last update 2025-07-24T09:36:12.995Z

identified2025-07-24T09:36:12.991Z

Accounts with devops-essentials license type are encountering degraded functionalities

Jul 7, 2025

Report: "Prod2: Feature flag is currently degraded"

Last update 2025-07-07T17:37:35.063Z

identified2025-07-07T17:37:35.060Z

The issue has been identified and a fix is being implemented.

Jul 3, 2025

Report: "Mac Hosted CI builds observing failures"

Last update 2025-07-03T19:37:58.268Z

investigating2025-07-03T19:37:58.264Z

We are currently investigating this issue.

Jun 24, 2025

Report: "[ PROD 2 ] - Pipelines Running Slowly"

Last update 2025-06-24T18:13:51.186Z

investigating2025-06-24T18:13:51.182Z

We are currently investigating an issue with pipelines in our PROD 2 environment running slowly.

Report: "[Prod1/Prod2/Prod3] SEI Service Degraded for Jenkins Users"

Last update 2025-06-24T16:09:40.979Z

identified2025-06-24T16:09:40.976Z

We've identified an issue with our SEI service when it's used with the Jenkins plugin, causing 500 errors. We are working on a fix now.

Jun 12, 2025

Report: "All clusters experiencing feature loss or degradation of functionality due to our sub-provider functionality being degraded"

Last update 2025-06-12T18:56:50.477Z

identified2025-06-12T18:56:50.470Z

The issue has been identified and a fix is being implemented.

investigating2025-06-12T18:51:50.539Z

We are currently investigating this issue.

Jun 9, 2025

Report: "CI builds for MacOS are experiencing an outage"

Last update 2025-06-09T03:07:11.394Z

investigating2025-06-09T03:07:11.391Z

We are currently investigating this issue.

Jun 6, 2025

Report: "CI/STO Stage Failures"

Last update 2025-06-06T19:53:24.405Z

identified2025-06-06T19:53:24.402Z

The issue has been identified and a fix is being implemented.

investigating2025-06-06T19:53:01.499Z

We are currently investigating an issue for CI/STO stages getting stuck or aborted.

Jun 5, 2025

Report: "Prod1: Unified Dashboards may be experiencing delays"

Last update 2025-06-05T14:41:55.171Z

investigating2025-06-05T14:41:55.168Z

Some of our unified dashboards might be experiencing delays

Jun 2, 2025

Report: "Customers using Feature Flag module are not able to Login in PROD2"

Last update 2025-06-02T14:03:18.063Z

resolved2025-05-17T18:13:27.760Z

No errors since last ~15 mins. Marking it as resolved.

monitoring2025-05-17T17:51:01.184Z

No errors observed in the last 10 mins and we are monitoring now.

identified2025-05-17T17:45:36.435Z

The issue has been identified and we will monitor

investigating2025-05-17T17:38:52.194Z

We are currently investigating

May 30, 2025

Report: "Helm deployments failing with older delegates. (< 25.05.858XX)"

Last update 2025-05-30T14:36:16.855Z

resolved2025-05-30T10:00:00.000Z

This incident has been resolved.

monitoring2025-05-30T09:35:02.977Z

A fix has been implemented and we are monitoring the results.

identified2025-05-30T09:33:37.915Z

we restored system to previous version

identified2025-05-30T08:40:48.000Z

The issue has been identified.

investigating2025-05-30T08:33:29.568Z

We are currently investigating

investigating2025-05-30T08:23:01.000Z

Helm deployments failing with older delegates (< 25.05.858XX)

Report: "Helm deployments failing with older delegates. (< 25.05.858XX)"

Last update 2025-05-30T05:00:00.000Z

Resolved2025-05-30T05:00:00.000Z

This incident has been resolved.

Monitoring2025-05-30T04:35:00.000Z

A fix has been implemented and we are monitoring the results.

Update2025-05-30T04:33:00.000Z

we restored system to previous version

Identified2025-05-30T03:40:00.000Z

The issue has been identified.

Update2025-05-30T03:33:00.000Z

We are currently investigating

Investigating2025-05-30T03:23:00.000Z

Helm deployments failing with older delegates (< 25.05.858XX)

May 24, 2025

Report: "Database Maintenance Notification"

Last update 2025-05-24T00:30:00.000Z

Completed2025-05-24T00:30:00.000Z

The scheduled maintenance has been completed.

In progress2025-05-23T23:30:00.000Z

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled2025-05-20T04:59:00.000Z

To enhance Harness service reliability and optimize performance, we will be conducting a database maintenance activity on Friday, May 23rd, from 9:30 p.m. to 10:30 p.m. Pacific Time on the PROD3 environment.The maintenance is planned for the following services in Prod2 environment onlyCD, CI, Pipeline, CCM, SSCA, DBDevops, Chaos, IDP, CVWe do not anticipate any downtime or service disruption during this window.

May 23, 2025

Report: "Data migration for unified dashboards"

Last update 2025-05-23T22:45:00.000Z

Completed2025-05-23T22:45:00.000Z

The scheduled maintenance has been completed.

In progress2025-05-23T22:00:00.000Z

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled2025-05-17T12:14:00.000Z

This maintenance is scheduled for a planned data migration for unified dashboards. During this period, we do not expect any downtime but there may be some data staleness for custom dashboards for the next 36 hours.

May 22, 2025

Report: "Unified pipeline dashboard is experiencing delay in Prod3"

Last update 2025-05-22T15:09:53.973Z

postmortem2025-05-22T15:07:24.874Z

## **Summary:** On 08 May 2025 at 4:05 AM UTC, a system release introduced a change that led to increased data ingestion activity. As a result, customers experienced latency in the Unified Pipeline dashboard causing temporary data staleness within the Prod3 environment. ‌ There was no downtime or data loss, but dashboard visibility was temporarily impacted. The issue has since been mitigated, and preventive measures are being implemented to avoid recurrence. ## **What was the issue?** A system release triggered high-frequency data ingestion, which caused processing delays and led to temporary latency in the Unified Pipeline dashboard within the Prod3 environment. ‌ ## **Timeline** ‌ | **Timeline** | **Activity** | | --- | --- | | 08 May 2025 at 4:05 AM UTC | Prod3 system release completed | | 08 May 2025 at 12:17 PM UTC | Issue was identified | | 08 May 2025 at 01:48 PM UTC | A fix to increase system resources of our databases was validated and implemented | | 08 May 2025 at 3:20 PM UTC | Data was caught up and issue resolved | ## **Resolution** To mitigate the processing delays, the database resources were scaled up. This scaling operation helped restore normal processing throughput. Once completed, the latency issue in the Unified Pipeline dashboard was resolved, and full visibility was restored in the Prod3 environment. ‌ ### **Next Steps** Move major data migrations under a feature flag and schedule them over weekends to ensure better processing performance and minimize impact on live systems.

resolved2025-05-08T15:23:34.173Z

This incident has been resolved.

investigating2025-05-08T12:27:05.611Z

We're currently experiencing delays in the Unified Pipeline Dashboard on Prod3. Our team is actively investigating the issue and will share an update shortly

May 20, 2025

Report: "EU1: Codebase Expression Fails to Resolve in Pipelines Across Multiple Projects (Partial outage)"

Last update 2025-05-20T15:52:27.050Z

resolved2025-05-20T15:52:27.032Z

Issue is resolved.

monitoring2025-05-20T15:52:09.134Z

Rolled back the deployment and customer confirms

identified2025-05-20T15:35:47.651Z

We are rolling back the deployment and validating

investigating2025-05-20T15:15:08.875Z

We are currently looking into an issue where codebase Expression is failing to Resolve in Pipelines Across Multiple Projects

Report: "EU1: Codebase Expression Fails to Resolve in Pipelines Across Multiple Projects (Partial outage)"

Last update 2025-05-20T10:52:00.000Z

Resolved2025-05-20T10:52:00.000Z

Issue is resolved.

Monitoring2025-05-20T10:52:00.000Z

Rolled back the deployment and customer confirms

Identified2025-05-20T10:35:00.000Z

We are rolling back the deployment and validating

Investigating2025-05-20T10:15:00.000Z

We are currently looking into an issue where codebase Expression is failing to Resolve in Pipelines Across Multiple Projects

Report: "CCM - Azure datasync delay"

Last update 2025-05-20T09:11:45.899Z

resolved2024-08-16T19:14:06.542Z

We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.

monitoring2024-08-16T18:48:21.712Z

Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.

identified2024-08-16T18:37:25.214Z

We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.

investigating2024-08-16T17:15:07.000Z

We are currently experiencing issues with Azure datasync for August 16. We are actively investigating the issue.

May 16, 2025

Report: "Prod2 - Resource Constraint Issues"

Last update 2025-05-16T18:22:32.567Z

postmortem2025-05-16T17:28:15.533Z

# RCA: Prod2 - Resource Constraint Issues ## **Summary:** Pipeline executions were getting queued for multiple customers with the message "_Current execution is queued as another execution is running with a given resource key_". ## **What was the issue?** ‌ Pipelines scheduled for execution were experiencing prolonged queuing delays. In certain cases, pipelines remained in the queued state long enough to eventually expire. This behavior impacted deployment pipelines as well as other pipelines incorporating a queue step, leading to execution delays and timeouts. ## **Resolution:** ‌ We found that a large number of resource restraint entries were created during pipeline runs. This buildup caused a backlog, which slowed down new pipeline processing. To mitigate the issue, we manually drained the queue. We also added capacity to help handle the load better and avoid the problem in the future. ## **RCA** Harness pipelines leverage resource restraint instances to control the number of concurrent pipeline executions. During the incident, an unexpected spike in load triggered the creation of significantly more instances than usual. As these are processed in the background at scheduled intervals, the sudden surge led to processing delays, causing pipelines to queue and resulting in slower execution times. **Action Items** 1. Harness is enhancing the internal management of resource locks to better support scaling and improve concurrency handling across pipelines. 2. Monitoring will be strengthened to include alerts for delays in processing resource restraint instances, which would allow a quicker detection and response to similar issues moving forward.

resolved2025-05-10T19:22:01.853Z

We have successfully resolved the issue.

monitoring2025-05-10T19:13:04.230Z

Pipelines are executing successfully, we are monitoring further.

identified2025-05-10T18:47:48.684Z

Mitigation efforts are still ongoing.

identified2025-05-10T18:11:29.361Z

Mitigation progress is being made, though efforts are still ongoing at this time.

identified2025-05-10T17:32:03.570Z

A ResourceRestraintID lock is being held in a single customer's pipeline, causing other pipelines to be stuck. This issue is currently limited to a small number of customers, and we're working to mitigate it now.

investigating2025-05-10T16:44:04.896Z

We are currently investigating an issue with resource constraints in our Prod2 environment, which is causing stuck pipelines for some customers.

May 13, 2025

Report: "Gitops agent using mtls is failing to connect to the gitops service"

Last update 2025-05-13T16:24:04.229Z

postmortem2025-05-13T07:06:33.353Z

## **Summary:** All GitOps agents configured to use mTLS authentication were disconnected. Ticket: [#83615](https://harnesssupport.zendesk.com/agent/tickets/83615) ## **What was the issue?** The disconnection was caused by a misconfiguration in the gateway component, introduced during a recent configuration update. This resulted in traffic being routed to a non-existent endpoint, blocking communication with the GitOps service. ‌ The issue was not identified in lower environments because of the absence of automated tests for mTLS-based scenarios. ## **Timeline** ‌ | **Time \(UTC\)** | **Event** | | --- | --- | | Wednesday, 29th April, 05:00 PM UTC | Started the incident | | Wednesday, 29th April, 05:30 PM UTC | Issue was identified | | Wednesday, 29th April, 06:30 PM UTC | Fix was validated in QA | | Wednesday, 29th April, 06:45 PM UTC | Fix was released in the Prod environments. | | Wednesday, 29th April, 07:00 PM UTC | The system is operational again, and agents started connecting again | ‌ ## **Resolution** Fixed the incorrect config. ### **Next Steps** 1. Expand our release testing to include mTLS-authenticated agents to ensure better coverage and early detection of similar issues. 2. Enhance monitoring and alerting based on agent connectivity patterns, particularly for mTLS-based agents, to enable faster response and resolution.

resolved2025-04-29T19:00:36.000Z

We have deployed a fix for this issue, and the GitOps service is working correctly again.

identified2025-04-29T17:00:24.000Z

We noticed gitops agent using mtls is failing to connect to gitops service in Prod-1 environment. This issue has been identified, we are working on the resolution. Thanks for your patience.

Apr 24, 2025

Report: "PROD1: Stale Data Observed for Unified Custom Dashboards"

Last update 2025-04-24T04:13:01.897Z

postmortem2025-04-23T18:45:45.897Z

## **Summary** On April 4, 2025, for 35 minutes, customers in the prod-1 production environment observed that the following custom dashboards were loading stale data: pipeline, stage, and step executions. We discovered that an incorrect version of the ETL process was accidentally deployed, which caused periodic failures in executing it. ## **Resolution** Upgrading the ETL process to a newer version addressed this issue. ## **RCA** Pipeline, stage, and step execution custom dashboards were not loading correctly due to an incorrect upgrade of the ETL process. The upgrade caused periodic execution failures and disrupted the views' data refresh. While no data loss was experienced, dashboards briefly rendered stale data. ## **Action Items** * Implement robust deployment process and runbooks to prevent unintended upgrades.

resolved2025-04-04T17:43:28.806Z

This incident has been resolved. Thanks for your patience.

identified2025-04-04T17:08:52.518Z

We have identified the issue and are working on a fix.

investigating2025-04-04T17:08:29.866Z

We are currently investigating this issue.

Apr 23, 2025

Report: "Custom Dashboards are not loading in Prod2"

Last update 2025-04-23T21:11:50.166Z

postmortem2025-04-23T20:54:53.693Z

## **Summary** On January 30th, TimescaleDB was running low on storage. To mitigate the issue, we began cleaning up unused indexes. However, one of the removed indexes was actively used by a custom dashboard, which subsequently led to degraded dashboard performance. ## **Resolution** In response, we promptly initiated an index rollback to restore dashboard performance and minimize customer impact. ## **RCA** One of the databases supporting custom dashboards was nearing its storage limit. To address this, we began reclaiming space through reindexing and applying data retention policies. During this cleanup process, a specific index - believed to be unused - was dropped to free up space. However, this index was actively used by the custom dashboard, leading to degraded performance. ## **Action Items** * Implement automation for `VACUUM` and `ANALYZE` operations to ensure accurate index usage statistics, and establish a robust review process to validate dependencies before dropping any indexes. * Plan and execute a database migration to a higher storage capacity.

resolved2025-01-30T16:04:28.000Z

This incident has been resolved.

investigating2025-01-30T15:15:03.000Z

We are continuing to investigate this issue.

investigating2025-01-30T14:58:28.000Z

Custom dashboards are failing to load in Prod2. We are currently looking into the issue.

Report: "GCE VM Reboots in us-west1-a Zone"

Last update 2025-04-23T20:48:09.644Z

postmortem2025-04-23T20:16:10.410Z

## **Summary** Google experienced an incident on February 25th with Compute Engine in the us-west1-a zone, where some nodes, specifically E2 and N1 types, would reboot. The reboot caused the ungraceful restart of containers on the affected nodes. ## **Resolution** Our monitoring systems alerted us to the issue. In response, we decided to be proactive and utilize nodeAffinity to remove core service workloads from the us-west1-a zone in the affected environments until Google resolved the issue and to mitigate potential customer impact. ## **RCA** Google has yet to post an RCA for their incident, but a small blurb from the resolved incident page states, “From preliminary analysis, the issue was due to a latent bug that manifested under specific conditions, which resulted in unexpected VM reboots in the us-west1-a zone.” ## **Action Items** There was no known customer impact due to this incident because our workloads are multi-zonal, and our actions were entirely proactive to prevent possible impact.

resolved2025-02-25T19:50:16.267Z

Google has marked their incident as resolved and stated that VMs utilizing the us-west1-a zone are fully operational again.

monitoring2025-02-25T19:48:25.000Z

Google has marked their incident as resolved and stated that VMs utilizing the us-west1-a zone are fully operational again.

monitoring2025-02-25T19:09:09.000Z

GCP is experiencing an issue with VMs in the us-west1-a zone. At this time, we've migrated our critical workloads out of this zone to negate any customer impact, and we are fully operational. We will continue to monitor the GCP incident in the event the scope changes.

monitoring2025-02-25T14:54:52.620Z

We are experiencing an issue with Google Compute Engine beginning at Monday, 2025-02-25 01:41 UTC. This is causing some services to intermittently restart, resulting in some workloads terminating unexpectedly Our engineering team is working with GCP to investigate the issue, and will post updates as we receive them from Google.

Report: "PROD1: Unified Custom Dashboards are not loading properly"

Last update 2025-04-23T20:12:54.200Z

postmortem2025-04-23T18:35:31.022Z

## **Summary** On April 8, 2025, for 25 minutes, customers in the prod-1 production environment observed that the following custom dashboards were not loading properly: pipeline, stage, and step executions. We discovered that necessary model changes were missed during the version upgrade of our ETL process. ## **Resolution** Upgrading the ETL process to a newer version addressed this issue. ## **RCA** Pipeline, stage, and step execution custom dashboards were not loading correctly due to an incorrect upgrade of the ETL process. The incorrect upgrade resulted in our views not having the necessary data to render the dashboards. While no data loss was experienced, dashboards were not rendering correctly for a brief period. ## **Action Items** * **Improve Pre-Deployment Checks for ETL service upgrade**: Enhance pre-deployment checks to validate critical model updates are part of the upgrade process.

resolved2025-04-08T19:38:37.107Z

We have resolved the issue. Dashboards are up and running.

identified2025-04-08T19:14:12.742Z

We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.

investigating2025-04-08T18:37:23.768Z

We are currently investigating this issue. This is impacting steps, stages and pipeline execution dashboards.

Report: "Custom Dashboards [Unified View Explores] are experiencing delays in updating in Prod2"

Last update 2025-04-23T19:02:25.362Z

postmortem2025-04-23T18:58:00.394Z

## **Summary** For 26 hours, customers on Prod-2 observed stale data on the following custom dashboards: pipeline executions, stage executions, and step executions. The metadata state tables managing the ETL process were corrupted during a plan application upgrade, requiring a rebuild of the customer-facing data marts for the dashboards. No data was lost during this process. ## **Resolution** The metadata state was reset to trigger data mart updates. ## **RCA** Plan application errors were due to metadata corruption. While no data loss was experienced, data staleness was observed because the data marts were not updated with the latest ETL intervals during the metadata recreation. ## **Action Items** * The ETL framework will be updated more frequently**.** Harness will set a regular cadence for testing new updates and deploying them into production to reduce drift in metadata rollbacks. * Metadata tables will be decoupled from raw data storage to better manage state effects. Decoupling state from raw ingestion will allow faster iteration loops if a database rollback is needed.

resolved2025-01-25T02:07:02.464Z

Custom Dashboards [Unified View Explores] are now updating normally. The issue is now resolved. We appreciate your patience.

identified2025-01-24T22:45:19.285Z

We are still working hard to address the problem and aim to resolve it by 5 PM PST. We understand the inconvenience this may cause and appreciate your patience.

identified2025-01-24T20:26:59.000Z

We are still working hard to address the problem and aim to resolve it by 3 PM PST. We understand the inconvenience this may cause and appreciate your patience.

identified2025-01-24T17:57:34.000Z

We are still working hard to address the problem and aim to resolve it by 12 PM PST. We understand the inconvenience this may cause and appreciate your patience.

identified2025-01-24T16:26:21.139Z

We are still working hard to address the problem and aim to resolve it by 10 AM PST. We understand the inconvenience this may cause and appreciate your patience.

investigating2025-01-24T06:32:17.263Z

We are experiencing an issue where Custom Dashboards [Unified View Explores] are not updating as expected. We have identified the problem and aim to resolve it by 8 AM PST. We understand the inconvenience this may cause and appreciate your patience.

Report: "Prod-3 was intermittently unavailable"

Last update 2025-04-23T19:01:37.191Z

postmortem2025-04-17T20:44:11.377Z

## **Summary** On 17th April, between 10:42 AM UTC - 11:12 AM UTC, customers experienced intermittent errors when trying to access [app3.harness.io](http://app3.harness.io) on our Prod-3 cluster. The issue was caused by a configuration change on a failover cluster in the backend ingress-controller service setup associated with [app3.harness.io](http://app3.harness.io). ## **Resolution** Our monitoring system alerted us to the issue , we identified and reverted the change to mitigate the issue which restored all the functionality in Prod-3 cluster. ## **RCA** As part of preparation work for a planned Disaster Recovery \(DR\) activity, we introduced a new configuration in the Prod-3 cluster. This change unintentionally made the Prod-3 DR environment eligible to receive live customer traffic. Since this environment was not fully operational some of the requests were returned with 503 Errors. ## **Action Items** * Enhanced monitoring on traffic going to inactive environments. * Additional safeguards in deployment process to avoid unintentional traffic routing changes.

resolved2025-04-16T11:12:47.000Z

This incident has been resolved.

investigating2025-04-16T10:42:00.000Z

We noticed intermittent failures in our Prod-3 clusters where app3.harness.io was resulting in 5xx errors. This issue has been identified and is now resolved. Please monitor this incident for postmortem report on this. Thanks for your patience.

Report: "Harness overview dashboard is not loading on Prod-Eu1"

Last update 2025-04-23T18:45:37.283Z

postmortem2025-04-23T18:42:45.877Z

## **Summary** Following the core release on February 5th, the Overview Dashboard in the EU cluster experienced degraded functionality. Investigation revealed that a version mismatch between the newly released core service and the existing dashboard service caused compatibility issues, leading to the degradation. ## **Resolution** We deployed the latest released version of dashboard service to resolve the issue. ## **RCA** The Dashboard service relies on the core services for proper functionality. An updated version of the core services was updated without updating the Dashboard service, resulting in a failure of the Overview dashboard to operate as expected. ## **Action Item** We have implemented a check to ensure dashboard service is also updated when the dependencies are updated.

resolved2025-02-05T17:08:19.000Z

We have fixed the issue with Overview dashboard/landing dashboard. This incident is resolved now. Please monitor this page for postmortem report.

identified2025-02-05T15:24:29.000Z

We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.

investigating2025-02-05T10:30:09.000Z

We are currently investigating this issue.

Report: "PROD3: UI not loading"

Last update 2025-04-23T17:40:53.477Z

postmortem2025-04-23T16:56:27.134Z

## **Summary** On April 1st 2025, the Prod-3 cluster experienced performance degradation, resulting in access check API calls timing out intermittently. The impact was traced to degraded performance in the underlying MongoDB database, which is critical for access control validation. ## **Resolution** Scaled up the database cluster to address the issue ## **RCA** The issue was caused by a temporary degraded performance in our database, which handles access validation for API calls. A memory optimization activity briefly reduced system capacity, and during this window, traffic increased unexpectedly, leading to a delay in the system scaling back to full performance. As a result, some access check operations experienced timeouts, impacting overall request performance. ## **Action Items** * Utilize database cluster scaleup to address any memory fragmentation issue. * Improve query and index optimization for better database efficiency. * Delete stale data to reduce memory usage. * Optimize retry mechanisms to avoid overwhelming the system during failures

resolved2025-04-01T07:41:49.834Z

This incident has been resolved.

monitoring2025-04-01T07:05:32.000Z

A fix has been implemented and we are monitoring.

Apr 19, 2025

Report: "PROD2: Login is failing"

Last update 2025-04-19T01:25:35.022Z

postmortem2025-04-19T01:21:16.701Z

## **Summary** On April 8th, in preparation for our scheduled deployment, we started an index build. This caused the database to become unresponsive, resulting in login failures for few customers. ## **Resolution** Our monitoring systems alerted us to the issue. In response, we initiated an index rollback to restore database responsiveness and mitigate customer impact. ## **RCA** To support upcoming changes in the new deployment, we followed best practices and suggestions from MongoDB and began index creation ahead of time. However, high I/O activity on the target collection caused both index and data storage to consume significantly more space than anticipated. The increased storage and index size lead to poor performance of the database. This was a result of how our managed MongoDB service provider handles storage management internally. As a result, the db becomes unresponsive leading to login failures. We are currently awaiting a root cause analysis \(RCA\) from our managed MongoDB service provider to understand the underlying cause of the issue from their side. ## **Action Items** * We have disabled index building on the specific db collection in question for short term. * We are actively working with MongoDB support to investigate and identify the root cause of the issue.

resolved2025-04-08T10:26:14.317Z

The incident has been resolved. A detailed Root Cause Analysis (RCA) will be shared.

monitoring2025-04-08T09:42:23.000Z

A fix has been implemented and we are monitoring the results.

investigating2025-04-08T09:37:21.998Z

We are currently investigating this issue.

monitoring2025-04-08T09:32:53.000Z

A fix has been implemented and we are monitoring the results.

investigating2025-04-08T09:24:31.000Z

We are currently investigating this issue.

Report: "PROD1: Unable to login"

Last update 2025-04-19T01:20:27.827Z

postmortem2025-04-19T01:17:27.517Z

## **Summary** On April 6th, during our scheduled production deployment, multiple customers could not log in because the services failed to start due to issues with the index build. ## **Resolution** Our monitoring system alerted us to the issue. Upon investigation, we identified an unexpected heavy load on the database resulting in service failures . In response, we initiated a system rollback which resolved the issue. ## **RCA** As part of our planned deployment in the production environment \(prod1\), indexes are created during service startup. However, the combination of high I/O activity on a specific collection and concurrent index creation led to resource contention in MongoDB due to locking which remained longer than usual. As a result, few critical services failed to start up causing the login issue. We are currently awaiting a root cause analysis \(RCA\) from our managed MongoDB service provider to understand the underlying cause of the issue from their side. ## **Action Items** * Index creation during service startup has been disabled as part of the deployment process. * We are actively working with MongoDB support to investigate and identify the root cause of the issue.

resolved2025-04-07T06:56:55.337Z

This incident currently stands resolved. We will publish an RCA.

monitoring2025-04-07T06:08:20.977Z

A fix has been implemented and we are monitoring the results.

investigating2025-04-07T05:46:50.574Z

The Harness service is currently unavailable. We are currently working to identify the root cause and restore the service as soon as possible.

Apr 8, 2025

Report: "PROD2: Delegates got disconnected from Harness"

Last update 2025-04-08T19:51:44.767Z

postmortem2025-01-27T22:00:38.757Z

#### **Summary** A subset of Delegates in prod2 cluster got disconnected, causing pipeline failures for customers. It was due to an increased load on the backed database due to an ad-hoc read query. #### **What was the issue?** Customer delegates were disconnected and pipelines were failing. #### **Resolution** We cancelled the runaway query, and upscaled the database. Overall recovery took ~17 minutes, and the majority of Kubernetes delegates reconnected automatically. A few of the customers had to restart the non-kubernetes delegates. #### **RCA** As part of a regular operational work, we ran a read query in the database which spiked the CPU usage on the database. Unfortunately, this query was run against the primary replica, which increased query latency, resulting in some delegates getting marked disconnected. #### **Action Items** 1. **Enhance access control:** We have a Just-In-Time read access to our database for operational tasks. We are enhancing our system to only provide access to non-primary replicas for such operations. 2. **Enhanced resiliency:** We are planning to run chaos experiments simulating db latency to improve resiliency in our delegate management sub-system against such faults.

resolved2025-01-21T11:21:15.000Z

All delegate connectivity is resumed. Detailed RCA will follow soon.

identified2025-01-21T11:10:56.000Z

We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.

investigating2025-01-21T10:45:24.000Z

Few delegates got disconnected from Harness

Report: "PROD1: STO pipeline execution failing"

Last update 2025-04-08T19:51:11.380Z

postmortem2025-03-11T20:55:28.789Z

[Postmortem is same as CI/STO Pipeline Execution failing for Customers](https://status.harness.io/incidents/gcmjzvkrrmzy)

resolved2025-03-06T14:47:29.405Z

This incident has been resolved.

monitoring2025-03-06T14:40:21.647Z

A fix was released to prod 1 and 2, and initial testing shows that the issue has been resolved. Team will continue to monitor

identified2025-03-06T14:06:42.236Z

The issue has been identified and a fix is getting implemented.

investigating2025-03-06T13:32:46.000Z

We are currently investigating an issue in our Prod-1/2 cluster where users are not able to run STO pipelines.

Apr 3, 2025

Report: "Login failures in prod4"

Last update 2025-04-03T21:34:16.624Z

postmortem2025-04-03T21:30:15.734Z

## Summary On Monday, March 31, 2025, at 6:53 PM UTC, some customers experienced authentication issues, including getting logged out of Harness. This incident affected users with accounts hosted on our Prod-4 cluster. This issue was resolved by 7:06 PM UTC, resulting in approximately 13 minutes of downtime. ## Impact * Duration: 13 minutes \(6:53 PM - 7:06 PM UTC\) * Affected Users: Customers with accounts hosted on the Prod-4 cluster * Symptoms: Authentication failures, unexpected logouts, and traffic drop ## Resolution Our engineering team identified the issue and took immediate action: 1. Reverted the configuration change at 7:06 PM UTC 2. Rolled back the deployment to the previous stable version \(1.16.0\) at 7:08 PM UTC 3. Verified service restoration across all affected systems ## RCA The incident was caused by a routing configuration error in our Global Gateway service. During a planned deployment, a change to our routing logic inadvertently prevented requests from being correctly directed to the Prod-4 cluster. As a result, authentication sessions for affected customers could not be appropriately maintained. ## Action Items To prevent similar incidents in the future, we are implementing the following improvements: 1. Improved validation of routing configuration changes 2. Additional monitoring to detect routing anomalies earlier

resolved2025-03-31T19:09:41.000Z

This incident has been resolved.

monitoring2025-03-31T19:05:54.000Z

A fix has been implemented and we are monitoring the results.

identified2025-03-31T19:00:06.000Z

The issue has been identified and a fix is being implemented.

investigating2025-03-31T18:56:45.000Z

We are continuing to investigate this issue.

investigating2025-03-31T18:54:09.000Z

We are currently investigating login failures in prod4.

Apr 1, 2025

Report: "Stale Data Observed for Custom Dashboards in Prod1"

Last update 2025-04-01T10:43:39.333Z

postmortem2025-04-01T10:33:38.807Z

### **Summary** On March 25, 2025, for 2 hours and 22 minutes, customers in the prod-1 production environment observed stale data on the following custom dashboards: pipeline executions, stage executions, and step executions. ### **What was the issue?** The metadata state tables managing the ETL process were corrupted during a version upgrade, requiring fixes to this table. No data was lost during this process. ### **Resolution** The metadata state was reset to trigger data mart updates. ‌ ‌ | **Time\(UTC\)** | **Event** | | --- | --- | | 26 Mar 2:04 AM | We identified the ETL process that timed out after the upgrade. | | 26 Mar 3:18 PM | Redeployed the ETL process, applied the plan, and recreated the views. | | 26 Mar 4:22 AM | The metadata schema was rebuilt, and all data quality checks were confirmed to be passing. | | 26 Mar 4:25 AM | The incident was resolved. | ### ‌ ### **RCA** ‌ Plan application errors were due to an upgrade of ETL process timing out after running for two hours. This resulted in metadata corruption, requiring data fixing. While no data loss was experienced, data staleness was observed because the data marts were not updated with the latest ETL intervals during the metadata recreation. ‌ ### **Action Item**s * Update the ETL framework frequently to avoid significant version number jumps\*\*.\*\* * Set up a regular cadence for testing new updates and deploying them into production.

resolved2025-03-26T04:26:08.050Z

This incident has been resolved. Thanks for your patience.

identified2025-03-26T04:13:13.000Z

We are working towards testing a fix in our dev environment.

identified2025-03-26T03:28:24.000Z

We are continuing to work on a fix for the issue.

identified2025-03-26T02:42:58.865Z

We are continuing to work on a fix for the issue.

identified2025-03-26T02:05:47.761Z

We are working on a fix. We have identified that only Unified Dashboards for pipeline, stage, and steps are currently impacted.

investigating2025-03-26T02:04:03.704Z

We are currently investigating this issue.

Mar 31, 2025

Report: "Custom dashboards are not loading - Prod1,2,3,4 and Eu1 due to Looker managed service outage"

Last update 2025-03-31T22:24:04.867Z

postmortem2025-03-31T22:18:47.536Z

### **Summary** Customers in all the production environments observed that custom dashboards were not loading correctly. ### **What was the issue?** Harness custom dashboards rely on Looker Studio, a managed service from Google. During that period, Looker experienced an outage, which directly affected custom dashboards. **Resolution** Once Google Looker was back to a stable state, custom dashboards started working correctly. ‌ | **Time \(UTC\)** | **Event** | | --- | --- | | 26 Mar 12:17 AM | Google Looker started experiencing an outage on login and dashboard functionality, impacting Harness custom dashboards. | | 26 Mar 1:29 AM | Google Looker managed service returned to a steady state. | ‌ ### RCA Custom dashboards lost availability due to an outage with Looker, a managed service from Google. No data loss was experienced. ### Action Items * We are awaiting a follow-up RCA from the Google Looker team

resolved2025-03-26T01:29:57.669Z

The dashboards are now rendering correctly with the recovery of Looker Service.

monitoring2025-03-26T01:23:04.450Z

The dashboards are now rendering correctly with the recovery of Looker Service.We will continue to monitor the situation.

identified2025-03-26T00:56:13.264Z

We are observing a gradual recovery of the Looker service. Some dashboards are now rendering correctly. However, a partial outage remains in effect. We will continue to monitor the situation and provide updates as they become available.

identified2025-03-26T00:17:25.400Z

Custom dashboards are not loading in Prod1, Prod2 and Prod3 because of our Managed service - Looker facing an outage currently. We will monitor this outage and update once we have a status update from Looker.

Mar 24, 2025

Report: "Pipeline failures due to secret decryption in Prod2"

Last update 2025-03-24T17:09:28.216Z

postmortem2025-03-24T17:06:24.737Z

#### Summary: Pipelines experienced failure in resolving secrets in cases where more than one secret were used in custom secret manager. This issue was isolated to secrets associated with custom secret managers. #### Root Cause Analysis: The pipeline failure happened because the system failed to resolve secrets correctly. A code change to improve performance of the secret decryptions was deployed which resulted in failures for secrets stored in custom secret manager. The code change was behind a feature flag. The feature flag was disabled which restored normal pipeline operations. #### Action Items: 1. **Add New Test Cases:** Add new test cases to the automation suite to cover different configuration combinations for custom secret managers. 2. **Add Metrics and Alerts:** Implement appropriate metrics and alerts to detect secret/expression resolution failures proactively and mitigate them.

resolved2025-03-19T21:55:28.000Z

This incident has been resolved.

monitoring2025-03-19T20:20:17.109Z

A fix has been implemented and we are monitoring the results.

identified2025-03-19T20:03:25.935Z

We are continuing to work on a fix for this issue.

identified2025-03-19T20:01:20.991Z

The issue has been identified and a fix is being implemented.

investigating2025-03-19T19:43:58.047Z

We are currently investigating this issue.

Report: "CIE Pipeline Execution Failures in Prod1"

Last update 2025-03-24T16:58:23.303Z

postmortem2025-03-24T16:53:23.480Z

## **Summary:** Pipelines had failures due to delegates and build pods got disconnected. This also impacted our hosted CI operations and all hosted build pipelines failed. ## Root Cause Analysis: For performance improvement and agility of our development our engineering team had been making changes to get the legacy delegates which are no longer being used by customers but still running disconnected from the platform. This change was behind a feature flag which was enabled and resulted in an incident. The code change had an unintended affect of not accepting connection request from build containers. The feature flag was disabled which restored normal pipeline operations. ## Action Items: 1. **Improve feature flag operations:** Our engineering team operated in silo while enabling this feature which resulted in wrong implementation and ineffective rollout of this functionality. We are improving out process internally to templatize and manage feature flag rollout by external operations team. 2. **Improve Automation:** Adding steps in our QA process to catch the dependency of delegates and build pods connection requests so any change in this area is validated internally.

resolved2025-03-18T17:30:53.000Z

This incident has been resolved.

monitoring2025-03-18T17:25:40.000Z

A fix has been implemented and we are monitoring the results.

identified2025-03-18T17:05:33.000Z

The issue has been identified and a fix is being implemented.

investigating2025-03-18T16:56:23.000Z

We are currently investigating this issue.

Mar 12, 2025

Report: "Prod3: Unable to Create Environments,Service Overrides and Infrastructure"

Last update 2025-03-12T00:26:18.175Z

postmortem2025-03-11T19:45:50.701Z

#### **Summary** Users in the Prod3 cluster were unable to create entities such as Environments,Service Overrides and Infrastructure. #### Resolution A fix was implemented to address the system issue. As a result, the system stabilized, enabling users to successfully create Environments and other affected entities. #### Timeline | **Time \(UTC\)** | **Event** | | --- | --- | | March 4, 2025, 8:35 AM UTC | Issue reported for unable to create Environments,Service Overrides and Infrastructure | | March 4, 2025, 11:34 AM UTC | A fix has been implemented and we are monitoring the results. | | March 4, 2025, 11:59 AM UTC | Incident resolved | #### **The Root Cause Analysis \(RCA\)** Harness was previously having an issue with login in the Prod3 cluster. To address that, Harness had reverted the system release to the previous version. This rollback led to data inconsistencies, resulting in failures for certain entities, such as Environment and Service overrides. The inconsistencies were later resolved in a subsequent update, after which functionality was confirmed to be working as expected.

resolved2025-03-04T11:59:34.082Z

This incident has been resolved.

monitoring2025-03-04T11:34:24.313Z

A fix has been implemented and we are monitoring the results.

identified2025-03-04T08:35:15.000Z

The issue has been identified and a fix is being implemented.

Report: "Login issues on our Prod3 environment"

Last update 2025-03-12T00:03:10.314Z

postmortem2025-03-11T20:29:37.687Z

#### **Summary** The Prod3 cluster experienced downtime, preventing users from accessing the Harness UI. Only access to Prod3 was affected but the pipeline executions were not impacted. #### Resolution To mitigate the issue, Harness services were auto-scaled. Additionally, rate limiting and timeouts were implemented for specific API endpoints to regulate the load. These measures effectively reduced system strain, allowing the platform to recover and resume normal operations. #### Timeline | **Time \(UTC\)** | **Event** | | --- | --- | | March 4, 2025, 7:25 AM UTC | Investigating login issue in prod3 environment. Prod3 cluster was under pressure and rejecting requests | | March 4, 2025, 7:30 AM UTC | Reverted system release | | March 4, 2025, 7:38 AM UTC | Changed status to monitoring. System is operating normally | #### **The Root Cause Analysis \(RCA\)** One of the core micro-services in the Harness platform was receiving a high volume of external traffic. The API endpoint under load was executing a long-running analytical query, which became slow during this period. This slowdown triggered a cascading effect across the infrastructure, leading to the unavailability of underlying services. As the load increased, new requests began to fail. Since the Harness UI depends on responses from backend APIs, the pages failed to load. #### **Action Items** 1. Move analytical services to a separate end point to prevent such issues impacting critical workflow

resolved2025-03-04T11:59:25.168Z

This incident has been resolved.

monitoring2025-03-04T07:38:29.198Z

A fix has been implemented and we are monitoring the results.

investigating2025-03-04T07:25:00.180Z

We are currently investigating logging issue in prod3 environment

Mar 11, 2025

Report: "Prod2: CI/STO Pipeline Execution failing for Customers"

Last update 2025-03-11T20:53:55.289Z

postmortem2025-03-11T20:42:19.705Z

# **Summary:** Customer encountered an issue with pipeline execution. The executions failed with an exception “Error Creating Plan: Could not create plan for node“. This impacted the CI and STO stages execution. # **Timeline:** | **Time \(IST\)** | **Event** | | --- | --- | | March 6, 2025, 4:02 AM UTC | Team reviewed the series of events for a previous Incident and since the load on pipeline runs were lower we decided to rollback the release to 1.66.1. | | March 6, 2025, 9:03 AM UTC | Customers reported that they are intermittently unable to run CI pipelines with plan creation error | | March 6, 2025, 9:17 AM UTC | Status page was updated | | March 6, 2025, 9:25 AM UTC | Identified the gap in licensing API that led to cache corruption. | | March 6, 2025, 10:00 AM UTC | CI manager deployment to version 1.67.3 \( prod 2\) was done and the errors stopped. | | March 6, 2025, 10:31 AM UTC | Got customer confirmation that CI is now operational. | | March 6, 2025, 11:45 AM UTC | STO errors were still occurring due to rollback | | March 6, 2025, 2:14 PM UTC | STO was rolled forward to 1.54 version | # Resolution: STO service was rolled forward to 1.54 version to resolve the issue. # RCA: When a pipeline execution is triggered we check the License details for the module and verify a valid license exists. As part of this check we ran into an issue for unknown license type which triggered an exception causing the pipeline execution failure. The license details API had a gap in the license details fetch call, which when encountered corrupted the cache for the consecutive executions with non onboarded license types. # Action Items: * Improvement in alerting for plan creation errors * Improve automation tests to cover advanced filtering scenarios for the licensing API

resolved2025-03-06T11:05:57.433Z

This incident has been resolved.

monitoring2025-03-06T10:17:10.983Z

We have applied the fix (internal test passed), services are restored.

identified2025-03-06T10:00:07.829Z

We are working on the fix.

investigating2025-03-06T09:31:34.192Z

We are also seeing intermittent new pipeline creation failure. We are currently investigating.

investigating2025-03-06T08:50:53.000Z

Pipeline Execution failing for Customers in Prod2.

Report: "CI stages are getting queued in Prod2"

Last update 2025-03-11T20:47:24.716Z

postmortem2025-03-11T19:14:24.752Z

## **Summary:** Customers have reported experiencing longer queue times for their Continuous Integration \(CI\) stages when using Harness Cloud infrastructure. Although the queue limits were not reached, builds remained queued, leading to extended waiting periods as they awaited progression. ## **Timeline:** | **Time \(UTC\)** | **Event** | | --- | --- | | March 4, 2025, 10:16 PM | Customer reported Queued builds | | March 4, 2025, 11:06 PM | Increased the limits for customers to unblock | | March 5, 2025, 2:00 AM | Reverted application which was suspected to have caused the issue | | March 5, 2025, 7:30 AM | We continued to investigate the issue as we were still seeing some missed cleanups and we also performed cleanup of stale metadata captured to prevent from further queuing | | March 5, 2025, 4:50 PM | We saw a spike in resources consumed by our apps as the peak load approached which was mitigated by increasing the resources and stabilized the app. | | March 5, 2025, 7:40 PM | Issue was narrowed down to the Jackson library upgrade and we started the rollback test on lower environment. | | March 5, 2025, 9:54 PM | We now rolled back to the previous version of the application and continued to monitor. During this time we noticed increased resource consumption on our Mongo instance, which further caused the stability issue and stuck ci stages. | | March 5, 2025, 11:18 PM | We decided to roll forward the release and undo the revert. Post which the system stabilized. | | March 6, 2025, 3:26 PM | We worked on the forward fix post the stabilization and released it to production. | # Resolution: We immediately increased the queue sizes for impacted customers to enable their build stages to progress. Subsequently we fixed the library issue and rolled out newer release. We are improving our alerting and automation to proactively determine any potential issue with resource cleanup at scale. ## **RCA:** A recent **Jackson library** upgrade slowed down the CI manager's cleanup thread, causing back pressure on the system during peak periods. With the Jackson library upgrade from 2.15.2 to 2.17.2, the `ObjectMapper` implementation changed to use a `ReentrantLock` object. During persistence, Spring recursively reads instance objects and serializes them via reflection. However, Java restricts access to `ReentrantLock` fields via reflection, causing serialization exceptions. As a side effect of Jackson library upgrade, the load on one of our services increased significantly causing restart of the pods which lead to stuck executions of few CI stages. The above led to pipelines entering a queued state and, due to resource constraints, some pipelines failing to execute. ## **Actions Items:** * Improve the monitoring and alerting for resource cleanup * Implement a cross-team process for validating the library upgrades

resolved2025-03-05T03:23:08.763Z

This incident has been resolved.

monitoring2025-03-05T00:55:56.736Z

A fix has been implemented and we are monitoring the results.

identified2025-03-05T00:00:56.647Z

The issue has been identified and a fix is being implemented.

investigating2025-03-04T23:50:03.613Z

We are currently investigating this issue.

Report: "Some CI pipelines are experiencing stage aborts in Prod2"

Last update 2025-03-11T19:48:23.654Z

postmortem2025-03-11T19:48:12.531Z

[Same Postmortem as CI stages are getting queued in Prod2](https://status.harness.io/incidents/wh0xbx7h2x6l)

resolved2025-03-06T04:25:04.632Z

This incident has been resolved.

monitoring2025-03-05T21:52:15.462Z

A fix has been implemented and we are monitoring the results.

identified2025-03-05T21:18:32.113Z

The issue has been identified and a fix is being implemented.

investigating2025-03-05T20:49:14.050Z

We are currently investigating this issue.

Report: "Intermittent UI Failures on Prod4"

Last update 2025-03-11T07:28:59.920Z

postmortem2025-03-11T07:27:40.101Z

This incident is a recurrence of a previously reported issue on March 2nd. The root cause, contributing factors, and corrective actions have already been documented in the earlier RCA. For detailed analysis and remediation steps, please refer to the RCA here: [https://status.harness.io/incidents/w7m7fgcmdhh0](https://status.harness.io/incidents/w7m7fgcmdhh0)

resolved2025-03-05T03:25:30.000Z

This incident has been resolved.

monitoring2025-03-05T03:25:02.000Z

We terminated the degraded pod and are actively monitoring the situation to ensure stability.

identified2025-03-05T03:18:59.000Z

The issue has been identified and a fix is being implemented.

Report: "Intermittent connections errors on Prod4"

Last update 2025-03-11T07:15:03.184Z

postmortem2025-03-11T07:12:52.138Z

### **Summary** On Friday 7 Mar 2025, the Prod4 cluster experienced a disruption when the Global Gateway service stopped serving incoming requests. The incident was caused by a configuration mismatch during a planned version upgrade. The system was fully recovered after approximately 12 minutes of downtime, out of which 7 minutes were full downtime and 5 minutes were partial service disruption. ### Resolution The team quickly identified the configuration mismatch and reverted to the previous configuration settings. After bouncing the Global Gateway pods, the system recovered, and normal service was restored. ### RCA During a planned upgrade from version 1.16.0 to version 1.17.2 of the Global Gateway service, a procedural error caused the new configuration intended for version 1.17.2 to be deployed while the older version 1.16.0 was still running in production. The older version was incompatible with the new configuration parameters, causing the service to stop responding to requests. ### Action Items 1. **Enhanced Deployment Oversight and controls**: Implement additional validation checks in the deployment pipeline to verify version compatibility with configuration changes. 2. **Improved Architecture Resilience**: Accelerate our planned architecture improvements to make the system more resilient to configuration changes and prevent similar failures in the future. Our team is committed to implementing these improvements to prevent similar incidents in the future.

resolved2025-03-08T01:00:26.000Z

This incident has been resolved.

monitoring2025-03-08T00:59:51.000Z

A fix has been implemented and we are monitoring the results.

identified2025-03-08T00:54:29.000Z

Global Gateway intermittently encountering connection errors in Prod 4

investigating2025-03-08T00:53:10.000Z

Global Gateway intermittently encountering connection errors in Prod 4

Mar 6, 2025

Report: "CI/CD Pipeline failure on Prod4"

Last update 2025-03-06T19:53:18.326Z

postmortem2025-03-05T23:20:17.837Z

## **Summary:** A Deployment pipeline execution on Prod4 resulted in removal of few workload identities from DR cluster which are shared across Primary and DR clusters, this caused the pods in both the primary and DR clusters that depend on workload identity to fail, affecting service availability. ## **What was the issue?** Customers faced issue with their CI/CD Pipeline where the pipeline started failing on Prod4environment ## **Timeline:** | **Time \(UTC\)** | **Event** | | --- | --- | | January 24 12:40 AM UTC | During the Prod4 deployment, some services failed to come up healthy, and a FireHydrant incident was triggered. | | January 24 12:57 AM UTC | The issue was identified as missing workload identity bindings in the DR cluster. The team decided to redeploy to restore the configuration. | | January 24 1:30 AM UTC | The redeployment fixed the issue by syncing the Terraform state, which resolved the mismatch in the DR cluster configuration | ## **Resolution:** Re-deployment resolved the issue by ensuring that the Terraform state was aligned with the intended configuration, which restored the missing workload identity bindings in the DR cluster. ## **RCA** During a recent Disaster Recovery \(DR\) exercise, manual changes were made to the DR cluster but were not properly captured in the Terraform configuration. When the deployment pipeline executed, Terraform applied its last known state, inadvertently removing the workload identity bindings in the DR cluster. This led to pod failures in both the Primary and DR clusters, causing the CI/CD pipeline to fail in Prod4. **Action Item** 1. **Ensure no manual changes** are made to the system. In case of unforeseen manual changes, document them and incorporate them into Terraform. 2. **Automate Drift Detection:** Implement automated drift detection to identify discrepancies between the live infrastructure and Terraform state. 3. **Pre-Deployment Validations:** Introduce additional pre-deployment checks to verify workload identity bindings before applying changes.

resolved2025-01-24T01:32:01.000Z

This incident has been resolved.

investigating2025-01-24T00:40:47.000Z

The Harness service is experiencing performance issues. We are working to identify the cause and restore normal operations as soon as possible.

Report: "Some Feature Flag customers are experiencing intermittent issues with evaluating target groups on Prod2"

Last update 2025-03-06T11:52:58.014Z

postmortem2025-03-06T11:50:13.663Z

## What was the issue? After a recent change to the Feature Flag authentication gateway, some evaluations failed for TargetGroups with rules that use a custom attribute. Once identified, the team reverted the configuration change, and evaluations returned to normal | **Time \(UTC\)** | **Event** | | --- | --- | | 09:02 | Feature Flag authentication gateway configuration change applied | | 16:49 | First report of issues relating to evaluations of custom rules on TargetGroups | | 18:34 | Feature Flag authentication gateway configuration reverted back | | 18:39 | Evaluations of targetGroups returns to normal | # RCA As part of improvements to our disaster recovery strategy, a change was made to make the Feature Flag authentication gateway more robust. Initial testing preformed failed to account for the scenario of client-side SDKs, with target groups using rules that use a custom attribute \(rather than a core attribute like identifier\). Client SDKs generally make two types of request. 1. An auth request 2. An evaluation request to get flag values. During the auth flow, the provided target and its attributes is stored in a DB and Redis e.g. ``` { . identifer : "123-456-789", . name : "bob", . custom_attribute_1 : "value1" } ``` During the evaluation flow, the target is retrieved from the cache if still present, and if not, it will be retrieved from the DB, and stored in the cache. After the Feature Flag authentication gateway change, targets were being written to a different Redis. They would still be persisted to the DB, but if the Redis instance used during evaluations contained an older version of the target that did not have the attributes, then the code would never go to the DB i.e. the Redis may contain ``` { . identifer : "123-456-789", . name : "bob" } ``` In this case, if the custom rule used identifier or name, it would work as expected, but if it is using the custom attribute, then that would be missing during the evaluation. ## Actions Items 1. Update our test suite, to include additional user authentication flows, to account for the impacted use case 2. Update the Feature Flag authentication gateway to use distinct Redis instances for each environment 3. Review functionality, that supports additional reading of target attributes, to provide a failsafe to ensure the correct evaluation is returned

resolved2025-03-03T19:07:08.290Z

Issue has now been resolved, and will share the RCA shortly

monitoring2025-03-03T19:05:43.514Z

We are continuing to monitor for any further issues.

monitoring2025-03-03T18:52:59.525Z

Issue has been identified, and a fix has been put in place. Team are continuing to monitor the issue

investigating2025-03-03T18:48:19.900Z

We are currently getting reports of some customers experiencing intermittent issues iwht Feature Flags when evaluating target groups in the prod2 environment. Team are actively diagnosing the issue, and will keep you updated

Report: "Seeing intermittent login issues on our Prod4 environment"

Last update 2025-03-06T06:01:28.747Z

postmortem2025-03-06T05:53:17.938Z

## **Summary:** Users experienced login failures on the Prod4 cluster due to backend connection limits being exceeded. The issue was triggered by a surge of WebSocket connections following a customer account migration, which triggered the circuit breaker limit on Harness Global Gateway for the Prod4 cluster. ## **Timeline:** | **Time \(UTC\)** | **Event** | | --- | --- | | March 2nd 2:14 PM | We received an alert for Prod4 login failures | | March 2nd 3:00 PM | Scaled Global Gateway pods from 2 to 4 and functionality restored. | | March 2nd 3:06 PM | Increased RPS for Prod4 ILB | | March 2nd 3:10 PM | Confirmed that login is restored | ## **Resolution:** * Increased backend capacity by scaling the Global Gateway service to distribute the load more effectively. * Set up necessary alerts to monitor system stability to confirm cluster connectivity. ## **RCA:** Following a customer account migration, there was a significant increase in WebSocket connections from delegate agents, exceeding the connection limits set for backend hosts. The backend system reached its maximum capacity, preventing new connections from being established. Additionally, one of the backend pods restarted unexpectedly, leaving only a single pod to handle all incoming traffic. This led to the circuit breaker being activated, causing login failures. ‌ ## **Action Item:** * Implement a dedicated traffic splitting configuration to handle WebSocket connections separately from other API requests to prevent similar incidents in the future. * Improve monitoring and alerting to detect when connection limits are approaching critical thresholds. * Conduct scalability testing to ensure the system can handle large numbers of WebSocket connections without reaching critical limits.

resolved2025-03-02T16:57:59.031Z

After continued monitoring and further investigation issue has been considered resolved

monitoring2025-03-02T16:18:40.943Z

A fix has been implemented and we are monitoring the results.

identified2025-03-02T16:18:31.985Z

We have identified the issue, and a migration has been applied. The team are continuing to investigate the source of the issue

identified2025-03-02T16:07:35.332Z

Issue has been identified with our global gateway, affecting routing to our Prod4 environment. Team is continuing to investigate the issue

investigating2025-03-02T15:42:54.619Z

Seeing intermittent login issues on our Prod4 environment

Feb 17, 2025

Report: "Users in Prod-2 cluster facing unexpected pipeline failures"

Last update 2025-02-17T05:12:18.326Z

postmortem2025-02-17T05:06:47.382Z

#### What was the issue? Customers experienced pipeline failures due to intermittent errors when submitting delegate tasks. The issue was identified by the error message: > UNAVAILABLE: Connection closed after GOAWAY. HTTP/2 error code: NO\_ERROR, debug data: max\_age ‌ **Timeline** | **Time \(UTC\)** | **Event** | | --- | --- | | 06:00 AM | First occurrence of the issue. | | 06:06 AM | Alert from our monitoring system received; team started investigating. | | 06:42 AM | Service instances scaled up to restore service. | | 07:00 AM | Functionality enabled \(which was already being rolled out behind a Feature Flag\) to prevent the reoccurrence of the issue. | ‌ **RCA** Pipeline execution functionality was degraded due to exhaustion of thread pool resources \(responsible for secret resolution from custom secret manager\). Trigger was pipeline run with a large number of secrets, which overwhelmed the thread pool responsible for resolving secrets. This reduced the capacity of the system, resulting in a build-up of delegate tasks awaiting submission. Eventually, those requests timed out, leading to pipeline failures. Once issue got identified, we immediately scaled up our service infrastructure to handle the increased load. Subsequently, a feature flag to optimize **secrets resolution flow** was enabled. \(This feature flag was in process to be enabled across all Harness environments in the next few days\). ‌ **Actions Items** 1. Roll out the feature in all environments. \(done\) 2. Enforce limit on number of simultaneous secret resolutions in a pipeline execution.

resolved2025-02-06T07:53:05.625Z

This incident has been resolved. Please monitor this incident for Postmortem report. Thanks for your patience.

monitoring2025-02-06T06:57:03.355Z

The issues is mitigated now , We are actively monitoring it.

investigating2025-02-06T06:29:21.190Z

We are currently investigating as issue in our Prod-2 cluster with CD Pipelines.

Feb 12, 2025

Report: "Harness performance degraded in Prod3."

Last update 2025-02-12T20:37:45.617Z

postmortem2025-02-10T23:56:17.651Z

#### **Summary** Prod3 environment observed slowness and degraded performance across the board and saw intermittent MongoDB errors while accessing the application #### **‌What was the issue?** On **Tuesday, February 4th at 9:30 AM UTC**, we observed a sudden spike in MongoDB utilization, reaching **90%** CPU usage, which resulted in **degraded cluster performance**. This surge led to blocked DB connections due to the load, causing multiple queries to starve for connections and impacting user experience across the platform. **Resolution** * The team investigated the issue and identified that degradation on MongoDB as the root cause. * Increased the resources for the Mongo DB which in turn helped in processing the load which helped the system to get back to its regular operation #### **‌Timeline** | **Time \(UTC\)** | **Event** | | --- | --- | | 4th Feb 9:30 AM | CPU Usage on MongoDB went up and persisted | | 4th Feb 10:30 AM | Mongo connections spiked, application retries increased | | 4th Feb 10:30 AM | Cache dirty fill ratio on Mongo became and persisted at 20%. Writes to MongoDB started to fail | | 4th Feb 11:40 AM | Upgraded MongoDB instance tier to M40 to handle the load | | 4th Feb 12:00 PM | System stabilized | #### **‌** #### **RCA** We observed that MongoDB CPU utilization spiked. We also observed that MongoDB memory utilization exceeded 90%, leading to an increase in system error rates. During this period, the cache dirty fill ratio started rising and surpassed 20%, remaining elevated. At this point, Mongo application threads became involved in eviction processes instead of executing usual database operations such as CRUD actions, replication, and other core functions. ‌ This shift in thread activity caused operations to stall, leading to excessive memory consumption across the nodes. As system memory utilization increased, overall database performance degraded, further compounding the issue. ‌ To mitigate the impact, we performed a cluster tier upscale on 02/04/25 at 03:40 AM PST, which successfully alleviated the memory pressure on the affected nodes. Following the upgrade, we observed that system performance returned to acceptable levels. #### **Action Items** Since the incident, the team has been actively working on several action items to prevent similar occurrences, including: * **Query Optimization:** Identifying and optimizing slow-running queries to reduce load on the database. * **Scaling Strategy:** Evaluating a proactive cluster tier auto scaling approach to handle traffic spikes efficiently. * **Monitoring & Alerts:** Enhancing monitoring to detect query bottlenecks earlier and prevent performance degradation.

resolved2025-02-04T13:07:38.801Z

This incident is resolved now. Please monitor this page for postmortem report.

monitoring2025-02-04T12:00:37.173Z

The issue has been mitigated and we are currently monitoring the system.

investigating2025-02-04T11:23:43.936Z

We are actively investigating the service degradation issue in the prod3 environment.

Feb 7, 2025

Report: "CI Plugin Image Retrieval Failure from the Docker Hub"

Last update 2025-02-07T19:46:36.861Z

postmortem2025-02-07T04:20:14.498Z

#### **Summary** Certain CI pipelines utilizing Harness CI steps, such as PluginStep and `Setup Build Intelligence`, encountered the error `failed to get image entrypoint`. CI Build Intelligence is enabled by default to enhance build caching. This improvement introduces a background step in each CI stage that operates a cache proxy server, which fetches images from Docker Hub. ‌ **What was the issue?** A recent outage at Docker Hub, as reported on the Docker Systems [Status Page](https://www.dockerstatus.com/pages/incident/533c6539221ae15e3f000031/67a479b283fb1305d10af103), caused the "Setup Build Intelligence" and CI PluginStep in the CI stage to fail. This issue was due to an inability to retrieve the image entrypoint, as indicated. According to Docker Hub, the outage was limited to unauthenticated clients \(Anonymous\). ‌ #### **Timeline** | **Timestamp** | **Event** | **Action** | | --- | --- | --- | | 6th Feb 8:48 AM UTC | A customer reported that the buildIntelligence step is failing | Initiated a SWAT call to address the issue. | | 6th Feb 9:00 AM UTC | Issue identified as stemming from Docker Hub downtime | Docker Systems Status Page | | 6th Feb 9:20 AM UTC | Docker Hub issue resolved and pipeline failures stopped | | ‌ #### **Action Items** 1. To mitigate such issues, Harness recommends that customers configure the built-in Harness Image Docker connector to use credentials instead of anonymous access and to pull images from GCR or ECR rather than Docker Hub. For detailed instructions, please refer to [Configure Harness to always use credentials to pull Harness images](https://developer.harness.io/docs/platform/connectors/artifact-repositories/connect-to-harness-container-image-registry-using-docker-connector/#configure-harness-to-always-use-credentials-to-pull-harness-images). 2. We are actively working to eliminate dependencies on external systems to enhance our reliability even further.

resolved2025-02-06T09:29:55.000Z

This incident has been resolved.

monitoring2025-02-06T09:26:29.000Z

Dockerhub incident has been resolved, and we are continuing to monitor on our side

identified2025-02-06T09:24:10.000Z

We are seeing an increase in failed pipelines that use the Build Intelligence, as it pulls from dockerhub.

investigating2025-02-06T09:00:58.000Z

DockerHub is facing an incident and in degraded performance https://www.dockerstatus.com/

Jan 28, 2025

Report: "Feature Flags metrics service down on Prod1"

Last update 2025-01-28T16:03:47.043Z

postmortem2024-11-13T18:07:38.598Z

#### **Summary** Requests to [https://events.ff.harness.io](https://events.ff.harness.io) where failing for all customers in Prod1 accounts, preventing metrics data from being updated. #### **What was the issue?** Testing being preformed on the cluster resulted in unexpected behaviour on one of the load balancing backend instances that is used to route traffic for the Feature Flag metrics service. #### **Resolution** Load balancer backend was re-synced with the running applications, and the traffic resumed. #### **Timeline** | Time\(UTC\) | Event | | --- | --- | | 13 Nov 14:55 | Testing service was brought down, causing a cascading tear down of the backend instance | | 13 Nov 14:56 | On call engineer alerted to the traffic errors on Prod1 | | 13 Nov 15:02 | Issue identified, and team started to determine the cause | | 13 Nov 15:10 | Issue was fixed, and system sync started | | 13 Nov 15:14 | Traffic resumed | #### **RCA** On Nov 13, 2024, between 14:55 and 15:14, traffic going into the Feature Flag metrics service on Prod1 received a 500 error. #### **Action Items** * Review policy checks on services, to ensure no load balancer backends have more than one label associated

resolved2024-11-13T15:30:06.000Z

Between 14:55 UTC and 15:14 UTC the Feature Flag service on Prod1 experienced an outage on the metrics service. Customers sending metrics data during this window will have received an 500 error We have identified the issue and a fix has been applied. An RCA will follow shortly

Report: "Queue-Service is impacted for Prod3 customers"

Last update 2025-01-28T16:02:13.530Z

postmortem2025-01-13T09:50:27.037Z

#### Summary We encountered issues with the Queue Service, where bidirectional webhooks were marked as queued, and Git changes were not reflected on Harness. #### Timeline | **TIMELINE \(UTC\)** | **Event** | | --- | --- | | Nov 18, 2024 - 12:48 PM | The customer reported an issue with Bidirectional GitX webhooks in the queued status. | | Nov 18, 2024 - 12:49 PM | The team analysed the monitoring and service logs and observed issues with Redis connectivity after the deployment. | | Nov 18, 2024 - 01:05 PM | DBRE was involved and credentials were rotated. | #### Immediate Resolution We updated the production Redis configuration and performed a configuration deployment. #### RCA The issue with queued webhooks occurred due to Redis errors affecting the Queue Service, which caused bidirectional GitX webhooks to be queued and not processed. The error during redeployment occurred when an incorrect configuration was pushed during a Redis credential rotation, temporarily disrupting the Queue Service. Connectivity remained intact until the alert was received, which prompted an update to the credentials. #### Action Items We have implemented monitoring for the customer webhook queued status to prevent future issues.

resolved2024-11-18T13:05:03.236Z

The issue has been resolved. We apologize for the inconvenience and will share the root cause analysis (RCA) shortly.

identified2024-11-18T12:48:04.923Z

Git back entities will have stale data. This will impact pipeline executions. As a work around, we would recommend disabling the webhook until further notice.

Report: "Pipelines custom webhook executions are delayed"

Last update 2025-01-28T16:00:13.730Z

postmortem2025-01-03T01:50:36.008Z

#### **Summary** Custom webhook triggers observed delayed execution due to a surge in incoming trigger executions that created a backlog for processing these types of trigger executions. This only impacted delayed executions via custom webhook triggers. The executions via api and UI were not impacted. #### **What was the issue?** Harness received a surge of custom webhook events for processing triggers. These triggers were executing git backed pipelines that were taking longer than usual to resolve which caused the back pressure on the trigger processing leading to delays for pipeline executions. This happened since we have a limited number of resources available for processing custom webhook types of triggers. #### **Resolution** We increased the resources on our systems to manage the surge which helped bring the system back to normal. #### **Timeline** | **Time \(UTC\)** | **Event** | | --- | --- | | Dec 20th 05:15pm | Identified the system is observing some delays in processing triggers. | | Dec 20th 05:50pm | Identified the issue causing the delays. | | Dec 20th 06:10pm | Increased the available resources for processing triggers. | | Dec 20th 07:25pm | Incident was identified as resolved. | #### **RCA** The currently allocated resources were unable to process the large number of custom webhooks leading to delays in processing them and thereby causing delayed pipeline executions. As a result, we had to allocate additional resources. #### **Action Items** 1. We have increased the number of threads that are assigned to process the custom webhooks. 2. We will be working on enhancing the business logic to de-couple the pipeline resolution from custom webhook trigger processing flow.

resolved2024-12-20T19:48:48.300Z

This incident has been resolved. We regret the inconvenience and will be providing an RCA for review.

monitoring2024-12-20T19:45:56.428Z

We have mitigated the issue at this time and are continuing to monitor the iterator queue. The iterator queue will gradually clear off and the webhook queue will clear off.

identified2024-12-20T19:39:49.311Z

We are continuing to make progress and have partially mitigated the issue.

identified2024-12-20T18:59:27.051Z

We are continuing to work on a fix for this issue. Thank you for your patience!

identified2024-12-20T18:28:58.782Z

We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.

investigating2024-12-20T17:59:40.000Z

We are actively investigating the issue. Thank you for your patience!

investigating2024-12-20T17:15:32.000Z

We are currently investigating the issue

Jan 27, 2025

Report: "Some customers on Prod1 may be experiencing degraded performance"

Last update 2025-01-27T21:13:31.943Z

postmortem2025-01-27T21:13:16.705Z

## Summary On **October 16th, 2024**, our **Prod1** environment experienced a significant increase in service response time and multiple 5xx errors. This led to degraded performance and outages for several services, including the NG-Manager pods, which went into an unhealthy state and restarted multiple times. ## What caused the issue The issue was caused by an overload on one of backend service **database** due to a large number of **background tasks** being re-assigned at once. This surge in tasks was triggered by **delegate disconnections**, which were caused by a spike in CPU usage on the **Ingress pod**. The overload on the database led to: * Increased memory usage * Slow database queries * Service pods restarting due to unhealthy states ## Resolution The following steps were taken to mitigate the issue: 1. Increased the size of the **MongoDB** instance. 2. Stopped **~1200 background tasks** that were running, which helped reduce the load on the database. These actions led to system recovery, and the NG-Manager pods returned to a healthy state. ‌ ## Follow-up Actions To prevent similar issues in the future, we are implementing the following changes: * **Improved Background Task Handling**: Modify task reset jobs to depend on task heartbeat rather than delegate disconnection status. * **MongoDB Autoscaling**: Enable autoscaling for MongoDB to handle CPU and memory spikes. * **Rate-limiting of Instance Sync Requests**: Implement throttling to ensure the database is not overwhelmed during peak activity. * **Enhanced Monitoring and Alerts**: Add alerts for MongoDB resource usage and instance sync updates to catch potential issues earlier.

resolved2024-10-16T18:03:15.869Z

This incident has been resolved. We will provide an RCA after findings are complete.

monitoring2024-10-16T17:49:08.653Z

The issue has been mitigated. We are still monitoring the system to ensure healthy operation of the cluster.

identified2024-10-16T17:37:50.199Z

We have identified the service that is causing the degradation. We have scaled up the DB resource for that service. We are still working to mitigate the issue.

investigating2024-10-16T17:09:53.232Z

We have internally found an issue that is impacting the optimal performance for Prod1 customers. We are actively investigating this.

Jan 13, 2025

Report: "Pipeline executions are briefly not visible on the platform."

Last update 2025-01-13T10:33:25.764Z

postmortem2025-01-13T10:23:49.675Z

# Summary Due to a delayed background index build sync on the analytical node, a data replication lag was introduced. This lag prevented the latest pipeline executions from appearing in the Prod2 environment. # Resolution | **Timeline \(UTC\)** | **Event** | | --- | --- | | Oct 28, 9:09 AM | The customer reported being unable to access the execution. | | Oct 28, 9:11 AM | The team Engaged to troubleshoot the issue. | | Oct 28, 9:43 AM | DBRE identified an issue with replication lag. Paused further index creation. Started bringing up new analytical node. | | Oct 28, 10:31 AM | Index build sync completed and lag issue was resolved. | # RCA On Oct 28, 2024, between 9:09 AM UTC to 10:31 AM UTC, customers faced an issue where their recent pipeline executions were not visible in the prod2 environment. The issue stemmed from an index creation job executed as part of a recent release. This job caused replication lag in one of the read replicas of our MongoDB database, preventing up-to-date data from being available. The index creation job was halted, which restored normal replication and resolved the visibility issue for recent pipeline executions. # Action Items **Implement maxStalenessSeconds**: - Application teams were advised to include the `maxStalenessSeconds` parameter in their connection configuration. This setting ensures read queries are directed to secondary nodes with replication lag below the specified `maxStalenessSeconds` threshold.

resolved2024-10-28T11:30:02.031Z

The replication issue has been resolved. We will share the RCA at the earliest.

monitoring2024-10-28T10:30:56.763Z

The replication is now under control, and the issue should be resolved. We will conduct a root cause analysis (RCA) and share our findings as soon as possible.

identified2024-10-28T09:46:02.792Z

We have identified the issue affecting data replication on our database analytics node.

investigating2024-10-28T09:32:32.431Z

We are currently investigating this issue.

Jan 10, 2025

Report: "Chaos Engineering users in Prod2 cluster may be experiencing issues"

Last update 2025-01-10T04:25:55.308Z

postmortem2025-01-10T04:21:26.148Z

**Incident Summary:** The Chaos dashboard on the Prod2 cluster was inaccessible to users, returning a 404 error code when attempting to access it. **Timeline** | **Time \(UTC\)** | **Event** | | --- | --- | | November 5, 6:58 AM | An alert was triggered. | | November 5, 7:08 AM | We identified that the issue was related to ingress. | | November 5, 7:16 AM | The ingress class issue caused by the helm migration was fixed and started monitoring. | | November 5, 7:19 AM | The incident has been successfully resolved. | **Root Cause Analysis:** During the Helm switchover, all module ingresses were duplicated under a new Ingress Class. Automation was created to ensure that each service's ingress rule was duplicated. However, the Chaos ingress rules followed an older method of implementing the Ingress Class which uses annotations, which was missed by the automation. **Immediate Resolution:** The Chaos ingress rules were recreated with valid Ingress Class. **Action Items:** * Update the Chaos ingress rules to use the new method of associating the Ingress Class using Key Value Pair. * Enhance the post-migration validation framework to include automated checks for data accuracy and integrity.

resolved2024-11-05T07:19:10.622Z

This incident has been resolved. We regret the inconvenience that this might have caused.

monitoring2024-11-05T07:16:35.424Z

The fix has been put in and we are monitoring the system at the moment.

identified2024-11-05T07:08:49.865Z

The issue has been identified and the team is actively working to fix it

investigating2024-11-05T07:06:07.711Z

We are currently investigating an issue where Chaos Engineering customers in Prod2 may be experiencing issues.

Jan 3, 2025

Report: "CI Harness Cloud builds were non-operational for Prod3."

Last update 2025-01-03T19:29:08.762Z

postmortem2024-12-10T20:40:55.950Z

## **Summary:** Customers in Prod3 were unable to run Hosted CI builds. ## **What was the issue:** The delegate lite microservice that is responsible for hosted CI builds got scaled down due to a misconfiguration so the customers were unable to run CI builds. ## **Resolution** | **Time** | **Event** | | --- | --- | | Dec 10 at 5:34 AM UTC | Customer reported an issue with running CI pipeline. | | Dec 10 at 5:52 AM UTC | Redeployment of Delegate lite microservice was done in Prod3 and the issue resolved. | ## **RCA** The delegate lite deployment follows a blue green deployment model. However, due to a misconfiguration, both the old and the new deployment were stopped after the maintenance window, which was intended to shut down only the old deployment. The issue did not get detected internally due to a misconfigured alert. ## **Action Items** * Fix and test the alerts for Hosted CI and related services to ensure they trigger as desired. * Fix the delegate lite deployment to scale down only if the required number of containers are active.

resolved2024-12-10T06:59:09.305Z

This incident is resolved and services are operational.

investigating2024-12-10T06:58:36.057Z

We would like to notify CI Cloud builds for (Mac Cloud Builds & Linux Cloud Builds) was non-operational from 5:46 AM UTC till 6:06 AM UTC. Currently the services are restored and working fine.

Nov 11, 2024

Report: "Harness Platform was briefly Unavailable"

Last update 2024-11-11T05:08:02.352Z

postmortem2024-11-11T05:05:02.374Z

### **Summary** Customers were unable to access [https://app.harness.io/](https://app.harness.io/) for 2 minutes. ### **What was the issue?** A recent deployment for the gateway component in the prod-1 environment had an incorrect configuration that downscaled all the gateway pods. ‌ **Resolution** Configuration was reverted to restore the service availability. | **Time\(UTC\)** | **Event** | | --- | --- | | 5 Nov 12:52:50 PM | Service deployment downscaled the gateway pods. | | 5 Nov 12:54:50 PM | Scaled-up gateway pods. New pods were up and running to serve traffic. | ### RCA On Nov 5, 2024, for 2 minutes, users experienced an HTTP 503 \(service unavailable\) error when attempting to access [https://app.harness.io](https://app.harness.io). This occurred due to the downscaling of the gateway service. The issue originated from a recent deployment that applied an incorrect configuration. The configuration was immediately reverted to restore service availability. ‌ ### Action Items **Improve Pre-Deployment Checks**: Enhance pre-deployment checks to validate critical service configurations, to prevent unintended downscaling.

resolved2024-11-05T14:20:35.143Z

This incident has been resolved.

investigating2024-11-05T14:19:32.862Z

We would like to notify you of a disruption to the Harness Platform that took place at 12:53 PM UTC today. This was a temporary glitch, and the Platform is now operating normally. Further details regarding the precise impact and underlying cause of this disruption will be provided in a postmortem report here. We appreciate your understanding and patience.

Oct 31, 2024

Report: "Pipeline Steps Timing out for a subset of customers in Prod2"

Last update 2024-10-31T04:30:10.720Z

postmortem2024-10-31T04:28:49.736Z

## **Summary:** Pipeline executions were failing with a time-out error on Prod2. This affected ~3% of pipeline executions. ## **What was the issue?** Tasks are execution units that run on a delegate as part of a pipeline execution. As a pipeline runs, its tasks are broadcast to delegates, and one eligible delegate picks up the task for execution. In case any delegate does not acquire the task within the stipulated time, it is rebroadcast. During this incident, rebroadcast functionality was affected, resulting in pipeline executions getting timed out. ## **Resolution:** We rolled back the service to resolve the issue. ## **RCA** An incompatibility change was rolled out in one of our micro-services, causing deserialization failure for a subset of task types. The rebroadcast threads went into an error state due to this deserialization error, resulting in the failure of pipelines that required task rebroadcasts. The system recovered upon the service's rollback. ‌ **Action Item** 1. Added a critical alert for rebroadcast events. 2. Rebroadbast logic is made resilient to task deserialization errors. 3. Unit Test added to catch incompatible contract changes for task data.

resolved2024-10-14T17:25:19.397Z

The incident has been resolved. We will be sharing a RCA with improvements in monitoring and other steps.

monitoring2024-10-14T17:08:46.764Z

The issue has been fixed and we are monitoring the system.

identified2024-10-14T16:01:11.274Z

The issue has been identified and we are still working on a fix.

investigating2024-10-14T15:08:47.104Z

We are currently investigating an issue where the clone codebase step is failing for a subset of customers in Prod2.

Sep 17, 2024

Report: "Harness cloud builds failing at initialise step for MAC users"

Last update 2024-09-17T10:42:07.922Z

postmortem2024-09-17T10:29:04.813Z

### **Summary** CI-hosted MacOS pipelines were failing during the initialisation step, impacting specific customers using our MacOS-hosted service. ### What was the issue? We tightened a firewall rule for our Mac VM registry that was previously too permissive. As a result, another component couldn’t access the registry, leading to pipeline failures. ### **Resolution** | **Time** | **Event** | | --- | --- | | Sept 1st, 17:00 UTC | Restricted the firewall rule. | | Sept 04, 06:03 UTC | Issue reported by the customer. | | Sept 04, 08:39 UTC | We re-created the firewall rule and validated that the issue was fixed. | ### RCA Our MacOS production setup includes several components. When we restricted the permissive firewall rule, the new rule did not account for the NAT IP address of one of these components. After the change, we ran a full sanity pipeline on the Mac machines, which passed successfully. The issue didn’t surface immediately as the affected component maintains a persistent socket connection, unaffected by the firewall until the connection is re-established or restarted. This explains why the failure didn’t occur immediately after we removed the permissive rule on September 1st. We restored the rule, and the issue was resolved. ### Action Items 1. Restrict the firewall rule again, ensuring that necessary NAT IPs are included. 2. Restart all relevant services when applying firewall rule restrictions. 3. Ensure that all connections are properly drained and re-established when the change is implemented.

resolved2024-09-04T06:47:32.356Z

We apologise for the inconvenience caused by this outage. We will make sure to provide the root cause analysis soon.

monitoring2024-09-04T06:39:20.596Z

The issue is resolved now. We will be sharing RCA for the problem as soon as possible.

investigating2024-09-04T06:33:19.838Z

We are currently investigating this issue.

Sep 11, 2024

Report: "Login issues on Prod4"

Last update 2024-09-11T23:10:42.319Z

postmortem2024-09-11T23:00:12.246Z

## **Summary:** Logged in users started getting redirected to the enrollment screen with “Email verified successfully” message and forced users to enter user details again. Pipeline executions and backend tasks were not impacted. Impact was for accounts in Prod 4 cluster. ## **What was the issue?** We released an incompatible version of Nextgen UI service, resulting in unexpected user flow of new sign up for existing users. This was a human error. ## **Timeline:** ## **Resolution:** | **Time** | **Event** | | --- | --- | | September 03 7:45 PM UTC | Customer reported Login redirection to SignUP page | | September 03 8:15 PM UTC | New deployment happened around the same time. Decided to rollback | | September 03 8:20 PM UTC | Started the partial rollback of FF Proxy changes | | September 03 8:30 PM UTC | Partial rollback didn’t fix the issue. Initiated full rollback | | September 03 9:00 PM UTC | Complete rollback completed and issue resolved | Rollback resolved the issue. ## **RCA** There was a human error in picking the version of NextGen UI service. Post deployment sanity did not catch this issue. Rolling back took longer than expected as multiple services got deployed together. **Action Item** 1. Remove manual process to pick the service versions. Automate the promotion process from lower environments. 2. Improve sanity test to catch above UI flow. 3. Make the rollback process atomic based on the previous known good state.

resolved2024-09-03T21:00:56.000Z

We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.

investigating2024-09-03T19:45:01.000Z

Logged in users started getting redirected to the enrollment screen. Currently investigating

Sep 4, 2024

Report: "Pipeline Services are having degraded performance"

Last update 2024-09-04T23:13:47.885Z

postmortem2024-09-04T23:08:47.835Z

## **Summary** After the Redis isolation Maintenance on Prod1, internal monitoring tools showed the pipelines were running slower. ## **What was the issue?** Harness platform uses a set of services including producers and consumers for the redis streams. The order in which these services were brought up caused some of the streams to not be consumed. ## **Timeline** | **Time** | **Event** | | --- | --- | | 9:55AM PT | Noticed intermittent slowness in Pipelines | | 10:00AM PT | Core services were rolled out again | | 10:10AM PT | Pipeline performance improved and services were running well | ## **Resolution** Restarting the services in the correct order made the redis producers/consumers available. The pipeline performance also improved and returned to normal latency.

resolved2024-07-20T18:43:12.526Z

We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.

monitoring2024-07-20T17:10:35.000Z

Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.

identified2024-07-20T16:40:16.000Z

We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.

Report: "Customers unable to access Harness on Prod4 Cluster"

Last update 2024-09-04T17:39:53.587Z

postmortem2024-09-04T17:31:45.987Z

## **Summary:** Customer experienced login failures with 5xx errors on Prod4 cluster. ## **What was the issue?** Harness platform uses managed memStore internally which experienced “Host error”, this triggered master switchover within seconds. Backend microservices which connect to memStore were not able to reconnect quickly. This issue was with JAVA based services but GO services reconnected properly. ## **Timeline:** | **Time** | **Event** | | --- | --- | | 21 August 4:06:41 PM UTC | Primary memStore went down | | 21 August 4:07:00 PM UTC | Secondary memStore promoted to Primary | | 21 August 4:06:41 PM UTC | Harness services experience RedisResponseTimeoutException | | 21 August 4:14:30 PM UTC | Harness services restores connectivity to new Primary | | 21 August 4:14:53 PM UTC | New instance of memstore added and promoted as Secondary | ## **Resolution:** After 8 min services reconnected to the new primary memStore on its own and things recovered. ## **RCA** JAVA services use redisson library to connect to memStore. The established connection pool doesn’t detect the endpoint going away and these connections eventually get timed out. In case of graceful failover this issue doesn’t happen and only in case of catastrophic failure we encounter this issue. **Action Item** * Detect this catastrophic failure and do a quicker reconnect by services

resolved2024-08-21T16:14:14.000Z

We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.

investigating2024-08-21T16:06:31.000Z

We are currently investigating this issue.

Jul 19, 2024

Report: "Users were unable to review details of security issues and STO pipeline steps were delayed"

Last update 2024-07-19T20:53:48.943Z

resolved2024-07-19T16:26:50.000Z

We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.

monitoring2024-07-19T16:20:28.000Z

We are continuing to monitor for any further issues.

monitoring2024-07-19T16:11:32.000Z

Users were unable to review details of security issues and STO pipeline steps were delayed