Historical record of incidents for Harness
Report: "All clusters experiencing feature loss or degradation of functionality due to our sub-provider functionality being degraded"
Last updateThe issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "CI builds for MacOS are experiencing an outage"
Last updateWe are currently investigating this issue.
Report: "CI/STO Stage Failures"
Last updateThe issue has been identified and a fix is being implemented.
We are currently investigating an issue for CI/STO stages getting stuck or aborted.
Report: "Prod1: Unified Dashboards may be experiencing delays"
Last updateSome of our unified dashboards might be experiencing delays
Report: "Customers using Feature Flag module are not able to Login in PROD2"
Last updateNo errors since last ~15 mins. Marking it as resolved.
No errors observed in the last 10 mins and we are monitoring now.
The issue has been identified and we will monitor
We are currently investigating
Report: "Helm deployments failing with older delegates. (< 25.05.858XX)"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
we restored system to previous version
The issue has been identified.
We are currently investigating
Helm deployments failing with older delegates (< 25.05.858XX)
Report: "Helm deployments failing with older delegates. (< 25.05.858XX)"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
we restored system to previous version
The issue has been identified.
We are currently investigating
Helm deployments failing with older delegates (< 25.05.858XX)
Report: "Database Maintenance Notification"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
To enhance Harness service reliability and optimize performance, we will be conducting a database maintenance activity on Friday, May 23rd, from 9:30 p.m. to 10:30 p.m. Pacific Time on the PROD3 environment.The maintenance is planned for the following services in Prod2 environment onlyCD, CI, Pipeline, CCM, SSCA, DBDevops, Chaos, IDP, CVWe do not anticipate any downtime or service disruption during this window.
Report: "Data migration for unified dashboards"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
This maintenance is scheduled for a planned data migration for unified dashboards. During this period, we do not expect any downtime but there may be some data staleness for custom dashboards for the next 36 hours.
Report: "Unified pipeline dashboard is experiencing delay in Prod3"
Last update## **Summary:** On 08 May 2025 at 4:05 AM UTC, a system release introduced a change that led to increased data ingestion activity. As a result, customers experienced latency in the Unified Pipeline dashboard causing temporary data staleness within the Prod3 environment. There was no downtime or data loss, but dashboard visibility was temporarily impacted. The issue has since been mitigated, and preventive measures are being implemented to avoid recurrence. ## **What was the issue?** A system release triggered high-frequency data ingestion, which caused processing delays and led to temporary latency in the Unified Pipeline dashboard within the Prod3 environment. ## **Timeline** | **Timeline** | **Activity** | | --- | --- | | 08 May 2025 at 4:05 AM UTC | Prod3 system release completed | | 08 May 2025 at 12:17 PM UTC | Issue was identified | | 08 May 2025 at 01:48 PM UTC | A fix to increase system resources of our databases was validated and implemented | | 08 May 2025 at 3:20 PM UTC | Data was caught up and issue resolved | ## **Resolution** To mitigate the processing delays, the database resources were scaled up. This scaling operation helped restore normal processing throughput. Once completed, the latency issue in the Unified Pipeline dashboard was resolved, and full visibility was restored in the Prod3 environment. ### **Next Steps** Move major data migrations under a feature flag and schedule them over weekends to ensure better processing performance and minimize impact on live systems.
This incident has been resolved.
We're currently experiencing delays in the Unified Pipeline Dashboard on Prod3. Our team is actively investigating the issue and will share an update shortly
Report: "EU1: Codebase Expression Fails to Resolve in Pipelines Across Multiple Projects (Partial outage)"
Last updateIssue is resolved.
Rolled back the deployment and customer confirms
We are rolling back the deployment and validating
We are currently looking into an issue where codebase Expression is failing to Resolve in Pipelines Across Multiple Projects
Report: "EU1: Codebase Expression Fails to Resolve in Pipelines Across Multiple Projects (Partial outage)"
Last updateIssue is resolved.
Rolled back the deployment and customer confirms
We are rolling back the deployment and validating
We are currently looking into an issue where codebase Expression is failing to Resolve in Pipelines Across Multiple Projects
Report: "CCM - Azure datasync delay"
Last updateWe can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.
Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.
We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.
We are currently experiencing issues with Azure datasync for August 16. We are actively investigating the issue.
Report: "Prod2 - Resource Constraint Issues"
Last update# RCA: Prod2 - Resource Constraint Issues ## **Summary:** Pipeline executions were getting queued for multiple customers with the message "_Current execution is queued as another execution is running with a given resource key_". ## **What was the issue?** Pipelines scheduled for execution were experiencing prolonged queuing delays. In certain cases, pipelines remained in the queued state long enough to eventually expire. This behavior impacted deployment pipelines as well as other pipelines incorporating a queue step, leading to execution delays and timeouts. ## **Resolution:** We found that a large number of resource restraint entries were created during pipeline runs. This buildup caused a backlog, which slowed down new pipeline processing. To mitigate the issue, we manually drained the queue. We also added capacity to help handle the load better and avoid the problem in the future. ## **RCA** Harness pipelines leverage resource restraint instances to control the number of concurrent pipeline executions. During the incident, an unexpected spike in load triggered the creation of significantly more instances than usual. As these are processed in the background at scheduled intervals, the sudden surge led to processing delays, causing pipelines to queue and resulting in slower execution times. **Action Items** 1. Harness is enhancing the internal management of resource locks to better support scaling and improve concurrency handling across pipelines. 2. Monitoring will be strengthened to include alerts for delays in processing resource restraint instances, which would allow a quicker detection and response to similar issues moving forward.
We have successfully resolved the issue.
Pipelines are executing successfully, we are monitoring further.
Mitigation efforts are still ongoing.
Mitigation progress is being made, though efforts are still ongoing at this time.
A ResourceRestraintID lock is being held in a single customer's pipeline, causing other pipelines to be stuck. This issue is currently limited to a small number of customers, and we're working to mitigate it now.
We are currently investigating an issue with resource constraints in our Prod2 environment, which is causing stuck pipelines for some customers.
Report: "Gitops agent using mtls is failing to connect to the gitops service"
Last update## **Summary:** All GitOps agents configured to use mTLS authentication were disconnected. Ticket: [#83615](https://harnesssupport.zendesk.com/agent/tickets/83615) ## **What was the issue?** The disconnection was caused by a misconfiguration in the gateway component, introduced during a recent configuration update. This resulted in traffic being routed to a non-existent endpoint, blocking communication with the GitOps service. The issue was not identified in lower environments because of the absence of automated tests for mTLS-based scenarios. ## **Timeline** | **Time \(UTC\)** | **Event** | | --- | --- | | Wednesday, 29th April, 05:00 PM UTC | Started the incident | | Wednesday, 29th April, 05:30 PM UTC | Issue was identified | | Wednesday, 29th April, 06:30 PM UTC | Fix was validated in QA | | Wednesday, 29th April, 06:45 PM UTC | Fix was released in the Prod environments. | | Wednesday, 29th April, 07:00 PM UTC | The system is operational again, and agents started connecting again | ## **Resolution** Fixed the incorrect config. ### **Next Steps** 1. Expand our release testing to include mTLS-authenticated agents to ensure better coverage and early detection of similar issues. 2. Enhance monitoring and alerting based on agent connectivity patterns, particularly for mTLS-based agents, to enable faster response and resolution.
We have deployed a fix for this issue, and the GitOps service is working correctly again.
We noticed gitops agent using mtls is failing to connect to gitops service in Prod-1 environment. This issue has been identified, we are working on the resolution. Thanks for your patience.
Report: "PROD1: Stale Data Observed for Unified Custom Dashboards"
Last update## **Summary** On April 4, 2025, for 35 minutes, customers in the prod-1 production environment observed that the following custom dashboards were loading stale data: pipeline, stage, and step executions. We discovered that an incorrect version of the ETL process was accidentally deployed, which caused periodic failures in executing it. ## **Resolution** Upgrading the ETL process to a newer version addressed this issue. ## **RCA** Pipeline, stage, and step execution custom dashboards were not loading correctly due to an incorrect upgrade of the ETL process. The upgrade caused periodic execution failures and disrupted the views' data refresh. While no data loss was experienced, dashboards briefly rendered stale data. ## **Action Items** * Implement robust deployment process and runbooks to prevent unintended upgrades.
This incident has been resolved. Thanks for your patience.
We have identified the issue and are working on a fix.
We are currently investigating this issue.
Report: "Custom Dashboards are not loading in Prod2"
Last update## **Summary** On January 30th, TimescaleDB was running low on storage. To mitigate the issue, we began cleaning up unused indexes. However, one of the removed indexes was actively used by a custom dashboard, which subsequently led to degraded dashboard performance. ## **Resolution** In response, we promptly initiated an index rollback to restore dashboard performance and minimize customer impact. ## **RCA** One of the databases supporting custom dashboards was nearing its storage limit. To address this, we began reclaiming space through reindexing and applying data retention policies. During this cleanup process, a specific index - believed to be unused - was dropped to free up space. However, this index was actively used by the custom dashboard, leading to degraded performance. ## **Action Items** * Implement automation for `VACUUM` and `ANALYZE` operations to ensure accurate index usage statistics, and establish a robust review process to validate dependencies before dropping any indexes. * Plan and execute a database migration to a higher storage capacity.
This incident has been resolved.
We are continuing to investigate this issue.
Custom dashboards are failing to load in Prod2. We are currently looking into the issue.
Report: "GCE VM Reboots in us-west1-a Zone"
Last update## **Summary** Google experienced an incident on February 25th with Compute Engine in the us-west1-a zone, where some nodes, specifically E2 and N1 types, would reboot. The reboot caused the ungraceful restart of containers on the affected nodes. ## **Resolution** Our monitoring systems alerted us to the issue. In response, we decided to be proactive and utilize nodeAffinity to remove core service workloads from the us-west1-a zone in the affected environments until Google resolved the issue and to mitigate potential customer impact. ## **RCA** Google has yet to post an RCA for their incident, but a small blurb from the resolved incident page states, “From preliminary analysis, the issue was due to a latent bug that manifested under specific conditions, which resulted in unexpected VM reboots in the us-west1-a zone.” ## **Action Items** There was no known customer impact due to this incident because our workloads are multi-zonal, and our actions were entirely proactive to prevent possible impact.
Google has marked their incident as resolved and stated that VMs utilizing the us-west1-a zone are fully operational again.
Google has marked their incident as resolved and stated that VMs utilizing the us-west1-a zone are fully operational again.
GCP is experiencing an issue with VMs in the us-west1-a zone. At this time, we've migrated our critical workloads out of this zone to negate any customer impact, and we are fully operational. We will continue to monitor the GCP incident in the event the scope changes.
We are experiencing an issue with Google Compute Engine beginning at Monday, 2025-02-25 01:41 UTC. This is causing some services to intermittently restart, resulting in some workloads terminating unexpectedly Our engineering team is working with GCP to investigate the issue, and will post updates as we receive them from Google.
Report: "PROD1: Unified Custom Dashboards are not loading properly"
Last update## **Summary** On April 8, 2025, for 25 minutes, customers in the prod-1 production environment observed that the following custom dashboards were not loading properly: pipeline, stage, and step executions. We discovered that necessary model changes were missed during the version upgrade of our ETL process. ## **Resolution** Upgrading the ETL process to a newer version addressed this issue. ## **RCA** Pipeline, stage, and step execution custom dashboards were not loading correctly due to an incorrect upgrade of the ETL process. The incorrect upgrade resulted in our views not having the necessary data to render the dashboards. While no data loss was experienced, dashboards were not rendering correctly for a brief period. ## **Action Items** * **Improve Pre-Deployment Checks for ETL service upgrade**: Enhance pre-deployment checks to validate critical model updates are part of the upgrade process.
We have resolved the issue. Dashboards are up and running.
We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.
We are currently investigating this issue. This is impacting steps, stages and pipeline execution dashboards.
Report: "Custom Dashboards [Unified View Explores] are experiencing delays in updating in Prod2"
Last update## **Summary** For 26 hours, customers on Prod-2 observed stale data on the following custom dashboards: pipeline executions, stage executions, and step executions. The metadata state tables managing the ETL process were corrupted during a plan application upgrade, requiring a rebuild of the customer-facing data marts for the dashboards. No data was lost during this process. ## **Resolution** The metadata state was reset to trigger data mart updates. ## **RCA** Plan application errors were due to metadata corruption. While no data loss was experienced, data staleness was observed because the data marts were not updated with the latest ETL intervals during the metadata recreation. ## **Action Items** * The ETL framework will be updated more frequently**.** Harness will set a regular cadence for testing new updates and deploying them into production to reduce drift in metadata rollbacks. * Metadata tables will be decoupled from raw data storage to better manage state effects. Decoupling state from raw ingestion will allow faster iteration loops if a database rollback is needed.
Custom Dashboards [Unified View Explores] are now updating normally. The issue is now resolved. We appreciate your patience.
We are still working hard to address the problem and aim to resolve it by 5 PM PST. We understand the inconvenience this may cause and appreciate your patience.
We are still working hard to address the problem and aim to resolve it by 3 PM PST. We understand the inconvenience this may cause and appreciate your patience.
We are still working hard to address the problem and aim to resolve it by 12 PM PST. We understand the inconvenience this may cause and appreciate your patience.
We are still working hard to address the problem and aim to resolve it by 10 AM PST. We understand the inconvenience this may cause and appreciate your patience.
We are experiencing an issue where Custom Dashboards [Unified View Explores] are not updating as expected. We have identified the problem and aim to resolve it by 8 AM PST. We understand the inconvenience this may cause and appreciate your patience.
Report: "Prod-3 was intermittently unavailable"
Last update## **Summary** On 17th April, between 10:42 AM UTC - 11:12 AM UTC, customers experienced intermittent errors when trying to access [app3.harness.io](http://app3.harness.io) on our Prod-3 cluster. The issue was caused by a configuration change on a failover cluster in the backend ingress-controller service setup associated with [app3.harness.io](http://app3.harness.io). ## **Resolution** Our monitoring system alerted us to the issue , we identified and reverted the change to mitigate the issue which restored all the functionality in Prod-3 cluster. ## **RCA** As part of preparation work for a planned Disaster Recovery \(DR\) activity, we introduced a new configuration in the Prod-3 cluster. This change unintentionally made the Prod-3 DR environment eligible to receive live customer traffic. Since this environment was not fully operational some of the requests were returned with 503 Errors. ## **Action Items** * Enhanced monitoring on traffic going to inactive environments. * Additional safeguards in deployment process to avoid unintentional traffic routing changes.
This incident has been resolved.
We noticed intermittent failures in our Prod-3 clusters where app3.harness.io was resulting in 5xx errors. This issue has been identified and is now resolved. Please monitor this incident for postmortem report on this. Thanks for your patience.
Report: "Harness overview dashboard is not loading on Prod-Eu1"
Last update## **Summary** Following the core release on February 5th, the Overview Dashboard in the EU cluster experienced degraded functionality. Investigation revealed that a version mismatch between the newly released core service and the existing dashboard service caused compatibility issues, leading to the degradation. ## **Resolution** We deployed the latest released version of dashboard service to resolve the issue. ## **RCA** The Dashboard service relies on the core services for proper functionality. An updated version of the core services was updated without updating the Dashboard service, resulting in a failure of the Overview dashboard to operate as expected. ## **Action Item** We have implemented a check to ensure dashboard service is also updated when the dependencies are updated.
We have fixed the issue with Overview dashboard/landing dashboard. This incident is resolved now. Please monitor this page for postmortem report.
We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.
We are currently investigating this issue.
Report: "PROD3: UI not loading"
Last update## **Summary** On April 1st 2025, the Prod-3 cluster experienced performance degradation, resulting in access check API calls timing out intermittently. The impact was traced to degraded performance in the underlying MongoDB database, which is critical for access control validation. ## **Resolution** Scaled up the database cluster to address the issue ## **RCA** The issue was caused by a temporary degraded performance in our database, which handles access validation for API calls. A memory optimization activity briefly reduced system capacity, and during this window, traffic increased unexpectedly, leading to a delay in the system scaling back to full performance. As a result, some access check operations experienced timeouts, impacting overall request performance. ## **Action Items** * Utilize database cluster scaleup to address any memory fragmentation issue. * Improve query and index optimization for better database efficiency. * Delete stale data to reduce memory usage. * Optimize retry mechanisms to avoid overwhelming the system during failures
This incident has been resolved.
A fix has been implemented and we are monitoring.
Report: "PROD2: Login is failing"
Last update## **Summary** On April 8th, in preparation for our scheduled deployment, we started an index build. This caused the database to become unresponsive, resulting in login failures for few customers. ## **Resolution** Our monitoring systems alerted us to the issue. In response, we initiated an index rollback to restore database responsiveness and mitigate customer impact. ## **RCA** To support upcoming changes in the new deployment, we followed best practices and suggestions from MongoDB and began index creation ahead of time. However, high I/O activity on the target collection caused both index and data storage to consume significantly more space than anticipated. The increased storage and index size lead to poor performance of the database. This was a result of how our managed MongoDB service provider handles storage management internally. As a result, the db becomes unresponsive leading to login failures. We are currently awaiting a root cause analysis \(RCA\) from our managed MongoDB service provider to understand the underlying cause of the issue from their side. ## **Action Items** * We have disabled index building on the specific db collection in question for short term. * We are actively working with MongoDB support to investigate and identify the root cause of the issue.
The incident has been resolved. A detailed Root Cause Analysis (RCA) will be shared.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "PROD1: Unable to login"
Last update## **Summary** On April 6th, during our scheduled production deployment, multiple customers could not log in because the services failed to start due to issues with the index build. ## **Resolution** Our monitoring system alerted us to the issue. Upon investigation, we identified an unexpected heavy load on the database resulting in service failures . In response, we initiated a system rollback which resolved the issue. ## **RCA** As part of our planned deployment in the production environment \(prod1\), indexes are created during service startup. However, the combination of high I/O activity on a specific collection and concurrent index creation led to resource contention in MongoDB due to locking which remained longer than usual. As a result, few critical services failed to start up causing the login issue. We are currently awaiting a root cause analysis \(RCA\) from our managed MongoDB service provider to understand the underlying cause of the issue from their side. ## **Action Items** * Index creation during service startup has been disabled as part of the deployment process. * We are actively working with MongoDB support to investigate and identify the root cause of the issue.
This incident currently stands resolved. We will publish an RCA.
A fix has been implemented and we are monitoring the results.
The Harness service is currently unavailable. We are currently working to identify the root cause and restore the service as soon as possible.
Report: "PROD2: Delegates got disconnected from Harness"
Last update#### **Summary** A subset of Delegates in prod2 cluster got disconnected, causing pipeline failures for customers. It was due to an increased load on the backed database due to an ad-hoc read query. #### **What was the issue?** Customer delegates were disconnected and pipelines were failing. #### **Resolution** We cancelled the runaway query, and upscaled the database. Overall recovery took ~17 minutes, and the majority of Kubernetes delegates reconnected automatically. A few of the customers had to restart the non-kubernetes delegates. #### **RCA** As part of a regular operational work, we ran a read query in the database which spiked the CPU usage on the database. Unfortunately, this query was run against the primary replica, which increased query latency, resulting in some delegates getting marked disconnected. #### **Action Items** 1. **Enhance access control:** We have a Just-In-Time read access to our database for operational tasks. We are enhancing our system to only provide access to non-primary replicas for such operations. 2. **Enhanced resiliency:** We are planning to run chaos experiments simulating db latency to improve resiliency in our delegate management sub-system against such faults.
All delegate connectivity is resumed. Detailed RCA will follow soon.
We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.
Few delegates got disconnected from Harness
Report: "PROD1: STO pipeline execution failing"
Last update[Postmortem is same as CI/STO Pipeline Execution failing for Customers](https://status.harness.io/incidents/gcmjzvkrrmzy)
This incident has been resolved.
A fix was released to prod 1 and 2, and initial testing shows that the issue has been resolved. Team will continue to monitor
The issue has been identified and a fix is getting implemented.
We are currently investigating an issue in our Prod-1/2 cluster where users are not able to run STO pipelines.
Report: "Login failures in prod4"
Last update## Summary On Monday, March 31, 2025, at 6:53 PM UTC, some customers experienced authentication issues, including getting logged out of Harness. This incident affected users with accounts hosted on our Prod-4 cluster. This issue was resolved by 7:06 PM UTC, resulting in approximately 13 minutes of downtime. ## Impact * Duration: 13 minutes \(6:53 PM - 7:06 PM UTC\) * Affected Users: Customers with accounts hosted on the Prod-4 cluster * Symptoms: Authentication failures, unexpected logouts, and traffic drop ## Resolution Our engineering team identified the issue and took immediate action: 1. Reverted the configuration change at 7:06 PM UTC 2. Rolled back the deployment to the previous stable version \(1.16.0\) at 7:08 PM UTC 3. Verified service restoration across all affected systems ## RCA The incident was caused by a routing configuration error in our Global Gateway service. During a planned deployment, a change to our routing logic inadvertently prevented requests from being correctly directed to the Prod-4 cluster. As a result, authentication sessions for affected customers could not be appropriately maintained. ## Action Items To prevent similar incidents in the future, we are implementing the following improvements: 1. Improved validation of routing configuration changes 2. Additional monitoring to detect routing anomalies earlier
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are currently investigating login failures in prod4.
Report: "Stale Data Observed for Custom Dashboards in Prod1"
Last update### **Summary** On March 25, 2025, for 2 hours and 22 minutes, customers in the prod-1 production environment observed stale data on the following custom dashboards: pipeline executions, stage executions, and step executions. ### **What was the issue?** The metadata state tables managing the ETL process were corrupted during a version upgrade, requiring fixes to this table. No data was lost during this process. ### **Resolution** The metadata state was reset to trigger data mart updates. | **Time\(UTC\)** | **Event** | | --- | --- | | 26 Mar 2:04 AM | We identified the ETL process that timed out after the upgrade. | | 26 Mar 3:18 PM | Redeployed the ETL process, applied the plan, and recreated the views. | | 26 Mar 4:22 AM | The metadata schema was rebuilt, and all data quality checks were confirmed to be passing. | | 26 Mar 4:25 AM | The incident was resolved. | ### ### **RCA** Plan application errors were due to an upgrade of ETL process timing out after running for two hours. This resulted in metadata corruption, requiring data fixing. While no data loss was experienced, data staleness was observed because the data marts were not updated with the latest ETL intervals during the metadata recreation. ### **Action Item**s * Update the ETL framework frequently to avoid significant version number jumps\*\*.\*\* * Set up a regular cadence for testing new updates and deploying them into production.
This incident has been resolved. Thanks for your patience.
We are working towards testing a fix in our dev environment.
We are continuing to work on a fix for the issue.
We are continuing to work on a fix for the issue.
We are working on a fix. We have identified that only Unified Dashboards for pipeline, stage, and steps are currently impacted.
We are currently investigating this issue.
Report: "Custom dashboards are not loading - Prod1,2,3,4 and Eu1 due to Looker managed service outage"
Last update### **Summary** Customers in all the production environments observed that custom dashboards were not loading correctly. ### **What was the issue?** Harness custom dashboards rely on Looker Studio, a managed service from Google. During that period, Looker experienced an outage, which directly affected custom dashboards. **Resolution** Once Google Looker was back to a stable state, custom dashboards started working correctly. | **Time \(UTC\)** | **Event** | | --- | --- | | 26 Mar 12:17 AM | Google Looker started experiencing an outage on login and dashboard functionality, impacting Harness custom dashboards. | | 26 Mar 1:29 AM | Google Looker managed service returned to a steady state. | ### RCA Custom dashboards lost availability due to an outage with Looker, a managed service from Google. No data loss was experienced. ### Action Items * We are awaiting a follow-up RCA from the Google Looker team
The dashboards are now rendering correctly with the recovery of Looker Service.
The dashboards are now rendering correctly with the recovery of Looker Service.We will continue to monitor the situation.
We are observing a gradual recovery of the Looker service. Some dashboards are now rendering correctly. However, a partial outage remains in effect. We will continue to monitor the situation and provide updates as they become available.
Custom dashboards are not loading in Prod1, Prod2 and Prod3 because of our Managed service - Looker facing an outage currently. We will monitor this outage and update once we have a status update from Looker.
Report: "Pipeline failures due to secret decryption in Prod2"
Last update#### Summary: Pipelines experienced failure in resolving secrets in cases where more than one secret were used in custom secret manager. This issue was isolated to secrets associated with custom secret managers. #### Root Cause Analysis: The pipeline failure happened because the system failed to resolve secrets correctly. A code change to improve performance of the secret decryptions was deployed which resulted in failures for secrets stored in custom secret manager. The code change was behind a feature flag. The feature flag was disabled which restored normal pipeline operations. #### Action Items: 1. **Add New Test Cases:** Add new test cases to the automation suite to cover different configuration combinations for custom secret managers. 2. **Add Metrics and Alerts:** Implement appropriate metrics and alerts to detect secret/expression resolution failures proactively and mitigate them.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "CIE Pipeline Execution Failures in Prod1"
Last update## **Summary:** Pipelines had failures due to delegates and build pods got disconnected. This also impacted our hosted CI operations and all hosted build pipelines failed. ## Root Cause Analysis: For performance improvement and agility of our development our engineering team had been making changes to get the legacy delegates which are no longer being used by customers but still running disconnected from the platform. This change was behind a feature flag which was enabled and resulted in an incident. The code change had an unintended affect of not accepting connection request from build containers. The feature flag was disabled which restored normal pipeline operations. ## Action Items: 1. **Improve feature flag operations:** Our engineering team operated in silo while enabling this feature which resulted in wrong implementation and ineffective rollout of this functionality. We are improving out process internally to templatize and manage feature flag rollout by external operations team. 2. **Improve Automation:** Adding steps in our QA process to catch the dependency of delegates and build pods connection requests so any change in this area is validated internally.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Prod3: Unable to Create Environments,Service Overrides and Infrastructure"
Last update#### **Summary** Users in the Prod3 cluster were unable to create entities such as Environments,Service Overrides and Infrastructure. #### Resolution A fix was implemented to address the system issue. As a result, the system stabilized, enabling users to successfully create Environments and other affected entities. #### Timeline | **Time \(UTC\)** | **Event** | | --- | --- | | March 4, 2025, 8:35 AM UTC | Issue reported for unable to create Environments,Service Overrides and Infrastructure | | March 4, 2025, 11:34 AM UTC | A fix has been implemented and we are monitoring the results. | | March 4, 2025, 11:59 AM UTC | Incident resolved | #### **The Root Cause Analysis \(RCA\)** Harness was previously having an issue with login in the Prod3 cluster. To address that, Harness had reverted the system release to the previous version. This rollback led to data inconsistencies, resulting in failures for certain entities, such as Environment and Service overrides. The inconsistencies were later resolved in a subsequent update, after which functionality was confirmed to be working as expected.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Login issues on our Prod3 environment"
Last update#### **Summary** The Prod3 cluster experienced downtime, preventing users from accessing the Harness UI. Only access to Prod3 was affected but the pipeline executions were not impacted. #### Resolution To mitigate the issue, Harness services were auto-scaled. Additionally, rate limiting and timeouts were implemented for specific API endpoints to regulate the load. These measures effectively reduced system strain, allowing the platform to recover and resume normal operations. #### Timeline | **Time \(UTC\)** | **Event** | | --- | --- | | March 4, 2025, 7:25 AM UTC | Investigating login issue in prod3 environment. Prod3 cluster was under pressure and rejecting requests | | March 4, 2025, 7:30 AM UTC | Reverted system release | | March 4, 2025, 7:38 AM UTC | Changed status to monitoring. System is operating normally | #### **The Root Cause Analysis \(RCA\)** One of the core micro-services in the Harness platform was receiving a high volume of external traffic. The API endpoint under load was executing a long-running analytical query, which became slow during this period. This slowdown triggered a cascading effect across the infrastructure, leading to the unavailability of underlying services. As the load increased, new requests began to fail. Since the Harness UI depends on responses from backend APIs, the pages failed to load. #### **Action Items** 1. Move analytical services to a separate end point to prevent such issues impacting critical workflow
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating logging issue in prod3 environment
Report: "Prod2: CI/STO Pipeline Execution failing for Customers"
Last update# **Summary:** Customer encountered an issue with pipeline execution. The executions failed with an exception “Error Creating Plan: Could not create plan for node“. This impacted the CI and STO stages execution. # **Timeline:** | **Time \(IST\)** | **Event** | | --- | --- | | March 6, 2025, 4:02 AM UTC | Team reviewed the series of events for a previous Incident and since the load on pipeline runs were lower we decided to rollback the release to 1.66.1. | | March 6, 2025, 9:03 AM UTC | Customers reported that they are intermittently unable to run CI pipelines with plan creation error | | March 6, 2025, 9:17 AM UTC | Status page was updated | | March 6, 2025, 9:25 AM UTC | Identified the gap in licensing API that led to cache corruption. | | March 6, 2025, 10:00 AM UTC | CI manager deployment to version 1.67.3 \( prod 2\) was done and the errors stopped. | | March 6, 2025, 10:31 AM UTC | Got customer confirmation that CI is now operational. | | March 6, 2025, 11:45 AM UTC | STO errors were still occurring due to rollback | | March 6, 2025, 2:14 PM UTC | STO was rolled forward to 1.54 version | # Resolution: STO service was rolled forward to 1.54 version to resolve the issue. # RCA: When a pipeline execution is triggered we check the License details for the module and verify a valid license exists. As part of this check we ran into an issue for unknown license type which triggered an exception causing the pipeline execution failure. The license details API had a gap in the license details fetch call, which when encountered corrupted the cache for the consecutive executions with non onboarded license types. # Action Items: * Improvement in alerting for plan creation errors * Improve automation tests to cover advanced filtering scenarios for the licensing API
This incident has been resolved.
We have applied the fix (internal test passed), services are restored.
We are working on the fix.
We are also seeing intermittent new pipeline creation failure. We are currently investigating.
Pipeline Execution failing for Customers in Prod2.
Report: "CI stages are getting queued in Prod2"
Last update## **Summary:** Customers have reported experiencing longer queue times for their Continuous Integration \(CI\) stages when using Harness Cloud infrastructure. Although the queue limits were not reached, builds remained queued, leading to extended waiting periods as they awaited progression. ## **Timeline:** | **Time \(UTC\)** | **Event** | | --- | --- | | March 4, 2025, 10:16 PM | Customer reported Queued builds | | March 4, 2025, 11:06 PM | Increased the limits for customers to unblock | | March 5, 2025, 2:00 AM | Reverted application which was suspected to have caused the issue | | March 5, 2025, 7:30 AM | We continued to investigate the issue as we were still seeing some missed cleanups and we also performed cleanup of stale metadata captured to prevent from further queuing | | March 5, 2025, 4:50 PM | We saw a spike in resources consumed by our apps as the peak load approached which was mitigated by increasing the resources and stabilized the app. | | March 5, 2025, 7:40 PM | Issue was narrowed down to the Jackson library upgrade and we started the rollback test on lower environment. | | March 5, 2025, 9:54 PM | We now rolled back to the previous version of the application and continued to monitor. During this time we noticed increased resource consumption on our Mongo instance, which further caused the stability issue and stuck ci stages. | | March 5, 2025, 11:18 PM | We decided to roll forward the release and undo the revert. Post which the system stabilized. | | March 6, 2025, 3:26 PM | We worked on the forward fix post the stabilization and released it to production. | # Resolution: We immediately increased the queue sizes for impacted customers to enable their build stages to progress. Subsequently we fixed the library issue and rolled out newer release. We are improving our alerting and automation to proactively determine any potential issue with resource cleanup at scale. ## **RCA:** A recent **Jackson library** upgrade slowed down the CI manager's cleanup thread, causing back pressure on the system during peak periods. With the Jackson library upgrade from 2.15.2 to 2.17.2, the `ObjectMapper` implementation changed to use a `ReentrantLock` object. During persistence, Spring recursively reads instance objects and serializes them via reflection. However, Java restricts access to `ReentrantLock` fields via reflection, causing serialization exceptions. As a side effect of Jackson library upgrade, the load on one of our services increased significantly causing restart of the pods which lead to stuck executions of few CI stages. The above led to pipelines entering a queued state and, due to resource constraints, some pipelines failing to execute. ## **Actions Items:** * Improve the monitoring and alerting for resource cleanup * Implement a cross-team process for validating the library upgrades
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Some CI pipelines are experiencing stage aborts in Prod2"
Last update[Same Postmortem as CI stages are getting queued in Prod2](https://status.harness.io/incidents/wh0xbx7h2x6l)
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Intermittent UI Failures on Prod4"
Last updateThis incident is a recurrence of a previously reported issue on March 2nd. The root cause, contributing factors, and corrective actions have already been documented in the earlier RCA. For detailed analysis and remediation steps, please refer to the RCA here: [https://status.harness.io/incidents/w7m7fgcmdhh0](https://status.harness.io/incidents/w7m7fgcmdhh0)
This incident has been resolved.
We terminated the degraded pod and are actively monitoring the situation to ensure stability.
The issue has been identified and a fix is being implemented.
Report: "Intermittent connections errors on Prod4"
Last update### **Summary** On Friday 7 Mar 2025, the Prod4 cluster experienced a disruption when the Global Gateway service stopped serving incoming requests. The incident was caused by a configuration mismatch during a planned version upgrade. The system was fully recovered after approximately 12 minutes of downtime, out of which 7 minutes were full downtime and 5 minutes were partial service disruption. ### Resolution The team quickly identified the configuration mismatch and reverted to the previous configuration settings. After bouncing the Global Gateway pods, the system recovered, and normal service was restored. ### RCA During a planned upgrade from version 1.16.0 to version 1.17.2 of the Global Gateway service, a procedural error caused the new configuration intended for version 1.17.2 to be deployed while the older version 1.16.0 was still running in production. The older version was incompatible with the new configuration parameters, causing the service to stop responding to requests. ### Action Items 1. **Enhanced Deployment Oversight and controls**: Implement additional validation checks in the deployment pipeline to verify version compatibility with configuration changes. 2. **Improved Architecture Resilience**: Accelerate our planned architecture improvements to make the system more resilient to configuration changes and prevent similar failures in the future. Our team is committed to implementing these improvements to prevent similar incidents in the future.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
Global Gateway intermittently encountering connection errors in Prod 4
Global Gateway intermittently encountering connection errors in Prod 4
Report: "CI/CD Pipeline failure on Prod4"
Last update## **Summary:** A Deployment pipeline execution on Prod4 resulted in removal of few workload identities from DR cluster which are shared across Primary and DR clusters, this caused the pods in both the primary and DR clusters that depend on workload identity to fail, affecting service availability. ## **What was the issue?** Customers faced issue with their CI/CD Pipeline where the pipeline started failing on Prod4environment ## **Timeline:** | **Time \(UTC\)** | **Event** | | --- | --- | | January 24 12:40 AM UTC | During the Prod4 deployment, some services failed to come up healthy, and a FireHydrant incident was triggered. | | January 24 12:57 AM UTC | The issue was identified as missing workload identity bindings in the DR cluster. The team decided to redeploy to restore the configuration. | | January 24 1:30 AM UTC | The redeployment fixed the issue by syncing the Terraform state, which resolved the mismatch in the DR cluster configuration | ## **Resolution:** Re-deployment resolved the issue by ensuring that the Terraform state was aligned with the intended configuration, which restored the missing workload identity bindings in the DR cluster. ## **RCA** During a recent Disaster Recovery \(DR\) exercise, manual changes were made to the DR cluster but were not properly captured in the Terraform configuration. When the deployment pipeline executed, Terraform applied its last known state, inadvertently removing the workload identity bindings in the DR cluster. This led to pod failures in both the Primary and DR clusters, causing the CI/CD pipeline to fail in Prod4. **Action Item** 1. **Ensure no manual changes** are made to the system. In case of unforeseen manual changes, document them and incorporate them into Terraform. 2. **Automate Drift Detection:** Implement automated drift detection to identify discrepancies between the live infrastructure and Terraform state. 3. **Pre-Deployment Validations:** Introduce additional pre-deployment checks to verify workload identity bindings before applying changes.
This incident has been resolved.
The Harness service is experiencing performance issues. We are working to identify the cause and restore normal operations as soon as possible.
Report: "Some Feature Flag customers are experiencing intermittent issues with evaluating target groups on Prod2"
Last update## What was the issue? After a recent change to the Feature Flag authentication gateway, some evaluations failed for TargetGroups with rules that use a custom attribute. Once identified, the team reverted the configuration change, and evaluations returned to normal | **Time \(UTC\)** | **Event** | | --- | --- | | 09:02 | Feature Flag authentication gateway configuration change applied | | 16:49 | First report of issues relating to evaluations of custom rules on TargetGroups | | 18:34 | Feature Flag authentication gateway configuration reverted back | | 18:39 | Evaluations of targetGroups returns to normal | # RCA As part of improvements to our disaster recovery strategy, a change was made to make the Feature Flag authentication gateway more robust. Initial testing preformed failed to account for the scenario of client-side SDKs, with target groups using rules that use a custom attribute \(rather than a core attribute like identifier\). Client SDKs generally make two types of request. 1. An auth request 2. An evaluation request to get flag values. During the auth flow, the provided target and its attributes is stored in a DB and Redis e.g. ``` { . identifer : "123-456-789", . name : "bob", . custom_attribute_1 : "value1" } ``` During the evaluation flow, the target is retrieved from the cache if still present, and if not, it will be retrieved from the DB, and stored in the cache. After the Feature Flag authentication gateway change, targets were being written to a different Redis. They would still be persisted to the DB, but if the Redis instance used during evaluations contained an older version of the target that did not have the attributes, then the code would never go to the DB i.e. the Redis may contain ``` { . identifer : "123-456-789", . name : "bob" } ``` In this case, if the custom rule used identifier or name, it would work as expected, but if it is using the custom attribute, then that would be missing during the evaluation. ## Actions Items 1. Update our test suite, to include additional user authentication flows, to account for the impacted use case 2. Update the Feature Flag authentication gateway to use distinct Redis instances for each environment 3. Review functionality, that supports additional reading of target attributes, to provide a failsafe to ensure the correct evaluation is returned
Issue has now been resolved, and will share the RCA shortly
We are continuing to monitor for any further issues.
Issue has been identified, and a fix has been put in place. Team are continuing to monitor the issue
We are currently getting reports of some customers experiencing intermittent issues iwht Feature Flags when evaluating target groups in the prod2 environment. Team are actively diagnosing the issue, and will keep you updated
Report: "Seeing intermittent login issues on our Prod4 environment"
Last update## **Summary:** Users experienced login failures on the Prod4 cluster due to backend connection limits being exceeded. The issue was triggered by a surge of WebSocket connections following a customer account migration, which triggered the circuit breaker limit on Harness Global Gateway for the Prod4 cluster. ## **Timeline:** | **Time \(UTC\)** | **Event** | | --- | --- | | March 2nd 2:14 PM | We received an alert for Prod4 login failures | | March 2nd 3:00 PM | Scaled Global Gateway pods from 2 to 4 and functionality restored. | | March 2nd 3:06 PM | Increased RPS for Prod4 ILB | | March 2nd 3:10 PM | Confirmed that login is restored | ## **Resolution:** * Increased backend capacity by scaling the Global Gateway service to distribute the load more effectively. * Set up necessary alerts to monitor system stability to confirm cluster connectivity. ## **RCA:** Following a customer account migration, there was a significant increase in WebSocket connections from delegate agents, exceeding the connection limits set for backend hosts. The backend system reached its maximum capacity, preventing new connections from being established. Additionally, one of the backend pods restarted unexpectedly, leaving only a single pod to handle all incoming traffic. This led to the circuit breaker being activated, causing login failures. ## **Action Item:** * Implement a dedicated traffic splitting configuration to handle WebSocket connections separately from other API requests to prevent similar incidents in the future. * Improve monitoring and alerting to detect when connection limits are approaching critical thresholds. * Conduct scalability testing to ensure the system can handle large numbers of WebSocket connections without reaching critical limits.
After continued monitoring and further investigation issue has been considered resolved
A fix has been implemented and we are monitoring the results.
We have identified the issue, and a migration has been applied. The team are continuing to investigate the source of the issue
Issue has been identified with our global gateway, affecting routing to our Prod4 environment. Team is continuing to investigate the issue
Seeing intermittent login issues on our Prod4 environment
Report: "Users in Prod-2 cluster facing unexpected pipeline failures"
Last update#### What was the issue? Customers experienced pipeline failures due to intermittent errors when submitting delegate tasks. The issue was identified by the error message: > UNAVAILABLE: Connection closed after GOAWAY. HTTP/2 error code: NO\_ERROR, debug data: max\_age **Timeline** | **Time \(UTC\)** | **Event** | | --- | --- | | 06:00 AM | First occurrence of the issue. | | 06:06 AM | Alert from our monitoring system received; team started investigating. | | 06:42 AM | Service instances scaled up to restore service. | | 07:00 AM | Functionality enabled \(which was already being rolled out behind a Feature Flag\) to prevent the reoccurrence of the issue. | **RCA** Pipeline execution functionality was degraded due to exhaustion of thread pool resources \(responsible for secret resolution from custom secret manager\). Trigger was pipeline run with a large number of secrets, which overwhelmed the thread pool responsible for resolving secrets. This reduced the capacity of the system, resulting in a build-up of delegate tasks awaiting submission. Eventually, those requests timed out, leading to pipeline failures. Once issue got identified, we immediately scaled up our service infrastructure to handle the increased load. Subsequently, a feature flag to optimize **secrets resolution flow** was enabled. \(This feature flag was in process to be enabled across all Harness environments in the next few days\). **Actions Items** 1. Roll out the feature in all environments. \(done\) 2. Enforce limit on number of simultaneous secret resolutions in a pipeline execution.
This incident has been resolved. Please monitor this incident for Postmortem report. Thanks for your patience.
The issues is mitigated now , We are actively monitoring it.
We are currently investigating as issue in our Prod-2 cluster with CD Pipelines.
Report: "Harness performance degraded in Prod3."
Last update#### **Summary** Prod3 environment observed slowness and degraded performance across the board and saw intermittent MongoDB errors while accessing the application #### **What was the issue?** On **Tuesday, February 4th at 9:30 AM UTC**, we observed a sudden spike in MongoDB utilization, reaching **90%** CPU usage, which resulted in **degraded cluster performance**. This surge led to blocked DB connections due to the load, causing multiple queries to starve for connections and impacting user experience across the platform. **Resolution** * The team investigated the issue and identified that degradation on MongoDB as the root cause. * Increased the resources for the Mongo DB which in turn helped in processing the load which helped the system to get back to its regular operation #### **Timeline** | **Time \(UTC\)** | **Event** | | --- | --- | | 4th Feb 9:30 AM | CPU Usage on MongoDB went up and persisted | | 4th Feb 10:30 AM | Mongo connections spiked, application retries increased | | 4th Feb 10:30 AM | Cache dirty fill ratio on Mongo became and persisted at 20%. Writes to MongoDB started to fail | | 4th Feb 11:40 AM | Upgraded MongoDB instance tier to M40 to handle the load | | 4th Feb 12:00 PM | System stabilized | #### **** #### **RCA** We observed that MongoDB CPU utilization spiked. We also observed that MongoDB memory utilization exceeded 90%, leading to an increase in system error rates. During this period, the cache dirty fill ratio started rising and surpassed 20%, remaining elevated. At this point, Mongo application threads became involved in eviction processes instead of executing usual database operations such as CRUD actions, replication, and other core functions. This shift in thread activity caused operations to stall, leading to excessive memory consumption across the nodes. As system memory utilization increased, overall database performance degraded, further compounding the issue. To mitigate the impact, we performed a cluster tier upscale on 02/04/25 at 03:40 AM PST, which successfully alleviated the memory pressure on the affected nodes. Following the upgrade, we observed that system performance returned to acceptable levels. #### **Action Items** Since the incident, the team has been actively working on several action items to prevent similar occurrences, including: * **Query Optimization:** Identifying and optimizing slow-running queries to reduce load on the database. * **Scaling Strategy:** Evaluating a proactive cluster tier auto scaling approach to handle traffic spikes efficiently. * **Monitoring & Alerts:** Enhancing monitoring to detect query bottlenecks earlier and prevent performance degradation.
This incident is resolved now. Please monitor this page for postmortem report.
The issue has been mitigated and we are currently monitoring the system.
We are actively investigating the service degradation issue in the prod3 environment.
Report: "CI Plugin Image Retrieval Failure from the Docker Hub"
Last update#### **Summary** Certain CI pipelines utilizing Harness CI steps, such as PluginStep and `Setup Build Intelligence`, encountered the error `failed to get image entrypoint`. CI Build Intelligence is enabled by default to enhance build caching. This improvement introduces a background step in each CI stage that operates a cache proxy server, which fetches images from Docker Hub. **What was the issue?** A recent outage at Docker Hub, as reported on the Docker Systems [Status Page](https://www.dockerstatus.com/pages/incident/533c6539221ae15e3f000031/67a479b283fb1305d10af103), caused the "Setup Build Intelligence" and CI PluginStep in the CI stage to fail. This issue was due to an inability to retrieve the image entrypoint, as indicated. According to Docker Hub, the outage was limited to unauthenticated clients \(Anonymous\). #### **Timeline** | **Timestamp** | **Event** | **Action** | | --- | --- | --- | | 6th Feb 8:48 AM UTC | A customer reported that the buildIntelligence step is failing | Initiated a SWAT call to address the issue. | | 6th Feb 9:00 AM UTC | Issue identified as stemming from Docker Hub downtime | Docker Systems Status Page | | 6th Feb 9:20 AM UTC | Docker Hub issue resolved and pipeline failures stopped | | #### **Action Items** 1. To mitigate such issues, Harness recommends that customers configure the built-in Harness Image Docker connector to use credentials instead of anonymous access and to pull images from GCR or ECR rather than Docker Hub. For detailed instructions, please refer to [Configure Harness to always use credentials to pull Harness images](https://developer.harness.io/docs/platform/connectors/artifact-repositories/connect-to-harness-container-image-registry-using-docker-connector/#configure-harness-to-always-use-credentials-to-pull-harness-images). 2. We are actively working to eliminate dependencies on external systems to enhance our reliability even further.
This incident has been resolved.
Dockerhub incident has been resolved, and we are continuing to monitor on our side
We are seeing an increase in failed pipelines that use the Build Intelligence, as it pulls from dockerhub.
DockerHub is facing an incident and in degraded performance https://www.dockerstatus.com/
Report: "Feature Flags metrics service down on Prod1"
Last update#### **Summary** Requests to [https://events.ff.harness.io](https://events.ff.harness.io) where failing for all customers in Prod1 accounts, preventing metrics data from being updated. #### **What was the issue?** Testing being preformed on the cluster resulted in unexpected behaviour on one of the load balancing backend instances that is used to route traffic for the Feature Flag metrics service. #### **Resolution** Load balancer backend was re-synced with the running applications, and the traffic resumed. #### **Timeline** | Time\(UTC\) | Event | | --- | --- | | 13 Nov 14:55 | Testing service was brought down, causing a cascading tear down of the backend instance | | 13 Nov 14:56 | On call engineer alerted to the traffic errors on Prod1 | | 13 Nov 15:02 | Issue identified, and team started to determine the cause | | 13 Nov 15:10 | Issue was fixed, and system sync started | | 13 Nov 15:14 | Traffic resumed | #### **RCA** On Nov 13, 2024, between 14:55 and 15:14, traffic going into the Feature Flag metrics service on Prod1 received a 500 error. #### **Action Items** * Review policy checks on services, to ensure no load balancer backends have more than one label associated
Between 14:55 UTC and 15:14 UTC the Feature Flag service on Prod1 experienced an outage on the metrics service. Customers sending metrics data during this window will have received an 500 error We have identified the issue and a fix has been applied. An RCA will follow shortly
Report: "Queue-Service is impacted for Prod3 customers"
Last update#### Summary We encountered issues with the Queue Service, where bidirectional webhooks were marked as queued, and Git changes were not reflected on Harness. #### Timeline | **TIMELINE \(UTC\)** | **Event** | | --- | --- | | Nov 18, 2024 - 12:48 PM | The customer reported an issue with Bidirectional GitX webhooks in the queued status. | | Nov 18, 2024 - 12:49 PM | The team analysed the monitoring and service logs and observed issues with Redis connectivity after the deployment. | | Nov 18, 2024 - 01:05 PM | DBRE was involved and credentials were rotated. | #### Immediate Resolution We updated the production Redis configuration and performed a configuration deployment. #### RCA The issue with queued webhooks occurred due to Redis errors affecting the Queue Service, which caused bidirectional GitX webhooks to be queued and not processed. The error during redeployment occurred when an incorrect configuration was pushed during a Redis credential rotation, temporarily disrupting the Queue Service. Connectivity remained intact until the alert was received, which prompted an update to the credentials. #### Action Items We have implemented monitoring for the customer webhook queued status to prevent future issues.
The issue has been resolved. We apologize for the inconvenience and will share the root cause analysis (RCA) shortly.
Git back entities will have stale data. This will impact pipeline executions. As a work around, we would recommend disabling the webhook until further notice.
Report: "Pipelines custom webhook executions are delayed"
Last update#### **Summary** Custom webhook triggers observed delayed execution due to a surge in incoming trigger executions that created a backlog for processing these types of trigger executions. This only impacted delayed executions via custom webhook triggers. The executions via api and UI were not impacted. #### **What was the issue?** Harness received a surge of custom webhook events for processing triggers. These triggers were executing git backed pipelines that were taking longer than usual to resolve which caused the back pressure on the trigger processing leading to delays for pipeline executions. This happened since we have a limited number of resources available for processing custom webhook types of triggers. #### **Resolution** We increased the resources on our systems to manage the surge which helped bring the system back to normal. #### **Timeline** | **Time \(UTC\)** | **Event** | | --- | --- | | Dec 20th 05:15pm | Identified the system is observing some delays in processing triggers. | | Dec 20th 05:50pm | Identified the issue causing the delays. | | Dec 20th 06:10pm | Increased the available resources for processing triggers. | | Dec 20th 07:25pm | Incident was identified as resolved. | #### **RCA** The currently allocated resources were unable to process the large number of custom webhooks leading to delays in processing them and thereby causing delayed pipeline executions. As a result, we had to allocate additional resources. #### **Action Items** 1. We have increased the number of threads that are assigned to process the custom webhooks. 2. We will be working on enhancing the business logic to de-couple the pipeline resolution from custom webhook trigger processing flow.
This incident has been resolved. We regret the inconvenience and will be providing an RCA for review.
We have mitigated the issue at this time and are continuing to monitor the iterator queue. The iterator queue will gradually clear off and the webhook queue will clear off.
We are continuing to make progress and have partially mitigated the issue.
We are continuing to work on a fix for this issue. Thank you for your patience!
We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.
We are actively investigating the issue. Thank you for your patience!
We are currently investigating the issue
Report: "Some customers on Prod1 may be experiencing degraded performance"
Last update## Summary On **October 16th, 2024**, our **Prod1** environment experienced a significant increase in service response time and multiple 5xx errors. This led to degraded performance and outages for several services, including the NG-Manager pods, which went into an unhealthy state and restarted multiple times. ## What caused the issue The issue was caused by an overload on one of backend service **database** due to a large number of **background tasks** being re-assigned at once. This surge in tasks was triggered by **delegate disconnections**, which were caused by a spike in CPU usage on the **Ingress pod**. The overload on the database led to: * Increased memory usage * Slow database queries * Service pods restarting due to unhealthy states ## Resolution The following steps were taken to mitigate the issue: 1. Increased the size of the **MongoDB** instance. 2. Stopped **~1200 background tasks** that were running, which helped reduce the load on the database. These actions led to system recovery, and the NG-Manager pods returned to a healthy state. ## Follow-up Actions To prevent similar issues in the future, we are implementing the following changes: * **Improved Background Task Handling**: Modify task reset jobs to depend on task heartbeat rather than delegate disconnection status. * **MongoDB Autoscaling**: Enable autoscaling for MongoDB to handle CPU and memory spikes. * **Rate-limiting of Instance Sync Requests**: Implement throttling to ensure the database is not overwhelmed during peak activity. * **Enhanced Monitoring and Alerts**: Add alerts for MongoDB resource usage and instance sync updates to catch potential issues earlier.
This incident has been resolved. We will provide an RCA after findings are complete.
The issue has been mitigated. We are still monitoring the system to ensure healthy operation of the cluster.
We have identified the service that is causing the degradation. We have scaled up the DB resource for that service. We are still working to mitigate the issue.
We have internally found an issue that is impacting the optimal performance for Prod1 customers. We are actively investigating this.
Report: "Pipeline executions are briefly not visible on the platform."
Last update# Summary Due to a delayed background index build sync on the analytical node, a data replication lag was introduced. This lag prevented the latest pipeline executions from appearing in the Prod2 environment. # Resolution | **Timeline \(UTC\)** | **Event** | | --- | --- | | Oct 28, 9:09 AM | The customer reported being unable to access the execution. | | Oct 28, 9:11 AM | The team Engaged to troubleshoot the issue. | | Oct 28, 9:43 AM | DBRE identified an issue with replication lag. Paused further index creation. Started bringing up new analytical node. | | Oct 28, 10:31 AM | Index build sync completed and lag issue was resolved. | # RCA On Oct 28, 2024, between 9:09 AM UTC to 10:31 AM UTC, customers faced an issue where their recent pipeline executions were not visible in the prod2 environment. The issue stemmed from an index creation job executed as part of a recent release. This job caused replication lag in one of the read replicas of our MongoDB database, preventing up-to-date data from being available. The index creation job was halted, which restored normal replication and resolved the visibility issue for recent pipeline executions. # Action Items **Implement maxStalenessSeconds**: - Application teams were advised to include the `maxStalenessSeconds` parameter in their connection configuration. This setting ensures read queries are directed to secondary nodes with replication lag below the specified `maxStalenessSeconds` threshold.
The replication issue has been resolved. We will share the RCA at the earliest.
The replication is now under control, and the issue should be resolved. We will conduct a root cause analysis (RCA) and share our findings as soon as possible.
We have identified the issue affecting data replication on our database analytics node.
We are currently investigating this issue.
Report: "Chaos Engineering users in Prod2 cluster may be experiencing issues"
Last update**Incident Summary:** The Chaos dashboard on the Prod2 cluster was inaccessible to users, returning a 404 error code when attempting to access it. **Timeline** | **Time \(UTC\)** | **Event** | | --- | --- | | November 5, 6:58 AM | An alert was triggered. | | November 5, 7:08 AM | We identified that the issue was related to ingress. | | November 5, 7:16 AM | The ingress class issue caused by the helm migration was fixed and started monitoring. | | November 5, 7:19 AM | The incident has been successfully resolved. | **Root Cause Analysis:** During the Helm switchover, all module ingresses were duplicated under a new Ingress Class. Automation was created to ensure that each service's ingress rule was duplicated. However, the Chaos ingress rules followed an older method of implementing the Ingress Class which uses annotations, which was missed by the automation. **Immediate Resolution:** The Chaos ingress rules were recreated with valid Ingress Class. **Action Items:** * Update the Chaos ingress rules to use the new method of associating the Ingress Class using Key Value Pair. * Enhance the post-migration validation framework to include automated checks for data accuracy and integrity.
This incident has been resolved. We regret the inconvenience that this might have caused.
The fix has been put in and we are monitoring the system at the moment.
The issue has been identified and the team is actively working to fix it
We are currently investigating an issue where Chaos Engineering customers in Prod2 may be experiencing issues.
Report: "CI Harness Cloud builds were non-operational for Prod3."
Last update## **Summary:** Customers in Prod3 were unable to run Hosted CI builds. ## **What was the issue:** The delegate lite microservice that is responsible for hosted CI builds got scaled down due to a misconfiguration so the customers were unable to run CI builds. ## **Resolution** | **Time** | **Event** | | --- | --- | | Dec 10 at 5:34 AM UTC | Customer reported an issue with running CI pipeline. | | Dec 10 at 5:52 AM UTC | Redeployment of Delegate lite microservice was done in Prod3 and the issue resolved. | ## **RCA** The delegate lite deployment follows a blue green deployment model. However, due to a misconfiguration, both the old and the new deployment were stopped after the maintenance window, which was intended to shut down only the old deployment. The issue did not get detected internally due to a misconfigured alert. ## **Action Items** * Fix and test the alerts for Hosted CI and related services to ensure they trigger as desired. * Fix the delegate lite deployment to scale down only if the required number of containers are active.
This incident is resolved and services are operational.
We would like to notify CI Cloud builds for (Mac Cloud Builds & Linux Cloud Builds) was non-operational from 5:46 AM UTC till 6:06 AM UTC. Currently the services are restored and working fine.
Report: "Harness Platform was briefly Unavailable"
Last update### **Summary** Customers were unable to access [https://app.harness.io/](https://app.harness.io/) for 2 minutes. ### **What was the issue?** A recent deployment for the gateway component in the prod-1 environment had an incorrect configuration that downscaled all the gateway pods. **Resolution** Configuration was reverted to restore the service availability. | **Time\(UTC\)** | **Event** | | --- | --- | | 5 Nov 12:52:50 PM | Service deployment downscaled the gateway pods. | | 5 Nov 12:54:50 PM | Scaled-up gateway pods. New pods were up and running to serve traffic. | ### RCA On Nov 5, 2024, for 2 minutes, users experienced an HTTP 503 \(service unavailable\) error when attempting to access [https://app.harness.io](https://app.harness.io). This occurred due to the downscaling of the gateway service. The issue originated from a recent deployment that applied an incorrect configuration. The configuration was immediately reverted to restore service availability. ### Action Items **Improve Pre-Deployment Checks**: Enhance pre-deployment checks to validate critical service configurations, to prevent unintended downscaling.
This incident has been resolved.
We would like to notify you of a disruption to the Harness Platform that took place at 12:53 PM UTC today. This was a temporary glitch, and the Platform is now operating normally. Further details regarding the precise impact and underlying cause of this disruption will be provided in a postmortem report here. We appreciate your understanding and patience.
Report: "Pipeline Steps Timing out for a subset of customers in Prod2"
Last update## **Summary:** Pipeline executions were failing with a time-out error on Prod2. This affected ~3% of pipeline executions. ## **What was the issue?** Tasks are execution units that run on a delegate as part of a pipeline execution. As a pipeline runs, its tasks are broadcast to delegates, and one eligible delegate picks up the task for execution. In case any delegate does not acquire the task within the stipulated time, it is rebroadcast. During this incident, rebroadcast functionality was affected, resulting in pipeline executions getting timed out. ## **Resolution:** We rolled back the service to resolve the issue. ## **RCA** An incompatibility change was rolled out in one of our micro-services, causing deserialization failure for a subset of task types. The rebroadcast threads went into an error state due to this deserialization error, resulting in the failure of pipelines that required task rebroadcasts. The system recovered upon the service's rollback. **Action Item** 1. Added a critical alert for rebroadcast events. 2. Rebroadbast logic is made resilient to task deserialization errors. 3. Unit Test added to catch incompatible contract changes for task data.
The incident has been resolved. We will be sharing a RCA with improvements in monitoring and other steps.
The issue has been fixed and we are monitoring the system.
The issue has been identified and we are still working on a fix.
We are currently investigating an issue where the clone codebase step is failing for a subset of customers in Prod2.
Report: "Harness cloud builds failing at initialise step for MAC users"
Last update### **Summary** CI-hosted MacOS pipelines were failing during the initialisation step, impacting specific customers using our MacOS-hosted service. ### What was the issue? We tightened a firewall rule for our Mac VM registry that was previously too permissive. As a result, another component couldn’t access the registry, leading to pipeline failures. ### **Resolution** | **Time** | **Event** | | --- | --- | | Sept 1st, 17:00 UTC | Restricted the firewall rule. | | Sept 04, 06:03 UTC | Issue reported by the customer. | | Sept 04, 08:39 UTC | We re-created the firewall rule and validated that the issue was fixed. | ### RCA Our MacOS production setup includes several components. When we restricted the permissive firewall rule, the new rule did not account for the NAT IP address of one of these components. After the change, we ran a full sanity pipeline on the Mac machines, which passed successfully. The issue didn’t surface immediately as the affected component maintains a persistent socket connection, unaffected by the firewall until the connection is re-established or restarted. This explains why the failure didn’t occur immediately after we removed the permissive rule on September 1st. We restored the rule, and the issue was resolved. ### Action Items 1. Restrict the firewall rule again, ensuring that necessary NAT IPs are included. 2. Restart all relevant services when applying firewall rule restrictions. 3. Ensure that all connections are properly drained and re-established when the change is implemented.
We apologise for the inconvenience caused by this outage. We will make sure to provide the root cause analysis soon.
The issue is resolved now. We will be sharing RCA for the problem as soon as possible.
We are currently investigating this issue.
Report: "Login issues on Prod4"
Last update## **Summary:** Logged in users started getting redirected to the enrollment screen with “Email verified successfully” message and forced users to enter user details again. Pipeline executions and backend tasks were not impacted. Impact was for accounts in Prod 4 cluster. ## **What was the issue?** We released an incompatible version of Nextgen UI service, resulting in unexpected user flow of new sign up for existing users. This was a human error. ## **Timeline:** ## **Resolution:** | **Time** | **Event** | | --- | --- | | September 03 7:45 PM UTC | Customer reported Login redirection to SignUP page | | September 03 8:15 PM UTC | New deployment happened around the same time. Decided to rollback | | September 03 8:20 PM UTC | Started the partial rollback of FF Proxy changes | | September 03 8:30 PM UTC | Partial rollback didn’t fix the issue. Initiated full rollback | | September 03 9:00 PM UTC | Complete rollback completed and issue resolved | Rollback resolved the issue. ## **RCA** There was a human error in picking the version of NextGen UI service. Post deployment sanity did not catch this issue. Rolling back took longer than expected as multiple services got deployed together. **Action Item** 1. Remove manual process to pick the service versions. Automate the promotion process from lower environments. 2. Improve sanity test to catch above UI flow. 3. Make the rollback process atomic based on the previous known good state.
We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.
Logged in users started getting redirected to the enrollment screen. Currently investigating
Report: "Pipeline Services are having degraded performance"
Last update## **Summary** After the Redis isolation Maintenance on Prod1, internal monitoring tools showed the pipelines were running slower. ## **What was the issue?** Harness platform uses a set of services including producers and consumers for the redis streams. The order in which these services were brought up caused some of the streams to not be consumed. ## **Timeline** | **Time** | **Event** | | --- | --- | | 9:55AM PT | Noticed intermittent slowness in Pipelines | | 10:00AM PT | Core services were rolled out again | | 10:10AM PT | Pipeline performance improved and services were running well | ## **Resolution** Restarting the services in the correct order made the redis producers/consumers available. The pipeline performance also improved and returned to normal latency.
We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.
Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.
We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.
Report: "Customers unable to access Harness on Prod4 Cluster"
Last update## **Summary:** Customer experienced login failures with 5xx errors on Prod4 cluster. ## **What was the issue?** Harness platform uses managed memStore internally which experienced “Host error”, this triggered master switchover within seconds. Backend microservices which connect to memStore were not able to reconnect quickly. This issue was with JAVA based services but GO services reconnected properly. ## **Timeline:** | **Time** | **Event** | | --- | --- | | 21 August 4:06:41 PM UTC | Primary memStore went down | | 21 August 4:07:00 PM UTC | Secondary memStore promoted to Primary | | 21 August 4:06:41 PM UTC | Harness services experience RedisResponseTimeoutException | | 21 August 4:14:30 PM UTC | Harness services restores connectivity to new Primary | | 21 August 4:14:53 PM UTC | New instance of memstore added and promoted as Secondary | ## **Resolution:** After 8 min services reconnected to the new primary memStore on its own and things recovered. ## **RCA** JAVA services use redisson library to connect to memStore. The established connection pool doesn’t detect the endpoint going away and these connections eventually get timed out. In case of graceful failover this issue doesn’t happen and only in case of catastrophic failure we encounter this issue. **Action Item** * Detect this catastrophic failure and do a quicker reconnect by services
We can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.
We are currently investigating this issue.
Report: "Users were unable to review details of security issues and STO pipeline steps were delayed"
Last updateWe can confirm normal operation. Get Ship Done! We will continue to monitor and ensure stability.
We are continuing to monitor for any further issues.
Users were unable to review details of security issues and STO pipeline steps were delayed