Historical record of incidents for Twingate
Report: "Admin and Reports Issues"
Last updateOur cloud provider is having issues impacting certain parts of Twingate service, specifically Admin, Reports Downloads and Network Dashboards. Everything else works as expected. We are looking into it.
Report: "Twingate Service Down"
Last updateThis incident is resolved now. We'll provide a post-mortem as soon as we have details.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Twingate Service Down"
Last updateThis incident is resolved now. We'll provide a post-mortem as soon as we have details.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Twingate Service Issues"
Last update**Twingate Public RCA: March 21, 2025 Authentication/Authorization Incident** **Summary** On March 21, 2025, between 21:10 and 21:41 UTC, a subset of authentication and authorization requests to Twingate services experienced elevated error rates. The issue was mitigated by 21:41 UTC and services returned to normal operation. **What Happened** Twingate runs an active-active architecture across multiple Google Kubernetes Engine \(GKE\) clusters, balanced by a Global Load Balancer \(GLB\). At the onset of the incident, we observed anomalies in one of our clusters, and to protect overall system health, we proactively scaled down deployments in that cluster. This action shifted traffic to other clusters in the topology, primarily one that typically handles a lighter load and had been scaled accordingly. The target cluster began autoscaling as expected, but the increased traffic caused elevated error rates, which triggered our retry mechanisms across services. While these retries helped many requests succeed, they also increased the overall system load. Simultaneously, the cluster underwent a cloud provider-initiated update operation that caused pod restarts and reduced capacity. To stabilize the system, we reintroduced capacity in the previously affected cluster, rebalancing the traffic across regions. Once this occurred, error rates subsided and retries diminished. **Root Cause** Anomalous behavior from our cloud provider, including unexpected request timeouts at the load balancer level and instability during a cluster update, led to a cascade of retry traffic that temporarily overwhelmed parts of the system. We are actively investigating both the unexpected timeout configuration and the behavior of the cluster during the update with our cloud provider. **Corrective actions** Short-Term: * Continue collaboration with our cloud provider to understand the root cause of the unexpected timeouts and cluster update impact. * COMPLETED: Keep deployments on all regions with same number of replicas in HPA configuration * COMPLETED: Increase max node counts on cluster node auto-scalers to give us more room to scale up. * COMPLETED: Configure HPA to scale up faster
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Issues with Dashboards and Events Processing"
Last updateAll of internal alerts are cleared. You should not see delays/errors with dashboards and reports any longer.
The Big Query service latencies have dropped and we see improvements with our internal dashboards.
We are seeing higher latency than usual with our cloud provider's APIs to process data used for our dashboards and reports. They confirmed the issue with their service. We hope this should be resolved soon.
Report: "Temporary Disruption in DNS Activity Reports"
Last updateThis incident has been resolved.
Recent DNS Activity is now restored and we are monitoring the system.
We're currently experiencing a temporary disruption with the DNS Activity Reports. The team is working on restoring access, and we'll have them back up shortly.
Report: "Jamf Device Integration Issue"
Last updateIssue with syncing trusted devices from Jamf has been fixed.
This incident is still being investigated. It started after a recent upgrade performed by Jamf. We have reached out to Jamf on the issue and working with them on resolution.
We're seeing increased errors when syncing trusted devices from Jamf. We're currently investigating.
Report: "DNS Events Ingestion Issue"
Last updateThis incident has been resolved.
DNS events are being ingested again and we're monitoring the system.
We've identified the issue and are working on a fix.
We've detected an issue with DNS events ingestion affecting some of our customers who are not able to see DNS events when exporting to S3. We're currently looking for root cause.
Report: "GitHub Social Login is Down"
Last updateThis incident has been resolved.
GitHub is currently experiencing an incident affecting its authentication - https://www.githubstatus.com/incidents/kz4khcgdsfdv Social login via Github is therefore temporarily not working until GitHub resolves their issues.
Report: "Secure DNS Dashboards Issue"
Last updateDashboard are back to normal.
Dashboards are gradually coming back online.
Issue has been identified and we're working on a fix.
We are continuing to investigate this issue.
We are seeing issues with Secure DNS (Internet Security) Dashboards loading. We are investigating.
Report: "Issue with S3 Sync of Internet Security Events"
Last updateThis incident has been resolved.
There was an ingestion delay that's fixed . We see logs flowing again.
We are having an issue with S3 Sync of Internet Security events. Team is engaged and issue is being investigated.
Report: "Issue with S3 Sync of Internet Security Events"
Last updateThe fix has been pushed to production and we confirmed Internet Security Events are now syncing to S3 buckets again.
We have identified the issue and working on a fix. Will update once fix is deployed.
We are continuing to investigate this issue.
We are seeing issues with AWS S3 Sync of Internet Security events. We are currently investigating.
Report: "MFA Incident"
Last updateDuring the incident , some of our customers MFA tokens became invalid. Less than 15 of our customers were impacted by this incident. We identified the issue and will make this key rotation safer going forward so it doesn't impact any of our customers. Impact Duration: 12:47 - 14:40 UTC
Report: "Database Connection Issues"
Last update**Components impacted** Management: Public API Management: Admin Console **Summary** On June 26, 2024 between 20:16 and 20:24 UTC, Twingate’s SQL proxies restarted, causing a brief failure to a small percentage of calls made to our Public API \(Terraform, Pulumi, k8s Operator, etc.\) and to the Admin console. There was no impact to Clients or Connectors. A change to our SQL proxy deployments that was targeting staging and development environments was pushed to production due to a misconfiguration, causing our SQL proxy instances to restart. **Root cause** Due to a misconfiguration, a change to our SQL proxy deployments intended for staging and development environments was pushed to production, causing them to restart. **Corrective actions** Short Term: * Ensure that SQL proxy deployments are only pushed in a controlled manner by resuming the GitOps workflow manually. * Fix the misconfiguration in our GitOps deployment mechanism for our SQL proxy deployments and set the Helm chart version to a static value so that all upgrades are done in a controlled manner. * Enhance our SQL proxy Helm chart to reduce the impact to services during updates and upgrades.
On June 26, 2024, between 8:16 pm UTC and 8:28 pm UTC, Twingate experienced several database connectivity alerts due to a failed rollout of one of its components. The rollout was promptly reversed, and our existing reliability measures prevented any major disruption to customer traffic.
Report: "Recent DNS Activity Unavailable"
Last updateThis incident has been resolved.
Recent DNS Activity screen on the Admin console is back up.
Recent DNS Activity screen on the Admin console isn't available.
Report: "Logins with Github Not Working"
Last update**Components impacted** Control Plane: Authentication **Summary** On April 30, 2024, between 15:44 and 18:38 UTC, users were unable to login to Twingate through GitHub. The Twingate Engineering team investigated the issue upon receiving support communications and found that logins with other identity providers were functioning normally, but GitHub logins were not working as expected. The problem was traced back to a software rollout that had inadvertently impacted GitHub logins. Engineering was able to create a fix and roll it out at 18:38 UTC, which restored the GitHub logins to its normal functionality. The team is investigating better ways to identify and prevent these types of issues from reaching production and address them as quickly as possible if they ever arise again. **Root cause** A recent package upgrade introduced a bug that impacted Twingate logins via GitHub. **Corrective actions** Short-term: * Increase our testing coverage to allow early detection of login issues for all supported identity providers in all environments. This will aid in early detection when software is deployed to lower, non-production environments. * Improve alerting for login issues.
We have successfully rollout the HotFix for the issue with Github logins.
A software rollout has broken the logins with Github. We are working on rolling out a hotfix.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Recent DNS Activity Unavailable"
Last update**Components impacted** Management: Admin Console **Summary** On April 20, 2024, between 5:32 GMT and 6:57 GMT, Recent DNS Activity on the Admin Console became unavailable. Shortly after the incident began, the Twingate on-call team received alerts regarding abnormal database activity. Workers on the clusters that manages DNS filtering logs starting seeing errors from the logs API, leading to excessive retries and database writes. To mitigate the issue, the DNS Log Streaming workers were temporarily disabled. The root cause was identified as a malfunction in the DNS Filtering Log API caused by a problematic dependency upgrade. Consequently, viewing DNS filtering logs and analytics in the Admin Console was temporarily unavailable. A rollback of the update was issued, and normal operations were restored at 6:57 GMT after which DNS filtering logs and analytics were available in the Admin Console. **Root cause** The DNS Filtering Log API went down due to a bad dependency upgrade. **Corrective actions** Already completed: * Rectified Admin Console's infinite retry logic by enhancing the retrieval of DNS activity logs during error states. * Optimized DNS Log Streaming retry and database write procedures to reduce unnecessary operations when no events are returned from the DNS Filtering API Short-term: * Improve the dependency upgrade process for the DNS FIltering API
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
Recent DNS Activity on admin is unavailable. We have identified the issue and working on a fix.
Report: "Empty DNS filtering logs"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We’re currently investigating an issue where DNS filtering logs are sometimes empty.
Report: "Database Connection Issues"
Last update**Components impacted** Control Plane: Authn, Authz, Connector Heartbeat Management: Admin Console, Public API, Identity Providers Sync **Summary** On Dec 11, 2023 between the hours of 5:02 pm UTC and 6:24 pm UTC, Twingate received network connectivity alerts at 3 distinct times, each for a few minutes. After investigation, it was identified that a change in our application had caused our connections to exceed the allowed maximum. We reverted the changes at 6:24 pm, after which the increased connections receded and our connection counts returned to healthy counts. **Root cause** The Twingate database connections exceeded the maximum allowed. **Corrective actions** Already Completed: * Adjusted capacity to keep the number of connections at a healthy number Short Term: * Increase the maximum number of connections allowed * Add alerting for connection utilization
Summary On Dec 11, 2023 between the hours of 5:02 pm UTC and 6:24 pm UTC, Twingate received network connectivity alerts at 3 distinct times, each for a few minutes. After investigation, it was identified that a change in our application had caused our connections to exceed the allowed maximum. We reverted the changes at 6:24 pm, after which the increased connections receded and our connection counts returned to healthy counts. (See Postmortem for more details)
Report: "Twingate Service Incident - Aug 19, 2023"
Last update**Summary** On August 19 at 7:51 AM UTC, Twingate received alerts of issues with the login services. Within a few minutes, the Twingate engineering team began investigating. The team quickly identified that our backend was seeing excessive timeouts from a 3rd-party API, preventing it from being able to process other requests such as authentication. After some initial fixes were unsuccessful, Twingate contacted the 3rd party and also disabled support for real-time updates that make use of these specific 3rd-party API calls. As a result, the issues started resolving at 8:10 AM UTC. Most of the services recovered quickly and full resolution occurred at 8:15 AM UTC. The vendor later confirmed and fixed the issue, and Twingate re-enabled the real-time update feature shortly after on the same day, August 19. **Root cause** The Twingate backend was exhausted due to timeouts from a 3rd-party API. **Post-incident Analysis** Twingate had already separated out most services to their own deployments, allowing those services to function throughout the incident. Therefore, only some users that needed to authenticate or re-authenticate were affected; any user that had authenticated prior to the incident was not impacted. Analysis of logs post-incident showed that the incident started at 7:49 AM UTC and fully recovered at 8:15 AM UTC. **Corrective actions** Short Term: * Separate Authentication and real-time services to their own deployments - COMPLETED Medium / Long Term: * Reevaluate and optimize timeout values for various backend and 3rd party services * Simplify the internal Twingate process for enabling and disabling features
This incident has been resolved. We'll publish RCA as soon as we can.
We are continuing to investigate this issue.
We are seeing issues with Twingate service and investigating.
Report: "Admin Console Authentication Issues for tenants with JumpCloud IDP integration"
Last updateAdmin Console Authentication is broken for our tenants that utilize JumpCloud IDP integration. This was due to a bug in authentication flows with our latest software that was deployed at 15:24 UTC on June 21st. Resolution was accomplished by reverting to the previous software at 12:42 pm UTC on June 22nd. We will add more tests and monitoring/alerting for the JumpCloud/SAML integration to avoid this in the future and for faster detection.
Report: "admin console and authentication incident"
Last update**Summary** On June 7th 17:08 UTC, a new version of our Controller software was rolled out. Shortly after the rollout completed, our on-call team received automated exception alerts and began investigating. The new version had inadvertently included changes to clean up our database that were out of sync with the deployed code, and the team decided to rollback the new software deployment. The rollback proceeded smoothly, however the previous code version was missing the database fields that had been cleaned up, and the incident started at 17:24 UTC. A fix was prepared and rolled out starting at 17:29 UTC. Deployment completed on the first cluster at 17:38 UTC and proceeded to the remaining clusters once we verified that the error state had been resolved. Deployment was completed on all clusters at 17:48 UTC. **Post-incident Analysis** We initially thought the incident had only impacted Admin Console users, however the following systems were impacted: * Admin Console sign in. * Client initial authentication and re-authentication requests. * Linux and container-based Connectors. This incident exposed an issue that resulted in Connectors incorrectly shutting down on transient Controller unavailability. **Root Cause** An error in our deployment process logic led to a mismatch in deployed code and database schema. **Corrective Actions** 1. Improve our processes for merging software changes that are linked to database schema changes. 2. We have fixed and will be testing the bug in our Connector uptime / retry logic. 3. Improve overall build and rollout performance to be able to push fixes more promptly.
This incident has been resolved. It's due to an issue with a software update. We are working on a plan to avoid this in the future. This impacted authentication flow too. Already authenticated flows continued to function.
We are seeing issues with our admin consoles not loading properly. It's been investigated.
Report: "Issue in us-east4 region"
Last updateOur cloud provider has announced that the issue with us-east4 region has been resolved. We will be monitoring a little bit more before re-enabling traffic on that region.
Our cloud provider is having a networking issue in us-east4 region that caused 500 errors for our customers hitting that region. Due to retry logic in our service, this shouldn't have caused issues for our customers. We diverted traffic to other regions and no traffic is handled in that problematic region any more. We are working with our cloud provider to make sure the problem is completely resolved before we re-enable traffic for that region.
Report: "Relay2 (us-east4) and Relay4 (eu-west6) Issue"
Last update**Components impacted** Relay clusters in us-east4 \(Ashburn, Virginia\) and europe-west6 \(Zurich, Switzerland\) **Summary** We’ve recently been working on adding spot instance scaling to our Relay cluster infrastructure. On Feb 14th at 01:00 UTC, we initiated this upgrade process via a Terraform change to all of our Relay clusters globally, which took approximately 45 minutes to complete. At 01:40 UTC, we noticed a decrease in the number of connections in two clusters \(us-east4 and europe-west6\) and started an investigation. We also engaged our cloud infrastructure provider proactively to rule out a regional cloud provider issue. At 02:04 UTC, we disabled pods in the two affected clusters, which caused connected Clients and Connectors to re-connect to the next closest Relay cluster. Overall connection metrics were seen to be normal across the redistributed connections. On further investigation, we determined that an error in our Terraform configuration affecting network firewall rules had caused the issue. At 03:30 UTC we corrected the error and redeployed the affected clusters, which resolved the issue. **Root cause** An error in the deployed Terraform configuration removed a critical network tag, which was required to set the correct network firewall rules within our Relay cluster. The result was that Relay clusters were discoverable, but not reachable, leading to the deadlock state experienced by the Clients and Connectors attempting to attach to the affected clusters. This issue only affected two of our Relay clusters because of a configuration difference that was dependent on the overall sizing of these clusters. This ultimately hid the issue during testing in our staging environment because this cluster sizing difference, which in turn leads to different configuration outcomes, was not accurately reflected. **Corrective actions** We are taking the following short term actions, some of which are already completed, to avoid this problem in the future: * Accurately reflect Relay cluster sizing and configuration differences in our staging environment. * Auto-create plans for all environments, including all configuration variations in product, with feature development branches. * Enhance Relay health checks to ensure clusters are non-discoverable to Connectors and Clients if the necessary network firewall tag is not in place. * Research and implement staggered rollouts with Terraform for our Relay cluster infrastructure. We also have medium term plans to add multi-cluster connectivity to our Connectors to handle regional Relay cluster problems automatically.
We have identified the issue and a fix is implemented. Both relays clusters are healthy and processing requests.
We found issues with 2 of our relay clusters, one in US (region: Virgina and one in Europe (region: Zurich). While we are working on bringing them up, the connectors and clients should have automatically reconnected to the other relay clusters, which may cause some slowness for some customers closer to those clusters since now they need to connect to other clusters.
We see issues with one of our relay clusters (US relay cluster 2) and investigating.
Report: "Twingate Service Incident"
Last update**Summary** On January 24 at 19:58 UTC, our on-call team started to receive automated alerts regarding system performance degradation. The team began an investigation and, by 20:02 UTC, the degradation had escalated to a point where some Twingate Clients began to experience request timeouts. The Client behavior on request timeout is to initiate request retries, which triggered additional requests to our infrastructure. Due to the overall performance degradation, the increase in inbound requests overloaded the system to a point where internal health check requests also began to fail. This resulted in system components being marked as offline, further reducing the available capacity to respond to requests. Autoscaling of serving infrastructure did occur, but the increase in capacity was insufficient to remedy the system’s overall decrease in performance, on top of the additional request workload. Between 20:05 UTC and 20:45 UTC, we identified that the performance degradation was exclusively affecting our authorization engine, independent of other system capabilities. At 20:47 UTC, we promoted our physically separate standby cluster to share load with the existing cluster in an active-active mode. Both clusters began serving traffic at 20:48 UTC, and some improvement to authorization engine throughput was seen, but individual requests were still taking much longer than normal. Noticing that the authorization engine was experiencing a higher load from certain tenants, the team next separated these tenants’ traffic to an isolated replica cluster in order to provide a surplus of processing bandwidth. System load returned to normal on the main cluster, and the traffic was gradually recombined between 21:13 UTC and 21:48 UTC. The system fully recovered at this point. **Post-incident Analysis** In our analysis across all tenant traffic during the incident, we determined that for tenants with the latest Connector and most up-to-date Client applications, less than 10% of users experienced any downtime related to Resource access. Many users were unaware of this incident as their connections remained active due to changes we implemented last year and were introduced in Client and Connector updates. The experienced severity of this incident was hence highly correlated with whether Clients and Connectors were up to date for a given tenant. However, this version disparity also affected the severity of the incident as a whole, and we discuss this in both the root cause and corrective actions, below. **Root Cause** This incident occurred because of two independent events that occurred simultaneously that in turn were made worse due to deployed Connectors and Clients with out of date behaviors. Specifically: 1. A temporary anomaly in our infrastructure provider’s load balancer caused a short term, but very significant \(greater than 10 _seconds_\), increase in request latency. This in turn triggered Client request retry behavior, increasing the overall load on the system in a short time span. 2. Independent to the above event, a large number of computationally-costly changes were triggered in our authorization engine through non-anomalous tenant activity. This increased the processing time for authorization requests. 3. Sufficient Connectors and Clients are deployed in our tenant base that do not have the most up to date logic in place for handling connection degradation. Clients and Connectors released _before_ approximately May 2022 do _not_ back off their retry requests, leading to an overwhelmingly large volume of requests to our system from a relatively small number of deployed Clients and Connectors. This exacerbated both \(1\) and \(2\). We are confident that if any of the above three conditions were not true, this incident would not have occurred. **Corrective Actions** Our corrective actions focus on addressing the three contributing factors above. In short, we will be: making upgrades and configuration changes to our infrastructure provider’s load balancers; improving authorization engine performance; and forcing upgrades of out of date deployed components. Many of these tasks were already underway before the incident, and some related tasks’ completion will be accelerated. A detailed breakdown is provided below. Immediate We have already taken the following immediate corrective actions: 1. Increased authorization engine capacity and distributed the load between multiple clusters located in different geographic regions 2. Isolated authorization requests to a dedicated deployment 3. Increased backend and health check timeouts to more appropriately match the potential for authorization request latency increases 4. Upgraded our infrastructure provider’s load balancer to improve container-awareness Short Term 1. Complete a significant upgrade of our authorization engine. This includes removing a subsystem that was identified as the bottleneck during this incident and previous incidents. This project began in 2022 Q4 and we expecting this replacement upgrade to complete by early 2023 Q2. 2. Introduce additional deployment isolation for different request types so that a failure in one part of the system doesn't affect other subsystems. This proved to work very well during the incident, and we will be further standardizing this in product. 3. Introduce additional logging to help accelerate future troubleshooting. Medium & Long Term 1. Gradually move more parts of our application servers from synchronous request processing to asynchronous processing. 2. Consider the use of a sidecar proxy in front of our application servers. 3. Consider the use of an improved load based auto-scaling mechanism for the authorization engine.
We are marking the issue as resolved. The system works as expected with healthy metrics.
Public-API (Admin-API) has been brought up too. While all the metrics for the service looks healthy, we will continue to monitor them.
We have identified the issue and Twingate system looks healthy since 1:48 pm PST. We are still monitoring the issue. Public-API is still being kept down for the time being.
We are continuing to investigate this issue.
Twingate engineering is still working on identifying the root cause of the issue. We'll continue to provide updates as we find out more. - Public-api is disabled. - Admins and Logins should work. - Still seeing issue with Authorization.
Twingate Engineering is fully engaged and we are still investigating the issue. We'll provide further updates as soon as we can.
We are currently investigating this issue.
Report: "Twingate Admin Impacted"
Last updateTwingate admin is fully functional again.
Currently Twingate admin is broken. We have identified the issue and rolling out the fix.
Report: "Twingate controller service impacted"
Last update**Summary** On September 21 at 4:14am UTC, Twingate released a new controller version as part of improvements to the authorization engine. This new release contained both code changes and involved data migration. The change caused an unexpectedly significant increase in load on the system, which only fully manifested itself several hours later once cached data started expiring at approximately 7:54am UTC. At this time, Twingate customers started to see issues initiating access to resources and login failures. Based on initial evidence around intermittent responses and increased network latency, our initial suspicion was that the failures were related to infrastructure problems. Increasing backend application capacity and other efforts to mitigate these issues were not successful. These efforts, combined with information we received from our cloud vendor support team, led us to shift our focus away from infrastructure issues at this point towards investigating the application layer. Our next step at approximately 10:00am UTC was to roll back the recent software changes to the controller, including the associated data migration that was performed as part of this update. This software rollback task, which also incorporated supervised data migration rollback, was completed at approximately 10:30am UTC. After rolling back the software changes and data migration, we observed improvements in the behavior of the system. Both network latency and cache hit ratios were much improved, but not back to normal operational levels. Continuing to investigate the issue, with software and data migration rolled back to a known state, we initiated the process to fall over to our standby cluster to fully rule out any infrastructure issues. At 11:24am UTC we initiated the failover process to our standby cluster, which completed at 11:27am UTC. At this point the system fully recovered with normal operational metrics. **Root cause** After detailed investigation, we identified two separate issues in the application layer that interacted with each other. First, a code bug caused the system to re-evaluate the permissions of all our users at the same time, causing a significant load that saturated the system. Second, the data migration process failed to replace existing cached values, which led to failed requests at the application layer. This second factor only became apparent as existing cached data expired. Failing over to our standby cluster was only effective after the data migration and software changes were reversed. This is because the cache was empty at the time that the standby cluster was brought online. Due to the nature of the software bug, data migration, and caching interactions, performing cluster failover earlier in the incident would have replicated the same problem on our standby cluster. **Corrective actions** Upon postmortem investigation, we also noticed that certain metrics were available that could have allowed us to detect similar issues before they fully impact the entire system. This early warning mechanism could potentially have caught this issue earlier, preventing the faulty code change from reaching our production environment. We have initiated a number of improvements: * Short-term * We are increasing monitoring and alerting coverage for the performance of the authorization engine. * We are continuing our efforts to compartmentalize our controller application, so degradation in one part of the system doesn’t impact the whole system. * We are writing an integration test to simulate this exact issue. * Based on log analysis we concluded we should update our incident protocol to immediately turn on “read-only” mode when an incident occurs to improve client and connector offline behavior. * We are reinforcing the engineering team’s use of feature flags and dark launches for new features and data migrations. * Medium/Long-term * Based on data collected in this incident, we have identified areas where we can improve the behavior of our client and connection applications to better handle similar situations and allow connectivity even under controller downtime.
This incident has been resolved.
We have restored service for all customers but are verifying each of our internal services in-turn and checking for any residual issues
Although we identified a problem earlier it appears that it was not the root cause of the issue. We are serving some customer requests but there remains an ongoing impact to service availability that we are investigating.
We are still working on the fix. It takes a bit longer then we expected.
The issue has been identified and a fix is being implemented. the system is slowly recovering. We will keep update.
We are working with our cloud provider to identify the issue
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Delay in display for network events"
Last updateThis incident has been resolved.
Our infrastructure is running behind which is causing delays in our display of network events in the admin console. No data has been lost and the system should be caught up shortly.
Report: "Twingate Docs down"
Last updateThis incident has been resolved.
Twingate docs site is up now but we are still monitoring the issue as our docs hosting provider hasn't released an update yet on the resolution of the issue.
Twingate docs site (docs.twingate.com) is down due to issues with our vendor.
Report: "download or update of client/connector packages"
Last updateThis incident has been resolved.
Our repository third party is recovering.
The issue is due to a third party incorrect infrastructure upgrade
We are investigating an issue with our third-party repository provider for Linux packages. While it is not impacting Twingate clients or connectors it does appear to be preventing download or update of client/connector packages.
Report: "www.twingate.com down - no service impact"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
Our service provider still reports an issue with their network however our web site came back up and has been operational. We'll continue to monitor.
Due to an issue with our service provider, our homepage (www.twingate.com) is down. There should be no impact to service for our customers.
Report: "Twingate ingress partially unavailable"
Last updateDue to unexpected delays associated with a planned ingress update, Twingate’s controller was partially unavailable for ~2.5 minutes between 10:15 pm and 10:18 pm PST. During this short window, no existing network connections were interrupted. New connections were blocked during this period. Ingress changes are rare and no additional changes are planned. Looking ahead, we will schedule a brief downtime if future ingress updates are necessary.
Report: "Cloud provider multiple services degraded"
Last updateOur cloud provider (Google) incident is now resolved, and our services in eu-west2 (London) are fully operational.
Our cloud provider (Google) is continuing to experience the outage in Europe-west2 (London) region. This is not directly impacting Twingate service as our clients and connectors have built-in redundancy by connecting to multiple regions but users in that region may experience slowness. We are monitoring the incident and we'll provide updates as we find out more.
Our cloud provider (Google) is experiencing an incident affecting multiple services in their eu-west2 (London) region. Our service spans multiple regions but users may experience intermittent slowness.
Report: "Documentation site is unavailable in some regions"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
The issue has been identified by our vendor as a CDN issue, we will continue to monitor and provide updates.
Report: "Twingate Controller outage"
Last update**Summary** On June 3rd at 3 AM UTC, Twingate started a regular Kubernetes upgrade on its main cluster. This maintenance is usually done once a month and has been performed successfully many times prior to this upgrade. It includes a version upgrade of the cluster followed by a version upgrade of node-pools, completed one at a time. Around 4:01 AM UTC, HTTP 502 errors started on our cloud load balancer instance indicating an issue with the service. While these errors were a small portion of the overall volume of requests at first, around 4:15 AM the system was fully overloaded and it turned into a full outage. Shortly after, we failed to our standby cluster in a different region of our cloud provider, but saw the same issue happening on our standby cluster too. We downgraded our active clusters node pools to the previous Kubernetes version. This added extra capacity and then we failed back to our active cluster at 5:05 AM UTC. Recovery started immediately and the service was fully recovered at 5:10 AM UTC. **Root cause** After a detailed investigation, we found that during the upgrade, network connectivity between internal components was not stable, triggering failures and retries. As a result of our application being overloaded, we failed to answer load balancer health checks which caused the 502 errors. We are working with our cloud provider to analyze why the network instability happened during the upgrade. **Corrective actions** We have initiated a number of improvements: * Completed: We increased our main application capacity, tuned application and network settings between mentioned services, upgraded our in-memory key-value store, and added PDB \(pod destruction budget\). * Short-term: We will continue to tune the application and network settings between various components of Twingate. We found a bug with how our our client handles 502 errors and we are working on our client to handle the 502 errors better. * Medium-term: We are looking into two major changes: 1\) implementing Circuit Breaker functionality so our main application can stay up when a downstream service goes down, and 2\) implement a multi-region active-active setup on our cloud provider, which will enable us to better control Kubernetes upgrades \(as well as other code and configuration changes\).
We are marking the incident as resolved. We will provide post-mortem notes as soon as we have them.
After reverting back the Kubernetes version and failing back to our previously Active cluster, we see Twingate Service recovered. We continue to monitor.
During a planned Kubernetes version upgrade, our application started to fail. We failed to our standby region/cluster, but it has the same issue. We are downgrading Kubernetes version and continuing to working on the issue.
Controller is currently experiencing an outage. Our team is investigating the issue.
Report: "Issue with Twingate"
Last update**Summary** At approximately 15:31 UTC on May 13, 2022, we received alerts from our monitoring systems pointing to a problem with Twingate. Our cloud provider’s load balancer started to return 502 \(Bad Gateway Error\) due to issues with our backend system. Looking into our backend logs, we noticed only 10-15% of requests were being handled properly and decided to restart our application pod in our Kubernetes cluster. Once the backend application pod restarted, the load balancer stopped returning 502 errors and things returned to normal around 15:44 UTC. During the outage, both our private and public APIs were affected. These APIs are used to drive most of the functionality that end users and administrators experience in Twingate. Specifically, this means that customers’ admin consoles were not accessible, the public API was not responsive to requests, and Clients and Connectors were unable to initiate authentication. Existing connections continued to function as a part of our reliability efforts completed in Q1 of 2022 \(provided that the clients and connectors were running the latest versions\). With this, we recommend that all of our customers upgrade their Clients and Connectors as soon as they can. **Root cause** After a detailed investigation, we found potential network glitches that caused connectivity issues and higher latency with different and unrelated parts of our system. While some components self-healed \(i.e. our Redis instance\), our main backend application was impacted. This was due to a much higher latency associated with a 3rd party service we use, leading to connection saturation of our API layer and the resulting rejection of additional requests, which manifested as 502 errors to the requestor. **Corrective actions** In order to mitigate the risk of this root cause impacting our service in the future, we have initiated a number of improvements: * Completed: We increased CPU and memory reservation for our backend application and relay pods. We decreased the connection timeout threshold for the third party so it doesn’t cause connection saturation again. * Short-term: We are working on adding more metrics and enabling more logging to help with investigation and post-mortem analysis in the future. * Medium Term: While we already had some circuit breaker capabilities and flags to turn off certain features, we will look for a complete service mesh solution with circuit breaker capabilities that should keep upstream applications and APIs running when issues and latencies arise for downstream dependencies.
We are marking this issue resolved. The impact was between 8:31 am to 8:44 am. We will add the post mortem to the incident as soon as we have it ready.
The system is fully up. We will continue to monitor.
We are continuing to investigate this issue.
We have seen improvements; we are monitoring the situation. We'll update as we find out more details on the issue.
We are currently investigating the incident.
Report: "Issue connecting to Twingate"
Last update**Summary** At approximately 02:26 UTC on January 19th, we observed an increase in latency between our API layer and our backend database system. Within a few minutes, this spike in latency developed into an outage that resulted in 90% of requests returning one of two responses to the requestor: either a 500 \(Internal Server Error\) or a 502 \(Bad Gateway Error\) error depending on where the error in our system occurred. These error conditions were caused by timeouts occurring between our API layer and the database and persisted until approximately 03:36 UTC. During the outage, both our private and public APIs were affected. These APIs are used to drive most of the functionality that end users and administrators experience in Twingate. Specifically, this means that customers’ admin consoles were not accessible, the public API was not responsive to requests, Clients and Connectors were unable to initiate authentication, and existing connections were eventually dropped without the ability to re-authenticate. **Root cause** The root cause of the issue is attributed to significant degradation in database performance due to a spike in CPU utilization, which increased latency across the system. The consequence of this increased latency was that even though our API layer was available to respond to requests, requests were taking significantly more time, leading to connection saturation of our API layer and the resulting rejection of additional requests, manifested as 500 or 502 errors to the requestor. **Corrective actions** In order to mitigate the risk of this root cause impacting our service in the future, we have initiated a number of improvements: * **Completed:** We have doubled the master database cluster server size in order to prevent utilization spikes disrupting our ability to continue to serve requests. * **Short term:** We are working on introducing zonal database read replicas, which will improve distribution of system load and will also remove the master database as a single point of failure. These improvements will also allow our service to maintain partial connectivity in situations when the master database is unavailable. * **Medium term:** We are implementing changes to the Client connection session management to maintain connectivity in cases when backend services are unreachable. This will introduce an additional layer of resiliency to our system beyond the changes described above.
We are continuing to monitor the system, and it remains stable and available. We are closing out this incident and we will follow up with a post mortem here.
We have re-established connectivity and Twingate services have been restored. We are continuing to monitor our systems.
Our engineers have isolated the problem to a network connectivity issue between our application servers and our database infrastructure. Our team is working to restore network connectivity and we will continue to post regular updates.
We are aware of an incident affecting our production system and are currently actively investigating the issue. We will be posting regular updates pertaining to this incident.
Report: "Twingate MFA challenge page is slow to load"
Last updateWe are closing this incident as the MFA challenge page has been operating normally since our last update.
We have switched to our backup CDN and will continue to monitor the system.
We have identified the issue and are currently switching to our backup CDN.
We are currently investigating degraded performance of loading the MFA page of Twingate. Customers are experiencing slower page load times due to a CDN issue but services are still functional and available. We will be posting regular updates pertaining to this incident.
Report: "Inbound requests have heavily downgraded availability"
Last update**Components impacted** * Controller **Summary** At approximately 17:06 UTC on December 13th, we observed an increase in latency between our API layer and our backend database system. Within a few minutes, this spike in latency developed into an outage that resulted in 90-95% of requests returning one of two responses to the requestor: either a 500 \(Internal Server Error\) or a 502 \(Bad Gateway Error\) error depending on where the error in our system occurred. These error conditions were caused by timeouts occurring between our API layer and the database and persisted until approximately 19:08 UTC. During the outage, both our private and public APIs were affected. These APIs are used to drive most of the functionality that end users and administrators experience in Twingate. Specifically, this means that customers’ admin consoles were not accessible, the public API was not responsive to requests, Clients and Connectors were unable to initiate authentication, and existing connections were eventually dropped without the ability to re-authenticate. **Root cause** The root cause of the issue was due to temporary loss of connectivity and increased network latency in our cloud service provider between our API layer and backend database. The consequence of this increased latency was that even though our API layer was available to respond to requests, requests were taking significantly more time, leading to connection saturation of our API layer and the resulting rejection of additional requests, manifested as 500 or 502 errors to the requestor. **Corrective actions** In order to mitigate the risk of this root cause impacting our service in the future, we have initiated a number of improvements to isolate the impact of backend service disruptions from end user connectivity. These projects include decoupling our backend database from the Controller, scaling cross-regional database replicas for additional resiliency, and implementing changes to user connection behaviors to maintain connectivity in cases when backend services are unreachable.
We are continuing to monitor the system, and it remains stable and available. We are closing out this incident, and we will continue to post updates and follow up with a post mortem here.
We are continuing to monitor the system, and it remains stable and available. We are investigating the root cause of the outage. We will post our next update at 09:00 PST / 17:00 UTC.
We are continuing to monitor the system, and it remains stable and available. We are investigating the root cause of the outage. We will post our next update at 21:00 PST / 05:00 UTC.
We are continuing to monitor the system, and it remains stable and available. We are investigating the root cause of the outage. We will post our next update at 17:00 PST / 01:00 UTC.
We are continuing to monitor the system, and it remains stable and available. We are investigating the root cause of the outage. We will continue to post additional updates regularly.
We are continuing to monitor the system, and we are still investigating the root cause of the outage. We will continue to post additional updates regularly.
Inbound requests are now being accepted and the service is operational again. We have verified that all operational tests are succeeding. We are continuing to investigate to determine the root cause of this incident.
We are continuing to investigate this issue. We have narrowed the source of the problem to the public-facing frontend servers that handle requests inbound to our service. As a result, this is broadly affecting our public API, the private API calls used by Clients and Connectors, and our web interface, resulting in heavily downgraded response availability across our service. We are still trying to identify the root cause at this time and will continue to post regular updates.
We are continuing to investigate this issue. We will be posting regular updates pertaining to this incident.
We are continuing to investigate this issue. We will be posting regular updates pertaining to this incident.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Admin console billing functionality is temporarily unavailable"
Last update**Components impacted** * Admin console **Summary** At 18:59 UTC we received an automated alert that requests to our 3rd party billing system \(Chargebee\) were failing. A few minutes later we confirmed that this was causing the Twingate admin console interface to fail to load with a 500 error, impacting all customers. At 19:15 UTC our engineering team submitted a hotfix to resolve the issue and update our production systems. The fix was deployed at 19:33 UTC and all customers were able to access the Twingate admin console with billing functionality disabled. We spoke to Chargebee at 20:00 UTC and they confirmed the issue on their side. At 20:47 UTC Chargebee confirmed that the outage was resolved in their system and the incident was closed. **Root cause** The Twingate admin console relies on Chargebee API access in order to load billing information specific to the particular customer account. This API call is made when the Twingate admin console loads in the browser. The underlying 3rd party API returned a 503 \(Service Temporarily Unavailable\) error, which was not captured as an exception in the error object returned by the 3rd party API library. This led to an uncaught exception, which caused the admin console to fail to load with a generic 500 \(Internal Server Error\) error. **Corrective actions** We have updated the admin console logic to capture this type of exception from Chargebee and ensure that the admin console will continue to load. We will be taking the following actions: 1. Auditing all 3rd party calls in the admin console to ensure that all exceptions are caught and do not result in the admin console being unavailable. 2. Update the billing behavior specifically to incorporate an unavailability message.
Our 3rd party billing system is now available and we are marking this incident as resolved. We will follow up with a post mortem update.
We have received confirmation from our 3rd party billing provider that they are aware of their system outage and are working on a fix. We will resolve this incident when the 3rd party system is available. Until this incident is marked as resolved, no billing functionality will be available, but all other functionality remains unaffected.
A fix has been deployed and the admin console is now available to all customers. We'll continue to monitor the system for the next 30 minutes before marking this incident as resolved.
The root cause of the issue has been identified and a fix is in progress. We will post another update shortly.
The admin console is currently experiencing an outage caused by our 3rd party billing system being unavailable. Our team is investigating the issue and working on a fix currently. This issue is isolated to loading the admin console and does not affect any resource access or end user authentication.
Report: "Okta connectivity issue"
Last updateCustomers using Okta should now be able to login without issue. We understand that the problem was related to an internet connectivity issue unrelated to Twingate that is now resolved.
We are observing that Okta appears available again. Customers that use Okta as their Identity Provider should try again to login if they have experienced trouble logging in.
We are aware of and investigating reports that customers using Okta for authentication are unable to login. It appears that customers' Okta domains are not reachable. At this stage, we are not aware of any issue affecting Twingate services itself and are working to restore access to those customers relying on Okta for authentication.
Report: "Service provider networking outage"
Last update**Components impacted** Controller Relays **Summary** Twingate services were unavailable to service requests from approximately 17:48 to 20:09 UTC on November 16th. The result was that during this period of time, access to Twingate and protected resources was limited, existing connections were dropped, and new connections were refused. At the end of the period, normal service resumed. Remediation required that customers reconnect their Connectors in order to restore access to protected resources. **Root cause** Google Cloud Platform \(GCP\) deployed a configuration change in their infrastructure that caused all requests to return 404 errors \([GCP incident description](https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh)\). Because Twingate relies on GCP infrastructure, access to the Twingate network and protected resources was impacted. GCP confirms that the incident was resolved as of 19:28 UTC. As GCP began to restore their service, impacted Twingate services automatically came back online. Currently, Twingate Clients and Connectors view 404 errors as unrecoverable states and thus did not automatically reconnect. Consequently, customers were required to restart their Connectors and the Windows service on the Windows Client to restore access. **Corrective actions** Automated monitoring alerted Twingate to the outage and our DevOps and on-call engineering teams started tracking the issue. Manual testing confirmed the outage, and additional investigation showed that other GCP customers were impacted. While traffic was being restored, systems indicated that Connectors did not automatically recover. For customers using our Managed Connectors, these were restarted at 20:50 UTC. We began notifying customers about the need to restart Connectors at approximately 19:00 UTC, and all customers were notified by 02:02 UTC on November 17th. Looking ahead, we plan to: * Prioritize Client and Connector reconnection behavior and extend it to include all non-recoverable errors * Introduce functionality to notify customers of Connector downtime via email notifications
We are marking this issue as resolved as our monitoring shows that our infrastructure is operating normally and Google Cloud Platform has resolved the incident on their network. We will be following up with a post-mortem shortly.
Google Cloud Platform has marked their Cloud Networking issue as resolved and has posted a status update: https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh We are continuing to monitor our infrastructure and will mark this incident as resolved when we are confident that everything has returned to normal.
We have verified that all of our infrastructure is fully operational at this time and will continue to monitor for any changes. Until our service provider (Google Cloud Platform) has closed their incident, we will leave this incident open in Monitoring status and provide regular updates as we receive them. Customers should verify that all of their Connectors are up and running if any Resources are inaccessible at this time.
We are continuing to monitor for any further issues.
We are continuing to monitor the status of the service. Customers may need to restart Connectors to restore connectivity to resources due to the nature of the networking outage.
The Twingate admin console is now accessible and the Twingate Controller is operational. Customers may need to restart Connectors to restore connectivity to resources due to the nature of the networking outage. The originating cause appears to be related to an outage in Google Cloud Platform's Networking service. Google Cloud Platform has opened an incident: https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh
We are continuing to investigate this issue.
We are investigating an outage report with regards to Twingate. At this time we suspect it is an issue affecting broader Internet services and is not isolated to Twingate. We will continue to post regular updates as we learn more.
Report: "Controller downtime"
Last update**Components impacted** Relay Controller **Summary** A physical hardware failure occurred in a node within one of our Eastern US Relay clusters at approximately 18:19 UTC. Connectors and Clients attached to this node automatically failed over to a new node. This failover process resulted in a partial outage of the Controller, which was partially available to service requests from approximately 18:21 to 18:40 UTC. At the end of the period, normal service resumed with no remediation required. **Root cause** A physical hardware failure occurred in a single node within one of our Eastern US Relay clusters. Although the hardware was swapped out automatically by our service provider, this resulted in all Connectors attached to this particular Relay node to automatically failover to a new Relay node, resulting in a flood of connection requests. This process proceeded normally, however the volume of connection requests was sufficient in this particular instance to temporarily prevent the Controller from accepting new connection requests. This in turn resulted in additional reconnection requests, exacerbating the original problem. **Corrective actions** As soon as we received monitoring alerts, the DevOps and on-call engineering teams started triaging the issue. Additional nodes were started to handle the spike in connection requests and the system was monitored as the request rate recovered and the system was brought back to a normal running state at 18:40 UTC. Looking ahead, we have already or plan to: 1. Add additional nodes and increase memory limits across the board to serve as an additional buffer for failover-based connection spikes. 2. Make changes to our heartbeat monitoring logic to increase overall resilience during transient traffic peaks. 3. Introduce changes to the Connector logic to maintain connections to multiple Relay nodes at all times, resulting in a flatter spike in failover re-connection requests. 4. Introduce additional resiliency in token issuance to prevent temporary spikes in connection requests from influencing otherwise healthy Clients and Connectors.
This incident has been resolved.
We are still investigating the root cause of the incident. We didn't find any issue on our side, and we are working with our cloud provider support team to investigate the matter further.
We are continuing to investigate this issue.
We are back and operational now. We are still investigating the root cause.
We are continuing to investigate this issue.
We are looking into it we will provide more info as soon as we have it
Report: "Connector restart may be required"
Last updateAll admins with affected Connectors were notified.
Connectors older than v1.26.0 require a restart due to a database update. You can find Connector version information in the Connector detail page in the Twingate Admin console. We are currently in the process of contacting Twingate admins.
Report: "Controller downtime"
Last update**Components impacted** Controller **Summary** The controller was unavailable to service new authentication requests from approximately 15:17 to 15:19 UTC. The result was that during this period of time, new connection requests were rejected. Existing connections were not impacted. At the end of the outage period, normal service resumed with no remediation required. **Root cause** Leading up to the start of the outage period, automated monitoring alerted us to spikes in memory usage. At approximately 15:16 UTC we introduced a change to our cluster that was intended to increase memory availability. At approximately 15:17 UTC as this change was rolled out, it had the unintended consequence that resulted in a decrease in service availability, with the resulting rejection of most requests. **Corrective actions** At 15:18 UTC, seeing the decrease in service availability, we reverted the change and simultaneously made additional hardware available to the cluster. Normal service resumed approximately 45 seconds later as the change propagated. Looking ahead, we plan to: 1. Investigate introducing decoupling between inbound requests and our backend as the likely cause of the memory spikes that triggered the change that caused the outage.
This incident has been resolved.
The system is now confirmed as fully operational. We are working on an incident report and taking steps to ensure that this issue will not happen in the future.
We're resolved the immediate issue by adding additional processing capacity and increasing memory limits on our controller infrastructructure.
We've identified the issue, which is being caused by excessive memory usage on our infrastructure.
Report: "Controller downtime"
Last update**Components impacted** Controller **Summary** The Controller was partially unavailable to service requests from approximately 19:39 to 19:46 UTC. The result was that during this period of time access to protected resources was limited, some existing connections were dropped, and new connections were refused. At the end of the period, normal service resumed with no remediation required. **Root cause** Leading up to the start of the incident was a planned maintenance period. The maintenance change propagated a configuration change across our Relay clusters. Due to human error that change was not applied in a sequential manner one cluster at a time but was released instead to all of our US clusters in parallel. Once the configuration changed was applied, it triggered reconnection requests from all active Clients and Connectors to our Relay infrastructure. As part of the reconnect process, Clients and Connectors needed to obtain new tokens from the Controller. At 19:39 UTC the spike of requests triggered our health check system, which incorrectly determined that the Controller was misbehaving and required restarting. The frequent Controller restarts had the unintended consequence that resulted in a decrease in service availability. **Corrective actions** As soon as the health-check system kicked in, the DevOps and on-call engineering teams started tracing down the issue. Logs and system metrics confirmed that except for health-check system, everything was performing well, so a decision was made to disable it. Seconds after disabling it, the system returned to a fully operation state. At 21:22 UTC a hot fix was deployed to the health-check system and it was enabled once again. Looking ahead, we plan to: 1. Only perform planned Relay maintenance operations that require connection migration outside of peak traffic hours. 2. Enforce a stricter limit of the number of parallel Relay cluster deployments. 3. Fix issues identified with our health check system and improve our performance and stress-testing to include more aggressive connection migration scenarios. 4. Update the Twingate status page immediately upon confirmation of an issue impacting customers.
This incident has been resolved. We will be posting a post mortem description shortly.
The Controller is currently fully available, and we are actively investigating the root cause of the issue.
The Controller infrastucture was experiencing degraded availability. The issue began at 19:39 UTC and continued until 19:47 UTC. Our team is currently investigating the root cause of the issue, and we will post additional updates here.
Report: "US East Coast Relay issue"
Last update**Components impacted** Relay Connector **Summary** On this date we had an outage during routine maintenance of our relay infrastructure. The issue started at 4am UTC and was resolved within 2 hours, requiring some customers to restart their connectors in order to re-establish connectivity to our relay infrastructure. **Root cause** In our investigation we determined that the connector received a malformed response from the relay cluster during its maintenance cycle. The malformed response in question contains the address of a particular relay node to which the connector is instructed to connect. This malformed response resulted in the connector retrying access to a non-existing relay node without failing over to another relay cluster. **Corrective actions** After correcting the specific issue that caused the malformed response, we modified both the relay and connector logic so that failover now happens automatically any time that a malformed response is received. We also modified our maintenance procedures to add additional health checks to prevent malformed responses. Finally, we took the opportunity to enhance how the failover logic works to incorporate multiple levels of relay redundancy in the connector's initial configuration that it receives after authentication.
The affected Relay cluster is now fully operational.
We are monitoring as our Relay cluster is coming back online. Any affected Connectors that did not automatically reconnect may require a restart in order to resolve any connectivity issues.
A fix has been implemented and we are monitoring the results.
We are currently investigating the issue.
Report: "Mumbai (asia-south1-a) cluster is unavailable"
Last update**Components impacted** Relay **Summary** We encountered an issue during an upgrade to our relay cluster monitoring infrastructure. As a result, we were unable to bring the Mumbai regional cluster up during this maintenance window, and so it was left down. There was no customer impact as any connections were re-routed to another relay cluster. **Root cause** We determined that the root cause was a configuration error introduced during a deployment configuration upgrade. This was fixed and the cluster was brought back up during a low traffic period at the end of the day. **Corrective actions** We identified an issue in our CI/CD process that resulted in the initial misconfiguration, which has been corrected.
Relay cluster in Mumbai (GCP region asia-south1-a) is now fully operational.
Engineering has a fix in place. We are currently monitoring the cluster and expect it to be back up by 21:00 PST.
We have identified an issue with our Relay cluster in Mumbai (GCP region asia-south1-a) and are working to fix it.