Is Twingate Down Right Now? Discover if there is an ongoing service outage.

Twingate is currently Operational

Last checked Jul 29, 2025 17:51 UTC from Twingate's official status page

Historical record of incidents for Twingate

Jun 12, 2025

Report: "Admin and Reports Issues"

Last update 2025-06-12T18:18:04.953Z

investigating2025-06-12T18:18:04.950Z

Our cloud provider is having issues impacting certain parts of Twingate service, specifically Admin, Reports Downloads and Network Dashboards. Everything else works as expected. We are looking into it.

May 28, 2025

Report: "Twingate Service Down"

Last update 2025-05-28T21:35:53.345Z

resolved2025-05-28T21:32:18.000Z

This incident is resolved now. We'll provide a post-mortem as soon as we have details.

monitoring2025-05-28T21:16:01.408Z

A fix has been implemented and we are monitoring the results.

identified2025-05-28T20:53:09.082Z

The issue has been identified and a fix is being implemented.

Report: "Twingate Service Down"

Last update 2025-05-28T16:32:00.000Z

Resolved2025-05-28T16:32:00.000Z

This incident is resolved now. We'll provide a post-mortem as soon as we have details.

Monitoring2025-05-28T16:16:00.000Z

A fix has been implemented and we are monitoring the results.

Identified2025-05-28T15:53:00.000Z

The issue has been identified and a fix is being implemented.

Apr 2, 2025

Report: "Twingate Service Issues"

Last update 2025-04-02T17:26:50.906Z

postmortem2025-03-28T15:14:12.222Z

**Twingate Public RCA: March 21, 2025 Authentication/Authorization Incident** **Summary** On March 21, 2025, between 21:10 and 21:41 UTC, a subset of authentication and authorization requests to Twingate services experienced elevated error rates. The issue was mitigated by 21:41 UTC and services returned to normal operation. **What Happened** Twingate runs an active-active architecture across multiple Google Kubernetes Engine \(GKE\) clusters, balanced by a Global Load Balancer \(GLB\). At the onset of the incident, we observed anomalies in one of our clusters, and to protect overall system health, we proactively scaled down deployments in that cluster. This action shifted traffic to other clusters in the topology, primarily one that typically handles a lighter load and had been scaled accordingly. The target cluster began autoscaling as expected, but the increased traffic caused elevated error rates, which triggered our retry mechanisms across services. While these retries helped many requests succeed, they also increased the overall system load. Simultaneously, the cluster underwent a cloud provider-initiated update operation that caused pod restarts and reduced capacity. To stabilize the system, we reintroduced capacity in the previously affected cluster, rebalancing the traffic across regions. Once this occurred, error rates subsided and retries diminished. **Root Cause** Anomalous behavior from our cloud provider, including unexpected request timeouts at the load balancer level and instability during a cluster update, led to a cascade of retry traffic that temporarily overwhelmed parts of the system. We are actively investigating both the unexpected timeout configuration and the behavior of the cluster during the update with our cloud provider. **Corrective actions** Short-Term: * Continue collaboration with our cloud provider to understand the root cause of the unexpected timeouts and cluster update impact. * COMPLETED: Keep deployments on all regions with same number of replicas in HPA configuration * COMPLETED: Increase max node counts on cluster node auto-scalers to give us more room to scale up. * COMPLETED: Configure HPA to scale up faster

resolved2025-03-21T21:13:00.895Z

This incident has been resolved.

monitoring2025-03-21T20:41:17.000Z

A fix has been implemented and we are monitoring the results.

investigating2025-03-21T20:10:01.000Z

We are currently investigating this issue.

Report: "Issues with Dashboards and Events Processing"

Last update 2025-04-02T17:25:49.334Z

resolved2025-04-02T17:25:22.439Z

All of internal alerts are cleared. You should not see delays/errors with dashboards and reports any longer.

monitoring2025-04-02T16:32:05.784Z

The Big Query service latencies have dropped and we see improvements with our internal dashboards.

identified2025-04-02T16:19:08.598Z

We are seeing higher latency than usual with our cloud provider's APIs to process data used for our dashboards and reports. They confirmed the issue with their service. We hope this should be resolved soon.

Feb 25, 2025

Report: "Temporary Disruption in DNS Activity Reports"

Last update 2025-02-25T01:39:06.175Z

resolved2025-02-25T01:39:06.156Z

This incident has been resolved.

monitoring2025-02-25T01:28:32.000Z

Recent DNS Activity is now restored and we are monitoring the system.

investigating2025-02-24T22:26:00.000Z

We're currently experiencing a temporary disruption with the DNS Activity Reports. The team is working on restoring access, and we'll have them back up shortly.

Oct 21, 2024

Report: "Jamf Device Integration Issue"

Last update 2024-10-21T21:01:23.414Z

resolved2024-10-21T21:01:23.397Z

Issue with syncing trusted devices from Jamf has been fixed.

investigating2024-10-20T12:12:00.625Z

This incident is still being investigated. It started after a recent upgrade performed by Jamf. We have reached out to Jamf on the issue and working with them on resolution.

investigating2024-10-20T03:52:25.478Z

We're seeing increased errors when syncing trusted devices from Jamf. We're currently investigating.

Sep 20, 2024

Report: "DNS Events Ingestion Issue"

Last update 2024-09-20T05:31:01.100Z

resolved2024-09-20T05:31:01.086Z

This incident has been resolved.

monitoring2024-09-20T05:24:50.738Z

DNS events are being ingested again and we're monitoring the system.

identified2024-09-20T05:22:44.109Z

We've identified the issue and are working on a fix.

investigating2024-09-20T04:57:34.397Z

We've detected an issue with DNS events ingestion affecting some of our customers who are not able to see DNS events when exporting to S3. We're currently looking for root cause.

Aug 15, 2024

Report: "GitHub Social Login is Down"

Last update 2024-08-15T00:19:24.220Z

resolved2024-08-15T00:19:24.203Z

This incident has been resolved.

monitoring2024-08-14T23:29:59.441Z

GitHub is currently experiencing an incident affecting its authentication - https://www.githubstatus.com/incidents/kz4khcgdsfdv Social login via Github is therefore temporarily not working until GitHub resolves their issues.

Jul 29, 2024

Report: "Secure DNS Dashboards Issue"

Last update 2024-07-29T22:11:50.018Z

resolved2024-07-29T22:11:50.003Z

Dashboard are back to normal.

monitoring2024-07-29T21:42:46.105Z

Dashboards are gradually coming back online.

identified2024-07-29T21:35:30.000Z

Issue has been identified and we're working on a fix.

investigating2024-07-29T21:32:27.795Z

We are continuing to investigate this issue.

investigating2024-07-29T21:27:06.454Z

We are seeing issues with Secure DNS (Internet Security) Dashboards loading. We are investigating.

Jul 20, 2024

Report: "Issue with S3 Sync of Internet Security Events"

Last update 2024-07-20T14:47:05.147Z

resolved2024-07-20T14:47:05.121Z

This incident has been resolved.

monitoring2024-07-20T14:10:03.838Z

There was an ingestion delay that's fixed . We see logs flowing again.

investigating2024-07-20T13:20:52.767Z

We are having an issue with S3 Sync of Internet Security events. Team is engaged and issue is being investigated.

Jul 19, 2024

Report: "Issue with S3 Sync of Internet Security Events"

Last update 2024-07-19T22:09:22.455Z

resolved2024-07-19T22:09:22.441Z

The fix has been pushed to production and we confirmed Internet Security Events are now syncing to S3 buckets again.

identified2024-07-19T21:18:00.234Z

We have identified the issue and working on a fix. Will update once fix is deployed.

investigating2024-07-19T20:42:35.416Z

We are continuing to investigate this issue.

investigating2024-07-19T20:42:24.457Z

We are seeing issues with AWS S3 Sync of Internet Security events. We are currently investigating.

Jul 11, 2024

Report: "MFA Incident"

Last update 2024-07-11T19:36:53.200Z

resolved2024-07-11T12:47:18.000Z

During the incident , some of our customers MFA tokens became invalid. Less than 15 of our customers were impacted by this incident. We identified the issue and will make this key rotation safer going forward so it doesn't impact any of our customers. Impact Duration: 12:47 - 14:40 UTC

Jun 29, 2024

Report: "Database Connection Issues"

Last update 2024-06-29T08:31:06.989Z

postmortem2024-06-29T08:28:18.467Z

**Components impacted** Management: Public API Management: Admin Console ‌ **Summary** On June 26, 2024 between 20:16 and 20:24 UTC, Twingate’s SQL proxies restarted, causing a brief failure to a small percentage of calls made to our Public API \(Terraform, Pulumi, k8s Operator, etc.\) and to the Admin console. There was no impact to Clients or Connectors. A change to our SQL proxy deployments that was targeting staging and development environments was pushed to production due to a misconfiguration, causing our SQL proxy instances to restart. ‌ **Root cause** Due to a misconfiguration, a change to our SQL proxy deployments intended for staging and development environments was pushed to production, causing them to restart. ‌ **Corrective actions** Short Term: * Ensure that SQL proxy deployments are only pushed in a controlled manner by resuming the GitOps workflow manually. * Fix the misconfiguration in our GitOps deployment mechanism for our SQL proxy deployments and set the Helm chart version to a static value so that all upgrades are done in a controlled manner. * Enhance our SQL proxy Helm chart to reduce the impact to services during updates and upgrades.

resolved2024-06-26T22:01:29.840Z

On June 26, 2024, between 8:16 pm UTC and 8:28 pm UTC, Twingate experienced several database connectivity alerts due to a failed rollout of one of its components. The rollout was promptly reversed, and our existing reliability measures prevented any major disruption to customer traffic.

May 14, 2024

Report: "Recent DNS Activity Unavailable"

Last update 2024-05-14T23:56:09.692Z

resolved2024-05-14T23:56:09.673Z

This incident has been resolved.

monitoring2024-05-14T23:43:21.000Z

Recent DNS Activity screen on the Admin console is back up.

identified2024-05-14T23:38:40.000Z

Recent DNS Activity screen on the Admin console isn't available.

May 3, 2024

Report: "Logins with Github Not Working"

Last update 2024-05-03T19:01:28.914Z

postmortem2024-05-03T19:00:45.177Z

**Components impacted** Control Plane: Authentication **Summary** On April 30, 2024, between 15:44 and 18:38 UTC, users were unable to login to Twingate through GitHub. The Twingate Engineering team investigated the issue upon receiving support communications and found that logins with other identity providers were functioning normally, but GitHub logins were not working as expected. The problem was traced back to a software rollout that had inadvertently impacted GitHub logins. Engineering was able to create a fix and roll it out at 18:38 UTC, which restored the GitHub logins to its normal functionality. The team is investigating better ways to identify and prevent these types of issues from reaching production and address them as quickly as possible if they ever arise again. **Root cause** A recent package upgrade introduced a bug that impacted Twingate logins via GitHub. **Corrective actions** Short-term: * Increase our testing coverage to allow early detection of login issues for all supported identity providers in all environments. This will aid in early detection when software is deployed to lower, non-production environments. * Improve alerting for login issues.

resolved2024-04-30T18:38:46.192Z

We have successfully rollout the HotFix for the issue with Github logins.

investigating2024-04-30T18:05:19.841Z

A software rollout has broken the logins with Github. We are working on rolling out a hotfix.

investigating2024-04-30T18:03:12.912Z

We are continuing to investigate this issue.

investigating2024-04-30T18:03:05.219Z

We are currently investigating this issue.

Apr 25, 2024

Report: "Recent DNS Activity Unavailable"

Last update 2024-04-25T14:45:24.551Z

postmortem2024-04-25T14:44:32.059Z

**Components impacted** Management: Admin Console **Summary** On April 20, 2024, between 5:32 GMT and 6:57 GMT, Recent DNS Activity on the Admin Console became unavailable. Shortly after the incident began, the Twingate on-call team received alerts regarding abnormal database activity. Workers on the clusters that manages DNS filtering logs starting seeing errors from the logs API, leading to excessive retries and database writes. To mitigate the issue, the DNS Log Streaming workers were temporarily disabled. The root cause was identified as a malfunction in the DNS Filtering Log API caused by a problematic dependency upgrade. Consequently, viewing DNS filtering logs and analytics in the Admin Console was temporarily unavailable. A rollback of the update was issued, and normal operations were restored at 6:57 GMT after which DNS filtering logs and analytics were available in the Admin Console. **Root cause** The DNS Filtering Log API went down due to a bad dependency upgrade. **Corrective actions** Already completed: * Rectified Admin Console's infinite retry logic by enhancing the retrieval of DNS activity logs during error states. * Optimized DNS Log Streaming retry and database write procedures to reduce unnecessary operations when no events are returned from the DNS Filtering API Short-term: * Improve the dependency upgrade process for the DNS FIltering API

resolved2024-04-19T22:04:56.188Z

This incident has been resolved.

monitoring2024-04-19T21:58:05.961Z

A fix has been implemented and we are monitoring the results.

identified2024-04-19T21:45:09.519Z

Recent DNS Activity on admin is unavailable. We have identified the issue and working on a fix.

Feb 1, 2024

Report: "Empty DNS filtering logs"

Last update 2024-02-01T03:22:49.373Z

resolved2024-02-01T03:22:49.360Z

This incident has been resolved.

identified2024-02-01T00:14:59.425Z

The issue has been identified and a fix is being implemented.

investigating2024-01-31T20:29:46.364Z

We’re currently investigating an issue where DNS filtering logs are sometimes empty.

Dec 12, 2023

Report: "Database Connection Issues"

Last update 2023-12-12T05:46:14.007Z

postmortem2023-12-12T05:44:38.666Z

**Components impacted** Control Plane: Authn, Authz, Connector Heartbeat Management: Admin Console, Public API, Identity Providers Sync **Summary** On Dec 11, 2023 between the hours of 5:02 pm UTC and 6:24 pm UTC, Twingate received network connectivity alerts at 3 distinct times, each for a few minutes. After investigation, it was identified that a change in our application had caused our connections to exceed the allowed maximum. We reverted the changes at 6:24 pm, after which the increased connections receded and our connection counts returned to healthy counts. **Root cause** The Twingate database connections exceeded the maximum allowed. **Corrective actions** Already Completed: * Adjusted capacity to keep the number of connections at a healthy number Short Term: * Increase the maximum number of connections allowed * Add alerting for connection utilization

resolved2023-12-11T17:00:00.000Z

Summary On Dec 11, 2023 between the hours of 5:02 pm UTC and 6:24 pm UTC, Twingate received network connectivity alerts at 3 distinct times, each for a few minutes. After investigation, it was identified that a change in our application had caused our connections to exceed the allowed maximum. We reverted the changes at 6:24 pm, after which the increased connections receded and our connection counts returned to healthy counts. (See Postmortem for more details)

Aug 23, 2023

Report: "Twingate Service Incident - Aug 19, 2023"

Last update 2023-08-23T06:13:43.607Z

postmortem2023-08-23T06:11:16.767Z

**Summary** On August 19 at 7:51 AM UTC, Twingate received alerts of issues with the login services. Within a few minutes, the Twingate engineering team began investigating. The team quickly identified that our backend was seeing excessive timeouts from a 3rd-party API, preventing it from being able to process other requests such as authentication. After some initial fixes were unsuccessful, Twingate contacted the 3rd party and also disabled support for real-time updates that make use of these specific 3rd-party API calls. As a result, the issues started resolving at 8:10 AM UTC. Most of the services recovered quickly and full resolution occurred at 8:15 AM UTC. The vendor later confirmed and fixed the issue, and Twingate re-enabled the real-time update feature shortly after on the same day, August 19. **Root cause** The Twingate backend was exhausted due to timeouts from a 3rd-party API. **Post-incident Analysis** Twingate had already separated out most services to their own deployments, allowing those services to function throughout the incident. Therefore, only some users that needed to authenticate or re-authenticate were affected; any user that had authenticated prior to the incident was not impacted. Analysis of logs post-incident showed that the incident started at 7:49 AM UTC and fully recovered at 8:15 AM UTC. **Corrective actions** Short Term: * Separate Authentication and real-time services to their own deployments - COMPLETED Medium / Long Term: * Reevaluate and optimize timeout values for various backend and 3rd party services * Simplify the internal Twingate process for enabling and disabling features

resolved2023-08-19T08:21:40.967Z

This incident has been resolved. We'll publish RCA as soon as we can.

investigating2023-08-19T08:01:57.704Z

We are continuing to investigate this issue.

investigating2023-08-19T08:01:28.952Z

We are seeing issues with Twingate service and investigating.

Jun 22, 2023

Report: "Admin Console Authentication Issues for tenants with JumpCloud IDP integration"

Last update 2023-06-22T19:58:40.424Z

resolved2023-06-22T12:30:00.000Z

Admin Console Authentication is broken for our tenants that utilize JumpCloud IDP integration. This was due to a bug in authentication flows with our latest software that was deployed at 15:24 UTC on June 21st. Resolution was accomplished by reverting to the previous software at 12:42 pm UTC on June 22nd. We will add more tests and monitoring/alerting for the JumpCloud/SAML integration to avoid this in the future and for faster detection.

Jun 16, 2023

Report: "admin console and authentication incident"

Last update 2023-06-16T17:27:04.136Z

postmortem2023-06-16T01:05:04.871Z

**Summary** On June 7th 17:08 UTC, a new version of our Controller software was rolled out. Shortly after the rollout completed, our on-call team received automated exception alerts and began investigating. The new version had inadvertently included changes to clean up our database that were out of sync with the deployed code, and the team decided to rollback the new software deployment. The rollback proceeded smoothly, however the previous code version was missing the database fields that had been cleaned up, and the incident started at 17:24 UTC. A fix was prepared and rolled out starting at 17:29 UTC. Deployment completed on the first cluster at 17:38 UTC and proceeded to the remaining clusters once we verified that the error state had been resolved. Deployment was completed on all clusters at 17:48 UTC. **Post-incident Analysis** We initially thought the incident had only impacted Admin Console users, however the following systems were impacted: * Admin Console sign in. * Client initial authentication and re-authentication requests. * Linux and container-based Connectors. This incident exposed an issue that resulted in Connectors incorrectly shutting down on transient Controller unavailability. **Root Cause** An error in our deployment process logic led to a mismatch in deployed code and database schema. **Corrective Actions** 1. Improve our processes for merging software changes that are linked to database schema changes. 2. We have fixed and will be testing the bug in our Connector uptime / retry logic. 3. Improve overall build and rollout performance to be able to push fixes more promptly.

resolved2023-06-07T17:51:08.000Z

This incident has been resolved. It's due to an issue with a software update. We are working on a plan to avoid this in the future. This impacted authentication flow too. Already authenticated flows continued to function.

investigating2023-06-07T17:42:28.545Z

We are seeing issues with our admin consoles not loading properly. It's been investigated.

Apr 25, 2023

Report: "Issue in us-east4 region"

Last update 2023-04-25T17:14:32.089Z

resolved2023-04-25T17:14:32.077Z

Our cloud provider has announced that the issue with us-east4 region has been resolved. We will be monitoring a little bit more before re-enabling traffic on that region.

monitoring2023-04-25T15:57:09.290Z

Our cloud provider is having a networking issue in us-east4 region that caused 500 errors for our customers hitting that region. Due to retry logic in our service, this shouldn't have caused issues for our customers. We diverted traffic to other regions and no traffic is handled in that problematic region any more. We are working with our cloud provider to make sure the problem is completely resolved before we re-enable traffic for that region.

Feb 17, 2023

Report: "Relay2 (us-east4) and Relay4 (eu-west6) Issue"

Last update 2023-02-17T16:47:25.062Z

postmortem2023-02-17T16:46:31.098Z

**Components impacted** Relay clusters in us-east4 \(Ashburn, Virginia\) and europe-west6 \(Zurich, Switzerland\) ‌ **Summary** We’ve recently been working on adding spot instance scaling to our Relay cluster infrastructure. On Feb 14th at 01:00 UTC, we initiated this upgrade process via a Terraform change to all of our Relay clusters globally, which took approximately 45 minutes to complete. At 01:40 UTC, we noticed a decrease in the number of connections in two clusters \(us-east4 and europe-west6\) and started an investigation. We also engaged our cloud infrastructure provider proactively to rule out a regional cloud provider issue. At 02:04 UTC, we disabled pods in the two affected clusters, which caused connected Clients and Connectors to re-connect to the next closest Relay cluster. Overall connection metrics were seen to be normal across the redistributed connections. On further investigation, we determined that an error in our Terraform configuration affecting network firewall rules had caused the issue. At 03:30 UTC we corrected the error and redeployed the affected clusters, which resolved the issue. ‌ **Root cause** An error in the deployed Terraform configuration removed a critical network tag, which was required to set the correct network firewall rules within our Relay cluster. The result was that Relay clusters were discoverable, but not reachable, leading to the deadlock state experienced by the Clients and Connectors attempting to attach to the affected clusters. This issue only affected two of our Relay clusters because of a configuration difference that was dependent on the overall sizing of these clusters. This ultimately hid the issue during testing in our staging environment because this cluster sizing difference, which in turn leads to different configuration outcomes, was not accurately reflected. ‌ **Corrective actions** We are taking the following short term actions, some of which are already completed, to avoid this problem in the future: * Accurately reflect Relay cluster sizing and configuration differences in our staging environment. * Auto-create plans for all environments, including all configuration variations in product, with feature development branches. * Enhance Relay health checks to ensure clusters are non-discoverable to Connectors and Clients if the necessary network firewall tag is not in place. * Research and implement staggered rollouts with Terraform for our Relay cluster infrastructure. We also have medium term plans to add multi-cluster connectivity to our Connectors to handle regional Relay cluster problems automatically.

resolved2023-02-14T03:34:21.756Z

We have identified the issue and a fix is implemented. Both relays clusters are healthy and processing requests.

identified2023-02-14T02:37:45.409Z

We found issues with 2 of our relay clusters, one in US (region: Virgina and one in Europe (region: Zurich). While we are working on bringing them up, the connectors and clients should have automatically reconnected to the other relay clusters, which may cause some slowness for some customers closer to those clusters since now they need to connect to other clusters.

investigating2023-02-14T02:06:01.342Z

We see issues with one of our relay clusters (US relay cluster 2) and investigating.

Feb 3, 2023

Report: "Twingate Service Incident"

Last update 2023-02-03T07:51:31.115Z

postmortem2023-01-28T00:55:39.286Z

**Summary** On January 24 at 19:58 UTC, our on-call team started to receive automated alerts regarding system performance degradation. The team began an investigation and, by 20:02 UTC, the degradation had escalated to a point where some Twingate Clients began to experience request timeouts. The Client behavior on request timeout is to initiate request retries, which triggered additional requests to our infrastructure. Due to the overall performance degradation, the increase in inbound requests overloaded the system to a point where internal health check requests also began to fail. This resulted in system components being marked as offline, further reducing the available capacity to respond to requests. Autoscaling of serving infrastructure did occur, but the increase in capacity was insufficient to remedy the system’s overall decrease in performance, on top of the additional request workload. Between 20:05 UTC and 20:45 UTC, we identified that the performance degradation was exclusively affecting our authorization engine, independent of other system capabilities. At 20:47 UTC, we promoted our physically separate standby cluster to share load with the existing cluster in an active-active mode. Both clusters began serving traffic at 20:48 UTC, and some improvement to authorization engine throughput was seen, but individual requests were still taking much longer than normal. Noticing that the authorization engine was experiencing a higher load from certain tenants, the team next separated these tenants’ traffic to an isolated replica cluster in order to provide a surplus of processing bandwidth. System load returned to normal on the main cluster, and the traffic was gradually recombined between 21:13 UTC and 21:48 UTC. The system fully recovered at this point. **Post-incident Analysis** In our analysis across all tenant traffic during the incident, we determined that for tenants with the latest Connector and most up-to-date Client applications, less than 10% of users experienced any downtime related to Resource access. Many users were unaware of this incident as their connections remained active due to changes we implemented last year and were introduced in Client and Connector updates. The experienced severity of this incident was hence highly correlated with whether Clients and Connectors were up to date for a given tenant. However, this version disparity also affected the severity of the incident as a whole, and we discuss this in both the root cause and corrective actions, below. **Root Cause** This incident occurred because of two independent events that occurred simultaneously that in turn were made worse due to deployed Connectors and Clients with out of date behaviors. Specifically: 1. A temporary anomaly in our infrastructure provider’s load balancer caused a short term, but very significant \(greater than 10 _seconds_\), increase in request latency. This in turn triggered Client request retry behavior, increasing the overall load on the system in a short time span. 2. Independent to the above event, a large number of computationally-costly changes were triggered in our authorization engine through non-anomalous tenant activity. This increased the processing time for authorization requests. 3. Sufficient Connectors and Clients are deployed in our tenant base that do not have the most up to date logic in place for handling connection degradation. Clients and Connectors released _before_ approximately May 2022 do _not_ back off their retry requests, leading to an overwhelmingly large volume of requests to our system from a relatively small number of deployed Clients and Connectors. This exacerbated both \(1\) and \(2\). We are confident that if any of the above three conditions were not true, this incident would not have occurred. **Corrective Actions** Our corrective actions focus on addressing the three contributing factors above. In short, we will be: making upgrades and configuration changes to our infrastructure provider’s load balancers; improving authorization engine performance; and forcing upgrades of out of date deployed components. Many of these tasks were already underway before the incident, and some related tasks’ completion will be accelerated. A detailed breakdown is provided below. Immediate We have already taken the following immediate corrective actions: 1. Increased authorization engine capacity and distributed the load between multiple clusters located in different geographic regions 2. Isolated authorization requests to a dedicated deployment 3. Increased backend and health check timeouts to more appropriately match the potential for authorization request latency increases 4. Upgraded our infrastructure provider’s load balancer to improve container-awareness Short Term 1. Complete a significant upgrade of our authorization engine. This includes removing a subsystem that was identified as the bottleneck during this incident and previous incidents. This project began in 2022 Q4 and we expecting this replacement upgrade to complete by early 2023 Q2. 2. Introduce additional deployment isolation for different request types so that a failure in one part of the system doesn't affect other subsystems. This proved to work very well during the incident, and we will be further standardizing this in product. 3. Introduce additional logging to help accelerate future troubleshooting. Medium & Long Term 1. Gradually move more parts of our application servers from synchronous request processing to asynchronous processing. 2. Consider the use of a sidecar proxy in front of our application servers. 3. Consider the use of an improved load based auto-scaling mechanism for the authorization engine.

resolved2023-01-25T01:03:44.306Z

We are marking the issue as resolved. The system works as expected with healthy metrics.

monitoring2023-01-24T22:32:39.835Z

Public-API (Admin-API) has been brought up too. While all the metrics for the service looks healthy, we will continue to monitor them.

monitoring2023-01-24T22:05:58.099Z

We have identified the issue and Twingate system looks healthy since 1:48 pm PST. We are still monitoring the issue. Public-API is still being kept down for the time being.

investigating2023-01-24T21:30:44.401Z

We are continuing to investigate this issue.

investigating2023-01-24T21:20:31.170Z

Twingate engineering is still working on identifying the root cause of the issue. We'll continue to provide updates as we find out more. - Public-api is disabled. - Admins and Logins should work. - Still seeing issue with Authorization.

investigating2023-01-24T20:44:16.150Z

Twingate Engineering is fully engaged and we are still investigating the issue. We'll provide further updates as soon as we can.

investigating2023-01-24T20:09:33.645Z

We are currently investigating this issue.

Sep 26, 2022

Report: "Twingate Admin Impacted"

Last update 2022-09-26T22:03:55.069Z

resolved2022-09-26T22:03:55.053Z

Twingate admin is fully functional again.

identified2022-09-26T21:59:52.638Z

Currently Twingate admin is broken. We have identified the issue and rolling out the fix.

Sep 23, 2022

Report: "Twingate controller service impacted"

Last update 2022-09-23T18:42:31.594Z

postmortem2022-09-23T18:42:11.319Z

**Summary** On September 21 at 4:14am UTC, Twingate released a new controller version as part of improvements to the authorization engine. This new release contained both code changes and involved data migration. The change caused an unexpectedly significant increase in load on the system, which only fully manifested itself several hours later once cached data started expiring at approximately 7:54am UTC. At this time, Twingate customers started to see issues initiating access to resources and login failures. Based on initial evidence around intermittent responses and increased network latency, our initial suspicion was that the failures were related to infrastructure problems. Increasing backend application capacity and other efforts to mitigate these issues were not successful. These efforts, combined with information we received from our cloud vendor support team, led us to shift our focus away from infrastructure issues at this point towards investigating the application layer. Our next step at approximately 10:00am UTC was to roll back the recent software changes to the controller, including the associated data migration that was performed as part of this update. This software rollback task, which also incorporated supervised data migration rollback, was completed at approximately 10:30am UTC. After rolling back the software changes and data migration, we observed improvements in the behavior of the system. Both network latency and cache hit ratios were much improved, but not back to normal operational levels. Continuing to investigate the issue, with software and data migration rolled back to a known state, we initiated the process to fall over to our standby cluster to fully rule out any infrastructure issues. At 11:24am UTC we initiated the failover process to our standby cluster, which completed at 11:27am UTC. At this point the system fully recovered with normal operational metrics. **Root cause** After detailed investigation, we identified two separate issues in the application layer that interacted with each other. First, a code bug caused the system to re-evaluate the permissions of all our users at the same time, causing a significant load that saturated the system. Second, the data migration process failed to replace existing cached values, which led to failed requests at the application layer. This second factor only became apparent as existing cached data expired. Failing over to our standby cluster was only effective after the data migration and software changes were reversed. This is because the cache was empty at the time that the standby cluster was brought online. Due to the nature of the software bug, data migration, and caching interactions, performing cluster failover earlier in the incident would have replicated the same problem on our standby cluster. **Corrective actions** Upon postmortem investigation, we also noticed that certain metrics were available that could have allowed us to detect similar issues before they fully impact the entire system. This early warning mechanism could potentially have caught this issue earlier, preventing the faulty code change from reaching our production environment. We have initiated a number of improvements: * Short-term * We are increasing monitoring and alerting coverage for the performance of the authorization engine. * We are continuing our efforts to compartmentalize our controller application, so degradation in one part of the system doesn’t impact the whole system. * We are writing an integration test to simulate this exact issue. * Based on log analysis we concluded we should update our incident protocol to immediately turn on “read-only” mode when an incident occurs to improve client and connector offline behavior. * We are reinforcing the engineering team’s use of feature flags and dark launches for new features and data migrations. * Medium/Long-term * Based on data collected in this incident, we have identified areas where we can improve the behavior of our client and connection applications to better handle similar situations and allow connectivity even under controller downtime.

resolved2022-09-21T04:04:04.666Z

This incident has been resolved.

monitoring2022-09-20T11:40:12.189Z

We have restored service for all customers but are verifying each of our internal services in-turn and checking for any residual issues

investigating2022-09-20T11:07:00.325Z

Although we identified a problem earlier it appears that it was not the root cause of the issue. We are serving some customer requests but there remains an ongoing impact to service availability that we are investigating.

identified2022-09-20T10:17:34.425Z

We are still working on the fix. It takes a bit longer then we expected.

identified2022-09-20T09:32:39.274Z

The issue has been identified and a fix is being implemented. the system is slowly recovering. We will keep update.

investigating2022-09-20T08:56:42.000Z

We are working with our cloud provider to identify the issue

investigating2022-09-20T08:05:31.930Z

We are continuing to investigate this issue.

investigating2022-09-20T08:05:09.461Z

We are currently investigating this issue.

Sep 20, 2022

Report: "Delay in display for network events"

Last update 2022-09-20T00:29:43.372Z

resolved2022-09-20T00:29:43.354Z

This incident has been resolved.

monitoring2022-09-19T21:48:03.432Z

Our infrastructure is running behind which is causing delays in our display of network events in the admin console. No data has been lost and the system should be caught up shortly.

Aug 24, 2022

Report: "Twingate Docs down"

Last update 2022-08-24T04:41:25.328Z

resolved2022-08-24T04:41:25.314Z

This incident has been resolved.

monitoring2022-08-24T00:53:28.745Z

Twingate docs site is up now but we are still monitoring the issue as our docs hosting provider hasn't released an update yet on the resolution of the issue.

investigating2022-08-24T00:21:38.612Z

Twingate docs site (docs.twingate.com) is down due to issues with our vendor.

Aug 17, 2022

Report: "download or update of client/connector packages"

Last update 2022-08-17T12:32:05.419Z

resolved2022-08-17T12:32:05.406Z

This incident has been resolved.

monitoring2022-08-17T12:14:43.000Z

Our repository third party is recovering.

identified2022-08-17T10:40:59.145Z

The issue is due to a third party incorrect infrastructure upgrade

investigating2022-08-17T10:20:20.448Z

We are investigating an issue with our third-party repository provider for Linux packages. While it is not impacting Twingate clients or connectors it does appear to be preventing download or update of client/connector packages.

Aug 16, 2022

Report: "www.twingate.com down - no service impact"

Last update 2022-08-16T15:46:41.668Z

resolved2022-08-16T15:46:41.652Z

This incident has been resolved.

monitoring2022-08-16T15:14:07.464Z

We are continuing to monitor for any further issues.

monitoring2022-08-16T15:06:22.734Z

Our service provider still reports an issue with their network however our web site came back up and has been operational. We'll continue to monitor.

investigating2022-08-16T14:53:13.610Z

Due to an issue with our service provider, our homepage (www.twingate.com) is down. There should be no impact to service for our customers.

Aug 9, 2022

Report: "Twingate ingress partially unavailable"

Last update 2022-08-09T17:44:46.578Z

resolved2022-08-09T05:00:00.000Z

Due to unexpected delays associated with a planned ingress update, Twingate’s controller was partially unavailable for ~2.5 minutes between 10:15 pm and 10:18 pm PST. During this short window, no existing network connections were interrupted. New connections were blocked during this period. Ingress changes are rare and no additional changes are planned. Looking ahead, we will schedule a brief downtime if future ingress updates are necessary.

Jul 20, 2022

Report: "Cloud provider multiple services degraded"

Last update 2022-07-20T09:06:10.287Z

resolved2022-07-20T09:06:10.274Z

Our cloud provider (Google) incident is now resolved, and our services in eu-west2 (London) are fully operational.

monitoring2022-07-20T06:10:17.784Z

Our cloud provider (Google) is continuing to experience the outage in Europe-west2 (London) region. This is not directly impacting Twingate service as our clients and connectors have built-in redundancy by connecting to multiple regions but users in that region may experience slowness. We are monitoring the incident and we'll provide updates as we find out more.

monitoring2022-07-19T20:42:18.000Z

Our cloud provider (Google) is experiencing an incident affecting multiple services in their eu-west2 (London) region. Our service spans multiple regions but users may experience intermittent slowness.

Jun 21, 2022

Report: "Documentation site is unavailable in some regions"

Last update 2022-06-21T07:35:04.108Z

resolved2022-06-21T07:35:04.095Z

This incident has been resolved.

identified2022-06-21T07:34:35.775Z

We are continuing to work on a fix for this issue.

identified2022-06-21T07:11:43.146Z

The issue has been identified by our vendor as a CDN issue, we will continue to monitor and provide updates.

Jun 10, 2022

Report: "Twingate Controller outage"

Last update 2022-06-10T00:07:22.955Z

postmortem2022-06-10T00:05:58.659Z

**Summary** On June 3rd at 3 AM UTC, Twingate started a regular Kubernetes upgrade on its main cluster. This maintenance is usually done once a month and has been performed successfully many times prior to this upgrade. It includes a version upgrade of the cluster followed by a version upgrade of node-pools, completed one at a time. Around 4:01 AM UTC, HTTP 502 errors started on our cloud load balancer instance indicating an issue with the service. While these errors were a small portion of the overall volume of requests at first, around 4:15 AM the system was fully overloaded and it turned into a full outage. Shortly after, we failed to our standby cluster in a different region of our cloud provider, but saw the same issue happening on our standby cluster too. We downgraded our active clusters node pools to the previous Kubernetes version. This added extra capacity and then we failed back to our active cluster at 5:05 AM UTC. Recovery started immediately and the service was fully recovered at 5:10 AM UTC. **Root cause** After a detailed investigation, we found that during the upgrade, network connectivity between internal components was not stable, triggering failures and retries. As a result of our application being overloaded, we failed to answer load balancer health checks which caused the 502 errors. We are working with our cloud provider to analyze why the network instability happened during the upgrade. **Corrective actions** We have initiated a number of improvements: * Completed: We increased our main application capacity, tuned application and network settings between mentioned services, upgraded our in-memory key-value store, and added PDB \(pod destruction budget\). * Short-term: We will continue to tune the application and network settings between various components of Twingate. We found a bug with how our our client handles 502 errors and we are working on our client to handle the 502 errors better. * Medium-term: We are looking into two major changes: 1\) implementing Circuit Breaker functionality so our main application can stay up when a downstream service goes down, and 2\) implement a multi-region active-active setup on our cloud provider, which will enable us to better control Kubernetes upgrades \(as well as other code and configuration changes\).

resolved2022-06-03T06:49:15.534Z

We are marking the incident as resolved. We will provide post-mortem notes as soon as we have them.

monitoring2022-06-03T05:13:08.606Z

After reverting back the Kubernetes version and failing back to our previously Active cluster, we see Twingate Service recovered. We continue to monitor.

investigating2022-06-03T04:54:51.279Z

During a planned Kubernetes version upgrade, our application started to fail. We failed to our standby region/cluster, but it has the same issue. We are downgrading Kubernetes version and continuing to working on the issue.

investigating2022-06-03T04:16:23.978Z

Controller is currently experiencing an outage. Our team is investigating the issue.

May 18, 2022

Report: "Issue with Twingate"

Last update 2022-05-18T00:04:21.804Z

postmortem2022-05-18T00:01:55.965Z

**Summary** At approximately 15:31 UTC on May 13, 2022, we received alerts from our monitoring systems pointing to a problem with Twingate. Our cloud provider’s load balancer started to return 502 \(Bad Gateway Error\) due to issues with our backend system. Looking into our backend logs, we noticed only 10-15% of requests were being handled properly and decided to restart our application pod in our Kubernetes cluster. Once the backend application pod restarted, the load balancer stopped returning 502 errors and things returned to normal around 15:44 UTC. During the outage, both our private and public APIs were affected. These APIs are used to drive most of the functionality that end users and administrators experience in Twingate. Specifically, this means that customers’ admin consoles were not accessible, the public API was not responsive to requests, and Clients and Connectors were unable to initiate authentication. Existing connections continued to function as a part of our reliability efforts completed in Q1 of 2022 \(provided that the clients and connectors were running the latest versions\). With this, we recommend that all of our customers upgrade their Clients and Connectors as soon as they can. **Root cause** After a detailed investigation, we found potential network glitches that caused connectivity issues and higher latency with different and unrelated parts of our system. While some components self-healed \(i.e. our Redis instance\), our main backend application was impacted. This was due to a much higher latency associated with a 3rd party service we use, leading to connection saturation of our API layer and the resulting rejection of additional requests, which manifested as 502 errors to the requestor. **Corrective actions** In order to mitigate the risk of this root cause impacting our service in the future, we have initiated a number of improvements: * Completed: We increased CPU and memory reservation for our backend application and relay pods. We decreased the connection timeout threshold for the third party so it doesn’t cause connection saturation again. * Short-term: We are working on adding more metrics and enabling more logging to help with investigation and post-mortem analysis in the future. * Medium Term: While we already had some circuit breaker capabilities and flags to turn off certain features, we will look for a complete service mesh solution with circuit breaker capabilities that should keep upstream applications and APIs running when issues and latencies arise for downstream dependencies.

resolved2022-05-13T17:17:38.340Z

We are marking this issue resolved. The impact was between 8:31 am to 8:44 am. We will add the post mortem to the incident as soon as we have it ready.

investigating2022-05-13T15:57:03.520Z

The system is fully up. We will continue to monitor.

investigating2022-05-13T15:55:36.263Z

We are continuing to investigate this issue.

investigating2022-05-13T15:48:18.477Z

We have seen improvements; we are monitoring the situation. We'll update as we find out more details on the issue.

investigating2022-05-13T15:38:03.589Z

We are currently investigating the incident.

Jan 23, 2022

Report: "Issue connecting to Twingate"

Last update 2022-01-23T07:20:02.558Z

postmortem2022-01-23T07:19:55.349Z

**Summary** At approximately 02:26 UTC on January 19th, we observed an increase in latency between our API layer and our backend database system. Within a few minutes, this spike in latency developed into an outage that resulted in 90% of requests returning one of two responses to the requestor: either a 500 \(Internal Server Error\) or a 502 \(Bad Gateway Error\) error depending on where the error in our system occurred. These error conditions were caused by timeouts occurring between our API layer and the database and persisted until approximately 03:36 UTC. During the outage, both our private and public APIs were affected. These APIs are used to drive most of the functionality that end users and administrators experience in Twingate. Specifically, this means that customers’ admin consoles were not accessible, the public API was not responsive to requests, Clients and Connectors were unable to initiate authentication, and existing connections were eventually dropped without the ability to re-authenticate. **Root cause** The root cause of the issue is attributed to significant degradation in database performance due to a spike in CPU utilization, which increased latency across the system. The consequence of this increased latency was that even though our API layer was available to respond to requests, requests were taking significantly more time, leading to connection saturation of our API layer and the resulting rejection of additional requests, manifested as 500 or 502 errors to the requestor. **Corrective actions** In order to mitigate the risk of this root cause impacting our service in the future, we have initiated a number of improvements: * **Completed:** We have doubled the master database cluster server size in order to prevent utilization spikes disrupting our ability to continue to serve requests. * **Short term:** We are working on introducing zonal database read replicas, which will improve distribution of system load and will also remove the master database as a single point of failure. These improvements will also allow our service to maintain partial connectivity in situations when the master database is unavailable. * **Medium term:** We are implementing changes to the Client connection session management to maintain connectivity in cases when backend services are unreachable. This will introduce an additional layer of resiliency to our system beyond the changes described above.

resolved2022-01-19T17:16:27.732Z

We are continuing to monitor the system, and it remains stable and available. We are closing out this incident and we will follow up with a post mortem here.

monitoring2022-01-19T15:52:02.705Z

We have re-established connectivity and Twingate services have been restored. We are continuing to monitor our systems.

investigating2022-01-19T15:23:53.000Z

Our engineers have isolated the problem to a network connectivity issue between our application servers and our database infrastructure. Our team is working to restore network connectivity and we will continue to post regular updates.

investigating2022-01-19T14:47:50.611Z

We are aware of an incident affecting our production system and are currently actively investigating the issue. We will be posting regular updates pertaining to this incident.

Jan 15, 2022

Report: "Twingate MFA challenge page is slow to load"

Last update 2022-01-15T01:38:03.956Z

resolved2022-01-15T01:38:03.927Z

We are closing this incident as the MFA challenge page has been operating normally since our last update.

monitoring2022-01-14T22:58:59.630Z

We have switched to our backup CDN and will continue to monitor the system.

identified2022-01-14T22:41:46.911Z

We have identified the issue and are currently switching to our backup CDN.

investigating2022-01-14T22:31:35.772Z

We are currently investigating degraded performance of loading the MFA page of Twingate. Customers are experiencing slower page load times due to a CDN issue but services are still functional and available. We will be posting regular updates pertaining to this incident.

Dec 21, 2021

Report: "Inbound requests have heavily downgraded availability"

Last update 2021-12-21T04:35:56.899Z

postmortem2021-12-21T04:35:44.329Z

**Components impacted** * Controller **Summary** At approximately 17:06 UTC on December 13th, we observed an increase in latency between our API layer and our backend database system. Within a few minutes, this spike in latency developed into an outage that resulted in 90-95% of requests returning one of two responses to the requestor: either a 500 \(Internal Server Error\) or a 502 \(Bad Gateway Error\) error depending on where the error in our system occurred. These error conditions were caused by timeouts occurring between our API layer and the database and persisted until approximately 19:08 UTC. During the outage, both our private and public APIs were affected. These APIs are used to drive most of the functionality that end users and administrators experience in Twingate. Specifically, this means that customers’ admin consoles were not accessible, the public API was not responsive to requests, Clients and Connectors were unable to initiate authentication, and existing connections were eventually dropped without the ability to re-authenticate. **Root cause** The root cause of the issue was due to temporary loss of connectivity and increased network latency in our cloud service provider between our API layer and backend database. The consequence of this increased latency was that even though our API layer was available to respond to requests, requests were taking significantly more time, leading to connection saturation of our API layer and the resulting rejection of additional requests, manifested as 500 or 502 errors to the requestor. **Corrective actions** In order to mitigate the risk of this root cause impacting our service in the future, we have initiated a number of improvements to isolate the impact of backend service disruptions from end user connectivity. These projects include decoupling our backend database from the Controller, scaling cross-regional database replicas for additional resiliency, and implementing changes to user connection behaviors to maintain connectivity in cases when backend services are unreachable.

resolved2021-12-14T17:42:49.467Z

We are continuing to monitor the system, and it remains stable and available. We are closing out this incident, and we will continue to post updates and follow up with a post mortem here.

monitoring2021-12-14T05:21:07.752Z

We are continuing to monitor the system, and it remains stable and available. We are investigating the root cause of the outage. We will post our next update at 09:00 PST / 17:00 UTC.

monitoring2021-12-14T01:07:17.670Z

We are continuing to monitor the system, and it remains stable and available. We are investigating the root cause of the outage. We will post our next update at 21:00 PST / 05:00 UTC.

monitoring2021-12-13T22:12:33.216Z

We are continuing to monitor the system, and it remains stable and available. We are investigating the root cause of the outage. We will post our next update at 17:00 PST / 01:00 UTC.

monitoring2021-12-13T20:39:23.646Z

We are continuing to monitor the system, and it remains stable and available. We are investigating the root cause of the outage. We will continue to post additional updates regularly.

monitoring2021-12-13T19:21:07.652Z

We are continuing to monitor the system, and we are still investigating the root cause of the outage. We will continue to post additional updates regularly.

investigating2021-12-13T19:12:43.231Z

Inbound requests are now being accepted and the service is operational again. We have verified that all operational tests are succeeding. We are continuing to investigate to determine the root cause of this incident.

investigating2021-12-13T19:08:20.924Z

We are continuing to investigate this issue. We have narrowed the source of the problem to the public-facing frontend servers that handle requests inbound to our service. As a result, this is broadly affecting our public API, the private API calls used by Clients and Connectors, and our web interface, resulting in heavily downgraded response availability across our service. We are still trying to identify the root cause at this time and will continue to post regular updates.

investigating2021-12-13T18:38:18.039Z

We are continuing to investigate this issue. We will be posting regular updates pertaining to this incident.

investigating2021-12-13T18:07:56.392Z

We are continuing to investigate this issue. We will be posting regular updates pertaining to this incident.

investigating2021-12-13T17:41:20.888Z

We are continuing to investigate this issue.

investigating2021-12-13T17:08:21.000Z

We are currently investigating this issue.

Dec 15, 2021

Report: "Admin console billing functionality is temporarily unavailable"

Last update 2021-12-15T23:39:09.677Z

postmortem2021-12-15T21:27:14.416Z

**Components impacted** * Admin console **Summary** At 18:59 UTC we received an automated alert that requests to our 3rd party billing system \(Chargebee\) were failing. A few minutes later we confirmed that this was causing the Twingate admin console interface to fail to load with a 500 error, impacting all customers. At 19:15 UTC our engineering team submitted a hotfix to resolve the issue and update our production systems. The fix was deployed at 19:33 UTC and all customers were able to access the Twingate admin console with billing functionality disabled. We spoke to Chargebee at 20:00 UTC and they confirmed the issue on their side. At 20:47 UTC Chargebee confirmed that the outage was resolved in their system and the incident was closed. **Root cause** The Twingate admin console relies on Chargebee API access in order to load billing information specific to the particular customer account. This API call is made when the Twingate admin console loads in the browser. The underlying 3rd party API returned a 503 \(Service Temporarily Unavailable\) error, which was not captured as an exception in the error object returned by the 3rd party API library. This led to an uncaught exception, which caused the admin console to fail to load with a generic 500 \(Internal Server Error\) error. **Corrective actions** We have updated the admin console logic to capture this type of exception from Chargebee and ensure that the admin console will continue to load. We will be taking the following actions: 1. Auditing all 3rd party calls in the admin console to ensure that all exceptions are caught and do not result in the admin console being unavailable. 2. Update the billing behavior specifically to incorporate an unavailability message.

resolved2021-12-15T21:22:36.953Z

Our 3rd party billing system is now available and we are marking this incident as resolved. We will follow up with a post mortem update.

monitoring2021-12-15T20:05:28.342Z

We have received confirmation from our 3rd party billing provider that they are aware of their system outage and are working on a fix. We will resolve this incident when the 3rd party system is available. Until this incident is marked as resolved, no billing functionality will be available, but all other functionality remains unaffected.

monitoring2021-12-15T19:33:59.289Z

A fix has been deployed and the admin console is now available to all customers. We'll continue to monitor the system for the next 30 minutes before marking this incident as resolved.

identified2021-12-15T19:13:14.584Z

The root cause of the issue has been identified and a fix is in progress. We will post another update shortly.

investigating2021-12-15T19:12:07.547Z

The admin console is currently experiencing an outage caused by our 3rd party billing system being unavailable. Our team is investigating the issue and working on a fix currently. This issue is isolated to loading the admin console and does not affect any resource access or end user authentication.

Report: "Okta connectivity issue"

Last update 2021-12-15T16:54:17.260Z

resolved2021-12-15T16:54:17.243Z

Customers using Okta should now be able to login without issue. We understand that the problem was related to an internet connectivity issue unrelated to Twingate that is now resolved.

monitoring2021-12-15T16:01:39.882Z

We are observing that Okta appears available again. Customers that use Okta as their Identity Provider should try again to login if they have experienced trouble logging in.

monitoring2021-12-15T15:54:45.000Z

We are aware of and investigating reports that customers using Okta for authentication are unable to login. It appears that customers' Okta domains are not reachable. At this stage, we are not aware of any issue affecting Twingate services itself and are working to restore access to those customers relying on Okta for authentication.

Nov 19, 2021

Report: "Service provider networking outage"

Last update 2021-11-19T23:54:00.885Z

postmortem2021-11-19T23:53:43.679Z

**Components impacted** Controller Relays **Summary** Twingate services were unavailable to service requests from approximately 17:48 to 20:09 UTC on November 16th. The result was that during this period of time, access to Twingate and protected resources was limited, existing connections were dropped, and new connections were refused. At the end of the period, normal service resumed. Remediation required that customers reconnect their Connectors in order to restore access to protected resources. **Root cause** Google Cloud Platform \(GCP\) deployed a configuration change in their infrastructure that caused all requests to return 404 errors \([GCP incident description](https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh)\). Because Twingate relies on GCP infrastructure, access to the Twingate network and protected resources was impacted. GCP confirms that the incident was resolved as of 19:28 UTC. As GCP began to restore their service, impacted Twingate services automatically came back online. Currently, Twingate Clients and Connectors view 404 errors as unrecoverable states and thus did not automatically reconnect. Consequently, customers were required to restart their Connectors and the Windows service on the Windows Client to restore access. **Corrective actions** Automated monitoring alerted Twingate to the outage and our DevOps and on-call engineering teams started tracking the issue. Manual testing confirmed the outage, and additional investigation showed that other GCP customers were impacted. While traffic was being restored, systems indicated that Connectors did not automatically recover. For customers using our Managed Connectors, these were restarted at 20:50 UTC. We began notifying customers about the need to restart Connectors at approximately 19:00 UTC, and all customers were notified by 02:02 UTC on November 17th. Looking ahead, we plan to: * Prioritize Client and Connector reconnection behavior and extend it to include all non-recoverable errors * Introduce functionality to notify customers of Connector downtime via email notifications

resolved2021-11-16T22:41:02.318Z

We are marking this issue as resolved as our monitoring shows that our infrastructure is operating normally and Google Cloud Platform has resolved the incident on their network. We will be following up with a post-mortem shortly.

monitoring2021-11-16T20:16:09.826Z

Google Cloud Platform has marked their Cloud Networking issue as resolved and has posted a status update: https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh We are continuing to monitor our infrastructure and will mark this incident as resolved when we are confident that everything has returned to normal.

monitoring2021-11-16T19:08:16.223Z

We have verified that all of our infrastructure is fully operational at this time and will continue to monitor for any changes. Until our service provider (Google Cloud Platform) has closed their incident, we will leave this incident open in Monitoring status and provide regular updates as we receive them. Customers should verify that all of their Connectors are up and running if any Resources are inaccessible at this time.

monitoring2021-11-16T18:29:50.699Z

We are continuing to monitor for any further issues.

monitoring2021-11-16T18:29:12.588Z

We are continuing to monitor the status of the service. Customers may need to restart Connectors to restore connectivity to resources due to the nature of the networking outage.

investigating2021-11-16T18:16:09.000Z

The Twingate admin console is now accessible and the Twingate Controller is operational. Customers may need to restart Connectors to restore connectivity to resources due to the nature of the networking outage. The originating cause appears to be related to an outage in Google Cloud Platform's Networking service. Google Cloud Platform has opened an incident: https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh

investigating2021-11-16T18:06:03.646Z

We are continuing to investigate this issue.

investigating2021-11-16T17:53:40.000Z

We are investigating an outage report with regards to Twingate. At this time we suspect it is an issue affecting broader Internet services and is not isolated to Twingate. We will continue to post regular updates as we learn more.

Nov 9, 2021

Report: "Controller downtime"

Last update 2021-11-09T05:16:34.743Z

postmortem2021-11-09T04:15:14.676Z

**Components impacted** Relay Controller **Summary** A physical hardware failure occurred in a node within one of our Eastern US Relay clusters at approximately 18:19 UTC. Connectors and Clients attached to this node automatically failed over to a new node. This failover process resulted in a partial outage of the Controller, which was partially available to service requests from approximately 18:21 to 18:40 UTC. At the end of the period, normal service resumed with no remediation required. **Root cause** A physical hardware failure occurred in a single node within one of our Eastern US Relay clusters. Although the hardware was swapped out automatically by our service provider, this resulted in all Connectors attached to this particular Relay node to automatically failover to a new Relay node, resulting in a flood of connection requests. This process proceeded normally, however the volume of connection requests was sufficient in this particular instance to temporarily prevent the Controller from accepting new connection requests. This in turn resulted in additional reconnection requests, exacerbating the original problem. **Corrective actions** As soon as we received monitoring alerts, the DevOps and on-call engineering teams started triaging the issue. Additional nodes were started to handle the spike in connection requests and the system was monitored as the request rate recovered and the system was brought back to a normal running state at 18:40 UTC. Looking ahead, we have already or plan to: 1. Add additional nodes and increase memory limits across the board to serve as an additional buffer for failover-based connection spikes. 2. Make changes to our heartbeat monitoring logic to increase overall resilience during transient traffic peaks. 3. Introduce changes to the Connector logic to maintain connections to multiple Relay nodes at all times, resulting in a flatter spike in failover re-connection requests. 4. Introduce additional resiliency in token issuance to prevent temporary spikes in connection requests from influencing otherwise healthy Clients and Connectors.

resolved2021-11-06T06:30:00.018Z

This incident has been resolved.

monitoring2021-11-05T21:05:19.585Z

We are still investigating the root cause of the incident. We didn't find any issue on our side, and we are working with our cloud provider support team to investigate the matter further.

investigating2021-11-05T18:54:52.798Z

We are continuing to investigate this issue.

investigating2021-11-05T18:53:58.473Z

We are back and operational now. We are still investigating the root cause.

investigating2021-11-05T18:49:50.898Z

We are continuing to investigate this issue.

investigating2021-11-05T18:37:30.726Z

We are looking into it we will provide more info as soon as we have it

Aug 16, 2021

Report: "Connector restart may be required"

Last update 2021-08-16T04:38:54.605Z

resolved2021-08-13T08:16:49.000Z

All admins with affected Connectors were notified.

monitoring2021-08-13T07:45:32.000Z

Connectors older than v1.26.0 require a restart due to a database update. You can find Connector version information in the Connector detail page in the Twingate Admin console. We are currently in the process of contacting Twingate admins.

Aug 13, 2021

Report: "Controller downtime"

Last update 2021-08-13T00:13:27.532Z

postmortem2021-07-30T00:57:32.142Z

**Components impacted** Controller **Summary** The controller was unavailable to service new authentication requests from approximately 15:17 to 15:19 UTC. The result was that during this period of time, new connection requests were rejected. Existing connections were not impacted. At the end of the outage period, normal service resumed with no remediation required. **Root cause** Leading up to the start of the outage period, automated monitoring alerted us to spikes in memory usage. At approximately 15:16 UTC we introduced a change to our cluster that was intended to increase memory availability. At approximately 15:17 UTC as this change was rolled out, it had the unintended consequence that resulted in a decrease in service availability, with the resulting rejection of most requests. **Corrective actions** At 15:18 UTC, seeing the decrease in service availability, we reverted the change and simultaneously made additional hardware available to the cluster. Normal service resumed approximately 45 seconds later as the change propagated. Looking ahead, we plan to: 1. Investigate introducing decoupling between inbound requests and our backend as the likely cause of the memory spikes that triggered the change that caused the outage.

resolved2021-07-29T17:00:33.000Z

This incident has been resolved.

monitoring2021-07-29T16:00:00.000Z

The system is now confirmed as fully operational. We are working on an incident report and taking steps to ensure that this issue will not happen in the future.

monitoring2021-07-29T15:17:59.000Z

We're resolved the immediate issue by adding additional processing capacity and increasing memory limits on our controller infrastructructure.

identified2021-07-29T15:15:30.000Z

We've identified the issue, which is being caused by excessive memory usage on our infrastructure.

Aug 12, 2021

Report: "Controller downtime"

Last update 2021-08-12T23:38:00.441Z

postmortem2021-08-12T23:37:53.104Z

**Components impacted** Controller **Summary** The Controller was partially unavailable to service requests from approximately 19:39 to 19:46 UTC. The result was that during this period of time access to protected resources was limited, some existing connections were dropped, and new connections were refused. At the end of the period, normal service resumed with no remediation required. **Root cause** Leading up to the start of the incident was a planned maintenance period. The maintenance change propagated a configuration change across our Relay clusters. Due to human error that change was not applied in a sequential manner one cluster at a time but was released instead to all of our US clusters in parallel. Once the configuration changed was applied, it triggered reconnection requests from all active Clients and Connectors to our Relay infrastructure. As part of the reconnect process, Clients and Connectors needed to obtain new tokens from the Controller. At 19:39 UTC the spike of requests triggered our health check system, which incorrectly determined that the Controller was misbehaving and required restarting. The frequent Controller restarts had the unintended consequence that resulted in a decrease in service availability. **Corrective actions** As soon as the health-check system kicked in, the DevOps and on-call engineering teams started tracing down the issue. Logs and system metrics confirmed that except for health-check system, everything was performing well, so a decision was made to disable it. Seconds after disabling it, the system returned to a fully operation state. At 21:22 UTC a hot fix was deployed to the health-check system and it was enabled once again. Looking ahead, we plan to: 1. Only perform planned Relay maintenance operations that require connection migration outside of peak traffic hours. 2. Enforce a stricter limit of the number of parallel Relay cluster deployments. 3. Fix issues identified with our health check system and improve our performance and stress-testing to include more aggressive connection migration scenarios. 4. Update the Twingate status page immediately upon confirmation of an issue impacting customers.

resolved2021-08-10T21:24:34.487Z

This incident has been resolved. We will be posting a post mortem description shortly.

monitoring2021-08-10T20:11:34.079Z

The Controller is currently fully available, and we are actively investigating the root cause of the issue.

monitoring2021-08-10T20:00:40.000Z

The Controller infrastucture was experiencing degraded availability. The issue began at 19:39 UTC and continued until 19:47 UTC. Our team is currently investigating the root cause of the issue, and we will post additional updates here.

Jul 29, 2021

Report: "US East Coast Relay issue"

Last update 2021-07-29T18:39:36.567Z

postmortem2021-07-29T18:39:26.167Z

**Components impacted** Relay Connector **Summary** On this date we had an outage during routine maintenance of our relay infrastructure. The issue started at 4am UTC and was resolved within 2 hours, requiring some customers to restart their connectors in order to re-establish connectivity to our relay infrastructure. **Root cause** In our investigation we determined that the connector received a malformed response from the relay cluster during its maintenance cycle. The malformed response in question contains the address of a particular relay node to which the connector is instructed to connect. This malformed response resulted in the connector retrying access to a non-existing relay node without failing over to another relay cluster. **Corrective actions** After correcting the specific issue that caused the malformed response, we modified both the relay and connector logic so that failover now happens automatically any time that a malformed response is received. We also modified our maintenance procedures to add additional health checks to prevent malformed responses. Finally, we took the opportunity to enhance how the failover logic works to incorporate multiple levels of relay redundancy in the connector's initial configuration that it receives after authentication.

resolved2020-10-22T06:56:01.000Z

The affected Relay cluster is now fully operational.

monitoring2020-10-22T06:33:00.000Z

We are monitoring as our Relay cluster is coming back online. Any affected Connectors that did not automatically reconnect may require a restart in order to resolve any connectivity issues.

monitoring2020-10-22T06:32:07.152Z

A fix has been implemented and we are monitoring the results.

investigating2020-10-22T03:00:38.000Z

We are currently investigating the issue.

Report: "Mumbai (asia-south1-a) cluster is unavailable"

Last update 2021-07-29T18:33:56.488Z

postmortem2021-07-29T18:33:36.179Z

**Components impacted** Relay **Summary** We encountered an issue during an upgrade to our relay cluster monitoring infrastructure. As a result, we were unable to bring the Mumbai regional cluster up during this maintenance window, and so it was left down. There was no customer impact as any connections were re-routed to another relay cluster. **Root cause** We determined that the root cause was a configuration error introduced during a deployment configuration upgrade. This was fixed and the cluster was brought back up during a low traffic period at the end of the day. **Corrective actions** We identified an issue in our CI/CD process that resulted in the initial misconfiguration, which has been corrected.

resolved2021-07-14T03:58:44.171Z

Relay cluster in Mumbai (GCP region asia-south1-a) is now fully operational.

monitoring2021-07-13T18:31:10.000Z

Engineering has a fix in place. We are currently monitoring the cluster and expect it to be back up by 21:00 PST.

identified2021-07-13T16:58:02.909Z

We have identified an issue with our Relay cluster in Mumbai (GCP region asia-south1-a) and are working to fix it.