Frontegg

Is Frontegg Down Right Now? Check if there is a current outage ongoing.

Frontegg is currently Operational

Last checked from Frontegg's official status page

Historical record of incidents for Frontegg

Report: "Service Outage"

Last update
postmortem

Root Cause Analysis \(RCA\): DDOS Attack Incident ‌ **Incident Summary** On May 23, 2025, between 16:53 to 17:16 UTC, our service in the Europe region experienced a temporary outage due to a sophisticated DDOS attack. Despite mitigation efforts by Cloudflare, the scale and speed of the attack overwhelmed our system's autoscaling capabilities, leading to service unavailability for a short period. ‌ **Timeline of Events:** * **16:53 UTC: DDOS attack begins.** * **16:54 UTC: Monitoring system alerts the on-call team.** * **17:03 UTC: On-call team identifies the DDOS attack.** * **17:10 UTC: Attack characteristics scoped.** * **17:15 UTC: Blocking and rate limit rules applied.** * **17:16 UTC: Service recovers.** ‌ **Root Cause** The attack's high volume and rapid escalation exceeded our system's ability to scale automatically in time, causing service disruption. ‌ **Incident Resolution & Next Steps:** To resolve the incident, we took the following actions: * We successfully blocked malicious traffic and hardened our defenses. * Preventive measures are being implemented, including enhancing CDN, infrastructure autoscaling, automated tools to identify attacks faster, and DDOS protection in collaboration with Cloudflare.

resolved

This incident has been resolved. Timeline of Events: 16:53 UTC: DDOS attack begins 16:54 UTC: Monitoring system detects service degradation and alerts the on-call team. 17:03 UTC: On-call team identifies that a DDOS attack is ongoing. 17:10 UTC: On-call team scopes the characteristics of the attack (volume, source IPs, and traffic patterns). 17:15 UTC: The on-call team applies blocking and rate limit rules on Cloudflare to mitigate the attack. 17:16 UTC: System recovers and service is restored.

Report: "Service Outage"

Last update
Monitoring

A fix has been implemented and we are monitoring the results.

Investigating

We're currently investigating an issue affecting some users. Our team is working to identify the cause and will provide updates as we learn more.

Report: "Service Degradation"

Last update
Monitoring

A fix has been implemented and we are monitoring the results.

Investigating

We're currently investigating an issue affecting some users. Our team is working to identify the cause and will provide updates as we learn more.

Report: "EU environment issues"

Last update
resolved

This incident has been resolved.

monitoring

The fix has been rolled out, and all indications are positive. Backoffice sync will be delayed

identified

The fix has been implemented and is being rolled out.

identified

We are continuing to work on a fix for this issue.

identified

We have identified the issue and implemented a fix.

investigating

We are currently investigating this issue.

Report: "EU environment issues"

Last update
Investigating

We are currently investigating this issue.

Report: "Infrastructure Upgrades"

Last update
Scheduled

We will be performing scheduled maintenance on our infrastructure during this time.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Report: "US Environment Degradation - Potential 504 Errors"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are currently implementing a patch to improve system performance. Some services may experience temporary disruptions

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented

investigating

We are currently investigating the issue.

Report: "Increased Latency in EU region"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Monitoring

investigating

We are currently investigating this issue.

Report: "Email service"

Last update
resolved

The incident is resolved. Email should be sent now.

investigating

We are working with our email provider on a solution at the moment

investigating

Some emails are not getting sent. For example Magic code and Magic link emails. We are investigating with our email provider

Report: "[EU Region] - Sporadic System Latency for Traffic Originating from IL"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Increased reports in issues loading Hosted Login Page"

Last update
resolved

This incident has been resolved.

monitoring

We are monitoring the issue and in contact with Azure

investigating

We are receiving reports on sporadic issues to loading the hosted login page for some users - it does not appear to be widely affecting usage, the team is currently investigating. The issue appears to be due to an Azure incident affecting our CDN service.

Report: "EU Degraded State - Partial Outage"

Last update
postmortem

# **Root Cause Analysis \(RCA\) Report** **Date and Time**: July 24, 2024**Duration**: 22 minutes **Affected Services**: Authentication and core services**Impact**: Customers in the EU region were hanging and returned as 504 timeouts**Reported By**: Internal monitoring systems and customers \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ **Executive summary:** On Wednesday, July 24th, at 08:43 GMT, Frontegg's internal monitoring systems indicated that the API Gateway encountered an issue following the deployment of a new OpenTelemetry propagator \(OTEL instrumentation\), causing service disruptions in the EU. As a result, some of our customers were experiencing timeout errors \(HTTP status 504\) returned by Frontegg. During the upgrade of our API Gateway, Frontegg also updated the OpenTelemetry library. This update inadvertently caused the system to send data one piece at a time instead of using efficient batches due to a misconfiguration in the data handling settings. OTEL transmitted millions of traces individually rather than in aggregated batches. Although our system was rigorously tested under various conditions, the high load in the EU environment caused our auto-scaling mechanism to lag behind the incoming traffic. This led to the API gateway being overwhelmed by the volume of client requests. ‌ **Cause Analysis:** The primary cause of the incident was the deployment of a new OTEL instrumentation in the API Gateway, which led to a significant increase in trace data volume. Contributing factors included: * The API Gateway's OTEL was configured with the BasicPropagator instead of a BatchPropogator, sending each trace as part of the flow. * The fast rise of HTTP requests to the OTEL collector overloaded the API gateway to handle incoming requests. Although it was autoscaled, it lacked in response to the number of requests. * With the increase of traces being sent, the OTEL Collector failed to handle millions of traces at such a rate, increasing the request handling time, which caused another increase in API-gateway HTTP requests ‌ **Customer Impact** During the incident, customers in the European region experienced significant service degradation. Specific issues included failures in hosted login monitors and general service instability. ‌ **Mitigation and resolution:** Upon receiving the initial alerts, the Frontegg team began investigating the issue promptly. After identifying the problem with the OTEL propagator and collector, we increased the allocated resources and reverted to the latest working version. Following the implementation of this change, the systems returned to normal operations. **Mitigation**: * Increased the CPU allocation for the OTEL Gateway to handle the increased workload. * Revert to the latest Api-gateway version. **Resolution**: * Restarted the API Gateway to clear hanging requests and stabilize the OTEL Gateway. * Deployed a new version of the API gateway with the correct configuration ‌ **Prevention and Future steps:** Enhance OTEL Propagator: Implement batch processing, asynchronous handling, and strict timeouts. * **Upgrade OTEL Gateway**: Allocate additional resources to the OTEL Gateway and implement autoscaling to handle increased workloads effectively. * **Implement Aggressive Timeouts**: Implement stringent timeout policies for all HTTP requests that are not customer-related. This measure will proactively prevent delays and mitigate the risk of unresponsive requests. * Stress tests: change the deployment pipeline to include stress testing instead of the nightly testing suite. ‌ **Communication:** **Enhance Status Page Communication**: Ensure the status page provides clear and timely updates during incidents. Develop and maintain standardized templates for incident communication to facilitate prompt and consistent information, even if the root cause is not immediately identified.

resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "US Degraded State - Partial Outage"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "US region services partial outage"

Last update
postmortem

## **Executive summary:** On June 3rd, at 12:06 GMT, The Frontegg team received an indication from our monitoring system of increased latency for refresh token requests \(average greater than 750 ms\) in our US region. Starting at 12:12 GMT, the first customer reached out to frontegg noting request timeouts. At 12:13 GMT, we updated our status page and officially began the investigation. As a preliminary measure the team began a number of different mitigation actions in an attempt to remedy the situation as quickly as possible. After seeing no improvement, at 12:30 GMT the team began a full cross-regional disaster recovery protocol. At 12:40 GMT we also began a same-region disaster recovery protocol \(starting a new same-region cluster\) as part of the escalation to ensure a successful recovery. At 13:25 GMT we began to divert the traffic to the new same-region cluster and by 13:30 we saw a stabilization of traffic to Frontegg. Upon further investigation, we discovered the root cause to be a networking issue inside our main cluster which caused a chain reaction affecting the general latency of the cluster. Additionally we are working with our cloud provider to gather additional details on the event from their side. ## **Effect:** From 12:06 GMT to 13:30 GMT on June 3rd, Frontegg accounts hosted in our US region experienced a substantial latency to a significant part of identity-based requests on Frontegg. This meant many requests were timed out, causing users to be unable to login or refresh their tokens. Additionally, access to the Frontegg Portal was partially blocked due to this issue. ## **Mitigation and resolution:** Once the Frontegg team received the initial alert to refresh latency, we began an investigation into our traffic, request latency, workload, hanging requests, and database latency. Upon finding inconclusive results, the team initiated a handful of mitigation efforts, such as: * At 12:14 GMT, we increased our cluster workload. * At 12:30 GMT the team began a full cross-regional disaster recovery protocol. * At 12:40 GMT we also began a same-region disaster recovery protocol \(starting a new same-region cluster\) as * By 13:00 GMT, we increased the number of Kafka brokers as an additional measure for mitigation. After a preliminary check on the new same-region cluster we began diverting traffic to the new cluster. By 13:30 GMT we saw a stabilization of traffic to this cluster and moved the incident to monitoring. We continued to monitor traffic for the next hour before resolving the incident. ## **Preventive steps:** * We are adding a same-region hot failover cluster for quick mitigation of P0 issues * We are fine-graining our rate limits on all routes within the system to add additional protection to our cluster health * We are working closely with our cloud provider to gather additional information on the event in order to increase the predictability of future events ‌ At Frontegg, we take any downtime incident very seriously. We understand that Frontegg is an essential service, and when we are down, our customers are down. To prevent further incidents, Frontegg is focusing all efforts on a zero-downtime delivery model. We apologize for any issues caused by this incident.

resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "[US Instance] - Authentication Service in degraded state"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Email Service in Degraded State"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Frontegg's email sending service is experiencing issues - we've identified the issue and are working our service provider on a fix.

Report: "CA region - Portal access"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue.

investigating

Unable to access Canada's portal, we are investigating the issue

Report: "Degraded performance in the US region"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "EU region - System degraded performance"

Last update
resolved

This incident has been resolved.

investigating

We are experiencing a system degradation, user login flows might be affected

Report: "[US Environment] - Backoffice opperating with degraded performance"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "US region - Management APIs and MFA service degradation"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are experiencing degradation in MFA service

Report: "Webhooks performance degradation"

Last update
resolved

This incident has been resolved.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Entitlements Service is in a degraded state"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Management portal is partially available"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue.

investigating

We are continuing to investigate this issue.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Latency to Partial Traffic on US Cluster"

Last update
resolved

For some traffic on US cluster there was a high latency which resulted in some users being unable to log in, or a 504 (timeout) response to Frontegg identity calls for roughly 1 hour.

Report: "Webhooks & Backoffice services in US cluster are in a degraded state"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Frontegg services in EU cluster are in a degraded state"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

Report: "Partial Service Degradation in EU Cluster"

Last update
resolved

This incident has been resolved. RCA Investigation is ongoing at the moment.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "DB upgrade on U.S. region"

Last update
resolved

The maintenance has been successfully completed

monitoring

The maintenance was completed and we are monitoring the results.

identified

The maintenance is still in progress

identified

The maintenance is still in progress

identified

The maintenance is still in progress

identified

A maintenance procedure on our DB may result a intermittent 500s in identity flows. We applied a caching mechanism to make sure that no data during the flow will be lost in case of an error. * Note: the issue should not effect active refresh tokens

Report: "Partial Outage on Hosted Login Service"

Last update
resolved

From 15:55 to 16:02 there was a partial outage on our hosted login service. The root cause is currently under investigation, but the issue has been mitigated.

Report: "Degraded performance on custom domains in the EU region"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Frontegg Services are showing Degraded Performance in EU & US"

Last update
postmortem

### **Executive summary:** On Wednesday May 31st, 2023, at 12:55 GMT we deployed a minor version to one of our services. Shortly after at 12:56 GMT, Frontegg’s US monitoring system started sending alerts for an authentication service which was not performing as expected, and the team immediately began investigating the issue. At 13:01 GMT we started getting alerts from Frontegg’s EU monitoring as well regarding the same service, shortly after, we started to get complaints from customers. At 13:04 GMT, 8 min after we started getting the alerts the team concluded that it was sourced by a recent change that was deployed. As part of the change, there was a database migration for one of our primary services. However, the migration job didn't run due to an edge race condition in our CD infrastructure, causing the service to remain in a schema mismatch state. At this point we immediately started a rollback process for both EU & US regions that was completed by 13:16 GMT. Once the rollback completed, we noticed that our services are working as expected again and customers also reported that they were no longer experiencing issues. ‌ **Affect:** Most requests to customers’ custom Frontegg domains resulted in 401/404 responses or inability to authenticate. For the EU region - between 12:59 to 13:16 GMT time.For the US region - between 12:56 to 13:14 GMT time ### **Mitigation and resolution:** Following the monitoring alerts the incident response team immediately identified the potential corrupted service and started rollback procedure with the previous successful deployment. ### **Preventive steps:** * We defined a gated process for deploying DB migration changes * A schema validation on service init to prevent schema mismatch cases was added * Will add deployment validation that will fail deployment if migration didn’t run * Will remove the high dependency in that specific service as a single-point-of-failure for the main system flows * Reduce service rollback time by running only relevant part of the CD pipeline

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Degradation in Tenant API Token authentication"

Last update
resolved

Users with Tenant API tokens that were with client credentials had an issue with authenticating the api token. The authenticate API token route was returning a 400 response. Access token API Tokens were fully opperational

Report: "Networking issue for US region working with Frontegg EU cluster"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are working with our external services and providers on this issue.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Degraded performance in US region"

Last update
resolved

This incident has been resolved.

monitoring

There might be delays in responding to some requests in the US region

Report: "Api are having Performance degradation"

Last update
postmortem

### **Executive summary:** On August 15th, 2022 at 02:01 IST \(UTC \+2\) Frontegg underwent a sophisticated DDOS subdomain organized attack. The attackers used multiple servers spread across a variety of Digital Ocean IPs. Each Server executed a low number of requests per second so our WAF did not trigger rate-limiting rules, yet it was recognized that many of the paths were related to WordPress engine's known weakness.  By 03:21 the attack had been successfully mitigated. At 04:46 a second organized attack began. The restrictions put in place by the previous attack were helpful in mitigating the second attack. By 05:30 all traffic returned to normal ### **Affect:** The incident caused a degraded performance to our API gateway. As a result, our API returned 504 and 524 errors to partial traffic over the course of the incident. The majority of these errors were returned between 02:01 IST and 02:30 IST, when our mitigation efforts began to have an effect. A majority of traffic was still able to go through without error during this time. ### **Mitigation and resolution:** Our initial response to the attack was to increase our rate limiting and WAF constraints. This initial step was implemented at 02:30 IST. Once we understood the level of sophistication and distribution of the attack, we implemented changes on the application level, including a different routing mechanism and added more specific WAF constraints based on origins of the attacking traffic, which took effect by 03:21 IST. ### **Preventive steps:** In order to prevent attacks like this in the future, we are implementing a more sophisticated route blocking mechanism to our API-gateway. Additionally we have reported the incident to the cloud provider which hosted a majority of the attacking traffic, and we are consulting with our WAF provider for further guidance on preventing such attacks.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Frontegg services are in a Degraded State"

Last update
resolved

Frontegg services were in a degraded state causing some users to experience issues with their user login. The problem was fixed and it is now under close monitoring on our side.

Report: "Partial outage in Frontegg services for some regions due to Cloudflare major outage"

Last update
resolved

This incident has been resolved.

monitoring

A fix was implemented and we have bypassed Cloudflare services

identified

Our DNS and WAF provider Cloudflare is partially down in some regions.

investigating

We are currently investigating this issue.

Report: "Management Portal in degraded state"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "US Region in partial outage"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "US region - Degraded performance in Frontegg services"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Degraded performance on Frontegg services"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Delay with sending webhooks"

Last update
resolved

This incident has been resolved.

identified

We are working on a fix to reduce the delay. Will continue to update

identified

We are currently investigating reports on delays with sending webhooks

Report: "We have identified an issue with processing user operational emails"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Delay with audit logs"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating reports from some of our customers for delay with Audit logs

Report: "Frontegg Portal is in a Degraded State"

Last update
resolved

The incident has been resolved, all systems operational.

monitoring

A fix has been implemented and we are monitoring the results

Report: "Frontegg Portal is in a Degraded State"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Connectivity issues"

Last update
resolved

We are currently investigating this issue.

Report: "Performance degradation in Portal"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Issues with hosted login"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are working on a fix. This does not affect customers using the embedded version of the Frontegg login

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "US Server In a Degraded State"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Frontegg services where in a Degraded Performance"

Last update
resolved

During a maintenance operations on the database secure access services were in degraded performance state

Report: "Frontegg services are in a Degraded State"

Last update
resolved

This incident has been resolved.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Frontegg services are in a Degraded State"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.