Flagsmith

Is Flagsmith Down Right Now? Check if there is a current outage ongoing.

Flagsmith is currently Operational

Last checked from Flagsmith's official status page

Historical record of incidents for Flagsmith

Report: "Public Website Down"

Last update
resolved

Our website was down momentarily this morning, but it did not affect our main application (app.flagsmith.com). Users were unable to view our marketing page due to an operational issue with management of our content management system. We'll share a post mortem in 24 hours.

Report: "Small number of intermittent requests to Edge API failing"

Last update
postmortem

## What Happened On Monday 16th December at around 14:25 GMT, we started receiving reports from customers that they were seeing elevated errors in their monitoring regarding connections to the Flagsmith API. Our investigation suggested that there were no application level issues in the Flagsmith platform. Since all customer reports were coming from Eastern US regions, we moved traffic away from the region, redirecting to our US west region. By this time, we were also able to carry out our own testing using infrastructure set up in US east. Our testing, and feedback from our customers showed that moving the traffic away did not resolve the issue. At 19:15 GMT, we opened a ticket with our infrastructure partner. Their investigations also confirmed that there were no issues with the Flagsmith platform itself. At 21:53 GMT, we created a ticket directly with our infrastructure provider \(AWS\) for them to investigate. At around 23:30 GMT on 16th December, we implemented a work around by creating a new DNS record to point directly to our infrastructure in US east, bypassing the latency based routing \(as provided by AWS Global Accelerator\). We shared this with customers experiencing the issues, and they confirmed that it resolved the issue. From here, we continued to investigate the issue with AWS support, providing them with additional information based on our testing, and reports from our customers. On 18th December at 17:58, we received the following information from AWS confirming that there had been an issue with Global Accelerator in the US East region. > The team confirmed that between December 13 5:00 PM PST and December 17 2:50 PM PST, AWS Global Accelerator experienced intermittent connection failures for client traffic served by the Ashburn, Virginia edge location. The issue has been resolved and the service is operating normally. Following this, we were able to confirm that the issue was no longer reproducible for us, or for our customers. ## What’s Next? We have requested, and are currently waiting for, a full post-mortem from the AWS team which may affect our next steps. In the meantime, we have begun looking at alternative solutions for the Global Accelerator that we may be able to keep as a cold standby in case of similar issues in the future.

resolved

We have received confirmation from multiple customers in the US east region now that the issue has been resolved.

monitoring

Our infrastructure provider has confirmed they were intermittently unavailable between December 14, 01:00 UTC and December 17, 22:50 UTC. We are working with affected customers to confirm that this issue is resolved.

identified

The issue has been isolated to a specific component in our hosting provider's network infrastructure. We are working with them, and escalating to get the issue resolved.

investigating

We have passed on all of the troubleshooting information to our infrastructure provider and are awaiting further information from them on this issue. If you are affected by this issue, please get in touch with us at support@flagsmith.com with any information that you can share so we can identify this intermittent issue.

investigating

We have had further reports of similar issues following the migration of traffic away from us-east-2, we are continuing to investigate.

monitoring

We have migrated all traffic away from us-east-2 and are monitoring the impact.

investigating

The issue seems to only be affecting clients connecting to the us-east-2 region. We're currently redirecting traffic away from the region.

investigating

We are currently investigating the Edge API sporadically timing out some requests or refusing connections.

Report: "Issues persisting identities with empty trait values"

Last update
resolved

Resolved at: 15:22 UTC A bug in the validation logic traits meant that identities with traits that have a value of `""` would cause an exception, resulting in the identity not being stored correctly when updates were made. This issue was resolved at 15:22 UTC, any new requests to retrieve an identity's flags, or store their traits will correctly store those traits.

Report: "Increased API latency"

Last update
resolved

This incident has been resolved.

monitoring

Environment creation is functional again now.

monitoring

API latency is back to normal. We are continuing to monitor.

identified

The cause is due to a migration applied to the production database taking more resources than we anticipated. The migration has been completed and the load is normalising. We are monitoring and managing the load. Currently, while this issue is ongoing, it is not possible to create new environments.

investigating

We are currently investigating this issue.

Report: "API occasionally slow or returning HTTP status 500"

Last update
resolved

Over a period of 5 days we experienced either slowdowns or 500 error response codes from our API. These brown-outs occurred for around 1 minute, 4 times a day. This was due to a misconfiguration on a client's SDK implementation that was sending us very high bursts in traffic following a push notification that was sent out to a large user population. In order to mitigate this outage, we have upsized our core database with 8x the capacity, and our app server cluster with 16x the capacity. This has provided us with enough capacity to serve these traffic bursts. We have also been in contact with the customer to help improve their SDK implementation to reduce the load on our API. We apologise for the degradation in service.

Report: "API outage"

Last update
postmortem

## Overview On the 20th of January 2021, at 16:47 UTC, our REST API suffered a partial outage for 38 minutes, with partial service resuming over the course of 6 minutes, resulting in total downtime of 44 minutes. The core reason for the outage was a database migration that failed to apply correctly. We manually corrected the migration and service was resumed. We’re really sorry for this downtime. We work hard to try to ensure 100% uptime, and will take on these learnings to improve the service into the future. ## Background As part of the development work around 3rd party integrations, we have been working on an integration with \[Amplitude\]\([https://amplitude.com/](https://amplitude.com/)\). This integration requires a new table to be created in the core Postgres database. Consequently a Django Database Migration was created to facilitate this. As part of this work, one of our developers manually edited the migration to make a change to the data schema. This was an error; migrations should not be manually edited; the engineer should have created a second migration to modify the data schema. We have also been migrating our code to use the Black python formatter. This caused issues with regards to our code review process by polluting the code review with additional formatting that made reading the code harder than it ought to be. ## Testing The code worked in our local, development and staging environments. This was due to the fact that test data was present in the prod environment but not on the development or staging environments. The migration failed to apply everywhere \(because the app thought there was no migration to apply\) but the exception was only thrown because there was data in the table in production. ## Outage Once our code review had progressed, we merged our code to master and the CI/CD pipelines pushed it into production. This caused the outage. We were also late to be alerted to this on account of it not taking down endpoints like /health; /health was still reporting 200 OK response codes. ## Immediate Fix We identified the issue quickly, wrote and tested a fix and then deployed it into production. ## Learnings * We will ensure a 3rd set of eyes review any commits that include data migration code. * We will ensure a better consistency of data during testing. * Our downtime alerting has been improved to make synthetic API calls to core SDK endpoints like retrieving flags. This better simulates real world usage.

resolved

We've identified an issue with a database migration that failed in production as part of an upgrade. We've remedied the issue and are monitoring the platform. Will provide further updates as soon as we can. We've identified the code that cause the outage and will be monitoring the platform but are confident that the issue has been fully resolved. We will provide a post mortem as soon as we have completed the root cause analysis.

Report: "Web dashboard loading with increased latency"

Last update
resolved

This latency issue has been fixed.

identified

We recently migrated from Gitlab CI to Github actions. The build pipelines have introduced a change which is increasing the asset loading performance of the dashboard. We're deploying a fix now. Please note this is just causing a slowdown to loading of the site.

Report: "Increased latency on subset of requests"

Last update
resolved

This incident has been resolved.

monitoring

We have identified the issue as a large number of transactional database wait locks. We have increased the size of our API server cluster which has reduced these wait locks, as well as reducing overall API latency. We are monitoring the situation.

investigating

Hi, We are seeing increased latency to our SDK endpoints for a small percentage of requests. We are investigating and will provide a further update as soon as we have more information. Apologies for the degradation in service.

Report: "Migration causing temporary table lock"

Last update
postmortem

## Root Cause At 13:15 UTC on 03 Feb 2022, we began deploying a routine release of the Flagsmith application to our production SaaS environment. This release included a database migration which added a new unique index to one of our tables which holds information about multivariate values for features. When the migration was run in our other environments we noticed no ill effects from the addition of the index, however, in production where we have substantially more data this index took longer to add than anticipated and required a full table lock during that period.  ## Downtime Our monitoring shows that the application was unresponsive for a period of just under 2 minutes while the migration was running.  ## Long Term Remediation To improve on this in the future, we are planning to upgrade our version of Django to allow us to easily add indexes concurrently. We will also be monitoring more carefully for future index additions and checking whether they will require a table lock. Finally, we will be looking at making our staging environment more representative of production in terms of data so that we can catch issues such as this in the future.

resolved

We've resolved an issue with a database migration that caused a temporary full table lock when modifying an index. Total outage was around 70 seconds. We're going to investigate the root cause and provide an update when it's ready.

Report: "Elevated error rates to our API"

Last update
resolved

API latency has returned to normal range following monitoring.

monitoring

Our fix has deployed to production and all key infrastructure metrics are returning to normal range. We will continue to monitor.

investigating

We're seeing elevated error rates to our API on account of a large influx of spurious traffic. We are deploying a fix to this which should help alleviate this.

Report: "Increased 502 responses to our Core API"

Last update
resolved

This incident has been resolved.

monitoring

The errors have been resolved. We are continuing to monitor this.

investigating

We are still seeing elevated 502 errors and are continuing to investigate. These issues are isolated to our Core API only and seem to be affecting ~0.1% of Core API requests only. Our Edge API continues to be fully operational.

monitoring

We have deployed a fix and are monitoring our container memory consumption.

investigating

Hi, We are experienced slightly elevated 502 errors returning from our Core API. We believe this is due to a recently introduced piece of code that is consuming more memory than it should be and are working on a fix.

Report: "Core API: Increased error rates"

Last update
postmortem

As part of a new feature rollout, there was a large database migration that needed to take place. We knew that the migration would take some time, however, it should not have affected production traffic. Unfortunately, despite our health check returning unhealthy until all migrations are complete, AWS ECS promoted the new version of the API application before the migrations were complete. This meant that the code that was running was expecting certain columns / data to be available in the database which weren’t there yet. We are still investigating what caused ECS to promote the new version before the migrations were complete.

resolved

This incident has been resolved.

investigating

We are seeing increased 502 responses to our Core API. We are aware of the cause and working on a fix. The Edge API is unaffected.

Report: "Core API is not responding"

Last update
resolved

This incident has been resolved.

monitoring

We are still seeing issues with the ECS cluster scaling out correctly, again down to the recent AWS eu-west-2 outage. We are monitoring.

monitoring

The DB has recovered. API latency is high as our ECS cluster scales out.

identified

AWS are restoring services. We hope to have the API back up shortly.

identified

AWS have confirmed an outage in their EU-West-2 data center. We are monitoring and waiting on AWS to provide updates.

identified

This looks to be an issue with AWS where our Core API is located. We are investigating.

investigating

We are investing an outage on our Core API.

Report: "Major Core API outage"

Last update
postmortem

At 12:46 UTC on Thursday 18th August, our monitoring picked up an increased number of HTTP 502s being served by our API. Upon investigation it became evident that an unexpected increase in load on the PostgreSQL database that serves our Core API was causing our application to struggle to serve some requests and we saw increased latency on those that were being served. In an attempt to resolve the issue, we adjusted the settings in our ECS cluster to reduce the number of connections to the database. Unfortunately, making this change via our IaaC workflow meant that the ECS service tried to recreate all the tasks but couldn’t do so as the health reporting was unable to consistently report a healthy status. This meant that our Core API was essentially flapping up and down while it tried to reinstate all the tasks. During this period, our API was continuing to serve some requests, with increased latency, however, there would have been a large proportion of HTTP 502s still. Following the above, our engineering team looked into the requests that were causing the increased load. From our investigation, it was apparent that the increased load was all to our environment document endpoint \(which powers the local evaluation in our latest server side clients\). This endpoint, although usable in our Core API, is very intensive as it generates the whole environment document from our PostgreSQL database to return to the client in JSON form. This involves a large number of queries. The compounding factor was due to a bug in our Node client regarding request timeouts. The Node client takes an argument of requestTimeoutSeconds on instantiation, however, it passes this directly into the call to the Node Fetch’s library fetch function which expects the timeout to be passed in milliseconds. As such, if requestTimeoutSeconds was set to e.g. 3, the request would timeout in 3ms and retry \(3 times by default\). So, every time a Node client polled for the environment, it would be making 3 requests in ~9ms \(or as close to it as Node can manage\). We were able to block the traffic to this endpoint for the customer that was putting an unusual amount of load through it due to their configuration and the above bug in the Node client. Once we had blocked this traffic, the application began serving traffic as normal again. This occurred at 15:24 UTC. At this point, traffic to the Core API was back to normal and all requests were served successfully. To remediate this issue, we are stepping up our efforts to encourage all of our clients to move over to our Edge API which is immune to issues of this nature. We are also planning to make improvements to the existing Core API platform to help guard against these issues in the future: 1. The addition of caching to our environment document endpoint to improve performance / minimise database impact 2. The implementation of automated rate limiting to better protect the platform from issues of this nature If you’ve read this and are unsure how to migrate to our Edge API, you can find out everything you need to know [here](https://docs.flagsmith.com/advanced-use/edge-api).

resolved

Major outage of our Core API affecting flag retrieval for users that have not yet migrated to Edge and dashboard usage.

Report: "Issue processing analytics data"

Last update
resolved

The issue with our downstream provider has been resolved. We will continue to monitor.

investigating

We're currently awaiting further information from the third party about the issues. As of yet, we don't have an ETA on resolution, however, the issue is still limited to analytics reads / writes.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating issues with a downstream provider impacting our ability to handle requests to write / read feature analytics. This is not affecting any critical services for managing or retrieving feature flags.

Report: "Subscription status is missing"

Last update
resolved

This incident has been resolved.

monitoring

We have restored the subscription status data.

investigating

We have an issue where we are not showing the correct subscription status on the Organisation settings page. We are investigating.

Report: "Delayed Flag updates"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

investigating

We are seeing increased task queue sizes which is delaying Flag updates propagating to our Edge API. We have isolated the issue and ar working on a fix.

Report: "Environment based integrations not working for Edge API"

Last update
postmortem

On June 7th at 10:33 UTC we released a change to our Edge API as part of [this issue](https://github.com/Flagsmith/flagsmith/issues/430) that filters out server-side only features when a client API key is used. These changes also affected the logic responsible for triggering environment-level integrations, causing them to fail. **Which integrations were affected?** * Mixpanel * Segment * Heap * Webhooks * Amplitude On June 15th at 22:25 UTC we were notified by a customer that they were not seeing data populated from their integration. First thing on June 16th the engineering team began investigating the issue. The team immediately identified that the change described above had changed the signature of a function that was also used by the integrations logic. Unfortunately, this change had not been picked up by our tests or code review, and the subsequent errors were not picked up by our monitoring. **Why didn’t our tests pick this up?** The unit tests covering the integrations logic utilised mocking, meaning that the change to the method signature was not correctly identified as an issue and our end to end test suite did not include the verification of successful integrations. **Why didn’t our monitoring pick this up?** The monitoring in place to track the error rate on the function responsible was using an incorrect aggregation algorithm meaning that the threshold for alert was never breached. At 18:03 UTC on June 16th a fix was released and service was resumed to all environment-based identity integrations \(listed above\). Unfortunately, due to the implementation of the asynchronous call to the lambda function that handles the integrations, it is not possible to recover the data that was lost during this period. ### What are we doing to prevent this from happening in the future * Improving our unit tests to rely less on mocks and, where they do rely on mocks, ensuring they utilise `spec` correctly \(see unittest documentation [here](https://docs.python.org/3/library/unittest.mock.html) for further reading\) * Extending our E2E testing suite on the Edge API to include tests for all integrations. * Alerting & monitoring: * Immediately this morning, we are changing our alerting to use the correct aggregation algorithm * In the near future, we will be improving these alerts to use percentiles and anomaly detection to ensure that errors are picked up quicker and are more accurate * Introducing Sentry to better track error logs that are reported from the Edge API * Moving asynchronous invocations of other lambda functions to use persistent queues and/or an event messaging system so that, after issues such as this, tasks can be re-run to ensure no data is lost.

resolved

Issue has been resolved. Integrations all tested as working. Post-Mortem report to come next week.

identified

We are aware of issues firing environment based integrations in our Edge API. We have identified a fix and are working on the testing now. Performance of the Edge API itself is not affected.

Report: "Erroneous flag values"

Last update
postmortem

## Summary of the issue Following a release of the Core API at 10:35 UTC, a regression was introduced which meant that the generated environment document contained erroneous flag values for those flags which had recent change requests that had not been committed \(and potentially deleted\). Since the environment document is used to generate the flags for the Edge API and SDKs running in local evaluation this meant that certain customers using these methods to evaluate their flags would have received erroneous flag values. In this situation, the flag values served were those that were included in the uncommitted change requests. ## Resolution steps At 17:20 UTC we were notified of this issue by a customer that was affected by the erroneous values. At 17:59 UTC the issue was identified and a fix was being developed. This fix was fully developed and released by 19:10 UTC and all affected environments were regenerated by 19:30 UTC. The PR for the fix can be found [here](https://github.com/Flagsmith/flagsmith/pull/2378) for those interested in reviewing further. ## Next steps / preventative measures In order to prevent these steps in the future we plan to expand our end-to-end testing suite to further cover our change requests workflows so that we can identify these issues earlier.

resolved

Customer reports of erroneous flag values being served in local evaluation mode.

Report: "Slow response times for Edge API requests"

Last update
postmortem

## Timeline At 12:15pm UTC, we were notified of increased response times on a number of our Edge API endpoints. Investigation showed nothing immediately obvious but we suspected that it could be caused by Sentry, our APM tool. We set about removing the Sentry initialisation from our code and deployed it as soon as we could. At 12:48pm UTC, this change was deployed and we observed the response times decrease immediately. At 12:52pm UTC our monitoring confirmed that the average response time had returned to normal. ## Next Steps * Look into improvements to reduce / remove the impact of Sentry issues on our Edge API. * Decrease the shutdown timeout of the Sentry SDK. * Look at using [Sentry relay](https://docs.sentry.io/product/relay) to remove the impact on core Edge API services. ‌ * Add integration tests to simulate performance degradation / outages from all downstream services.

resolved

This incident has been resolved.

monitoring

The downstream service has been successfully removed. Response times have returned to normal. We are continuing to monitor the situation.

identified

We have identified an issue caused by a downstream service which is causing a knock on effect to our performance. We are currently deploying a change to remove the downstream service.

investigating

We are currently investigating this issue.

Report: "Issue affecting permissions for project admin users"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are continuing to investigate this issue.

investigating

We are investigating an issue where project admin users do not receive the inherited permissions on each environment in the project.

Report: "We are currently encountering difficulties with our task processing system."

Last update
postmortem

## Timeline We were alerted at 23:39 UTC on 18/07/2023 that the queue for our asynchronous task processor was above the acceptable threshold. Once our team was online in India at 2:59am UTC, the status page was updated. By this time the task processor queue had backed up and the application was not able to write flag change events to the datastore which powers the Edge API. We investigated multiple avenues to determine the cause of the issues but there were multiple ‘symptoms’ that made determining the root cause very difficult. One specific issue, which turned out to be a red herring, related to the functionality to forward core API requests to the Edge API. This process seemed to be taking much longer than expected. Much of the investigation was spent restricting the usage of this functionality. At around 9:30am UTC, the cause was attributed to a particular set of tasks in the queue which were causing the processor units to run out of memory. Once it was determined to be safe to do so, these tasks were removed from the queue. At 10:19 UTC the issue had been resolved and the queue had returned to normal, meaning that flag change events were being written to the Edge API datastore again. Any changes that were not processed at the time were also re-run to ensure that the state was consistent with the expected changes that had been made in the database. ## Issue Details The issue was caused by an environment in the Flagsmith platform that included 400 segments and nearly 5000 segment overrides. This meant that the environment document which is generated to power the Edge API was larger than it was possible for the task processor instances to load into memory, and subsequently write to the Edge API datastore. To compound the issue, these changes were made via the Flagsmith API which resulted in 1000s of tasks being generated to update the document in the Edge API datastore in a short space of time. Each of these needed to load the offending environment, causing the task processor instances to fall into a cycle of running out of memory. These tasks were slowly being blocked from being picked up again by the processors but the quantity meant that there were always new versions of the same \(or very similar\) tasks to pick up. ## Next Steps * Implement limits on the size of the environment document * This will primarily consist of implementing limits on the number of segments and features in a given projects, as well as limiting the total number of segment overrides in a given project. * Deprecate the functionality to forward requests from the Core API to the Edge API. All projects using the Edge API will need to ensure that all connected SDKs are using the Edge API only.

resolved

This incident has been resolved. We will publish a full post-mortem imminently.

monitoring

We have deployed an update which has resumed consumption of the task queue. We are now processing the task queue and expect to be caught up in the next hour.

identified

We have identified a database lock that has caused this issue with the task processor. We are working on an interim fix as we identify the root cause.

investigating

We are continuing to investigate this issue with the utmost priority.

investigating

At the moment, we are conducting an investigation, which indicates that any flag changes made in the last approximately two hours may not be visible to the client.

Report: "Performance impacted to our Core API"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We were alerted to a slow DB query in newly released code at 13:09 BST. We are reverting the code and expect to be back to normal latency in the next 10 minutes. Edge API is not affected.

Report: "Core API is not responding"

Last update
postmortem

**Summary** On September 5th at 09:45 UTC, we initiated a release that included a database migration aimed at introducing a new constraint to the table containing information related to flags. According to our pre-live tests, this task should not have taken more than 50 milliseconds. Unfortunately, during the release to production, due to the high throughput on a particular table that it needed to acquire a temporary lock on, this caused a backlog of blocked connections waiting on the migration to complete. This caused a knock on effect that exhausted the connections on the database and a full restart was necessary. Once the restart was complete, the connections were restored and service was resumed. This happened at 10:20 UTC. **Next Steps** We have researched the cause of the issue and we do still have further research to understand certain aspects. Our current plan in the meantime is to implement certain safeguards as can be found in the following links to the Postgres documentation which should help reduce any impact in the future. [https://www.postgresql.org/docs/11/runtime-config-client.html](https://www.postgresql.org/docs/11/runtime-config-client.html) [https://www.postgresql.org/docs/11/runtime-config-logging.html](https://www.postgresql.org/docs/11/runtime-config-logging.html) \(`log_lock_waits`\)

resolved

This incident has been resolved. A postmortem will follow.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We have identified a database migration that has failed as part of a new release. We are working to re-apply the migration.

investigating

We are currently investigating this issue.

Report: "Identity integrations are not being triggered"

Last update
postmortem

## **Summary** On September 5th at 09:45 UTC, we initiated a release that included a database migration aimed at introducing a new constraint to the table containing information related to flags. According to our pre-live tests, this task should not have taken more than 50 milliseconds. Unfortunately, during the release to production, due to the high throughput on a particular table that it needed to acquire a temporary lock on, this caused a backlog of blocked connections waiting on the migration to complete. This caused a knock on effect that exhausted the connections on the database and a full restart was necessary. Once the restart was complete, the connections were restored and service was resumed. This happened at 10:20 UTC. ## **Next Steps** We have researched the cause of the issue and we do still have further research to understand certain aspects. Our current plan in the meantime is to implement certain safeguards as can be found in the following links to the Postgres documentation which should help reduce any impact in the future. [https://www.postgresql.org/docs/11/runtime-config-client.html](https://www.postgresql.org/docs/11/runtime-config-client.html) [https://www.postgresql.org/docs/11/runtime-config-logging.html](https://www.postgresql.org/docs/11/runtime-config-logging.html) \(`log_lock_waits`\)

resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Service Outage"

Last update
resolved

The incident was related to an erroneous DNS change. This has now been reverted and service should be back up and running. There may be a period of time where failures are still seen as we wait for DNS caching to be propagated.

investigating

We are currently investigating reports of a service outage on all of our infrastructure.

Report: "Increased error rates on the Edge API"

Last update
postmortem

### Timeline At around 13.45 today, we deployed a change to resolve a validation issue that had been introduced in a release earlier today. This validation issue affected only requests which provided a numeric value for the identity identifier. The new validation which was added, however, caused an issue for certain integrations since it also added a requirement for the traits key to be provided \(and not omitted\) which is not the case in some of our clients \(the Go client for example omits the traits key if the list is empty\). This meant that valid requests from these clients for identities with no traits were being incorrectly rejected as invalid. Once we received alerts for this from our monitoring and some of our affected customers we began investigating. At 14:54 we deployed a change which resolved the validation issue for certain cases, however not all. As such, at 15:06 we made the decision to roll back the affected regions, and at 15:48 we deployed a permanent fix for this including additional test cases to cover this behaviour. ### Impact Since the requests that were affected by this issue were those that had no traits, the impact was fairly limited and no trait data has been lost. Some identities will not have been created during this period, however, due to the nature of the Flagsmith integration, subsequent calls to identify those users will create them. ### Next Steps We have been working hard already on improving our release process for the Edge API. The first step of this, which is due to be released next week, is to improve our automated releases to rollback based on a number of additional alerting factors, including more granular looks at our error rates. This will ensure that, in future, a small subset of errors like this will trigger an immediate automated rollback. The next step after this is to create a more comprehensive end to end testing suite which exercises each of our SDKs to verify that the integrations are all compatible with any new changes.

resolved

This incident has been resolved.

identified

We are deploying the permanent fix now.

identified

We are continuing to work on a fix for this issue.

identified

We have rolled back in certain affected regions and have completed the work for the permanent fix. This is in the final stages of testing now and will be rolled out imminently.

identified

We have identified the remaining issue and are implementing a fix. ETA for full resolution: 15 minutes.

investigating

We are continuing to investigate this issue.

investigating

Issues are still persisting for integrations using the Go client. We are investigating further.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Issues with flag updates"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We have rolled back a recent change which has cleared out the backlog of tasks. Updates to flags should be propagated as normal.

investigating

We have been alerted to an issue with our asynchronous task processor which handles replicating flag updates across our network. We are currently investigating.

Report: "Core API Outage"

Last update
resolved

Our Core API was overwhelmed by massive traffic spike, causing the core SQL database to become extremely slow. This led to ECS tasks failing the health checks, prompting the load balancer to start and stop new tasks, which in turn added more load to the already maxed-out database. We tried several approaches to rate limiting the source of the traffic, but eventually had to temporarily stop traffic at the load balancer for 2 minutes in order to stabilise the system. We are working on implementing AWS API Gateway to include rate limiting at the gateway level to avoid these sort of incidents in the future.

Report: "Core API outage"

Last update
postmortem

At around 23:35 UTC, July 9th we received an alert that our Core API was not responding. This resulted in our SaaS customers not being able to use the Flagsmith dashboard \([app.flagsmith.com](http://app.flagsmith.com)\). Customers SDK’s serving flags were not impacted for those using the Edge API. Please note, any customers still using our Core API to serve flags were also impacted. This number is limited as we have advised customers to migration to the Edge API starting in June 2022. Our team resolved the issue at 3:06 UTC, July 10th and the Core API was fully responsive. The root cause of the issue was a database running at maximum CPU caused by requests to an end point that triggered an inefficient query. We also had our load balancer consistently recycling unhealthy API tasks that also strained the system due to unnecessary database connections. These two items combined, resulted in the core API being unresponsive. We recovered the database by dropping all traffic and terminating all open connections. This allowed the database to be recovered and process traffic normally. We are mitigating future issues like this by doing the following: * Optimizing the query that was triggered that used too much CPU capacity \(note that this has been completed and deployed to our production SaaS environment\) * Add better alerting when inefficient queries are identified in the application * Improving our internal tools \(e.g. PagerDuty\) to improve response time of issue identification triggered by some team members being out of office

resolved

Core API and admin dashboard outage on 10th July 2024.