Historical record of incidents for Flatfile
Report: "Excel Files Not Extracting"
Last updateWe are seeing an issue where excel files are not being extracted. We have identified a cause and are working on a fix. CSV files are still working.
Report: "Uploads Are Unresponsive"
Last updateWe are currently investigating this issue.
Report: "Intermittent Bad Gateway Errors"
Last updateA fix has been implemented and we are monitoring the results.
We are investigating an issue where some users are seeing bad gateway errors.
Report: "Files failing to upload"
Last updateThe fix has been fully rolled out, we are currently monitoring the platform.
The issue has been identified and a fix has been rolled out
We are currently investigating reports of failures of file uploads in our UK platform region
Report: "Files failing to upload"
Last updateThe fix has been fully rolled out, we are currently monitoring the platform.
The issue has been identified and a fix has been rolled out
We are currently investigating reports of failures of file uploads in our UK platform region
Report: "Unable to access platform.flatfile.com"
Last update# Incident Overview **Nature of Incident:** Incorrect Application Frontend Container Deployed to Platform Frontend **Services Affected:** Flatfile Platform Dashboard and Spaces ## Details of the Incident At approximately 4:26pm MDT on June 3, a tag collision caused an image for an unreleased product to be deployed in place of the Platform frontend container. This led to frontend assets for an incorrect frontend application to be served in place of the Flatfile dashboard. 8 minutes later the engineering team was alerted to the incident and manually deployed the correct image to resolve the issue. ## Impact Assessment The incident affected all users of the Platform frontend applications for approximately 20 minutes. The API was unaffected. ## Root Cause The root cause was a collision in tagging on images within the Flatfile image registry. ## Resolution Flatfile engineering manually deployed a known good image to the frontend service to immediately restore service. Following this, the image registry was patched to prevent future incidents. ## Security and Data Integrity Please be assured that this incident did not compromise the security or integrity of your data. Our commitment to data protection remains a top priority.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an issue causing users to not be able to access platform.flatfile.com.
Report: "Unable to access platform.flatfile.com"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an issue causing users to not be able to access platform.flatfile.com.
Report: "Platform Outage"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Platform Outage"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Spaces Failing to Load"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Spaces Failing to Load"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Slowness in Loading Spaces"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "File Uploads Failing"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are currently investigating an issue that is causing some file uploads to fail.
Report: "Slowness in Loading Spaces"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Identity Service Degredation"
Last updateLogins to the US, EU, and UK environments are operational again! All dashboards should now be accessible.
We're watching our service provider for updates.
We are continuing to work on a fix for this issue.
We're watching our service provider for updates.
Our identify provider is experiencing some instability, causing the login to the dashboard to be inaccessible at the moment. We are monitoring and will update once this is back up again.
Report: "File Uploads Failing"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are currently investigating an issue that is causing some file uploads to fail.
Report: "Identity Service Degredation"
Last updateLogins to the US, EU, and UK environments are operational again! All dashboards should now be accessible.
We're watching our service provider for updates.
We are continuing to work on a fix for this issue.
We're watching our service provider for updates.
Our identify provider is experiencing some instability, causing the login to the dashboard to be inaccessible at the moment. We are monitoring and will update once this is back up again.
Report: "AU Region Frontend Application Outage"
Last updateA frontend application configuration variable change was deployed prior to deploying the code change. The configuration has been patched and the code change deployed.
Report: "Intermittent 504 Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified the issue and are rolling out a fix
We are currently investigating customer reports of 504 errors on the platform
Report: "Spaces failing to load"
Last updateThis incident has been resolved. The root cause of the incident was due to a misconfiguration when releasing an update to our spaces configuration. This resulted in spaces failing to load properly. The spaces configuration has now been fixed and rolled out.
A fix has been implemented and we are seeing reports of access to spaces being restored.
The issue has been identified and a fix is being rolled out
We are currently investigating reports where customers are facing issues with embedded spaces loading
Report: "Spaces Failing to Load"
Last updateThis incident has been resolved. Root Cause is still being determined.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
Spaces are not opening for some users.
Report: "Spaces failing to load"
Last updateCache invalidation conflicts for our configuration service led to a degradation of the Spaces application in some edge locations.
Report: "Error Creating and Loading Spaces"
Last updateEarlier today we deployed a feature that relied on a new environment variable. Despite updating the relevant files with the correct variable, one failed to deploy properly resulting in the variable not being found after deployment, and causing the errors that led to spaces not loading. This has been rolled back.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are seeing errors creating spaces and are investigating.
Report: "EU Application Routing Incident"
Last updateThis incident has been resolved.
An update to our networking has been deployed and we are monitoring the solution
The issue has been identified and a fix is being implemented.
Report: "Intermittent 503 Errors"
Last update# **Introduction** On Apr 2, 2025 a service degradation caused intermittent requests for static assets to fail; these included requests for HTML, JS, CSS and other assets resulting in failed delivery of frontend applications for several short bursts of time. # **Incident Details** * **Date Reported**: April 2, 2025 * **Issue Summary**: Delivery of frontend application assets degraded # **Impact Assessment** The incident resulted in degraded delivery of static assets used in the frontend applications, manifesting in the following: 1. Intermittent errors loading spaces 2. Missing assets in applications 3. NGINX error pages being viewed instead of Spaces The incident did not affect usage of the API and browser clients which had cached the static asset files. # **Root Cause** Our cloud hosting provider terminated several EC2 instances in our Kubernetes fleet over several hours the morning of April 2. The NGINX proxy that delivers static assets was forced to recreate on another node, resulting in several seconds of failed requests for assets. This occurred several times in succession. # **Resolution & Fix** 1. **Immediate Remediation** * Flatifle infrastructure engineers scaled NGINX resources across the fleet to avoid downtime during disruptions 2. **Recovery Strategy** * We implemented new routing and retry strategy combined with affinity rules to prevent scheduling on ephemeral resources # **Follow-Up Actions** * **Monitoring Enhancement**: While monitoring for this type of issue exists and alerts triggered correctly, enhancements could be made to escalate alerts and prompt faster response times.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating an issue where some users are seeing intermittent 503 errors
Report: "Caching Round-trip Discrepancy"
Last update# **Introduction** A recent update introduced a change to how sheet counts are processed. As part of this update, we began including a timestamp \(countsComputedAt\) to indicate when the counts were calculated. Shortly after being deployed to production, several endpoints \(such as get workbook, update records, patch job\) started returning 400 errors with the message recordCount.updated\_at.getTime is not a function. # **Incident Details** **Date Reported**: April 24, 2025 **What Went Wrong:** Data from workbooks is cached for performance. When this cached data is retrieved, it goes through a process that converts date objects into plain text \(specifically, ISO8601-formatted strings\). This conversion removes the special capabilities of a date object—such as calculating the time—causing our system to break when it tried to use those capabilities. This error did not appear in our testing or development environments because caching is disabled there, so the date information remained intact and behaved as expected. **Why It Wasn’t Caught Earlier:** **Development and test environments differ from production**: Since caching is turned off during development and testing, we didn’t see the issue where the date value was converted into a plain string. **Incomplete type checking**: The part of our system responsible for managing cached data didn’t reflect that the cached date values had changed from their original form, so the issue wasn’t flagged by our automated checks. **Next Steps:** We’re updating our development environment to more closely mirror production so that we can catch this type of issue earlier in the future. Additionally, we’re improving our type handling and tests around cached data to ensure these transformations are properly accounted for going forward. **Resolution & Fix** 1. Production was rolled back while the issue was investigated. 2. The workbook presenter logic was updated to handle the different types resulting from caching behavior. 3. The local and test environments were updated to perform type conversion consistently with production, even when caching is disabled.
A recent update introduced a change to how sheet counts are processed. As part of this update, we began including a timestamp (countsComputedAt) to indicate when the counts were calculated. Shortly after being deployed to production, several endpoints (such as get workbook, update records, patch job) started returning 400 errors with the message recordCount.updated_at.getTime is not a function.
Report: "Spaces Not Loading"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are investigating an issue where some users are unable to load spaces from the dashboard.
Report: "Dashboard Inaccessible"
Last updateWhile making an enhancement update on the way we do releases, we encountered a drop in network traffic to our services due to a misconfiguration. We identified and reverted back the change to resolve the issue.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are seeing a 503 Service Temporarily Unavailable error when trying to reach our dashboard. We are currently investigating this.
Report: "Intermittent 502s"
Last updateFlatfile experienced an issue maintaining database connections via a proxy which led to service degradations whereby some connections from the app to the database unexpectedly hung up, returning a 502 error to the client. The connection issue with the proxy was resolved at 11:47am MDT.
A fix has been implemented and we are monitoring the results.
We are seeing some intermittent 502 errors when loading Portals and performing some actions inside spaces. Our team is investigating this currently.
Report: "Login and space load errors on UK region"
Last updateAn update was deployed to the UK regional server that ended up breaking some internal routing. Once the routing issue was identified, our team deployed an update to correct this behavior.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: ""Something went wrong" errors"
Last update# **Introduction** On March 14, 2025, our team identified an issue where certain workbooks were failing to open and/or update. These failures were caused by a database incident involving one of our ephemeral database servers. This document outlines the incident details, the identified root cause, the steps taken to resolve the issue, and the long-term remediation plan. # **Incident Details** * **Date Reported**: March 14, 2025 * **Issue Summary**: One of Flatfile’s ephemeral database instances entered an abnormal state. Workbooks mounted to this database instance failed to open and/or be updated. # **Impact Assessment** The incident resulted in degraded service performance for users with workbooks on the Quickstore 3 database. Specifically, users experienced: 1. Intermittent unavailability of existing workbooks stored on the affected database 2. Issues loading sheets in newly created spaces that attempted to access data from the affected database The incident did not affect the creation of new workbooks, as these would be directed to functioning database instances. Only workbooks that were already stored on the Quickstore 3 instance were impacted, leading to a compromised user experience for a subset of users. # **Root Cause** Initial investigations determined that the Quickstore 3 database had entered an abnormal state. The database writer node became unresponsive, preventing both read and write operations from completing successfully. While the exact trigger for this state is still under investigation, monitoring data suggests that the database instance may have experienced resource exhaustion or an internal failure that was not automatically resolved by the database management system. # **Resolution & Fix** 1. **Immediate Remediation** * A backup of the affected database instance was completed to secure all data. * A new database instance was brought online to attempt to maintain service availability. * A new reader node was spun up while planning to remove the problematic node from service. 2. **Recovery Strategy** * After evaluating options, Flatfile launched a new database cluster using the backup at the same time that the reader node was coming online in case the additional reader node was unable to make the database healthy again. # **Follow-Up Actions** * **Monitoring Enhancement**: While monitoring for this type of issue exists and alerts triggered correctly, enhancements could be made to escalate alerts and prompt faster response times. * **Root Cause Investigation**: Continue the investigation into database monitoring data to determine what initially caused the Quickstore 3 database to enter the problematic state.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are seeing some intermittent "something went wrong" errors when trying to load sheets for some users. We are currently investigating this.
Report: "Intermittent Jobs Failures"
Last updateAs a result of a traffic spike the database connection pool was overwhelmed between 10:30-11:30am Eastern. We will be updating our rate limiting algorithm to address this type of traffic.
Report: "Intermittent job failures"
Last updateAs a result of a traffic spike the database connection pool was overwhelmed. We will be updating our rate limiting algorithm to account for this particular type of traffic.
A fix has been implemented and we are monitoring the results.
We are seeing these errors start to recur, investigating now!
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are seeing some jobs intermittently fail, like file extractions. We are investigating now.
Report: "Portal Instability"
Last updateWe rolled back a change affecting jobs; this incident is resolved.
We're seeing errors related to jobs, we are currently investigating this issue.
Report: "Portal Service Degredation"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Intermittent 400 errors on some operations"
Last updateThis incident has been resolved. We were seeing intermittent connectivity issues to blob storage, and deployed a hotfix to restore a stable connection.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Intermittent Login Errors"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
Some users are experiencing trouble logging into the Flatfile Dashboard. We are currently investigating this issue.
Report: "Database Connection Errors on EU Region"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "Users are unable to Import Files"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are seeing an issue where users are unable to upload files to a space.
Report: "Login Issues"
Last updateWe have rolled out a fix and systems are operational.
We have identified the issue and are rolling out a fix.
We are continuing to investigate this issue.
We are investigating an issue where some users are having issues logging into the platform and launching spaces.
Report: "Intermittent Errors on EU Regional Server"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Errors on Platform API"
Last updateThis incident has been resolved. We tuned our database for this region to prevent long running queries.
A fix has been implemented and we are monitoring the results.
We are seeing errors when hitting the Platform API for our EU region. We are currently investigating.
Report: "Errors on Platform API"
Last updateEarlier today, we experienced issues with our app due to hitting memory constraints. We have addressed the issue by scaling our app up which has resolved the memory constraints.
A fix has been implemented and we are monitoring the results.
We are seeing errors when hitting the Platform API for our EU region. We are currently investigating.
Report: "Slowness loading parts of the dashboard and spaces"
Last updateWe recently observed increased network traffic that caused some areas of slowness. Our team identified the issue and implemented measures to optimize performance. We’re actively monitoring the system to ensure everything continues running smoothly.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Authentication Services Outage"
Last updateThis incident has been resolved.
We are investigating an issue with our authentication services in the AUS region
Report: "Intermittent Failure to load v2 Portal"
Last updateEarlier today, we experienced issues with our cache database due to hitting a memory limit. We have addressed the issue by freeing up additional memory on our database cluster.
A fix has been implemented and we are monitoring the results.
We are seeing some intermittent failures to load the v2 Portal, and are investigating this issue now.
Report: "Slowness, errors creating workbooks"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
A fix has been implemented and we are monitoring the results.
Seeing some /counts requests and other API calls take a long time to resolve, and we are currently investigating this issue.
A fix has been implemented and we are monitoring the results.
We are seeing some slowness and/or errors when creating workbooks as part of new space creation. We are currently investigating this issue.
Report: "Slowness, errors creating workbooks"
Last updateEarlier today, we saw failed indexing on new workbooks and found a recent change that initiated this behavior. We rolled back the change that caused these failures and will ensure it is addressed.
A fix has been implemented and we are monitoring the results.
We're seeing slowness when creating workbooks begin to return and are investigating why this is happening.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are seeing some slowness and/or errors when creating workbooks as part of new space creation. We are currently investigating this issue.
Report: "Flatfile Not Accessible"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring for any additional issues
We are continuing to investigate this issue.
We are currently investigating an issue where some users may not be able to access the Flatfile app.
Report: "Slowness loading sheets, extracting files"
Last updateA testing scenario that resulted in a dramatic traffic spike resulted in temporary decreased performance on API endpoints. Spaces that were created within this window may need to be recreated.
We are continuing to investigate this issue.
We are seeing slowness when loading and working in spaces. We're currently investigating the issue and will provide a status update once we have more information.
Report: "CSV Extraction Failure/Slow"
Last updateQueue backpressure caused by high network traffic caused degraded performance on some asynchronous processes such as file extraction.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Spaces not loading"
Last updateThis incident has been resolved.
We are currently investigating an issue preventing spaces from loading. We have identified the issue and are working on a fix.
Report: "events.flatfile.com Errors"
Last updateWe've restored the internal service, so you should no longer see these errors in your developer console.
You may see some errors in your network console indicating that the events.flatfile.com domain cannot be reached or is returning a bad gateway error. This is an internal service that we are working to restore, and should not affect your ability to use your Spaces or the Dashboard in the meantime.
Report: "Slow Performance in Spaces"
Last updateThis is resolved now; we are seeing files extract and spaces create in a timely manner.
We are currently seeing an issue with some behavior in spaces taking a while to run, like file extraction and some actions. We are currently investigating this.
Report: "Some spaces taking a while to be created"
Last updateWe saw a queuing issue with our events that impacted event-related behavior like space creation and file extraction. We've cleared out the affected queue and are seeing performance back to normal
We are seeing some space:configure events taking a while to be read by server-side listeners, causing slowness when creating a space that looks like the space is not initially created. We are currently investigating this issue.
Report: "Authentication Errors on Legacy Platform"
Last updateWe've identified the source of the errors, and have restored service to the dashboard and Portal services.
We are seeing errors when accessing the Legacy Dashboard, and some Portals. We are currently investigating this issue.
Report: "Degraded Performance"
Last updateThis incident has been resolved.
The issue has been resolved and we are monitoring for any additional issues.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We're currently investigating an issue where some users are unable to access platform.flatfile.com. We're working to get this up and running as quickly as possible.
Report: "Degraded performance on custom jobs"
Last updateThis incident has been resolved.
We are currently investigating an issue in Spaces where some types of custom jobs are delayed
Report: "Spaces Loading Slower Than Expected"
Last updateThis issue has now been resolved and spaces are loading as expected.
The issue has been identified and a patch is currently deploying
We are currently investigating an issue that is causing customers using the spaces UI or latest Portal SDK to see spaces loading slower than they should be.
Report: "Partial Degradation of Mapping Page"
Last updateFor approximately 2 hours this afternoon, starting around 1:30 EDT, we saw a partial degradation of our mapping feature that caused the page to occasionally not load. We've rolled back the code that caused this degradation and are ensuring it's been permanently corrected.