Historical record of incidents for Factorial
Report: "Scheduled Job Disruption"
Last updateWe are currently experiencing an unexpected event that has resulted in the loss of some scheduled jobs. We are actively investigating the issue and working to restore normal operations as quickly as possible. We have identified the following impact: - Automatic Breaks: users may notice inconsistent information regarding automatic breaks. - Time Off and Expenses Reports: users may need to retrigger their time off and expenses reports to ensure accurate processing. - Other minor side-effects like missed notifications. We apologize for any inconvenience this may cause and appreciate your understanding as we work to resolve the issue. We will provide updates as more information becomes available. Thank you for your patience. Next Update: before 17/06/2025 12:00 CEST
Report: "Major outage"
Last updateAll systems are now fully operational. Thank you for your patience, and we apologize for any inconvenience caused.
All systems are now back online. We're continuing to actively monitor the platform to ensure ongoing stability.
The issue has been identified, and our Platform team is actively working on implementing a resolution.
We are currently experiencing a major outage affecting the Factorial application and its related services. As a result, the platform is temporarily unavailable. Our engineering team is actively investigating the issue and working to restore services as quickly as possible. We will provide updates as soon as more information becomes available. We sincerely apologize for the inconvenience this may cause and appreciate your patience and understanding.
Report: "Major outage"
Last updateAll systems are now fully operational.Thank you for your patience, and we apologize for any inconvenience caused.
All systems are now back online. We're continuing to actively monitor the platform to ensure ongoing stability.
The issue has been identified, and our Platform team is actively working on implementing a resolution.
We are currently experiencing a major outage affecting the Factorial application and its related services. As a result, the platform is temporarily unavailable.Our engineering team is actively investigating the issue and working to restore services as quickly as possible. We will provide updates as soon as more information becomes available.We sincerely apologize for the inconvenience this may cause and appreciate your patience and understanding.
Report: "Small amount of errors during a planned test operation"
Last updateToday between 04:39 and 04:50 UTC there was an incident that resulted in a temporary increase in error rates. This happened during a planned test operation. Our team has promptly identified and resolved the issue, and is investigating the root cause to ensure it is not repeteated in upcoming tests. We appreciate your understanding.
Report: "Degraded latency and increased error rates - failing cache system"
Last updateWe have identified the root of the problem and have deployed fixes to restore the service performance. We apologize for the inconvenience caused.
We are investigating an issue where one of our cache nodes has become unresponsive, leading to a global performance degradation. We are working on restoring the service as soon as possible.
Report: "Demo environment temporarily unavailable"
Last updateWe experienced a temporary outage in our demo environment between 17:52 and 18:04 UTC today, following a maintenance operation. During this time, users may have encountered issues accessing the demo environment. Our team identified the root cause of the issue and quickly implemented a fix. The demo system is now fully operational. We are conducting a thorough review of the incident to prevent future occurrences.
Report: "Incorrect avatars displayed"
Last updateOn February 6, 2025, between 12:20 and 15:09, some components of the Factorial application may have displayed incorrect avatars due to an issue in our last release. The issue was reported at 14:48, and our internal team immediately resolved it. If you’re still experiencing issues, please refresh the page in your browser. If the issue persists you can reach us via your Customer Support Portal.
Report: "Slower response times and timeouts - advanced reports"
Last updateThe performance issue we started experiencing yesterday has been resolved. Please keep in mind, advanced reports are still unavailable. We will reenable them in the next 24h. Apologies for the inconvenience. Update: Reports have been reenabled on February 5th at 8:15 UTC
We have restored the service, although the Reports functionality is currently unavailable.
We have identified the source of the issue and we are applying a mitigation to restore service as soon as possible.
This seems an unrelated issue, but we are currently facing a major outage in the application. We are investigating and will update as soon as we have more details on the reason and the estimated time of recovery.
A fix has been implemented and we are monitoring the results.
Our application is experiencing slower response times than usual. This will be most noticeable at peak hours like 13:00 and 14:00 UTC, when some requests could even timeout and require re-submission. We are currently working on several fixes at both the application and infrastructure layer to restore our usual quality of service. We will update this incident as soon as they are deployed. Apologies for the inconvenience.
Report: "Unusual web traffic volume altering performance of the app"
Last updateTraffic levels have been stable and back to the usual levels for a while.
Our actions have stabilized the system and the application is now performing normally. While the number of requests has greatly reduced, we keep investigating to understand the origin of the spike.
As of 11:00 UTC / 12:00 CET, we have observed a significant increase in requests to our application, resulting in performance degradation. We are implementing mitigation measures to restore service quality while we keep investigating the root cause of the issue. We appreciate your patience and will provide further updates as we work to resolve this matter.
Report: "Service outage caused by unresponsive database"
Last updateThe lock has been cleared and the service has been restored. We will perform a more thorough investigation next week and set up the mechanisms and processes required to prevent this from happening again.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Sporadic errors using the application"
Last updateThe fix deployed in the previous hour has resolved the issue. The application should work normally now. We keep investigating why this errors started appearing today and how they relate to a change that was published last Thursday. Apologies for the inconvenience.
Some users are experiencing sporadic errors loading or moving around different features of the application. We have identified the potential root cause of the issue and have deployed a solution that should solve these errors.
Report: "German domain factorialhr.de unavailable"
Last updateThe issue has been resolved, factorialhr.de is now serving again our website.
The issue is still ongoing, but we expect more information today when the German DNS registry resolves our open request.
There is an issue with the German domain name (factorialhr.de). The public website and customer pages using it are currently unreachable. We have identified the problem and are working on a fix. Meanwhile, users can still access the application at https://app.factorialhr.com/ We apologize for the inconvenience caused. Next update: before Monday 25th Nov. 12:00 UTC
Report: "Increased latency, response times a lot higher than usual"
Last updateDue to an issue with under-provisioned capacity the Factorial web application experienced highly increased latency resulting in very long loading times from 6:00AM to 7:00AM (UTC). Apologies for the inconvenience. - The Factorial Team
Report: "Service outage caused by an unresponsive database"
Last updateWe experienced a brief service interruption from 03:30 PM to 03:36 PM (UTC). During this time, our database experienced unresponsiveness, which impacted the availability of our service. To mitigate the issue, our team promptly killed the suspicious process to restore the database functionality instantly. All systems are now operating normally, and we'll apply preventive measures to prevent similar occurrences in the future. We appreciate your understanding and patience as we work through this process.
Report: "Sidebar visibility issue in Factorial web application"
Last updateSince around 11:15 CET, the sidebar in the Factorial app hasn’t been visible, making it unusable. We identified the root cause but chose to quickly reduce the impact by rolling back to a previous version at 12:00 CET. Thanks for your patience, and sorry for the inconvenience! - The Factorial Team
Report: "Increased latency and error rates"
Last updateFor now the mitigation is working and the app is working normally. We continue investigating to find a permanent fix. Update: as of 9/11 midday UTC, the root cause has been addressed. We don't expect any further side-effects of this incident. Apologies for the inconvenience, Factorial team
We have identified the root cause of the issue and are currently working on a fix.
We are continuing to investigate this issue.
We are experiencing increased latency and higher error rates since 18:55 UTC. Our team is investigating the origin of the issue. We will provide an update as soon as we have more information.
Report: "Performance regression"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have been experiencing slow response times, leading to the app being unusable at times. We have identified the root of the issue and are now working on a solution. We expect it to be deployed during the next 2 hours.
Report: "Service outage caused by an unresponsive database"
Last updateThe fix has now been deployed.
We experienced another service interruption today from 06:43 to 06:51 (UTC). The symptoms were consistent with those observed on Tuesday; however, this time we had access to additional information that enabled us to identify the root cause of the issue. We are currently working on a resolution and are committed to deploying a fix promptly. We appreciate your understanding and support during this time. If you have any questions or require further assistance, please do not hesitate to reach out through our support channels. Thank you for your continued partnership. Best regards, Factorial Infrastructure Team
Report: "Service outage caused by an unresponsive database"
Last updateWe experienced a brief service interruption from 09:00 AM to 09:15 AM (UTC). During this time, our database experienced unresponsiveness, which impacted the availability of our service. To mitigate the issue, our team promptly initiated a failover to the secondary database instance. This action successfully restored service functionality, and we are pleased to report that all systems are now operating normally. We are currently investigating the root cause of the incident to ensure that we can prevent similar occurrences in the future. We appreciate your understanding and patience as we work through this process. If you have any questions or require further assistance, please do not hesitate to reach out through our channels. Thank you for your continued support. Best regards, Factorial Infrastructure team
Report: "Degraded performance on Factorial"
Last updateWe are pleased to inform you that the fix has been successfully deployed. The application is now operating at its normal performance levels. Thank you for your patience during this incident.
A fix has been tested and is being currently deployed to all environments.
The application has suffered a major performance degradation at 15:25 CEST, the cause of which has been identified as well and is being tackled by our Engineering team. It should be deployed in the next 1-2 hours. The fix to the earlier issue, which only happened on certain conditions, has produced the desired results.
We keep seeing intermittent service interruptions. Our team is hard at work to identify and fix the root cause of the issue. Apologies for the inconvenience.
We are currently investigating reports of laggy performance affecting some customers when using our application and API. Our teams have determined that this issue is unrelated to the earlier infrastructure problem encountered today. We have identified the specific conditions that lead to this performance degradation and are actively working on a resolution. We appreciate your patience as we address this matter and will provide updates as soon as more information becomes available. Thank you for your understanding.
Report: "Degraded performance on Factorial app"
Last updateWe are pleased to inform you that the performance degradation issue has been successfully resolved. Our team has conducted a thorough investigation and identified enhancements to our monitoring systems. These improvements will enable us to detect and address similar situations more effectively in the future. We appreciate your patience and understanding during this incident. Thank you for your continued support.
The fail-over to the secondary database has improved the situation as expected. We are monitoring the recovery and looking at side-effects of the situation before marking the incident fully resolved.
We have decided to fail over the database to the secondary instance in another availability zone. This should resolve the issue in a matter of minutes.
The performance of the application is heavily degraded since 9:00 CEST. We are investigating the source of the issue to restore the service as soon as possible.
Report: "Delay in time tracking calculations"
Last updateThe issue has been resolved. There may be some inconsistencies in the calculation that will eventually be reconciled with the shifts in the database.
The system in charge of computing the time tracking totals for display on the application interface is experiencing unusual delay since 12:00 UTC. While the shifts were being registered, today's totals may not include the most recent ones. Our team has submitted a fix and we are hoping to see a recovery in a matter of minutes.
Report: "Core component failure made Factorial application unavailable"
Last updateAn incident occurred today, between 14:22 and 14:49 CEST, which affected the performance of our application and website. During this time, a failure in a core component of our infrastructure resulted in slower response times, increased error rates, and, ultimately, service unavailability. Our incident response team acted swiftly to identify the issue and successfully replaced the failing component, restoring full service shortly thereafter. Following the incident, our infrastructure team conducted an investigation to understand the root cause of the failure. We have since implemented improvements to our configuration to prevent similar issues from occurring in the future. We sincerely apologize for any inconvenience this may have caused and appreciate your understanding as we continue to enhance the reliability of our services. Thank you for your continued support.
Report: "Factorial application unavailable"
Last updateThe service has been restored; app.factorialhr.com is fully operational again. We apologize for the inconvenience caused and will perform an investigation to ensure such errors don't happen again in the future.
A fix is underway - we expect the service to be restored in the next hour.
Due to an error introduced in our latest release, the Factorial application is currently unavailable or partially loading. Our teams have identified the source of the problem and are investigating a fix to be deployed as soon as possible.
Report: "Elevated error rates"
Last updateThe affected system has been replaced. This incident is now resolved.
Our monitoring systems have detected higher error rates than usual. In most cases these are timeouts caused by a malfunctioning system. Our Engineers have applied remediation and we are confirming the recovery of the service levels back to normal.
Report: "Brief service interruption during Database migration"
Last updateAs part of our continuous efforts to improve the application and its performance, an unanticipated short service interruption has been noticed while upgrading our database services. We apologize for the inconvenience this event may have caused our customers and will improve our protocols to ensure this kind of interruption does not reoccur.
Report: "Full outage after routing misconfiguration"
Last updateOur team have introduced a misconfiguration at 09:20 CEST with an automated deployment, immediate action have been taken and the sevice were restored at 09:28. Despite our validation processes this introduced an unwanted change that triggered a second downtime at 10:00 CEST. Our emergency procedure have been launched and we restored our services at 10:27 CEST. We are committed to delivering exceptional services and we are constantly reviewing all processes to avoid similar inconveniences in the future.
Report: "Factorial Backend Error"
Last updateWe’re back! The Backend Service should be up and running. Thanks for bearing with us.
We are investigating a Backend Service issue that might be affecting some users. We are making every effort to find a solution as soon as possible. We'll soon provide another update.
Report: "Factorial Backend Error"
Last updateWe’re back! The Backend Service should be up and running. Thanks for bearing with us.
We are investigating a Backend Service issue that might be affecting some users. We are making every effort to find a solution as soon as possible. We'll soon provide another update.
Report: "Factorial Backend Error"
Last updateWe’re back! The Backend Service should be up and running. Thanks for bearing with us.
We are investigating a Backend Service issue that might be affecting some users. We are making every effort to find a solution as soon as possible. We'll soon provide another update.
Report: "Factorial Backend Error"
Last updateWe’re back! The Backend Service should be up and running. Thanks for bearing with us.
We are investigating a Backend Service issue that might be affecting some users. We are making every effort to find a solution as soon as possible. We'll soon provide another update.
Report: "Factorial Backend Error"
Last updateWe’re back! The Backend Service should be up and running. Thanks for bearing with us.
We are investigating a Backend Service issue that might be affecting some users. We are making every effort to find a solution as soon as possible. We'll soon provide another update.
Report: "Factorial Backend Error"
Last updateWe’re back! The Backend Service should be up and running. Thanks for bearing with us.
We are investigating a Backend Service issue that might be affecting some users. We are making every effort to find a solution as soon as possible. We'll soon provide another update.
Report: "Degraded performance and requests timing out"
Last updateDue to a configuration error in an instance of our server cluster, requests to Factorial servers that were hitting that machine were timing out or had a very slow response time. The isuse has since been resolved
Report: "Major outage"
Last update# What happened? At 14:39 new content for our public pages was deployed, causing our cache to hit its maximum capacity limit. This event triggered a fallback strategy: we start requesting a third-party service to serves us the content for our public pages. This third-party service quickly became overwhelmed with requests and started applying an exponential backoff strategy, forcing our backend services to wait long periods of time in order to get a response, and thus making our API unresponsive. # How did we solve it? Increasing the maximum capacity limit of our cache fixed the issue. # How are we gonna make sure it does not happen again? We are gonna review our cache strategy, so that our whole infrastructure does not depend on it in order to properly function.
Majour outage of all our services except the blog from 14:39 to 15:43.
Report: "Increased latency, response times a lot higher than usual"
Last updateA recent change in one of our api endpoints made it much slower than usual and that created a snowball effect in our application servers after some time, making it so requests from all endpoints were queued up and not served in a timely manner
We're seeing high response times across all our features We're currently investigating the issue
Report: "Time Tracking service outage"
Last update# What happened? Today we made an upgrade in the Time Tracking service that involves a migration of the settings of each company to the new implementation. This process took longer than expected disabling the service up to 1 hour in some companies # How are we going to prevent similar issues in the future? When an upgrade like this requires big Data migrations, procure to implement it backward compatible \(if possible\) and in low-impact times \(nights/weekends\)
The Time Tracking service was disabled for some companies, preventing employees to clock-in and out, in desktop and mobile App
Report: "Performance degraded on our API request time"
Last updateWe fixed the main performance issue and now the system is stable. The custom fields feature has been enabled again. We're still going to monitor our system to detect possible regressions.
The issue has been identified and a fix is being implemented.
We found the culprit of this issue. Yesterday we deployed a change with our custom fields system with a non performant endpoint, this degraded our puma's trying to serve this requests for about 60 seconds. This was causing other requests to be delayed and eventually timedout. We partially disabled custom fields feature in order to keep other parts of the app working. We're fixing the performance regression and we'll enable full custom fields feature once we get a decent performance on affected endpoint. We'll keep updating this incidence with further steps.
We are currently investigating the issue
Report: "DNS change produced downtime"
Last update# What happened? Today we made a change in our DNS \(Domain Name System\) that produced a downtime by making our main domains \([factorialhr.com](http://factorialhr.com), [factorialhr.es](http://factorialhr.es), [factorialhr.fr](http://factorialhr.fr), ...\) unable to be resolved for a few seconds. Our infrastructure change first destroyed and then recreated an existing dns record, while having our SOA retry-time too high. That produced a downtime to our public sites of about 7 minutes # How are we going to prevent similar issues in the future? Re-think the way we apply DNS changes in our infrastructure. We will lower the retry-time in our SOAs to more manageable values. We'll also apply some changes manually first and then pass these changes to our infrastructure code.
This incident has been resolved.
Report: "Overall degraded service"
Last update# What happened? On May 23rd, 2020, we added a new Scheduler service to our infrastructure. The Scheduler is in charge, among other things, of distributing load between all our service instances to ensure the best possible performance and make Factorial more resilient to potential failures. Unfortunately, the Scheduler was misconfigured in such a way it started unevenly distributing work among our services. On May 24th, as traffic to our websites and applications started increasing, the uneven distribution of work overloaded some of Factorial services, causing a partial outage of our home page, web, and mobile applications that lasted for approximately one hour. # How are we going to prevent similar issues in the future? We immediately got noticed of the problem but took a longer than expected time to fix it, in part due to some key personnel not being available at the appropriate time. We acknowledge this is due, in part, to a poorly implemented outage procedure at Factorial, and we commit to improve this procedure.
A technical incidence due to some changes performed to our infrastructure has resulted in intermittent outages of our Home Page, web and mobile applications. The issue has been resolved and we are currently working to provide a more detailed postmortem.
Report: "Domain Name resolution interrupted"
Last update# What happened? Last week we introduced a regression in the way we manage our DNS in our infrastructure. This change made that during about an hour our main domains [fatorialhr.com](http://fatorialhr.com) and [factorialhr.es](http://factorialhr.es) didn't respond and some users couldn't access our site. # How are we going to prevent similar issues in the future? Our infrastructure tools have a way to preview the changes that are going to be applied. In the future we’ll double check that we're not introducing unexpected changes.
This incident has been resolved.
A technical incidence due to some changes on our DNS performed to our infrastructure has resulted in outages of our Home Page, web and mobile applications. Due to this change the main domains: factorialhr.com and factorialhr.es stopped responding. We identified the issue and we applied a change to restore old working behaviour.
Report: "Employees were not able to clock in"
Last updateWe had an issue with our queueing system on Sunday night and the system didn't generate the December periods. This meant that employees couldn't clock in until ~9:00 when the issue was resolved.
Report: "All customers experienced a downtime for an hour"
Last update# What happened? We tried to issue the rollback through the CI. But this relies on production machines to do the builds. There is already a task to fix this issue, but on the meantime we learned that the way to go should have been to issue the rollback directly in the machines, with Capistrano. Also, we managed to do the rollback quite quick, but at that point, redis was already down. We kept looking for the culprit although we already fixed it. This lack of visibility obfuscated the real problem. If we had checked our monitors/logs more attentively we would have seen that we were dealing with a different problem \(redis being down\). We acknowledge that the response was not quick enough. Next time, we are going to rollback quicker and be sure to review all the monitors before taking next steps.
We launched a new version of the product that had a performance regression. This new version caused the machines started to go slower than usual until they could not handle any load. At that point the engineers rolled back the faulty release but the machines were still unresponsive and one of our database services went down with them (redis). In the end we managed to bring all the machines back to live with their services and restore service to all customers.
Report: "Factorial was down due to slow migration"
Last update**What happened** A slower-than-usual database migration resulted in Factorial backend service getting stuck in an inconsistent state, trying to query nonexistent attributes from the database. A service restart quickly fixed the issue. **What are we going to do to prevent it in the future?** We implemented a system to enforce migrations be coded in a safe manner. We are gonna start enforcing the use of this system from now on.
Factorial was down 7 minutes. This incident has been resolved.
Report: "Factorial was down due a database high load"
Last updateThere was a mysql high load due a bad performant migration
Report: "Factorial was down due to a broken reference"
Last updateWhat happened? There was a broken reference in production environment that wasn't detected by our deployment process which crashed the whole backend application. What are we going to do to prevent it in the future? We are going to add a step in our deployment process which will be able to detect this kind of broken references.
We already fixed the issue. Service should become operational soon.
Report: "Factorial being very slow and not loading sometimes"
Last update# Factorial very slow and unresponsive ## What happened? We released a new version of Factorial with the new “Upgrade button”. In order to show/hide the button we needed to request more information from our API which for some reason was hitting a third party \(Stripe\). This made the application unbearably slow. ## What are we going to do to prevent it in the future? This kind of regressions are very difficult to catch during development. We are going to keep investing on monitoring and alerting so we can catch this issues earlier and rollback fast to not affect our customers.
We already fixed the issue. Service should become operational soon.
Our engineers have identified the issue and are working on a fix.
Report: "Application unresponsive"
Last update## 🤷♂️ What happened? The application became very slow and some users claimed that they couldn’t work for 5 minutes. ## 🕵️♀️ Why did it happen? We released a new version of Factorial to improve how we resize and serve the user avatars and company images. During the migration the machines were overloaded and couldn’t handle the traffic. ## 👮♀️ What are we going to do to avoid it in the future? The team has learned – the hard way – that our infrastructure can’t handle this kind of migrations during peak working hours. We will make this migrations more progressively and, if that’s not possible, choose the migration times more wisely.
Factorial became very slow and some users couldn't use the service.