Historical record of incidents for Fasterize
Report: "Acceleration is disabled."
Last updateWe currently have some issues on our european infrastructure. We're investigating. Speeding-up is disable. Traffic should be redirected to origins.
Report: "Temporary Service Disruption : Error 502"
Last updateWhat happened:This morning at 8:52, we made planned adjustments to our infrastructure in order to improve the reliability of the connection between our platform and CDN. During the transition, the service experienced a short disruption due to unexpected load balancer restart.Impact:The site was briefly unavailable for approximately one minute generating error 502.Resolution:The situation was quickly identified and resolved by 08:52 (GMT+2) . Everything is now back to normal and fully operational.We’re sorry for the inconvenience and are taking steps to prevent this from happening again.
Report: "Temporary Service Disruption : Error 502"
Last updateWhat happened: This morning at 8:52, we made planned adjustments to our infrastructure in order to improve the reliability of the connection between our platform and CDN. During the transition, the service experienced a short disruption due to unexpected load balancer restart. Impact: The site was briefly unavailable for approximately one minute generating error 502. Resolution: The situation was quickly identified and resolved by 08:52 (GMT+2) . Everything is now back to normal and fully operational. We’re sorry for the inconvenience and are taking steps to prevent this from happening again.
Report: "Platform has been unavailable"
Last update## **Summary** On May 14, 2025, Fasterize experienced a partial service disruption affecting a subset of customers. The issue was caused by a large-scale DDoS attack targeting a website accelerated by our platform. The incident lasted approximately 25 minutes, with service fully restored at 21:47. ## **Timeline \(UTC\+2\)** * **21:22 – 21:30**: Our systems registered an abnormally high volume of requests — over 37 million in total, peaking at 350,000 requests per second. * **21:47**: Traffic stabilized and all services were back to normal. ## **What Happened** The DDoS attack overwhelmed several load balancers, leading to repeated restarts. Under normal circumstances, our failover system automatically routes traffic directly to the origin servers if a platform zone becomes unhealthy. However, the DNS health checks tied to certain zones were misconfigured. They continued to report the zone as healthy despite the outage, preventing failover from triggering correctly. ## **Impact** * **Severity Level:** 1 \(Unplanned downtime affecting multiple production websites\) * **Detection time:** 12 minutes * **Time to full recovery:** 25 minutes ## **What We're Doing** ### **Immediate fixes** * Corrected the failover configuration to ensure accurate health checks. ### **Short-term improvements** * Tuned load balancer settings for better resilience under high traffic. * Improved alerting on health check anomalies. ### **Medium-term improvements** * Increasing infrastructure redundancy to distribute traffic more effectively. * Evaluating native rate-limiting solutions to mitigate volumetric attacks.
We experienced a service disruption caused by a Distributed Denial of Service (DDoS) attack. The issue has now been resolved. A full post-mortem will follow. Thank you for your patience and understanding.
One of our datacenter has been unavailable between 21h21 and 21h47. We are investigating the incident.
Report: "Acceleration was disabled between 10:45 AM GMT+1 and 11 AM GMT+1"
Last update# Healthcheck in error Date of post mortem : 25/02/2025 Participants of the post mortem : Anthony Barré # Description of the incident Following a change in the Fasterize API, **the health checks of the platform turned red**, indicating an unavailability of the platform while it was actually healthy. These health checks were configured to query a URL associated with **a domain name that was no longer present** in the configuration database. ### **Immediate corrective action** * **Temporary deactivation** of health checks. * **Rollback** of the API configuration change. # Facts & Timeline All times are UTC\+1. # Analysis ## **Technical context** * The health checks validate the availability of the platform through several layers \(proxy, load balancer, workers, etc …\) * The failure of the health checks was due to errors at the proxy level, a symptom of the absence of configuration associated with the domain used. * At the proxy level, the health check response is adapted by region to respond to the health checks associated with each region. * An outdated process, that was still running despite no longer being required, caused issues when it updated the configuration used by proxies with incorrect data. ## **Root cause analysis** The API configuration change caused an unstable state. Indeed, an API update did not take place immediately. Thus, some API pods were reading the health check configuration from one environment and others from another environment. This corrupted the database used by the proxies of our main environment. The proxies no longer responded with 200 but with error code to health check requests. ## **Impact of incident** 🏠 **Affected customers** : **All customers** were impacted by the lack of acceleration. 🔴 **Specific issue** : **Two customers** experienced major disruptions because their origins did not allow them to receive traffic directly from the Internet. 📊 **Incident metrics** * **Severity**: **1** \(service shutdown impacting a large number of users\). * ⏱ **Detection time**: **4 minutes**. * 🛠 **Resolution time**: **20 minutes**. # Action Plan ## Short term \(immediate - 1 week\) * Review the platform's health checks so that they are more robust * Disable and delete the obsolete monitoring process ✅ * Fix the deployment of configmaps in the helm charts. ## Medium term * Review the API configuration to clarify the use of region-related fields and avoid side effects.
Incident is now closed. Sorry for the inconvenience. A post-mortem will follow.
Fix has been deployed, everything is back to normal (the platform is properly monitored again). We're still monitoring.
Root cause has been identified. A post-mortem is being written. We're gonna set up more robust global platform health checks.
Acceleration was disabled between 10:45 AM GMT+1 and 11 AM GMT+1 due to a misconfiguration of the global platform health checks. These health checks are currently being reviewed (but the platform is actually healthy).
Report: "Minor Incident: Purging Image Metadata and CSS/JS Bundles"
Last updateWe encountered a minor issue affecting the purge of image metadata and CSS/JS bundles. While the purge process experienced some disruptions, this had no noticeable impact on website functionality. Pages and static files were purged correctly.
Report: "Pages were served without optimization"
Last updateTimeframe: 9:30 AM to 10:15 AM During this period, pages were served without optimization by the platform. There was no downtime, and traffic continued to be served seamlessly with original (non-optimized) pages. Cause: The issue was triggered by a recent release that unexpectedly introduced a high load on the platform. Resolution: We promptly identified the issue and rolled back the release to restore normal operations. We sincerely apologize for any inconvenience this may have caused and appreciate your understanding.
Report: "Platform unavailability caused by a DDOS attack"
Last updateThe platform experienced an outage from 10:13 AM to 10:20 AM (Paris timezone) caused by a DDOS attack. Traffic was automatically routed to origins from 10:17AM to 10:20AM. The attack volume was around 30x the normal trafic. Even if our anti DDOS system blocked 97.6% of the attack, the attack was too heavy for our frontals servers.
Report: "Delay in CDN logs indexation"
Last updateThere is a delay in indexing CDN logs due to network maintenance. The CDN log indexing platform is behind, and the currently available logs date back to before Wednesday, June 5, 2024 at 1:45 PM. As a result, last night's log extractions for the June 5 logs are incomplete. This delay will be caught up in the next few hours. We will keep you informed.
Report: "Temporary Platform Unavailability"
Last update**Post Mortem**: Temporary Platform Unavailability **Event Date**: May 15, 2023 **Incident Duration**: 11:29 AM to 11:55 AM **Incident Description**: The platform experienced an outage from 11:29 AM to 11:55 AM. Traffic was automatically routed to origins. Customers therefore lost the benefit of the solution, but the sites remained available during the incident. The addition of a large number of configurations on the platform increased the consumed memory and the startup time of the front layer services. Some services stopped and did not start correctly. **Event Timeline**: * 11:17 AM: Addition of new configurations. * 11:21 AM: Detection of a memory shortage on a service, leading to the shutdown of a critical process. * 11:34 AM: Additional services become unavailable. * 11:38 AM: Widespread detection of the incident; automatic traffic redirection. * 11:45 AM: Attempts to restart services, partially successful. * 12:00 PM - 12:15 PM: Assessment and decision-making on corrective actions. * 12:33 PM: Modification of startup configurations to improve tolerance to startup time. **Analysis**: Two main factors lead to this incident : * our HTTP server requires a reload in order to load new configuration into account. During this reload, the number of processes for this service is doubled, leading to a risk of memory exhaustion. * The start timeout for the HTTP service was set as the default value and we didn’t have a monitor alerting us that the HTTP service start time was close to the limit. **Impact**: All users of the platform were affected by this incident. **Corrective and Preventive Measures**: * Short term: Review of alert systems and adjustment of service startup configurations. * Medium term: Improvement in configuration management to reduce their number and optimize service startup monitoring. * Long term: Researching alternative HTTP server to improve update management without impacting performance or memory consumption. **Conclusion**: This incident highlights the importance of constant monitoring and proactive resource management to prevent outages. The measures taken should enhance the stability and reliability of the platform.
The platform experienced an outage from 11:29 AM to 11:55 AM. Traffic was automatically routed to origins. Customers therefore lost the benefit of the solution, but the sites remained available during the incident.
Report: "Acceleration has been temporarily disabled."
Last updateWe had some issues on our european infrastructure. Now fixed. Speeding-up was disabled but trafic was ok.
Report: "Fasterize issue impacts website features"
Last updateThis incident has been resolved.
The rollback is done. We are still monitoring the platform.
Some customers notified us about broken features on their websites after 17h00 (Paris time). We correlate the incident with a release done this afternoon. We are rolling back the changes on our platform. We expected the end of rollback around 19h15.
Report: "Performance degradation"
Last update# Description On Thursday, October 19th, between 4:55 PM UTC\+2 and 6:25 PM UTC\+2, Fasterize european platform was unable to optimize web pages for all customers. The original version was then delivered. We discovered that between 4:45 PM UTC\+2 and 5:50 PM UTC\+2, a specific request was made that caused a failure in the Fasterize engine during optimization and left the process in a non-functional state. The number of functional processes then decreased until it fell below a critical threshold. Our engine then automatically switched to a degraded mode where pages were no longer optimized and served without delay. At 5:29 PM UTC\+2, the oncall team manually added capacity to the platform to return to a stable state, but this did not definitely improve the situation. Starting from 6:15 PM UTC\+2, the optimization processes gradually resumed traffic. The engine then returned to its normal mode of operation. To prevent any further incidents, the request has been excluded from optimizations and a fix on the optimization engine is being developed. ## Action plan **Short term:** * Fix the engine to optimize the responsible request without any crashes **Medium term:** * Review the health check system at the engine level to automatically restart non-functional processes
This incident has been resolved at 18h25 (Paris time). A post mortem will follow.
We're monitoring the results but everything's fine. Seems to be related to a schema change in a storage component (to be confirmed after the RCA).
We have mitigated the issue. Performance is back to normal. Still investigating for the root cause.
We currently have some issues on our european infrastructure. Being fixed. Slight impact on acceleration. Some pages can have some slowdowns. Some optimizations are disabled.
Report: "Traffic temporarily redirected to origin"
Last updateDuring the migration of website configurations database, the traffic has been redirected to the customers origin web servers for 32 minutes. For the majority of websites, the traffic has correctly been served by the origin. However, for a few websites, the origin didn’t succeed to do so due to origin configuration. ## Facts and Timeline All times are UTC\+2. * 3:30pm: migration starting from the legacy database to the new database after several days of testing on a fraction of the traffic in production and staging environments.. * 4:02pm: Platform health checks status gradually moves from 100% to 0%. Traffic is automatically routed to the customers’ origin. * 4:03pm: An alert indicates that some health checks are red. The alert is immediately taken into account by our tech team, and a crisis team is set up. * 4:14pm: The root cause is detected in the new database: record holding information necessary for the health checks response is incorrect. * 4:35pm: Health check configurations are changed to allow traffic to return to the platform. Incident ends for clients. * 5:45pm: Missing record fixed in the database. * 10/04: Health checks are set back to the original settings. ## Analysis On October 3, 2023, the deployment of a new database holding website configurations occurred. During the deployment, the platform health checks switched to an unhealthy state. Platform health checks consist of multiple monitors sending requests to the platform at regular intervals to validate that all layers in the platform are functional. When these requests fail, the traffic is automatically routed to the customers’ origin. After the migration, the health checks received 521 errors \(meaning that the relevant configuration for a given requested domain was not found\). The issue occurred because the deployment brought in a change in the logic involved in config loading. In the previous release, a request from the health checks was satisfied even if no configuration matched. In the current version, this is not possible. To quickly fix the issue, we created a configuration for health checks. This issue was not detected in our testing phases for the following reasons: * no alert is configured for health checks in our staging environment. * the health checks are not correctly covered by automatic testing By design, redirecting browser traffic to the origin when the platform is considered down is correct. However, we are seeing more and more cases where the origin cannot accept the traffic sent by browsers due to various reasons such as firewalls or incorrect certificates. We will improve our API to manage these edge cases. # Impacts * Number of customers impacted: all # Counter measures ## Short term 1. Set up alerting on our staging environment for health check 2. Add a test covering the health checks routes # Medium term * Design a way to avoid origin failover when the origin doesn’t support it
This incident has been resolved since 16h35. A postmortem is in construction and will be available tomorrow.
The situation is restored, all traffic is now routing through Fasterize as expected. A postmortem of the incident will shortly be provided
Traffic is temporarily redirected to the origin of customer websites, the situation will be restored very soon.
Report: "Unavailability of caches due to a DDOS attack"
Last updateA DDOS attack made our cache layer for pages unavailable between 11:38 and 11:44.
Report: "We are facing intermittent DDOS attacks"
Last updateThis incident has been resolved.
At 5:05pm (CEST) and 5:20pm (CEST), we've seen performance degradation of our frontals servers due to multiple DDOS attacks. Our network protection are mitigating them and we are adding several other rules in order to block them all. If our protection is not sufficient enough, we will reroute impacted domain's traffic to their origin by bypassing fasterize.
Report: "Fasterize API unavailable"
Last updateThis incident is now resolved.
The fix is now deployed for the purge API. Everything is back to normal but we're still monitoring the results.
We are continuing to work on a fix. Dashboards are now ok. Purge API still WIP.
The issue has been identified and is being fixed. ETA < 1h
We currently have some issues on our API servers. Being investigated. It might be impossible to access dashboards, update rules or flush the cache. No impact on acceleration.
Report: "Issue on the image display"
Last updateThe rollback completed at 5:11 pm. We will conduct an investigation and improve our tests suite to improve coverage on the impacted feature. We are sorry for the inconvenience.
We are investigating an issue on images not correctly displayed since 2pm. The issue seems related to the feature “Improve CLS with img dimensions attributes”. A rollback of changes introduced this afternoon on the engine is on going. As a temporaray workaround, we disabled this feature on impacted websites.
Report: "Configuration issues"
Last update# Error 521 for some customers Date: 04/05/2023 # Description of incident Some customers have experienced resource errors 521. The Fasterize error 521 corresponds to a configuration that is not found in the engine. After an engine update, some proxies failed to load some configurations in V2 format. # Facts and Timeline * **16h38**: Launch of the engine update after validation on the staging environment and then in canary mode * **17h10**: First alert:High proxy error ratio detected. * **17h12**: 521 errors are starting to appear. * **17h27**: The technical team turns off problematic proxies. * **17:30**: Traffic is back to normal. * **17h36**: Publication of a message on StatusPage * **17:56:** Trigger workers rollback * **18:25:** The technical team is fixing the issue by returning to the previous version of the engine. # Analysis On February 15, 2023, the deployment of the website-config package \(4.14.1\) changed the JSON schema used for client configurations in order to introduce a new key. This change should not have been included in the package because the feature was not finished. The new version of the website-config package moved this new key to another location in the JSON schema. During deployment, the deletion of the key previously and incorrectly introduced in the validation scheme had the effect of invalidating all V2 configs with this key. However, this key was added automatically by the API if it was not present. A mechanism to load a configuration even if it is not valid was however introduced during the update but did not work. When processing requests associated with unloaded configurations, the engine responded with a 521 error. The fallback mechanism at the front level has mitigated the problem at the cache layer. Indeed, a second attempt on another proxy is triggered in the event of a 521 error. However, the return-to-origin system is not in place for 521 errors \(to prevent the discovery of configurations\). The message for 521 errors is not clear enough and should render a page like the one used for 592 or 594 errors. At the rollback level, retrieving the commit corresponding to version N-1 was not so easy. The rollback was not possible via the CI because it took too long to execute and was therefore executed on a developer workstation. # Metrics ## Error 521 * a first peak around 5:05 p.m. \(which triggered the alert\) * from 5:10 p.m. to 5:30 p.m. a large number of requests/s in 521 is observed  As a percentage of all traffic:  Over the duration of the incident  Only on impacted customers   # Impacts * Number of customers impacted: 12 sites \(< 2%\) * Percentage of requests impacted on all customers * Maximum: 1.5% * Over the duration of the incident: 0.32% * Percentage of queries impacted on impacted customers * Maximum: 7.3% * Over the duration of the incident: 1.54% # Counter measures ## Short term 1. Fix the engine and the faulty package to remove the breaking change 2. Secure v2 config schema validation changes 3. Enable fallback to origin on 521 error ## Middle term 1. Set up a system for migrating V2 configs from one schema version to another. 2. Improve some documentation \(rollback, release\) 3. Improve crisis internal organization 4. Put an extra step to enable canary phase with actual production traffic before triggering the rest of the update
The rollback has been completed. We will publish a postmortem tomorrow (5th of May) to clarify the root cause of the incident.
No more errors are being generated since 5:42pm (UTC+2) and 90% less errors since 5:30pm. Impacts were limited to websites using config version 2.
The issue has been identified. A mitigation has been deployed to replace faulty proxies. The root cause is being diagnosed.
Some proxies has loaded invalid configurations and cannot process incoming requests. We are currently investigating this issue.
Report: "Purge issue on CDN via Dashboard and API"
Last updateBetween 10:00 AM (GMT+1) and 12h45AM (GMT+1), the purge on the CDN was not working as expected. We found the issue and apply a fix on the API. This is now resolved. Sorry for the inconvenience
Report: "Performance degradation"
Last updateStarting from approximately 7am (CEST) and until 3pm (CEST), we've seen performance degradation of our engine due to a defect in the autoscaling process. Some users / websites might have noticed some slight slowdown on non-cacheable objects of the order of a couple hundred ms. This is now resolved. Sorry for the inconvenience
Report: "Security implementation causes 429 errors"
Last updateThis incident is now closed. Sorry for the inconvenience.
The fix has been rolled out. There are no more 429 errors.
Several requests returns a 429 status code during a security implementation (rate limiting) in order to protect the platform against DDOS attacks. We are currently fixing the issue.
Report: "Acceleration is disabled."
Last update# Description de l'incident L’incident suivant est relatif à une boucle infinie sur notre plateforme suite à un mauvais paramétrage de configuration. **Impacts** Forts ralentissements des sites clients à cause de la saturation de la plateforme.**Timeline** Début de l’incident : 17h08 Fin de l’incident: 17h40 # Faits et Timeline **17h08-17h43** : Plusieurs perturbations sur la plateforme ont eu lieu à un intervalle d’environ 10 minutes. Chaque perturbation a duré environ 180 secondes. **17h14** : Première alerte en interne émise concernant le problème. **17h27** : Identification de l’origine de l’incident par l’équipe. Mise en place directement d’une réunion d’urgence avec l’équipe technique. **17h31** : Premières actions correctives pour mitiger le problème. **17h35** : Mise à jour du [statut public de la plateforme](https://status.fasterize.com/incidents/lh51wzsdvs1n) \([statuspage.io](https://status.fasterize.com/incidents/lh51wzsdvs1n)\). **17h40** : Résolution du problème. # Métriques * Niveaux de sévérité de l'incident : * Sévérité 2 : dégradation du site, problème de performance et/ou feature cassée avec difficulté de contourner impactant un nombre significatif d'utilisateur * Temps de détection : **6 minutes** * Temps de résolution : **32 minutes** # Analyse Une configuration a été incorrectement paramétrée au niveau de l’origine. L’origine de la configuration pointait sur Fasterize au lieu de pointer sur l’hébergement. La sécurité contre les boucles infinies présentes au niveau de la plateforme n’a pas fonctionné. Cela a conduit à saturer la plateforme et à produire des temps de réponse très dégradés. La détection automatique de la stabilité de la plateforme a détecté à plusieurs reprises les indisponibilités. Cependant, ces instabilités étaient déclenchées à intervalle régulier. Ainsi, les sites web ont été routés vers l’origine puis de nouveau routés vers Fasterize à la fin des différentes boucles. # Plan d'actions **Court terme :** * Correction de l’API pour mieux valider l’origine et ainsi éviter le cas d’une origine pointant vers Fasterize * Correction de la détection des boucles infinies sur le chemin de la requête **Moyen terme :** * Amélioration du système de protection de la plateforme via un système de rate limit.
This incident has been resolved at 17:42. We will provide a post mortem tomorrow (16/02/2023).
The issue is identified. It was related to an attack that has been mitigated.
We currently have some issues on our european infrastructure. Being fixed. Speeding-up is disabled but trafic is ok.
Report: "Performance degradation"
Last updateThis incident has been resolved.
Fix has been deployed, acceleration is back to normal. We're still monitoring. Sorry for the inconvenience
Issue has been identified. Fix is being deployed. ETA: 11:50 UTC+1
We currently have some issues on our european infrastructure. Being fixed. Slight impact on acceleration. Some pages can have some slowdowns.
Report: "Errors 502 from the platform"
Last updateThis incident has been now resolved by progressively deploying the updated module and monitoring the results in the last few days. Reminder: before that, a workaround was deployed to mitigate errors.
The fix has been deployed. We have identified the source of the problem, which is an external integration provided by a partner. We will work with our partner to update/fix the offending module.
A workaround is being deployed. However, while we have identified the cause of the 502 increase, we have yet to determine the root cause. We're actively investigating this topic.
We are continuing to work on a fix for this issue.
We are currently experiencing a platform issue. We noticed an increase in the number of TCP Resets since ~10 days on our load balancer. This issue has generated an increase of the level of 502 errors. Within the last hours, it represents 0.03% of the total number of requests. Our technical team is massively working to resolve this critical issue and will update the status of the debugging steps until full resolution.
Report: "Slight slow down"
Last updateThis incident is now resolved. A post-mortem will follow in the next few days.
We are still seeing wait time optimizations but they will decrease until the patch is fully deployed.
From 9:54 GMT+2 until 10:25 GMT+2, we had a slight disruption in the optimization process, causing some pages to slow down. We apologize for the inconvenience. The problem is now under control and we are investigating to understand the root cause of the incident.
Report: "Issues on image optimization"
Last updateIncident is now closed. Sorry for the inconvenience.
Fix is being deployed. We're still monitoring.
We currently have some issues on our european infrastructure: some images are not resized or not optimized anymore. Being fixed. Expected impact on images delivery (images delayed or/and bigger than expected).
Report: "Degraded performance of our API"
Last updateThe issue has been resolved, the API and dashboard are now fully operational.
Issue has been identified and we are currently monitoring it
We are currently investigating a degradation of performance of our API that may affect the availability of our dashboard.
Report: "Error 504 from the platform"
Last updateBetween 2022/05/23 18:08:54 GMT+2 and 2022/05/23 18:12:54 GMT+2, the platform has been unreachable due to DDOS attack targeting the platform. The traffic has progressively been transferred to the origin. Customers may have encountered 504 error (unreachable origin) during this shortage. We sincerely sorry for this incident and are planning mitigations to limit the impact of such attack.
Report: "JS errors following engine update"
Last update# Problème d’erreurs Javascript Date du post mortem : 27/01/2022 Participants du post mortem : * toute l’équipe technique * Yahia # Description de l'incident Suite à une mise à jour du moteur d’optimisation sur le moteur de règle liée aux configurations, les pages de certains clients ont été cassées. # Faits et Timeline **24/01 au 26/01** : Passage d’une partie de la production avec la nouvelle version \(mode canary\) **26/01 15h57** : Release du moteur sur la mise à jour de la librairie responsable de la gestion des règles autour des configurations clientes **26/01 18h21 :** un ticket au support indiquant un souci de JS. **26/01 18h43 :** Fin de la release **26/01 21h42 :** nouveau ticket au support **27/01 08h09 :** nouveau ticket au support **27/01 09h10** : prise en compte des tickets au support et fait le lien avec la MEP d’hier **27/01 09h15** : Réponse aux tickets support pour indiquer le début d’investigation **27/01 09h23 :** nouveau ticket au support **27/01 09h21 :** déclaration publique d’un incident **27/01 09h37 :** ouverture d’une visioconf de gestion **27/01 09h40** : décision de rollback les workers **27/01 09h41 :** mise à jour de l’incident en cours **27/01 10h08** : Fin du rollback des workers **27/01 10h48** : Message sur status page indiquant le retour à la normale # Analyse La mise à jour visait à passer le moteur sur une nouvelle version majeure de la librairie de gestion de configuration pour être en mesure de lire des configurations V2. La librairie a été entièrement réécrite avec une interface différente pour gérer le système de contexte d'exécution propre aux configurations V2 qui est différent du système d’exclusion propre aux configurations V1. Le plan de mise à jour consistait à maintenir la même compatibilité sur la config V1 utilisée actuellement en production par l’ensemble des clients. Le bug introduit est un bug au niveau de la librairie de gestion des configurations. Lors de l’analyse de règle ayant une blacklist, le code avait un effet de bord qui désactive la règle pour les appels suivants. Cela a introduit des problèmes au niveau du deferjs car certains scripts n’étaient plus différés. La mise à jour a suivi le circuit classique de validation : tests unitaires et fonctionnels sur les environnements de pré production verts et suite à la mise en production, l’ensemble des métriques du moteur étaient bonnes. Lors de la mise en production, aucune statistique n’a remonté le problème. La statistique du nombre d’erreurs JS n’a pas remontée de problème. été fiable et est restée à son niveau habituel. Malgré que les clients ont signalé le problème via des tickets support, aucune action n’a été prise car les tickets n’ont pas déclenchés l’astreinte pendant les heures non ouvrées. La mise en production a été déclenchée trop tardivement dans la journée par rapport au niveau de risque et a été terminée à la fin des heures ouvrées. L’équipe de Fasterize n’était plus en place en cas d’incident. Il n’y a pas eu de navigations manuelles après la mise en production pour détecter un problème côté navigateur lié à du Javascript. Cela aurait peut-être permis de détecter le problème rapidement. # Métriques Sévérité 1: arrêt du site non planifié qui affecte un nombre significatif d'utilisateurs * Temps de détection : 17 heures * Temps de résolution : 50 minutes # Contre mesures ## Actions pendant l’incident Rollback des ami des workers Vidage du cache des pages du top clients \+ clients ayant écrit des tickets au support. # Plan d'actions **Court terme :** * correction de la librairie et ajout d’un test fonctionnel * revoir la métrique d’erreurs javascript et créer une alerte sur cette métrique * message automatique si ticket urgent sur [support@fasterize.com](mailto:support@fasterize.com) en heure non ouvrée **Moyen terme :** * **améliorer la procédure de mise en production selon le niveau de risque \(normal ou élevé\). Les MEP avec un niveau de risque élevé seront effectuées uniquement le mardi, mercredi ou jeudi matin avec communication externe préalable.** **Long terme :** * étudier la faisabilité de faire les mises en production en deux temps en mettant à jour un environnement pour les préproduction des clients puis en mettant à jour un environnement avec les production des clients.
Incident is now closed, sorry for the inconvenience. A post-mortem will follow.
A rollback on the engine has been deployed in order to fix the issue. If you still see any JS errors, please flush your cache on the dashboard : https://fasterizehelp.freshdesk.com/en/support/solutions/articles/43000456842-how-to-flush-the-fasterize-cache- .
Several customers have errors from JS scripts since 26/01/2022 at 6pm following an engine update. We are currently investigating the issue and applying a rollback.
Report: "Traffic was slightly slowed down during 1 minute"
Last updateAt 9:51 PM, some of our fronts were saturated by an unexpected spike of traffic. This has been quickly solved once new machines were ready to take some traffic. Some sites could have seen a slight decrease of pageviews during a few minutes.
Report: "Maintenance Apache Log4j2 Remote Code Execution (RCE) Vulnerability"
Last updateOur log processing stack has been patched to avoid the Apache Log4j2 Remote Code Execution (RCE) Vulnerability.
Report: "Logging / Monitoring incident"
Last updateThe incident is resolved.
Since 5pm, we have some issues on our logging / monitoring infrastructure. Being fixed. The impact is missing logs between 5pm and 12pm on 10/12/2021.
Report: "Cache purge timeout on dashboard"
Last updateThis incident has been resolved, sorry for the temporary inconvenience
We are continuing to investigate this issue.
We have an issue regarding the cache purge via the dashboard which is currently under investigation. The cache purge via API is still available Example : to purge the whole cache please refer to https://support.fasterize.com/fr/support/solutions/articles/43000620943-purger-tous-les-fichiers-du-cache
Report: "Logs delivery suspended"
Last updateThis incident is now resolved, we've pushed most of missing logs in configured locations. There could be some missing logs still. If you don't see your logs, try to contact support.
The logs are finally back and the missing logs for the beginning of the day will be soon delivered. Sorry for the inconvenience.
Due to logs cluster issues, logs are not delivered anymore from Saturday Nov. 6th 1AM. We're working on a fix and expect that everything will be back to normal in a few hours.
Report: "Augmentation du temps de traitements au niveau de l'engine"
Last updateLe problème a été résolu à 8h15. Les nouveaux services responsables des optimisations avaient une anomalie de configuration qui les empêchaient de démarrer normalement. Le trafic est maintenant pleinement optimisé.
Nous sommes en train d'investiguer une augmentation du temps de traitements au niveau de l'engine. Le trafic est partiellement non optimisé ou avec des temps de réponse dégradés.
Report: "Fasterize dashboard unavailable"
Last updateThis incident has been resolved.
A fix has been implemented and everything is now back to normal. Still monitoring.
We currently have some issues on our dashboard. Being fixed. API is ok so you can still update rules or flush this way. No impact on acceleration.
Report: "Batch jobs not launched"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented, jobs are now resuming and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Batch jobs that usually launch at night were not launched as expected. This includes logs export to partners or customers and cache warming. CDN, acceleration, API, dashboard are not affected. Some customers may have pages that are cached as expected. Team is working to fix this.
Report: "Logs cluster failure"
Last updateDue to logs cluster issues, we have lost some logs from friday 23 to tuesday 27 th of july 2021. Our logs cluster had some performance issues since sunday 16:30 UTC. It seems that some logs indices experimented sharding issue that have corrupted 23 to 24 th of july 2021 indexes. Indexes from sunday 25 to tuesday 27 th 9h00 UTC of july 2021 may be incomplete. We deleted corrupted indexes. This allowed us to delete unhealthy nodes and re-balance the cluster. Logs are now fully operational. We are sorry for logs lost during this incident.
Report: "Performance degradation"
Last update# Description Le 27/04/21 à 16h45, nos workers sont devenus indisponibles suite à une opération de maintenance sur la base de données gérant la configuration du moteur et les configurations client. Le dashboard et les API sont devenus très lents voir indisponibles, et ont généré des erreurs. L’opération de maintenance de la base de donnée visait à mettre à jour son code de configuration. Certaines pages des sites des clients ont été ralenti, notamment les pages HTML non cachées et les fragments de Smartcache. Les pages en SmartCache provoquaient des redirections vers les versions non cachées. # Faits et timeline * A 16h33, fin de la maintenance, la mise à jour des workers est déclenchée. Les tests fonctionnels et API sont OK \(mais pas fiables car les tests sont alors joués sur les anciens workers\). * A partir de 16h34, la service discovery ne trouve plus les nouvelles instances de la base de données de configuration * A partir de 16h35, il n’y a plus de worker async disponibles vus par les proxys * A partir de 16h45, il n’y a plus de worker sync disponibles vus par les proxys * 16h55 : alerte sur les workers sync, les proxys ouvrent le circuit breaker et bypassent les workers, les pages ne sont plus optimisées en majeure partie * 17h10 : redémarrage de la service discovery et retour du service de base de données de configuration dans la service discovery * 17h13 : communication de l’incident sur Statuspage * 17h23 : lancement d’une nouvelle mise à jour des workers * 17h30, fin de l’incident de l’indisponibilité des workers * 17h35, communication de la fin de l’incident sur statuspage * 17h38, redémarrage des services de l’API pour retrouver un service normal au niveau du dashboard # Analyse A partir de 16h35, suite à la migration de la base de données de configuration, le renommage des nouvelles machines avec le nom des anciennes a généré un conflit dans la service discovery, ce qui a éjecté les nouvelles machines. Conséquence, les DNS internes de la service discovery pour le service de base de données de configuration ne répondaient plus d’adresse IP. Sur les nouveaux workers utilisant la découverte de ce service via la service discovery, le service worker était toujours en attente de la récupération des fichiers de configuration et ne démarrait pas. Le redémarrage des agents de service discovery a suffi à rétablir la situation, ce redémarrage ayant eu lieu avec la mise à jour de tous les workers. D’autre part, l’indisponibilité de la base de données de configuration a aussi impacté l’ancienne API encore utilisée par le Dashboard \(récupération du statut de branchement\). Cette ancienne API, directement connecté à la base de données de configuration, n’avait plus non plus accès aux configurations pour la même raison et a généré des temps de réponse très long du Dashboard. Un redémarrage de cette API a suffi à la forcer à refaire une résolution DNS via la service discovery et à retrouver la connexion à la base de données de configuration. # Métriques * Niveaux de sévérité de l'incident : * Sévérité 2 : dégradation du site, problème de performance et/ou feature cassée avec difficulté de contourner impactant un nombre significatif d'utilisateur * Temps de détection : 10 minutes \(16h45 ⇢ 16h55\) * Temps de résolution : 45 minutes \(16h45 ⇢ 17h30\) * Durée de l’incident : 45 minutes # Impacts * ⅔ des pages ne sont pas optimisées. ⅓ des pages est ralenti \(timeout de 500ms\) * Un ticket client au sujet de redirection suite à des erreurs de Smartcache # Contre mesures ## Actions pendant l’incident * restart des services # Plan d'actions **Court terme :** * Revoir si les paramètres du circuit breaker dans le proxy sont corrects \(⅓ de tâches envoyées aux brokers alors que les circuits breaker des proxys étaient ouverts\) * Alerting lorsqu’un service ou un pourcentage de noeud d’un service est down sur la service discovery * Mettre à jour la documentation de la service discovery * Statuspage : ne pas faire d’auto-closing des maintenances English version # Description On 04/27/21 at 4:45pm, our workers became unavailable following a maintenance operation on the database managing the engine configuration and the client configurations. The dashboard and the APIs became very slow or even unavailable, and generated errors. The database maintenance operation aimed at updating its configuration code. Some pages of the clients' sites were slowed down, especially the non-cached HTML pages and the Smartcache fragments. The SmartCache pages caused redirects to the non-cached versions. # Facts and timeline * At 16:33, end of maintenance, the workers update is triggered. The functional and API tests are OK \(but not reliable because the tests are then played on the old workers\) * From 16:34, the service discovery does not find the new instances of the configuration database * From 16:35, there are no more available async workers seen by the proxies * From 16h45, there are no more available worker sync seen by the proxies * 16h55 : alert on the sync workers, the proxies open the circuit breaker and bypass the workers, the pages are not optimized anymore * 17h10 : restart of the service discovery, the configuration database service in the service discovery is back * 17h13 : communication of the incident on Statuspage * 17h23 : launch of a new update of the workers * 17h30, end of the incident of the unavailability of the workers * 17h35, communication of the end of the incident on statuspage * 17h38, restart of the API services to get back to a normal service on the dashboard # Analysis From 16:35, following the migration of the configuration database, the renaming of the new machines with the name of the old ones generated a conflict in the service discovery, which ejected the new machines. As a result, the internal DNS of the service discovery for the configuration database service was no longer responding with IP addresses. On the new workers using service discovery for this service, the service worker was still waiting for the configuration files to be retrieved and would not start. Restarting the service discovery agents was enough to restore the situation, as this restart took place with the update of all workers. On the other hand, the unavailability of the configuration database also impacted the old API still used by the Dashboard \(connection status management\). This old API, directly connected to the configuration database, did not have access to the configurations for the same reason and generated very long response times of the Dashboard. A restart of this API was enough to force it to redo a DNS resolution via the service discovery and to recover the connection to the configuration database. # Metrics * Incident severity levels: * Severity 2: degradation of the site, performance problem and/or broken feature with difficulty to bypass impacting a significant number of users * Detection time: 10 minutes \(16h45 ⇢ 16h55\) * Resolution time: 45 minutes \(16h45 ⇢ 17h30\) * Duration of the incident: 45 minutes # Impacts * ⅔ of pages are not optimized. ⅓ of the pages is slowed down \(timeout of 500ms\) * A customer ticket about redirection due to Smartcache errors # Countermeasures ## Actions during the incident * service restart # Action plan ## Short term : * Review if the circuit breaker settings in the proxy are correct \(⅓ of tasks sent to brokers while the proxies' circuit breakers were open\) * Alerting when a service or a percentage of a service node is down on service discovery * Update the service discovery documentation * Statuspage: do not auto-close maintenance
This incident has been resolved.
A fix has been deployed and we're monitoring the result.
Issue has been identified and a fix is being deployed. ETA 5-10 min
Starting from 4:45pm UTC+2, we currently have some issues on our european infrastructure. Being fixed. Some impact on acceleration. Some pages can have some slowdowns (<500ms)
Report: "Acceleration is disabled."
Last update# Description Le 16/03/2021, entre 18h45 et 19h20, l’ensemble de la plateforme Fasterize a subi des ralentissements avec des temps de réponse possiblement élevés quelque soit le site. Entre 18h49 et 18h59, la plateforme a automatiquement basculé le trafic vers les origines clients afin d’assurer la continuité du trafic. A partir de 18h58, de nouvelles machines ont été ajoutées et ont commencé à prendre du trafic pour mitiger l’impact en attendant de trouver la root cause. A 19h, le trafic est à nouveau routé sur la plateforme Fasterize et seules quelques requêtes sont ralenties. A 19h18, la cause est identifiée et conduit à bloquer 10 minutes plus tard une adresse IP effectuant des requêtes surchargeant la plateforme. # Faits et timeline * A partir de 8h23, augmentation du nombre de requêtes vers des fichiers volumineux \(> 1 Go\) d’un site client * A partir de 14h, seconde augmentation du nombre de requêtes vers ces fichiers volumineux * 18h49, alerte sur un nb élevé de requêtes annulées par les internautes \(499\) * 18h50, alerte sur l’indispo de nos fronts, le trafic est rerouté automatiquement sur les origines client * 18h55, ajout de machines dans le pool * 18h58, les nouvelles machines répondent au trafic * 18h59, la plateforme est de nouveau vue comme up, le trafic est rerouté chez Fasterize, quelques ralentissements * 19h18, débranchement du site client après identification de la surcharge de trafic * 19h20, retour à la normale \(plus de ralentissements\) * 19h27, blocage de l’adresse IP incriminée # Analyse A partir de 8h23, un serveur hébergé chez GCP a lancé plusieurs centaines de requêtes sur des fichiers volumineux transitant par nos proxys \(fichiers XML > 1Go\). Jusqu’alors, ce serveur faisait quelques dizaines de requêtes par jour. La bande passante sur les fronts et les proxys a progressivement augmenté sur toute la journée \(jusqu’à un facteur x2.5 par rapport à la veille et à la semaine précédente\) : A partir de 18h45, les temps de réponse globaux ont commencé à se dégrader sans qu’il y ait plus de bande passante utilisée/ Cela peut s’expliquer par l’augmentation soudaine du load des frontaux qui jusque là était stable. L’augmentation du load reste cependant inexpliquée à cette heure. # Métriques * Niveaux de sévérité de l'incident : * Sévérité 2 : dégradation du site, problème de performance et/ou feature cassée avec difficulté de contourner impactant un nombre significatif d'utilisateur * Temps de détection : 5 minutes \(18h45 ⇢ 18h49\) * Temps de résolution : 35 minutes \(18h45 ⇢ 19h20\) * Durée de l’incident : 35 minutes # Impacts * Débranchement automatique de l’ensemble des clients pdt 10 minutes. * Aucun ticket au support * Débranchement manuel de quelques sites par un client # Contre mesures ## Actions pendant l’incident * Ajout de frontaux * Débranchement du site web incriminé * Blocage de l’adresse incriminée # Plan d'actions **Court terme :** * Ajustement des alertes sur la bande passante * Ajustement des alertes sur le ping-fstrz-engine * Détection des objets les plus volumineux pour les bypasser **Moyen terme :** * Rate limiting sur les objets volumineux # English version # Description On 16/03/2021, between 18:45 and 19:20, the entire Fasterize platform experienced slowdowns with possibly high response times regardless of the site. Between 18:49 and 18:59, the platform automatically switched the traffic to the customer origins to ensure the continuity of traffic. From 18:58, new machines were added and started to take traffic to mitigate the impact until the root cause is found. At 7:00 pm, the traffic is again routed on the Fasterize platform and only a few requests are slowed down. At 7:18pm, the cause is identified and leads to blocking 10 minutes later an IP address making requests overloading the platform. # Facts and timeline * From 8:23 am, increase in the number of requests to large files \(> 1 GB\) from a client site * From 2pm, second increase in the number of requests to these large files * 18h49, alert on a high number of requests cancelled by users \(499\) * 18h50, alert route53 on the availability of our fronts, the traffic is rerouted automatically on the client origins * 18h55, addition of machines in the pool * 18h58, the new machines respond to traffic * 18h59, the platform is again seen as up, traffic is rerouted to Fasterize, some slowdowns * 19h18, disconnection of the customer site after identification of the traffic overload * 19h20, back to normal \(no more slowdowns\) * 19h27, blocking of the IP address in question # Analysis From 8:23 am, a server hosted by GCP started several hundred requests on large files transiting through our proxies \(XML files > 1GB\). Until then, this server made a few dozen requests per day. The bandwidth on the fronts and proxies has progressively increased throughout the day \(up to a factor x2.5 compared to the day before and the week before\) Starting at 6:45pm, overall response times started to degrade without more bandwidth being used. This can be explained by the sudden increase of the load of the front-ends, which until then had been stable. The increase in load remains unexplained at this time. # Metrics * Incident severity levels: * Severity 2: degradation of the site, performance problem and/or broken feature with difficulty to bypass impacting a significant number of users * Detection time: 5 minutes \(18h45 ⇢ 18h49\) * Resolution time: 35 minutes \(18h45 ⇢ 19h20\) * Duration of the incident: 35 minutes # Impacts * Automatic disconnection of all customers for 10 minutes. * No ticket to support * Manual disconnection of some sites by a customer # Countermeasures ## Actions during the incident * Addition of front ends * Disconnection of the offending website * Blocking of the incriminated address # Action plan \[ \] planned, \[-\] doing, \[x\] done ## Short term : * \[x\] Adjustment of the alerts on the bandwidth * \[-\] Adjustment of alerts on ping-fstrz-engine * \[-\] Detection of the most voluminous objects to bypass them ## Medium term : * \[ \] Rate limiting on large objects
Incident is now closed. Sorry for the inconvenience. A post-mortem will follow.
Fix has been deployed, acceleration has been enabled and everything is back to normal. We're still monitoring.
Issue has been identified Mitigation is being deployed.
We currently have some issues on our european infrastructure. Being fixed. Speeding-up is disabled but trafic is ok.
Report: "Slight degradation of performance"
Last updateThis incident has been resolved.
Starting from 8:40pm UTC+2, we're seeing slightly degraded performance for optimized pages (~ +200ms on response time). Cached and optimized objects are not affected. We are currently working on a fix and starting to deploy it in the next couple of minutes.
Report: "Logging / Monitoring incident"
Last updateEverything is now ok. Some logs may still be missing from 8AM UTC to 11AM UTC, sorry for the inconvenience.
Log delivery is now ok. Some delay may appear and some logs between 9h30 UTC and 14h30 UTC may be missing. We're monitoring until it's completely resolved.
We currently have some issues on our logging infrastructure. It's being actively fixed but there are impacts on log delivery. No impact on acceleration.
Report: "Intermittent errors on Europe platform"
Last updateEnglish version follows. # Description Entre 8h53 et 9h52, la couche de front a été surchargée suite au mauvais redémarrage automatique de certaines machines. Pendant cette période, seul un nombre restreint de machines ont assuré le trafic. Le trafic n’a été rerouté que quelques minutes malgré les sondes rapportant l’indisponibilité de la plateforme. # Faits et timeline * 6h30 : renouvellement automatique des certificats Let’s Encrypt * 6h30 : démarrage des machines pour la journée * 6h35 : retour en erreur du démarrage des fronts, toujours pas dispos sur le load-balancer * 8h27 : première alerte sur la bande passante, automatiquement résolue à 8h32 * 8h45 : seconde alerte sur la bande passante, automatiquement résolue à 8h51, l’équipe essaye de démarrer de nouvelles machines * 8h51 : 3ème alerte sur la bande passante, automatiquement résolue à 9h * 8h53 : la couche de fronts commence à être surchargée au niveau réseau * 9h01 : premier ticket au support * 9h19 : alerte sur l’indispo globale => désactivation globale de Fasterize * 9h20 : identification du problème sur les machines défectueuses * 9h25 : première communication sur [status.fasterize.com](http://status.fasterize.com) * 9h35 : première tentative de déploiement du fix et échec * 9h45 : lancement du déploiement du fix * 9h52 : fin du déploiement et restart des machines # Analyse Tous les jours, l’infrastructure Fasterize est ajustée en terme de trafic et des machines sont éteintes ou démarrées en fonction. A 6h30, les machines front ont été démarrées normalement mais le service HTTP n’a pas correctement démarré. Le démarrage du service HTTP a été rendu impossible par un problème de configuration lié au renouvellement automatique des certificats Let’s Encrypt : le service HTTP n’avait plus accès à la clef privée des certificats renouvelés pendant la nuit et refusait de démarrer. Cet accès a été rendu impossible à cause du nouveau mécanisme de renouvellement des certificats impliquant des droits différents sur les certificats et clefs privées et suite au renouvellement automatique effectué à 06h30. Le load-balancer a bien vu les machines démarrées mais les voyait en unhealthy. Les machines restantes ont donc pris tout le trafic et ont commencé à être surchargées après avoir atteint leur capacité maximale. La couche de CDN a donc eu du mal à joindre l’origine et a provoqué des erreurs 50x. La disponibilité de la couche d’optimisation mesurée par les sondes externes montre bien l’indisponibilité à partir de 8h52 alors que les sondes de disponibilité globale montre une indisponibilité seulement pendant 3 minutes à partir de 9h18. Les origines clients sont branchés sur les sondes globales \(incluant CDN et couche d’optimisation\) et donc n’ont pas été reroutées à partir du début de l’incident. Les sondes de disponibilité au niveau global étaient paramétrées avec une sensibilité moindre et comme une partie du trafic continuait de passer, elles n’ont pas relevée la même indisponibilité. Les alertes remontées à l’astreinte ont concerné le trafic réseau excessif mais pas les erreurs 504 car le taux d’erreur moyen de 504 n'a pas dépassé les seuils classiques que nous utilisons, au plus fort de l’incident vers 9h17. Aucune alerte n’a été remontée sur la non disponibilité du service HTTP des fronts. # Métriques * Niveaux de sévérité de l'incident : * Sévérité 2 : dégradation du site, problème de performance et/ou feature cassée avec difficulté de contourner impactant un nombre significatif d'utilisateur * Temps de détection : 2h \(à partir du démarrage des fronts\) * Temps de résolution : 3h * Durée de l’incident : 60 minutes # Impacts Sur la durée de l’incident, les erreurs 50x ont représenté 10,77% du trafic des pages HTML, 3,52% du trafic non caché et 1,15% du trafic total. Au plus fort de l’incident \(9h17\), ces taux sont montés à respectivement 38,7%, 16,3% et 5,5% Onze clients ont remonté des erreurs via le support. # Plan d'action \[ \] à planifier, \[-\] en cours, \[x\] fait **Court terme :** * \[-\] Modification du mécanisme de synchronisation des certificats Let’s Encrypt * \[ \] Améliorer la remontée d’informations \(logs et alertes\) en cas de problèmes lors du renouvellement et/ou de la synchro * \[x\] Corriger la sensibilité de la sonde de disponibilité au niveau global * \[x\] Organisation : débranchement systématique de la plateforme en cas d’incident impactant l’ensemble des clients * \[-\] Test du débranchement manuel sur un environnement de staging * \[x\] Revoir les seuils d'alerte sur les 504 vus par Cloudfront * \[-\] Ajouter une alerte sur la disponibilité du service HTTP pour les fronts * \[ \] Organisation : améliorer le temps de réaction avant publication d’un incident **Moyen terme :** * Améliorer la résilience des fronts face à un certificat SSL invalide / absent. **Long terme :** * Revue du système de gestion des certificats SSL. # English version # Description Between 8:53 and 9:52 a.m \(UTC\+2\)., the front layer was overloaded due to the poor automatic restart of some machines. During this period, only a limited number of machines were running. Traffic was only re-routed for a few minutes despite availability probes reporting the unavailability of the platform. # Facts and timeline * 6:30 am: automatic renewal of Let's Encrypt certificates * 6:30 am: start of the machines for the day * 6h35: error return from the start of the fronts, still not available on the load-balancer * 8:27 am: first bandwidth alert, automatically resolved at 8:32 am * 8:45 am: second alert on the bandwidth, automatically resolved at 8:51 am, the team tries to start new machines * 8:51 am: 3rd alert on bandwidth, automatically resolved at 9 am * 8:53 am: the front layer begins to be overloaded at network level * 9:01 am: first ticket to the support desk * 9h19 : global availability alert => global disabling of Fasterize * 9:20 am: identification of the issue on defective machines * 9:25 am: first communication on [status.fasterize.com](http://status.fasterize.com) * 9:35 am: first attempt to deploy a fix and failure * 9:45 am: second start of fix deployment * 9:52 am: end of deployment and restart of the machines # Analysis Every day, Fasterize infrastructure is adjusted in terms of traffic and machines are switched off or started up accordingly. At 6:30 am, the front machines were started normally but the HTTP service did not start correctly. Starting the HTTP service was made impossible by a configuration problem related to the automatic renewal of Let's Encrypt certificates: the HTTP service no longer had access to the private key of the certificates renewed during the night and refused to start. This access was made impossible because of the new certificate renewal mechanism involving different rights on certificates and private keys and following the automatic renewal performed at 06:30. The load-balancer did see the machines started but saw them as unhealthy. So the remaining machines took all the traffic and started to be overloaded after reaching their maximum capacity. The CDN layer therefore had trouble reaching the origin and caused 50x errors. The availability of the optimization layer measured by the external probes shows unavailability from 8:52 am while the global availability probes show unavailability only for 3 minutes from 9:18 am. The customer origins are connected to the global probes \(including CDN and optimization layer\) and therefore were not rerouted from the beginning of the incident. The global availability probes were set up with a lower sensitivity and as some traffic continued to pass, they did not detect the same unavailability. The alerts raised on call concerned the excessive network traffic but not the 504 errors because the average error rate of 504 did not exceed the classic thresholds that we use. No alert was raised on the non-availability of the HTTP service of the fronts. # Metrics * Incident Severity Levels : * Severity 2: site degradation, performance problem and/or feature broken with difficulty to bypass impacting a significant number of users. * Detection time: 2h \(from the start of the edges\) * Resolution time: 3h * Duration of the incident: 60 minutes # Impacts Over the duration of the incident, 50x errors accounted for 10.77% of the HTML page traffic, 3.52% of the non-cached traffic and 1.15% of the total traffic. At the peak of the incident \(9:17 a.m.\), these rates rose to 38.7%, 16.3% and 5.5% respectively. Eleven customers reported errors via support. # Action plan \[ \] planned, \[-\] doing, \[x\] done **Short term :** * \[-\] Modification of the Let's Encrypt certificate synchronization mechanism * \[ \] Improve the feedback \(logs and alerts\) in case of problems during renewal and/or synchronization. * \[x\] Correct the sensitivity of the availability probe at the global level * \[x\] Organization: systematic disconnection of the platform in the event of an incident impacting all customers. * \[-\] Test manual disconnect in a staging environment * \[x\] Review the alert thresholds on the 504 seen by Cloudfront. * \[-\] Add an alert on the availability of HTTP service for the fronts. * \[ \] Organization: improve response time before publication of an incident **Medium term :** * Improve the resilience of the fronts against an invalid/absent SSL certificate. **Long term :** * Review of the SSL certificate management system.
This incident is now resolved. Postmortem will follow. Sorry for this incident and your impacted customers.
The fix is now deployed. Trafic is now accelerated
The issue has been identified and the fix is being applied
We're currently investigating an issue giving intermittent 50x errors.
Report: "Optimisations temporarily disabled"
Last updateOptimizations are now enabled.
Following the previously scheduled maintenance, starting at 3pm, optimizations has been disabled and will be restored in the next hour.
Report: "503 errors on statics"
Last update# Description Between 10:18 UTC\+2 and 11:20 a.m. UTC\+2, the static resources of some clients responded with 503 errors. Internet users did not necessarily see these errors, but some sites may have displayed broken pages because of these missing objects, especially for Internet users who did not have these objects in their browser cache. # Facts and Timeline * 10:18: manual update of one of our component * 10:28: first alert * 10:36: Start of bypass of the CDN layer for the impacted domains * 10:52: All impacted domains bypass the CDN layer. Due to some DNS propagation delays, errors occur until 11:20 * 13:42: Start of reconnection of impacted domains to the CDN * 14:04: Impacted domains are reconnected to the CDN # Analyze The incident was caused by an update on one of our component, not supposed to be related to the production stack. An execution role needed by edge processes on the CDN layer was removed as a side-effect of this update. # Metrics Severity: level 2 \(site degradation, performance problem and/or feature broken with difficulty to bypass impacting a significant number of users\) Time To Detect: 10 min Time To Resolve: 60min # Impacts Only a few sites were impacted \(<10\). # Countermeasures * Short-term * adjust alerting on edge processes to improve diagnosis * adjust alert level on 5xx errors viewed from the CDN layer * Mid-Term * secure the execution role of edge processes * ease CDN layer unplugging for a specific customer
Everything is now back to normal. Post-mortem will follow in the next hours. Sorry for the inconvenience :-(
The problem has been fixed for all impacted customers. We are monitoring errors to assert everything is back to normal.
For some customers, statics are not served by the CDN layer anymore, we're actively working to fix this. But in the meantime, websites are normally served.
Some 503 errors have occured for static assets on the CDN layer. This was limited to some customers.
Report: "Inlined images may be corrupted"
Last updateThe fix has been deployed. The image inlining is enabled again. We are sorry for the inconvenience.
The issue has been identified in the engine code that could cause conflict. We are deploying an hot fix.
We are currently investigating an issue on the image inlining feature. This feature has been deactivated. We saw inlined images that were not the right images.
Report: "Fasterize.com website unavailable"
Last updateThis incident has been resolved.
Our website has been attacked, so security systems have blocked trafic. We are investigating.
Report: "TLS Certificat issue on the dashboard and the API"
Last updateThis incident has been resolved.
We are currently investigating an issue on our TLS certificat for fasterize.com. This affect the dashboard and the API.
Report: "Intermittent errors 502 emitted by KeyCDN"
Last updateThis incident has been resolved by KeyCDN at 8:25.
All impacted customers have been unplugged from KeyCDN
We are currently investigating errors 502 emitted by KeyCDN since 04:34 (Paris time). We already ask to their support more details. Customers using the new platform with Cloudfront are not affected. We are unplugging KeyCDN on impacted customers
Report: "Intermittent errors 502 emitted by KeyCDN"
Last update## Description de l'incident Le mardi 28 janvier 2020, l’un de nos CDN partenaires KeyCDN a émis des erreurs 502 entre 13h40 à 15h05, suite à l’indisponibilité d’un des serveurs de leur POP à Paris. L’incident a concerné une petite partie du trafic, la majorité étant routée sur un autre CDN et seul un serveur étant en erreur. KeyCDN nous a bien confirmé l’incident chez eux par mail. Cela n’a pas impacté les clients ayant migré sur le CDN d’Amazon Cloudfront. ## Plan d'actions ### Court terme * Basculer au plus vite les derniers clients sur notre nouvelle infrastructure * Mettre à jour la liste de diffusion sur [status.fasterize.com](http://status.fasterize.com) pour ne pas avoir de clients qui ne soient pas notifiés ### Moyen terme * Pouvoir désactiver simplement un CDN globalement pour ne pas avoir à agir client par client \(config par config\) * Modifier automatiquement les messages / signatures du support pour signaler l’incident
KeyCDN has taken a problematic edge server out of production for further analysis. We don't see any 502 errors since 15h05. We replugged unplugged websites on KeyCDN.
We unplugged KeyCDN while waiting for a complete resolution.
We are currently investigating errors 502 emitted by KeyCDN. We already ask to their support more details. The new platform AWS euwest1 is not impacted by this incident.
Report: "DNS resolution error on some machines"
Last update# **Description de l'incident** Note : L’incident suivant est relatif au datacenter situé chez AWS \(euwest1\). Le mercredi 22 janvier 2020 entre 19h et 20h30, un problème de résolution DNS a été identifié sur l’ensemble des machines hébergées sur une de nos trois zones de disponibilité de Paris. Le système de supervision indique que pendant cette durée 6,24% des requêtes à l’origine ont échoué. Les machines présentes dans cette zone utilisant toutes le DNS interne d’AWS \(censé être hautement disponible / redondé\), elles ne pouvaient plus se connecter aux serveurs d’origine des clients. AWS a déclaré un problème de connectivité sur la zone incriminée sur sa page de status. Après identification de l’incident, nous avons modifié la configuration de nos machines pour utiliser un autre service public DNS afin de maintenir le service dans l’attente de la résolution de l’incident côte hébergeur. L’absence de résolution DNS ne permettait pas d’avoir une vision en temps réel et fiable de l’état du système \(logs et métriques\). A partir du moment où nous avons modifié les resolvers DNS, le système de supervision a commencé à rattraper le retard de logs et de métriques. Tant que ce retard n’était pas comblé, nous n’avions pas une vision temps réel des erreurs émises par la plateforme via nos système de logs et de collecte de métriques. En l’absence de supervision et de logs et suite à nos tests manuels, nous avons pensé à tort que la correction était suffisante. Il s'avère, après analyse, que les serveurs HTTP impactés n’ont pas correctement pris en compte le changement, utilisant leur propre configuration de resolver. La fin de l’incident correspond à la remise en état du service DNS par AWS. # Plan d'actions **Court terme :** * améliorer la disponibilité de notre système de résolution en évitant de dépendre d’un unique acteur * révision du système de métrique pour fiabiliser la vision temps réel de la plateforme **Moyen terme :** * amélioration de la prise en compte automatique d’une panne au niveau de la couche responsable de la communication avec les origines
This incident is now resolved. From AWS : "10:25 AM PST We are investigating an issue which is affecting internet connectivity to a single availability zone in EU-WEST-3 Region. 11:05 AM PST We have identified the root cause of the issue that is affecting connectivity to a single availability zone in EU-WEST-3 Region and continue to work towards resolution. 11:45 AM PST Between 10:00 AM and 11:28 AM PST we experienced an issue affecting network connectivity to AWS services in a single Availability Zone in EU-WEST-3 Region. The issue has been resolved and connectivity has been restored."
AWS is working on the DNS resolution issue. The platform is behaving normally since the DNS resolver change.
We detected an issue impacting the DNS resolution on some machines. We already fixed the issue by changing the DNS resolver. We are investigating why the DNS resolver is not working.