Historical record of incidents for SEKOIA.IO
Report: "FRA1 event storage cluster performance issues impacting search jobs"
Last updateWe are currently experiencing performance issues with our event storage cluster in the FRA1 region. This is impacting search jobs, resulting in slower event research. Our engineers are investigating this issue and we are working to restore normal functionality as soon as possible.
Report: "[MCO1] HTTP intake issues"
Last updateA sudden and significant increase in received HTTP events around 04:12 CEST this morning caused our HTTP events receiver instances to crash and restart in a loop, as they were filling their local queues faster than they could push to our internal message bus. The problem was almost entirely silent because the restarts are fast, and once restarted, instances operate normally until their queues fill again. At 08:23 CEST we applied a fix by drastically increasing the number of service instances, allowing them to push the higher event volume to the message bus faster than their queues fill. As a result of this incident, clients using HTTP intake experienced a significant percentage of lost events within this incident timeframe. Our teams will implement early-warning alerts and improved auto-scaling to detect and mitigate similar issues sooner in the future. We apologize for the inconvenience.
Report: "FRA2 region is out of service"
Last updateThe region is currently down. We are investigating some issues on the cluster.
Report: "FRA2 region is unreachable"
Last updateOur team is in contact with our external storage support, which is in the process of adding two more hosts to our storage cluster to resolve this issue. Provisioning of these hosts is expected to take between 15 to 30 minutes. At present, some servers are coming back online, but full functionality will only be restored when the majority of servers are operational. We appreciate your patience as we work to resolve this issue.
We are currently experiencing an issue in our FRA2 region due to a running migration. A virtual machine reconfiguration has resulted in duplication of stored data, which has filled our datastore. Our engineers are working to resolve the issue. Your patience is appreciated.
Report: "FRA2 region is unreachable"
Last updateThis incident has been resolved.
The previous issue impacting the FRA2 region has been fully resolved. The operational workflow is now running smoothly with no errors detected on the APIs. We are actively mitigating the backlog and expect to be completely up-to-date shortly.
The issue with the overloaded datastore in the FRA2 region is being addressed. Additional hosts are being added to the storage cluster, with some servers already coming back online. Full restoration is expected once the majority of servers are operational. Thank you for your patience.
Our team is in contact with our external storage support, which is in the process of adding two more hosts to our storage cluster to resolve this issue. Provisioning of these hosts is expected to take between 15 to 30 minutes. At present, some servers are coming back online, but full functionality will only be restored when the majority of servers are operational. We appreciate your patience as we work to resolve this issue.
We are currently experiencing an issue in our FRA2 region due to a running migration. A virtual machine reconfiguration has resulted in duplication of stored data, which has filled our datastore. Our engineers are working to resolve the issue. Your patience is appreciated.
Report: "UAE1 indexation delays"
Last updateWe identified a component that needed some load balancing in order to provide better performance.This incident is now resolved.
We are currently investigating this issue.
Report: "UAE1 indexation delays"
Last updateWe identified a component that needed some load balancing in order to provide better performance. This incident is now resolved.
We are currently investigating this issue.
Report: "UAE1 events not available in alerts"
Last updateWe found the root cause and performed a fix. Events are now immediately available in new alerts.
We detected that the process responsible for associating events to alerts is having congestion issues. We are actively looking into this issue.
Report: "UAE1 events not available in alerts"
Last updateWe found the root cause and performed a fix. Events are now immediately available in new alerts.
We detected that the process responsible for associating events to alerts is having congestion issues. We are actively looking into this issue.
Report: "[MCO1] Event indexation stopped"
Last updateThe fix was effective, the incident is now considered resolved.
Event processing is back to normal. We're monitoring the platform.
We've fixed the issue and the event processing is restarting
We've identified the potential root cause and working on a fix.
We are facing an incident causing delay to indexing events on the platform. It concerns event processing and alert raising. And also the event indexing in our storage cluster.
Report: "[MCO1] Event indexation stopped"
Last updateThe fix was effective, the incident is now considered resolved.
Event processing is back to normal.We're monitoring the platform.
We've fixed the issue and the event processing is restarting
We've identified the potential root cause and working on a fix.
We are facing an incident causing delay to indexing events on the platform.It concerns event processing and alert raising.And also the event indexing in our storage cluster.
Report: "[FRA1] Playbooks issue"
Last updateEverything is back to normal and stable since 15:30 CEST
We have identified the root cause and applied some fixes.There is still some unusual load on playbook runs but the state is currently coming back to normal.
We are facing some delay on playbooks execution due to an issue that happened this morning at 11:00 CEST.It might seem like playbooks execution are loading and waiting infinitely on the client side.We are catching up on the delay and the situation will come back to normal soon.We are still investigating on the root cause.
Report: "[FRA1] Playbooks issue"
Last updateEverything is back to normal and stable since 15:30 CEST
We have identified the root cause and applied some fixes. There is still some unusual load on playbook runs but the state is currently coming back to normal.
We are facing some delay on playbooks execution due to an issue that happened this morning at 11:00 CEST. It might seem like playbooks execution are loading and waiting infinitely on the client side. We are catching up on the delay and the situation will come back to normal soon. We are still investigating on the root cause.
Report: "[FRA2] TLS issue"
Last updateOur team observed a peak in HTTP event ingestion once the issue was fixed.It indicates that events blocked during the incident were received after retry.Everything is now up and stable.
The root cause was a proxy parameter that was unintentionally reset due to an automated process.This behavior was not intended and we will implement safeguards to avoid it in the future.The issue has been fixed and everything is back.The long-lived sessions like syslog ingestion were likely not impacted, but HTTP ingestion can have encountered errors.Events processing and our internal services were not impacted.We'll keep monitoring and watching for any persistent impact.
We are facing a TLS issue on the region.Our team is investigating on the root cause and determining the blast radius at the moment.
Report: "[FRA2] TLS issue"
Last updateOur team observed a peak in HTTP event ingestion once the issue was fixed. It indicates that events blocked during the incident were received after retry. Everything is now up and stable.
The root cause was a proxy parameter that was unintentionally reset due to an automated process. This behavior was not intended and we will implement safeguards to avoid it in the future. The issue has been fixed and everything is back. The long-lived sessions like syslog ingestion were likely not impacted, but HTTP ingestion can have encountered errors. Events processing and our internal services were not impacted. We'll keep monitoring and watching for any persistent impact.
We are facing a TLS issue on the region. Our team is investigating on the root cause and determining the blast radius at the moment.
Report: "FRA1 event indexation delays"
Last updateThis incident has been resolved.
This issue is still ongoing.
As we are experiencing a high volume of traffic on FRA1, you may experience some delay in event indexation. Alerts are raised in real time and other components are running nominally.
Report: "[FRA2] Events search degraded"
Last updateThis incident has been resolved.
We identified multiple search jobs which were hindering the data cluster performances. We killed these requests and the situation resolved.
We identified an issue with a really impacting search query from a customer. We are revoking the search to resolve the situation.
We are having an issue with the search capabilities on FRA2. We are currently investigation the root cause of the issues as short term fix are not sufficient.
Report: "FRA2 temporary outage"
Last updateWe identified the root cause to be linked to a new flavor of nodes with smaller disks. The issue has been mitigated and a task to improve this flavor has been created.
We identified an issue on FRA2 which resulted in a partial outage from 23:15 to 23:57 CEST. The situation is now stable while we are looking into the root cause of this issue.
Report: "FRA1 events temporarily unavailable"
Last updateDue to side-effects of an ongoing investigation, we have experienced a short outage on the events page from 10:36 to 10:49 CEST. This incident is now resolved and the investigation has been stopped. The situation is back to normal.
Report: "[FRA1] Event indexing delay"
Last updateThe platform is indexing in real-time since 18:45 CEST
Our event indexing has stopped at 16:55. We have identified the reason and applied a fix. The indexing was back at 17:22 It will progressively catch up on the accumulated delay. During this time, events will show with delay in the events page. We will keep monitoring until it is back on real time.
Report: "[FRA1] temporary failure"
Last updateThis incident is over, we still have some little delay on event processing but it's resolving progressively.
We found that the root cause of the incident is linked to a host incident on our cloud provider side.
We had a temporary failure on a cache service that caused authentication failures since 14:41 CET to 14:55 CET. This also caused some delay on alerts raising, events processing and playbooks starts. The service is back again and we are catching up on the delay now. Our team is currently investigating on the root cause.
Report: "Delay in events analysis"
Last updateWe caught up with the backlog of events and the traffic is now being processed in real-time.
We had an issue with our DNS resolution in a kubernetes cluster. The issue is fixed but we took some delay in handling events. The lag is currently being handled.
Report: "Provider outage"
Last updateAll the delayed events were fully handled at 8h50 CEST.
We're continuing to monitor the processing closely and are seeing steady progress in reducing the backlog. While it may still take some time to fully catch up, we’re doing everything we can to maintain stability and ensure no data is missed.
The backlog of the queued events is still being processed at maximum capacity. Our team is dedicated to clearing this backlog as efficiently as possible, ensuring that all events are handled promptly.
Cluster recovery is done with no data loss. We are still processing the backlog of queued events, at maximum capacity.
We are making great progress on fixing the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are processing the delayed events. We found a way to speed up the recovery process. The event storage cluster is steadily recovering.
We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are processing the delayed events. The situation is progressively recovering however we had to slow down the process for the moment due a very high number of parallel tasks causing a risk for the cluster. We are trying to find ways to improve the situation faster.
We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are processing the delayed events. The situation is progressively recovering however we had to slow down the process for the moment due a very high number of parallel tasks causing a risk for the cluster. We are trying to find ways to improve the situation faster.
We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are processing the delayed events. The situation is progressively recovering.
We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are currently fixing the situation.
We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are currently fixing the situation.
We are still working on the event storage cluster. We are also making progress with alerts without events. Events linked to alerts are progressively available in the event storage cluster.
We are still working on the event storage cluster. So far, event search are working but the oldest events are not available for search. We are fixing this situation progressively meaning more and more older events will be available later. On another hand, all events ingested from this morning 03:30 CEST are not available in the event storage cluster.
We are still working to stabilize the event storage cluster. So far some event query and search are working however all data are not available for the moment.
We are still preparing a fix to rollout on our whole event storage cluster. In the mean time, we fixed the automation cluster.
Most services are up. There is still some issues with our event storage cluster making events and events search not being available. All events are still received and properly processed. On another hand, automation (playbooks) is also having issues. We are working quickly to fix these situations.
We had an outage on our main provider, and the network went down. We are currently recovering access to the platform and fixing the different issues.
Report: "[FRA1] events indexing delay"
Last updateThis incident has been resolved.
We have identified and fixed the issue, and we are now indexing back to normal rate. The delay will now slowly reduce, we will keep you updated once we are back in real-time.
Hello, We are currently facing a problem with indexing events which causes delay before event are available on the events page. Events and alerts are still processed on real-time, and there is no data loss. We are still investigating on the root cause and will keep you updated.
Report: "FRA1 hardware network issues"
Last updateAfter swapping network cards on the faulty router we decided on completely replacing the router with another similar machine. We are not seeing any of the initial issues as of now.
Our provider is having hardware issues on a pair of servers that are the main network routers of FRA1. While we are investigating, you may see some sporadic timeouts and 50x errors (less than 0.1%), which will succeed after a retry. Event ingestion is also experiencing some delay, due to the nature of the underlying issue.
Report: "[FRA1] - Delay on event processing"
Last updateEvent ingestion is back to real-time
A fix has been implemented and we are gradually catching up on the delay. We will close this incident once event ingestion is back to real-time.
Hello, An unexpected behavior during a deployment is causing some slight delay on event processing. We identified the root cause and are currently working on a solution.
Report: "[MCO1] Events indexation delay"
Last updateThe indexing is back in real time since 13:13 CET and everything is stable.
A fix has been implemented and the platform is catching up on the delay. We will come back to you when we are on real-time.
Good morning. You may experiencing delay on events indexing due to an exceptionally high traffic this morning. This affects the time before events are visible in the events page. Our team currently working towards a solution. Currently the delay is about 15 to 20min.
Report: "Event ingestion issue over HTTP"
Last updateThe root cause was due to an incident on our provider's side. We will communicate a postmortem as soon as our provider's investigation are finished.
The ingestion issue is currently resolved. We are still investigating the root cause.
We are currently having an issues on event ingestion for HTTP requests. We are investigating the issue.
Report: "FRA1 Web application and API issue"
Last updateA cloud provider issue impacted access to the web application and APIs of the FRA1 region between 17:21 and 17:28 CET. This incident did not affect the reception or processing of events. We are currently reaching out to our cloud providers to determine the root cause, as this does not originate from an issue in our scope. Update : this has been traced back to an issue with a cloud load balancer hosted by Scaleway. A router misconfiguration resulted in all public traffic being black-holed for the duration of this incident. More information is available on their status page here : https://status.scaleway.com/incidents/162zw5zd9x8r
Report: "OVH Object Storage unavailable"
Last updateEverything has been back to normal since 22:10 CET
OVH is currently experiencing worldwide issues with their Object Storage offering. We are in contact with their support. This is affecting some parts of our application, such as notebooks and the update of anomaly detection rules models. We will keep you posted once we have more info.
Report: "[FRA1] 500 errors on some APIs"
Last updateThis incident has been resolved.
We identified the underlying issue and performed a corrective. Error rates are going down now.
We are detecting an abnormal number of 50x HTTP errors on some APIs endpoints.
Report: "[FRA2] vmWare hosts update"
Last updateThis incident has been resolved. We will be investigating with OVH to understand what went wrong in their automated VM management process.
We are now processing events nominally.
After investigating further, it seems that this incident was not directly caused by our routine operation, but by an automatic upgrade process carried on by OVH, our provider. We will be in touch with them to understand the root cause of this incident.
All services are now running correctly. Some databases are still initializing, so events are currently being buffered before resuming their normal processing. The UI and API are working as expected.
A routine reboot of a VMWare host resulted in some of our VMs going down unexpectedly. We are currently stabilizing the situation.
Report: "[FRA1] - Events ingestion down"
Last updateThis incident was resolved. Event ingestion is now in real time.
A fix has been implemented. The situation is under control. There is still some delay of event processing. It should recover soon.
A fix has been implemented and we are monitoring the results.
We identified an issue in our ingestion process. The ingestion is currently down at the moment. We are working on a fix.
Report: "[FRA2] Maintenance exceeding time slot."
Last updateThis incident has been resolved. A post-mortem will come in the following days while our engineers gather all necessary data.
We are done with the rollback of this cluster upgrade, and the region is now up. We are monitoring the overall situation before closing this incident.
We have restored the backup and we are starting to bring the platform back up.
As part of our recovery procedure, we are currently stopping the whole region to restore the backup.
Our team tried in vain multiple operations to fix the issue we encountered. We took the decision to rollback the upgrade and restore the previous cluster state from a backup.
The maintenance time slot has been exceeded but it is not completed. A message saying that the maintenance slot is completed has automatically closed the status page while it should not have. We are still experiencing network errors and the whole team is working towards a solution. We will keep you updated.
Report: "[FRA1] - Playbooks errors"
Last updateOur team experienced an issue that impacted playbooks from 23:42 to 00:20 CET. Playbooks may have reported errors and/or could have been stopped. The issue has been fixed on our side but we encourage you to check your playbooks. We are sorry for that inconvenience.
Report: "Playbook runs in error"
Last updateAll previously missed playbook runs were replayed and the underlying issue has been fixed. This incident is now resolved. Thanks for your patience and understanding.
Playbooks are currently running as expected. Our team are investigating if we can retry the previously missed runs.
Due to a recent deployment, playbooks are not starting on some regions. We identified the issue and we are currently rolling out a fix.
Report: "[FRA1] Alerts raising lag"
Last updateWe are back to raising alerts on real-time. Thank you for your patience.
Hello, we have an issue causing delay on alert raising. Our team has identified the cause and is currently applying a fix.
Report: "[FRA1] Alerts raised without events"
Last updateAll alerts are correctly processed since 15:10 CET. The fix has been applied and past alerts lacking events have also been fixed. Thank you for your patience.
Our team is currently applying the fix on production.
The issue has been identified. Our teams are working on a fix to prevent this issue in the future and ensure that events are correctly added to already raised alerts. We'll come back to you once the fix will be applied.
Hello, we are aware of an issue causing alerts to be raised without associated events being available. Our engineering and infrastructure team are currently investigating on the root cause.
Report: "MCO1 lag on events processing"
Last updateEvent processing and alert raising is back to real time since 21h30 CET. Event storage is back to real-time since 23h. This incident is now over.
The whole platform is up again since 19:05 CET. We are processing the backlog of events and raising alerts accordingly. The estimated time before recovering to real-time processing is just above 2 hours. Events storage is catching up a little bit slower, so events will not show instantly after being processed. We are still monitoring the behavior and will let you know when everything is back to real-time. Thank you for your patience.
Hello, During today's region upgrade, the upgrade of a critical service for events processing is taking an abnormally long time due to a restart failure, which restarted its upgrade procedure from start. The impact is a substantial delay on events processing (more than an hour). The fix has instantly been implemented to prevent the same failure but it still takes a long time to restart entirely. We are monitoring it closely and will take you updated.
Report: "[UAE1] Platform instability"
Last updateThis incident has been resolved. All events are processed in real time and alerts are raised in real time.
All nodes have been restarted. The platform is up and running. We have some delay on event processing and detection. This delay should be processed in the next hour.
The platform is fully functioning for now, some nodes are still being restarted, in a controlled manner.
We identified the issue and we are rolling out a fix. The platform is currently usable but the rollout of our fix may create some sporadic issues in the next minutes.
We are currently seeing some instability on the UAE1 region, our team is investigating
Report: "Issue with alerts not being raised"
Last updateThis incident has been treated. A post-mortem was produced and communicated to customers.
Event replay was started at 17:55 CET, you should now see alerts being raised for the period of the incident (11:21 CET until 13:01 CET). We estimate this event replay to finish around 05:00 CET tomorrow.
We are currently mobilizing resources to perform a replay of events received during the duration of the incident. Our goal is to ensure alerts were eventually correctly raised.
A fix has been implemented and we are monitoring the results.
We have an issue since 11h20 which impacted the alerts raising process. We identified the issue and are currently deploying a fix.
Report: "UAE1 - delay on events processing."
Last updateEverything is back to real-time since 15:40 CET
The traffic has come back to normal, and there is no more delay on event processing. However, there is still some delay on alerts raising, that is currently resolving. We will keep monitoring until it is back to real-time.
Hello Our UAE1 platform is facing an exceptionally high traffic since 12:00 CET It causes delay on event processing and alert raising.
Report: "FRA1 - events search delay"
Last updateWe have found the root cause of the incident and it is now resolved. Our team is now working on preventing this problem in the future. Thanks for your patience.
The peak load has passed and search jobs are now working in real-time since 10:40. We are still investigating on the root cause of the load.
Hello, We are aware of slowness on events searches since 09:55 CET. Our team is investigating on the issue.
Report: "[MCO1] General performance issue"
Last updateCloud provider backups ended around 6am and IOPS performance was restored to its baseline.
Our cloud provider for the MCO1 region is currently performing block devices backups, which results in a global slowdown of the storage layer of our deployment. While the backups are ongoing, events are being processed with a significant delay and some API queries might fail.
Report: "Events analysis delay"
Last updateThis incident has been resolved.
We are currently handling the delayed events.
We are encountering issues with the update of one of our service. The issue is identified and the situation should resolve soon. We have currently some delay analyzing the events and raising alerts. As a consequence, the events delayed are also not available to search.
Report: "[FRA1] Search jobs temporary unavailability"
Last updateThis incident has been resolved.
The cache has been resized successfully and no errors are seen since 13:47. We are still monitoring the situation and investigating the root cause, but the service is up.
We identified an issue with an internal cache cluster used for search jobs on the events page. While we are resizing that cache, some search jobs may fail. Our team is currently performing the resize operation, the situation should stabilize soon.
Report: "[FRA1] Temporary outage"
Last updateA critical internal service was unreachable for 7 minutes between 11h52 and 11h59 on our platform. It is at the center of many other services, which means that many other services became unreachable. This was fixed and the platform is reachable again and everything working as intended.
Report: "[MCO1] delay on event processing and indexing."
Last updateThe platform is running on real-time since 00:34 UTC+1
A fix has been implemented and the delay is now decreasing slowly. We will keep monitoring this incident and will close this status page once we are back in real-time.
We are aware of an incident causing delay to accumulate on the platform. It concerns event processing and alert raising, which are currently happening about 10 minutes after an event is received. And also the event indexing in our storage cluster, which is currently happening around 30 minutes after an event is processed. We have identified the cause and are working towards its resolution.
Report: "[FRA1] Playbook runs incident"
Last updateThe playbooks environment is stable and steady. This incident is not resolved.
The playbook environment is back up and we are processing tasks as usual. We are back in real-time, however the environment is handling a lot of charge at the moment. We are monitoring closely until everything is stable and back to normal.
We implemented a fix to the network issue. The cluster is coming back online on our side. We are currently stabilizing the cluster after the fix, and validating that everything is working.
We detected an incident concerning our playbook runs that impacts DNS resolution and runs processing.
Report: "MCO1 Indexation performance issues"
Last updateThe delay has been completely resolved since 21h50 UTC, the incident is now completely resolved.
Some fix were implemented to increase our performance. We are now able to catch the delay, slowly but steadily. We will close this status page once the event storage is back on real-time.
The incident is still ongoing. We are actively continuing to search for its root cause. Our team has added some resources to the storage cluster as a temporary workaround. At this stage, performance is still below our expectations, and we continue to experience delays. We estimate the delay to be around 1 hour and 45 minutes between the processing of an event and its entry into our storage cluster.
We are currently having issues indexing events in our storage cluster. This is generating delay before the events are available in the events and alerts pages. The detection is not affected.
Report: "[FRA2] events processing delay"
Last updateEvents are processed in real-time since 17:39 CET, everything is stable.
During a service update on the region, we encountered an issue with our event processing. It has been stopped for around an hour, since 15:33. We have fixed the issue and it is now consuming again, we are catching up on the delay. We expect around 1h before coming back to real-time processing.
Report: "[MCO1] events processing lag"
Last updateThis incident has been resolved.
A fix has been implemented and we are now catching up on the lag. We will keep monitoring closely and keep this status page open until we have no more delay on events processing.
We are investigating an issue with our events processing pipeline which has bad performance since 09:27 CET. Events processing is taking lag, which impacts alerts raising.
Report: "FRA1 Events processing stopped"
Last updateThe platform is consuming events in real-time since 19:09 CET. This incident has been resolved.
A good part of the backlog has already been processed. The platform's detection is estimated to be back in real-time at 19:05 CET. We will keep monitoring closely until that time.
The event processing is stable and we are catching up on lag slowly. We can expect to be back in real-time in a couple hours due to the volume of events backlog that we have. We will keep you updated on this.
We were able to mitigate the issue, our events processing pipeline is back up. The platform is now consuming lag, we are monitoring this closely.
We lost several servers in a few minutes due to a network issue on our cloud-provider side. Our event ingestion pipeline is safe from impact and we are still receiving every events. However, our events processing pipeline is stopped since 15:23 CET, and we are taking lag on events processing and alerts raising. We are currently reaching our cloud provider support to get more information in order to resolve this incident as fast as possible.
Report: "[UAE1] Syslog SSL issue"
Last updateFrom 12:45 CET to 18:42 CET, there was a conflict with our ingress configurations that caused syslog SSL connections to be rejected. This impacted rsyslog reception. As a result, events sent during this period may have been rejected and could potentially be lost if they were not buffered on your side. We sincerely apologize for any inconvenience this may have caused. This is a serious matter, and we are committed to implementing enhanced monitoring and safeguards to ensure this issue is identified more quickly and prevented from recurring in the future. The issue has been addressed, and we have restored the ability to send events via rsyslog using our event-amplifier.
Report: "[UAE1] Platform down"
Last updateThis incident has been resolved.
The platform is up again since 12:49 CET. We have not lost any events during the downtime. Alerts are being processed. We will keep monitoring closely during some time to ensure everything is stable.
The UAE1 region is experiencing some downtime since 11h20 CET due to an issue during a maintenance update. The problem is identified, we are actively rolling back the changes and expect the issue to be resolved in the next half hour.
Report: "[FRA1] internal network issues"
Last updateWe have fully caught up on alerts raising lag and everything is running and stable since.
We noticed some impact on tag enrichment too, that have caused a lot of alerts to raise. Everything is stable now, we are consuming lag on alerts raising. ETA : ~1H
A fix has been deployed at 12:58 CET. Ingestion and event storage have not been impacted. We have not lost any event. However, alerts raising tasks are delayed. We are gradually consuming the lag, and we will give you an ETA soon.
We are aware of an ongoing incident on our platform since 12:15 CET, related to internal loadbalancers. This is impacting our whole platform. Our team is currently implementing a fix. We'll keep you updated
Report: "Event ingestion delays"
Last updateAll backlog has been processed, this incident is now over.
We managed to identify the issue and process the backlog of pending tasks on the cluster responsible for event ingestion. We are now catching up on the backlog of enqueued events.
Investigation is still ongoing.
We are currently experiencing performance issues with event ingestion. As a results, events may show up late into the events page. Our team is looking into this issue.
Report: "Temporary Disruption in "Alert Created" Playbook Triggers"
Last updateOn 17/10, at 16:54 CEST, a deployment introduced a bug into production which led to the "alert created" playbook triggers not being activated. All other triggers and playbooks continued to operate without any issues. Our team detected the issue and has already rolled back the affected deployment as of 10:09 today. We are actively working on replaying the missed triggers and are developing a permanent fix to prevent similar incidents in the future. We apologize for any inconvenience caused and appreciate your patience while we resolve this matter. Thank you for your understanding.
Report: "FRA1 detection is down"
Last updateThis incident has been resolved. All alerts are being processed in real time.
We are pleased to inform you that the fix has been successfully deployed. No alerts were lost during this incident. However please note that some alerts may experience some temporary delay. Our team is closely monitoring the situation to ensure everything returns to normal promptly. Thank you for your patience and support.
We have identified an issue with our detection engine and have temporarily paused it to prevent any false alerts. Rest assured, our team is actively working on a solution, which we expect to deploy shortly. Thank you for your patience and understanding.