SEKOIA.IO

Is SEKOIA.IO Down Right Now? Check if there is a current outage ongoing.

SEKOIA.IO is currently Operational

Last checked from SEKOIA.IO's official status page

Historical record of incidents for SEKOIA.IO

Report: "FRA1 event storage cluster performance issues impacting search jobs"

Last update
investigating

We are currently experiencing performance issues with our event storage cluster in the FRA1 region. This is impacting search jobs, resulting in slower event research. Our engineers are investigating this issue and we are working to restore normal functionality as soon as possible.

Report: "[MCO1] HTTP intake issues"

Last update
resolved

A sudden and significant increase in received HTTP events around 04:12 CEST this morning caused our HTTP events receiver instances to crash and restart in a loop, as they were filling their local queues faster than they could push to our internal message bus. The problem was almost entirely silent because the restarts are fast, and once restarted, instances operate normally until their queues fill again. At 08:23 CEST we applied a fix by drastically increasing the number of service instances, allowing them to push the higher event volume to the message bus faster than their queues fill. As a result of this incident, clients using HTTP intake experienced a significant percentage of lost events within this incident timeframe. Our teams will implement early-warning alerts and improved auto-scaling to detect and mitigate similar issues sooner in the future. We apologize for the inconvenience.

Report: "FRA2 region is out of service"

Last update
investigating

The region is currently down. We are investigating some issues on the cluster.

Report: "FRA2 region is unreachable"

Last update
Update

Our team is in contact with our external storage support, which is in the process of adding two more hosts to our storage cluster to resolve this issue. Provisioning of these hosts is expected to take between 15 to 30 minutes. At present, some servers are coming back online, but full functionality will only be restored when the majority of servers are operational. We appreciate your patience as we work to resolve this issue.

Identified

We are currently experiencing an issue in our FRA2 region due to a running migration. A virtual machine reconfiguration has resulted in duplication of stored data, which has filled our datastore. Our engineers are working to resolve the issue. Your patience is appreciated.

Report: "FRA2 region is unreachable"

Last update
resolved

This incident has been resolved.

monitoring

The previous issue impacting the FRA2 region has been fully resolved. The operational workflow is now running smoothly with no errors detected on the APIs. We are actively mitigating the backlog and expect to be completely up-to-date shortly.

identified

The issue with the overloaded datastore in the FRA2 region is being addressed. Additional hosts are being added to the storage cluster, with some servers already coming back online. Full restoration is expected once the majority of servers are operational. Thank you for your patience.

identified

Our team is in contact with our external storage support, which is in the process of adding two more hosts to our storage cluster to resolve this issue. Provisioning of these hosts is expected to take between 15 to 30 minutes. At present, some servers are coming back online, but full functionality will only be restored when the majority of servers are operational. We appreciate your patience as we work to resolve this issue.

identified

We are currently experiencing an issue in our FRA2 region due to a running migration. A virtual machine reconfiguration has resulted in duplication of stored data, which has filled our datastore. Our engineers are working to resolve the issue. Your patience is appreciated.

Report: "UAE1 indexation delays"

Last update
Resolved

We identified a component that needed some load balancing in order to provide better performance.This incident is now resolved.

Investigating

We are currently investigating this issue.

Report: "UAE1 indexation delays"

Last update
resolved

We identified a component that needed some load balancing in order to provide better performance. This incident is now resolved.

investigating

We are currently investigating this issue.

Report: "UAE1 events not available in alerts"

Last update
Resolved

We found the root cause and performed a fix. Events are now immediately available in new alerts.

Investigating

We detected that the process responsible for associating events to alerts is having congestion issues. We are actively looking into this issue.

Report: "UAE1 events not available in alerts"

Last update
resolved

We found the root cause and performed a fix. Events are now immediately available in new alerts.

investigating

We detected that the process responsible for associating events to alerts is having congestion issues. We are actively looking into this issue.

Report: "[MCO1] Event indexation stopped"

Last update
resolved

The fix was effective, the incident is now considered resolved.

monitoring

Event processing is back to normal. We're monitoring the platform.

identified

We've fixed the issue and the event processing is restarting

investigating

We've identified the potential root cause and working on a fix.

investigating

We are facing an incident causing delay to indexing events on the platform. It concerns event processing and alert raising. And also the event indexing in our storage cluster.

Report: "[MCO1] Event indexation stopped"

Last update
Resolved

The fix was effective, the incident is now considered resolved.

Monitoring

Event processing is back to normal.We're monitoring the platform.

Identified

We've fixed the issue and the event processing is restarting

Update

We've identified the potential root cause and working on a fix.

Investigating

We are facing an incident causing delay to indexing events on the platform.It concerns event processing and alert raising.And also the event indexing in our storage cluster.

Report: "[FRA1] Playbooks issue"

Last update
Resolved

Everything is back to normal and stable since 15:30 CEST

Identified

We have identified the root cause and applied some fixes.There is still some unusual load on playbook runs but the state is currently coming back to normal.

Investigating

We are facing some delay on playbooks execution due to an issue that happened this morning at 11:00 CEST.It might seem like playbooks execution are loading and waiting infinitely on the client side.We are catching up on the delay and the situation will come back to normal soon.We are still investigating on the root cause.

Report: "[FRA1] Playbooks issue"

Last update
resolved

Everything is back to normal and stable since 15:30 CEST

identified

We have identified the root cause and applied some fixes. There is still some unusual load on playbook runs but the state is currently coming back to normal.

investigating

We are facing some delay on playbooks execution due to an issue that happened this morning at 11:00 CEST. It might seem like playbooks execution are loading and waiting infinitely on the client side. We are catching up on the delay and the situation will come back to normal soon. We are still investigating on the root cause.

Report: "[FRA2] TLS issue"

Last update
Resolved

Our team observed a peak in HTTP event ingestion once the issue was fixed.It indicates that events blocked during the incident were received after retry.Everything is now up and stable.

Monitoring

The root cause was a proxy parameter that was unintentionally reset due to an automated process.This behavior was not intended and we will implement safeguards to avoid it in the future.The issue has been fixed and everything is back.The long-lived sessions like syslog ingestion were likely not impacted, but HTTP ingestion can have encountered errors.Events processing and our internal services were not impacted.We'll keep monitoring and watching for any persistent impact.

Investigating

We are facing a TLS issue on the region.Our team is investigating on the root cause and determining the blast radius at the moment.

Report: "[FRA2] TLS issue"

Last update
resolved

Our team observed a peak in HTTP event ingestion once the issue was fixed. It indicates that events blocked during the incident were received after retry. Everything is now up and stable.

monitoring

The root cause was a proxy parameter that was unintentionally reset due to an automated process. This behavior was not intended and we will implement safeguards to avoid it in the future. The issue has been fixed and everything is back. The long-lived sessions like syslog ingestion were likely not impacted, but HTTP ingestion can have encountered errors. Events processing and our internal services were not impacted. We'll keep monitoring and watching for any persistent impact.

investigating

We are facing a TLS issue on the region. Our team is investigating on the root cause and determining the blast radius at the moment.

Report: "FRA1 event indexation delays"

Last update
resolved

This incident has been resolved.

monitoring

This issue is still ongoing.

identified

As we are experiencing a high volume of traffic on FRA1, you may experience some delay in event indexation. Alerts are raised in real time and other components are running nominally.

Report: "[FRA2] Events search degraded"

Last update
resolved

This incident has been resolved.

monitoring

We identified multiple search jobs which were hindering the data cluster performances. We killed these requests and the situation resolved.

identified

We identified an issue with a really impacting search query from a customer. We are revoking the search to resolve the situation.

investigating

We are having an issue with the search capabilities on FRA2. We are currently investigation the root cause of the issues as short term fix are not sufficient.

Report: "FRA2 temporary outage"

Last update
resolved

We identified the root cause to be linked to a new flavor of nodes with smaller disks. The issue has been mitigated and a task to improve this flavor has been created.

identified

We identified an issue on FRA2 which resulted in a partial outage from 23:15 to 23:57 CEST. The situation is now stable while we are looking into the root cause of this issue.

Report: "FRA1 events temporarily unavailable"

Last update
resolved

Due to side-effects of an ongoing investigation, we have experienced a short outage on the events page from 10:36 to 10:49 CEST. This incident is now resolved and the investigation has been stopped. The situation is back to normal.

Report: "[FRA1] Event indexing delay"

Last update
resolved

The platform is indexing in real-time since 18:45 CEST

monitoring

Our event indexing has stopped at 16:55. We have identified the reason and applied a fix. The indexing was back at 17:22 It will progressively catch up on the accumulated delay. During this time, events will show with delay in the events page. We will keep monitoring until it is back on real time.

Report: "[FRA1] temporary failure"

Last update
resolved

This incident is over, we still have some little delay on event processing but it's resolving progressively.

identified

We found that the root cause of the incident is linked to a host incident on our cloud provider side.

identified

We had a temporary failure on a cache service that caused authentication failures since 14:41 CET to 14:55 CET. This also caused some delay on alerts raising, events processing and playbooks starts. The service is back again and we are catching up on the delay now. Our team is currently investigating on the root cause.

Report: "Delay in events analysis"

Last update
resolved

We caught up with the backlog of events and the traffic is now being processed in real-time.

monitoring

We had an issue with our DNS resolution in a kubernetes cluster. The issue is fixed but we took some delay in handling events. The lag is currently being handled.

Report: "Provider outage"

Last update
resolved

All the delayed events were fully handled at 8h50 CEST.

monitoring

We're continuing to monitor the processing closely and are seeing steady progress in reducing the backlog. While it may still take some time to fully catch up, we’re doing everything we can to maintain stability and ensure no data is missed.

monitoring

The backlog of the queued events is still being processed at maximum capacity. Our team is dedicated to clearing this backlog as efficiently as possible, ensuring that all events are handled promptly.

monitoring

Cluster recovery is done with no data loss. We are still processing the backlog of queued events, at maximum capacity.

monitoring

We are making great progress on fixing the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are processing the delayed events. We found a way to speed up the recovery process. The event storage cluster is steadily recovering.

monitoring

We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are processing the delayed events. The situation is progressively recovering however we had to slow down the process for the moment due a very high number of parallel tasks causing a risk for the cluster. We are trying to find ways to improve the situation faster.

monitoring

We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are processing the delayed events. The situation is progressively recovering however we had to slow down the process for the moment due a very high number of parallel tasks causing a risk for the cluster. We are trying to find ways to improve the situation faster.

monitoring

We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are processing the delayed events. The situation is progressively recovering.

monitoring

We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are currently fixing the situation.

monitoring

We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are currently fixing the situation.

identified

We are still working on the event storage cluster. We are also making progress with alerts without events. Events linked to alerts are progressively available in the event storage cluster.

identified

We are still working on the event storage cluster. So far, event search are working but the oldest events are not available for search. We are fixing this situation progressively meaning more and more older events will be available later. On another hand, all events ingested from this morning 03:30 CEST are not available in the event storage cluster.

identified

We are still working to stabilize the event storage cluster. So far some event query and search are working however all data are not available for the moment.

identified

We are still preparing a fix to rollout on our whole event storage cluster. In the mean time, we fixed the automation cluster.

identified

Most services are up. There is still some issues with our event storage cluster making events and events search not being available. All events are still received and properly processed. On another hand, automation (playbooks) is also having issues. We are working quickly to fix these situations.

investigating

We had an outage on our main provider, and the network went down. We are currently recovering access to the platform and fixing the different issues.

Report: "[FRA1] events indexing delay"

Last update
resolved

This incident has been resolved.

monitoring

We have identified and fixed the issue, and we are now indexing back to normal rate. The delay will now slowly reduce, we will keep you updated once we are back in real-time.

investigating

Hello, We are currently facing a problem with indexing events which causes delay before event are available on the events page. Events and alerts are still processed on real-time, and there is no data loss. We are still investigating on the root cause and will keep you updated.

Report: "FRA1 hardware network issues"

Last update
resolved

After swapping network cards on the faulty router we decided on completely replacing the router with another similar machine. We are not seeing any of the initial issues as of now.

investigating

Our provider is having hardware issues on a pair of servers that are the main network routers of FRA1. While we are investigating, you may see some sporadic timeouts and 50x errors (less than 0.1%), which will succeed after a retry. Event ingestion is also experiencing some delay, due to the nature of the underlying issue.

Report: "[FRA1] - Delay on event processing"

Last update
resolved

Event ingestion is back to real-time

monitoring

A fix has been implemented and we are gradually catching up on the delay. We will close this incident once event ingestion is back to real-time.

identified

Hello, An unexpected behavior during a deployment is causing some slight delay on event processing. We identified the root cause and are currently working on a solution.

Report: "[MCO1] Events indexation delay"

Last update
resolved

The indexing is back in real time since 13:13 CET and everything is stable.

monitoring

A fix has been implemented and the platform is catching up on the delay. We will come back to you when we are on real-time.

identified

Good morning. You may experiencing delay on events indexing due to an exceptionally high traffic this morning. This affects the time before events are visible in the events page. Our team currently working towards a solution. Currently the delay is about 15 to 20min.

Report: "Event ingestion issue over HTTP"

Last update
resolved

The root cause was due to an incident on our provider's side. We will communicate a postmortem as soon as our provider's investigation are finished.

monitoring

The ingestion issue is currently resolved. We are still investigating the root cause.

investigating

We are currently having an issues on event ingestion for HTTP requests. We are investigating the issue.

Report: "FRA1 Web application and API issue"

Last update
resolved

A cloud provider issue impacted access to the web application and APIs of the FRA1 region between 17:21 and 17:28 CET. This incident did not affect the reception or processing of events. We are currently reaching out to our cloud providers to determine the root cause, as this does not originate from an issue in our scope. Update : this has been traced back to an issue with a cloud load balancer hosted by Scaleway. A router misconfiguration resulted in all public traffic being black-holed for the duration of this incident. More information is available on their status page here : https://status.scaleway.com/incidents/162zw5zd9x8r

Report: "OVH Object Storage unavailable"

Last update
resolved

Everything has been back to normal since 22:10 CET

identified

OVH is currently experiencing worldwide issues with their Object Storage offering. We are in contact with their support. This is affecting some parts of our application, such as notebooks and the update of anomaly detection rules models. We will keep you posted once we have more info.

Report: "[FRA1] 500 errors on some APIs"

Last update
resolved

This incident has been resolved.

monitoring

We identified the underlying issue and performed a corrective. Error rates are going down now.

investigating

We are detecting an abnormal number of 50x HTTP errors on some APIs endpoints.

Report: "[FRA2] vmWare hosts update"

Last update
resolved

This incident has been resolved. We will be investigating with OVH to understand what went wrong in their automated VM management process.

monitoring

We are now processing events nominally.

monitoring

After investigating further, it seems that this incident was not directly caused by our routine operation, but by an automatic upgrade process carried on by OVH, our provider. We will be in touch with them to understand the root cause of this incident.

monitoring

All services are now running correctly. Some databases are still initializing, so events are currently being buffered before resuming their normal processing. The UI and API are working as expected.

identified

A routine reboot of a VMWare host resulted in some of our VMs going down unexpectedly. We are currently stabilizing the situation.

Report: "[FRA1] - Events ingestion down"

Last update
resolved

This incident was resolved. Event ingestion is now in real time.

monitoring

A fix has been implemented. The situation is under control. There is still some delay of event processing. It should recover soon.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We identified an issue in our ingestion process. The ingestion is currently down at the moment. We are working on a fix.

Report: "[FRA2] Maintenance exceeding time slot."

Last update
resolved

This incident has been resolved. A post-mortem will come in the following days while our engineers gather all necessary data.

monitoring

We are done with the rollback of this cluster upgrade, and the region is now up. We are monitoring the overall situation before closing this incident.

identified

We have restored the backup and we are starting to bring the platform back up.

identified

As part of our recovery procedure, we are currently stopping the whole region to restore the backup.

identified

Our team tried in vain multiple operations to fix the issue we encountered. We took the decision to rollback the upgrade and restore the previous cluster state from a backup.

identified

The maintenance time slot has been exceeded but it is not completed. A message saying that the maintenance slot is completed has automatically closed the status page while it should not have. We are still experiencing network errors and the whole team is working towards a solution. We will keep you updated.

Report: "[FRA1] - Playbooks errors"

Last update
resolved

Our team experienced an issue that impacted playbooks from 23:42 to 00:20 CET. Playbooks may have reported errors and/or could have been stopped. The issue has been fixed on our side but we encourage you to check your playbooks. We are sorry for that inconvenience.

Report: "Playbook runs in error"

Last update
resolved

All previously missed playbook runs were replayed and the underlying issue has been fixed. This incident is now resolved. Thanks for your patience and understanding.

monitoring

Playbooks are currently running as expected. Our team are investigating if we can retry the previously missed runs.

identified

Due to a recent deployment, playbooks are not starting on some regions. We identified the issue and we are currently rolling out a fix.

Report: "[FRA1] Alerts raising lag"

Last update
resolved

We are back to raising alerts on real-time. Thank you for your patience.

identified

Hello, we have an issue causing delay on alert raising. Our team has identified the cause and is currently applying a fix.

Report: "[FRA1] Alerts raised without events"

Last update
resolved

All alerts are correctly processed since 15:10 CET. The fix has been applied and past alerts lacking events have also been fixed. Thank you for your patience.

identified

Our team is currently applying the fix on production.

identified

The issue has been identified. Our teams are working on a fix to prevent this issue in the future and ensure that events are correctly added to already raised alerts. We'll come back to you once the fix will be applied.

investigating

Hello, we are aware of an issue causing alerts to be raised without associated events being available. Our engineering and infrastructure team are currently investigating on the root cause.

Report: "MCO1 lag on events processing"

Last update
resolved

Event processing and alert raising is back to real time since 21h30 CET. Event storage is back to real-time since 23h. This incident is now over.

monitoring

The whole platform is up again since 19:05 CET. We are processing the backlog of events and raising alerts accordingly. The estimated time before recovering to real-time processing is just above 2 hours. Events storage is catching up a little bit slower, so events will not show instantly after being processed. We are still monitoring the behavior and will let you know when everything is back to real-time. Thank you for your patience.

monitoring

Hello, During today's region upgrade, the upgrade of a critical service for events processing is taking an abnormally long time due to a restart failure, which restarted its upgrade procedure from start. The impact is a substantial delay on events processing (more than an hour). The fix has instantly been implemented to prevent the same failure but it still takes a long time to restart entirely. We are monitoring it closely and will take you updated.

Report: "[UAE1] Platform instability"

Last update
resolved

This incident has been resolved. All events are processed in real time and alerts are raised in real time.

monitoring

All nodes have been restarted. The platform is up and running. We have some delay on event processing and detection. This delay should be processed in the next hour.

identified

The platform is fully functioning for now, some nodes are still being restarted, in a controlled manner.

identified

We identified the issue and we are rolling out a fix. The platform is currently usable but the rollout of our fix may create some sporadic issues in the next minutes.

investigating

We are currently seeing some instability on the UAE1 region, our team is investigating

Report: "Issue with alerts not being raised"

Last update
resolved

This incident has been treated. A post-mortem was produced and communicated to customers.

monitoring

Event replay was started at 17:55 CET, you should now see alerts being raised for the period of the incident (11:21 CET until 13:01 CET). We estimate this event replay to finish around 05:00 CET tomorrow.

monitoring

We are currently mobilizing resources to perform a replay of events received during the duration of the incident. Our goal is to ensure alerts were eventually correctly raised.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We have an issue since 11h20 which impacted the alerts raising process. We identified the issue and are currently deploying a fix.

Report: "UAE1 - delay on events processing."

Last update
resolved

Everything is back to real-time since 15:40 CET

monitoring

The traffic has come back to normal, and there is no more delay on event processing. However, there is still some delay on alerts raising, that is currently resolving. We will keep monitoring until it is back to real-time.

identified

Hello Our UAE1 platform is facing an exceptionally high traffic since 12:00 CET It causes delay on event processing and alert raising.

Report: "FRA1 - events search delay"

Last update
resolved

We have found the root cause of the incident and it is now resolved. Our team is now working on preventing this problem in the future. Thanks for your patience.

identified

The peak load has passed and search jobs are now working in real-time since 10:40. We are still investigating on the root cause of the load.

investigating

Hello, We are aware of slowness on events searches since 09:55 CET. Our team is investigating on the issue.

Report: "[MCO1] General performance issue"

Last update
resolved

Cloud provider backups ended around 6am and IOPS performance was restored to its baseline.

identified

Our cloud provider for the MCO1 region is currently performing block devices backups, which results in a global slowdown of the storage layer of our deployment. While the backups are ongoing, events are being processed with a significant delay and some API queries might fail.

Report: "Events analysis delay"

Last update
resolved

This incident has been resolved.

monitoring

We are currently handling the delayed events.

identified

We are encountering issues with the update of one of our service. The issue is identified and the situation should resolve soon. We have currently some delay analyzing the events and raising alerts. As a consequence, the events delayed are also not available to search.

Report: "[FRA1] Search jobs temporary unavailability"

Last update
resolved

This incident has been resolved.

monitoring

The cache has been resized successfully and no errors are seen since 13:47. We are still monitoring the situation and investigating the root cause, but the service is up.

identified

We identified an issue with an internal cache cluster used for search jobs on the events page. While we are resizing that cache, some search jobs may fail. Our team is currently performing the resize operation, the situation should stabilize soon.

Report: "[FRA1] Temporary outage"

Last update
resolved

A critical internal service was unreachable for 7 minutes between 11h52 and 11h59 on our platform. It is at the center of many other services, which means that many other services became unreachable. This was fixed and the platform is reachable again and everything working as intended.

Report: "[MCO1] delay on event processing and indexing."

Last update
resolved

The platform is running on real-time since 00:34 UTC+1

monitoring

A fix has been implemented and the delay is now decreasing slowly. We will keep monitoring this incident and will close this status page once we are back in real-time.

identified

We are aware of an incident causing delay to accumulate on the platform. It concerns event processing and alert raising, which are currently happening about 10 minutes after an event is received. And also the event indexing in our storage cluster, which is currently happening around 30 minutes after an event is processed. We have identified the cause and are working towards its resolution.

Report: "[FRA1] Playbook runs incident"

Last update
resolved

The playbooks environment is stable and steady. This incident is not resolved.

monitoring

The playbook environment is back up and we are processing tasks as usual. We are back in real-time, however the environment is handling a lot of charge at the moment. We are monitoring closely until everything is stable and back to normal.

identified

We implemented a fix to the network issue. The cluster is coming back online on our side. We are currently stabilizing the cluster after the fix, and validating that everything is working.

investigating

We detected an incident concerning our playbook runs that impacts DNS resolution and runs processing.

Report: "MCO1 Indexation performance issues"

Last update
resolved

The delay has been completely resolved since 21h50 UTC, the incident is now completely resolved.

monitoring

Some fix were implemented to increase our performance. We are now able to catch the delay, slowly but steadily. We will close this status page once the event storage is back on real-time.

investigating

The incident is still ongoing. We are actively continuing to search for its root cause. Our team has added some resources to the storage cluster as a temporary workaround. At this stage, performance is still below our expectations, and we continue to experience delays. We estimate the delay to be around 1 hour and 45 minutes between the processing of an event and its entry into our storage cluster.

investigating

We are currently having issues indexing events in our storage cluster. This is generating delay before the events are available in the events and alerts pages. The detection is not affected.

Report: "[FRA2] events processing delay"

Last update
resolved

Events are processed in real-time since 17:39 CET, everything is stable.

monitoring

During a service update on the region, we encountered an issue with our event processing. It has been stopped for around an hour, since 15:33. We have fixed the issue and it is now consuming again, we are catching up on the delay. We expect around 1h before coming back to real-time processing.

Report: "[MCO1] events processing lag"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are now catching up on the lag. We will keep monitoring closely and keep this status page open until we have no more delay on events processing.

investigating

We are investigating an issue with our events processing pipeline which has bad performance since 09:27 CET. Events processing is taking lag, which impacts alerts raising.

Report: "FRA1 Events processing stopped"

Last update
resolved

The platform is consuming events in real-time since 19:09 CET. This incident has been resolved.

monitoring

A good part of the backlog has already been processed. The platform's detection is estimated to be back in real-time at 19:05 CET. We will keep monitoring closely until that time.

monitoring

The event processing is stable and we are catching up on lag slowly. We can expect to be back in real-time in a couple hours due to the volume of events backlog that we have. We will keep you updated on this.

identified

We were able to mitigate the issue, our events processing pipeline is back up. The platform is now consuming lag, we are monitoring this closely.

identified

We lost several servers in a few minutes due to a network issue on our cloud-provider side. Our event ingestion pipeline is safe from impact and we are still receiving every events. However, our events processing pipeline is stopped since 15:23 CET, and we are taking lag on events processing and alerts raising. We are currently reaching our cloud provider support to get more information in order to resolve this incident as fast as possible.

Report: "[UAE1] Syslog SSL issue"

Last update
resolved

From 12:45 CET to 18:42 CET, there was a conflict with our ingress configurations that caused syslog SSL connections to be rejected. This impacted rsyslog reception. As a result, events sent during this period may have been rejected and could potentially be lost if they were not buffered on your side. We sincerely apologize for any inconvenience this may have caused. This is a serious matter, and we are committed to implementing enhanced monitoring and safeguards to ensure this issue is identified more quickly and prevented from recurring in the future. The issue has been addressed, and we have restored the ability to send events via rsyslog using our event-amplifier.

Report: "[UAE1] Platform down"

Last update
resolved

This incident has been resolved.

monitoring

The platform is up again since 12:49 CET. We have not lost any events during the downtime. Alerts are being processed. We will keep monitoring closely during some time to ensure everything is stable.

identified

The UAE1 region is experiencing some downtime since 11h20 CET due to an issue during a maintenance update. The problem is identified, we are actively rolling back the changes and expect the issue to be resolved in the next half hour.

Report: "[FRA1] internal network issues"

Last update
resolved

We have fully caught up on alerts raising lag and everything is running and stable since.

monitoring

We noticed some impact on tag enrichment too, that have caused a lot of alerts to raise. Everything is stable now, we are consuming lag on alerts raising. ETA : ~1H

monitoring

A fix has been deployed at 12:58 CET. Ingestion and event storage have not been impacted. We have not lost any event. However, alerts raising tasks are delayed. We are gradually consuming the lag, and we will give you an ETA soon.

identified

We are aware of an ongoing incident on our platform since 12:15 CET, related to internal loadbalancers. This is impacting our whole platform. Our team is currently implementing a fix. We'll keep you updated

Report: "Event ingestion delays"

Last update
resolved

All backlog has been processed, this incident is now over.

monitoring

We managed to identify the issue and process the backlog of pending tasks on the cluster responsible for event ingestion. We are now catching up on the backlog of enqueued events.

investigating

Investigation is still ongoing.

investigating

We are currently experiencing performance issues with event ingestion. As a results, events may show up late into the events page. Our team is looking into this issue.

Report: "Temporary Disruption in "Alert Created" Playbook Triggers"

Last update
resolved

On 17/10, at 16:54 CEST, a deployment introduced a bug into production which led to the "alert created" playbook triggers not being activated. All other triggers and playbooks continued to operate without any issues. Our team detected the issue and has already rolled back the affected deployment as of 10:09 today. We are actively working on replaying the missed triggers and are developing a permanent fix to prevent similar incidents in the future. We apologize for any inconvenience caused and appreciate your patience while we resolve this matter. Thank you for your understanding.

Report: "FRA1 detection is down"

Last update
resolved

This incident has been resolved. All alerts are being processed in real time.

monitoring

We are pleased to inform you that the fix has been successfully deployed. No alerts were lost during this incident. However please note that some alerts may experience some temporary delay. Our team is closely monitoring the situation to ensure everything returns to normal promptly. Thank you for your patience and support.

identified

We have identified an issue with our detection engine and have temporarily paused it to prevent any false alerts. Rest assured, our team is actively working on a solution, which we expect to deploy shortly. Thank you for your patience and understanding.