Is SEKOIA.IO Down Right Now? Discover if there is an ongoing service outage.

SEKOIA.IO is currently Operational

Last checked Jul 29, 2025 14:37 UTC from SEKOIA.IO's official status page

Historical record of incidents for SEKOIA.IO

Jun 12, 2025

Report: "FRA1 event storage cluster performance issues impacting search jobs"

Last update 2025-06-12T14:47:37.505Z

investigating2025-06-12T14:47:37.501Z

We are currently experiencing performance issues with our event storage cluster in the FRA1 region. This is impacting search jobs, resulting in slower event research. Our engineers are investigating this issue and we are working to restore normal functionality as soon as possible.

Jun 9, 2025

Report: "[MCO1] HTTP intake issues"

Last update 2025-06-09T06:59:04.558Z

resolved2025-06-09T02:00:00.000Z

A sudden and significant increase in received HTTP events around 04:12 CEST this morning caused our HTTP events receiver instances to crash and restart in a loop, as they were filling their local queues faster than they could push to our internal message bus. The problem was almost entirely silent because the restarts are fast, and once restarted, instances operate normally until their queues fill again. At 08:23 CEST we applied a fix by drastically increasing the number of service instances, allowing them to push the higher event volume to the message bus faster than their queues fill. As a result of this incident, clients using HTTP intake experienced a significant percentage of lost events within this incident timeframe. Our teams will implement early-warning alerts and improved auto-scaling to detect and mitigate similar issues sooner in the future. We apologize for the inconvenience.

Jun 5, 2025

Report: "FRA2 region is out of service"

Last update 2025-06-05T08:14:11.806Z

investigating2025-06-05T08:14:11.802Z

The region is currently down. We are investigating some issues on the cluster.

Jun 4, 2025

Report: "FRA2 region is unreachable"

Last update 2025-06-04T18:01:00.000Z

Update2025-06-04T18:01:00.000Z

Our team is in contact with our external storage support, which is in the process of adding two more hosts to our storage cluster to resolve this issue. Provisioning of these hosts is expected to take between 15 to 30 minutes. At present, some servers are coming back online, but full functionality will only be restored when the majority of servers are operational. We appreciate your patience as we work to resolve this issue.

Identified2025-06-04T17:51:00.000Z

We are currently experiencing an issue in our FRA2 region due to a running migration. A virtual machine reconfiguration has resulted in duplication of stored data, which has filled our datastore. Our engineers are working to resolve the issue. Your patience is appreciated.

Report: "FRA2 region is unreachable"

Last update 2025-06-04T17:10:29.999Z

resolved2025-06-04T17:10:29.983Z

This incident has been resolved.

monitoring2025-06-04T16:48:54.401Z

The previous issue impacting the FRA2 region has been fully resolved. The operational workflow is now running smoothly with no errors detected on the APIs. We are actively mitigating the backlog and expect to be completely up-to-date shortly.

identified2025-06-04T16:24:58.906Z

The issue with the overloaded datastore in the FRA2 region is being addressed. Additional hosts are being added to the storage cluster, with some servers already coming back online. Full restoration is expected once the majority of servers are operational. Thank you for your patience.

identified2025-06-04T16:01:01.787Z

identified2025-06-04T15:51:49.403Z

Jun 2, 2025

Report: "UAE1 indexation delays"

Last update 2025-06-02T20:10:00.000Z

Resolved2025-06-02T20:10:00.000Z

We identified a component that needed some load balancing in order to provide better performance.This incident is now resolved.

Investigating2025-06-02T15:35:00.000Z

We are currently investigating this issue.

Report: "UAE1 indexation delays"

Last update 2025-06-02T18:10:20.225Z

resolved2025-06-02T18:10:20.211Z

We identified a component that needed some load balancing in order to provide better performance. This incident is now resolved.

investigating2025-06-02T13:35:15.630Z

We are currently investigating this issue.

Report: "UAE1 events not available in alerts"

Last update 2025-06-02T12:18:00.000Z

Resolved2025-06-02T12:18:00.000Z

We found the root cause and performed a fix. Events are now immediately available in new alerts.

Investigating2025-06-02T11:33:00.000Z

We detected that the process responsible for associating events to alerts is having congestion issues. We are actively looking into this issue.

Report: "UAE1 events not available in alerts"

Last update 2025-06-02T10:18:45.752Z

resolved2025-06-02T10:18:45.735Z

We found the root cause and performed a fix. Events are now immediately available in new alerts.

investigating2025-06-02T09:33:34.997Z

We detected that the process responsible for associating events to alerts is having congestion issues. We are actively looking into this issue.

May 26, 2025

Report: "[MCO1] Event indexation stopped"

Last update 2025-05-26T12:03:31.665Z

resolved2025-05-24T16:26:11.000Z

The fix was effective, the incident is now considered resolved.

monitoring2025-05-24T13:10:00.000Z

Event processing is back to normal. We're monitoring the platform.

identified2025-05-24T12:55:15.403Z

We've fixed the issue and the event processing is restarting

investigating2025-05-24T12:16:23.000Z

We've identified the potential root cause and working on a fix.

investigating2025-05-24T11:30:29.000Z

We are facing an incident causing delay to indexing events on the platform. It concerns event processing and alert raising. And also the event indexing in our storage cluster.

May 24, 2025

Report: "[MCO1] Event indexation stopped"

Last update 2025-05-24T18:26:00.000Z

Resolved2025-05-24T18:26:00.000Z

The fix was effective, the incident is now considered resolved.

Monitoring2025-05-24T15:10:00.000Z

Event processing is back to normal.We're monitoring the platform.

Identified2025-05-24T14:55:00.000Z

We've fixed the issue and the event processing is restarting

Update2025-05-24T14:16:00.000Z

We've identified the potential root cause and working on a fix.

Investigating2025-05-24T13:30:00.000Z

We are facing an incident causing delay to indexing events on the platform.It concerns event processing and alert raising.And also the event indexing in our storage cluster.

May 23, 2025

Report: "[FRA1] Playbooks issue"

Last update 2025-05-23T15:51:00.000Z

Resolved2025-05-23T15:51:00.000Z

Everything is back to normal and stable since 15:30 CEST

Identified2025-05-23T15:02:00.000Z

We have identified the root cause and applied some fixes.There is still some unusual load on playbook runs but the state is currently coming back to normal.

Investigating2025-05-23T14:30:00.000Z

We are facing some delay on playbooks execution due to an issue that happened this morning at 11:00 CEST.It might seem like playbooks execution are loading and waiting infinitely on the client side.We are catching up on the delay and the situation will come back to normal soon.We are still investigating on the root cause.

Report: "[FRA1] Playbooks issue"

Last update 2025-05-23T13:51:24.113Z

resolved2025-05-23T13:51:24.095Z

Everything is back to normal and stable since 15:30 CEST

identified2025-05-23T13:02:47.090Z

We have identified the root cause and applied some fixes. There is still some unusual load on playbook runs but the state is currently coming back to normal.

investigating2025-05-23T12:30:12.009Z

We are facing some delay on playbooks execution due to an issue that happened this morning at 11:00 CEST. It might seem like playbooks execution are loading and waiting infinitely on the client side. We are catching up on the delay and the situation will come back to normal soon. We are still investigating on the root cause.

May 21, 2025

Report: "[FRA2] TLS issue"

Last update 2025-05-21T17:13:00.000Z

Resolved2025-05-21T17:13:00.000Z

Our team observed a peak in HTTP event ingestion once the issue was fixed.It indicates that events blocked during the incident were received after retry.Everything is now up and stable.

Monitoring2025-05-21T17:02:00.000Z

The root cause was a proxy parameter that was unintentionally reset due to an automated process.This behavior was not intended and we will implement safeguards to avoid it in the future.The issue has been fixed and everything is back.The long-lived sessions like syslog ingestion were likely not impacted, but HTTP ingestion can have encountered errors.Events processing and our internal services were not impacted.We'll keep monitoring and watching for any persistent impact.

Investigating2025-05-21T16:45:00.000Z

We are facing a TLS issue on the region.Our team is investigating on the root cause and determining the blast radius at the moment.

Report: "[FRA2] TLS issue"

Last update 2025-05-21T15:13:40.083Z

resolved2025-05-21T15:13:40.068Z

Our team observed a peak in HTTP event ingestion once the issue was fixed. It indicates that events blocked during the incident were received after retry. Everything is now up and stable.

monitoring2025-05-21T15:02:01.000Z

The root cause was a proxy parameter that was unintentionally reset due to an automated process. This behavior was not intended and we will implement safeguards to avoid it in the future. The issue has been fixed and everything is back. The long-lived sessions like syslog ingestion were likely not impacted, but HTTP ingestion can have encountered errors. Events processing and our internal services were not impacted. We'll keep monitoring and watching for any persistent impact.

investigating2025-05-21T14:45:18.186Z

We are facing a TLS issue on the region. Our team is investigating on the root cause and determining the blast radius at the moment.

May 19, 2025

Report: "FRA1 event indexation delays"

Last update 2025-05-19T19:22:04.813Z

resolved2025-05-19T19:22:04.796Z

This incident has been resolved.

monitoring2025-05-19T16:28:49.025Z

This issue is still ongoing.

identified2025-05-19T10:38:56.227Z

As we are experiencing a high volume of traffic on FRA1, you may experience some delay in event indexation. Alerts are raised in real time and other components are running nominally.

May 16, 2025

Report: "[FRA2] Events search degraded"

Last update 2025-05-16T15:33:15.368Z

resolved2025-05-16T15:33:15.339Z

This incident has been resolved.

monitoring2025-05-16T12:02:43.724Z

We identified multiple search jobs which were hindering the data cluster performances. We killed these requests and the situation resolved.

identified2025-05-16T10:19:29.919Z

We identified an issue with a really impacting search query from a customer. We are revoking the search to resolve the situation.

investigating2025-05-16T07:29:50.068Z

We are having an issue with the search capabilities on FRA2. We are currently investigation the root cause of the issues as short term fix are not sufficient.

May 9, 2025

Report: "FRA2 temporary outage"

Last update 2025-05-09T22:51:03.865Z

resolved2025-05-09T22:51:03.851Z

We identified the root cause to be linked to a new flavor of nodes with smaller disks. The issue has been mitigated and a task to improve this flavor has been created.

identified2025-05-09T21:56:57.372Z

We identified an issue on FRA2 which resulted in a partial outage from 23:15 to 23:57 CEST. The situation is now stable while we are looking into the root cause of this issue.

Report: "FRA1 events temporarily unavailable"

Last update 2025-05-09T08:54:19.155Z

resolved2025-05-09T08:54:19.146Z

Due to side-effects of an ongoing investigation, we have experienced a short outage on the events page from 10:36 to 10:49 CEST. This incident is now resolved and the investigation has been stopped. The situation is back to normal.

May 7, 2025

Report: "[FRA1] Event indexing delay"

Last update 2025-05-07T18:46:22.578Z

resolved2025-05-07T18:46:22.561Z

The platform is indexing in real-time since 18:45 CEST

monitoring2025-05-07T15:25:00.006Z

Our event indexing has stopped at 16:55. We have identified the reason and applied a fix. The indexing was back at 17:22 It will progressively catch up on the accumulated delay. During this time, events will show with delay in the events page. We will keep monitoring until it is back on real time.

Apr 30, 2025

Report: "[FRA1] temporary failure"

Last update 2025-04-30T13:53:11.965Z

resolved2025-04-30T13:53:11.951Z

This incident is over, we still have some little delay on event processing but it's resolving progressively.

identified2025-04-30T13:23:50.721Z

We found that the root cause of the incident is linked to a host incident on our cloud provider side.

identified2025-04-30T13:06:24.805Z

We had a temporary failure on a cache service that caused authentication failures since 14:41 CET to 14:55 CET. This also caused some delay on alerts raising, events processing and playbooks starts. The service is back again and we are catching up on the delay now. Our team is currently investigating on the root cause.

Apr 25, 2025

Report: "Delay in events analysis"

Last update 2025-04-25T18:04:54.183Z

resolved2025-04-25T18:04:54.165Z

We caught up with the backlog of events and the traffic is now being processed in real-time.

monitoring2025-04-25T16:01:12.269Z

We had an issue with our DNS resolution in a kubernetes cluster. The issue is fixed but we took some delay in handling events. The lag is currently being handled.

Apr 19, 2025

Report: "Provider outage"

Last update 2025-04-19T07:31:11.893Z

resolved2025-04-19T07:28:55.000Z

All the delayed events were fully handled at 8h50 CEST.

monitoring2025-04-18T16:50:30.019Z

We're continuing to monitor the processing closely and are seeing steady progress in reducing the backlog. While it may still take some time to fully catch up, we’re doing everything we can to maintain stability and ensure no data is missed.

monitoring2025-04-18T14:37:33.505Z

The backlog of the queued events is still being processed at maximum capacity. Our team is dedicated to clearing this backlog as efficiently as possible, ensuring that all events are handled promptly.

monitoring2025-04-17T23:40:14.931Z

Cluster recovery is done with no data loss. We are still processing the backlog of queued events, at maximum capacity.

monitoring2025-04-17T16:13:35.383Z

We are making great progress on fixing the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are processing the delayed events. We found a way to speed up the recovery process. The event storage cluster is steadily recovering.

monitoring2025-04-17T14:25:26.191Z

We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are processing the delayed events. The situation is progressively recovering however we had to slow down the process for the moment due a very high number of parallel tasks causing a risk for the cluster. We are trying to find ways to improve the situation faster.

monitoring2025-04-17T13:22:23.304Z

monitoring2025-04-17T10:00:00.642Z

monitoring2025-04-17T09:01:36.919Z

We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are currently fixing the situation.

monitoring2025-04-17T07:53:42.000Z

We are still making progress on the event storage cluster. All "cold" data is progressively getting available. For all events ingested after 03:30 CEST, we are currently fixing the situation.

identified2025-04-17T06:55:34.655Z

We are still working on the event storage cluster. We are also making progress with alerts without events. Events linked to alerts are progressively available in the event storage cluster.

identified2025-04-17T05:53:58.171Z

We are still working on the event storage cluster. So far, event search are working but the oldest events are not available for search. We are fixing this situation progressively meaning more and more older events will be available later. On another hand, all events ingested from this morning 03:30 CEST are not available in the event storage cluster.

identified2025-04-17T04:37:58.478Z

We are still working to stabilize the event storage cluster. So far some event query and search are working however all data are not available for the moment.

identified2025-04-17T03:28:40.017Z

We are still preparing a fix to rollout on our whole event storage cluster. In the mean time, we fixed the automation cluster.

identified2025-04-17T03:06:15.430Z

Most services are up. There is still some issues with our event storage cluster making events and events search not being available. All events are still received and properly processed. On another hand, automation (playbooks) is also having issues. We are working quickly to fix these situations.

investigating2025-04-17T02:22:05.536Z

We had an outage on our main provider, and the network went down. We are currently recovering access to the platform and fixing the different issues.

Apr 10, 2025

Report: "[FRA1] events indexing delay"

Last update 2025-04-10T21:04:23.158Z

resolved2025-04-10T21:04:23.140Z

This incident has been resolved.

monitoring2025-04-10T11:01:25.326Z

We have identified and fixed the issue, and we are now indexing back to normal rate. The delay will now slowly reduce, we will keep you updated once we are back in real-time.

investigating2025-04-10T10:37:02.134Z

Hello, We are currently facing a problem with indexing events which causes delay before event are available on the events page. Events and alerts are still processed on real-time, and there is no data loss. We are still investigating on the root cause and will keep you updated.

Apr 2, 2025

Report: "FRA1 hardware network issues"

Last update 2025-04-02T12:54:32.506Z

resolved2025-04-02T12:31:33.000Z

After swapping network cards on the faulty router we decided on completely replacing the router with another similar machine. We are not seeing any of the initial issues as of now.

investigating2025-04-02T08:02:36.942Z

Our provider is having hardware issues on a pair of servers that are the main network routers of FRA1. While we are investigating, you may see some sporadic timeouts and 50x errors (less than 0.1%), which will succeed after a retry. Event ingestion is also experiencing some delay, due to the nature of the underlying issue.

Mar 28, 2025

Report: "[FRA1] - Delay on event processing"

Last update 2025-03-28T12:43:35.374Z

resolved2025-03-28T12:43:35.357Z

Event ingestion is back to real-time

monitoring2025-03-28T11:30:44.096Z

A fix has been implemented and we are gradually catching up on the delay. We will close this incident once event ingestion is back to real-time.

identified2025-03-28T10:58:57.776Z

Hello, An unexpected behavior during a deployment is causing some slight delay on event processing. We identified the root cause and are currently working on a solution.

Mar 26, 2025

Report: "[MCO1] Events indexation delay"

Last update 2025-03-26T12:39:27.567Z

resolved2025-03-26T12:38:42.000Z

The indexing is back in real time since 13:13 CET and everything is stable.

monitoring2025-03-26T10:15:44.494Z

A fix has been implemented and the platform is catching up on the delay. We will come back to you when we are on real-time.

identified2025-03-26T10:00:17.000Z

Good morning. You may experiencing delay on events indexing due to an exceptionally high traffic this morning. This affects the time before events are visible in the events page. Our team currently working towards a solution. Currently the delay is about 15 to 20min.

Mar 17, 2025

Report: "Event ingestion issue over HTTP"

Last update 2025-03-17T15:57:59.891Z

resolved2025-03-17T15:53:41.000Z

The root cause was due to an incident on our provider's side. We will communicate a postmortem as soon as our provider's investigation are finished.

monitoring2025-03-17T11:01:00.447Z

The ingestion issue is currently resolved. We are still investigating the root cause.

investigating2025-03-17T10:05:45.625Z

We are currently having an issues on event ingestion for HTTP requests. We are investigating the issue.

Mar 14, 2025

Report: "FRA1 Web application and API issue"

Last update 2025-03-14T16:04:02.091Z

resolved2025-03-10T17:00:24.000Z

A cloud provider issue impacted access to the web application and APIs of the FRA1 region between 17:21 and 17:28 CET. This incident did not affect the reception or processing of events. We are currently reaching out to our cloud providers to determine the root cause, as this does not originate from an issue in our scope. Update : this has been traced back to an issue with a cloud load balancer hosted by Scaleway. A router misconfiguration resulted in all public traffic being black-holed for the duration of this incident. More information is available on their status page here : https://status.scaleway.com/incidents/162zw5zd9x8r

Mar 12, 2025

Report: "OVH Object Storage unavailable"

Last update 2025-03-12T23:02:46.259Z

resolved2025-03-12T23:02:46.245Z

Everything has been back to normal since 22:10 CET

identified2025-03-12T20:00:00.872Z

OVH is currently experiencing worldwide issues with their Object Storage offering. We are in contact with their support. This is affecting some parts of our application, such as notebooks and the update of anomaly detection rules models. We will keep you posted once we have more info.

Mar 11, 2025

Report: "[FRA1] 500 errors on some APIs"

Last update 2025-03-11T01:43:31.512Z

resolved2025-03-11T01:43:31.488Z

This incident has been resolved.

monitoring2025-03-11T01:35:20.830Z

We identified the underlying issue and performed a corrective. Error rates are going down now.

investigating2025-03-11T01:21:27.591Z

We are detecting an abnormal number of 50x HTTP errors on some APIs endpoints.

Mar 6, 2025

Report: "[FRA2] vmWare hosts update"

Last update 2025-03-06T14:23:28.504Z

resolved2025-03-06T14:23:28.489Z

This incident has been resolved. We will be investigating with OVH to understand what went wrong in their automated VM management process.

monitoring2025-03-06T11:51:24.035Z

We are now processing events nominally.

monitoring2025-03-06T11:26:15.507Z

After investigating further, it seems that this incident was not directly caused by our routine operation, but by an automatic upgrade process carried on by OVH, our provider. We will be in touch with them to understand the root cause of this incident.

monitoring2025-03-06T11:02:45.838Z

All services are now running correctly. Some databases are still initializing, so events are currently being buffered before resuming their normal processing. The UI and API are working as expected.

identified2025-03-06T10:48:05.755Z

A routine reboot of a VMWare host resulted in some of our VMs going down unexpectedly. We are currently stabilizing the situation.

Feb 26, 2025

Report: "[FRA1] - Events ingestion down"

Last update 2025-02-26T14:02:27.030Z

resolved2025-02-26T14:02:27.014Z

This incident was resolved. Event ingestion is now in real time.

monitoring2025-02-26T07:27:59.860Z

A fix has been implemented. The situation is under control. There is still some delay of event processing. It should recover soon.

monitoring2025-02-26T06:26:51.655Z

A fix has been implemented and we are monitoring the results.

identified2025-02-26T05:55:44.013Z

We identified an issue in our ingestion process. The ingestion is currently down at the moment. We are working on a fix.

Feb 20, 2025

Report: "[FRA2] Maintenance exceeding time slot."

Last update 2025-02-20T14:28:17.267Z

resolved2025-02-20T14:28:17.251Z

This incident has been resolved. A post-mortem will come in the following days while our engineers gather all necessary data.

monitoring2025-02-20T13:55:17.000Z

We are done with the rollback of this cluster upgrade, and the region is now up. We are monitoring the overall situation before closing this incident.

identified2025-02-20T12:31:25.885Z

We have restored the backup and we are starting to bring the platform back up.

identified2025-02-20T12:04:26.000Z

As part of our recovery procedure, we are currently stopping the whole region to restore the backup.

identified2025-02-20T11:48:55.350Z

Our team tried in vain multiple operations to fix the issue we encountered. We took the decision to rollback the upgrade and restore the previous cluster state from a backup.

identified2025-02-20T10:31:04.187Z

The maintenance time slot has been exceeded but it is not completed. A message saying that the maintenance slot is completed has automatically closed the status page while it should not have. We are still experiencing network errors and the whole team is working towards a solution. We will keep you updated.

Report: "[FRA1] - Playbooks errors"

Last update 2025-02-20T00:00:59.090Z

resolved2025-02-19T23:59:25.000Z

Our team experienced an issue that impacted playbooks from 23:42 to 00:20 CET. Playbooks may have reported errors and/or could have been stopped. The issue has been fixed on our side but we encourage you to check your playbooks. We are sorry for that inconvenience.

Feb 18, 2025

Report: "Playbook runs in error"

Last update 2025-02-18T16:49:41.212Z

resolved2025-02-18T16:49:41.197Z

All previously missed playbook runs were replayed and the underlying issue has been fixed. This incident is now resolved. Thanks for your patience and understanding.

monitoring2025-02-18T13:39:12.230Z

Playbooks are currently running as expected. Our team are investigating if we can retry the previously missed runs.

identified2025-02-18T13:04:17.233Z

Due to a recent deployment, playbooks are not starting on some regions. We identified the issue and we are currently rolling out a fix.

Feb 7, 2025

Report: "[FRA1] Alerts raising lag"

Last update 2025-02-07T18:12:35.321Z

resolved2025-02-07T18:12:35.307Z

We are back to raising alerts on real-time. Thank you for your patience.

identified2025-02-07T17:23:44.508Z

Hello, we have an issue causing delay on alert raising. Our team has identified the cause and is currently applying a fix.

Feb 4, 2025

Report: "[FRA1] Alerts raised without events"

Last update 2025-02-04T15:43:02.900Z

resolved2025-02-04T15:43:02.885Z

All alerts are correctly processed since 15:10 CET. The fix has been applied and past alerts lacking events have also been fixed. Thank you for your patience.

identified2025-02-04T14:52:52.123Z

Our team is currently applying the fix on production.

identified2025-02-04T12:02:15.000Z

The issue has been identified. Our teams are working on a fix to prevent this issue in the future and ensure that events are correctly added to already raised alerts. We'll come back to you once the fix will be applied.

investigating2025-02-04T11:41:20.161Z

Hello, we are aware of an issue causing alerts to be raised without associated events being available. Our engineering and infrastructure team are currently investigating on the root cause.

Jan 21, 2025

Report: "MCO1 lag on events processing"

Last update 2025-01-21T22:23:47.010Z

resolved2025-01-21T22:23:46.992Z

Event processing and alert raising is back to real time since 21h30 CET. Event storage is back to real-time since 23h. This incident is now over.

monitoring2025-01-21T18:30:41.000Z

The whole platform is up again since 19:05 CET. We are processing the backlog of events and raising alerts accordingly. The estimated time before recovering to real-time processing is just above 2 hours. Events storage is catching up a little bit slower, so events will not show instantly after being processed. We are still monitoring the behavior and will let you know when everything is back to real-time. Thank you for your patience.

monitoring2025-01-21T17:00:34.623Z

Hello, During today's region upgrade, the upgrade of a critical service for events processing is taking an abnormally long time due to a restart failure, which restarted its upgrade procedure from start. The impact is a substantial delay on events processing (more than an hour). The fix has instantly been implemented to prevent the same failure but it still takes a long time to restart entirely. We are monitoring it closely and will take you updated.

Jan 17, 2025

Report: "[UAE1] Platform instability"

Last update 2025-01-17T14:01:34.052Z

resolved2025-01-17T14:01:34.037Z

This incident has been resolved. All events are processed in real time and alerts are raised in real time.

monitoring2025-01-17T12:24:22.209Z

All nodes have been restarted. The platform is up and running. We have some delay on event processing and detection. This delay should be processed in the next hour.

identified2025-01-17T11:42:18.089Z

The platform is fully functioning for now, some nodes are still being restarted, in a controlled manner.

identified2025-01-17T11:26:45.584Z

We identified the issue and we are rolling out a fix. The platform is currently usable but the rollout of our fix may create some sporadic issues in the next minutes.

investigating2025-01-17T11:08:11.886Z

We are currently seeing some instability on the UAE1 region, our team is investigating

Report: "Issue with alerts not being raised"

Last update 2025-01-17T11:06:01.452Z

resolved2025-01-17T11:06:01.435Z

This incident has been treated. A post-mortem was produced and communicated to customers.

monitoring2025-01-15T17:45:10.066Z

Event replay was started at 17:55 CET, you should now see alerts being raised for the period of the incident (11:21 CET until 13:01 CET). We estimate this event replay to finish around 05:00 CET tomorrow.

monitoring2025-01-15T15:16:15.811Z

We are currently mobilizing resources to perform a replay of events received during the duration of the incident. Our goal is to ensure alerts were eventually correctly raised.

monitoring2025-01-15T12:10:12.974Z

A fix has been implemented and we are monitoring the results.

identified2025-01-15T11:55:28.473Z

We have an issue since 11h20 which impacted the alerts raising process. We identified the issue and are currently deploying a fix.

Jan 10, 2025

Report: "UAE1 - delay on events processing."

Last update 2025-01-10T15:00:35.764Z

resolved2025-01-10T15:00:35.748Z

Everything is back to real-time since 15:40 CET

monitoring2025-01-10T13:50:53.528Z

The traffic has come back to normal, and there is no more delay on event processing. However, there is still some delay on alerts raising, that is currently resolving. We will keep monitoring until it is back to real-time.

identified2025-01-10T12:17:13.712Z

Hello Our UAE1 platform is facing an exceptionally high traffic since 12:00 CET It causes delay on event processing and alert raising.

Jan 9, 2025

Report: "FRA1 - events search delay"

Last update 2025-01-09T10:22:51.278Z

resolved2025-01-09T10:22:51.263Z

We have found the root cause of the incident and it is now resolved. Our team is now working on preventing this problem in the future. Thanks for your patience.

identified2025-01-09T09:54:34.114Z

The peak load has passed and search jobs are now working in real-time since 10:40. We are still investigating on the root cause of the load.

investigating2025-01-09T09:16:42.882Z

Hello, We are aware of slowness on events searches since 09:55 CET. Our team is investigating on the issue.

Dec 20, 2024

Report: "[MCO1] General performance issue"

Last update 2024-12-20T09:43:00.147Z

resolved2024-12-20T09:43:00.134Z

Cloud provider backups ended around 6am and IOPS performance was restored to its baseline.

identified2024-12-19T23:04:00.559Z

Our cloud provider for the MCO1 region is currently performing block devices backups, which results in a global slowdown of the storage layer of our deployment. While the backups are ongoing, events are being processed with a significant delay and some API queries might fail.

Dec 17, 2024

Report: "Events analysis delay"

Last update 2024-12-17T16:52:14.867Z

resolved2024-12-17T16:52:14.851Z

This incident has been resolved.

monitoring2024-12-17T11:30:29.971Z

We are currently handling the delayed events.

identified2024-12-17T10:21:16.302Z

We are encountering issues with the update of one of our service. The issue is identified and the situation should resolve soon. We have currently some delay analyzing the events and raising alerts. As a consequence, the events delayed are also not available to search.

Dec 4, 2024

Report: "[FRA1] Search jobs temporary unavailability"

Last update 2024-12-04T14:09:40.051Z

resolved2024-12-04T14:09:40.039Z

This incident has been resolved.

monitoring2024-12-04T13:05:23.890Z

The cache has been resized successfully and no errors are seen since 13:47. We are still monitoring the situation and investigating the root cause, but the service is up.

identified2024-12-04T12:43:59.415Z

We identified an issue with an internal cache cluster used for search jobs on the events page. While we are resizing that cache, some search jobs may fail. Our team is currently performing the resize operation, the situation should stabilize soon.

Dec 3, 2024

Report: "[FRA1] Temporary outage"

Last update 2024-12-03T11:11:47.655Z

resolved2024-12-03T11:07:46.000Z

A critical internal service was unreachable for 7 minutes between 11h52 and 11h59 on our platform. It is at the center of many other services, which means that many other services became unreachable. This was fixed and the platform is reachable again and everything working as intended.

Nov 28, 2024

Report: "[MCO1] delay on event processing and indexing."

Last update 2024-11-28T08:41:19.952Z

resolved2024-11-28T08:41:19.940Z

The platform is running on real-time since 00:34 UTC+1

monitoring2024-11-27T17:16:33.089Z

A fix has been implemented and the delay is now decreasing slowly. We will keep monitoring this incident and will close this status page once we are back in real-time.

identified2024-11-27T16:33:38.000Z

We are aware of an incident causing delay to accumulate on the platform. It concerns event processing and alert raising, which are currently happening about 10 minutes after an event is received. And also the event indexing in our storage cluster, which is currently happening around 30 minutes after an event is processed. We have identified the cause and are working towards its resolution.

Nov 26, 2024

Report: "[FRA1] Playbook runs incident"

Last update 2024-11-26T17:55:18.264Z

resolved2024-11-26T17:55:18.239Z

The playbooks environment is stable and steady. This incident is not resolved.

monitoring2024-11-26T17:09:55.000Z

The playbook environment is back up and we are processing tasks as usual. We are back in real-time, however the environment is handling a lot of charge at the moment. We are monitoring closely until everything is stable and back to normal.

identified2024-11-26T16:32:08.000Z

We implemented a fix to the network issue. The cluster is coming back online on our side. We are currently stabilizing the cluster after the fix, and validating that everything is working.

investigating2024-11-26T15:05:30.674Z

We detected an incident concerning our playbook runs that impacts DNS resolution and runs processing.

Nov 22, 2024

Report: "MCO1 Indexation performance issues"

Last update 2024-11-22T22:10:26.916Z

resolved2024-11-22T22:10:26.895Z

The delay has been completely resolved since 21h50 UTC, the incident is now completely resolved.

monitoring2024-11-22T18:30:24.099Z

Some fix were implemented to increase our performance. We are now able to catch the delay, slowly but steadily. We will close this status page once the event storage is back on real-time.

investigating2024-11-22T16:02:11.793Z

The incident is still ongoing. We are actively continuing to search for its root cause. Our team has added some resources to the storage cluster as a temporary workaround. At this stage, performance is still below our expectations, and we continue to experience delays. We estimate the delay to be around 1 hour and 45 minutes between the processing of an event and its entry into our storage cluster.

investigating2024-11-22T13:24:15.338Z

We are currently having issues indexing events in our storage cluster. This is generating delay before the events are available in the events and alerts pages. The detection is not affected.

Nov 20, 2024

Report: "[FRA2] events processing delay"

Last update 2024-11-20T17:27:14.464Z

resolved2024-11-20T17:27:14.451Z

Events are processed in real-time since 17:39 CET, everything is stable.

monitoring2024-11-20T15:41:00.000Z

During a service update on the region, we encountered an issue with our event processing. It has been stopped for around an hour, since 15:33. We have fixed the issue and it is now consuming again, we are catching up on the delay. We expect around 1h before coming back to real-time processing.

Report: "[MCO1] events processing lag"

Last update 2024-11-20T13:12:46.396Z

resolved2024-11-20T13:12:46.373Z

This incident has been resolved.

monitoring2024-11-20T10:09:56.868Z

A fix has been implemented and we are now catching up on the lag. We will keep monitoring closely and keep this status page open until we have no more delay on events processing.

investigating2024-11-20T09:46:07.809Z

We are investigating an issue with our events processing pipeline which has bad performance since 09:27 CET. Events processing is taking lag, which impacts alerts raising.

Nov 6, 2024

Report: "FRA1 Events processing stopped"

Last update 2024-11-06T18:26:25.180Z

resolved2024-11-06T18:26:25.147Z

The platform is consuming events in real-time since 19:09 CET. This incident has been resolved.

monitoring2024-11-06T17:08:22.305Z

A good part of the backlog has already been processed. The platform's detection is estimated to be back in real-time at 19:05 CET. We will keep monitoring closely until that time.

monitoring2024-11-06T15:39:30.942Z

The event processing is stable and we are catching up on lag slowly. We can expect to be back in real-time in a couple hours due to the volume of events backlog that we have. We will keep you updated on this.

identified2024-11-06T15:12:32.000Z

We were able to mitigate the issue, our events processing pipeline is back up. The platform is now consuming lag, we are monitoring this closely.

identified2024-11-06T14:52:03.000Z

We lost several servers in a few minutes due to a network issue on our cloud-provider side. Our event ingestion pipeline is safe from impact and we are still receiving every events. However, our events processing pipeline is stopped since 15:23 CET, and we are taking lag on events processing and alerts raising. We are currently reaching our cloud provider support to get more information in order to resolve this incident as fast as possible.

Oct 31, 2024

Report: "[UAE1] Syslog SSL issue"

Last update 2024-10-31T18:04:03.879Z

resolved2024-10-31T18:00:00.000Z

From 12:45 CET to 18:42 CET, there was a conflict with our ingress configurations that caused syslog SSL connections to be rejected. This impacted rsyslog reception. As a result, events sent during this period may have been rejected and could potentially be lost if they were not buffered on your side. We sincerely apologize for any inconvenience this may have caused. This is a serious matter, and we are committed to implementing enhanced monitoring and safeguards to ensure this issue is identified more quickly and prevented from recurring in the future. The issue has been addressed, and we have restored the ability to send events via rsyslog using our event-amplifier.

Report: "[UAE1] Platform down"

Last update 2024-10-31T14:16:35.395Z

resolved2024-10-31T14:16:35.373Z

This incident has been resolved.

monitoring2024-10-31T12:36:11.639Z

The platform is up again since 12:49 CET. We have not lost any events during the downtime. Alerts are being processed. We will keep monitoring closely during some time to ensure everything is stable.

identified2024-10-31T11:01:21.000Z

The UAE1 region is experiencing some downtime since 11h20 CET due to an issue during a maintenance update. The problem is identified, we are actively rolling back the changes and expect the issue to be resolved in the next half hour.

Oct 30, 2024

Report: "[FRA1] internal network issues"

Last update 2024-10-30T15:37:31.855Z

resolved2024-10-30T15:37:31.833Z

We have fully caught up on alerts raising lag and everything is running and stable since.

monitoring2024-10-30T13:40:13.000Z

We noticed some impact on tag enrichment too, that have caused a lot of alerts to raise. Everything is stable now, we are consuming lag on alerts raising. ETA : ~1H

monitoring2024-10-30T12:20:01.000Z

A fix has been deployed at 12:58 CET. Ingestion and event storage have not been impacted. We have not lost any event. However, alerts raising tasks are delayed. We are gradually consuming the lag, and we will give you an ETA soon.

identified2024-10-30T11:56:53.196Z

We are aware of an ongoing incident on our platform since 12:15 CET, related to internal loadbalancers. This is impacting our whole platform. Our team is currently implementing a fix. We'll keep you updated

Oct 22, 2024

Report: "Event ingestion delays"

Last update 2024-10-22T21:40:00.796Z

resolved2024-10-22T21:40:00.780Z

All backlog has been processed, this incident is now over.

monitoring2024-10-22T19:33:14.483Z

We managed to identify the issue and process the backlog of pending tasks on the cluster responsible for event ingestion. We are now catching up on the backlog of enqueued events.

investigating2024-10-22T18:21:20.249Z

Investigation is still ongoing.

investigating2024-10-22T16:48:48.554Z

We are currently experiencing performance issues with event ingestion. As a results, events may show up late into the events page. Our team is looking into this issue.

Oct 18, 2024

Report: "Temporary Disruption in "Alert Created" Playbook Triggers"

Last update 2024-10-18T08:27:40.789Z

resolved2024-10-17T03:00:00.000Z

On 17/10, at 16:54 CEST, a deployment introduced a bug into production which led to the "alert created" playbook triggers not being activated. All other triggers and playbooks continued to operate without any issues. Our team detected the issue and has already rolled back the affected deployment as of 10:09 today. We are actively working on replaying the missed triggers and are developing a permanent fix to prevent similar incidents in the future. We apologize for any inconvenience caused and appreciate your patience while we resolve this matter. Thank you for your understanding.

Oct 17, 2024

Report: "FRA1 detection is down"

Last update 2024-10-17T14:32:45.981Z

resolved2024-10-17T14:32:45.963Z

This incident has been resolved. All alerts are being processed in real time.

monitoring2024-10-17T14:10:10.925Z

We are pleased to inform you that the fix has been successfully deployed. No alerts were lost during this incident. However please note that some alerts may experience some temporary delay. Our team is closely monitoring the situation to ensure everything returns to normal promptly. Thank you for your patience and support.

identified2024-10-17T13:50:04.255Z

We have identified an issue with our detection engine and have temporarily paused it to prevent any false alerts. Rest assured, our team is actively working on a solution, which we expect to deploy shortly. Thank you for your patience and understanding.