Dome9 Security

Is Dome9 Security Down Right Now? Check if there is a current outage ongoing.

Dome9 Security is currently Operational

Last checked from Dome9 Security's official status page

Historical record of incidents for Dome9 Security

Report: "Degraded performance in security events - EU, Sydney and Mumbai DCs"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Login issues - Investigating an Increase in error rate"

Last update
resolved

This incident has been resolved.

investigating

We are currently investigating this issue.

Report: "Servers Replacement"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

Report: "Servers Replacement"

Last update
resolved

This incident has been resolved.

identified

We are replacing few servers related to Security Groups and Identity protection.

Report: "Servers Replacement"

Last update
resolved

This incident has been resolved.

identified

We are replacing few servers related to Security Groups and Identity protection.

Report: "Servers Replacement"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are replacing few servers related to Security Groups and Identity protection.

Report: "CDR Account Activity delay in data and alerts."

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "CloudGuard - UI Pages increase on error rate on Infinity portal."

Last update
resolved

This incident has been resolved.

investigating

We are currently investigating this issue.

Report: "Canada DC - CDR - Logs analysis latency on Azure subscriptions"

Last update
resolved

This incident has been resolved.

investigating

We are currently investigating this issue.

Report: "Canada DC | CDR - Log analysis Latency"

Last update
resolved

This incident has been resolved.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

Report: "Latency to present new findings in US region"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Latency to present new findings in EU region"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Latency to present new findings in US region"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "CloudGuard - US DC - Increase in error rate"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "CloudGuard - US DC - Increase in error rate"

Last update
postmortem

# Summary Between Monday, May 6, 2024, 12:39 UTC to 13:55 UTC, all users of the CloudGuard \(US region\) experienced degraded performance and failure to login. The event was triggered by an extreme load on our internal services caused by internal activity and external API calls. Our internal alerting and client reports were clear to point on a major issue. The high load caused CloudGuard database to stop functioning. The incident was mitigated eventually by recovering the database. The system then became stable again. # Incident Timeline Thursday, May 6, 2024, 12:39 UTC – An alert is triggered. A war room is created to diagnose the issues. Thursday, May 6, 2024, 12:54 UTC – It’s clear an incident has started. Database is started to be recovered Thursday, May 6, 2024, 13:00 UTC – The status page is updated. Thursday, May 6, 2024, 13:45 UTC – The system shows signs of recovering Thursday, May 6, 2024, 13:55 UTC – The system is up and running and is being monitored Thursday, May 6, 2024, 14:45 UTC – It’s clear the system is back to being fully operational. No reported issues by clients for meaningful time. Closing the incident. # Root Cause Analysis It was a rare combination of calls to CloudGuard database that resulted in extreme load that caused it to reach its limits. It became degraded which caused the major outage. # Next Steps We sincerely apologize for the recent outage of our system. We take our availability very seriously and we understand that this outage has caused you inconvenience. We appreciate your patience and understanding during this time. Further steps we are planning to take: 1. Identify and improve connections management to CloudGuard database 2. Add limitations to sources that access CloudGuard database

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "CloudGuard workload protection - Image scanning is unavailable in some accounts"

Last update
resolved

Resolved

Report: "CloudGuard workload protection - Image scanning is unavailable in some accounts"

Last update
resolved

Resolved

Report: "CDR (Intelligence) Azure Onboarding issue - Due to Microsoft Azure bug"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "CDR (Intelligence) Azure Onboarding issue - Due to Microsoft Azure bug"

Last update
resolved

This incident has been resolved.

identified

We have reproduced the issue with MS support and they are working on a fix.

Report: "CloudGuard - Increase in Assessments error rate"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Login issues preventing users from logging in to Infinity Portal"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "Issue with fetching data and continuous compliance in US region"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "CloudGuard Intelligence - Delay in Data processing"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "There is a delay in sending notification on Alerts"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Home dashboard doesn't load in EU"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "main window doesn't load in infinity portal UI"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Issue with loading UI in EU region"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Alerts materialization data not healthy - Impact on finding-orc/export API route - Returns partial data"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Sometimes get error about user when loading pages in infinity portal, need just to refresh the page"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Delays in sending notifications in US region"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Compliance - US Region - Managed rulesets run issue"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Issues with login in US region"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "US DC - UI Latency"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "US DC - UI Latency and error rate increase"

Last update
postmortem

US Data center was sporadically not accessible through the portal for a period of 16 hours. Backend services were functioning \(including CSPM, Containers etc\). ## **What was the issue?** One of the instances of our web servers was malfunctioning and returned timeout. We stopped getting alerts from Sumo so this issue was found in a delay. **TIMELINES :** ·        **11:18 –** we got complaints ·        **11:35 –** ‘War room’ started ·        **12:04** **–** deployment was done and system was back to normal **Cause** A specific instance was not working as expected, returning sporadic errors , the root case was probably due to too many open sockets on that specific instance which eventually resulted in timeout response. The default number of allowed sockets were decreased in our latest deployed .NET library. ## **Lesson Learned** 1. We will enhance our monitoring to automatically fix the malfunctioning instance 1. Increase the default amount of sockets allowed \( related to specific .NET configuration\) 1. Review why Sumo alerts were not received ## **Summary** We would like to thank you for being a loyal customer, and again apologize for the inconvenience. We would like to assure you that we treat this matter with the utmost seriousness. Cloud Guard did a post mortem process to implement changes needed in our process to minimize possibilities of such incidents in the future.   Sincerely, Eyal Fingold  VP Cloud Security Products  @Check Point

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Login issues"

Last update
postmortem

Production DC in US was not accessible through the portal for a period of 1hour 47 min, due to massive onboarding serverless account attempt. Backend services were functioning \(including CSPM, Containers etc\). The massive onboarding loaded our servers \(Centrals\) , reaching 100 CPU on 30 machines . After restarts of the Central servers, the login problems were resolved, but it took more time to inject new details to the other services in the system. The main problem was that due to customer specific request , Cloud-Guard removed rate limit on the customer’s API, enabling the onboard API to get issued 13000 in less than one hour. The problem was resolved after all relevant services of the system were restarted and pickup updated connection string and successfully connected to the DB.   ## **What was the issue?** A script to onboard 13000 AWS Serverless accounts run , causing DDoS on our API servers. The APIs are protected with a rate limit , however , due to past request of that specific customer , the protection was removed. **TIMELINES :** ·        **13:19 –** An alert was received in our system that there are number of API servers cross 80% of their CPU utilization ·        A message was reported in our internal ‘Critical production issues’ channel at Teams ·        Status page was updated accordingly. ·        **13:40 –** ‘War room’ started IIS CPU reached 100% for many machines we knew there was an increase in CPU in the past month so we got distracted ·        **15:00** -  Initially we have suspected a certain deployment occurred at around the time the degradation started , however eventually we found out that a specific API /v2/serverless/accounts caused throttling on our API servers ·        **15:12 -** Rate limit on the API was enabled and API servers were restarted. ## **Lesson Learned** 1. Rate limit should be added in WAF and not just on the application side. 1. Alerting on CPU utilization was reduced from 80% to 60% 1. Set a process where in cases like that , servers should be immediately restarted. 1. Enable better monitoring and visualization on the following : a.       Frontend API latency \(roundtrip time\) b.      API servers error count \(enable alert when threshold for specific time period reached\) **Summary** We would like to thank you for being a loyal customer, and again apologize for the inconvenience. We would like to assure you that we treat this matter with the utmost seriousness. Cloud Guard did a post mortem process to implement changes needed in our process to minimize possibilities of such incidents in the future. ‌ Sincerely, Eyal Fingold  VP Cloud Security Products  @Check Point

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "CDR AWS - Account Activity Latency"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Cloud Guard - Latency in fetching inventory and running rulesets"

Last update
resolved

This incident has been resolved.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

Report: "Web Console - Increase in Error rate on Infinity portal only"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "US Region - Significant performance degradation"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Cloud-guard Intelligence"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "US DC - Intelligence - Delay in Log ingestion"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "US DC - Protected assets page - Data Latency"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Latency in fetching data"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "Due to AWS Outage in N. virgina we are experiencing outage"

Last update
resolved

This incident has been resolved.

investigating

AWS Status - https://health.aws.amazon.com/health/status

investigating

We are currently investigating this issue.

Report: "Latency in creating and receiving CSPM findings"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "US Region - Latency on Protected assets page"

Last update
resolved

This incident has been resolved.

investigating

We are currently investigating this issue.

Report: "US Region | Intelligence | Account Activity Latency"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently checking reports of Account activity latency to get events data.

Report: "US region - UI Latency"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "US data center - protected assets Latency on UI"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "US - Console latency and login issues"

Last update
resolved

This incident has been resolved.

investigating

We are currently investigating this issue.