42Crunch

Is 42Crunch Down Right Now? Check if there is a current outage ongoing.

42Crunch is currently Operational

Last checked from 42Crunch's official status page

Historical record of incidents for 42Crunch

Report: "Error accessing community platform"

Last update
resolved

Connection issue has been fixed

Report: "us.42crunch.cloud is down"

Last update
postmortem

**November 23th, 2021** We want to provide you with some additional information about the service disruption that occurred in the 42Crunch Enterprise platform \(us.42crunch.cloud\) on November 1st, 2021. **Issue Summary** On Monday, 1 November, 2021, 42Crunch SaaS platform \(_us.42crunch.cloud_\) instances in all regions lost connectivity for a total of approximately 90 minutes, from 21:20 to 22:50 Central European Time.  We noticed lot’s of "_Back-off restarting failed container_" messages for the **kube-proxy** pods by the time of the outage. Because of the **kube-proxy** errors above we began to see lot’s of _connection refused errors_ in GKE components.  Error messages like: _dial tcp 10.104.0.1:443: connect: connection refused_ For example, _kubedns, autoscaler, metrics-server, event-exporter_ and others could not connect to the default Kubernetes service. Our pods were **UP** but as _kube-proxy_ was down it cannot forward traffic to our pods.  On our side during the outage we noticed that: * all pods were up and running * we don’t see any restarts \(for pods and nodes\) * we don’t see application errors in the logs * our app didn’t send any 50x HTTP error code \(our stack was fully up all the time\) On the nodes level we can see errors about connection to our control plane being refused. A case was opened to the Google support team. In it, we described the incident and requested more information on the reason for the unavailability of the platform. Control plane became unavailable at 12:20 PM PT and became available again at 1:56 PM PT on November 1st. The times line up exactly to when there was an outage on GKE that affected clusters in _us-west2_. Google Support informed us that it was a problem from their side in GSLB \(Google Global Software Load Balancer\). GSLB allows Google to balance live user traffic between clusters so that Google can match user demand to available service capacity, and so they can handle service failures in a way that’s transparent to users.  GSLB used by hosted master service on GKE \(Google Kubernetes Engine\) was affected due to a network configuration change made by Google causing breaking on hosted masters in the _us-west2_ region \(where _us.42crunch.cloud_ is deployed\) were affected as well. This change was pushed minutes before the outage and at 12:20 PM and we had an end-user impact. Client's services were not targeted. However their masters are in a GKE owned project that was affected. The hosted masters temporarily had network connectivity disrupted. The outage caused traffic loss to the control plane, it was unavailable so things like _gcloud_, _kubectl_, IAM service account authentication was unavailable, as were services such as master repairs, and autoscaling. Existing workloads are believed to be unaffected.  Nodes were showing failure to connect to Control Plane VIP \(virtual IP address\) which is another symptom of the outage. Fortunately the GKE team rolled back the configuration change quickly and at 1:54 PM PT the issue was deemed mitigated. This is exactly the time our control plane regained availability and it has remained healthy since. From Google's side, this outage led to many high priority action items for the product team, including improving alerting on traffic drops to hosted master service, and to implement better testing to better predict how these changes will roll out. Unfortunately our cluster was affected, and unfortunately there was nothing we could have done to avoid this. Fortunately the GKE team was able to identify the bad config push, rollback the change and mitigate the issue within 2 hours. **In closing** We want to apologize for the impact this event caused for our customers. While we are proud of our track record of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.

resolved

This incident has been resolved

monitoring

We are still monitoring the outage and in contact with Google Cloud support to fix the issue

identified

We've identified that the problem is on GKE (Google Kubernetes Engine) where our platform is hosted

investigating

We are currently investigating an outage with our Enterprise platform (us.42crunch.cloud)