CloudRepo

Is CloudRepo Down Right Now? Check if there is a current outage ongoing.

CloudRepo is currently Operational

Last checked from CloudRepo's official status page

Historical record of incidents for CloudRepo

Report: "Elevated 502 Errors"

Last update
resolved

We have not observed a single 502 across our systems for the past two hours, while close to peak load. We will consider this issue resolved as there is no current customer impact. We will continue to monitor and evaluate internally in order to prevent any future disruption.

monitoring

We believe the issue has been caused by an increase in load as well as a potential resource leak of some sort. We have scaled the size of all of our resources by 2x in order to immediately reduce the impact to our partners while we investigate the resource leak. Since we have scaled up at 1415 GMT, we have not seen a single 502 pass through our load balancers. We will continue monitoring closely while we search for root cause.

investigating

While we are identifying the root cause of the issue, we have doubled the size of our clusters (cpu, memory, and network) in order to reduce the frequency of these errors.

investigating

We are continuing to investigate this issue.

investigating

We have received reports of 502 errors which are causing builds to break. Our internal metrics indicate that this is affecting between 1-5% of all requests. We have elevated this to a critical issue and we are actively investigating.

Report: "Performance Degradation"

Last update
resolved

The incident has been confirmed as resolved.

monitoring

We believe we have resolved the issue and are monitoring to verify.

identified

We are experiencing a higher than normal request rate on our servers - we are scaling our servers and expect performance to return to normal momentarily.

Report: "502s and Slow Response Times"

Last update
resolved

At approximately 21:00 UTC on 5/2/22 - customers began to see intermittent 502s as well as slow down in response times. These rates increased through the night and were resolved at approximately 13:39 on 5/3/22. Root cause was determined to be an influx of requests from an outside source. We have scaled the system accordingly and will continue to monitor the status closely. Going forward we will continue to 1) investigate the root cause of the issue as well and 2) investigate as to why our alerting system failed to notify the support team. This is our first customer impacting outage in over 3 years and we'd like to sincerely apologize to any of our partners who were impacted.

Report: "API Outage"

Last update
resolved

During a server upgrade, there was approximately 1-2 minutes of API downtime. This was unintentional and we are looking into why this happened so we will avoid future outages in the future as we upgrade hardware.

Report: "Package Repository Outage"

Last update
resolved

Customer Impact: Access to our storage APIs (publishing/reading packages) was returning 500 errors for some partners. Root Cause: Our servers exhausted their connections to the storage layer and our monitoring system did not alert us to this degraded state - a partner alerted us instead. Resolution: After we were alerted of this issue we were able to restore functionality to all partners. Duration: Approximately two hours Future Mitigation: To prevent this from happening again, we will be implementing several changes: 1) Improve our monitoring to detect 500 errors as soon as they occur. 2) Increased the size of our cluster to give us more headroom in our connection pools. 3) Continue to investigate root cause and fix anything that may be holding on to connections.

Report: "Package Repository Outage"

Last update
resolved

Customer Impact: Access to our storage APIs (publishing/reading packages) was returning 500 errors for some partners. This is a repeat of the May 9th outage - please refer to the incident summary for more details. Resolution: After we were alerted of this issue we were able to restore functionality to all partners. Duration: Approximately 45 minutes at approximately 11:00 CST and 20 minutes around 18:00 CST. Future Mitigation: To prevent this from happening again, we will be implementing several changes: 1) Improve our monitoring to detect 500 errors as soon as they occur. 2) Increased the size of our cluster to give us more headroom in our connection pools. 3) Continue to investigate root cause and fix anything that may be holding on to connections.