Historical record of incidents for CloudRepo
Report: "Elevated 502 Errors"
Last updateWe have not observed a single 502 across our systems for the past two hours, while close to peak load. We will consider this issue resolved as there is no current customer impact. We will continue to monitor and evaluate internally in order to prevent any future disruption.
We believe the issue has been caused by an increase in load as well as a potential resource leak of some sort. We have scaled the size of all of our resources by 2x in order to immediately reduce the impact to our partners while we investigate the resource leak. Since we have scaled up at 1415 GMT, we have not seen a single 502 pass through our load balancers. We will continue monitoring closely while we search for root cause.
While we are identifying the root cause of the issue, we have doubled the size of our clusters (cpu, memory, and network) in order to reduce the frequency of these errors.
We are continuing to investigate this issue.
We have received reports of 502 errors which are causing builds to break. Our internal metrics indicate that this is affecting between 1-5% of all requests. We have elevated this to a critical issue and we are actively investigating.
Report: "Performance Degradation"
Last updateThe incident has been confirmed as resolved.
We believe we have resolved the issue and are monitoring to verify.
We are experiencing a higher than normal request rate on our servers - we are scaling our servers and expect performance to return to normal momentarily.
Report: "502s and Slow Response Times"
Last updateAt approximately 21:00 UTC on 5/2/22 - customers began to see intermittent 502s as well as slow down in response times. These rates increased through the night and were resolved at approximately 13:39 on 5/3/22. Root cause was determined to be an influx of requests from an outside source. We have scaled the system accordingly and will continue to monitor the status closely. Going forward we will continue to 1) investigate the root cause of the issue as well and 2) investigate as to why our alerting system failed to notify the support team. This is our first customer impacting outage in over 3 years and we'd like to sincerely apologize to any of our partners who were impacted.
Report: "API Outage"
Last updateDuring a server upgrade, there was approximately 1-2 minutes of API downtime. This was unintentional and we are looking into why this happened so we will avoid future outages in the future as we upgrade hardware.
Report: "Package Repository Outage"
Last updateCustomer Impact: Access to our storage APIs (publishing/reading packages) was returning 500 errors for some partners. Root Cause: Our servers exhausted their connections to the storage layer and our monitoring system did not alert us to this degraded state - a partner alerted us instead. Resolution: After we were alerted of this issue we were able to restore functionality to all partners. Duration: Approximately two hours Future Mitigation: To prevent this from happening again, we will be implementing several changes: 1) Improve our monitoring to detect 500 errors as soon as they occur. 2) Increased the size of our cluster to give us more headroom in our connection pools. 3) Continue to investigate root cause and fix anything that may be holding on to connections.
Report: "Package Repository Outage"
Last updateCustomer Impact: Access to our storage APIs (publishing/reading packages) was returning 500 errors for some partners. This is a repeat of the May 9th outage - please refer to the incident summary for more details. Resolution: After we were alerted of this issue we were able to restore functionality to all partners. Duration: Approximately 45 minutes at approximately 11:00 CST and 20 minutes around 18:00 CST. Future Mitigation: To prevent this from happening again, we will be implementing several changes: 1) Improve our monitoring to detect 500 errors as soon as they occur. 2) Increased the size of our cluster to give us more headroom in our connection pools. 3) Continue to investigate root cause and fix anything that may be holding on to connections.