Historical record of incidents for Gemfury
Report: "Timeouts for proxied npm registry"
Last updateWe are seeing increased timeouts for proxied requests to the public npm registry. We are tracking the upstream issue with npmjs.org.
Report: "Platform issues causing upload failures"
Last updateWe are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Elevated errors after deployment"
Last updateThe fix was effective. Service fully restored.
We've deployed an update that caused elevated errors due to a caching error that didn't surface during testing. We've rolled back the update, implemented a fix, and rolled out the fixed update. Monitoring.
Report: "Custom domains and uploads outage"
Last updateRequest and error rates returning to normal levels. We will continue to monitor. Resolving.
A routine infrastructure update misconfigured our load balancers. We've updated the load balancers and updated DNS settings. It may take time for changes to propagate to clients. Monitoring.
Investigating custom domains and uploads outage.
Report: "Build failures"
Last updateBack to normal. Resolving.
Shortage of resources prevented build jobs from being scheduled. Scaling up nodes seems to have fixed the issue. Monitoring.
Investigating build failures
Report: "Failing builds via Git"
Last updateClock drift resulted in authentication failure between Git server and builder nodes. Correcting the time resolved the sporadic failures. We will investigate why nodes don't automatically synchronize their clocks. Resolving the immediate issue.
We are investigating an issue with failing builds via `git push`
Report: "Git-push build errors"
Last updateCluster access restored. Git repo building issues have been resolved.
Payment issue with our cluster provider caused temporary API suspension. We've fixed the payment, and waiting for restoration of cluster access.
Investigating "git push" build issues
Report: "Elevated error rates"
Last updateEverything returned to operating normally. Resolving.
A cache instance has failed triggering automatic failover. Error rates returning to normal.
We are investigating elevated error rates
Report: "Elevated error responses"
Last updateWe've found a caching bug that was introduced by the latest deployment. It will not be present in future builds. Resolving.
We've rolled back most recent deployment. Still debugging.
Investigating latency & error spike
Report: "Partial service failure due to non-volatile Redis saturation"
Last updateEarlier today, we deployed a bug that exposed a legacy system to excessive traffic, overwhelming our core Redis instance and causing cascading failures in certain functionality. In the past, we occasionally used a legacy internal system to capture debug information for rare data states, aiding in tracking and reproducing customer issues. This information was stored in our core non-volatile Redis instance. While this approach had worked for rare conditions and low-traffic code paths, this incident occurred due to mistakenly adding such tracking to heavily-trafficked functionality. May 29th, 12:43 UTC - Deployment of a release with the tracking bug The release containing the tracking bug passed preflight and was deployed to production. Initially, everything appeared stable. However, the utilization of our non-volatile Redis, which usually hovers below 5%, slowly started to increase. Unfortunately, this increase went unnoticed. May 29th, 15:55 UTC - Redis storage reached maximum utilization When the non-volatile Redis storage reached 100% utilization, write operations began receiving "OOM command not allowed" error responses, resulting in 500 errors for certain user-facing APIs. Most read operations were successful, but not all. Regrettably, the tracking code was present in the API layer servicing the Dashboard and the CLI, causing errors for those read operations. Worse, the error rate remained low enough to not trigger any alarms. May 29th, 23:04 UTC - Redis instance cleaned to restore service Customer service noticed an increase in error reports and promptly notified engineering to investigate. Engineering quickly identified the issue and cleared excess data from Redis, restoring service. May 29th, 23:40 UTC - Fix deployed to remove the tracking bug We deployed a fix to remove the tracking bug. Further steps Later in the day, we removed the legacy tracking system and migrated that functionality to use standard error and metrics tracking. This step will prevent similar space issues in the future and consolidate our monitoring infrastructure. Moving forward, we will implement more alarms for storage utilization and introduce more fine-grained tracking of error rates.
Report: "Elevated errors"
Last updateEverything confirmed operational. We will resolve this incident, but will continue investigating the cause of the outage.
We've paused traffic and restarted services. Errors subsiding, requests being serviced normally. Looking into what happened.
We're seeing elevated errors on all components. Investigating.
Report: "Elevated errors for custom domains"
Last updateThis incident has been resolved.
Fix applied. Error rates recovering. Monitoring.
Incorrect cache configuration is causing high latencies and timeouts. Deploying a fix.
We are currently investigating this issue.
Report: "Service connection failure"
Last updateResolved
Fix deployed. Monitoring.
A failed update left our upload & custom domains services in an erroneous state. We're deploying a fix now.
Report: "Downtime due to incorrect infrastructure upgrade"
Last updateService restored. Resolving.
We've restored the cluster, and restarted accepting traffic.
We are continuing to restore our cluster. In the meantime, we've restored uploads in a limited capacity.
We've rolled out an incorrect infrastructure configuration that has lead to custom domains and uploads downtime. Updating with a fixed configuration.
Report: "Partial service outage"
Last updateIncident caused by a routing bug that has been rolled back by our platform provider. Resolving.
Our metrics & errors are back to normal. We're still waiting for upstream resolution and cause.
Error rates have dropped. We're tracking the upstream resolution.
We are seeing a disruption due to an issue with our upstream platform provider.
Report: "Sporadic errors"
Last updateFixed erroneous configuration for ACME. Issue resolved.
Service restored. Errors returning to normal levels. Monitoring resolution.
Issue identified: internal ACME SSL certificate failed to refresh. Refreshing manually.
We are investigating sporadic repository errors
Report: "Dashboard errors"
Last updateConfirmed resolved.
We've identified this as a build misconfiguration issue. Fix implemented & deployed. Monitoring error rates.
Investigating consisted failure to load the Dashboard app.
Report: "Elevated errors returned from service"
Last updateCertificates are fixed. Resolving.
We've updated all certificates except uploads server.
We are investigating elevated errors due to a SSL certificate error
Report: "Elevated repo and API errors"
Last updateUpstream issue has been resolved.
Hosting availability is recovering. Waiting for upstream resolution.
We are continuing to work on a fix for this issue.
The problem appears to be caused by availability issues of our upstream cloud hosting provider.
We're investigating elevated error rates for our repository and API endpoints
Report: "Elevated timeouts from NPM-Proxy"
Last updateDependency issues are resolved.
NPM-Proxy errors have subsided on our end. We are monitoring this in the context of the still-open npm issue. We're also looking into whether current AWS internet connectivity problems are affecting our service or its dependencies: https://status.aws.amazon.com
We are receiving sporadic 503 errors from the public npm registry. Status seems to reflect the issue: https://status.npmjs.org
We're seeing elevated timeout error rates from proxied requests to the public NPM registry via npm-proxy.fury.io.
Report: "Elevated errors"
Last updateEverything back to normal. Resolving.
Failover complete. Errors subsiding. Monitoring.
We have a database degradation issue triggering a failover.
We are currently investigating this issue.
Report: "Elevated errors"
Last updateWe've restored all affected services.
Repo services and dashboard are back to normal. We are still working to fix static pages (landing pages and documentation). You can access your dashboard by going directly to https://manage.fury.io
We've fixed the incorrect DNS records. Error rates returning to normal.
We've identified this as a DNS misconfiguration.
We are investigating elevated repo errors
Report: "Platform issues causing instability"
Last updateConfirmed as resolved.
Dashboard live updates have been restored as well, subject to the same DNS caching delays. We're going to continue to monitor error rates as the clients transition to the new endpoint.
The issue was identified as a partial outage for our platform provider. We cannot address this issue directly, but we were able to implement a bypass and partially return service to Version Badge and Content Explorer. It may take clients an hour or two to see the Version Badge return due to DNS caching. We are still working to restore Dashboard live updates.
The problem also seems to extend to Dashboard live updates, and Version Badge public service.
We are investigating consistent internal errors for our Content Explorer backend.
Report: "Elevated errors"
Last updateDatabase has auto-recovered. Errors back to normal. Resolving.
We've experienced a period of elevated error rates due to a database issue. Errors have subsided after switching to the failover. Looking into the database issue.
Report: "SSL Certificate Expired"
Last updateEverything looks good. Resolving.
All certificates replaced. Verifying availability.
Updated certificate for Repos, API, Git server, and Dashboard. Fixing certificate for push server.
We're renewing the certificate now.
Report: "Sporadic errors"
Last updateError rates are holding low. We will resolve this incident while continuing to investigate the root cause with our platform provider. We will separately investigate and address the issue of subpar incident response.
A number of repo instances have crashed. Restarted crashed instances, and scaled up. This should have been handled automatically - looking into why this did not happen.
We are continuing to investigate this issue.
We are looking into reports of sporadic repo and dashboard errors
Report: "Error spike"
Last updateThis incident has been resolved.
Outage was due to a bad configuration update. We've rolled it back to investigate.
We've identified our cache as the source of the errors.
We are investigating a 5xx error spike
Report: "Upload processing delays"
Last updateWe were able to stabilize uploads, and we'll keep this as an area of focus. Resolving.
We're starting to see some progress. Workers keeping up with incoming work. Still monitoring
We've sped up one hot path by about 75%, and provisioned larger workers. Still seeing delays.
We've found code that has terrible performance for packages with many versions. Multiple background jobs containing this code consume all the worker slots denying other jobs from being processed. (Edit: Jobs that bring your uploads to your dashboard and indexes.)
Issue is quite difficult to reproduce. We have implemented additional instrumentation to capture the issue when it happens in production. We will keep this issue open over the next period of high traffic.
We are investigating upload delays due to locked up background job workers.
Report: "npm-proxy returns 404 for some proxied packages"
Last updateThis incident has been resolved
The public registry for npm is having an incident. We are following it here: https://status.npmjs.org/
Report: "Networking errors in upload hosts"
Last updateDNS issues have been resolved upstream. Everything is back to normal here. We will be looking into how to mitigate this better in the future.
Errors have dissipated. Services running normally. Monitoring for any changes.
Error rates improving, service returning to normal
Outage appears to be caused by sporadic DNS failures to access Gemfury API, and other services. Also note that Cloudflare DNS is currently down, may be related.
Investigating elevated networking errors on upload processing hosts
Report: "Delayed index builds"
Last updateEverything back to normal.
We are monitoring the results.
We've spun up more workers, and they are now catching up with the uploads.
Due to elevated activity, upload processing and index updates are delayed. We're working on scaling up capacity.
Report: "403 errors"
Last updateEverything looks better now. We're going to close the incident while we continue to investigate the root cause.
We were still seeing errors, so we're turning off the optimizations again.
403 errors are back below 0.5%, but as predicted, latencies are up. We're going to slowly bring back some optimizations.
Seeing spikes up to 9% of requests, from a baseline of 0.5-1%. Rolling back to an earlier build, and disabling experimental optimizations. Bigger accounts may experience greater latencies.
Investigating reports of sporadic 403 errors for valid tokens, and possible slowness.
Report: "Elevated errors"
Last updateThis incident has been resolved.
Platform provider incident resolved upstream at 17:31 UTC
Uploads queue has been processed, and back to normal now. Still seeing some elevated latencies on repo endpoints.
Queue has caught up. Indexing recent uploads now.
Update from our platform provider: platform issues starting around 16:22 UTC
Index rebuild queue has run away. Pausing new index rebuilds to allow workers to catch up.
We are continuing to investigate this issue.
Investigating elevated errors and reports of delayed upload processing.
Report: "Reemergence of platform issues"
Last updateThis incident has been resolved.
We are seeing sporadic empty responses from our endpoints due to the reoccurrence of the platform issue. We are going to keep monitoring their team's progress, and make sure we don't lose anything important.
Report: "Platform networking issues (continued)"
Last updateThis incident has been resolved.
We are reopening an incident to continue following a networking event with our platform provider. Although better than yesterday, we are still getting customer reports of sporadic connection issues.