Gemfury

Is Gemfury Down Right Now? Check if there is a current outage ongoing.

Gemfury is currently Operational

Last checked from Gemfury's official status page

Historical record of incidents for Gemfury

Report: "Timeouts for proxied npm registry"

Last update
identified

We are seeing increased timeouts for proxied requests to the public npm registry. We are tracking the upstream issue with npmjs.org.

Report: "Platform issues causing upload failures"

Last update
investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Elevated errors after deployment"

Last update
resolved

The fix was effective. Service fully restored.

monitoring

We've deployed an update that caused elevated errors due to a caching error that didn't surface during testing. We've rolled back the update, implemented a fix, and rolled out the fixed update. Monitoring.

Report: "Custom domains and uploads outage"

Last update
resolved

Request and error rates returning to normal levels. We will continue to monitor. Resolving.

monitoring

A routine infrastructure update misconfigured our load balancers. We've updated the load balancers and updated DNS settings. It may take time for changes to propagate to clients. Monitoring.

investigating

Investigating custom domains and uploads outage.

Report: "Build failures"

Last update
resolved

Back to normal. Resolving.

monitoring

Shortage of resources prevented build jobs from being scheduled. Scaling up nodes seems to have fixed the issue. Monitoring.

investigating

Investigating build failures

Report: "Failing builds via Git"

Last update
resolved

Clock drift resulted in authentication failure between Git server and builder nodes. Correcting the time resolved the sporadic failures. We will investigate why nodes don't automatically synchronize their clocks. Resolving the immediate issue.

investigating

We are investigating an issue with failing builds via `git push`

Report: "Git-push build errors"

Last update
resolved

Cluster access restored. Git repo building issues have been resolved.

identified

Payment issue with our cluster provider caused temporary API suspension. We've fixed the payment, and waiting for restoration of cluster access.

investigating

Investigating "git push" build issues

Report: "Elevated error rates"

Last update
resolved

Everything returned to operating normally. Resolving.

monitoring

A cache instance has failed triggering automatic failover. Error rates returning to normal.

investigating

We are investigating elevated error rates

Report: "Elevated error responses"

Last update
resolved

We've found a caching bug that was introduced by the latest deployment. It will not be present in future builds. Resolving.

monitoring

We've rolled back most recent deployment. Still debugging.

investigating

Investigating latency & error spike

Report: "Partial service failure due to non-volatile Redis saturation"

Last update
resolved

Earlier today, we deployed a bug that exposed a legacy system to excessive traffic, overwhelming our core Redis instance and causing cascading failures in certain functionality. In the past, we occasionally used a legacy internal system to capture debug information for rare data states, aiding in tracking and reproducing customer issues. This information was stored in our core non-volatile Redis instance. While this approach had worked for rare conditions and low-traffic code paths, this incident occurred due to mistakenly adding such tracking to heavily-trafficked functionality. May 29th, 12:43 UTC - Deployment of a release with the tracking bug The release containing the tracking bug passed preflight and was deployed to production. Initially, everything appeared stable. However, the utilization of our non-volatile Redis, which usually hovers below 5%, slowly started to increase. Unfortunately, this increase went unnoticed. May 29th, 15:55 UTC - Redis storage reached maximum utilization When the non-volatile Redis storage reached 100% utilization, write operations began receiving "OOM command not allowed" error responses, resulting in 500 errors for certain user-facing APIs. Most read operations were successful, but not all. Regrettably, the tracking code was present in the API layer servicing the Dashboard and the CLI, causing errors for those read operations. Worse, the error rate remained low enough to not trigger any alarms. May 29th, 23:04 UTC - Redis instance cleaned to restore service Customer service noticed an increase in error reports and promptly notified engineering to investigate. Engineering quickly identified the issue and cleared excess data from Redis, restoring service. May 29th, 23:40 UTC - Fix deployed to remove the tracking bug We deployed a fix to remove the tracking bug. Further steps Later in the day, we removed the legacy tracking system and migrated that functionality to use standard error and metrics tracking. This step will prevent similar space issues in the future and consolidate our monitoring infrastructure. Moving forward, we will implement more alarms for storage utilization and introduce more fine-grained tracking of error rates.

Report: "Elevated errors"

Last update
resolved

Everything confirmed operational. We will resolve this incident, but will continue investigating the cause of the outage.

monitoring

We've paused traffic and restarted services. Errors subsiding, requests being serviced normally. Looking into what happened.

investigating

We're seeing elevated errors on all components. Investigating.

Report: "Elevated errors for custom domains"

Last update
resolved

This incident has been resolved.

monitoring

Fix applied. Error rates recovering. Monitoring.

identified

Incorrect cache configuration is causing high latencies and timeouts. Deploying a fix.

investigating

We are currently investigating this issue.

Report: "Service connection failure"

Last update
resolved

Resolved

monitoring

Fix deployed. Monitoring.

identified

A failed update left our upload & custom domains services in an erroneous state. We're deploying a fix now.

Report: "Downtime due to incorrect infrastructure upgrade"

Last update
resolved

Service restored. Resolving.

monitoring

We've restored the cluster, and restarted accepting traffic.

identified

We are continuing to restore our cluster. In the meantime, we've restored uploads in a limited capacity.

identified

We've rolled out an incorrect infrastructure configuration that has lead to custom domains and uploads downtime. Updating with a fixed configuration.

Report: "Partial service outage"

Last update
resolved

Incident caused by a routing bug that has been rolled back by our platform provider. Resolving.

monitoring

Our metrics & errors are back to normal. We're still waiting for upstream resolution and cause.

monitoring

Error rates have dropped. We're tracking the upstream resolution.

identified

We are seeing a disruption due to an issue with our upstream platform provider.

Report: "Sporadic errors"

Last update
resolved

Fixed erroneous configuration for ACME. Issue resolved.

monitoring

Service restored. Errors returning to normal levels. Monitoring resolution.

identified

Issue identified: internal ACME SSL certificate failed to refresh. Refreshing manually.

investigating

We are investigating sporadic repository errors

Report: "Dashboard errors"

Last update
resolved

Confirmed resolved.

monitoring

We've identified this as a build misconfiguration issue. Fix implemented & deployed. Monitoring error rates.

investigating

Investigating consisted failure to load the Dashboard app.

Report: "Elevated errors returned from service"

Last update
resolved

Certificates are fixed. Resolving.

monitoring

We've updated all certificates except uploads server.

investigating

We are investigating elevated errors due to a SSL certificate error

Report: "Elevated repo and API errors"

Last update
resolved

Upstream issue has been resolved.

monitoring

Hosting availability is recovering. Waiting for upstream resolution.

identified

We are continuing to work on a fix for this issue.

identified

The problem appears to be caused by availability issues of our upstream cloud hosting provider.

investigating

We're investigating elevated error rates for our repository and API endpoints

Report: "Elevated timeouts from NPM-Proxy"

Last update
resolved

Dependency issues are resolved.

monitoring

NPM-Proxy errors have subsided on our end. We are monitoring this in the context of the still-open npm issue. We're also looking into whether current AWS internet connectivity problems are affecting our service or its dependencies: https://status.aws.amazon.com

identified

We are receiving sporadic 503 errors from the public npm registry. Status seems to reflect the issue: https://status.npmjs.org

investigating

We're seeing elevated timeout error rates from proxied requests to the public NPM registry via npm-proxy.fury.io.

Report: "Elevated errors"

Last update
resolved

Everything back to normal. Resolving.

monitoring

Failover complete. Errors subsiding. Monitoring.

identified

We have a database degradation issue triggering a failover.

investigating

We are currently investigating this issue.

Report: "Elevated errors"

Last update
resolved

We've restored all affected services.

monitoring

Repo services and dashboard are back to normal. We are still working to fix static pages (landing pages and documentation). You can access your dashboard by going directly to https://manage.fury.io

monitoring

We've fixed the incorrect DNS records. Error rates returning to normal.

identified

We've identified this as a DNS misconfiguration.

investigating

We are investigating elevated repo errors

Report: "Platform issues causing instability"

Last update
resolved

Confirmed as resolved.

monitoring

Dashboard live updates have been restored as well, subject to the same DNS caching delays. We're going to continue to monitor error rates as the clients transition to the new endpoint.

identified

The issue was identified as a partial outage for our platform provider. We cannot address this issue directly, but we were able to implement a bypass and partially return service to Version Badge and Content Explorer. It may take clients an hour or two to see the Version Badge return due to DNS caching. We are still working to restore Dashboard live updates.

investigating

The problem also seems to extend to Dashboard live updates, and Version Badge public service.

investigating

We are investigating consistent internal errors for our Content Explorer backend.

Report: "Elevated errors"

Last update
resolved

Database has auto-recovered. Errors back to normal. Resolving.

monitoring

We've experienced a period of elevated error rates due to a database issue. Errors have subsided after switching to the failover. Looking into the database issue.

Report: "SSL Certificate Expired"

Last update
resolved

Everything looks good. Resolving.

monitoring

All certificates replaced. Verifying availability.

identified

Updated certificate for Repos, API, Git server, and Dashboard. Fixing certificate for push server.

identified

We're renewing the certificate now.

Report: "Sporadic errors"

Last update
resolved

Error rates are holding low. We will resolve this incident while continuing to investigate the root cause with our platform provider. We will separately investigate and address the issue of subpar incident response.

monitoring

A number of repo instances have crashed. Restarted crashed instances, and scaled up. This should have been handled automatically - looking into why this did not happen.

investigating

We are continuing to investigate this issue.

investigating

We are looking into reports of sporadic repo and dashboard errors

Report: "Error spike"

Last update
resolved

This incident has been resolved.

monitoring

Outage was due to a bad configuration update. We've rolled it back to investigate.

identified

We've identified our cache as the source of the errors.

investigating

We are investigating a 5xx error spike

Report: "Upload processing delays"

Last update
resolved

We were able to stabilize uploads, and we'll keep this as an area of focus. Resolving.

monitoring

We're starting to see some progress. Workers keeping up with incoming work. Still monitoring

monitoring

We've sped up one hot path by about 75%, and provisioned larger workers. Still seeing delays.

identified

We've found code that has terrible performance for packages with many versions. Multiple background jobs containing this code consume all the worker slots denying other jobs from being processed. (Edit: Jobs that bring your uploads to your dashboard and indexes.)

investigating

Issue is quite difficult to reproduce. We have implemented additional instrumentation to capture the issue when it happens in production. We will keep this issue open over the next period of high traffic.

investigating

We are investigating upload delays due to locked up background job workers.

Report: "npm-proxy returns 404 for some proxied packages"

Last update
resolved

This incident has been resolved

monitoring

The public registry for npm is having an incident. We are following it here: https://status.npmjs.org/

Report: "Networking errors in upload hosts"

Last update
resolved

DNS issues have been resolved upstream. Everything is back to normal here. We will be looking into how to mitigate this better in the future.

investigating

Errors have dissipated. Services running normally. Monitoring for any changes.

investigating

Error rates improving, service returning to normal

investigating

Outage appears to be caused by sporadic DNS failures to access Gemfury API, and other services. Also note that Cloudflare DNS is currently down, may be related.

investigating

Investigating elevated networking errors on upload processing hosts

Report: "Delayed index builds"

Last update
resolved

Everything back to normal.

monitoring

We are monitoring the results.

identified

We've spun up more workers, and they are now catching up with the uploads.

investigating

Due to elevated activity, upload processing and index updates are delayed. We're working on scaling up capacity.

Report: "403 errors"

Last update
resolved

Everything looks better now. We're going to close the incident while we continue to investigate the root cause.

monitoring

We were still seeing errors, so we're turning off the optimizations again.

monitoring

403 errors are back below 0.5%, but as predicted, latencies are up. We're going to slowly bring back some optimizations.

investigating

Seeing spikes up to 9% of requests, from a baseline of 0.5-1%. Rolling back to an earlier build, and disabling experimental optimizations. Bigger accounts may experience greater latencies.

investigating

Investigating reports of sporadic 403 errors for valid tokens, and possible slowness.

Report: "Elevated errors"

Last update
resolved

This incident has been resolved.

investigating

Platform provider incident resolved upstream at 17:31 UTC

investigating

Uploads queue has been processed, and back to normal now. Still seeing some elevated latencies on repo endpoints.

investigating

Queue has caught up. Indexing recent uploads now.

investigating

Update from our platform provider: platform issues starting around 16:22 UTC

investigating

Index rebuild queue has run away. Pausing new index rebuilds to allow workers to catch up.

investigating

We are continuing to investigate this issue.

investigating

Investigating elevated errors and reports of delayed upload processing.

Report: "Reemergence of platform issues"

Last update
resolved

This incident has been resolved.

monitoring

We are seeing sporadic empty responses from our endpoints due to the reoccurrence of the platform issue. We are going to keep monitoring their team's progress, and make sure we don't lose anything important.

Report: "Platform networking issues (continued)"

Last update
resolved

This incident has been resolved.

identified

We are reopening an incident to continue following a networking event with our platform provider. Although better than yesterday, we are still getting customer reports of sporadic connection issues.