Uploadcare

Is Uploadcare Down Right Now? Check if there is a current outage ongoing.

Uploadcare is currently Operational

Last checked from Uploadcare's official status page

Historical record of incidents for Uploadcare

Report: "Unavailability of uploading from 3rd Party Services"

Last update
postmortem

## Incident Summary On November 25, 2024, from 16:01 to 19:44 UTC, the “Uploading from 3rd Party Services” feature of our platform was unavailable. This feature allows users to upload files from social networks and storage services like Dropbox, Facebook, Google Drive, and Instagram. ## Timeline * **15:50 UTC** – A server configuration update was initiated to streamline deployments in our Kubernetes environment. * **16:01 UTC** – The outage began as requests to the “Uploading from 3rd Party Services” feature failed to be processed. * **19:00 UTC** – Investigation revealed that the application networking was incorrectly configured. * **19:15 UTC** – A fix was implemented to re-configure the application networking. * **19:44 UTC** – The incident was fully resolved, and service was restored. ## Root Cause The outage was caused by a synchronization issue between repositories during a server configuration update. This misalignment between repositories, compounded by a lack of explicit configuration, led to the service outage. ## Impact For approximately 3 hours and 43 minutes, users were unable to upload files from third-party services. ## Challenges During Resolution This type of issue is difficult to catch in a staging environment because the traffic profile does not reflect production-level demand. Although the alert system worked as intended, delays in responding to the alert contributed to the resolution time. ## Resolution The application configuration was updated and service functionality was fully restored at 19:44 UTC. ## Action Items ### Short-term * Improve feature status monitoring and alerting to detect outages faster. ## Long-term * Improve synchronization processes between repositories to avoid dependency misalignments. ‌ We sincerely apologize for the disruption this caused to our users. We take this incident seriously and are committed to implementing the above action items to prevent similar occurrences in the future. Thank you for your patience and continued trust in our platform. If you have any questions or need further details, please don’t hesitate to reach out.

resolved

Issue has been resolved

Report: "Elevated URL API Errors"

Last update
postmortem

## Incident Summary On 26 November 2024, an issue with our URL API was identified, causing a partial outage of the service from 14:31 to 15:00 UTC. ## Timeline * **14:20 UTC** – A CDN configuration change was made to improve the Auto Format feature in the URL API. * **14:31 UTC** – Service degradation began as increased traffic overwhelmed our origin servers due to cache key changes. * **14:37 UTC** – Alert notifications were received by our team. * **14:58 UTC** – The CDN configuration was rolled back. * **15:00 UTC** – The incident was fully resolved, and service was restored. ## Root Cause The issue was caused by a CDN configuration change that altered cache key behavior, significantly increasing traffic to our origin servers. This overwhelmed our autoscaling groups, which hit their scaling limits, resulting in a service outage. ## Impact URL API degradation lasted for approximately 29 minutes. ## Challenges During Resolution This type of issue is difficult to catch in a staging environment because the traffic profile does not reflect production-level demand. Although the alert system worked as intended, delays in responding to the alert contributed to the resolution time. ## Resolution The configuration change was identified as the root cause and rolled back. The service stabilized immediately after the rollback, and full functionality was restored by 15:00 UTC. ## Action Items ### Short-term * Review and improve escalation processes for alert notifications to reduce resolution time. * Adjust staging environment testing to better simulate production traffic patterns for similar changes. ## Long-term * Assess autoscaling limits to ensure sufficient capacity for handling unexpected traffic spikes. We sincerely apologize for the disruption and are committed to learning from this incident to improve the reliability of our services. Thank you for your understanding and continued support.

resolved

This incident has been resolved.

investigating

Systems are back to normal

investigating

We're experiencing an elevated level of URL API errors and are currently looking into the issue.

Report: "Webhook Service Degradation"

Last update
postmortem

## Incident Summary On 25 September 2024, an issue with webhook delivery was identified, affecting clients between 10:07 and 12:19 UTC. The delay impacted webhook notifications, with no data loss but a significant delay in processing and delivery. ## Timeline * 10:07 UTC – A system configuration change was made, which inadvertently disrupted webhook processing. * 10:07 UTC – Webhook delivery issues began. * 12:05 UTC – The problem was identified and resolved, with backlogged webhooks being processed. * 12:12 UTC – The first webhook was successfully delivered after the fix. * 12:19 UTC – All queued events were processed, with delivery confirmed for all affected users. ## Root Cause The issue was caused by a configuration change that resulted in the webhook delivery system not processing events correctly. Despite initial signs of system health, the disruption went undetected due to gaps in the system’s monitoring tools. ## Impact * Webhook delivery was delayed for approximately 2 hours. * Customers experienced delays in receiving event notifications. * No data was lost, but delivery delays were significant due to a backlog in event processing. ## Challenges During Resolution * Monitoring systems indicated that components of the webhook system were healthy, which delayed identification of the underlying problem. ## Resolution * Webhook processing was restarted, and we verified that all queued events were delivered without any data loss. * The incident was fully resolved by 12:19 UTC, with all webhooks processed and delivered. ## Action Items ### Short-term * Improve the system’s monitoring and alerting to better detect issues with webhook processing. ## Long-term * Explore options to improve the resilience of our webhook delivery system, including scaling the infrastructure to better handle failures.

resolved

This incident has been resolved. We apologize for any inconvenience this may have caused.

investigating

We're experiencing a slowdown in our Webhooks service.

Report: "REST API and Upload API outage"

Last update
postmortem

## REST and Upload API service degradation \(incident #wvjpwt1qtpkn\) **Date**: 2024-07-08 **Authors**: Arsenii Baranov **Status**: Complete **Summary**: From 09:01:53 to 09:33:12 UTC we've experienced higher latencies and in the end a complete outage of Upload and REST API. **Root Causes**: PostgreSQL performance degradation due to human-factor. **Trigger**: Misconfiguration. **Resolution**: Changed our DBA-related processes, fixed monitoring related issue. **Detection**: Our Customer Success team detected the issue and escalated to the Engineering team. **Action Items**: | Action Item | Type | Status | | --- | --- | --- | | Fix monitoring misconfiguration | mitigate | **DONE** | | Improve DBA maintenance approach | prevent | **DONE** | ## Lessons Learned **What went well** * Due to distributed nature of Uploadcare, this incident has no effect on most of our services. This degradation didn’t affect storage, processing and serving files that were already stored by Uploadcare CDN. * Our incident mitigation strategy was right and worked immediately. **What went wrong** * This incident was detected in non-automatic way due to alert misconfiguration. * We failed to process API request during the incident. ## Timeline 2024-07-08 _\(all times UTC\)_ * 08:00 Database maintenance started * 09:01 **SERVICE DEGRADATION BEGINS** * 09:23 Our customer success team escalates issue to Infrastructure team * 09:29 Issue localised and fixed * 09:33 **SERVICE DEGRADATION ENDS**

resolved

2024-07-08 09:01:53 UTC REST API and Upload API became unavailable due to system misconfiguration. 2024-07-08 09:33:12 UTC We've identified the source of problems, deployed fixes. Service recovered. We are monitoring the situation.

Report: "Service degradation"

Last update
postmortem

## Upload API and Video processing services degradation \(incident #5r4zj8shr69c\) **Date**: 2023-10-02 **Authors**: Alyosha Gusev, Denis Bondar **Status**: Complete **Summary**: From 14:15 to 16:45 UTC we’ve experienced higher latencies of Upload API and with video processing due to very high interest in these services. **Root Causes**: Cascading failure due to combination of exceptionally high amount of requests to Upload API. **Trigger**: Latent bug triggered by sudden traffic spike. **Resolution**: Changed our throttling politics, increased resources for processing. **Detection**: Our Customer Success team detected the issue and escalated to the Engineering team. **Action Items**: | Action Item | Type | Status | | --- | --- | --- | | Test corresponding alerts for correctness | mitigate | **DONE** | | Improve our upload processing system to remove bottleneck that we found | prevent | **DONE** | | Fix service access issue for team members that form potential response teams | mitigate | **DONE** | ‌ ## Lessons Learned **What went well** * Due to distributed nature of Uploadcare, this incident has no effect on most of our services. This degradation didn’t affect storage, processing and serving files that were already stored by Uploadcare CDN. * Our incident mitigation strategy was right and worked immediately. **What went wrong** * This incident was detected in non-automatic way due to alert misconfiguration. * Due to hardening security standards in our organisation, not all of incident responders had access to Statuspage to update our customers in timely manner. ## Timeline 2023-10-02 _\(all times UTC\)_ * 14:15 Our upload processing queue start filling * 14:20 **SERVICE DEGRADATION BEGINS** * 15:23 Our customer success team escalates issue to Infrastructure team * 15:31 Issue localised * 15:41 Incident response team is formed * 15:51:13 Adjusted our throttling policies * 15:51:38 Increased number of processing instances * 16:40 **SERVICE DEGRADATION ENDS** Processing queues clear

resolved

From 14:15 to 16:45 UTC we’ve experienced higher latencies of from_url uploads and with video processing. We’ve identified the source of the problem, eliminated it and are monitoring the situation. These services are fully functional now.

Report: "Minor URL API degradation"

Last update
postmortem

# -/json/ and -/main\_colors/ operations returning 400 status **Date**: 2023-11-21 **Authors**: Alyosha Gusev **Status**: Complete, action items in progress **Summary**: From 2023-11-21 17:40 to 2023-11-22 18:40 -/json/ and -/main\_colors/ operations started to return 400 status **Root Causes**: HTTP rewrites misconfiguration **Trigger**: Deploy of the new functionality that enables our customers to save more money on traffic, and end users to experience lower load times \(improvements to automatic image optimisation\) **Resolution**: Bugfix deploy **Detection**: Our automatic tests detected the issue **Action Items**: | Action Item | Type | Status | | --- | --- | --- | | Fix rewrites | mitigate | **DONE** | | Improve visibility of failing tests notification | prevent | **DONE** | ‌ ## Lessons Learned **What went well** * Problem was detected by automatic tests * Bugfix was deployed immediately **What went wrong** * We didn’t found out that tests were failing immediately ## Timeline All times UTC * 2023-11-21 17:40: **SERVICE DEGRADATION BEGINS**. Deploy of the new functionality * 2023-11-22 16:18: Team notices failing tests * 2023-11-22 16:19: Issue localised, Incident response team is formed * 2023-11-22 16:19: Team starts bugfix implementation * 2023-11-22 17:40: Fix deployed to production * 2023-11-22 18:40: Last cached error response expired * 2023-11-22 18:40: **SERVICE DEGRADATION ENDS**

resolved

2023-11-21 17:40 -/json/ and -/main_colors/ operations returning 400 status

Report: "CDN"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

We've observed heightened latency and error rates for uncached requests to our CDN and image processing API from 6:00 to 6:45 UTC. This was primarily due to a surge in server load and a deviation in our capacity scaling mechanism. Currently we still experienced heightened load, while capacity scaling mechanism was fixed.

Report: "Service degradation"

Last update
resolved

We have experienced a service degradation from 05:08 to 05:10 UTC. The source of the problem was identified and we are working on the fix for the future.

Report: "Higher latencies of from_url uploads"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We’re experiencing higher latencies of from_url uploads. Direct uploads are fully functional. We’ve identified the source of the problem and are working on the fix.

Report: "Upload degradation from cloud services"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We’re experiencing service degradation of uploads from cloud services (e.g. Facebook, Google Drive). We're currently investigating the issue. Direct uploads are fully functional.

Report: "Higher latencies of from_url uploads"

Last update
resolved

From 12:29 to 13:31 UTC we’ve experienced higher latencies of from_url uploads. Direct uploads were fully functional. We’ve identified the source of the problem, eliminated it and are monitoring the situation. The service is fully functional now.

Report: "Elevated CDN 5xx Errors"

Last update
resolved

Between 14:30-14:51 UTC customers may have experienced elevated levels of errors.

Report: "Webhooks operational issue"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are experiencing elevated error rates for Webhooks service. We have identified root cause and we are actively working towards recovery.

Report: "Issue with DNS resolving for *.ucr.io domains"

Last update
resolved

This incident has been resolved.

monitoring

The problems with DNS resolving recovered. We are still monitoring.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating the issue with DNS resolving.

Report: "DNS issues"

Last update
resolved

No more issues with DNS resolving were detected.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are experiencing issues with DNS resolving of *.uploadcare.com

Report: "Elevated CDN 5xx Errors"

Last update
resolved

Between 16:38-16:43 UTC customers may have experienced elevated levels of errors.

Report: "Webhook issues for SVG files"

Last update
resolved

This incident has been resolved. All webhooks have been delivered as expected.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We're working on the fix.

identified

We're experiencing issues with webhook delivery for SVG uploads. The main Upload API remains fully functional.

Report: "Partial degradation of from_url uploads"

Last update
resolved

From 15:44 to 15:55 UTC we had bad configuration deployment resulting in partial degradation of our uploads service. The service is fully functional now.

monitoring

Starting at 15:44 UTC we’ve experienced partial degradation of our uploads. We’ve identified the source of problems, implemented fixes and are monitoring the situation.

investigating

We're experiencing issues with from_url uploads.

Report: "Partial degradation on uploads, api"

Last update
resolved

This incident has been resolved.

monitoring

We have experienced issues with our uploads, api systems from 7:51 to 7:56 UTC. We have identified the source of problems, implemented the fix and are monitoring the situation.

Report: "Partial degradation of image processing and file delivery"

Last update
resolved

From 8:20 to 9:20 UTC we’ve experienced partial outage of our image processing and file delivery due to degradation of our storage subsystem. The issue is resolved. All systems are working properly now.

monitoring

The issue is resolved. We are monitoring the situation.

identified

We have identified the issue.

identified

We have identified the issue.

investigating

We're experiencing issues with image processing and file delivery.

Report: "Video Processing System Maintenance"

Last update
resolved

This incident has been resolved.

identified

Video processing vendor is performing database maintenance between 6 am and 11 am UTC. They're trying to minimize the impact on their service.

Report: "Website partial outage."

Last update
resolved

This incident has been resolved.

identified

Still waiting for Netlify to fix things on their end. Core Uploadcare services are not impacted. Uploads, REST API, CDN, document/video conversion engines are working.

identified

Due to our vendor issues, website is partially down.

Report: "Higher latency and increased error rate on CDN and upload API"

Last update
resolved

This incident has been resolved.

monitoring

This incident has been resolved.

monitoring

From 07:25 to 7:45 UTC we've experienced higher latency and increased error rate on AWS S3 resulting in partial degradation of our content delivery service and upload API. We are monitoring the situation.

Report: "Expired certificate on ucarecdn.com"

Last update
postmortem

# What happened On 19th July 2019 at 15:30 UTC certificate used to serve traffic from [ucarecdn.com](http://ucarecdn.com) and [www.ucarecdn.com](http://www.ucarecdn.com) has expired. All HTTPS traffic to these domains effectively stopped. We were able to quickly \(within minutes\) redirect [www.ucarecdn.com](http://www.ucarecdn.com) traffic to alternative CDN provider with proper certificate installed. It took us approximately 105 minutes since start of the incident to fix the issue with [ucarecdn.com](http://ucarecdn.com). ![](https://ucarecdn.com/d07b72c7-36e0-4328-b5b9-902108fff1e9/-/preview/400x300/traffic) # Why that happened 1. Our CDN partner's certificate management system failed to renew the certificate in question and needed an input from our team. 2. It did sent an email 5 days prior to expiration requesting manual input. 3. This email wasn't noticed, because it was sent to a team member that was off the grid on vacation. # What we should do to improve ‌ 1. Change notification system settings, so any issues with certificates with our CDN partners are sent to a team, not a person \[done\] 2. Use 3rd party service to monitor certificate expiration and other settings \[done\] 3. Use CDN partner's APIs to actively and automatically monitor certificates \[in progress\] 4. Create an service that could be used to quickly redirect [ucarecdn.com](http://ucarecdn.com) traffic to backup CDNs in case of any issues with the main one \(not only due to certificates\)

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

Report: "REST API partial outage"

Last update
resolved

This incident has been resolved.

monitoring

Starting at 09:20 UTC we've experienced partial outage of our REST API. We've identified the source of problems, deployed fixes and are monitoring the situation.

Report: "Upload service degradation."

Last update
postmortem

On December 12, we had a degradation of our Upload API. Most users were unable to upload files for 3 hours 11 minutes between Dec 12, 22:39 GMT and Dec 13, 01:50 GMT. ## What happened Requests to Upload API were either handled extremely slowly or \(most of them\) rejected by our web workers. Scaling up our Upload API fleet didn't help. ## What really happened Further investigation revealed that: — Slow requests were consuming all available web workers and without available workers requests were rejected by nginx. — Handled requests were slow due to constant database locks on one database table. — Database locks were caused by dramatical change of tracked usage statistics \(change of project settings by one of our largest customers\). We've spent most of the time during incident on investigation and figuring our what is happening. Actual DB load was average, and DB was wrongly dismissed as source of issues at first. Once we've the root cause, the fix was trivial and took minutes to implement and deploy. ## What we have done We turned off usage tracking for particular customer. ## What we will do — Refactor statistic tracking, so it does not affect our core service. — Add more specific monitors to our DB, so we could identify problems of similar nature much faster.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are continuing to investigate this issue.

investigating

We're experiencing issues with out Upload API.

Report: "Partial Upload and CDN outage"

Last update
postmortem

We've experienced a partial outage due to issues at AWS, our network infrastructure provider.  Starting from approximately 20:07 till 20:25 UTC there was an increased error rates for Upload API and on CDN origins.  Due to the nature of the issue, our error reporting systems didn't notice Upload API problems: * all reported metrics were OK * there were no anomalies like drop in upload success rates or peaks in upload error rates We were only aware of CDN origin problems until some of our customers reported issues with Uploading. Underlying issue is quoted below: > Instance launch errors and connectivity issues  > > 01:49 PM PDT Between 1:06 PM and 1:33 PM PDT we experienced increased error rates for new instance launches and intermittent network connectivity issues for some running instances in a single Availability Zone in the US-EAST-1 Region. The issue has been resolved and the service is operating normally

resolved

We've experienced a partial outage. Starting from approximately 20:07 till 20:25 UTC there was an increased error rates for Upload API and on CDN origins.

Report: "Upload API downtime"

Last update
resolved

This incident has been resolved.

monitoring

The was 8 minutes downtime starting from 18:21 UTC. What happened: We're trying to mitigate Roscomnadzor's carpet bombing our Russian clients and end users. As one of the measures we've changed our Load Balancing settings but missed one critical config setting that caused the downtime during our regular code deployment. We've fixed the setting and are monitoring the situation.

Report: "Increased CDN error rates"

Last update
resolved

This incident has been resolved.

monitoring

The issues were caused by network connectivity problems in AWS east-1 region.

investigating

Our CDN origin fleet is experiencing increased error rate.

Report: "Increased REST API error rates"

Last update
resolved

We've encountered increased error rates on our REST API endpoints. This resulted in reduced reported uptime. In fact, even though the uptime suffered it wasn't as bad as reported. What happened: - from February 25 22:00 UTC to February 26 05:50 UTC error rates on REST API endpoints were increased Why that happened: - one of the machines in REST API fleet ran out of memory - due to OOM, the machine was unable to handle any incoming requests - misconfigured health check prevented load balancer from getting rid of the failing machine - part of all requests, including Pingdom (that reports our uptime) ones, was sent to that failing machine What we've done: - tracked down and terminated the failing machine - fixed health check configuration

Report: "Increased error rate for Upload API"

Last update
resolved

This incident has been resolved.

monitoring

Our diagnostic tools show that all subsystems are back to normal. We continue monitoring of the situation.

monitoring

Error rates are decreasing.

identified

Appears that increased error rate is connected with increased error rate for S3.

investigating

We are currently investigating this issue.

Report: "Upload API was down from 03:04 to 05:16"

Last update
resolved

Due to error in infrastructure configuration, our queue responsible for uploading files was not working properly.

Report: "File copying is now working"

Last update
resolved

We have found and fixed the issue with backing up of stored files to S3. All files that were not backed up during the incident will be backed up eventually. Automatic copying of uploaded files to S3 buckets (Custom storage) was not affected by the issue.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Automatic file copying is not working at the moment, we're investigating the cause of this.

Report: "Uploadcare services have degraded performance"

Last update
resolved

Currently, Uploadcare service is fully working. We continue monitoring the situation closely.

monitoring

Upload API is up again. We continue monitoring all components.

identified

Upload API went down.

monitoring

AWS Autoscaling is still down, we're deploying new instances manually.

monitoring

AWS has reported that everything is green. We're working on recovering affected Uploadcare components. Currently, everything should work but higher latency and error rate are expected.

identified

AWS has fixed working with existing S3 objects, so CDN and subset of REST API are working at the moment. More details on AWS status: https://status.aws.amazon.com/

identified

AWS S3 is experiencing increased error rates that results in all of our buckets going AWOL.

investigating

Our CDN is down. Investigating.

Report: "Increased error rate on CDN"

Last update
resolved

CDN provider switch resolved the issue and went smoothly.

monitoring

CDN provider switch is complete. We continue monitoring the situation.

identified

We're experiencing increased error rates on our CDN. Switching to another CDN provider.

Report: "REST API Increased latency."

Last update
resolved

The issue is resolved. REST API latency is back to normal.

identified

Since 09:30 UTC we're experiencing an increased latency for our REST API endpoints. We've found the reason and are working on resolving the issue.

Report: "Upload API is down."

Last update
resolved

This incident has been resolved.

identified

Upload API is not working. The problem is identified and the fix is being deployed.

Report: "Partial CDN service degradation for Russian users."

Last update
resolved

This incident has been resolved.

monitoring

We've experienced CDN service degradation for some of our end users that get content from edge servers in Moscow, Russia. We've identified and resolved the issue. Currently, we're monitoring the situation. The incident took place between 19:00 and 22:00 UTC.

Report: "Increased error rate on REST API"

Last update
resolved

This incident has been resolved.

monitoring

We've fixed the issue and continue monitoring the REST API servers.

investigating

We are currently investigating this issue.

Report: "Direct uploads to S3 buckets"

Last update
resolved

The fix is deployed to our upload servers.

identified

Currently direct uploads to clients' buckets located in regions other than 'standart' don't work. We've identified the cause of the issue and are working on the fix.

Report: "Increased error rate during uploads."

Last update
resolved

We've found and fixed the issue that was causing upload latency.

investigating

We're seeing increased error rate and latency during file upload. Investigating.

Report: "Increased error rate for group files on CDN"

Last update
resolved

This incident has been resolved.

monitoring

Due to AWS service degradation we're experiencing increased error rate for "file in group" on CDN.