Historical record of incidents for api.video
Report: "Livestream Issues"
Last updateThis incident has been resolved.
We've identified the root cause of the issue. Our infrastructure team deployed a fix at 11:08 UTC, and the livestream delivery service is now operating as expected.
We’re currently investigating an issue affecting our livestream delivery service. Our team is actively working to identify the cause and implement a fix as quickly as possible. We appreciate your patience and will provide updates as soon as we have more information.
Report: "Issues with Delivery of videos"
Last updateA misconfiguration on our US-based origin servers was identified, which temporarily blocked access to some video files. The configuration has been corrected at 14:50UTC, and the system is now fully operational. We apologies for the inconvenience.
Our team has identified the root cause of the issue and implemented a fix. We're currently monitoring the situation to ensure stability. We'll share more details about the incident as soon as possible.
We are experiencing some issues with the delivery of some videos in the US region. Our team is doing our best to find the root cause of the problem.
Report: "Partial outage"
Last updateEarlier today, api.video experienced a disruption in service due to a network issue affecting our infrastructure provider, Gcore. This began at approximately 10:00 GMT+2 and caused degraded performance and potential unavailability for some users. Gcore identified and resolved the root cause of the issue, and services were fully restored as of 11:09 GMT+2 (services were partially back to normal at 10:38 GMT+2) . We have been monitoring the situation closely and can confirm that the issue is now resolved. We apologize for any inconvenience this may have caused and appreciate your patience. A Root Cause Analysis (RCA) from Gcore will follow in the coming days. If you continue to experience any issues, please don’t hesitate to reach out to our support team. More details about the issue here: https://status.gcore.com Thank you for your understanding.
We are currently experiencing a significant degradation of our API due to a third-part provider having issue in different locations. We will provide more details as soon as possible.
Report: "High API Errors rate (09:25–09:42 UTC)"
Last updateIncident Summary: Between 09:25 UTC and 09:42 UTC, our API returned 500 errors due to an issue with our messaging cluster. Impact: During this period, API requests failed, potentially disrupting user workflows. Root Cause: The issue was traced to a failure in our messaging cluster, which affected message processing. Resolution: Our team identified the problem and restored messaging service, bringing the API back to normal operations by 09:42 UTC. We apologize for the inconvenience.
Report: "Low performance some services"
Last updateThis incident has been resolved. We have made some improvements and it should not occur again.
Low Performance in Video Ingest from URL source and Livestream Recordings (generated after the livestream). - Video creation from url or migrations may experience delays. - Livestream recordings may exhibit lower-than-usual performance. We are implementing optimizations to enhance video processing speed and monitoring system behavior to ensure a swift resolution.
Report: "Dashboard issues"
Last updateOur engineering team has resolved the issue that was causing videos and livestreams not to display on the dashboard. All videos and livestreams are now correctly displayed on the dashboard.
We have identified the issue causing problem. It is related to an external library we use. Our team is actively working on implementing a fix, and we anticipate deploying it shortly. Thank you for your patience and understanding as we work to resolve this issue and improve your experience.
We are currently experiencing challenges in displaying lists of videos, livestreams, players, and webhooks on our dashboard. Our team is diligently investigating the root cause of this issue. It's important to note that our API calls are functioning normally. We appreciate your understanding and will provide updates as we work to resolve this matter.
Report: "Slow ingestion (transcoding)"
Last updateWe have experienced an issue with slower than usual transcoding that is now resolved. The issue was caused by one of our partners that have failed to scale in a timely fashion. We have already taken steps in order to mitigate the issue from occurring in the future.
Report: "Dashboard video details return an error"
Last updateThe issue has been resolved and you can now navigate to the video details in the dashboard.
We have identified the issue as related to our internal client. Our engineering team is preparing a fix in order to deploy it to production. We will update shortly.
We are currently investigating and issue with the dashboard. At this point, it is not possible to view the video details. Please note that the video details are still available through the API. You can retrieve the video details from the video object.
Report: "Broken button the embedded player"
Last updateThe issue has now been resolved. The embedded player control buttons are displaying correctly.
We are continuing to work on a fix for this issue.
We have identified an issue with our embedded player. The buttons on the player are shown as squares instead of the expected buttons. The issue has been identified by our engineering team and we are working to resolve it as soon as possible.
Report: "Video ingestion issue"
Last updateThe issue with the ingestion and transcoding of videos is now resolved. All systems are back to a normal and operational state. We would like to apologize for any inconvenience that has been caused during the time of the incident.
We are seeing ingestion and video transcoding returning back to normal. Currently, we are monitoring the situation and we will update as soon as the issue is completely resolved.
We are continuing to investigate this issue.
We are currently experiencing an issue with our transcoding cluster that started at 15:10 UTC. The issue is impacting new video transcoding and ingestion, videos may take longer to transcode. All previously ingested videos are delivering normally. Our engineering team has identified the root cause and working on resolving the issue as soon as possible We will provide an update shortly.
Report: "Slow transcoding"
Last updatePlease be advised that all pending videos have been transcoded and the incident has been resolved.
Our transcoding service is back to a normal state and is now transcoding at the expected speed. Please be advised, there might be some videos that are still in pending status as the transcoding queue is being cleared. We apologize for the inconvenience.
We are currently experiencing an issue with our transcoding cluster. The issue is impacting video transcoding and upload, videos may take longer to transcode. Our engineering team has identified the root cause and working on resolving the issue as soon as possible We will provide an update shortly.
Report: "Delivery Instability"
Last updateAfter identifying the problem we put a fix in place and all services should be back to normal. Low number of users were affected.
Due to CDN problems we are having instability in the delivery service.
Report: "Unstable transcoding"
Last updateInstability with video transcoding on September 29th between 17:24pm-18:33pm (GMT+2), however all videos have been ingested. The service is working properly now and we are retranscoding the videos that were stuck.
Report: "Live Streaming instability"
Last updateThis incident has been resolved.
After the deployment of important updates related to the delivery, we encountered instability with live streaming. The team fixed it and is currently monitoring the results.
Report: "[RESOLVED]Live stream unavailability"
Last updateOn March 4th, at 10:03 p.m. GMT +1, api.video was confronted with an unforeseen outage of our Live Streaming services. Our engineering team managed to locate and fix the issue 3 hours and 22 minutes later, concluding the downtime on 05 March at 01:25 a.m. Once the issue was mitigated, our team began working on an incident analysis to ensure that such potential outages won't be an issue in the future. We understand the gravity of such events for our partners, and all is being done to provide you with the highest level of product consistency and transparency in the process.
Report: "Dashboard issue"
Last updateThis incident has been resolved.
We are currently investigating the issue. API still function and limited to just impact on dashboard
Report: "Internal DNS issue"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Usage data inside the dashboard is not accurate for all users
Report: "Issue with our database"
Last updateThis incident has been resolved.
Services are back and we are monitoring.
We are continuing to investigate this issue.
Major issue with our database. All services are affected. We are investigating it.
Report: "Access temporarily restricted"
Last updateThis incident has been resolved.
We have found the root cause and are working on this now.
We are currently investigating this issue. We will add updates as soon as possible.
Report: "Major issue of one of our hosting providers"
Last updateThe servers are now functional. Customers can use our API. No data was lost.
The issue is resolving, our API services are getting back to normal.
We are continuing to investigate this issue.
Major connectivity issue of one of our hosting providers in EU and CA
Report: "Some API instances are down"
Last updateThe servers are now functional.
Some servers lost their network. We are currently restarting them.
We are currently investigating why some servers don't respond.
Report: "VOD delivery degraded"
Last updateMajority of videos are now migrated to our new cluster and the platform is operating well. We will continue the migration of remaining videos in the coming days but degraded ones should be a minority. If you experience any issue don't hesitate to reach us though normal support request
Videos uploaded from the 3rd of June are not impacted, and our service is fully operational for them. Videos uploaded prior to June 3 may experience degraded service. We are working on it and preparing to migrate them to our new data storage.
Encoding problems have been resolved. All services are now linked to the new storage. New videos will be uploaded to the new storage. Videos uploaded before the switch will be retranscoded to the new storage.
API is back to normal. We are still experiencing encoding issues due to the switch. This is under investigation
API in read-only still ongoing
API is now in read-only during the switch
The new storage cluster is ready. The api will be in read-only temporarily during the switch. Check this page regularly, to get updated.
New storage cluster is currently in construction, the switch will start soon and will imply a downtime on the API As soon as the switch is done all new videos will work normally Right after, past videos will be migrated step by step until complete resolution
A serious issue has been found on our distributed storage configuration, generating long response time on assets. We have to remake our storage cluster. We will mitigate the problem as we migrate to the new cluster.
New issue has been identified. Issue with storage access. Working on it.
The issue has been identified. Issue with the Redis cluster.
We are currently investigating this issue.
Report: "Live streams with api.video player not loading"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
A storage issue prevents api.video players from loading configuration. The issue does not concern integration with external players.
Report: "Delivery degraded"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented. One storage node partially down.
We are currently investigating the issue.
Report: "Delivery for VOD content suffers from slowing down"
Last updateA large peak in traffic occurred in North America and part of this traffic was redirected to Europe to be able to manage it, impacting the performance of this region. Additional cache servers have been deployed in the North American region to support this traffic. The platform is stable again.
The delivery for VOD content suffers from slowing down but remains functional. The problem is identified and the solution is already being implemented. Return to normal in approximately 1 hour.
Report: "Delivery degraded"
Last updateThe delivery platform has been stable for 24 hours now. Data access conflicts on the central database prior to caching on the local redis created a bottleneck and the stacking of TCP connections to provide the resource access responses. We have adapted the configuration to better manage these competitive accesses. 2ndQuadrant/EDB(the world's largest provider of PostgreSQL products and solutions) did a full Postgresql configuration review this afternoon to provide additional optimizations as well.
We are continuing to investigate this issue.
Following yesterday's incident, the delivery infrastructure is still experiencing unstable functioning today. All of our technical teams are working to resolve the problem and stabilize it as quickly as possible.
Report: "Delivery issue"
Last updateThis incident has been resolved.
The fix has been implemented, deployed in production and we are monitoring the result and the stability of the whole platform. We apologize to all our customers for the disagreement.
We are migrating private assets management from the legacy database ETCD(a distributed key-value store) to a Postgresql+Redis databases due to a RPC connectivity issue between our different web-services and ETCD servers. The changes are almost ready to go into production.
We are having problems with the delivery of assets today. Our teams are actively working to resolve the problem.
Report: "2020-10-8 & 9 Major outage"
Last update# Incident summary Since October 8, 2020 at approximately 21:00 UTC, and until this morning of October 9, 2020 at 8:20 UTC, api.video has suffered a major failure, impacting its entire service. Some random issues for live streaming continued until 10:30 UTC. The event was triggered by a progressive major failure at one of our DNS providers, at 21:00 UTC, on October 8, 2020. The event was detected by our support team at 6:00 UTC. The Tech team started working on the event by 6:30 UTC. The incident affected all users. Incident cleared after workarounds got implemented, and situation is considered as stable since 8:20 UTC. # Leadup Today, to deliver all our contents \(live streams, VOD files, player assets, ...\) we rely on a CDN, in front of a platform split across continents \(North America & France\). To load-balance the traffic from the CDN towards the closest platform, we rely on a GSLB, a load-balancer at the DNS level. We also benefit from this GSLB within our platform to geo-route traffic between geographic dependencies \(databases, keystore, ...\). We use to do this thanks to our hosting provider's local load-balancers, but moved away from this setup on June 14, due to repetitive major issues on this service. We selected PerfOps as our GSLB provider, created by the man behind jsdelivr, Dmitriy Akulov. # Fault On October 8 evening, all their domains, including [perfops.net](http://perfops.net) \(the brand one\) and [flexbalancers.net](http://flexbalancers.net) \(technical one\), went off: the first one has no more exisiting DNS records, whereas the later is no more declared by any registrar. At the time of this post-mortem, we still have no more communications with their executive team nor their technical team. Due to this outage, all our load-balanced DNS records went unresponsive. # Impact Load-balanced DNS records include VOD, LIVE, player assets origins for the CDN. From then on, it was impossible to view videos. The impact was global. # Detection While we completely revised our monitoring in May 2020, we did not implement end-2-end probes. As a result, while each individual service was running, we had no information of any service interruption to our users. Our support team noticed the issues at 6:00 UTC on October 9, raising the case to our CTO at 6:20 UTC, after a series of tests. # Response Our CTO started to diagnose the incident at 6:25 UTC, pointing issues at the DNS level. Several providers could be concerned \(GoDaddy - our actual Registrar -, PerfOps - our actual GSLB -, NSOne - the replacement of PerfOps we setting up -, CDNetworks - our actual CDN -\). Meanwhile, he tried to reach out to the infrastructure team to get assistance on the issue at 6:42 UTC. Infrastructure team was available on working on the incident at 7:00 UTC. # Recovery As the issue was on PerfOps \(unresponsive domains\), and as the actual work with NSOne is not ready for production yet, we decided to go with a workaround, updating all the CNAME towards PerfOps' DNS records, by aliases towards unique hostname on our ends. The change was performed at GoDaddy's level, our Registrar. Due to how DNS works, and as the minimal TTL by GoDaddy is 30 minutes, any changed, started at 7:00 UTC, would be effective within the hour \(twice the TTL minus 1 second\) in worst case scenario. At 8:00 UTC, some internal services still encountered some issues. We noticed errors in the workarounds \(wrong records used\) and implemented fixes at both DNS level and servers' level. # Timeline All times are UTC. 21:00 - PerfOps' domains went off, DNS propagation slowing down the impact from an external point of view 6:00 - global incident noticed by our support team 6:20 - incident escalated to our CTO 6:25 - beginning of the diagnosis 6:42 - incident escalated to our Infrastructure Team 7:00 - DNS records are being updated 8:00 - few internal errors remained and got fixed 8:20 - incident closed 10:00 - live customers complains about service flapping, the infrastructure team reopens the incident and start another diagnosis 10:07 - the unique ingest server we're using as a workaround to the initial incident suffers from consecutive loads and recurrent error from a specific stream 10:10 - standard load-balancing is implemented between all our ingest nodes for live streaming and suspicious live stream is killed 10:12 - as a quick fix, some customers are manually moved towards specific ingest nodes, while the load balancing is being setup 10:25 - the load-balancing is up & running 10:30 - service is ok for our live customers # Root cause 1. Video delivery was out due to errors at the CDN level. 2. Those errors were generated either by errors at the origin level \(our geo-clustered platform\), or DNS errors. 3. The errors by our servers were DNS related. 4. All DNS errors were related to a provider outage. 5. Because we lack end-2-end testing, no alert got raised on the monitoring, towards our tech team. 6. Because we felt confident with the GSLB provider, and due to the urge with the former load balancing provider, we didn't challenge them enough. 7. Because we chose workarounds to quickly solve the situation, some small errors appear, which we solve on the fly. # Backlog check We have several items in our backlog that are already in progress that would have avoid this situation: 1. implement end-to-end tests, to monitor the service as our end-users consume it 2. replace PerfOps to a better performance-based GSLB service provider \(NSOne\) 3. avoid external GSLB for internal dependencies: this concerns both the way we address our systems from one service to the other, but also various clusters we have as geo-replication instead of replication where applicable 4. setup a proper on-call scheduling to avoid any time frame without any tech person available # Recurrence No previous incident was related to this root cause. # Lesson learned From this incident, we’ve noticed several aspects to be improved. Also, DNS remains key to any internet services and internal infrastructure communication. Any element should be redundant enough. Thanks to this, as we build clusters of any service, we should benefit from multiple DNS providers at any level. A first step is to rely on unique 1st-class providers for each stage of a DNS resolution process, then to move to multiple 1st-class providers for each stage. # Corrective actions Although all these topics are already in our backlog, it is necessary to review their priorities and deadlines. * \[ \] End-to-end tests to ensure about how we actually deliver the service to our users: deadline set to October 30 * \[ \] Replace PerfOps with NSOne to ensure about stability of DNS resolution: deadline set to October 13 * \[ \] Use internal DNS instead of public DNS for internal system communications: deadline set to October 16 * \[ \] Revamp internal clusters as databases, storage: deadline set to December 25 * \[ \] Proper on-call scheduling: deadline set to October 30 * \[ \] Multiple DNS providers for GSLB & DNS resolution: 2021 Q1
This incident has been resolved.