7digital

Is 7digital Down Right Now? Check if there is a current outage ongoing.

7digital is currently Operational

Last checked from 7digital's official status page

Historical record of incidents for 7digital

Report: "Platform Incident - Search Outage"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented to restore service to search and we're continuing to monitor the issue. Regards, 7digital Client Success Team

identified

We're currently experiencing an outage with our Catalogue Search. This impacts users of the search API and the 7digital playlist tool with search currently unavailable within the playlist tool. The 7digital engineering team are working to resolve the issue and we'll provide further updates as soon we have them. Regards, 7digital Client Success Team

Report: "Ingestion outage impacting Catalogue Events"

Last update
resolved

Catalogue ingestion has now been switched on and the platform has been restored to full health. The team will continue to monitor and be on hand to react to any further incidents.

identified

Our 3rd party supplier has updated us that they're seeing significant signs of recovery to their service. As a result some of the impacted services on 7digital's platform are also starting to restore. We'll continue to monitor progress and have taken the decision to leave catalogue ingestion switched off until we're confident the platform is restored to full health. At this stage we plan to switch on catalogue ingestion shortly after 13:00 UTC if things continue to progress as expected. We'll update this incident at that time.

identified

On further investigation we've identified that an issue with a 3rd party supplier is impacting our ingestion platform. Despite attempting to implement a resolution internally, this has resulted in an outage impacting our ability to ingest content from suppliers. All catalogue updates will be delayed until our ingestion platform is restored to full health. To clarify after the initial status message, our Catalogue Events product is operational but due to the underlying issue with our ingestion platform we won't be generating any catalogue updates through Events as a result.

identified

As of ~08:00 BST we started to observe errors in our Catalogue Events product which meant we were unable to push out any event messages. We've identified the issue and are implementing a fix to resolve the impact on Events. Once the service is restored we'll follow up with further updates and information.

Report: "Data Warehouse Processing Delay Impacting Incremental Artist Feed"

Last update
resolved

This incident has been resolved.

identified

Dear Clients, Due to unforeseen circumstances with our data warehouse, the incremental artist feed for 20240304 has been generated, however, it will not contain any data. To ensure you have access to the most up-to-date catalogue, we advise continuing as normal by ingesting the incremental artist feed for 20240305, which will cover data for both 20240304 and 20240305. We apologise for any inconvenience this may cause and want to assure you that we are actively working to resolve this issue and prevent future occurrences. If you have any questions, please contact your Client Success Manager. If you are a client who does not ingest our feeds then please ignore this email. With best regards, 7digital Client Success Team

Report: "Platform Incident - Elevated Streaming Errors"

Last update
resolved

This incident has now been resolved. We'll continue to monitor the platform and we'll continue to have engineering support staff on hand should any further issue occur.

monitoring

A fix has been implemented to deal with the elevated error rate and we're now seeing error rates returning to a normal level. We're continuing to monitor the platform and will take any further action as necessary.

identified

We're observing an elevated error rate against stream requests to the platform. The 7digital engineering team are working to resolve the issue and we'll provide further updates as we have them.

Report: "Platform unavailable due to unprecedented demand"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We're currently observing an unprecedented volume of demand on the platform, and as a result multiple areas of the API are currently unavailable. We're in the process of scaling up the platform to support this increased load and restore availability. As soon as we have further updates they'll be posted here.

Report: "Partial Outage: Elevated Error Rates"

Last update
resolved

This incident has been resolved.

monitoring

Dear clients, The platform degradation/outage experienced today is now closed. The platform is stable and the error rate has dropped. Our Tech team will continue to investigate. If you continue to experience issues with the 7digital platform create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team

investigating

Dear Clients, During this time, we have noticed a higher than usual error rate on calls to various APIs and longer response times. Our team of on-call support engineers is currently investigating the issue and taking steps to improve the stability of the platform. We will provide you with additional details in a further notice once we have completed our analysis of the problem. If you have any questions, please create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team

Report: "Platform Incident - Ingestion Outage"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented to solve the problem, allowing us to resume ingestion. The backlog will be cleared gradually as we ramp up processing capacity.

identified

We've identified the fault and are currently in the process of restoring our ingestion platform. Once the ingestion platform is functional again we can begin ingesting pending deliveries and will provide a further update at that time.

identified

We are currently experiencing an outage on our content ingestion systems. We last processed new content and updates on Saturday morning (UK time). We are working towards a resolution now and plan to provide an update shortly.

Report: "Platform Incident - Download API Outage"

Last update
resolved

We've rolled back our Download API to a previous state which has restored downloads successfully. We'll continue to monitor to ensure the issue is resolved and will follow up with further information once we've compiled a post incident report.

investigating

We're currently investigating an issue with the download API where most, or all download attempts are failing. As soon as we have further information we'll provide additional updates.

Report: "Platform Degradation"

Last update
resolved

This incident has been resolved.

monitoring

Dear clients, The platform degradation. The platform is stable and the error rate has now returned to normal performance levels. Our Tech team will continue to investigate and monitor performance. If you continue to experience issues with the 7digital platform create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team

investigating

Dear Clients, We are currently seeing a degradation of service across most APIs. This issue is only affecting uncached media as cache hits are served from our CDN and are therefore unaffected. Our on call support engineers are currently investigating the issue and taking action to further stabilise the platform. Once we have completed analysis of the issue we'll send an additional notice out with further details. Further notices will be sent regarding this incident as they become available. With best regards, 7digital Client SuccessTeam

Report: "Network failure incident"

Last update
resolved

At 13:46 BST we were alerted to an issue with one of our two network lines. After investigation, at 14:04 BST we removed the network at fault from service, directing all traffic to our secondary line which restored normal service to the platform. We are continuing to investigate what caused the failure and will restore network redundancy as soon as possible.

Report: "Playlist Tool Unavailable"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Dear clients, The 7digital Playlist Tool is currently unavailable. Once we have further updates we'll share them with additional announcements. You can subscribe to updates via email, webhooks and RSS feed on our statuspage (insert link). If you would like to receive SMS updates, please create a Service Desk ticket with the Client Success Team. If you have any questions, please create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team

Report: "Platform Outage"

Last update
postmortem

The incident report for this outage is now available and can be found [here](https://drive.google.com/file/d/1MK4JVxFuKtHnnvekUzSk2aT61Z0zo43G/view).

resolved

Full service has been restored to all components of the platform and monitoring will continue. We're confident service has been restored and we will follow up with an incident report next week once we've been able to fully evaluate the issue and the actions taken to remedy it.

monitoring

As per the previous update a fix has been implemented and service has been restored to the platform, with the exception of ALC purchasing, locker and ALC/permanent download endpoints. We are continuing to monitor the fix and will also be working on restoring high-availability to the platform.

identified

We are continuing to work to resolve the ongoing outage. The cause has been isolated to our SQL Server cluster, which is currently not able to keep certain databases online - choosing to take them down for as of yet, an unknown reason. Our efforts to force a single node to host the databases have so far not been effective at solving the current issue. We believe at this stage, that our cluster configuration has a non-trivial problem, and we are moving to bring up databases separately outside of the high-availability cluster to restore service as soon as we can. Following the return of stability, we will look to restore high-availability as soon as we can.

identified

We are continuing to work on a fix for the issue and will provide an update again shortly. We can also confirm that cached streams continue to be served throughout the incident. We're reviewing activity to identify other areas of the platform which are only partially unavailable and will communicate updates with further information shortly.

identified

We have been able to successfully bring back online the affected DB cluster, however we're still experiencing problems keeping the DB cluster online permanently, causing the platform to be unavailable. We are continuing to investigate and will provide further updates shortly.

identified

We've now isolated the cause of the outage to our DB cluster. We're observing an issue between the primary and replica node which is hindering our ability to bring the cluster back online. We're continuing to investigate the issue and will provide further updates as they're available.

investigating

We are currently investigating a suspected platform outage. Currently all endpoints on 7digital's API are unavailable, 7digital engineers are investigating the cause and we'll provide further information as soon as it's available.

Report: "Download API Outage"

Last update
resolved

This incident has been resolved.

monitoring

The download outage experienced today is now closed. The endpoint is back to normal. If you continue to experience issues with the 7digital platform create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team

investigating

Dear Clients, The Download APIs are currently down due to an outage. Streaming and Media Transfer endpoints are not affected. A new EC2 instance is being set up by the Tech team. Further updates will be announced as soon as they become available. If you have any questions, please create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team

Report: "Platform Degradation"

Last update
resolved

As of 10:50 GMT full service has resumed but we're continuing to monitor the platform.

identified

Dear Clients, Since 9.20am GMT we saw a degradation across all APIs. Engineers on-call are currently investigating the issue and are taking action to further stabilise the platform. A further update will be communicated in due course. If you have any questions, please create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team

Report: "Partial Ingestion Issue"

Last update
resolved

This incident has been resolved.

identified

This week we released a fix for an ingestion issue which had been causing some releases to fail ingestion. As a consequence of this fix some content is now being moved from our delivery platform too quickly, resulting in partial or no ingestion of affected releases. This is only affecting suppliers on automated ingestion. Our team are aware of the issue and are currently working on a resolution. Once the issue has been resolved we'll re-ingest all affected releases. Until this issue has been confirmed as resolved please raise tickets for any priority content released on Friday October 29th which is not already available and is delivered by a supplier on automated ingestion. Priority tickets can be raised here: https://7digitalops.atlassian.net/servicedesk/customer/portal/6/group/12/create/209

Report: "UMG Ingestion Delays"

Last update
resolved

Dear Clients, We are currently experiencing a delay with processing UMG content. A bug was detected in the UMG Teleporter application causing a backlog of content to develop on their side. Therefore, as a result, content will be delayed from entering the daily feeds. A fix has been deployed and UMG have been notified of the issue. We are currently processing the backlog which is estimated to take 24 hours. If you have any questions, please create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team

Report: "UMG Feed Delay"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been deployed and UMG have been notified of the issue. We are currently processing the backlog.

identified

We are currently experiencing a delay with processing UMG content. At 8pm 01/07 a bug was detected in the UMG Teleporter application causing a backlog of content to develop on their side. Therefore as a result, content delivered between 8pm 01/07 and 11am 02/07 may be delayed from entering the daily feeds. A fix has been deployed and UMG have been notified of the issue. We are currently processing the backlog.

Report: "Outage - CDN provider"

Last update
resolved

Dear Clients, The CDN outage experienced today is now closed. The platform is back to normal and the error rate has completely dropped since 13:37. With best regards, 7digital Client Success Team

monitoring

The error rate has dropped since 11:53 GMT. Our Tech team will continue to monitor the platform before we close this incident. For the most current updates, you can follow Fastly Status page here: https://status.fastly.com/

identified

We are continuing to work on a fix for this issue.

identified

The CDN outage is affecting all 7digital API and Media Delivery endpoints. Our on-call support engineers are currently investigating the issue and in contact with our CDN provider. You can subscribe to updates via email, webhooks and RSS feed on our statuspage (https://status.7digital.com/). If you would like to receive SMS updates, please create a Service Desk ticket with the Client Success Team. Once we have further updates we'll share them with additional announcements. If you have any questions, please create a Service Desk ticket with the Client Success Team.

investigating

Dear Clients, We are currently seeing a degradation of service with our CDN provider. Further notices will be sent regarding this incident as they become available. With best regards, 7digital Client SuccessTeam

Report: "Playlist Tool Unavailable"

Last update
resolved

This incident has been resolved. Playlist tool is now stable

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating the issue. The Playlist API is not affected.

Report: "Elevated Error Rates - CDN"

Last update
resolved

This incident has been resolved.

monitoring

Between 15:17 and 16:04 GMT, our CDN provider encountered an issue causing an increase in error rates. As of 16:04 GMT normal service has been resumed but we're continuing to monitor the platform.

Report: "UMG Global Outage"

Last update
resolved

This incident has been resolved.

monitoring

UMG have now fixed the issue and we are receiving content once again. Whilst the issue has been resolved, we are expecting a 4 day turnaround to clear the backlog that has developed. As such, any releases from UMG Global from 18/12 may not be in feeds until this coming Friday 25/12. However, while we clear this backlog, we can now look to action priority release requests from UMG where possible.

identified

We are currently experiencing an issue with UMG Global's content delivery platform impacting UMG content from the 18/12/20 onwards. Currently, the backlog of deliveries includes updates, takedowns and inserts. As a result of this backlog on UMGs servers, we are unable to action priority release requests from UMG at this time The issue sits with UMG Global's delivery system and is unrelated to 7digital's content ingestion process. We are currently working on a fix with UMG. We apologies for any inconvenience and will keep you updated as we progress with a fix.

Report: "Platform Degradation"

Last update
postmortem

Our CDN provider \(Fastly\) has now confirmed that they have identified an issue with their service which directly caused the degradation see in this incident. A further explanation of the issue which occurred at our CDN provider is below: Starting at approximately 08:51 UTC, Fastly observed three global transit provider events affecting most Fastly data centers, leading to increased 5xx errors and latency. The first event occurred from 08:51 to 8:55 UTC, the second from 09:57 to 10:04 UTC, and a third event starting at 10:17 UTC. During the third event, at 10:21 UTC, a Fastly traffic engineering configuration change, designed to mitigate the provider issues, inadvertently led to additional customer impact in the form of errors, latency, and timeouts. Additionally, the Fastly portal and API were affected during this event. Fastly Engineering started mitigations at 10:25 UTC, gradually restoring services until the last repair was completed at 10:56 UTC.

resolved

Between 11:22 and 12:00 BST we saw a degradation of service on the platform. As of 12:00 BST full service has resumed but we're continuing to monitor the platform and will follow up to provide further details on the impact of the incident and the measures taken to resolve and prevent it reoccurring.

Report: "Platform Outage"

Last update
postmortem

**7digital Incident Report** ### **Incident Details:** **Incident Summary** A switch within the CTR data centre power cycled itself, causing the ILB high-availability cluster to failover. Whilst the ILB failover completed, the automatic failback \(after the switch recovered\) left the ILB and XRP in a state of limbo which only got resolved when keepalived was restarted on all nodes. This caused an almost complete API outage, since most critical APIs rely on the ILB to route API calls. In addition, the cloud catalogue API did not recover as quickly as the data centre services due to it making use of a DNS entry where the automatic failover was disabled. ### **Timeline** 22:42 - On-call SRE receives multiple pingdom down alerts across all APIs 22:45 - SRE online, reports VPN to access platform is online. Identifies the severity of the outage and calls Client Success OOH. 22:52 - API health dashboard shows 100% error rate on most endpoints; large response time \(> 2 seconds\). Core Platform errors dashboard shows an initial load of API Router errors indicating they are timing out whilst connecting to the DB. 22:56 - Client Success indicate they are working on notification to clients. 22:56 - SRE starts working through "data centre failure modes" runbook. 22:59 - DC cross connect identified as being up. 23:02 - ILB IP announcements look OK according to "ip a". SRE notices that the backup ILB briefly recently received some traffic. SRE decides to reset keepalived to force re-announcements of IPs anyway. 23:05 - Most services start to recover; pingdom alerts clear; Core Platform application errors mostly clear apart from webstore & comparison-reproxy. 23:07 - SRE asked by Client Success if the prepared platform announcement should go out since the platform looks to be recovering. Decision taken to send it as stability is not clear. Notification sent to clients. 23:07 - SRE notices that VHC has taken all API traffic and Pingdom is still reporting CTR XRP as down. ~/track/details is also reported to be down. 23:11 - API origin DNS \(which Pingdom check is using\) is found to be pointing to CTR. DNS made easy shows the record's auto-failover mode has been disabled. SRE re-enables the auto-failover. Pingdom alerts recover for all but the CTR XRP check. 23:15 - SRE notices that the release details endpoint still has a high error rate \(50%\) and the 7digital D2C webstore is still erroring in Core Platform application errors. Client Success manually checks ~/release/details and finds that it looks stable. 23:31 - Whilst investigating the issue with ~/release/details, SRE notices that the errors to that endpoint dropped off from 23:26. 23:36 - SRE tells Client Success that the platform looks all UP now, but they are still monitoring and looking into the loss of DC redundancy \(CTR not handling API traffic\). 23:39 - SRE forces 99% of API traffic to VHC whilst investigating CTR issues. 23:41 - Client Success update incident status to "monitoring" 23:43 - SRE restarts NGINX on CTR XLB 00 and 01, has no effect. Restarting keepalive on those hosts however restores CTR's XRP service. Pingdom alert clears. 23:59 - Incident closed. **Duration of outage/incident \(Time to Recovery\):** 25 minutes **Time taken to isolate/diagnose the issue \(Time to Isolate\):** 25 minutes ### **Impact** **What applications or services were affected?** Any partner services and internal applications \(inc. web store\) which use our API. **How might these services have been affected?** Indicators show a complete outage during this time. Error rates of 100% and high response time. ### **Technical Details** It seems like whilst the platform correctly failed over given a presumed network blip, once the blip had resolved itself, the failback did not complete cleanly. This caused the almost-total outage of the API. A smaller, secondary problem with how DNS is updated given the loss of a DC's XRP, caused the cloud catalogue service to fail. This mainly affected 7digital's webstore service, and did not impact partners. **Dashboards:** Core Platform Application Errors: ![](https://lh6.googleusercontent.com/WeEGRD8mG6lATlOP0BNVdPQPyyzWjFi6z813qjbnDIcRZjkuepQzVIg3h92qZIOUEPvJZaA_-7kgWkDbqOsdA-PKtB9_YObUSKwtiMAziCcVj2fLFsp7YnN7JW3U8tu4qeUBY7pq) ‌ API Health dashboard: ![](https://lh4.googleusercontent.com/Z9m1QzuSmZV0OLPN0vHk7vNoQFbJM9uRHV-fPK-0eoBSx0eYyUVxg-MinSo9Ec4Fl4k-pgksZWcLp9K1jE4LbKlHnDkFL_8Js1l-ZSn81R_Bh6eWbUl2aeeoFKzLbQB5UNJWS2L1) Data Centre Usage: ![](https://lh5.googleusercontent.com/7rxr1vLpKpdcp9_zEDF4EdpZcJFSl_iqBs3CuFUebhJau7Pbx-zHheQiD1SZiUpD77S1u5OhaNKjAzHECmPUTBlwyMDKoITs4bRTtXkqolmUBj8kyetwaNmHTtZWCSqamom_9rt-) ‌ **Analysis of our response to rectifying the incident** As is the case with a lot of networking-triggered incidents, the information available to SREs was at first confusing and did not immediately reveal a resolution. However, since the team had witnessed something similar happen in the past, we had a runbook at hand to help SREs diagnose networking & data centre issues. This proved a decisive factor in the relatively quick recovery of the service, given the complex nature of the fault. Process wise, we were quick to identify the impact to customers and client success was able to notify partners as quickly as possible. We have also identified that we could have better documentation on how the cloud catalogue service is architected, so that the SRE team can better understand and recover the service. ### **Analysis of the technical issue/s** Ideally, switch power cycles/failures should be able to happen and our infrastructure automatically recover, or failover. In this instance, the infrastructure did not recover on its own and required SRE intervention to force all load balancers to re-announce their floating IPs to switches. Our investigation will focus on how we can automate the recovery of the service given this scenario, as we've had similar occurrences of this in the past. We're also aware of CTR-TEN-AS1 being a SPOF in relation to the dark fibre, so we will look into ways of increasing redundancy there. With regards to the on-going webstore issues, it has been presumed that the reason that was continuing to fail was its reliance on an DNS entry that had its automatic failover disabled. Since DNSmadeeasy provides no audit trail, we will look at ways of regularly snapshotting the configuration to source control so we can trace changes in future. It is presumed that once the TTL for the bad DNS record had expired, the cloud catalogue infrastructure recovered on its own, hence no intervention was required to fix that issue following the re-enabling of the automatic failover. **Conclusions and Actions** The resultant de-briefing identified the following issues with our process: 1. There are multiple locations which explain our incident response process. We should remove redundant copies of the process so that only one is accessible to avoid confusion. In general we were fairly happy with how quickly we responded, however, as this is the second time the load balancer has not recovered following a quick failback, we will prioritise looking at automating this so that we do not need manual SRE intervention in future. We will also look at updating our documentation on how the new cloud catalogue flow works for the SRE team.

resolved

Dear clients, The platform outage experienced today is now closed. The platform is back to normal and the error rate has completely dropped since 11:25. Our Tech team will continue to investigate and an incident report will be shared in due course. If you continue to experience issues with the 7digital platform create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team

monitoring

Dear clients, The platform has now been restored and the error rate has dropped since 23:17 GMT. Our Tech team will continue to monitor the platform before we close this incident. If you continue to experience issues with the 7digital platform create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team

investigating

Dear clients, From 22:45 GMT - we are currently experiencing severe platform outage affecting all areas of the 7digital API. Our on-call support engineers are currently investigating the issue and taking action to further stabilise the platform. You can subscribe to updates via email, webhooks and RSS feed on our statuspage (https://status.7digital.com/). If you would like to receive SMS updates, please create a Service Desk ticket with the Client Success Team. Once we have further updates we'll share them with additional announcements. If you have any questions, please create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team

Report: "Elevated Error Rates - Platform Wide"

Last update
resolved

Dear clients, The platform degradation/outage experienced today is now closed. The platform is stable and the error rate has dropped as of 19:23 GMT. Our Tech team will continue to investigate. If you continue to experience issues with the 7digital platform create a Service Desk ticket with the Client Success Team. With best regards, 7digital Client Success Team

investigating

Dear Clients, We are currently observing an elevated error rate on calls to our API. Our on call support engineers are currently investigating the issue. Once we have completed analysis of the issue we'll send an additional notice out with further details. If you have any questions, please create a Service Desk ticket with the Client Success Team by raising the issue here: https://7digitalops.atlassian.net/servicedesk/customer/portal/6/group/12 Regards, 7digital Client Success Team