Productsup

Is Productsup Down Right Now? Check if there is a current outage ongoing.

Productsup is currently Operational

Last checked from Productsup's official status page

Historical record of incidents for Productsup

Report: "Network maintenance advisory"

Last update
investigating

Due to to the unexpected interruption of our network connection between two of our datacenters, some services might encounter degraded performance or be intermittently unreachable. We are currently investigating what caused the interruption of the connection and will update this thread when we have done so. Best regards, Productsup Network Operations Center

Report: "Maintenance advisory"

Last update
Completed

The scheduled maintenance has been completed.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

During this period, one of our providers in Falkenstein will be conducting maintenance on our network components. This could result in temporary unavailability of certain direct connections or the platform. We expect any interruptions to be brief. Best regards, Productsup Network Operations Center

Report: "Data Processing queues"

Last update
Resolved

This incident has been resolved.

Monitoring

A fix has been implemented and we are monitoring the results.

Investigating

Due to an influx of jobs that is higher than usual, a queue has built up in our scheduler system. We are actively working to process the backlog and restore normal operations.Our team is monitoring the situation closely, and jobs are being processed as efficiently as possible.We ask for your understanding and patience as we work through this.

Report: "Data Processing queues"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Due to an influx of jobs that is higher than usual, a queue has built up in our scheduler system. We are actively working to process the backlog and restore normal operations. Our team is monitoring the situation closely, and jobs are being processed as efficiently as possible. We ask for your understanding and patience as we work through this.

Report: "Destination "Productsup Server" Issues"

Last update
postmortem

**Incident Overview** On May 12, we experienced a temporary service disruption affecting parts of our transport platform. This occurred during a configuration update aimed at improving our system’s routing capabilities. **Root Cause** The disruption was caused by a misconfiguration in our internal storage system. Although the change had passed staging tests, it behaved differently in production, leading to certain requests being misrouted. **Impact** During the incident, some clients were unable to access services hosted on specific subdomains of our transport platform and jobs couldn't write to destination "Productsup server". Normal operation resumed shortly after the change was rolled back. **Preventive Measures** To prevent this issue from recurring, we are: * Improving our deployment validation processes. * Enhancing staging to more closely reflect production conditions. * Adding more automated checks to detect routing issues earlier. **Conclusion** The issue was resolved by reverting the configuration. Processes have been updated to prevent recurrence. All systems are operating normally.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We're investigating an elevated number of errors when uploading data to "Productsup Server".

Report: "Data processing performance update"

Last update
resolved

Dear customers, our provider has replaced the affected hardware on schedule, and we are observing normal transfer speeds, therefore we are closing this incident. Thank you for your continued support, Your Productsup Technical Operations Team

identified

Dear Customers, Please be advised that we are currently experiencing degraded transfer and processing speeds due to a network equipment issue, specifically a failed router, in one of our data centers. This has been ongoing since 20:00 Berlin time yesterday. Our provider is scheduled to replace the affected hardware today. We will provide another update once the issue is resolved and performance returns to normal.

Report: "Router Hardware Failure"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently experiencing a hardware failure on one of our routers. As all our routers are fully redundant, there is no impact on our services. We will be performing a hardware replacement shortly to restore full redundancy. Further updates will be provided as soon as more information is available. Thank you for your understanding. Best regards, Productsup Network Operations Center

Report: "Export to Datasource incident"

Last update
resolved

Summary On 17.01.2024, one of the processing storage cluster servers rebooted unexpectedly due to a configuration issue with the object storage gateway. This incident was compounded by previous changes made to accommodate a new replication zone in Frankfurt, leading to operational complications. Incident Details Background Several months ago, modifications were made to include a new replication zone in Frankfurt. However, it appears the toolset used for this configuration did not execute correctly, resulting in the retention of a "default" zone and a "default" zone group alongside the active "fsn" and "fsn-1" zones and zone groups in Falkenstein. Incident Timeline Reboot: The server rebooted unexpectedly. Configuration Issue: Upon investigation, it was found that the object storage gateway connected to the default zone instead of the intended configuration. Service Behavior: Despite the service starting, it encountered issues as there was no data in the default zone, leading to complaints about missing data for incoming requests. Monitoring Oversight: The monitoring system only tracked whether the service was "up," which did not alert us to the underlying issue until a user report was received. Resolution The issue was identified after user reports highlighted the data unavailability. Immediate actions were taken to correct the object storage gateway configuration, ensuring it points to the correct zone. Lessons Learned Monitoring Improvements: To prevent similar incidents in the future, we will enhance our monitoring to include error rates on this cluster. This will allow us to detect unusual patterns that may indicate configuration mistakes or other issues. Configuration Checks: A review of the configuration process for replication zones will be conducted to ensure that toolsets function as expected and do not leave behind unintended configurations.

Report: "Platform not Available"

Last update
resolved

All systems operating normally.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor for any further issues.

monitoring

Our main database cluster caused the applications to fail. The root cause has been identified and fixed and all applications resumed normal operations. We'll monitor the situation.

investigating

The platform is currently not available. We're investigating the root cause.

Report: "Data Processing is slow"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Due to a large influx of jobs being started, data processing might currently suffer some delays. We’re currently investigating and will have more information shortly. Thank you for your patience.

Report: "Update on Processing Performance"

Last update
resolved

Dear Customers, We are pleased to inform you that, according to our monitoring, we are satisfied that the issue has been resolved. We are now experiencing improved site run times and enhanced upload/download performance. Thank you for your patience during this process. We remain committed to providing you with the best possible service and will continue to monitor the situation closely. To prevent similar issues in the future, we are implementing the following measures: Increased Bandwidth Planning: We will reassess and adjust our bandwidth capacity regularly to ensure it meets the demands of our customer base, especially during peak usage times. Enhanced Monitoring Tools: We are upgrading our monitoring systems to provide real-time insights into network performance and quickly identify any anomalies.

identified

Dear customers, We have identified a potential issue affecting our service performance. A network equipment component was found to be possibly defective. With the assistance of our data center provider, we have replaced this equipment, which has resulted in significantly higher transfer rates. Additionally, we have upgraded the bandwidth on this equipment to better accommodate these improved speeds. Most customers should start experiencing enhanced times for site runs effective immediately. We are continuing to monitor the entire platform for any other factors that might cause slowness or delays, and we will keep you updated on any further developments. Thank you for your continued patience and support.

investigating

Dear Customers, We are currently experiencing unusually high demand for our software due to a combination of seasonal factors. As a result, some sites may experience delays or, in some cases, become stuck in the queue. Our team is actively investigating all possible root causes to address these challenges. Please rest assured that we are monitoring the situation closely and taking every step to enhance performance. Your experience is important to us, and we appreciate your patience as we work through this. Thank you for your understanding.

Report: "Image Service Incident (Image Designer)"

Last update
postmortem

### Incident Overview On 13.12.2024, an upgrade from Chromium 124 to Chromium 130 was deployed on our servers, impacting the Image Service's ability to generate scalable vector images \(SVGs\). This incident was triggered by our automated security upgrade process, which is in place to maintain the security integrity of our software. ### Timeline * **Upgrade Date:** 13.12.2024 07:00 - Chromium was upgraded from version 124 to version 130. * **Incident Detection:** 13.12.2024 09:00 - Users reported issues with SVG generation. * **Investigation:** 13.12.2024 09:30 - The team began investigating the root cause. * **Rollback:** 13.12.2024 12:00 - Chromium was rolled back to version 124, restoring SVG generation functionality. * **Current Status:** Fix for Chromium 130 is in progress and will be thoroughly tested before redeployment. ### Root Cause The upgrade to Chromium 130 introduced incompatibilities that affected SVG generation. This was an unanticipated consequence of the automated security upgrades, which are designed to keep our systems secure by ensuring all software is up to date. ### Resolution To mitigate the immediate issue, we rolled back Chromium to version 124, which restored the SVG generation functionality. The team is currently working on a fix for version 130 to ensure that it can be deployed without causing similar issues in the future. ### Action Items 1. **Testing Fix for Chromium 130:** Ensure that the upcoming fix is thoroughly tested before any deployment. 2. **Review Upgrade Process:** Assess our automated upgrade process to identify potential improvements that minimize risks associated with future upgrades. 3. **Monitoring Enhancements:** Implement enhanced monitoring to quickly identify and address similar issues in the future. ### Conclusion We recognize the impact that this incident had on our users and are committed to ensuring that our systems remain secure while minimizing disruptions. We appreciate your understanding and patience as we work towards a permanent solution for the Chromium 130 upgrade.

resolved

Description: We have successfully rolled back Chromium to an earlier version, and SVG generation is now functioning as expected. Impact: Users can now generate SVG images without any issues. Thank you for your patience during this incident. If you experience any further issues, please reach out to our support team.

investigating

Description: We are currently experiencing issues with Image Designer (SVG) generation in our Image Service. This is due to a necessary security upgrade, which has temporarily affected our ability to generate images. Impact: Users are unable to generate SVG images at this time. Next Steps: We are aware of the issue and are actively working on a workaround. Updates will be provided as soon as more information is available. Thank you for your patience as we resolve this matter.

Report: "Download Proxy Outage"

Last update
postmortem

**Incident Overview** The Download Proxy experienced a partial outage lasting a few hours due to issues with our firewall configuration. This incident resulted in the rejection of traffic directed to the Download Proxy, impacting user access and service availability. **Root Cause** The root cause of the outage was a misconfiguration in the firewall settings following recent changes in IP addressing for our proxy services. The firewall began rejecting legitimate traffic, leading to service unavailability. Since the outage was partial, only some connections were affected, making immediate detection difficult. **Impact** Users were unable to access the Download Proxy for several hours, but the impact was not uniform across all connections. Some users experienced delays in their workflows, while others were unaffected. **Preventive Measures** To prevent similar incidents in the future, the following measures will be implemented: Change Management Process: Establish a more robust change management process to review potential impacts of IP changes on firewall and proxy configurations. Automated Alerts: Set up automated alerts for unusual traffic rejections to allow for quicker response times. Documentation: Ensure thorough documentation of any changes made to network configurations, including IP address assignments. **Conclusion** The partial outage of the Download Proxy was a result of firewall misconfiguration following IP address changes. By implementing the aforementioned preventive measures, we aim to enhance our resilience to similar issues in the future and ensure better service reliability for our users.

resolved

The partial outage affecting the Download Proxy service has been successfully resolved. The issue was traced back to misconfigurations in the firewall settings following recent changes to IP addressing for our proxy services. As the service is going live again, we are committed to enhancing our systems to prevent similar incidents in the future. Thank you for your understanding and support.

identified

We are currently experiencing a partial outage of the Download Proxy service. Users have reported issues with accessing the service, with some connections being rejected. This appears to be related to recent changes in IP addressing for our proxy services. We've implemented a fix and the download proxy should function again as intended.

Report: "Data Processing queues"

Last update
resolved

This incident has been resolved.

identified

Due to an influx of jobs that is higher than usual, a queue has built up in our scheduler system. We are actively working to process the backlog and restore normal operations. Our team is monitoring the situation closely, and jobs are being processed as efficiently as possible. We ask for your understanding and patience as we work through this. Incident start: 11:00 CET Estimated resolution time: 15:00 CET

Report: "Processing issues"

Last update
resolved

This incident has been resolved.

identified

Summary: We have identified a fault in one of our job dispatcher servers. Upon stopping the problematic server, the jobs that were previously stalled have been freed and will restart according to their schedule. Actions Taken: The faulty job dispatcher server has been stopped to release the stuck jobs. All jobs are now running on a functional server. The problematic server is currently under debugging to determine the root cause of the issue. Next Steps: We will continue monitoring the situation and provide updates as we progress with the debugging process. Thank you for your patience as we work to resolve this issue.

investigating

We are currently experiencing ongoing issues with jobs getting stuck in the processing queue. This problem is impacting our job execution and overall system performance. Despite previous resolutions, some jobs are once again stalling. Initial investigations are being conducted and we expect to have a resolution shortly. We apologize for the inconvenience and appreciate your patience as we work to resolve this issue. Further updates will be provided regularly.

Report: "Processing issues"

Last update
resolved

Incident Summary We experienced an unexpected disruption in job processing services due to scheduled maintenance side effects from our data center provider. Most jobs became stuck and required a restart to resume normal operations. Timeline 06:05 UTC+1: Incident began; jobs were reported as stuck. 07:00 UTC+1: Main job scheduler server was restarted, temporarily stopping all executing jobs. 07:05 UTC+1: Jobs resumed execution normally following the restart. Root Cause The disruption was caused by unforeseen side effects from maintenance performed by our datacenter provider, which impacted the job processing system. Resolution Restarted the main job scheduler server, which freed up resources and allowed jobs to be re-executed according to their original schedules. Monitoring tools were employed to ensure that job processing returned to normal. Conclusion The incident has been resolved. We appreciate the patience and understanding of all affected users during this disruption.

monitoring

We are pleased to inform you that the restart of our main job scheduler server was successful. All jobs are now executing normally, and we are actively monitoring the situation to ensure stability. Thank you for your patience during this maintenance period. We appreciate your understanding and will keep you updated if any further issues arise. Current Status: * Job processing has resumed successfully. * All jobs are executing as scheduled. * Monitoring is ongoing to ensure system stability. Thank you for your support!

identified

We are currently experiencing an unplanned disruption to our job processing services. Due to unexpected side effects from maintenance performed by our datacenter provider, most jobs are currently stuck and will need to be restarted. To resolve this issue, we will be restarting our main job scheduler server. This will stop all currently executing jobs and free up resources. Once the restart is complete, jobs will be re-executed according to their scheduled timelines. We apologize for any inconvenience this may cause and appreciate your patience as we work to resolve this issue. We will provide further updates as soon as possible. Current Status: * Job processing is currently disrupted * Most jobs are stuck and will need to be restarted * Main job scheduler server will be restarted to resolve the issue * Jobs will be re-executed according to their scheduled timelines after the restart Estimated Resolution Time: 15 min Next Update: We will provide another update once the restart is complete and job processing has resumed.

Report: "Network Provider issue"

Last update
resolved

The issue with one of our IX connections has been resolved, and normal traffic flow is fully restored. Our monitoring confirms that all systems are operating optimally, and there should no longer be any delays. Thank you for your patience and understanding.

investigating

Our monitoring has detected that there is currently an outage on one of our IX connections. This has been reported to the relevant IX for further investigation. We have diverted traffic via alternative routes, but it is possible that there may be (minimal) delays.

Report: "Data Processing is slow"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Due to a large influx of jobs being started, data processing might currently suffer some delays. We’re currently investigating and will have more information shortly. Thank you for your patience.

Report: "Platform stability issues"

Last update
resolved

The issue impacting the platform was identified as a network configuration error by our server provider. After rolling back the change, network conditions have returned to normal, and all systems are operational. We are closely monitoring the platform to ensure stability. Thank you for your patience, and we apologize for any inconvenience caused.

investigating

We are currently experiencing intermittent availability issues and degraded performance with the platform. Some users may find the platform temporarily unavailable or experience slower response times. This appears to be related to underlying network conditions, and our technical team is actively investigating the root cause. We appreciate your patience as we work to restore full service. Further updates will be provided as soon as more information is available. Thank you for your understanding.

Report: "Degraded performance"

Last update
resolved

The DDoS attack causing degraded performance has been successfully mitigated, and our services are now back to normal operation. We sincerely appreciate your patience and understanding throughout this incident. If you encounter any issues or have any questions, please don't hesitate to reach out to our support team.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently experiencing degraded platform performance due to a Distributed Denial of Service (DDoS) attack. Our Network Operations Center is actively working to mitigate the impact and restore full functionality as swiftly as possible. During this time, you may encounter intermittent service disruptions or delays in accessing our services. We sincerely apologize for any inconvenience this may cause and appreciate your understanding and patience as we resolve this issue.

Report: "Platform and APIs Access Disruption"

Last update
resolved

We have identified and resolved the issue. The problem was caused by a hardware failure on one of our routers. Our team has successfully redirected traffic to another router, restoring access to both the platform and API. Description: We experienced an issue where platform.productsup.com and our API were inaccessible due to a network problem. Our team investigated and resolved the root cause, which was identified as a hardware failure on one of our routers. Impact: * Users were temporarily unable to access platform.productsup.com * The API was temporarily unavailable * Data Processing functionality remained unaffected throughout the incident Resolution: Traffic has been redirected to another router, restoring full access to both the platform and API. All services should now be functioning normally. We apologize for any inconvenience this may have caused and appreciate your patience during the outage. If you continue to experience any issues, please contact our support team.

investigating

Cause has been identified and a workaround has been implemented. Services are gradually coming back online as we monitor the situation.

investigating

Description: We are currently experiencing an issue where platform.productsup.com and our API services are inaccessible. Our team is actively investigating the root cause, which appears to be related to a network issue. We apologize for any inconvenience this may cause. Impact: * Users are unable to access platform.productsup.com * Data Processing functionality remains unaffected Next Update: We will provide an update within the next 30 minutes or as soon as we have more information.

Report: "DNS Issue Affecting Data Processing System"

Last update
resolved

Incident Summary: We experienced a DNS issue that caused delayed access to our database. This resulted in some jobs experiencing delays ranging from 1 to 100 minutes. Resolution Details: Our technical team identified the root cause as a misconfiguration in the DNS settings. We worked closely with our DNS team to rectify the issue, and the DNS settings have now been restored to normal functionality. Impact Acknowledgment: We recognize that the delays may have affected your workflows and appreciate your understanding during this incident. Next Steps: We will continue to monitor the system to ensure stability and performance. A post-incident review will be conducted to prevent similar issues in the future. Thank you for your patience and support as we resolved this matter. If you have any further questions or concerns, please feel free to reach out.

investigating

We are currently experiencing a DNS issue that is causing delays in our data processing system. Our team is actively investigating the problem and working to resolve it as quickly as possible. Impact: Users may experience slower response times or difficulty accessing certain features of the data processing system. Current Status: Our technical team is diagnosing the DNS configuration to expedite resolution. Next Steps: We will provide updates every hour or as new information becomes available. We apologize for the inconvenience and appreciate your patience as we work to restore normal operations. Thank you for your understanding.

Report: "Platform not accessible"

Last update
resolved

Dear customers, We are pleased to inform you that the incident appears fully resolved and the situation is back to normal. If you experienced any residual issues or have additional questions, please do not hesitate to contact our support team. We are committed to ensuring a seamless experience and are here to assist you. Thank you again for your continued business and loyalty. We value your trust in our service, and we remain dedicated to providing a reliable and high-quality experience.

identified

The Productsup platform is now accessible and functioning without any issues. We are actively monitoring the system for any potential errors to ensure continued stability. We will provide further updates if any issues arise. Thank you for your patience during this incident.

investigating

We are currently experiencing slow response times for the Productsup platform user interface. Please note that data processing services are not affected. Our technical team is actively investigating the issue to identify the root cause and implement a solution as quickly as possible. We will provide updates as more information becomes available. Thank you for your patience, and we apologize for any inconvenience this may cause. For further updates, please check this status page regularly.

Report: "Performance Issues"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We're seeing slow responses for the Platform user interface. Data Processing is not affected. We are currently investigating the issue. Sorry for the inconvenience.

Report: "Platform stability issues due to hosting provider network disruptions"

Last update
resolved

Dear customers, Our monitoring indicates that the platform is stable and the hosting provider has confirmed that all works correctly on their side again. Therefore we are closing this incident as resolved. Thank you for your continued support.

monitoring

Hosting provider reported that there was an issue with a network interface on their side, but now it has stabilized. From our side we also see that all systems are operational again. We keep closely monitoring the status to ensure that everything works correctly.

investigating

Dear Customers, We noticed some network problems originating from our hosting provider that result in our platform being unstable at the moment. We are looking into possibilities to mitigate the impact of this issue and also we've notified the network team on provider's side about the problem. Thank you for your understanding and we will keep you posted about further developments.

Report: "Data Processing is slow"

Last update
resolved

Dear customers, We are pleased to inform you that the Processing job queue has resolved itself and the situation is back to normal. If you experienced any residual issues or have additional questions, please do not hesitate to contact our support team. We are committed to ensuring a seamless experience and are here to assist you. Thank you again for your continued business and loyalty. We value your trust in our service, and we remain dedicated to providing a reliable and high-quality experience. Best regards, Your Productsup Team

monitoring

Dear Customers, We have rolled back the recent changes made to the Google Content API destination, which should resolve the issue you have been experiencing. However, there is a backlog of requests that built up during the incident, and it may take some time for this queue to clear. We apologize for any inconvenience this has caused. If you have any further questions or concerns, please reach out to your technical support team. We are committed to resolving this incident fully once the queue has been processed. Thank you for your patience and understanding. Sincerely, Your Technical Operations Team

identified

Dear customers, We have identified an issue affecting sites using the Google Content API destination. Our Engineering team is diligently looking for a fix. We will let you know as soon as it is implemented. Kind regards, Productsup Technical Operations Center

investigating

Due to a large influx of jobs being started, data processing might currently suffer some delays. We’re currently investigating and will have more information shortly. Thank you for your patience.

Report: "Network incident in our storage datacenter"

Last update
resolved

Dear Customers, We are pleased to inform you that the critical network switch hardware failure experienced by our hosting provider has been resolved. The replacement equipment has been successfully installed and tested, and normal service has been restored. Our hosting provider's technical team worked diligently over the past hours to expedite the repair and minimize the duration of the service interruption. We appreciate your patience and understanding throughout this incident. If you experienced any residual issues or have additional questions, please do not hesitate to contact our support team. We are committed to ensuring a seamless experience and are here to assist you. Thank you again for your continued business and loyalty. We value your trust in our service, and we remain dedicated to providing a reliable and high-quality experience. Best regards, Your Productsup Team

investigating

Dear Customers, We regret to inform you that a critical network switch component owned and maintained by our hosting provider has experienced a complete hardware failure. This equipment is essential for our service to function, and it needs to be replaced as soon as possible. Our hosting provider is currently working to resolve this issue and restore normal operations. However, as this incident is outside of our direct control, we are unable to provide an exact timeline for when the fix will be completed. Please be assured that we will provide updates as soon as we receive more information from our provider. We understand that this unplanned service interruption is an inconvenience, and we apologize for the disruption. We appreciate your patience and understanding as our hosting provider works to resolve this matter. If you have any further questions or concerns, please do not hesitate to contact our support team. Thank you for your continued business.

Report: "Transport storage cluster issue"

Last update
resolved

Dear customers, the incorrect configuration was reverted by our hosting provider team and services are functioning again as normal. We sincerely apologize for any inconvenience caused by this incident and thank you for your renewed support.

identified

Dear customers, due to an unexpected network switch configuration failure at our datacenter provider, traffic to the Productsup Server destination, FTP Service, Platform API and Stream API may encounter degraded performance. The provider's network control team is currently fixing the issue. Thanks for your understanding and we will keep you posted about further developments.

Report: "Authentication issues in Productsup Platform"

Last update
resolved

Dear customers, We are pleased to inform you that the client outreach via email has started and will be concluded before the end of the day. We sincerely apologize for any inconvenience this incident may have caused you, and we greatly appreciate your patience and understanding during this time. If you encounter any further issues or have any questions, please do not hesitate to reach out to us. Kind regards, Your Productsup Team

monitoring

Dear clients, We have found more specifics to login issues for clients who are using our SSO features. Clients using Azure or custom IDPs can be affected, clients using other SSO integrations are not affected. Later today we will contact the affected clients via email with information about the ongoing research and communicate some suggestions to resolve the current problems. Kind regards, Your Productsup Team

identified

We have currently moved all SSO configuration from "enforced" to "enabled". If you are unable to login using your SSO, please use https://platform.productsup.com/ and reset your password to continue using the platform while we continue to find more details about the ongoing issue.

identified

Dear clients, This incident is a follow up on the planned release of the new authentication system in the Productsup Platform. More information is available on the following link: https://status.productsup.io/incidents/jyjljf5lkt26 The release is done. At the moment, some users with SSO enforced/enabled on their account may face issues while logging in. We are currently investigating this issue. Please reach out to support@productsup.com if you are facing any difficulties while logging in. Kind regards, Your Productsup Team

Report: "Failure of Scheduled Database Job and Subsequent Job Processing Interruptions"

Last update
resolved

A critical failure occurred in the scheduled job designed to add partitions to several tables within our database. This scheduled maintenance task did not execute as planned, directly impacting all processing jobs that rely on these database tables. Consequently, all processing jobs scheduled between 00:00 and 06:00 UTC+2 were unable to run, leading to significant delays and disruption of normal operations. Cause of Incident: The primary cause of the incident was the failure of the scheduled job tasked with adding partitions to the database tables. A deeper investigation revealed that the failure occurred due to an unforeseen software issue with our scheduler server, which also prevented the execution of our fail-safe mechanism designed to handle such failures. Operational Impact: The processing of all jobs scheduled during the affected time frame was halted, causing a backlog of data processing tasks. Immediate Actions Taken: Manual intervention was employed to run the necessary jobs once the issue was identified. Software diagnostics and immediate repairs were initiated on the scheduler server to restore its functionality. Long-Term Corrective Actions: Hardware Redundancy: Implement additional hardware redundancy for our scheduler server to mitigate the risk of a single point of failure. Enhanced Monitoring: Upgrade our monitoring systems to detect software issues more promptly before they impact critical operations. Fail-Safe Enhancements: Review and enhance the robustness of our fail-safe mechanisms to ensure they can handle unexpected failures more effectively. Regular Audits: Schedule regular audits of scheduled tasks and associated hardware to ensure they are functioning as expected without any potential risks. Conclusion: We sincerely apologize for the inconvenience caused by this incident and are committed to implementing the necessary measures to prevent such occurrences in the future. We appreciate the understanding and patience of all stakeholders during the resolution of this issue.

Report: "Platform API & Stream API: Temporary Disruption Due to Data Overflow"

Last update
resolved

Dear valued customers, Over the weekend, we experienced a temporary disruption due to unexpected data overflow in our storage systems, which support our Platform API and Stream API. This issue began at 10:05 AM GMT+2 and was resolved by 11:08 AM GMT+2. Test data from a trial run filled our storage to its maximum capacity, leading to limited service availability during this time. To address this, we have suspended these trial imports and cleared excess data from our systems. We sincerely apologize for any inconvenience this may have caused and appreciate your understanding. Thank you for your continued support. We are committed to maintaining the highest levels of service availability for you.

Report: "Ongoing network incident at hosting provider"

Last update
resolved

Dear customers, The incident has been resolved on our provider's side and we have not experienced any disruptions in our infrastructure for the past 72 hours. Therefore we declare this incident closed. Thank you for your continued support.

monitoring

Dear customers, Our services appear to be fully functional again, according to our monitoring. We are still waiting for a communication from our provider to declare the incident fully resolved. Thank you for your patience.

monitoring

Dear clients, We are experiencing temporary disruptions again, as our hosting provider is still trying to patch the last router that was encountering the issue. Errors might happen while you are using the platform, until the issue is fully resolved. We will update you of further developments as we are monitoring the incident on our side. We would like to renew our apologies for any inconvenience this might cause, and appreciate your patience and understanding as our provider is working to resolve the situation promptly.

monitoring

Dear clients, our hosting provider has updated us with an incident notice, indicating routers at their datacenter were affected by a firmware bug. They have identified the routers causing issues and have isolated them, and they are currently applying updates. In consequence, our services should be fully functional again, pending any issues caused by the previous disruption. On our side, we are monitoring the situation and will update this incident when assured that it is fully resolved.

identified

Dear valued customers, We regret to inform you that our platform is currently experiencing disruption in service. A network incident is ongoing at our hosting provider where our services are hosted, which causes some of them to undergo communication failures. Because of this, errors could appear while you are using the platform. We sincerely apologize for any inconvenience this might cause, and appreciate your patience and understanding as our provider is working to resolve the situation promptly.

Report: "Platform reachability issue"

Last update
resolved

Dear valued customers, We regret to inform you that between the hours of 9:29 AM and 9:42 AM CET, our platform experienced a temporary disruption in service. This interruption was due to a network incident occurring at the datacenter where our services are hosted. As a result, our main services were rendered inaccessible from the internet. Upon investigation, it was determined that the root cause of the issue was a misconfiguration on a network switch by our providers' network engineering center. We want to emphasize that this incident was beyond our control, but we are actively collaborating with our providers to implement measures to prevent such incidents from occurring in the future. We sincerely apologize for any inconvenience this may have caused and appreciate your patience and understanding as we work to resolve the situation promptly. Thank you for your continued support.

Report: "Platform Incident"

Last update
resolved

This incident has been resolved.

monitoring

Our network team has managed to fail over to another router and the services should now be operational. We are continuing to monitor the situation and waiting to get a post mortem back from the hosting provider. We will update this incident when fully satisfied that the solution is resolved on their side as well.

investigating

Our platform is experiencing issues again, still due to network equipment issues at our hosting provider. They are currently investigating the issue and we will update this incident as soon as we hear from them. Thanks for your understanding and we hope to update you shortly with better news.

Report: "Platform incident"

Last update
resolved

The incident has been resolved when our network engineers performed a successful fail over. We will reach out to our hosting provider for further details. Thanks for your continued patience and understanding.

identified

We have identified the issue with a core router at our hosting provider, we are currently failing over manually to another router in order to solve the issue. The services should come up shortly.

investigating

Our platform is unavailable at this time due to a network incident. We’re currently investigating and will have more information shortly. Thank you for your patience.

Report: "Performance Degradation Notice"

Last update
resolved

We are happy to announce that our platform has fully restored normal operations. Following our quick resolution of the recent performance issues, all services are now functioning as expected. We appreciate your patience and understanding during this disruption.

monitoring

We have resolved the performance issues that were affecting our platform's speed over the past couple of hours, and we are now monitoring the situation closely to ensure continued stability.

investigating

We are experiencing performance issues on our platform, resulting in slower than usual response times over the past couple of hours. Our team is actively investigating the root cause and working diligently to resolve the issue.

Report: "FTP Service Degraded Performance"

Last update
postmortem

### Executive Summary: In this postmortem report, we will analyze the recent incident related to our FTP service. The incident was triggered by a migration of the FTP service storage to a new location, resulting in file upload errors and customer dissatisfaction. Our primary goal is to identify the root causes of the incident, outline the actions taken to mitigate the issue, and propose preventive measures to avoid similar problems in the future. ## Incident Timeline: * **October 25, 2023:** Migration of FTP storage to a larger but slower I/O location commenced. * **October 28, 2023:** Increased I/O activity on the new storage led to timeouts and file upload failures. * **October 30, 2023:** Further analysis revealed insufficient retry logic for FTP uploads. * **October 31, 2023:** Initiation of the migration from slow storage to a faster alternative. * **October 30, 2023:** Communication with affected customers to keep them informed. ## Root Causes: 1. **Storage Migration Decision:** The incident was primarily triggered by the decision to migrate the FTP service storage to a location with more capacity but slower I/O. This decision was based on the assumption that the FTP service was used infrequently by most customers, and the migration was assessed as a low impact operation that would not cause any downtime. 2. **Inadequate Retry Logic:** The FTP upload process lacked a proper retry logic for handling timeouts. This issue was unforeseen and, had it been anticipated, could have been avoided with exponential backoff and a higher number of retries. 3. **Insufficient Monitoring:** Our monitoring systems provided monitoring for uptime of the FTP service, but they did not provide any insights on increased upload error ratios. ## Immediate Actions Taken: 1. **Migration Reversal:** To mitigate the immediate impact, the migration process was reversed, and data was migrated back to faster storage. 2. **Retry Logic Enhancement:** The retry logic for FTP uploads was improved by implementing exponential backoff and increasing the number of retries. 3. **Customer Communication:** Affected customers were informed about the incident, the steps taken to address it, and the ongoing migration process. ## Preventive Measures: 1. **Improved Monitoring:** Enhance monitoring systems to detect performance issues more quickly, ensuring timely responses to unexpected incidents. 2. **Proactive Communication:** Establish proactive communication protocols to inform affected customers promptly when service disruptions occur. 3. **Retry Logic Enhancement:** Continuously review and improve retry logic for all critical services to account for unforeseen issues and reduce the impact of timeouts. 4. **Risk Assessment:** Conduct thorough risk assessments before making significant changes to services or infrastructure to anticipate potential problems. 5. **Customer Feedback Integration:** Encourage customers to provide feedback on service performance, and actively integrate their insights into our ongoing improvement efforts. ## Conclusion: This incident has highlighted the importance of anticipating and preparing for unforeseen issues in our systems. We apologize for the inconvenience and frustration this incident may have caused you and other affected customers. We are committed to applying the lessons learned from this incident to ensure that our services continue to meet your expectations. Your trust in our services is invaluable to us, and we appreciate your patience and understanding during this challenging period. If you have any further concerns or questions, please do not hesitate to reach out to us. We are committed to maintaining the highest standards of reliability and providing you with a smoother and more consistent experience in the future.

resolved

Dear customers, We are pleased to inform you that the recent incident related to our FTP service has been successfully resolved. Our team has worked diligently to address the issues, and we wanted to provide you with this brief update. The migration of our FTP service storage to a faster and more stable location has been completed, and the improvements to the retry logic for FTP uploads have been implemented. We have thoroughly tested these changes to ensure their effectiveness. We sincerely apologize for any inconvenience this incident may have caused you, and we greatly appreciate your patience and understanding during this time. If you encounter any further issues or have any questions, please do not hesitate to reach out to us.

monitoring

Dear customers, for further clarification, although the service has been very stable for the past 12 hours, occasional failures may still happen while the FTP data is being migrated to a new cluster. Please bear with us in the meantime as the process is completing in the background. We are still actively monitoring the situation. Thanks for your understanding and we hope to update you shortly with better news.

monitoring

The FTP service is fully operational again and did not encounter any timeouts in the past 2 hours. We will be monitoring the situation for the next 24 hours.

identified

We have identified an issue with the storage cluster that backs up the Productsup FTP Service. To resolve this issue we need to migrate the data to another cluster. Unfortunately, this will cause reduced performance on the service while the migration is happening. We ask us kindly to bear with us while our engineers are working hard on solving the issue.

Report: "Platform Database Issues"

Last update
resolved

Our database team identified the root issue caused by a runaway database global lock, which required a full cluster restart in order to promptly restore the service to normal operation. The incident is fully resolved. Thanks again for your patience and your committed support as we strive to maintain a high level of service reliability.

monitoring

The cluster is operational again, we are monitoring its stability.

identified

Our database administrators have performed a full cluster restart. The services should come up again. We are bringing up more cluster nodes as we speak. Thanks for your patience and continued support.

investigating

We're currently encountering issues with the Platform Database Cluster. Our technicians are investigating the issue. Please accept our apologies for the inconvenience.

Report: "Login Functionality Partially Unavailable"

Last update
resolved

Our engineering team identified and addressed the root cause of the issue. After thoroughly investigating and troubleshooting, we implemented the necessary fixes to ensure a smooth login experience for all users. We understand the importance of a seamless user experience, and we are committed to preventing such issues from reoccurring in the future. We will continue to monitor our systems closely to maintain a high level of service reliability.

identified

We want to provide you with an important update regarding the login issue that some of our users have been experiencing. We have successfully identified the root cause of the problem, and our engineering team is actively working on implementing a fix.

investigating

We are aware of an ongoing issue affecting the login functionality of our platform. Some users may be experiencing difficulties when attempting to log in to their accounts. We understand the frustration this may cause and are actively investigating the root cause of the problem.

Report: "PUP FTP degraded"

Last update
resolved

The incident has been resolved and the service is operational again

monitoring

The ftp upload service was only partially operational between 2023-09-06 8:55 to 9:30 CEST due to an network outage. The service is operational again.

Report: "Stream API Downtime"

Last update
resolved

The Stream API suffered an outage for 15 minutes between 2023-09-01 13:45 to 14:00 CEST due to a misconfiguration applied on an instance. The service is available again.

Report: "Platform Data View issue"

Last update
resolved

The Platform Data View component was degraded from 2:30PM CEST to 4:00PM CEST due to issues within a released version. A fix has been released and we are now monitoring the results. We are sorry for the inconvenience caused.

Report: "Performance Issues"

Last update
resolved

The issue has been identified as a database lock. The lock has been removed and the user interface should continue working as usual. Thanks for your patience and continued support.

investigating

We're seeing slow responses for the Platform user interface. Data Processing is not affected. We are currently investigating the issue. Sorry for the inconvenience.

Report: "Processing error log outage"

Last update
resolved

We resolved the issue, and all services are operating normally. The most recent error logs for Processing are available again in the UI now. Productsup is committed to continually and quickly improving our technology and operational processes to prevent incidents. We appreciate your patience and again apologize for impacting you, your users, and your organization. Thanks for your continued support.

monitoring

We have identified the root cause of the error log problem in Processing and have successfully implemented a solution to resolve it. It only impacted a few processing runs. We are pleased to report that Processing is back up and running as expected. Currently, we are focused on ensuring that all of the most recent error logs are once again accessible through our UI.

Report: "Platform API - Request timeouts"

Last update
postmortem

# SUMMARY The Stream API experienced a major outage on Monday, July 10, between 11:00 CEST and 12:00 CEST. It affected most API requests; you could notice slow responses and timeouts on many requests. As soon as our systems reported consecutive health check failures, we investigated and found out that the caching cluster memory was full, which resulted in blocking all requests. The memory consumption reached its maximum due to a significant increase in traffic and the system's release on July 6, which contained additional caching-related functionalities.  # REMEDIAL ACTIONS PLAN & NEXT STEPS We restarted all cache and API nodes to fix the issue. We replaced our caching cluster with a more powerful solution. Before the incident, we planned to introduce rate-limiting for the Platform API and started communicating it to our clients. If you haven’t received an email explaining our rate limits, you can expect it in the upcoming weeks. These actions will contribute to the stability of our cluster. Productsup is committed to continually and quickly improving our technology and operational processes to prevent outages. We appreciate your patience and apologize for the impact on you, your users, and your organization. We thank you for your business and continued support.

resolved

Our new caching cluster is operational since 14:35 CEST, and everything is stable so far. We will follow up with a post-mortem within 2 business days.

monitoring

The API is functional again. We will soon replace our caching cluster, this will be announced via an ad-hoc schedule maintenance message on this status page.

identified

Gradually increased traffic over the last 7 days and deployment on the 6th of July for a future rate-limiting functionality caused a slow decrease in availability for our caching cluster. At around 11:00 today it got stuck. We are solidifying our caching infrastructure at the moment.

investigating

Our caching cluster was experiencing issues and causing problems with the API. We have restarted all API nodes and we're receiving data again. We're still investigating the root cause of the issue.

investigating

We are currently investigating issues with our Platform API. We see that since 11:00 the amount of traffic has significantly dropped. Our API is returning errors in the 4xx domain and clients are experience request timeouts.

Report: "Stream API missing data"

Last update
postmortem

# **SUMMARY** The Stream API experienced data loss on Monday, July 3. An unfortunate misconfiguration in our production environment resulted in removing production data from our redundant storage cluster. We have started looking into solutions to mitigate the impact. After a thorough investigation, we found that data loss didn't affect all streams. We analyzed the affected streams and started to recover data from the last imported message. The Stream API message queue storage has a 10-day retention period. For some streams, we needed to reimport data only for the last two days. For other streams, we needed to reimport data for the entire retention period. After the investigation, we immediately contacted our clients and advised them to send a full catalog update if they noticed significant product fluctuations. We did this to mitigate the impact on the running processes. Most sites have reimported data for the retention period and are back to running normally. # **REMEDIAL ACTIONS PLAN & NEXT STEPS** We plan to keep improving our backup strategy and precautions to prevent running any destructive logic in our production environment. Productsup is committed to continuously improving our technology and operational processes to prevent outages. We appreciate your patience and apologize for the impact on you, your users, and your organization. We thank you for your continued support.

resolved

The system is fully operational. All affected clients have been informed. We are working on importing the data that was queued in our system with a retention period of 10 days. Latest tomorrow a full post-mortem will be released.

monitoring

We are continuing to monitor for any further issues.

monitoring

We have reached out to all clients who are affected. The email contains instructions on determining the impact and advice on our clients' best course of action.

identified

We investigated the cause of the missing data and identified its source. Partial data loss occurred due to a misconfiguration. We are looking into the possibility of recovering the data. All affected clients will be informed via email as soon as possible.

investigating

Due to an unknown incident, some of the data that was previously uploaded using the Stream API is currently missing. We are investigating the causes and will update you as soon as we have findings. Thanks for your patience.

Report: "Data View is currently not accessible"

Last update
resolved

We resolved the issue. The Dataview is operating normally again. Productsup is committed to continually and quickly improving our technology and operational processes to prevent incidents. We appreciate your patience and again apologize for impacting you, your users, and your organization. Thanks for your continued support.

monitoring

We have implemented a solution and are currently monitoring it.

investigating

We apologize for the inconvenience, but the Data View in the platform UI is currently unavailable. Our team is currently investigating the issue.

Report: "Partially failures when generating export files"

Last update
postmortem

**SUMMARY** On 13th of April between 13:50 CEST and 17:20 CEST, our processing pipeline experienced a partial outage following the release of a new feature. After conducting a thorough investigation, the root cause of the outage was identified as a side effect of a newly released feature. This side effect unintendedly impacted less than 2% of our clients and resulted in an invalid data format for the export files. It prevented the export destinations from processing the data. A rollback to the previous release was immediately implemented to resolve the issue, which successfully resolved the problem. Additionally, our customer success team proactively notified the affected customers about the incident, offering apologies and any necessary assistance. **REMEDIAL ACTIONS PLAN & NEXT STEPS** As a preventive measure, we have introduced additional measures in our development cycle to ensure that introducing new features does not impact existing features in our software. We remain dedicated to preventing such incidents in the future and continuously improving our processes to ensure high-quality services to our clients. We apologize for any inconvenience caused and would like to thank our clients for their understanding and patience during this incident.

resolved

We did experience an issue in our processing pipeline. It affected a few export channels with custom sorting. We resolved the issue at 5:30 pm, and all services operate normally. Productsup is committed to continually and quickly improving our technology and operational processes to prevent incidents. Again, we appreciate your patience and apologize for impacting you, your users, and your organization. Thanks for your continued support.

Report: "Data Processing is slow"

Last update
resolved

Dear customers, due to security-related changes we have encountered connection failures between the scheduling system and the job executors. To resolve this issue we've adjusted the settings and restored the full capacity of our scheduling system. All jobs should be able to start without delay now. Thank you for your understanding and your continued support as we are committed to provide the most secure but also the most available service.

investigating

Data processing is experiencing some delays. We’re currently investigating and will have more information shortly. Thank you for your patience.

Report: "Stream API degraded availability"

Last update
postmortem

**SUMMARY** We experienced issues with our Kafka cluster used for our Stream API on Wednesday, the 29th of March, between 12:40 CEST and 15:40 CEST. From 12:40 CEST till 13:40 CEST, we had a partial outage, which means that only about 50% of all requests got through. We started the investigation into the instability of the cluster and tried several remedial actions to see if we could get the cluster running again at total capacity. However, we noticed that a single Node would become unavailable again whenever it was restarted. Between 13:40 CEST and 15:40 CEST, we rejected all requests to our Load Balancer to take the cluster offline and inspect each node individually. During this period, we had a complete outage of the service. We restarted each Node and let it repair itself with an increased memory configuration. At 15:40 CEST, all nodes were stable again. We were able to enable the requests to our Load Balancer again. From this moment on, the system was back up and running. A small backlog had build-up during the downtime, but the backlog was handled within 20 minutes and did not cause a severe impact on the functionality of the Stream API. An increased load to our Stream API and a combination of insufficiently powerful nodes to carry the cluster's full load were the main reasons for the instability. To prevent this from happening again, we immediately enabled strict rate limiting for overusers of our API. Ever since, the Stream API has been running stable again. No data was lost on our end. All requests that returned a successful status code \(in the 2xx domain\) were received and processed. All requests that returned a 500 status code should be resent if needed. **REMEDIAL ACTIONS PLAN & NEXT STEPS** Next week, we will enable strict rate limiting for all Stream API users. We will allow clients to make 30 requests per second per Stream. Till now, this was only a recommendation, and the hard limit was higher. Additionally, we plan to extend our cluster next week. The servers have already been ordered and prepared. We also have planned an action to review our maintenance documentation, review the guide on stability recommendations, and make changes where needed. Productsup is committed to continually and quickly improving our technology and operational processes to prevent outages of our Stream API. Again, we appreciate your patience and apologize for the impact on you, your users, and your organization. We thank you for your business and continued support.

resolved

Last night we made some configuration changes to provide more stability to the cluster. We did not see peaks of errors or failed connections over the previous 12 hours. We will follow up with a post-mortem later today.

monitoring

The cluster is online again. We do anticipate a queue build up with our clients, so there could be a slight performance degradation in the upcoming minutes. We do not expect any failing connections to our system. We continue to monitor the behavior of the cluster and our API. Tomorrow we will follow up with a post-mortem and action plan for improvements.

identified

We have allocated the issue to a single node. After the recovery, we modified the configuration to be more stable. These updates are also applied to the other Nodes. We are running the final checks before the cluster can go online.

investigating

One of the nodes is in recovery; we are waiting till this is done before enabling the cluster again.

investigating

We are restarting nodes in our cluster and inspecting the reported errors. In the meantime, all requests to our load balancer are blocked.

investigating

We will perform a complete cluster restart, expecting a brief full outage of the system. We will stop traffic to our load balancer, temporarily rejecting all requests.

investigating

We're experiencing issues with our Stream API cluster and are seeing an increased number of failed connections. Sorry for the inconvenience.

Report: "Redirect Tracking"

Last update
resolved

The incident has been resolved.

investigating

We're investigating an issue with our redirect tracking.

Report: "Data Processing is slow"

Last update
resolved

Data processing queue times have been stable for the past 24 hours, therefore we are closing this incident. Thanks for your patience.

identified

Data processing is currently affected by elevated queue times. We’re investigating and will have more information shortly. Thank you for your patience.

monitoring

After a lengthy investigation, we've identified a race condition in our job dispatcher software. We've implemented a fix and the job queue went down immediately. We will be monitoring the results of this fix during the weekend.

investigating

Data processing is currently affected by elevated queue times. We’re investigating and will have more information shortly. Thank you for your patience.

Report: "Transport server issue"

Last update
resolved

The issue has been identified as we've reached out to our hosting provider, who identified the issue with one of the network switches. The service is now operational again. Thanks for your patience and continued support.

investigating

We are currently investigating an issue with our transport server. As a result, the noted services are currently unavailable.