Historical record of incidents for Xplenty
Report: "Intermittent Dashboard Errors"
Last updateThe incident has been resolved. We apologize for the inconvenience caused.
A fix has been implemented and we are monitoring the result. We apologize for the inconvenience caused here.
We are currently investigating the issue which is causing the dashboard to return intermittent errors.
Report: "Intermittent Dashboard Errors"
Last updateThe incident has been resolved. We apologize for the inconvenience caused.
A fix has been implemented and we are monitoring the result. We apologize for the inconvenience caused here.
We are currently investigating the issue which is causing the dashboard to return intermittent errors.
Report: "Jobs Stuck on Pending"
Last updateThe incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an issue which caused jobs to be stuck on pending.
Report: "Issues retrieving data from Bing Ads"
Last updateThe incident has been resolved.
A fix has been implemented. Please reconnect your Bing Ads connection and rerun the jobs on a newly provisioned cluster.
Customers may experience issues with jobs that read data from Bing Ads. Our engineers are investigating the issues.
Report: "Partial Job Executions Failure"
Last updateJobs are partially failed to be executed on the clusters due to an issue from our upstream providers. The issue was resolved at April 13, 12:30 UTC.
Report: "Some Clusters Stuck on Creating"
Last updateThe incident has been resolved.
Clusters are now being created. We are continuing to monitor the issue.
We are seeing some clusters stuck on Creating on the Virginia region and have identified that it's due to an issue with our upstream provider. Will provide updates as soon as we have it.
Report: "Job Notification Issue Causing Job Failures"
Last updateFrom Feb 24th, 14:30 UTC to 19:41 UTC, an issue with our job notification service caused jobs to skip the Running state and display a blank runtime. While Dataflow packages may have run normally (pending further confirmation), Workflows were affected, leading to job failures. The issue has now been resolved. If needed, please rerun the jobs or use "Run One-Off" on the schedule page. A post-mortem will be released soon. We sincerely apologize for the inconvenience.
Report: "Job Failures on File Storage Destination"
Last updateFrom 10:25 UTC to 13:00 UTC, job failures occurred in packages using the file storage destination component due to a faulty deployment. The issue has been resolved. We apologize for any inconvenience this may have caused.
Report: "Dashboard Loading Issue"
Last updateThis incident has been resolved.
We are aware of an intermittent issue affecting the dashboard. If you encounter this problem, please try refreshing the page. Our development team is actively investigating the root cause and will implement a fix as soon as possible. We apologize for the inconvenience and appreciate your patience.
Report: "API down due to an emergency maintenance of our upstream provider"
Last updateThis incident has been resolved.
Our API is currently down due to an emergency maintenance of our upstream provider. We will update shortly
Report: "Oregon SSH server down and experiencing job failures"
Last updateThere was an issue with our SSH server in the Oregon region, and the dev team applied the fix as of 21:40 UTC
Report: "Dashboard and API Issues"
Last updateThe issue was due to our upstream provider. The incident has been resolved.
We are currently investigating this issue.
Report: "Clusters Stuck on Creating State"
Last updateOn Nov 12 at 12:30 UTC until 14:00 UTC, we have detected an issue on our cluster provisioning which caused clusters to be stuck on creating due to a bad deployment. This should not affect any scheduled jobs but may have caused delayed start time of the said jobs. This issue has now been resolved and we apologize for the inconvenience caused.
Report: "Dashboard and API issues"
Last updateThe incident has been resolved.
Dashboard is now up and we are currently monitoring.
We are currently having Dashboard and API not loading because of an issue with our upstream provider
Report: "Intermittent Jobs and Clusters Stuck on Pending"
Last update### **Root Cause** During our recent database maintenance on September 16th 6 AM UTC, we encountered resource limitations from our upstream provider, which resulted in some worker tasks being missed. ### **Resolution and Mitigation** * **Immediate Actions Taken:** We immediately stabilized the environment by restarting affected services and applications to minimize disruption. * **Long-Term Measures:** To prevent this issue from happening again: * Implemented automatic termination of long-idle connections to free up resources. * Enhanced our monitoring for pending jobs, ensuring that any long-running tasks are promptly identified and addressed. ### **Preventive Actions** * **Monitoring Improvements:** We have implemented monitoring for jobs stuck in a pending state, enabling us to remain proactive in addressing long-running tasks and responding before they impact operations. * **Additional Measures:** We have increased the resources allocated to our database and are working closely with our upstream provider to ensure resource availability. ### **Next Steps** We will continue to monitor the situation closely and make adjustments to workflows or settings as needed. Our team is committed to preventing future incidents of this nature, and we sincerely apologize for any inconvenience caused by this issue.
During our recent database maintenance, we encountered intermittent resource limitations from our upstream provider, which resulted in some worker tasks being missed. We have implemented measures in place and this issue is now resolved.
Report: "Intermittent Dashboard and API Issues"
Last updateThe updates seem to have resolved the issue and the incident has been resolved.
A measure has been implemented and we are monitoring the results. The issue may have been caused by a database resource issue and we have yet to confirm. Thank you for your patience and we apologize for the inconvenience caused.
We are continuing to investigate the issue with our upstream provider.
We are currently investigating this issue which may have been potentially caused by the database migration maintenance from earlier.
Report: "Increased Cluster Creation Times"
Last updateThe incident has been resolved. We apologize for the inconvenience.
We have identified the issue and put a fix in place and cluster provisioning times should be back to normal now. We are monitoring the issue.
We are currently investigating an issue with increased cluster creation time.
Report: "Dashboard and API issues"
Last updateThe cause of the downtime was due to a bad deployment. We have rolled back the deployment for the meantime. We apologize for the inconvenience caused.
We are currently investigating on this issue.
Report: "Clusters and Jobs Stuck on Pending, Long Running Package Validation"
Last updateThis incident has been resolved. Upon investigation, we have identified that the root cause of this issue is coming from our upstream provider.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Dashboard and API issues"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The dashboard and the API are currently not accessible due to an issue with our upstream provider. We are currently in touch with them to expedite the issue.
Report: "Dashboard and API Issues"
Last updateThe incident has been resolved.
Applications are now starting to go online and jobs should now run fine. We are monitoring further.
The dashboard and the API are currently not accessible due to an issue with our upstream provider. We are currently in touch with them to expedite the issue.
Report: "Jobs with Redshift connections are failing"
Last updateThis incident has been resolved.
The issue has been fixed and Redshift Source is fetching the schema properly.
We noticed the jobs with Redshift connection are not able to fetch the schema and we are working on it.
Report: "Failed Jobs due to Salesforce Destination Issue - 'No SLF4J providers were found.'"
Last updateThis incident has been resolved. Thank you for your patience.
We are aware of failed jobs with our Salesforce destination and are currently investigating the issue. This issue started at around 12:00PM UTC and we are working on getting this resolved as soon as possible.
Report: "Failed Jobs On Newly Created Cluster"
Last updateFrom 7:00 AM UTC to 7:51 UTC, an issue has been detected which caused jobs ran on a newly provisioned cluster to fail. A fix has been rolled out and and job runs should now be back and running. We apologize for the inconvenience caused.
Report: "Scheduled Jobs Were Not Triggered"
Last updateFrom 3:25 UTC until 6:31 UTC, an issue has been detected which caused scheduled jobs to not run. We've issued a fix and scheduled jobs should now be back and running.
Report: "Xplenty dashboard not accessible"
Last updateA fix has been rolled out and and the dashboard should now be back and running.
Currently, the Xplenty dashboard at https://app.xplenty.com/ is not accessible. We are looking into the issue.
Report: "Issue with Facebook API Connections"
Last updateMeta has finally approved our OAuth application back up and this issue has been resolved. Please reconnect your existing Facebook Ads Insights connections and jobs should run fine moving forward.
Unfortunately we still have not received an update from Meta regarding the review of our app. They took it offline as they said they needed to review it and we’ve submitted their requests to them 3 times in the last 10 days but we’re still not getting any response except automated emails each time we submit an appeal. We’re doing all we can to get speaking with someone at Meta to get an update and resolution. Apologies again for the impact caused here, it’s incredibly frustrating for us too. We hope to be able to share a more positive update by end of week.
We are currently resolving this issue.
Report: "Jobs are in pending status"
Last updateThis incident has been resolved.
We are currently investigating on this issue.
Report: "Intermittent connection issue to REST API and Database connections"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
Our Engineering has identified the connection issues are only impacting our Virginia Region. We will have more updates as they become available.
We are continuing to investigate this issue.
Reports of intermittent connection issues to REST API and Databases received. Our engineering team is investigating and will provide an update as soon as we have more information.
Report: "Dashboard and API Downtime"
Last updateFrom 8:46 AM UTC to 9:09 UTC, an issue has been detected which caused API and dashboard to be down. A fix has been rolled out and and components should now be back and running. This should not affect scheduled jobs.
Report: "Proxy Issue due to upstream provider on Sydney region"
Last updateFrom 11:54 PM UTC to 12:01 AM UTC, an issue has been detected on our proxy server on Sydney region which caused jobs with databases to fail. The root cause was a hardware issue with our upstream provider. A fix has been implemented and jobs should now work accordingly.
Report: "Intermittent Job Failures and Clusters Stuck on Pending"
Last update### **Issue Summary** From 7:11 AM UTC to 8:58 UTC, there’s an intermittent number of jobs and clusters stuck on pending and errors. ### **Root Cause** The root cause of this outage was due to our Redis component reaching 100% memory which caused the intermittent issues. Redis is used as a caching mechanism of our application. ### **Resolution and recovery** Here are the steps we are taking to ensure that the incident does not happen again moving forward. * Vertically scaled up Redis for more memory. * Improve monitoring so we can quickly detect Redis-related memory issues We appreciate your patience and again apologize for the impact to you, your users, and your organization. We thank you for your business and continued support. Sincerely, [Integrate.io](http://integrate.io/) Engineering
Beginning at approximately 12:40 AM until 2:30 AM UTC, there was an issue in one of our infrastructure components used for caching which affected clusters and jobs provisioned on the said time period. The issue has now been fixed by our engineers.
Report: "Issues with upstream DNS provider"
Last updateThis incident has been resolved.
We are experiencing issues with an upstream DNS provider impacting login access. Please contact our support team for immediate assistance.
Report: "Failing jobs on Salesforce destination"
Last updateTime of the incident between 7:11 AM UTC - 8:58 AM UTC. This has already been resolved. Please rerun the package with newly created cluster to overcome errors with Salesforce destination.
Report: "Jobs failing due to JDBC connection issues."
Last update### **Issue Summary** From 15:01 UTC to 17:19 UTC, our Virginia proxy server became inaccessible. Due to this, all jobs with database and SFTP connections on this proxy failed. The issue was caused by our upstream provider. Customers in our other regions were unaffected. ### **Root Cause** The root cause of this outage was due to an issue with the upstream provider on the particular instance. ### **Resolution and recovery** Here are the steps we are taking to ensure that the incident does not happen again moving forward. * Improve fault-tolerance with automated proxy server failover so that there will be minimal downtime when a proxy hardware issue reoccurs. We appreciate your patience and again apologize for the impact to you, your users, and your organization. We thank you for your business and continued support. Sincerely, [Integrate.io](http://Integrate.io) Engineering
This incident is now resolved. Jobs are running fine.
We are currently investigating this issue.
Report: "Jobs are in pending status"
Last updateClusters are up and running. We have fixed the issue.
We are currently investigating the issue.
Report: "Jobs are failing and connections not working."
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Dashboard and API Offline"
Last updateThe dashboard and API's are working fine now. RCA : Our dashboard and API were currently offline due to an issue with our upstream provider, All looks good now!
The Dashboard and API are working fine.
Our upstream hosting provider (Heroku) seems to be having issues still. We are monitoring further.
Dashboard and API seems to be recovering. We are monitoring further.
Upstream provider is currently investigating the said issue.
Our dashboard and API are currently offline due to an issue with our upstream provider. We are currently waiting for updates from the said provider.
Report: "Apache Log4j2 issue"
Last updateThis incident has been resolved.
Xplenty is aware of the recently disclosed security issue affecting the open-source Apache "Log4j" utility (CVE-2021-44228). At this time, we can confirm that Xplenty is NOT impacted by this CVE. We strongly encourage customers who manage environments containing Log4j to update to the latest version.
Report: "Job failures in the Ireland region"
Last updateThis incident has been resolved.
Incident has been resolved. Job are running normally in the Ireland region.
We are continuing to investigate this issue.
We are investigating an increase in job failures in the Ireland region.
Report: "Clusters Stuck on Creation"
Last updateOn June 26th 11:30 PM UTC continuing until 07:14 AM UTC, an issue has been identified which caused clusters to be stuck on creating state. Clusters became available at around 07:14 AM after a fix has been put in place. The root cause was due to a rare faulty message which was stuck on our message queue mechanism. We have already found the reason and have remediated a fix moving forward to ensure this does not happen again. We have also added a mechanism to address the faulty message scenario in case something similar happens again in the future (as we have found out based on this incident that this is a single point of failure). We apologize for the inconvenience caused and please do reach out if there's anything we can do to help.
Report: "Dashboard Issues"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating dashboard issues not loading.
Report: "Job Failures on MySQL Destination"
Last updateThis incident has been resolved.
From 4:40 UTC, jobs with MySQL destination started to fail due to a bad deployment. We have rolled out a fix to remediate the said issue and we are currently monitoring. We apologize for the inconvenience caused.
Report: "Clusters Stuck on Creation"
Last updateFrom 7:55 AM UTC to 11:02 AM UTC, an issue has been identified which caused clusters to be stuck on creating state. A fix has now been implemented and jobs that were waiting for the cluster should now run and succeed. Apologies for the inconvenience caused.
Report: "Connectivity and related Job Failures"
Last update### **Issue Summary** From 17:13 UTC to 19:15 UTC, a Virginia proxy server became overloaded. Due to this, all jobs with database and SFTP connections on this proxy failed. The proxy server had an out of memory issue caused by an increased spike in connections. Customers in our other regions were unaffected. ### **Root Cause** The root cause of this outage was due to a spike in connections causing the proxy server to run out of memory. ### **Resolution and recovery** Here are the steps we are taking to ensure that the incident does not happen again moving forward. * Doubling the proxy server’s memory * Adjusted the proxy server’s memory swap settings We appreciate your patience and again apologize for the impact to you, your users, and your organization. We thank you for your business and continued support. Sincerely, Xplenty Engineering
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We have noted connectivity issues and job failures that show connection refused error. We are currently investigating this issue
Report: "Connectivity and related Job Failures"
Last updateYesterday we experienced an issue on Virginia region causing jobs with database and SFTP connections to fail. Today we are providing an incident report that details the nature of the outage and our response. We understand this service issue has impacted our valued customers, and we apologize to everyone who was affected. ### **Issue Summary** From 18:17 UTC to 23:00 UTC, Xplenty proxy server on Virginia went down. Due to this, all jobs with database and SFTP connections failed. The Virginia proxy server had a memory issue caused by an increased spike in connections. Customers on other regions and their jobs were unaffected. ### **Root Cause** The root cause of this outage was due to an increased spike in connections making our proxy server overloaded in terms of memory usage. ### **Resolution and recovery** Here are the steps we are taking to ensure that the incident does not happen again moving forward. * Soft limits on connection tunnels per account to avoid proxy server congestion. * Improved monitoring which shows active connections with better granularity. Xplenty is committed to continually improving our technology and operational processes to prevent outages. We appreciate your patience and again apologize for the impact to you, your users, and your organization. We thank you for your business and continued support. Sincerely, Xplenty Engineering
The incident has been resolved.
We are continuing to monitor the system. No further issues so far.
A fix has been implemented and jobs should continue working. We are currently monitoring the system.
Our team has identified the issue and is working to roll out a fix
We are continuing to investigate this issue.
We have noted database connectivity issues and job failures that show connection refused error. We are currently investigating this issue.
Report: "Pending Jobs on Oregon Region"
Last updateLast Friday we experienced an issue on Oregon region causing jobs to be stuck on pending though eventually ran. Today we are providing an incident report that details the nature of the outage and our response. The following is an incident report for the Pending Jobs on Oregon Region that occurred on Friday August 28th 2020. We understand this service issue has impacted our valued customers, and we apologize to everyone who was affected. ### Issue Summary From 11:30 UTC to 16:06 UTC, Xplenty job monitoring service on Oregon region went down. Due to this, no new customer jobs were able to be started on Xplenty’s infrastructure and they were stuck on pending state. The deployment in charge of the job couldn’t scale up due to network interface limits relative to our instance count. ### Timeline \(all times UTC\) * 11:30 UTC: Jobs stuck on pending on Oregon region, downtime begins. * 11:30 UTC: PagerDuty alerts the team and the investigation begins * 12:00 UTC: Xplenty contacts our cloud provider to check if there’s any issues * 12:30 UTC: Job & cluster processing engines are operational * 13:45 UTC: Xplenty tweaked autoscaler configuration * 13:50 UTC: 100% of service is restored and operational ### Root Cause The root cause of this outage was our deployment couldn’t scale up due to an autoscaler misconfiguration. ### Resolution and recovery Xplenty development team has tweaked the configuration to ensure that the incident does not happen moving forward. Xplenty is committed to continually improving our technology and operational processes to prevent outages. We appreciate your patience and again apologize for the impact to you, your users, and your organization. We thank you for your business and continued support. Sincerely, Xplenty Engineering
The incident has been resolved.
Jobs should now continue running fine. We are currently assessing the root cause and will be providing an update.
We are currently having jobs pending on Oregon region. We are currently investigating the issue.
Report: "Jobs Failure on Packages With File Storage Component"
Last updateBetween 6:37 AM UTC until 8:32 AM, the system had issues running jobs with file storage components on it which caused jobs to fail without any logs. The root cause was due to a bad deployment and a fix has been put in place. We apologize for the inconvenience caused.
Report: "Xplenty API is slow and jobs are pending"
Last updateThe issue has been resolved.
A fix has been implemented and monitoring the results.
Xplenty API is slow and jobs are pending. We are investigating on this.
Report: "Connectors Issue"
Last updateAll the connections are working fine.
We applied the fix and currently monitoring the issue.
While we are still working on fixing this. Recommended to "reconnect" the connections and run the packages.
We are currently having issues with Salesforce, Google, and BingAds connections. We are investigating on this.
Report: "Jobs Update Issue"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
We are continuing to monitor for any further issues.
The issue has been identified and a fix has been implemented. We are monitoring further for issues.
We are having issues with our jobs update handler due to the API server maintenance few hours ago. We have put the system on maintenance mode while we are investigating the issue.