Historical record of incidents for Thought Industries
Report: "Content Issues"
Last updateWe are continuing to work on a fix for this issue.
We have detected an issue in course content and test behavior - a fix is being worked on and will be deployed shortly.
Report: "Intermittent DNS Issue"
Last updateBetween 10:00 AM and 4:30 PM PST an intermittent, internal DNS issue resulted in elevated error rates and increased load times across the platform - this issue has been resolved. We will be closely monitoring the platform to ensure no further functionality is affected.
Report: "Service Disruption (US)"
Last updateBetween 11:19 and 11:31 AM EDT we experienced intermittent service disruption. We are investigating the cause of the incident.
Report: "Service Disruption / DDoS (US)"
Last updateBetween 6:02 and 6:12 AM EST an on-going external vulnerability scan / DDoS attack impacted the US platform and resulted in intermittent availability issues during the stated 10-minute window. The internal team and our automated infrastructure provided a quick resolution to this issue and infra team will continue to review and improve our systems to ensure service disruptions are minimized.
Report: "Possible Login Issues with SSO"
Last updateDue to an issue identified with a recent release, we had to revert a fix for metadata not reflecting in custom domain that caused issues with cross domain in SSO. This release caused login issues with a small subset of clients using Single Sign-On (SSO). Our team actively reverted and the issue is now resolved.
Report: "Elevated load times"
Last updateBetween 8:15 AM EDT and 11:00 AM EDT the platform experienced significantly elevated response time in both the EU and US. The root cause of this outage was determined to be a routine security upgrade of an external dependency, leading to high CPU on our application servers. Despite auto-scaling due to increased load, we did not see a satisfactory reduction in response time as expected. Reverting the dependency upgrade led to an immediate return to expected response times.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are investigating elevated load times in the US.
Report: "Rustici SCORM Outage (US)"
Last updateBetween 5:45 AM PDT and 7:45 AM PDT the SCORM Rustici service experienced a minor increase in error rates, followed by a more severe outage between 7:45 AM PDT and 8:45 AM PDT, after which point service was restored. The root cause of this outage was determined to be a misconfiguration in the internal load balancer, which resulted in general Rustici traffic routing to a single node and degraded performance when traffic exceeded a critical threshold. The infrastructure team has applied a fix as of the resolution of this outage and confirmed that traffic is correctly routing to all available nodes.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are investigating high loading times on Rustici SCORM launches in the US.
Report: "Looker Instability (US)"
Last updateThis issue has been resolved.
A fix has been implemented and we are monitoring the results.
Looker functionality has been intermittently unavailable for short periods of time and we are investigating the cause.
Report: "Reporting Outage (US)"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Reporting Delay (EU)"
Last updateThe EU region experienced reporting delays due to a faulty pipeline worker. The infrastructure team has resolved the issue and is re-triggering reporting table builds to ensure all reporting is up-to-date.
Report: "Helium Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
A vendor has identified an internal bug with their workers and we are working with them to resolve the issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Reporting Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are investigating increased error rates in looker-based reporting.
Report: "Rustici SCORM Outage"
Last updateBetween 4:25 PM EDT and 5:15 PM EDT the Rustici SCORM service experienced increased error rates, disrupting service for certain customers. As of the resolution of this service interruption the underlaying issue has been diagnosed and a fix has been applied. While no further issues are expected, the platform team is closely monitoring the Rustici application to ensure continued service availability.
Report: "AWS Firehose Outage"
Last updateBetween 2024-07-30 21:45 UTC and 2024-07-31 4:10 UTC AWS experienced an internal outage which impacted several delivery streams for activity-related reporting on the Thought Industries platform. Despite the failure, posted data was successfully stored in S3 for future ingestion and was successfully re-processed at the conclusion of the outage - there should be no lasting inaccuracies in reporting as a result of this incident.
AWS has resolved the outage. Accuracy of some activity-based reporting will be impacted for the active period of the outage.
AWS is experiencing increased error rates on their Firehose service which impacts some specific activity tracking & reporting on the platform. We will be carefully monitoring the situation until resolution. See AWS's status page for details: https://health.aws.amazon.com/health/status
Report: "US Rustici Outage"
Last updateBetween 6:30 AM PST and 8:00 AM PST the US Rustici cluster experienced increased error rates, impacting the launch and progress of Rustici-hosted SCORM courses for certain users on the learner platform. Due to the intermittent nature of the issue the initial alarms were unvalidated and the issue persisted until confirmation was achieved and service was restored shortly thereafter.
Report: "EU Redshift"
Last updateOn June 15, 2024, 4:00 AM UTC, following an automated AWS update on the production EU reporting database cluster, the ETL experienced intermittent errors resulting in delayed and/or inaccurate reporting results until the issue was fully resolved on June 17, 2024, at 9:00 AM UTC. No further impact should be present as of the resolution of the issue, and the infrastructure team is currently focused on implementing and deploying fixes to ensure this issue does not reoccur.
Report: "EU Reporting"
Last updateOn June 11, 2024, 11:35 AM UTC the EU reporting database experienced intermittent issues in data ingress, resulting in varied delays in reporting accuracy for EU customers until June 11, 2024, 3:25 PM UTC, after which data flow was confirmed as restored. The root cause of this incident was determined to be a faulty state in the internal jobs system which caused ETL jobs to hang without completing or exiting properly, blocking future job runs. No data inconsistencies should be present as of the resolution of this issue, and our infrastructure team is implementing several solutions to improve data availability and more accurately detect and escalate any future issues.
Report: "We are investigating platform instability in the US region."
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "US Instability"
Last updateThis incident has been resolved.
We are continuing to monitor as availability is restored.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are investigating platform instability in the US region.
Report: "Rustici Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified an outage on the Rustici platform and are working on restoring service.
Report: "Rustici Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified an outage on the Rustici platform and are working on restoring service.
Report: "Reporting Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Search Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We have identified an issue in our search functionality and are investigating the cause.
Report: "EU Region Read-Only Mode"
Last updateThis incident has been resolved.
The EU-region platform entered read-only mode during a scheduled zero-downtime maintenance - we have detected and resolved the issue and are now monitoring.
Report: "Looker Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We've identified an outage on looker-based reporting and are working to resolve the issue.
Report: "Rustici Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
We have identified an outage on the Rustici platform and are working on restoring service.
Report: "US Outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "AWS Lambda Outage"
Last updateThis incident has been resolved.
Amazon Web Services is experiencing a region-wide outage which is impacting new certificate generation. AWS has already identified the issue and we are awaiting resolution. Note: Certificates may still be granted, but viewing certificate PDFs is impacted for newly granted certificates or sites with 'Always Regenerate Certificates' enabled.
Report: "Increased Latency"
Last updateThis incident has been resolved.
We have detected increased latency for a subset of customers and are investigating the cause.
Report: "Rustici Cloudfront Intermittent Outage"
Last updateBetween May 30, 2023 and May 31, 2023, we experienced some intermittent errors due to the application not able to connect to 3rd party services, specifically Rustici. We were able to bypass the limitations and restore access. The services will continue to be monitored for any additional edge cases.
Report: "Identified issue with Assessment Engine [US]"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified an issue with our assessment engine which may cause learners to experience issues completing assessments in a single cluster within the US region. We are urgently working to resolve the issue.
Report: "System Responsiveness Being Investigated"
Last updateThe site slowness and responsiveness of the application has been resolved. An RCA will be made available upon request within 7 business days. Please submit a support ticket to request this RCA if necessary.
We have identified an issue which causing a slowness to the responsiveness to the application. We are urgently working to resolve the issue.
Report: "504 Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "503 Server Errors"
Last updateThis incident has been resolved. A full RCA will be available in the next 7 days and will be available upon request. Thank you all for your patience and sorry for any inconvenience caused.
We're investigating reports around issues with 503 Server Errors across all sites in US. We are currently investigating and will provide updates as soon as we have them.
Report: "[US] Issue with All Sites Being Investigated"
Last updateWe have resolved issues with all 503 Server Errors and can confirm full functionality has now been restored. A full RCA will be available in the next 7 days and will be available upon request. Thank you all for your patience and sorry for any inconvenience caused.
Another fix has been implemented around 503 Server Errors and we're monitoring the results.
We are still seeing issues with with US Sites and 503 Errors. We are continuing to work on a fix for this issue.
A fix has been implemented for 503 Errors and we're monitoring the results. Full Resolution Expected Shortly.
We have identified the issue and are working to resolve this A.S.A.P.
We are continuing to investigate this issue.
We're investigating reports around issues with 503 Server Errors across all sites in US. We are currently investigating and will provide updates as soon as we have them.
Report: "SCORM Rustici Down for US Instances"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Our engineering team has identified an issue with SCORM Rustici as it is not currently working in all US instances. They are actively investigating and seeking to get this resolved as soon as possible
Report: "Salesforce & BI Connector Sync Issues & Ecommerce Purchase Reporting Issue"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Our engineering and development team is actively investigating an issue where they have identified and confirmed that all syncs (Salesforce & BI) have stopped running. Additionally, eCommerce reporting is not currently populating. This is only for US based instances. We want to apologize for the inconvenience to you and will keep you informed as our team is looking to resolve this as soon as possible!
Report: "Intermittent 500 / Not Found Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Our engineering and development team are seeing intermittent 500/not found errors within the Thought Industries Platform. They are investigating this and looking to resolve this as soon as possible.
Report: "Issues with Logging In to Sites on US Platform"
Last updateThe infrastructure team identified an issue with logging into sites for US clients on Monday, February 27th which has been resolved. This was caused by a maintenance release which resulted in the release being rolled back. A timeline is provided below: US Platform Issues identified: 12:47 pm EST US Platform Issues resolved: 2:07 pm EST
Report: "US Platform Outage Identified & Resolved"
Last updateThe infrastructure team identified a platform wide outage for US clients on Tues 12/27 that was quickly resolved. Timeline provided below: US Platform Outage identified: 7:35 pm EST US Platform Outage resolved: 7:52 pm EST
Report: "Issue with Manual Login"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're investigating some reports around issues with manual logins for all US instances. We are currently investigating and will provide updates as soon as we have them.
We're investigating some reports around issues with manual logins for all US instances. We are currently investigating and will provide updates as soon as we have them.
Report: "SCORM Tenant Error"
Last updateThis incident has been resolved.
A fix has been implemented and we're currently monitoring the results.
We have identified the issue with errors related to SCORM files and are working to resolve this A.S.A.P. We apologize for any inconvenience this may cause.
Report: "Isolated Database Cluster Issue / System Outage"
Last updateSummary Starting on February 5th and ending on February 14th, an isolated database cluster experienced periodically increased query times resulting in intermittent outages for relevant instances. A separate, half-hour platform-wide outage occurred during instance migration procedures on the 14th. Service was ultimately fully restored on February 14th where the affected instance was isolated from other customers. Timeline Original Cluster 2022-02-05 - 99.97% availability, first related outage event. 2022-02-06 - 99.99% availability. 2022-02-07 - 99.98% availability, further outage events begin. 2022-02-08 - 99.86% availability. 2022-02-09 - 99.78% availability. 2022-02-10 - 99.94% availability. 2022-02-11 - 99.86% availability. 2022-02-12 - 99.92% availability. 2022-02-13 - 99.71% availability, instances are migrated to a new cluster. 2022-02-14 - 99.99% availability, brief outage, resolution achieved for this cluster. New Cluster (created 2022-02-13) 2022-02-13 - 99.99% availability, cluster is created and populated. 2022-02-14 - 99.72% availability, outage follows a migrated instance & said instance is isolated. 2022-02-15 - 99.99% availability, resolution achieved for this cluster. Root Cause The root cause of the primary outage was determined to be the unique usage of the platform by an individual instance which resulted in slow and blocking queries against the database cluster. These problematic queries triggered cascading slowdowns and resulted in outages for instances hosted on the isolated database. The secondary outage was caused due to an incorrect migration configuration. Action Items / Response Thought Industries is dedicated to providing a reliable and available platform, and we are deeply aware how an outage can affect our customers and their clients. We are determined to prevent the recurrence of this outage and have implemented the following action items: - Affected database tables have been further optimized. - The impacted instances have been distributed across several isolated clusters. The primarily impacted instance is isolated on its own isolated cluster. Future distributions are planned to further improve the distribution of instances. Migration procedures have been adjusted to ensure misconfiguration does not occur. - Several projects are being actively worked on to improve platform performance and stability. We apologize for any inconvenience caused by this outage. We will continue to closely monitor affected schools to ensure that resolution has been achieved and will continue to work around the clock to ensure that the Thought Industries platform is reliable and available for everyone.
Report: "503 Errors"
Last updateThought Industries experienced a brief period of 503 errors occurring which impacted our customers ability to access their sites. A fix for this issue was released by our Infrastructure team and is now considered resolved. More information will be made available over the coming days.
Report: "Intermittent Site Slowness"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating slowness across one of our servers which may be affecting a small subset of customers. We apologize for any inconvenience this may cause, and will continue to provide updates as soon as we have them.
Report: "Rustici (SCORM)"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We're investigating an issue around Rustici (SCORM) that has impacted some of our customers ability to view SCORM content. We are currently investigating and will provide updates as soon as we have them.
Report: "Wistia outage impacting video uploads"
Last updateWe have resolved the issue that was impacting video uploads and can confirm full functionality has now been restored.
We've identified the source of this issue as an error with our video hosting provider Wistia. We are monitoring their status page (https://status.wistia.com/) and we hope to have restored functionality soon.
Report: "Reporting: Issues with loading Reports"
Last updateThis incident has been resolved.
Here is an update from Amazon Web Services (AWS) related to our previously reported issue with reporting data not loading. AWS Update: We continue to work towards full recovery of Redshift clusters in the USE1-AZ4 Availability Zone. Complete recovery will likely be reliant on the full recovery for the EC2 / EBS issue being tracked on the Service Health Dashboard located here: https://status.aws.amazon.com/ We will continue to provide updates as soon as we have them.
We have received reports of an issue with reporting data not loading. This is the result a current issue w/ our upstream provider Amazon Web Services that we are monitoring for an update. We will provide updates as soon as we have them.
Report: "Intermittent Site Slowness"
Last updateThis incident has been resolved.
A fix has been implemented for issue regarding site slowness and we're monitoring the results.
We're investigating some reports around issues with site slowness. We are currently investigating and will provide updates as soon as we have them.
Report: "Intermittent Site Slowness"
Last updateWe have resolved the issue regarding site slowness and can confirm full functionality has now been restored.
We're investigating some reports around issues with site slowness for a small number of our customers. We are currently investigating and will provide updates as soon as we have them.
Report: "Intermittent Site Slowness"
Last updateWe have resolved the issue regarding site slowness and can confirm full functionality has now been restored.
A fix has been implemented and we are now monitoring the results.
We're investigating some reports around issues with site slowness for a small number of our customers. We are currently investigating and will provide updates as soon as we have them.