Historical record of incidents for Parade
Report: "Platform maintenance for Parade for Brokers"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
We will be undergoing scheduled maintenance at this time.We expect around 30 minutes of downtime, as we transition workloads.At this time, workloads running the background may not run in real-time, such as integrations, matching, and email automation. However, no data will be lost, and tasks will be processed after the maintenance window. We will be conducting this after business hours, to minimize any disruptions. The window will be starting at 10PM EST.
Report: "Platform Maintenance for Parade for Carriers"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Parade for Carriers will be undergoing scheduled maintenance.We expect there to be a downtime of around 20 minutes, where the Parade for Carriers portal will not be operational. Logging in, searching for loads, and viewing freight may be impacted.We will be conducting this maintenance late at night outside business hours to minimize business impact. The maintenance will begin at 10pm EST.
Report: "Parade Help Center Down"
Last updateOur Help Center is back and active.
We have identified the cause behind the unavailability of our Help Center and are actively working with the responsible team to fix it. Meanwhile, please reach out to support@parade.ai directly in case of necessity.
Report: "Load Processing Degradation"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
Queues have stabilized and issue should affect only Load Synchronisation between Parade and TMSes intermittently. Be aware that there might be spikes where loads could take longer to update on Parade.
Overview: Our load integration payload processing is currently experiencing performance degradation, resulting in slower-than-usual processing speeds at certain moments during the day. This issue appears to stem from inefficiencies in the processing pipeline, causing delays despite the overall queue size remaining within normal limits. Initial analysis suggests that a recent change in processing logic introduced unintended overhead, contributing to the degradation.
Report: "Load Processing Degradation"
Last updateOverview: Our load integration payload processing is currently experiencing performance degradation, resulting in slower-than-usual processing speeds at certain moments during the day. This issue appears to stem from inefficiencies in the processing pipeline, causing delays despite the overall queue size remaining within normal limits. Initial analysis suggests that a recent change in processing logic introduced unintended overhead, contributing to the degradation.
Report: "Delays in processing load integration payloads"
Last update## Overview: Our load integration payload processing queue experienced an unusually high influx of messages, leading to a significant backlog and delays in processing. This was caused by a large volume of payloads from a newly onboarded customer combined with a legacy reprocessing job that unintentionally re-published unprocessed messages, creating duplicate entries in the queue and quickly overwhelming the system. ## Incident Timeline: * 11/05 1:26 AM UTC: The load integration payload processing queue starts receiving messages at an exceptionally high rate. * 11/05 1:46 AM UTC: Queue accumulates over 5,000 messages, triggering an alert. * 11/05 1:13 PM UTC: An investigation was underway to identify the root cause. * 11/05 1:39 PM UTC: an additional processing capacity was added, but acknowledgment rates remain insufficient compared to publish rates. * 11/05 1:49 PM UTC: the legacy reprocessing job was stopped, reducing the message duplication rate. * 11/05 7:48 PM UTC the backlog was fully processed. On 23:02 UTC - a permanent fix was implemented, preventing the legacy reprocessing job from re-publishing messages. ## Root Cause The issue resulted from two main factors, 1. A large volume of integration payloads was sent by a new customer, creating an unexpected load. 2. A legacy reprocessing job, configured to re-publish messages periodically, inadvertently duplicated messages faster than they could be acknowledged, overwhelming the queue. ## Resolution and Recovery Steps: To resolve the issue, the team scaled up the number of queue consumers and disabled the reprocessing job responsible for causing the queue by duplicating unprocessed messages. The reprocessing job was deemed unnecessary due to the presence of a dead-letter queue, which ensures resilience against message processing failures. With this change, the system is now better equipped to handle large spikes in message volume without overloading consumers, improving overall stability during high-demand periods.
Service Disruption Report: Delays in processing load integration payloads sent to Parade
Report: "Errors when posting to DAT and Truckstop"
Last updateWe have resolved the issues causing the errors in posting to DAT and Truckstop. Backfills have been started for the affected customers. All load postings should be fully operational going forward.
A fix has been deployed and we are monitoring the results. We are also going to run a backfill to fix unposted loads for affected customers over the next hour.
We are currently investigating an issue that is affecting all load postings to DAT and Truckstop. This incident has been ongoing all morning, and is preventing loads from being posted and updated to DAT and Truckstop loadboards. We have identified the root cause and are currently working on releasing a hotfix.
Report: "DAT Downtime"
Last updateDAT is now undergoing maintenance. So they are still down but expected to come back up after the maintenance window.
DAT is currently having a total outage. Therefore all of our services that interact with DAT are not functioning. We are in touch with them to see when they can resume service.
Report: "Delay in Background Processing Tasks"
Last updateService Disruption Report: Some Background Tasks Not Starting or Delayed Overview: We encountered a service disruption that affected the processing of certain tasks in certain background queues in our system. Incident Timeline: The disruption began shortly after the release of version 1.92.1 at 05:05 UTC on May 8th, affecting only a subset of queues. By 17:57 UTC, the customer support team reported issues with load updates not being processed. At 19:06 UTC, a rapid fix was deployed by removing a few queue restrictions, which restored functionality but led to high latency and database locking issues due to increased message throughput. Further optimizations were made, including scaling down low-priority tasks and implementing a new solution, which began showing significant improvement by 22:30 UTC. Root Cause: The issue was linked to our transition to a new backend framework, which was incompatible with a library we use for queues. This incompatibility prevented tasks from being published, as they were erroneously locked in the system. Resolution and Recovery Steps: Immediate action was taken to resolve the publishing issue. A simpler, interim custom built mechanism was implemented for this resolution. A more sustainable solution was developed and validated in our staging environment, ensuring full compatibility and functionality going forward. Moving Forward: We will be testing the sustainable solution in stage, before moving it to production in the next 2 days. We are committed to maintaining stable and efficient operations. Measures are being taken to enhance our testing environments to better replicate production conditions and ensure compatibility for all updates. We apologize for any inconvenience caused and appreciate your patience and understanding. For further assistance or inquiries, please contact our support team.
Report: "Intermittent Downtime across Platform"
Last update**Service Outage Report: Internal APIs** **Overview:** We recently experienced a service disruption affecting our internal APIs immediately following the deployment of a recent update. This incident led to the API intermittently returning a 503 Service Unavailable error, impacting services reliant on this API, such as P4C and integration services. **Incident Timeline:** * The issue commenced with the deployment of a problematic query in version 1.92.1 at 05:05 UTC. * At 12:06 UTC, the issue was identified due to carrier pages not loading, with the problem officially communicated by the customer experience team. * Initial attempts to resolve the issue by rolling back to a previous version were unsuccessful. * By 14:12 UTC, temporary traffic rerouting restored functionality for most services. * A permanent solution was implemented and rolled out by 17:20 UTC with the deployment of version 1.92.4. **Root Cause:** The root cause was identified as a performance issue with a SQL query that was part of an upgrade involving our backend framework and packages. The upgrade inadvertently introduced a change in how email recipients are managed, which significantly increased database query times in our production environment. **Resolution and Recovery Steps:** * Traffic was temporarily rerouted to alternative deployments to isolate the problematic component and restore service functionality. * A permanent fix was developed, optimizing the SQL queries involved, and successfully deployed without further incident. **Moving Forward:** We have taken steps to prevent similar incidents by enhancing our testing protocols to better simulate real-world loads in our staging environment. We appreciate your understanding and apologize for any inconvenience caused. If you have further questions or need assistance, please contact our support team.
This incident has been resolved.
We are currently investigating an issue that is affecting the whole platform. The incident started around 10PM PST Yesterday that caused downtime to internal and external integrations. We are still investigating slowdowns across platform that are causing a delay in integrations and portal performance.
Report: "Broker Portal Authentication Issues"
Last updateWe have resolved the incident, and the broker portal is fully functional again. Our team has made some infrastructure changes to avoid the downtimes caused by our service providers. We will be undergoing some additional measure later this week to permanently switch infrastructure providers so that this does not happen again.
We are still actively resolving the issue, and a solution is being prepared to move our application deployment to another infrastructure provider. The original downtime with our service provider is also related to a larger downtime across AWS: https://health.aws.amazon.com/health/status that we have added to our monitoring.
We are currently investigating an issue affecting all customers. The broker portal is currently down for most users due to a downtime caused by one of our service providers. We are carefully monitoring their situation here: https://www.vercel-status.com/ But also working on alternate plans to bring the application back up.
Report: "Delay in API and CSV updates from the TMS"
Last updateAfter monitoring this for the last 2 hours, no further delays have occurred. This incident has been resolved
We experienced a slowdown in processing both API and CSV updates from the TMS, starting around 7:08 AM PST. We have already implemented a fix to API integrations, which was put into place in production around 9:00AM PST. We have been monitoring for over an hour, and we are processing updates in real-time. For CSV integrations, we have already implemented a series of optimizations that have brought the processing delay to under 20 minutes, but have not fully resolved the issue. We are implementing an additional fix at 10:30AM PST that should bring updates back to a non-delayed state. No data was lost during this time period, and API integrations are already in a consistent state. We will continue to monitor CSV integrations and provide updates as needed. As part of the investigation, we have also identified further improvements to our processing pipeline for both API and CSV integrations. These improvements will be implemented in the near future.
Report: "Problems processing McLeod DFM carrier updates"
Last updateWe have identified and resolved the issue with our McLeod DFM carrier sync. We had an old weekly carrier job that runs every Saturday night. This old job was was not updated correctly in our most recent deployment, and was causing an elevated reporting of errors in our system. The impact of this change was not as significant as initially reported. Real-time carrier updates were not affected, and only this weekly job was not run correctly. This issue has now been resolved. Real-time and weekly carrier updates should all be operating as expected. We have restarted the weekly job to process any missing data over the last 24 hours.
We are currently investigating and issue where we are having errors processing real-time McLeod DFM carrier updates. Not all customers are affected. Real-time Load updates also do not seem to be affected. The issue began this morning and we will continue to provide updates as we work towards resolving the issue.
Report: "CSV Integration Processing Delay"
Last updateIssue Summary We had a major delay in processing CSV files for Available Load integrations with some customers. This only affected customers on a CSV load integration, and not all customers using the integration were affected. Customers sending over larger files were more likely to be affected. Timeline We first detected slowdown in CSV file processing with 1 of our customers on 1/13/2023. Over the weekend this issue got worse, and the majority of Available Load CSVs were not processing on the Monday of 1/16/2023. We resolved this issue on the night of 1/19/2023 with a hotfix deployment. Root Cause We discovered that the root cause of the issue was a bugfix that was deployed on the night of 1/12/2023. This bug fix helped improve the consistency and timing of loadboard postings after a load was made re-available over our CSV load integration. However, what we failed to recognize was that the code change resulted in a higher usage of memory. This increase of memory caused our application to exceed the allocated memory threshold for our provisioned computing resources. Out of Memory errors were more common for customers with larger files. This resulted in files being partially processed, before getting interrupted due to memory constraints, and therefore customers saw a delay in load updates coming into Parade. Resolution and recovery On 1/13/2023, only one customer was affected and a support ticket was raised to our team. When more customers were affected on the morning of 1/16/2023, the ticket was immediately re-prioritized to be P0. Some optimizations were deployed the night of 1/16/2023, but did not consistently solve the problem. From 1/17/2023 to 1/18/2023, out team continued to monitor processing times, and noticed that larger CSV files were still seeing major delays in processing. Some small optimizations were implemented that benefited a few customers, but not all. The root cause was identified and tested on 1/19/2023, and deployed that night. This resulted in all customer data being updated successfully. Since CSV files are snapshots of customer load data, no load data was lost. Corrective and Preventative Measures We are working on better preventative measures and monitoring for resource-constraint issues. This includes re-evaluating any CPU and Memory thresholds for our integration pipeline. We have also implemented preventative measures to increase the overall memory allocation for crucial parts of our platform.
Report: "Network Latency"
Last updateThis incident has been resolved as of 9:25 AM PST. Our cloud platform provider (GCP) has notified us that the network issues should no longer be happening. We have verified that all products have recovered. There is minimal data backup, and we will be closely monitoring data pipelines in the next few hours. A full incident post-mortem will be published soon.
We are currently investigating a network latency issue. This is caused by downtime at one of our cloud providers, specifically Google Cloud Platform, which is affecting their global customers. Here is a summary of what is being investigated on their end: https://status.cloud.google.com/incidents/FRpfyvgG3MUeSTTuX1Dx https://status.cloudamqp.com/ During this time, the following services are affected: - All Parade applications will see slowdowns in terms of network requests - Parade data processing will be delayed (this includes load and carrier data processing from the TMS) According to their site, this problem is intermittent, and seems to be causing intermittent timeouts and network packet loss. Pending their investigation, we are planning workarounds on our end to bring our cluster back to full health.
Report: "DAT Rateview credentials not used"
Last updateIssue Summary We had a change in our system that caused customer DAT Rateview credentials to be incorrectly marked as invalid, and therefore not used when fetching updated Rateview data. Timeline We first noticed this issue when one of our automated alerts notified our support team that a majority of customer Rateview credentials were marked as invalid. This happened around 7am EST on 11/1/2022 Support took a dual pronged approach, alerting both Parade's internal team, and also the DAT team to identify if these Rateview credentials were actually not working. After internal investigation, we identified that a deployment the night before on 10/31/2022 had added a check to mark DAT Rateview credentials as invalid. Root Cause This additional check deployed the night before, had caused most of our credentials to be identified as "invalid" and therefore not usable for the integration with DAT Rateview. The end result was that this check was too broad, and was preventing valid Rateview credentials from being used. Resolution and recovery Parade team reverted the breaking change around 10:09 AM PST on 11/1/2022. This resolved the issue, and we also re-enabled all Rateview creds around the same time. Any new loads or updated loads after that time would have had their Rateview rates automatically re-fetched. All Rateview rates on prior available loads within Parade were refreshed as part of the daily sync around 4pm PST. No customer data was lost during this time period. Corrective and Preventative Measures Parade's internal alerting mechanisms worked as intended in this scenario. Our internal support and engineering teams were notified promptly about the Rateview inconsistencies. We will continue monitoring our internal alerts around Rateview going forward, and make this a standard part of our customer health checks. We have put in a ticket for the engineering team to implement a longer term solution to identify invalid Rateview credentials, as we still do not believe we have a great way to identify these post initial onboarding. It is a common problem for customers to change their credentials or lose access to subscriptions. The Parade team will continue working with DAT to identify these bad credentials going forward so that customers do not experience any additional interruptions to their Parade - Rateview integration.
Report: "Parade for Carriers Login Issues"
Last updateIssue Summary Carriers were unable to log in to Parade for Carriers due to a bad deployment of new database credentials. This issue did not affect any broker related portals. Carriers were still able to log in to Parade when coming in from an email link. The only login portal affected was the one hosted on carriers.parade.ai Timeline This issue was first identified at 7AM EST on 10/31/2022. After investigation by the Parade team, we restored the database connection around 10:24 AM PST on 10/31/2022. Root Cause The issue was caused by a bad deployment that unhooked our Parade for Carriers app from its supporting database. This caused any APIs that relied on the database to fail for the carrier app. Resolution and recovery Resolution was achieved by redeploying our Parade for Carriers backend APIs with a new set of database connection credentials. Recovery to the platform was achieved soon after redeploy at 10:24 AM PST. No customer data loss occurred, but it did take our platform ~45 minutes to replay all events going to the carrier portal. Corrective and Preventative Measures We have added additional monitoring and alerting for the Parade for Carriers app specifically. We have also provisioned new database secrets for our deployments so that any future deployments will no longer have a database connection mixup. This also adds a side effect of bringing in additional security to our platform.
Report: "Intermittent UI issues in broker portal"
Last updateIssue Summary We were seeing some pages in the broker portal not loading intermittently for certain users. This primarily affected newly logged in users that logged in after 4:30 AM PST on Oct 22, 2022. Some users logged in with sessions cached before that time were still able to use the portal. Affected pages included every tab in the broker portal except for the Loads and Carriers tabs. Users would have experienced the tabs not loading at all, or not being clickable. This issue was intermittent depending on whether or not third party libraries were able to load at the time of a user visiting the portal, since the issue stemmed from a failed 3rd party CDN. Timeline Issue began when unpkg.com, a popular Javascript Content Delivery Network, started experiencing issues around 4:30 AM PST on October 28, 2022. The Parade team was able to reproduce the issue at 6:28AM PST and identify the root issue. Full resolution was achieved at 8:16 AM PST on October 28, 2022. Users may need to do a hard refresh to get access to the new production fixes, if they were already logged in and were part of the group that experienced these issues. Root Cause Parade was leveraging unpkg.com to manage a few external Javascript libraries. During the morning, unpkg.com experienced issues were certain Javascript packages would not resolve intermittently. This caused unexpected consequences on our production site, which needs those libraries to operate certain pages. Therefore those pages would not load if we were unable to retrieve information from unpkg.com. Technical details of the third party CDN that failed can be found here: https://github.com/mjackson/unpkg/issues/343 This issue affected production sites across many different sites that leveraged unpkg.com as a library management tool. Resolution and recovery The team identified a few libraries that we were using through unpkg.com. We have since replaced these libraries to go through a different CDN network. After this replacement, and a new production hotfix deployment, the team tested and verified that all accounts were able to access the full portal. Corrective and Preventative Measures We have verified that we no longer use ANY libraries through unpgk.com and we will continue to ensure that we do not use unpkg.com in the future. We have also verified that the new version of the broker portal does not contain any usages of The long term plan is for us to self-host certain libraries that are crucial for our application to run, and to use a different production-ready CDN when needed.
Report: "Web portal slowness"
Last updateIssue Summary We had a brief slowdown of our broker web portal during a short period of time between 6:57 AM to 7:15 AM PST. End users of the portal would have seen a never-ending loading screen, or the inability to add/modify data during this time period. Our statuspage also updated with the proper components affected during this time. Timeline This issue was first reported by our monitoring services at 6:57 AM PST, and automatically escalated to both our support and on-call engineer teams. This issue resolved itself, with no corrective action necessary at 7:15 AM PST. Root Cause This was due to an abnormally larger amount of web requests taking more than 5 seconds to complete, which triggered our monitoring alarms. We have identified a few requests that were taking slightly longer than usual, and blocking resources from handling other requests to our APIs. Resolution and recovery No corrective action was needed by our team, as once the requests finished, the web portal APIs returned to normal functionality. Recovery occurred at 7:15 AM PST, with no data being lost. Corrective and Preventative Measures We are tackling preventative measures from 2 directions, as this issue seems to be a combination of: 1. Surges in traffic: we are working on better identification of our API autoscaling to make sure this aligns with the traffic that we see throughout the day 2. Slow APIs that are affecting the system as a whole: we have identified a few APIs to speed up, so that as we scale, these do not become a growing issue.
Report: "Portal and Load Processing slowdown"
Last updateIssue Summary We experienced a system-wide slowdown due to a long-running database operation that ended up locking up most of our production tables. Timeline The issue started on Aug 23 at 9:02am PST, and we reached full recovery at 10:18 am PST. Root Cause The cause of the issue originated from a long running database query that lasted for more than 24 hours. This query results from a data cleanup job that was done on behalf of a customer. This clean up job started running the day prior, and did not finish. The last step of the query caused a database lock on many of the key tables that our application uses in our production database. Resolution and recovery Terminating the database query at 10:08 am PST helped free up the database locks. This allowed for our system to immediately recover, the broker portal issues were resolved at the time of resolution, and the backlog of load updates that were affected by this downtime were synced up to real-time within the hour. Corrective and Preventative Measures We are forbidding our team from running the same data deletion query on our database in the future. We have also solutioned an alternate approach to getting data out of our production database, that no longer requires long running database queries. As a result of the 2 measure above, the database operation in question should never be executed again. We have also implemented a company policy to no longer run long running jobs that affect multiple tables overnight, as these are unpredictable in when they will possibly cause lockups.
Report: "McLeod DFM Load Processing Delays"
Last updateIssue Summary We received a major slowdown of how we process Loads from McLeod DFM TMS integrations. No data loss occurred, but we were delayed in updating loads in the Parade system. Timeline The issue was first identified on Aug 25th at 4:54 AM PST. That morning, one of our standard integration health alerts notified the team that we were experiencing thousands of load updates that were not being processed across all of our DFM customers. StatusPage was also updated with a Degraded Performance tag on the DFM Load Processing category at this time. The issue was identified and a fix was put into place at 10:05 AM PST. Root Cause The issue was an infrastructure issue related to our DFM middleware deployment. We saw issues with our workers hitting CPU limits, and also jobs that were reaching an Out of Memory error. This was a rather difficult issue to catch, as the issue was not due to any deployment or code change. Our systems were slowly using more and more CPU and memory as we scale, and this day was when we hit our limits. Resolution and recovery Resolution was achieved by restarting our deployed infrastructure for our DFM middleware layer. This allowed us to start processing updates again in a timely manner. This was done at 10:05 AM PST. At this time, we still had 1000s of unprocessed updates. Our system caught up with all pending updates at 10:54 AM PST. No data loss occurred, as we had all of the updates stored in our database that we could replay. Corrective and Preventative Measures We have implemented some better logging and monitoring around CPU and memory usage of our infrastructure for DFM to catch these infrastructure issues earlier in the future. We have also bumped the amount of resourcing our DFM middleware hardware is able to use. CPU usage limits have been doubled, and memory usage limits were increased by 50%. This should benefit all DFM integrations going forward, and we will continue to monitor if these limits need to be adjusted on a monthly basis. Some lessons were also learned about downtime alerting internally. We have revisited our downtime alerting process to establish better communication between our routine health checks and the support/engineering teams.
Report: "McLeod DFM Load processing delays"
Last updateWe have full recovered from the DFM backup. A post-mortem report will be shared shortly describing the incident, how resolution was achieved, and follow up items to ensure this does not happen again.
We have identified the cause of the issue. We are replaying the load updates for all customers, and are almost fully caught up except for a handful of accounts. Engineering will keep monitoring to make sure the sync continues to operate successfully.
We are currently investigating issues with McLeod DFM Load processing delays that are happening this morning.
Report: "Parade for Carriers Intermittent Loading Issues"
Last updateThis incident has been resolved.
We have implemented a few fixes for the API slowdowns, and will be monitoring results over the next few business days
We have identified an issue that is causing Parade for Carriers (carriers.parade.ai) to not load for some carriers. This is primarily related to a few internal backend APIs that are slow. Another issue that we are also actively resolving is affecting Parade for Carriers for customers with the Whitelabel feature that were implemented in the last month (not affecting older Whitelabel implementations). Because of a new URL routing method that we have implemented for the new Whitelabel implementations, the app is stuck in a state where the page stays on a blank page. We expect resolution for both of these issues within the next business day. We will update customers affected as we reach resolution.
Report: "Load API Integration Downtime"
Last updateIssue Summary Due to a major release on 7/12/2022, we introduced a breaking change that caused all Load API integrations to intermittently fail. This issue affected all customers on API integrations, which includes many of our direct TMS Integrations. These integrations include: - Aljex - McLeod - Tai - Turvo - Revenova - FMS TMS - EZLoader TMS - 3PL Systems - Any Homegrown TMS's that are API integrated CSV integrated customers were not affected. Timeline The issue was identified at 6:00AM PDT. The escalation and fix identification was done by our engineering team promptly, and the eventual resolution was introduced into our production environment at 8:25 AM PDT. Root Cause The cause of API integrations to fail was because of a new data field we had introduced for a specific TMS integration. This field was intended to be a completely optional new field, with no backwards breaking changed. However, the code deployed in the release made the field required, which broke many existing API integrations. Resolution and recovery The cause of the downtime was quickly identified because of our clear release logging. The eventual fix needed a code change, which was introduced into our staging environment for testing, before it was eventually moved into our production environment at 8:25 AM PDT. We have daily syncs for most API integrated customers/TMS's to make sure we are in a proper load state by the end of the day. Some API customers may need to trigger new updates on their loads from the TMS end to recover from any missing updates. Customer Success will reach out to customers on a case by case basis if additional updates are needed from the TMS. Corrective and Preventative Measures Main cause of this issue was the lack of QA for the integration related change that caused the backwards incompatibility. We are making changes to our QA process and our staging environment monitoring to make sure that any critical integrations changes are properly tested. We are also building in regression tests to make sure we have no backwards incompatible changes. Some of these improvements have already been implemented, but we expect full testing and QA pipeline improvements to come in the next few weeks.
Report: "Broker Portal Slowdown"
Last updateIssue Summary We encountered a major system slowdown on July 2, 2019 which affected primarily the broker web portal. Timeline Initial slowdowns related to the incident started on 4:32 AM PDT and our support team escalated the issue at 6:50 AM PDT. This issue was eventually fully resolved at 2:03PM PDT. Root Cause The cause of this incident was an issue related to our primary database, which most of our microservices connect to. We encountered a surge of database queries related to a weekly reporting job. This job caused a substantial increase of the amount of expensive queries happening in our database. This caused the initial slowdown, that our team was alerted to. As a result of the spike in traffic, the database went into recovery mode at 7:34 AM PDT, which resulted in a more severe downtime, as the database had to do a full restart. Resolution and recovery All systems were brought back online at 2:03PM PDT. Corrective and Preventative Measures We have added additional alerting around top-level database metrics. We are also reworking how our support and on-call engineering team are notified about deeper technical issues, to make sure we are able to respond to system level downtimes in a faster manner. The weekly reporting job in particular that had generated some of these expensive queries was identified as a legacy feature, and has been removed and has been deprecated as a feature. As a result of removing this job, our engineering team has also gone through all daily and weekly cronjobs, and we have cleaned up many expensive recurring jobs, as a preventative measure to make sure these background jobs to not cause database issues in the future
Report: "DAT Posting authentication errors"
Last updateWe have resolved the DAT authentication issues with DAT. All posting accounts should now be unblocked, and our support team is running a backfill to repost all posting updates that we may have missed during the incident.
We are currently investigating issues with DAT postings not happening. The DAT authentication service seems to be down on the DAT side, and we are actively working with their team to resolve this and unblock accounts.
Report: "Parade Internal API slowdown"
Last updateWe are seeing a slowdown of internal APIs across our system. Integrations and users using the broker portal and carrier portal may be affected by intermittent slowdowns.
Report: "API Issues"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results
We are continuing to investigate this issue.
We are currently investigating API downtime affecting our integrations and the broker and carrier portals.