Historical record of incidents for BenchPrep
Report: "Degraded Perfromance"
Last updateThis incident has been resolved.
A fix has been implemented, we have confirmed the applications are operational and we are monitoring current performance.
We're continuing to investigate the issue and are currently testing a potential change to improve connectivity.
We are actively investigating an issue with degraded performance across our applications.
Report: "Degraded Perfromance"
Last updateThis incident has been resolved.
A fix has been implemented, we have confirmed the applications are operational and we are monitoring current performance.
We're continuing to investigate the issue and are currently testing a potential change to improve connectivity.
We are actively investigating an issue with degraded performance across our applications.
Report: "Release Maintenance"
Last updateThe scheduled maintenance has been completed.
We are continuing to verify the maintenance items.
Verification is currently underway for the maintenance items.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
We are planning to have a short maintenance window. While we are expecting performance impact to be minimal for most of our customers, it is possible that some customers will experience degraded performance or downtime for a period of up to 15 minutes.
Report: "Degraded Performance - Boost Dashboard and Institution Admin"
Last updateThis issue has been resolved.
A fix has been implemented and we are monitoring the results.
We are actively investigating an issue causing the Boost Dashboard and Institution Admin to display incorrectly. The Learning Application is not affected.
Report: "Downgraded Performance - Boost Dashboard"
Last updateThis incident has been resolved.
The dashboards have been successfully updated, and we are actively monitoring the ongoing update.
We are continuing to monitor the ongoing update and investigating the root cause of the issue.
We are investigating the reoccurrence of this issue while closely monitoring an ongoing update. We will share further updates.
The dashboards have been successfully updated, and we are actively monitoring the ongoing update.
We are actively investigating this issue while closely monitoring an update that is currently in progress.
We are currently investigating an issue with our Boost Dashboard application. The issue does not impact any learner facing applications or other admin applications.
Report: "Downgraded Performance - Reporting Dashboards in Console"
Last updateThis incident has been resolved.
Vendor issue resolved at 19:50 CT. We are monitoring and working with vendor to get issue and resolution explanation
We are currently investigating an issue with Console Analytics dashboards not displaying any data. Our team is actively working with a third-party vendor to resolve. This issue is isolated to the Console Analytics section and does not impact the Learning Application.
Report: "Downgraded Performance - Boost Dashboard"
Last updateThis incident has been resolved.
We have confirmed Boost application was successfully updated, we are monitoring the results.
We are currently investigating an issue with our Boost Dashboard application. The issue does not impact any learner facing applications or other admin applications.
Report: "Degraded Performance"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an intermittent degraded performance of the Learning Application and Admin Tools due to issues within one of our data centers. We are replacing the impacted node and expect services to be fully restored shortly.
Report: "Console - embedded dashboard reporting"
Last updateThis issue has been resolved.
You might experience a slight disruption in Console's embedded dashboard reporting. We will provide an update on the issue. This does not affect learner applications.
Report: "Downgraded Performance - Learning Application - Interactions"
Last updateA fix has been implemented and the issue has been resolved.
The issue has been identified and we are actively working on a solution.
We are currently investigating an issue with Interactions not launching successfully within the Learning Application.
Report: "Downgraded Performance - Boost Dashboard"
Last updateThis incident has been resolved.
We've applied a solution and are monitoring the ongoing progress; the completion of the process is expected to take a few hours. We will provide further updates.
We are actively working on a process of implementing a solution to fix the issue. This process will take time and we will provide further status updates.
We are continuing our investigation and are actively working on a solution to resolve the issue. We will provide further updates.
We are currently investigating an issue with our Boost Dashboard application. The issue does not impact any learner facing applications or other admin applications.
Report: "Downgraded Performance"
Last updateThis incident has been resolved.
We are continuing to monitor the fix, load times within learning and admin applications have returned to normal performance.
We have identified an issue with downgraded database performance causing intermittent slowness in the learning application and admin applications. A fix has been identified and implemented and we are monitoring progress.
Report: "Degraded Performance - Console"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an issue with our Console Admin application which is affecting a subset of admin users. This does not impact the learner facing applications.
Report: "Degraded Performance - Console."
Last updateThis issue has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating an issue with our Console Admin application which is affecting a subset of admin users. This does not impact the learner facing applications.
Report: "Degraded Performance - Console"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The root cause has been identified and we are working on deploying a fix.
We are currently investigating an issue with our Console Admin application which is affecting a subset of admin users. This does not impact the learner facing applications.
Report: "Downgraded Performance - Boost Dashboard"
Last updateThis issue has been resolved as of last night at 21:00:24 CDT.
The root cause has been identified, we have implemented a fix and are monitoring progress.
We are currently investigating an issue with our Boost Dashboard application. The issue does not impact any learner facing applications or other admin applications.
Report: "Downgraded Performance - Reporting Dashboards in Console"
Last updateThe issue has been resolved and all of the Dashboards accessible via Console are performing as expected.
We are continuing to work on a resolution with our vendor. We will provide further status updates.
The root cause has been identified and we are working on a resolution with one of our vendors. We will provide further status updates.
We are currently investigating an issue with data displayed in User, Branch/Group and Branch Summary dashboards accessible via Console. Until the issue is identified and resolved, the dashboards will not display any information. This does not impact learner facing application and Boost Dashboards remain available.
Report: "BenchPrep Admin Support Center - Intermittent issues"
Last updateThe issue has been resolved. Our Ticketing System is fully functional.
We are currently experiencing intermittent issues with our Ticketing System & Knowledge Base - https://support.benchprep.com/home/ You can submit tickets directly via email: help@benchprep.com This DOES NOT affect any of Administration Tools or end user Learning Applications.
Report: "Snowflake - Degraded Performance"
Last updateThis incident has been resolved. Data has been fully resynced and is performing regular scheduled syncs as of 18:24:32 CST.
We continue monitoring the resynchronization process and are seeing progress. We will continue to provide further updates.
We have started a process of resynchronization of the data and we are seeing signs of progress. We will continue monitoring the process and will provide further updates.
We are continuing to investigate this issue with our 3rd party vendor. Additionally, we are actively seeking alternate means of refreshing the data. We will provide further updates.
We have confirmed data has not successfully synced since 2023-01-12 19:42 PM CST. We are working on a process of implementing a solution to replicate the data. This process will take time and we will provide further status updates.
We are currently investigating an issue with Snowflake data replication with our 3rd party vendor. This does not affect any administration or end user applications and the impact is isolated to Snowflake raw data access.
Report: "Degraded Performance"
Last updateThis incident has been resolved.
We have disabled BDR database extensions and have restored connectivity. We will continue to monitor for the time being.
We are currently investigating an issue with our backend database system and have put the site into maintenance mode for the time being.
Report: "Snowflake - Degraded Performance"
Last updateThis incident has been resolved. Data has been fully resynced and is performing regular scheduled syncs as of 5:42 am CST.
The resynchronization process is continuing and we are monitoring progress. Majority of the data has been resynced.
A fix has been implemented. We are monitoring the results and will provide further updates.
The issue has been identified and we are working on implementing a solution to replicate the data. This process will take time and we will provide further status updates.
We are currently investigating an issue with Snowflake data replication with our 3rd party vendor. Data has not been successfully synced since 22:41 CST. This does not affect any administration or end user applications and the impact is isolated to Snowflake raw data access.
Report: "Degraded Performance"
Last updateThe incident has resolved, we will continue to monitor.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating reports of degraded performance.
Report: "Degraded Performance"
Last updateWe found regression in api reporting request that was causing significant amount of memory consumption impacting the nodes and entire system. The fix was been deployed and verified.
We have confirmed performance has stabilized and we continue monitoring affected applications. We are planning to introduce significant improvements to our progress cluster during the next database maintenance.
The rollback was completed and services have been restored. We are continuing to investigate this issue.
The restart did not alleviate the issues. We are restoring the services and rolling back the changes.
We are conducting a short restart of services which can cause non-learner applications to be unavailable for a few minutes.
We are experiencing an issue with high load times within our non-learner BenchPrep applications. Learner applications are not affected. We are currently provisioning more resources and will continue to investigate.
Report: "Elevated API Errors"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
We are continuing to monitor and verifying that all our data centers are healthy.
We are continuing to monitor for any further issues. The mitigation steps have been applied and the system is operational.
A fix has been implemented and we are monitoring the results.
We are working on deploying a fix shortly to mitigate the issue.
We are continuing to work on a fix for this issue.
We will be performing emergency maintenance.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Degraded Performance"
Last updateWe are not seeing Ingress pod restarts after increasing pod count in production.
We increased the number of pods on Ingress to support the current traffic.
We are currently investigating degraded performance issues with our ingress pods.
Report: "Degraded Performance with Redis Cluster"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
Upgraded Redis Cluster was rolled out, we will monitor the situation now
We are rolling out upgrades to our Redis cluster and will put the site in maintenance for an estimate of 5-10 minutes.
We are having issues with our Redis Cluster, we have identified the problem and rolling out the temporary solution
Report: "Degraded Performance - login"
Last updateRedis fix is applied and the issue is resolved now. We continue to monitor the application.
We were able to successfully failover the Redis nodes and everything looks good, Now traffic serves to both data centers and monitoring the applications. We will keep monitoring and will update the final status.
Redis replication restores completed and failover the Redis services to different node now.
We have been running on Dallas past few hours without any issues and performing full Redis replica restoration currently.
We restarted our Redis instance and traffic moved to the Dallas data center. Still working on fix.
The issue is identified and we are in the process of fixing it.
We are currently investigating login issues. We will keep posted on updates.
Report: "Localization issue on Login pages"
Last updateVerified that issue was with the incorrect locales updated from the script.
A fix for locale crash have been implemented, services should be back - we are still looking at the issue at actual localization for select tenants
There is an issue with our localization impacting some tenants login pages, we are still investigating.
We are currently investigating degraded performance of the login pages
Report: "Degraded Performance"
Last updateNo critical identified and seems normal.
We are still monitoring and checking all the possibilities for this issue.
Restarted Redis nodes in the production cluster. Currently monitoring the system.
We are currently investigating the issue.
Report: "SSO service is down"
Last updateLogin & sso service once brought back up have been stable.
Scaled resources came back. We are monitoring and will actively work to identify the cause
We are continuing to work on a fix for this issue.
We are scaling the resources up to ensure new images are brought up
We are continuing to investigate this issue.
Following rolling production deployment, sso service failed to come back up. We are actively investigating.
Report: "Degraded Performance in Dallas Cluster"
Last updateWe are moving issue to resolved, close monitoring didn't reveal anything out of the ordinary.
We observed a network performance degrade, cleared out old connections which resolved the issue. We will be monitoring closely for the next hour.
We are continuing to investigate the issue, in the meantime all traffic successfully redirected to the healthy datacenter restoring operational activities
We are investigating the issue impacting our Dallas cluster, in the meantime redirecting the traffic to the healthy datacenter.
Report: "Degraded Performance"
Last updateAttaching IBM incident https://cloud.ibm.com/status?item=INC4252245
We are continuing to monitor for any further issues.
Confirmed packet loss & networking issues with the cloud provider, services are back, but we will keep the incident open until we hear official confirmation that it is resolved.
We are seeing network issues that we are actively working with the Cloud Provider on, in the meantime one datacenter is back, routed traffic to it and switch back to degraded performance
Updating to outage. Teams are looking into the issue
We are investigating degraded performance reported by a number of our applications pods
Report: "Issue with Course Building"
Last updateThe issue with course builds involving updates to questions has been identified and fixed. You may generate builds for courses with updates to questions. Additionally, our tech team is prioritizing work to optimize course build process to ensure a consistent course build experience. We encourage you to monitor course build progress and contact support if you experience repeated failed builds.
If you have made any updates to questions, please refrain from building the course until further update can be provided. If your changes do not involve updates to questions, you can generate the build.
BenchPrep is investigating issues related to course builds in BluePrint and advises all customers to hold off on building any courses until an update can be provided.
Report: "Degraded Performance"
Last updateWe have restored traffic and have removed the need to to connect to the IBM registry when pods restart.
We have identified an issue with connecting to the IBM cloud registry from our San Jose Datacenter. We are routing all traffic to Dallas.
We are currently investigating reports of degraded performance.
Report: "Exam Results and BluePrint application"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
An issue was identified in how exam results are being displayed to users. This is originating from the BluePrint application, which will be turned off pending resolution. A fix has been identified and is being worked on. Expected resolution time is 1 hour.
Report: "Degraded Performance"
Last updateSeveral shared database connections were corrupted. Restarting our connection pooling software and connected services cleared out the connections. We will continue to monitor and provide updates if it reoccurs.
We have identified the problem and are currently monitoring.
We are currently investigating reports of degraded performance.
Report: "Degraded Performance"
Last updateThis incident has been resolved.
We brought up datacenters and identified the issue with the database contention. We cleared the contention and monitoring the situation
We are currently seeing issue with degraded performance impacting our San Jose cluster. We are redirecting traffic to another datacenter and investigating.
Report: "Degraded Performance"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
System performance has stabilized and we are continuing to monitor network connectivity.
We are working with our hosting service to investigate Dallas network issues. We will continue to monitor system performance while sending traffic to San Jose.
We are currently routing all traffic to San Jose while we continue to investigate potential network connectivity issues in Dallas.
We are experiencing slower than normal load times. We are currently investigating the issue and will post any relevant updates here.
Report: "Lagging reports in Boost"
Last updateThe reports have been generated. We will continue to monitor.
We have restarted the report build process and will be monitoring until completion.
We have identified an issue with one step of the report generation process and are deploying a work around.
We are currently investigating a delay in report generation in Boost.
Report: "Degraded Performance"
Last updateConnectivity to both data centers has been restored and we are sending traffic to both locations now. We will continue to monitor the situation.
We are continuing to monitor for any further issues.
We have identified an issue with our database connection pooling software. Restarting it has temporarily resolved the issue, we are monitoring and looking for permanent solutions.
We are currently investigating an issue related to memory contention on one of our databases.
We are currently routing all traffic to San Jose and are working with our infrastructure provider to address some network connectivity issues. All services are operational.
We have identified an issue in our Dallas datacenter and have routed all traffic to San Jose for the time being. We will continue to investigate.
We are experiencing slower than normal load times and reports of pages unable to load. We are currently investigating the issue and will post any relevant updates here.
Report: "Degraded Performance"
Last updateWe have confirmed that there are no performance issues outside of some internal tools. We will be making some operational changes to address the performance regression and will continue to monitor the situation until it has been resolved.
We experienced an issue with database deadlocks at approximately 11am CST. That issue has been resolved and we had received reports of lingering slowness in Blueprint. We are continuing to monitor performance of the platform.
We are experiencing slower than normal load times. We are currently investigating the issue and will post any relevant updates here.
Report: "Elevated API Errors"
Last update**Date:** December 11, 2020 **Date of Incident**: December 10, 2020 **Raised by**: Internal Monitoring **Severity Level:** Critical **Description** * BenchPrep internal monitoring tools \(Pingdom / NewRelic / Airbrake\) alerted of site stability issues **Root Cause** * A DOS \(Denial of Service\) like behavior was detected from a small range of IPs **Resolution** BenchPrep Engineers blocked traffic from offending IP addresses **Mitigation Strategies** BenchPrep will be implementing automated DOS mitigation measures so that manual intervention will not have to occur. **Timeline** 2020-12-10 - 12:13 PM CST - First alerts triggered 2020-12-10 - 12:14 PM CST - BenchPrep begins investigating 2020-12-10 - 12:29 PM CST - BenchPrep blocks offending IP addresses 2020-12-10 - 12:30 PM CST - Site stability restored Total resolution time: 17 minutes
This incident has been resolved.
Services returning to normal, we will continue monitoring.
We are continuing to investigate this issue.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Elevated Errors Loading Ascend"
Last update**Date:** December 4, 2020 **Date of Incident**: December 3, 2020 **Raised by**: Internal Monitoring **Severity Level:** High **Description** * BenchPrep internal monitoring tools \(Airbrake\) alerted of network connectivity issues with IBM Cloud Object Storage **Root Cause** * An expired SSL certificate was deployed at IBM Cloud Object Storage, this prevented requests made for cached content to fail. **Resolution **The issue was resolved before BenchPrep could deploy any temporary workarounds. **Mitigation Strategies **BenchPrep will build support for content to be pulled directly from our database in the event of connectivity issues in the future. **Timeline **2020-12-03 - 03:41 PM CST - First exception error reported 2020-12-03 - 03:51 PM CST - BenchPrep begins investigating 2020-12-03 - 04:12 PM CST - Connectivity Restored Total resolution time: 31 minutes
This incident has been resolved.
Service has been restored, we are currently monitoring.
We are continuing to investigate this issue.
We are experiencing elevated errors when loading BenchPrep Ascend and BenchPrep Engage. We are currently investigating.
Report: "Degraded performance with IBM Cloud"
Last updateWhile IBM is still working out other issues (https://cloud.ibm.com/status?selected=status) we are going to resolve this incident as all are services continue to stay stable.
Since 19:02 CST we saw significant performance improvements of our services and response times. We are actively monitoring the situation at IBM (they have updated their status page https://cloud.ibm.com/status?selected=status ) and will keep you updated.
Here is what we know: * We have reached out to IBM and our representatives are aware of the situation * IBM is experiencing world-wide network outages (even impacting their status page https://cloud.ibm.com/status and support help desks) with various additional sources validating it (https://status.aspera.io/, https://downdetector.com/status/ibm-cloud, etc.) * It appears at this moment that issue is specific to IBM's public networking * Our services are not heavily impacted due to our architecture as well as the fact that IBM's private network traffic is healthy at the moment, we are confirming that we are still serving traffic in our application. We will keep updating this as we find out more
We are seeing some issues with IBM Cloud networking. We are reaching out to vendor for more information.
Report: "Elevated API Errors"
Last update**Date:** May 21, 2020 **Date of Incident**: May 2, 2020 **Raised by**: System Monitoring **Severity Level:** High **Description **At 1:54pm CST BenchPrep staff was notified of site instability issues via system monitors. BenchPrep engineers soon began looking into container and cluster status. **Root Cause **A traffic spike caused memory usage of individual pods to creep up. This resulted in the host nodes running out of memory. Due to lack of available memory, the Kubernetes cluster was unable to restart new healthy pods, which resulted in the backend API service becoming unresponsive. **Resolution **BenchPrep engineers restarted all API pods, this restored stability during the investigation. **Mitigation Strategies: **We have added additional alerts on host nodes for when memory consumption is high. We are reviewing our deployed resources and scaling down less frequently used ones in order to reduce the per-node memory footprint. We are also investigating adding an additional node to each cluster for additional stability. **Timeline **01:54 PM CST - Notification of the site instability from Pingdom 02:07 PM CST - BenchPrep Engineers began troubleshooting 02:20 PM CST - BenchPrep platform put into maintenance mode during investigation 02:40 PM CST - BenchPrep platform stability restored and maintenance mode disabled Total resolution time: 46 minutes
We have identified an issue with high system memory utilization and have corrected it. We will continue to monitor the situation.
A fix has been implemented and we are monitoring the results.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Failed production database switch"
Last update**Description** At 09:50 CDT, BenchPrep database connection pooling software lost connection to the backend database during an attempt to switch to a new database server. This in turn caused an outage of all platform services. The loss of database connection was noticed immediately, and the original database server was back online by 09:53 CDT. BenchPrep’s initial attempt to revert the switch back to the original server did not resolve the connection issue, however. In order to restore platform functionality, BenchPrep reconfigured applications to connect to the database directly, instead of via the connection pool. This configuration went into place at 10:15 CDT. While successful, it was not optimal, resulting in a minor intermittent outage. Final configuration restoration of all changes happened at 11:08 CDT with the all app switch at 11:30 CDT. **Root Cause** Investigation at this point revealed that the initial switchover attempt was missing a necessary change to a connection pool configuration file that would have corresponded to the changes made to the database server endpoint address. Additionally, initial configuration roll back was incomplete missing the expected connection port. **Resolution** BenchPrep engineers prepared a more thorough reversion of all configuration changes associated with the switchover, which was carefully reviewed and manually tested before any more application changes were made. This configuration was in place at 11:08 CDT and all platform services fully switched to it at 11:30 CDT. **Mitigation Strategies:** * Rather than replacing database connection parameters all at once, future connection pool configuration changes will be put in place in the environment alongside existing configuration, and platform applications will only switch after verifying that the new connection works. * Prior to any such configuration change, a rollback branch will be prepared ahead of time in case it is needed.
This incident has been resolved.
At 9:50am CT as part of the routine database health improvements we have attempted a failover to a secondary cluster at which point we have lost the client side connection pooling. We have restored the service at 10:15am CT and actively monitoring the situation. Due to this next attempt will happened at the next regular scheduled downtime.
Report: "Degraded Performance"
Last updateWe received reports of slowness on the platform. We have identified and associated those reports to the time when we reloaded our database connection pooling software initiated at 2:22PM CST. Slowness was due to incremental rollout of new connection pods across multiple data centers.
We are currently seeing the degraded performance across the platform and are investigating.
Report: "IBM Cloud Node Outages"
Last updateWe are closing the issue - all BenchPrep services are restored albeit running through fewer datacenters.
While IBM is dealing with the issue we have decided to direct all traffic off the impacted cluster. That means all services should be back to normal, we are monitoring the situation so that we can bring the location back once IBM is done.
P3 Incident being investigated: We are aware of an incident that IBM Cloud (our server provider) is experiencing which is causing intermittent loading issues in Blueprint & the Tenant Admin Dashboard. We are in communication with IBM and closely monitoring the impact to our system. Once IBM resolves the incident, service should resume as normal. We will communicate updates as we get them from IBM and you can also check IBM Cloud status page directly (specifically the "Node Outages" incident. This issue is sporadic and a refresh may correct it.
Report: "Elevated API Errors"
Last update**Date:** 2019/08/02 **Date of Incident**: 2019/08/02 **Raised by**: BenchPrep **Severity Level:** High **Description** BenchPrep monitoring triggered alerts that the API servers/containers had lost connectivity to our Redis database service provided by IBM. **Root Cause** IBM experienced network connectivity issues between their services. This prevented both of BenchPrep’s Kubernetes clusters from connecting to the Redis backend service. For more information see IBM Incident ID: INC0999643 on their cloud status page [https://cloud.ibm.com/status](https://cloud.ibm.com/status) and search for the incident id. **Resolution** IBM Redis database services had recovered before we were able to migrate to the alternative service. **Mitigation Strategies:** We will maintain the alternative Redis instance provisioned and plan on migrating over to our own internally managed Redis solution. **Timeline** 10:04 AM CT - First email alerts came in 10:11 AM CT - BenchPrep engineers started provisioning alternative services 10:21 AM CT - Redis Services were fully restored A downloadable copy of this report can be found here: [https://drive.google.com/file/d/14p16oqEw\_SAaFOz-cXc8CY6gIbmwJKzV/view?usp=sharing](https://drive.google.com/file/d/14p16oqEw_SAaFOz-cXc8CY6gIbmwJKzV/view?usp=sharing)
IBM has resolved database connectivity issues.
Connectivity has been restored to our provider's database service. We will continue to monitor the situation.
One of our backend database providers is experience an outage with connectivity. We are provisioning an alternative and will attempt to migrate services.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Elevated API Errors"
Last update**Date:** 2019/07/31 **Date of Incident**: 2019/07/31 **Raised by**: BenchPrep **Severity Level:** High **Description** BenchPrep monitoring triggered alerts that the API servers/containers had lost connectivity to database services. BenchPrep began looking into the issue and discovered that PostgreSQL connection pooling services were unable to authenticate with the backend database. After ensuring the correct configuration was in place and restarting impacted database pooling services, successful connections were established. **Root Cause** Our database connection pooling software \(PgBouncer\) ran into a known but rare bug. When new connections were opened, an extra connection parameter was sent. The version of Postgres we are running isn’t compatible with that parameter. This resulted in the "invalid server parameter" error and insufficient valid connections were available. Reference: [https://pgbouncer-general.pgfoundry.narkive.com/lZPDYkqn/pgbouncer-1-1-released](https://pgbouncer-general.pgfoundry.narkive.com/lZPDYkqn/pgbouncer-1-1-released) **Resolution** Restarting our connection pooling software caused new connections to the backend database to be re-established. This returned stability to the site. **Mitigation Strategies:** We will be adding additional log level alerts to preemptively alert when connection issues start arising. This should allow us to manually intervene before a catastrophic failure occurs. **Timeline** 04:10 PM CT - First error notification came in 04:14 PM CT - API based services reached 100% error rate and site went down 04:22 PM CT - Partial availability restored 04:25 PM CT - Services fully restored A downloadable copy of this report can be found here: [https://drive.google.com/file/d/1m\_5EJsnJ8udRuk00bgJbSTtqTpVXrvPW/view?usp=sharing](https://drive.google.com/file/d/1m_5EJsnJ8udRuk00bgJbSTtqTpVXrvPW/view?usp=sharing)
This incident has been resolved.
We have identified an issue with database connectivity and have corrected it. We will continue to monitor the situation.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Outage - Redis"
Last updateThis incident has been resolved.
We rolled out Redis backup services. The platform should be operational now, we are continuing to monitor and verify it.
We went to a competing redis offering and deploying that shortly.
While are we waiting for IBM, we are provisioning our backup services. Once they will be up we should be able to move fairly fast and service should restore shortly
Both compose connectors available to us are not working. We have escalated the issue with IBM to make sure new connection is available as soon as possible
The issue has been identified to the lack of connection strings related to the IBM Compose maintenance https://status.compose.com/. We are updating it which should bring services back shortly.
We have received an alert from our redis cluster, this will cause degraded performance for backend services. We are investigating and working on remediation.
Report: "Degraded performance"
Last updateThis incident has been resolved
The issue was identified, fixed and we are currently doing verification and monitoring
We are currently seeing the degraded performance in the database cluster