BenchPrep

Is BenchPrep Down Right Now? Check if there is a current outage ongoing.

BenchPrep is currently Operational

Last checked from BenchPrep's official status page

Historical record of incidents for BenchPrep

Report: "Degraded Perfromance"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented, we have confirmed the applications are operational and we are monitoring current performance.

investigating

We're continuing to investigate the issue and are currently testing a potential change to improve connectivity.

investigating

We are actively investigating an issue with degraded performance across our applications.

Report: "Degraded Perfromance"

Last update
Resolved

This incident has been resolved.

Monitoring

A fix has been implemented, we have confirmed the applications are operational and we are monitoring current performance.

Update

We're continuing to investigate the issue and are currently testing a potential change to improve connectivity.

Investigating

We are actively investigating an issue with degraded performance across our applications.

Report: "Release Maintenance"

Last update
Completed

The scheduled maintenance has been completed.

Update

We are continuing to verify the maintenance items.

Verifying

Verification is currently underway for the maintenance items.

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled

We are planning to have a short maintenance window. While we are expecting performance impact to be minimal for most of our customers, it is possible that some customers will experience degraded performance or downtime for a period of up to 15 minutes.

Report: "Degraded Performance - Boost Dashboard and Institution Admin"

Last update
resolved

This issue has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are actively investigating an issue causing the Boost Dashboard and Institution Admin to display incorrectly. The Learning Application is not affected.

Report: "Downgraded Performance - Boost Dashboard"

Last update
resolved

This incident has been resolved.

monitoring

The dashboards have been successfully updated, and we are actively monitoring the ongoing update.

investigating

We are continuing to monitor the ongoing update and investigating the root cause of the issue.

investigating

We are investigating the reoccurrence of this issue while closely monitoring an ongoing update. We will share further updates.

monitoring

The dashboards have been successfully updated, and we are actively monitoring the ongoing update.

investigating

We are actively investigating this issue while closely monitoring an update that is currently in progress.

investigating

We are currently investigating an issue with our Boost Dashboard application. The issue does not impact any learner facing applications or other admin applications.

Report: "Downgraded Performance - Reporting Dashboards in Console"

Last update
resolved

This incident has been resolved.

monitoring

Vendor issue resolved at 19:50 CT. We are monitoring and working with vendor to get issue and resolution explanation

investigating

We are currently investigating an issue with Console Analytics dashboards not displaying any data. Our team is actively working with a third-party vendor to resolve. This issue is isolated to the Console Analytics section and does not impact the Learning Application.

Report: "Downgraded Performance - Boost Dashboard"

Last update
resolved

This incident has been resolved.

monitoring

We have confirmed Boost application was successfully updated, we are monitoring the results.

investigating

We are currently investigating an issue with our Boost Dashboard application. The issue does not impact any learner facing applications or other admin applications.

Report: "Degraded Performance"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating an intermittent degraded performance of the Learning Application and Admin Tools due to issues within one of our data centers. We are replacing the impacted node and expect services to be fully restored shortly.

Report: "Console - embedded dashboard reporting"

Last update
resolved

This issue has been resolved.

investigating

You might experience a slight disruption in Console's embedded dashboard reporting. We will provide an update on the issue. This does not affect learner applications.

Report: "Downgraded Performance - Learning Application - Interactions"

Last update
resolved

A fix has been implemented and the issue has been resolved.

identified

The issue has been identified and we are actively working on a solution.

investigating

We are currently investigating an issue with Interactions not launching successfully within the Learning Application.

Report: "Downgraded Performance - Boost Dashboard"

Last update
resolved

This incident has been resolved.

monitoring

We've applied a solution and are monitoring the ongoing progress; the completion of the process is expected to take a few hours. We will provide further updates.

identified

We are actively working on a process of implementing a solution to fix the issue. This process will take time and we will provide further status updates.

investigating

We are continuing our investigation and are actively working on a solution to resolve the issue. We will provide further updates.

investigating

We are currently investigating an issue with our Boost Dashboard application. The issue does not impact any learner facing applications or other admin applications.

Report: "Downgraded Performance"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor the fix, load times within learning and admin applications have returned to normal performance.

monitoring

We have identified an issue with downgraded database performance causing intermittent slowness in the learning application and admin applications. A fix has been identified and implemented and we are monitoring progress.

Report: "Degraded Performance - Console"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating an issue with our Console Admin application which is affecting a subset of admin users. This does not impact the learner facing applications.

Report: "Degraded Performance - Console."

Last update
resolved

This issue has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating an issue with our Console Admin application which is affecting a subset of admin users. This does not impact the learner facing applications.

Report: "Degraded Performance - Console"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The root cause has been identified and we are working on deploying a fix.

investigating

We are currently investigating an issue with our Console Admin application which is affecting a subset of admin users. This does not impact the learner facing applications.

Report: "Downgraded Performance - Boost Dashboard"

Last update
resolved

This issue has been resolved as of last night at 21:00:24 CDT.

monitoring

The root cause has been identified, we have implemented a fix and are monitoring progress.

investigating

We are currently investigating an issue with our Boost Dashboard application. The issue does not impact any learner facing applications or other admin applications.

Report: "Downgraded Performance - Reporting Dashboards in Console"

Last update
resolved

The issue has been resolved and all of the Dashboards accessible via Console are performing as expected.

identified

We are continuing to work on a resolution with our vendor. We will provide further status updates.

identified

The root cause has been identified and we are working on a resolution with one of our vendors. We will provide further status updates.

investigating

We are currently investigating an issue with data displayed in User, Branch/Group and Branch Summary dashboards accessible via Console. Until the issue is identified and resolved, the dashboards will not display any information. This does not impact learner facing application and Boost Dashboards remain available.

Report: "BenchPrep Admin Support Center - Intermittent issues"

Last update
resolved

The issue has been resolved. Our Ticketing System is fully functional.

investigating

We are currently experiencing intermittent issues with our Ticketing System & Knowledge Base - https://support.benchprep.com/home/ You can submit tickets directly via email: help@benchprep.com This DOES NOT affect any of Administration Tools or end user Learning Applications.

Report: "Snowflake - Degraded Performance"

Last update
resolved

This incident has been resolved. Data has been fully resynced and is performing regular scheduled syncs as of 18:24:32 CST.

investigating

We continue monitoring the resynchronization process and are seeing progress. We will continue to provide further updates.

investigating

We have started a process of resynchronization of the data and we are seeing signs of progress. We will continue monitoring the process and will provide further updates.

investigating

We are continuing to investigate this issue with our 3rd party vendor. Additionally, we are actively seeking alternate means of refreshing the data. We will provide further updates.

investigating

We have confirmed data has not successfully synced since 2023-01-12 19:42 PM CST. We are working on a process of implementing a solution to replicate the data. This process will take time and we will provide further status updates.

investigating

We are currently investigating an issue with Snowflake data replication with our 3rd party vendor. This does not affect any administration or end user applications and the impact is isolated to Snowflake raw data access.

Report: "Degraded Performance"

Last update
resolved

This incident has been resolved.

monitoring

We have disabled BDR database extensions and have restored connectivity. We will continue to monitor for the time being.

investigating

We are currently investigating an issue with our backend database system and have put the site into maintenance mode for the time being.

Report: "Snowflake - Degraded Performance"

Last update
resolved

This incident has been resolved. Data has been fully resynced and is performing regular scheduled syncs as of 5:42 am CST.

monitoring

The resynchronization process is continuing and we are monitoring progress. Majority of the data has been resynced.

monitoring

A fix has been implemented. We are monitoring the results and will provide further updates.

identified

The issue has been identified and we are working on implementing a solution to replicate the data. This process will take time and we will provide further status updates.

investigating

We are currently investigating an issue with Snowflake data replication with our 3rd party vendor. Data has not been successfully synced since 22:41 CST. This does not affect any administration or end user applications and the impact is isolated to Snowflake raw data access.

Report: "Degraded Performance"

Last update
resolved

The incident has resolved, we will continue to monitor.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating reports of degraded performance.

Report: "Degraded Performance"

Last update
resolved

We found regression in api reporting request that was causing significant amount of memory consumption impacting the nodes and entire system. The fix was been deployed and verified.

monitoring

We have confirmed performance has stabilized and we continue monitoring affected applications. We are planning to introduce significant improvements to our progress cluster during the next database maintenance.

investigating

The rollback was completed and services have been restored. We are continuing to investigate this issue.

investigating

The restart did not alleviate the issues. We are restoring the services and rolling back the changes.

investigating

We are conducting a short restart of services which can cause non-learner applications to be unavailable for a few minutes.

investigating

We are experiencing an issue with high load times within our non-learner BenchPrep applications. Learner applications are not affected. We are currently provisioning more resources and will continue to investigate.

Report: "Elevated API Errors"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor and verifying that all our data centers are healthy.

monitoring

We are continuing to monitor for any further issues. The mitigation steps have been applied and the system is operational.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We are working on deploying a fix shortly to mitigate the issue.

identified

We are continuing to work on a fix for this issue.

identified

We will be performing emergency maintenance.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.

Report: "Degraded Performance"

Last update
resolved

We are not seeing Ingress pod restarts after increasing pod count in production.

monitoring

We increased the number of pods on Ingress to support the current traffic.

investigating

We are currently investigating degraded performance issues with our ingress pods.

Report: "Degraded Performance with Redis Cluster"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

Upgraded Redis Cluster was rolled out, we will monitor the situation now

identified

We are rolling out upgrades to our Redis cluster and will put the site in maintenance for an estimate of 5-10 minutes.

identified

We are having issues with our Redis Cluster, we have identified the problem and rolling out the temporary solution

Report: "Degraded Performance - login"

Last update
resolved

Redis fix is applied and the issue is resolved now. We continue to monitor the application.

monitoring

We were able to successfully failover the Redis nodes and everything looks good, Now traffic serves to both data centers and monitoring the applications. We will keep monitoring and will update the final status.

identified

Redis replication restores completed and failover the Redis services to different node now.

identified

We have been running on Dallas past few hours without any issues and performing full Redis replica restoration currently.

identified

We restarted our Redis instance and traffic moved to the Dallas data center. Still working on fix.

identified

The issue is identified and we are in the process of fixing it.

investigating

We are currently investigating login issues. We will keep posted on updates.

Report: "Localization issue on Login pages"

Last update
resolved

Verified that issue was with the incorrect locales updated from the script.

monitoring

A fix for locale crash have been implemented, services should be back - we are still looking at the issue at actual localization for select tenants

investigating

There is an issue with our localization impacting some tenants login pages, we are still investigating.

investigating

We are currently investigating degraded performance of the login pages

Report: "Degraded Performance"

Last update
resolved

No critical identified and seems normal.

monitoring

We are still monitoring and checking all the possibilities for this issue.

monitoring

Restarted Redis nodes in the production cluster. Currently monitoring the system.

investigating

We are currently investigating the issue.

Report: "SSO service is down"

Last update
resolved

Login & sso service once brought back up have been stable.

monitoring

Scaled resources came back. We are monitoring and will actively work to identify the cause

identified

We are continuing to work on a fix for this issue.

identified

We are scaling the resources up to ensure new images are brought up

investigating

We are continuing to investigate this issue.

investigating

Following rolling production deployment, sso service failed to come back up. We are actively investigating.

Report: "Degraded Performance in Dallas Cluster"

Last update
resolved

We are moving issue to resolved, close monitoring didn't reveal anything out of the ordinary.

monitoring

We observed a network performance degrade, cleared out old connections which resolved the issue. We will be monitoring closely for the next hour.

investigating

We are continuing to investigate the issue, in the meantime all traffic successfully redirected to the healthy datacenter restoring operational activities

investigating

We are investigating the issue impacting our Dallas cluster, in the meantime redirecting the traffic to the healthy datacenter.

Report: "Degraded Performance"

Last update
resolved

Attaching IBM incident https://cloud.ibm.com/status?item=INC4252245

monitoring

We are continuing to monitor for any further issues.

monitoring

Confirmed packet loss & networking issues with the cloud provider, services are back, but we will keep the incident open until we hear official confirmation that it is resolved.

identified

We are seeing network issues that we are actively working with the Cloud Provider on, in the meantime one datacenter is back, routed traffic to it and switch back to degraded performance

investigating

Updating to outage. Teams are looking into the issue

investigating

We are investigating degraded performance reported by a number of our applications pods

Report: "Issue with Course Building"

Last update
resolved

The issue with course builds involving updates to questions has been identified and fixed. You may generate builds for courses with updates to questions. Additionally, our tech team is prioritizing work to optimize course build process to ensure a consistent course build experience. We encourage you to monitor course build progress and contact support if you experience repeated failed builds.

investigating

If you have made any updates to questions, please refrain from building the course until further update can be provided. If your changes do not involve updates to questions, you can generate the build.

investigating

BenchPrep is investigating issues related to course builds in BluePrint and advises all customers to hold off on building any courses until an update can be provided.

Report: "Degraded Performance"

Last update
resolved

We have restored traffic and have removed the need to to connect to the IBM registry when pods restart.

identified

We have identified an issue with connecting to the IBM cloud registry from our San Jose Datacenter. We are routing all traffic to Dallas.

investigating

We are currently investigating reports of degraded performance.

Report: "Exam Results and BluePrint application"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

An issue was identified in how exam results are being displayed to users. This is originating from the BluePrint application, which will be turned off pending resolution. A fix has been identified and is being worked on. Expected resolution time is 1 hour.

Report: "Degraded Performance"

Last update
resolved

Several shared database connections were corrupted. Restarting our connection pooling software and connected services cleared out the connections. We will continue to monitor and provide updates if it reoccurs.

monitoring

We have identified the problem and are currently monitoring.

investigating

We are currently investigating reports of degraded performance.

Report: "Degraded Performance"

Last update
resolved

This incident has been resolved.

monitoring

We brought up datacenters and identified the issue with the database contention. We cleared the contention and monitoring the situation

investigating

We are currently seeing issue with degraded performance impacting our San Jose cluster. We are redirecting traffic to another datacenter and investigating.

Report: "Degraded Performance"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

System performance has stabilized and we are continuing to monitor network connectivity.

investigating

We are working with our hosting service to investigate Dallas network issues. We will continue to monitor system performance while sending traffic to San Jose.

investigating

We are currently routing all traffic to San Jose while we continue to investigate potential network connectivity issues in Dallas.

investigating

We are experiencing slower than normal load times. We are currently investigating the issue and will post any relevant updates here.

Report: "Lagging reports in Boost"

Last update
resolved

The reports have been generated. We will continue to monitor.

monitoring

We have restarted the report build process and will be monitoring until completion.

identified

We have identified an issue with one step of the report generation process and are deploying a work around.

investigating

We are currently investigating a delay in report generation in Boost.

Report: "Degraded Performance"

Last update
resolved

Connectivity to both data centers has been restored and we are sending traffic to both locations now. We will continue to monitor the situation.

monitoring

We are continuing to monitor for any further issues.

monitoring

We have identified an issue with our database connection pooling software. Restarting it has temporarily resolved the issue, we are monitoring and looking for permanent solutions.

investigating

We are currently investigating an issue related to memory contention on one of our databases.

monitoring

We are currently routing all traffic to San Jose and are working with our infrastructure provider to address some network connectivity issues. All services are operational.

monitoring

We have identified an issue in our Dallas datacenter and have routed all traffic to San Jose for the time being. We will continue to investigate.

investigating

We are experiencing slower than normal load times and reports of pages unable to load. We are currently investigating the issue and will post any relevant updates here.

Report: "Degraded Performance"

Last update
resolved

We have confirmed that there are no performance issues outside of some internal tools. We will be making some operational changes to address the performance regression and will continue to monitor the situation until it has been resolved.

monitoring

We experienced an issue with database deadlocks at approximately 11am CST. That issue has been resolved and we had received reports of lingering slowness in Blueprint. We are continuing to monitor performance of the platform.

investigating

We are experiencing slower than normal load times. We are currently investigating the issue and will post any relevant updates here.

Report: "Elevated API Errors"

Last update
postmortem

**Date:** December 11, 2020 **Date of Incident**: December 10, 2020 **Raised by**: Internal Monitoring **Severity Level:** Critical **Description** * BenchPrep internal monitoring tools \(Pingdom / NewRelic / Airbrake\) alerted of site stability issues **Root Cause** * A DOS \(Denial of Service\) like behavior was detected from a small range of IPs **Resolution** BenchPrep Engineers blocked traffic from offending IP addresses **Mitigation Strategies** BenchPrep will be implementing automated DOS mitigation measures so that manual intervention will not have to occur. **Timeline** 2020-12-10 - 12:13 PM CST - First alerts triggered 2020-12-10 - 12:14 PM CST - BenchPrep begins investigating 2020-12-10 - 12:29 PM CST - BenchPrep blocks offending IP addresses 2020-12-10 - 12:30 PM CST - Site stability restored Total resolution time: 17 minutes

resolved

This incident has been resolved.

monitoring

Services returning to normal, we will continue monitoring.

investigating

We are continuing to investigate this issue.

investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.

Report: "Elevated Errors Loading Ascend"

Last update
postmortem

**Date:** December 4, 2020 **Date of Incident**: December 3, 2020 **Raised by**: Internal Monitoring **Severity Level:** High **Description** * BenchPrep internal monitoring tools \(Airbrake\) alerted of network connectivity issues with IBM Cloud Object Storage **Root Cause** * An expired SSL certificate was deployed at IBM Cloud Object Storage, this prevented requests made for cached content to fail. **Resolution **The issue was resolved before BenchPrep could deploy any temporary workarounds. **Mitigation Strategies **BenchPrep will build support for content to be pulled directly from our database in the event of connectivity issues in the future. **Timeline **2020-12-03 - 03:41 PM CST - First exception error reported 2020-12-03 - 03:51 PM CST - BenchPrep begins investigating 2020-12-03 - 04:12 PM CST - Connectivity Restored Total resolution time: 31 minutes

resolved

This incident has been resolved.

monitoring

Service has been restored, we are currently monitoring.

investigating

We are continuing to investigate this issue.

investigating

We are experiencing elevated errors when loading BenchPrep Ascend and BenchPrep Engage. We are currently investigating.

Report: "Degraded performance with IBM Cloud"

Last update
resolved

While IBM is still working out other issues (https://cloud.ibm.com/status?selected=status) we are going to resolve this incident as all are services continue to stay stable.

monitoring

Since 19:02 CST we saw significant performance improvements of our services and response times. We are actively monitoring the situation at IBM (they have updated their status page https://cloud.ibm.com/status?selected=status ) and will keep you updated.

investigating

Here is what we know: * We have reached out to IBM and our representatives are aware of the situation * IBM is experiencing world-wide network outages (even impacting their status page https://cloud.ibm.com/status and support help desks) with various additional sources validating it (https://status.aspera.io/, https://downdetector.com/status/ibm-cloud, etc.) * It appears at this moment that issue is specific to IBM's public networking * Our services are not heavily impacted due to our architecture as well as the fact that IBM's private network traffic is healthy at the moment, we are confirming that we are still serving traffic in our application. We will keep updating this as we find out more

investigating

We are seeing some issues with IBM Cloud networking. We are reaching out to vendor for more information.

Report: "Elevated API Errors"

Last update
postmortem

**Date:** May 21, 2020 **Date of Incident**: May 2, 2020 **Raised by**: System Monitoring **Severity Level:** High **Description **At 1:54pm CST BenchPrep staff was notified of site instability issues via system monitors. BenchPrep engineers soon began looking into container and cluster status. **Root Cause **A traffic spike caused memory usage of individual pods to creep up. This resulted in the host nodes running out of memory. Due to lack of available memory, the Kubernetes cluster was unable to restart new healthy pods, which resulted in the backend API service becoming unresponsive. **Resolution **BenchPrep engineers restarted all API pods, this restored stability during the investigation. **Mitigation Strategies: **We have added additional alerts on host nodes for when memory consumption is high. We are reviewing our deployed resources and scaling down less frequently used ones in order to reduce the per-node memory footprint. We are also investigating adding an additional node to each cluster for additional stability. **Timeline **01:54 PM CST - Notification of the site instability from Pingdom 02:07 PM CST - BenchPrep Engineers began troubleshooting 02:20 PM CST - BenchPrep platform put into maintenance mode during investigation 02:40 PM CST - BenchPrep platform stability restored and maintenance mode disabled Total resolution time: 46 minutes

resolved

We have identified an issue with high system memory utilization and have corrected it. We will continue to monitor the situation.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.

Report: "Failed production database switch"

Last update
postmortem

**Description** At 09:50 CDT, BenchPrep database connection pooling software lost connection to the backend database during an attempt to switch to a new database server. This in turn caused an outage of all platform services. The loss of database connection was noticed immediately, and the original database server was back online by 09:53 CDT. BenchPrep’s initial attempt to revert the switch back to the original server did not resolve the connection issue, however. In order to restore platform functionality, BenchPrep reconfigured applications to connect to the database directly, instead of via the connection pool. This configuration went into place at 10:15 CDT. While successful, it was not optimal, resulting in a minor intermittent outage. Final configuration restoration of all changes happened at 11:08 CDT with the all app switch at 11:30 CDT. **Root Cause** Investigation at this point revealed that the initial switchover attempt was missing a necessary change to a connection pool configuration file that would have corresponded to the changes made to the database server endpoint address. Additionally, initial configuration roll back was incomplete missing the expected connection port. **Resolution** BenchPrep engineers prepared a more thorough reversion of all configuration changes associated with the switchover, which was carefully reviewed and manually tested before any more application changes were made. This configuration was in place at 11:08 CDT and all platform services fully switched to it at 11:30 CDT. **Mitigation Strategies:** * Rather than replacing database connection parameters all at once, future connection pool configuration changes will be put in place in the environment alongside existing configuration, and platform applications will only switch after verifying that the new connection works. * Prior to any such configuration change, a rollback branch will be prepared ahead of time in case it is needed.

resolved

This incident has been resolved.

monitoring

At 9:50am CT as part of the routine database health improvements we have attempted a failover to a secondary cluster at which point we have lost the client side connection pooling. We have restored the service at 10:15am CT and actively monitoring the situation. Due to this next attempt will happened at the next regular scheduled downtime.

Report: "Degraded Performance"

Last update
resolved

We received reports of slowness on the platform. We have identified and associated those reports to the time when we reloaded our database connection pooling software initiated at 2:22PM CST. Slowness was due to incremental rollout of new connection pods across multiple data centers.

investigating

We are currently seeing the degraded performance across the platform and are investigating.

Report: "IBM Cloud Node Outages"

Last update
resolved

We are closing the issue - all BenchPrep services are restored albeit running through fewer datacenters.

monitoring

While IBM is dealing with the issue we have decided to direct all traffic off the impacted cluster. That means all services should be back to normal, we are monitoring the situation so that we can bring the location back once IBM is done.

investigating

P3 Incident being investigated: We are aware of an incident that IBM Cloud (our server provider) is experiencing which is causing intermittent loading issues in Blueprint & the Tenant Admin Dashboard. We are in communication with IBM and closely monitoring the impact to our system. Once IBM resolves the incident, service should resume as normal. We will communicate updates as we get them from IBM and you can also check IBM Cloud status page directly (specifically the "Node Outages" incident. This issue is sporadic and a refresh may correct it.

Report: "Elevated API Errors"

Last update
postmortem

**Date:** 2019/08/02 **Date of Incident**: 2019/08/02 **Raised by**: BenchPrep **Severity Level:** High **Description** BenchPrep monitoring triggered alerts that the API servers/containers had lost connectivity to our Redis database service provided by IBM. ‌ **Root Cause** IBM experienced network connectivity issues between their services. This prevented both of BenchPrep’s Kubernetes clusters from connecting to the Redis backend service. For more information see IBM Incident ID: INC0999643 on their cloud status page [https://cloud.ibm.com/status](https://cloud.ibm.com/status) and search for the incident id. ‌ **Resolution** IBM Redis database services had recovered before we were able to migrate to the alternative service. **Mitigation Strategies:** We will maintain the alternative Redis instance provisioned and plan on migrating over to our own internally managed Redis solution. ‌ **Timeline** 10:04 AM CT - First email alerts came in 10:11 AM CT - BenchPrep engineers started provisioning alternative services 10:21 AM CT - Redis Services were fully restored ‌ A downloadable copy of this report can be found here: [https://drive.google.com/file/d/14p16oqEw\_SAaFOz-cXc8CY6gIbmwJKzV/view?usp=sharing](https://drive.google.com/file/d/14p16oqEw_SAaFOz-cXc8CY6gIbmwJKzV/view?usp=sharing)

resolved

IBM has resolved database connectivity issues.

monitoring

Connectivity has been restored to our provider's database service. We will continue to monitor the situation.

identified

One of our backend database providers is experience an outage with connectivity. We are provisioning an alternative and will attempt to migrate services.

investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.

Report: "Elevated API Errors"

Last update
postmortem

**Date:** 2019/07/31 **Date of Incident**: 2019/07/31 **Raised by**: BenchPrep **Severity Level:** High ‌ **Description** BenchPrep monitoring triggered alerts that the API servers/containers had lost connectivity to database services. BenchPrep began looking into the issue and discovered that PostgreSQL connection pooling services were unable to authenticate with the backend database. After ensuring the correct configuration was in place and restarting impacted database pooling services, successful connections were established. ‌ **Root Cause** Our database connection pooling software \(PgBouncer\) ran into a known but rare bug. When new connections were opened, an extra connection parameter was sent. The version of Postgres we are running isn’t compatible with that parameter. This resulted in the "invalid server parameter" error and insufficient valid connections were available. Reference: [https://pgbouncer-general.pgfoundry.narkive.com/lZPDYkqn/pgbouncer-1-1-released](https://pgbouncer-general.pgfoundry.narkive.com/lZPDYkqn/pgbouncer-1-1-released) ‌ **Resolution** Restarting our connection pooling software caused new connections to the backend database to be re-established. This returned stability to the site. ‌ **Mitigation Strategies:** We will be adding additional log level alerts to preemptively alert when connection issues start arising. This should allow us to manually intervene before a catastrophic failure occurs. ‌ **Timeline** 04:10 PM CT - First error notification came in 04:14 PM CT - API based services reached 100% error rate and site went down 04:22 PM CT - Partial availability restored 04:25 PM CT - Services fully restored ‌ A downloadable copy of this report can be found here: [https://drive.google.com/file/d/1m\_5EJsnJ8udRuk00bgJbSTtqTpVXrvPW/view?usp=sharing](https://drive.google.com/file/d/1m_5EJsnJ8udRuk00bgJbSTtqTpVXrvPW/view?usp=sharing)

resolved

This incident has been resolved.

monitoring

We have identified an issue with database connectivity and have corrected it. We will continue to monitor the situation.

investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.

Report: "Outage - Redis"

Last update
resolved

This incident has been resolved.

monitoring

We rolled out Redis backup services. The platform should be operational now, we are continuing to monitor and verify it.

identified

We went to a competing redis offering and deploying that shortly.

identified

While are we waiting for IBM, we are provisioning our backup services. Once they will be up we should be able to move fairly fast and service should restore shortly

identified

Both compose connectors available to us are not working. We have escalated the issue with IBM to make sure new connection is available as soon as possible

identified

The issue has been identified to the lack of connection strings related to the IBM Compose maintenance https://status.compose.com/. We are updating it which should bring services back shortly.

investigating

We have received an alert from our redis cluster, this will cause degraded performance for backend services. We are investigating and working on remediation.

Report: "Degraded performance"

Last update
resolved

This incident has been resolved

monitoring

The issue was identified, fixed and we are currently doing verification and monitoring

investigating

We are currently seeing the degraded performance in the database cluster