Historical record of incidents for metapulse.com
Report: "Website envisage.io currently down. Metapulse.com is not affected."
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
Report: "Website down"
Last updateThe main site [metapulse.com](http://metapulse.com) was down for a couple hours this morning due to misconfiguration during a server upgrade. We will ensure better testing is done after server upgrades to help prevent this in the future.
The main site metapulse.com was down for a couple hours this morning due to misconfiguration during a server upgrade. Everything should be functioning normally now.
Report: "Email not being sent"
Last updateOn July 13, 2023, [metapulse.com](http://metapulse.com) was unable to send email for several hours. This was due to the Amazon SES authentication keys being rotated while the servers were using the original keys. The extended downtime was due to difficulty getting the servers to recognize the new keys. After the issue was resolved we were able to resend most emails including notifications and mailings. However some emails were unrecoverable including users registration and resetting passwords. Due to this issue, we have taken steps to make MetaPulse less dependent on email. A new Data Exports page is now available which lists all data exports and provides download links instead of relying on email.
This issue has been fixed. We will be sending out missed emails shortly.
We are continuing to work on fixing this issue. We will attempt to resend the unsent emails once this issue is fixed.
We are continuing to work on a fix for this issue.
Email is not being sent from our web server. We have identified the issue and are working to resolve it as soon as possible.
Report: "Degraded performance"
Last updateThe job queue has caught up and everything is running smoothly.
A fix has been put in place. You may still experience some degraded performance while the job queue catches up.
The web site is running smoothly now, however the background job queue is still backed up. We are working on a fix for this.
We are continuing to investigate this issue.
We are currently investigating what is causing degraded performance in the app.
Report: "Temporarily Down"
Last updateThe deploy has finished and everything is back up and running.
The site is temporarily down while we deploy a fix. It should be back up and running in a few minutes.
Report: "Partial Outage"
Last updateThis issue appears to be resolved. If you are experiencing any further issues, please contact us. This issue was caused by an incompatibility between an upgraded library and a configuration unique to our production environment. Some changes will be made to bring our development and staging environments more in-line with production to reduce the chance of this happening in the future.
A fix has been deployed. We will continue monitoring to see if there are any further issues.
The issue has been identified and we are currently deploying a fix.
Some features aren't working such as data exports and setting graph values. We are investigating the cause of this issue and will work on a fix shortly.
Report: "Partial outage"
Last updateThis incident has been resolved.
Some areas of the application are not working such as file uploads and editing Knowledge items. The issue has been identified and we are working on a solution.
Report: "App is Down"
Last updateThis incident has been resolved. It was due to incompatibility between a minor framework update and legacy data.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Database Outage"
Last updateToday we had an unexpected performance outage of our main production database that runs the envisage platform, this caused all our services to go down including the web interface and API. The issue was traced to a database query taking up to 90s to complete. On investigation, this query previously was taking less than 75ms to complete on average. We initially deployed some changes to the code to exclude this query from execution while we investigated the issue, unfortunately, other queries started taking 30\+ seconds to execute and caused the web application to become unresponsive and start timing out. Due to this, we took all web servers off line while we further investigated. One thing that immediately jumped out at us was that the production database was doing a parallel index scan, while running the query on our laptops on a copy of the production data was using a sequential table scan. We thought this might be because the production database was still running on Magnetic disk instead of SSD storage. This could have been causing the query planner to incrorrectly predict the cost a sequential table scan to be higher than an index scan. We took a backup and upgraded the production DB to SSD storage, unfortunately this did not solve the problem with the query speed improving due to it now having SSD storage, but still taking 80\+s to perofrm, completely unacceptable. After further investigation we found was that if we disabled parralell index scan workers in the system temporarily through setting `SET max_parallel_workers_per_gather = 0;` the query speed improved from 80s to less than 75ms. This was the breakthrough that opened the door to a handling. We saw that our existing database was running on PostgreSQL version 10 which introduced parallel workers and that in the version 11 release notes of PostgreSQL it was mentioned that there were improvements in the query planner with regards to using parallel workers. As the site was down, we investigated upgrading the PostgreSQL version to version 11. We had already been running our staging environments on version 11 for some months and were confident of the upgrade working as planned, so we opted to bring this forward and upgrade the database. Once the database was upgraded to version 11, the query went back to taking 75ms on average to run. At this point we enabled the web, API and background workers for envisage and tested and found service to be restored. We apoligise for the outage, however from what we can tell we hit some arbitrary query planning limit on the table size that caused it. We will investigate further to see how this sort of outage can be prevented in the future and have upgraded the instance power and IOPS allocation of our database server in the mean time.
Confirmed database is now performing per expectation. All systems operational.
We have deployed the update the database server and all seems to be working now, will monitor further.
We have finished the upgrade on the database instance, have found a further reason behind performance due to parallel_workers in database query planner. Upgrading from v10 to v11 of PostgreSQL to take advantage of parallel query performance improvements.
Issue has returned, we have taken the servers offline to perform an upgrade on our database server. No loss of data has occurred.
We have pushed an optimisation fix to production, service restored, performance degraded and we are monitoring.
We have identified the problem, and working to upgrade the database instance to fix the issue.
We are experiencing an issue with our database server having exceptionally long query times.
Report: "Database Performance Issue"
Last updateThis incident has been resolved by raising a memory limit in the database configuration.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Background workers not running"
Last updateThe background job queue has caught up and is working normally.
We have finished upgrading Redis, and the job queue is processing again.
We are in the process of upgrading Redis to resolve this issue.
Our background workers are currently not processing jobs. No data has been lost.
Report: "Mail is not being sent"
Last updateMail should be working again. Please let us know if you are experiencing any issues receiving email from envisage.
We are still working on a solution for this mail outage. Thank you for your patience.
The issue has been identified and a fix is being implemented.
There is an issue with our mail system causing all email to not be sent from envisage. We are currently investigating.
Report: "Slow or Unresponsive"
Last updateThis incident has been resolved.
After increasing the memory of the database server the issue appears to be fixed.
The database server was running out of memory causing the queries to slow down and back up.
We are continuing to investigate this issue.
The web server is sometimes slow or unresponsive to requests. We are currently investigating this issue.
Report: "Server issues preventing User log in"
Last updateThe deployment finished and everything appears to be functioning normal again. Thank you for your patience.
There is a bug in part of the code which bypassed our extensive test suite and was only triggered in production. We are deploying a fix now.
Issue related to Background Workers and Database. Main Web Site unaffected. We are continuing to investigate.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.