metapulse.com

Is metapulse.com Down Right Now? Check if there is a current outage ongoing.

metapulse.com is currently Operational

Last checked from metapulse.com's official status page

Historical record of incidents for metapulse.com

Report: "Website envisage.io currently down. Metapulse.com is not affected."

Last update
resolved

This incident has been resolved.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

Report: "Website down"

Last update
postmortem

The main site [metapulse.com](http://metapulse.com) was down for a couple hours this morning due to misconfiguration during a server upgrade. We will ensure better testing is done after server upgrades to help prevent this in the future.

resolved

The main site metapulse.com was down for a couple hours this morning due to misconfiguration during a server upgrade. Everything should be functioning normally now.

Report: "Email not being sent"

Last update
postmortem

On July 13, 2023, [metapulse.com](http://metapulse.com) was unable to send email for several hours. This was due to the Amazon SES authentication keys being rotated while the servers were using the original keys. The extended downtime was due to difficulty getting the servers to recognize the new keys. After the issue was resolved we were able to resend most emails including notifications and mailings. However some emails were unrecoverable including users registration and resetting passwords. Due to this issue, we have taken steps to make MetaPulse less dependent on email. A new Data Exports page is now available which lists all data exports and provides download links instead of relying on email.

resolved

This issue has been fixed. We will be sending out missed emails shortly.

identified

We are continuing to work on fixing this issue. We will attempt to resend the unsent emails once this issue is fixed.

identified

We are continuing to work on a fix for this issue.

identified

Email is not being sent from our web server. We have identified the issue and are working to resolve it as soon as possible.

Report: "Degraded performance"

Last update
resolved

The job queue has caught up and everything is running smoothly.

monitoring

A fix has been put in place. You may still experience some degraded performance while the job queue catches up.

identified

The web site is running smoothly now, however the background job queue is still backed up. We are working on a fix for this.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating what is causing degraded performance in the app.

Report: "Temporarily Down"

Last update
resolved

The deploy has finished and everything is back up and running.

identified

The site is temporarily down while we deploy a fix. It should be back up and running in a few minutes.

Report: "Partial Outage"

Last update
resolved

This issue appears to be resolved. If you are experiencing any further issues, please contact us. This issue was caused by an incompatibility between an upgraded library and a configuration unique to our production environment. Some changes will be made to bring our development and staging environments more in-line with production to reduce the chance of this happening in the future.

monitoring

A fix has been deployed. We will continue monitoring to see if there are any further issues.

identified

The issue has been identified and we are currently deploying a fix.

investigating

Some features aren't working such as data exports and setting graph values. We are investigating the cause of this issue and will work on a fix shortly.

Report: "Partial outage"

Last update
resolved

This incident has been resolved.

identified

Some areas of the application are not working such as file uploads and editing Knowledge items. The issue has been identified and we are working on a solution.

Report: "App is Down"

Last update
resolved

This incident has been resolved. It was due to incompatibility between a minor framework update and legacy data.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Database Outage"

Last update
postmortem

Today we had an unexpected performance outage of our main production database that runs the envisage platform, this caused all our services to go down including the web interface and API. The issue was traced to a database query taking up to 90s to complete. On investigation, this query previously was taking less than 75ms to complete on average. We initially deployed some changes to the code to exclude this query from execution while we investigated the issue, unfortunately, other queries started taking 30\+ seconds to execute and caused the web application to become unresponsive and start timing out. Due to this, we took all web servers off line while we further investigated. One thing that immediately jumped out at us was that the production database was doing a parallel index scan, while running the query on our laptops on a copy of the production data was using a sequential table scan. We thought this might be because the production database was still running on Magnetic disk instead of SSD storage. This could have been causing the query planner to incrorrectly predict the cost a sequential table scan to be higher than an index scan. We took a backup and upgraded the production DB to SSD storage, unfortunately this did not solve the problem with the query speed improving due to it now having SSD storage, but still taking 80\+s to perofrm, completely unacceptable. After further investigation we found was that if we disabled parralell index scan workers in the system temporarily through setting `SET max_parallel_workers_per_gather = 0;` the query speed improved from 80s to less than 75ms. This was the breakthrough that opened the door to a handling. We saw that our existing database was running on PostgreSQL version 10 which introduced parallel workers and that in the version 11 release notes of PostgreSQL it was mentioned that there were improvements in the query planner with regards to using parallel workers. As the site was down, we investigated upgrading the PostgreSQL version to version 11. We had already been running our staging environments on version 11 for some months and were confident of the upgrade working as planned, so we opted to bring this forward and upgrade the database. Once the database was upgraded to version 11, the query went back to taking 75ms on average to run. At this point we enabled the web, API and background workers for envisage and tested and found service to be restored. We apoligise for the outage, however from what we can tell we hit some arbitrary query planning limit on the table size that caused it. We will investigate further to see how this sort of outage can be prevented in the future and have upgraded the instance power and IOPS allocation of our database server in the mean time.

resolved

Confirmed database is now performing per expectation. All systems operational.

monitoring

We have deployed the update the database server and all seems to be working now, will monitor further.

identified

We have finished the upgrade on the database instance, have found a further reason behind performance due to parallel_workers in database query planner. Upgrading from v10 to v11 of PostgreSQL to take advantage of parallel query performance improvements.

identified

Issue has returned, we have taken the servers offline to perform an upgrade on our database server. No loss of data has occurred.

monitoring

We have pushed an optimisation fix to production, service restored, performance degraded and we are monitoring.

identified

We have identified the problem, and working to upgrade the database instance to fix the issue.

investigating

We are experiencing an issue with our database server having exceptionally long query times.

Report: "Database Performance Issue"

Last update
resolved

This incident has been resolved by raising a memory limit in the database configuration.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Background workers not running"

Last update
resolved

The background job queue has caught up and is working normally.

monitoring

We have finished upgrading Redis, and the job queue is processing again.

identified

We are in the process of upgrading Redis to resolve this issue.

investigating

Our background workers are currently not processing jobs. No data has been lost.

Report: "Mail is not being sent"

Last update
resolved

Mail should be working again. Please let us know if you are experiencing any issues receiving email from envisage.

identified

We are still working on a solution for this mail outage. Thank you for your patience.

identified

The issue has been identified and a fix is being implemented.

investigating

There is an issue with our mail system causing all email to not be sent from envisage. We are currently investigating.

Report: "Slow or Unresponsive"

Last update
resolved

This incident has been resolved.

monitoring

After increasing the memory of the database server the issue appears to be fixed.

identified

The database server was running out of memory causing the queries to slow down and back up.

investigating

We are continuing to investigate this issue.

investigating

The web server is sometimes slow or unresponsive to requests. We are currently investigating this issue.

Report: "Server issues preventing User log in"

Last update
resolved

The deployment finished and everything appears to be functioning normal again. Thank you for your patience.

identified

There is a bug in part of the code which bypassed our extensive test suite and was only triggered in production. We are deploying a fix now.

investigating

Issue related to Background Workers and Database. Main Web Site unaffected. We are continuing to investigate.

investigating

We are continuing to investigate this issue.

investigating

We are continuing to investigate this issue.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.