Historical record of incidents for Pronto
Report: "Delayed notifications"
Last updateWe have resolved the issue and notification times are back to normal.
We are currently experiencing longer than normal delays delivering new message notifications. We are actively investigating the source of the issue and will update as we learn more.
Report: "Delayed notifications"
Last updateWe are currently experiencing longer than normal delays delivering new message notifications. We are actively investigating the source of the issue and will update as we learn more.
Report: "Intermittent issues with web app"
Last updateOne of our hosting providers had intermittent problems resulting in errors in the Pronto web app. They have since identified and fixed the problem.
Report: "Database Maintenance"
Last updateWe've wrapped up the loose ends on this database maintenance. Everything looks good and the Pronto platform is fully functional. Thanks for your patience.
The database maintenance is taking longer than expected. Service may unreliable until it is complete. We apologize for the disruption, our team is working as quickly as possible to restore full service.
Report: "Database issues"
Last updateThe issue has been fully resolved. Thanks for your patience.
A fix has been implemented and we are monitoring for stability.
There is a connectivity issue with our primary database. We are working urgently with our database vendor to understand the issue and get it fixed as soon as possible. We apologize for the issues and will send out another update as soon as we know more.
Report: "Pronto System Outage"
Last updateThis incident has been resolved.
The system has recovered and is currently operational. We will continue to monitor the database and work to identify the cause.
We have cleared out the hung queries and performance seems to be back to normal. We will continue monitoring for the next little while to ensure that the problem is resolved.
Database queries have begun to hang for an unknown reason and have resulted in system downtime. We are investigating this right now with our database vendor and will update here as soon as we know more. We apologize for the downtime.
Report: "Real-time web sockets issue"
Last updateThis incident has been resolved.
Our provider has implemented a fix and we are monitoring the results.
We are again seeing increased error rates and problems connecting to our real-time messaging service provider. They are investigating the issue.
Report: "Real-time web sockets issue"
Last updateA fix has been implemented and we are now seeing normal connection rates. Thanks for your patience. As we gather more details about what happened we will share them in a post-mortem.
Our provider has made strides in fixing the issue but it is not completely resolved yet. We continue to see sporadic connection issues. These issues may be resolved temporarily by refreshing your web browser or by restarting your app (on mobile). We will continue to post updates as we know more.
Overall things are working better, but we are still seeing sporadic connection issues. We continue to work with our provider to identify root cause. We apologize for the disruption today. We will continue to post updates as we get them.
We are again seeing increased error rates and problems connecting to our real-time messaging service provider. This issue seems to be identical to the one we experienced yesterday. Our service provider has acknowledged the issue and is working on a resolution. Thank you for your patience.
Report: "Pronto system outage"
Last updateOur service provider has marked this issue as resolved. Our systems look good and everything is functioning as normal. Thank you for your patience.
Our service provider seems to have resolved the issue and Pronto is currently functional. We will continue to monitor the situation and await further updates from them until they have confirmed the solution.
We are again experiencing the same issue as before, causing a service disruption to Pronto. We will continue to monitor the situation and update when we have further information.
Our service provider seems to have resolved the issue and Pronto is currently functional. We will continue to monitor the situation and await further updates from them until they have confirmed the solution.
Our real time messaging service provider is currently experiencing a major system outage. This event is also affecting Pronto. They are aware of the issue are are currently working to resolve the issue as quickly as possible.
Report: "Pronto system outage"
Last updateWe are confident that the system is now fully back to normal. We will be working with our database vendor further to understand how to avoid this situation during future migrations. Thank you for you patience.
The issue appears to be that when the migration was performed some unexpected database locks caused certain queries to hang, blocking other queries from happening. We have cleared out the hung queries and performance seems to be back to normal. We will continue monitoring for the next little while to ensure that the problem is resolved.
After the team performed a routine database migration, queries began to hang for an unknown reason and have resulted in system downtime. We are investigating this right now with our database vendor and will update here as soon as we know more. We apologize for the downtime.
Report: "Slow performance and request failures"
Last updateWe identified a rarely occurring slow database query that caused a cascading effect on the database. We have restored the production database performance and are in the process of testing a permanent fix.
We are currently investigating an issue where requests are failing. We will update this incident as we learn more.
Report: "Real-time web sockets issue"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
Performance seems to be improving. We are continuing to monitor and will post any further updates until the issue is resolved.
The vendor has acknowledged there's an error on their side and they're working to resolve it.
Our websocket provider is currently experiencing an outage that is affecting sending, receiving, and loading messages. Customers may see prolonged progress spinners and message send errors. We are in contact with our vendor and will provide updates here as we get them.
Report: "Message send errors"
Last updateThe vendor has updated their SSL cert and all is back to normal.
The SSL certificate of one our vendors expired. We are contacting them so they can update it.
Attempting to send a message is resulting in a server error. Meetings may also be affected. The team is investigating.
Report: "Failed web deployment"
Last updateThe deployment has been successfully rolled back. We will be taking a look at what happened before re-deploying.
We are currently performing a rollback to the previous version which will take a few minutes.
During an attempt to deploy new changes to the Pronto web app, there was a failure resulting in blank screens and loading errors. We are currently investigating. The mobile app and APIs are not affected.
Report: "Real-time web sockets issue"
Last updateThis incident has been resolved.
We are seeing some elevated error rates once again and are working with our vendor.
Our websockets vendor has implemented a fix and we are monitoring the results. Right now the system appears to be back to normal.
We are continuing to investigate this issue.
Our websocket provider is currently experiencing an outage that is affecting sending, receiving, and loading messages. Customers may see prolonged progress spinners and message send errors. We are in contact with our vendor and will provide updates here as we get them.
Report: "Notification delay"
Last updateSince our last update we have been carefully monitoring notification delays and the numbers look great and have stayed that way for about 4 hours now. We believe that this issue is now fully resolved and will close this incident. The root cause was that a database upgrade appears to have reset some of the optimizations that were used to ensure fast database access. After reconfiguring the database with the proper optimizations, performance returned to normal levels.
After making some more database optimizations all our numbers are looking good again. We will continue monitoring for the next few hours, but for now notifications are back to normal delivery times.
We are once again experiencing delayed notifications, due to some slow database queries. We are working again with our vendor to investigate and will update here as we learn more. We sincerely apologize for the ongoing issue. The team is working diligently to solve this issue for good.
Report: "Delayed notifications"
Last updateNotifications have returned to normal after the fixes made by our database vendor. We will continue to monitor notification delay over the next 24 hours to ensure no further issues. Thanks again for your patience.
Notification delivery times have returned to normal after our database vendor applied some optimizations. Initial findings indicate a positive impact on all affected database operations. We will continue to monitor the results. Thank you for your patience.
Overall notification delays have improved, but we are still seeing some periodic spikes. We continue to investigate and will update here again once we know more. Thank you for your patience.
We are currently experiencing issues with notifications being delayed. We believe this is due to a database being slower than normal after an upgrade. We are working with our database vendor to identify and solve the issue. The rest of the Pronto platform is working normally.
Report: "Platform issues"
Last updateAfter reviewing logs and metrics, all evidence points to a temporary, transient network issue within AWS that lasted from 16:53 - 17:00 UTC. We have opened a case with AWS support and will update with a post-mortem if there are any changes to our assessment.
We are continuing to investigate this issue.
The Pronto service is now back up. We are still investigating the cause.
Most requests are currently failing to the backend API servers. We are actively investigating and will update here once we know more.
Report: "Database issue"
Last updateThe Pronto platform is back to normal. We will be analyzing data from this incident over the coming days to understand what went wrong and how to better prevent a similar event in the future.
Some of the affected components have been restored and we are starting to see some improvement. The Pronto service is back up, but may have degraded performance for a little while as the other components are restored.
As mentioned in the previous update, the issue is due to some underlying hardware failures. Our database vendor is working to move the affected components to new hardware.
Our database vendor has identified a hardware issue in the database cluster and is working to mediate.
Our backend database provider has acknowledged an issue on their platform and are working to diagnose. We will continue to update here as soon as we know more.
We have been alerted to an issue with our backend database that is causing a system outage. We are currently investigating.
Report: "Degraded performance on user queries"
Last updateOur workarounds have been deployed successfully and performance has returned to normal. Both User Search and User Count endpoints have been re-enabled and all Pronto functionality is restored. At this point we believe the root cause was a database-level bug. We are working with our database vendor to confirm the bug and get a permanent fix. In the meantime, now that we know how to workaround it, we expect no further problems from this.
We are continuing to work on a fix for this issue. We have deployed some workarounds and seen some promising results. We are working to identify and address the remaining slow areas using a similar approach. We will provide another update after testing and deploying those additional workarounds.
We've identified the very slow queries and are attempting some workarounds to speed them up. It is still unclear what the root cause of the slow queries is, but we are hopeful that this workaround will resolve the immediate issues to get performance back to normal and allow us to reenable all endpoints.
We have temporarily disabled two endpoints that are causing the issue in order to protect the rest of the application. These two endpoints are: 1. User search - Attempting to search for a user in any context will not work. This includes starting a new DM (existing DMs are unaffected), adding new users to a group, or searching for users in org management. 2. User counts - In org management the overall user count will not be displayed We apologize for this loss of functionality, but deemed it necessary in order to prevent further issues across the Pronto app. We are working with our database vendor directly to diagnose and solve this issue as quickly as possible. Thanks for your patience.
We are currently seeing degraded performance on user-related queries in the main Pronto database. User searches in org management and also in the client apps are currently taking a long time or timing out. User online status is also affected and may not be reflecting the correct state right now. We are actively investigating and will update as soon as we know more.
Report: "AWS outage"
Last updateAll Pronto services are now back to normal. Push notifications are now being delivered in real-time and other async jobs such as URL previews are speedy once again. Canvas integration has also been re-enabled. Canvas course syncing will need some time to catch up, but should be up to date for all customers within the next 6 hours. Thank you for your patience today. We will spend some time analyzing this event to see what changes we can make to be more resilient to a similar failure in the future.
AWS has implemented their root cause mitigation plan and core Pronto services are once again working well. We are still experiencing some minor latency with push notifications as scaling on that service has not yet been restored by AWS engineers. Canvas integration is also still disabled for the same reason. We are hopeful that these issues will both be resolved quickly.
As AWS starts to see significant recovery, we also are seeing some Pronto services scaling up again. Push notifications are still delayed, but response times are improving on the core Pronto services. Canvas integration is still disabled. We will continue to provide updates as services recover.
We just saw a major increase in traffic from an integration platform, perhaps as it itself was recovering. This caused our small cluster to get overloaded. To mitigate we have temporarily disabled the Canvas integration platform until we are once again able to scale the Pronto services. This mitigation appears to have worked and Pronto core services are back up, albeit with slower response times than normal.
We are continuing to work on a fix for this issue.
As expected, traffic increases finally pushed Pronto over the edge and we are now experiencing a system wide outage due to our inability to scale because of the AWS outage. We will continue to do whatever we can within our power to bring Pronto back up. We sincerely apologize for the disruption we know this is causing you.
AWS says they are starting to see some signs of recovery, but do not have an ETA for full recovery at this time. We have tried various ways to scale Pronto servers, but because AWS internal APIs are failing this has not been successful. Thus, Pronto is currently running on less than half the capacity we normally would at this time of day. Push notifications continue to be delayed, and general response times are increasing. We expect that if AWS has not recovered their services in the next hour we will start to see much higher latency and an increase in error rates on Pronto core services. We will continue to explore alternatives in the meantime and will keep you up to date. Thanks for your patience.
AWS has identified the root cause and are working towards recovery. Pronto core services are still running smoothly for now (except for delays in push notifications and other async jobs as noted in the last update), but because of the outage we are unable to automatically or manually scale up our servers as we normally would. As traffic increases in the next couple of hours this could result in slower response times across Pronto services. We are investigating alternative ways to scale up our servers in the meantime and will continue to keep you updated.
There seems to be problems in the us-east-1 AWS region resulting in some services being slow or having increased error rates. Core Pronto services are not currently impacted, but push notifications and other async jobs such as URL previews may be delayed. We are monitoring the situation and will post updates as we learn more. AWS status is available here: https://status.aws.amazon.com/
Report: "Database node failures"
Last updateA postmortem
Just after 8:00pm MDT on Sep 26th the Pronto database cluster had a simultaneous failure on multiple nodes. The Pronto database cluster is designed to automatically withstand the loss of individual nodes that happen in succession, but not when multiple happen simultaneously as they did in this case. The Operations team at Pronto immediately engaged multiple avenues of support at both our hosting provider and our database vendor. In the meantime they also prepared to perform a full database restore (something that is tested regularly, including one last week). After some time, our hosting provider alerted us to an underlying service failure on their part that resulted in the node failures. They worked to restore services, but this took several hours. After our hosting provider’s fix, Pronto services began to come back online at about 1:15am MDT on Sep 27th and were working normally by 1:30am. We are extremely sorry for the disruption to Pronto services. We will learn from this incident and work towards improving our services to be more resilient to underlying failures like this in the future.