Historical record of incidents for AskNicely
Report: "AskNicely returns Error 500"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Corporate website DNS outage"
Last updateThis incident has been resolved. If you visited our corporate website during the incident and are still experiencing an issue, please try flushing your DNS cache: https://blog.hubspot.com/website/flush-dns
A issue with our corporate website (asknicely.com) has been identified and a fix has been initiated. This did not directly affect the AskNicely Application but may impact users who start by clicking the Sign In link from our home page. As a workaround, those users may instead visit https://start.asknice.ly/findlogin/
Report: "Application Unavailable for some US customers"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating unavailability in our application for some US customers.
Report: "Some US hosted customers are currently getting 404 and 500 errors when trying to access AskNicely"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Investigating increased error rates for US clients"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Investigating increased error rates for US clients."
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Investigating increased error rates for US clients."
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Issue reported with the dashboard response widget and workflows page for some users."
Last updateThis incident has been resolved. If you experience any issues, please clear your browser cache. If that does not resolve the issue, please contact AskNicely Support.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Surveys not loading for some users"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have identified a possible cause of surveys not displaying to respondents and are currently deploying a change for remediation.
We are investigating reports of surveys not loading.
Report: "Tenants hosted in Australia may be experiencing issues with loading their main dashboard page."
Last updateThis incident has been resolved.
We are currently investigating unavailability in our application.
Report: "Elevated 404 error rates"
Last updateThis incident has been resolved.
Report: "Brief period of unavailability"
Last updateWe think we have a good understanding of what caused the incident, and do not anticipate any more problems. Sorry for any inconvenience.
We are currently monitoring a recent period of instability in our storage systems. Some AskNicely accounts were unavailable, but they seem to have now recovered. We continue to monitor the situation.
Report: "Site response performance problem"
Last updateThe issue has been resolved.
We have identified an issue causing slow responses from the AskNicely application for our customers based in the US datacenter. We've taken steps to move load off the affected systems, and we expect response times to now be recovering to normal.
Report: "AskNicely Site 500 error"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
We are continuing to monitor for any further issues.
We have restarted Nginx and all metrics are now nominal. We will investigate further, continue to monitor. We have an error message that most likely requires a small nginx.config change to prevent this happening in the future. Engineers were alerted within 1 minutes of the first 500 errors. Site was restored to health in approx 10 - 15 minutes. Sorry for this outage. Your friendly engineering team.
We are continuing to monitor for any further issues.
We are seeing 500 server errors related to NGINX. We are investigating and monitoring.
Report: "502 Error US Datacenter"
Last updateWe’ve identified endpoints that were not properly rate limited and when receiving a high volume of traffic were causing infrastructure issues. We’re working on better rate limiting coverage rolled out to prevent further outages.
We have now resolved this incident and identified the cause. The engineering team are now doing a postmortem of the event to prevent this happening in the future.
We are continuing to monitor for any further issues.
We are now monitoring the situation the situation and all our monitoring tools are reporting the system is operating within expected parameters.
We have identified the source of the problem that has been causing an exceptional high load.
We have seen some performance issues that are causing some 502 and 504 errors. We are working hard to see where these are occurring, we will update this as we continue to find the root cause. All alert systems are operating as expected and now we are going through platform monitoring tool
We've rolled out changes to try resolve issues accessing AskNicely, and are monitoring current status.
We are currently investigating a 502 error.
Report: "502 Error USA Data Center"
Last update## The 502 error Today a number of customer may have experienced a 502 error and were not able to access the AskNicely platform. We are super proud of the platform we have built, and when we let our customers down, we know we need to do a better job, it really hurts. We are sorry you were not able to access our platform. Very sorry. We have a fantastic engineering team and over the next week, we will be focusing on our infrastructure to help minimise outages that you may have seen today. ## What went wrong AskNicely is built on AWS \(Amazon\), it is an amazing platform which allows to scale our solution very easily. Today we hit an issue with extremely heavy load on our USA database server \(RDS\). The symptoms we saw. * 502 Error rates * Load Balancer errors, 'unhealthy web server in load balancer pool * Database load in RDS going from under 5% to 100% in matter of seconds. Very abnormal. * Our 502 error page did not tell our customers what was happening, nor link to our status page. Bad. ## What went right We have extensive monitoring on AskNicely we have some fantastic services that we love which kicked in as soon as it detected something abnormal. The services we use today: * [PagerDuty.com](http://PagerDuty.com) We love PagerDuty, both the mobile app, email, SMS and automated phone calls for alerting. Auto escalation policies to other team members. * [Datadog.com](http://Datadog.com) provides us with detailed metrics around our application performance and servers, we send a massive amount of data back to Datadog and its a valuable asset that we use for real time monitoring and debugging. * [Loggly.com](http://Loggly.com) all our log files and error logs are managed in Loggly. We can easily visualise and quantify requests from customers in seconds using their powerful log query tool. * [NewRelic.com](http://NewRelic.com) can provide incredibly detailed analysis of what parts of our application are being used the most, how well that code is performing and what part of the code is the slowest. It also monitors how long our application is taking to load for our customers. We really absolutely love NewRelic and it is our Litmus test to see if our code changes have resolved our issues or not. * [Slack.com](http://Slack.com) it makes it so easy for our team to stay on the same page and communicate instantly no matter where we are in the world. * [Statuspage.io](http://Statuspage.io) You can find a link to our statuspage from the [www.asknicely.com](http://www.asknicely.com) homepage and our 404 pages. ## What we discovered During this time, we came under a very heavy API load from one customer. Normally our API rate limiter would kick in and prevent any one single customer from causing an outage. But due to the size of this customers dataset, our API was too slow to respond to all their requests causing massive congestion. Our rate limiting API is tuned for number of requests, not time to process a request. ## What we did We have a number of strategies that we use to scale our platform. One strategy allows us to move a single customer from one database host \(RDS Instance\) to another. Once we isolated the issue, this customer was moved to their own database instance. The AskNicely application instantly become responsive and all our server metrics returned to what we would consider normal parameters. We have also worked on several bottle necks including: * Autoscaling our primary USA database server, we have tripled the capacity of this server, in size and dedicated IOPS. * We have 6x our Redis instance that provides us with a powerful and fast caching service for parts of the application. * We have changed several variables on our RDS instance that would allow higher loads * We have added another application server to the server pool. ## What we are planning todo * Add detailed API monitoring - time, frequency, tenant and database * Improve our API rate limiter. * Refactor our API code that caused us issues and most likely refactor a particular query that caused the heavy load on our database. * Provide a way to gracefully degrade AskNicely so that core/key services are not affected. * Improve our 502 error page to link to our StatusPage so we can get our customers more timely updates. Again we are sorry, and we are working hard to rectify these issues. John // CTO and co-founder AskNicely
This is issue is now resolved. We have made several changes that have identified the root cause and rectified these issues. We will continue to monitor over the next several days.
We are continuing to monitor, we have made a significant change that appears to rectify the issue. Again, we are monitoring this and we will do a debrief today.
We have identified an issue and are now monitoring.
We are investigating a 502 Error on the US datacenter, we have several engineers looking into the issue.
Report: "Database Issues"
Last updateAll AskNicely services are back to normal.
AskNicely is back to being fully operational. We are monitoring for any continued irregular activity.
We have noticed irregular database activity and are performing emergency maintenance to resolve the issues. Services will be partially offline.
Report: "Application Unavailability"
Last updateWe experienced an application outage starting 11:20 AM due to a database issue. This incident was resolved within an hour.
Report: "Application Unavailability"
Last updateWe experienced a brief period of application unavailability due to malicious requests causing high server load. This issue has been resolved.
Report: "404 errors for tenants hosted in Australia and Europe"
Last updateThe incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating the issue