Historical record of incidents for loader.io
Report: "Web app and test running service unavailable"
Last updateOn Jan 2, 2023 the loader.io website, API, and test running services became unavailable due to an expired TLS certificate in a backend service that manages service credentials. The certificate had been renewed, but was not distributed to all servers that needed it. When the certificate expired, Loader's orchestration system was unable to read credentials for internal connections to databases and other services, and several services failed as a result, including the web interface and the test running & scheduling jobs. Our team was not notified due to a separate failure of alerting systems, and the team was not in the office because of the observance of the New Years Day holiday on the Monday after new years day. As soon as a team member noticed the outage, service was restored by distributing the renewed certificate to the credential service, and restarting the other failed services. - Test settings, results, and other account information in the loader.io web interface was inaccessible - Tests that had been scheduled for Jan 2, 2023 were not run during their scheduled time, and instead would have run after service was restored, early on Jan 3 2023 - Some scheduled tests may have been scheduled twice in error when service was restored, due to retries and delays processing the backlog of tests We are reviewing our automation and monitoring systems to ensure that critical systems are better automated, and that our team receives alerts promptly!
Report: "Main database unreachable"
Last updateThis incident has been resolved.
A replica has been promoted after Loader's primary database failed. Systems are functional and we continue to monitor closely
We are investigating an issue connecting to Loader's primary database
Report: "loader.io website down"
Last updateThis incident has been resolved.
Systems are getting back to normal; we continue to monitor closely
We are investigating a problem with our website load balancers.
Report: "load test data collection errors"
Last updateThis incident has been resolved.
We implemented a temporary fix last night, so scheduled tests should have run as expected. A permanent fix is now in place and all systems are operational. We will continue to monitor all systems closely.
We are investigating errors from our load test data collection service; systems are in maintenance mode while we address the underlying problem.
Report: "Bad Gateway errors"
Last updateAll web traffic is stable, so we are marking this resolved.
A recent configuration change caused some of our web servers to stop responding. You may have seen HTTP 502 "Bad Gateway" errors from our web interface and API. The problem has been fixed and we will continue to monitor closely.
Report: "test queue delays"
Last updateThis incident has been resolved.
The cause of test delays has been identified and fixed. We are continuing to monitor as the test queues catch up.
We are looking into an issue where some tests are not running
Report: "Test delays"
Last updateTests should be running normally now
Some tests are being queued for longer than usual, we are investigating the cause of the slow-down
Report: "some load tests not running"
Last updateTests are running normally now
One of our load generation machines stopped responding this morning and has caused a few tests scheduled on it not to run. It has been removed from our fleet of load generators and affected tests should start running. We will be monitoring closely to make sure the issue is resolved.
Report: "tests are being delayed"
Last updateTests are now running normally.
Delayed tests are starting to run, and new tests should run as expected. All systems operational :)
We are currently investigating this issue.
Report: "https tests not sending correct number of requests"
Last updateWe rolled back a recent deploy, https tests should be behaving normally again.
Some tests against https endpoints are not sending the correct number of requests. We are currently working to resolve this issue.
Report: "test results not live-updating"
Last updatetest results are now coming through and updating live as the test runs
We are investigating an issue where test results do not update as the test runs. Tests are running and results do appear on page refresh.
Report: "Service down"
Last updateWe've restored service.
We are currently experiencing an unexpected service outage. We are working on resolving the issue.
Report: "Networking issue"
Last updateOne one our nodes inside of EC2, DNS was resolving EC2 hostnames to public IP addresses instead of internal ones, which prevented some of our internal systems from communicating properly. DNS is resolving properly again.
We are investigating a networking issue that is preventing some tests from running correctly.
Report: "Database server reboot"
Last updateThis incident has been resolved.
We're back up, but our queues are a little backed up, so tests may take a little longer to start for a bit.
The server reboot is going to take longer than anticipated. We should be back around 8:15 AM EDT.
Report: "load generator issues"
Last updateThis incident has been resolved.
load generation is performing normally, but we will keep monitoring to make sure the issue is resolved
We are investigating an issue with our load generators causing a few tests to lose some results
Report: "unplanned maintenance"
Last updateAnd we're back. Tests should be verifying and running as usual now.
Our web and API are down right now because of a deploy gone wrong. We're working on getting operational again as soon as possible.
Report: "stalled tests"
Last updateSome network issues around 10:20AM EST caused a few tests to stall. Those tests have been aborted and systems operational now.
Investigating some stalled tests from the past hour
Report: "isolated stalled tests"
Last updateA small number of users may have experienced the "preparing screen of death" intermittently over the last 12 hours, where a test shows a preparing message and even the "abort test" button couldn't get you out of it. This was caused by EC2 capacity issues at Amazon, combined with a bug in our handling of that error. A fix for the bug in our code has been deployed, and if you had a test stuck at the preparing screen, we have aborted it for you - instead of the preparing screen, you should now see a message indicating that your test has been aborted. You can run the test again from there.