Historical record of incidents for Sleuth
Report: "Website down"
Last updateWe experienced minor issues during the deploy of a new version, which caused our website and services to be unavailable for a period of 4 minutes.
Report: "Website down"
Last updateWe experienced minor issues during the deploy of a new version, which caused our website and services to be unavailable for a period of 4 minutes.
Report: "Service unavailability"
Last updateWe were experiencing service unavailability for period of 7 minutes caused by network configuration issues.
Report: "Reduced performance across the board"
Last updateWhy, oh why is 'ANALYZE VERBOSE' not automatically ran on a major Postgres upgrade :( (we are all green across the board, and even faster, if you can believe it)
A fix has been implemented and we are monitoring the results.
We identified a database issue related to the recent upgrade, and performance seems to be returning to normal. We will continue to monitor the situation
We are currently investigating this issue.
Report: "Delayed deploy and impact processing"
Last updateThis incident has been resolved.
The underlying infrastructure issue has been resolved and Sleuth is again fully operational. We're still actively monitoring the situation.
The underlying AWS infrastructure problem was identified. To ensure data consistency we will delay processing of deploy and impact data until the issues are resolved.
We are currently experiencing a degradation of service due to infrastructure network issues. Please stand by as we investigate possible resolution.
Report: "Performance Degraded"
Last updateThis incident has been resolved.
We are currently investigating degraded performance on the sleuth application.
Report: "Degraded Application Performance"
Last updateThis incident has been resolved.
The web application and deploy processing are no longer experiencing a performance degradation and we are actively monitoring them.
The sleuth application is still experiencing delays in deploy processing and slower than normal load times. We have mitigated one root cause, and are still investigating the continued performance issues.
We are currently investigating degraded performance of the sleuth website & deploy processing
Report: "Deploys incorrectly marked as rolled back"
Last updateWe have corrected data for all impacted rollback deploys.
We are working to correct the incorrectly marked rollback deploys.
A bug in the system caused deploys from 2023-02-02 22:31 UTC and 2023-02-03 09:24 UTC to be processed and to be incorrectly marked as rolled back. The team is working on correcting the deploys incorrectly marked as rolled back.
Report: "Immediate Session Expiration: All Users"
Last updateThis morning, a security attack vector was discovered by a paid independent researcher. There is no evidence that this attack vector has been exploited. This has been addressed by our team and the vector has been closed at this time. Out of an abundance of caution, we have logged out all Sleuth users. The only action required from you is logging back in the next time you access Sleuth. Thank you for your understanding and continued trust. Please reach out to us if you have any questions regarding this matter.
Report: "We're experiencing a delay in actions processing"
Last updateAction execution has been running normally and has completely stabilized. The incident has been resolved.
We have stabilized the execution of affected actions. We are continuing to monitor the performance, but you should be seeing normal behavior with action execution and Slack messages.
The issue has been identified that is causing a slowdown in the following areas: * Sleuth actions evaluations * Slack message delivery * PR locking You will experience delayed Sleuth actions evaluations, Slack message delivery, and PR locking. All actions are still being registered and will be executed at a later time. No deploy data is being lost.
Report: "Impact collection is delayed"
Last updateImpact collection has been running normally and has completely stabilized. The incident has been resolved.
We have stabilized impact collection and it's running normally now.
We are currently investigating an issue causing us to collect impact at a delayed rate.
Report: "We're experiencing a delay in detecting deploys from CI/CD integrations"
Last updateDeploy detection via CI/CD integrations is now operating normally. We've implemented a work around that allows us to mitigate this kind of issue moving forward and at the same time the provider has completed their maintenance.
We've identified an issue with processing deploys from CI/CD integrations. One of our supported CI/CD integrations is taking maintenance and this has revealed an issue with how we handle this situation. You will experience delayed deploy detection through CI/CD providers while we mitigate the issue. Webhook deploy processing is functioning normally but is also slightly delayed.
Report: "We're seeing site-wide slowdowns, we're investigating an increase in DB operations"
Last updateThis incident has been resolved.
We've identified the problem and remediated it. We had a few very long running queries get stuck in our database which had negative follow on effects. We've killed the offending queries and removed the code that trigged them. We'll be following up with a change to stop this kind of vector in the near future.
We are seeing slow downs related to increased queries against our database. We are investigating the cause.
Report: "Impact tracking has been suspended for a short time"
Last updateWe have reenabled impact collection for all.
We are seeing some issues related to collecting impact. We've temporarily suspended impact collection. We will reenable within a few hours.
Report: "We are seeing some issues with our Redis instance"
Last updateWe're back to fully operational.
We identified the issue. Our background tasks were creating keys that weren't being cleaned up and eventually chewed up most of our storage. We've cleared those keys and are putting in place a way to stop this from happing moving forward. The service is now restored to normal and we are monitoring.
Users may see some sporadic errors and impact processing may be delayed. We are investigating the issue.
Report: "Website down"
Last updateWebsite unresponsive, investigating
Report: "We're seeing a slowdown on all provided services"
Last updateWe identified the issue and the site is back to fully functional. Our primary DB was running low on disk IOPS credits. We've increased our RDS instance size and storage size which has significantly increased our available IOPS.
We've identified an issue with our DB such that we are seeing slower performance than usual. The site is still operational but is running at a reduced capacity.
Report: "We are experiencing downtime related to a bad migration"
Last updateWe've successfully re-run an updated version of the migration and all systems are back to normal.
We have cleared the problem. A migration locked our main deploys table and killing the initiating process did not clear the lock. We've cleared the lock and the site has resumed it's normal functioning. We're monitoring to make sure everything is completely back to normal.
We are continuing to investigate this issue.
We're investigating the cause now
Report: "Site unavailable due to a bad deploy"
Last updateWe've fully resolved the issue.
We are continuing to monitor for any further issues.
We have rolled back the bad change and the site is available again. We are monitoring and will update this incident as we learn more.
We are investigating site issues that seem due to a bad code deploy. We will update as soon as we have more details.
Report: "Impact collection is delayed"
Last updateThis incident has been resolved.
We have rolled out a fix and impact collection has returned to normal. We're just monitoring a bit before we resolve this incident.
We've identified an issue with our impact collection being delayed for new deploys. We're working on a fix and will update this incident as we progress.
Report: "We're having trouble with our background jobs"
Last updateThe behavior of the application has returned to normal. The issue was a bad deploy where we changed the threading model of our background jobs. Some of the libraries we depend on were not supported in this new model. We have reverted to the old model for now.
We are having trouble with out background processing. We're working on a fix.
Report: "We are seeing an issue collecting data from sources that are authenticated via an API key"
Last updateWe've restored the integrations and all things are working again. If you see any issues please contact support.
We have identified the issue and are working to resolve normal operations. Integrations that are authenticated via API key are affected. This includes Jira, Sentry, Rollbar, Honeybadger, Datadog, CircleCI. We will be able to restore full service once we've worked through the root cause.
Report: "Performing a major server upgrading"
Last updateService is back to normal.
We've run into issues with the upgrade and are in the process of rolling back.
We're currently performing a major server upgrade. The service will be unavailable for about 15 minutes.