Historical record of incidents for Scout
Report: "Partial Outage - Delay in Metrics"
Last updateA subset of customers experienced a delay in metrics. All metrics have caught up at this point.
Report: "Partial Outage - Delay in Metrics"
Last updateA subset of customers experienced a delay in metrics. All metrics have caught up at this point.
Report: "Ingest delays"
Last updateThis incident has been resolved.
We are seeing a spike in metrics, causing delays
Report: "Ingest delays"
Last updateThis incident has been resolved.
We are seeing a spike in metrics, causing delays
Report: "Ingest delays"
Last updateThis incident has been resolved.
Zookeeper corruption has been rooted out. Things appear healthier and catching up in all cases.
We are not to full resolution yet.
Throughput has improved although behavior of individual partitions remains a problem and is still causing delays in some cases.
It has been a long day with kafka. We continue to experience instability, causing lag and dropped payloads.
An alternate approach has been applied, we are watching.
The initial fix was unsuccessful, certain accounts are now substantially delayed.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Ingest delays"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Metric ingestion is being delayed for a set of users. We are working towards resolution.
Report: "Ingest delays and performance degradation"
Last updateThis incident has been resolved.
Ingest is recovering. Some accounts will require additional backfill of data, which we are working on.
We have identified the issue and are working on fixes.
We are continuing to investigate this issue.
Ingested records are taking longer than usual to process. In some cases, this is affecting alerting.
Report: "Ingest Delays"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Report: "Ingest Delays"
Last updateThis incident has been resolved.
Delays are recovering, we are monitoring.
The issue has been identified and a fix is being implemented.
Ingested data is suffering delays in processing. Our team is working on a fix.
Report: "Ingest delays and dashboard degredation"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Data Ingest Issues"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Ingest delays"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "Ingest delays"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Ingest Issues"
Last updateThis incident has been resolved.
We have implemented a fix and are monitoring the issue.
We are currently investigating this issue.
Report: "Ingest Delays and UI Degradation"
Last updateThis incident has been resolved.
Ingested metrics are being processed, lag is recovering.
Data ingest is lagging. Some users are experiencing UI errors.
Report: "Web UI Unavailable"
Last updateThis incident has been resolved.
The UI is operational, metric processing is recovering. Dashboards will be caught up shortly.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Page Load Errors"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring results.
We are investigating page load errors in the UI for some customers.
Report: "Delay in metrics ingestion"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating a delay in metrics ingestion.
Report: "Delay in Metrics Ingestion, Slow Page Loads"
Last updateWe have resolved the underlying issue. Metric ingestion is caught up for all customers and the UI is operating as normal.
We are currently investigating a delay in metrics ingestion and slow UI page loads for some customers.
Report: "Agent payloads being rejected, missing dashboard data"
Last updateThis incident has been resolved.
We have implemented the fix and the kafka cluster is operating normally. Agent checkin payloads are being ingested and processed again as of 3:41PM MT. Data from 2:40-3:30PM MT will not backfill to charts.
We have encountered an issue with our Kafka cluster preventing agent payloads from being recorded into kafka for storage and processing. You will not see current data in your dashboards until the issue is resolved. We are working on deploying the fix now.
Report: "Metrics Ingestion Lag"
Last updateMetric ingestion has caught up and all charts are now current.
We have identified the issue and the metrics will begin filling in on application charts within a few minutes.
We are currently experience metric ingestion lag. You may not see the most recent metrics in your charts. We are investigating the issue.
Report: "Metrics ingestion lag, some dashboards not loading"
Last updateAll apps are reachable via the UI and the metrics backlog has been processed.
All apps should now be accessible. We are processing the metrics ingestion backlog and you will see your chart metrics fill in soon.
We are continuing to work on a fix for this issue.
We have identified an issue with our database which is causing time series metrics ingestion lag for all customers. In addition, some customers may not be able to load their app in the UI. We are working to fix the issue as soon as possible.
Report: "Load balancer change caused metric gap"
Last updateA change to our elastic load balancer which accepts the metrics payloads from agents was temporarily in a nonfunctional state from 3:30PM MT to 3:35PM MT on 6-3-2021. The issue has been resolved. You may have missing metrics in charts for this time period.
Report: "Metric ingestion delay/dashboards unavailable"
Last updateAll services are back to normal.
We are continuing to investigate this issue.
We are currently experiencing an issue with one of our time series databases. Dashboards may not be available or metrics delayed. Metrics are currently being buffered and will be filled in once the issue is resolved.
Report: "Delayed Metrics"
Last updateIngestion is back to normal.
We are currently investigating this issue.
Report: "Database communication issue"
Last updateEverything is caught up, and looking good.
The database is back up, and the buffered data is being ingested. You'll see charts catch up over the next few minutes.
One of our database backends has disconnected from our frontend. Investigating. All incoming data is buffered and will be replayed once we're back up.
Report: "Delay in ingestion"
Last updateAll buffered data has been ingested and is available
We have fixed a communication issue between our web servers and one of our database servers, and are catching up data now.
Currently investigating a delay in new data being ingested.
Report: "Ingestion Issue"
Last updateOn 2020/05/31 we experienced a short network outage that prevented our zookeeper and kafka nodes from reaching each other. When connectivity was restored, there was a problem with stale zookeeper data which prevented the kafka brokers from initiating a proper leader election for topic partitions. This also prevented kafka producers from being able to produce to a majority of partitions. A manual leader election was attempted, but failed to correct the issue. We began a rolling restart of our entire kafka cluster, which ultimately resolved the issue. Later versions of Kafka have better handling around this particular failure, and we anticipate moving to a recent version to prevent entering this failure mode again.
Ingestion for all customers has been operating normally since 10:20AM MT. Some customers will have some or no data from 8:50AM to 10:20AM MT. We will follow up with more information about the cause of the outage.
We have restarted several of our Kafka servers in ingestion, and ingestion appears to be recovering. Data should begin appearing on your dashboard again.
We are investigating an issue in ingestion of agent data.
Report: "Delayed Ingestion for Some Customers"
Last updateThis incident has been resolved.
We've isolated the issue to a single account. We're in contact with that customer and have restarted ingestion for all other accounts.
We've identified a handful of incoming messages that have slowed our ingestion processing, causing it to fall behind. This has caused tripped circuit breakers in other parts of our app. All data is stored, but ingestion as a whole is paused
One of our ingestion servers is falling a little behind, so ingestion for any customers on that server will be delayed. All data is safe and is being processed.
Report: "Database Connectivity Issues"
Last updateReplaying data is complete.
The connection issue has been resolved, data will begin backfilling and be fully up to date in a few minutes.
We're seeing timeouts for one of our time series databases
Report: "Ingestion Lag"
Last updateMetric ingestion was paused at 20:13, restarted at 22:30 UTC and all apps metrics are caught up and stable as of 2019-09-18 0:00UTC. Operations are back to normal.
We are recreating some databases indexes which has forced us to fully pause ingestion. Once the indexes are rebuilt metrics will fill in to current while we continue to fix the root cause of the ingestion lag.
We are experiencing some ingestion lag. We have identified the issue and we are working on processing the backlog. Your charts will continue to catch up as we process the backlog.
Report: "Time series database requires restart"
Last updateAll buffered checkin data has been ingested, and all components are back online.
The database has been restarted, and checkins are catching up now.
The server one of our time series databases is located on required a restart. It is currently booting, and will be back in service in a few minutes.
Report: "Investigating Database Issues"
Last updateAll buffered checkin data has been ingested, and all components are back online.
Everything is back up, and buffered checkins are flowing back into the system. Data should be caught up with current in a few minutes.
We've identified the cause of the issue, and have fixed the underlying issue. We are bringing the database back online.
We appear to have degraded write behavior on our main Postgres database. We are investigating.
Report: "UI Timeouts"
Last updateWe've killed a rogue process that was tying up our database, and all pages are responding.
We are continuing to investigate this issue.
Some users are seeing their UI timeout when loaded.
Report: "Time series database issues"
Last updateThe backlog of ingestion data has cleared, and everything is up and running.
The database server has recovered and is ingesting again. Buffered data is being ingested, and will fill in as it catches up.
We are experiencing an issue with one of our time series databases. Some of our customers will experience UI timeouts and ingestion lag.
Report: "Database Connection Issues"
Last updateAll chart metrics are now completely caught up. The root cause of the incident was due to attempted table partitioning during a database vacuum, which caused a lock on a critical table and cascaded to impact the rest of the application. We'll be adjusting our vacuum and partitioning schedules to avoid this lock again.
We've identified and fixed the database connection issue. We are currently loading the backlog of data that was held during the incident. Data will be appearing in the UI shortly.
We appear to be using more than the expected number of database connections, causing failures on our Web UI. Ingestion is backed up, but the incoming data is safe and collected.
Report: "Server Monitoring install packages are temporarily unavailable"
Last updateThe package repos are back and operating normally.
Installation of scoutd (yum install scoutd or apt-get install scoutd) will fail. We are working on restoring access. This only affects Server Monitoring, not APM.
Report: "Network connectivity issues"
Last updateNetwork connectivity is restored. There will be a 7-minute drop in charts corresponding to the outage.
http://status.railsmachine.com/incidents/31cpsbzq5p97
Report: "[Server Monitoring] Incorrect alert routing/Alerts not being sent out"
Last update## Server Monitoring 12/31/2016 Postmortem At 5:35PM MDT, our database table storing alerts hit the auto-increment limit for its primary key datatype. As a result, new alerts were either not created as they were supposed to, or in some cases, created and associated with the wrong account. Since the alerts table is huge, modifying it in-place was not an option. We began a sequence of altering the table on a MySQL read-only instance, switching multi-master to the secondary, and modifying the primary database. Shortly thereafter, we temporarily disabled notifications for all accounts to minimize the impact of the alterations. By 8:37PM MDT, alterations were complete. Unfortunately, a glitch in the multi-master switchover process resulted in 7-minute outage from 8:58PM-09:07PM MTD. The glitch was the result of a duplicate `mmm_mond` process running, which repeatedly killed MySQL's replication thread, which caused database instability. ### What We Have Done to Ensure This Does Not Happen Again 1. We have added monitoring and alerting on MySQL Multi-master's `mmm_mond` process, to ensure that only one process is running at a time. 2. We have audited all tables in our database to ensure that no other tables are close to exceeding their primary key auto-increment limit. While none are currently close, there are two tables at 50% of their limit, so we will be migrating these tables proactively during an upcoming scheduled maintenance window.
We have corrected the underlying database issue causing the incorrectly routed alerts. Alerts should be back to normal for all accounts.
Alerts not being routed correctly. We have identified the problem and while the fix is implemented alerts have disabled for all accounts.
Report: "Brief downtime (database fix)"
Last updateFrom 8:58PM-09:07PM MDT 2016-12-31, scoutapp.com was unavailable during a database alteration. Data was not collected during this time.
Report: "Server Monitoring: brief metric ingestion outage while swapping database writer role"
Last updateScout Server Monitoring had a brief ingestion outage from 4:27PM to 4:31PM MDT while swapping a database writer role.
Report: "Somewhat degraded performance while datacenter upgrades switches"
Last updateThis incident has been resolved.
You may encounter: data occasionally delayed by ~2 minutes; an occasional error attempting to view a chart or alert. If you experience this, just refresh the page you are looking at.
Report: "AWS Networking Issue"
Last updateAWS resolved their network issue and we should be back to normal.
We are continuing to investigate the issue with our AWS servers, and will update when we have discovered a solution.
We are currently experiencing an issue with our us-west-1 AWS servers. This could cause some degradation in performance.
Report: "Network instability - Server Monitoring outage"
Last updateThis incident has been resolved.
Server Monitoring is back online. We apologize for the outage, and will post a post-mortem tomorrow.
We are re-syncing a database that was corrupted during the power outage. Stay tuned ...
We've regained accesses to most of our machines via SSH and are working on bringing services back up
From http://status.railsmachine.com/incidents/blqbh5wmfcrl: "At approximately 5:40 EST, we experienced a temporary utility interruption at the data center. This temporary utility interruption caused an unknown error in our UPS which resulted in a power outage to your environment. zColo Operations are diligently working to restore power to your environment. Additional updates will be provided when available."
"Preliminary reports indicate a power outage. We are continuing to investigate and are working to get things back online now." from http://status.railsmachine.com/incidents/blqbh5wmfcrl
http://status.railsmachine.com/incidents/blqbh5wmfcrl
Report: "Ingestion lag for metrics"
Last updateAll charts are up to date. We will follow up with a post mortem.
The ingestion pipeline is catching up.
Our ingestion pipeline handling metrics from the agent is backed up and we are investigating the cause. Charts for your apps will not have up-to-date metrics until the issue is resolved.
Report: "UI unavailable"
Last updateCharts should be caught back up - all systems back to normal.
The UI is available again. There is a 20 minute lag in data. We've begun replaying data ingestion to fill in the gap.
InfluxDB hung while removing a significant amount of data from a timeseries database. We're restarting InfluxDB, which should take around 30 minutes. Data ingestion is continuing - charts will be a bit behind as we replay checkins to Influx after it comes back online.
The Scout UI is currently unavailable. We're investing an issue with our backend timeseries storage.
Report: "Metric Ingestion Lag"
Last updateMetric ingestion for all customers is now caught up and operating normally.
An RDS instance failover triggered the lag. We're restarted ingestion and charts are filling in with data.
We're investigating a delay in the display of fresh data on charts.
Report: "Data Ingestion Delay"
Last updateWe're back to normal. No data was lost during the ingestion delay.
The delay was triggered by a spike in Influx query times. The delay is decreasing rapidly. We're monitoring to ensure things return to normal.
We're seeing a delay in metric ingestion and are investigating.
Report: "Time Series Database Issue"
Last updateMetric ingestion has caught back up.
Our systems are replaying buffered data collected during the outage and ingesting these into our database.
The time-series database is restarting and should be operational in a few minutes. After which buffered data from the downtime will be played into it.
The backend time-series database appears to be having issues. All incoming data is being buffered and will be ingested into the system, but the site is currently inaccessible.
Report: "502 errors accessing scoutapp.com"
Last updateThis incident has been resolved.
Data ingestion has caught back up.
The site is now available. We're replaying data that wasn't ingested over the downtime.
We are continuing to work on a fix for this issue.
We had a bad deploy and are investigating 502 errors. Data ingestion has not been impacted (data has not been lost).
Report: "Metrics ingestion lag"
Last updateThis incident has been resolved.
Our relational database needed tuning. Most customer's charts are current - for those customers who still have some lag, it should resolve within the hour.
Some apps are having lag on their metrics charts. We are investigating.
Report: "504 errors accessing scoutapp.com"
Last updateThis incident has been resolved.
The UI should be available again.
We identified a lock on a table and have cleared the lock. We're continuing to investigate.
We're seeing some 504 errors accessing scoutapp.com and investigating.