Scout

Is Scout Down Right Now? Check if there is a current outage ongoing.

Scout is currently Operational

Last checked from Scout's official status page

Historical record of incidents for Scout

Report: "Partial Outage - Delay in Metrics"

Last update
resolved

A subset of customers experienced a delay in metrics. All metrics have caught up at this point.

Report: "Partial Outage - Delay in Metrics"

Last update
Resolved

A subset of customers experienced a delay in metrics. All metrics have caught up at this point.

Report: "Ingest delays"

Last update
resolved

This incident has been resolved.

investigating

We are seeing a spike in metrics, causing delays

Report: "Ingest delays"

Last update
Resolved

This incident has been resolved.

Investigating

We are seeing a spike in metrics, causing delays

Report: "Ingest delays"

Last update
resolved

This incident has been resolved.

monitoring

Zookeeper corruption has been rooted out. Things appear healthier and catching up in all cases.

identified

We are not to full resolution yet.

identified

Throughput has improved although behavior of individual partitions remains a problem and is still causing delays in some cases.

identified

It has been a long day with kafka. We continue to experience instability, causing lag and dropped payloads.

monitoring

An alternate approach has been applied, we are watching.

identified

The initial fix was unsuccessful, certain accounts are now substantially delayed.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are currently investigating this issue.

Report: "Ingest delays"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Metric ingestion is being delayed for a set of users. We are working towards resolution.

Report: "Ingest delays and performance degradation"

Last update
resolved

This incident has been resolved.

identified

Ingest is recovering. Some accounts will require additional backfill of data, which we are working on.

identified

We have identified the issue and are working on fixes.

investigating

We are continuing to investigate this issue.

investigating

Ingested records are taking longer than usual to process. In some cases, this is affecting alerting.

Report: "Ingest Delays"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

Report: "Ingest Delays"

Last update
resolved

This incident has been resolved.

monitoring

Delays are recovering, we are monitoring.

identified

The issue has been identified and a fix is being implemented.

investigating

Ingested data is suffering delays in processing. Our team is working on a fix.

Report: "Ingest delays and dashboard degredation"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Data Ingest Issues"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

Report: "Ingest delays"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "Ingest delays"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating this issue.

Report: "Ingest Issues"

Last update
resolved

This incident has been resolved.

monitoring

We have implemented a fix and are monitoring the issue.

investigating

We are currently investigating this issue.

Report: "Ingest Delays and UI Degradation"

Last update
resolved

This incident has been resolved.

monitoring

Ingested metrics are being processed, lag is recovering.

investigating

Data ingest is lagging. Some users are experiencing UI errors.

Report: "Web UI Unavailable"

Last update
resolved

This incident has been resolved.

monitoring

The UI is operational, metric processing is recovering. Dashboards will be caught up shortly.

investigating

We are continuing to investigate this issue.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Page Load Errors"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring results.

investigating

We are investigating page load errors in the UI for some customers.

Report: "Delay in metrics ingestion"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating a delay in metrics ingestion.

Report: "Delay in Metrics Ingestion, Slow Page Loads"

Last update
resolved

We have resolved the underlying issue. Metric ingestion is caught up for all customers and the UI is operating as normal.

investigating

We are currently investigating a delay in metrics ingestion and slow UI page loads for some customers.

Report: "Agent payloads being rejected, missing dashboard data"

Last update
resolved

This incident has been resolved.

monitoring

We have implemented the fix and the kafka cluster is operating normally. Agent checkin payloads are being ingested and processed again as of 3:41PM MT. Data from 2:40-3:30PM MT will not backfill to charts.

identified

We have encountered an issue with our Kafka cluster preventing agent payloads from being recorded into kafka for storage and processing. You will not see current data in your dashboards until the issue is resolved. We are working on deploying the fix now.

Report: "Metrics Ingestion Lag"

Last update
resolved

Metric ingestion has caught up and all charts are now current.

monitoring

We have identified the issue and the metrics will begin filling in on application charts within a few minutes.

investigating

We are currently experience metric ingestion lag. You may not see the most recent metrics in your charts. We are investigating the issue.

Report: "Metrics ingestion lag, some dashboards not loading"

Last update
resolved

All apps are reachable via the UI and the metrics backlog has been processed.

monitoring

All apps should now be accessible. We are processing the metrics ingestion backlog and you will see your chart metrics fill in soon.

identified

We are continuing to work on a fix for this issue.

identified

We have identified an issue with our database which is causing time series metrics ingestion lag for all customers. In addition, some customers may not be able to load their app in the UI. We are working to fix the issue as soon as possible.

Report: "Load balancer change caused metric gap"

Last update
resolved

A change to our elastic load balancer which accepts the metrics payloads from agents was temporarily in a nonfunctional state from 3:30PM MT to 3:35PM MT on 6-3-2021. The issue has been resolved. You may have missing metrics in charts for this time period.

Report: "Metric ingestion delay/dashboards unavailable"

Last update
resolved

All services are back to normal.

investigating

We are continuing to investigate this issue.

investigating

We are currently experiencing an issue with one of our time series databases. Dashboards may not be available or metrics delayed. Metrics are currently being buffered and will be filled in once the issue is resolved.

Report: "Delayed Metrics"

Last update
resolved

Ingestion is back to normal.

investigating

We are currently investigating this issue.

Report: "Database communication issue"

Last update
resolved

Everything is caught up, and looking good.

monitoring

The database is back up, and the buffered data is being ingested. You'll see charts catch up over the next few minutes.

investigating

One of our database backends has disconnected from our frontend. Investigating. All incoming data is buffered and will be replayed once we're back up.

Report: "Delay in ingestion"

Last update
resolved

All buffered data has been ingested and is available

monitoring

We have fixed a communication issue between our web servers and one of our database servers, and are catching up data now.

investigating

Currently investigating a delay in new data being ingested.

Report: "Ingestion Issue"

Last update
postmortem

On 2020/05/31 we experienced a short network outage that prevented our zookeeper and kafka nodes from reaching each other. When connectivity was restored, there was a problem with stale zookeeper data which prevented the kafka brokers from initiating a proper leader election for topic partitions. This also prevented kafka producers from being able to produce to a majority of partitions. A manual leader election was attempted, but failed to correct the issue. We began a rolling restart of our entire kafka cluster, which ultimately resolved the issue. Later versions of Kafka have better handling around this particular failure, and we anticipate moving to a recent version to prevent entering this failure mode again.

resolved

Ingestion for all customers has been operating normally since 10:20AM MT. Some customers will have some or no data from 8:50AM to 10:20AM MT. We will follow up with more information about the cause of the outage.

monitoring

We have restarted several of our Kafka servers in ingestion, and ingestion appears to be recovering. Data should begin appearing on your dashboard again.

investigating

We are investigating an issue in ingestion of agent data.

Report: "Delayed Ingestion for Some Customers"

Last update
resolved

This incident has been resolved.

monitoring

We've isolated the issue to a single account. We're in contact with that customer and have restarted ingestion for all other accounts.

investigating

We've identified a handful of incoming messages that have slowed our ingestion processing, causing it to fall behind. This has caused tripped circuit breakers in other parts of our app. All data is stored, but ingestion as a whole is paused

investigating

One of our ingestion servers is falling a little behind, so ingestion for any customers on that server will be delayed. All data is safe and is being processed.

Report: "Database Connectivity Issues"

Last update
resolved

Replaying data is complete.

monitoring

The connection issue has been resolved, data will begin backfilling and be fully up to date in a few minutes.

investigating

We're seeing timeouts for one of our time series databases

Report: "Ingestion Lag"

Last update
resolved

Metric ingestion was paused at 20:13, restarted at 22:30 UTC and all apps metrics are caught up and stable as of 2019-09-18 0:00UTC. Operations are back to normal.

monitoring

We are recreating some databases indexes which has forced us to fully pause ingestion. Once the indexes are rebuilt metrics will fill in to current while we continue to fix the root cause of the ingestion lag.

monitoring

We are experiencing some ingestion lag. We have identified the issue and we are working on processing the backlog. Your charts will continue to catch up as we process the backlog.

Report: "Time series database requires restart"

Last update
resolved

All buffered checkin data has been ingested, and all components are back online.

monitoring

The database has been restarted, and checkins are catching up now.

identified

The server one of our time series databases is located on required a restart. It is currently booting, and will be back in service in a few minutes.

Report: "Investigating Database Issues"

Last update
resolved

All buffered checkin data has been ingested, and all components are back online.

monitoring

Everything is back up, and buffered checkins are flowing back into the system. Data should be caught up with current in a few minutes.

identified

We've identified the cause of the issue, and have fixed the underlying issue. We are bringing the database back online.

investigating

We appear to have degraded write behavior on our main Postgres database. We are investigating.

Report: "UI Timeouts"

Last update
resolved

We've killed a rogue process that was tying up our database, and all pages are responding.

investigating

We are continuing to investigate this issue.

investigating

Some users are seeing their UI timeout when loaded.

Report: "Time series database issues"

Last update
resolved

The backlog of ingestion data has cleared, and everything is up and running.

monitoring

The database server has recovered and is ingesting again. Buffered data is being ingested, and will fill in as it catches up.

investigating

We are experiencing an issue with one of our time series databases. Some of our customers will experience UI timeouts and ingestion lag.

Report: "Database Connection Issues"

Last update
resolved

All chart metrics are now completely caught up. The root cause of the incident was due to attempted table partitioning during a database vacuum, which caused a lock on a critical table and cascaded to impact the rest of the application. We'll be adjusting our vacuum and partitioning schedules to avoid this lock again.

investigating

We've identified and fixed the database connection issue. We are currently loading the backlog of data that was held during the incident. Data will be appearing in the UI shortly.

investigating

We appear to be using more than the expected number of database connections, causing failures on our Web UI. Ingestion is backed up, but the incoming data is safe and collected.

Report: "Server Monitoring install packages are temporarily unavailable"

Last update
resolved

The package repos are back and operating normally.

investigating

Installation of scoutd (yum install scoutd or apt-get install scoutd) will fail. We are working on restoring access. This only affects Server Monitoring, not APM.

Report: "Network connectivity issues"

Last update
resolved

Network connectivity is restored. There will be a 7-minute drop in charts corresponding to the outage.

investigating

http://status.railsmachine.com/incidents/31cpsbzq5p97

Report: "[Server Monitoring] Incorrect alert routing/Alerts not being sent out"

Last update
postmortem

## Server Monitoring 12/31/2016 Postmortem At 5:35PM MDT, our database table storing alerts hit the auto-increment limit for its primary key datatype. As a result, new alerts were either not created as they were supposed to, or in some cases, created and associated with the wrong account. Since the alerts table is huge, modifying it in-place was not an option. We began a sequence of altering the table on a MySQL read-only instance, switching multi-master to the secondary, and modifying the primary database. Shortly thereafter, we temporarily disabled notifications for all accounts to minimize the impact of the alterations. By 8:37PM MDT, alterations were complete. Unfortunately, a glitch in the multi-master switchover process resulted in 7-minute outage from 8:58PM-09:07PM MTD. The glitch was the result of a duplicate `mmm_mond` process running, which repeatedly killed MySQL's replication thread, which caused database instability. ### What We Have Done to Ensure This Does Not Happen Again 1. We have added monitoring and alerting on MySQL Multi-master's `mmm_mond` process, to ensure that only one process is running at a time. 2. We have audited all tables in our database to ensure that no other tables are close to exceeding their primary key auto-increment limit. While none are currently close, there are two tables at 50% of their limit, so we will be migrating these tables proactively during an upcoming scheduled maintenance window.

resolved

We have corrected the underlying database issue causing the incorrectly routed alerts. Alerts should be back to normal for all accounts.

identified

Alerts not being routed correctly. We have identified the problem and while the fix is implemented alerts have disabled for all accounts.

Report: "Brief downtime (database fix)"

Last update
resolved

From 8:58PM-09:07PM MDT 2016-12-31, scoutapp.com was unavailable during a database alteration. Data was not collected during this time.

Report: "Server Monitoring: brief metric ingestion outage while swapping database writer role"

Last update
resolved

Scout Server Monitoring had a brief ingestion outage from 4:27PM to 4:31PM MDT while swapping a database writer role.

Report: "Somewhat degraded performance while datacenter upgrades switches"

Last update
resolved

This incident has been resolved.

monitoring

You may encounter: data occasionally delayed by ~2 minutes; an occasional error attempting to view a chart or alert. If you experience this, just refresh the page you are looking at.

Report: "AWS Networking Issue"

Last update
resolved

AWS resolved their network issue and we should be back to normal.

investigating

We are continuing to investigate the issue with our AWS servers, and will update when we have discovered a solution.

investigating

We are currently experiencing an issue with our us-west-1 AWS servers. This could cause some degradation in performance.

Report: "Network instability - Server Monitoring outage"

Last update
resolved

This incident has been resolved.

monitoring

Server Monitoring is back online. We apologize for the outage, and will post a post-mortem tomorrow.

identified

We are re-syncing a database that was corrupted during the power outage. Stay tuned ...

identified

We've regained accesses to most of our machines via SSH and are working on bringing services back up

identified

From http://status.railsmachine.com/incidents/blqbh5wmfcrl: "At approximately 5:40 EST, we experienced a temporary utility interruption at the data center. This temporary utility interruption caused an unknown error in our UPS which resulted in a power outage to your environment. zColo Operations are diligently working to restore power to your environment. Additional updates will be provided when available."

investigating

"Preliminary reports indicate a power outage. We are continuing to investigate and are working to get things back online now." from http://status.railsmachine.com/incidents/blqbh5wmfcrl

investigating

http://status.railsmachine.com/incidents/blqbh5wmfcrl

Report: "Ingestion lag for metrics"

Last update
resolved

All charts are up to date. We will follow up with a post mortem.

monitoring

The ingestion pipeline is catching up.

investigating

Our ingestion pipeline handling metrics from the agent is backed up and we are investigating the cause. Charts for your apps will not have up-to-date metrics until the issue is resolved.

Report: "UI unavailable"

Last update
resolved

Charts should be caught back up - all systems back to normal.

monitoring

The UI is available again. There is a 20 minute lag in data. We've begun replaying data ingestion to fill in the gap.

identified

InfluxDB hung while removing a significant amount of data from a timeseries database. We're restarting InfluxDB, which should take around 30 minutes. Data ingestion is continuing - charts will be a bit behind as we replay checkins to Influx after it comes back online.

investigating

The Scout UI is currently unavailable. We're investing an issue with our backend timeseries storage.

Report: "Metric Ingestion Lag"

Last update
resolved

Metric ingestion for all customers is now caught up and operating normally.

identified

An RDS instance failover triggered the lag. We're restarted ingestion and charts are filling in with data.

investigating

We're investigating a delay in the display of fresh data on charts.

Report: "Data Ingestion Delay"

Last update
resolved

We're back to normal. No data was lost during the ingestion delay.

monitoring

The delay was triggered by a spike in Influx query times. The delay is decreasing rapidly. We're monitoring to ensure things return to normal.

investigating

We're seeing a delay in metric ingestion and are investigating.

Report: "Time Series Database Issue"

Last update
resolved

Metric ingestion has caught back up.

monitoring

Our systems are replaying buffered data collected during the outage and ingesting these into our database.

monitoring

The time-series database is restarting and should be operational in a few minutes. After which buffered data from the downtime will be played into it.

investigating

The backend time-series database appears to be having issues. All incoming data is being buffered and will be ingested into the system, but the site is currently inaccessible.

Report: "502 errors accessing scoutapp.com"

Last update
resolved

This incident has been resolved.

identified

Data ingestion has caught back up.

identified

The site is now available. We're replaying data that wasn't ingested over the downtime.

identified

We are continuing to work on a fix for this issue.

identified

We had a bad deploy and are investigating 502 errors. Data ingestion has not been impacted (data has not been lost).

Report: "Metrics ingestion lag"

Last update
resolved

This incident has been resolved.

monitoring

Our relational database needed tuning. Most customer's charts are current - for those customers who still have some lag, it should resolve within the hour.

investigating

Some apps are having lag on their metrics charts. We are investigating.

Report: "504 errors accessing scoutapp.com"

Last update
resolved

This incident has been resolved.

monitoring

The UI should be available again.

identified

We identified a lock on a table and have cleared the lock. We're continuing to investigate.

investigating

We're seeing some 504 errors accessing scoutapp.com and investigating.