Honeybadger.io

Is Honeybadger.io Down Right Now? Check if there is a current outage ongoing.

Honeybadger.io is currently Operational

Last checked from Honeybadger.io's official status page

Historical record of incidents for Honeybadger.io

Report: "Delay in stats and check-in processing"

Last update
resolved

We had an interruption in processing of check-in alerts and in-app stats from about 12:30 to about 13:30 UTC. This may have resulted in a failure or a delay in reporting check-ins that went missing during that period.

Report: "Data Processing Delays"

Last update
resolved

The backlog has been processed and all systems are running normally.

monitoring

We have identified and corrected an issue in our data processing pipeline that may have caused delays for some customers. No data has been lost and the system should be caught up shortly.

Report: "App slowness and timeouts"

Last update
resolved

This incident has been resolved.

monitoring

We have identified the cause of the slowdown and performance is back to normal.

investigating

We are looking into the cause of slowness and timeouts with the web app.

Report: "Insights backlog issue"

Last update
resolved

The backlog has been fully processed, so all Insights events should be ingested and available.

monitoring

We have successfully pushed a fix and are waiting for the backlog to be processed. I also want to give a reminder that this only affects Insights events; error processing was unaffected.

identified

We have identified an issue with Insights event ingestion backlog. We are working on a fix.

Report: "Check-in monitoring"

Last update
resolved

We discovered and addressed a configuration issue that caused the delay in processing. Processing has returned to normal.

investigating

We are seeing an increased backlog in our check-in processing, which is causing false alarms to be reported. We are investigating the cause of the backlog.

Report: "Increased response times and timeouts"

Last update
resolved

This incident has been resolved.

monitoring

We've resolved an issue with query contention, and web application service has been restored. Our API pipeline was unaffected by this incident. Some charts on the reports tab may be outdated until we update project counts. We'll continue to monitor the situation. Sorry for the inconvenience!

identified

We've identified an issue with aggregate database queries and are working on a fix.

investigating

We are continuing to investigate this issue.

investigating

We are currently experiencing elevated response times and timeouts from our web application.

Report: "Web app is unavailable"

Last update
resolved

This incident has been resolved.

identified

We are working on restoring service

Report: "Delay in Logplex processing"

Last update
resolved

We discovered this morning that our Logplex pipeline, which handles Heroku platform errors, had an interruption in processing, which caused platform errors not to be recorded. This has been corrected, the backlog is being processed, and additional monitoring has been added to avoid this issue in the future.

Report: "Uptime check reports are delayed"

Last update
resolved

Lambda is back in business, and the backlog of events triggered during the outage is being processed.

identified

Search indexing and notice timeline charts are also impacted by the failures with Lamba. We're continuing to queue updates for processing once Lambda resumes normal operations.

identified

Issues with AWS Lambda are causing our uptime check reports to be delayed. Reports are queued up and will be delivered as AWS services recover.

Report: "Unhealthy API server"

Last update
resolved

Starting around 3:30pm PST, we had an incident where an unhealthy API server was in rotation in our load balancer. During that time requests routed to that server responded with a 502 which could have resulted in the clients dropping notices. At around 5:30pm PST we removed the unhealthy server from rotation. 7:22pm PST edit: Reworded message to describe that notices are not queued on the client, but dropped when the server responds in an error state

Report: "Issues processing backlog"

Last update
resolved

We identified that our workers had an issue deploying. We deployed a fix and upped our capacity to deal with the current backlog. Everything is back to normal now. Turns out this issue was related GitHub rotating their public RSA key: https://github.blog/2023-03-23-we-updated-our-rsa-ssh-host-key/

identified

We have identified the issue and are working towards clearing the backlog.

investigating

We currently are having issues processing our notice backlog. We are looking into the issue.

Report: "Web app unavailable"

Last update
resolved

We had an incident (from 1:00 to 1:27 PST) where our apps were unresponsive. They have since recovered.

Report: "API issues"

Last update
resolved

Our primary redis instance was having a bad time. This has been fixed, and now we're back to normal.

investigating

We are seeing increased error rates for our API

Report: "Email delivery is being delayed"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

Our email provider is having problems delivering emails, so our outbound email deliveries are being delayed.

Report: "Web app unavailable"

Last update
resolved

This incident has been resolved.

investigating

We are currently investigating this issue.

Report: "Uptime checks are delayed"

Last update
resolved

This incident has been resolved.

investigating

We are currently investigating this issue.

Report: "Network disruption"

Last update
resolved

We had a network disruption that lasted approximately one minute. The cause has been identified and resolved.

Report: "Brief web app downtime"

Last update
resolved

We had some brief downtime while a deploy was completing—everything is back to normal now. 👍

Report: "Status pages down"

Last update
resolved

We had a faulty deploy at 11 am yesterday that caused us to stop serving customer status pages. We weren’t immediately aware of the problem due to insufficient monitoring of this recently-launched feature. We have restored the status pages, identified and resolved the problem that caused the faulty deploy, and added more monitoring.

Report: "Check-in failures"

Last update
resolved

We had a period of about 30 minutes where our check-ins and sourcemaps API endpoints were not accepting requests, which resulted in some check-in missing alerts that should not have been sent and in failures to upload sourcemaps. This issue has been resolved.

Report: "Delayed search indexing"

Last update
resolved

It looks like we're back to normal. AWS finally updated their status page. ;) https://phd.aws.amazon.com/phd/home?region=us-east-1#/account/dashboard/open-issues?eventID=arn:aws:health:us-east-1::event/MULTIPLE_SERVICES/AWS_MULTIPLE_SERVICES_OPERATIONAL_ISSUE/AWS_MULTIPLE_SERVICES_OPERATIONAL_ISSUE_SVALH_1646859683&eventTab=details

investigating

We are continuing to investigate this issue.

investigating

We're investigating an issue related to slow search indexing—search results may not reflect reality until we get this resolved (or AWS does). Will keep y'all updated here. Sorry for the inconvenience!

Report: "Uptime checks delayed"

Last update
resolved

We're all good now, thanks folks!

monitoring

We've taken steps to resolve the immediate issues; uptime checks should be back to normal. We'll update this incident as we learn more, but hopefully will have this wrapped up soon. :) As always, please reach out to support@honeybadger.io if you're having trouble.

investigating

Uptime checks and some other periodic jobs are not running on time. Uptime/outage alerts will be delayed until this issue is resolved, as well as some stats in the UI.

Report: "Slack delivery errors"

Last update
resolved

This incident has been resolved.

monitoring

Slack inbound webhooks are failing, so some error notifications are unable to be delivered.

Report: "Delays in processing"

Last update
resolved

It looks like we're coming out of the woods. The backlogs are gone, EC2 instances are booting once again, and processing volume is returning to normal.

monitoring

We are continuing to be impacted by problems in us-east-1. We are working our plan to bring up another region.

monitoring

Various services in AWS us-east-1 are having problems, which is causing delays in error ingestion.

Report: "Web app errors and check-in alerts"

Last update
resolved

We had a bad code deploy that caused a few instances of errors with the web app and erroneous alerts for check-ins. Sorry for the inconvenience!

Report: "Bug in error grouping"

Last update
resolved

We accidentally broke grouping for some errors for a period of about 8 hours today, ending right before this update was published. No errors were lost, but you may have seen what looked like duplicate errors, as we failed to group some errors together that should have been. Sorry for the inconvenience!

Report: "Web app unavailable"

Last update
resolved

This incident has been resolved.

identified

The web app is unavailable. Error collection is not impacted.

Report: "Current-hour stats are missing"

Last update
resolved

This incident has been resolved.

investigating

Counts of error notifications received in the past hour are currently showing 0 in the UI. This does not impact error collection and notification. We are investigating the cause of the issue.

Report: "Problems with search"

Last update
resolved

Everything is back to normal, and the backfill is complete.

monitoring

Our cluster has resumed normal operations. We are now indexing data as it arrives, and we are starting the backfill process to index the data we didn't index since this incident began.

identified

The issue has been identified and a fix is being implemented.

investigating

Our search cluster is having problems — searches and some charts are not currently available.

Report: "Delays in data appearing in search results"

Last update
resolved

This incident has been resolved.

investigating

We are currently experiencing a delay in the data that is being returned in search results, which includes the list of notices displayed on the error detail page. Error ingestion and search indexing are not affected.

Report: "API response times"

Last update
resolved

Processing is back to normal, and we're keeping our fingers crossed that AWS autoscaling is working once again.

identified

We are continuing to work on dealing with the fallout from the issues in us-east-1. We are processing inbound error notifications, but with some delay.

identified

We are continuing to work on a fix for this issue.

identified

AWS is having issues (https://status.aws.amazon.com) which are impacting our API, causing slowness, retries, and failures.

Report: "Network timeouts"

Last update
resolved

This incident has been resolved.

investigating

We are seeing long response times due to apparent network issues.

Report: "Delays in data appearing in search results"

Last update
resolved

This incident has been resolved.

investigating

We are currently experiencing a delay in the data that is being returned in search results, which includes the list of notices displayed on the error detail page. Error ingestion and search indexing are not affected.

Report: "API downtime"

Last update
resolved

We had an automation failure that caused a few minutes of downtime for our ingestion API. We are reworking our automation to avoid this problem in the future.

Report: "Delays in data appearing in search results"

Last update
resolved

This incident has been resolved.

investigating

We are currently experiencing a delay in the data that is being returned in search results, which includes the list of notices displayed on the error detail page. Error ingestion and search indexing are not affected.

Report: "API and Web App May be Unavailable to Some Users"

Last update
resolved

We have switched all affected systems to new certificates and everything is back to normal.

monitoring

We are in the process of issuing new certificates from a different CA, to work around Comodo's issue.

monitoring

The unavailability is being caused by an upstream SSL problem. Comodo (our CA) is being flagged as "untrusted" by clients. See this tweet for an example: https://twitter.com/aitorpazos/status/1266703889786691584 We are currently monitoring the situation, as well as investigating other solutions.

investigating

We have experienced a drop in external traffic which leads us to believe that some customers may not currently be able to access our service. At this time we suspect that the issue is caused by network problems outside of our own systems, and we are currently investigating to confirm this.

Report: "Sales and docs sites are down"

Last update
resolved

Netlify fixed it :)

monitoring

Netlify is having redirect loop problems, causing our static sites to go down. Everything in our app and API is fine.

Report: "Partial API outage"

Last update
postmortem

This report details the impacts of our outage of August 26th, the cause for the outage, and steps we have and will be taking to prevent a similar kind of outage in the future. Before getting into the details, I want to apologize to everyone who was impacted by this outage – we have worked hard to build a resilient system, and it’s really disappointing when we let you down. On to the deets… ### What happened? A little after 7PM Pacific Time I received an alert from PagerDuty letting me know that one of our ingestion API endpoints was not responding to external monitoring. Reviewing our dashboards revealed that our primary Redis cluster was about to run out of available memory. This cluster stores the Sidekiq queue that we use for processing the payloads of the errors being reported to our API, and it typically rests at 3% memory utilization. Our internal dashboard did show some outliers among our customers for inbound traffic, but a few minutes of research down that path did not lead to a cause for the memory consumption. Running down our list of things to check in case of emergency led me to find that our database server was overloaded with slow-running queries. This caused our Sidekiq jobs to take much longer than usual \(30-50x as long\), which caused a backlog large enough to consume all the memory. Having our main Redis cluster effectively unusable resulted in the following problems: * Our API endpoints were unable to receive error reports, source map uploads, deployment notifications, and check-in reports * Uptime checks were delayed * Some check-ins were switched to the down state because the check-in reports couldn't be received * Some error notifications were lost as we had to juggle Redis instances \(more on that below\) ### When was it fixed? Not long after 9PM our API was responsive again. By 10PM our backlog of error payloads was fully processed, and we were back to normal. We would have been back in business sooner, but a few things tripped us up: * As soon as it was clear we were going to run out of memory on our Redis cluster \(which is hosted by AWS ElastiCache\), and that I wouldn't be able to quickly free up some memory, I started a resize of the existing cluster. When it became clear that would not be quick \(it ended up taking approximately two hours\), I spun up a new, larger ElastiCache cluster. When it become clear _that_ would not be quick, I spun up an EC2 instance in our VPS to host Redis temporarily. * Unfortunately, though we use Ansible for automating all our EC2 provisioning, we did not have a playbook for quickly spinning up a Redis server. When I set up the first server manually, I didn't provision enough disk space on the instance to store the Redis snapshot as the backlog grew \(I didn't have the fix in place for the slow queries yet\). * When I spotted that problem with the Redis instance, I spun up another one with a large-enough disk. * We also didn't have an automated way to update the 4 locations where our app, api, and worker instances were configured with the location of the Redis server, so with each of the two changes to the Redis server location I had to run some Ansible commands to update configurations and bounce services Once that was all settled, though, and traffic was once again flowing in to our new self-hosted Redis instance, I was able to turn my attention to the cause of the problem – the slow queries. It turns out that one query was the cause of the slowness. This query loaded previously-uploaded sourcemaps to be applied to Javascript errors as they were being processed. Since the problem was so localized, I was able to get the database back to a good place by temporarily suspending the sourcemap processing. ### What's the remediation plan? As you can imagine, there are a number of things we can do to help avoid or minimize this kind of situation in the future: 1. Review that database table and/or the query to see how we can get past the tipping-point we encountered that turned a 4-10ms query into a 400ms query for certain customers \(in process\) 2. Persist payloads to S3 sooner in our Sidekiq jobs so we can minimize memory pressure on Redis 3. Increase the size of our Redis cluster \(already done\) 4. Create an Ansible playbook to quickly provision a new Redis instance in case of emergency \(done\) 5. Centralize the 4 app configurations for the URL of the Redis cluster and create an Ansible playbook that can quickly update those configurations \(in process\) We'll continue to improve our systems and processes to deliver the most reliable service we can. We truly appreciate that you've chosen us for your monitoring needs, and we are always eager to show our appreciation by working hard for you. As always, if you have any questions or comments, please reach out to us at [support@honeybadger.io](mailto:support@honeybadger.io).

resolved

The backlog has been cleared, and our Redis cluster is happy once again. We'll be looking at ways we can better handle this scenario in the future.

identified

We have a temporary fix in place for the impacted Redis cluster, and now we are working on the backlog.

identified

Our main Redis cluster is having issues, and we are attempting to work around them

investigating

We are currently investigating this issue.

Report: "Pipeline processing delays"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "Sourcemap upload issues"

Last update
resolved

We have determined that some of the sourcemap traffic was being incorrectly routed to a staging server, which was encountering errors. This routing problem has been fixed.

investigating

We have received reports of problems with uploads of sourcemaps to our API, and we are looking into the issue.

Report: "Bogus check-in alerts"

Last update
resolved

This incident has been resolved.

monitoring

Starting around 6:30 AM UTC on March 28th, we encountered a burst of Lambda invocation errors, which caused a backlog in one of our queues, which caused some of the uptime checks to not be recorded in time to avoid alert conditions, which caused bogus alerts to be sent out. The backlog was cleared within a couple of hours. We're looking at changes we can make to our job queues to prevent this kind of scenario from occurring again.

Report: "Intermittent problem displaying errors in web app"

Last update
resolved

We just identified and resolved an issue where displaying some errors in the web UI was failing due to a problem with a 3rd party service. No other systems were impacted.

Report: "Intermittent sourcemap upload issue"

Last update
resolved

Our API was experiencing intermittent failures with sourcemap uploads from about 7:30 AM to 10:30 AM PDT. This was caused by internal DNS resolution failures. The problem has been resolved.

Report: "Intermittent sourcemap and deployment reporting problems"

Last update
resolved

This morning we had an issue with one of our web servers not being able to receive sourcemap uploads and deployment notifications, which resulted in intermittent failures from our API. This has been resolved.

Report: "Slowdown in sourcemap processing"

Last update
resolved

This incident has been resolved.

investigating

We are experiencing delays in sourcemap processing. This may result in some error backtraces not being enriched properly, but otherwise error report ingestion is not affected.

Report: "Heroku log drains experiencing intermittent connectivity problems"

Last update
resolved

This incident has been resolved.

investigating

We are currently investigating this issue.

Report: "Timeouts and Processing Delays"

Last update
resolved

It looks like the S3 problems are resolved, so we are back to normal.

monitoring

We are experiencing timeouts in the UI and processing delays due to increased errors rates from S3.

Report: "Error processing delays"

Last update
resolved

We are now back to normal operations.

monitoring

We've identified and routed around the cause of the slowdown. We're working through our backlog of delayed notices.

investigating

At around 9:30 pm PST we started experiencing delays processing error notifications. We're investigating the cause of this.

Report: "Intermittent search issues"

Last update
resolved

This incident has been resolved.

investigating

We're experiencing intermittent search problems affecting some of our users.

Report: "Logplex failure"

Last update
resolved

From 07:23 UTC to 12:28 UTC our Heroku logplex endpoint was down. This resulted in Heroku Platform errors not being collected and alerted during this window. The problem was caused by a misconfigured AMI that, when booted, would not pass the ELB health check. The outage was as long as it was because our monitoring was not configured correctly to wake us up for failures with this endpoint. The AMI problem has been corrected and additional monitoring will be added.