Is Honeycomb Down Right Now? Discover if there is an ongoing service outage.

Honeycomb is currently Operational

Last checked Jul 29, 2025 17:50 UTC from Honeycomb's official status page

Historical record of incidents for Honeycomb

Jul 21, 2025

Report: "Querying slow down in Production US"

Last update 2025-07-21T18:01:14.391Z

identified2025-07-21T18:01:14.388Z

We are seeing a querying slowdown in the US. We have identified the cause and are working to mitigate

Report: "Delay in SLO and errors in Trigger notifications"

Last update 2025-07-21T15:25:05.001Z

monitoring2025-07-21T15:25:04.634Z

An error in our production AWS configuration caused a brief interruption that affected Service Maps and caused an delay in delivering SLO notifications, and errors delivering Trigger notifications between 9:54AM UTC-5 and 10:05AM UTC-5. If any longer Triggers (1 day) would have activated within this window, they may not have sent a notification.

Jul 16, 2025

Report: "Shepherd is struggling in Prod EU"

Last update 2025-07-16T04:38:52.374Z

investigating2025-07-16T04:38:52.372Z

Telemetry is being intermittently dropped in the EU and we are currently investigating the cause

Jun 26, 2025

Report: "Queries are failing with errors"

Last update 2025-06-26T00:20:31.454Z

monitoring2025-06-26T00:20:31.437Z

Querying capabilities are now restored.

identified2025-06-26T00:15:58.961Z

We have identified the cause of the problem and have begun a revert.

investigating2025-06-26T00:08:33.508Z

A build needs to be reverted as we have ongoing failures when rendering query results.

Jun 13, 2025

Report: "Honeycomb UI down"

Last update 2025-06-13T18:53:25.597Z

monitoring2025-06-13T18:53:25.094Z

A fix has been implemented and we are monitoring the results.

identified2025-06-13T18:51:58.421Z

We have identified the feature flag that triggered this and are rolling it back.

May 20, 2025

Report: "Query history is not available in classic environments"

Last update 2025-05-20T23:10:23.862Z

resolved2025-05-20T23:10:23.848Z

This incident is resolved, query history is available in Classic environments again.

identified2025-05-20T22:14:28.307Z

Query history is not available in classic environments right now. The root cause has been identified and we're in the process of deploying a fix.

Report: "Query history is not available in classic environments"

Last update 2025-05-20T22:14:00.000Z

Identified2025-05-20T22:14:00.000Z

Query history is not available in classic environments right now. The root cause has been identified and we're in the process of deploying a fix.

May 13, 2025

Report: "Notifications affected by Slack outage"

Last update 2025-05-13T00:19:38.349Z

resolved2025-05-13T00:19:37.947Z

This is resolved

monitoring2025-05-12T23:04:40.530Z

We are experiencing issues with Slack notifications while Slack is partially down.

May 12, 2025

Report: "Notifications affected by Slack outage"

Last update 2025-05-12T23:04:00.000Z

Monitoring2025-05-12T23:04:00.000Z

We are experiencing issues with Slack notifications while Slack is partially down.

Apr 22, 2025

Report: "Degraded query performance"

Last update 2025-04-22T19:50:21.386Z

postmortem2025-04-22T19:39:44.341Z

On April 16, we’ve experienced 55 minutes of degraded query performance in interactive queries and board rendering for a dozen or so teams. During this time, queries that were usually fast would have started taking much longer than usual, from less than 5 seconds to about a minute. More importantly though, for about 25 minutes, the evaluation of triggers and SLOs in our US region was interrupted, meaning alerts may have been delayed or missed. The detection of slow queries mostly came up through customers reaching out to us. On our end, the main performance SLOs never fell below their thresholds and we overall were within our budget. We associated the raising delays to an increase in shared lambda resource, caused by background tasks being queued up, which in turn created contention for some queries. As we started an internal incident to handle this, we were paged about our alerting subsystem not reporting as healthy. We saw the contention in the underlying resources as the main contributor and tweaked some rate limiting parameters to ensure overall usage came back to manageable levels. As we did so, the alerting system also recovered. We monitored the system and made sure it was functioning as normal for a while before closing the incident. Our investigation mostly focused on what exactly caused alerting to hang, a behavior that surprised every responder. A key behavior we focused on was that the system worked fine under pressure until an automated deployment happened. We eventually found out that while resource contention in our lambdas did lead to slowness for queries, it was coming back from the deployment while under pressure that caused the stalling. As it turns out, that application does gradual backfilling of recently changed SLOs in the background. However, in its initial iteration, it performs this task at boot time in the foreground and _then_ moves it to the background. Because the application restarted while the system was under heavy contention, it stalled on that first run, and did not recover while load remained high. When we solved the contention issue, background jobs managed to finish, then moved to be asynchronous, and alerting came back. Our two follow-up actions have been to tweak the alerting for our triggers and SLO components so they page roughly 3-5x faster next time, and to make sure the first evaluation of background tasks is done asynchronously, as we initially expected them to be. We do not plan on doing further in-depth reviews of this incident at this time.

resolved2025-04-16T18:27:04.629Z

The system is stable and performance should be back to normal.

monitoring2025-04-16T17:56:29.917Z

Performance is now back to normal. We have added Triggers and SLOs to the list of impacted services, and upgraded the impact to Major given some triggers did not run.

identified2025-04-16T17:55:15.171Z

We are continuing to work on a fix for this issue.

identified2025-04-16T17:35:01.613Z

We have identified resource contention that currently leads to degraded query performance, which has slowed down most querying types for the last hour. The situation seems to be improving but we are keeping an eye on it.

Apr 16, 2025

Report: "Degraded query performance"

Last update 2025-04-16T17:35:00.000Z

Identified2025-04-16T17:35:00.000Z

We have identified resource contention that currently leads to degraded query performance, which has slowed down most querying types for the last hour.The situation seems to be improving but we are keeping an eye on it.

Apr 9, 2025

Report: "Query Failures"

Last update 2025-04-09T19:02:22.045Z

resolved2025-04-09T18:00:00.000Z

Between 18:10 and 18:15 UTC, we saw elevated levels of query failures, including ones made by users and ones made as part of trigger invocations. The system automatically recovered, and begun to serve queries again. After this, from 18:15 UTC to approximately 18:40 UTC, some user queries timed out after 5 minutes. No impact to triggers was seen during this time.

Report: "Query Failures"

Last update 2025-04-09T18:00:00.000Z

Resolved2025-04-09T18:00:00.000Z

Between 18:10 and 18:15 UTC, we saw elevated levels of query failures, including ones made by users and ones made as part of trigger invocations. The system automatically recovered, and begun to serve queries again.After this, from 18:15 UTC to approximately 18:40 UTC, some user queries timed out after 5 minutes. No impact to triggers was seen during this time.

Apr 7, 2025

Report: "Query Builder WHERE clause rejecting overlapping column names with different casings"

Last update 2025-04-07T21:22:14.126Z

resolved2025-04-07T21:22:14.107Z

The incident has been resolved.

monitoring2025-04-07T19:58:18.349Z

A fix has been deployed and we will continue to monitor.

identified2025-04-07T19:39:25.414Z

The Query Builder WHERE clause is currently rejecting lower-case selections where there are two columns with the same name that differ only in casing (e.g. if "name" vs "Name" are both present, it would not be possible to select "name"). We have identified the problem and are working on a fix.

Report: "Query Builder WHERE clause rejecting overlapping column names with different casings"

Last update 2025-04-07T19:39:00.000Z

Identified2025-04-07T19:39:00.000Z

Mar 28, 2025

Report: "Autocomplete was lowercasing all input in where clause"

Last update 2025-03-28T21:06:38.234Z

resolved2025-03-28T21:06:38.210Z

Autocomplete in the where clause of Honeycomb UI's query builder was lowercasing all user input, preventing queries from being run correctly. A fix for this has been deployed and users should be able to run queries as expected again.

Report: "Autocomplete was lowercasing all input in where clause"

Last update 2025-03-28T21:06:00.000Z

Resolved2025-03-28T21:06:00.000Z

Autocomplete in the where clause of Honeycomb UI's query builder was lowercasing all user input, preventing queries from being run correctly.A fix for this has been deployed and users should be able to run queries as expected again.

Mar 21, 2025

Report: "A font package is causing boards to fail"

Last update 2025-03-21T19:24:19.258Z

resolved2025-03-05T15:15:51.000Z

The incident has been resolved.

investigating2025-03-05T13:38:13.885Z

We started seeing errors on Mar 4 around 4pm PST that boards were not rendering. Some boards have this issue, but not all. A single customer might have a mix of boards that have this issue and boards that are fine. Board data doesn't appear to be affected.

Feb 27, 2025

Report: "Slack Trigger Issues"

Last update 2025-02-27T01:20:15.259Z

resolved2025-02-27T01:20:14.968Z

This is resolved,

monitoring2025-02-26T19:23:40.000Z

Due to an ongoing Slack outage, trigger and SLO notifications might be failing to reach your workspaces. We are monitoring the situation and will update this when the situation clears up

Jan 27, 2025

Report: "Unusual database load"

Last update 2025-01-27T23:00:31.545Z

resolved2025-01-27T23:00:31.526Z

This incident has been resolved.

monitoring2025-01-25T06:07:37.587Z

All systems go. Thank you for your patience.

monitoring2025-01-25T01:43:47.719Z

A fix has been implemented, we're monitoring the results, things look good so far but some symptoms may persist.

investigating2025-01-24T22:35:09.755Z

We are currently investigating unusual database load resulting in delays to certain noncritical features such as audit events.

Jan 23, 2025

Report: "Querying issues"

Last update 2025-01-23T00:21:24.705Z

resolved2025-01-23T00:21:24.427Z

This incident has been resolved.

monitoring2025-01-22T21:54:25.940Z

We've pushed out a fix to relevant query subsystems. We're monitoring and will continue to do so, but we are fully functional.

identified2025-01-22T15:54:51.000Z

We have identified a fix for the querying issues observed yesterday as well as their underlying cause, and are currently executing it. The system is presently stable and usable.

investigating2025-01-22T02:09:08.760Z

We are still investigating a transient failure in our querying system; We continue to see no lost data, and querying should be back for all previously affected users.

investigating2025-01-22T01:31:23.183Z

We are still investigating a transient failure in our querying system; We have observed no lost data at this time, but querying continues to be impacted for a subset of users

investigating2025-01-21T22:10:18.825Z

Some US customers are seeing queries slow to succeed and/or failing. We are investigating the cause of the slowness and the query failures.

Jan 22, 2025

Report: "UI Unavailable"

Last update 2025-01-22T00:19:47.671Z

resolved2025-01-22T00:19:47.355Z

Between 14:00 and 14:20 PST (22:00 and 22:20 UTC), our UI became unavailable due to an issue with a deployment. The impact of this incident is that pages in the UI would not load. The API, and triggers/SLOs and their relevant alerting, were unaffected. We rolled back the impacted build while fixing the issue. A fix for the issue has been deployed, and we have confirmed its availability in production.

monitoring2025-01-21T22:26:24.000Z

Between 14:00 and 14:20 PST (22:00 and 22:20 UTC), our UI became unavailable due to an issue with a deployment. The impact of this incident is that pages in the UI would not load. The API, and triggers/SLOs and their relevant alerting, were unaffected. We have rolled back the relevant deployment and are deploying a fix. The UI should now be fully available. We will continue to monitor as we proceed with the fix.

identified2025-01-21T22:18:16.385Z

We are currently experiencing a UI outage due to a bad deployment. We are actively resolving the issue.

Jan 11, 2025

Report: "Transient query failures"

Last update 2025-01-11T00:09:14.320Z

resolved2025-01-11T00:09:14.311Z

Between 15:51 and 15:55 PST (23:51 and 23:55 UTC), many queries that covered a long enough time range to touch "cold storage" data failed. Retrying those queries will succeed. Some triggers that would have run during this time and needed enough data to reach into cold storage would have also failed. This incident is resolved and all queries are once again functioning normally.

Dec 19, 2024

Report: "SLO and Trigger webhook notifications with irregular payloads"

Last update 2024-12-19T02:25:00.876Z

resolved2024-12-19T02:25:00.862Z

We have confirmed the new build to have fixed the issue.

monitoring2024-12-19T01:56:48.602Z

The problematic code change has been spotted and corrected. Overall we believe the problem started for webhooks at roughly 21:53 UTC and stopped at 01:46 UTC.

identified2024-12-19T01:39:17.197Z

We have identified an issue with webhook targets on triggers and SLOs that result in irregular payloads. We are working on correcting the situation

Nov 15, 2024

Report: "Data Ingestion lag"

Last update 2024-11-15T15:20:06.718Z

resolved2024-11-15T15:20:06.699Z

All internal systems are healthy, as well as all public-facing elements.

monitoring2024-11-15T15:06:17.307Z

Our storage engine has caught up with ingest. All public impact should be resolved by now, but we're keeping an eye on the system load while everything internal stabilizes.

identified2024-11-15T14:49:05.728Z

A significant portion of our partitions are currently lagging behind on ingestion. The data is going to be available, but with a delay. Queries may not return recent data just yet, and triggers that run on short cycles are likely to not see the data they are monitoring. SLO alerts may be delayed but will account for the data when they catch up.

Nov 13, 2024

Report: "SampleRate processing impaired"

Last update 2024-11-13T19:28:35.438Z

resolved2024-11-13T19:28:35.097Z

Processing of the SampleRate attribute with a capital S is restored. This incident is now resolved.

monitoring2024-11-13T19:22:20.332Z

We have deployed a fix and we are currently monitoring to ensure resolution. At this time we expect that SampleRate processing is working as usual.

investigating2024-11-13T19:15:11.357Z

Honeycomb has been ignoring the “SampleRate” attribute in OTLP telemetry for the past 19 hours. “sampleRate” with a lower-case S is being processed as normal. Telemetry sent using Beelines is also being processed as normal. We're currently deploying a fix and will post an update soon.

Nov 6, 2024

Report: "Ingest and processing delays"

Last update 2024-11-06T21:16:05.068Z

resolved2024-11-06T21:15:59.388Z

We have seen full recovery of our querying in Production US.

investigating2024-11-06T16:27:02.143Z

We are working on a fix to address the increased latency. We are seeing slow but steady recovery. Triggers and SLOs appear to be operational and unimpacted but we are monitoring the situation closely.

investigating2024-11-06T15:40:24.000Z

We are investigating increased latency in our ingest pipeline. We are seeing slow but steady recovery. Triggers and SLOs appear to be operational but we are monitoring the situation closely.

Nov 1, 2024

Report: "Partial query outage"

Last update 2024-11-01T14:09:29.845Z

resolved2024-11-01T14:09:29.356Z

This incident has been resolved.

identified2024-11-01T13:52:09.000Z

we have identified an issue causing certain query types (service maps, boards, API) inconsistently failing for some users

Oct 18, 2024

Report: "SLO evaluations delayed"

Last update 2024-10-18T19:08:08.979Z

resolved2024-10-18T19:08:08.687Z

This incident has been resolved.

monitoring2024-10-18T18:13:30.393Z

We're implementing a fix; the services are healthy, but we are ensuring continued stability.

identified2024-10-18T18:03:53.095Z

We have identified an issue with our infrastructure that's affecting SLO evaluations and are rolling out a fix.

Oct 11, 2024

Report: "Querying degraded"

Last update 2024-10-11T00:20:03.527Z

resolved2024-10-11T00:20:03.157Z

Affected services have resumed normal activity.

monitoring2024-10-10T23:22:08.810Z

The revert was successful, and we're seeing services recover. We will continue to monitor for abnormal behavior.

investigating2024-10-10T23:19:06.080Z

We're reverting the build that seems to be causing the issue.

investigating2024-10-10T23:15:24.997Z

We're currently investigating an issue with our storage engine, which affects querying, SLOs, and triggers.

Oct 3, 2024

Report: "Querying degraded due to backend crashes"

Last update 2024-10-03T17:57:11.336Z

postmortem2024-10-03T17:57:00.984Z

On August 30, Honeycomb experienced about one hour during which 20% to 50% of queries failed in the US region, across all query types. During this time, queries created with the query builder, boards, trace views, and some trigger evaluations may have failed and returned an error instead of results. The incident happened during the deployment of a routine upgrade of one of our gRPC libraries, used in a significant portion of our stack. As it rolled out in pre-production and non-public environments, a few minor transient errors were detected, but could not be replicated during more than an hour past deployment. As we deployed it to production environments, the US honeycomb instance started erroring out when communicating with our querying system, and this time it kept failing even when the deployment was completed. We ended up rolling back the change as the situation grew worse as more hosts were involved, and wasn’t transient; It was triggered more consistently under the heavier load of our production systems. A later investigation revealed that the issues had to do with a small change in the library \(which was gated behind a compile flag\) but still touched code _elsewhere_, such that an optimization operation clashed with our usage of the underlying protobufs library. The protobufs library is used for serialization of data, and while gRPC was functional, our usage of protobuf as part of our querying logic was impacted. Once the issue was understood, code was modified to be safer with regards to the new library version, and was rolled out without further issue. An internal incident review was conducted, and we do not plan on further external reports at this time.

resolved2024-08-30T14:12:40.140Z

This incident has been resolved.

monitoring2024-08-30T10:34:53.067Z

We have reverted the problematic build and are monitoring the status. Query availability and performance appears back to normal.

identified2024-08-30T10:05:38.093Z

We are reverting the problematic dependency that appears to have caused our query backend to sporadically crash.

Sep 30, 2024

Report: "Sandbox failing to render"

Last update 2024-09-30T13:18:58.423Z

resolved2024-09-30T13:18:58.079Z

The sandbox environment is now fully functional.

monitoring2024-09-27T22:30:27.002Z

We have identified the issue and applied a fix that we are monitoring. In the meantime you can refresh the page or if you are presented with the cookie acceptance dialog, accepting the cookies and refreshing the page will also work.

monitoring2024-09-27T22:22:10.206Z

We have identified the issue and applied a fix that we are monitoring. This impacts people who have declined to receive cookies, temporarily accepting cookies will allow access to play.honeycomb.io in the meantime.

investigating2024-09-27T20:11:33.918Z

We are continuing to investigate this issue.

investigating2024-09-27T20:11:20.306Z

We are investigating reports that the sandbox environment is failing to render

Sep 23, 2024

Report: "Queries and triggers looking back more than 2 hours may sometimes fail"

Last update 2024-09-23T21:37:39.588Z

resolved2024-09-23T21:37:39.264Z

Our fix has been identified and deployed, and there is no indication of further customer impact.

monitoring2024-09-23T20:32:37.159Z

We've found the right set of fixes required to stabilize everything. Queries should work as usual but we're keeping an eye on service stability.

identified2024-09-23T20:22:24.823Z

we have rolled back our change as narrowly and quickly as we could, but we are finding it was connected to more elements of our infrastructure. We are still gradually rolling back more components until we stabilize querying of older data.

identified2024-09-23T19:37:15.536Z

We have identified an issue with a recent build that causes queries and triggers that look back further than 2 hours to sometimes fail. Retrying queries should work but may take multiple attempts. We are currently preparing a fix.

Sep 9, 2024

Report: "Activity log events delayed"

Last update 2024-09-09T22:39:08.977Z

resolved2024-09-09T22:39:08.963Z

All backlogged events have finished processing and new events are being processed normally

monitoring2024-09-09T22:21:42.664Z

For clarification on the previous resolution - we are actually delayed by around half an hour but no events have been lost and when the backlog finishes processing we will resume normal posting of events

Report: "Activity Log events delayed"

Last update 2024-09-09T22:14:16.010Z

resolved2024-09-09T22:14:15.998Z

Ingestion of new events into the Activity Log is no longer delayed. No data has been lost.

investigating2024-09-09T21:19:46.157Z

Ingestion of new events into the Activity Log is delayed. Historical events are still queryable, but new events will not appear.

Aug 20, 2024

Report: "Activity Log Outage in the EU Region"

Last update 2024-08-20T21:36:26.244Z

resolved2024-08-20T21:36:25.851Z

The Activity Log in the EU region is operational again.

monitoring2024-08-20T21:32:43.930Z

We have implemented a fix and are monitoring.

investigating2024-08-20T21:24:26.567Z

We are currently investigating an outage of the activity log in the EU region

Aug 7, 2024

Report: "UI and API unavailable"

Last update 2024-08-07T22:42:37.205Z

postmortem2024-08-07T22:34:09.284Z

On August 6, we experienced an outage impacting multiple components of our platform, between 12:59:39 PDT and 13:20:05 PDT. Within that time range, and for 17 minutes, roughly 25% of incoming telemetry data was rejected; our API rejected 75% of requests \(mostly to the `/1/auth` endpoint\); the [ui.honeycomb.io](http://ui.honeycomb.io) website was completely unusable for at least 19 minutes; triggers weren’t evaluated for that time, and finally, SLO evaluations may have been delayed or issues may have happened in sending out notifications. Our engineers noticed a degradation at roughly 13:00 PDT; alerts confirming a major issue went out at 13:04 PDT, and we spun up our internal incident response in parallel. As most components started suffering at the same time, right around a deployment, it took a few minutes to properly get situated and narrow down the issue to database performance, correlated with a table schema migration. We managed to identify a stuck query, but by the time we knew exactly which one was involved, the database was so overloaded we could not log in with the elevated privileges required to terminate it, and had to fail the database over. This resolved the issue, and we spent a few more minutes making sure all data was correct and that all subsystems recovered properly. The schema migration was technically safe—a column addition to the `teams` table using an `INSTANT` algorithm that should cause no downtime nor interruption. Unbeknownst to us, merely a few seconds before the migration was applied, a read query doing a costly `SELECT` started running. This query had been mostly unchanged for the last 5 years and never caused issues, while being called roughly 10 times a day. The migration query modifying the same table got scheduled at the same time. It acquired a metadata lock that then prevented _any other query from running on this table,_ while the `ALTER` statement itself waits for already running queries and transactions using this table to terminate. This is usually a short wait, and as soon as the `ALTER` statement is scheduled, other operations can in turn be scheduled concurrently. Our investigation reveals that this specific slow `SELECT` query run could easily take more than 5 minutes to complete for some customer organizations. Generally, this isn’t a problem as these queries can run concurrently and do not block other operations; the client connection from our software times out and returns quickly while the query terminates later in MySQL. The end result is an unfortunate scheduling edge case within MySQL where a generally non-blocking query stalled a data schema change that is also generally non-blocking. But because the query extended in time, everything having to do with `teams`—such as authentication—hung behind the slow query \(which held back the `ALTER`, which held back all other queries until it could be scheduled\), and many systems in turn became unresponsive. The same migration was re-applied without problems a few minutes later. We are currently auditing the specific query that took long enough to contribute to the outage, to see if it can be optimized or to ensure it times out much faster on the database’s side. Following this, we are hoping to better enforce database-side timeouts in general to align them with our client-side timeouts. This should ensure that schema migrations that should otherwise be safe actually are so. We do not plan a more in-depth public review at this time, although we will continue investigating these events internally.

resolved2024-08-06T21:00:34.514Z

We have confirmed all Honeycomb services are once again operational.

monitoring2024-08-06T20:29:54.861Z

Both the UI and API are once again functional and we are following up on the related changes.

investigating2024-08-06T20:12:10.751Z

The Honeycomb UI is unavailable to many customers and some traffic is being rejected at the API. We have identified an overloaded database table and are working to mitigate the issue.

Jul 31, 2024

Report: "AWS CloudWatch integration ingest delays"

Last update 2024-07-31T13:35:45.899Z

resolved2024-07-31T13:35:45.576Z

Everything has remained stable for many hours so we are now closing this incident.

monitoring2024-07-31T04:56:26.381Z

The underlying AWS incident with Kinesis Data Streams has been resolved.

identified2024-07-31T00:11:18.422Z

Due to an ongoing incident with AWS Kinesis, ingest of CloudWatch metrics into Honeycomb via the Kinesis Streams integration is impacted and may be delayed.

Jul 19, 2024

Report: "Sandbox data missing"

Last update 2024-07-19T14:12:05.209Z

resolved2024-07-19T14:12:04.842Z

After enough time, all necessary data has been added back to the sandbox

monitoring2024-07-18T21:15:28.107Z

Data generation for the Sandbox environment was offline for a while and is currently filling back up

Jul 3, 2024

Report: "Board Query Errors"

Last update 2024-07-03T15:51:31.419Z

resolved2024-07-03T15:51:31.400Z

This has been fixed and errors are no longer encountered.

identified2024-07-03T14:52:08.701Z

Source of the issue has been identified: unnamed Board queries will error when clicked for all customers -- not just Classic. A fix is being worked on now.

investigating2024-07-03T14:32:27.083Z

We are investigating an issue with Board Queries affecting Classic teams.

Jun 17, 2024

Report: "Trace loads may be missing columns"

Last update 2024-06-17T22:29:03.326Z

resolved2024-06-17T22:29:03.312Z

We have rolled back the change, and queries appear to be functioning normally again. For trace loads specifically, you may need to click "reload trace" and then refresh the page in order for the issue to be fully fixed.

identified2024-06-17T22:12:32.626Z

We have identified an issue where trace loads and the events table are missing derived columns, in some cases causing those queries to fail.

Jun 14, 2024

Report: "www.honeycomb.io outage"

Last update 2024-06-14T01:35:28.423Z

resolved2024-06-14T01:35:28.038Z

This incident has been resolved.

investigating2024-06-14T00:18:59.712Z

www.honeycomb.io is down. We are in contact with our hosting provider and will provide an update when this is resolved. The Honeycomb service itself is unaffected.

Jun 7, 2024

Report: "US Production site is down"

Last update 2024-06-07T16:14:09.503Z

postmortem2024-06-07T16:12:36.616Z

On June 3rd, we experienced 20 minutes of outage in the US region in querying and a small increase in ingest failures. During this time customers were unable to query their data and alerting was delayed, but less than 0.1% of data sent to us was dropped. We received an early alert about our ingest system appearing to be unreachable. This correlated strongly with a database schema migration we had just started running at the same time, which was quickly confirmed by engineers. The migration slowed down our biggest database in the US environment and caused operations to pile up. We opened the public incident, stating that all systems were down. In fact, because our ingest pathway has a robust caching mechanism, we were able to keep accepting the majority of data without issue. Other systems related to querying and triggers were still failing however, and SLO alerting was delayed for multiple minutes. Our first priority was to cancel the migration, but reaching the database proved to be difficult due to all connections being saturated. We debated failing the database over to another availability zone by rebooting it as well, but by the time we had managed to get a live connection to it, our automated systems had already detected the failure and performed the failover for us after which the system became stable again. The migration involved was related to modifying an ENUM set on a database table, which unexpectedly caused a full table rewrite. It had previously run without issue on smaller databases, leading to a false sense of security. Additionally, two prior changes to the same ENUM field had not caused any performance issues. After the restart we made sure that data integrity was properly maintained, that all caches were properly aligned, and that the overall migration could safely complete. We are currently looking at strengthening our ability to spot risky migrations ahead of time \(regardless of how well they worked on _other_ databases in other environments\).

resolved2024-06-03T17:19:43.571Z

This incident has been resolved.

monitoring2024-06-03T15:49:22.328Z

A fix has been implemented and we are monitoring the recovery

identified2024-06-03T15:35:27.203Z

We have identified the cause of this incident and are working to remediate

Jun 5, 2024

Report: "Increased error query error rate."

Last update 2024-06-05T22:17:36.744Z

resolved2024-06-05T22:17:34.444Z

The build has been rolled out to the query engine and performance is back to expected levels.

monitoring2024-06-05T21:56:02.613Z

We have detected a change made to our query engine that increased our error rate above baseline levels. This change is being reverted.

May 23, 2024

Report: "BubbleUp on SLOs intermittently not rendering heatmap data"

Last update 2024-05-23T21:28:15.633Z

resolved2024-05-23T21:28:15.611Z

This incident has been resolved.

monitoring2024-05-23T19:10:58.122Z

The fix is live in production, and we will continue monitoring to make sure the issue doesn't re-occur.

identified2024-05-23T18:06:48.020Z

The fix has been merged and is on its way to production.

identified2024-05-23T17:54:23.668Z

Some clicks through to a query from an SLO bubble up are not rendering the graph showing duration_ms. The problem has been identified and we're working on a fix.

May 15, 2024

Report: "Ingestion errors"

Last update 2024-05-15T16:52:56.473Z

postmortem2024-05-15T15:56:29.299Z

On May 2nd from 4:36 p.m. to 5:00 p.m. PDT, the ingest service for Honeycomb’s US region had an incident that lasted 24 minutes. For the first 17 minutes, the service accepted varying amounts of traffic \(roughly 30%\) and for the last seven minutes, all incoming traffic was dropped. During this time, customers sending us telemetry \(either directly from within the application or via a proxy such as Refinery or the OpenTelemetry Collector\) would have seen slow or failed attempts to send telemetry to api.honeycomb.io. Additionally, while querying via the UI was mostly functional, for some of that time responsiveness slowed down and some queries failed. This incident occurred during routine maintenance of one of our caching servers. In order to sustain the volume of traffic we receive from our customers, we leverage several layers of caches that store frequently accessed information. Each process has an in-memory cache, there is a shared cache, and the database itself. The local and shared caches both expire information as it ages to manage memory use. Because of the layered aspect of these caches, either can be emptied for a short time and the system will continue to function. However, if one of the caches is unavailable for too long, the load will shift to the database. During this maintenance adjusting the configuration of the shared cache \(intended to improve the experience of our largest customers\), the shared cache was unavailable for too long, and as the load shifted to the database it became overwhelmed. The remote cache must be filled from the database, so when the database was overwhelmed, the cache could not be filled. This was a reinforcing feedback loop—the more load the database had, the more it needed the cache, and the more difficult it was to fill the cache. At some point, the whole system tipped and the only way to recovery was to block traffic entirely to refill the cache. These phases of system degradation correspond to the two main phases of the incident. Of the 24 minutes our system was impacted, the first 17 were this increasing struggle to refill the cache as the database became more and more overloaded. The last seven minutes were when we shut off all incoming traffic in order for the database to recover and fill the cache. As soon as the cache was full, we allowed traffic back in the system. ![](https://s3.amazonaws.com/www.honeycomb.io/cache_clear_db_load.jpg) This chart shows some of the interactions described above. The addition of the remote cache to the system removes potential database load and allows the system to scale above what would have been the limit of the database \(labeled Safety Limit\). When the remote cache clears, load on the database gradually increases from caches expiring. However, there is a window between the time when the cache clears and when the increasing load from expiring caches hits the safety limit—and within that window the system still functions! If the process to refill the cache can succeed within this window, the system stays up. If it cannot, when the blue database line hits the red safety limit line, it becomes impossible to recover the system without taking it offline. So long as this window remains large enough, there are benefits to keeping the caching architecture simple. But when the window becomes too small, there are a few other paths forward. We can use this chart to help describe changes we can make to the system to make it harder to repeat this incident. There are two things we can change about this chart: we can make the maintenance window larger, and we can reduce the chance we enter the window at all. * By increasing the time a cache entry remains valid, we reduce the slope of the blue line after the cache clearing event. In other words, with a fixed number of cache entries, spreading out expirations over more time means fewer expirations per second. This makes the maintenance window larger, giving us more time to complete the maintenance. * By horizontally sharding the remote cache \(spreading cache entries across multiple machines\), each remote cache server represents only a portion of the total “baseline database load without the cache” volume. In other words, instead of the actual database load reaching for what it would have been without the cache, it will plateau at some lower value. This also reduces the slope of the increase in database load, making the maintenance window larger. * By using failover caches, the remote cache can remain available even as some servers are taken down for maintenance. This reduces the probability that the cache-clearing event happens at all, meaning we don’t even enter into the maintenance window. In summary, caches make systems able to scale to great heights, but add complexity in operation and understanding to the overall system. Adding them in the right place opens a system to new opportunities, while at the same time making previously-simple behaviors more chaotic and difficult to understand. For this particular system impacting Honeycomb ingestion, we are both adding some failover to the cache, as well as adjusting our cache timeouts in order to ensure that we enter a maintenance window like this one less often—and that when we do, we have more time available to complete the needed maintenance.

resolved2024-05-03T04:32:54.674Z

This incident has been resolved.

monitoring2024-05-03T00:06:05.602Z

A fix has been implemented and we are monitoring the results.

identified2024-05-03T00:02:41.403Z

The issue has been identified and a fix is being implemented.

investigating2024-05-02T23:46:31.543Z

We are continuing to investigate this issue. EU ingestion is not impacted

investigating2024-05-02T23:45:41.817Z

We are continuing to investigate this issue.

investigating2024-05-02T23:44:48.015Z

We are currently investigating this issue.

May 6, 2024

Report: "Delay in SLO processing"

Last update 2024-05-06T18:36:21.564Z

resolved2024-05-06T18:36:21.166Z

SLO processing is now caught back up.

monitoring2024-05-06T18:23:26.268Z

We are continuing to monitor for any further issues.

monitoring2024-05-06T18:21:45.760Z

SLO processing is up to 20 minutes behind currently, the cause has been identified and we are in the process of catching up.

Apr 29, 2024

Report: "Elevated query error rates and delays"

Last update 2024-04-29T19:53:45.526Z

resolved2024-04-29T19:53:45.056Z

No significant impact has been felt in a good while and our provider also completed their own outage.

monitoring2024-04-29T18:46:46.665Z

We have seen our error rate remain within normal bounds for a while and have managed to stabilize backlogged work the incident caused. We are keeping an eye on the situation until we get the all-clear from our providers.

identified2024-04-29T18:30:10.290Z

We are seeing queries succeed and fail at a normal and expected rate. However our provider still has an ongoing incident and we are still seeing some resource contention we are keeping an eye on.

identified2024-04-29T18:12:54.748Z

Our service provider has confirmed they're having a regional degradation on operations we rely on for querying. We are seeing what we can do to minimize impact, but mostly we are waiting on things to clear up on their end.

investigating2024-04-29T17:26:30.663Z

We are currently monitoring error rates and delays to a small fraction of our queries, above our usual baseline. Most queries still succeed without a problem. At this time we are talking with our providers to make sure things don't degrade and we are keeping an active eye on it.

Apr 24, 2024

Report: "Some queries seeing incomplete data"

Last update 2024-04-24T20:54:58.802Z

resolved2024-04-24T20:54:58.420Z

This incident has been resolved.

monitoring2024-04-24T20:21:46.806Z

Some of our query workers struggled to keep up with a sudden influx of events. The un-healthy workers have now recovered, and we are monitoring to make sure this doesn't recur.

investigating2024-04-24T20:05:51.324Z

We’re investigating an issue that may be causing a small subset of queries to return inconsistent data when the query is re-run.

Apr 12, 2024

Report: "eu1 ingest and UI down"

Last update 2024-04-12T20:22:56.585Z

postmortem2024-04-12T20:08:54.983Z

On April 9, event ingestion and the Honeycomb UI were unavailable in EU1 from 22:00:10 to 22:11:30 UTC. We’d like to share a bit more about what went wrong and the next steps we plan to take. We frequently deploy fixes and improvements to our systems as part of our regular work to improve Honeycomb. Our systems run on Kubernetes, and we deploy changes by terminating pods and replacing them with new ones, one at a time. Normally, this is a completely uneventful process, as new pods pick up where the old ones left off, processing customer requests. However, in this case, traffic was not properly forwarded to the new pods, even though they were working correctly. Once the last of the old pods was terminated, service was abruptly interrupted. This occurred both for our event ingestion service and for the service handling our web-based user interface, roughly at the same time. When a new pod is started, a component called the AWS Load Balancer Controller registers it with the Application Load Balancer \(ALB\), and the ALB then begins forwarding traffic to the new pod. During this incident, the AWS Load Balancer Controller failed, and therefore new pods were not registered with the ALB. Once this happened, the next deployment caused a service failure. Earlier on April 9, we began deploying a routine system update to our Kubernetes cluster. Despite extensive testing in our development and staging environments, the system update interacted poorly with other components of our cluster, causing the AWS Load Balancer Controller and other pods to fail sporadically. We restarted the AWS Load Balancer Controller to restore service, and we also rolled back the system update to prevent recurrence. With service restored, we have now turned our attention to gaining a better understanding of the exact cause of the failure, both in order to prevent recurrence and to allow us to safely deploy the system update. Now that we know that the AWS Load Balancer Controller is such a critical component, we’ll add monitoring and alerting to ensure that an on-call engineer is made aware of a failure quickly. This will allow us to preemptively pause deployments and prevent the kind of service interruption we saw on April 9.

resolved2024-04-10T22:02:38.694Z

This incident has been resolved.

investigating2024-04-09T22:31:25.352Z

We have fully restored service. Ingest and the UI in EU1 (ui.eu1.honeycomb.io) were unavailable during 22:00:10 - 22:11:30 UTC.

investigating2024-04-09T22:16:01.535Z

We have implemented a fix and are seeing recovery. We are monitoring to ensure the service is stable.

investigating2024-04-09T22:09:00.930Z

We are continuing to investigate this issue.

investigating2024-04-09T22:08:25.515Z

We are continuing to investigate this issue.

investigating2024-04-09T22:08:09.540Z

The EU1 instance of Honeycomb is down. We are currently investigating.

Apr 11, 2024

Report: "Raw Row Query Failures"

Last update 2024-04-11T22:41:31.642Z

resolved2024-04-11T22:41:31.252Z

This incident has been resolved.

monitoring2024-04-11T22:13:20.726Z

We identified a recent change to querying that caused results in the events table, bubble-up and the trace view to be incomplete. We have rolled back the change and results appear to be back to normal.

investigating2024-04-11T21:55:23.351Z

We are currently investigating raw row query failures. This is impacting a number of areas, including BubbleUp and the Trace view.

Mar 15, 2024

Report: "Login issues on SAML teams"

Last update 2024-03-15T18:29:09.273Z

resolved2024-03-15T18:29:08.833Z

A fix has been applied and SAML logins should now be working again.

investigating2024-03-15T18:19:20.287Z

We're investigating an issue where users see an error on login for a subset of teams that utilize SAML for authentication.

Mar 13, 2024

Report: "Logins are broken"

Last update 2024-03-13T21:14:09.186Z

resolved2024-03-13T21:14:08.441Z

We believe the issue has been addressed and shouldn't recur in the short term.

monitoring2024-03-13T20:47:52.528Z

We have identified the cause of logins not working and it should be functional once again. We'll continue to monitor the situation.

investigating2024-03-13T20:25:44.695Z

We are investigating reports of not being able to log into Honeycomb.

Jan 17, 2024

Report: "Low Granularity Query Failures"

Last update 2024-01-17T17:24:31.938Z

resolved2024-01-17T17:24:31.399Z

Rollback has been completed and no instances of the error have occurred since the rollback, so we're closing this out.

identified2024-01-17T17:08:55.386Z

Due to a bug, certain low granularity queries are failing. The issue has already been identified and a fix is being deployed. As a workaround you can increase the granularity of your query.

Nov 30, 2023

Report: "DNS Configuration Issue"

Last update 2023-11-30T20:44:00.106Z

resolved2023-11-30T20:15:13.882Z

All systems now confirmed operational

monitoring2023-11-30T20:03:44.000Z

In an attempt to make our DNS mechanism better and safer, we deployed a change that instead appears to have drastically dropped our ability to do DNS lookups. While we don’t have a full understanding of how that happened, we have rolled back the change and everything is back to functional. Impact of the incident: - SLO processing was delayed by 2 minutes, but has since recovered - Queries and triggers were significantly impacted for 12 minutes - We had a 19 second period where 14% of ingest events were impacted

Nov 27, 2023

Report: "Partial Ingest Outage"

Last update 2023-11-27T20:59:28.139Z

resolved2023-11-27T20:59:28.125Z

Confirmed; the issue has been mitigated.

monitoring2023-11-27T20:49:21.652Z

Old build was successfully deployed, we're monitoring to make sure everything's working as it should

identified2023-11-27T20:20:46.312Z

We believe we've identified the issue is due to a recent code change, and are rolling back to a previous version

investigating2023-11-27T20:12:50.820Z

We're investigating issues with our ingest pipeline

Nov 16, 2023

Report: "Cannot create new datasets by sending events for classic environments"

Last update 2023-11-16T23:04:01.790Z

resolved2023-11-16T23:04:01.215Z

This incident has been resolved.

monitoring2023-11-16T19:01:49.414Z

We have rolled back to a known-good build, and have confirmed that datasets are being created properly.

identified2023-11-16T18:36:11.312Z

We have identified an issue that prevents new datasets from being created when events are sent for them. It only impacts honeycomb classic environments. We are working on remediating the issue.

Nov 10, 2023

Report: "SLO service degredation"

Last update 2023-11-10T17:45:57.860Z

resolved2023-11-10T17:45:57.840Z

This incident has been resolved.

monitoring2023-11-10T17:45:08.416Z

SLO service has returned to normal operation.

monitoring2023-11-10T17:24:11.966Z

SLOs are continuing to catch up and will be restored to normal service levels soon.

identified2023-11-10T17:15:27.708Z

Based on the rate of recovery, SLO evaluation should be caught up in 10 minutes.

identified2023-11-10T17:11:58.201Z

We've identified and remediated the core issue; SLO evaluations are currently 15 minutes behind and catching up.

investigating2023-11-10T17:10:48.186Z

We've identified an issue with our SLO service and are working to restore the SLO alerting pipeline.

Nov 9, 2023

Report: "Query errors"

Last update 2023-11-09T22:13:03.990Z

resolved2023-11-09T22:13:03.486Z

This incident has been resolved.

investigating2023-11-09T22:12:45.654Z

Incident is resolved. Queries have recovered and all systems are working as usual.

investigating2023-11-09T21:56:40.979Z

Queries are now recovering. New queries should now be succeeding. Monitoring the situation.

investigating2023-11-09T21:32:43.741Z

We are currently looking into query errors that we're seeing after applying an infrastructure change

Nov 7, 2023

Report: "Ingest errors and delays"

Last update 2023-11-07T19:03:49.579Z

postmortem2023-11-07T19:03:29.827Z

On Sunday, November 5, we experienced a bit over 1h10m of partially available ingestion, along with roughly 5 minutes of complete ingestion outage. Starting at around 02:15 UTC, customers might have seen event processing from our API ingestion endpoint be slower, often failing in an on-and-off manner, until it stopped entirely for a few minutes. At 02:50 UTC, the system briefly recovered, although it took until 03:30 UTC for it to become fully stable again. We detected the issue through our standard alerting mechanisms, which notified our on-call engineers of issues with both ingestion performance and stability. Additionally, automated load-shedding mechanisms aiming to maintain system stability were tripped and generated extra notifications. Despite the load-shedding being in place—with the objective of dropping traffic more aggressively to prevent a cascade of ingest host failures—we found our ingestion fleet in a series of restarts. Our engineers tried to manually and aggressively scale the fleet up to buy it more capacity. We then noticed an interplay with Kubernetes’ crash loop back-off behavior, which took previously failing hosts and kept them offline, which incidentally meant our overall cluster capacity still wasn’t sufficient. We also saw aggressive retry behavior from some traffic sources looking a bit like a thundering herd, so we cut off most ingest for a few minutes to let us build back the required capacity in ingest host to deal with all the incoming data. We then quickly recovered, and tweaked rate-limiting for our most aggressive sources to stabilize the cluster. After analysis, we’ve gathered a few clues indicating that this is a variation on previously seen incidents for which we generally had adequate mitigation mechanisms in place, but that happened in a manner that circumvented some of it this time around. Specifically, what we’ve found out is an abnormally large amount of queries coming from spread out connections within a few minutes. Before auto-scaling could kick in \(which may have been slowed down by recent optimizations to our ingest code that shifted its workload a bit\), these requests also managed to trigger a lot of database writes that inadvertently trampled each other and bogged down our connection pools. This happened faster than it took our cache—which would circumvent that work—to propagate the writes to all hosts. This, in turn, amplified and accelerated the memory use of our ingestion hosts, until they died. By the time the hosts came all back, the writes had managed to make it through and the caches automatically refreshed themselves, but we hadn’t yet managed to be stable again. We identified some of our customers sending us more than 20x their usual traffic. We initially thought this would be a backfill, but then started suspecting a surprising retry behavior. Because the OTel protocol can only return a success or a failure for an entire batch of spans but we might drop only _some of them_, we suspected they would re-send entire batches just to cover a small portion of failures. We temporarily dropped rate-limits to their traffic until it subsided, and the ingestion pipeline as a whole got stable again. Because the failure mode is relatively well understood, our next step is to focus on determining the projects that will address these failure paths. We have scaled up the fleet to prevent a repeat of this incident while we work on the longer term fixes.

resolved2023-11-06T03:57:11.256Z

This incident has been resolved.

monitoring2023-11-06T03:51:52.048Z

Ingest is now healthy again. We are continuing to monitor the fleet.

identified2023-11-06T03:01:10.308Z

We've scaled up the fleet in an attempt to deal with what seems to be a sudden spike in traffic. The new instances appear to be healthy, but we're monitoring our SLOs to ensure ingest becomes healthy.

investigating2023-11-06T02:45:28.537Z

Escalate the incident severity to critical as more ingestion traffic is getting dropped.

investigating2023-11-06T02:28:23.113Z

We're investigating alerts related to our ingestion service, which has a higher than normal error rate and response time.

Oct 29, 2023

Report: "transient issues due to database reboot"

Last update 2023-10-29T01:17:52.228Z

resolved2023-10-29T01:15:38.261Z

We have finished confirming the system is once again functioning normally.

monitoring2023-10-29T01:02:24.000Z

Due to a required version upgrade, one of our databases rebooted at 17:42 PDT. Most services were recovered by 17:46 PDT. Ingestion was mostly unaffected but had an elevated error rate (roughly 0.1%). People using the UI also saw 500 errors during this time. Triggers scheduled to fire during the 4 minutes of DB unavailability did not run.