Honeycomb

Is Honeycomb Down Right Now? Check if there is a current outage ongoing.

Honeycomb is currently Operational

Last checked from Honeycomb's official status page

Historical record of incidents for Honeycomb

Report: "Honeycomb UI down"

Last update
monitoring

A fix has been implemented and we are monitoring the results.

identified

We have identified the feature flag that triggered this and are rolling it back.

Report: "Query history is not available in classic environments"

Last update
resolved

This incident is resolved, query history is available in Classic environments again.

identified

Query history is not available in classic environments right now. The root cause has been identified and we're in the process of deploying a fix.

Report: "Query history is not available in classic environments"

Last update
Identified

Query history is not available in classic environments right now. The root cause has been identified and we're in the process of deploying a fix.

Report: "Notifications affected by Slack outage"

Last update
resolved

This is resolved

monitoring

We are experiencing issues with Slack notifications while Slack is partially down.

Report: "Notifications affected by Slack outage"

Last update
Monitoring

We are experiencing issues with Slack notifications while Slack is partially down.

Report: "Degraded query performance"

Last update
postmortem

On April 16, we’ve experienced 55 minutes of degraded query performance in interactive queries and board rendering for a dozen or so teams. During this time, queries that were usually fast would have started taking much longer than usual, from less than 5 seconds to about a minute. More importantly though, for about 25 minutes, the evaluation of triggers and SLOs in our US region was interrupted, meaning alerts may have been delayed or missed. The detection of slow queries mostly came up through customers reaching out to us. On our end, the main performance SLOs never fell below their thresholds and we overall were within our budget. We associated the raising delays to an increase in shared lambda resource, caused by background tasks being queued up, which in turn created contention for some queries. As we started an internal incident to handle this, we were paged about our alerting subsystem not reporting as healthy. We saw the contention in the underlying resources as the main contributor and tweaked some rate limiting parameters to ensure overall usage came back to manageable levels. As we did so, the alerting system also recovered. We monitored the system and made sure it was functioning as normal for a while before closing the incident. Our investigation mostly focused on what exactly caused alerting to hang, a behavior that surprised every responder. A key behavior we focused on was that the system worked fine under pressure until an automated deployment happened. We eventually found out that while resource contention in our lambdas did lead to slowness for queries, it was coming back from the deployment while under pressure that caused the stalling. As it turns out, that application does gradual backfilling of recently changed SLOs in the background. However, in its initial iteration, it performs this task at boot time in the foreground and _then_ moves it to the background. Because the application restarted while the system was under heavy contention, it stalled on that first run, and did not recover while load remained high. When we solved the contention issue, background jobs managed to finish, then moved to be asynchronous, and alerting came back. Our two follow-up actions have been to tweak the alerting for our triggers and SLO components so they page roughly 3-5x faster next time, and to make sure the first evaluation of background tasks is done asynchronously, as we initially expected them to be. We do not plan on doing further in-depth reviews of this incident at this time.

resolved

The system is stable and performance should be back to normal.

monitoring

Performance is now back to normal. We have added Triggers and SLOs to the list of impacted services, and upgraded the impact to Major given some triggers did not run.

identified

We are continuing to work on a fix for this issue.

identified

We have identified resource contention that currently leads to degraded query performance, which has slowed down most querying types for the last hour. The situation seems to be improving but we are keeping an eye on it.

Report: "Degraded query performance"

Last update
Identified

We have identified resource contention that currently leads to degraded query performance, which has slowed down most querying types for the last hour.The situation seems to be improving but we are keeping an eye on it.

Report: "Query Failures"

Last update
resolved

Between 18:10 and 18:15 UTC, we saw elevated levels of query failures, including ones made by users and ones made as part of trigger invocations. The system automatically recovered, and begun to serve queries again. After this, from 18:15 UTC to approximately 18:40 UTC, some user queries timed out after 5 minutes. No impact to triggers was seen during this time.

Report: "Query Failures"

Last update
Resolved

Between 18:10 and 18:15 UTC, we saw elevated levels of query failures, including ones made by users and ones made as part of trigger invocations. The system automatically recovered, and begun to serve queries again.After this, from 18:15 UTC to approximately 18:40 UTC, some user queries timed out after 5 minutes. No impact to triggers was seen during this time.

Report: "Query Builder WHERE clause rejecting overlapping column names with different casings"

Last update
resolved

The incident has been resolved.

monitoring

A fix has been deployed and we will continue to monitor.

identified

The Query Builder WHERE clause is currently rejecting lower-case selections where there are two columns with the same name that differ only in casing (e.g. if "name" vs "Name" are both present, it would not be possible to select "name"). We have identified the problem and are working on a fix.

Report: "Query Builder WHERE clause rejecting overlapping column names with different casings"

Last update
Identified

The Query Builder WHERE clause is currently rejecting lower-case selections where there are two columns with the same name that differ only in casing (e.g. if "name" vs "Name" are both present, it would not be possible to select "name"). We have identified the problem and are working on a fix.

Report: "Autocomplete was lowercasing all input in where clause"

Last update
resolved

Autocomplete in the where clause of Honeycomb UI's query builder was lowercasing all user input, preventing queries from being run correctly. A fix for this has been deployed and users should be able to run queries as expected again.

Report: "Autocomplete was lowercasing all input in where clause"

Last update
Resolved

Autocomplete in the where clause of Honeycomb UI's query builder was lowercasing all user input, preventing queries from being run correctly.A fix for this has been deployed and users should be able to run queries as expected again.

Report: "A font package is causing boards to fail"

Last update
resolved

The incident has been resolved.

investigating

We started seeing errors on Mar 4 around 4pm PST that boards were not rendering. Some boards have this issue, but not all. A single customer might have a mix of boards that have this issue and boards that are fine. Board data doesn't appear to be affected.

Report: "Slack Trigger Issues"

Last update
resolved

This is resolved,

monitoring

Due to an ongoing Slack outage, trigger and SLO notifications might be failing to reach your workspaces. We are monitoring the situation and will update this when the situation clears up

Report: "Unusual database load"

Last update
resolved

This incident has been resolved.

monitoring

All systems go. Thank you for your patience.

monitoring

A fix has been implemented, we're monitoring the results, things look good so far but some symptoms may persist.

investigating

We are currently investigating unusual database load resulting in delays to certain noncritical features such as audit events.

Report: "Querying issues"

Last update
resolved

This incident has been resolved.

monitoring

We've pushed out a fix to relevant query subsystems. We're monitoring and will continue to do so, but we are fully functional.

identified

We have identified a fix for the querying issues observed yesterday as well as their underlying cause, and are currently executing it. The system is presently stable and usable.

investigating

We are still investigating a transient failure in our querying system; We continue to see no lost data, and querying should be back for all previously affected users.

investigating

We are still investigating a transient failure in our querying system; We have observed no lost data at this time, but querying continues to be impacted for a subset of users

investigating

Some US customers are seeing queries slow to succeed and/or failing. We are investigating the cause of the slowness and the query failures.

Report: "UI Unavailable"

Last update
resolved

Between 14:00 and 14:20 PST (22:00 and 22:20 UTC), our UI became unavailable due to an issue with a deployment. The impact of this incident is that pages in the UI would not load. The API, and triggers/SLOs and their relevant alerting, were unaffected. We rolled back the impacted build while fixing the issue. A fix for the issue has been deployed, and we have confirmed its availability in production.

monitoring

Between 14:00 and 14:20 PST (22:00 and 22:20 UTC), our UI became unavailable due to an issue with a deployment. The impact of this incident is that pages in the UI would not load. The API, and triggers/SLOs and their relevant alerting, were unaffected. We have rolled back the relevant deployment and are deploying a fix. The UI should now be fully available. We will continue to monitor as we proceed with the fix.

identified

We are currently experiencing a UI outage due to a bad deployment. We are actively resolving the issue.

Report: "Transient query failures"

Last update
resolved

Between 15:51 and 15:55 PST (23:51 and 23:55 UTC), many queries that covered a long enough time range to touch "cold storage" data failed. Retrying those queries will succeed. Some triggers that would have run during this time and needed enough data to reach into cold storage would have also failed. This incident is resolved and all queries are once again functioning normally.

Report: "SLO and Trigger webhook notifications with irregular payloads"

Last update
resolved

We have confirmed the new build to have fixed the issue.

monitoring

The problematic code change has been spotted and corrected. Overall we believe the problem started for webhooks at roughly 21:53 UTC and stopped at 01:46 UTC.

identified

We have identified an issue with webhook targets on triggers and SLOs that result in irregular payloads. We are working on correcting the situation

Report: "Data Ingestion lag"

Last update
resolved

All internal systems are healthy, as well as all public-facing elements.

monitoring

Our storage engine has caught up with ingest. All public impact should be resolved by now, but we're keeping an eye on the system load while everything internal stabilizes.

identified

A significant portion of our partitions are currently lagging behind on ingestion. The data is going to be available, but with a delay. Queries may not return recent data just yet, and triggers that run on short cycles are likely to not see the data they are monitoring. SLO alerts may be delayed but will account for the data when they catch up.

Report: "SampleRate processing impaired"

Last update
resolved

Processing of the SampleRate attribute with a capital S is restored. This incident is now resolved.

monitoring

We have deployed a fix and we are currently monitoring to ensure resolution. At this time we expect that SampleRate processing is working as usual.

investigating

Honeycomb has been ignoring the “SampleRate” attribute in OTLP telemetry for the past 19 hours. “sampleRate” with a lower-case S is being processed as normal. Telemetry sent using Beelines is also being processed as normal. We're currently deploying a fix and will post an update soon.

Report: "Ingest and processing delays"

Last update
resolved

We have seen full recovery of our querying in Production US.

investigating

We are working on a fix to address the increased latency. We are seeing slow but steady recovery. Triggers and SLOs appear to be operational and unimpacted but we are monitoring the situation closely.

investigating

We are investigating increased latency in our ingest pipeline. We are seeing slow but steady recovery. Triggers and SLOs appear to be operational but we are monitoring the situation closely.

Report: "Partial query outage"

Last update
resolved

This incident has been resolved.

identified

we have identified an issue causing certain query types (service maps, boards, API) inconsistently failing for some users

Report: "SLO evaluations delayed"

Last update
resolved

This incident has been resolved.

monitoring

We're implementing a fix; the services are healthy, but we are ensuring continued stability.

identified

We have identified an issue with our infrastructure that's affecting SLO evaluations and are rolling out a fix.

Report: "Querying degraded"

Last update
resolved

Affected services have resumed normal activity.

monitoring

The revert was successful, and we're seeing services recover. We will continue to monitor for abnormal behavior.

investigating

We're reverting the build that seems to be causing the issue.

investigating

We're currently investigating an issue with our storage engine, which affects querying, SLOs, and triggers.

Report: "Querying degraded due to backend crashes"

Last update
postmortem

On August 30, Honeycomb experienced about one hour during which 20% to 50% of queries failed in the US region, across all query types. During this time, queries created with the query builder, boards, trace views, and some trigger evaluations may have failed and returned an error instead of results. The incident happened during the deployment of a routine upgrade of one of our gRPC libraries, used in a significant portion of our stack. As it rolled out in pre-production and non-public environments, a few minor transient errors were detected, but could not be replicated during more than an hour past deployment. As we deployed it to production environments, the US honeycomb instance started erroring out when communicating with our querying system, and this time it kept failing even when the deployment was completed. We ended up rolling back the change as the situation grew worse as more hosts were involved, and wasn’t transient; It was triggered more consistently under the heavier load of our production systems. A later investigation revealed that the issues had to do with a small change in the library \(which was gated behind a compile flag\) but still touched code _elsewhere_, such that an optimization operation clashed with our usage of the underlying protobufs library. The protobufs library is used for serialization of data, and while gRPC was functional, our usage of protobuf as part of our querying logic was impacted. Once the issue was understood, code was modified to be safer with regards to the new library version, and was rolled out without further issue. An internal incident review was conducted, and we do not plan on further external reports at this time.

resolved

This incident has been resolved.

monitoring

We have reverted the problematic build and are monitoring the status. Query availability and performance appears back to normal.

identified

We are reverting the problematic dependency that appears to have caused our query backend to sporadically crash.

Report: "Sandbox failing to render"

Last update
resolved

The sandbox environment is now fully functional.

monitoring

We have identified the issue and applied a fix that we are monitoring. In the meantime you can refresh the page or if you are presented with the cookie acceptance dialog, accepting the cookies and refreshing the page will also work.

monitoring

We have identified the issue and applied a fix that we are monitoring. This impacts people who have declined to receive cookies, temporarily accepting cookies will allow access to play.honeycomb.io in the meantime.

investigating

We are continuing to investigate this issue.

investigating

We are investigating reports that the sandbox environment is failing to render

Report: "Queries and triggers looking back more than 2 hours may sometimes fail"

Last update
resolved

Our fix has been identified and deployed, and there is no indication of further customer impact.

monitoring

We've found the right set of fixes required to stabilize everything. Queries should work as usual but we're keeping an eye on service stability.

identified

we have rolled back our change as narrowly and quickly as we could, but we are finding it was connected to more elements of our infrastructure. We are still gradually rolling back more components until we stabilize querying of older data.

identified

We have identified an issue with a recent build that causes queries and triggers that look back further than 2 hours to sometimes fail. Retrying queries should work but may take multiple attempts. We are currently preparing a fix.

Report: "Activity log events delayed"

Last update
resolved

All backlogged events have finished processing and new events are being processed normally

monitoring

For clarification on the previous resolution - we are actually delayed by around half an hour but no events have been lost and when the backlog finishes processing we will resume normal posting of events

Report: "Activity Log events delayed"

Last update
resolved

Ingestion of new events into the Activity Log is no longer delayed. No data has been lost.

investigating

Ingestion of new events into the Activity Log is delayed. Historical events are still queryable, but new events will not appear.

Report: "Activity Log Outage in the EU Region"

Last update
resolved

The Activity Log in the EU region is operational again.

monitoring

We have implemented a fix and are monitoring.

investigating

We are currently investigating an outage of the activity log in the EU region

Report: "UI and API unavailable"

Last update
postmortem

On August 6, we experienced an outage impacting multiple components of our platform, between 12:59:39 PDT and 13:20:05 PDT. Within that time range, and for 17 minutes, roughly 25% of incoming telemetry data was rejected; our API rejected 75% of requests \(mostly to the `/1/auth` endpoint\); the [ui.honeycomb.io](http://ui.honeycomb.io) website was completely unusable for at least 19 minutes; triggers weren’t evaluated for that time, and finally, SLO evaluations may have been delayed or issues may have happened in sending out notifications. Our engineers noticed a degradation at roughly 13:00 PDT; alerts confirming a major issue went out at 13:04 PDT, and we spun up our internal incident response in parallel. As most components started suffering at the same time, right around a deployment, it took a few minutes to properly get situated and narrow down the issue to database performance, correlated with a table schema migration. We managed to identify a stuck query, but by the time we knew exactly which one was involved, the database was so overloaded we could not log in with the elevated privileges required to terminate it, and had to fail the database over. This resolved the issue, and we spent a few more minutes making sure all data was correct and that all subsystems recovered properly. The schema migration was technically safe—a column addition to the `teams` table using an `INSTANT` algorithm that should cause no downtime nor interruption. Unbeknownst to us, merely a few seconds before the migration was applied, a read query doing a costly `SELECT` started running. This query had been mostly unchanged for the last 5 years and never caused issues, while being called roughly 10 times a day. The migration query modifying the same table got scheduled at the same time. It acquired a metadata lock that then prevented _any other query from running on this table,_ while the `ALTER` statement itself waits for already running queries and transactions using this table to terminate. This is usually a short wait, and as soon as the `ALTER` statement is scheduled, other operations can in turn be scheduled concurrently. Our investigation reveals that this specific slow `SELECT` query run could easily take more than 5 minutes to complete for some customer organizations. Generally, this isn’t a problem as these queries can run concurrently and do not block other operations; the client connection from our software times out and returns quickly while the query terminates later in MySQL. The end result is an unfortunate scheduling edge case within MySQL where a generally non-blocking query stalled a data schema change that is also generally non-blocking. But because the query extended in time, everything having to do with `teams`—such as authentication—hung behind the slow query \(which held back the `ALTER`, which held back all other queries until it could be scheduled\), and many systems in turn became unresponsive. The same migration was re-applied without problems a few minutes later. We are currently auditing the specific query that took long enough to contribute to the outage, to see if it can be optimized or to ensure it times out much faster on the database’s side. Following this, we are hoping to better enforce database-side timeouts in general to align them with our client-side timeouts. This should ensure that schema migrations that should otherwise be safe actually are so. We do not plan a more in-depth public review at this time, although we will continue investigating these events internally.

resolved

We have confirmed all Honeycomb services are once again operational.

monitoring

Both the UI and API are once again functional and we are following up on the related changes.

investigating

The Honeycomb UI is unavailable to many customers and some traffic is being rejected at the API. We have identified an overloaded database table and are working to mitigate the issue.

Report: "AWS CloudWatch integration ingest delays"

Last update
resolved

Everything has remained stable for many hours so we are now closing this incident.

monitoring

The underlying AWS incident with Kinesis Data Streams has been resolved.

identified

Due to an ongoing incident with AWS Kinesis, ingest of CloudWatch metrics into Honeycomb via the Kinesis Streams integration is impacted and may be delayed.

Report: "Sandbox data missing"

Last update
resolved

After enough time, all necessary data has been added back to the sandbox

monitoring

Data generation for the Sandbox environment was offline for a while and is currently filling back up

Report: "Board Query Errors"

Last update
resolved

This has been fixed and errors are no longer encountered.

identified

Source of the issue has been identified: unnamed Board queries will error when clicked for all customers -- not just Classic. A fix is being worked on now.

investigating

We are investigating an issue with Board Queries affecting Classic teams.

Report: "Trace loads may be missing columns"

Last update
resolved

We have rolled back the change, and queries appear to be functioning normally again. For trace loads specifically, you may need to click "reload trace" and then refresh the page in order for the issue to be fully fixed.

identified

We have identified an issue where trace loads and the events table are missing derived columns, in some cases causing those queries to fail.

Report: "www.honeycomb.io outage"

Last update
resolved

This incident has been resolved.

investigating

www.honeycomb.io is down. We are in contact with our hosting provider and will provide an update when this is resolved. The Honeycomb service itself is unaffected.

Report: "US Production site is down"

Last update
postmortem

On June 3rd, we experienced 20 minutes of outage in the US region in querying and a small increase in ingest failures. During this time customers were unable to query their data and alerting was delayed, but less than 0.1% of data sent to us was dropped. We received an early alert about our ingest system appearing to be unreachable. This correlated strongly with a database schema migration we had just started running at the same time, which was quickly confirmed by engineers. The migration slowed down our biggest database in the US environment and caused operations to pile up. We opened the public incident, stating that all systems were down. In fact, because our ingest pathway has a robust caching mechanism, we were able to keep accepting the majority of data without issue. Other systems related to querying and triggers were still failing however, and SLO alerting was delayed for multiple minutes. Our first priority was to cancel the migration, but reaching the database proved to be difficult due to all connections being saturated. We debated failing the database over to another availability zone by rebooting it as well, but by the time we had managed to get a live connection to it, our automated systems had already detected the failure and performed the failover for us after which the system became stable again. The migration involved was related to modifying an ENUM set on a database table, which unexpectedly caused a full table rewrite. It had previously run without issue on smaller databases, leading to a false sense of security. Additionally, two prior changes to the same ENUM field had not caused any performance issues. After the restart we made sure that data integrity was properly maintained, that all caches were properly aligned, and that the overall migration could safely complete. We are currently looking at strengthening our ability to spot risky migrations ahead of time \(regardless of how well they worked on _other_ databases in other environments\).

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the recovery

identified

We have identified the cause of this incident and are working to remediate

Report: "Increased error query error rate."

Last update
resolved

The build has been rolled out to the query engine and performance is back to expected levels.

monitoring

We have detected a change made to our query engine that increased our error rate above baseline levels. This change is being reverted.

Report: "BubbleUp on SLOs intermittently not rendering heatmap data"

Last update
resolved

This incident has been resolved.

monitoring

The fix is live in production, and we will continue monitoring to make sure the issue doesn't re-occur.

identified

The fix has been merged and is on its way to production.

identified

Some clicks through to a query from an SLO bubble up are not rendering the graph showing duration_ms. The problem has been identified and we're working on a fix.

Report: "Ingestion errors"

Last update
postmortem

On May 2nd from 4:36 p.m. to 5:00 p.m. PDT, the ingest service for Honeycomb’s US region had an incident that lasted 24 minutes. For the first 17 minutes, the service accepted varying amounts of traffic \(roughly 30%\) and for the last seven minutes, all incoming traffic was dropped. During this time, customers sending us telemetry \(either directly from within the application or via a proxy such as Refinery or the OpenTelemetry Collector\) would have seen slow or failed attempts to send telemetry to api.honeycomb.io. Additionally, while querying via the UI was mostly functional, for some of that time responsiveness slowed down and some queries failed. This incident occurred during routine maintenance of one of our caching servers. In order to sustain the volume of traffic we receive from our customers, we leverage several layers of caches that store frequently accessed information. Each process has an in-memory cache, there is a shared cache, and the database itself. The local and shared caches both expire information as it ages to manage memory use. Because of the layered aspect of these caches, either can be emptied for a short time and the system will continue to function. However, if one of the caches is unavailable for too long, the load will shift to the database. During this maintenance adjusting the configuration of the shared cache \(intended to improve the experience of our largest customers\), the shared cache was unavailable for too long, and as the load shifted to the database it became overwhelmed. The remote cache must be filled from the database, so when the database was overwhelmed, the cache could not be filled. This was a reinforcing feedback loop—the more load the database had, the more it needed the cache, and the more difficult it was to fill the cache. At some point, the whole system tipped and the only way to recovery was to block traffic entirely to refill the cache. These phases of system degradation correspond to the two main phases of the incident. Of the 24 minutes our system was impacted, the first 17 were this increasing struggle to refill the cache as the database became more and more overloaded. The last seven minutes were when we shut off all incoming traffic in order for the database to recover and fill the cache. As soon as the cache was full, we allowed traffic back in the system. ![](https://s3.amazonaws.com/www.honeycomb.io/cache_clear_db_load.jpg) This chart shows some of the interactions described above. The addition of the remote cache to the system removes potential database load and allows the system to scale above what would have been the limit of the database \(labeled Safety Limit\). When the remote cache clears, load on the database gradually increases from caches expiring. However, there is a window between the time when the cache clears and when the increasing load from expiring caches hits the safety limit—and within that window the system still functions! If the process to refill the cache can succeed within this window, the system stays up. If it cannot, when the blue database line hits the red safety limit line, it becomes impossible to recover the system without taking it offline. So long as this window remains large enough, there are benefits to keeping the caching architecture simple. But when the window becomes too small, there are a few other paths forward. We can use this chart to help describe changes we can make to the system to make it harder to repeat this incident. There are two things we can change about this chart: we can make the maintenance window larger, and we can reduce the chance we enter the window at all. * By increasing the time a cache entry remains valid, we reduce the slope of the blue line after the cache clearing event. In other words, with a fixed number of cache entries, spreading out expirations over more time means fewer expirations per second. This makes the maintenance window larger, giving us more time to complete the maintenance. * By horizontally sharding the remote cache \(spreading cache entries across multiple machines\), each remote cache server represents only a portion of the total “baseline database load without the cache” volume. In other words, instead of the actual database load reaching for what it would have been without the cache, it will plateau at some lower value. This also reduces the slope of the increase in database load, making the maintenance window larger. * By using failover caches, the remote cache can remain available even as some servers are taken down for maintenance. This reduces the probability that the cache-clearing event happens at all, meaning we don’t even enter into the maintenance window. In summary, caches make systems able to scale to great heights, but add complexity in operation and understanding to the overall system. Adding them in the right place opens a system to new opportunities, while at the same time making previously-simple behaviors more chaotic and difficult to understand. For this particular system impacting Honeycomb ingestion, we are both adding some failover to the cache, as well as adjusting our cache timeouts in order to ensure that we enter a maintenance window like this one less often—and that when we do, we have more time available to complete the needed maintenance.

resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are continuing to investigate this issue. EU ingestion is not impacted

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating this issue.

Report: "Delay in SLO processing"

Last update
resolved

SLO processing is now caught back up.

monitoring

We are continuing to monitor for any further issues.

monitoring

SLO processing is up to 20 minutes behind currently, the cause has been identified and we are in the process of catching up.

Report: "Elevated query error rates and delays"

Last update
resolved

No significant impact has been felt in a good while and our provider also completed their own outage.

monitoring

We have seen our error rate remain within normal bounds for a while and have managed to stabilize backlogged work the incident caused. We are keeping an eye on the situation until we get the all-clear from our providers.

identified

We are seeing queries succeed and fail at a normal and expected rate. However our provider still has an ongoing incident and we are still seeing some resource contention we are keeping an eye on.

identified

Our service provider has confirmed they're having a regional degradation on operations we rely on for querying. We are seeing what we can do to minimize impact, but mostly we are waiting on things to clear up on their end.

investigating

We are currently monitoring error rates and delays to a small fraction of our queries, above our usual baseline. Most queries still succeed without a problem. At this time we are talking with our providers to make sure things don't degrade and we are keeping an active eye on it.

Report: "Some queries seeing incomplete data"

Last update
resolved

This incident has been resolved.

monitoring

Some of our query workers struggled to keep up with a sudden influx of events. The un-healthy workers have now recovered, and we are monitoring to make sure this doesn't recur.

investigating

We’re investigating an issue that may be causing a small subset of queries to return inconsistent data when the query is re-run.

Report: "eu1 ingest and UI down"

Last update
postmortem

On April 9, event ingestion and the Honeycomb UI were unavailable in EU1 from 22:00:10 to 22:11:30 UTC. We’d like to share a bit more about what went wrong and the next steps we plan to take. We frequently deploy fixes and improvements to our systems as part of our regular work to improve Honeycomb. Our systems run on Kubernetes, and we deploy changes by terminating pods and replacing them with new ones, one at a time. Normally, this is a completely uneventful process, as new pods pick up where the old ones left off, processing customer requests. However, in this case, traffic was not properly forwarded to the new pods, even though they were working correctly. Once the last of the old pods was terminated, service was abruptly interrupted. This occurred both for our event ingestion service and for the service handling our web-based user interface, roughly at the same time. When a new pod is started, a component called the AWS Load Balancer Controller registers it with the Application Load Balancer \(ALB\), and the ALB then begins forwarding traffic to the new pod. During this incident, the AWS Load Balancer Controller failed, and therefore new pods were not registered with the ALB. Once this happened, the next deployment caused a service failure. Earlier on April 9, we began deploying a routine system update to our Kubernetes cluster. Despite extensive testing in our development and staging environments, the system update interacted poorly with other components of our cluster, causing the AWS Load Balancer Controller and other pods to fail sporadically. We restarted the AWS Load Balancer Controller to restore service, and we also rolled back the system update to prevent recurrence. With service restored, we have now turned our attention to gaining a better understanding of the exact cause of the failure, both in order to prevent recurrence and to allow us to safely deploy the system update. Now that we know that the AWS Load Balancer Controller is such a critical component, we’ll add monitoring and alerting to ensure that an on-call engineer is made aware of a failure quickly. This will allow us to preemptively pause deployments and prevent the kind of service interruption we saw on April 9.

resolved

This incident has been resolved.

investigating

We have fully restored service. Ingest and the UI in EU1 (ui.eu1.honeycomb.io) were unavailable during 22:00:10 - 22:11:30 UTC.

investigating

We have implemented a fix and are seeing recovery. We are monitoring to ensure the service is stable.

investigating

We are continuing to investigate this issue.

investigating

We are continuing to investigate this issue.

investigating

The EU1 instance of Honeycomb is down. We are currently investigating.

Report: "Raw Row Query Failures"

Last update
resolved

This incident has been resolved.

monitoring

We identified a recent change to querying that caused results in the events table, bubble-up and the trace view to be incomplete. We have rolled back the change and results appear to be back to normal.

investigating

We are currently investigating raw row query failures. This is impacting a number of areas, including BubbleUp and the Trace view.

Report: "Login issues on SAML teams"

Last update
resolved

A fix has been applied and SAML logins should now be working again.

investigating

We're investigating an issue where users see an error on login for a subset of teams that utilize SAML for authentication.

Report: "Logins are broken"

Last update
resolved

We believe the issue has been addressed and shouldn't recur in the short term.

monitoring

We have identified the cause of logins not working and it should be functional once again. We'll continue to monitor the situation.

investigating

We are investigating reports of not being able to log into Honeycomb.

Report: "Low Granularity Query Failures"

Last update
resolved

Rollback has been completed and no instances of the error have occurred since the rollback, so we're closing this out.

identified

Due to a bug, certain low granularity queries are failing. The issue has already been identified and a fix is being deployed. As a workaround you can increase the granularity of your query.

Report: "DNS Configuration Issue"

Last update
resolved

All systems now confirmed operational

monitoring

In an attempt to make our DNS mechanism better and safer, we deployed a change that instead appears to have drastically dropped our ability to do DNS lookups. While we don’t have a full understanding of how that happened, we have rolled back the change and everything is back to functional. Impact of the incident: - SLO processing was delayed by 2 minutes, but has since recovered - Queries and triggers were significantly impacted for 12 minutes - We had a 19 second period where 14% of ingest events were impacted

Report: "Partial Ingest Outage"

Last update
resolved

Confirmed; the issue has been mitigated.

monitoring

Old build was successfully deployed, we're monitoring to make sure everything's working as it should

identified

We believe we've identified the issue is due to a recent code change, and are rolling back to a previous version

investigating

We're investigating issues with our ingest pipeline

Report: "Cannot create new datasets by sending events for classic environments"

Last update
resolved

This incident has been resolved.

monitoring

We have rolled back to a known-good build, and have confirmed that datasets are being created properly.

identified

We have identified an issue that prevents new datasets from being created when events are sent for them. It only impacts honeycomb classic environments. We are working on remediating the issue.

Report: "SLO service degredation"

Last update
resolved

This incident has been resolved.

monitoring

SLO service has returned to normal operation.

monitoring

SLOs are continuing to catch up and will be restored to normal service levels soon.

identified

Based on the rate of recovery, SLO evaluation should be caught up in 10 minutes.

identified

We've identified and remediated the core issue; SLO evaluations are currently 15 minutes behind and catching up.

investigating

We've identified an issue with our SLO service and are working to restore the SLO alerting pipeline.

Report: "Query errors"

Last update
resolved

This incident has been resolved.

investigating

Incident is resolved. Queries have recovered and all systems are working as usual.

investigating

Queries are now recovering. New queries should now be succeeding. Monitoring the situation.

investigating

We are currently looking into query errors that we're seeing after applying an infrastructure change

Report: "Ingest errors and delays"

Last update
postmortem

On Sunday, November 5, we experienced a bit over 1h10m of partially available ingestion, along with roughly 5 minutes of complete ingestion outage. Starting at around 02:15 UTC, customers might have seen event processing from our API ingestion endpoint be slower, often failing in an on-and-off manner, until it stopped entirely for a few minutes. At 02:50 UTC, the system briefly recovered, although it took until 03:30 UTC for it to become fully stable again. We detected the issue through our standard alerting mechanisms, which notified our on-call engineers of issues with both ingestion performance and stability. Additionally, automated load-shedding mechanisms aiming to maintain system stability were tripped and generated extra notifications. Despite the load-shedding being in place—with the objective of dropping traffic more aggressively to prevent a cascade of ingest host failures—we found our ingestion fleet in a series of restarts. Our engineers tried to manually and aggressively scale the fleet up to buy it more capacity. We then noticed an interplay with Kubernetes’ crash loop back-off behavior, which took previously failing hosts and kept them offline, which incidentally meant our overall cluster capacity still wasn’t sufficient. We also saw aggressive retry behavior from some traffic sources looking a bit like a thundering herd, so we cut off most ingest for a few minutes to let us build back the required capacity in ingest host to deal with all the incoming data. We then quickly recovered, and tweaked rate-limiting for our most aggressive sources to stabilize the cluster. After analysis, we’ve gathered a few clues indicating that this is a variation on previously seen incidents for which we generally had adequate mitigation mechanisms in place, but that happened in a manner that circumvented some of it this time around. Specifically, what we’ve found out is an abnormally large amount of queries coming from spread out connections within a few minutes. Before auto-scaling could kick in \(which may have been slowed down by recent optimizations to our ingest code that shifted its workload a bit\), these requests also managed to trigger a lot of database writes that inadvertently trampled each other and bogged down our connection pools. This happened faster than it took our cache—which would circumvent that work—to propagate the writes to all hosts. This, in turn, amplified and accelerated the memory use of our ingestion hosts, until they died. By the time the hosts came all back, the writes had managed to make it through and the caches automatically refreshed themselves, but we hadn’t yet managed to be stable again. We identified some of our customers sending us more than 20x their usual traffic. We initially thought this would be a backfill, but then started suspecting a surprising retry behavior. Because the OTel protocol can only return a success or a failure for an entire batch of spans but we might drop only _some of them_, we suspected they would re-send entire batches just to cover a small portion of failures. We temporarily dropped rate-limits to their traffic until it subsided, and the ingestion pipeline as a whole got stable again. Because the failure mode is relatively well understood, our next step is to focus on determining the projects that will address these failure paths. We have scaled up the fleet to prevent a repeat of this incident while we work on the longer term fixes.

resolved

This incident has been resolved.

monitoring

Ingest is now healthy again. We are continuing to monitor the fleet.

identified

We've scaled up the fleet in an attempt to deal with what seems to be a sudden spike in traffic. The new instances appear to be healthy, but we're monitoring our SLOs to ensure ingest becomes healthy.

investigating

Escalate the incident severity to critical as more ingestion traffic is getting dropped.

investigating

We're investigating alerts related to our ingestion service, which has a higher than normal error rate and response time.

Report: "transient issues due to database reboot"

Last update
resolved

We have finished confirming the system is once again functioning normally.

monitoring

Due to a required version upgrade, one of our databases rebooted at 17:42 PDT. Most services were recovered by 17:46 PDT. Ingestion was mostly unaffected but had an elevated error rate (roughly 0.1%). People using the UI also saw 500 errors during this time. Triggers scheduled to fire during the 4 minutes of DB unavailability did not run.