Historical record of incidents for PostHog
Report: "Errors from an upstream provider outage"
Last updateWe're experiencing a elevated errors from an upstream provider, we're monitoring the issues and will post an update soon.
Report: "Cohort recalculations taking longer than expected"
Last updateWe've spotted a small number of cohorts are stuck in a recalculating state, and a larger number are taking longer than 24 hours to automatically recalculate as they should. We've identified the issue and have deployed a fix.
Report: "Queries are slow to run"
Last updateWe've been alerted to an increase in query times. We're currently investigating the issue, and will provide an update once we identify the root cause.
Report: "Elevated errors on us.posthog.com"
Last updateWe're seeing elevated errors loading the posthog interface. We're investigating and we'll update you as we know more.
Report: "Elevated errors on us.posthog.com"
Last updateWe're seeing elevated errors loading the posthog interface. We're investigating and we'll update you as we know more.
Report: "Data Processing Delays - Reporting Tools Affected"
Last updateThe ingestion delay incident has been resolved
Due to delays in a maintenance process, our data processing infrastructure is running behind which is causing inaccuracies in the reporting tools. No data has been lost and the system should be caught up shortly.
Report: "EU: elevated errors on web UI"
Last updateThis incident has been resolved.
Situation is back to normal. We found the root cause being in our networking stack. We're preparing a long term fix for it. Thanks for your patience!
The situation seemed to have calmed down, we're investigating the root cause.
We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.
Report: "Data Processing Delays - Reporting Tools Affected"
Last updateThe ingestion delay incident has been resolved
Due to delays in a maintenance process, our data processing infrastructure is running behind which is causing inaccuracies in the reporting tools. No data has been lost and the system should be caught up shortly.
Report: "EU: elevated errors on web UI"
Last updateThis incident has been resolved.
Situation is back to normal. We found the root cause being in our networking stack. We're preparing a long term fix for it. Thanks for your patience!
The situation seemed to have calmed down, we're investigating the root cause.
We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.
Report: "US: Delayed event ingestion"
Last updateThe backlog has been fully processed and event ingestion is back to normal. Thank you for bearing with us and apologies for the disruption.
We are consuming the lagged backlog and still monitoring the progress.
We have increased the consumer resources to speed up the resolution and keep monitoring the rate.
We identified another related issue and rolled the appropriate fix. The lag should be down and we keep monitoring it.
We identified the issue and rolled out a fix. The event lag is dropping, and we keep monitoring it.
We're currently falling behind on event ingestion. No data loss has occurred, and we're actively investigating the issue.
Report: "US: Delayed event ingestion"
Last updateThe backlog has been fully processed and event ingestion is back to normal. Thank you for bearing with us and apologies for the disruption.
We are consuming the lagged backlog and still monitoring the progress.
We have increased the consumer resources to speed up the resolution and keep monitoring the rate.
We identified another related issue and rolled the appropriate fix. The lag should be down and we keep monitoring it.
We identified the issue and rolled out a fix. The event lag is dropping, and we keep monitoring it.
We're currently falling behind on event ingestion. No data loss has occurred, and we're actively investigating the issue.
Report: "Data Processing Delays - Reporting Tools Affected"
Last updateThis incident has been resolved.
We identified the issue and the ingestion pipeline is catching up.
Our data processing infrastructure is running behind which is causing inaccuracies in the reporting tools. No data has been lost.
Report: "Posthog Cloud EU Database Maintenance"
Last updateThe scheduled maintenance has been completed.
Scheduled maintenance is currently in progress. We will provide updates as necessary.
We are performing scheduled maintenance on our EU Cloud Clickhouse database. We don't expect significant disruption but there may be some slow queries or ingestion delays.
Report: "Increased parts count impacting performance"
Last updateParts are back to normal and the cluster is responding normally again. We'll keep monitoring it and let you know if we find misbehavior again.
We are currently investigating increased parts counts on our datastore and we are investigating why these parts are not being merged as they should. This will cause increased query times.
Report: "US: Delayed event ingestion"
Last updateWe've caught up on our backlog of messages. Ingestion rates look optimal. Parts are being merged as they should. New nodes are fully online. Query latencies are looking great at 100ms avg. Should be smooth sailing from here on out. Enjoy your Friday!
There were some recurring errors in our infrastructure that led us to restart clickhouse nodes. We are falling behind on events ingestion, as we are replacing some nodes in our ClickHouse cluster. This will increase lag in our ingestion pipeline. Performance may be impacted during this time too. We are still working on this and monitoring it.
Report: "Elevated API Errors"
Last updateWe've resolved the incident. This just affected querying and no data was lost We're still working on finding the root cause for this issue (our clickhouse nodes were segfaulting without warning) and will continue to monitor.
We're experiencing failures to load data across the entire app at the moment. We've identified the root cause and are working to resolve this asap.
We're experiencing failures to load data across the entire app at the moment. We've identified the root cause and are working to resolve this asap.
Report: "US degraded performance"
Last updateLag has recovered and the system is completely functional again. Sorry for any inconvenience caused by this incident.
The cluster is now responsive and the data ingestion has been resumed. The app is responding better now. We are still monitoring a couple of fixes we have pushed. We identified a query that was flooding the cluster and which may have been the root cause of this.
We have recovered a good part of the cluster, but we are still working to bring it back completely. The performance may be still degraded. We think some problematic queries may have been the root cause, we are still investigating it.
We are trying to bring back the cluster. The app may be completely unresponsive, and lag is expected during this time, we'll try to provide an update as soon as possible.
We have detected a partial outage in our ClickHouse cluster and it's impacting the application response and performance getting insights. We are investigating the root cause.
Report: "Elevated API Errors"
Last updateWe've not seen a reoccurence of this issue so closing this incident now.
We're still investigating high load for some offline functionality (ie exports) but the vast majority of the app should work fine now
Our US app instance is down and pods are unhealthy. We're figuring out why and are working on resolving. Data ingestion and feature flags are not affected.
Report: "US degraded performance"
Last updateWe rolled out a change that increased load causing queries to be slower. We rolled back that change so performance should be back up.
We have spotted that our data infrastructure is under heavy load and it's impacting the time the app takes to load insights or leading to errors when loading them. We are investigating what could be the root cause.
Report: "Elevated API Errors"
Last updateThe problem has been fixed.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Intermitting API erorrs - API endpoints & feature flags"
Last updateLooking good, resolved!
We scaled the infrastructure components and are monitoring this. First signs indicate recovery. We will come back with an update once this is verified.
We identified intermitting 500s on several API endpoints, incl. feature flags. Reason seems to be an underprovisioned infrastructure compoenent. We're working on a fix. Apologies for any inconvenience
Report: "Elevated API Errors"
Last updateWe identified undetected underprovisioning in one of our network components. We scaled this up now and working on a fix to mitigate this long-term. Thank you for your patience.
Performance and error rate are back to normal levels. we're still investigating the root cause for this issue.
We are continuing to investigate this issue. Notice about US: this incident never affected the US environment. The "partial outage" status was wrong for that. We will correct this later. apologies for the inconvencience
The error rate has gone down, we're still looking for the root cause.
Elevated error rates are coming up again, we're investigating
We identified a surge in memory usage and workload eviction events. We scaled up feature flags and web app to mitigate. We're monitoring this.
Situation has calmed down after scaling up resources. We're still investigating the root cause. Notice: in an earlier message, it was reported that this was about the US region. This was wrong, this is only about the EU region. Apologies for the initial wrong reporting
We are continuing to investigate this issue.
We're experiencing an elevated level of API errors incl feature flags and are currently looking into the issue.
Report: "Data Processing Delays - Reporting Tools Affected"
Last updateThis incident has been resolved.
We're monitoring the ingestion pipeline, as it processes the delayed messages. We're estimating that the system will fully recover within an hour.
We are still investigating intermittent latency spikes in the event ingestion pipeline. Events are still being processed with a delay, which should decrease over time.
We are still investigating the root cause of the issue. Events are still delayed but the delay is no longer increasing. We hope to have a resolution shortly
Our data processing infrastructure is running behind which is causing inaccuracies in the reporting tools. No data has been lost and the system should be caught up shortly.
Report: "API Query endpoint intermittently 500'ing"
Last updateThis incident was resolved over the weekend.
We've shed load and haven't seen errors re-occur yet. We'll continue monitoring this over the weekend.
The API query endpoint is throwing intermittent 500 errors due to capacity limits on our end. We are working to fix this on our end and make the errors more clear. If known valid queries are failing with 500s, we recommend retrying queries with exponential backoff.
Report: "US Error Tracking Processing Delays"
Last updateBug fixed, ingestion workers scaled back up and lag recovering rapidly. No data loss should be observable.
We've identified the root cause of the issue. We are reprocessing exception events and continuing to monitor to make sure the pipeline fully recovers.
We are currently experiencing downtime in our error tracking data pipeline, while a bug is resolved. No data loss has occurred.
Report: "Elevated API Errors Evaluating Feature Flags"
Last updateAfter adding more database capacities feature flag evaluation has recovered to normal values. We close this incident now, but keep monitoring. We're working on a long term fix. Apologies for the inconvenience.
We saw a surge in feature flag evaluations and increased backend and database capacity. Seeing first signs of recovery.
US: We're experiencing an elevated level of feature flags API errors and are currently investigating.
Report: "Processing Delays"
Last updateWe've resolved the issue and ingestion has caught up to real time
We're keeping a close eye on our ingestion delay. Events might take up to 35 minutes to show up inside PostHog in our EU Cloud. No data has been lost.
Our EU data processing infrastructure is running behind, which is causing inaccuracies in the reporting tools. No data has been lost, and the system should catch up shortly. We're monitoring it closely.
Report: "Data pipeline delivery delays in US Cloud"
Last updateWe've identified the bottleneck and fixed it with improved alerting to avoid the issue in the future.
Pipeline destinations are currently experiencing delays in US Cloud - this means deliveries may be sent significantly later than the event that triggers it. No data has been lost and the deliveries will happen as we catch up on processing
Report: "Elevated API Errors for Feature Flag evaluation"
Last updateThe load issue has been resolved.
We're experiencing an elevated level of API errors when evaluating feature flags, due to unexpected load. We're currently investigating.
Report: "Elevated feature flag and local evaluation API Errors"
Last updateLoad spike identified and resolved, error rate and api latency returned to normal
We're seeing unexpected database load causing query timeouts and elevated latency on these endpoints.
Report: "Elevated API Errors - Feature Flags and Local Evaluation"
Last updateLoad has dropped and our error rate has returned to normal levels
We're experiencing an elevated level of API errors when evaluating feature flags, due to unexpected load. We're currently investigating.
Report: "Elevated capture errors in the US region"
Last updateA patch in was applied and we do not have errors anymore
We're experiencing elevated capture endpoint error rates, due to unanticipated kafka cluster patching. The vast, vast majority of requests are being retried successfully by our network edge routers, but some very large volume customers may see a very small number of terminally failed requests.
Report: "Batch exports not making progress in US Cloud"
Last updateThis incident has been resolved.
We have narrowed down the problem to a very small set of Snowflake batch exports that we have manually cancelled. If you were affected we will be reaching out. All other batch exports are fully recovered or on the path to recovery. Performance of ongoing batch exports will soon be on pace with real time once again.
We were unable to make a full recovery, and the issue seems to persist. We are investigating new potential fixes. In the meantime, batch exports will be delayed.
We are monitoring the backfill process for Snowflake batch exports and any pending large batch exports for other destinations. All backfills are progressing normally. On-going batch exports are operating normally, but users with pending backfills may see us lag behind real time until all the backfilling is done.
We have deployed our fixes and have managed to resolve the concurrency issues with Snowflake batch exports. Most batch exports besides Snowflake should be fully recovered, with the exception of larger batch exports that will still need some time to work through the backlog. We will shortly begin backfilling any Snowflake batch exports that were cancelled due to this incident.
We are continuing to work on a fix for this issue.
We have reasons to believe the cause of the problem is a deadlock happening while connecting to Snowflake. We are attempting to deploy a patch that would deal with the deadlock when it happens, leaving the investigation of what is causing the deadlock for later. Assuming the patch is successful in addressing the problem we will begin back filling any Snowflake batch exports that were cancelled.
We have reason to believe that the problem is related to Snowflake batch exports. In consequence, most Snowflake batch exports are being cancelled to be retried at a later date. We are investigating how to remediate the problem with Snowflake. Users of other destinations should see batch exports recovering over time. Depending on the size of the data exported, this recovery could take less or more time.
We have been making slow progress on batch exports and a backlog has built up, particularly on larger batch exports. It is taking us some time to work through the backlog, so users may see batch exports be delayed in delivering data. No data loss has happened nor is it expected.
Report: "US ingestion lag"
Last updateAfter monitoring we have seen that all systems are working as normally.
We have been able to identify the root cause and pushed a fix to get event ingestion back to normal and latency is now back to normal. We'll keep monitoring the infrastructure.
Our ingestion infrastructure is processing slowly causing delays for event ingestion. We are investigating what could be the root cause.
Report: "EU: elevated feature flag evaluation errors"
Last updateEU: We observed elevated error rates in feature flag evaluation that may have led to some requests timing out between 15:00 UTC and 17:00 UTC. We're apologizing for this inconvencience and started improving our alerting to catch this earlier.
Report: "EU: feature flags and surveys with elevated error rates"
Last updateEU: We were observing increased error rates for feature flags and surveys. While mitigating first issues with feature flags, we were restarting some internal components, which caused other issues. Surveys showed elevated error rates between ~15:22 UTC and 15:36 UTC. Feature flags showed elevated error rates between ~15:00 UTC and 15:36 UTC. There was a large number of timeouts in our database in the EU region causing high feature flags error rate and service disruptions. Apologies for this disruption.
Report: "US: Increased person processing load is causing locks on the replica DB"
Last updateThis incident has been resolved.
We scaled up the processing to ease out the spike. We are monitoring the situation.
Performance on the posthog read replica database is a bit degraded due to a high load of person ingestion processing. This is occasionally affecting flag evaluation since the feature flag service depends on the read replica database
Report: "EU: Data Processing Delays - Reporting Tools Affected"
Last updateIssue resolved
We have identified an issue with our service that builds the list of event and properties to search for when querying data. We are deploying a fix now and hope to see recovery in the coming hours. Until then UI tools for querying your data may be missing information you would expect. No data loss has occurred and event ingestion itself is unaffected.
Report: "Taxonomy updates delayed in EU"
Last updateIssue resolved
The issue is resolved, and we are catching up on newly seen events and properties.
Our taxonomy generation system (for event and property definitions you use in filters and elsewhere) is currently delayed as we fix a minor schema bug. This means new event names or properties you just sent to posthog won't be available for use in places like filters or insights. We have identified the bug, and expect to resolve it shortly.
Report: "PostHog Cloud UI in EU is down"
Last updateSystems were stable over the last few hours, metrics are showing normal behaviour over longer time for both flag evalulation and the web UI. Thank you for you patience!
Flag evaluation is back to normal, monitoring the overall state
Cloud UI is back up, investigating impact on flag evaluation
We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.
Report: "PostHog app in EU is not loading"
Last updateWe've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.
PostHog cloud is unavailable, the surveys API and local_evaluation APIs are also affected. Fix is being worked on at the moment.
We're experience issues loading the PostHog app in EU. Data ingress does not appear to be affected.
Report: "US app has intermittent errors"
Last updateWe'd identified a migration that had unintended impact to our database. We've cleared the lock and watching db health stabilize. All is looking back to normal at this time.
We're seeing intermittent errors with loading us.posthog.com, and we're investigating why. This isn't impacting the ingestion of data.
Report: "Elevated API Errors"
Last updateWe've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.
We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Batch exports delayed in EU"
Last updateAll pending batch export runs have completed and new batch export runs are progressing normally. The incident is resolved.
We have identified the root cause of the problem and are in the process of deploying a fix.
We have noticed batch exports experiencing a delay of several hours in PostHog Cloud EU. We are investigating the problem. Batch exports in PostHog Cloud US are not affected and operating normally.
Report: "[US] increased errors on feature flags, ingestion and app"
Last updateWe briefly had a spike in errors on our US instance for various endpoints due to a rollout. We rolled back and errors rates dropped.
Report: "Data Processing Delays - Reporting Tools Affected"
Last updateAll processing is back to normal
We've identified some bottlenecks slowing down processing. We should be back to real time shortly
Our data processing infrastructure is running behind which is causing inaccuracies in the reporting tools. No data has been lost and the system should be caught up shortly.
Report: "Event taxonomy processing delays"
Last updateWe've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon.
We've spotted that something has gone wrong. We're currently investigating the issue, and will provide an update soon. System recovered. We are continuing to monitor.
We've spotted a problem in processing event taxonomy updates - property and event definitions. We're working to fix the problem, and while working on that fix, some updates to event taxonomy will be delayed. This impacts e.g. whether new events or properties are available for filtering.
Report: "EU: Elevated error rate on data capture"
Last updateWe resolved the issue and everything is operational again. One of our reverse-proxy instances scaled in ungracefully which caused routing errors. After manually terminating it, the services recovered. We saw elevated errors from 16.01 UTC to 16.48 UTC. A good part of it was recovered by internal retries, but we can't be certain right now to not have lost some events. We will analyze and provide a long term fix so that this won't happen again. Our apologies for this as we were not able to capture all data during this time.
We found something in networking and it seems to be recovering now. Monitoring the situation.
We've spotted that something has gone wrong. We're seeing elevated error rates on capture on the web app. We're currently investigating the issue, and will provide an update soon.
Report: "EU Maintenance - Data Processing Delays"
Last updateingestion caught up. All is good as expected.
The maintenance operations are done, we are monitoring and waiting on all ingestion and data processing delays to catch up. Again, no data has been lost during this standard procedure. Thank you for your patience!
Due to a planned maintenance activity, we're expecting ingestion and data processing delays in EU. No data will be lost during this operation. Thank you for your patience!
Report: "EU ingestion lag"
Last updateWe have fixed the underlying issue and ingestion latency is back to normal. All data is up to date now.
We have identified that there is lag in the events ingestion pipeline. We are investigating what could be the root cause. No data has been lost.
Report: "Web app down"
Last updateWe rolled back and fixed the issue. After monitoring we can clear things up now.
We've rolled back to a previous version and the web app is recovered. Monitoring until bug fix is merged and latest web app version deployed.
The posthog web app is down in all regions, due to a bug in our HTML rendering. All data pipeline components are still fully functional, and no data will be lost.
Report: "Maintenance - Data Processing Delays"
Last updateThis incident has been resolved.
The main work of the maintenance operations are done. We're monitoring the ingestion and data processing to catch up. Thanks again for your patience!
Due to a planned maintenance activity, we're expecting ingestion and data processing delays. No data will be lost during this operation. Thank you for your patience!
Report: "Event processing delays on EU Cloud"
Last updateThis incident has been resolved.
We're investigating ingestion delays on EU cloud
Report: "Web app unavailable"
Last updateWe improved our monitoring so we can catch similar issues before they affect production.
The app is back now... we're investigating the root cause here
We've seen the web app is unavailable and we're investigating data ingestion is not affected
Report: "JS static assets not loading"
Last updateThe incident was resolved an hour ago. We're blocking new deployments until we root-cause the issue
The issue was triggered again, we're rolling back quickly this time.
Report: "JS static assets not loading"
Last updateThis incident is resolved. You may need to hard-refresh (CMD + Shift + R) in order for the page to load. For some reason, our github workflows skipped the "upload static assets to s3" step but rolled out anyway. We're investigating why this happened.
We've again spotted you can't load the PostHog app at the moment We're investigating the cause Data ingestion is not affected
Report: "Issues loading the posthog site"
Last updateWe've spotted and fixed the issue with our static asset pipeline and all environments are back online and available!
We've spotted you can't load the PostHog app at the moment We're investigating the cause Data ingestion is not affected
Report: "Live Stream service unavailable"
Last updateWe've spotted and addressed the root cause and the service is back up and running. Sorry for the inconvenience and enjoy those fresh free range live events streaming to your browser!
Something has gone wrong with our livestream service which is responsible for reporting live events to the activity page. We are investigating now and will report back once we have found the root cause!