Historical record of incidents for Heron Data
Report: "GCP outage"
Last updateRead requests are succeeding, but post requests are failing
We are continuing to investigate this issue.
We are experiencing issues with one of our cloud providers, GCP. Namely, we are seeing degradation in the cloud storage, cloud run, and AI services
Report: "File Parsing Down"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Server slowness"
Last updateThe incident has been resolved
Systems are stable
We are currently investigating an issue where Heron async processing is experiencing slowness
Report: "Bank statement PDF parsing performance degraded"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "API instability"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Email server outage"
Last updateThis incident has been resolved.
Our email server provider, Postmark, experienced an issue with processing queued messages. This meant that all submissions forwarded to Heron Data were not being received by the Heron system. No data was lost in this process, and the provider system has mostly recovered -- all incoming messages are being processed and most historical messages are now processed. We are monitoring to ensure complete recovery https://status.postmarkapp.com/notices/zno1dlxjdjmblc0d-service-issue-outbound-sending-and-inbound-processing-messages-are-being-accepted-and-queued
Report: "High rate of timeouts"
Last updateThe API has been stable and working since the fix. This is now been marked as resolved.
A fix was implemented on 30/11/2023 at 19:30 UTC and our monitoring systems have no reports of further timeouts. We will continue monitoring throughout the day.
We are experiencing an increase in timeouts and are investigating the issue. We have taken intermediate steps to mitigate this and will update you as and when we have more information.
Report: "API instability"
Last updateThis incident has been resolved. We experienced a high number of API requests that were using a high amount of CPU, overwhelming our infrastructure. We shipped a change that reduced the CPU usage, and have seen normal CPU usage for the last 8 hours
We have implemented mitigation measures to address the higher than usual error rate with our API. We will continue to closely monitor the situation. Thank you for your understanding.
We have identified the cause of the error rate issue and are actively working on resolving it. Thank you for your patience as we work towards a solution.
We are currently experiencing a higher than usual error rate with our API. Our team is actively investigating and working on resolving the issue. Updates to follow soon. We apologize for any inconvenience caused. Thank you for your patience.
Report: "Database outage"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We have identified that a related database table is impacted, and are implementing a fix
A fix has been implemented and we are monitoring the results.
The database migration is complete and async processing is back online
We are running a backfill on the table in question which we believe will take ~6 hours, so we will provide another update around then
We have remediated part of the issue. Now the outage is limited to async processing and any route that involves fetching transaction categories (e.g., delete & get transactions, end user endpoints like /summary, /profit_and_loss, and various reports). We are executing a fix which is underway for the remainder
We are continuing to work on a fix for this issue.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are currently investigating an issue where we have reached the maximum number of IDs for a table in our database. We have to schedule some emergency downtime to resolve the issue
Report: "Elevated API Errors"
Last updateWe apologise for the increased 500 errors and timeouts experienced in the last 10 hours. We introduced a new index to our database which was partially built and resulted in an “invalid” state. This resulted in long running queries which blocked subsequent reads. We dropped the index, as well as blocking database processes, and are now recovered.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "PDF parsing issue"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are experiencing issues with our PDF parsing capabilities. The issue has been identified and the fix is in progress.
Report: "API outage"
Last updateWe experienced API outage (<5mins) from a prolonged database migration. All systems are back to normal
Report: "Increased API latency"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
Report: "Increased API latency"
Last updateResponse times have been restored.
We are currently investigating this issue.
Report: "Increase API response times"
Last updateWe noticed intermittent increased latency yesterday afternoon due to unprecedented levels of traffic. While debugging the root cause and how to handle this, we increased our compute resources in multiple areas to handle the traffic. We have now identified the root cause and solved the issue. Response times are back to normal but we will continue to monitor to ensure the fix worked.
Report: "Major outage: Load balancer 502s"
Last updateFrom 00:10-00:19 UTC (9 minutes), we had a major outage in our API when we experienced a high amount of 502 errors that rendered our backend API unusable. This was caused by issues in our load balancer, which was caused by usually high amounts of traffic and unavailable pods to service the traffic. We are in the process of adding better scalability to our backend to handle such spikes in traffic moving forward, including better horizontal scaling metrics.
Report: "API downtime"
Last updateFrom 8:30 - 8:35 UTC we had brief API downtime from a planned database migration. We're fully operational again
Report: "SSL Certificate Renewal Error"
Last updateOur automated SSL renewal process failed. This meant our API was down between 10:08 and 10:49AM BST. We provisioned a new SSL certificate manually, restored automatic SSL certificate renewals and added monitoring to prevent this from happening again.
Report: "Degraded API Performance"
Last updateSome users might have experienced intermittent 500 response codes between 7PM and 8PM BST. This was caused by a monitoring tool we added to better understand our usage and latency which unfortunately interfered with external API calls which we rely on for some API endpoints and users. Once we identified the issue we reverted the monitoring changes and all systems returned to normal. Moving forward we will deploy and babysit all monitoring changes (on top of code deploys and core infra changes) in staging before releasing to production.
Report: "Degraded API performance"
Last updateAs part of releasing new ML models to improve our merchant extraction service, we hit scaling issues when exposed to a higher than normal throughput. All the relevant checks were made in testing, local and staging environments but unfortunately nothing could reproduce production-level traffic. We have reverted the release, and all systems are fully functional again.
Report: "Increased API and webhook responses"
Last updateOur REST API responses experienced increased latency, peaking at 60s. Some customers using our async enrichment flow will have seen webhooks take longer to arrive than usual. All systems are back to normal now.
Report: "Database CPU upgrade"
Last updateWe ran an upgrade of our production database to handle increased traffic which resulted in about 1 minute of intermittent downtime for our API users. The API performance is back to normal.
Report: "Google Cloud Load Balancers"
Last updateGoogle Cloud has resolved this issue. Our API is fully functional again.
Google Cloud Load Balancers are experiencing issues, so Google is unable to port any traffic to our API. Any calls to our API will currently result in a 404 error. This is affecting other known sites such as spotify.com and etsy.com.
Report: "API downtime"
Last updateIn an attempt to optimise our network traffic routing, we provisioned some config changes to our load balancer. The change caused our Kubernetes routing to become unhealthy and thus blocking traffic coming into our app.herondata.io domain. This was tested in staging environments and worked correctly, which means production-level traffic made it impossible to apply this change properly. We quickly reverted the config to stabilise the API which is now operating normally.
Report: "API downtime"
Last updateDuring a database migration in production, an overly aggressive lock took over a table we use for authentication, so we could not serve any requests to customers between 16:20-16:50 UTC. We've now resolved the issue and are fully operational.
Report: "API downtime"
Last updateFrom 9:11 to 9:20 UTC we experienced timeouts and failed requests on our `POST` and `GET` transactions endpoints. This was due to shipping product improvements which included a database upgrade. We're fully operational again.