Is Treasure Data Down Right Now? Discover if there is an ongoing service outage.

Treasure Data is currently Operational

Last checked Jul 29, 2025 17:53 UTC from Treasure Data's official status page

Historical record of incidents for Treasure Data

Jul 2, 2025

Report: "[AP02 region] Treasure Insights Performance degraded"

Last update 2025-07-02T02:53:03.930Z

investigating2025-07-02T02:53:03.927Z

We are currently investigating a slowdown affecting Treasure Insight on the AP02 site. As a result, you may experience longer than usual processing times, and some requests may time out or fail.

Jun 23, 2025

Report: "[All Regions] a Partial Issue with CDP API"

Last update 2025-06-23T06:45:05.382Z

identified2025-06-23T06:45:05.379Z

We are currently experiencing an issue with the CDP API concerning the retrieval of realtime attributes. When opening the settings screen or attempting to change settings, realtime attribute values cannot be retrieved, resulting in an error. This issue exclusively affects the realtime segment settings screen and does not impact the actual behavior of realtime segments. Our team is actively working to resolve this issue. We sincerely apologize for any inconvenience or disruption this may cause.

Jun 2, 2025

Report: "[US region] Ingest API - Performance Degradation"

Last update 2025-06-02T17:43:14.480Z

resolved2025-06-02T17:43:14.461Z

This incident has been resolved. Duration: ~35 minutes of processing delay. Data Ingestion delay between June 1 - 7:35 AM to June 1 - 10:35 AM PST. Affected Customers: All customers ingesting data to AWS US region during the incident window Impact: Delayed data availability in Plazma (up to 40 mins) No data loss occurred

monitoring2025-06-02T17:37:12.078Z

We are continuing to monitor for any further issues.

monitoring2025-06-02T16:26:00.074Z

Latest Update: Our team is still actively monitoring and assessing the processing of the data. We will leave this status page in a Monitoring state until we are certain everything has been processed. Thank you for your continued patience.

monitoring2025-06-02T04:47:20.000Z

Latest Update: Our internal graphs show that the overwhelming majority of the data has been processed. There is small amount of residual data that is taking longer than expected to process. We will leave this status page in a Monitoring state until we are certain everything has been processed. Thank you for your continued patience.

investigating2025-06-02T01:49:17.599Z

Latest Update: Our team is still actively monitoring and assessing the processing of the data backlog. Thank you for your continued patience.

investigating2025-06-01T17:18:35.000Z

Latest Update: Our team has deployed a fix, and the system is now processing the backlog of data at a controlled rate. Estimated Time to Full Recovery: ~7 hours We’re actively monitoring the recovery and will provide further updates as progress continues. Thank you for your patience.

investigating2025-06-01T16:35:24.708Z

Current Status: A fix has been applied and is currently under observation. Impact: An issue was identified with data ingestion(delayed ingestion) between Sunday 8:00 AM PST - 8:30 AM PST. Remediation: A fix has been implemented, and new incoming data is now processing normally. Next Steps: We are actively working to resume sending the affected data, which will arrive out of order along with new incoming data. Updates to Follow: Further details will be provided as the situation progresses.

investigating2025-06-01T16:17:31.717Z

Our Ingest API is experiencing a performance issue. We are investigating the cause.

Report: "[US region] Ingest API - Performance Degradation"

Last update 2025-06-02T12:43:00.000Z

Resolved2025-06-02T12:43:00.000Z

This incident has been resolved.Duration: ~35 minutes of processing delay. Data Ingestion delay between June 1 - 7:35 AM to June 1 - 10:35 AM PST.Affected Customers:All customers ingesting data to AWS US region during the incident windowImpact:Delayed data availability in Plazma (up to 40 mins)No data loss occurred

Update2025-06-02T12:37:00.000Z

We are continuing to monitor for any further issues.

Update2025-06-02T11:26:00.000Z

Monitoring2025-06-01T23:47:00.000Z

Latest Update: Our internal graphs show that the overwhelming majority of the data has been processed. There is small amount of residual data that is taking longer than expected to process.We will leave this status page in a Monitoring state until we are certain everything has been processed.Thank you for your continued patience.

Update2025-06-01T20:49:00.000Z

Latest Update: Our team is still actively monitoring and assessing the processing of the data backlog.Thank you for your continued patience.

Update2025-06-01T12:18:00.000Z

Latest Update: Our team has deployed a fix, and the system is now processing the backlog of data at a controlled rate.Estimated Time to Full Recovery: ~7 hoursWe’re actively monitoring the recovery and will provide further updates as progress continues. Thank you for your patience.

Update2025-06-01T11:35:00.000Z

Current Status:A fix has been applied and is currently under observation.Impact:An issue was identified with data ingestion(delayed ingestion) between Sunday 8:00 AM PST - 8:30 AM PST.Remediation:A fix has been implemented, and new incoming data is now processing normally.Next Steps:We are actively working to resume sending the affected data, which will arrive out of order along with new incoming data.Updates to Follow:Further details will be provided as the situation progresses.

Investigating2025-06-01T11:17:00.000Z

Our Ingest API is experiencing a performance issue. We are investigating the cause.

Apr 23, 2025

Report: "[US region] Ingest API - Performance Degradation"

Last update 2025-04-23T11:30:38.524Z

resolved2025-04-23T11:30:38.505Z

We have continued our recovery efforts. While the issue has not been fully resolved, the impact is now limited. We are directly contacting affected customers and will continue to provide updates as needed. As this issue is now affecting only a limited number of customers, we are closing this status page. We sincerely apologize for the inconvenience this has caused and appreciate your understanding as we continue to support those still impacted. Thank you for your patience and understanding.

monitoring2025-04-23T00:44:54.940Z

We are continuing to run our recovery operation, and it is progressing. We are monitoring closely and will update in six hours.

monitoring2025-04-22T16:39:26.951Z

Recovery is proceeding and we are continuing to closely monitor the progress. We will have another update in six hours.

monitoring2025-04-22T10:15:09.277Z

We have been monitoring the situation for the past 6 hours. At this time, we are unable to provide an updated estimate for full recovery. We are continuing to monitor closely and are also exploring additional measures to improve recovery performance. We will share another update in approximately 6 hours or sooner if there is any significant development. We apologize for the inconvenience.

monitoring2025-04-21T18:59:57.806Z

We have rolled out a fix and are observing our backlog recovering at a safe rate. We are currently processing a backlog of historical data, which is progressing at a controlled rate. Estimated time to a full recover of all data is approximately 32 hours, and we will be monitoring and provide updates through this process.

identified2025-04-21T17:43:48.397Z

We are now attempting to restore the pipeline from a timestamp rather than a checkpoint. There is no risk of user data loss. Downstream systems may observe duplicate events from the affected time window, and this will be limited in scope.

identified2025-04-21T16:40:00.977Z

We have applied a fix and are observing it. We have identified a window between Thursday 3AM UTC and 11AM Monday UTC where data was not being processed. We have implemented a remediation and new incoming data is not impacted. We are actively working to resume sending of this data, which will arrive out of order along with new incoming data.

investigating2025-04-21T16:25:52.671Z

We are continuing to investigate this issue.

investigating2025-04-21T16:10:09.071Z

Our Ingest API is experiencing a performance issue. We are investigating the cause.

Apr 15, 2025

Report: "[Tokyo Region] Performance Issue of Trino service"

Last update 2025-04-15T10:27:58.312Z

resolved2025-04-15T10:27:58.295Z

This incident has been resolved.

monitoring2025-04-15T08:48:10.762Z

We are observing recovery. We continue to monitor for full recovery.

identified2025-04-15T08:09:45.250Z

The root cause has been identified and we are applying the fix.

investigating2025-04-15T07:55:45.260Z

Our Trino service is experiencing an issue. We are investigating the cause.

Apr 9, 2025

Report: "[US Region] Hive Jobs and the Result Export jobs triggered from Hive Jobs are not functioning properly"

Last update 2025-04-09T03:22:16.500Z

resolved2025-04-09T03:22:16.480Z

We have confirmed that Hive jobs are now working properly, and the related Result Export jobs are functioning as expected. The impact occurred between April 8, 16:00 UTC and April 9, 01:35 UTC. If you had any jobs that failed during the affected time window, please re-run them as needed. We sincerely apologize for the inconvenience this my have caused.

monitoring2025-04-09T01:49:46.631Z

We have identified the root cause and applied a hotfix. The issue was that Hive Jobs were unable to start properly, which in turn caused some Result Export jobs triggered from those Hive Jobs to fail. We are currently monitoring the system to ensure that the situation continues to improve.

investigating2025-04-09T01:24:05.440Z

We are currently investigating an issue where Hive Jobs and the Result Export jobs triggered from Hive Jobs are not functioning properly. Our team is actively looking into the cause of the issue. We will provide updates as soon as more information becomes available.

Feb 11, 2025

Report: "[All Regions] Utilization Dashboards showing outdated information"

Last update 2025-02-11T21:47:14.578Z

resolved2025-02-11T21:47:14.563Z

This issue is resolved and our utilization dashboards should be showing up-to-date information. If you observe anything unusual in your usage data, please contact our support team. Thank you for your patience while working through this issue.

monitoring2025-02-11T20:34:02.503Z

We have remediated the issue and are processing usage data from the last 4 days. Users should see the usage dashboards catching up. We expect this process to take about another hour to complete.

identified2025-02-11T19:05:40.585Z

We have identified the cause of this issue and are working to restore service. Once the service is restored, we expect it will take a few hours to catch up on usage data for the last few days.

investigating2025-02-11T17:48:41.616Z

Our utilization dashboards do not update with up-to-date information. Customers accessing their Treasure Data usage dashboards will see a gap in usage details from early Saturday morning UTC. There is no impact on ongoing Treasure Data usage, and all usage information is correctly stored internally. However, the dashboard where customers can view their consumption is not up-to-date. Note this is a reporting problem only. There is no indication of any issues with regular Treasure Data usage. We are working to diagnose the issue and will provide an update in the next hour.

Jan 31, 2025

Report: "[All Regions] Treasure Insights is experiencing an outage"

Last update 2025-01-31T05:19:55.665Z

resolved2025-01-20T16:04:37.283Z

We would like to inform you that the issue has been fully resolved. Incident Impact Details: - The Treasure Insights were returning 502 errors and it was unreachable during the incident. Incident Impact Time: - Start: January 20, 09:51 UTC - End: January 20, 15:35 UTC We apologize for any inconvenience this may have caused and thank you for your patience and understanding.

monitoring2025-01-20T15:44:09.462Z

A fix has been implemented, and we are monitoring the service to ensure everything is functioning correctly.

identified2025-01-20T14:58:11.652Z

The problem has been identified, and we are currently working on a solution.

investigating2025-01-20T14:49:49.038Z

The situation remains the same as it was in the last update.

investigating2025-01-20T14:17:25.268Z

We are still investigating the issue.

investigating2025-01-20T13:34:13.819Z

We have observed that the users are not able to access the Treasure Insights. We are currently investigating the issue.

Report: "[EU Region] Elevated error/ performance degradation related to personalisation API"

Last update 2025-01-31T05:19:36.727Z

resolved2025-01-23T12:18:59.310Z

Between Thursday, 23 Jan 2025 07:20 UTC to 11:40 UTC, customers experienced elevated error rates and increased latency related to Profiles API. A fix has been implemented, and the issue has been resolved. If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will share an incident retrospective soon.

monitoring2025-01-23T12:16:21.516Z

The response team found problematic real-time segment configurations of one customer's Parent Segment that possibly contributed to consuming the concurrency capacity. The team updated the real-time event routing configuration to mitigate the high latency issue. Combined with capacity addition operations, the team stabilized the Profiles API cache cluster. If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will update the postmortem with further remediation plan as promised.

identified2025-01-23T11:08:52.268Z

We successfully provisioned 2x capacity in 30 minutes. New resources improved the latency, but the error rate is still high. The response team is planning to implement another remediation instead of adding resources. We will update you in 30 minutes.

identified2025-01-23T10:37:56.430Z

We provisioned additional capacity at 10:00 am UTC to support the increasing workload. It improved the latency, but we still observed errors and long latency for a small amount of requests. The response team started providing another concurrency capacity. Unlike the previous methods, the new process should not take longer for provisioning. We will update the result in 30 minutes.

identified2025-01-23T08:05:44.135Z

The response team confirmed the symptom is from the same cause as the previous incidents. We are provisioning additional concurrency capacity to the environment. We will update you when it is completed.

investigating2025-01-23T07:34:35.002Z

We are currently observing errors or performance degradation for the personalization API. We are investigating the cause of the issue now.

Report: "[EU Region] Elevated error rate and performance degradation for personalization API"

Last update 2025-01-31T05:19:14.166Z

resolved2025-01-30T15:43:49.883Z

We implemented fundamental isolation to a problematic configuration at 14:42 UTC. The remediation caused the cluster workload to drop from 60% to 1%. On Friday, we implemented write access isolation to the problematic configuration. It stopped the cluster workload from growing. Today, we implemented read access isolation that restored the cluster workload to the previous level. The system is operating normally now. We close the incident. We acknowledge we need further actions to prevent the same incident from happening again by a similar configuration. We will post further postmortem when we are ready.

monitoring2025-01-30T14:18:13.939Z

We are still monitoring the service. Between Thursday, 30 Jan 2025, 10:00 UTC to 11:05 UTC, customers experienced elevated error rates and longer latency for Profiles API lookup. Currently, the cluster workload has calmed down and is operating normally. Our response team is ready to provision additional processing capacity. However, we are closely monitoring the service status to avoid further downtime during peak times. In addition to it, we are working on isolating problematic accesses from the service. We will keep the status page open and update you on the progress.

monitoring2025-01-30T12:31:59.085Z

We are continuing to monitor for any further issues.

monitoring2025-01-30T11:38:17.445Z

We are currently observing that the performance degradation and error rate have improved. We continue to closely monitor the metrics.

investigating2025-01-30T10:54:28.977Z

We detected degraded performance of personalization API and an error rate increase. We are currently investigating this issue.

Report: "[EU Region] Elevated error rate and performance degradation for personalization API"

Last update 2025-01-31T05:19:04.564Z

resolved2025-01-29T17:15:52.273Z

Between Wednesday, 29 Jan 2025, 15:47 UTC to 16:51 UTC, customers experienced elevated error rates and increased latency related to Profiles API. The cause was a slightly but non-visible elevated error rate monitor kicked a system recovery operation. Then, the recovery operation caused the same incident due to a configuration problem we had on Friday. https://status.treasuredata.com/incidents/jyqjpyscvjzh The response team re-deployed the safe version to recover the system. Also, as a short-term mitigation, we updated the recovery operation until we complete the root cause analysis and permanent fix we described in "Further Actions" in the previoius postmortem: https://status.treasuredata.com/incidents/jyqjpyscvjzh At the moment, if you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident.

monitoring2025-01-29T17:01:03.772Z

We are continuing to monitor for any further issues.

monitoring2025-01-29T16:57:24.372Z

Currently, we can see a lot of improvement in the monitoring of health status. We continue to carefully monitor the health status.

monitoring2025-01-29T16:49:18.682Z

We started to apply a remediation and we are observing that the service is recovering. However, we closely monitor the service health status

investigating2025-01-29T16:13:36.549Z

We detected degraded performance of personalization API and an error rate increase. We are currently investigating this issue.

Report: "[EU Region] Elevated error/ performance degradation related to personalisation API"

Last update 2025-01-31T05:18:52.807Z

postmortem2025-01-23T02:41:30.202Z

The Profiles API enables browsers to retrieve personalized content based on detailed customer information. Between January 20 and January 22, we experienced elevated error rates and increased latency for a subset of requests to the Profiles API. We sincerely apologize for the inconvenience caused by this incident. We understand the critical role our API plays in delivering seamless user experiences, and we are committed to preventing such disruptions in the future. # Timeline * On January 20, from 7:45 to 11:15 UTC - 3% error rate during the time * On January 21, from 7:35 to 10:25 UTC - 33% error rate during the time * On January 22, from 9:15 to 16:40 UTC - 40% error rate during the time During these periods, API calls to `https://cdp-eu01.in.treasuredata.com/` exhibited elevated error rates and latency. This issue did not impact RT 2.0, the newer version of our real-time system. # Incident Analysis This is the current analysis snapshot; updates will be provided as more information becomes available. We noticed a gradual increase in processing workloads on the Profiles API starting on January 6, driven by the complexity of real-time segmentation. By January 20, this workload exceeded the internal concurrency limit configured in our caching cluster. Key observations are: * Symptoms consistently began to appear around 07:30 UTC each day. * Internal system indicators flagged potential issues approximately two hours prior to the incidents. The bottleneck was traced to the caching cluster's concurrency capacity, which was insufficient to handle the growing workload. # Action Taken Based on the observation, we implemented the mitigation to increase the concurrency capacity in the caching cluster. We will monitor the symptoms closely today and provide additional capacity when necessary. # Further Actions Our development team will have a capacity review of the Profiles API infrastructure to prepare for future workload growth. The remediation plan will include the following steps: * Enhanced monitoring and alerting of the caching cluster’s concurrency capacity * Ensuring safe yet rapid capacity provisioning when required We will provide a follow-up update by the end of Friday, summarizing any additional findings and actions taken. Hiroshi \(Nahi\) Nakamura CTO & VP Engineering Treasure Data

resolved2025-01-23T02:41:17.586Z

Between Wednesday, 22 Jan 2025 09:15 UTC to 16:40 UTC, Some customers experienced elevated error rates and increased latency related to Profiles API. A fix has been implemented and the issue has been resolved. If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will share an incident retrospective soon.

monitoring2025-01-23T02:15:29.228Z

We have fully deployed our fixes to the Personalization API and our monitors show systems operating normally. Our teams will continue to monitor the issue, and we will update this incident if we observe any unusual behavior. If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will share an incident retrospective once it is available.

investigating2025-01-23T00:44:31.456Z

We have observed some intermittent errors as we roll out a fix to all of our systems, and users may see delays or errors as the change is applied to our systems. Our response team is working to minimize the impact to customers while we deploy this change, but we expect some slower performance while we gradually deploy the fix over the next 3-4 hours.

investigating2025-01-22T20:24:24.855Z

Our response team has identified a potential cause for this issue, and we will be deploying a fix shortly. At this time we have not observed any elevated error rates or delays since 16:40 UTC. We will provide an additional update once this fix has been deployed. If you are observing abnormal errors or long delays from our Personalization API, please reach out to our support team. We will continue to monitor for any issues, and will update once our fix is deployed.

investigating2025-01-22T18:34:45.262Z

From 09:00 to 17:00 UTC, we observed elevated 500s and high latency on the CDP KVS server. Customers may have observed elevated errors and timeouts during this period when sending requests to the Personalization API. Our team has been investigating this issue and has deployed a workaround to our systems while we work to identify the root cause of the problem. There should be no system impact at this time. Customers who continue to observe delays or elevated error rates should contact our support team, and we'll be happy to assist them further. We will continue to investigate and will provide another update by 11 PM UTC.

investigating2025-01-22T14:52:03.116Z

We have applied various mitigation on our infrastructure side however it doesn't decrease the error rate. We are continuously investigating the possible causes on our end

investigating2025-01-22T09:56:41.329Z

We are currently observing errors or performance degradation for the personalization API. We are investigating the cause of the issue now.

Report: "[EU Region] Elevated error/ performance degradation related to personalisation API"

Last update 2025-01-31T05:18:38.620Z

resolved2025-01-21T10:54:28.036Z

We observed that the error rate decreased and the issue was resolved.

investigating2025-01-21T10:07:57.979Z

We applied the mitigation, and we are still monitoring to see if it resolves the issue

investigating2025-01-21T09:24:14.391Z

We are currently observing errors or performance degradation for the personalization API. We are investigating the cause of the issue now.

Report: "[EU Region] Elevated error rate for CDP KVS"

Last update 2025-01-31T05:18:26.740Z

resolved2025-01-20T11:41:39.476Z

We would like to inform you that the issue has been fully resolved. Incident Impact Details: - Personalization API has experienced an outage leading to increased errors and timeouts. Incident Impact Time: - Start: January 20, 07:45 UTC - End: January 20, 11:15 UTC We apologize for any inconvenience this may have caused and thank you for your patience and understanding.

monitoring2025-01-20T11:10:25.721Z

We are observing fewer errors now. However, we are still monitoring and re-evaluating the remedial steps to confirm better performance.

monitoring2025-01-20T10:35:49.879Z

Through our investigation, we identified the cause of the issue and have applied some remediation. Our team is closely monitoring the system to ensure continued stability.

investigating2025-01-20T10:33:16.718Z

We are currently investigating this issue.

Report: "[EU Region] Trino/Presto - Degraded performance"

Last update 2025-01-31T05:17:40.042Z

resolved2024-12-08T20:27:36.659Z

This incident has been resolved.

monitoring2024-12-08T20:05:20.416Z

A fix has been implemented and we are monitoring the results.

investigating2024-12-08T19:48:35.000Z

Some users may be experiencing degraded performance when running presto or trino jobs. We are investigating the incident. At present all users in the EU central region may be affected.

Report: "[US Region] Trino/Presto performance degradation"

Last update 2025-01-31T00:22:27.192Z

resolved2024-10-16T13:15:50.154Z

This incident has been resolved.

monitoring2024-10-16T12:41:25.414Z

We have applied remediation for the degraded performance infrastructure. We are currently monitoring the performance closely.

investigating2024-10-16T12:12:48.936Z

We are investigating a possible problem currently affecting Trino/Presto queries for the US region. Queries might have degraded performance. We will provide an update as soon as we know more details.

Nov 13, 2024

Report: "[All Regions] Elevated error rate for CDP KVS"

Last update 2024-11-13T10:16:59.767Z

resolved2024-11-13T10:16:59.739Z

We would like to inform you that the issue has been fully resolved. Incident Impact Details: - Profiles API experienced an increased frequency of errors and timeouts. - The latest logs were not reflected in real-time segments. Incident Impact Time by Region: us: - Start: November 13, 04:14 UTC - End: November 13, 08:55 UTC aws-tokyo: - Start: November 13, 04:14 UTC - End: November 13, 08:54 UTC eu01: - Start: November 13, 04:17 UTC - End: November 13, 08:51 UTC ap02 - Start: November 13, 04:15 UTC - End: November 13, 09:01 UTC ap03 - Start: November 13, 04:17 UTC - End: November 13, 08:52 UTC We apologize for any inconvenience this may have caused and thank you for your patience and understanding.

monitoring2024-11-13T09:17:46.692Z

Through our investigation, we identified the cause of the issue as a recent release operation. We have reverted all changes from this release, and normal functionality has been restored. Our team is closely monitoring the system to ensure continued stability.

investigating2024-11-13T07:09:32.407Z

Since approximately 4:00 UTC, we have been experiencing an issue with requests to CDP KVS, which may be affecting Profiles API functionality, causing delays in KVS data synchronization and updates to real-time segment information. Our team is actively investigating and working to resolve the issue as quickly as possible. Please note that Realtime 2.0 is not affected.

Nov 5, 2024

Report: "[EU region] Presto - Partial Outage"

Last update 2024-11-05T19:26:40.501Z

resolved2024-11-05T19:26:40.486Z

Between Nov 5, 17:15 UTC and Nov 5, 18:45 UTC, Some customers experienced delays and errors related to presto. The cause was insufficient capacity, which will be investigated further. A fix has been implemented and the issue has been resolved. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring2024-11-05T19:05:32.788Z

We have applied a fix. The problem looks to be resolved, but we are continuing to monitor.

investigating2024-11-05T18:49:33.746Z

We are continuously investigating this issue. For most queries we expect they will succeed after one or more automatic retries

investigating2024-11-05T18:24:59.705Z

We are investigating a possible problem currently causing escalated error rates from presto queries. We will provide an update as soon as we know more.

Oct 2, 2024

Report: "[US Region] Query Engine - Service Degraded Performance"

Last update 2024-10-02T04:35:15.701Z

postmortem2024-10-02T04:34:44.084Z

We experienced a temporary overload on the storage layer. It started from 16:15 PDT and fixed on 18:15 PDT. The major impact was performance defgadation for data ingestion components \(Streaming Import REST API, Mobile/Javascript REST API, Data Connector\) and Hive and Presto query engines. Some of queries executed on Hive and Presto failed because of performance degradation of the storage.

resolved2024-10-02T01:30:02.160Z

This incident has been resolved.

monitoring2024-10-02T01:15:34.019Z

We are continuing to monitor for any further issues.

monitoring2024-10-02T01:01:15.025Z

A fix has been implemented and we are monitoring the results.

identified2024-10-02T00:41:25.136Z

The issue has been identified and a fix is being implemented.

investigating2024-10-02T00:07:20.300Z

We are continuing to investigate this issue.

investigating2024-10-02T00:06:08.556Z

We are continuing to investigate this issue.

investigating2024-10-02T00:04:19.314Z

We're experiencing an elevated level of API errors and are currently looking into the issue.

Sep 20, 2024

Report: "[All Regions] Web Interface - Partial Outage to show Standard Audit Logs"

Last update 2024-09-20T06:28:42.247Z

resolved2024-09-20T06:28:42.232Z

This incident has been resolved.

monitoring2024-09-20T05:44:53.782Z

We confirm the issue was resolved. We will continue to monitor the results.

identified2024-09-20T02:20:04.724Z

We observed a problem with web console access related to showing Standard Audit Logs. We have found the cause of the incident. We are working to resolve the incident.

Sep 5, 2024

Report: "[US region] Presto Query Engine - Degraded Performance"

Last update 2024-09-05T13:18:35.877Z

resolved2024-09-05T13:18:35.860Z

The incident is now resolved. All affected components are back to normal. A subset of customers in the US region might have experienced degraded performance on Presto queries between 4:50 PM EDT and 1:40 AM EDT. Presto queries might also have been queued for longer than usual during the incident. Finally, some queries might have failed due to the remediations that were put in place.

monitoring2024-09-05T08:28:20.596Z

Systems should be back to normal but we continue to monitor the situation for a while.

monitoring2024-09-05T04:44:23.701Z

We applied the fix. We will continue to monitor the results.

investigating2024-09-05T03:53:19.121Z

Though not all, the performance for some queries has been improved. We are continuing to investigate the issue.

investigating2024-09-05T01:48:05.049Z

This incident is still ongoing. We are investigating the root cause.

monitoring2024-09-04T23:19:22.693Z

A fix has been implemented and we are monitoring the results.

investigating2024-09-04T22:22:16.763Z

We are investigating a possible problem currently affecting Presto. Queries could be delayed. We will provide an update as soon as we know more.

Aug 29, 2024

Report: "[Tokyo, AP03 Region] Custom Script Workflow error"

Last update 2024-08-29T17:04:27.271Z

resolved2024-08-29T10:59:34.066Z

Our infrastructure provider (AWS) issue is resolved and we don't observe new errors for now. Please rerun the failed workflow if needed

monitoring2024-08-29T10:30:44.997Z

The error rate is decreased. Please rerun the failed workflow if needed. We observed errors with the custom script between 8 am and 10 am UTC on August 29th. We keep monitoring the issue carefully.

investigating2024-08-29T10:14:54.455Z

Custom Script from workflow fails due to an ongoing incident with our infrastructure provider (AWS). Error example: Unable to execute HTTP request: Connect to sts.amazonaws.com:443 [sts.amazonaws.com/209.54.177.164] failed: connect timed out com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sts.amazonaws.com:443 [sts.amazonaws.com/209.54.177.164] failed: connect timed out We are actively working on the issue on our end.

Report: "[Tokyo, AP03 Region] Data Connector, Result Export, Hive job malfunction"

Last update 2024-08-29T17:04:12.368Z

resolved2024-08-29T11:00:45.554Z

Our infrastructure provider (AWS) issue is resolved and we don't observe new errors for now. Some of the jobs failed due to the incident, so please rerun the failed jobs if needed

investigating2024-08-29T10:07:45.000Z

We have found that Data Connector, Result Export, and Hive jobs weren't able to start or failed the job due to an incident with our infrastructure provider (AWS). Some of the Data Connector, ResultExport, and Hive jobs might encounter delay or error. The issue observed on Aug 29th between 8:30 UTC - 9:45 UTC We are still investigating the issue on our end.

Aug 7, 2024

Report: "[US Region] Ingest API - Performance Downgrade"

Last update 2024-08-07T14:23:00.053Z

resolved2024-08-07T14:23:00.039Z

We confirmed the catch-up is complete at 6:44 am PT. From 2024-08-06 03:20 am to 2024-08-07 06:44 am PT, the events arrived at us01.records.in.treasuredata.com and c360-ingest-api.treasuredata.com experienced maximum 8 hours of delay in batch data ingestion. There was no impact in real-time system.

monitoring2024-08-07T13:15:45.739Z

We are continuing to monitor for any further issues.

monitoring2024-08-07T13:13:00.983Z

We are continuing to monitor the recovery. As of now, 99% of events become visible within 45 minutes. We will resolve the incident when the catch-up is complete.

monitoring2024-08-07T13:09:08.774Z

We are continuing to monitor for any further issues.

monitoring2024-08-07T01:12:23.332Z

We are continuing to monitor our systems' recovery as we work through the backlog of messages sent in the last 12 hours. We have added more resources to reduce the impact of this issue. At this time, we expect all messages to be processed, but customers may continue to see multi-hour delays as we continue to process messages to our Ingest API for the next few hours. We will continue to monitor this issue, and we appreciate your patience as we work through it.

monitoring2024-08-06T18:18:46.653Z

We have rolled out a fix and observed that processing delays are no longer increasing. Customers may continue to see delayed message processing over the next 3-4 hours as the backlog is processed. We continue exploring options to shorten this time and will monitor for any issues.

identified2024-08-06T17:41:02.336Z

We have identified the source of the problem and are applying a solution now. Customers may still see processing delays as we catch up on the request backlog. We will continue to explore options to accelerate our recovery, and we will continue to monitor the situation.

investigating2024-08-06T17:21:32.379Z

We are observing slower processing time for messages sent to our Ingest API. Users may see a delay up to two hours in message processing. We are continuing to investigate the root cause and exploring options to catch up on our backlog of messages, and will provide an update once we know more.

investigating2024-08-06T16:04:03.210Z

We are still investigating the cause of this issue.

investigating2024-08-06T15:32:43.528Z

Our Ingest API is experiencing a performance issue. We are investigating the cause.

Jul 31, 2024

Report: "[US Region] High Error rate at Custom Script and some DataConnector"

Last update 2024-07-31T07:53:33.883Z

resolved2024-07-31T07:53:33.871Z

This incident has been resolved, all affected components (Custom Script and some DataConnector) are now back to normal.

monitoring2024-07-31T05:25:33.928Z

According to our infrastructure provider (AWS), this issue has already been resolved. We also see that the failure rate has been reduced, so we will update this incident to Monitoring status and the affected components to Operational status.

identified2024-07-31T03:16:08.579Z

Due to the degradation of Amazon Ads system https://status.ads.amazon.com, our connectors for Amazon Ads platform are currently not working properly. So if you are using any of the below connectors, your jobs may not be running correctly. - Amazon Marketing Cloud export - Amazon Marketing Cloud import - Amazon Ads export - Amazon DSP export We will provide further updates as soon as more information becomes available.

identified2024-07-31T02:24:03.000Z

This issue is still ongoing, we are still seeing custom script tasks fail. Custom script user may also encounter some errors about AWS Cloud Watch logs. According to our infrastructure provider (AWS), they are working on recovery and there are some improvements being seen internally, but they expect it to take 1-2 hours for full recovery. We will provide further updates as soon as more information becomes available.

identified2024-07-30T23:52:53.000Z

We are currently experiencing a high error rate in Custom Script service on Treasure Workflow (US Region) due to an ongoing incident with our infrastructure provider (AWS). This issue is increased error rates with the following error message like: > Task failed with unexpected error: null (Service: AWSLogs; Status Code: 503; Error Code: null; Request ID: xxxxxx; Proxy: null) At this time, we do not have an estimated time for full resolution. We will provide further updates as soon as more information becomes available.

Jun 26, 2024

Report: "[US Region] Delays in Processing incoming events"

Last update 2024-06-26T03:43:18.508Z

resolved2024-06-26T03:43:18.494Z

The issue is resolved at the provider and all components have completed catch-up.

monitoring2024-06-26T02:43:27.817Z

We are in constant communication with our service provider.

investigating2024-06-26T02:43:05.111Z

We are monitoring delays in systems responsible for processing incoming ingested events using our ingestion API. There also increased errors in the ingestion API. The delay is caused by infrastructure issues in our provider, which are currently being addressed. We are monitoring the situation. During this time, writing to storage may be delayed, but there is no evidence of data loss.

Jun 25, 2024

Report: "[EU region] Profiles API - Degraded Performance"

Last update 2024-06-25T13:28:26.448Z

resolved2024-06-25T13:28:26.437Z

Between 2:02 a.m. and 5:47 a.m. PDT, the CDP Personalization API experienced elevated API error rates. The engineering team identified the computing instance causing the issue and implemented a fix. The problem has been resolved already. The Personalization API clients that equip error retry observed no issue. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

Jun 20, 2024

Report: "[All Region] All Hive jobs run on Hive4"

Last update 2024-06-20T04:52:16.310Z

resolved2024-06-20T04:52:16.295Z

Verified that the issue is completely resolved. We apologize for the inconvenience.

monitoring2024-06-20T04:33:17.765Z

All Hive jobs excluding CDP Workflow run on Hive4 (query engine 2023.1) during the following time period. - [US Region] 2024-06-19 07:45 +0000 - 2024-06-20 04:06 +0000 - [Tokyo Region] 2024-06-19 09:05 +0000 - 2024-06-20 04:08 +0000 - [EU Region] 2024-06-19 09:14 - 2024-06-20 04:09 +0000 - [Korea Region] 2024-06-19 09:21 +0000 - 2024-06-20 04:10 +0000 - [AP03 Region] 2024-06-19 09:30 +0000 - 2024-06-20 04:11 +0000 We have fixed it and all Hive jobs are now properly executed on the query engine specified by the user. We apologize for the inconvenience.

Apr 2, 2024

Report: "[US Region] Ingestion API degraded performance and availability"

Last update 2024-04-02T15:48:00.039Z

resolved2024-04-02T15:00:00.000Z

For the period between 7:00 AM and 8:00 AM PDT, uses of the Mobile/Javascript API in the US region experienced periods of slow responses and unavailability. There was no data loss once the data was received by the API.

Mar 19, 2024

Report: "[US Region] Delay in Streaming Import"

Last update 2024-03-19T10:09:59.544Z

resolved2024-03-19T10:09:59.522Z

We have been monitoring closely, and as there have been no further recurrences of the delay, we consider the issue resolved. We apologize for any inconvenience caused.

monitoring2024-03-19T09:38:19.670Z

We have confirmed that the delay was resolved around UTC 9:30. We are currently continuing to monitor the situation.

investigating2024-03-19T09:01:41.902Z

We have observed delays of up to 2 hours in Streaming Import (td-js-sdk, td-mobile-sdk, postback request, ingestion-api, fluentd, etc..) occurring from approximately UTC 6:30 onwards. We are currently investigating the cause and working to resolve the delay.

Feb 8, 2024

Report: "[US Region] Performance Issue of Presto service"

Last update 2024-02-08T08:29:37.078Z

resolved2024-02-08T08:29:37.066Z

This incident has been resolved. Presto service is returned to normal.

monitoring2024-02-08T07:51:19.883Z

The fix has been applied. We are monitoring the results.

investigating2024-02-08T07:29:05.575Z

Our Presto service is experiencing an issue. We are investigating the cause.

Dec 14, 2023

Report: "[US Region] Performance Issue of Presto service"

Last update 2023-12-14T16:35:07.822Z

resolved2023-12-14T16:35:07.804Z

This incident has been resolved. Presto service is now returned to normal.

monitoring2023-12-14T15:54:59.934Z

We are monitoring the results.

investigating2023-12-14T15:25:39.000Z

The fix has been applied. We will continue to monitor.

investigating2023-12-14T14:48:45.495Z

We are continuing to investigate the issue.

investigating2023-12-14T14:15:39.826Z

Our Presto service is experiencing an issue. We are investigating the cause.

Report: "[US/EU/Tokyo/Korea region] Treasure Workflow - Partial Outage on mail Operator"

Last update 2023-12-14T06:58:17.746Z

resolved2023-12-14T06:58:17.710Z

Between 2023-12-14 04:55 UTC and 2023-12-14 06:04 UTC, some customers in the US/EU/Tokyo/Korea region were experiencing failures to send email using mail operator related to the Workflow service. A fix has been implemented and the issue has been resolved. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring2023-12-14T06:31:23.884Z

A fix has been implemented and we are monitoring the results.

identified2023-12-14T06:21:36.545Z

The issue has been identified and a fix is being implemented.

investigating2023-12-14T06:13:19.861Z

We are continuing to investigate this issue.

investigating2023-12-14T06:11:56.679Z

We have observed the problem that sending email by using mail operator fails. We are currently investigating this issue.

Sep 29, 2023

Report: "[US Region] REST API to retrieve query results occasionally fails"

Last update 2023-09-29T09:20:34.315Z

resolved2023-09-29T09:20:34.299Z

We confirmed that the issue has stopped occurring and has stabilized. We will resolve this status, but will continue investigation.

monitoring2023-09-29T09:00:11.974Z

The previous update wrongly mentioned the incident timeframe. Correct: 7:00 and 8:31 UTC Incorrect: 16:00 and 17:31 UTC We are sorry for the inconvinience.

monitoring2023-09-29T08:53:10.910Z

REST API from Workflow and client libraries to retrieve query results occasionally fails due to the follwoing error between 16:00 and 17:31 UTC. We already confirmed the error doesn't happen right now. Thus, we continue to monitor the situation while we are investigating the cause. ``` [CLIENT_ERROR] [400:Bad Request] API request to /v3/job/result/000000 has failed: <?xml version="1.0" encoding="UTF-8"?> <Error><Code>AuthorizationQueryParametersError</Code><Message>X-Amz-Expires must be non-negative</Message><RequestId>TTTTTTTTTT</RequestId><HostId>oI6e7Jfaub0mG/XXXXXXXXXXXXXXXXX+7w=</HostId></Error> (td client http) ```

Sep 27, 2023

Report: "[US Region] Ingest API - Performance Downgrade"

Last update 2023-09-27T05:25:18.089Z

resolved2023-09-27T05:25:18.076Z

The performance downgrade on Import API has been resolved. We apologize for any inconvenience caused.

monitoring2023-09-27T05:16:12.000Z

Between 19:42 PST to 21:47 PST, there was a performance downgrade on our Ingest API. We already applied a fix and we are monitoring the result.

identified2023-09-27T04:44:52.000Z

We have identified the cause of the performance downgrade. We will provide an update as soon as we know more.

investigating2023-09-27T04:13:48.000Z

We have detected the Ingest API performance was downgraded since Sep 26, 21:00 PST.

Sep 26, 2023

Report: "[US Region] Treasure Workflow - Partial Outage in Workflow Service"

Last update 2023-09-26T01:33:21.344Z

resolved2023-09-26T01:33:21.332Z

Our Workflow service had an outage from 10:50 am PST on 25th Sep. 2023. From that time, the workflow requests went to pending status. We fixed the incident and deployed our fix at 2:40 pm PST on 25th Sep. 2023. During this outage time window, the customer workflows might experience some delays. After fix deployment, the Workflow service is working as normal, so the service started to resume pending workflows while handling new requests as well. The incident has been resolved.

monitoring2023-09-26T00:22:16.963Z

We still have 20% of pending workflows to catch up. The remaining pending workflows will be processed within 30 minutes. We are continuing to monitor for any further issues.

monitoring2023-09-25T22:57:39.663Z

The half of pending workflows are processed without any issue. The remaining pending workflows will be processed within an hour. We are continuing to monitor for any further issues.

monitoring2023-09-25T22:16:25.426Z

The pending workflows are resuming now, but it will be taking 1-2 hours to backfill all pending workflows. We are continuing to monitor for any further issues.

monitoring2023-09-25T21:50:26.146Z

A fix has been implemented and we are monitoring the results.

identified2023-09-25T20:52:16.514Z

The issue has been identified and a fix is being implemented.

investigating2023-09-25T20:17:50.330Z

We have observed the issue that the workflows are pending with partial outage in service. We are currently investigating the issue.

investigating2023-09-25T19:01:59.139Z

We have observed the issue that the custom scripts execution failure. We are currently investigating the issue.

Sep 5, 2023

Report: "[US Region] Presto partial performance degradation and potential job failure"

Last update 2023-09-05T10:28:52.657Z

resolved2023-09-04T12:45:44.000Z

This incident has been resolved. There was a potential status inconsistency issue on one of our Presto cluster in US region during AM 1:04 - 3:10 UTC on Sep 4. Queries issued during this period have been potentially affected. You may see failure of INSERT/DELETE jobs with error messages like below. Please do not rerun those especially INSERT jobs to avoid double insertion because even if the job failed, the writing to your table might have done. - cannot get transactionId for null transaction - Cannot complete uploading. This error is temporary and should be recovered by retrying. - Failed to rewrite partition - Killed by the system because this query stalled for more than 1.00h. Also, some of your queries issued during this period might stuck or even failed with the following error. Those jobs were also affected by this incident. - Query exceeded the maximum execution time limit of 6.00h

investigating2023-09-04T10:56:38.164Z

There was a potential status inconsistency issue on one of our Presto cluster in US region during AM 1:04 - 3:10 UTC on Sep 4. Queries issued during this period have been potentially affected. You may see failure of INSERT/DELETE jobs with error messages like below. Please do not rerun those especially INSERT jobs because even if the job failed, the writing to your table might have done. - cannot get transactionId for null transaction - Cannot complete uploading. This error is temporary and should be recovered by retrying. CTAS failed with the following error might be also affected by this incident: - Query exceeded the maximum execution time limit of 6.00h Also, some of your queries might stuck during this incident. We are sure that newly issued queries are not affected while we are still working on to identify the impact of this incident.

investigating2023-09-04T10:55:43.302Z

We are continuing to investigate this issue.

investigating2023-09-04T08:05:59.000Z

investigating2023-09-04T05:33:19.000Z

There might be potential status inconsistency in your INSERT jobs if they failed with error messages like below: - Cannot complete uploading. This error is temporary and should be recovered by retrying - cannot get transactionId for null transaction Please do not rerun those jobs because even if the job failed, the writing to your table might work. We are still working on to identify the impact of this incident.

investigating2023-09-04T03:10:20.555Z

We are investigating the cause. Queries may be delayed.

Aug 30, 2023

Report: "[US Region] Data Connector - Partial outage"

Last update 2023-08-30T05:46:55.989Z

resolved2023-08-30T05:46:55.982Z

Starting from August 29, 2023 at 03:00 UTC until August 30, 2023 at 02:47 UTC, certain data connectors (specifically, Bulk Load Jobs used for importing data into TreasureData) within the US region were associated with static IP addresses for data export purposes instead of bulk load purposes during their execution. If you have an IP-base rule (whitelisting TreasureData IPs only)at external service, the following scenarios could have occurred: 1/ Your Bulk Load jobs were fallback to use proper static IP Addresses, resulting in successful job completion as usual. 2/ However, some of Your Bulk Load jobs may fail, without initiating retries or continuing the process. The resolution of these failures depended on the behavior of the external services. In such cases, the affected Bulk Load jobs would need to be manually re-run or re-triggered. We have already fixed this issue and now Static IP Addresses for Data Connectors are properly allocated to the bulkload jobs.

Aug 3, 2023

Report: "[EU,Tokyo,Korea,AP03 regions] Workflow - Elevated error rate for Custom Script execution"

Last update 2023-08-03T23:10:23.244Z

resolved2023-08-03T23:09:46.019Z

From 1:09 PM to 1:48 PM PDT, a small number of our customers may have encountered an increased error rate when executing Custom Scripts from workflows. This incident was due to an infrastructure issue. If you encountered Workflow errors featuring a 'task submission failed' message during this time period, we kindly recommend that you retry the workflow. Should you have any questions or require further assistance, please don't hesitate to contact us at support@treasure-data.com.

Report: "[US region] Workflow - Elevated error rate for Custom Script execution"

Last update 2023-08-03T22:39:56.964Z

resolved2023-08-03T22:39:43.709Z

From 1:09 PM to 1:47 PM PDT, a small number of our customers may have encountered an increased error rate when executing Custom Scripts from workflows. This incident was due to an infrastructure issue. If you encountered Workflow errors featuring a 'task submission failed' message during this time period, we kindly recommend that you retry the workflow. Should you have any questions or require further assistance, please don't hesitate to contact us at support@treasure-data.com.

Aug 1, 2023

Report: "[Tokyo Region] Report about missing Premium Audit Log events"

Last update 2023-08-01T02:32:39.431Z

resolved2023-08-01T02:32:39.421Z

On July 30, 2023 from 01:09 to 15:35 JST, we detected that one of our instances in the Tokyo region had connectivity issues that caused some premium audit log events to not be delivered to customer accounts. We have already identified and mitigated the issue, however, the said missing logs could not be identified and recovered. Events for Treasure Data CDP operation itself were not affected. We sincerely apologize for any inconvenience this may have caused. If you have any questions, please contact support@treasure-data.com.

Jul 6, 2023

Report: "[US, Tokyo, EU and Korea Regions] Treasure Insights outage"

Last update 2023-07-06T06:10:58.563Z

resolved2023-07-06T06:10:58.551Z

Between July 6, 2023 04:24 UTC to July 6, 2023 06:07 UTC, all customers experienced an outage related to Treasure Insights. The root cause was a network misconfiguration. A fix has been implemented and the issue has been resolved. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring2023-07-06T05:45:51.889Z

We have made remediation and Treasure Insights is now operational. We will keep monitoring the issues.

investigating2023-07-06T05:13:46.047Z

Since the 6th of July at 4:24 UTC, we observed an issue in Treasure Insights. Our engineering team is investigating the cause. We will post further updates here. If you have any questions or concerns about this, please feel free to reach out to our Support team at support@treasuredata.com.

May 16, 2023

Report: "[Tokyo Region] Treasure Insights - Datamodel creation and build"

Last update 2023-05-16T09:34:45.578Z

resolved2023-05-16T09:34:45.563Z

The incident has been resolved.

identified2023-05-16T08:30:57.650Z

The issue has been identified and a fix is being implemented.

investigating2023-05-16T08:09:37.612Z

As of 4:18pm JST on 16 May, we observe an issue in creating and building Treasure Insights datamodels. Our engineering team is investigating the cause.

Apr 18, 2023

Report: "[EU Region] Treasure Workflow - Outage on Custom Script"

Last update 2023-04-18T02:14:51.263Z

resolved2023-04-18T02:13:24.185Z

A change in upstream provider updates was no longer compatible with our configuration. The Custom Script service in Treasure Workflow was failing to launch from 2023-04-17 21:52:37 UTC to 2023-04-17 23:13:55 UTC. We fixed the issue after working closely with the upstream provider. This incident has been resolved.

Apr 17, 2023

Report: "[US/Tokyo/Korea/AP03] Treasure Workflow - Partial Outage on Custom Script"

Last update 2023-04-17T21:52:12.801Z

resolved2023-04-17T21:52:12.789Z

A change in upstream provider updates was no longer compatible with our configuration. We fixed the issue after working closely with the upstream provider. This incident has been resolved.

monitoring2023-04-17T21:19:21.006Z

A fix has been implemented and we are monitoring the results.

identified2023-04-17T21:02:53.153Z

The issue has been identified and a fix is being implemented.

investigating2023-04-17T20:42:01.215Z

We identified that we are failing to call the resource from upstream provider, and working with upstream provide to resolve this issue. We are currently working on this issue actively.

investigating2023-04-17T19:44:58.421Z

We are still investigating this issue with upstream provider.

investigating2023-04-17T19:04:44.138Z

We have observed the issue that the custom script execution failure or delay. We are currently investigating this issue.

Mar 23, 2023

Report: "[US, EU01, Tokyo, AP03] Streaming data ingested between 2023-03-13 21:00 and 22:00 (UTC) were not accessible"

Last update 2023-03-23T14:28:42.878Z

postmortem2023-03-23T14:22:21.573Z

# Summary There was an incident from 2023-03-13 22:30 to 2023-03-15 20:30 \(UTC\) that streaming data ingested between 2023-03-13 21:00 and 22:00 \(UTC\) were not accessible. As a result, Presto, Hive queries, and export jobs ran during the time frame did not include the streaming data. A bug in a job scheduler caused one defragmentation job to run on the wrong cluster that runs the job with an outdated codebase. We plan to implement remediations based on the root cause analysis. # Impact to customers Data ingested between 2023-03-13 21:00 and 22:00 \(UTC\) through stream ingestion \(e.g. data from td-agent, td-js-sdk, mobile-sdk, ADL, Postback API\) into the hourly partitioned data on the archive storage, were not accessible to queries and other jobs during the time frame 2023-03-13 22:30 and 2023-03-15 20:30 \(UTC\). Due to this, Presto, Hive queries and Export jobs executed during the time frame \(2023-03-13 22:30 to 2023-03-15 20:30 UTC\) did not include the streaming data of the ingestion time \(2023-03-13 21:00 to 22:00 UTC\). # What happened In Treasure Data, data from stream ingestion \(e.g., data from td-agent, td-js-sdk, mobile-sdk, ADL, Postback API\) is stored in the landing area first so that Presto, Hive queries, and Export jobs can access quickly. The data in the landing area is fragmented because of its nature. Thus, Treasure Data has an internal mechanism to run a job to defragment the data in the landing area. At 2023-03-13 22:05, our Engineering team rolled out the scheduler update for the defragmentation job for a limited part of our regions. However, the scheduler update had a bug that routes the defragmentation job to the wrong cluster so the job ran with an outdated codebase. Forty minutes later, the team detected the wrong routing by a monitor and reverted the scheduler update. The team investigated the impact of the executed jobs with the outdated codebase and confirmed that the jobs ran and finished without errors. However, after a detailed investigation 2 days later, the team found the jobs did not register storage metadata after completing the defragmentation silently. As a result, despite storage data being there, Presto, Hive queries, and Export jobs could not access it due to the lack of metadata for 2 days. At 2023-03-15 07:00 \(UTC\), the team identified the missing metadata based on further investigation after receiving a customer inquiry. We completed the metadata recovery at 20:30 after operationalizing the recovery process at staging environments. # Details and Remediations The root cause was a scheduler update had a bug that routed one defragmentation job to the wrong cluster resulting in the job running with an outdated codebase. However, we recognize that we need to address the following issues by the cause analysis. * The outdated codebase should not be able to handle the job. We did not delete the codebase as a part of a migration process over a year ago. We will disable the codebase before any update not to retake the job. * The scheduler should have a guard to prevent unexpected routing, even if it has a bug. We are in the process of replacing job routing with the centralized routing mechanism. The scheduler update was a part of the process. The centralized routing mechanism has a guard and monitoring of the routing; however, the scheduler we updated this time was the last service not using the centralized routing mechanism. As a temporary measure until the process completion, we will implement a monitor to detect a wrong routing to detect an issue quickly. * We should have detected the missing metadata at the QA phase. We do have a data consistency test framework, but we did not include it in the QA target this time because the update was not core defragmentation logic but scheduler. We will plan to make it an automated framework, and incorporate its status as a check for anything related to data defragmentation work. ‌ We regret this incident has prevented you from fully leveraging the functionalities of the system and in particular the query subsystems. Please feel free to reach out to our support team through [support@treasuredata.com](mailto:support@treasuredata.com) if you have any questions.

resolved2023-03-15T20:54:47.000Z

We identified at 2023-03-15 10am (UTC) that there was an incident that streaming data ingested between 2023-03-13 21:00 and 22:00 (UTC) were not accessible during the following time frame 2023-03-13 22:30 and 2023-03-15 20:30 (UTC). As a result, Presto, Hive queries and export jobs ran during the time frame (Mar 13 22:30 to Mar 15 20:30 UTC) did not include the streaming data of the ingestion time (Mar 13 21:00 to 22:00 UTC.) = Overview Our internal data merge job system merges fragmented data on the realtime storage that manages streaming import data (e.g. data from td-agent, td-js-sdk, mobile-sdk, ADL, Postback API) into the hourly partitioned data on the archive storage. Due to an issue in this system, the scheduled task that handled data between Mar 13 21:00 and 22:00 UTC did not copy the streaming data to the archive storage. Consequently, the data were not visible between 2023-03-13 22:30 and 2023-03-15 20:30 (UTC). The data ingested by Embulk, Bulk Import, Data Connector, INSERT INTO, or CREATE TABLE AS are not affected. We identified the issue at 2023-03-15 10am (UTC), and completed the recovery to make all data visible by 2023-03-15 20:30 (UTC). Presto, Hive queries and export jobs see the expected data after Mar 15 20:30 UTC. = What you should do for recovery For the Presto, Hive queries and export jobs you ran against the data ingested between Mar 13 21:00 and Mar 22:00 UTC, please re-run the jobs and confirm the result if necessary. = What’s next We apologize for any inconvenience it has caused. After the root cause analysis and further remediation planning, we will publish a detailed postmortem. In the meantime, if you have any questions, please don’t hesitate to contact support@treasure-data.com.

Mar 16, 2023

Report: "[US Region] Web Interface - Partial Outage"

Last update 2023-03-16T13:35:38.717Z

resolved2023-03-16T13:35:38.702Z

Between Mar. 16, 2023 05:38 PDT to Mar. 16, 2023 05:50 PDT, all customers experienced access issue related to Web Interface's outage. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring2023-03-16T12:53:39.241Z

We have made remediation and web console is now operational. We will keep monitoring the issues.

investigating2023-03-16T12:51:05.818Z

We detected the console access was partially unavailable since Mar. 16 05:38 PDT

Mar 9, 2023

Report: "[Tokyo region] Web Interface - Partial Outage"

Last update 2023-03-09T01:50:17.177Z

resolved2023-03-09T01:50:17.159Z

Between Mar 9, 2023 10:16 JST to Mar 9. 2023 10:28 JST, all customers experienced access issue related to Web Interface's outage. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring2023-03-09T01:34:28.628Z

We have made remediation and web console is now operational. We will keep monitoring the issues.

identified2023-03-09T01:31:49.000Z

We are investigating a possible problem currently affecting web console. We will provide an update as soon as we know more.

investigating2023-03-09T01:28:14.000Z

We detected the console access was partially unavailable since Mar 9, 10:16 JST.

Mar 8, 2023

Report: "[Tokyo region] Web Interface - Partial Outage"

Last update 2023-03-08T06:01:39.849Z

resolved2023-03-08T06:00:50.650Z

Between Mar 8, 2023 14:21 JST to Mar 8. 2023 14:34 JST, all customers experienced access issue related to Web Interface's outage. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring2023-03-08T05:47:18.064Z

We have made remediation and web console is now operational. We will keep monitoring the issues.

identified2023-03-08T05:39:44.083Z

We are investigating a possible problem currently affecting web console. We will provide an update as soon as we know more.

investigating2023-03-08T05:38:41.000Z

We detected the console access was partially unavailable since Mar 8, 14:21 JST.

Mar 6, 2023

Report: "[Tokyo region] Web Interface - Partial Outage"

Last update 2023-03-06T14:24:13.974Z

resolved2023-03-06T14:24:13.956Z

We already resolved the issue. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring2023-03-06T14:11:54.789Z

We are investigating a possible problem affecting console access that occurred March 6, 2023 from 22:24 to 22:46 JST (13:24 - 13:46 UTC). We already applied remediation and the issue should be resolved. We are still monitoring the issue.

Feb 10, 2023

Report: "[EU01] API error rate increases"

Last update 2023-02-10T10:43:44.552Z

resolved2023-02-10T10:43:44.534Z

We haven't seen any symptoms caused the issue again. The incident has been resolved. During 8:20am UTC and 8:58am UTC, API response may returned 504 error Intermittently. This high error rate made associated services like Web Console with REST API (api.eu01.treasuredata.com) may also had some degradation. We are sorry for causing you trouble.

monitoring2023-02-10T09:33:50.292Z

During 8:20am UTC and 8:58am UTC, we have observed API response error, that may returned 504 error Intermittently. We already applied a fix and we are monitoring the result.

identified2023-02-10T09:25:15.602Z

We observed the API error rate decreased. In addition, we are working on implementing a mitigation for the issue.

investigating2023-02-10T08:52:19.887Z

We are currently observing Intermittent API response errors. We are working on investigating the issue.

Feb 2, 2023

Report: "[ap03] Result output / Data Connector job outbound access issue"

Last update 2023-02-02T09:56:12.402Z

resolved2023-02-02T09:56:12.388Z

The incident has been resolved.

monitoring2023-02-02T08:05:45.270Z

The following components in AP03 may failed during 10:30-16:30 if the export/import 3rd party system configures IP whitelist setting. - DataConnector - ResultOutput During this period, jobs of these services were assigned an unintended IP address. As a result, the 3rd party system might deny access to these jobs depending on their IP whitelist setting. We've fixed the routing setting and it now works correctly. If your jobs failed during this period, please re-run your workflow/jobs.

monitoring2023-02-02T07:51:10.118Z

A fix has been implemented and we are monitoring the result.

identified2023-02-02T07:29:29.486Z

We acknowledged a failure of Data Connector jobs due to a outbound network configuration issue by Data Connector job in Private Connect environment. We are working on resolving the issue.