Treasure Data

Is Treasure Data Down Right Now? Check if there is a current outage ongoing.

Treasure Data is currently Operational

Last checked from Treasure Data's official status page

Historical record of incidents for Treasure Data

Report: "[US region] Ingest API - Performance Degradation"

Last update
resolved

This incident has been resolved. Duration: ~35 minutes of processing delay. Data Ingestion delay between June 1 - 7:35 AM to June 1 - 10:35 AM PST. Affected Customers: All customers ingesting data to AWS US region during the incident window Impact: Delayed data availability in Plazma (up to 40 mins) No data loss occurred

monitoring

We are continuing to monitor for any further issues.

monitoring

Latest Update: Our team is still actively monitoring and assessing the processing of the data. We will leave this status page in a Monitoring state until we are certain everything has been processed. Thank you for your continued patience.

monitoring

Latest Update: Our internal graphs show that the overwhelming majority of the data has been processed. There is small amount of residual data that is taking longer than expected to process. We will leave this status page in a Monitoring state until we are certain everything has been processed. Thank you for your continued patience.

investigating

Latest Update: Our team is still actively monitoring and assessing the processing of the data backlog. Thank you for your continued patience.

investigating

Latest Update: Our team has deployed a fix, and the system is now processing the backlog of data at a controlled rate. Estimated Time to Full Recovery: ~7 hours We’re actively monitoring the recovery and will provide further updates as progress continues. Thank you for your patience.

investigating

Current Status: A fix has been applied and is currently under observation. Impact: An issue was identified with data ingestion(delayed ingestion) between Sunday 8:00 AM PST - 8:30 AM PST. Remediation: A fix has been implemented, and new incoming data is now processing normally. Next Steps: We are actively working to resume sending the affected data, which will arrive out of order along with new incoming data. Updates to Follow: Further details will be provided as the situation progresses.

investigating

Our Ingest API is experiencing a performance issue. We are investigating the cause.

Report: "[US region] Ingest API - Performance Degradation"

Last update
Resolved

This incident has been resolved.Duration: ~35 minutes of processing delay. Data Ingestion delay between June 1 - 7:35 AM to June 1 - 10:35 AM PST.Affected Customers:All customers ingesting data to AWS US region during the incident windowImpact:Delayed data availability in Plazma (up to 40 mins)No data loss occurred

Update

We are continuing to monitor for any further issues.

Update

Latest Update: Our team is still actively monitoring and assessing the processing of the data. We will leave this status page in a Monitoring state until we are certain everything has been processed.Thank you for your continued patience.

Monitoring

Latest Update: Our internal graphs show that the overwhelming majority of the data has been processed. There is small amount of residual data that is taking longer than expected to process.We will leave this status page in a Monitoring state until we are certain everything has been processed.Thank you for your continued patience.

Update

Latest Update: Our team is still actively monitoring and assessing the processing of the data backlog.Thank you for your continued patience.

Update

Latest Update: Our team has deployed a fix, and the system is now processing the backlog of data at a controlled rate.Estimated Time to Full Recovery: ~7 hoursWe’re actively monitoring the recovery and will provide further updates as progress continues. Thank you for your patience.

Update

Current Status:A fix has been applied and is currently under observation.Impact:An issue was identified with data ingestion(delayed ingestion) between Sunday 8:00 AM PST - 8:30 AM PST.Remediation:A fix has been implemented, and new incoming data is now processing normally.Next Steps:We are actively working to resume sending the affected data, which will arrive out of order along with new incoming data.Updates to Follow:Further details will be provided as the situation progresses.

Investigating

Our Ingest API is experiencing a performance issue. We are investigating the cause.

Report: "[US region] Ingest API - Performance Degradation"

Last update
resolved

We have continued our recovery efforts. While the issue has not been fully resolved, the impact is now limited. We are directly contacting affected customers and will continue to provide updates as needed. As this issue is now affecting only a limited number of customers, we are closing this status page. We sincerely apologize for the inconvenience this has caused and appreciate your understanding as we continue to support those still impacted. Thank you for your patience and understanding.

monitoring

We are continuing to run our recovery operation, and it is progressing. We are monitoring closely and will update in six hours.

monitoring

Recovery is proceeding and we are continuing to closely monitor the progress. We will have another update in six hours.

monitoring

We have been monitoring the situation for the past 6 hours. At this time, we are unable to provide an updated estimate for full recovery. We are continuing to monitor closely and are also exploring additional measures to improve recovery performance. We will share another update in approximately 6 hours or sooner if there is any significant development. We apologize for the inconvenience.

monitoring

We have rolled out a fix and are observing our backlog recovering at a safe rate. We are currently processing a backlog of historical data, which is progressing at a controlled rate. Estimated time to a full recover of all data is approximately 32 hours, and we will be monitoring and provide updates through this process.

identified

We are now attempting to restore the pipeline from a timestamp rather than a checkpoint. There is no risk of user data loss. Downstream systems may observe duplicate events from the affected time window, and this will be limited in scope.

identified

We have applied a fix and are observing it. We have identified a window between Thursday 3AM UTC and 11AM Monday UTC where data was not being processed. We have implemented a remediation and new incoming data is not impacted. We are actively working to resume sending of this data, which will arrive out of order along with new incoming data.

investigating

We are continuing to investigate this issue.

investigating

Our Ingest API is experiencing a performance issue. We are investigating the cause.

Report: "[Tokyo Region] Performance Issue of Trino service"

Last update
resolved

This incident has been resolved.

monitoring

We are observing recovery. We continue to monitor for full recovery.

identified

The root cause has been identified and we are applying the fix.

investigating

Our Trino service is experiencing an issue. We are investigating the cause.

Report: "[US Region] Hive Jobs and the Result Export jobs triggered from Hive Jobs are not functioning properly"

Last update
resolved

We have confirmed that Hive jobs are now working properly, and the related Result Export jobs are functioning as expected. The impact occurred between April 8, 16:00 UTC and April 9, 01:35 UTC. If you had any jobs that failed during the affected time window, please re-run them as needed. We sincerely apologize for the inconvenience this my have caused.

monitoring

We have identified the root cause and applied a hotfix. The issue was that Hive Jobs were unable to start properly, which in turn caused some Result Export jobs triggered from those Hive Jobs to fail. We are currently monitoring the system to ensure that the situation continues to improve.

investigating

We are currently investigating an issue where Hive Jobs and the Result Export jobs triggered from Hive Jobs are not functioning properly. Our team is actively looking into the cause of the issue. We will provide updates as soon as more information becomes available.

Report: "[All Regions] Utilization Dashboards showing outdated information"

Last update
resolved

This issue is resolved and our utilization dashboards should be showing up-to-date information. If you observe anything unusual in your usage data, please contact our support team. Thank you for your patience while working through this issue.

monitoring

We have remediated the issue and are processing usage data from the last 4 days. Users should see the usage dashboards catching up. We expect this process to take about another hour to complete.

identified

We have identified the cause of this issue and are working to restore service. Once the service is restored, we expect it will take a few hours to catch up on usage data for the last few days.

investigating

Our utilization dashboards do not update with up-to-date information. Customers accessing their Treasure Data usage dashboards will see a gap in usage details from early Saturday morning UTC. There is no impact on ongoing Treasure Data usage, and all usage information is correctly stored internally. However, the dashboard where customers can view their consumption is not up-to-date. Note this is a reporting problem only. There is no indication of any issues with regular Treasure Data usage. We are working to diagnose the issue and will provide an update in the next hour.

Report: "[All Regions] Treasure Insights is experiencing an outage"

Last update
resolved

We would like to inform you that the issue has been fully resolved. Incident Impact Details: - The Treasure Insights were returning 502 errors and it was unreachable during the incident. Incident Impact Time: - Start: January 20, 09:51 UTC - End: January 20, 15:35 UTC We apologize for any inconvenience this may have caused and thank you for your patience and understanding.

monitoring

A fix has been implemented, and we are monitoring the service to ensure everything is functioning correctly.

identified

The problem has been identified, and we are currently working on a solution.

investigating

The situation remains the same as it was in the last update.

investigating

We are still investigating the issue.

investigating

We have observed that the users are not able to access the Treasure Insights. We are currently investigating the issue.

Report: "[EU Region] Elevated error/ performance degradation related to personalisation API"

Last update
resolved

Between Thursday, 23 Jan 2025 07:20 UTC to 11:40 UTC, customers experienced elevated error rates and increased latency related to Profiles API. A fix has been implemented, and the issue has been resolved. If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will share an incident retrospective soon.

monitoring

The response team found problematic real-time segment configurations of one customer's Parent Segment that possibly contributed to consuming the concurrency capacity. The team updated the real-time event routing configuration to mitigate the high latency issue. Combined with capacity addition operations, the team stabilized the Profiles API cache cluster. If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will update the postmortem with further remediation plan as promised.

identified

We successfully provisioned 2x capacity in 30 minutes. New resources improved the latency, but the error rate is still high. The response team is planning to implement another remediation instead of adding resources. We will update you in 30 minutes.

identified

We provisioned additional capacity at 10:00 am UTC to support the increasing workload. It improved the latency, but we still observed errors and long latency for a small amount of requests. The response team started providing another concurrency capacity. Unlike the previous methods, the new process should not take longer for provisioning. We will update the result in 30 minutes.

identified

The response team confirmed the symptom is from the same cause as the previous incidents. We are provisioning additional concurrency capacity to the environment. We will update you when it is completed.

investigating

We are currently observing errors or performance degradation for the personalization API. We are investigating the cause of the issue now.

Report: "[EU Region] Elevated error rate and performance degradation for personalization API"

Last update
resolved

We implemented fundamental isolation to a problematic configuration at 14:42 UTC. The remediation caused the cluster workload to drop from 60% to 1%. On Friday, we implemented write access isolation to the problematic configuration. It stopped the cluster workload from growing. Today, we implemented read access isolation that restored the cluster workload to the previous level. The system is operating normally now. We close the incident. We acknowledge we need further actions to prevent the same incident from happening again by a similar configuration. We will post further postmortem when we are ready.

monitoring

We are still monitoring the service. Between Thursday, 30 Jan 2025, 10:00 UTC to 11:05 UTC, customers experienced elevated error rates and longer latency for Profiles API lookup. Currently, the cluster workload has calmed down and is operating normally. Our response team is ready to provision additional processing capacity. However, we are closely monitoring the service status to avoid further downtime during peak times. In addition to it, we are working on isolating problematic accesses from the service. We will keep the status page open and update you on the progress.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are currently observing that the performance degradation and error rate have improved. We continue to closely monitor the metrics.

investigating

We detected degraded performance of personalization API and an error rate increase. We are currently investigating this issue.

Report: "[EU Region] Elevated error rate and performance degradation for personalization API"

Last update
resolved

Between Wednesday, 29 Jan 2025, 15:47 UTC to 16:51 UTC, customers experienced elevated error rates and increased latency related to Profiles API. The cause was a slightly but non-visible elevated error rate monitor kicked a system recovery operation. Then, the recovery operation caused the same incident due to a configuration problem we had on Friday. https://status.treasuredata.com/incidents/jyqjpyscvjzh The response team re-deployed the safe version to recover the system. Also, as a short-term mitigation, we updated the recovery operation until we complete the root cause analysis and permanent fix we described in "Further Actions" in the previoius postmortem: https://status.treasuredata.com/incidents/jyqjpyscvjzh At the moment, if you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident.

monitoring

We are continuing to monitor for any further issues.

monitoring

Currently, we can see a lot of improvement in the monitoring of health status. We continue to carefully monitor the health status.

monitoring

We started to apply a remediation and we are observing that the service is recovering. However, we closely monitor the service health status

investigating

We detected degraded performance of personalization API and an error rate increase. We are currently investigating this issue.

Report: "[EU Region] Elevated error/ performance degradation related to personalisation API"

Last update
postmortem

The Profiles API enables browsers to retrieve personalized content based on detailed customer information. Between January 20 and January 22, we experienced elevated error rates and increased latency for a subset of requests to the Profiles API. We sincerely apologize for the inconvenience caused by this incident. We understand the critical role our API plays in delivering seamless user experiences, and we are committed to preventing such disruptions in the future. # Timeline * On January 20, from 7:45 to 11:15 UTC - 3% error rate during the time * On January 21, from 7:35 to 10:25 UTC - 33% error rate during the time * On January 22, from 9:15 to 16:40 UTC - 40% error rate during the time During these periods, API calls to `https://cdp-eu01.in.treasuredata.com/` exhibited elevated error rates and latency. This issue did not impact RT 2.0, the newer version of our real-time system. # Incident Analysis This is the current analysis snapshot; updates will be provided as more information becomes available. We noticed a gradual increase in processing workloads on the Profiles API starting on January 6, driven by the complexity of real-time segmentation. By January 20, this workload exceeded the internal concurrency limit configured in our caching cluster. Key observations are: * Symptoms consistently began to appear around 07:30 UTC each day. * Internal system indicators flagged potential issues approximately two hours prior to the incidents. The bottleneck was traced to the caching cluster's concurrency capacity, which was insufficient to handle the growing workload.  # Action Taken Based on the observation, we implemented the mitigation  to increase the concurrency capacity in the caching cluster. We will monitor the symptoms closely today and provide additional capacity when necessary. # Further Actions Our development team will have a capacity review of the Profiles API infrastructure to prepare for future workload growth. The remediation plan will include the following steps: * Enhanced monitoring and alerting of the caching cluster’s concurrency capacity * Ensuring safe yet rapid capacity provisioning when required We will provide a follow-up update by the end of Friday, summarizing any additional findings and actions taken. Hiroshi \(Nahi\) Nakamura CTO & VP Engineering Treasure Data

resolved

Between Wednesday, 22 Jan 2025 09:15 UTC to 16:40 UTC, Some customers experienced elevated error rates and increased latency related to Profiles API. A fix has been implemented and the issue has been resolved. If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will share an incident retrospective soon.

monitoring

We have fully deployed our fixes to the Personalization API and our monitors show systems operating normally. Our teams will continue to monitor the issue, and we will update this incident if we observe any unusual behavior. If you experience any delays or abnormal errors, please reach out to our support team. Thank you for your patience and understanding during this incident. We will share an incident retrospective once it is available.

investigating

We have observed some intermittent errors as we roll out a fix to all of our systems, and users may see delays or errors as the change is applied to our systems. Our response team is working to minimize the impact to customers while we deploy this change, but we expect some slower performance while we gradually deploy the fix over the next 3-4 hours.

investigating

Our response team has identified a potential cause for this issue, and we will be deploying a fix shortly. At this time we have not observed any elevated error rates or delays since 16:40 UTC. We will provide an additional update once this fix has been deployed. If you are observing abnormal errors or long delays from our Personalization API, please reach out to our support team. We will continue to monitor for any issues, and will update once our fix is deployed.

investigating

From 09:00 to 17:00 UTC, we observed elevated 500s and high latency on the CDP KVS server. Customers may have observed elevated errors and timeouts during this period when sending requests to the Personalization API. Our team has been investigating this issue and has deployed a workaround to our systems while we work to identify the root cause of the problem. There should be no system impact at this time. Customers who continue to observe delays or elevated error rates should contact our support team, and we'll be happy to assist them further. We will continue to investigate and will provide another update by 11 PM UTC.

investigating

We have applied various mitigation on our infrastructure side however it doesn't decrease the error rate. We are continuously investigating the possible causes on our end

investigating

We are currently observing errors or performance degradation for the personalization API. We are investigating the cause of the issue now.

Report: "[EU Region] Elevated error/ performance degradation related to personalisation API"

Last update
resolved

We observed that the error rate decreased and the issue was resolved.

investigating

We applied the mitigation, and we are still monitoring to see if it resolves the issue

investigating

We are currently observing errors or performance degradation for the personalization API. We are investigating the cause of the issue now.

Report: "[EU Region] Elevated error rate for CDP KVS"

Last update
resolved

We would like to inform you that the issue has been fully resolved. Incident Impact Details: - Personalization API has experienced an outage leading to increased errors and timeouts. Incident Impact Time: - Start: January 20, 07:45 UTC - End: January 20, 11:15 UTC We apologize for any inconvenience this may have caused and thank you for your patience and understanding.

monitoring

We are observing fewer errors now. However, we are still monitoring and re-evaluating the remedial steps to confirm better performance.

monitoring

Through our investigation, we identified the cause of the issue and have applied some remediation. Our team is closely monitoring the system to ensure continued stability.

investigating

We are currently investigating this issue.

Report: "[EU Region] Trino/Presto - Degraded performance"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

Some users may be experiencing degraded performance when running presto or trino jobs. We are investigating the incident. At present all users in the EU central region may be affected.

Report: "[US Region] Trino/Presto performance degradation"

Last update
resolved

This incident has been resolved.

monitoring

We have applied remediation for the degraded performance infrastructure. We are currently monitoring the performance closely.

investigating

We are investigating a possible problem currently affecting Trino/Presto queries for the US region. Queries might have degraded performance. We will provide an update as soon as we know more details.

Report: "[All Regions] Elevated error rate for CDP KVS"

Last update
resolved

We would like to inform you that the issue has been fully resolved. Incident Impact Details: - Profiles API experienced an increased frequency of errors and timeouts. - The latest logs were not reflected in real-time segments. Incident Impact Time by Region: us: - Start: November 13, 04:14 UTC - End: November 13, 08:55 UTC aws-tokyo: - Start: November 13, 04:14 UTC - End: November 13, 08:54 UTC eu01: - Start: November 13, 04:17 UTC - End: November 13, 08:51 UTC ap02 - Start: November 13, 04:15 UTC - End: November 13, 09:01 UTC ap03 - Start: November 13, 04:17 UTC - End: November 13, 08:52 UTC We apologize for any inconvenience this may have caused and thank you for your patience and understanding.

monitoring

Through our investigation, we identified the cause of the issue as a recent release operation. We have reverted all changes from this release, and normal functionality has been restored. Our team is closely monitoring the system to ensure continued stability.

investigating

Since approximately 4:00 UTC, we have been experiencing an issue with requests to CDP KVS, which may be affecting Profiles API functionality, causing delays in KVS data synchronization and updates to real-time segment information. Our team is actively investigating and working to resolve the issue as quickly as possible. Please note that Realtime 2.0 is not affected.

Report: "[EU region] Presto - Partial Outage"

Last update
resolved

Between Nov 5, 17:15 UTC and Nov 5, 18:45 UTC, Some customers experienced delays and errors related to presto. The cause was insufficient capacity, which will be investigated further. A fix has been implemented and the issue has been resolved. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring

We have applied a fix. The problem looks to be resolved, but we are continuing to monitor.

investigating

We are continuously investigating this issue. For most queries we expect they will succeed after one or more automatic retries

investigating

We are investigating a possible problem currently causing escalated error rates from presto queries. We will provide an update as soon as we know more.

Report: "[US Region] Query Engine - Service Degraded Performance"

Last update
postmortem

We experienced a temporary overload on the storage layer. It started from 16:15 PDT and fixed on 18:15 PDT. The major impact was performance defgadation for data ingestion components \(Streaming Import REST API, Mobile/Javascript REST API, Data Connector\) and Hive and Presto query engines. Some of queries executed on Hive and Presto failed because of performance degradation of the storage.

resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are continuing to investigate this issue.

investigating

We are continuing to investigate this issue.

investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.

Report: "[All Regions] Web Interface - Partial Outage to show Standard Audit Logs"

Last update
resolved

This incident has been resolved.

monitoring

We confirm the issue was resolved. We will continue to monitor the results.

identified

We observed a problem with web console access related to showing Standard Audit Logs. We have found the cause of the incident. We are working to resolve the incident.

Report: "[US region] Presto Query Engine - Degraded Performance"

Last update
resolved

The incident is now resolved. All affected components are back to normal. A subset of customers in the US region might have experienced degraded performance on Presto queries between 4:50 PM EDT and 1:40 AM EDT. Presto queries might also have been queued for longer than usual during the incident. Finally, some queries might have failed due to the remediations that were put in place.

monitoring

Systems should be back to normal but we continue to monitor the situation for a while.

monitoring

We applied the fix. We will continue to monitor the results.

investigating

Though not all, the performance for some queries has been improved. We are continuing to investigate the issue.

investigating

This incident is still ongoing. We are investigating the root cause.

monitoring

A fix has been implemented and we are monitoring the results.

investigating

We are investigating a possible problem currently affecting Presto. Queries could be delayed. We will provide an update as soon as we know more.

Report: "[Tokyo, AP03 Region] Custom Script Workflow error"

Last update
resolved

Our infrastructure provider (AWS) issue is resolved and we don't observe new errors for now. Please rerun the failed workflow if needed

monitoring

The error rate is decreased. Please rerun the failed workflow if needed. We observed errors with the custom script between 8 am and 10 am UTC on August 29th. We keep monitoring the issue carefully.

investigating

Custom Script from workflow fails due to an ongoing incident with our infrastructure provider (AWS). Error example: Unable to execute HTTP request: Connect to sts.amazonaws.com:443 [sts.amazonaws.com/209.54.177.164] failed: connect timed out com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to sts.amazonaws.com:443 [sts.amazonaws.com/209.54.177.164] failed: connect timed out We are actively working on the issue on our end.

Report: "[Tokyo, AP03 Region] Data Connector, Result Export, Hive job malfunction"

Last update
resolved

Our infrastructure provider (AWS) issue is resolved and we don't observe new errors for now. Some of the jobs failed due to the incident, so please rerun the failed jobs if needed

investigating

We have found that Data Connector, Result Export, and Hive jobs weren't able to start or failed the job due to an incident with our infrastructure provider (AWS). Some of the Data Connector, ResultExport, and Hive jobs might encounter delay or error. The issue observed on Aug 29th between 8:30 UTC - 9:45 UTC We are still investigating the issue on our end.

Report: "[US Region] Ingest API - Performance Downgrade"

Last update
resolved

We confirmed the catch-up is complete at 6:44 am PT. From 2024-08-06 03:20 am to 2024-08-07 06:44 am PT, the events arrived at us01.records.in.treasuredata.com and c360-ingest-api.treasuredata.com experienced maximum 8 hours of delay in batch data ingestion. There was no impact in real-time system.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor the recovery. As of now, 99% of events become visible within 45 minutes. We will resolve the incident when the catch-up is complete.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to monitor our systems' recovery as we work through the backlog of messages sent in the last 12 hours. We have added more resources to reduce the impact of this issue. At this time, we expect all messages to be processed, but customers may continue to see multi-hour delays as we continue to process messages to our Ingest API for the next few hours. We will continue to monitor this issue, and we appreciate your patience as we work through it.

monitoring

We have rolled out a fix and observed that processing delays are no longer increasing. Customers may continue to see delayed message processing over the next 3-4 hours as the backlog is processed. We continue exploring options to shorten this time and will monitor for any issues.

identified

We have identified the source of the problem and are applying a solution now. Customers may still see processing delays as we catch up on the request backlog. We will continue to explore options to accelerate our recovery, and we will continue to monitor the situation.

investigating

We are observing slower processing time for messages sent to our Ingest API. Users may see a delay up to two hours in message processing. We are continuing to investigate the root cause and exploring options to catch up on our backlog of messages, and will provide an update once we know more.

investigating

We are still investigating the cause of this issue.

investigating

Our Ingest API is experiencing a performance issue. We are investigating the cause.

Report: "[US Region] High Error rate at Custom Script and some DataConnector"

Last update
resolved

This incident has been resolved, all affected components (Custom Script and some DataConnector) are now back to normal.

monitoring

According to our infrastructure provider (AWS), this issue has already been resolved. We also see that the failure rate has been reduced, so we will update this incident to Monitoring status and the affected components to Operational status.

identified

Due to the degradation of Amazon Ads system https://status.ads.amazon.com, our connectors for Amazon Ads platform are currently not working properly. So if you are using any of the below connectors, your jobs may not be running correctly. - Amazon Marketing Cloud export - Amazon Marketing Cloud import - Amazon Ads export - Amazon DSP export We will provide further updates as soon as more information becomes available.

identified

This issue is still ongoing, we are still seeing custom script tasks fail. Custom script user may also encounter some errors about AWS Cloud Watch logs. According to our infrastructure provider (AWS), they are working on recovery and there are some improvements being seen internally, but they expect it to take 1-2 hours for full recovery. We will provide further updates as soon as more information becomes available.

identified

We are currently experiencing a high error rate in Custom Script service on Treasure Workflow (US Region) due to an ongoing incident with our infrastructure provider (AWS). This issue is increased error rates with the following error message like: > Task failed with unexpected error: null (Service: AWSLogs; Status Code: 503; Error Code: null; Request ID: xxxxxx; Proxy: null) At this time, we do not have an estimated time for full resolution. We will provide further updates as soon as more information becomes available.

Report: "[US Region] Delays in Processing incoming events"

Last update
resolved

The issue is resolved at the provider and all components have completed catch-up.

monitoring

We are in constant communication with our service provider.

investigating

We are monitoring delays in systems responsible for processing incoming ingested events using our ingestion API. There also increased errors in the ingestion API. The delay is caused by infrastructure issues in our provider, which are currently being addressed. We are monitoring the situation. During this time, writing to storage may be delayed, but there is no evidence of data loss.

Report: "[EU region] Profiles API - Degraded Performance"

Last update
resolved

Between 2:02 a.m. and 5:47 a.m. PDT, the CDP Personalization API experienced elevated API error rates. The engineering team identified the computing instance causing the issue and implemented a fix. The problem has been resolved already. The Personalization API clients that equip error retry observed no issue. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

Report: "[All Region] All Hive jobs run on Hive4"

Last update
resolved

Verified that the issue is completely resolved. We apologize for the inconvenience.

monitoring

All Hive jobs excluding CDP Workflow run on Hive4 (query engine 2023.1) during the following time period. - [US Region] 2024-06-19 07:45 +0000 - 2024-06-20 04:06 +0000 - [Tokyo Region] 2024-06-19 09:05 +0000 - 2024-06-20 04:08 +0000 - [EU Region] 2024-06-19 09:14 - 2024-06-20 04:09 +0000 - [Korea Region] 2024-06-19 09:21 +0000 - 2024-06-20 04:10 +0000 - [AP03 Region] 2024-06-19 09:30 +0000 - 2024-06-20 04:11 +0000 We have fixed it and all Hive jobs are now properly executed on the query engine specified by the user. We apologize for the inconvenience.

Report: "[US Region] Ingestion API degraded performance and availability"

Last update
resolved

For the period between 7:00 AM and 8:00 AM PDT, uses of the Mobile/Javascript API in the US region experienced periods of slow responses and unavailability. There was no data loss once the data was received by the API.

Report: "[US Region] Delay in Streaming Import"

Last update
resolved

We have been monitoring closely, and as there have been no further recurrences of the delay, we consider the issue resolved. We apologize for any inconvenience caused.

monitoring

We have confirmed that the delay was resolved around UTC 9:30. We are currently continuing to monitor the situation.

investigating

We have observed delays of up to 2 hours in Streaming Import (td-js-sdk, td-mobile-sdk, postback request, ingestion-api, fluentd, etc..) occurring from approximately UTC 6:30 onwards. We are currently investigating the cause and working to resolve the delay.

Report: "[US Region] Performance Issue of Presto service"

Last update
resolved

This incident has been resolved. Presto service is returned to normal.

monitoring

The fix has been applied. We are monitoring the results.

investigating

Our Presto service is experiencing an issue. We are investigating the cause.

Report: "[US Region] Performance Issue of Presto service"

Last update
resolved

This incident has been resolved. Presto service is now returned to normal.

monitoring

We are monitoring the results.

investigating

The fix has been applied. We will continue to monitor.

investigating

We are continuing to investigate the issue.

investigating

Our Presto service is experiencing an issue. We are investigating the cause.

Report: "[US/EU/Tokyo/Korea region] Treasure Workflow - Partial Outage on mail Operator"

Last update
resolved

Between 2023-12-14 04:55 UTC and 2023-12-14 06:04 UTC, some customers in the US/EU/Tokyo/Korea region were experiencing failures to send email using mail operator related to the Workflow service. A fix has been implemented and the issue has been resolved. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We are continuing to investigate this issue.

investigating

We have observed the problem that sending email by using mail operator fails. We are currently investigating this issue.

Report: "[US Region] REST API to retrieve query results occasionally fails"

Last update
resolved

We confirmed that the issue has stopped occurring and has stabilized. We will resolve this status, but will continue investigation.

monitoring

The previous update wrongly mentioned the incident timeframe. Correct: 7:00 and 8:31 UTC Incorrect: 16:00 and 17:31 UTC We are sorry for the inconvinience.

monitoring

REST API from Workflow and client libraries to retrieve query results occasionally fails due to the follwoing error between 16:00 and 17:31 UTC. We already confirmed the error doesn't happen right now. Thus, we continue to monitor the situation while we are investigating the cause. ``` [CLIENT_ERROR] [400:Bad Request] API request to /v3/job/result/000000 has failed: <?xml version="1.0" encoding="UTF-8"?> <Error><Code>AuthorizationQueryParametersError</Code><Message>X-Amz-Expires must be non-negative</Message><RequestId>TTTTTTTTTT</RequestId><HostId>oI6e7Jfaub0mG/XXXXXXXXXXXXXXXXX+7w=</HostId></Error> (td client http) ```

Report: "[US Region] Ingest API - Performance Downgrade"

Last update
resolved

The performance downgrade on Import API has been resolved. We apologize for any inconvenience caused.

monitoring

Between 19:42 PST to 21:47 PST, there was a performance downgrade on our Ingest API. We already applied a fix and we are monitoring the result.

identified

We have identified the cause of the performance downgrade. We will provide an update as soon as we know more.

investigating

We have detected the Ingest API performance was downgraded since Sep 26, 21:00 PST.

Report: "[US Region] Treasure Workflow - Partial Outage in Workflow Service"

Last update
resolved

Our Workflow service had an outage from 10:50 am PST on 25th Sep. 2023. From that time, the workflow requests went to pending status. We fixed the incident and deployed our fix at 2:40 pm PST on 25th Sep. 2023. During this outage time window, the customer workflows might experience some delays. After fix deployment, the Workflow service is working as normal, so the service started to resume pending workflows while handling new requests as well. The incident has been resolved.

monitoring

We still have 20% of pending workflows to catch up. The remaining pending workflows will be processed within 30 minutes. We are continuing to monitor for any further issues.

monitoring

The half of pending workflows are processed without any issue. The remaining pending workflows will be processed within an hour. We are continuing to monitor for any further issues.

monitoring

The pending workflows are resuming now, but it will be taking 1-2 hours to backfill all pending workflows. We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We have observed the issue that the workflows are pending with partial outage in service. We are currently investigating the issue.

investigating

We have observed the issue that the custom scripts execution failure. We are currently investigating the issue.

Report: "[US Region] Presto partial performance degradation and potential job failure"

Last update
resolved

This incident has been resolved. There was a potential status inconsistency issue on one of our Presto cluster in US region during AM 1:04 - 3:10 UTC on Sep 4. Queries issued during this period have been potentially affected. You may see failure of INSERT/DELETE jobs with error messages like below. Please do not rerun those especially INSERT jobs to avoid double insertion because even if the job failed, the writing to your table might have done. - cannot get transactionId for null transaction - Cannot complete uploading. This error is temporary and should be recovered by retrying. - Failed to rewrite partition - Killed by the system because this query stalled for more than 1.00h. Also, some of your queries issued during this period might stuck or even failed with the following error. Those jobs were also affected by this incident. - Query exceeded the maximum execution time limit of 6.00h

investigating

There was a potential status inconsistency issue on one of our Presto cluster in US region during AM 1:04 - 3:10 UTC on Sep 4. Queries issued during this period have been potentially affected. You may see failure of INSERT/DELETE jobs with error messages like below. Please do not rerun those especially INSERT jobs because even if the job failed, the writing to your table might have done. - cannot get transactionId for null transaction - Cannot complete uploading. This error is temporary and should be recovered by retrying. CTAS failed with the following error might be also affected by this incident: - Query exceeded the maximum execution time limit of 6.00h Also, some of your queries might stuck during this incident. We are sure that newly issued queries are not affected while we are still working on to identify the impact of this incident.

investigating

We are continuing to investigate this issue.

investigating

There was a potential status inconsistency issue on one of our Presto cluster in US region during AM 1:04 - 3:10 UTC on Sep 4. Queries issued during this period have been potentially affected. You may see failure of INSERT/DELETE jobs with error messages like below. Please do not rerun those especially INSERT jobs because even if the job failed, the writing to your table might have done. - cannot get transactionId for null transaction - Cannot complete uploading. This error is temporary and should be recovered by retrying. CTAS failed with the following error might be also affected by this incident: - Query exceeded the maximum execution time limit of 6.00h Also, some of your queries might stuck during this incident. We are sure that newly issued queries are not affected while we are still working on to identify the impact of this incident.

investigating

There might be potential status inconsistency in your INSERT jobs if they failed with error messages like below: - Cannot complete uploading. This error is temporary and should be recovered by retrying - cannot get transactionId for null transaction Please do not rerun those jobs because even if the job failed, the writing to your table might work. We are still working on to identify the impact of this incident.

investigating

We are investigating the cause. Queries may be delayed.

Report: "[US Region] Data Connector - Partial outage"

Last update
resolved

Starting from August 29, 2023 at 03:00 UTC until August 30, 2023 at 02:47 UTC, certain data connectors (specifically, Bulk Load Jobs used for importing data into TreasureData) within the US region were associated with static IP addresses for data export purposes instead of bulk load purposes during their execution. If you have an IP-base rule (whitelisting TreasureData IPs only)at external service, the following scenarios could have occurred: 1/ Your Bulk Load jobs were fallback to use proper static IP Addresses, resulting in successful job completion as usual. 2/ However, some of Your Bulk Load jobs may fail, without initiating retries or continuing the process. The resolution of these failures depended on the behavior of the external services. In such cases, the affected Bulk Load jobs would need to be manually re-run or re-triggered. We have already fixed this issue and now Static IP Addresses for Data Connectors are properly allocated to the bulkload jobs.

Report: "[EU,Tokyo,Korea,AP03 regions] Workflow - Elevated error rate for Custom Script execution"

Last update
resolved

From 1:09 PM to 1:48 PM PDT, a small number of our customers may have encountered an increased error rate when executing Custom Scripts from workflows. This incident was due to an infrastructure issue. If you encountered Workflow errors featuring a 'task submission failed' message during this time period, we kindly recommend that you retry the workflow. Should you have any questions or require further assistance, please don't hesitate to contact us at support@treasure-data.com.

Report: "[US region] Workflow - Elevated error rate for Custom Script execution"

Last update
resolved

From 1:09 PM to 1:47 PM PDT, a small number of our customers may have encountered an increased error rate when executing Custom Scripts from workflows. This incident was due to an infrastructure issue. If you encountered Workflow errors featuring a 'task submission failed' message during this time period, we kindly recommend that you retry the workflow. Should you have any questions or require further assistance, please don't hesitate to contact us at support@treasure-data.com.

Report: "[Tokyo Region] Report about missing Premium Audit Log events"

Last update
resolved

On July 30, 2023 from 01:09 to 15:35 JST, we detected that one of our instances in the Tokyo region had connectivity issues that caused some premium audit log events to not be delivered to customer accounts. We have already identified and mitigated the issue, however, the said missing logs could not be identified and recovered. Events for Treasure Data CDP operation itself were not affected. We sincerely apologize for any inconvenience this may have caused. If you have any questions, please contact support@treasure-data.com.

Report: "[US, Tokyo, EU and Korea Regions] Treasure Insights outage"

Last update
resolved

Between July 6, 2023 04:24 UTC to July 6, 2023 06:07 UTC, all customers experienced an outage related to Treasure Insights. The root cause was a network misconfiguration. A fix has been implemented and the issue has been resolved. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring

We have made remediation and Treasure Insights is now operational. We will keep monitoring the issues.

investigating

Since the 6th of July at 4:24 UTC, we observed an issue in Treasure Insights. Our engineering team is investigating the cause. We will post further updates here. If you have any questions or concerns about this, please feel free to reach out to our Support team at support@treasuredata.com.

Report: "[Tokyo Region] Treasure Insights - Datamodel creation and build"

Last update
resolved

The incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

investigating

As of 4:18pm JST on 16 May, we observe an issue in creating and building Treasure Insights datamodels. Our engineering team is investigating the cause.

Report: "[EU Region] Treasure Workflow - Outage on Custom Script"

Last update
resolved

A change in upstream provider updates was no longer compatible with our configuration. The Custom Script service in Treasure Workflow was failing to launch from 2023-04-17 21:52:37 UTC to 2023-04-17 23:13:55 UTC. We fixed the issue after working closely with the upstream provider. This incident has been resolved.

Report: "[US/Tokyo/Korea/AP03] Treasure Workflow - Partial Outage on Custom Script"

Last update
resolved

A change in upstream provider updates was no longer compatible with our configuration. We fixed the issue after working closely with the upstream provider. This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

The issue has been identified and a fix is being implemented.

investigating

We identified that we are failing to call the resource from upstream provider, and working with upstream provide to resolve this issue. We are currently working on this issue actively.

investigating

We are still investigating this issue with upstream provider.

investigating

We have observed the issue that the custom script execution failure or delay. We are currently investigating this issue.

Report: "[US, EU01, Tokyo, AP03] Streaming data ingested between 2023-03-13 21:00 and 22:00 (UTC) were not accessible"

Last update
postmortem

# Summary There was an incident from 2023-03-13 22:30 to 2023-03-15 20:30 \(UTC\) that streaming data ingested between 2023-03-13 21:00 and 22:00 \(UTC\) were not accessible. As a result, Presto, Hive queries, and export jobs ran during the time frame did not include the streaming data. A bug in a job scheduler caused one defragmentation job to run on the wrong cluster that runs the job with an outdated codebase. We plan to implement remediations based on the root cause analysis. # Impact to customers Data ingested between 2023-03-13 21:00 and 22:00 \(UTC\) through stream ingestion \(e.g. data from td-agent, td-js-sdk, mobile-sdk, ADL, Postback API\) into the hourly partitioned data on the archive storage, were not accessible to queries and other jobs during the time frame 2023-03-13 22:30 and 2023-03-15 20:30 \(UTC\). Due to this, Presto, Hive queries and Export jobs executed during the time frame \(2023-03-13 22:30 to 2023-03-15 20:30 UTC\) did not include the streaming data of the ingestion time \(2023-03-13 21:00 to 22:00 UTC\). # What happened In Treasure Data, data from stream ingestion \(e.g., data from td-agent, td-js-sdk, mobile-sdk, ADL, Postback API\) is stored in the landing area first so that Presto, Hive queries, and Export jobs can access quickly. The data in the landing area is fragmented because of its nature. Thus, Treasure Data has an internal mechanism to run a job to defragment the data in the landing area. At 2023-03-13 22:05, our Engineering team rolled out the scheduler update for the defragmentation job for a limited part of our regions. However, the scheduler update had a bug that routes the defragmentation job to the wrong cluster so the job ran with an outdated codebase. Forty minutes later, the team detected the wrong routing by a monitor and reverted the scheduler update. The team investigated the impact of the executed jobs with the outdated codebase and confirmed that the jobs ran and finished without errors. However, after a detailed investigation 2 days later, the team found the jobs did not register storage metadata after completing the defragmentation silently. As a result, despite storage data being there, Presto, Hive queries, and Export jobs could not access it due to the lack of metadata for 2 days. At 2023-03-15 07:00 \(UTC\), the team identified the missing metadata based on further investigation after receiving a customer inquiry. We completed the metadata recovery at 20:30 after operationalizing the recovery process at staging environments. # Details and Remediations The root cause was a scheduler update had a bug that routed one defragmentation job to the wrong cluster resulting in the job running with an outdated codebase. However, we recognize that we need to address the following issues by the cause analysis. * The outdated codebase should not be able to handle the job. We did not delete the codebase as a part of a migration process over a year ago. We will disable the codebase before any update not to retake the job. * The scheduler should have a guard to prevent unexpected routing, even if it has a bug. We are in the process of replacing job routing with the centralized routing mechanism. The scheduler update was a part of the process. The centralized routing mechanism has a guard and monitoring of the routing; however, the scheduler we updated this time was the last service not using the centralized routing mechanism. As a temporary measure until the process completion, we will implement a monitor to detect a wrong routing to detect an issue quickly. * We should have detected the missing metadata at the QA phase. We do have a data consistency test framework, but we did not include it in the QA target this time because the update was not core defragmentation logic but scheduler. We will plan to make it an automated framework, and incorporate its status as a check for anything related to data defragmentation work. ‌ We regret this incident has prevented you from fully leveraging the functionalities of the system and in particular the query subsystems. Please feel free to reach out to our support team through [support@treasuredata.com](mailto:support@treasuredata.com) if you have any questions.

resolved

We identified at 2023-03-15 10am (UTC) that there was an incident that streaming data ingested between 2023-03-13 21:00 and 22:00 (UTC) were not accessible during the following time frame 2023-03-13 22:30 and 2023-03-15 20:30 (UTC). As a result, Presto, Hive queries and export jobs ran during the time frame (Mar 13 22:30 to Mar 15 20:30 UTC) did not include the streaming data of the ingestion time (Mar 13 21:00 to 22:00 UTC.) = Overview Our internal data merge job system merges fragmented data on the realtime storage that manages streaming import data (e.g. data from td-agent, td-js-sdk, mobile-sdk, ADL, Postback API) into the hourly partitioned data on the archive storage. Due to an issue in this system, the scheduled task that handled data between Mar 13 21:00 and 22:00 UTC did not copy the streaming data to the archive storage. Consequently, the data were not visible between 2023-03-13 22:30 and 2023-03-15 20:30 (UTC). The data ingested by Embulk, Bulk Import, Data Connector, INSERT INTO, or CREATE TABLE AS are not affected. We identified the issue at 2023-03-15 10am (UTC), and completed the recovery to make all data visible by 2023-03-15 20:30 (UTC). Presto, Hive queries and export jobs see the expected data after Mar 15 20:30 UTC. = What you should do for recovery For the Presto, Hive queries and export jobs you ran against the data ingested between Mar 13 21:00 and Mar 22:00 UTC, please re-run the jobs and confirm the result if necessary. = What’s next We apologize for any inconvenience it has caused. After the root cause analysis and further remediation planning, we will publish a detailed postmortem. In the meantime, if you have any questions, please don’t hesitate to contact support@treasure-data.com.

Report: "[US Region] Web Interface - Partial Outage"

Last update
resolved

Between Mar. 16, 2023 05:38 PDT to Mar. 16, 2023 05:50 PDT, all customers experienced access issue related to Web Interface's outage. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring

We have made remediation and web console is now operational. We will keep monitoring the issues.

investigating

We detected the console access was partially unavailable since Mar. 16 05:38 PDT

Report: "[Tokyo region] Web Interface - Partial Outage"

Last update
resolved

Between Mar 9, 2023 10:16 JST to Mar 9. 2023 10:28 JST, all customers experienced access issue related to Web Interface's outage. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring

We have made remediation and web console is now operational. We will keep monitoring the issues.

identified

We are investigating a possible problem currently affecting web console. We will provide an update as soon as we know more.

investigating

We detected the console access was partially unavailable since Mar 9, 10:16 JST.

Report: "[Tokyo region] Web Interface - Partial Outage"

Last update
resolved

Between Mar 8, 2023 14:21 JST to Mar 8. 2023 14:34 JST, all customers experienced access issue related to Web Interface's outage. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring

We have made remediation and web console is now operational. We will keep monitoring the issues.

identified

We are investigating a possible problem currently affecting web console. We will provide an update as soon as we know more.

investigating

We detected the console access was partially unavailable since Mar 8, 14:21 JST.

Report: "[Tokyo region] Web Interface - Partial Outage"

Last update
resolved

We already resolved the issue. We apologize for any inconvenience caused. If you have any questions about it, please contact support@treasure-data.com

monitoring

We are investigating a possible problem affecting console access that occurred March 6, 2023 from 22:24 to 22:46 JST (13:24 - 13:46 UTC). We already applied remediation and the issue should be resolved. We are still monitoring the issue.

Report: "[EU01] API error rate increases"

Last update
resolved

We haven't seen any symptoms caused the issue again. The incident has been resolved. During 8:20am UTC and 8:58am UTC, API response may returned 504 error Intermittently. This high error rate made associated services like Web Console with REST API (api.eu01.treasuredata.com) may also had some degradation. We are sorry for causing you trouble.

monitoring

During 8:20am UTC and 8:58am UTC, we have observed API response error, that may returned 504 error Intermittently. We already applied a fix and we are monitoring the result.

identified

We observed the API error rate decreased. In addition, we are working on implementing a mitigation for the issue.

investigating

We are currently observing Intermittent API response errors. We are working on investigating the issue.

Report: "[ap03] Result output / Data Connector job outbound access issue"

Last update
resolved

The incident has been resolved.

monitoring

The following components in AP03 may failed during 10:30-16:30 if the export/import 3rd party system configures IP whitelist setting. - DataConnector - ResultOutput During this period, jobs of these services were assigned an unintended IP address. As a result, the 3rd party system might deny access to these jobs depending on their IP whitelist setting. We've fixed the routing setting and it now works correctly. If your jobs failed during this period, please re-run your workflow/jobs.

monitoring

A fix has been implemented and we are monitoring the result.

identified

We acknowledged a failure of Data Connector jobs due to a outbound network configuration issue by Data Connector job in Private Connect environment. We are working on resolving the issue.