Historical record of incidents for Zonos
Report: "2025-06-10 Issue with BigCommerce Plugin"
Last updateWe are continuing to work on a fix for this issue.
The issue has been identified and we are continuing to monitor the impact.
An issue with an upstream infrastructure provider is impacting our BigCommerce Plugin
Report: "2025-01-06 Magento Checkout Issue"
Last update**What products were affected and what was the impact?** * All Magento Checkout merchants Impact: * Partial Outage **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | Jan 6, 2025 | 2:27 PM MST | | To: | Jan 7, 2025 | 09:11 AM MST | ### How was the issue detected? Our team was notified that carts were not being created for some Magento shoppers, leading to an immediate investigation by the engineering team. ### What functionality was affected? The checkout process for Magento merchants encountered validation failures due to incorrect UUID handling. This disrupted the checkout flow, impacting merchants using the Magento Checkout. ### What problems did this cause? * Magento merchants experienced disruptions in checkouts being loaded at times. ### What was the resolution of the problem and steps that are being taken for continued follow-up? * The issue was identified as stemming from changes made to TSID validation. * A rollback of the problematic changes was performed promptly, restoring checkout functionality. * A synthetic test was created to validate UUID behavior and detect future occurrences of this issue. * Code annotations have been added to document known issues and improve future debugging. ### What mitigation solutions will we put in place to prevent this issue from occurring in the future? * Establishing alerts detecting deviations in carts-to-checkout sessions. This will help flag potential checkout issues early. * Increasing test coverage to identify gaps in synthetic testing.
This incident has been resolved.
We are currently investigating this issue.
Report: "2025-01-24 Issue loading Checkout"
Last update**What products were affected and what was the impact?** * Checkout \(Custom Integrations\) Impact: * Major outage for affected merchants **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | Jan 24, 2025 | 12:00 PM MST | | To: | Jan 27, 2025 | 07:45 AM MST | **How was the issue detected?** Our team was notified that Checkout domains were not being allowed as expected, leading to an immediate investigation by the engineering team. **What functionality was affected?** Online store settings \(allowed store domain\) were not loading properly for all stores using custom integrations. This prevented affected stores from loading checkout. **What problems did this cause?** Stores relying on custom integrations could not load their checkout correctly. **What was the resolution of the problem and steps that are being taken for continued follow-up?** * A rollback to a previous version immediately resolved the issue. * A new synthetic test has been implemented to ensure store settings caching is store-specific and does not mix settings between stores. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** * **Improved Testing:** A new synthetic test has been introduced to verify caching behavior and prevent incorrect store settings from being cached. * **Staging Environment:** Setting up a dedicated test environment for custom integrations to catch similar issues earlier.
This incident has been resolved.
We are currently investigating this issue.
Report: "2025-01-24 - Shopify tracking number and order import issue"
Last update### What products were affected and what was the impact? Shopify Checkout Partial outage: * Affected some Shopify merchants ### What timeframe did this issue occur? | | **Date** | **Time** | | --- | --- | --- | | From: | Jan 23, 2025 | 05:30 PM MDT | | To: | Jan 24, 2025 | 03:30 PM MDT | ### How was the issue detected? * A discrepancy was identified when some orders in Shopify were missing tracking numbers. * Additionally, merchants using older versions of the Shopify app experienced order import failures. ### What functionality was affected? * Order fulfillment ### What problems did this cause? * **Order Processing Issues:** Orders processed without tracking numbers caused confusion and delays for merchants and customers. * **Order Import Failures:** Some merchants experienced missing orders due to import failures. ### What was the resolution of the problem and steps that are being taken for continued follow-up? * We patched the root cause of the issue, resolving the tracking number and order import failures. * We updated all the affected orders with tracking numbers and manually imported orders that had failed to import. * API headers and logging enhancements were implemented to improve issue detection and response time. ### What mitigation solutions will we put in place to prevent this issue from occurring in the future? * Expand QA coverage to address downstream impacts of refactors, with a focus on comprehensive testing of critical flows, including label creation. * Expand synthetic test suite to validate end-to-end order processing, with particular emphasis on fulfillment workflows. * Establish a consistent testing environment that simulates real-world Shopify store configurations, reducing the risk of silent failures.
This incident has been resolved.
We are currently investigating this issue.
Report: "2025-01-03 Issue creating orders for Checkout"
Last update**What products were affected and what was the impact?** * Zonos Checkout Impact: * Major Outage **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | Jan 3, 2025 | 01:30 PM MST | | To: | Jan 3, 2025 | 07:06 PM MST | ### How was the issue detected? Our team was notified that payments were being processed, but orders were not being created. ### What functionality was affected? * Ability to create orders * Order fulfillment ### What problems did this cause? * Customer frustration due to failed order creation despite successful payment processing. * Potential financial reconciliation issues for merchants and customers. ### What was the resolution of the problem and steps that are being taken for continued follow-up? * The root cause was identified as a backend change, which introduced a bug related to the separation of payment intent and checkout session. * A rollback was implemented to restore functionality, and error logging for Stripe intent failures has now been enabled. ### What mitigation solutions will we put in place to prevent this issue from occurring in the future? * Synthetic tests have been updated to run every 10 minutes to narrow down problematic PRs and detect failures earlier.
This incident has been resolved.
We are experiencing an issue with orders not being created despite payments being processed successfully.
Report: "2024-11-12 Issue with Exchange Rate"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are currently investigating elevated error rates.
Report: "2024-11-08 - Landed Cost Quote Failures"
Last update### What products were affected and what was the impact? Landed Cost API, Checkout Impact: CRITICAL ### What timeframe did this issue occur? | **Date** | **Time** | | --- | --- | | November 8, 2024 | 8:50am - 10:26am MST | ### How was the issue detected? A spike in error logs triggered an alert to our Engineering team, who responded immediately to the issue. ### What functionality was affected? Landed Cost quotes that use our automated item classification service failed. ### What problems did this cause? If an HS Code was not provided in the API request to Landed Cost, and the automatic classification service was enabled, then the landed cost quote would fail. When the landed cost quote fails, shoppers may not be able to place their order. ### What was the resolution of the problem and steps that are being taken for continued follow-up? The root cause of the issue was a deployment issue with the item service used for automatic classification. While the issue was detected immediately, resolution required rebuilding and redeploying services, which took longer than expected. After services were rebuilt and redeployed, the system health was validated and normal operations resumed. ### What mitigation solutions will we put in place to prevent this issue from occurring in the future? We discovered that this issue was due, in part, to a deficiency in our deployment procedures. We are working to update the procedure to prevent any future issues. We are also creating a synthetic test in our lower environments that will catch similar issues before they can be deployed into production.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "2024-10-12 - Permission Issue with Shopify Landed Cost API"
Last update**What products were affected and what was the impact?** On **Saturday, October 12, 2024, at 22:30 UTC**, we experienced an outage that impacted the ability of some Shopify merchants to retrieve landed cost quotes for international orders. This outage lasted until **Sunday, October 13, 2024, at 21:33 UTC**, during which time affected merchants may have been unable to display accurate landed cost calculations to their customers, potentially affecting international checkouts and purchase completions. The root cause of the outage was an API permission issue that was introduced during a routine deployment. Unfortunately, the synthetic tests for the Landed Cost API flow didn’t immediately flag the issue, which caused an extremely uncharacteristic delay to the resolution. It's important to note that this issue was caused by a very unique set of circumstances. We have just finished an extremely complex and expansive migration project to move all of our merchants to a robust and secure new system. This issue was directly caused by processes related to this now complete migration. Impact: critical **What timeframe did this issue occur?** | **Date** | **Time** | | --- | --- | | October 12, 2024 | 22:30 UTC | | October 13, 2024 | 21:33 UTC | **How was the issue detected?** The engineering team noticed a warning of failed quotes in the log files. **What functionality was affected?** All landed cost requests during the outage failed for affected merchants. **What problems did this cause?** Shoppers of affected merchants were unable to get landed cost quotes, and in many cases complete their checkout. **What was the resolution of the problem and steps that are being taken for continued follow-up?** Once the issue was identified, our team acted swiftly to permission issue. After thorough testing, we confirmed the API was fully restored at **21:33 UTC** on Sunday. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** * **Expanded test coverage**: We are enhancing our synthetic tests to include additional token and permission checks, ensuring that potential issues like this are caught earlier. * **Improved log classifications**: We’re updating our logging processes to better distinguish between warnings and errors, which will enable faster detection and resolution of similar problems in the future.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Increased error rate on Dashboard authentication"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "2024-09-30 Elevated error rates on landed cost"
Last update### **What products were affected and what was the impact?** * Landed Cost API ### Impact: * DEGRADED SERVICE ### **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | Sep 30, 2024 | 16:44 MST | | To: | Sep 30, 2024 | 18:12 MST | ### **How was the issue detected?** Synthetic test failures alerted our team. ### **What functionality was affected?** Increased latency on landed cost quotes, plus a period where landed cost quotes failed. ### **What problems did this cause?** Landed cost quotes were slow to return and/or failed to return. ### **What was the resolution of the problem and steps that are being taken for continued follow-up?** The problem was caused by a very large message added to our message queue that could not be consumed due to insufficient resources. The immediate resolution was to increase maximum allocated memory to allow the message queue to clear. ### **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** To prevent this from happening again, we have decreased the total allowed message size by both the producers and consumers. We are also evaluating techniques to make the message queue more robust to this type of failure.
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "2024-08-30 Issue with Shopify Duty Tax Plugin"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "2024-08-07 Upstream provider issue affecting Zonos Dashboard"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
We are continuing to monitor for any further issues.
We are continuing to monitor for any further issues.
There continue to be intermittent issues. The provider is investigating and we will continue to monitor.
The issue appears to be resolved but we are monitoring for further issues.
The issue has been identified as an outage with an upstream provider.
We are currently investigating this issue.
Report: "2024-08-05 Investigating database issues"
Last update**What products were affected and what was the impact?** * All non-legacy products Impact: MAJOR OUTAGE **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | Aug 5, 2024 | 14:31 MST | | To: | Aug 5, 2024 | 14:48 MST | **How was the issue detected?** Synthetic tests began failing which notified our DevOps team. **What functionality was affected?** Queries to the database were degraded and eventually became unsuccessful. **What problems did this cause?** All non-legacy services experienced degraded performance and a brief major outage where all database queries failed. **What was the resolution of the problem and steps that are being taken for continued follow-up?** The incident was caused by storage exhaustion on one of our production database clusters. We normally have autoscaling and alerting configured on our database clusters; however, in this case the cluster was created as part of a migration, and the proper alerting was not configured. The incident was resolved by allocating additional storage capacity to the affected database cluster. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** To prevent similar incidents in the future, we are conducting a thorough audit of our infrastructure inventory. This includes reviewing system health and monitoring configurations to ensure proper alerting and maximum visibility into critical infrastructure metrics.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating database issues.
Report: "2024-03-21 Investigating an issue with item catalog"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "2024-03-12 Issue with Dashboard"
Last updateThis incident has been resolved.
We are monitoring closely for any further issues.
A third-party upstream provider issue caused a distruption of service.
Service has been restored and we are continuing to investigate the cause of the issue.
We are currently investigating this issue.
Report: "2024-01-10 Issue with Shipping Quote Service"
Last update**What products were affected and what was the impact?** * Landed Cost Impact: * DEGRADED PERFORMANCE **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | Jan 10, 2024 | 07:32 MST | | To: | Jan 10, 2024 | 07:54 MST | **How was the issue detected?** We saw an error rate spike within one of our APIs indicating there was an issue with an upstream 3rd party service. **What functionality was affected?** We were unable to generate some landed cost quotes. **What problems did this cause?** The issue caused requests to timeout in one of our APIs. These timed-out requests exhausted the database connection pool until the service was cycled and database connections became available. **What was the resolution of the problem and steps that are being taken for continued follow-up?** We cycled our affected service, releasing database connections. We’re looking at ways to mitigate dependency on this particular 3rd party API to ensure smooth operation of our APIs in the event of a future outage. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** More resilient handling for upstream 500 errors from 3rd party APIs.
An issue with an upstream third-party API caused timeouts and exhausted database connections. This caused slow responses and some failures of Landed Cost quotes until the service cycled and database connections became available.
Report: "2024-02-13 Shipping Quote Service Partial Outage"
Last updateThere was a partial outage with shipping quote service from 10:49 AM MST to 11:56 AM MST.
Report: "2023-12-13 Issue with shipping quote service"
Last update**What products were affected and what was the impact?** * Checkout * Landed Cost \(Legacy\) * Landed Cost API Impact: * Checkout MAJOR OUTAGE * Landed Cost \(Legacy\) MAJOR OUTAGE * Landed Cost API SERVICE DEGRADATION **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | December 13, 2023 | 3:02 MST | | To: | December 13, 2023 | 5:35 MST | **How was the issue detected?** There was increased database load and request timeouts. This was detected by the monitoring system and the team was notified. **What functionality was affected?** Shipping Quotes for the checkout process and Landed Cost \(Legacy\) API were directly affected. Landed Cost API was indirectly affected by the increased load on the database server. **What problems did this cause?** In the process of removing invalid data from the database a database table index became corrupted causing increased latency and load on the database server. The affected table was for providing flat rate shipping rates to the landed cost \(legacy\) service. The landed cost service is used by the checkout process to calculate shipping rates for international orders. This caused the landed cost \(legacy\) and checkout processes to fail when attempting to calculate shipping. This also caused landed cost API to fail intermittently with timeouts because of the increased load on the database. **What was the resolution of the problem and steps that are being taken for continued follow-up?** A patch was deployed to disable the flat rate charts and allow partial recovery and to allow for correcting the affected database table. The affected database table was restored and the flat rate charts were re-enabled. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** This issue occurred in our legacy database system. The data removal was done to improve query performance, and the operation should have been safe. Though index corruption is a very rare and unexpected outcome, we are doing two things to mitigate this risk and prevent future failure cases: 1. migrating legacy services to a more robust database technology, and 2. creating an additional review policy for potentially destructive database operations.
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "2023-09-19 Issue with Landed Cost"
Last update**What products were affected and what was the impact?** Our Landed Cost APIs \(GraphQL\) were impacted. Impact: MAJOR OUTAGE **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | Sep 19, 2023 | 15:30 MST | | To: | Sep 19, 2023 | 15:44 MST | **How was the issue detected?** Error rates triggered alarm monitors. **What functionality was affected?** Landed cost quotes for some stores could not be returned. **What problems did this cause?** Some merchants were unable to retrieve landed cost quotes. **What was the resolution of the problem and steps that are being taken for continued follow-up?** We discovered a configuration error in a previous release. Due to caching, scheduled synthetic test failures in the development environment lagged behind the deployment to the production environment. We fixed the configuration error, released the fix, and confirmed issue resolution via manual and automated testing. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** We will expire caches more quickly in the development environment for synthetic tests to identify potential problems prior to deployment to the production environment.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "2023-08-22 Issue with Duty-Tax Service"
Last update**What products were affected and what was the impact?** Our Landed Cost APIs \(REST and GraphQL\) were impacted. During the outage, all API requests were unauthorized. Impact: MAJOR OUTAGE **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | August 15, 2023 | 10:40 MST | | To: | August 15, 2023 | 11:02 MST | **How was the issue detected?** We were immediately alerted by our monitoring system. **What functionality was affected?** No landed cost quotes could be returned due to authentication failures. **What problems did this cause?** wrong auth URL. used the discovery instead of the public. **What was the resolution of the problem and steps that are being taken for continued follow-up?** We discovered a configuration error in a previous release. We fixed the configuration error, released the fix, and confirmed issue resolution via manual and automated testing. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** We are improving documentation on environment configuration, and clarifying how to properly configure service discovery. We are also improving testing procedures to catch similar issues before they are introduced into production.
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "2023-08-15 Issue with International Checkout"
Last update**What products were affected and what was the impact?** International Checkout Impact: PARTIAL OUTAGE **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | August 15, 2023 | 11:20 MST | | To: | August 15, 2023 | 11:31 MST | **How was the issue detected?** Synthetic tests and elevated error rates triggered alerts. **What functionality was affected?** About 40% of shipment rating requests from Checkout failed. **What problems did this cause?** Our internal shipment rating service saw failures that caused some checkouts to be slow, and others to fail. We also saw minimal failures in our catalog service. **What was the resolution of the problem and steps that are being taken for continued follow-up?** We immediately reverted the change to our catalog service that caused the issue. We are investigating how to fix the issue for the next release. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** We are exploring opportunities to add more caching to the catalog service, in addition to making architecture and infrastructure improvements.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Investigating issues with Shopify Duty Tax"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "2023-06-13 Outage - Dashboard"
Last update**What products were affected and what was the impact?** Zonos Dashboard Impact: CRITICAL **What timeframe did this issue occur?** | **Date** | **Time** | | --- | --- | | Jun 13, 2023\] | 12:54 to 13:46 MDT | **How was the issue detected?** Internal reports of authorization failures and Dashboard becoming inaccessible. **What functionality was affected?** Zonos Dashboard was not accessible. **What problems did this cause?** Users were unable to access Dashboard to complete tasks. **What was the resolution of the problem and steps that are being taken for continued follow-up?** The issue was identified as an AWS Operational issue in the US-EAST-1 Region impacting an upstream service provider hosting our Front-End services for Dashboard. We were able to redeploy those services to an unaffected region to restore functionality. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** We are continually assessing and improving business continuity solutions throughout every layer of our tech stack to minimize downtime and automate recovery where possible.
This incident has been resolved.
A fix has been implemented and we are monitoring the results.
An issue with upstream Lambda creation and execution has been identified, and we are waiting on a fix to be rolled out while investigating other mitigation strategies. For more information, see the AWS status at https://health.aws.amazon.com/health/status.
We are continuing to investigate this issue.
We are currently investigating reports of a potential service interruption with Dashboard. We apologize for any inconvenience and will post another update as soon as we learn more.
Report: "2023-05-20 Partial Outage - Shopify Duty Tax"
Last update**What products were affected and what was the impact?** Shopify Duty & Tax Impact: CRITICAL **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | May 20th, 2023 | 09:30 MST | | To: | May 20th, 2023 | 10:30 MST | **How was the issue detected?** At 9:31 AM MST, we began receiving alerts of increased latency with some Shopify Duty & Tax quote requests. **What functionality was affected?** Shopify Duty & Tax quotes for some customers. **What problems did this cause?** From approximately 9:30 am to 10:30 am MST, latency was sufficiently elevated to cause some quote requests from our Shopify Duty & Tax plugin to fail. **What was the resolution of the problem and steps that are being taken for continued follow-up?** We identified that the issue was caused by a scheduled maintenance job that obtained a lock on the database. The database lock significantly increased latency. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** The database lock was not necessary, and the locking code has been removed. This will prevent this issue from happening again in the future. Also, we are educating our engineers on the proper usage of database locking strategies, and implementing protection measures during the code review process. Additionally, we are improving our monitoring and on-call coverage to ensure faster response times to issues that impact shoppers.
From approximately 9:30 am to 10:30 am MST, latency was sufficiently elevated to cause some quote requests from our Shopify Duty & Tax plugin to fail.
Report: "2023-05-03 Issue with Quoter"
Last update**What products were affected and what was the impact?** Dashboard Quoter Impact: CRITICAL **What timeframe did this issue occur?** | | **Date** | **Time** | | --- | --- | --- | | From: | May 2nd, 2023 | 15:30 MST | | To: | May 3rd, 2023 | 08:35 MST | **How was the issue detected?** A developer was using [dashboard.zonos.com](http://dashboard.zonos.com/) to make a quote and discovered it was broken. **What functionality was affected?** 100% of Quoter requests failed. **What problems did this cause?** Customers using Quoter were unable to get quotes. **What was the resolution of the problem and steps that are being taken for continued follow-up?** It was discovered that a deployment of Zonos Dashboard had a missing environment variable. The missing environment variable was added to the deployment, and Quoter functionality was restored. This was validated both via server logs and manual testing. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** * We have improved our testing procedures to catch similar deployment issues in QA before the deployment reaches production. * We will no longer allow a build if an environment variable is missing. * We have modified our release schedule to allow for greater support coverage around releases. * We are working to improve alerting for issues related to Dashboard functionality and deployments.
This incident has been resolved.
We are continuing to monitor for any further issues.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We are experiencing a service outage with Quoter. Our team is currently working to restore service. We apologize for any inconvenience. All users may be affected. We will provide an additional update within an hour.
Report: "Partial Plugins - Shopify Duty Tax Outage"
Last update**What products were affected and what was the impact?** All Zonos GraphQL services. Impact: CRITICAL **What timeframe did this issue occur?** | **Date** | **Time** | | --- | --- | | Mar 31, 2023 | Starting at 18:00 MDT | | Apr 1, 2023 | Ending at 12:45 MDT | **How was the issue detected?** On the morning of April 1, Shopify GraphQL customers began noticing issues with landed cost quotes and notified CS, who then escalated the issue to the Engineering team. **What functionality was affected?** All GraphQL services in the Zonos Cloud were impacted. **What problems did this cause?** Merchants on GraphQL were unable to receive shipment ratings and landed cost quotes. **What was the resolution of the problem and steps that are being taken for continued follow-up?** After being notified of the issue, we worked quickly to switch GraphQL merchants over to our REST endpoints, which were not experiencing any issues. We then identified the root cause of the issue with GraphQL: a code deployment that caused broke event serialization and caused synchronous events to fail. A weakness with synchronous event handling then caused the event failure to cascade to the cluster-level. We immediately released a fix to prevent future occurrences. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** Our monitoring and notification channels for production server clusters were focused on unhealthy target groups and container failures. Due to the nature of the failure, we didn't receive notifications for either. This is a clear gap in monitoring coverage at a cluster-wide level. To make sure this never happens again, we are configuring task-based monitoring outside of the clusters where we will: * query each service in the cluster directly for the minimum amount of tasks that should be running and the actual number of tasks that are running, * make mock requests to each service to make sure they are returning correct responses, and * direct these notifications to our alerting platform with "on-call" rotations to make sure there are no lapses in coverage. We have also improved the resiliency of our event system, such that even if there were a future issue with event serialization, it would have no effect upon our public GraphQL services.
This incident has been resolved.
Investigating issues with quoting on Shopify.
Report: "Partial Landed Cost API Outage"
Last updateThere was a problem with the DNS on a few of the Landed Cost API servers causing a partial outage. The problem has been identified and resolved.
We are currently investigating this issue.
Report: "Partial Landed Cost API Outage"
Last updateThere was a problem with the DNS on a few of the Landed Cost API servers causing a partial outage. The problem has been identified and resolved.
We are currently investigating this issue.
Report: "Shopify CA cert change, breaking Shopify Checkout"
Last update**What products were affected and what was the impact?** Zonos checkout for Shopify plugin. Api calls from the Shopify Checkout plugin to the Shopify api were being rejected with ssl connection issues. Impact: **MAJOR** **What timeframe did this issue occur?** | **Date** | **Time** | | --- | --- | | Jan 18, 2023 | `12:23:03 PM MDT` | **How was the issue detected?** Reports of international users unable to proceed past the cart page triggered the initial investigation of the issue. **What functionality was affected?** The incident was cased by a SSL certificate update released by Shopify which also updated the Intermediate Certificate. Our Zonos checkout plugin did not have the Certificate in its keystore and could not validate the certificate for any calls being made to the Shopify API. **What was the resolution of the problem and steps that are being taken for continued follow-up?** The node version of the Zonos Checkout plugin was updated to a more current version which contained the Intermediate Certificate. After the plugin was updated we continued to monitor the incident to validate the solution corrected the issue.
Api calls from the Shopify Checkout plugin to the Shopify api were being rejected with ssl connection issues.
Report: "Slow response times with Classify service"
Last update**What products were affected and what was the impact?** Slow responses and timeouts when calling the Classify API. Impact: **MINOR** **What timeframe did this issue occur?** | **Date** | **Time** | | --- | --- | | Jan 23, 2023 | `07:55 AM MDT` | **How was the issue detected?** We received reports from customers that response times from the Classify API were longer than previously. **What functionality was affected?** Speed of responses from the Classify API. **What was the resolution of the problem and steps that are being taken for continued follow-up?** We identified that there was an issue with batching large data sets, integration of new AI technology, as well as some inefficiencies with our databases. All of these issues have been corrected.
This incident has been resolved.
We are investigating reports of slow responses and timeouts with the Classify service.
Report: "Elevated Error Rate"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
elevated error rate in ZonosApi on the /v1/landed_cost endpoint
Report: "Service outage"
Last update**What products were affected and what was the impact?** All Zonos Services Impact: **CRITICAL** **What timeframe did this issue occur?** | **Date** | **Time** | | --- | --- | | Feb. 24, 2022 | 16:39 UTC - 17:29 UTC | **How was the issue detected?** The issue was detected by the entire company noticing services not responding and not performing properly. **What functionality was affected?** All Zonos core products were affected. Landed Cost APIs, Checkout, Extensions, Dashboard etc. **What problems did this cause?** Outages across the board. **What was the resolution of the problem and steps that are being taken for continued follow-up?** The problem seemed to be related to Heroku outages and possibly other DNS issues. We have plans to move more critical entities off of Heroku at this time. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** Depend less on third party software. Stop the flow of traffic when we find total outages in the future.
We are still investigating the cause of the issue, but all issues appear to be up and running at this time.
A fix has been implemented and we are monitoring the results.
There are a variety of services down worldwide. We are investigating whether this is related.
We are continuing to investigate this issue.
We are experiencing a service outage. Our team is currently working to restore service. We apologize for any inconvenience. Checkout users may be affected. We will provide an additional update ASAP.
Report: "Cloud provider issues impacting label creation"
Last update**What products were affected and what was the impact?** The outage mainly impacted Dashboard label creation and retrieval. Impact: `Major` **What timeframe did this issue occur?** | **Date** | **Time** | | --- | --- | | Dec 8, 2021 | 08:35 - 14:20 MST | **How was the issue detected?** Our team was notified via customer support of a possible issue with shipment label creation. We verified that this was due to our upstream provider experiencing increased API error rates. **What problems did this cause?** Merchants were unable to create shipments for their orders and fulfill them. As it wasn’t a complete outage with the shipment API, some merchants were able to mitigate the issue by re-creating the shipment. **What was the resolution of the problem and steps that are being taken for continued follow-up?** Our team started moving towards use of an alternate cloud storage option but noticed decreased error rates at that time. We analyzed the failed labels and reported those affected to our Customer Success team to notify the affected merchants. **What mitigation solutions will we put in place to prevent this issue from occurring in the future?** Our team is looking into the option of implementing a cross-region replication or possible backup cloud service as a fall back so our services stay online in the event an outage like this occurs again.
Our team has been tracking errors that have occurred during the label creation process and has not identified any errors for more than 30 minutes. Our upstream cloud provider is still working on resolving the underlying issue, but the impact to our services appears to be resolved. We will continue to monitor to ensure that that our services are running smoothly.
Our upstream cloud provider has identified the issue and is working on a fix. There is currently no ETA for resolution. We are looking into a temporary solution to improve the issues with the label creation process and will provide regular updates.
We are aware of issues with our upstream cloud provider that may be affecting our services, specifically the label creation process. The issues appear to be intermittent and our cloud provider is working on a fix for the problem. We are continuing to monitor the situation and will provide regular updates.
Report: "Dashboard API Error"
Last updateReviewed log data to verify the restart cleared any blocking instances. The data causing the back-up resolved.
The dashboard API was returning high error rates and reporting unhealthy. Restarted the instances affected. Monitored for resolution.
Report: "Shopify App Cache Server Error"
Last updateEngineers were notified at 12:08 AM MST for failed request on Shopify app. Responded to alarms. Identified the issue noted previously. Engineers reset the failed cache connections and monitored the resolution of the issue. Engineers analyzed the root cause of the issue and implemented a plan to accommodate fallback from cache failure in the future.
Shopify App had issues connecting to back-end cache server. This resulted in higher load and failed requests.
Report: "Quotes not returning"
Last updateThis incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Landed cost not coming back for multiple systems."
Last updateThere was a typo in a field that was causing the countries to not be recognized by our tax calculation service. This affected multiple systems: Zonos Checkout Zonos Legacy Landed Cost API Shopify Duty Tax Shopify Checkout Big Commerce Duty Tax Big Commerce Checkout Sales Force Checkout Sales Force Duty Tax The bad data in the db was found and resolved.
Landed cost results were not coming back for some of our systems including, checkout, legacy landed cost apis, and all shopping cart extensions.
Report: "Order numbers index not populating"
Last updateThis incident has been resolved.
We have restored the dashboard to full functionality by utilizing a backup method of populating the required information.
New order number are not showing up in Zonos Dashboard. You are still able to pull up individual orders, but you cannot lookup newer orders.
Report: "Third party database outage."
Last updateThis incident has resolved. Zonos changed to a backup third party provider.
We are currently investigating this issue. A database provider of Zonos has unexpectedly dropped service. They are not indicating when they will be up and we are looking to move to another source now.
Report: "Database Upgrade"
Last updateThis incident has been resolved.
We are finished updating the database.
Report: "Checkout outage"
Last updateTwo queries were ran out of order causing a lock to the Rules engine table. The queries were deleted and some minor maintainence was performed on those tables during the minute it was down.
A data error was located and fixed. All servers are running normal at this time.
Checkout seems to be slow or not allowing quotes
Report: "Outage of legacy quoting services."
Last updateWe were receiving carts with over a billion items each coming at about one per second. Our packaging algorithm was back logged attempting to pack all the items in boxes. Our algorithm limitations have been changed and fake carts like this won’t limit performance in the future.
This incident has been resolved.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "DNS Errors across all Zonos servers."
Last updateThis incident has been resolved.
This issue has resolved itself. It looks like goDaddy DNS routing was not working, but it is back up and running now.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "This is an example incident"
Last updateWhen your product or service isn’t functioning as expected, let your customers know by creating an incident. Communicate early, even if you don’t know exactly what’s going on.
Empathize with those affected and let them know everything is operating as normal.
As you continue to work through the incident, update your customers frequently.
Let your users know once a fix is in place, and keep communication clear and precise.
Report: "Stripe Payments API is partially down."
Last updateThis incident has been resolved.
The issue has been identified and we are waiting on a third party for a resolution.
Report: "Heroku partial outage"
Last updateThis caused there to be periodic problems saving new merchant accounts. It also caused stores look up to be slowed down.