Is Voyado Down Right Now? Discover if there is an ongoing service outage.

Voyado is currently Operational

Last checked 2025-07-03T01:39:04.400Z from Voyado's official status page

Historical record of incidents for Voyado

Jul 2, 2025

Report: "[Engage] Contact lookup Performance Degradation"

Last update 2025-07-02T14:31:25.657Z

investigating2025-07-02T14:31:25.654Z

We are currently experiencing service disruptions due to an issue with an external provider. This may affect the availability or performance of contact lookups. We are monitoring the situation closely and will provide updates as more information becomes available.

Jun 26, 2025

Report: "[Engage] Tracking Pipeline Degradation"

Last update 2025-06-26T10:45:38.826Z

identified2025-06-26T10:45:38.823Z

We’re seeing that this issue is affecting the entire tracking pipeline, including abandoned cart and products of interest. A fix is currently being deployed and we are monitoring the progress. We sincerely apologize for any inconvenience this may cause.

identified2025-06-26T09:35:27.237Z

There's a current decrease in the amount of Abandoned cart automations being triggered. A fix is being deployed.

Jun 4, 2025

Report: "Engage: performance issues affecting tracking pipeline (abandoned cart/browse) and reports"

Last update 2025-06-04T16:19:00.000Z

Investigating2025-06-04T16:19:00.000Z

We are currently investigating the issue which seems to be related to scaling and could lead to delays in abandoned cart/browse automations to be triggered. Also affecting loading of dashboard/reports in the UI.

Report: "Engage: performance issues affecting tracking pipeline (abandoned cart/browse) and reports"

Last update 2025-06-04T15:55:13.439Z

investigating2025-06-04T15:55:13.436Z

The issue is still under active investigation. We appreciate your patience while we work to resolve it.

investigating2025-06-04T14:19:40.841Z

Jun 3, 2025

Report: "[Engage] System Performance Degradation"

Last update 2025-06-03T18:33:00.000Z

Resolved2025-06-03T18:33:00.000Z

After closely monitoring the situation, we can now confirm that the incident has been resolved. The steps we took to address the issue have held up, and the platform has been stable since.We know this disruption caused real headaches and we’re genuinely sorry for the impact it had. While we’re not in a position to point to an exact root cause just yet, our team is deep into the investigation. Once we have a full picture, we’ll share a detailed post mortem outlining what happened and what we’re doing to make sure it doesn’t happen again.Thanks for bearing with us, and we appreciate your patience and trust.

Monitoring2025-06-03T16:03:00.000Z

The fix has been implemented and we are seeing positive results since a few minutes back. We will continue to monitor the performance to make sure the positive effect of the fix is not temporary.

Identified2025-06-03T15:58:00.000Z

We have found the source of the degradation and are working on a fix.

Update2025-06-03T15:27:00.000Z

We are continuing our investigations. We have not been able to mediate the effects and still see service degradation across most parts of the platform.

Update2025-06-03T15:08:00.000Z

We are continuing to troubleshoot at full force. At the moment we can see service degradations for most customers and across a variety of functionality including the API:s and the user interface in the application.

Investigating2025-06-03T14:51:00.000Z

We are currently investigating indications of general slowness and degraded performance. Users may experience unusually long loading times, this may also affect processing times in automations and sendouts.

Report: "[Engage] System Performance Degradation"

Last update 2025-06-03T16:33:07.665Z

resolved2025-06-03T16:33:07.647Z

After closely monitoring the situation, we can now confirm that the incident has been resolved. The steps we took to address the issue have held up, and the platform has been stable since. We know this disruption caused real headaches and we’re genuinely sorry for the impact it had. While we’re not in a position to point to an exact root cause just yet, our team is deep into the investigation. Once we have a full picture, we’ll share a detailed post mortem outlining what happened and what we’re doing to make sure it doesn’t happen again. Thanks for bearing with us, and we appreciate your patience and trust.

monitoring2025-06-03T14:03:37.554Z

The fix has been implemented and we are seeing positive results since a few minutes back. We will continue to monitor the performance to make sure the positive effect of the fix is not temporary.

identified2025-06-03T13:58:09.616Z

We have found the source of the degradation and are working on a fix.

investigating2025-06-03T13:27:57.132Z

We are continuing our investigations. We have not been able to mediate the effects and still see service degradation across most parts of the platform.

investigating2025-06-03T13:08:10.404Z

investigating2025-06-03T12:51:39.544Z

May 28, 2025

Report: "[Elevate] Email recommendation degraded service"

Last update 2025-05-28T10:57:00.000Z

Postmortem2025-05-28T10:57:00.000Z

Resolved2025-05-27T23:05:00.000Z

Service is back to normal.

Investigating2025-05-27T21:00:00.000Z

We are current seeing degraded service in the Email Recommendation service.

Report: "[Elevate] Email recommendation degraded service"

Last update 2025-05-28T09:39:59.489Z

postmortem2025-05-28T08:47:40.432Z

## **Description and Impact** A recent update to the Email Recommendations service introduced a change intended to simplify configuration and improve caching. However, this inadvertently caused image files to be stored locally on individual servers rather than in shared storage. As a result, image requests frequently failed, triggering a surge in background jobs attempting to recreate missing images. These jobs launched in an uncontrolled manner, consuming excessive CPU resources. Even with full auto-scaling in effect, all available server capacity was quickly saturated, which led to degraded performance and service outages. Most requests during this period failed with error responses, and any successful responses were noticeably delayed. We understand the inconvenience this caused and acted swiftly to resolve the situation. ## **Affected Area** Email Recommendations ## **Timeline** * **2025-05-27 12:00 UTC** – A new version of Email Recommendations, that included the bug, was deployed * **2025-05-27 18:50 UTC** – Service degradation began * **2025-05-27 19:00 UTC** – Issue detected and investigation started * **2025-05-27 21:00 UTC** – Service fully restored ## **Actions Going Forward** * Configuration has been corrected to ensure proper handling of image storage * New alerts have been added to detect high CPU usage in fully scaled environments at an earlier stage * Additional automated testing will be introduced to better catch similar issues before deployment

resolved2025-05-27T21:05:19.000Z

Service is back to normal.

investigating2025-05-27T19:00:05.000Z

We are current seeing degraded service in the Email Recommendation service.

Report: "Voyado Engage- Service window"

Last update 2025-05-28T00:00:00.000Z

Completed2025-05-28T00:00:00.000Z

The scheduled maintenance has been completed.

In progress2025-05-27T21:00:00.000Z

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled2025-04-28T14:57:00.000Z

Every Tuesday at 21:00 CET/CEST we reserve the right to perform scheduled maintenance on Voyado Engage. Normally this won't affect availability at all, but on some rare occasions unresponsiveness or degraded performance may be experienced during the upgrade.We do apologize in advance for any impact on your work and/or availability, but hope that the things we release to production will make it all worth it in the end.

May 25, 2025

Report: "Voyado Engage - Service window"

Last update 2025-05-25T05:00:00.000Z

Completed2025-05-25T05:00:00.000Z

The scheduled maintenance has been completed.

In progress2025-05-25T03:00:00.000Z

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Scheduled2025-04-28T14:57:00.000Z

Every Sunday at 03:00 CET/CEST we reserve the right to perform scheduled maintenance on our databases. During this time we make sure that the databases are fine tuned and all your data is protected. Normally this won’t affect availability, but some unresponsiveness or degraded performance may be experienced for short periods during the maintenance.

May 22, 2025

Report: "[Engage] Issues With Order V3"

Last update 2025-05-22T10:50:44.217Z

postmortem2025-05-22T10:47:44.282Z

### Summary On May 20th 2025, an issue in the Engage platform led to delays in how orders were processed and messages were sent. While everything looked normal on the surface — meaning actions appeared to be accepted — the actual processing behind the scenes wasn’t happening as it should. The issue was fully resolved within a few hours and no orders were lost. ### Customer Impact A small number of customers using specific order features were affected. Orders submitted to Engage were **not processed** during the incident window. As a result, **automated transactional emails** linked to these orders were also delayed. Once the issue was fixed, **all delayed orders were processed successfully** and the related emails were sent. ### Root Cause and Mitigation A recent system changes introduced this problem. Once the issue was identified, our team: * Released a fix that allowed the system to handle orders as expected again. * Made sure all delayed orders were caught up and properly processed.. ### Next Steps We’re adjusting our internal testing setups to more closely match the real-world environment, helping us catch potential issues for this type of case before they can affect you. We appreciate your patience and understanding, and apologize for any inconvenience. We remain committed to providing a stable and reliable platform experience.

resolved2025-05-20T11:22:25.826Z

The incident has now been resolved. Orders affected during the incident have been successfully processed.

monitoring2025-05-20T09:46:45.874Z

The issue has been resolved, and order processing has returned to normal. Orders affected during the incident are being queued for handling shortly.

identified2025-05-20T07:20:34.610Z

We are currently investigating an issue with Order V3 where the orders is not processed. No order requests are lost, they are stuck in processing.

May 13, 2025

Report: "[Engage] API issues"

Last update 2025-05-13T10:07:38.341Z

postmortem2025-05-13T10:04:08.828Z

### **Summary** On April 30th 2025, between 15:49 and 16:38 CEST, Voyado Engage experienced an issue that caused delays and failures in API calls. The issue was traced to a problem during a system deployment. Our engineering team identified and resolved the issue quickly, with full service restored by 17:03. ### **Customer Impact** During the incident, a significant number of API calls were delayed or failed. Even successful calls had noticeably slower response times. ### **Root Cause and Mitigation** The issue was caused by a misconfiguration in a system update. This affected how the platform processed internal requests and resulted in service degradation. When the deployment was paused to limit impact, some servers were left running the old version of Engage and could not handle the full traffic load. Our team halted the rollout, developed a fix, and redeployed the updated version. Recovery began at 16:38, with full functionality confirmed by 17:03. ### **Next Steps** To prevent similar issues in the future, we are taking actions to implement improvements to our deployment and monitoring processes and to ensure faster recovery time. We appreciate your patience and understanding, and apologize for any inconvenience. We remain committed to providing a stable and reliable platform experience.

resolved2025-04-30T15:41:42.975Z

This incident has been resolved.

monitoring2025-04-30T15:08:23.184Z

We’re continuing to see improvements and are getting back to normal operations. We're now monitoring the fix closely.

identified2025-04-30T14:50:17.403Z

The fix is currently being rolled out, and we're starting to see signs of improvement. However, the issue is not fully resolved yet.

identified2025-04-30T14:20:26.710Z

We have identified the issue and are working on a fix.

investigating2025-04-30T14:09:59.810Z

We are currently investigating a possible incident affecting our APIs.

May 2, 2025

Report: "We are currently experiencing a delay in sending messages."

Last update 2025-05-02T08:24:15.528Z

postmortem2025-05-02T08:22:59.148Z

**Summary** On the morning of April 21st, Voyado Engage experienced an issue causing delays in the delivery of email messages. This primarily impacted messages sent through automation workflows. While no messages were lost, many were delivered later than intended. The situation was fully resolved the same day, and we are taking steps to ensure it does not recur. **Customer Impact** Approximately fifty percent of our customer base were affected by the incident. The majority of the delays impacted automated email workflows, though some manual send-outs were also affected. While all messages were eventually delivered, delays ranged from about 30 minutes to up to 3 hours for some customers. **Root Cause** The incident was mainly caused by inefficient memory management in the mail-processing application code. Over time, servers' memory usage steadily increased, peaking on April 21st. Combined with a few exceptionally large email campaigns, the system experienced severe resource pressure: * Memory Leaks: Memory was not properly released, causing sustained high usage that led to issues with Time-outs and Storage Delays as well as high CPU load: * Timeouts and Storage Delays: The platform struggled to write data to storage fast enough caused by the high memory usage, resulting in application slowdowns. * CPU Load: Some mail servers reached unnormal high CPU usage, worsening the delays. Importantly, no failures were detected in our cloud infrastructure, and no messages were lost. **Mitigation** Once the incident was identified: * A full application deploy was initiated to clear up memory usage and stabilize the system. Essentially performing a reboot of the application. * On-call engineers monitored the queues and gradually cleared all delayed messages. * Additional manual steps were taken to resend any stuck processes, ensuring no message was left behind. By 19:00 CEST on April 21st, all messages had been successfully sent and the system was back to a healthy operational state. **Next Steps** To prevent similar issues in the future, we are taking several actions to evaluate and potentially adjust memory utilization in the application, in addition to fine-tune monitoring of memory and storage health. We are also updating our incident management process to enable faster mitigation actions should similar symptoms appear. We appreciate your patience and understanding, and apologize for any inconvenience. We remain committed to providing a stable and reliable platform experience.

resolved2025-04-21T21:58:29.000Z

This incident has been resolved.

investigating2025-04-21T13:16:45.508Z

The degradation has been mitigated and we're currently working on addressing the aftermath (making sure all delayed messages are sent)

investigating2025-04-21T10:53:17.476Z

We are continuing to investigate this issue and working on a solution.

investigating2025-04-21T09:32:43.168Z

We are currently experiencing a delay in sending messages. We are investigating this and working on a solution.

Apr 7, 2025

Report: "[Engage] Messages not being sent"

Last update 2025-04-07T06:50:11.385Z

postmortem2025-04-06T21:22:12.000Z

## Summary On the morning of March 11, 2025, an issue occurred in the Engage platform that resulted in delays for message send-outs for a sub-set of our customers. The incident was triggered by an unexpected event in our in-memory database setup, which temporarily disrupted the platform’s ability to process and send messages. The issue was resolved rapidly and all affected send-puts were successfully delivered, either automatically or through manual resending. ### Customer Impact Approximately 54 customers experienced a temporary halt in their message send-outs for about one hour. Most messages were eventually sent out automatically once the issue resolved itself, but a smaller portion required manual resending by our team. No messages were lost. ### Root Cause The issue was caused by an unexpected failover in our in-memory database, which altered the primary-secondary configuration and triggered faulty callbacks in our system. This misconfiguration prevented messages from being processed as expected which led to the delay. ### Remediation & Mitigation * Our team identified the issue quickly through our monitoring and began troubleshooting. * A hotfix was implemented the same morning to remediate the faulty callbacks that prevented the message execution and to mitigate future occurrence of the unexpected behavior. * Messages stuck in the queue were either automatically processed or manually resent by our support team. ### Next Steps We recognize that similar in-memory database-related issues have occurred in the past. Based on recent events and ss part of our continuous improvement and reliability work, we are reviewing our in-memory database setup to improve its resilience and behavior during failovers. We appreciate your patience and understanding, and we remain committed to providing a stable and reliable platform experience.

resolved2025-03-11T11:57:54.499Z

This incident has been resolved.

identified2025-03-11T09:46:54.549Z

All the messages that got stuck have been resent. We are rolling out a fix to mediate the cause.

investigating2025-03-11T08:57:20.529Z

We have identified that the issue is only affecting a subset of customers. We are continuing to investigate the issue and are preparing to deploy a fix.

investigating2025-03-11T08:24:33.687Z

We are currently experiencing issues with sendouts and they are not being sent. We are investigating this and working on a solution.

Apr 1, 2025

Report: "[Engage] - Possible degradation due to ongoing incident for our cloud provider"

Last update 2025-04-01T12:29:19.464Z

resolved2025-04-01T12:27:35.000Z

Microsoft has declared their incident resolved, so we do the same. Microsoft's summary of Impact: Between 08:51 and 10:15 UTC on 01 April 2025, we identified customer impact resulting from a power event in the North Europe region which impacted Virtual Machines, Storage, CosmosDB, PostgreSQL, Azure ExpressRoute and Azure NetApp Files. We can confirm that all affected services have now recovered. A power maintenance event led to temporary power loss in a single datacenter, in Physical Availability Zone 2, in the North Europe region affecting multiple racks and devices. The power has been fully restored and services are seeing full recovery.

monitoring2025-04-01T10:29:54.358Z

Microsoft have extended their incident as it is affecting more resources and we are continuing to monitor the situation. Our monitoring platform currently indicates a fairly normal level of operations, but we do see intermittent failures on a per call basis, but for a very low percentage of calls. So there is some impact, ranging from slow responses to failures to perform certain actions. As Microsoft have extended the incident to include connectivity towards various forms of storage we expect to see failures reading from product feeds, and similar, possibly halting certain, but not all, messages.

investigating2025-04-01T09:31:27.912Z

We have received reports on an ongoing incident in Microsoft Azure that may result in degraded performance for Engage. The incident has not yet been published on Azure status pages so we don't have full visibility as of right now. At the moment we are assessing the situation through our monitoring platform to understand if and how we are impacted. We will be doing what we can to counter any effects.

Mar 20, 2025

Report: "[Engage] Report Register status and Automation statistics are not working"

Last update 2025-03-20T14:31:19.568Z

postmortem2025-03-20T14:29:34.343Z

**Summary** On March 11th, two different issues were identified that impacted report functionality within the Engage platform. The first issue affected the Register Status Report by preventing desired report results and the second issue prevented data publication for most tenants in the Automation Statistics Report. **Customer Impact** Customers experienced difficulties accessing accurate reporting data for about a \(1\) day before it was functioning as expected the following day. The respective report was impacted: * Register Status Report * Automation Statistics Report **Root Cause** * **Register Status Report Issue:** A change in how a data type were handled during refactoring led to an oversight in compatibility with an API regarding null values. * **Automation Statistics Report Issue:** Missing necessary tables to be utilized for the report in the production environment data layer. A scheduled job to create these tables was not executed, leading to a mismatch between Engage logic and the Data layer API. **Remediation** Immediate actions were taken to resolve the issues: * **Register Status Report:** A hotfix was deployed to fix the bug. * **Automation Statistics Report:** The necessary job was manually triggered to create the missing tables, restoring the report functionality. **Next Steps** To prevent similar incidents in the future, we have identified improvements to be implemented in areas of testing and validation and to continue to follow established processes. We appreciate your patience and remain committed to delivering a seamless and high-quality platform experience. If you have any further questions or concerns, please reach out to our support team.

resolved2025-03-11T12:25:53.131Z

Both the Automation statistics report and Register status is now working again.

identified2025-03-11T10:43:32.495Z

identified2025-03-11T09:18:05.553Z

We have identified issues with our reports Register status and Automation statistics. They are currently not loading any data. We are working on a fix for this.

Mar 12, 2025

Report: "[Engage] - Disturbance identified, affecting Messages and Automations"

Last update 2025-03-12T08:26:56.909Z

postmortem2025-03-12T08:26:44.321Z

## Summary Between the hour of 09:12 and 09:52 on February 2nd we encountered an issue affecting a central In-memory database used by many processes in the platform. The issue landed the service in a state which didn't trigger failover to backup services, causing various anomalies throughout the platform. Among those a large number of messages being delayed \(requiring manual resend\), automation events not triggering, reported login issues and more. ## Customer Impact The issue mainly affected customers who had messages and activity execution during the time frame 09:12 - 09:52 with a delay. ## Root Cause and Mitigation **Root Cause** The root cause of the issue was a central in-memory database ending up in a bad state. The database is used for storing data for quick access throughout the platform in a high load – low latency configuration. As this data is needed in various processes the effect was spread over multiple parts of the platform, but only specific use cases were greatly affected from a user perspective. The database has a redundant setup, with a primary to multiple replica configuration, where failover to a replica is automatic should the primary service run into issues. In this instance all servers in the setup ended up in a replica state, with no primary resource active, thus causing the issue. **Mitigation** Enforce primary: To mediate the issue we enforced a new primary resource in the configuration which returned Engage to a normal state where messages and execution of activities were functioning as expected. ## Next steps Unfortunately, this has happened before and although we did take actions to mitigate it from happening again it did. The root cause is still, after investigation, not clear. Our intention is now to look over our current in-memory database setup and take action to upgrade and update the setup.

resolved2025-03-02T11:24:19.500Z

All massages are now resent. The incident is now resolved.

monitoring2025-03-02T09:58:24.357Z

As previously stated, operations are back to normal. We are still working on the aftermath (i.e. resending email messages that were not sent due to the outage). Automations are fully synced and are running as normal. Next update will come when all messages are resent.

monitoring2025-03-02T09:25:22.347Z

The processing of Messages and Automations are still looking good after the implemented fix. There are still some delays from queued up messages that are being processed but Automations are back to normal.

monitoring2025-03-02T09:00:03.197Z

A fix has been implemented with promising results. We are seeing that Messages are being sent again and Automations are being processed. We will continue to monitor the situation.

investigating2025-03-02T08:55:11.835Z

We are continuing to investigate this issue.

investigating2025-03-02T08:36:35.119Z

We have identified a disturbance in Voyado which is affecting Message sendouts and Automations, which are currently not being processed. We are currently investigating the issue.

Feb 28, 2025

Report: "Engage - SMS not being sent due to issues with our SMS provider"

Last update 2025-02-28T10:48:55.134Z

resolved2025-02-28T10:48:55.113Z

No backlog left, all SMS has been sent and system is back to fully operational.

investigating2025-02-28T10:42:21.416Z

We have moved our queue over to our other SMS provider temporarily and SMS are being sent. We have some SMS that needs to be re-queued (by us), but all new SMS should be processing okay for now.

investigating2025-02-28T10:23:38.000Z

We have noticed issues with sending SMS via one of our SMS providers, hence SMS are not being sent as of now. This is being investigated and escalated accordingly. No SMS messages are lost but queued up for now.

Feb 25, 2025

Report: "Voyado Engage - Emails delayed"

Last update 2025-02-25T07:13:39.743Z

postmortem2025-02-25T07:12:32.344Z

**Summary** On the morning of February 9th, we detected system slowness through triggered warnings. Initially, it appeared to be linked to a single tenant's large-scale send-out, but further investigation revealed that multiple tenants were affected. An alert was later triggered indicating that a shared in-memory database, which helps process messages efficiently, was unavailable. This caused delays in message processing, impacting approximately 130 tenants. While most messages were eventually processed automatically, some required manual intervention. The maximum delay experienced was up to three hours, though this only affected a small number of messages for a few tenants. **Customer Impact** Customers experienced delays in their scheduled and automated message send-outs, including both SMS and emails. The disruption was due to a shared in-memory database becoming unavailable, which paused message processing. Once the system resumed, a backlog caused further delays. Our on-call team manually resent messages that got stuck, but a small portion of messages for a few tenants could not be resent. These customers were contacted directly. Delays ranged from as little as five minutes up to a maximum of 240 minutes. **Root Cause** The issue was caused by an unexpected data handover problem in a shared in-memory database, which temporarily lost track of some messages. This database is designed to handle a high volume of messages quickly and efficiently. Normally, if there’s an issue, the system switches to a backup automatically. However, in this case, when the switch happened, some data was lost. As a result, the system had trouble determining which messages had been sent and which were still in progress, leading to delays. Messages scheduled for processing after the disruption were handled as expected once the system recovered. **Mitigation** Since the issue was caused by an automatic switch to a backup system, the system recovered on its own. However, our team had to manually resend messages that had gotten stuck in the process. **Next Steps** We are currently evaluating improvements in the following areas: * Enhancing system robustness to minimize the risk of data loss during failovers. * Implementing automatic resending of delayed messages to quickly mitigate the effects of any disruption. We appreciate your patience and understanding. Our commitment remains to providing a reliable and seamless experience on the Engage platform. If you have any further questions or concerns, please reach out to our support team.

resolved2025-02-09T10:26:05.662Z

The issue has now been resolved, and we have processed the delayed messages.

identified2025-02-09T09:32:25.479Z

We recently experienced an issue that has now been resolved. As a result, some messages may be delayed. We are actively working to send them out as soon as possible.

Feb 19, 2025

Report: "[Engage] Limited availability in Staging"

Last update 2025-02-19T09:55:44.274Z

resolved2025-02-19T09:55:44.255Z

This incident has been resolved, and staging is now available again.

identified2025-02-19T09:02:36.085Z

We are continuing to work on a fix for this issue.

identified2025-02-19T08:41:54.116Z

We have issues with our Staging environment limiting availability at the moment. We apologize and are working to get it available again as soon as possible.

Feb 14, 2025

Report: "Increased error rates and latency for Stockholm customers"

Last update 2025-02-14T07:15:45.629Z

resolved2025-02-14T00:33:27.000Z

This incident has been resolved.

investigating2025-02-14T00:32:52.000Z

Our Cloud provider have resolved the connectivity issues and the API should be fully operational again.

investigating2025-02-13T23:17:49.000Z

Our API experienced elevated error rates and latency due to service disruptions and network connectivity problems for our Cloud Provider in the Stockholm region.

Feb 6, 2025

Report: "[Engage] Message sendout delays"

Last update 2025-02-06T12:31:02.804Z

postmortem2025-02-06T12:30:50.208Z

## Summary On January 29th, between approximately 09:30 and 20:00 CET, we experienced delays in message processing within the Engage platform. This resulted in messages such as emails and SMS, including automated communications, being sent later than expected. Some customers also encountered issues with promotions being assigned with a delay. ## Customer Impact Customers faced delays of 30-60 minutes for their messages to be delivered, affecting both scheduled and automated communications like welcome emails and order confirmations. Additionally, some customers had trouble assigning promotions. While all messages were eventually delivered, our incident response team manually resent a few messages that got stuck to ensure completion. ## Root Cause and Mitigation **Root cause** The issue was caused by a recent update to the system responsible for handling message distribution between available resources. The update, which upgraded a component to the latest Microsoft version, unintentionally slowed down the way messages were processed, creating a bottleneck that led to delays. The problem became more noticeable as more messages were sent throughout the day, compounding the issue. **Mitigation** Once our monitoring systems flagged the delay, our incident response team immediately began investigating. To resolve the issue, we: · Deployed a temporary fix to gain better insight into what was causing the slowdown. · Identified that some areas of the system were handling messages more efficiently than others and adjusted message distribution to relieve pressure on the slowest parts. · Made system adjustments to ensure messages could be processed at normal speed again, significantly reducing the delays. ## Next Steps After the incident, we applied permanent system updates to prevent this issue from happening again. These changes have been successfully implemented and are being closely monitored. Moving forward, we will: · Continue working with our system providers to ensure the platform remains stable. · Improve our ability to detect similar issues earlier in our testing environments to catch potential delays before they affect customers. We appreciate your patience during this incident and remain committed to providing a reliable and seamless experience on the Engage platform. If you have any further questions or concerns, please do not hesitate to reach out to our support team.

resolved2025-01-29T21:17:01.604Z

The incident has now been resolved after active monitoring of the system after our recent fix. We have managed the delayed messages, where the majority have been resent and in a few minor cases been rescheduled.

monitoring2025-01-29T19:41:59.879Z

The fix implemented in our recent deploy has show desired effects and we are experiencing new messages (email and sms) being sent as expected. However, we have an accumulated queue of messages that should have been sent throughout the day that will experience a delay. We are working on managing the delayed messages.

investigating2025-01-29T19:16:53.489Z

The status of the delays remains unchanged. We are actively investigating the issue through troubleshooting and are awaiting the resolution of our recent update.

investigating2025-01-29T17:12:06.289Z

We are rolling out an update that includes fixes to help address the issue and additional monitoring tools to better understand the root cause.

investigating2025-01-29T15:47:47.336Z

We are still in active troubleshooting and delays are still present. Currently, we have an average delay on 45 min on all messages going out.

investigating2025-01-29T14:50:47.250Z

We are still in active troubleshooting and delays are unchanged.

investigating2025-01-29T14:08:45.497Z

Our attempt on mitigating the issue with implemented fix did not have the desired effect and we are continuing our troubleshooting. Delays in messaging are unchanged.

investigating2025-01-29T13:27:34.076Z

We are still seeing a delay after the implementing an additional fix. Our efforts to mitigate and resolve the issue continues.

investigating2025-01-29T12:01:09.481Z

We have identified and implemented a fix to mitigate the issue, and are seeing some improvement to the delay, however there is still a delay for messages to be sent. Troubleshooting is still ongoing to ensure our mitigation efforts are correct and to identify the root cause.

investigating2025-01-29T11:26:56.684Z

We are continuing to investigate this issue.

investigating2025-01-29T10:50:58.751Z

We are continuing to investigate this issue.

investigating2025-01-29T10:24:02.000Z

We are currently seeing delays in sending both SMS and email messages. Troubleshooting is ongoing.

Report: "Engage - Problems with SMS messages in automations"

Last update 2025-02-06T12:30:38.406Z

postmortem2025-02-06T12:30:16.457Z

resolved2025-01-29T09:03:12.199Z

We have fixed the issue with sending messages and are back at normal state. Messages that were not sent due to the issue is being resent.

investigating2025-01-29T08:52:13.567Z

There are currently problems with SMS messages sent via automations, affecting all tenants. We are troubleshooting.

Jan 28, 2025

Report: "[Engage] FTP is unreachable for some customers"

Last update 2025-01-28T12:19:04.097Z

postmortem2025-01-28T12:14:50.632Z

**Summary** On January 15th we experienced an issue when our Engage FTP service became unavailable. The issue was quickly identified and engineers restarted the service, monitored and successfully verified its availability. **Customer Impact** For customers utilizing the FTP service the it was unavailable for about 30-40 minutes, leaving them with not being able to utilize the service. **Root Cause and Mitigation** **Root cause** A sudden high load on the FTP server contributed to the consumption of all available resources for the service. This condition led to a bottleneck where multiple processes were blocked or delayed due to the lack of available resources. The accumulation of blocked processes overwhelmed the server, eventually causing it to crash. While we cannot definitively confirm that the high load was the sole trigger, it is highly likely to have been a significant contributing factor. **Mitigation** Restarting the affected server remediated the issue. **Next Steps** We have identified suggested actions to mitigate the issue in the future and will take those into consideration going forward. The actions include, but are not limited to, adjusting available resources as well as trying to mitigate the risk for overuse of the service.

resolved2025-01-15T16:17:02.868Z

This incident has been resolved.

monitoring2025-01-15T15:18:24.677Z

A fix has been implemented and we are monitoring the results.

investigating2025-01-15T15:11:30.631Z

We are currently investigating this issue.

Jan 21, 2025

Report: "Engage - Automation statistics report - data stale"

Last update 2025-01-21T07:37:02.020Z

resolved2025-01-21T07:37:02.006Z

The data in the report is now up to date again.

identified2025-01-20T09:10:27.340Z

It has been reported that the automation statistics report data is stale, no updates after 13th of January. The root cause has been identified and should be fixed by tonight's refresh.

Jan 7, 2025

Report: "Email Recommendation Timeouts"

Last update 2025-01-07T13:24:05.338Z

resolved2025-01-07T12:05:22.000Z

This incident has been resolved.

investigating2025-01-07T11:19:44.000Z

We have identified the problem and is working to deploy a fix. We have also remediated a part of the degraded performance.

investigating2025-01-07T10:23:26.000Z

We are currently investigating the timeouts for Email Recommendations.

Dec 20, 2024

Report: "Engage: We are seeing connection errors/timeouts against our API and also affecting the user interface for some customers"

Last update 2024-12-20T14:10:19.895Z

resolved2024-12-20T14:10:19.877Z

System has been stable since fix. Closing incident as resolved.

monitoring2024-12-20T11:20:57.620Z

The root cause has been identified and we have implemented action that we anticipate will have resolved the issue. We will continue to monitor but all looks green right now.

investigating2024-12-20T10:58:09.771Z

We are currently investigating - seems one of our servers is struggling.

Dec 13, 2024

Report: "Engage- slowness in automations"

Last update 2024-12-13T10:02:21.248Z

postmortem2024-12-13T09:56:30.456Z

### **Summary** On the evening of December 3rd our Automations experienced a service degradation. This incident affected all tenants running workflows the same way - contacts moving through the workflows at a much slower pace than normal, in many cases with delays of 1-3 hours ### **Customer Impact** All workflows ran during the incident, but at a much slower pace, resulting in potential delays of 1-3 hours. Workflows normally executing “immediately” could take up to 3 hours to finish during the incident, possibly affecting time critical use cases of Automations, such assignment of promotions at checkout etc. ## **Root Cause and Mitigation** ### **Root Cause** We have been unable to find a singular root cause at this point, rather finding indications of several contributing factors leading up to the incident: * A certain part of the workflow initiation started experiencing minor delays, but the high load \(always\) experienced in Automations made this a significant total delay causing queues. * Functions present to ensure integrity and functionality in case of problems became inadvertent reasons for further problems as locks taken to prevent duplicates etc added further delays, retries designed to assure functionality added to the load after a certain point causing even more queues and throttling of traffic by underlying Azure components. ### **Mitigation** Upon redeploys of Automation services stale traffic and locks were reset, which lead to diminished queues as resources were forcibly made available again. This meant we could once again process workflows in a normal fashion, but it took some time to work through the queued up traffic caused by the incident, even at full throttle. ### **Actions Taken & Next Steps** Changes have been made to the functionality involved in the incident and we will monitor the preventive actions taken to see if we have mitigated the risk of landing in the same situation in the future. This will be done through further testing and live monitoring in combination.

resolved2024-12-03T19:51:50.687Z

We have resolved the issue and will continue to work on the root cause.

monitoring2024-12-03T17:37:39.115Z

We are continuing to monitor for any further issues.

monitoring2024-12-03T17:34:53.840Z

We are continuing to monitor for any further issues.

monitoring2024-12-03T17:30:23.935Z

We are back at normal performance for automations and will continue to monitor closely.

identified2024-12-03T16:44:18.942Z

We are still working to remediate the issue, status on performance is unchanged.

identified2024-12-03T16:06:40.645Z

We have identified the issue and are currently in progress of verifying if our actions have resolved the issue with degraded performance on automations.

investigating2024-12-03T15:31:13.352Z

We are currently experiencing larger than usual queues in our automations. We are investigating this.

Dec 6, 2024

Report: "[Engage] Service degradation"

Last update 2024-12-06T21:42:13.644Z

postmortem2024-12-06T21:22:56.938Z

### **Summary** During the morning of Black Friday, we were alerted about delays in the sending of SMS messages. Shortly thereafter API-alerts were also triggered, indicating a service disruption on several endpoints. Upon further investigation we identified that a component in the messaging chain, a database used to house the messages to be sent, was under heavy load and did not scale the way it was supposed to, causing degradation to SMS delivery. Manual scaling had an immediate effect and the platform started recovering. As a result of the service degradation many messages were only partially sent and manual efforts had to be taken to rectify these effects. As a result of the degradation some messages were partially duplicated due to automatic resends. ### **Customer Impact** The immediate effect of the service degradation was delays to SMS messaging and a partial outage on our APIs between 10:22 and 10:51 CET. During this period many requests towards the application could not be handled and received 5xx-errors, while the majority of traffic was handled but with sub-standard response times. The customer impact was failing or delayed actions when performing actions through API, such as looking up contact data, redeeming promotions etc. The customer impact of the SMS incident was that many messages were only partially sent. Measured on a platform level about 5% of the intended recipients did not receive their intended message, but individual messages had higher/lower failed delivery rates. As we were working to find a way to resend the affected messages, we were alerted to the fact that the service degradation had led to an unanticipated side effect. Automatic retries designed to make sure all messages are always delivered had led to batches of messages being sent multiple times, somehow bypassing the checks present to prevent such duplicates. Due to this, manual resending of failing messages was paused, and later cancelled, to avoid the risk of further duplicates being sent \(a risk further increased by the general delays of delivery receipts from operators during Black Friday\). All efforts were redirected towards finding the bug that had led to the duplicates being sent. As the day progressed with us unable to safely resend messages without risking duplicates, we reached a point where we deemed the failing SMS messages could no longer be sent. When this decision had been made, a manual effort was started to change the information shown in the application. From showing messages as ”sending” to ”sent”, to indicate that no further attempts on delivery were going to be made. Due to the nature of the incident, with duplicates and manual intervention, the stats for the affected messages will never be fully accurate. The actual deviation for the messages will vary, but all messages will see erroneous data in the delivery stats, making it very hard to follow up on the effect of the delivery. ### **Root Cause and Mitigation** * A sudden burst of requests and severe load on the SMS SQL server led to long response times and eventually timeouts in connections when a service designed to auto scale failed to do so sufficiently. Manual scaling mitigated the immediate problem. * Several steps were taken to prevent duplicate SMS messages in the event of service degradation, and the functionality is now more robust. This includes code changes to Engage and related services. * Monitoring of the affected SQL server has been increased. ### **Next Steps** * Further investigations will be made to fully understand why the scaling didn’t work as intended for the affected resources in the Messaging pipeline. * Investigations will be made to understand if, and why, this pipeline degradation influenced API performance and actions taken to mitigate the risks indicated by those findings. * We’re implementing further improvements to prevent unintended duplicates when in similar situations.

resolved2024-11-29T10:55:26.000Z

Our API:s are still stable and back at normal operations. Sending of SMS is also back to a more normal status and we are sending at full capacity. Teams are making sure that any delayed messages are delivered to their intended recipients, albeit somewhat late in some cases. We sincerely apologize for the inconvenience caused by this service degradation. We are fully staffed and working hard to monitor the platform and taking swift action to mediate any problems caused by the Black Friday load. This day is as important to us as it is to you.

investigating2024-11-29T09:58:16.869Z

We are seeing signs of improvement regarding the API:s since approximately 10:51 CET. Sending of SMS still degraded. Our efforts continue at full force.

investigating2024-11-29T09:50:45.464Z

We are continuing to investigate the issue, using all resources available to us. Monitoring shows that our APIs are affected as well as the sending of SMS messages. SMS are being sent, but at a lower rate than normal.

investigating2024-11-29T09:34:45.730Z

We are currently investigating service degradation on our APIs More information will follow

Dec 4, 2024

Report: "Voyado Engage - Emails delayed"

Last update 2024-12-04T08:07:38.840Z

postmortem2024-12-04T08:06:46.049Z

## Summary The issue began at around 3:40 PM CET, causing delays in processing, sending and storing messages. We discovered that resources used to send messages were experiencing unexpectedly high load, which overwhelmed their capacity to handle requests efficiently. This primarily led to delays in sending e-mails. A small number of SMS messages were also indirectly affected by the delays. Our first attempt at mediating the issue by scaling resources didn't work, as the problem quickly resurfaced. To address this, we made a targeted adjustment to the code used to better manage the workload. This update was tested on an affected resource and proved effective, so the fix was subsequently rolled out to all affected resources. With the system stabilized, we were able to resend all delayed messages, completing the process later that evening. These actions resolved the problem and restored normal functionality of sending e-mails from Engage. ## Customer Impact Delays, primarily in email messaging, impacted a large amount of tenants. Some messages were delayed by several hours. ## Root Cause and Mitigation The trigger for the issue was high load on the email messaging chain. In combination with certain code being activated which contained non ideal resource usage, CPU load and thread counts greatly exceeded anticipated and sustainable levels. This was mitigated by a code fix to reduce the number of threads used, which then reduced the load on the servers. ## Next Steps Implement further improvements to automatically resend delayed messages to quickly mediate the effects of any disturbance. We will look at architectural changes to reduce general load in the messaging chain to be better prepared for future high load scenarios.

resolved2024-11-28T22:51:50.980Z

All the delay messages has now been sent. The incident has now been resolved.

monitoring2024-11-28T19:45:15.381Z

We still see a good effect of the implemented fix and the sendouts are going out as planned now. We are still processing the queue of the previously delayed messages.

monitoring2024-11-28T18:44:37.106Z

We see good effect of a fix we implemented. Engage is coming back to normal performance. Sendouts are going out as planned now, and queues are being processed. We will monitor and update regarding send queue status.

investigating2024-11-28T18:06:12.810Z

We are still seeing delays because of high load. Email sendouts are being queued. We are working on different solutions to make sure emails are going out. At the moment no other parts of Engage is affected.

investigating2024-11-28T17:39:23.797Z

investigating2024-11-28T17:07:59.486Z

investigating2024-11-28T16:42:59.654Z

We are still working to find a solution. The current email sendout delay is approximately 1,5h.

investigating2024-11-28T16:33:14.422Z

We’re still seeing delays and are working to find a solution. At the moment no other parts of Engage is affected.

investigating2024-11-28T16:09:04.989Z

We’re still seeing delays and are working to find a solution. At the moment no other parts of Engage is affected.

investigating2024-11-28T15:34:24.864Z

We are currently experiencing delays in sending emails. We are working to resolve the issue.

Nov 30, 2024

Report: "[ENGAGE] INFO: SMS degradation from earlier today"

Last update 2024-11-30T10:19:11.881Z

resolved2024-11-30T10:19:11.866Z

This incident has been resolved.

identified2024-11-29T15:11:56.352Z

Earlier today, technical issues related to SMS send-outs affected our API connections. For 30 minutes, the Voyado system could not process membership commands, both in-store and for e-commerce. During this period, SMS delivery was slower than usual. The issue was resolved, and all services have been fully operational since 10:50. We closed the incident when the API's had been stable for some time and SMS was back at full capacity. We also advertised that we made sure that any delayed messages were to be corrected. Current Status: We still have a small portion of recipients that haven't received their SMS message (about 5%). We have decided not to send these messages to avoid the risk of sending duplicates. During the aftermath of the incident we discovered that a central part of our sending pipeline had been affected in a way we didn't anticipate which had led to some recipients receiving multiple copies of the messages intended for them. Our top priority is resolving the core issue to ensure no further duplicate messages will be sent. As a result of the previous disturbance with SMS messages, SMS statistics are currently not accurate and may still show some communications as sending. They will be changed to Sent or Stopped at a later point. We are taking all possible measures to prevent this from happening again. The problem was caused by a part of our system connected to SMS send-outs that did not auto-scale as it should have, resulting in a significant backlog that couldn’t be processed quickly enough. Over time, this issue began affecting other parts of the system. We identified the component causing the error, and once resolved, the rest of the system issues were cleared. We are now closely monitoring this to ensure stability moving forward. We apologize to all customers affected by these technical issues. We are fully staffed with on-call teams to ensure the platform operates smoothly, and errors are handled and resolved as quickly as possible. For further questions or concerns, please get in touch with our Support team.

Nov 27, 2024

Report: "Voyado Engage - Emails delayed"

Last update 2024-11-27T17:53:32.564Z

resolved2024-11-27T17:53:32.537Z

The incident has now been resolved, all the messages that was delayed has been resent.

investigating2024-11-27T17:01:45.959Z

We are currently experiencing delays in sending emails. We are working to resolve the issue.

Nov 13, 2024

Report: "[Elevate] Outage on WebAPI"

Last update 2024-11-13T15:52:45.968Z

resolved2024-11-13T15:18:00.000Z

This incident has been resolved.

Nov 8, 2024

Report: "[Engage] Service degradation for a subset of customers"

Last update 2024-11-08T07:21:52.453Z

postmortem2024-11-08T07:20:51.357Z

## Summary One of the SQL servers reported high CPU utilization during the evening/night of November 5th. This affected a number of tenants currently residing on that server, in the form of longer response times for different API endpoints. Checking DPA suggested a query most likely to be the culprit. The issue resolved itself after some time, most likely due a decrease in the number of API calls. The next day, a missing database index was identified as the probable root cause. ## Customer Impact All tenants on the affected SQL would in theory have experienced decreased performance for all operations toward the database, including API calls, messages, automations. However, since the incident happened when there was low activity in the system, this didn’t result in any major problems for the end users. ## Root Cause and Mitigation An increase in queries against the Contact table, using a `where`-clause which was not included in any index. A missing index on the affected table was identified and added. ## Next Steps We will look into how much the database performance has been improved by the added index as well as investigating if there is a need for creating additional indexes. The API usage for tenants on the affected server will be evaluated to see if there is an inefficient implementation causing unnecessary load.

resolved2024-11-05T22:36:48.450Z

We conclude that the incident has been resolved. Response times have been normal for the past 30 mins (or more).

monitoring2024-11-05T22:17:58.333Z

We are seeing signs of improvement and most of the customers affected are right now at more "normal" levels. We are continuing to monitor the situation closely.

identified2024-11-05T21:48:19.835Z

We have identified the issue and are working on a solution to mediate the issue. The service degradation is currently affecting a subset of customers, linked to a specific resource cluster, while most customers are unaffected at this point. The affected customers experience higher than normal response times in our APIs as well as the web applications.

investigating2024-11-05T21:20:45.400Z

We are currently investigating an issue causing service degradation (slowness, unresponsiveness) for a subset of customers. Work is ongoing and information will be provided as soon as possible.

Oct 29, 2024

Report: "[Engage] Degraded performance for sending email"

Last update 2024-10-29T11:00:51.038Z

resolved2024-10-29T11:00:51.026Z

This incident has been resolved.

monitoring2024-10-29T10:12:18.521Z

The degradation has been mitigated and we're currently working on addressing the aftermath (making sure all delayed messages are sent).

investigating2024-10-29T09:53:07.471Z

We are currently investigating an issue where a subset of customers are experiencing degraded service affecting the sending of emails. Emails are being delayed, or in some cases not sent. No action will have to be taken by the customers affected at this point. We will make sure all messages are sent, albeit somewhat late in some cases.

Oct 28, 2024

Report: "Elevate Apps is currently unavaliable"

Last update 2024-10-28T10:04:07.693Z

resolved2024-10-28T10:04:07.671Z

This incident have been resolved, if you experience any issues send a ticket to support@voyado.com

monitoring2024-10-28T09:54:00.369Z

Up and running and we are monitoring the situation

investigating2024-10-28T09:50:04.944Z

We are currently investigating the issue.

Oct 20, 2024

Report: "[Engage] Increased response times on person lookups"

Last update 2024-10-20T13:01:37.058Z

resolved2024-10-20T13:01:37.038Z

This incident has been resolved.

monitoring2024-10-20T10:54:28.116Z

Since 12:13 PM CEST, we have observed better response times and no failed requests caused by the maintenance performed by our supplier. Our supplier's maintenance window is still open until 3:00 PM CEST, but our logs no longer show any significant impact on response times from the APIs providing us with person lookup data.

identified2024-10-20T08:24:21.795Z

We also see personlookup requests made by customers that are failing due to this. We will continue to monitor the situation.

identified2024-10-20T07:34:53.057Z

One of our suppliers is performing maintenance and our monitoring indicates that this is resulting in degraded response times in the APIs providing us with person lookup data. Their maintenance is scheduled to be completed by 3:00 PM CEST.

Oct 16, 2024

Report: "[Engage] - Increased response times towards our API"

Last update 2024-10-16T17:36:12.057Z

resolved2024-10-16T17:36:12.043Z

This incident has been resolved.

investigating2024-10-16T16:45:15.708Z

We are currently seeing longer response times towards our API. We are investigating this issue.

Oct 15, 2024

Report: "Engage - increased response times towards API"

Last update 2024-10-15T14:17:18.611Z

resolved2024-10-15T14:17:18.594Z

This incident has been resolved by scaling up the resources for the affected customers and API has been stable since around 2pm.

investigating2024-10-15T11:05:49.371Z

We can see that queries are queued up in the sql database, impacting API response. Investigations are ongoing

Report: "[Engage] Issues logging in"

Last update 2024-10-15T08:28:55.077Z

resolved2024-10-15T08:28:55.061Z

This incident has been resolved.

identified2024-10-15T07:42:23.631Z

We are currently seeing issues with logging in to the platform and are working on a fix.

Oct 10, 2024

Report: "Engage - FTP unavailable"

Last update 2024-10-10T16:01:42.658Z

resolved2024-10-10T16:01:42.642Z

The FTP is now fully operational again. We will continue to monitor the server.

monitoring2024-10-10T15:48:32.809Z

A fix has been implemented and we are monitoring the results.

investigating2024-10-10T14:14:53.735Z

We are continuing working with our supplier on a resolution for this issue.

investigating2024-10-10T13:31:59.583Z

The FTP is experiencing issues since about 15:09 CEST. We are investigating this and trying to bring the service back up.

Sep 25, 2024

Report: "[Engage] - Disturbance identified"

Last update 2024-09-25T11:43:59.555Z

postmortem2024-09-25T11:42:47.453Z

**Summary** Between 07:30 and 08:20 on September 17th, we encountered an issue affecting a central in-memory database critical to multiple processes across the platform. The issue prevented failover to backup services, leading to various platform anomalies. These anomalies included significant delays in message processing \(requiring manual resends\), failure of automation events, login issues, and other disruptions. **Customer Impact** Customers who had messages or activities scheduled between 07:30 and 08:20 were primarily affected, experiencing delays in execution. Scheduled jobs were disrupted during this period and, while many were able to self-heal upon their next scheduled run, certain critical activities—such as birthday automation events—may not have triggered as expected, potentially impacting some customers. **Root Cause and Mitigation** **Root Cause** The root cause was a failure in a central in-memory database, which entered a degraded state. This database is designed to store data for rapid access under high-load, low-latency conditions. As the database is essential to numerous platform processes, the issue spread across multiple areas of the system, although only specific use cases were significantly impacted from a user perspective. The database operates in a redundant primary-to-replica configuration, with automatic failover to replicas when the primary encounters issues. In this instance, all servers shifted into a replica state with no primary resource active, leading to widespread disruptions. **Mitigation** Enforce primary: To resolve the issue, we manually enforced the designation of a new primary resource within the configuration. This restored normal platform operations, allowing messages and scheduled activities to resume as expected. **Next steps** To prevent future occurrences, we have enhanced the system's downtime handling for our in-memory database. This improvement ensures that the system will automatically return to the correct state once back online, mitigating the risk of similar disruptions going forward.

resolved2024-09-17T10:26:56.092Z

We are now in a stable state and incident is closed. We will continue to monitor the system during the day. Some data for statistics during the time period 07.30 - 08.20 CEST are not correct and needs to be handled manually, the team will continue working to mitigate these errors.

monitoring2024-09-17T07:48:41.477Z

Engage is up and running since approx 08.30 CEST. We keep on monitoring the applications since our mitigations. We are continuing to work for a solution for the activities scheduled between 07:30-08:30 CEST that was affected.

identified2024-09-17T07:43:55.519Z

Engage is in a stable state and working as normal for now since approx. 08:30 CEST. However, activities scheduled between 07:30-08:20 are affected and are evaluated how to be handled.

identified2024-09-17T06:55:48.844Z

We managed to mitigate some effects of the incident and have a more stable situation. Login is working, automations and messaging are fully functional since approx. 08:30 CEST. We are continuing our work on the incident and any lingering effects.

investigating2024-09-17T06:43:34.697Z

We are continuing to investigate this issue.

investigating2024-09-17T06:39:16.502Z

We are continuing to investigate the issue.

investigating2024-09-17T06:01:57.360Z

We have identified a disturbance in Voyado and are currently investigating the issue.

Sep 6, 2024

Report: "[Engage] Disruptions to person lookups"

Last update 2024-09-06T14:29:46.604Z

resolved2024-09-06T14:29:46.588Z

Our supplier indicate through their status page that services have been restored since approx. 1 hour. Our monitoring confirms this and we consider the incident closed

monitoring2024-09-06T13:12:16.281Z

One of our suppliers for person lookups is reporting disruptions to their service. Due to their problems Engage is currently unable to perform person lookups. Full functionality will be restored when our supplier has mediated their problem. We have no ETA at this time

Sep 4, 2024

Report: "Shopify app - 504 Gateway Time-out"

Last update 2024-09-04T05:18:37.430Z

postmortem2024-09-04T05:16:20.262Z

## 504 Error Incident Due to Database Indexing Issue ### Incident Summary * **Date of Incident:** 2024-09-03 * **Time of Incident:** 11:40 \(CET\) * **Duration:** 10 minutes * **Impact:** The application experienced a 504 Gateway Timeout error, affecting users' ability to access data from the `customers` table. ### Root Cause The 504 error encountered this morning was caused by a database indexing issue. Specifically, the `voyado_id` field in the `customers` table did not have an index. As a result, the application faced delays while trying to retrieve data, leading to the 504 Gateway Timeout error. ### Resolution The incident was resolved by rolling back the schema change applied the day before, which removed the issue of inefficient queries caused by the missing index.

resolved2024-09-03T10:00:41.354Z

This incident has now been resolved and more information to come.

investigating2024-09-03T09:45:32.659Z

We are experiencing issues with the app and are currently investigating and trying to resolve the issue as soon as possible.

Aug 20, 2024

Report: "[Engage] Service disruptions for a subset of customers"

Last update 2024-08-20T11:05:51.495Z

resolved2024-08-20T11:05:51.476Z

This incident has been resolved.

monitoring2024-08-20T09:06:24.974Z

The mediating action seems to have had the desired effect. We are continuing to monitor response times and performance.

identified2024-08-20T08:57:18.999Z

We have identified the cause of the disturbance and a mediating action has been taken. We are waiting to see if we achieved the desired effect.

investigating2024-08-20T08:30:22.591Z

We are since a little while back investigating problems mainly affecting a subset of customers. The customers are affected due to a problem with a shared resource cluster and we are looking into mitigations

Report: "[OTHER] Delays in ticketing through email (Support)"

Last update 2024-08-20T11:04:09.017Z

resolved2024-08-20T11:04:08.999Z

Zendesk reported the incident closed at 12:21 CEST

monitoring2024-08-20T09:51:51.086Z

We are continuing to monitor for any further issues.

monitoring2024-08-20T09:50:39.000Z

Our vendor Zendesk has flagged problems handling incoming mail during the morning, possibly affecting how quickly incoming mail to Support is turned into tickets and being available for us to act on. It's hard for us to tell if this is affecting us and if so, how much. (https://status.zendesk.com/) To be sure of swift handling of tickets, please register tickets through the web portal https://explore.voyado.com

Aug 19, 2024

Report: "[Engage] Image delivery issues"

Last update 2024-08-19T13:09:00.630Z

postmortem2024-08-19T13:08:03.149Z

## Summary On August 18, 2024, at 13, we encountered a critical issue involving the expiration of the SSL/TLS certificate for the custom domain `images.eclub.se`, which was hosted and managed by our cloud provider's CDN services. The certificate expired without prior notification from the cloud provider, leading to security warnings and the failure to load images correctly in customer emails. This resulted in images not being displayed properly, which negatively impacted user experience. ## Customer Impact Images did not load properly in emails. This effected all tenants. ## Root Cause and Mitigation The root cause of the issue was twofold: 1. **Failure to Renew Certificate**: The cloud provider did not automatically renew the SSL/TLS certificate for the custom domain `images.eclub.se` due to a domain validation issues. 2. **Lack of Notification**: The cloud provider failed to notify us about the impending expiration, leaving the issue unnoticed until the certificate had already expired. Mitigation: **Re-enabling Encryption**: We re-enabled encryption on the affected endpoint to force the initiation of a new certificate validation process. **Cloud Provider Communication**: A ticket was submitted to the cloud provider to escalate the issue and expedite the resolution process. Additionally, the third-party certificate provider was contacted to request that the Voyado team responsible for cloud resources is to be added to the notification list for any future. ## Next Steps **Enhanced Monitoring**: Implementing a more focused monitoring to track SSL/TLS certificate expirations independently of the cloud provider, ensuring that we receive timely alerts. **Proactive Communication with Providers**: Establish direct lines of communication with both the cloud provider and the certificate authority to ensure that we are promptly informed of any potential issues with certificate renewals.

resolved2024-08-19T07:49:44.062Z

This is incident has been resolved. A post mortem will be provided as soon as it's ready.

monitoring2024-08-19T06:06:39.098Z

The affected resources have been functional throughout the night and we are currently monitoring.

identified2024-08-18T15:10:26.557Z

Actions to mitigate the issue have been taken on multiple fronts. Given the nature of the problem we expect some delay before the actions taken come into effect. A ticket has also been raised with our cloud provider to help mitigate the issue.

identified2024-08-18T11:49:58.335Z

Investigations point toward an expired certificate managed by our cloud provider causing security warnings and/or failure to load images from our CDN. We are working to mitigate the issue.

investigating2024-08-18T11:30:29.478Z

We are currently investigating reported problems where images hosted on Engage CDN's are failing to load in emails. Reports indicate the issue only affects Engage-hosted images, not images from external hosts.

Aug 5, 2024

Report: "[Engage] Delays in messaging"

Last update 2024-08-05T15:10:42.537Z

resolved2024-08-05T15:10:42.522Z

This incident has been resolved.

monitoring2024-08-05T13:01:28.905Z

Our mitigating effects had the desired effect and messaging is back at normal speed. Messages are being delivered and any messages affected by the incident will reach their intended recipients, but a tad later than expected. We are still monitoring the system and delving deeper into the root cause.

identified2024-08-05T11:31:24.797Z

We have identified a possible cause and mitigating actions are on the way!

investigating2024-08-05T11:07:40.574Z

We are currently investigating delays in messaging. Messages may be delayed or seemingly fail to send at the moment.

Jul 26, 2024

Report: "Disturbance identified"

Last update 2024-07-26T11:31:54.889Z

postmortem2024-07-24T20:11:25.418Z

## Email Recommendation Out of Memory Due to a recent change by a customer, where they started using very large images, a memory leak in Email Recommendations was exposed and led to the application crashing. The memory leak has now been fixed, and a permanent solution for the customer has been applied to reduce the size of the images created by the application. * We will implement a feature in Email Recommendations that will give an error if the images are too big. * Documentation will be updated with more information about this. * Implemented a self-healing mechanism in case this issue should appear again. The incident also exposed an internal problem with a service providing a backend application with data that did not recover automatically after an incident. This issue has been resolved, and the affected service will now recover automatically. The business impact was that the service was unavailable for: * 1 hour 45 minutes for end users * 2 hours 15 minutes for backend users \+ 1 hour for the customer with large images. We apologize for any inconvenience this has caused and will work to ensure it won't happen again.

resolved2024-07-24T08:20:59.927Z

This incident has been resolved.

investigating2024-07-24T06:52:40.097Z

We are continuing to investigate this issue.

investigating2024-07-24T06:52:25.960Z

We have identified a disturbance in Email Recomendations and are working on restoring normal operations as soon as possible.

Jul 14, 2024

Report: "Latency and API reponse timeouts"

Last update 2024-07-14T10:49:29.665Z

resolved2024-07-14T10:49:29.649Z

This incident has been resolved.

investigating2024-07-14T10:12:37.623Z

We are continuing to investigate this issue.

investigating2024-07-14T08:08:07.000Z

We are currently investigating the latency and timeouts for nordic customers.

Jul 12, 2024

Report: "Voyado Engage - Problems With send-outs"

Last update 2024-07-12T19:43:43.186Z

resolved2024-07-12T19:43:43.169Z

The incident has been solved. We’re still sending messages that’s been delayed. All the delayed messages will be sent.

identified2024-07-12T14:05:20.230Z

We have identified the issue and we're working on a solution. When it's resolved we will resend the messages that has been delayed.

investigating2024-07-12T12:44:09.276Z

We have identified problems with send-out within Voyado Engage it's started 13:35 CEST and it's ongoing. Working to resolve and resend the messages.

Jun 14, 2024

Report: "[Engage] - Delays to some BI-exports"

Last update 2024-06-14T10:03:55.635Z

resolved2024-06-14T10:03:55.623Z

This incident has been resolved. The following BI-exports will run as scheduled and contain both new data as well as the missing data from the previously failed exports.

identified2024-06-14T08:25:39.977Z

The issue has been identified and a fix is being implemented.

investigating2024-06-14T07:39:41.746Z

We have noticed some failures in the BI-exports overnight. This is being investigated by the developers.

May 30, 2024

Report: "Engage: Slowness in API reported"

Last update 2024-05-30T09:04:52.304Z

postmortem2024-05-30T09:04:19.766Z

## Summary On 2024-05-27, we experienced two brief service interruptions due to an issue related to enabling BI exports for new tenants, which caused database overloads due to a flaw in the activation method. ## Customer Impact 15:11 - 15:14 Services were unavailable for 3 minutes due to this issue. 15:19 - 15:22 Services were unavailable again for 3 minutes due to this issue. ## Root Cause and Mitigation The interruptions were caused by a flaw in the BI export activation method for new tenants, leading to temporary database overloads. A fix has been deployed to resolve this issue and prevent it from recurring. ## Next Steps No further steps required.

resolved2024-05-27T15:04:47.703Z

Monitoring has shown no signs of deterioration and we deem it safe to declare this incident resolved. Investigations on cause and subsequent improvements are still ongoing. A post mortem will be provided as soon as possible.

monitoring2024-05-27T13:55:22.458Z

A fix has been implemented and we are monitoring the platform to make sure it had the intended effect.

investigating2024-05-27T13:29:25.371Z

We have been noticing slowness in the API response times and timeouts since around 15:20 CEST. We are investigating.

May 23, 2024

Report: "[Elevate] Problems loading Email section"

Last update 2024-05-23T10:54:39.015Z

postmortem2024-05-23T10:45:35.747Z

## Incident description Administrating Email Recommendations in esales.cloud was not possible. No other services/function were affected. An internal service that the esales.cloud apps are using to communicate with other internal services could not reach the Email Recommendations service due to network issues in our cloud environment. ### Incident record * 2024-05-22 07:29 First customer contacts support. * 2024-05-22 07:46 Investigation started. * 2024-05-22 09:45 Mitigating action taken. * 2024-05-22 09:52 Service restored. ### Incident review/actions As a result of the incident we will work to increase our monitoring in order to detect similar issues quicker.

resolved2024-05-22T10:11:29.897Z

We have been able to locate AND fix the problem, which means the Email section should be back to normal again. We apologize for the inconvenience caused by the error.

investigating2024-05-22T08:17:17.299Z

We have received reports on problems loading the Email section in the Elevate apps. We have been able to verify the problem and are actively working on finding the cause and subsequent solution.

Apr 22, 2024

Report: "[Engage] - FTP server is currently unavailable"

Last update 2024-04-22T09:22:39.100Z

resolved2024-04-22T09:22:39.086Z

FTP is back online.

investigating2024-04-22T09:03:51.526Z

The FTP server ftp.voyado.com is currently unavailable. We are actively investigating.

Apr 19, 2024

Report: "[Engage] Application unreachable"

Last update 2024-04-19T12:06:12.000Z

postmortem2024-04-19T11:55:20.711Z

**Summary of the incident** On Tuesday, we had our regular biweekly update to release new code to Engage. Unfortunately, an unexpected issue caused significant slowdowns and downtime in our user interface and APIs, which affected the functionality of our Engage product for all our customers. The issue started at 09:25 AM and was resolved by 12:22 PM, causing a downtime of approximately three hours. As a result, all endpoints related to API integrations with Engage were significantly affected, meaning POS and e-commerce platforms could not communicate with Engage, impacting consumer interactions. **Incident timeline** * 08:49 Release update started. * 09:25 The issue starts occurring. * 09:28 First alerts of malfunction - incident lead responds. * 09:37 Task force assigned, based on escalation from incident lead, working nonstop to solve the issue. * 11:47 Fix completed and deployed to production. * 12:10 Early indicators showed the issue was resolved. * 12:22 Issue verified as solved, normal operation levels confirmed, and case closed. The incident team followed set processes and provided regular updates on the Voyado Status page to communicate with affected customers. However, we acknowledge the significant impact that the loss of functionality has on our customers. As a result, we have formed a post-incident team to review our processes and procedures to prevent future incidents.

resolved2024-04-16T11:12:26.000Z

We are still seeing normal operations and will thanks to this declare this incident resolved. We regained normal operations approximately 12:10 CEST. We will keep working with the incident to make sure we minimize the risk of this happening again. A post mortem for this incident will be appended as soon as we have mapped not only the whats, but also the whys.

monitoring2024-04-16T10:23:01.076Z

The initial signs of improvement are still valid, at the moment we can see normal response times throughout the solution. We will keep working but change the status of the incident to monitoring. Any degradations will make us revert this status.

investigating2024-04-16T10:12:44.245Z

The before mentioned fix has been rolled out and we're seeing some initial signs of improvement. We will keep working on achieving a full resolution.

investigating2024-04-16T09:36:22.000Z

We are continuing to work hard on the mitigation of this disturbance. A new attempted fix is about to be rolled out. Expected ETA 30 min. We can see that all parts of the platform are affected by the disturbance, even though we are not completely down. Some requests are getting through, some are slow and some are not getting through the door so to speak. We are however approaching this as if we were completely unavailable. The slowness and unavailability affects all endpoints in our core APIs as well as the graphical user interface of Engage.

investigating2024-04-16T09:08:49.441Z

The applied fix was not enough to get us out of the woods unfortunately. We are still seeing the same symptoms come back after the fix had been applied. Work is still ongoing with top priority. Any information made available will be communicated as soon as possible.

identified2024-04-16T08:41:46.769Z

A first fix is currently being applied to try to mitigate some of the symptoms we can see through our monitoring. As the application and it's resources are updated we hope to see a change for the better. We are however looking into further actions.

identified2024-04-16T08:24:25.740Z

We have identified a possible cause for the disturbance and are implementing a fix.

investigating2024-04-16T08:17:12.108Z

We are continuing to investigate this issue.

investigating2024-04-16T08:16:15.000Z

Unfortunately we are still seeing major increases in response times from the application. This leads to traffic build up causing some requests to be faced with 503-responses. Mitigating the incident is our top priority and we have all hands on deck. No ETA at the moment. Information will be provided as soon as it's available.

investigating2024-04-16T07:31:15.989Z

We are experiencing some longer response times since around 09:18am.

Apr 15, 2024

Report: "Degraded performance in search results and recommendations"

Last update 2024-04-15T14:52:53.186Z

resolved2024-04-15T14:52:53.170Z

This incident has been resolved.

monitoring2024-04-15T13:21:01.424Z

Currently we experience a somewhat degraded performance in search results and recommendations. We are working on mitigating this issue. Estimated fixed: 15:00 UTC (17:00 CEST)