Is Batch Down Right Now? Discover if there is an ongoing service outage.

Batch is currently Operational

Last checked Jul 29, 2025 14:29 UTC from Batch's official status page

Historical record of incidents for Batch

Jul 25, 2025

Report: "Indexing delay affecting Profile autocomplete and analytics"

Last update 2025-07-25T07:40:01.768Z

resolved2025-07-24T20:30:00.000Z

Since 24 July at 22:34 CEST, we are experiencing an indexing delay in part of our indexing pipeline. This issue affects: - The availability of new Profile attributes in the dashboard autocomplete - Profile analytics, which may display inaccurate data We are actively working to resolve the issue and will provide updates as soon as possible.

Jul 22, 2025

Report: "Temporary Service Disruptions Following Network Maintenance"

Last update 2025-07-22T16:05:57.223Z

resolved2025-07-22T13:30:00.000Z

On July 22 from 15:34 to 16:04 CEST, the Dashboard and some APIs experienced temporary disruptions due to unexpected effects during a planned network maintenance. Impacted components included the Dashboard, Data ingestion, Trigger Events API, Transactionnal API, Profile API, and Custom Audience API. The issue has been resolved and all systems are now operating normally.

Jul 13, 2025

Report: "Delays in CEP Trigger Automations"

Last update 2025-07-13T17:33:37.939Z

resolved2025-07-13T16:00:00.000Z

Between 18:05 and 19:08 GMT+2, our Trigger Automations component experienced an issue that prevented new trigger events from being processed in real-time. This resulted in delayed communications, including emails, push notifications, and SMS, for up to one hour. Our engineering team identified and resolved the underlying issue, restoring normal operations by 19:06 GMT+2. A backlog of delayed events was fully processed by 19:08 GMT+2. All systems are now operating normally. We sincerely apologize for any disruption this may have caused.

Jul 7, 2025

Report: "Dashboard unreachable"

Last update 2025-07-07T09:58:12.061Z

investigating2025-07-07T09:58:12.059Z

We are currently investigating this issue.

Jul 3, 2025

Report: "Incorrect Date Attribute Values on Some CEP Profiles"

Last update 2025-07-03T17:04:28.944Z

identified2025-07-03T17:04:22.288Z

We have identified an issue affecting Date attributes on Profiles when updated via our SDKs (v2.x and v3.x). Between July 2nd, 15:36 CET and July 3rd, 15:25 CET, any Date attribute set using the SDK was assigned an incorrect value far in the future. Impact As a result of this issue: * Targeting based on these attributes will not behave as expected. * Profiles will not appear correctly in the Dashboard’s Profile View. * Data exports will reflect incorrect dates for affected attributes. What is Not Affected * Attributes updated outside the impacted time window have the correct value. * Attributes updated via the Profile API or Custom Data API are not impacted. * The MEP (Mobile Engagement Platform) remains unaffected. Current status Our engineering team is actively working on restoring the correct values for all impacted Profiles. As it is a lengthy process, we expect this work to be completed in the next 24h.

Jul 1, 2025

Report: "Network Issue Impacting Message Delivery and APIs"

Last update 2025-07-01T14:42:25.002Z

investigating2025-07-01T14:42:24.997Z

Since 15:54 CET, we have been experiencing a network-related incident with our hosting provider. As a result: - A majority of outgoing messages (push notifications, emails, and SMS) are not being sent, across both our MEP and CEP platforms. - All APIs are impacted and are currently showing an elevated error rate. - The dashboard remains accessible, but its performance is significantly degraded. Our engineering team is actively investigating the root cause and working to restore full service as quickly as possible. We will provide an update by 17:36 CET or as soon as new information becomes available.

Jun 26, 2025

Report: "Image loading issues"

Last update 2025-06-26T08:31:15.243Z

identified2025-06-26T08:31:15.240Z

We're currently experiencing intermittent errors with our image loading system. The impact is two fold: * some campaigns cannot be created correctly due to image loading failures, affecting both the main dashboard and the editorial dashboards. * some transactional push notifications containing images cannot be sent correctly due to image loading failures. This issue started on June 20th at 17:00 CET. Our team is actively investigating the issue and working to restore full functionality as quickly as possible. We apologize for any inconvenience this may cause.

Jun 20, 2025

Report: "Profile updates and custom events processing delays on CEP"

Last update 2025-06-20T06:15:03.408Z

investigating2025-06-20T06:14:18.119Z

We are currently experiencing a delay in indexing messages from our SDKs and Profile API calls. At the moment, message indexing is running approximately 8 hours behind. This delay impacts only the updating of Profile events and attributes. Specifically: • The profile view is not updated in real-time. • Some automations may be triggered later than expected.

Jun 17, 2025

Report: "CEP SMS Delivery Issues"

Last update 2025-06-17T07:27:23.455Z

resolved2025-06-17T07:27:23.435Z

Between June 9th and 13th, we encountered issues with SMS delivery, which resulted in a significant discrepancy between the estimated reach and the actual number of delivered SMS in the analytics. Please note that these SMS were not billed, and the issue has since been resolved. We have reached out to the affected clients regarding this issue.

Jun 13, 2025

Report: "Partial API and Analytics Disruption"

Last update 2025-06-13T07:41:08.302Z

resolved2025-06-09T14:30:00.000Z

Between June 9th 16:30 GMT+2 and June 10th 12:50 GMT+2, a database issue caused partial service disruptions affecting specific API endpoints and our analytics data processing pipeline. Impact: - Transactional API: Intermittent 500 internal server errors were returned by get stats endpoint. - Export API: Exports requests took multiple hours to be processed. - Dashboard: Analytics data production was delayed, causing outdated information to be displayed on the dashboard. The engineering team identified the root cause of the instability and deployed a correction. Service functionality was restored and confirmed stable at 12:50 GMT+2, queued exports finished processing around 15:50 GMT+2

Jun 12, 2025

Report: "CEP Audience API Issues"

Last update 2025-06-12T08:39:46.804Z

resolved2025-06-10T15:30:00.000Z

Between June 10th 17h (French Time) and June 11th 17h, intermittent 500 errors occurred on create and update API calls that included attributes, affecting approximately 1% of total API traffic during this period.

May 30, 2025

Report: "Filtering imported tokens in Push v2 is disabled"

Last update 2025-05-30T15:31:34.818Z

identified2025-05-30T14:18:39.000Z

This morning, we identified an issue affecting imported push tokens in Push v2, which may have led to duplicate notifications being sent. As a precaution, we have temporarily disabled creating campaigns with imported tokens while we investigate the root cause and work on a fix. Push v1 remains fully operational and is not impacted by this issue.

Report: "Filtering imported tokens in Push v2 is disabled"

Last update 2025-05-30T09:18:00.000Z

Identified2025-05-30T09:18:00.000Z

This morning, we identified an issue affecting imported push tokens in Push v2, which may have led to duplicate notifications being sent.As a precaution, we have temporarily disabled creating campaigns with imported tokens while we investigate the root cause and work on a fix. Push v1 remains fully operational and is not impacted by this issue.

May 20, 2025

Report: "Dashboard Performance Issue"

Last update 2025-05-20T13:22:29.156Z

resolved2025-05-20T13:22:29.133Z

Situation is back to normal

monitoring2025-05-20T09:42:23.340Z

The performance issues previously affecting the dashboard have been resolved. A fix has been applied, resulting in a much faster experience for all users. We will continue to monitor the situation to ensure everything remains stable.

investigating2025-05-20T09:27:11.684Z

We are currently experiencing slowness on our dashboard. Our teams are investigating the cause of this issue. We will keep you updated on the situation.

May 15, 2025

Report: "Profile updates and custom events processing delays on CEP"

Last update 2025-05-15T18:58:09.026Z

resolved2025-05-15T18:58:09.002Z

This incident has been resolved.

monitoring2025-05-15T13:43:26.238Z

All the indexation backlog has now been cleared, and we're back to nominal status.

identified2025-05-15T13:12:37.742Z

A fix has been deployed, the backlog of SDK events is being caught up. The indexing lag is now of 30 minutes. We will update this incident when everything is back to normal.

identified2025-05-15T11:40:42.043Z

We are experiencing a delay in indexing messages from our Mobile SDKs. The indexing of these messages is currently 1 hour and 30 minutes late, the problem having manifested itself from 12H15 onwards (GMT+2). This delay affects only the updating of Profile events, meaning : - The profile view will not be updated in real-time. - Some automations will be sent later than expected.

May 14, 2025

Report: "Edge Network Instability"

Last update 2025-05-14T13:59:17.491Z

resolved2025-04-24T20:30:00.000Z

Between April 24th, 2025 at 20:30 UTC and May 2nd, 2025 at 08:00 UTC, some users experienced network instabilities across our services. The issue was traced to a faulty load balancer in our infrastructure, which caused intermittent disruptions and impacted service stability during that period. The defective component has since been replaced, and systems have returned to normal operation. All services are now stable and functioning as expected. As a result of this incident, anomaly detection tools have been enhanced to better detect network-related issues at the edge of our infrastructure. These improvements are designed to enable more proactive identification and resolution of potential problems, further reinforcing the reliability of our services.

Apr 29, 2025

Report: "Profile updates and custom events processing delays on CEP"

Last update 2025-04-29T12:34:22.216Z

resolved2025-04-29T12:34:22.196Z

This incident has been resolved.

monitoring2025-04-29T09:34:14.936Z

The indexing backlog has now been cleared, we are keeping the platform under close surveillance.

identified2025-04-29T08:22:29.950Z

The incident is still ongoing, and the indexing lag has been reduced to 1 hour and 40 minutes.

investigating2025-04-29T06:42:55.491Z

We are experiencing a delay in indexing messages from our Mobile SDKs. The indexing of these messages is currently 7 hour and 30 minutes late, the problem having manifested itself from 17H15 onwards (GMT+2). This delay affects only the updating of Profile events, meaning : - The profile view will not be updated in real-time. - Some automations will be sent later than expected.

Apr 27, 2025

Report: "Profile updates and custom events processing delays on CEP"

Last update 2025-04-27T20:28:27.501Z

resolved2025-04-27T20:28:27.487Z

The indexing backlog has now been cleared, and the incident is now resolved.

identified2025-04-27T18:45:30.480Z

A fix has been deployed on our indexing system, and the backlog of SDK events is being caught up. We will update this incident when everything is back to normal.

investigating2025-04-27T18:06:46.128Z

We are continuing to investigate this issue.

investigating2025-04-27T18:03:34.000Z

We are experiencing a delay in indexing messages from our Mobile SDKs. The indexing of these messages is currently 1 hour and 50 minutes late, the problem having manifested itself from 17H15 onwards (GMT+2). This delay affects only the updating of Profile events, meaning : - The profile view will not be updated in real-time. - Some automations will be sent later than expected.

Apr 23, 2025

Report: "Slow loading of native attributes for MEP Campaign Targeting"

Last update 2025-04-23T09:11:06.068Z

resolved2025-04-23T09:11:06.046Z

This incident has been resolved.

monitoring2025-04-23T08:54:53.226Z

A fix has been implemented and we are monitoring the results.

investigating2025-04-23T07:50:50.347Z

Despite initial fixes being applied, we are still experiencing issues, and the situation remains under investigation. We will continue to keep you informed and provide further updates as soon as possible.

identified2025-04-22T15:34:00.646Z

The issue has been identified and a fix is being implemented.

monitoring2025-04-22T14:32:01.558Z

A fix has been implemented and we are monitoring the results.

identified2025-04-18T11:00:09.000Z

Since Friday, April 18th, we have been experiencing delays in loading native attributes for targeting our MEP campaigns. The incident has been identified, and initial fixes are being applied. We will provide further updates as soon as possible.

Apr 18, 2025

Report: "CEP Automation delay"

Last update 2025-04-18T08:02:12.685Z

resolved2025-04-18T08:02:12.671Z

This incident has been resolved.

monitoring2025-04-18T07:27:20.339Z

The fix has been deployed and the system is catching up. The delayed messages should all be sent around 10:00 GMT+2.

identified2025-04-18T07:01:26.156Z

Our Customer Engagement Platform is experiencing delays for some Push, SMS & Email automations. About 1.5% of the messages are delayed for at most 10 hours. We have identified the cause and are working on a fix. No messages have been lost: once the fix is deployed, the delayed messages will be sent.

Apr 15, 2025

Report: "MEP Push processing delays"

Last update 2025-04-15T13:18:19.137Z

resolved2025-04-10T08:30:00.000Z

Our Mobile Engagement Platform experienced processing delays for some triggered push notifications between April 10th, 10:40 PM and April 11th, 10:05 AM. All messages were successfully processed, and none were lost.

Apr 10, 2025

Report: "CEP Push processing delays"

Last update 2025-04-10T08:10:36.439Z

resolved2025-04-06T18:30:00.000Z

Our Customer Engagement Platform experienced processing delays for some triggered push notifications between April 5th, 3:30 PM and April 6th, 8:30 PM. All messages were successfully processed, and none were lost.

Mar 28, 2025

Report: "Dashboard connectivity issue"

Last update 2025-03-28T09:23:44.711Z

resolved2025-03-28T09:23:44.694Z

Monitoring has confirmed that the issue is fully resolved and all systems are operating normally. This incident is now closed.

monitoring2025-03-27T16:10:57.118Z

A permanent fix has been implemented, and the affected components remain stable. The incident will remain open as we continue to closely monitor the situation for the next 24 hours.

monitoring2025-03-26T18:36:26.091Z

There are no changes at this time. The mitigation remains effective, the impacted components are stable, and we are continuing to monitor the situation closely.

monitoring2025-03-25T14:47:37.539Z

A mitigation has been implemented and is functioning as expected. We are continuing to closely monitor the system and address any remaining issues.

identified2025-03-25T14:02:16.802Z

We are continuing to work on a fix for this issue.

identified2025-03-25T13:44:24.277Z

The issue has been identified and a fix is being implemented.

investigating2025-03-25T13:44:09.012Z

We’re currently experiencing connectivity issues affecting the Dashboard. Our team is investigating and working to restore full access as soon as possible.

Mar 26, 2025

Report: "GDPR API issues"

Last update 2025-03-26T13:18:48.283Z

resolved2025-03-22T02:00:00.000Z

We experienced an issue affecting our GDPR API, resulting in the later being unavailable for a while. A fix was implemented at 5:00 A.M. UTC (6:00 A.M. CET) and we didn't notice any new occurrence of this issue since then.

Mar 25, 2025

Report: "Audience API issues"

Last update 2025-03-25T08:43:13.440Z

resolved2025-03-25T08:43:13.420Z

The issue has been fully resolved — a permanent fix was implemented at 21:00 UTC, and we've continued monitoring throughout the night to ensure stability.

monitoring2025-03-24T17:16:04.434Z

A fix has been implemented and we are monitoring the results.

investigating2025-03-24T16:59:17.063Z

We’re currently experiencing an issue affecting our Audience API. Our engineering team is actively investigating and working to resolve it as quickly as possible.

Mar 20, 2025

Report: "Profile updates and custom events processing delays on CEP"

Last update 2025-03-20T16:59:02.238Z

resolved2025-03-19T19:00:00.000Z

We have experienced a delay in indexing messages from our Mobile SDKs. The indexing of these messages was up to 57 minutes late, over the period from 20:00 to 21:15 (GMT+1). This delay affects only the updating of Profile events: • The profile view will not be up to date. • Some trigger automations will be sent later than expected. The origin of this problem has not yet been precisely identified, but investigation is now the top priority of our technical teams, to ensure that this does not happen again.

Feb 28, 2025

Report: "Profile updates and custom events processing delays"

Last update 2025-02-28T19:54:58.643Z

resolved2025-02-28T19:54:58.613Z

The issue has now been resolved. Profile updates and custom events are processed without delays.

monitoring2025-02-28T19:34:31.219Z

A fix has been implemented and the delays are currently decreasing. We're still monitoring the situation.

identified2025-02-28T19:23:11.000Z

We are aware of an issue which is causing some delays in processing profile updates and custom events received from the SDK. Automations triggered by an SDK event may also experience a delay. We have identified the issue and are working on a fix.

Feb 17, 2025

Report: "Delivery issues"

Last update 2025-02-17T16:38:52.623Z

resolved2025-02-16T03:00:00.000Z

Timeline All times are in UTC and 24h time. On february 16 at around 9:45 we detected significant delays for a subset of push notifications, email and sms sent for campaigns or recurring automations. An investigation by our team revealed that one instance of the service responsible for processing these campaigns or automations was having trouble keeping up with the incoming data. The issue was found and a mitigation was put in place immediately. After this operation the service operated correctly again and started catching up its delays. At around 9:50 all delays were resolved and everything was back to normal. Impact A small subset of push notifications, emails and sms were sent with a delay. We estimate that around 4% of all messages were delayed up to 7h. Root cause There was an issue with our profile selection system at around 3:00 which caused a small part of the processing to halt. Once the problem was identified our team proceeded to mitigate it, after which the service was working correctly again and the delays were resolved. Conclusion Although the original problem was an easy fix, the main issue was that we lacked efficient monitoring for this particular service which resulted in much higher delivery delays than it should have. In the near future we will work on improving the monitoring for this service so that we can address any issues much more quickly; in addition we will also work on preventing these kind of issues altogether.

Feb 5, 2025

Report: "Network issue on our hosting provider"

Last update 2025-02-05T13:33:21.069Z

postmortem2025-02-05T13:32:05.918Z

# Foreword At 01:30 GMT\+1 on 02/02/2024, our on-call team monitored a potential issue on the platform. At 01:58 GMT\+1, it was identified that the network was down on 189 servers. The physical network interfaces were down, due a misconfiguration from our hosting provider following a hardware failure, rendering immediate action impossible. Approximately 20 minutes later, some servers began to come back online spontaneously, leading to further investigation and attempts to restore services. # Fault The network failure affected 189 servers, causing their physical network interfaces to go down. Once our hosting provider corrected the configuration on the network, our servers came back online. While some servers were restored by automated systems, others required manual intervention. # Impact For approximately 1 hour and 35 minutes, between 01:25 GMT\+1 and 03:00 GMT\+1 on 02/02/2024, various core and optional services experienced partial or complete downtime. The incident affected several components: * **CEP Core services**: Push delivery, Data ingestion, and Dashboard were partially down. * **MEP Core services**: Push delivery, In-app delivery, and Data analysis experienced partial downtime. * **APIs**: Several APIs, including Audience API, Export API, and Campaign API, were partially down \(flapping errors due to the time needed by the system to auto-heal\). * **Optional services**: Inbox, Webhook, and Custom exports were partially down. Despite the downtime, there was no impact on email and SMS delivery. Regarding potential data loss: Requests accepted by our APIs during the incident were ingested, though some may have experienced delays. All retries made after **03:00 GMT\+1** were successfully processed, and the data was properly ingested. Campaigns scheduled to launch during the incident did not trigger at the expected time but were restored once the network came back online. However, a small portion of these campaigns expired after **two hours** and were not sent. # Timeline 01:30 GMT\+1 - Pager alert triggered; our on-call SRE begins investigation. 01:58 GMT\+1 - Network failure identified on 189 servers; physical interfaces down. 02:18 GMT\+1 - Some servers begin to recover automatically; further investigation initiated. 03:00 GMT\+1 - Most services restored; incident reported to infrastructure provider. 19:00 GMT\+1 - Manual intervention by our infrastructure provider to restore few remaining servers; all services back online by 19:00 GMT\+1.

resolved2025-02-03T17:01:42.000Z

After further investigation, we now consider this incident resolved. The root cause has been identified as a network issue. We are preparing a detailed analysis, which will be shared in the coming days to provide further insights into the incident.

monitoring2025-02-02T03:00:28.943Z

All our services have returned to normal. We continue to closely monitor the platform.

identified2025-02-02T02:35:40.018Z

A significant part of our infrastructure became unreachable at 01:30 GMT+2. Most of the affected servers are now accessible again, and services are gradually returning to normal. The status page will be updated progressively to reflect component availability. We will provide another update in 30 minutes, at 04:00 GMT+2.

identified2025-02-02T00:45:58.000Z

We're currently experiencing network problems that may impact all services. Said issue is likely to come from our hosting provider. We will keep you posted as the investigation goes on.

Jan 20, 2025

Report: "Email composer down"

Last update 2025-01-20T14:32:58.205Z

resolved2025-01-20T13:00:00.000Z

We encountered an issue with our email composer between 15:00 GMT+1 and 15:20 GMT+1, which prevented users from opening the composer or saving emails. The issue has now been resolved, and we continue to monitor the situation.

Jan 17, 2025

Report: "Inconsistent Trigger Automation Reentry on CEP"

Last update 2025-01-17T15:29:30.364Z

resolved2025-01-15T09:30:00.000Z

On January 15th, from 10:20 GMT+1 to 17:20 GMT+1, we identified an issue that affected the correct computation of certain trigger automation events. This may have impacted less than 5% of events triggered during this period, potentially preventing some users from reentering a trigger automation if it had already been completed previously. The issue has since been fully resolved, and no further inconsistencies have been observed.

Jan 5, 2025

Report: "Delay in indexing Profile events for Web and Mobile SDKs"

Last update 2025-01-05T11:44:24.519Z

resolved2025-01-05T11:44:24.505Z

This incident has been resolved.

monitoring2025-01-05T10:05:58.097Z

Indexing delays are fully resolved. We will continue to monitor the platform to ensure a smooth return to normal.

identified2025-01-05T09:44:33.000Z

We have identified the issue and implemented a fix. The indexing delay is decreasing.

investigating2025-01-05T09:34:38.600Z

Since 01:25 GMT+1, we have been experiencing delays in processing Profile events from our Mobile and Web SDKs. The delay has now reached 5 hours. This delay affects only the updating of Profile events: • The profile view will not be up to date. • Some trigger automations will be sent later than expected. We are actively investigating this issue and will keep you updated on its resolution.

Dec 17, 2024

Report: "Push delivery delays"

Last update 2024-12-17T15:41:18.274Z

resolved2024-12-17T13:30:00.000Z

Between 1.15PM and 2.20PM, we experienced delays in processing push notification campaigns. The issue has been resolved, and operations have returned to normal.

Dec 11, 2024

Report: "Webhook delays"

Last update 2024-12-11T07:17:10.995Z

resolved2024-12-11T07:17:02.277Z

This incident has been resolved. All delayed webhook events were retried once the fix was implemented and new webhook events are sent without delays.

monitoring2024-12-11T06:55:38.575Z

We are aware of some delays sending webhook events to customer's endpoints. A fix has been implemented and events are being sent correctly again. We are monitoring the situation.

Nov 28, 2024

Report: "Profile API ingestion delay"

Last update 2024-11-28T16:51:02.989Z

resolved2024-11-28T16:51:02.975Z

All indexing delays are now eliminated. Requests sent to the Profile API are processed in real time: • Targeting is performed on up-to-date data. • The profile view is up to date. • Trigger automations are sent on time.

monitoring2024-11-28T16:16:50.732Z

Indexing times continue to improve, with Profile API events now running an hour behind schedule.

monitoring2024-11-28T15:30:01.000Z

Since 12:45 GMT+1 we have been experiencing delays in processing requests from the Profile API. The delay is currently 1 hour 40 minutes, and the system is gradually catching up. This delay impacts the updating of profile attributes and events: • Targeting may be performed on data that is older than anticipated. • The profile view will not be up to date. • Trigger automations will be sent later than expected. Attributes and events from the SDK are not affected. We are monitoring the issue and expect a resolution by 17:40 GMT+1, we will keep you informed of the situation every 30 minutes.

Nov 12, 2024

Report: "Notifications, SMS, Email, API & SDK Web services down"

Last update 2024-11-12T16:53:39.654Z

postmortem2024-11-12T16:29:14.430Z

On Thursday, November 7, 2024, an unexpected issue happened while migrating one of our key message queue clusters. This issue resulted in a major outage of our delivery services and APIs. ‌ ## What Happened? A mandatory migration was underway for one of our core message queue clusters. This migration was a prerequisite for expanding our infrastructure across multiple data centers. Unexpectedly, at the final step of the migration, a key sub-component of our message queue cluster \(based on Kafka\) encountered issues communicating with other nodes in the cluster. The cluster became unavailable resulting in our applications stopping processing messages. We detected this incident immediately and determined that it was related to a bug due to an edge case in the migration process. However, we chose not to simply roll back the migration, as we needed to ensure data integrity and prevent any potential loss. Finishing this migration was what ultimately solved the issue and allowed us resume normal operations. ‌ ## Impact on our platform * **Push Campaigns:** Delayed by up to one hour. * **Transactional Push Notifications:** Delayed by up to two hours. * **APIs:** Experienced a 16% error rate across all services, except the Custom Audience API, which continued to encounter errors until Nov. 8th, 09:31 GMT\+1. • Successful API calls \(returning a success status code\) were enqueued but not processed during the incident. • Processing of enqueued requests began around 23:00 GMT\+1 and concluded by Nov. 8th, 00:40 GMT\+1. • **Action Required:** Retry any important failed API calls, as they were not enqueued. * **SDK Web Services:** • In-App Automations with “Re-evaluate targeting just before display” did not function as expected. • Events, attribute updates, and push opens from the mobile SDK and plugins will be retried when users reopen their apps. • Events, attribute updates, and push opens from the Web SDK have been partially lost. ‌ ## Timeline & mitigation actions _For clarity, this timeline only lists the most important events. All times are GMT\+1_ ### 17:40 Our alerting system detected that part of our core services were not working properly due to the unavailability of one of our key message queue clusters. Since we were working on this cluster, we immediately identified the root cause and began our investigation. ### 18:08 After assessing the severity of the incident, we declared it publicly via our status page and started working on various plans to resume operations. We decided not to force the migration or roll back until we fully understood the root cause and could ensure no data would be lost. ### 18:20 In order to restore all message delivery services as fast as possible, we decided to implement a quick and temporary workaround. This change involved removing an internal feedback loop necessary for all post-delivery actions \(analytics, marketing pressure, inbox\). ### 18:48 The workaround has been deployed, and confirmed to be working. Messages are being sent again. ### 21:30 We decided to resume the migration procedure. ### 22:30 All nodes were successfully migrated, and the cluster started healing itself. We then reverted the workaround and restarted all services using this cluster. ### 23:30 All services seemed operational. The incident remained open and under monitoring. ### 09:26, the Next Day Due to a flood of alerts caused by the incident, the monitoring for our Custom Audience API was broken. After post-incident in-depth investigation, this monitoring issue was detected and the Custom Audience API was fixed. ### 14:26 We marked the incident as resolved after verifying that all services were functioning as expected. ‌ ## Forthcoming actions As this migration was part of a long term plan to have a more resilient infrastructure — preventing this very issue from happening — we will continue deployment as planned.

resolved2024-11-08T13:26:27.000Z

The system has been functioning properly since our last communication, and we now consider this incident resolved. Summary of Impact: • Push Campaigns: Delayed by up to one hour. • Transactional Push Notifications: Delayed by up to two hours. • APIs: Experienced a 16% error rate across all services, except the Custom Audience API, which continued to encounter errors until Nov. 8th, 09:31 GMT+1. • Successful API calls (returning a success status code) were enqueued but not processed during the incident. • Processing of enqueued requests began around 23:00 GMT+1 and concluded by Nov. 8th, 00:40 GMT+1. • Action Required: Retry any important failed API calls, as they were not enqueued. • SDK Web Services: • In-App Automations with “Re-evaluate targeting just before display” did not function as expected. • Events, attribute updates, and push opens from the mobile SDK and plugins will be retried when users reopen their apps. • Events, attribute updates, and push opens from the Web SDK have been partially lost. Analytics and Tracking Limitations: To restore campaign functionality as quickly as possible, we temporarily disabled internal tracking of push, email, and SMS deliveries between 18:40 GMT+1 and 23:20 GMT+1. As a result: • Analytics for messages sent during this period are unavailable and cannot be recovered. • Open rate percentages for this timeframe are unreliable. • Marketing pressure features (Global Frequency, Label Frequency, and Recurring Automation Cappings) do not account for push, email, or SMS deliveries during this interval. We are exploring ways to partially regenerate missing analytics data. Next Steps: Our team is preparing a comprehensive postmortem, which we plan to publish next week. We apologize for the inconvenience caused and appreciate your understanding.

monitoring2024-11-08T09:23:52.176Z

The Custom Audience API encountered errors until November 8th, 9:31 GMT+1. It is now working as expected. We will send an update later today with a more information about the impacted components. A full post-mortem is planned for next week.

monitoring2024-11-07T22:33:00.000Z

The previous operation was successfully completed. SDKs and APIs are functioning correctly. Data ingestion has been back online since 22:45 GMT+1. From 18:47 GMT+1 to 23:20 GMT+1, no analytics data was collected, and unfortunately, we will not be able to recover this data. As a result, you may notice abnormal open rates, as messages were sent during this period but acknowledgment information was not collected. Push notifications won't show up in the Inbox feature either. Our teams are continuing to monitor the situation. We will publish a post-mortem next week.

identified2024-11-07T21:01:48.241Z

To prepare our platform for the upcoming operation, we will temporarily suspend all API and SDK web services. During this time, data ingestion will not be possible (you will receive HTTP 500 errors) . We will inform you as soon as data ingestion is restored. Analytics are sill unavailable. We will post another update in an hour.

identified2024-11-07T19:48:47.545Z

Our teams are still working on a complete solution. We will post another update in an hour.

identified2024-11-07T18:48:01.762Z

The workaround is now also implemented to resume Transactional Push. Transactional Push will now be sent again, and any queued push that were delayed are being delivered progressively. Our teams are still working on a complete solution.

identified2024-11-07T18:19:08.637Z

The workaround is now also implemented to resume Email & SMS campaigns. Email & SMS will now be sent again, and any queued Email & SMS that were delayed are being delivered progressively. Due to this workaround, success and error analytics will not be available on the dashboard, APIs, or exports. Our teams are still working on a complete solution.

identified2024-11-07T18:18:26.787Z

We are continuing to work on a fix for this issue.

identified2024-11-07T17:47:57.784Z

We have located the root cause but are still working on exactly what components are affected. We have implemented a workaround to resume APNS, FCM, and Web Push notifications for campaigns. Notifications will now be sent again, and any queued notifications that were delayed are being delivered progressively. Due to this workaround, success and error analytics will not be available on the dashboard, APIs, or exports. Our teams are still actively working to fully restore the remaining affected services.

investigating2024-11-07T17:09:49.000Z

We are currently experiencing technical issues since 17:40 GMT+1. Notifications, Email, SMS, our API, and SDK web services are all down. Our team is actively investigating the situation to restore the service as quickly as possible. We will keep you updated as soon as we have more information.

Oct 30, 2024

Report: "Batch services unavailable"

Last update 2024-10-30T14:09:57.908Z

resolved2024-10-30T13:30:00.000Z

We experienced 10 minutes of downtime across our entire services, due to network saturation on our hosting provider's side. This downtime lasted from 14:25 to 14:35 GMT+1. We are back in a nominal state with our entire platform available.

Oct 27, 2024

Report: "In-App Automation edition errors"

Last update 2024-10-27T13:34:29.301Z

resolved2024-10-27T06:00:00.000Z

Create, delete and update operations of In-App Automations from Dashboard and API suffered from high error rates from October 27 at 7:15 GMT+1 to October 27 14:32 GMT+1 The running automations were delivered as expected to the SDKs, meaning that they were displayed to end users.

Oct 14, 2024

Report: "Partial delay in the indexation of Profile data from Mobile SDKs"

Last update 2024-10-14T13:44:02.811Z

resolved2024-10-14T13:44:02.793Z

No suspicious behavior was detected during the monitoring period, our corrective patch has brought the system back to its nominal status.

monitoring2024-10-14T08:42:08.892Z

A fix has been applied and we're monitoring the results. All the delays have been cleared and we're back in real time mode.

identified2024-10-14T07:52:05.587Z

Over the weekend, we experienced two periods of slowdown in the indexing of Profile events from our mobile SDKs. Following some palliative action by our team this morning, this delay has been temporarily resolved. These periods occurred between October 12 at 8:00 and 23:45 (UTC+2), and between October 13 at 10:30 and October 14 at 9:20 (UTC+2). We are investigating the root cause of this incident. These episodes of delay may have had an impact on the triggering of automations set up with the events we were slow to record.

Oct 10, 2024

Report: "Delay in APNS Push Notifications Delivery"

Last update 2024-10-10T12:42:17.754Z

resolved2024-10-10T13:17:00.000Z

We experienced a delay in the delivery of push notifications through APNS (Apple Push Notification Service), between 13:17 UTC+2 and 14:04 UTC+2. All delayed notifications were successfully sent by 14:06 UTC+2. The issue is now resolved.

Oct 4, 2024

Report: "Push campaign analytics delay"

Last update 2024-10-04T14:27:55.380Z

resolved2024-10-01T05:50:41.000Z

Analytics of Push campaigns & recurring automations have been recomputed, the incident is resolved. We are continuing to work on the "Devices synced" metric. Update (2024-10-04): The "Devices synced" metric data has been restored.

monitoring2024-09-30T17:16:30.056Z

The issue has been narrowed down to a database issue and fixed. This had the following impact on the system: - Analytics of Push campaigns & recurring automations sent between September 30, 02:00 GMT+2 and 16:30 GMT+2 are unavailable. This is a temporary issue that will be resolved when the data gets recomputed on September 31 around 05:00 GMT+2. Your push analytics will be available then. - The "Devices synced" metric of In-app automations doesn't take into account data prior to September 30. We are looking into ways to restore this data. Our team is monitoring the situation. We will mark this incident as resolved once we confirm that the nightly recomputing has been performed as expected.

investigating2024-09-30T15:51:17.847Z

We are investigating an issue with push campaign & recurring automations analytics, where in some cases the send/open counts are showing very low values. Pushs were sent as expected. Campaigns older than September 30th 2:00 GMT+2 have accurate analytics. We are working on restoring those metrics.

Oct 1, 2024

Report: "Instability of API Custom Audience"

Last update 2024-10-01T15:52:36.914Z

resolved2024-10-01T16:00:00.000Z

We encountered an elevated error rate between 17:18 GMT+2 and 17:42 GMT+2.

Report: "Partial delay in the indexation of native and custom installation data"

Last update 2024-10-01T07:28:47.187Z

resolved2024-10-01T07:01:30.000Z

The indexation is fully operational since September 30, 2024, at 16:35 GMT+2. This incident is resolved.

monitoring2024-09-30T14:54:56.938Z

The fix has been released, and since 16:35 GMT+2, the delayed native and custom installation data have been indexed. No data was lost. Our team continues to monitor the situation.

investigating2024-09-30T14:24:15.646Z

We are experiencing a partial delay in the indexation of native & custom installation data on our push platform.This means that Push campaigns and recurring automation targeting is performed on older data (14:25 GMT+2) and does not reflect the latest changes. No information is lost during this delay. Trigger automations and In-Apps are not affected. Our team is working on a resolution.

Sep 30, 2024

Report: "iOS push significant bounce rate"

Last update 2024-09-30T08:03:57.212Z

postmortem2024-09-27T17:54:42.821Z

## What happened? In an effort to improve the performance of our iOS push notification infrastructure, we started rolling out an upgrade of our backend applications. This phased rollout targeted 2% of our daily volume of iOS Campaigns & Recurring automations. The new version contained a bug that caused applications sharing the same APNs\* p8\*\* push configuration across multiple apps to experience a high bounce rate, leading to incorrect “apns2\_device\_token\_not\_for\_topic” errors. Apps using p12 based push configurations were **not** affected. Once we were made aware of the problem, we stopped the rollout and reverted it. ‌ _\*APNs = Apple Push Notification service_ _\*\*p8 = private key used to authenticate against Apple’s servers_ ‌ ## Impact on our platform From September 26, 14:45 GMT\+2 to September 27, 10:30 GMT\+2, part of our iOS Campaign & Recurring automation traffic returned “apns2\_device\_token\_not\_for\_topic” errors while they should not have. We have not tried to send the failed notifications a second time. The analytics of the affected Campaign/Recurring automations are accurate. ‌ All other components worked as expected: * iOS Transactional & Trigger automation pushes * Android & Webpush: all push notifications * In-apps * Email * SMS * APIs * Dashboard * Analytics ‌ ## Timeline ### _For clarity, this timeline only lists the most important events. All times are GMT\+2_ **September 26, 14:45** We begin the gradual deployment of our improvements to the iOS push notification architecture. Starting with 2% of our daily volume of iOS Campaigns & Recurring automations. **September 27, 09:57** We notice a significant number of bounces on the campaigns since September 26 at 18:30.At this point we check if it is related to our deployment from the day before. **September 27, 10:05** We identify a campaign impacted by this problem to analyze the cause. **September 27, 10:20** We open an incident and confirm that it is related to yesterday's deployment. **September 27, 10:30** We are rolling back the changes to the previous iOS push notification architecture. **September 27, 11:34** We open a status page incident. We are still analyzing the impacted applications **September 27, 14:50** The problem is reclassified as major, concerns few applications but the applications concerned were unable to send a major part of their campaign & recurring automation push notifications. ‌ ## Actions The bug has been fixed but postponed the rollout for a later time \(we reverted the code to its original state before introducing it\). We are looking into improving our alaming to catch this specific kind of error at low volumes.

resolved2024-09-26T14:45:00.000Z

We had an issue with some of our iOS push notifications that caused a significant bounce rate when the same push credentials are used across multiple apps. We were in a gradual deployment, the incident did not impact all apps. The incident impacted notifications sent between September 26th at 14:45 and September 27th at 11:00 (GMT+2). The bounced notifications will not be retried.

Sep 20, 2024

Report: "Dashboard is partially unavailable"

Last update 2024-09-20T15:03:51.546Z

resolved2024-09-20T15:03:51.529Z

The dashboard is now fully operational. This incident has been resolved.

monitoring2024-09-20T13:36:57.947Z

Due to a component experiencing a spike in usage, we experienced a partial service outage. During the duration of the incident: - The Dashboard was partially unavailable. About 10% of the requests failed, retrying worked in some cases. - The Campaign Stats API was temporarily unavailable. - All other components (delivery, SDK, APIs) functioned as expected. The issue has been resolved by allocating additional hardware resources to this component.

identified2024-09-20T12:57:49.000Z

We are encountering some connectivity issues with our dashboard web service. Our team is working on a resolution.

Sep 19, 2024

Report: "Hosting provider partial downtime"

Last update 2024-09-19T15:10:51.189Z

postmortem2024-09-19T15:07:02.831Z

On Thursday, September 10, 2024, some key messaging components of the Batch platform’s suffered an exceptionally long outage of 9 hours following a downtime at one of our hosting providers resulting in a very disappointing and frustrating situation for our customers, our partners and ourselves. Since the inception of Batch, almost 10 years ago, we’ve invested tremendous engineering resources towards building a resilient, high-availability, high-throughput messaging infrastructure able to support some of the largest e-commerce, media, banking and mobility players across Europe and beyond. Over the years, we’ve attracted tremendous talent, drawn to the mission-critical nature of our platform, passionately engaged in building the definitive customer engagement platform for the enterprise. Today, we’re well aware that we haven’t lived up to the reliability and uptime standards upheld for almost a decade. Following protocol and our philosophy of transparency, we’re publishing a detailed account of what happened that day, walking you through some core concepts of our architecture, detailing changes we’ve already made and outlining what we plan to implement in the future. ‌ ## What happened? ### Hosting Provider incident \(root cause\) The provider incident was tracked here:[ https://network.status-ovhcloud.com/incidents/8mq79l7wcx8p](https://network.status-ovhcloud.com/incidents/8mq79l7wcx8p) Our provider, OVHCloud, encountered an electrical disjunction that resulted in the loss of half of an entire datacenter room. This is a very rare occurrence: this is the first time something like this has happened in 10 years. The nature of the faulty element to replace made the recovery longer than initially expected. ‌ ## Impact on our platform Among the functions managed by the servers we lost, only one component’s resilience has been affected. The component that failed is the distributed message queue cluster of the selection engine. While we did not lose all of the cluster's servers, we lost too many at once, which made the cluster unavailable. Only a very small number of messages were still coming to and from the targeting engine, but this only worked for a little while until the system noticed it had a very high failure rate and halted. ‌ While the input \(campaign API & dashboard\) and outputs \(sending the notifications to Apple, Google, etc.\) were healthy, there were no more push notifications going through as we had nothing telling **who** to send messages to. ‌ ## Timeline _For clarity, this timeline only lists the most important events. All times are UTC\+2_ ‌ **13:23** Our monitoring systems start showing alerts. ‌ **13:54** The incident response process is triggered. At this point, part of the push notification system is starting to look unhealthy, but it is not fully down. A ticket is opened on our provider's support channels, we find out a couple of minutes later than our hosting provider had opened a public incident at 13:47. ‌ **14:02** We open a status page incident. We're still analyzing the situation. ‌ **14:40** We requalify the incident on the public status page as a "major outage". ‌ **15:57** Our hosting provider estimates that servers will be back online around 18:00 UTC\+2. ‌ **17:23** Our hosting provider updates us with a new estimated resolution time: 21:00 UTC\+2. ‌ **17:57** We publish a status page update with detailed information about what works and what does not and an estimated time of recovery. ‌ **18:05** We change our push notification system's configuration so that push notifications that should have been sent more than one hour ago will not be sent when the system comes back online, massively sending out of date notifications. ‌ **18:00-21:30** The on-call SRE and a couple of engineers continue to actively monitor the situation, waiting for our servers to come back online. ‌ **21:45** The servers come back online in a 10-minute span. Transactional and Automation push notifications are now working as expected. We notice that campaigns are still down: we paged an engineer to help diagnose this issue. ‌ **22:15-23:00** We identify two issues: * A component is stuck in a failure state and has to be restarted. * A recent change in the targeting engine's error handling code slowed it down. This was not an issue for usual operations as this bug was only triggered when things went wrong elsewhere. We had to hotfix this to recover from the incident. ‌ **23:12** The hotfix goes into production. The selection engine starts catching up with the enqueued operation. ‌ **23:27** All enqueued operations have been processed, the system is back in a normal state. We run some manual tests and mark the status page incident as Monitoring. ‌ **09:33, the next day** Everything is stable; we mark the incident as resolved and start working on a postmortem. ‌ ## About the resolution time The main decision point of the incident was about the re-creation of a new message queue cluster or waiting for servers to come back online. With the information we had at the beginning of the incident and the provider's initial estimated recovery time, we decided against doing that, opting to wait for the servers to come back. Unfortunately, the issue was bigger than expected, and recovery on our provider's end was postponed. Had we known that the repairs would take such a long time, we might have acted differently. ‌ When the servers came back online, part of the system self-healed: Transactional and Automation push notifications started working as expected. Campaigns required a bit more work on our part: * Eight hours' worth of work piled up in the message queues, which overloaded some services and databases * The incident triggered a bug in a code path that we didn’t hit under normal circumstances This required manual intervention to add capacity, restart apps stuck in a failure state and push a hotfix in production. ‌ ## Forthcoming actions We improved our incident response process to put an emphasis on clear communication: for our clients, this means that we will communicate more frequently on our status page even if we’re still investigating, fixing or monitoring. ‌ To enhance our resilience against this type of issue, we’re performing an in-depth review of how our services and components are physically distributed in our data centers/availability zones/rooms and their replication factors. We also improved our alarming systems in an effort to catch those issues before they turn into an outage.

resolved2024-09-11T07:33:29.397Z

The system has been working properly since our last communication. We now consider this incident resolved. We are working on a postmortem to publish in the coming days.

monitoring2024-09-10T21:34:53.539Z

Push Campaigns are now working as expected. Any Push Campaign message that should have been sent before 20:10 UTC has been canceled and will not be sent. Pushes scheduled since then have been sent up to one hour late. All our services are now working as expected. We will continue to monitor the services for the next few hours.

identified2024-09-10T20:51:34.306Z

Our hosting provider brought the servers back up around 19:40 UTC. Due to the unprecedented nature of the incident, the system did not self heal as expected and required manual intervention to come back online. So far the Push Transactional API and Push Automations are now working as expected, but Push Campaigns remain unavailable. We’re working on this issue and expect a resolution in the next couple of hours.

identified2024-09-10T15:57:30.952Z

As our hosting provider is taking more time than originally expected to fix the issue, we would like to make a full recap of the situation: The platform is impacted in the following ways: - Most push notifications (mobile & web) are impacted and have not been sent since 12:00 UTC. This is: Campaigns, Trigger Automations and Transactional. In some rare cases the pushes went through, but at this time we are not able to give any more precise information. Other products are not impacted: - Email, SMS work as expected - Data Ingestion (APIs, SDKs, flat file imports) work as expected - In-app automations are working as expected During this time, please avoid creating new campaigns as we cannot assure you when they will be executed. Deleting a campaign that should already have started sending and recreating it might not work as expected and result in duplicate notifications when the situation comes back to normal. Our hosting provider now expects the servers to be back online around 20:00 UTC. When the system will be back to a healthy state, Batch will try to send the pending notifications. We will update the statuspage when this happens. To help ensure that your users do not get late/duplicate/unwanted notifications, *any notification that have been pending (that is, should have been sent) more than an hour ago will be dropped*.

identified2024-09-10T13:57:02.583Z

According to our provider, we expect the servers to be back online at the end of the afternoon (16H UTC). Most of the push orchestrations are impacted. Email & SMS orchestrations are not impacted.

identified2024-09-10T12:44:41.563Z

We are continuing to work on a fix for this issue.

identified2024-09-10T12:02:54.000Z

We are currently experiencing issues on certain servers, leading to delays and interruptions in the push notification services.

Sep 12, 2024

Report: "Email & SMS Automations delay"

Last update 2024-09-12T09:14:17.603Z

resolved2024-09-12T08:30:00.000Z

Due to a logic error, Email & SMS automations were interrupted from 10:19 AM to 10:46 AM (GMT+2). The messages were not lost but were sent up to 27 minutes late. The issue has been fixed.

Sep 4, 2024

Report: "In-App Automation"

Last update 2024-09-04T08:25:14.820Z

resolved2024-09-04T08:06:51.000Z

Create, delete and update of In-App Automations was unabailable: - From September 2 at 18:30 to September 3 at 10:20 - From September 3 between 18:30 and 23:10 Times are GMT+2 The automations were delivered as expected to the SDKs, meaning that they were displayed to end users. This incident was due to an unhealthy database. Due to the nature of the failure, the automatic failover failed to trigger on September 2. We performed a root cause analysis and planned an improvement of the failover mechanism but the issue happened again while we were still working on deploying said improvement.

Aug 14, 2024

Report: "Stale In-App Automation content"

Last update 2024-08-14T21:08:43.092Z

resolved2024-08-14T21:08:43.078Z

This incident has been resolved.

monitoring2024-08-14T18:24:50.767Z

A fix has been implemented and we are monitoring the results.

identified2024-08-14T17:12:34.599Z

We have noticed an issue with In-App campaigns where there is a desynchronization between the In-App Automations served to apps and what has been setup on the dashboard. Changes made to In-App campaigns from the dashboard & our APIs since 16:30 GMT+2 might have taken a significant time to be reflected in apps Cappings, personalization and planned start/stop dates are not affected. If you haven't made any change to your In-App automations before this incident, you are not affected by this incident.

Jul 29, 2024

Report: "Push delivery delays"

Last update 2024-07-29T15:57:06.729Z

postmortem2024-07-29T12:53:12.055Z

Here are some details on the incident. # Timeline **All times are in UTC and 24h time.** On July 29 at around 11:00 we detected significant delays for a small subset of push notifications sent for campaigns or recurring automations. An investigation by our team revealed that one instance of the service responsible for processing these campaigns or automations was having trouble keeping up with the incoming data; at this point we decided to open this incident \(at around 11:15\). The issue was found and a mitigation was put in place at around 11:30. After this operation the service operated correctly again and started catching up its delays. At around 11:46 all delays were resolved and everything was back to normal. # Impact A small subset of push notifications were sent with a delay. We estimate that around 10% of all notifications were delayed up to 1h20. # Root cause There was an issue with our message queuing system at around 10:00 which caused it to corrupt some internal state for our service; this state corruption meant that the service could not process some campaigns and automations. When this was detected we proceeded to restart the service in order to remove the corrupted state. This was effective, however there were follow up issues due to the amount of campaigns and automations delayed to process at the same time: the service was resource constrained and using an ineffective configuration to process so much campaigns and automations. Once these problems were identified our tram proceeded to mitigate them, after which the service was working correctly again and the delays were resolved. # Conclusion Although the original problem was an easy fix, the main issue was that we lacked efficient monitoring for this particular service which resulted in much higher delivery delays than it should have. In the near future we will work on improving the monitoring for this service so that we can address any issues much more quickly; in addition we will also work on preventing these kind of issues altogether.

resolved2024-07-29T12:17:35.100Z

This incident has been resolved.

monitoring2024-07-29T11:51:06.917Z

A fix has been implemented and deployed. All delays have been addressed and push notification campaigns are being sent correctly again. Our team is still monitoring the situation.

identified2024-07-29T11:43:15.748Z

The issue has been identified and a fix is being implemented.

investigating2024-07-29T11:15:32.269Z

We'are aware of some delays processing push notification campaigns. Our team is currently investigating.

Jul 22, 2024

Report: "Email composer unavailable on Batch dashboard"

Last update 2024-07-22T14:08:52.076Z

resolved2024-07-22T12:57:22.000Z

The fix has been released at 12:50pm UTC and the email composer is now working as expected.

identified2024-07-22T11:55:51.000Z

Our Email Composer partner is currently facing issues affecting the availability of the Email Composer feature on the Batch dashboard. Consequently, creating or modifying email templates is not possible at the moment. The issue has been identified, and a fix is being prepared. Our partner's team is working diligently to resolve the problem and restore full functionality. We will provide an update once the fix is deployed or if there are any significant developments. We apologize for the inconvenience and appreciate your patience.

Report: "Elevated response time and error rate on REST APIs"

Last update 2024-07-22T13:59:50.508Z

resolved2024-07-22T00:00:00.000Z

Due to a database issue, we have encountered increased response times and elevated error rates on our APIs. The issue started over the weekend, impacting a very low number of the requests but has intensified on 2024-07-22 around 9AM UTC. We rolled out some changes to get the error rates and response times back to low values. Failed requests have not been processed and can be retried. We will keep monitoring the situation while we work on fixing the root cause.

Jul 18, 2024

Report: "Web SDK webservice issue"

Last update 2024-07-18T11:59:10.005Z

resolved2024-07-17T15:56:53.029Z

The fix behaves as expected and data collection is back to normal.

monitoring2024-07-17T15:36:30.000Z

We have identified an issue on our Web SDK webservices. The issue resulted in a loss of a significant amount of data coming from the SDKs between 2024-07-15 09:20:00 UTC and 2024-07-17 15:25:00 UTC. As the data will not be replayed, you might notice the following impact: - Analytics (DAUs, Starts, Installs) for this period will not be accurate and stay that way. - Automations might have not been sent as expected. We identified the issue and pushed a fix that we are monitoring. Mobile SDKs and REST APIs are not affected.

Jul 16, 2024

Report: "Push delivery delays"

Last update 2024-07-16T15:55:42.779Z

resolved2024-07-16T15:55:42.762Z

This incident has been resolved.

monitoring2024-07-15T15:25:35.580Z

The issue has been resolved. We're still working on understanding what exactly happened and monitoring the situation.

identified2024-07-15T15:22:33.646Z

We've identified an issue with one of our database system that caused significant push delivery delays of up to 20 minutes.

Jul 12, 2024

Report: "Incorrect cache on audiences"

Last update 2024-07-12T08:17:00.100Z

resolved2024-06-26T08:15:48.000Z

From June 26 to July 6, we noticed that our cache system on audience was inconsistent. If a campaign (push or email) using an audience , it may have targeted not all the users it should have been. This problem have been fixed and everything is back to normal. We are still investigating to understand the cause of the incident.

Report: "iOS campaign push notifications delivery delays"

Last update 2024-07-12T07:52:15.199Z

resolved2024-07-10T15:00:00.000Z

We had an issue with iOS push notifications being delayed multiple times during the day on July 10th. Only push campaigns were affected, transactional and trigger automations were not. If you created a push campaign for iOS around these times you may have observed notification delivery delays up to 15 minutes: 10:00 UTC, 12:00 UTC, 13:30 UTC. The root cause was a recent update to the service responsible for sending iOS notifications which caused an unforeseen performance regression. It took our team some time to diagnose and understand that the issue was related to this update, after our investigation we rolled back the update and everything went back to normal.

Jul 8, 2024

Report: "Intermittent issues accessing the dashboard and communicating with our APIs"

Last update 2024-07-08T09:49:59.359Z

resolved2024-07-08T08:30:00.000Z

We had a networking issue on a subset of our servers causing intermittent inaccessibility of our dashboard and APIs; between 08:15 UTC and 08:30 UTC you may have seen errors accessing the dashboard or any API. This was due to an upgrade of an internal system that handles networking which had unintended side effects. A quick rollback resolved the issue and accessibility was restored. No notifications or emails were lost during this incident, however notifications or emails sent from a trigger automation may have been delayed: we've observed up to 6 minutes of delay: once the issue was resolved everything was immediately sent.

Jun 21, 2024

Report: "API Campaign - Timeouts"

Last update 2024-06-21T20:13:59.681Z

resolved2024-06-21T20:13:59.663Z

This incident has been resolved.

monitoring2024-06-21T17:20:30.408Z

A fix has been implemented and we are monitoring the results.

identified2024-06-21T15:45:36.640Z

A database issue is causing slowdowns on our Push Campaign API, which can result in timeouts. We are working on a fix.

Jun 20, 2024

Report: "Push send test inoperant"

Last update 2024-06-20T12:39:58.369Z

postmortem2024-06-20T12:35:06.447Z

# Timeline **All times are in UTC and 24h time.** On the morning of June 12, we decided to make a configuration change in one of our monitoring systems. This configuration was erroneous, but we did not notice this at the time. This caused the system to become unavailable. On June 12 at 16:00, we received a notification from a client telling us that sending a test push notification was not working. At first, we did some tests and were unable to reproduce the problem in our testing environment. As we investigated, we discovered that one of our apps was crashing and was no longer processing messages. This app is responsible for indexing new tokens. That means that our push delivery was operational for tokens indexed before the start of the incident, but no new token could be contacted \(all push notifications were affected: campaign push notifications, automation push notifications, test push notifications\) Data processing errors can happen but this specific app was dependent on the monitoring system when handling errors, which is a bad pattern. Due to the monitoring system being down, we were unable to process messages. We have changed the error handling in this app to be independent of the monitoring system. Once the hotfix was released on June 13 at 12:15, the app was once again operational and the new tokens were once again being indexed. All tokens were finally indexed around 16:20.

resolved2024-06-13T14:26:40.945Z

Situation is back to normal campaign, automations and send test should work with any tokens

monitoring2024-06-13T12:21:33.000Z

We expect the situation to be back to normal between 4PM and 5PM (UTC)

identified2024-06-13T10:31:31.779Z

A fix has been implemented and we are processing delayed messages. Situation will be back to normal in the afternoon

investigating2024-06-13T09:23:38.945Z

we experience difficulties on indexing new push tokens since yesterday morning. Therefore : - Push test are only impacted when using new tokens - All sending can be impacted, including campaigns and automations (but only for new tokens) we are sill investigating .

investigating2024-06-13T08:21:20.988Z

Push send test lead to a "no token found" error. Campaigns & automations sending are not impacted. we are currently investigating this issue

Jun 11, 2024

Report: "display conditions issues on email"

Last update 2024-06-11T14:54:00.034Z

resolved2024-06-09T07:00:00.000Z

An incident occurred with our email builder due to a third-party software that we use. This affects all email orchestrations containing display conditions that were edited between 2024-06-09 11:00:00 and 2024-06-11 09:30:00. The emails edited during that period have lost their conditions, and running these orchestrations will result in all conditional blocks being shown to the user. This concerns 22 orchestrations, and if you are impacted, we will reach out to you to guide you on how to resolve the issue.

Jun 6, 2024

Report: "Imported email image issues on GMail apps"

Last update 2024-06-06T13:12:04.658Z

resolved2024-06-06T13:00:00.000Z

Images imported by uploading a ZIP template in Batch failed to display in Gmail's web and mobile apps. This was due to Google's proxy failing to retrieve the image from our CDN for an unknown reason Emails composed with the Email Composer are not affected by this. We have resolved the issue by switching to another CDN provider.

May 22, 2024

Report: "FCM push delivery issue"

Last update 2024-05-22T17:57:07.018Z

resolved2024-05-22T14:00:00.000Z

We had an issue with our system that handles retries for FCM push notifications specifically: starting from 14:15 UTC and ending at 16:42 UTC retries were not processed due to the underlying database being unavailable. After our team fixed the issue the retry system started processing and sending retry notifications again. No notifications were lost in this incident but some notifications could have taken more than 2h 30min to be sent. All delays have now been resolved, retries are processed correctly.