Historical record of incidents for Batch
Report: "Partial API and Analytics Disruption"
Last updateBetween June 9th 16:30 GMT+2 and June 10th 12:50 GMT+2, a database issue caused partial service disruptions affecting specific API endpoints and our analytics data processing pipeline. Impact: - Transactional API: Intermittent 500 internal server errors were returned by get stats endpoint. - Export API: Exports requests took multiple hours to be processed. - Dashboard: Analytics data production was delayed, causing outdated information to be displayed on the dashboard. The engineering team identified the root cause of the instability and deployed a correction. Service functionality was restored and confirmed stable at 12:50 GMT+2, queued exports finished processing around 15:50 GMT+2
Report: "CEP Audience API Issues"
Last updateBetween June 10th 17h (French Time) and June 11th 17h, intermittent 500 errors occurred on create and update API calls that included attributes, affecting approximately 1% of total API traffic during this period.
Report: "Filtering imported tokens in Push v2 is disabled"
Last updateThis morning, we identified an issue affecting imported push tokens in Push v2, which may have led to duplicate notifications being sent. As a precaution, we have temporarily disabled creating campaigns with imported tokens while we investigate the root cause and work on a fix. Push v1 remains fully operational and is not impacted by this issue.
Report: "Filtering imported tokens in Push v2 is disabled"
Last updateThis morning, we identified an issue affecting imported push tokens in Push v2, which may have led to duplicate notifications being sent.As a precaution, we have temporarily disabled creating campaigns with imported tokens while we investigate the root cause and work on a fix. Push v1 remains fully operational and is not impacted by this issue.
Report: "Dashboard Performance Issue"
Last updateSituation is back to normal
The performance issues previously affecting the dashboard have been resolved. A fix has been applied, resulting in a much faster experience for all users. We will continue to monitor the situation to ensure everything remains stable.
We are currently experiencing slowness on our dashboard. Our teams are investigating the cause of this issue. We will keep you updated on the situation.
Report: "Profile updates and custom events processing delays on CEP"
Last updateThis incident has been resolved.
All the indexation backlog has now been cleared, and we're back to nominal status.
A fix has been deployed, the backlog of SDK events is being caught up. The indexing lag is now of 30 minutes. We will update this incident when everything is back to normal.
We are experiencing a delay in indexing messages from our Mobile SDKs. The indexing of these messages is currently 1 hour and 30 minutes late, the problem having manifested itself from 12H15 onwards (GMT+2). This delay affects only the updating of Profile events, meaning : - The profile view will not be updated in real-time. - Some automations will be sent later than expected.
Report: "Edge Network Instability"
Last updateBetween April 24th, 2025 at 20:30 UTC and May 2nd, 2025 at 08:00 UTC, some users experienced network instabilities across our services. The issue was traced to a faulty load balancer in our infrastructure, which caused intermittent disruptions and impacted service stability during that period. The defective component has since been replaced, and systems have returned to normal operation. All services are now stable and functioning as expected. As a result of this incident, anomaly detection tools have been enhanced to better detect network-related issues at the edge of our infrastructure. These improvements are designed to enable more proactive identification and resolution of potential problems, further reinforcing the reliability of our services.
Report: "Profile updates and custom events processing delays on CEP"
Last updateThis incident has been resolved.
The indexing backlog has now been cleared, we are keeping the platform under close surveillance.
The incident is still ongoing, and the indexing lag has been reduced to 1 hour and 40 minutes.
We are experiencing a delay in indexing messages from our Mobile SDKs. The indexing of these messages is currently 7 hour and 30 minutes late, the problem having manifested itself from 17H15 onwards (GMT+2). This delay affects only the updating of Profile events, meaning : - The profile view will not be updated in real-time. - Some automations will be sent later than expected.
Report: "Profile updates and custom events processing delays on CEP"
Last updateThe indexing backlog has now been cleared, and the incident is now resolved.
A fix has been deployed on our indexing system, and the backlog of SDK events is being caught up. We will update this incident when everything is back to normal.
We are continuing to investigate this issue.
We are experiencing a delay in indexing messages from our Mobile SDKs. The indexing of these messages is currently 1 hour and 50 minutes late, the problem having manifested itself from 17H15 onwards (GMT+2). This delay affects only the updating of Profile events, meaning : - The profile view will not be updated in real-time. - Some automations will be sent later than expected.
Report: "Slow loading of native attributes for MEP Campaign Targeting"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
Despite initial fixes being applied, we are still experiencing issues, and the situation remains under investigation. We will continue to keep you informed and provide further updates as soon as possible.
The issue has been identified and a fix is being implemented.
A fix has been implemented and we are monitoring the results.
Since Friday, April 18th, we have been experiencing delays in loading native attributes for targeting our MEP campaigns. The incident has been identified, and initial fixes are being applied. We will provide further updates as soon as possible.
Report: "CEP Automation delay"
Last updateThis incident has been resolved.
The fix has been deployed and the system is catching up. The delayed messages should all be sent around 10:00 GMT+2.
Our Customer Engagement Platform is experiencing delays for some Push, SMS & Email automations. About 1.5% of the messages are delayed for at most 10 hours. We have identified the cause and are working on a fix. No messages have been lost: once the fix is deployed, the delayed messages will be sent.
Report: "MEP Push processing delays"
Last updateOur Mobile Engagement Platform experienced processing delays for some triggered push notifications between April 10th, 10:40 PM and April 11th, 10:05 AM. All messages were successfully processed, and none were lost.
Report: "CEP Push processing delays"
Last updateOur Customer Engagement Platform experienced processing delays for some triggered push notifications between April 5th, 3:30 PM and April 6th, 8:30 PM. All messages were successfully processed, and none were lost.
Report: "Dashboard connectivity issue"
Last updateMonitoring has confirmed that the issue is fully resolved and all systems are operating normally. This incident is now closed.
A permanent fix has been implemented, and the affected components remain stable. The incident will remain open as we continue to closely monitor the situation for the next 24 hours.
There are no changes at this time. The mitigation remains effective, the impacted components are stable, and we are continuing to monitor the situation closely.
A mitigation has been implemented and is functioning as expected. We are continuing to closely monitor the system and address any remaining issues.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We’re currently experiencing connectivity issues affecting the Dashboard. Our team is investigating and working to restore full access as soon as possible.
Report: "GDPR API issues"
Last updateWe experienced an issue affecting our GDPR API, resulting in the later being unavailable for a while. A fix was implemented at 5:00 A.M. UTC (6:00 A.M. CET) and we didn't notice any new occurrence of this issue since then.
Report: "Audience API issues"
Last updateThe issue has been fully resolved — a permanent fix was implemented at 21:00 UTC, and we've continued monitoring throughout the night to ensure stability.
A fix has been implemented and we are monitoring the results.
We’re currently experiencing an issue affecting our Audience API. Our engineering team is actively investigating and working to resolve it as quickly as possible.
Report: "Profile updates and custom events processing delays on CEP"
Last updateWe have experienced a delay in indexing messages from our Mobile SDKs. The indexing of these messages was up to 57 minutes late, over the period from 20:00 to 21:15 (GMT+1). This delay affects only the updating of Profile events: • The profile view will not be up to date. • Some trigger automations will be sent later than expected. The origin of this problem has not yet been precisely identified, but investigation is now the top priority of our technical teams, to ensure that this does not happen again.
Report: "Profile updates and custom events processing delays"
Last updateThe issue has now been resolved. Profile updates and custom events are processed without delays.
A fix has been implemented and the delays are currently decreasing. We're still monitoring the situation.
We are aware of an issue which is causing some delays in processing profile updates and custom events received from the SDK. Automations triggered by an SDK event may also experience a delay. We have identified the issue and are working on a fix.
Report: "Delivery issues"
Last update<b>Timeline</b> All times are in UTC and 24h time. On february 16 at around 9:45 we detected significant delays for a subset of push notifications, email and sms sent for campaigns or recurring automations. An investigation by our team revealed that one instance of the service responsible for processing these campaigns or automations was having trouble keeping up with the incoming data. The issue was found and a mitigation was put in place immediately. After this operation the service operated correctly again and started catching up its delays. At around 9:50 all delays were resolved and everything was back to normal. <b>Impact</b> A small subset of push notifications, emails and sms were sent with a delay. We estimate that around 4% of all messages were delayed up to 7h. <b>Root cause</b> There was an issue with our profile selection system at around 3:00 which caused a small part of the processing to halt. Once the problem was identified our team proceeded to mitigate it, after which the service was working correctly again and the delays were resolved. <b>Conclusion</b> Although the original problem was an easy fix, the main issue was that we lacked efficient monitoring for this particular service which resulted in much higher delivery delays than it should have. In the near future we will work on improving the monitoring for this service so that we can address any issues much more quickly; in addition we will also work on preventing these kind of issues altogether.
Report: "Network issue on our hosting provider"
Last update# Foreword At 01:30 GMT\+1 on 02/02/2024, our on-call team monitored a potential issue on the platform. At 01:58 GMT\+1, it was identified that the network was down on 189 servers. The physical network interfaces were down, due a misconfiguration from our hosting provider following a hardware failure, rendering immediate action impossible. Approximately 20 minutes later, some servers began to come back online spontaneously, leading to further investigation and attempts to restore services. # Fault The network failure affected 189 servers, causing their physical network interfaces to go down. Once our hosting provider corrected the configuration on the network, our servers came back online. While some servers were restored by automated systems, others required manual intervention. # Impact For approximately 1 hour and 35 minutes, between 01:25 GMT\+1 and 03:00 GMT\+1 on 02/02/2024, various core and optional services experienced partial or complete downtime. The incident affected several components: * **CEP Core services**: Push delivery, Data ingestion, and Dashboard were partially down. * **MEP Core services**: Push delivery, In-app delivery, and Data analysis experienced partial downtime. * **APIs**: Several APIs, including Audience API, Export API, and Campaign API, were partially down \(flapping errors due to the time needed by the system to auto-heal\). * **Optional services**: Inbox, Webhook, and Custom exports were partially down. Despite the downtime, there was no impact on email and SMS delivery. Regarding potential data loss: Requests accepted by our APIs during the incident were ingested, though some may have experienced delays. All retries made after **03:00 GMT\+1** were successfully processed, and the data was properly ingested. Campaigns scheduled to launch during the incident did not trigger at the expected time but were restored once the network came back online. However, a small portion of these campaigns expired after **two hours** and were not sent. # Timeline 01:30 GMT\+1 - Pager alert triggered; our on-call SRE begins investigation. 01:58 GMT\+1 - Network failure identified on 189 servers; physical interfaces down. 02:18 GMT\+1 - Some servers begin to recover automatically; further investigation initiated. 03:00 GMT\+1 - Most services restored; incident reported to infrastructure provider. 19:00 GMT\+1 - Manual intervention by our infrastructure provider to restore few remaining servers; all services back online by 19:00 GMT\+1.
After further investigation, we now consider this incident resolved. The root cause has been identified as a network issue. We are preparing a detailed analysis, which will be shared in the coming days to provide further insights into the incident.
All our services have returned to normal. We continue to closely monitor the platform.
A significant part of our infrastructure became unreachable at 01:30 GMT+2. Most of the affected servers are now accessible again, and services are gradually returning to normal. The status page will be updated progressively to reflect component availability. We will provide another update in 30 minutes, at 04:00 GMT+2.
We're currently experiencing network problems that may impact all services. Said issue is likely to come from our hosting provider. We will keep you posted as the investigation goes on.
Report: "Email composer down"
Last updateWe encountered an issue with our email composer between 15:00 GMT+1 and 15:20 GMT+1, which prevented users from opening the composer or saving emails. The issue has now been resolved, and we continue to monitor the situation.
Report: "Inconsistent Trigger Automation Reentry on CEP"
Last updateOn January 15th, from 10:20 GMT+1 to 17:20 GMT+1, we identified an issue that affected the correct computation of certain trigger automation events. This may have impacted less than 5% of events triggered during this period, potentially preventing some users from reentering a trigger automation if it had already been completed previously. The issue has since been fully resolved, and no further inconsistencies have been observed.
Report: "Delay in indexing Profile events for Web and Mobile SDKs"
Last updateThis incident has been resolved.
Indexing delays are fully resolved. We will continue to monitor the platform to ensure a smooth return to normal.
We have identified the issue and implemented a fix. The indexing delay is decreasing.
Since 01:25 GMT+1, we have been experiencing delays in processing Profile events from our Mobile and Web SDKs. The delay has now reached 5 hours. This delay affects only the updating of Profile events: • The profile view will not be up to date. • Some trigger automations will be sent later than expected. We are actively investigating this issue and will keep you updated on its resolution.
Report: "Push delivery delays"
Last updateBetween 1.15PM and 2.20PM, we experienced delays in processing push notification campaigns. The issue has been resolved, and operations have returned to normal.
Report: "Webhook delays"
Last updateThis incident has been resolved. All delayed webhook events were retried once the fix was implemented and new webhook events are sent without delays.
We are aware of some delays sending webhook events to customer's endpoints. A fix has been implemented and events are being sent correctly again. We are monitoring the situation.
Report: "Profile API ingestion delay"
Last updateAll indexing delays are now eliminated. Requests sent to the Profile API are processed in real time: • Targeting is performed on up-to-date data. • The profile view is up to date. • Trigger automations are sent on time.
Indexing times continue to improve, with Profile API events now running an hour behind schedule.
Since 12:45 GMT+1 we have been experiencing delays in processing requests from the Profile API. The delay is currently 1 hour 40 minutes, and the system is gradually catching up. This delay impacts the updating of profile attributes and events: • Targeting may be performed on data that is older than anticipated. • The profile view will not be up to date. • Trigger automations will be sent later than expected. Attributes and events from the SDK are not affected. We are monitoring the issue and expect a resolution by 17:40 GMT+1, we will keep you informed of the situation every 30 minutes.
Report: "Notifications, SMS, Email, API & SDK Web services down"
Last updateOn Thursday, November 7, 2024, an unexpected issue happened while migrating one of our key message queue clusters. This issue resulted in a major outage of our delivery services and APIs. ## What Happened? A mandatory migration was underway for one of our core message queue clusters. This migration was a prerequisite for expanding our infrastructure across multiple data centers. Unexpectedly, at the final step of the migration, a key sub-component of our message queue cluster \(based on Kafka\) encountered issues communicating with other nodes in the cluster. The cluster became unavailable resulting in our applications stopping processing messages. We detected this incident immediately and determined that it was related to a bug due to an edge case in the migration process. However, we chose not to simply roll back the migration, as we needed to ensure data integrity and prevent any potential loss. Finishing this migration was what ultimately solved the issue and allowed us resume normal operations. ## Impact on our platform * **Push Campaigns:** Delayed by up to one hour. * **Transactional Push Notifications:** Delayed by up to two hours. * **APIs:** Experienced a 16% error rate across all services, except the Custom Audience API, which continued to encounter errors until Nov. 8th, 09:31 GMT\+1. • Successful API calls \(returning a success status code\) were enqueued but not processed during the incident. • Processing of enqueued requests began around 23:00 GMT\+1 and concluded by Nov. 8th, 00:40 GMT\+1. • **Action Required:** Retry any important failed API calls, as they were not enqueued. * **SDK Web Services:** • In-App Automations with “Re-evaluate targeting just before display” did not function as expected. • Events, attribute updates, and push opens from the mobile SDK and plugins will be retried when users reopen their apps. • Events, attribute updates, and push opens from the Web SDK have been partially lost. ## Timeline & mitigation actions _For clarity, this timeline only lists the most important events. All times are GMT\+1_ ### 17:40 Our alerting system detected that part of our core services were not working properly due to the unavailability of one of our key message queue clusters. Since we were working on this cluster, we immediately identified the root cause and began our investigation. ### 18:08 After assessing the severity of the incident, we declared it publicly via our status page and started working on various plans to resume operations. We decided not to force the migration or roll back until we fully understood the root cause and could ensure no data would be lost. ### 18:20 In order to restore all message delivery services as fast as possible, we decided to implement a quick and temporary workaround. This change involved removing an internal feedback loop necessary for all post-delivery actions \(analytics, marketing pressure, inbox\). ### 18:48 The workaround has been deployed, and confirmed to be working. Messages are being sent again. ### 21:30 We decided to resume the migration procedure. ### 22:30 All nodes were successfully migrated, and the cluster started healing itself. We then reverted the workaround and restarted all services using this cluster. ### 23:30 All services seemed operational. The incident remained open and under monitoring. ### 09:26, the Next Day Due to a flood of alerts caused by the incident, the monitoring for our Custom Audience API was broken. After post-incident in-depth investigation, this monitoring issue was detected and the Custom Audience API was fixed. ### 14:26 We marked the incident as resolved after verifying that all services were functioning as expected. ## Forthcoming actions As this migration was part of a long term plan to have a more resilient infrastructure — preventing this very issue from happening — we will continue deployment as planned.
The system has been functioning properly since our last communication, and we now consider this incident resolved. Summary of Impact: • Push Campaigns: Delayed by up to one hour. • Transactional Push Notifications: Delayed by up to two hours. • APIs: Experienced a 16% error rate across all services, except the Custom Audience API, which continued to encounter errors until Nov. 8th, 09:31 GMT+1. • Successful API calls (returning a success status code) were enqueued but not processed during the incident. • Processing of enqueued requests began around 23:00 GMT+1 and concluded by Nov. 8th, 00:40 GMT+1. • Action Required: Retry any important failed API calls, as they were not enqueued. • SDK Web Services: • In-App Automations with “Re-evaluate targeting just before display” did not function as expected. • Events, attribute updates, and push opens from the mobile SDK and plugins will be retried when users reopen their apps. • Events, attribute updates, and push opens from the Web SDK have been partially lost. Analytics and Tracking Limitations: To restore campaign functionality as quickly as possible, we temporarily disabled internal tracking of push, email, and SMS deliveries between 18:40 GMT+1 and 23:20 GMT+1. As a result: • Analytics for messages sent during this period are unavailable and cannot be recovered. • Open rate percentages for this timeframe are unreliable. • Marketing pressure features (Global Frequency, Label Frequency, and Recurring Automation Cappings) do not account for push, email, or SMS deliveries during this interval. We are exploring ways to partially regenerate missing analytics data. Next Steps: Our team is preparing a comprehensive postmortem, which we plan to publish next week. We apologize for the inconvenience caused and appreciate your understanding.
The Custom Audience API encountered errors until November 8th, 9:31 GMT+1. It is now working as expected. We will send an update later today with a more information about the impacted components. A full post-mortem is planned for next week.
The previous operation was successfully completed. SDKs and APIs are functioning correctly. Data ingestion has been back online since 22:45 GMT+1. From 18:47 GMT+1 to 23:20 GMT+1, no analytics data was collected, and unfortunately, we will not be able to recover this data. As a result, you may notice abnormal open rates, as messages were sent during this period but acknowledgment information was not collected. Push notifications won't show up in the Inbox feature either. Our teams are continuing to monitor the situation. We will publish a post-mortem next week.
To prepare our platform for the upcoming operation, we will temporarily suspend all API and SDK web services. During this time, data ingestion will not be possible (you will receive HTTP 500 errors) . We will inform you as soon as data ingestion is restored. Analytics are sill unavailable. We will post another update in an hour.
Our teams are still working on a complete solution. We will post another update in an hour.
The workaround is now also implemented to resume Transactional Push. Transactional Push will now be sent again, and any queued push that were delayed are being delivered progressively. Our teams are still working on a complete solution.
The workaround is now also implemented to resume Email & SMS campaigns. Email & SMS will now be sent again, and any queued Email & SMS that were delayed are being delivered progressively. Due to this workaround, success and error analytics will not be available on the dashboard, APIs, or exports. Our teams are still working on a complete solution.
We are continuing to work on a fix for this issue.
We have located the root cause but are still working on exactly what components are affected. We have implemented a workaround to resume APNS, FCM, and Web Push notifications for campaigns. Notifications will now be sent again, and any queued notifications that were delayed are being delivered progressively. Due to this workaround, success and error analytics will not be available on the dashboard, APIs, or exports. Our teams are still actively working to fully restore the remaining affected services.
We are currently experiencing technical issues since 17:40 GMT+1. Notifications, Email, SMS, our API, and SDK web services are all down. Our team is actively investigating the situation to restore the service as quickly as possible. We will keep you updated as soon as we have more information.
Report: "Batch services unavailable"
Last updateWe experienced 10 minutes of downtime across our entire services, due to network saturation on our hosting provider's side. This downtime lasted from 14:25 to 14:35 GMT+1. We are back in a nominal state with our entire platform available.
Report: "In-App Automation edition errors"
Last updateCreate, delete and update operations of In-App Automations from Dashboard and API suffered from high error rates from October 27 at 7:15 GMT+1 to October 27 14:32 GMT+1 The running automations were delivered as expected to the SDKs, meaning that they were displayed to end users.
Report: "Partial delay in the indexation of Profile data from Mobile SDKs"
Last updateNo suspicious behavior was detected during the monitoring period, our corrective patch has brought the system back to its nominal status.
A fix has been applied and we're monitoring the results. All the delays have been cleared and we're back in real time mode.
Over the weekend, we experienced two periods of slowdown in the indexing of Profile events from our mobile SDKs. Following some palliative action by our team this morning, this delay has been temporarily resolved. These periods occurred between October 12 at 8:00 and 23:45 (UTC+2), and between October 13 at 10:30 and October 14 at 9:20 (UTC+2). We are investigating the root cause of this incident. These episodes of delay may have had an impact on the triggering of automations set up with the events we were slow to record.
Report: "Delay in APNS Push Notifications Delivery"
Last updateWe experienced a delay in the delivery of push notifications through APNS (Apple Push Notification Service), between 13:17 UTC+2 and 14:04 UTC+2. All delayed notifications were successfully sent by 14:06 UTC+2. The issue is now resolved.
Report: "Push campaign analytics delay"
Last updateAnalytics of Push campaigns & recurring automations have been recomputed, the incident is resolved. We are continuing to work on the "Devices synced" metric. Update (2024-10-04): The "Devices synced" metric data has been restored.
The issue has been narrowed down to a database issue and fixed. This had the following impact on the system: - Analytics of Push campaigns & recurring automations sent between September 30, 02:00 GMT+2 and 16:30 GMT+2 are unavailable. This is a temporary issue that will be resolved when the data gets recomputed on September 31 around 05:00 GMT+2. Your push analytics will be available then. - The "Devices synced" metric of In-app automations doesn't take into account data prior to September 30. We are looking into ways to restore this data. Our team is monitoring the situation. We will mark this incident as resolved once we confirm that the nightly recomputing has been performed as expected.
We are investigating an issue with push campaign & recurring automations analytics, where in some cases the send/open counts are showing very low values. Pushs were sent as expected. Campaigns older than September 30th 2:00 GMT+2 have accurate analytics. We are working on restoring those metrics.
Report: "Instability of API Custom Audience"
Last updateWe encountered an elevated error rate between 17:18 GMT+2 and 17:42 GMT+2.
Report: "Partial delay in the indexation of native and custom installation data"
Last updateThe indexation is fully operational since September 30, 2024, at 16:35 GMT+2. This incident is resolved.
The fix has been released, and since 16:35 GMT+2, the delayed native and custom installation data have been indexed. No data was lost. Our team continues to monitor the situation.
We are experiencing a partial delay in the indexation of native & custom installation data on our push platform.This means that Push campaigns and recurring automation targeting is performed on older data (14:25 GMT+2) and does not reflect the latest changes. No information is lost during this delay. Trigger automations and In-Apps are not affected. Our team is working on a resolution.
Report: "iOS push significant bounce rate"
Last update## What happened? In an effort to improve the performance of our iOS push notification infrastructure, we started rolling out an upgrade of our backend applications. This phased rollout targeted 2% of our daily volume of iOS Campaigns & Recurring automations. The new version contained a bug that caused applications sharing the same APNs\* p8\*\* push configuration across multiple apps to experience a high bounce rate, leading to incorrect “apns2\_device\_token\_not\_for\_topic” errors. Apps using p12 based push configurations were **not** affected. Once we were made aware of the problem, we stopped the rollout and reverted it. _\*APNs = Apple Push Notification service_ _\*\*p8 = private key used to authenticate against Apple’s servers_ ## Impact on our platform From September 26, 14:45 GMT\+2 to September 27, 10:30 GMT\+2, part of our iOS Campaign & Recurring automation traffic returned “apns2\_device\_token\_not\_for\_topic” errors while they should not have. We have not tried to send the failed notifications a second time. The analytics of the affected Campaign/Recurring automations are accurate. All other components worked as expected: * iOS Transactional & Trigger automation pushes * Android & Webpush: all push notifications * In-apps * Email * SMS * APIs * Dashboard * Analytics ## Timeline ### _For clarity, this timeline only lists the most important events. All times are GMT\+2_ **September 26, 14:45** We begin the gradual deployment of our improvements to the iOS push notification architecture. Starting with 2% of our daily volume of iOS Campaigns & Recurring automations. **September 27, 09:57** We notice a significant number of bounces on the campaigns since September 26 at 18:30.At this point we check if it is related to our deployment from the day before. **September 27, 10:05** We identify a campaign impacted by this problem to analyze the cause. **September 27, 10:20** We open an incident and confirm that it is related to yesterday's deployment. **September 27, 10:30** We are rolling back the changes to the previous iOS push notification architecture. **September 27, 11:34** We open a status page incident. We are still analyzing the impacted applications **September 27, 14:50** The problem is reclassified as major, concerns few applications but the applications concerned were unable to send a major part of their campaign & recurring automation push notifications. ## Actions The bug has been fixed but postponed the rollout for a later time \(we reverted the code to its original state before introducing it\). We are looking into improving our alaming to catch this specific kind of error at low volumes.
We had an issue with some of our iOS push notifications that caused a significant bounce rate when the same push credentials are used across multiple apps. We were in a gradual deployment, the incident did not impact all apps. The incident impacted notifications sent between September 26th at 14:45 and September 27th at 11:00 (GMT+2). The bounced notifications will not be retried.
Report: "Dashboard is partially unavailable"
Last updateThe dashboard is now fully operational. This incident has been resolved.
Due to a component experiencing a spike in usage, we experienced a partial service outage. During the duration of the incident: - The Dashboard was partially unavailable. About 10% of the requests failed, retrying worked in some cases. - The Campaign Stats API was temporarily unavailable. - All other components (delivery, SDK, APIs) functioned as expected. The issue has been resolved by allocating additional hardware resources to this component.
We are encountering some connectivity issues with our dashboard web service. Our team is working on a resolution.
Report: "Hosting provider partial downtime"
Last updateOn Thursday, September 10, 2024, some key messaging components of the Batch platform’s suffered an exceptionally long outage of 9 hours following a downtime at one of our hosting providers resulting in a very disappointing and frustrating situation for our customers, our partners and ourselves. Since the inception of Batch, almost 10 years ago, we’ve invested tremendous engineering resources towards building a resilient, high-availability, high-throughput messaging infrastructure able to support some of the largest e-commerce, media, banking and mobility players across Europe and beyond. Over the years, we’ve attracted tremendous talent, drawn to the mission-critical nature of our platform, passionately engaged in building the definitive customer engagement platform for the enterprise. Today, we’re well aware that we haven’t lived up to the reliability and uptime standards upheld for almost a decade. Following protocol and our philosophy of transparency, we’re publishing a detailed account of what happened that day, walking you through some core concepts of our architecture, detailing changes we’ve already made and outlining what we plan to implement in the future. ## What happened? ### Hosting Provider incident \(root cause\) The provider incident was tracked here:[ https://network.status-ovhcloud.com/incidents/8mq79l7wcx8p](https://network.status-ovhcloud.com/incidents/8mq79l7wcx8p) Our provider, OVHCloud, encountered an electrical disjunction that resulted in the loss of half of an entire datacenter room. This is a very rare occurrence: this is the first time something like this has happened in 10 years. The nature of the faulty element to replace made the recovery longer than initially expected. ## Impact on our platform Among the functions managed by the servers we lost, only one component’s resilience has been affected. The component that failed is the distributed message queue cluster of the selection engine. While we did not lose all of the cluster's servers, we lost too many at once, which made the cluster unavailable. Only a very small number of messages were still coming to and from the targeting engine, but this only worked for a little while until the system noticed it had a very high failure rate and halted. While the input \(campaign API & dashboard\) and outputs \(sending the notifications to Apple, Google, etc.\) were healthy, there were no more push notifications going through as we had nothing telling **who** to send messages to. ## Timeline _For clarity, this timeline only lists the most important events. All times are UTC\+2_ **13:23** Our monitoring systems start showing alerts. **13:54** The incident response process is triggered. At this point, part of the push notification system is starting to look unhealthy, but it is not fully down. A ticket is opened on our provider's support channels, we find out a couple of minutes later than our hosting provider had opened a public incident at 13:47. **14:02** We open a status page incident. We're still analyzing the situation. **14:40** We requalify the incident on the public status page as a "major outage". **15:57** Our hosting provider estimates that servers will be back online around 18:00 UTC\+2. **17:23** Our hosting provider updates us with a new estimated resolution time: 21:00 UTC\+2. **17:57** We publish a status page update with detailed information about what works and what does not and an estimated time of recovery. **18:05** We change our push notification system's configuration so that push notifications that should have been sent more than one hour ago will not be sent when the system comes back online, massively sending out of date notifications. **18:00-21:30** The on-call SRE and a couple of engineers continue to actively monitor the situation, waiting for our servers to come back online. **21:45** The servers come back online in a 10-minute span. Transactional and Automation push notifications are now working as expected. We notice that campaigns are still down: we paged an engineer to help diagnose this issue. **22:15-23:00** We identify two issues: * A component is stuck in a failure state and has to be restarted. * A recent change in the targeting engine's error handling code slowed it down. This was not an issue for usual operations as this bug was only triggered when things went wrong elsewhere. We had to hotfix this to recover from the incident. **23:12** The hotfix goes into production. The selection engine starts catching up with the enqueued operation. **23:27** All enqueued operations have been processed, the system is back in a normal state. We run some manual tests and mark the status page incident as Monitoring. **09:33, the next day** Everything is stable; we mark the incident as resolved and start working on a postmortem. ## About the resolution time The main decision point of the incident was about the re-creation of a new message queue cluster or waiting for servers to come back online. With the information we had at the beginning of the incident and the provider's initial estimated recovery time, we decided against doing that, opting to wait for the servers to come back. Unfortunately, the issue was bigger than expected, and recovery on our provider's end was postponed. Had we known that the repairs would take such a long time, we might have acted differently. When the servers came back online, part of the system self-healed: Transactional and Automation push notifications started working as expected. Campaigns required a bit more work on our part: * Eight hours' worth of work piled up in the message queues, which overloaded some services and databases * The incident triggered a bug in a code path that we didn’t hit under normal circumstances This required manual intervention to add capacity, restart apps stuck in a failure state and push a hotfix in production. ## Forthcoming actions We improved our incident response process to put an emphasis on clear communication: for our clients, this means that we will communicate more frequently on our status page even if we’re still investigating, fixing or monitoring. To enhance our resilience against this type of issue, we’re performing an in-depth review of how our services and components are physically distributed in our data centers/availability zones/rooms and their replication factors. We also improved our alarming systems in an effort to catch those issues before they turn into an outage.
The system has been working properly since our last communication. We now consider this incident resolved. We are working on a postmortem to publish in the coming days.
Push Campaigns are now working as expected. Any Push Campaign message that should have been sent before 20:10 UTC has been canceled and will not be sent. Pushes scheduled since then have been sent up to one hour late. All our services are now working as expected. We will continue to monitor the services for the next few hours.
Our hosting provider brought the servers back up around 19:40 UTC. Due to the unprecedented nature of the incident, the system did not self heal as expected and required manual intervention to come back online. So far the Push Transactional API and Push Automations are now working as expected, but Push Campaigns remain unavailable. We’re working on this issue and expect a resolution in the next couple of hours.
As our hosting provider is taking more time than originally expected to fix the issue, we would like to make a full recap of the situation: The platform is impacted in the following ways: - Most push notifications (mobile & web) are impacted and have not been sent since 12:00 UTC. This is: Campaigns, Trigger Automations and Transactional. In some rare cases the pushes went through, but at this time we are not able to give any more precise information. Other products are not impacted: - Email, SMS work as expected - Data Ingestion (APIs, SDKs, flat file imports) work as expected - In-app automations are working as expected During this time, please avoid creating new campaigns as we cannot assure you when they will be executed. Deleting a campaign that should already have started sending and recreating it might not work as expected and result in duplicate notifications when the situation comes back to normal. Our hosting provider now expects the servers to be back online around 20:00 UTC. When the system will be back to a healthy state, Batch will try to send the pending notifications. We will update the statuspage when this happens. To help ensure that your users do not get late/duplicate/unwanted notifications, *any notification that have been pending (that is, should have been sent) more than an hour ago will be dropped*.
According to our provider, we expect the servers to be back online at the end of the afternoon (16H UTC). Most of the push orchestrations are impacted. Email & SMS orchestrations are not impacted.
We are continuing to work on a fix for this issue.
We are currently experiencing issues on certain servers, leading to delays and interruptions in the push notification services.
Report: "Email & SMS Automations delay"
Last updateDue to a logic error, Email & SMS automations were interrupted from 10:19 AM to 10:46 AM (GMT+2). The messages were not lost but were sent up to 27 minutes late. The issue has been fixed.
Report: "In-App Automation"
Last updateCreate, delete and update of In-App Automations was unabailable: - From September 2 at 18:30 to September 3 at 10:20 - From September 3 between 18:30 and 23:10 Times are GMT+2 The automations were delivered as expected to the SDKs, meaning that they were displayed to end users. This incident was due to an unhealthy database. Due to the nature of the failure, the automatic failover failed to trigger on September 2. We performed a root cause analysis and planned an improvement of the failover mechanism but the issue happened again while we were still working on deploying said improvement.
Report: "Stale In-App Automation content"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We have noticed an issue with In-App campaigns where there is a desynchronization between the In-App Automations served to apps and what has been setup on the dashboard. Changes made to In-App campaigns from the dashboard & our APIs since 16:30 GMT+2 might have taken a significant time to be reflected in apps Cappings, personalization and planned start/stop dates are not affected. If you haven't made any change to your In-App automations before this incident, you are not affected by this incident.
Report: "Push delivery delays"
Last updateHere are some details on the incident. # Timeline **All times are in UTC and 24h time.** On July 29 at around 11:00 we detected significant delays for a small subset of push notifications sent for campaigns or recurring automations. An investigation by our team revealed that one instance of the service responsible for processing these campaigns or automations was having trouble keeping up with the incoming data; at this point we decided to open this incident \(at around 11:15\). The issue was found and a mitigation was put in place at around 11:30. After this operation the service operated correctly again and started catching up its delays. At around 11:46 all delays were resolved and everything was back to normal. # Impact A small subset of push notifications were sent with a delay. We estimate that around 10% of all notifications were delayed up to 1h20. # Root cause There was an issue with our message queuing system at around 10:00 which caused it to corrupt some internal state for our service; this state corruption meant that the service could not process some campaigns and automations. When this was detected we proceeded to restart the service in order to remove the corrupted state. This was effective, however there were follow up issues due to the amount of campaigns and automations delayed to process at the same time: the service was resource constrained and using an ineffective configuration to process so much campaigns and automations. Once these problems were identified our tram proceeded to mitigate them, after which the service was working correctly again and the delays were resolved. # Conclusion Although the original problem was an easy fix, the main issue was that we lacked efficient monitoring for this particular service which resulted in much higher delivery delays than it should have. In the near future we will work on improving the monitoring for this service so that we can address any issues much more quickly; in addition we will also work on preventing these kind of issues altogether.
This incident has been resolved.
A fix has been implemented and deployed. All delays have been addressed and push notification campaigns are being sent correctly again. Our team is still monitoring the situation.
The issue has been identified and a fix is being implemented.
We'are aware of some delays processing push notification campaigns. Our team is currently investigating.
Report: "Email composer unavailable on Batch dashboard"
Last updateThe fix has been released at 12:50pm UTC and the email composer is now working as expected.
Our Email Composer partner is currently facing issues affecting the availability of the Email Composer feature on the Batch dashboard. Consequently, creating or modifying email templates is not possible at the moment. The issue has been identified, and a fix is being prepared. Our partner's team is working diligently to resolve the problem and restore full functionality. We will provide an update once the fix is deployed or if there are any significant developments. We apologize for the inconvenience and appreciate your patience.
Report: "Elevated response time and error rate on REST APIs"
Last updateDue to a database issue, we have encountered increased response times and elevated error rates on our APIs. The issue started over the weekend, impacting a very low number of the requests but has intensified on 2024-07-22 around 9AM UTC. We rolled out some changes to get the error rates and response times back to low values. Failed requests have not been processed and can be retried. We will keep monitoring the situation while we work on fixing the root cause.
Report: "Web SDK webservice issue"
Last updateThe fix behaves as expected and data collection is back to normal.
We have identified an issue on our Web SDK webservices. The issue resulted in a loss of a significant amount of data coming from the SDKs between 2024-07-15 09:20:00 UTC and 2024-07-17 15:25:00 UTC. As the data will not be replayed, you might notice the following impact: - Analytics (DAUs, Starts, Installs) for this period will not be accurate and stay that way. - Automations might have not been sent as expected. We identified the issue and pushed a fix that we are monitoring. Mobile SDKs and REST APIs are not affected.
Report: "Push delivery delays"
Last updateThis incident has been resolved.
The issue has been resolved. We're still working on understanding what exactly happened and monitoring the situation.
We've identified an issue with one of our database system that caused significant push delivery delays of up to 20 minutes.
Report: "Incorrect cache on audiences"
Last updateFrom June 26 to July 6, we noticed that our cache system on audience was inconsistent. If a campaign (push or email) using an audience , it may have targeted not all the users it should have been. This problem have been fixed and everything is back to normal. We are still investigating to understand the cause of the incident.
Report: "iOS campaign push notifications delivery delays"
Last updateWe had an issue with iOS push notifications being delayed multiple times during the day on July 10th. Only push campaigns were affected, transactional and trigger automations were not. If you created a push campaign for iOS around these times you may have observed notification delivery delays up to 15 minutes: 10:00 UTC, 12:00 UTC, 13:30 UTC. The root cause was a recent update to the service responsible for sending iOS notifications which caused an unforeseen performance regression. It took our team some time to diagnose and understand that the issue was related to this update, after our investigation we rolled back the update and everything went back to normal.
Report: "Intermittent issues accessing the dashboard and communicating with our APIs"
Last updateWe had a networking issue on a subset of our servers causing intermittent inaccessibility of our dashboard and APIs; between 08:15 UTC and 08:30 UTC you may have seen errors accessing the dashboard or any API. This was due to an upgrade of an internal system that handles networking which had unintended side effects. A quick rollback resolved the issue and accessibility was restored. No notifications or emails were lost during this incident, however notifications or emails sent from a trigger automation may have been delayed: we've observed up to 6 minutes of delay: once the issue was resolved everything was immediately sent.
Report: "API Campaign - Timeouts"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
A database issue is causing slowdowns on our Push Campaign API, which can result in timeouts. We are working on a fix.
Report: "Push send test inoperant"
Last update# Timeline **All times are in UTC and 24h time.** On the morning of June 12, we decided to make a configuration change in one of our monitoring systems. This configuration was erroneous, but we did not notice this at the time. This caused the system to become unavailable. On June 12 at 16:00, we received a notification from a client telling us that sending a test push notification was not working. At first, we did some tests and were unable to reproduce the problem in our testing environment. As we investigated, we discovered that one of our apps was crashing and was no longer processing messages. This app is responsible for indexing new tokens. That means that our push delivery was operational for tokens indexed before the start of the incident, but no new token could be contacted \(all push notifications were affected: campaign push notifications, automation push notifications, test push notifications\) Data processing errors can happen but this specific app was dependent on the monitoring system when handling errors, which is a bad pattern. Due to the monitoring system being down, we were unable to process messages. We have changed the error handling in this app to be independent of the monitoring system. Once the hotfix was released on June 13 at 12:15, the app was once again operational and the new tokens were once again being indexed. All tokens were finally indexed around 16:20.
Situation is back to normal campaign, automations and send test should work with any tokens
We expect the situation to be back to normal between 4PM and 5PM (UTC)
A fix has been implemented and we are processing delayed messages. Situation will be back to normal in the afternoon
we experience difficulties on indexing new push tokens since yesterday morning. Therefore : - Push test are only impacted when using new tokens - All sending can be impacted, including campaigns and automations (but only for new tokens) we are sill investigating .
Push send test lead to a "no token found" error. Campaigns & automations sending are not impacted. we are currently investigating this issue
Report: "display conditions issues on email"
Last updateAn incident occurred with our email builder due to a third-party software that we use. This affects all email orchestrations containing display conditions that were edited between 2024-06-09 11:00:00 and 2024-06-11 09:30:00. The emails edited during that period have lost their conditions, and running these orchestrations will result in all conditional blocks being shown to the user. This concerns 22 orchestrations, and if you are impacted, we will reach out to you to guide you on how to resolve the issue.
Report: "Imported email image issues on GMail apps"
Last updateImages imported by uploading a ZIP template in Batch failed to display in Gmail's web and mobile apps. This was due to Google's proxy failing to retrieve the image from our CDN for an unknown reason Emails composed with the Email Composer are not affected by this. We have resolved the issue by switching to another CDN provider.
Report: "FCM push delivery issue"
Last updateWe had an issue with our system that handles retries for FCM push notifications specifically: starting from 14:15 UTC and ending at 16:42 UTC retries were not processed due to the underlying database being unavailable. After our team fixed the issue the retry system started processing and sending retry notifications again. No notifications were lost in this incident but some notifications could have taken more than 2h 30min to be sent. All delays have now been resolved, retries are processed correctly.