Historical record of incidents for Courier
Report: "Cloudflare Outage Possible Okta SSO Impact"
Last updateCloudflare has reported an outage that will impact SSO features provided by WorkOS. Customers with Okta SSO might experience some degradation when signing in. Our team is monitoring the status for both Cloudflare and WorkOS. WorkOS status: https://status.workos.com/incidents/k9s870cktcsf Cloudflare Status: https://www.cloudflarestatus.com/incidents/25r9t0vz99rp
Report: "Message Delays on Send Pipeline"
Last updateThe Courier team has identified an issue where there are significant delays of up to 10 minutes for messages. We are currently investigating and looking for a root cause to mitigate. Updates will follow.
Report: "EU Region – Notifications Stuck in “Queued” State"
Last updateThe issue has been resolved. The root cause was purely a logging issue, notifications were still sent but showing up in logs as queued. All logs have been rehydrated
The issue has been identified and a fix is being implemented.
We are currently investigating an issue where EU workspace notifications appear to remain in the “queued” state.
Report: "AWS SES Failures"
Last updateAll impacted messages have been reprocessed.
The team is still working through the reprocessing for impacted messages. Confirmation of all impacted messages reprocessed will be updated here.
Fix has landed and the team is monitoring progress. Messages sent to AWS SES are sending normally. The team will reprocess impacted messages.
The reverted changes will land in production in around one hour. The team will investigate the impact and handle message retries accordingly.
The Courier team has identified an issue with a recent deployment that is impacting AWS SES integrations. We have found the root cause and have reverted the changes. The team is gathering impact.
Report: "Internal Courier Error Issue"
Last updateAll impacted messages have been reprocessed.
The team is still working on reprocessing dropped messages caused by this issue. Send pipeline is operational.
Messages with tags are no longer impacted, and error has been resolved. The team is working on reprocessing dropped messages.
The fix has landed in production and we will be monitoring for any more "Internal Courier Errors". We will be working on reprocessing any dropped messages caused by this issue.
A fix has been merged and will fully deploy within the next hour. Updates to follow.
We have tested the fix for "Internal Courier Error" and will be releasing it to production soon.
The Courier team has identified an issue where messages fail due to an "Internal Courier Error" in the test environment. Our team found the root cause and are working to mitigate the issue.
Report: "Segment Event Processing Slowdowns"
Last updateSegment event processing stream has resolved.
The Courier team has identified a slowdown in Segment event processing. There will be a slight delay for messages triggered by incoming Segment triggers. We estimate the queue to normalize soon and are monitoring.
Report: "Significant Notification Event Latency"
Last updateMessage sends are flowing normally, and events are caught up without delays.
We're monitoring our processing pipeline as messages are flowing normally. There is a slight delay in event processing which will also impact outbound webhooks.
The Courier team identified a delay in message processing and event updating. The pipeline queue for message delivery has normalized, however significant slowdowns in message events are still processing.
Report: "Courier Send and Message Event Slowdown"
Last updateThe team has identified the issue related to network connections on AWS which resulted in a backup of messages on our pipeline. Messages are flowing through as expected, and the stream has caught up.
Our team has identified a bottleneck in our system caused by timeouts on our sendworker. Our exponential backoff has caught up and messages are starting to go through.
The Courier team has identified a slowdown in sending notifications. This issue has been identified by our team and is closely monitoring. We do not expect any messages to drop. We will continue to monitor and update accordingly.
Report: "Delay in Message Sends and Processing"
Last updateThe send stream has caught up, all messages have been sent and the issue appears to be fully resolved.
The team has identified the cause of message sends and slowdowns to be related to an AWS service. Messages that were stuck in a queued state are slowly passing through. The team will continue to monitor and update.
The Courier team has identified a slowdown in our pipeline. The queue has backed up but no messages have been dropped at the moment. We will continue to update as we monitor our pipeline.
Report: "Automation Delay Processing Issues"
Last updateThe service issue affecting automation workflows with delay steps has been partially resolved. New automation workflows started after 6:05 PM PT are executing successfully. However, we have determined that automations that failed during the incident period (approximately 3:20 PM PT - 6:05 PM PT) cannot be automatically retried at this time due to technical limitations. If your business was impacted by failed automation runs during this incident, please contact our support team and we will work directly with you to address your concerns. We sincerely apologize for this disruption to your workflows.
We have deployed a fix for the earlier issue affecting automation workflows with delay steps. New automation workflows are now executing successfully. However, automations that failed during the incident period (approximately 3:20 PM PT - 6:05 PM PT) have not yet been automatically retried. Our engineering team is monitoring the recovery and evaluating a plan to process these backlogged automations.
We are currently experiencing an issue affecting automation workflows that include delay steps. Some customers may encounter failures when attempting to execute automations with scheduled delays. Our engineering team has identified the root cause and is implementing a fix. We expect the service to be fully restored in a short while.
Report: "Delayed Message Processing"
Last updateThis incident has been resolved
A fix has been implemented and delivery times are beginning to returning to expected levels. The team will continue to monitor.
The team identified an issue with the latest fix and reverted it. A new fix has been published to mitigate the message delay issue. ETA to land in production ~45 minutes.
Fix has been deployed and will be live in around 1hr.
The team has identified the issue and is rolling out a fix for the message delays.
The team is still investigating and discussing the root cause internally. Messages are experiencing longer than normal processing times.
We are continuing to monitor the root cause of the delayed message processing.
Courier has identified a delay in message processing for the messages API. We are currently investigating the root cause and will update you periodically.
Report: "Message Logs Delays"
Last updateThe data stream is unblocked, and the message logs queue is resolved and flowing normally.
The fix has landed in production and the team is monitoring the message log queue. Message event logs should be flowing normally.
The team encountered an issue with testing the fix and reverted the update. We are publishing a new update that should resolve the backlogged message logs.
The team is testing out a fix to reduce the bottlenecked log lines before releasing to production.
We are continuing to investigate this issue.
The Courier team is investigating an issue with the event logger for message event logs hitting a bottleneck. The team is actively investigating. Messages are still sending.
Report: "Courier Inbox FetchMessages Connectivity Issues"
Last updateInbox connectivity issues resolved.
The team has initiated backlogging the impacted Inbox messages.
The team has gathered impacted messages and is running tests before a release.
We are continuing to monitor for any further issues.
We are continuing to gather a list of impacted messages to backfill impacted Inbox components.
Inbox connectivity for fetching messages has been reestablished. The team is monitoring closely, and working on retroactively processing impacted messages to Inbox
The team has identified the issue and will be rolling out a fix. The team will reprocess the impacted messages that were not fetched.
The Courier Team is investigating an issue related to Inbox fetching messages in the component.
Report: "Send Pipeline and Event Status Slowdowns"
Last updateBottleneck has cleared for messages and event statuses.
Corrective actions have cleared the bottleneck and messages and events should be flowing normally.
The team has increased the send pipeline worker and messages are clearing the bottleneck.
The team has increased our processing to help with the bottleneck.
The Courier team has identified an issue with the send pipeline and event status updates causing queued messages and delayed webhook events. The Courier team will monitor the bottleneck and adjust message batching necessary to flow messages normally.
Report: "Delay in Message Processing"
Last updateThe general pipeline has recovered.
Fix has been deployed, and enqueued messages have started to go through slowly. Once the bottleneck clears, messages should start to flow normally.
Release is published and building to production. ETA ~45 minutes.
The release is live, and the team is monitoring it.
Our team has released a revert to address the regression and it's in the process of merging.
The Courier team identified an issue in our health monitoring involving our message event processing. The issue has been identified and a revert is in place.
Report: "Automation Service Degrated"
Last updateAutomations have stabilized.
The team has identified the issue and is closely monitoring. Failed steps will continue to be retried with exponential backoff. Automations should recover, and all affected automations should execute after a delay of up to 15 minutes once the problem is resolved
The Courier team has identified an issue impacting Automation services, which has resulted in degraded performance. An underlying issue was identified at around 10:30 PST and a fix was released at 11:00 PST. We are monitoring the automation worker for any leads.
Report: "Delay in Event Statuses Processing"
Last updateIssue was resolved and events are flowing normally.
The team has identified an issue where event statuses are delayed. The team has identified the root cause and is mitigating a fix.
Report: "Automation Logs Delayed"
Last updateThis was a symptom of a previous automation issue, where a snowball effect caused a massive backlog of events. It took a long time for the Kinesis stream to catch up. We resolved the issue by temporarily increasing the stream shard count. The issue is now fully resolved.
Courier Automation logs are currently experiencing a delay in showing up. The team is aware of this issue and is mitigating the root cause by relieving the bottleneck of incoming Automation requests.
Report: "Rendering Errors for Email Templates"
Last updateWhen URLs were present with click tracking, the team identified an issue with templates failing on the render step in the message lifecycle. The team has since identified the issue and rolled out a fix that addresses this rendering error and templates should be rendering properly.
Report: "Automation Delays"
Last updateThis incident has been resolved, automation throughput has returned to normal levels.
Automation delays are occurring due to a combination of data retrieval issues and system timeouts. The team is investigating .
Report: "AWS serviceUnavailable Outage"
Last updateThe team has resolved the issue. Send pipeline operational.
Service has resumed normal operation. Courier Engineers are monitoring.
Service has re-entered a degraded state. Sends from Courier are impacted.
An incident where Courier messages resulted in "Internal Courier Error" was the result of AWS returning serviceUnavailable. The team has identified the issue and messages that responded with 5xx errors will be retried by our pipeline resilience.
An incident where Courier messages resulted in "Internal Courier Error" was the result of AWS returning serviceUnavailable. The team has identified the issue and messages that responded with 5xx errors will be retried by our pipeline resilience.
Report: "Delayed Event Status"
Last updateEvents log stream has stabilized.
Events are slowly stabilizing and the team is monitoring.
The Courier team has identified the issue with the events table and has upped the capacity until events stabilize.
There is an increased number of events in Courier's events log table causing delayed queued events to display. The team has upped the write capacity for these events and is waiting for the stream to stabilize.
Report: "Delay in Message Status Updates"
Last updateOn 2024-05-20 12:40 GMT-7, Courier experienced a sudden spike in outbound message volume. All messages were sent normally. However, the queue used to process message update events became overwhelmed and could not accept events at the rate they were produced. This caused a delay in message status updates as the queue backed up. Although the queue would have recovered eventually on its own, the engineering team chose to increase queue capacity to resolve the issue more quickly. This increase was implemented at 14:43, with full recovery of enqueued message update events by 14:50. Messages processed between 12:40 and 14:43 experienced a delay in status updates of up to 400 seconds, with a typical delay of about 100 seconds. There was no delay in message processing or delivery; all message update events were eventually processed. Outbound webhooks, which depend on the impacted queue, were similarly delayed, as were message statuses shown in the Logs UI and reported by the API.
Report: "Segment Track events not firing automation workflows"
Last updateThis incident has been resolved.
Issue has been fixed and automations are now being invoked. Automation logs might not reflect correct state - we're working on fixing it.
The issue has been identified and a fix is being implemented.
Report: "Delayed message delivery"
Last updateThe incident has been resolved.
A fix has been implemented and the Courier Engineering team is monitoring system health.
The issue has been identified and a fix is being tested and implemented.
We are currently investigating an issue causing delayed message delivery for some small percentage of requests processed by Courier.
Report: "Message send delays"
Last updateSystem is back to a healthy state.
A fix has been deployed to Courier's production systems. The engineering team will continue to monitor as operations return to normal state.
We are continuing to investigate this issue.
We are currently investigating an issue that is affecting send times for some messages.
Report: "Degraded Segment Inbound request processing"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "Outbound notifications are not being sent"
Last updateThe incident has been verified as resolved
A fix has been implemented and the team is monitoring pipeline health. All unprocessed messages should be sent as platform health recovers.
The issue has been identified and a patch has been deployed.
We are currently investigating an issue where messages are not being sent to downstream providers.
Report: "Message status updates are delayed"
Last updateThis incident has been resolved.
We've released a fix for the bug causing message status updates to be delayed. Note: There's no impact sending out the messages, only the status reflected in logs.
Report: "Inbox messages degradation for versions lower than v2"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We've identified the issue causing inbox messages to render and a fix is on its way.
Report: "Delay in the messages sent out"
Last updateThis incident has been resolved.
System looks healthy and we're monitoring closely.
The issue has been identified and a fix is on its way to production
Courier has found an issue causing delay in messages being sent. We've identified the root cause and a fix is on its way to production, we'll follow up shortly with an update when the fix is live.
Report: "An incident with Courier's hosting infrastructure has been identified and is impacting access to service."
Last updateServices have been fully restored
Systems should be returning to a normal operating state. Our engineering team will continue to monitor.
The extent of the outage with our service provider continues to increase. The team is closely monitoring.
We are continuing to investigate this issue.
The engineering team is currently investing and working with the support team of our provider infrastructure.
Report: "Issue affecting sends"
Last updateThe incident has been resolved
A fix has been deployed. The team is continuing to monitor. Any affected messages will automatically be reprocessed. The team will continue to monitor the health of the platform.
We are continuing to investigate the issue. The team believes < that 8% of all messages to be impacted though further analysis is continuing. Out of an abundance of caution, the team has also ramped down rollout of any upgrades and features through our experimentation engine.
We are currently investigating an issue that appears to be affecting a percentage of messages being processed. In addition, customers may be receiving "NOT FOUND" errors for messages IDs that are affected when attempting to access them via logs or the API.
Report: "Message status updates are delayed"
Last updateMessage status updates are now being applied with no delay.
Courier's release has completed and message status updates appear to be recovering. Courier will continue to monitor until event updates are caught up.
Courier is rolling back a change that we believe is causing updates to not apply to sent messages. Next update in ~45 minutes.
Courier is currently investigating an issue with message status updates. We believe there is no impact to message delivery at this time.
Report: "Message Status Delays"
Last updateSystem is back to being healthy and fully operational
A fix has been deployed and the engineering team is monitoring status
We have identified an issue that is causing message status delays to be reflected. This will also result in delayed outbound webhook message delivery. A fix is being tested and deployed.
Report: "Delayed Automation Execution"
Last updateThe incident has been resolved. All backlogged Automations have been processed.
This issue had a broader impact that originally indicated: Instead of only automations in a "WAITING" state, all automations processing was adversely impacted. The applied fix appears to have stabilized automations execution. Backlogged automations are now being processed. The team will continue to monitor.
The issue has been identified and a fix is being applied to production systems. Once the fix has been applied, backlogged Automations will begin to clear out. We will post an update in the next 40 minutes.
We're are currently investigating reports of a delay in resuming execution of Courier Automations while they are in a "waiting" state.
Report: "Timeouts impacting Courier Inbox and access to log data"
Last updateSystems have returned to normal
A fix has been deployed to Courier's production systems and services are returning to a healthy state. The engineering team will continue to monitor.
We have identified the cause of the issue and are working towards resolution.
We are currently investigating an issue that is affecting Courier Inbox as well as the ability to access log data. Sends are not impacted.
Report: "Message Send Delays for Legacy Segment to Send Message Integration"
Last updateWe have finished measuring the impact and will be reaching out to affected customers.
We have confirmed our deployed fix has resolved the issue for new incoming events. We are continue to work on resolution for historical segment events that did not trigger message sends.
We have released a fix for new incoming Segment events. We are monitoring to confirm send volume for Segment to Send returns to normal.
We are continuing to work on a fix for this issue.
We have identified the issue and are preparing a release that will resolve it for new incoming Segment events. We are working to identify historical segment events that should have triggered sends that were impacted.
We are currently investigating an issue where Send messages triggered by the Segment Event track/ integration are not being sent. Customers using the Segment Event track/ integration with Automations appear not to be impacted.
Report: "Message send delays"
Last update### Impact Courier experienced delayed message delivery in its send pipeline impacting 0.1% of messages from 12:50pm to 21:50pm PT on 7/14. No messages were dropped as a result of the incident. 99.9% of send calls experienced no delivery delay. The average message send delay was 3 hours and 20 minutes for impacted messages. #### Root Cause Courier uses feature flags to safely roll out new features. Due to a misconfiguration of a flag, a larger than expected volume of send requests were included in a validation experiment meant to verify a refactor of the send pipeline was safe to rollout. These requests added significant additional load on key stages of the send pipeline, and caused non-validation related requests to queue. #### Remediation Courier incrementally scaled up processing capacity in the send pipeline to work through the large accumulated backlog of messages. Additionally, a hotfix release was pushed to production in order to drop validation messages that had already entered the send pipeline. #### Follow up actions * Courier has established a process to better validate flag configuration in the future, as well as made changes to its feature flag helper library to make use less error-prone. * Courier has created an incident playbook to guide on-call engineers through options to quickly scale up message processing in the send pipeline.
The incident has been resolved.
A fix has been implemented and we are monitoring system health. All backlogged messages are being processed.
We are continuing to work towards resolution of the issue. We currently are seeing delays of approximately 2 hours for some message delivery
The issue has been identified and a resolution is being deployed to our production services.
We are currently investigating an issue that is affecting send times for some messages.
Report: "Elevated provider timeouts"
Last updateThis incident has been resolved.
A fix has been deployed and we are monitoring system health
We have identified the issue and are working on a resolution
We are currently investigating delays in message processing due to provider timeouts.
Report: "Increased API Error Rates and latencies due to multiple AWS service outages"
Last updateAWS issue has been resolved with services operating normally. Courier systems are operational and healthy.
AWS is seeing a reduction in error rates and latencies. We are continuing to monitor the issue and health of the system.
We're monitoring the AWS ongoing issue.
AWS is experiencing elevated error rates and latencies for services that affect Courier. As a result, users may see increased errors and delays within Courier APIs. So far the impact remains low and our reprocessing infrastructure is operational. We'll be monitoring this incident closely and continue to post updates.
Report: "Validation errors editing notification templates"
Last updateThis incident has been resolved
A fix has been implemented and we are monitoring to verify there are no further impacts
The issues has been identified and a fix is being implemented
We are currently investigating an incident preventing some notification templates from being updated via Studio
Report: "Increased Error Rates due to AWS (SQS, S3 and Lambda) services issues in us-east-1"
Last updateAs of 1:45 PM PST S3 Event Notifications have delivered the backlog of events. This issue is resolved and all services are now operating normally.
AWS identified an issue with their API and is beginning to see recovery in their API error rates for all affected services. We will continue to monitor their status page and update here. https://health.aws.amazon.com/health/status
Report: "Traffic Routing Issue Affecting Courier Studio and Message APIs"
Last updateThe incident has been resolved
The core issue has been identified and a resolution has been implemented. We will continue to monitor the situation.
API operations are returning to normal. Studio operations have started to return to normal operating state. We will continue to monitor.
Send functionality has been restored. We will continue to monitor. Major outages still exist on Studio and a partial outage against Courier API GET endpoints.
We are continuing to investigate this issue.
The routing outage has expanded to include the /send endpoint affecting message delivery.
We are currently investigating a traffic routing issue that affecting access to Courier Studio and Message API endpoints.
Report: "Ongoing AWS Outage"
Last updateOur API is processing messages again and appears to be healthy. We're continuing to monitor the AWS status page: https://status.aws.amazon.com/
Our services are recovering. We will continue to monitor until the AWS status page says it is fully recovered.
Our hosting provider, AWS, is currently experiencing an outage that is impacting our services. More information here: https://status.aws.amazon.com/
Report: "Partial outage - Segment destination processing"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
Report: "Segment Event Ingest Outage"
Last updateThis incident has been resolved.
Events flowing into Courier from Segment are currently experiencing a delay. We have identified the issue and rolled out a fix, which is making its way through our system. Direct API calls (e.g. to our /send endpoint) are not affected.
Report: "Amazon US-EAST-1 Partial Outage"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
Amazon has identified the root cause and is working on resolving the issue.
Amazon Web Services is continuing to experience increased errors with Kinesis and related services. This continues to impact the Courier UI, but not the API.
Amazon Web Services is currently experiencing increased error rates on US-EAST-1, which impacts Courier's hosting environment. Right now this is affecting access to the Courier user interface, but not to our API for sending notifications. We're continuing to monitor the situation. https://status.aws.amazon.com/
Report: "Sending Delays"
Last updateMessages sent through the Courier API were incorrectly marked as Undeliverable; these messages have since been sent.