Courier

Is Courier Down Right Now? Check if there is a current outage ongoing.

Courier is currently Operational

Last checked from Courier's official status page

Historical record of incidents for Courier

Report: "Cloudflare Outage Possible Okta SSO Impact"

Last update
monitoring

Cloudflare has reported an outage that will impact SSO features provided by WorkOS. Customers with Okta SSO might experience some degradation when signing in. Our team is monitoring the status for both Cloudflare and WorkOS. WorkOS status: https://status.workos.com/incidents/k9s870cktcsf Cloudflare Status: https://www.cloudflarestatus.com/incidents/25r9t0vz99rp

Report: "Message Delays on Send Pipeline"

Last update
investigating

The Courier team has identified an issue where there are significant delays of up to 10 minutes for messages. We are currently investigating and looking for a root cause to mitigate. Updates will follow.

Report: "EU Region – Notifications Stuck in “Queued” State"

Last update
resolved

The issue has been resolved. The root cause was purely a logging issue, notifications were still sent but showing up in logs as queued. All logs have been rehydrated

identified

The issue has been identified and a fix is being implemented.

investigating

We are currently investigating an issue where EU workspace notifications appear to remain in the “queued” state.

Report: "AWS SES Failures"

Last update
resolved

All impacted messages have been reprocessed.

monitoring

The team is still working through the reprocessing for impacted messages. Confirmation of all impacted messages reprocessed will be updated here.

monitoring

Fix has landed and the team is monitoring progress. Messages sent to AWS SES are sending normally. The team will reprocess impacted messages.

identified

The reverted changes will land in production in around one hour. The team will investigate the impact and handle message retries accordingly.

identified

The Courier team has identified an issue with a recent deployment that is impacting AWS SES integrations. We have found the root cause and have reverted the changes. The team is gathering impact.

Report: "Internal Courier Error Issue"

Last update
resolved

All impacted messages have been reprocessed.

monitoring

The team is still working on reprocessing dropped messages caused by this issue. Send pipeline is operational.

monitoring

Messages with tags are no longer impacted, and error has been resolved. The team is working on reprocessing dropped messages.

monitoring

The fix has landed in production and we will be monitoring for any more "Internal Courier Errors". We will be working on reprocessing any dropped messages caused by this issue.

identified

A fix has been merged and will fully deploy within the next hour. Updates to follow.

identified

We have tested the fix for "Internal Courier Error" and will be releasing it to production soon.

identified

The Courier team has identified an issue where messages fail due to an "Internal Courier Error" in the test environment. Our team found the root cause and are working to mitigate the issue.

Report: "Segment Event Processing Slowdowns"

Last update
resolved

Segment event processing stream has resolved.

monitoring

The Courier team has identified a slowdown in Segment event processing. There will be a slight delay for messages triggered by incoming Segment triggers. We estimate the queue to normalize soon and are monitoring.

Report: "Significant Notification Event Latency"

Last update
resolved

Message sends are flowing normally, and events are caught up without delays.

monitoring

We're monitoring our processing pipeline as messages are flowing normally. There is a slight delay in event processing which will also impact outbound webhooks.

identified

The Courier team identified a delay in message processing and event updating. The pipeline queue for message delivery has normalized, however significant slowdowns in message events are still processing.

Report: "Courier Send and Message Event Slowdown"

Last update
resolved

The team has identified the issue related to network connections on AWS which resulted in a backup of messages on our pipeline. Messages are flowing through as expected, and the stream has caught up.

monitoring

Our team has identified a bottleneck in our system caused by timeouts on our sendworker. Our exponential backoff has caught up and messages are starting to go through.

identified

The Courier team has identified a slowdown in sending notifications. This issue has been identified by our team and is closely monitoring. We do not expect any messages to drop. We will continue to monitor and update accordingly.

Report: "Delay in Message Sends and Processing"

Last update
resolved

The send stream has caught up, all messages have been sent and the issue appears to be fully resolved.

monitoring

The team has identified the cause of message sends and slowdowns to be related to an AWS service. Messages that were stuck in a queued state are slowly passing through. The team will continue to monitor and update.

identified

The Courier team has identified a slowdown in our pipeline. The queue has backed up but no messages have been dropped at the moment. We will continue to update as we monitor our pipeline.

Report: "Automation Delay Processing Issues"

Last update
resolved

The service issue affecting automation workflows with delay steps has been partially resolved. New automation workflows started after 6:05 PM PT are executing successfully. However, we have determined that automations that failed during the incident period (approximately 3:20 PM PT - 6:05 PM PT) cannot be automatically retried at this time due to technical limitations. If your business was impacted by failed automation runs during this incident, please contact our support team and we will work directly with you to address your concerns. We sincerely apologize for this disruption to your workflows.

monitoring

We have deployed a fix for the earlier issue affecting automation workflows with delay steps. New automation workflows are now executing successfully. However, automations that failed during the incident period (approximately 3:20 PM PT - 6:05 PM PT) have not yet been automatically retried. Our engineering team is monitoring the recovery and evaluating a plan to process these backlogged automations.

identified

We are currently experiencing an issue affecting automation workflows that include delay steps. Some customers may encounter failures when attempting to execute automations with scheduled delays. Our engineering team has identified the root cause and is implementing a fix. We expect the service to be fully restored in a short while.

Report: "Delayed Message Processing"

Last update
resolved

This incident has been resolved

monitoring

A fix has been implemented and delivery times are beginning to returning to expected levels. The team will continue to monitor.

identified

The team identified an issue with the latest fix and reverted it. A new fix has been published to mitigate the message delay issue. ETA to land in production ~45 minutes.

identified

Fix has been deployed and will be live in around 1hr.

identified

The team has identified the issue and is rolling out a fix for the message delays.

investigating

The team is still investigating and discussing the root cause internally. Messages are experiencing longer than normal processing times.

investigating

We are continuing to monitor the root cause of the delayed message processing.

investigating

Courier has identified a delay in message processing for the messages API. We are currently investigating the root cause and will update you periodically.

Report: "Message Logs Delays"

Last update
resolved

The data stream is unblocked, and the message logs queue is resolved and flowing normally.

monitoring

The fix has landed in production and the team is monitoring the message log queue. Message event logs should be flowing normally.

identified

The team encountered an issue with testing the fix and reverted the update. We are publishing a new update that should resolve the backlogged message logs.

identified

The team is testing out a fix to reduce the bottlenecked log lines before releasing to production.

investigating

We are continuing to investigate this issue.

investigating

The Courier team is investigating an issue with the event logger for message event logs hitting a bottleneck. The team is actively investigating. Messages are still sending.

Report: "Courier Inbox FetchMessages Connectivity Issues"

Last update
resolved

Inbox connectivity issues resolved.

monitoring

The team has initiated backlogging the impacted Inbox messages.

monitoring

The team has gathered impacted messages and is running tests before a release.

monitoring

We are continuing to monitor for any further issues.

monitoring

We are continuing to gather a list of impacted messages to backfill impacted Inbox components.

monitoring

Inbox connectivity for fetching messages has been reestablished. The team is monitoring closely, and working on retroactively processing impacted messages to Inbox

identified

The team has identified the issue and will be rolling out a fix. The team will reprocess the impacted messages that were not fetched.

investigating

The Courier Team is investigating an issue related to Inbox fetching messages in the component.

Report: "Send Pipeline and Event Status Slowdowns"

Last update
resolved

Bottleneck has cleared for messages and event statuses.

monitoring

Corrective actions have cleared the bottleneck and messages and events should be flowing normally.

monitoring

The team has increased the send pipeline worker and messages are clearing the bottleneck.

identified

The team has increased our processing to help with the bottleneck.

identified

The Courier team has identified an issue with the send pipeline and event status updates causing queued messages and delayed webhook events. The Courier team will monitor the bottleneck and adjust message batching necessary to flow messages normally.

Report: "Delay in Message Processing"

Last update
resolved

The general pipeline has recovered.

monitoring

Fix has been deployed, and enqueued messages have started to go through slowly. Once the bottleneck clears, messages should start to flow normally.

monitoring

Release is published and building to production. ETA ~45 minutes.

monitoring

The release is live, and the team is monitoring it.

monitoring

Our team has released a revert to address the regression and it's in the process of merging.

identified

The Courier team identified an issue in our health monitoring involving our message event processing. The issue has been identified and a revert is in place.

Report: "Automation Service Degrated"

Last update
resolved

Automations have stabilized.

monitoring

The team has identified the issue and is closely monitoring. Failed steps will continue to be retried with exponential backoff. Automations should recover, and all affected automations should execute after a delay of up to 15 minutes once the problem is resolved

investigating

The Courier team has identified an issue impacting Automation services, which has resulted in degraded performance. An underlying issue was identified at around 10:30 PST and a fix was released at 11:00 PST. We are monitoring the automation worker for any leads.

Report: "Delay in Event Statuses Processing"

Last update
resolved

Issue was resolved and events are flowing normally.

identified

The team has identified an issue where event statuses are delayed. The team has identified the root cause and is mitigating a fix.

Report: "Automation Logs Delayed"

Last update
resolved

This was a symptom of a previous automation issue, where a snowball effect caused a massive backlog of events. It took a long time for the Kinesis stream to catch up. We resolved the issue by temporarily increasing the stream shard count. The issue is now fully resolved.

investigating

Courier Automation logs are currently experiencing a delay in showing up. The team is aware of this issue and is mitigating the root cause by relieving the bottleneck of incoming Automation requests.

Report: "Rendering Errors for Email Templates"

Last update
resolved

When URLs were present with click tracking, the team identified an issue with templates failing on the render step in the message lifecycle. The team has since identified the issue and rolled out a fix that addresses this rendering error and templates should be rendering properly.

Report: "Automation Delays"

Last update
resolved

This incident has been resolved, automation throughput has returned to normal levels.

investigating

Automation delays are occurring due to a combination of data retrieval issues and system timeouts. The team is investigating .

Report: "AWS serviceUnavailable Outage"

Last update
resolved

The team has resolved the issue. Send pipeline operational.

monitoring

Service has resumed normal operation. Courier Engineers are monitoring.

investigating

Service has re-entered a degraded state. Sends from Courier are impacted.

monitoring

An incident where Courier messages resulted in "Internal Courier Error" was the result of AWS returning serviceUnavailable. The team has identified the issue and messages that responded with 5xx errors will be retried by our pipeline resilience.

monitoring

An incident where Courier messages resulted in "Internal Courier Error" was the result of AWS returning serviceUnavailable. The team has identified the issue and messages that responded with 5xx errors will be retried by our pipeline resilience.

Report: "Delayed Event Status"

Last update
resolved

Events log stream has stabilized.

monitoring

Events are slowly stabilizing and the team is monitoring.

identified

The Courier team has identified the issue with the events table and has upped the capacity until events stabilize.

investigating

There is an increased number of events in Courier's events log table causing delayed queued events to display. The team has upped the write capacity for these events and is waiting for the stream to stabilize.

Report: "Delay in Message Status Updates"

Last update
resolved

On 2024-05-20 12:40 GMT-7, Courier experienced a sudden spike in outbound message volume. All messages were sent normally. However, the queue used to process message update events became overwhelmed and could not accept events at the rate they were produced. This caused a delay in message status updates as the queue backed up. Although the queue would have recovered eventually on its own, the engineering team chose to increase queue capacity to resolve the issue more quickly. This increase was implemented at 14:43, with full recovery of enqueued message update events by 14:50. Messages processed between 12:40 and 14:43 experienced a delay in status updates of up to 400 seconds, with a typical delay of about 100 seconds. There was no delay in message processing or delivery; all message update events were eventually processed. Outbound webhooks, which depend on the impacted queue, were similarly delayed, as were message statuses shown in the Logs UI and reported by the API.

Report: "Segment Track events not firing automation workflows"

Last update
resolved

This incident has been resolved.

monitoring

Issue has been fixed and automations are now being invoked. Automation logs might not reflect correct state - we're working on fixing it.

identified

The issue has been identified and a fix is being implemented.

Report: "Delayed message delivery"

Last update
resolved

The incident has been resolved.

monitoring

A fix has been implemented and the Courier Engineering team is monitoring system health.

identified

The issue has been identified and a fix is being tested and implemented.

investigating

We are currently investigating an issue causing delayed message delivery for some small percentage of requests processed by Courier.

Report: "Message send delays"

Last update
resolved

System is back to a healthy state.

monitoring

A fix has been deployed to Courier's production systems. The engineering team will continue to monitor as operations return to normal state.

investigating

We are continuing to investigate this issue.

investigating

We are currently investigating an issue that is affecting send times for some messages.

Report: "Degraded Segment Inbound request processing"

Last update
resolved

This incident has been resolved.

identified

The issue has been identified and a fix is being implemented.

Report: "Outbound notifications are not being sent"

Last update
resolved

The incident has been verified as resolved

monitoring

A fix has been implemented and the team is monitoring pipeline health. All unprocessed messages should be sent as platform health recovers.

identified

The issue has been identified and a patch has been deployed.

investigating

We are currently investigating an issue where messages are not being sent to downstream providers.

Report: "Message status updates are delayed"

Last update
resolved

This incident has been resolved.

monitoring

We've released a fix for the bug causing message status updates to be delayed. Note: There's no impact sending out the messages, only the status reflected in logs.

Report: "Inbox messages degradation for versions lower than v2"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been implemented and we are monitoring the results.

identified

We've identified the issue causing inbox messages to render and a fix is on its way.

Report: "Delay in the messages sent out"

Last update
resolved

This incident has been resolved.

monitoring

System looks healthy and we're monitoring closely.

identified

The issue has been identified and a fix is on its way to production

investigating

Courier has found an issue causing delay in messages being sent. We've identified the root cause and a fix is on its way to production, we'll follow up shortly with an update when the fix is live.

Report: "An incident with Courier's hosting infrastructure has been identified and is impacting access to service."

Last update
resolved

Services have been fully restored

monitoring

Systems should be returning to a normal operating state. Our engineering team will continue to monitor.

investigating

The extent of the outage with our service provider continues to increase. The team is closely monitoring.

investigating

We are continuing to investigate this issue.

investigating

The engineering team is currently investing and working with the support team of our provider infrastructure.

Report: "Issue affecting sends"

Last update
resolved

The incident has been resolved

monitoring

A fix has been deployed. The team is continuing to monitor. Any affected messages will automatically be reprocessed. The team will continue to monitor the health of the platform.

investigating

We are continuing to investigate the issue. The team believes < that 8% of all messages to be impacted though further analysis is continuing. Out of an abundance of caution, the team has also ramped down rollout of any upgrades and features through our experimentation engine.

investigating

We are currently investigating an issue that appears to be affecting a percentage of messages being processed. In addition, customers may be receiving "NOT FOUND" errors for messages IDs that are affected when attempting to access them via logs or the API.

Report: "Message status updates are delayed"

Last update
resolved

Message status updates are now being applied with no delay.

monitoring

Courier's release has completed and message status updates appear to be recovering. Courier will continue to monitor until event updates are caught up.

identified

Courier is rolling back a change that we believe is causing updates to not apply to sent messages. Next update in ~45 minutes.

investigating

Courier is currently investigating an issue with message status updates. We believe there is no impact to message delivery at this time.

Report: "Message Status Delays"

Last update
resolved

System is back to being healthy and fully operational

monitoring

A fix has been deployed and the engineering team is monitoring status

identified

We have identified an issue that is causing message status delays to be reflected. This will also result in delayed outbound webhook message delivery. A fix is being tested and deployed.

Report: "Delayed Automation Execution"

Last update
resolved

The incident has been resolved. All backlogged Automations have been processed.

monitoring

This issue had a broader impact that originally indicated: Instead of only automations in a "WAITING" state, all automations processing was adversely impacted. The applied fix appears to have stabilized automations execution. Backlogged automations are now being processed. The team will continue to monitor.

identified

The issue has been identified and a fix is being applied to production systems. Once the fix has been applied, backlogged Automations will begin to clear out. We will post an update in the next 40 minutes.

investigating

We're are currently investigating reports of a delay in resuming execution of Courier Automations while they are in a "waiting" state.

Report: "Timeouts impacting Courier Inbox and access to log data"

Last update
resolved

Systems have returned to normal

monitoring

A fix has been deployed to Courier's production systems and services are returning to a healthy state. The engineering team will continue to monitor.

identified

We have identified the cause of the issue and are working towards resolution.

investigating

We are currently investigating an issue that is affecting Courier Inbox as well as the ability to access log data. Sends are not impacted.

Report: "Message Send Delays for Legacy Segment to Send Message Integration"

Last update
resolved

We have finished measuring the impact and will be reaching out to affected customers.

monitoring

We have confirmed our deployed fix has resolved the issue for new incoming events. We are continue to work on resolution for historical segment events that did not trigger message sends.

monitoring

We have released a fix for new incoming Segment events. We are monitoring to confirm send volume for Segment to Send returns to normal.

identified

We are continuing to work on a fix for this issue.

identified

We have identified the issue and are preparing a release that will resolve it for new incoming Segment events. We are working to identify historical segment events that should have triggered sends that were impacted.

investigating

We are currently investigating an issue where Send messages triggered by the Segment Event track/ integration are not being sent. Customers using the Segment Event track/ integration with Automations appear not to be impacted.

Report: "Message send delays"

Last update
postmortem

### Impact Courier experienced delayed message delivery in its send pipeline impacting 0.1% of messages from 12:50pm to 21:50pm PT on 7/14. No messages were dropped as a result of the incident. 99.9% of send calls experienced no delivery delay. The average message send delay was 3 hours and 20 minutes for impacted messages. #### Root Cause Courier uses feature flags to safely roll out new features. Due to a misconfiguration of a flag, a larger than expected volume of send requests were included in a validation experiment meant to verify a refactor of the send pipeline was safe to rollout. These requests added significant additional load on key stages of the send pipeline, and caused non-validation related requests to queue. #### Remediation Courier incrementally scaled up processing capacity in the send pipeline to work through the large accumulated backlog of messages. Additionally, a hotfix release was pushed to production in order to drop validation messages that had already entered the send pipeline. #### Follow up actions * Courier has established a process to better validate flag configuration in the future, as well as made changes to its feature flag helper library to make use less error-prone. * Courier has created an incident playbook to guide on-call engineers through options to quickly scale up message processing in the send pipeline.

resolved

The incident has been resolved.

monitoring

A fix has been implemented and we are monitoring system health. All backlogged messages are being processed.

identified

We are continuing to work towards resolution of the issue. We currently are seeing delays of approximately 2 hours for some message delivery

identified

The issue has been identified and a resolution is being deployed to our production services.

investigating

We are currently investigating an issue that is affecting send times for some messages.

Report: "Elevated provider timeouts"

Last update
resolved

This incident has been resolved.

monitoring

A fix has been deployed and we are monitoring system health

identified

We have identified the issue and are working on a resolution

investigating

We are currently investigating delays in message processing due to provider timeouts.

Report: "Increased API Error Rates and latencies due to multiple AWS service outages"

Last update
resolved

AWS issue has been resolved with services operating normally. Courier systems are operational and healthy.

monitoring

AWS is seeing a reduction in error rates and latencies. We are continuing to monitor the issue and health of the system.

monitoring

We're monitoring the AWS ongoing issue.

investigating

AWS is experiencing elevated error rates and latencies for services that affect Courier. As a result, users may see increased errors and delays within Courier APIs. So far the impact remains low and our reprocessing infrastructure is operational. We'll be monitoring this incident closely and continue to post updates.

Report: "Validation errors editing notification templates"

Last update
resolved

This incident has been resolved

monitoring

A fix has been implemented and we are monitoring to verify there are no further impacts

identified

The issues has been identified and a fix is being implemented

investigating

We are currently investigating an incident preventing some notification templates from being updated via Studio

Report: "Increased Error Rates due to AWS (SQS, S3 and Lambda) services issues in us-east-1"

Last update
resolved

As of 1:45 PM PST S3 Event Notifications have delivered the backlog of events. This issue is resolved and all services are now operating normally.

monitoring

AWS identified an issue with their API and is beginning to see recovery in their API error rates for all affected services. We will continue to monitor their status page and update here. https://health.aws.amazon.com/health/status

Report: "Traffic Routing Issue Affecting Courier Studio and Message APIs"

Last update
resolved

The incident has been resolved

monitoring

The core issue has been identified and a resolution has been implemented. We will continue to monitor the situation.

investigating

API operations are returning to normal. Studio operations have started to return to normal operating state. We will continue to monitor.

investigating

Send functionality has been restored. We will continue to monitor. Major outages still exist on Studio and a partial outage against Courier API GET endpoints.

investigating

We are continuing to investigate this issue.

investigating

The routing outage has expanded to include the /send endpoint affecting message delivery.

investigating

We are currently investigating a traffic routing issue that affecting access to Courier Studio and Message API endpoints.

Report: "Ongoing AWS Outage"

Last update
resolved

Our API is processing messages again and appears to be healthy. We're continuing to monitor the AWS status page: https://status.aws.amazon.com/

monitoring

Our services are recovering. We will continue to monitor until the AWS status page says it is fully recovered.

identified

Our hosting provider, AWS, is currently experiencing an outage that is impacting our services. More information here: https://status.aws.amazon.com/

Report: "Partial outage - Segment destination processing"

Last update
resolved

This incident has been resolved.

identified

We are continuing to work on a fix for this issue.

identified

The issue has been identified and a fix is being implemented.

Report: "Segment Event Ingest Outage"

Last update
resolved

This incident has been resolved.

identified

Events flowing into Courier from Segment are currently experiencing a delay. We have identified the issue and rolled out a fix, which is making its way through our system. Direct API calls (e.g. to our /send endpoint) are not affected.

Report: "Amazon US-EAST-1 Partial Outage"

Last update
resolved

This incident has been resolved.

monitoring

We are continuing to monitor for any further issues.

monitoring

A fix has been implemented and we are monitoring the results.

identified

Amazon has identified the root cause and is working on resolving the issue.

identified

Amazon Web Services is continuing to experience increased errors with Kinesis and related services. This continues to impact the Courier UI, but not the API.

investigating

Amazon Web Services is currently experiencing increased error rates on US-EAST-1, which impacts Courier's hosting environment. Right now this is affecting access to the Courier user interface, but not to our API for sending notifications. We're continuing to monitor the situation. https://status.aws.amazon.com/

Report: "Sending Delays"

Last update
resolved

Messages sent through the Courier API were incorrectly marked as Undeliverable; these messages have since been sent.