Historical record of incidents for Pusher
Report: "Interruption in Stats Collection and Degraded Datadog Integration"
Last updateThe backlog has fully cleared and Stats Collection and Datadog integration are now operating normally. This incident is now resolved. We apologise for any inconvenience caused and appreciate your patience throughout.
Systems are now catching up on the backlog, and Realtime Stats and Datadog integration should gradually return to normal. We are continuing to monitor the recovery closely and will provide another update once the catch-up is fully complete.
The issue affecting Stats Collection and the Datadog integration has been identified and addressed. Systems are now catching up on the backlog, and Realtime Stats and Datadog integration should gradually return to normal. We are continuing to monitor the recovery closely and will provide another update once the catch-up is fully complete. Thank you for your continued patience.
We are currently experiencing an issue affecting our stats collection systems. As a result, Realtime Stats are currently unavailable. Additionally, our Datadog integration is experiencing degraded performance. Our engineering team is actively investigating the root cause and working to restore normal service as quickly as possible. We will provide updates as we learn more. Thank you for your patience and understanding.
Report: "Interruption in Stats Collection and Degraded Datadog Integration"
Last updateWe are currently experiencing an issue affecting our stats collection systems. As a result, Realtime Stats are currently unavailable. Additionally, our Datadog integration is experiencing degraded performance.Our engineering team is actively investigating the root cause and working to restore normal service as quickly as possible. We will provide updates as we learn more.Thank you for your patience and understanding.
Report: "Pusher dashboard inaccessible for logins with username/password"
Last updateThis incident has been resolved.
We are currently investigating an issue where the Pusher dashboard is inaccessible to users attempting to login with username/password. Users logging with SSO are unaffected. Pusher products remain functional without any issues.
Report: "Users unable to Log In using Google Login"
Last updateOur engineering team has resolved this incident. We apologise for any inconvenience this may have caused.
Our engineering team is currently investigating an issue that is causing errors when users attempt to log in to the Pusher portal via Google. All other login methods are functioning as expected.
Report: "Some customers are unable to downgrade their account"
Last updateThis incident has been resolved.
We are aware of an issue where customers can't downgrade their account plan. Everything else continues to function normally.
Report: "Beams: intermittent disruption loading accounts from the pusher dashboard"
Last updateFrom 5:02UTC until 6:23UTC Beams suffered intermittent disruption loading accounts from the pusher dashboard. This was due to internal services not picking up a new ssl certificate that was automatically issued during this time.
Report: "Increased latency on MT1"
Last updateThis incident has been resolved. The team saw similarities with a previous incident that affected the US2 cluster and acted swiftly to mitigate and prevent further impact. Overall the impact was very small: a few connections saw small delays receiving messages for a very short time.
We are currently investigating this issue.
Report: "We are experiencing a major outage in our US2 cluster"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented, and we are monitoring the results. WebSocket connections should now be functional.
The team has been able to identify the issue and push a fix. We are seeing a full recovery of the services. We will continue to monitor.
Our team is still investigating the issue within our infrastructure that is preventing us from adding WebSocket capacity to the US2 cluster. This issue is specifically impacting our WebSocket deployments. The majority of clients are currently unable to connect using WebSocket. However fallback mechanisms (SockJS, HTTP streaming, and HTTP polling) are available for connection if they are not disabled in your configuration.
We are escalating the current issue from a partial outage to a major outage. The majority of traffic in the US2 cluster is currently affected. Our team is fully engaged and working to resolve the issue.
We are investigating what is causing the issue to our US cluster.
Report: "Inaccurate Account Usage Warnings Reported"
Last updateWe have resolved the statistics computation issue. Connection counts should be correct again and remain accurate going forward. Daily message volume counters will still show inflated usage until the end of the day; customers that are concerned about potential throttling due to breaching daily limits can reach out to customer support to ensure access remains uninterrupted.
Some customers have received incorrect warnings about account usage or observed usage statistics that appear to be incorrect. We are investigating.
Report: "Statistics Aggregation delayed"
Last updateThis incident has been resolved.
Aggregated statistics have been restored. We are monitoring the situation.
The issue has been identified and mitigated. Missing aggregated statistics are being recovered. No data is missing.
We are currently investigating an issue with aggregated statistics being delayed on the Dashboard.
Report: "Stats collection experiencing an interruption"
Last updateThis incident has been resolved.
The problem has been identified and fixed. We are monitoring the results.
We are investigating why Stats collection has stopped.
Report: "Intermittent failures delivering webhooks on EU cluster"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Experiencing Degraded Performance in EU clusters for Pusher Channels"
Last updateThis incident has been resolved.
Resolved. Monitoring the system.
Issue identified and being fixed.
Impact: We are currently experiencing degraded performance on our EU clusters. As a result, some users are unable to send messages, and delays may be experienced. Affected Product: Pusher Channels Current Status: Our team is actively investigating the issue and working to restore full functionality as quickly as possible. We appreciate your patience and understanding as we work to resolve this issue promptly. Thank you for your continued support.
Report: "Experiencing Delays in Webhook Processing for Pusher Channel"
Last updateThe disturbance have been resolved.
Recovered. Actively monitoring the system.
Impact: We are currently experiencing delays of up to 15 minutes in processing webhooks across our EU, MT1, and US2 clusters. However, we want to assure you that despite these delays, you can still utilize our system and successfully process messages. Affected Product: Pusher Channel Action Taken: Our team is actively supervising and adjusting system modifications to address this issue. We are progressively scaling up the pod count to improve processing efficiency. Additionally, we are conducting thorough analysis of traffic patterns to better understanding. We appreciate your patience and understanding as we work to resolve this issue promptly. Thank you for your continued support.
Report: "SSL certificate renewal for our legacy telemetry endpoint"
Last updateThe SSL certificate for our client stats endpoint expired at 00:00 UTC, but the issue was resolved at 8:49 AM UTC. During this period, clients who opted to send client stats using older versions of the JavaScript SDK for web (PusherJS) encountered some errors and were unable to send telemetry data to our servers. This does not impact any functionalities of our SDK, even in older versions. However, it's worth mentioning that we are unable to verify this on much older versions of our SDKs, as our team no longer monitors them. We have replaced the certificate on this endpoint with an auto-renewable certificate to prevent this issue from happening. The client stats endpoint is a legacy endpoint and is not being used in newer versions of our SDKs.
Report: "High Latency in US2 on Channels WebSocket Client API"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue. Impact is greatly reduced now. Customers are seeing lower latencies.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
Customers are seeing high latency on the US2 cluster (WebSockets). The team is working on it.
Report: "High Latency in US2 on Channels WebSocket Client API"
Last updateThis incident has been resolved.
A fix for this issue has been deployed and we are monitoring the system.
We are continuing to investigate issues with users unable to establish connections from the Websocket Client.
Issues with US2 have returned. We are seeing issues maintaining connections.
A fix has been implemented and the latency issues have resolved in US2. We will monitor the systems to ensure the issue does not reoccur.
We are currently investigating an issue with high latency in US2 on the Channels WebSocket Client API
Report: "Service Disruption Notice: Pusher US2 Cluster"
Last updateOn Saturday March 30th from 04:15 UTC until 11:30 UTC Pusher channels experienced a partial outage and higher than usual latencies on our US2 cluster. The incident was triggered by a single, high volume customer sending requests that was causing an authentication error. These errors are collated and stored in our error logging system, which can be viewed on the Pusher Dashboard. Due to the exceptionally high number of errors, the logging system was overwhelmed, and caused all writing to the log system to backlog on our socket processes. This in turn resulted in the socket processes taking longer to process messages, and to start failing health checks. The failing health checks caused pods to restart, which resulted in connections being dropped. We identified this issue and took steps to mitigate, attempting to switch off non-critical components of the system to alleviate stress on the system. Eventually it was clear that these steps were having little impact and we deployed a hotfix to disable the error logging system. During the deployment of this hotfix the system hit a limit on the amount of registered targets on our cloud provider’s load balancer. This caused the rolling deployment to take much longer than expected, as it waited for old socket processes to be fully drained and terminated. After coordination with our cloud provider we were able to increase this limit, allowing the deployment to complete and the incident was resolved. There were 2 main windows of impact From 4:15 till 5:40 UTC and From 7:15 till 9:15 UTC From 9:15 till 11:30 service was operating normally and we saw a ramp up in connections.
Report: "Functions dashboard returns error"
Last updateThis incident has been resolved.
We are currently investigating the issue where on the Dashboard when navigating to the Functions feature an error is shown.
Report: "Pusher.com website down"
Last updateThis incident has been resolved.
We are currently investigating this issue.
Report: "Pusher US2 webhooks delayed"
Last updateWe have identified the issue and resolved it. Webhooks are behaving as expected again.
We are investigating a delay in Webhooks.
Report: "Pusher EU webhooks delayed"
Last updateAll delayed EU webhooks completed processing at approximately 1030 UTC. This incident is resolved.
We have deployed a fix for this issue. The delivery of the delayed webhooks continues.
We are continuing to see delays in webhooks. We are working on implementing an additional fix.
A fix has been implemented and we are monitoring the results as the backload is processed off.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Pusher US2 webhooks delayed"
Last updateAll delayed US webhooks completed processing at approximately 730 UTC. This incident is resolved.
We have deployed a fix for this issue. The delivery of the delayed webhooks continues.
The issue has been identified and a fix is being implemented.
Report: "Pusher US3 cluster outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The team identified the problem. Most connections are stable now but some of them may get reconnected still. The team is working on a permenant solution.
Pusher US3 cluster is experiencing disruptions for client socket connections. Our team is looking into it.
Report: "Channels Stats Missing Data"
Last updateBetween 15:37 - 16:01 UTC we encountered an issue with our stats pipeline that resulted in missing metrics. Customers may notice data missing for this period in the Pusher Dashboard metrics along with any configured integration dashboards(DataDog and Librato). The issue has been resolved and stats are being published correctly since 16:01.
Report: "Public EU cluster outage"
Last updateThis incident has been resolved.
The public EU cluster has been restored and all services are now back online. We will continue to monitor the cluster health.
The issue has been identified and a fix is being implemented.
The public EU cluster is currently experiencing an outage which affects customers in this region. For customers in other regions, there is no impact. We are actively working on restoring the public EU cluster.
Report: "US2 Customers may be experiencing latency issues with or unable to utilize the Channels WebSocket and REST API"
Last updateThis incident has been resolved.
The fix has been implemented and latency has improved.
The team has identified the cause of the latency and are actively working to resolve the high latency issues.
We are currently investigating an issue in US2 where some clients may be experiencing latency issues or unable to utilize the Channels WebSocket and REST API.
Report: "Customers using Datadog exporter feature at pusher channels might experience a small data loss or delay on the dashboards"
Last updateThis incident is being resolved
Our engineering team has detected an issue that is preventing all data to be published to DataDog. We are currently investigating the root cause
Report: "High latency on channels US2 API"
Last updateOur engineering team has confirmed this issue has been mitigated. We apologise for any inconvenience this may have caused you.
Our engineering team is working on rolling out an improvement to resolve the high latency on the US2 cluster.
We are currently investigating this issue.
Report: "A subset of webhooks may not be successfully delivered - EU cluster only"
Last updateOur engineering team has rolled out a fix and has confirmed webhooks delivery is fully operational on the affected EU cluster.
Our engineering team is actively investigating an issue that is impacting the delivery of a specific set of webhooks. This is only affecting our EU cluster
Report: "intermittent issues with publishing messages in EU cluster"
Last updateThis issue has been resolved
The issue has been identified and a fix is released
We are currently investigating the issue
Report: "Webhook delivery issue - Channels EU cluster"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We're currently investigating a webhook delivery issue in the EU cluster, impacting a small portion of traffic.
Report: "Channels EU cluster having intermittent issues"
Last updateThe cluster is now operational following the implementation of a fix at 17:13 UTC. Our team is actively monitoring the results. The primary contributing factor was identified as a DNS-related issue. In order to determine the root cause, a thorough investigation will be conducted and a detailed post-mortem will be published.
We are continuing our investigation into the ongoing issue affecting EU cluster. Our team is actively working to mitigate the problem. While the cluster is slowly recovering, it has not yet reached a stable state.
We are currently experiencing an outage affecting API, WebSocket, and Webhook delivery services. Our team is actively investigating the issue
All systems are now operational. However, clients utilising fallback transports may still experience intermittent issues. We are actively working to address the remaining issues.
Webhook delivery was temporarily affected, resulting in delays for some customers. Delivery is returning to normal.
We saw 3 short bursts of errors related to service restarts with limited customer impact in the last 2 hours. We are continuing to investigate the issue.
We are currently investigating this issue.
Report: "Disruption in Channels stats integration"
Last updateBetween 06:10 and 09:14 UTC, we encountered an issue that temporarily disrupted the export of data to our stats integrations as well as our own dashboard. As a result, customers using our stats integration to relay metrics to platforms like Datadog and Librato may observe a gap in their data, and the same gap may be visible in our own dashboard.
Report: "Marketing website is unreachable for some visitors"
Last updateThe marketing website (pusher.com) should now be reachable for all visitors, and this incident is now resolved. Pusher services were available and unaffected the entire time.
Our marketing website (pusher.com) is unreachable for some visitors. All Pusher services are unaffected and working as expected. We are investigating the issue with the marketing website.
Report: "Partial outage in eu cluster"
Last updateBetween 14:30 and 14:45 UTC, the Channels API experienced a partial outage. At its peak, approximately 32% of requests to publish a new message failed and most clients had to reconnect.
Report: "Insights and graphs not available in neither customer and admin view"
Last updateMost customers have had their graphs restored. Over the next few days, the team will persist in their efforts to ensure that all customers' graph issues are resolved, albeit outside the incident process.
We've observed an improvement in processing speed, and for numerous customers, the data is now readily available and accurate on the Dashboard. We are actively working on the issue.
Some data processing is taking longer than expected. We are actively working on the issue.
The data replay is ongoing. We expect it to be completed in the next 6h.
We are now replaying old data in order to process it. This may cause intermittent delays in usage stats across the dashboard.
The team has restored the service and is currently monitoring
We are currently investigating an issue that is preventing the telemetry to be pushed to the customer and admin views, making the graphs not available
Report: "Elevated API Errors"
Last updateThis incident has been resolved.
Our vendor continues to investigate and work towards resolution of this issue. We are continuing to see elevated error rates on our public cluster during this time.
AWS has confirmed an issue in the us-west region. We are actively assessing the impact on our customers, who may experience elevated error rates during this time.
Customers hosted on our us3 cluster are currently experiencing elevated error rates. Our team is currently looking into the issue.
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "MT1 Cluster issues"
Last updateThis issue is now resolved and the service is operational once more. An RCA will be published once we conclude internal investigations into the issue.
A fix has been implemented and we are monitoring the results.
We are investigating reports of issues with our MT1 cluster.
Report: "MT1 cluster outage"
Last updateThis issue is now resolved, with connections being established successfully across both websocket and fallback protocols. We will share a post mortem when the full investigation is complete.
We've noticed some improvements, although the cluster isn't completely stable yet. Our team is working to resolve the issue. We're currently able to accept more connections through WebSocket transport. However, webhook delivery and fallback transports continue to experience higher impact. Our customer support team has been assisting some of our customers in redirecting their traffic to alternative US-based clusters. (US2 and US3)
We have implemented a fix and are monitoring progress as we see Websocket connections reliablity improve. We anticipate that our fallback protocols, SockJS and XHR, will improve over the course of the next 30 minutes.
We are continuing to investigate this issue.
We are investigating an issue with clients unable to establish a connection to the mt1 cluster.
Report: "MT1 Elevated Errors"
Last updateThis incident has been resolved.
We have implemented a fix and are monitoring the impact - so far the system looks operational.
We are seeing that error rates have increased once more and are continuing to investigate.
The cluster has recovered. We are monitoring the results.
We're experiencing an elevated level of API errors on MT1 and are currently looking into the issue.
Report: "Elevated error rates on mt1 cluster"
Last updateBetween 12:33 and 13:05 UTC customers may have encountered an increase in error rates when interacting with the Channels HTTP API. The error rate has since subsided. We are investigating the platform to identify the underlying cause.
Report: "Partial disruption in DataDog metrics exporter"
Last updateAn incident occurred yesterday during a maintenance operation, resulting in disruptions to our stats exporter. Some customers might have gaps in their DataDog metrics between 18:30 and 19:30 UTC.
Report: "ap3 cluster experience reconnections for transports: sockjs, xhr_streaming, and xhr_polling"
Last updateThe issue has been identified and resolved.
On ap3 cluster, clients using one of these transports "sockjs, xhr_streaming, and xhr_polling" will experience regular reconnections. We are investigating the issue.
Report: "Issues impacting clients connected to our fallback transports in the MT1 cluster"
Last updateBetween 15:50 to 18:00 UTC, we experienced a DNS related issue that affected a substantial portion of our sockjs traffic. Clients using WebSocket were unaffected by this incident.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Partial outage in MT1 cluster"
Last updateBetween 19:22 and 19:51, approximately 1.2% of messages failed, and a considerable number of our clients experienced re-connections. Our monitoring system experienced partial downtime, leading to confusion in our autoscaling policy. This confusion triggered a scale-down event in the cluster, resulting in reduced available capacity. Our engineering team has taken steps to prevent this from happening again. This incident has been resolved.
Report: "Partial outage in MT1 cluster"
Last updateBetween 14:55 and 15:07 UTC, the Presence and Channel Existence state features in MT1 were affected, particularly for new connections. From 15:15 to 16:23 UTC, a significant number of clients connected through our fallback transports (xhr_streaming and long-polling) continued to experience the same issue. This incident has been resolved.
Report: "MT1 API service degradation"
Last updateFrom 13:31 to 13:53 we are aware of an increase in error rates and elevated latencies when interacting with the Pusher Channels API for the mt1 cluster. The issue has since been resolved and investigations into the root cause are underway.
Report: "Customers are unable to access the Pusher Dashboard."
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We've received reports that customers are unable to access the dashboard. Pusher engineers are currently in investigating.
Report: "Mt1 cluster outage - API and Connections unavailable"
Last updateThis incident has been resolved. It started at 10:32 UTC and the fix was deployed at 10:35 UTC.
The issue was resolved
Report: "EU cluster outage - API and Connections unavailable"
Last updateThis incident has been resolved.
The issue is resolved and we'll keep monitoring
We investigated the issue and deployed a solution.