Historical record of incidents for Daily
Report: "Issues with logging and telemetry"
Last updateDaily is being affected by widespread internet outages in authentication services. This is currently affecting our ability to collect or view some call metrics. We'll post more as soon as we have more information.
Report: "Elevated error rates for SIP/PSTN"
Last updateThis incident has been resolved.
It appears that SignalWire rolled back the change, and the errors have stopped. We are monitoring for any further issues.
SignalWire has confirmed that they made an unannounced change that's causing the problem, and they're working to revert it. We're also working on a quick production update to work around the issue if they can't revert quickly enough.
We're investigating elevated error rates from SignalWire when provisioning SIP/PSTN resources. If you're creating rooms with dial-in or dial-out enabled and it isn't absolutely necessary, you can remove those params from your room creation request to successfully create rooms.
Report: "Elevated error rates for SIP/PSTN"
Last updateWe're investigating elevated error rates from SignalWire when provisioning SIP/PSTN resources. If you're creating rooms with dial-in or dial-out enabled and it isn't absolutely necessary, you can remove those params from your room creation request to successfully create rooms.
Report: "Issues with SIP/PSTN audio quality"
Last updateThis incident has been resolved.
We've deployed an update to production servers to remediate the issue, and we're testing to ensure everything is fixed.
We've identified an issue with how Daily and SignalWire are negotiating audio codecs for incoming SIP/PSTN calls. We're working on an update that will change the default audio codec used between Daily and SignalWire to Opus. Once this is deployed, you may still experience audio issues if you've explicitly set your SIP audio codec to PCMU as described here: https://docs.daily.co/guides/products/dial-in-dial-out/sip#sip-dial-in-audio-and-video
We've identified an issue causing degraded ("broken up" or "choppy") audio between some Daily sessions and SignalWire SIP/PSTN endpoints. PSTN dialout and PIN dialin do not seem to be affected. SIP dialout seems to be moderately affected. PIN-less PSTN dialin and SIP dialin seem to be experiencing a much higher proportion of affected calls.
We're investigating reports of poor quality audio for SIP/PSTN participants in some calls.
Report: "Issues with SIP/PSTN audio quality"
Last updateWe're investigating reports of poor quality audio for SIP/PSTN participants in some calls.
Report: "Elevated SIP/PSTN error rates"
Last updateThis incident has been resolved.
We're still waiting on some additional fixes from Signalwire. Room creation with dial-in/dial-out is working, but you still may experience problems updating dial-in/dial-out settings for existing rooms. Daily Bots and Pipecat Cloud users should be unaffected.
Signalwire has restored service, but we're still seeing some error responses from their API. We believe the majority of room creation failure issues have been resolved. We'll post here again when we've handled this last issue.
We've been told we should be back online after a hotfix at approximately 18:00 UTC, or 15 minutes from now. We'll post another update as soon as we have more info.
We're still waiting on Signalwire to restore service.
We're still monitoring. In addition to dialout_enabled, you'll need to remove other room properties related to SIP/PSTN, dialin, and/or dialout to create rooms.
We're continuing to monitor an issue from our SIP/PSTN provider. The rest of the Daily platform is unaffected. If you're getting an error when trying to create a room, you can remove the dialout_enabled property from the room creation request and try again.
Our SIP/PSTN provider has identified an issue and they're deploying a fix. In the meantime, you should still be able to create Daily rooms without provisioning SIP/PSTN.
We're seeing elevated error rates from our SIP/PSTN provider. If you're creating rooms with dial-in support and getting errors, you may want to retry creating those rooms without dial-in, and then add dial-in with an update REST request.
Report: "Elevated SIP/PSTN error rates"
Last updateWe're seeing elevated error rates from our SIP/PSTN provider. If you're creating rooms with dial-in support and getting errors, you may want to retry creating those rooms without dial-in, and then add dial-in with an update REST request.
Report: "Delayed audio for some SIP/PSTN dial-in calls"
Last updateThis incident has been resolved.
We've deployed a fix, and we're monitoring for any further issues.
We've identified the issue, and we're testing a fix in our staging infrastructure. We'll post another update when we've deployed the fix.
We're getting reports of delayed audio from some customers using SIP dialin and dialout. When the phone user joins the call, they can talk and others will hear them, but the phone user won't hear any audio from other call participants (bot or human) for the first 20-30 seconds. We're continuing to troubleshoot the issue, and we'll post here as soon as we have more info.
We're investigating an issue that's causing delays in audio connection for a some SIP/PSTN calls.
Report: "Issues connecting to rooms"
Last updateThis issue has been resolved.
We've resolved the issue and we're monitoring to ensure the platform is operating normally.
We're investigating an issue that may be preventing some users from joining meeting rooms.
Report: "Networking issues"
Last updateWe've deployed an update that increases the throughput of the database that was the bottleneck in today's incident. We'll have more info about additional remediations and a postmortem for today's incident within the next few days.
Our metrics have stayed at normal levels since our remediating actions about 30 minutes ago. We're continuing to monitor the platform while we discuss longer-term solutions to make absolutely sure we've addressed the root cause here.
We've made some changes to the affected database, and our metrics and error rates have returned to normal. We've also re-enabled delivery of all webhooks, and we're monitoring for any further issues.
We're addressing an issue with an internal database that's causing problems with existing meetings, as well as starting new ones. Your users are likely seeing some failures when trying to join meeting sessions, and users in ongoing sessions are seeing occasional meeting moves. We'll post more information as soon as soon as it's available.
We've temporarily disabled the component that sends webhooks.
We're continuing to investigate the source of meeting disruptions. Customers may be experiencing 'meeting moves' where a call session moves from one server to another, causing a 2-3 second disruption to the call. You may also see delays in receiving meeting.started and meeting.ended webhooks.
We're investigating issues that may be causing problems with network connections between regions.
Report: "Issues starting recordings"
Last updateWe're still making a few small infrastructure changes, but our internal metrics have been back at normal levels for some time.
We've identified an issue causing some recordings to fail to start, specifically in the Oracle Cloud San Jose region. We've already made some infrastructure changes that should be routing new recording requests to other regions. If you've seen a recording fail to start, you can try starting it again using daily-js or the REST API. We'll keep you posted on our progress resolving the issue.
Report: "Delayed API calls"
Last updateOn Tuesday, October 22, around 17:15 UTC \(9:15 AM PDT\), a Daily customer started running a series of load tests. Their test involved rapidly creating and deleting a large number of rooms that used PSTN dial-out, cloud recording, and webhooks. This eventually caused several capacity threshold alerts to fire around 18:15 UTC \(10:15 AM PDT\) as our system scaled out to handle the load. We noticed that their test was running a script that created a room and started dial-out, but almost every instance of the script was exiting the room uncleanly before the outgoing call even connected to anything. This exposed an edge case that caused a ‘zombie’ PSTN participant to stay in that session and continue to try and send presence updates indefinitely. We’re already working on fixing that bug. This has probably happened before, but in much smaller quantities, since it involves a very unusual combination of events—but since this was an automated load test, it was causing too many of these ‘zombies’ to build up, all trying to write frequent presence updates to the database. Soon, the database response time began to slow under the increased load. Around that same time \(18:15 UTC, 10:15 AM PDT\), we noticed an increase in API error rates—specifically, actions that required writing to the database. Our team started to work both problems at once: safely get rid of the ‘zombie’ sessions without affecting other customers, and alleviate the load on the database to improve API response times. API error rates for POST requests spiked as high as 8%, and error rates for all requests peaked at 2-3%. We were able to return API error levels and latency back to normal by around 19:50 UTC \(12:50 PDT\) by refreshing several database instances. We contacted the customer and stopped the load tests, and then we were able to remove the ‘zombie’ sessions through our normal deploy process. We’re sorry for the disruption this caused. We’re already working on several remediations, including fixing the bug that caused the ‘zombie’ sessions, as well as adjusting platform rate limits to prevent this from happening again.
This issue has been resolved. We will post more information about this incident in the near future.
API latency and errors have stayed at normal levels for a while now, but we're continuing to monitor for any further impact.
API error levels have decreased considerably, but we're still working on full remediation. More updates to come.
We've identified an issue causing some slowdowns in one of our databases, leading to some delayed or failed API responses. We've solved the root cause of the issue, but we're being cautious about restoring the database to full functionality, so we expect the delays to continue for a short time.
We're investigating an issue that's causing delays with some API operations, such as creating rooms and starting recordings. We'll post more info as soon as we have it.
Report: "Missing meeting webhook deliveries"
Last updateThis incident has been resolved. Customers needing assistance with missing webhook deliveries should contact support via help@daily.co.
Between 14:01 UTC and 17:12 UTC webhooks for meeting.started and meeting.ended events were not delivered. We have applied a mitigation and are continuing to monitor. The underlying cause for the missing deliveries is still under investigation.
Report: "dashboard.daily.co availability"
Last updateThis incident has been resolved.
The upstream issue has been resolved, and we're monitoring for any more issues.
The upstream issue has been resolved, and we're monitoring for any more issues.
Some customers are seeing 400 BAD_REQUEST messages when trying to load dashboard.daily.co. This is likely related to a Vercel incident: https://www.vercel-status.com/incidents/f6b2blrl5f5f
Report: "Elevated latency on some API endpoints"
Last updateAPI latency has returned to normal levels.
We have applied a mitigation and are continuing to monitor the situation.
We are currently investigating increased latency affecting some of our APIs.
Report: "Degraded logging and metrics API performance"
Last updateThe impaired database system has fully recovered and is operating normally. API performance has returned to normal levels.
The degraded logging and metrics API performance was the result of an impaired database system. The initial impact was resolved earlier today, but we continue to monitor the system as recovery completes.
We are currently investigating an issue with degraded performance with the logging and metrics API.
Report: "Elevated latency / intermittent failures on API endpoints"
Last updateNetwork-level issues are resolved and service is operating nominally.
Network-level mitigations have been applied and we are seeing latency back to normal levels.
We're currently investigating this issue.
Report: "Issue with sessions in ap-northeast-2"
Last updateThis issue has been resolved, and call sessions in ap-northeast-2 are working normally.
We've confirmed an issue preventing some users from joining calls hosted in the ap-northeast-2 region. Sessions in other regions are unaffected. If you've set the 'geo' property on your domain or a specific room to 'ap-northeast-2', you may want to temporarily change it to 'ap-south-1' or another nearby region.
We're investigating an issue preventing some users from joining call sessions in the ap-northeast-2 region.
Report: "Problems connecting to rooms"
Last updateThis incident has been resolved.
We've identified an issue that was causing some users to receive an error when trying to join a call. Affected users would see an error in the console starting with "web socket connection failed". We've rolled back a platform update from earlier today, and the errors have stopped. We're still diagnosing the problem with the platform update, but operations are back to normal.
We're investigating reports of problems when trying to join calls.
Report: "Issues connecting to calls"
Last updateOn Tuesday, February 7 at 9:47 AM Eastern time \(14:47 UTC\), our database reported a performance issue under normal operational load. We had upgraded the database server over the weekend, but it had been operating normally since Monday. The alerts indicated a high level of lock contention on the newly upgraded database, which was causing problems for our call servers \(SFUs\). The SFUs are designed to shut themselves down if they are not able to connect to our database. When an SFU shuts down, our autoscaling will start a new SFU to replace it. With several SFUs shutting down at the same time \(and several new ones starting\), we experienced a larger than normal volume of “meeting moves”, which added additional load to a database that was already struggling. A “meeting move” occurs when an old SFU is shutting down. Our webapp automatically moves any ongoing call sessions on that SFU to a different SFU. During a meeting move, users will usually notice everyone else’s video drop out for a second or two before reappearing. The next few paragraphs shows the sequence of events between 09:51 to 10:47 which helped us identify the cause. By 9:51 \(T\+4 minutes\), engineers had found a potential culprit: a large volume of queries stuck in a deadlock. These were “meeting events” from the SFU, noting when participants joined or left meetings. This was causing the webapp API requests to time out and return 5xx errors, and ultimately causing the SFUs to drop their connections and restart. By 10:13 \(T\+26 minutes\), we had found one potential cause of the deadlocks. After our database migration from the previous weekend, we were still using MySQL binary log replication to keep our old database up to date. We disabled binlog replication and restarted the database to try and reduce the overall load on the database. This helped, but many of the SFUs retried the queries that were causing the deadlocks, so the problem persisted. We continued investigating, and also contacted AWS support to see if they had any insight on the issue. At 10:47 AM \(T\+1 hour\), engineers were working on a script that would terminate stuck queries when the database suddenly restarted itself. This restart took slightly longer than the one at 10:13, and it allowed the SFUs to discard the now-stale meeting updates without being disconnected long enough to cause them to restart. At this time, the SFUs and the platform went back to normal operation. We were ultimately able to prove that the deadlocking behavior was caused by a low-level behavior change introduced in a point release of MySQL. Our database maintenance from the previous weekend had upgraded us to that version and introduced the change. Working around that behavior change involved updating an index on one affected table. We spent the rest of the week developing and testing a plan to update the production database, and we completed that work with no user impact on Saturday evening. At 11:01, we decided we could move into a monitoring state while continuing to investigate the root cause. We left the status incident in a “monitoring” state until Friday, because we wanted to make sure we fully understood the initial cause of the deadlocks took any necessary action to avoid it in the future. One such action was the addition of rate limiting to the room creation API endpoint. The overall impact of this incident was limited to almost exactly one hour, between 14:47 and 15:47 UTC. During that time, some users in Daily calls experienced the “meeting moves” described earlier. There may have been a small number of users that weren’t able to join a room if they happened to try in the middle of a “meeting move”, which lasts a few 10s of seconds. They would have joined on reattempting a few seconds later. Similarly, some REST API requests may have returned 5xx error codes as well. We are continuing to work with AWS to make sure that the deadlocks issue we saw in production with Aurora MySQL 2.11.0 is fully documented, understood, and fixed in a future release. A more conservative approach to deadlocks was a known change in MySQL 5.7 \(which Aurora MySQL 2 is based on\). However, the severity of the deadlocks that we experienced during this incident was a surprise to us and to the AWS Aurora team. We try hard to test all infrastructure changes under production-like workloads. In this case, we failed to test with a synthetic workload that had the right “shape” to trigger these deadlocks. As a result of this incident, we have added additional API request patterns to our testing workload. We’ve also added some new production monitoring alarms that are targeted at more fine-grained database metrics.
We've identified the issue that caused the incident on Tuesday morning. While we've already deployed fixes that helped prevent the problem from reoccurring, we still need to perform one more database update that will require a short scheduled maintenance. That will likely happen this weekend. We will post a full retro after completing the final database maintenance operation.
We've deployed a platform update with a few improvements designed to mitigate the impact of the current database performance issue. The only thing you may notice is that you'll no longer see 429 rate limit responses in your Dashboard API logs. Our database metrics have remained normal today, but we'll continue to monitor the platform to verify these fixes and watch for further issues.
While we were able to restore platform functionality earlier today, we've continued to troubleshoot the underlying issue that caused the problem. As a precautionary measure, we've temporarily enabled rate limiting on the REST API endpoint used to create rooms. The limit for <tt>POST /rooms</tt> is now the same as the <a href="https://docs.daily.co/reference/rest-api#rate-limits">DELETE /rooms/:name endpoint</a>. You can expect about 2 requests per second, or 50 over a 30-second window.
We’ve addressed the issue with the database, and platform operations have returned to normal. We are monitoring alerts and metrics for any further issues.
We've identified an issue with one of our databases that coordinates activity between call servers. This is causing elevated rates of "meeting moves", which is when an ongoing call session has to move from one call server to a different one. If you're in a call when this happens, you'll notice everyone's video and audio drop out and come back within a few seconds. You may also need to restart recording or live streaming when this happens. You may also experience timeouts when making REST API requests. We'll post more information as soon as it's available.
We are investigating elevated platform error rates. Users may get websocket connection errors when trying to join calls.
Report: "Issues connecting to calls."
Last updateThis incident has been resolved.
A fix has been applied and we are monitoring to be sure that all underlying issues are resolved.
We have identified the issue and are applying a fix.
The issue has been identified and a fix is being implemented.
We're investigating an issue preventing some users from connecting to calls.
Report: "Missing metrics in call participant logs"
Last updateWe've confirmed the initial report that there are a small number of recent call sessions that didn't log any metrics data. This can happen if your app has multiple call object instances running on the same page. Your app may do this if you are calling <tt>createCallObject()</tt> more often than you think; for example, in a React effect hook. Multiple call objects usually cause a variety of other errors on the page, so if you aren't already troubleshooting app issues related to this problem, you don't need to be worried about missing metrics. We are adding functionality to daily-js to help customers identify if they have multiple call objects on the same page. If you need help resolving this issue in your app, please feel free to contact support.
We're investigating reports of missing metrics data in participant logs from a small number of users. This may date back to some time around 2023-01-23 17:00 UTC (9:00 AM PST on Monday, Jan 23).
We're investigating reports of missing metrics data in participant logs.
Report: "Problems creating raw-tracks recordings"
Last updateThis incident has been resolved.
We've deployed new call servers and recording infrastructure to resolve the issue. You should be able to start a raw-tracks recording from any call session that started on or after approximately 02:15 UTC. Existing long-running call sessions may still be running on older call server instances. Those sessions may still experience errors with raw-tracks recordings. Those sessions will automatically move to new call server instances within the next few hours as part of our normal deploy process. We'll resolve this incident when all of the old call server instances have been retired and operations are back to normal.
We're in the process of deploying updates to resolve this issue. We'll resolve the incident as soon as the fix is live in production.
We've confirmed an issue preventing the creation of raw-tracks recordings. Other recording types are unaffected, including "cloud" recordings to your own S3 bucket. If you need to record an important call during this incident, you can change the <tt>enable_recording</tt> property on your domain, room, or meeting token to <tt>cloud</tt> to make a cloud recording.
We're investigating reports of errors from customers trying to create "raw-tracks" recordings.
Report: "Intermittent issues with cloud recordings"
Last updateThis incident has been resolved.
We are experiencing an issue where cloud recordings are intermittently returning all black frames. We have pushed a fix to production and are currently monitoring the situation.
Report: "Intermittent issues starting cloud recordings and livestreams"
Last updateThis incident has been resolved.
Daily’s auto scaling system experienced a failure to communicate with some internal services, preventing it from adding capacity for cloud recording and live streaming quickly enough to keep up with demand. We resolved the issue, and we're monitoring platform operations to ensure that everything has returned to normal.
We have pushed a fix to production and are currently monitoring the situation.
We have identified the issue and are currently testing a fix.
We are experiencing an issue where customers attempting to start livestreams or cloud recordings are intermittently receiving a temporarily-unavailable error.
Report: "Problems joining rooms in us-east-1"
Last updateThis incident has been resolved.
We identified a brief issue with DNS while deploying a call server (sigh, it's always DNS). This would have caused intermittent join problems for some users for several minutes. We resolved the issue, and we're monitoring platform operations to ensure that everything has returned to normal.
We're investigating an issue preventing some users from joining meetings hosted in the <tt>us-east-1</tt> region.
Report: "Issues connecting to rooms"
Last updateAWS has resolved their issue, and our operations have returned to normal.
We're still watching the ongoing AWS status incident until it's resolved. We'll provide another update if anything changes in the meantime.
AWS has acknowledged an issue with API Gateway in the <tt>us-west-2</tt> region. We're routing API requests to other regions for now, so everything should be operating normally for you and your users. We'll leave this issue open until AWS has resolved their underlying issue and our health checks return to normal.
We're routing around a possible networking issue to our API gateways in <tt>us-west-2</tt>. This should allow your users to connect to calls, but we're still watching for other networking problems or follow-on effects<a href="https://news.ycombinator.com/item?id=33010341" target="_blank">.</a>
We're investigating an issue preventing some users from connecting to rooms in the <tt>us-west-2</tt> region.
Report: "Problems connecting to rooms"
Last updateWe've re-enabled our API Gateways in us-west-2, and users are connecting to rooms successfully. This incident is resolved.
We've confirmed that the issues with joining calls were a result of an AWS incident posted on their status site. AWS has resolved that incident, and we're seeing successful responses from our us-west-2 resources in our staging environment. We should be re-enabling our us-west-2 API Gateways shortly.
We've temporarily removed our affected us-west-2 API Gateways while AWS works to resolve the underlying issues. That should solve the problem that was preventing users from joining calls. Existing call sessions should be unaffected. We're closely monitoring other parts of our infrastructure, and we'll provide updates here if further issues emerge.
We are investigating an AWS issue with API Gateways in the us-west-2 region that is preventing some users from joining Daily sessions.
Report: "Users may be unable to view meeting session data in dashboard"
Last updateThis incident has been resolved.
A fix has been deployed, and dashboard users should now be able to access meeting session data. We're continuing to monitor the situation.
We've identified an error impacting the ability for some users to view meeting session data in the Daily Dashboard. A fix is being implemented.
Report: "Customers may experience difficulty downloading meeting recordings."
Last updateThe issue impacting downloads of meeting recordings is resolved.
Daily has resolved the issue impacting downloads of meeting recordings, and continues to monitor the situation.
Daily has identified a problem that is impacting the ability to download meeting recordings via the dashboard and access-link APIs, and is implementing a fix.
Report: "Connectivity issues"
Last updateError rates and network metrics have returned to normal levels, so we're considering this issue resolved.
We've seen overall error rates decrease as AWS has been working to resolve the networking issue. Things are improving, but you may still experience delays and errors until this issue is fully resolved.
We're continuing to see issues across our platform as a result of the ongoing AWS outage. You'll likely experience problems joining calls, accessing the Dashboard, or using the REST API. We'll continue to post more information here as we have it.
We're experiencing network delays and timeouts throughout our infrastructure as a result of a larger-scale AWS incident. You may experience problems connecting to calls, viewing your Dashboard, or making REST API requests. We'll update this incident as we know more.
We're investigating reports of problems connecting to calls.
Report: "Degraded audio and video call experience"
Last updateWe have restored our service provider configuration to its nominal state after confirming that all providers are operating normally.
We have temporarily routed traffic through another service provider, which should resolve call connection issues for most users.
We are continuing to work on a fix for this issue.
We are currently experiencing an issue with one of our service providers that may be affecting connection to calls (slow connections or timeouts). We are implementing a fix.
Report: "Audio and video calls impacted by an ongoing incident in us-east-1"
Last updateAWS has resolved the regional issues in us-east-1. We have re-activated the us-east-1 region for new audio and video calls.
We have attempted to work around this AWS issue for users that would normally be routed to our us-east-1 resources by temporarily removing all of our us-east-1 DNS records from our AWS API Gateway configurations whilst we wait for AWS to recover the region. We'll continue to monitor the situation.
Users close to the us-east-1 (Northern Virginia) region of AWS may be unable to join video calls. We are investigating.
Report: "Internet connectivity - South America region"
Last updateThis incident has been resolved.
Connectivity issues in the South America - Brazil region have been resolved. We are continuing to monitor the situation.
AWS is experiencing intermittent Internet connectivity issues in the South America - Brazil region. Users may experience a degraded experience connecting to audio and video calls in the region. Connecting to calls may take longer, and users in the region may connect to a server in another region until connectivity returns to normal.
Report: "Dashboard degraded: logging and telemetry impacted"
Last updateThis incident has been resolved.
The impacted data repository has returned to normal operation. Audio and video call logs and metrics should now be available in the dashboard.
We have identified a problem retrieving audio and video call logging and telemetry from the dashboard. Customers may experience a 'Session not found' error message when attempting to view call logs and telemetry.
Report: "Call Telemetry degraded"
Last updateWe have confirmed that call telemetry data is flowing nominally into the data repository.
A resolution has been implemented and telemetry data from ongoing calls should appear in the dashboard.
We have identified a problem with our call telemetry data repository and are working on a fix. Telemetry data from some calls may not appear in the dashboard. Audio and video call experiences are not impacted.
Report: "Degraded Call Experience"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
The fix for the degraded call experience has been deployed. Daily is monitoring in-call activity.
Intermittent degraded call experience or difficulty connecting to calls in some geographic regions. A fix is currently being deployed
Report: "Degraded dashboard and REST API performance"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
We're experiencing higher than normal CPU utilization on our database. You may experience slower than normal Dashboard loading and REST API requests.
Report: "SSL Certificate Expiration"
Last updateOn August 31, 2020, at 05:28:23 (all times UTC), the Secure Sockets Layer (SSL) certificate used to secure connections to Daily's Selective Forwarding Units (SFUs) expired. The expired certificate meant that client connections to these servers no longer worked, and had the following impact: - Meetings using our WebSocket signaling option did not function. - Regardless of the signaling option, meetings that use SFU rather than peer-to-peer media delivery did not function.