Historical record of incidents for getstream.io
Report: "Outage in Us-East Edge"
Last updateThere was a significant spike in requests that hit one of our Edge shards in the US-East region, leading to an exhaustion of resources. The incident lasted 4 minutes.
Report: "High error rate on Chat API service in us-east"
Last updateThis incident has been resolved.
The issue is being resolved, and our team is closely monitoring the database.
One of our shards in the us-east region is experiencing an issue with the underlying database storage. Our team is actively working to restore the service.
Report: "Video - high error rate on Join call API (Mumbai)"
Last updateOur team deployed a patch to the affected system and the incident has been resolved
Between 12:35 and 12:51 UTC, our API layer experienced a downtime in the Mumbai region due to increased system load. We are currently investigating the incident and implementing a patch. During this time API requests to join calls returned a 5xx response.
Report: "Service degradation in the US-East region"
Last updateThis incident has been resolved.
One of our shards in the US-East region is experiencing degraded performance due to an issue with the underlying data storage. Our team is currently working on resolving the incident.
The issue has been identified and a fix is being implemented.
Report: "Degraded performance in Dublin"
Last updateThe incident has been resolved.
One of our shards in the Dublin region is experiencing degraded performance due to an issue with the underlying data storage. Our team is currently working on resolving the incident.
Report: "API reachability problems in United Arab Emirates (UAE)"
Last updateThis incident has been resolved.
APIs are partially unreachable for users within the UAE on the DU Telecom and Virgin Telecom ISPs. Our team is currently working to remediate the problem by talking to the ISP and local authorities, a temporary remediation is also in progress.
Report: "High error rate on Chat API service in ohio"
Last updateThe incident has been resolved, and traffic is now served without any issue.
Our monitoring system has detected an incident affecting one of our shard in Ohio. Customers located on that shard have experienced a high error rate.
Report: "High error rate on one of our Feed shards in the us-east region"
Last updateThe shard has been recovered and the incident has been resolved. Our team is currently conducting an internal investigation to determine the root cause of the issue.
The issue has been identified and a fix is being implemented.
We are currently experiencing a high error rate on one of our Feed shards in the us-east region. Our team is actively working to resolve the situation.
Report: "High error rate on Chat API endpoints"
Last updateAt 21:02 CET, a high error rate was recorded on the Chat API endpoints in one of our us-east shards. Our team resolved the issue at 21:18 CET, and the service is now operating normally.
Report: "High error rate for Chat Query Channels endpoint"
Last updateAn issue with the QueryChannel endpoint led to some queries returning HTTP 403 response code. This incident has been resolved.
Report: "Realtime connections outage"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating an issue with our Feed Realtime service.
Report: "Elevated error rate in our edge network"
Last updateThis incident has been resolved.
The issue has been identified and our team is working on a remediation
Report: "Elevated error rate for Feed apps in dublin region"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
Report: "AWS connectivity issues"
Last updateThis incident has been resolved.
The issue has been resolved and the service in Ohio region is operating normally.
We experience a partial outage due to AWS connectivity issues for selected apps in the Ohio region
Report: "Elevated API Errors on us-east"
Last updateThe incident has been resolved. A post-mortem will follow.
The issue of this morning propagated to an additional component of our infrastructure intended to dispatch messages to the end users via websocket protocol. Our team tried to mitigate the issue and the problem seems to be resolved now. We are still monitoring the situation closely.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented. A temporary remediation has been put in place to mitigate the ongoing issue.
We're experiencing an elevated level of API errors and are currently looking into the issue. This issue affects one shard only in our us-east region.
Report: "Elevated API error rate in Dublin"
Last updateTraffic to our Dublin infrastructure experienced elevated error rate due to a AWS outage. The incident started at 11:20PM, error rate decreased at 11:38PM and the incident was resolved by 11:59PM We are still performing impact and root-cause analysis, a postmortem with more information will be posted here.
Report: "Increased error rate on Chat API"
Last updateWe experienced higher than normal error rates during a database maintenance on Chat API. The error increased started at 5:24AM and resolved at 5:42AM UTC.
Report: "Chat API"
Last updateHigh error rate on Chat HTTP APIs
Report: "High error rate on Feed Realtime endpoint"
Last updateThis incident has been resolved.
Realtime updates for feeds are back to normal, we are still monitoring the traffic. The previous patch unfortunately did not resolve the problem and was causing realtime clients to retry the connection via the `Client not found, please reconnect` response.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
Report: "Feed Realtime - SQS high error rate"
Last updateMillions of requests to the handshake endpoint of our feed realtime system broke the API. This issue has been resolved and a full post mortem will follow.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are currently investigating an issue with AWS SQS, we are receiving 100% error rate from SQS APIs. Our feeds realtime endpoint is currently unable to push notifications to SQS.
Report: "Elevated error rates on Chat API"
Last updateWe have completed the post mortem for the December 9th incident. As the founder and CEO of Stream I’d like to apologize to all of our customers impacted by this issue. Stream powers activity feeds and chat for a billion end users, and we recognize that our customers operating in important sectors, such as healthcare, education, finance, and social apps, rely on our technology. As such, we have a responsibility to ensure that these systems are always available. Stability and performance is the cornerstone of what makes a hosted API like Stream work. Over the last 5 years it’s been extremely rare for us to have stability issues. Our team spends a significant amount of time and resources to ensure that we keep up our good stability track record. On December 9th, however, we made some significant mistakes, and we need to learn from that, as a team, and do better in the future. **The Outage** A rolling deployment between 11:28 GTM & 14:38 GMT was made to chat shards in US-east and Singapore regions. The code contained an issue with our Raft-based replication system, causing 66% of message events to not be delivered. Messages were still stored and retrievable via the API. The event replay endpoint also still returned messages. At 17:00 GMT the issue was identified and the code was rolled back, resolving the issue for all shards by 17:38 GMT. While the end user impact on the chat experience depends on the SDK, the offline storage integration, and the API region, for most apps, this meant a very significant disruption to the chat functionality. **What Went Wrong** As with any significant downtime event, it was a combination of problems that caused this outage. 1. The issue with the broken code should have been caught during our review process. 2. The QA process should have identified this issue. Unfortunately tests were run on a single node setup and did not capture the bug. 3. The issue with the reduced message events should have been visible during the rolling deploy. 4. Monitoring and alerting should have picked up the issue before our customers reported it. **Resolution 1 - Monitoring** The biggest and most glaring issue here is the monitoring. While we do have extensive monitoring and alerting in place, we did not have one that captured message propagation. The team is introducing monitoring to track message delivery and adding alerting rules. **Resolution 2 - QA** The second issue is that our extensive QA test suite didn’t catch this issue, since it only occurred when running Stream in a multi cluster environment. We are updating our QA process to run in a cluster environment, so that it more closely resembles production systems. **Resolution 3 - Heartbeat Monitoring** The previous two resolutions would have been enough to avoid this incident or reduce it to a very minor incident. With that being said, Chat API is a complex system and we think that more end-to-end testing will make issues easier to notice. For this reason we are also going to introduce canary-like testing so that we can detect failures at client-side level as well. **Non-Technical Factors** Stream has been growing extremely rapidly over the last year. Our team grew from 31 to 93 in the last 12 months. The chat API usage has been growing even faster than that. Keeping up to this level of growth requires constant changes to processes and operations like monitoring and deployment. This is something we have to reflect on as a team and do better. **Conclusion** Performance and stability is one of our key focus areas and something we spend a significant part of our engineering efforts on. Yesterday we let our customers down. For that, Tommaso and I would like to apologize. The entire team at Stream will strive to do better in the future.
This incident has been resolved.
We identified an issue with Chat API that caused some messages to not being delivered via Websockets. The problem is already resolved for most applications, and the remediation should be completed for all apps shortly.
We are currently investigating an increase of errors on Chat API
Report: "Increased API latency"
Last updateAWS Networking issue is now resolved. We are now cleaning up our temporary remediations since they are not needed anymore. Traffic is back to normal for the last hour.
Due to a networking issue on AWS us-east region, we are experiencing increased latency for some of the traffic on our US region. We are mitigating the problem while waiting for a final remediation on AWS infrastructure.
Report: "High error rates and timeouts"
Last updateBetween 4:05PM and 4:45PM UTC on January 28 2020 we had an API outage caused by performance degradation. The event was triggered by a new release to our Chat API servers; quickly after the new release was live, load on our database infrastructure increased and caused HTTP response times to spike and time-out in some cases. The event was detected by our latency and error monitoring. The team started working on the event by rolling back to the previous version at 4:20PM UTC. Unfortunately the rollback did not resolve the problem entirely. After another rollback attempt we realised there were still pending queries from the previous release running on our PostgreSQL database. We manually terminated all the pending tasks at 4:40PM UTC; after that the error rate dropped to 0% again. The outage affected 5% of HTTP requests at its peak \(4:20PM to 4:27PM UTC\).
This incident has been resolved.
We are continuing to monitor for any further issues.
We are continuing to monitor for any further issues.
A recent released caused load increase on part of the chat infrastructure and caused degraded performance and timeout errors. Remediation is in progress.
Report: "Timeout Errors"
Last updateThis incident has been resolved.
Increased load on some API endpoints caused traffic to spike intermittently. Adding more capacity remediated the problem.
We are experiencing spikes of timeout errors; the team is investigating on the root cause and working on a remediation
Report: "Emails from Dashboard are not sent"
Last updateThis incident has been resolved.
Emails from Dashboard (invites, password resets and other notifications are currently not sent correctly). We are talking to our SMTP provider (Mailgun) to resolve this issue as soon as possible.
Report: "Dashboard redirect issue"
Last updateThis issue has been resolved.
The dashboard has a bug that's causing it to redirect some users to the homepage. Our team is investigating. APIs are fully operational, this only impacts the dashboard.
Report: "Elevated API Errors on US-EAST"
Last updateWe were experiencing an elevated level of API errors on our us-east region. This incident lasted from 2:11PM UTC to 2:18PM.
Report: "Elevated API Errors on region EU-WEST"
Last updateThis incident has been resolved.
We were experiencing an elevated level of API errors because of a Redis upgrade that was unsuccessful. We have resolved the issue and are monitoring for further problems
Report: "EU-WEST API downtime"
Last updateDue to an operation mistake, API service between 12:56PM UTC and 12:58PM UTC API had very high error rate on the Europe West region. The problem is mitigated and resolved. Detailed API error rate over time: 12:56PM 78% 12:57PM 93% 12:58PM 4%
Report: "Partial API outage"
Last updateBetween 03:59PM and 04:32PM UTC API traffic resulted in HTTP errors or timeouts. Only a part of Stream applications hosted on US were affected by this problem.
Report: "Realtime Redis Failover"
Last updateOur distributed realtime cluster uses Redis (on Elasticache) for state management. A failover of the Elasticache cluster caused realtime to be unavailable for 7 minutes. This issue has been resolved and we're investigating why the failover took 7 minutes. This impacted customers using Stream's websocket, SQS or Webhook firehose systems.
Report: "High latency spike and increased error rate"
Last update### January 23 and 24 outage postmortem report Stream suffered two incidents of degraded performance in the past 24 hours. We take uptime very seriously, and would like to be transparent with our operations and to our customers. The spikes occurred on Jan 23 at 3:50PM UTC and on Jan 24 at 11:45AM UTC. Both spikes were caused by a sudden increase of pressure to one of our PostgreSQL databases. Because Postgresql was slow at serving queries, HTTP requests started to pile up and eventually saturated the API workers' connection backlogs. API clients using a very low timeout will have encountered timeout exceptions. Other users of Stream would see 5xx responses on part of their API calls. I am going to add a little bit of background so that it is easier to elaborate on what went wrong. Some of our internal operations rely on moving data from one PostgreSQL database to another. Thanks to `psql` such operation is routinely performed by pipe-ing `COPY TO STDOUT` and `COPY FROM STDOUT` together. In order to not pressure the destination database with writes: we also use `pv` so that we are sure we never end up consuming all our IOPS capacity. The command looks more or less like this: ```psql src_db -c '\copy (...) to stdout' | pv -p --rate-limit 5242880 | psql dst_db -c '\copy (...) from stdout'``` By terminating the same copy command running on the source database we were able to remove **write** pressure on the disk. After that the high latency problem affecting the API service was automatically resolved. After researching on other possible causes, we concluded that the pressure created by the copy command combined with increased traffic was behind this outage and picked a different time to do it again. The same operation was then restarted during low traffic hours. To our surprise write pressure increased again after a couple of hours on the source database and caused another, albeit shorter, outage. After more digging we realized that in both occasions the command was running in the background and could not write to stdout, forcing the source database to store the query results on disk and then causing very high I/O on disk and slow response times to regular traffic. The remediation for this is very straight forward and it simply requires to never block stdout on the source database.
Between 11:43 UTC and 11:47 API traffic had a spike in HTTP 5xx errors and an increase of latency.
Report: "High API error rate"
Last update# The issue From 19:16 to 19:24 UTC and from 19:30 to 19:46 UTC we had a high number HTTP 502 errors when connecting to the API. # The causes A change was made to our servers SSH configuration that was thought not to have any effect. However, on newly provisioned servers it caused a failure to start the server process. Normally this wouldn't have caused a big problem, because the load balancer should mark the host as unhealthy and thus no traffic should be sent there. Unfortunately, this was not the case because of a recent change in the health check logic. This change wrongly reported the server as healthy even though the server process was down. # The fixes First we removed the bad servers manually from the load balancer. After that we fixed the problem with the SSH configuration and added the servers back to the load balancer. Finally we changed the health check to not report healthy when the server process is down. Our apologies about the outage, our team is hard at work to further improve stability.
The issue has been resolved, more information about the outage will follow shortly.
We're currently investigating a high error rate on the APIs. A percentage of requests to the API are returning 502s, the cause is not yet identified.
Report: "API downtime"
Last updateWe've received additional information from AWS about this outage. To summarize, the RDS monitoring process and DB instance both failed causing a delay in automated failover. ===== Thank you for contacting AWS Premium Support. I understand that your RDS instance was not reachable from 1:23 to 1:41 UTC on 9th of February 2017 and you want to know the cause for it. I have investigated your RDS instance and following is my analysis: --> 2017-02-09 01:25:26 External Monitoring process is unable to communicate with monitoring service on your instance --> Due to the communication issues talking to the monitoring process on the instance, the failover was getting delayed until the hard limit was reached from the external monitoring process. Before External Monitoring process forces failover you did a manual reboot with failover at around 2017-02-09 01:40:42 UTC. --> That was the reason CloudWatch metrics was not available during that time period but it started uploading after it failed over to standby DB instance. --> After making sure new primary DB instance is up to date with the old primary DB instance, RDS issued replace DB instance. --> Replace DB instance workflow has deleted the faulty instance (old primary) and replace it with new instance. Then it will sync up with the primary DB instance. --> This process (Replace DB instance) completed successfully at 2017-02-09 1:57:45 UTC. However, during this process DB instance was available for reads and writes. Normally the failover will be triggered shortly within few minutes and this time it's indeed abnormal. It rarely happens and we do apologize for any inconvenience that this issue might have caused on your environment. The RDS team always works hard on improving the stability and reliability of the RDS service but sometimes failure do occur. Our sincerest apologies for the operational pain that was caused you and please let me know if there is anything else I can assist with.
The problem was related to a hardware failure with one of our databases. The faulty server was replaced with a hot backup.
We are investigating an outage on the API
Report: "Slow performance on feed API endpoints"
Last update## The problem Due to what it seems to be a bug with EC2 Security Groups, the connectivity between one API server and one Redis backend was impaired. This connectivity issue resulted in API requests waiting until a hard-timeout occurred. At its peak 1% of all API calls were affected and either returned a 502 error code or raised client-side timeout exceptions. ## Mitigation Once the problem was clear, the EC2 server with the configuration problem was removed from the load balancer, this immediately resolved the problem. ## Solution We are talking with AWS support to isolate and validate this problem; in the meantime we instrumented all our API servers to pro-actively check for this specific issue and decommission servers experiencing the same problem.
The issue has been resolved. We're still investigating the root cause.
We're investigating a slowdown on our main feed API endpoint.
Report: "Realtime WS connections outage"
Last updateConnections are back, currently investigating the root cause.
Report: "Elevated API Errors"
Last updateBetween 7:24AM and 7:26AM UTC we had an increased error rate due to a database failover procedure triggered by AWS.
Report: "Elevated API Errors"
Last updateWe experienced an elevated rate of API errors between 13:00 and 13:01 UTC. The issue resolved itself and we are looking into the cause.
Report: "Dashboard issues"
Last updateWe switched to our back-up CDN a few minutes ago. Dashboard is fully operation again.
The CDN (IMGIX) is not serving static files from our origin correctly (https://status.imgix.com/incidents/5plz9dqxxhns). We have temporarily mitigated the problem by rolling back to a previous release (the statics for that release are available at CDN edge nodes and are returned correctly). We are now working to switch to the backup CDN (AWS Cloudfront).
We're currently investigating problems with loading the Dashboard.
Report: "Elevated API Errors"
Last updateThe outage started 10:14 UTC and was over at 10:18 UTC.
The issue seems to be resloved
We're experiencing an elevated level of API errors and are currently looking into the issue.
Report: "Elevated API Errors"
Last updateA small part of Stream's customers was impacted by an issue that caused notification feeds to be temporarily unavailable. Earlier today at 15:30 GMT a maintenance database migration was initiated; unfortunately the procedure did not clear the cached state of feeds correctly. This lead to reads not showing activities from before that time. At 16:00 UTC we remediated by flushing the stale cache which immediately fixed the problem. The issue did not affect any write operations such as adding/removing activities or following/unfollowing feeds. After some investigation we could find and amended the incorrect part of the maintenance operation. Apologies for the trouble and for bringing this up during the week-end.
Report: "Intermittent failures"
Last updateThis incident has been resolved.
API service is back to normal since 3:57PM UTC. Outage was due to temporary performance degradation of a Postgresql server.
The failures seem to have stopped. We're still looking into the root cause.
Intermittent failures, cause is not known yet.
Report: "Failure on main PG database"
Last updateThe issue was mitigated and we are now working on a permanent solution.
Our main PG database (that holds the configs) is seeing high CPU. This is causing a percentage of API requests to fail. We're investigating this issue and will keep you posted.
Report: "DNS issues with getstream.io domain"
Last updateDNS resolution for getstream.io is back to normal. The two .io name servers are now returning the correct results.
DNS resolution for getstream.io is randomly returning errors. The root cause seems to be related to two .io nameservers returning incorrect results.
Report: "API latency increase"
Last updateWe identified and resolved a problem affecting API latency for several applications. The problem was pinned down to heavy pressure on a Postgresql server, the issue was resolved and latency returned to normal levels.
Report: "API latency increase"
Last updateThe problem was related to a routine maintenance slowing down query response times for one Postgresql database. The latency for API was impacted between 3:50PM and 4:15PM UTC.
We are currently investigating this issue.
Report: "Latency spike"
Last updateThere was a temporary spike in latency caused by our Cassandra cluster. The issue has been mitigated.
Report: "High API latency"
Last updateThe high latency has now been resolved. The root cause isn't clear yet.
Stream's main API is experiencing high latency. Our team is investigating this. The cause is not known yet.
Report: "API Outage"
Last updateAt 2:41 PM UTC we experienced approximately 60 seconds of downtime with the Stream API. This was caused by a database lock not being released in a timely manner. Regular service was restored immediately after the lock was released.
Report: "Random latency spikes"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
One the Cassandra servers was very slow at serving queries due to long stop of the world GC.
We are investigating on this issue.