Is getstream.io Down Right Now? Discover if there is an ongoing service outage.

getstream.io is currently Operational

Last checked Jul 29, 2025 23:14 UTC from getstream.io's official status page

Historical record of incidents for getstream.io

May 16, 2025

Report: "Outage in Us-East Edge"

Last update 2025-05-16T09:53:53.517Z

resolved2025-05-16T09:27:00.000Z

There was a significant spike in requests that hit one of our Edge shards in the US-East region, leading to an exhaustion of resources. The incident lasted 4 minutes.

Apr 16, 2025

Report: "High error rate on Chat API service in us-east"

Last update 2025-04-16T21:17:23.123Z

resolved2025-04-16T21:17:23.105Z

This incident has been resolved.

monitoring2025-04-16T21:02:06.386Z

The issue is being resolved, and our team is closely monitoring the database.

identified2025-04-16T19:25:13.831Z

One of our shards in the us-east region is experiencing an issue with the underlying database storage. Our team is actively working to restore the service.

Mar 31, 2025

Report: "Video - high error rate on Join call API (Mumbai)"

Last update 2025-03-31T17:48:28.934Z

resolved2025-03-31T17:48:28.910Z

Our team deployed a patch to the affected system and the incident has been resolved

monitoring2025-03-31T13:39:06.254Z

Between 12:35 and 12:51 UTC, our API layer experienced a downtime in the Mumbai region due to increased system load. We are currently investigating the incident and implementing a patch. During this time API requests to join calls returned a 5xx response.

Mar 1, 2025

Report: "Service degradation in the US-East region"

Last update 2025-03-01T19:18:42.149Z

resolved2025-03-01T19:18:42.135Z

This incident has been resolved.

identified2025-03-01T19:05:17.958Z

One of our shards in the US-East region is experiencing degraded performance due to an issue with the underlying data storage. Our team is currently working on resolving the incident.

identified2025-03-01T19:03:36.527Z

The issue has been identified and a fix is being implemented.

Feb 24, 2025

Report: "Degraded performance in Dublin"

Last update 2025-02-24T14:27:54.790Z

resolved2025-02-24T14:27:54.773Z

The incident has been resolved.

identified2025-02-24T12:02:24.138Z

One of our shards in the Dublin region is experiencing degraded performance due to an issue with the underlying data storage. Our team is currently working on resolving the incident.

Jan 24, 2025

Report: "API reachability problems in United Arab Emirates (UAE)"

Last update 2025-01-24T12:26:42.318Z

resolved2025-01-24T12:26:42.295Z

This incident has been resolved.

identified2025-01-20T09:32:11.000Z

APIs are partially unreachable for users within the UAE on the DU Telecom and Virgin Telecom ISPs. Our team is currently working to remediate the problem by talking to the ISP and local authorities, a temporary remediation is also in progress.

May 29, 2024

Report: "High error rate on Chat API service in ohio"

Last update 2024-05-29T01:07:53.302Z

resolved2024-05-29T01:07:07.000Z

The incident has been resolved, and traffic is now served without any issue.

investigating2024-05-29T01:06:57.000Z

Our monitoring system has detected an incident affecting one of our shard in Ohio. Customers located on that shard have experienced a high error rate.

May 9, 2024

Report: "High error rate on one of our Feed shards in the us-east region"

Last update 2024-05-09T14:38:11.781Z

resolved2024-05-09T14:38:11.767Z

The shard has been recovered and the incident has been resolved. Our team is currently conducting an internal investigation to determine the root cause of the issue.

identified2024-05-09T13:52:06.236Z

The issue has been identified and a fix is being implemented.

investigating2024-05-09T13:51:31.629Z

We are currently experiencing a high error rate on one of our Feed shards in the us-east region. Our team is actively working to resolve the situation.

Apr 29, 2024

Report: "High error rate on Chat API endpoints"

Last update 2024-04-29T19:42:19.580Z

resolved2024-04-29T19:42:19.567Z

At 21:02 CET, a high error rate was recorded on the Chat API endpoints in one of our us-east shards. Our team resolved the issue at 21:18 CET, and the service is now operating normally.

Mar 28, 2024

Report: "High error rate for Chat Query Channels endpoint"

Last update 2024-03-28T11:30:20.054Z

resolved2024-03-28T11:30:20.045Z

An issue with the QueryChannel endpoint led to some queries returning HTTP 403 response code. This incident has been resolved.

May 9, 2023

Report: "Realtime connections outage"

Last update 2023-05-09T08:33:10.807Z

resolved2023-05-09T08:33:10.792Z

This incident has been resolved.

monitoring2023-05-09T08:08:31.170Z

A fix has been implemented and we are monitoring the results.

identified2023-05-09T07:35:55.933Z

The issue has been identified and a fix is being implemented.

investigating2023-05-09T07:32:57.619Z

We are currently investigating an issue with our Feed Realtime service.

Mar 27, 2023

Report: "Elevated error rate in our edge network"

Last update 2023-03-27T15:12:07.959Z

resolved2023-03-27T15:12:07.944Z

This incident has been resolved.

identified2023-03-27T14:35:12.240Z

The issue has been identified and our team is working on a remediation

Mar 21, 2023

Report: "Elevated error rate for Feed apps in dublin region"

Last update 2023-03-21T17:09:27.682Z

resolved2023-03-21T17:09:27.663Z

This incident has been resolved.

identified2023-03-21T16:56:45.773Z

The issue has been identified and a fix is being implemented.

Jul 28, 2022

Report: "AWS connectivity issues"

Last update 2022-07-28T18:37:04.421Z

resolved2022-07-28T18:37:04.395Z

This incident has been resolved.

identified2022-07-28T18:36:56.396Z

The issue has been resolved and the service in Ohio region is operating normally.

identified2022-07-28T18:27:43.024Z

We experience a partial outage due to AWS connectivity issues for selected apps in the Ohio region

Dec 8, 2021

Report: "Elevated API Errors on us-east"

Last update 2021-12-08T12:43:43.333Z

resolved2021-12-08T12:43:43.315Z

The incident has been resolved. A post-mortem will follow.

monitoring2021-12-08T11:04:10.891Z

The issue of this morning propagated to an additional component of our infrastructure intended to dispatch messages to the end users via websocket protocol. Our team tried to mitigate the issue and the problem seems to be resolved now. We are still monitoring the situation closely.

monitoring2021-12-08T09:02:45.597Z

A fix has been implemented and we are monitoring the results.

identified2021-12-08T08:55:36.273Z

The issue has been identified and a fix is being implemented. A temporary remediation has been put in place to mitigate the ongoing issue.

investigating2021-12-08T07:20:52.000Z

We're experiencing an elevated level of API errors and are currently looking into the issue. This issue affects one shard only in our us-east region.

Sep 1, 2021

Report: "Elevated API error rate in Dublin"

Last update 2021-09-01T09:59:31.143Z

resolved2021-08-31T21:30:00.000Z

Traffic to our Dublin infrastructure experienced elevated error rate due to a AWS outage. The incident started at 11:20PM, error rate decreased at 11:38PM and the incident was resolved by 11:59PM We are still performing impact and root-cause analysis, a postmortem with more information will be posted here.

Apr 14, 2021

Report: "Increased error rate on Chat API"

Last update 2021-04-14T07:04:02.792Z

resolved2021-04-14T05:30:00.000Z

We experienced higher than normal error rates during a database maintenance on Chat API. The error increased started at 5:24AM and resolved at 5:42AM UTC.

Mar 15, 2021

Report: "Chat API"

Last update 2021-03-15T19:11:54.291Z

resolved2021-03-15T18:30:00.000Z

High error rate on Chat HTTP APIs

Jan 6, 2021

Report: "High error rate on Feed Realtime endpoint"

Last update 2021-01-06T15:35:16.058Z

resolved2021-01-06T15:35:16.046Z

This incident has been resolved.

monitoring2021-01-06T10:02:18.174Z

Realtime updates for feeds are back to normal, we are still monitoring the traffic. The previous patch unfortunately did not resolve the problem and was causing realtime clients to retry the connection via the `Client not found, please reconnect` response.

monitoring2021-01-06T09:53:55.607Z

A fix has been implemented and we are monitoring the results.

identified2021-01-06T05:00:07.000Z

The issue has been identified and a fix is being implemented.

Jan 4, 2021

Report: "Feed Realtime - SQS high error rate"

Last update 2021-01-04T23:25:00.204Z

resolved2021-01-04T23:25:00.190Z

Millions of requests to the handshake endpoint of our feed realtime system broke the API. This issue has been resolved and a full post mortem will follow.

monitoring2021-01-04T19:45:25.620Z

We are continuing to monitor for any further issues.

monitoring2021-01-04T17:38:06.628Z

A fix has been implemented and we are monitoring the results.

investigating2021-01-04T16:57:26.689Z

We are currently investigating an issue with AWS SQS, we are receiving 100% error rate from SQS APIs. Our feeds realtime endpoint is currently unable to push notifications to SQS.

Dec 10, 2020

Report: "Elevated error rates on Chat API"

Last update 2020-12-10T20:37:32.784Z

postmortem2020-12-10T20:36:32.353Z

We have completed the post mortem for the December 9th incident. As the founder and CEO of Stream I’d like to apologize to all of our customers impacted by this issue. Stream powers activity feeds and chat for a billion end users, and we recognize that our customers operating in important sectors, such as healthcare, education, finance, and social apps, rely on our technology. As such, we have a responsibility to ensure that these systems are always available. Stability and performance is the cornerstone of what makes a hosted API like Stream work. Over the last 5 years it’s been extremely rare for us to have stability issues. Our team spends a significant amount of time and resources to ensure that we keep up our good stability track record. On December 9th, however, we made some significant mistakes, and we need to learn from that, as a team, and do better in the future. **The Outage** A rolling deployment between 11:28 GTM & 14:38 GMT was made to chat shards in US-east and Singapore regions. The code contained an issue with our Raft-based replication system, causing 66% of message events to not be delivered. Messages were still stored and retrievable via the API. The event replay endpoint also still returned messages. At 17:00 GMT the issue was identified and the code was rolled back, resolving the issue for all shards by 17:38 GMT. While the end user impact on the chat experience depends on the SDK, the offline storage integration, and the API region, for most apps, this meant a very significant disruption to the chat functionality. **What Went Wrong** As with any significant downtime event, it was a combination of problems that caused this outage. 1. The issue with the broken code should have been caught during our review process. 2. The QA process should have identified this issue. Unfortunately tests were run on a single node setup and did not capture the bug. 3. The issue with the reduced message events should have been visible during the rolling deploy. 4. Monitoring and alerting should have picked up the issue before our customers reported it. **Resolution 1 - Monitoring** The biggest and most glaring issue here is the monitoring. While we do have extensive monitoring and alerting in place, we did not have one that captured message propagation. The team is introducing monitoring to track message delivery and adding alerting rules. **Resolution 2 - QA** The second issue is that our extensive QA test suite didn’t catch this issue, since it only occurred when running Stream in a multi cluster environment. We are updating our QA process to run in a cluster environment, so that it more closely resembles production systems. **Resolution 3 - Heartbeat Monitoring** The previous two resolutions would have been enough to avoid this incident or reduce it to a very minor incident. With that being said, Chat API is a complex system and we think that more end-to-end testing will make issues easier to notice. For this reason we are also going to introduce canary-like testing so that we can detect failures at client-side level as well. **Non-Technical Factors** Stream has been growing extremely rapidly over the last year. Our team grew from 31 to 93 in the last 12 months. The chat API usage has been growing even faster than that. Keeping up to this level of growth requires constant changes to processes and operations like monitoring and deployment. This is something we have to reflect on as a team and do better. **Conclusion** Performance and stability is one of our key focus areas and something we spend a significant part of our engineering efforts on. Yesterday we let our customers down. For that, Tommaso and I would like to apologize. The entire team at Stream will strive to do better in the future.

resolved2020-12-09T17:43:15.987Z

This incident has been resolved.

identified2020-12-09T17:20:54.576Z

We identified an issue with Chat API that caused some messages to not being delivered via Websockets. The problem is already resolved for most applications, and the remediation should be completed for all apps shortly.

investigating2020-12-09T16:59:16.817Z

We are currently investigating an increase of errors on Chat API

Jul 28, 2020

Report: "Increased API latency"

Last update 2020-07-28T16:13:44.591Z

resolved2020-07-28T12:12:17.435Z

AWS Networking issue is now resolved. We are now cleaning up our temporary remediations since they are not needed anymore. Traffic is back to normal for the last hour.

identified2020-07-28T10:37:53.211Z

Due to a networking issue on AWS us-east region, we are experiencing increased latency for some of the traffic on our US region. We are mitigating the problem while waiting for a final remediation on AWS infrastructure.

Jan 28, 2020

Report: "High error rates and timeouts"

Last update 2020-01-28T17:21:12.443Z

postmortem2020-01-28T16:59:28.789Z

Between 4:05PM and 4:45PM UTC on January 28 2020 we had an API outage caused by performance degradation. The event was triggered by a new release to our Chat API servers; quickly after the new release was live, load on our database infrastructure increased and caused HTTP response times to spike and time-out in some cases. The event was detected by our latency and error monitoring. The team started working on the event by rolling back to the previous version at 4:20PM UTC. Unfortunately the rollback did not resolve the problem entirely. After another rollback attempt we realised there were still pending queries from the previous release running on our PostgreSQL database. We manually terminated all the pending tasks at 4:40PM UTC; after that the error rate dropped to 0% again. The outage affected 5% of HTTP requests at its peak \(4:20PM to 4:27PM UTC\).

resolved2020-01-28T16:58:54.633Z

This incident has been resolved.

monitoring2020-01-28T16:57:46.207Z

We are continuing to monitor for any further issues.

monitoring2020-01-28T16:45:20.000Z

We are continuing to monitor for any further issues.

monitoring2020-01-28T16:12:14.000Z

A recent released caused load increase on part of the chat infrastructure and caused degraded performance and timeout errors. Remediation is in progress.

Nov 21, 2019

Report: "Timeout Errors"

Last update 2019-11-21T22:16:44.889Z

resolved2019-11-21T22:16:44.874Z

This incident has been resolved.

monitoring2019-11-21T22:04:58.296Z

Increased load on some API endpoints caused traffic to spike intermittently. Adding more capacity remediated the problem.

investigating2019-11-21T21:27:15.834Z

We are experiencing spikes of timeout errors; the team is investigating on the root cause and working on a remediation

Aug 30, 2019

Report: "Emails from Dashboard are not sent"

Last update 2019-08-30T16:36:17.617Z

resolved2019-08-30T16:36:17.602Z

This incident has been resolved.

identified2019-08-30T14:46:37.010Z

Emails from Dashboard (invites, password resets and other notifications are currently not sent correctly). We are talking to our SMTP provider (Mailgun) to resolve this issue as soon as possible.

May 10, 2019

Report: "Dashboard redirect issue"

Last update 2019-05-10T21:34:11.243Z

resolved2019-05-10T21:34:11.230Z

This issue has been resolved.

investigating2019-05-10T20:40:52.923Z

The dashboard has a bug that's causing it to redirect some users to the homepage. Our team is investigating. APIs are fully operational, this only impacts the dashboard.

Feb 6, 2019

Report: "Elevated API Errors on US-EAST"

Last update 2019-02-06T14:25:33.050Z

resolved2019-02-06T14:23:35.617Z

We were experiencing an elevated level of API errors on our us-east region. This incident lasted from 2:11PM UTC to 2:18PM.

Jan 2, 2019

Report: "Elevated API Errors on region EU-WEST"

Last update 2019-01-02T13:22:46.721Z

resolved2019-01-02T13:22:46.705Z

This incident has been resolved.

monitoring2019-01-02T13:04:26.578Z

We were experiencing an elevated level of API errors because of a Redis upgrade that was unsuccessful. We have resolved the issue and are monitoring for further problems

Nov 22, 2018

Report: "EU-WEST API downtime"

Last update 2018-11-22T13:16:10.091Z

resolved2018-11-22T13:08:04.246Z

Due to an operation mistake, API service between 12:56PM UTC and 12:58PM UTC API had very high error rate on the Europe West region. The problem is mitigated and resolved. Detailed API error rate over time: 12:56PM 78% 12:57PM 93% 12:58PM 4%

Aug 23, 2018

Report: "Partial API outage"

Last update 2018-08-23T16:43:09.606Z

resolved2018-08-23T15:59:01.860Z

Between 03:59PM and 04:32PM UTC API traffic resulted in HTTP errors or timeouts. Only a part of Stream applications hosted on US were affected by this problem.

Aug 8, 2018

Report: "Realtime Redis Failover"

Last update 2018-08-08T01:09:10.941Z

resolved2018-08-08T01:09:10.931Z

Our distributed realtime cluster uses Redis (on Elasticache) for state management. A failover of the Elasticache cluster caused realtime to be unavailable for 7 minutes. This issue has been resolved and we're investigating why the failover took 7 minutes. This impacted customers using Stream's websocket, SQS or Webhook firehose systems.

Jul 30, 2018

Report: "High latency spike and increased error rate"

Last update 2018-07-30T19:30:10.662Z

postmortem2018-07-30T19:30:10.659Z

### January 23 and 24 outage postmortem report Stream suffered two incidents of degraded performance in the past 24 hours. We take uptime very seriously, and would like to be transparent with our operations and to our customers. The spikes occurred on Jan 23 at 3:50PM UTC and on Jan 24 at 11:45AM UTC. Both spikes were caused by a sudden increase of pressure to one of our PostgreSQL databases. Because Postgresql was slow at serving queries, HTTP requests started to pile up and eventually saturated the API workers' connection backlogs. API clients using a very low timeout will have encountered timeout exceptions. Other users of Stream would see 5xx responses on part of their API calls. I am going to add a little bit of background so that it is easier to elaborate on what went wrong. Some of our internal operations rely on moving data from one PostgreSQL database to another. Thanks to `psql` such operation is routinely performed by pipe-ing `COPY TO STDOUT` and `COPY FROM STDOUT` together. In order to not pressure the destination database with writes: we also use `pv` so that we are sure we never end up consuming all our IOPS capacity. The command looks more or less like this: ```psql src_db -c '\copy (...) to stdout' | pv -p --rate-limit 5242880 | psql dst_db -c '\copy (...) from stdout'``` By terminating the same copy command running on the source database we were able to remove **write** pressure on the disk. After that the high latency problem affecting the API service was automatically resolved. After researching on other possible causes, we concluded that the pressure created by the copy command combined with increased traffic was behind this outage and picked a different time to do it again. The same operation was then restarted during low traffic hours. To our surprise write pressure increased again after a couple of hours on the source database and caused another, albeit shorter, outage. After more digging we realized that in both occasions the command was running in the background and could not write to stdout, forcing the source database to store the query results on disk and then causing very high I/O on disk and slow response times to regular traffic. The remediation for this is very straight forward and it simply requires to never block stdout on the source database.

resolved2018-01-24T12:09:42.526Z

Between 11:43 UTC and 11:47 API traffic had a spike in HTTP 5xx errors and an increase of latency.

Report: "High API error rate"

Last update 2018-07-30T19:30:10.637Z

postmortem2018-07-30T19:30:10.634Z

# The issue From 19:16 to 19:24 UTC and from 19:30 to 19:46 UTC we had a high number HTTP 502 errors when connecting to the API. # The causes A change was made to our servers SSH configuration that was thought not to have any effect. However, on newly provisioned servers it caused a failure to start the server process. Normally this wouldn't have caused a big problem, because the load balancer should mark the host as unhealthy and thus no traffic should be sent there. Unfortunately, this was not the case because of a recent change in the health check logic. This change wrongly reported the server as healthy even though the server process was down. # The fixes First we removed the bad servers manually from the load balancer. After that we fixed the problem with the SSH configuration and added the servers back to the load balancer. Finally we changed the health check to not report healthy when the server process is down. Our apologies about the outage, our team is hard at work to further improve stability.

resolved2017-10-10T20:00:32.331Z

The issue has been resolved, more information about the outage will follow shortly.

investigating2017-10-10T19:50:32.345Z

We're currently investigating a high error rate on the APIs. A percentage of requests to the API are returning 502s, the cause is not yet identified.

Report: "API downtime"

Last update 2018-07-30T19:30:10.612Z

postmortem2018-07-30T19:30:10.609Z

We've received additional information from AWS about this outage. To summarize, the RDS monitoring process and DB instance both failed causing a delay in automated failover. ===== Thank you for contacting AWS Premium Support. I understand that your RDS instance was not reachable from 1:23 to 1:41 UTC on 9th of February 2017 and you want to know the cause for it. I have investigated your RDS instance and following is my analysis: --> 2017-02-09 01:25:26 External Monitoring process is unable to communicate with monitoring service on your instance --> Due to the communication issues talking to the monitoring process on the instance, the failover was getting delayed until the hard limit was reached from the external monitoring process. Before External Monitoring process forces failover you did a manual reboot with failover at around 2017-02-09 01:40:42 UTC. --> That was the reason CloudWatch metrics was not available during that time period but it started uploading after it failed over to standby DB instance. --> After making sure new primary DB instance is up to date with the old primary DB instance, RDS issued replace DB instance. --> Replace DB instance workflow has deleted the faulty instance (old primary) and replace it with new instance. Then it will sync up with the primary DB instance. --> This process (Replace DB instance) completed successfully at 2017-02-09 1:57:45 UTC. However, during this process DB instance was available for reads and writes. Normally the failover will be triggered shortly within few minutes and this time it's indeed abnormal. It rarely happens and we do apologize for any inconvenience that this issue might have caused on your environment. The RDS team always works hard on improving the stability and reliability of the RDS service but sometimes failure do occur. Our sincerest apologies for the operational pain that was caused you and please let me know if there is anything else I can assist with.

resolved2017-02-09T01:49:48.269Z

The problem was related to a hardware failure with one of our databases. The faulty server was replaced with a hot backup.

investigating2017-02-09T01:41:33.865Z

We are investigating an outage on the API

Report: "Slow performance on feed API endpoints"

Last update 2018-07-30T19:30:10.587Z

postmortem2018-07-30T19:30:10.584Z

## The problem Due to what it seems to be a bug with EC2 Security Groups, the connectivity between one API server and one Redis backend was impaired. This connectivity issue resulted in API requests waiting until a hard-timeout occurred. At its peak 1% of all API calls were affected and either returned a 502 error code or raised client-side timeout exceptions. ## Mitigation Once the problem was clear, the EC2 server with the configuration problem was removed from the load balancer, this immediately resolved the problem. ## Solution We are talking with AWS support to isolate and validate this problem; in the meantime we instrumented all our API servers to pro-actively check for this specific issue and decommission servers experiencing the same problem.

resolved2016-12-21T09:55:39.230Z

The issue has been resolved. We're still investigating the root cause.

investigating2016-12-21T09:14:15.141Z

We're investigating a slowdown on our main feed API endpoint.

Jun 21, 2018

Report: "Realtime WS connections outage"

Last update 2018-06-21T18:02:24.009Z

resolved2018-06-21T18:02:23.994Z

Connections are back, currently investigating the root cause.

Apr 26, 2018

Report: "Elevated API Errors"

Last update 2018-04-26T07:49:24.963Z

resolved2018-04-26T07:49:24.952Z

Between 7:24AM and 7:26AM UTC we had an increased error rate due to a database failover procedure triggered by AWS.

Apr 13, 2018

Report: "Elevated API Errors"

Last update 2018-04-13T13:33:18.392Z

resolved2018-04-13T13:33:18.379Z

We experienced an elevated rate of API errors between 13:00 and 13:01 UTC. The issue resolved itself and we are looking into the cause.

Mar 14, 2018

Report: "Dashboard issues"

Last update 2018-03-14T15:51:38.148Z

resolved2018-03-14T15:51:38.126Z

We switched to our back-up CDN a few minutes ago. Dashboard is fully operation again.

monitoring2018-03-14T15:38:30.696Z

The CDN (IMGIX) is not serving static files from our origin correctly (https://status.imgix.com/incidents/5plz9dqxxhns). We have temporarily mitigated the problem by rolling back to a previous release (the statics for that release are available at CDN edge nodes and are returned correctly). We are now working to switch to the backup CDN (AWS Cloudfront).

investigating2018-03-14T15:07:55.913Z

We're currently investigating problems with loading the Dashboard.

Feb 14, 2018

Report: "Elevated API Errors"

Last update 2018-02-14T10:37:14.321Z

resolved2018-02-14T10:37:14.294Z

The outage started 10:14 UTC and was over at 10:18 UTC.

monitoring2018-02-14T10:22:14.880Z

The issue seems to be resloved

investigating2018-02-14T10:19:11.250Z

We're experiencing an elevated level of API errors and are currently looking into the issue.

Feb 3, 2018

Report: "Elevated API Errors"

Last update 2018-02-03T18:57:49.898Z

resolved2018-02-03T18:46:00.000Z

A small part of Stream's customers was impacted by an issue that caused notification feeds to be temporarily unavailable. Earlier today at 15:30 GMT a maintenance database migration was initiated; unfortunately the procedure did not clear the cached state of feeds correctly. This lead to reads not showing activities from before that time. At 16:00 UTC we remediated by flushing the stale cache which immediately fixed the problem. The issue did not affect any write operations such as adding/removing activities or following/unfollowing feeds. After some investigation we could find and amended the incorrect part of the maintenance operation. Apologies for the trouble and for bringing this up during the week-end.

Jan 23, 2018

Report: "Intermittent failures"

Last update 2018-01-23T16:34:25.529Z

resolved2018-01-23T16:34:25.505Z

This incident has been resolved.

monitoring2018-01-23T16:31:13.522Z

API service is back to normal since 3:57PM UTC. Outage was due to temporary performance degradation of a Postgresql server.

investigating2018-01-23T16:12:52.818Z

The failures seem to have stopped. We're still looking into the root cause.

investigating2018-01-23T15:56:18.017Z

Intermittent failures, cause is not known yet.

Dec 26, 2017

Report: "Failure on main PG database"

Last update 2017-12-26T22:30:21.226Z

resolved2017-12-26T22:30:21.206Z

The issue was mitigated and we are now working on a permanent solution.

identified2017-12-26T21:45:38.468Z

Our main PG database (that holds the configs) is seeing high CPU. This is causing a percentage of API requests to fail. We're investigating this issue and will keep you posted.

Sep 20, 2017

Report: "DNS issues with getstream.io domain"

Last update 2017-09-20T15:33:44.451Z

resolved2017-09-20T15:33:44.433Z

DNS resolution for getstream.io is back to normal. The two .io name servers are now returning the correct results.

investigating2017-09-20T14:08:04.989Z

DNS resolution for getstream.io is randomly returning errors. The root cause seems to be related to two .io nameservers returning incorrect results.

Jul 21, 2017

Report: "API latency increase"

Last update 2017-07-21T16:24:27.285Z

resolved2017-07-21T16:24:27.233Z

We identified and resolved a problem affecting API latency for several applications. The problem was pinned down to heavy pressure on a Postgresql server, the issue was resolved and latency returned to normal levels.

May 1, 2017

Report: "API latency increase"

Last update 2017-05-01T16:26:07.161Z

resolved2017-05-01T16:26:07.135Z

The problem was related to a routine maintenance slowing down query response times for one Postgresql database. The latency for API was impacted between 3:50PM and 4:15PM UTC.

investigating2017-05-01T16:17:54.552Z

We are currently investigating this issue.

Apr 4, 2017

Report: "Latency spike"

Last update 2017-04-04T21:46:09.830Z

resolved2017-04-04T21:46:09.805Z

There was a temporary spike in latency caused by our Cassandra cluster. The issue has been mitigated.

Feb 24, 2017

Report: "High API latency"

Last update 2017-02-24T18:59:57.702Z

resolved2017-02-24T18:59:57.679Z

The high latency has now been resolved. The root cause isn't clear yet.

investigating2017-02-24T17:40:26.777Z

Stream's main API is experiencing high latency. Our team is investigating this. The cause is not known yet.

Feb 16, 2017

Report: "API Outage"

Last update 2017-02-16T15:29:21.396Z

resolved2017-02-16T15:29:21.368Z

At 2:41 PM UTC we experienced approximately 60 seconds of downtime with the Stream API. This was caused by a database lock not being released in a timely manner. Regular service was restored immediately after the lock was released.

Feb 14, 2017

Report: "Random latency spikes"

Last update 2017-02-14T08:35:34.692Z

resolved2017-02-14T08:35:34.666Z

This incident has been resolved.

monitoring2017-02-13T21:52:16.164Z

A fix has been implemented and we are monitoring the results.

identified2017-02-13T20:19:44.397Z

One the Cassandra servers was very slow at serving queries due to long stop of the world GC.

investigating2017-02-13T20:09:18.782Z

We are investigating on this issue.