Historical record of incidents for ReadMe
Report: "Widespread service provider outage"
Last updateWhile ReadMe services have not been affected, our team is monitoring the status of widespread outages with Cloudflare and GCS.
Report: "OAuth end-user authentication for Enterprise customers is down"
Last updateEnd-user authentication via OAuth is currently unavailable due to an availability incident with Heroku, our upstream provider. They are currently investigating a resolution for the issue.
Report: "ReadMe Hubs and Dashboard are down"
Last updateThere was an issue with one of our upstream providers. Service is remaining stable and we’re investigating the root cause with our provider.
We're back up and continuing to monitor.
We are continuing to investigate this issue.
We are currently investigating.
Report: "ReadMe Hubs and Dashboard are down"
Last updateThere was an issue with one of our upstream providers. Service is remaining stable and we’re investigating the root cause with our provider.
We're back up and continuing to monitor.
We are continuing to investigate this issue.
We are currently investigating.
Report: "Slightly elevated error rates"
Last updateWe have eliminated the slightly elevated error rates.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Search results not showing up in "All" tab on hubs"
Last updateThis issue has now been resolved. A database setting was changed which conflicted with the way search works when searching all sections. This setting has now been reverted and search is functioning as before.
There is a currently ongoing issue where the "All" tab when searching for search results is coming up empty. Searching specifically inside of each section is still working. We are investigating and will update when we have a fix.
Report: "Some customers are experiencing pages that are not loading correctly"
Last updateThere was an issue with our upstream provider which is now resolved.
We are currently investigating
Report: "Slow performance in Refactored dashboard"
Last updateThis incident has been resolved.
The issue lasted a few minutes and the admin dashboard is currently online. We’re continuing to monitor.
Report: "Slow performance across hubs"
Last updateReadMe Refactored project performance has been restored.
Current ReadMe Refactored experience customers are experiencing performance issues.
Report: "ReadMe Hubs are down"
Last updateThis incident has been resolved.
The hubs are currently operating as expected, but we are actively monitoring and investigating the root cause of this issue. If you are still experiencing any problems, please contact support@readme.io.
We are currently investigating this issue.
Report: "ReadMe Hubs are down"
Last updateWe have added server capacity to deal with an increase in traffic and load. We will continue to monitor but all sites appear to be working as expected again. Please reach out to support@readme.io if you are still having problems.
A fix has been implemented and we are monitoring the results.
We are continuing to investigate this issue.
We are currently investigating the issue
Report: "SSL Certs sometimes fail to generate"
Last updateThis incident has been resolved.
We found the culprit on Friday! Our hosting provider (who we love!) was marking some domains in Cloudflare as theirs, and Cloudflare got confused. We're still waiting for a report from them to understand how that happened, but for now it seems to be an issue of the past.
We're currently seeing that for a very small percentage of our users, we're having trouble reissuing SSL certs with Cloudflare. We're working with Cloudflare to identify the issue and correct it. We're monitoring it, but please reach out to support@readme.io if you experience this! This is only happening for a small portion of customers.
Report: "Slow performance across all hubs and dashboard"
Last updateThis incident has been resolved.
Currently investigating this incident
Report: "Intermittent dashboard issues"
Last updateThe dashboard has remained up ever since. We're continuing to investigate to find a root cause.
The dashboard located at https://dash.readme.com/ seems to be going in and out of healthy status. It is currently up but we're monitoring and investigating to find a root cause. Everything else seems unaffected at the present time.
Report: "Slow performance on ReadMe hubs and dashboard"
Last updateAfter monitoring, performance has improved significantly
We're investigating slow performance on both the dashboard and hub
Report: "Slow performance across hubs"
Last updateWe added server capacity to deal with an increase in load. We’ll continue to monitor but all sites appear to be working as expected. Please reach out to support@readme.io if you are still having problems.
A fix has been implemented and we are monitoring the results.
Current ReadMe Refactored experience customers are experiencing performance issues
Report: "ReadMe Hubs and Dashboard are down"
Last updateThis incident has been resolved.
Access to the ReadMe application has been restored and we are monitoring the results. Formal investigation and a post-mortem to follow. We apologize for the downtime.
We're working on restoring our dash and hubs currently
We are currently investigating the issue
Report: "We are seeing an increase in 503s and timeouts across our hubs"
Last updateWe have added server capacity to deal with an increase in traffic and load. We will continue to monitor but all sites appear to be working as expected again. Please reach out to support@readme.io if you are still having problems.
We are currently investigating timeouts and 404s across hubs.
Report: "dash.readme.com is currently down"
Last updateDue to a deploy that went out, we had about 18 minutes of degraded performance.
Due to a deploy that went out, we had about 18 minutes of degraded performance.
We are continuing to monitor for any further issues.
We have rolled back a recent deploy which has resolved the issue. Currently monitoring.
We are currently investigating admin dashboard outage
Report: "dash.readme.com is unresponsive"
Last updateThis incident has been resolved.
We've deployed changes successfully and our admin dash is stable. We're currently monitoring this to see if there are any issues.
We're continue to investigate the admin dash, as there are periods of timeouts for some people on the platform.
We are continuing to investigate this issue.
We are currently investigating this issue.
We've rolled back our previous deploy and our admin dashboard is back up, but we're still investigating the cause of the issue.
After a deploy, our dashboard is currently down. We are working on bringing this back up shortly.
Report: "Slow API Explorer requests when using 'Try It'"
Last updateRequests should now be going through without issue. We'll continue to monitor requests to make sure no one is rate limited by accident.
We've found the issue and are monitoring the changes. Requests should be able to go through once again.
We are currently investigating CORS errors when making API requests from ReadMe
Report: "Versions not appearing in hub"
Last updateWe have resolved the issue of versions not appearing correctly in the public facing hubs. If you are still not seeing your list of versions, please contact support@readme.com and we will be happy to resolve this!
We have identified an issue where other project versions are not appearing in the hub and are actively working toward a resolution
Report: "Owlbot is unavailable for some users"
Last updateThis incident was created out of an abundance of caution. Since then our service provider has confirmed that their API was unaffected, so we can also now confirm that Owlbot is doing owl-right.
Owlbot appears to be functioning correctly despite reported issues with our service provider.
Our AI-powered search tool, Owlbot, is down for some users.
Report: "Very high request volume as well as high error rate"
Last updateOn Monday, May 20, ReadMe users experienced issues accessing the ReadMe Dashboard \([dash.readme.com](http://dash.readme.com)\) from 19:33 to 19:45 UTC. Users may have experienced errors or slow response times when accessing the Dash. Mitigations were applied by 19:45 UTC and the dashboard quickly resumed normal operation.
This incident has been resolved.
We've discovered the culprit and blocked traffic to the offending IP. Our dashboard is now back up, and we are currently monitoring the situation.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Developer Dashboard data insertion issues"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Search results are slow to reindex"
Last updateSearch indexing performance for enterprise customers using the staging feature have returned to normal.
Search results for some enterprise production projects are slow to reindex. We are monitoring the situation as it's continuing to normalize.
Report: "Widespread outage"
Last update## **What Happened** ReadMe experienced a significant outage on Tuesday, March 26 beginning at 16:06 UTC \(9:06am Pacific\). This outage affected all ReadMe services including our management dashboard and the developer hubs we host for our customers. We recovered the majority of our service by 16:42 UTC \(9:42am Pacific\) including most access to the Dash and the Hubs. The rest of the service fully recovered at 17:34 UTC \(10:34am Pacific\). Although the outage began with one of ReadMe’s service providers, we take full responsibility and we’re truly sorry for the inconvenience to our customers. We’re working through ways to prevent the same issue from happening again and to reduce the impact from similar events in the future. ## **Root Cause** ReadMe uses a number of third party service providers to host our Internet-facing services including our customer-facing dashboard \([dash.readme.com](https://dash.readme.com/)\) and developer documentation hubs. One of our primary service providers is Render, a web application hosting platform. This outage began when Render experienced a broad range of outages. We’re still learning more about what happened and we will update this document when those details are available. We have redundant systems running at Render and can handle a partial Render service outage. Further, it’s usually very fast to replace cloud services on Render in a partial outage. But our infrastructure is not resilient to a full outage of the entire Render service, which is what happened on the 26th. _**Update \(April 1, 2024\):**_ Render has confirmed that the issue began with an unintended restart of all customer and system workloads on their platform, which was caused by a faulty code change. Render has provided a [Root Cause Analysis](https://render.com/blog/root-cause-analysis-extended-service-disruption-3-26-24) for their underlying incident. Although the incident was triggered by our service provider, we’re ultimately responsible for our own uptime and we are working on remediations to reduce the scope and severity of this class of incidents. ## **Resolution** We host many services on Render including our Node.js web application and our Redis data stores. Redis is an in-memory data store that we use for caches and queues. We don’t use Redis for long-term \(persistent\) data storage, but many other companies do. Because of the unique challenges of restoring persistent data stores, Render’s managed Redis services took significantly longer to recover. We implemented two temporary workarounds to restore ReadMe service: we removed Redis from the critical path in areas of our service where this was possible, and we launched temporary replacement Redis services until our managed Redis instances were recovered. After the managed Redis service was available and stable, we resumed normal operations on our managed Redis instances. ## **Timeline** * 2024-03-26 16:06 UTC: All traffic to ReadMe’s web services begins to fail with HTTP 503 server errors. The ReadMe team begins mobilizing at 16:08 and automated alerts fire at 16:10. * 2024-03-26 16:12 UTC: Render confirms that they are experiencing a major outage. We begin troubleshooting and looking for paths forward. The [ReadMe Status](https://www.readmestatus.com/) site is updated at 16:13. * 2024-03-26 16:35 UTC: Although Render reports that many services have already recovered, ReadMe’s applications are still unavailable. We consult with our service provider and determine that Redis caches and queues will take longer to recover. We immediately began efforts to workaround the Redis services that had not yet recovered. * 2024-03-26 16:42 UTC: We deploy a change to remove Redis from the critical path of many application flows. This restores most ReadMe functionality; from this point forward 88% of requests to the Dash and the Hubs are successful. Some functionality that requires Redis is unavailable, like page view tracking and page quality voting. Further, our Developer Dashboard and its API are still offline. We continue attempting to restore remaining service by deploying alternate Redis servers outside the managed Redis infrastructure. * 2024-03-26 17:34 UTC: With the temporary Redis servers in place, all remaining issues with our application are resolved, including the Developer Dashboard and its API. Error rates and response times immediately return to nominal levels. We note the full recovery on our status site at 17:53. ## **Path Forward** ReadMe is committed to maintaining a high level of service availability; we sincerely apologize for letting our customers down. We will be holding an internal retrospective later this week to learn from this incident and improve our response to future incidents. This incident identified a number of tightly-coupled services in our infrastructure — failures in some internal services caused unforeseen problems in other related services. Among other improvements, we’ll look into ways to decouple those services. This incident alone isn’t enough to reevaluate our relationship with Render, but we continually monitor our partners’ performance relative to our service targets. If we are unable to meet our service targets with our current providers, we will engage additional providers for redundancy, or look for replacements depending on the situation. Finally, our close relationship with Render allowed us to get accurate technical details during the incident. This information allowed us to move quickly and take corrective action. ## **Final Note** ReadMe takes our mission to make the API developer experience magical very seriously. We deeply regret this service outage and are using it as an opportunity to strengthen our processes, provide transparency, and improve our level of service going forward. We apologize for this disruption and thank you for being a valued customer.
This incident has been resolved.
All metrics related products have returned to operational. We have also fixed the underlying issue blocking new edits from being saved. We will continue to monitor to ensure recovery is maintained.
While the administrative dashboard is reachable, users are presently unable to save edits to existing documents. We are working to recover this and the metrics-related products.
Documentation hubs have recovered and we will monitor recovery. The Developer Metrics API, dashboard metrics, and "Recent Requests" in documentation API reference pages are still unavailable. We will continue to update as we have more information
We are beginning to see recovery on administrative dashboards, while our documentation hubs continue to experience very high latency. We are working with our partners to remedy this situation. We will provide an update as soon as we have more information.
We are aware of a widespread issue that has impacted all of ReadMe's products. We are working with our partners to identify and resolve this issue.
Report: "Hub navigation links direct to 404 pages"
Last updateWe had a deploy earlier this morning that broke the navigation bar links resulting in 404s when going to 'Guides' and 'API Reference' sections. The deploy has been rolled back and things are now functioning as normal.
Report: "Some enterprise documentation hubs are displaying errors"
Last updateWe have resolved the issue with errors displaying on some documentation hubs.
We are currently investigating alerts that some enteprise documentation hubs are displaying errors within the documentation.
Report: "Delay in API Metrics Processing"
Last updateThis incident has been resolved.
The delay in API metrics processing has been resolved. We will continue to monitor and investigate for any longer-term effects.
We are experiencing a delay in processing API Metrics. Metrics displayed in the "My Developers" tab of the administrative dashboard will be delayed.
Report: ""Too Many Requests" error message on ReadMe Hubs"
Last updateHealthchecks starting firing alerting us to a DDOS across Hubs and the display of a "Too Many Requests" error message. We took measures to block malicious traffic and sites appear to be functioning as expected now.
Report: "Owlbot AI Offline"
Last updateThis incident has been resolved.
Services that support Owlbot AI are currently unavailable.
Report: "Currently investigating downtime on our hub sites"
Last updateResponse times have returned to normal and all health checks are returning as expected. We are considering this incident resolved but will continue to monitor.
We have taken actions to block malicious traffic and sites are starting to respond again. We will continue to monitor.
We are currently investigating downtime and failed health checks for all documentation hub sites.
Report: "We are currently experiencing an unusually high server load on our platform"
Last updateThis incident has been resolved.
Performance has stabilized across ReadMe, but we are still monitoring
We are continuing to investigate this issue.
We are currently investigating this issue.
Report: "Investigating outage across ReadMe Hubs"
Last updateResponse times have remained stable since so we're considering this issue closed. Please reach out to support@readme.io if you have any more problems!
We've taken measures to return the service to normal functionality. We will continue to monitor for the next 30 mins.
We are currently investigating an issue that is causing either a slowdown or a full outage on ReadMe hub sites. We will update this as we know more.
Report: "ReadMe Hard Down"
Last updateThis incident has been resolved.
Access to ReadMe's administrative dashboard and documentation hubs has returned to normal levels. We will continue to monitor the situation for the immediate future.
We are currently investigating this issue.
Report: "500 errors and slow response times"
Last updateThis incident has been resolved.
We are continuing to investigate this issue. Error rates and latencies have returned to normal but we continue to monitor the situation.
We are currently investigating intermittent degradation to our ReadMe hubs service.
Report: "500s on reference docs"
Last updateThis incident has been resolved.
We've taken steps to stabilize our application and are monitoring for further degradation.
We're aware of reports of intermittent 500s on some reference documentation.
Report: "readme.io subdomains not redirecting properly"
Last updateThis incident has been resolved.
Subdomains on the readme.io domain are currently not redirecting to the ReadMe service. All endpoints on the readme.com domain are unaffacted. We are currently investigating and will post an update as soon as we have more information.
Report: "ReadMe documentation hubs and dashboards experiencing failures and large delays in load time"
Last updateThis incident has been resolved.
The ReadMe dash and documentation hubs are currently recovering. We will continue to monitor the situation.
We are currently investigating this issue and will report when we have more information.
Report: "Investigating Downtime"
Last updateThis incident has been resolved.
We have implemented a fix and our currently monitoring the situation carefully. We will provide an update when we have more information.
We are continuing to investigate this issue.
We're currently investigating downtime across our dashboard and hubs.
Report: "ReadMe API Metrics product experiencing major outage"
Last updateThe backlog of API logs has been cleared and this incident has been resolved.
We have identified the cause for the API Metrics outage previously reported. We have issued a fix and the API Metrics product is available once again. We are currently processing queued logs and will be monitoring the situation. An update will be provided when the queues have finished processing.
The ReadMe API Metrics product is currently experiencing a major outage. Requests to send API logs to ReadMe are currently failing or have very high latency, and the viewing of API metrics in the hub and administrative dashboard is currently disabled. We are actively investigating this issue and will provide an update when we know more information.
Report: "Degraded Performance for Developer Dashboard"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Codeblocks not rendering correctly on hubs"
Last updateWe deployed a change which made it so various codeblocks were rendering as raw Markdown. We quickly reverted this change and things should be back to normal.
Report: "Issue with static assets not loading on initial page load (Google Chrome only)"
Last updateThis incident has been resolved.
We've identified a mitigation and operation has returned to normal. We will continue monitoring to ensure proper remediation.
We are continuing to investigate this issue with our CDN provider. We are simultaneously working on a potential alternative to hopefully mitigate this issue in the meantime. We will post more information as it comes available.
When loading a Hub site for the first time in Google Chrome, the console is showing a lot of HTTP/2 errors which are resulting in an unstyled and unresponsive page. We're actively investigating this issue and suspect it may be something to do with our upstream CDN provider. We will update as we know more.
Report: "ReadMe API Metrics experiencing increased latencies and API failures"
Last updateThis incident has been resolved.
We have identified the root cause of the issue and have issued a fix. Impacts are returning to normal levels; we will continue to monitor to confirm the impact has subsided.
We are continuing to investigate this issue and will provide updates as they become available.
We are currently experiencing a degradation in ReadMe's API metrics product, resulting in an increase in 500-level errors when submitting API logs through ReadMe's metrics API. Additionally, metrics dashboards are intermittently experiencing a delay in loading for some customers. We are actively investigating and will post an update when available.
Report: "Regressions executing some custom JavaScript"
Last updateThis incident has been resolved.
We have identified an issue that went out today at 11:35am PT that regressed the ability for custom JavaScript to perform DOM queries.
Report: "Search re-indexes are not completing successfully for non-page types"
Last updateOur search re-index queues are processing jobs once again. Please e-mail support if you're still seeing any issues!
A fix has been implemented and we are now testing this on specific Enterprise projects with staging
We have identified a solution to reindex non-page search results.
We are continuing to investigate this issue.
We are having issues re-indexing non-page search results, such as custom pages and discussion posts.
Report: "Search re-indexing is currently down for certain projects"
Last updateProjects are currently able to re-index their search results successfully.
Our upstream provider has fixed their issue and our search queues are starting to re-index jobs again. We're monitoring this issue and will confirm shortly once jobs are successful.
We are still seeing issues re-indexing projects with staging enabled due to our upstream provider. We will provide updates once we have more available.
We are continuing to investigate this issue.
We are currently having issues with our search re-indexing on specific projects. Our team is investigating the issue and we will provide an update when available.
Report: "Multiple issues impacting customer hubs"
Last updateThe issue preventing emails to be sent and causing staged changes to be visible immediately in production has been resolved. Emails attempted to be sent during the impacted time period have been discarded. Changes made to projects with staging enabled are now correctly not visible on production until they have been promoted.
In addition to staged changes being immediately published to prod, all emails sent from ReadMe's systems were failing to be sent. This impacted password reset emails, password-less logins, and other emails sent to end users.
We are continuing to investigate this issue.
Changes to enterprise projects that have staging enabled are currently being promoted to production immediately, skipping the normal staging step. This impacts any enterprise-plan customer with staging enabled. Our team is currently investigating and will provide updates as they become available.
Report: "ReadMe Micro Unavailable"
Last updateThis incident has been resolved.
The ReadMe Micro service is currently unavailable for all ReadMe Micro subscribers. Any visits to ReadMe Micro documentation sites will fail and display an Internal Server Error. There is an issue with the upstream provider that hosts ReadMe Micro; we are working to have this resolved as soon as possible. Other ReadMe services are not affected — ReadMe documentation hubs and the ReadMe admin dashboard are both fully functional.
Report: "Degraded performance for ReadMe Hubs"
Last updateThe previous issue with intermittent failures on ReadMe Hubs has been resolved.
ReadMe Hubs for all public projects have been experiencing small intermittent failures for guides, API references, recipes, and discussions. End users are either being presented with a 500 error or a blank white screen for about 5% of requests. Our engineering team is working to identify a fix.
Report: "Owlbot AI Unavailable"
Last updateThis incident has been resolved.
We have resolved the issue with the Owlbot AI chatbot and all functionality has been restored.
We are continuing to work on a fix for this issue.
The Owlbot AI chatbot component is currently unavailable for all Owlbot AI subscribers. End users that attempt to issue a query receive no response or visual aid to identify that an error has occurred. We are aware of the cause and are working to have this resolved as soon as possible.