Historical record of incidents for Stellate
Report: "Increased Errors with Multiple Third Party Service Provider"
Last updateWe are currently investigating this issue.
Report: "Elevated errors on Dashboard"
Last updateThis incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and we are monitoring the results.
We are currently investigating this issue.
Report: "Dashboard API Experiencing Issues"
Last updateThis incident has been resolved.
We have applied new database indexes that should improve performance. We are still working on rolling out a production deployment to include additional API improvements.
We have identified potential optimizations and are in the process of testing them to ensure their effectiveness. While the main degradation has calmed down, we’re taking precautionary steps to apply patches and ensure the system's stability and safety moving forward.
We are currently experiencing database performance issues and are actively working on updates to resolve the situation. Our team is implementing changes to restore normal service as quickly as possible. We will continue to provide updates as we progress.
We are currently investigating this issue.
Report: "High Error Rates on Dashboard"
Last updateThis incident has been resolved.
We are continuing to work on a fix for this issue.
The issue has been identified and a fix is being implemented.
Report: "High Error Rates Detected on Dashboard"
Last updateThis incident has been resolved.
The issue has been identified and a fix is being implemented.
We are currently investigating this issue.
Report: "Issues with GraphQL Metrics ingesting new data"
Last updateThis issue has been resolved. We'll continue monitoring all services, but we don't expect any further issues.
We are seeing signs of recovery, and new data is again being ingested into our GraphQL Metrics systems. We are monitoring all systems and are in touch with our infrastructure provider for further updates. Please note that the GraphQL Metrics system might lag behind near real-time for a while as queued updates get processed.
The issue has been identified as an issue with one of our infrastructure providers. and their ClickPipes service. They have opened an investigation and updated their status page at https://status.clickhouse.com/incidents/f6j8dfnyy6dn
We are currently looking into an issue with our GraphQL Metrics systems, which prevents new data to show up on the dashboard. All other systems are working as expected and our GraphQL Edge Caching and GraphQL Rate Limiting systems are not affected.
Report: "Issues with Purging API"
Last updateDuring a routine employee offboarding, we revoked that employee’s access to Fastly. Revoking their access to Fastly also revoked all access tokens that engineer created. Unfortunately, this included the central API token all our systems use to communicate with the Fastly API. This had two immediate impacts: 1. Purging started failing silently: Stellate’s purging API kept returning successful responses even though data would not be evicted from the cache. 2. Service configuration updates failing silently: Service configuration updates appeared to persist even though they were not updated in the CDN. As part of the incident response, we switched the central Fastly API token to a new token owned by a shared engineering account. Further, we will work on gaining better visibility and alerting on failure conditions with the purging API, as well as audit all tokens in use by our services to ensure they are not owned by individual engineers.
This incident has been resolved.
We are continuing to monitor for any further issues.
A fix has been implemented and the Purging API is working as expected again. We are monitoring all systems to make sure they are working as expected.
The team has identified the issue and is currently implementing a fix.
We are currently looking into an issue with the Purging API.
Report: "Increased error rates with service configuration updates"
Last updateThis incident has been resolved.
Fastly has identified the issue and implemented a fix. They are monitoring their systems to ensure this issue has been mitigated.
We are investigating increased timeouts with Fastly KV, which is a required dependency for configuration updates either via the Stellate Dashboard, or the API. Fastly is tracking this incident via their status page at https://www.fastlystatus.com/incident/376458. This does not impact our edge services
Report: "Issues with the Purging API"
Last updateThis incident has been resolved.
The fix has been rolled out to all nodes and the Fastly team is monitoring purges across their systems.
Fastly has begun rolling out a fix and is monitoring purges on their systems. The current ETA for the system-wide rollout is ~12 hours. As a (temporary) workaround, we recommend disabling `swr` for all cache rules. We will continue monitoring this incident on our end and update this status page as we get additional information from the team at Fastly.
Fastly has added an incident to their status page, available at https://www.fastlystatus.com/incident/376338 They are still investigating the issue and looking for a fix.
We continue to look into this with the Fastly team. We have narrowed this down to an issue with purging, in which all purges are considered soft purges if the cache configuration includes swr (stale-while-revalidate) values.
We are currently looking into a potential issue with the Purging API not removing data from the cache. We are working with our infrastructure partners to investigate what is causing this.
Report: "Issues with configuration changes"
Last updateWe have shipped a fix and it is once again safe to make configuration changes. We are very sorry for any issues this may have caused.
We are currently investigating an issue related to pushing configuration changes. Please refrain from making configuration changes at this time, as this could potentially make your service unavailable.
Report: "Increased error rates on configuration changes"
Last updateThis incident has been resolved.
Fastly API error rates have recovered and we're not seeing any more issues. We will continue to monitor.
Fastly is looking into elevated error rates on their APIs. See https://www.fastlystatus.com/incident/376327 for more information.
We are working with our infrastructure partners to get this issue resolved.
We are currently looking into an issue with errors when applying configuration changes, either via the Stellate CLI or the dashboard. Edge services are not affected by these issues.
Report: "Service disruption for Automated Persisted Queries (APQs)"
Last update# Incident * A bug was released on Jan 8th at 1.43 pm UTC while improving Persisted Operation support. The two areas of code overlap, and unfortunately, the change broke support for APQs. * Our E2E test suite should have caught this bug. * Unfortunately, we recently made many improvements to our E2E test suite and silently broke the validity of the APQ E2E tests. These tests were running and reporting successes, but under the hood, they were erroneously being run against a server that does not support APQ. * The impact of this bug was not widespread enough to trigger alarms after release. * At 11.45 pm UTC, a customer raised an issue with APQs, and our engineering team started investigating. * On Jan 9th at 2.05 am UTC, a fix was deployed, and the issue was resolved. # Improvements * We’ve fixed the bug in our E2E test suite for APQ. * We’ve agreed on a path forward to start monitoring GraphQL errors. The work has begun and is being tracked but has yet to be completed. * We’ve scheduled a rollback dry run for our following incident dry run to improve our institutional knowledge of rollback procedures and find potential improvements.
- A bug was released on Jan 8th at 1.43 pm UTC while improving Persisted Operation support. The two areas of code overlap, and unfortunately, the change broke support for APQs. - Our E2E test suite should have caught this bug. - Unfortunately, we recently made many improvements to our E2E test suite and silently broke the validity of the APQ E2E tests. These tests were running and reporting successes, but under the hood, they were erroneously being run against a server that does not support APQ. - The impact of this bug was not widespread enough to trigger alarms after release. - At 11.45 pm UTC, a customer raised an issue with APQs, and our engineering team started investigating. - On Jan 9th at 2.05 am UTC, a fix was deployed, and the issue was resolved.
Report: "Elevated response times due to an incident at our infrastructure provider"
Last updateThis incident has been resolved.
A fix has been implemented and Fastly is monitoring performance.
Fastly has identified the issue and is implementing a fix.
We are currently looking into elevated response times due to an incident with Compute@Edge at Fastly. Please see https://www.fastlystatus.com/incident/376194 for more information.
Report: "Issues with Stellate Services behind a Cloudflare Proxy"
Last updateFastly started forbidding domain fronting on October 24th, customers that were using Cloudflare with proxy enabled were affected as Fastly could not verify domain ownership for TLS certificates. This caused Fastly to throw a TLS validation error when trying to access these domains. We got communications from Fastly in September telling us some domains were going to be affected. However, they mentioned we had until the TLS certificates expired on current domains to take action. After the incident we reached out to Fastly, and they also mentioned the report they sent us was incomplete, as it did not include information for the HTTP method, as requests not using the POST method could be affected. This miscommunication from Fastly side led us to believe we had more time before our application would be affected. Going forward, we are double checking important dates with third party providers to make sure there are no misunderstandings and we don’t cause downtime for our customers.
We have added additional information to the service settings on validating custom domains that do not point at Fastly directly. If you have Cloudflare Proxy, or another proxy, in front of Stellate, please make sure your custom domain is shown as _Verified_ in your service settings. If you have questions, do not hesitate to reach out to our support team.
If you are running your Stellate service behind a Cloudflare DNS record with proxy turned on and are running into issues with SAN (subject alternative names) errors, we recommend turning the proxy off and reaching out to our support team via support@stellate.co or the in-app messenger.
Report: "Cloudflare API Service Outage impacts Service Configuration Changes"
Last updateCloudflare fixed the power grid issues they were experiencing and published a post-mortem on their page, which is available at https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/. They are closely monitoring their systems to ensure they are running as expected.
Cloudflare is experiencing an outage of its API services, which affects Stellate. Stellate edge services (caching, rate limiting, developer portal) are not affected. However, you might experience higher error rates on configuration changes. Please see https://www.cloudflarestatus.com/incidents/hm7491k53ppg for more information.
Report: "Cloudflare Worker issues impacting Stellate services"
Last updateCloudflare marked this incident as resolved.
Cloudflare identified the issue and implemented a fix. (See https://www.cloudflarestatus.com/incidents/l6x2h1zp69bc) They are still monitoring the systems, as are we.
We are seeing services recover. However, as Cloudflare didn't update their status page yet, we will keep this incident active.
We are continuing to monitor for any further issues.
Cloudflare Workers is experiencing an incident (please see https://www.cloudflarestatus.com/incidents/l6x2h1zp69bc for more information), which impacts Stellate services that haven't switched to our new infrastructure and Fastly exit IPs yet. If you would like to switch, please see https://stellate.co/docs/graphql-edge-cache/switch-to-new-infrastructure If you already switched, your service is not affected.
Report: "Elevated Error Rates in the Columbus, Ohio point of presence."
Last updateThis incident has been resolved. We are working with the Fastly and Cloudflare teams to better understand what caused those elevated error rates and how to prevent them.
We are currently investigating elevated error rates in the Columbus, Ohio point of presence.
Report: "Issues with configuration updates propagating"
Last update* Stellate relies on Fastly infrastructure for our offerings * Fastly experienced a partial outage of their KV Store offering on June 17th and June 18th, which affected Stellate. They provide a summary of this incident on their status page at [https://www.fastlystatus.com/incident/376022](https://www.fastlystatus.com/incident/376022) ## Timeline * August 17th 10:46 UTC - A customer reported their stellate endpoint failing in the FRA \(Frankfurt\) point of presence \(POP\), as well as in several other edge locations. This was due to them pushing an update to their configuration, specifically the `originUrl` . * 10:50 - We identified the issue as being a stale KV value in the FRA POP, as well as several others. * 10:55 - We created an incident on our status page for degraded KV in the FRA POP and several others. * 13:08 - We realized that Rate Limiting and Developer Portals were affected by this outage as well. * 13:30 - We reported this incident to Fastly. * August 18th 4:00 UTC - Fastly was not yet able to provide us with a satisfactory response on what was causing this and didn’t acknowledge the ongoing outage. * 6:23 - A large e-commerce customer reported their website was unavailable. This was due to a KV key disappearing in the FRA POP, as well as several others. * 7:09 - Additional reports started to come in via Intercom about services not responding properly. * 7:15 - We escalated the incident with Fastly as from our view more regions seemed to be affected and becoming unavailable. * 7:16 - We deployed a partial fix that disabled our new infrastructure. This fixed edge caching for users who didn’t recently push configuration changes \(the majority of services\). Rate Limiting, JWT-based scopes, and the Developer Portal were still affected by the KV outage. * 8:01 - Fastly was able to reproduce the bug based on a reproduction that we provided earlier and started working on a fix. * 9:02 - Fastly opened an [official incident](https://www.fastlystatus.com/incident/376022) on their status page. * 10:04 - Fastly marked the incident as resolved * 10:19 - Fastly communicated to us that the cause was an issue with surrogate keys in their C@E caching layer. * August 22nd - Fastly shared their confidential Fastly Service Advisory with us providing additional information about this incident and how they want to prevent this from happening again. ## Next Steps * We have had several calls with Fastly over the last couple of days, working with them to analyze what went wrong, why it took them so long to escalate this internally, and how we can improve communication and collaboration going forward. * As a direct outcome of this, we have re-connected with our European contacts at Fastly and designated a direct contact to involve in conversations and escalations going forward. * We are going to investigate a fallback option for Fastly KV. * Additionally, we will review all possible failure points that could make Stellate core services inaccessible \(in the event of a third-party outage\) and investigate options for additional redundancies for those services.
This issue has been resolved. We have temporarily switched all services back to our "old infrastructure" and are running additional tests as well as working with Fastly before we reopen the "new infrastructure". We will also publish additional details once we conclude our internal post mortem process.
Fastly has implemented a fix for the issue, all services are working as expected again. We have temporarily disabled switching over to the new infrastructure and are working with Fastly to better understand what happened on their end, why it took so long to identify and rectify this and how we can better monitor and prevent this in the future. We well enable the new infrastructure again, once we are confident in any services we rely on.
We continue working with Fastly to resolve this issue. Please see https://www.fastlystatus.com/incident/376022 for updates from their team as well.
We are continuing to work on a fix for this issue.
The incident with KV stores, which are used for service configuration, is now spreading to additional edge locations and affecting overall service availability for services on the new infrastructure. We have disabled the new infrastructure to provide our partner more time to identify and resolve the issue on their end.
Our infrastructure partner has identified the issue and is working on fixing it.
We are continuing to investigate this issue together with our infrastructure providers. If you haven't made configuration changes to your service recently, you are not affected by this issue.
We are investigating an issue with configuration updates propagating to the respective services. If you didn't make configuration changes recently, your services are not impacted by this incident.
Report: "Metrics System"
Last updateThe issue has been resolved and the metrics systems are worked as expected again. Some metrics data wasn't properly ingested and is currently saved in a backup system. We will ingest this data into our production systems after additional checks have been run.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are currently looking into an issue with our metrics system. Edge Caching and other service are unaffected and operate as expected.
Report: "Degraded service for Custom Domains Management, Developer Portal API Key Management and Purging Analytics"
Last updateThis incident has been resolved.
A fix has been implemented and we are monitoring the results.
The issue has been identified and a fix is being implemented.
We are continuing to investigate this issue.
We are experiencing degraded service for the following parts of our service * Creating and removing custom domain names to services and developer portals * Creating, revoking and removing API keys managed via the developer portal * Purging Analytics The edge cache itself, as well as the purging API and invalidation of cached responses, is not affected by this incident.
Report: "Stellate Services unavailable because of Cloudflare Worker KV outage"
Last update* Stellate currently relies on CloudFlare services for parts of our offerings. * Cloudflare had a global outage of their KV store for ~10 minutes on June 7th, from 6.51 pm to 7.01 pm. They provide a summary of this incident on their own status page at [https://www.cloudflarestatus.com/incidents/1mj9jch1tqf9](https://www.cloudflarestatus.com/incidents/1mj9jch1tqf9). * Any traffic that resulted in cache misses or cache passes triggered an HTTP/500 error page during that time frame. Traffic directly handled by the edge cache \(i.e., cache hits\) was not affected. * ~30% of traffic resulted in cache hits and was served correctly. * ~70% of traffic resulted in cache misses or passes; these requests returned an HTTP/500 error. * We are currently working on a larger infrastructure improvement that will remove the dependency on Cloudflare Worker KV. * Additionally, we will review all possible failure points that could make Stellate core services inaccessible \(in the event of a third-party outage\) and investigate options for additional redundancies for those services.
Cloudflare posted an update on their status page and marked the incident that caused this incident as resolved. See https://www.cloudflarestatus.com/incidents/1mj9jch1tqf9 for their update.
All services are back up and running again. We are monitoring the status of our services as well as Cloudflare Worker KV store.
As far as we can tell, Cloudflare Workers KV service, which we depend on, was having a outage of about 5 to 10 minutes. They seem to be back up and running again. We are monitoring the situation and will update our status page as needed.
We are looking into an issue with Stellate right now. We will update this incident as we have more data available.
Report: "Dashboard unacessible"
Last updateThis incident has been resolved.
The dashboard is back online, we are monitoring performance.
We are looking into an issue with the Stellate Dashboard. Our edge services (caching, rate limiting) and backend service (metrics backend) are unaffected.
Report: "DNS issue related to the default `stellate.sh` domain."
Last updateThis incident has been resolved. All services are available again.
We are looking into an issue regarding Stellate services hosted on the default `stellate.sh` domains. Services on custom domain names are not affected.
Report: "Increased error rates when loading the Stellate Dashboard"
Last updateMetrics queues have caught up and the cluster is operating as expected again.
Our metrics cluster is back online and accessible again. It will take a short while for it to catch up with queued metrics updates, and you might still see sporadic errors on the dashboard in the next couple of minutes. We are closely monitoring performance and will update this incident as required.
We identified the root cause, which is an issue with our metrics cluster. We have alerted the infrastructure partner operating that cluster for us and they are working on restoring access.
We are looking into an issue with a service backing our dashboard. This is not affecting CDN services (neither caching, nor the private beta of rate-limiting), however you will see higher error rates when trying to load the Stellate dashboard.
Report: "Major Service Outage"
Last update# Leadup/Fault At 03.39 am UTC, our engineering team got alerted about an elevated number of errors in our CDN. While looking into the increased error rate, we noticed SSL handshake errors between the caching layer and the workers forwarding requests to origins. Additional debugging surfaced that the SSL handshake failures were caused by the `graphcdn.app` domain expiring. Later investigations revealed that while the domain was set to renew automatically, the credit card payment for the renewal failed. Additionally, we could not immediately get ahold of the person required to restore access to the `graphcdn.app` domain. All requests sent to any `*.graphcdn.app` subdomain were served an error page by the domain registrar. Because the CDN workers were internally using a `graphcdn.app` subdomain \(a leftover from our name change\), this domain expiry caused all requests to fail, even if the external domain used was not a `*.graphcdn.app` subdomain but a `*.stellate.sh` subdomain or a custom domain. # Timeline \(all times in UTC\) * 3:39 am - The first alert was triggered, and the engineers on-call were paged and started investigating the issue. * 4:01 am - While working through our incident runbook, the engineering team noticed an error message regarding a “moved domain.” * 4:12 am - As we couldn’t immediately get a hold of the person required to renew the `graphcdn.app` domain, the engineering team started to move the CDN workers to a different domain. * 4:28 am - We opened an issue on our status page, [https://status.stellate.co/incidents/m7v0bgflsg4c](https://status.stellate.co/incidents/m7v0bgflsg4c) * 5:15 am - We deployed the fix that moved the CDN workers to a different domain and restored service for all requests sent to `*.stellate.sh` subdomains or custom domains. * 5:50 am - Restored access to the `graphcdn.app` domain, which resolved the issue for any services using `graphcdn.app` \(The time those services were available again varied slightly depending on DNS propagation.\) * 6:34 am - Marked the incident resolved. # How did we resolve it? The on-call engineers did not have access to the domain registrar where we registered the `graphcdn.app` domain. We couldn’t get ahold of the person who had access to that registrar because they weren’t on-call. While trying to find another way to reach that person, we deployed the first fix at 5:15 am UTC that removed the internal dependency on the `graphcdn.app` domain to restore service for all custom and `*.stellate.sh` subdomains. We recovered access to and renewed the `graphcdn.app` domain, and restored service for the `*.graphcdn.app` subdomains at 5:50 am. # Post mortem After resolving the incident, we conducted an internal post-mortem, analyzed the incident, and derived some immediate as well as future actions that are already completed: ## Immediate Actions 1. Validate that no other domains are expiring soon. 2. Audit and ensure all on-call engineers have access to all critical services. 3. Use a central email for authentication with any domain registrars. 4. Ensure there is an escalation policy to the founders and that the founders are permanently on-call. ## Future Actions 1. Set up monitoring & alerts for expiring certificates & domains and audit our current monitoring setup for holes. 2. Audit and ensure all on-call engineers have access to all services \(not just the critical ones\). 3. Create a Customer Success on-call rotation, including guidelines on when and how to involve the Customer Success teams in ongoing incidents. 4. Set up monitoring and alerts for failed subscription payments.
This incident has been resolved and all services are working as expected again. We will publish additional information on what triggered this incident, steps taken to fix it, as well as issues identified with our processes and how we plan to address them later today (European time zones).
Services on `graphcdn.app` domains are working again, though we still see some issues with DNS resolution for those domains from some providers.
We have deployed a fix for `graphcdn.app` domains and are waiting for the required DNS changes to propagate. All services should be working again shortly.
All services using the stellate.sh or custom domains are back up and running again. Services still using the older graphcdn.app domains are still affected.
We are continuing to investigate this issue, which is causing all Stellate services to be unavailable at this time.
We are currently investigating an issue with our edge caching service.
Report: "Issues with newly created custom domain names"
Last updateThis issue has now been resolved. All custom domain names created during this incident are working again and newly created custom domain names get properly provisioned as well. We will however continue to monitor this closely.
Fastly identified the issue with newly created custom domain names for GraphCDN and deployed a fix. We checked all custom domains from users who checked in with our support team and they are all working. We will continue to monitor the situation until we are certain this issue has been fully resolved.
We are currently looking into an issue with adding new custom domain names to GraphCDN services. While the service continues to be accessible and working from the default <code>$servicename.graphcdn.app</code>, the newly added custom domain name responds with an <em>HTTP/2 500</em> error and an error message detailing that direct connections are not allowed. This does <strong>not</strong> impact existing custom domains that have been configured in the past. We are working with the Fastly support team to identify the root cause for this and will push a fix once we have identified what change on our end, or on theirs, is causing this issue. We apologize for the inconvenience and would advise using the `graphcdn.app` service URLs for the time being.
Report: "Retroactive - Elevated Error Rates for Services with Custom Domains"
Last updateWe identified an issue with a Fastly backend GraphCDN relies on that caused elevated error rates for requests on services that use custom domain names between 15.23 pm and 15.25 pm UTC. This affected less than 1000 requests on services that have a custom domain defined. We are in touch with Fastly's support team and are monitoring error rates closely.
Report: "Increased latency for some edge locations"
Last update## Leadup/fault We observed a customer being targeted by a DDoS attack which exhausted a maximum limit on concurrent requests per Point of Presence as imposed by our service provider. We observed traffic at a volume orders of magnitude higher than usual. Mitigation was delayed due to requests being dropped at this frequency. ## Impact During this attack, the amount of traffic exceeded a location-wide limit on concurrent connections. This resulted in all traffic at these locations becoming degraded and the attack impacting more than the targeted customer. All of the services we host remained reachable and online during this period but were experiencing increased latency due to throttling. ## Timeline \(all times in UTC\) * 2022/04/28, around 10 am we observed an increase in traffic. * around 10:18 am we observed this traffic impacting service performance. This was confirmed by customer reports. * at 10:23 am we declared an incident and started our investigation and remediation process * around 10:35 am we identified a DDoS attack targeting a specific customer as the root cause * at 11:05 am, we confirmed a remediation plan with the affected customer and blocked traffic to their service. * at 11:07 am latency across our other services returned back to expected levels. * at 11:32 am the affected customer reduced routing traffic to our CDN and continued to work with us to bring their service back up.  💡 The above diagram shows the traffic pattern we’ve observed with a horizontal marker showing the mean amount of requests we’d typically expect to see. ## Short-term solution * We worked with the customer to temporarily stop routing traffic to our CDN, after informing them of the issue, to reduce the amount of traffic entering the affected Points of Presence. * We have shipped a per-service kill switch which allows us to block traffic to customer services if a customer says they’re unable to cope with a sudden influx of requests, as observed in DoS attacks. * We have talked to our infrastructure provider and raised our concurrent connection limits. * We learned from the specific DDoS latency and traffic patterns and are improving our monitoring to detect such patterns sooner. ## Future plans * We are prioritizing allowing services to limit the kind of requests they’re accepting \(e.g. non-GraphQL requests\), which aims to block more traffic at the edge. * We are prioritizing implementing configurable rate limiting.
This incident has been resolved.
Service metrics are back to regular levels. We are monitoring our systems closely and will post an update with a proper post mortem later as well.
We have identified and deployed a fix and are monitoring performance.
We are looking into an issue with increased latency in some of our locations.
Report: "Cloudflare Outage"
Last update## Leadup/fault Cloudflare deployed a change to its global network, taking the busiest 19 locations offline \(accounting for about 50% of total traffic passing through Cloudflare\). This outage propagated to the Stellate GraphQL Edge Cache which uses Cloudflare Workers under the hood. Cloudflare posted an [elaborate explanation](https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/) about this incident on their blog. ## Impact * Traffic passing through Stellate POPs \(provided by Fastly\) which routed to affected Cloudflare locations saw increased error rates and outages. This affected all Stellate services, no matter if GraphQL Edge Caching was enabled or not. * Since we use our GraphQL Analytics service for internal APIs, our dashboard was affected by the outage as well. * The Stellate Purging API also runs on Cloudflare Workers and was unavailable in affected locations. * Lastly, we observed failed attempts for users trying to log in to the dashboard via email. Our endpoint errored due to the [WorkOS](https://workos.com/) API \(used internally to power magic login links\) returning an error. WorkOS also mentioned a “degraded service” incident on their [status page](https://status.workos.com/incidents/s5kl869ldj94) that aligns with the timing of the Cloudflare outage. ## Timeline \(all times in UTC\) * On 2022-06-21, around 6:40 am we started getting customer reports about our CDN service being unavailable * Around 6:52 am we linked this to the Cloudflare incident * At 7:03 am an incident was opened at Stellate for a failing part of our internal system * Around 7:20 am Cloudflare implemented a fix, in the minutes after that we saw our services returning back to normal ## Short-term solution * We improved our internal monitoring to check more locations. This will help us spot partial outages of our CDN services quicker in the future. * We made the email login endpoint more resilient to outages of WorkOS. ## Future plans * Already before the incident today we were planning on consolidating our CDN service and reducing the dependencies on third-party providers like Cloudflare.
- Around 6:40 am we started getting customer reports about our CDN service being unavailable - Around 6:52 am we linked this to the Cloudflare incident - At 7:03 am an incident was opened at Stellate for a failing part of our internal system - Around 7:20 am Cloudflare implemented a fix, in the minutes after that we saw our services returning back to normal
Report: "Intermittent HTTP/500 errors on the public website and dashboard"
Last updateThis incident has been resolved. See https://www.vercel-status.com/incidents/kx5rjhwyrwp7 for additional information from our hosting provider.
A fix has been implemented and we are monitoring the results.
We are looking into intermittent HTTP/500 errors on our marketing site and dashboard with our hosting provider. The edge caches are not affected by those issues.