Is Stellate Down Right Now? Discover if there is an ongoing service outage.

Stellate is currently undergoing Planned Maintenance

Last checked Jul 29, 2025 17:57 UTC from Stellate's official status page

Historical record of incidents for Stellate

Jun 12, 2025

Report: "Increased Errors with Multiple Third Party Service Provider"

Last update 2025-06-12T19:28:35.409Z

investigating2025-06-12T19:28:34.871Z

We are currently investigating this issue.

Feb 22, 2025

Report: "Elevated errors on Dashboard"

Last update 2025-02-22T03:23:17.947Z

resolved2025-02-22T03:23:17.932Z

This incident has been resolved.

monitoring2025-02-21T18:55:03.720Z

We are continuing to monitor for any further issues.

monitoring2025-02-21T10:54:11.095Z

A fix has been implemented and we are monitoring the results.

investigating2025-02-20T17:52:58.313Z

We are currently investigating this issue.

Jan 8, 2025

Report: "Dashboard API Experiencing Issues"

Last update 2025-01-08T11:00:20.355Z

resolved2025-01-08T11:00:20.025Z

This incident has been resolved.

monitoring2025-01-07T15:48:52.282Z

We have applied new database indexes that should improve performance. We are still working on rolling out a production deployment to include additional API improvements.

identified2025-01-07T14:49:57.289Z

We have identified potential optimizations and are in the process of testing them to ensure their effectiveness. While the main degradation has calmed down, we’re taking precautionary steps to apply patches and ensure the system's stability and safety moving forward.

identified2025-01-07T13:44:11.785Z

We are currently experiencing database performance issues and are actively working on updates to resolve the situation. Our team is implementing changes to restore normal service as quickly as possible. We will continue to provide updates as we progress.

investigating2025-01-07T07:00:01.000Z

We are currently investigating this issue.

Oct 11, 2024

Report: "High Error Rates on Dashboard"

Last update 2024-10-11T21:25:52.477Z

resolved2024-10-11T21:25:52.022Z

This incident has been resolved.

identified2024-10-11T20:16:04.193Z

We are continuing to work on a fix for this issue.

identified2024-10-11T19:32:20.323Z

The issue has been identified and a fix is being implemented.

Report: "High Error Rates Detected on Dashboard"

Last update 2024-10-11T00:06:14.211Z

resolved2024-10-11T00:06:13.765Z

This incident has been resolved.

identified2024-10-10T23:49:53.504Z

The issue has been identified and a fix is being implemented.

investigating2024-10-10T23:29:42.772Z

We are currently investigating this issue.

Aug 29, 2024

Report: "Issues with GraphQL Metrics ingesting new data"

Last update 2024-08-29T10:34:01.031Z

resolved2024-08-29T10:34:00.403Z

This issue has been resolved. We'll continue monitoring all services, but we don't expect any further issues.

monitoring2024-08-29T10:13:06.275Z

We are seeing signs of recovery, and new data is again being ingested into our GraphQL Metrics systems. We are monitoring all systems and are in touch with our infrastructure provider for further updates. Please note that the GraphQL Metrics system might lag behind near real-time for a while as queued updates get processed.

identified2024-08-29T09:47:51.000Z

The issue has been identified as an issue with one of our infrastructure providers. and their ClickPipes service. They have opened an investigation and updated their status page at https://status.clickhouse.com/incidents/f6j8dfnyy6dn

investigating2024-08-29T09:10:23.210Z

We are currently looking into an issue with our GraphQL Metrics systems, which prevents new data to show up on the dashboard. All other systems are working as expected and our GraphQL Edge Caching and GraphQL Rate Limiting systems are not affected.

Jul 23, 2024

Report: "Issues with Purging API"

Last update 2024-07-23T18:42:56.591Z

postmortem2024-07-23T18:36:18.543Z

During a routine employee offboarding, we revoked that employee’s access to Fastly. Revoking their access to Fastly also revoked all access tokens that engineer created. Unfortunately, this included the central API token all our systems use to communicate with the Fastly API. This had two immediate impacts: 1. Purging started failing silently: Stellate’s purging API kept returning successful responses even though data would not be evicted from the cache. 2. Service configuration updates failing silently: Service configuration updates appeared to persist even though they were not updated in the CDN. As part of the incident response, we switched the central Fastly API token to a new token owned by a shared engineering account. Further, we will work on gaining better visibility and alerting on failure conditions with the purging API, as well as audit all tokens in use by our services to ensure they are not owned by individual engineers.

resolved2024-07-22T08:07:49.357Z

This incident has been resolved.

monitoring2024-07-22T07:45:30.121Z

We are continuing to monitor for any further issues.

monitoring2024-07-22T07:38:32.038Z

A fix has been implemented and the Purging API is working as expected again. We are monitoring all systems to make sure they are working as expected.

identified2024-07-22T07:32:57.045Z

The team has identified the issue and is currently implementing a fix.

investigating2024-07-22T05:40:24.070Z

We are currently looking into an issue with the Purging API.

Apr 18, 2024

Report: "Increased error rates with service configuration updates"

Last update 2024-04-18T16:48:34.348Z

resolved2024-04-18T16:48:34.334Z

This incident has been resolved.

monitoring2024-04-18T15:50:25.201Z

Fastly has identified the issue and implemented a fix. They are monitoring their systems to ensure this issue has been mitigated.

investigating2024-04-18T14:50:39.785Z

We are investigating increased timeouts with Fastly KV, which is a required dependency for configuration updates either via the Stellate Dashboard, or the API. Fastly is tracking this incident via their status page at https://www.fastlystatus.com/incident/376458. This does not impact our edge services

Feb 29, 2024

Report: "Issues with the Purging API"

Last update 2024-02-29T14:07:22.120Z

resolved2024-02-29T14:07:21.591Z

This incident has been resolved.

monitoring2024-02-29T13:33:46.992Z

The fix has been rolled out to all nodes and the Fastly team is monitoring purges across their systems.

monitoring2024-02-29T01:08:41.689Z

Fastly has begun rolling out a fix and is monitoring purges on their systems. The current ETA for the system-wide rollout is ~12 hours. As a (temporary) workaround, we recommend disabling `swr` for all cache rules. We will continue monitoring this incident on our end and update this status page as we get additional information from the team at Fastly.

investigating2024-02-28T23:44:03.651Z

Fastly has added an incident to their status page, available at https://www.fastlystatus.com/incident/376338 They are still investigating the issue and looking for a fix.

investigating2024-02-28T22:42:18.771Z

We continue to look into this with the Fastly team. We have narrowed this down to an issue with purging, in which all purges are considered soft purges if the cache configuration includes swr (stale-while-revalidate) values.

investigating2024-02-28T15:41:11.910Z

We are currently looking into a potential issue with the Purging API not removing data from the cache. We are working with our infrastructure partners to investigate what is causing this.

Feb 28, 2024

Report: "Issues with configuration changes"

Last update 2024-02-28T10:14:05.299Z

resolved2024-02-28T10:14:04.904Z

We have shipped a fix and it is once again safe to make configuration changes. We are very sorry for any issues this may have caused.

investigating2024-02-28T09:57:19.946Z

We are currently investigating an issue related to pushing configuration changes. Please refrain from making configuration changes at this time, as this could potentially make your service unavailable.

Feb 24, 2024

Report: "Increased error rates on configuration changes"

Last update 2024-02-24T06:14:22.202Z

resolved2024-02-24T06:14:21.844Z

This incident has been resolved.

monitoring2024-02-23T22:52:30.336Z

Fastly API error rates have recovered and we're not seeing any more issues. We will continue to monitor.

identified2024-02-23T21:36:10.952Z

Fastly is looking into elevated error rates on their APIs. See https://www.fastlystatus.com/incident/376327 for more information.

investigating2024-02-23T21:26:44.211Z

We are working with our infrastructure partners to get this issue resolved.

investigating2024-02-23T20:04:59.361Z

We are currently looking into an issue with errors when applying configuration changes, either via the Stellate CLI or the dashboard. Edge services are not affected by these issues.

Jan 12, 2024

Report: "Service disruption for Automated Persisted Queries (APQs)"

Last update 2024-01-12T13:40:08.486Z

postmortem2024-01-12T13:39:36.361Z

# Incident * A bug was released on Jan 8th at 1.43 pm UTC while improving Persisted Operation support. The two areas of code overlap, and unfortunately, the change broke support for APQs. * Our E2E test suite should have caught this bug. * Unfortunately, we recently made many improvements to our E2E test suite and silently broke the validity of the APQ E2E tests. These tests were running and reporting successes, but under the hood, they were erroneously being run against a server that does not support APQ. * The impact of this bug was not widespread enough to trigger alarms after release. * At 11.45 pm UTC, a customer raised an issue with APQs, and our engineering team started investigating. * On Jan 9th at 2.05 am UTC, a fix was deployed, and the issue was resolved. # Improvements * We’ve fixed the bug in our E2E test suite for APQ. * We’ve agreed on a path forward to start monitoring GraphQL errors. The work has begun and is being tracked but has yet to be completed. * We’ve scheduled a rollback dry run for our following incident dry run to improve our institutional knowledge of rollback procedures and find potential improvements.

resolved2024-01-08T13:45:00.000Z

- A bug was released on Jan 8th at 1.43 pm UTC while improving Persisted Operation support. The two areas of code overlap, and unfortunately, the change broke support for APQs. - Our E2E test suite should have caught this bug. - Unfortunately, we recently made many improvements to our E2E test suite and silently broke the validity of the APQ E2E tests. These tests were running and reporting successes, but under the hood, they were erroneously being run against a server that does not support APQ. - The impact of this bug was not widespread enough to trigger alarms after release. - At 11.45 pm UTC, a customer raised an issue with APQs, and our engineering team started investigating. - On Jan 9th at 2.05 am UTC, a fix was deployed, and the issue was resolved.

Dec 6, 2023

Report: "Elevated response times due to an incident at our infrastructure provider"

Last update 2023-12-06T08:44:17.086Z

resolved2023-12-06T08:44:17.069Z

This incident has been resolved.

monitoring2023-12-05T18:43:00.988Z

A fix has been implemented and Fastly is monitoring performance.

identified2023-12-05T18:21:34.780Z

Fastly has identified the issue and is implementing a fix.

investigating2023-12-05T16:42:33.562Z

We are currently looking into elevated response times due to an incident with Compute@Edge at Fastly. Please see https://www.fastlystatus.com/incident/376194 for more information.

Nov 6, 2023

Report: "Issues with Stellate Services behind a Cloudflare Proxy"

Last update 2023-11-06T15:59:03.842Z

postmortem2023-11-06T15:57:58.280Z

Fastly started forbidding domain fronting on October 24th, customers that were using Cloudflare with proxy enabled were affected as Fastly could not verify domain ownership for TLS certificates. This caused Fastly to throw a TLS validation error when trying to access these domains. We got communications from Fastly in September telling us some domains were going to be affected. However, they mentioned we had until the TLS certificates expired on current domains to take action. After the incident we reached out to Fastly, and they also mentioned the report they sent us was incomplete, as it did not include information for the HTTP method, as requests not using the POST method could be affected. This miscommunication from Fastly side led us to believe we had more time before our application would be affected. Going forward, we are double checking important dates with third party providers to make sure there are no misunderstandings and we don’t cause downtime for our customers.

resolved2023-11-06T15:57:43.578Z

We have added additional information to the service settings on validating custom domains that do not point at Fastly directly. If you have Cloudflare Proxy, or another proxy, in front of Stellate, please make sure your custom domain is shown as _Verified_ in your service settings. If you have questions, do not hesitate to reach out to our support team.

monitoring2023-10-25T09:07:21.958Z

If you are running your Stellate service behind a Cloudflare DNS record with proxy turned on and are running into issues with SAN (subject alternative names) errors, we recommend turning the proxy off and reaching out to our support team via support@stellate.co or the in-app messenger.

Nov 4, 2023

Report: "Cloudflare API Service Outage impacts Service Configuration Changes"

Last update 2023-11-04T13:22:59.157Z

resolved2023-11-04T13:22:58.546Z

Cloudflare fixed the power grid issues they were experiencing and published a post-mortem on their page, which is available at https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/. They are closely monitoring their systems to ensure they are running as expected.

monitoring2023-11-02T17:46:31.253Z

Cloudflare is experiencing an outage of its API services, which affects Stellate. Stellate edge services (caching, rate limiting, developer portal) are not affected. However, you might experience higher error rates on configuration changes. Please see https://www.cloudflarestatus.com/incidents/hm7491k53ppg for more information.

Oct 30, 2023

Report: "Cloudflare Worker issues impacting Stellate services"

Last update 2023-10-30T20:41:03.736Z

resolved2023-10-30T20:41:00.872Z

Cloudflare marked this incident as resolved.

monitoring2023-10-30T20:37:35.714Z

Cloudflare identified the issue and implemented a fix. (See https://www.cloudflarestatus.com/incidents/l6x2h1zp69bc) They are still monitoring the systems, as are we.

monitoring2023-10-30T20:30:14.665Z

We are seeing services recover. However, as Cloudflare didn't update their status page yet, we will keep this incident active.

monitoring2023-10-30T20:23:11.401Z

We are continuing to monitor for any further issues.

monitoring2023-10-30T20:13:45.317Z

Cloudflare Workers is experiencing an incident (please see https://www.cloudflarestatus.com/incidents/l6x2h1zp69bc for more information), which impacts Stellate services that haven't switched to our new infrastructure and Fastly exit IPs yet. If you would like to switch, please see https://stellate.co/docs/graphql-edge-cache/switch-to-new-infrastructure If you already switched, your service is not affected.

Sep 13, 2023

Report: "Elevated Error Rates in the Columbus, Ohio point of presence."

Last update 2023-09-13T10:39:28.374Z

resolved2023-09-13T10:39:28.356Z

This incident has been resolved. We are working with the Fastly and Cloudflare teams to better understand what caused those elevated error rates and how to prevent them.

investigating2023-09-12T14:36:21.819Z

We are currently investigating elevated error rates in the Columbus, Ohio point of presence.

Sep 11, 2023

Report: "Issues with configuration updates propagating"

Last update 2023-09-11T10:53:00.206Z

postmortem2023-09-11T10:52:31.006Z

* Stellate relies on Fastly infrastructure for our offerings * Fastly experienced a partial outage of their KV Store offering on June 17th and June 18th, which affected Stellate. They provide a summary of this incident on their status page at [https://www.fastlystatus.com/incident/376022](https://www.fastlystatus.com/incident/376022) ## Timeline * August 17th 10:46 UTC - A customer reported their stellate endpoint failing in the FRA $Frankfurt$ point of presence $POP$, as well as in several other edge locations. This was due to them pushing an update to their configuration, specifically the `originUrl` . * 10:50 - We identified the issue as being a stale KV value in the FRA POP, as well as several others. * 10:55 - We created an incident on our status page for degraded KV in the FRA POP and several others. * 13:08 - We realized that Rate Limiting and Developer Portals were affected by this outage as well. * 13:30 - We reported this incident to Fastly. * August 18th 4:00 UTC - Fastly was not yet able to provide us with a satisfactory response on what was causing this and didn’t acknowledge the ongoing outage. * 6:23 - A large e-commerce customer reported their website was unavailable. This was due to a KV key disappearing in the FRA POP, as well as several others. * 7:09 - Additional reports started to come in via Intercom about services not responding properly. * 7:15 - We escalated the incident with Fastly as from our view more regions seemed to be affected and becoming unavailable. * 7:16 - We deployed a partial fix that disabled our new infrastructure. This fixed edge caching for users who didn’t recently push configuration changes $the majority of services$. Rate Limiting, JWT-based scopes, and the Developer Portal were still affected by the KV outage. * 8:01 - Fastly was able to reproduce the bug based on a reproduction that we provided earlier and started working on a fix. * 9:02 - Fastly opened an [official incident](https://www.fastlystatus.com/incident/376022) on their status page. * 10:04 - Fastly marked the incident as resolved * 10:19 - Fastly communicated to us that the cause was an issue with surrogate keys in their C@E caching layer. * August 22nd - Fastly shared their confidential Fastly Service Advisory with us providing additional information about this incident and how they want to prevent this from happening again. ## Next Steps * We have had several calls with Fastly over the last couple of days, working with them to analyze what went wrong, why it took them so long to escalate this internally, and how we can improve communication and collaboration going forward. * As a direct outcome of this, we have re-connected with our European contacts at Fastly and designated a direct contact to involve in conversations and escalations going forward. * We are going to investigate a fallback option for Fastly KV. * Additionally, we will review all possible failure points that could make Stellate core services inaccessible $in the event of a third-party outage$ and investigate options for additional redundancies for those services.

resolved2023-08-18T12:17:07.319Z

This issue has been resolved. We have temporarily switched all services back to our "old infrastructure" and are running additional tests as well as working with Fastly before we reopen the "new infrastructure". We will also publish additional details once we conclude our internal post mortem process.

monitoring2023-08-18T10:33:53.690Z

Fastly has implemented a fix for the issue, all services are working as expected again. We have temporarily disabled switching over to the new infrastructure and are working with Fastly to better understand what happened on their end, why it took so long to identify and rectify this and how we can better monitor and prevent this in the future. We well enable the new infrastructure again, once we are confident in any services we rely on.

identified2023-08-18T09:53:37.000Z

We continue working with Fastly to resolve this issue. Please see https://www.fastlystatus.com/incident/376022 for updates from their team as well.

identified2023-08-18T07:29:27.809Z

We are continuing to work on a fix for this issue.

identified2023-08-18T07:15:50.000Z

The incident with KV stores, which are used for service configuration, is now spreading to additional edge locations and affecting overall service availability for services on the new infrastructure. We have disabled the new infrastructure to provide our partner more time to identify and resolve the issue on their end.

identified2023-08-17T16:03:15.601Z

Our infrastructure partner has identified the issue and is working on fixing it.

investigating2023-08-17T13:56:21.619Z

We are continuing to investigate this issue together with our infrastructure providers. If you haven't made configuration changes to your service recently, you are not affected by this issue.

investigating2023-08-17T11:55:46.822Z

We are investigating an issue with configuration updates propagating to the respective services. If you didn't make configuration changes recently, your services are not impacted by this incident.

Jul 21, 2023

Report: "Metrics System"

Last update 2023-07-21T19:43:09.471Z

resolved2023-07-21T19:43:09.458Z

The issue has been resolved and the metrics systems are worked as expected again. Some metrics data wasn't properly ingested and is currently saved in a backup system. We will ingest this data into our production systems after additional checks have been run.

monitoring2023-07-21T19:20:24.729Z

A fix has been implemented and we are monitoring the results.

identified2023-07-21T19:08:58.473Z

The issue has been identified and a fix is being implemented.

investigating2023-07-21T18:53:10.677Z

We are currently looking into an issue with our metrics system. Edge Caching and other service are unaffected and operate as expected.

Jun 14, 2023

Report: "Degraded service for Custom Domains Management, Developer Portal API Key Management and Purging Analytics"

Last update 2023-06-14T14:05:25.088Z

resolved2023-06-14T14:05:23.884Z

This incident has been resolved.

monitoring2023-06-14T13:31:23.911Z

A fix has been implemented and we are monitoring the results.

identified2023-06-14T13:16:43.801Z

The issue has been identified and a fix is being implemented.

investigating2023-06-14T12:57:07.086Z

We are continuing to investigate this issue.

investigating2023-06-14T12:55:22.426Z

We are experiencing degraded service for the following parts of our service * Creating and removing custom domain names to services and developer portals * Creating, revoking and removing API keys managed via the developer portal * Purging Analytics The edge cache itself, as well as the purging API and invalidation of cached responses, is not affected by this incident.

Jun 8, 2023

Report: "Stellate Services unavailable because of Cloudflare Worker KV outage"

Last update 2023-06-08T11:48:45.840Z

postmortem2023-06-08T10:48:42.393Z

* Stellate currently relies on CloudFlare services for parts of our offerings. * Cloudflare had a global outage of their KV store for ~10 minutes on June 7th, from 6.51 pm to 7.01 pm. They provide a summary of this incident on their own status page at [https://www.cloudflarestatus.com/incidents/1mj9jch1tqf9](https://www.cloudflarestatus.com/incidents/1mj9jch1tqf9). * Any traffic that resulted in cache misses or cache passes triggered an HTTP/500 error page during that time frame. Traffic directly handled by the edge cache $i.e., cache hits$ was not affected. * ~30% of traffic resulted in cache hits and was served correctly. * ~70% of traffic resulted in cache misses or passes; these requests returned an HTTP/500 error. * We are currently working on a larger infrastructure improvement that will remove the dependency on Cloudflare Worker KV. * Additionally, we will review all possible failure points that could make Stellate core services inaccessible $in the event of a third-party outage$ and investigate options for additional redundancies for those services.

resolved2023-06-07T19:46:11.971Z

Cloudflare posted an update on their status page and marked the incident that caused this incident as resolved. See https://www.cloudflarestatus.com/incidents/1mj9jch1tqf9 for their update.

monitoring2023-06-07T19:18:50.255Z

All services are back up and running again. We are monitoring the status of our services as well as Cloudflare Worker KV store.

monitoring2023-06-07T19:07:46.927Z

As far as we can tell, Cloudflare Workers KV service, which we depend on, was having a outage of about 5 to 10 minutes. They seem to be back up and running again. We are monitoring the situation and will update our status page as needed.

investigating2023-06-07T19:01:49.545Z

We are looking into an issue with Stellate right now. We will update this incident as we have more data available.

May 2, 2023

Report: "Dashboard unacessible"

Last update 2023-05-02T15:16:52.234Z

resolved2023-05-02T15:16:50.763Z

This incident has been resolved.

monitoring2023-05-02T14:58:38.495Z

The dashboard is back online, we are monitoring performance.

investigating2023-05-02T14:36:11.671Z

We are looking into an issue with the Stellate Dashboard. Our edge services (caching, rate limiting) and backend service (metrics backend) are unaffected.

Apr 17, 2023

Report: "DNS issue related to the default `stellate.sh` domain."

Last update 2023-04-17T09:56:44.408Z

resolved2023-04-17T09:56:44.389Z

This incident has been resolved. All services are available again.

investigating2023-04-17T09:53:24.929Z

We are looking into an issue regarding Stellate services hosted on the default `stellate.sh` domains. Services on custom domain names are not affected.

Dec 5, 2022

Report: "Increased error rates when loading the Stellate Dashboard"

Last update 2022-12-05T10:40:06.750Z

resolved2022-12-05T10:40:06.251Z

Metrics queues have caught up and the cluster is operating as expected again.

monitoring2022-12-05T10:25:14.272Z

Our metrics cluster is back online and accessible again. It will take a short while for it to catch up with queued metrics updates, and you might still see sporadic errors on the dashboard in the next couple of minutes. We are closely monitoring performance and will update this incident as required.

identified2022-12-05T10:07:37.169Z

We identified the root cause, which is an issue with our metrics cluster. We have alerted the infrastructure partner operating that cluster for us and they are working on restoring access.

investigating2022-12-05T09:50:38.888Z

We are looking into an issue with a service backing our dashboard. This is not affecting CDN services (neither caching, nor the private beta of rate-limiting), however you will see higher error rates when trying to load the Stellate dashboard.

Sep 26, 2022

Report: "Major Service Outage"

Last update 2022-09-26T19:15:06.327Z

postmortem2022-09-26T19:08:26.665Z

# Leadup/Fault At 03.39 am UTC, our engineering team got alerted about an elevated number of errors in our CDN. While looking into the increased error rate, we noticed SSL handshake errors between the caching layer and the workers forwarding requests to origins. Additional debugging surfaced that the SSL handshake failures were caused by the `graphcdn.app` domain expiring. Later investigations revealed that while the domain was set to renew automatically, the credit card payment for the renewal failed. Additionally, we could not immediately get ahold of the person required to restore access to the `graphcdn.app` domain. All requests sent to any `*.graphcdn.app` subdomain were served an error page by the domain registrar. Because the CDN workers were internally using a `graphcdn.app` subdomain $a leftover from our name change$, this domain expiry caused all requests to fail, even if the external domain used was not a `*.graphcdn.app` subdomain but a `*.stellate.sh` subdomain or a custom domain. # Timeline $all times in UTC$ * 3:39 am - The first alert was triggered, and the engineers on-call were paged and started investigating the issue. * 4:01 am - While working through our incident runbook, the engineering team noticed an error message regarding a “moved domain.” * 4:12 am - As we couldn’t immediately get a hold of the person required to renew the `graphcdn.app` domain, the engineering team started to move the CDN workers to a different domain. * 4:28 am - We opened an issue on our status page, [https://status.stellate.co/incidents/m7v0bgflsg4c](https://status.stellate.co/incidents/m7v0bgflsg4c) * 5:15 am - We deployed the fix that moved the CDN workers to a different domain and restored service for all requests sent to `*.stellate.sh` subdomains or custom domains. * 5:50 am - Restored access to the `graphcdn.app` domain, which resolved the issue for any services using `graphcdn.app` $The time those services were available again varied slightly depending on DNS propagation.$ * 6:34 am - Marked the incident resolved. # How did we resolve it? The on-call engineers did not have access to the domain registrar where we registered the `graphcdn.app` domain. We couldn’t get ahold of the person who had access to that registrar because they weren’t on-call. While trying to find another way to reach that person, we deployed the first fix at 5:15 am UTC that removed the internal dependency on the `graphcdn.app` domain to restore service for all custom and `*.stellate.sh` subdomains. We recovered access to and renewed the `graphcdn.app` domain, and restored service for the `*.graphcdn.app` subdomains at 5:50 am. # Post mortem After resolving the incident, we conducted an internal post-mortem, analyzed the incident, and derived some immediate as well as future actions that are already completed: ## Immediate Actions 1. Validate that no other domains are expiring soon. 2. Audit and ensure all on-call engineers have access to all critical services. 3. Use a central email for authentication with any domain registrars. 4. Ensure there is an escalation policy to the founders and that the founders are permanently on-call. ## Future Actions 1. Set up monitoring & alerts for expiring certificates & domains and audit our current monitoring setup for holes. 2. Audit and ensure all on-call engineers have access to all services $not just the critical ones$. 3. Create a Customer Success on-call rotation, including guidelines on when and how to involve the Customer Success teams in ongoing incidents. 4. Set up monitoring and alerts for failed subscription payments.

resolved2022-09-26T06:34:17.477Z

This incident has been resolved and all services are working as expected again. We will publish additional information on what triggered this incident, steps taken to fix it, as well as issues identified with our processes and how we plan to address them later today (European time zones).

monitoring2022-09-26T06:07:11.695Z

Services on `graphcdn.app` domains are working again, though we still see some issues with DNS resolution for those domains from some providers.

monitoring2022-09-26T05:55:56.107Z

We have deployed a fix for `graphcdn.app` domains and are waiting for the required DNS changes to propagate. All services should be working again shortly.

monitoring2022-09-26T05:16:43.153Z

All services using the stellate.sh or custom domains are back up and running again. Services still using the older graphcdn.app domains are still affected.

investigating2022-09-26T04:46:50.000Z

We are continuing to investigate this issue, which is causing all Stellate services to be unavailable at this time.

investigating2022-09-26T04:28:41.443Z

We are currently investigating an issue with our edge caching service.

Jul 26, 2022

Report: "Issues with newly created custom domain names"

Last update 2022-07-26T13:40:51.659Z

resolved2022-01-03T17:25:40.688Z

This issue has now been resolved. All custom domain names created during this incident are working again and newly created custom domain names get properly provisioned as well. We will however continue to monitor this closely.

monitoring2022-01-03T13:06:13.172Z

Fastly identified the issue with newly created custom domain names for GraphCDN and deployed a fix. We checked all custom domains from users who checked in with our support team and they are all working. We will continue to monitor the situation until we are certain this issue has been fully resolved.

investigating2021-12-31T17:28:33.000Z

We are currently looking into an issue with adding new custom domain names to GraphCDN services. While the service continues to be accessible and working from the default <code>$servicename.graphcdn.app</code>, the newly added custom domain name responds with an <em>HTTP/2 500</em> error and an error message detailing that direct connections are not allowed. This does <strong>not</strong> impact existing custom domains that have been configured in the past. We are working with the Fastly support team to identify the root cause for this and will push a fix once we have identified what change on our end, or on theirs, is causing this issue. We apologize for the inconvenience and would advise using the `graphcdn.app` service URLs for the time being.

Report: "Retroactive - Elevated Error Rates for Services with Custom Domains"

Last update 2022-07-26T13:40:51.629Z

resolved2022-02-16T14:23:00.000Z

We identified an issue with a Fastly backend GraphCDN relies on that caused elevated error rates for requests on services that use custom domain names between 15.23 pm and 15.25 pm UTC. This affected less than 1000 requests on services that have a custom domain defined. We are in touch with Fastly's support team and are monitoring error rates closely.

Report: "Increased latency for some edge locations"

Last update 2022-07-26T13:40:51.595Z

postmortem2022-04-29T14:32:46.944Z

## Leadup/fault We observed a customer being targeted by a DDoS attack which exhausted a maximum limit on concurrent requests per Point of Presence as imposed by our service provider. We observed traffic at a volume orders of magnitude higher than usual. Mitigation was delayed due to requests being dropped at this frequency. ## Impact During this attack, the amount of traffic exceeded a location-wide limit on concurrent connections. This resulted in all traffic at these locations becoming degraded and the attack impacting more than the targeted customer. All of the services we host remained reachable and online during this period but were experiencing increased latency due to throttling. ## Timeline $all times in UTC$ * 2022/04/28, around 10 am we observed an increase in traffic. * around 10:18 am we observed this traffic impacting service performance. This was confirmed by customer reports. * at 10:23 am we declared an incident and started our investigation and remediation process * around 10:35 am we identified a DDoS attack targeting a specific customer as the root cause * at 11:05 am, we confirmed a remediation plan with the affected customer and blocked traffic to their service. * at 11:07 am latency across our other services returned back to expected levels. * at 11:32 am the affected customer reduced routing traffic to our CDN and continued to work with us to bring their service back up. ‌ ![](https://i.imgur.com/2ULmhvU.png) 💡 The above diagram shows the traffic pattern we’ve observed with a horizontal marker showing the mean amount of requests we’d typically expect to see. ## Short-term solution * We worked with the customer to temporarily stop routing traffic to our CDN, after informing them of the issue, to reduce the amount of traffic entering the affected Points of Presence. * We have shipped a per-service kill switch which allows us to block traffic to customer services if a customer says they’re unable to cope with a sudden influx of requests, as observed in DoS attacks. * We have talked to our infrastructure provider and raised our concurrent connection limits. * We learned from the specific DDoS latency and traffic patterns and are improving our monitoring to detect such patterns sooner. ## Future plans * We are prioritizing allowing services to limit the kind of requests they’re accepting $e.g. non-GraphQL requests$, which aims to block more traffic at the edge. * We are prioritizing implementing configurable rate limiting.

resolved2022-04-28T13:46:47.094Z

This incident has been resolved.

monitoring2022-04-28T11:16:26.245Z

Service metrics are back to regular levels. We are monitoring our systems closely and will post an update with a proper post mortem later as well.

identified2022-04-28T11:09:05.954Z

We have identified and deployed a fix and are monitoring performance.

investigating2022-04-28T10:29:28.853Z

We are looking into an issue with increased latency in some of our locations.

Report: "Cloudflare Outage"

Last update 2022-07-26T13:40:51.562Z

postmortem2022-06-21T15:45:23.371Z

## Leadup/fault Cloudflare deployed a change to its global network, taking the busiest 19 locations offline $accounting for about 50% of total traffic passing through Cloudflare$. This outage propagated to the Stellate GraphQL Edge Cache which uses Cloudflare Workers under the hood. Cloudflare posted an [elaborate explanation](https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/) about this incident on their blog. ## Impact * Traffic passing through Stellate POPs $provided by Fastly$ which routed to affected Cloudflare locations saw increased error rates and outages. This affected all Stellate services, no matter if GraphQL Edge Caching was enabled or not. * Since we use our GraphQL Analytics service for internal APIs, our dashboard was affected by the outage as well. * The Stellate Purging API also runs on Cloudflare Workers and was unavailable in affected locations. * Lastly, we observed failed attempts for users trying to log in to the dashboard via email. Our endpoint errored due to the [WorkOS](https://workos.com/) API $used internally to power magic login links$ returning an error. WorkOS also mentioned a “degraded service” incident on their [status page](https://status.workos.com/incidents/s5kl869ldj94) that aligns with the timing of the Cloudflare outage. ## Timeline $all times in UTC$ * On 2022-06-21, around 6:40 am we started getting customer reports about our CDN service being unavailable * Around 6:52 am we linked this to the Cloudflare incident * At 7:03 am an incident was opened at Stellate for a failing part of our internal system * Around 7:20 am Cloudflare implemented a fix, in the minutes after that we saw our services returning back to normal ## Short-term solution * We improved our internal monitoring to check more locations. This will help us spot partial outages of our CDN services quicker in the future. * We made the email login endpoint more resilient to outages of WorkOS. ## Future plans * Already before the incident today we were planning on consolidating our CDN service and reducing the dependencies on third-party providers like Cloudflare.

resolved2022-06-21T07:20:00.000Z

- Around 6:40 am we started getting customer reports about our CDN service being unavailable - Around 6:52 am we linked this to the Cloudflare incident - At 7:03 am an incident was opened at Stellate for a failing part of our internal system - Around 7:20 am Cloudflare implemented a fix, in the minutes after that we saw our services returning back to normal

Report: "Intermittent HTTP/500 errors on the public website and dashboard"

Last update 2022-07-26T13:40:51.525Z

resolved2022-06-29T17:19:28.867Z

This incident has been resolved. See https://www.vercel-status.com/incidents/kx5rjhwyrwp7 for additional information from our hosting provider.

monitoring2022-06-29T16:54:33.077Z

A fix has been implemented and we are monitoring the results.

investigating2022-06-29T16:06:24.535Z

We are looking into intermittent HTTP/500 errors on our marketing site and dashboard with our hosting provider. The edge caches are not affected by those issues.