Is Doppler Down Right Now? Discover if there is an ongoing service outage.

Doppler is currently Operational

Last checked Jul 29, 2025 14:39 UTC from Doppler's official status page

Historical record of incidents for Doppler

Jun 12, 2025

Report: "System Outage"

Last update 2025-06-12T21:07:00.000Z

Resolved2025-06-12T21:07:00.000Z

This incident has been resolved.

Monitoring2025-06-12T20:19:00.000Z

GCP and Cloudflare have been gradually resolving their outages and Doppler services have been partially restored.

Investigating2025-06-12T18:29:00.000Z

We're currently investigating a system-wide outage related to disruptions with GCP and Cloudflare.

Report: "Degraded system performance"

Last update 2025-06-12T18:29:42.242Z

investigating2025-06-12T18:29:41.724Z

We're currently investigating degraded system performance related to possible disruptions with GCP and Cloudflare.

Apr 11, 2025

Report: "Extended Maintenance"

Last update 2025-04-11T02:37:36.480Z

resolved2025-04-11T02:33:31.000Z

This incident has been resolved.

investigating2025-04-11T02:15:36.000Z

We're currently experiencing downtime due to extended database maintenance.

Oct 31, 2024

Report: "Partial Personal Config Outage"

Last update 2024-10-31T00:05:41.527Z

resolved2024-10-31T00:05:41.506Z

This incident has been resolved.

investigating2024-10-30T23:53:29.843Z

Only users with access to personal configs via groups are affected. Users with direct access to personal configs are not affected.

investigating2024-10-30T23:41:59.718Z

A fix has been identified and should be deployed soon.

investigating2024-10-30T23:19:20.843Z

We are continuing to investigate this issue.

investigating2024-10-30T22:10:52.000Z

Certain users may be unable to fetch from personal configs.

May 31, 2024

Report: "Splunk Activity Logs Integration Outage"

Last update 2024-05-31T19:51:46.146Z

resolved2024-05-31T19:51:46.132Z

A fix has been deployed and we are monitoring reconnections. We will backfill missed Activity Logs as needed after workplaces reconnect Splunk.

identified2024-05-31T18:05:40.284Z

We are continuing to work on a fix for this issue.

identified2024-05-31T18:04:29.483Z

Activity Logs stopped publishing to Splunk as of May 30th at 19:55 UTC. The issue has been identified and a fix is in progress. After the issue has been resolved and the connections reconnected, missed activity logs will be republished.

Apr 16, 2024

Report: "Elevated error rates for API and dashboard"

Last update 2024-04-16T13:30:00.381Z

resolved2024-04-15T15:59:00.000Z

For a period of 18 minutes, secrets related API and dashboard actions saw increased error rates (503) and timeouts.

Feb 29, 2024

Report: "Syncs to Railway are intermittently failing"

Last update 2024-02-29T22:10:11.225Z

resolved2024-02-29T22:10:11.211Z

This incident has been resolved.

monitoring2024-02-29T22:00:46.505Z

A fix has been implemented and we are monitoring the results.

identified2024-02-29T21:38:42.192Z

We are continuing to work on a fix for this issue.

identified2024-02-29T21:29:38.304Z

We are working on a fix.

Aug 4, 2023

Report: "Unable to fetch secrets from the Doppler dashboard and API"

Last update 2023-08-04T23:20:17.207Z

postmortem2023-08-04T23:18:54.666Z

This incident was caused by a faulty Kubernetes NetworkPolicy change. We’ll be evaluating how we can adjust our deployment procedures to catch these changes in the future.

resolved2023-08-04T22:50:48.761Z

This incident has been resolved.

investigating2023-08-04T22:46:51.137Z

We are continuing to investigate this issue.

investigating2023-08-04T22:28:36.000Z

We are currently investigating this issue.

Mar 22, 2023

Report: "Doppler CLI Failing to Install From OS Package Registries"

Last update 2023-03-22T23:13:11.230Z

resolved2023-03-22T23:13:10.409Z

An incident was identified at our provider for OS package registry downloads. They've fixed the issue and installs from OS package registries are available again. Impact: The incident affected all Doppler users attempting to install the Doppler CLI via their OS package registeries. This resulted in disruptions to their workflows relying on the Doppler CLI. Resolution: Cloudsmith resolved the incident after around an hour, and users were once again able to install the Doppler CLI from their OS package registries.

investigating2023-03-22T22:30:57.220Z

We're currently investigating this issue. If you need immediate availability, you can install Doppler directly from our published release artifacts instead of via your OS's package registry by running: ``` (curl -Ls --tlsv1.2 --proto "=https" --retry 3 https://cli.doppler.com/install.sh || wget -t 3 -qO- https://cli.doppler.com/install.sh) | sh ```

Mar 2, 2023

Report: "Doppler CLI Failing to Install Due to GitHub Incident"

Last update 2023-03-02T18:57:49.121Z

resolved2023-02-27T20:45:00.000Z

For several hours on February 27, 2023, users intermittently encountered errors when attempting to install the Doppler CLI. This issue was due to a related incident on GitHub, where Doppler hosts all CLI binaries and signatures. According to GitHub's incident report, GitHub experienced degraded performance and increased error rates for their packages service. Impact: The incident affected all Doppler users attempting to install the Doppler CLI. This resulted in disruptions to their workflows and GitHub actions relying on the Doppler CLI. Resolution: GitHub resolved the incident after a few hours, and users were once again able to install the Doppler CLI.

Jan 19, 2023

Report: "Unable to sync secrets, webhooks, and activity log notifications"

Last update 2023-01-19T16:31:31.415Z

postmortem2023-01-17T01:31:40.182Z

# Summary From 2023-01-16 19:11 UTC to 2023-01-16 20:04 UTC, Doppler experienced a partial outage which prevented sync integrations, webhooks, and activity log notifications from executing. The outage also prevented an internal job from firing which recomputes the version hashes for Doppler configs. This resulted in API clients \(e.g. the Doppler CLI and Kubernetes Operator\) failing to receive secrets updates which were made during this window. A recovery migration was run at 2023-01-17 00:37 UTC, re-triggering all syncs, webhooks, and activity log notifications — as well as recomputing config version hashes to restore the functionality for all clients to fetch secret updates. # Incident Details Doppler uses RabbitMQ to queue jobs which need to be executed as a result of secret updates. On 2023-01-13, Doppler’s security team rotated a RabbitMQ password, mistakenly identifying the credential as unused in production. It took several days for the RabbitMQ sessions in Doppler’s production services to expire and once they did, queue jobs could no longer be published. Once the incident was identified, Doppler’s security team created new RabbitMQ users to be used by our production services. The change was deployed and the incident was resolved at 2023-01-16 20:04 UTC. At 2023-01-17 00:37 UTC, Doppler ran a recovery migration to re-fire queue events for sync integrations, webhooks, activity log notifications, and secret version hash recomputations that were meant to fire during the incident window. # Next Steps Doppler has switched from using a single RabbitMQ credential to using one user per service. RabbitMQ users are now clearly named to mitigate the risk of accidental rotation in the future. We’ve also identified that the ability for API clients to fetch secrets should not be dependent on our application’s ability to connect to RabbitMQ. Our engineering team will move the config version hash computation to our atomic secrets write operation to ensure that the latest secrets are always fetched by clients. Lastly, our engineering team is reconfiguring the way we queue asynchronous jobs to ensure that if secrets are modified during a partial infrastructure failure, all post-update jobs will eventually be executed — without the need for manual recovery migrations.

resolved2023-01-16T19:11:00.000Z

During this incident, changes to secrets could not be synced to integrations or trigger webhook updates. Additionally, activity log notifications could not be posted to Slack, Microsoft Teams, Sumo Logic, Splunk, or Datadog.

Jan 5, 2023

Report: "Heroku sync integrations are failing"

Last update 2023-01-05T21:40:01.166Z

postmortem2023-01-05T21:30:04.281Z

**Root Cause** Doppler syncs secrets to Heroku via a Heroku OAuth application. This application is created in the Heroku dashboard and must be owned by a single Heroku user account. Doppler’s previous Heroku OAuth application was owned by a specific Doppler employee’s Heroku account without access to any additional resources. During a routine external account audit, this account was mistakenly identified as unused and manually deleted by our security team. This irrecoverably deleted Doppler’s existing Heroku OAuth application, thereby breaking any existing syncs and requiring the creation of a new OAuth application in a new account. **Resolution** Because users had authorized our previous Heroku OAuth application to their Heroku account\(s\), users need to authorize the new Heroku OAuth application. This involves reconnecting the integration from the [Doppler dashboard](https://dashboard.doppler.com/workplace/settings). Once the integration is reconnected, Doppler will re-enable all associated syncs that have been disabled and perform a fresh sync. Note that the previous OAuth application was deleted and therefore no action is required to remove its access to your Heroku account. **Next Steps** Internally, we’re reorganizing how shared accounts used for critical functionality are stored in 1Password. This new 1Password organization should help prevent this kind of accidental deletion in the future. We avoid shared accounts whenever possible, but this isn’t always feasible given third party implementations. We'll also be adding our individual integrations to our status page. This will allow customers to more easily see which integrations, if any, are currently experiencing issues.

resolved2023-01-05T18:23:19.977Z

This issue is now resolved. Additional action is required to re-enable existing Heroku syncs. All workplaces will need to reconnect their Heroku sync integrations from the [workplace Settings page](https://dashboard.doppler.com/workplace/settings). Once integrations are reconnected, all syncs will automatically be triggered to sync any pending changes in Doppler. A postmortem will be available soon with more details regarding what happened.

identified2023-01-05T01:06:37.350Z

This incident may persist for up to 24 hours. In the meantime, if you have an urgent secret change that needs to be made, please update the secret in your Doppler dashboard and then manually update the secret in Heroku directly via `heroku config:set`. Updating the secret in Doppler will ensure that the value in Heroku is not overwritten once the incident is resolved.

identified2023-01-05T00:48:58.710Z

New and existing Heroku sync integrations are currently not functioning. We have identified the issue and are working on resolving it. Secrets previously synced to Heroku apps will remain available but new secret changes in Doppler will not be synced until this is resolved.

Aug 29, 2022

Report: "Secrets unable to be read"

Last update 2022-08-29T01:38:01.851Z

resolved2022-08-29T01:38:01.082Z

This issue was caused by a redis server running out of memory. We've increased the amount of memory allotted to this server. We'll also be setting up additional alerting on redis resource usage so that we can catch and prevent these issues in the future.

monitoring2022-08-29T01:32:22.482Z

A fix has been implemented and we are monitoring the results.

investigating2022-08-29T00:38:39.000Z

We are currently investigating this issue.

Aug 1, 2022

Report: "Updating secrets fails for a limited number of environments"

Last update 2022-08-01T16:45:20.041Z

resolved2022-07-28T20:00:00.000Z

An error was discovered while running a data backfill job that resulted in users being unable to update secrets for affected environments. The data that was generated triggered an unexpected assertion in the secrets validation code. Users that attempted to save secrets in these environments would have received a 500 response from the server with the message "An error has occurred but don't fret. Our team has been notified." The data has been fixed for all environments and secrets write operations are fully functional again.

Jun 21, 2022

Report: "Services responding with 500 Internal Server Error"

Last update 2022-06-21T07:22:58.681Z

resolved2022-06-21T07:22:57.481Z

This incident has been resolved.

monitoring2022-06-21T07:10:20.492Z

We are continuing to monitor for any further issues.

monitoring2022-06-21T07:09:51.967Z

A fix has been implemented and we are monitoring the results.

identified2022-06-21T06:52:05.681Z

This issue is due to a widespread Cloudflare outage. https://www.cloudflarestatus.com/incidents/xvs51y9qs9dj

investigating2022-06-21T06:33:26.175Z

We are currently investigating this issue.

Dec 22, 2021

Report: "Unable to connect to the Doppler dashboard and API"

Last update 2021-12-22T17:08:02.454Z

resolved2021-12-22T17:08:01.751Z

This incident has been resolved.

monitoring2021-12-22T16:54:56.126Z

A fix has been implemented and we are monitoring the results.

investigating2021-12-22T16:50:06.072Z

We are currently investigating this issue.

Dec 8, 2021

Report: "Reading/writing secrets intermittently failing due to an outage at our tokenization provider"

Last update 2021-12-08T17:13:43.045Z

resolved2021-12-08T10:20:01.000Z

This incident has been resolved.

investigating2021-12-07T22:05:53.864Z

We're intermittently experiencing outages with our tokenization provider, who is impacted by the AWS us-east-1 outage.

Dec 7, 2021

Report: "Reading/writing secrets failing due to an outage at our tokenization provider"

Last update 2021-12-07T19:01:26.259Z

resolved2021-12-07T19:01:25.689Z

This incident has been resolved.

monitoring2021-12-07T18:45:44.234Z

Our tokenization provider has recovered and all Doppler services are operational.

identified2021-12-07T18:32:51.643Z

The tokenization provider is impacted by the AWS us-east-1 outage.

Nov 20, 2021

Report: "Dashboard intermittently returning bad gateway errors"

Last update 2021-11-20T00:58:39.528Z

resolved2021-11-20T00:58:39.066Z

This incident has been resolved.

monitoring2021-11-20T00:38:07.701Z

We believe we've identified and resolved the issue. We're currently monitoring the fix to confirm.

investigating2021-11-19T23:48:58.796Z

We're currently seeing intermittent errors on dashboard.doppler.com. We believe this is a connectivity issue between Cloudflare and our hosting provider and are investigating. API and CLI usage are not affected.

Nov 16, 2021

Report: "All services returning 404"

Last update 2021-11-16T18:19:12.371Z

resolved2021-11-16T18:19:11.768Z

This incident has been resolved. https://status.cloud.google.com/incidents/6PM5mNd43NbMqjCZ5REh

monitoring2021-11-16T18:09:26.201Z

The services are now coming back online. We are continuing to monitor.

investigating2021-11-16T17:39:16.912Z

This appears to be due to a larger Google outage. We are in contact with Google support and are monitoring their status page.

Nov 15, 2021

Report: "Users getting rate limited when hitting API"

Last update 2021-11-15T22:41:45.231Z

resolved2021-11-15T22:41:42.340Z

This incident has been resolved.

monitoring2021-11-15T22:37:21.040Z

We've rolled back the code change that's causing this issue. Service Tokens should resume normal operation.

investigating2021-11-15T22:14:17.000Z

We are currently investigating an issue with Service Tokens getting rate limited.

Oct 1, 2021

Report: "API unavailable due to contention over database locking"

Last update 2021-10-01T21:48:05.909Z

resolved2021-10-01T09:00:00.000Z

Our API experienced an 11 minute outage due to database contention over locking our Workplace table. The issue resolved itself once the locks expired. We've refactored the affected component to achieve the desired result without locking. We'll be exploring a more holistic fix to prevent this issue from occurring in other components.

Sep 23, 2021

Report: "Workplace admins, viewers and collaborators unable to login via CLI"

Last update 2021-09-23T19:12:38.453Z

resolved2021-09-23T19:12:38.062Z

This issue has been resolved. We've fixed an erroneous permission check that disallowed lower privilege users from authorizing the CLI.

monitoring2021-09-23T19:10:18.537Z

A fix has been implemented and we are monitoring the results.

identified2021-09-23T17:26:24.428Z

We are continuing to work on a fix for this issue.

identified2021-09-23T17:08:09.492Z

The issue has been identified and a fix is being implemented.

Apr 28, 2021

Report: "Secrets storage is unavailable"

Last update 2021-04-28T03:26:48.067Z

resolved2021-04-28T03:26:47.436Z

This incident with our third party vendor has been resolved. All secrets access has been restored.

monitoring2021-04-28T03:24:24.693Z

A fix has been implemented and we are monitoring the results.

investigating2021-04-28T03:15:11.920Z

We are currently investigating this issue.

Apr 14, 2021

Report: "CLI Tokens Authentication Issues"

Last update 2021-04-14T05:44:33.221Z

resolved2021-04-14T05:44:32.214Z

The issue causing CLI authentication to fail has been resolved. CLI tokens should now be functioning as expected. If your CLI token continues not to work, please reach out to support.

identified2021-04-14T04:38:53.984Z

We are currently experiencing an issue with authenticating CLI tokens. When connecting with a CLI token, users are receiving the message "Invalid CLI Token". We are investigating this and will have a new update within 1 hour. If you are experiencing this issue, you can circumvent it by re-logging in to the Doppler CLI using 'doppler login'.

Jan 6, 2021

Report: "SAML SSO disabled on all workplaces"

Last update 2021-01-06T05:53:57.311Z

resolved2021-01-06T05:53:56.691Z

SAML SSO has been restored for all workplaces.

investigating2021-01-06T05:21:30.462Z

SAML SSO has been disabled on all workplaces due to a faulty migration script. We are currently in the process of recovering SAML and will update this incident accordingly.

Jul 27, 2020

Report: "Scheduled Maintenance"

Last update 2020-07-27T22:06:16.103Z

resolved2018-12-19T13:14:13.315Z

This incident has been resolved.

monitoring2018-12-19T13:13:57.230Z

Upgraded servers and postgres databases for elastic scaling. During transition, a few API requests timed out. All systems are now fully operational.

Report: "Users are unable to connect to Doppler servers"

Last update 2020-07-27T22:06:16.054Z

postmortem2019-03-07T05:28:58.266Z

Our first critical outage occurred today, and it led to almost 2 hours of downtime but thankfully no data loss. This was an unacceptable amount of downtime and could have been completely prevented. We take your trust very seriously as Doppler is a critical path in your devops and productivity workflows. We have learned greatly from this experience, from fixing the root cause and adding checks to prevent this kind of outage in the future. Here is what happened: # Timeline **March 6th, 2019 - 3:04 PM \(PST\)** [Heroku](https://heroku.com) starts an automated maintenance on our primary postgres database. This process includes creating a follower of our primary database on the newest postgres version, then hard forking the database and setting it as the new primary database. After the primary database is in use, the old database is removed. We were warned about this migration ahead of time and assumed the maintenance/migration would be quick and all of the credentials would automatically be updated across our environments. **March 6th, 2019 - 3:20 PM \(PST\)** As our servers start crashing and we receive a flood of [Bugsnag](https://bugsnag.com) error reports. Digging through the stack traces and logging, we realize a failure occurred during the migration where our primary database url was revoked but the new primary database url had not been automatically set as an environment variable. This made our servers attempt connection to an invalid database url which caused our servers to crash. This same failure affected our failover servers which relied on a follower database of our primary database. Since Doppler relies on Doppler for environment variables, it created a circular loop where we need to be up to boot up. Removing Doppler as a source for environment variables is tricky, as we only stored our environment variables in Doppler. Those environment variables are tokenized with our security vendor so it was impossible for us to copy them from our database into Heroku manually. Instead, we had to go through every service we used to grab or create new credentials through each of their individual dashboards. **March 6th, 2019 - 3:30 PM \(PST\)** We initiated our recovery plan: 1\. Report the incident on StatusPage 2\. Put a list together of all the credentials needed 3\. Find correct database url 4\. Modify our environment variables with credentials 5\. Deploy new code with our up-to-date environment variables **March 6th, 2019 - 3:50 PM \(PST\)** After finding all of our credentials / environment variables we needed, we started inputting them manually into Heroku environments. This had to be done across all of our services for Doppler to work properly. After setting most of our environment variables, we ran into a critical problem. Heroku would not let us set the new database url because it was a managed variable controlled by Heroku Postgres. In addition, our servers started scaling up rapidly as they were constantly crashing and rebooting, each time hitting our API that was already down. This endless cycle created timeouts and produced a massive number of logs, making it very hard to debug. To counteract this, we tried to disable autoscaling and scaled our servers down to 1 dyno. This process kept returning an error as Heroku would mark it as an invalid request. After 10 minutes, we were able to successfully scale our servers down to 1 dyno. Now that logging is easier to comb through, we refocused on fixing our postgres invalid credentials problem. We found a buried option in a submenu of the Managed Postgres Dashboard to force a rotation of our database credentials. Next, Heroku immediately propagated those new credentials to our environment variables as we expected. **March 6th, 2019 - 4:15 PM \(PST\)** Now that all of our correct environment variables are on Heroku, we started step 5. Though the Doppler API was down, our SDK was able to fallback to the environment variables set on Heroku. Quickly after deploying the fix and our website is back up, we realize our security vendor's API url is incorrect from the one on the dashboard, as they had created a dedicated one for us. **March 6th, 2019 - 4:30 PM \(PST\)** After digging through a hundred old slack messages and talking with support, we found the dedicated API url to use. Now our servers are fully back up and running. **March 6th, 2019 - 4:50 PM \(PST\)** After 20 minutes of additional monitoring and stress testing, the StatusPage is updated with the indication that the incident is resolved. # What We Learned ### **Circular Loop** Doppler being a critical path for Doppler is a linchpin waiting to be pulled. We needed to have a fallback option built-into our systems so that Doppler can boot up after a full outage. This will be done by ensuring Heroku environment variables are always up to date. ### **Stronger Prechecks** Previous to the outage, our precheck script strictly checked for the presence of all our required environment variables. This is not good enough. Our new precheck script also initializes all libraries and clients \(including our database client\) to ensure all the credentials are correct. During the outage, when our database url was revoked, the old prechecks still passed as a url was present. It now ensures authorization ahead of time and done successfully. ### **DevOps Expertise** A wake up call that we will not be able to run on Heroku for too long. Heroku provides a lot of value for the price tag but comes at a deep expense: lack of devops expertise. Overtime we will migrate to AWS and build out our own devops workflows that can support our extremely high SLA and fault tolerant requirements.

resolved2019-03-07T00:50:57.097Z

This incident has been resolved.

monitoring2019-03-07T00:42:46.963Z

A fix has been implemented and we are monitoring the results.

identified2019-03-07T00:29:08.671Z

We are still working on a fix for this issue on our API front. A fix has been implemented for the affected website and we are monitoring the results.

identified2019-03-06T23:58:58.929Z

We are continuing to work on a fix for this issue.

identified2019-03-06T23:46:57.588Z

We are continuing to work on a fix for this issue.

identified2019-03-06T23:45:48.474Z

We have investigated reports of elevated error rates. We've identified the issue causing an outage in updates and are rolling out a fix.

Report: "StatusPage Metrics Migration"

Last update 2020-07-27T22:06:16.010Z

resolved2019-04-15T01:08:04.089Z

We are modifying our StatusPage to add new metrics which may briefly result in metrics showing false outages.

Report: "Restricted API Key authentication is down"

Last update 2020-07-27T22:06:15.969Z

postmortem2019-12-09T09:23:50.073Z

A critical outage around Restricted API Key authentication occurred Sunday evening at exactly 9:00 PM PST leading to over 1 hour of downtime. This outage did not result in any data loss but did render Restricted API Keys inoperable with our API endpoints for that period of time. This was an unacceptable amount of downtime and could have been completely prevented. We take your trust very seriously as Doppler is a critical path in your devops and productivity workflows. We have learned from this experience while fixing the root cause and adding checks to prevent this kind of outage in the future. Here is what happened: # Timeline **December 9th, 2019 - 9:00 PM \(PST\)** Our engineering team rolls out new authentication logic for our API endpoints. The update is designed to increase our defense in depth by adding additional layers of authentication around every layer in our core stack. **December 9th, 2019 - 9:40 PM \(PST\)** While testing the API endpoints to verify the new logic, we find that all requests utilizing a Restricted API Key are being rejected. **December 9th, 2019 - 10:00 PM \(PST\)** We identify the culprit bug in our codebase and start working on a patch. The bug is linked to how we handle Restricted API Keys. Doppler’s API offers 3 methods of authentication: Personal Keys, Restricted API Keys \(now called Service Tokens\), and CLI Tokens. Personal Keys and CLI Tokens are tied to a user identity, while Service Tokens are not. Our investigation finds that our authentication logic was requiring a user identity and did not gracefully handle a case where one would not be present. **December 9th, 2019 - 10:48 PM \(PST\)** The patch is released to production and the engineering team starts monitoring the fix. **December 9th, 2019 - 11:38 PM \(PST\)** After stress testing the patched authentication logic in production, we mark the incident resolved. # Moving Forward **End to End Testing \(e2e\)** From unit tests to e2e testing, we strive to test _every_ part of our stack. As it turns out, we had not added e2e tests for Service Tokens. This outage has led us to reassess our test coverage and focus on testing all remaining user flows. **StatusPage Accuracy** Doppler uses Pingdom to test for uptime and display those results on our status page. Pingdom was testing our health check endpoints which do not require authentication. This led Pingdom to not report our outage. We have now changed Pingdom to test our secrets endpoint, which requires multiple layers of authentication and is a better indicator of our API being serviceable to customers. **Announcing Maintenance Windows** Starting today, we are going to announce major maintenance schedules one week in advance. Our goal with maintenance is to _never_ have any downtime, but when downtime does strike, we want you to be prepared. Knowing a time window in advance can help your engineering teams navigate and prepare for the off chance appropriately.

resolved2019-12-09T07:38:15.192Z

This issue has been resolved. During a revamp of our authentication methods, we introduced a logic bug. This bug prevented restricted API Keys from gaining access to their projects, causing any restricted API Key to receive the error "User must be provided." A fix has been deployed and this issue is now resolved.

monitoring2019-12-09T06:48:44.464Z

A fix has been implemented and we are monitoring the results.

identified2019-12-09T05:00:02.849Z

The issue has been identified and a fix is being implemented.

Report: "Migration Script Race Condition"

Last update 2020-07-27T22:06:15.928Z

resolved2020-02-06T23:43:00.000Z

We had a migration script that failed to run before the rest of our services deployed. This caused the newly deployed services to crash because the database was missing a column in a critical table. Our deployment orchestrator quickly detected the issue and reverted to the previous build. Total downtime was approximately 2 minutes. We plan to mitigate future failures by requiring our migrations to complete before rolling out our remaining services.

Report: "Downtime due to migration script"

Last update 2020-07-27T22:06:15.887Z

postmortem2020-07-10T04:14:17.770Z

Today at 8:38pm Pacific we experienced a critical outage that brought down all of our servers on both our primary and failover production environments. This outage did not result in any data loss but did disable all dashboard and API access. ## Why Did it Happen? At Doppler, we typically do batched rollouts to production at night \(lowest risk time\). To help prevent outages we use staged releases, where the migration scripts are run first, then our new code is released to our clusters. The problem we faced today with this approach is it creates an edge case when we are deleting a column in the database that is still being used in active code. In our latest rollout, we had a migration script that deleted a column in one of our tables. This column would not be used in the soon to be released code but was used by our active deployments. When the migration script ran, all our active deployments immediately crashed as the ORM \(object-relational mapper\) expected a column that did not exist anymore. ## Moving Forward **Migration Scripts** We are enacting a new policy where columns can not be deleted when there is an active deployment that relies on them. The new model will require that the column can only be deleted once 100% of our deployments are not using it. We expect this to come in the form of 2 rollouts, the first transitioning our deployments off the column with the second removing the column. **Hardening Our Deployments** One of the coolest parts of building tools for developers is that we get to dog food our own product. But it does come with its own struggles, such as Doppler relying on Doppler. Circular loops can be dangerous. To help prevent that, we create encrypted snapshots of our secrets in our Docker images during the build phase. This is done so that in the event an outage occurs, we can bring ourselves back up. The problem with the current approach is the timing of when those images are built. Currently, those images are built after the migration scripts have run. We will be shifting to a new model where all images are built before running migrations so that we have a guarantee we can access our secrets. The longest deployment step during a rollout is building our images. By building our images before running migrations, we get the added benefit of dramatically reducing the time between when the migration is run and our updated code is released. ## Wrapping Up We want to apologize to all of our customers for this unacceptable outage. We take your trust in our uptime very seriously. We have learned from this incident and will showcase it by continuing to do everything in our power to prevent future outages.

resolved2020-07-10T04:07:41.757Z

This incident has been resolved.

monitoring2020-07-10T04:05:24.114Z

A fix has been implemented and we are monitoring the results.

investigating2020-07-10T03:38:44.000Z

We are currently investigating this issue.

Report: "DNS resolution failure"

Last update 2020-07-27T22:06:15.845Z

postmortem2020-07-17T22:17:59.726Z

Today at 2:15pm Pacific we experienced our second outage in the last two weeks. Sadly the timing is not great but it did give us the opportunity to re-evaluate our failure points as we continue to harden our infrastructure. This outage affected all of our endpoints, from our production and failover infrastructure to the documentation hub and status page but did not result in any data loss. ## What Happened? Doppler uses [Cloudflare](https://cloudflare.com) as our DNS provider which provides a suite of powerful features including DDOS protection, a CDN for assets, firewall rules, edge workers, and plenty of others. They are one of the most popular and trusted DNS providers, which supports nearly 20% of all internet traffic. Today they went down, which brought down a portion of the global internet with them. Cloudflare recommends using their DNS proxy so you can benefit from their suite of features. As we were reminded today, using their proxy changes the landscape of the default protections DNS provides, which results in a nonobvious cost. DNS by its very nature is decentralized, which creates a layer of resilience against being a single point of failure. But this assumption of protection breaks down when you use a proxy at the DNS layer, as now you have a new single point of failure. Today we all paid that nonobvious cost. ## Moving Forward **Hardening Our DNS Reliability** Internally we are tracking the best path forward towards hardening our DNS’s reliability. This can come in a couple of different forms, such as disabling proxy mode for our DNS records. This would remove our DNS layer as a single point of failure but comes at the cost of losing DDOS protection, our records not being masked, and also some other behind the scenes magic. Another possible option would be to add an additional DNS provider \(that supports DDOS protection\) to our stack. Then in the case one goes down, our traffic will automatically failover the other. This would add a fair amount of complexity to our stack. Sadly all solutions thought of so far have tradeoffs and could have nonobvious consequences. We deeply care about finding the right answer, not the fastest to implement. As we continue to explore and implement solutions, we expect to write about our findings and decisions on our [engineering blog](https://doppler.com/blog). **Customer Observability** Being transparent is core to the DNA of the company and we strive to provide our customer's observability during outages in real-time. We do this through our [@DopplerHelp](https://twitter.com/DopplerHelp) Twitter and [status page](https://status.doppler.com). Being that our status page’s DNS is hosted by Cloudflare, it was also affected by the outage. To prevent this in the future, we are moving our status page’s DNS to another provider and will use a new dedicated domain. This domain is still being configured and will be announced soon. **Doppler CLI** The Doppler CLI has a nifty command called `doppler run` which downloads your secrets from our API and then injects them into your application. After each successful run, we automatically create and store an encrypted snapshot of the secrets for you. On the off chance the CLI is unable to connect to our API, we smartly fallback to this encrypted snapshot after 5 retries. During the outage, our `doppler run` users were unaffected as they had an existing snapshot to fallback to. One area we found that could use a little love is in showcasing of a retry event. If request hangs it can create a visible delay to the user. In the next release, the Doppler CLI will print a message stating a retry event is happening so you always stay informed. ## Wrapping Up Providing a seamless experience that provides near perfect uptime is an incredibly difficult task that requires deep thought about every layer in the stack. Today we are reminded that our DNS is a single point of failure and that even with the most trusted of services, like Cloudflare, can bring us down if we don’t have multiple layers of redundancies. As we continue to harden our infrastructure, we plan to share our learnings with you through our [engineering blog](https://doppler.com/blog).

resolved2020-07-17T22:05:35.006Z

DNS resolution has resumed normal operation. A proper postmortem will follow.

monitoring2020-07-17T21:41:33.553Z

We are continuing to monitor for any further issues.

monitoring2020-07-17T21:37:09.899Z

DNS resolution appears to be operating normally again. We are still monitoring.

investigating2020-07-17T21:29:05.782Z

We are continuing to investigate this issue.

investigating2020-07-17T21:20:19.496Z

DNS resolution is currently failing for doppler.com and all subdomains. We believe this is a Cloudflare outage but are investigating.