How to Build an Incident Runbook for Third-Party Service Failures

When Stripe or AWS goes down at 2 AM, your on-call engineer shouldn't be Googling what to do. A well-written third-party outage runbook turns a scramble into a 5-minute response. Here's how to build one.

·9 min read

At 2:47 AM, your on-call engineer gets paged. Users can't check out. The engineer opens their laptop, confirms it's not your own infrastructure, and then starts doing exactly the wrong thing: opening a browser, searching "Stripe outage June," scanning Twitter, pulling up status.stripe.com, and eventually asking in Slack whether anyone else knows what to do.

Twelve minutes of chaos that should have been thirty seconds of execution.

The difference between a scramble and a clean response is a runbook — a pre-written document that answers every question your on-call engineer needs answered before they even know the question.

What a Third-Party Outage Runbook Must Answer

A runbook for a vendor dependency failure has a specific job: eliminate decision-making during an incident. Every second spent thinking "what should I do next?" is a second users spend hitting broken experiences.

Your runbook must answer five questions:

1. How do we confirm it's really them and not us? The first five minutes of any incident are wasted if your team can't quickly rule themselves out. The runbook should include the vendor's status page URL, a direct link to your monitoring dashboard for that service, and a threshold — for example, "if error rate on /api/checkout exceeds 5% for two consecutive minutes and Stripe's status page shows an incident, it's them."

2. What specifically breaks in our app? A Stripe outage doesn't mean "payments don't work" — it might mean payment intents fail but existing subscriptions are unaffected. An AWS us-east-1 incident might only affect your file upload service, not your core API. Your runbook should be explicit about which user flows break for which components of which vendor.

3. What's our fallback? This is the highest-leverage part of the runbook. Does your app have a graceful degradation path? Can you show a maintenance page for just the checkout flow while the rest of the app works? Can you queue notifications and retry later? Can you route traffic to a backup region? Write it down before you need it.

4. Who do we notify, and in what order? The engineering channel needs to know. Does customer support? Do your largest accounts get a proactive email? Does the CEO need to be looped in at 3 AM or only at 7 AM? Make the decision tree explicit. When the answer is already written, there's no debate.

5. What do we tell users? Have a template ready. Blank-page paralysis during an incident is real. A pre-written template that just needs the service name and ETA filled in is the difference between a confident status update and silence.

A Complete Example: Stripe Outage Runbook

Here is a full runbook for a Stripe payment processing failure. Use it as a template and adapt it for your own services.


Service: Stripe
Owner: Backend team (on-call engineer)
Last reviewed: [date]

Detection
Alert fires when: Statusfield detects a Stripe incident, OR error rate on /api/checkout exceeds 5% for 2+ minutes.
Confirm at: status.stripe.com
Internal dashboard: [link to your error rate chart]

Impact assessment

Stripe componentWhat breaks for users
Payment Intents APICheckout fails — users cannot purchase
Subscriptions APIRenewal attempts fail — no immediate user impact
DashboardNo user impact — internal only
Radar (fraud detection)Payments may bypass fraud scoring — monitor for unusual volume

Immediate actions (first 5 minutes)

  1. Confirm the outage on the Stripe status page and Statusfield dashboard
  2. Post in #incidents: "Stripe experiencing [outage type]. Payments affected. Investigating."
  3. Enable the DISABLE_CHECKOUT feature flag to show a maintenance message at /checkout
  4. Set your status page to "Investigating — payment processing disruption"

If the outage extends past 30 minutes

  1. Notify the customer success team — they should have responses ready for inbound support tickets
  2. Post a customer-facing update with an ETA if one is available from Stripe's status page

Customer communication template

We're currently experiencing a disruption with our payment processor. Your existing account and data are unaffected. New purchases and plan upgrades are temporarily unavailable. We're monitoring the situation and will update this message as soon as service is restored. We apologize for the inconvenience.

Recovery checklist

  • Stripe status page shows all components operational
  • Internal error rate on /api/checkout back below 0.5% for 5 consecutive minutes
  • Disable DISABLE_CHECKOUT feature flag
  • Update status page to resolved
  • Check for any failed payment attempts that should be retried
  • Post incident summary in #incidents within 24 hours

Scaling to All P0 Dependencies

Once you've written one runbook, the process becomes repeatable. Triage your full dependency list by blast radius:

PriorityDefinitionServices (examples)Runbook owner
P0App broken or revenue blockedAuth provider, payment processor, primary DBBackend lead
P1Core features degradedEmail service, CDN, CI/CD platformPlatform team
P2Secondary features affectedAnalytics, CRM sync, notification serviceFeature team
P3Non-critical integrationsMarketing tools, A/B testingNo runbook needed

P0 and P1 services need runbooks before the next incident. P2 services warrant at minimum an impact assessment and a team assignment. P3 services need only a note in your dependency inventory.

Where to Store Runbooks and How Often to Update Them

The worst place for runbooks is a wiki that nobody opens during incidents. Your runbooks need to be findable in under thirty seconds under stress. The best pattern: link directly from your alerting system. When the Stripe alert fires in PagerDuty or your Slack channel, the runbook URL is in the alert body. One click, immediate context.

Review runbooks quarterly and after every real incident that triggered them. An incident that exposes a gap in your runbook — a fallback that didn't work, a notification step that was missing — is the best possible time to fix it.

Statusfield monitors official vendor status pages and delivers the signal the moment it matters — and the alert can carry a direct link to your runbook. When your on-call engineer sees a Stripe incident notification at 2:47 AM, the runbook is one tap away. That's the gap this whole system closes.

FAQ

What should every third-party runbook include? At minimum: confirmation steps (how to verify it's really them), impact mapping (which user flows break), immediate actions in order, customer communication template, and a recovery checklist. Anything that requires a decision during an incident should already have the answer written in the runbook.

How often should runbooks be updated? Review them quarterly and update them immediately after any incident that exposed a gap. If a step in the runbook was wrong or missing during a real incident, fix it the same day while the details are fresh.

Where should runbooks live? Link directly from your alerting system. When a PagerDuty or Slack alert fires, the runbook URL should be in the alert body. A runbook that requires navigating to a wiki under stress is a runbook that won't get used.

Who is responsible for writing runbooks? The team that owns the integration. If your backend team owns the Stripe integration, they own the Stripe runbook. Assign ownership explicitly — "the on-call engineer" is not an owner, a team is.

How does Statusfield connect to runbooks? Statusfield detects vendor incidents the moment they're posted and fires an alert to your channel of choice. That alert can include a direct link to your runbook for that service, turning a 2 AM notification into immediate context and action — no searching required.

Should runbooks cover partial outages, not just full outages? Yes. Partial outages (one Stripe component degraded, one AWS region affected) are more common than full outages and harder to diagnose. Your runbook should map each component to its user impact so engineers know immediately what's affected even if the vendor's status page shows only a partial incident.

Know the moment a tool you depend on goes down

Statusfield watches 2,000+ services your business depends on and alerts you the moment they break.

Free plan · No credit card