How to Write a Postmortem When a Third-Party Service Causes an Outage

Q: Who should own a third-party postmortem?

The on-call engineer who responded typically drives it, but the postmortem should be a collaborative document. The engineer provides the timeline and technical details; product and customer-facing teams contribute the user impact and communication review.

Q: How do we handle a vendor who doesn't publish a postmortem?

Write yours anyway using your own data. Note that the vendor's root cause analysis wasn't available. Vendors who routinely don't publish postmortems give you less information to calibrate your resilience planning — this is itself a signal worth tracking.

Q: What's the right tone for a postmortem?

Blameless and factual. Write 'the alert fired at 14:17' not 'we didn't notice until 14:17.' The goal is to fix systems, not assign fault. Postmortems that assign blame create a culture where engineers avoid being named in them, which defeats the purpose.

Third-party outages are easy to dismiss in a postmortem. The vendor had a problem. You waited. They fixed it. You recovered. What's there to write about?

More than you'd think. The fact that a vendor caused the incident doesn't mean you had no control over the outcome — over how quickly you detected it, how you communicated, how your system degraded, or how fast you recovered. A good postmortem on a third-party incident asks: what could we have done better given that we don't control the vendor?

Why Third-Party Postmortems Are Worth Writing

The argument against writing them: "The cause was external. Nothing to fix in our code."

The argument for: almost every third-party outage exposes one of these gaps in your system:

Detection gap — how long between the incident starting and your team knowing?
Attribution gap — how long before you knew the cause was the vendor and not your code?
Degradation gap — did your system fail hard or fail gracefully?
Communication gap — when did your users find out, and what did you tell them?
Recovery gap — how long after the vendor resolved did your system return to normal?

Each of these is in your control, even when the root cause isn't.

The Anatomy of a Third-Party Postmortem

1. Incident Timeline

Reconstruct what happened in chronological order, combining your logs with the vendor's incident timeline:

Timeline (all times UTC):

14:03 — Vendor infrastructure event begins (confirmed from vendor postmortem)
14:08 — First error responses observed in production logs (Stripe payment intent API)
14:17 — Error rate on /api/checkout exceeds 5% — first alert fires
14:19 — On-call acknowledges alert, begins investigation
14:24 — Team rules out recent deployments as cause (last deploy: 3 days ago)
14:31 — Vendor posts incident on status page: "Degraded performance on Payment Intents"
14:33 — Root cause attributed to Stripe (8 minutes after alert, 25 minutes after incident start)
14:40 — Internal Slack status update sent; customer-facing status banner added to app
14:47 — Decision made not to fail over to alternate payment processor (low traffic period)
15:12 — Vendor posts "Investigating" → "Monitoring" update
15:28 — Vendor posts resolution
15:31 — Error rate returns to baseline
15:35 — Customer-facing status banner removed; team stand-down

Duration: 92 minutes from first error to full recovery
Detection lag: 14 minutes (14:03 → 14:17)
Attribution lag: 30 minutes (14:03 → 14:33)
User communication lag: 37 minutes (14:03 → 14:40)

Build this timeline from:

Your application error logs (with UTC timestamps — always log in UTC)
Your monitoring alert history
The vendor's incident timeline (usually posted in their postmortem)
Your internal communication history (Slack, incident channels)

2. Impact Assessment

Be specific about what users experienced:

Metric	Value
Duration	92 minutes (14:08–15:31 UTC)
Users affected	~340 users attempted checkout during the window
Transactions failed	127 payment intent failures (37% of checkout attempts)
Revenue impact	~$3,200 in delayed transactions (all subsequently recovered)
Support tickets	14 tickets opened during incident
SLA breach	No — 99.9% monthly SLA still met

Don't round numbers. "Hundreds of users affected" is less useful than "340 users." The precision matters for prioritizing future work.

3. Root Cause Analysis

For third-party incidents, there are always two root causes:

External root cause (vendor-controlled):

Stripe experienced infrastructure issues in their us-east-1 region affecting their Payment Intents API. Their postmortem identified a database failover that caused elevated error rates and latency from 14:03–15:28 UTC.

Internal contributing factors (in your control):

No synthetic monitoring on the Stripe checkout path. Detection relied on real user traffic hitting errors, adding 9 minutes of detection lag.

No fallback behavior for payment failures. Failed payment intents returned a generic 500 error to users with no explanation or retry option.

Status page update was manual and delayed. The customer-facing banner was added 37 minutes after incident start, after 127 users had already seen errors without explanation.

This framing is honest and actionable. The vendor caused the incident; your system could have handled it better.

4. What Went Well

Include this section. It's not filler — it tells your team what processes to preserve:

On-call responded within 2 minutes of the first alert
No one wasted time reverting a recent deployment (last deploy was 3 days old)
Team correctly identified "wait for vendor resolution" over attempting a workaround that would have delayed the support ticket backlog

Postmortems that only list failures create a culture where engineers are afraid to be named in them. That defeats the purpose.

5. Action Items

This is the section that prevents the same outcome next time. Each action item needs:

A specific, testable outcome
An owner
A deadline

Example action items from a Stripe incident:

Action	Owner	Deadline	Outcome
Add synthetic probe to Stripe checkout API, alerting at first failure	@infra-team	2 weeks	Detection lag < 2 min
Implement retry with exponential backoff on payment intent creation	@backend-team	1 week	User retries automatically, not forced to reload
Add customer-facing status banner that auto-triggers when checkout error rate > 10%	@frontend-team	3 weeks	Users informed within 5 min of incident start
Add Stripe to service monitoring watchlist for proactive status alerts	@infra-team	This week	Attribution lag reduced by having vendor status in alert workflow
Document failover procedure: when to use secondary payment processor	@backend-team	1 month	Runbook available if incident duration > 30 min

Notice: the last action item ("add Stripe to service monitoring watchlist") doesn't require building anything. It's a configuration change that delivers value immediately.

The 5-Why Framework for Third-Party Incidents

Applying 5-why to vendor outages surfaces internal gaps that pure attribution ("Stripe was down") misses:

Why did 127 checkout attempts fail? → Stripe's Payment Intents API returned errors.

Why did users see a generic error instead of a clear message? → Our checkout flow didn't handle payment provider errors separately from other failures.

Why didn't we know to communicate with users until 37 minutes in? → We didn't have an automated trigger for customer communication during payment degradation.

Why did we rely on manual processes for communication? → We've never established an SLA for user communication during incidents.

Why have we never established a communication SLA? → We've had few significant incidents. This one exposed the gap.

Each layer reveals something actionable.

What Not to Write in a Third-Party Postmortem

Blame: "Stripe let us down." This accomplishes nothing. Write about what you'll do to be more resilient to future vendor failures — regardless of which vendor it is.

Vendor-specific promises: Don't write "we will switch payment processors." Unless you've already decided to, that's not an action item — it's a threat. Write "we will evaluate secondary payment processor options and define when we'd activate failover."

Vague commitments: "We will improve monitoring." Write "We will add a synthetic health check on the Stripe checkout API path that fires within 30 seconds of first failure."

Post-hoc timeline fabrication: Don't reconstruct timelines from memory. Use actual log timestamps. If you don't have them, add log coverage as an action item — but don't fill gaps with estimates and present them as fact.

Turning Postmortems Into Resilience

A postmortem is a design document in disguise. Every gap you identify is a missing piece of your reliability architecture:

Detection gap → synthetic monitoring, alerting thresholds
Attribution gap → vendor status monitoring, correlation tooling
Degradation gap → fallback behavior, circuit breakers, graceful failure states
Communication gap → automated status triggers, user-facing messaging templates
Recovery gap → health check automation, automated stand-down procedures

The goal isn't to prevent third-party outages — you can't. The goal is to build a system where a third-party outage has a predictable, manageable impact.

Where Statusfield Fits

The attribution gap — the time between incident start and your team knowing the vendor is the cause — is where most debugging time is wasted. During a Stripe incident, engineers check deploys, database connections, and application code before concluding the problem is upstream.

Statusfield monitors official vendor status pages continuously. When Stripe posts a degradation incident, Statusfield delivers the alert before your on-call engineer has finished pulling recent deploys. The attribution gap shrinks from 20–30 minutes to under 2 minutes.

For postmortem action items, this is a day-one change: add the relevant vendor to Statusfield and eliminate manual status page monitoring from your incident workflow.

Statusfield's free plan monitors up to 3 services. The Pro plan ($29/month) covers up to 20 services with unlimited email and Slack notifications.

FAQ

Should we write a postmortem for every third-party incident, or only significant ones? Set a threshold: write a postmortem for any incident that affected more than X% of users, lasted more than Y minutes, or generated more than Z support tickets. Below that threshold, a brief incident log entry is sufficient. The goal is to reserve postmortem effort for incidents where the action items will change something.

Who should own a third-party postmortem? The on-call engineer who responded to the incident typically drives it, but the postmortem itself should be a collaborative document. The engineer provides the timeline and technical details; the product and customer-facing teams contribute the user impact and communication review. Don't assign ownership to the vendor.

How do we handle a vendor who doesn't publish a postmortem? Write yours anyway, using your own data. Note in the postmortem that the vendor's root cause analysis wasn't available. This is a signal worth tracking — vendors who routinely don't publish postmortems give you less information to calibrate your resilience planning.

What's the right tone for a postmortem? Blameless and factual. Write "the alert fired at 14:17" not "we didn't notice until 14:17." Write "the checkout error was not handled separately from other failures" not "the developer who wrote the checkout code didn't think about error states." The goal is to fix systems, not assign fault.

How long should a postmortem take to write? Write a draft within 48 hours while the incident is fresh. Set a deadline for the final version — typically 5 business days. A postmortem that takes 3 weeks to finalize loses most of its value. Prioritize completeness of the timeline and clarity of action items over prose quality.