Third-party outages are easy to dismiss in a postmortem. The vendor had a problem. You waited. They fixed it. You recovered. What's there to write about?
More than you'd think. The fact that a vendor caused the incident doesn't mean you had no control over the outcome — over how quickly you detected it, how you communicated, how your system degraded, or how fast you recovered. A good postmortem on a third-party incident asks: what could we have done better given that we don't control the vendor?
Why Third-Party Postmortems Are Worth Writing
The argument against writing them: "The cause was external. Nothing to fix in our code."
The argument for: almost every third-party outage exposes one of these gaps in your system:
- Detection gap — how long between the incident starting and your team knowing?
- Attribution gap — how long before you knew the cause was the vendor and not your code?
- Degradation gap — did your system fail hard or fail gracefully?
- Communication gap — when did your users find out, and what did you tell them?
- Recovery gap — how long after the vendor resolved did your system return to normal?
Each of these is in your control, even when the root cause isn't.
The Anatomy of a Third-Party Postmortem
1. Incident Timeline
Reconstruct what happened in chronological order, combining your logs with the vendor's incident timeline:
Timeline (all times UTC):
14:03 — Vendor infrastructure event begins (confirmed from vendor postmortem)
14:08 — First error responses observed in production logs (Stripe payment intent API)
14:17 — Error rate on /api/checkout exceeds 5% — first alert fires
14:19 — On-call acknowledges alert, begins investigation
14:24 — Team rules out recent deployments as cause (last deploy: 3 days ago)
14:31 — Vendor posts incident on status page: "Degraded performance on Payment Intents"
14:33 — Root cause attributed to Stripe (8 minutes after alert, 25 minutes after incident start)
14:40 — Internal Slack status update sent; customer-facing status banner added to app
14:47 — Decision made not to fail over to alternate payment processor (low traffic period)
15:12 — Vendor posts "Investigating" → "Monitoring" update
15:28 — Vendor posts resolution
15:31 — Error rate returns to baseline
15:35 — Customer-facing status banner removed; team stand-down
Duration: 92 minutes from first error to full recovery
Detection lag: 14 minutes (14:03 → 14:17)
Attribution lag: 30 minutes (14:03 → 14:33)
User communication lag: 37 minutes (14:03 → 14:40)
Build this timeline from:
- Your application error logs (with UTC timestamps — always log in UTC)
- Your monitoring alert history
- The vendor's incident timeline (usually posted in their postmortem)
- Your internal communication history (Slack, incident channels)
2. Impact Assessment
Be specific about what users experienced:
| Metric | Value |
|---|---|
| Duration | 92 minutes (14:08–15:31 UTC) |
| Users affected | ~340 users attempted checkout during the window |
| Transactions failed | 127 payment intent failures (37% of checkout attempts) |
| Revenue impact | ~$3,200 in delayed transactions (all subsequently recovered) |
| Support tickets | 14 tickets opened during incident |
| SLA breach | No — 99.9% monthly SLA still met |
Don't round numbers. "Hundreds of users affected" is less useful than "340 users." The precision matters for prioritizing future work.
3. Root Cause Analysis
For third-party incidents, there are always two root causes:
External root cause (vendor-controlled):
Stripe experienced infrastructure issues in their us-east-1 region affecting their Payment Intents API. Their postmortem identified a database failover that caused elevated error rates and latency from 14:03–15:28 UTC.
Internal contributing factors (in your control):
- No synthetic monitoring on the Stripe checkout path. Detection relied on real user traffic hitting errors, adding 9 minutes of detection lag.
- No fallback behavior for payment failures. Failed payment intents returned a generic 500 error to users with no explanation or retry option.
- Status page update was manual and delayed. The customer-facing banner was added 37 minutes after incident start, after 127 users had already seen errors without explanation.
This framing is honest and actionable. The vendor caused the incident; your system could have handled it better.
4. What Went Well
Include this section. It's not filler — it tells your team what processes to preserve:
- On-call responded within 2 minutes of the first alert
- No one wasted time reverting a recent deployment (last deploy was 3 days old)
- Team correctly identified "wait for vendor resolution" over attempting a workaround that would have delayed the support ticket backlog
Postmortems that only list failures create a culture where engineers are afraid to be named in them. That defeats the purpose.
5. Action Items
This is the section that prevents the same outcome next time. Each action item needs:
- A specific, testable outcome
- An owner
- A deadline
Example action items from a Stripe incident:
| Action | Owner | Deadline | Outcome |
|---|---|---|---|
| Add synthetic probe to Stripe checkout API, alerting at first failure | @infra-team | 2 weeks | Detection lag < 2 min |
| Implement retry with exponential backoff on payment intent creation | @backend-team | 1 week | User retries automatically, not forced to reload |
| Add customer-facing status banner that auto-triggers when checkout error rate > 10% | @frontend-team | 3 weeks | Users informed within 5 min of incident start |
| Add Stripe to service monitoring watchlist for proactive status alerts | @infra-team | This week | Attribution lag reduced by having vendor status in alert workflow |
| Document failover procedure: when to use secondary payment processor | @backend-team | 1 month | Runbook available if incident duration > 30 min |
Notice: the last action item ("add Stripe to service monitoring watchlist") doesn't require building anything. It's a configuration change that delivers value immediately.
The 5-Why Framework for Third-Party Incidents
Applying 5-why to vendor outages surfaces internal gaps that pure attribution ("Stripe was down") misses:
Why did 127 checkout attempts fail? → Stripe's Payment Intents API returned errors.
Why did users see a generic error instead of a clear message? → Our checkout flow didn't handle payment provider errors separately from other failures.
Why didn't we know to communicate with users until 37 minutes in? → We didn't have an automated trigger for customer communication during payment degradation.
Why did we rely on manual processes for communication? → We've never established an SLA for user communication during incidents.
Why have we never established a communication SLA? → We've had few significant incidents. This one exposed the gap.
Each layer reveals something actionable.
What Not to Write in a Third-Party Postmortem
Blame: "Stripe let us down." This accomplishes nothing. Write about what you'll do to be more resilient to future vendor failures — regardless of which vendor it is.
Vendor-specific promises: Don't write "we will switch payment processors." Unless you've already decided to, that's not an action item — it's a threat. Write "we will evaluate secondary payment processor options and define when we'd activate failover."
Vague commitments: "We will improve monitoring." Write "We will add a synthetic health check on the Stripe checkout API path that fires within 30 seconds of first failure."
Post-hoc timeline fabrication: Don't reconstruct timelines from memory. Use actual log timestamps. If you don't have them, add log coverage as an action item — but don't fill gaps with estimates and present them as fact.
Turning Postmortems Into Resilience
A postmortem is a design document in disguise. Every gap you identify is a missing piece of your reliability architecture:
- Detection gap → synthetic monitoring, alerting thresholds
- Attribution gap → vendor status monitoring, correlation tooling
- Degradation gap → fallback behavior, circuit breakers, graceful failure states
- Communication gap → automated status triggers, user-facing messaging templates
- Recovery gap → health check automation, automated stand-down procedures
The goal isn't to prevent third-party outages — you can't. The goal is to build a system where a third-party outage has a predictable, manageable impact.
Where Statusfield Fits
The attribution gap — the time between incident start and your team knowing the vendor is the cause — is where most debugging time is wasted. During a Stripe incident, engineers check deploys, database connections, and application code before concluding the problem is upstream.
Statusfield monitors official vendor status pages continuously. When Stripe posts a degradation incident, Statusfield delivers the alert before your on-call engineer has finished pulling recent deploys. The attribution gap shrinks from 20–30 minutes to under 2 minutes.
For postmortem action items, this is a day-one change: add the relevant vendor to Statusfield and eliminate manual status page monitoring from your incident workflow.
Statusfield's free plan monitors up to 3 services. The Pro plan ($29/month) covers up to 20 services with unlimited email and Slack notifications.
FAQ
Should we write a postmortem for every third-party incident, or only significant ones? Set a threshold: write a postmortem for any incident that affected more than X% of users, lasted more than Y minutes, or generated more than Z support tickets. Below that threshold, a brief incident log entry is sufficient. The goal is to reserve postmortem effort for incidents where the action items will change something.
Who should own a third-party postmortem? The on-call engineer who responded to the incident typically drives it, but the postmortem itself should be a collaborative document. The engineer provides the timeline and technical details; the product and customer-facing teams contribute the user impact and communication review. Don't assign ownership to the vendor.
How do we handle a vendor who doesn't publish a postmortem? Write yours anyway, using your own data. Note in the postmortem that the vendor's root cause analysis wasn't available. This is a signal worth tracking — vendors who routinely don't publish postmortems give you less information to calibrate your resilience planning.
What's the right tone for a postmortem? Blameless and factual. Write "the alert fired at 14:17" not "we didn't notice until 14:17." Write "the checkout error was not handled separately from other failures" not "the developer who wrote the checkout code didn't think about error states." The goal is to fix systems, not assign fault.
How long should a postmortem take to write? Write a draft within 48 hours while the incident is fresh. Set a deadline for the final version — typically 5 business days. A postmortem that takes 3 weeks to finalize loses most of its value. Prioritize completeness of the timeline and clarity of action items over prose quality.
Know the moment a tool you depend on goes down
Statusfield watches 2,000+ services your business depends on and alerts you the moment they break.
Free plan · No credit card
Related Articles
How to Detect Third-Party Outages Before Your Users Do
Your users are not your monitoring system. Here's how to get detection coverage that surfaces third-party incidents in time to act — before the support tickets arrive.
What to Do When a Vendor Has No Status Page
Not every vendor publishes a public status page. Here's how to get visibility into the operational health of dependencies that tell you nothing — and why building that visibility yourself rarely scales.
How to Reduce Mean Time to Detect Third-Party Service Failures
The longer it takes to discover that Stripe or AWS is down, the more customers hit broken experiences. Here's how production engineering teams minimize the gap between when a vendor incident starts and when your team knows about it.