How to Detect Third-Party Outages Before Your Users Do

Your users are not your monitoring system. Here's how to get detection coverage that surfaces third-party incidents in time to act — before the support tickets arrive.

·7 min read

The worst time to learn about a vendor outage is from a user report. At that point, the incident has already been running for however long it took a user to notice, get frustrated, and write to you. You're starting your investigation behind the curve.

The better outcome is a different sequence: your monitoring fires, your on-call engineer activates a fallback and posts a status update, and when users do encounter issues, there's already a message explaining the situation. This requires detecting the incident before it affects a significant number of users.

Why Users Find Out Before Your Monitoring Does

Internal monitoring — APM, error rate dashboards, uptime checks on your own endpoints — catches the effects of vendor incidents, not the incidents themselves. There are two problems with this:

Detection lag. Your error rate threshold might be set at 5%. A vendor incident has to affect enough of your traffic to cross that threshold before your alert fires. Depending on your traffic volume, this can take 5–15 minutes.

Noise floor. Transient errors happen constantly. Requests fail, connections reset, timeouts occur. Your monitoring ignores these because they're normal. But a vendor incident starts small — a few more 503s than usual — before it grows. By the time it's unambiguous signal, it's been running long enough to affect users.

The gap between incident start and internal alert firing is where user-reported issues come from.

What Proactive Detection Looks Like

Proactive detection means watching the vendors directly, in parallel with monitoring your own systems:

Vendor incident starts
  └─ Vendor posts to status page (5–15 min lag)
       └─ Status monitor detects → alerts (within 1 min of posting)
            └─ Engineer notified (within 2 min of detection)
                 └─ Fallbacks activated, status page updated (within 5 min)
                      └─ Users see status message, not errors

Compare to reactive detection:

Vendor incident starts
  └─ Error rate climbs
       └─ Threshold crossed → APM alert (8–15 min after incident start)
            └─ Engineer investigates assuming internal issue
                 └─ Discovers vendor incident (15–30 min after start)
                      └─ Users have been hitting errors for 20+ minutes

The difference is where you start: with context (vendor is degraded) or without (something is wrong, don't know why).

What Proactive Detection Actually Requires

Closing that gap is less about a clever script and more about coverage that holds up over time. Effective vendor detection needs all of the following, continuously:

  • Continuous coverage of every vendor in your critical path — checking once a minute, not "when an engineer remembers to look."
  • Component-level granularity. "Stripe is up" is useless if Stripe's Payment Intents component is degraded. You need the specific component your feature depends on.
  • Change detection, not state polling. Alert when a component worsens (operational → partial outage), not on every poll while it sits degraded.
  • A map from vendor components to your features. The alert that helps says "Stripe Payment Intents degraded → checkout and subscription upgrades affected," not "Stripe is degraded."
  • Severity-aware thresholds and routing so a minor CDN blip doesn't page someone at 3 AM while a checkout outage does.
  • Recovery detection to close the loop — restore fallbacks and clear the status banner the moment the vendor actually recovers, not when an engineer notices at 9 AM.

Why Building This Yourself Doesn't Scale

Any one of those pieces is a weekend project. Keeping all of them working across every vendor you depend on is a standing maintenance commitment most teams underestimate:

  • Status page URLs and JSON formats change without notice, and they differ across vendors — what works for one breaks on a vendor using a different status provider.
  • Synthetic checks need per-vendor credentials, rate-limit handling, and someone to rotate them.
  • Polling infrastructure has to run somewhere reliable, with its own alerting (a monitor that silently dies is worse than none).
  • The component-to-feature map and on-call routing drift every time your product or team changes.

You end up maintaining a monitoring product as a side effect of shipping your actual product. That's the trade-off to weigh before writing the first line of polling code.

Tune Your Alert Thresholds

Not every vendor degradation needs to page your on-call engineer. A degraded_performance status on a CDN's edge network might mean slightly elevated latency on some routes — important to know, not worth waking someone at 3 AM.

Set alert thresholds per vendor and component:

Vendor/ComponentAlert thresholdRouting
Stripe / Payment Intentsdegraded_performancePage on-call + Slack
Auth0 / Authenticationpartial_outagePage on-call + Slack
SendGrid / Mail Sendmajor_outageSlack only
Cloudflare / CDNmajor_outagePage on-call
GitHub / Actionspartial_outageSlack (DevOps channel)

The threshold should reflect user impact. If degraded_performance on a vendor means your checkout conversion drops 30%, page someone. If it means 50ms extra latency on API docs, log it.

The Recovery Signal Matters Too

Detecting incident start is only half the equation. Recovery detection closes the loop:

  • Confirms the incident is resolved (vs. the vendor posting "monitoring" with issues still present)
  • Triggers fallback restoration (re-enable checkout, clear the status banner)
  • Marks the incident end time for SLA calculations and postmortem data

An incident that resolves at 3 AM should clear the degraded feature flags at 3 AM — not when an engineer notices at 9 AM.

Statusfield Detects It So You Don't Have To

Statusfield monitors 400+ services continuously and routes alerts to Slack, Discord, Telegram, email, or webhooks — at the component level, on status change, with recovery detection built in. You add the services you depend on, choose which components matter and where alerts go, and the polling, parsing, format-change handling, and delivery are handled for you.

That's the whole point: you get detection-before-your-users without standing up and maintaining the monitoring stack yourself. You configure what matters; Statusfield watches it 24/7 and tells you the moment it changes.

Start monitoring your vendors free →

Know the moment a tool you depend on goes down

Statusfield watches 2,000+ services your business depends on and alerts you the moment they break.

Free plan · No credit card