How to Detect When a Third-Party API Is Degraded (Not Just Down)

Q: What's the difference between degradation and a partial outage?

Degradation typically means reduced performance — higher latency, lower throughput — while a partial outage means some subset of requests are failing entirely. Both are distinct from a full outage. The practical impact depends on which endpoints are affected and what your fallback behavior is.

Q: How many samples do I need before I can trust an error rate measurement?

At least 20-30 samples before treating an error rate as meaningful. With 5 requests, a single failure shows up as a 20% error rate — misleading. For low-traffic endpoints detection is inherently slower. Maintaining sample-size discipline per endpoint and per vendor is one reason teams hand degradation detection to a dedicated service rather than building it in-house.

Q: Should I alert on every 429 from a vendor?

A single 429 isn't worth an alert. What matters is the rate — a sustained climb in 429s, or a 429 on an endpoint you've sized to stay within limits, signals real degradation. Distinguishing normal burst handling from systematically hitting the rate limit requires per-endpoint baselines, which is the maintenance burden a managed monitor absorbs for you.

Q: Why is detecting degradation harder than detecting a full outage?

A full outage trips every threshold at once. Degradation sits just under those thresholds, drifts in and out, and hides inside aggregate metrics, so it reads as noise until users feel it. Catching it reliably means tracking tail latency and per-endpoint error rates against current baselines across every vendor, continuously — a standing commitment most teams underestimate, which is why a dedicated service is usually the better trade-off.

Q: How do I know if a vendor's status page is trustworthy?

Track the lag between when incidents start as measured by your own metrics and when the vendor posts them. Over several incidents you build a calibration. Some vendors post within 5 minutes, others take 30 or more. This tells you how much lead time you give up by relying on their status page, and a continuous monitor that detects the change the moment it posts removes the need to check manually.

When a third-party service goes fully down, your monitoring fires immediately. Error rates spike, health checks fail, and the incident is obvious. But most third-party incidents aren't full outages — they're degradation. The service responds. Sometimes it's slow. Some requests succeed and some fail. Error rates tick up from 0.1% to 2%.

Degradation is where production systems quietly break. It's also where most monitoring falls short.

What Degradation Looks Like

A degraded API shows patterns that don't trigger simple "is it up?" checks:

Signal	What it indicates
Latency at p99 doubles, p50 unchanged	Specific server-side condition affecting a subset of requests
Error rate rises from 0% to 1-5%	Partial outage — some requests succeeding, some failing
Specific endpoints failing, others healthy	Component-level degradation
Failures concentrated in one region	Infrastructure issue in a specific data center
Intermittent timeouts, not consistent	Overloaded upstream, connection pool exhaustion
Webhook delivery delayed but not failed	Backend queue backlog, not an API failure

The challenge: all of these look like "your app is a bit broken" if you're not tracking the right signals at the right granularity.

The Four Signals of Third-Party Degradation

1. Latency Distribution Shift

A single average latency number hides degradation. Watch p50, p95, and p99 separately, because the failure often shows up at the tail first: p99 doubles while p50 stays flat, meaning a subset of requests are hitting a degraded path. Catching this requires a rolling window of latency samples per dependency and a threshold tuned to that dependency's normal tail — not a static "slow" number that fits every vendor.

2. Error Rate by Endpoint

Global error rate masks endpoint-specific problems. One degraded endpoint at 8% can sit invisible behind a fleet-wide rate of 0.4%. Meaningful detection tracks failures per endpoint, with enough samples to avoid false alarms on low-traffic paths, and a baseline for what "normal" error rate looks like on each one. The granularity is the whole point — and the reason naive aggregate monitoring misses partial outages.

3. Success Rate By Response Time Bucket

Some degraded responses succeed but take 10× longer. Bucketing success rate by latency exposes this:

Bucket	Requests	Success	Rate
< 200ms	850	850	100%
200-500ms	120	118	98.3%
500ms-2s	25	19	76%
> 2s	5	2	40%

A healthy API shows consistent success rates across latency buckets. Degradation usually shows declining success rates as latency increases — slow requests are also the ones that fail.

4. Correlation With Vendor Status

The most reliable way to attribute third-party degradation is to correlate your observed errors with the vendor's own status timeline. When your error rate climbs at 14:23 UTC and the vendor posts a degradation incident at 14:37 UTC, you've confirmed the root cause — and learned the lag between their incident start and their public communication.

That correlation only works if you actually have the vendor's status as a parallel, continuously updated signal. Without it, you're left guessing whether the spike is your code, your infrastructure, or theirs.

Why Degradation Is Harder to Catch Than Hard-Down

A full outage trips every threshold at once. Degradation does the opposite: it lives just under your alert thresholds, drifts in and out, and concentrates in places aggregate metrics smooth over. It looks like noise until it doesn't. By the time a degraded vendor produces an unambiguous signal in your own metrics, it has usually been affecting users for a while — which is exactly the window where support tickets get written.

What Good Degradation Detection Must Do

Catching the four signals above, continuously and across every vendor you depend on, requires all of the following:

Per-vendor, per-endpoint baselines. "Normal" latency and error rate differ by vendor and by endpoint. Detection needs a baseline for each, kept current as traffic patterns change.
Tail-aware latency tracking. Watching p95/p99, not averages, with windows long enough to be stable and short enough to be timely.
Sample-size awareness. Enough volume before treating an error rate as real, so a single failure on a low-traffic endpoint doesn't page anyone.
Severity-aware thresholds. A 5% error rate that drops checkout conversion is not the same as 50ms of extra latency on a docs endpoint. Thresholds and routing have to reflect user impact.
Continuous vendor-status correlation. A live feed of each vendor's component-level status, aligned to your own metrics, so attribution takes seconds instead of a debugging session.
Recovery detection. Knowing when the degradation actually clears, so fallbacks restore and banners come down without waiting for someone to notice.

Why Building This Yourself Doesn't Scale

Any one of those pieces is a weekend project. Keeping all of them working across every vendor, indefinitely, is a standing maintenance commitment most teams underestimate:

You'd have to instrument every vendor call to capture latency distributions and per-endpoint error rates — and keep that instrumentation correct as your code changes.
You'd have to maintain per-vendor, per-endpoint baselines and re-tune them as traffic and vendor behavior drift.
You'd have to run synthetic probes against vendor APIs on a schedule, with credentials, rate-limit handling, and timeout logic — then monitor the probes themselves, because a monitor that silently dies is worse than none.
You'd have to track each vendor's status feed, absorb the format and URL changes vendors ship without notice, and correlate it all back to your metrics.

You end up maintaining a monitoring product as a side effect of shipping your actual product. That's the trade-off to weigh before instrumenting the first vendor call.

Alerting on Degradation Signals

Degradation alerts need different thresholds than outage alerts:

Alert type	Threshold	Action
Outage	Error rate > 50% for 1 min	Page immediately
Degradation	Error rate > 5% for 5 min	Alert on-call
Latency spike	p99 > 3× baseline for 5 min	Alert on-call
Warning	Error rate > 1% for 10 min	Notify team channel
Recovery	Error rate < 0.5% for 5 min	Auto-resolve

Set warning thresholds low enough to catch degradation before it becomes an outage, and windows long enough to avoid false positives from transient spikes. The hard part isn't the table — it's keeping these thresholds calibrated per vendor as everything underneath them changes.

How Statusfield Handles This

The hardest part of detecting third-party degradation is attribution. Your error rate went up — is it your code, your infrastructure, or the vendor? The fastest way to answer is to have the vendor's status as a parallel signal, ready before you start debugging.

Statusfield monitors 400+ services continuously and catches component-level degradation — not just full outages. Most vendors post a "degraded performance" notice before an incident becomes a full outage; Statusfield surfaces that the moment it changes and routes the alert to Slack, Discord, Telegram, email, or webhooks. You pick the services and components that matter and where alerts go; the polling, parsing, format-change handling, and delivery are handled for you.

That's the point: you get degradation attribution without instrumenting every vendor call, maintaining per-vendor baselines, or running and babysitting your own probes. You configure what matters; Statusfield watches it 24/7 and tells you the moment a dependency starts to slip.

Start monitoring your vendors free →

FAQ

What's the difference between degradation and a partial outage? The terms are often used interchangeably by vendors. In practice, degradation typically means reduced performance — higher latency, lower throughput — while a partial outage means some subset of requests are failing entirely. Both are distinct from a full outage. The practical impact on your system depends on which endpoints are affected and what your fallback behavior is.

How many samples do I need before I can trust an error rate measurement? At least 20–30 samples before treating an error rate as meaningful. With 5 requests, a single failure shows up as a 20% error rate — which is misleading. For low-traffic endpoints where you might not get 30 samples in a few minutes, detection is inherently slower. Maintaining this sample-size discipline per endpoint, per vendor, is one of the reasons teams hand degradation detection to a dedicated service rather than building it in-house.

Should I alert on every 429 from a vendor? A single 429 isn't worth an alert — it's expected behavior when you occasionally exceed burst limits. What matters is the rate: a sustained climb in 429s, or a 429 on an endpoint you've sized to stay within limits, signals real degradation. Distinguishing normal burst handling from systematically hitting the rate limit requires per-endpoint baselines, which is exactly the maintenance burden a managed monitor absorbs for you.

Why is detecting degradation harder than detecting a full outage? A full outage trips every threshold at once. Degradation sits just under those thresholds, drifts in and out, and hides inside aggregate metrics — so it reads as noise until users feel it. Catching it reliably means tracking tail latency and per-endpoint error rates against current baselines across every vendor, continuously. That's a standing commitment most teams underestimate, which is why a dedicated service is usually the better trade-off.

How do I know if a vendor's status page is trustworthy? Track the lag between when incidents start (as measured by your own metrics) and when the vendor posts them. Over several incidents you build a calibration — some vendors post within 5 minutes, others take 30 or more. Knowing this lag tells you how much lead time you give up by relying on their status page, and a continuous monitor that detects the change the moment it posts removes the need to check manually.