When a third-party service goes fully down, your monitoring fires immediately. Error rates spike, health checks fail, and the incident is obvious. But most third-party incidents aren't full outages — they're degradation. The service responds. Sometimes it's slow. Some requests succeed and some fail. Error rates tick up from 0.1% to 2%.
Degradation is where production systems quietly break. It's also where most monitoring falls short.
What Degradation Looks Like
A degraded API shows patterns that don't trigger simple "is it up?" checks:
| Signal | What it indicates |
|---|---|
| Latency at p99 doubles, p50 unchanged | Specific server-side condition affecting a subset of requests |
| Error rate rises from 0% to 1-5% | Partial outage — some requests succeeding, some failing |
| Specific endpoints failing, others healthy | Component-level degradation |
| Failures concentrated in one region | Infrastructure issue in a specific data center |
| Intermittent timeouts, not consistent | Overloaded upstream, connection pool exhaustion |
| Webhook delivery delayed but not failed | Backend queue backlog, not an API failure |
The challenge: all of these look like "your app is a bit broken" if you're not tracking the right signals at the right granularity.
The Four Signals of Third-Party Degradation
1. Latency Distribution Shift
A single average latency number hides degradation. Watch p50, p95, and p99 separately, because the failure often shows up at the tail first: p99 doubles while p50 stays flat, meaning a subset of requests are hitting a degraded path. Catching this requires a rolling window of latency samples per dependency and a threshold tuned to that dependency's normal tail — not a static "slow" number that fits every vendor.
2. Error Rate by Endpoint
Global error rate masks endpoint-specific problems. One degraded endpoint at 8% can sit invisible behind a fleet-wide rate of 0.4%. Meaningful detection tracks failures per endpoint, with enough samples to avoid false alarms on low-traffic paths, and a baseline for what "normal" error rate looks like on each one. The granularity is the whole point — and the reason naive aggregate monitoring misses partial outages.
3. Success Rate By Response Time Bucket
Some degraded responses succeed but take 10× longer. Bucketing success rate by latency exposes this:
| Bucket | Requests | Success | Rate |
|---|---|---|---|
| < 200ms | 850 | 850 | 100% |
| 200-500ms | 120 | 118 | 98.3% |
| 500ms-2s | 25 | 19 | 76% |
| > 2s | 5 | 2 | 40% |
A healthy API shows consistent success rates across latency buckets. Degradation usually shows declining success rates as latency increases — slow requests are also the ones that fail.
4. Correlation With Vendor Status
The most reliable way to attribute third-party degradation is to correlate your observed errors with the vendor's own status timeline. When your error rate climbs at 14:23 UTC and the vendor posts a degradation incident at 14:37 UTC, you've confirmed the root cause — and learned the lag between their incident start and their public communication.
That correlation only works if you actually have the vendor's status as a parallel, continuously updated signal. Without it, you're left guessing whether the spike is your code, your infrastructure, or theirs.
Why Degradation Is Harder to Catch Than Hard-Down
A full outage trips every threshold at once. Degradation does the opposite: it lives just under your alert thresholds, drifts in and out, and concentrates in places aggregate metrics smooth over. It looks like noise until it doesn't. By the time a degraded vendor produces an unambiguous signal in your own metrics, it has usually been affecting users for a while — which is exactly the window where support tickets get written.
What Good Degradation Detection Must Do
Catching the four signals above, continuously and across every vendor you depend on, requires all of the following:
- Per-vendor, per-endpoint baselines. "Normal" latency and error rate differ by vendor and by endpoint. Detection needs a baseline for each, kept current as traffic patterns change.
- Tail-aware latency tracking. Watching p95/p99, not averages, with windows long enough to be stable and short enough to be timely.
- Sample-size awareness. Enough volume before treating an error rate as real, so a single failure on a low-traffic endpoint doesn't page anyone.
- Severity-aware thresholds. A 5% error rate that drops checkout conversion is not the same as 50ms of extra latency on a docs endpoint. Thresholds and routing have to reflect user impact.
- Continuous vendor-status correlation. A live feed of each vendor's component-level status, aligned to your own metrics, so attribution takes seconds instead of a debugging session.
- Recovery detection. Knowing when the degradation actually clears, so fallbacks restore and banners come down without waiting for someone to notice.
Why Building This Yourself Doesn't Scale
Any one of those pieces is a weekend project. Keeping all of them working across every vendor, indefinitely, is a standing maintenance commitment most teams underestimate:
- You'd have to instrument every vendor call to capture latency distributions and per-endpoint error rates — and keep that instrumentation correct as your code changes.
- You'd have to maintain per-vendor, per-endpoint baselines and re-tune them as traffic and vendor behavior drift.
- You'd have to run synthetic probes against vendor APIs on a schedule, with credentials, rate-limit handling, and timeout logic — then monitor the probes themselves, because a monitor that silently dies is worse than none.
- You'd have to track each vendor's status feed, absorb the format and URL changes vendors ship without notice, and correlate it all back to your metrics.
You end up maintaining a monitoring product as a side effect of shipping your actual product. That's the trade-off to weigh before instrumenting the first vendor call.
Alerting on Degradation Signals
Degradation alerts need different thresholds than outage alerts:
| Alert type | Threshold | Action |
|---|---|---|
| Outage | Error rate > 50% for 1 min | Page immediately |
| Degradation | Error rate > 5% for 5 min | Alert on-call |
| Latency spike | p99 > 3× baseline for 5 min | Alert on-call |
| Warning | Error rate > 1% for 10 min | Notify team channel |
| Recovery | Error rate < 0.5% for 5 min | Auto-resolve |
Set warning thresholds low enough to catch degradation before it becomes an outage, and windows long enough to avoid false positives from transient spikes. The hard part isn't the table — it's keeping these thresholds calibrated per vendor as everything underneath them changes.
How Statusfield Handles This
The hardest part of detecting third-party degradation is attribution. Your error rate went up — is it your code, your infrastructure, or the vendor? The fastest way to answer is to have the vendor's status as a parallel signal, ready before you start debugging.
Statusfield monitors 400+ services continuously and catches component-level degradation — not just full outages. Most vendors post a "degraded performance" notice before an incident becomes a full outage; Statusfield surfaces that the moment it changes and routes the alert to Slack, Discord, Telegram, email, or webhooks. You pick the services and components that matter and where alerts go; the polling, parsing, format-change handling, and delivery are handled for you.
That's the point: you get degradation attribution without instrumenting every vendor call, maintaining per-vendor baselines, or running and babysitting your own probes. You configure what matters; Statusfield watches it 24/7 and tells you the moment a dependency starts to slip.
Start monitoring your vendors free →
FAQ
What's the difference between degradation and a partial outage? The terms are often used interchangeably by vendors. In practice, degradation typically means reduced performance — higher latency, lower throughput — while a partial outage means some subset of requests are failing entirely. Both are distinct from a full outage. The practical impact on your system depends on which endpoints are affected and what your fallback behavior is.
How many samples do I need before I can trust an error rate measurement? At least 20–30 samples before treating an error rate as meaningful. With 5 requests, a single failure shows up as a 20% error rate — which is misleading. For low-traffic endpoints where you might not get 30 samples in a few minutes, detection is inherently slower. Maintaining this sample-size discipline per endpoint, per vendor, is one of the reasons teams hand degradation detection to a dedicated service rather than building it in-house.
Should I alert on every 429 from a vendor? A single 429 isn't worth an alert — it's expected behavior when you occasionally exceed burst limits. What matters is the rate: a sustained climb in 429s, or a 429 on an endpoint you've sized to stay within limits, signals real degradation. Distinguishing normal burst handling from systematically hitting the rate limit requires per-endpoint baselines, which is exactly the maintenance burden a managed monitor absorbs for you.
Why is detecting degradation harder than detecting a full outage? A full outage trips every threshold at once. Degradation sits just under those thresholds, drifts in and out, and hides inside aggregate metrics — so it reads as noise until users feel it. Catching it reliably means tracking tail latency and per-endpoint error rates against current baselines across every vendor, continuously. That's a standing commitment most teams underestimate, which is why a dedicated service is usually the better trade-off.
How do I know if a vendor's status page is trustworthy? Track the lag between when incidents start (as measured by your own metrics) and when the vendor posts them. Over several incidents you build a calibration — some vendors post within 5 minutes, others take 30 or more. Knowing this lag tells you how much lead time you give up by relying on their status page, and a continuous monitor that detects the change the moment it posts removes the need to check manually.
Know the moment a tool you depend on goes down
Statusfield watches 2,000+ services your business depends on and alerts you the moment they break.
Free plan · No credit card
Related Articles
How to Handle Rate Limiting From Third-Party APIs in Production
Rate limits are one of the most common production failures caused by third-party APIs. Here's how to detect them early, implement proper backoff, and build systems that degrade gracefully when you hit the ceiling.
How to Know If an API Is Down or Your Code Is Broken
When API calls fail, the hardest question is: is it them or is it you? Here's a systematic approach to diagnosing third-party API failures fast, before you waste an hour debugging working code.
How to Detect Third-Party Outages Before Your Users Do
Your users are not your monitoring system. Here's how to get detection coverage that surfaces third-party incidents in time to act — before the support tickets arrive.