How to Monitor Third-Party APIs and SaaS Dependencies (Developer Guide)
Every app depends on third-party APIs. Here is how to monitor them properly — with alerts, runbooks, and the tools that actually work — so outages stop surprising your team.
Your app is only as reliable as its most unreliable dependency.
You've built solid internal systems. You have APM. You have error tracking. You have alerts for your own infrastructure. But when Stripe goes down, your payment flow breaks. When Auth0 is having issues, users can't log in. When Twilio has an outage, your notifications queue up silently.
And nine times out of ten, your team finds out from a customer complaint.
This guide covers how to properly monitor third-party APIs and SaaS dependencies — so you can stop being blindsided.
Why Third-Party Monitoring Is Different
Monitoring your own infrastructure is about metrics, logs, and traces. Third-party monitoring is a different problem:
- You have no access to their internals — you can't query their metrics
- Their status pages lag reality by 15–30 minutes
- Outages can be partial — affecting some regions, some API endpoints, some customer tiers
- Silent failures are common — APIs return 200 OK but produce wrong results
- SLAs don't prevent incidents — 99.9% uptime still means 8.7 hours/year of downtime
The goal isn't to predict outages. It's to know about them as fast as possible, understand what's affected, and respond intelligently.
The Three-Layer Monitoring Strategy
Layer 1: Real-Time External Status Tracking
The fastest signal for third-party outages comes from monitoring their status pages and public health endpoints — at scale.
Tools like Statusfield monitor 2,000+ services in real-time. Instead of you polling status.stripe.com every minute, Statusfield does it for all your dependencies simultaneously and sends you an alert the moment something changes.
This gives you:
- 60-second detection instead of 15-30 minute lag
- Component-level specificity — is it their API, or just their dashboard?
- Historical patterns — which of your dependencies fails most often?
- A single alert channel instead of email subscriptions from 20 different vendors
Set this up first. It takes minutes to set up and is the fastest path to awareness.
Layer 2: Synthetic API Monitoring
External status pages only tell you what the vendor reports. Synthetic monitoring actually calls their API from your perspective and measures:
- Response time — is it slow before it's officially "degraded"?
- Success rate — are you getting errors even when their status page is green?
- Correctness — does the response contain what you expect?
Implement this with:
// Example: Health check for a critical third-party API
async function checkStripeHealth() {
const start = Date.now();
try {
// Use a lightweight, read-only API call
const response = await stripe.balance.retrieve();
const latency = Date.now() - start;
if (latency > 2000) {
alert('Stripe response time degraded: ' + latency + 'ms');
}
return { status: 'ok', latency };
} catch (error) {
alert('Stripe API check failed: ' + error.message);
return { status: 'error', error: error.message };
}
}Run this from a cron job every minute. Send failures to your alerting system (PagerDuty, Opsgenie, or even just a Slack webhook).
Key insight: Use the lightest possible API call that still exercises the auth path. For Stripe, balance.retrieve() is perfect — it's cheap, fast, and covers authentication + basic API availability.
Layer 3: Error Rate Monitoring in Your Own App
Your app already knows when third-party calls are failing — you're probably just not alerting on it specifically.
Add a tag/dimension to your error tracking for third-party errors:
# Pseudocode — adapt to your stack
try:
result = third_party_api.call()
except ThirdPartyAPIError as e:
# Tag this error with the vendor name
sentry.capture_exception(e, tags={
'vendor': 'stripe',
'vendor_component': 'payment_processing'
})
raiseThen create a dashboard (in Datadog, Grafana, whatever you use) that shows error rates per vendor. When Stripe errors spike from 0.1% to 15%, that's your early warning sign — often before their status page updates.
Which Dependencies to Monitor (And How to Prioritize)
Not all third-party services deserve the same monitoring investment. Prioritize by blast radius:
| Priority | Criteria | Examples | Monitoring Approach |
|---|---|---|---|
| P0 — Critical | App completely broken if this fails | Auth provider, primary database, payment processor | All three layers + PagerDuty |
| P1 — Important | Core features degraded, revenue impact | Email provider, CDN, primary API | Layers 1 + 2, Slack alert |
| P2 — Significant | Secondary features affected | Analytics, CRM sync, notifications | Layer 1 + error tracking |
| P3 — Minor | Nice-to-have features | Marketing integrations, non-critical APIs | Layer 1 only |
Create this list explicitly. Put it in a doc. Make sure your on-call rotation knows it cold.
Building Runbooks for Each Critical Dependency
When a P0 dependency goes down at 3 AM, your on-call engineer needs to respond in seconds — not spend 10 minutes figuring out what to do.
A runbook for each critical dependency should answer:
- How do we know it's actually down? (Status page URL, health check URL)
- What specifically breaks in our app? (Login, payments, notifications?)
- What's our graceful degradation strategy? (Cache, queue, disable feature flag?)
- Who do we notify? (Internal Slack channel, customer success, customers directly?)
- How do we communicate to users? (Status page message template)
- How do we recover when it's back? (Retry queued jobs, clear cache, etc.)
A minimal runbook template:
## Stripe Outage Runbook
**Detection:** statusfield.com/services/stripe OR error rate > 5% on /api/checkout
**Impact:** Payments fail. Users cannot purchase. Subscriptions cannot renew.
**Immediate actions:**
1. Confirm outage at status.stripe.com
2. Post in #incidents: "Stripe experiencing [outage type]. Payments affected. Monitoring."
3. Enable maintenance mode on /checkout (feature flag: DISABLE_PAYMENTS)
4. If > 30 min: notify customer success team
**Communication template:**
"We're currently experiencing issues with our payment processor (Stripe).
Your existing subscriptions are unaffected. New payments are temporarily
unavailable. We'll update this message when resolved. ETA: [ETA from status page]"
**Recovery:**
1. Confirm Stripe shows operational on status page
2. Disable maintenance mode
3. Process any queued/failed payments
4. Monitor error rate for 15 minutes post-recoveryThis takes 30 minutes to write per critical service. It saves hours during incidents.
Graceful Degradation Patterns
The best monitoring strategy includes designing your app to survive dependency failures:
Pattern 1: Feature Flags / Kill Switches
if (!await isStripeHealthy() || featureFlag('DISABLE_PAYMENTS')) {
return showMaintenanceMessage('Payments temporarily unavailable');
}Implement a kill switch for each P0 dependency. When it's down, flip the flag in your feature flag system. Users see a clear message instead of a cryptic error.
Pattern 2: Circuit Breakers
A circuit breaker detects when a downstream service is failing and stops calling it — preventing cascading failures:
const breaker = new CircuitBreaker(stripeApi.charge, {
timeout: 3000, // 3 second timeout
errorThresholdPercentage: 50, // Open if >50% of calls fail
resetTimeout: 30000 // Try again after 30 seconds
});
breaker.on('open', () => {
alert('Stripe circuit breaker opened — too many failures');
});Libraries: opossum (Node.js), resilience4j (Java), polly (C#).
Pattern 3: Queue and Retry
For non-realtime operations, queue failed calls and retry when the service recovers:
// Instead of failing immediately when Twilio is down
async function sendNotification(userId, message) {
try {
await twilio.messages.create({...});
} catch (error) {
// Queue for later instead of throwing
await notificationQueue.add({ userId, message }, {
attempts: 5,
backoff: { type: 'exponential', delay: 1000 }
});
}
}This works well for: email notifications, SMS, webhooks, CRM syncs, analytics events.
Alerting Configuration That Doesn't Create Noise
The goal is to know about outages immediately without alert fatigue.
Don't alert on:
- Single failed health checks (transient network blip)
- Scheduled maintenance windows
- Outages for P2/P3 services during off-hours
Do alert on:
- 2+ consecutive failed health checks for P0/P1 services
- Error rate > X% for > 2 minutes on critical paths
- Any status change for P0 services (even to degraded)
Configure severity correctly:
- P0 outages → PagerDuty / wake someone up
- P1 outages → Slack alert during business hours
- P2/P3 changes → Slack channel, no ping, no off-hours
Tools Comparison
| Tool | Best For | Pricing |
|---|---|---|
| Statusfield | Monitoring known SaaS/API vendors in real-time | See plans |
| Datadog Synthetics | Synthetic API monitoring, teams already on Datadog | See plans |
| Checkly | Code-based API monitoring, developer-friendly | See plans |
| Better Uptime | Simple HTTP uptime monitoring | See plans |
| PagerDuty | Alert routing and on-call management | See plans |
For most teams: Statusfield for external status tracking + Checkly for synthetic monitoring + PagerDuty for alerting is a solid, cost-effective stack.
Getting Started Today
If you do nothing else from this guide, do these three things this week:
-
Inventory your critical dependencies — list every third-party API your app calls. Be honest about what breaks without it.
-
Set up external status monitoring — add your top 5 dependencies to Statusfield. Takes 10 minutes. You'll know about outages before your users do.
-
Write one runbook — pick your most critical dependency (probably your auth provider or payment processor) and write the runbook. Just one. Then do one per week until you have coverage for all P0 services.
The third time a vendor outage causes you to scramble instead of execute, you'll wish you'd done this earlier.
Start monitoring your dependencies on Statusfield →
Frequently Asked Questions
How is third-party API monitoring different from regular uptime monitoring?
Standard uptime monitoring checks if your app is up. Third-party monitoring checks if the services your app depends on are up. Both matter, but third-party monitoring is often overlooked — and it's often the reason your app is broken even when your own infrastructure is healthy.
Should I trust vendor status pages?
Partially. They're reliable eventually, but they lag real incidents by 15–30 minutes. Use them as a confirmation tool, not a first-alert system. External monitoring tools detect incidents faster.
What's the minimum monitoring setup for a small startup?
Statusfield for external status tracking + basic error rate alerts in whatever APM you already use. That covers you for ~90% of third-party outage scenarios with minimal setup time.
How do I know which of my dependencies is most unreliable?
Track it. Statusfield's historical data shows incident frequency per service. After 30 days, you'll know which vendors have the worst track record — and you can prioritize your mitigation efforts accordingly.
Do SLAs protect me from vendor outages?
SLAs provide credits, not prevention. A 99.9% SLA allows ~8.7 hours of downtime per year. Service credits don't compensate for lost revenue, customer churn, or engineering time spent fighting fires. SLAs are a business backstop, not an operational guarantee.
Related Articles
Is ChatGPT Down? How to Check OpenAI Status Right Now
ChatGPT not working? Learn how to check if OpenAI or ChatGPT is down, what the error codes mean, and how to get instant alerts when the API goes down — so your team stops wasting hours on a problem that isn't yours.
Is Cloudflare Down? How to Check Cloudflare Status Right Now
Cloudflare down? Here is how to instantly check whether it is a Cloudflare outage or something else — and how to get automatic alerts so your team knows before your users do.
How to Check if a Website Is Down (For Everyone or Just You)
Website not loading? Learn 6 fast ways to check if a site is down for everyone or just you — plus how to get automatic alerts so you're never the last to know.