Why Your App Goes Down Even When Your Own Infrastructure Is Fine

Your servers are healthy. Your database is responding. Your own metrics look clean. But your users are getting errors. The culprit is almost always a silent failure upstream. Here's what to look for.

·10 min read

Your deployment is clean. Your database is healthy. CPU is flat. Memory is nominal. Error rate is — wait, actually the error rate is spiking. But that's impossible, because nothing on your side changed.

This is one of the most disorienting production incidents a team can face: your own systems are fine, but users are reporting failures. And because everything looks green in your internal dashboards, the first twenty minutes get spent debugging code that isn't the problem.

The culprit is almost always upstream.

The Silent Failure Pattern

Most production incidents at SaaS companies trace back to a dependency, not the company's own infrastructure or code. The pattern is remarkably consistent:

  1. Users start experiencing failures
  2. First support ticket arrives — "your site is broken"
  3. On-call engineer checks their own systems — everything looks healthy
  4. Engineers start debugging: recent deploys, database queries, connection pools
  5. Fifteen to twenty minutes in, someone finally checks the status page of a third-party service
  6. There it is — an incident posted eight minutes ago

Those fifteen to twenty minutes of wasted investigation time are the cost of not monitoring your dependencies with the same rigor as your own infrastructure. Multiply that by the number of incidents per year, and it's a significant engineering drain. Multiply it by users who gave up during that window, and it's a revenue and retention problem.

Why Your Own Monitoring Misses This

APM tools, error tracking, and infrastructure monitoring are all designed to look inward. They're excellent at what they do — and completely blind to what happens upstream.

When Auth0 is down, your APM shows elevated latency on your login endpoint. It does not tell you why. Your error tracking shows authentication failures. It does not attribute them to Auth0. Your logs show 401 Unauthorized responses. They do not tell you Auth0's API is returning timeouts.

Your internal monitoring captures the symptom but not the cause. The diagnostic chain stays internal: you check your own login code, your own database, your own token validation logic — none of which are the problem. The cause lives outside your observability stack entirely, in a vendor's infrastructure you have no direct visibility into.

This is why third-party monitoring is a different discipline from internal observability. You need a system that watches external sources — specifically, official vendor status pages — and alerts you the moment something changes there.

The Common Patterns of Silent Upstream Failures

Understanding which upstream failures produce which symptoms helps you shortcut the investigation when they happen.

Auth provider down → all logins fail, your code looks fine

When Clerk, Auth0, Okta, or any similar provider has an incident, your app's login flow fails entirely. But from an internal monitoring perspective, your code is doing exactly what it's supposed to do — calling the auth API. The API is just returning errors. Your own error rate spikes, but the root cause is invisible unless you're watching the auth provider's status page.

The trap: engineers check their session handling code, their JWT validation, their cookie settings. None of it helps. The problem resolves on its own when the auth provider recovers — which looks, to the inexperienced observer, like the debugging "worked."

CDN down → your app loads fine, static assets are broken

A CDN incident is particularly deceptive because it tends to be partial. Your server-side rendering works. Your API responds normally. But JavaScript bundles, CSS, or image assets served from Cloudflare or Fastly are returning 503s or timing out.

From your server monitoring: completely clean. From your users: the page loads but looks broken, or interactive elements don't work because the JavaScript never arrived. Support tickets say "your site looks weird" or "buttons don't work" — symptoms that don't obviously point to a CDN issue.

Payment API down → checkout fails, your error tracking shows nothing relevant

Stripe, Paddle, and similar payment processors have their own SDKs with their own error handling. When the Stripe API is degraded, your checkout code catches the Stripe error and — if you've built it well — shows a graceful error message. Your own error tracking may show a spike in PaymentError events, but nothing that obviously says "Stripe is having an incident."

Meanwhile, every user who hits checkout during that window bounces. No purchase. No subscription. Each one is a conversion you'll never recover.

Email or SMS provider down → notifications silently queue or fail

SendGrid, Postmark, and Twilio incidents are the quietest of all. Your code attempts to send. The API call fails. If you have a retry queue, messages pile up silently. If you don't, they're dropped. Users don't receive password reset emails, confirmation messages, or alerts.

The first sign is often a customer support ticket hours later: "I never got my confirmation email." Your own error tracking may show nothing if the failure happened in an async job that wasn't instrumented properly.

Building a Dependency Health Check Dashboard

The operational answer to this problem is a dependency status dashboard that runs alongside your internal monitoring — a single view that shows the health of every service your app depends on, updated in real time.

The inputs for this dashboard are two things: official vendor status pages (the authoritative source for vendor-confirmed incidents) and, optionally, synthetic API health checks (your own probes that call vendor APIs from your infrastructure). Together they give you a complete picture: what the vendor has confirmed, and what you're actually experiencing.

Statusfield monitors official vendor status pages and delivers the signal the moment it matters. Rather than building and maintaining a custom integration for every status page in your dependency stack, you add the services you depend on and Statusfield watches them continuously — delivering alerts the moment a vendor posts an incident.

The practical result: when your error rate spikes, the first thing your on-call engineer checks is the Statusfield dashboard. If a dependency is showing an incident, root cause is established in thirty seconds instead of twenty minutes. If everything shows operational, you know with confidence the problem is internal and can debug accordingly.

Why APM Doesn't Capture This

Application Performance Monitoring is designed to trace execution through your own code. When an outbound API call fails, APM captures the failure — the latency spike, the error response code, the timeout — but it has no mechanism to determine whether that failure is caused by your code or by the vendor's infrastructure.

APM tells you what failed. Third-party status monitoring tells you why. Both are necessary. Neither replaces the other.

The engineers who reach root cause fastest during upstream incidents are the ones who have already answered the question "is anything upstream showing an incident?" before they start debugging their own code. That question takes thirty seconds to answer if you have a dependency dashboard. It takes fifteen minutes if you're doing it manually by opening status pages one by one.

FAQ

Why does traditional monitoring miss third-party failures? Traditional APM and infrastructure monitoring looks inward — at your own code, your own servers, your own database. When a vendor's API fails, your internal monitoring captures the symptoms (elevated latency, error responses) but has no mechanism to attribute them to an upstream cause. You need a separate system watching external status sources.

What are the most common sources of silent upstream failures? Auth providers (all logins fail), CDNs (assets silently unavailable), payment processors (checkout fails with confusing errors), and email or SMS providers (notifications silently dropped or queued). These four categories account for the majority of upstream incidents that cause wasted investigation time.

How do you add third-party status to your internal dashboard? The fastest path is a tool like Statusfield that monitors vendor status pages and exposes the data via webhook or API. You can feed that data into your existing dashboard — Datadog, Grafana, a custom status page — so upstream status sits alongside your internal metrics in one view.

What's the fastest way to check if a dependency is down during an incident? Before the incident, configure a monitoring tool to watch your dependencies continuously. During an incident, your first question should be "is anything in my dependency stack showing an incident?" — answered with a single dashboard check, not manual status page browsing.

How does Statusfield surface upstream failures? Statusfield monitors official vendor status pages continuously and alerts you the moment a vendor posts an incident — before your support queue fills up, before your engineers spend twenty minutes debugging the wrong thing. It covers hundreds of services including Stripe, AWS, GitHub, Twilio, Cloudflare, and many more.

Should synthetic monitoring replace status page monitoring? No — they're complementary. Synthetic monitoring (your own probes calling vendor APIs) detects failures from your perspective before they're officially acknowledged. Status page monitoring tells you what the vendor has confirmed. Run both for critical dependencies: synthetic for early detection, status page for authoritative confirmation.

Know the moment a tool you depend on goes down

Statusfield watches 2,000+ services your business depends on and alerts you the moment they break.

Free plan · No credit card