What Vendor SLAs Don't Tell You About Actual Reliability

Q: How do you use historical monitoring data to make architectural decisions?

Pattern the incident history: high-frequency short incidents suggest circuit breakers and retry logic; low-frequency long incidents suggest a fallback or degraded mode; component-specific patterns suggest architecting around the weak component. The goal is to make your system failure modes match the actual failure patterns of your dependencies.

Q: When should a reliability pattern trigger a vendor switch?

When the empirical data shows a reliability pattern your architecture cannot compensate for, and the business impact of incidents is material. A vendor with occasional blips is manageable with retry logic. A vendor with recurring multi-hour outages on core components, where no reasonable fallback exists, warrants evaluating alternatives — regardless of what their SLA says.

99.9% uptime sounds like a near-guarantee. It's the kind of number that gets cited in sales calls and procurement reviews with confidence. But 99.9% uptime means your vendor is contractually permitted to be unavailable for 8.7 hours per year — and that downtime can happen in a single continuous incident on the worst possible day.

SLAs are contracts, not operational reality. Understanding the gap between the two is one of the most undervalued skills in infrastructure planning.

The SLA Math You Should Know

Most vendors publish their SLA as a percentage uptime figure. Here's what those percentages actually mean in practice:

SLA	Downtime per year	Downtime per month	Downtime per week
99.0%	87.6 hours	7.3 hours	1.7 hours
99.5%	43.8 hours	3.6 hours	50 minutes
99.9%	8.7 hours	43.8 minutes	10 minutes
99.95%	4.4 hours	21.9 minutes	5 minutes
99.99%	52.6 minutes	4.4 minutes	1 minute
99.999%	5.3 minutes	26 seconds	6 seconds

Most SaaS infrastructure vendors — cloud providers, payment processors, database-as-a-service platforms — offer between 99.9% and 99.99%. The numbers look close. The operational difference is enormous.

The more important distinction is how that downtime is distributed. An SLA doesn't specify whether your 8.7 hours of allowable downtime happens as 52 one-minute blips spread through the year, or as a single 8.7-hour outage on a Tuesday when you're running a promotion. The contract allows either.

What SLA Credits Actually Buy You

When a vendor misses their SLA, they typically offer service credits — a percentage of your monthly bill returned as account credit. On the surface, this looks like accountability. In practice, the economics rarely work in your favor.

Consider: you pay $200/month for a database service with a 99.9% SLA. The service goes down for 6 hours, you lose $10,000 in sales, and support volume spikes. The vendor misses their SLA. Your credit: $20 — 10% of your monthly bill.

SLA credits are not indemnification. They're a gesture. The fine print usually includes caps (maximum credit per month is often 30% of the monthly fee), exclusions (scheduled maintenance doesn't count, events outside the vendor's control don't count), and you typically have to file a claim within a tight window.

This isn't a criticism of the vendors who offer them — it's a realistic description of what SLA credits are. They're a pricing tool, not a reliability guarantee.

Cluster vs. Spread: Why Distribution Matters More Than the Headline Number

A vendor with 99.9% uptime and a perfect uniform distribution of incidents would have about 10 minutes of downtime per week. No single incident would be long enough to materially affect your operations.

Real incidents don't work that way. Major cloud and SaaS providers have historical incidents that cluster — a significant architecture failure might cause a 4-hour outage, which alone consumes nearly half the annual SLA budget in a single event. The following eleven months might be perfect.

From an architectural planning perspective, what matters is:

How long are typical incidents? A vendor that has 20 five-minute incidents per year is very different from one that has 2 two-hour incidents, even if the total downtime is similar.
Which components fail? Core payment processing going down for 30 minutes is worse than the admin dashboard being unavailable for 4 hours.
What time do incidents tend to occur? Some platforms have incident patterns correlated with deployment windows or traffic peaks.

None of this information is in the SLA. It's in the historical incident record.

Tracking Empirical Reliability

The alternative to trusting SLA percentages is tracking actual incident history. This means:

Monitoring incident frequency, not just uptime. A vendor that has frequent short incidents may have a better uptime percentage than one with rare long incidents, but the frequent-incident vendor may be harder to build reliably on top of.

Watching component-level reliability. Overall uptime hides component volatility. A payment processor might have 99.99% overall uptime while their webhook delivery component has three incidents per quarter. If your system depends on webhooks, that's the number that matters.

Building your own historical record. Vendor status pages are authoritative, but their historical archives vary in quality. Some vendors purge old incidents from their status pages. Some underreport the scope. Having your own timestamped log of when incidents occurred, which components were affected, and how long they lasted gives you data that the vendor doesn't control.

This is where operational monitoring becomes a strategic input. Statusfield monitors official vendor status pages continuously and logs incidents as they occur. Over time, this builds an empirical picture of how reliable your dependencies actually are — not what the SLA says, but what the historical record shows.

Using Reliability Data to Make Architectural Decisions

Once you have historical incident data on your dependencies, it changes how you architect:

High frequency, short duration incidents → Add circuit breakers and retry logic. The service is generally reliable but occasionally blips. Your code should handle transient failures gracefully without surfacing errors to users.

Low frequency, long duration incidents → Add a fallback or degraded mode. If a vendor goes down for hours at a time a few times a year, you need a plan for what your application does during those hours. Can you serve cached data? Can you queue writes and process them on recovery?

Component-specific reliability issues → Architect around the weak component. If webhook delivery is unreliable but the API is solid, don't build core flows that depend on webhooks being delivered immediately. Use polling or a hybrid approach.

Incident clustering around deployments → Add deployment-aware circuit breakers. If a vendor tends to have incidents immediately after their own releases, you can monitor their deployment announcements and apply extra caution in the 30 minutes following a vendor deploy.

The SLA number doesn't tell you any of this. Empirical data does.

FAQ

What does a 99.9% SLA actually mean in hours of downtime? A 99.9% SLA permits 8.76 hours of downtime per year, or about 43.8 minutes per month. The key point most teams miss: that downtime can be concentrated in a single incident rather than spread evenly. A vendor can have 99.9% annual uptime with one 8-hour outage on a bad day, and still be within their contractual obligation.

Do SLA credits compensate for lost revenue during an outage? Generally no. SLA credits are calculated as a percentage of your monthly bill — typically 10–30% depending on the severity and duration of the outage. If a payment processor outage causes you to lose sales, the credit you receive is almost always a fraction of the actual business impact. SLA credits are a contractual courtesy, not business indemnification.

How do you track actual vendor uptime rather than relying on their SLA? Monitor their official status page over time and log incidents as they occur. Statusfield does this automatically — it monitors vendor status pages continuously and records incidents with timestamps, affected components, and duration. Over weeks and months, this builds an empirical reliability record that reflects actual operational behavior rather than contractual guarantees.

What should you look at besides the overall SLA percentage? Component-level reliability is often more relevant than the overall number. A payment processor with 99.99% overall uptime might have a specific component — checkout, webhooks, or a regional API endpoint — that has incidents quarterly. Identify which components your application depends on most critically and track their specific reliability, not just the aggregate.

How do you use historical monitoring data to make architectural decisions? Pattern the incident history into categories: high-frequency short incidents suggest you need circuit breakers and retry logic; low-frequency long incidents suggest you need a fallback or degraded mode; component-specific patterns suggest architecting around the weak component. The goal is to make your system's failure modes match the actual failure patterns of your dependencies, rather than planning for a generic "outage."

When should a reliability pattern trigger a vendor switch? When the empirical data shows a reliability pattern that your architecture can't compensate for, and the business impact of incidents is material. A vendor with occasional blips is manageable with good retry logic. A vendor with recurring multi-hour outages on core components, where no reasonable fallback exists, warrants evaluating alternatives — regardless of what their SLA says.

What Vendor SLAs Don't Tell You About Actual Reliability

The SLA Math You Should Know

What SLA Credits Actually Buy You

Cluster vs. Spread: Why Distribution Matters More Than the Headline Number

Tracking Empirical Reliability

Using Reliability Data to Make Architectural Decisions

FAQ

Related Articles

How to Detect Third-Party Outages Before Your Users Do

How to Write a Postmortem When a Third-Party Service Causes an Outage

What to Do When a Vendor Has No Status Page