SLA Reality Check: What Those Uptime Numbers Actually Mean for Your Business
I had a board member tell me last year that our cloud provider guaranteed “five nines of uptime” and therefore our systems should never go down. I spent the next twenty minutes explaining why that statement was wrong in at least three different ways.
SLAs are one of the most misunderstood concepts in enterprise IT. Business stakeholders treat them as guarantees. Vendors use them as marketing tools. And IT leaders are stuck in the middle trying to explain that the number on the contract doesn’t mean what anyone thinks it means.
The Mathematics Nobody Does
Here’s what those uptime percentages translate to in actual downtime per year:
- 99% uptime: 3.65 days of downtime
- 99.9% (three nines): 8 hours and 46 minutes
- 99.99% (four nines): 52 minutes and 36 seconds
- 99.999% (five nines): 5 minutes and 15 seconds
Most cloud providers offer three nines for core compute services. Five nines is exceptionally rare and usually applies to very specific services, not your entire application stack.
Here’s what nobody tells the board: your application’s uptime can only be as good as the weakest link in your dependency chain. If your application depends on a database, a cache layer, a CDN, an authentication service, and a payment gateway, the compound availability is the product of all those individual availabilities. Five services at 99.9% each gives you roughly 99.5%---about 44 hours of potential downtime per year.
That’s a full working week where something in your stack could be broken.
What SLAs Actually Guarantee
Most SLAs don’t guarantee uptime at all. What they guarantee is a service credit if uptime falls below the stated threshold.
Read the fine print on your AWS, Azure, or Google Cloud SLA. If they miss their target, you typically get a 10-25% service credit for that billing period. So if you’re paying $50,000 per month and they have a four-hour outage, you might get $12,500 back. Meanwhile, that outage cost your business $500,000 in lost revenue.
SLAs are incentive structures, not insurance policies. They push the provider to maintain high availability, but they don’t protect you when things go wrong. That protection has to come from your own architecture and operational practices.
How IT Leaders Should Think About SLAs
Stop treating vendor SLAs as your availability strategy. Here’s a more productive framework:
Define your own internal SLOs first. Not every application requires four nines. Your marketing website can tolerate a few hours of downtime. Your payment processing system cannot. Map each application to a business impact category and set internal service level objectives accordingly.
Design for failure. Assume every service will have outages. Build redundancy, failover, and graceful degradation into your architecture. If your application falls over because a single third-party API is unavailable, that’s an architecture problem, not a vendor problem.
Monitor from the customer’s perspective. Your cloud provider’s status page might show green while your customers are experiencing errors. Implement synthetic monitoring that tests actual user journeys and measures availability from the outside in.
I’ve worked through this exact exercise with the Team400 team for clients who had serious gaps between their vendor SLAs and their actual business requirements. The result is usually architectural changes combined with renegotiated vendor terms that dramatically improve real-world resilience.
The Internal SLA Problem
Vendor SLAs get all the attention, but the bigger problem is internal SLAs---the commitments between IT and the business units you serve.
I’ve seen IT departments with no internal SLAs at all. Without them, there’s no shared understanding of what IT is delivering and no basis for prioritising investments.
Setting internal SLAs forces necessary conversations. The cost curve is exponential---each additional nine roughly doubles your infrastructure investment. When the business understands that going from 99.9% to 99.99% might cost $200,000 per year, they become more realistic about requirements.
Measuring What Matters
Stop reporting uptime as a single number. Instead, report on:
Error budgets. How much of your allowed downtime for the quarter have you consumed? This gives the business a forward-looking view and helps prioritise reliability investments versus feature development.
Mean time to detect (MTTD). How quickly do you know something is broken? I’ve seen outages go undetected for hours because monitoring only checked infrastructure health, not actual service availability.
Mean time to recover (MTTR). How fast do you fix things? This is more within your control than preventing outages entirely and has a direct relationship to your actual uptime numbers.
Getting Honest About Risk
The most productive SLA conversation I’ve ever had with a board started with this statement: “Our systems will go down. The question is how often, for how long, and how we respond.”
That honesty makes people uncomfortable, but it leads to better outcomes than pretending vendor SLAs will protect you. Do the maths. Run the compound availability calculations. Present the real numbers alongside the real costs of improving them. And make sure everyone understands the difference between a vendor’s contractual commitment and your actual business resilience.