We independently evaluate all products and services. If you click through links we provide, we may earn a commission at no extra cost to you. Learn More.

Alerts

Uptime Metrics Explained: SLA, SLO, MTTR, Error Budgets

Sean

Published on: January 7, 2026

[1,081 words, 6 minute read time]

Teams love dashboards. Stakeholders love single numbers. And that’s exactly how reliability metrics go wrong.

Metrics should change behavior, not decorate dashboards.

This guide explains the uptime metrics that actually matter—SLA, SLO, MTTR, and error budgets—with plain-language examples, simple calculators, and reporting templates you can use immediately.

(If you need an operational response process to improve these metrics, start with the incident playbook.)

The key definitions (with plain-language examples)

SLI (Service Level Indicator)

An SLI is the measured thing. Examples:

“% of requests to /checkout that return 2xx within 2 seconds”
“Availability of the login endpoint”
“API error rate”

Think: the raw measurement.

SLO (Service Level Objective)

An SLO is your internal target for an SLI. Examples:

“Checkout availability is 99.95% monthly”
“p95 API latency under 800ms”

Think: the goal you’re trying to hit.

SLA (Service Level Agreement)

An SLA is an external promise with consequences (credits, refunds, contract terms). Example:

“We guarantee 99.9% uptime monthly, or you receive a credit.”

Think: a contractual commitment you should be confident you can meet.

MTTR (Mean Time To Recovery / Restore)

MTTR measures how quickly you restore service after an incident starts.

Depending on your org, “R” might mean:

Recovery: service fully back to normal
Restore: service good enough for users again (even if degraded)

Think: how fast you get users back to “working.”

Error budget

An error budget is how much unreliability you can “spend” while still meeting your SLO.

If your SLO is 99.9% availability, your monthly error budget is 0.1% downtime (more below).

Think: permission to ship changes—until you spend it.

Why “99.9% uptime” can mislead

“99.9%” sounds excellent, but it hides important realities:

The time window matters
99.9% per month is different than 99.9% per year.
Where you measure matters
“Homepage up” can be 99.99% while “checkout works” is 99.5%.
Short outages can still be painful
A few minutes during a launch or sale can cost more than an hour at 3 a.m.
Monitoring frequency affects what you observe
If you check every 5 minutes, your visibility into short incidents is limited. See check frequency.

Better framing: define SLOs around the user-critical journeys (login/checkout/API) and measure MTTR so you actually get faster.

Allowed downtime per month calculator (the one stakeholders ask for)

Formula:
Allowed downtime = (1 − uptime %) × total time in the period

A 30-day month has:

30 × 24 × 60 = 43,200 minutes

Allowed downtime per 30-day month (quick table)

Target uptime	Allowed downtime/month
99%	432 minutes (7h 12m)
99.5%	216 minutes (3h 36m)
99.9%	43.2 minutes (43m 12s)
99.95%	21.6 minutes (21m 36s)
99.99%	4.32 minutes (4m 19s)

Important: this is total downtime across the month. A single 45-minute outage can blow a 99.9% target.

MTTR: how to measure it (and why it’s your best lever)

The simplest MTTR definition

MTTR = average time from incident start → service restored

To measure consistently, define:

Start time: first confirmed user-impacting failure (or first alert after confirmation)
Restore time: when critical checks are passing and users are unblocked

MTTR is really a chain of smaller times

If you want MTTR to improve, break it into parts:

Detection time (incident starts → alert fires)
Acknowledgment time (alert → human response)
Diagnosis time (response → cause/hypothesis)
Mitigation time (hypothesis → rollback/fix applied)
Verification time (fix → confirmed stable)

Small improvements in each stage compound.

MTTR improvement levers (high ROI)

Better detection: tighter monitoring on critical pages + confirmation logic (less noise, more trust)
Clear ownership: one primary responder, one comms owner
Fast rollback path: make rollback boring and fast
Runbooks: “first 5 minutes” checklist (printable)
Dependency visibility: know if it’s DNS, hosting, app, or third-party quickly
Post-incident fixes: address the top recurring causes

Use the operational playbook during real incidents: incident playbook.

Choosing SLOs by site type (practical guidance)

SLOs should match business impact and operational maturity—not ego.

Personal site / brochure site

Availability SLO: 99.5%–99.9%
Emphasis: low noise, simple monitoring

Content + lead gen site

Availability SLO: 99.9%
Add a journey SLO: “contact/booking page availability”
Tighten during campaigns

Ecommerce

Availability SLO (storefront): 99.9%+
Checkout SLO: often higher than the homepage
Consider latency SLOs (slow checkout is “down” in practice)

SaaS

App availability SLO: 99.9%–99.95%
API SLO: match customer expectations and plan tiers
Add “critical journey” SLO: login → dashboard success

Agency (managing many client sites)

Tiered SLOs by client package:
- Standard: 99.9%
- Premium: 99.95% + faster MTTR targets
Report transparently; avoid promising an SLA you can’t operationally support

Tip: Keep the number of SLOs small at first (1–3). Too many and nobody uses them.

Error budgets: how to use them without becoming bureaucratic

If your SLO is 99.9% monthly, you have ~43 minutes of downtime “budget” in a 30-day month.

What to do with that budget

If you’re within budget: ship changes normally
If you’re burning budget fast: slow down risky releases, invest in stability
If you exceed budget: pause non-critical changes until reliability improves

The point: error budgets turn reliability into a shared decision, not an ops complaint.

Reporting templates that stakeholders actually read

One-page monthly reliability report (template)

Period: {Month YYYY}
Scope: {Homepage / App / API / Checkout}

Headline metrics

Availability: {X%} (SLO: {Y%})
MTTR: {X minutes} (Target: {Y})
of incidents: {N}
of customer-impacting incidents: {N}

Top incidents (3 max)

{Date} – {Impact summary} – {Duration} – {Root cause category} – {Fix/next step}
…
…

What changed this month

Monitoring improvements: {keyword check, new region, confirmation logic}
Response improvements: {runbook update, escalation changes}
Prevention work: {caching fix, DB tuning, DNS monitoring}

Next month priorities

{1–3 concrete actions tied to metrics}

If you publish incident updates publicly, align them with your reporting and transparency practices: status pages.

Sample monthly report (filled example)

Period: December 2025
Scope: SaaS App + API

Headline metrics

Availability: 99.93% (SLO: 99.95%) → missed
MTTR: 18 minutes (Target: 20 minutes) → met
of incidents: 4
Customer-impacting: 2

Top incidents

Dec 12 – Login failures (EU) – 27 minutes – CDN POP degradation – Added multi-region confirmation + provider escalation runbook
Dec 21 – API 503s – 19 minutes – DB connection pool exhaustion – Increased pool + added alert on saturation
Dec 28 – Checkout latency – 35 minutes – Third-party dependency slowdown – Added dependency monitor + fallback mode

Next month priorities

Raise app SLO monitoring precision (keyword checks on login/dashboard)
Add DNS monitoring and expiration alerts
Reduce diagnosis time with a “symptom → layer” runbook update

CTA: Pick one SLO + one MTTR target for next quarter

If you do one thing after reading this:

Pick one SLO that reflects user success (not vanity uptime).
Pick one MTTR target that forces operational improvement.

CTA: Pick one SLO + one MTTR target for next quarter—then use them to drive monitoring, incident response, and release decisions.