We independently evaluate all products and services. If you click through links we provide, we may earn a commission at no extra cost to you. Learn More.

Uptime Metrics Explained: SLA, SLO, MTTR, Error Budgets

Published on:

[1,081 words, 6 minute read time]

Teams love dashboards. Stakeholders love single numbers. And that’s exactly how reliability metrics go wrong.

Metrics should change behavior, not decorate dashboards.

This guide explains the uptime metrics that actually matter—SLA, SLO, MTTR, and error budgets—with plain-language examples, simple calculators, and reporting templates you can use immediately.

(If you need an operational response process to improve these metrics, start with the incident playbook.)


The key definitions (with plain-language examples)

SLI (Service Level Indicator)

An SLI is the measured thing. Examples:

  • “% of requests to /checkout that return 2xx within 2 seconds”
  • “Availability of the login endpoint”
  • “API error rate”

Think: the raw measurement.

SLO (Service Level Objective)

An SLO is your internal target for an SLI. Examples:

  • “Checkout availability is 99.95% monthly”
  • “p95 API latency under 800ms

Think: the goal you’re trying to hit.

SLA (Service Level Agreement)

An SLA is an external promise with consequences (credits, refunds, contract terms). Example:

  • “We guarantee 99.9% uptime monthly, or you receive a credit.”

Think: a contractual commitment you should be confident you can meet.

MTTR (Mean Time To Recovery / Restore)

MTTR measures how quickly you restore service after an incident starts.

Depending on your org, “R” might mean:

  • Recovery: service fully back to normal
  • Restore: service good enough for users again (even if degraded)

Think: how fast you get users back to “working.”

Error budget

An error budget is how much unreliability you can “spend” while still meeting your SLO.

If your SLO is 99.9% availability, your monthly error budget is 0.1% downtime (more below).

Think: permission to ship changes—until you spend it.


Why “99.9% uptime” can mislead

“99.9%” sounds excellent, but it hides important realities:

  1. The time window matters
    99.9% per month is different than 99.9% per year.
  2. Where you measure matters
    “Homepage up” can be 99.99% while “checkout works” is 99.5%.
  3. Short outages can still be painful
    A few minutes during a launch or sale can cost more than an hour at 3 a.m.
  4. Monitoring frequency affects what you observe
    If you check every 5 minutes, your visibility into short incidents is limited. See check frequency.

Better framing: define SLOs around the user-critical journeys (login/checkout/API) and measure MTTR so you actually get faster.


Allowed downtime per month calculator (the one stakeholders ask for)

Formula:
Allowed downtime = (1 − uptime %) × total time in the period

A 30-day month has:

  • 30 × 24 × 60 = 43,200 minutes

Allowed downtime per 30-day month (quick table)

Target uptimeAllowed downtime/month
99%432 minutes (7h 12m)
99.5%216 minutes (3h 36m)
99.9%43.2 minutes (43m 12s)
99.95%21.6 minutes (21m 36s)
99.99%4.32 minutes (4m 19s)

Important: this is total downtime across the month. A single 45-minute outage can blow a 99.9% target.


MTTR: how to measure it (and why it’s your best lever)

The simplest MTTR definition

MTTR = average time from incident startservice restored

To measure consistently, define:

  • Start time: first confirmed user-impacting failure (or first alert after confirmation)
  • Restore time: when critical checks are passing and users are unblocked

MTTR is really a chain of smaller times

If you want MTTR to improve, break it into parts:

  1. Detection time (incident starts → alert fires)
  2. Acknowledgment time (alert → human response)
  3. Diagnosis time (response → cause/hypothesis)
  4. Mitigation time (hypothesis → rollback/fix applied)
  5. Verification time (fix → confirmed stable)

Small improvements in each stage compound.

MTTR improvement levers (high ROI)

  • Better detection: tighter monitoring on critical pages + confirmation logic (less noise, more trust)
  • Clear ownership: one primary responder, one comms owner
  • Fast rollback path: make rollback boring and fast
  • Runbooks: “first 5 minutes” checklist (printable)
  • Dependency visibility: know if it’s DNS, hosting, app, or third-party quickly
  • Post-incident fixes: address the top recurring causes

Use the operational playbook during real incidents: incident playbook.


Choosing SLOs by site type (practical guidance)

SLOs should match business impact and operational maturity—not ego.

Personal site / brochure site

  • Availability SLO: 99.5%–99.9%
  • Emphasis: low noise, simple monitoring

Content + lead gen site

  • Availability SLO: 99.9%
  • Add a journey SLO: “contact/booking page availability”
  • Tighten during campaigns

Ecommerce

  • Availability SLO (storefront): 99.9%+
  • Checkout SLO: often higher than the homepage
  • Consider latency SLOs (slow checkout is “down” in practice)

SaaS

  • App availability SLO: 99.9%–99.95%
  • API SLO: match customer expectations and plan tiers
  • Add “critical journey” SLO: login → dashboard success

Agency (managing many client sites)

  • Tiered SLOs by client package:
    • Standard: 99.9%
    • Premium: 99.95% + faster MTTR targets
  • Report transparently; avoid promising an SLA you can’t operationally support

Tip: Keep the number of SLOs small at first (1–3). Too many and nobody uses them.


Error budgets: how to use them without becoming bureaucratic

If your SLO is 99.9% monthly, you have ~43 minutes of downtime “budget” in a 30-day month.

What to do with that budget

  • If you’re within budget: ship changes normally
  • If you’re burning budget fast: slow down risky releases, invest in stability
  • If you exceed budget: pause non-critical changes until reliability improves

The point: error budgets turn reliability into a shared decision, not an ops complaint.


Reporting templates that stakeholders actually read

One-page monthly reliability report (template)

Period: {Month YYYY}
Scope: {Homepage / App / API / Checkout}

Headline metrics

  • Availability: {X%} (SLO: {Y%})
  • MTTR: {X minutes} (Target: {Y})
  • of incidents: {N}
  • of customer-impacting incidents: {N}

Top incidents (3 max)

  1. {Date} – {Impact summary} – {Duration} – {Root cause category} – {Fix/next step}

What changed this month

  • Monitoring improvements: {keyword check, new region, confirmation logic}
  • Response improvements: {runbook update, escalation changes}
  • Prevention work: {caching fix, DB tuning, DNS monitoring}

Next month priorities

  • {1–3 concrete actions tied to metrics}

If you publish incident updates publicly, align them with your reporting and transparency practices: status pages.


Sample monthly report (filled example)

Period: December 2025
Scope: SaaS App + API

Headline metrics

  • Availability: 99.93% (SLO: 99.95%) → missed
  • MTTR: 18 minutes (Target: 20 minutes) → met
  • of incidents: 4
  • Customer-impacting: 2

Top incidents

  1. Dec 12 – Login failures (EU) – 27 minutes – CDN POP degradation – Added multi-region confirmation + provider escalation runbook
  2. Dec 21 – API 503s – 19 minutes – DB connection pool exhaustion – Increased pool + added alert on saturation
  3. Dec 28 – Checkout latency – 35 minutes – Third-party dependency slowdown – Added dependency monitor + fallback mode

Next month priorities

  • Raise app SLO monitoring precision (keyword checks on login/dashboard)
  • Add DNS monitoring and expiration alerts
  • Reduce diagnosis time with a “symptom → layer” runbook update

CTA: Pick one SLO + one MTTR target for next quarter

If you do one thing after reading this:

  1. Pick one SLO that reflects user success (not vanity uptime).
  2. Pick one MTTR target that forces operational improvement.

CTA: Pick one SLO + one MTTR target for next quarter—then use them to drive monitoring, incident response, and release decisions.