We independently evaluate all products and services. If you click through links we provide, we may earn a commission at no extra cost to you. Learn More.

Website Down? What to Do in the First 30 Minutes

Published on:

[1,271 words, 7 minute read time]

A “site down” alert triggers adrenaline for a reason: downtime threatens revenue, trust, and your sanity. But the fastest way to make an outage worse is to jump straight into random debugging.

Triage first, diagnose second, fix third.

This guide is a practical incident response playbook for small teams, on-call rotations, and solo owners. It tells you exactly what to do in the first 30 minutes, how to identify where the failure lives (DNS vs hosting vs app vs third-party), how to communicate, and how to run a simple post-incident review.

If you’re building the full alerting + response system, this is part of the downtime alerts hub.


The first 5 minutes: confirm + scope (don’t skip this)

Your job in the first five minutes is not to solve the outage. It’s to answer two questions:

  1. Is it real?
  2. How big is it?

First 5 minutes checklist (copy/paste)

Confirm

  • Check your monitoring dashboard: was the alert confirmed (retries/confirmation), or a single blip?
  • Verify independently:
    • load the site from your browser and a second network (phone hotspot is perfect)
    • check from another location/tool if available
  • Look at the error type: timeout, 5xx, DNS failure, SSL error, 403/429, keyword mismatch

Scope

  • What’s affected?
    • homepage only, login, checkout, API, everything?
  • Who’s affected?
    • one region or global?
    • all users or only logged-in users?
  • When did it start?
    • note the start time and whether there was a recent deploy/config change

Declare

  • Open an incident thread (Slack/Teams/text) and assign:
    • primary responder
    • comms owner (even if that’s you)

Start a simple incident log

  • Timestamped notes:
    • “15:02 alert fired”
    • “15:04 confirmed from hotspot”
    • “15:06 suspect DNS”

This prevents confusion later and makes your postmortem easy.


Identify the layer: DNS vs hosting vs app vs third-party

Once the incident is confirmed and scoped, you want to identify the layer that’s failing. Most outages map cleanly to one of these buckets.

Symptom → likely cause (quick table)

SymptomWhat it usually meansWhere to look first
Domain won’t resolve / “site can’t be reached”DNS issue, domain expired, resolver problemDNS provider, registrar, DNS monitoring
TLS/SSL warning in browserExpired cert, chain mismatch, TLS config issueCertificate renewal, CDN/WAF TLS settings
500 errorsApplication bug or misconfigApp logs, recent deploys, env vars
502/504 gateway errorsUpstream (app server) failing behind proxy/load balancerLoad balancer, origin health, app servers
503 errorsOverload, maintenance mode, dependency failureCapacity, maintenance toggles, upstream services
TimeoutOverload, networking issue, DB stalls, deadlocksHost metrics, DB, upstream dependencies
Only one region affectedCDN POP issue, routing/ISP, regional DNSCDN status, multi-location checks, DNS
Checkout/login broken but homepage loadsThird-party (payments/auth) or app flow bugPayment/auth provider, flow tests, recent changes
403/429 in monitors but site “works”WAF/bot protection blocking probesWAF rules, allowlisting, monitor config

DNS layer (often the sneakiest)

DNS issues create the classic “works for me” problem because different resolvers and regions can behave differently.

Check:

  • DNS provider status
  • recent DNS changes
  • domain expiration and nameserver correctness

If you want proactive prevention here, start with DNS monitoring.

Hosting / infrastructure layer

If DNS is fine but the site times out or returns 502/503:

  • check hosting provider status page
  • check server health (CPU/RAM/disk)
  • check whether your load balancer sees healthy upstreams
  • consider whether you hit capacity (traffic spike, bot attack, background job storm)

Application layer

If you see 500s or a specific flow fails:

  • correlate with recent deploy time
  • roll back quickly if the timing matches
  • check logs for errors/exceptions
  • check DB connectivity and migrations

Third-party dependencies

Common external causes:

  • payment gateway outage
  • auth/OAuth provider issues
  • critical API dependency latency
  • CDN degradation
  • email provider failures (for login/verification flows)

If your “site is up” but users can’t complete critical actions, dependencies are a prime suspect.


The next 10 minutes: stabilize (stop the bleeding)

Once you have a working hypothesis, prioritize mitigation over perfect diagnosis.

Fast stabilization moves (choose what fits)

  • Rollback the last deploy (if the incident aligns with a change window)
  • Disable a feature flag or revert a config toggle
  • Scale up resources temporarily (compute/database) if overloaded
  • Bypass or degrade gracefully around a slow dependency
  • Turn on a maintenance page only if you need to protect data integrity

Guiding principle: restore service first, then investigate deeply.


Rollback strategy basics (how to do it safely)

Rollbacks are one of the most effective “small team” outage tools—when done calmly.

When to roll back

  • The outage began right after a deploy/config change
  • Error rates spiked immediately after release
  • A single flow (login/checkout) broke after a change

How to roll back (simple version)

  1. Identify the last known good version (release tag/commit/build)
  2. Roll back one step (don’t stack changes)
  3. Confirm recovery using monitors + real user checks
  4. Pause further deployments until stable

Two rollback tips that save pain

  • Don’t debug in production first. Roll back to restore users, then debug with breathing room.
  • Write down what you changed. It helps you avoid reintroducing the issue later.

Communication steps (internal + status page)

Communication isn’t “nice.” It prevents chaos and reduces support load.

Internal communication (within 10 minutes)

Post a quick note:

  • what’s happening (symptom, not speculation)
  • what’s impacted
  • who’s owning the fix
  • when the next update will be

Example:

“Investigating: users seeing 503s on checkout. Confirmed in US-East and EU. @Sam owning. Next update in 15 minutes.”

Status page: when and how

If customers are affected and the incident isn’t resolved quickly, use a status page. Start here: status pages.

A good rule:

  • If it impacts customer success and lasts longer than ~10–15 minutes, post a status update.

Status update template

  • Status: Investigating / Identified / Monitoring / Resolved
  • Impact: who/what is affected
  • Current state: brief and factual
  • Next update: time-bound commitment

Avoid speculation. Be honest and short.


Minute 20–30: confirm recovery and prevent immediate relapse

Once you’ve applied a fix or rollback:

Recovery checklist

  • Confirm monitors show UP across regions
  • Confirm critical flows:
    • login (if SaaS)
    • checkout (if ecommerce)
    • key API endpoint (if product relies on API)
  • Watch for flapping (up/down)
  • Keep comms cadence until stable

If it’s “up” but slow, treat it as a performance incident (often a precursor to another outage).


Copy/paste runbook (printable checklist)

Paste this into a doc and keep it somewhere obvious.

Website Outage Runbook (30 minutes)

Roles

  • Primary responder: ________
  • Comms owner: ________
  • Backup/escalation: ________

Links

  • Monitoring dashboard: ________
  • Hosting provider status: ________
  • DNS provider/registrar: ________
  • Deploy/CI pipeline: ________
  • Logs/APM: ________
  • Status page: ________

0–5 minutes (Confirm + Scope)

  • Confirm alert (retries/regions)
  • Verify from second network/location
  • Identify affected services/pages
  • Open incident channel + assign roles
  • Start incident log (timestamps)

5–15 minutes (Identify layer)

  • DNS vs hosting vs app vs third-party
  • Check provider status pages
  • Check recent changes (deploy/config/DNS)

15–30 minutes (Mitigate + Communicate)

  • Roll back if change-correlated
  • Scale or fail over if overloaded
  • Post internal update + next update time
  • Post status page update if customer impact persists
  • Confirm recovery across regions + critical flow test

After recovery

  • Capture timeline and root cause
  • Create action items (prevention + detection + docs)

Post-incident review template (keep it lightweight)

A good postmortem isn’t about blame—it’s about preventing repeat incidents.

Template (copy/paste)

Incident summary:

  • What happened (plain English)

Timeline:

  • Start time:
  • Detection time:
  • Mitigation time:
  • Resolution time:

Impact:

  • Who was affected and how (regions, pages, users)
  • Revenue/support impact (if known)

Root cause:

  • Primary cause:
  • Contributing factors:

What went well:

  • (e.g., fast rollback, good comms)

What didn’t go well:

  • (e.g., unclear ownership, missing alerts, false positives)

Action items:

  • Prevention (fix underlying issue)
  • Detection (add/adjust monitors, keyword checks, regions)
  • Response (update runbook, escalation)
  • Owner + due date for each item

CTA: Print/save the checklist

You don’t want to invent a process during an outage.

CTA: Print or save the first 30 minutes checklist and the runbook template somewhere your team can find in seconds—then you’ll triage first, diagnose second, and fix third.