We independently evaluate all products and services. If you click through links we provide, we may earn a commission at no extra cost to you. Learn More.

Alerts

Website Down? What to Do in the First 30 Minutes

Sean

Published on: January 7, 2026

[1,271 words, 7 minute read time]

A “site down” alert triggers adrenaline for a reason: downtime threatens revenue, trust, and your sanity. But the fastest way to make an outage worse is to jump straight into random debugging.

Triage first, diagnose second, fix third.

This guide is a practical incident response playbook for small teams, on-call rotations, and solo owners. It tells you exactly what to do in the first 30 minutes, how to identify where the failure lives (DNS vs hosting vs app vs third-party), how to communicate, and how to run a simple post-incident review.

If you’re building the full alerting + response system, this is part of the downtime alerts hub.

The first 5 minutes: confirm + scope (don’t skip this)

Your job in the first five minutes is not to solve the outage. It’s to answer two questions:

Is it real?
How big is it?

First 5 minutes checklist (copy/paste)

Confirm

Check your monitoring dashboard: was the alert confirmed (retries/confirmation), or a single blip?
Verify independently:
- load the site from your browser and a second network (phone hotspot is perfect)
- check from another location/tool if available
Look at the error type: timeout, 5xx, DNS failure, SSL error, 403/429, keyword mismatch

Scope

What’s affected?
- homepage only, login, checkout, API, everything?
Who’s affected?
- one region or global?
- all users or only logged-in users?
When did it start?
- note the start time and whether there was a recent deploy/config change

Declare

Open an incident thread (Slack/Teams/text) and assign:
- primary responder
- comms owner (even if that’s you)

Start a simple incident log

Timestamped notes:
- “15:02 alert fired”
- “15:04 confirmed from hotspot”
- “15:06 suspect DNS”

This prevents confusion later and makes your postmortem easy.

Identify the layer: DNS vs hosting vs app vs third-party

Once the incident is confirmed and scoped, you want to identify the layer that’s failing. Most outages map cleanly to one of these buckets.

Symptom → likely cause (quick table)

Symptom	What it usually means	Where to look first
Domain won’t resolve / “site can’t be reached”	DNS issue, domain expired, resolver problem	DNS provider, registrar, DNS monitoring
TLS/SSL warning in browser	Expired cert, chain mismatch, TLS config issue	Certificate renewal, CDN/WAF TLS settings
500 errors	Application bug or misconfig	App logs, recent deploys, env vars
502/504 gateway errors	Upstream (app server) failing behind proxy/load balancer	Load balancer, origin health, app servers
503 errors	Overload, maintenance mode, dependency failure	Capacity, maintenance toggles, upstream services
Timeout	Overload, networking issue, DB stalls, deadlocks	Host metrics, DB, upstream dependencies
Only one region affected	CDN POP issue, routing/ISP, regional DNS	CDN status, multi-location checks, DNS
Checkout/login broken but homepage loads	Third-party (payments/auth) or app flow bug	Payment/auth provider, flow tests, recent changes
403/429 in monitors but site “works”	WAF/bot protection blocking probes	WAF rules, allowlisting, monitor config

DNS layer (often the sneakiest)

DNS issues create the classic “works for me” problem because different resolvers and regions can behave differently.

Check:

DNS provider status
recent DNS changes
domain expiration and nameserver correctness

If you want proactive prevention here, start with DNS monitoring.

Hosting / infrastructure layer

If DNS is fine but the site times out or returns 502/503:

check hosting provider status page
check server health (CPU/RAM/disk)
check whether your load balancer sees healthy upstreams
consider whether you hit capacity (traffic spike, bot attack, background job storm)

Application layer

If you see 500s or a specific flow fails:

correlate with recent deploy time
roll back quickly if the timing matches
check logs for errors/exceptions
check DB connectivity and migrations

Third-party dependencies

Common external causes:

payment gateway outage
auth/OAuth provider issues
critical API dependency latency
CDN degradation
email provider failures (for login/verification flows)

If your “site is up” but users can’t complete critical actions, dependencies are a prime suspect.

The next 10 minutes: stabilize (stop the bleeding)

Once you have a working hypothesis, prioritize mitigation over perfect diagnosis.

Fast stabilization moves (choose what fits)

Rollback the last deploy (if the incident aligns with a change window)
Disable a feature flag or revert a config toggle
Scale up resources temporarily (compute/database) if overloaded
Bypass or degrade gracefully around a slow dependency
Turn on a maintenance page only if you need to protect data integrity

Guiding principle: restore service first, then investigate deeply.

Rollback strategy basics (how to do it safely)

Rollbacks are one of the most effective “small team” outage tools—when done calmly.

When to roll back

The outage began right after a deploy/config change
Error rates spiked immediately after release
A single flow (login/checkout) broke after a change

How to roll back (simple version)

Identify the last known good version (release tag/commit/build)
Roll back one step (don’t stack changes)
Confirm recovery using monitors + real user checks
Pause further deployments until stable

Two rollback tips that save pain

Don’t debug in production first. Roll back to restore users, then debug with breathing room.
Write down what you changed. It helps you avoid reintroducing the issue later.

Communication steps (internal + status page)

Communication isn’t “nice.” It prevents chaos and reduces support load.

Internal communication (within 10 minutes)

Post a quick note:

what’s happening (symptom, not speculation)
what’s impacted
who’s owning the fix
when the next update will be

Example:

“Investigating: users seeing 503s on checkout. Confirmed in US-East and EU. @Sam owning. Next update in 15 minutes.”

Status page: when and how

If customers are affected and the incident isn’t resolved quickly, use a status page. Start here: status pages.

A good rule:

If it impacts customer success and lasts longer than ~10–15 minutes, post a status update.

Status update template

Status: Investigating / Identified / Monitoring / Resolved
Impact: who/what is affected
Current state: brief and factual
Next update: time-bound commitment

Avoid speculation. Be honest and short.

Minute 20–30: confirm recovery and prevent immediate relapse

Once you’ve applied a fix or rollback:

Recovery checklist

Confirm monitors show UP across regions
Confirm critical flows:
- login (if SaaS)
- checkout (if ecommerce)
- key API endpoint (if product relies on API)
Watch for flapping (up/down)
Keep comms cadence until stable

If it’s “up” but slow, treat it as a performance incident (often a precursor to another outage).

Copy/paste runbook (printable checklist)

Paste this into a doc and keep it somewhere obvious.

Website Outage Runbook (30 minutes)

Roles

Primary responder: ________
Comms owner: ________
Backup/escalation: ________

Links

Monitoring dashboard: ________
Hosting provider status: ________
DNS provider/registrar: ________
Deploy/CI pipeline: ________
Logs/APM: ________
Status page: ________

0–5 minutes (Confirm + Scope)

Confirm alert (retries/regions)
Verify from second network/location
Identify affected services/pages
Open incident channel + assign roles
Start incident log (timestamps)

5–15 minutes (Identify layer)

DNS vs hosting vs app vs third-party
Check provider status pages
Check recent changes (deploy/config/DNS)

15–30 minutes (Mitigate + Communicate)

Roll back if change-correlated
Scale or fail over if overloaded
Post internal update + next update time
Post status page update if customer impact persists
Confirm recovery across regions + critical flow test

After recovery

Capture timeline and root cause
Create action items (prevention + detection + docs)

Post-incident review template (keep it lightweight)

A good postmortem isn’t about blame—it’s about preventing repeat incidents.

Template (copy/paste)

Incident summary:

What happened (plain English)

Timeline:

Start time:
Detection time:
Mitigation time:
Resolution time:

Impact:

Who was affected and how (regions, pages, users)
Revenue/support impact (if known)

Root cause:

Primary cause:
Contributing factors:

What went well:

(e.g., fast rollback, good comms)

What didn’t go well:

(e.g., unclear ownership, missing alerts, false positives)

Action items:

Prevention (fix underlying issue)
Detection (add/adjust monitors, keyword checks, regions)
Response (update runbook, escalation)
Owner + due date for each item

CTA: Print/save the checklist

You don’t want to invent a process during an outage.

CTA: Print or save the first 30 minutes checklist and the runbook template somewhere your team can find in seconds—then you’ll triage first, diagnose second, and fix third.