Incident response

Detection, triage, communication, and post-mortem process.

Detection

Severity classification

SevDefinitionTarget initial response
S1API returning 5xx for >1% of traffic, OR auth failing globally15 minutes, on-call paged
S2Significant latency regression, OR partial endpoint outage60 minutes
S3Non-blocking bug, degraded analytics, single-customer issue1 business day

Communication

Post-mortem

Runbooks

Internal decision trees for API-down, key-leak, D1-outage, and Cloudflare-outage live in scripts/incident-response-runbook.md (private). Summaries are shared with enterprise customers under NDA.

Next step