Incident response

Detection, triage, communication, and post-mortem process.

Detection

Synthetic monitors every 60 seconds from three regions.
Error-rate and p95-latency alerts via Sentry (if customer opted in) and Cloudflare Workers analytics.
Customer reports to support@cohesionauth.com.

Severity classification

Sev	Definition	Target initial response
S1	API returning 5xx for `>1%` of traffic, OR auth failing globally	15 minutes, on-call paged
S2	Significant latency regression, OR partial endpoint outage	60 minutes
S3	Non-blocking bug, degraded analytics, single-customer issue	1 business day

Communication

Status page: status.cohesionauth.com updated within 15 minutes for S1, 30 for S2.
Customer email: within 30 minutes for S1.
Incident channel: Customers on enterprise contracts receive a Slack Connect or email thread for the duration.

Post-mortem

Published within 5 business days for all S1 incidents.
Includes timeline, root cause, what we missed, what changes.
Blameless. Systems, not people.

Runbooks

Internal decision trees for API-down, key-leak, D1-outage, and Cloudflare-outage live in scripts/incident-response-runbook.md (private). Summaries are shared with enterprise customers under NDA.

Incident response

Detection

Severity classification

Communication

Post-mortem

Runbooks

Next step