Incident response
Detection, triage, communication, and post-mortem process.
Detection
- Synthetic monitors every 60 seconds from three regions.
- Error-rate and p95-latency alerts via Sentry (if customer opted in) and Cloudflare Workers analytics.
- Customer reports to
[email protected].
Severity classification
| Sev | Definition | Target initial response |
|---|---|---|
| S1 | API returning 5xx for >1% of traffic, OR auth failing globally | 15 minutes, on-call paged |
| S2 | Significant latency regression, OR partial endpoint outage | 60 minutes |
| S3 | Non-blocking bug, degraded analytics, single-customer issue | 1 business day |
Communication
- Status page: status.cohesionauth.com updated within 15 minutes for S1, 30 for S2.
- Customer email: within 30 minutes for S1.
- Incident channel: Customers on enterprise contracts receive a Slack Connect or email thread for the duration.
Post-mortem
- Published within 5 business days for all S1 incidents.
- Includes timeline, root cause, what we missed, what changes.
- Blameless. Systems, not people.
Runbooks
Internal decision trees for API-down, key-leak, D1-outage, and Cloudflare-outage live in scripts/incident-response-runbook.md (private). Summaries are shared with enterprise customers under NDA.