DET-OPS-001 — Sentry error rate spike
Rule ID: DET-OPS-001
Title: Sentry events-per-minute per route exceeding 3σ above 7-day baseline
Category: ops
Last validated: 2026-06-04 (initial catalog, dormant)
State: dormant — requires sentry_backend flag ON in prod and SENTRY_DSN_BACKEND set per app per project_apm_vendor_sentry
Telemetry source
- Sentry SaaS — events captured via
backend_v2/api/observability/sentry_init.py(FlaskIntegration). PII-scrubbed via_before_send(sensitive headers + body keys filtered). - Source for the detection: Sentry API
GET /api/0/projects/{org}/{project}/events/filtered by tagroute:<route>with 1-min buckets. - Companion Console-side Sentry init similarly via the same flag posture (verify path; not directly cited here).
Statistical method + baseline window
- Method: Poisson tail test on events-per-route-per-minute. Per-route baselines because route mix varies (high-volume
/api/statusvs low-volume/api/admin/customers). - Baseline window: rolling 7 days, time-of-day-aware (hourly buckets).
- Fire condition: observed events for a 1-min window on a route > μ + 3σ of that route's baseline AND observed count >= 5. Floor of 5 prevents single-error FPs.
Threshold + expected FP rate
- Pre-launch placeholder: absolute threshold of >= 20 events/min on any route. Replaced by dynamic 3σ post-baseline.
- Expected FP rate (post-launch): ~3 per week. Deploy-aligned error bursts (a new endpoint with bad init, etc.) will fire. That's the design — those should be caught.
Alert route
- HIGH (single route, single 1-min window, >= μ + 4σ):
#raxx-ops-alert-sev2-5(ET) /#raxx-ops-alert-sev2(off-hours). Per-event because deploy-induced errors can cascade fast. - MEDIUM (3σ band, sustained for 3+ consecutive minutes): ops@ daily digest with route-level summary.
- LOW (single-route single-window 3σ, no sustained pattern): silent log.
Escalation owner
- sre-agent primary — most fires are post-deploy regressions.
- security-agent if the route is auth-adjacent (
/api/auth/*,/api/admin/*) AND the error type clusters onpermission_deniedorunauthorized.
Test fixture / synthetic positive
See _fixtures/sentry_error_rate_spike_positive.json for a synthetic Sentry-API-shaped response showing 32 events/min on /api/auth/login/verify against a baseline of μ=2.1, σ=0.8.
What to do when this fires
- Pull the top error message + stack-trace cluster for the route in the fire window. One root cause or many?
- Correlate with the most recent deploy SHA. If the fire-window starts within 5 minutes of a deploy: likely a regression. Dispatch sre-agent for rollback evaluation.
- If no recent deploy: check upstream-dependency status (Heroku Postgres, Cloudflare, broker API).
- For auth-route fires: cross-reference DET-AUTH-001 / DET-AUTH-002 / DET-AUTH-003 — if any are also firing, escalate to security-agent.
What NOT to do
- Do not increase the baseline window beyond 7 days; recent regressions hide in long baselines.
- Do not exclude routes from this rule because they're "noisy" — high-baseline noisy routes self-tune via their own per-route baseline.
- Do not auto-rollback from this rule. Rollback decisions go to sre-agent.