DET-BETA-003 — Sentry error-rate spike on beta preview endpoints
Rule ID: DET-BETA-003
Title: Sentry error rate > 3σ above 7-day baseline on /api/beta/preview/* routes
Category: beta
Last validated: 2026-06-12 (beta-launch campaign)
State: dormant — sentry_backend flag is OFF in prod (prerequisite P2 from June 4 catalog). Once the flag is on and SENTRY_DSN_BACKEND is set, this rule activates immediately. The beta-launch window makes P2 urgent — see escalation note below.
Why this detection exists
The beta preview routes are new, complex, and making a cross-service HTTP call (Raptor → Console via beta_token_verifier.py) on every single request. Error-rate spikes on these routes can signal:
- Operational failure: Console's internal verify endpoint is down or the
INTERNAL_API_SECRETmismatch between the two apps — causesRuntimeErrorin every token verification attempt, producing 500s for all beta testers. - Adversarial probing: unusual error patterns (405 Method Not Allowed bursts, 400 bad-request bursts on well-formed endpoints) can indicate boundary-probing by a scanner that found the endpoint.
- Beta-launch regression: a bug in the just-deployed preview route that only manifests under real-user load (not caught by pre-launch tests).
The existing ops_sentry_error_rate_spike rule (DET-OPS-002 from June 4 catalog) applies across all routes but its baseline is the entire app. The beta routes need a route-specific detection that fires on a surge in those paths before the global rule catches it in aggregate.
Telemetry source
- Sentry (
sentry_backendflag,SENTRY_DSN_BACKENDenv var): any unhandled exception or explicitsentry_sdk.capture_exceptioncall from routes prefixed/api/beta/preview/. Thebeta_preview.pyroute useslogger.error()for DB errors; these do not surface in Sentry unless Sentry is wired. Once wired, unhandled 500s from the route handler surface automatically. - Route-specific Sentry tags: filter on
transactionorurlfield matching*/beta/preview/*in Sentry's performance monitoring. - Heroku router 5xx count (as a fallback when Sentry is off):
heroku logs --app raxx-api-prod | grep 'status=500' | grep 'beta/preview'.
Escalation note — P2 urgency on beta-launch day
sentry_backend is currently OFF in prod. With real testers hitting these routes today, flying blind on server errors is a HIGH operational risk — bugs manifest at first-real-user volume that synthetic tests miss. This is an immediate operator-action recommendation:
Recommendation to operator (not an auto-action): flip sentry_backend ON and set SENTRY_DSN_BACKEND today before invite emails go out. Command (from feedback_bootstrap_via_heroku):
heroku config:set sentry_backend=1 SENTRY_DSN_BACKEND=<dsn from Infisical> \
--app raxx-api-prod >/dev/null 2>&1
Routing: sre-agent to confirm DSN is present in vault and flag is promotion-ready.
Statistical method + baseline window
- Method: Poisson tail on error-events-per-minute on the
/api/beta/preview/*transaction grouping in Sentry. Compare to rolling 7-day baseline per route prefix. - Baseline window: 7 days. On day 1 (today), the baseline is zero — any error fires immediately at MEDIUM. Statistical threshold activates after 7 days of observed data.
- Fire condition (post-baseline): observed errors-per-minute on
/api/beta/preview/*> μ + 3σ of the 7-day baseline for the same route prefix. - Fire condition (pre-baseline / day 1–7): >= 3 errors in any 5-minute window on these routes. This is a hard floor with no statistical requirement — the surface is new and any error cluster during beta launch is signal worth surfacing.
Threshold + expected FP rate
- Day 1 absolute threshold: >= 3 server-side errors (5xx) on
/api/beta/preview/*in any 5-minute window. - Post-baseline dynamic threshold: μ + 3σ on the 7-day error rate per route.
- Expected FP rate (day 1): low. Expected error rate from legitimate testers is near-zero if the Console cross-service call is stable. The most likely day-1 FP is the
INTERNAL_API_SECRETmismatch error — this surfaces as aRuntimeErrorin Raptor and a 500 for the tester. That is not a FP; it is an operational incident. The rule correctly fires.
Alert route
- CRITICAL (>= 10 errors in 5 minutes OR any
RuntimeErrorfrombeta_token_verifier.py, which indicates HMAC key mismatch causing ALL verifications to fail):#raxx-ops-alert-sev1. This is an active tester-facing outage. - HIGH (>= 3 errors in 5 minutes or sustained 3σ exceedance):
#raxx-ops-alert-sev2-5(ET) /#raxx-ops-alert-sev2(off-hours). Per-event — beta-launch day errors are SEV-2 by posture. - MEDIUM (1–2 errors in 5 minutes, no pattern): ops@ daily digest.
Escalation owner
- sre-agent — first response. Determine whether the error is operational (Console down, HMAC mismatch, DB connection issue) vs. code regression.
- security-agent — if the error pattern suggests adversarial input (e.g., 400-burst from one IP suggesting injection probing, or unexpected 403 patterns indicating auth bypass attempts).
- feature-developer — if the error is a code regression from the beta-preview routes (missing field, schema mismatch after
backend_v2/alembic/versions/0031_beta_walkthrough_tables.pymigration).
Specific error signatures to watch
| Sentry exception class / log pattern | Likely cause | Who to escalate to |
|---|---|---|
RuntimeError: beta_token_verifier: Console returned 403 |
INTERNAL_API_SECRET mismatch between raxx-api-prod and raxx-console-prod |
sre-agent (config var sync) |
beta_token_verifier: Console request timed out x3+ in 5 min |
Console app overloaded or down | sre-agent |
beta_preview.*_get db_error |
Postgres connection issue on beta_preview_progress table |
sre-agent |
beta_preview.screen5_post db_error |
Schema mismatch (column rename in 0031 migration not applied cleanly) | feature-developer |
| 5xx with no exception in Sentry | Heroku router timeout or dyno OOM | sre-agent |
Test fixture / synthetic positive
See _fixtures/preview_sentry_error_spike_positive.json — 5 synthetic Sentry events for transaction=/api/beta/preview/synth-scraper-token-001/screen/1 with exception class RuntimeError occurring within a 3-minute window.
What NOT to do
- Do not tune this rule to reduce noise before the baseline establishes. Any error during beta-launch day is a real signal.
- Do not treat a
RuntimeErrorfrombeta_token_verifieras a token enumeration event — it is a configuration failure, not an adversarial token, and it affects ALL testers simultaneously. - Do not wait for the automated Sentry alert to fire if testers are reporting errors in real-time — escalate immediately on operator-reported issues per the normal incident flow.