RCA — Session-auth middleware missing exemptions: waitlist signup + Postmark delivery webhook
Incident ID: 2026-06-05-session-auth-waitlist-postmark-exemptions Date: 2026-06-05 Severity: SEV-1 (production outage; getraxx.com user-facing signup broken; delivery event pipeline dead) Duration: 8 days (2026-05-28 UTC → 2026-06-05 17:06 UTC) Blast radius: All getraxx.com waitlist submissions; all Postmark delivery/bounce/spam events Author: sre-agent
Summary
FLAG_SESSION_AUTH_MIDDLEWARE was enabled on raxx-api-prod on 2026-05-28 as part of a launch-day security audit. The middleware's exemption list was written when the flag was initially designed; several endpoints added later — including /api/waitlist/signup and /webhooks/postmark/delivery — were never added to it. Every request to these paths returned {"error":"Authentication required","reason":"missing"} with HTTP 401. The getraxx.com waitlist was silently accepting form submissions but returning 401 to the browser for 8 days. The Postmark delivery/bounce/spam webhook pipeline was completely dead for the same period. Fix: 6 missing path prefixes added to _EXEMPT_PREFIXES in session_auth.py. Deployed to production 2026-06-05 17:06 UTC.
Timeline (all times UTC)
- 2026-05-28 ~00:00 —
FLAG_SESSION_AUTH_MIDDLEWARE=1set onraxx-api-prodviaheroku config:set(PR #2997 hotfix wave; migration 0112 documents this date). - 2026-05-28 ~00:00 —
/api/waitlist/signupand/webhooks/postmark/deliverybegin returning 401. Silent; no alert fires because neither path had a synthetic probe or Postmark event-counter alarm. - 2026-06-05 ~12:00 — Email-workflow E2E verification (workflow
wf_f3757591-222) surfaces the failures as two CRITICALs in PR #3305. - 2026-06-05 16:50 — sre-agent reads root file; confirms both probe paths return 401 pre-fix.
- 2026-06-05 16:56 — PR #3306 merged (CI: 17 passed, 14 skipped, 0 failed).
- 2026-06-05 17:00 — Staging deploy complete.
- 2026-06-05 17:07 — Production deploy complete.
- 2026-06-05 17:07 — Post-deploy probes confirm: waitlist signup → 202
check_your_email; Postmark delivery → 404not_found(flag-off — correct; no longer 401 from session-auth). - Resolved.
Impact
- Users affected: all getraxx.com visitors who submitted the waitlist form from 2026-05-28 through 2026-06-05 (~8 days). Submissions reached the browser form-submit handler but the API returned 401; leads were not captured in the DB.
- User-visible symptoms: waitlist form appeared to work (no error surfaced to user if the frontend silently swallowed non-2xx), but no confirmation email was sent and no row was written to
waitlist_signups. - Data integrity: waitlist signup rows for this 8-day window are absent. No corruption; data was never written.
- Postmark delivery events: Delivery/Bounce/SpamComplaint webhook calls from Postmark returned 401. Postmark does not retry 4xx. These events are permanently lost for the 8-day window. The delivery monitor was non-functional; no bounce-rate alerts could have fired even if thresholds were crossed.
- Revenue / billing: waitlist is pre-revenue; no direct billing impact. Lost leads are a business impact but not quantifiable without traffic data.
What went well
- The email-workflow E2E verification caught both failures in a single run rather than waiting for a customer report.
- The middleware code structure made root cause unambiguous:
_EXEMPT_PREFIXESis the single authoritative list; the fix is a targeted append. - CI ran in under 3 minutes and caught nothing new (expected — this is an additive config change with no logic change).
- Pre-deploy and post-deploy probes provided clear before/after confirmation.
What didn't go well
- No synthetic probe existed for
/api/waitlist/signup. The getraxx.com waitlist is the primary pre-launch lead-capture mechanism; it had zero production monitoring. - No Postmark dashboard alarm on delivery-event count. A zero-event alert (or alert on N hours with no events when the flag is on) would have caught the dead webhook within hours.
- The exemption list in
session_auth.pyhad no "companion checklist" for the routes audit. When new routes are added toapi/__init__.py, there is no CI gate that validates them against the exemption list. - The migration 0112 documentation listed the original exempt paths but did not trigger a review of routes added after the middleware was written.
Root cause analysis
-
Contributing factor 1: Exemption list written at middleware design time, not routes-registration time.
_EXEMPT_PREFIXESwas written when the middleware was first designed (pre-launch sprint). Routes that were added later —waitlist_signups.bp,marketing.waitlist_bp,postmark_delivery.bp_webhook,freescout_support_webhook.bp,freescout_audit_webhook.bp,internal_waf_events.bp— each have their own authentication mechanism and no session cookie. None was added to the exemption list at route-addition time because there was no process to trigger that review. -
Contributing factor 2: No monitoring on public signup or webhook receipt rate. If either a synthetic probe of
/api/waitlist/signupor a Postmark event-count alarm had existed, the failure would have been detected within 1–24h of the flag enable rather than 8 days later. -
Contributing factor 3: Squash merge of security flag conflated timing with route coverage. The flag was enabled as a security hardening step; the exemption list was "intentionally conservative" (per the module docstring). The conservatism was correct in principle but the scope wasn't re-audited at flag-enable time against the full set of registered routes.
Detection
- What alerted us: email-workflow E2E verification report, workflow
wf_f3757591-222, PR #3305, 2026-06-05. - Time between cause and detection: ~8 days.
- How to detect faster next time: see Action Items 1 and 2.
Resolution
- What was changed: 6 path prefixes added to
_EXEMPT_PREFIXESinbackend_v2/api/middleware/session_auth.py. 10 regression tests added inbackend_v2/tests/test_session_auth_public_exemptions_3305.py. - PR: #3306
- Validation: post-deploy probes confirmed
POST /api/waitlist/signup→ 202 (was 401) andPOST /webhooks/postmark/delivery→ 404/not_foundfrom its own flag gate (was 401 from session-auth).
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Add synthetic probe for POST /api/waitlist/signup — alert on non-2xx |
sre-agent | 2026-06-12 | (file separately) |
| 2 | Add Postmark delivery-event-count alarm: alert ops@ if zero events received in any 6h window when FLAG_POSTMARK_DELIVERY_MONITOR=1 |
sre-agent | 2026-06-12 | (file separately) |
| 3 | Add CI lint: any blueprint registered in api/__init__.py that is NOT under /api prefix (i.e., registered with url_prefix=bp.url_prefix directly) must appear in _EXEMPT_PREFIXES or be explicitly listed as "session-protected" in a companion config |
feature-developer | 2026-06-19 | (file separately) |
| 4 | Update session_auth.py module docstring checklist: add "run routes audit" as a required step when enabling FLAG_SESSION_AUTH_MIDDLEWARE in a new environment | sre-agent | 2026-06-12 | n/a |
References
- PR:
https://github.com/raxx-app/TradeMasterAPI/pull/3306 - Workflow report: PR #3305 (email-workflow E2E verification)
- Migration:
console/migrations/versions/0112_promote_session_auth_middleware.py - Runbook:
docs/ops/runbooks/session-auth-middleware.md(create; see Action Item 4) - Related incidents: none (first occurrence of this failure class)