RCA — Postmark bounce/spam alert misfire (low-denominator repeat paging)
Incident ID: 2026-05-13-postmark-bounce-alert-misfire Date: 2026-05-13 Severity: SEV-3 Duration: Ongoing (first alert observed post-sandbox-exit 2026-05-09; escalated 2026-05-13) Blast radius: Operator (Kristerpher) repeatedly Slack-pinged; no user-facing impact; no data loss Author: sre-agent
Summary
Postmark's native notification system (configured in the Postmark dashboard as a Slack incoming webhook) is firing per-event bounce and spam-complaint alerts to TradeMasterAPI-Notify every time raxx.app sends a new email, because one persistent hard-bounce event stays in the trailing 1h/24h window. With only 1-3 total emails in the window, the bounce rate trivially exceeds the 1% threshold (100%, 50%, 33%). Each new successful email delivery lowers the denominator but doesn't clear the bad event, so a new alert fires on every send. The Raptor in-process delivery monitor (postmark_delivery.py) is completely dormant (flag off, token empty) — it plays no role in the current alert storm.
Timeline (all times UTC)
- 2026-05-07 — First documented investigation (sre-report
2026-05-07-postmark-bounce-alerts.md). Root cause identified: Postmark sandbox +ops@raxx.apphard-bounce before Google Groups were provisioned. Postmark sandbox exit pending. - 2026-05-09 — Postmark approved out of sandbox. Sandbox-driven bounces cease. Hard-bounce events for
ops@raxx.app/billing@raxx.appprovisioned before 2026-05-06 remain in Postmark suppression list (not cleared). - 2026-05-09 onwards — Each send attempt to a suppressed address (or any new send triggering Postmark's rate check with the old bad event in window) fires a fresh notification. Alert format: "Bounce rate over last 1h: 100.0% (1/1)".
- 2026-05-13 ~20:00 — Operator escalates repeated paging as noise. Incident opened.
- 2026-05-13 20:21 — SRE agent investigating. Confirmed:
FLAG_POSTMARK_DELIVERY_MONITORis OFF;POSTMARK_SERVER_TOKENis empty onraxx-api-prod; all Raptor alert paths are dormant. - 2026-05-13 20:30 — Root cause confirmed: Postmark dashboard Slack webhook + unflushed suppression list entries + low-volume denominator.
- 2026-05-13 — Minimum-denominator floor added to
postmark_delivery.py. Diagnostic scriptscripts/ops/postmark_bounce_check.pycreated. Operator action items filed.
Impact
- Users affected: 0
- User-visible symptoms: none
- Data integrity: ok
- Revenue / billing: ok
- Operator: repeated Slack pings creating alert fatigue pre-launch
What went well
- The 2026-05-07 SRE report correctly identified the alert relay path (Postmark dashboard native webhook, not Raptor code). That analysis held up fully.
- The Raptor in-process delivery monitor (
postmark_delivery.py) already had a 60-minute in-memory suppression window — correct design choice for post-launch. - Postmark sandbox exit (2026-05-09) successfully completed, removing that source of bounces.
- Feature flags correctly kept the unfinished delivery webhook path dormant.
What didn't go well
- The 2026-05-07 action item R2 ("clear hard bounce suppressions for ops@, billing@, no-reply@") was documented but not yet completed. Those suppression entries are still present in Postmark, generating repeat notifications on every send attempt.
- The Postmark dashboard Slack webhook has no minimum-denominator floor or dedup window. It fires on every threshold crossing regardless of statistical significance.
- The in-process delivery monitor (
postmark_delivery.py) lacked a minimum-denominator floor. At pre-launch volumes, even the in-process 60-minute suppression window would only reduce frequency to one ping per hour, not eliminate the noise. - No operator runbook for Postmark bounce/suppression management existed (added in this incident).
Root cause analysis
-
Contributing factor 1: Postmark suppression list not flushed post-sandbox — When
ops@raxx.app,billing@raxx.app, andno-reply@raxx.appwere first emailed before the Google Groups existed (pre-2026-05-06), Postmark recorded hard-bounce events and suppressed those addresses. After the Google Groups were provisioned and Postmark was approved out of sandbox, these suppression entries were not cleared (action item R2 from 2026-05-07 report, not yet completed). Every attempted send to a suppressed address generates a new Postmark notification event. -
Contributing factor 2: Low-denominator rate calculation is statistically meaningless — Postmark's notification system computes a rate over a trailing 1h (bounce) or 24h (spam) window. Pre-launch with 1-3 total emails in the window, a single bad event = 33-100% rate, far above 1% threshold. The alert fires legitimately according to its own logic, but the signal has no actionable meaning at this volume. This same problem exists in
postmark_delivery.py(now fixed with a minimum-denominator floor). -
Contributing factor 3: No dedup window in Postmark's native notification system — Postmark fires a notification on every outbound webhook POST to any Slack/email destination, on every threshold crossing event. There is no native cooldown. Each new email delivery becomes a fresh denominator that recomputes the rate with the same numerator (1 bad event), crossing the threshold again.
-
Contributing factor 4: Pre-launch posture not applied to Postmark dashboard notifications — Per
feedback_pre_launch_digest_notifications.md, routine CI/cron Slack pings should be in a daily digest, not per-event. The Postmark dashboard Slack webhook was set up pre-launch without this posture in mind.
Detection
- What alerted us: Operator escalated via Slack
- How long between cause and detection: ~4 days (sandbox exit 2026-05-09 → escalation 2026-05-13)
- How to detect faster next time: The 2026-05-07 report identified the risk clearly. An action item tracking the suppression-list clearance should have been filed as a GitHub issue with a due date.
Resolution
Immediate (operator-runnable, not requiring code deploy)
R1 — Clear Postmark suppression list entries (MOST URGENT):
# 1. Fetch POSTMARK_SERVER_TOKEN from vault:
# vault.raxx.app -> /MooseQuest/postmark/POSTMARK_SERVER_API_KEY
# OR: heroku config --app raxx-api-prod | grep POSTMARK_SERVER_TOKEN
# 2. Run diagnostic to see what's suppressed:
export POSTMARK_SERVER_TOKEN=<token>
python3 scripts/ops/postmark_bounce_check.py --suppressions-only
# 3. Reactivate known-good addresses:
python3 scripts/ops/postmark_bounce_check.py --reactivate ops@raxx.app
python3 scripts/ops/postmark_bounce_check.py --reactivate billing@raxx.app
python3 scripts/ops/postmark_bounce_check.py --reactivate no-reply@raxx.app
# 4. Verify in Postmark dashboard:
# https://account.postmarkapp.com/servers -> Suppressions
R2 — Remove or pause the Postmark → Slack notification (STOPS PAGING IMMEDIATELY):
In the Postmark dashboard: 1. Navigate to the raxx.app server → Settings → Notifications 2. Find the Slack webhook entry pointing to TradeMasterAPI-Notify 3. Remove it (or temporarily disable)
This stops the per-event pinging. The Raptor in-process delivery monitor (when enabled post-launch) is the intended replacement, and it already has per-hour suppression + the new minimum-denominator floor from this incident.
R3 — Verify the spam complaint is not from a real external customer:
Run python3 scripts/ops/postmark_bounce_check.py and check the spam complaint entry. If the recipient is an internal address (ops@, billing@, etc.) it is safe to clear. If it is an external address, escalate — a real customer flagging Raxx email as spam is a sender-reputation signal.
Code fix (deployed in this PR)
C1 — Minimum-denominator floor in postmark_delivery.py:
Added _BOUNCE_MIN_DENOMINATOR = 10 and _SPAM_MIN_DENOMINATOR = 25 constants. _check_alert_thresholds() now skips the alert when the total event count in the window is below the floor. Configurable via POSTMARK_ALERT_MIN_DENOMINATOR_BOUNCE and POSTMARK_ALERT_MIN_DENOMINATOR_SPAM env vars.
This fix protects the Raptor in-process delivery monitor against the same low-denominator misfire pattern when it is eventually enabled.
Validation
After R1 (suppression list cleared):
- Send a test email from no-reply@raxx.app to kris@moosequest.net via Postmark
- No bounce notification should fire
- Postmark Activity tab should show "Delivered"
After R2 (Postmark webhook removed): - Send another test email - No Slack ping in TradeMasterAPI-Notify
After C1 (code deployed):
- Unit tests pass: python3 -m pytest backend_v2/tests/test_postmark_delivery_webhook.py -q
- Specifically tests 26-29 (TestMinDenominatorFloor class)
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Clear hard-bounce suppressions for ops@, billing@, no-reply@ in Postmark dashboard OR via postmark_bounce_check.py --reactivate |
Kristerpher | 2026-05-14 | (filed) |
| 2 | Identify + classify spam complaint recipient — is it internal or external? Escalate if external. | Kristerpher | 2026-05-14 | (filed) |
| 3 | Remove (or redirect to daily digest) the Postmark dashboard → Slack webhook for TradeMasterAPI-Notify | Kristerpher | 2026-05-14 | (filed) |
| 4 | Set POSTMARK_SERVER_TOKEN on raxx-api-prod from vault so postmark_bounce_check.py can be run without manual vault fetch |
Kristerpher | 2026-05-16 | (filed) |
| 5 | Enable FLAG_POSTMARK_DELIVERY_MONITOR + POSTMARK_DELIVERY_WEBHOOK_SECRET after CF Access bypass for /webhooks/postmark/delivery is in place (action item R5 from 2026-05-07 report) |
Kristerpher | Post-launch | (existing #669) |
References
- Prior SRE report:
docs/ops/sre-reports/2026-05-07-postmark-bounce-alerts.md - Postmark runbook:
docs/ops/runbooks/postmark.md(created in this incident) - Diagnostic script:
scripts/ops/postmark_bounce_check.py - Code fix:
backend_v2/api/routes/postmark_delivery.py(minimum-denominator floor) - Postmark dashboard:
https://account.postmarkapp.com/servers - Vault path:
/MooseQuest/postmark/POSTMARK_SERVER_API_KEY