RCA — Postmark bounce/spam alert misfire (low-denominator repeat paging)

Incident ID: 2026-05-13-postmark-bounce-alert-misfire Date: 2026-05-13 Severity: SEV-3 Duration: Ongoing (first alert observed post-sandbox-exit 2026-05-09; escalated 2026-05-13) Blast radius: Operator (Kristerpher) repeatedly Slack-pinged; no user-facing impact; no data loss Author: sre-agent

Summary

Postmark's native notification system (configured in the Postmark dashboard as a Slack incoming webhook) is firing per-event bounce and spam-complaint alerts to TradeMasterAPI-Notify every time raxx.app sends a new email, because one persistent hard-bounce event stays in the trailing 1h/24h window. With only 1-3 total emails in the window, the bounce rate trivially exceeds the 1% threshold (100%, 50%, 33%). Each new successful email delivery lowers the denominator but doesn't clear the bad event, so a new alert fires on every send. The Raptor in-process delivery monitor (postmark_delivery.py) is completely dormant (flag off, token empty) — it plays no role in the current alert storm.

Timeline (all times UTC)

2026-05-07 — First documented investigation (sre-report 2026-05-07-postmark-bounce-alerts.md). Root cause identified: Postmark sandbox + ops@raxx.app hard-bounce before Google Groups were provisioned. Postmark sandbox exit pending.
2026-05-09 — Postmark approved out of sandbox. Sandbox-driven bounces cease. Hard-bounce events for ops@raxx.app / billing@raxx.app provisioned before 2026-05-06 remain in Postmark suppression list (not cleared).
2026-05-09 onwards — Each send attempt to a suppressed address (or any new send triggering Postmark's rate check with the old bad event in window) fires a fresh notification. Alert format: "Bounce rate over last 1h: 100.0% (1/1)".
2026-05-13 ~20:00 — Operator escalates repeated paging as noise. Incident opened.
2026-05-13 20:21 — SRE agent investigating. Confirmed: FLAG_POSTMARK_DELIVERY_MONITOR is OFF; POSTMARK_SERVER_TOKEN is empty on raxx-api-prod; all Raptor alert paths are dormant.
2026-05-13 20:30 — Root cause confirmed: Postmark dashboard Slack webhook + unflushed suppression list entries + low-volume denominator.
2026-05-13 — Minimum-denominator floor added to postmark_delivery.py. Diagnostic script scripts/ops/postmark_bounce_check.py created. Operator action items filed.

Impact

Users affected: 0
User-visible symptoms: none
Data integrity: ok
Revenue / billing: ok
Operator: repeated Slack pings creating alert fatigue pre-launch

What went well

The 2026-05-07 SRE report correctly identified the alert relay path (Postmark dashboard native webhook, not Raptor code). That analysis held up fully.
The Raptor in-process delivery monitor (postmark_delivery.py) already had a 60-minute in-memory suppression window — correct design choice for post-launch.
Postmark sandbox exit (2026-05-09) successfully completed, removing that source of bounces.
Feature flags correctly kept the unfinished delivery webhook path dormant.

What didn't go well

The 2026-05-07 action item R2 ("clear hard bounce suppressions for ops@, billing@, no-reply@") was documented but not yet completed. Those suppression entries are still present in Postmark, generating repeat notifications on every send attempt.
The Postmark dashboard Slack webhook has no minimum-denominator floor or dedup window. It fires on every threshold crossing regardless of statistical significance.
The in-process delivery monitor (postmark_delivery.py) lacked a minimum-denominator floor. At pre-launch volumes, even the in-process 60-minute suppression window would only reduce frequency to one ping per hour, not eliminate the noise.
No operator runbook for Postmark bounce/suppression management existed (added in this incident).

Root cause analysis

Contributing factor 1: Postmark suppression list not flushed post-sandbox — When ops@raxx.app, billing@raxx.app, and no-reply@raxx.app were first emailed before the Google Groups existed (pre-2026-05-06), Postmark recorded hard-bounce events and suppressed those addresses. After the Google Groups were provisioned and Postmark was approved out of sandbox, these suppression entries were not cleared (action item R2 from 2026-05-07 report, not yet completed). Every attempted send to a suppressed address generates a new Postmark notification event.
Contributing factor 2: Low-denominator rate calculation is statistically meaningless — Postmark's notification system computes a rate over a trailing 1h (bounce) or 24h (spam) window. Pre-launch with 1-3 total emails in the window, a single bad event = 33-100% rate, far above 1% threshold. The alert fires legitimately according to its own logic, but the signal has no actionable meaning at this volume. This same problem exists in postmark_delivery.py (now fixed with a minimum-denominator floor).
Contributing factor 3: No dedup window in Postmark's native notification system — Postmark fires a notification on every outbound webhook POST to any Slack/email destination, on every threshold crossing event. There is no native cooldown. Each new email delivery becomes a fresh denominator that recomputes the rate with the same numerator (1 bad event), crossing the threshold again.
Contributing factor 4: Pre-launch posture not applied to Postmark dashboard notifications — Per feedback_pre_launch_digest_notifications.md, routine CI/cron Slack pings should be in a daily digest, not per-event. The Postmark dashboard Slack webhook was set up pre-launch without this posture in mind.

Detection

What alerted us: Operator escalated via Slack
How long between cause and detection: ~4 days (sandbox exit 2026-05-09 → escalation 2026-05-13)
How to detect faster next time: The 2026-05-07 report identified the risk clearly. An action item tracking the suppression-list clearance should have been filed as a GitHub issue with a due date.

Resolution

Immediate (operator-runnable, not requiring code deploy)

R1 — Clear Postmark suppression list entries (MOST URGENT):

# 1. Fetch POSTMARK_SERVER_TOKEN from vault:
#    vault.raxx.app -> /MooseQuest/postmark/POSTMARK_SERVER_API_KEY
#    OR: heroku config --app raxx-api-prod | grep POSTMARK_SERVER_TOKEN

# 2. Run diagnostic to see what's suppressed:
export POSTMARK_SERVER_TOKEN=<token>
python3 scripts/ops/postmark_bounce_check.py --suppressions-only

# 3. Reactivate known-good addresses:
python3 scripts/ops/postmark_bounce_check.py --reactivate ops@raxx.app
python3 scripts/ops/postmark_bounce_check.py --reactivate billing@raxx.app
python3 scripts/ops/postmark_bounce_check.py --reactivate no-reply@raxx.app

# 4. Verify in Postmark dashboard:
#    https://account.postmarkapp.com/servers -> Suppressions

R2 — Remove or pause the Postmark → Slack notification (STOPS PAGING IMMEDIATELY):

In the Postmark dashboard: 1. Navigate to the raxx.app server → Settings → Notifications 2. Find the Slack webhook entry pointing to TradeMasterAPI-Notify 3. Remove it (or temporarily disable)

This stops the per-event pinging. The Raptor in-process delivery monitor (when enabled post-launch) is the intended replacement, and it already has per-hour suppression + the new minimum-denominator floor from this incident.

R3 — Verify the spam complaint is not from a real external customer:

Run python3 scripts/ops/postmark_bounce_check.py and check the spam complaint entry. If the recipient is an internal address (ops@, billing@, etc.) it is safe to clear. If it is an external address, escalate — a real customer flagging Raxx email as spam is a sender-reputation signal.

Code fix (deployed in this PR)

C1 — Minimum-denominator floor in postmark_delivery.py:

Added _BOUNCE_MIN_DENOMINATOR = 10 and _SPAM_MIN_DENOMINATOR = 25 constants. _check_alert_thresholds() now skips the alert when the total event count in the window is below the floor. Configurable via POSTMARK_ALERT_MIN_DENOMINATOR_BOUNCE and POSTMARK_ALERT_MIN_DENOMINATOR_SPAM env vars.

This fix protects the Raptor in-process delivery monitor against the same low-denominator misfire pattern when it is eventually enabled.

Validation

After R1 (suppression list cleared): - Send a test email from no-reply@raxx.app to kris@moosequest.net via Postmark - No bounce notification should fire - Postmark Activity tab should show "Delivered"

After R2 (Postmark webhook removed): - Send another test email - No Slack ping in TradeMasterAPI-Notify

After C1 (code deployed): - Unit tests pass: python3 -m pytest backend_v2/tests/test_postmark_delivery_webhook.py -q - Specifically tests 26-29 (TestMinDenominatorFloor class)

Action items

#	Action	Owner	Due	Issue
1	Clear hard-bounce suppressions for ops@, billing@, no-reply@ in Postmark dashboard OR via `postmark_bounce_check.py --reactivate`	Kristerpher	2026-05-14	(filed)
2	Identify + classify spam complaint recipient — is it internal or external? Escalate if external.	Kristerpher	2026-05-14	(filed)
3	Remove (or redirect to daily digest) the Postmark dashboard → Slack webhook for TradeMasterAPI-Notify	Kristerpher	2026-05-14	(filed)
4	Set `POSTMARK_SERVER_TOKEN` on `raxx-api-prod` from vault so `postmark_bounce_check.py` can be run without manual vault fetch	Kristerpher	2026-05-16	(filed)
5	Enable `FLAG_POSTMARK_DELIVERY_MONITOR` + `POSTMARK_DELIVERY_WEBHOOK_SECRET` after CF Access bypass for `/webhooks/postmark/delivery` is in place (action item R5 from 2026-05-07 report)	Kristerpher	Post-launch	(existing #669)

References

Prior SRE report: docs/ops/sre-reports/2026-05-07-postmark-bounce-alerts.md
Postmark runbook: docs/ops/runbooks/postmark.md (created in this incident)
Diagnostic script: scripts/ops/postmark_bounce_check.py
Code fix: backend_v2/api/routes/postmark_delivery.py (minimum-denominator floor)
Postmark dashboard: https://account.postmarkapp.com/servers
Vault path: /MooseQuest/postmark/POSTMARK_SERVER_API_KEY