Raxx · internal docs

internal · gated

RCA — Postmark bounce/spam alert misfire (low-denominator repeat paging)

Incident ID: 2026-05-13-postmark-bounce-alert-misfire Date: 2026-05-13 Severity: SEV-3 Duration: Ongoing (first alert observed post-sandbox-exit 2026-05-09; escalated 2026-05-13) Blast radius: Operator (Kristerpher) repeatedly Slack-pinged; no user-facing impact; no data loss Author: sre-agent


Summary

Postmark's native notification system (configured in the Postmark dashboard as a Slack incoming webhook) is firing per-event bounce and spam-complaint alerts to TradeMasterAPI-Notify every time raxx.app sends a new email, because one persistent hard-bounce event stays in the trailing 1h/24h window. With only 1-3 total emails in the window, the bounce rate trivially exceeds the 1% threshold (100%, 50%, 33%). Each new successful email delivery lowers the denominator but doesn't clear the bad event, so a new alert fires on every send. The Raptor in-process delivery monitor (postmark_delivery.py) is completely dormant (flag off, token empty) — it plays no role in the current alert storm.


Timeline (all times UTC)


Impact


What went well

What didn't go well


Root cause analysis


Detection


Resolution

Immediate (operator-runnable, not requiring code deploy)

R1 — Clear Postmark suppression list entries (MOST URGENT):

# 1. Fetch POSTMARK_SERVER_TOKEN from vault:
#    vault.raxx.app -> /MooseQuest/postmark/POSTMARK_SERVER_API_KEY
#    OR: heroku config --app raxx-api-prod | grep POSTMARK_SERVER_TOKEN

# 2. Run diagnostic to see what's suppressed:
export POSTMARK_SERVER_TOKEN=<token>
python3 scripts/ops/postmark_bounce_check.py --suppressions-only

# 3. Reactivate known-good addresses:
python3 scripts/ops/postmark_bounce_check.py --reactivate ops@raxx.app
python3 scripts/ops/postmark_bounce_check.py --reactivate billing@raxx.app
python3 scripts/ops/postmark_bounce_check.py --reactivate no-reply@raxx.app

# 4. Verify in Postmark dashboard:
#    https://account.postmarkapp.com/servers -> Suppressions

R2 — Remove or pause the Postmark → Slack notification (STOPS PAGING IMMEDIATELY):

In the Postmark dashboard: 1. Navigate to the raxx.app server → Settings → Notifications 2. Find the Slack webhook entry pointing to TradeMasterAPI-Notify 3. Remove it (or temporarily disable)

This stops the per-event pinging. The Raptor in-process delivery monitor (when enabled post-launch) is the intended replacement, and it already has per-hour suppression + the new minimum-denominator floor from this incident.

R3 — Verify the spam complaint is not from a real external customer:

Run python3 scripts/ops/postmark_bounce_check.py and check the spam complaint entry. If the recipient is an internal address (ops@, billing@, etc.) it is safe to clear. If it is an external address, escalate — a real customer flagging Raxx email as spam is a sender-reputation signal.

Code fix (deployed in this PR)

C1 — Minimum-denominator floor in postmark_delivery.py:

Added _BOUNCE_MIN_DENOMINATOR = 10 and _SPAM_MIN_DENOMINATOR = 25 constants. _check_alert_thresholds() now skips the alert when the total event count in the window is below the floor. Configurable via POSTMARK_ALERT_MIN_DENOMINATOR_BOUNCE and POSTMARK_ALERT_MIN_DENOMINATOR_SPAM env vars.

This fix protects the Raptor in-process delivery monitor against the same low-denominator misfire pattern when it is eventually enabled.

Validation

After R1 (suppression list cleared): - Send a test email from no-reply@raxx.app to kris@moosequest.net via Postmark - No bounce notification should fire - Postmark Activity tab should show "Delivered"

After R2 (Postmark webhook removed): - Send another test email - No Slack ping in TradeMasterAPI-Notify

After C1 (code deployed): - Unit tests pass: python3 -m pytest backend_v2/tests/test_postmark_delivery_webhook.py -q - Specifically tests 26-29 (TestMinDenominatorFloor class)


Action items

# Action Owner Due Issue
1 Clear hard-bounce suppressions for ops@, billing@, no-reply@ in Postmark dashboard OR via postmark_bounce_check.py --reactivate Kristerpher 2026-05-14 (filed)
2 Identify + classify spam complaint recipient — is it internal or external? Escalate if external. Kristerpher 2026-05-14 (filed)
3 Remove (or redirect to daily digest) the Postmark dashboard → Slack webhook for TradeMasterAPI-Notify Kristerpher 2026-05-14 (filed)
4 Set POSTMARK_SERVER_TOKEN on raxx-api-prod from vault so postmark_bounce_check.py can be run without manual vault fetch Kristerpher 2026-05-16 (filed)
5 Enable FLAG_POSTMARK_DELIVERY_MONITOR + POSTMARK_DELIVERY_WEBHOOK_SECRET after CF Access bypass for /webhooks/postmark/delivery is in place (action item R5 from 2026-05-07 report) Kristerpher Post-launch (existing #669)

References