Raxx · internal docs

internal · gated

Postmark delivery monitor runbook

System: Postmark delivery-event silence monitor (issue #3310) Owner: sre-agent Last incident: 2026-06-05 (see docs/ops/incidents/2026-06-05-session-auth-waitlist-postmark-exemptions.md) Last reviewed: 2026-06-17 UTC


Why this monitor exists

The Postmark delivery webhook at /webhooks/postmark/delivery silently 401'd for 8 days. The bounce-rate alerts that depend on receiving webhook events were also non-functional — the detection layer itself had the same gap it was supposed to catch.

This monitor closes that blind spot by polling Postmark's Messages API directly, independently of the Raptor webhook receiving path.

See docs/ops/runbooks/postmark.md for the full Postmark architecture and credential reference.


How it works

Workflow: .github/workflows/postmark-delivery-monitor.yml Script: scripts/ci/check_postmark_delivery_silence.py

Runs every hour at :00 UTC. On each run:

  1. Polls GET /messages/outbound (Postmark Messages API) for the past 6 hours.
  2. If outbound send count > 0: checks GET /api/_internal/postmark/recent-deliveries on Raptor (protected by CF Access) to see how many delivery webhook events arrived.
  3. If sends > 0 AND delivery events == 0 (or Raptor is unreachable): fires a SEV-2 alert to ops@raxx.app via Postmark email.
  4. Idempotent: writes a state JSON file on the runner to suppress re-alerts within 4 hours.

Flag gate: FLAG_POSTMARK_DELIVERY_MONITOR must be set to '1' or 'true' as a GitHub Actions variable (repo- or environment-level) for the check to run. When off, the workflow logs "skipped" and exits 0.


How to tell it's broken

Monitor firing (silence detected):

Subject line: [Raxx ops] Postmark delivery-event SILENCE detected — SEV-2

This means: - Postmark processed ≥1 outbound sends in the last 6 hours, AND - Raptor's delivery webhook received 0 events in the last 24h (or Raptor was unreachable).

Monitor not running (workflow dark):


How to diagnose (in order)

  1. Check Postmark dashboard: https://account.postmarkapp.com/servers → Activity tab Are sends appearing in the Activity log? What delivery status do they show?

  2. Check if the Raptor delivery webhook is receiving events: bash curl -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \ -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \ "https://api.raxx.app/api/_internal/postmark/recent-deliveries" Expected: total_24h > 0 if emails have been sent in the last 24h. If {"error": "not_found"}: FLAG_POSTMARK_DELIVERY_MONITOR is off on Raptor.

  3. Test Postmark webhook directly: bash curl -X POST \ -H "X-Postmark-Webhook-Token: $POSTMARK_DELIVERY_WEBHOOK_SECRET" \ -H "Content-Type: application/json" \ -d '{"RecordType":"Delivery","MessageID":"test-diag-001","Recipient":"ops@raxx.app"}' \ https://api.raxx.app/webhooks/postmark/delivery Expected: {"ok": true, "event_type": "Delivery", ...} If 401: webhook token mismatch — see Failure mode B below. If 302 or 403: Cloudflare Access is blocking Postmark's IPs — see Failure mode C.

  4. Run the monitor script manually (dry-run): bash export FLAG_POSTMARK_DELIVERY_MONITOR=1 export POSTMARK_SERVER_TOKEN=<token-from-vault> export CF_ACCESS_CLIENT_ID=<id> export CF_ACCESS_CLIENT_SECRET=<secret> python3 scripts/ci/check_postmark_delivery_silence.py --dry-run

  5. Check Heroku logs for 401 errors on the webhook endpoint: bash heroku logs --tail --app raxx-api-prod | grep "postmark_delivery" Look for: rejected -- invalid or missing webhook token


Known failure modes

Failure mode A: FLAG_POSTMARK_DELIVERY_MONITOR off

Symptom: Monitor workflow logs "skipped — flag off" and exits 0; silence is not checked. Cause: FLAG_POSTMARK_DELIVERY_MONITOR is not set to '1' on the GitHub Actions variable. Fix: - Set it as a repo-level Actions variable: https://github.com/raxx-app/TradeMasterAPI/settings/variables/actions Variable name: FLAG_POSTMARK_DELIVERY_MONITOR, value: 1 - Or verify it's set for the environment the workflow runs in. Verification: Trigger a manual workflow_dispatch run and confirm the check step executes (not "skipped").

Failure mode B: Webhook token mismatch (POSTMARK_DELIVERY_WEBHOOK_SECRET)

Symptom: Monitor fires silence alert; curl test returns 401. Cause: POSTMARK_DELIVERY_WEBHOOK_SECRET on Heroku doesn't match the token configured in Postmark dashboard → Settings → Webhooks → Delivery. Fix: 1. Fetch the correct value from vault: bash # Vault path: /MooseQuest/postmark/POSTMARK_DELIVERY_WEBHOOK_SECRET 2. Reset on Heroku (silence stdout — per feedback_heroku_config_set_echoes_secrets.md): bash heroku config:set POSTMARK_DELIVERY_WEBHOOK_SECRET=<value> \ --app raxx-api-prod >/dev/null 2>&1 3. Verify: bash curl -X POST \ -H "X-Postmark-Webhook-Token: <value>" \ -H "Content-Type: application/json" \ -d '{"RecordType":"Delivery","MessageID":"test-fix-001","Recipient":"ops@raxx.app"}' \ https://api.raxx.app/webhooks/postmark/delivery # Expect: {"ok": true, ...} Reference: docs/ops/runbooks/postmark.md — Failure mode E

Failure mode C: Cloudflare Access blocks Postmark webhook IPs

Symptom: Monitor fires silence alert; curl test returns 302 redirect or 403. Cause: /webhooks/postmark/delivery is behind CF Access. Postmark's delivery IP ranges cannot carry CF Access credentials. Fix: Add CF Access bypass rule for Postmark IP ranges: 1. CF Zero Trust → Access → Applications → api.raxx.app 2. Add Policy: Bypass for IP ranges from Postmark's published list: https://postmarkapp.com/support/article/800-ips-for-postmark-servers 3. Scope the bypass to the /webhooks/postmark/delivery path only. Verification: POST /webhooks/postmark/delivery with valid token returns 200. Reference: docs/ops/runbooks/postmark.md — Failure mode E; feedback_cf_access_does_not_bypass_bot_fight_mode.md

Failure mode D: FLAG_POSTMARK_DELIVERY_MONITOR off on Raptor (not Heroku env)

Symptom: Monitor fires; GET /api/_internal/postmark/recent-deliveries returns 404. Cause: The in-process Raptor delivery monitor flag is off. Fix:

heroku config:set FLAG_POSTMARK_DELIVERY_MONITOR=1 --app raxx-api-prod >/dev/null 2>&1

Note: The silence monitor workflow flag and the Raptor in-process flag are independent. The workflow can check Postmark API silence (outbound count) even if the Raptor flag is off; it treats 404 from Raptor as an unknown state and alerts. Verification: GET /api/_internal/postmark/recent-deliveries returns 200 with event data.

Failure mode E: POSTMARK_SERVER_TOKEN missing on GH Actions runner

Symptom: Workflow step exits 2; log shows "POSTMARK_SERVER_TOKEN is not set". Cause: The POSTMARK_SERVER_TOKEN secret is not configured in the GitHub repository secrets (https://github.com/raxx-app/TradeMasterAPI/settings/secrets/actions). Fix: Add the secret from vault path /MooseQuest/postmark/POSTMARK_SERVER_API_KEY. Verification: Re-run the workflow; the check step should proceed past the token check.

Failure mode F: Monitor workflow dark (not running)

Symptom: No runs visible in the workflow history for 2+ hours during business hours. Cause: GH Actions cron schedule drift (known GH issue — schedule runs can lag up to 15 minutes) or the workflow was accidentally disabled. Fix: 1. Check the workflow is enabled: https://github.com/raxx-app/TradeMasterAPI/actions/workflows/postmark-delivery-monitor.yml → click "..." → "Enable workflow" if it shows as disabled. 2. Trigger a workflow_dispatch manually to confirm the workflow runs. 3. If runs are delayed by >30 minutes, this is a GH Actions infrastructure issue — check https://githubstatus.com for scheduled job delays. Note: A workflow dark for 2+ hours is a detection gap. Per feedback_nightly_scan_dark_is_high.md, treat this as a SEV-3.


Enable the monitor

Prerequisites: 1. POSTMARK_SERVER_TOKEN set as a GitHub Actions secret (read-only Postmark API access). 2. OPS_ALERT_EMAIL and OPS_ALERT_FROM set as GitHub Actions secrets. 3. CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET set as GitHub Actions secrets (for the Raptor internal endpoint call). 4. FLAG_POSTMARK_DELIVERY_MONITOR set to '1' as a GitHub Actions variable.

Optionally, also set FLAG_POSTMARK_DELIVERY_MONITOR=1 on Heroku to enable the Raptor in-process webhook monitor (see docs/ops/runbooks/postmark.md).

To test without alerting:

# Trigger a dry-run via workflow_dispatch:
gh workflow run postmark-delivery-monitor.yml \
  --field dry_run=true \
  --repo raxx-app/TradeMasterAPI

Alert thresholds reference

Parameter Default Override
Look-back window 6 hours --window-hours CLI arg
Send threshold (minimum sends before silence = anomaly) 1 --send-threshold CLI arg
Alert re-fire suppression 4 hours --suppress-hours CLI arg

Emergency stop

To stop the monitor from running without disabling the workflow entirely:

Option A — Remove the GH Actions variable: https://github.com/raxx-app/TradeMasterAPI/settings/variables/actions → Delete or set FLAG_POSTMARK_DELIVERY_MONITOR to 0.

Option B — Disable the workflow: https://github.com/raxx-app/TradeMasterAPI/actions/workflows/postmark-delivery-monitor.yml → Click "..." → "Disable workflow".


Escalation

Wake the operator when: - The silence alert fires AND the Postmark dashboard shows >50 emails sent with no delivery records. - The monitor fails with exit 2 for 2+ consecutive runs (broken monitor). - Postmark account status shows the server is inactive or suspended. - Hard-bounce rate reported by Postmark dashboard exceeds 5%.