Postmark delivery monitor runbook
System: Postmark delivery-event silence monitor (issue #3310)
Owner: sre-agent
Last incident: 2026-06-05 (see docs/ops/incidents/2026-06-05-session-auth-waitlist-postmark-exemptions.md)
Last reviewed: 2026-06-17 UTC
Why this monitor exists
The Postmark delivery webhook at /webhooks/postmark/delivery silently 401'd for 8 days.
The bounce-rate alerts that depend on receiving webhook events were also non-functional — the
detection layer itself had the same gap it was supposed to catch.
This monitor closes that blind spot by polling Postmark's Messages API directly, independently of the Raptor webhook receiving path.
See docs/ops/runbooks/postmark.md for the full Postmark architecture and credential reference.
How it works
Workflow: .github/workflows/postmark-delivery-monitor.yml
Script: scripts/ci/check_postmark_delivery_silence.py
Runs every hour at :00 UTC. On each run:
- Polls
GET /messages/outbound(Postmark Messages API) for the past 6 hours. - If outbound send count > 0: checks
GET /api/_internal/postmark/recent-deliverieson Raptor (protected by CF Access) to see how many delivery webhook events arrived. - If sends > 0 AND delivery events == 0 (or Raptor is unreachable): fires a SEV-2
alert to
ops@raxx.appvia Postmark email. - Idempotent: writes a state JSON file on the runner to suppress re-alerts within 4 hours.
Flag gate: FLAG_POSTMARK_DELIVERY_MONITOR must be set to '1' or 'true' as a
GitHub Actions variable (repo- or environment-level) for the check to run. When off,
the workflow logs "skipped" and exits 0.
How to tell it's broken
Monitor firing (silence detected):
Subject line: [Raxx ops] Postmark delivery-event SILENCE detected — SEV-2
This means: - Postmark processed ≥1 outbound sends in the last 6 hours, AND - Raptor's delivery webhook received 0 events in the last 24h (or Raptor was unreachable).
Monitor not running (workflow dark):
- No successful run in the last 2 hours in
https://github.com/raxx-app/TradeMasterAPI/actions/workflows/postmark-delivery-monitor.yml - Treat a workflow that has not run in 2 hours as a SEV-3 detection gap.
How to diagnose (in order)
-
Check Postmark dashboard:
https://account.postmarkapp.com/servers → Activity tabAre sends appearing in the Activity log? What delivery status do they show? -
Check if the Raptor delivery webhook is receiving events:
bash curl -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \ -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \ "https://api.raxx.app/api/_internal/postmark/recent-deliveries"Expected:total_24h> 0 if emails have been sent in the last 24h. If{"error": "not_found"}:FLAG_POSTMARK_DELIVERY_MONITORis off on Raptor. -
Test Postmark webhook directly:
bash curl -X POST \ -H "X-Postmark-Webhook-Token: $POSTMARK_DELIVERY_WEBHOOK_SECRET" \ -H "Content-Type: application/json" \ -d '{"RecordType":"Delivery","MessageID":"test-diag-001","Recipient":"ops@raxx.app"}' \ https://api.raxx.app/webhooks/postmark/deliveryExpected:{"ok": true, "event_type": "Delivery", ...}If401: webhook token mismatch — see Failure mode B below. If302or403: Cloudflare Access is blocking Postmark's IPs — see Failure mode C. -
Run the monitor script manually (dry-run):
bash export FLAG_POSTMARK_DELIVERY_MONITOR=1 export POSTMARK_SERVER_TOKEN=<token-from-vault> export CF_ACCESS_CLIENT_ID=<id> export CF_ACCESS_CLIENT_SECRET=<secret> python3 scripts/ci/check_postmark_delivery_silence.py --dry-run -
Check Heroku logs for 401 errors on the webhook endpoint:
bash heroku logs --tail --app raxx-api-prod | grep "postmark_delivery"Look for:rejected -- invalid or missing webhook token
Known failure modes
Failure mode A: FLAG_POSTMARK_DELIVERY_MONITOR off
Symptom: Monitor workflow logs "skipped — flag off" and exits 0; silence is not checked.
Cause: FLAG_POSTMARK_DELIVERY_MONITOR is not set to '1' on the GitHub Actions variable.
Fix:
- Set it as a repo-level Actions variable:
https://github.com/raxx-app/TradeMasterAPI/settings/variables/actions
Variable name: FLAG_POSTMARK_DELIVERY_MONITOR, value: 1
- Or verify it's set for the environment the workflow runs in.
Verification: Trigger a manual workflow_dispatch run and confirm the check step
executes (not "skipped").
Failure mode B: Webhook token mismatch (POSTMARK_DELIVERY_WEBHOOK_SECRET)
Symptom: Monitor fires silence alert; curl test returns 401.
Cause: POSTMARK_DELIVERY_WEBHOOK_SECRET on Heroku doesn't match the token configured
in Postmark dashboard → Settings → Webhooks → Delivery.
Fix:
1. Fetch the correct value from vault:
bash
# Vault path: /MooseQuest/postmark/POSTMARK_DELIVERY_WEBHOOK_SECRET
2. Reset on Heroku (silence stdout — per feedback_heroku_config_set_echoes_secrets.md):
bash
heroku config:set POSTMARK_DELIVERY_WEBHOOK_SECRET=<value> \
--app raxx-api-prod >/dev/null 2>&1
3. Verify:
bash
curl -X POST \
-H "X-Postmark-Webhook-Token: <value>" \
-H "Content-Type: application/json" \
-d '{"RecordType":"Delivery","MessageID":"test-fix-001","Recipient":"ops@raxx.app"}' \
https://api.raxx.app/webhooks/postmark/delivery
# Expect: {"ok": true, ...}
Reference: docs/ops/runbooks/postmark.md — Failure mode E
Failure mode C: Cloudflare Access blocks Postmark webhook IPs
Symptom: Monitor fires silence alert; curl test returns 302 redirect or 403.
Cause: /webhooks/postmark/delivery is behind CF Access. Postmark's delivery
IP ranges cannot carry CF Access credentials.
Fix: Add CF Access bypass rule for Postmark IP ranges:
1. CF Zero Trust → Access → Applications → api.raxx.app
2. Add Policy: Bypass for IP ranges from Postmark's published list:
https://postmarkapp.com/support/article/800-ips-for-postmark-servers
3. Scope the bypass to the /webhooks/postmark/delivery path only.
Verification: POST /webhooks/postmark/delivery with valid token returns 200.
Reference: docs/ops/runbooks/postmark.md — Failure mode E;
feedback_cf_access_does_not_bypass_bot_fight_mode.md
Failure mode D: FLAG_POSTMARK_DELIVERY_MONITOR off on Raptor (not Heroku env)
Symptom: Monitor fires; GET /api/_internal/postmark/recent-deliveries returns 404.
Cause: The in-process Raptor delivery monitor flag is off.
Fix:
heroku config:set FLAG_POSTMARK_DELIVERY_MONITOR=1 --app raxx-api-prod >/dev/null 2>&1
Note: The silence monitor workflow flag and the Raptor in-process flag are independent.
The workflow can check Postmark API silence (outbound count) even if the Raptor flag is off;
it treats 404 from Raptor as an unknown state and alerts.
Verification: GET /api/_internal/postmark/recent-deliveries returns 200 with event data.
Failure mode E: POSTMARK_SERVER_TOKEN missing on GH Actions runner
Symptom: Workflow step exits 2; log shows "POSTMARK_SERVER_TOKEN is not set".
Cause: The POSTMARK_SERVER_TOKEN secret is not configured in the GitHub repository
secrets (https://github.com/raxx-app/TradeMasterAPI/settings/secrets/actions).
Fix: Add the secret from vault path /MooseQuest/postmark/POSTMARK_SERVER_API_KEY.
Verification: Re-run the workflow; the check step should proceed past the token check.
Failure mode F: Monitor workflow dark (not running)
Symptom: No runs visible in the workflow history for 2+ hours during business hours.
Cause: GH Actions cron schedule drift (known GH issue — schedule runs can lag up to
15 minutes) or the workflow was accidentally disabled.
Fix:
1. Check the workflow is enabled:
https://github.com/raxx-app/TradeMasterAPI/actions/workflows/postmark-delivery-monitor.yml
→ click "..." → "Enable workflow" if it shows as disabled.
2. Trigger a workflow_dispatch manually to confirm the workflow runs.
3. If runs are delayed by >30 minutes, this is a GH Actions infrastructure issue — check
https://githubstatus.com for scheduled job delays.
Note: A workflow dark for 2+ hours is a detection gap. Per
feedback_nightly_scan_dark_is_high.md, treat this as a SEV-3.
Enable the monitor
Prerequisites:
1. POSTMARK_SERVER_TOKEN set as a GitHub Actions secret (read-only Postmark API access).
2. OPS_ALERT_EMAIL and OPS_ALERT_FROM set as GitHub Actions secrets.
3. CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET set as GitHub Actions secrets
(for the Raptor internal endpoint call).
4. FLAG_POSTMARK_DELIVERY_MONITOR set to '1' as a GitHub Actions variable.
Optionally, also set FLAG_POSTMARK_DELIVERY_MONITOR=1 on Heroku to enable the
Raptor in-process webhook monitor (see docs/ops/runbooks/postmark.md).
To test without alerting:
# Trigger a dry-run via workflow_dispatch:
gh workflow run postmark-delivery-monitor.yml \
--field dry_run=true \
--repo raxx-app/TradeMasterAPI
Alert thresholds reference
| Parameter | Default | Override |
|---|---|---|
| Look-back window | 6 hours | --window-hours CLI arg |
| Send threshold (minimum sends before silence = anomaly) | 1 | --send-threshold CLI arg |
| Alert re-fire suppression | 4 hours | --suppress-hours CLI arg |
Emergency stop
To stop the monitor from running without disabling the workflow entirely:
Option A — Remove the GH Actions variable:
https://github.com/raxx-app/TradeMasterAPI/settings/variables/actions
→ Delete or set FLAG_POSTMARK_DELIVERY_MONITOR to 0.
Option B — Disable the workflow:
https://github.com/raxx-app/TradeMasterAPI/actions/workflows/postmark-delivery-monitor.yml
→ Click "..." → "Disable workflow".
Escalation
Wake the operator when: - The silence alert fires AND the Postmark dashboard shows >50 emails sent with no delivery records. - The monitor fails with exit 2 for 2+ consecutive runs (broken monitor). - Postmark account status shows the server is inactive or suspended. - Hard-bounce rate reported by Postmark dashboard exceeds 5%.