Raxx · internal docs

internal · gated

ADR-0074: Email Delivery v1 — Hybrid Architecture (Postmark + SNS/SQS/Lambda)

Status: Accepted
Date: 2026-05-11 UTC
Supersedes / amends: ADR-0072 (v1 implementation strategy only; ADR-0072's target-state topology is unchanged)
Refs: #1657, docs/architecture/durable-email-delivery.md


Context

ADR-0072 (merged 2026-05-11 UTC, PR #1663) adopted SNS FIFO + SQS FIFO + DLQ at both layers + Lambda as the durable email primitive. It specified SES as both the inbound parser (SES receipt rules) and the outbound delivery agent (ses:SendEmail). Sub-cards #1664–#1672 were filed to execute the migration.

The operator raised a follow-up question (2026-05-11 UTC):

"If we wanted to keep Postmark in the loop — how would that work. Would that save us some work here?"

This ADR evaluates two architectures:

Architecture A (Hybrid): Keep Postmark as the delivery agent (inbound webhook parser + outbound API sender). Add the SNS/SQS/DLQ durability primitive as originally designed. Postmark is the leaf node; the durability layer wraps it.

Architecture B (Full SES): The ADR-0072 target state. SES handles receipt rules (inbound) and ses:SendEmail (outbound). Postmark is fully retired.

The honest frame: SNS/SQS/DLQ/Lambda is the durability answer regardless. SES vs Postmark is a separate question about which delivery agent sits at the leaf. Both are valid.


Decision

Adopt Architecture A (Hybrid) for v1.

Keep Postmark as the delivery leaf. Build the SNS/SQS/DLQ/Lambda durability layer around it. Defer the SES migration to post-v1. Reserve SES as the future upgrade path; do not close the door.

This is not a reversal of ADR-0072's durability architecture — it is a v1 implementation scope decision. ADR-0072's topology (SNS FIFO → SQS FIFO → DLQ at both layers → Lambda) ships in v1 exactly as specified. Only the leaf node changes.


Architecture A (Hybrid) — Topology

Inbound

Customer email
  → MX record → Google Workspace (moosequest.net / raxx.app MX)
    → forward rule → Postmark Inbound Processing
      → Postmark POSTs webhook → API Gateway endpoint (HTTPS)
        → API Gateway: validate Postmark-Token signature header
          → publish to SNS FIFO: raxx-email-inbound.fifo
            ├── SNS DLQ: raxx-email-inbound-sns-dlq.fifo
            └── SQS FIFO: raxx-email-inbound-bridge.fifo
                  ├── SQS DLQ: raxx-email-inbound-bridge-dlq.fifo
                  └── Lambda: raxx-email-inbound-bridge
                        └── FreeScout API POST /api/conversations

Outbound

Raptor / Console / FreeScout publisher
  → sns:Publish → SNS FIFO: raxx-email-outbound.fifo
    ├── SNS DLQ: raxx-email-outbound-sns-dlq.fifo
    └── SQS FIFO: raxx-email-outbound-send.fifo
          ├── SQS DLQ: raxx-email-outbound-send-dlq.fifo
          └── Lambda: raxx-email-outbound-sender
                → Postmark API: POST /email (SendEmail)
                    → customer

Sub-Card Delta — What Changes, What Survives, What Gets Cut

SC-E1 (#1664): SES domain verification + DKIM + sandbox-out request

Status: CUT

Rationale: Postmark's domain verification (raxx.app), DKIM records, and sender signatures are already done (2026-05-09 UTC, project_postmark_approved.md). Postmark is already out of sandbox. The 2026-05-16 UTC deadline to initiate SES sandbox-out is no longer on the critical path. SC-E1 can be closed.

SC-E2 (#1665): Terraform email-delivery stack — SNS/SQS/DLQs/alarms/IAM

Status: SAME (minor scope adjustment)

The entire Terraform stack ships as designed: SNS FIFO topics, SQS FIFO queues, DLQs, CloudWatch alarms, IAM roles, DynamoDB dedup table, SSM parameters. Two adjustments:

  1. raxx-email-lambda-outbound IAM role: remove ses:SendEmail and ses:SendRawEmail. Add ssm:GetParameter on /raxx/email/postmark_server_token.
  2. raxx-email-lambda-inbound IAM role: add execute-api:Invoke is not needed (API Gateway is the entry point, not Lambda). Add permission to verify Postmark webhook signature (implementation detail for feature-developer — HMAC with Postmark-Token header).
  3. New SSM parameter: /raxx/email/postmark_server_token (SecureString). Replaces /raxx/email/ses_from_domain. Existing POSTMARK_SERVER_TOKEN on Heroku apps is not the same token — Lambda needs its own server token scoped to a single Postmark server.

Everything else (SNS topics, SQS queues, DLQ topology, alarms, DynamoDB dedup table, CloudWatch alarm counts) is identical.

SC-E3 (#1666): Lambda — inbound email bridge

Status: MODIFIED (Postmark webhook instead of SES event)

Core change: the Lambda no longer parses a raw SES receipt event from S3. Instead, it receives a structured Postmark inbound webhook JSON payload (already parsed: subject, from, to, text body, HTML body, attachments as metadata). This is simpler — Postmark does the MIME parsing; SES would give raw RFC 2822 bytes that the Lambda would need to parse itself.

Changes to the Lambda: - Input schema: Postmark inbound JSON (not SES receipt notification). - Signature verification: validate Postmark-Token header on the API Gateway request before SNS publish (reject unsigned requests at the API Gateway layer, not Lambda). - Attachment handling: Postmark provides attachment content and metadata inline in the webhook payload (up to 10 MB total); SES stores to S3 and provides a key. No S3 read access needed on the inbound Lambda. - Idempotency key: use Postmark's MessageID field (equivalent to RFC 2822 Message-ID). - Routing logic (FreeScout mailbox dispatch by To address): unchanged.

Everything else — DynamoDB dedup check, FreeScout API POST, structured logging, visibility timeout handling — is identical.

Note on F18 (SES inbound rule misconfigured): this failure mode is eliminated. Replace in the failure matrix with "Postmark webhook endpoint unreachable / API Gateway misconfigured" — same detection approach (synthetic probe, F1 unchanged).

SC-E4 (#1667): Lambda — outbound email sender

Status: MODIFIED (Postmark API call instead of ses:SendEmail)

Core change: the Lambda calls Postmark's POST /email API instead of ses:SendEmail. The structural pattern (SQS event → dedup check → send → delete message) is identical.

Changes to the Lambda: - Postmark API call: use server token from SSM /raxx/email/postmark_server_token. The boto3 SES client call is replaced with an httpx or requests POST to https://api.postmarkapp.com/email. - Error handling: Postmark API returns HTTP 422 for validation errors (bad From address, suppressed recipient), 429 for rate limiting, 5xx for server errors. Map these to the same retry/DLQ logic as ADR-0072 specified for SES throttling. - Postmark rate limit: 1 000 messages/second published limit (compared to SES new-account 14/s). No reserved concurrency adjustment needed; Postmark headroom is much larger. - correlation_id dedup logic: unchanged.

IAM: remove ses:SendEmail. Add ssm:GetParameter for Postmark token path.

SC-E5 (#1668): Migrate Raptor postmark_client.py → sns_publisher.py

Status: MODIFIED (scope reduced)

The migration now means: Raptor stops calling Postmark API directly and instead publishes to SNS. Raptor does not swap from Postmark to SES — it swaps from direct Postmark call to queued Postmark call. The Lambda (SC-E4) still calls Postmark at the end.

This is actually simpler than the full SES migration: the operator mental model stays "we use Postmark for email," only the call path changes from synchronous to async-via-queue. The sns_publisher.py wrapper and the feature flag logic are unchanged from the ADR-0072 spec.

SC-E6 (#1669): FreeScout SMTP cutover from Postmark to SES

Status: CUT

FreeScout continues using Postmark SMTP for outbound replies. No config change to FreeScout. SC-E6 can be closed.

SC-E7 (#1670): Synthetic probe + CloudWatch alarm

Status: SAME

The probe design (external sender → support@raxx.app → FreeScout conversation check) is completely delivery-agent-agnostic. No changes.

SC-E8 (#1671): DLQ redrive runbook

Status: SAME

Runbook covers SQS DLQ mechanics — independent of whether the leaf node is SES or Postmark. No changes.

SC-E9 (#1672): Postmark retirement

Status: CUT

Postmark stays. SC-E9 can be closed. If Postmark is ever retired post-v1, a new card handles it then.


New Card Required

SC-E10: API Gateway endpoint for Postmark inbound webhook → SNS publish

This work was implicit in ADR-0072 (SES receipt rules handled the inbound entry point natively). With Postmark, a new entry point is needed:

This is a size:S card — API Gateway → SNS direct integration is a well-documented AWS pattern. Estimate: 0.5–1 day for feature-developer or sre-agent.


Engineering-Day Delta

Card Full SES estimate Hybrid estimate Days saved
SC-E1: SES domain verify + sandbox-out 1 day (operator) 0 (CUT) 1
SC-E2: Terraform stack 2 days 1.5 days (minor IAM adjustment) 0.5
SC-E3: Inbound Lambda 3 days 2 days (no S3/MIME; simpler Postmark JSON) 1
SC-E4: Outbound Lambda 2.5 days 2 days (Postmark API call vs ses:SendEmail — similar effort) 0.5
SC-E5: Raptor migration 2 days 1.5 days (SNS publish instead of SES client — simpler) 0.5
SC-E6: FreeScout SMTP cutover 0.5 days 0 (CUT) 0.5
SC-E7: Synthetic probe 1 day 1 day (SAME) 0
SC-E8: DLQ runbook 0.5 days 0.5 days (SAME) 0
SC-E9: Postmark retirement 1 day 0 (CUT) 1
SC-E10: API Gateway webhook entry 0 (SES handles inbound) 1 day (NEW) -1
Total ~13.5 days ~8.5 days ~5 days (37%)

The operator's estimate of 50–60% was slightly optimistic; the honest delta is approximately 37% reduction. The SC-E10 API Gateway work partially offsets the savings. The dominant savings are in SC-E1 (no sandbox-out wait), SC-E3 (no MIME parsing), and SC-E9 (no retirement work).

Critical path impact: The 2026-05-16 UTC deadline to initiate SES sandbox-out is eliminated. There is no longer a hard deadline gating the outbound path. The remaining critical path is SC-E2 (Terraform) → SC-E3 + SC-E10 (inbound) before 2026-05-23 UTC launch.


Failure Mode Delta

Changes to the 20-row failure matrix in docs/architecture/durable-email-delivery.md:

Row Change
F1 (domain/MX not verified) SOFTENED — detection unchanged; "SES domain not verified" becomes "Postmark inbound webhook not configured"; already configured, so risk is near-zero at v1
F2 (SES sandbox not lifted) ELIMINATED — Postmark is already out of sandbox
F8 (SES throttles outbound) REPLACED — "Postmark API rate limit hit (429)". Postmark published limit 1 000/s vs SES new-account 14/s. Risk is lower, not higher. Lambda error handling for 429 + exponential backoff is unchanged
F11 (SES signature verification) REPLACED — "Postmark webhook token validation fails". Same detection approach: security log alert if postmark_signature_invalid count > 0. Recovery: rotate Postmark inbound webhook token in Postmark admin + update SSM
F15 (AWS region outage) SOFTENED — Postmark is independent of AWS for outbound delivery. If us-east-1 goes down: (a) SQS/Lambda fails → mail queues until recovery (same as before); (b) but Postmark API itself is unaffected. The outbound path depends on SQS, not on AWS being the delivery rail. A partial improvement
F18 (SES inbound rule misconfigured) REPLACED — "Postmark inbound webhook URL wrong or API Gateway misconfigured". Same detection: synthetic probe (F1 covers). Recovery: correct Postmark dashboard webhook URL
F19 (Lambda IAM missing ses:SendEmail) REPLACED — "Lambda IAM missing ssm:GetParameter for Postmark token". Same detection: AccessDeniedException in Lambda logs
All other rows UNCHANGED

What We Keep vs Lose

Keep (Hybrid wins these)

Lose (Full SES wins these)

The architecture explicitly preserves the SES escape hatch. The outbound Lambda consumer is the only thing that changes in a future SES migration — everything upstream (SNS, SQS, publishers) is delivery-agent-agnostic. Switching from Postmark to SES post-v1 is a one-Lambda change + DNS/domain verification, not an architectural rebuild.


Migration Plan for Architecture A (Hybrid)

Phase 0: None needed

No SES sandbox-out, no domain re-verification. Postmark is already production-ready.

Phase 1: Terraform + inbound durability (target: before 2026-05-23 UTC)

  1. SC-E2: Terraform SNS/SQS/DLQ/DynamoDB/alarms/IAM + SSM parameters.
  2. SC-E10 (new): API Gateway endpoint, Postmark webhook → SNS publish. Update Postmark inbound webhook URL.
  3. SC-E3 (modified): inbound Lambda (Postmark JSON → FreeScout API).
  4. SC-E8: DLQ runbook.
  5. SC-E7: Synthetic probe Lambda + alarm.
  6. Cutover: update Postmark inbound server to POST to API Gateway URL instead of Raptor's /webhooks/postmark/inbound. Feature-flag postmark_inbound_to_freescout OFF in Raptor.

Phase 2: Outbound queue (post-v1 acceptable)

  1. SC-E4 (modified): outbound Lambda (Postmark API call).
  2. SC-E5 (modified): Raptor postmark_client.pysns_publisher.py (feature-flagged).
  3. 7-day soak period.

Phase 3 (optional, post-launch): Switch to SES

If deliverability, cost, or vendor concerns justify it: 1. SC-E1 equivalent: SES domain verify + DKIM + sandbox-out request. 2. Swap outbound Lambda consumer from Postmark API → ses:SendEmail. No publisher changes. 3. FreeScout SMTP cutover (equivalent to original SC-E6). 4. Postmark retirement.

This is now an optional, undated future card — not a v1 commitment.


Consequences

Positive

Negative / Trade-offs


Alternatives Considered

Architecture B: Full SES (ADR-0072 target state)

Rejected for v1 because: 1. SES sandbox-out requires AWS Support case with 24–48 h turnaround — a hard deadline constraint at 12 days to launch. 2. SES inbound delivers raw MIME to S3; the Lambda must parse RFC 2822 — added complexity and bug surface vs Postmark's structured JSON. 3. Postmark's IP reputation at launch is stronger than a fresh SES account. 4. All prior Postmark setup work (domain verification, DKIM, sandbox-out, sender signatures) was done and paid for.

Accepted as the future post-v1 migration path if cost or vendor risk justifies it.

Architecture A with SNS-direct (no API Gateway)

Alternative: Postmark POSTs directly to a Raptor endpoint, Raptor publishes to SNS. This avoids API Gateway but keeps the Raptor dyno in the inbound critical path — if Raptor restarts or is overloaded, the inbound webhook call fails. API Gateway is a more reliable entry point with retries and independent scaling.

Rejected: API Gateway cost at v1 volume is negligible (< $0.01/mo). Keeping Raptor out of the synchronous webhook path is worth it.


Notes