Raxx · internal docs

internal · gated ↑ index

Durable Email Delivery — SNS/SQS/SES with DLQ at Both Layers

Status: Accepted
Date: 2026-05-11 UTC
Operator-locked: yes — architecture spec locked by operator 2026-05-11 UTC
Refs: #1657, docs/architecture/adr/0072-durable-email-sns-sqs-ses.md


V1 Implementation: Hybrid (Postmark + SNS/SQS/Lambda)

Updated 2026-05-11 UTC — See docs/architecture/adr/0074-email-delivery-hybrid-postmark-v1.md.

What builds in v1: The SNS/SQS/DLQ/Lambda durability topology below ships exactly as designed. The delivery leaf node is Postmark (not SES) for v1.

Topology delta from full-SES design:

Path Full SES (ADR-0072 target) V1 Hybrid (ADR-0074 accepted)
Inbound entry SES receipt rule → S3 → SNS Postmark inbound webhook → API Gateway → SNS
Inbound Lambda input Raw MIME from S3 Structured Postmark JSON
Outbound Lambda leaf ses:SendEmail Postmark API POST /email
FreeScout SMTP SES SMTP (Phase 3) Postmark SMTP (unchanged)

Card delta: - SC-E1 (#1664): CLOSED — Postmark already verified + sandbox-out complete - SC-E2 (#1665): SAME (minor IAM adjustment: no ses:SendEmail, add Postmark token SSM path) - SC-E3 (#1666): MODIFIED — Postmark JSON input instead of raw MIME - SC-E4 (#1667): MODIFIED — Postmark API call instead of ses:SendEmail - SC-E5 (#1668): MODIFIED (scope reduced) — Raptor publishes to SNS; Lambda calls Postmark - SC-E6 (#1669): CLOSED — FreeScout stays on Postmark SMTP - SC-E7 (#1670): SAME - SC-E8 (#1671): SAME - SC-E9 (#1672): CLOSED — Postmark stays in v1 - SC-E10: NEW — API Gateway endpoint for Postmark webhook → SNS publish

Critical path impact: SC-E1 deadline (2026-05-16 UTC) is eliminated. No SES sandbox-out needed before launch. New critical path: SC-E2 → SC-E10 + SC-E3 before 2026-05-23 UTC.

The SES migration path remains open post-v1. Switching from hybrid to full SES requires only a Lambda consumer swap — all upstream topology is delivery-agent-agnostic.


Context

Raxx's current email path runs through Postmark: outbound transactional mail fires directly from Raptor via postmark_client.py; inbound support mail arrives via Postmark's inbound webhook, hits POST /webhooks/postmark/inbound in Raptor, and is forwarded to FreeScout's API. This is a synchronous, single-hop architecture with no durability layer and no replay capability.

The current path has four structural weaknesses: 1. If Postmark is degraded or rejects the call, the mail is gone — no retry, no DLQ. 2. If FreeScout's API is down when the Postmark webhook fires, the ticket is lost. 3. Raptor's inbound dedup table (SQLite, 1 000-row cap) is load-bearing for idempotency. 4. There is no mechanism to detect that the entire email path has silently failed end-to-end.

Operator direction (2026-05-11 UTC): design durable email delivery using SNS FIFO + SQS FIFO + DLQs at both layers + Lambda + SES. Start with failure modes, not features.


Invariants

# Invariant Source
I1 No stored credentials in application code or repos 0002-no-stored-credentials.md
I2 All secrets live in AWS SSM Parameter Store for workloads feedback_aws_workloads_use_ssm_not_vault.md
I3 Email is the single contact channel, only after verification Platform invariant
I4 GDPR by default: retention limits, DSR support, DPA-ready logging 0003-gdpr-by-default.md
I5 Audit trail for every state change affecting money, permissions, or data access Platform invariant
I6 Vendor branding never surfaces in customer-facing copy feedback_no_backend_branding.md
I7 Secrets never inline in repo files even in private repos feedback_no_inline_secrets_in_repo.md

Section 1 — Failure Mode Matrix

This section precedes all architecture. Every row names a failure, its customer impact, how we detect it (with specific metric and threshold), and the recovery path.

Design principle (operator verbatim): "We shouldn't discover that it has 'also' failed." If a failure mode has no detection, that is the bug.

Note on v1 hybrid changes: Rows F2, F8, F11, F18, and F19 are updated below for the hybrid architecture. Original SES-specific text is preserved with a [Full-SES target] note for reference.

# Failure Customer impact Detection (metric + alarm threshold) Recovery path Owner
F1 Inbound mail path broken (MX/webhook misconfigured) All inbound mail silently discarded Synthetic probe: probe email sent every 5 min; alert if FreeScout ticket absent within 5 min of send Fix webhook config + replay from DLQ Operator
F2 [V1 hybrid] Postmark API unavailable (outbound) — [Full-SES target] SES sandbox not lifted Outbound email delayed; messages queue in SQS Lambda errors with ConnectionError or 5xx; SQS ApproximateNumberOfMessagesVisible grows; Postmark status page webhook → ops@ SQS retries until Postmark recovers; messages durable in SQS Automated + Operator
F3 SNS publish fails (Raptor SDK call rejected — auth, throttle, or network) Caller gets 5xx; mail never enters the queue Sentry exception on sns.publish() calls; CloudWatch metric SNS:NumberOfNotificationsFailed > 0 for 2 consecutive periods Retry with exponential backoff (caller-side); if persistent, raise P1 Automated + Operator
F4 SNS → SQS subscription broken (subscription deleted, policy revoked) Mail lands in SNS but never in SQS; SNS DLQ fills CloudWatch: SNS DLQ ApproximateNumberOfMessagesVisible > 0 within 15 min Re-subscribe SQS to SNS topic; SNS DLQ messages replay via CLI Operator
F5 SQS DLQ depth > 0 (poison message or code crash) Some messages permanently stuck; customer never gets email or ticket CloudWatch alarm: SQS DLQ ApproximateNumberOfMessagesVisible > 0 (zero-tolerance) Examine DLQ messages per Replay SOP (Section 9); fix code; redrive Operator
F6 Lambda fails to invoke — concurrency exhausted or runtime error SQS backlog grows; mail delayed CloudWatch: Lambda Errors > 5 per 5 min, SQS ApproximateNumberOfMessagesVisible > 20 for 10 min Increase reserved concurrency; fix code; SQS drains once Lambda healthy Automated alarm + Operator
F7 FreeScout API down (inbound Lambda cannot POST /api/conversations) Mail buffered in SQS; no ticket created SQS ApproximateNumberOfMessagesVisible baseline comparison; FreeScout synthetic health probe GET /api/users 200 expected every 2 min FreeScout restart; SQS drains automatically once FS recovers Operator (FreeScout on Lightsail)
F8 [V1 hybrid] Postmark API rate limit hit (429) — [Full-SES target] SES throttles outbound Outbound email delayed, eventually DLQ Lambda errors with 429 response; SQS ApproximateNumberOfMessagesVisible grows Exponential backoff in Lambda (already implemented); SQS retries automatically Automated + Operator
F9 SQS visibility timeout < Lambda execution time (message redelivered while in-flight) Duplicate ticket or duplicate outbound email Lambda idempotency check on Message-ID header / correlation_id; alarm if duplicate conversation created in FreeScout Fix visibility timeout to 20 min (Section 6); add dedup guard Feature-developer
F10 Poison message — malformed UTF-8, payload > 256 KB limit, schema violation One message stuck in DLQ; pipeline for all others continues SQS DLQ alarm (F5 covers this); Lambda structured log "event_type": "poison_message" DLQ Replay SOP (Section 9); repair or discard; root-cause fix Operator
F11 [V1 hybrid] Postmark webhook HMAC signature invalid, missing, or replayed (Date header > 60 s old) — [Full-SES target] SES signature verification failure Legitimate inbound mail rejected at boundary OR attacker spoofs/replays events CloudWatch alarm raxx-email-postmark-signature-invalid fires on Raxx/Email / postmark_signature_invalid >= 1 in 5 min; emitted by Lambda Authorizer on every Deny (#1731) Investigate authorizer logs (event_type field); if sustained, rotate Postmark inbound webhook token in Postmark admin + update SSM /raxx/email/postmark_inbound_webhook_token Automated alarm + Operator
F12 FIFO ordering violated — caller uses wrong MessageGroupId Out-of-order processing within a group (e.g., reply processed before original) Audit log: sequence gap detection on correlation_id ordering Code fix on publisher side; FIFO group key conventions in ADR Feature-developer
F13 Malicious attachment in inbound email Support staff opens malware; compromised host Attachment scan Lambda (S3 upload + ClamAV or GuardDuty Malware Protection) intercepts before FreeScout delivery Quarantine to S3 bucket; alert ops@; do not forward to FreeScout Automated
F14 Reply-to spam trap triggers domain blacklist Raxx domain reputation degrades; all future mail blocked DMARC aggregate report weekly digest; Postmark bounce/spam rates in dashboard Suppression list update; alert ops@; investigate source address Operator
F15 AWS us-east-1 region outage Inbound stall; outbound queues but Postmark delivery unaffected if SQS recovers AWS Service Health Dashboard webhook → ops@ alert within 15 min of incident declaration Accept degradation for inbound; Postmark handles outbound delivery independently of AWS; no cross-region failover at v1 Operator
F16 DLQ alarm fires but operator has no SOP Extended mean-time-to-recovery DLQ alarm links to runbook URL in AlarmDescription field; wiki link surfaced in alert body Runbook authored up-front (Section 9) and linked from alarm description Operator (pre-ship)
F17 Lambda cold-start latency causes visibility timeout breach during burst Redeliveries during spike; potential duplicates Lambda InitDuration P99 > 10 s for 3 consecutive periods; SQS NumberOfMessagesMoved Provision concurrency for inbound Lambda; idempotency guard handles duplicates Feature-developer
F18 [V1 hybrid] Postmark inbound webhook URL wrong or API Gateway misconfigured — [Full-SES target] SES inbound rule misconfigured Inbound mail not captured; never reaches SNS topic Synthetic probe (F1 covers detection) Verify Postmark inbound server webhook URL points to API Gateway endpoint; re-test probe Operator
F19 [V1 hybrid] Lambda IAM missing ssm:GetParameter for Postmark token — [Full-SES target] Lambda IAM role missing ses:SendEmail permission Outbound Lambda crashes on permission error Lambda CloudWatch Logs: AccessDeniedException; Lambda Errors alarm (F6 covers) IAM policy update; redeploy Lambda (no code change needed if Terraform-managed) Operator / sre-agent
F20 SQS message expiry — message deleted at MaximumMessageRetention before processed Silent drop if Lambda backlogged for > 4 days SQS NumberOfMessagesMoved from main queue (unusual drop); ApproximateAgeOfOldestMessage > 86400 (1 day) alarm Investigate why Lambda is not draining; increase retention only after root-cause fix Operator

Note on F15 (AWS region SPOF): Cross-region failover for SNS/SQS is out of scope for v1. In the hybrid architecture, Postmark handles outbound delivery independently of AWS — an AWS region outage stalls inbound queuing but does not prevent messages already in transit via Postmark from delivering. This is an improvement over full SES. Multi-region email durability is a post-v1 improvement path.


Section 2 — Component Layout

SNS FIFO Topics

Resource Name IAM role Terraform path SSM secrets Cost @ v1
SNS FIFO topic (inbound) raxx-email-inbound.fifo raxx-email-sns-publisher (least: sns:Publish) terraform/email-delivery/sns.tf none < $0.01/mo (< 1 M publishes)
SNS FIFO topic (outbound) raxx-email-outbound.fifo raxx-email-sns-publisher terraform/email-delivery/sns.tf none < $0.01/mo
SNS DLQ (inbound) raxx-email-inbound-sns-dlq.fifo n/a (CloudWatch alarm only) terraform/email-delivery/dlq.tf none < $0.01/mo
SNS DLQ (outbound) raxx-email-outbound-sns-dlq.fifo n/a terraform/email-delivery/dlq.tf none < $0.01/mo

SQS FIFO Queues

Resource Name IAM role Terraform path SSM secrets Cost @ v1
SQS FIFO (inbound → FreeScout bridge) raxx-email-inbound-bridge.fifo raxx-email-lambda-inbound (least: sqs:ReceiveMessage, sqs:DeleteMessage, sqs:GetQueueAttributes) terraform/email-delivery/sqs.tf none < $0.01/mo
SQS DLQ (inbound bridge) raxx-email-inbound-bridge-dlq.fifo n/a terraform/email-delivery/dlq.tf none < $0.01/mo
SQS FIFO (outbound → Postmark sender) raxx-email-outbound-send.fifo raxx-email-lambda-outbound terraform/email-delivery/sqs.tf none < $0.01/mo
SQS DLQ (outbound) raxx-email-outbound-send-dlq.fifo n/a terraform/email-delivery/dlq.tf none < $0.01/mo

Lambda Functions

V1 hybrid (ADR-0074): inbound Lambda receives Postmark JSON (not SES raw MIME); outbound Lambda calls Postmark API (not ses:SendEmail).

Resource Name IAM role Runtime SSM secrets (path) Terraform path Cost @ v1
Inbound bridge raxx-email-inbound-bridge raxx-email-lambda-inbound (sqs:Receive, sqs:Delete, ssm:GetParameter on /raxx/email/*) Python 3.12 /raxx/email/freescout_api_key, /raxx/email/freescout_api_url, /raxx/email/mailbox_routing_map terraform/email-delivery/lambda.tf < $0.01/mo
Outbound sender raxx-email-outbound-sender raxx-email-lambda-outbound (sqs:Receive, sqs:Delete, ssm:GetParameter on /raxx/email/*) Python 3.12 /raxx/email/postmark_server_token, /raxx/email/dedup_table_name terraform/email-delivery/lambda.tf < $0.01/mo

Full-SES target (post-v1): outbound Lambda role adds ses:SendEmail, ses:SendRawEmail; removes Postmark token SSM path. Inbound Lambda role unchanged.

API Gateway (v1 hybrid addition)

Resource Purpose Terraform path
HTTP API endpoint POST /webhooks/postmark/inbound Postmark inbound webhook entry point → SNS publish terraform/email-delivery/api-gateway.tf
SSM parameter /raxx/email/postmark_inbound_webhook_token Webhook token for request validation terraform/email-delivery/ssm.tf

CloudWatch Alarms

All alarms created in terraform/email-delivery/alarms.tf. SNS notification target: ops@raxx.app (pre-launch: digest mode per feedback_pre_launch_digest_notifications.md; post-launch: per-event).

IAM Roles (least-privilege summary)

raxx-email-sns-publisher: sns:Publish on raxx-email-inbound.fifo and raxx-email-outbound.fifo only.
raxx-email-lambda-inbound: sqs:ReceiveMessage, sqs:DeleteMessage, sqs:GetQueueAttributes, sqs:ChangeMessageVisibility on inbound queue; ssm:GetParameter on /raxx/email/*; logs:CreateLogGroup, logs:CreateLogEvent.
raxx-email-lambda-outbound (v1 hybrid): same SQS permissions on outbound queue; ssm:GetParameter on /raxx/email/*; logs:CreateLogGroup, logs:CreateLogEvent. No ses:* permissions in v1.

No role has * actions. No role has cross-account permissions at v1.


Section 3 — Topic and Queue Topology

Decision: two separate SNS topics (one inbound, one outbound). See ADR-0072 for full rationale. Summary:

One SQS queue per consumer. At v1, two consumers: FreeScout bridge (inbound) and Postmark sender (outbound). A third consumer for attachment AV scanning can subscribe to the inbound topic independently when that card lands. This pattern allows adding consumers without modifying existing queues.

Postmark inbound webhook
  └─► API Gateway POST /webhooks/postmark/inbound
        └─► (signature validation)
              └─► SNS FIFO: raxx-email-inbound.fifo
                    ├─► SNS DLQ: raxx-email-inbound-sns-dlq.fifo
                    └─► SQS FIFO: raxx-email-inbound-bridge.fifo
                          ├─► SQS DLQ: raxx-email-inbound-bridge-dlq.fifo
                          └─► Lambda: raxx-email-inbound-bridge
                                └─► FreeScout API POST /api/conversations

Raptor / Console / FreeScout (publisher)
  └─► SNS FIFO: raxx-email-outbound.fifo
        ├─► SNS DLQ: raxx-email-outbound-sns-dlq.fifo
        └─► SQS FIFO: raxx-email-outbound-send.fifo
              ├─► SQS DLQ: raxx-email-outbound-send-dlq.fifo
              └─► Lambda: raxx-email-outbound-sender
                    └─► Postmark API POST /email

Full-SES target topology (post-v1): Replace "Postmark inbound webhook → API Gateway" with "SES receipt rule". Replace "Postmark API POST /email" with "ses:SendEmail". All SNS/SQS/DLQ/Lambda layers are unchanged.


Section 4 — Multi-Mailbox Routing Logic

Day 1 mailboxes: support@raxx.app, ops@raxx.app, account-management@raxx.app.

The inbound bridge Lambda dispatches by destination To address to the correct FreeScout mailbox ID.

Routing map storage: SSM Parameter Store at /raxx/email/mailbox_routing_map as a JSON string:

{
  "support@raxx.app":            101,
  "ops@raxx.app":                102,
  "account-management@raxx.app": 103
}

Reasons for SSM over Lambda env vars: - Rotatable without Lambda redeploy (update SSM value; Lambda reads at invocation time or via cached refresh on cold start). - Consistent with feedback_aws_workloads_use_ssm_not_vault.md. - Auditable: SSM parameter updates generate CloudTrail events.

Lambda routing logic (pseudocode — implementation in sub-card):

to_address = normalize(postmark_event["To"])
routing_map = get_cached_ssm_param("/raxx/email/mailbox_routing_map")
mailbox_id = routing_map.get(to_address) or routing_map.get("support@raxx.app")
freescout_client.create_conversation(mailbox_id=mailbox_id, ...)

If no match, fall back to support@raxx.app (mailbox 101) and log routing_fallback event for audit.


Section 5 — Idempotency Contract

FIFO SQS with content-based deduplication provides at-most-once delivery within the 5-minute deduplication window. Outside that window, redelivery is possible.

Inbound bridge Lambda: - Dedup key: MessageID field from the Postmark inbound webhook payload (equivalent to RFC 2822 Message-ID). - Check: query a DynamoDB table raxx-email-inbound-dedup (partition key: message_id) before creating a FreeScout conversation. - On hit: log "event_type": "idempotency_skip" and delete from SQS (return success to avoid re-delivery). - On miss: create conversation, then write message_id to dedup table with TTL = 90 days. - Why DynamoDB (not Postgres / Raptor DB): Lambda is a separate compute boundary; avoiding a Raptor DB dependency prevents tight coupling and DB connection exhaustion under Lambda burst.

Outbound sender Lambda: - Dedup key: correlation_id provided by the caller in the SNS message attribute. - Check: same DynamoDB table with separate prefix outbound#<correlation_id>. - Callers (Raptor, Console) generate correlation_id as a UUID4 on first publish. Retry of the same logical send reuses the same correlation_id. - SQS FIFO MessageDeduplicationId = correlation_id (5-min window hardening).


Section 6 — Visibility Timeout Sizing

Lambda maximum execution timeout: 15 minutes (900 s).

Recommended SQS visibility timeout: 20 minutes (1200 s).

Math: Lambda max execution (900 s) + safety margin (300 s) = 1200 s = 20 min.

The safety margin accounts for Lambda cold-start latency (typically < 5 s but provisioned concurrency eliminates this), Postmark/FreeScout API call latency, and DynamoDB dedup read/write.

Implications: - If a Lambda invocation crashes hard (OOM, unhandled exception), the message becomes visible again after 20 minutes. - With maxReceiveCount = 3 on the SQS queue, a poison message will attempt 3 times (60 min total delay) before moving to the DLQ. - 60-minute delay is acceptable for v1 support volume; reduces noise from transient FreeScout/Postmark hiccups.


Section 7 — Reserved Concurrency

Outbound sender Lambda: - Postmark API published rate limit: 1 000 messages/second — significantly higher headroom than SES new-account 14/s. - At v1 outbound volume (10 000/mo ≈ 333/day ≈ 14/hour), burst is negligible. - Reserved concurrency: 5. Conservative guard against code bugs or requeue storms. - If Postmark rate limits surface, reduce to 1 (single-threaded processing).

Inbound bridge Lambda: - FreeScout API rate limit: unknown. Conservative assumption: 10 req/s. - Inbound v1 volume: ~100/mo. Spikes unlikely. - Reserved concurrency: 3. Prevents runaway invocations from hitting FS API. - If FreeScout API rate limits surface in production, reduce to 1 (single-threaded processing).

Note: Reserved concurrency also acts as a kill-switch. Setting reserved concurrency to 0 halts all processing (messages remain durably in SQS). This satisfies the platform invariant for kill-switches on critical execution paths.


Section 8 — Migration Plan

V1 Hybrid Plan (ADR-0074 — active)

Phase 0: None needed. Postmark is already production-ready, domain-verified, DKIM configured, sandbox-out complete. No AWS Support case required before v1 launch.

Phase 1 — Inbound durability before v1 launch (target: before 2026-05-23 UTC) 1. SC-E2: Terraform stack (SNS/SQS/DLQ/DynamoDB/alarms/IAM/API Gateway). 2. SC-E10: API Gateway endpoint POST /webhooks/postmark/inbound → SNS publish. 3. SC-E3 (modified): inbound Lambda (Postmark JSON → FreeScout API). 4. SC-E8: DLQ runbook. 5. SC-E7: Synthetic probe Lambda + alarm. 6. Cutover: update Postmark inbound server webhook URL to API Gateway endpoint. Feature-flag postmark_inbound_to_freescout OFF in Raptor.

Phase 2 — Outbound queue (post-v1 acceptable) 1. SC-E4 (modified): outbound Lambda (Postmark API call). 2. SC-E5 (modified): Raptor postmark_client.pysns_publisher.py (feature-flagged). 3. 7-day soak period.

FreeScout continues using Postmark SMTP for outbound replies throughout v1. Unchanged.

Phase 3 (optional, post-v1): Full SES migration

If deliverability, cost, or vendor concerns justify it after launch:

  1. Verify raxx.app domain in SES: add DKIM CNAME records to Cloudflare DNS.
  2. Open AWS Support case: "Production Access Request" for SES.
  3. Swap outbound Lambda consumer: Postmark API → ses:SendEmail. No publisher code changes.
  4. FreeScout SMTP cutover: change from Postmark SMTP to SES SMTP endpoint.
  5. Postmark retirement (cancel subscription, archive logs, remove credentials from Heroku apps, clean up dead code in Raptor).

This is an optional, undated future card — not a v1 commitment. The Lambda consumer is the only component that changes; all upstream topology (SNS/SQS/DLQ/publishers) is delivery-agent-agnostic.


Original SES Plan (ADR-0072 target state — reference only, superseded for v1 by ADR-0074)

Phase 0 — SES Domain Verification + Sandbox-Out (Operator action, size:s, ~1 day) - Verify raxx.app domain in SES: add DKIM CNAME records to DNS (Cloudflare, zone raxx.app). - Add SPF record if not present. - Open AWS Support case: "Production Access Request" for SES. - Do NOT proceed to Phase 1 for outbound until sandbox-out is confirmed.

Phase 1 — Inbound Durability (size:l) - Terraform: provision SNS inbound topic + SQS inbound queue + DLQs + alarms. - SES: configure receipt rule set for raxx.app → S3 action → SNS publish action to raxx-email-inbound.fifo. - Deploy raxx-email-inbound-bridge Lambda (SES raw MIME from S3 input). - Cut DNS MX record from Postmark to SES inbound.

Phase 2 — Outbound Queue Validation (size:m) - Terraform: provision SNS outbound topic + SQS outbound queue + DLQs. - Deploy raxx-email-outbound-sender Lambda (SES SendEmail). - Migrate postmark_client.pysns_publisher.py in Raptor (feature-flagged). - Monitor for a 7-day soak period.

Phase 3 — FreeScout SMTP Cutover (size:s) - Change FreeScout SMTP settings from Postmark to SES SMTP endpoint. - SES SMTP credentials stored in SSM at /raxx/email/ses_smtp_password.

Phase 4 — Postmark Retirement (size:s, ~30 days after Phase 3 soak) - Cancel Postmark subscription. - Archive Postmark delivery logs (export to S3 for GDPR retention period). - Remove POSTMARK_SERVER_TOKEN from all Heroku apps. - Remove postmark_client.py, postmark_inbound.py, postmark_collector.py in a cleanup PR.


Section 9 — Replay SOP for DLQ

When a CloudWatch alarm fires for SQS DLQ depth > 0, follow this procedure.

Step 1 — Identify the affected DLQ

Alarm name format: raxx-email-<direction>-<type>-dlq-depth. Determine which DLQ.

Step 2 — Examine DLQ messages

# Receive up to 10 messages (does not delete them)
aws sqs receive-message \
  --queue-url https://sqs.us-east-1.amazonaws.com/<account-id>/raxx-email-inbound-bridge-dlq.fifo \
  --max-number-of-messages 10 \
  --attribute-names All \
  --message-attribute-names All \
  --visibility-timeout 300 \
  --region us-east-1

Examine the Body field (JSON-encoded SNS notification wrapping the original email event). Look for: - ApproximateReceiveCount: if > 3, it hit max-receive-count. - SentTimestamp: when the message first arrived.

Step 3 — Correlate with Lambda logs

# Find Lambda invocation errors around the message's SentTimestamp
aws logs filter-log-events \
  --log-group-name /aws/lambda/raxx-email-inbound-bridge \
  --start-time <epoch-ms-from-SentTimestamp> \
  --filter-pattern '"ERROR"' \
  --region us-east-1

Check for: FreeScout API 5xx, connection timeout, malformed payload, IAM permission error.

Step 4 — Determine disposition

Root cause Disposition
FreeScout was transiently down (now recovered) Redrive — message should process successfully
Code bug (now fixed and Lambda redeployed) Redrive
Malformed message (no Message-ID, corrupt UTF-8) Discard — log reason; root-cause at source
Duplicate of already-processed message Discard — idempotency guard will skip it anyway, but discarding avoids noise
Suspected spoofed/malicious message Quarantine to S3; do not redrive; open security incident

Step 5 — Redrive (AWS Console preferred for audit trail)

AWS Console method: 1. SQS Console → queue → "Dead-letter queue" tab → "Start DLQ redrive". 2. Set destination to the source queue. 3. Confirm. AWS generates a CloudTrail event.

CLI method (for scripting):

aws sqs start-message-move-task \
  --source-arn arn:aws:sqs:us-east-1:<account-id>:raxx-email-inbound-bridge-dlq.fifo \
  --destination-arn arn:aws:sqs:us-east-1:<account-id>:raxx-email-inbound-bridge.fifo \
  --region us-east-1

Step 6 — Post-redrive audit

After redrive, verify: 1. DLQ depth returns to 0 (CloudWatch metric ApproximateNumberOfMessagesVisible). 2. Lambda successfully processes the message (check CloudWatch Logs for successful invocations). 3. FreeScout conversation created (or Postmark delivery confirmed). 4. Write an ops note in the FreeScout ops@ ticket if the failure affected a customer-visible email path. Note: timestamp (UTC), root cause, resolution, message count affected.


Section 10 — Synthetic Probe

Design

Why external sender

Probe must originate outside AWS. If API Gateway, SNS, SQS, Lambda, or FreeScout API all fail together (e.g., F15 — region outage), a probe sent via AWS Lambda would also be affected and never trigger. Using the operator's personal email as sender ensures the signal originates from a completely independent path.

Probe Lambda IAM

raxx-email-probe-role: ssm:GetParameter on /raxx/email/freescout_api_key only. No SNS, no SQS permissions — read-only health check.

Failure response

Probe alarm fires → ops@ alert with probe ID → operator checks: 1. Did the email arrive at Postmark inbound? (Postmark dashboard: inbound activity log) 2. Did API Gateway receive the webhook? (API Gateway access logs) 3. Did SNS publish? (SNS console: delivery status) 4. Did SQS receive? (SQS console: messages received metric) 5. Did Lambda invoke? (CloudWatch Logs) 6. Did FreeScout receive the API call? (FS admin → activity log)

This six-layer walkthrough isolates the failing component within 10 minutes.


Section 11 — Cost Analysis

At v1 volume: 10 000 outbound/mo + 100 inbound/mo.

V1 Hybrid (Postmark delivery leaf)

Service Usage Unit cost Monthly cost
Postmark outbound 10 000 emails $1.25 / 1 000 $12.50
Postmark inbound 100 emails Included in plan $0.00
SNS ~10 200 publishes First 1 M free $0.00
SQS ~10 200 requests First 1 M free $0.00
Lambda ~10 200 invocations, avg 2 s each First 1 M invocations + 400 000 GB-s free $0.00
API Gateway ~100 inbound webhook calls First 1 M free (HTTP API) $0.00
DynamoDB (dedup table) ~10 300 reads + writes First 25 GB free (on-demand) $0.00
CloudWatch alarms ~12 alarms $0.10/alarm/mo $1.20
CloudWatch Logs ~500 MB/mo Lambda logs $0.50/GB ingest + $0.03/GB storage ~$0.27
Probe Lambda 8 640 invocations/mo (every 5 min) Within free tier $0.00
Total ~$13.97/mo

Full SES (post-v1 target)

Service Usage Unit cost Monthly cost
SES outbound 10 000 emails $0.10 / 1 000 $1.00
SES inbound 100 emails $0.10 / 1 000 $0.01
SNS/SQS/Lambda/DynamoDB same as above same as above ~$1.47
Total ~$2.48/mo

Cost delta at v1: $13.97/mo (hybrid) vs $2.48/mo (full SES). Difference: ~$11.50/mo.

When does the SES migration pay for itself? The migration is approximately 3–4 engineering-days of work. At a conservative $500/day cost, break-even at v1 volume is ~1 000 months. The migration is a cost decision at scale, not at v1. At 100k/mo outbound volume, the difference is ~$112/mo — meaningful enough to revisit.


Section 12 — ADRs


Section 13 — Card Breakdown for PM

V1 Hybrid Card Slate

# Title Status Size Agent / Owner Phase Blocking
SC-E1 (#1664) SES domain verification + DKIM + sandbox-out CLOSE Eliminated
SC-E2 (#1665) Terraform: email-delivery stack (SNS/SQS/DLQ/alarms/IAM) SAME (minor IAM delta) M sre-agent Phase 1 SC-E3, SC-E4
SC-E3 (#1666) Lambda: inbound email bridge MODIFIED (Postmark JSON input) M feature-developer Phase 1 v1 launch
SC-E4 (#1667) Lambda: outbound email sender MODIFIED (Postmark API call) M feature-developer Phase 2
SC-E5 (#1668) Migrate Raptor postmark_client.py → sns_publisher.py MODIFIED (scope reduced) S feature-developer Phase 2 SC-E4
SC-E6 (#1669) FreeScout SMTP cutover from Postmark to SES CLOSE Eliminated
SC-E7 (#1670) Synthetic probe + CloudWatch alarm SAME M sre-agent Phase 1
SC-E8 (#1671) DLQ redrive runbook SAME S sre-agent Phase 1 SC-E2
SC-E9 (#1672) Postmark retirement CLOSE Eliminated
SC-E10 (new) API Gateway endpoint: Postmark webhook → SNS publish NEW S sre-agent Phase 1 SC-E3

Estimated total engineering (v1 hybrid): ~8.5 days across feature-developer + sre-agent. SC-E2 (Terraform) is the critical path. SC-E2 → SC-E10 → SC-E3 must land before 2026-05-23 UTC.


Migrations

No Raptor database schema migrations are introduced by this design. The dedup table moves from Raptor SQLite (postmark_inbound_dedup) to a standalone DynamoDB table in the email-delivery stack. This DynamoDB table is provisioned by the email-delivery Terraform stack (SC-E2) and populated by the Lambda (SC-E3).

The postmark_inbound_dedup SQLite table in Raptor remains for a 30-day observation window after Phase 1 cutover, then is dropped in a schema migration in a follow-up cleanup card.


GDPR and Security Checklist


Open Questions

These must be resolved before sub-cards SC-E3 and SC-E10 can be claimed:

  1. FreeScout mailbox IDs for support@, ops@, account-management@: operator must confirm the numeric mailbox IDs from FreeScout admin so the SSM routing map can be populated.
  2. DynamoDB provisioning mode: on-demand (pay-per-request) vs provisioned (fixed capacity). Recommendation is on-demand at v1 volume, but operator should confirm cost tolerance.
  3. Probe sender address: confirm which external email address (e.g., kris@moosequest.net) will be the synthetic probe sender. It must be a real inbox the operator controls to validate end-to-end delivery — and it must be outside AWS.
  4. Postmark inbound webhook token: confirm the token value to populate SSM /raxx/email/postmark_inbound_webhook_token. This is the static token Postmark sends in the webhook request header for validation.
  5. API Gateway auth method: API Gateway → SNS direct integration (no Lambda hop) requires careful IAM role configuration. Confirm whether sre-agent should use a direct integration (mapping templates) or a thin Lambda shim for the signature validation step. Direct integration is more efficient; Lambda shim is more debuggable.

Note: Open questions 2 (SES receipt rule set region) and 5 (S3 attachment handling) from the original list are eliminated by the hybrid architecture — Postmark handles MIME parsing and attachment metadata inline in the webhook payload; no S3 involvement for inbound.