Status: Accepted
Date: 2026-05-11 UTC
Operator-locked: yes — architecture spec locked by operator 2026-05-11 UTC
Refs: #1657, docs/architecture/adr/0072-durable-email-sns-sqs-ses.md
Updated 2026-05-11 UTC — See docs/architecture/adr/0074-email-delivery-hybrid-postmark-v1.md.
What builds in v1: The SNS/SQS/DLQ/Lambda durability topology below ships exactly as designed. The delivery leaf node is Postmark (not SES) for v1.
Topology delta from full-SES design:
| Path | Full SES (ADR-0072 target) | V1 Hybrid (ADR-0074 accepted) |
|---|---|---|
| Inbound entry | SES receipt rule → S3 → SNS | Postmark inbound webhook → API Gateway → SNS |
| Inbound Lambda input | Raw MIME from S3 | Structured Postmark JSON |
| Outbound Lambda leaf | ses:SendEmail |
Postmark API POST /email |
| FreeScout SMTP | SES SMTP (Phase 3) | Postmark SMTP (unchanged) |
Card delta:
- SC-E1 (#1664): CLOSED — Postmark already verified + sandbox-out complete
- SC-E2 (#1665): SAME (minor IAM adjustment: no ses:SendEmail, add Postmark token SSM path)
- SC-E3 (#1666): MODIFIED — Postmark JSON input instead of raw MIME
- SC-E4 (#1667): MODIFIED — Postmark API call instead of ses:SendEmail
- SC-E5 (#1668): MODIFIED (scope reduced) — Raptor publishes to SNS; Lambda calls Postmark
- SC-E6 (#1669): CLOSED — FreeScout stays on Postmark SMTP
- SC-E7 (#1670): SAME
- SC-E8 (#1671): SAME
- SC-E9 (#1672): CLOSED — Postmark stays in v1
- SC-E10: NEW — API Gateway endpoint for Postmark webhook → SNS publish
Critical path impact: SC-E1 deadline (2026-05-16 UTC) is eliminated. No SES sandbox-out needed before launch. New critical path: SC-E2 → SC-E10 + SC-E3 before 2026-05-23 UTC.
The SES migration path remains open post-v1. Switching from hybrid to full SES requires only a Lambda consumer swap — all upstream topology is delivery-agent-agnostic.
Raxx's current email path runs through Postmark: outbound transactional mail fires directly from Raptor via postmark_client.py; inbound support mail arrives via Postmark's inbound webhook, hits POST /webhooks/postmark/inbound in Raptor, and is forwarded to FreeScout's API. This is a synchronous, single-hop architecture with no durability layer and no replay capability.
The current path has four structural weaknesses: 1. If Postmark is degraded or rejects the call, the mail is gone — no retry, no DLQ. 2. If FreeScout's API is down when the Postmark webhook fires, the ticket is lost. 3. Raptor's inbound dedup table (SQLite, 1 000-row cap) is load-bearing for idempotency. 4. There is no mechanism to detect that the entire email path has silently failed end-to-end.
Operator direction (2026-05-11 UTC): design durable email delivery using SNS FIFO + SQS FIFO + DLQs at both layers + Lambda + SES. Start with failure modes, not features.
| # | Invariant | Source |
|---|---|---|
| I1 | No stored credentials in application code or repos | 0002-no-stored-credentials.md |
| I2 | All secrets live in AWS SSM Parameter Store for workloads | feedback_aws_workloads_use_ssm_not_vault.md |
| I3 | Email is the single contact channel, only after verification | Platform invariant |
| I4 | GDPR by default: retention limits, DSR support, DPA-ready logging | 0003-gdpr-by-default.md |
| I5 | Audit trail for every state change affecting money, permissions, or data access | Platform invariant |
| I6 | Vendor branding never surfaces in customer-facing copy | feedback_no_backend_branding.md |
| I7 | Secrets never inline in repo files even in private repos | feedback_no_inline_secrets_in_repo.md |
This section precedes all architecture. Every row names a failure, its customer impact, how we detect it (with specific metric and threshold), and the recovery path.
Design principle (operator verbatim): "We shouldn't discover that it has 'also' failed." If a failure mode has no detection, that is the bug.
Note on v1 hybrid changes: Rows F2, F8, F11, F18, and F19 are updated below for the hybrid architecture. Original SES-specific text is preserved with a [Full-SES target] note for reference.
| # | Failure | Customer impact | Detection (metric + alarm threshold) | Recovery path | Owner |
|---|---|---|---|---|---|
| F1 | Inbound mail path broken (MX/webhook misconfigured) | All inbound mail silently discarded | Synthetic probe: probe email sent every 5 min; alert if FreeScout ticket absent within 5 min of send | Fix webhook config + replay from DLQ | Operator |
| F2 | [V1 hybrid] Postmark API unavailable (outbound) — [Full-SES target] SES sandbox not lifted | Outbound email delayed; messages queue in SQS | Lambda errors with ConnectionError or 5xx; SQS ApproximateNumberOfMessagesVisible grows; Postmark status page webhook → ops@ |
SQS retries until Postmark recovers; messages durable in SQS | Automated + Operator |
| F3 | SNS publish fails (Raptor SDK call rejected — auth, throttle, or network) | Caller gets 5xx; mail never enters the queue | Sentry exception on sns.publish() calls; CloudWatch metric SNS:NumberOfNotificationsFailed > 0 for 2 consecutive periods |
Retry with exponential backoff (caller-side); if persistent, raise P1 | Automated + Operator |
| F4 | SNS → SQS subscription broken (subscription deleted, policy revoked) | Mail lands in SNS but never in SQS; SNS DLQ fills | CloudWatch: SNS DLQ ApproximateNumberOfMessagesVisible > 0 within 15 min |
Re-subscribe SQS to SNS topic; SNS DLQ messages replay via CLI | Operator |
| F5 | SQS DLQ depth > 0 (poison message or code crash) | Some messages permanently stuck; customer never gets email or ticket | CloudWatch alarm: SQS DLQ ApproximateNumberOfMessagesVisible > 0 (zero-tolerance) |
Examine DLQ messages per Replay SOP (Section 9); fix code; redrive | Operator |
| F6 | Lambda fails to invoke — concurrency exhausted or runtime error | SQS backlog grows; mail delayed | CloudWatch: Lambda Errors > 5 per 5 min, SQS ApproximateNumberOfMessagesVisible > 20 for 10 min |
Increase reserved concurrency; fix code; SQS drains once Lambda healthy | Automated alarm + Operator |
| F7 | FreeScout API down (inbound Lambda cannot POST /api/conversations) | Mail buffered in SQS; no ticket created | SQS ApproximateNumberOfMessagesVisible baseline comparison; FreeScout synthetic health probe GET /api/users 200 expected every 2 min |
FreeScout restart; SQS drains automatically once FS recovers | Operator (FreeScout on Lightsail) |
| F8 | [V1 hybrid] Postmark API rate limit hit (429) — [Full-SES target] SES throttles outbound | Outbound email delayed, eventually DLQ | Lambda errors with 429 response; SQS ApproximateNumberOfMessagesVisible grows |
Exponential backoff in Lambda (already implemented); SQS retries automatically | Automated + Operator |
| F9 | SQS visibility timeout < Lambda execution time (message redelivered while in-flight) | Duplicate ticket or duplicate outbound email | Lambda idempotency check on Message-ID header / correlation_id; alarm if duplicate conversation created in FreeScout |
Fix visibility timeout to 20 min (Section 6); add dedup guard | Feature-developer |
| F10 | Poison message — malformed UTF-8, payload > 256 KB limit, schema violation | One message stuck in DLQ; pipeline for all others continues | SQS DLQ alarm (F5 covers this); Lambda structured log "event_type": "poison_message" |
DLQ Replay SOP (Section 9); repair or discard; root-cause fix | Operator |
| F11 | [V1 hybrid] Postmark webhook HMAC signature invalid, missing, or replayed (Date header > 60 s old) — [Full-SES target] SES signature verification failure | Legitimate inbound mail rejected at boundary OR attacker spoofs/replays events | CloudWatch alarm raxx-email-postmark-signature-invalid fires on Raxx/Email / postmark_signature_invalid >= 1 in 5 min; emitted by Lambda Authorizer on every Deny (#1731) |
Investigate authorizer logs (event_type field); if sustained, rotate Postmark inbound webhook token in Postmark admin + update SSM /raxx/email/postmark_inbound_webhook_token |
Automated alarm + Operator |
| F12 | FIFO ordering violated — caller uses wrong MessageGroupId | Out-of-order processing within a group (e.g., reply processed before original) | Audit log: sequence gap detection on correlation_id ordering |
Code fix on publisher side; FIFO group key conventions in ADR | Feature-developer |
| F13 | Malicious attachment in inbound email | Support staff opens malware; compromised host | Attachment scan Lambda (S3 upload + ClamAV or GuardDuty Malware Protection) intercepts before FreeScout delivery | Quarantine to S3 bucket; alert ops@; do not forward to FreeScout | Automated |
| F14 | Reply-to spam trap triggers domain blacklist | Raxx domain reputation degrades; all future mail blocked | DMARC aggregate report weekly digest; Postmark bounce/spam rates in dashboard | Suppression list update; alert ops@; investigate source address | Operator |
| F15 | AWS us-east-1 region outage | Inbound stall; outbound queues but Postmark delivery unaffected if SQS recovers | AWS Service Health Dashboard webhook → ops@ alert within 15 min of incident declaration | Accept degradation for inbound; Postmark handles outbound delivery independently of AWS; no cross-region failover at v1 | Operator |
| F16 | DLQ alarm fires but operator has no SOP | Extended mean-time-to-recovery | DLQ alarm links to runbook URL in AlarmDescription field; wiki link surfaced in alert body |
Runbook authored up-front (Section 9) and linked from alarm description | Operator (pre-ship) |
| F17 | Lambda cold-start latency causes visibility timeout breach during burst | Redeliveries during spike; potential duplicates | Lambda InitDuration P99 > 10 s for 3 consecutive periods; SQS NumberOfMessagesMoved |
Provision concurrency for inbound Lambda; idempotency guard handles duplicates | Feature-developer |
| F18 | [V1 hybrid] Postmark inbound webhook URL wrong or API Gateway misconfigured — [Full-SES target] SES inbound rule misconfigured | Inbound mail not captured; never reaches SNS topic | Synthetic probe (F1 covers detection) | Verify Postmark inbound server webhook URL points to API Gateway endpoint; re-test probe | Operator |
| F19 | [V1 hybrid] Lambda IAM missing ssm:GetParameter for Postmark token — [Full-SES target] Lambda IAM role missing ses:SendEmail permission |
Outbound Lambda crashes on permission error | Lambda CloudWatch Logs: AccessDeniedException; Lambda Errors alarm (F6 covers) |
IAM policy update; redeploy Lambda (no code change needed if Terraform-managed) | Operator / sre-agent |
| F20 | SQS message expiry — message deleted at MaximumMessageRetention before processed | Silent drop if Lambda backlogged for > 4 days | SQS NumberOfMessagesMoved from main queue (unusual drop); ApproximateAgeOfOldestMessage > 86400 (1 day) alarm |
Investigate why Lambda is not draining; increase retention only after root-cause fix | Operator |
Note on F15 (AWS region SPOF): Cross-region failover for SNS/SQS is out of scope for v1. In the hybrid architecture, Postmark handles outbound delivery independently of AWS — an AWS region outage stalls inbound queuing but does not prevent messages already in transit via Postmark from delivering. This is an improvement over full SES. Multi-region email durability is a post-v1 improvement path.
| Resource | Name | IAM role | Terraform path | SSM secrets | Cost @ v1 |
|---|---|---|---|---|---|
| SNS FIFO topic (inbound) | raxx-email-inbound.fifo |
raxx-email-sns-publisher (least: sns:Publish) |
terraform/email-delivery/sns.tf |
none | < $0.01/mo (< 1 M publishes) |
| SNS FIFO topic (outbound) | raxx-email-outbound.fifo |
raxx-email-sns-publisher |
terraform/email-delivery/sns.tf |
none | < $0.01/mo |
| SNS DLQ (inbound) | raxx-email-inbound-sns-dlq.fifo |
n/a (CloudWatch alarm only) | terraform/email-delivery/dlq.tf |
none | < $0.01/mo |
| SNS DLQ (outbound) | raxx-email-outbound-sns-dlq.fifo |
n/a | terraform/email-delivery/dlq.tf |
none | < $0.01/mo |
| Resource | Name | IAM role | Terraform path | SSM secrets | Cost @ v1 |
|---|---|---|---|---|---|
| SQS FIFO (inbound → FreeScout bridge) | raxx-email-inbound-bridge.fifo |
raxx-email-lambda-inbound (least: sqs:ReceiveMessage, sqs:DeleteMessage, sqs:GetQueueAttributes) |
terraform/email-delivery/sqs.tf |
none | < $0.01/mo |
| SQS DLQ (inbound bridge) | raxx-email-inbound-bridge-dlq.fifo |
n/a | terraform/email-delivery/dlq.tf |
none | < $0.01/mo |
| SQS FIFO (outbound → Postmark sender) | raxx-email-outbound-send.fifo |
raxx-email-lambda-outbound |
terraform/email-delivery/sqs.tf |
none | < $0.01/mo |
| SQS DLQ (outbound) | raxx-email-outbound-send-dlq.fifo |
n/a | terraform/email-delivery/dlq.tf |
none | < $0.01/mo |
V1 hybrid (ADR-0074): inbound Lambda receives Postmark JSON (not SES raw MIME); outbound Lambda calls Postmark API (not ses:SendEmail).
| Resource | Name | IAM role | Runtime | SSM secrets (path) | Terraform path | Cost @ v1 |
|---|---|---|---|---|---|---|
| Inbound bridge | raxx-email-inbound-bridge |
raxx-email-lambda-inbound (sqs:Receive, sqs:Delete, ssm:GetParameter on /raxx/email/*) |
Python 3.12 | /raxx/email/freescout_api_key, /raxx/email/freescout_api_url, /raxx/email/mailbox_routing_map |
terraform/email-delivery/lambda.tf |
< $0.01/mo |
| Outbound sender | raxx-email-outbound-sender |
raxx-email-lambda-outbound (sqs:Receive, sqs:Delete, ssm:GetParameter on /raxx/email/*) |
Python 3.12 | /raxx/email/postmark_server_token, /raxx/email/dedup_table_name |
terraform/email-delivery/lambda.tf |
< $0.01/mo |
Full-SES target (post-v1): outbound Lambda role adds ses:SendEmail, ses:SendRawEmail; removes Postmark token SSM path. Inbound Lambda role unchanged.
| Resource | Purpose | Terraform path |
|---|---|---|
HTTP API endpoint POST /webhooks/postmark/inbound |
Postmark inbound webhook entry point → SNS publish | terraform/email-delivery/api-gateway.tf |
SSM parameter /raxx/email/postmark_inbound_webhook_token |
Webhook token for request validation | terraform/email-delivery/ssm.tf |
All alarms created in terraform/email-delivery/alarms.tf. SNS notification target: ops@raxx.app (pre-launch: digest mode per feedback_pre_launch_digest_notifications.md; post-launch: per-event).
raxx-email-sns-publisher: sns:Publish on raxx-email-inbound.fifo and raxx-email-outbound.fifo only.
raxx-email-lambda-inbound: sqs:ReceiveMessage, sqs:DeleteMessage, sqs:GetQueueAttributes, sqs:ChangeMessageVisibility on inbound queue; ssm:GetParameter on /raxx/email/*; logs:CreateLogGroup, logs:CreateLogEvent.
raxx-email-lambda-outbound (v1 hybrid): same SQS permissions on outbound queue; ssm:GetParameter on /raxx/email/*; logs:CreateLogGroup, logs:CreateLogEvent. No ses:* permissions in v1.
No role has * actions. No role has cross-account permissions at v1.
Decision: two separate SNS topics (one inbound, one outbound). See ADR-0072 for full rationale. Summary:
sns:Publish on outbound only; API Gateway integration publishes to inbound topic only.One SQS queue per consumer. At v1, two consumers: FreeScout bridge (inbound) and Postmark sender (outbound). A third consumer for attachment AV scanning can subscribe to the inbound topic independently when that card lands. This pattern allows adding consumers without modifying existing queues.
Postmark inbound webhook
└─► API Gateway POST /webhooks/postmark/inbound
└─► (signature validation)
└─► SNS FIFO: raxx-email-inbound.fifo
├─► SNS DLQ: raxx-email-inbound-sns-dlq.fifo
└─► SQS FIFO: raxx-email-inbound-bridge.fifo
├─► SQS DLQ: raxx-email-inbound-bridge-dlq.fifo
└─► Lambda: raxx-email-inbound-bridge
└─► FreeScout API POST /api/conversations
Raptor / Console / FreeScout (publisher)
└─► SNS FIFO: raxx-email-outbound.fifo
├─► SNS DLQ: raxx-email-outbound-sns-dlq.fifo
└─► SQS FIFO: raxx-email-outbound-send.fifo
├─► SQS DLQ: raxx-email-outbound-send-dlq.fifo
└─► Lambda: raxx-email-outbound-sender
└─► Postmark API POST /email
Full-SES target topology (post-v1): Replace "Postmark inbound webhook → API Gateway" with "SES receipt rule". Replace "Postmark API POST /email" with "ses:SendEmail". All SNS/SQS/DLQ/Lambda layers are unchanged.
Day 1 mailboxes: support@raxx.app, ops@raxx.app, account-management@raxx.app.
The inbound bridge Lambda dispatches by destination To address to the correct FreeScout mailbox ID.
Routing map storage: SSM Parameter Store at /raxx/email/mailbox_routing_map as a JSON string:
{
"support@raxx.app": 101,
"ops@raxx.app": 102,
"account-management@raxx.app": 103
}
Reasons for SSM over Lambda env vars:
- Rotatable without Lambda redeploy (update SSM value; Lambda reads at invocation time or via cached refresh on cold start).
- Consistent with feedback_aws_workloads_use_ssm_not_vault.md.
- Auditable: SSM parameter updates generate CloudTrail events.
Lambda routing logic (pseudocode — implementation in sub-card):
to_address = normalize(postmark_event["To"])
routing_map = get_cached_ssm_param("/raxx/email/mailbox_routing_map")
mailbox_id = routing_map.get(to_address) or routing_map.get("support@raxx.app")
freescout_client.create_conversation(mailbox_id=mailbox_id, ...)
If no match, fall back to support@raxx.app (mailbox 101) and log routing_fallback event for audit.
FIFO SQS with content-based deduplication provides at-most-once delivery within the 5-minute deduplication window. Outside that window, redelivery is possible.
Inbound bridge Lambda:
- Dedup key: MessageID field from the Postmark inbound webhook payload (equivalent to RFC 2822 Message-ID).
- Check: query a DynamoDB table raxx-email-inbound-dedup (partition key: message_id) before creating a FreeScout conversation.
- On hit: log "event_type": "idempotency_skip" and delete from SQS (return success to avoid re-delivery).
- On miss: create conversation, then write message_id to dedup table with TTL = 90 days.
- Why DynamoDB (not Postgres / Raptor DB): Lambda is a separate compute boundary; avoiding a Raptor DB dependency prevents tight coupling and DB connection exhaustion under Lambda burst.
Outbound sender Lambda:
- Dedup key: correlation_id provided by the caller in the SNS message attribute.
- Check: same DynamoDB table with separate prefix outbound#<correlation_id>.
- Callers (Raptor, Console) generate correlation_id as a UUID4 on first publish. Retry of the same logical send reuses the same correlation_id.
- SQS FIFO MessageDeduplicationId = correlation_id (5-min window hardening).
Lambda maximum execution timeout: 15 minutes (900 s).
Recommended SQS visibility timeout: 20 minutes (1200 s).
Math: Lambda max execution (900 s) + safety margin (300 s) = 1200 s = 20 min.
The safety margin accounts for Lambda cold-start latency (typically < 5 s but provisioned concurrency eliminates this), Postmark/FreeScout API call latency, and DynamoDB dedup read/write.
Implications:
- If a Lambda invocation crashes hard (OOM, unhandled exception), the message becomes visible again after 20 minutes.
- With maxReceiveCount = 3 on the SQS queue, a poison message will attempt 3 times (60 min total delay) before moving to the DLQ.
- 60-minute delay is acceptable for v1 support volume; reduces noise from transient FreeScout/Postmark hiccups.
Outbound sender Lambda: - Postmark API published rate limit: 1 000 messages/second — significantly higher headroom than SES new-account 14/s. - At v1 outbound volume (10 000/mo ≈ 333/day ≈ 14/hour), burst is negligible. - Reserved concurrency: 5. Conservative guard against code bugs or requeue storms. - If Postmark rate limits surface, reduce to 1 (single-threaded processing).
Inbound bridge Lambda: - FreeScout API rate limit: unknown. Conservative assumption: 10 req/s. - Inbound v1 volume: ~100/mo. Spikes unlikely. - Reserved concurrency: 3. Prevents runaway invocations from hitting FS API. - If FreeScout API rate limits surface in production, reduce to 1 (single-threaded processing).
Note: Reserved concurrency also acts as a kill-switch. Setting reserved concurrency to 0 halts all processing (messages remain durably in SQS). This satisfies the platform invariant for kill-switches on critical execution paths.
Phase 0: None needed. Postmark is already production-ready, domain-verified, DKIM configured, sandbox-out complete. No AWS Support case required before v1 launch.
Phase 1 — Inbound durability before v1 launch (target: before 2026-05-23 UTC)
1. SC-E2: Terraform stack (SNS/SQS/DLQ/DynamoDB/alarms/IAM/API Gateway).
2. SC-E10: API Gateway endpoint POST /webhooks/postmark/inbound → SNS publish.
3. SC-E3 (modified): inbound Lambda (Postmark JSON → FreeScout API).
4. SC-E8: DLQ runbook.
5. SC-E7: Synthetic probe Lambda + alarm.
6. Cutover: update Postmark inbound server webhook URL to API Gateway endpoint. Feature-flag postmark_inbound_to_freescout OFF in Raptor.
Phase 2 — Outbound queue (post-v1 acceptable)
1. SC-E4 (modified): outbound Lambda (Postmark API call).
2. SC-E5 (modified): Raptor postmark_client.py → sns_publisher.py (feature-flagged).
3. 7-day soak period.
FreeScout continues using Postmark SMTP for outbound replies throughout v1. Unchanged.
Phase 3 (optional, post-v1): Full SES migration
If deliverability, cost, or vendor concerns justify it after launch:
raxx.app domain in SES: add DKIM CNAME records to Cloudflare DNS.ses:SendEmail. No publisher code changes.This is an optional, undated future card — not a v1 commitment. The Lambda consumer is the only component that changes; all upstream topology (SNS/SQS/DLQ/publishers) is delivery-agent-agnostic.
Phase 0 — SES Domain Verification + Sandbox-Out (Operator action, size:s, ~1 day)
- Verify raxx.app domain in SES: add DKIM CNAME records to DNS (Cloudflare, zone raxx.app).
- Add SPF record if not present.
- Open AWS Support case: "Production Access Request" for SES.
- Do NOT proceed to Phase 1 for outbound until sandbox-out is confirmed.
Phase 1 — Inbound Durability (size:l)
- Terraform: provision SNS inbound topic + SQS inbound queue + DLQs + alarms.
- SES: configure receipt rule set for raxx.app → S3 action → SNS publish action to raxx-email-inbound.fifo.
- Deploy raxx-email-inbound-bridge Lambda (SES raw MIME from S3 input).
- Cut DNS MX record from Postmark to SES inbound.
Phase 2 — Outbound Queue Validation (size:m)
- Terraform: provision SNS outbound topic + SQS outbound queue + DLQs.
- Deploy raxx-email-outbound-sender Lambda (SES SendEmail).
- Migrate postmark_client.py → sns_publisher.py in Raptor (feature-flagged).
- Monitor for a 7-day soak period.
Phase 3 — FreeScout SMTP Cutover (size:s)
- Change FreeScout SMTP settings from Postmark to SES SMTP endpoint.
- SES SMTP credentials stored in SSM at /raxx/email/ses_smtp_password.
Phase 4 — Postmark Retirement (size:s, ~30 days after Phase 3 soak)
- Cancel Postmark subscription.
- Archive Postmark delivery logs (export to S3 for GDPR retention period).
- Remove POSTMARK_SERVER_TOKEN from all Heroku apps.
- Remove postmark_client.py, postmark_inbound.py, postmark_collector.py in a cleanup PR.
When a CloudWatch alarm fires for SQS DLQ depth > 0, follow this procedure.
Alarm name format: raxx-email-<direction>-<type>-dlq-depth. Determine which DLQ.
# Receive up to 10 messages (does not delete them)
aws sqs receive-message \
--queue-url https://sqs.us-east-1.amazonaws.com/<account-id>/raxx-email-inbound-bridge-dlq.fifo \
--max-number-of-messages 10 \
--attribute-names All \
--message-attribute-names All \
--visibility-timeout 300 \
--region us-east-1
Examine the Body field (JSON-encoded SNS notification wrapping the original email event). Look for:
- ApproximateReceiveCount: if > 3, it hit max-receive-count.
- SentTimestamp: when the message first arrived.
# Find Lambda invocation errors around the message's SentTimestamp
aws logs filter-log-events \
--log-group-name /aws/lambda/raxx-email-inbound-bridge \
--start-time <epoch-ms-from-SentTimestamp> \
--filter-pattern '"ERROR"' \
--region us-east-1
Check for: FreeScout API 5xx, connection timeout, malformed payload, IAM permission error.
| Root cause | Disposition |
|---|---|
| FreeScout was transiently down (now recovered) | Redrive — message should process successfully |
| Code bug (now fixed and Lambda redeployed) | Redrive |
| Malformed message (no Message-ID, corrupt UTF-8) | Discard — log reason; root-cause at source |
| Duplicate of already-processed message | Discard — idempotency guard will skip it anyway, but discarding avoids noise |
| Suspected spoofed/malicious message | Quarantine to S3; do not redrive; open security incident |
AWS Console method: 1. SQS Console → queue → "Dead-letter queue" tab → "Start DLQ redrive". 2. Set destination to the source queue. 3. Confirm. AWS generates a CloudTrail event.
CLI method (for scripting):
aws sqs start-message-move-task \
--source-arn arn:aws:sqs:us-east-1:<account-id>:raxx-email-inbound-bridge-dlq.fifo \
--destination-arn arn:aws:sqs:us-east-1:<account-id>:raxx-email-inbound-bridge.fifo \
--region us-east-1
After redrive, verify:
1. DLQ depth returns to 0 (CloudWatch metric ApproximateNumberOfMessagesVisible).
2. Lambda successfully processes the message (check CloudWatch Logs for successful invocations).
3. FreeScout conversation created (or Postmark delivery confirmed).
4. Write an ops note in the FreeScout ops@ ticket if the failure affected a customer-visible email path. Note: timestamp (UTC), root cause, resolution, message count affected.
kris@moosequest.net).support@raxx.app.PROBE-<UUID4> (probe ID embedded in subject, not body, for easy FS query).raxx-email-probe) queries FreeScout API GET /api/conversations?subject=PROBE-<UUID4> after 90 seconds.ops@raxx.app alert.Probe must originate outside AWS. If API Gateway, SNS, SQS, Lambda, or FreeScout API all fail together (e.g., F15 — region outage), a probe sent via AWS Lambda would also be affected and never trigger. Using the operator's personal email as sender ensures the signal originates from a completely independent path.
raxx-email-probe-role: ssm:GetParameter on /raxx/email/freescout_api_key only. No SNS, no SQS permissions — read-only health check.
Probe alarm fires → ops@ alert with probe ID → operator checks: 1. Did the email arrive at Postmark inbound? (Postmark dashboard: inbound activity log) 2. Did API Gateway receive the webhook? (API Gateway access logs) 3. Did SNS publish? (SNS console: delivery status) 4. Did SQS receive? (SQS console: messages received metric) 5. Did Lambda invoke? (CloudWatch Logs) 6. Did FreeScout receive the API call? (FS admin → activity log)
This six-layer walkthrough isolates the failing component within 10 minutes.
At v1 volume: 10 000 outbound/mo + 100 inbound/mo.
| Service | Usage | Unit cost | Monthly cost |
|---|---|---|---|
| Postmark outbound | 10 000 emails | $1.25 / 1 000 | $12.50 |
| Postmark inbound | 100 emails | Included in plan | $0.00 |
| SNS | ~10 200 publishes | First 1 M free | $0.00 |
| SQS | ~10 200 requests | First 1 M free | $0.00 |
| Lambda | ~10 200 invocations, avg 2 s each | First 1 M invocations + 400 000 GB-s free | $0.00 |
| API Gateway | ~100 inbound webhook calls | First 1 M free (HTTP API) | $0.00 |
| DynamoDB (dedup table) | ~10 300 reads + writes | First 25 GB free (on-demand) | $0.00 |
| CloudWatch alarms | ~12 alarms | $0.10/alarm/mo | $1.20 |
| CloudWatch Logs | ~500 MB/mo Lambda logs | $0.50/GB ingest + $0.03/GB storage | ~$0.27 |
| Probe Lambda | 8 640 invocations/mo (every 5 min) | Within free tier | $0.00 |
| Total | ~$13.97/mo |
| Service | Usage | Unit cost | Monthly cost |
|---|---|---|---|
| SES outbound | 10 000 emails | $0.10 / 1 000 | $1.00 |
| SES inbound | 100 emails | $0.10 / 1 000 | $0.01 |
| SNS/SQS/Lambda/DynamoDB | same as above | same as above | ~$1.47 |
| Total | ~$2.48/mo |
Cost delta at v1: $13.97/mo (hybrid) vs $2.48/mo (full SES). Difference: ~$11.50/mo.
When does the SES migration pay for itself? The migration is approximately 3–4 engineering-days of work. At a conservative $500/day cost, break-even at v1 volume is ~1 000 months. The migration is a cost decision at scale, not at v1. At 100k/mo outbound volume, the difference is ~$112/mo — meaningful enough to revisit.
docs/architecture/adr/0072-durable-email-sns-sqs-ses.md — durability topology (Accepted, target state)docs/architecture/adr/0074-email-delivery-hybrid-postmark-v1.md — v1 implementation scope (Accepted, amends ADR-0072 for v1)| # | Title | Status | Size | Agent / Owner | Phase | Blocking |
|---|---|---|---|---|---|---|
| SC-E1 (#1664) | SES domain verification + DKIM + sandbox-out | CLOSE | — | — | — | Eliminated |
| SC-E2 (#1665) | Terraform: email-delivery stack (SNS/SQS/DLQ/alarms/IAM) | SAME (minor IAM delta) | M | sre-agent | Phase 1 | SC-E3, SC-E4 |
| SC-E3 (#1666) | Lambda: inbound email bridge | MODIFIED (Postmark JSON input) | M | feature-developer | Phase 1 | v1 launch |
| SC-E4 (#1667) | Lambda: outbound email sender | MODIFIED (Postmark API call) | M | feature-developer | Phase 2 | — |
| SC-E5 (#1668) | Migrate Raptor postmark_client.py → sns_publisher.py | MODIFIED (scope reduced) | S | feature-developer | Phase 2 | SC-E4 |
| SC-E6 (#1669) | FreeScout SMTP cutover from Postmark to SES | CLOSE | — | — | — | Eliminated |
| SC-E7 (#1670) | Synthetic probe + CloudWatch alarm | SAME | M | sre-agent | Phase 1 | — |
| SC-E8 (#1671) | DLQ redrive runbook | SAME | S | sre-agent | Phase 1 | SC-E2 |
| SC-E9 (#1672) | Postmark retirement | CLOSE | — | — | — | Eliminated |
| SC-E10 (new) | API Gateway endpoint: Postmark webhook → SNS publish | NEW | S | sre-agent | Phase 1 | SC-E3 |
Estimated total engineering (v1 hybrid): ~8.5 days across feature-developer + sre-agent. SC-E2 (Terraform) is the critical path. SC-E2 → SC-E10 → SC-E3 must land before 2026-05-23 UTC.
No Raptor database schema migrations are introduced by this design. The dedup table moves from Raptor SQLite (postmark_inbound_dedup) to a standalone DynamoDB table in the email-delivery stack. This DynamoDB table is provisioned by the email-delivery Terraform stack (SC-E2) and populated by the Lambda (SC-E3).
The postmark_inbound_dedup SQLite table in Raptor remains for a 30-day observation window after Phase 1 cutover, then is dropped in a schema migration in a follow-up cleanup card.
Message-ID headers (no email addresses, no content).to_hash (SHA-256 truncated) not raw addresses.event_type, message_id_hash, mailbox_id, outcome, timestamp_utc. All DLQ redrives produce CloudTrail events. SNS publish calls produce CloudTrail events.aws_ssm_parameter data source + Lambda environment via Terraform, marked sensitive = true). FreeScout API key: same pattern.sqs:ReceiveMessage on the DLQ. IAM policy restricts DLQ access to raxx-ops-break-glass role. If unauthorized access detected: notify operator within 72 hours per GDPR Art. 33 requirement. Breach notification automation tracked in docs/architecture/adr/0003-gdpr-by-default.md./raxx/email/*), SecureString type, KMS-encrypted. Rotatable without Lambda redeploy (Lambda reads SSM at invocation time with a 5-minute in-memory cache). Never in Lambda environment variables in Terraform state.These must be resolved before sub-cards SC-E3 and SC-E10 can be claimed:
support@, ops@, account-management@: operator must confirm the numeric mailbox IDs from FreeScout admin so the SSM routing map can be populated.kris@moosequest.net) will be the synthetic probe sender. It must be a real inbox the operator controls to validate end-to-end delivery — and it must be outside AWS./raxx/email/postmark_inbound_webhook_token. This is the static token Postmark sends in the webhook request header for validation.Note: Open questions 2 (SES receipt rule set region) and 5 (S3 attachment handling) from the original list are eliminated by the hybrid architecture — Postmark handles MIME parsing and attachment metadata inline in the webhook payload; no S3 involvement for inbound.