SES Durable Email Delivery with DLQ at Both Layers

Status: Accepted
Date: 2026-05-11 UTC
Operator-locked: yes — architecture spec locked by operator 2026-05-11 UTC
Refs: #1657, docs/architecture/durable-email-delivery.md

Context

Raxx's v1 email path (Raptor → Postmark direct HTTP call, or Postmark webhook → Raptor → FreeScout API) is synchronous and has no durability layer. A Postmark outage, a FreeScout downtime window, or a transient network error silently drops the email with no replay path and no detection.

The operator direction (2026-05-11 UTC) is to design durable email delivery starting from failure modes, not features. The operator-specified architecture: SNS FIFO for fan-out/pub-sub, SQS FIFO for persistent buffering, DLQs at both the SNS layer and the SQS layer, Lambda as the consumer, and SES as the delivery agent.

Decision

Adopt the following durable email delivery architecture as the Raxx email primitive, replacing the direct Postmark path:

SNS FIFO topics as the publish boundary. Publishers (Raptor, Console) call sns:Publish and are done. The topic is durable; if SQS is momentarily unavailable, SNS retries delivery for up to 23 days before moving the message to the SNS DLQ.
SNS DLQ at the subscription layer (RedrivePolicy on the SNS subscription to SQS). Captures messages that exhausted SNS delivery retries. This is distinct from the SQS DLQ and requires a separate alarm.
SQS FIFO queues (one per consumer) as the durable buffer between SNS and Lambda. Multi-AZ storage. Messages survive Lambda cold-starts, crashes, and FreeScout/SES downtime.
SQS DLQ per queue. Captures messages that have exceeded maxReceiveCount (set to 3). This catches poison messages and code errors without blocking the pipeline for healthy messages.
Lambda consumers: one for inbound (SES event → FreeScout API) and one for outbound (SNS event → SES SendEmail). Lambda deletes the SQS message only on successful processing; on exception, visibility timeout expires and SQS redelivers.
Two separate SNS topics: raxx-email-inbound.fifo and raxx-email-outbound.fifo. Failure isolation is the primary reason (see Consequences).
SES for inbound receipt rules (capture email at DNS level → S3 → SNS) and outbound sending (ses:SendEmail).

Consequences

Positive

At-least-once delivery with replay: if any downstream (FreeScout API, SES) is down, the message stays in SQS until it recovers. No email is dropped silently.
Two DLQ layers = two alarm surfaces: SNS DLQ catches subscription failures (topology broken); SQS DLQ catches processing failures (code/API error). Both are alarmed independently. An operator receiving either alarm has a different diagnosis path.
Failure isolation by topic: an SES outbound throttle does not block FreeScout inbound ticket creation. Separate topics mean separate DLQs, separate alarms, separate Lambda errors, separate investigation.
Kill-switch: setting Lambda reserved concurrency to 0 stops processing without message loss. Messages accumulate in SQS until the Lambda is re-enabled.
Cost: SES is ~12x cheaper than Postmark at v1 volume (~$2.48/mo vs ~$12.50/mo).
No stored credentials: IAM role-based SES access, FreeScout API key in SSM. Lambda never receives credentials in environment variables or code.

Negative / Trade-offs

SES sandbox is the critical-path constraint: SES production access requires an AWS Support case with 24–48 h turnaround. This must start before v1 launch. Inbound does not require sandbox-out and can proceed immediately.
Additional operational surface: 4 CloudWatch alarms + DynamoDB dedup table + 2 Lambda functions + Terraform stack to maintain. Operator training on DLQ redrive SOP required before go-live.
FreeScout SMTP cutover complexity: FreeScout uses Postmark SMTP for outbound replies. Cutting over to SES SMTP requires FreeScout admin changes and SMTP credential provisioning, not a code deploy.
us-east-1 SPOF: all SES, SNS, SQS, and Lambda resources are in us-east-1. AWS region failure = full email outage. Cross-region failover is explicitly out of scope for v1 and is a documented accepted risk.

Alternatives Considered

Alternative A: Keep Postmark, add retry logic in Raptor

Raptor's postmark_client.py could be enhanced with exponential backoff and a local retry queue (Redis or Postgres-backed). This is cheaper to build but does not solve the fundamental problem: if Raptor itself is down, mail is lost. Postmark's webhook is also a single synchronous hop.

Rejected because: Raptor is a Heroku dyno — it restarts, idles, and is not a durable message store. The operator's explicit direction is SNS/SQS/SES.

One topic with message attributes to distinguish direction, with two SQS subscriptions filtered by attribute.

Rejected because: failure scoping is worse. An outbound-side failure (SES throttle) causes the single topic's DLQ alarm to fire, making it ambiguous whether the failure is inbound or outbound. Separate topics give unambiguous alarm attribution. The cost difference is negligible (both under the SNS free tier at v1 volume).

Alternative C: EventBridge instead of SNS

EventBridge has richer routing rules and schema registry. However, EventBridge FIFO (ordered delivery) is not available for standard event buses; only SQS FIFO guarantees ordering. EventBridge also has no native DLQ at the publish layer equivalent to SNS redrive policy.

Rejected because: the operator spec calls for SNS FIFO. EventBridge would require a custom ordering solution. SNS + SQS FIFO is the established AWS pattern for this use case.

Raptor publishes directly to SQS FIFO queues. Simpler topology, fewer components.

Rejected because: removes the fan-out capability needed for future additional consumers (e.g., AV scan Lambda subscribing to the inbound topic). Adding a second consumer to a direct SQS publish path requires code changes in every publisher. With SNS, adding a consumer is an SNS subscription — no publisher code change.

Notes

This ADR supersedes the implicit decision to use Postmark as the email delivery primitive.
The Postmark migration path is phased (Phase 0–4 in the design doc). This ADR records the target state; the migration ADR (if needed for the phased details) may be filed as a sub-ADR once Phase 1 lands.
Reserved concurrency values (5 for outbound, 3 for inbound) are set conservatively at v1 and are Terraform variables — adjustable without code changes.