Raxx · internal docs

internal · gated

ADR-0072: SNS/SQS/SES Durable Email Delivery with DLQ at Both Layers

Status: Accepted
Date: 2026-05-11 UTC
Operator-locked: yes — architecture spec locked by operator 2026-05-11 UTC
Refs: #1657, docs/architecture/durable-email-delivery.md


Context

Raxx's v1 email path (Raptor → Postmark direct HTTP call, or Postmark webhook → Raptor → FreeScout API) is synchronous and has no durability layer. A Postmark outage, a FreeScout downtime window, or a transient network error silently drops the email with no replay path and no detection.

The operator direction (2026-05-11 UTC) is to design durable email delivery starting from failure modes, not features. The operator-specified architecture: SNS FIFO for fan-out/pub-sub, SQS FIFO for persistent buffering, DLQs at both the SNS layer and the SQS layer, Lambda as the consumer, and SES as the delivery agent.


Decision

Adopt the following durable email delivery architecture as the Raxx email primitive, replacing the direct Postmark path:

  1. SNS FIFO topics as the publish boundary. Publishers (Raptor, Console) call sns:Publish and are done. The topic is durable; if SQS is momentarily unavailable, SNS retries delivery for up to 23 days before moving the message to the SNS DLQ.

  2. SNS DLQ at the subscription layer (RedrivePolicy on the SNS subscription to SQS). Captures messages that exhausted SNS delivery retries. This is distinct from the SQS DLQ and requires a separate alarm.

  3. SQS FIFO queues (one per consumer) as the durable buffer between SNS and Lambda. Multi-AZ storage. Messages survive Lambda cold-starts, crashes, and FreeScout/SES downtime.

  4. SQS DLQ per queue. Captures messages that have exceeded maxReceiveCount (set to 3). This catches poison messages and code errors without blocking the pipeline for healthy messages.

  5. Lambda consumers: one for inbound (SES event → FreeScout API) and one for outbound (SNS event → SES SendEmail). Lambda deletes the SQS message only on successful processing; on exception, visibility timeout expires and SQS redelivers.

  6. Two separate SNS topics: raxx-email-inbound.fifo and raxx-email-outbound.fifo. Failure isolation is the primary reason (see Consequences).

  7. SES for inbound receipt rules (capture email at DNS level → S3 → SNS) and outbound sending (ses:SendEmail).


Consequences

Positive

Negative / Trade-offs


Alternatives Considered

Alternative A: Keep Postmark, add retry logic in Raptor

Raptor's postmark_client.py could be enhanced with exponential backoff and a local retry queue (Redis or Postgres-backed). This is cheaper to build but does not solve the fundamental problem: if Raptor itself is down, mail is lost. Postmark's webhook is also a single synchronous hop.

Rejected because: Raptor is a Heroku dyno — it restarts, idles, and is not a durable message store. The operator's explicit direction is SNS/SQS/SES.

Alternative B: Single unified SNS topic for inbound and outbound

One topic with message attributes to distinguish direction, with two SQS subscriptions filtered by attribute.

Rejected because: failure scoping is worse. An outbound-side failure (SES throttle) causes the single topic's DLQ alarm to fire, making it ambiguous whether the failure is inbound or outbound. Separate topics give unambiguous alarm attribution. The cost difference is negligible (both under the SNS free tier at v1 volume).

Alternative C: EventBridge instead of SNS

EventBridge has richer routing rules and schema registry. However, EventBridge FIFO (ordered delivery) is not available for standard event buses; only SQS FIFO guarantees ordering. EventBridge also has no native DLQ at the publish layer equivalent to SNS redrive policy.

Rejected because: the operator spec calls for SNS FIFO. EventBridge would require a custom ordering solution. SNS + SQS FIFO is the established AWS pattern for this use case.

Alternative D: SQS only (no SNS fan-out layer)

Raptor publishes directly to SQS FIFO queues. Simpler topology, fewer components.

Rejected because: removes the fan-out capability needed for future additional consumers (e.g., AV scan Lambda subscribing to the inbound topic). Adding a second consumer to a direct SQS publish path requires code changes in every publisher. With SNS, adding a consumer is an SNS subscription — no publisher code change.


Notes