Email DLQ Redrive Runbook

System: email-delivery (SNS/SQS/Lambda — Postmark hybrid) Owner: operator / sre-agent Last incident: (none — initial provisioning 2026-05-11 UTC) Last reviewed: 2026-05-11 UTC ADR refs: ADR-0072, ADR-0074 Architecture doc: docs/architecture/durable-email-delivery.md Sections 1, 5, 9 Depends on: SC-E2 (#1665) Terraform stack applied

ARN Reference (from `terraform output` after SC-E2 apply)

To get live ARNs:

cd terraform/modules/email-delivery-stack
AWS_DEFAULT_REGION=us-east-1 terraform output

Queue URLs and ARNs — account 521228113048, region us-east-1:

Resource	Name	ARN pattern
Inbound main queue	`inbound-email-freescout-bridge.fifo`	`arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge.fifo`
Outbound main queue	`outbound-email-postmark-sender.fifo`	`arn:aws:sqs:us-east-1:521228113048:outbound-email-postmark-sender.fifo`
Inbound SQS DLQ	`inbound-email-freescout-bridge-dlq.fifo`	`arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge-dlq.fifo`
Outbound SQS DLQ	`outbound-email-postmark-sender-dlq.fifo`	`arn:aws:sqs:us-east-1:521228113048:outbound-email-postmark-sender-dlq.fifo`
Inbound SNS topic	`raxx-inbound-email.fifo`	`arn:aws:sns:us-east-1:521228113048:raxx-inbound-email.fifo`
Outbound SNS topic	`raxx-outbound-email.fifo`	`arn:aws:sns:us-east-1:521228113048:raxx-outbound-email.fifo`
Inbound SNS DLQ	`raxx-email-inbound-sns-dlq.fifo`	`arn:aws:sqs:us-east-1:521228113048:raxx-email-inbound-sns-dlq.fifo`
Outbound SNS DLQ	`raxx-email-outbound-sns-dlq.fifo`	`arn:aws:sqs:us-east-1:521228113048:raxx-email-outbound-sns-dlq.fifo`

Update this table after terraform apply outputs actual ARNs.

DLQ Architecture (two layers)

Layer 1 — SNS DLQs (topology failures, F4):
  raxx-inbound-email.fifo  → [SNS retries 23 days] → raxx-email-inbound-sns-dlq.fifo
  raxx-outbound-email.fifo → [SNS retries 23 days] → raxx-email-outbound-sns-dlq.fifo

Layer 2 — SQS DLQs (Lambda processing failures, F5/F10):
  inbound-email-freescout-bridge.fifo  → [5 attempts] → inbound-email-freescout-bridge-dlq.fifo
  outbound-email-postmark-sender.fifo  → [5 attempts] → outbound-email-postmark-sender-dlq.fifo

SNS DLQ alarm = topology failure (subscription broken, policy revoked). SQS DLQ alarm = Lambda processing failure (code error, API down, poison message). These are different root causes with different remediation paths.

How to tell it's broken

CloudWatch alarm raxx-email-inbound-sns-dlq-depth firing: SNS → SQS subscription broken (F4)
CloudWatch alarm raxx-email-outbound-sns-dlq-depth firing: same, outbound side
CloudWatch alarm raxx-email-inbound-bridge-dlq-depth firing: inbound Lambda processing failure (F5)
CloudWatch alarm raxx-email-outbound-sender-dlq-depth firing: outbound Lambda processing failure (F5)
CloudWatch alarm raxx-email-probe-failure firing: end-to-end inbound path broken (F1)
Both raxx-email-probe-failure + a DLQ alarm: F1 + F4 detection layer triggered simultaneously

The alarm description field on each DLQ alarm names the failure mode and links to this runbook.

How to diagnose (in order)

Step 1 — Identify which DLQ(s) have messages

Check alarm state for all four DLQs:

aws cloudwatch describe-alarms \
  --alarm-names \
    "raxx-email-inbound-sns-dlq-depth" \
    "raxx-email-outbound-sns-dlq-depth" \
    "raxx-email-inbound-bridge-dlq-depth" \
    "raxx-email-outbound-sender-dlq-depth" \
  --query 'MetricAlarms[*].[AlarmName,StateValue,StateReason]' \
  --output table \
  --region us-east-1

Expected value: all OK. Any ALARM state requires investigation.

Step 2 — Inspect DLQ messages (does not delete them)

Replace QUEUE_URL with the appropriate URL from the ARN table above.

# Inbound SQS DLQ (Lambda processing failure)
aws sqs receive-message \
  --queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/inbound-email-freescout-bridge-dlq.fifo" \
  --max-number-of-messages 10 \
  --attribute-names All \
  --message-attribute-names All \
  --visibility-timeout 300 \
  --region us-east-1

# Outbound SQS DLQ (Lambda processing failure)
aws sqs receive-message \
  --queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/outbound-email-postmark-sender-dlq.fifo" \
  --max-number-of-messages 10 \
  --attribute-names All \
  --message-attribute-names All \
  --visibility-timeout 300 \
  --region us-east-1

# Inbound SNS DLQ (topology failure — subscription broken)
aws sqs receive-message \
  --queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/raxx-email-inbound-sns-dlq.fifo" \
  --max-number-of-messages 10 \
  --attribute-names All \
  --message-attribute-names All \
  --visibility-timeout 300 \
  --region us-east-1

In the response, check: - ApproximateReceiveCount: if >= 5, the message hit max-receive-count. - SentTimestamp: when the message first arrived (use this for log correlation). - Body: the raw email event payload (Postmark JSON for inbound; Raptor-published JSON for outbound). - For inbound: look for MessageID field (Postmark's unique identifier). This is the idempotency key. - For outbound: look for correlation_id in message attributes.

Step 3 — Correlate with Lambda CloudWatch Logs

Convert the SentTimestamp from epoch milliseconds to a human-readable time, then query Lambda logs:

# Inbound Lambda logs — filter for ERROR around message timestamp
aws logs filter-log-events \
  --log-group-name /aws/lambda/raxx-email-inbound-bridge \
  --start-time <epoch_ms_from_SentTimestamp> \
  --end-time <epoch_ms_plus_60min> \
  --filter-pattern '"ERROR"' \
  --region us-east-1

# Outbound Lambda logs
aws logs filter-log-events \
  --log-group-name /aws/lambda/raxx-email-outbound-sender \
  --start-time <epoch_ms_from_SentTimestamp> \
  --end-time <epoch_ms_plus_60min> \
  --filter-pattern '"ERROR"' \
  --region us-east-1

# Check for IAM permission errors specifically (F19)
aws logs filter-log-events \
  --log-group-name /aws/lambda/raxx-email-inbound-bridge \
  --start-time <epoch_ms_from_SentTimestamp> \
  --filter-pattern '"AccessDeniedException"' \
  --region us-east-1

Common error signatures: - AccessDeniedException → IAM issue (F19). Check the Lambda role policy matches iam.tf. - FreeScout API 5xx / ConnectionError → FreeScout down (F7). Check FreeScout health. - Postmark API 429 → Rate limit (F8). Exponential backoff should handle; check if sustained. - Postmark API 422 → Malformed payload or suppressed recipient. Investigate sender/recipient. - JSONDecodeError / KeyError → Malformed message body (F10/poison message). - ReadTimeoutError → FreeScout or Postmark API slow; visibility timeout may have been exceeded.

If the SNS DLQ (not SQS DLQ) has messages, the SQS subscription is broken:

# List all subscriptions for the inbound topic
aws sns list-subscriptions-by-topic \
  --topic-arn "arn:aws:sns:us-east-1:521228113048:raxx-inbound-email.fifo" \
  --region us-east-1

# Expected: one subscription with Protocol=sqs, Endpoint=<inbound bridge queue ARN>
# If subscription is missing or SubscriptionArn shows "PendingConfirmation": broken

Check SQS queue policy allows SNS to send:

aws sqs get-queue-attributes \
  --queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/inbound-email-freescout-bridge.fifo" \
  --attribute-names Policy \
  --region us-east-1 | python3 -c "import json,sys; p=json.load(sys.stdin); print(json.dumps(json.loads(p['Attributes']['Policy']), indent=2))"

Expected: Statement with Principal.Service = sns.amazonaws.com and Action = sqs:SendMessage.

Known failure modes

Failure mode A: SQS DLQ depth > 0 — FreeScout API was transiently down

Symptom: inbound-email-freescout-bridge-dlq-depth alarm fires. Lambda logs show FreeScout 503 or connection timeout around the message timestamp. Cause: FreeScout on Lightsail was restarting or overloaded when the Lambda invocation ran. Fix: Verify FreeScout is now healthy (curl -s https://tickets.raxx.app/api/users returns 200). Then redrive. Verification: DLQ depth returns to 0. Check FreeScout — new conversation created for the email.

Failure mode B: SQS DLQ depth > 0 — Postmark API down or rate limited

Symptom: outbound-email-postmark-sender-dlq-depth fires. Lambda logs show Postmark 5xx or 429. Cause: Postmark service degraded or rate limit hit. Fix: Check Postmark status page (https://status.postmarkapp.com). If recovered, redrive. If sustained 429, reduce lambda_outbound_reserved_concurrency to 1 in terraform.tfvars and re-apply. Verification: DLQ depth returns to 0. Check Postmark activity log for delivery confirmation.

Failure mode C: SQS DLQ depth > 0 — IAM permission error

Symptom: Lambda logs show AccessDeniedException. DLQ message has ApproximateReceiveCount >= 5. Cause: Lambda execution role is missing a permission (F19). May be a Terraform state drift or a manual IAM change. Fix: terraform plan from terraform/modules/email-delivery-stack/ to detect drift. If drift found, terraform apply to restore. No Lambda code change needed. Verification: Lambda processes subsequent messages without AccessDeniedException.

Symptom: raxx-email-inbound-sns-dlq-depth fires. SNS subscription list shows missing or PendingConfirmation. Cause: SNS → SQS subscription was deleted (manual action, Terraform destroy, or IAM policy revoke on SQS queue). F4. Fix: terraform apply to restore subscription. Check CloudTrail for who deleted the subscription. Verification: SNS DLQ drains (no new messages). Test: publish a test message to SNS topic and verify it lands in SQS.

Failure mode E: Poison message — malformed payload

Symptom: DLQ message body contains invalid JSON, missing required fields, or non-UTF-8 content. Cause: Upstream publisher sent a malformed message (Postmark webhook schema change, Raptor bug). Fix: Discard the message (do not redrive). File a type:reliability issue with the malformed payload example. Fix the upstream publisher or Lambda schema validation. Verification: No new poison messages appear after fix.

Failure mode F: Suspected spoofed or malicious message

Symptom: DLQ message appears to be a crafted/spoofed email payload (wrong X-Postmark-Signature, unexpected From domain, payload structure doesn't match Postmark JSON schema). Cause: External attacker posting to the API Gateway endpoint with a forged payload. Fix: DO NOT redrive. Quarantine to S3 (copy the raw message body to an S3 object for forensics). Open a security incident. Rotate the Postmark inbound webhook token in Postmark admin + update SSM /raxx/email/postmark_inbound_webhook_token. Verification: Token rotated, Postmark admin updated, new test probe succeeds.

Disposition table

Root cause	Disposition	Action
FreeScout was transiently down (now recovered)	Redrive	`start-message-move-task`
Postmark API was down (now recovered)	Redrive	`start-message-move-task`
Code bug (Lambda fixed, redeployed)	Redrive	`start-message-move-task`
IAM issue fixed via Terraform	Redrive	`start-message-move-task`
Malformed message (no Message-ID, corrupt UTF-8)	Discard	Delete from DLQ; root-cause at source
Duplicate of already-processed message	Discard	Idempotency guard skips it; discard to avoid noise
Suspected spoofed/malicious	Quarantine	S3 copy → security incident → rotate token

Redrive procedure

Preferred: AWS Console (creates CloudTrail audit event)

Open AWS Console → SQS → select the DLQ queue.
Click "Dead-letter queue" tab → "Start DLQ redrive".
Set destination to the source queue (remove -dlq from the name).
Set message move throughput to match Lambda reserved concurrency (e.g., 3 messages/second for inbound).
Click "Start". AWS generates a CloudTrail event: StartMessageMoveTask.
Monitor: watch DLQ ApproximateNumberOfMessagesVisible drop to 0.

CLI method (for scripting or when console is unavailable)

# Inbound SQS DLQ → inbound main queue
aws sqs start-message-move-task \
  --source-arn "arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge-dlq.fifo" \
  --destination-arn "arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge.fifo" \
  --max-number-of-messages-per-second 3 \
  --region us-east-1

# Outbound SQS DLQ → outbound main queue
aws sqs start-message-move-task \
  --source-arn "arn:aws:sqs:us-east-1:521228113048:outbound-email-postmark-sender-dlq.fifo" \
  --destination-arn "arn:aws:sqs:us-east-1:521228113048:outbound-email-postmark-sender.fifo" \
  --max-number-of-messages-per-second 5 \
  --region us-east-1

# Check redrive task status
aws sqs list-message-move-tasks \
  --source-arn "arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge-dlq.fifo" \
  --region us-east-1

Manual discard (single message)

If you've examined a message and determined it must be discarded:

# 1. Receive the message to get the ReceiptHandle
RECEIPT=$(aws sqs receive-message \
  --queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/inbound-email-freescout-bridge-dlq.fifo" \
  --max-number-of-messages 1 \
  --visibility-timeout 300 \
  --region us-east-1 \
  --query 'Messages[0].ReceiptHandle' \
  --output text)

# 2. Delete the message (irreversible — confirm root cause first)
aws sqs delete-message \
  --queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/inbound-email-freescout-bridge-dlq.fifo" \
  --receipt-handle "$RECEIPT" \
  --region us-east-1

Post-redrive verification checklist

After a redrive:

DLQ ApproximateNumberOfMessagesVisible returns to 0 (CloudWatch console or CLI).
Lambda successfully processes the redriven messages — check CloudWatch Logs for successful invocations (no ERROR lines, look for "outcome": "success" structured log event).
Downstream verified: - Inbound: FreeScout conversation created. Search FreeScout for the message's MessageID value or probe ID. - Outbound: Postmark delivery confirmed. Check Postmark dashboard → Activity → Message detail shows Delivered.
Write an ops note in FreeScout ops@ mailbox ticket if the failure affected a customer-visible email path. Required fields: timestamp (UTC), root cause, resolution, message count affected, redrive time (UTC).
If the DLQ had > 1 message: verify all messages were redriven (check task status with list-message-move-tasks).

Audit trail expectations

Every redrive produces: - CloudTrail event: StartMessageMoveTask (IAM principal, timestamp, source/destination ARNs). - Lambda CloudWatch Log: structured JSON for each processed message: event_type, message_id_hash (SHA-256, truncated — no raw PII), mailbox_id (inbound) or correlation_id (outbound), outcome, timestamp_utc. - FreeScout ops@ note: written manually by operator per checklist item 4 above. - DynamoDB dedup record: dedup_key written with TTL after successful processing.

If a message was discarded instead of redriven, write a manual audit entry in the FreeScout ops@ mailbox with: reason for discard, message body hash (SHA-256), timestamp (UTC).

Slack escalation thresholds

Condition	Action
Any DLQ alarm fires	Investigate immediately using this runbook. Post status to ops@ FreeScout.
DLQ not cleared within 12h of alarm fire	Page operator (SEV-2 escalation).
DLQ not cleared within 24h	SEV-1 escalation: customer email delivery failure.
SNS DLQ fires	Always escalate to operator same-day — topology failure requires Terraform review.

Detection sources

This runbook is reached via two detection paths:

F1 + F4 detection layer (end-to-end probe + topology): - raxx-email-probe-failure alarm: probe email to support@raxx.app not seen in FreeScout within 5 min. Wired up by SC-E7 (#1670). - Any DLQ alarm: confirms where in the pipeline the failure occurred.

F5 detection (Lambda processing failure): - raxx-email-inbound-bridge-dlq-depth and raxx-email-outbound-sender-dlq-depth alarms.

When both raxx-email-probe-failure AND a DLQ alarm fire simultaneously, treat as high-confidence end-to-end delivery failure. Both signals together = F1 + F4 layer. Start diagnosis at Step 1 of this runbook.

Cross-references

Architecture doc: docs/architecture/durable-email-delivery.md Section 9 (original SOP)
ADR-0072: docs/architecture/adr/0072-durable-email-sns-sqs-ses.md
ADR-0074: docs/architecture/adr/0074-email-delivery-hybrid-postmark-v1.md
Terraform stack: terraform/modules/email-delivery-stack/
SC-E2 (#1665): Terraform stack provisioning
SC-E3 (#1666): Inbound Lambda (blocked on this runbook)
SC-E7 (#1670): Synthetic probe (probe alarm + DLQ alarm = F1 + F4 detection)
FreeScout runbook: docs/ops/runbooks/freescout.md
Postmark status page: https://status.postmarkapp.com

Emergency stop

To halt all email processing without message loss (SQS retains messages for 14 days):

# Set inbound Lambda concurrency to 0 (kill-switch)
aws lambda put-function-concurrency \
  --function-name raxx-email-inbound-bridge \
  --reserved-concurrent-executions 0 \
  --region us-east-1

# Set outbound Lambda concurrency to 0
aws lambda put-function-concurrency \
  --function-name raxx-email-outbound-sender \
  --reserved-concurrent-executions 0 \
  --region us-east-1

Messages accumulate in SQS. To resume:

aws lambda put-function-concurrency \
  --function-name raxx-email-inbound-bridge \
  --reserved-concurrent-executions 3 \
  --region us-east-1

aws lambda put-function-concurrency \
  --function-name raxx-email-outbound-sender \
  --reserved-concurrent-executions 5 \
  --region us-east-1

Or use Terraform: update lambda_inbound_reserved_concurrency / lambda_outbound_reserved_concurrency in terraform.tfvars and terraform apply.

Escalation

Wake the operator when: - SNS DLQ has messages (topology failure — requires Terraform intervention). - SQS DLQ not cleared after 12h of redrive attempts. - Root cause cannot be determined from Lambda logs (novel failure class). - Message body appears to contain a malicious or spoofed payload. - Any redrive causes new unexpected behavior downstream (FreeScout duplicates, Postmark bounces).

Contact: ops@raxx.app or Kristerpher's Slack DM (D0AJ7K184TV in MooseQuest workspace).