Email DLQ Redrive Runbook
System: email-delivery (SNS/SQS/Lambda — Postmark hybrid)
Owner: operator / sre-agent
Last incident: (none — initial provisioning 2026-05-11 UTC)
Last reviewed: 2026-05-11 UTC
ADR refs: ADR-0072, ADR-0074
Architecture doc: docs/architecture/durable-email-delivery.md Sections 1, 5, 9
Depends on: SC-E2 (#1665) Terraform stack applied
ARN Reference (from terraform output after SC-E2 apply)
To get live ARNs:
cd terraform/modules/email-delivery-stack
AWS_DEFAULT_REGION=us-east-1 terraform output
Queue URLs and ARNs — account 521228113048, region us-east-1:
| Resource | Name | ARN pattern |
|---|---|---|
| Inbound main queue | inbound-email-freescout-bridge.fifo |
arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge.fifo |
| Outbound main queue | outbound-email-postmark-sender.fifo |
arn:aws:sqs:us-east-1:521228113048:outbound-email-postmark-sender.fifo |
| Inbound SQS DLQ | inbound-email-freescout-bridge-dlq.fifo |
arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge-dlq.fifo |
| Outbound SQS DLQ | outbound-email-postmark-sender-dlq.fifo |
arn:aws:sqs:us-east-1:521228113048:outbound-email-postmark-sender-dlq.fifo |
| Inbound SNS topic | raxx-inbound-email.fifo |
arn:aws:sns:us-east-1:521228113048:raxx-inbound-email.fifo |
| Outbound SNS topic | raxx-outbound-email.fifo |
arn:aws:sns:us-east-1:521228113048:raxx-outbound-email.fifo |
| Inbound SNS DLQ | raxx-email-inbound-sns-dlq.fifo |
arn:aws:sqs:us-east-1:521228113048:raxx-email-inbound-sns-dlq.fifo |
| Outbound SNS DLQ | raxx-email-outbound-sns-dlq.fifo |
arn:aws:sqs:us-east-1:521228113048:raxx-email-outbound-sns-dlq.fifo |
Update this table after terraform apply outputs actual ARNs.
DLQ Architecture (two layers)
Layer 1 — SNS DLQs (topology failures, F4):
raxx-inbound-email.fifo → [SNS retries 23 days] → raxx-email-inbound-sns-dlq.fifo
raxx-outbound-email.fifo → [SNS retries 23 days] → raxx-email-outbound-sns-dlq.fifo
Layer 2 — SQS DLQs (Lambda processing failures, F5/F10):
inbound-email-freescout-bridge.fifo → [5 attempts] → inbound-email-freescout-bridge-dlq.fifo
outbound-email-postmark-sender.fifo → [5 attempts] → outbound-email-postmark-sender-dlq.fifo
SNS DLQ alarm = topology failure (subscription broken, policy revoked). SQS DLQ alarm = Lambda processing failure (code error, API down, poison message). These are different root causes with different remediation paths.
How to tell it's broken
- CloudWatch alarm
raxx-email-inbound-sns-dlq-depthfiring: SNS → SQS subscription broken (F4) - CloudWatch alarm
raxx-email-outbound-sns-dlq-depthfiring: same, outbound side - CloudWatch alarm
raxx-email-inbound-bridge-dlq-depthfiring: inbound Lambda processing failure (F5) - CloudWatch alarm
raxx-email-outbound-sender-dlq-depthfiring: outbound Lambda processing failure (F5) - CloudWatch alarm
raxx-email-probe-failurefiring: end-to-end inbound path broken (F1) - Both
raxx-email-probe-failure+ a DLQ alarm: F1 + F4 detection layer triggered simultaneously
The alarm description field on each DLQ alarm names the failure mode and links to this runbook.
How to diagnose (in order)
Step 1 — Identify which DLQ(s) have messages
Check alarm state for all four DLQs:
aws cloudwatch describe-alarms \
--alarm-names \
"raxx-email-inbound-sns-dlq-depth" \
"raxx-email-outbound-sns-dlq-depth" \
"raxx-email-inbound-bridge-dlq-depth" \
"raxx-email-outbound-sender-dlq-depth" \
--query 'MetricAlarms[*].[AlarmName,StateValue,StateReason]' \
--output table \
--region us-east-1
Expected value: all OK. Any ALARM state requires investigation.
Step 2 — Inspect DLQ messages (does not delete them)
Replace QUEUE_URL with the appropriate URL from the ARN table above.
# Inbound SQS DLQ (Lambda processing failure)
aws sqs receive-message \
--queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/inbound-email-freescout-bridge-dlq.fifo" \
--max-number-of-messages 10 \
--attribute-names All \
--message-attribute-names All \
--visibility-timeout 300 \
--region us-east-1
# Outbound SQS DLQ (Lambda processing failure)
aws sqs receive-message \
--queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/outbound-email-postmark-sender-dlq.fifo" \
--max-number-of-messages 10 \
--attribute-names All \
--message-attribute-names All \
--visibility-timeout 300 \
--region us-east-1
# Inbound SNS DLQ (topology failure — subscription broken)
aws sqs receive-message \
--queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/raxx-email-inbound-sns-dlq.fifo" \
--max-number-of-messages 10 \
--attribute-names All \
--message-attribute-names All \
--visibility-timeout 300 \
--region us-east-1
In the response, check:
- ApproximateReceiveCount: if >= 5, the message hit max-receive-count.
- SentTimestamp: when the message first arrived (use this for log correlation).
- Body: the raw email event payload (Postmark JSON for inbound; Raptor-published JSON for outbound).
- For inbound: look for MessageID field (Postmark's unique identifier). This is the idempotency key.
- For outbound: look for correlation_id in message attributes.
Step 3 — Correlate with Lambda CloudWatch Logs
Convert the SentTimestamp from epoch milliseconds to a human-readable time, then query Lambda logs:
# Inbound Lambda logs — filter for ERROR around message timestamp
aws logs filter-log-events \
--log-group-name /aws/lambda/raxx-email-inbound-bridge \
--start-time <epoch_ms_from_SentTimestamp> \
--end-time <epoch_ms_plus_60min> \
--filter-pattern '"ERROR"' \
--region us-east-1
# Outbound Lambda logs
aws logs filter-log-events \
--log-group-name /aws/lambda/raxx-email-outbound-sender \
--start-time <epoch_ms_from_SentTimestamp> \
--end-time <epoch_ms_plus_60min> \
--filter-pattern '"ERROR"' \
--region us-east-1
# Check for IAM permission errors specifically (F19)
aws logs filter-log-events \
--log-group-name /aws/lambda/raxx-email-inbound-bridge \
--start-time <epoch_ms_from_SentTimestamp> \
--filter-pattern '"AccessDeniedException"' \
--region us-east-1
Common error signatures:
- AccessDeniedException → IAM issue (F19). Check the Lambda role policy matches iam.tf.
- FreeScout API 5xx / ConnectionError → FreeScout down (F7). Check FreeScout health.
- Postmark API 429 → Rate limit (F8). Exponential backoff should handle; check if sustained.
- Postmark API 422 → Malformed payload or suppressed recipient. Investigate sender/recipient.
- JSONDecodeError / KeyError → Malformed message body (F10/poison message).
- ReadTimeoutError → FreeScout or Postmark API slow; visibility timeout may have been exceeded.
Step 4 — Check SNS subscription health (if SNS DLQ is firing)
If the SNS DLQ (not SQS DLQ) has messages, the SQS subscription is broken:
# List all subscriptions for the inbound topic
aws sns list-subscriptions-by-topic \
--topic-arn "arn:aws:sns:us-east-1:521228113048:raxx-inbound-email.fifo" \
--region us-east-1
# Expected: one subscription with Protocol=sqs, Endpoint=<inbound bridge queue ARN>
# If subscription is missing or SubscriptionArn shows "PendingConfirmation": broken
Check SQS queue policy allows SNS to send:
aws sqs get-queue-attributes \
--queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/inbound-email-freescout-bridge.fifo" \
--attribute-names Policy \
--region us-east-1 | python3 -c "import json,sys; p=json.load(sys.stdin); print(json.dumps(json.loads(p['Attributes']['Policy']), indent=2))"
Expected: Statement with Principal.Service = sns.amazonaws.com and Action = sqs:SendMessage.
Known failure modes
Failure mode A: SQS DLQ depth > 0 — FreeScout API was transiently down
Symptom: inbound-email-freescout-bridge-dlq-depth alarm fires. Lambda logs show FreeScout 503 or connection timeout around the message timestamp.
Cause: FreeScout on Lightsail was restarting or overloaded when the Lambda invocation ran.
Fix: Verify FreeScout is now healthy (curl -s https://tickets.raxx.app/api/users returns 200). Then redrive.
Verification: DLQ depth returns to 0. Check FreeScout — new conversation created for the email.
Failure mode B: SQS DLQ depth > 0 — Postmark API down or rate limited
Symptom: outbound-email-postmark-sender-dlq-depth fires. Lambda logs show Postmark 5xx or 429.
Cause: Postmark service degraded or rate limit hit.
Fix: Check Postmark status page (https://status.postmarkapp.com). If recovered, redrive. If sustained 429, reduce lambda_outbound_reserved_concurrency to 1 in terraform.tfvars and re-apply.
Verification: DLQ depth returns to 0. Check Postmark activity log for delivery confirmation.
Failure mode C: SQS DLQ depth > 0 — IAM permission error
Symptom: Lambda logs show AccessDeniedException. DLQ message has ApproximateReceiveCount >= 5.
Cause: Lambda execution role is missing a permission (F19). May be a Terraform state drift or a manual IAM change.
Fix: terraform plan from terraform/modules/email-delivery-stack/ to detect drift. If drift found, terraform apply to restore. No Lambda code change needed.
Verification: Lambda processes subsequent messages without AccessDeniedException.
Failure mode D: SNS DLQ depth > 0 — subscription deleted or policy revoked
Symptom: raxx-email-inbound-sns-dlq-depth fires. SNS subscription list shows missing or PendingConfirmation.
Cause: SNS → SQS subscription was deleted (manual action, Terraform destroy, or IAM policy revoke on SQS queue). F4.
Fix: terraform apply to restore subscription. Check CloudTrail for who deleted the subscription.
Verification: SNS DLQ drains (no new messages). Test: publish a test message to SNS topic and verify it lands in SQS.
Failure mode E: Poison message — malformed payload
Symptom: DLQ message body contains invalid JSON, missing required fields, or non-UTF-8 content.
Cause: Upstream publisher sent a malformed message (Postmark webhook schema change, Raptor bug).
Fix: Discard the message (do not redrive). File a type:reliability issue with the malformed payload example. Fix the upstream publisher or Lambda schema validation.
Verification: No new poison messages appear after fix.
Failure mode F: Suspected spoofed or malicious message
Symptom: DLQ message appears to be a crafted/spoofed email payload (wrong X-Postmark-Signature, unexpected From domain, payload structure doesn't match Postmark JSON schema).
Cause: External attacker posting to the API Gateway endpoint with a forged payload.
Fix: DO NOT redrive. Quarantine to S3 (copy the raw message body to an S3 object for forensics). Open a security incident. Rotate the Postmark inbound webhook token in Postmark admin + update SSM /raxx/email/postmark_inbound_webhook_token.
Verification: Token rotated, Postmark admin updated, new test probe succeeds.
Disposition table
| Root cause | Disposition | Action |
|---|---|---|
| FreeScout was transiently down (now recovered) | Redrive | start-message-move-task |
| Postmark API was down (now recovered) | Redrive | start-message-move-task |
| Code bug (Lambda fixed, redeployed) | Redrive | start-message-move-task |
| IAM issue fixed via Terraform | Redrive | start-message-move-task |
| Malformed message (no Message-ID, corrupt UTF-8) | Discard | Delete from DLQ; root-cause at source |
| Duplicate of already-processed message | Discard | Idempotency guard skips it; discard to avoid noise |
| Suspected spoofed/malicious | Quarantine | S3 copy → security incident → rotate token |
Redrive procedure
Preferred: AWS Console (creates CloudTrail audit event)
- Open AWS Console → SQS → select the DLQ queue.
- Click "Dead-letter queue" tab → "Start DLQ redrive".
- Set destination to the source queue (remove
-dlqfrom the name). - Set message move throughput to match Lambda reserved concurrency (e.g., 3 messages/second for inbound).
- Click "Start". AWS generates a CloudTrail event:
StartMessageMoveTask. - Monitor: watch DLQ
ApproximateNumberOfMessagesVisibledrop to 0.
CLI method (for scripting or when console is unavailable)
# Inbound SQS DLQ → inbound main queue
aws sqs start-message-move-task \
--source-arn "arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge-dlq.fifo" \
--destination-arn "arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge.fifo" \
--max-number-of-messages-per-second 3 \
--region us-east-1
# Outbound SQS DLQ → outbound main queue
aws sqs start-message-move-task \
--source-arn "arn:aws:sqs:us-east-1:521228113048:outbound-email-postmark-sender-dlq.fifo" \
--destination-arn "arn:aws:sqs:us-east-1:521228113048:outbound-email-postmark-sender.fifo" \
--max-number-of-messages-per-second 5 \
--region us-east-1
# Check redrive task status
aws sqs list-message-move-tasks \
--source-arn "arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge-dlq.fifo" \
--region us-east-1
Manual discard (single message)
If you've examined a message and determined it must be discarded:
# 1. Receive the message to get the ReceiptHandle
RECEIPT=$(aws sqs receive-message \
--queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/inbound-email-freescout-bridge-dlq.fifo" \
--max-number-of-messages 1 \
--visibility-timeout 300 \
--region us-east-1 \
--query 'Messages[0].ReceiptHandle' \
--output text)
# 2. Delete the message (irreversible — confirm root cause first)
aws sqs delete-message \
--queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/inbound-email-freescout-bridge-dlq.fifo" \
--receipt-handle "$RECEIPT" \
--region us-east-1
Post-redrive verification checklist
After a redrive:
- DLQ
ApproximateNumberOfMessagesVisiblereturns to 0 (CloudWatch console or CLI). - Lambda successfully processes the redriven messages — check CloudWatch Logs for successful invocations (no
ERRORlines, look for"outcome": "success"structured log event). - Downstream verified:
- Inbound: FreeScout conversation created. Search FreeScout for the message's
MessageIDvalue or probe ID. - Outbound: Postmark delivery confirmed. Check Postmark dashboard → Activity → Message detail showsDelivered. - Write an ops note in FreeScout
ops@mailbox ticket if the failure affected a customer-visible email path. Required fields: timestamp (UTC), root cause, resolution, message count affected, redrive time (UTC). - If the DLQ had > 1 message: verify all messages were redriven (check task status with
list-message-move-tasks).
Audit trail expectations
Every redrive produces:
- CloudTrail event: StartMessageMoveTask (IAM principal, timestamp, source/destination ARNs).
- Lambda CloudWatch Log: structured JSON for each processed message: event_type, message_id_hash (SHA-256, truncated — no raw PII), mailbox_id (inbound) or correlation_id (outbound), outcome, timestamp_utc.
- FreeScout ops@ note: written manually by operator per checklist item 4 above.
- DynamoDB dedup record: dedup_key written with TTL after successful processing.
If a message was discarded instead of redriven, write a manual audit entry in the FreeScout ops@ mailbox with: reason for discard, message body hash (SHA-256), timestamp (UTC).
Slack escalation thresholds
| Condition | Action |
|---|---|
| Any DLQ alarm fires | Investigate immediately using this runbook. Post status to ops@ FreeScout. |
| DLQ not cleared within 12h of alarm fire | Page operator (SEV-2 escalation). |
| DLQ not cleared within 24h | SEV-1 escalation: customer email delivery failure. |
| SNS DLQ fires | Always escalate to operator same-day — topology failure requires Terraform review. |
Detection sources
This runbook is reached via two detection paths:
F1 + F4 detection layer (end-to-end probe + topology):
- raxx-email-probe-failure alarm: probe email to support@raxx.app not seen in FreeScout within 5 min. Wired up by SC-E7 (#1670).
- Any DLQ alarm: confirms where in the pipeline the failure occurred.
F5 detection (Lambda processing failure):
- raxx-email-inbound-bridge-dlq-depth and raxx-email-outbound-sender-dlq-depth alarms.
When both raxx-email-probe-failure AND a DLQ alarm fire simultaneously, treat as high-confidence end-to-end delivery failure. Both signals together = F1 + F4 layer. Start diagnosis at Step 1 of this runbook.
Cross-references
- Architecture doc:
docs/architecture/durable-email-delivery.mdSection 9 (original SOP) - ADR-0072:
docs/architecture/adr/0072-durable-email-sns-sqs-ses.md - ADR-0074:
docs/architecture/adr/0074-email-delivery-hybrid-postmark-v1.md - Terraform stack:
terraform/modules/email-delivery-stack/ - SC-E2 (#1665): Terraform stack provisioning
- SC-E3 (#1666): Inbound Lambda (blocked on this runbook)
- SC-E7 (#1670): Synthetic probe (probe alarm + DLQ alarm = F1 + F4 detection)
- FreeScout runbook:
docs/ops/runbooks/freescout.md - Postmark status page:
https://status.postmarkapp.com
Emergency stop
To halt all email processing without message loss (SQS retains messages for 14 days):
# Set inbound Lambda concurrency to 0 (kill-switch)
aws lambda put-function-concurrency \
--function-name raxx-email-inbound-bridge \
--reserved-concurrent-executions 0 \
--region us-east-1
# Set outbound Lambda concurrency to 0
aws lambda put-function-concurrency \
--function-name raxx-email-outbound-sender \
--reserved-concurrent-executions 0 \
--region us-east-1
Messages accumulate in SQS. To resume:
aws lambda put-function-concurrency \
--function-name raxx-email-inbound-bridge \
--reserved-concurrent-executions 3 \
--region us-east-1
aws lambda put-function-concurrency \
--function-name raxx-email-outbound-sender \
--reserved-concurrent-executions 5 \
--region us-east-1
Or use Terraform: update lambda_inbound_reserved_concurrency / lambda_outbound_reserved_concurrency in terraform.tfvars and terraform apply.
Escalation
Wake the operator when: - SNS DLQ has messages (topology failure — requires Terraform intervention). - SQS DLQ not cleared after 12h of redrive attempts. - Root cause cannot be determined from Lambda logs (novel failure class). - Message body appears to contain a malicious or spoofed payload. - Any redrive causes new unexpected behavior downstream (FreeScout duplicates, Postmark bounces).
Contact: ops@raxx.app or Kristerpher's Slack DM (D0AJ7K184TV in MooseQuest workspace).