Raxx · internal docs

internal · gated

Email DLQ Redrive Runbook

System: email-delivery (SNS/SQS/Lambda — Postmark hybrid) Owner: operator / sre-agent Last incident: (none — initial provisioning 2026-05-11 UTC) Last reviewed: 2026-05-11 UTC ADR refs: ADR-0072, ADR-0074 Architecture doc: docs/architecture/durable-email-delivery.md Sections 1, 5, 9 Depends on: SC-E2 (#1665) Terraform stack applied


ARN Reference (from terraform output after SC-E2 apply)

To get live ARNs:

cd terraform/modules/email-delivery-stack
AWS_DEFAULT_REGION=us-east-1 terraform output

Queue URLs and ARNs — account 521228113048, region us-east-1:

Resource Name ARN pattern
Inbound main queue inbound-email-freescout-bridge.fifo arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge.fifo
Outbound main queue outbound-email-postmark-sender.fifo arn:aws:sqs:us-east-1:521228113048:outbound-email-postmark-sender.fifo
Inbound SQS DLQ inbound-email-freescout-bridge-dlq.fifo arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge-dlq.fifo
Outbound SQS DLQ outbound-email-postmark-sender-dlq.fifo arn:aws:sqs:us-east-1:521228113048:outbound-email-postmark-sender-dlq.fifo
Inbound SNS topic raxx-inbound-email.fifo arn:aws:sns:us-east-1:521228113048:raxx-inbound-email.fifo
Outbound SNS topic raxx-outbound-email.fifo arn:aws:sns:us-east-1:521228113048:raxx-outbound-email.fifo
Inbound SNS DLQ raxx-email-inbound-sns-dlq.fifo arn:aws:sqs:us-east-1:521228113048:raxx-email-inbound-sns-dlq.fifo
Outbound SNS DLQ raxx-email-outbound-sns-dlq.fifo arn:aws:sqs:us-east-1:521228113048:raxx-email-outbound-sns-dlq.fifo

Update this table after terraform apply outputs actual ARNs.


DLQ Architecture (two layers)

Layer 1 — SNS DLQs (topology failures, F4):
  raxx-inbound-email.fifo  → [SNS retries 23 days] → raxx-email-inbound-sns-dlq.fifo
  raxx-outbound-email.fifo → [SNS retries 23 days] → raxx-email-outbound-sns-dlq.fifo

Layer 2 — SQS DLQs (Lambda processing failures, F5/F10):
  inbound-email-freescout-bridge.fifo  → [5 attempts] → inbound-email-freescout-bridge-dlq.fifo
  outbound-email-postmark-sender.fifo  → [5 attempts] → outbound-email-postmark-sender-dlq.fifo

SNS DLQ alarm = topology failure (subscription broken, policy revoked). SQS DLQ alarm = Lambda processing failure (code error, API down, poison message). These are different root causes with different remediation paths.


How to tell it's broken

The alarm description field on each DLQ alarm names the failure mode and links to this runbook.


How to diagnose (in order)

Step 1 — Identify which DLQ(s) have messages

Check alarm state for all four DLQs:

aws cloudwatch describe-alarms \
  --alarm-names \
    "raxx-email-inbound-sns-dlq-depth" \
    "raxx-email-outbound-sns-dlq-depth" \
    "raxx-email-inbound-bridge-dlq-depth" \
    "raxx-email-outbound-sender-dlq-depth" \
  --query 'MetricAlarms[*].[AlarmName,StateValue,StateReason]' \
  --output table \
  --region us-east-1

Expected value: all OK. Any ALARM state requires investigation.

Step 2 — Inspect DLQ messages (does not delete them)

Replace QUEUE_URL with the appropriate URL from the ARN table above.

# Inbound SQS DLQ (Lambda processing failure)
aws sqs receive-message \
  --queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/inbound-email-freescout-bridge-dlq.fifo" \
  --max-number-of-messages 10 \
  --attribute-names All \
  --message-attribute-names All \
  --visibility-timeout 300 \
  --region us-east-1

# Outbound SQS DLQ (Lambda processing failure)
aws sqs receive-message \
  --queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/outbound-email-postmark-sender-dlq.fifo" \
  --max-number-of-messages 10 \
  --attribute-names All \
  --message-attribute-names All \
  --visibility-timeout 300 \
  --region us-east-1

# Inbound SNS DLQ (topology failure — subscription broken)
aws sqs receive-message \
  --queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/raxx-email-inbound-sns-dlq.fifo" \
  --max-number-of-messages 10 \
  --attribute-names All \
  --message-attribute-names All \
  --visibility-timeout 300 \
  --region us-east-1

In the response, check: - ApproximateReceiveCount: if >= 5, the message hit max-receive-count. - SentTimestamp: when the message first arrived (use this for log correlation). - Body: the raw email event payload (Postmark JSON for inbound; Raptor-published JSON for outbound). - For inbound: look for MessageID field (Postmark's unique identifier). This is the idempotency key. - For outbound: look for correlation_id in message attributes.

Step 3 — Correlate with Lambda CloudWatch Logs

Convert the SentTimestamp from epoch milliseconds to a human-readable time, then query Lambda logs:

# Inbound Lambda logs — filter for ERROR around message timestamp
aws logs filter-log-events \
  --log-group-name /aws/lambda/raxx-email-inbound-bridge \
  --start-time <epoch_ms_from_SentTimestamp> \
  --end-time <epoch_ms_plus_60min> \
  --filter-pattern '"ERROR"' \
  --region us-east-1

# Outbound Lambda logs
aws logs filter-log-events \
  --log-group-name /aws/lambda/raxx-email-outbound-sender \
  --start-time <epoch_ms_from_SentTimestamp> \
  --end-time <epoch_ms_plus_60min> \
  --filter-pattern '"ERROR"' \
  --region us-east-1

# Check for IAM permission errors specifically (F19)
aws logs filter-log-events \
  --log-group-name /aws/lambda/raxx-email-inbound-bridge \
  --start-time <epoch_ms_from_SentTimestamp> \
  --filter-pattern '"AccessDeniedException"' \
  --region us-east-1

Common error signatures: - AccessDeniedException → IAM issue (F19). Check the Lambda role policy matches iam.tf. - FreeScout API 5xx / ConnectionError → FreeScout down (F7). Check FreeScout health. - Postmark API 429 → Rate limit (F8). Exponential backoff should handle; check if sustained. - Postmark API 422 → Malformed payload or suppressed recipient. Investigate sender/recipient. - JSONDecodeError / KeyError → Malformed message body (F10/poison message). - ReadTimeoutError → FreeScout or Postmark API slow; visibility timeout may have been exceeded.

Step 4 — Check SNS subscription health (if SNS DLQ is firing)

If the SNS DLQ (not SQS DLQ) has messages, the SQS subscription is broken:

# List all subscriptions for the inbound topic
aws sns list-subscriptions-by-topic \
  --topic-arn "arn:aws:sns:us-east-1:521228113048:raxx-inbound-email.fifo" \
  --region us-east-1

# Expected: one subscription with Protocol=sqs, Endpoint=<inbound bridge queue ARN>
# If subscription is missing or SubscriptionArn shows "PendingConfirmation": broken

Check SQS queue policy allows SNS to send:

aws sqs get-queue-attributes \
  --queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/inbound-email-freescout-bridge.fifo" \
  --attribute-names Policy \
  --region us-east-1 | python3 -c "import json,sys; p=json.load(sys.stdin); print(json.dumps(json.loads(p['Attributes']['Policy']), indent=2))"

Expected: Statement with Principal.Service = sns.amazonaws.com and Action = sqs:SendMessage.


Known failure modes

Failure mode A: SQS DLQ depth > 0 — FreeScout API was transiently down

Symptom: inbound-email-freescout-bridge-dlq-depth alarm fires. Lambda logs show FreeScout 503 or connection timeout around the message timestamp. Cause: FreeScout on Lightsail was restarting or overloaded when the Lambda invocation ran. Fix: Verify FreeScout is now healthy (curl -s https://tickets.raxx.app/api/users returns 200). Then redrive. Verification: DLQ depth returns to 0. Check FreeScout — new conversation created for the email.

Failure mode B: SQS DLQ depth > 0 — Postmark API down or rate limited

Symptom: outbound-email-postmark-sender-dlq-depth fires. Lambda logs show Postmark 5xx or 429. Cause: Postmark service degraded or rate limit hit. Fix: Check Postmark status page (https://status.postmarkapp.com). If recovered, redrive. If sustained 429, reduce lambda_outbound_reserved_concurrency to 1 in terraform.tfvars and re-apply. Verification: DLQ depth returns to 0. Check Postmark activity log for delivery confirmation.

Failure mode C: SQS DLQ depth > 0 — IAM permission error

Symptom: Lambda logs show AccessDeniedException. DLQ message has ApproximateReceiveCount >= 5. Cause: Lambda execution role is missing a permission (F19). May be a Terraform state drift or a manual IAM change. Fix: terraform plan from terraform/modules/email-delivery-stack/ to detect drift. If drift found, terraform apply to restore. No Lambda code change needed. Verification: Lambda processes subsequent messages without AccessDeniedException.

Failure mode D: SNS DLQ depth > 0 — subscription deleted or policy revoked

Symptom: raxx-email-inbound-sns-dlq-depth fires. SNS subscription list shows missing or PendingConfirmation. Cause: SNS → SQS subscription was deleted (manual action, Terraform destroy, or IAM policy revoke on SQS queue). F4. Fix: terraform apply to restore subscription. Check CloudTrail for who deleted the subscription. Verification: SNS DLQ drains (no new messages). Test: publish a test message to SNS topic and verify it lands in SQS.

Failure mode E: Poison message — malformed payload

Symptom: DLQ message body contains invalid JSON, missing required fields, or non-UTF-8 content. Cause: Upstream publisher sent a malformed message (Postmark webhook schema change, Raptor bug). Fix: Discard the message (do not redrive). File a type:reliability issue with the malformed payload example. Fix the upstream publisher or Lambda schema validation. Verification: No new poison messages appear after fix.

Failure mode F: Suspected spoofed or malicious message

Symptom: DLQ message appears to be a crafted/spoofed email payload (wrong X-Postmark-Signature, unexpected From domain, payload structure doesn't match Postmark JSON schema). Cause: External attacker posting to the API Gateway endpoint with a forged payload. Fix: DO NOT redrive. Quarantine to S3 (copy the raw message body to an S3 object for forensics). Open a security incident. Rotate the Postmark inbound webhook token in Postmark admin + update SSM /raxx/email/postmark_inbound_webhook_token. Verification: Token rotated, Postmark admin updated, new test probe succeeds.


Disposition table

Root cause Disposition Action
FreeScout was transiently down (now recovered) Redrive start-message-move-task
Postmark API was down (now recovered) Redrive start-message-move-task
Code bug (Lambda fixed, redeployed) Redrive start-message-move-task
IAM issue fixed via Terraform Redrive start-message-move-task
Malformed message (no Message-ID, corrupt UTF-8) Discard Delete from DLQ; root-cause at source
Duplicate of already-processed message Discard Idempotency guard skips it; discard to avoid noise
Suspected spoofed/malicious Quarantine S3 copy → security incident → rotate token

Redrive procedure

Preferred: AWS Console (creates CloudTrail audit event)

  1. Open AWS Console → SQS → select the DLQ queue.
  2. Click "Dead-letter queue" tab → "Start DLQ redrive".
  3. Set destination to the source queue (remove -dlq from the name).
  4. Set message move throughput to match Lambda reserved concurrency (e.g., 3 messages/second for inbound).
  5. Click "Start". AWS generates a CloudTrail event: StartMessageMoveTask.
  6. Monitor: watch DLQ ApproximateNumberOfMessagesVisible drop to 0.

CLI method (for scripting or when console is unavailable)

# Inbound SQS DLQ → inbound main queue
aws sqs start-message-move-task \
  --source-arn "arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge-dlq.fifo" \
  --destination-arn "arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge.fifo" \
  --max-number-of-messages-per-second 3 \
  --region us-east-1

# Outbound SQS DLQ → outbound main queue
aws sqs start-message-move-task \
  --source-arn "arn:aws:sqs:us-east-1:521228113048:outbound-email-postmark-sender-dlq.fifo" \
  --destination-arn "arn:aws:sqs:us-east-1:521228113048:outbound-email-postmark-sender.fifo" \
  --max-number-of-messages-per-second 5 \
  --region us-east-1

# Check redrive task status
aws sqs list-message-move-tasks \
  --source-arn "arn:aws:sqs:us-east-1:521228113048:inbound-email-freescout-bridge-dlq.fifo" \
  --region us-east-1

Manual discard (single message)

If you've examined a message and determined it must be discarded:

# 1. Receive the message to get the ReceiptHandle
RECEIPT=$(aws sqs receive-message \
  --queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/inbound-email-freescout-bridge-dlq.fifo" \
  --max-number-of-messages 1 \
  --visibility-timeout 300 \
  --region us-east-1 \
  --query 'Messages[0].ReceiptHandle' \
  --output text)

# 2. Delete the message (irreversible — confirm root cause first)
aws sqs delete-message \
  --queue-url "https://sqs.us-east-1.amazonaws.com/521228113048/inbound-email-freescout-bridge-dlq.fifo" \
  --receipt-handle "$RECEIPT" \
  --region us-east-1

Post-redrive verification checklist

After a redrive:

  1. DLQ ApproximateNumberOfMessagesVisible returns to 0 (CloudWatch console or CLI).
  2. Lambda successfully processes the redriven messages — check CloudWatch Logs for successful invocations (no ERROR lines, look for "outcome": "success" structured log event).
  3. Downstream verified: - Inbound: FreeScout conversation created. Search FreeScout for the message's MessageID value or probe ID. - Outbound: Postmark delivery confirmed. Check Postmark dashboard → Activity → Message detail shows Delivered.
  4. Write an ops note in FreeScout ops@ mailbox ticket if the failure affected a customer-visible email path. Required fields: timestamp (UTC), root cause, resolution, message count affected, redrive time (UTC).
  5. If the DLQ had > 1 message: verify all messages were redriven (check task status with list-message-move-tasks).

Audit trail expectations

Every redrive produces: - CloudTrail event: StartMessageMoveTask (IAM principal, timestamp, source/destination ARNs). - Lambda CloudWatch Log: structured JSON for each processed message: event_type, message_id_hash (SHA-256, truncated — no raw PII), mailbox_id (inbound) or correlation_id (outbound), outcome, timestamp_utc. - FreeScout ops@ note: written manually by operator per checklist item 4 above. - DynamoDB dedup record: dedup_key written with TTL after successful processing.

If a message was discarded instead of redriven, write a manual audit entry in the FreeScout ops@ mailbox with: reason for discard, message body hash (SHA-256), timestamp (UTC).


Slack escalation thresholds

Condition Action
Any DLQ alarm fires Investigate immediately using this runbook. Post status to ops@ FreeScout.
DLQ not cleared within 12h of alarm fire Page operator (SEV-2 escalation).
DLQ not cleared within 24h SEV-1 escalation: customer email delivery failure.
SNS DLQ fires Always escalate to operator same-day — topology failure requires Terraform review.

Detection sources

This runbook is reached via two detection paths:

F1 + F4 detection layer (end-to-end probe + topology): - raxx-email-probe-failure alarm: probe email to support@raxx.app not seen in FreeScout within 5 min. Wired up by SC-E7 (#1670). - Any DLQ alarm: confirms where in the pipeline the failure occurred.

F5 detection (Lambda processing failure): - raxx-email-inbound-bridge-dlq-depth and raxx-email-outbound-sender-dlq-depth alarms.

When both raxx-email-probe-failure AND a DLQ alarm fire simultaneously, treat as high-confidence end-to-end delivery failure. Both signals together = F1 + F4 layer. Start diagnosis at Step 1 of this runbook.


Cross-references


Emergency stop

To halt all email processing without message loss (SQS retains messages for 14 days):

# Set inbound Lambda concurrency to 0 (kill-switch)
aws lambda put-function-concurrency \
  --function-name raxx-email-inbound-bridge \
  --reserved-concurrent-executions 0 \
  --region us-east-1

# Set outbound Lambda concurrency to 0
aws lambda put-function-concurrency \
  --function-name raxx-email-outbound-sender \
  --reserved-concurrent-executions 0 \
  --region us-east-1

Messages accumulate in SQS. To resume:

aws lambda put-function-concurrency \
  --function-name raxx-email-inbound-bridge \
  --reserved-concurrent-executions 3 \
  --region us-east-1

aws lambda put-function-concurrency \
  --function-name raxx-email-outbound-sender \
  --reserved-concurrent-executions 5 \
  --region us-east-1

Or use Terraform: update lambda_inbound_reserved_concurrency / lambda_outbound_reserved_concurrency in terraform.tfvars and terraform apply.


Escalation

Wake the operator when: - SNS DLQ has messages (topology failure — requires Terraform intervention). - SQS DLQ not cleared after 12h of redrive attempts. - Root cause cannot be determined from Lambda logs (novel failure class). - Message body appears to contain a malicious or spoofed payload. - Any redrive causes new unexpected behavior downstream (FreeScout duplicates, Postmark bounces).

Contact: ops@raxx.app or Kristerpher's Slack DM (D0AJ7K184TV in MooseQuest workspace).