ADR-0082 — Terraform deployment pipeline pattern (Option D: GH Actions + AWS OIDC)

Status: Accepted Date: 2026-05-12 UTC Author: architect-agent Parent card: #1837 Epic: #1834 Related ADRs: 0022, 0072, 0077, [ADR-0125](https://internal-docs.raxx.app/architecture/adr/0125-queue-phase1-cpp-billing-v1.html) Supersedes: nothing (establishes new policy)

Context

Incident: 2026-05-12 UTC — 30-minute mailbox routing gap

PR #1832 merged a bridge Lambda code change (fix(email-bridge): parse ToFull for mailbox routing). The Lambda ARN and environment variables are managed in the terraform/modules/email-delivery-stack stack. After merge, the stack required a manual terraform apply from a developer machine to propagate the change. The gap between code merge and apply was approximately 30 minutes. During that window, inbound emails were routed to the wrong FreeScout mailboxes.

The root cause was the absence of an automated apply path: there was no workflow to trigger plan on PR open nor apply on merge to main.

The nine terraform roots

terraform/
  _bootstrap/               # README only — no .tf files
  cf-access/                # Cloudflare Access service tokens + application rules
  cf-pages-docs-customer/   # CF Pages project for docs.raxx.app
  cf-waf-probes/            # Synthetic WAF probe configs — sparse (no provider.tf yet)
  freescout/                # FreeScout Lightsail, DNS, SSM service token
  modules/                  # Shared modules (email-delivery-stack, etc.) — not a root
  queue/                    # Queue service infra + waf.tf (Cloudflare WAF for queue.raxx.app)
  support-attachments/      # S3 + IAM for FreeScout attachment storage
  waf/                      # AWS-side CloudFront + CF WAF for raxx.app + getraxx.com

Eight of the nine are deployable roots (_bootstrap has no .tf files; modules/ is a shared library, not a root).

Options considered

Option A — Atlantis (self-hosted plan/apply server). Atlantis provides PR-comment-driven plan and apply with locking. Rejected: requires a persistent server with high-availability requirements, Atlantis-level RBAC separate from GitHub, and operational runbook complexity above the operator's current team size. Deferred to post-v1.

Option B — Terraform Cloud / HCP Terraform. Managed plan/apply with VCS-driven runs. Rejected: adds a third-party in the IAM trust chain for AWS credentials, $20+/month at the team tier, and a VCS connection to a private repo. Lock-in concern given ADR-0076's Rust-phase trajectory.

Option C — Single matrix workflow across all roots. One workflow file that fans out a matrix over all roots on any terraform/** change. Simpler to maintain, but the IAM role per root boundary becomes harder to enforce (one role would need permissions across all roots), and a plan failure in one root blocks unrelated roots.

Option D — One workflow per root, AWS OIDC for short-lived creds. (CHOSEN) Each root has its own .github/workflows/terraform-<root>.yml and a dedicated IAM role scoped to only the resources that root manages. AWS OIDC issues a short-lived credential to the workflow run; no static AWS keys are stored anywhere.

Why Option D

IAM blast radius is bounded per root: a misconfiguration in the waf/ role cannot touch freescout/ resources.
Workflow files are independently reviewable: a PR touching terraform/waf/** only runs the waf workflow, not all roots.
Short-lived OIDC credentials satisfy the "no stored credentials" invariant. No AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY in secrets or environment.
Matches the existing terraform-validate.yml pattern (per-module matrix already exists; per-root workflows extend it naturally).
Operator confirmed Option D on 2026-05-12 in #1834.

Decision

Adopt Option D as the canonical Terraform deployment pipeline pattern for all roots under terraform/.

Pattern — canonical contract every per-root workflow must follow

Every deployable root under terraform/ gets one workflow file and one IAM role. The following contract is binding for all implementation sub-cards (#1838–#1846).

Workflow file location

.github/workflows/terraform-<root-name>.yml

Where <root-name> matches the directory name under terraform/ (e.g., terraform-waf.yml, terraform-freescout.yml).

Triggers

on:
  pull_request:
    types: [opened, synchronize, reopened]
    paths:
      - 'terraform/<root-name>/**'
      - '.github/workflows/terraform-<root-name>.yml'
  push:
    branches:
      - main
    paths:
      - 'terraform/<root-name>/**'
      - '.github/workflows/terraform-<root-name>.yml'
  schedule:
    # Nightly drift detection — cards #1847 (post-launch)
    - cron: '0 6 * * *'   # 06:00 UTC

The schedule trigger runs plan only (never apply). It fires regardless of path changes. Drift detection cards (#1847) land post-launch; the cron block is present in all workflow files from the start but gated behind a workflow_dispatch or nightly-only job condition to avoid noisy plan output during active development.

Job structure

PR-open path:    checkout → setup-terraform → OIDC assume role → init → plan → sticky PR comment → required status check
push-to-main:    checkout → setup-terraform → OIDC assume role → init → plan → apply → audit log emit
nightly-cron:    checkout → setup-terraform → OIDC assume role → init → plan (plan-only, post to step summary)

Plan on PR produces a sticky comment on the PR using the same pattern as deploy-customer-docs.yml (find-or-create comment by marker, replace on re-run). The comment includes the plan summary, add/change/destroy counts, and a link to the workflow run for the full diff.

Apply on main runs only after the merge commit is available (post-merge push trigger). It does NOT apply on the PR branch. This preserves the invariant that only code merged to main is applied to infrastructure.

The apply job must: 1. Re-run plan (not re-use the PR plan artifact) to catch drift between plan time and apply time. 2. Emit an audit log event to the unified audit table (ADR-0058) via a Raptor or Queue internal endpoint. The event payload includes: root name, run ID, commit SHA, actor (GitHub Actions), resource change summary, outcome. 3. On failure, post a Slack DM to D0AJ7K184TV.

Required status check name

terraform-<root-name>-plan

This check must be added to branch protection on main for every root before the workflow is considered production-ready. The implementation sub-card for each root includes adding this check as a gating step.

OIDC trust: per-root IAM role

Each root has a dedicated IAM role. The OIDC trust policy limits assumption to this repository and this workflow file only:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::<ACCOUNT_ID>:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
        "token.actions.githubusercontent.com:sub": "repo:raxx-app/TradeMasterAPI:ref:refs/heads/main"
      },
      "StringLike": {
        "token.actions.githubusercontent.com:sub": "repo:raxx-app/TradeMasterAPI:*"
      }
    }
  }]
}

For PR-time plan runs, the sub condition must also allow PR-ref subjects (repo:raxx-app/TradeMasterAPI:pull_request). Apply runs must be restricted to ref:refs/heads/main only. Implementers must set separate conditions on plan vs apply jobs within the same role, or use two roles per root (plan-role + apply-role). Two roles per root is the recommended approach for roots managing sensitive resources (Lambda code, SNS topics, SES identity).

Role naming convention

raxx-tf-<root-name>-plan    (read + plan only)
raxx-tf-<root-name>-apply   (plan + apply)

IAM role pattern and minimum permissions

Sample: `email-delivery-stack` root (the incident root)

The email-delivery-stack module manages: SNS FIFO topics, SQS FIFO queues + DLQs, Lambda functions, Lambda event source mappings, CloudWatch alarms, DynamoDB dedup table, IAM roles for Lambda execution.

Plan role minimum permissions:

{
  "Effect": "Allow",
  "Action": [
    "sns:GetTopicAttributes", "sns:ListSubscriptionsByTopic",
    "sqs:GetQueueAttributes", "sqs:GetQueueUrl",
    "lambda:GetFunction", "lambda:GetFunctionConfiguration",
    "lambda:ListEventSourceMappings", "lambda:GetEventSourceMapping",
    "dynamodb:DescribeTable",
    "cloudwatch:DescribeAlarms",
    "iam:GetRole", "iam:GetRolePolicy", "iam:ListAttachedRolePolicies",
    "s3:GetBucketVersioning"
  ],
  "Resource": "arn:aws:*:us-east-1:<ACCOUNT_ID>:*raxx-email*"
}

Apply role additional permissions (beyond plan):

{
  "Effect": "Allow",
  "Action": [
    "sns:CreateTopic", "sns:SetTopicAttributes", "sns:Subscribe", "sns:Unsubscribe", "sns:DeleteTopic",
    "sqs:CreateQueue", "sqs:SetQueueAttributes", "sqs:DeleteQueue",
    "lambda:UpdateFunctionCode", "lambda:UpdateFunctionConfiguration",
    "lambda:CreateEventSourceMapping", "lambda:UpdateEventSourceMapping", "lambda:DeleteEventSourceMapping",
    "lambda:PublishLayerVersion", "lambda:DeleteLayerVersion",
    "dynamodb:CreateTable", "dynamodb:UpdateTable",
    "cloudwatch:PutMetricAlarm", "cloudwatch:DeleteAlarms",
    "iam:PassRole"
  ],
  "Resource": "arn:aws:*:us-east-1:<ACCOUNT_ID>:*raxx-email*"
}

The iam:PassRole action is scoped to Lambda execution roles named raxx-email-*. The apply role does NOT have iam:CreateRole or iam:AttachRolePolicy — IAM role creation for Lambda execution roles is a separate Terraform root operation or a one-time bootstrap step.

Permission pattern by resource type (all roots)

Resource type	Plan permissions	Apply adds
Lambda function	`GetFunction`, `GetFunctionConfiguration`	`UpdateFunctionCode`, `UpdateFunctionConfiguration`
SNS topic	`GetTopicAttributes`, `ListSubscriptions`	`CreateTopic`, `SetTopicAttributes`, `Subscribe`, `DeleteTopic`
SQS queue	`GetQueueAttributes`, `GetQueueUrl`	`CreateQueue`, `SetQueueAttributes`, `DeleteQueue`
CloudWatch alarm	`DescribeAlarms`	`PutMetricAlarm`, `DeleteAlarms`
IAM role (read)	`GetRole`, `GetRolePolicy`, `ListAttachedRolePolicies`	`PassRole` (scoped)
Cloudflare resources	Provider uses `CLOUDFLARE_API_TOKEN` from SSM — not IAM	same

Cloudflare-managed resources (WAF rulesets in queue/waf.tf, waf/main.tf, cf-access/) use a CLOUDFLARE_API_TOKEN retrieved from AWS SSM Parameter Store at plan/apply time. The OIDC role must include ssm:GetParameter on the specific SSM parameter paths. The token is never stored in the workflow YAML or GitHub secrets.

Sub-root anomalies

`_bootstrap` — no `.tf` files

The _bootstrap/ directory contains only a README.md describing manual one-time steps (OIDC provider registration, initial IAM role seeding). No Terraform resources are managed here.

Decision: A terraform-_bootstrap.yml workflow is still created as a tracking shell. It runs terraform validate on push and produces a status check, but has no plan or apply jobs. The workflow documents that this root is intentionally static. The IAM role for _bootstrap is the OIDC-bootstrapper role itself — it need not be provisioned by Terraform.

`cf-waf-probes` — sparse, no `provider.tf` / `versions.tf`

The cf-waf-probes/ root currently contains only dns.tf. It is missing provider.tf and versions.tf, which are required for terraform init to succeed.

Decision: The implementation sub-card for cf-waf-probes (#1842 or equivalent) must scaffold provider.tf + versions.tf before wiring the workflow. The terraform-validate.yml already flags this root as invalid on PR. The sub-card implementer resolves this before adding the plan/apply workflow.

`queue/waf.tf` vs `waf/main.tf` — WAF boundary

Two roots manage Cloudflare WAF resources:

terraform/queue/waf.tf: Cloudflare WAF ruleset for queue.raxx.app — rate limits, BFM skip rules, Bot Fight Mode bypass for service-to-service calls. Cloudflare-only resources. No AWS resources. IAM role requires only ssm:GetParameter for the Cloudflare API token.
terraform/waf/main.tf: CloudFront distribution WAF ACLs for raxx.app and getraxx.com — AWS WAF Web ACLs, CloudFront associations. Also instantiates the cf-waf module for Cloudflare-side rules on the customer zones. AWS + Cloudflare resources. IAM role requires wafv2:*, cloudfront:GetDistribution, and ssm:GetParameter.

Boundary rule: The queue/ apply role must NOT have wafv2 permissions. The waf/ apply role must NOT be able to modify queue.raxx.app WAF rules (scoped by Cloudflare zone ID at the provider level). Each root's Cloudflare provider is configured with a zone-scoped API token stored in a separate SSM parameter path.

Terraform state backend

All roots use S3 + DynamoDB state locking (not local state). The S3 bucket and DynamoDB table are managed by the _bootstrap manual process. Each root's backend.tf (or versions.tf backend block) must reference the shared state bucket with a per-root prefix:

s3://<STATE_BUCKET>/terraform/<root-name>/terraform.tfstate

The apply workflow does NOT store any Terraform state locally between runs. State is always remote.

Rollback

If the pipeline itself is broken (e.g., a workflow YAML bug causes plan to fail on every PR):

Revert the workflow PR on main. This restores the previous workflow file.
If a terraform apply produced incorrect infrastructure before the pipeline was reverted, the operator re-runs terraform apply from a developer machine using the same credentials pattern (OIDC assume via aws sts assume-role-with-web-identity locally, or a one-time IAM user key for emergency recovery).
The human-fallback apply procedure is documented at docs/runbooks/terraform-pipeline.md (filed as sub-card #1848). This runbook must exist before any root's pipeline goes live.

There is no automated rollback of a terraform apply that completes successfully. A destructive change that was applied by the pipeline requires a corrective apply (a new commit that reverts the Terraform resource change, merged to main, applied by the pipeline).

Rollout plan

Phase	Roots	Gate
1 — Incident root first	`modules/email-delivery-stack` (sub-card #1838)	Plan check passes on a test PR; apply verified on staging-equivalent
2 — High-change roots	`freescout`, `cf-access`, `waf`	After #1838 is stable for 3 business days
3 — Remaining roots	`queue`, `support-attachments`, `cf-pages-docs-customer`	After Phase 2
4 — Anomalous roots	`cf-waf-probes` (after scaffold), `_bootstrap` (tracking shell)	After Phase 3
5 — Drift detection	Nightly cron on all roots	Cards #1847, post-launch

Phase 1 is launch-blocking because the incident root (email-delivery-stack) must not require manual apply again before v1 launch.

Security considerations

No stored credentials. AWS access via OIDC short-lived tokens only. Cloudflare API tokens in SSM, fetched at runtime. No static secrets in workflow YAML, GitHub secrets, or .tfvars files committed to the repo.
Audit trail. Every apply emits an audit event (see pattern above). The event is append-only per ADR-0022.
Plan-only on PRs. No infrastructure changes happen on PR branches. The plan output is informational. Only merge to main triggers an apply.
Two-role model. Plan and apply roles are separate IAM identities. A compromised PR-build environment cannot apply infrastructure changes.
State file access. The S3 state bucket has a bucket policy restricting access to the OIDC roles and the claude-infisical-bootstrap IAM user (break-glass). No other IAM principals can read or write state.
Kill-switch. Setting a Lambda's ReservedConcurrentExecutions to 0 (via a Terraform variable committed to main) is the kill-switch for email processing. The pipeline applies this within the normal merge→apply window.
Breach notification. A failed apply that leaves infrastructure in an inconsistent state triggers the Slack DM to D0AJ7K184TV and creates a FreeScout incident ticket via the Console "Investigate" flow. The docs/runbooks/terraform-pipeline.md runbook (#1848) defines the investigation SOP.

Consequences

Positive

Eliminates the class of incidents caused by "merged but not applied" drift. After a root's workflow is live, every merge to main that touches that root produces an apply within minutes.
IAM blast radius is bounded: a misconfigured role for one root cannot affect another root's resources.
Short-lived OIDC credentials satisfy the no-stored-credentials invariant with no operational overhead (no key rotation schedule needed).
Plan comments on PRs make infrastructure changes visible during code review, the same way tests make logic changes visible.
The pattern is consistent with the existing terraform-validate.yml convention already enforced in CI.

Negative / trade-offs

Nine workflow files + nine (or eighteen) IAM roles to create and maintain. Initial setup cost is front-loaded.
Apply happens asynchronously after merge. There is still a small window (seconds to low minutes) between merge and completed apply. This window is much smaller than the manual-apply gap but is not zero.
Concurrent merges to main touching the same root can cause state lock contention. DynamoDB lock will queue the second apply; operator should be aware this is normal behavior and not a failure.
No apply-blocking on destructive plan changes (e.g., a destroy on a production SNS topic). Implementers must add a plan -detailed-exitcode check and a manual approval gate for plans that include destructions. This is flagged as an open question below.

Open questions

Destructive-change gate. Should applies that include at least one destroy action require a manual approval step (GitHub environment protection with required reviewer) before proceeding? This prevents accidental deletion of stateful resources (SNS topics, SQS queues, DynamoDB tables). Recommended: yes, but deferred to Phase 2 unless the email-delivery-stack root warrants it immediately. Needs operator decision before #1838 is claimed.
Two IAM roles vs one per root. The two-role model (plan + apply) is cleaner for security but doubles the IAM bootstrapping work. Is the operator comfortable with the apply role being able to both plan and apply (one role per root, with the ref condition on the OIDC trust restricting apply to main)? The single-role approach is simpler but relies entirely on the OIDC ref condition for the apply boundary. Needs operator decision before sub-cards are dispatched.
Cloudflare API token per root vs shared. Currently a single Cloudflare automation token is documented in reference_cloudflare_tokens.md. WAF rule management requires Zone:WAF:Edit. If the queue/ Cloudflare provider and the waf/ Cloudflare provider share the same token, a compromise of either workflow's SSM read path exposes the other root's write capability. Recommend: separate scoped tokens per root. Needs operator decision on whether to create additional Cloudflare tokens before Phase 2.

Cross-references

Incident: PR #1832 merge + 30-minute mailbox routing gap (2026-05-12 UTC)
Epic: #1834 (Terraform deployment automation)
Parent card: #1837 (this ADR)
ADR-0022: append-only hash-chain audit log (audit emit on apply)
ADR-0072: SNS/SQS/SES durable email delivery (email-delivery-stack root)
ADR-0076: Queue phase 1 C++ / billing v1 (related infrastructure context)
ADR-0077: Cloudflare WAF layered defense (waf/ + queue/waf.tf context)
feedback_aws_workloads_use_ssm_not_vault.md: workload secrets in SSM, not Infisical
docs/runbooks/terraform-pipeline.md: human-fallback runbook (sub-card #1848, not yet created)