Raxx · internal docs

internal · gated

ADR-0082 — Terraform deployment pipeline pattern (Option D: GH Actions + AWS OIDC)

Status: Accepted Date: 2026-05-12 UTC Author: architect-agent Parent card: #1837 Epic: #1834 Related ADRs: 0022, 0072, 0077, [ADR-0076](https://internal-docs.raxx.app/architecture/adr/0076-queue-phase1-billing-v1-aggressive-12day.html) Supersedes: nothing (establishes new policy)


Context

Incident: 2026-05-12 UTC — 30-minute mailbox routing gap

PR #1832 merged a bridge Lambda code change (fix(email-bridge): parse ToFull for mailbox routing). The Lambda ARN and environment variables are managed in the terraform/modules/email-delivery-stack stack. After merge, the stack required a manual terraform apply from a developer machine to propagate the change. The gap between code merge and apply was approximately 30 minutes. During that window, inbound emails were routed to the wrong FreeScout mailboxes.

The root cause was the absence of an automated apply path: there was no workflow to trigger plan on PR open nor apply on merge to main.

The nine terraform roots

terraform/
  _bootstrap/               # README only — no .tf files
  cf-access/                # Cloudflare Access service tokens + application rules
  cf-pages-docs-customer/   # CF Pages project for docs.raxx.app
  cf-waf-probes/            # Synthetic WAF probe configs — sparse (no provider.tf yet)
  freescout/                # FreeScout Lightsail, DNS, SSM service token
  modules/                  # Shared modules (email-delivery-stack, etc.) — not a root
  queue/                    # Queue service infra + waf.tf (Cloudflare WAF for queue.raxx.app)
  support-attachments/      # S3 + IAM for FreeScout attachment storage
  waf/                      # AWS-side CloudFront + CF WAF for raxx.app + getraxx.com

Eight of the nine are deployable roots (_bootstrap has no .tf files; modules/ is a shared library, not a root).

Options considered

Option A — Atlantis (self-hosted plan/apply server). Atlantis provides PR-comment-driven plan and apply with locking. Rejected: requires a persistent server with high-availability requirements, Atlantis-level RBAC separate from GitHub, and operational runbook complexity above the operator's current team size. Deferred to post-v1.

Option B — Terraform Cloud / HCP Terraform. Managed plan/apply with VCS-driven runs. Rejected: adds a third-party in the IAM trust chain for AWS credentials, $20+/month at the team tier, and a VCS connection to a private repo. Lock-in concern given ADR-0076's Rust-phase trajectory.

Option C — Single matrix workflow across all roots. One workflow file that fans out a matrix over all roots on any terraform/** change. Simpler to maintain, but the IAM role per root boundary becomes harder to enforce (one role would need permissions across all roots), and a plan failure in one root blocks unrelated roots.

Option D — One workflow per root, AWS OIDC for short-lived creds. (CHOSEN) Each root has its own .github/workflows/terraform-<root>.yml and a dedicated IAM role scoped to only the resources that root manages. AWS OIDC issues a short-lived credential to the workflow run; no static AWS keys are stored anywhere.

Why Option D


Decision

Adopt Option D as the canonical Terraform deployment pipeline pattern for all roots under terraform/.


Pattern — canonical contract every per-root workflow must follow

Every deployable root under terraform/ gets one workflow file and one IAM role. The following contract is binding for all implementation sub-cards (#1838–#1846).

Workflow file location

.github/workflows/terraform-<root-name>.yml

Where <root-name> matches the directory name under terraform/ (e.g., terraform-waf.yml, terraform-freescout.yml).

Triggers

on:
  pull_request:
    types: [opened, synchronize, reopened]
    paths:
      - 'terraform/<root-name>/**'
      - '.github/workflows/terraform-<root-name>.yml'
  push:
    branches:
      - main
    paths:
      - 'terraform/<root-name>/**'
      - '.github/workflows/terraform-<root-name>.yml'
  schedule:
    # Nightly drift detection — cards #1847 (post-launch)
    - cron: '0 6 * * *'   # 06:00 UTC

The schedule trigger runs plan only (never apply). It fires regardless of path changes. Drift detection cards (#1847) land post-launch; the cron block is present in all workflow files from the start but gated behind a workflow_dispatch or nightly-only job condition to avoid noisy plan output during active development.

Job structure

PR-open path:    checkout → setup-terraform → OIDC assume role → init → plan → sticky PR comment → required status check
push-to-main:    checkout → setup-terraform → OIDC assume role → init → plan → apply → audit log emit
nightly-cron:    checkout → setup-terraform → OIDC assume role → init → plan (plan-only, post to step summary)

Plan on PR produces a sticky comment on the PR using the same pattern as deploy-customer-docs.yml (find-or-create comment by marker, replace on re-run). The comment includes the plan summary, add/change/destroy counts, and a link to the workflow run for the full diff.

Apply on main runs only after the merge commit is available (post-merge push trigger). It does NOT apply on the PR branch. This preserves the invariant that only code merged to main is applied to infrastructure.

The apply job must: 1. Re-run plan (not re-use the PR plan artifact) to catch drift between plan time and apply time. 2. Emit an audit log event to the unified audit table (ADR-0058) via a Raptor or Queue internal endpoint. The event payload includes: root name, run ID, commit SHA, actor (GitHub Actions), resource change summary, outcome. 3. On failure, post a Slack DM to D0AJ7K184TV.

Required status check name

terraform-<root-name>-plan

This check must be added to branch protection on main for every root before the workflow is considered production-ready. The implementation sub-card for each root includes adding this check as a gating step.

OIDC trust: per-root IAM role

Each root has a dedicated IAM role. The OIDC trust policy limits assumption to this repository and this workflow file only:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::<ACCOUNT_ID>:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
        "token.actions.githubusercontent.com:sub": "repo:raxx-app/TradeMasterAPI:ref:refs/heads/main"
      },
      "StringLike": {
        "token.actions.githubusercontent.com:sub": "repo:raxx-app/TradeMasterAPI:*"
      }
    }
  }]
}

For PR-time plan runs, the sub condition must also allow PR-ref subjects (repo:raxx-app/TradeMasterAPI:pull_request). Apply runs must be restricted to ref:refs/heads/main only. Implementers must set separate conditions on plan vs apply jobs within the same role, or use two roles per root (plan-role + apply-role). Two roles per root is the recommended approach for roots managing sensitive resources (Lambda code, SNS topics, SES identity).

Role naming convention

raxx-tf-<root-name>-plan    (read + plan only)
raxx-tf-<root-name>-apply   (plan + apply)

IAM role pattern and minimum permissions

Sample: email-delivery-stack root (the incident root)

The email-delivery-stack module manages: SNS FIFO topics, SQS FIFO queues + DLQs, Lambda functions, Lambda event source mappings, CloudWatch alarms, DynamoDB dedup table, IAM roles for Lambda execution.

Plan role minimum permissions:

{
  "Effect": "Allow",
  "Action": [
    "sns:GetTopicAttributes", "sns:ListSubscriptionsByTopic",
    "sqs:GetQueueAttributes", "sqs:GetQueueUrl",
    "lambda:GetFunction", "lambda:GetFunctionConfiguration",
    "lambda:ListEventSourceMappings", "lambda:GetEventSourceMapping",
    "dynamodb:DescribeTable",
    "cloudwatch:DescribeAlarms",
    "iam:GetRole", "iam:GetRolePolicy", "iam:ListAttachedRolePolicies",
    "s3:GetBucketVersioning"
  ],
  "Resource": "arn:aws:*:us-east-1:<ACCOUNT_ID>:*raxx-email*"
}

Apply role additional permissions (beyond plan):

{
  "Effect": "Allow",
  "Action": [
    "sns:CreateTopic", "sns:SetTopicAttributes", "sns:Subscribe", "sns:Unsubscribe", "sns:DeleteTopic",
    "sqs:CreateQueue", "sqs:SetQueueAttributes", "sqs:DeleteQueue",
    "lambda:UpdateFunctionCode", "lambda:UpdateFunctionConfiguration",
    "lambda:CreateEventSourceMapping", "lambda:UpdateEventSourceMapping", "lambda:DeleteEventSourceMapping",
    "lambda:PublishLayerVersion", "lambda:DeleteLayerVersion",
    "dynamodb:CreateTable", "dynamodb:UpdateTable",
    "cloudwatch:PutMetricAlarm", "cloudwatch:DeleteAlarms",
    "iam:PassRole"
  ],
  "Resource": "arn:aws:*:us-east-1:<ACCOUNT_ID>:*raxx-email*"
}

The iam:PassRole action is scoped to Lambda execution roles named raxx-email-*. The apply role does NOT have iam:CreateRole or iam:AttachRolePolicy — IAM role creation for Lambda execution roles is a separate Terraform root operation or a one-time bootstrap step.

Permission pattern by resource type (all roots)

Resource type Plan permissions Apply adds
Lambda function GetFunction, GetFunctionConfiguration UpdateFunctionCode, UpdateFunctionConfiguration
SNS topic GetTopicAttributes, ListSubscriptions CreateTopic, SetTopicAttributes, Subscribe, DeleteTopic
SQS queue GetQueueAttributes, GetQueueUrl CreateQueue, SetQueueAttributes, DeleteQueue
CloudWatch alarm DescribeAlarms PutMetricAlarm, DeleteAlarms
IAM role (read) GetRole, GetRolePolicy, ListAttachedRolePolicies PassRole (scoped)
Cloudflare resources Provider uses CLOUDFLARE_API_TOKEN from SSM — not IAM same

Cloudflare-managed resources (WAF rulesets in queue/waf.tf, waf/main.tf, cf-access/) use a CLOUDFLARE_API_TOKEN retrieved from AWS SSM Parameter Store at plan/apply time. The OIDC role must include ssm:GetParameter on the specific SSM parameter paths. The token is never stored in the workflow YAML or GitHub secrets.


Sub-root anomalies

_bootstrap — no .tf files

The _bootstrap/ directory contains only a README.md describing manual one-time steps (OIDC provider registration, initial IAM role seeding). No Terraform resources are managed here.

Decision: A terraform-_bootstrap.yml workflow is still created as a tracking shell. It runs terraform validate on push and produces a status check, but has no plan or apply jobs. The workflow documents that this root is intentionally static. The IAM role for _bootstrap is the OIDC-bootstrapper role itself — it need not be provisioned by Terraform.

cf-waf-probes — sparse, no provider.tf / versions.tf

The cf-waf-probes/ root currently contains only dns.tf. It is missing provider.tf and versions.tf, which are required for terraform init to succeed.

Decision: The implementation sub-card for cf-waf-probes (#1842 or equivalent) must scaffold provider.tf + versions.tf before wiring the workflow. The terraform-validate.yml already flags this root as invalid on PR. The sub-card implementer resolves this before adding the plan/apply workflow.

queue/waf.tf vs waf/main.tf — WAF boundary

Two roots manage Cloudflare WAF resources:

Boundary rule: The queue/ apply role must NOT have wafv2 permissions. The waf/ apply role must NOT be able to modify queue.raxx.app WAF rules (scoped by Cloudflare zone ID at the provider level). Each root's Cloudflare provider is configured with a zone-scoped API token stored in a separate SSM parameter path.


Terraform state backend

All roots use S3 + DynamoDB state locking (not local state). The S3 bucket and DynamoDB table are managed by the _bootstrap manual process. Each root's backend.tf (or versions.tf backend block) must reference the shared state bucket with a per-root prefix:

s3://<STATE_BUCKET>/terraform/<root-name>/terraform.tfstate

The apply workflow does NOT store any Terraform state locally between runs. State is always remote.


Rollback

If the pipeline itself is broken (e.g., a workflow YAML bug causes plan to fail on every PR):

  1. Revert the workflow PR on main. This restores the previous workflow file.
  2. If a terraform apply produced incorrect infrastructure before the pipeline was reverted, the operator re-runs terraform apply from a developer machine using the same credentials pattern (OIDC assume via aws sts assume-role-with-web-identity locally, or a one-time IAM user key for emergency recovery).
  3. The human-fallback apply procedure is documented at docs/runbooks/terraform-pipeline.md (filed as sub-card #1848). This runbook must exist before any root's pipeline goes live.

There is no automated rollback of a terraform apply that completes successfully. A destructive change that was applied by the pipeline requires a corrective apply (a new commit that reverts the Terraform resource change, merged to main, applied by the pipeline).


Rollout plan

Phase Roots Gate
1 — Incident root first modules/email-delivery-stack (sub-card #1838) Plan check passes on a test PR; apply verified on staging-equivalent
2 — High-change roots freescout, cf-access, waf After #1838 is stable for 3 business days
3 — Remaining roots queue, support-attachments, cf-pages-docs-customer After Phase 2
4 — Anomalous roots cf-waf-probes (after scaffold), _bootstrap (tracking shell) After Phase 3
5 — Drift detection Nightly cron on all roots Cards #1847, post-launch

Phase 1 is launch-blocking because the incident root (email-delivery-stack) must not require manual apply again before v1 launch.


Security considerations


Consequences

Positive

Negative / trade-offs


Open questions

  1. Destructive-change gate. Should applies that include at least one destroy action require a manual approval step (GitHub environment protection with required reviewer) before proceeding? This prevents accidental deletion of stateful resources (SNS topics, SQS queues, DynamoDB tables). Recommended: yes, but deferred to Phase 2 unless the email-delivery-stack root warrants it immediately. Needs operator decision before #1838 is claimed.

  2. Two IAM roles vs one per root. The two-role model (plan + apply) is cleaner for security but doubles the IAM bootstrapping work. Is the operator comfortable with the apply role being able to both plan and apply (one role per root, with the ref condition on the OIDC trust restricting apply to main)? The single-role approach is simpler but relies entirely on the OIDC ref condition for the apply boundary. Needs operator decision before sub-cards are dispatched.

  3. Cloudflare API token per root vs shared. Currently a single Cloudflare automation token is documented in reference_cloudflare_tokens.md. WAF rule management requires Zone:WAF:Edit. If the queue/ Cloudflare provider and the waf/ Cloudflare provider share the same token, a compromise of either workflow's SSM read path exposes the other root's write capability. Recommend: separate scoped tokens per root. Needs operator decision on whether to create additional Cloudflare tokens before Phase 2.


Cross-references