ADR-0082 — Terraform deployment pipeline pattern (Option D: GH Actions + AWS OIDC)
Status: Accepted Date: 2026-05-12 UTC Author: architect-agent Parent card: #1837 Epic: #1834 Related ADRs: 0022, 0072, 0077, [ADR-0076](https://internal-docs.raxx.app/architecture/adr/0076-queue-phase1-billing-v1-aggressive-12day.html) Supersedes: nothing (establishes new policy)
Context
Incident: 2026-05-12 UTC — 30-minute mailbox routing gap
PR #1832 merged a bridge Lambda code change (fix(email-bridge): parse ToFull for mailbox routing). The Lambda ARN and environment variables are managed in the terraform/modules/email-delivery-stack stack. After merge, the stack required a manual terraform apply from a developer machine to propagate the change. The gap between code merge and apply was approximately 30 minutes. During that window, inbound emails were routed to the wrong FreeScout mailboxes.
The root cause was the absence of an automated apply path: there was no workflow to trigger plan on PR open nor apply on merge to main.
The nine terraform roots
terraform/
_bootstrap/ # README only — no .tf files
cf-access/ # Cloudflare Access service tokens + application rules
cf-pages-docs-customer/ # CF Pages project for docs.raxx.app
cf-waf-probes/ # Synthetic WAF probe configs — sparse (no provider.tf yet)
freescout/ # FreeScout Lightsail, DNS, SSM service token
modules/ # Shared modules (email-delivery-stack, etc.) — not a root
queue/ # Queue service infra + waf.tf (Cloudflare WAF for queue.raxx.app)
support-attachments/ # S3 + IAM for FreeScout attachment storage
waf/ # AWS-side CloudFront + CF WAF for raxx.app + getraxx.com
Eight of the nine are deployable roots (_bootstrap has no .tf files; modules/ is a shared library, not a root).
Options considered
Option A — Atlantis (self-hosted plan/apply server).
Atlantis provides PR-comment-driven plan and apply with locking. Rejected: requires a persistent server with high-availability requirements, Atlantis-level RBAC separate from GitHub, and operational runbook complexity above the operator's current team size. Deferred to post-v1.
Option B — Terraform Cloud / HCP Terraform. Managed plan/apply with VCS-driven runs. Rejected: adds a third-party in the IAM trust chain for AWS credentials, $20+/month at the team tier, and a VCS connection to a private repo. Lock-in concern given ADR-0076's Rust-phase trajectory.
Option C — Single matrix workflow across all roots.
One workflow file that fans out a matrix over all roots on any terraform/** change. Simpler to maintain, but the IAM role per root boundary becomes harder to enforce (one role would need permissions across all roots), and a plan failure in one root blocks unrelated roots.
Option D — One workflow per root, AWS OIDC for short-lived creds. (CHOSEN)
Each root has its own .github/workflows/terraform-<root>.yml and a dedicated IAM role scoped to only the resources that root manages. AWS OIDC issues a short-lived credential to the workflow run; no static AWS keys are stored anywhere.
Why Option D
- IAM blast radius is bounded per root: a misconfiguration in the
waf/role cannot touchfreescout/resources. - Workflow files are independently reviewable: a PR touching
terraform/waf/**only runs thewafworkflow, not all roots. - Short-lived OIDC credentials satisfy the "no stored credentials" invariant. No
AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEYin secrets or environment. - Matches the existing
terraform-validate.ymlpattern (per-module matrix already exists; per-root workflows extend it naturally). - Operator confirmed Option D on 2026-05-12 in #1834.
Decision
Adopt Option D as the canonical Terraform deployment pipeline pattern for all roots under terraform/.
Pattern — canonical contract every per-root workflow must follow
Every deployable root under terraform/ gets one workflow file and one IAM role. The following contract is binding for all implementation sub-cards (#1838–#1846).
Workflow file location
.github/workflows/terraform-<root-name>.yml
Where <root-name> matches the directory name under terraform/ (e.g., terraform-waf.yml, terraform-freescout.yml).
Triggers
on:
pull_request:
types: [opened, synchronize, reopened]
paths:
- 'terraform/<root-name>/**'
- '.github/workflows/terraform-<root-name>.yml'
push:
branches:
- main
paths:
- 'terraform/<root-name>/**'
- '.github/workflows/terraform-<root-name>.yml'
schedule:
# Nightly drift detection — cards #1847 (post-launch)
- cron: '0 6 * * *' # 06:00 UTC
The schedule trigger runs plan only (never apply). It fires regardless of path changes. Drift detection cards (#1847) land post-launch; the cron block is present in all workflow files from the start but gated behind a workflow_dispatch or nightly-only job condition to avoid noisy plan output during active development.
Job structure
PR-open path: checkout → setup-terraform → OIDC assume role → init → plan → sticky PR comment → required status check
push-to-main: checkout → setup-terraform → OIDC assume role → init → plan → apply → audit log emit
nightly-cron: checkout → setup-terraform → OIDC assume role → init → plan (plan-only, post to step summary)
Plan on PR produces a sticky comment on the PR using the same pattern as deploy-customer-docs.yml (find-or-create comment by marker, replace on re-run). The comment includes the plan summary, add/change/destroy counts, and a link to the workflow run for the full diff.
Apply on main runs only after the merge commit is available (post-merge push trigger). It does NOT apply on the PR branch. This preserves the invariant that only code merged to main is applied to infrastructure.
The apply job must:
1. Re-run plan (not re-use the PR plan artifact) to catch drift between plan time and apply time.
2. Emit an audit log event to the unified audit table (ADR-0058) via a Raptor or Queue internal endpoint. The event payload includes: root name, run ID, commit SHA, actor (GitHub Actions), resource change summary, outcome.
3. On failure, post a Slack DM to D0AJ7K184TV.
Required status check name
terraform-<root-name>-plan
This check must be added to branch protection on main for every root before the workflow is considered production-ready. The implementation sub-card for each root includes adding this check as a gating step.
OIDC trust: per-root IAM role
Each root has a dedicated IAM role. The OIDC trust policy limits assumption to this repository and this workflow file only:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::<ACCOUNT_ID>:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
"token.actions.githubusercontent.com:sub": "repo:raxx-app/TradeMasterAPI:ref:refs/heads/main"
},
"StringLike": {
"token.actions.githubusercontent.com:sub": "repo:raxx-app/TradeMasterAPI:*"
}
}
}]
}
For PR-time plan runs, the sub condition must also allow PR-ref subjects (repo:raxx-app/TradeMasterAPI:pull_request). Apply runs must be restricted to ref:refs/heads/main only. Implementers must set separate conditions on plan vs apply jobs within the same role, or use two roles per root (plan-role + apply-role). Two roles per root is the recommended approach for roots managing sensitive resources (Lambda code, SNS topics, SES identity).
Role naming convention
raxx-tf-<root-name>-plan (read + plan only)
raxx-tf-<root-name>-apply (plan + apply)
IAM role pattern and minimum permissions
Sample: email-delivery-stack root (the incident root)
The email-delivery-stack module manages: SNS FIFO topics, SQS FIFO queues + DLQs, Lambda functions, Lambda event source mappings, CloudWatch alarms, DynamoDB dedup table, IAM roles for Lambda execution.
Plan role minimum permissions:
{
"Effect": "Allow",
"Action": [
"sns:GetTopicAttributes", "sns:ListSubscriptionsByTopic",
"sqs:GetQueueAttributes", "sqs:GetQueueUrl",
"lambda:GetFunction", "lambda:GetFunctionConfiguration",
"lambda:ListEventSourceMappings", "lambda:GetEventSourceMapping",
"dynamodb:DescribeTable",
"cloudwatch:DescribeAlarms",
"iam:GetRole", "iam:GetRolePolicy", "iam:ListAttachedRolePolicies",
"s3:GetBucketVersioning"
],
"Resource": "arn:aws:*:us-east-1:<ACCOUNT_ID>:*raxx-email*"
}
Apply role additional permissions (beyond plan):
{
"Effect": "Allow",
"Action": [
"sns:CreateTopic", "sns:SetTopicAttributes", "sns:Subscribe", "sns:Unsubscribe", "sns:DeleteTopic",
"sqs:CreateQueue", "sqs:SetQueueAttributes", "sqs:DeleteQueue",
"lambda:UpdateFunctionCode", "lambda:UpdateFunctionConfiguration",
"lambda:CreateEventSourceMapping", "lambda:UpdateEventSourceMapping", "lambda:DeleteEventSourceMapping",
"lambda:PublishLayerVersion", "lambda:DeleteLayerVersion",
"dynamodb:CreateTable", "dynamodb:UpdateTable",
"cloudwatch:PutMetricAlarm", "cloudwatch:DeleteAlarms",
"iam:PassRole"
],
"Resource": "arn:aws:*:us-east-1:<ACCOUNT_ID>:*raxx-email*"
}
The iam:PassRole action is scoped to Lambda execution roles named raxx-email-*. The apply role does NOT have iam:CreateRole or iam:AttachRolePolicy — IAM role creation for Lambda execution roles is a separate Terraform root operation or a one-time bootstrap step.
Permission pattern by resource type (all roots)
| Resource type | Plan permissions | Apply adds |
|---|---|---|
| Lambda function | GetFunction, GetFunctionConfiguration |
UpdateFunctionCode, UpdateFunctionConfiguration |
| SNS topic | GetTopicAttributes, ListSubscriptions |
CreateTopic, SetTopicAttributes, Subscribe, DeleteTopic |
| SQS queue | GetQueueAttributes, GetQueueUrl |
CreateQueue, SetQueueAttributes, DeleteQueue |
| CloudWatch alarm | DescribeAlarms |
PutMetricAlarm, DeleteAlarms |
| IAM role (read) | GetRole, GetRolePolicy, ListAttachedRolePolicies |
PassRole (scoped) |
| Cloudflare resources | Provider uses CLOUDFLARE_API_TOKEN from SSM — not IAM |
same |
Cloudflare-managed resources (WAF rulesets in queue/waf.tf, waf/main.tf, cf-access/) use a CLOUDFLARE_API_TOKEN retrieved from AWS SSM Parameter Store at plan/apply time. The OIDC role must include ssm:GetParameter on the specific SSM parameter paths. The token is never stored in the workflow YAML or GitHub secrets.
Sub-root anomalies
_bootstrap — no .tf files
The _bootstrap/ directory contains only a README.md describing manual one-time steps (OIDC provider registration, initial IAM role seeding). No Terraform resources are managed here.
Decision: A terraform-_bootstrap.yml workflow is still created as a tracking shell. It runs terraform validate on push and produces a status check, but has no plan or apply jobs. The workflow documents that this root is intentionally static. The IAM role for _bootstrap is the OIDC-bootstrapper role itself — it need not be provisioned by Terraform.
cf-waf-probes — sparse, no provider.tf / versions.tf
The cf-waf-probes/ root currently contains only dns.tf. It is missing provider.tf and versions.tf, which are required for terraform init to succeed.
Decision: The implementation sub-card for cf-waf-probes (#1842 or equivalent) must scaffold provider.tf + versions.tf before wiring the workflow. The terraform-validate.yml already flags this root as invalid on PR. The sub-card implementer resolves this before adding the plan/apply workflow.
queue/waf.tf vs waf/main.tf — WAF boundary
Two roots manage Cloudflare WAF resources:
terraform/queue/waf.tf: Cloudflare WAF ruleset forqueue.raxx.app— rate limits, BFM skip rules, Bot Fight Mode bypass for service-to-service calls. Cloudflare-only resources. No AWS resources. IAM role requires onlyssm:GetParameterfor the Cloudflare API token.terraform/waf/main.tf: CloudFront distribution WAF ACLs forraxx.appandgetraxx.com— AWS WAF Web ACLs, CloudFront associations. Also instantiates thecf-wafmodule for Cloudflare-side rules on the customer zones. AWS + Cloudflare resources. IAM role requireswafv2:*,cloudfront:GetDistribution, andssm:GetParameter.
Boundary rule: The queue/ apply role must NOT have wafv2 permissions. The waf/ apply role must NOT be able to modify queue.raxx.app WAF rules (scoped by Cloudflare zone ID at the provider level). Each root's Cloudflare provider is configured with a zone-scoped API token stored in a separate SSM parameter path.
Terraform state backend
All roots use S3 + DynamoDB state locking (not local state). The S3 bucket and DynamoDB table are managed by the _bootstrap manual process. Each root's backend.tf (or versions.tf backend block) must reference the shared state bucket with a per-root prefix:
s3://<STATE_BUCKET>/terraform/<root-name>/terraform.tfstate
The apply workflow does NOT store any Terraform state locally between runs. State is always remote.
Rollback
If the pipeline itself is broken (e.g., a workflow YAML bug causes plan to fail on every PR):
- Revert the workflow PR on
main. This restores the previous workflow file. - If a
terraform applyproduced incorrect infrastructure before the pipeline was reverted, the operator re-runsterraform applyfrom a developer machine using the same credentials pattern (OIDC assume viaaws sts assume-role-with-web-identitylocally, or a one-time IAM user key for emergency recovery). - The human-fallback apply procedure is documented at
docs/runbooks/terraform-pipeline.md(filed as sub-card #1848). This runbook must exist before any root's pipeline goes live.
There is no automated rollback of a terraform apply that completes successfully. A destructive change that was applied by the pipeline requires a corrective apply (a new commit that reverts the Terraform resource change, merged to main, applied by the pipeline).
Rollout plan
| Phase | Roots | Gate |
|---|---|---|
| 1 — Incident root first | modules/email-delivery-stack (sub-card #1838) |
Plan check passes on a test PR; apply verified on staging-equivalent |
| 2 — High-change roots | freescout, cf-access, waf |
After #1838 is stable for 3 business days |
| 3 — Remaining roots | queue, support-attachments, cf-pages-docs-customer |
After Phase 2 |
| 4 — Anomalous roots | cf-waf-probes (after scaffold), _bootstrap (tracking shell) |
After Phase 3 |
| 5 — Drift detection | Nightly cron on all roots | Cards #1847, post-launch |
Phase 1 is launch-blocking because the incident root (email-delivery-stack) must not require manual apply again before v1 launch.
Security considerations
- No stored credentials. AWS access via OIDC short-lived tokens only. Cloudflare API tokens in SSM, fetched at runtime. No static secrets in workflow YAML, GitHub secrets, or
.tfvarsfiles committed to the repo. - Audit trail. Every
applyemits an audit event (see pattern above). The event is append-only per ADR-0022. - Plan-only on PRs. No infrastructure changes happen on PR branches. The
planoutput is informational. Only merge tomaintriggers an apply. - Two-role model. Plan and apply roles are separate IAM identities. A compromised PR-build environment cannot apply infrastructure changes.
- State file access. The S3 state bucket has a bucket policy restricting access to the OIDC roles and the
claude-infisical-bootstrapIAM user (break-glass). No other IAM principals can read or write state. - Kill-switch. Setting a Lambda's
ReservedConcurrentExecutionsto 0 (via a Terraform variable committed tomain) is the kill-switch for email processing. The pipeline applies this within the normal merge→apply window. - Breach notification. A failed apply that leaves infrastructure in an inconsistent state triggers the Slack DM to
D0AJ7K184TVand creates a FreeScout incident ticket via the Console "Investigate" flow. Thedocs/runbooks/terraform-pipeline.mdrunbook (#1848) defines the investigation SOP.
Consequences
Positive
- Eliminates the class of incidents caused by "merged but not applied" drift. After a root's workflow is live, every merge to
mainthat touches that root produces an apply within minutes. - IAM blast radius is bounded: a misconfigured role for one root cannot affect another root's resources.
- Short-lived OIDC credentials satisfy the no-stored-credentials invariant with no operational overhead (no key rotation schedule needed).
- Plan comments on PRs make infrastructure changes visible during code review, the same way tests make logic changes visible.
- The pattern is consistent with the existing
terraform-validate.ymlconvention already enforced in CI.
Negative / trade-offs
- Nine workflow files + nine (or eighteen) IAM roles to create and maintain. Initial setup cost is front-loaded.
- Apply happens asynchronously after merge. There is still a small window (seconds to low minutes) between merge and completed apply. This window is much smaller than the manual-apply gap but is not zero.
- Concurrent merges to
maintouching the same root can cause state lock contention. DynamoDB lock will queue the second apply; operator should be aware this is normal behavior and not a failure. - No apply-blocking on destructive plan changes (e.g., a
destroyon a production SNS topic). Implementers must add aplan -detailed-exitcodecheck and a manual approval gate for plans that include destructions. This is flagged as an open question below.
Open questions
-
Destructive-change gate. Should applies that include at least one
destroyaction require a manual approval step (GitHub environment protection with required reviewer) before proceeding? This prevents accidental deletion of stateful resources (SNS topics, SQS queues, DynamoDB tables). Recommended: yes, but deferred to Phase 2 unless the email-delivery-stack root warrants it immediately. Needs operator decision before #1838 is claimed. -
Two IAM roles vs one per root. The two-role model (plan + apply) is cleaner for security but doubles the IAM bootstrapping work. Is the operator comfortable with the apply role being able to both plan and apply (one role per root, with the
refcondition on the OIDC trust restricting apply tomain)? The single-role approach is simpler but relies entirely on the OIDC ref condition for the apply boundary. Needs operator decision before sub-cards are dispatched. -
Cloudflare API token per root vs shared. Currently a single Cloudflare automation token is documented in
reference_cloudflare_tokens.md. WAF rule management requiresZone:WAF:Edit. If thequeue/Cloudflare provider and thewaf/Cloudflare provider share the same token, a compromise of either workflow's SSM read path exposes the other root's write capability. Recommend: separate scoped tokens per root. Needs operator decision on whether to create additional Cloudflare tokens before Phase 2.
Cross-references
- Incident: PR #1832 merge + 30-minute mailbox routing gap (2026-05-12 UTC)
- Epic: #1834 (Terraform deployment automation)
- Parent card: #1837 (this ADR)
- ADR-0022: append-only hash-chain audit log (audit emit on apply)
- ADR-0072: SNS/SQS/SES durable email delivery (email-delivery-stack root)
- ADR-0076: Queue phase 1 C++ / billing v1 (related infrastructure context)
- ADR-0077: Cloudflare WAF layered defense (waf/ + queue/waf.tf context)
feedback_aws_workloads_use_ssm_not_vault.md: workload secrets in SSM, not Infisicaldocs/runbooks/terraform-pipeline.md: human-fallback runbook (sub-card #1848, not yet created)