Raxx · internal docs

internal · gated

Terraform state recovery runbook

System: raxx-iac-state-prod (S3 + DynamoDB) Owner: operator (Kristerpher) Last reviewed: 2026-05-21 RTO target: <30 min for state restore

State bucket facts

Bucket:         raxx-iac-state-prod
Region:         us-east-1
Account:        521228113048
DynamoDB lock:  raxx-iac-state-locks
Encryption:     AES256 (SSE-S3)
Versioning:     Enabled (as of 2026-04-28; lifecycle applied 2026-05-21)
MFA Delete:     Not enabled — requires root-account MFA (see future action below)

Lifecycle policy expire-noncurrent-tfstate-versions: noncurrent versions expire after 90 days. Incomplete multipart uploads purged after 7 days.

TF roots and their state keys

Root module State key
terraform/cf-access cf-access/terraform.tfstate
terraform/cf-pages-docs-customer cf-pages-docs-customer/terraform.tfstate
terraform/cf-pages-support cf-pages-support/terraform.tfstate
terraform/freescout freescout/terraform.tfstate
terraform/queue queue/terraform.tfstate
terraform/support-attachments support-attachments/terraform.tfstate
terraform/waf waf/terraform.tfstate
terraform/modules/cf-access-getraxx (see that module's versions.tf)
terraform/modules/email-delivery-stack (see that module's versions.tf)

How to tell it's broken

How to diagnose (in order)

  1. Verify the bucket is reachable and versioning is on: aws s3api get-bucket-versioning --bucket raxx-iac-state-prod --region us-east-1 Expected: {"Status": "Enabled"}

  2. Confirm the state key exists: aws s3 ls s3://raxx-iac-state-prod/<root>/terraform.tfstate

  3. List all versions of the state key: aws s3api list-object-versions \ --bucket raxx-iac-state-prod \ --prefix <root>/terraform.tfstate \ --region us-east-1 Look at Versions[] — the top entry (IsLatest: true) is current state. Previous entries are recovery candidates.

  4. Inspect a specific version: aws s3api get-object \ --bucket raxx-iac-state-prod \ --key <root>/terraform.tfstate \ --version-id <VersionId> \ /tmp/tfstate-inspect.json cat /tmp/tfstate-inspect.json | python3 -m json.tool | head -40

  5. Check for a stale DynamoDB lock: aws dynamodb scan --table-name raxx-iac-state-locks --region us-east-1 A stale lock has a LockID matching raxx-iac-state-prod/<root>/terraform.tfstate.

Known failure modes

Failure mode A: stale DynamoDB lock

Symptom: terraform apply blocks indefinitely with "Error acquiring the state lock." Cause: A prior apply was interrupted before it could release the lock. Fix: Delete the specific lock item — do NOT delete the entire table.

aws dynamodb delete-item \
  --table-name raxx-iac-state-locks \
  --key '{"LockID": {"S": "raxx-iac-state-prod/<root>/terraform.tfstate"}}' \
  --region us-east-1

Alternatively, use Terraform's built-in force-unlock (requires the lock ID from the error message):

terraform force-unlock <lock-id>

Verification: Re-run terraform plan. It should proceed past the lock check.

Failure mode B: state file accidentally deleted

Symptom: terraform init / terraform plan returns NoSuchKey for the state object. Cause: Manual aws s3 rm or a misconfigured apply deleted the state key. Fix: Restore the most recent version using S3 versioning.

Step 1 — list versions:

aws s3api list-object-versions \
  --bucket raxx-iac-state-prod \
  --prefix <root>/terraform.tfstate \
  --region us-east-1

Step 2 — identify the latest non-delete version. DeleteMarkers[] entries are the deletions; Versions[] entries are real state snapshots.

Step 3 — copy that version back as the current object:

aws s3api copy-object \
  --bucket raxx-iac-state-prod \
  --copy-source "raxx-iac-state-prod/<root>/terraform.tfstate?versionId=<VersionId>" \
  --key <root>/terraform.tfstate \
  --server-side-encryption AES256 \
  --region us-east-1

Verification:

aws s3 ls s3://raxx-iac-state-prod/<root>/terraform.tfstate
terraform plan   # should show no unexpected drift

Failure mode C: state file corrupted (invalid JSON)

Symptom: terraform init fails with JSON parse error or terraform plan aborts with "Error reading state." Cause: Partial write, bit-rot, or a bad manual edit of the state file. Fix: Same as Failure mode B — restore the previous good version.

Use list-object-versions to find the version immediately before the corruption event (correlate the timestamp with when the error was first noticed), then copy-object to restore it.

Failure mode D: encryption header rejected (AccessDenied on PutObject)

Symptom: terraform apply fails with AccessDenied only on the state write step. Cause: The bucket policy requires x-amz-server-side-encryption on every PutObject. The TF backend omits it if encrypt = true is missing from versions.tf. Fix: Confirm all versions.tf backend blocks have encrypt = true. If a manual upload was attempted without it, use the --server-side-encryption AES256 flag shown in Failure mode B's copy-object step.

Failure mode E: lifecycle policy missing (cost growth)

Symptom: aws s3api get-bucket-lifecycle-configuration returns NoSuchLifecycleConfiguration. Cause: Lifecycle was not applied or was accidentally removed. Fix:

aws s3api put-bucket-lifecycle-configuration \
  --bucket raxx-iac-state-prod \
  --region us-east-1 \
  --lifecycle-configuration '{
    "Rules": [{
      "ID": "expire-noncurrent-tfstate-versions",
      "Status": "Enabled",
      "Filter": {"Prefix": ""},
      "NoncurrentVersionExpiration": {"NoncurrentDays": 90},
      "AbortIncompleteMultipartUpload": {"DaysAfterInitiation": 7}
    }]
  }'

Verification: aws s3api get-bucket-lifecycle-configuration --bucket raxx-iac-state-prod --region us-east-1 returns the rule with Status: Enabled.

Emergency stop

Terraform has no running service to stop. To prevent any concurrent applies from touching state:

  1. Manually insert a sentinel lock into DynamoDB: aws dynamodb put-item \ --table-name raxx-iac-state-locks \ --item '{"LockID": {"S": "raxx-iac-state-prod/EMERGENCY-FREEZE"}, "Info": {"S": "Manual freeze - remove to unlock"}}' \ --region us-east-1 Note: this does not actually block Terraform — Terraform locks per state key. To freeze a specific root, use the key raxx-iac-state-prod/<root>/terraform.tfstate.

  2. Revoke the terraform-runner IAM role's s3:PutObject permission in the AWS console until the incident is resolved.

Future action: MFA Delete

MFA Delete prevents any principal (including compromised IAM keys) from permanently deleting object versions. Enabling it requires root-account MFA credentials. This is not automated.

When the operator is ready: 1. Log in to AWS console as root with MFA. 2. Run: aws s3api put-bucket-versioning \ --bucket raxx-iac-state-prod \ --versioning-configuration Status=Enabled,MFADelete=Enabled \ --mfa "arn:aws:iam::521228113048:mfa/root-account-mfa <TOTP-code>" \ --region us-east-1 3. Verify: aws s3api get-bucket-versioning --bucket raxx-iac-state-prod --region us-east-1 should return MFADelete: Enabled.

Track as type:reliability issue #2654 (to be filed).

Escalation

Wake the operator when: - State is corrupted AND no clean version exists within the last 24h. - More than one TF root simultaneously shows unexpected drift after a state restore. - Any IAM credential used for TF applies is suspected compromised.

Contact: operator (Kristerpher) via Slack DM D0AJ7K184TV.