Terraform state recovery runbook
System: raxx-iac-state-prod (S3 + DynamoDB) Owner: operator (Kristerpher) Last reviewed: 2026-05-21 RTO target: <30 min for state restore
State bucket facts
Bucket: raxx-iac-state-prod
Region: us-east-1
Account: 521228113048
DynamoDB lock: raxx-iac-state-locks
Encryption: AES256 (SSE-S3)
Versioning: Enabled (as of 2026-04-28; lifecycle applied 2026-05-21)
MFA Delete: Not enabled — requires root-account MFA (see future action below)
Lifecycle policy expire-noncurrent-tfstate-versions: noncurrent versions expire after 90 days. Incomplete multipart uploads purged after 7 days.
TF roots and their state keys
| Root module | State key |
|---|---|
terraform/cf-access |
cf-access/terraform.tfstate |
terraform/cf-pages-docs-customer |
cf-pages-docs-customer/terraform.tfstate |
terraform/cf-pages-support |
cf-pages-support/terraform.tfstate |
terraform/freescout |
freescout/terraform.tfstate |
terraform/queue |
queue/terraform.tfstate |
terraform/support-attachments |
support-attachments/terraform.tfstate |
terraform/waf |
waf/terraform.tfstate |
terraform/modules/cf-access-getraxx |
(see that module's versions.tf) |
terraform/modules/email-delivery-stack |
(see that module's versions.tf) |
How to tell it's broken
terraform initfails with:Error loading state: AccessDenied— IAM issue, not state corruption.terraform initfails with:Error loading state: NoSuchKey— state file missing or key typo.terraform planshows resources that were just created as "will be created" — state file is stale or points to wrong key.terraform applyfails mid-run and leaves a.tflockin DynamoDB — stale lock.
How to diagnose (in order)
-
Verify the bucket is reachable and versioning is on:
aws s3api get-bucket-versioning --bucket raxx-iac-state-prod --region us-east-1Expected:{"Status": "Enabled"} -
Confirm the state key exists:
aws s3 ls s3://raxx-iac-state-prod/<root>/terraform.tfstate -
List all versions of the state key:
aws s3api list-object-versions \ --bucket raxx-iac-state-prod \ --prefix <root>/terraform.tfstate \ --region us-east-1Look atVersions[]— the top entry (IsLatest: true) is current state. Previous entries are recovery candidates. -
Inspect a specific version:
aws s3api get-object \ --bucket raxx-iac-state-prod \ --key <root>/terraform.tfstate \ --version-id <VersionId> \ /tmp/tfstate-inspect.json cat /tmp/tfstate-inspect.json | python3 -m json.tool | head -40 -
Check for a stale DynamoDB lock:
aws dynamodb scan --table-name raxx-iac-state-locks --region us-east-1A stale lock has aLockIDmatchingraxx-iac-state-prod/<root>/terraform.tfstate.
Known failure modes
Failure mode A: stale DynamoDB lock
Symptom: terraform apply blocks indefinitely with "Error acquiring the state lock."
Cause: A prior apply was interrupted before it could release the lock.
Fix: Delete the specific lock item — do NOT delete the entire table.
aws dynamodb delete-item \
--table-name raxx-iac-state-locks \
--key '{"LockID": {"S": "raxx-iac-state-prod/<root>/terraform.tfstate"}}' \
--region us-east-1
Alternatively, use Terraform's built-in force-unlock (requires the lock ID from the error message):
terraform force-unlock <lock-id>
Verification: Re-run terraform plan. It should proceed past the lock check.
Failure mode B: state file accidentally deleted
Symptom: terraform init / terraform plan returns NoSuchKey for the state object.
Cause: Manual aws s3 rm or a misconfigured apply deleted the state key.
Fix: Restore the most recent version using S3 versioning.
Step 1 — list versions:
aws s3api list-object-versions \
--bucket raxx-iac-state-prod \
--prefix <root>/terraform.tfstate \
--region us-east-1
Step 2 — identify the latest non-delete version. DeleteMarkers[] entries are the deletions; Versions[] entries are real state snapshots.
Step 3 — copy that version back as the current object:
aws s3api copy-object \
--bucket raxx-iac-state-prod \
--copy-source "raxx-iac-state-prod/<root>/terraform.tfstate?versionId=<VersionId>" \
--key <root>/terraform.tfstate \
--server-side-encryption AES256 \
--region us-east-1
Verification:
aws s3 ls s3://raxx-iac-state-prod/<root>/terraform.tfstate
terraform plan # should show no unexpected drift
Failure mode C: state file corrupted (invalid JSON)
Symptom: terraform init fails with JSON parse error or terraform plan aborts with "Error reading state."
Cause: Partial write, bit-rot, or a bad manual edit of the state file.
Fix: Same as Failure mode B — restore the previous good version.
Use list-object-versions to find the version immediately before the corruption event (correlate the timestamp with when the error was first noticed), then copy-object to restore it.
Failure mode D: encryption header rejected (AccessDenied on PutObject)
Symptom: terraform apply fails with AccessDenied only on the state write step.
Cause: The bucket policy requires x-amz-server-side-encryption on every PutObject. The TF backend omits it if encrypt = true is missing from versions.tf.
Fix: Confirm all versions.tf backend blocks have encrypt = true. If a manual upload was attempted without it, use the --server-side-encryption AES256 flag shown in Failure mode B's copy-object step.
Failure mode E: lifecycle policy missing (cost growth)
Symptom: aws s3api get-bucket-lifecycle-configuration returns NoSuchLifecycleConfiguration.
Cause: Lifecycle was not applied or was accidentally removed.
Fix:
aws s3api put-bucket-lifecycle-configuration \
--bucket raxx-iac-state-prod \
--region us-east-1 \
--lifecycle-configuration '{
"Rules": [{
"ID": "expire-noncurrent-tfstate-versions",
"Status": "Enabled",
"Filter": {"Prefix": ""},
"NoncurrentVersionExpiration": {"NoncurrentDays": 90},
"AbortIncompleteMultipartUpload": {"DaysAfterInitiation": 7}
}]
}'
Verification: aws s3api get-bucket-lifecycle-configuration --bucket raxx-iac-state-prod --region us-east-1 returns the rule with Status: Enabled.
Emergency stop
Terraform has no running service to stop. To prevent any concurrent applies from touching state:
-
Manually insert a sentinel lock into DynamoDB:
aws dynamodb put-item \ --table-name raxx-iac-state-locks \ --item '{"LockID": {"S": "raxx-iac-state-prod/EMERGENCY-FREEZE"}, "Info": {"S": "Manual freeze - remove to unlock"}}' \ --region us-east-1Note: this does not actually block Terraform — Terraform locks per state key. To freeze a specific root, use the keyraxx-iac-state-prod/<root>/terraform.tfstate. -
Revoke the
terraform-runnerIAM role'ss3:PutObjectpermission in the AWS console until the incident is resolved.
Future action: MFA Delete
MFA Delete prevents any principal (including compromised IAM keys) from permanently deleting object versions. Enabling it requires root-account MFA credentials. This is not automated.
When the operator is ready:
1. Log in to AWS console as root with MFA.
2. Run:
aws s3api put-bucket-versioning \
--bucket raxx-iac-state-prod \
--versioning-configuration Status=Enabled,MFADelete=Enabled \
--mfa "arn:aws:iam::521228113048:mfa/root-account-mfa <TOTP-code>" \
--region us-east-1
3. Verify: aws s3api get-bucket-versioning --bucket raxx-iac-state-prod --region us-east-1 should return MFADelete: Enabled.
Track as type:reliability issue #2654 (to be filed).
Escalation
Wake the operator when: - State is corrupted AND no clean version exists within the last 24h. - More than one TF root simultaneously shows unexpected drift after a state restore. - Any IAM credential used for TF applies is suspected compromised.
Contact: operator (Kristerpher) via Slack DM D0AJ7K184TV.