ADR-0103 — BCP Backup Posture: vault snapshots, TF state versioning, and restore drills
Status: Accepted
Date: 2026-05-21 UTC
Author: raxx-software-architect
Refs: docs/architecture/business-continuity-plan-2026-05-21.md, SEV-4 vault Lightsail timeout (2026-05-21)
Context
As of 2026-05-21 UTC, two days before v1 launch, a SEV-4 vault Lightsail timeout incident surfaced that the Infisical vault on a single Lightsail instance has no automated snapshot schedule and no scheduled secret export. Three related gaps were identified in the same sweep:
- Terraform state bucket (
raxx-iac-state-prod) S3 Versioning not confirmed enabled. - No automated daily snapshot for the vault instance (unlike FreeScout, which has
freescout-backup.yml). - No verified restore drill on record for either Heroku Postgres or FreeScout.
This ADR records the decisions made to close these gaps in the v1 window.
Decision
D1 — Enable Lightsail auto-snapshot for the vault instance at 06:30 UTC daily
The vault instance must have daily Lightsail snapshots, retaining 7 (matching FreeScout). The snapshot runs 30 minutes after the FreeScout backup completes to stagger AWS API calls.
Rationale: a vault host failure without a snapshot requires full Infisical CE re-install and manual re-entry of all secrets. This is unacceptable at launch. The auto-snapshot is the minimum viable posture; a Terraform-managed snapshot automation is the v2 target.
D2 — Enable S3 Versioning on raxx-iac-state-prod before launch
S3 Versioning must be enabled on the Terraform state bucket before 2026-05-23 UTC. Without it, an accidental terraform state rm or state-bucket object deletion is unrecoverable. MFA Delete is recommended once root credentials can be applied from a secure session.
Rationale: the cost of enabling versioning is negligible (S3 stores prior state versions; cost is a few KB per apply). The cost of a corrupted-state recovery is 4–8 hours of terraform import work against live infrastructure.
D3 — Schedule daily Infisical admin export to S3 (vault-exports prefix)
A GH Actions workflow must produce a GPG-encrypted Infisical admin export daily and upload to s3://raxx-iac-state-prod/vault-exports/YYYY-MM-DD.json.gpg. The export is encrypted with the operator's public key before leaving the runner. Failure alerts via Slack DM D0AJ7K184TV.
Rationale: a Lightsail snapshot restores the full instance disk but requires the snapshot to be current. The admin export provides a logical-only fallback that survives even if all snapshots are lost — analogous to the FreeScout Tier 2 mysqldump strategy.
D4 — Verified restore drill required before marking BCP cards as "done"
Each of the following must have a documented restore result row before the BCP implementation cards can be closed:
- Heroku Postgres: restore to throwaway app, verify Alembic revision matches
- FreeScout: Tier 1 snapshot restore (or Tier 2 logical restore) to raxx-tickets-restored instance
"Backup exists" and "backup restores cleanly" are different claims. D4 is the verification gate.
Consequences
Positive: - Vault SPOF risk is materially reduced from "single instance with no backup" to "instance + daily snapshot + daily encrypted export". - Terraform state is versioned; accidental deletions are recoverable from S3 version history. - Verified restore drills create organizational muscle memory before a real incident.
Negative:
- Lightsail auto-snapshots for the vault instance cost ~$0.05/GB/mo × 40 GB × 7 snapshots ≈ ~$14/mo (estimate). Accepted.
- S3 versioning increases state bucket storage by a small amount for each terraform apply. Negligible.
- Daily export workflow requires Infisical admin export API to work from a machine identity. This needs confirmation (see Open Question 3 in the BCP document).
Alternatives Considered
Alt A — Use Lightsail managed auto-snapshot only (no S3 export)
Rejected as the sole mechanism. A snapshot is recoverable only if the Lightsail snapshot service itself is available and the instance can be restored in the same AWS account. A logical export to S3 survives AWS account compromise, region-level failure, and accidental instance/snapshot deletion.
Alt B — Migrate vault to a managed secret store (AWS Secrets Manager or Infisical Cloud SaaS)
Deferred to v2. AWS Secrets Manager at v1 scale would require pricing evaluation and a migration path. Infisical Cloud SaaS paywalls OIDC SSO (per project_infisical_sso_not_pursued). The self-hosted instance plus the D1+D2+D3 backup posture is the v1 path.
Alt C — Use Terraform to manage the snapshot schedule
Preferred for v2. The aws_lightsail_instance_snapshot resource in the AWS Terraform provider supports automated snapshots but with more complex lifecycle management. For the v1 window (T-2 days to launch), the AWS CLI enable-add-on approach is faster to implement without requiring a TF plan/apply cycle and the associated vault credential sourcing.