Vault disaster recovery runbook

System: vault.raxx.app — self-hosted Infisical CE on AWS Lightsail (raxx-vault, small_3_0, 2 vCPU / 2 GB RAM, us-east-1a) Owner: operator (Kristerpher) Last incident: #2650 (2026-05-20 UTC — SEV-4 vault diagnostic confirming instance name + bundle) Last reviewed: 2026-05-21 UTC

Snapshot schedule and retention

Setting	Value
Method	Lightsail AutoSnapshot add-on
Schedule	Daily at 05:00 UTC (off-peak)
Retention	7 days (Lightsail platform maximum for AutoSnapshot)
First snapshot	2026-05-22 at 05:00 UTC
Region	`us-east-1`
Availability zone	`us-east-1a`

AutoSnapshot was enabled on 2026-05-21 via:

aws lightsail enable-add-on \
  --resource-name raxx-vault \
  --add-on-request 'addOnType=AutoSnapshot,autoSnapshotAddOnRequest={snapshotTimeOfDay=05:00}' \
  --region us-east-1

Future hardening (post-launch): migrate snapshot management to the Terraform root that will manage raxx-vault (no Lightsail TF root exists as of 2026-05-21). See #2652 action item.

BCP backup failure — pageable event

The daily vault snapshot (bcp-vault-snapshot-daily.yml, Win 3) failure is a pageable event. A failed backup means no encrypted off-Infisical copy of secrets was created for that day.

Alert channels (both fire on failure): - Slack: ops webhook channel (via SLACK_WEBHOOK_URL) - Email: ops@raxx.app via Postmark (via POSTMARK_OPS_ALERT_TOKEN)

How to investigate a backup failure:

Open the failing run from the GH Actions link in the alert. Look at which step failed.
Most common cause: GPG_BACKUP_PUBLIC_KEY repo secret missing or empty. Fix: operator must export their GPG public key and set the secret: bash gpg --armor --export YOUR_KEY_FINGERPRINT | base64 | gh secret set GPG_BACKUP_PUBLIC_KEY --repo raxx-app/TradeMasterAPI Then re-run the workflow via workflow_dispatch.
Second most common: CF Access service token expired (CF_ACCESS_CLIENT_ID / CF_ACCESS_CLIENT_SECRET). Fix: rotate per docs/ops/runbooks/rotation/cloudflare-access-service-token.md.
Infisical universal auth credentials expired (INFISICAL_CLIENT_ID / INFISICAL_CLIENT_SECRET). Fix: rotate in the Infisical dashboard and update GH Actions secrets.
AWS IAM key permissions issue (AWS_BACKUP_ACCESS_KEY_ID). Fix: verify key has s3:PutObject + s3:HeadObject on raxx-iac-state-prod bucket.

After fixing: dispatch bcp-vault-snapshot-daily.yml manually and confirm a .json.gpg object appears in s3://raxx-iac-state-prod/vault-exports/$(date -u +%Y-%m-%d).json.gpg.

How to tell the vault needs recovery

Symptom 1: vault.raxx.app is unreachable (CF Access login page never loads; curl returns connection refused or timeout).
Symptom 2: Lightsail console shows raxx-vault in Stopped or Error state.
Symptom 3: Agent sessions fail with vault-read errors even after confirming the CF WAF skip rule is live (see docs/ops/runbooks/vault-access.md).
Symptom 4: AWS Lightsail support event for us-east-1a AZ affecting instance.

How to diagnose (in order)

Check Lightsail instance state:

aws lightsail get-instance \
  --instance-name raxx-vault \
  --region us-east-1 \
  --query 'instance.state'

Expected: {"code": 16, "name": "running"}.

Attempt a direct health probe (if instance is running but vault is not serving):

# Requires CF Access service-token headers or operator browser session
# From an agent session:
curl -sf \
  -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
  -H "User-Agent: raxx-sre/1.0" \
  "https://vault.raxx.app/api/v1/health" \
  | python3 -m json.tool

Expected: {"status": "ok"} or equivalent Infisical health response. A 502/504 with the instance running usually means the Infisical Docker container has crashed — SSH in and check.

SSH into the instance to check container state:

aws lightsail download-default-key-pair --region us-east-1 \
  --output text --query 'privateKeyBase64' \
  | base64 --decode > /tmp/ls-key.pem
chmod 600 /tmp/ls-key.pem

# Get the public IP
aws lightsail get-instance \
  --instance-name raxx-vault \
  --region us-east-1 \
  --query 'instance.publicIpAddress' \
  --output text

ssh -i /tmp/ls-key.pem bitnami@<PUBLIC_IP> "sudo docker ps -a; sudo docker logs infisical --tail 50"

Restore procedure

Step 1 — List available snapshots

aws lightsail get-auto-snapshots \
  --resource-name raxx-vault \
  --region us-east-1

AutoSnapshot names follow the format raxx-vault-auto-YYYY-MM-DD. Pick the most recent healthy snapshot.

For any manually-created snapshots:

aws lightsail get-instance-snapshots \
  --region us-east-1 \
  --query 'instanceSnapshots[?fromInstanceName==`raxx-vault`].[name,state,createdAt]'

Step 2 — Create replacement instance from snapshot

aws lightsail create-instance-from-snapshot \
  --instance-name raxx-vault-restore \
  --instance-snapshot-name <snapshot-name> \
  --availability-zone us-east-1a \
  --bundle-id small_3_0 \
  --region us-east-1

Wait for the instance to reach running state:

aws lightsail get-instance \
  --instance-name raxx-vault-restore \
  --region us-east-1 \
  --query 'instance.state'

Typical boot time: 3–5 minutes.

Step 3 — Verify the restored instance is healthy

Get the new public IP:

aws lightsail get-instance \
  --instance-name raxx-vault-restore \
  --region us-east-1 \
  --query 'instance.publicIpAddress' \
  --output text

Probe directly (bypassing CF Access) to confirm Infisical is serving:

curl -sf "http://<NEW_IP>:80/api/v1/health"

If healthy, proceed to DNS swap. If unhealthy, SSH in and check container logs (Step 3 of diagnose section above).

Step 4 — DNS swap

The vault hostname (vault.raxx.app) is a Cloudflare-proxied DNS record pointing to the Lightsail static IP.

Assign a new static IP to the restored instance (recommended — avoids DHCP churn):

# Allocate a new static IP
aws lightsail allocate-static-ip \
  --static-ip-name raxx-vault-restore-static \
  --region us-east-1

# Attach it
aws lightsail attach-static-ip \
  --static-ip-name raxx-vault-restore-static \
  --instance-name raxx-vault-restore \
  --region us-east-1

# Get the address
aws lightsail get-static-ip \
  --static-ip-name raxx-vault-restore-static \
  --region us-east-1 \
  --query 'staticIp.ipAddress' \
  --output text

Update the Cloudflare DNS A record for vault.raxx.app to point at the new static IP. Use the CLOUDFLARE_EDIT_DNS token (not CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN — wrong scope; see docs/ops/runbooks/cloudflare-tokens.md and memory reference_cloudflare_tokens.md):

# Zone ID for raxx.app
ZONE_ID=$(infisical secrets get CF_ZONE_ID_RAXX_APP \
  --path /MooseQuest/cloudflare/ --plain)

CF_DNS_TOKEN=$(infisical secrets get CLOUDFLARE_EDIT_DNS \
  --path /MooseQuest/cloudflare/ --plain)

# Get the record ID for vault.raxx.app
RECORD_ID=$(curl -sS \
  -H "Authorization: Bearer $CF_DNS_TOKEN" \
  "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records?name=vault.raxx.app&type=A" \
  | python3 -c "import sys,json; r=json.load(sys.stdin)['result']; print(r[0]['id'])")

# Update to new IP
NEW_IP=<new static IP from above>

curl -sS -X PUT \
  -H "Authorization: Bearer $CF_DNS_TOKEN" \
  -H "Content-Type: application/json" \
  "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
  --data "{\"type\":\"A\",\"name\":\"vault.raxx.app\",\"content\":\"$NEW_IP\",\"ttl\":60,\"proxied\":true}"

CF propagation with proxy enabled: near-instant (CF caches the origin IP).

Step 5 — Verify end-to-end

curl -sf \
  -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
  -H "User-Agent: raxx-sre/1.0" \
  "https://vault.raxx.app/api/v1/health"

Then confirm a secret read works (agent session):

infisical secrets get SENTRY_INTERNAL_INTEGRATION_TOKEN \
  --env=prod \
  --path=/MooseQuest/sentry \
  --plain | head -c 8

Expected: 8 non-empty characters.

Step 6 — Rename + clean up

Once raxx-vault-restore is confirmed healthy and DNS is pointing at it:

# There is no Lightsail rename-instance command — stop the old instance
# and detach its static IP before re-allocating if needed.
# Retain the old instance for 24h before deletion as a rollback option.
aws lightsail stop-instance \
  --instance-name raxx-vault \
  --region us-east-1

# After 24h clean-up (operator decision):
aws lightsail delete-instance \
  --instance-name raxx-vault \
  --region us-east-1

# Re-enable AutoSnapshot on the restored instance:
aws lightsail enable-add-on \
  --resource-name raxx-vault-restore \
  --add-on-request 'addOnType=AutoSnapshot,autoSnapshotAddOnRequest={snapshotTimeOfDay=05:00}' \
  --region us-east-1

RTO target

1–2 hours from first alert to vault serving traffic, assuming: - A healthy auto-snapshot is available (7-day window). - The operator has AWS CLI access and CF DNS edit scope. - No AZ-level AWS outage (if us-east-1a is affected, restore to us-east-1b by changing --availability-zone in Step 2).

Manual snapshot (on-demand, before risky ops)

Take a manual snapshot before any maintenance that modifies the vault host (OS updates, Docker upgrades, config changes):

aws lightsail create-instance-snapshot \
  --instance-name raxx-vault \
  --instance-snapshot-name "raxx-vault-pre-maintenance-$(date -u +%Y%m%d-%H%M)" \
  --region us-east-1

Manual snapshots are NOT subject to the 7-day auto-rotation policy — they persist until explicitly deleted.

Verify auto-snapshot is still enabled

Include this in the weekly SRE sweep:

aws lightsail get-instance \
  --instance-name raxx-vault \
  --region us-east-1 \
  --query 'instance.addOns'

Expected:

[
    {
        "name": "AutoSnapshot",
        "status": "Enabled",
        "snapshotTimeOfDay": "05:00"
    }
]

If status is Disabled or the add-on is missing, re-enable using the command in §Snapshot schedule and retention above.

Emergency stop (clean shutdown)

To stop the vault instance without data loss (e.g., if the host is compromised and needs isolation):

aws lightsail stop-instance \
  --instance-name raxx-vault \
  --region us-east-1

Note: stopping the instance will break all agent sessions, CI pipelines, Velvet rotation, and any service that reads from Infisical. Coordinate with the operator before executing outside an active security incident.

Escalation

Wake the operator when: - The restore procedure fails at any step and no healthy snapshot is available. - The AZ us-east-1a is degraded per AWS status (cross-AZ restore changes the --availability-zone value but requires operator decision). - The Infisical master encryption key or database is corrupt (snapshot may not help — escalate to Infisical support). - Estimated recovery time exceeds the 2-hour RTO target.

AWS status page: https://health.aws.amazon.com/health/status Lightsail service health: https://health.aws.amazon.com/health/status (filter: Lightsail, us-east-1)

Cross-references

Vault access + WAF runbook: docs/ops/runbooks/vault-access.md
Secrets source of truth: all vault credentials live in Infisical — never inline in repo files (memory: feedback_no_inline_secrets_in_repo.md, feedback_secrets_in_vault_sop.md)
DNS token scope: memory reference_cloudflare_tokens.md — use CLOUDFLARE_EDIT_DNS, not the automation token
Burr SSO gateway pattern (CF Access as OIDC provider): memory project_burr_sso_gateway.md — vault is the first Burr consumer; CF Access policy for vault.raxx.app must be preserved during DNS swap
AWS workload secrets in SSM (not Infisical): memory feedback_aws_workloads_use_ssm_not_vault.md
Issue #2652 (this work): Lightsail AutoSnapshot enabling + TF hardening backlog
Issue #2650 (SEV-4 diagnostic): instance name + bundle confirmation