Vault disaster recovery runbook
System: vault.raxx.app — self-hosted Infisical CE on AWS Lightsail (raxx-vault, small_3_0, 2 vCPU / 2 GB RAM, us-east-1a)
Owner: operator (Kristerpher)
Last incident: #2650 (2026-05-20 UTC — SEV-4 vault diagnostic confirming instance name + bundle)
Last reviewed: 2026-05-21 UTC
Snapshot schedule and retention
| Setting | Value |
|---|---|
| Method | Lightsail AutoSnapshot add-on |
| Schedule | Daily at 05:00 UTC (off-peak) |
| Retention | 7 days (Lightsail platform maximum for AutoSnapshot) |
| First snapshot | 2026-05-22 at 05:00 UTC |
| Region | us-east-1 |
| Availability zone | us-east-1a |
AutoSnapshot was enabled on 2026-05-21 via:
aws lightsail enable-add-on \
--resource-name raxx-vault \
--add-on-request 'addOnType=AutoSnapshot,autoSnapshotAddOnRequest={snapshotTimeOfDay=05:00}' \
--region us-east-1
Future hardening (post-launch): migrate snapshot management to the Terraform root that will manage raxx-vault (no Lightsail TF root exists as of 2026-05-21). See #2652 action item.
How to tell the vault needs recovery
- Symptom 1:
vault.raxx.appis unreachable (CF Access login page never loads; curl returns connection refused or timeout). - Symptom 2: Lightsail console shows
raxx-vaultinStoppedorErrorstate. - Symptom 3: Agent sessions fail with vault-read errors even after confirming the CF WAF skip rule is live (see
docs/ops/runbooks/vault-access.md). - Symptom 4: AWS Lightsail support event for
us-east-1aAZ affecting instance.
How to diagnose (in order)
- Check Lightsail instance state:
aws lightsail get-instance \
--instance-name raxx-vault \
--region us-east-1 \
--query 'instance.state'
Expected: {"code": 16, "name": "running"}.
- Attempt a direct health probe (if instance is running but vault is not serving):
# Requires CF Access service-token headers or operator browser session
# From an agent session:
curl -sf \
-H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
-H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
-H "User-Agent: raxx-sre/1.0" \
"https://vault.raxx.app/api/v1/health" \
| python3 -m json.tool
Expected: {"status": "ok"} or equivalent Infisical health response. A 502/504 with the instance running usually means the Infisical Docker container has crashed — SSH in and check.
- SSH into the instance to check container state:
aws lightsail download-default-key-pair --region us-east-1 \
--output text --query 'privateKeyBase64' \
| base64 --decode > /tmp/ls-key.pem
chmod 600 /tmp/ls-key.pem
# Get the public IP
aws lightsail get-instance \
--instance-name raxx-vault \
--region us-east-1 \
--query 'instance.publicIpAddress' \
--output text
ssh -i /tmp/ls-key.pem bitnami@<PUBLIC_IP> "sudo docker ps -a; sudo docker logs infisical --tail 50"
Restore procedure
Step 1 — List available snapshots
aws lightsail get-auto-snapshots \
--resource-name raxx-vault \
--region us-east-1
AutoSnapshot names follow the format raxx-vault-auto-YYYY-MM-DD. Pick the most recent healthy snapshot.
For any manually-created snapshots:
aws lightsail get-instance-snapshots \
--region us-east-1 \
--query 'instanceSnapshots[?fromInstanceName==`raxx-vault`].[name,state,createdAt]'
Step 2 — Create replacement instance from snapshot
aws lightsail create-instance-from-snapshot \
--instance-name raxx-vault-restore \
--instance-snapshot-name <snapshot-name> \
--availability-zone us-east-1a \
--bundle-id small_3_0 \
--region us-east-1
Wait for the instance to reach running state:
aws lightsail get-instance \
--instance-name raxx-vault-restore \
--region us-east-1 \
--query 'instance.state'
Typical boot time: 3–5 minutes.
Step 3 — Verify the restored instance is healthy
Get the new public IP:
aws lightsail get-instance \
--instance-name raxx-vault-restore \
--region us-east-1 \
--query 'instance.publicIpAddress' \
--output text
Probe directly (bypassing CF Access) to confirm Infisical is serving:
curl -sf "http://<NEW_IP>:80/api/v1/health"
If healthy, proceed to DNS swap. If unhealthy, SSH in and check container logs (Step 3 of diagnose section above).
Step 4 — DNS swap
The vault hostname (vault.raxx.app) is a Cloudflare-proxied DNS record pointing to the Lightsail static IP.
Assign a new static IP to the restored instance (recommended — avoids DHCP churn):
# Allocate a new static IP
aws lightsail allocate-static-ip \
--static-ip-name raxx-vault-restore-static \
--region us-east-1
# Attach it
aws lightsail attach-static-ip \
--static-ip-name raxx-vault-restore-static \
--instance-name raxx-vault-restore \
--region us-east-1
# Get the address
aws lightsail get-static-ip \
--static-ip-name raxx-vault-restore-static \
--region us-east-1 \
--query 'staticIp.ipAddress' \
--output text
Update the Cloudflare DNS A record for vault.raxx.app to point at the new static IP. Use the CLOUDFLARE_EDIT_DNS token (not CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN — wrong scope; see docs/ops/runbooks/cloudflare-tokens.md and memory reference_cloudflare_tokens.md):
# Zone ID for raxx.app
ZONE_ID=$(infisical secrets get CF_ZONE_ID_RAXX_APP \
--path /MooseQuest/cloudflare/ --plain)
CF_DNS_TOKEN=$(infisical secrets get CLOUDFLARE_EDIT_DNS \
--path /MooseQuest/cloudflare/ --plain)
# Get the record ID for vault.raxx.app
RECORD_ID=$(curl -sS \
-H "Authorization: Bearer $CF_DNS_TOKEN" \
"https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records?name=vault.raxx.app&type=A" \
| python3 -c "import sys,json; r=json.load(sys.stdin)['result']; print(r[0]['id'])")
# Update to new IP
NEW_IP=<new static IP from above>
curl -sS -X PUT \
-H "Authorization: Bearer $CF_DNS_TOKEN" \
-H "Content-Type: application/json" \
"https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
--data "{\"type\":\"A\",\"name\":\"vault.raxx.app\",\"content\":\"$NEW_IP\",\"ttl\":60,\"proxied\":true}"
CF propagation with proxy enabled: near-instant (CF caches the origin IP).
Step 5 — Verify end-to-end
curl -sf \
-H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
-H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
-H "User-Agent: raxx-sre/1.0" \
"https://vault.raxx.app/api/v1/health"
Then confirm a secret read works (agent session):
infisical secrets get SENTRY_INTERNAL_INTEGRATION_TOKEN \
--env=prod \
--path=/MooseQuest/sentry \
--plain | head -c 8
Expected: 8 non-empty characters.
Step 6 — Rename + clean up
Once raxx-vault-restore is confirmed healthy and DNS is pointing at it:
# There is no Lightsail rename-instance command — stop the old instance
# and detach its static IP before re-allocating if needed.
# Retain the old instance for 24h before deletion as a rollback option.
aws lightsail stop-instance \
--instance-name raxx-vault \
--region us-east-1
# After 24h clean-up (operator decision):
aws lightsail delete-instance \
--instance-name raxx-vault \
--region us-east-1
# Re-enable AutoSnapshot on the restored instance:
aws lightsail enable-add-on \
--resource-name raxx-vault-restore \
--add-on-request 'addOnType=AutoSnapshot,autoSnapshotAddOnRequest={snapshotTimeOfDay=05:00}' \
--region us-east-1
RTO target
1–2 hours from first alert to vault serving traffic, assuming:
- A healthy auto-snapshot is available (7-day window).
- The operator has AWS CLI access and CF DNS edit scope.
- No AZ-level AWS outage (if us-east-1a is affected, restore to us-east-1b by changing --availability-zone in Step 2).
Manual snapshot (on-demand, before risky ops)
Take a manual snapshot before any maintenance that modifies the vault host (OS updates, Docker upgrades, config changes):
aws lightsail create-instance-snapshot \
--instance-name raxx-vault \
--instance-snapshot-name "raxx-vault-pre-maintenance-$(date -u +%Y%m%d-%H%M)" \
--region us-east-1
Manual snapshots are NOT subject to the 7-day auto-rotation policy — they persist until explicitly deleted.
Verify auto-snapshot is still enabled
Include this in the weekly SRE sweep:
aws lightsail get-instance \
--instance-name raxx-vault \
--region us-east-1 \
--query 'instance.addOns'
Expected:
[
{
"name": "AutoSnapshot",
"status": "Enabled",
"snapshotTimeOfDay": "05:00"
}
]
If status is Disabled or the add-on is missing, re-enable using the command in §Snapshot schedule and retention above.
Emergency stop (clean shutdown)
To stop the vault instance without data loss (e.g., if the host is compromised and needs isolation):
aws lightsail stop-instance \
--instance-name raxx-vault \
--region us-east-1
Note: stopping the instance will break all agent sessions, CI pipelines, Velvet rotation, and any service that reads from Infisical. Coordinate with the operator before executing outside an active security incident.
Escalation
Wake the operator when:
- The restore procedure fails at any step and no healthy snapshot is available.
- The AZ us-east-1a is degraded per AWS status (cross-AZ restore changes the --availability-zone value but requires operator decision).
- The Infisical master encryption key or database is corrupt (snapshot may not help — escalate to Infisical support).
- Estimated recovery time exceeds the 2-hour RTO target.
AWS status page: https://health.aws.amazon.com/health/status
Lightsail service health: https://health.aws.amazon.com/health/status (filter: Lightsail, us-east-1)
Cross-references
- Vault access + WAF runbook:
docs/ops/runbooks/vault-access.md - Secrets source of truth: all vault credentials live in Infisical — never inline in repo files (memory:
feedback_no_inline_secrets_in_repo.md,feedback_secrets_in_vault_sop.md) - DNS token scope: memory
reference_cloudflare_tokens.md— useCLOUDFLARE_EDIT_DNS, not the automation token - Burr SSO gateway pattern (CF Access as OIDC provider): memory
project_burr_sso_gateway.md— vault is the first Burr consumer; CF Access policy for vault.raxx.app must be preserved during DNS swap - AWS workload secrets in SSM (not Infisical): memory
feedback_aws_workloads_use_ssm_not_vault.md - Issue #2652 (this work): Lightsail AutoSnapshot enabling + TF hardening backlog
- Issue #2650 (SEV-4 diagnostic): instance name + bundle confirmation