Business Continuity Plan — Raxx v1

Version: 1.0
Effective date: 2026-05-21 UTC
Author: raxx-software-architect
Owner: Operator (Kristerpher)
Next scheduled review: 2026-08-21 UTC (90 days post-launch)
Related ADR: docs/architecture/adr/0103-bcp-backup-posture.md
Implementation cards: See §4 for sub-card issue numbers.

Preamble — Invariants This Plan Must Not Violate

Every recovery procedure in this document is constrained by the following non-negotiable platform invariants. Any BCP action that would require violating them is a sign the procedure is wrong — not a reason to override the invariant.

No stored credentials. Recovery actions that involve credential rotation must never write credentials to files, shell history, terminal logs, or any surface that persists. Use >/dev/null 2>&1 on all heroku config:set calls (per feedback_heroku_config_set_echoes_secrets).
Passkeys / WebAuthn only. Recovery of customer authentication infrastructure must not fall back to passwords or SMS OTP.
GDPR by default. A breach event triggers the notification obligation; recovery timelines do not exempt this. If a restore involves customer PII, the DSR-erasure posture must be preserved in the restored state.
Audit trail for every state change. Recovery actions that touch money, permissions, or data access must be logged. Document every action taken during an incident in the FreeScout ticket opened at ops@raxx.app.
Paper-first gating. No recovery path re-enables live-trading code paths unless the paper-profitable gate (or explicit per-flow operator override) is satisfied. Restoring a prod environment does not automatically re-enable live trading.
Secrets into infra, not into code. All secret re-provisioning during recovery reads from vault or SSM. Nothing is committed to the repo.
Credentials into Infisical, not on disk. Operator break-glass credentials for AWS, Heroku, Cloudflare are stored in Google Drive (private) per project_aws_iam_state. They are not in the repo.

§1 Risk Inventory

Legend: RTO = Recovery Time Objective (max tolerable downtime), RPO = Recovery Point Objective (max tolerable data loss). Both are targets, not current capabilities. Current posture gaps are noted explicitly.

1.1 Heroku — single-region compute

Dimension	Detail
Components	`raxx-api-prod`, `raxx-api-staging`, `raxx-console-prod`, `raxx-console-staging`, `raxx-velvet-prod`
Impact	P0 — total platform outage for all Raxx customers
RTO target	15 min (dyno restart / rollback); 4 h (redeploy from GitHub)
RPO target	0 data loss — all persistent state is in Postgres, not dyno RAM
Current posture	Single Heroku region (us-east). No multi-region standby. GitHub Actions deploys to Heroku on `main` push (staging) or `workflow_dispatch` (prod) with human-approval gate (ADR-0020). Heroku maintains rolling release history; rollback is `heroku rollback -a <app>`.
Gap	No automated failover to a second region. Heroku us-east regional incident = full customer-visible outage. No SLA SLO is defined for v1 pre-team.

1.2 Heroku Postgres — Raptor + Velvet Standard-0

Dimension	Detail
Components	Raptor: `DATABASE_URL` (Standard-0 on `raxx-api-prod`), `RAPTOR_APP_DATABASE_URL` (restricted `raptor_app` role). Velvet: Velvet-internal Postgres.
Impact	P0 — trading, session, and billing functions fail without DB connectivity
RTO target	15 min (connection restart / credential refresh); 2 h (PITR restore)
RPO target	5 min (Heroku continuous WAL shipping)
Current posture	Heroku manages continuous WAL archiving and daily snapshots for Standard-0. Automated daily backup at 02:00 UTC. Retention: 7 days (Standard-0 default). PITR within retention window is available via `heroku pg:backups:restore`. Manual restore tested: NOT YET — no verified restore drill on record.
Gap	No verified restore drill. Velvet Postgres backup retention and PITR posture not confirmed as distinct from default. No cross-region Postgres replica.

1.3 Infisical vault — single Lightsail instance (SPOF — today's SEV-4)

Dimension	Detail
Component	`vault.raxx.app` — self-hosted Infisical CE on `aws_lightsail_instance.infisical` (us-east-1a)
Impact	P1 — agent sessions cannot read secrets; Velvet rotation fails; any Heroku config:set that sources from vault fails; Terraform applies that source from vault fail
RTO target	30 min (Lightsail instance restart); 2 h (rebuild from snapshot)
RPO target	24 h (daily snapshot); 0 data loss if instance is restored (not rebuilt)
Current posture	Single Lightsail micro instance in us-east-1a. No HA. No daily snapshot automation confirmed. Agent sessions cache vault reads in-process; a brief vault outage does not restart Heroku dynos mid-flight. CF Access gate protects the surface. Today's SEV-4 (2026-05-21) confirms this SPOF is real and active.
Gap	No automated daily Lightsail snapshot for vault instance. No Infisical admin export scheduled to S3. No secondary vault instance. Vault outage > dyno-restart cycle could leave services unable to re-read secrets on boot.

1.4 AWS Lightsail — FreeScout ticketing

Dimension	Detail
Component	`raxx-tickets` Lightsail instance (us-east-1a), FreeScout + MariaDB
Impact	P2 — operator loses incident ticketing; customer support queue goes dark; EyeTok cannot create tickets
RTO target	30 min (Tier 1 snapshot restore; static IP re-attach stays DNS-stable)
RPO target	24 h (daily backup at 06:00 UTC); 0 for in-flight conversations if snapshot was taken recently
Current posture	Two-tier backup: Tier 1 daily Lightsail snapshot (7 retained); Tier 2 daily mysqldump to `s3://raxx-support-attachments/db-backups/freescout/` (30-day retention). GH Actions workflow `.github/workflows/freescout-backup.yml` runs at 06:00 UTC. Post-modules-installed snapshot `raxx-tickets-modules-installed-20260503231628` is the canonical fast-restore target. Restore SOP at `docs/ops/runbooks/freescout-backup-restore.md`. Verified restore: PENDING — no row in the verified restore record table.
Gap	No verified restore drill performed. Paid-module license re-activation depends on the canonical post-modules snapshot being current.

1.5 GitHub — source of truth for all code + Terraform + runbooks

Dimension	Detail
Component	`raxx-app/TradeMasterAPI` repo, GitHub Actions CI
Impact	P1 — loss of code source prevents rebuilds; loss of CI prevents deploys
RTO target	GitHub SLA is 99.9% uptime. Local clones exist on operator laptops.
RPO target	0 — git is append-only; every commit is replicated to every clone
Current posture	GitHub provides geo-redundant repository storage. Multiple agent worktree clones exist locally. No explicit off-GitHub archive configured.
Gap	No automated off-GitHub mirror (e.g., daily push to a private S3 archive). Low-priority SPOF given GitHub's own redundancy, but worth noting.

1.6 Cloudflare — DNS + CDN + Pages + Access + Workers

Dimension	Detail
Components	DNS for `raxx.app` and `getraxx.com`; CF Pages (Antlers, getraxx, internal-docs); CF Access (vault, console, getraxx pre-launch); CF Workers (status page D1 backend)
Impact	P1 — CF outage takes down all frontend surfaces and DNS resolution for the platform
RTO target	CF SLA 99.99%+. Operator action limited to TTL tuning or CF API calls; no self-hosted fallback.
RPO target	CF Pages deploys are re-deployable from GitHub any time.
Current posture	CF Pages projects are rebuild-on-demand from GitHub. DNS records are IaC-managed (Terraform). CF Access policies are Terraform-managed. If Cloudflare goes down, Antlers and getraxx are unreachable, but Raptor on Heroku continues accepting direct-URL traffic.
Gap	No fallback DNS zone. Pages cannot be served elsewhere without a rebuild to an alternate host. Acceptable for v1.

1.7 Terraform state — remote S3 backend

Dimension	Detail
Component	`raxx-iac-state-prod` S3 bucket (us-east-1), `raxx-iac-state-locks` DynamoDB table
Impact	P1 — loss of Terraform state prevents IaC-managed infrastructure changes; manual Terraform requires import from scratch
RTO target	S3 99.999999999% durability. DynamoDB 99.99% availability.
RPO target	S3 replication gives near-zero RPO. S3 Versioning on the state bucket: status unknown — not confirmed.
Current posture	At least one Terraform root (`cf-pages-docs-customer`) has an S3 backend (`raxx-iac-state-prod`, us-east-1, SSE-S3 encrypt=true, DynamoDB locks). Other roots (freescout, waf, cf-access, etc.) use the same bucket pattern based on convention, but individual `backend.tf` confirmation for all roots was not fully verified.
Gap	S3 Versioning for the state bucket is not confirmed enabled. No S3 cross-region replication confirmed. Loss of state bucket means full `terraform import` from scratch for all managed resources.

1.8 AWS SSM Parameter Store

Dimension	Detail
Component	SSM parameters for workload secrets (`/raxx/freescout/*`, Velvet passwords, rotated credentials) per `feedback_aws_workloads_use_ssm_not_vault`
Impact	P1 during recovery — SSM reads are required by GH Actions backup workflow and by Velvet rotation scripts
RTO target	SSM 99.99% availability (AWS SLA). Regional.
RPO target	SSM is not a database; parameter values are durable. No data loss scenario absent deliberate deletion.
Current posture	AWS us-east-1 only. No multi-region replication of SSM parameters.
Gap	SSM parameter values are not independently backed up. If an SSM parameter is accidentally deleted, values must be recovered from Infisical vault (for vendor API tokens) or re-generated via rotation. Document SSM paths in the operator break-glass doc.

1.9 Operator (single human owner — pre-team)

Dimension	Detail
Impact	P0 over 48h — prod-approval gate (ADR-0020) requires human reviewer; no secondary reviewer exists pre-team
RTO target	N/A for planned absences; escalation path for emergencies: see §6
Current posture	Single operator. No documented deputy. Break-glass AWS root credentials in private Google Drive per `project_aws_iam_state`. No secondary GitHub Admin configured.
Gap	No named deputy. No secondary GitHub org admin. No Heroku team member with prod-deploy approval rights. This is the most significant non-technical SPOF and is currently accepted as a pre-team constraint.

1.10 External service SPOFs

Service	Impact	Current posture	Gap
Stripe	P1 — billing fails	Stripe's own redundancy. Test-mode pre-launch. No fallback.	Acceptable for v1; Stripe SLA is strong
Postmark	P2 — transactional email queues	Postmark queues hold for 48h on outages	No secondary sender configured
Alpaca (paper trading, pre-live)	P2 — paper trading unavailable	Flag-gated. Live falls back to paper flag.	No fallback broker pre-BYOB
Sentry	P2 — error capture dark	Sentry SDK buffers locally; events replayed on reconnect	Acceptable for v1
Oracle Dyn (moosequest.net)	P3 — home-network DNS	Operator personal dependency; Raxx platform not affected	Per `feedback_dyndns_stays` — do not migrate

§2 Backup Posture — Current State

2.1 Heroku Postgres (Raptor + Velvet)

What is backed up: Full Postgres database via continuous WAL archiving + daily snapshot.

Schedule and retention:

Tier	Method	Schedule	Retention	Storage
Continuous WAL	Heroku-managed streaming replication	Continuous	7 days (Standard-0)	Heroku-managed S3 (us-east-1)
Daily snapshot	Heroku `pg:backups` automated	Daily (Heroku-scheduled, typically ~02:00 UTC)	7 days (Standard-0)	Heroku-managed S3
Manual snapshot	`heroku pg:backups:capture`	On-demand	Must be downloaded to persist beyond Heroku retention

Command surface:

# List available backups
heroku pg:backups -a raxx-api-prod

# Capture a manual snapshot before a risky migration
heroku pg:backups:capture -a raxx-api-prod

# Download a backup locally
heroku pg:backups:download -a raxx-api-prod --output /tmp/raptor-prod-backup.dump

# PITR restore (rolls back to a specific timestamp within retention window)
heroku pg:backups:restore '<backup-id-or-timestamp>' DATABASE_URL -a raxx-api-prod

# Inspect PITR capabilities
heroku pg:info -a raxx-api-prod

Restore SOP: 1. Capture a manual backup of the current state first: heroku pg:backups:capture -a raxx-api-prod 2. Put the app in maintenance mode: heroku maintenance:on -a raxx-api-prod 3. Restore: heroku pg:backups:restore <target> DATABASE_URL -a raxx-api-prod 4. After restore, Alembic migrations may need to be re-applied if the target is before a schema change: heroku run "python -m alembic upgrade head" -a raxx-api-prod 5. Verify: heroku run "python -m alembic current" -a raxx-api-prod 6. Re-provision raptor_app role if role was lost in the restore: see docs/ops/runbooks/raptor-db-credentials.md §4 Scenario B 7. Restore maintenance mode off: heroku maintenance:off -a raxx-api-prod

Estimated restore time: 15–45 min depending on DB size.

Tested? No verified restore drill on record. Required before launch or first post-launch week.

2.2 Infisical Vault (vault.raxx.app)

What is backed up: All vendor API tokens, service credentials, Infisical machine identity secrets. This is the source of truth for secrets that must survive a full infrastructure rebuild.

Current posture: CRITICAL GAP. No automated backup is configured. The Lightsail instance disk holds the Postgres database backing Infisical CE. Lightsail automatic snapshots are NOT confirmed enabled for the vault instance. A vault instance loss (hardware failure, accidental termination) would require manual re-entry of all secrets.

Infisical admin export (manual, one-time procedure):

# Log into Infisical UI at vault.raxx.app
# Navigate to: Project Settings → Export → Download JSON
# Store export at: <Google Drive>/Raxx/Break-glass/infisical-export-YYYY-MM-DD.json
# Encrypt before storing: GPG encrypt to operator's key
gpg --recipient kris@moosequest.net --encrypt infisical-export.json

Required action (§4 v1 win): Schedule a daily Lightsail snapshot for the vault instance and a scheduled Infisical export to S3.

Restore SOP (if vault instance is lost): 1. Provision new Lightsail instance from most recent snapshot (per FreeScout rebuild pattern) 2. Re-attach static IP for vault.raxx.app 3. Verify Infisical CE is running: https://vault.raxx.app 4. If snapshot unavailable: provision fresh Infisical CE, restore from admin export JSON via Infisical CLI 5. Rotate all machine identity tokens and re-seed Heroku config vars

2.3 FreeScout (AWS Lightsail)

Status: BACKED UP. See docs/ops/runbooks/freescout-backup-restore.md for full SOP.

Tier	Type	Schedule	Retention	Location
Tier 1	Full Lightsail instance snapshot	Daily 06:00 UTC (GH Actions)	7 snapshots	Lightsail
Tier 2	MySQL logical dump (gzip)	Daily 06:00 UTC (GH Actions)	30 days	`s3://raxx-support-attachments/db-backups/freescout/`

Estimated restore time: Tier 1 (full instance): ~30 min (snapshot restore + static IP re-attach). Tier 2 (data-only): ~15 min.

Tested? Verified restore record in freescout-backup-restore.md shows PENDING. A verified restore drill is required.

Verify latest backup ran:

TODAY=$(date -u '+%Y-%m-%d')
aws s3api head-object \
  --bucket raxx-support-attachments \
  --key "db-backups/freescout/${TODAY}.sql.gz" \
  --region us-east-1

2.4 GitHub Repository

Status: Redundant. GitHub provides geo-redundant storage. Git's distributed model means every git clone is a full backup. Multiple clones exist on the operator's workstation and in CI runner ephemeral environments.

Additional exposure: GitHub Actions workflow artifacts (build artifacts, coverage reports) are ephemeral and not backed up — not a recovery concern since they are regenerated from source.

Off-GitHub archive: Not configured. Low priority given GitHub's reliability, but worth automating post-launch.

2.5 Terraform State

Location: S3 bucket raxx-iac-state-prod (us-east-1), DynamoDB table raxx-iac-state-locks.

S3 durability: 99.999999999% (eleven nines). S3 Versioning status: NOT CONFIRMED. Without versioning, an accidental terraform destroy of the state file is unrecoverable.

Current backup posture: Relies entirely on S3 durability. No additional snapshot or off-bucket copy.

Required action (§4 v1 win): Enable S3 Versioning on raxx-iac-state-prod + enable MFA Delete to prevent accidental state deletion.

# Enable versioning (one-time)
aws s3api put-bucket-versioning \
  --bucket raxx-iac-state-prod \
  --versioning-configuration Status=Enabled \
  --region us-east-1

# Verify
aws s3api get-bucket-versioning --bucket raxx-iac-state-prod --region us-east-1

Restore SOP (if state is corrupted or deleted): 1. Retrieve prior version from S3 (if versioning enabled): aws s3api list-object-versions --bucket raxx-iac-state-prod 2. If state is unrecoverable: run terraform import for each resource using the docs/ops/runbooks/terraform-cf-access-state-imports.md pattern 3. Timeline for full state reconstruction from import: estimated 4–8 h for all roots

2.6 AWS SSM Parameter Store

Current posture: No backup. SSM parameter values are managed via Terraform (IaC) for infrastructure secrets, and via Velvet/manual rotation for workload secrets. Loss of an SSM parameter requires re-retrieval from the corresponding Infisical vault path or from the Velvet rotation log.

SSM paths to document in break-glass: /raxx/freescout/*, /raxx/cf-access/*, and any Velvet-managed rotation paths.

§3 Golden Image — Full-Rebuild Recovery Path

This section answers: "If the Heroku account is compromised, vault.raxx.app goes down, and the operator's laptop is lost — how does Raxx get back up?"

3.1 Prerequisites (must exist before you need this)

Asset	Where it lives	Who controls it
GitHub repo (`raxx-app/TradeMasterAPI`)	GitHub — survives Heroku + Lightsail loss	GitHub account (linked to `kris@moosequest.net` Google account)
AWS root account credentials	Private Google Drive (`/Raxx/Break-glass/aws-root-credentials.md`) per `project_aws_iam_state`	Google account with hardware MFA
Heroku account credentials	Passkey (WebAuthn) — no password	`kris@moosequest.net` email for recovery
Cloudflare account credentials	Passkey / 2FA	`kris@moosequest.net` email for recovery
Infisical admin export (last snapshot)	Private Google Drive (`/Raxx/Break-glass/infisical-export-YYYY-MM-DD.json.gpg`)	Google account
Stripe account credentials	Passkey / email MFA	`kris@moosequest.net`
GPG private key (for encrypted Drive exports)	Operator's hardware key or backup passphrase in Drive	Operator only

The single non-automatable dependency: the operator's Google account (hardware MFA protected) is the root of all credential recovery paths. If that account is compromised or inaccessible, no programmatic recovery path exists. This is the operator-only break-glass scenario (§6).

3.2 Rebuild SOP (green-field recovery sequence)

Estimated total RTO: 6–12 hours for full service restoration. Data RPO depends on most recent backup (up to 24 h data loss at current posture; up to 5 min with Heroku PITR within retention window).

Phase 1 — Reestablish AWS access (0–30 min)

Log into AWS console using root credentials from Google Drive break-glass doc.
Verify claude-infisical-bootstrap IAM user exists. If lost: create new IAM user with Lightsail + SSM + S3 permissions per the bootstrap policy.
Retrieve or rotate access keys.

Phase 2 — Restore vault.raxx.app (30 min–2 h)

Check if Lightsail instance exists: aws lightsail get-instance --instance-name <vault-instance-name> --region us-east-1
If instance exists but unresponsive: restart. If missing: restore from most recent Lightsail snapshot.
If no snapshot: provision fresh Lightsail instance, install Infisical CE from the Terraform root.
If fresh install: restore secrets from the Infisical admin export JSON on Google Drive.
Re-attach static IP. Verify https://vault.raxx.app resolves.
Re-provision machine identity tokens for CI + Velvet + agents.

Phase 3 — Restore SSM parameters (2–3 h)

Re-seed SSM /raxx/freescout/* and other workload secrets from Infisical vault (now restored).

Phase 4 — Restore Heroku applications (3–5 h)

If Heroku account is compromised and requires a full new account:
- Create new Heroku account.
- Create apps: raxx-api-prod, raxx-console-prod, raxx-velvet-prod (per project_heroku_app_names).
- Provision Standard-0 Postgres add-on for Raptor and Velvet.
- Restore Postgres from most recent Heroku backup (download .dump, restore to new DB): heroku pg:backups:restore
- Re-provision raptor_app role (per docs/ops/runbooks/raptor-postgres-roles.md).
Seed all Heroku config vars from Infisical vault (restored in Phase 2). Use heroku config:set ... >/dev/null 2>&1 for all secrets.
Re-deploy code: trigger GitHub Actions deploy-heroku.yml workflow_dispatch targeting main for each app.
Restore FLAG_RAPTOR_APP_ROLE_SEPARATION=1 and other production flags.
Do not re-enable live trading flags until paper gate has been re-satisfied.

Phase 5 — Restore Cloudflare configuration (4–5 h)

If CF account is recoverable: Terraform state (from S3) contains all CF resources. terraform apply from each root restores DNS, Pages, Access, and WAF config.
If CF account is lost: re-register new account; rebuild DNS manually from known record inventory; re-deploy Pages from GitHub; re-establish CF Access.
Antlers / getraxx re-deploy from GitHub via deploy-antlers.yml workflow dispatch.

Phase 6 — Restore FreeScout ticketing (5–6 h)

Restore from Tier 1 Lightsail snapshot (see docs/ops/runbooks/freescout-backup-restore.md §Tier 1 restore).
Re-attach static IP. Verify https://tickets.raxx.app resolves.
Update SSM /raxx/freescout/ssh_key if new key pair is generated.

Phase 7 — Verify and soft-open (6–12 h)

Run smoke suite: scripts/ci/run_smoke.sh --env=prod
Verify CF Access gates are up for pre-launch surfaces.
Verify Sentry is receiving errors.
Verify Postmark can send test email.
Open EyeTok on-call agent if deployed.
Notify customers via status.raxx.app if there was any customer-visible downtime.

3.3 Estimated RTO/RPO by current posture (honest assessment)

Recovery scenario	Current RTO	Current RPO	Notes
Heroku dyno crash (single app)	5–10 min	0	Heroku auto-restarts; Postgres is unaffected
Heroku dyno rollback (bad deploy)	10–15 min	0	`heroku rollback`
Heroku Postgres PITR restore	1–3 h	5 min (within 7-day window)	Requires maintenance window
Vault Lightsail instance loss	1–4 h	24 h (unconfirmed — no snapshot automation)	Today's SEV-4 root cause
FreeScout Lightsail loss	30–60 min	24 h (daily backup)	Tier 1 restore procedure exists and is documented
Full Heroku account compromise	6–12 h	Up to 24 h (Postgres backup age)	Requires DB download + restore to new account
Terraform state loss (no S3 versioning)	4–8 h	N/A — IaC must be reconstructed via import	Current gap: S3 versioning unconfirmed
Operator unreachable for 48 h	Blocking for prod deploys	N/A	ADR-0020 prod-approval gate has no deputy

§4 Low-Effort v1 Wins (ship pre-launch or first post-launch week)

Pre-launch = T-0 to T+7 days (target by 2026-05-30 UTC)
Post-launch week = T+7 to T+30 days
Each item is scoped to operator or agent execution; no new feature-dev needed unless noted.

Win 1 — Enable daily Lightsail snapshots for vault instance (XS)

Size: XS — one AWS CLI command or one Terraform addition.
Why: Today's SEV-4 (vault Lightsail timeouts) confirmed this SPOF is active. Without a snapshot, vault instance loss = manual re-entry of all secrets.
Action:

# Enable Lightsail auto-snapshot at 06:30 UTC (after FreeScout backup completes)
aws lightsail enable-add-on \
  --resource-name <vault-instance-name> \
  --region us-east-1 \
  --add-on-request 'addOnType=AutoSnapshot,autoSnapshotAddOnRequest={snapshotTimeOfDay=06:30}'

Acceptance criteria: aws lightsail get-auto-snapshots --resource-name <vault-instance-name> --region us-east-1 shows at least one snapshot dated within the last 25 hours within 24 h of enabling.
Ship window: Pre-launch (this week).
Issue: Filed as bcp-win-1 — see GH issue created from this card.

Win 2 — Enable S3 Versioning on Terraform state bucket (XS)

Size: XS — two AWS CLI commands.
Why: Accidental terraform state rm or terraform destroy of the state file without versioning is unrecoverable. Versioning adds a safety net at no meaningful cost.
Action:

aws s3api put-bucket-versioning \
  --bucket raxx-iac-state-prod \
  --versioning-configuration Status=Enabled \
  --region us-east-1

# Optional but recommended: MFA delete to prevent accidental state deletion
# (requires root credentials — do separately)

Acceptance criteria: aws s3api get-bucket-versioning --bucket raxx-iac-state-prod --region us-east-1 returns {"Status": "Enabled"}.
Ship window: Pre-launch.
Issue: Filed as bcp-win-2.

Win 3 — Schedule daily Infisical admin export to S3 (S)

Size: S — a GH Actions workflow (similar pattern to freescout-backup.yml).
Why: Vault is the crown jewel of the secret posture. No automated export means a vault rebuild requires full manual re-entry. Today's SEV-4 makes this urgent.
Design: GH Actions cron at 07:00 UTC. Uses Infisical export API, GPG-encrypts with operator public key, uploads to s3://raxx-iac-state-prod/vault-exports/YYYY-MM-DD.json.gpg. Failure alerts via Slack DM D0AJ7K184TV.
Acceptance criteria: S3 object exists at vault-exports/YYYY-MM-DD.json.gpg with size > 1 KB within 25 h of first workflow run; failed run posts Slack DM alert.
Ship window: Pre-launch or first post-launch week.
Issue: Filed as bcp-win-3.

Win 4 — Document and verify a Heroku Postgres restore drill (XS)

Size: XS (operator-executed, no code) — run heroku pg:backups:restore against a scratch Heroku app (not prod).
Why: No restore drill is on record for Raptor Postgres. "Backup exists" and "backup restores" are different claims. The FreeScout restore record also shows PENDING — complete both in the same session.
Action: Operator creates a throwaway Heroku app, provisions a Postgres add-on, downloads the most recent prod backup, restores it, verifies Alembic current revision matches. Documents the result in docs/ops/runbooks/raptor-db-credentials.md §2 restore record.
Acceptance criteria: A documented restore result row (date, backup used, duration, outcome) exists in the runbook. Verified restore for FreeScout backup is also completed in freescout-backup-restore.md.
Ship window: First post-launch week.
Issue: Filed as bcp-win-4.

Win 5 — Add `bcp-smoke` GH Actions workflow (monthly verify) (S)

Size: S — one new GH Actions workflow file.
Why: The BCP's usefulness decays if the restore procedures are never exercised. A monthly automated smoke confirms: FreeScout S3 backup exists, vault snapshot exists, TF state bucket versioning is enabled. Does not run a full restore — just validates the existence and integrity of backup artifacts.
Schedule: 0 8 1 * * (08:00 UTC on the 1st of each month).
Checks: - FreeScout: aws s3api head-object for yesterday's dump - Vault: aws lightsail get-auto-snapshots — most recent < 25 h - TF state: aws s3api get-bucket-versioning returns Enabled - Heroku: heroku pg:backups -a raxx-api-prod lists a backup < 48 h - Failure: post alert to ops@raxx.app + Slack DM D0AJ7K184TV

Acceptance criteria: Workflow exists and passes on first manual trigger; no failures on workflow_dispatch. Monthly cron fires on 2026-06-01.
Ship window: First post-launch week.
Issue: Filed as bcp-win-5.

§5 Comprehensive v2 Future Investments (NOT IMPLEMENTED — roadmap)

The following are documented for future investment. None are in scope for v1 launch.

5.1 Multi-region Heroku (or migrate to Fly.io / Railway with multi-region support)

Heroku does not natively support multi-region failover for a single app. Options: (a) Heroku Private Spaces with geo-routing, (b) migrate Raptor to a provider with native multi-region (Fly.io), (c) add a secondary Heroku app in eu-west with read-replica Postgres and CF Load Balancer routing.
Trigger: First customer in a region where us-east latency is unacceptable, or first Heroku us-east regional incident causing > 30 min customer outage.

5.2 Vault HA — multi-AZ Infisical

Replace the single Lightsail micro instance with an HA Infisical CE deployment: two instances across us-east-1a and us-east-1b behind an AWS ALB, Postgres on RDS Multi-AZ.
Trigger: Post-launch if vault-related SEV-3 or higher incidents recur; or when team size > 1 operator.

5.3 AWS Backup consolidation

Enable AWS Backup across all AWS resources (Lightsail instances, SSM parameters, S3 buckets) with a unified backup vault, cross-region replication to us-west-2, and 30-day retention. Replace the ad-hoc GH Actions snapshot scripts.
Trigger: SOC2 Type II preparation or first enterprise customer.

5.4 RDS read replica for Raptor Postgres

Heroku Standard-0 does not expose RDS read-replica controls. Migrating to Heroku Premium-0 or to a self-managed RDS instance would enable a read replica in a second AZ, reducing Postgres RPO to seconds.
Trigger: Customer-facing analytics queries cause replication lag, or revenue per day > cost of RDS Premium tier.

5.5 Off-GitHub repo mirror

Daily git push to a private S3 archive or a GitLab CE instance for cold-storage backup of all commits and LFS objects.
Trigger: GitHub has a sustained multi-day incident (extremely low probability) or a security concern about repo access.

5.6 PagerDuty integration for EyeTok

As documented in docs/architecture/eyetok-oncall-agent-2026-05-15.md §6.4, PagerDuty Solo is deferred to v1.1. Add after a real after-hours SEV1 is missed via Slack push.

Currently the GDPR breach 72-hour notification is a manual operator action. A v2 enhancement would automate the initial DPA notification workflow from an EyeTok-triggered breach classification, with templated notification email to relevant DPAs prepared within 24 h of breach detection.

5.8 Automated retention enforcement job (#1631)

Per docs/ops/policies/data-retention.md §3, the automated retention job is deferred to 2026-Q3. This is acceptable given first data hits the 7-year mark in 2033. No BCP action needed before 2031-03-01.

§6 Operator-Only Break-Glass Paths

This section addresses the scenario: Kristerpher is unreachable for 48 hours or longer.

6.1 Current state (honest assessment)

There is no designated deputy. As a solo pre-team operator, all privileged access is gated on a single person. This is the accepted pre-team posture. The following documents what breaks and what doesn't:

Function	Breaks without operator?	Notes
Prod deploys (new code)	Yes — ADR-0020 requires human reviewer approval	No code ships to prod without operator
Staging deploys (CI)	No — auto-deploys on `main` push without approval	Staging continues working
Heroku config:set (secrets rotation)	Yes — requires Heroku account access	Velvet rotation stalls if operator unreachable
FreeScout ticket queue	No — EyeTok creates tickets; queue is readable	Support backlog accumulates; no human responses
Customer-facing platform	No — existing prod deployment keeps running	No new features; existing features work
Paper trading	No — paper gate is code-enforced	Paper strategies continue running
Live trading	No — live trading flags off pre-launch	Not a concern for v1
AWS infra changes	Yes — requires IAM credentials	No new infrastructure
GDPR DSR responses	Yes — manual process	30-day clock keeps ticking

6.2 Break-glass credential inventory

Credential	Location	Access method
AWS root credentials	Private Google Drive `/Raxx/Break-glass/aws-root-credentials.md`	Google account (hardware MFA on `kris@moosequest.net`)
Heroku account	Passkey-bound to `kris@moosequest.net` email	Email account recovery
Cloudflare account	Passkey / backup codes	Email account recovery
GitHub account (`kris@moosequest.net`)	Passkey	Email account recovery
Infisical admin export	Private Google Drive `/Raxx/Break-glass/infisical-export-YYYY-MM-DD.json.gpg`	GPG decrypt with operator's key
Stripe account	Passkey	Email account recovery
Google Workspace account (`kris@moosequest.net`)	Hardware MFA (YubiKey)	Recovery code (stored with break-glass)

Critical chain: All recovery paths ultimately flow through kris@moosequest.net Google Workspace account. If that account is inaccessible, break-glass recovery requires Google account recovery codes, which must be stored in a physical location (fireproof safe or safety deposit box) off-device.

6.3 Pre-team deputy recommendation (deferred)

Not implemented — documented for future action:

When first team member is added, the following should be provisioned within their first week: 1. Add as GitHub Organization Admin on raxx-app org 2. Add as reviewer on the production GitHub Environment 3. Add to Heroku team with member role on prod apps 4. Provision Cloudflare sub-account with DNS read + emergency write 5. Document in a docs/ops/emergency-contacts.md file with contact details

6.4 Extended operator absence SOP (current posture)

If Kristerpher will be unreachable for more than 24 hours: 1. Verify no pre-launch-blocker issues are open requiring operator action 2. Verify Sentry and EyeTok (when live) are alerting to Slack 3. Set Slack #raxx-ops-alert-sev1 mobile push to maximum priority 4. Verify ops@raxx.app is routing incident notifications 5. Platform continues to operate autonomously for existing customers; no new prod deployments until return

§7 Communication Plan During an Incident

7.1 Internal routing (operator + EyeTok agent)

Severity	Channel	Who posts	Response SLA
SEV1 (critical)	`#raxx-ops-alert-sev1` (C0B423M38H4)	EyeTok, then operator	Immediate — Slack mobile push
SEV2 (high, after-hours)	`#raxx-ops-alert-sev2` (C0B445RU95Y)	EyeTok	Mobile push + call-out
SEV2.5 (high, business hours 12:00–22:30 UTC weekdays)	`#raxx-ops-alert-sev2-5` (C0B4611UC2V)	EyeTok	Normal notification
SEV3 (medium/low)	`#raxx-ops-alert-sev3` (C0B4615LQ49)	EyeTok autonomous	Agent audit log
Ops alerts (non-EyeTok)	`ops@raxx.app` / FreeScout	Automated inbound	Operator reviews at next session

For the EyeTok on-call agent design, see docs/architecture/eyetok-oncall-agent-2026-05-15.md.

7.2 Customer-facing communication

Channel	Current state	Notes
`status.raxx.app`	Cloudflare Worker + D1 (per ADR-0028, ADR-0030)	Public status page; FreeScout webhook receiver drives state changes
`support@raxx.app`	Google MX routing → FreeScout via Postmark bridge	Customer inbound support queue
Transactional email (Postmark)	Active and approved out of sandbox (2026-05-09)	Used for account notifications, not incident alerts

Incident communication SOP: 1. EyeTok opens a FreeScout ticket at ops@raxx.app 2. If customer-visible impact: operator posts a public incident note to status.raxx.app via the Console status-management UI 3. Public note must follow the invariant: retrospective only, no forward-looking estimates unless a firm maintenance window end time is known 4. Customer inbound tickets at support@raxx.app receive a canned initial response: "We're aware of an issue affecting [service]. We'll provide an update at [time] UTC or when resolved." 5. On resolution: update status.raxx.app with incident resolved timestamp; post resolution note to customer ticket

If an incident involves suspected unauthorized access to customer PII: 1. EyeTok must classify the incident as a potential data breach at triage 2. Operator must make a breach determination within 4 hours of detection 3. If a breach is confirmed: DPA notification must be filed within 72 hours of discovery (GDPR Art. 33) 4. Affected customers must be notified without undue delay if high risk to rights/freedoms (GDPR Art. 34) 5. All breach notification actions are logged in the FreeScout ticket with the gdpr-breach tag 6. Current status: no automated breach notification pipeline exists (deferred to §5.7)

Security Considerations

Question	Answer
What PII does this BCP collect?	None. This document describes procedures; it contains no PII.
What is the retention period for incident tickets?	Per `docs/ops/policies/data-retention.md`: DSR request records retained 7 years; general ops tickets 90 days
Does any part of this store a credential in a form that can be replayed?	No. Command snippets in this doc use environment variable placeholders. Break-glass credentials are referenced by location (Google Drive), not value.
What is logged for audit?	Every recovery action that touches money, permissions, or data access must be logged in the FreeScout incident ticket with UTC timestamps.
Where are secrets?	In Infisical vault (vendor API tokens) and AWS SSM (workload secrets). Break-glass copies in Google Drive (operator-only, hardware MFA protected).
Are secrets rotatable without redeploy?	Yes. All Heroku config vars are sourced from vault; rotation is vault-update → `heroku config:set >/dev/null 2>&1` → dyno restart. No code redeploy required.
Is there a kill-switch for live execution paths?	Yes. `FLAG_RAPTOR_APP_ROLE_SEPARATION`, `FLAG_WEBAUTHN_REGISTRATION`, and the paper-first gating flag are all disableable via a single `heroku config:unset` without code deploy (per `feedback_heroku_config_set_echoes_secrets` pattern).
What happens on breach?	See §7.3. 72-hour DPA notification obligation. Affected customers notified per Art. 34. No automated pipeline for v1 — manual SOP applies.

Open Questions (requiring operator decision before sub-cards can be claimed)

Vault instance name. The vault Lightsail instance name is referenced in Win 1 as <vault-instance-name>. Confirm the Lightsail instance name before Win 1 can be executed by an agent. (Likely raxx-vault or similar — check terraform/ infisical root or aws lightsail get-instances.)
TF state bucket scope. Confirmed that cf-pages-docs-customer uses raxx-iac-state-prod. Other roots (freescout, waf, cf-access) need a one-time verification that they also use this bucket before Win 2 (S3 versioning) fully protects all state. Unblocking action: grep -r "raxx-iac-state-prod" terraform/ across all roots.
Infisical export API availability. Win 3 requires the Infisical CE admin export endpoint. Confirm the export endpoint URL and auth method for Infisical CE (self-hosted) vs Infisical Cloud SaaS. The endpoint may differ from the Cloud SaaS admin export path.
Verified restore drill timing. Win 4 requires operator-executed restore against a throwaway Heroku app. Is this feasible before the 2026-05-23 launch date, or should it be scheduled for the first post-launch week (2026-05-23 to 2026-05-30 UTC)?