Business Continuity Plan — Raxx v1
Version: 1.0
Effective date: 2026-05-21 UTC
Author: raxx-software-architect
Owner: Operator (Kristerpher)
Next scheduled review: 2026-08-21 UTC (90 days post-launch)
Related ADR: docs/architecture/adr/0103-bcp-backup-posture.md
Implementation cards: See §4 for sub-card issue numbers.
Preamble — Invariants This Plan Must Not Violate
Every recovery procedure in this document is constrained by the following non-negotiable platform invariants. Any BCP action that would require violating them is a sign the procedure is wrong — not a reason to override the invariant.
- No stored credentials. Recovery actions that involve credential rotation must never write credentials to files, shell history, terminal logs, or any surface that persists. Use
>/dev/null 2>&1on allheroku config:setcalls (perfeedback_heroku_config_set_echoes_secrets). - Passkeys / WebAuthn only. Recovery of customer authentication infrastructure must not fall back to passwords or SMS OTP.
- GDPR by default. A breach event triggers the notification obligation; recovery timelines do not exempt this. If a restore involves customer PII, the DSR-erasure posture must be preserved in the restored state.
- Audit trail for every state change. Recovery actions that touch money, permissions, or data access must be logged. Document every action taken during an incident in the FreeScout ticket opened at
ops@raxx.app. - Paper-first gating. No recovery path re-enables live-trading code paths unless the paper-profitable gate (or explicit per-flow operator override) is satisfied. Restoring a prod environment does not automatically re-enable live trading.
- Secrets into infra, not into code. All secret re-provisioning during recovery reads from vault or SSM. Nothing is committed to the repo.
- Credentials into Infisical, not on disk. Operator break-glass credentials for AWS, Heroku, Cloudflare are stored in Google Drive (private) per
project_aws_iam_state. They are not in the repo.
§1 Risk Inventory
Legend: RTO = Recovery Time Objective (max tolerable downtime), RPO = Recovery Point Objective (max tolerable data loss). Both are targets, not current capabilities. Current posture gaps are noted explicitly.
1.1 Heroku — single-region compute
| Dimension | Detail |
|---|---|
| Components | raxx-api-prod, raxx-api-staging, raxx-console-prod, raxx-console-staging, raxx-velvet-prod |
| Impact | P0 — total platform outage for all Raxx customers |
| RTO target | 15 min (dyno restart / rollback); 4 h (redeploy from GitHub) |
| RPO target | 0 data loss — all persistent state is in Postgres, not dyno RAM |
| Current posture | Single Heroku region (us-east). No multi-region standby. GitHub Actions deploys to Heroku on main push (staging) or workflow_dispatch (prod) with human-approval gate (ADR-0020). Heroku maintains rolling release history; rollback is heroku rollback -a <app>. |
| Gap | No automated failover to a second region. Heroku us-east regional incident = full customer-visible outage. No SLA SLO is defined for v1 pre-team. |
1.2 Heroku Postgres — Raptor + Velvet Standard-0
| Dimension | Detail |
|---|---|
| Components | Raptor: DATABASE_URL (Standard-0 on raxx-api-prod), RAPTOR_APP_DATABASE_URL (restricted raptor_app role). Velvet: Velvet-internal Postgres. |
| Impact | P0 — trading, session, and billing functions fail without DB connectivity |
| RTO target | 15 min (connection restart / credential refresh); 2 h (PITR restore) |
| RPO target | 5 min (Heroku continuous WAL shipping) |
| Current posture | Heroku manages continuous WAL archiving and daily snapshots for Standard-0. Automated daily backup at 02:00 UTC. Retention: 7 days (Standard-0 default). PITR within retention window is available via heroku pg:backups:restore. Manual restore tested: NOT YET — no verified restore drill on record. |
| Gap | No verified restore drill. Velvet Postgres backup retention and PITR posture not confirmed as distinct from default. No cross-region Postgres replica. |
1.3 Infisical vault — single Lightsail instance (SPOF — today's SEV-4)
| Dimension | Detail |
|---|---|
| Component | vault.raxx.app — self-hosted Infisical CE on aws_lightsail_instance.infisical (us-east-1a) |
| Impact | P1 — agent sessions cannot read secrets; Velvet rotation fails; any Heroku config:set that sources from vault fails; Terraform applies that source from vault fail |
| RTO target | 30 min (Lightsail instance restart); 2 h (rebuild from snapshot) |
| RPO target | 24 h (daily snapshot); 0 data loss if instance is restored (not rebuilt) |
| Current posture | Single Lightsail micro instance in us-east-1a. No HA. No daily snapshot automation confirmed. Agent sessions cache vault reads in-process; a brief vault outage does not restart Heroku dynos mid-flight. CF Access gate protects the surface. Today's SEV-4 (2026-05-21) confirms this SPOF is real and active. |
| Gap | No automated daily Lightsail snapshot for vault instance. No Infisical admin export scheduled to S3. No secondary vault instance. Vault outage > dyno-restart cycle could leave services unable to re-read secrets on boot. |
1.4 AWS Lightsail — FreeScout ticketing
| Dimension | Detail |
|---|---|
| Component | raxx-tickets Lightsail instance (us-east-1a), FreeScout + MariaDB |
| Impact | P2 — operator loses incident ticketing; customer support queue goes dark; EyeTok cannot create tickets |
| RTO target | 30 min (Tier 1 snapshot restore; static IP re-attach stays DNS-stable) |
| RPO target | 24 h (daily backup at 06:00 UTC); 0 for in-flight conversations if snapshot was taken recently |
| Current posture | Two-tier backup: Tier 1 daily Lightsail snapshot (7 retained); Tier 2 daily mysqldump to s3://raxx-support-attachments/db-backups/freescout/ (30-day retention). GH Actions workflow .github/workflows/freescout-backup.yml runs at 06:00 UTC. Post-modules-installed snapshot raxx-tickets-modules-installed-20260503231628 is the canonical fast-restore target. Restore SOP at docs/ops/runbooks/freescout-backup-restore.md. Verified restore: PENDING — no row in the verified restore record table. |
| Gap | No verified restore drill performed. Paid-module license re-activation depends on the canonical post-modules snapshot being current. |
1.5 GitHub — source of truth for all code + Terraform + runbooks
| Dimension | Detail |
|---|---|
| Component | raxx-app/TradeMasterAPI repo, GitHub Actions CI |
| Impact | P1 — loss of code source prevents rebuilds; loss of CI prevents deploys |
| RTO target | GitHub SLA is 99.9% uptime. Local clones exist on operator laptops. |
| RPO target | 0 — git is append-only; every commit is replicated to every clone |
| Current posture | GitHub provides geo-redundant repository storage. Multiple agent worktree clones exist locally. No explicit off-GitHub archive configured. |
| Gap | No automated off-GitHub mirror (e.g., daily push to a private S3 archive). Low-priority SPOF given GitHub's own redundancy, but worth noting. |
1.6 Cloudflare — DNS + CDN + Pages + Access + Workers
| Dimension | Detail |
|---|---|
| Components | DNS for raxx.app and getraxx.com; CF Pages (Antlers, getraxx, internal-docs); CF Access (vault, console, getraxx pre-launch); CF Workers (status page D1 backend) |
| Impact | P1 — CF outage takes down all frontend surfaces and DNS resolution for the platform |
| RTO target | CF SLA 99.99%+. Operator action limited to TTL tuning or CF API calls; no self-hosted fallback. |
| RPO target | CF Pages deploys are re-deployable from GitHub any time. |
| Current posture | CF Pages projects are rebuild-on-demand from GitHub. DNS records are IaC-managed (Terraform). CF Access policies are Terraform-managed. If Cloudflare goes down, Antlers and getraxx are unreachable, but Raptor on Heroku continues accepting direct-URL traffic. |
| Gap | No fallback DNS zone. Pages cannot be served elsewhere without a rebuild to an alternate host. Acceptable for v1. |
1.7 Terraform state — remote S3 backend
| Dimension | Detail |
|---|---|
| Component | raxx-iac-state-prod S3 bucket (us-east-1), raxx-iac-state-locks DynamoDB table |
| Impact | P1 — loss of Terraform state prevents IaC-managed infrastructure changes; manual Terraform requires import from scratch |
| RTO target | S3 99.999999999% durability. DynamoDB 99.99% availability. |
| RPO target | S3 replication gives near-zero RPO. S3 Versioning on the state bucket: status unknown — not confirmed. |
| Current posture | At least one Terraform root (cf-pages-docs-customer) has an S3 backend (raxx-iac-state-prod, us-east-1, SSE-S3 encrypt=true, DynamoDB locks). Other roots (freescout, waf, cf-access, etc.) use the same bucket pattern based on convention, but individual backend.tf confirmation for all roots was not fully verified. |
| Gap | S3 Versioning for the state bucket is not confirmed enabled. No S3 cross-region replication confirmed. Loss of state bucket means full terraform import from scratch for all managed resources. |
1.8 AWS SSM Parameter Store
| Dimension | Detail |
|---|---|
| Component | SSM parameters for workload secrets (/raxx/freescout/*, Velvet passwords, rotated credentials) per feedback_aws_workloads_use_ssm_not_vault |
| Impact | P1 during recovery — SSM reads are required by GH Actions backup workflow and by Velvet rotation scripts |
| RTO target | SSM 99.99% availability (AWS SLA). Regional. |
| RPO target | SSM is not a database; parameter values are durable. No data loss scenario absent deliberate deletion. |
| Current posture | AWS us-east-1 only. No multi-region replication of SSM parameters. |
| Gap | SSM parameter values are not independently backed up. If an SSM parameter is accidentally deleted, values must be recovered from Infisical vault (for vendor API tokens) or re-generated via rotation. Document SSM paths in the operator break-glass doc. |
1.9 Operator (single human owner — pre-team)
| Dimension | Detail |
|---|---|
| Impact | P0 over 48h — prod-approval gate (ADR-0020) requires human reviewer; no secondary reviewer exists pre-team |
| RTO target | N/A for planned absences; escalation path for emergencies: see §6 |
| Current posture | Single operator. No documented deputy. Break-glass AWS root credentials in private Google Drive per project_aws_iam_state. No secondary GitHub Admin configured. |
| Gap | No named deputy. No secondary GitHub org admin. No Heroku team member with prod-deploy approval rights. This is the most significant non-technical SPOF and is currently accepted as a pre-team constraint. |
1.10 External service SPOFs
| Service | Impact | Current posture | Gap |
|---|---|---|---|
| Stripe | P1 — billing fails | Stripe's own redundancy. Test-mode pre-launch. No fallback. | Acceptable for v1; Stripe SLA is strong |
| Postmark | P2 — transactional email queues | Postmark queues hold for 48h on outages | No secondary sender configured |
| Alpaca (paper trading, pre-live) | P2 — paper trading unavailable | Flag-gated. Live falls back to paper flag. | No fallback broker pre-BYOB |
| Sentry | P2 — error capture dark | Sentry SDK buffers locally; events replayed on reconnect | Acceptable for v1 |
| Oracle Dyn (moosequest.net) | P3 — home-network DNS | Operator personal dependency; Raxx platform not affected | Per feedback_dyndns_stays — do not migrate |
§2 Backup Posture — Current State
2.1 Heroku Postgres (Raptor + Velvet)
What is backed up: Full Postgres database via continuous WAL archiving + daily snapshot.
Schedule and retention:
| Tier | Method | Schedule | Retention | Storage |
|---|---|---|---|---|
| Continuous WAL | Heroku-managed streaming replication | Continuous | 7 days (Standard-0) | Heroku-managed S3 (us-east-1) |
| Daily snapshot | Heroku pg:backups automated |
Daily (Heroku-scheduled, typically ~02:00 UTC) | 7 days (Standard-0) | Heroku-managed S3 |
| Manual snapshot | heroku pg:backups:capture |
On-demand | Must be downloaded to persist beyond Heroku retention |
Command surface:
# List available backups
heroku pg:backups -a raxx-api-prod
# Capture a manual snapshot before a risky migration
heroku pg:backups:capture -a raxx-api-prod
# Download a backup locally
heroku pg:backups:download -a raxx-api-prod --output /tmp/raptor-prod-backup.dump
# PITR restore (rolls back to a specific timestamp within retention window)
heroku pg:backups:restore '<backup-id-or-timestamp>' DATABASE_URL -a raxx-api-prod
# Inspect PITR capabilities
heroku pg:info -a raxx-api-prod
Restore SOP:
1. Capture a manual backup of the current state first: heroku pg:backups:capture -a raxx-api-prod
2. Put the app in maintenance mode: heroku maintenance:on -a raxx-api-prod
3. Restore: heroku pg:backups:restore <target> DATABASE_URL -a raxx-api-prod
4. After restore, Alembic migrations may need to be re-applied if the target is before a schema change: heroku run "python -m alembic upgrade head" -a raxx-api-prod
5. Verify: heroku run "python -m alembic current" -a raxx-api-prod
6. Re-provision raptor_app role if role was lost in the restore: see docs/ops/runbooks/raptor-db-credentials.md §4 Scenario B
7. Restore maintenance mode off: heroku maintenance:off -a raxx-api-prod
Estimated restore time: 15–45 min depending on DB size.
Tested? No verified restore drill on record. Required before launch or first post-launch week.
2.2 Infisical Vault (vault.raxx.app)
What is backed up: All vendor API tokens, service credentials, Infisical machine identity secrets. This is the source of truth for secrets that must survive a full infrastructure rebuild.
Current posture: CRITICAL GAP. No automated backup is configured. The Lightsail instance disk holds the Postgres database backing Infisical CE. Lightsail automatic snapshots are NOT confirmed enabled for the vault instance. A vault instance loss (hardware failure, accidental termination) would require manual re-entry of all secrets.
Infisical admin export (manual, one-time procedure):
# Log into Infisical UI at vault.raxx.app
# Navigate to: Project Settings → Export → Download JSON
# Store export at: <Google Drive>/Raxx/Break-glass/infisical-export-YYYY-MM-DD.json
# Encrypt before storing: GPG encrypt to operator's key
gpg --recipient kris@moosequest.net --encrypt infisical-export.json
Required action (§4 v1 win): Schedule a daily Lightsail snapshot for the vault instance and a scheduled Infisical export to S3.
Restore SOP (if vault instance is lost):
1. Provision new Lightsail instance from most recent snapshot (per FreeScout rebuild pattern)
2. Re-attach static IP for vault.raxx.app
3. Verify Infisical CE is running: https://vault.raxx.app
4. If snapshot unavailable: provision fresh Infisical CE, restore from admin export JSON via Infisical CLI
5. Rotate all machine identity tokens and re-seed Heroku config vars
2.3 FreeScout (AWS Lightsail)
Status: BACKED UP. See docs/ops/runbooks/freescout-backup-restore.md for full SOP.
| Tier | Type | Schedule | Retention | Location |
|---|---|---|---|---|
| Tier 1 | Full Lightsail instance snapshot | Daily 06:00 UTC (GH Actions) | 7 snapshots | Lightsail |
| Tier 2 | MySQL logical dump (gzip) | Daily 06:00 UTC (GH Actions) | 30 days | s3://raxx-support-attachments/db-backups/freescout/ |
Estimated restore time: Tier 1 (full instance): ~30 min (snapshot restore + static IP re-attach). Tier 2 (data-only): ~15 min.
Tested? Verified restore record in freescout-backup-restore.md shows PENDING. A verified restore drill is required.
Verify latest backup ran:
TODAY=$(date -u '+%Y-%m-%d')
aws s3api head-object \
--bucket raxx-support-attachments \
--key "db-backups/freescout/${TODAY}.sql.gz" \
--region us-east-1
2.4 GitHub Repository
Status: Redundant. GitHub provides geo-redundant storage. Git's distributed model means every git clone is a full backup. Multiple clones exist on the operator's workstation and in CI runner ephemeral environments.
Additional exposure: GitHub Actions workflow artifacts (build artifacts, coverage reports) are ephemeral and not backed up — not a recovery concern since they are regenerated from source.
Off-GitHub archive: Not configured. Low priority given GitHub's reliability, but worth automating post-launch.
2.5 Terraform State
Location: S3 bucket raxx-iac-state-prod (us-east-1), DynamoDB table raxx-iac-state-locks.
S3 durability: 99.999999999% (eleven nines). S3 Versioning status: NOT CONFIRMED. Without versioning, an accidental terraform destroy of the state file is unrecoverable.
Current backup posture: Relies entirely on S3 durability. No additional snapshot or off-bucket copy.
Required action (§4 v1 win): Enable S3 Versioning on raxx-iac-state-prod + enable MFA Delete to prevent accidental state deletion.
# Enable versioning (one-time)
aws s3api put-bucket-versioning \
--bucket raxx-iac-state-prod \
--versioning-configuration Status=Enabled \
--region us-east-1
# Verify
aws s3api get-bucket-versioning --bucket raxx-iac-state-prod --region us-east-1
Restore SOP (if state is corrupted or deleted):
1. Retrieve prior version from S3 (if versioning enabled): aws s3api list-object-versions --bucket raxx-iac-state-prod
2. If state is unrecoverable: run terraform import for each resource using the docs/ops/runbooks/terraform-cf-access-state-imports.md pattern
3. Timeline for full state reconstruction from import: estimated 4–8 h for all roots
2.6 AWS SSM Parameter Store
Current posture: No backup. SSM parameter values are managed via Terraform (IaC) for infrastructure secrets, and via Velvet/manual rotation for workload secrets. Loss of an SSM parameter requires re-retrieval from the corresponding Infisical vault path or from the Velvet rotation log.
SSM paths to document in break-glass: /raxx/freescout/*, /raxx/cf-access/*, and any Velvet-managed rotation paths.
§3 Golden Image — Full-Rebuild Recovery Path
This section answers: "If the Heroku account is compromised, vault.raxx.app goes down, and the operator's laptop is lost — how does Raxx get back up?"
3.1 Prerequisites (must exist before you need this)
| Asset | Where it lives | Who controls it |
|---|---|---|
GitHub repo (raxx-app/TradeMasterAPI) |
GitHub — survives Heroku + Lightsail loss | GitHub account (linked to kris@moosequest.net Google account) |
| AWS root account credentials | Private Google Drive (/Raxx/Break-glass/aws-root-credentials.md) per project_aws_iam_state |
Google account with hardware MFA |
| Heroku account credentials | Passkey (WebAuthn) — no password | kris@moosequest.net email for recovery |
| Cloudflare account credentials | Passkey / 2FA | kris@moosequest.net email for recovery |
| Infisical admin export (last snapshot) | Private Google Drive (/Raxx/Break-glass/infisical-export-YYYY-MM-DD.json.gpg) |
Google account |
| Stripe account credentials | Passkey / email MFA | kris@moosequest.net |
| GPG private key (for encrypted Drive exports) | Operator's hardware key or backup passphrase in Drive | Operator only |
The single non-automatable dependency: the operator's Google account (hardware MFA protected) is the root of all credential recovery paths. If that account is compromised or inaccessible, no programmatic recovery path exists. This is the operator-only break-glass scenario (§6).
3.2 Rebuild SOP (green-field recovery sequence)
Estimated total RTO: 6–12 hours for full service restoration. Data RPO depends on most recent backup (up to 24 h data loss at current posture; up to 5 min with Heroku PITR within retention window).
Phase 1 — Reestablish AWS access (0–30 min)
- Log into AWS console using root credentials from Google Drive break-glass doc.
- Verify
claude-infisical-bootstrapIAM user exists. If lost: create new IAM user with Lightsail + SSM + S3 permissions per the bootstrap policy. - Retrieve or rotate access keys.
Phase 2 — Restore vault.raxx.app (30 min–2 h)
- Check if Lightsail instance exists:
aws lightsail get-instance --instance-name <vault-instance-name> --region us-east-1 - If instance exists but unresponsive: restart. If missing: restore from most recent Lightsail snapshot.
- If no snapshot: provision fresh Lightsail instance, install Infisical CE from the Terraform root.
- If fresh install: restore secrets from the Infisical admin export JSON on Google Drive.
- Re-attach static IP. Verify
https://vault.raxx.appresolves. - Re-provision machine identity tokens for CI + Velvet + agents.
Phase 3 — Restore SSM parameters (2–3 h)
- Re-seed SSM
/raxx/freescout/*and other workload secrets from Infisical vault (now restored).
Phase 4 — Restore Heroku applications (3–5 h)
- If Heroku account is compromised and requires a full new account:
- Create new Heroku account.
- Create apps:
raxx-api-prod,raxx-console-prod,raxx-velvet-prod(perproject_heroku_app_names). - Provision Standard-0 Postgres add-on for Raptor and Velvet.
- Restore Postgres from most recent Heroku backup (download
.dump, restore to new DB):heroku pg:backups:restore - Re-provision
raptor_approle (perdocs/ops/runbooks/raptor-postgres-roles.md).
- Seed all Heroku config vars from Infisical vault (restored in Phase 2). Use
heroku config:set ... >/dev/null 2>&1for all secrets. - Re-deploy code: trigger GitHub Actions
deploy-heroku.ymlworkflow_dispatchtargetingmainfor each app. - Restore
FLAG_RAPTOR_APP_ROLE_SEPARATION=1and other production flags. - Do not re-enable live trading flags until paper gate has been re-satisfied.
Phase 5 — Restore Cloudflare configuration (4–5 h)
- If CF account is recoverable: Terraform state (from S3) contains all CF resources.
terraform applyfrom each root restores DNS, Pages, Access, and WAF config. - If CF account is lost: re-register new account; rebuild DNS manually from known record inventory; re-deploy Pages from GitHub; re-establish CF Access.
- Antlers / getraxx re-deploy from GitHub via
deploy-antlers.ymlworkflow dispatch.
Phase 6 — Restore FreeScout ticketing (5–6 h)
- Restore from Tier 1 Lightsail snapshot (see
docs/ops/runbooks/freescout-backup-restore.md§Tier 1 restore). - Re-attach static IP. Verify
https://tickets.raxx.appresolves. - Update SSM
/raxx/freescout/ssh_keyif new key pair is generated.
Phase 7 — Verify and soft-open (6–12 h)
- Run smoke suite:
scripts/ci/run_smoke.sh --env=prod - Verify CF Access gates are up for pre-launch surfaces.
- Verify Sentry is receiving errors.
- Verify Postmark can send test email.
- Open EyeTok on-call agent if deployed.
- Notify customers via status.raxx.app if there was any customer-visible downtime.
3.3 Estimated RTO/RPO by current posture (honest assessment)
| Recovery scenario | Current RTO | Current RPO | Notes |
|---|---|---|---|
| Heroku dyno crash (single app) | 5–10 min | 0 | Heroku auto-restarts; Postgres is unaffected |
| Heroku dyno rollback (bad deploy) | 10–15 min | 0 | heroku rollback |
| Heroku Postgres PITR restore | 1–3 h | 5 min (within 7-day window) | Requires maintenance window |
| Vault Lightsail instance loss | 1–4 h | 24 h (unconfirmed — no snapshot automation) | Today's SEV-4 root cause |
| FreeScout Lightsail loss | 30–60 min | 24 h (daily backup) | Tier 1 restore procedure exists and is documented |
| Full Heroku account compromise | 6–12 h | Up to 24 h (Postgres backup age) | Requires DB download + restore to new account |
| Terraform state loss (no S3 versioning) | 4–8 h | N/A — IaC must be reconstructed via import | Current gap: S3 versioning unconfirmed |
| Operator unreachable for 48 h | Blocking for prod deploys | N/A | ADR-0020 prod-approval gate has no deputy |
§4 Low-Effort v1 Wins (ship pre-launch or first post-launch week)
Pre-launch = T-0 to T+7 days (target by 2026-05-30 UTC)
Post-launch week = T+7 to T+30 days
Each item is scoped to operator or agent execution; no new feature-dev needed unless noted.
Win 1 — Enable daily Lightsail snapshots for vault instance (XS)
Size: XS — one AWS CLI command or one Terraform addition.
Why: Today's SEV-4 (vault Lightsail timeouts) confirmed this SPOF is active. Without a snapshot, vault instance loss = manual re-entry of all secrets.
Action:
# Enable Lightsail auto-snapshot at 06:30 UTC (after FreeScout backup completes)
aws lightsail enable-add-on \
--resource-name <vault-instance-name> \
--region us-east-1 \
--add-on-request 'addOnType=AutoSnapshot,autoSnapshotAddOnRequest={snapshotTimeOfDay=06:30}'
Acceptance criteria: aws lightsail get-auto-snapshots --resource-name <vault-instance-name> --region us-east-1 shows at least one snapshot dated within the last 25 hours within 24 h of enabling.
Ship window: Pre-launch (this week).
Issue: Filed as bcp-win-1 — see GH issue created from this card.
Win 2 — Enable S3 Versioning on Terraform state bucket (XS)
Size: XS — two AWS CLI commands.
Why: Accidental terraform state rm or terraform destroy of the state file without versioning is unrecoverable. Versioning adds a safety net at no meaningful cost.
Action:
aws s3api put-bucket-versioning \
--bucket raxx-iac-state-prod \
--versioning-configuration Status=Enabled \
--region us-east-1
# Optional but recommended: MFA delete to prevent accidental state deletion
# (requires root credentials — do separately)
Acceptance criteria: aws s3api get-bucket-versioning --bucket raxx-iac-state-prod --region us-east-1 returns {"Status": "Enabled"}.
Ship window: Pre-launch.
Issue: Filed as bcp-win-2.
Win 3 — Schedule daily Infisical admin export to S3 (S)
Size: S — a GH Actions workflow (similar pattern to freescout-backup.yml).
Why: Vault is the crown jewel of the secret posture. No automated export means a vault rebuild requires full manual re-entry. Today's SEV-4 makes this urgent.
Design: GH Actions cron at 07:00 UTC. Uses Infisical export API, GPG-encrypts with operator public key, uploads to s3://raxx-iac-state-prod/vault-exports/YYYY-MM-DD.json.gpg. Failure alerts via Slack DM D0AJ7K184TV.
Acceptance criteria: S3 object exists at vault-exports/YYYY-MM-DD.json.gpg with size > 1 KB within 25 h of first workflow run; failed run posts Slack DM alert.
Ship window: Pre-launch or first post-launch week.
Issue: Filed as bcp-win-3.
Win 4 — Document and verify a Heroku Postgres restore drill (XS)
Size: XS (operator-executed, no code) — run heroku pg:backups:restore against a scratch Heroku app (not prod).
Why: No restore drill is on record for Raptor Postgres. "Backup exists" and "backup restores" are different claims. The FreeScout restore record also shows PENDING — complete both in the same session.
Action: Operator creates a throwaway Heroku app, provisions a Postgres add-on, downloads the most recent prod backup, restores it, verifies Alembic current revision matches. Documents the result in docs/ops/runbooks/raptor-db-credentials.md §2 restore record.
Acceptance criteria: A documented restore result row (date, backup used, duration, outcome) exists in the runbook. Verified restore for FreeScout backup is also completed in freescout-backup-restore.md.
Ship window: First post-launch week.
Issue: Filed as bcp-win-4.
Win 5 — Add bcp-smoke GH Actions workflow (monthly verify) (S)
Size: S — one new GH Actions workflow file.
Why: The BCP's usefulness decays if the restore procedures are never exercised. A monthly automated smoke confirms: FreeScout S3 backup exists, vault snapshot exists, TF state bucket versioning is enabled. Does not run a full restore — just validates the existence and integrity of backup artifacts.
Schedule: 0 8 1 * * (08:00 UTC on the 1st of each month).
Checks:
- FreeScout: aws s3api head-object for yesterday's dump
- Vault: aws lightsail get-auto-snapshots — most recent < 25 h
- TF state: aws s3api get-bucket-versioning returns Enabled
- Heroku: heroku pg:backups -a raxx-api-prod lists a backup < 48 h
- Failure: post alert to ops@raxx.app + Slack DM D0AJ7K184TV
Acceptance criteria: Workflow exists and passes on first manual trigger; no failures on workflow_dispatch. Monthly cron fires on 2026-06-01.
Ship window: First post-launch week.
Issue: Filed as bcp-win-5.
§5 Comprehensive v2 Future Investments (NOT IMPLEMENTED — roadmap)
The following are documented for future investment. None are in scope for v1 launch.
5.1 Multi-region Heroku (or migrate to Fly.io / Railway with multi-region support)
Heroku does not natively support multi-region failover for a single app. Options: (a) Heroku Private Spaces with geo-routing, (b) migrate Raptor to a provider with native multi-region (Fly.io), (c) add a secondary Heroku app in eu-west with read-replica Postgres and CF Load Balancer routing.
Trigger: First customer in a region where us-east latency is unacceptable, or first Heroku us-east regional incident causing > 30 min customer outage.
5.2 Vault HA — multi-AZ Infisical
Replace the single Lightsail micro instance with an HA Infisical CE deployment: two instances across us-east-1a and us-east-1b behind an AWS ALB, Postgres on RDS Multi-AZ.
Trigger: Post-launch if vault-related SEV-3 or higher incidents recur; or when team size > 1 operator.
5.3 AWS Backup consolidation
Enable AWS Backup across all AWS resources (Lightsail instances, SSM parameters, S3 buckets) with a unified backup vault, cross-region replication to us-west-2, and 30-day retention. Replace the ad-hoc GH Actions snapshot scripts.
Trigger: SOC2 Type II preparation or first enterprise customer.
5.4 RDS read replica for Raptor Postgres
Heroku Standard-0 does not expose RDS read-replica controls. Migrating to Heroku Premium-0 or to a self-managed RDS instance would enable a read replica in a second AZ, reducing Postgres RPO to seconds.
Trigger: Customer-facing analytics queries cause replication lag, or revenue per day > cost of RDS Premium tier.
5.5 Off-GitHub repo mirror
Daily git push to a private S3 archive or a GitLab CE instance for cold-storage backup of all commits and LFS objects.
Trigger: GitHub has a sustained multi-day incident (extremely low probability) or a security concern about repo access.
5.6 PagerDuty integration for EyeTok
As documented in docs/architecture/eyetok-oncall-agent-2026-05-15.md §6.4, PagerDuty Solo is deferred to v1.1. Add after a real after-hours SEV1 is missed via Slack push.
5.7 Automated GDPR breach notification pipeline
Currently the GDPR breach 72-hour notification is a manual operator action. A v2 enhancement would automate the initial DPA notification workflow from an EyeTok-triggered breach classification, with templated notification email to relevant DPAs prepared within 24 h of breach detection.
5.8 Automated retention enforcement job (#1631)
Per docs/ops/policies/data-retention.md §3, the automated retention job is deferred to 2026-Q3. This is acceptable given first data hits the 7-year mark in 2033. No BCP action needed before 2031-03-01.
§6 Operator-Only Break-Glass Paths
This section addresses the scenario: Kristerpher is unreachable for 48 hours or longer.
6.1 Current state (honest assessment)
There is no designated deputy. As a solo pre-team operator, all privileged access is gated on a single person. This is the accepted pre-team posture. The following documents what breaks and what doesn't:
| Function | Breaks without operator? | Notes |
|---|---|---|
| Prod deploys (new code) | Yes — ADR-0020 requires human reviewer approval | No code ships to prod without operator |
| Staging deploys (CI) | No — auto-deploys on main push without approval |
Staging continues working |
| Heroku config:set (secrets rotation) | Yes — requires Heroku account access | Velvet rotation stalls if operator unreachable |
| FreeScout ticket queue | No — EyeTok creates tickets; queue is readable | Support backlog accumulates; no human responses |
| Customer-facing platform | No — existing prod deployment keeps running | No new features; existing features work |
| Paper trading | No — paper gate is code-enforced | Paper strategies continue running |
| Live trading | No — live trading flags off pre-launch | Not a concern for v1 |
| AWS infra changes | Yes — requires IAM credentials | No new infrastructure |
| GDPR DSR responses | Yes — manual process | 30-day clock keeps ticking |
6.2 Break-glass credential inventory
| Credential | Location | Access method |
|---|---|---|
| AWS root credentials | Private Google Drive /Raxx/Break-glass/aws-root-credentials.md |
Google account (hardware MFA on kris@moosequest.net) |
| Heroku account | Passkey-bound to kris@moosequest.net email |
Email account recovery |
| Cloudflare account | Passkey / backup codes | Email account recovery |
GitHub account (kris@moosequest.net) |
Passkey | Email account recovery |
| Infisical admin export | Private Google Drive /Raxx/Break-glass/infisical-export-YYYY-MM-DD.json.gpg |
GPG decrypt with operator's key |
| Stripe account | Passkey | Email account recovery |
Google Workspace account (kris@moosequest.net) |
Hardware MFA (YubiKey) | Recovery code (stored with break-glass) |
Critical chain: All recovery paths ultimately flow through kris@moosequest.net Google Workspace account. If that account is inaccessible, break-glass recovery requires Google account recovery codes, which must be stored in a physical location (fireproof safe or safety deposit box) off-device.
6.3 Pre-team deputy recommendation (deferred)
Not implemented — documented for future action:
When first team member is added, the following should be provisioned within their first week:
1. Add as GitHub Organization Admin on raxx-app org
2. Add as reviewer on the production GitHub Environment
3. Add to Heroku team with member role on prod apps
4. Provision Cloudflare sub-account with DNS read + emergency write
5. Document in a docs/ops/emergency-contacts.md file with contact details
6.4 Extended operator absence SOP (current posture)
If Kristerpher will be unreachable for more than 24 hours:
1. Verify no pre-launch-blocker issues are open requiring operator action
2. Verify Sentry and EyeTok (when live) are alerting to Slack
3. Set Slack #raxx-ops-alert-sev1 mobile push to maximum priority
4. Verify ops@raxx.app is routing incident notifications
5. Platform continues to operate autonomously for existing customers; no new prod deployments until return
§7 Communication Plan During an Incident
7.1 Internal routing (operator + EyeTok agent)
| Severity | Channel | Who posts | Response SLA |
|---|---|---|---|
| SEV1 (critical) | #raxx-ops-alert-sev1 (C0B423M38H4) |
EyeTok, then operator | Immediate — Slack mobile push |
| SEV2 (high, after-hours) | #raxx-ops-alert-sev2 (C0B445RU95Y) |
EyeTok | Mobile push + call-out |
| SEV2.5 (high, business hours 12:00–22:30 UTC weekdays) | #raxx-ops-alert-sev2-5 (C0B4611UC2V) |
EyeTok | Normal notification |
| SEV3 (medium/low) | #raxx-ops-alert-sev3 (C0B4615LQ49) |
EyeTok autonomous | Agent audit log |
| Ops alerts (non-EyeTok) | ops@raxx.app / FreeScout |
Automated inbound | Operator reviews at next session |
For the EyeTok on-call agent design, see docs/architecture/eyetok-oncall-agent-2026-05-15.md.
7.2 Customer-facing communication
| Channel | Current state | Notes |
|---|---|---|
status.raxx.app |
Cloudflare Worker + D1 (per ADR-0028, ADR-0030) | Public status page; FreeScout webhook receiver drives state changes |
support@raxx.app |
Google MX routing → FreeScout via Postmark bridge | Customer inbound support queue |
| Transactional email (Postmark) | Active and approved out of sandbox (2026-05-09) | Used for account notifications, not incident alerts |
Incident communication SOP:
1. EyeTok opens a FreeScout ticket at ops@raxx.app
2. If customer-visible impact: operator posts a public incident note to status.raxx.app via the Console status-management UI
3. Public note must follow the invariant: retrospective only, no forward-looking estimates unless a firm maintenance window end time is known
4. Customer inbound tickets at support@raxx.app receive a canned initial response: "We're aware of an issue affecting [service]. We'll provide an update at [time] UTC or when resolved."
5. On resolution: update status.raxx.app with incident resolved timestamp; post resolution note to customer ticket
7.3 GDPR breach notification trigger
If an incident involves suspected unauthorized access to customer PII:
1. EyeTok must classify the incident as a potential data breach at triage
2. Operator must make a breach determination within 4 hours of detection
3. If a breach is confirmed: DPA notification must be filed within 72 hours of discovery (GDPR Art. 33)
4. Affected customers must be notified without undue delay if high risk to rights/freedoms (GDPR Art. 34)
5. All breach notification actions are logged in the FreeScout ticket with the gdpr-breach tag
6. Current status: no automated breach notification pipeline exists (deferred to §5.7)
Security Considerations
| Question | Answer |
|---|---|
| What PII does this BCP collect? | None. This document describes procedures; it contains no PII. |
| What is the retention period for incident tickets? | Per docs/ops/policies/data-retention.md: DSR request records retained 7 years; general ops tickets 90 days |
| Does any part of this store a credential in a form that can be replayed? | No. Command snippets in this doc use environment variable placeholders. Break-glass credentials are referenced by location (Google Drive), not value. |
| What is logged for audit? | Every recovery action that touches money, permissions, or data access must be logged in the FreeScout incident ticket with UTC timestamps. |
| Where are secrets? | In Infisical vault (vendor API tokens) and AWS SSM (workload secrets). Break-glass copies in Google Drive (operator-only, hardware MFA protected). |
| Are secrets rotatable without redeploy? | Yes. All Heroku config vars are sourced from vault; rotation is vault-update → heroku config:set >/dev/null 2>&1 → dyno restart. No code redeploy required. |
| Is there a kill-switch for live execution paths? | Yes. FLAG_RAPTOR_APP_ROLE_SEPARATION, FLAG_WEBAUTHN_REGISTRATION, and the paper-first gating flag are all disableable via a single heroku config:unset without code deploy (per feedback_heroku_config_set_echoes_secrets pattern). |
| What happens on breach? | See §7.3. 72-hour DPA notification obligation. Affected customers notified per Art. 34. No automated pipeline for v1 — manual SOP applies. |
Open Questions (requiring operator decision before sub-cards can be claimed)
-
Vault instance name. The vault Lightsail instance name is referenced in Win 1 as
<vault-instance-name>. Confirm the Lightsail instance name before Win 1 can be executed by an agent. (Likelyraxx-vaultor similar — checkterraform/infisical root oraws lightsail get-instances.) -
TF state bucket scope. Confirmed that
cf-pages-docs-customerusesraxx-iac-state-prod. Other roots (freescout, waf, cf-access) need a one-time verification that they also use this bucket before Win 2 (S3 versioning) fully protects all state. Unblocking action:grep -r "raxx-iac-state-prod" terraform/across all roots. -
Infisical export API availability. Win 3 requires the Infisical CE admin export endpoint. Confirm the export endpoint URL and auth method for Infisical CE (self-hosted) vs Infisical Cloud SaaS. The endpoint may differ from the Cloud SaaS admin export path.
-
Verified restore drill timing. Win 4 requires operator-executed restore against a throwaway Heroku app. Is this feasible before the 2026-05-23 launch date, or should it be scheduled for the first post-launch week (2026-05-23 to 2026-05-30 UTC)?