Raxx · internal docs

internal · gated

Raptor Prod Postgres Cutover SOP

Runbook ID: RM-10 Card: #1568 Epic: #1556 Staging SOP (RM-9): docs/ops/runbooks/raptor-postgres-staging-cutover.md Roles runbook: docs/ops/runbooks/raptor-postgres-roles.md Design doc: docs/architecture/raptor-postgres-migration/design.md Migration plan: docs/architecture/raptor-postgres-migration/migration-plan.md Target cutover deadline: 2026-05-16 23:59 UTC Estimated wall-clock: 1–2 hours (including smoke validation)


Operator pre-cutover briefing

This SOP cuts raxx-api-prod from SQLite to Heroku Postgres Standard-0. The 48-hour safety buffer before the 2026-05-23 UTC v1 launch is the hard floor. Prod cutover must land by 2026-05-16 23:59 UTC to leave the full buffer intact.

This is a pre-launch, zero-customer environment. No customer data exists in SQLite and no customer data will be migrated. The rollback path (reverting to SQLite) is valid only while the Postgres users table shows count(*) = 0. Once any customer row exists, rollback is closed — you must fix forward.

Do NOT execute this runbook concurrently with any other prod maintenance. Execute from a single terminal session. Keep a second terminal open for log tailing.

Downstream dependency: RM-11 (#1569) (SC-A1 role separation) gates on this cutover being complete. Do not start RM-11 until this runbook is signed off at the bottom.


Section 1 — Pre-flight checklist

Complete every item before entering the maintenance window. Each command output must be captured in the cutover record (see Section 8).

1.1 RM-9 staging soak verified clean

The RM-9 staging soak window closed 2026-05-13 22:55 UTC. Confirm no Sentry errors have fired for raxx-api-staging since 2026-05-10 22:55 UTC (the RM-9 staging cutover timestamp).

# Open Sentry → project: raxx-backend → filter by environment=staging
# Date range: 2026-05-10 22:55 UTC to now
# Expected: zero unresolved issues tagged environment=staging

Also run the staging log check:

heroku logs -a raxx-api-staging --since 72h | grep -i "sqlite\|OperationalError\|database is locked"
# Expected: no output

If Sentry shows errors or the grep returns hits: STOP. Do not proceed to prod. Investigate on staging first; file a blocking comment on #1568.

1.2 Heroku CLI authenticated

heroku auth:whoami
# Expected: your Heroku account email
# If not: heroku login

Confirm access to the raxx-app team and the raxx-api-prod app:

heroku apps --team raxx-app | grep raxx-api-prod
# Expected: raxx-api-prod

1.3 Prod Postgres Standard-0 provisioned and healthy

heroku addons -a raxx-api-prod
# Expected: one addon of type heroku-postgresql, plan standard-0
# Capture the addon name (e.g. postgresql-<adjective>-NNNNN) — you need it for the upgrade step if it is still Essential-tier

heroku pg:info -a raxx-api-prod
# Expected: Plan: Standard 0, Status: Available
# If Plan: Essential-0 or Essential-1 or Essential-2 — see Section 2.1 (upgrade step)
# If Status: anything other than Available — STOP, wait for Heroku support

1.4 DATABASE_URL is already a Postgres URL

Heroku sets DATABASE_URL automatically when the Postgres addon is attached as the primary. Verify it points to Postgres, not SQLite:

heroku config:get DATABASE_URL -a raxx-api-prod
# Expected: postgresql://... (amazonaws.com:5432 in the hostname)
# If sqlite+pysqlite:///./raptor.db — STOP, addon is not attached as primary

1.5 Infisical credentials valid

Credential staleness has caused silent failures before. Verify before you start (see project memory note on session env staleness):

# Confirm any Infisical-sourced env vars you rely on during this runbook are fresh.
# At minimum, confirm you can reach the vault:
curl -fsS https://vault.raxx.app/api/status
# Expected: 200 with {"status":"ok"}

If the vault token has expired, refresh it via infisical login before proceeding. Do not skip this — a stale token mid-runbook will cause a confusing partial state.

1.6 Alembic migrations current on main

git fetch origin main
git log --oneline origin/main -5
# Confirm the latest RM-phase migrations are in the tip of main

1.7 Snapshot confirmation (rollback anchor)

Because the Heroku dyno filesystem is ephemeral, the SQLite file at ./raptor.db on the dyno is not accessible for backup via the CLI. The rollback procedure (Section 6) reverts DATABASE_URL to the SQLite path, which causes the dyno to recreate an empty database from schema on next restart. This is safe only because we are pre-launch with zero customers.

Confirm this invariant before proceeding:

# Run this AFTER Postgres is confirmed available (Section 1.3):
heroku pg:psql -a raxx-api-prod -c "SELECT count(*) FROM users;" 2>/dev/null || echo "TABLE_NOT_FOUND"
# Expected: 0 rows (table may not exist yet if alembic has not run — TABLE_NOT_FOUND is also acceptable)
# If count > 0: STOP. Customer data exists. Rollback is NOT safe. Escalate immediately.

Record the output in the cutover log.


Section 2 — Maintenance window plan

2.1 Expected downtime

2.2 Enable maintenance mode

heroku maintenance:on -a raxx-api-prod
# Expected: Enabling maintenance mode for raxx-api-prod... done

heroku maintenance -a raxx-api-prod
# Expected: on

2.3 Rollback trigger

If any of the following occur, execute Section 6 (Rollback) immediately:


Section 3 — Step-by-step Postgres upgrade (if still Essential-tier)

Skip this section if heroku pg:info (Section 1.3) already shows Plan: Standard 0. If it shows Essential-tier, complete this section first.

Standard-0 is required before pg:credentials:create will work. Essential-tier databases reject the command with: Error: You can't create a custom credential on Essential-tier databases.

# Step 3.1 — Identify the addon name
heroku addons -a raxx-api-prod
# Note the addon name, e.g. postgresql-adjacent-27271

# Step 3.2 — Upgrade to Standard-0
heroku addons:upgrade <addon-name> heroku-postgresql:standard-0 -a raxx-api-prod
# Expected: Starting upgrade of postgresql-<name> to heroku-postgresql:standard-0... (free)
# Heroku will display a maintenance window warning — this is expected and acceptable pre-launch.

# Step 3.3 — Poll until upgrade completes
heroku pg:info -a raxx-api-prod
# Repeat every 60 seconds until: Plan: Standard 0, Status: Available
# Typical upgrade time: 2–5 minutes

Do not proceed to Section 4 until the addon shows Standard 0 and Available.


Section 4 — Step-by-step cutover commands

Run these in order. Capture every command and its output in the cutover record.

4.1 Run Alembic migrations on prod Postgres

This creates the full schema on the prod Postgres database. It is safe to run multiple times (Alembic is idempotent via revision tracking).

heroku run alembic upgrade head -a raxx-api-prod
# Expected output ends with:
#   Running upgrade ... -> 0001_raptor_baseline, raptor baseline
# (or the current head revision if later RM phases added migrations)
#
# If error: "Can't locate revision identified by '...'"
#   → Run: heroku run alembic stamp 0001_raptor_baseline -a raxx-api-prod
#   → Then retry: heroku run alembic upgrade head -a raxx-api-prod

4.2 Verify schema is present

heroku pg:psql -a raxx-api-prod -c "\dt"
# Expected: ~35 tables listed (users, sessions, paper_orders, historical_bars,
#           webauthn_credentials, backup_codes, customer_audit_events, ...)
# If fewer than 35 tables: migration did not fully apply — STOP, re-run step 4.1

4.3 Confirm zero customer rows

heroku pg:psql -a raxx-api-prod -c "SELECT count(*) FROM users;"
# Expected: 0
# If not 0: STOP. Do not proceed. Escalate.

4.4 Create the raptor_app credential

IMPORTANT — RDS password workaround. Heroku Postgres runs on AWS RDS. The DATABASE_URL owner credential is NOT a member of the RDS rds_password group. Attempting to create a role via CREATE ROLE ... WITH LOGIN PASSWORD directly in psql will fail with:

ERROR: must be a member of rds_password to alter passwords

Do NOT use raw SQL to create the role. Use the Heroku CLI path, which goes through Heroku's RDS-aware credential management:

heroku pg:credentials:create DATABASE --name raptor_app -a raxx-api-prod
# Expected: Adding credential raptor_app to postgresql-<name>... done
#
# Argument order matters:
#   - "DATABASE" is the positional addon identifier (literal string)
#   - "--name raptor_app" is the credential/role name
# DO NOT swap them — the CLI returns "Unknown database: raptor_app" if the
# role name goes in the positional slot.

Reference: feedback_heroku_pg_rds_password_gotcha.md and docs/ops/runbooks/raptor-postgres-roles.md (Failure mode A).

4.5 Retrieve and store the raptor_app connection URL

heroku pg:credentials:url DATABASE --name raptor_app -a raxx-api-prod
# Copy the "Connection URL:" value (postgres://raptor_app:...) shown.
# Handle this URL as a secret — it contains the credential password.

Set it in Heroku config. Stdout is silenced because the URL contains the password — echoing it to the terminal is a secret leak (feedback_heroku_config_set_echoes_secrets.md):

heroku config:set RAPTOR_APP_DATABASE_URL="<URL from above>" -a raxx-api-prod >/dev/null 2>&1

Verify the var is set without exposing the value:

heroku config:get RAPTOR_APP_DATABASE_URL -a raxx-api-prod | cut -c1-40
# Expected: postgres://raptor_app:... (first 40 chars only)

4.6 Apply DML grants and revoke migration-table access

Open a psql session as the DATABASE owner:

heroku pg:psql -a raxx-api-prod

Run this block as a single copy-paste unit:

-- Full DML on all application tables
-- (SELECT already granted by pg:credentials:create)
GRANT INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO raptor_app;

-- Sequence access (required for INSERT into SERIAL/BIGSERIAL columns)
GRANT USAGE ON ALL SEQUENCES IN SCHEMA public TO raptor_app;

-- Deny access to migration-management tables —
-- raptor_app must never be able to stamp a migration as applied
REVOKE ALL ON TABLE alembic_version FROM raptor_app;
REVOKE ALL ON TABLE schema_migrations FROM raptor_app;

Still in psql, verify the grants landed:

SELECT rolname, rolcanlogin FROM pg_catalog.pg_roles WHERE rolname = 'raptor_app';
-- Expected: raptor_app | t

\dp historical_bars
-- Expected: raptor_app=arwd/... in the Access Privileges column
-- a=INSERT, r=SELECT, w=UPDATE, d=DELETE

\dp alembic_version
-- Expected: NO raptor_app entry (or an empty privilege string)

Exit psql (\q).

4.7 Tail logs in a second terminal (open before step 4.8)

Open a second terminal window and start tailing before the restart:

heroku logs -a raxx-api-prod --tail

Keep this window visible throughout steps 4.8–4.10.

4.8 Restart the dyno

heroku restart -a raxx-api-prod
# Wait ~30 seconds for the dyno to come up

In the log tail (second terminal), watch for:

Database engine initialized: dialect=postgresql

Watch against:

sqlite3.OperationalError
database is locked
role "raptor_app" does not exist
permission denied for table

If any of the "watch against" patterns appear within 2 minutes of restart: execute the rollback (Section 6) immediately.

4.9 Disable maintenance mode

Only after the dyno is up and the health endpoint responds (verify in step 5.1 below before running this):

heroku maintenance:off -a raxx-api-prod
# Expected: Disabling maintenance mode for raxx-api-prod... done

4.10 Capture current alembic head revision

heroku run alembic current -a raxx-api-prod
# Capture the revision hash shown. Record it in the cutover log.
# Example: INFO  [alembic.runtime.migration] Running on postgresql
#          Current revision for postgresql://...: 0001_raptor_baseline (head)

Section 5 — Post-cutover verification

Complete all items before declaring the cutover successful.

5.1 Health endpoint

curl -fsS https://api.raxx.app/healthz
# Expected: {"status":"ok","db":"postgresql","version":"..."}
# If db is "sqlite": DATABASE_URL is still pointing at SQLite — see rollback
# If HTTP error: check heroku logs for the reason
# Historical data fetch (uses HistoricalBarRepository → Postgres)
curl -fsS "https://api.raxx.app/api/historical-data/AAPL?range=1d" \
  -H "Cookie: <session-cookie-from-auth-smoke>"
# Expected: 200 with OHLCV bar data

# Billing snapshot (reads vendor_billing_snapshots table)
curl -fsS "https://api.raxx.app/api/billing/snapshot" \
  -H "Cookie: <session-cookie-from-auth-smoke>"
# Expected: 200 (may return empty array pre-launch)

5.3 Auth flow smoke

# Register a test passkey
# POST /api/auth/register with a throwaway test user

# Login with that passkey
# POST /api/auth/login — expect 200 + session cookie

# Verify session: GET /api/system/whoami or equivalent authed endpoint

# Logout: POST /api/auth/logout — expect 200

# Verify in Postgres the test user row was created:
heroku pg:psql -a raxx-api-prod -c "SELECT id, email, created_at FROM users ORDER BY created_at DESC LIMIT 3;"

Or run the full smoke suite if available:

scripts/ci/run_smoke.sh --env=prod

5.4 Schema migration state

heroku run alembic current -a raxx-api-prod
# Expected revision must match what was captured in step 4.10.
# If it shows "(head)" that is correct.
# If it shows a revision hash below the expected head: a migration did not apply — re-run step 4.1.

Verify the table count:

heroku pg:psql -a raxx-api-prod -c "SELECT count(*) FROM information_schema.tables WHERE table_schema='public';"
# Expected: approximately 35 (exact number depends on RM-phase migrations merged to main)

5.5 Sentry: no init errors

Open Sentry → project: raxx-backend → environment: production. Filter to the last 30 minutes. Confirm:

If any of these appear: do NOT proceed to monitoring phase. Investigate and either fix forward or execute rollback (Section 6) depending on severity.

5.6 Log scan for SQLite references

heroku logs -a raxx-api-prod --since 30m | grep -i "sqlite\|database is locked\|OperationalError"
# Expected: no output

Section 6 — Rollback procedure

This rollback is valid only pre-launch (zero customer rows on Postgres). Once any real customer data exists in Postgres, this procedure destroys it. Check SELECT count(*) FROM users on Postgres before running. If count > 0, escalate to the operator — you must fix forward, not roll back.

The Heroku dyno filesystem is ephemeral. Reverting DATABASE_URL to the SQLite path causes the dyno to start a new empty SQLite database from scratch on next boot. This is acceptable pre-launch because no production customer data existed in SQLite either.

Rollback sequence

# Step R1 — Enable maintenance mode (if not already on)
heroku maintenance:on -a raxx-api-prod

# Step R2 — Revert DATABASE_URL to SQLite path
heroku config:set DATABASE_URL="sqlite+pysqlite:///./raptor.db" -a raxx-api-prod >/dev/null 2>&1

# Step R3 — Restart dyno to pick up the reverted URL
heroku restart -a raxx-api-prod

# Step R4 — Verify
heroku config:get DATABASE_URL -a raxx-api-prod
# Expected: sqlite+pysqlite:///./raptor.db

curl -fsS https://api.raxx.app/healthz
# Expected: {"status":"ok","db":"sqlite"}

# Step R5 — Disable maintenance mode
heroku maintenance:off -a raxx-api-prod

Post-rollback actions


Section 7 — Post-cutover monitoring

7.1 Active watch (first 4 hours)

Keep a terminal with heroku logs -a raxx-api-prod --tail running. Check every 15–30 minutes for the first 4 hours:

# Quick 4-hour check (run each time):
heroku logs -a raxx-api-prod --since 30m | grep -i "error\|sqlite\|permission denied\|OperationalError" | tail -20
curl -fsS https://api.raxx.app/healthz

7.2 Dashboard watch (first 24 hours)

At the 8-hour and 24-hour marks:

# Postgres connection health
heroku pg:info -a raxx-api-prod
# Watch: Connection Count (should not be near max)
# Watch: Status (must be Available)

# Log scan for SQLite remnants
heroku logs -a raxx-api-prod --since 24h | grep -i "sqlite\|OperationalError" | wc -l
# Expected: 0

# Table count stable
heroku pg:psql -a raxx-api-prod -c "SELECT count(*) FROM information_schema.tables WHERE table_schema='public';"

7.3 Escalation triggers

Escalate immediately (Slack #ops-alerts and direct ping to Kristerpher) if:

7.4 RM-11 gate

Do not start RM-11 (#1569) until:


Section 8 — Cutover record and sign-off

Fill this in during execution. Copy-paste actual command output into each field. This record is the artifact for the RM-10 card close.

Cutover ID:        RM-10-<YYYYMMDD-HHMM>UTC   (e.g. RM-10-20260516-1430UTC)

Operator:          ____________________________

Maintenance mode ON at (UTC):   ____________________________

Step 4.1 alembic upgrade head output (last line):
  ___________________________________________________________________

Step 4.2 \dt row count:         ____________________________

Step 4.3 users count:           ____________________________

Step 4.4 pg:credentials:create output:
  ___________________________________________________________________

Step 4.5 RAPTOR_APP_DATABASE_URL prefix (first 40 chars):
  ___________________________________________________________________

Step 4.6 raptor_app rolcanlogin:  ____________________________

Step 4.8 restart timestamp (UTC): ____________________________

"dialect=postgresql" seen in logs: [ ] yes  [ ] no (if no — rollback)

Step 4.9 maintenance mode OFF at (UTC): ____________________________

Step 4.10 alembic current revision:
  ___________________________________________________________________

Section 5.1 healthz db field:      ____________________________

Section 5.5 Sentry clear (y/n):    ____________________________

4-hour watch ended at (UTC):       ____________________________
4-hour watch result:               [ ] clean  [ ] issues found (see note)

Notes:
  ___________________________________________________________________
  ___________________________________________________________________

Operator sign-off (pre-cutover dry-read):

I have read this runbook end-to-end. Pre-flight checks are complete. RM-9 staging soak is verified clean.

Operator: ____ Date (UTC): _____


Ops sign-off (post-cutover, 4-hour watch complete):

Cutover is stable. Sentry clean. Health endpoint returns db: postgresql. RM-11 gate is open.

Signed: ____ Date (UTC): _______


Connects to