Raptor — Operations Runbook

System: Raptor (raxx-api-prod, raxx-api-staging) Owner: operator / sre-agent Last reviewed: 2026-06-12 UTC Parent issues: #2599 (env-bootstrap checklist), #81 (SDLC hardening epic)

Maintenance rule: Every PR that adds a new flag-gated subsystem to create_app() MUST update the checklist in this file in the same PR. Mirrors the policy in feedback_new_flag_needs_b1_migration_same_pr.

Env-bootstrap checklist

Use this checklist when provisioning a new environment (staging or prod). Work through it linearly. Check off each item as you complete it. Every heroku config:set call must have >/dev/null 2>&1 appended — the CLI echoes the value otherwise (#feedback_heroku_config_set_echoes_secrets).

Command to set a var (safe pattern):

heroku config:set VAR_NAME="<value>" --app <app-name> >/dev/null 2>&1

Core — always required

Every environment needs these regardless of feature flags.

Var	`app.config` key	Required when	Accepted values / format	Default	Infisical path
`SECRET_KEY`	`SECRET_KEY`	Always (fails fast if `FLASK_ENV=production` and unset)	64-byte URL-safe token — generate with `python3 -c "import secrets; print(secrets.token_urlsafe(64))"`	none	`/Raxx/Backend/SECRET_KEY`
`FLASK_ENV`	n/a (read directly by Flask)	Always	`production` or `development`	`development`	set inline
`DATABASE_URL`	resolved via `resolve_runtime_database_url()`	Always (Postgres path)	`postgres://...` (Heroku-managed)	n/a	managed by Heroku
`FRONTEND_ORIGIN`	used in CORS allowlist	Always	comma-separated origins, e.g. `https://raxx.app,https://www.raxx.app`	`http://localhost:3000`	`/Raxx/Backend/FRONTEND_ORIGIN`
`ADMIN_SERVICE_TOKEN`	read at call time by billing/audit/shadow-GDPR routes	Always	32+ char random token	none (routes return 501 if unset)	`/Raxx/Backend/ADMIN_SERVICE_TOKEN`

[ ] SECRET_KEY provisioned
[ ] FLASK_ENV=production set
[ ] DATABASE_URL present (Heroku adds this automatically on Postgres add-on attach)
[ ] FRONTEND_ORIGIN set to production origin(s)
[ ] ADMIN_SERVICE_TOKEN provisioned

WebAuthn (`FLAG_WEBAUTHN_REGISTRATION`)

These three vars are validated at startup by validate_webauthn_config() when FLAG_WEBAUTHN_REGISTRATION=1. Raptor refuses to start if any is empty.

Root cause of the 2026-05-20 staging boot failure (contributing factor 3, see docs/incidents/2026-05-20-staging-webauthn-boot-fail.md) was WEBAUTHN_ORIGIN never being provisioned. These vars are re-pinned from os.environ AFTER from_pyfile() to prevent instance config from silently overwriting them.

Var	`app.config` key	Required when	Accepted values / format	Default	Infisical path
`WEBAUTHN_RP_ID`	`WEBAUTHN_RP_ID`	`FLAG_WEBAUTHN_REGISTRATION=1`	Relying Party ID — must be the registrable domain, e.g. `raxx.app`	`""` (fails validation if empty)	`/Raxx/Backend/WEBAUTHN_RP_ID`
`WEBAUTHN_RP_NAME`	`WEBAUTHN_RP_NAME`	`FLAG_WEBAUTHN_REGISTRATION=1`	Human-readable RP name shown in browser passkey dialog, e.g. `Raxx`	`Raxx`	`/Raxx/Backend/WEBAUTHN_RP_NAME`
`WEBAUTHN_ORIGIN`	`WEBAUTHN_ORIGIN`	`FLAG_WEBAUTHN_REGISTRATION=1`	Full origin of the Antlers frontend, e.g. `https://raxx.app`	`""` (fails validation if empty)	`/Raxx/Backend/WEBAUTHN_ORIGIN`

[ ] WEBAUTHN_RP_ID=raxx.app set (prod) or raxx-staging.pages.dev / staging domain (staging)
[ ] WEBAUTHN_RP_NAME=Raxx set
[ ] WEBAUTHN_ORIGIN=https://raxx.app set (prod) or staging origin (staging)
[ ] FLAG_WEBAUTHN_REGISTRATION=1 set after the above are confirmed

Sentry (`FLAG_SENTRY_BACKEND`)

Raptor's Sentry integration is gated by sentry_backend flag. If the flag is on but SENTRY_DSN_BACKEND is absent, Sentry is silently skipped (non-fatal).

Var	`app.config` key	Required when	Accepted values / format	Default	Infisical path
`SENTRY_DSN_BACKEND`	n/a (read at init time by `sentry_init.py`)	`FLAG_SENTRY_BACKEND=1`	DSN URL from Sentry project settings, e.g. `https://<key>@o<org>.ingest.sentry.io/<project>`	none (sentry init skipped if absent)	`/Raxx/Backend/SENTRY_DSN_BACKEND`

[ ] SENTRY_DSN_BACKEND provisioned from Sentry project dashboard
[ ] FLAG_SENTRY_BACKEND=1 set after DSN is confirmed

Postmark (`FLAG_POSTMARK_INBOUND_TO_FREESCOUT`, `FLAG_POSTMARK_DELIVERY_MONITOR`, and transactional email)

POSTMARK_SERVER_TOKEN is read at call time by multiple services (waitlist email, transactional email, trace integrity alerts, admin notify). Absence means those paths silently skip sending — no fatal error at startup, but emails will not deliver.

Var	`app.config` key	Required when	Accepted values / format	Default	Infisical path
`POSTMARK_SERVER_TOKEN`	n/a (read from `os.environ` at call time)	Any email sending path is live	Postmark Server API token from Postmark dashboard	none (email calls silently no-op)	`/Raxx/Backend/POSTMARK_SERVER_TOKEN`
`POSTMARK_INBOUND_WEBHOOK_TOKEN`	n/a	`FLAG_POSTMARK_INBOUND_TO_FREESCOUT=1`	Inbound webhook secret from Postmark	none	`/Raxx/Backend/POSTMARK_INBOUND_WEBHOOK_TOKEN`

[ ] POSTMARK_SERVER_TOKEN provisioned (Postmark account → Servers → API Tokens)
[ ] Postmark account out of sandbox (confirmed 2026-05-09 — docs/ops/postmark.md)
[ ] POSTMARK_INBOUND_WEBHOOK_TOKEN provisioned if FLAG_POSTMARK_INBOUND_TO_FREESCOUT=1

Database (Postgres role separation — `FLAG_RAPTOR_APP_ROLE_SEPARATION`)

When the flag is on, Raptor uses RAPTOR_APP_DATABASE_URL (restricted raptor_app role) for all request handlers. DATABASE_URL (owner credential) is then used only by Alembic migrations in the release dyno. See docs/ops/runbooks/raptor-db-credentials.md for the full two-URL model.

Var	`app.config` key	Required when	Accepted values / format	Default	Infisical path
`RAPTOR_APP_DATABASE_URL`	`RUNTIME_DATABASE_URL` (after resolution)	`FLAG_RAPTOR_APP_ROLE_SEPARATION=1`	`postgres://raptor_app:<password>@...` — restricted role URL	falls back to `DATABASE_URL` if absent	`/Raxx/Backend/RAPTOR_APP_DATABASE_URL`

[ ] raptor_app role provisioned via pg:credentials:create (see raptor-postgres-roles.md)
[ ] RAPTOR_APP_DATABASE_URL provisioned and set
[ ] FLAG_RAPTOR_APP_ROLE_SEPARATION=1 set

Demo session (`FLAG_DEMO_SESSION`)

When enabled, Raptor initialises a Redis client and the demo blueprint registers. Fail-closed: if REDIS_URL is absent, demo endpoints degrade gracefully and an error is logged, but startup continues.

Var	`app.config` key	Required when	Accepted values / format	Default	Infisical path
`REDIS_URL`	n/a (`app.extensions["redis_client"]`)	`FLAG_DEMO_SESSION=1`	`redis://...` or `rediss://...` (Heroku Redis add-on URL)	none (demo degrades gracefully)	managed by Heroku Redis add-on

[ ] Heroku Redis add-on attached (adds REDIS_URL automatically)
[ ] FLAG_DEMO_SESSION=1 set after Redis is confirmed reachable

Staging posture (2026-06-14): raxx-api-staging does not have a Redis add-on attached and REDIS_URL is absent. This is intentional: FLAG_DEMO_SESSION and FLAG_RATE_LIMIT_STORAGE_REDIS are both OFF on staging. The WebAuthn challenge store falls back to the in-process dict (safe for a single-worker staging dyno). The rate limiter falls back to in-memory. No consumer hard-fails at startup. The nightly config-health check treats staging REDIS_URL as severity: warning rather than critical to reflect this posture. Activate Redis on staging by running:

heroku addons:create heroku-redis:mini --app raxx-api-staging

Only needed when FLAG_DEMO_SESSION=1 is to be exercised on staging. Set the flag only AFTER confirming REDIS_URL is populated (add-on provisioning can take ~30s).

Broker / market data (`FLAG_ALPACA_TRADING`, `FLAG_ALPACA_MARKET_DATA`)

Alpaca keys are read at call time by alpaca_integration.py, alpaca_market_data_service.py, and options.py. Absence causes those routes to return a 503 with a missing_env payload.

Var	`app.config` key	Required when	Accepted values / format	Default	Infisical path
`ALPACA_API_KEY`	`ALPACA_API_KEY` (via `_read_config_value`)	Live trading or market data enabled	Alpaca live-account key ID	none	`/Raxx/Backend/ALPACA_API_KEY`
`ALPACA_API_SECRET`	`ALPACA_API_SECRET`	Live trading or market data enabled	Alpaca live-account secret key	none	`/Raxx/Backend/ALPACA_API_SECRET`
`ALPACA_BASE_URL`	`ALPACA_BASE_URL`	Live trading	`https://api.alpaca.markets` (live)	`https://api.alpaca.markets`	`/Raxx/Backend/ALPACA_BASE_URL`
`ALPACA_PAPER_API_KEY`	`ALPACA_PAPER_API_KEY`	Paper trading mode	Alpaca paper-account key ID	falls back to `ALPACA_API_KEY`	`/Raxx/Backend/ALPACA_PAPER_API_KEY`
`ALPACA_PAPER_API_SECRET`	`ALPACA_PAPER_API_SECRET`	Paper trading mode	Alpaca paper-account secret key	falls back to `ALPACA_API_SECRET`	`/Raxx/Backend/ALPACA_PAPER_API_SECRET`
`ALPACA_PAPER_BASE_URL`	`ALPACA_PAPER_BASE_URL`	Paper trading mode	`https://paper-api.alpaca.markets`	`https://paper-api.alpaca.markets`	`/Raxx/Backend/ALPACA_PAPER_BASE_URL`

[ ] Live keys provisioned (paper keys also cover market-data reads)
[ ] Paper keys provisioned separately (users start in paper mode)
[ ] Base URLs confirmed — do not swap live/paper URLs

Session auth middleware (`FLAG_SESSION_AUTH_MIDDLEWARE`)

Warning: Do NOT enable this flag on api.raxx.app until Cloudflare Access has been removed from the API origin. CF Access MUST remain active until this flag is on. Flipping the flag before removing CF Access causes double-auth and blocks all authenticated requests.

Var	`app.config` key	Required when	Accepted values / format	Default	Infisical path
n/a (cookie-based, no additional env vars)	—	—	—	—	—

[ ] CF Access removed from api.raxx.app first
[ ] FLAG_SESSION_AUTH_MIDDLEWARE=1 set after CF Access removal confirmed

Status 3P poller (`FLAG_STATUS_3P_POLLER`)

When enabled, a background daemon thread polls upstream partner status pages and writes results to the D1 status DB via STATUS_WORKER_URL.

Var	`app.config` key	Required when	Accepted values / format	Default	Infisical path
`STATUS_WORKER_URL`	n/a (read from `os.environ` at call time)	`FLAG_STATUS_3P_POLLER=1`	Base URL of the CF Worker, e.g. `https://status.raxx.app`	none (poller logs error and no-ops if absent)	`/Raxx/Backend/STATUS_WORKER_URL`

[ ] STATUS_WORKER_URL set to CF Worker URL
[ ] FLAG_STATUS_3P_POLLER=1 set

WebAuthn challenge-miss alerting (`FLAG_WEBAUTHN_CHALLENGE_MISS_ALERT`)

Requires Sentry to already be initialised (i.e., FLAG_SENTRY_BACKEND=1 and SENTRY_DSN_BACKEND set). Installs a log handler that captures Redis challenge-miss events and forwards them to Sentry.

[ ] Sentry checklist above completed first
[ ] FLAG_WEBAUTHN_CHALLENGE_MISS_ALERT=1 set

gunicorn `--preload` + module-level mutable state — Raxx policy

Policy added 2026-05-29 UTC. Closes action item #3 from RCA docs/incidents/2026-05-25-signup-challenge-store-miss.md.

Why `--preload` is on

Raptor runs gunicorn with --preload (Heroku Procfile: web: gunicorn --preload ...). Preload imports the entire WSGI application once in the master process before forking worker processes. This produces:

Faster cold-start on dyno wake: bytecode is compiled and in-memory before any worker forks.
Lower per-worker memory footprint on Linux: CoW (copy-on-write) pages shared by workers until the first write.
Faster dyno restart on deploys: workers can begin accepting traffic sooner.

Preload is the correct default for a Python web app on Heroku with a non-trivial import tree. Do not remove --preload to "fix" a mutable-state bug — fix the state instead.

What goes wrong with module-level mutable state

When gunicorn forks, each worker process gets a copy-on-write clone of the master's memory. The first write to any object in a worker creates a private copy for that worker only. Other workers and the master process do not see that write.

Concretely:

# webauthn_service.py (BEFORE the 2026-05-25 fix)
_challenge_store: dict[int, str] = {}   # module-level, initialized once in master

def _store_challenge(user_id: int, challenge: str) -> None:
    _challenge_store[user_id] = challenge   # writes to *this worker's* private copy

def _pop_challenge(user_id: int) -> str | None:
    return _challenge_store.pop(user_id, None)  # looks up in *this worker's* private copy

If the HTTP request that calls _store_challenge is handled by worker A, and the next HTTP request that calls _pop_challenge is handled by worker B (or the master under a brief overlap during dyno lifecycle events), _pop_challenge returns None. The challenge was written to worker A's private page; worker B's page still has the empty initial dict.

This was the root cause of the 2026-05-25 SEV-1 (_challenge_store miss on every signup attempt, PR #2728). The same failure mode also explains the trading_mode_override regression fixed in PR #3030: a process-global dict keyed on user ID was mutated per request; under preload, concurrent requests in different workers each had a divergent view of that dict.

The rule

Module-level mutable state that must be consistent across processes is forbidden in Raptor.

State type	Acceptable storage	Not acceptable
Per-request ephemeral data (auth token from header, parsed body)	Local variable or `flask.g`	Module-level dict
Per-session transient data (WebAuthn challenge, CSRF nonce)	Redis (TTL-keyed)	Module-level dict
Per-user persistent data (trading mode, profile, preferences)	Postgres row	Module-level dict, `app.extensions["..."]` dict
Read-mostly app config (RP ID, feature flags read at startup)	`app.config` or `app.extensions["..."]` — read-only after `create_app()`	Any post-fork mutation
Background worker state	Redis or Postgres	Module-level variable mutated per task

Immutable is fine. Module-level constants, compiled regex patterns, read-only lookup tables, and frozen dataclasses are safe under --preload because they are never written after the fork.

app.extensions is safe for read-only objects. A Redis client, a Sentry client, or a database engine stored in app.extensions["redis_client"] is initialized once in create_app() (before fork) and used read-only in workers. This is the correct pattern. Do not store mutable per-user or per-request state there.

Required pattern for transient state

Use Redis with a TTL. The post-fix webauthn_service.py is the canonical example:

# webauthn_service.py (AFTER [PR #2728](https://github.com/raxx-app/TradeMasterAPI/pull/2728))
_CHALLENGE_TTL = 60  # seconds

def _store_challenge(app: Flask, user_id: int, challenge: str) -> None:
    redis = app.extensions.get("redis_client")
    if redis:
        redis.set(f"webauthn:reg:challenge:{user_id}", challenge, ex=_CHALLENGE_TTL)
        logger.info("webauthn.challenge_stored store=redis user_id=%s", user_id)
    else:
        # Local dev / CI: Redis absent, fall back to in-process dict
        _challenge_store_fallback[user_id] = challenge
        logger.info("webauthn.challenge_stored store=local user_id=%s", user_id)

def _pop_challenge(app: Flask, user_id: int) -> str | None:
    redis = app.extensions.get("redis_client")
    if redis:
        val = redis.getdel(f"webauthn:reg:challenge:{user_id}")
        if val is None:
            logger.warning("webauthn.challenge_miss store=redis user_id=%s", user_id)
        else:
            logger.info("webauthn.challenge_popped store=redis user_id=%s", user_id)
        return val.decode() if val else None
    else:
        return _challenge_store_fallback.pop(user_id, None)

Key properties of this pattern:

Atomic pop. GETDEL reads and deletes in a single Redis command. No race between read and delete.
TTL-bounded. A challenge that is never popped (e.g., user abandons signup) expires in 60 seconds. No accumulation of stale state.
Positive diagnostic signal. webauthn.challenge_miss is a named log event, not a silent None return. Sentry (via FLAG_WEBAUTHN_CHALLENGE_MISS_ALERT) captures it.
Local-dev fallback. When REDIS_URL is absent (local dev, CI), the code falls back to an in-process dict. This is acceptable because a single-process dev server does not fork.

Code-review checklist

When reviewing any PR that adds new state to a Raptor service:

[ ] Is this state written after create_app() returns? If yes, it cannot be module-level.
[ ] Is this state per-user, per-session, or per-request? If yes, it belongs in Redis or Postgres, not app.extensions.
[ ] Is this state read-only after startup? If yes, app.config or app.extensions is fine.
[ ] If Redis is used: does it have a TTL? Does pop use GETDEL (atomic)? Does a miss emit a named log event?
[ ] Is there a local-dev fallback for when REDIS_URL is absent?

Incidents that established this policy

Incident	Date	PR	Bug class
WebAuthn challenge-store miss — every signup fails	2026-05-25	PR #2728	Module-level dict, challenge written to one worker's CoW page, verify called on different worker
Per-user `trading_mode_override` process-global dict	(see PR #3030)	PR #3030	Module-level dict keyed on user_id, mutations diverged across worker processes under preload

SQLite-vs-Postgres test divergence (known failure class)

Raptor's unit tests run against in-memory SQLite via _make_engine() fixtures that patch _get_conn() directly. This is fast and self-contained, but SQLite accepts several SQL constructs that Postgres rejects. Any time a route has a write path that tests pass but prod 500s, check this list first.

Known divergence points

Construct	SQLite behaviour	Postgres behaviour	Safe alternative
`MAX(col, val)` in a `SET` clause	Accepted — treated as single-row aggregate, returns correct answer	Rejected: `function max(integer, integer) does not exist`	Read current value → Python `max(current, new)` → pass as bound param
`GREATEST(a, b)`	Rejected (no `GREATEST` function)	Accepted	Same: Python `max()` with bound param
`DO $$ ... $$` PL/pgSQL blocks	Rejected	Accepted	Mark with `-- POSTGRES-ONLY` sentinel; use `if bind.dialect.name == 'postgresql':` guard
`RENAME COLUMN` (DDL)	Accepted in SQLite 3.25+	Accepted	Safe on both
`now()` as `server_default`	Rejected	Accepted	Use `sa.text("now()")` gated on `if bind.dialect.name == 'postgresql':`

Rule for advance-progress patterns

Any UPDATE that increments or ceiling-caps a counter column must follow this pattern (from beta_walkthrough.py, established 2026-04-xx):

# Read current value
row = conn.execute(text("SELECT col FROM table WHERE pk = :pk"), {"pk": pk}).fetchone()
current = row[0] if row else 0
# Compute in Python — never use MAX(col, val) in the SET clause
conn.execute(
    text("UPDATE table SET col = :new_val WHERE pk = :pk"),
    {"new_val": max(current, new_value), "pk": pk},
)

Incidents from this class

Incident	Date	PR	Failure
beta_preview POST /screen/1 500	2026-06-12	PR #3540	`MAX(col, val)` in SET clause — SQLite silent, Postgres hard error
beta_nda_ack table missing in Raptor	2026-06-10	PR #3454	Table created in Console migration only — no Raptor alembic migration
beta_walkthrough tables missing in Raptor	2026-06-12	PR #3517	Same pattern as 2026-06-10

Blueprint authoring convention — url_prefix MUST NOT include /api

This is a common footgun. The _core_blueprints registration loop in api/__init__.py (~line 370) prepends /api to every blueprint's prefix:

for bp in _core_blueprints:
    prefixed_url = f"/api{bp.url_prefix}" if bp.url_prefix else "/api"
    app.register_blueprint(bp, url_prefix=prefixed_url)

All blueprints in _core_blueprints MUST define their url_prefix without the leading /api. Examples:

Blueprint	Correct prefix	Final route
`internal_waf_events`	`/internal/waf-events`	`/api/internal/waf-events`
`freescout_audit_webhook`	`/internal/freescout-webhook`	`/api/internal/freescout-webhook`
`internal_beta_survey`	`/internal/beta-survey-data`	`/api/internal/beta-survey-data`

If a blueprint accidentally includes /api in its prefix (e.g., /api/internal/foo), the route is registered at /api/api/internal/foo and every request falls through to the serve_static catch-all → 404 {"error":"React build not found..."}.

Symptom: raptor_survey_client: Raptor returned status=404 error=React build not found in Console logs, and matching 404s in Raptor access logs for a route that should exist.

Fix: Remove the /api prefix from the Blueprint definition's url_prefix argument.

Incident reference: docs/incidents/2026-06-13-beta-digest-survey-404.md (PR #3556 introduced, PR #3558 fixed).

Cross-references

Two-URL Postgres model: docs/ops/runbooks/raptor-db-credentials.md
Postgres role provisioning: docs/ops/runbooks/raptor-postgres-roles.md
Staging Postgres cutover: docs/ops/runbooks/raptor-postgres-staging-cutover.md
Prod Postgres cutover: docs/ops/runbooks/raptor-postgres-prod-cutover.md
Feature flag flip procedure: docs/ops/feature-flags-runbook.md
Required config vars manifest: docs/ops/required-config-vars.yaml
Incident (boot fail): docs/incidents/2026-05-20-staging-webauthn-boot-fail.md
Incident (challenge-store miss): docs/incidents/2026-05-25-signup-challenge-store-miss.md

Raptor — Operations Runbook

Env-bootstrap checklist

Core — always required

WebAuthn (FLAG_WEBAUTHN_REGISTRATION)

Sentry (FLAG_SENTRY_BACKEND)

Postmark (FLAG_POSTMARK_INBOUND_TO_FREESCOUT, FLAG_POSTMARK_DELIVERY_MONITOR, and transactional email)

Database (Postgres role separation — FLAG_RAPTOR_APP_ROLE_SEPARATION)

Demo session (FLAG_DEMO_SESSION)

Broker / market data (FLAG_ALPACA_TRADING, FLAG_ALPACA_MARKET_DATA)

Session auth middleware (FLAG_SESSION_AUTH_MIDDLEWARE)

Status 3P poller (FLAG_STATUS_3P_POLLER)

WebAuthn challenge-miss alerting (FLAG_WEBAUTHN_CHALLENGE_MISS_ALERT)

gunicorn --preload + module-level mutable state — Raxx policy

Why --preload is on