Raxx · internal docs

internal · gated

Raptor — Operations Runbook

System: Raptor (raxx-api-prod, raxx-api-staging) Owner: operator / sre-agent Last reviewed: 2026-05-29 UTC Parent issues: #2599 (env-bootstrap checklist), #81 (SDLC hardening epic)

Maintenance rule: Every PR that adds a new flag-gated subsystem to create_app() MUST update the checklist in this file in the same PR. Mirrors the policy in feedback_new_flag_needs_b1_migration_same_pr.


Env-bootstrap checklist

Use this checklist when provisioning a new environment (staging or prod). Work through it linearly. Check off each item as you complete it. Every heroku config:set call must have >/dev/null 2>&1 appended — the CLI echoes the value otherwise (#feedback_heroku_config_set_echoes_secrets).

Command to set a var (safe pattern):

heroku config:set VAR_NAME="<value>" --app <app-name> >/dev/null 2>&1

Core — always required

Every environment needs these regardless of feature flags.

Var app.config key Required when Accepted values / format Default Infisical path
SECRET_KEY SECRET_KEY Always (fails fast if FLASK_ENV=production and unset) 64-byte URL-safe token — generate with python3 -c "import secrets; print(secrets.token_urlsafe(64))" none /Raxx/Backend/SECRET_KEY
FLASK_ENV n/a (read directly by Flask) Always production or development development set inline
DATABASE_URL resolved via resolve_runtime_database_url() Always (Postgres path) postgres://... (Heroku-managed) n/a managed by Heroku
FRONTEND_ORIGIN used in CORS allowlist Always comma-separated origins, e.g. https://raxx.app,https://www.raxx.app http://localhost:3000 /Raxx/Backend/FRONTEND_ORIGIN
ADMIN_SERVICE_TOKEN read at call time by billing/audit/shadow-GDPR routes Always 32+ char random token none (routes return 501 if unset) /Raxx/Backend/ADMIN_SERVICE_TOKEN

WebAuthn (FLAG_WEBAUTHN_REGISTRATION)

These three vars are validated at startup by validate_webauthn_config() when FLAG_WEBAUTHN_REGISTRATION=1. Raptor refuses to start if any is empty.

Root cause of the 2026-05-20 staging boot failure (contributing factor 3, see docs/incidents/2026-05-20-staging-webauthn-boot-fail.md) was WEBAUTHN_ORIGIN never being provisioned. These vars are re-pinned from os.environ AFTER from_pyfile() to prevent instance config from silently overwriting them.

Var app.config key Required when Accepted values / format Default Infisical path
WEBAUTHN_RP_ID WEBAUTHN_RP_ID FLAG_WEBAUTHN_REGISTRATION=1 Relying Party ID — must be the registrable domain, e.g. raxx.app "" (fails validation if empty) /Raxx/Backend/WEBAUTHN_RP_ID
WEBAUTHN_RP_NAME WEBAUTHN_RP_NAME FLAG_WEBAUTHN_REGISTRATION=1 Human-readable RP name shown in browser passkey dialog, e.g. Raxx Raxx /Raxx/Backend/WEBAUTHN_RP_NAME
WEBAUTHN_ORIGIN WEBAUTHN_ORIGIN FLAG_WEBAUTHN_REGISTRATION=1 Full origin of the Antlers frontend, e.g. https://raxx.app "" (fails validation if empty) /Raxx/Backend/WEBAUTHN_ORIGIN

Sentry (FLAG_SENTRY_BACKEND)

Raptor's Sentry integration is gated by sentry_backend flag. If the flag is on but SENTRY_DSN_BACKEND is absent, Sentry is silently skipped (non-fatal).

Var app.config key Required when Accepted values / format Default Infisical path
SENTRY_DSN_BACKEND n/a (read at init time by sentry_init.py) FLAG_SENTRY_BACKEND=1 DSN URL from Sentry project settings, e.g. https://<key>@o<org>.ingest.sentry.io/<project> none (sentry init skipped if absent) /Raxx/Backend/SENTRY_DSN_BACKEND

Postmark (FLAG_POSTMARK_INBOUND_TO_FREESCOUT, FLAG_POSTMARK_DELIVERY_MONITOR, and transactional email)

POSTMARK_SERVER_TOKEN is read at call time by multiple services (waitlist email, transactional email, trace integrity alerts, admin notify). Absence means those paths silently skip sending — no fatal error at startup, but emails will not deliver.

Var app.config key Required when Accepted values / format Default Infisical path
POSTMARK_SERVER_TOKEN n/a (read from os.environ at call time) Any email sending path is live Postmark Server API token from Postmark dashboard none (email calls silently no-op) /Raxx/Backend/POSTMARK_SERVER_TOKEN
POSTMARK_INBOUND_WEBHOOK_TOKEN n/a FLAG_POSTMARK_INBOUND_TO_FREESCOUT=1 Inbound webhook secret from Postmark none /Raxx/Backend/POSTMARK_INBOUND_WEBHOOK_TOKEN

Database (Postgres role separation — FLAG_RAPTOR_APP_ROLE_SEPARATION)

When the flag is on, Raptor uses RAPTOR_APP_DATABASE_URL (restricted raptor_app role) for all request handlers. DATABASE_URL (owner credential) is then used only by Alembic migrations in the release dyno. See docs/ops/runbooks/raptor-db-credentials.md for the full two-URL model.

Var app.config key Required when Accepted values / format Default Infisical path
RAPTOR_APP_DATABASE_URL RUNTIME_DATABASE_URL (after resolution) FLAG_RAPTOR_APP_ROLE_SEPARATION=1 postgres://raptor_app:<password>@... — restricted role URL falls back to DATABASE_URL if absent /Raxx/Backend/RAPTOR_APP_DATABASE_URL

Demo session (FLAG_DEMO_SESSION)

When enabled, Raptor initialises a Redis client and the demo blueprint registers. Fail-closed: if REDIS_URL is absent, demo endpoints degrade gracefully and an error is logged, but startup continues.

Var app.config key Required when Accepted values / format Default Infisical path
REDIS_URL n/a (app.extensions["redis_client"]) FLAG_DEMO_SESSION=1 redis://... or rediss://... (Heroku Redis add-on URL) none (demo degrades gracefully) managed by Heroku Redis add-on

Broker / market data (FLAG_ALPACA_TRADING, FLAG_ALPACA_MARKET_DATA)

Alpaca keys are read at call time by alpaca_integration.py, alpaca_market_data_service.py, and options.py. Absence causes those routes to return a 503 with a missing_env payload.

Var app.config key Required when Accepted values / format Default Infisical path
ALPACA_API_KEY ALPACA_API_KEY (via _read_config_value) Live trading or market data enabled Alpaca live-account key ID none /Raxx/Backend/ALPACA_API_KEY
ALPACA_API_SECRET ALPACA_API_SECRET Live trading or market data enabled Alpaca live-account secret key none /Raxx/Backend/ALPACA_API_SECRET
ALPACA_BASE_URL ALPACA_BASE_URL Live trading https://api.alpaca.markets (live) https://api.alpaca.markets /Raxx/Backend/ALPACA_BASE_URL
ALPACA_PAPER_API_KEY ALPACA_PAPER_API_KEY Paper trading mode Alpaca paper-account key ID falls back to ALPACA_API_KEY /Raxx/Backend/ALPACA_PAPER_API_KEY
ALPACA_PAPER_API_SECRET ALPACA_PAPER_API_SECRET Paper trading mode Alpaca paper-account secret key falls back to ALPACA_API_SECRET /Raxx/Backend/ALPACA_PAPER_API_SECRET
ALPACA_PAPER_BASE_URL ALPACA_PAPER_BASE_URL Paper trading mode https://paper-api.alpaca.markets https://paper-api.alpaca.markets /Raxx/Backend/ALPACA_PAPER_BASE_URL

Session auth middleware (FLAG_SESSION_AUTH_MIDDLEWARE)

Warning: Do NOT enable this flag on api.raxx.app until Cloudflare Access has been removed from the API origin. CF Access MUST remain active until this flag is on. Flipping the flag before removing CF Access causes double-auth and blocks all authenticated requests.

Var app.config key Required when Accepted values / format Default Infisical path
n/a (cookie-based, no additional env vars)

Status 3P poller (FLAG_STATUS_3P_POLLER)

When enabled, a background daemon thread polls upstream partner status pages and writes results to the D1 status DB via STATUS_WORKER_URL.

Var app.config key Required when Accepted values / format Default Infisical path
STATUS_WORKER_URL n/a (read from os.environ at call time) FLAG_STATUS_3P_POLLER=1 Base URL of the CF Worker, e.g. https://status.raxx.app none (poller logs error and no-ops if absent) /Raxx/Backend/STATUS_WORKER_URL

WebAuthn challenge-miss alerting (FLAG_WEBAUTHN_CHALLENGE_MISS_ALERT)

Requires Sentry to already be initialised (i.e., FLAG_SENTRY_BACKEND=1 and SENTRY_DSN_BACKEND set). Installs a log handler that captures Redis challenge-miss events and forwards them to Sentry.


gunicorn --preload + module-level mutable state — Raxx policy

Policy added 2026-05-29 UTC. Closes action item #3 from RCA docs/incidents/2026-05-25-signup-challenge-store-miss.md.

Why --preload is on

Raptor runs gunicorn with --preload (Heroku Procfile: web: gunicorn --preload ...). Preload imports the entire WSGI application once in the master process before forking worker processes. This produces:

Preload is the correct default for a Python web app on Heroku with a non-trivial import tree. Do not remove --preload to "fix" a mutable-state bug — fix the state instead.

What goes wrong with module-level mutable state

When gunicorn forks, each worker process gets a copy-on-write clone of the master's memory. The first write to any object in a worker creates a private copy for that worker only. Other workers and the master process do not see that write.

Concretely:

# webauthn_service.py (BEFORE the 2026-05-25 fix)
_challenge_store: dict[int, str] = {}   # module-level, initialized once in master

def _store_challenge(user_id: int, challenge: str) -> None:
    _challenge_store[user_id] = challenge   # writes to *this worker's* private copy

def _pop_challenge(user_id: int) -> str | None:
    return _challenge_store.pop(user_id, None)  # looks up in *this worker's* private copy

If the HTTP request that calls _store_challenge is handled by worker A, and the next HTTP request that calls _pop_challenge is handled by worker B (or the master under a brief overlap during dyno lifecycle events), _pop_challenge returns None. The challenge was written to worker A's private page; worker B's page still has the empty initial dict.

This was the root cause of the 2026-05-25 SEV-1 (_challenge_store miss on every signup attempt, PR #2728). The same failure mode also explains the trading_mode_override regression fixed in PR #3030: a process-global dict keyed on user ID was mutated per request; under preload, concurrent requests in different workers each had a divergent view of that dict.

The rule

Module-level mutable state that must be consistent across processes is forbidden in Raptor.

State type Acceptable storage Not acceptable
Per-request ephemeral data (auth token from header, parsed body) Local variable or flask.g Module-level dict
Per-session transient data (WebAuthn challenge, CSRF nonce) Redis (TTL-keyed) Module-level dict
Per-user persistent data (trading mode, profile, preferences) Postgres row Module-level dict, app.extensions["..."] dict
Read-mostly app config (RP ID, feature flags read at startup) app.config or app.extensions["..."] — read-only after create_app() Any post-fork mutation
Background worker state Redis or Postgres Module-level variable mutated per task

Immutable is fine. Module-level constants, compiled regex patterns, read-only lookup tables, and frozen dataclasses are safe under --preload because they are never written after the fork.

app.extensions is safe for read-only objects. A Redis client, a Sentry client, or a database engine stored in app.extensions["redis_client"] is initialized once in create_app() (before fork) and used read-only in workers. This is the correct pattern. Do not store mutable per-user or per-request state there.

Required pattern for transient state

Use Redis with a TTL. The post-fix webauthn_service.py is the canonical example:

# webauthn_service.py (AFTER [PR #2728](https://github.com/raxx-app/TradeMasterAPI/pull/2728))
_CHALLENGE_TTL = 60  # seconds

def _store_challenge(app: Flask, user_id: int, challenge: str) -> None:
    redis = app.extensions.get("redis_client")
    if redis:
        redis.set(f"webauthn:reg:challenge:{user_id}", challenge, ex=_CHALLENGE_TTL)
        logger.info("webauthn.challenge_stored store=redis user_id=%s", user_id)
    else:
        # Local dev / CI: Redis absent, fall back to in-process dict
        _challenge_store_fallback[user_id] = challenge
        logger.info("webauthn.challenge_stored store=local user_id=%s", user_id)

def _pop_challenge(app: Flask, user_id: int) -> str | None:
    redis = app.extensions.get("redis_client")
    if redis:
        val = redis.getdel(f"webauthn:reg:challenge:{user_id}")
        if val is None:
            logger.warning("webauthn.challenge_miss store=redis user_id=%s", user_id)
        else:
            logger.info("webauthn.challenge_popped store=redis user_id=%s", user_id)
        return val.decode() if val else None
    else:
        return _challenge_store_fallback.pop(user_id, None)

Key properties of this pattern:

  1. Atomic pop. GETDEL reads and deletes in a single Redis command. No race between read and delete.
  2. TTL-bounded. A challenge that is never popped (e.g., user abandons signup) expires in 60 seconds. No accumulation of stale state.
  3. Positive diagnostic signal. webauthn.challenge_miss is a named log event, not a silent None return. Sentry (via FLAG_WEBAUTHN_CHALLENGE_MISS_ALERT) captures it.
  4. Local-dev fallback. When REDIS_URL is absent (local dev, CI), the code falls back to an in-process dict. This is acceptable because a single-process dev server does not fork.

Code-review checklist

When reviewing any PR that adds new state to a Raptor service:

Incidents that established this policy

Incident Date PR Bug class
WebAuthn challenge-store miss — every signup fails 2026-05-25 PR #2728 Module-level dict, challenge written to one worker's CoW page, verify called on different worker
Per-user trading_mode_override process-global dict (see PR #3030) PR #3030 Module-level dict keyed on user_id, mutations diverged across worker processes under preload

Cross-references