Raxx · internal docs

internal · gated

Raptor — Operations Runbook

System: Raptor (raxx-api-prod, raxx-api-staging) Owner: operator / sre-agent Last reviewed: 2026-06-12 UTC Parent issues: #2599 (env-bootstrap checklist), #81 (SDLC hardening epic)

Maintenance rule: Every PR that adds a new flag-gated subsystem to create_app() MUST update the checklist in this file in the same PR. Mirrors the policy in feedback_new_flag_needs_b1_migration_same_pr.


Env-bootstrap checklist

Use this checklist when provisioning a new environment (staging or prod). Work through it linearly. Check off each item as you complete it. Every heroku config:set call must have >/dev/null 2>&1 appended — the CLI echoes the value otherwise (#feedback_heroku_config_set_echoes_secrets).

Command to set a var (safe pattern):

heroku config:set VAR_NAME="<value>" --app <app-name> >/dev/null 2>&1

Core — always required

Every environment needs these regardless of feature flags.

Var app.config key Required when Accepted values / format Default Infisical path
SECRET_KEY SECRET_KEY Always (fails fast if FLASK_ENV=production and unset) 64-byte URL-safe token — generate with python3 -c "import secrets; print(secrets.token_urlsafe(64))" none /Raxx/Backend/SECRET_KEY
FLASK_ENV n/a (read directly by Flask) Always production or development development set inline
DATABASE_URL resolved via resolve_runtime_database_url() Always (Postgres path) postgres://... (Heroku-managed) n/a managed by Heroku
FRONTEND_ORIGIN used in CORS allowlist Always comma-separated origins, e.g. https://raxx.app,https://www.raxx.app http://localhost:3000 /Raxx/Backend/FRONTEND_ORIGIN
ADMIN_SERVICE_TOKEN read at call time by billing/audit/shadow-GDPR routes Always 32+ char random token none (routes return 501 if unset) /Raxx/Backend/ADMIN_SERVICE_TOKEN

WebAuthn (FLAG_WEBAUTHN_REGISTRATION)

These three vars are validated at startup by validate_webauthn_config() when FLAG_WEBAUTHN_REGISTRATION=1. Raptor refuses to start if any is empty.

Root cause of the 2026-05-20 staging boot failure (contributing factor 3, see docs/incidents/2026-05-20-staging-webauthn-boot-fail.md) was WEBAUTHN_ORIGIN never being provisioned. These vars are re-pinned from os.environ AFTER from_pyfile() to prevent instance config from silently overwriting them.

Var app.config key Required when Accepted values / format Default Infisical path
WEBAUTHN_RP_ID WEBAUTHN_RP_ID FLAG_WEBAUTHN_REGISTRATION=1 Relying Party ID — must be the registrable domain, e.g. raxx.app "" (fails validation if empty) /Raxx/Backend/WEBAUTHN_RP_ID
WEBAUTHN_RP_NAME WEBAUTHN_RP_NAME FLAG_WEBAUTHN_REGISTRATION=1 Human-readable RP name shown in browser passkey dialog, e.g. Raxx Raxx /Raxx/Backend/WEBAUTHN_RP_NAME
WEBAUTHN_ORIGIN WEBAUTHN_ORIGIN FLAG_WEBAUTHN_REGISTRATION=1 Full origin of the Antlers frontend, e.g. https://raxx.app "" (fails validation if empty) /Raxx/Backend/WEBAUTHN_ORIGIN

Sentry (FLAG_SENTRY_BACKEND)

Raptor's Sentry integration is gated by sentry_backend flag. If the flag is on but SENTRY_DSN_BACKEND is absent, Sentry is silently skipped (non-fatal).

Var app.config key Required when Accepted values / format Default Infisical path
SENTRY_DSN_BACKEND n/a (read at init time by sentry_init.py) FLAG_SENTRY_BACKEND=1 DSN URL from Sentry project settings, e.g. https://<key>@o<org>.ingest.sentry.io/<project> none (sentry init skipped if absent) /Raxx/Backend/SENTRY_DSN_BACKEND

Postmark (FLAG_POSTMARK_INBOUND_TO_FREESCOUT, FLAG_POSTMARK_DELIVERY_MONITOR, and transactional email)

POSTMARK_SERVER_TOKEN is read at call time by multiple services (waitlist email, transactional email, trace integrity alerts, admin notify). Absence means those paths silently skip sending — no fatal error at startup, but emails will not deliver.

Var app.config key Required when Accepted values / format Default Infisical path
POSTMARK_SERVER_TOKEN n/a (read from os.environ at call time) Any email sending path is live Postmark Server API token from Postmark dashboard none (email calls silently no-op) /Raxx/Backend/POSTMARK_SERVER_TOKEN
POSTMARK_INBOUND_WEBHOOK_TOKEN n/a FLAG_POSTMARK_INBOUND_TO_FREESCOUT=1 Inbound webhook secret from Postmark none /Raxx/Backend/POSTMARK_INBOUND_WEBHOOK_TOKEN

Database (Postgres role separation — FLAG_RAPTOR_APP_ROLE_SEPARATION)

When the flag is on, Raptor uses RAPTOR_APP_DATABASE_URL (restricted raptor_app role) for all request handlers. DATABASE_URL (owner credential) is then used only by Alembic migrations in the release dyno. See docs/ops/runbooks/raptor-db-credentials.md for the full two-URL model.

Var app.config key Required when Accepted values / format Default Infisical path
RAPTOR_APP_DATABASE_URL RUNTIME_DATABASE_URL (after resolution) FLAG_RAPTOR_APP_ROLE_SEPARATION=1 postgres://raptor_app:<password>@... — restricted role URL falls back to DATABASE_URL if absent /Raxx/Backend/RAPTOR_APP_DATABASE_URL

Demo session (FLAG_DEMO_SESSION)

When enabled, Raptor initialises a Redis client and the demo blueprint registers. Fail-closed: if REDIS_URL is absent, demo endpoints degrade gracefully and an error is logged, but startup continues.

Var app.config key Required when Accepted values / format Default Infisical path
REDIS_URL n/a (app.extensions["redis_client"]) FLAG_DEMO_SESSION=1 redis://... or rediss://... (Heroku Redis add-on URL) none (demo degrades gracefully) managed by Heroku Redis add-on

Staging posture (2026-06-14): raxx-api-staging does not have a Redis add-on attached and REDIS_URL is absent. This is intentional: FLAG_DEMO_SESSION and FLAG_RATE_LIMIT_STORAGE_REDIS are both OFF on staging. The WebAuthn challenge store falls back to the in-process dict (safe for a single-worker staging dyno). The rate limiter falls back to in-memory. No consumer hard-fails at startup. The nightly config-health check treats staging REDIS_URL as severity: warning rather than critical to reflect this posture. Activate Redis on staging by running:

heroku addons:create heroku-redis:mini --app raxx-api-staging

Only needed when FLAG_DEMO_SESSION=1 is to be exercised on staging. Set the flag only AFTER confirming REDIS_URL is populated (add-on provisioning can take ~30s).


Broker / market data (FLAG_ALPACA_TRADING, FLAG_ALPACA_MARKET_DATA)

Alpaca keys are read at call time by alpaca_integration.py, alpaca_market_data_service.py, and options.py. Absence causes those routes to return a 503 with a missing_env payload.

Var app.config key Required when Accepted values / format Default Infisical path
ALPACA_API_KEY ALPACA_API_KEY (via _read_config_value) Live trading or market data enabled Alpaca live-account key ID none /Raxx/Backend/ALPACA_API_KEY
ALPACA_API_SECRET ALPACA_API_SECRET Live trading or market data enabled Alpaca live-account secret key none /Raxx/Backend/ALPACA_API_SECRET
ALPACA_BASE_URL ALPACA_BASE_URL Live trading https://api.alpaca.markets (live) https://api.alpaca.markets /Raxx/Backend/ALPACA_BASE_URL
ALPACA_PAPER_API_KEY ALPACA_PAPER_API_KEY Paper trading mode Alpaca paper-account key ID falls back to ALPACA_API_KEY /Raxx/Backend/ALPACA_PAPER_API_KEY
ALPACA_PAPER_API_SECRET ALPACA_PAPER_API_SECRET Paper trading mode Alpaca paper-account secret key falls back to ALPACA_API_SECRET /Raxx/Backend/ALPACA_PAPER_API_SECRET
ALPACA_PAPER_BASE_URL ALPACA_PAPER_BASE_URL Paper trading mode https://paper-api.alpaca.markets https://paper-api.alpaca.markets /Raxx/Backend/ALPACA_PAPER_BASE_URL

Session auth middleware (FLAG_SESSION_AUTH_MIDDLEWARE)

Warning: Do NOT enable this flag on api.raxx.app until Cloudflare Access has been removed from the API origin. CF Access MUST remain active until this flag is on. Flipping the flag before removing CF Access causes double-auth and blocks all authenticated requests.

Var app.config key Required when Accepted values / format Default Infisical path
n/a (cookie-based, no additional env vars)

Status 3P poller (FLAG_STATUS_3P_POLLER)

When enabled, a background daemon thread polls upstream partner status pages and writes results to the D1 status DB via STATUS_WORKER_URL.

Var app.config key Required when Accepted values / format Default Infisical path
STATUS_WORKER_URL n/a (read from os.environ at call time) FLAG_STATUS_3P_POLLER=1 Base URL of the CF Worker, e.g. https://status.raxx.app none (poller logs error and no-ops if absent) /Raxx/Backend/STATUS_WORKER_URL

WebAuthn challenge-miss alerting (FLAG_WEBAUTHN_CHALLENGE_MISS_ALERT)

Requires Sentry to already be initialised (i.e., FLAG_SENTRY_BACKEND=1 and SENTRY_DSN_BACKEND set). Installs a log handler that captures Redis challenge-miss events and forwards them to Sentry.


gunicorn --preload + module-level mutable state — Raxx policy

Policy added 2026-05-29 UTC. Closes action item #3 from RCA docs/incidents/2026-05-25-signup-challenge-store-miss.md.

Why --preload is on

Raptor runs gunicorn with --preload (Heroku Procfile: web: gunicorn --preload ...). Preload imports the entire WSGI application once in the master process before forking worker processes. This produces:

Preload is the correct default for a Python web app on Heroku with a non-trivial import tree. Do not remove --preload to "fix" a mutable-state bug — fix the state instead.

What goes wrong with module-level mutable state

When gunicorn forks, each worker process gets a copy-on-write clone of the master's memory. The first write to any object in a worker creates a private copy for that worker only. Other workers and the master process do not see that write.

Concretely:

# webauthn_service.py (BEFORE the 2026-05-25 fix)
_challenge_store: dict[int, str] = {}   # module-level, initialized once in master

def _store_challenge(user_id: int, challenge: str) -> None:
    _challenge_store[user_id] = challenge   # writes to *this worker's* private copy

def _pop_challenge(user_id: int) -> str | None:
    return _challenge_store.pop(user_id, None)  # looks up in *this worker's* private copy

If the HTTP request that calls _store_challenge is handled by worker A, and the next HTTP request that calls _pop_challenge is handled by worker B (or the master under a brief overlap during dyno lifecycle events), _pop_challenge returns None. The challenge was written to worker A's private page; worker B's page still has the empty initial dict.

This was the root cause of the 2026-05-25 SEV-1 (_challenge_store miss on every signup attempt, PR #2728). The same failure mode also explains the trading_mode_override regression fixed in PR #3030: a process-global dict keyed on user ID was mutated per request; under preload, concurrent requests in different workers each had a divergent view of that dict.

The rule

Module-level mutable state that must be consistent across processes is forbidden in Raptor.

State type Acceptable storage Not acceptable
Per-request ephemeral data (auth token from header, parsed body) Local variable or flask.g Module-level dict
Per-session transient data (WebAuthn challenge, CSRF nonce) Redis (TTL-keyed) Module-level dict
Per-user persistent data (trading mode, profile, preferences) Postgres row Module-level dict, app.extensions["..."] dict
Read-mostly app config (RP ID, feature flags read at startup) app.config or app.extensions["..."] — read-only after create_app() Any post-fork mutation
Background worker state Redis or Postgres Module-level variable mutated per task

Immutable is fine. Module-level constants, compiled regex patterns, read-only lookup tables, and frozen dataclasses are safe under --preload because they are never written after the fork.

app.extensions is safe for read-only objects. A Redis client, a Sentry client, or a database engine stored in app.extensions["redis_client"] is initialized once in create_app() (before fork) and used read-only in workers. This is the correct pattern. Do not store mutable per-user or per-request state there.

Required pattern for transient state

Use Redis with a TTL. The post-fix webauthn_service.py is the canonical example:

# webauthn_service.py (AFTER [PR #2728](https://github.com/raxx-app/TradeMasterAPI/pull/2728))
_CHALLENGE_TTL = 60  # seconds

def _store_challenge(app: Flask, user_id: int, challenge: str) -> None:
    redis = app.extensions.get("redis_client")
    if redis:
        redis.set(f"webauthn:reg:challenge:{user_id}", challenge, ex=_CHALLENGE_TTL)
        logger.info("webauthn.challenge_stored store=redis user_id=%s", user_id)
    else:
        # Local dev / CI: Redis absent, fall back to in-process dict
        _challenge_store_fallback[user_id] = challenge
        logger.info("webauthn.challenge_stored store=local user_id=%s", user_id)

def _pop_challenge(app: Flask, user_id: int) -> str | None:
    redis = app.extensions.get("redis_client")
    if redis:
        val = redis.getdel(f"webauthn:reg:challenge:{user_id}")
        if val is None:
            logger.warning("webauthn.challenge_miss store=redis user_id=%s", user_id)
        else:
            logger.info("webauthn.challenge_popped store=redis user_id=%s", user_id)
        return val.decode() if val else None
    else:
        return _challenge_store_fallback.pop(user_id, None)

Key properties of this pattern:

  1. Atomic pop. GETDEL reads and deletes in a single Redis command. No race between read and delete.
  2. TTL-bounded. A challenge that is never popped (e.g., user abandons signup) expires in 60 seconds. No accumulation of stale state.
  3. Positive diagnostic signal. webauthn.challenge_miss is a named log event, not a silent None return. Sentry (via FLAG_WEBAUTHN_CHALLENGE_MISS_ALERT) captures it.
  4. Local-dev fallback. When REDIS_URL is absent (local dev, CI), the code falls back to an in-process dict. This is acceptable because a single-process dev server does not fork.

Code-review checklist

When reviewing any PR that adds new state to a Raptor service:

Incidents that established this policy

Incident Date PR Bug class
WebAuthn challenge-store miss — every signup fails 2026-05-25 PR #2728 Module-level dict, challenge written to one worker's CoW page, verify called on different worker
Per-user trading_mode_override process-global dict (see PR #3030) PR #3030 Module-level dict keyed on user_id, mutations diverged across worker processes under preload


SQLite-vs-Postgres test divergence (known failure class)

Raptor's unit tests run against in-memory SQLite via _make_engine() fixtures that patch _get_conn() directly. This is fast and self-contained, but SQLite accepts several SQL constructs that Postgres rejects. Any time a route has a write path that tests pass but prod 500s, check this list first.

Known divergence points

Construct SQLite behaviour Postgres behaviour Safe alternative
MAX(col, val) in a SET clause Accepted — treated as single-row aggregate, returns correct answer Rejected: function max(integer, integer) does not exist Read current value → Python max(current, new) → pass as bound param
GREATEST(a, b) Rejected (no GREATEST function) Accepted Same: Python max() with bound param
DO $$ ... $$ PL/pgSQL blocks Rejected Accepted Mark with -- POSTGRES-ONLY sentinel; use if bind.dialect.name == 'postgresql': guard
RENAME COLUMN (DDL) Accepted in SQLite 3.25+ Accepted Safe on both
now() as server_default Rejected Accepted Use sa.text("now()") gated on if bind.dialect.name == 'postgresql':

Rule for advance-progress patterns

Any UPDATE that increments or ceiling-caps a counter column must follow this pattern (from beta_walkthrough.py, established 2026-04-xx):

# Read current value
row = conn.execute(text("SELECT col FROM table WHERE pk = :pk"), {"pk": pk}).fetchone()
current = row[0] if row else 0
# Compute in Python — never use MAX(col, val) in the SET clause
conn.execute(
    text("UPDATE table SET col = :new_val WHERE pk = :pk"),
    {"new_val": max(current, new_value), "pk": pk},
)

Incidents from this class

Incident Date PR Failure
beta_preview POST /screen/1 500 2026-06-12 PR #3540 MAX(col, val) in SET clause — SQLite silent, Postgres hard error
beta_nda_ack table missing in Raptor 2026-06-10 PR #3454 Table created in Console migration only — no Raptor alembic migration
beta_walkthrough tables missing in Raptor 2026-06-12 PR #3517 Same pattern as 2026-06-10

Blueprint authoring convention — url_prefix MUST NOT include /api

This is a common footgun. The _core_blueprints registration loop in api/__init__.py (~line 370) prepends /api to every blueprint's prefix:

for bp in _core_blueprints:
    prefixed_url = f"/api{bp.url_prefix}" if bp.url_prefix else "/api"
    app.register_blueprint(bp, url_prefix=prefixed_url)

All blueprints in _core_blueprints MUST define their url_prefix without the leading /api. Examples:

Blueprint Correct prefix Final route
internal_waf_events /internal/waf-events /api/internal/waf-events
freescout_audit_webhook /internal/freescout-webhook /api/internal/freescout-webhook
internal_beta_survey /internal/beta-survey-data /api/internal/beta-survey-data

If a blueprint accidentally includes /api in its prefix (e.g., /api/internal/foo), the route is registered at /api/api/internal/foo and every request falls through to the serve_static catch-all → 404 {"error":"React build not found..."}.

Symptom: raptor_survey_client: Raptor returned status=404 error=React build not found in Console logs, and matching 404s in Raptor access logs for a route that should exist.

Fix: Remove the /api prefix from the Blueprint definition's url_prefix argument.

Incident reference: docs/incidents/2026-06-13-beta-digest-survey-404.md (PR #3556 introduced, PR #3558 fixed).


Cross-references