Raptor — Operations Runbook
System: Raptor (raxx-api-prod, raxx-api-staging)
Owner: operator / sre-agent
Last reviewed: 2026-05-29 UTC
Parent issues: #2599 (env-bootstrap checklist), #81 (SDLC hardening epic)
Maintenance rule: Every PR that adds a new flag-gated subsystem to
create_app()MUST update the checklist in this file in the same PR. Mirrors the policy infeedback_new_flag_needs_b1_migration_same_pr.
Env-bootstrap checklist
Use this checklist when provisioning a new environment (staging or prod). Work through it
linearly. Check off each item as you complete it. Every heroku config:set call must have
>/dev/null 2>&1 appended — the CLI echoes the value otherwise (#feedback_heroku_config_set_echoes_secrets).
Command to set a var (safe pattern):
heroku config:set VAR_NAME="<value>" --app <app-name> >/dev/null 2>&1
Core — always required
Every environment needs these regardless of feature flags.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
SECRET_KEY |
SECRET_KEY |
Always (fails fast if FLASK_ENV=production and unset) |
64-byte URL-safe token — generate with python3 -c "import secrets; print(secrets.token_urlsafe(64))" |
none | /Raxx/Backend/SECRET_KEY |
FLASK_ENV |
n/a (read directly by Flask) | Always | production or development |
development |
set inline |
DATABASE_URL |
resolved via resolve_runtime_database_url() |
Always (Postgres path) | postgres://... (Heroku-managed) |
n/a | managed by Heroku |
FRONTEND_ORIGIN |
used in CORS allowlist | Always | comma-separated origins, e.g. https://raxx.app,https://www.raxx.app |
http://localhost:3000 |
/Raxx/Backend/FRONTEND_ORIGIN |
ADMIN_SERVICE_TOKEN |
read at call time by billing/audit/shadow-GDPR routes | Always | 32+ char random token | none (routes return 501 if unset) | /Raxx/Backend/ADMIN_SERVICE_TOKEN |
- [ ]
SECRET_KEYprovisioned - [ ]
FLASK_ENV=productionset - [ ]
DATABASE_URLpresent (Heroku adds this automatically on Postgres add-on attach) - [ ]
FRONTEND_ORIGINset to production origin(s) - [ ]
ADMIN_SERVICE_TOKENprovisioned
WebAuthn (FLAG_WEBAUTHN_REGISTRATION)
These three vars are validated at startup by validate_webauthn_config() when
FLAG_WEBAUTHN_REGISTRATION=1. Raptor refuses to start if any is empty.
Root cause of the 2026-05-20 staging boot failure (contributing factor 3, see
docs/incidents/2026-05-20-staging-webauthn-boot-fail.md) was WEBAUTHN_ORIGIN
never being provisioned. These vars are re-pinned from os.environ AFTER
from_pyfile() to prevent instance config from silently overwriting them.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
WEBAUTHN_RP_ID |
WEBAUTHN_RP_ID |
FLAG_WEBAUTHN_REGISTRATION=1 |
Relying Party ID — must be the registrable domain, e.g. raxx.app |
"" (fails validation if empty) |
/Raxx/Backend/WEBAUTHN_RP_ID |
WEBAUTHN_RP_NAME |
WEBAUTHN_RP_NAME |
FLAG_WEBAUTHN_REGISTRATION=1 |
Human-readable RP name shown in browser passkey dialog, e.g. Raxx |
Raxx |
/Raxx/Backend/WEBAUTHN_RP_NAME |
WEBAUTHN_ORIGIN |
WEBAUTHN_ORIGIN |
FLAG_WEBAUTHN_REGISTRATION=1 |
Full origin of the Antlers frontend, e.g. https://raxx.app |
"" (fails validation if empty) |
/Raxx/Backend/WEBAUTHN_ORIGIN |
- [ ]
WEBAUTHN_RP_ID=raxx.appset (prod) orraxx-staging.pages.dev/ staging domain (staging) - [ ]
WEBAUTHN_RP_NAME=Raxxset - [ ]
WEBAUTHN_ORIGIN=https://raxx.appset (prod) or staging origin (staging) - [ ]
FLAG_WEBAUTHN_REGISTRATION=1set after the above are confirmed
Sentry (FLAG_SENTRY_BACKEND)
Raptor's Sentry integration is gated by sentry_backend flag. If the flag is on
but SENTRY_DSN_BACKEND is absent, Sentry is silently skipped (non-fatal).
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
SENTRY_DSN_BACKEND |
n/a (read at init time by sentry_init.py) |
FLAG_SENTRY_BACKEND=1 |
DSN URL from Sentry project settings, e.g. https://<key>@o<org>.ingest.sentry.io/<project> |
none (sentry init skipped if absent) | /Raxx/Backend/SENTRY_DSN_BACKEND |
- [ ]
SENTRY_DSN_BACKENDprovisioned from Sentry project dashboard - [ ]
FLAG_SENTRY_BACKEND=1set after DSN is confirmed
Postmark (FLAG_POSTMARK_INBOUND_TO_FREESCOUT, FLAG_POSTMARK_DELIVERY_MONITOR, and transactional email)
POSTMARK_SERVER_TOKEN is read at call time by multiple services (waitlist email,
transactional email, trace integrity alerts, admin notify). Absence means those
paths silently skip sending — no fatal error at startup, but emails will not deliver.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
POSTMARK_SERVER_TOKEN |
n/a (read from os.environ at call time) |
Any email sending path is live | Postmark Server API token from Postmark dashboard | none (email calls silently no-op) | /Raxx/Backend/POSTMARK_SERVER_TOKEN |
POSTMARK_INBOUND_WEBHOOK_TOKEN |
n/a | FLAG_POSTMARK_INBOUND_TO_FREESCOUT=1 |
Inbound webhook secret from Postmark | none | /Raxx/Backend/POSTMARK_INBOUND_WEBHOOK_TOKEN |
- [ ]
POSTMARK_SERVER_TOKENprovisioned (Postmark account → Servers → API Tokens) - [ ] Postmark account out of sandbox (confirmed 2026-05-09 —
docs/ops/postmark.md) - [ ]
POSTMARK_INBOUND_WEBHOOK_TOKENprovisioned ifFLAG_POSTMARK_INBOUND_TO_FREESCOUT=1
Database (Postgres role separation — FLAG_RAPTOR_APP_ROLE_SEPARATION)
When the flag is on, Raptor uses RAPTOR_APP_DATABASE_URL (restricted raptor_app
role) for all request handlers. DATABASE_URL (owner credential) is then used only
by Alembic migrations in the release dyno. See docs/ops/runbooks/raptor-db-credentials.md
for the full two-URL model.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
RAPTOR_APP_DATABASE_URL |
RUNTIME_DATABASE_URL (after resolution) |
FLAG_RAPTOR_APP_ROLE_SEPARATION=1 |
postgres://raptor_app:<password>@... — restricted role URL |
falls back to DATABASE_URL if absent |
/Raxx/Backend/RAPTOR_APP_DATABASE_URL |
- [ ]
raptor_approle provisioned viapg:credentials:create(seeraptor-postgres-roles.md) - [ ]
RAPTOR_APP_DATABASE_URLprovisioned and set - [ ]
FLAG_RAPTOR_APP_ROLE_SEPARATION=1set
Demo session (FLAG_DEMO_SESSION)
When enabled, Raptor initialises a Redis client and the demo blueprint registers.
Fail-closed: if REDIS_URL is absent, demo endpoints degrade gracefully and an
error is logged, but startup continues.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
REDIS_URL |
n/a (app.extensions["redis_client"]) |
FLAG_DEMO_SESSION=1 |
redis://... or rediss://... (Heroku Redis add-on URL) |
none (demo degrades gracefully) | managed by Heroku Redis add-on |
- [ ] Heroku Redis add-on attached (adds
REDIS_URLautomatically) - [ ]
FLAG_DEMO_SESSION=1set after Redis is confirmed reachable
Broker / market data (FLAG_ALPACA_TRADING, FLAG_ALPACA_MARKET_DATA)
Alpaca keys are read at call time by alpaca_integration.py, alpaca_market_data_service.py,
and options.py. Absence causes those routes to return a 503 with a missing_env payload.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
ALPACA_API_KEY |
ALPACA_API_KEY (via _read_config_value) |
Live trading or market data enabled | Alpaca live-account key ID | none | /Raxx/Backend/ALPACA_API_KEY |
ALPACA_API_SECRET |
ALPACA_API_SECRET |
Live trading or market data enabled | Alpaca live-account secret key | none | /Raxx/Backend/ALPACA_API_SECRET |
ALPACA_BASE_URL |
ALPACA_BASE_URL |
Live trading | https://api.alpaca.markets (live) |
https://api.alpaca.markets |
/Raxx/Backend/ALPACA_BASE_URL |
ALPACA_PAPER_API_KEY |
ALPACA_PAPER_API_KEY |
Paper trading mode | Alpaca paper-account key ID | falls back to ALPACA_API_KEY |
/Raxx/Backend/ALPACA_PAPER_API_KEY |
ALPACA_PAPER_API_SECRET |
ALPACA_PAPER_API_SECRET |
Paper trading mode | Alpaca paper-account secret key | falls back to ALPACA_API_SECRET |
/Raxx/Backend/ALPACA_PAPER_API_SECRET |
ALPACA_PAPER_BASE_URL |
ALPACA_PAPER_BASE_URL |
Paper trading mode | https://paper-api.alpaca.markets |
https://paper-api.alpaca.markets |
/Raxx/Backend/ALPACA_PAPER_BASE_URL |
- [ ] Live keys provisioned (paper keys also cover market-data reads)
- [ ] Paper keys provisioned separately (users start in paper mode)
- [ ] Base URLs confirmed — do not swap live/paper URLs
Session auth middleware (FLAG_SESSION_AUTH_MIDDLEWARE)
Warning: Do NOT enable this flag on api.raxx.app until Cloudflare Access has been
removed from the API origin. CF Access MUST remain active until this flag is on.
Flipping the flag before removing CF Access causes double-auth and blocks all
authenticated requests.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
| n/a (cookie-based, no additional env vars) | — | — | — | — | — |
- [ ] CF Access removed from
api.raxx.appfirst - [ ]
FLAG_SESSION_AUTH_MIDDLEWARE=1set after CF Access removal confirmed
Status 3P poller (FLAG_STATUS_3P_POLLER)
When enabled, a background daemon thread polls upstream partner status pages and writes
results to the D1 status DB via STATUS_WORKER_URL.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
STATUS_WORKER_URL |
n/a (read from os.environ at call time) |
FLAG_STATUS_3P_POLLER=1 |
Base URL of the CF Worker, e.g. https://status.raxx.app |
none (poller logs error and no-ops if absent) | /Raxx/Backend/STATUS_WORKER_URL |
- [ ]
STATUS_WORKER_URLset to CF Worker URL - [ ]
FLAG_STATUS_3P_POLLER=1set
WebAuthn challenge-miss alerting (FLAG_WEBAUTHN_CHALLENGE_MISS_ALERT)
Requires Sentry to already be initialised (i.e., FLAG_SENTRY_BACKEND=1 and
SENTRY_DSN_BACKEND set). Installs a log handler that captures Redis challenge-miss
events and forwards them to Sentry.
- [ ] Sentry checklist above completed first
- [ ]
FLAG_WEBAUTHN_CHALLENGE_MISS_ALERT=1set
gunicorn --preload + module-level mutable state — Raxx policy
Policy added 2026-05-29 UTC. Closes action item #3 from RCA
docs/incidents/2026-05-25-signup-challenge-store-miss.md.
Why --preload is on
Raptor runs gunicorn with --preload (Heroku Procfile: web: gunicorn --preload ...).
Preload imports the entire WSGI application once in the master process before forking
worker processes. This produces:
- Faster cold-start on dyno wake: bytecode is compiled and in-memory before any worker forks.
- Lower per-worker memory footprint on Linux: CoW (copy-on-write) pages shared by workers until the first write.
- Faster dyno restart on deploys: workers can begin accepting traffic sooner.
Preload is the correct default for a Python web app on Heroku with a non-trivial import tree.
Do not remove --preload to "fix" a mutable-state bug — fix the state instead.
What goes wrong with module-level mutable state
When gunicorn forks, each worker process gets a copy-on-write clone of the master's memory. The first write to any object in a worker creates a private copy for that worker only. Other workers and the master process do not see that write.
Concretely:
# webauthn_service.py (BEFORE the 2026-05-25 fix)
_challenge_store: dict[int, str] = {} # module-level, initialized once in master
def _store_challenge(user_id: int, challenge: str) -> None:
_challenge_store[user_id] = challenge # writes to *this worker's* private copy
def _pop_challenge(user_id: int) -> str | None:
return _challenge_store.pop(user_id, None) # looks up in *this worker's* private copy
If the HTTP request that calls _store_challenge is handled by worker A, and the next HTTP
request that calls _pop_challenge is handled by worker B (or the master under a brief overlap
during dyno lifecycle events), _pop_challenge returns None. The challenge was written to
worker A's private page; worker B's page still has the empty initial dict.
This was the root cause of the 2026-05-25 SEV-1 (_challenge_store miss on every signup
attempt, PR #2728). The same failure mode also explains the trading_mode_override regression
fixed in PR #3030: a process-global dict keyed on user ID was mutated per request; under
preload, concurrent requests in different workers each had a divergent view of that dict.
The rule
Module-level mutable state that must be consistent across processes is forbidden in Raptor.
| State type | Acceptable storage | Not acceptable |
|---|---|---|
| Per-request ephemeral data (auth token from header, parsed body) | Local variable or flask.g |
Module-level dict |
| Per-session transient data (WebAuthn challenge, CSRF nonce) | Redis (TTL-keyed) | Module-level dict |
| Per-user persistent data (trading mode, profile, preferences) | Postgres row | Module-level dict, app.extensions["..."] dict |
| Read-mostly app config (RP ID, feature flags read at startup) | app.config or app.extensions["..."] — read-only after create_app() |
Any post-fork mutation |
| Background worker state | Redis or Postgres | Module-level variable mutated per task |
Immutable is fine. Module-level constants, compiled regex patterns, read-only lookup tables,
and frozen dataclasses are safe under --preload because they are never written after the fork.
app.extensions is safe for read-only objects. A Redis client, a Sentry client, or a
database engine stored in app.extensions["redis_client"] is initialized once in create_app()
(before fork) and used read-only in workers. This is the correct pattern. Do not store mutable
per-user or per-request state there.
Required pattern for transient state
Use Redis with a TTL. The post-fix webauthn_service.py is the canonical example:
# webauthn_service.py (AFTER [PR #2728](https://github.com/raxx-app/TradeMasterAPI/pull/2728))
_CHALLENGE_TTL = 60 # seconds
def _store_challenge(app: Flask, user_id: int, challenge: str) -> None:
redis = app.extensions.get("redis_client")
if redis:
redis.set(f"webauthn:reg:challenge:{user_id}", challenge, ex=_CHALLENGE_TTL)
logger.info("webauthn.challenge_stored store=redis user_id=%s", user_id)
else:
# Local dev / CI: Redis absent, fall back to in-process dict
_challenge_store_fallback[user_id] = challenge
logger.info("webauthn.challenge_stored store=local user_id=%s", user_id)
def _pop_challenge(app: Flask, user_id: int) -> str | None:
redis = app.extensions.get("redis_client")
if redis:
val = redis.getdel(f"webauthn:reg:challenge:{user_id}")
if val is None:
logger.warning("webauthn.challenge_miss store=redis user_id=%s", user_id)
else:
logger.info("webauthn.challenge_popped store=redis user_id=%s", user_id)
return val.decode() if val else None
else:
return _challenge_store_fallback.pop(user_id, None)
Key properties of this pattern:
- Atomic pop.
GETDELreads and deletes in a single Redis command. No race between read and delete. - TTL-bounded. A challenge that is never popped (e.g., user abandons signup) expires in 60 seconds. No accumulation of stale state.
- Positive diagnostic signal.
webauthn.challenge_missis a named log event, not a silentNonereturn. Sentry (viaFLAG_WEBAUTHN_CHALLENGE_MISS_ALERT) captures it. - Local-dev fallback. When
REDIS_URLis absent (local dev, CI), the code falls back to an in-process dict. This is acceptable because a single-process dev server does not fork.
Code-review checklist
When reviewing any PR that adds new state to a Raptor service:
- [ ] Is this state written after
create_app()returns? If yes, it cannot be module-level. - [ ] Is this state per-user, per-session, or per-request? If yes, it belongs in Redis or
Postgres, not
app.extensions. - [ ] Is this state read-only after startup? If yes,
app.configorapp.extensionsis fine. - [ ] If Redis is used: does it have a TTL? Does pop use
GETDEL(atomic)? Does a miss emit a named log event? - [ ] Is there a local-dev fallback for when
REDIS_URLis absent?
Incidents that established this policy
| Incident | Date | PR | Bug class |
|---|---|---|---|
| WebAuthn challenge-store miss — every signup fails | 2026-05-25 | PR #2728 | Module-level dict, challenge written to one worker's CoW page, verify called on different worker |
Per-user trading_mode_override process-global dict |
(see PR #3030) | PR #3030 | Module-level dict keyed on user_id, mutations diverged across worker processes under preload |
Cross-references
- Two-URL Postgres model:
docs/ops/runbooks/raptor-db-credentials.md - Postgres role provisioning:
docs/ops/runbooks/raptor-postgres-roles.md - Staging Postgres cutover:
docs/ops/runbooks/raptor-postgres-staging-cutover.md - Prod Postgres cutover:
docs/ops/runbooks/raptor-postgres-prod-cutover.md - Feature flag flip procedure:
docs/ops/feature-flags-runbook.md - Required config vars manifest:
docs/ops/required-config-vars.yaml - Incident (boot fail):
docs/incidents/2026-05-20-staging-webauthn-boot-fail.md - Incident (challenge-store miss):
docs/incidents/2026-05-25-signup-challenge-store-miss.md