Raptor — Operations Runbook
System: Raptor (raxx-api-prod, raxx-api-staging)
Owner: operator / sre-agent
Last reviewed: 2026-06-12 UTC
Parent issues: #2599 (env-bootstrap checklist), #81 (SDLC hardening epic)
Maintenance rule: Every PR that adds a new flag-gated subsystem to
create_app()MUST update the checklist in this file in the same PR. Mirrors the policy infeedback_new_flag_needs_b1_migration_same_pr.
Env-bootstrap checklist
Use this checklist when provisioning a new environment (staging or prod). Work through it
linearly. Check off each item as you complete it. Every heroku config:set call must have
>/dev/null 2>&1 appended — the CLI echoes the value otherwise (#feedback_heroku_config_set_echoes_secrets).
Command to set a var (safe pattern):
heroku config:set VAR_NAME="<value>" --app <app-name> >/dev/null 2>&1
Core — always required
Every environment needs these regardless of feature flags.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
SECRET_KEY |
SECRET_KEY |
Always (fails fast if FLASK_ENV=production and unset) |
64-byte URL-safe token — generate with python3 -c "import secrets; print(secrets.token_urlsafe(64))" |
none | /Raxx/Backend/SECRET_KEY |
FLASK_ENV |
n/a (read directly by Flask) | Always | production or development |
development |
set inline |
DATABASE_URL |
resolved via resolve_runtime_database_url() |
Always (Postgres path) | postgres://... (Heroku-managed) |
n/a | managed by Heroku |
FRONTEND_ORIGIN |
used in CORS allowlist | Always | comma-separated origins, e.g. https://raxx.app,https://www.raxx.app |
http://localhost:3000 |
/Raxx/Backend/FRONTEND_ORIGIN |
ADMIN_SERVICE_TOKEN |
read at call time by billing/audit/shadow-GDPR routes | Always | 32+ char random token | none (routes return 501 if unset) | /Raxx/Backend/ADMIN_SERVICE_TOKEN |
- [ ]
SECRET_KEYprovisioned - [ ]
FLASK_ENV=productionset - [ ]
DATABASE_URLpresent (Heroku adds this automatically on Postgres add-on attach) - [ ]
FRONTEND_ORIGINset to production origin(s) - [ ]
ADMIN_SERVICE_TOKENprovisioned
WebAuthn (FLAG_WEBAUTHN_REGISTRATION)
These three vars are validated at startup by validate_webauthn_config() when
FLAG_WEBAUTHN_REGISTRATION=1. Raptor refuses to start if any is empty.
Root cause of the 2026-05-20 staging boot failure (contributing factor 3, see
docs/incidents/2026-05-20-staging-webauthn-boot-fail.md) was WEBAUTHN_ORIGIN
never being provisioned. These vars are re-pinned from os.environ AFTER
from_pyfile() to prevent instance config from silently overwriting them.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
WEBAUTHN_RP_ID |
WEBAUTHN_RP_ID |
FLAG_WEBAUTHN_REGISTRATION=1 |
Relying Party ID — must be the registrable domain, e.g. raxx.app |
"" (fails validation if empty) |
/Raxx/Backend/WEBAUTHN_RP_ID |
WEBAUTHN_RP_NAME |
WEBAUTHN_RP_NAME |
FLAG_WEBAUTHN_REGISTRATION=1 |
Human-readable RP name shown in browser passkey dialog, e.g. Raxx |
Raxx |
/Raxx/Backend/WEBAUTHN_RP_NAME |
WEBAUTHN_ORIGIN |
WEBAUTHN_ORIGIN |
FLAG_WEBAUTHN_REGISTRATION=1 |
Full origin of the Antlers frontend, e.g. https://raxx.app |
"" (fails validation if empty) |
/Raxx/Backend/WEBAUTHN_ORIGIN |
- [ ]
WEBAUTHN_RP_ID=raxx.appset (prod) orraxx-staging.pages.dev/ staging domain (staging) - [ ]
WEBAUTHN_RP_NAME=Raxxset - [ ]
WEBAUTHN_ORIGIN=https://raxx.appset (prod) or staging origin (staging) - [ ]
FLAG_WEBAUTHN_REGISTRATION=1set after the above are confirmed
Sentry (FLAG_SENTRY_BACKEND)
Raptor's Sentry integration is gated by sentry_backend flag. If the flag is on
but SENTRY_DSN_BACKEND is absent, Sentry is silently skipped (non-fatal).
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
SENTRY_DSN_BACKEND |
n/a (read at init time by sentry_init.py) |
FLAG_SENTRY_BACKEND=1 |
DSN URL from Sentry project settings, e.g. https://<key>@o<org>.ingest.sentry.io/<project> |
none (sentry init skipped if absent) | /Raxx/Backend/SENTRY_DSN_BACKEND |
- [ ]
SENTRY_DSN_BACKENDprovisioned from Sentry project dashboard - [ ]
FLAG_SENTRY_BACKEND=1set after DSN is confirmed
Postmark (FLAG_POSTMARK_INBOUND_TO_FREESCOUT, FLAG_POSTMARK_DELIVERY_MONITOR, and transactional email)
POSTMARK_SERVER_TOKEN is read at call time by multiple services (waitlist email,
transactional email, trace integrity alerts, admin notify). Absence means those
paths silently skip sending — no fatal error at startup, but emails will not deliver.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
POSTMARK_SERVER_TOKEN |
n/a (read from os.environ at call time) |
Any email sending path is live | Postmark Server API token from Postmark dashboard | none (email calls silently no-op) | /Raxx/Backend/POSTMARK_SERVER_TOKEN |
POSTMARK_INBOUND_WEBHOOK_TOKEN |
n/a | FLAG_POSTMARK_INBOUND_TO_FREESCOUT=1 |
Inbound webhook secret from Postmark | none | /Raxx/Backend/POSTMARK_INBOUND_WEBHOOK_TOKEN |
- [ ]
POSTMARK_SERVER_TOKENprovisioned (Postmark account → Servers → API Tokens) - [ ] Postmark account out of sandbox (confirmed 2026-05-09 —
docs/ops/postmark.md) - [ ]
POSTMARK_INBOUND_WEBHOOK_TOKENprovisioned ifFLAG_POSTMARK_INBOUND_TO_FREESCOUT=1
Database (Postgres role separation — FLAG_RAPTOR_APP_ROLE_SEPARATION)
When the flag is on, Raptor uses RAPTOR_APP_DATABASE_URL (restricted raptor_app
role) for all request handlers. DATABASE_URL (owner credential) is then used only
by Alembic migrations in the release dyno. See docs/ops/runbooks/raptor-db-credentials.md
for the full two-URL model.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
RAPTOR_APP_DATABASE_URL |
RUNTIME_DATABASE_URL (after resolution) |
FLAG_RAPTOR_APP_ROLE_SEPARATION=1 |
postgres://raptor_app:<password>@... — restricted role URL |
falls back to DATABASE_URL if absent |
/Raxx/Backend/RAPTOR_APP_DATABASE_URL |
- [ ]
raptor_approle provisioned viapg:credentials:create(seeraptor-postgres-roles.md) - [ ]
RAPTOR_APP_DATABASE_URLprovisioned and set - [ ]
FLAG_RAPTOR_APP_ROLE_SEPARATION=1set
Demo session (FLAG_DEMO_SESSION)
When enabled, Raptor initialises a Redis client and the demo blueprint registers.
Fail-closed: if REDIS_URL is absent, demo endpoints degrade gracefully and an
error is logged, but startup continues.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
REDIS_URL |
n/a (app.extensions["redis_client"]) |
FLAG_DEMO_SESSION=1 |
redis://... or rediss://... (Heroku Redis add-on URL) |
none (demo degrades gracefully) | managed by Heroku Redis add-on |
- [ ] Heroku Redis add-on attached (adds
REDIS_URLautomatically) - [ ]
FLAG_DEMO_SESSION=1set after Redis is confirmed reachable
Staging posture (2026-06-14): raxx-api-staging does not have a Redis add-on
attached and REDIS_URL is absent. This is intentional: FLAG_DEMO_SESSION and
FLAG_RATE_LIMIT_STORAGE_REDIS are both OFF on staging. The WebAuthn challenge store
falls back to the in-process dict (safe for a single-worker staging dyno). The rate
limiter falls back to in-memory. No consumer hard-fails at startup. The nightly
config-health check treats staging REDIS_URL as severity: warning rather than
critical to reflect this posture. Activate Redis on staging by running:
heroku addons:create heroku-redis:mini --app raxx-api-staging
Only needed when FLAG_DEMO_SESSION=1 is to be exercised on staging. Set the flag
only AFTER confirming REDIS_URL is populated (add-on provisioning can take ~30s).
Broker / market data (FLAG_ALPACA_TRADING, FLAG_ALPACA_MARKET_DATA)
Alpaca keys are read at call time by alpaca_integration.py, alpaca_market_data_service.py,
and options.py. Absence causes those routes to return a 503 with a missing_env payload.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
ALPACA_API_KEY |
ALPACA_API_KEY (via _read_config_value) |
Live trading or market data enabled | Alpaca live-account key ID | none | /Raxx/Backend/ALPACA_API_KEY |
ALPACA_API_SECRET |
ALPACA_API_SECRET |
Live trading or market data enabled | Alpaca live-account secret key | none | /Raxx/Backend/ALPACA_API_SECRET |
ALPACA_BASE_URL |
ALPACA_BASE_URL |
Live trading | https://api.alpaca.markets (live) |
https://api.alpaca.markets |
/Raxx/Backend/ALPACA_BASE_URL |
ALPACA_PAPER_API_KEY |
ALPACA_PAPER_API_KEY |
Paper trading mode | Alpaca paper-account key ID | falls back to ALPACA_API_KEY |
/Raxx/Backend/ALPACA_PAPER_API_KEY |
ALPACA_PAPER_API_SECRET |
ALPACA_PAPER_API_SECRET |
Paper trading mode | Alpaca paper-account secret key | falls back to ALPACA_API_SECRET |
/Raxx/Backend/ALPACA_PAPER_API_SECRET |
ALPACA_PAPER_BASE_URL |
ALPACA_PAPER_BASE_URL |
Paper trading mode | https://paper-api.alpaca.markets |
https://paper-api.alpaca.markets |
/Raxx/Backend/ALPACA_PAPER_BASE_URL |
- [ ] Live keys provisioned (paper keys also cover market-data reads)
- [ ] Paper keys provisioned separately (users start in paper mode)
- [ ] Base URLs confirmed — do not swap live/paper URLs
Session auth middleware (FLAG_SESSION_AUTH_MIDDLEWARE)
Warning: Do NOT enable this flag on api.raxx.app until Cloudflare Access has been
removed from the API origin. CF Access MUST remain active until this flag is on.
Flipping the flag before removing CF Access causes double-auth and blocks all
authenticated requests.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
| n/a (cookie-based, no additional env vars) | — | — | — | — | — |
- [ ] CF Access removed from
api.raxx.appfirst - [ ]
FLAG_SESSION_AUTH_MIDDLEWARE=1set after CF Access removal confirmed
Status 3P poller (FLAG_STATUS_3P_POLLER)
When enabled, a background daemon thread polls upstream partner status pages and writes
results to the D1 status DB via STATUS_WORKER_URL.
| Var | app.config key |
Required when | Accepted values / format | Default | Infisical path |
|---|---|---|---|---|---|
STATUS_WORKER_URL |
n/a (read from os.environ at call time) |
FLAG_STATUS_3P_POLLER=1 |
Base URL of the CF Worker, e.g. https://status.raxx.app |
none (poller logs error and no-ops if absent) | /Raxx/Backend/STATUS_WORKER_URL |
- [ ]
STATUS_WORKER_URLset to CF Worker URL - [ ]
FLAG_STATUS_3P_POLLER=1set
WebAuthn challenge-miss alerting (FLAG_WEBAUTHN_CHALLENGE_MISS_ALERT)
Requires Sentry to already be initialised (i.e., FLAG_SENTRY_BACKEND=1 and
SENTRY_DSN_BACKEND set). Installs a log handler that captures Redis challenge-miss
events and forwards them to Sentry.
- [ ] Sentry checklist above completed first
- [ ]
FLAG_WEBAUTHN_CHALLENGE_MISS_ALERT=1set
gunicorn --preload + module-level mutable state — Raxx policy
Policy added 2026-05-29 UTC. Closes action item #3 from RCA
docs/incidents/2026-05-25-signup-challenge-store-miss.md.
Why --preload is on
Raptor runs gunicorn with --preload (Heroku Procfile: web: gunicorn --preload ...).
Preload imports the entire WSGI application once in the master process before forking
worker processes. This produces:
- Faster cold-start on dyno wake: bytecode is compiled and in-memory before any worker forks.
- Lower per-worker memory footprint on Linux: CoW (copy-on-write) pages shared by workers until the first write.
- Faster dyno restart on deploys: workers can begin accepting traffic sooner.
Preload is the correct default for a Python web app on Heroku with a non-trivial import tree.
Do not remove --preload to "fix" a mutable-state bug — fix the state instead.
What goes wrong with module-level mutable state
When gunicorn forks, each worker process gets a copy-on-write clone of the master's memory. The first write to any object in a worker creates a private copy for that worker only. Other workers and the master process do not see that write.
Concretely:
# webauthn_service.py (BEFORE the 2026-05-25 fix)
_challenge_store: dict[int, str] = {} # module-level, initialized once in master
def _store_challenge(user_id: int, challenge: str) -> None:
_challenge_store[user_id] = challenge # writes to *this worker's* private copy
def _pop_challenge(user_id: int) -> str | None:
return _challenge_store.pop(user_id, None) # looks up in *this worker's* private copy
If the HTTP request that calls _store_challenge is handled by worker A, and the next HTTP
request that calls _pop_challenge is handled by worker B (or the master under a brief overlap
during dyno lifecycle events), _pop_challenge returns None. The challenge was written to
worker A's private page; worker B's page still has the empty initial dict.
This was the root cause of the 2026-05-25 SEV-1 (_challenge_store miss on every signup
attempt, PR #2728). The same failure mode also explains the trading_mode_override regression
fixed in PR #3030: a process-global dict keyed on user ID was mutated per request; under
preload, concurrent requests in different workers each had a divergent view of that dict.
The rule
Module-level mutable state that must be consistent across processes is forbidden in Raptor.
| State type | Acceptable storage | Not acceptable |
|---|---|---|
| Per-request ephemeral data (auth token from header, parsed body) | Local variable or flask.g |
Module-level dict |
| Per-session transient data (WebAuthn challenge, CSRF nonce) | Redis (TTL-keyed) | Module-level dict |
| Per-user persistent data (trading mode, profile, preferences) | Postgres row | Module-level dict, app.extensions["..."] dict |
| Read-mostly app config (RP ID, feature flags read at startup) | app.config or app.extensions["..."] — read-only after create_app() |
Any post-fork mutation |
| Background worker state | Redis or Postgres | Module-level variable mutated per task |
Immutable is fine. Module-level constants, compiled regex patterns, read-only lookup tables,
and frozen dataclasses are safe under --preload because they are never written after the fork.
app.extensions is safe for read-only objects. A Redis client, a Sentry client, or a
database engine stored in app.extensions["redis_client"] is initialized once in create_app()
(before fork) and used read-only in workers. This is the correct pattern. Do not store mutable
per-user or per-request state there.
Required pattern for transient state
Use Redis with a TTL. The post-fix webauthn_service.py is the canonical example:
# webauthn_service.py (AFTER [PR #2728](https://github.com/raxx-app/TradeMasterAPI/pull/2728))
_CHALLENGE_TTL = 60 # seconds
def _store_challenge(app: Flask, user_id: int, challenge: str) -> None:
redis = app.extensions.get("redis_client")
if redis:
redis.set(f"webauthn:reg:challenge:{user_id}", challenge, ex=_CHALLENGE_TTL)
logger.info("webauthn.challenge_stored store=redis user_id=%s", user_id)
else:
# Local dev / CI: Redis absent, fall back to in-process dict
_challenge_store_fallback[user_id] = challenge
logger.info("webauthn.challenge_stored store=local user_id=%s", user_id)
def _pop_challenge(app: Flask, user_id: int) -> str | None:
redis = app.extensions.get("redis_client")
if redis:
val = redis.getdel(f"webauthn:reg:challenge:{user_id}")
if val is None:
logger.warning("webauthn.challenge_miss store=redis user_id=%s", user_id)
else:
logger.info("webauthn.challenge_popped store=redis user_id=%s", user_id)
return val.decode() if val else None
else:
return _challenge_store_fallback.pop(user_id, None)
Key properties of this pattern:
- Atomic pop.
GETDELreads and deletes in a single Redis command. No race between read and delete. - TTL-bounded. A challenge that is never popped (e.g., user abandons signup) expires in 60 seconds. No accumulation of stale state.
- Positive diagnostic signal.
webauthn.challenge_missis a named log event, not a silentNonereturn. Sentry (viaFLAG_WEBAUTHN_CHALLENGE_MISS_ALERT) captures it. - Local-dev fallback. When
REDIS_URLis absent (local dev, CI), the code falls back to an in-process dict. This is acceptable because a single-process dev server does not fork.
Code-review checklist
When reviewing any PR that adds new state to a Raptor service:
- [ ] Is this state written after
create_app()returns? If yes, it cannot be module-level. - [ ] Is this state per-user, per-session, or per-request? If yes, it belongs in Redis or
Postgres, not
app.extensions. - [ ] Is this state read-only after startup? If yes,
app.configorapp.extensionsis fine. - [ ] If Redis is used: does it have a TTL? Does pop use
GETDEL(atomic)? Does a miss emit a named log event? - [ ] Is there a local-dev fallback for when
REDIS_URLis absent?
Incidents that established this policy
| Incident | Date | PR | Bug class |
|---|---|---|---|
| WebAuthn challenge-store miss — every signup fails | 2026-05-25 | PR #2728 | Module-level dict, challenge written to one worker's CoW page, verify called on different worker |
Per-user trading_mode_override process-global dict |
(see PR #3030) | PR #3030 | Module-level dict keyed on user_id, mutations diverged across worker processes under preload |
SQLite-vs-Postgres test divergence (known failure class)
Raptor's unit tests run against in-memory SQLite via _make_engine() fixtures that
patch _get_conn() directly. This is fast and self-contained, but SQLite accepts
several SQL constructs that Postgres rejects. Any time a route has a write path that
tests pass but prod 500s, check this list first.
Known divergence points
| Construct | SQLite behaviour | Postgres behaviour | Safe alternative |
|---|---|---|---|
MAX(col, val) in a SET clause |
Accepted — treated as single-row aggregate, returns correct answer | Rejected: function max(integer, integer) does not exist |
Read current value → Python max(current, new) → pass as bound param |
GREATEST(a, b) |
Rejected (no GREATEST function) |
Accepted | Same: Python max() with bound param |
DO $$ ... $$ PL/pgSQL blocks |
Rejected | Accepted | Mark with -- POSTGRES-ONLY sentinel; use if bind.dialect.name == 'postgresql': guard |
RENAME COLUMN (DDL) |
Accepted in SQLite 3.25+ | Accepted | Safe on both |
now() as server_default |
Rejected | Accepted | Use sa.text("now()") gated on if bind.dialect.name == 'postgresql': |
Rule for advance-progress patterns
Any UPDATE that increments or ceiling-caps a counter column must follow this pattern
(from beta_walkthrough.py, established 2026-04-xx):
# Read current value
row = conn.execute(text("SELECT col FROM table WHERE pk = :pk"), {"pk": pk}).fetchone()
current = row[0] if row else 0
# Compute in Python — never use MAX(col, val) in the SET clause
conn.execute(
text("UPDATE table SET col = :new_val WHERE pk = :pk"),
{"new_val": max(current, new_value), "pk": pk},
)
Incidents from this class
| Incident | Date | PR | Failure |
|---|---|---|---|
| beta_preview POST /screen/1 500 | 2026-06-12 | PR #3540 | MAX(col, val) in SET clause — SQLite silent, Postgres hard error |
| beta_nda_ack table missing in Raptor | 2026-06-10 | PR #3454 | Table created in Console migration only — no Raptor alembic migration |
| beta_walkthrough tables missing in Raptor | 2026-06-12 | PR #3517 | Same pattern as 2026-06-10 |
Blueprint authoring convention — url_prefix MUST NOT include /api
This is a common footgun. The _core_blueprints registration loop in
api/__init__.py (~line 370) prepends /api to every blueprint's prefix:
for bp in _core_blueprints:
prefixed_url = f"/api{bp.url_prefix}" if bp.url_prefix else "/api"
app.register_blueprint(bp, url_prefix=prefixed_url)
All blueprints in _core_blueprints MUST define their url_prefix without the
leading /api. Examples:
| Blueprint | Correct prefix | Final route |
|---|---|---|
internal_waf_events |
/internal/waf-events |
/api/internal/waf-events |
freescout_audit_webhook |
/internal/freescout-webhook |
/api/internal/freescout-webhook |
internal_beta_survey |
/internal/beta-survey-data |
/api/internal/beta-survey-data |
If a blueprint accidentally includes /api in its prefix (e.g., /api/internal/foo),
the route is registered at /api/api/internal/foo and every request falls through to
the serve_static catch-all → 404 {"error":"React build not found..."}.
Symptom: raptor_survey_client: Raptor returned status=404 error=React build not found
in Console logs, and matching 404s in Raptor access logs for a route that should exist.
Fix: Remove the /api prefix from the Blueprint definition's url_prefix argument.
Incident reference: docs/incidents/2026-06-13-beta-digest-survey-404.md (PR #3556 introduced, PR #3558 fixed).
Cross-references
- Two-URL Postgres model:
docs/ops/runbooks/raptor-db-credentials.md - Postgres role provisioning:
docs/ops/runbooks/raptor-postgres-roles.md - Staging Postgres cutover:
docs/ops/runbooks/raptor-postgres-staging-cutover.md - Prod Postgres cutover:
docs/ops/runbooks/raptor-postgres-prod-cutover.md - Feature flag flip procedure:
docs/ops/feature-flags-runbook.md - Required config vars manifest:
docs/ops/required-config-vars.yaml - Incident (boot fail):
docs/incidents/2026-05-20-staging-webauthn-boot-fail.md - Incident (challenge-store miss):
docs/incidents/2026-05-25-signup-challenge-store-miss.md