RCA — WebAuthn validator boot regression on raxx-api-prod
Incident ID: 2026-05-21-prod-webauthn-boot-fail Date: 2026-05-21 Severity: SEV-1 Duration: 6m 18s (19:00:01 UTC first crash → 19:06:09 UTC rollback complete) Blast radius: raxx-api-prod web dyno; operator-only (CF Access perimeter active; no customer traffic) Author: sre-agent
Summary
Flipping FLAG_WEBAUTHN_REGISTRATION=1 + FLAG_AUTH_WEBAUTHN_LOGIN=1 on raxx-api-prod caused the web dyno to crash-loop at boot. The validator in create_app() raised RuntimeError: WebAuthn config validation failed — refusing to start: WEBAUTHN_RP_ID is not set even though WEBAUTHN_RP_ID=raxx.app was present in the Heroku config and visible to one-off dynos. The crash was caused by app.config.from_pyfile('config.py', silent=True) overwriting the correctly-mapped WEBAUTHN_RP_ID value with '' after from_mapping() set it — a code path that the PR #2590 regression tests did not exercise because they used the test_config code path, not the production from_pyfile path. The dyno was restored in 6 minutes via heroku config:unset.
Timeline (all times UTC)
- 07:24 — Release v32:
WEBAUTHN_RP_ID=raxx.app,WEBAUTHN_ORIGIN=https://raxx.app,WEBAUTHN_RP_NAME=Raxxset onraxx-api-prod(no flags flipped; dyno stable). - 18:59:51 —
heroku config:set BOOTSTRAP_TOKEN_SIGNING_KEY=<64-hex> FLAG_WEBAUTHN_REGISTRATION=1 FLAG_AUTH_WEBAUTHN_LOGIN=1 -a raxx-api-prod - 18:59:53 — Old web.1 received SIGTERM, exited cleanly.
- 19:00:01 — New web.1 boot attempt.
create_app()raised:RuntimeError: WebAuthn config validation failed — refusing to start: WEBAUTHN_RP_ID is not set. Set WEBAUTHN_RP_ID=raxx.app in your environment. - 19:00:11 — Second restart attempt, identical crash.
- 19:02:00 — SRE investigation started.
heroku run python -c "import os; print(repr(os.environ.get('WEBAUTHN_RP_ID')))"returned'raxx.app'confirming env var is present on the dyno. - 19:04:00 — Root cause identified:
from_pyfile('config.py', silent=True)overwritesapp.config['WEBAUTHN_RP_ID']with''afterfrom_mapping()sets it correctly. Confirmed via local reproduction. - 19:06:09 —
heroku config:unset FLAG_WEBAUTHN_REGISTRATION FLAG_AUTH_WEBAUTHN_LOGIN -a raxx-api-prod. Web dyno recovered.
Impact
- Users affected: 0 (CF Access perimeter; launch posture is operator-testing-only)
- User-visible symptoms: none
- Data integrity: ok
- Revenue / billing: ok
What went well
- Rollback was clean and fast (config:unset is instant; no code deploy needed).
- CF Access perimeter contained blast radius to operator only.
- The error message from the validator was precise and pointed directly at the right config key.
heroku runone-off dyno check confirmed os.environ state quickly.
What didn't go well
- The same failure class (WebAuthn boot crash) recurred 24 hours after PR #2590 was merged to fix the staging variant. The fix existed in main but its test coverage had a gap.
- PR #2590's regression tests exercised
create_app(test_config={...})(the test code path), which callsapp.config.from_mapping(test_config)and skipsfrom_pyfile('config.py', silent=True). The production path was never tested. - No pre-flip checklist existed to verify: (a) the fix is deployed to prod, and (b) the production boot path is smoke-tested.
BOOTSTRAP_TOKEN_SIGNING_KEYwas set in the sameconfig:setcall as the flag flip. That key had not yet been written to Infisical vault (operator note: still outstanding at rollback time).
Root cause analysis
-
Contributing factor 1:
from_pyfilesilently overwrites critical config keys —create_app()callsapp.config.from_mapping(...)at line 66 (correctly settingWEBAUTHN_RP_ID=os.environ.get(...)) and then callsapp.config.from_pyfile('config.py', silent=True)at line 121.from_pyfileloads{instance_path}/config.pyif it exists. If that file containsWEBAUTHN_RP_ID = '', Flask's config system silently overwrites the correctly-set value with''. Thesilent=Trueparameter suppresses the FileNotFoundError when the file is absent, but does NOT suppress loading when the file IS present. Any file present inbackend_v2/instance/config.py(e.g., from a prior local dev session that was accidentally included in the slug, or from a migration tool creating a partial config) would have triggered this overwrite. -
Contributing factor 2: PR #2590's test suite only covered the test code path — The tests in
test_app_factory.py::TestWebAuthnConfigWithFlagEnabled::test_boot_succeeds_when_valid_config_and_flag_oncalledcreate_app(test_config={...}). With a non-Nonetest_config, the factory skipsfrom_pyfileand callsfrom_mapping(test_config)instead. This meant thefrom_pyfile→ overwrite → crash scenario was never exercised in CI. The test passed because there was nofrom_pyfilecall on that code path. -
Contributing factor 3: No prod deployment gate before flag flip — PR #2590 fixed the staging incident (2026-05-20) and was merged to main, which auto-deploys to staging. Prod requires
workflow_dispatch + human approvalper ADR-0020. It is possible that at the time of the flag flip (18:59:51 UTC 2026-05-21), the prod slug was still running a build that pre-dated the #2590 fix (the from_mapping fix was absent entirely). If so,WEBAUTHN_RP_IDwas never mapped intoapp.configat all — the original failure mode from 2026-05-20. Either the prod slug lacked the fix, orinstance/config.pytriggered the overwrite. The evidence is consistent with both; the fix addresses both.
Detection
- What alerted us: Heroku dyno crash log surfaced immediately after config:set.
- Time between cause and detection: ~10 seconds (dyno boot attempt at 19:00:01, crash visible instantly in log stream).
- How to detect faster next time: a boot smoke test (e.g.,
heroku run python -c "from api import create_app; create_app()") before the flag flip would have surfaced this before dyno traffic was affected. Add this as a pre-flip runbook step.
Resolution
- What was changed:
heroku config:unset FLAG_WEBAUTHN_REGISTRATION FLAG_AUTH_WEBAUTHN_LOGIN -a raxx-api-prod(rollback). No code change at incident time. - Validation: web dyno came up cleanly.
heroku ps -a raxx-api-prodshowedweb.1: up. - Follow-on fix (this PR): added
if test_config is None:re-pin block increate_app()that re-readsWEBAUTHN_RP_ID,WEBAUTHN_RP_NAME,WEBAUTHN_ORIGINfromos.environafter thefrom_pyfilecall, so noinstance/config.pycan overwrite them. Added regression tests that exercise the production code path and the override scenario.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Write BOOTSTRAP_TOKEN_SIGNING_KEY to Infisical vault (was set on prod but not in vault) |
operator | 2026-05-22 | — |
| 2 | Add pre-flip smoke step to WebAuthn runbook: heroku run -a raxx-api-prod -- python -c "from api import create_app; create_app()" before any flag flip that touches boot-time validators |
sre-agent | 2026-05-23 | — |
| 3 | Prod deploy after this fix merges (workflow_dispatch + approval per ADR-0020) | operator | 2026-05-22 | — |
| 4 | Update .gitignore to explicitly ignore backend_v2/instance/config.py so a dev machine never accidentally commits it into a slug |
developer | 2026-05-28 | — |
References
- Runbook:
docs/ops/runbooks/raptor.md(to be updated with pre-flip smoke step) - Prior incident:
docs/incidents/2026-05-20-staging-webauthn-boot-fail.md - Fix PR:
fix(api): WebAuthn validator regression on flag-gated boot — RCA + smoke gate - ADR-0005: WebAuthn RP ID policy
- ADR-0012: RP ID isolation (raxx.app vs console.raxx.app)
- ADR-0020: Prod deploy gate (workflow_dispatch + human approval)