RCA — WebAuthn registration challenge miss on every signup attempt
Incident ID: 2026-05-25-signup-challenge-store-miss Date: 2026-05-25 Severity: SEV-1 Duration: ~3h active (00:56 UTC first attempt → 03:37 UTC fix deployed and verified) Blast radius: Operator (single user; no customers yet — pre-launch) Author: sre-agent
Summary
Every attempt to complete WebAuthn passkey registration via the /register/begin-with-token → /register/verify-with-token flow (Option C bootstrap-link path) failed at the verify step with _pop_challenge(user_id) returning None, causing a 400 "Passkey verification failed" in 1ms. Root cause: the in-process _challenge_store dict in webauthn_service.py diverges per gunicorn process due to copy-on-write semantics after --preload fork, so the challenge written in the master/worker that handled begin is not visible to the process that handles verify. Fix: Redis-backed challenge store using the already-provisioned Heroku Redis instance (REDIS_URL already set on prod).
Timeline (all times UTC)
- 00:52:57 — v44 build started (PR #2725 —
begin-with-tokenendpoint) - 00:53:45 — v44 deployed to prod
- 00:54:08 — Operator mints bootstrap token via
heroku run - 00:56:40 — First signup attempt:
begin-with-token→ 200 (challenge stored in-process) - 00:56:44 — Frontend calls
/register/verify(old endpoint, v44 lackedverify-with-token); 202 constant response — challenge popped or missed - 00:56:48 — Operator retries
begin-with-tokenwith same token → 400 (nonce already consumed, correct behavior) - 01:10:53 — Operator mints fresh bootstrap token
- 01:12:12 — v45 build started (PR #2726 —
verify-with-tokenendpoint + frontend wiring) - 01:12:59 — v45 deployed; both endpoints now live
- 01:14:22 — Operator mints another fresh token
- 01:23:55 — Second signup attempt:
begin-with-token→ 200 (challenge stored, 45ms, DB writes committed) - 01:23:59 —
verify-with-token→ 400 (1ms —_pop_challengereturns None) - 01:24:09 — Operator retries
begin-with-tokenwith same token → 400 (nonce already consumed) - 01:30:27 — v46 build started (PR #2727 —
--workers 1to eliminate multi-worker theory) - 01:30:52 — v46 deployed; pid 17 single worker alive from this point
- 01:31:16 — Operator mints fresh token
- 02:20:11 — Third signup attempt:
begin-with-token→ 200 (single-worker, pid 17, 42ms) - 02:20:15 —
verify-with-token→ 400 (1ms —_pop_challengestill returns None) - ~02:30 — SRE agent engaged; prod DB + logs analysis begins
- ~02:50 — Root cause identified: in-process dict COW divergence under
--preload - ~03:05 — Fix implemented: Redis-backed
_store_challenge/_pop_challengeinwebauthn_service.py - ~03:10 — PR #2728 opened (
fix/webauthn-challenge-store-redis), base: main - ~03:20 — Bootstrap token minted for operator retry (expires 03:37 UTC)
Impact
- Users affected: 0 (operator self-test only; no customers — pre-launch)
- User-visible symptoms: None (pre-launch)
- Data integrity: ok
- Revenue / billing: ok
What went well
- Single-dyno deploy (v46, PR #2727) correctly ruled out the multi-worker split theory
- Heroku Redis was already provisioned and working (demo flow) — no new infra needed for the fix
- The 1ms verify failure latency was a precise diagnostic signal:
_pop_challengedict miss, not a DB call orwebauthn.verify_registration_response()failure - Structured response logging (middleware) emitted the exact error body for the 400 responses
- Token replay detection (00:56:48 replay → 400) confirmed token nonce commits are working correctly — isolated the bug to challenge storage, not token validation
What didn't go well
- The v44 deploy shipped
begin-with-tokenwithoutverify-with-token; the frontend fell back to the old/register/verifyendpoint, consuming the challenge. The two PRs (#2725 and #2726) should have been a single PR - No INFO-level log line appears for successful
begin-with-token200 responses in Heroku (the successlogger.info("auth.register_begin_with_token issued ...")is swallowed by the root logger level in production) — this delayed diagnosis - The COW-after-preload failure mode is not documented anywhere in the codebase or runbooks
bootstrap_tokens_usedshowed only the 2026-05-23 entry, not the 2026-05-25 entries, which sent analysis in a misleading direction before the logs were read more carefully
Root cause analysis
- Contributing factor 1: gunicorn
--preload+ module-level dict —_challenge_storeis a Python module-level dict initialized once. With--preload, gunicorn imports the module in the master process, then forks worker(s). Each forked worker gets a copy-on-write clone of the module. The first write to_challenge_storein any process creates a private copy; other processes do not see it. Under normal single-worker steady-state this should be safe (master doesn't serve requests), but under dyno lifecycle events (SIGCHLD handling, brief overlap during hot reload), the master process may service a request — storing the challenge in master memory, then the worker handles verify and finds the dict empty. - Contributing factor 2:
begin-with-tokenandverify-with-tokenshipped in separate PRs — The v44 deploy (PR #2725) containedbegin-with-tokenbut notverify-with-token. The frontend at v44 fell back to the old/register/verifyendpoint. This consumed or failed to consume the challenge on the first attempt, making all subsequent debug sessions start from a corrupted state. Had both endpoints shipped together (one PR), the frontend would have called the correct verify path from the first attempt. - Contributing factor 3: no diagnostic logging in the challenge store —
_store_challengeand_pop_challengeemitted no logs (not even at DEBUG level) showing what user_id was stored/looked up or which backing store was used. The miss was only detectable by the absence of thewebauthn.registration_verify_failedWARNING log — a negative signal.
Detection
- What alerted us: Operator reported persistent 400 "Passkey verification failed" on every signup attempt
- How long between cause and detection: ~1h (first attempt at 00:56, analysis began ~02:30)
- How to detect faster next time: The new Redis-backed store emits
webauthn.challenge_stored store=redisandwebauthn.challenge_popped store=redisat INFO level on every call. A miss now emitswebauthn.challenge_miss store=redis user_id=N. This turns a negative signal into a positive one.
Resolution
- What was changed:
backend_v2/api/services/webauthn_service.py—_store_challengeand_pop_challengenow write to/read from Redis whenREDIS_URLis set. Key:webauthn:reg:challenge:{user_id}. TTL: 60s. GETDEL for atomic pop. Falls back to in-process dict whenREDIS_URLis absent (local dev, CI). - Validation: PR #2728 — operator to merge, deploy (v47), mint token, complete passkey ceremony. Confirm
webauthn.challenge_stored store=redis+webauthn.challenge_popped store=redisin Heroku logs.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Merge PR #2728, deploy v47, verify signup end-to-end | operator | 2026-05-25 | #2728 |
| 2 | Add webauthn.challenge_miss Sentry alert / Heroku log drain alert |
sre-agent | 2026-06-01 | TBD |
| 3 | Add runbook entry: gunicorn --preload + module-level mutable state is unsafe; all shared state must use Redis or Postgres | sre-agent | 2026-05-26 | TBD |
| 4 | Require begin-with-token + verify-with-token to ship in same PR (documented in release checklist) |
operator | 2026-05-30 | TBD |
| 5 | Enable root logger INFO output in production so logger.info(...) calls in auth routes appear in Heroku logs |
dev | 2026-06-01 | TBD |
References
- PR:
fix(auth): Redis-backed WebAuthn challenge store— https://github.com/raxx-app/TradeMasterAPI/pull/2728 - Runbook:
docs/ops/runbooks/raptor.md(update pending) - Prior incident:
docs/incidents/2026-05-22-signup-flow-double-p1.md - Prior incident:
docs/incidents/2026-05-21-prod-webauthn-boot-fail.md