RCA — WebAuthn registration challenge miss on every signup attempt

Incident ID: 2026-05-25-signup-challenge-store-miss Date: 2026-05-25 Severity: SEV-1 Duration: ~3h active (00:56 UTC first attempt → 03:37 UTC fix deployed and verified) Blast radius: Operator (single user; no customers yet — pre-launch) Author: sre-agent

Summary

Every attempt to complete WebAuthn passkey registration via the /register/begin-with-token → /register/verify-with-token flow (Option C bootstrap-link path) failed at the verify step with _pop_challenge(user_id) returning None, causing a 400 "Passkey verification failed" in 1ms. Root cause: the in-process _challenge_store dict in webauthn_service.py diverges per gunicorn process due to copy-on-write semantics after --preload fork, so the challenge written in the master/worker that handled begin is not visible to the process that handles verify. Fix: Redis-backed challenge store using the already-provisioned Heroku Redis instance (REDIS_URL already set on prod).

Timeline (all times UTC)

00:52:57 — v44 build started (PR #2725 — begin-with-token endpoint)
00:53:45 — v44 deployed to prod
00:54:08 — Operator mints bootstrap token via heroku run
00:56:40 — First signup attempt: begin-with-token → 200 (challenge stored in-process)
00:56:44 — Frontend calls /register/verify (old endpoint, v44 lacked verify-with-token); 202 constant response — challenge popped or missed
00:56:48 — Operator retries begin-with-token with same token → 400 (nonce already consumed, correct behavior)
01:10:53 — Operator mints fresh bootstrap token
01:12:12 — v45 build started (PR #2726 — verify-with-token endpoint + frontend wiring)
01:12:59 — v45 deployed; both endpoints now live
01:14:22 — Operator mints another fresh token
01:23:55 — Second signup attempt: begin-with-token → 200 (challenge stored, 45ms, DB writes committed)
01:23:59 — verify-with-token → 400 (1ms — _pop_challenge returns None)
01:24:09 — Operator retries begin-with-token with same token → 400 (nonce already consumed)
01:30:27 — v46 build started (PR #2727 — --workers 1 to eliminate multi-worker theory)
01:30:52 — v46 deployed; pid 17 single worker alive from this point
01:31:16 — Operator mints fresh token
02:20:11 — Third signup attempt: begin-with-token → 200 (single-worker, pid 17, 42ms)
02:20:15 — verify-with-token → 400 (1ms — _pop_challenge still returns None)
~02:30 — SRE agent engaged; prod DB + logs analysis begins
~02:50 — Root cause identified: in-process dict COW divergence under --preload
~03:05 — Fix implemented: Redis-backed _store_challenge/_pop_challenge in webauthn_service.py
~03:10 — PR #2728 opened (fix/webauthn-challenge-store-redis), base: main
~03:20 — Bootstrap token minted for operator retry (expires 03:37 UTC)

Impact

Users affected: 0 (operator self-test only; no customers — pre-launch)
User-visible symptoms: None (pre-launch)
Data integrity: ok
Revenue / billing: ok

What went well

Single-dyno deploy (v46, PR #2727) correctly ruled out the multi-worker split theory
Heroku Redis was already provisioned and working (demo flow) — no new infra needed for the fix
The 1ms verify failure latency was a precise diagnostic signal: _pop_challenge dict miss, not a DB call or webauthn.verify_registration_response() failure
Structured response logging (middleware) emitted the exact error body for the 400 responses
Token replay detection (00:56:48 replay → 400) confirmed token nonce commits are working correctly — isolated the bug to challenge storage, not token validation

What didn't go well

The v44 deploy shipped begin-with-token without verify-with-token; the frontend fell back to the old /register/verify endpoint, consuming the challenge. The two PRs (#2725 and #2726) should have been a single PR
No INFO-level log line appears for successful begin-with-token 200 responses in Heroku (the success logger.info("auth.register_begin_with_token issued ...") is swallowed by the root logger level in production) — this delayed diagnosis
The COW-after-preload failure mode is not documented anywhere in the codebase or runbooks
bootstrap_tokens_used showed only the 2026-05-23 entry, not the 2026-05-25 entries, which sent analysis in a misleading direction before the logs were read more carefully

Root cause analysis

Contributing factor 1: gunicorn --preload + module-level dict — _challenge_store is a Python module-level dict initialized once. With --preload, gunicorn imports the module in the master process, then forks worker(s). Each forked worker gets a copy-on-write clone of the module. The first write to _challenge_store in any process creates a private copy; other processes do not see it. Under normal single-worker steady-state this should be safe (master doesn't serve requests), but under dyno lifecycle events (SIGCHLD handling, brief overlap during hot reload), the master process may service a request — storing the challenge in master memory, then the worker handles verify and finds the dict empty.
Contributing factor 2: begin-with-token and verify-with-token shipped in separate PRs — The v44 deploy (PR #2725) contained begin-with-token but not verify-with-token. The frontend at v44 fell back to the old /register/verify endpoint. This consumed or failed to consume the challenge on the first attempt, making all subsequent debug sessions start from a corrupted state. Had both endpoints shipped together (one PR), the frontend would have called the correct verify path from the first attempt.
Contributing factor 3: no diagnostic logging in the challenge store — _store_challenge and _pop_challenge emitted no logs (not even at DEBUG level) showing what user_id was stored/looked up or which backing store was used. The miss was only detectable by the absence of the webauthn.registration_verify_failed WARNING log — a negative signal.

Detection

What alerted us: Operator reported persistent 400 "Passkey verification failed" on every signup attempt
How long between cause and detection: ~1h (first attempt at 00:56, analysis began ~02:30)
How to detect faster next time: The new Redis-backed store emits webauthn.challenge_stored store=redis and webauthn.challenge_popped store=redis at INFO level on every call. A miss now emits webauthn.challenge_miss store=redis user_id=N. This turns a negative signal into a positive one.

Resolution

What was changed: backend_v2/api/services/webauthn_service.py — _store_challenge and _pop_challenge now write to/read from Redis when REDIS_URL is set. Key: webauthn:reg:challenge:{user_id}. TTL: 60s. GETDEL for atomic pop. Falls back to in-process dict when REDIS_URL is absent (local dev, CI).
Validation: PR #2728 — operator to merge, deploy (v47), mint token, complete passkey ceremony. Confirm webauthn.challenge_stored store=redis + webauthn.challenge_popped store=redis in Heroku logs.

Action items

#	Action	Owner	Due	Issue
1	Merge PR #2728, deploy v47, verify signup end-to-end	operator	2026-05-25	#2728
2	Add `webauthn.challenge_miss` Sentry alert / Heroku log drain alert	sre-agent	2026-06-01	TBD
3	Add runbook entry: gunicorn --preload + module-level mutable state is unsafe; all shared state must use Redis or Postgres	sre-agent	2026-05-26	TBD
4	Require `begin-with-token` + `verify-with-token` to ship in same PR (documented in release checklist)	operator	2026-05-30	TBD
5	Enable root logger INFO output in production so `logger.info(...)` calls in auth routes appear in Heroku logs	dev	2026-06-01	TBD

References

PR: fix(auth): Redis-backed WebAuthn challenge store — https://github.com/raxx-app/TradeMasterAPI/pull/2728
Runbook: docs/ops/runbooks/raptor.md (update pending)
Prior incident: docs/incidents/2026-05-22-signup-flow-double-p1.md
Prior incident: docs/incidents/2026-05-21-prod-webauthn-boot-fail.md