RCA — Antlers Phase 3 Cutover: WebAuthn smoke false-negative (wrong host + path)
Incident ID: 2026-05-28-cutover-webauthn-smoke-false-negative Date: 2026-05-28 Severity: SEV-3 Duration: ~0m service impact (cutover succeeded; smoke failure was a false negative, no user-visible outage) Blast radius: CI/CD pipeline reported failure on a successful production cutover. No users affected. Author: sre-agent
Summary
The Antlers Phase 3 cutover workflow (run 26557690397) successfully re-pointed raxx.app from the CRA Pages project to the Next.js Pages project (raxx-prod-next). All 10 preceding steps passed, including the health check (HTTP 200, 729-byte body). The final WebAuthn smoke step (Step 11) failed with HTTP 405 because it sent a POST to https://raxx.app/api/auth/passkey/options — a path that does not exist on the Next.js frontend. The workflow reported FAILURE, but the cutover state was already correct. The smoke was a false negative. No rollback was needed or taken.
Timeline (all times UTC)
- 06:01:09 — Cutover workflow run 26557690397 started (live mode, dry_run=false)
- 06:01:10 — Vault load: CF tokens loaded from Infisical prod
- 06:01:14 — Pre-flight 1/3: raxx-prod-next has a successful deployment — PASS
- 06:01:xx — Pre-flight 2/3: nodejs_compat flag set — PASS
- 06:01:xx — Pre-flight 3/3: CF Access policy on raxx.app — PASS
- 06:01:xx — Snapshot: CRA deployment ID
7292b7c4-7605-46a8-9547-812e87619398recorded - 06:01:xx — Step 7: raxx.app attached to raxx-prod-next — PASS (HTTP 200/201 or 409)
- 06:01:xx — Step 8: CNAME raxx.app → raxx-prod-next.pages.dev updated — PASS
- 06:01:xx — Step 9: raxx.app detached from raxx-app (CRA) — PASS
- 06:01:49 — Step 10: Health check raxx.app — PASS (attempt 4/8, HTTP 200, body 729 bytes)
- 06:01:49 — Step 11: WebAuthn smoke POST
https://raxx.app/api/auth/passkey/options→ HTTP 405 — FAIL - 06:01:49 — Workflow exits non-zero; Step 12 (rollback alias doc) skipped
- 06:01:49 — Job summary and Slack notification emitted (failure state)
- 2026-05-28 (post-run) — SRE agent investigation confirms cutover state is correct; smoke was false negative
- 2026-05-28 — Fix-forward PR filed correcting smoke endpoint
Impact
- Users affected: 0
- User-visible symptoms: none — raxx.app serves Next.js correctly post-cutover
- Data integrity: ok
- Revenue / billing: ok
- Cutover state: raxx.app attached to raxx-prod-next (verified via CF Pages API); raxx-app has no custom domains
What went well
- All 10 pre-cutover and cutover steps executed correctly
- The health check retry loop (8 attempts, 10s interval) correctly detected HTTP 200 on attempt 4, giving CF edge propagation time to settle
- Workflow emitted the job summary and Slack notification even on failure, providing rollback documentation
- The CRA snapshot deployment ID was recorded before the smoke step ran, preserving the rollback reference
What didn't go well
- The WebAuthn smoke comment in the workflow file (
# per [PR #2714](https://github.com/raxx-app/TradeMasterAPI/pull/2714)) referenced the wrong endpoint path - The smoke targeted
raxx.app(the Next.js frontend) instead ofapi.raxx.app(Raptor backend); there is no API proxy from raxx.app to api.raxx.app - The path
/api/auth/passkey/optionsdoes not exist in Raptor; the correct registration options path is/api/auth/register/optionsand the correct login options path is/api/auth/login/options - A false-negative CI failure on a successful production cutover creates unnecessary alarm and blocks automated post-cutover steps (Step 12 was skipped)
Root cause analysis
-
Contributing factor 1: Wrong target host in smoke — The smoke POSTed to
raxx.appassuming it proxies/api/*requests to Raptor. The Next.js frontend (raxx-prod-next) has no such proxy;NEXT_PUBLIC_API_URLpoints the browser directly toapi.raxx.app. A POST toraxx.app/api/auth/passkey/optionsreaches the Next.js CF Pages runtime, which has no matching route handler and returns 405. -
Contributing factor 2: Wrong path — The path
/api/auth/passkey/optionswas never a valid Raptor route. Raptor's registration challenge endpoint is/api/auth/register/options; login challenge is/api/auth/login/options. The path likely originated from a comment referencing early design drafts before the final route naming was settled. -
Contributing factor 3: No pre-flight validation of the smoke endpoint — The workflow had a thorough pre-flight battery for infrastructure state (deployment, compatibility flags, CF Access policy) but no pre-flight verification that the smoke target URL was reachable. A pre-flight curl against the smoke URL in dry-run mode would have surfaced the wrong endpoint before a live cutover.
Detection
- What alerted us: Workflow reported FAILURE on step "WebAuthn smoke — POST /api/auth/passkey/options returns 2xx"
- How long between cause and detection: 0m (immediate — CI reported on same run)
- How to detect faster next time: A dry-run before the live cutover would have caught this (dry-run mode skips the actual curl, so it would not have caught it either — but a "smoke endpoint pre-flight" step in dry-run could curl the endpoint and report the path validity)
Resolution
- What was changed:
.github/workflows/deploy-antlers-cutover.ymlStep 11 corrected: - Host changed:
raxx.app→api.raxx.app - Path changed:
/api/auth/passkey/options→/api/auth/login/options - Added rpId validation: smoke now checks that
rpId:"raxx.app"is present in the 200 response body, directly confirming the RP ID invariant - Updated workflow comment to explain the host distinction (Next.js frontend vs Raptor backend)
- Validation: Manual probe of
https://api.raxx.app/api/auth/login/optionsreturned HTTP 200 with{"rpId":"raxx.app","challenge":"...","timeout":60000,"userVerification":"required"}confirming Raptor is reachable and RP ID is correct post-cutover
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Fix-forward PR for workflow Step 11 (this RCA) | sre-agent | 2026-05-28 | (this PR) |
| 2 | Add smoke-endpoint pre-flight step to dry-run mode — curl smoke URL before live cutover, fail dry-run if 404/405 | sre-agent | 2026-06-04 | TBD |
| 3 | Update workflow comment "per PR #2714" to reference correct endpoint source (auth.py route definition) | sre-agent | 2026-05-28 | (this PR — done) |
References
- Workflow:
.github/workflows/deploy-antlers-cutover.yml - GH Actions run:
https://github.com/raxx-app/TradeMasterAPI/actions/runs/26557690397 - Raptor auth routes:
backend_v2/api/routes/auth.pylines 5-13 (route index) - Next.js API client:
frontend/raxx-next/lib/apiClient.ts(confirmsNEXT_PUBLIC_API_URL=api.raxx.app) - CF Pages domain verification: CF Pages API confirmed
raxx.appisactiveonraxx-prod-next, no domains onraxx-app - ADR:
docs/architecture/adr/0106-antlers-nextjs-cutover-strategy.md - Related incident:
docs/incidents/2026-05-27-nodejs-compat-flag-missing.md