Runbook — Antlers Phase 3 post-cutover smoke + rollback
System: raxx.app — CF Pages CNAME cutover from CRA (raxx-app) to Next.js (raxx-prod-next)
Owner: operator / sre-agent
Scope: Phase 3 production cutover per ADR-0106 Strategy A
Related issues: #2883 (cutover execution), #2884 (this runbook), #2885 (retirement after 14-day soak)
ADR: docs/architecture/adr/0106-antlers-nextjs-cutover-strategy.md
Cutover workflow: .github/workflows/deploy-antlers-cutover.yml
§1 — Pre-conditions before running this runbook
All of the following must be true before declaring the cutover at the post-smoke stage:
- [ ] Pre-flight smoke passed:
POST api.raxx.app/api/auth/login/optionsreturned 200 +rpId:raxx.app(validated by thepre-flight-smokejob indeploy-antlers-cutover.yml) - [ ] The cutover workflow ran to completion with
dry_run=false— check the GitHub Actions run log for thecutoverjob - [ ] CF Pages
raxx-prod-nextproject hasraxx.appattached as a custom domain - [ ] The
raxx-prod-nextsnapshot deployment ID was recorded by Step 6 of the workflow (visible in the Actions run summary — see "cra_deploy_id" output) - [ ] The
raxx-prod-nextCF Pages project is healthy (no build failures in the last deployment)
If any pre-condition is unmet, do NOT proceed — investigate the cutover workflow run first.
§2 — Post-cutover smoke checklist (10-15 min)
Run the automated smoke script immediately after the cutover workflow completes:
# Load CF Access service token from vault
CF_ACCESS_CLIENT_ID=$(infisical secrets get CF_ACCESS_CLIENT_ID \
--path /MooseQuest/cloudflare/ --env prod --plain)
CF_ACCESS_CLIENT_SECRET=$(infisical secrets get CF_ACCESS_CLIENT_SECRET \
--path /MooseQuest/cloudflare/ --env prod --plain)
python3 scripts/ops/phase3_post_cutover_smoke.py \
--client-id "${CF_ACCESS_CLIENT_ID}" \
--client-secret "${CF_ACCESS_CLIENT_SECRET}" \
--report-file /tmp/phase3-smoke-$(date +%Y%m%dT%H%M%SZ).md
The script runs the following 6 checks in order:
| # | Check | Pass criterion |
|---|---|---|
| 1 | GET https://raxx.app/ |
HTTP 200 + response body contains _next/static/ (Next.js marker) |
| 2 | GET https://raxx.app/login |
HTTP 200 |
| 3 | POST https://api.raxx.app/api/auth/login/options |
HTTP 200 + "rpId":"raxx.app" in response JSON |
| 4 | GET https://raxx.app/signup |
HTTP 200 + no SSR error markers in body |
| 5 | GET https://raxx.app/dashboard |
HTTP 200 or 3xx (not 5xx) |
| 6 | GET https://getraxx.com/ |
HTTP 200 (marketing site unaffected) |
Manual supplemental checks (run alongside the automated script, ~5 min):
- Open
https://raxx.appin a browser — confirm no console errors (F12 → Console), no hydration warnings, no visible error boundary content. - Navigate to
https://raxx.app/signup— confirm the form fields render and no white screen. - Open
https://getraxx.com— confirm it still loads the marketing site (separate project).
§3 — Smoke pass criteria
All 6 automated checks pass AND:
- No Sentry
antlers-nextjsproject error count spikes above 3× the 7-day baseline in the first 15 minutes post-cutover. Check:https://sentry.io→ projectantlers-nextjs. - No operator-reported functional regression in the first 30 minutes.
If both conditions are met, proceed to §5 (Soak period).
If any check fails or Sentry spikes, proceed to §4 (Rollback procedure).
§4 — Rollback procedure
Target restore time: under 5 minutes. The CF Pages domain re-attach is under 2 minutes; allow 3 minutes for CF edge propagation and verification.
Option A — Workflow rollback trigger (recommended, ~2 min)
Run the cutover workflow in rollback mode. This re-attaches raxx.app to the CRA project
(raxx-app) and removes it from raxx-prod-next. No DNS record change is required because
the CNAME still points at raxx-prod-next.pages.dev; the domain attachment change is what
routes traffic.
gh workflow run deploy-antlers-cutover.yml \
--repo raxx-app/TradeMasterAPI \
-f confirm_cutover="ROLLBACK raxx.app TO CRA" \
-f rollback=true \
-f cra_project_name=raxx-app
The workflow requires an operator-approval gate (production-nextjs GH environment).
Approve via https://github.com/raxx-app/TradeMasterAPI/actions → pending approval.
Verify rollback:
CF_ACCESS_CLIENT_ID=$(infisical secrets get CF_ACCESS_CLIENT_ID \
--path /MooseQuest/cloudflare/ --env prod --plain)
CF_ACCESS_CLIENT_SECRET=$(infisical secrets get CF_ACCESS_CLIENT_SECRET \
--path /MooseQuest/cloudflare/ --env prod --plain)
# Should return 200 — CRA app is back serving raxx.app
HTTP_CODE=$(curl -sS -o /dev/null -w "%{http_code}" \
-H "User-Agent: raxx-sre-rollback-verify/1.0" \
-H "CF-Access-Client-Id: ${CF_ACCESS_CLIENT_ID}" \
-H "CF-Access-Client-Secret: ${CF_ACCESS_CLIENT_SECRET}" \
"https://raxx.app/")
echo "raxx.app HTTP: ${HTTP_CODE}"
# Body should NOT contain _next/static/ (CRA build, not Next.js)
BODY=$(curl -sS \
-H "User-Agent: raxx-sre-rollback-verify/1.0" \
-H "CF-Access-Client-Id: ${CF_ACCESS_CLIENT_ID}" \
-H "CF-Access-Client-Secret: ${CF_ACCESS_CLIENT_SECRET}" \
"https://raxx.app/")
if echo "${BODY}" | grep -q "_next/static/"; then
echo "WARNING: _next/static/ still present — Next.js still serving. Rollback may not have propagated yet."
else
echo "OK: CRA markup confirmed (no _next/static/ marker)."
fi
Option B — CF Pages API curl (break-glass, no workflow required)
Use when the workflow cannot be triggered (GH Actions outage, approval unavailable, etc.).
# Load tokens from vault
CF_TOKEN=$(infisical secrets get CF_PAGES_DEPLOY_TOKEN \
--path /MooseQuest/cloudflare/ --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID \
--path /MooseQuest/cloudflare/ --env prod --plain)
# Step 1: Re-attach raxx.app to the CRA project (raxx-app)
curl -sS -X POST \
-H "Authorization: Bearer ${CF_TOKEN}" \
-H "Content-Type: application/json" \
-H "User-Agent: raxx-sre-rollback/1.0" \
"https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/pages/projects/raxx-app/domains" \
-d '{"name":"raxx.app"}'
# Step 2: Remove raxx.app from the Next.js project (raxx-prod-next)
curl -sS -X DELETE \
-H "Authorization: Bearer ${CF_TOKEN}" \
-H "User-Agent: raxx-sre-rollback/1.0" \
"https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/pages/projects/raxx-prod-next/domains/raxx.app"
Expected responses: - Step 1: HTTP 200/201 (attached) or 409 (already attached — idempotent, continue) - Step 2: HTTP 200/204 (detached) or 404 (not attached — idempotent, continue)
Wait 60-90 seconds for CF edge propagation, then verify (see Option A verification commands above).
Option C — CF Pages dashboard (last resort, no CLI)
- Go to
https://dash.cloudflare.com→ Pages → projectraxx-prod-next - Settings → Custom domains → remove
raxx.app - Go to Pages → project
raxx-app - Settings → Custom domains → add
raxx.app - Wait for domain verification to complete (~2 min)
- Verify via browser:
https://raxx.appshould serve the CRA app
CRA rollback reference deployment
The cutover workflow Step 6 records the CRA snapshot deployment ID in the Actions run summary.
This is the specific raxx-app project deployment that was serving raxx.app immediately
before cutover. It is retained in the raxx-app CF Pages project history and is NOT
automatically pruned during the 14-day soak window.
To find it after the fact:
CF_TOKEN=$(infisical secrets get CF_PAGES_DEPLOY_TOKEN \
--path /MooseQuest/cloudflare/ --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID \
--path /MooseQuest/cloudflare/ --env prod --plain)
curl -sS \
-H "Authorization: Bearer ${CF_TOKEN}" \
-H "User-Agent: raxx-sre/1.0" \
"https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/pages/projects/raxx-app/deployments?per_page=5" \
| python3 -m json.tool | grep -E '"id"|"created_on"|"latest_stage"'
Post-rollback data to capture
If rolling back, capture the following before ending the incident session (for the post-rollback investigation — operator decides whether to file a ticket):
- The specific smoke check(s) that failed (copy the script output)
- The Sentry
antlers-nextjserror count at time of rollback decision - The GitHub Actions run URL for the failed cutover
- Browser console errors if any were observed
- Time of cutover and time of rollback decision (UTC)
Escalation path if rollback fails
If both Option A and Option B fail:
- Email ops@raxx.app with subject URGENT: raxx.app rollback failure — Phase 3
- Include the CF API responses from Option B
- Fallback: CF dashboard Option C above
§5 — Soak period
Minimum viable soak: 1 hour
Check the following at T+1h:
- curl -sS -o /dev/null -w "%{http_code}" https://raxx.app/ returns 200
- No Sentry antlers-nextjs error spike (check dashboard)
- getraxx.com still returns 200
Optimal soak: 24 hours
At T+24h, per ADR-0106 milestone table:
- Sentry error rate should be within 2× baseline
- Run the automated smoke script one more time as a re-confirmation:
bash
python3 scripts/ops/phase3_post_cutover_smoke.py \
--client-id "${CF_ACCESS_CLIENT_ID}" \
--client-secret "${CF_ACCESS_CLIENT_SECRET}" \
--markdown
Victory declaration: 72 hours
At T+72h:
- If no rollback was needed and Sentry error rate is stable, the cutover is declared successful.
- The CRA raxx-app project remains dormant (do NOT delete) until the 14-day soak expires.
- After 14 days, dispatch the retirement sub-card (#2885) to retire frontend/trademaster_ui/.
CRA retention schedule
| Time after cutover | Action |
|---|---|
| T+0 to T+72h | Rollback alias retained; auto-rollback decision window |
| T+72h | Soak milestone declared (no retirement yet) |
| T+14d | Operator approves #2885 — retire CRA source files |
| T+14d | raxx-app CF Pages project moved to dormant (project not deleted) |
§6 — WebAuthn invariant
The WebAuthn RP ID (raxx.app) is unchanged throughout this entire procedure. Passkeys
enrolled before the cutover remain valid because:
- The RP ID is determined by the origin the browser sees (
raxx.app) - The CNAME change only re-points which CF Pages project answers DNS for
raxx.app - Raptor (
api.raxx.app) is completely unchanged; the passkey challenge origin check in Raptor is configured withWEBAUTHN_RP_ID=raxx.appandWEBAUTHN_ORIGIN=https://raxx.app
If POST api.raxx.app/api/auth/login/options returns rpId != raxx.app, that is a Raptor
configuration problem — not a DNS/CF Pages problem. Do not rollback for this; escalate
to the backend config investigation path.
References
- Cutover workflow:
.github/workflows/deploy-antlers-cutover.yml - Phase 3 cutover execution: #2883
- This runbook issue: #2884
- CRA retirement: #2885
- CF Pages runbook:
docs/ops/runbooks/cf-pages-antlers-next.md - ADR-0106:
docs/architecture/adr/0106-antlers-nextjs-cutover-strategy.md - Automated smoke script:
scripts/ops/phase3_post_cutover_smoke.py - Sentry project:
antlers-nextjs(nottrademaster-ui)