Raxx · internal docs

internal · gated

Runbook — Antlers Phase 3 post-cutover smoke + rollback

System: raxx.app — CF Pages CNAME cutover from CRA (raxx-app) to Next.js (raxx-prod-next) Owner: operator / sre-agent Scope: Phase 3 production cutover per ADR-0106 Strategy A Related issues: #2883 (cutover execution), #2884 (this runbook), #2885 (retirement after 14-day soak) ADR: docs/architecture/adr/0106-antlers-nextjs-cutover-strategy.md Cutover workflow: .github/workflows/deploy-antlers-cutover.yml


§1 — Pre-conditions before running this runbook

All of the following must be true before declaring the cutover at the post-smoke stage:

If any pre-condition is unmet, do NOT proceed — investigate the cutover workflow run first.


§2 — Post-cutover smoke checklist (10-15 min)

Run the automated smoke script immediately after the cutover workflow completes:

# Load CF Access service token from vault
CF_ACCESS_CLIENT_ID=$(infisical secrets get CF_ACCESS_CLIENT_ID \
    --path /MooseQuest/cloudflare/ --env prod --plain)
CF_ACCESS_CLIENT_SECRET=$(infisical secrets get CF_ACCESS_CLIENT_SECRET \
    --path /MooseQuest/cloudflare/ --env prod --plain)

python3 scripts/ops/phase3_post_cutover_smoke.py \
    --client-id "${CF_ACCESS_CLIENT_ID}" \
    --client-secret "${CF_ACCESS_CLIENT_SECRET}" \
    --report-file /tmp/phase3-smoke-$(date +%Y%m%dT%H%M%SZ).md

The script runs the following 6 checks in order:

# Check Pass criterion
1 GET https://raxx.app/ HTTP 200 + response body contains _next/static/ (Next.js marker)
2 GET https://raxx.app/login HTTP 200
3 POST https://api.raxx.app/api/auth/login/options HTTP 200 + "rpId":"raxx.app" in response JSON
4 GET https://raxx.app/signup HTTP 200 + no SSR error markers in body
5 GET https://raxx.app/dashboard HTTP 200 or 3xx (not 5xx)
6 GET https://getraxx.com/ HTTP 200 (marketing site unaffected)

Manual supplemental checks (run alongside the automated script, ~5 min):


§3 — Smoke pass criteria

All 6 automated checks pass AND:

If both conditions are met, proceed to §5 (Soak period).

If any check fails or Sentry spikes, proceed to §4 (Rollback procedure).


§4 — Rollback procedure

Target restore time: under 5 minutes. The CF Pages domain re-attach is under 2 minutes; allow 3 minutes for CF edge propagation and verification.

Run the cutover workflow in rollback mode. This re-attaches raxx.app to the CRA project (raxx-app) and removes it from raxx-prod-next. No DNS record change is required because the CNAME still points at raxx-prod-next.pages.dev; the domain attachment change is what routes traffic.

gh workflow run deploy-antlers-cutover.yml \
    --repo raxx-app/TradeMasterAPI \
    -f confirm_cutover="ROLLBACK raxx.app TO CRA" \
    -f rollback=true \
    -f cra_project_name=raxx-app

The workflow requires an operator-approval gate (production-nextjs GH environment). Approve via https://github.com/raxx-app/TradeMasterAPI/actions → pending approval.

Verify rollback:

CF_ACCESS_CLIENT_ID=$(infisical secrets get CF_ACCESS_CLIENT_ID \
    --path /MooseQuest/cloudflare/ --env prod --plain)
CF_ACCESS_CLIENT_SECRET=$(infisical secrets get CF_ACCESS_CLIENT_SECRET \
    --path /MooseQuest/cloudflare/ --env prod --plain)

# Should return 200 — CRA app is back serving raxx.app
HTTP_CODE=$(curl -sS -o /dev/null -w "%{http_code}" \
    -H "User-Agent: raxx-sre-rollback-verify/1.0" \
    -H "CF-Access-Client-Id: ${CF_ACCESS_CLIENT_ID}" \
    -H "CF-Access-Client-Secret: ${CF_ACCESS_CLIENT_SECRET}" \
    "https://raxx.app/")
echo "raxx.app HTTP: ${HTTP_CODE}"

# Body should NOT contain _next/static/ (CRA build, not Next.js)
BODY=$(curl -sS \
    -H "User-Agent: raxx-sre-rollback-verify/1.0" \
    -H "CF-Access-Client-Id: ${CF_ACCESS_CLIENT_ID}" \
    -H "CF-Access-Client-Secret: ${CF_ACCESS_CLIENT_SECRET}" \
    "https://raxx.app/")
if echo "${BODY}" | grep -q "_next/static/"; then
    echo "WARNING: _next/static/ still present — Next.js still serving. Rollback may not have propagated yet."
else
    echo "OK: CRA markup confirmed (no _next/static/ marker)."
fi

Option B — CF Pages API curl (break-glass, no workflow required)

Use when the workflow cannot be triggered (GH Actions outage, approval unavailable, etc.).

# Load tokens from vault
CF_TOKEN=$(infisical secrets get CF_PAGES_DEPLOY_TOKEN \
    --path /MooseQuest/cloudflare/ --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID \
    --path /MooseQuest/cloudflare/ --env prod --plain)

# Step 1: Re-attach raxx.app to the CRA project (raxx-app)
curl -sS -X POST \
    -H "Authorization: Bearer ${CF_TOKEN}" \
    -H "Content-Type: application/json" \
    -H "User-Agent: raxx-sre-rollback/1.0" \
    "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/pages/projects/raxx-app/domains" \
    -d '{"name":"raxx.app"}'

# Step 2: Remove raxx.app from the Next.js project (raxx-prod-next)
curl -sS -X DELETE \
    -H "Authorization: Bearer ${CF_TOKEN}" \
    -H "User-Agent: raxx-sre-rollback/1.0" \
    "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/pages/projects/raxx-prod-next/domains/raxx.app"

Expected responses: - Step 1: HTTP 200/201 (attached) or 409 (already attached — idempotent, continue) - Step 2: HTTP 200/204 (detached) or 404 (not attached — idempotent, continue)

Wait 60-90 seconds for CF edge propagation, then verify (see Option A verification commands above).

Option C — CF Pages dashboard (last resort, no CLI)

  1. Go to https://dash.cloudflare.com → Pages → project raxx-prod-next
  2. Settings → Custom domains → remove raxx.app
  3. Go to Pages → project raxx-app
  4. Settings → Custom domains → add raxx.app
  5. Wait for domain verification to complete (~2 min)
  6. Verify via browser: https://raxx.app should serve the CRA app

CRA rollback reference deployment

The cutover workflow Step 6 records the CRA snapshot deployment ID in the Actions run summary. This is the specific raxx-app project deployment that was serving raxx.app immediately before cutover. It is retained in the raxx-app CF Pages project history and is NOT automatically pruned during the 14-day soak window.

To find it after the fact:

CF_TOKEN=$(infisical secrets get CF_PAGES_DEPLOY_TOKEN \
    --path /MooseQuest/cloudflare/ --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID \
    --path /MooseQuest/cloudflare/ --env prod --plain)

curl -sS \
    -H "Authorization: Bearer ${CF_TOKEN}" \
    -H "User-Agent: raxx-sre/1.0" \
    "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/pages/projects/raxx-app/deployments?per_page=5" \
    | python3 -m json.tool | grep -E '"id"|"created_on"|"latest_stage"'

Post-rollback data to capture

If rolling back, capture the following before ending the incident session (for the post-rollback investigation — operator decides whether to file a ticket):

  1. The specific smoke check(s) that failed (copy the script output)
  2. The Sentry antlers-nextjs error count at time of rollback decision
  3. The GitHub Actions run URL for the failed cutover
  4. Browser console errors if any were observed
  5. Time of cutover and time of rollback decision (UTC)

Escalation path if rollback fails

If both Option A and Option B fail: - Email ops@raxx.app with subject URGENT: raxx.app rollback failure — Phase 3 - Include the CF API responses from Option B - Fallback: CF dashboard Option C above


§5 — Soak period

Minimum viable soak: 1 hour

Check the following at T+1h: - curl -sS -o /dev/null -w "%{http_code}" https://raxx.app/ returns 200 - No Sentry antlers-nextjs error spike (check dashboard) - getraxx.com still returns 200

Optimal soak: 24 hours

At T+24h, per ADR-0106 milestone table: - Sentry error rate should be within 2× baseline - Run the automated smoke script one more time as a re-confirmation: bash python3 scripts/ops/phase3_post_cutover_smoke.py \ --client-id "${CF_ACCESS_CLIENT_ID}" \ --client-secret "${CF_ACCESS_CLIENT_SECRET}" \ --markdown

Victory declaration: 72 hours

At T+72h: - If no rollback was needed and Sentry error rate is stable, the cutover is declared successful. - The CRA raxx-app project remains dormant (do NOT delete) until the 14-day soak expires. - After 14 days, dispatch the retirement sub-card (#2885) to retire frontend/trademaster_ui/.

CRA retention schedule

Time after cutover Action
T+0 to T+72h Rollback alias retained; auto-rollback decision window
T+72h Soak milestone declared (no retirement yet)
T+14d Operator approves #2885 — retire CRA source files
T+14d raxx-app CF Pages project moved to dormant (project not deleted)

§6 — WebAuthn invariant

The WebAuthn RP ID (raxx.app) is unchanged throughout this entire procedure. Passkeys enrolled before the cutover remain valid because:

If POST api.raxx.app/api/auth/login/options returns rpId != raxx.app, that is a Raptor configuration problem — not a DNS/CF Pages problem. Do not rollback for this; escalate to the backend config investigation path.


References