Incident RCA — getraxx.com unserved (2026-05-08 UTC)
Status: Fix in PR (see #1368 comment — pending merge + re-deploy) Severity: SEV-2 — marketing front door unserved; no customer data at risk Reported: #1368 Fix PRs: #1369 (landing + workflow) · this PR (grep bug in workflow)
Timeline
| Time (UTC) | Event |
|---|---|
| 2026-04-22 | getraxx.com zone created on Cloudflare. DNS CNAME getraxx.com → getraxx.pages.dev added. No CF Pages project created. 403 begins. |
| 2026-04-23 | Commit a837db0 (feature/brand-tokens-getraxx-landing) builds landing page components inside frontend/trademaster_ui/src/pages/getraxx/. Branch never merged; no deploy workflow authored. |
| 2026-05-08 | Operator notices https://getraxx.com/ returns HTTP 403 with Cloudflare server header. Filed as #1368. |
| 2026-05-08 | SRE investigation confirms: CNAME points to getraxx.pages.dev, but no CF Pages project named getraxx exists on the account. Cloudflare returns 403 for unresolvable custom hostname. |
| 2026-05-08 | Feature-developer PR #1369 opens: standalone frontend/getraxx-landing/ project + deploy-getraxx.yml workflow. |
| 2026-05-09 01:13 UTC | PR #1369 merges to main (commit dfda2b6). deploy-getraxx.yml triggers. |
| 2026-05-09 01:14 UTC | deploy-getraxx.yml run 25587414670 fails at "Ensure CF Pages custom domain (getraxx.com)" step. CF Pages project getraxx is created successfully. Custom domain attach API call returns "success": true (pretty-printed JSON with space after colon). Shell grep -q '"success":true' (no space) does not match. success_flag stays "false". Script exits 1. Pages deploy never executes. |
| 2026-05-10 | SRE re-investigation. curl -I https://getraxx.com returns HTTP 522 (connection timeout) — changed from 403 because the CF Pages project now exists but has zero deployments. Root cause of current state: grep pattern bug in the workflow. |
| 2026-05-10 | Fix PR opened (this PR): grep -qE '"success":[[:space:]]*true' in both custom-domain steps. |
Root cause
Three independent failures compounded — original two from the 2026-05-08 investigation plus a third introduced by the fix workflow:
-
No CF Pages project created. The DNS CNAME for
getraxx.compointed togetraxx.pages.devbut no CF Pages project with that name was ever provisioned. Cloudflare returns HTTP 403 for unresolvable custom hostname. -
Landing source not standalone. The React landing page (built in commit
a837db0) was embedded inside the Antlers CRA app with no standalone deployable. -
Grep pattern assumed compact JSON.
deploy-getraxx.ymlchecked for'"success":true'(no whitespace) but the Cloudflare v4 API returns pretty-printed JSON:"success": true(space after colon). The shell script false-negated a successful API response and aborted. The CF Pages project and custom domain attach both completed successfully; only the guard check failed.
Current state (2026-05-10)
- CF Pages project
getraxxexists on the account. - Custom domain
getraxx.comwas attached during the failed run (the attach succeeded before the grep check failed). - Zero deployments exist in the project — so the domain serves HTTP 522 instead of content.
- DNS resolves correctly:
getraxx.com→104.21.6.99/172.67.134.179(Cloudflare anycast, proxied).
On merge of this fix PR, deploy-getraxx.yml will re-trigger. The idempotent
steps (project create, domain attach) will no-op correctly. The deploy step will
upload the first artifact and the site will serve HTTP 200.
Fix
This PR patches .github/workflows/deploy-getraxx.yml:
grep -q '"success":true'→grep -qE '"success":[[:space:]]*true'in both thegetraxx.comandwww.getraxx.comcustom-domain-attach steps.- Adds CF error code
8000000(CF Pages "domain already attached") to thealready_flaggrep pattern, so idempotent re-runs on already-attached domains also pass.
Operator prerequisite (still outstanding)
The CLOUDFLARE_EDIT_DNS token covers the raxx.app zone only. The DNS bootstrap
steps in deploy-getraxx.yml require DNS:Edit scope on the getraxx.com zone
(Zone ID 0bdcee38d1da2d021eb6166f0bd6204f). Since the DNS CNAME already exists,
those steps will no-op on the CNAME check and not attempt creation. The Pages deploy
does not require the DNS token — the site will serve after merge regardless.
Action: Extend CLOUDFLARE_EDIT_DNS to cover the getraxx.com zone when
convenient (low urgency; DNS CNAME is already correct).
Contributing factors
- No "new surface" checklist was followed when
getraxx.comzone was created. config/status-surfaces.yamlprobe_urlforgetraxx-comnever generated an alert on 403/522 — no proactive monitoring caught this in the 18 days since 2026-04-22.- The fix workflow was written and reviewed without a test against actual CF API response format. The CF v4 API always returns pretty-printed JSON from its domain attach endpoint; this was not verified before authoring the grep.
Action items
| Action | Owner | Due | Issue |
|---|---|---|---|
| Merge this fix PR | Kristerpher | 2026-05-10 | #1368 |
Verify curl -I https://getraxx.com returns 200 after deploy |
SRE | 2026-05-10 | #1368 |
Extend CLOUDFLARE_EDIT_DNS to cover getraxx.com zone |
Operator | 2026-05-17 | — |
Add probe alert: non-200 from getraxx.com fires Slack alert |
SRE | 2026-05-17 | — |
Audit other workflows for '"success":true' compact-JSON grep anti-pattern |
SRE | 2026-05-17 | — |