New-surface deploy/preview convention — runbook
System: raxx platform surfaces
Owner: sre-agent / operator
Last incident: n/a (convention doc; not incident-driven)
Last reviewed: 2026-05-12 UTC
ADRs: 0052 (hosting tiers) · 0053 (workflow template) · 0081 (convention as standard)
Architecture doc: new-surface-convention.md
Scaffold script: scripts/scaffold/new-surface.sh
Parent card: #649
Purpose
This runbook answers one question: "how do I stand up a new Raxx surface correctly in about 30 minutes?" It also documents each existing surface so that when something breaks on a surface you've never touched, you know where its knobs are.
Before this document existed, each surface was bootstrapped ad hoc over one to two days. The scaffold script and this runbook eliminate the per-surface re-discovery.
Existing surfaces (parity baseline)
Use these as reference when something breaks or when scaffolding a new surface in the same tier.
Antlers — app.raxx.app
| Field | Value |
|---|---|
| URL | https://app.raxx.app |
| Tier | A (Cloudflare Pages) |
| CF Pages project | raxx-app |
| Deploy workflow | .github/workflows/deploy-antlers.yml |
| Source path | frontend/trademaster_ui/ |
| Status tile id | app-raxx-app |
| Access | Public (post-launch) |
| Sentry project | n/a (static SPA; errors tracked client-side) |
| Vault paths | /MooseQuest/cloudflare/ (CF deploy token) |
| Parity note | Predates ADR-0052; audit emit and freeze gate are present; no separate staging deploy — branch previews via CF Pages. |
getraxx.com — marketing landing page
| Field | Value |
|---|---|
| URL | https://getraxx.com |
| Tier | A (Cloudflare Pages) |
| CF Pages project | getraxx |
| Deploy workflow | .github/workflows/deploy-getraxx.yml |
| Source path | frontend/getraxx-landing/ |
| Status tile id | getraxx-com |
| Access | Public (post 2026-05-23 launch); CF Access gated pre-launch |
| Sentry project | n/a |
| Vault paths | /MooseQuest/cloudflare/ |
| Parity note | Full parity. www.getraxx.com is a DNS CNAME (not a CF Pages custom domain) to preserve zone-level redirect rule. See RCA docs/incidents/2026-05-11-www-getraxx-access-bypass.md for why. Pre-launch CF Access removal: docs/ops/runbooks/getraxx-launch-day-cf-access-removal.md. |
raxx-mockups — design preview
| Field | Value |
|---|---|
| URL | https://raxx-mockups.pages.dev (no raxx.app subdomain) |
| Tier | A (Cloudflare Pages) |
| CF Pages project | raxx-mockups |
| Deploy workflow | .github/workflows/deploy-mockups.yml |
| Source path | docs/design/, mockups-site/ |
| Status tile id | raxx-mockups |
| Access | Internal (public: false — not enumerated on customer status page) |
| Sentry project | n/a |
| Vault paths | /MooseQuest/cloudflare/ |
| Parity note | Full parity. Split from deploy-internal-docs.yml on 2026-05-08 (#1367). No custom raxx.app subdomain — pages.dev URL is intentional (design previews need no branding gate). |
console.raxx.app — operator console
| Field | Value |
|---|---|
| URL | https://console.raxx.app |
| Tier | B (Heroku) |
| Heroku apps | raxx-console-prod-* (prod), raxx-console-staging-* (staging) |
| Deploy workflow | .github/workflows/deploy-console.yml |
| Source path | console/ |
| Status tile id | console-raxx-app |
| Access | Internal (CF Access gated; public: false) |
| Sentry project | tbd (sentry_backend flag gates it) |
| Vault paths | /MooseQuest/cloudflare/, /MooseQuest/console/ |
| Parity note | Full parity. Health check uses Heroku origin URL (not CF-fronted domain) to bypass CF Access 302. Env is hostname-derived — no in-app switcher. Staging is a runtime dup on read-only prod Queue data. |
tickets.raxx.app — support portal (FreeScout)
| Field | Value |
|---|---|
| URL | https://tickets.raxx.app |
| Tier | C (Lightsail + Terraform) |
| Lightsail instance | see terraform/lightsail-freescout/ |
| Deploy workflow | none (Terraform-managed; no CI deploy) |
| Status tile id | tickets-raxx-app |
| Access | Public (customer-facing support queue) |
| Sentry project | n/a |
| Vault paths | /MooseQuest/freescout/, /MooseQuest/cloudflare/ |
| Parity note | Tier C exemption — no CI deploy step. Operator runs terraform apply from terraform/lightsail-freescout/. Cert renewal: docs/ops/runbooks/freescout-cert-renewal.md. Full runbook: docs/ops/runbooks/freescout.md. Backup/restore: docs/ops/runbooks/freescout-backup-restore.md. |
docs.raxx.app — customer documentation
| Field | Value |
|---|---|
| URL | https://docs.raxx.app |
| Tier | A (Cloudflare Pages) |
| CF Pages project | raxx-customer-docs |
| Deploy workflow | .github/workflows/deploy-customer-docs.yml |
| Source path | docs/customer/ |
| Status tile id | docs-raxx-app |
| Access | Public |
| Sentry project | n/a |
| Vault paths | /MooseQuest/cloudflare/ |
| Parity note | Full parity. Bootstrap step in workflow creates the CF Pages project idempotently and sets the DNS CNAME. Pattern reference for Tier A scaffolding. |
How to add a new surface (30-minute path)
Step 1 — Run the scaffold (5 min)
bash scripts/scaffold/new-surface.sh
The script prompts for:
- SURFACE_NAME — slug (e.g., learn)
- DOMAIN — FQDN (e.g., learn.raxx.app)
- TIER — cf-pages | cf-worker | heroku | lightsail
- ACCESS — public | internal
Output (all under scripts/scaffold/output/<surface>/):
- deploy-<surface>.yml — move to .github/workflows/ after review
- terraform-cf-pages-<surface>.tf.stub — starting point for the TF stub
- branch-protection.md — documents expected branch protection settings
- README.md — surface ops notes template
Step 2 — Fill in TODOs (10 min)
Open each generated file and resolve the TODO markers:
deploy-<surface>.yml: replaceTODO-surface-source-pathwith the actual directory (e.g.,frontend/learn/). Add the build step.README.md: add vault paths specific to this surface.branch-protection.md: confirm the branch protection rules match the surface's risk profile.
Step 3 — Register the status tile (5 min)
Move the status tile entry from scripts/scaffold/output/<surface>/status-stub.yaml
into config/status-surfaces.yaml (append to the owned block). Confirm the
probe_url is correct for the surface's /health endpoint.
Step 4 — Create Sentry project (if Tier B) (5 min)
Tier B (Heroku) surfaces running server code need a Sentry project:
1. Create project named <surface> in the Sentry org.
2. Vault the DSN: infisical secrets set SENTRY_DSN=<dsn> --path /MooseQuest/sentry/<surface>/
3. Add heroku config:set SENTRY_DSN="$SENTRY_DSN" --app <heroku-app> >/dev/null 2>&1
to the deploy workflow.
Step 5 — Open the PR (5 min)
Move generated files to their final locations, open a PR. The PR description
must include:
- Surface name, domain, tier, access type
- Whether CF Access gate was created (internal surfaces)
- Whether status tile was registered
- For Tier A: confirm the CF Pages project name matches the entry in status-surfaces.yaml
- Live HTTP-200 verification output. Run the command below against the custom domain
(not a preview URL) and paste the output into the PR body. A PR without this output
has not satisfied the surface launch acceptance criteria (see
docs/architecture/new-surface-convention.md §9.1):
sh
curl -o /dev/null -sSL -w "%{http_code}" https://<custom-domain>/
# expected: 200
Internal surfaces: include the CF-Access-Client-Id / CF-Access-Client-Secret
headers in the curl call. The task is not done until this returns 200.
How to tell a surface is broken
Tier A (CF Pages)
- Status tile shows degraded or down for the surface.
- CF Pages project dashboard shows failed deployment.
- curl -I https://<surface>/health returns non-200 or times out.
- Cloudflare status page shows a Pages incident: https://www.cloudflarestatus.com
Tier B (Heroku)
- Status tile shows degraded or down.
- Heroku dashboard shows dyno restarts or H10/H12 errors.
- heroku logs --app <app> --tail shows crash loop or OOM.
- Sentry (if configured) shows a spike in error rate.
- curl https://<heroku-origin>/health (not CF-fronted URL) returns non-200.
Tier C (Lightsail)
- Status tile shows degraded or down.
- SSH connectivity to the Lightsail instance fails.
- Application logs (e.g., FreeScout: sudo journalctl -u apache2 -f) show errors.
How to diagnose (in order)
- Check the status tile in the operator console (
https://console.raxx.app). - Check the most recent deploy run in GitHub Actions for the surface's workflow.
- For Tier A: check the CF Pages project dashboard for the surface.
For Tier B:
heroku logs --app <app> --tail --num 200. For Tier C: SSH to the Lightsail instance and check application logs. - Check Cloudflare status (
https://www.cloudflarestatus.com) and Heroku status (https://status.heroku.com) for upstream incidents. - For internal surfaces: confirm the CF Access application exists and the
service token has not expired. Token check:
curl -I https://<surface>/healthwith and without theCF-Access-Client-Idheader.
Known failure modes
CF Pages deploy failed — build error
Symptom: Workflow fails at the build step. CF Pages project shows no new deployment. Cause: Build script failure (missing dependency, compile error, OOM in runner). Fix: Read the build step logs in the workflow run. Fix the source error and push. Verification: Re-trigger workflow. Confirm new deployment appears in CF Pages dashboard and health check passes.
CF Pages deploy failed — wrangler auth error
Symptom: Workflow fails at the wrangler publish step with "Authentication error" or "10000".
Cause: CLOUDFLARE_API_TOKEN secret is expired or scoped to wrong permissions.
Fix: Rotate the CF Pages deploy token in Cloudflare. Update vault at /MooseQuest/cloudflare/CF_PAGES_DEPLOY_TOKEN. See docs/ops/runbooks/cloudflare-tokens.md.
Verification: Re-trigger workflow. Confirm wrangler step succeeds.
DNS CNAME missing — surface returns NXDOMAIN or 404
Symptom: curl https://<surface>/health returns NXDOMAIN or connection refused.
Cause: DNS CNAME record was not created (or was deleted). Workflow bootstrap step may have failed silently.
Fix:
# Confirm CNAME is absent
dig CNAME <surface>.raxx.app
# Re-run the workflow (the DNS bootstrap step is idempotent)
# Or create manually via CF API:
curl -sS -X POST -H "Authorization: Bearer $CLOUDFLARE_EDIT_DNS" \
-H "Content-Type: application/json" \
"https://api.cloudflare.com/client/v4/zones/f12dbb5cac57d5591a5058874498a6d1/dns_records" \
-d '{"type":"CNAME","name":"<surface>.raxx.app","content":"<project>.pages.dev","proxied":true,"ttl":1}'
Verification: dig CNAME <surface>.raxx.app returns the target. curl -I https://<surface>/health returns 200.
Internal surface returns 403 — CF Access gate missing or misconfigured
Symptom: curl https://<surface> returns 403 with an "Access denied" body, even with valid credentials.
Cause: CF Access application does not exist, or the email allowlist policy is missing the operator's email.
Fix: Check the CF Access application list in the Cloudflare dashboard. If the app is missing, re-run the workflow (the CF Access bootstrap step is idempotent). If the policy is wrong, update via Cloudflare Access UI.
Verification: curl -H "CF-Access-Client-Id: <id>" -H "CF-Access-Client-Secret: <secret>" https://<surface>/health returns 200.
CF Access service token bypassed by Bot Fight Mode
Symptom: Service token requests return 403 with "Error 1010" or bot-fight challenge page.
Cause: CF Bot Fight Mode fires before the Access layer. Service tokens do not bypass Bot Fight Mode.
Fix: Add a WAF skip rule keyed on CF-Access-Client-Id header. See docs/ops/runbooks/waf.md and feedback memory feedback_cf_access_does_not_bypass_bot_fight_mode.md.
Verification: curl -H "CF-Access-Client-Id: <id>" -H "CF-Access-Client-Secret: <secret>" https://<surface>/health returns 200 without a challenge page.
Deploy workflow does not trigger after push
Symptom: Push to main does not start the surface's deploy workflow.
Cause: The paths: filter in the workflow does not match any changed file, or the workflow file itself is outside the paths filter.
Fix: Confirm the workflow's paths: block includes the surface source directory. Note: the workflow YAML file itself should always be in its own paths: entry so that workflow-only changes still trigger a deploy.
Verification: gh workflow run deploy-<surface>.yml (manual trigger) confirms the workflow file is valid.
Health check fails post-deploy (Tier B)
Symptom: Workflow fails at the health-check step. Surface is deployed but returning non-200.
Cause: Dyno crash on boot (check heroku logs), missing environment variable, or DB migration failure.
Fix: See docs/ops/runbooks/heroku.md §"H10/H12 errors" and §"Dyno crash on boot".
Verification: curl https://<heroku-origin>/health returns 200. Then confirm CF-fronted URL also returns 200.
Naming conventions
| Artifact | Pattern | Example |
|---|---|---|
| Surface slug | <name> (lowercase, hyphens) |
learn |
| CF Pages project | raxx-<name> |
raxx-learn |
| Heroku app (prod) | raxx-<name>-production |
raxx-learn-production |
| Heroku app (staging) | raxx-<name>-staging |
raxx-learn-staging |
| Deploy workflow | deploy-<name>.yml |
deploy-learn.yml |
| Status tile id | <name>-raxx-app (for *.raxx.app) or <name>-<tld> for external |
learn-raxx-app |
| Vault path prefix | /MooseQuest/<name>/ |
/MooseQuest/learn/ |
| Sentry project | <name> |
learn |
| Terraform directory | terraform/cf-pages-<name>/ (Tier A) or terraform/<name>/ (Tier C) |
terraform/cf-pages-learn/ |
| Surface docs | docs/surfaces/<name>/README.md |
docs/surfaces/learn/README.md |
Vault paths (shared across surfaces)
These paths are used by every surface and loaded via .github/actions/load-vault-secrets:
| Secret | Vault path | Used for |
|---|---|---|
CF_PAGES_DEPLOY_TOKEN |
/MooseQuest/cloudflare/CF_PAGES_DEPLOY_TOKEN |
Wrangler publish auth |
CLOUDFLARE_ACCOUNT_ID |
/MooseQuest/cloudflare/CLOUDFLARE_ACCOUNT_ID |
CF API calls |
CLOUDFLARE_EDIT_DNS |
/MooseQuest/cloudflare/CLOUDFLARE_EDIT_DNS |
DNS CNAME creation |
CLOUDFLARE_ACCESS_MGMT_TOKEN |
/MooseQuest/cloudflare/CLOUDFLARE_ACCESS_MGMT_TOKEN |
CF Access app creation (internal surfaces) |
Emergency stop
To take a surface offline immediately:
Tier A (CF Pages):
# Disable the CF Pages custom domain via the Cloudflare dashboard, or
# set a WAF rule to block all traffic to the surface:
# Cloudflare dashboard → Security → WAF → Create rule → Block → hostname = <surface>
Tier B (Heroku):
heroku maintenance:on --app <heroku-app>
Tier C (Lightsail):
# Stop the application service on the Lightsail instance:
ssh ubuntu@<lightsail-ip> 'sudo systemctl stop apache2' # example for FreeScout
Escalation
Wake the operator when: - A surface's CF Access gate is missing and the surface was expected to be internal. - A deploy fails and the health check is not recovering after 3 manual re-triggers. - A new surface's Terraform stack requires provisioning a new Lightsail instance (Tier C). - A cost implication from a new surface exceeds $50/mo or $500/year.
References
- Architecture:
docs/architecture/new-surface-convention.md - ADR-0052:
docs/architecture/adr/0052-new-surface-hosting-tiers.md - ADR-0053:
docs/architecture/adr/0053-new-surface-deploy-workflow-template.md - ADR-0081:
docs/architecture/adr/0081-surface-convention.md - Scaffold script:
scripts/scaffold/new-surface.sh - Deploy freeze runbook:
docs/ops/runbooks/deploy-freeze.md - Heroku runbook:
docs/ops/runbooks/heroku.md - Cloudflare tokens:
docs/ops/runbooks/cloudflare-tokens.md - CF WAF:
docs/ops/runbooks/waf.md - FreeScout (Tier C example):
docs/ops/runbooks/freescout.md