Raxx · internal docs

internal · gated

New-surface deploy/preview convention — runbook

System: raxx platform surfaces Owner: sre-agent / operator Last incident: n/a (convention doc; not incident-driven) Last reviewed: 2026-05-12 UTC ADRs: 0052 (hosting tiers) · 0053 (workflow template) · 0081 (convention as standard) Architecture doc: new-surface-convention.md Scaffold script: scripts/scaffold/new-surface.sh Parent card: #649


Purpose

This runbook answers one question: "how do I stand up a new Raxx surface correctly in about 30 minutes?" It also documents each existing surface so that when something breaks on a surface you've never touched, you know where its knobs are.

Before this document existed, each surface was bootstrapped ad hoc over one to two days. The scaffold script and this runbook eliminate the per-surface re-discovery.


Existing surfaces (parity baseline)

Use these as reference when something breaks or when scaffolding a new surface in the same tier.

Antlers — app.raxx.app

Field Value
URL https://app.raxx.app
Tier A (Cloudflare Pages)
CF Pages project raxx-app
Deploy workflow .github/workflows/deploy-antlers.yml
Source path frontend/trademaster_ui/
Status tile id app-raxx-app
Access Public (post-launch)
Sentry project n/a (static SPA; errors tracked client-side)
Vault paths /MooseQuest/cloudflare/ (CF deploy token)
Parity note Predates ADR-0052; audit emit and freeze gate are present; no separate staging deploy — branch previews via CF Pages.

getraxx.com — marketing landing page

Field Value
URL https://getraxx.com
Tier A (Cloudflare Pages)
CF Pages project getraxx
Deploy workflow .github/workflows/deploy-getraxx.yml
Source path frontend/getraxx-landing/
Status tile id getraxx-com
Access Public (post 2026-05-23 launch); CF Access gated pre-launch
Sentry project n/a
Vault paths /MooseQuest/cloudflare/
Parity note Full parity. www.getraxx.com is a DNS CNAME (not a CF Pages custom domain) to preserve zone-level redirect rule. See RCA docs/incidents/2026-05-11-www-getraxx-access-bypass.md for why. Pre-launch CF Access removal: docs/ops/runbooks/getraxx-launch-day-cf-access-removal.md.

raxx-mockups — design preview

Field Value
URL https://raxx-mockups.pages.dev (no raxx.app subdomain)
Tier A (Cloudflare Pages)
CF Pages project raxx-mockups
Deploy workflow .github/workflows/deploy-mockups.yml
Source path docs/design/, mockups-site/
Status tile id raxx-mockups
Access Internal (public: false — not enumerated on customer status page)
Sentry project n/a
Vault paths /MooseQuest/cloudflare/
Parity note Full parity. Split from deploy-internal-docs.yml on 2026-05-08 (#1367). No custom raxx.app subdomain — pages.dev URL is intentional (design previews need no branding gate).

console.raxx.app — operator console

Field Value
URL https://console.raxx.app
Tier B (Heroku)
Heroku apps raxx-console-prod-* (prod), raxx-console-staging-* (staging)
Deploy workflow .github/workflows/deploy-console.yml
Source path console/
Status tile id console-raxx-app
Access Internal (CF Access gated; public: false)
Sentry project tbd (sentry_backend flag gates it)
Vault paths /MooseQuest/cloudflare/, /MooseQuest/console/
Parity note Full parity. Health check uses Heroku origin URL (not CF-fronted domain) to bypass CF Access 302. Env is hostname-derived — no in-app switcher. Staging is a runtime dup on read-only prod Queue data.

tickets.raxx.app — support portal (FreeScout)

Field Value
URL https://tickets.raxx.app
Tier C (Lightsail + Terraform)
Lightsail instance see terraform/lightsail-freescout/
Deploy workflow none (Terraform-managed; no CI deploy)
Status tile id tickets-raxx-app
Access Public (customer-facing support queue)
Sentry project n/a
Vault paths /MooseQuest/freescout/, /MooseQuest/cloudflare/
Parity note Tier C exemption — no CI deploy step. Operator runs terraform apply from terraform/lightsail-freescout/. Cert renewal: docs/ops/runbooks/freescout-cert-renewal.md. Full runbook: docs/ops/runbooks/freescout.md. Backup/restore: docs/ops/runbooks/freescout-backup-restore.md.

docs.raxx.app — customer documentation

Field Value
URL https://docs.raxx.app
Tier A (Cloudflare Pages)
CF Pages project raxx-customer-docs
Deploy workflow .github/workflows/deploy-customer-docs.yml
Source path docs/customer/
Status tile id docs-raxx-app
Access Public
Sentry project n/a
Vault paths /MooseQuest/cloudflare/
Parity note Full parity. Bootstrap step in workflow creates the CF Pages project idempotently and sets the DNS CNAME. Pattern reference for Tier A scaffolding.

How to add a new surface (30-minute path)

Step 1 — Run the scaffold (5 min)

bash scripts/scaffold/new-surface.sh

The script prompts for: - SURFACE_NAME — slug (e.g., learn) - DOMAIN — FQDN (e.g., learn.raxx.app) - TIERcf-pages | cf-worker | heroku | lightsail - ACCESSpublic | internal

Output (all under scripts/scaffold/output/<surface>/): - deploy-<surface>.yml — move to .github/workflows/ after review - terraform-cf-pages-<surface>.tf.stub — starting point for the TF stub - branch-protection.md — documents expected branch protection settings - README.md — surface ops notes template

Step 2 — Fill in TODOs (10 min)

Open each generated file and resolve the TODO markers:

Step 3 — Register the status tile (5 min)

Move the status tile entry from scripts/scaffold/output/<surface>/status-stub.yaml into config/status-surfaces.yaml (append to the owned block). Confirm the probe_url is correct for the surface's /health endpoint.

Step 4 — Create Sentry project (if Tier B) (5 min)

Tier B (Heroku) surfaces running server code need a Sentry project: 1. Create project named <surface> in the Sentry org. 2. Vault the DSN: infisical secrets set SENTRY_DSN=<dsn> --path /MooseQuest/sentry/<surface>/ 3. Add heroku config:set SENTRY_DSN="$SENTRY_DSN" --app <heroku-app> >/dev/null 2>&1 to the deploy workflow.

Step 5 — Open the PR (5 min)

Move generated files to their final locations, open a PR. The PR description must include: - Surface name, domain, tier, access type - Whether CF Access gate was created (internal surfaces) - Whether status tile was registered - For Tier A: confirm the CF Pages project name matches the entry in status-surfaces.yaml - Live HTTP-200 verification output. Run the command below against the custom domain (not a preview URL) and paste the output into the PR body. A PR without this output has not satisfied the surface launch acceptance criteria (see docs/architecture/new-surface-convention.md §9.1):

sh curl -o /dev/null -sSL -w "%{http_code}" https://<custom-domain>/ # expected: 200

Internal surfaces: include the CF-Access-Client-Id / CF-Access-Client-Secret headers in the curl call. The task is not done until this returns 200.


How to tell a surface is broken

Tier A (CF Pages) - Status tile shows degraded or down for the surface. - CF Pages project dashboard shows failed deployment. - curl -I https://<surface>/health returns non-200 or times out. - Cloudflare status page shows a Pages incident: https://www.cloudflarestatus.com

Tier B (Heroku) - Status tile shows degraded or down. - Heroku dashboard shows dyno restarts or H10/H12 errors. - heroku logs --app <app> --tail shows crash loop or OOM. - Sentry (if configured) shows a spike in error rate. - curl https://<heroku-origin>/health (not CF-fronted URL) returns non-200.

Tier C (Lightsail) - Status tile shows degraded or down. - SSH connectivity to the Lightsail instance fails. - Application logs (e.g., FreeScout: sudo journalctl -u apache2 -f) show errors.


How to diagnose (in order)

  1. Check the status tile in the operator console (https://console.raxx.app).
  2. Check the most recent deploy run in GitHub Actions for the surface's workflow.
  3. For Tier A: check the CF Pages project dashboard for the surface. For Tier B: heroku logs --app <app> --tail --num 200. For Tier C: SSH to the Lightsail instance and check application logs.
  4. Check Cloudflare status (https://www.cloudflarestatus.com) and Heroku status (https://status.heroku.com) for upstream incidents.
  5. For internal surfaces: confirm the CF Access application exists and the service token has not expired. Token check: curl -I https://<surface>/health with and without the CF-Access-Client-Id header.

Known failure modes

CF Pages deploy failed — build error

Symptom: Workflow fails at the build step. CF Pages project shows no new deployment. Cause: Build script failure (missing dependency, compile error, OOM in runner). Fix: Read the build step logs in the workflow run. Fix the source error and push. Verification: Re-trigger workflow. Confirm new deployment appears in CF Pages dashboard and health check passes.

CF Pages deploy failed — wrangler auth error

Symptom: Workflow fails at the wrangler publish step with "Authentication error" or "10000". Cause: CLOUDFLARE_API_TOKEN secret is expired or scoped to wrong permissions. Fix: Rotate the CF Pages deploy token in Cloudflare. Update vault at /MooseQuest/cloudflare/CF_PAGES_DEPLOY_TOKEN. See docs/ops/runbooks/cloudflare-tokens.md. Verification: Re-trigger workflow. Confirm wrangler step succeeds.

DNS CNAME missing — surface returns NXDOMAIN or 404

Symptom: curl https://<surface>/health returns NXDOMAIN or connection refused. Cause: DNS CNAME record was not created (or was deleted). Workflow bootstrap step may have failed silently. Fix:

# Confirm CNAME is absent
dig CNAME <surface>.raxx.app

# Re-run the workflow (the DNS bootstrap step is idempotent)
# Or create manually via CF API:
curl -sS -X POST -H "Authorization: Bearer $CLOUDFLARE_EDIT_DNS" \
  -H "Content-Type: application/json" \
  "https://api.cloudflare.com/client/v4/zones/f12dbb5cac57d5591a5058874498a6d1/dns_records" \
  -d '{"type":"CNAME","name":"<surface>.raxx.app","content":"<project>.pages.dev","proxied":true,"ttl":1}'

Verification: dig CNAME <surface>.raxx.app returns the target. curl -I https://<surface>/health returns 200.

Internal surface returns 403 — CF Access gate missing or misconfigured

Symptom: curl https://<surface> returns 403 with an "Access denied" body, even with valid credentials. Cause: CF Access application does not exist, or the email allowlist policy is missing the operator's email. Fix: Check the CF Access application list in the Cloudflare dashboard. If the app is missing, re-run the workflow (the CF Access bootstrap step is idempotent). If the policy is wrong, update via Cloudflare Access UI. Verification: curl -H "CF-Access-Client-Id: <id>" -H "CF-Access-Client-Secret: <secret>" https://<surface>/health returns 200.

CF Access service token bypassed by Bot Fight Mode

Symptom: Service token requests return 403 with "Error 1010" or bot-fight challenge page. Cause: CF Bot Fight Mode fires before the Access layer. Service tokens do not bypass Bot Fight Mode. Fix: Add a WAF skip rule keyed on CF-Access-Client-Id header. See docs/ops/runbooks/waf.md and feedback memory feedback_cf_access_does_not_bypass_bot_fight_mode.md. Verification: curl -H "CF-Access-Client-Id: <id>" -H "CF-Access-Client-Secret: <secret>" https://<surface>/health returns 200 without a challenge page.

Deploy workflow does not trigger after push

Symptom: Push to main does not start the surface's deploy workflow. Cause: The paths: filter in the workflow does not match any changed file, or the workflow file itself is outside the paths filter. Fix: Confirm the workflow's paths: block includes the surface source directory. Note: the workflow YAML file itself should always be in its own paths: entry so that workflow-only changes still trigger a deploy. Verification: gh workflow run deploy-<surface>.yml (manual trigger) confirms the workflow file is valid.

Health check fails post-deploy (Tier B)

Symptom: Workflow fails at the health-check step. Surface is deployed but returning non-200. Cause: Dyno crash on boot (check heroku logs), missing environment variable, or DB migration failure. Fix: See docs/ops/runbooks/heroku.md §"H10/H12 errors" and §"Dyno crash on boot". Verification: curl https://<heroku-origin>/health returns 200. Then confirm CF-fronted URL also returns 200.


Naming conventions

Artifact Pattern Example
Surface slug <name> (lowercase, hyphens) learn
CF Pages project raxx-<name> raxx-learn
Heroku app (prod) raxx-<name>-production raxx-learn-production
Heroku app (staging) raxx-<name>-staging raxx-learn-staging
Deploy workflow deploy-<name>.yml deploy-learn.yml
Status tile id <name>-raxx-app (for *.raxx.app) or <name>-<tld> for external learn-raxx-app
Vault path prefix /MooseQuest/<name>/ /MooseQuest/learn/
Sentry project <name> learn
Terraform directory terraform/cf-pages-<name>/ (Tier A) or terraform/<name>/ (Tier C) terraform/cf-pages-learn/
Surface docs docs/surfaces/<name>/README.md docs/surfaces/learn/README.md

Vault paths (shared across surfaces)

These paths are used by every surface and loaded via .github/actions/load-vault-secrets:

Secret Vault path Used for
CF_PAGES_DEPLOY_TOKEN /MooseQuest/cloudflare/CF_PAGES_DEPLOY_TOKEN Wrangler publish auth
CLOUDFLARE_ACCOUNT_ID /MooseQuest/cloudflare/CLOUDFLARE_ACCOUNT_ID CF API calls
CLOUDFLARE_EDIT_DNS /MooseQuest/cloudflare/CLOUDFLARE_EDIT_DNS DNS CNAME creation
CLOUDFLARE_ACCESS_MGMT_TOKEN /MooseQuest/cloudflare/CLOUDFLARE_ACCESS_MGMT_TOKEN CF Access app creation (internal surfaces)

Emergency stop

To take a surface offline immediately:

Tier A (CF Pages):

# Disable the CF Pages custom domain via the Cloudflare dashboard, or
# set a WAF rule to block all traffic to the surface:
# Cloudflare dashboard → Security → WAF → Create rule → Block → hostname = <surface>

Tier B (Heroku):

heroku maintenance:on --app <heroku-app>

Tier C (Lightsail):

# Stop the application service on the Lightsail instance:
ssh ubuntu@<lightsail-ip> 'sudo systemctl stop apache2'  # example for FreeScout

Escalation

Wake the operator when: - A surface's CF Access gate is missing and the surface was expected to be internal. - A deploy fails and the health check is not recovering after 3 manual re-triggers. - A new surface's Terraform stack requires provisioning a new Lightsail instance (Tier C). - A cost implication from a new surface exceeds $50/mo or $500/year.


References