RCA — console.raxx.app dashboard: multiple surfaces showing DEGRADED/FAILED
Incident ID: 2026-04-30-console-degraded-surfaces Date: 2026-04-30 Severity: SEV-2 (deploy pipeline surface and internal monitoring degraded; no user-facing production outage) Duration: Detection 09:12 UTC — Investigation complete 10:45 UTC (93 min) Blast radius: Operator dashboard only — surface tile states were incorrect for most monitored surfaces. No user-visible impact; no data loss. Author: sre-agent
Summary
The console.raxx.app dashboard was showing DEGRADED or FAILED states on the majority of its infrastructure surface tiles. Two root causes were found: (1) the probe_heroku_app function was sending the health check request to CF Access-gated custom domain URLs rather than the direct Heroku .herokuapp.com URL, causing every CF-Access-protected Heroku surface to appear degraded even when the underlying app was healthy; (2) the surface registry referenced a CF Pages project named "getraxx" that does not exist, and the getraxx.com custom domain has its CF Pages binding broken (error 1014). Additionally, raxx-api-prod was found to have zero deployed dynos (genuinely DOWN), a pre-existing infrastructure gap uncovered by this triage.
Timeline (all times UTC)
- 09:12 — Kristerpher reports console.raxx.app dashboard showing degraded status tiles
- 09:15 — SRE agent acknowledges; begins reading
site_probes.pyandstatus_poller.py - 09:22 — CF Access creds pulled from Keychain (
raxx-cf-access-client-id/raxx-cf-access-client-secret) - 09:25 — Probe
console.raxx.app/healthwith service token: 302 redirect to CF Access login (service_token_status: false in JWT meta) - 09:30 — Probe
api.raxx.app/healthwith service token: same pattern (302 to login) - 09:33 —
vault.raxx.appprobed: 200 OK with same service token — confirms vault policy has service token enrolled; others do not - 09:36 — CF Access application policies pulled via API:
console.raxx.appandapi.raxx.apppolicies contain only email rule for kris@moosequest.net; no service token rule - 09:40 — Infisical vault queried for fresh CF tokens; both
CLOUDFLARE_RAXX_AUTOMATION_API_TOKENandCLOUDFLAREROLLEDin env are stale/revoked; vault tokens are active - 09:45 — CF Pages project list pulled with vault
CLOUDFLARE_PAGES_READ_TOKEN: no project namedgetraxxexists;getraxx.comreturns HTTP 403 with CF error 1014 (CNAME cross-user banned) - 09:52 — Heroku
raxx-api-prodchecked: 0 dynos, empty formation, 502 on direct URL — app is DOWN (no slug deployed) - 09:55 —
raxx-api-staging,raxx-console-staging,raxx-console-prod: all healthy on direct.herokuapp.comURLs (200) - 10:05 — CF Access policy check for all 9 Access applications: only
vault.raxx.apphas theraxx-agent-vault-accessservice token policy. All other apps (console, api, staging variants, raxx-mockups, raxx-app-previews) are email-only - 10:15 — Root cause confirmed: probe uses
public_url(CF Access-gated) for health check portion ofprobe_heroku_app - 10:22 — Code fix written:
site_probes.pymodified to useweb_urlfrom Heroku Platform API (direct.herokuapp.com) for health check;status_poller.pygetraxxentry removed from registry with comment - 10:45 — RCA drafted; operator escalation prepared for
raxx-api-prodDOWN +getraxx.com+ CF Access policy gap
Impact
- Users affected: 0 (dashboard is internal operator tooling)
- User-visible symptoms: None
- Data integrity: OK
- Revenue / billing: OK
What went well
- CF Access service token credentials in Keychain matched vault values exactly — no credential confusion
- Infisical vault was reachable and machine identity auth worked; fresh CF API tokens were obtainable
- The CF Access JWT meta payload included
service_token_status: false, making it unambiguous that the service token was present in the request but not matched by any policy rule raxx-api-stagingandraxx-console-stagingwere genuinely healthy; the degraded tiles were probe false-positives for those surfaces
What didn't go well
- The
probe_heroku_appfunction accepted apublic_urlparameter and used it for the health check, routing traffic through CF Access without any mechanism to supply credentials. There was no test for the "public_url behind CF Access" failure mode. raxx-api-prodreached production with 0 dynos and no deployed slug. Releases were config-var-only. The release pipeline did not gate on "is a web dyno running?"- The
getraxxsurface was added to the registry referencing a CF Pages project that does not exist. No validation prevents a registry entry from referencing a non-existent project. - Both
CLOUDFLARE_RAXX_AUTOMATION_API_TOKENandCLOUDFLAREROLLEDin the console app's env were revoked/stale. The poller would have silently degraded on any Cloudflare Pages probe attempt. - CF Access service token
raxx-agent-vault-accessis not enrolled in the policies for any application exceptvault.raxx.app. Any probe that needs to bypass CF Access on other apps will fail silently.
Root cause analysis
Contributing factor 1: probe_heroku_app routes health check through CF Access — The function used public_url for the liveness HTTP call. public_url was set to the CF Access-gated custom domain (e.g. https://api.raxx.app). The probe sent no CF Access headers, causing a 302 redirect to the login page which was read as a non-200, marking the surface DEGRADED. The direct Heroku URL (app_data["web_url"]) was available from the Platform API response and should have been used.
Contributing factor 2: Missing service token policies on CF Access applications — The service token raxx-agent-vault-access (used by probe_console_self) was enrolled only on vault.raxx.app. The remaining 7 CF Access-protected applications had email-only policies. Any probe needing to bypass CF Access on those apps had no path to success.
Contributing factor 3: Phantom CF Pages project in registry — status_poller.py referenced project_name="getraxx" but no such CF Pages project exists in the account. The probe returned 404 → degraded on every cycle. Separately, the getraxx.com DNS record resolves via Cloudflare proxy but the CF Pages custom domain binding is absent, returning error 1014.
Contributing factor 4: raxx-api-prod has no deployed dyno — The app has only config-var releases (v3–v7). No code/slug has been deployed. Formation is empty. This is a genuine DOWN state independent of any probe configuration issue.
Detection
- What alerted us: Operator (Kristerpher) observed degraded tiles on console dashboard
- How long between cause and detection: Unknown; likely hours (poller runs every 30s so every tile was red from first poll)
- How to detect faster next time: Add a Sentry alert for
probe_error: skipped:*or non-200 health on known-healthy apps; add a Heroku dyno-count check in the health probe path
Resolution
Fixed in this PR (fix/console-degraded-surfaces-2026-04-30):
- console/app/services/site_probes.py: probe_heroku_app now uses app_data["web_url"] (direct .herokuapp.com) for the health check; public_url parameter retained for future end-to-end probes but removed from health check path
- console/app/services/status_poller.py: getraxx surface entry removed from SURFACE_REGISTRY_BY_ENV["prod"] with explanatory comment; will be re-added when CF Pages project is created and domain is re-attached
Requires operator action (escalation below):
- raxx-api-prod: deploy a slug with a web dyno; scale web formation to ≥1
- getraxx.com: create or re-attach CF Pages project for the marketing site
- CF Access policies: add raxx-agent-vault-access service token to non_identity policy on console.raxx.app, api.raxx.app, and the staging equivalents
- Rotate stale CF tokens in console app env: CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN and CLOUDFLAREROLLED are revoked
Validation: After PR merge and re-deploy of console, direct-URL probes for raxx-api-staging, raxx-console-staging, raxx-console-prod, and vault should all show HEALTHY. api-prod will remain FAILED/DOWN until a dyno is deployed.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Add raxx-agent-vault-access service token to non_identity policy on console.raxx.app, api.raxx.app, console-staging.raxx.app, api-staging.raxx.app, raxx-mockups.pages.dev, raxx-app.pages.dev CF Access applications |
operator | 2026-05-02 | TBD |
| 2 | Deploy raxx-api-prod: push a slug with web dyno; scale formation to 1 |
operator | 2026-05-02 | TBD |
| 3 | Resolve getraxx.com error 1014: create a CF Pages project for the marketing site and re-attach the custom domain |
operator | 2026-05-02 | TBD |
| 4 | Rotate stale CF tokens in console app env (CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN, CLOUDFLAREROLLED) — pull from vault |
operator | 2026-05-07 | TBD |
| 5 | Add probe-level test for "public_url behind CF Access returns 302" failure mode in test_site_probes.py |
sre-agent | 2026-05-07 | TBD |
| 6 | Add registry validation: on startup, warn if a cloudflare_pages surface references a non-existent project |
sre-agent | 2026-05-14 | TBD |
| 7 | Add Heroku dyno-count check to probe_heroku_app: return DEGRADED (not DOWN) if formation exists but 0 dynos running, with explicit error dyno_count_zero |
sre-agent | 2026-05-14 | TBD |
References
- Runbook:
docs/ops/runbooks/status-d1.md(updated as part of this incident) - CF error 1014: https://developers.cloudflare.com/support/troubleshooting/cloudflare-errors/troubleshooting-cloudflare-1xxx-errors/#error-1014-cname-cross-user-banned
- CF Access service tokens: https://developers.cloudflare.com/cloudflare-one/identity/service-tokens/
- ADR-0030:
docs/architecture/adr/0030-status-state-machine.md