RCA — console.raxx.app dashboard: multiple surfaces showing DEGRADED/FAILED

Incident ID: 2026-04-30-console-degraded-surfaces Date: 2026-04-30 Severity: SEV-2 (deploy pipeline surface and internal monitoring degraded; no user-facing production outage) Duration: Detection 09:12 UTC — Investigation complete 10:45 UTC (93 min) Blast radius: Operator dashboard only — surface tile states were incorrect for most monitored surfaces. No user-visible impact; no data loss. Author: sre-agent

Summary

The console.raxx.app dashboard was showing DEGRADED or FAILED states on the majority of its infrastructure surface tiles. Two root causes were found: (1) the probe_heroku_app function was sending the health check request to CF Access-gated custom domain URLs rather than the direct Heroku .herokuapp.com URL, causing every CF-Access-protected Heroku surface to appear degraded even when the underlying app was healthy; (2) the surface registry referenced a CF Pages project named "getraxx" that does not exist, and the getraxx.com custom domain has its CF Pages binding broken (error 1014). Additionally, raxx-api-prod was found to have zero deployed dynos (genuinely DOWN), a pre-existing infrastructure gap uncovered by this triage.

Timeline (all times UTC)

09:12 — Kristerpher reports console.raxx.app dashboard showing degraded status tiles
09:15 — SRE agent acknowledges; begins reading site_probes.py and status_poller.py
09:22 — CF Access creds pulled from Keychain (raxx-cf-access-client-id / raxx-cf-access-client-secret)
09:25 — Probe console.raxx.app/health with service token: 302 redirect to CF Access login (service_token_status: false in JWT meta)
09:30 — Probe api.raxx.app/health with service token: same pattern (302 to login)
09:33 — vault.raxx.app probed: 200 OK with same service token — confirms vault policy has service token enrolled; others do not
09:36 — CF Access application policies pulled via API: console.raxx.app and api.raxx.app policies contain only email rule for kris@moosequest.net; no service token rule
09:40 — Infisical vault queried for fresh CF tokens; both CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN and CLOUDFLAREROLLED in env are stale/revoked; vault tokens are active
09:45 — CF Pages project list pulled with vault CLOUDFLARE_PAGES_READ_TOKEN: no project named getraxx exists; getraxx.com returns HTTP 403 with CF error 1014 (CNAME cross-user banned)
09:52 — Heroku raxx-api-prod checked: 0 dynos, empty formation, 502 on direct URL — app is DOWN (no slug deployed)
09:55 — raxx-api-staging, raxx-console-staging, raxx-console-prod: all healthy on direct .herokuapp.com URLs (200)
10:05 — CF Access policy check for all 9 Access applications: only vault.raxx.app has the raxx-agent-vault-access service token policy. All other apps (console, api, staging variants, raxx-mockups, raxx-app-previews) are email-only
10:15 — Root cause confirmed: probe uses public_url (CF Access-gated) for health check portion of probe_heroku_app
10:22 — Code fix written: site_probes.py modified to use web_url from Heroku Platform API (direct .herokuapp.com) for health check; status_poller.py getraxx entry removed from registry with comment
10:45 — RCA drafted; operator escalation prepared for raxx-api-prod DOWN + getraxx.com + CF Access policy gap

Impact

Users affected: 0 (dashboard is internal operator tooling)
User-visible symptoms: None
Data integrity: OK
Revenue / billing: OK

What went well

CF Access service token credentials in Keychain matched vault values exactly — no credential confusion
Infisical vault was reachable and machine identity auth worked; fresh CF API tokens were obtainable
The CF Access JWT meta payload included service_token_status: false, making it unambiguous that the service token was present in the request but not matched by any policy rule
raxx-api-staging and raxx-console-staging were genuinely healthy; the degraded tiles were probe false-positives for those surfaces

What didn't go well

The probe_heroku_app function accepted a public_url parameter and used it for the health check, routing traffic through CF Access without any mechanism to supply credentials. There was no test for the "public_url behind CF Access" failure mode.
raxx-api-prod reached production with 0 dynos and no deployed slug. Releases were config-var-only. The release pipeline did not gate on "is a web dyno running?"
The getraxx surface was added to the registry referencing a CF Pages project that does not exist. No validation prevents a registry entry from referencing a non-existent project.
Both CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN and CLOUDFLAREROLLED in the console app's env were revoked/stale. The poller would have silently degraded on any Cloudflare Pages probe attempt.
CF Access service token raxx-agent-vault-access is not enrolled in the policies for any application except vault.raxx.app. Any probe that needs to bypass CF Access on other apps will fail silently.

Root cause analysis

Contributing factor 1: probe_heroku_app routes health check through CF Access — The function used public_url for the liveness HTTP call. public_url was set to the CF Access-gated custom domain (e.g. https://api.raxx.app). The probe sent no CF Access headers, causing a 302 redirect to the login page which was read as a non-200, marking the surface DEGRADED. The direct Heroku URL (app_data["web_url"]) was available from the Platform API response and should have been used.

Contributing factor 2: Missing service token policies on CF Access applications — The service token raxx-agent-vault-access (used by probe_console_self) was enrolled only on vault.raxx.app. The remaining 7 CF Access-protected applications had email-only policies. Any probe needing to bypass CF Access on those apps had no path to success.

Contributing factor 3: Phantom CF Pages project in registry — status_poller.py referenced project_name="getraxx" but no such CF Pages project exists in the account. The probe returned 404 → degraded on every cycle. Separately, the getraxx.com DNS record resolves via Cloudflare proxy but the CF Pages custom domain binding is absent, returning error 1014.

Contributing factor 4: raxx-api-prod has no deployed dyno — The app has only config-var releases (v3–v7). No code/slug has been deployed. Formation is empty. This is a genuine DOWN state independent of any probe configuration issue.

Detection

What alerted us: Operator (Kristerpher) observed degraded tiles on console dashboard
How long between cause and detection: Unknown; likely hours (poller runs every 30s so every tile was red from first poll)
How to detect faster next time: Add a Sentry alert for probe_error: skipped:* or non-200 health on known-healthy apps; add a Heroku dyno-count check in the health probe path

Resolution

Fixed in this PR (fix/console-degraded-surfaces-2026-04-30): - console/app/services/site_probes.py: probe_heroku_app now uses app_data["web_url"] (direct .herokuapp.com) for the health check; public_url parameter retained for future end-to-end probes but removed from health check path - console/app/services/status_poller.py: getraxx surface entry removed from SURFACE_REGISTRY_BY_ENV["prod"] with explanatory comment; will be re-added when CF Pages project is created and domain is re-attached

Requires operator action (escalation below): - raxx-api-prod: deploy a slug with a web dyno; scale web formation to ≥1 - getraxx.com: create or re-attach CF Pages project for the marketing site - CF Access policies: add raxx-agent-vault-access service token to non_identity policy on console.raxx.app, api.raxx.app, and the staging equivalents - Rotate stale CF tokens in console app env: CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN and CLOUDFLAREROLLED are revoked

Validation: After PR merge and re-deploy of console, direct-URL probes for raxx-api-staging, raxx-console-staging, raxx-console-prod, and vault should all show HEALTHY. api-prod will remain FAILED/DOWN until a dyno is deployed.

Action items

#	Action	Owner	Due	Issue
1	Add `raxx-agent-vault-access` service token to `non_identity` policy on console.raxx.app, api.raxx.app, console-staging.raxx.app, api-staging.raxx.app, raxx-mockups.pages.dev, raxx-app.pages.dev CF Access applications	operator	2026-05-02	TBD
2	Deploy `raxx-api-prod`: push a slug with web dyno; scale formation to 1	operator	2026-05-02	TBD
3	Resolve `getraxx.com` error 1014: create a CF Pages project for the marketing site and re-attach the custom domain	operator	2026-05-02	TBD
4	Rotate stale CF tokens in console app env (`CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN`, `CLOUDFLAREROLLED`) — pull from vault	operator	2026-05-07	TBD
5	Add probe-level test for "public_url behind CF Access returns 302" failure mode in `test_site_probes.py`	sre-agent	2026-05-07	TBD
6	Add registry validation: on startup, warn if a `cloudflare_pages` surface references a non-existent project	sre-agent	2026-05-14	TBD
7	Add Heroku dyno-count check to `probe_heroku_app`: return DEGRADED (not DOWN) if formation exists but 0 dynos running, with explicit error `dyno_count_zero`	sre-agent	2026-05-14	TBD

References

Runbook: docs/ops/runbooks/status-d1.md (updated as part of this incident)
CF error 1014: https://developers.cloudflare.com/support/troubleshooting/cloudflare-errors/troubleshooting-cloudflare-1xxx-errors/#error-1014-cname-cross-user-banned
CF Access service tokens: https://developers.cloudflare.com/cloudflare-one/identity/service-tokens/
ADR-0030: docs/architecture/adr/0030-status-state-machine.md