Raxx · internal docs

internal · gated ↑ index

RCA — console.raxx.app dashboard: multiple surfaces showing DEGRADED/FAILED

Incident ID: 2026-04-30-console-degraded-surfaces Date: 2026-04-30 Severity: SEV-2 (deploy pipeline surface and internal monitoring degraded; no user-facing production outage) Duration: Detection 09:12 UTC — Investigation complete 10:45 UTC (93 min) Blast radius: Operator dashboard only — surface tile states were incorrect for most monitored surfaces. No user-visible impact; no data loss. Author: sre-agent


Summary

The console.raxx.app dashboard was showing DEGRADED or FAILED states on the majority of its infrastructure surface tiles. Two root causes were found: (1) the probe_heroku_app function was sending the health check request to CF Access-gated custom domain URLs rather than the direct Heroku .herokuapp.com URL, causing every CF-Access-protected Heroku surface to appear degraded even when the underlying app was healthy; (2) the surface registry referenced a CF Pages project named "getraxx" that does not exist, and the getraxx.com custom domain has its CF Pages binding broken (error 1014). Additionally, raxx-api-prod was found to have zero deployed dynos (genuinely DOWN), a pre-existing infrastructure gap uncovered by this triage.


Timeline (all times UTC)


Impact


What went well

What didn't go well


Root cause analysis

Contributing factor 1: probe_heroku_app routes health check through CF Access — The function used public_url for the liveness HTTP call. public_url was set to the CF Access-gated custom domain (e.g. https://api.raxx.app). The probe sent no CF Access headers, causing a 302 redirect to the login page which was read as a non-200, marking the surface DEGRADED. The direct Heroku URL (app_data["web_url"]) was available from the Platform API response and should have been used.

Contributing factor 2: Missing service token policies on CF Access applications — The service token raxx-agent-vault-access (used by probe_console_self) was enrolled only on vault.raxx.app. The remaining 7 CF Access-protected applications had email-only policies. Any probe needing to bypass CF Access on those apps had no path to success.

Contributing factor 3: Phantom CF Pages project in registrystatus_poller.py referenced project_name="getraxx" but no such CF Pages project exists in the account. The probe returned 404 → degraded on every cycle. Separately, the getraxx.com DNS record resolves via Cloudflare proxy but the CF Pages custom domain binding is absent, returning error 1014.

Contributing factor 4: raxx-api-prod has no deployed dyno — The app has only config-var releases (v3–v7). No code/slug has been deployed. Formation is empty. This is a genuine DOWN state independent of any probe configuration issue.


Detection


Resolution

Fixed in this PR (fix/console-degraded-surfaces-2026-04-30): - console/app/services/site_probes.py: probe_heroku_app now uses app_data["web_url"] (direct .herokuapp.com) for the health check; public_url parameter retained for future end-to-end probes but removed from health check path - console/app/services/status_poller.py: getraxx surface entry removed from SURFACE_REGISTRY_BY_ENV["prod"] with explanatory comment; will be re-added when CF Pages project is created and domain is re-attached

Requires operator action (escalation below): - raxx-api-prod: deploy a slug with a web dyno; scale web formation to ≥1 - getraxx.com: create or re-attach CF Pages project for the marketing site - CF Access policies: add raxx-agent-vault-access service token to non_identity policy on console.raxx.app, api.raxx.app, and the staging equivalents - Rotate stale CF tokens in console app env: CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN and CLOUDFLAREROLLED are revoked

Validation: After PR merge and re-deploy of console, direct-URL probes for raxx-api-staging, raxx-console-staging, raxx-console-prod, and vault should all show HEALTHY. api-prod will remain FAILED/DOWN until a dyno is deployed.


Action items

# Action Owner Due Issue
1 Add raxx-agent-vault-access service token to non_identity policy on console.raxx.app, api.raxx.app, console-staging.raxx.app, api-staging.raxx.app, raxx-mockups.pages.dev, raxx-app.pages.dev CF Access applications operator 2026-05-02 TBD
2 Deploy raxx-api-prod: push a slug with web dyno; scale formation to 1 operator 2026-05-02 TBD
3 Resolve getraxx.com error 1014: create a CF Pages project for the marketing site and re-attach the custom domain operator 2026-05-02 TBD
4 Rotate stale CF tokens in console app env (CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN, CLOUDFLAREROLLED) — pull from vault operator 2026-05-07 TBD
5 Add probe-level test for "public_url behind CF Access returns 302" failure mode in test_site_probes.py sre-agent 2026-05-07 TBD
6 Add registry validation: on startup, warn if a cloudflare_pages surface references a non-existent project sre-agent 2026-05-14 TBD
7 Add Heroku dyno-count check to probe_heroku_app: return DEGRADED (not DOWN) if formation exists but 0 dynos running, with explicit error dyno_count_zero sre-agent 2026-05-14 TBD

References