Incident ID: 2026-04-30-console-degraded-surfaces Date: 2026-04-30 Severity: SEV-2 (deploy pipeline surface and internal monitoring degraded; no user-facing production outage) Duration: Detection 09:12 UTC — Investigation complete 10:45 UTC (93 min) Blast radius: Operator dashboard only — surface tile states were incorrect for most monitored surfaces. No user-visible impact; no data loss. Author: sre-agent
The console.raxx.app dashboard was showing DEGRADED or FAILED states on the majority of its infrastructure surface tiles. Two root causes were found: (1) the probe_heroku_app function was sending the health check request to CF Access-gated custom domain URLs rather than the direct Heroku .herokuapp.com URL, causing every CF-Access-protected Heroku surface to appear degraded even when the underlying app was healthy; (2) the surface registry referenced a CF Pages project named "getraxx" that does not exist, and the getraxx.com custom domain has its CF Pages binding broken (error 1014). Additionally, raxx-api-prod was found to have zero deployed dynos (genuinely DOWN), a pre-existing infrastructure gap uncovered by this triage.
site_probes.py and status_poller.pyraxx-cf-access-client-id / raxx-cf-access-client-secret)console.raxx.app/health with service token: 302 redirect to CF Access login (service_token_status: false in JWT meta)api.raxx.app/health with service token: same pattern (302 to login)vault.raxx.app probed: 200 OK with same service token — confirms vault policy has service token enrolled; others do notconsole.raxx.app and api.raxx.app policies contain only email rule for kris@moosequest.net; no service token ruleCLOUDFLARE_RAXX_AUTOMATION_API_TOKEN and CLOUDFLAREROLLED in env are stale/revoked; vault tokens are activeCLOUDFLARE_PAGES_READ_TOKEN: no project named getraxx exists; getraxx.com returns HTTP 403 with CF error 1014 (CNAME cross-user banned)raxx-api-prod checked: 0 dynos, empty formation, 502 on direct URL — app is DOWN (no slug deployed)raxx-api-staging, raxx-console-staging, raxx-console-prod: all healthy on direct .herokuapp.com URLs (200)vault.raxx.app has the raxx-agent-vault-access service token policy. All other apps (console, api, staging variants, raxx-mockups, raxx-app-previews) are email-onlypublic_url (CF Access-gated) for health check portion of probe_heroku_appsite_probes.py modified to use web_url from Heroku Platform API (direct .herokuapp.com) for health check; status_poller.py getraxx entry removed from registry with commentraxx-api-prod DOWN + getraxx.com + CF Access policy gapservice_token_status: false, making it unambiguous that the service token was present in the request but not matched by any policy ruleraxx-api-staging and raxx-console-staging were genuinely healthy; the degraded tiles were probe false-positives for those surfacesprobe_heroku_app function accepted a public_url parameter and used it for the health check, routing traffic through CF Access without any mechanism to supply credentials. There was no test for the "public_url behind CF Access" failure mode.raxx-api-prod reached production with 0 dynos and no deployed slug. Releases were config-var-only. The release pipeline did not gate on "is a web dyno running?"getraxx surface was added to the registry referencing a CF Pages project that does not exist. No validation prevents a registry entry from referencing a non-existent project.CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN and CLOUDFLAREROLLED in the console app's env were revoked/stale. The poller would have silently degraded on any Cloudflare Pages probe attempt.raxx-agent-vault-access is not enrolled in the policies for any application except vault.raxx.app. Any probe that needs to bypass CF Access on other apps will fail silently.Contributing factor 1: probe_heroku_app routes health check through CF Access — The function used public_url for the liveness HTTP call. public_url was set to the CF Access-gated custom domain (e.g. https://api.raxx.app). The probe sent no CF Access headers, causing a 302 redirect to the login page which was read as a non-200, marking the surface DEGRADED. The direct Heroku URL (app_data["web_url"]) was available from the Platform API response and should have been used.
Contributing factor 2: Missing service token policies on CF Access applications — The service token raxx-agent-vault-access (used by probe_console_self) was enrolled only on vault.raxx.app. The remaining 7 CF Access-protected applications had email-only policies. Any probe needing to bypass CF Access on those apps had no path to success.
Contributing factor 3: Phantom CF Pages project in registry — status_poller.py referenced project_name="getraxx" but no such CF Pages project exists in the account. The probe returned 404 → degraded on every cycle. Separately, the getraxx.com DNS record resolves via Cloudflare proxy but the CF Pages custom domain binding is absent, returning error 1014.
Contributing factor 4: raxx-api-prod has no deployed dyno — The app has only config-var releases (v3–v7). No code/slug has been deployed. Formation is empty. This is a genuine DOWN state independent of any probe configuration issue.
probe_error: skipped:* or non-200 health on known-healthy apps; add a Heroku dyno-count check in the health probe pathFixed in this PR (fix/console-degraded-surfaces-2026-04-30):
- console/app/services/site_probes.py: probe_heroku_app now uses app_data["web_url"] (direct .herokuapp.com) for the health check; public_url parameter retained for future end-to-end probes but removed from health check path
- console/app/services/status_poller.py: getraxx surface entry removed from SURFACE_REGISTRY_BY_ENV["prod"] with explanatory comment; will be re-added when CF Pages project is created and domain is re-attached
Requires operator action (escalation below):
- raxx-api-prod: deploy a slug with a web dyno; scale web formation to ≥1
- getraxx.com: create or re-attach CF Pages project for the marketing site
- CF Access policies: add raxx-agent-vault-access service token to non_identity policy on console.raxx.app, api.raxx.app, and the staging equivalents
- Rotate stale CF tokens in console app env: CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN and CLOUDFLAREROLLED are revoked
Validation: After PR merge and re-deploy of console, direct-URL probes for raxx-api-staging, raxx-console-staging, raxx-console-prod, and vault should all show HEALTHY. api-prod will remain FAILED/DOWN until a dyno is deployed.
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Add raxx-agent-vault-access service token to non_identity policy on console.raxx.app, api.raxx.app, console-staging.raxx.app, api-staging.raxx.app, raxx-mockups.pages.dev, raxx-app.pages.dev CF Access applications |
operator | 2026-05-02 | TBD |
| 2 | Deploy raxx-api-prod: push a slug with web dyno; scale formation to 1 |
operator | 2026-05-02 | TBD |
| 3 | Resolve getraxx.com error 1014: create a CF Pages project for the marketing site and re-attach the custom domain |
operator | 2026-05-02 | TBD |
| 4 | Rotate stale CF tokens in console app env (CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN, CLOUDFLAREROLLED) — pull from vault |
operator | 2026-05-07 | TBD |
| 5 | Add probe-level test for "public_url behind CF Access returns 302" failure mode in test_site_probes.py |
sre-agent | 2026-05-07 | TBD |
| 6 | Add registry validation: on startup, warn if a cloudflare_pages surface references a non-existent project |
sre-agent | 2026-05-14 | TBD |
| 7 | Add Heroku dyno-count check to probe_heroku_app: return DEGRADED (not DOWN) if formation exists but 0 dynos running, with explicit error dyno_count_zero |
sre-agent | 2026-05-14 | TBD |
docs/ops/runbooks/status-d1.md (updated as part of this incident)docs/architecture/adr/0030-status-state-machine.md