System: console dashboard — background surface poller (console/app/services/status_poller.py + site_probes.py)
Owner: operator / sre-agent
Last incident: 2026-04-30 (see docs/incidents/2026-04-30-console-degraded-surfaces.md)
Last reviewed: 2026-04-30
GET /dashboard/_force_poll returns {"polled": 0} or takes > 10spoll_once errors every 30s: heroku logs --tail --app raxx-console-prod | grep pollGET /api/_internal/poller-status returns thread_alive: false for one or more surfacesCheck the /health endpoint — expected: {"status": "ok", "poller_last_poll_at": "<recent ISO>"}. If poller_last_poll_at is > 120s old, the poller thread is stalled.
Check Heroku logs — heroku logs --tail --app raxx-console-prod | grep -E "poll_once|probe_|StatusPoller". Expected: poll_once entries every ~30s. Look for probe_crashed or thread_crashed errors.
Check the surface registry — console/app/services/status_poller.py:SURFACE_REGISTRY_BY_ENV. Every entry must have a valid provider and matching kwargs. CF Pages entries need an existing project in the account. Heroku entries need an app that has a deployed web dyno.
Check probe credentials — run heroku config --app raxx-console-prod | grep -E "HEROKU_API_KEY|CLOUDFLARE_PAGES_READ_TOKEN|CLOUDFLARE_ACCOUNT_ID|CF_ACCESS_SERVICE_TOKEN". If any are missing or blank, the corresponding probes will degrade with skipped: error messages.
Test a specific probe manually — see "Manual probe commands" below.
Symptom: api-prod, console-prod, api-staging, console-staging all show DEGRADED. Direct Heroku URLs return 200. CF Access redirect present in probe response.
Cause: probe_heroku_app was (before 2026-04-30 fix) using public_url (CF Access-gated custom domain) for the health check. After the fix, it uses the direct .herokuapp.com URL. If this re-appears, the fix was reverted or a new surface with public_url was added to the registry expecting CF Access bypass.
Fix (post-2026-04-30): Verify site_probes.py probe_heroku_app does NOT use public_url for the health check call. The health check URL should be app_data["web_url"] (direct Heroku URL).
Verification: curl -sS https://<app>-<hash>.herokuapp.com/health → expect 200.
skipped: CF_ACCESS_SERVICE_TOKEN_CONSOLE unset)Symptom: console-prod tile shows DEGRADED with error skipped: CF_ACCESS_SERVICE_TOKEN_CONSOLE unset.
Cause: The CF_ACCESS_SERVICE_TOKEN_CONSOLE env var is not set on the console Heroku app.
Fix:
# Format: client_id:client_secret
heroku config:set CF_ACCESS_SERVICE_TOKEN_CONSOLE="<client_id>:<client_secret>" --app raxx-console-prod
Pull values from Infisical: /MooseQuest/cloudflare/CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET.
Prerequisite: The raxx-agent-vault-access service token must be enrolled in a non_identity policy on the console.raxx.app CF Access application. If not, the token will be rejected (302 to CF Access login). See action item #1 in docs/incidents/2026-04-30-console-degraded-surfaces.md.
Verification: curl -sS -H "CF-Access-Client-Id: <id>" -H "CF-Access-Client-Secret: <secret>" https://console.raxx.app/health → expect 200.
skipped: CLOUDFLARE_PAGES_READ_TOKEN unset)Symptom: getraxx, raxx-mockups, or raxx-app-previews tiles show DEGRADED with skipped: CLOUDFLARE_PAGES_READ_TOKEN unset.
Cause: CLOUDFLARE_PAGES_READ_TOKEN or CLOUDFLARE_ACCOUNT_ID env var is missing on the console app.
Fix:
PAGES_TOKEN=$(infisical secrets get CLOUDFLARE_PAGES_READ_TOKEN --path /MooseQuest/cloudflare --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID --path /MooseQuest/cloudflare --env prod --plain)
heroku config:set CLOUDFLARE_PAGES_READ_TOKEN="$PAGES_TOKEN" CLOUDFLARE_ACCOUNT_ID="$ACCOUNT_ID" --app raxx-console-prod
Verification: heroku config --app raxx-console-prod | grep CLOUDFLARE_PAGES_READ_TOKEN → non-empty value.
deploy_* error)Symptom: A CF Pages tile shows DEGRADED with error deploy_failure or deploy_building.
Cause: The latest CF Pages deployment for that project failed or is still building.
Fix:
1. Go to Cloudflare dashboard → Pages → <project> → check the latest deployment status and logs.
2. If the deployment failed, investigate the build logs. Re-trigger a deployment from the Cloudflare dashboard or by pushing a commit.
3. If deploy_building persists > 10 min, cancel and retrigger the deployment.
Verification: curl -sS "https://api.cloudflare.com/client/v4/accounts/<id>/pages/projects/<project>/deployments" -H "Authorization: Bearer <CLOUDFLARE_PAGES_READ_TOKEN>" → latest_stage.status == "success".
health_http_502 or health_no_response)Symptom: An api-prod or api-staging tile shows DEGRADED with health_http_502 or health_no_response.
Cause: The Heroku app's web dyno is not running (0 dynos) or crashed on startup.
Diagnose:
heroku ps --app raxx-api-prod
heroku logs --tail --app raxx-api-prod
Fix: If 0 dynos: the app has never been deployed or was scaled down.
# Check formation
heroku ps:type --app raxx-api-prod
# If web dyno is missing, deploy the app first:
git push heroku main
# Or scale if formation exists:
heroku ps:scale web=1 --app raxx-api-prod
Verification: curl -sS https://<app>-<hash>.herokuapp.com/health → expect 200.
Symptom: A CF Pages tile shows DEGRADED with http_404 on every poll.
Cause: SURFACE_REGISTRY_BY_ENV contains a cloudflare_pages entry with a project_name that doesn't exist in the CF account.
Diagnose:
PAGES_TOKEN=$(infisical secrets get CLOUDFLARE_PAGES_READ_TOKEN --path /MooseQuest/cloudflare --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID --path /MooseQuest/cloudflare --env prod --plain)
curl -sS "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/pages/projects" \
-H "Authorization: Bearer $PAGES_TOKEN" | python3 -m json.tool | grep '"name"'
Compare the output to entries in status_poller.py. If a project_name in the registry is absent from the project list, remove the registry entry (or correct the name) and open a PR.
Verification: The removed surface no longer appears on the dashboard.
Symptom: Dashboard shows all tiles with checked_at > 120s old. /health shows poller_last_poll_at stale or null.
Cause: The background poller thread crashed and hit max backoff (300s), or the Heroku dyno restarted and the poller hasn't recovered.
Fix: Trigger a dyno restart (the poller starts automatically on app startup):
heroku restart --app raxx-console-prod
If restarts don't help, check for Python exceptions in logs:
heroku logs --tail --app raxx-console-prod | grep -E "Exception|Traceback|StatusPoller"
Verification: /health shows poller_last_poll_at updated within 60s.
connection_error or http_*)Symptom: vault tile shows DEGRADED.
Cause: The Lightsail instance hosting Infisical is unreachable, or Infisical is returning a non-200 on /api/status.
Diagnose:
CF_ID=$(security find-generic-password -s "raxx-cf-access-client-id" -a "claude-bootstrap" -w)
CF_SEC=$(security find-generic-password -s "raxx-cf-access-client-secret" -a "claude-bootstrap" -w)
curl -sS https://vault.raxx.app/api/status \
-H "CF-Access-Client-Id: $CF_ID" \
-H "CF-Access-Client-Secret: $CF_SEC"
If unreachable: check AWS Lightsail instance state in AWS console. If Infisical is the issue: check Lightsail instance logs. If CF Access is the issue: the raxx-agent-vault-access service token policy on vault.raxx.app may have been removed.
Escalation: If the Lightsail instance is down and cannot be restarted remotely, escalate to operator.
# Probe a Heroku app directly (bypasses CF Access)
curl -sS https://<app>-<hash>.herokuapp.com/health
# Probe vault (CF Access gated — needs service token)
CF_ID=$(security find-generic-password -s "raxx-cf-access-client-id" -a "claude-bootstrap" -w)
CF_SEC=$(security find-generic-password -s "raxx-cf-access-client-secret" -a "claude-bootstrap" -w)
curl -sS https://vault.raxx.app/api/status \
-H "CF-Access-Client-Id: $CF_ID" \
-H "CF-Access-Client-Secret: $CF_SEC"
# Query CF Pages latest deployment
PAGES_TOKEN=$(infisical secrets get CLOUDFLARE_PAGES_READ_TOKEN --path /MooseQuest/cloudflare --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID --path /MooseQuest/cloudflare --env prod --plain)
curl -sS "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/pages/projects/<project>/deployments?per_page=1" \
-H "Authorization: Bearer $PAGES_TOKEN" | python3 -m json.tool
# List CF Access service tokens (check enrollment)
CF_MGMT=$(infisical secrets get CLOUDFLARE_ACCESS_MGMT_TOKEN --path /MooseQuest/cloudflare --env prod --plain)
curl -sS "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/access/service_tokens" \
-H "Authorization: Bearer $CF_MGMT" | python3 -m json.tool
# Check CF Access application policies
curl -sS "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/access/apps/<app-id>/policies" \
-H "Authorization: Bearer $CF_MGMT" | python3 -m json.tool
# Force-poll all surfaces via dashboard API (ops+ role required)
curl -sS -X POST https://console.raxx.app/dashboard/_force_poll \
-H "CF-Access-Client-Id: $CF_ID" \
-H "CF-Access-Client-Secret: $CF_SEC" \
-H "Cookie: session=<your-session>"
To stop the background poller without taking the console app fully offline:
heroku config:set TESTING=true --app raxx-console-prod
heroku restart --app raxx-console-prod
Setting TESTING=true prevents start_background_poller from starting. Dashboard will show stale/empty tiles but the rest of the console remains functional.
To re-enable:
heroku config:unset TESTING --app raxx-console-prod
heroku restart --app raxx-console-prod
Escalate to operator when:
- Poller thread crashes with Python exception that isn't probe-related (e.g. DB corruption, Flask context error)
- CF Access service token raxx-agent-vault-access is revoked or expires
- A surface shows DEGRADED after probe config fix and direct-URL probe is also returning non-200
- raxx-api-prod is DOWN (0 dynos) — requires operator to deploy a slug
CF Access policy changes (adding service token to an application policy) require Cloudflare dashboard access — operator-only action.
Who to contact: Kristerpher (operator)