Raxx · internal docs

internal · gated ↑ index

Console Status Poller Runbook

System: console dashboard — background surface poller (console/app/services/status_poller.py + site_probes.py) Owner: operator / sre-agent Last incident: 2026-04-30 (see docs/incidents/2026-04-30-console-degraded-surfaces.md) Last reviewed: 2026-04-30


How to tell it's broken


How to diagnose (in order)

  1. Check the /health endpoint — expected: {"status": "ok", "poller_last_poll_at": "<recent ISO>"}. If poller_last_poll_at is > 120s old, the poller thread is stalled.

  2. Check Heroku logsheroku logs --tail --app raxx-console-prod | grep -E "poll_once|probe_|StatusPoller". Expected: poll_once entries every ~30s. Look for probe_crashed or thread_crashed errors.

  3. Check the surface registryconsole/app/services/status_poller.py:SURFACE_REGISTRY_BY_ENV. Every entry must have a valid provider and matching kwargs. CF Pages entries need an existing project in the account. Heroku entries need an app that has a deployed web dyno.

  4. Check probe credentials — run heroku config --app raxx-console-prod | grep -E "HEROKU_API_KEY|CLOUDFLARE_PAGES_READ_TOKEN|CLOUDFLARE_ACCOUNT_ID|CF_ACCESS_SERVICE_TOKEN". If any are missing or blank, the corresponding probes will degrade with skipped: error messages.

  5. Test a specific probe manually — see "Manual probe commands" below.


Known failure modes

Failure mode A: All CF-Access-gated Heroku surfaces show DEGRADED

Symptom: api-prod, console-prod, api-staging, console-staging all show DEGRADED. Direct Heroku URLs return 200. CF Access redirect present in probe response.

Cause: probe_heroku_app was (before 2026-04-30 fix) using public_url (CF Access-gated custom domain) for the health check. After the fix, it uses the direct .herokuapp.com URL. If this re-appears, the fix was reverted or a new surface with public_url was added to the registry expecting CF Access bypass.

Fix (post-2026-04-30): Verify site_probes.py probe_heroku_app does NOT use public_url for the health check call. The health check URL should be app_data["web_url"] (direct Heroku URL).

Verification: curl -sS https://<app>-<hash>.herokuapp.com/health → expect 200.


Failure mode B: Console self-probe DEGRADED (skipped: CF_ACCESS_SERVICE_TOKEN_CONSOLE unset)

Symptom: console-prod tile shows DEGRADED with error skipped: CF_ACCESS_SERVICE_TOKEN_CONSOLE unset.

Cause: The CF_ACCESS_SERVICE_TOKEN_CONSOLE env var is not set on the console Heroku app.

Fix:

# Format: client_id:client_secret
heroku config:set CF_ACCESS_SERVICE_TOKEN_CONSOLE="<client_id>:<client_secret>" --app raxx-console-prod

Pull values from Infisical: /MooseQuest/cloudflare/CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET.

Prerequisite: The raxx-agent-vault-access service token must be enrolled in a non_identity policy on the console.raxx.app CF Access application. If not, the token will be rejected (302 to CF Access login). See action item #1 in docs/incidents/2026-04-30-console-degraded-surfaces.md.

Verification: curl -sS -H "CF-Access-Client-Id: <id>" -H "CF-Access-Client-Secret: <secret>" https://console.raxx.app/health → expect 200.


Failure mode C: CF Pages surface DEGRADED (skipped: CLOUDFLARE_PAGES_READ_TOKEN unset)

Symptom: getraxx, raxx-mockups, or raxx-app-previews tiles show DEGRADED with skipped: CLOUDFLARE_PAGES_READ_TOKEN unset.

Cause: CLOUDFLARE_PAGES_READ_TOKEN or CLOUDFLARE_ACCOUNT_ID env var is missing on the console app.

Fix:

PAGES_TOKEN=$(infisical secrets get CLOUDFLARE_PAGES_READ_TOKEN --path /MooseQuest/cloudflare --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID --path /MooseQuest/cloudflare --env prod --plain)
heroku config:set CLOUDFLARE_PAGES_READ_TOKEN="$PAGES_TOKEN" CLOUDFLARE_ACCOUNT_ID="$ACCOUNT_ID" --app raxx-console-prod

Verification: heroku config --app raxx-console-prod | grep CLOUDFLARE_PAGES_READ_TOKEN → non-empty value.


Failure mode D: CF Pages surface DEGRADED (deploy_* error)

Symptom: A CF Pages tile shows DEGRADED with error deploy_failure or deploy_building.

Cause: The latest CF Pages deployment for that project failed or is still building.

Fix: 1. Go to Cloudflare dashboard → Pages → <project> → check the latest deployment status and logs. 2. If the deployment failed, investigate the build logs. Re-trigger a deployment from the Cloudflare dashboard or by pushing a commit. 3. If deploy_building persists > 10 min, cancel and retrigger the deployment.

Verification: curl -sS "https://api.cloudflare.com/client/v4/accounts/<id>/pages/projects/<project>/deployments" -H "Authorization: Bearer <CLOUDFLARE_PAGES_READ_TOKEN>"latest_stage.status == "success".


Failure mode E: Heroku surface DEGRADED (health_http_502 or health_no_response)

Symptom: An api-prod or api-staging tile shows DEGRADED with health_http_502 or health_no_response.

Cause: The Heroku app's web dyno is not running (0 dynos) or crashed on startup.

Diagnose:

heroku ps --app raxx-api-prod
heroku logs --tail --app raxx-api-prod

Fix: If 0 dynos: the app has never been deployed or was scaled down.

# Check formation
heroku ps:type --app raxx-api-prod
# If web dyno is missing, deploy the app first:
git push heroku main
# Or scale if formation exists:
heroku ps:scale web=1 --app raxx-api-prod

Verification: curl -sS https://<app>-<hash>.herokuapp.com/health → expect 200.


Failure mode F: Registry entry for non-existent CF Pages project

Symptom: A CF Pages tile shows DEGRADED with http_404 on every poll.

Cause: SURFACE_REGISTRY_BY_ENV contains a cloudflare_pages entry with a project_name that doesn't exist in the CF account.

Diagnose:

PAGES_TOKEN=$(infisical secrets get CLOUDFLARE_PAGES_READ_TOKEN --path /MooseQuest/cloudflare --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID --path /MooseQuest/cloudflare --env prod --plain)
curl -sS "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/pages/projects" \
  -H "Authorization: Bearer $PAGES_TOKEN" | python3 -m json.tool | grep '"name"'

Compare the output to entries in status_poller.py. If a project_name in the registry is absent from the project list, remove the registry entry (or correct the name) and open a PR.

Verification: The removed surface no longer appears on the dashboard.


Failure mode G: All tiles stale (poller thread stalled)

Symptom: Dashboard shows all tiles with checked_at > 120s old. /health shows poller_last_poll_at stale or null.

Cause: The background poller thread crashed and hit max backoff (300s), or the Heroku dyno restarted and the poller hasn't recovered.

Fix: Trigger a dyno restart (the poller starts automatically on app startup):

heroku restart --app raxx-console-prod

If restarts don't help, check for Python exceptions in logs:

heroku logs --tail --app raxx-console-prod | grep -E "Exception|Traceback|StatusPoller"

Verification: /health shows poller_last_poll_at updated within 60s.


Failure mode H: Vault surface DEGRADED (connection_error or http_*)

Symptom: vault tile shows DEGRADED.

Cause: The Lightsail instance hosting Infisical is unreachable, or Infisical is returning a non-200 on /api/status.

Diagnose:

CF_ID=$(security find-generic-password -s "raxx-cf-access-client-id" -a "claude-bootstrap" -w)
CF_SEC=$(security find-generic-password -s "raxx-cf-access-client-secret" -a "claude-bootstrap" -w)
curl -sS https://vault.raxx.app/api/status \
  -H "CF-Access-Client-Id: $CF_ID" \
  -H "CF-Access-Client-Secret: $CF_SEC"

If unreachable: check AWS Lightsail instance state in AWS console. If Infisical is the issue: check Lightsail instance logs. If CF Access is the issue: the raxx-agent-vault-access service token policy on vault.raxx.app may have been removed.

Escalation: If the Lightsail instance is down and cannot be restarted remotely, escalate to operator.


Manual probe commands

# Probe a Heroku app directly (bypasses CF Access)
curl -sS https://<app>-<hash>.herokuapp.com/health

# Probe vault (CF Access gated — needs service token)
CF_ID=$(security find-generic-password -s "raxx-cf-access-client-id" -a "claude-bootstrap" -w)
CF_SEC=$(security find-generic-password -s "raxx-cf-access-client-secret" -a "claude-bootstrap" -w)
curl -sS https://vault.raxx.app/api/status \
  -H "CF-Access-Client-Id: $CF_ID" \
  -H "CF-Access-Client-Secret: $CF_SEC"

# Query CF Pages latest deployment
PAGES_TOKEN=$(infisical secrets get CLOUDFLARE_PAGES_READ_TOKEN --path /MooseQuest/cloudflare --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID --path /MooseQuest/cloudflare --env prod --plain)
curl -sS "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/pages/projects/<project>/deployments?per_page=1" \
  -H "Authorization: Bearer $PAGES_TOKEN" | python3 -m json.tool

# List CF Access service tokens (check enrollment)
CF_MGMT=$(infisical secrets get CLOUDFLARE_ACCESS_MGMT_TOKEN --path /MooseQuest/cloudflare --env prod --plain)
curl -sS "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/access/service_tokens" \
  -H "Authorization: Bearer $CF_MGMT" | python3 -m json.tool

# Check CF Access application policies
curl -sS "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/access/apps/<app-id>/policies" \
  -H "Authorization: Bearer $CF_MGMT" | python3 -m json.tool

# Force-poll all surfaces via dashboard API (ops+ role required)
curl -sS -X POST https://console.raxx.app/dashboard/_force_poll \
  -H "CF-Access-Client-Id: $CF_ID" \
  -H "CF-Access-Client-Secret: $CF_SEC" \
  -H "Cookie: session=<your-session>"

Emergency stop

To stop the background poller without taking the console app fully offline:

heroku config:set TESTING=true --app raxx-console-prod
heroku restart --app raxx-console-prod

Setting TESTING=true prevents start_background_poller from starting. Dashboard will show stale/empty tiles but the rest of the console remains functional.

To re-enable:

heroku config:unset TESTING --app raxx-console-prod
heroku restart --app raxx-console-prod

Escalation

Escalate to operator when: - Poller thread crashes with Python exception that isn't probe-related (e.g. DB corruption, Flask context error) - CF Access service token raxx-agent-vault-access is revoked or expires - A surface shows DEGRADED after probe config fix and direct-URL probe is also returning non-200 - raxx-api-prod is DOWN (0 dynos) — requires operator to deploy a slug

CF Access policy changes (adding service token to an application policy) require Cloudflare dashboard access — operator-only action.

Who to contact: Kristerpher (operator)