Raxx · internal docs

internal · gated

Console Status Poller Runbook

System: console dashboard — background surface poller (console/app/services/status_poller.py + site_probes.py) Owner: operator / sre-agent Last incident: 2026-05-29 — Velvet prod health_http_404 for 463h; auto-ticket workflow added (see failure mode I below) Last reviewed: 2026-05-29


How to tell it's broken


How to diagnose (in order)

  1. Check the /health endpoint — expected: {"status": "ok", "poller_last_poll_at": "<recent ISO>"}. If poller_last_poll_at is > 120s old, the poller thread is stalled.

  2. Check Heroku logsheroku logs --tail --app raxx-console-prod | grep -E "poll_once|probe_|StatusPoller". Expected: poll_once entries every ~30s. Look for probe_crashed or thread_crashed errors.

  3. Check the surface registryconsole/app/services/status_poller.py:SURFACE_REGISTRY_BY_ENV. Every entry must have a valid provider and matching kwargs. CF Pages entries need an existing project in the account. Heroku entries need an app that has a deployed web dyno.

  4. Check probe credentials — run heroku config --app raxx-console-prod | grep -E "HEROKU_API_KEY|CLOUDFLARE_PAGES_READ_TOKEN|CLOUDFLARE_ACCOUNT_ID|CF_ACCESS_SERVICE_TOKEN". If any are missing or blank, the corresponding probes will degrade with skipped: error messages.

  5. Test a specific probe manually — see "Manual probe commands" below.


Known failure modes

Failure mode A-0: Sentry probe_false_positive warning fires (#1702)

Symptom: Sentry receives a warning-level event tagged event_type=probe_false_positive for a Heroku surface.

Meaning: The surface has been marked DEGRADED by the probe for >30 min, but a direct HTTP GET to its .herokuapp.com/health URL returned 200. The probe is likely reporting a false positive — the surface is actually serving traffic.

Common causes: - Heroku config-var-only release set slug=null on the latest release and the FLAG_CONSOLE_HEROKU_DYNO_MONITOR fix (#1700) was not effective (e.g. flag was off). - A transient Heroku Platform API failure caused a false zero_dynos or no_slug_ever_deployed error. - The probe's error threshold logic changed and produces a false degraded state.

Triage: 1. Check the Sentry event tags: surface_id tells you which surface, probe_error tells you what the probe reported. 2. Check last_probe_results in the Sentry event extras — look for the pattern of errors over the last 5 polls. 3. Manually verify the surface: curl -sS https://<app>-<hash>.herokuapp.com/health → expect 200. 4. If the surface is genuinely healthy, identify which probe check produced the false positive and open a fix issue.

Resolution: If the surface is healthy, the warning itself is informational — no customer-visible outage occurred. The underlying probe false-positive should be investigated and fixed.


Failure mode A: All CF-Access-gated Heroku surfaces show DEGRADED

Symptom: api-prod, console-prod, api-staging, console-staging all show DEGRADED. Direct Heroku URLs return 200. CF Access redirect present in probe response.

Cause: probe_heroku_app was (before 2026-04-30 fix) using public_url (CF Access-gated custom domain) for the health check. After the fix, it uses the direct .herokuapp.com URL. If this re-appears, the fix was reverted or a new surface with public_url was added to the registry expecting CF Access bypass.

Fix (post-2026-04-30): Verify site_probes.py probe_heroku_app does NOT use public_url for the health check call. The health check URL should be app_data["web_url"] (direct Heroku URL).

Verification: curl -sS https://<app>-<hash>.herokuapp.com/health → expect 200.


Failure mode B: Console self-probe DEGRADED (skipped: CF_ACCESS_SERVICE_TOKEN_CONSOLE unset)

Symptom: console-prod tile shows DEGRADED with error skipped: CF_ACCESS_SERVICE_TOKEN_CONSOLE unset.

Cause: The CF_ACCESS_SERVICE_TOKEN_CONSOLE env var is not set on the console Heroku app.

Fix:

# Format: client_id:client_secret
heroku config:set CF_ACCESS_SERVICE_TOKEN_CONSOLE="<client_id>:<client_secret>" --app raxx-console-prod

Pull values from Infisical: /MooseQuest/cloudflare/CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET.

Prerequisite: The raxx-agent-vault-access service token must be enrolled in a non_identity policy on the console.raxx.app CF Access application. If not, the token will be rejected (302 to CF Access login). See action item #1 in docs/ops/incidents/2026-04-30-console-degraded-surfaces.md.

Verification: curl -sS -H "CF-Access-Client-Id: <id>" -H "CF-Access-Client-Secret: <secret>" https://console.raxx.app/health → expect 200.


Failure mode C: CF Pages surface DEGRADED (skipped: CLOUDFLARE_PAGES_READ_TOKEN unset)

Symptom: getraxx, raxx-mockups, or raxx-app-previews tiles show DEGRADED with skipped: CLOUDFLARE_PAGES_READ_TOKEN unset.

Cause: CLOUDFLARE_PAGES_READ_TOKEN or CLOUDFLARE_ACCOUNT_ID env var is missing on the console app.

Fix:

PAGES_TOKEN=$(infisical secrets get CLOUDFLARE_PAGES_READ_TOKEN --path /MooseQuest/cloudflare --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID --path /MooseQuest/cloudflare --env prod --plain)
heroku config:set CLOUDFLARE_PAGES_READ_TOKEN="$PAGES_TOKEN" CLOUDFLARE_ACCOUNT_ID="$ACCOUNT_ID" --app raxx-console-prod

Verification: heroku config --app raxx-console-prod | grep CLOUDFLARE_PAGES_READ_TOKEN → non-empty value.


Failure mode D: CF Pages surface DEGRADED (deploy_* error)

Symptom: A CF Pages tile shows DEGRADED with error deploy_failure or deploy_building.

Cause: The latest CF Pages deployment for that project failed or is still building.

Fix: 1. Go to Cloudflare dashboard → Pages → <project> → check the latest deployment status and logs. 2. If the deployment failed, investigate the build logs. Re-trigger a deployment from the Cloudflare dashboard or by pushing a commit. 3. If deploy_building persists > 10 min, cancel and retrigger the deployment.

Verification: curl -sS "https://api.cloudflare.com/client/v4/accounts/<id>/pages/projects/<project>/deployments" -H "Authorization: Bearer <CLOUDFLARE_PAGES_READ_TOKEN>"latest_stage.status == "success".


Failure mode E: Heroku surface DEGRADED (health_http_502 or health_no_response)

Symptom: An api-prod or api-staging tile shows DEGRADED with health_http_502 or health_no_response.

Cause: The Heroku app's web dyno is not running (0 dynos) or crashed on startup.

Diagnose:

heroku ps --app raxx-api-prod
heroku logs --tail --app raxx-api-prod

Fix: If 0 dynos: the app has never been deployed or was scaled down.

# Check formation
heroku ps:type --app raxx-api-prod
# If web dyno is missing, deploy the app first:
git push heroku main
# Or scale if formation exists:
heroku ps:scale web=1 --app raxx-api-prod

Verification: curl -sS https://<app>-<hash>.herokuapp.com/health → expect 200.


Failure mode F: Registry entry for non-existent CF Pages project

Symptom: A CF Pages tile shows DEGRADED with http_404 on every poll.

Cause: SURFACE_REGISTRY_BY_ENV contains a cloudflare_pages entry with a project_name that doesn't exist in the CF account.

Diagnose:

PAGES_TOKEN=$(infisical secrets get CLOUDFLARE_PAGES_READ_TOKEN --path /MooseQuest/cloudflare --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID --path /MooseQuest/cloudflare --env prod --plain)
curl -sS "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/pages/projects" \
  -H "Authorization: Bearer $PAGES_TOKEN" | python3 -m json.tool | grep '"name"'

Compare the output to entries in status_poller.py. If a project_name in the registry is absent from the project list, remove the registry entry (or correct the name) and open a PR.

Verification: The removed surface no longer appears on the dashboard.


Failure mode G: All tiles stale (poller thread stalled)

Symptom: Dashboard shows all tiles with checked_at > 120s old. /health shows poller_last_poll_at stale or null.

Cause: The background poller thread crashed and hit max backoff (300s), or the Heroku dyno restarted and the poller hasn't recovered.

Fix: Trigger a dyno restart (the poller starts automatically on app startup):

heroku restart --app raxx-console-prod

If restarts don't help, check for Python exceptions in logs:

heroku logs --tail --app raxx-console-prod | grep -E "Exception|Traceback|StatusPoller"

Verification: /health shows poller_last_poll_at updated within 60s.


Failure mode H: Vault surface DEGRADED (connection_error or http_*)

Symptom: vault tile shows DEGRADED.

Cause: The Lightsail instance hosting Infisical is unreachable, or Infisical is returning a non-200 on /api/status.

Diagnose:

CF_ID=$(security find-generic-password -s "raxx-cf-access-client-id" -a "claude-bootstrap" -w)
CF_SEC=$(security find-generic-password -s "raxx-cf-access-client-secret" -a "claude-bootstrap" -w)
curl -sS https://vault.raxx.app/api/status \
  -H "CF-Access-Client-Id: $CF_ID" \
  -H "CF-Access-Client-Secret: $CF_SEC"

If unreachable: check AWS Lightsail instance state in AWS console. If Infisical is the issue: check Lightsail instance logs. If CF Access is the issue: the raxx-agent-vault-access service token policy on vault.raxx.app may have been removed.

Escalation: If the Lightsail instance is down and cannot be restarted remotely, escalate to operator.


Manual probe commands

# Probe a Heroku app directly (bypasses CF Access)
curl -sS https://<app>-<hash>.herokuapp.com/health

# Probe vault (CF Access gated — needs service token)
CF_ID=$(security find-generic-password -s "raxx-cf-access-client-id" -a "claude-bootstrap" -w)
CF_SEC=$(security find-generic-password -s "raxx-cf-access-client-secret" -a "claude-bootstrap" -w)
curl -sS https://vault.raxx.app/api/status \
  -H "CF-Access-Client-Id: $CF_ID" \
  -H "CF-Access-Client-Secret: $CF_SEC"

# Query CF Pages latest deployment
PAGES_TOKEN=$(infisical secrets get CLOUDFLARE_PAGES_READ_TOKEN --path /MooseQuest/cloudflare --env prod --plain)
ACCOUNT_ID=$(infisical secrets get CLOUDFLARE_ACCOUNT_ID --path /MooseQuest/cloudflare --env prod --plain)
curl -sS "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/pages/projects/<project>/deployments?per_page=1" \
  -H "Authorization: Bearer $PAGES_TOKEN" | python3 -m json.tool

# List CF Access service tokens (check enrollment)
CF_MGMT=$(infisical secrets get CLOUDFLARE_ACCESS_MGMT_TOKEN --path /MooseQuest/cloudflare --env prod --plain)
curl -sS "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/access/service_tokens" \
  -H "Authorization: Bearer $CF_MGMT" | python3 -m json.tool

# Check CF Access application policies
curl -sS "https://api.cloudflare.com/client/v4/accounts/${ACCOUNT_ID}/access/apps/<app-id>/policies" \
  -H "Authorization: Bearer $CF_MGMT" | python3 -m json.tool

# Force-poll all surfaces via dashboard API (ops+ role required)
curl -sS -X POST https://console.raxx.app/dashboard/_force_poll \
  -H "CF-Access-Client-Id: $CF_ID" \
  -H "CF-Access-Client-Secret: $CF_SEC" \
  -H "Cookie: session=<your-session>"

Failure mode I: DEGRADED or long-STALE surface with no GH issue

Symptom: Console shows a surface as DEGRADED or STALE (>24h) but no GH issue exists for it.

Cause: Either the auto-file workflow (console-degraded-auto-file.yml) is not running, the flag FLAG_SURFACE_DEGRADED_AUTO_FILE is off, or the workflow itself failed (detection gap).

Diagnose: 1. Check GH Actions for recent runs of "Surface degraded auto-file". 2. If runs are absent: the flag is likely off. Set FLAG_SURFACE_DEGRADED_AUTO_FILE = 1 as a repo variable. 3. If runs are failing: check the workflow logs. Common causes: console unreachable, CF Access token expired, GH_TOKEN_OPS_BOT expired. 4. If runs are green but no issue: the console /api/internal/status/surfaces may not be returning DEGRADED status for this surface. Check the endpoint directly (see manual probe commands below).

Manual trigger:

# Trigger the auto-file workflow manually via GitHub UI or CLI:
gh workflow run console-degraded-auto-file.yml -f dry_run=false

Verification: A GH issue appears with title [degraded] surface=<id> or [stale-checker] surface=<id>.


Emergency stop

To stop the background poller without taking the console app fully offline:

heroku config:set TESTING=true --app raxx-console-prod
heroku restart --app raxx-console-prod

Setting TESTING=true prevents start_background_poller from starting. Dashboard will show stale/empty tiles but the rest of the console remains functional.

To re-enable:

heroku config:unset TESTING --app raxx-console-prod
heroku restart --app raxx-console-prod

Escalation

Escalate to operator when: - Poller thread crashes with Python exception that isn't probe-related (e.g. DB corruption, Flask context error) - CF Access service token raxx-agent-vault-access is revoked or expires - A surface shows DEGRADED after probe config fix and direct-URL probe is also returning non-200 - raxx-api-prod is DOWN (0 dynos) — requires operator to deploy a slug

CF Access policy changes (adding service token to an application policy) require Cloudflare dashboard access — operator-only action.

Who to contact: Kristerpher (operator)