CI hygiene runbook

System: GitHub Actions (.github/workflows/) Owner: sre-agent Last incident: 2026-05-13 (ci-hygiene-pass — YAML parse error, 429 retry, deploy modal stall, nightly scan PR failure) Last reviewed: 2026-05-13

How to tell it's broken

ci.yml jobs show "0m 0s / 0 jobs" in the run list (YAML parse error — no jobs were parsed)
Branch protection is supposedly enforcing a CI check but every PR shows it as pending/unknown
Deploy modal in the console sticks at QUEUED indefinitely after a workflow failure
Nightly security scan is creating a branch but no PR (gh pr create failing)
A lint workflow's own unit tests fail with "No module named pytest"

How to diagnose (in order)

Check the failing workflow's raw YAML: gh run list --workflow=<name>.yml --limit=1 then inspect the most recent failure
Validate YAML locally: python3 -c "import yaml; yaml.safe_load(open('.github/workflows/<name>.yml'))" — a YAML error will print the line number
For branch protection: gh api repos/raxx-app/TradeMasterAPI/branches/main/protection — "Branch not protected" (404) means no required checks are enforced
For deploy modal stall: check whether the Notify — building step in the failing deploy job ran. If it didn't, the modal never left QUEUED. Look for a needs.smoke.result == 'failure' in the run.
For nightly scan PR failures: gh run view <run-id> --log-failed | grep -E "pull request create|permitted|rejected" — "GitHub Actions is not permitted to create or approve pull requests" means the bot token mint fell back to GITHUB_TOKEN.

Known failure modes

Failure mode A: YAML parse error — colon in `run:` scalar

Symptom: Workflow shows 0 jobs, 0 seconds run time. GitHub UI shows a YAML error. gh run list shows failure at "Set up job" or no steps at all.

Cause: A bare run: scalar containing a Python print("thing: value") or any key: value pattern is legal Python but can confuse some YAML parsers. GitHub Actions uses a strict YAML parser. The colon after a word at the start of a value is interpreted as a YAML mapping key.

Fix: Remove the colon from any string in a run: block. Use print("boot OK") instead of print("boot: OK"). If you need a colon (e.g. URL in a print statement), wrap the entire run: value in a YAML block scalar (run: |) and single-quote the Python string, or use a variable.

Verification: python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml'))" should print nothing (no error).

Failure mode B: Two workflows with the same `name:` value

Symptom: Branch protection cannot distinguish between ci.yml and ci-pr.yml because both are named "CI". Required status checks in branch protection reference workflow names; if two workflows share a name, GitHub cannot route the check to the correct one.

Cause: Copy-paste during workflow creation.

Fix: Every workflow must have a unique name: value. Convention: - ci.yml → name: CI — main - ci-pr.yml → name: PR Gates - ci-console.yml → name: CI — console - deploy-heroku.yml → name: Deploy to Heroku - deploy-console.yml → name: Deploy console - etc.

Verification: grep '^name:' .github/workflows/*.yml | sort -k2 | uniq -d -f1 should return no output (no duplicate names).

Symptom: The console deploy modal shows QUEUED indefinitely after a workflow run that failed. No FAILED or ERROR state is surfaced.

Cause: The Notify — building / Notify — failed steps inside the deploy job are not reachable when the workflow fails before the deploy job starts (e.g. smoke gate failure, runner quota failure). The notify job runs if: always() but historically didn't post a failed callback to the deploy tracking endpoint.

Fix (post-2026-05-13): The notify job in deploy-heroku.yml now includes a Notify — failed (smoke gate or deploy skipped) step that fires when needs.deploy.result is skipped or cancelled, or when needs.smoke.result is not success. This covers the gap.

If the modal is still stuck after this fix, check: 1. Is console_deploy_id being passed? The action is a no-op when the deploy ID is empty. 2. Is DEPLOY_CALLBACK_HMAC_SECRET set in the environment's secrets? A missing HMAC secret causes the curl to be sent with an invalid signature, which the console rejects silently (the action uses || true). 3. Is CONSOLE_INTERNAL_URL (or the hardcoded fallback) reachable from the runner?

Verification: After a forced smoke gate failure (e.g. manually break run_smoke.sh), confirm the modal transitions from QUEUED to FAILED within 60 seconds.

Failure mode D: Nightly security scan creates a branch but no PR

Symptom: security/scan-YYYY-MM-DD branch exists in the repo. No PR was created. gh run view <run-id> --log-failed shows: pull request create failed: GraphQL: GitHub Actions is not permitted to create or approve pull requests (createPullRequest).

Cause: The Mint raxx-ops-bot token step in nightly-security-scan.yml fell back to GITHUB_TOKEN when the bot secrets are missing or expired. GITHUB_TOKEN cannot create PRs due to the org-level "Actions cannot create PRs" toggle. The git push succeeds (using the checkout credential helper, which runs with a different auth context), but gh pr create uses GH_TOKEN which is the fallback GITHUB_TOKEN.

Fix (post-2026-05-13): The workflow now checks BOT_IDENTITY == "raxx-ops-bot" before proceeding to git push and gh pr create. If the mint fell back, the step fails loudly with a diagnostic message pointing to the missing secrets.

Operator action if mint fails: Verify these secrets are set on the repo: RAXX_OPS_BOT_APP_ID, RAXX_OPS_BOT_PRIVATE_KEY, RAXX_OPS_BOT_INSTALL_ID. See docs/ops/runbooks/agent-bot-tokens-setup.md for provisioning.

Verification: gh run watch the next scheduled run (08:07 UTC). Confirm bot_identity=raxx-ops-bot in the mint step output. Confirm a PR is created with type:security + needs-grooming labels.

Failure mode E: Lint workflow's own unit tests fail ("No module named pytest")

Symptom: lint-cf-tokens.yml → "Run lint unit tests" step fails with ModuleNotFoundError: No module named pytest. The actual lint step passes.

Cause: The workflow installs the linter's dependencies but not pytest. setup-python installs the interpreter; it does not install test runners.

Fix (post-2026-05-13): A pip install pytest step was added before the meta-test step in lint-cf-tokens.yml.

Verification: gh workflow run lint-cf-tokens.yml from a branch touching .github/workflows/. Confirm both "Run CF token lint" and "Run lint unit tests" pass.

Failure mode F: CF Access headers missing on Raptor cron workflow

Symptom: A cron or manual workflow that calls api.raxx.app returns HTTP 403 with body {"error":"direct_origin_blocked"}. The lint gate lint-cf-access-headers fails on the PR that introduced or modified the workflow.

Cause: The workflow references RAPTOR_INTERNAL_API_URL but did not declare CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET in its env: block, OR did not pass -H "CF-Access-Client-Id: ..." and -H "CF-Access-Client-Secret: ..." to the curl call. Cloudflare Access sits in front of Raptor's origin; any request that bypasses the CF proxy or lacks valid service-token headers gets a 403 before reaching the Heroku dyno.

RCA: docs/incidents/2026-05-23-ci-gitleaks-dark-crons-2703-2717.md — both billing-retention-cron.yml and trace-integrity-cron.yml shipped without these headers and failed silently until #2724.

Fix: Add the following to the job's env: block and to every curl call in the job:

env:
  RAPTOR_INTERNAL_API_URL: ${{ secrets.RAPTOR_INTERNAL_API_URL }}
  CF_ACCESS_CLIENT_ID:     ${{ secrets.CF_ACCESS_CLIENT_ID }}
  CF_ACCESS_CLIENT_SECRET: ${{ secrets.CF_ACCESS_CLIENT_SECRET }}

curl -X POST \
  -H "CF-Access-Client-Id: ${CF_ACCESS_CLIENT_ID}" \
  -H "CF-Access-Client-Secret: ${CF_ACCESS_CLIENT_SECRET}" \
  -H "User-Agent: raxx-<workflow-name>/1.0" \
  "${RAPTOR_INTERNAL_API_URL}/api/internal/jobs/<job>"

Also: if the workflow uses inline Python urllib.request to call a CF-gated origin, add "User-Agent": "raxx-<workflow-name>/1.0" to the headers dict — the default Python-urllib/3.x UA trips CF error 1010 on CF-gated origins.

The RAPTOR_INTERNAL_API_URL secret value MUST be https://api.raxx.app, NOT the .herokuapp.com origin URL. The Heroku direct-origin URL returns 403 direct_origin_blocked for requests that bypass the Cloudflare proxy.

Verification: Locally run python scripts/ci/lint_cf_access_headers.py --workflow-dir .github/workflows. It should exit 0 with PASSED. Then trigger a manual workflow_dispatch run of the affected workflow and confirm HTTP 200 from the endpoint.

Failure mode G: Heroku API 429 (rate limit) on deploy

Symptom: Workflow run "Deploy to Heroku #NNN" fails. Heroku CLI log shows HTTP 429 Too Many Requests. The runner was briefly queued or multiple deploys fired in rapid succession.

Cause: Heroku's API rate-limits requests from the same token. The git push heroku call internally hits the Heroku API to register the build. Under burst conditions (multiple deploys queued) the first caller to the API gets 429.

Fix (post-2026-05-13): deploy-heroku.yml now retries the git push up to 5 times with quadratic backoff (10s, 40s, 90s, 160s). The concurrency group heroku-deploy-staging prevents queued-up staging deploys from pile-driving the same API key.

Verification: After the fix, confirm gh run list --workflow=deploy-heroku.yml shows no 429-related failures. For a manual test: trigger two rapid staging deploys and confirm the second one either waits (concurrency queue) or retries successfully.

Naming convention for workflows

Rules: 1. name: must be unique across all .github/workflows/*.yml files. No two workflows may share the same name. 2. Workflow file names use kebab-case: deploy-heroku.yml, ci-pr.yml. 3. Scheduled workflows that produce Slack digest output should be named with a cron suffix: ci-digest-cron.yml, billing-collector-cron.yml. 4. Security workflows prefix with security- or nightly-: nightly-security-scan.yml, security-zap.yml. 5. Deploy workflows prefix with deploy-: deploy-heroku.yml, deploy-console.yml, deploy-velvet.yml.

Check for name collisions: grep '^name:' .github/workflows/*.yml | sed 's/.*: //' | sort | uniq -d — output should be empty.

Path filters — when to add them

Add paths: to a push: trigger when: - The workflow only makes sense for a specific subdirectory (e.g. deploy-console.yml only deploys when console/** changes) - The workflow does not touch security controls (gitleaks MUST run on every push)

Do NOT add paths: to: - security-secrets (gitleaks) — secrets can appear in any file - ci.yml push-to-main triggers — the push-to-main run is the final gate; path-skipping it creates a gap

Asset-manifest checksum guard

The asset-manifest guard (asset-manifest-check job in ci-pr.yml) fails when a PR touches static assets (console/app/static/**, frontend/trademaster_ui/public/**, frontend/status-page/public/**) and the docs/asset-manifest.json manifest is stale.

Regen procedure:

bash scripts/ci/regen-asset-manifest.sh
git add docs/asset-manifest.json
git commit -m "chore(assets): regen asset manifest"

Why this exists: The favicon revert incident (2026-04) silently overwrote a merged commit during an agent rebase. The manifest guard catches that class of drift at PR time.

Deploy callback URL pattern

Deploy workflows send lifecycle callbacks to the console's internal deploy tracking endpoint:

POST <CONSOLE_INTERNAL_URL>/api/internal/deploys/<console_deploy_id>/status
Body: {"status":"building|deploying|succeeded|failed","log_line":"...","failure_reason":null|"...","run_id":"...","run_url":"..."}
Header: X-Raxx-Deploy-Signature: sha256=<hmac>

The action that wraps this: .github/actions/notify-deploy-status/action.yml

When adding a new deploy workflow, include Notify — building, Notify — succeeded (if: success()), and Notify — failed (if: failure()) steps. Also add an outer Notify — failed step in the always-running summary job to cover the case where the deploy job is skipped.

Adding actionlint to PR CI

actionlint (https://github.com/rhysd/actionlint) catches most YAML/expression errors before they reach a runner. It is a single Go binary with no runtime dependencies.

To add to ci-pr.yml:

- name: Install actionlint
  run: |
    ACTIONLINT_VER="1.7.7"
    curl -sSL "https://github.com/rhysd/actionlint/releases/download/v${ACTIONLINT_VER}/actionlint_${ACTIONLINT_VER}_linux_amd64.tar.gz" \
      | tar -xz -C /usr/local/bin actionlint
- name: Lint workflow files
  run: actionlint

Add this as a job that runs on every PR touching .github/workflows/**. This would have caught the print("boot: OK") YAML parse error that caused 7 days of ci.yml failures starting 2026-05-06.

Status: Proposed in Phase 4 but not yet implemented — operator decision pending on whether to add actionlint as a required check.

Emergency stop

To temporarily disable a broken workflow without deleting it:

# Disable
gh workflow disable <workflow-name>.yml

# Re-enable
gh workflow enable <workflow-name>.yml

Disabling does not delete run history. The workflow can be re-enabled immediately.

Escalation

Wake the operator when: - A security scan workflow (nightly-security-scan, security-zap) fails for 2 consecutive days - Branch protection is not enforcing any required checks (currently: NONE configured — this is an operator decision, see Phase 4 action items) - A deploy workflow is failing AND the Heroku app is returning non-200 on health checks - The deploy modal sticks at QUEUED for >10 minutes on a production deploy with a console_deploy_id

Contact: kris@moosequest.net

References

Incident: docs/ops/incidents/2026-05-13-ci-cleanup-pass.md
Related runbook: docs/ops/runbooks/ci-runner-posture.md
Deploy callback action: .github/actions/notify-deploy-status/
Feedback: memory/feedback_gh_actions_transitive_skip.md, memory/feedback_gh_actions_netrc_broken.md