CI hygiene runbook
System: GitHub Actions (.github/workflows/) Owner: sre-agent Last incident: 2026-05-13 (ci-hygiene-pass — YAML parse error, 429 retry, deploy modal stall, nightly scan PR failure) Last reviewed: 2026-05-13
How to tell it's broken
ci.ymljobs show "0m 0s / 0 jobs" in the run list (YAML parse error — no jobs were parsed)- Branch protection is supposedly enforcing a CI check but every PR shows it as pending/unknown
- Deploy modal in the console sticks at QUEUED indefinitely after a workflow failure
- Nightly security scan is creating a branch but no PR (gh pr create failing)
- A lint workflow's own unit tests fail with "No module named pytest"
How to diagnose (in order)
- Check the failing workflow's raw YAML:
gh run list --workflow=<name>.yml --limit=1then inspect the most recent failure - Validate YAML locally:
python3 -c "import yaml; yaml.safe_load(open('.github/workflows/<name>.yml'))"— a YAML error will print the line number - For branch protection:
gh api repos/raxx-app/TradeMasterAPI/branches/main/protection— "Branch not protected" (404) means no required checks are enforced - For deploy modal stall: check whether the
Notify — buildingstep in the failing deploy job ran. If it didn't, the modal never left QUEUED. Look for aneeds.smoke.result == 'failure'in the run. - For nightly scan PR failures:
gh run view <run-id> --log-failed | grep -E "pull request create|permitted|rejected"— "GitHub Actions is not permitted to create or approve pull requests" means the bot token mint fell back to GITHUB_TOKEN.
Known failure modes
Failure mode A: YAML parse error — colon in run: scalar
Symptom: Workflow shows 0 jobs, 0 seconds run time. GitHub UI shows a YAML error. gh run list shows failure at "Set up job" or no steps at all.
Cause: A bare run: scalar containing a Python print("thing: value") or any key: value pattern is legal Python but can confuse some YAML parsers. GitHub Actions uses a strict YAML parser. The colon after a word at the start of a value is interpreted as a YAML mapping key.
Fix: Remove the colon from any string in a run: block. Use print("boot OK") instead of print("boot: OK"). If you need a colon (e.g. URL in a print statement), wrap the entire run: value in a YAML block scalar (run: |) and single-quote the Python string, or use a variable.
Verification: python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml'))" should print nothing (no error).
Failure mode B: Two workflows with the same name: value
Symptom: Branch protection cannot distinguish between ci.yml and ci-pr.yml because both are named "CI". Required status checks in branch protection reference workflow names; if two workflows share a name, GitHub cannot route the check to the correct one.
Cause: Copy-paste during workflow creation.
Fix: Every workflow must have a unique name: value. Convention:
- ci.yml → name: CI — main
- ci-pr.yml → name: PR Gates
- ci-console.yml → name: CI — console
- deploy-heroku.yml → name: Deploy to Heroku
- deploy-console.yml → name: Deploy console
- etc.
Verification: grep '^name:' .github/workflows/*.yml | sort -k2 | uniq -d -f1 should return no output (no duplicate names).
Failure mode C: Deploy modal sticks at QUEUED after workflow failure
Symptom: The console deploy modal shows QUEUED indefinitely after a workflow run that failed. No FAILED or ERROR state is surfaced.
Cause: The Notify — building / Notify — failed steps inside the deploy job are not reachable when the workflow fails before the deploy job starts (e.g. smoke gate failure, runner quota failure). The notify job runs if: always() but historically didn't post a failed callback to the deploy tracking endpoint.
Fix (post-2026-05-13): The notify job in deploy-heroku.yml now includes a Notify — failed (smoke gate or deploy skipped) step that fires when needs.deploy.result is skipped or cancelled, or when needs.smoke.result is not success. This covers the gap.
If the modal is still stuck after this fix, check:
1. Is console_deploy_id being passed? The action is a no-op when the deploy ID is empty.
2. Is DEPLOY_CALLBACK_HMAC_SECRET set in the environment's secrets? A missing HMAC secret causes the curl to be sent with an invalid signature, which the console rejects silently (the action uses || true).
3. Is CONSOLE_INTERNAL_URL (or the hardcoded fallback) reachable from the runner?
Verification: After a forced smoke gate failure (e.g. manually break run_smoke.sh), confirm the modal transitions from QUEUED to FAILED within 60 seconds.
Failure mode D: Nightly security scan creates a branch but no PR
Symptom: security/scan-YYYY-MM-DD branch exists in the repo. No PR was created. gh run view <run-id> --log-failed shows: pull request create failed: GraphQL: GitHub Actions is not permitted to create or approve pull requests (createPullRequest).
Cause: The Mint raxx-ops-bot token step in nightly-security-scan.yml fell back to GITHUB_TOKEN when the bot secrets are missing or expired. GITHUB_TOKEN cannot create PRs due to the org-level "Actions cannot create PRs" toggle. The git push succeeds (using the checkout credential helper, which runs with a different auth context), but gh pr create uses GH_TOKEN which is the fallback GITHUB_TOKEN.
Fix (post-2026-05-13): The workflow now checks BOT_IDENTITY == "raxx-ops-bot" before proceeding to git push and gh pr create. If the mint fell back, the step fails loudly with a diagnostic message pointing to the missing secrets.
Operator action if mint fails: Verify these secrets are set on the repo: RAXX_OPS_BOT_APP_ID, RAXX_OPS_BOT_PRIVATE_KEY, RAXX_OPS_BOT_INSTALL_ID. See docs/ops/runbooks/agent-bot-tokens-setup.md for provisioning.
Verification: gh run watch the next scheduled run (08:07 UTC). Confirm bot_identity=raxx-ops-bot in the mint step output. Confirm a PR is created with type:security + needs-grooming labels.
Failure mode E: Lint workflow's own unit tests fail ("No module named pytest")
Symptom: lint-cf-tokens.yml → "Run lint unit tests" step fails with ModuleNotFoundError: No module named pytest. The actual lint step passes.
Cause: The workflow installs the linter's dependencies but not pytest. setup-python installs the interpreter; it does not install test runners.
Fix (post-2026-05-13): A pip install pytest step was added before the meta-test step in lint-cf-tokens.yml.
Verification: gh workflow run lint-cf-tokens.yml from a branch touching .github/workflows/. Confirm both "Run CF token lint" and "Run lint unit tests" pass.
Failure mode F: CF Access headers missing on Raptor cron workflow
Symptom: A cron or manual workflow that calls api.raxx.app returns HTTP 403
with body {"error":"direct_origin_blocked"}. The lint gate lint-cf-access-headers
fails on the PR that introduced or modified the workflow.
Cause: The workflow references RAPTOR_INTERNAL_API_URL but did not declare
CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET in its env: block, OR did not
pass -H "CF-Access-Client-Id: ..." and -H "CF-Access-Client-Secret: ..." to the
curl call. Cloudflare Access sits in front of Raptor's origin; any request that
bypasses the CF proxy or lacks valid service-token headers gets a 403 before reaching
the Heroku dyno.
RCA: docs/incidents/2026-05-23-ci-gitleaks-dark-crons-2703-2717.md — both
billing-retention-cron.yml and trace-integrity-cron.yml shipped without these
headers and failed silently until #2724.
Fix: Add the following to the job's env: block and to every curl call in
the job:
env:
RAPTOR_INTERNAL_API_URL: ${{ secrets.RAPTOR_INTERNAL_API_URL }}
CF_ACCESS_CLIENT_ID: ${{ secrets.CF_ACCESS_CLIENT_ID }}
CF_ACCESS_CLIENT_SECRET: ${{ secrets.CF_ACCESS_CLIENT_SECRET }}
curl -X POST \
-H "CF-Access-Client-Id: ${CF_ACCESS_CLIENT_ID}" \
-H "CF-Access-Client-Secret: ${CF_ACCESS_CLIENT_SECRET}" \
-H "User-Agent: raxx-<workflow-name>/1.0" \
"${RAPTOR_INTERNAL_API_URL}/api/internal/jobs/<job>"
Also: if the workflow uses inline Python urllib.request to call a CF-gated origin,
add "User-Agent": "raxx-<workflow-name>/1.0" to the headers dict — the default
Python-urllib/3.x UA trips CF error 1010 on CF-gated origins.
The RAPTOR_INTERNAL_API_URL secret value MUST be https://api.raxx.app, NOT the
.herokuapp.com origin URL. The Heroku direct-origin URL returns 403
direct_origin_blocked for requests that bypass the Cloudflare proxy.
Verification: Locally run
python scripts/ci/lint_cf_access_headers.py --workflow-dir .github/workflows.
It should exit 0 with PASSED. Then trigger a manual workflow_dispatch run of
the affected workflow and confirm HTTP 200 from the endpoint.
Failure mode G: Heroku API 429 (rate limit) on deploy
Symptom: Workflow run "Deploy to Heroku #NNN" fails. Heroku CLI log shows HTTP 429 Too Many Requests. The runner was briefly queued or multiple deploys fired in rapid succession.
Cause: Heroku's API rate-limits requests from the same token. The git push heroku call internally hits the Heroku API to register the build. Under burst conditions (multiple deploys queued) the first caller to the API gets 429.
Fix (post-2026-05-13): deploy-heroku.yml now retries the git push up to 5 times with quadratic backoff (10s, 40s, 90s, 160s). The concurrency group heroku-deploy-staging prevents queued-up staging deploys from pile-driving the same API key.
Verification: After the fix, confirm gh run list --workflow=deploy-heroku.yml shows no 429-related failures. For a manual test: trigger two rapid staging deploys and confirm the second one either waits (concurrency queue) or retries successfully.
Naming convention for workflows
Rules:
1. name: must be unique across all .github/workflows/*.yml files. No two workflows may share the same name.
2. Workflow file names use kebab-case: deploy-heroku.yml, ci-pr.yml.
3. Scheduled workflows that produce Slack digest output should be named with a cron suffix: ci-digest-cron.yml, billing-collector-cron.yml.
4. Security workflows prefix with security- or nightly-: nightly-security-scan.yml, security-zap.yml.
5. Deploy workflows prefix with deploy-: deploy-heroku.yml, deploy-console.yml, deploy-velvet.yml.
Check for name collisions: grep '^name:' .github/workflows/*.yml | sed 's/.*: //' | sort | uniq -d — output should be empty.
Path filters — when to add them
Add paths: to a push: trigger when:
- The workflow only makes sense for a specific subdirectory (e.g. deploy-console.yml only deploys when console/** changes)
- The workflow does not touch security controls (gitleaks MUST run on every push)
Do NOT add paths: to:
- security-secrets (gitleaks) — secrets can appear in any file
- ci.yml push-to-main triggers — the push-to-main run is the final gate; path-skipping it creates a gap
Asset-manifest checksum guard
The asset-manifest guard (asset-manifest-check job in ci-pr.yml) fails when a PR touches static assets (console/app/static/**, frontend/trademaster_ui/public/**, frontend/status-page/public/**) and the docs/asset-manifest.json manifest is stale.
Regen procedure:
bash scripts/ci/regen-asset-manifest.sh
git add docs/asset-manifest.json
git commit -m "chore(assets): regen asset manifest"
Why this exists: The favicon revert incident (2026-04) silently overwrote a merged commit during an agent rebase. The manifest guard catches that class of drift at PR time.
Deploy callback URL pattern
Deploy workflows send lifecycle callbacks to the console's internal deploy tracking endpoint:
POST <CONSOLE_INTERNAL_URL>/api/internal/deploys/<console_deploy_id>/status
Body: {"status":"building|deploying|succeeded|failed","log_line":"...","failure_reason":null|"...","run_id":"...","run_url":"..."}
Header: X-Raxx-Deploy-Signature: sha256=<hmac>
The action that wraps this: .github/actions/notify-deploy-status/action.yml
When adding a new deploy workflow, include Notify — building, Notify — succeeded (if: success()), and Notify — failed (if: failure()) steps. Also add an outer Notify — failed step in the always-running summary job to cover the case where the deploy job is skipped.
Adding actionlint to PR CI
actionlint (https://github.com/rhysd/actionlint) catches most YAML/expression errors before they reach a runner. It is a single Go binary with no runtime dependencies.
To add to ci-pr.yml:
- name: Install actionlint
run: |
ACTIONLINT_VER="1.7.7"
curl -sSL "https://github.com/rhysd/actionlint/releases/download/v${ACTIONLINT_VER}/actionlint_${ACTIONLINT_VER}_linux_amd64.tar.gz" \
| tar -xz -C /usr/local/bin actionlint
- name: Lint workflow files
run: actionlint
Add this as a job that runs on every PR touching .github/workflows/**. This would have caught the print("boot: OK") YAML parse error that caused 7 days of ci.yml failures starting 2026-05-06.
Status: Proposed in Phase 4 but not yet implemented — operator decision pending on whether to add actionlint as a required check.
Emergency stop
To temporarily disable a broken workflow without deleting it:
# Disable
gh workflow disable <workflow-name>.yml
# Re-enable
gh workflow enable <workflow-name>.yml
Disabling does not delete run history. The workflow can be re-enabled immediately.
Escalation
Wake the operator when:
- A security scan workflow (nightly-security-scan, security-zap) fails for 2 consecutive days
- Branch protection is not enforcing any required checks (currently: NONE configured — this is an operator decision, see Phase 4 action items)
- A deploy workflow is failing AND the Heroku app is returning non-200 on health checks
- The deploy modal sticks at QUEUED for >10 minutes on a production deploy with a console_deploy_id
Contact: kris@moosequest.net
References
- Incident:
docs/ops/incidents/2026-05-13-ci-cleanup-pass.md - Related runbook:
docs/ops/runbooks/ci-runner-posture.md - Deploy callback action:
.github/actions/notify-deploy-status/ - Feedback:
memory/feedback_gh_actions_transitive_skip.md,memory/feedback_gh_actions_netrc_broken.md