RCA — CI cleanup pass: YAML parse error, name collision, Heroku 429, deploy modal stall, nightly scan recurrence

Incident ID: 2026-05-13-ci-cleanup-pass Date: 2026-05-13 Severity: SEV-2 (staging outage / deploy pipeline blocked) Duration: ~7 days for ci.yml parse error (detected 2026-05-06, fixed 2026-05-13); same-day for others Blast radius: All PRs and push-to-main since 2026-05-06; Heroku deploy #734 prod; nightly security scan 4/5 recent runs Author: sre-agent

Summary

Five distinct CI failure classes were found during the 2026-05-13 cleanup sweep. The highest-impact was a YAML parse error in ci.yml that caused every push and PR since 2026-05-06 to produce a 0-job, 0-second CI run — consuming a runner allocation for nothing and providing zero signal. Four additional findings were diagnosed and fixed in the same pass: a workflow name collision blocking branch-protection disambiguation, Heroku API 429 on deploy triggering a deploy failure, a deploy modal stuck at QUEUED after workflow failures, and the nightly security scan PR-creation failing repeatedly despite a prior fix (#1820).

Timeline (all times UTC)

2026-05-06 ~08:00 — ci.yml begins failing silently. YAML parse error at what was then line 222: run: python -c 'from velvet.app import app; print("boot: OK")'. The colon in "boot: OK" inside a bare YAML scalar causes the parser to treat OK" as an unexpected mapping value. No alert fires; the workflow just produces no jobs.
2026-05-06 to 2026-05-12 — 35+ runner allocations per day wasted. No CI signal on any push to main or PR.
2026-05-13 ~09:00 — Operator reports deploy modal stuck at QUEUED 465s after deploy #734 triggered from console. Root cause: Heroku 429 + modal callback gap.
2026-05-13 ~09:15 — Nightly security scan run 25793827748 (08:07 UTC scheduled) confirmed failing again with gh pr create: GitHub Actions is not permitted to create or approve pull requests. Issue #1975 filed.
2026-05-13 (this pass) — All five issues diagnosed and fixed in fix/ci-hygiene-pass-2026-05-13.

Impact

Users affected: 0 (pre-launch, no customers)
User-visible symptoms: none (internal CI only)
Data integrity: ok
Revenue / billing: ok
Developer impact: 7 days of no CI signal on push/PR; deploy #734 prod stalled; 4/5 nightly scans failed to create PRs

Per-finding breakdown

Finding 1: ci.yml YAML parse error (highest impact)

File: .github/workflows/ci.yml Line (pre-fix): velvet-smoke job, Velvet import smoke test — app factory step Root cause: run: python -c 'from velvet.app import app; print("boot: OK")'

The bare YAML scalar value python -c 'from velvet.app import app; print("boot: OK")' contains OK" after a colon following whitespace. GitHub Actions' YAML parser (which uses Go's gopkg.in/yaml.v3) is stricter than PyYAML. It treated this as a mapping, producing a parse error. Because the parse error is at the job level, NONE of the jobs in ci.yml parse — the workflow appears to start but immediately produces 0 jobs, 0 seconds.

Evidence: 35+ wasted runner allocations from 2026-05-06 through 2026-05-13 (architect #1937 audit noted ~56% failure rate from this alone being ~31% of all failures).

Fix: Changed print("boot: OK") to print("boot OK") (removed colon from the Python string). Added a comment explaining why colons must be avoided in run: scalar print statements.

Also renamed: name: CI → name: CI — main to resolve the name collision with ci-pr.yml (Finding 2).

Time saved: ~35 wasted runner allocations/day × 7 days = 245 wasted runs stopped. Every future push/PR now gets real CI output.

Finding 2: ci-pr.yml name collision with ci.yml

File: .github/workflows/ci-pr.yml Root cause: Both ci.yml and ci-pr.yml had name: CI. GitHub's branch protection required-checks config references workflow names. With two workflows sharing the same name, branch protection cannot distinguish which workflow's check to require. Additionally, the github.workflow concurrency group key produced collisions.

Fix: ci-pr.yml renamed to name: PR Gates. ci.yml renamed to name: CI — main. Concurrency group keys now differ naturally since they embed github.workflow.

Finding: main branch has NO branch protection configured at all (gh api repos/raxx-app/TradeMasterAPI/branches/main/protection returns 404 "Branch not protected"). This is a separate operator decision (see Phase 4 action items).

Finding 3: Heroku 429 on deploy (deploy-heroku.yml)

File: .github/workflows/deploy-heroku.yml Root cause: git push --force heroku "HEAD:refs/heads/main" makes a single attempt. The Heroku API rate-limits by token; burst conditions (multiple deploys firing in sequence or from concurrent workflows) produce HTTP 429 on the first API call. The deploy fails immediately with no retry.

Evidence: Run "Deploy to Heroku #734" failed with Heroku 429 on 2026-05-13.

Fix: Wrapped the git push in a retry loop: 5 attempts, quadratic backoff (i²×10s → 10s, 40s, 90s, 160s). The existing concurrency group heroku-deploy-staging already prevents multiple concurrent staging deploys, which limits burst frequency. The retry loop absorbs transient 429s within a single run.

Note: The 429 was on git push, which triggers Heroku's build API internally. The retry pattern is correct for this failure mode. For very sustained 429s (quota exhausted), escalation is needed.

Files: .github/workflows/deploy-heroku.yml, .github/actions/notify-deploy-status/action.yml Root cause: The Notify — building and Notify — failed steps live inside the deploy job. When the smoke gate job fails, the deploy job is skipped entirely (its if: always() && needs.smoke.result == 'success' condition evaluates false). No failed callback is ever posted to the console's deploy tracking endpoint. The modal polls for state but never sees a terminal state — it sits at QUEUED indefinitely.

Why the notify job didn't help: The notify job (which uses if: always()) was posting a PR comment but was NOT posting a callback to the deploy endpoint. The two notification paths (PR comment and deploy-endpoint callback) were not unified.

Fix: Added a Notify — failed (smoke gate or deploy skipped) step inside the notify job. This step: - Uses if: to fire only when needs.smoke.result != 'success' OR needs.deploy.result is skipped or cancelled - Requires a Checkout step first (composite actions need the repo to be available) - Posts status: failed with a descriptive log_line and failure_reason that distinguishes "smoke failed" from "deploy skipped" - Is a no-op when console_deploy_id is empty (manual runs, the action handles this internally)

File:line of bug: deploy-heroku.yml — the gap was the absence of a failed callback in the notify job when deploy was skipped.

Finding 5: Nightly security scan PR creation failure (#1975)

File: .github/workflows/nightly-security-scan.yml Root cause: The Mint raxx-ops-bot token step falls back to GITHUB_TOKEN when any of RAXX_OPS_BOT_APP_ID, RAXX_OPS_BOT_PRIVATE_KEY, RAXX_OPS_BOT_INSTALL_ID are missing or when the GitHub API call fails. In fallback mode, steps.mint.outputs.gh_token = GITHUB_TOKEN. The GH_TOKEN env var is then set to GITHUB_TOKEN, which the org-level "Actions cannot create PRs" toggle blocks.

Why git push succeeds but gh pr create fails: git push origin uses the credential helper from actions/checkout@v4 (which stores the checkout token, different from GITHUB_TOKEN behavior). gh pr create uses GH_TOKEN env var directly. When GH_TOKEN is GITHUB_TOKEN, the push (via checkout credentials) succeeds but the PR creation (via GH_TOKEN) fails.

Prior fix (#1820) and why it recurred: #1820 added the GH_TOKEN wiring. But it didn't add a guard that aborts the step when the mint fell back. If the secrets rotate or expire, the step silently fell back again and the PR creation failed the same way. The dangling branch (security/scan-YYYY-MM-DD) was created but no PR existed.

Fix: 1. Added a guard that checks BOT_IDENTITY == "raxx-ops-bot" and exits 1 with a diagnostic message if it's the fallback. This produces a visible failure rather than a silent dangling branch. 2. Changed git push to use URL-embedded token (https://x-access-token:${GH_TOKEN}@github.com/...) as the remote URL, ensuring git also uses the installation token rather than the checkout credential helper. This is consistent with memory/feedback_gh_actions_netrc_broken.md.

To verify fix: Check next scheduled run (08:07 UTC) for bot_identity=raxx-ops-bot in the mint step output.

What went well

The concurrency group on deploy-heroku.yml was already correct — it prevented multiple concurrent staging deploys which would have made the 429 worse
The notify-deploy-status composite action's no-op design (empty console_deploy_id = skip) meant the deploy-modal fix didn't break manual workflow runs
The nightly scan's continue-on-error on all scanner steps meant the scan still ran and produced artifacts even when the PR creation failed

What didn't go well

The ci.yml YAML parse error went undetected for 7 days. No alert fires on "workflow produced 0 jobs." The system silently allocated runners and produced nothing useful.
The nightly-security-scan PR creation failure had a prior fix (#1820) that didn't fully close the gap. The fix patched the happy path but didn't guard against fallback state.
The deploy modal silent-QUEUED bug was not caught during the original deploy callback implementation. The "always() notify job covers failures" assumption was wrong — the job notified PR comments, not the deploy endpoint.

Root cause analysis

Contributing factor 1: No YAML linting in CI — actionlint would have caught the print("boot: OK") parse error at PR time. It was proposed but never wired in.
Contributing factor 2: No alert on 0-job workflow runs — GitHub Actions does not natively alert when a workflow produces 0 jobs (the normal success path is "all jobs ran and passed"; a 0-job run is a different failure mode that looks like success to naive monitors).
Contributing factor 3: Deploy callback gap in smoke-gate failure path — the notify-deploy-status composite action was designed with the happy-path and in-job failure paths in mind, but the "job skipped because upstream failed" path was not considered.
Contributing factor 4: Fragile bot token mint without guard — the mint step has a fallback that produces a different token silently. Downstream steps assumed mint success.

Detection

ci.yml parse error: detected by operator/architect audit (#1937) — 7 days after onset
Heroku 429: detected immediately when deploy modal stalled (operator reported 2026-05-13)
Nightly scan: detected by issue filing (#1975) — recurring pattern
Others: detected during this cleanup pass

Resolution

ci.yml YAML parse error: removed colon from Python print string; renamed workflow
ci-pr.yml name collision: renamed to "PR Gates"
Heroku 429: added retry-with-backoff in deploy-heroku.yml
Deploy modal stall: added outer Notify — failed step in notify job
Nightly scan: added bot_identity guard + URL-embedded token push

Action items

#	Action	Owner	Due	Issue
1	Add `actionlint` step to `ci-pr.yml` to lint all workflow files on every PR	sre-agent	2026-05-20	file below
2	Configure branch protection on main with required checks: `CI — main / Backend tests`, `PR Gates / Sprint readiness gate`, `PR Gates / Secret scan (gitleaks)`	operator	2026-05-20	file below
3	Add alert/monitor for 0-job workflow runs (or wire Sentry CI integration)	sre-agent	2026-05-27	file below
4	Audit top-5 slowest workflows for caching gaps + matrix overruns (Phase 3)	sre-agent	2026-05-27	file below
5	Verify RAXX_OPS_BOT secrets are current; add rotation reminder to Velvet rotation schedule	operator	2026-05-16	file below

Phase 3 time-saving proposals

See per-workflow time estimates below. All are filed as separate type:reliability GitHub issues.

ci.yml — estimated 4-5 min savings per run

backend-tests-postgres job has no cache: pip on setup-python. Add cache: 'pip'. Estimated: ~1-2 min saved per run.
security-secrets (gitleaks) fetches full history (fetch-depth: 0) which is correct and cannot be optimized. Accept this cost.
velvet-smoke job installs Velvet deps but does not cache pip. Add cache: 'pip' to its setup-python step. Estimated: ~30s saved.
The detect-changes job runs on every push/PR, even docs-only changes. This is intentional (the gate IS the optimization). No change.
security-deps installs pip-audit then separately installs node — these could run in parallel if split into two jobs. Low-risk consolidation. Estimated: ~1-2 min saved.

deploy-heroku.yml — estimated 2-3 min savings per run

Smoke gate (smoke job) installs both Python and Node even when only the backend changed. These are already parallelized via the smoke script — cannot eliminate either. Accept.
fetch-depth: 0 on the checkout in the deploy job is required for git push --force. Cannot optimize.
Heroku CLI install via curl ... | sh runs on every deploy. The binary is ~50MB. Consider caching the heroku CLI binary in the GitHub Actions cache (key: heroku-cli-version). Estimated: ~30s saved per deploy.
Post-deploy health check has a fixed 10s sleep between retries with no upper bound. First-attempt hit rate is >90%; reducing initial sleep to 5s and keeping retries saves ~5-10s on the happy path.

deploy-console.yml — estimated 3-4 min savings per run

git subtree split runs on every console deploy and is the slowest step (~60-90s on large histories). No optimization available without changing the deploy strategy. Accept for now; track under the post-launch Heroku-Container migration consideration.
Tailwind CSS build runs npm ci before each deploy. Add npm cache via actions/setup-node@v4 with cache-dependency-path: console/package-lock.json. Estimated: ~1-2 min saved.
The smoke gate duplicates actions/setup-node@v4 + npm ci for the frontend. This is shared with deploy-heroku.yml smoke. If these two deploy workflows ever run in parallel (different surfaces), runner cost doubles. Acceptable for now.

ci-pr.yml — estimated 3-5 min savings per run

smoke_suite job installs both Python and Node sequentially. Both have caches configured (cache: 'pip' and cache: 'npm'). Cache hits should already be fast; verify cache hit rate in run logs before optimizing.
migration-gate job checks out the full history (fetch-depth: 0) for an alembic diff. Alembic only needs the current HEAD plus the PR base SHA. Reducing to fetch-depth: 1 and explicitly fetching the base SHA could save ~30s on large repos.
stale-branch-guard and base_branch_lint both do fetch-depth: 0 + fetch --all. These are sequential within each job but could share the checkout if merged. Low-priority.
commitlint installs @commitlint/cli@19 + @commitlint/config-conventional@19 on every run with --no-save. No caching. Add cache: 'npm' or move to a composite action with a cached install. Estimated: ~1 min saved.

nightly-security-scan.yml — estimated 5-8 min savings per run

The scan job installs pip-audit + bandit then downloads gitleaks + trivy. The apt-get for trivy hits the internet every run. No caching is possible for apt packages, but the gitleaks binary download could be cached (key: gitleaks-v8.30.1-linux-x64). Estimated: ~30s saved.
trivy runs with --scanners vuln,misconfig,secret. The vuln scanner downloads the vulnerability database on every run (~200MB). Trivy supports a TRIVY_CACHE_DIR env var. Use actions/cache@v4 with key: trivy-db-<date> to cache the DB for 24h. Estimated: 2-3 min saved per nightly run.
npm audit step under frontend/trademaster_ui runs without an npm ci first. If package-lock.json is cached, this is fast. If not, npm audit downloads metadata. Confirm cache is set up.

References

PR: fix/ci-hygiene-pass-2026-05-13
Runbook: docs/ops/runbooks/ci-hygiene.md
Prior incident (nightly scan): #1975
Prior fix: #1820
Architect audit: #1937
Feedback: memory/feedback_gh_actions_transitive_skip.md, memory/feedback_gh_actions_netrc_broken.md