RCA — CI cleanup pass: YAML parse error, name collision, Heroku 429, deploy modal stall, nightly scan recurrence
Incident ID: 2026-05-13-ci-cleanup-pass Date: 2026-05-13 Severity: SEV-2 (staging outage / deploy pipeline blocked) Duration: ~7 days for ci.yml parse error (detected 2026-05-06, fixed 2026-05-13); same-day for others Blast radius: All PRs and push-to-main since 2026-05-06; Heroku deploy #734 prod; nightly security scan 4/5 recent runs Author: sre-agent
Summary
Five distinct CI failure classes were found during the 2026-05-13 cleanup sweep. The highest-impact was a YAML parse error in ci.yml that caused every push and PR since 2026-05-06 to produce a 0-job, 0-second CI run — consuming a runner allocation for nothing and providing zero signal. Four additional findings were diagnosed and fixed in the same pass: a workflow name collision blocking branch-protection disambiguation, Heroku API 429 on deploy triggering a deploy failure, a deploy modal stuck at QUEUED after workflow failures, and the nightly security scan PR-creation failing repeatedly despite a prior fix (#1820).
Timeline (all times UTC)
- 2026-05-06 ~08:00 —
ci.ymlbegins failing silently. YAML parse error at what was then line 222:run: python -c 'from velvet.app import app; print("boot: OK")'. The colon in"boot: OK"inside a bare YAML scalar causes the parser to treatOK"as an unexpected mapping value. No alert fires; the workflow just produces no jobs. - 2026-05-06 to 2026-05-12 — 35+ runner allocations per day wasted. No CI signal on any push to main or PR.
- 2026-05-13 ~09:00 — Operator reports deploy modal stuck at QUEUED 465s after deploy #734 triggered from console. Root cause: Heroku 429 + modal callback gap.
- 2026-05-13 ~09:15 — Nightly security scan run
25793827748(08:07 UTC scheduled) confirmed failing again withgh pr create: GitHub Actions is not permitted to create or approve pull requests. Issue #1975 filed. - 2026-05-13 (this pass) — All five issues diagnosed and fixed in
fix/ci-hygiene-pass-2026-05-13.
Impact
- Users affected: 0 (pre-launch, no customers)
- User-visible symptoms: none (internal CI only)
- Data integrity: ok
- Revenue / billing: ok
- Developer impact: 7 days of no CI signal on push/PR; deploy #734 prod stalled; 4/5 nightly scans failed to create PRs
Per-finding breakdown
Finding 1: ci.yml YAML parse error (highest impact)
File: .github/workflows/ci.yml
Line (pre-fix): velvet-smoke job, Velvet import smoke test — app factory step
Root cause: run: python -c 'from velvet.app import app; print("boot: OK")'
The bare YAML scalar value python -c 'from velvet.app import app; print("boot: OK")' contains OK" after a colon following whitespace. GitHub Actions' YAML parser (which uses Go's gopkg.in/yaml.v3) is stricter than PyYAML. It treated this as a mapping, producing a parse error. Because the parse error is at the job level, NONE of the jobs in ci.yml parse — the workflow appears to start but immediately produces 0 jobs, 0 seconds.
Evidence: 35+ wasted runner allocations from 2026-05-06 through 2026-05-13 (architect #1937 audit noted ~56% failure rate from this alone being ~31% of all failures).
Fix: Changed print("boot: OK") to print("boot OK") (removed colon from the Python string). Added a comment explaining why colons must be avoided in run: scalar print statements.
Also renamed: name: CI → name: CI — main to resolve the name collision with ci-pr.yml (Finding 2).
Time saved: ~35 wasted runner allocations/day × 7 days = 245 wasted runs stopped. Every future push/PR now gets real CI output.
Finding 2: ci-pr.yml name collision with ci.yml
File: .github/workflows/ci-pr.yml
Root cause: Both ci.yml and ci-pr.yml had name: CI. GitHub's branch protection required-checks config references workflow names. With two workflows sharing the same name, branch protection cannot distinguish which workflow's check to require. Additionally, the github.workflow concurrency group key produced collisions.
Fix: ci-pr.yml renamed to name: PR Gates. ci.yml renamed to name: CI — main. Concurrency group keys now differ naturally since they embed github.workflow.
Finding: main branch has NO branch protection configured at all (gh api repos/raxx-app/TradeMasterAPI/branches/main/protection returns 404 "Branch not protected"). This is a separate operator decision (see Phase 4 action items).
Finding 3: Heroku 429 on deploy (deploy-heroku.yml)
File: .github/workflows/deploy-heroku.yml
Root cause: git push --force heroku "HEAD:refs/heads/main" makes a single attempt. The Heroku API rate-limits by token; burst conditions (multiple deploys firing in sequence or from concurrent workflows) produce HTTP 429 on the first API call. The deploy fails immediately with no retry.
Evidence: Run "Deploy to Heroku #734" failed with Heroku 429 on 2026-05-13.
Fix: Wrapped the git push in a retry loop: 5 attempts, quadratic backoff (i²×10s → 10s, 40s, 90s, 160s). The existing concurrency group heroku-deploy-staging already prevents multiple concurrent staging deploys, which limits burst frequency. The retry loop absorbs transient 429s within a single run.
Note: The 429 was on git push, which triggers Heroku's build API internally. The retry pattern is correct for this failure mode. For very sustained 429s (quota exhausted), escalation is needed.
Finding 4: Deploy modal sticks at QUEUED after smoke gate failure
Files: .github/workflows/deploy-heroku.yml, .github/actions/notify-deploy-status/action.yml
Root cause: The Notify — building and Notify — failed steps live inside the deploy job. When the smoke gate job fails, the deploy job is skipped entirely (its if: always() && needs.smoke.result == 'success' condition evaluates false). No failed callback is ever posted to the console's deploy tracking endpoint. The modal polls for state but never sees a terminal state — it sits at QUEUED indefinitely.
Why the notify job didn't help: The notify job (which uses if: always()) was posting a PR comment but was NOT posting a callback to the deploy endpoint. The two notification paths (PR comment and deploy-endpoint callback) were not unified.
Fix: Added a Notify — failed (smoke gate or deploy skipped) step inside the notify job. This step:
- Uses if: to fire only when needs.smoke.result != 'success' OR needs.deploy.result is skipped or cancelled
- Requires a Checkout step first (composite actions need the repo to be available)
- Posts status: failed with a descriptive log_line and failure_reason that distinguishes "smoke failed" from "deploy skipped"
- Is a no-op when console_deploy_id is empty (manual runs, the action handles this internally)
File:line of bug: deploy-heroku.yml — the gap was the absence of a failed callback in the notify job when deploy was skipped.
Finding 5: Nightly security scan PR creation failure (#1975)
File: .github/workflows/nightly-security-scan.yml
Root cause: The Mint raxx-ops-bot token step falls back to GITHUB_TOKEN when any of RAXX_OPS_BOT_APP_ID, RAXX_OPS_BOT_PRIVATE_KEY, RAXX_OPS_BOT_INSTALL_ID are missing or when the GitHub API call fails. In fallback mode, steps.mint.outputs.gh_token = GITHUB_TOKEN. The GH_TOKEN env var is then set to GITHUB_TOKEN, which the org-level "Actions cannot create PRs" toggle blocks.
Why git push succeeds but gh pr create fails: git push origin uses the credential helper from actions/checkout@v4 (which stores the checkout token, different from GITHUB_TOKEN behavior). gh pr create uses GH_TOKEN env var directly. When GH_TOKEN is GITHUB_TOKEN, the push (via checkout credentials) succeeds but the PR creation (via GH_TOKEN) fails.
Prior fix (#1820) and why it recurred: #1820 added the GH_TOKEN wiring. But it didn't add a guard that aborts the step when the mint fell back. If the secrets rotate or expire, the step silently fell back again and the PR creation failed the same way. The dangling branch (security/scan-YYYY-MM-DD) was created but no PR existed.
Fix:
1. Added a guard that checks BOT_IDENTITY == "raxx-ops-bot" and exits 1 with a diagnostic message if it's the fallback. This produces a visible failure rather than a silent dangling branch.
2. Changed git push to use URL-embedded token (https://x-access-token:${GH_TOKEN}@github.com/...) as the remote URL, ensuring git also uses the installation token rather than the checkout credential helper. This is consistent with memory/feedback_gh_actions_netrc_broken.md.
To verify fix: Check next scheduled run (08:07 UTC) for bot_identity=raxx-ops-bot in the mint step output.
What went well
- The concurrency group on deploy-heroku.yml was already correct — it prevented multiple concurrent staging deploys which would have made the 429 worse
- The
notify-deploy-statuscomposite action's no-op design (emptyconsole_deploy_id= skip) meant the deploy-modal fix didn't break manual workflow runs - The nightly scan's
continue-on-erroron all scanner steps meant the scan still ran and produced artifacts even when the PR creation failed
What didn't go well
- The ci.yml YAML parse error went undetected for 7 days. No alert fires on "workflow produced 0 jobs." The system silently allocated runners and produced nothing useful.
- The nightly-security-scan PR creation failure had a prior fix (#1820) that didn't fully close the gap. The fix patched the happy path but didn't guard against fallback state.
- The deploy modal silent-QUEUED bug was not caught during the original deploy callback implementation. The "always() notify job covers failures" assumption was wrong — the job notified PR comments, not the deploy endpoint.
Root cause analysis
- Contributing factor 1: No YAML linting in CI —
actionlintwould have caught theprint("boot: OK")parse error at PR time. It was proposed but never wired in. - Contributing factor 2: No alert on 0-job workflow runs — GitHub Actions does not natively alert when a workflow produces 0 jobs (the normal success path is "all jobs ran and passed"; a 0-job run is a different failure mode that looks like success to naive monitors).
- Contributing factor 3: Deploy callback gap in smoke-gate failure path — the notify-deploy-status composite action was designed with the happy-path and in-job failure paths in mind, but the "job skipped because upstream failed" path was not considered.
- Contributing factor 4: Fragile bot token mint without guard — the mint step has a fallback that produces a different token silently. Downstream steps assumed mint success.
Detection
- ci.yml parse error: detected by operator/architect audit (#1937) — 7 days after onset
- Heroku 429: detected immediately when deploy modal stalled (operator reported 2026-05-13)
- Nightly scan: detected by issue filing (#1975) — recurring pattern
- Others: detected during this cleanup pass
Resolution
- ci.yml YAML parse error: removed colon from Python print string; renamed workflow
- ci-pr.yml name collision: renamed to "PR Gates"
- Heroku 429: added retry-with-backoff in deploy-heroku.yml
- Deploy modal stall: added outer
Notify — failedstep in notify job - Nightly scan: added bot_identity guard + URL-embedded token push
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Add actionlint step to ci-pr.yml to lint all workflow files on every PR |
sre-agent | 2026-05-20 | file below |
| 2 | Configure branch protection on main with required checks: CI — main / Backend tests, PR Gates / Sprint readiness gate, PR Gates / Secret scan (gitleaks) |
operator | 2026-05-20 | file below |
| 3 | Add alert/monitor for 0-job workflow runs (or wire Sentry CI integration) | sre-agent | 2026-05-27 | file below |
| 4 | Audit top-5 slowest workflows for caching gaps + matrix overruns (Phase 3) | sre-agent | 2026-05-27 | file below |
| 5 | Verify RAXX_OPS_BOT secrets are current; add rotation reminder to Velvet rotation schedule | operator | 2026-05-16 | file below |
Phase 3 time-saving proposals
See per-workflow time estimates below. All are filed as separate type:reliability GitHub issues.
ci.yml — estimated 4-5 min savings per run
backend-tests-postgresjob has nocache: piponsetup-python. Addcache: 'pip'. Estimated: ~1-2 min saved per run.security-secrets(gitleaks) fetches full history (fetch-depth: 0) which is correct and cannot be optimized. Accept this cost.velvet-smokejob installs Velvet deps but does not cache pip. Addcache: 'pip'to itssetup-pythonstep. Estimated: ~30s saved.- The
detect-changesjob runs on every push/PR, even docs-only changes. This is intentional (the gate IS the optimization). No change. security-depsinstallspip-auditthen separately installsnode— these could run in parallel if split into two jobs. Low-risk consolidation. Estimated: ~1-2 min saved.
deploy-heroku.yml — estimated 2-3 min savings per run
- Smoke gate (
smokejob) installs both Python and Node even when only the backend changed. These are already parallelized via the smoke script — cannot eliminate either. Accept. fetch-depth: 0on the checkout in thedeployjob is required forgit push --force. Cannot optimize.- Heroku CLI install via
curl ... | shruns on every deploy. The binary is ~50MB. Consider caching the heroku CLI binary in the GitHub Actions cache (key: heroku-cli-version). Estimated: ~30s saved per deploy. - Post-deploy health check has a fixed 10s sleep between retries with no upper bound. First-attempt hit rate is >90%; reducing initial sleep to 5s and keeping retries saves ~5-10s on the happy path.
deploy-console.yml — estimated 3-4 min savings per run
git subtree splitruns on every console deploy and is the slowest step (~60-90s on large histories). No optimization available without changing the deploy strategy. Accept for now; track under the post-launch Heroku-Container migration consideration.- Tailwind CSS build runs
npm cibefore each deploy. Add npm cache viaactions/setup-node@v4withcache-dependency-path: console/package-lock.json. Estimated: ~1-2 min saved. - The smoke gate duplicates
actions/setup-node@v4+npm cifor the frontend. This is shared withdeploy-heroku.ymlsmoke. If these two deploy workflows ever run in parallel (different surfaces), runner cost doubles. Acceptable for now.
ci-pr.yml — estimated 3-5 min savings per run
smoke_suitejob installs both Python and Node sequentially. Both have caches configured (cache: 'pip'andcache: 'npm'). Cache hits should already be fast; verify cache hit rate in run logs before optimizing.migration-gatejob checks out the full history (fetch-depth: 0) for an alembic diff. Alembic only needs the current HEAD plus the PR base SHA. Reducing tofetch-depth: 1and explicitly fetching the base SHA could save ~30s on large repos.stale-branch-guardandbase_branch_lintboth dofetch-depth: 0+fetch --all. These are sequential within each job but could share the checkout if merged. Low-priority.commitlintinstalls@commitlint/cli@19 + @commitlint/config-conventional@19on every run with--no-save. No caching. Addcache: 'npm'or move to a composite action with a cached install. Estimated: ~1 min saved.
nightly-security-scan.yml — estimated 5-8 min savings per run
- The scan job installs pip-audit + bandit then downloads gitleaks + trivy. The apt-get for trivy hits the internet every run. No caching is possible for apt packages, but the
gitleaksbinary download could be cached (key: gitleaks-v8.30.1-linux-x64). Estimated: ~30s saved. trivyruns with--scanners vuln,misconfig,secret. Thevulnscanner downloads the vulnerability database on every run (~200MB). Trivy supports aTRIVY_CACHE_DIRenv var. Useactions/cache@v4withkey: trivy-db-<date>to cache the DB for 24h. Estimated: 2-3 min saved per nightly run.npm auditstep underfrontend/trademaster_uiruns without annpm cifirst. If package-lock.json is cached, this is fast. If not, npm audit downloads metadata. Confirm cache is set up.
References
- PR:
fix/ci-hygiene-pass-2026-05-13 - Runbook:
docs/ops/runbooks/ci-hygiene.md - Prior incident (nightly scan): #1975
- Prior fix: #1820
- Architect audit: #1937
- Feedback:
memory/feedback_gh_actions_transitive_skip.md,memory/feedback_gh_actions_netrc_broken.md