Raxx · internal docs

internal · gated

RCA — CI cleanup pass: YAML parse error, name collision, Heroku 429, deploy modal stall, nightly scan recurrence

Incident ID: 2026-05-13-ci-cleanup-pass Date: 2026-05-13 Severity: SEV-2 (staging outage / deploy pipeline blocked) Duration: ~7 days for ci.yml parse error (detected 2026-05-06, fixed 2026-05-13); same-day for others Blast radius: All PRs and push-to-main since 2026-05-06; Heroku deploy #734 prod; nightly security scan 4/5 recent runs Author: sre-agent


Summary

Five distinct CI failure classes were found during the 2026-05-13 cleanup sweep. The highest-impact was a YAML parse error in ci.yml that caused every push and PR since 2026-05-06 to produce a 0-job, 0-second CI run — consuming a runner allocation for nothing and providing zero signal. Four additional findings were diagnosed and fixed in the same pass: a workflow name collision blocking branch-protection disambiguation, Heroku API 429 on deploy triggering a deploy failure, a deploy modal stuck at QUEUED after workflow failures, and the nightly security scan PR-creation failing repeatedly despite a prior fix (#1820).


Timeline (all times UTC)


Impact


Per-finding breakdown

Finding 1: ci.yml YAML parse error (highest impact)

File: .github/workflows/ci.yml Line (pre-fix): velvet-smoke job, Velvet import smoke test — app factory step Root cause: run: python -c 'from velvet.app import app; print("boot: OK")'

The bare YAML scalar value python -c 'from velvet.app import app; print("boot: OK")' contains OK" after a colon following whitespace. GitHub Actions' YAML parser (which uses Go's gopkg.in/yaml.v3) is stricter than PyYAML. It treated this as a mapping, producing a parse error. Because the parse error is at the job level, NONE of the jobs in ci.yml parse — the workflow appears to start but immediately produces 0 jobs, 0 seconds.

Evidence: 35+ wasted runner allocations from 2026-05-06 through 2026-05-13 (architect #1937 audit noted ~56% failure rate from this alone being ~31% of all failures).

Fix: Changed print("boot: OK") to print("boot OK") (removed colon from the Python string). Added a comment explaining why colons must be avoided in run: scalar print statements.

Also renamed: name: CIname: CI — main to resolve the name collision with ci-pr.yml (Finding 2).

Time saved: ~35 wasted runner allocations/day × 7 days = 245 wasted runs stopped. Every future push/PR now gets real CI output.


Finding 2: ci-pr.yml name collision with ci.yml

File: .github/workflows/ci-pr.yml Root cause: Both ci.yml and ci-pr.yml had name: CI. GitHub's branch protection required-checks config references workflow names. With two workflows sharing the same name, branch protection cannot distinguish which workflow's check to require. Additionally, the github.workflow concurrency group key produced collisions.

Fix: ci-pr.yml renamed to name: PR Gates. ci.yml renamed to name: CI — main. Concurrency group keys now differ naturally since they embed github.workflow.

Finding: main branch has NO branch protection configured at all (gh api repos/raxx-app/TradeMasterAPI/branches/main/protection returns 404 "Branch not protected"). This is a separate operator decision (see Phase 4 action items).


Finding 3: Heroku 429 on deploy (deploy-heroku.yml)

File: .github/workflows/deploy-heroku.yml Root cause: git push --force heroku "HEAD:refs/heads/main" makes a single attempt. The Heroku API rate-limits by token; burst conditions (multiple deploys firing in sequence or from concurrent workflows) produce HTTP 429 on the first API call. The deploy fails immediately with no retry.

Evidence: Run "Deploy to Heroku #734" failed with Heroku 429 on 2026-05-13.

Fix: Wrapped the git push in a retry loop: 5 attempts, quadratic backoff (i²×10s → 10s, 40s, 90s, 160s). The existing concurrency group heroku-deploy-staging already prevents multiple concurrent staging deploys, which limits burst frequency. The retry loop absorbs transient 429s within a single run.

Note: The 429 was on git push, which triggers Heroku's build API internally. The retry pattern is correct for this failure mode. For very sustained 429s (quota exhausted), escalation is needed.


Finding 4: Deploy modal sticks at QUEUED after smoke gate failure

Files: .github/workflows/deploy-heroku.yml, .github/actions/notify-deploy-status/action.yml Root cause: The Notify — building and Notify — failed steps live inside the deploy job. When the smoke gate job fails, the deploy job is skipped entirely (its if: always() && needs.smoke.result == 'success' condition evaluates false). No failed callback is ever posted to the console's deploy tracking endpoint. The modal polls for state but never sees a terminal state — it sits at QUEUED indefinitely.

Why the notify job didn't help: The notify job (which uses if: always()) was posting a PR comment but was NOT posting a callback to the deploy endpoint. The two notification paths (PR comment and deploy-endpoint callback) were not unified.

Fix: Added a Notify — failed (smoke gate or deploy skipped) step inside the notify job. This step: - Uses if: to fire only when needs.smoke.result != 'success' OR needs.deploy.result is skipped or cancelled - Requires a Checkout step first (composite actions need the repo to be available) - Posts status: failed with a descriptive log_line and failure_reason that distinguishes "smoke failed" from "deploy skipped" - Is a no-op when console_deploy_id is empty (manual runs, the action handles this internally)

File:line of bug: deploy-heroku.yml — the gap was the absence of a failed callback in the notify job when deploy was skipped.


Finding 5: Nightly security scan PR creation failure (#1975)

File: .github/workflows/nightly-security-scan.yml Root cause: The Mint raxx-ops-bot token step falls back to GITHUB_TOKEN when any of RAXX_OPS_BOT_APP_ID, RAXX_OPS_BOT_PRIVATE_KEY, RAXX_OPS_BOT_INSTALL_ID are missing or when the GitHub API call fails. In fallback mode, steps.mint.outputs.gh_token = GITHUB_TOKEN. The GH_TOKEN env var is then set to GITHUB_TOKEN, which the org-level "Actions cannot create PRs" toggle blocks.

Why git push succeeds but gh pr create fails: git push origin uses the credential helper from actions/checkout@v4 (which stores the checkout token, different from GITHUB_TOKEN behavior). gh pr create uses GH_TOKEN env var directly. When GH_TOKEN is GITHUB_TOKEN, the push (via checkout credentials) succeeds but the PR creation (via GH_TOKEN) fails.

Prior fix (#1820) and why it recurred: #1820 added the GH_TOKEN wiring. But it didn't add a guard that aborts the step when the mint fell back. If the secrets rotate or expire, the step silently fell back again and the PR creation failed the same way. The dangling branch (security/scan-YYYY-MM-DD) was created but no PR existed.

Fix: 1. Added a guard that checks BOT_IDENTITY == "raxx-ops-bot" and exits 1 with a diagnostic message if it's the fallback. This produces a visible failure rather than a silent dangling branch. 2. Changed git push to use URL-embedded token (https://x-access-token:${GH_TOKEN}@github.com/...) as the remote URL, ensuring git also uses the installation token rather than the checkout credential helper. This is consistent with memory/feedback_gh_actions_netrc_broken.md.

To verify fix: Check next scheduled run (08:07 UTC) for bot_identity=raxx-ops-bot in the mint step output.


What went well

What didn't go well

Root cause analysis

Detection

Resolution

Action items

# Action Owner Due Issue
1 Add actionlint step to ci-pr.yml to lint all workflow files on every PR sre-agent 2026-05-20 file below
2 Configure branch protection on main with required checks: CI — main / Backend tests, PR Gates / Sprint readiness gate, PR Gates / Secret scan (gitleaks) operator 2026-05-20 file below
3 Add alert/monitor for 0-job workflow runs (or wire Sentry CI integration) sre-agent 2026-05-27 file below
4 Audit top-5 slowest workflows for caching gaps + matrix overruns (Phase 3) sre-agent 2026-05-27 file below
5 Verify RAXX_OPS_BOT secrets are current; add rotation reminder to Velvet rotation schedule operator 2026-05-16 file below

Phase 3 time-saving proposals

See per-workflow time estimates below. All are filed as separate type:reliability GitHub issues.

ci.yml — estimated 4-5 min savings per run

  1. backend-tests-postgres job has no cache: pip on setup-python. Add cache: 'pip'. Estimated: ~1-2 min saved per run.
  2. security-secrets (gitleaks) fetches full history (fetch-depth: 0) which is correct and cannot be optimized. Accept this cost.
  3. velvet-smoke job installs Velvet deps but does not cache pip. Add cache: 'pip' to its setup-python step. Estimated: ~30s saved.
  4. The detect-changes job runs on every push/PR, even docs-only changes. This is intentional (the gate IS the optimization). No change.
  5. security-deps installs pip-audit then separately installs node — these could run in parallel if split into two jobs. Low-risk consolidation. Estimated: ~1-2 min saved.

deploy-heroku.yml — estimated 2-3 min savings per run

  1. Smoke gate (smoke job) installs both Python and Node even when only the backend changed. These are already parallelized via the smoke script — cannot eliminate either. Accept.
  2. fetch-depth: 0 on the checkout in the deploy job is required for git push --force. Cannot optimize.
  3. Heroku CLI install via curl ... | sh runs on every deploy. The binary is ~50MB. Consider caching the heroku CLI binary in the GitHub Actions cache (key: heroku-cli-version). Estimated: ~30s saved per deploy.
  4. Post-deploy health check has a fixed 10s sleep between retries with no upper bound. First-attempt hit rate is >90%; reducing initial sleep to 5s and keeping retries saves ~5-10s on the happy path.

deploy-console.yml — estimated 3-4 min savings per run

  1. git subtree split runs on every console deploy and is the slowest step (~60-90s on large histories). No optimization available without changing the deploy strategy. Accept for now; track under the post-launch Heroku-Container migration consideration.
  2. Tailwind CSS build runs npm ci before each deploy. Add npm cache via actions/setup-node@v4 with cache-dependency-path: console/package-lock.json. Estimated: ~1-2 min saved.
  3. The smoke gate duplicates actions/setup-node@v4 + npm ci for the frontend. This is shared with deploy-heroku.yml smoke. If these two deploy workflows ever run in parallel (different surfaces), runner cost doubles. Acceptable for now.

ci-pr.yml — estimated 3-5 min savings per run

  1. smoke_suite job installs both Python and Node sequentially. Both have caches configured (cache: 'pip' and cache: 'npm'). Cache hits should already be fast; verify cache hit rate in run logs before optimizing.
  2. migration-gate job checks out the full history (fetch-depth: 0) for an alembic diff. Alembic only needs the current HEAD plus the PR base SHA. Reducing to fetch-depth: 1 and explicitly fetching the base SHA could save ~30s on large repos.
  3. stale-branch-guard and base_branch_lint both do fetch-depth: 0 + fetch --all. These are sequential within each job but could share the checkout if merged. Low-priority.
  4. commitlint installs @commitlint/cli@19 + @commitlint/config-conventional@19 on every run with --no-save. No caching. Add cache: 'npm' or move to a composite action with a cached install. Estimated: ~1 min saved.

nightly-security-scan.yml — estimated 5-8 min savings per run

  1. The scan job installs pip-audit + bandit then downloads gitleaks + trivy. The apt-get for trivy hits the internet every run. No caching is possible for apt packages, but the gitleaks binary download could be cached (key: gitleaks-v8.30.1-linux-x64). Estimated: ~30s saved.
  2. trivy runs with --scanners vuln,misconfig,secret. The vuln scanner downloads the vulnerability database on every run (~200MB). Trivy supports a TRIVY_CACHE_DIR env var. Use actions/cache@v4 with key: trivy-db-<date> to cache the DB for 24h. Estimated: 2-3 min saved per nightly run.
  3. npm audit step under frontend/trademaster_ui runs without an npm ci first. If package-lock.json is cached, this is fast. If not, npm audit downloads metadata. Confirm cache is set up.

References