RCA — Recurring GitHub Actions workflow failures (5 workflows)
Incident ID: 2026-06-30-recurring-workflow-failures Date: 2026-06-30 Severity: SEV-2 (gatekeeper blocked; multiple detection gaps) Duration: ~16 days (failures first observed ~2026-06-14; resolved 2026-06-30 via PR #3908) Blast radius: Internal CI/CD pipeline — operator email flood; develop→release promotions blocked; degraded-surface detection gap; nightly security-scan label failures Author: sre-agent
Summary
Five distinct GitHub Actions failure classes accumulated across seven
workflows, generating a sustained flood of failure notification emails to
the operator. The most critical failure was a YAML syntax error in
gatekeeper-develop-to-release.yml that prevented every develop→release
promotion (SEV-2). Three additional SEV-3 failures produced detection
gaps: the console-degraded monitor could not mint a bot token (base64 vs
raw PEM mismatch), the internal-docs deploy failed on an already-attached
CF custom domain, and the nightly security scan failed to apply issue
labels (wrong label name). All four were resolved via code fixes in PR
3908. release.yml has a fifth failure (GitHub App missing permissions)
that requires an operator action documented below.
Timeline (all times UTC)
- ~2026-06-05 — nightly-security-scan begins failing on
severity:mediumlabel not found (first confirmed failure 2026-06-05 through 2026-06-07) - ~2026-06-14 — gatekeeper-develop-to-release fails on every push to develop (YAML syntax error present from initial commit)
- ~2026-06-14 — console-degraded-auto-file fails on every run (base64 PEM issue present from workflow creation)
- ~2026-06-14 — deploy-internal-docs fails on every run (CF 8000018 not in grep pattern)
- 2026-06-30 — Operator reports workflow failure email flood; sre-agent triages all 7 workflows
- 2026-06-30 — Root causes identified for all 5 failure classes
- 2026-06-30 — Fixes 1-4 committed and PR #3908 opened targeting develop
- 2026-06-30 — RCA filed; operator action documented for release.yml
Impact
- Users affected: 0 (pre-launch; no customer traffic)
- User-visible symptoms: none
- Data integrity: ok
- Revenue / billing: ok
- Internal impact: develop→release promotions blocked for ~16 days; degraded-surface detection gap for same period; security scan label failures suppressed 3 days of findings; operator inbox flooded with failure emails
What went well
- All four code-fixable failures had clear, actionable root causes identifiable from logs alone
- The console-degraded-auto-file
notify-on-failurejob correctly surfaced the detection gap in step summary - The nightly-security-scan
commit-summarystep hadcontinue-on-error: trueand did not mask the real failure (label add step) - The gatekeeper had a clearly structured YAML block that made the indentation fault locatable via
python3 yaml.safe_load() - vault-loading pattern was already established in gatekeeper + release.yml — reuse was straightforward
What didn't go well
- Four distinct workflow failures accumulated over ~16 days with no automated triage; the system treated all failures as equivalent noise
- The YAML syntax error in the gatekeeper (highest-impact failure) was not caught at PR review time — no actionlint check in CI
RAXX_OPS_BOT_PRIVATE_KEYGH secret was provisioned as base64 when the action requires raw PEM — inconsistency with vault storage format was not documented- The CF Pages API error code 8000018 was not covered in the idempotency guard despite being the most common "already attached" code returned by the API
severity:mediumlabel was referenced in code without verifying the label existed in the repo- No runbook existed for the gatekeeper workflow at the time of failure
Root cause analysis
-
Contributing factor 1: Zero-indented commit message body in YAML run block —
gatekeeper-develop-to-release.ymllines 287-297 used agit commit -m "title\n\nbody line"multi-line string at column 0 inside arun: |literal block. YAML interprets any line at the block's starting indentation as the end of the block; lines that then contain:are parsed as YAML keys, producing aScannerError. Every gatekeeper run failed at YAML parse time before any code executed. -
Contributing factor 2: GH secret stores base64 PEM; action requires raw PEM —
RAXX_OPS_BOT_PRIVATE_KEYwas set as a base64-encoded PEM in GH Actions secrets.actions/create-github-app-token@v1calls the JScrypto.createPrivateKey()API which requires a raw PEM string starting with-----BEGIN RSA PRIVATE KEY-----. The mismatch produced "Invalid keyData" on every console-degraded-auto-file run. The vault at/MooseQuest/raxx-ops-bot/PRIVATE_KEY_PEMstores raw PEM (already the correct format used by gatekeeper/release.yml) — the issue was that the console workflow was reading from the wrong source. -
Contributing factor 3: CF API error code 8000018 absent from idempotency grep —
deploy-internal-docs.ymlwrapped the CF Pages custom-domain attach call in an idempotency guard that grepped foralready exists|already attached|already added|duplicate|8000000. The CF API returns8000018(not8000000) for "You have already added this custom domain." String8000018does not match the pattern, so the guard never fired and the step failed after 3 retries. -
Contributing factor 4: Non-existent severity label in scan script —
scan_file_issues.pymappedMEDIUMbandit severity to GH labelseverity:medium. The repo usesseverity:medium-low;severity:mediumwas never created. Thegh api issues/{n}/labelscall returned a 422 ("could not add label"), failing the "File findings as GitHub issues" step with nocontinue-on-errorguard. -
Contributing factor 5 (operator action required): GitHub App installation missing write permissions —
release.ymluses raxx-ops-bot to open release-please PRs. The App installation on TradeMasterAPI lackscontents: writeandpull-requests: write. The API returns 403 "Resource not accessible by integration." Cannot be resolved via workflow code; requires GitHub App settings update by the operator.
Detection
- What alerted us: Operator reported email flood from GitHub Actions failure notifications
- How long between cause and detection: ~16 days for gatekeeper/console/deploy failures; ~3 days for label failures
- How to detect faster next time: add actionlint to CI (catches YAML syntax errors in workflow files before merge); add a weekly "workflow health sweep" step that counts failures per workflow and surfaces any with >3 consecutive failures as a SEV-3
Resolution
Fix 1 — gatekeeper YAML syntax
Replaced multiline git commit -m "title\n\nbody" with a printf-based variable so all lines remain within the YAML run: block indentation.
Fix 2 — deploy-internal-docs CF error code
Added 8000018 to the grep alternation in the custom-domain idempotency guard.
Fix 3 — console-degraded-auto-file PEM format
Replaced GH-secret-based token mint with the load-vault-secrets composite action (same pattern as gatekeeper + release.yml). Both vault-load and mint steps have continue-on-error: true; the run script already falls back to github.token if steps.mint_bot_token.outputs.token is empty.
Fix 4 — nightly-security-scan label
Changed severity:medium → severity:medium-low in _SEV_LABEL_MAP and the fallback default in scan_file_issues.py.
Validation
All three modified workflow YAML files validated via python3 yaml.safe_load(). PR #3908 targets develop; merges when CI passes.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Update raxx-ops-bot GitHub App installation: set Contents + Pull Requests to Read & Write | operator | 2026-07-03 | (no issue — operator action) |
| 2 | Add actionlint to CI (workflow YAML linting on PR) to catch YAML syntax errors before merge | sre-agent / feature-dev | 2026-07-07 | file after PR #3908 merges |
| 3 | Write runbook for gatekeeper-develop-to-release.yml | sre-agent | 2026-07-07 | file after PR #3908 merges |
| 4 | Document RAXX_OPS_BOT_PRIVATE_KEY format inconsistency (vault=raw PEM; GH secret=base64) — either sync formats or add a decoding step for workflows that still read the GH secret | sre-agent | 2026-07-07 | file after PR #3908 merges |
| 5 | Add a weekly workflow-health sweep that alerts if any workflow has >3 consecutive failures | sre-agent | 2026-07-14 | file after PR #3908 merges |
References
- PR: https://github.com/raxx-app/TradeMasterAPI/pull/3908
- Runbook (to be created):
docs/ops/runbooks/gatekeeper-develop-to-release.md - Related memory:
feedback_nightly_scan_dark_is_high.md,feedback_gh_actions_transitive_skip.md