Raxx · internal docs

internal · gated

RCA — Recurring GitHub Actions workflow failures (5 workflows)

Incident ID: 2026-06-30-recurring-workflow-failures Date: 2026-06-30 Severity: SEV-2 (gatekeeper blocked; multiple detection gaps) Duration: ~16 days (failures first observed ~2026-06-14; resolved 2026-06-30 via PR #3908) Blast radius: Internal CI/CD pipeline — operator email flood; develop→release promotions blocked; degraded-surface detection gap; nightly security-scan label failures Author: sre-agent

Summary

Five distinct GitHub Actions failure classes accumulated across seven workflows, generating a sustained flood of failure notification emails to the operator. The most critical failure was a YAML syntax error in gatekeeper-develop-to-release.yml that prevented every develop→release promotion (SEV-2). Three additional SEV-3 failures produced detection gaps: the console-degraded monitor could not mint a bot token (base64 vs raw PEM mismatch), the internal-docs deploy failed on an already-attached CF custom domain, and the nightly security scan failed to apply issue labels (wrong label name). All four were resolved via code fixes in PR

3908. release.yml has a fifth failure (GitHub App missing permissions)

that requires an operator action documented below.

Timeline (all times UTC)

Impact

What went well

What didn't go well

Root cause analysis

Detection

Resolution

Fix 1 — gatekeeper YAML syntax

Replaced multiline git commit -m "title\n\nbody" with a printf-based variable so all lines remain within the YAML run: block indentation.

Fix 2 — deploy-internal-docs CF error code

Added 8000018 to the grep alternation in the custom-domain idempotency guard.

Fix 3 — console-degraded-auto-file PEM format

Replaced GH-secret-based token mint with the load-vault-secrets composite action (same pattern as gatekeeper + release.yml). Both vault-load and mint steps have continue-on-error: true; the run script already falls back to github.token if steps.mint_bot_token.outputs.token is empty.

Fix 4 — nightly-security-scan label

Changed severity:mediumseverity:medium-low in _SEV_LABEL_MAP and the fallback default in scan_file_issues.py.

Validation

All three modified workflow YAML files validated via python3 yaml.safe_load(). PR #3908 targets develop; merges when CI passes.

Action items

# Action Owner Due Issue
1 Update raxx-ops-bot GitHub App installation: set Contents + Pull Requests to Read & Write operator 2026-07-03 (no issue — operator action)
2 Add actionlint to CI (workflow YAML linting on PR) to catch YAML syntax errors before merge sre-agent / feature-dev 2026-07-07 file after PR #3908 merges
3 Write runbook for gatekeeper-develop-to-release.yml sre-agent 2026-07-07 file after PR #3908 merges
4 Document RAXX_OPS_BOT_PRIVATE_KEY format inconsistency (vault=raw PEM; GH secret=base64) — either sync formats or add a decoding step for workflows that still read the GH secret sre-agent 2026-07-07 file after PR #3908 merges
5 Add a weekly workflow-health sweep that alerts if any workflow has >3 consecutive failures sre-agent 2026-07-14 file after PR #3908 merges

References