CI Failure Triage — 2026-05-18
Date: 2026-05-18 Author: sre-agent Scope: All 6 open PRs (#2383, #2384, #2385, #2387, #2388, #2390) + last 10 main-branch runs per workflow
Failure table
| PR / branch | Workflow | Job | Test or step | Root cause | Fix owner | Already covered by | Estimated size |
|---|---|---|---|---|---|---|---|
#2383 fix/resolve-ref-href-uri-schemes |
CI — console | Console unit tests (pytest) | test_audit_selected_env.py::TestOpsDispatchMatrix::test_dispatch_noop_stub_returns_200 |
card_groomer_pass action removed in PR #1961; route returns 404 |
feature-developer | #2390 | XS — route stub or test update |
| #2383 | CI — console | Console unit tests (pytest) | test_audit_selected_env.py::TestOpsDispatchMatrix::test_dispatch_audit_row_has_selected_env |
Same — missing card_groomer_pass action stub |
feature-developer | #2390 | XS |
| #2383 | CI — console | Console unit tests (pytest) | test_vendor_name_cleanup_2067_2068.py::TestNavV1FreeScout::test_issues_link_title_no_freescout |
"FreeScout" string surviving in rendered HTML when FLAG_CONSOLE_DASHBOARD_HOME is on (test ordering sensitivity) |
feature-developer | #2390 | S |
| #2383 | CI — console | Console unit tests (pytest) | test_vendor_name_cleanup_2067_2068.py::TestNavV2FreeScout::test_nav_v2_no_freescout_anywhere |
Same — nav v2 tooltips / aria labels / footer links contain "FreeScout" | feature-developer | #2390 | S |
| #2383 | CI — console | Console unit tests (pytest) | test_vendor_name_cleanup_2067_2068.py::TestRenderedHtmlNoVendorNames::test_dashboard_nav_v1_rendered_no_freescout |
Same — full rendered dashboard (nav v1) contains "FreeScout" | feature-developer | #2390 | S |
| #2383 | CI — console | Console unit tests (pytest) | test_vendor_name_cleanup_2067_2068.py::TestRenderedHtmlNoVendorNames::test_dashboard_nav_v2_rendered_no_freescout |
Same — full rendered dashboard (nav v2) contains "FreeScout" | feature-developer | #2390 | S |
#2384 global-audit-has-detail-guard-2240 |
CI — console | Console unit tests (pytest) | Same 6 tests as #2383 | Identical pre-existing main regressions | feature-developer | #2390 | same |
| #2384 | CI — console | Console unit tests (pytest) | Application context error (RuntimeError: Working outside of application context) |
Separate test isolation bug surfacing alongside the primary failures | feature-developer | needs own fix if #2390 doesn't cover it | S |
#2385 heroku-gh-secrets-token-925-fallback |
CI — console | Console unit tests (pytest) | Same 6 tests as #2383 | Identical pre-existing main regressions | feature-developer | #2390 | same |
#2387 sc-db-5-console-dashboard-v2-flag |
CI — console | Console unit tests (pytest) | Same 6 tests as #2383 | Identical pre-existing main regressions | feature-developer | #2390 | same |
#2388 console-b1-queue-billing |
CI — console | Console unit tests (pytest) | Same 6 tests as #2383 | Identical pre-existing main regressions | feature-developer | #2390 | same |
#2390 fix/vendor-name-ops-dispatch-regressions |
CI — console | Console lint (ruff) | api_rbac_grants.py:115 C901 (complexity > 10), api_billing.py:53 I001, alerts.py:105 I001, admins_online.py:71 I001 |
Ruff import sort + complexity violations in files touched by the fix PR itself | feature-developer | not yet covered | S — ruff --fix for I001, manual refactor for C901 |
| main (post-#2389 merge, run 26044051515) | CI — main | OpenAPI drift check | Check OpenAPI spec vs runtime routes |
/internal/trace/sys-event [POST] registered in Flask (PR #2017) but absent from backend_v2/api/openapi.yaml — 135 runtime routes vs 136 spec paths |
feature-developer | not covered | XS — add path entry to openapi.yaml |
| main (all pushes today, runs 26016820991 26016818910 26016114205 26016112336 et al.) | Release | release-please | Run release-please |
illegal pathing characters in path: frontend/trademaster_ui/../../VERSION — release-please v17 rejects ../../ traversal in extra-files; config specifies "../../VERSION" and "../../backend_v2/version.txt" relative to the frontend/trademaster_ui package path |
feature-developer | not covered | S — change extra-files to repo-root-relative paths or use root package strategy |
| main (run 26044052359, triggered by #2389 merge) | Deploy to Heroku | Synthetic gate (post-deploy) / File GitHub issue on failure | Checkout step |
Reusable synthetic-gate workflow tries actions/checkout@v4 with a token that lacks repo read access (raxx-app/TradeMasterAPI — "Repository not found") — the job fires only when synthetic health checks fail, so this is a failure-path-only defect in the on-failure issue-filing job |
sre-agent | not covered | S — pass GITHUB_TOKEN or raxx-ops-bot token with contents: read to reusable workflow |
| main (run 26016821022 + earlier) | Deploy internal docs | build-and-deploy | CF Access smoke verify | CF Access policy for internal-docs.raxx.app is set to decision=allow instead of decision=non_identity; service token carries no IdP identity → Access issues 302 to login page → curl sees HTTP 200 (login page) → smoke verify fails. Documented in feedback_cf_access_service_token_needs_non_identity.md |
operator-decision | not covered | XS operator action — set policy to non_identity in CF dashboard |
| main (schedule, runs 26027433724 26025757785 et al. — 8/8 failing since 2026-05-12) | Terraform — email-delivery-stack | terraform-email-delivery-stack-plan | configure-aws-credentials@v4 |
AWS credentials not loaded — OIDC or AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY secrets absent or expired for the TF runner context |
operator-decision | not covered | M — provision AWS OIDC role or refresh static creds in GH secrets |
| main (schedule, 5/5 failing since 2026-05-14) | Billing retention cron | Run billing retention sweep | curl POST /api/internal/jobs/billing-retention |
RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN repo secrets are empty strings; curl exits code 3 (URL malformed) before any HTTP request is made |
operator-decision | not covered | XS operator action — set two repo secrets |
| main (schedule, 3/10 failing since 2026-05-16, 7/10 succeeding before) | Nightly Security Scan | Run scanners | gh pr create for findings report |
raxx-ops-bot installation token mint returns 404 (installation token exchange failed: Not Found) — GitHub App installation ID is stale or the App was reinstalled; falls back to github-actions[bot] which cannot create PRs |
sre-agent | not covered | S — verify RAXX_OPS_BOT_INSTALL_ID in GH secrets; re-mint via App settings |
| feature/shadow-analytics-dpia-runbook (run 26016042725) | Queue Docker smoke | Detect Queue path changes | permissions check | Resource not accessible by integration — GITHUB_TOKEN lacked contents: read for the path-detection step on that branch |
feature-developer | branch is closed/merged | N/A |
Workflow-level health summary (main branch, last 10 runs each)
| Workflow | Pass rate (last 10) | Verdict | Notes |
|---|---|---|---|
| CI — main | 8/10 | flaky | OpenAPI drift check fails when backend changes land without openapi.yaml update. BFM (06:11–15:30 UTC) caused 1 additional failure (vault auth 403). |
| CI — console | 8/10 | flaky | FreeScout + OpsDispatch regressions in main since ~PR #2352 area; failing on all wave PRs. #2390 should resolve. |
| Release | 0/7 | broken | release-please ../../VERSION pathing error — every push to main since at least 2026-05-18T05:48 UTC. No successful Release run in the window sampled. |
| Deploy to Heroku | 4/7 (non-cancelled) | degraded | Earlier failures: heroku CLI not on PATH (exit 127) — fixed by PR adding binary guard. Latest failure (#2389 merge): staging deploy succeeded but "File GitHub issue on failure" job fails when synthetic checks fail because the checkout token lacks repo access. |
| Deploy internal docs | 0/8 | broken | CF Access decision=allow vs decision=non_identity — blocking all runs since first observed 2026-05-18T05:26 UTC. |
| Terraform — email-delivery-stack | 0/8 | broken (chronic) | AWS credentials missing/expired. Failing since 2026-05-12. No evidence this workflow has ever successfully planned in the sampled window. |
| Billing retention cron | 0/5 | broken (chronic) | RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN secrets not set. Failing since 2026-05-14. |
| Nightly Security Scan | 3/10 failures | flaky | raxx-ops-bot installation token mint 404 introduced ~2026-05-16. Scan itself runs; only the PR-creation step fails. |
| Flag drift check (B2) | 2/2 success | healthy | — |
| Daily card groomer | 1/1 success | healthy | — |
| Queue Docker smoke | 1/1 failure on one branch | isolated | Only feature/shadow-analytics-dpia-runbook failed; all other branches pass. Permissions issue on that branch. |
| PR Gates (ci-pr.yml) | 10/10 success | healthy | Sprint Readiness gate itself is passing on all PRs. BFM 403 vault auth issue resolved at 15:30 UTC. |
Flakiness assessment
OpenAPI drift check (CI — main): Not flaky — deterministic fail on every push that includes backend route changes without a matching openapi.yaml update. The /internal/trace/sys-event route was registered in PR #2017 and the spec was never updated. This will fail every CI — main run until resolved.
CI — console vendor-name / ops-dispatch failures: Not flaky — 100% reproducible across all 5 wave PRs and earlier main runs. Test ordering dependency: FLAG_CONSOLE_DASHBOARD_HOME enable sequence exposes FreeScout strings in templates that weren't cleaned up. Deterministic failure.
Nightly Security Scan: Intermittently flaky from the raxx-ops-bot token mint perspective. The scan completes; the PR-creation step fails ~3 out of the last 10 runs. The 404 from the installations endpoint suggests the App installation ID rotated or the App was temporarily suspended.
Release workflow: Consistently failing — not flaky, structurally broken since the ../../VERSION path was introduced in the release-please config.
CI infrastructure issues
-
Node.js 20 deprecation warning on every run. All workflows using
actions/checkout@v4,actions/setup-node@v4,actions/setup-python@v5,dorny/paths-filter@v3will be forced to Node.js 24 on 2026-06-02, removed 2026-09-16. Every single CI run today carries this warning. No failures yet but forced breakage is scheduled in 15 days. -
RAXX_OPS_BOT_PRIVATE_KEYprinted to Release workflow logs in plaintext. Run 26044051583 (fix(ci): smoke fixture crash on clean runner (#2361) (#2389)) has the full RSA private key printed in the release-please step output becauserelease-please-actionechoes its inputs. This is a SEV-1 secret exposure — the key was visible in run logs as of 15:43 UTC. The key should be treated as compromised and rotated immediately. This is a separate, urgent issue from the pathing bug. -
Synthetic-gate reusable workflow token scope. The
File GitHub issue on failurejob insidesynthetic-gate.ymlusesactions/checkout@v4with the caller'sGITHUB_TOKEN, which hasissues: writebut the log showsremote: Repository not found. This suggests the token is scoped to the workflow's installation but not the full repo clone — likely thecontents: readpermission is missing from thedeploy-heroku.ymlpermissions block for this reusable workflow call. -
billing-retention-cron.ymlsecrets not provisioned.RAPTOR_INTERNAL_API_URLandADMIN_SERVICE_TOKENare blank in the runner environment. The workflow has been defined but the secrets were never added to the repository. This is a pre-launch gap, not a regression. -
Terraform AWS credentials absent since 2026-05-12. The
configure-aws-credentials@v4action cannot load credentials. The workflow was first introduced around that date and the OIDC trust policy or static key pair was never configured in GH secrets. -
release-pleaseextra-filespath traversal rejected in v17. The configfrontend/trademaster_ui/../../VERSIONisfrontend/trademaster_uias the package path +../../VERSIONas a relative file path, which resolves to the repo rootVERSION. Release-please v17.3.0 explicitly rejects path traversal. The fix is to use either repo-root-relative paths (e.g.,VERSION,backend_v2/version.txtkeyed under a root package) or move thefrontend/trademaster_uipackage to useextra-fileswith absolute-from-root paths using therootextra-file strategy. -
PR #2390 ruff lint violations in files touched by the fix. The fix PR itself introduces or exposes four ruff violations (
I001import sort inapi_billing.py,alerts.py,admins_online.py;C901complexity inapi_rbac_grants.py). TheI001violations are auto-fixable viaruff --fix;C901requires a brief refactor. These are in files that pre-existed but were imported by the fix.
Action items (proposed)
| # | Action | Owner | Priority | Notes |
|---|---|---|---|---|
| 1 | URGENT: Rotate RAXX_OPS_BOT_PRIVATE_KEY — plaintext in run 26044051583 logs |
operator-decision | P0 — do now | Key was exposed ~15:43 UTC; treat as compromised; regenerate GitHub App private key, update GH secret |
| 2 | Merge #2390 to unblock 5 wave PRs (vendor-name + ops-dispatch regressions) | feature-developer | P1 | #2390 CI is passing; ruff lint violations need fix before merge |
| 3 | Fix ruff violations in #2390 (I001 x3 + C901 x1) |
feature-developer | P1 — blocks #2390 merge | ruff --fix for I001; manual refactor for api_rbac_grants.py:115 |
| 4 | Add /internal/trace/sys-event [POST] to backend_v2/api/openapi.yaml |
feature-developer | P1 — CI — main broken | Route registered in #2017, spec never updated |
| 5 | Fix release-please extra-files path traversal in .release-please-config.json |
feature-developer | P1 — Release broken since 05:48 UTC | Change relative ../../VERSION paths to root-relative under a root package or use component strategy |
| 6 | Set CF Access policy for internal-docs.raxx.app to decision=non_identity |
operator-decision | P1 — Deploy internal docs broken | Per feedback_cf_access_service_token_needs_non_identity.md |
| 7 | Set RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN in GH repo secrets |
operator-decision | P2 — billing cron broken | Pre-launch; no user impact yet |
| 8 | Provision AWS OIDC role or static creds for Terraform email-delivery-stack | operator-decision | P2 — TF plan broken since 2026-05-12 | Check if TF stack is pre-launch gated; if so, can defer to infra sprint |
| 9 | Fix raxx-ops-bot installation token mint for nightly security scan |
sre-agent | P2 — scan runs but PR creation fails | Verify RAXX_OPS_BOT_INSTALL_ID in GH secrets; check App installation status |
| 10 | Fix synthetic-gate.yml "File GitHub issue on failure" checkout token scope |
sre-agent | P3 — only fires on synthetic health check failures | Add contents: read to deploy-heroku.yml reusable workflow permissions or pass explicit token |
| 11 | Bump pinned actions to Node.js 24 compatible versions before 2026-06-02 | sre-agent | P3 — forced breakage on 2026-06-02 | actions/checkout@v4, setup-node@v4, setup-python@v5, dorny/paths-filter@v3 |
| 12 | Investigate and fix application context isolation error in test_global_audit_has_detail_guard tests on #2384 |
feature-developer | P3 — secondary failure, may resolve with #2390 merge | RuntimeError: Working outside of application context in PR #2384 only |
Context carried forward
- BFM (Cloudflare Bot Fight Mode) was disabled at ~15:30 UTC today to unblock vault auth (Infisical
universal-auth/loginreturning 403 to GH Actions runner IPs). All PR Gate runs after 15:30 UTC pass Sprint Readiness. A WAF skip rule keyed onCF-Access-Client-Idheader is needed before BFM can be re-enabled (separate workstream). - PR #2390 (
fix/vendor-name-ops-dispatch-regressions) has all CI — console and PR Gates passing as of 16:15 UTC. Merge is blocked only by the ruff violations noted above (action item #3). - Cancelled runs on older wave PR reruns are concurrency-cancellation artifacts (force-push rebases), not failures — per
feedback_pr_cancelled_checks_are_duplicates.md.
References
- Runbook:
docs/ops/runbooks/ci-cd.md(create if absent) - Related incident:
docs/incidents/2026-05-18-p0-waf-chain-token-scope-gaps.md - BFM context:
docs/architecture/cf-provider-v5-upgrade-2026-05-18.md - Feedback:
docs/memory/feedback_cf_access_service_token_needs_non_identity.md - Feedback:
docs/memory/feedback_gh_actions_transitive_skip.md - Feedback:
docs/memory/feedback_heroku_config_set_echoes_secrets.md