Raxx · internal docs

internal · gated

CI Failure Triage — 2026-05-18

Date: 2026-05-18 Author: sre-agent Scope: All 6 open PRs (#2383, #2384, #2385, #2387, #2388, #2390) + last 10 main-branch runs per workflow


Failure table

PR / branch Workflow Job Test or step Root cause Fix owner Already covered by Estimated size
#2383 fix/resolve-ref-href-uri-schemes CI — console Console unit tests (pytest) test_audit_selected_env.py::TestOpsDispatchMatrix::test_dispatch_noop_stub_returns_200 card_groomer_pass action removed in PR #1961; route returns 404 feature-developer #2390 XS — route stub or test update
#2383 CI — console Console unit tests (pytest) test_audit_selected_env.py::TestOpsDispatchMatrix::test_dispatch_audit_row_has_selected_env Same — missing card_groomer_pass action stub feature-developer #2390 XS
#2383 CI — console Console unit tests (pytest) test_vendor_name_cleanup_2067_2068.py::TestNavV1FreeScout::test_issues_link_title_no_freescout "FreeScout" string surviving in rendered HTML when FLAG_CONSOLE_DASHBOARD_HOME is on (test ordering sensitivity) feature-developer #2390 S
#2383 CI — console Console unit tests (pytest) test_vendor_name_cleanup_2067_2068.py::TestNavV2FreeScout::test_nav_v2_no_freescout_anywhere Same — nav v2 tooltips / aria labels / footer links contain "FreeScout" feature-developer #2390 S
#2383 CI — console Console unit tests (pytest) test_vendor_name_cleanup_2067_2068.py::TestRenderedHtmlNoVendorNames::test_dashboard_nav_v1_rendered_no_freescout Same — full rendered dashboard (nav v1) contains "FreeScout" feature-developer #2390 S
#2383 CI — console Console unit tests (pytest) test_vendor_name_cleanup_2067_2068.py::TestRenderedHtmlNoVendorNames::test_dashboard_nav_v2_rendered_no_freescout Same — full rendered dashboard (nav v2) contains "FreeScout" feature-developer #2390 S
#2384 global-audit-has-detail-guard-2240 CI — console Console unit tests (pytest) Same 6 tests as #2383 Identical pre-existing main regressions feature-developer #2390 same
#2384 CI — console Console unit tests (pytest) Application context error (RuntimeError: Working outside of application context) Separate test isolation bug surfacing alongside the primary failures feature-developer needs own fix if #2390 doesn't cover it S
#2385 heroku-gh-secrets-token-925-fallback CI — console Console unit tests (pytest) Same 6 tests as #2383 Identical pre-existing main regressions feature-developer #2390 same
#2387 sc-db-5-console-dashboard-v2-flag CI — console Console unit tests (pytest) Same 6 tests as #2383 Identical pre-existing main regressions feature-developer #2390 same
#2388 console-b1-queue-billing CI — console Console unit tests (pytest) Same 6 tests as #2383 Identical pre-existing main regressions feature-developer #2390 same
#2390 fix/vendor-name-ops-dispatch-regressions CI — console Console lint (ruff) api_rbac_grants.py:115 C901 (complexity > 10), api_billing.py:53 I001, alerts.py:105 I001, admins_online.py:71 I001 Ruff import sort + complexity violations in files touched by the fix PR itself feature-developer not yet covered S — ruff --fix for I001, manual refactor for C901
main (post-#2389 merge, run 26044051515) CI — main OpenAPI drift check Check OpenAPI spec vs runtime routes /internal/trace/sys-event [POST] registered in Flask (PR #2017) but absent from backend_v2/api/openapi.yaml — 135 runtime routes vs 136 spec paths feature-developer not covered XS — add path entry to openapi.yaml
main (all pushes today, runs 26016820991 26016818910 26016114205 26016112336 et al.) Release release-please Run release-please illegal pathing characters in path: frontend/trademaster_ui/../../VERSION — release-please v17 rejects ../../ traversal in extra-files; config specifies "../../VERSION" and "../../backend_v2/version.txt" relative to the frontend/trademaster_ui package path feature-developer not covered S — change extra-files to repo-root-relative paths or use root package strategy
main (run 26044052359, triggered by #2389 merge) Deploy to Heroku Synthetic gate (post-deploy) / File GitHub issue on failure Checkout step Reusable synthetic-gate workflow tries actions/checkout@v4 with a token that lacks repo read access (raxx-app/TradeMasterAPI — "Repository not found") — the job fires only when synthetic health checks fail, so this is a failure-path-only defect in the on-failure issue-filing job sre-agent not covered S — pass GITHUB_TOKEN or raxx-ops-bot token with contents: read to reusable workflow
main (run 26016821022 + earlier) Deploy internal docs build-and-deploy CF Access smoke verify CF Access policy for internal-docs.raxx.app is set to decision=allow instead of decision=non_identity; service token carries no IdP identity → Access issues 302 to login page → curl sees HTTP 200 (login page) → smoke verify fails. Documented in feedback_cf_access_service_token_needs_non_identity.md operator-decision not covered XS operator action — set policy to non_identity in CF dashboard
main (schedule, runs 26027433724 26025757785 et al. — 8/8 failing since 2026-05-12) Terraform — email-delivery-stack terraform-email-delivery-stack-plan configure-aws-credentials@v4 AWS credentials not loaded — OIDC or AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY secrets absent or expired for the TF runner context operator-decision not covered M — provision AWS OIDC role or refresh static creds in GH secrets
main (schedule, 5/5 failing since 2026-05-14) Billing retention cron Run billing retention sweep curl POST /api/internal/jobs/billing-retention RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN repo secrets are empty strings; curl exits code 3 (URL malformed) before any HTTP request is made operator-decision not covered XS operator action — set two repo secrets
main (schedule, 3/10 failing since 2026-05-16, 7/10 succeeding before) Nightly Security Scan Run scanners gh pr create for findings report raxx-ops-bot installation token mint returns 404 (installation token exchange failed: Not Found) — GitHub App installation ID is stale or the App was reinstalled; falls back to github-actions[bot] which cannot create PRs sre-agent not covered S — verify RAXX_OPS_BOT_INSTALL_ID in GH secrets; re-mint via App settings
feature/shadow-analytics-dpia-runbook (run 26016042725) Queue Docker smoke Detect Queue path changes permissions check Resource not accessible by integration — GITHUB_TOKEN lacked contents: read for the path-detection step on that branch feature-developer branch is closed/merged N/A

Workflow-level health summary (main branch, last 10 runs each)

Workflow Pass rate (last 10) Verdict Notes
CI — main 8/10 flaky OpenAPI drift check fails when backend changes land without openapi.yaml update. BFM (06:11–15:30 UTC) caused 1 additional failure (vault auth 403).
CI — console 8/10 flaky FreeScout + OpsDispatch regressions in main since ~PR #2352 area; failing on all wave PRs. #2390 should resolve.
Release 0/7 broken release-please ../../VERSION pathing error — every push to main since at least 2026-05-18T05:48 UTC. No successful Release run in the window sampled.
Deploy to Heroku 4/7 (non-cancelled) degraded Earlier failures: heroku CLI not on PATH (exit 127) — fixed by PR adding binary guard. Latest failure (#2389 merge): staging deploy succeeded but "File GitHub issue on failure" job fails when synthetic checks fail because the checkout token lacks repo access.
Deploy internal docs 0/8 broken CF Access decision=allow vs decision=non_identity — blocking all runs since first observed 2026-05-18T05:26 UTC.
Terraform — email-delivery-stack 0/8 broken (chronic) AWS credentials missing/expired. Failing since 2026-05-12. No evidence this workflow has ever successfully planned in the sampled window.
Billing retention cron 0/5 broken (chronic) RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN secrets not set. Failing since 2026-05-14.
Nightly Security Scan 3/10 failures flaky raxx-ops-bot installation token mint 404 introduced ~2026-05-16. Scan itself runs; only the PR-creation step fails.
Flag drift check (B2) 2/2 success healthy
Daily card groomer 1/1 success healthy
Queue Docker smoke 1/1 failure on one branch isolated Only feature/shadow-analytics-dpia-runbook failed; all other branches pass. Permissions issue on that branch.
PR Gates (ci-pr.yml) 10/10 success healthy Sprint Readiness gate itself is passing on all PRs. BFM 403 vault auth issue resolved at 15:30 UTC.

Flakiness assessment

OpenAPI drift check (CI — main): Not flaky — deterministic fail on every push that includes backend route changes without a matching openapi.yaml update. The /internal/trace/sys-event route was registered in PR #2017 and the spec was never updated. This will fail every CI — main run until resolved.

CI — console vendor-name / ops-dispatch failures: Not flaky — 100% reproducible across all 5 wave PRs and earlier main runs. Test ordering dependency: FLAG_CONSOLE_DASHBOARD_HOME enable sequence exposes FreeScout strings in templates that weren't cleaned up. Deterministic failure.

Nightly Security Scan: Intermittently flaky from the raxx-ops-bot token mint perspective. The scan completes; the PR-creation step fails ~3 out of the last 10 runs. The 404 from the installations endpoint suggests the App installation ID rotated or the App was temporarily suspended.

Release workflow: Consistently failing — not flaky, structurally broken since the ../../VERSION path was introduced in the release-please config.


CI infrastructure issues

  1. Node.js 20 deprecation warning on every run. All workflows using actions/checkout@v4, actions/setup-node@v4, actions/setup-python@v5, dorny/paths-filter@v3 will be forced to Node.js 24 on 2026-06-02, removed 2026-09-16. Every single CI run today carries this warning. No failures yet but forced breakage is scheduled in 15 days.

  2. RAXX_OPS_BOT_PRIVATE_KEY printed to Release workflow logs in plaintext. Run 26044051583 (fix(ci): smoke fixture crash on clean runner (#2361) (#2389)) has the full RSA private key printed in the release-please step output because release-please-action echoes its inputs. This is a SEV-1 secret exposure — the key was visible in run logs as of 15:43 UTC. The key should be treated as compromised and rotated immediately. This is a separate, urgent issue from the pathing bug.

  3. Synthetic-gate reusable workflow token scope. The File GitHub issue on failure job inside synthetic-gate.yml uses actions/checkout@v4 with the caller's GITHUB_TOKEN, which has issues: write but the log shows remote: Repository not found. This suggests the token is scoped to the workflow's installation but not the full repo clone — likely the contents: read permission is missing from the deploy-heroku.yml permissions block for this reusable workflow call.

  4. billing-retention-cron.yml secrets not provisioned. RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN are blank in the runner environment. The workflow has been defined but the secrets were never added to the repository. This is a pre-launch gap, not a regression.

  5. Terraform AWS credentials absent since 2026-05-12. The configure-aws-credentials@v4 action cannot load credentials. The workflow was first introduced around that date and the OIDC trust policy or static key pair was never configured in GH secrets.

  6. release-please extra-files path traversal rejected in v17. The config frontend/trademaster_ui/../../VERSION is frontend/trademaster_ui as the package path + ../../VERSION as a relative file path, which resolves to the repo root VERSION. Release-please v17.3.0 explicitly rejects path traversal. The fix is to use either repo-root-relative paths (e.g., VERSION, backend_v2/version.txt keyed under a root package) or move the frontend/trademaster_ui package to use extra-files with absolute-from-root paths using the root extra-file strategy.

  7. PR #2390 ruff lint violations in files touched by the fix. The fix PR itself introduces or exposes four ruff violations (I001 import sort in api_billing.py, alerts.py, admins_online.py; C901 complexity in api_rbac_grants.py). The I001 violations are auto-fixable via ruff --fix; C901 requires a brief refactor. These are in files that pre-existed but were imported by the fix.


Action items (proposed)

# Action Owner Priority Notes
1 URGENT: Rotate RAXX_OPS_BOT_PRIVATE_KEY — plaintext in run 26044051583 logs operator-decision P0 — do now Key was exposed ~15:43 UTC; treat as compromised; regenerate GitHub App private key, update GH secret
2 Merge #2390 to unblock 5 wave PRs (vendor-name + ops-dispatch regressions) feature-developer P1 #2390 CI is passing; ruff lint violations need fix before merge
3 Fix ruff violations in #2390 (I001 x3 + C901 x1) feature-developer P1 — blocks #2390 merge ruff --fix for I001; manual refactor for api_rbac_grants.py:115
4 Add /internal/trace/sys-event [POST] to backend_v2/api/openapi.yaml feature-developer P1 — CI — main broken Route registered in #2017, spec never updated
5 Fix release-please extra-files path traversal in .release-please-config.json feature-developer P1 — Release broken since 05:48 UTC Change relative ../../VERSION paths to root-relative under a root package or use component strategy
6 Set CF Access policy for internal-docs.raxx.app to decision=non_identity operator-decision P1 — Deploy internal docs broken Per feedback_cf_access_service_token_needs_non_identity.md
7 Set RAPTOR_INTERNAL_API_URL and ADMIN_SERVICE_TOKEN in GH repo secrets operator-decision P2 — billing cron broken Pre-launch; no user impact yet
8 Provision AWS OIDC role or static creds for Terraform email-delivery-stack operator-decision P2 — TF plan broken since 2026-05-12 Check if TF stack is pre-launch gated; if so, can defer to infra sprint
9 Fix raxx-ops-bot installation token mint for nightly security scan sre-agent P2 — scan runs but PR creation fails Verify RAXX_OPS_BOT_INSTALL_ID in GH secrets; check App installation status
10 Fix synthetic-gate.yml "File GitHub issue on failure" checkout token scope sre-agent P3 — only fires on synthetic health check failures Add contents: read to deploy-heroku.yml reusable workflow permissions or pass explicit token
11 Bump pinned actions to Node.js 24 compatible versions before 2026-06-02 sre-agent P3 — forced breakage on 2026-06-02 actions/checkout@v4, setup-node@v4, setup-python@v5, dorny/paths-filter@v3
12 Investigate and fix application context isolation error in test_global_audit_has_detail_guard tests on #2384 feature-developer P3 — secondary failure, may resolve with #2390 merge RuntimeError: Working outside of application context in PR #2384 only

Context carried forward


References