Raxx · internal docs

internal · gated

GitHub Actions Audit — 2026-06-05

Author: sre-agent Date: 2026-06-05 Status: Recommendation-only — no workflow files modified Operator approval required before any Phase execution


§1 Inventory Table

Total workflows on disk: 72 (including __tests__/ subdirectory with one shell test — not a workflow file). Active registered workflows: 71. All are in active state per the GitHub API. Run data sampled from the most recent 500 completed runs as of 2026-06-05T14:00 UTC.

Duration estimates use createdAtupdatedAt delta on successful runs (best available via gh run list — does not include queue wait time, which is typically 5–15s on the hosted runners).

Runs-per-month for cron workflows derived from cron expression parsing. PR/push trigger run-counts are derived from observed activity in the sample window. The sample window covers roughly 7–10 days of activity, so "runs last 30 days" for frequently-triggered workflows is extrapolated (noted where extrapolated).

Minute-cost rounding: GH Actions rounds every job to the nearest minute, minimum 1 minute. A 9-second job that runs 96 times/month still costs 96 minutes.

# Workflow Name File Trigger Types Path Filter Jobs Avg Dur (s) Runs/Month Est. Est. Min/Month Last Conclusion
1 Alembic version drift check alembic-version-cron.yml schedule (06:00 UTC daily), dispatch none 3 51 30 27 success
2 Antlers Next.js CI antlers-next-ci.yml push+PR on frontend/raxx-next/** yes 6 164 ~60 165 failure
3 BCP vault snapshot daily bcp-vault-snapshot-daily.yml schedule (07:00 UTC daily), dispatch none 15 30 8 failure (all)
4 BCP smoke monthly bcp-smoke-monthly.yml schedule (1st of month 08:00), dispatch none unable to measure 1 1 never fired
5 Billing collector cron billing-collector-cron.yml dispatch only (PAUSED) none n/a 0 0 never fired
6 Billing retention cron billing-retention-cron.yml schedule (03:00 UTC daily), dispatch none 11 30 6 failure (all)
7 Bot token smoke daily-bot-token-smoke.yml schedule (06:30 UTC daily), dispatch none 9 30 5 success
8 CF Access header lint lint-cf-access-headers.yml PR + push/main on .github/workflows/**, scripts/ci/lint_cf_access_headers.py yes 16 ~30 8 success
9 CF Pages deploy uniqueness lint lint-cf-pages-deploy-uniqueness.yml PR + push/main on .github/workflows/** yes 16 ~30 8 success
10 CF token usage lint lint-cf-tokens.yml PR + push/main on .github/workflows/** yes 17 ~30 8 success
11 CI Digest daily ci-digest-cron.yml schedule (07:00 UTC daily), dispatch none 22 30 11 success
12 CI — console ci-console.yml PR + push/main on console/** yes 66 ~80 88 failure
13 CI — main ci.yml push/main (no path filter) none 183 ~45 138 failure
14 Console review app review-app-console.yml PR on console/** yes 154 ~25 64 success
15 Cron heartbeat monitor cron-heartbeat-monitor.yml schedule (06:30 UTC daily), dispatch none 14 30 7 success
16 Daily card groomer daily-card-groomer.yml schedule (09:00 UTC daily), dispatch none 67 30 34 success
17 Deploy Antlers LEGACY CRA deploy-antlers.yml push/main on frontend/trademaster_ui/**, dispatch yes n/a 0 (PAUSED 2026-06-01) 0 never (paused)
18 Deploy Antlers Next — prod deploy-antlers-next-prod.yml push/main on frontend/raxx-next/**, dispatch yes 161 ~20 54 success
19 Deploy Antlers Next — staging deploy-antlers-next-staging.yml push/main on frontend/raxx-next/**, dispatch yes 161 ~20 54 success
20 Deploy Antlers cutover deploy-antlers-cutover.yml dispatch only none unable to measure 0 0 never fired
21 Deploy Queue deploy-queue.yml push/main on queue/**, dispatch yes unable to measure ~2 14 never (no recent queue changes)
22 Deploy Queue failure streak monitor deploy-queue-failure-monitor.yml workflow_run on Deploy Queue none unable to measure ~2 2 never
23 Deploy Velvet deploy-velvet.yml push/main on velvet/**, dispatch yes 98 ~3 5 success
24 Deploy console deploy-console.yml push/main on console/**, dispatch yes 198 ~8 27 success
25 Deploy console shim Worker deploy-console-shim.yml push/main on infra/cf-workers/console-deploy-shim/**, dispatch yes unable to measure ~1 2 never fired recently
26 Deploy customer docs deploy-customer-docs.yml push/main on docs/customer/**, dispatch yes unable to measure ~3 3 never fired recently
27 Deploy design mockups deploy-mockups.yml push/main on docs/design/**, mockups-site/**, dispatch yes 46 ~8 7 success
28 Deploy failure streak alert deploy-failure-streak-alert.yml workflow_run on 14 deploy workflows none 7 ~47 6 success
29 Deploy flag docs deploy-flag-docs.yml push+PR on backend_v2/api/feature_flags.yaml yes 28 ~10 5 success
30 Deploy getraxx landing page deploy-getraxx.yml push/main on frontend/getraxx-landing/**, dispatch yes 98 ~10 17 failure (all)
31 Deploy internal docs deploy-internal-docs.yml push/main on docs/**/*.md, backend_v2/api/feature_flags.yaml, dispatch yes 56 ~12 11 failure
32 Deploy status Worker deploy-status-worker.yml push/main on frontend/status-worker/**, dispatch yes unable to measure ~1 1 never fired recently
33 Deploy status page deploy-status-page.yml push/main on frontend/status-page/**, dispatch yes 62 ~1 1 success
34 Deploy support portal deploy-support.yml push/main on frontend/support/**, dispatch yes unable to measure ~1 1 never fired recently
35 Deploy to Heroku deploy-heroku.yml push/main (no path filter), dispatch none 159 ~45 120 success
36 Deploy waf-log-shipper Worker deploy-worker-waf-log-shipper.yml push/main on workers/waf-log-shipper/**, dispatch yes unable to measure ~1 1 never fired recently
37 Drift orchestrator cron drift-orchestrator-cron.yml schedule (06:00 UTC daily), dispatch none 25 30 13 success
38 E2E smoke (signup→onboarding) e2e-smoke.yml PR + schedule, dispatch none 142 ~35 (PR) + 30 (schedule) = 65 154 failure (all)
39 Flag drift check flag-drift-check.yml schedule (every 4h), dispatch none 32 180 96 success
40 FreeScout daily backup freescout-backup.yml schedule (06:00 UTC daily), dispatch none 25 30 13 success
41 FreeScout Terraform apply freescout-apply.yml dispatch only none unable to measure 0 0 never fired
42 Heroku config-var health heroku-config-health-nightly.yml schedule (05:00 UTC daily), dispatch none 22 30 11 failure (all — real findings)
43 Launch Readiness Check launch-readiness-check.yml dispatch only none unable to measure 0 0 never fired
44 Nightly Security Scan nightly-security-scan.yml schedule (08:07 UTC daily), dispatch none 136 30 68 failure
45 Nightly 1-min bar cache warm historical-bars-1min-warm-nightly.yml schedule (22:00 UTC daily + 23:00 Sun), dispatch none 11 34 7 failure (all)
46 Nightly second-factor reminders reminders-second-factor-nightly.yml schedule (14:00 UTC daily), dispatch none 15 30 8 failure (all)
47 PII scan pii-scan.yml PR + push/main (no path filter) none 17 ~75 22 success
48 PR Gates ci-pr.yml PR, dispatch none 74 ~50 62 success
49 PR Preview pr-preview.yml PR (detect-then-run pattern) none (internal detect) 25 ~50 21 success
50 Queue Docker smoke queue-docker-smoke.yml PR (no path filter on outer trigger, detect-job inside) none (detect inside) 13 ~50 11 success
51 Queue vcpkg manifest check vcpkg-manifest-check.yml PR on queue/vcpkg.json, queue/Dockerfile yes unable to measure ~2 1 never fired recently
52 Queue zero-dyno monitor queue-zero-dyno-monitor.yml schedule (every 15 min), dispatch none 13 2,880 624 success
53 Release release.yml push/main (no path filter), dispatch none 72 ~45 54 failure (all — release-please config bug)
54 SC-IDENT-6 Bot identity smoke sc-ident-6-smoke.yml dispatch only none unable to measure 0 0 never fired
55 Security OWASP ZAP security-zap.yml PR on frontend/raxx-next/** + backend_v2/api/**; schedule Mon 09:07 UTC yes 15 ~35 PR + 4 sched = 39 11 success
56 Slack Notify slack-notify.yml workflow_run on 'CI' (mismatch — see §3) none unable to measure 0 (never fires) 0 never fired
57 Smoke test cloudflare-retry test-cloudflare-retry-action.yml dispatch only none unable to measure 0 0 never fired
58 Smoke test emit-audit-event test-emit-audit-event-action.yml dispatch only none unable to measure 0 0 never fired
59 Surface degraded auto-file console-degraded-auto-file.yml schedule (every 15 min), dispatch none 25 2,880 1,200 failure (all — broken private key)
60 Synthetic Gate synthetic-gate.yml workflow_call, schedule (Mon–Fri 13:00 UTC), dispatch none 43 ~22 16 success
61 Terraform validate terraform-validate.yml PR + push/main on terraform/** yes unable to measure ~5 3 never fired recently
62 Terraform email-delivery-stack terraform-email-delivery-stack.yml PR + push/main + schedule (06:00 UTC daily) on terraform/modules/email-delivery-stack/** yes + schedule 9 30 sched + ~2 push = 32 5 success
63 Test notify-deploy-status action test-notify-deploy-status.yml PR + push/main on .github/actions/notify-deploy-status/** yes unable to measure ~2 1 never fired recently
64 Trace integrity cron trace-integrity-cron.yml schedule (02:00 UTC daily), dispatch none 11 30 6 failure (all)
65 WAF Synthetic Probes waf-synthetic-probe.yml schedule (Mon–Fri 14:00 UTC), dispatch none 27 22 10 failure (all)
66 Workflow secret names lint lint-workflow-secret-names.yml PR + push/main on .github/workflows/** yes 16 ~30 8 success
67 iOS CI ios-ci.yml push + PR on ios/** yes 24 ~5 2 success
68 mbt-drift-daily mbt-drift-daily.yml schedule (23:00 UTC daily), dispatch none 28 30 14 success
69 mbt-drift-per-symbol-weekly mbt-drift-per-symbol-weekly.yml schedule (Sun 23:30 UTC), dispatch none unable to measure 4 2 never fired recently
70 mbt-resting-orders-cron mbt-resting-orders-cron.yml schedule (every 5 min, Mon–Fri 13:00–20:00 UTC) none 43 ~480 344 success
71 tickets E2E smoke tickets-e2e-smoke.yml schedule, dispatch none 38 30 19 failure

Estimated total minutes/month (current state): ~3,400–3,700 min/month

The three highest single contributors: 1. queue-zero-dyno-monitor: ~624 min/month (every-15-min schedule, always runs, 13s per run → rounds to 1 min × 2,880 fires) 2. surface-degraded-auto-file: ~1,200 min/month (every-15-min, 25s per run, but currently failing 100%) 3. mbt-resting-orders-cron: ~344 min/month (every-5-min weekday trading hours)


§2 Categorization

Critical path — must run on every PR or push to main

Workflow Verdict
PR Gates (ci-pr.yml) Keep — comprehensive gate, path-aware detect-changes
PII scan (pii-scan.yml) Keep — operator instruction: never remove
CI — main (ci.yml) Keep — push-to-main integration tests; add path filter (see §3)
CI — console (ci-console.yml) Keep — already path-filtered to console/**
Antlers Next.js CI (antlers-next-ci.yml) Keep — already path-filtered to frontend/raxx-next/**
iOS CI (ios-ci.yml) Keep — already path-filtered to ios/**
Security OWASP ZAP (security-zap.yml) Keep — path-filtered to frontend/raxx-next/** + backend_v2/api/**; note path was previously frontend/trademaster_ui/** and has already been updated to raxx-next — correct
Queue Docker smoke (queue-docker-smoke.yml) Keep — no outer path filter but internal detect-job pattern handles it correctly
Terraform validate (terraform-validate.yml) Keep — path-filtered to terraform/**
Queue vcpkg manifest check (vcpkg-manifest-check.yml) Keep — path-filtered to queue Dockerfile

Path-filtered — correct, keep

These workflows trigger only when relevant paths change and behave correctly:

deploy-antlers-next-prod.yml, deploy-antlers-next-staging.yml, deploy-console.yml, deploy-velvet.yml, deploy-queue.yml, deploy-getraxx.yml, deploy-internal-docs.yml, deploy-customer-docs.yml, deploy-flag-docs.yml, deploy-mockups.yml, deploy-status-page.yml, deploy-status-worker.yml, deploy-support.yml, deploy-console-shim.yml, deploy-worker-waf-log-shipper.yml, lint-cf-access-headers.yml, lint-cf-tokens.yml, lint-cf-pages-deploy-uniqueness.yml, lint-workflow-secret-names.yml, review-app-console.yml, test-notify-deploy-status.yml

Scheduled — keep (firing meaningful output)

Workflow Cron Output quality
Alembic version drift check daily 06:00 Functional — checks staging+prod migration heads
BCP vault snapshot daily daily 07:00 Currently broken — operator-action, see §3
Bot token smoke daily 06:30 Functional — 100% success
CI Digest daily daily 07:00 Functional — digest posture per feedback_pre_launch_digest_notifications
Cron heartbeat monitor daily 06:30 Functional — monitors other crons
Daily card groomer daily 09:00 Functional — 100% success
Drift orchestrator cron daily 06:00 Functional — 100% success
Flag drift check every 4h Functional — 100% success; consider relaxing to daily
FreeScout daily backup daily 06:00 Functional — 100% success
Heroku config-var health daily 05:00 Failing — real finding (staging REDIS_URL absent), see §3
Nightly Security Scan daily 08:07 Partially functional — failing, see §3
Nightly 1-min bar cache warm daily 22:00 Broken 100% — see §3
Nightly second-factor reminders daily 14:00 Broken 100% — see §3
mbt-drift-daily daily 23:00 Functional — 100% success
mbt-drift-per-symbol-weekly Sun 23:30 Functional — no recent runs (correct, runs weekly)
mbt-resting-orders-cron every 5 min weekday Functional — 100% success; high cost (~344 min/month)
Queue zero-dyno monitor every 15 min Functional — 100% success; highest cost (~624 min/month)
Billing retention cron daily 03:00 Broken 100% — related to Postgres migration, see §3
Trace integrity cron daily 02:00 Broken 100% — see §3
WAF Synthetic Probes weekday 14:00 Broken 100% — see §3
Tickets E2E smoke scheduled Broken 100% — see §3
E2E smoke (signup→onboarding) scheduled + PR Broken 100% — see §3
Synthetic Gate weekday 13:00 Functional — 100% success
Terraform email-delivery-stack daily 06:00 + path Functional — 100% success

Dispatch-only — runs on demand, zero recurring cost

These fire zero recurring cycles. Keep all; they are on-demand tools.

billing-collector-cron.yml (explicitly paused), deploy-antlers-cutover.yml, freescout-apply.yml, launch-readiness-check.yml, sc-ident-6-smoke.yml, test-cloudflare-retry-action.yml, test-emit-audit-event-action.yml

Always-skip / dead triggers

Workflow Issue
Slack Notify (slack-notify.yml) workflow_run watches ['CI'] but CI workflow name is 'CI — main'. Name mismatch — this workflow has never fired.
Deploy Antlers LEGACY CRA (deploy-antlers.yml) Paused 2026-06-01. Path frontend/trademaster_ui/** — the CRA app is superseded. Fires zero times.

Actively broken — 100% failure rate

These have produced zero successes in the entire observed sample window and are consuming minutes without output:

Workflow Observed Failures Root Cause (from log analysis)
Surface degraded auto-file 16/16 Failed to read private key — GitHub App private key secret is invalid/expired
Release 18/18 release-please failed: illegal pathing characters in path: frontend/trademaster_ui/../../VERSION — release-please config references stale CRA path
Deploy getraxx landing page 4/4 Cannot find module 'playwright' — smoke job missing npx playwright install step
BCP vault snapshot daily 2/2 Unable to measure root cause from log tail; vault auth or S3 write failure
Billing retention cron 2/2 Postgres migration in flight (project_raptor_postgres_migration_decision)
Nightly 1-min bar cache warm 2/2 Unable to measure from log — likely Postgres/Heroku auth
Nightly second-factor reminders 2/2 Unable to measure from log
Trace integrity cron 2/2 Unable to measure from log
WAF Synthetic Probes 2/2 Unable to measure from log
Tickets E2E smoke 1/1 Unable to measure from log
E2E smoke (signup→onboarding) 4/4 Unable to measure from log
Deploy internal docs 4 failures in 12 emit-audit-event/action.ymlUnrecognized named-value: 'github' in composite action template

§3 Cycle-Waster Findings

CW-1: surface-degraded-auto-file running every 15 minutes 100% broken

File: console-degraded-auto-file.yml Cron: */15 * * * * — 2,880 fires/month Cost: ~1,200 min/month (each 25-second run rounds to 1 minute × 2,880) Failure cause: DOMException [DataError]: Invalid keyData / Failed to read private key — the GitHub App private key stored as a secret is malformed or expired. Impact: Zero monitoring output for the entire observable window. The workflow is consuming ~1,200 minutes per month for no operational value. Operator action before Phase 1 can fix: Rotate/re-provision the GitHub App private key secret. Once functional, consider reducing frequency to every 30 minutes (halves cost with no meaningful detection latency change).

CW-2: release.yml firing on every push to main, 0% success rate

File: release.yml Trigger: push to main, no path filter Cost: ~54 min/month (72s avg × ~45 pushes/month) Failure cause: release-please failed: illegal pathing characters in path: frontend/trademaster_ui/../../VERSION — the release-please configuration references a path component from the legacy CRA layout that has since been superseded. The release-please-config.json almost certainly contains a reference to frontend/trademaster_ui with a relative ../../VERSION extra-component path. Impact: Every push to main produces a failed workflow run. This is visible noise that trains reviewers to ignore red workflow icons, which is a detection-erosion risk. Fix: Update release-please-config.json to remove or correct the frontend/trademaster_ui component path. This is a one-line config change, not a workflow change.

CW-3: queue-zero-dyno-monitor running every 15 minutes, healthy but high-frequency

File: queue-zero-dyno-monitor.yml Cron: */15 * * * * — 2,880 fires/month Cost: ~624 min/month (13s avg × 2,880, each rounds to 1 min) Output: 100% success, legitimately monitoring Queue dyno count. Issue: 15-minute frequency is fine for production SLO coverage, but this is pre-launch. The Queue service is not yet customer-facing. Detecting zero dynos within 15 minutes vs within 30 minutes makes no difference at current stage. Proposed change: Relax to */30 * * * * — saves ~312 min/month with zero customer-facing impact.

CW-4: flag-drift-check running every 4 hours

File: flag-drift-check.yml Cron: 0 */4 * * * — 180 fires/month Cost: ~96 min/month Output: 100% success, checks flag YAML drift. Issue: Flag changes happen via PRs (a human action). A flag drift state that exists at 04:00 UTC will still exist at 08:00 UTC. Daily detection is sufficient. Proposed change: Change to 0 6 * * * (daily) — saves ~66 min/month.

CW-5: ci.yml (CI — main) has no path filter, runs full suite on every push

File: ci.yml Trigger: push to main, no path filter Cost: ~138 min/month (183s avg × ~45 pushes/month) Issue: Runs backend tests, frontend tests, security scans, OpenAPI drift on every push to main regardless of what changed. The detect-changes job gates individual test jobs internally, but the workflow still starts a runner and the detect-changes job runs on every push. For pushes that only touch docs/** or mockups-site/**, the entire workflow spins up only to skip everything after detect. Note: The detect-changes pattern is correct for what it does. The improvement is adding a top-level paths filter to skip workflow startup entirely for doc-only or design-only changes. The detect-changes internal logic already handles the rest. Proposed change: Add paths filter excluding docs/**, mockups-site/**, *.md changes.

CW-6: deploy-heroku.yml fires on every push to main with no path filter

File: deploy-heroku.yml Trigger: push to main, no path filter Cost: ~120 min/month (159s avg × ~45 pushes/month) Issue: This workflow deploys Raptor (the backend) to Heroku, but it fires on every push to main — including pushes that only touch frontend/raxx-next/**, docs/**, mockups-site/**, etc. A frontend-only push should not trigger a backend deploy. Proposed change: Add path filter for backend_v2/**, requirements*.txt, Procfile, runtime.txt, .github/workflows/deploy-heroku.yml. This is a careful change — verify against the freeze-check and smoke job gates.

CW-7: slack-notify.yml workflow_run trigger name mismatch — never fires

File: slack-notify.yml Trigger: workflow_run on workflows ['CI'] Issue: The CI workflow is named 'CI — main', not 'CI'. GitHub's workflow_run trigger matches on exact workflow name. This workflow has never fired in the observable sample. The CI failure notification posture (per feedback_pre_launch_digest_notifications) relies on this workflow for failure alerts — but it is silently dead. Fix: Change workflows: ['CI'] to workflows: ['CI — main'] in slack-notify.yml.

CW-8: deploy-antlers.yml (Legacy CRA) active but paths point to superseded code

File: deploy-antlers.yml Name: Deploy Antlers (LEGACY CRA — PAUSED 2026-06-01) State: active (GH Actions API reports active despite the name containing PAUSED) Trigger: push/main on frontend/trademaster_ui/** Issue: The CRA has been superseded by Antlers Next.js. The path frontend/trademaster_ui/** still exists (it is the current Antlers app directory per project_antlers_directory). If any commit touches files in that path, this legacy workflow will fire and attempt a CRA deploy. The name says PAUSED but no mechanism enforces the pause except that the path has low traffic. Risk: Not zero-cost in perpetuity. Any touch of frontend/trademaster_ui/** fires this. Recommend disabling (state=disabled_manually) rather than deleting, as the deployment script may be referenced.

CW-9: Multiple monitor/infra workflows broken 100% with no operator visibility

The following workflows have been silently failing for the entire observable window (2–16 consecutive failures, no successes):

Because slack-notify.yml is broken (CW-7), none of these failures have generated Slack alerts. The cron-heartbeat-monitor (cron-heartbeat-monitor.yml) may be detecting some of these — but its output is not visible from this audit.


§4 Phased Reduction Plan

Phase 1 — Safe wins (no workflow file changes, low blast radius)

These can be approved individually. None require coordinated changes across multiple files.

# Action File Rationale Effort Risk
P1-A Fix slack-notify.yml workflow_run trigger: ['CI']['CI — main'] slack-notify.yml 1-char fix. Restore CI failure alerting to Slack (currently completely dead — CW-7). 5 min Very low
P1-B Disable (not delete) Legacy CRA deploy workflow deploy-antlers.yml Set state to disabled_manually via GH API. Prevents accidental CRA redeploy if trademaster_ui is touched. 2 min Very low
P1-C Fix release-please-config.json to remove stale frontend/trademaster_ui path component release-please-config.json Stops 18-for-18 failure rate on Release workflow. Saves ~54 min/month of failed runner time. 15 min Low (release-please config only)
P1-D Rotate GitHub App private key for console-degraded-auto-file Secret in repo settings Fixes the broken-private-key failure. Restore 1,200 min/month of monitoring value. Operator-action for the key rotation itself. 10 min (key rotation) Low
P1-E Relax queue-zero-dyno-monitor from */15 to */30 queue-zero-dyno-monitor.yml Saves ~312 min/month. Detection window goes from 15 min to 30 min — acceptable pre-launch. 5 min Very low
P1-F Relax flag-drift-check from every 4h to daily (0 6 * * *) flag-drift-check.yml Saves ~66 min/month. Drift is human-driven via PRs; daily detection is sufficient. 5 min Very low

Phase 1 total estimated savings if all items land: ~430 min/month (dominated by fixing surface-degraded-auto-file so it runs correctly but at the same frequency, and reducing zero-dyno-monitor).

Phase 1 total if surface-degraded-auto-file also moves to 30-min after key fix: ~630 min/month.


Phase 2 — Consolidation (workflow file changes, needs testing)

These require changes to workflow YAML and should be done in a single PR with a one-day soak on staging.

# Action Files Rationale Effort Risk
P2-A Add path filter to ci.yml to skip on doc/design-only pushes ci.yml Reduces spurious full-test runs on doc-only commits. Est. saves ~30 min/month. 30 min Medium — must test detect-changes still fires correctly
P2-B Add path filter to deploy-heroku.yml for backend paths only deploy-heroku.yml Stop backend deploy from triggering on frontend-only pushes. Est. saves ~40 min/month. 45 min Medium — test that all legitimate backend-change paths are covered
P2-C Fix deploy-getraxx.yml smoke job: add npx playwright install step deploy-getraxx.yml Fixes 4/4 failure rate. No new behavior — just installs the missing Playwright dependency. 20 min Low
P2-D Fix deploy-internal-docs.yml emit-audit-event composite action usage .github/actions/emit-audit-event/action.yml github.run_id context unavailable in composite action template at parse time. Fix: pass values as inputs from calling workflow. Shared failure also affecting other workflows using this action. 30 min Medium — touches shared composite action
P2-E Consolidate the 3 CF lint workflows into a single workflow with 3 jobs New file replacing lint-cf-access-headers.yml, lint-cf-pages-deploy-uniqueness.yml, lint-cf-tokens.yml All 3 have identical triggers, identical path filters (.github/workflows/**), same 16s runtime. Running as 3 separate workflows means 3 runner allocations per PR touching a workflow file. Consolidating to 1 workflow with 3 jobs drops from ~24 min/month to ~8 min/month for those triggers. 45 min Low — identical triggers, independent jobs
P2-F Fix billing-retention-cron failure: confirm Postgres migration status, re-enable or pause schedule billing-retention-cron.yml Stops 100% failure waste. Either update DB path after Postgres migration lands, or add workflow_dispatch-only + comment-out schedule like billing-collector-cron.yml did. 20 min (after Postgres migration) Low once migration is in

Phase 2 estimated additional savings: ~90–120 min/month (path filters + getraxx fix + CF lint consolidation).


Phase 3 — Production-lean shape (architectural changes, operator sign-off required on design)

# Action Rationale
P3-A Introduce per-surface staging autodeploy confirmation that staging deploy succeeded before prod can run Currently deploy-heroku.yml deploys staging and production in sequence within the same workflow run. The staging gate (synthetic-gate) runs after staging deploy, but production deploy is a separate workflow_dispatch step. Formalizing the staging→soak→prod promotion model means prod deploy only fires from workflow_dispatch (never from push), and requires a staging smoke pass as a prerequisite.
P3-B Convert all production deploy triggers to workflow_dispatch-only deploy-antlers-next-prod.yml, deploy-console.yml (prod jobs), deploy-velvet.yml, deploy-queue.yml currently trigger on push to main. Under the operator's lean-prod model, these should only trigger manually or via a promote action, not automatically on every push. Staging deploys remain on push-to-main (fast, automatic).
P3-C Add a single prod-promote workflow that gates on: (1) staging smoke passing, (2) operator confirms One promote-to-prod.yml workflow with inputs: surface (dropdown: raptor / console / antlers / velvet / queue). It checks the latest staging deploy was green, runs a pre-promote smoke, then fires the surface-specific production deploy as a workflow_call. Eliminates per-surface prod deploy triggers.
P3-D Consolidate monitoring crons that fire at the same time Currently 5 separate workflows fire within 30 minutes of 06:00 UTC: alembic-version-cron, bcp-vault-snapshot-daily, bot-token-smoke (06:30), ci-digest-cron (07:00), cron-heartbeat-monitor (06:30), drift-orchestrator-cron. These could share a single "nightly-ops-sweep" workflow with sequential jobs, reducing runner-startup overhead.
P3-E Remove E2E smoke from PR trigger until it passes e2e-smoke.yml triggers on PR and has never succeeded. It is consuming ~84 min/month on PRs that will never benefit from it. Gate it behind workflow_dispatch-only until the E2E suite is green.

Phase 3 estimated additional savings: ~150–200 min/month (primarily from removing push-triggered prod deploys that become dispatch-only, and fixing the E2E PR trigger).


§5 Staging Autodeploy + Production-Lean Architecture

Current state per surface

Surface          Staging autodeploy?    Prod autodeploy?    Manual gate?
---------------------------------------------------------------------------
Raptor (API)     YES (push to main,     YES (same workflow,  freeze-check
                 no path filter)        no manual confirm)   on dispatch=prod
Antlers Next.js  YES (push to main,     YES (separate        none
                 raxx-next/** paths)    prod workflow,
                                        same push trigger)
Console          YES (push to main,     YES (same workflow,  freeze-check
                 console/** paths)      no manual confirm)   on dispatch=prod
Velvet           YES (push to main,     YES (same workflow,  freeze-check
                 velvet/** paths)       no manual confirm)
Queue            YES (push to main,     YES (same workflow,  confirm-gate job
                 queue/** paths)        confirm-gate-rejected with manual input)
getraxx          YES (push to main)     (CF Pages — same     none
                                        CF Pages deploy)
Antlers Next     YES (staging           SEPARATE prod        none
                 workflow auto)         workflow auto)

Proposed lean-prod shape (after Phase 3)

Push to main
     |
     v
[Path filters route to correct deploy workflow]
     |
     +---> deploy-raptor-staging   (automatic, on push)
     +---> deploy-antlers-staging  (automatic, on push)
     +---> deploy-console-staging  (automatic, on push)
     +---> deploy-velvet-staging   (automatic, on push)
     +---> deploy-queue-staging    (automatic, on push)
     |
     v
[Staging smoke: synthetic-gate + surface health check]
     |
     v
[Soak period — default: next business day (no automated timer)]
     |
     v
[Operator: dispatch promote-to-prod with surface input]
     |
     v
promote-to-prod.yml
     |
     +---> Verify: latest staging deploy = green
     +---> Run: pre-promote smoke (30s)
     +---> Deploy to production
     +---> Post-deploy: synthetic-gate on prod
     +---> Notify: ops@ digest

What stays the same

What changes in Phase 3


§6 Estimated Minutes/Month Saved Per Phase

Phase Actions Est. Min/Month Saved Notes
Phase 1 (safe wins) P1-A through P1-F ~430 min/month Dominated by fixing surface-degraded-auto-file key + reducing zero-dyno-monitor frequency
Phase 1 + surface-degraded-auto-file at 30-min Above + relax to */30 after key fix ~630 min/month
Phase 2 (consolidation) P2-A through P2-F +90–120 min/month Path filter on ci.yml + deploy-heroku.yml + CF lint consolidation
Phase 3 (prod-lean shape) P3-A through P3-E +150–200 min/month E2E PR trigger removal, push→dispatch prod deploys
Total (all phases) ~770–950 min/month From current ~3,400–3,700 to ~2,500–2,900

At GH Actions Pro rates ($0.008/min for Ubuntu runners), the current estimated burn is $27–30/month in compute alone, before the base Pro plan cost. All three phases landing brings that to roughly $20–23/month — a ~$7–10/month reduction in compute.

Ubicloud trigger threshold is 5,000 min/month per project_ci_billing. Current estimated usage is 3,400–3,700 min/month, which is 68–74% of that threshold. Even in the worst case (all phases fail to reduce), the repo is not at imminent risk of hitting the Ubicloud threshold. If surface-degraded-auto-file is fixed and remains at */15 frequency, it alone accounts for 1,200 min/month of a single workflow. Fixing its root cause is the single highest-leverage action.


§7 Actively Broken Workflows — Surfacing Separately

The following workflows have been producing zero successes for the entire observable window and should be investigated as separate SRE dispatches (not in this audit's scope to fix):

BROKEN-1: release.yml — 18/18 failures

Cause: release-please-config.json contains stale path reference frontend/trademaster_ui/../../VERSION. Fix: update release-please config. Separately: Node.js 20 deprecation warning on googleapis/release-please-action@5c625bfb5d1ff62eadeeb3772007f7f66fdcf071 — this action is pinned by SHA to a version that runs on Node.js 20; GitHub Actions will force Node.js 24 starting 2026-06-16. This is an imminent breaking change, 11 days away.

BROKEN-2: console-degraded-auto-file.yml — 16/16 failures

Cause: Failed to read private key / Invalid keyData. The GitHub App private key secret is malformed or has been rotated without updating the secret. Monitoring is completely blind.

BROKEN-3: heroku-config-health-nightly.yml — real findings (not broken, but alerting correctly)

Note: This workflow is functioning correctly — it is producing failures because it found real issues: raxx-api-staging is missing REDIS_URL (severity=critical) and has an empty POSTMARK_SERVER_TOKEN (severity=warning). These are legitimate infra gaps, not workflow bugs. Operator action needed: provision REDIS_URL on raxx-api-staging.

BROKEN-4: deploy-internal-docs.yml — recurring failures

Cause: emit-audit-event/action.yml uses github.run_id and github.server_url in a composite action template context where the github context is not available. This is a shared composite action bug that may affect other workflows using emit-audit-event.

BROKEN-5: Multiple nightly crons failing (likely Postgres migration dependency)

billing-retention-cron, historical-bars-1min-warm-nightly, reminders-second-factor-nightly, trace-integrity-cron — likely all failing due to the SQLite→Postgres migration in progress (project_raptor_postgres_migration_decision). These should be paused (schedule commented out, dispatch-only) until the migration lands, matching the pattern already established by billing-collector-cron.yml.

BROKEN-6: Node.js 20 deprecation — deadline 2026-06-16 (11 days)

Affects: actions/checkout@v4, actions/setup-python@v5, actions/create-github-app-token@v1, googleapis/release-please-action (pinned SHA). GitHub will force Node.js 24 after 2026-06-16. Actions that haven't published Node.js 24-compatible versions will break. The most urgent is googleapis/release-please-action pinned to a specific SHA — check if a newer SHA that supports Node 24 is available.


§8 Operator-Approval Checklist

For each proposed action, check the box to greenlight:

Phase 1 (safe wins — low risk)

[ ] P1-A  slack-notify.yml: fix workflow_run name 'CI' → 'CI — main'
          Restores Slack CI-failure alerting. Zero behavioral change otherwise.

[ ] P1-B  deploy-antlers.yml: disable (not delete) via GH API
          Prevents accidental CRA deploy. File stays on disk for reference.

[ ] P1-C  release-please-config.json: remove stale frontend/trademaster_ui component
          Stops 100% Release failure rate. Requires review of config to confirm
          which components are still active.

[ ] P1-D  Rotate GitHub App private key for surface-degraded-auto-file
          (OPERATOR ACTION: requires secret provisioning in repo settings)
          Restores console degradation monitoring.

[ ] P1-E  queue-zero-dyno-monitor: relax cron */15 → */30
          Saves ~312 min/month. Detection window: 15 min → 30 min.

[ ] P1-F  flag-drift-check: relax cron every 4h → daily 06:00 UTC
          Saves ~66 min/month. Flag changes are human-driven via PRs.

Phase 2 (consolidation — moderate risk, test before merge)

[ ] P2-A  ci.yml: add paths filter to skip on doc/design-only pushes
          Needs verification that detect-changes still fires for all code paths.

[ ] P2-B  deploy-heroku.yml: add backend-only path filter
          Needs verification all legitimate backend paths are covered.
          Confirm with operator: are there any infra-file paths outside
          backend_v2/ that should trigger a Raptor deploy?

[ ] P2-C  deploy-getraxx.yml: add npx playwright install to smoke job
          Fixes 4/4 failure rate. Low risk — additive step only.

[ ] P2-D  Fix emit-audit-event composite action github context usage
          Review .github/actions/emit-audit-event/action.yml; pass run_id
          and server_url as explicit inputs from calling workflows.

[ ] P2-E  Consolidate 3 CF lint workflows into 1
          New file replaces 3 files. Identical trigger/path logic, 3 jobs.
          Confirm: are any external processes monitoring these specific
          workflow names?

[ ] P2-F  billing-retention-cron: pause schedule (comment out), dispatch-only
          Mirror billing-collector-cron pattern. Re-enable post-Postgres migration.

Phase 3 (architectural — requires separate operator sign-off on design)

[ ] P3-DESIGN  Operator approves staging-autodeploy / prod-dispatch model
               described in §5 before any Phase 3 workflow changes begin.
               Specifically: confirm all 5 Heroku surfaces (Raptor/Console/
               Antlers/Velvet/Queue) should move to dispatch-only prod deploys.

[ ] P3-A  Introduce per-surface staging smoke prerequisite for prod
[ ] P3-B  Convert prod deploy triggers to workflow_dispatch-only
[ ] P3-C  Create promote-to-prod.yml consolidation workflow
[ ] P3-D  Consolidate 06:00 UTC cron cluster into nightly-ops-sweep
[ ] P3-E  Remove e2e-smoke.yml from PR trigger until suite is green

§9 Ubicloud Assessment

Current estimated burn: ~3,400–3,700 min/month. Ubicloud threshold: 5,000 min/month (per project_ci_billing).

Current usage is at ~70–74% of the threshold. Not at imminent risk. The two highest-cost workflows (surface-degraded-auto-file at ~1,200 min/month when broken, queue-zero-dyno-monitor at ~624 min/month, mbt-resting-orders-cron at ~344 min/month) are all running correctly except the degraded-auto-file. Fixing Phase 1 alone brings the estimate down to ~2,800–3,100 min/month (56–62% of threshold).

Recommendation: Ubicloud is not needed in the near term. If a significant new service launches (iOS CI running full Xcode builds, for instance) or if the E2E suite starts running on every PR and takes 10+ minutes, revisit. Do not queue Ubicloud migration until burn sustains above 4,500 min/month for two consecutive months.


References