GitHub Actions Audit — 2026-06-05
Author: sre-agent Date: 2026-06-05 Status: Recommendation-only — no workflow files modified Operator approval required before any Phase execution
§1 Inventory Table
Total workflows on disk: 72 (including __tests__/ subdirectory with one shell test — not a workflow file).
Active registered workflows: 71. All are in active state per the GitHub API.
Run data sampled from the most recent 500 completed runs as of 2026-06-05T14:00 UTC.
Duration estimates use createdAt → updatedAt delta on successful runs (best available via gh run list — does not include queue wait time, which is typically 5–15s on the hosted runners).
Runs-per-month for cron workflows derived from cron expression parsing. PR/push trigger run-counts are derived from observed activity in the sample window. The sample window covers roughly 7–10 days of activity, so "runs last 30 days" for frequently-triggered workflows is extrapolated (noted where extrapolated).
Minute-cost rounding: GH Actions rounds every job to the nearest minute, minimum 1 minute. A 9-second job that runs 96 times/month still costs 96 minutes.
| # | Workflow Name | File | Trigger Types | Path Filter | Jobs | Avg Dur (s) | Runs/Month Est. | Est. Min/Month | Last Conclusion |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Alembic version drift check | alembic-version-cron.yml |
schedule (06:00 UTC daily), dispatch | none | 3 | 51 | 30 | 27 | success |
| 2 | Antlers Next.js CI | antlers-next-ci.yml |
push+PR on frontend/raxx-next/** |
yes | 6 | 164 | ~60 | 165 | failure |
| 3 | BCP vault snapshot daily | bcp-vault-snapshot-daily.yml |
schedule (07:00 UTC daily), dispatch | none | 15 | 30 | 8 | failure (all) | |
| 4 | BCP smoke monthly | bcp-smoke-monthly.yml |
schedule (1st of month 08:00), dispatch | none | unable to measure | 1 | 1 | never fired | |
| 5 | Billing collector cron | billing-collector-cron.yml |
dispatch only (PAUSED) | none | n/a | 0 | 0 | never fired | |
| 6 | Billing retention cron | billing-retention-cron.yml |
schedule (03:00 UTC daily), dispatch | none | 11 | 30 | 6 | failure (all) | |
| 7 | Bot token smoke | daily-bot-token-smoke.yml |
schedule (06:30 UTC daily), dispatch | none | 9 | 30 | 5 | success | |
| 8 | CF Access header lint | lint-cf-access-headers.yml |
PR + push/main on .github/workflows/**, scripts/ci/lint_cf_access_headers.py |
yes | 16 | ~30 | 8 | success | |
| 9 | CF Pages deploy uniqueness lint | lint-cf-pages-deploy-uniqueness.yml |
PR + push/main on .github/workflows/** |
yes | 16 | ~30 | 8 | success | |
| 10 | CF token usage lint | lint-cf-tokens.yml |
PR + push/main on .github/workflows/** |
yes | 17 | ~30 | 8 | success | |
| 11 | CI Digest daily | ci-digest-cron.yml |
schedule (07:00 UTC daily), dispatch | none | 22 | 30 | 11 | success | |
| 12 | CI — console | ci-console.yml |
PR + push/main on console/** |
yes | 66 | ~80 | 88 | failure | |
| 13 | CI — main | ci.yml |
push/main (no path filter) | none | 183 | ~45 | 138 | failure | |
| 14 | Console review app | review-app-console.yml |
PR on console/** |
yes | 154 | ~25 | 64 | success | |
| 15 | Cron heartbeat monitor | cron-heartbeat-monitor.yml |
schedule (06:30 UTC daily), dispatch | none | 14 | 30 | 7 | success | |
| 16 | Daily card groomer | daily-card-groomer.yml |
schedule (09:00 UTC daily), dispatch | none | 67 | 30 | 34 | success | |
| 17 | Deploy Antlers LEGACY CRA | deploy-antlers.yml |
push/main on frontend/trademaster_ui/**, dispatch |
yes | n/a | 0 (PAUSED 2026-06-01) | 0 | never (paused) | |
| 18 | Deploy Antlers Next — prod | deploy-antlers-next-prod.yml |
push/main on frontend/raxx-next/**, dispatch |
yes | 161 | ~20 | 54 | success | |
| 19 | Deploy Antlers Next — staging | deploy-antlers-next-staging.yml |
push/main on frontend/raxx-next/**, dispatch |
yes | 161 | ~20 | 54 | success | |
| 20 | Deploy Antlers cutover | deploy-antlers-cutover.yml |
dispatch only | none | unable to measure | 0 | 0 | never fired | |
| 21 | Deploy Queue | deploy-queue.yml |
push/main on queue/**, dispatch |
yes | unable to measure | ~2 | 14 | never (no recent queue changes) | |
| 22 | Deploy Queue failure streak monitor | deploy-queue-failure-monitor.yml |
workflow_run on Deploy Queue | none | unable to measure | ~2 | 2 | never | |
| 23 | Deploy Velvet | deploy-velvet.yml |
push/main on velvet/**, dispatch |
yes | 98 | ~3 | 5 | success | |
| 24 | Deploy console | deploy-console.yml |
push/main on console/**, dispatch |
yes | 198 | ~8 | 27 | success | |
| 25 | Deploy console shim Worker | deploy-console-shim.yml |
push/main on infra/cf-workers/console-deploy-shim/**, dispatch |
yes | unable to measure | ~1 | 2 | never fired recently | |
| 26 | Deploy customer docs | deploy-customer-docs.yml |
push/main on docs/customer/**, dispatch |
yes | unable to measure | ~3 | 3 | never fired recently | |
| 27 | Deploy design mockups | deploy-mockups.yml |
push/main on docs/design/**, mockups-site/**, dispatch |
yes | 46 | ~8 | 7 | success | |
| 28 | Deploy failure streak alert | deploy-failure-streak-alert.yml |
workflow_run on 14 deploy workflows | none | 7 | ~47 | 6 | success | |
| 29 | Deploy flag docs | deploy-flag-docs.yml |
push+PR on backend_v2/api/feature_flags.yaml |
yes | 28 | ~10 | 5 | success | |
| 30 | Deploy getraxx landing page | deploy-getraxx.yml |
push/main on frontend/getraxx-landing/**, dispatch |
yes | 98 | ~10 | 17 | failure (all) | |
| 31 | Deploy internal docs | deploy-internal-docs.yml |
push/main on docs/**/*.md, backend_v2/api/feature_flags.yaml, dispatch |
yes | 56 | ~12 | 11 | failure | |
| 32 | Deploy status Worker | deploy-status-worker.yml |
push/main on frontend/status-worker/**, dispatch |
yes | unable to measure | ~1 | 1 | never fired recently | |
| 33 | Deploy status page | deploy-status-page.yml |
push/main on frontend/status-page/**, dispatch |
yes | 62 | ~1 | 1 | success | |
| 34 | Deploy support portal | deploy-support.yml |
push/main on frontend/support/**, dispatch |
yes | unable to measure | ~1 | 1 | never fired recently | |
| 35 | Deploy to Heroku | deploy-heroku.yml |
push/main (no path filter), dispatch | none | 159 | ~45 | 120 | success | |
| 36 | Deploy waf-log-shipper Worker | deploy-worker-waf-log-shipper.yml |
push/main on workers/waf-log-shipper/**, dispatch |
yes | unable to measure | ~1 | 1 | never fired recently | |
| 37 | Drift orchestrator cron | drift-orchestrator-cron.yml |
schedule (06:00 UTC daily), dispatch | none | 25 | 30 | 13 | success | |
| 38 | E2E smoke (signup→onboarding) | e2e-smoke.yml |
PR + schedule, dispatch | none | 142 | ~35 (PR) + 30 (schedule) = 65 | 154 | failure (all) | |
| 39 | Flag drift check | flag-drift-check.yml |
schedule (every 4h), dispatch | none | 32 | 180 | 96 | success | |
| 40 | FreeScout daily backup | freescout-backup.yml |
schedule (06:00 UTC daily), dispatch | none | 25 | 30 | 13 | success | |
| 41 | FreeScout Terraform apply | freescout-apply.yml |
dispatch only | none | unable to measure | 0 | 0 | never fired | |
| 42 | Heroku config-var health | heroku-config-health-nightly.yml |
schedule (05:00 UTC daily), dispatch | none | 22 | 30 | 11 | failure (all — real findings) | |
| 43 | Launch Readiness Check | launch-readiness-check.yml |
dispatch only | none | unable to measure | 0 | 0 | never fired | |
| 44 | Nightly Security Scan | nightly-security-scan.yml |
schedule (08:07 UTC daily), dispatch | none | 136 | 30 | 68 | failure | |
| 45 | Nightly 1-min bar cache warm | historical-bars-1min-warm-nightly.yml |
schedule (22:00 UTC daily + 23:00 Sun), dispatch | none | 11 | 34 | 7 | failure (all) | |
| 46 | Nightly second-factor reminders | reminders-second-factor-nightly.yml |
schedule (14:00 UTC daily), dispatch | none | 15 | 30 | 8 | failure (all) | |
| 47 | PII scan | pii-scan.yml |
PR + push/main (no path filter) | none | 17 | ~75 | 22 | success | |
| 48 | PR Gates | ci-pr.yml |
PR, dispatch | none | 74 | ~50 | 62 | success | |
| 49 | PR Preview | pr-preview.yml |
PR (detect-then-run pattern) | none (internal detect) | 25 | ~50 | 21 | success | |
| 50 | Queue Docker smoke | queue-docker-smoke.yml |
PR (no path filter on outer trigger, detect-job inside) | none (detect inside) | 13 | ~50 | 11 | success | |
| 51 | Queue vcpkg manifest check | vcpkg-manifest-check.yml |
PR on queue/vcpkg.json, queue/Dockerfile |
yes | unable to measure | ~2 | 1 | never fired recently | |
| 52 | Queue zero-dyno monitor | queue-zero-dyno-monitor.yml |
schedule (every 15 min), dispatch | none | 13 | 2,880 | 624 | success | |
| 53 | Release | release.yml |
push/main (no path filter), dispatch | none | 72 | ~45 | 54 | failure (all — release-please config bug) | |
| 54 | SC-IDENT-6 Bot identity smoke | sc-ident-6-smoke.yml |
dispatch only | none | unable to measure | 0 | 0 | never fired | |
| 55 | Security OWASP ZAP | security-zap.yml |
PR on frontend/raxx-next/** + backend_v2/api/**; schedule Mon 09:07 UTC |
yes | 15 | ~35 PR + 4 sched = 39 | 11 | success | |
| 56 | Slack Notify | slack-notify.yml |
workflow_run on 'CI' (mismatch — see §3) |
none | unable to measure | 0 (never fires) | 0 | never fired | |
| 57 | Smoke test cloudflare-retry | test-cloudflare-retry-action.yml |
dispatch only | none | unable to measure | 0 | 0 | never fired | |
| 58 | Smoke test emit-audit-event | test-emit-audit-event-action.yml |
dispatch only | none | unable to measure | 0 | 0 | never fired | |
| 59 | Surface degraded auto-file | console-degraded-auto-file.yml |
schedule (every 15 min), dispatch | none | 25 | 2,880 | 1,200 | failure (all — broken private key) | |
| 60 | Synthetic Gate | synthetic-gate.yml |
workflow_call, schedule (Mon–Fri 13:00 UTC), dispatch | none | 43 | ~22 | 16 | success | |
| 61 | Terraform validate | terraform-validate.yml |
PR + push/main on terraform/** |
yes | unable to measure | ~5 | 3 | never fired recently | |
| 62 | Terraform email-delivery-stack | terraform-email-delivery-stack.yml |
PR + push/main + schedule (06:00 UTC daily) on terraform/modules/email-delivery-stack/** |
yes + schedule | 9 | 30 sched + ~2 push = 32 | 5 | success | |
| 63 | Test notify-deploy-status action | test-notify-deploy-status.yml |
PR + push/main on .github/actions/notify-deploy-status/** |
yes | unable to measure | ~2 | 1 | never fired recently | |
| 64 | Trace integrity cron | trace-integrity-cron.yml |
schedule (02:00 UTC daily), dispatch | none | 11 | 30 | 6 | failure (all) | |
| 65 | WAF Synthetic Probes | waf-synthetic-probe.yml |
schedule (Mon–Fri 14:00 UTC), dispatch | none | 27 | 22 | 10 | failure (all) | |
| 66 | Workflow secret names lint | lint-workflow-secret-names.yml |
PR + push/main on .github/workflows/** |
yes | 16 | ~30 | 8 | success | |
| 67 | iOS CI | ios-ci.yml |
push + PR on ios/** |
yes | 24 | ~5 | 2 | success | |
| 68 | mbt-drift-daily | mbt-drift-daily.yml |
schedule (23:00 UTC daily), dispatch | none | 28 | 30 | 14 | success | |
| 69 | mbt-drift-per-symbol-weekly | mbt-drift-per-symbol-weekly.yml |
schedule (Sun 23:30 UTC), dispatch | none | unable to measure | 4 | 2 | never fired recently | |
| 70 | mbt-resting-orders-cron | mbt-resting-orders-cron.yml |
schedule (every 5 min, Mon–Fri 13:00–20:00 UTC) | none | 43 | ~480 | 344 | success | |
| 71 | tickets E2E smoke | tickets-e2e-smoke.yml |
schedule, dispatch | none | 38 | 30 | 19 | failure |
Estimated total minutes/month (current state): ~3,400–3,700 min/month
The three highest single contributors:
1. queue-zero-dyno-monitor: ~624 min/month (every-15-min schedule, always runs, 13s per run → rounds to 1 min × 2,880 fires)
2. surface-degraded-auto-file: ~1,200 min/month (every-15-min, 25s per run, but currently failing 100%)
3. mbt-resting-orders-cron: ~344 min/month (every-5-min weekday trading hours)
§2 Categorization
Critical path — must run on every PR or push to main
| Workflow | Verdict |
|---|---|
PR Gates (ci-pr.yml) |
Keep — comprehensive gate, path-aware detect-changes |
PII scan (pii-scan.yml) |
Keep — operator instruction: never remove |
CI — main (ci.yml) |
Keep — push-to-main integration tests; add path filter (see §3) |
CI — console (ci-console.yml) |
Keep — already path-filtered to console/** |
Antlers Next.js CI (antlers-next-ci.yml) |
Keep — already path-filtered to frontend/raxx-next/** |
iOS CI (ios-ci.yml) |
Keep — already path-filtered to ios/** |
Security OWASP ZAP (security-zap.yml) |
Keep — path-filtered to frontend/raxx-next/** + backend_v2/api/**; note path was previously frontend/trademaster_ui/** and has already been updated to raxx-next — correct |
Queue Docker smoke (queue-docker-smoke.yml) |
Keep — no outer path filter but internal detect-job pattern handles it correctly |
Terraform validate (terraform-validate.yml) |
Keep — path-filtered to terraform/** |
Queue vcpkg manifest check (vcpkg-manifest-check.yml) |
Keep — path-filtered to queue Dockerfile |
Path-filtered — correct, keep
These workflows trigger only when relevant paths change and behave correctly:
deploy-antlers-next-prod.yml, deploy-antlers-next-staging.yml, deploy-console.yml, deploy-velvet.yml, deploy-queue.yml, deploy-getraxx.yml, deploy-internal-docs.yml, deploy-customer-docs.yml, deploy-flag-docs.yml, deploy-mockups.yml, deploy-status-page.yml, deploy-status-worker.yml, deploy-support.yml, deploy-console-shim.yml, deploy-worker-waf-log-shipper.yml, lint-cf-access-headers.yml, lint-cf-tokens.yml, lint-cf-pages-deploy-uniqueness.yml, lint-workflow-secret-names.yml, review-app-console.yml, test-notify-deploy-status.yml
Scheduled — keep (firing meaningful output)
| Workflow | Cron | Output quality |
|---|---|---|
| Alembic version drift check | daily 06:00 | Functional — checks staging+prod migration heads |
| BCP vault snapshot daily | daily 07:00 | Currently broken — operator-action, see §3 |
| Bot token smoke | daily 06:30 | Functional — 100% success |
| CI Digest daily | daily 07:00 | Functional — digest posture per feedback_pre_launch_digest_notifications |
| Cron heartbeat monitor | daily 06:30 | Functional — monitors other crons |
| Daily card groomer | daily 09:00 | Functional — 100% success |
| Drift orchestrator cron | daily 06:00 | Functional — 100% success |
| Flag drift check | every 4h | Functional — 100% success; consider relaxing to daily |
| FreeScout daily backup | daily 06:00 | Functional — 100% success |
| Heroku config-var health | daily 05:00 | Failing — real finding (staging REDIS_URL absent), see §3 |
| Nightly Security Scan | daily 08:07 | Partially functional — failing, see §3 |
| Nightly 1-min bar cache warm | daily 22:00 | Broken 100% — see §3 |
| Nightly second-factor reminders | daily 14:00 | Broken 100% — see §3 |
| mbt-drift-daily | daily 23:00 | Functional — 100% success |
| mbt-drift-per-symbol-weekly | Sun 23:30 | Functional — no recent runs (correct, runs weekly) |
| mbt-resting-orders-cron | every 5 min weekday | Functional — 100% success; high cost (~344 min/month) |
| Queue zero-dyno monitor | every 15 min | Functional — 100% success; highest cost (~624 min/month) |
| Billing retention cron | daily 03:00 | Broken 100% — related to Postgres migration, see §3 |
| Trace integrity cron | daily 02:00 | Broken 100% — see §3 |
| WAF Synthetic Probes | weekday 14:00 | Broken 100% — see §3 |
| Tickets E2E smoke | scheduled | Broken 100% — see §3 |
| E2E smoke (signup→onboarding) | scheduled + PR | Broken 100% — see §3 |
| Synthetic Gate | weekday 13:00 | Functional — 100% success |
| Terraform email-delivery-stack | daily 06:00 + path | Functional — 100% success |
Dispatch-only — runs on demand, zero recurring cost
These fire zero recurring cycles. Keep all; they are on-demand tools.
billing-collector-cron.yml (explicitly paused), deploy-antlers-cutover.yml, freescout-apply.yml, launch-readiness-check.yml, sc-ident-6-smoke.yml, test-cloudflare-retry-action.yml, test-emit-audit-event-action.yml
Always-skip / dead triggers
| Workflow | Issue |
|---|---|
Slack Notify (slack-notify.yml) |
workflow_run watches ['CI'] but CI workflow name is 'CI — main'. Name mismatch — this workflow has never fired. |
Deploy Antlers LEGACY CRA (deploy-antlers.yml) |
Paused 2026-06-01. Path frontend/trademaster_ui/** — the CRA app is superseded. Fires zero times. |
Actively broken — 100% failure rate
These have produced zero successes in the entire observed sample window and are consuming minutes without output:
| Workflow | Observed Failures | Root Cause (from log analysis) |
|---|---|---|
| Surface degraded auto-file | 16/16 | Failed to read private key — GitHub App private key secret is invalid/expired |
| Release | 18/18 | release-please failed: illegal pathing characters in path: frontend/trademaster_ui/../../VERSION — release-please config references stale CRA path |
| Deploy getraxx landing page | 4/4 | Cannot find module 'playwright' — smoke job missing npx playwright install step |
| BCP vault snapshot daily | 2/2 | Unable to measure root cause from log tail; vault auth or S3 write failure |
| Billing retention cron | 2/2 | Postgres migration in flight (project_raptor_postgres_migration_decision) |
| Nightly 1-min bar cache warm | 2/2 | Unable to measure from log — likely Postgres/Heroku auth |
| Nightly second-factor reminders | 2/2 | Unable to measure from log |
| Trace integrity cron | 2/2 | Unable to measure from log |
| WAF Synthetic Probes | 2/2 | Unable to measure from log |
| Tickets E2E smoke | 1/1 | Unable to measure from log |
| E2E smoke (signup→onboarding) | 4/4 | Unable to measure from log |
| Deploy internal docs | 4 failures in 12 | emit-audit-event/action.yml — Unrecognized named-value: 'github' in composite action template |
§3 Cycle-Waster Findings
CW-1: surface-degraded-auto-file running every 15 minutes 100% broken
File: console-degraded-auto-file.yml
Cron: */15 * * * * — 2,880 fires/month
Cost: ~1,200 min/month (each 25-second run rounds to 1 minute × 2,880)
Failure cause: DOMException [DataError]: Invalid keyData / Failed to read private key — the GitHub App private key stored as a secret is malformed or expired.
Impact: Zero monitoring output for the entire observable window. The workflow is consuming ~1,200 minutes per month for no operational value.
Operator action before Phase 1 can fix: Rotate/re-provision the GitHub App private key secret. Once functional, consider reducing frequency to every 30 minutes (halves cost with no meaningful detection latency change).
CW-2: release.yml firing on every push to main, 0% success rate
File: release.yml
Trigger: push to main, no path filter
Cost: ~54 min/month (72s avg × ~45 pushes/month)
Failure cause: release-please failed: illegal pathing characters in path: frontend/trademaster_ui/../../VERSION — the release-please configuration references a path component from the legacy CRA layout that has since been superseded. The release-please-config.json almost certainly contains a reference to frontend/trademaster_ui with a relative ../../VERSION extra-component path.
Impact: Every push to main produces a failed workflow run. This is visible noise that trains reviewers to ignore red workflow icons, which is a detection-erosion risk.
Fix: Update release-please-config.json to remove or correct the frontend/trademaster_ui component path. This is a one-line config change, not a workflow change.
CW-3: queue-zero-dyno-monitor running every 15 minutes, healthy but high-frequency
File: queue-zero-dyno-monitor.yml
Cron: */15 * * * * — 2,880 fires/month
Cost: ~624 min/month (13s avg × 2,880, each rounds to 1 min)
Output: 100% success, legitimately monitoring Queue dyno count.
Issue: 15-minute frequency is fine for production SLO coverage, but this is pre-launch. The Queue service is not yet customer-facing. Detecting zero dynos within 15 minutes vs within 30 minutes makes no difference at current stage.
Proposed change: Relax to */30 * * * * — saves ~312 min/month with zero customer-facing impact.
CW-4: flag-drift-check running every 4 hours
File: flag-drift-check.yml
Cron: 0 */4 * * * — 180 fires/month
Cost: ~96 min/month
Output: 100% success, checks flag YAML drift.
Issue: Flag changes happen via PRs (a human action). A flag drift state that exists at 04:00 UTC will still exist at 08:00 UTC. Daily detection is sufficient.
Proposed change: Change to 0 6 * * * (daily) — saves ~66 min/month.
CW-5: ci.yml (CI — main) has no path filter, runs full suite on every push
File: ci.yml
Trigger: push to main, no path filter
Cost: ~138 min/month (183s avg × ~45 pushes/month)
Issue: Runs backend tests, frontend tests, security scans, OpenAPI drift on every push to main regardless of what changed. The detect-changes job gates individual test jobs internally, but the workflow still starts a runner and the detect-changes job runs on every push. For pushes that only touch docs/** or mockups-site/**, the entire workflow spins up only to skip everything after detect.
Note: The detect-changes pattern is correct for what it does. The improvement is adding a top-level paths filter to skip workflow startup entirely for doc-only or design-only changes. The detect-changes internal logic already handles the rest.
Proposed change: Add paths filter excluding docs/**, mockups-site/**, *.md changes.
CW-6: deploy-heroku.yml fires on every push to main with no path filter
File: deploy-heroku.yml
Trigger: push to main, no path filter
Cost: ~120 min/month (159s avg × ~45 pushes/month)
Issue: This workflow deploys Raptor (the backend) to Heroku, but it fires on every push to main — including pushes that only touch frontend/raxx-next/**, docs/**, mockups-site/**, etc. A frontend-only push should not trigger a backend deploy.
Proposed change: Add path filter for backend_v2/**, requirements*.txt, Procfile, runtime.txt, .github/workflows/deploy-heroku.yml. This is a careful change — verify against the freeze-check and smoke job gates.
CW-7: slack-notify.yml workflow_run trigger name mismatch — never fires
File: slack-notify.yml
Trigger: workflow_run on workflows ['CI']
Issue: The CI workflow is named 'CI — main', not 'CI'. GitHub's workflow_run trigger matches on exact workflow name. This workflow has never fired in the observable sample. The CI failure notification posture (per feedback_pre_launch_digest_notifications) relies on this workflow for failure alerts — but it is silently dead.
Fix: Change workflows: ['CI'] to workflows: ['CI — main'] in slack-notify.yml.
CW-8: deploy-antlers.yml (Legacy CRA) active but paths point to superseded code
File: deploy-antlers.yml
Name: Deploy Antlers (LEGACY CRA — PAUSED 2026-06-01)
State: active (GH Actions API reports active despite the name containing PAUSED)
Trigger: push/main on frontend/trademaster_ui/**
Issue: The CRA has been superseded by Antlers Next.js. The path frontend/trademaster_ui/** still exists (it is the current Antlers app directory per project_antlers_directory). If any commit touches files in that path, this legacy workflow will fire and attempt a CRA deploy. The name says PAUSED but no mechanism enforces the pause except that the path has low traffic.
Risk: Not zero-cost in perpetuity. Any touch of frontend/trademaster_ui/** fires this. Recommend disabling (state=disabled_manually) rather than deleting, as the deployment script may be referenced.
CW-9: Multiple monitor/infra workflows broken 100% with no operator visibility
The following workflows have been silently failing for the entire observable window (2–16 consecutive failures, no successes):
billing-retention-cron.yml— Postgres migration dependencytrace-integrity-cron.yml— root cause unclearwaf-synthetic-probe.yml— root cause unclearhistorical-bars-1min-warm-nightly.yml— likely Postgres/Heroku authreminders-second-factor-nightly.yml— likely Postgres/Heroku authe2e-smoke.yml— new workflow, never succeededtickets-e2e-smoke.yml— never succeeded
Because slack-notify.yml is broken (CW-7), none of these failures have generated Slack alerts. The cron-heartbeat-monitor (cron-heartbeat-monitor.yml) may be detecting some of these — but its output is not visible from this audit.
§4 Phased Reduction Plan
Phase 1 — Safe wins (no workflow file changes, low blast radius)
These can be approved individually. None require coordinated changes across multiple files.
| # | Action | File | Rationale | Effort | Risk |
|---|---|---|---|---|---|
| P1-A | Fix slack-notify.yml workflow_run trigger: ['CI'] → ['CI — main'] |
slack-notify.yml |
1-char fix. Restore CI failure alerting to Slack (currently completely dead — CW-7). | 5 min | Very low |
| P1-B | Disable (not delete) Legacy CRA deploy workflow | deploy-antlers.yml |
Set state to disabled_manually via GH API. Prevents accidental CRA redeploy if trademaster_ui is touched. |
2 min | Very low |
| P1-C | Fix release-please-config.json to remove stale frontend/trademaster_ui path component |
release-please-config.json |
Stops 18-for-18 failure rate on Release workflow. Saves ~54 min/month of failed runner time. | 15 min | Low (release-please config only) |
| P1-D | Rotate GitHub App private key for console-degraded-auto-file |
Secret in repo settings | Fixes the broken-private-key failure. Restore 1,200 min/month of monitoring value. Operator-action for the key rotation itself. | 10 min (key rotation) | Low |
| P1-E | Relax queue-zero-dyno-monitor from */15 to */30 |
queue-zero-dyno-monitor.yml |
Saves ~312 min/month. Detection window goes from 15 min to 30 min — acceptable pre-launch. | 5 min | Very low |
| P1-F | Relax flag-drift-check from every 4h to daily (0 6 * * *) |
flag-drift-check.yml |
Saves ~66 min/month. Drift is human-driven via PRs; daily detection is sufficient. | 5 min | Very low |
Phase 1 total estimated savings if all items land: ~430 min/month (dominated by fixing surface-degraded-auto-file so it runs correctly but at the same frequency, and reducing zero-dyno-monitor).
Phase 1 total if surface-degraded-auto-file also moves to 30-min after key fix: ~630 min/month.
Phase 2 — Consolidation (workflow file changes, needs testing)
These require changes to workflow YAML and should be done in a single PR with a one-day soak on staging.
| # | Action | Files | Rationale | Effort | Risk |
|---|---|---|---|---|---|
| P2-A | Add path filter to ci.yml to skip on doc/design-only pushes |
ci.yml |
Reduces spurious full-test runs on doc-only commits. Est. saves ~30 min/month. | 30 min | Medium — must test detect-changes still fires correctly |
| P2-B | Add path filter to deploy-heroku.yml for backend paths only |
deploy-heroku.yml |
Stop backend deploy from triggering on frontend-only pushes. Est. saves ~40 min/month. | 45 min | Medium — test that all legitimate backend-change paths are covered |
| P2-C | Fix deploy-getraxx.yml smoke job: add npx playwright install step |
deploy-getraxx.yml |
Fixes 4/4 failure rate. No new behavior — just installs the missing Playwright dependency. | 20 min | Low |
| P2-D | Fix deploy-internal-docs.yml emit-audit-event composite action usage |
.github/actions/emit-audit-event/action.yml |
github.run_id context unavailable in composite action template at parse time. Fix: pass values as inputs from calling workflow. Shared failure also affecting other workflows using this action. |
30 min | Medium — touches shared composite action |
| P2-E | Consolidate the 3 CF lint workflows into a single workflow with 3 jobs | New file replacing lint-cf-access-headers.yml, lint-cf-pages-deploy-uniqueness.yml, lint-cf-tokens.yml |
All 3 have identical triggers, identical path filters (.github/workflows/**), same 16s runtime. Running as 3 separate workflows means 3 runner allocations per PR touching a workflow file. Consolidating to 1 workflow with 3 jobs drops from ~24 min/month to ~8 min/month for those triggers. |
45 min | Low — identical triggers, independent jobs |
| P2-F | Fix billing-retention-cron failure: confirm Postgres migration status, re-enable or pause schedule |
billing-retention-cron.yml |
Stops 100% failure waste. Either update DB path after Postgres migration lands, or add workflow_dispatch-only + comment-out schedule like billing-collector-cron.yml did. |
20 min (after Postgres migration) | Low once migration is in |
Phase 2 estimated additional savings: ~90–120 min/month (path filters + getraxx fix + CF lint consolidation).
Phase 3 — Production-lean shape (architectural changes, operator sign-off required on design)
| # | Action | Rationale |
|---|---|---|
| P3-A | Introduce per-surface staging autodeploy confirmation that staging deploy succeeded before prod can run | Currently deploy-heroku.yml deploys staging and production in sequence within the same workflow run. The staging gate (synthetic-gate) runs after staging deploy, but production deploy is a separate workflow_dispatch step. Formalizing the staging→soak→prod promotion model means prod deploy only fires from workflow_dispatch (never from push), and requires a staging smoke pass as a prerequisite. |
| P3-B | Convert all production deploy triggers to workflow_dispatch-only |
deploy-antlers-next-prod.yml, deploy-console.yml (prod jobs), deploy-velvet.yml, deploy-queue.yml currently trigger on push to main. Under the operator's lean-prod model, these should only trigger manually or via a promote action, not automatically on every push. Staging deploys remain on push-to-main (fast, automatic). |
| P3-C | Add a single prod-promote workflow that gates on: (1) staging smoke passing, (2) operator confirms | One promote-to-prod.yml workflow with inputs: surface (dropdown: raptor / console / antlers / velvet / queue). It checks the latest staging deploy was green, runs a pre-promote smoke, then fires the surface-specific production deploy as a workflow_call. Eliminates per-surface prod deploy triggers. |
| P3-D | Consolidate monitoring crons that fire at the same time | Currently 5 separate workflows fire within 30 minutes of 06:00 UTC: alembic-version-cron, bcp-vault-snapshot-daily, bot-token-smoke (06:30), ci-digest-cron (07:00), cron-heartbeat-monitor (06:30), drift-orchestrator-cron. These could share a single "nightly-ops-sweep" workflow with sequential jobs, reducing runner-startup overhead. |
| P3-E | Remove E2E smoke from PR trigger until it passes |
e2e-smoke.yml triggers on PR and has never succeeded. It is consuming ~84 min/month on PRs that will never benefit from it. Gate it behind workflow_dispatch-only until the E2E suite is green. |
Phase 3 estimated additional savings: ~150–200 min/month (primarily from removing push-triggered prod deploys that become dispatch-only, and fixing the E2E PR trigger).
§5 Staging Autodeploy + Production-Lean Architecture
Current state per surface
Surface Staging autodeploy? Prod autodeploy? Manual gate?
---------------------------------------------------------------------------
Raptor (API) YES (push to main, YES (same workflow, freeze-check
no path filter) no manual confirm) on dispatch=prod
Antlers Next.js YES (push to main, YES (separate none
raxx-next/** paths) prod workflow,
same push trigger)
Console YES (push to main, YES (same workflow, freeze-check
console/** paths) no manual confirm) on dispatch=prod
Velvet YES (push to main, YES (same workflow, freeze-check
velvet/** paths) no manual confirm)
Queue YES (push to main, YES (same workflow, confirm-gate job
queue/** paths) confirm-gate-rejected with manual input)
getraxx YES (push to main) (CF Pages — same none
CF Pages deploy)
Antlers Next YES (staging SEPARATE prod none
workflow auto) workflow auto)
Proposed lean-prod shape (after Phase 3)
Push to main
|
v
[Path filters route to correct deploy workflow]
|
+---> deploy-raptor-staging (automatic, on push)
+---> deploy-antlers-staging (automatic, on push)
+---> deploy-console-staging (automatic, on push)
+---> deploy-velvet-staging (automatic, on push)
+---> deploy-queue-staging (automatic, on push)
|
v
[Staging smoke: synthetic-gate + surface health check]
|
v
[Soak period — default: next business day (no automated timer)]
|
v
[Operator: dispatch promote-to-prod with surface input]
|
v
promote-to-prod.yml
|
+---> Verify: latest staging deploy = green
+---> Run: pre-promote smoke (30s)
+---> Deploy to production
+---> Post-deploy: synthetic-gate on prod
+---> Notify: ops@ digest
What stays the same
- Staging: fast, automatic, push-triggered. No friction.
- PR gate: unchanged — comprehensive, path-aware.
- Scheduled crons: unchanged except frequency adjustments in Phase 1.
- CF Pages surfaces (getraxx, mockups, status-page, customer-docs): autodeploy is correct for these; no customer-critical data at risk.
What changes in Phase 3
- Raptor, Console, Antlers, Velvet, Queue production deploys become
workflow_dispatch-only. - A new
promote-to-prod.ymlconsolidates the confirmation UI. - The
deploy-heroku.ymlno longer has apushtrigger for production. - The existing
freeze-checkandsynthetic-gatejobs remain as the backbone.
§6 Estimated Minutes/Month Saved Per Phase
| Phase | Actions | Est. Min/Month Saved | Notes |
|---|---|---|---|
| Phase 1 (safe wins) | P1-A through P1-F | ~430 min/month | Dominated by fixing surface-degraded-auto-file key + reducing zero-dyno-monitor frequency |
| Phase 1 + surface-degraded-auto-file at 30-min | Above + relax to */30 after key fix |
~630 min/month | |
| Phase 2 (consolidation) | P2-A through P2-F | +90–120 min/month | Path filter on ci.yml + deploy-heroku.yml + CF lint consolidation |
| Phase 3 (prod-lean shape) | P3-A through P3-E | +150–200 min/month | E2E PR trigger removal, push→dispatch prod deploys |
| Total (all phases) | ~770–950 min/month | From current ~3,400–3,700 to ~2,500–2,900 |
At GH Actions Pro rates ($0.008/min for Ubuntu runners), the current estimated burn is $27–30/month in compute alone, before the base Pro plan cost. All three phases landing brings that to roughly $20–23/month — a ~$7–10/month reduction in compute.
Ubicloud trigger threshold is 5,000 min/month per project_ci_billing. Current estimated usage is 3,400–3,700 min/month, which is 68–74% of that threshold. Even in the worst case (all phases fail to reduce), the repo is not at imminent risk of hitting the Ubicloud threshold. If surface-degraded-auto-file is fixed and remains at */15 frequency, it alone accounts for 1,200 min/month of a single workflow. Fixing its root cause is the single highest-leverage action.
§7 Actively Broken Workflows — Surfacing Separately
The following workflows have been producing zero successes for the entire observable window and should be investigated as separate SRE dispatches (not in this audit's scope to fix):
BROKEN-1: release.yml — 18/18 failures
Cause: release-please-config.json contains stale path reference frontend/trademaster_ui/../../VERSION. Fix: update release-please config. Separately: Node.js 20 deprecation warning on googleapis/release-please-action@5c625bfb5d1ff62eadeeb3772007f7f66fdcf071 — this action is pinned by SHA to a version that runs on Node.js 20; GitHub Actions will force Node.js 24 starting 2026-06-16. This is an imminent breaking change, 11 days away.
BROKEN-2: console-degraded-auto-file.yml — 16/16 failures
Cause: Failed to read private key / Invalid keyData. The GitHub App private key secret is malformed or has been rotated without updating the secret. Monitoring is completely blind.
BROKEN-3: heroku-config-health-nightly.yml — real findings (not broken, but alerting correctly)
Note: This workflow is functioning correctly — it is producing failures because it found real issues: raxx-api-staging is missing REDIS_URL (severity=critical) and has an empty POSTMARK_SERVER_TOKEN (severity=warning). These are legitimate infra gaps, not workflow bugs. Operator action needed: provision REDIS_URL on raxx-api-staging.
BROKEN-4: deploy-internal-docs.yml — recurring failures
Cause: emit-audit-event/action.yml uses github.run_id and github.server_url in a composite action template context where the github context is not available. This is a shared composite action bug that may affect other workflows using emit-audit-event.
BROKEN-5: Multiple nightly crons failing (likely Postgres migration dependency)
billing-retention-cron, historical-bars-1min-warm-nightly, reminders-second-factor-nightly, trace-integrity-cron — likely all failing due to the SQLite→Postgres migration in progress (project_raptor_postgres_migration_decision). These should be paused (schedule commented out, dispatch-only) until the migration lands, matching the pattern already established by billing-collector-cron.yml.
BROKEN-6: Node.js 20 deprecation — deadline 2026-06-16 (11 days)
Affects: actions/checkout@v4, actions/setup-python@v5, actions/create-github-app-token@v1, googleapis/release-please-action (pinned SHA). GitHub will force Node.js 24 after 2026-06-16. Actions that haven't published Node.js 24-compatible versions will break. The most urgent is googleapis/release-please-action pinned to a specific SHA — check if a newer SHA that supports Node 24 is available.
§8 Operator-Approval Checklist
For each proposed action, check the box to greenlight:
Phase 1 (safe wins — low risk)
[ ] P1-A slack-notify.yml: fix workflow_run name 'CI' → 'CI — main'
Restores Slack CI-failure alerting. Zero behavioral change otherwise.
[ ] P1-B deploy-antlers.yml: disable (not delete) via GH API
Prevents accidental CRA deploy. File stays on disk for reference.
[ ] P1-C release-please-config.json: remove stale frontend/trademaster_ui component
Stops 100% Release failure rate. Requires review of config to confirm
which components are still active.
[ ] P1-D Rotate GitHub App private key for surface-degraded-auto-file
(OPERATOR ACTION: requires secret provisioning in repo settings)
Restores console degradation monitoring.
[ ] P1-E queue-zero-dyno-monitor: relax cron */15 → */30
Saves ~312 min/month. Detection window: 15 min → 30 min.
[ ] P1-F flag-drift-check: relax cron every 4h → daily 06:00 UTC
Saves ~66 min/month. Flag changes are human-driven via PRs.
Phase 2 (consolidation — moderate risk, test before merge)
[ ] P2-A ci.yml: add paths filter to skip on doc/design-only pushes
Needs verification that detect-changes still fires for all code paths.
[ ] P2-B deploy-heroku.yml: add backend-only path filter
Needs verification all legitimate backend paths are covered.
Confirm with operator: are there any infra-file paths outside
backend_v2/ that should trigger a Raptor deploy?
[ ] P2-C deploy-getraxx.yml: add npx playwright install to smoke job
Fixes 4/4 failure rate. Low risk — additive step only.
[ ] P2-D Fix emit-audit-event composite action github context usage
Review .github/actions/emit-audit-event/action.yml; pass run_id
and server_url as explicit inputs from calling workflows.
[ ] P2-E Consolidate 3 CF lint workflows into 1
New file replaces 3 files. Identical trigger/path logic, 3 jobs.
Confirm: are any external processes monitoring these specific
workflow names?
[ ] P2-F billing-retention-cron: pause schedule (comment out), dispatch-only
Mirror billing-collector-cron pattern. Re-enable post-Postgres migration.
Phase 3 (architectural — requires separate operator sign-off on design)
[ ] P3-DESIGN Operator approves staging-autodeploy / prod-dispatch model
described in §5 before any Phase 3 workflow changes begin.
Specifically: confirm all 5 Heroku surfaces (Raptor/Console/
Antlers/Velvet/Queue) should move to dispatch-only prod deploys.
[ ] P3-A Introduce per-surface staging smoke prerequisite for prod
[ ] P3-B Convert prod deploy triggers to workflow_dispatch-only
[ ] P3-C Create promote-to-prod.yml consolidation workflow
[ ] P3-D Consolidate 06:00 UTC cron cluster into nightly-ops-sweep
[ ] P3-E Remove e2e-smoke.yml from PR trigger until suite is green
§9 Ubicloud Assessment
Current estimated burn: ~3,400–3,700 min/month.
Ubicloud threshold: 5,000 min/month (per project_ci_billing).
Current usage is at ~70–74% of the threshold. Not at imminent risk. The two highest-cost workflows (surface-degraded-auto-file at ~1,200 min/month when broken, queue-zero-dyno-monitor at ~624 min/month, mbt-resting-orders-cron at ~344 min/month) are all running correctly except the degraded-auto-file. Fixing Phase 1 alone brings the estimate down to ~2,800–3,100 min/month (56–62% of threshold).
Recommendation: Ubicloud is not needed in the near term. If a significant new service launches (iOS CI running full Xcode builds, for instance) or if the E2E suite starts running on every PR and takes 10+ minutes, revisit. Do not queue Ubicloud migration until burn sustains above 4,500 min/month for two consecutive months.
References
docs/ops/runbooks/ci-hygiene.md— existing CI hygiene runbookdocs/ops/runbooks/ci-runner-posture.md— runner posture notesfeedback_gh_actions_transitive_skip—needschain skip propagationfeedback_pre_launch_digest_notifications— pre-launch digest posturefeedback_asset_manifest_layer_a— asset manifest guard (keep!)project_ci_billing— GH Actions Pro plan + Ubicloud thresholdproject_raptor_postgres_migration_decision— explains several cron failures