GitHub Actions Audit — 2026-06-05

Author: sre-agent Date: 2026-06-05 Status: Recommendation-only — no workflow files modified Operator approval required before any Phase execution

§1 Inventory Table

Total workflows on disk: 72 (including __tests__/ subdirectory with one shell test — not a workflow file). Active registered workflows: 71. All are in active state per the GitHub API. Run data sampled from the most recent 500 completed runs as of 2026-06-05T14:00 UTC.

Duration estimates use createdAt → updatedAt delta on successful runs (best available via gh run list — does not include queue wait time, which is typically 5–15s on the hosted runners).

Runs-per-month for cron workflows derived from cron expression parsing. PR/push trigger run-counts are derived from observed activity in the sample window. The sample window covers roughly 7–10 days of activity, so "runs last 30 days" for frequently-triggered workflows is extrapolated (noted where extrapolated).

Minute-cost rounding: GH Actions rounds every job to the nearest minute, minimum 1 minute. A 9-second job that runs 96 times/month still costs 96 minutes.

#	Workflow Name	File	Trigger Types	Path Filter	Jobs	Avg Dur (s)	Runs/Month Est.	Est. Min/Month	Last Conclusion
1	Alembic version drift check	`alembic-version-cron.yml`	schedule (06:00 UTC daily), dispatch	none	3	51	30	27	success
2	Antlers Next.js CI	`antlers-next-ci.yml`	push+PR on `frontend/raxx-next/**`	yes	6	164	~60	165	failure
3	BCP vault snapshot daily	`bcp-vault-snapshot-daily.yml`	schedule (07:00 UTC daily), dispatch	none	15	30	8	failure (all)
4	BCP smoke monthly	`bcp-smoke-monthly.yml`	schedule (1st of month 08:00), dispatch	none	unable to measure	1	1	never fired
5	Billing collector cron	`billing-collector-cron.yml`	dispatch only (PAUSED)	none	n/a	0	0	never fired
6	Billing retention cron	`billing-retention-cron.yml`	schedule (03:00 UTC daily), dispatch	none	11	30	6	failure (all)
7	Bot token smoke	`daily-bot-token-smoke.yml`	schedule (06:30 UTC daily), dispatch	none	9	30	5	success
8	CF Access header lint	`lint-cf-access-headers.yml`	PR + push/main on `.github/workflows/**`, `scripts/ci/lint_cf_access_headers.py`	yes	16	~30	8	success
9	CF Pages deploy uniqueness lint	`lint-cf-pages-deploy-uniqueness.yml`	PR + push/main on `.github/workflows/**`	yes	16	~30	8	success
10	CF token usage lint	`lint-cf-tokens.yml`	PR + push/main on `.github/workflows/**`	yes	17	~30	8	success
11	CI Digest daily	`ci-digest-cron.yml`	schedule (07:00 UTC daily), dispatch	none	22	30	11	success
12	CI — console	`ci-console.yml`	PR + push/main on `console/**`	yes	66	~80	88	failure
13	CI — main	`ci.yml`	push/main (no path filter)	none	183	~45	138	failure
14	Console review app	`review-app-console.yml`	PR on `console/**`	yes	154	~25	64	success
15	Cron heartbeat monitor	`cron-heartbeat-monitor.yml`	schedule (06:30 UTC daily), dispatch	none	14	30	7	success
16	Daily card groomer	`daily-card-groomer.yml`	schedule (09:00 UTC daily), dispatch	none	67	30	34	success
17	Deploy Antlers LEGACY CRA	`deploy-antlers.yml`	push/main on `frontend/trademaster_ui/**`, dispatch	yes	n/a	0 (PAUSED 2026-06-01)	0	never (paused)
18	Deploy Antlers Next — prod	`deploy-antlers-next-prod.yml`	push/main on `frontend/raxx-next/**`, dispatch	yes	161	~20	54	success
19	Deploy Antlers Next — staging	`deploy-antlers-next-staging.yml`	push/main on `frontend/raxx-next/**`, dispatch	yes	161	~20	54	success
20	Deploy Antlers cutover	`deploy-antlers-cutover.yml`	dispatch only	none	unable to measure	0	0	never fired
21	Deploy Queue	`deploy-queue.yml`	push/main on `queue/**`, dispatch	yes	unable to measure	~2	14	never (no recent queue changes)
22	Deploy Queue failure streak monitor	`deploy-queue-failure-monitor.yml`	workflow_run on Deploy Queue	none	unable to measure	~2	2	never
23	Deploy Velvet	`deploy-velvet.yml`	push/main on `velvet/**`, dispatch	yes	98	~3	5	success
24	Deploy console	`deploy-console.yml`	push/main on `console/**`, dispatch	yes	198	~8	27	success
25	Deploy console shim Worker	`deploy-console-shim.yml`	push/main on `infra/cf-workers/console-deploy-shim/**`, dispatch	yes	unable to measure	~1	2	never fired recently
26	Deploy customer docs	`deploy-customer-docs.yml`	push/main on `docs/customer/**`, dispatch	yes	unable to measure	~3	3	never fired recently
27	Deploy design mockups	`deploy-mockups.yml`	push/main on `docs/design/`, `mockups-site/`, dispatch	yes	46	~8	7	success
28	Deploy failure streak alert	`deploy-failure-streak-alert.yml`	workflow_run on 14 deploy workflows	none	7	~47	6	success
29	Deploy flag docs	`deploy-flag-docs.yml`	push+PR on `backend_v2/api/feature_flags.yaml`	yes	28	~10	5	success
30	Deploy getraxx landing page	`deploy-getraxx.yml`	push/main on `frontend/getraxx-landing/**`, dispatch	yes	98	~10	17	failure (all)
31	Deploy internal docs	`deploy-internal-docs.yml`	push/main on `docs/*/.md`, `backend_v2/api/feature_flags.yaml`, dispatch	yes	56	~12	11	failure
32	Deploy status Worker	`deploy-status-worker.yml`	push/main on `frontend/status-worker/**`, dispatch	yes	unable to measure	~1	1	never fired recently
33	Deploy status page	`deploy-status-page.yml`	push/main on `frontend/status-page/**`, dispatch	yes	62	~1	1	success
34	Deploy support portal	`deploy-support.yml`	push/main on `frontend/support/**`, dispatch	yes	unable to measure	~1	1	never fired recently
35	Deploy to Heroku	`deploy-heroku.yml`	push/main (no path filter), dispatch	none	159	~45	120	success
36	Deploy waf-log-shipper Worker	`deploy-worker-waf-log-shipper.yml`	push/main on `workers/waf-log-shipper/**`, dispatch	yes	unable to measure	~1	1	never fired recently
37	Drift orchestrator cron	`drift-orchestrator-cron.yml`	schedule (06:00 UTC daily), dispatch	none	25	30	13	success
38	E2E smoke (signup→onboarding)	`e2e-smoke.yml`	PR + schedule, dispatch	none	142	~35 (PR) + 30 (schedule) = 65	154	failure (all)
39	Flag drift check	`flag-drift-check.yml`	schedule (every 4h), dispatch	none	32	180	96	success
40	FreeScout daily backup	`freescout-backup.yml`	schedule (06:00 UTC daily), dispatch	none	25	30	13	success
41	FreeScout Terraform apply	`freescout-apply.yml`	dispatch only	none	unable to measure	0	0	never fired
42	Heroku config-var health	`heroku-config-health-nightly.yml`	schedule (05:00 UTC daily), dispatch	none	22	30	11	failure (all — real findings)
43	Launch Readiness Check	`launch-readiness-check.yml`	dispatch only	none	unable to measure	0	0	never fired
44	Nightly Security Scan	`nightly-security-scan.yml`	schedule (08:07 UTC daily), dispatch	none	136	30	68	failure
45	Nightly 1-min bar cache warm	`historical-bars-1min-warm-nightly.yml`	schedule (22:00 UTC daily + 23:00 Sun), dispatch	none	11	34	7	failure (all)
46	Nightly second-factor reminders	`reminders-second-factor-nightly.yml`	schedule (14:00 UTC daily), dispatch	none	15	30	8	failure (all)
47	PII scan	`pii-scan.yml`	PR + push/main (no path filter)	none	17	~75	22	success
48	PR Gates	`ci-pr.yml`	PR, dispatch	none	74	~50	62	success
49	PR Preview	`pr-preview.yml`	PR (detect-then-run pattern)	none (internal detect)	25	~50	21	success
50	Queue Docker smoke	`queue-docker-smoke.yml`	PR (no path filter on outer trigger, detect-job inside)	none (detect inside)	13	~50	11	success
51	Queue vcpkg manifest check	`vcpkg-manifest-check.yml`	PR on `queue/vcpkg.json`, `queue/Dockerfile`	yes	unable to measure	~2	1	never fired recently
52	Queue zero-dyno monitor	`queue-zero-dyno-monitor.yml`	schedule (every 15 min), dispatch	none	13	2,880	624	success
53	Release	`release.yml`	push/main (no path filter), dispatch	none	72	~45	54	failure (all — release-please config bug)
54	SC-IDENT-6 Bot identity smoke	`sc-ident-6-smoke.yml`	dispatch only	none	unable to measure	0	0	never fired
55	Security OWASP ZAP	`security-zap.yml`	PR on `frontend/raxx-next/` + `backend_v2/api/`; schedule Mon 09:07 UTC	yes	15	~35 PR + 4 sched = 39	11	success
56	Slack Notify	`slack-notify.yml`	workflow_run on `'CI'` (mismatch — see §3)	none	unable to measure	0 (never fires)	0	never fired
57	Smoke test cloudflare-retry	`test-cloudflare-retry-action.yml`	dispatch only	none	unable to measure	0	0	never fired
58	Smoke test emit-audit-event	`test-emit-audit-event-action.yml`	dispatch only	none	unable to measure	0	0	never fired
59	Surface degraded auto-file	`console-degraded-auto-file.yml`	schedule (every 15 min), dispatch	none	25	2,880	1,200	failure (all — broken private key)
60	Synthetic Gate	`synthetic-gate.yml`	workflow_call, schedule (Mon–Fri 13:00 UTC), dispatch	none	43	~22	16	success
61	Terraform validate	`terraform-validate.yml`	PR + push/main on `terraform/**`	yes	unable to measure	~5	3	never fired recently
62	Terraform email-delivery-stack	`terraform-email-delivery-stack.yml`	PR + push/main + schedule (06:00 UTC daily) on `terraform/modules/email-delivery-stack/**`	yes + schedule	9	30 sched + ~2 push = 32	5	success
63	Test notify-deploy-status action	`test-notify-deploy-status.yml`	PR + push/main on `.github/actions/notify-deploy-status/**`	yes	unable to measure	~2	1	never fired recently
64	Trace integrity cron	`trace-integrity-cron.yml`	schedule (02:00 UTC daily), dispatch	none	11	30	6	failure (all)
65	WAF Synthetic Probes	`waf-synthetic-probe.yml`	schedule (Mon–Fri 14:00 UTC), dispatch	none	27	22	10	failure (all)
66	Workflow secret names lint	`lint-workflow-secret-names.yml`	PR + push/main on `.github/workflows/**`	yes	16	~30	8	success
67	iOS CI	`ios-ci.yml`	push + PR on `ios/**`	yes	24	~5	2	success
68	mbt-drift-daily	`mbt-drift-daily.yml`	schedule (23:00 UTC daily), dispatch	none	28	30	14	success
69	mbt-drift-per-symbol-weekly	`mbt-drift-per-symbol-weekly.yml`	schedule (Sun 23:30 UTC), dispatch	none	unable to measure	4	2	never fired recently
70	mbt-resting-orders-cron	`mbt-resting-orders-cron.yml`	schedule (every 5 min, Mon–Fri 13:00–20:00 UTC)	none	43	~480	344	success
71	tickets E2E smoke	`tickets-e2e-smoke.yml`	schedule, dispatch	none	38	30	19	failure

Estimated total minutes/month (current state): ~3,400–3,700 min/month

The three highest single contributors: 1. queue-zero-dyno-monitor: ~624 min/month (every-15-min schedule, always runs, 13s per run → rounds to 1 min × 2,880 fires) 2. surface-degraded-auto-file: ~1,200 min/month (every-15-min, 25s per run, but currently failing 100%) 3. mbt-resting-orders-cron: ~344 min/month (every-5-min weekday trading hours)

§2 Categorization

Critical path — must run on every PR or push to main

Workflow	Verdict
PR Gates (`ci-pr.yml`)	Keep — comprehensive gate, path-aware detect-changes
PII scan (`pii-scan.yml`)	Keep — operator instruction: never remove
CI — main (`ci.yml`)	Keep — push-to-main integration tests; add path filter (see §3)
CI — console (`ci-console.yml`)	Keep — already path-filtered to `console/**`
Antlers Next.js CI (`antlers-next-ci.yml`)	Keep — already path-filtered to `frontend/raxx-next/**`
iOS CI (`ios-ci.yml`)	Keep — already path-filtered to `ios/**`
Security OWASP ZAP (`security-zap.yml`)	Keep — path-filtered to `frontend/raxx-next/` + `backend_v2/api/`; note path was previously `frontend/trademaster_ui/` and has already been updated to raxx-next** — correct
Queue Docker smoke (`queue-docker-smoke.yml`)	Keep — no outer path filter but internal detect-job pattern handles it correctly
Terraform validate (`terraform-validate.yml`)	Keep — path-filtered to `terraform/**`
Queue vcpkg manifest check (`vcpkg-manifest-check.yml`)	Keep — path-filtered to queue Dockerfile

Path-filtered — correct, keep

These workflows trigger only when relevant paths change and behave correctly:

deploy-antlers-next-prod.yml, deploy-antlers-next-staging.yml, deploy-console.yml, deploy-velvet.yml, deploy-queue.yml, deploy-getraxx.yml, deploy-internal-docs.yml, deploy-customer-docs.yml, deploy-flag-docs.yml, deploy-mockups.yml, deploy-status-page.yml, deploy-status-worker.yml, deploy-support.yml, deploy-console-shim.yml, deploy-worker-waf-log-shipper.yml, lint-cf-access-headers.yml, lint-cf-tokens.yml, lint-cf-pages-deploy-uniqueness.yml, lint-workflow-secret-names.yml, review-app-console.yml, test-notify-deploy-status.yml

Scheduled — keep (firing meaningful output)

Workflow	Cron	Output quality
Alembic version drift check	daily 06:00	Functional — checks staging+prod migration heads
BCP vault snapshot daily	daily 07:00	Currently broken — operator-action, see §3
Bot token smoke	daily 06:30	Functional — 100% success
CI Digest daily	daily 07:00	Functional — digest posture per `feedback_pre_launch_digest_notifications`
Cron heartbeat monitor	daily 06:30	Functional — monitors other crons
Daily card groomer	daily 09:00	Functional — 100% success
Drift orchestrator cron	daily 06:00	Functional — 100% success
Flag drift check	every 4h	Functional — 100% success; consider relaxing to daily
FreeScout daily backup	daily 06:00	Functional — 100% success
Heroku config-var health	daily 05:00	Failing — real finding (staging REDIS_URL absent), see §3
Nightly Security Scan	daily 08:07	Partially functional — failing, see §3
Nightly 1-min bar cache warm	daily 22:00	Broken 100% — see §3
Nightly second-factor reminders	daily 14:00	Broken 100% — see §3
mbt-drift-daily	daily 23:00	Functional — 100% success
mbt-drift-per-symbol-weekly	Sun 23:30	Functional — no recent runs (correct, runs weekly)
mbt-resting-orders-cron	every 5 min weekday	Functional — 100% success; high cost (~344 min/month)
Queue zero-dyno monitor	every 15 min	Functional — 100% success; highest cost (~624 min/month)
Billing retention cron	daily 03:00	Broken 100% — related to Postgres migration, see §3
Trace integrity cron	daily 02:00	Broken 100% — see §3
WAF Synthetic Probes	weekday 14:00	Broken 100% — see §3
Tickets E2E smoke	scheduled	Broken 100% — see §3
E2E smoke (signup→onboarding)	scheduled + PR	Broken 100% — see §3
Synthetic Gate	weekday 13:00	Functional — 100% success
Terraform email-delivery-stack	daily 06:00 + path	Functional — 100% success

Dispatch-only — runs on demand, zero recurring cost

These fire zero recurring cycles. Keep all; they are on-demand tools.

billing-collector-cron.yml (explicitly paused), deploy-antlers-cutover.yml, freescout-apply.yml, launch-readiness-check.yml, sc-ident-6-smoke.yml, test-cloudflare-retry-action.yml, test-emit-audit-event-action.yml

Always-skip / dead triggers

Workflow	Issue
Slack Notify (`slack-notify.yml`)	`workflow_run` watches `['CI']` but CI workflow name is `'CI — main'`. Name mismatch — this workflow has never fired.
Deploy Antlers LEGACY CRA (`deploy-antlers.yml`)	Paused 2026-06-01. Path `frontend/trademaster_ui/**` — the CRA app is superseded. Fires zero times.

Actively broken — 100% failure rate

These have produced zero successes in the entire observed sample window and are consuming minutes without output:

Workflow	Observed Failures	Root Cause (from log analysis)
Surface degraded auto-file	16/16	`Failed to read private key` — GitHub App private key secret is invalid/expired
Release	18/18	`release-please failed: illegal pathing characters in path: frontend/trademaster_ui/../../VERSION` — release-please config references stale CRA path
Deploy getraxx landing page	4/4	`Cannot find module 'playwright'` — smoke job missing `npx playwright install` step
BCP vault snapshot daily	2/2	Unable to measure root cause from log tail; vault auth or S3 write failure
Billing retention cron	2/2	Postgres migration in flight (`project_raptor_postgres_migration_decision`)
Nightly 1-min bar cache warm	2/2	Unable to measure from log — likely Postgres/Heroku auth
Nightly second-factor reminders	2/2	Unable to measure from log
Trace integrity cron	2/2	Unable to measure from log
WAF Synthetic Probes	2/2	Unable to measure from log
Tickets E2E smoke	1/1	Unable to measure from log
E2E smoke (signup→onboarding)	4/4	Unable to measure from log
Deploy internal docs	4 failures in 12	`emit-audit-event/action.yml` — `Unrecognized named-value: 'github'` in composite action template

§3 Cycle-Waster Findings

CW-1: `surface-degraded-auto-file` running every 15 minutes 100% broken

File: console-degraded-auto-file.yml Cron: */15 * * * * — 2,880 fires/month Cost: ~1,200 min/month (each 25-second run rounds to 1 minute × 2,880) Failure cause: DOMException [DataError]: Invalid keyData / Failed to read private key — the GitHub App private key stored as a secret is malformed or expired. Impact: Zero monitoring output for the entire observable window. The workflow is consuming ~1,200 minutes per month for no operational value. Operator action before Phase 1 can fix: Rotate/re-provision the GitHub App private key secret. Once functional, consider reducing frequency to every 30 minutes (halves cost with no meaningful detection latency change).

CW-2: `release.yml` firing on every push to main, 0% success rate

File: release.yml Trigger: push to main, no path filter Cost: ~54 min/month (72s avg × ~45 pushes/month) Failure cause: release-please failed: illegal pathing characters in path: frontend/trademaster_ui/../../VERSION — the release-please configuration references a path component from the legacy CRA layout that has since been superseded. The release-please-config.json almost certainly contains a reference to frontend/trademaster_ui with a relative ../../VERSION extra-component path. Impact: Every push to main produces a failed workflow run. This is visible noise that trains reviewers to ignore red workflow icons, which is a detection-erosion risk. Fix: Update release-please-config.json to remove or correct the frontend/trademaster_ui component path. This is a one-line config change, not a workflow change.

CW-3: `queue-zero-dyno-monitor` running every 15 minutes, healthy but high-frequency

File: queue-zero-dyno-monitor.yml Cron: */15 * * * * — 2,880 fires/month Cost: ~624 min/month (13s avg × 2,880, each rounds to 1 min) Output: 100% success, legitimately monitoring Queue dyno count. Issue: 15-minute frequency is fine for production SLO coverage, but this is pre-launch. The Queue service is not yet customer-facing. Detecting zero dynos within 15 minutes vs within 30 minutes makes no difference at current stage. Proposed change: Relax to */30 * * * * — saves ~312 min/month with zero customer-facing impact.

CW-4: `flag-drift-check` running every 4 hours

File: flag-drift-check.yml Cron: 0 */4 * * * — 180 fires/month Cost: ~96 min/month Output: 100% success, checks flag YAML drift. Issue: Flag changes happen via PRs (a human action). A flag drift state that exists at 04:00 UTC will still exist at 08:00 UTC. Daily detection is sufficient. Proposed change: Change to 0 6 * * * (daily) — saves ~66 min/month.

CW-5: `ci.yml` (CI — main) has no path filter, runs full suite on every push

File: ci.yml Trigger: push to main, no path filter Cost: ~138 min/month (183s avg × ~45 pushes/month) Issue: Runs backend tests, frontend tests, security scans, OpenAPI drift on every push to main regardless of what changed. The detect-changes job gates individual test jobs internally, but the workflow still starts a runner and the detect-changes job runs on every push. For pushes that only touch docs/** or mockups-site/**, the entire workflow spins up only to skip everything after detect. Note: The detect-changes pattern is correct for what it does. The improvement is adding a top-level paths filter to skip workflow startup entirely for doc-only or design-only changes. The detect-changes internal logic already handles the rest. Proposed change: Add paths filter excluding docs/**, mockups-site/**, *.md changes.

CW-6: `deploy-heroku.yml` fires on every push to main with no path filter

File: deploy-heroku.yml Trigger: push to main, no path filter Cost: ~120 min/month (159s avg × ~45 pushes/month) Issue: This workflow deploys Raptor (the backend) to Heroku, but it fires on every push to main — including pushes that only touch frontend/raxx-next/**, docs/**, mockups-site/**, etc. A frontend-only push should not trigger a backend deploy. Proposed change: Add path filter for backend_v2/**, requirements*.txt, Procfile, runtime.txt, .github/workflows/deploy-heroku.yml. This is a careful change — verify against the freeze-check and smoke job gates.

CW-7: `slack-notify.yml` workflow_run trigger name mismatch — never fires

File: slack-notify.yml Trigger: workflow_run on workflows ['CI'] Issue: The CI workflow is named 'CI — main', not 'CI'. GitHub's workflow_run trigger matches on exact workflow name. This workflow has never fired in the observable sample. The CI failure notification posture (per feedback_pre_launch_digest_notifications) relies on this workflow for failure alerts — but it is silently dead. Fix: Change workflows: ['CI'] to workflows: ['CI — main'] in slack-notify.yml.

CW-8: `deploy-antlers.yml` (Legacy CRA) active but paths point to superseded code

File: deploy-antlers.yml Name: Deploy Antlers (LEGACY CRA — PAUSED 2026-06-01) State: active (GH Actions API reports active despite the name containing PAUSED) Trigger: push/main on frontend/trademaster_ui/** Issue: The CRA has been superseded by Antlers Next.js. The path frontend/trademaster_ui/** still exists (it is the current Antlers app directory per project_antlers_directory). If any commit touches files in that path, this legacy workflow will fire and attempt a CRA deploy. The name says PAUSED but no mechanism enforces the pause except that the path has low traffic. Risk: Not zero-cost in perpetuity. Any touch of frontend/trademaster_ui/** fires this. Recommend disabling (state=disabled_manually) rather than deleting, as the deployment script may be referenced.

CW-9: Multiple monitor/infra workflows broken 100% with no operator visibility

The following workflows have been silently failing for the entire observable window (2–16 consecutive failures, no successes):

billing-retention-cron.yml — Postgres migration dependency
trace-integrity-cron.yml — root cause unclear
waf-synthetic-probe.yml — root cause unclear
historical-bars-1min-warm-nightly.yml — likely Postgres/Heroku auth
reminders-second-factor-nightly.yml — likely Postgres/Heroku auth
e2e-smoke.yml — new workflow, never succeeded
tickets-e2e-smoke.yml — never succeeded

Because slack-notify.yml is broken (CW-7), none of these failures have generated Slack alerts. The cron-heartbeat-monitor (cron-heartbeat-monitor.yml) may be detecting some of these — but its output is not visible from this audit.

§4 Phased Reduction Plan

Phase 1 — Safe wins (no workflow file changes, low blast radius)

These can be approved individually. None require coordinated changes across multiple files.

#	Action	File	Rationale	Effort	Risk
P1-A	Fix `slack-notify.yml` workflow_run trigger: `['CI']` → `['CI — main']`	`slack-notify.yml`	1-char fix. Restore CI failure alerting to Slack (currently completely dead — CW-7).	5 min	Very low
P1-B	Disable (not delete) Legacy CRA deploy workflow	`deploy-antlers.yml`	Set state to `disabled_manually` via GH API. Prevents accidental CRA redeploy if trademaster_ui is touched.	2 min	Very low
P1-C	Fix `release-please-config.json` to remove stale `frontend/trademaster_ui` path component	`release-please-config.json`	Stops 18-for-18 failure rate on Release workflow. Saves ~54 min/month of failed runner time.	15 min	Low (release-please config only)
P1-D	Rotate GitHub App private key for `console-degraded-auto-file`	Secret in repo settings	Fixes the broken-private-key failure. Restore 1,200 min/month of monitoring value. Operator-action for the key rotation itself.	10 min (key rotation)	Low
P1-E	Relax `queue-zero-dyno-monitor` from `/15` to `/30`	`queue-zero-dyno-monitor.yml`	Saves ~312 min/month. Detection window goes from 15 min to 30 min — acceptable pre-launch.	5 min	Very low
P1-F	Relax `flag-drift-check` from every 4h to daily (`0 6 * * *`)	`flag-drift-check.yml`	Saves ~66 min/month. Drift is human-driven via PRs; daily detection is sufficient.	5 min	Very low

Phase 1 total estimated savings if all items land: ~430 min/month (dominated by fixing surface-degraded-auto-file so it runs correctly but at the same frequency, and reducing zero-dyno-monitor).

Phase 1 total if surface-degraded-auto-file also moves to 30-min after key fix: ~630 min/month.

Phase 2 — Consolidation (workflow file changes, needs testing)

These require changes to workflow YAML and should be done in a single PR with a one-day soak on staging.

#	Action	Files	Rationale	Effort	Risk
P2-A	Add path filter to `ci.yml` to skip on doc/design-only pushes	`ci.yml`	Reduces spurious full-test runs on doc-only commits. Est. saves ~30 min/month.	30 min	Medium — must test detect-changes still fires correctly
P2-B	Add path filter to `deploy-heroku.yml` for backend paths only	`deploy-heroku.yml`	Stop backend deploy from triggering on frontend-only pushes. Est. saves ~40 min/month.	45 min	Medium — test that all legitimate backend-change paths are covered
P2-C	Fix `deploy-getraxx.yml` smoke job: add `npx playwright install` step	`deploy-getraxx.yml`	Fixes 4/4 failure rate. No new behavior — just installs the missing Playwright dependency.	20 min	Low
P2-D	Fix `deploy-internal-docs.yml` emit-audit-event composite action usage	`.github/actions/emit-audit-event/action.yml`	`github.run_id` context unavailable in composite action template at parse time. Fix: pass values as inputs from calling workflow. Shared failure also affecting other workflows using this action.	30 min	Medium — touches shared composite action
P2-E	Consolidate the 3 CF lint workflows into a single workflow with 3 jobs	New file replacing `lint-cf-access-headers.yml`, `lint-cf-pages-deploy-uniqueness.yml`, `lint-cf-tokens.yml`	All 3 have identical triggers, identical path filters (`.github/workflows/**`), same 16s runtime. Running as 3 separate workflows means 3 runner allocations per PR touching a workflow file. Consolidating to 1 workflow with 3 jobs drops from ~24 min/month to ~8 min/month for those triggers.	45 min	Low — identical triggers, independent jobs
P2-F	Fix `billing-retention-cron` failure: confirm Postgres migration status, re-enable or pause schedule	`billing-retention-cron.yml`	Stops 100% failure waste. Either update DB path after Postgres migration lands, or add `workflow_dispatch`-only + comment-out schedule like `billing-collector-cron.yml` did.	20 min (after Postgres migration)	Low once migration is in

Phase 2 estimated additional savings: ~90–120 min/month (path filters + getraxx fix + CF lint consolidation).

Phase 3 — Production-lean shape (architectural changes, operator sign-off required on design)

#	Action	Rationale
P3-A	Introduce per-surface staging autodeploy confirmation that staging deploy succeeded before prod can run	Currently `deploy-heroku.yml` deploys staging and production in sequence within the same workflow run. The staging gate (`synthetic-gate`) runs after staging deploy, but production deploy is a separate `workflow_dispatch` step. Formalizing the staging→soak→prod promotion model means prod deploy only fires from `workflow_dispatch` (never from push), and requires a staging smoke pass as a prerequisite.
P3-B	Convert all production deploy triggers to `workflow_dispatch`-only	`deploy-antlers-next-prod.yml`, `deploy-console.yml` (prod jobs), `deploy-velvet.yml`, `deploy-queue.yml` currently trigger on `push` to main. Under the operator's lean-prod model, these should only trigger manually or via a promote action, not automatically on every push. Staging deploys remain on push-to-main (fast, automatic).
P3-C	Add a single prod-promote workflow that gates on: (1) staging smoke passing, (2) operator confirms	One `promote-to-prod.yml` workflow with inputs: `surface` (dropdown: raptor / console / antlers / velvet / queue). It checks the latest staging deploy was green, runs a pre-promote smoke, then fires the surface-specific production deploy as a `workflow_call`. Eliminates per-surface prod deploy triggers.
P3-D	Consolidate monitoring crons that fire at the same time	Currently 5 separate workflows fire within 30 minutes of 06:00 UTC: alembic-version-cron, bcp-vault-snapshot-daily, bot-token-smoke (06:30), ci-digest-cron (07:00), cron-heartbeat-monitor (06:30), drift-orchestrator-cron. These could share a single "nightly-ops-sweep" workflow with sequential jobs, reducing runner-startup overhead.
P3-E	Remove `E2E smoke` from PR trigger until it passes	`e2e-smoke.yml` triggers on PR and has never succeeded. It is consuming ~84 min/month on PRs that will never benefit from it. Gate it behind `workflow_dispatch`-only until the E2E suite is green.

Phase 3 estimated additional savings: ~150–200 min/month (primarily from removing push-triggered prod deploys that become dispatch-only, and fixing the E2E PR trigger).

§5 Staging Autodeploy + Production-Lean Architecture

Current state per surface

Surface          Staging autodeploy?    Prod autodeploy?    Manual gate?
---------------------------------------------------------------------------
Raptor (API)     YES (push to main,     YES (same workflow,  freeze-check
                 no path filter)        no manual confirm)   on dispatch=prod
Antlers Next.js  YES (push to main,     YES (separate        none
                 raxx-next/** paths)    prod workflow,
                                        same push trigger)
Console          YES (push to main,     YES (same workflow,  freeze-check
                 console/** paths)      no manual confirm)   on dispatch=prod
Velvet           YES (push to main,     YES (same workflow,  freeze-check
                 velvet/** paths)       no manual confirm)
Queue            YES (push to main,     YES (same workflow,  confirm-gate job
                 queue/** paths)        confirm-gate-rejected with manual input)
getraxx          YES (push to main)     (CF Pages — same     none
                                        CF Pages deploy)
Antlers Next     YES (staging           SEPARATE prod        none
                 workflow auto)         workflow auto)

Proposed lean-prod shape (after Phase 3)

Push to main
     |
     v
[Path filters route to correct deploy workflow]
     |
     +---> deploy-raptor-staging   (automatic, on push)
     +---> deploy-antlers-staging  (automatic, on push)
     +---> deploy-console-staging  (automatic, on push)
     +---> deploy-velvet-staging   (automatic, on push)
     +---> deploy-queue-staging    (automatic, on push)
     |
     v
[Staging smoke: synthetic-gate + surface health check]
     |
     v
[Soak period — default: next business day (no automated timer)]
     |
     v
[Operator: dispatch promote-to-prod with surface input]
     |
     v
promote-to-prod.yml
     |
     +---> Verify: latest staging deploy = green
     +---> Run: pre-promote smoke (30s)
     +---> Deploy to production
     +---> Post-deploy: synthetic-gate on prod
     +---> Notify: ops@ digest

What stays the same

Staging: fast, automatic, push-triggered. No friction.
PR gate: unchanged — comprehensive, path-aware.
Scheduled crons: unchanged except frequency adjustments in Phase 1.
CF Pages surfaces (getraxx, mockups, status-page, customer-docs): autodeploy is correct for these; no customer-critical data at risk.

What changes in Phase 3

Raptor, Console, Antlers, Velvet, Queue production deploys become workflow_dispatch-only.
A new promote-to-prod.yml consolidates the confirmation UI.
The deploy-heroku.yml no longer has a push trigger for production.
The existing freeze-check and synthetic-gate jobs remain as the backbone.

§6 Estimated Minutes/Month Saved Per Phase

Phase	Actions	Est. Min/Month Saved	Notes
Phase 1 (safe wins)	P1-A through P1-F	~430 min/month	Dominated by fixing surface-degraded-auto-file key + reducing zero-dyno-monitor frequency
Phase 1 + surface-degraded-auto-file at 30-min	Above + relax to `*/30` after key fix	~630 min/month
Phase 2 (consolidation)	P2-A through P2-F	+90–120 min/month	Path filter on ci.yml + deploy-heroku.yml + CF lint consolidation
Phase 3 (prod-lean shape)	P3-A through P3-E	+150–200 min/month	E2E PR trigger removal, push→dispatch prod deploys
Total (all phases)		~770–950 min/month	From current ~3,400–3,700 to ~2,500–2,900

At GH Actions Pro rates ($0.008/min for Ubuntu runners), the current estimated burn is $27–30/month in compute alone, before the base Pro plan cost. All three phases landing brings that to roughly $20–23/month — a ~$7–10/month reduction in compute.

Ubicloud trigger threshold is 5,000 min/month per project_ci_billing. Current estimated usage is 3,400–3,700 min/month, which is 68–74% of that threshold. Even in the worst case (all phases fail to reduce), the repo is not at imminent risk of hitting the Ubicloud threshold. If surface-degraded-auto-file is fixed and remains at */15 frequency, it alone accounts for 1,200 min/month of a single workflow. Fixing its root cause is the single highest-leverage action.

§7 Actively Broken Workflows — Surfacing Separately

The following workflows have been producing zero successes for the entire observable window and should be investigated as separate SRE dispatches (not in this audit's scope to fix):

BROKEN-1: `release.yml` — 18/18 failures

Cause: release-please-config.json contains stale path reference frontend/trademaster_ui/../../VERSION. Fix: update release-please config. Separately: Node.js 20 deprecation warning on googleapis/release-please-action@5c625bfb5d1ff62eadeeb3772007f7f66fdcf071 — this action is pinned by SHA to a version that runs on Node.js 20; GitHub Actions will force Node.js 24 starting 2026-06-16. This is an imminent breaking change, 11 days away.

BROKEN-2: `console-degraded-auto-file.yml` — 16/16 failures

Cause: Failed to read private key / Invalid keyData. The GitHub App private key secret is malformed or has been rotated without updating the secret. Monitoring is completely blind.

BROKEN-3: `heroku-config-health-nightly.yml` — real findings (not broken, but alerting correctly)

Note: This workflow is functioning correctly — it is producing failures because it found real issues: raxx-api-staging is missing REDIS_URL (severity=critical) and has an empty POSTMARK_SERVER_TOKEN (severity=warning). These are legitimate infra gaps, not workflow bugs. Operator action needed: provision REDIS_URL on raxx-api-staging.

BROKEN-4: `deploy-internal-docs.yml` — recurring failures

Cause: emit-audit-event/action.yml uses github.run_id and github.server_url in a composite action template context where the github context is not available. This is a shared composite action bug that may affect other workflows using emit-audit-event.

BROKEN-5: Multiple nightly crons failing (likely Postgres migration dependency)

billing-retention-cron, historical-bars-1min-warm-nightly, reminders-second-factor-nightly, trace-integrity-cron — likely all failing due to the SQLite→Postgres migration in progress (project_raptor_postgres_migration_decision). These should be paused (schedule commented out, dispatch-only) until the migration lands, matching the pattern already established by billing-collector-cron.yml.

BROKEN-6: Node.js 20 deprecation — deadline 2026-06-16 (11 days)

Affects: actions/checkout@v4, actions/setup-python@v5, actions/create-github-app-token@v1, googleapis/release-please-action (pinned SHA). GitHub will force Node.js 24 after 2026-06-16. Actions that haven't published Node.js 24-compatible versions will break. The most urgent is googleapis/release-please-action pinned to a specific SHA — check if a newer SHA that supports Node 24 is available.

§8 Operator-Approval Checklist

For each proposed action, check the box to greenlight:

Phase 1 (safe wins — low risk)

[ ] P1-A  slack-notify.yml: fix workflow_run name 'CI' → 'CI — main'
          Restores Slack CI-failure alerting. Zero behavioral change otherwise.

[ ] P1-B  deploy-antlers.yml: disable (not delete) via GH API
          Prevents accidental CRA deploy. File stays on disk for reference.

[ ] P1-C  release-please-config.json: remove stale frontend/trademaster_ui component
          Stops 100% Release failure rate. Requires review of config to confirm
          which components are still active.

[ ] P1-D  Rotate GitHub App private key for surface-degraded-auto-file
          (OPERATOR ACTION: requires secret provisioning in repo settings)
          Restores console degradation monitoring.

[ ] P1-E  queue-zero-dyno-monitor: relax cron */15 → */30
          Saves ~312 min/month. Detection window: 15 min → 30 min.

[ ] P1-F  flag-drift-check: relax cron every 4h → daily 06:00 UTC
          Saves ~66 min/month. Flag changes are human-driven via PRs.

Phase 2 (consolidation — moderate risk, test before merge)

[ ] P2-A  ci.yml: add paths filter to skip on doc/design-only pushes
          Needs verification that detect-changes still fires for all code paths.

[ ] P2-B  deploy-heroku.yml: add backend-only path filter
          Needs verification all legitimate backend paths are covered.
          Confirm with operator: are there any infra-file paths outside
          backend_v2/ that should trigger a Raptor deploy?

[ ] P2-C  deploy-getraxx.yml: add npx playwright install to smoke job
          Fixes 4/4 failure rate. Low risk — additive step only.

[ ] P2-D  Fix emit-audit-event composite action github context usage
          Review .github/actions/emit-audit-event/action.yml; pass run_id
          and server_url as explicit inputs from calling workflows.

[ ] P2-E  Consolidate 3 CF lint workflows into 1
          New file replaces 3 files. Identical trigger/path logic, 3 jobs.
          Confirm: are any external processes monitoring these specific
          workflow names?

[ ] P2-F  billing-retention-cron: pause schedule (comment out), dispatch-only
          Mirror billing-collector-cron pattern. Re-enable post-Postgres migration.

Phase 3 (architectural — requires separate operator sign-off on design)

[ ] P3-DESIGN  Operator approves staging-autodeploy / prod-dispatch model
               described in §5 before any Phase 3 workflow changes begin.
               Specifically: confirm all 5 Heroku surfaces (Raptor/Console/
               Antlers/Velvet/Queue) should move to dispatch-only prod deploys.

[ ] P3-A  Introduce per-surface staging smoke prerequisite for prod
[ ] P3-B  Convert prod deploy triggers to workflow_dispatch-only
[ ] P3-C  Create promote-to-prod.yml consolidation workflow
[ ] P3-D  Consolidate 06:00 UTC cron cluster into nightly-ops-sweep
[ ] P3-E  Remove e2e-smoke.yml from PR trigger until suite is green

§9 Ubicloud Assessment

Current estimated burn: ~3,400–3,700 min/month. Ubicloud threshold: 5,000 min/month (per project_ci_billing).

Current usage is at ~70–74% of the threshold. Not at imminent risk. The two highest-cost workflows (surface-degraded-auto-file at ~1,200 min/month when broken, queue-zero-dyno-monitor at ~624 min/month, mbt-resting-orders-cron at ~344 min/month) are all running correctly except the degraded-auto-file. Fixing Phase 1 alone brings the estimate down to ~2,800–3,100 min/month (56–62% of threshold).

Recommendation: Ubicloud is not needed in the near term. If a significant new service launches (iOS CI running full Xcode builds, for instance) or if the E2E suite starts running on every PR and takes 10+ minutes, revisit. Do not queue Ubicloud migration until burn sustains above 4,500 min/month for two consecutive months.

References

docs/ops/runbooks/ci-hygiene.md — existing CI hygiene runbook
docs/ops/runbooks/ci-runner-posture.md — runner posture notes
feedback_gh_actions_transitive_skip — needs chain skip propagation
feedback_pre_launch_digest_notifications — pre-launch digest posture
feedback_asset_manifest_layer_a — asset manifest guard (keep!)
project_ci_billing — GH Actions Pro plan + Ubicloud threshold
project_raptor_postgres_migration_decision — explains several cron failures