CI Health Gate Runbook
This runbook covers triage for Sprint readiness CI failures in:
.github/workflows/ci-pr.yml.github/workflows/slack-notify.ymlscripts/ci/run_health_gate.sh
What the gate currently checks
scripts/ci/run_smoke.sh- backend_v2 integration smoke tests - frontend integration smoke testbackend_v2/tests/integration/cli_hooks_smoke.py- placeholder synthetic checks for upcoming CLIdoctoranddatacommandsscripts/ci/validate_chart_exports.py(non-blocking in current wave) - CSV: header/schema + row quality + fixture content diff checks - PDF: header/EOF/page-marker integrity checks when artifact exists - JPEG: SOI/EOI/dimension integrity checks when artifact exists
Fast triage flow
- Open the failed GitHub Actions run URL from the PR comment.
- Identify which step failed: - Run health gate checks - Slack Notify
- Reproduce locally from repo root:
-
bash scripts/ci/run_health_gate.sh - Fix and rerun only the failing scope first, then rerun the full gate.
Common failure patterns
- Python dependency/test bootstrap failures
- Symptom:
pytestimport/module errors in backend_v2 tests -
Action: verify Python deps in
scripts/ci/run_smoke.share still sufficient -
Frontend smoke failures
- Symptom: integration smoke test exits non-zero
-
Action: run
npm --prefix frontend/trademaster_ui run test:integration-smokelocally -
Chart export QA warning/failure signals
- Symptom: chart export report contains
warn/failfindings - Action: inspect
artifacts/chart-export-validation.json:summary.semver_signalsummary.actionable_failures[]summary.actionable_warnings[]
-
Use
check_code,message, andremediationto triage quickly. -
Slack message not sent
- Symptom: Slack workflow succeeds but no channel message
- Action: check
SLACK_WEBHOOK_URLsecret exists; if not, this is expected safe behavior
Secret-safe behavior
slack-notify.ymlnever prints webhook values.- If
SLACK_WEBHOOK_URLis absent, workflow records a summary message and exits without failure. - CI gate does not require external Alpaca secrets yet; external synthetics remain an incremental follow-up.
- Chart export QA remains non-blocking by default unless script is run with
--strict.
Incremental hardening backlog (next step to production-grade)
- Replace placeholder CLI hooks with real
doctorcommand checks. - Add explicit synthetic probes for critical data/trading contracts with retry policy.
- Add
actionlintor equivalent workflow lint enforcement in CI. - Add alert routing/escalation policy for repeated gate failures.