Runbook: Cron Heartbeat Monitor

Issue #2723. RCA: docs/incidents/2026-05-23-ci-gitleaks-dark-crons-2703-2717.md

What this monitor does

The cron-heartbeat-monitor workflow runs at 06:30 UTC daily and checks that billing-retention-cron and trace-integrity-cron each have a successful run in the GitHub Actions API within the last 48 hours.

If either workflow is stale, it: 1. Sends a Postmark alert to ops@raxx.app with the workflow name, last success timestamp, and age. 2. Exits non-zero so the daily CI digest (ci-digest-cron) registers a cron failure.

Enabling the monitor

The check is gated behind FLAG_CRON_HEARTBEAT_MONITOR. Enable it by setting the GH Actions repo variable (not a Heroku env var):

gh variable set FLAG_CRON_HEARTBEAT_MONITOR --body "1" --repo raxx-app/TradeMasterAPI

To also enable on the staging environment target:

gh variable set FLAG_CRON_HEARTBEAT_MONITOR --body "1" --repo raxx-app/TradeMasterAPI --env staging

Alert: `[Raxx ops] cron heartbeat STALE`

Subject: [Raxx ops] cron heartbeat STALE: <workflow-name>

Immediate steps

Open the linked GH Actions run URL in the alert email.
Check the workflow's own run history: - billing-retention-cron: .github/workflows/billing-retention-cron.yml - trace-integrity-cron: .github/workflows/trace-integrity-cron.yml
Look for: - Failed runs with a non-zero exit (sweeper endpoint returned non-200, or integrity check returned fail_count > 0). - Skipped runs (flag gate off on Heroku — FLAG_BILLING_RETENTION_JOB or FLAG_TRACE_INTEGRITY_CHECK may be unset). - No runs at all (cron schedule may have been disabled or workflow deleted).

If the workflow is failing

Check the endpoint health:

# billing-retention-cron
curl -s -X POST \
  -H "X-Admin-Service-Token: $ADMIN_SERVICE_TOKEN" \
  -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
  https://api.raxx.app/api/internal/jobs/billing-retention?dry_run=true

# trace-integrity-cron
curl -s -X POST \
  -H "X-Admin-Service-Token: $ADMIN_SERVICE_TOKEN" \
  -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
  https://api.raxx.app/api/internal/jobs/trace-integrity-check?dry_run=true&window_hours=24

If the workflow is silently skipped

The flag gate on the Heroku app may be off. Verify and re-enable:

# billing-retention-cron flag gate
heroku config:get FLAG_BILLING_RETENTION_JOB --app raxx-api-prod >/dev/null

# trace-integrity-cron flag gate
heroku config:get FLAG_TRACE_INTEGRITY_CHECK --app raxx-api-prod >/dev/null

After restoring the workflow

Trigger a manual run to confirm the workflow succeeds:

gh workflow run billing-retention-cron.yml --repo raxx-app/TradeMasterAPI
gh workflow run trace-integrity-cron.yml --repo raxx-app/TradeMasterAPI

Then trigger the heartbeat monitor to confirm the stale alert clears:

gh workflow run cron-heartbeat-monitor.yml --repo raxx-app/TradeMasterAPI \
  -f dry_run=true

Alert: `[Raxx ops] cron heartbeat MONITOR FAILED`

The monitor job itself crashed. This is a detection gap — the health of billing-retention-cron and trace-integrity-cron is UNKNOWN.

Check the cron-heartbeat-monitor run logs.
Verify GITHUB_TOKEN has actions: read scope (set at workflow level).
Verify POSTMARK_SERVER_TOKEN, OPS_ALERT_EMAIL, OPS_ALERT_FROM secrets are present in the repo.

SLA reference

Workflow	Schedule	SLA window
`billing-retention-cron`	03:00 UTC nightly	48h
`trace-integrity-cron`	02:00 UTC nightly	48h

A missed window on trace-integrity-cron is a compliance gap for SC-6 (hash-chain verification). Five consecutive missed windows (2026-05-18 to 2026-05-23) were the root cause that prompted this monitor (#2723).

Testing the monitor locally

# Simulate a stale-workflow scenario (set a short SLA)
FLAG_CRON_HEARTBEAT_MONITOR=1 \
  python3 scripts/ci/check_cron_heartbeats.py \
  --repo raxx-app/TradeMasterAPI \
  --gh-token "$GITHUB_TOKEN" \
  --sla-hours 1 \
  --output-file /tmp/heartbeat_result.json

cat /tmp/heartbeat_result.json