Runbook: Cron Heartbeat Monitor
Issue #2723. RCA: docs/incidents/2026-05-23-ci-gitleaks-dark-crons-2703-2717.md
What this monitor does
The cron-heartbeat-monitor workflow runs at 06:30 UTC daily and checks that
billing-retention-cron and trace-integrity-cron each have a successful run
in the GitHub Actions API within the last 48 hours.
If either workflow is stale, it:
1. Sends a Postmark alert to ops@raxx.app with the workflow name, last success
timestamp, and age.
2. Exits non-zero so the daily CI digest (ci-digest-cron) registers a cron
failure.
Enabling the monitor
The check is gated behind FLAG_CRON_HEARTBEAT_MONITOR. Enable it by setting
the GH Actions repo variable (not a Heroku env var):
gh variable set FLAG_CRON_HEARTBEAT_MONITOR --body "1" --repo raxx-app/TradeMasterAPI
To also enable on the staging environment target:
gh variable set FLAG_CRON_HEARTBEAT_MONITOR --body "1" --repo raxx-app/TradeMasterAPI --env staging
Alert: [Raxx ops] cron heartbeat STALE
Subject: [Raxx ops] cron heartbeat STALE: <workflow-name>
Immediate steps
- Open the linked GH Actions run URL in the alert email.
- Check the workflow's own run history:
-
billing-retention-cron:.github/workflows/billing-retention-cron.yml-trace-integrity-cron:.github/workflows/trace-integrity-cron.yml - Look for:
- Failed runs with a non-zero exit (sweeper endpoint returned non-200, or
integrity check returned fail_count > 0).
- Skipped runs (flag gate off on Heroku —
FLAG_BILLING_RETENTION_JOBorFLAG_TRACE_INTEGRITY_CHECKmay be unset). - No runs at all (cron schedule may have been disabled or workflow deleted).
If the workflow is failing
Check the endpoint health:
# billing-retention-cron
curl -s -X POST \
-H "X-Admin-Service-Token: $ADMIN_SERVICE_TOKEN" \
-H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
-H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
https://api.raxx.app/api/internal/jobs/billing-retention?dry_run=true
# trace-integrity-cron
curl -s -X POST \
-H "X-Admin-Service-Token: $ADMIN_SERVICE_TOKEN" \
-H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
-H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
https://api.raxx.app/api/internal/jobs/trace-integrity-check?dry_run=true&window_hours=24
If the workflow is silently skipped
The flag gate on the Heroku app may be off. Verify and re-enable:
# billing-retention-cron flag gate
heroku config:get FLAG_BILLING_RETENTION_JOB --app raxx-api-prod >/dev/null
# trace-integrity-cron flag gate
heroku config:get FLAG_TRACE_INTEGRITY_CHECK --app raxx-api-prod >/dev/null
After restoring the workflow
Trigger a manual run to confirm the workflow succeeds:
gh workflow run billing-retention-cron.yml --repo raxx-app/TradeMasterAPI
gh workflow run trace-integrity-cron.yml --repo raxx-app/TradeMasterAPI
Then trigger the heartbeat monitor to confirm the stale alert clears:
gh workflow run cron-heartbeat-monitor.yml --repo raxx-app/TradeMasterAPI \
-f dry_run=true
Alert: [Raxx ops] cron heartbeat MONITOR FAILED
The monitor job itself crashed. This is a detection gap — the health of billing-retention-cron and trace-integrity-cron is UNKNOWN.
- Check the
cron-heartbeat-monitorrun logs. - Verify
GITHUB_TOKENhasactions: readscope (set at workflow level). - Verify
POSTMARK_SERVER_TOKEN,OPS_ALERT_EMAIL,OPS_ALERT_FROMsecrets are present in the repo.
SLA reference
| Workflow | Schedule | SLA window |
|---|---|---|
billing-retention-cron |
03:00 UTC nightly | 48h |
trace-integrity-cron |
02:00 UTC nightly | 48h |
A missed window on trace-integrity-cron is a compliance gap for SC-6
(hash-chain verification). Five consecutive missed windows (2026-05-18 to
2026-05-23) were the root cause that prompted this monitor (#2723).
Testing the monitor locally
# Simulate a stale-workflow scenario (set a short SLA)
FLAG_CRON_HEARTBEAT_MONITOR=1 \
python3 scripts/ci/check_cron_heartbeats.py \
--repo raxx-app/TradeMasterAPI \
--gh-token "$GITHUB_TOKEN" \
--sla-hours 1 \
--output-file /tmp/heartbeat_result.json
cat /tmp/heartbeat_result.json