BCP smoke — monthly workflow review
System: .github/workflows/bcp-smoke-monthly.yml (Win 5) + .github/workflows/bcp-vault-snapshot-daily.yml (Win 3)
Owner: sre-agent / operator
Last reviewed: 2026-05-29 UTC
Review context: PR #3049 shipped both workflows; this review was triggered by operator directive 2026-05-29.
Review findings
1. Workflow configuration
Cron expression: 0 8 1 * * — correct. Fires at 08:00 UTC on the 1st of each month.
For reference: 08:00 UTC = 04:00 ET / 01:00 PT. No DST shift in UTC; the operator's
local clock will shift seasonally (ET offset varies ±1h around daylight saving transitions)
but the workflow fires at a fixed UTC time.
workflow_dispatch present: Yes. Accepts two boolean inputs:
- skip_vault_export_check — skips Check 5 if Win 3 not yet deployed
- dry_run — skips all AWS/Heroku/Slack calls; prints step labels only
Concurrency: cancel-in-progress: false — correct for a backup-verification workflow.
A cron run should not be cancelled by a concurrent manual dispatch or vice versa.
Permissions: contents: read — correct. No write permissions required (read-only checks).
Transitive-skip guard: notify-failure carries if: always() && needs.bcp-smoke.result == 'failure'.
This correctly fires even if the upstream job fails (per feedback_gh_actions_transitive_skip).
2. Bug found and fixed (PR #3086)
Root cause: bcp-smoke-monthly.yml contained a python3 -c "..." multi-line string in
Check 4 where the closing " appeared at column 0. The GH Actions YAML parser rejected the
workflow file, causing every push-triggered run to fail at 0s before any job started.
Evidence: 10+ runs in gh run list showing failure + 0s duration, all triggered by
push, across multiple branches. The "workflow file issue" error message confirmed the YAML
parse rejection.
Fix applied:
- Rewrote both python3 -c "..." multi-line blocks in Check 4 as single-line -c invocations.
- Fixed POSTMARK_API_TOKEN → POSTMARK_OPS_ALERT_TOKEN (secret name mismatch).
- Fixed SLACK_BOT_TOKEN → SLACK_WEBHOOK_URL in both bcp workflows; switched notification
method from chat.postMessage API to incoming-webhook POST (matches what is provisioned).
Test dispatch result: workflow_dispatch with dry_run=true + skip_vault_export_check=true
completed successfully in ~10s on branch fix/bcp-smoke-monthly-yaml-parse. All 5 checks
reported skip as expected. notify-failure job skipped (no failure to alert). YAML parse
is confirmed resolved.
3. Secret and auth audit
| Secret name (workflow) | Repo secret present | Status |
|---|---|---|
AWS_BACKUP_ACCESS_KEY_ID |
Yes (2026-05-18) | OK |
AWS_BACKUP_SECRET_ACCESS_KEY |
Yes (2026-05-18) | OK |
HEROKU_API_KEY |
Yes (2026-05-06) | OK |
POSTMARK_OPS_ALERT_TOKEN |
Yes (2026-05-20) | OK (was misnamed POSTMARK_API_TOKEN before PR #3086) |
SLACK_WEBHOOK_URL |
Yes (2026-03-04) | OK (was misnamed SLACK_BOT_TOKEN before PR #3086) |
GPG_BACKUP_PUBLIC_KEY |
MISSING | Operator action required — see below |
INFISICAL_CLIENT_ID |
Yes (2026-04-27) | OK (used by daily snapshot, not monthly smoke) |
INFISICAL_CLIENT_SECRET |
Yes (2026-04-27) | OK (used by daily snapshot, not monthly smoke) |
Operator action — GPG_BACKUP_PUBLIC_KEY:
The daily vault snapshot (bcp-vault-snapshot-daily.yml) requires GPG_BACKUP_PUBLIC_KEY
(base64-encoded ASCII-armored operator public key). This secret is not present in the repo.
Until it is minted, the daily snapshot will fail at the Import GPG public key step.
Dry-run mode skips this step, so dry-run dispatch succeeds.
To mint:
# Export your GPG public key and base64-encode it
gpg --armor --export YOUR_KEY_FINGERPRINT | base64 | gh secret set GPG_BACKUP_PUBLIC_KEY
4. Coverage assessment
| Check | What it validates | Threshold | Assessment |
|---|---|---|---|
| 1 | FreeScout S3 mysqldump exists (yesterday) | Object exists | Reasonable. Does not verify file size or content integrity. A 0-byte dump would pass. Consider adding size check (>100KB) post-launch. |
| 2 | Vault Lightsail auto-snapshot age < 25h | 25 hours | Correct. Lightsail AutoSnapshot fires daily at 05:00 UTC; 25h gives 1h of slack before alerting. |
| 3 | TF state S3 bucket versioning = Enabled | Status == "Enabled" | Correct. Versioning is a one-time infrastructure property; this check catches inadvertent disablement. |
| 4 | Heroku Postgres backup age < 48h | 48 hours | Reasonable. Heroku Postgres daily backups give 24h of slack before this threshold triggers. Tighter (< 28h) would be better post-launch. |
| 5 | Vault S3 export exists (yesterday's .json.gpg) | Object exists | Dependent on daily snapshot (Win 3) having run. Will always fail until GPG_BACKUP_PUBLIC_KEY is minted and the daily snapshot completes at least once. |
Coverage gaps identified:
- Check 1 does not verify file size. A failed mysqldump that writes an empty or partial file would not be caught.
- Check 4 threshold of 48h is generous for a daily backup schedule. Failure at Heroku's end would not alert until the second missed backup.
- Check 5 is an existence check only — it confirms the object is present but does not verify GPG integrity (cannot decrypt on runner without private key, so this is acceptable by design).
- No check covers FreeScout itself being up (separate concern for freescout.md runbook).
5. Failure routing
Pass (no failures): silent. Workflow exits 0. No alert sent. Correct per feedback_pre_launch_digest_notifications.
Failure: per-event alert to two channels:
- Postmark email → ops@raxx.app with subject [BCP ALERT] Monthly smoke failed — checks: <list> (<date>)
- Slack webhook → ops alert channel with a structured block message including the run URL
Both are non-fatal: if POSTMARK_OPS_ALERT_TOKEN or SLACK_WEBHOOK_URL is empty, the step
prints ::warning:: and exits 0. This means a credential gap in notifications does NOT cause
the workflow to report a false failure — the actual check results are the authoritative signal.
No GH issue is auto-filed on failure. This differs from the console auto-ticketer pattern
(#3081). Assessment: acceptable for a monthly frequency. A failed BCP check warrants
operator action the same day; the Slack + email channel is sufficient.
6. Daily vault snapshot dependency
The monthly smoke Check 5 depends on bcp-vault-snapshot-daily.yml having run successfully.
That workflow runs at 07:00 UTC daily. As of this review (2026-05-29), the daily snapshot
has attempted one run (2026-05-29T10:51 UTC) and failed — root cause: SLACK_BOT_TOKEN
not found AND GPG_BACKUP_PUBLIC_KEY not present (secret lookup returns empty; the GPG step
fails on empty key parse).
Run ID 26633102612 confirms the workflow started (AWS credentials configured, Infisical CLI
installation began) before the GPG step was reached.
Impact on monthly smoke: Check 5 will report FAIL on the 1st of June 2026 unless:
1. GPG_BACKUP_PUBLIC_KEY is minted (operator action), AND
2. The daily snapshot runs and produces at least one successful export before 2026-06-01.
Recommendation: skip_vault_export_check should be set manually for the June 1 run if
GPG_BACKUP_PUBLIC_KEY is not minted before then.
7. Node.js 20 deprecation notice
aws-actions/configure-aws-credentials@v4 runs on Node.js 20. GH Actions will force
Node.js 24 starting 2026-06-02 (3 days from review date). Before that date, pin to a
@v4 release that supports Node.js 24, or upgrade to @v4.x.x where x.x supports it.
Check:
https://github.com/aws-actions/configure-aws-credentials/releases
If not updated before 2026-06-02, the action will be forced to Node.js 24 automatically (likely still works, but worth confirming with a test dispatch post-upgrade).
Operator action items
| # | Action | Due | Notes |
|---|---|---|---|
| 1 | Mint GPG_BACKUP_PUBLIC_KEY repo secret |
Before 2026-06-01 | Unblocks daily vault snapshot + Check 5 |
| 2 | Verify daily vault snapshot succeeds after secret mint | Before 2026-06-01 | Dispatch bcp-vault-snapshot-daily.yml with dry_run=false and confirm S3 object appears |
| 3 | Merge PR #3086 | Immediate | Unblocks monthly smoke from 0s YAML failures on every push |
| 4 | Update aws-actions/configure-aws-credentials@v4 before 2026-06-02 |
2026-06-01 | Node.js 20 forced deprecation on 2026-06-02 |
Known failure modes
Workflow fails at 0s on every push
Symptom: All push-triggered runs show failure + 0s duration with "workflow file issue" message.
Cause: YAML parse error in the workflow file — typically multi-line shell content at column 0 inside a run: | block.
Fix: Check for python3 -c "..." or heredoc blocks with content at column 0. Rewrite as single-line -c calls or ensure heredoc body is indented.
Verification: python3 -c "import yaml; yaml.safe_load(open('.github/workflows/bcp-smoke-monthly.yml'))" — must not raise.
Check 5 always fails
Symptom: Check 5 reports FAIL every month.
Cause: GPG_BACKUP_PUBLIC_KEY not minted — daily snapshot cannot encrypt → no S3 object → Check 5 always misses.
Fix: Mint GPG_BACKUP_PUBLIC_KEY per operator action item #1 above. Verify daily snapshot runs successfully.
Workaround: Dispatch monthly smoke with skip_vault_export_check=true while secret is pending.
Slack notification fails silently
Symptom: Workflow completes with a ::warning::SLACK_WEBHOOK_URL not configured line but no Slack alert.
Cause: SLACK_WEBHOOK_URL secret empty or expired.
Fix: Re-set SLACK_WEBHOOK_URL in repo secrets from the Slack app webhook configuration.
Heroku Postgres backup age check is stale
Symptom: Check 4 reports FAIL — backup is older than 48h.
Cause: Heroku Postgres scheduled backup missed or failed. Most common cause: Heroku maintenance window or temporary Postgres unavailability.
Runbook: See docs/ops/runbooks/heroku.md for Heroku Postgres backup commands and manual backup trigger (heroku pg:backups:capture -a raxx-api-prod).
References
- Workflow:
.github/workflows/bcp-smoke-monthly.yml - Daily snapshot:
.github/workflows/bcp-vault-snapshot-daily.yml - BCP document:
docs/architecture/business-continuity-plan-2026-05-21.md - Fix PR: #3086
- Issue tracking Win 5: #2657
- Issue tracking Win 3: #2654
- Vault snapshot runbook:
docs/ops/runbooks/vault-disaster-recovery.md - FreeScout backup runbook:
docs/ops/runbooks/freescout-backup-restore.md