RCA — BCP vault snapshot (Win 3) silent failure: GPG_BACKUP_PUBLIC_KEY never minted
Incident ID: 2026-06-15-bcp-vault-snapshot-gpg-key-missing Date: 2026-06-15 Severity: SEV-3 Duration: 5 days open (2026-06-11 first failure → 2026-06-15 operator escalation; fix pending operator action) Blast radius: Internal BCP only. No user-facing service impact. Zero vault secrets were lost or exposed. The vault itself is healthy; the off-Infisical encrypted backup copy was simply not produced for 5 consecutive days. Author: sre-agent
Summary
The bcp-vault-snapshot-daily.yml workflow (BCP Win 3) has failed every day since 2026-06-11. The root cause is that the GPG_BACKUP_PUBLIC_KEY GitHub Actions repo secret was never set. Without a valid GPG public key the workflow cannot encrypt the vault export before uploading to S3, so it exits at the Import GPG public key step in under 10 seconds. The Slack webhook notification was firing on each failure (the notify-failure job ran successfully), but ops@ email alerting was not wired into that job, and there was no escalation path that crossed into a pageable channel the operator monitors as a pager. The Slack alert went unnoticed for 5 days.
The fix has two parts: (a) an operator action to mint GPG_BACKUP_PUBLIC_KEY, which is a break-glass item only the operator can perform; (b) a CI/alerting change (this PR) adding Postmark ops@ email to the notify-failure job of both the vault snapshot and FreeScout backup workflows so future failures reach a pageable channel immediately rather than depending on someone checking Slack.
Timeline (all times UTC)
- 2026-05-29 10:51 — First attempt of
bcp-vault-snapshot-daily.ymlafter initial deploy. Fails.bcp-smoke-monthly-review.mddocumentsGPG_BACKUP_PUBLIC_KEYas MISSING and files it as operator action item #1, due before 2026-06-01. - 2026-06-01 08:00 — Monthly BCP smoke fires. Check 5 fails (no vault export object in S3). Slack + Postmark alert sent to ops channel. Operator action item #1 is still open.
- 2026-06-11 11:25 — First run captured in recent
gh run listhistory.failurein 11 seconds. - 2026-06-12 11:05 — Second consecutive failure.
- 2026-06-13 10:05 — Third consecutive failure.
- 2026-06-14 10:28 — Fourth consecutive failure.
- 2026-06-15 13:21 — Fifth consecutive failure. Operator notices and escalates to sre-agent.
- 2026-06-15 ~14:00 — SRE investigation begins. Root cause confirmed via
gh run view 27549207437 --log-failedin under 5 minutes. Workflow change (pageable alerting) drafted and submitted as PR.
Impact
- Users affected: 0
- User-visible symptoms: none
- Data integrity: ok — the vault itself is intact; Lightsail AutoSnapshot running daily at 05:00 UTC provides a separate backup of the vault host
- Revenue / billing: ok
- BCP gap: 5 days (2026-06-11 through 2026-06-15) with no encrypted off-Infisical vault export on S3. Recovery window is covered by the 7-day Lightsail AutoSnapshot retention, but the S3 export gap is real.
What went well
- The
notify-failureSlack webhook job was already in the workflow and did fire on every failure — the alerting infrastructure was partially there. - The
bcp-smoke-monthly-review.mddocumented the missing secret and the operator action item explicitly on 2026-05-29. The failure was anticipated and documented. - Lightsail AutoSnapshot (Win 2) provides a parallel recovery path; the vault data was never unrecoverable.
- Root cause identification from first log pull took under 2 minutes. The workflow's
::error::Failed to parse GPG fingerprint from public keymessage is unambiguous. - FreeScout backup (the sibling BCP backup cron) has been passing cleanly throughout — no cross-contamination.
What didn't go well
- The operator action item to mint
GPG_BACKUP_PUBLIC_KEYwas filed on 2026-05-29 with a due date of 2026-06-01 and was not completed. There was no automated escalation when it slipped the deadline. - The Slack webhook alert fired 5 times without producing operator action. Slack alone is insufficient for a pageable event class — the system lacked an ops@ email path.
- The FreeScout backup
notify-failurejob referencedSLACK_BOT_TOKEN(a bot API token), which was never provisioned as a GH Actions secret. That failure notification path was silently dead. This is a latent gap exposed by this investigation even though FreeScout backup itself is green. bcp-smoke-monthly.ymlCheck 5 (which would have caught the missing export) was set to fire on 2026-06-01, but theskip_vault_export_checkflag was not used and the operator action item deadline was 2026-06-01 — no one verified the outcome.
Root cause analysis
-
Contributing factor 1: GPG_BACKUP_PUBLIC_KEY secret never minted — The workflow requires the operator's GPG public key (base64-encoded) as a GitHub Actions secret. This secret was never set. The workflow was deployed without verifying all required secrets were present. The check was documentation-only (operator action item in a runbook review doc) with no automated gate.
-
Contributing factor 2: Alerting on a single channel (Slack webhook) is insufficient for a pageable event class — The system had one alert channel (Slack incoming webhook). A BCP backup failure is an incident-class event that warrants ops@ email so it reaches the operator's inbox as a durable, high-signal artifact rather than a scrollable Slack message. The workflow comment block said "Slack DM on failure" but the
SLACK_WEBHOOK_URLposts to an ops channel, not a DM — lower visibility than originally intended. -
Contributing factor 3: Operator action item without an automated escalation deadline — The runbook review filed an operator action item on 2026-05-29 with due date 2026-06-01. There is no mechanism in the system that detects a missed operator action item deadline and re-escalates. The item sat open silently for 16 days.
-
Contributing factor 4: FreeScout backup notify-failure path silently broken — The FreeScout backup
notify-failurejob usesSLACK_BOT_TOKEN(a Slack bot API token), which is not present in the repo's GH Actions secrets. If FreeScout backup had failed, no alert would have been sent. This is a latent gap unrelated to the triggering incident but exposed by the same investigation.
Detection
- What alerted us: operator noticed directly (manual observation of GH Actions run list)
- How long between cause and detection: 5 days (2026-06-11 first failure → 2026-06-15 operator escalation)
- How to detect faster next time: ops@ email on BCP backup failure (now wired via this PR); the Slack webhook alone is not a pager
Resolution
Completed (this PR)
- Added
Send ops@ email alert via Postmarkstep tonotify-failurejob inbcp-vault-snapshot-daily.yml. UsesPOSTMARK_OPS_ALERT_TOKEN+OPS_ALERT_EMAIL+OPS_ALERT_FROM(all present in GH Actions secrets since 2026-05-20). Both Slack and email fire independently. - Replaced the dead
SLACK_BOT_TOKEN/chat.postMessagepath infreescout-backup.ymlnotify-failurewithSLACK_WEBHOOK_URL(incoming webhook, same as vault snapshot and bcp-smoke-monthly) + added Postmark ops@ email step. - Updated
docs/ops/runbooks/vault-disaster-recovery.mdwith a BCP backup failure section documenting the pageable posture, the alert channels, and the root-cause checklist.
Pending operator action (blocker for backup resumption)
The backup itself cannot run green until GPG_BACKUP_PUBLIC_KEY is set. This is a break-glass credential operation only the operator can perform:
# 1. Identify your key fingerprint
gpg --list-secret-keys --keyid-format LONG
# 2. Export the public key, base64-encode it, and set as GH Actions secret
gpg --armor --export YOUR_KEY_FINGERPRINT \
| base64 \
| gh secret set GPG_BACKUP_PUBLIC_KEY --repo raxx-app/TradeMasterAPI
# 3. Verify the secret is now listed
gh secret list --repo raxx-app/TradeMasterAPI | grep GPG_BACKUP_PUBLIC_KEY
# 4. Dispatch the workflow manually to confirm green
gh workflow run bcp-vault-snapshot-daily.yml --repo raxx-app/TradeMasterAPI
# 5. Confirm the S3 object landed
aws s3 ls s3://raxx-iac-state-prod/vault-exports/ --region us-east-1 | tail -3
Expected: a .json.gpg object for today's date, size > 1 KB.
Validation (after operator action)
gh run list --workflow bcp-vault-snapshot-daily.yml --limit 1showssuccessaws s3 ls s3://raxx-iac-state-prod/vault-exports/ --region us-east-1includes today'sYYYY-MM-DD.json.gpg
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Mint GPG_BACKUP_PUBLIC_KEY GH Actions secret + confirm backup green |
Kristerpher (operator) | 2026-06-15 | operator-action |
| 2 | Merge this PR (pageable alerting for BCP backup failures) | sre-agent / operator | 2026-06-15 | see PR |
| 3 | Add a workflow-level pre-flight step to bcp-vault-snapshot-daily.yml that validates all required secrets are non-empty before attempting the backup, so a missing secret fails fast with a clear error on day 1, not day N |
sre-agent | 2026-06-22 | type:reliability |
| 4 | Consider adding a monthly "secrets audit" job to bcp-smoke-monthly.yml that checks all BCP-critical secrets are non-empty |
sre-agent | 2026-06-30 | type:reliability |
References
- Runbook:
docs/ops/runbooks/vault-disaster-recovery.md - BCP smoke review (filed original operator action item):
docs/ops/runbooks/bcp-smoke-monthly-review.md - Workflow:
.github/workflows/bcp-vault-snapshot-daily.yml - Failing run (most recent):
https://github.com/raxx-app/TradeMasterAPI/actions/runs/27549207437 - First confirmed failure (captured in history): run
27343443722(2026-06-11T11:25:40Z)