RCA — BCP vault snapshot (Win 3) silent failure: GPG_BACKUP_PUBLIC_KEY never minted

Incident ID: 2026-06-15-bcp-vault-snapshot-gpg-key-missing Date: 2026-06-15 Severity: SEV-3 Duration: 5 days open (2026-06-11 first failure → 2026-06-15 operator escalation; fix pending operator action) Blast radius: Internal BCP only. No user-facing service impact. Zero vault secrets were lost or exposed. The vault itself is healthy; the off-Infisical encrypted backup copy was simply not produced for 5 consecutive days. Author: sre-agent

Summary

The bcp-vault-snapshot-daily.yml workflow (BCP Win 3) has failed every day since 2026-06-11. The root cause is that the GPG_BACKUP_PUBLIC_KEY GitHub Actions repo secret was never set. Without a valid GPG public key the workflow cannot encrypt the vault export before uploading to S3, so it exits at the Import GPG public key step in under 10 seconds. The Slack webhook notification was firing on each failure (the notify-failure job ran successfully), but ops@ email alerting was not wired into that job, and there was no escalation path that crossed into a pageable channel the operator monitors as a pager. The Slack alert went unnoticed for 5 days.

The fix has two parts: (a) an operator action to mint GPG_BACKUP_PUBLIC_KEY, which is a break-glass item only the operator can perform; (b) a CI/alerting change (this PR) adding Postmark ops@ email to the notify-failure job of both the vault snapshot and FreeScout backup workflows so future failures reach a pageable channel immediately rather than depending on someone checking Slack.

Timeline (all times UTC)

2026-05-29 10:51 — First attempt of bcp-vault-snapshot-daily.yml after initial deploy. Fails. bcp-smoke-monthly-review.md documents GPG_BACKUP_PUBLIC_KEY as MISSING and files it as operator action item #1, due before 2026-06-01.
2026-06-01 08:00 — Monthly BCP smoke fires. Check 5 fails (no vault export object in S3). Slack + Postmark alert sent to ops channel. Operator action item #1 is still open.
2026-06-11 11:25 — First run captured in recent gh run list history. failure in 11 seconds.
2026-06-12 11:05 — Second consecutive failure.
2026-06-13 10:05 — Third consecutive failure.
2026-06-14 10:28 — Fourth consecutive failure.
2026-06-15 13:21 — Fifth consecutive failure. Operator notices and escalates to sre-agent.
2026-06-15 ~14:00 — SRE investigation begins. Root cause confirmed via gh run view 27549207437 --log-failed in under 5 minutes. Workflow change (pageable alerting) drafted and submitted as PR.

Impact

Users affected: 0
User-visible symptoms: none
Data integrity: ok — the vault itself is intact; Lightsail AutoSnapshot running daily at 05:00 UTC provides a separate backup of the vault host
Revenue / billing: ok
BCP gap: 5 days (2026-06-11 through 2026-06-15) with no encrypted off-Infisical vault export on S3. Recovery window is covered by the 7-day Lightsail AutoSnapshot retention, but the S3 export gap is real.

What went well

The notify-failure Slack webhook job was already in the workflow and did fire on every failure — the alerting infrastructure was partially there.
The bcp-smoke-monthly-review.md documented the missing secret and the operator action item explicitly on 2026-05-29. The failure was anticipated and documented.
Lightsail AutoSnapshot (Win 2) provides a parallel recovery path; the vault data was never unrecoverable.
Root cause identification from first log pull took under 2 minutes. The workflow's ::error::Failed to parse GPG fingerprint from public key message is unambiguous.
FreeScout backup (the sibling BCP backup cron) has been passing cleanly throughout — no cross-contamination.

What didn't go well

The operator action item to mint GPG_BACKUP_PUBLIC_KEY was filed on 2026-05-29 with a due date of 2026-06-01 and was not completed. There was no automated escalation when it slipped the deadline.
The Slack webhook alert fired 5 times without producing operator action. Slack alone is insufficient for a pageable event class — the system lacked an ops@ email path.
The FreeScout backup notify-failure job referenced SLACK_BOT_TOKEN (a bot API token), which was never provisioned as a GH Actions secret. That failure notification path was silently dead. This is a latent gap exposed by this investigation even though FreeScout backup itself is green.
bcp-smoke-monthly.yml Check 5 (which would have caught the missing export) was set to fire on 2026-06-01, but the skip_vault_export_check flag was not used and the operator action item deadline was 2026-06-01 — no one verified the outcome.

Root cause analysis

Contributing factor 1: GPG_BACKUP_PUBLIC_KEY secret never minted — The workflow requires the operator's GPG public key (base64-encoded) as a GitHub Actions secret. This secret was never set. The workflow was deployed without verifying all required secrets were present. The check was documentation-only (operator action item in a runbook review doc) with no automated gate.
Contributing factor 2: Alerting on a single channel (Slack webhook) is insufficient for a pageable event class — The system had one alert channel (Slack incoming webhook). A BCP backup failure is an incident-class event that warrants ops@ email so it reaches the operator's inbox as a durable, high-signal artifact rather than a scrollable Slack message. The workflow comment block said "Slack DM on failure" but the SLACK_WEBHOOK_URL posts to an ops channel, not a DM — lower visibility than originally intended.
Contributing factor 3: Operator action item without an automated escalation deadline — The runbook review filed an operator action item on 2026-05-29 with due date 2026-06-01. There is no mechanism in the system that detects a missed operator action item deadline and re-escalates. The item sat open silently for 16 days.
Contributing factor 4: FreeScout backup notify-failure path silently broken — The FreeScout backup notify-failure job uses SLACK_BOT_TOKEN (a Slack bot API token), which is not present in the repo's GH Actions secrets. If FreeScout backup had failed, no alert would have been sent. This is a latent gap unrelated to the triggering incident but exposed by the same investigation.

Detection

What alerted us: operator noticed directly (manual observation of GH Actions run list)
How long between cause and detection: 5 days (2026-06-11 first failure → 2026-06-15 operator escalation)
How to detect faster next time: ops@ email on BCP backup failure (now wired via this PR); the Slack webhook alone is not a pager

Resolution

Completed (this PR)

Added Send ops@ email alert via Postmark step to notify-failure job in bcp-vault-snapshot-daily.yml. Uses POSTMARK_OPS_ALERT_TOKEN + OPS_ALERT_EMAIL + OPS_ALERT_FROM (all present in GH Actions secrets since 2026-05-20). Both Slack and email fire independently.
Replaced the dead SLACK_BOT_TOKEN / chat.postMessage path in freescout-backup.yml notify-failure with SLACK_WEBHOOK_URL (incoming webhook, same as vault snapshot and bcp-smoke-monthly) + added Postmark ops@ email step.
Updated docs/ops/runbooks/vault-disaster-recovery.md with a BCP backup failure section documenting the pageable posture, the alert channels, and the root-cause checklist.

Pending operator action (blocker for backup resumption)

The backup itself cannot run green until GPG_BACKUP_PUBLIC_KEY is set. This is a break-glass credential operation only the operator can perform:

# 1. Identify your key fingerprint
gpg --list-secret-keys --keyid-format LONG

# 2. Export the public key, base64-encode it, and set as GH Actions secret
gpg --armor --export YOUR_KEY_FINGERPRINT \
  | base64 \
  | gh secret set GPG_BACKUP_PUBLIC_KEY --repo raxx-app/TradeMasterAPI

# 3. Verify the secret is now listed
gh secret list --repo raxx-app/TradeMasterAPI | grep GPG_BACKUP_PUBLIC_KEY

# 4. Dispatch the workflow manually to confirm green
gh workflow run bcp-vault-snapshot-daily.yml --repo raxx-app/TradeMasterAPI

# 5. Confirm the S3 object landed
aws s3 ls s3://raxx-iac-state-prod/vault-exports/ --region us-east-1 | tail -3

Expected: a .json.gpg object for today's date, size > 1 KB.

Validation (after operator action)

gh run list --workflow bcp-vault-snapshot-daily.yml --limit 1 shows success
aws s3 ls s3://raxx-iac-state-prod/vault-exports/ --region us-east-1 includes today's YYYY-MM-DD.json.gpg

Action items

#	Action	Owner	Due	Issue
1	Mint `GPG_BACKUP_PUBLIC_KEY` GH Actions secret + confirm backup green	Kristerpher (operator)	2026-06-15	operator-action
2	Merge this PR (pageable alerting for BCP backup failures)	sre-agent / operator	2026-06-15	see PR
3	Add a workflow-level pre-flight step to `bcp-vault-snapshot-daily.yml` that validates all required secrets are non-empty before attempting the backup, so a missing secret fails fast with a clear error on day 1, not day N	sre-agent	2026-06-22	type:reliability
4	Consider adding a monthly "secrets audit" job to `bcp-smoke-monthly.yml` that checks all BCP-critical secrets are non-empty	sre-agent	2026-06-30	type:reliability

References

Runbook: docs/ops/runbooks/vault-disaster-recovery.md
BCP smoke review (filed original operator action item): docs/ops/runbooks/bcp-smoke-monthly-review.md
Workflow: .github/workflows/bcp-vault-snapshot-daily.yml
Failing run (most recent): https://github.com/raxx-app/TradeMasterAPI/actions/runs/27549207437
First confirmed failure (captured in history): run 27343443722 (2026-06-11T11:25:40Z)