Raxx · internal docs

internal · gated

RCA — BCP vault snapshot (Win 3) silent failure: GPG_BACKUP_PUBLIC_KEY never minted

Incident ID: 2026-06-15-bcp-vault-snapshot-gpg-key-missing Date: 2026-06-15 Severity: SEV-3 Duration: 5 days open (2026-06-11 first failure → 2026-06-15 operator escalation; fix pending operator action) Blast radius: Internal BCP only. No user-facing service impact. Zero vault secrets were lost or exposed. The vault itself is healthy; the off-Infisical encrypted backup copy was simply not produced for 5 consecutive days. Author: sre-agent


Summary

The bcp-vault-snapshot-daily.yml workflow (BCP Win 3) has failed every day since 2026-06-11. The root cause is that the GPG_BACKUP_PUBLIC_KEY GitHub Actions repo secret was never set. Without a valid GPG public key the workflow cannot encrypt the vault export before uploading to S3, so it exits at the Import GPG public key step in under 10 seconds. The Slack webhook notification was firing on each failure (the notify-failure job ran successfully), but ops@ email alerting was not wired into that job, and there was no escalation path that crossed into a pageable channel the operator monitors as a pager. The Slack alert went unnoticed for 5 days.

The fix has two parts: (a) an operator action to mint GPG_BACKUP_PUBLIC_KEY, which is a break-glass item only the operator can perform; (b) a CI/alerting change (this PR) adding Postmark ops@ email to the notify-failure job of both the vault snapshot and FreeScout backup workflows so future failures reach a pageable channel immediately rather than depending on someone checking Slack.


Timeline (all times UTC)


Impact


What went well


What didn't go well


Root cause analysis


Detection


Resolution

Completed (this PR)

Pending operator action (blocker for backup resumption)

The backup itself cannot run green until GPG_BACKUP_PUBLIC_KEY is set. This is a break-glass credential operation only the operator can perform:

# 1. Identify your key fingerprint
gpg --list-secret-keys --keyid-format LONG

# 2. Export the public key, base64-encode it, and set as GH Actions secret
gpg --armor --export YOUR_KEY_FINGERPRINT \
  | base64 \
  | gh secret set GPG_BACKUP_PUBLIC_KEY --repo raxx-app/TradeMasterAPI

# 3. Verify the secret is now listed
gh secret list --repo raxx-app/TradeMasterAPI | grep GPG_BACKUP_PUBLIC_KEY

# 4. Dispatch the workflow manually to confirm green
gh workflow run bcp-vault-snapshot-daily.yml --repo raxx-app/TradeMasterAPI

# 5. Confirm the S3 object landed
aws s3 ls s3://raxx-iac-state-prod/vault-exports/ --region us-east-1 | tail -3

Expected: a .json.gpg object for today's date, size > 1 KB.

Validation (after operator action)


Action items

# Action Owner Due Issue
1 Mint GPG_BACKUP_PUBLIC_KEY GH Actions secret + confirm backup green Kristerpher (operator) 2026-06-15 operator-action
2 Merge this PR (pageable alerting for BCP backup failures) sre-agent / operator 2026-06-15 see PR
3 Add a workflow-level pre-flight step to bcp-vault-snapshot-daily.yml that validates all required secrets are non-empty before attempting the backup, so a missing secret fails fast with a clear error on day 1, not day N sre-agent 2026-06-22 type:reliability
4 Consider adding a monthly "secrets audit" job to bcp-smoke-monthly.yml that checks all BCP-critical secrets are non-empty sre-agent 2026-06-30 type:reliability

References