Vault coverage audit runbook
System: Infisical vault — per-secret env coverage (prod / staging / dev)
Owner: sre-agent / operator
Script: scripts/vault/audit_coverage.py
Last incident: — (no incident; proactive audit)
Last reviewed: 2026-05-12 UTC
Related issues: #596 (Phase 1 — audit script + snapshot)
Related docs: docs/ops/vault-token-taxonomy.md, docs/ops/2026-05-12-vault-coverage-snapshot.md
Purpose
The Infisical vault holds secrets across three environments: prod, staging,
and dev. Drift occurs when a secret is provisioned in one environment but
not another — a common source of "works in prod, breaks in staging" incidents.
This audit is read-only. It lists secrets, compares presence and version across environments, and reports drift. It does not modify vault contents.
SSM out of scope. AWS-resident workload secrets live in AWS Parameter Store
(SSM), not Infisical. Per feedback_aws_workloads_use_ssm_not_vault.md, those
secrets are managed separately via Terraform + SSM and are never written to
Infisical. Do not attempt to audit SSM via this script.
How to tell it's broken (signs that vault coverage has drifted)
- A staging deploy succeeds but a service fails to start with a missing-env-var
error (
KeyError,os.environraise, orgetenv()returningNonewhere a value is required). - Rotation pipeline marks a secret as
not_foundin staging but healthy in prod (Velvet rotation logs or console/secretspage). - CI staging smoke test passes at the HTTP layer but returns
401/403because a vendor key is absent from staging vault. - Console vault browser shows a secret in the prod env dropdown but no entry for staging.
How to diagnose (in order)
-
Run the audit script — produces the coverage matrix with all presence gaps in one view. This is always step one.
-
Cross-reference the taxonomy —
docs/ops/vault-token-taxonomy.mdSection 3 documents which secrets are intentionally prod-only vs. expected in both prod and staging. Presence gaps for account-wide tokens (Cloudflare, GitHub, Anthropic) are by design, not drift. -
Check Infisical directly — if a secret shows as
missingin the audit but you believe it exists, confirm via the Infisical UI or the Infisical CLI:bash infisical secrets get <SECRET_NAME> --path /MooseQuest/<vendor>/ \ --env staging --plain -
Check the secret path — Infisical returns 404 for a valid secret name if the folder path does not exist in that environment. See
feedback_vault_folder_must_exist.md: folders must be created viaPOST /api/v1/foldersbefore secrets can be written to a new path.
Running the audit script
Prerequisites
| Env var | Source |
|---|---|
INFISICAL_CLIENT_ID |
Infisical Universal Auth machine identity — read from vault at /MooseQuest/infisical/ |
INFISICAL_CLIENT_SECRET |
Paired with above |
INFISICAL_PROJECT_ID |
Infisical project / workspace ID — visible in project settings |
INFISICAL_HOST |
Optional. Default: https://app.infisical.com |
INFISICAL_PATH_PREFIX |
Optional. Default: /MooseQuest/. Set to a sub-path to limit scope. |
CF_ACCESS_CLIENT_ID |
Required only if vault host is behind Cloudflare Access (e.g., self-hosted vault.raxx.app) |
CF_ACCESS_CLIENT_SECRET |
Paired with above |
Run — markdown output to stdout
python3 scripts/vault/audit_coverage.py
Run — write to the canonical snapshot file
python3 scripts/vault/audit_coverage.py \
--output docs/ops/2026-05-12-vault-coverage-snapshot.md
Replace the date in the filename with today's date for a new snapshot. Commit the updated file so the coverage history is version-controlled.
Run — CSV output (for spreadsheet import)
python3 scripts/vault/audit_coverage.py --format csv > /tmp/vault-coverage.csv
Run — narrower path scope
INFISICAL_PATH_PREFIX=/MooseQuest/heroku/ \
python3 scripts/vault/audit_coverage.py
Exit codes
| Code | Meaning |
|---|---|
| 0 | Audit completed. Drift may exist — check the report. |
| 1 | Missing required env vars or vault unreachable. |
Interpreting the output
Coverage matrix cells
| Cell value | Meaning |
|---|---|
vN (e.g., v3) |
Secret present; Infisical version number N |
— |
Secret absent from this environment |
Drift column
"Drift" means the secret is present in at least one environment but absent from another. Drift is not always wrong — account-wide tokens (Cloudflare, AWS, GitHub) are intentionally prod-only. Always cross-reference the taxonomy before treating a drift entry as a defect.
Decision tree for each drift row
-
Account-wide token (one vendor account regardless of Raxx environment):
prodonly is correct. Examples:CF_PAGES_DEPLOY,ANTHROPIC_API_KEY,AWS_ACCESS_KEY_ID,GITHUB_API_READONLY_TOKEN. No action. -
Env-specific token (vendor has separate accounts/servers per env): must be present in every env where the Raxx service runs. Examples:
HEROKU_API_KEY,ALPACA_PAPER_API_KEY_ID,POSTMARK_SERVER_TOKEN,CF_ACCESS_SVC_CONSOLE. If staging is missing → provision. -
Dev-only token (vendor test-mode):
devenv only. Example:STRIPE_RESTRICTED_KEY(Stripe test-mode key). If missing fromdev→ provision the test-mode key. -
CF Access service tokens: each Raxx environment has its own CF Access application (different
app_id). Separate service tokens must exist for prod and staging. Both envs should be populated.
Known failure modes
Failure mode A: Script exits 1 — "Missing required env vars"
Symptom: Script prints [error] Missing required env vars: INFISICAL_CLIENT_ID ...
Cause: The Universal Auth credentials are not in the shell environment.
Fix:
# Option 1 — export directly (for one-off use; do not persist to shell history)
export INFISICAL_CLIENT_ID="<value>"
export INFISICAL_CLIENT_SECRET="<value>"
export INFISICAL_PROJECT_ID="<value>"
# Option 2 — read from vault via Infisical CLI (bootstrap token required)
eval "$(infisical export --env prod --path /MooseQuest/infisical/ \
| grep -E '^(INFISICAL_CLIENT_ID|INFISICAL_CLIENT_SECRET|INFISICAL_PROJECT_ID)=')"
Verification: Re-run the script. [info] Vault host: line appears = credentials accepted.
Failure mode B: Script exits 1 — "Failed to obtain Infisical auth token"
Symptom: Script prints [error] Failed to obtain Infisical auth token.
Cause: Either the credentials are set but invalid (wrong client ID / secret), or the Infisical host is unreachable.
Fix:
1. Verify the vault host is reachable:
bash
curl -sS -o /dev/null -w "%{http_code}" "${INFISICAL_HOST:-https://app.infisical.com}/api/status"
Expect 200. Anything else = host unreachable.
-
If behind Cloudflare Access (
vault.raxx.app), verify the CF Access service-token credentials are set:bash echo "CF_ACCESS_CLIENT_ID=${CF_ACCESS_CLIENT_ID:-(not set)}" echo "CF_ACCESS_CLIENT_SECRET=${CF_ACCESS_CLIENT_SECRET:-(not set)}"If not set, retrieve from vault at/MooseQuest/cloudflare/and export. Seedocs/ops/runbooks/cf-access-service-token-provisioning.md. -
Check Infisical status:
https://status.infisical.com
Failure mode C: Secrets list is empty for one env
Symptom: Script reports 0 secrets found in 'staging' (or dev) but prod
has secrets.
Cause (most likely): The path prefix does not exist as a folder in that environment. Infisical returns an empty list (not 404) for a valid env with no secrets at the given path.
Diagnosis:
# Confirm the /MooseQuest/ folder exists in staging
curl -s \
-H "Authorization: Bearer $INFISICAL_TOKEN" \
-H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
-H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
"$INFISICAL_HOST/api/v1/folders?workspaceId=$INFISICAL_PROJECT_ID&environment=staging&path=/"
If /MooseQuest/ is absent from the staging folder list, no secrets have ever
been provisioned to staging under this prefix. This is a genuine coverage gap —
all expected staging secrets are missing.
Fix: Provision the folder and required secrets per the classification table
in docs/ops/vault-token-taxonomy.md Section 3. Create the folder first:
curl -s -X POST \
-H "Authorization: Bearer $INFISICAL_TOKEN" \
-H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
-H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
-H "Content-Type: application/json" \
-d '{"workspaceId":"'"$INFISICAL_PROJECT_ID"'","environment":"staging","name":"MooseQuest","path":"/"}' \
"$INFISICAL_HOST/api/v1/folders"
Then create sub-folders per vendor as needed, then provision the secrets.
Warning: Per
feedback_vault_folder_must_exist.md, Infisical returns 404 if a secret is written to a path whose folder does not exist. Always create the folder before writing the secret. ThePOST /api/v1/folderscall is idempotent — re-running it on an existing folder is safe.
Failure mode D: Version skew between prod and staging
Symptom: A secret is present in both prod and staging but the version
numbers differ significantly (e.g., prod is v8, staging is v2).
Cause: The secret was rotated in prod but the staging equivalent was never updated. Staging is running a stale credential.
Fix: This is a rotation gap, not a coverage gap. Route to Velvet rotation
pipeline or manual rotation per the vendor-specific SOP in
docs/ops/runbooks/rotation/.
Remediation workflow (step-by-step)
For each drift row that is not intentionally prod-only:
-
Confirm the secret is missing (not just at a different path):
bash infisical secrets get <SECRET_NAME> --path /MooseQuest/<vendor>/ \ --env staging --plain # Expect: value printed. If error: secret is genuinely absent. -
Check the vendor-specific folder exists in staging (see Failure mode C above).
-
Provision the missing secret in the Infisical UI or via API. Use the staging credential for that vendor (not the prod value — staging must have its own isolated credential where the vendor supports it).
-
Add the
__EXPIRES_ATcompanion secret for the new entry. -
Re-run the audit to confirm the gap is resolved:
bash python3 scripts/vault/audit_coverage.py | grep "<SECRET_NAME>" -
Update the snapshot file and commit:
bash python3 scripts/vault/audit_coverage.py \ --output docs/ops/2026-05-12-vault-coverage-snapshot.md git add docs/ops/2026-05-12-vault-coverage-snapshot.md git commit -m "ops(vault): update coverage snapshot after remediation"
Scheduled re-run
The audit should be run monthly and after any vault provisioning change. No automated scheduler is wired for this yet — tracked in #596 action items.
To add it to the nightly GH Actions digest (future):
1. Add a vault-coverage-audit step to .github/workflows/nightly-ops.yml
that runs audit_coverage.py --format md and posts the drift section to
Slack if drift_count > 0.
2. Wire INFISICAL_CLIENT_ID, INFISICAL_CLIENT_SECRET, and
INFISICAL_PROJECT_ID as GitHub Actions secrets sourced from vault.
Emergency stop
This script is read-only. There is no emergency stop — it cannot modify vault
contents. If the script is running and you want to stop it, Ctrl-C is
sufficient.
Escalation
Escalate to operator (Kristerpher) when:
- The live script finds a credential with no prod entry that is expected
to have one (e.g.,
HEROKU_API_KEYmissing from prod) — this is a SEV-2: a prod service may be running without rotation coverage. - A secret is present in staging but missing from prod — investigate whether a staging-only credential was accidentally provisioned instead of a prod credential.
- The Infisical host is unreachable and the circuit breaker has been open for
15 minutes (see
docs/ops/runbooks/infisical-cloud-config.md). - Any secret with
sensitivity:critical(live trading keys, Heroku platform key) appears in an unexpected environment.
References
- Script:
scripts/vault/audit_coverage.py - Snapshot:
docs/ops/2026-05-12-vault-coverage-snapshot.md - Vault taxonomy:
docs/ops/vault-token-taxonomy.md - Infisical config runbook:
docs/ops/runbooks/infisical-cloud-config.md - CF Access service token provisioning:
docs/ops/runbooks/cf-access-service-token-provisioning.md - Vault folder creation requirement:
feedback_vault_folder_must_exist.md - SSM vs. Infisical boundary:
feedback_aws_workloads_use_ssm_not_vault.md - Related issue: #596