Raxx · internal docs

internal · gated ↑ index

Vault coverage audit runbook

System: Infisical vault — per-secret env coverage (prod / staging / dev) Owner: sre-agent / operator Script: scripts/vault/audit_coverage.py Last incident: — (no incident; proactive audit) Last reviewed: 2026-05-12 UTC Related issues: #596 (Phase 1 — audit script + snapshot) Related docs: docs/ops/vault-token-taxonomy.md, docs/ops/2026-05-12-vault-coverage-snapshot.md


Purpose

The Infisical vault holds secrets across three environments: prod, staging, and dev. Drift occurs when a secret is provisioned in one environment but not another — a common source of "works in prod, breaks in staging" incidents.

This audit is read-only. It lists secrets, compares presence and version across environments, and reports drift. It does not modify vault contents.

SSM out of scope. AWS-resident workload secrets live in AWS Parameter Store (SSM), not Infisical. Per feedback_aws_workloads_use_ssm_not_vault.md, those secrets are managed separately via Terraform + SSM and are never written to Infisical. Do not attempt to audit SSM via this script.


How to tell it's broken (signs that vault coverage has drifted)


How to diagnose (in order)

  1. Run the audit script — produces the coverage matrix with all presence gaps in one view. This is always step one.

  2. Cross-reference the taxonomydocs/ops/vault-token-taxonomy.md Section 3 documents which secrets are intentionally prod-only vs. expected in both prod and staging. Presence gaps for account-wide tokens (Cloudflare, GitHub, Anthropic) are by design, not drift.

  3. Check Infisical directly — if a secret shows as missing in the audit but you believe it exists, confirm via the Infisical UI or the Infisical CLI: bash infisical secrets get <SECRET_NAME> --path /MooseQuest/<vendor>/ \ --env staging --plain

  4. Check the secret path — Infisical returns 404 for a valid secret name if the folder path does not exist in that environment. See feedback_vault_folder_must_exist.md: folders must be created via POST /api/v1/folders before secrets can be written to a new path.


Running the audit script

Prerequisites

Env var Source
INFISICAL_CLIENT_ID Infisical Universal Auth machine identity — read from vault at /MooseQuest/infisical/
INFISICAL_CLIENT_SECRET Paired with above
INFISICAL_PROJECT_ID Infisical project / workspace ID — visible in project settings
INFISICAL_HOST Optional. Default: https://app.infisical.com
INFISICAL_PATH_PREFIX Optional. Default: /MooseQuest/. Set to a sub-path to limit scope.
CF_ACCESS_CLIENT_ID Required only if vault host is behind Cloudflare Access (e.g., self-hosted vault.raxx.app)
CF_ACCESS_CLIENT_SECRET Paired with above

Run — markdown output to stdout

python3 scripts/vault/audit_coverage.py

Run — write to the canonical snapshot file

python3 scripts/vault/audit_coverage.py \
  --output docs/ops/2026-05-12-vault-coverage-snapshot.md

Replace the date in the filename with today's date for a new snapshot. Commit the updated file so the coverage history is version-controlled.

Run — CSV output (for spreadsheet import)

python3 scripts/vault/audit_coverage.py --format csv > /tmp/vault-coverage.csv

Run — narrower path scope

INFISICAL_PATH_PREFIX=/MooseQuest/heroku/ \
  python3 scripts/vault/audit_coverage.py

Exit codes

Code Meaning
0 Audit completed. Drift may exist — check the report.
1 Missing required env vars or vault unreachable.

Interpreting the output

Coverage matrix cells

Cell value Meaning
vN (e.g., v3) Secret present; Infisical version number N
Secret absent from this environment

Drift column

"Drift" means the secret is present in at least one environment but absent from another. Drift is not always wrong — account-wide tokens (Cloudflare, AWS, GitHub) are intentionally prod-only. Always cross-reference the taxonomy before treating a drift entry as a defect.

Decision tree for each drift row

  1. Account-wide token (one vendor account regardless of Raxx environment): prod only is correct. Examples: CF_PAGES_DEPLOY, ANTHROPIC_API_KEY, AWS_ACCESS_KEY_ID, GITHUB_API_READONLY_TOKEN. No action.

  2. Env-specific token (vendor has separate accounts/servers per env): must be present in every env where the Raxx service runs. Examples: HEROKU_API_KEY, ALPACA_PAPER_API_KEY_ID, POSTMARK_SERVER_TOKEN, CF_ACCESS_SVC_CONSOLE. If staging is missing → provision.

  3. Dev-only token (vendor test-mode): dev env only. Example: STRIPE_RESTRICTED_KEY (Stripe test-mode key). If missing from dev → provision the test-mode key.

  4. CF Access service tokens: each Raxx environment has its own CF Access application (different app_id). Separate service tokens must exist for prod and staging. Both envs should be populated.


Known failure modes

Failure mode A: Script exits 1 — "Missing required env vars"

Symptom: Script prints [error] Missing required env vars: INFISICAL_CLIENT_ID ...

Cause: The Universal Auth credentials are not in the shell environment.

Fix:

# Option 1 — export directly (for one-off use; do not persist to shell history)
export INFISICAL_CLIENT_ID="<value>"
export INFISICAL_CLIENT_SECRET="<value>"
export INFISICAL_PROJECT_ID="<value>"

# Option 2 — read from vault via Infisical CLI (bootstrap token required)
eval "$(infisical export --env prod --path /MooseQuest/infisical/ \
  | grep -E '^(INFISICAL_CLIENT_ID|INFISICAL_CLIENT_SECRET|INFISICAL_PROJECT_ID)=')"

Verification: Re-run the script. [info] Vault host: line appears = credentials accepted.


Failure mode B: Script exits 1 — "Failed to obtain Infisical auth token"

Symptom: Script prints [error] Failed to obtain Infisical auth token.

Cause: Either the credentials are set but invalid (wrong client ID / secret), or the Infisical host is unreachable.

Fix: 1. Verify the vault host is reachable: bash curl -sS -o /dev/null -w "%{http_code}" "${INFISICAL_HOST:-https://app.infisical.com}/api/status" Expect 200. Anything else = host unreachable.

  1. If behind Cloudflare Access (vault.raxx.app), verify the CF Access service-token credentials are set: bash echo "CF_ACCESS_CLIENT_ID=${CF_ACCESS_CLIENT_ID:-(not set)}" echo "CF_ACCESS_CLIENT_SECRET=${CF_ACCESS_CLIENT_SECRET:-(not set)}" If not set, retrieve from vault at /MooseQuest/cloudflare/ and export. See docs/ops/runbooks/cf-access-service-token-provisioning.md.

  2. Check Infisical status: https://status.infisical.com


Failure mode C: Secrets list is empty for one env

Symptom: Script reports 0 secrets found in 'staging' (or dev) but prod has secrets.

Cause (most likely): The path prefix does not exist as a folder in that environment. Infisical returns an empty list (not 404) for a valid env with no secrets at the given path.

Diagnosis:

# Confirm the /MooseQuest/ folder exists in staging
curl -s \
  -H "Authorization: Bearer $INFISICAL_TOKEN" \
  -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
  "$INFISICAL_HOST/api/v1/folders?workspaceId=$INFISICAL_PROJECT_ID&environment=staging&path=/"

If /MooseQuest/ is absent from the staging folder list, no secrets have ever been provisioned to staging under this prefix. This is a genuine coverage gap — all expected staging secrets are missing.

Fix: Provision the folder and required secrets per the classification table in docs/ops/vault-token-taxonomy.md Section 3. Create the folder first:

curl -s -X POST \
  -H "Authorization: Bearer $INFISICAL_TOKEN" \
  -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \
  -H "Content-Type: application/json" \
  -d '{"workspaceId":"'"$INFISICAL_PROJECT_ID"'","environment":"staging","name":"MooseQuest","path":"/"}' \
  "$INFISICAL_HOST/api/v1/folders"

Then create sub-folders per vendor as needed, then provision the secrets.

Warning: Per feedback_vault_folder_must_exist.md, Infisical returns 404 if a secret is written to a path whose folder does not exist. Always create the folder before writing the secret. The POST /api/v1/folders call is idempotent — re-running it on an existing folder is safe.


Failure mode D: Version skew between prod and staging

Symptom: A secret is present in both prod and staging but the version numbers differ significantly (e.g., prod is v8, staging is v2).

Cause: The secret was rotated in prod but the staging equivalent was never updated. Staging is running a stale credential.

Fix: This is a rotation gap, not a coverage gap. Route to Velvet rotation pipeline or manual rotation per the vendor-specific SOP in docs/ops/runbooks/rotation/.


Remediation workflow (step-by-step)

For each drift row that is not intentionally prod-only:

  1. Confirm the secret is missing (not just at a different path): bash infisical secrets get <SECRET_NAME> --path /MooseQuest/<vendor>/ \ --env staging --plain # Expect: value printed. If error: secret is genuinely absent.

  2. Check the vendor-specific folder exists in staging (see Failure mode C above).

  3. Provision the missing secret in the Infisical UI or via API. Use the staging credential for that vendor (not the prod value — staging must have its own isolated credential where the vendor supports it).

  4. Add the __EXPIRES_AT companion secret for the new entry.

  5. Re-run the audit to confirm the gap is resolved: bash python3 scripts/vault/audit_coverage.py | grep "<SECRET_NAME>"

  6. Update the snapshot file and commit: bash python3 scripts/vault/audit_coverage.py \ --output docs/ops/2026-05-12-vault-coverage-snapshot.md git add docs/ops/2026-05-12-vault-coverage-snapshot.md git commit -m "ops(vault): update coverage snapshot after remediation"


Scheduled re-run

The audit should be run monthly and after any vault provisioning change. No automated scheduler is wired for this yet — tracked in #596 action items.

To add it to the nightly GH Actions digest (future): 1. Add a vault-coverage-audit step to .github/workflows/nightly-ops.yml that runs audit_coverage.py --format md and posts the drift section to Slack if drift_count > 0. 2. Wire INFISICAL_CLIENT_ID, INFISICAL_CLIENT_SECRET, and INFISICAL_PROJECT_ID as GitHub Actions secrets sourced from vault.


Emergency stop

This script is read-only. There is no emergency stop — it cannot modify vault contents. If the script is running and you want to stop it, Ctrl-C is sufficient.


Escalation

Escalate to operator (Kristerpher) when:


References