SOP — HEROKU_API_KEY Drift Recovery
Owner: Operator (Kristerpher) + agent Last updated: 2026-05-03 First incident: 2026-05-03 03:48 UTC (during MBT v1 polish-sprint deploys) Related issues: #925, #891, #943
If the rotation handler raised
OldTokenInvalidError("Heroku rejected the rolling token before mint"), that is the typed signal added in #943. The dyno'sHEROKU_API_KEYis the drifted copy — re-sync from vault per Path A or B below.
What "drift" means here
There are three copies of HEROKU_API_KEY that must agree:
- Vault —
/MooseQuest/heroku/HEROKU_API_KEYin Infisical. Source of truth. - GitHub Actions secret —
HEROKU_API_KEYathttps://github.com/raxx-app/TradeMasterAPI/settings/secrets/actions. Read by every Heroku-deploy workflow. - Heroku config var —
HEROKU_API_KEYon each of the four Heroku apps (raxx-console-prod,raxx-console-staging,raxx-api-prod,raxx-api-staging). Read by the running app for vendor calls.
Drift = one of those values stops matching the others. The most-painful failure mode: GH Actions secret is stale, every Heroku deploy fails with Error: The token provided to HEROKU_API_KEY is invalid.
How to detect drift
Symptom 1 — git push heroku fails in CI
Deploy to Heroku via heroku CLI + credential helper (subtree split):
Error: The token provided to HEROKU_API_KEY is invalid. Please
double-check that you have the correct token, or run `heroku login`
without HEROKU_API_KEY set.
Symptom 2 — heroku run fails locally (different problem; usually means your local CLI auth is stale, not vault drift)
Confirm the source of truth is valid
heroku run --app raxx-console-prod --no-tty 'python -c "
import requests
from app.services import vault
t = vault.get_secret_value(\"HEROKU_API_KEY\")
r = requests.get(\"https://api.heroku.com/account\",
headers={\"Authorization\": f\"Bearer {t}\",
\"Accept\": \"application/vnd.heroku+json; version=3\"},
timeout=10)
print(\"vault token validates:\", r.status_code, \"OK\" if r.ok else r.json())
"'
If this returns 200 OK → vault is valid; the drift is in GH or Heroku config. Continue below.
If this returns 401 → vault itself is stale; you need to mint a new token via the rotate-from-console UI (#885). Stop here; that's a different runbook.
Recovery — if vault is valid, GH is stale
Path A — Manual paste (60 seconds)
# Read vault value (safe; runs in dyno, value goes only to your terminal)
heroku run --app raxx-console-prod --no-tty 'python -c "
from app.services import vault
print(vault.get_secret_value(\"HEROKU_API_KEY\"))
"' | tail -1
Copy the value, then:
- Open
https://github.com/raxx-app/TradeMasterAPI/settings/secrets/actions - Click
HEROKU_API_KEY→ "Update secret" - Paste the value
- Click "Update secret"
Re-run the failed deploy:
gh workflow run deploy-console.yml -f environment=production -f ref=main
Path B — Automated, requires GITHUB_API_SECRETS_TOKEN (#925)
Once GITHUB_API_SECRETS_TOKEN is in vault with secrets:write scope:
heroku run --app raxx-console-prod --no-tty 'python /dev/stdin' <<'PY'
import base64, os, sys, requests
from nacl.public import PublicKey, SealedBox
from app.services import vault
token = vault.get_secret_value("HEROKU_API_KEY")
gh = vault.get_secret_value("GITHUB_API_SECRETS_TOKEN")
H = {"Authorization": f"Bearer {gh}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28"}
r = requests.get("https://api.github.com/repos/raxx-app/TradeMasterAPI/actions/secrets/public-key", headers=H, timeout=15)
r.raise_for_status()
pk = r.json()
sealed = SealedBox(PublicKey(base64.b64decode(pk["key"])))
encrypted = base64.b64encode(sealed.encrypt(token.encode("utf-8"))).decode("ascii")
r = requests.put("https://api.github.com/repos/raxx-app/TradeMasterAPI/actions/secrets/HEROKU_API_KEY",
headers=H, json={"encrypted_value": encrypted, "key_id": pk["key_id"]}, timeout=15)
print(f"GH secret PUT: HTTP {r.status_code}")
sys.exit(0 if r.ok else 1)
PY
Recovery — if vault is stale, Heroku/GH have valid tokens
This is the inverse drift: vault rotted but the live apps still work. Less common.
- Read the live token from one of the Heroku apps:
bash heroku config:get HEROKU_API_KEY -a raxx-console-prod - Validate it (same
requests.get /accounttest as above) - Write back to vault:
bash heroku run --app raxx-console-prod --no-tty 'python -c " from app.services import vault vault.store_secret_version(\"HEROKU_API_KEY\", \"<paste here>\") "'
Prevention
- The Mode A rotation handler (#885 / PR #906 / PR #887) keeps all three destinations in lockstep on every rotation. Use it rather than manual rotation.
- Once #925 lands (
GITHUB_API_SECRETS_TOKEN), the handler can fully self-heal. Until then, the GH-secret destination is operator-only on each rotation. - Audit the three values weekly via the validator scheduler (
console/app/services/handler_validator.py).
Postmortem template (use after every drift incident)
### HEROKU_API_KEY drift — <UTC timestamp>
- Detected: <how — log line / failed deploy>
- Vault state: <valid/stale>
- GH secret state: <valid/stale>
- Heroku config state: <valid/stale on each of the 4 apps>
- Root cause: <manual rotation? failed Mode A handler? UI dashboard rotation?>
- Recovery: <Path A / Path B / inverse>
- Time to recover: <minutes>
- Followup: <issue number>
Save postmortems at docs/ops/postmortems/heroku-key-drift-<YYYY-MM-DD>.md.
Refs
- 2026-05-03 03:48 UTC incident — first observed; recovery via Path A. Cross-PR collateral: #925 filed for the durable Path B fix.
- PR #906 — Heroku Mode A handler HTTP rewrite (closes the rotation-time drift gap).
- #891 — original handler bug (CLI shell-out) that motivated #906.