RCA — Bot token mint returning 404 after MooseQuest → raxx-app org migration
Incident ID: 2026-05-09-bot-token-mint-404 Date: 2026-05-09 Severity: SEV-3 Duration: ~1 session (detection → root cause identified; full resolution pending operator vault update) Blast radius: All dispatched agents (sre-agent, feature-developer, product-manager, security-agent, card-groomer, raxx-blr-bot) — all fell back to operator PAT for the entire session Author: sre-agent
Summary
After the GitHub organization was renamed from MooseQuest to raxx-app, all three GitHub App bot tokens (raxx-dev-bot, raxx-ops-bot, raxx-pm-bot) stopped minting. Every agent dispatch fell back to the operator PAT. Two root causes were found: (1) the GitHub App installation IDs stored in Infisical still reference the pre-migration MooseQuest org installations, which no longer exist on raxx-app and return HTTP 404 from the GitHub token-exchange endpoint; (2) a latent code defect in mint_github_token.py caused the default Infisical path prefix to resolve to /MooseQuest/github/<bot> instead of the documented and correct /MooseQuest/<bot>. The code fix ships in this commit; the vault update (new installation IDs) requires operator action in the GitHub UI.
Timeline (all times UTC)
- 2026-05-09 ~00:00 — GitHub org renamed from MooseQuest to raxx-app (approximate; exact time unknown)
- 2026-05-09 (session start) — First signal: every dispatched agent logs "mint failed; fell back to operator PAT"
- 2026-05-09 (session) — Operator confirms the failure started after the org migration
- 2026-05-09 (this task) — SRE agent investigates; root causes identified
- 2026-05-09 — Code fix applied (
DEFAULT_PATH_PREFIXcorrected); provisioning + arch docs updated - Pending — Operator re-installs Apps on raxx-app org and updates Infisical INSTALLATION_ID values
Impact
- Users affected: 0 (internal tooling only)
- User-visible symptoms: none
- Data integrity: ok
- Revenue / billing: ok
- Operator impact: all agent GitHub API calls (PR creation, issue filing, comments) attributed to operator PAT instead of bot identities for the full session; audit trail ambiguous; operator notification feed polluted with bot activity
What went well
- The fallback path worked correctly — no agent was blocked; tasks completed with PAT
- The warning message ("mint failed; fell back to operator PAT") surfaced the failure rather than silently degrading
- The Infisical path prefix is overridable via
INFISICAL_PATH_PREFIXenv var, so the operator could have worked around the code bug at runtime
What didn't go well
- No runbook existed for the org-migration scenario; the system had no documented checklist for "things to do after a GitHub org rename"
- The mint script's
DEFAULT_PATH_PREFIX(/MooseQuest/github/) diverged from the docs (/MooseQuest/), meaning the test attest_mint_github_token.py:216was asserting the right expected path but the code was producing a wrong one — a signal that was never caught - The provisioning runbook (
github-app-provisioning.md) used lowercase secret key names (app-id,installation-id,private-key-pem) instead of the uppercase names the mint script expects (APP_ID,INSTALLATION_ID,PRIVATE_KEY_PEM); this would causeexit 4failures on fresh provisioning from the runbook
Root cause analysis
-
Contributing factor 1: GitHub App installation IDs are org-scoped and do not migrate. When MooseQuest was renamed to raxx-app, the App installations on MooseQuest became orphaned. GitHub creates new installations when an App is re-installed on the new org, with new installation IDs. The vault still held the old IDs. Every call to
POST /app/installations/{old_id}/access_tokensreturned HTTP 404. The mint script exits 5 on this response and the wrapper falls back to PAT. -
Contributing factor 2:
DEFAULT_PATH_PREFIXhardcoded as/MooseQuest/github/instead of/MooseQuest/. Line 137 ofmint_github_token.pysetDEFAULT_PATH_PREFIX = "/MooseQuest/github/". Combined with the path constructionf"{path_prefix.rstrip('/')}/{bot}", the effective secret path was/MooseQuest/github/raxx-dev-botinstead of/MooseQuest/raxx-dev-bot. If the operator stored secrets at/MooseQuest/raxx-dev-bot/(per the docs), the Infisical fetch would return empty secrets and the script would exit 4. If the operator happened to store them under/MooseQuest/github/raxx-dev-bot/(matching the buggy default), the Infisical fetch succeeded — but then the GitHub token exchange would still fail with 404 because the installation ID was stale. Either way, no token was minted. -
Contributing factor 3: No post-migration checklist existed. The system had no documented procedure for "GitHub org migration." The architecture doc, provisioning runbook, and agent-identity doc all assumed the org was static.
Detection
- What alerted us: agent operator reports — every dispatched agent logged "mint failed; fell back to operator PAT"
- How long between cause and detection: up to 1 full session (hours)
- How to detect faster next time: add a scheduled smoke-test CI job that runs
scripts/agents/with_bot_token.sh raxx-ops-bot gh api /userand alerts on non-ghs_token format or fallback warning in output. Track in action item #1 below.
Resolution
Code fix (this commit)
scripts/agents/mint_github_token.py line 137: changed DEFAULT_PATH_PREFIX from "/MooseQuest/github/" to "/MooseQuest/". This aligns the default with the documented vault layout and with the INFISICAL_PATH_PREFIX default shown in agent-bot-tokens-setup.md.
Validation: The existing test at test_mint_github_token.py:216 asserts "/MooseQuest/raxx-dev-bot" in captured.err — this assertion was already correct for the intended path. After the fix, it passes for the default case (previously it would have checked against /MooseQuest/github/raxx-dev-bot which is wrong but the assertion checked the right thing).
Operator vault + installation update (pending — operator action required)
See action items 2 and 3.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Add a scheduled CI smoke-test that mints a bot token (raxx-ops-bot), checks that the output starts with ghs_, and pages/alerts if it falls back to PAT |
ops | 2026-05-16 | file new |
| 2 | Re-install the three GitHub Apps on the raxx-app org and capture new installation IDs | Kristerpher (operator) | 2026-05-09 | — |
| 3 | Update INSTALLATION_ID in Infisical at /MooseQuest/raxx-dev-bot/, /MooseQuest/raxx-ops-bot/, /MooseQuest/raxx-pm-bot/ with the new IDs from action item 2 |
Kristerpher (operator) | 2026-05-09 | — |
| 4 | Verify token mint after vault update: scripts/agents/with_bot_token.sh raxx-ops-bot gh api /user — confirm "login": "raxx-ops-bot[bot]" in response |
Kristerpher (operator) | 2026-05-09 | — |
References
- Runbook:
docs/ops/runbooks/github-app-provisioning.md(updated this incident) - Architecture doc:
docs/architecture/agent-github-identity.md(updated this incident — migration checklist added) - Setup runbook:
docs/ops/runbooks/agent-bot-tokens-setup.md - Mint script:
scripts/agents/mint_github_token.py - Wrapper:
scripts/agents/with_bot_token.sh - Tracking issue: #335