Raxx · internal docs

internal · gated

Handoff Agent Provisioning Runbook

System: handoff-agent-identity / SC-HANDOFF-GITHUB-01 Owner: ops ADR: ADR-0131 (docs/architecture/adr/0131-handoff-agent-least-privilege-identity.md) Design doc: docs/architecture/handoff-agent-least-privilege-identity.md Last reviewed: 2026-06-21


What this runbook covers

Every agent handoff dispatch must begin with a Step 0: provision a scoped, time-bound, independently revocable GitHub identity for the specific task. This runbook documents how to use provision_handoff_identity.py and what to do when it fails.

"Handoff agent" means: any Claude Cloud, scheduled CCR agent, new-repo agent, or unattended autonomous task. NOT live, operator-present sessions (those use the shared bot App tokens from mint_github_token.py).


Prerequisite

The three Infisical env vars must be in your shell:

INFISICAL_CLIENT_ID
INFISICAL_CLIENT_SECRET
INFISICAL_PROJECT_ID

These are the same credentials used by mint_github_token.py. See docs/ops/runbooks/agent-bot-tokens-setup.md for one-time setup.

Optionally, if your vault sits behind Cloudflare Access:

CF_ACCESS_CLIENT_ID
CF_ACCESS_CLIENT_SECRET

How to tell it's broken


How to diagnose (in order)

  1. Check Infisical env vars are set: echo "${INFISICAL_CLIENT_ID:0:4}..." Expected: shows the first 4 chars of your client ID. If empty, source your shell config.

  2. Run provision with verbose output: python3 scripts/agents/provision_handoff_identity.py \ --task-slug test-$(date +%Y%m%d) \ --repo TradeMasterAPI \ --owner raxx-app The script emits step-by-step progress to stderr. All steps numbered 1–5 must complete. Failure message pinpoints whether it's Infisical, GitHub, or vault.

  3. If GitHub returns 403/422 on PAT creation, confirm raxx-dev-bot App has the Fine-grained personal access tokens organization permission in GitHub Settings → Developer Settings → GitHub Apps → raxx-dev-bot → Permissions.

  4. If Infisical returns 404 on secret write, the folder creation step (step 3/5) failed silently. Run manually: # List folders at the handoff path to confirm existence. curl -s -H "Authorization: Bearer <infisical-token>" \ "https://app.infisical.com/api/v1/folders?workspaceId=<id>&environment=prod&path=/MooseQuest/handoffs/"


Known failure modes

Failure mode A: GitHub 422 on fine-grained PAT creation

Symptom: step 2/5 fails with HTTP 422 from api.github.com/user/personal-access-tokens.

Cause: The GitHub App (raxx-dev-bot) does not have the Fine-grained personal access tokens organization-level permission, which is required for a GitHub App to create PATs on behalf of a bot account.

Fix: Operator action required. 1. Go to GitHub.com → Organization Settings → Developer Settings → GitHub Apps → raxx-dev-bot. 2. Under Permissions, enable Organization permissions → Personal access tokens → Read and write. 3. Re-run provision.

Verification: Re-run provision_handoff_identity.py; step 2/5 succeeds.

Note: This is the most common Phase 1 blocker. Phase 2 (GitHub App install scoping) avoids this limitation entirely.


Failure mode B: Infisical 404 on handoff folder

Symptom: step 4/5 fails with vault write returning HTTP 404.

Cause: The /MooseQuest/handoffs/ folder doesn't exist in Infisical (per feedback_vault_folder_must_exist). The script's step 3/5 creates it, but if step 3 itself fails (e.g., Infisical auth timeout), step 4 will 404.

Fix: 1. Re-run the provision script — step 3/5 will retry folder creation. 2. If Infisical is degraded, wait for vault health to recover before re-provisioning (check https://status.infisical.com).

Verification: Run --verify after re-running provision.


Failure mode C: PAT already exists with same name

Symptom: GitHub returns 422 "token name already taken".

Cause: A previous provision attempt created a PAT that was not revoked. The PAT note includes a date suffix to avoid collisions, but if two provisions run on the same day for the same slug, the second will collide.

Fix:

# Revoke the existing one first.
python3 scripts/agents/provision_handoff_identity.py \
  --revoke --task-slug <slug>

# Then re-provision.
python3 scripts/agents/provision_handoff_identity.py \
  --task-slug <slug> --repo TradeMasterAPI --owner raxx-app

Verification: --verify --task-slug <slug> returns a clean manifest.


Failure mode D: Vault PROVISION_MANIFEST missing at dispatch time

Symptom: Orchestrator blocks dispatch — cannot read PROVISION_MANIFEST for task slug.

Cause: Either provisioning was skipped (Step 0 violated) or provision failed and was not retried.

Fix: Run the full provision sequence:

python3 scripts/agents/provision_handoff_identity.py \
  --task-slug <slug> --repo TradeMasterAPI --owner raxx-app

Then verify:

python3 scripts/agents/provision_handoff_identity.py --verify --task-slug <slug>

Provision a handoff identity

python3 scripts/agents/provision_handoff_identity.py \
  --task-slug <issue-slug>-<YYYYMMDD> \
  --repo TradeMasterAPI \
  --owner raxx-app \
  --ttl-hours 24

Successful output (all to stderr — nothing on stdout):

Provisioning handoff GitHub identity:
  task-slug  : iap-notif-handler-20260621
  PAT name   : gh-pat-handoff-iap-notif-handler-20260621-20260621
  repo       : raxx-app/TradeMasterAPI
  bot        : raxx-dev-bot
  TTL        : 24h (max 24h)
  vault path : /MooseQuest/handoffs/iap-notif-handler-20260621/
  permissions: {'contents': 'write', 'pull_requests': 'write', 'issues': 'write'}

step 1/5: minting bot App token...
  App token minted.
step 2/5: creating fine-grained PAT on GitHub...
  PAT created: id=12345678 name='...' expires=2026-06-22T00:00:00Z
step 3/5: connecting to vault and ensuring folder exists...
  Vault folder confirmed: /MooseQuest/handoffs/iap-notif-handler-20260621/
step 4/5: writing GH_HANDOFF_TOKEN to vault (value not logged)...
  GH_HANDOFF_TOKEN written to /MooseQuest/handoffs/iap-notif-handler-20260621/
step 5/5: writing PROVISION_MANIFEST to vault...
  PROVISION_MANIFEST written to /MooseQuest/handoffs/iap-notif-handler-20260621/

Provisioning complete.
  PAT id     : 12345678
  PAT name   : gh-pat-handoff-iap-notif-handler-20260621-20260621
  expires    : 2026-06-22T00:00:00Z
  vault path : /MooseQuest/handoffs/iap-notif-handler-20260621/GH_HANDOFF_TOKEN
  manifest   : /MooseQuest/handoffs/iap-notif-handler-20260621/PROVISION_MANIFEST
  revoke via : python3 scripts/agents/provision_handoff_identity.py --revoke --task-slug iap-notif-handler-20260621

Verify a provisioned identity

python3 scripts/agents/provision_handoff_identity.py \
  --verify --task-slug <slug>

This reads the PROVISION_MANIFEST from vault, asserts permissions are within the allowed set, then queries GitHub for live PAT metadata. It never prints the token value.


Revoke a handoff identity

On task completion, or immediately if the handoff is compromised:

python3 scripts/agents/provision_handoff_identity.py \
  --revoke --task-slug <slug>

This: 1. Reads the PAT id from the PROVISION_MANIFEST in vault. 2. Deletes the PAT from GitHub (immediate effect). 3. Removes GH_HANDOFF_TOKEN from vault. 4. Removes PROVISION_MANIFEST from vault.

If the PAT was already expired or manually deleted from GitHub, the script logs a warning and continues to clean up vault. Exit 8 if nothing at all was found.


Emergency revocation (PAT compromised)

If a handoff token may be compromised, revoke immediately:

# Revoke via script (preferred — cleans vault too):
python3 scripts/agents/provision_handoff_identity.py --revoke --task-slug <slug>

# Fallback if script is unavailable — revoke directly on GitHub:
# GitHub → Settings → Developer Settings → Personal access tokens →
# Fine-grained tokens → find the token by name → Delete.

# Then clean vault manually:
# Infisical dashboard → MooseQuest project → handoffs/<slug>/ → delete GH_HANDOFF_TOKEN

Blast radius of a compromised handoff PAT (ADR-0131 §6.1): - Scoped to ONE repo (TradeMasterAPI). - Permissions: contents:write, pull_requests:write, issues:write only. - No admin, no secrets, no workflows, no branch-protection bypass. - Cannot merge to main without a PR review (branch protection enforced server-side). - Expires within 24h regardless.


Handoff provisioning gate (checklist)

Before dispatching a handoff agent:

HANDOFF PROVISIONING GATE (GitHub surface)

[ ] task-slug defined and matches GitHub issue slug
[ ] provision_handoff_identity.py --task-slug <slug> completed successfully
[ ] PROVISION_MANIFEST confirmed in vault (run --verify to check)
[ ] PAT expiry < 24h from now
[ ] Revocation plan documented (--revoke command ready)
[ ] After task completion: --revoke will be run

Phase 2 (target state): GitHub App install scoping

Phase 1 (this runbook) uses fine-grained PATs under raxx-dev-bot. Phase 2 (SC-HANDOFF-GITHUB-01 target) moves to per-handoff GitHub App install scoping — the App is installed on only the target repo, with only the three required permissions, and an installation token (1-hour TTL) is generated at dispatch time.

Phase 2 upgrade path: 1. Create a dedicated GitHub App for handoff identity. 2. Install it on TradeMasterAPI with contents:write, pull_requests:write, issues:write. 3. Generate installation tokens at dispatch time instead of fine-grained PATs. 4. Update provision_handoff_identity.py to use the App install flow. 5. Retire fine-grained PAT path.

The security invariants (HANDOFF-INVARIANT-1) are identical in both phases.


Escalation (GitHub surface)

Wake the operator when: - GitHub API returns 5xx for > 5 minutes during provision/revoke. - A PROVISION_MANIFEST shows permissions outside the allowed set (scope violation). - A compromised handoff PAT may have had write access to the repo for > 1 hour before detection (potential git history audit required). - The Infisical vault is unreachable and a task is awaiting dispatch.

Contact: ops@raxx.app


Full multi-surface provisioning (SC-HANDOFF-GATE-01)

The provision_handoff_full.py script provisions ALL surfaces listed in a handoff-spec.json file. It wraps and extends provision_handoff_identity.py.

Supported surfaces

Surface What is provisioned Vault secret(s)
infisical Machine identity scoped to task path INFISICAL_CLIENT_ID, INFISICAL_CLIENT_SECRET
github Fine-grained PAT (24h max) GH_HANDOFF_TOKEN
stripe Restricted Key (test mode default) STRIPE_RESTRICTED_KEY
cloudflare Scoped API token (DNS Read, expiry set) CF_HANDOFF_TOKEN
aws STS AssumeRole session (max 1h) AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN
heroku Operator-action notice (not automated) HEROKU_HANDOFF_TOKEN (manual)
apple_iap Sandbox StoreKit key (default) APPLE_IAP_KEY_ID, APPLE_IAP_ISSUER_ID, APPLE_IAP_PRIVATE_KEY

Step 0: write handoff-spec.json

{
    "task_slug":          "iap-notif-handler-20260621",
    "surfaces":           ["infisical", "github", "apple_iap"],
    "expected_runtime":   "PT4H",
    "ttl_expires_at":     "2026-06-22T04:00:00Z",
    "operator_approval":  "Kristerpher 2026-06-21",
    "linked_card":        3629,
    "repo":               "TradeMasterAPI",
    "owner":              "raxx-app",
    "ttl_hours":          4
}

Required fields: task_slug, surfaces, operator_approval, linked_card.

Step 1: provision all surfaces

export INFISICAL_CLIENT_ID="..."
export INFISICAL_CLIENT_SECRET="..."
export INFISICAL_PROJECT_ID="..."

# For Stripe test key:
export STRIPE_API_KEY="sk_test_..."

# For Cloudflare:
export CF_API_TOKEN="<admin-cf-token>"
export CF_ACCOUNT_ID="<account-id>"

# For AWS:
export AWS_ROLE_ARN="arn:aws:iam::<account>:role/<role>"
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."

# For Apple IAP (sandbox default):
export APPLE_IAP_KEY_ID_SANDBOX="..."
export APPLE_IAP_ISSUER_ID_SANDBOX="..."
export APPLE_IAP_PRIVATE_KEY_SANDBOX="$(cat /path/to/AuthKey.p8)"

python3 scripts/agents/provision_handoff_full.py --spec handoff-spec.json

With live financial credentials (operator must pass flags explicitly):

python3 scripts/agents/provision_handoff_full.py \
    --spec handoff-spec.json \
    --approve-live-stripe \
    --approve-prod-iam

Exit codes: 0 = all surfaces ok, 11 = partial failure (re-run after fixing).

Step 2: run the gate check

python3 scripts/agents/check_handoff_gate.py \
    --task-slug iap-notif-handler-20260621 \
    --require-surfaces infisical github apple_iap

Gate must exit 0 before dispatch is allowed.

Gate exit codes:

Code Meaning Fix
0 PASS Dispatch allowed
20 Manifest absent Re-run provisioner
21 Manifest invalid JSON Re-run provisioner
22 Slug mismatch Verify --task-slug matches spec
23 Failed surfaces Fix failing surface, re-provision
24 No ok surfaces Check provisioner output, re-run
25 TTL expired Re-provision with new TTL
26 Over-broad permission Audit manifest, re-provision without admin/secrets/workflows
27 Required surface absent Add surface to spec, re-provision

Step 3: inject env block

The provisioner emits a JSON env block to stdout. The orchestrator injects this into the handoff agent's spawn context:

{
  "env": {
    "HANDOFF_TASK_SLUG": "iap-notif-handler-20260621",
    "HANDOFF_VAULT_PATH": "/MooseQuest/handoffs/iap-notif-handler-20260621/",
    "HANDOFF_EXPIRES_AT": "2026-06-22T04:00:00Z",
    "HANDOFF_LINKED_CARD": "3629",
    "INFISICAL_SECRET_PATH": "/MooseQuest/handoffs/iap-notif-handler-20260621/",
    "GH_TOKEN": "(read from vault: GH_HANDOFF_TOKEN)"
  },
  "manifest_location": "/MooseQuest/handoffs/iap-notif-handler-20260621/PROVISION_MANIFEST"
}

The handoff agent reads its actual secrets from vault at /MooseQuest/handoffs/<task-slug>/. Secret values are NOT in the env block (values are never printed — see H6).

Step 4: Heroku surface (operator action)

If heroku is in the spec surfaces, the provisioner prints:

[OPERATOR ACTION REQUIRED — Heroku surface]
Steps:
  1. heroku login
  2. heroku authorizations:create \
         --description='handoff-<task-slug>' \
         --scope=read,deploy
  3. infisical secrets set HEROKU_HANDOFF_TOKEN='<token>' \
           --path=/MooseQuest/handoffs/<task-slug>/ \
           --env=prod
  4. On task completion: heroku authorizations:revoke <token-id>

The gate does not require heroku in surfaces_ok unless --require-surfaces heroku is explicitly passed. Status will be operator_action_required in the manifest.

Step 5: revocation

On task completion:

python3 scripts/agents/provision_handoff_full.py \
    --task-slug iap-notif-handler-20260621 --revoke

Then perform per-surface external revocation (the script reminds you):

Surface External revocation command
Stripe Stripe dashboard → Developers → API keys → delete restricted key
Cloudflare CF dashboard → My Profile → API Tokens → delete token
AWS STS session auto-expires (max 1h); delete task IAM role if applicable
Heroku heroku authorizations:revoke <authorization-id>
Apple IAP Apple Developer portal → Keys → revoke (only if prod key was used)
Infisical Infisical admin → Identities → delete infisical-handoff-<slug>-<date>

Security invariants enforced

Invariant Mechanism
H7: no live financial creds without approval --approve-live-stripe / --approve-prod-iam required
H3: time-bound (24h max) MAX_TTL_HOURS = 24; gate checks ttl_expires_at
H4: step 0 not optional gate exits non-zero if manifest absent
H1: no admin/blanket scope gate exits 26 if forbidden permissions in manifest
H6: no secret values in logs credentials overwritten in local scope after vault write