Raxx · internal docs

internal · gated ↑ index

Velvet Operator Runbook

Last verified against: Velvet v2 (2026-05-04 UTC) Parent epic: #907 Design doc: docs/architecture/velvet/v2-rotation-flows.md Handler-author guide: docs/architecture/velvet-handler-author-guide.md


Reading time: ~15 min


1. When to use Velvet vs. manual rotation

Situation Use Velvet Use manual procedure
Scheduled rotation of a credential with registered subscribers Yes No
Emergency revocation after a suspected leak Yes — revocation flow No
Credential with active: false in the subscription manifest No — fix the manifest first Proceed manually per vendor SOP
Velvet itself is down or unreachable No — use vendor SOP directly Yes
Velvet's own bootstrap credentials (INFISICAL_CLIENT_SECRET, HK_VELVET_BOOTSTRAP) No — circular dependency Yes — section 8 below
Vendor does not support programmatic token creation (e.g. CF User API tokens) Operator-assisted Velvet (OPERATOR_MANUAL flow) Parallel manual path
Feature flag velvet_v2_rotation is off No — Velvet returns 503 Yes — use vendor SOP
Credential has no subscribers registered in the manifest No Yes — use per-credential SOP in docs/ops/runbooks/rotation/

2. Pre-flight checklist

Complete every item before triggering a rotation. A stalled pre-flight is cheaper than a stalled distribute.

curl -sf https://raxx-velvet-prod.herokuapp.com/healthz curl -sf https://raxx-velvet-staging.herokuapp.com/healthz

If either returns non-200 or times out, stop. Do not rotate against a degraded Velvet.

curl -sf https://raxx-velvet-prod.herokuapp.com/flags | python3 -m json.tool | grep velvet_v2_rotation

Expected: "velvet_v2_rotation": true. If false, rotation endpoints return 503 and you must use the manual vendor SOP.

curl -sf https://raxx-velvet-prod.herokuapp.com/tokens/HK_PLATFORM_FULL/subscribers

You should see the expected consumer list. If the list is empty or the endpoint returns 404, the credential is not registered.

curl -sf "https://raxx-velvet-prod.herokuapp.com/tokens/HK_PLATFORM_FULL/rotations?limit=5"

If the most recent job is in distribute_partial or revoke_failed, resolve it before starting a new job. Two overlapping rotation jobs for the same credential are not supported.


3. Triggering a rotation

3a. Console UI (preferred)

  1. Navigate to the console: https://raxx-console-prod.herokuapp.com/security/secrets
  2. Locate the credential row in the Secrets table.
  3. Click Rotate — this opens the Stage Wizard modal.
  4. Follow the three-panel flow: Stage 1 (Verify) → Stage 2 (Mint + Distribute) → Stage 3 (Validate + Revoke).
  5. Each stage requires explicit operator action before advancing. You can abort at any stage.

3b. API (for scripted or emergency use)

All endpoints require a rotate-scoped service token in the Authorization: Bearer <token> header.

Step 1 — Create the job:

POST https://raxx-velvet-prod.herokuapp.com/tokens/HK_PLATFORM_FULL/rotate
Content-Type: application/json

{
  "flow_type": "operational",
  "idempotency_key": "<uuid-v4>",
  "force_revoke": false
}

Response (202):

{ "job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6", "status": "init" }

Step 2 — Run Stage 1 (Verify):

POST https://raxx-velvet-prod.herokuapp.com/tokens/HK_PLATFORM_FULL/rotations/<job_id>/stage
Content-Type: application/json

{ "action": "verify" }

Step 3 — Proceed to Mint + Distribute:

POST .../rotations/<job_id>/stage
{ "action": "proceed_mint" }

Step 4 — Proceed to Revoke (after validating all consumers):

POST .../rotations/<job_id>/stage
{ "action": "proceed_revoke" }

Polling for status:

GET https://raxx-velvet-prod.herokuapp.com/tokens/HK_PLATFORM_FULL/rotations/<job_id>

SSE stream (real-time status):

GET https://raxx-velvet-prod.herokuapp.com/tokens/HK_PLATFORM_FULL/rotations/<job_id>/stream
Accept: text/event-stream

4. Monitoring in-flight rotations

Console UI

The Stage Wizard shows live status via SSE. Each consumer row updates in real time. A spinner indicates in-progress; green check = succeeded; amber = in-progress; red X = failed.

API polling

Poll GET /tokens/{name}/rotations/{job_id} every 5 seconds. The response includes:

{
  "job_id": "...",
  "status": "distributing",
  "consumers": [
    { "consumer_id": "heroku-config-console-prod", "distribute_status": "succeeded" },
    { "consumer_id": "heroku-config-api-prod",     "distribute_status": "in_progress" },
    { "consumer_id": "github-actions-heroku-api-key", "distribute_status": "pending" }
  ],
  "created_at": "2026-05-04T06:00:00Z",
  "updated_at": "2026-05-04T06:00:42Z"
}

Heroku logs

heroku logs --tail --app raxx-velvet-prod | grep job_id=<your-job-id>

5. Job states reference

Every rotation_jobs row progresses through this state machine. The status column tells you exactly where a job is.

Status Meaning Can advance Operator action required
init Job created; nothing touched Yes Click "Verify"
verifying Velvet probing the vendor with the current token Automatic Wait
verify_failed Auth probe failed; current token may be invalid Retry or abort See section 6
verified Probe confirmed; operator gate before mint Yes Click "Proceed to mint"
minting Velvet calling vendor to mint new token Automatic Wait
mint_failed Vendor mint API returned error Abort only Old token still valid
minted New token in hand; not yet distributed Yes (automatic fan-out) Wait
distributing Fan-out to registered consumers in progress Automatic Wait
distribute_partial Some consumers failed; others succeeded Retry or abort Retry failed rows or section 6
distribute_failed All consumers failed Abort New token minted but not distributed — see abort table
distributed All consumers received new token Automatic validation Wait
validating Healthchecks running on all consumers Automatic Manual-confirm rows if needed
validate_partial Some healthchecks failed Retry or abort Retry failed rows
validate_failed All healthchecks failed Abort Investigate consumer reachability
validated All consumers healthy with new token Yes Type-to-confirm + FreeScout ID, then click Revoke
revoking Velvet calling vendor to revoke old token Automatic Wait
revoke_failed Vendor revoke API returned error Retry or mark manual See section 6
done Rotation complete; old token revoked Terminal None
aborted Operator or system aborted Terminal Check residual state (section 6)

Revocation flow statuses:

Status Meaning
rev_init Revocation job created
rev_revoking Vendor revoke call in flight
rev_revoke_failed Vendor rejected the revoke call
rev_revoked Vendor confirmed revocation; validating consumers
rev_validating Healthchecks running (expecting 401 from each consumer)
rev_leaked One or more consumers returned non-401 after revocation — SEV1
rev_done All consumers confirmed locked out

6. Stuck job diagnosis and recovery

Definition of stuck: A job has been in the same status for more than 5 minutes without a state change, OR a job is in a terminal state (distribute_partial, revoke_failed, aborted) that requires operator action.

6a. Job stuck in verifying

The auth probe is timing out or being rate-limited.

  1. Check Velvet logs: heroku logs --app raxx-velvet-prod | grep job_id=<id>
  2. Look for ConnectionError, Timeout, or HTTP status code.
  3. If the vendor API is rate-limiting: wait 2 minutes and click "Retry" in the console.
  4. If the vendor API is returning 401: the current token is already invalid. Stop and use the vendor's manual revocation + re-issue process. File a FreeScout incident ticket.

6b. Job stuck in minting

  1. Check Velvet logs for mint failed entries.
  2. If the vendor returned 401 on the mint call, the old token drifted between Verify and Mint. This is rare but possible if two rotation jobs ran simultaneously. Abort this job; the old token is still valid.
  3. If the vendor returned 5xx: retry once. If it fails again, wait 10 minutes (vendor-side transient issue) and retry.

6c. Job in distribute_partial

Some consumers received the new token; others did not. The old token is still valid.

Recovery options:

Option A (preferred): Retry failed rows in the console. Click "Retry failed" for each red row. Velvet will re-attempt the PATCH/INFISICAL_WRITE for those consumers only.

Option B (if retries keep failing): Identify the failing consumer(s) by their consumer_id in the job status. Manually push the new token to that consumer using the vendor's own interface. Once done, click "Manual confirm" in the console to mark that row as succeeded. After all rows are green, advance to Stage 3.

Option C (if you need to abort): Click Abort. The new token is now minted but distributed only to some consumers. You must manually delete the new token from vault and re-sync the affected consumers to the old token. The console shows the residual consumer list. File a FreeScout ticket and follow per-vendor SOP in docs/ops/runbooks/rotation/.

6d. Job in validate_partial or validate_failed

Healthchecks failed on one or more consumers after the new token was distributed.

  1. Check which consumers are showing validate_status: failed in the job status response.
  2. Confirm the consumer application has restarted and loaded the new config var. For Heroku apps: heroku ps --app <app-name> — if dyno is crashed, that's your answer.
  3. Wait 60 seconds for dyno restart to complete, then click "Retry validation" in the console.
  4. If a consumer has healthcheck_endpoint: null in the manifest, a "Manual confirm" button appears. Verify the consumer manually, then click confirm to mark it as validated.
  5. If validation keeps failing: check whether the distribute step actually wrote the new token. Use the per-vendor SOP to verify the config var value was updated.

6e. Job in revoke_failed

The new token is distributed and validated. The old token has not been revoked.

  1. Note the old_auth_id or equivalent from the job metadata (visible in the console audit summary and in Velvet logs).
  2. Retry the revoke in the console ("Retry revoke" button).
  3. If the vendor returns 404 on the revoke call, the old token was already deleted outside Velvet. Click "Mark manually revoked" and enter the FreeScout ticket ID. Velvet will advance to done.
  4. If the vendor keeps returning errors: revoke the old token manually via the vendor dashboard or CLI. Then click "Mark manually revoked" with the FreeScout ticket ID.

6f. Job aborted from minted or distribute_partial

The new token exists in vault but the old token is still valid. Both tokens are now live simultaneously.

Cleanup required:

Aborted from Action
minted (new token in vault, not distributed) Delete the new token from the vault path. The old token remains the active credential. File a FreeScout ticket documenting the orphaned token.
distribute_partial Document which consumers have the new token and which have the old (check the rotation_job_consumers rows). Manually sync all consumers back to the old token. Then delete the new token from vault.
validated Distribution and validation succeeded; only revocation is pending. You may manually revoke the old token via the vendor dashboard, then use "Mark manually revoked" in the console.

7. Rollback

Velvet does not support one-click rollback. Once the new token has been distributed and the old token revoked, there is no automated path back.

What is reversible:

What is NOT reversible:


8. Rotating Velvet's own bootstrap credentials (Invariant I7)

Velvet's own credentials (INFISICAL_CLIENT_SECRET, INFISICAL_CLIENT_ID, HK_VELVET_BOOTSTRAP) are stored as Heroku config vars, not in vault, to break the bootstrap circularity. Velvet cannot rotate them itself.

Rotating INFISICAL_CLIENT_SECRET

  1. In the Infisical dashboard, generate a new client secret for the Velvet machine identity.
  2. Set the new value on both Heroku apps:

heroku config:set INFISICAL_CLIENT_SECRET=<new_value> --app raxx-velvet-prod >/dev/null 2>&1 heroku config:set INFISICAL_CLIENT_SECRET=<new_value> --app raxx-velvet-staging >/dev/null 2>&1

Note: always redirect to /dev/null 2>&1heroku config:set echoes config vars to stdout by default (feedback: heroku_config_set_echoes_secrets).

  1. Verify Velvet restarts and /healthz returns 200 on both apps.
  2. Revoke the old client secret in the Infisical dashboard.
  3. Record the rotation in a FreeScout ticket.

Rotating HK_VELVET_BOOTSTRAP

This token is used by Velvet to authenticate its PATCH calls to Heroku config vars on behalf of consumer updates.

  1. Use the Heroku Platform API or dashboard to create a new OAuth authorization for the Velvet machine user.
  2. Set the new token:

heroku config:set HK_VELVET_BOOTSTRAP=<new_token> --app raxx-velvet-prod >/dev/null 2>&1 heroku config:set HK_VELVET_BOOTSTRAP=<new_token> --app raxx-velvet-staging >/dev/null 2>&1

  1. Verify /healthz on both apps.
  2. Revoke the old authorization via the Heroku dashboard or:

heroku authorizations:revoke <old-auth-id>

  1. Update the companion secret HK_VELVET_BOOTSTRAP__AUTH_ID in Infisical with the new authorization UUID, so the next rotation can find it.

9. Failure modes by adapter

Postmark (PM_SERVER_MAIL)

Failure Meaning Action
verify_failed with HTTP 401 Postmark token already invalid Rotate manually: generate new token in Postmark dashboard, enter new value via Velvet OPERATOR_MANUAL or directly in vault
distribute_partial on infisical-postmark-prod Infisical write failed Check INFISICAL_CLIENT_ID/INFISICAL_CLIENT_SECRET Heroku config vars; retry
validate_failed HTTP 401 New token not yet active at Postmark (rare propagation delay) Wait 30 seconds; retry validation
Revoke not automated Postmark does not expose a token-delete API Operator must manually delete the old server token in the Postmark dashboard; click "Mark manually revoked"

Heroku (HK_PLATFORM_FULL)

Failure Meaning Action
verify_failed with "old token invalid" HEROKU_PLATFORM_API_TOKEN in Velvet config vars is drifted Follow docs/ops/runbooks/heroku-api-key-drift-recovery.md
distribute_partial — one Heroku app PATCH to that app returned non-200 Check if the app exists: heroku apps --app <app-name>. If the app was deleted, remove it from the manifest; mark consumer row as skipped
revoke_failed with "revoke_pending" Old auth DELETE failed after distribute succeeded Note old_auth_id from logs; manually revoke via heroku authorizations:revoke <id>; mark manually revoked
distribute_partial — github-actions-heroku-api-key GitHub Actions secret PUT failed Check GH_APP_OPS_BOT token in vault; verify repo name is correct in manifest

Cloudflare (CF_DNS_EDIT_RAXX_APP, others)

Failure Meaning Action
Consumer active: false in manifest CF adapter pending OQ7 resolution Rotate manually per docs/ops/runbooks/rotation/cloudflare-user-api-token.md
OPERATOR_MANUAL flow — operator entered wrong value New token does not validate at CF Re-enter the correct token value; Velvet will re-attempt vault write

Note: scripts/ops/probe_cf_token_perms.py reads Cloudflare token permissions directly from Infisical. It does not go through Velvet. This is intentional — it is a read-only diagnostic tool and has not been migrated to the Velvet bus.

AWS SSM (AWS_ACCESS_KEY_ID, password-class credentials)

Failure Meaning Action
distribute_partial — SSM write 403 Velvet's IAM role lacks ssm:PutParameter on the target path Verify the IAM policy attached to the Velvet Heroku dyno's assumed role covers /raxx/{env}/{vendor}/{name}
SSM path not found (404 on read) SSM path does not exist yet First rotation creates the path; if the path was deleted externally, it will be re-created by the adapter

10. SEV1 — rev_leaked response

If a revocation job reaches rev_leaked, one or more consumers returned a non-401 response after the old token was confirmed revoked at the vendor. This means at least one consumer still has a copy of the revoked token and may be accepting it.

Immediate steps:

  1. You will have received a Slack DM on channel SL_BOT_NOTIFY within 30 seconds of the flag being set. The message includes the job_id, credential_name, and the list of leaking consumer_ids.
  2. Open the Velvet console: the leaked consumers are highlighted in red with an "Investigate" button.
  3. Click "Investigate" — this auto-creates a FreeScout ticket pre-filled with the consumer list.
  4. For each leaking consumer: a. Determine whether the consumer is still actively serving traffic. b. If yes: immediately disable or restart the consumer to force it to reload config. c. Verify the consumer is no longer accepting the revoked token by re-running the healthcheck manually.
  5. Once all consumers return 401, click "Confirm leak resolved" in the Velvet console. The job advances from rev_leaked to rev_done.
  6. If any consumer cannot be forced to reject the token (e.g., a caching layer with a long TTL), escalate to a security incident per the security response runbook.

Root causes of rev_leaked:


11. Staging vs. production

APP_ENV on each Heroku dyno controls which Infisical environment and SSM path prefix is used.

App APP_ENV Infisical env slug SSM path prefix
raxx-velvet-prod prod prod /raxx/prod/
raxx-velvet-staging staging staging /raxx/staging/

The subscription manifest uses env: prod and env: staging per consumer row. A rotation job on raxx-velvet-prod only fans out to consumers with env: prod.

The Heroku app consumer rows for staging config vars (raxx-console-staging, raxx-api-staging) are registered with env: prod in the manifest — this is intentional. The staging apps' config vars hold the same credential (the Heroku platform key), which is a single credential shared across environments.


12. Common operator mistakes and fixes

Mistake Symptom Fix
Starting a prod rotation against raxx-velvet-staging Job fans out to staging consumers only; prod consumers never receive the new token Abort the job. Re-run against raxx-velvet-prod.
Forgetting to open a FreeScout ticket before rotating Cannot enter ticket ID at Stage 3 revoke gate Open the ticket now. The gate enforces non-empty input but does not validate the ticket exists.
Clicking "Abort" from validated thinking it rolls everything back New token stays in vault and distributed; old token stays live See section 6f — abort from validated requires manual revocation of the old token only.
Retrying a revoke_failed job with a different auth token Second revoke attempt uses stale auth Ensure HK_VELVET_BOOTSTRAP or the relevant auth token in vault is current before retrying.
Two operators starting rotations for the same credential simultaneously Second job's verify step returns "active rotation already in progress" Only one operational rotation per credential can be in flight at a time. The first job must reach done or aborted before the second can start.
Running heroku config:set without redirecting stdout Secret value printed to terminal and shell history Always use heroku config:set VAR=value >/dev/null 2>&1
Checking vault for the new token value after rotation completes Token value is not available via Velvet after the job reaches done Read from Infisical directly using the machine identity; Velvet does not store the token value after rotation.

13. Slack DM notifications

Terminal events (job done, aborted, rev_leaked) trigger a Slack DM to the operator's channel.

Bot channel for automated alerts: SL_BOT_NOTIFY (configured in Velvet Heroku config vars). Operator DM for walk-away pings: D0AJ7K184TV (Kristerpher's DM channel).

rev_leaked alerts are additionally sent within 30 seconds to the ops alert inbox (ops@raxx.app).