Status: Design locked 2026-05-03
Author: architect-agent
Parent epic: #907
Refs: Kristerpher directive 2026-05-03 ~06:00 UTC; incidents rot_2b1e (2026-05-02 20:50 UTC), PR #906 handler gap, drift recovery runbook
ADRs: 0037-velvet-service-bus-subscription-model, 0038-velvet-three-stage-operational-flow, 0039-velvet-revocation-401-criterion, 0040-velvet-consumer-registration-static-manifest
Operator runbook: docs/ops/runbooks/velvet-operator.md
Handler-author guide: docs/architecture/velvet-handler-author-guide.md
The v1 handler design (PR #906, cards #914–#916) modelled rotation as a single handler function with four sequential steps: mint → validate → distribute → revoke. The 2026-05-02 incident (rot_2b1e) exposed the critical flaw in this model:
mint call and to authenticate the distribute calls.Independently: the v1 model hard-codes distribution destinations inside each handler. Adding a new consumer (GitHub Actions secret, Heroku config var, or a new service) requires a handler code change rather than a consumer registration.
Three changes address root causes, not symptoms:
These constraints are non-negotiable. Any implementation that routes around them requires explicit operator approval.
| # | Invariant |
|---|---|
| I1 | No credential value is stored in application code, in logs, or in any audit record. Token hash (SHA-256) is the only form of a credential value that appears in audit or error context. |
| I2 | Stage 1 (Verify) must succeed before any new token is minted. A failed Stage 1 leaves the current token untouched. |
| I3 | The old token is not revoked until every registered subscriber has confirmed success on the new token (Stage 3 validation must reach 100%). Operator can override with explicit force-revoke after acknowledging that some consumers may have stale tokens. |
| I4 | Every state transition that affects token lifecycle is appended to the audit log with: job_id, operator_id, timestamp (UTC), from_state, to_state, consumer_id (if per-consumer), error detail (no credential values). |
| I5 | Revocation always succeeds at the vendor side first. Consumer healthchecks come after. If vendor revocation fails, the revocation job fails — the token remains valid. |
| I6 | The service bus subscription manifest is versioned in the repository under docs/architecture/velvet/subscription-manifest.yml. Runtime registrations are not accepted (see ADR-0040). |
| I6 | Paper-first gate does not apply to rotation flows (rotation is operator-infrastructure, not trading execution). |
| I7 | The Velvet service's own credentials (Infisical client_id/secret, bootstrap key) are stored in Heroku config vars, not in vault, to break the bootstrap circularity. They must be rotatable via operator-manual process and that process must be documented. |
rotation_jobs table (revised from V3 #910)The existing schema is extended to support the three-stage state machine and per-subscriber distribution tracking.
CREATE TABLE rotation_jobs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
credential_name TEXT NOT NULL,
env TEXT NOT NULL CHECK (env IN ('prod', 'staging')),
flow_type TEXT NOT NULL CHECK (flow_type IN ('testing', 'operational', 'revocation')),
status TEXT NOT NULL CHECK (status IN (
'init',
'verifying', 'verify_failed',
'verified',
'minting', 'mint_failed',
'minted',
'distributing', 'distribute_partial', 'distribute_failed',
'distributed',
'validating', 'validate_partial', 'validate_failed',
'validated',
'revoking', 'revoke_failed',
'done',
'aborted'
)),
idempotency_key TEXT UNIQUE,
operator_id TEXT NOT NULL,
operator_abort BOOLEAN NOT NULL DEFAULT false,
force_revoke BOOLEAN NOT NULL DEFAULT false,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
verified_at TIMESTAMPTZ,
minted_at TIMESTAMPTZ,
distributed_at TIMESTAMPTZ,
validated_at TIMESTAMPTZ,
revoked_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
error_stage TEXT,
error_message TEXT,
new_token_hash TEXT, -- SHA-256 of new token, never the value
old_token_hash TEXT, -- SHA-256 of old token at job creation time
vendor_revoke_ref TEXT, -- vendor-returned revocation ID/txn, if any
metadata JSONB -- handler-specific: vendor auth IDs, scope list, etc.
);
CREATE INDEX idx_rotjobs_credential ON rotation_jobs (credential_name, env);
CREATE INDEX idx_rotjobs_status ON rotation_jobs (status);
CREATE INDEX idx_rotjobs_created ON rotation_jobs (created_at DESC);
CREATE INDEX idx_rotjobs_operator ON rotation_jobs (operator_id);
rotation_job_consumers table (new)One row per registered subscriber per rotation job. Tracks per-consumer distribute and validate state independently.
CREATE TABLE rotation_job_consumers (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
job_id UUID NOT NULL REFERENCES rotation_jobs(id) ON DELETE CASCADE,
consumer_id TEXT NOT NULL, -- matches subscription manifest consumer_id
env TEXT NOT NULL,
distribute_status TEXT NOT NULL DEFAULT 'pending'
CHECK (distribute_status IN ('pending','in_progress','succeeded','failed','skipped')),
validate_status TEXT NOT NULL DEFAULT 'pending'
CHECK (validate_status IN ('pending','in_progress','succeeded','failed','skipped')),
distribute_attempt_count INT NOT NULL DEFAULT 0,
validate_attempt_count INT NOT NULL DEFAULT 0,
distribute_error TEXT,
validate_error TEXT,
distribute_completed_at TIMESTAMPTZ,
validate_completed_at TIMESTAMPTZ,
healthcheck_http_status INT, -- last HTTP status from healthcheck endpoint
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_rotjob_consumers_job ON rotation_job_consumers (job_id);
CREATE INDEX idx_rotjob_consumers_cid ON rotation_job_consumers (consumer_id);
subscription_manifest.yml (versioned config — not a database table)Location: docs/architecture/velvet/subscription-manifest.yml
Loaded at Velvet app startup; re-loaded on SIGHUP without restart.
# Velvet service-bus subscription manifest
# Format version: 2
# Every consumer that holds a copy of a rotatable token must register here.
subscriptions:
- token_name: HK_PLATFORM_FULL
consumer_id: heroku-config-console-prod
env: prod
update_endpoint: "https://api.heroku.com/apps/raxx-console-prod/config-vars"
update_method: PATCH
update_auth_token_name: HK_PLATFORM_FULL # Velvet reads THIS from vault to auth the update
healthcheck_endpoint: "https://api.heroku.com/account"
healthcheck_method: GET
healthcheck_auth_header: "Authorization: Bearer {token}"
healthcheck_success_status: 200
capabilities:
- update
- healthcheck
description: "Heroku config var on raxx-console-prod"
- token_name: HK_PLATFORM_FULL
consumer_id: github-actions-secret
env: prod
update_endpoint: "https://api.github.com/repos/raxx-app/TradeMasterAPI/actions/secrets/HEROKU_API_KEY"
update_method: PUT_ENCRYPTED # Velvet handles NaCl encryption for GH Secrets API
update_auth_token_name: GH_APP_OPS_BOT
healthcheck_endpoint: null # GH secrets have no direct healthcheck; use a special verify flow
capabilities:
- update
description: "GitHub Actions secret HEROKU_API_KEY"
- token_name: PM_SERVER_MAIL
consumer_id: infisical-postmark-server
env: prod
update_endpoint: "velvet://infisical/postmark/PM_SERVER_MAIL" # internal; Velvet handles directly
update_method: INFISICAL_WRITE
healthcheck_endpoint: "https://api.postmarkapp.com/server"
healthcheck_method: GET
healthcheck_auth_header: "X-Postmark-Server-Token: {token}"
healthcheck_success_status: 200
capabilities:
- update
- healthcheck
description: "Infisical vault entry for Postmark server token"
Schema fields:
| Field | Type | Required | Description |
|---|---|---|---|
token_name |
string | yes | Token taxonomy name (e.g., HK_PLATFORM_FULL) |
consumer_id |
string | yes | Unique ID for this consumer entry; stable across rotations |
env |
prod / staging |
yes | Which environment this subscription is for |
update_endpoint |
URI | yes | Where Velvet POSTs/PATCHes the new token value |
update_method |
string | yes | PATCH, PUT, POST, PUT_ENCRYPTED, INFISICAL_WRITE |
update_auth_token_name |
string | yes | Token Velvet uses to authenticate the update call |
healthcheck_endpoint |
URI or null | no | GET endpoint; null means healthcheck not supported |
healthcheck_method |
string | no | HTTP verb for healthcheck |
healthcheck_auth_header |
string | no | Header template with {token} placeholder |
healthcheck_success_status |
int | no | Expected HTTP status on success (200 / 204) |
capabilities |
list | yes | update and/or healthcheck |
description |
string | yes | Human-readable label for the console UI table |
All existing v1 endpoints remain. New endpoints:
POST /tokens/{name}/rotate
Body: { "flow_type": "operational" | "testing" | "revocation",
"idempotency_key": "<uuid>",
"force_revoke": false }
Returns 202: { "job_id": "<uuid>", "status": "init" }
GET /tokens/{name}/rotations/{job_id}
Returns rotation_jobs row + consumers array
POST /tokens/{name}/rotations/{job_id}/stage
Body: { "action": "verify" | "proceed_mint" | "proceed_revoke" |
"abort" | "force_revoke" }
Returns 200: { "job_id", "status", "consumers": [...] }
GET /tokens/{name}/rotations/{job_id}/stream
SSE endpoint; emits { "event": "state_change", "data": {...} }
on every status transition
When Velvet PATCHes/POSTs a consumer's update_endpoint, the body is:
{
"velvet_job_id": "<uuid>",
"token_name": "HK_PLATFORM_FULL",
"token_value": "<new_value>",
"rotate_timestamp": "2026-05-03T06:00:00Z"
}
For HEROKU_CONFIG_PATCH: Velvet wraps this into the Heroku Platform API shape ({"HEROKU_API_KEY": "<new_value>"}) before sending. The manifest update_method determines the adapter.
For PUT_ENCRYPTED (GitHub Secrets): Velvet fetches the repo public key, NaCl-seals the value, and constructs the GitHub Secrets API payload — no raw value leaves Velvet's process.
For INFISICAL_WRITE: Velvet calls Infisical directly using its own machine identity. No outbound HTTP to a third party; the update is internal.
Expected consumer response (for HTTPS consumers with their own endpoints):
{ "accepted": true, "consumer_id": "my-service" }
HTTP 200 or 204 is treated as success. Any 4xx/5xx marks that consumer's distribute_status = 'failed'.
Velvet GETs healthcheck_endpoint with the new token substituted into healthcheck_auth_header. It checks only the HTTP status code against healthcheck_success_status.
For revocation validation: Velvet GETs with the old (now-revoked) token. Success criterion is HTTP 401.
Consumers with healthcheck_endpoint: null must be manually confirmed by the operator via the UI before the validation stage can advance. The console surfaces a "Manual confirm" button for these rows.
stateDiagram-v2
[*] --> init : POST /rotate (flow_type=operational)
init --> verifying : operator clicks "Verify"
verifying --> verify_failed : auth probe fails / no permissions
verifying --> verified : auth probe OK + permission confirmed
verify_failed --> aborted : operator acknowledges
verify_failed --> verifying : operator retries
verified --> minting : operator clicks "Proceed"
minting --> mint_failed : vendor mint API error
minting --> minted : new token value in hand
mint_failed --> aborted : operator aborts (old token still valid)
minted --> distributing : fan-out to all registered consumers
distributing --> distribute_partial : some consumers failed
distributing --> distribute_failed : all consumers failed
distributing --> distributed : all consumers succeeded
distribute_partial --> distributing : operator retries failed rows
distribute_partial --> aborted : operator aborts (new token minted but not fully distributed — see cleanup)
distribute_failed --> aborted : operator aborts
distributed --> validating : Velvet calls all consumer healthchecks with new token
validating --> validate_partial : some healthchecks failed
validating --> validate_failed : all healthchecks failed
validating --> validated : all consumers return healthcheck_success_status
validate_partial --> validating : operator retries failed rows
validate_partial --> aborted : operator force-aborts
validate_failed --> aborted : operator aborts
validated --> revoking : operator clicks "Revoke old token" (type-to-confirm)
revoking --> revoke_failed : vendor revoke API error
revoking --> done : vendor confirms revocation
revoke_failed --> revoking : operator retries
revoke_failed --> done : operator marks "manually revoked" (with required ticket ID)
done --> [*]
aborted --> [*]
Abort semantics by state:
| Aborted from state | Cleanup required | Risk |
|---|---|---|
init / verifying |
None — nothing touched | None |
minted |
New token in vault; old still valid. Operator must manually delete new token from vault. | New token is live but unused — orphan credential. |
distribute_partial |
Per-consumer: some have new token, some have old. Old token still valid. | Consumer split-brain window; resolve by completing distribution or rolling back new token manually. |
validated |
New token distributed and validated. Old token still valid. | Low risk — rotation is effectively complete; revocation is the only remaining step. |
Abort does NOT automatically revoke the new token. The operator is shown the residual state and must take manual action per the runbook.
sequenceDiagram
actor Op as Operator (console)
participant C as Console UI
participant V as Velvet API
participant DB as Postgres (rotation_jobs)
participant Vault as Infisical Vault
participant Vendor as Vendor API (e.g. Heroku)
participant Bus as Service Bus (consumer adapters)
Op->>C: Click "Rotate HK_PLATFORM_FULL"
C->>V: POST /tokens/HK_PLATFORM_FULL/rotate {flow_type: operational}
V->>DB: INSERT rotation_jobs {status: init}
V-->>C: 202 {job_id}
C-->>Op: Show Stage 1 modal
Op->>C: Click "Verify"
C->>V: POST /rotations/{job_id}/stage {action: verify}
V->>Vendor: GET /account (old token auth probe)
V->>Vendor: GET /account/authorizations (check rotate permission)
alt Verify OK
V->>DB: UPDATE status=verified
V-->>C: {status: verified}
C-->>Op: Show "Verified — proceed?" gate
else Verify failed
V->>DB: UPDATE status=verify_failed
V-->>C: {status: verify_failed, error: ...}
C-->>Op: Show failure — nothing changed
end
Op->>C: Click "Proceed to mint"
C->>V: POST /rotations/{job_id}/stage {action: proceed_mint}
V->>Vendor: POST /oauth/authorizations (mint new token)
V->>DB: UPDATE status=minted, new_token_hash=sha256(new)
V->>Vault: Write new_token to vault (non-active slot)
V-->>C: {status: minted}
Note over V,Bus: Fan-out to all registered consumers for HK_PLATFORM_FULL
V->>DB: INSERT rotation_job_consumers rows {status: pending}
V->>DB: UPDATE rotation_jobs status=distributing
loop Each consumer (parallel)
V->>Bus: Adapter: PATCH /apps/{app}/config-vars {HEROKU_API_KEY: new}
Bus-->>V: 200 OK
V->>DB: UPDATE rotation_job_consumers {distribute_status: succeeded}
end
V->>DB: UPDATE rotation_jobs status=distributed
V-->>C: SSE event: state_change {status: distributed, consumers: [...]}
C-->>Op: Show Stage 3 panel: all green / partial
Note over V,Bus: Validate — healthcheck each consumer with new token
loop Each consumer (parallel)
V->>Bus: GET healthcheck_endpoint (new token in auth header)
Bus-->>V: 200 OK (new token valid at vendor)
V->>DB: UPDATE rotation_job_consumers {validate_status: succeeded}
end
V->>DB: UPDATE rotation_jobs status=validated
V-->>C: SSE {status: validated}
C-->>Op: Show "Revoke old token?" gate (type-to-confirm)
Op->>C: Type confirm phrase + click Revoke
C->>V: POST /rotations/{job_id}/stage {action: proceed_revoke}
V->>Vendor: DELETE /oauth/authorizations/{old_token_id}
V->>Vault: Mark old token slot inactive
V->>DB: UPDATE rotation_jobs status=done
V-->>C: SSE {status: done}
C-->>Op: Show audit summary
stateDiagram-v2
[*] --> rev_init : POST /rotate {flow_type: revocation}
rev_init --> rev_revoking : operator clicks "Revoke" (type-to-confirm)
rev_revoking --> rev_revoke_failed : vendor returns non-2xx
rev_revoking --> rev_revoked : vendor confirms revocation
rev_revoke_failed --> rev_revoking : operator retries
rev_revoke_failed --> [*] : operator aborts (token still valid)
rev_revoked --> rev_validating : Velvet enumerates subscribers; healthchecks with old (revoked) token
rev_validating --> rev_leaked : any consumer returns non-401
rev_validating --> rev_done : all consumers return 401
rev_leaked --> rev_done : operator acknowledges leak + files incident ticket
rev_done --> [*]
sequenceDiagram
actor Op as Operator
participant C as Console UI
participant V as Velvet API
participant DB as Postgres
participant Vendor as Vendor API
participant Bus as Consumer healthchecks
Op->>C: Click "Revoke HK_PLATFORM_FULL" (no replacement)
C->>V: POST /tokens/HK_PLATFORM_FULL/rotate {flow_type: revocation}
V->>DB: INSERT rotation_jobs {status: rev_init, flow_type: revocation}
V-->>C: 202 {job_id}
C-->>Op: Show revocation modal: type-to-confirm gate
Op->>C: Type confirm phrase + click "Revoke now"
C->>V: POST /rotations/{job_id}/stage {action: proceed_revoke}
V->>Vendor: DELETE /oauth/authorizations/{token_id}
alt Vendor confirms revocation
V->>DB: UPDATE status=rev_revoked
Note over V,Bus: Walk registered subscribers; healthcheck with revoked token
loop Each subscriber
V->>Bus: GET healthcheck_endpoint (revoked token in auth header)
alt Returns 401
V->>DB: validate_status=succeeded (401 IS the success criterion)
else Returns non-401
V->>DB: validate_status=failed, healthcheck_http_status=<code>
Note over V: Flag as rotation_leaked
end
end
alt All 401
V->>DB: UPDATE rotation_jobs status=rev_done
C-->>Op: "Revocation confirmed — all consumers locked out"
else Any non-401
V->>DB: UPDATE rotation_jobs status=rev_leaked
C-->>Op: Show leaked consumer list + "Investigate" button
end
else Vendor revocation fails
V->>DB: UPDATE status=rev_revoke_failed
V-->>C: {error: vendor_error_detail}
C-->>Op: Retry or abort
end
The testing flow runs the full operational state machine against a throwaway token provisioned for the test. It never touches live credentials. Key differences:
flow_type = testing in rotation_jobsflow_type=testing; it does not appear in operational rotation history by default.Testing flow state machine is identical to operational but transitions automatically (no operator gates) and the terminal state is always revoking → done (auto-revoke of the test token).
The rotation modal is replaced with a multi-panel stage wizard. Each stage is a distinct panel within the same modal container. The operator cannot jump forward; they can abort at any point.
+------------------------------------------+
| Rotate: HK_PLATFORM_FULL [Abort] |
+------------------------------------------+
| [1. Verify] [2. Mint+Dist] [3. Val+Rev] | <- stage progress bar (current highlighted)
+------------------------------------------+
| <stage-specific content> |
+------------------------------------------+
Stage 1 — Verify panel: - Token name, last-rotated date, rotation cadence, registered consumer count - "Verify credentials" button (triggers the auth probe) - On success: green check + "Credentials verified — 4 consumers registered. Proceed?" - On failure: red banner with the exact probe error (e.g., "Auth probe returned 401 — current token invalid") - Abort button available throughout
Stage 2 — Mint + Distribute panel: - Shows "Minting new token..." spinner while Velvet calls vendor mint - After mint: consumer table
Consumer | Environment | Status
--------------------------+-------------+---------
heroku-config-console-prod| prod | [green] Succeeded
heroku-config-api-prod | prod | [green] Succeeded
github-actions-secret | prod | [amber] In progress...
heroku-config-staging | staging | [red] Failed — 401 on PATCH
Stage 3 — Validate + Revoke panel:
- Same consumer table, now showing validation (healthcheck) status
- "Manual confirm" button for consumers with healthcheck_endpoint: null
- After all validation green: "Revoke old token?" confirmation gate
- Type-to-confirm field: revoke HK_PLATFORM_FULL
- FreeScout ticket ID field (required)
- Revoke button triggers vendor DELETE + vault mark-inactive
- Final audit summary panel after done:
- Job ID, duration, consumers updated, operator, UTC timestamp
Revocation modal (simpler, separate entry point):
- Single panel: "This permanently revokes HK_PLATFORM_FULL. No replacement will be minted."
- Type-to-confirm: revoke HK_PLATFORM_FULL permanently
- FreeScout ticket ID
- After revocation: consumer table showing 401 status per row (green = confirmed locked out, red = investigation needed)
Visual parity notes: The modal shell established in _rotate_modal_v2.html (Slate/rose palette, type-to-confirm, TOTP gate, FreeScout ticket field) is retained. The TOTP gate is preserved on the Stage 3 revoke gate. The danger banner ("This will invalidate the current value") becomes per-stage contextual copy.
The existing rotation_jobs schema (card #910, not yet applied to any live database — app not yet shipped) is replaced in its entirety by the v2 schema. Because raxx-velvet-prod and raxx-velvet-staging do not yet exist, this is a fresh schema, not a live migration.
Migration order:
1. 001_create_rotation_jobs_v2.sql — full table with v2 status enum
2. 002_create_rotation_job_consumers.sql — join table
3. 003_indexes.sql — all indexes
All migrations run via the Velvet release Procfile command (Alembic). Idempotent (CREATE TABLE IF NOT EXISTS + DO $$ BEGIN ... EXCEPTION WHEN duplicate_object THEN NULL; END $$).
Cards #914 (Postmark handler), #915 (Heroku handler), #916 (Cloudflare handler) are replaced by per-vendor adapter cards in the v2 slate. They are not closed yet — they stay open with a comment pointing to their replacement cards. They are removed from the sprint backlog.
The existing PR #906 handler (Heroku Mode A rotation) remains operational in the console app until the Velvet v2 Heroku adapter is live and smoke-tested. After that, PR #906's handler is deprecated and gated behind a feature flag set to off.
docs/architecture/velvet/subscription-manifest.yml is checked in as part of the design PR. It is initially populated with the three highest-priority consumers:
1. Postmark server token → Infisical write (lowest risk, proves the bus)
2. Heroku platform key → four Heroku config-var PATCH destinations + one GitHub Actions secret
3. Cloudflare DNS token → one Infisical write (CF tokens cannot be rotated via API — Velvet delegates to operator with clear instructions)
dark Velvet v2 app deployed; subscription manifest loaded; no rotation traffic
Smoke: GET /tokens/{name} works; subscription manifest parses cleanly
Card: B1 (scaffold + manifest loader)
flag feature flag velvet_v2_rotation gates POST /rotate endpoint
Postmark adapter live; test rotation run against throwaway PM token
Cards: B2 (schema), B3 (service bus core), B4 (Postmark adapter)
beta Heroku adapter live; one operational rotation of HK_PLATFORM_FULL in staging
Console UI v2 modal behind flag velvet_v2_rotation_ui
Cards: B5 (Heroku adapter), B6 (GH Actions adapter), B7 (console UI)
B5 Status: implemented in PR #947 — velvet/adapters/heroku_config_var.py
HerokuConfigVarAdapter: PATCH /apps/{app}/config-vars via HTTP (no CLI),
auth via HK_PLATFORM_FULL env var, flag FLAG_VELVET_HEROKU_CONFIG_VAR_ADAPTER,
response body always <REDACTED> in logs, 429 retry once after 5s.
B6 Status: implemented in PR #947 — velvet/adapters/github_actions_secret.py
GithubActionsSecretAdapter: GET public-key + NaCl SealedBox + PUT secret,
auth via GH_APP_OPS_BOT env var (depends on #925 — degrades gracefully),
flag FLAG_VELVET_GH_ACTIONS_SECRET_ADAPTER, raw value never leaves process
unencrypted.
ga All v1 handler cards deprecated; PR #906 handler behind off-flag
Rotation history visible in console for all token types
Cards: B8 (CF adapter), B9 (revocation flow), B10 (runbook), B11 (v1 retirement)
PII collected: None. The audit log stores operator_id (a console account identifier — an opaque UUID, not an email address), job IDs, and token name. Token values never appear.
Retention: Rotation job rows retained for 2 years (per audit requirements for credential-change events). Purge job runs monthly; deletes rotation_jobs rows older than 730 days; cascades to rotation_job_consumers via FK.
Credential values at rest: New token value is held in Velvet process memory only during distribute. It is written to Infisical via the app's own machine identity. The value is never written to the rotation_jobs table. new_token_hash (SHA-256, hex) is the only stored form.
Credential values in transit: All Velvet → vendor calls are HTTPS. Consumer update_endpoint URIs must be HTTPS (manifest validator rejects http://). The token_value field in the update request body is transmitted once per consumer over TLS. It is never logged by Velvet; the consumer's logging posture is the consumer's responsibility.
Credential replay risk: Velvet holds no credential in a form that can be replayed beyond the current rotation job's lifetime. Once a job reaches done, the new token value is no longer accessible via Velvet (only via vault, with a fresh read). This satisfies invariant I1 and the system-wide no-stored-credentials constraint.
Breach notification: If the rotation_leaked status is reached (revocation confirmed at vendor but one or more consumers still respond non-401), an ops alert is emitted to the operator's Slack channel (SL_BOT_NOTIFY) within 30 seconds. The alert message includes: job_id, credential_name, list of consumer_ids that did not return 401, and a link to the console investigation panel.
Service-bus auth: When Velvet calls a consumer's update_endpoint, it signs the request with an HMAC-SHA256 request signature using a shared secret (VELVET_CONSUMER_SHARED_SECRET) stored in vault. The consumer verifies the signature before applying the new token value. Consumers without HMAC verification are classified as capabilities: [update_no_verify] in the manifest and flagged in the console as degraded trust.
Secrets rotation of Velvet's own credentials: Velvet's Infisical machine identity secret and Heroku bootstrap config vars are rotated manually per a dedicated runbook (card B10). The subscription manifest itself is rotated by re-deploying the app with an updated manifest file.
Kill-switch: Feature flag velvet_v2_rotation set to off disables all POST /rotate endpoints and returns HTTP 503 with body {"error": "rotation_disabled"}. The console surfaces this as "Rotation temporarily disabled" in the UI.
These block specific sub-cards and require Kristerpher's explicit decision before implementation begins.
| # | Question | Blocks | Options |
|---|---|---|---|
| OQ1 | Consumer registration: static manifest only, or also runtime API? The design proposes static manifest only (ADR-0040). Runtime API allows new consumers to register at boot without a deploy, but adds a registration-endpoint attack surface and drift risk. | B3 (service bus core) | A: static only (current design). B: static + runtime API (runtime registrations stored in DB, require rotate-scoped service token). |
| OQ2 | Healthcheck timeout per consumer. How long does Velvet wait for a consumer's healthcheck GET before marking it failed? Short timeout causes false failures on slow consumers; long timeout blocks the stage. Proposed default: 15 seconds. |
B3 | A: 15s uniform. B: per-consumer configurable in manifest (healthcheck_timeout_s field). |
| OQ3 | Distribute fan-out: parallel or sequential? Parallel is faster but if one consumer update causes downstream effects (e.g., a config-var PATCH triggers a dyno restart), parallel may cause a thundering herd. Proposed: parallel with configurable max_concurrency (default: 4). | B3, B5 | A: parallel (current design). B: sequential with configurable delay. |
| OQ4 | Partial distribute threshold: what fraction of consumers failing blocks Stage 3? Current design: any failure blocks. But if a low-importance consumer (e.g., an archived Heroku app) fails distribute, it may not be worth blocking the entire rotation. A criticality tier on the manifest entry would allow "warn on failure" vs "block on failure". | B3, B5 | A: any failure blocks (current design). B: per-consumer required: true/false flag in manifest. |
| OQ5 | Force-revoke authorization. Invariant I3 says operator can override the all-consumers-green requirement. Should force-revoke require a second operator (4-eyes pattern) or is single-operator confirm sufficient? | B9 (revocation), B7 (UI) | A: single operator with type-to-confirm (current design). B: two-operator confirm (requires RBAC changes). |
| OQ6 | Testing flow: should it produce audit log entries? If yes, test rotations appear in the audit trail (good for proving compliance). If no, testing is invisible to auditors but cleaner in the ops view. Proposed: yes, with flow_type=testing label that can be filtered. |
B2 (schema), B7 (UI) | A: log with label (current design). B: no audit log for test flows. |
| OQ7 | Cloudflare token rotation: Velvet-managed or operator-prompted? CF User API tokens cannot be rotated via CF's API (CF only exposes token read + delete, not create-with-same-scopes). Options: Velvet opens the CF dashboard URL in a side panel and walks the operator through manual steps, or Velvet fully delegates to operator with a "I manually rotated this — enter new value" form. | B8 (CF adapter) | A: Velvet-guided manual flow (open CF dashboard, operator enters new value). B: Velvet treats CF as an update_method: OPERATOR_MANUAL entry with a standard confirm form. |
| OQ8 | PR #906 retirement timeline. The existing Heroku handler in the console app must be deprecated after the Velvet Heroku adapter is live. Should the console PR that removes it be filed as part of the v2 slate (B11), or is it a separate track? | B11 (v1 retirement) | A: include in v2 slate as B11. B: file as separate epic under #81. |
| OQ9 | rotation_leaked escalation path. When Velvet flags a consumer as leaked after revocation, what is the response SLA? Currently: Slack alert within 30 seconds. Should there also be a FreeScout ticket auto-created (like the Status page's "Investigate" button pattern)? |
B9, B7 | A: Slack alert only (current design). B: Slack alert + auto-FreeScout ticket with consumer list as body. |
See section-level heading "Sub-card slate" in the accompanying briefing.
| Slot | Title | Scope | Size | Depends on |
|---|---|---|---|---|
| B1 | feat(velvet/scaffold): Heroku app pair + Postgres + subscription manifest loader | New Velvet app pair, Postgres add-on, manifest YAML parser + validator at boot | M | — |
| B2 | feat(velvet/db): rotation_jobs v2 schema + rotation_job_consumers table | Fresh v2 schema, Alembic migrations, seed test | S | B1 |
| B3 | feat(velvet/bus): service-bus core — fan-out, per-consumer state, SSE stream | Bus engine: enumerate subscribers, dispatch adapters in parallel, persist per-consumer state, SSE endpoint | L | B2 |
| B4 | feat(velvet/adapter): Postmark adapter (infisical-write + healthcheck) | INFISICAL_WRITE adapter; GET healthcheck; first end-to-end test rotation | S | B3 |
| B5 | feat(velvet/adapter): Heroku config-var adapter (PATCH + healthcheck) | PATCH /config-vars for each registered Heroku app; GET /account healthcheck | M | B3 |
| B6 | feat(velvet/adapter): GitHub Actions secret adapter (PUT_ENCRYPTED) | NaCl encrypt + PUT /actions/secrets/{name}; no healthcheck | S | B3 |
| B7 | feat(velvet/ui): stage-wizard modal (3-panel + revocation panel) | Console UI: stage progress, consumer table, type-to-confirm, SSE-driven updates, abort affordance | L | B3, B5 |
| B8 | feat(velvet/adapter): Cloudflare DNS adapter (operator-manual flow) | OPERATOR_MANUAL update_method; console walks operator through CF dashboard; enters new value form | M | B3 |
| B9 | feat(velvet/api): revocation flow endpoint + rev_leaked alert | POST /rotate {flow_type: revocation}; vendor DELETE; healthcheck 401 validation; Slack alert on rotation_leaked | M | B3, B4 |
| B10 | feat(velvet/auth): stage-endpoint auth middleware + rotation-authz matrix | Service-token auth on all stage endpoints; rotate/revoke permission scoping | S | B1 |
| B11 | ops(velvet): operator runbook v2 + handler-author guide (adapter contract) | Runbook covering all three flows, abort recovery, force-revoke, manifest maintenance | S | B4, B5, B9 |
| B12 | feat(velvet/test): testing flow automation + throwaway token lifecycle | flow_type=testing path; throwaway token mint/use/auto-revoke; no operator gates; CI smoke test | M | B3, B4 |
| B13 | feat(velvet/db): GET /tokens/{name}/rotations history endpoint | List endpoint with pagination; flow_type filter; operator-scoped; links to console history page | S | B2 |
| B14 | ops(velvet/retire): deprecate PR #906 Heroku handler behind off-flag | Feature flag velvet_v2_rotation=on gates new path; old handler returns 503 with migration note |
S | B5, B7 |
| B15 | feat(velvet/adapter): SSM Parameter Store integration (passwords) | GET/PUT to AWS SSM /raxx/{env}/{vendor}/{name}; used by password-class credentials |
M | B1 (D2 path shape must be confirmed — OQ on existing card) |
Dropped from v1 slate: - V7 #914 (Postmark handler) — replaced by B4 - V8 #915 (Heroku handler) — replaced by B5 + B6 + B14 - V9 #916 (CF handler) — replaced by B8
Retained from v1 slate (with revision comments filed): - V1 #908 (scaffold) — merged into B1; close #908 when B1 ships - V2 #909 (GET proxy) — still valid, no change needed - V3 #910 (rotation_jobs schema) — superseded by B2; comment filed on #910 - V4 #911 (POST /rotate) — superseded by B3/B9; comment filed on #911 - V5 #912 (auth middleware) — superseded by B10; comment filed on #912 - V6 #913 (SSM integration) — becomes B15; retained pending OQ resolution - V10 #917 (first console callsite) — still valid; now depends on B7 - V11 #918 (runbook) — superseded by B11; comment filed on #918