Raxx · internal docs

internal · gated ↑ index

ADR-0038: Velvet — Three-Stage Operational Rotation Flow

Date: 2026-05-03
Status: Accepted
Deciders: Kristerpher (operator), architect-agent
Refs: #907 (Velvet epic), v2-rotation-flows design doc, Kristerpher directive 2026-05-03 06:00 UTC


Context

The v1 rotation model was a single atomic handler: mint → validate → distribute → revoke executed sequentially. This model has two structural flaws demonstrated in production:

  1. The old token is used to authenticate the mint call. If the old token is already invalid at rotation time (expired, drifted, or manually revoked), the very first step fails — but the error bubbled up as "distribute incomplete" with all destinations listed, suggesting distribution was attempted.

  2. Revocation is tightly coupled to distribution. In the v1 model, if distribute partially failed (some consumers updated, some not), the handler still proceeded to revoke the old token, leaving some consumers with a stale token and no valid token at all. This is the worst possible state.

Kristerpher's directive explicitly describes three stages: "Stage 1 — Verify you can auth, see the token, and have permissions to manipulate it. Stage 2 — Create a new token and begin the process of updating everything. Stage 3 — Revocation and validation... validate everything on the new token, and then finally execute the revocation."


Decision

Operational rotation is split into three explicit, operator-gated stages with durable state persisted in Postgres after each transition:

Stage 1 — Verify: Using the current (old) token, probe the vendor API to confirm (a) the token authenticates, (b) the token metadata is accessible, and (c) the rotating identity has permission to create a new token. This stage makes zero state changes. If it fails, the operator sees the exact probe error; the current token is untouched.

Stage 2 — Mint and Distribute: A new token is minted at the vendor. The old token remains valid. Velvet fans out to all registered subscribers in parallel via the service bus. The operator sees per-consumer status in real time. Stage 2 is not complete until all required consumers (those marked required: true in the manifest, pending OQ4 resolution) report distribute_status = succeeded.

Stage 3 — Validate and Revoke: Each consumer's healthcheck endpoint is probed with the new token. Only after all consumers confirm the new token is active does the operator see the "Revoke old token" gate. The revoke action is gated behind type-to-confirm + FreeScout ticket ID. After vendor revocation confirms, the job reaches done.

Operator abort is available at every stage. Abort semantics are clearly defined per state (documented in the design doc).


Consequences

Positive: - Stage 1 failure is unambiguous: it means "I cannot even prove I have the right credentials to start this rotation." The operator gets a clean signal before any mutation occurs. - The old token remains valid throughout Stages 1 and 2. No consumer outage until Stage 3 revocation, which only fires after validation confirms the new token is fully propagated. - The per-stage operator gate means a human reviews each transition. This satisfies the audit trail requirement: every state change that "affects... credentials" has an operator_id, timestamp, and reason. - Partial distribute is surfaced to the operator immediately. They can retry failed consumers or abort — they are never surprised after the fact.

Negative: - Operational rotation requires operator attention across three interactive stages. Fully automated (unattended) rotation is not possible under this design. For expiry-driven rotations, the operator must be available to click through the stages. - The three-stage flow takes longer wall-clock time than an atomic handler. A fast rotation might take 2-5 minutes of operator attention.

Mitigation for automation gap: Expiry-driven rotations create the job and advance to verified automatically if Stage 1 succeeds (automated probe runs on a schedule). The operator then arrives to an already-verified job and clicks through Stages 2 and 3. This preserves the safety gates while reducing operator time-on-task.


Alternatives Considered

Option A — Fully automated rotation (no operator gates): Velvet runs all three stages unattended when triggered by an expiry schedule. Operator is notified at completion or failure.

Rejected because: the 2026-05-02 incident was caused by exactly this pattern — the handler ran unattended, failed at the first step, and left the operator with a misleading partial-distribution error. Operator gates provide a safety break. Given the sensitivity of these credentials (sensitivity:critical for Heroku and Stripe), the extra 5 minutes of operator attention is the right tradeoff.

Option B — Keep v1 atomic handler, fix the old-token-used-for-mint bug: Patch the handler to use a fresh auth identity for the mint call rather than the old token.

Rejected because: this fixes only one of the two structural flaws. Distribution still precedes validation, and revocation is still atomically coupled to distribution success. A partial distribute will still result in revocation of the old token before all consumers have the new one if the operator isn't watching carefully.