Date: 2026-05-03
Status: Accepted
Deciders: Kristerpher (operator), architect-agent
Refs: #907 (Velvet epic), v2-rotation-flows design doc, Kristerpher directive 2026-05-03 06:00 UTC
In the revocation flow, Velvet revokes a token at the vendor and then needs to confirm that all registered consumers are no longer able to use that token. The question is: what is the success criterion for the post-revocation healthcheck?
Kristerpher stated this directly (2026-05-03): "we should validate everything on the new token, and then finally execute the revocation... you're essentially looking for an Unauthorized at the end of the workflow. The nice part is revocation is easy — you can see how clever that is."
The same logic applies to the revocation-only flow: after the vendor confirms deletion, Velvet calls each consumer's healthcheck endpoint using the now-revoked token. A 401 from the vendor (passed through the consumer's healthcheck response) proves the token no longer works.
For the revocation flow, the consumer healthcheck is called with the old (revoked) token. Success criterion is HTTP 401 from the consumer's healthcheck_endpoint.
For the operational flow Stage 3, the consumer healthcheck is called with the new token. Success criterion is healthcheck_success_status (typically 200 or 204).
These are the inverse of each other by design: - Operational: prove the new token works (200 expected) - Revocation: prove the old token is dead (401 expected)
Any consumer that returns a non-401 after revocation is flagged as rotation_leaked. Leaked consumers are surfaced immediately in the console UI, an alert is sent to SL_BOT_NOTIFY, and optionally a FreeScout ticket is auto-created (pending OQ9 resolution).
The term rotation_leaked captures the semantics: the revoked credential is still being honored by the vendor when called from that consumer, which means either (a) the consumer has a cached or different copy of the token, or (b) the vendor did not fully propagate the revocation. Either condition requires human investigation.
Positive:
- The 401 criterion is vendor-agnostic and does not require the consumer to implement custom callback logic. Any consumer with a healthcheck_endpoint that performs a vendor API call will naturally return a transport of whatever the vendor returns.
- The criterion is binary and deterministic. There is no ambiguity about "partial success" — either the vendor rejects the token (401) or it doesn't.
- The criterion aligns with Kristerpher's mental model, making the code and the runbook speak the same language.
Negative:
- Some vendors may cache token validity briefly (eventual consistency on revocation). A consumer might return 200 immediately after revocation and 401 ten seconds later. Velvet's healthcheck runs once, immediately after the vendor revoke API returns. A caching window could produce false rotation_leaked flags.
- Mitigation: Velvet retries the healthcheck up to 3 times with a 10-second interval before marking a consumer as rotation_leaked. The manifest can specify a revocation_propagation_delay_s field to extend the retry window for known-slow vendors.
Non-issue: - 403 (Forbidden) is treated the same as 401 for revocation validation purposes. Both mean "the vendor did not accept the request." The distinction (unauthenticated vs. unauthorized) is not meaningful for this purpose.
Option A — Consumer confirms revocation via signed callback:
After Velvet revokes, each consumer receives a POST /internal/rotate with {"action": "invalidate", "token_name": "..."}, clears its local token, and responds with {"status": "cleared"}.
Rejected because: this requires every consumer to implement a callback endpoint, which is a significant integration burden for existing consumers (Heroku config vars, GitHub Actions secrets, AWS SSM). The 401-probe approach works without any consumer code change.
Option B — Velvet calls the vendor API directly to confirm revocation: After the vendor DELETE call, Velvet calls the vendor's token-status endpoint to confirm the token is no longer listed as active.
This is done regardless (invariant I5: vendor revocation must succeed before consumer healthchecks). But it is not sufficient: vendor confirmation that the token is deleted does not prove that consumers have stopped using it. A consumer with a cached copy may still make successful API calls on a token the vendor believes is deleted (vendor-side propagation lag). The 401-probe tests the consumer's actual behavior, not the vendor's metadata.