ADR 0129 — RBAC V2 Blueprint Cutover Rollout Strategy

Status: Accepted Date: 2026-06-18 UTC Deciders: Kristerpher (operator), software-architect Scope: Console service — all 17 blueprint files in console/app/blueprints/

Context

The Console has 141 legacy @require_role(...) call sites that resolve against the flat four-level admin_roles table (superadmin / ops / support / readonly). RBAC V2 tables and fine-grained decorators exist in console/app/middleware/rbac.py. Issue #1473 (operator-authorized for pre-launch, 2026-06-18) asks: how do we cut over 141 sites safely without a single all-or-nothing mega-PR that cannot be rolled back at the route level?

Two questions drove this ADR:

Should the cutover use a runtime flag-check inside each decorator (live-flip) or a deployment-time switch (redeploy)?
Should the cutover happen in a single PR or a phased cluster-by-cluster sequence?

Decision

The cutover uses direct decorator replacement in phased blueprint clusters, with FLAG_RBAC_V2 as a deployment gate (not a runtime gate). Shadow dual-mode code is removed in the first sub-card. Each cluster ships independently to staging and soaks before promotion to prod. Rollback is via tagged-SHA redeploy, not flag flip.

The per-route permission mapping in docs/architecture/rbac-blueprint-cutover.md §3 is the authoritative correctness artifact. Every sub-card must produce integration tests proving 200/403 behaviour before merging.

Language choice rationale

Skipped. This ADR governs an operational/authz rollout decision, not a new service.

Consequences

Positive

Each sub-card PR is independently reviewable and has a small diff.
Rollback is unambiguous: tagged SHA, not a flag state that may be stale across dynos.
No runtime overhead of a per-request flag check inside every decorator invocation.
TOTP elevation chains are unchanged — the new decorator replaces only the session/role gate, not the TOTP layer.
Audit trail is preserved: _write_access_denied() fires in V2 decorators identically to legacy.

Negative / risks

FLAG_RBAC_V2 cannot live-flip between old and new decorator behaviour on a running dyno. This is a known limitation: Flask resolves decorators at import time.
The per-route mapping table is a new artifact that can drift from code if not maintained. The S10 CI lint gate (grep '@require_role' finding zero matches) is the enforcement mechanism.
4 AMBIGUOUS mappings require operator decisions before 3 of the 10 sub-cards can be claimed (S1, S3, S4, S7). These are explicit blocking dependencies, not implicit ones.

Neutral

The dual-mode shadow report code (rbac_dual_mode.py import chain) is deleted in S1. This removes observability data about legacy/V2 divergence. The shadow data served its purpose during the parallel-run phase; removing it is correct per #1473 AC.
console-ops role (migration 0012) has no group assignment. It is not used by any mapping in rbac-blueprint-cutover.md. This is an open question (#10/OQ-5) but does not block the cutover.

Alternatives considered

Alternative A: Runtime flag-check inside each decorator

Each route registers a wrapper that checks FLAG_RBAC_V2 at request time and branches to the legacy or V2 check. This enables live flip without redeploy.

Rejected because: Flask decorator stacks are evaluated at import time. A per-request flag check would require every route to be wrapped in an additional callable, significantly increasing code complexity and introducing a new surface for decorator-ordering bugs. The existing shadow-check mechanism already demonstrated the fragility of this pattern (it had to be lazy-imported to avoid circular imports). The operational benefit — live flip — is low: flag flips on Heroku trigger a dyno restart anyway, so the latency difference between flag-flip-restart and SHA-redeploy is seconds.

Alternative B: Single mega-PR, all 141 sites at once

All blueprints are ported in one PR. Reviewed once, merged once.

Rejected because: A 141-site change with no per-blueprint granularity is unreviable, unrollbackable at the route level, and creates a single point of failure. A wrong mapping in secrets.py would require reverting all blueprints. The phased approach allows secrets (the highest-privilege blueprint) to ship last, after all other clusters have soaked.

PII collected: None. Decorator reads admin_id (UUID) from session, not email or PII.
Retention period: Audit rows written by _write_access_denied follow existing audit_log retention policy (unchanged).
Deletion on DSR: No new tables. Existing admin_id foreign key in audit_log is already covered by the DSR deletion path.
Audit trail: Every 403 writes an access_denied audit row. Both legacy and V2 decorators call _write_access_denied. No reduction in audit coverage.
Stored credentials: None. V2 reads live from rbac_* tables per request; no caching of role assertions.
Breach notification path: Unchanged from existing console breach path.
Secrets location + rotation: No secrets introduced. The FLAG_RBAC_V2 env var is a boolean toggle in Heroku config, not a secret.
Kill-switch: Tagged SHA redeploy (rbac-legacy-baseline tag) restores legacy decorator behaviour within one Heroku deploy cycle.

Revisit when

A second operator is added to the team, triggering the first real group-membership change and surfacing any gaps in the group→role assignment seed data.
The admin_roles table is dropped (post-30d soak card following this cutover), at which point the legacy path ceases to exist and this ADR is fully executed.
Queue Phase 1 ships and identity ownership moves to Queue service, potentially requiring the RBAC resolution path to cross service boundaries.

OQ-6 Resolution (2026-06-19)

Question: Machine tokens have no rbac_user_groups rows. Will they pass _admin_has_rbac_role()? If not, how do we handle the 3 dual-auth routes (/api/status/sites, /api/status/builds, /api/status/secrets) so S5 can complete and the admin_roles table can be dropped in #973?

Ground truth (verified by code trace): Machine tokens do NOT pass _admin_has_rbac_role(). Tracing the decorator stack on these 3 routes today:

@machine_auth_or_session   # outer — wraps the require_role wrapper
@require_role(...)          # inner — the callable that machine_auth_or_session sees as `func`

When FLAG_CONSOLE_MACHINE_AUTH is ON and a valid CF-JWT or Bearer token arrives, machine_auth_or_session sets g.machine_auth = True and calls func(*args, **kwargs) — but func is the require_role wrapper, not the bare view. require_role immediately reads request.cookies.get(COOKIE_NAME), finds no cookie, and returns redirect(url_for("auth.login")) (302). Machine-token access to these 3 routes has always been broken in this way. The routes have never served machine clients successfully.

Resolution chosen: Option A — g.machine_auth short-circuit at the top of the require_rbac_role inner wrapper.

# console/app/middleware/rbac.py — require_rbac_role inner wrapper
def wrapper(*args: Any, **kwargs: Any) -> Any:
    # Machine-authenticated infra requests bypass the human RBAC role check.
    # @machine_auth_or_session has already validated the CF-JWT or Bearer token
    # cryptographically. Machine tokens carry no session cookie and must not
    # be redirected to /auth/login. g.admin_id is intentionally not set here;
    # views must tolerate None (the audit write at /api/status/secrets already
    # uses getattr(g, "admin_id", None)).
    if getattr(g, "machine_auth", False):
        return fn(*args, **kwargs)

    token = request.cookies.get(COOKIE_NAME)
    ...  # remainder unchanged

The 3 routes convert to @require_rbac_role(...) per the mapping table in rbac-blueprint-cutover.md §3.6:

# GET /api/status/sites
@machine_auth_or_session
@require_rbac_role("console-user")

# GET /api/status/builds
@machine_auth_or_session
@require_rbac_role("console-audit-user")

# GET /api/status/secrets
@machine_auth_or_session
@require_rbac_role("console-secrets-admin")

Why not Option B (enroll machine identity in RBAC group): Pollutes the user-group model with a non-human principal, creates a permanent maintenance burden across migrations, and introduces a credential-adjacent record that audit trails treat as if it were an admin. Option A is structurally cleaner and aligns with the layered-auth design intent.

Security posture of /api/status/secrets: This endpoint returns Infisical secret metadata (names, paths, last-rotation timestamps) — not secret values. Machine-token access to secret metadata is operationally legitimate for drift-monitoring agents. The human path (@require_rbac_role("console-secrets-admin")) is unchanged and enforces the tightest RBAC gate. The audit write (line 511) already uses getattr(g, "admin_id", None) — no change required there.

Note for all future routes using @require_rbac_role: Any route that adds @machine_auth_or_session before @require_rbac_role will automatically inherit the machine-auth bypass. This is correct behavior — document it in the decorator docstring when implementing S5.

No operator decision required. This is a safe mechanical resolution within the documented cutover intent. Machine auth requires a provisioned CF-JWT (bound to the CF Access application) or a HMAC Bearer token already controlled by the operator.

Additional #973 scope identified by OQ-6 analysis

These call sites read admin_roles (the table or the Admin.has_role() method which reads the AdminRole relationship) and are NOT covered by any existing S1–S10 sub-card.

973 must handle all of them before dropping the table:

Location	What it reads	Required change
`services/admins_online.py:108–114`	Joins `AdminRole` to filter active sessions by role	Rewrite query against `rbac_user_groups → rbac_group_roles → rbac_roles`
`commands/bootstrap.py:83,102`	Seeds an `AdminRole` row at bootstrap	Rewrite to seed `rbac_user_groups` membership instead
`middleware/env_guard.py:70`	`admin.has_role("superadmin")` belt-and-suspenders check	Rewrite to `_admin_has_rbac_role(admin.id, "console-manager")`
`blueprints/flags.py:619,650,1080`	Inline `admin.has_role()` for within-view superadmin/ops branching	Rewrite to `_admin_has_rbac_role(g.admin_id, ...)` — part of S6
`blueprints/customers.py:719`	`admin.has_role("superadmin")` for PII visibility gating	Part of S3/customers sub-card
`blueprints/dashboard.py:513`	`admin.has_role("superadmin")` for secrets-alert inclusion	Part of S5
`blueprints/deploy_freeze.py:189`	`admin.has_role(...)` inline call	Part of S4
`models/admin.py`	`AdminRole` model, `Admin.roles` relationship, `has_role()`, `primary_role()`	Delete in #973 after all callers ported
`models/__init__.py`	Re-exports `AdminRole`, `RoleEnum`	Remove exports in #973
`migrations/env.py:14`	Imports `AdminRole` for autogenerate	Remove in #973

The admins_online.py rewrite is the most consequential: it is the only non-blueprint service that directly queries admin_roles with a JOIN, and it is not assigned to any existing sub-card. Create a dedicated task within #973 or as a prerequisite sub-card.

admin_roles table drop migration must come last, after all items above are merged and soaked in staging for 24 hours.