Raxx · internal docs

internal · gated ↑ index

ADR-0035 — Staging-to-prod flag promotion: explicit promotion queue over ambient drift

Status: Proposed Date: 2026-05-01 UTC Refs: docs/architecture/console-flag-promotion-flow.md, #551, #552, #554, #353, ADR-0028


Context

The feature flag system (ADR-0026, ADR-0027) makes prod and staging flag values fully independent per-env rows in console_feature_flags. This solves the drift problem by giving operators explicit control per env — but it introduces a coordination problem: an operator who has verified a flag on staging must remember to flip it on prod. There is no queue, no record of intent, and no gate enforcing the soak discipline.

The result is the same drift in the other direction: staging validates the flag but prod never gets it, or prod gets it immediately without any soak evidence.

Three approaches were considered:

A — Mirror on flip: when an operator flips a flag on staging, automatically flip prod as well (or offer a one-click "also flip prod" option in the same action).

B — Explicit promotion queue: flipping a staging flag is a separate act from promoting it to prod. An operator explicitly marks a flag "active for prod," which creates a promotion record with a soak clock. After soak plus an approval action, the flag flips on prod.

C — Ambient promotion scheduler: a background job that watches for staging/prod divergence and auto-promotes flags that have been on staging for more than N hours, with no per-flag operator action required.


Decision

Adopt Option B: an explicit promotion queue (console_flag_promotions table) with per-flag soak periods, state-machine lifecycle, manual approval by default, and opt-in auto-promote only for flags classified risk: low.

Key design points:


Consequences

Positive: - Drift between staging and prod is explicit and visible. The /flags/promotions page is a queue; an empty queue means no pending promotions. - Every promotion generates an audit row (console.flag.promoted) with full context: who marked, who approved, soak elapsed, staging value at mark time. - ADR-0028 friction is preserved. The human checkpoint exists before the prod flip fires. - The promotion model is independent of the deploy flow (#731). Flag promotions and code deploys are separate patterns with separate audit trails. Coupling them in v1 would create an ordering constraint that makes each harder to reason about. - Risk classification in YAML is groundwork that does not impose runtime cost until auto-promote is wired.

Negative / tradeoffs: - Two-step flow adds ceremony for the operator. A staging flag that is verified and clearly low-risk still requires a mark + soak + approve cycle. This is a deliberate cost: the ceremony is the checkpoint. - Soak period creates latency between "staging verified" and "prod live." For urgent changes, the operator can reduce the per-flag soak to 0 or use the existing manual prod flip. The promotion flow is not the only prod-flip path. - The console_flag_promotions table is a new migration dependency downstream of #552. - Slack notification on every promotion requires a bot token in vault. If the Slack call fails, the promotion must not be blocked — Slack is observability, not a gate.


Alternatives Considered

Option A — Mirror on flip

Immediately mirroring a staging flip to prod collapses the soak discipline entirely. The operator has not observed the flag running in staging; the flip and the mirror are simultaneous. The only remaining protection is a confirm dialog — which operators click through. Rejected: this is Option C with extra clicks.

Option C — Ambient promotion scheduler

An ambient scheduler that promotes flags silently based on age violates ADR-0028's intentional-friction principle. The operator's human checkpoint disappears. A staging flag that has been on for 25 hours could flip prod without any explicit operator decision. Rejected: no silent prod mutations.

Option B with deploy-flow coupling (#731)

An alternative version of Option B where approved promotions are queued as deploy intents (riding on the console_deploys table from the deploy-flow design). This would give promotions a unified lifecycle with code deploys. Rejected for v1: the blast radius of a flag flip is far smaller than a code deploy, the approval semantics are different (soak + click vs. type-to-confirm phrase), and tying them together means neither can land independently. Coupling is a v2 consideration once both systems are stable.