Raxx · internal docs

internal · gated ↑ index

Console Flag Promotion Flow — Staging to Prod

Status: Proposed Owner: software-architect Date: 2026-05-01 UTC Parent epic: #353 (env-switcher), #551 (in-console feature flag management) Related ADRs: ADR-0035, ADR-0028, ADR-0026, ADR-0027 Related docs: console-feature-flags.md, console-env-switcher.md


1. Context

The env-switcher (#353) and in-console flag management (#551) give operators per-env flag control. A flag can be on in staging and off in prod — or vice versa. The missing piece is an explicit workflow for moving a verified staging flag to prod: the operator has soaked a feature in staging, confirmed it behaves correctly, and wants prod to match.

Without a promotion flow, the operator either: 1. Manually flips the prod flag (no record of soak, no link to staging verification), or 2. Forgets to flip prod at all (silent divergence).

This design adds a promotion queue: a structured workflow where operators mark a flag "active for prod," a soak clock runs, and an explicit approval action performs the prod flip. Every step is audited. No prod flag changes silently.

What this is not: - Not a code deploy trigger — flag promotions and code deploys are separate (see ADR-0035). - Not an auto-promotion system — v1 is manual-approval-only for all flags (ADR-0035, Consequences). - Not a paper-first gate replacement — paper-first gating is a hard invariant and is not a feature flag. This system must never manage paper-first gating.


2. Invariants

All platform invariants apply. Promotion-specific:

  1. No stored credentials. Promotion records store flag keys and boolean values. No secrets, no tokens, no replayable values.
  2. Every promotion action is audited. Mark, approve, promote, reject, and expire all write immutable console_audit_log rows. Action names: console.flag.mark_promote, console.flag.approved, console.flag.promoted, console.flag.rejected, console.flag.expired.
  3. Staging value is captured at mark time. staging_value_at_mark is written when the operator marks active-for-prod. The promotion carries that snapshot — not whatever staging happens to be at promote time. If staging reverts between mark and promote, the promotion reflects the verified state, not the reverted state. This is a design invariant: the operator is promoting what they observed, not what staging currently says.
  4. No auto-revert from prod. A promoted flag stays on in prod until an operator manually flips it off. Staging changes never reach back to undo a prod promotion.
  5. Promotion is a forward-only gate. rejected and expired states are terminal. A flag in a terminal state requires a fresh mark-active-for-prod action to re-enter the queue. No "undo rejection" action.
  6. High-risk flags require typed confirmation. Flags classified risk: high in feature_flags.yaml require the operator to type the confirmation phrase before the promote action fires. Same friction as ADR-0028 deploy gating.
  7. Slack notification is observability, not a gate. If the Slack call to D0AJ7K184TV fails, the promotion proceeds and completes. The failure is logged in console_audit_log as a non-blocking warning. Promotion must never be blocked by a notification dependency.
  8. Paper-first gating is a hard invariant and must not be a feature flag. If a flag key in feature_flags.yaml attempts to gate the paper-first check, feature-developer must stop and escalate. The promotion system does not create an exception to this rule.

3. Data Model

3.1 console_flag_promotions table

New table, downstream of console_feature_flags (#552 migration 0005).

CREATE TABLE console_flag_promotions (
    id                      TEXT     PRIMARY KEY DEFAULT (lower(hex(randomblob(16)))),
    flag_key                TEXT     NOT NULL,
    marked_active_for_prod_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    marked_by               TEXT     NOT NULL REFERENCES admins(id) ON DELETE RESTRICT,
    staging_value_at_mark   INTEGER  NOT NULL CHECK (staging_value_at_mark IN (0, 1)),
    prod_target_value       INTEGER  NOT NULL CHECK (prod_target_value IN (0, 1)),
    soak_until_at           DATETIME NOT NULL,
    state                   TEXT     NOT NULL DEFAULT 'pending'
                            CHECK (state IN ('pending', 'approved', 'promoted', 'rejected', 'expired')),
    approved_at             DATETIME,
    approved_by             TEXT     REFERENCES admins(id) ON DELETE SET NULL,
    promoted_at             DATETIME,
    rejection_reason        TEXT,
    audit_correlation_id    TEXT     REFERENCES console_audit_log(id) ON DELETE SET NULL,
    UNIQUE (flag_key, state)    -- only one live promotion per flag at a time; terminal states may repeat
);

CREATE INDEX idx_cfp_state       ON console_flag_promotions (state);
CREATE INDEX idx_cfp_flag_key    ON console_flag_promotions (flag_key);
CREATE INDEX idx_cfp_soak_until  ON console_flag_promotions (soak_until_at)
    WHERE state = 'pending';

Notes: - UNIQUE (flag_key, state) with the partial caveat: SQLite doesn't support partial unique constraints natively. Feature-developer should enforce "only one pending or approved promotion per flag_key" at the service layer, not the DB constraint. The unique constraint on (flag_key, state) prevents two rows in the same state for the same flag. - marked_by uses ON DELETE RESTRICT — a pending promotion blocks deletion of the marking admin. This is intentional: orphaned pending promotions with no accountable actor are dangerous. - approved_by uses ON DELETE SET NULL — if the approver is erased under DSR, the promotion record is preserved (audit compliance) but the approver FK is nulled. - audit_correlation_id links the promotion record to the first audit log entry written at mark time, enabling a single JOIN to retrieve the full audit trail.

3.2 feature_flags.yaml additions

Two new per-flag fields are added alongside the existing risk: field (already scaffolded in #552):

flags:
  console_billing:
    default: false
    description: "Gates billing-specific permission checks"
    risk: high                # low | medium | high
    soak_period_hours: 48     # default: 24; overrides global default for this flag
    env_override: true
  console_dashboard_home:
    default: false
    description: "Dashboard home grid redesign"
    risk: low
    soak_period_hours: 4
    env_override: true

If soak_period_hours is absent, the system uses the global default (24h). The risk: field governs which approval path applies (see §4).

3.3 Schema migration

Migration file: console/db/migrations/0006_flag_promotions.sql

-- up
CREATE TABLE console_flag_promotions (
    id                        TEXT     PRIMARY KEY DEFAULT (lower(hex(randomblob(16)))),
    flag_key                  TEXT     NOT NULL,
    marked_active_for_prod_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    marked_by                 TEXT     NOT NULL REFERENCES admins(id) ON DELETE RESTRICT,
    staging_value_at_mark     INTEGER  NOT NULL CHECK (staging_value_at_mark IN (0, 1)),
    prod_target_value         INTEGER  NOT NULL CHECK (prod_target_value IN (0, 1)),
    soak_until_at             DATETIME NOT NULL,
    state                     TEXT     NOT NULL DEFAULT 'pending'
                              CHECK (state IN ('pending','approved','promoted','rejected','expired')),
    approved_at               DATETIME,
    approved_by               TEXT     REFERENCES admins(id) ON DELETE SET NULL,
    promoted_at               DATETIME,
    rejection_reason          TEXT,
    audit_correlation_id      TEXT     REFERENCES console_audit_log(id) ON DELETE SET NULL
);

CREATE INDEX idx_cfp_state       ON console_flag_promotions (state);
CREATE INDEX idx_cfp_flag_key    ON console_flag_promotions (flag_key);
CREATE INDEX idx_cfp_soak_until  ON console_flag_promotions (soak_until_at);

-- down
DROP INDEX IF EXISTS idx_cfp_soak_until;
DROP INDEX IF EXISTS idx_cfp_flag_key;
DROP INDEX IF EXISTS idx_cfp_state;
DROP TABLE IF EXISTS console_flag_promotions;

Zero-downtime: additive table creation. Depends on migration 0005 (console_feature_flags) having been applied.


4. APIs / Contracts

4.1 POST /console/flags/<flag_key>/mark-promote

Mark a staging flag as "active for prod" — creates a console_flag_promotions row.

Auth: @require_role("superadmin"). Session constraint: session.selected_env must be staging. Marking from a prod context returns 409 {"error": "must_be_in_staging_context"}. Pre-condition: no existing pending or approved promotion for this flag_key. If one exists, returns 409 {"error": "promotion_already_pending"}. Body: none (flag_key from path; staging value read from console_feature_flags at mark time). Response: 201 Created {"promotion_id": "...", "soak_until_at": "...UTC..."}. Side effects: - Reads current staging value from console_feature_flags → stored as staging_value_at_mark. - Reads soak_period_hours from feature_flags.yaml for this key (default 24). - Writes console_flag_promotions row (state: pending, soak_until_at: now + soak_period_hours). - Writes console_audit_log: action="console.flag.mark_promote", payload {flag_key, staging_value, soak_until_at, marked_by}. - Stores audit_correlation_id on the new promotion row.

4.2 POST /console/flags/<flag_key>/promote

Approve and execute the promotion — flips the prod flag.

Auth: @require_role("superadmin"). Session constraint: session.selected_env must be prod. Promoting from a staging context returns 409. Pre-condition: promotion exists in approved or (pending AND soak elapsed AND soak elapsed validation passes). If soak has not elapsed and state is pending, returns 409 {"error": "soak_not_elapsed", "soak_until_at": "..."}. Body (high-risk flags only): {"confirmation_phrase": "promote <flag_key> to prod"}. Mismatch returns 422. Body (low/medium-risk flags): no body required; a ?confirm=1 query param signals intent. Response: 200 {"promoted_at": "...UTC...", "prod_value": true}. Side effects: 1. Reads current prod value from console_feature_flags (for audit — "from" value). 2. Calls flags.flip(flag_key, env="prod", value=staging_value_at_mark, admin_id=...) from the existing service (#552). 3. Updates console_flag_promotions row: state → promoted, promoted_at, approved_by, approved_at. 4. Writes console_audit_log: action="console.flag.promoted", payload {flag_key, from_value, to_value, promotion_id, soak_elapsed_hours, marked_by, approved_by}. 5. Posts Slack DM to D0AJ7K184TV: "Flag <flag_key> promoted to prod: <old><new> by <actor> after <soak_h> h soak." Non-blocking (fire-and-forget with error logged).

4.3 POST /console/flags/<flag_key>/reject-promote

Manually reject a pending promotion.

Auth: @require_role("superadmin"). Body: {"reason": "string"} (optional but encouraged). Response: 204 No Content. Side effects: Updates state to rejected; writes console_audit_log (action="console.flag.rejected").

4.4 GET /console/flags/promotions

The promotions queue page.

Auth: @require_role("ops") for read; promote/reject buttons only render for superadmin. Response: HTML page listing pending and approved promotions; historical (promoted/rejected/expired) shown in a collapsed section.

4.5 Expiry job

A scheduled background task (e.g. APScheduler or Celery beat, consistent with existing console scheduler pattern) runs every hour:

  1. Finds all pending promotions where marked_active_for_prod_at < NOW() - 7 days.
  2. Transitions state to expired.
  3. Writes console_audit_log for each expiry: action="console.flag.expired".

Expiry does not flip any flag. It is purely a cleanup of stale intent records.


5. State Machine + Sequences

5.1 Promotion state machine

                   [operator marks]
                         │
                         ▼
                      pending  ───[operator rejects]──────────────► rejected
                         │
                    soak elapsed?
                     Yes │  No (only soak-elapsed promotions
                         │       can be approved in v1)
                         ▼
                      approved ◄──[operator clicks approve]
                         │
                    high-risk?
                     Yes │  No
                         ▼    ▼
                     type     click
                    phrase   confirm
                         │    │
                         ▼    ▼
                      promoted

         pending ──[7 days without action]──► expired

5.2 Mark-active-for-prod sequence

sequenceDiagram
    participant Op as Operator (staging context)
    participant Page as /console/flags (HTMX)
    participant Handler as POST /flags/<key>/mark-promote
    participant FlagSvc as flags.py service
    participant DB as console DB
    participant Audit as console_audit_log

    Op->>Page: Click "Mark active for prod" on console_billing row
    Page->>Handler: POST /console/flags/console_billing/mark-promote
    Handler->>Handler: Validate session.selected_env == "staging"
    Handler->>DB: SELECT state FROM console_flag_promotions WHERE flag_key=? AND state IN ('pending','approved')
    alt Promotion already pending
        DB-->>Handler: row exists
        Handler-->>Op: 409 promotion_already_pending
    else No active promotion
        Handler->>FlagSvc: flags.get("console_billing", env="staging")
        FlagSvc-->>Handler: staging_value = true
        Handler->>DB: INSERT console_flag_promotions (state=pending, soak_until_at=now+24h, staging_value_at_mark=1)
        Handler->>Audit: write console.flag.mark_promote
        Handler-->>Op: 201 {promotion_id, soak_until_at}
        Page->>Page: HTMX swaps row — shows yellow "Pending promotion" badge
    end

5.3 Approve-and-promote sequence

sequenceDiagram
    participant Op as Operator (prod context)
    participant Page as /flags/promotions
    participant Handler as POST /flags/<key>/promote
    participant FlagSvc as flags.py service
    participant DB as console DB
    participant Audit as console_audit_log
    participant Slack as Slack API

    Op->>Page: Click "Promote now" for console_billing
    Page->>Page: Render type-to-confirm modal (high-risk flag)
    Op->>Page: Types "promote console_billing to prod"
    Page->>Handler: POST /console/flags/console_billing/promote {confirmation_phrase: "..."}
    Handler->>Handler: Validate session.selected_env == "prod"
    Handler->>DB: SELECT promotion WHERE flag_key=? AND state IN ('pending','approved')
    Handler->>Handler: Validate soak_until_at <= NOW()
    Handler->>Handler: Validate confirmation_phrase (high-risk gate)
    Handler->>FlagSvc: flags.flip("console_billing", env="prod", value=staging_value_at_mark)
    FlagSvc->>DB: UPSERT console_feature_flags (flag_key="console_billing", env="prod", value=1)
    FlagSvc->>Audit: write console.flag.flip (existing audit path)
    Handler->>DB: UPDATE console_flag_promotions SET state=promoted, promoted_at=NOW(), approved_by=...
    Handler->>Audit: write console.flag.promoted {full context}
    Handler->>Slack: POST DM D0AJ7K184TV "Flag console_billing promoted..."
    Note over Handler,Slack: Slack failure is non-blocking; logged as warning
    Handler-->>Op: 200 {promoted_at, prod_value: true}
    Page->>Page: Badge turns green "Promoted"

5.4 Expiry sequence

sequenceDiagram
    participant Scheduler as Background job (hourly)
    participant DB as console DB
    participant Audit as console_audit_log

    Scheduler->>DB: SELECT id, flag_key FROM console_flag_promotions WHERE state='pending' AND marked_active_for_prod_at < NOW() - 7d
    loop For each stale promotion
        Scheduler->>DB: UPDATE state='expired'
        Scheduler->>Audit: write console.flag.expired {flag_key, promotion_id, age_hours}
    end

6. Migrations

Migration 0006 (additive)

File: console/db/migrations/0006_flag_promotions.sql (up + down, see §3.3).

Dependency: Migration 0005 (console_feature_flags) must be applied first.

Deployment order: 1. Apply migration 0005 on staging console DB (already covered by #552). 2. Apply migration 0006 on staging console DB. 3. Deploy console to staging — promotion endpoints and UI are available behind FLAG_CONSOLE_FLAG_PROMOTIONS=1 env var. 4. After staging soak ≥ 24h: apply migration 0006 on prod console DB. 5. Deploy console to prod with FLAG_CONSOLE_FLAG_PROMOTIONS=1 on prod.

Rollback: Drop the table (down migration). In-flight promotions are lost. The actual flag values in console_feature_flags are unaffected — the promotion queue is metadata, not authority. Rollback is safe.


7. Rollout Plan

Phase What lands Gate
Dark Migration 0006 applied; promotion endpoints exist but not reachable from UI Zero-downtime additive migration
Staging preview FLAG_CONSOLE_FLAG_PROMOTIONS=1 on staging; /flags/promotions page live; mark + approve flow testable Manual smoke test by operator on staging
Beta Staging soak ≥ 24h; audit log reviewed; Slack notification confirmed; no expiry regressions Staging sign-off
GA FLAG_CONSOLE_FLAG_PROMOTIONS=1 on prod; expiry job confirmed running; ops runbook updated Prod sign-off

The feature is gated by FLAG_CONSOLE_FLAG_PROMOTIONS env var (Infisical-sourced), not a self-referential DB flag, to avoid a bootstrapping dependency on the very system being introduced.


8. Security Considerations

PII collected: None. marked_by and approved_by reference admins.id (UUID, not PII directly). Names are resolved at read time from admins table. rejection_reason is free text — operators should not enter PII here; the UI should include a note.

Retention: console_flag_promotions rows are operational records. Promoted/rejected/expired rows are kept for 90 days then eligible for archival (not deletion — the audit log entries reference them). The console_audit_log rows for promotion events follow the 2-year audit retention policy.

DSR erasure: marked_by uses ON DELETE RESTRICT — deletion of an admin with an active (pending/approved) promotion is blocked. Operator must resolve pending promotions before the account can be erased. approved_by uses ON DELETE SET NULL — terminal promotions are preserved; the approver FK is nulled if the admin is erased. Both behaviors are consistent with the erasure policy in rbac-design.md §10.

Audit trail: Five distinct audit actions cover every state transition. Audit rows are immutable; no delete path exists. The audit_correlation_id FK on console_flag_promotions links the row to its first audit entry for efficient forensics.

No credential storage: Promotion records store flag keys (strings) and values (0/1 integers). This system must not be extended to store anything else. The rejection_reason field is free text; the service layer must enforce a max-length (e.g. 500 chars) and no HTML/code injection.

Confirmation phrase gate for high-risk flags: risk: high flags require {"confirmation_phrase": "promote <flag_key> to prod"} in the POST body. The phrase is validated with a constant-time string comparison to avoid timing oracle. Mismatched phrase returns 422 without indicating which part was wrong.

Slack notification posture: The Slack DM to D0AJ7K184TV is outbound only. The bot token lives in Infisical. If the token is compromised, an attacker can post to the DM channel — they cannot read console data. The token is rotatable without redeploy (Infisical env var).

Kill-switch: FLAG_CONSOLE_FLAG_PROMOTIONS=0 disables all promotion endpoints. The console_flag_promotions table is unaffected (data preserved). Any approved-but-not-yet-promoted promotions resume when the flag is re-enabled.

Breach posture: console_flag_promotions contains flag keys, boolean values, and admin UUIDs. Exfiltration reveals which features were promoted and by whom — operational sensitivity, not user PII. Existing breach-notification automation applies; no new notification surface is created.

Secrets location: FLAG_CONSOLE_FLAG_PROMOTIONS (rollout gate), SLACK_BOT_TOKEN (DM notify) live in Infisical. Neither appears in code. Both are rotatable without redeploy.


9. Open Questions

These require a decision from Kristerpher before sub-cards can be claimed for GA:

  1. Default soak period. The design uses 24h as default. Is this correct, or should the default be longer (e.g. 48h) for an initial conservative posture? Per-flag override exists regardless, but the fallback default matters for flags that have no soak_period_hours: set in YAML.

  2. Auto-promote in v2: opt-in per-flag or opt-out? The risk: low classification is scaffolded for v2 auto-promote. When v2 lands, should auto-promote default to OFF (operator must explicitly set auto_promote: true in YAML for each flag) or default to ON for all risk: low flags (operator must set auto_promote: false to suppress)? The design doc defaults to opt-in-per-flag (OFF unless explicitly enabled), but confirmation is needed before v2 sub-cards are written.

  3. High-risk flag classification: manual in YAML or inferred? Current design: risk: is set manually in feature_flags.yaml by the operator. Alternatively, risk could be inferred from the flag key (e.g., any flag containing billing, vault, or deploy in its name is automatically high). Manual is explicit but requires discipline. Inference is automatic but may mislabel. Which posture should v1 implement?

  4. Expired/rejected promotion history: visible forever or auto-archived? The design retains terminal promotion records for 90 days. Should expired and rejected promotions be visible on /flags/promotions by default (under a collapsed section), or should they be hidden and only accessible via the audit log page? Default UX treatment affects the promotions page design.

  5. Mark-from-prod context. Current design: marking active-for-prod requires session.selected_env == staging. This forces the operator to be in staging context when they mark. Is this the right friction, or should marking be possible from any context (the staging value is read from DB regardless of session env)? Forcing the staging context means the operator has to be "in staging" to initiate a promotion — a reasonable gate but adds a step.