Status: Proposed Owner: software-architect Date: 2026-05-01 UTC Parent epic: #353 (env-switcher), #551 (in-console feature flag management) Related ADRs: ADR-0035, ADR-0028, ADR-0026, ADR-0027 Related docs: console-feature-flags.md, console-env-switcher.md
The env-switcher (#353) and in-console flag management (#551) give operators per-env flag control. A flag can be on in staging and off in prod — or vice versa. The missing piece is an explicit workflow for moving a verified staging flag to prod: the operator has soaked a feature in staging, confirmed it behaves correctly, and wants prod to match.
Without a promotion flow, the operator either: 1. Manually flips the prod flag (no record of soak, no link to staging verification), or 2. Forgets to flip prod at all (silent divergence).
This design adds a promotion queue: a structured workflow where operators mark a flag "active for prod," a soak clock runs, and an explicit approval action performs the prod flip. Every step is audited. No prod flag changes silently.
What this is not: - Not a code deploy trigger — flag promotions and code deploys are separate (see ADR-0035). - Not an auto-promotion system — v1 is manual-approval-only for all flags (ADR-0035, Consequences). - Not a paper-first gate replacement — paper-first gating is a hard invariant and is not a feature flag. This system must never manage paper-first gating.
All platform invariants apply. Promotion-specific:
console_audit_log rows. Action names: console.flag.mark_promote, console.flag.approved, console.flag.promoted, console.flag.rejected, console.flag.expired.staging_value_at_mark is written when the operator marks active-for-prod. The promotion carries that snapshot — not whatever staging happens to be at promote time. If staging reverts between mark and promote, the promotion reflects the verified state, not the reverted state. This is a design invariant: the operator is promoting what they observed, not what staging currently says.risk: high in feature_flags.yaml require the operator to type the confirmation phrase before the promote action fires. Same friction as ADR-0028 deploy gating.console_audit_log as a non-blocking warning. Promotion must never be blocked by a notification dependency.feature_flags.yaml attempts to gate the paper-first check, feature-developer must stop and escalate. The promotion system does not create an exception to this rule.console_flag_promotions tableNew table, downstream of console_feature_flags (#552 migration 0005).
CREATE TABLE console_flag_promotions (
id TEXT PRIMARY KEY DEFAULT (lower(hex(randomblob(16)))),
flag_key TEXT NOT NULL,
marked_active_for_prod_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
marked_by TEXT NOT NULL REFERENCES admins(id) ON DELETE RESTRICT,
staging_value_at_mark INTEGER NOT NULL CHECK (staging_value_at_mark IN (0, 1)),
prod_target_value INTEGER NOT NULL CHECK (prod_target_value IN (0, 1)),
soak_until_at DATETIME NOT NULL,
state TEXT NOT NULL DEFAULT 'pending'
CHECK (state IN ('pending', 'approved', 'promoted', 'rejected', 'expired')),
approved_at DATETIME,
approved_by TEXT REFERENCES admins(id) ON DELETE SET NULL,
promoted_at DATETIME,
rejection_reason TEXT,
audit_correlation_id TEXT REFERENCES console_audit_log(id) ON DELETE SET NULL,
UNIQUE (flag_key, state) -- only one live promotion per flag at a time; terminal states may repeat
);
CREATE INDEX idx_cfp_state ON console_flag_promotions (state);
CREATE INDEX idx_cfp_flag_key ON console_flag_promotions (flag_key);
CREATE INDEX idx_cfp_soak_until ON console_flag_promotions (soak_until_at)
WHERE state = 'pending';
Notes:
- UNIQUE (flag_key, state) with the partial caveat: SQLite doesn't support partial unique constraints natively. Feature-developer should enforce "only one pending or approved promotion per flag_key" at the service layer, not the DB constraint. The unique constraint on (flag_key, state) prevents two rows in the same state for the same flag.
- marked_by uses ON DELETE RESTRICT — a pending promotion blocks deletion of the marking admin. This is intentional: orphaned pending promotions with no accountable actor are dangerous.
- approved_by uses ON DELETE SET NULL — if the approver is erased under DSR, the promotion record is preserved (audit compliance) but the approver FK is nulled.
- audit_correlation_id links the promotion record to the first audit log entry written at mark time, enabling a single JOIN to retrieve the full audit trail.
feature_flags.yaml additionsTwo new per-flag fields are added alongside the existing risk: field (already scaffolded in #552):
flags:
console_billing:
default: false
description: "Gates billing-specific permission checks"
risk: high # low | medium | high
soak_period_hours: 48 # default: 24; overrides global default for this flag
env_override: true
console_dashboard_home:
default: false
description: "Dashboard home grid redesign"
risk: low
soak_period_hours: 4
env_override: true
If soak_period_hours is absent, the system uses the global default (24h). The risk: field governs which approval path applies (see §4).
Migration file: console/db/migrations/0006_flag_promotions.sql
-- up
CREATE TABLE console_flag_promotions (
id TEXT PRIMARY KEY DEFAULT (lower(hex(randomblob(16)))),
flag_key TEXT NOT NULL,
marked_active_for_prod_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
marked_by TEXT NOT NULL REFERENCES admins(id) ON DELETE RESTRICT,
staging_value_at_mark INTEGER NOT NULL CHECK (staging_value_at_mark IN (0, 1)),
prod_target_value INTEGER NOT NULL CHECK (prod_target_value IN (0, 1)),
soak_until_at DATETIME NOT NULL,
state TEXT NOT NULL DEFAULT 'pending'
CHECK (state IN ('pending','approved','promoted','rejected','expired')),
approved_at DATETIME,
approved_by TEXT REFERENCES admins(id) ON DELETE SET NULL,
promoted_at DATETIME,
rejection_reason TEXT,
audit_correlation_id TEXT REFERENCES console_audit_log(id) ON DELETE SET NULL
);
CREATE INDEX idx_cfp_state ON console_flag_promotions (state);
CREATE INDEX idx_cfp_flag_key ON console_flag_promotions (flag_key);
CREATE INDEX idx_cfp_soak_until ON console_flag_promotions (soak_until_at);
-- down
DROP INDEX IF EXISTS idx_cfp_soak_until;
DROP INDEX IF EXISTS idx_cfp_flag_key;
DROP INDEX IF EXISTS idx_cfp_state;
DROP TABLE IF EXISTS console_flag_promotions;
Zero-downtime: additive table creation. Depends on migration 0005 (console_feature_flags) having been applied.
POST /console/flags/<flag_key>/mark-promoteMark a staging flag as "active for prod" — creates a console_flag_promotions row.
Auth: @require_role("superadmin").
Session constraint: session.selected_env must be staging. Marking from a prod context returns 409 {"error": "must_be_in_staging_context"}.
Pre-condition: no existing pending or approved promotion for this flag_key. If one exists, returns 409 {"error": "promotion_already_pending"}.
Body: none (flag_key from path; staging value read from console_feature_flags at mark time).
Response: 201 Created {"promotion_id": "...", "soak_until_at": "...UTC..."}.
Side effects:
- Reads current staging value from console_feature_flags → stored as staging_value_at_mark.
- Reads soak_period_hours from feature_flags.yaml for this key (default 24).
- Writes console_flag_promotions row (state: pending, soak_until_at: now + soak_period_hours).
- Writes console_audit_log: action="console.flag.mark_promote", payload {flag_key, staging_value, soak_until_at, marked_by}.
- Stores audit_correlation_id on the new promotion row.
POST /console/flags/<flag_key>/promoteApprove and execute the promotion — flips the prod flag.
Auth: @require_role("superadmin").
Session constraint: session.selected_env must be prod. Promoting from a staging context returns 409.
Pre-condition: promotion exists in approved or (pending AND soak elapsed AND soak elapsed validation passes). If soak has not elapsed and state is pending, returns 409 {"error": "soak_not_elapsed", "soak_until_at": "..."}.
Body (high-risk flags only): {"confirmation_phrase": "promote <flag_key> to prod"}. Mismatch returns 422.
Body (low/medium-risk flags): no body required; a ?confirm=1 query param signals intent.
Response: 200 {"promoted_at": "...UTC...", "prod_value": true}.
Side effects:
1. Reads current prod value from console_feature_flags (for audit — "from" value).
2. Calls flags.flip(flag_key, env="prod", value=staging_value_at_mark, admin_id=...) from the existing service (#552).
3. Updates console_flag_promotions row: state → promoted, promoted_at, approved_by, approved_at.
4. Writes console_audit_log: action="console.flag.promoted", payload {flag_key, from_value, to_value, promotion_id, soak_elapsed_hours, marked_by, approved_by}.
5. Posts Slack DM to D0AJ7K184TV: "Flag <flag_key> promoted to prod: <old> → <new> by <actor> after <soak_h> h soak." Non-blocking (fire-and-forget with error logged).
POST /console/flags/<flag_key>/reject-promoteManually reject a pending promotion.
Auth: @require_role("superadmin").
Body: {"reason": "string"} (optional but encouraged).
Response: 204 No Content.
Side effects: Updates state to rejected; writes console_audit_log (action="console.flag.rejected").
GET /console/flags/promotionsThe promotions queue page.
Auth: @require_role("ops") for read; promote/reject buttons only render for superadmin.
Response: HTML page listing pending and approved promotions; historical (promoted/rejected/expired) shown in a collapsed section.
A scheduled background task (e.g. APScheduler or Celery beat, consistent with existing console scheduler pattern) runs every hour:
pending promotions where marked_active_for_prod_at < NOW() - 7 days.expired.console_audit_log for each expiry: action="console.flag.expired".Expiry does not flip any flag. It is purely a cleanup of stale intent records.
[operator marks]
│
▼
pending ───[operator rejects]──────────────► rejected
│
soak elapsed?
Yes │ No (only soak-elapsed promotions
│ can be approved in v1)
▼
approved ◄──[operator clicks approve]
│
high-risk?
Yes │ No
▼ ▼
type click
phrase confirm
│ │
▼ ▼
promoted
pending ──[7 days without action]──► expired
sequenceDiagram
participant Op as Operator (staging context)
participant Page as /console/flags (HTMX)
participant Handler as POST /flags/<key>/mark-promote
participant FlagSvc as flags.py service
participant DB as console DB
participant Audit as console_audit_log
Op->>Page: Click "Mark active for prod" on console_billing row
Page->>Handler: POST /console/flags/console_billing/mark-promote
Handler->>Handler: Validate session.selected_env == "staging"
Handler->>DB: SELECT state FROM console_flag_promotions WHERE flag_key=? AND state IN ('pending','approved')
alt Promotion already pending
DB-->>Handler: row exists
Handler-->>Op: 409 promotion_already_pending
else No active promotion
Handler->>FlagSvc: flags.get("console_billing", env="staging")
FlagSvc-->>Handler: staging_value = true
Handler->>DB: INSERT console_flag_promotions (state=pending, soak_until_at=now+24h, staging_value_at_mark=1)
Handler->>Audit: write console.flag.mark_promote
Handler-->>Op: 201 {promotion_id, soak_until_at}
Page->>Page: HTMX swaps row — shows yellow "Pending promotion" badge
end
sequenceDiagram
participant Op as Operator (prod context)
participant Page as /flags/promotions
participant Handler as POST /flags/<key>/promote
participant FlagSvc as flags.py service
participant DB as console DB
participant Audit as console_audit_log
participant Slack as Slack API
Op->>Page: Click "Promote now" for console_billing
Page->>Page: Render type-to-confirm modal (high-risk flag)
Op->>Page: Types "promote console_billing to prod"
Page->>Handler: POST /console/flags/console_billing/promote {confirmation_phrase: "..."}
Handler->>Handler: Validate session.selected_env == "prod"
Handler->>DB: SELECT promotion WHERE flag_key=? AND state IN ('pending','approved')
Handler->>Handler: Validate soak_until_at <= NOW()
Handler->>Handler: Validate confirmation_phrase (high-risk gate)
Handler->>FlagSvc: flags.flip("console_billing", env="prod", value=staging_value_at_mark)
FlagSvc->>DB: UPSERT console_feature_flags (flag_key="console_billing", env="prod", value=1)
FlagSvc->>Audit: write console.flag.flip (existing audit path)
Handler->>DB: UPDATE console_flag_promotions SET state=promoted, promoted_at=NOW(), approved_by=...
Handler->>Audit: write console.flag.promoted {full context}
Handler->>Slack: POST DM D0AJ7K184TV "Flag console_billing promoted..."
Note over Handler,Slack: Slack failure is non-blocking; logged as warning
Handler-->>Op: 200 {promoted_at, prod_value: true}
Page->>Page: Badge turns green "Promoted"
sequenceDiagram
participant Scheduler as Background job (hourly)
participant DB as console DB
participant Audit as console_audit_log
Scheduler->>DB: SELECT id, flag_key FROM console_flag_promotions WHERE state='pending' AND marked_active_for_prod_at < NOW() - 7d
loop For each stale promotion
Scheduler->>DB: UPDATE state='expired'
Scheduler->>Audit: write console.flag.expired {flag_key, promotion_id, age_hours}
end
File: console/db/migrations/0006_flag_promotions.sql (up + down, see §3.3).
Dependency: Migration 0005 (console_feature_flags) must be applied first.
Deployment order:
1. Apply migration 0005 on staging console DB (already covered by #552).
2. Apply migration 0006 on staging console DB.
3. Deploy console to staging — promotion endpoints and UI are available behind FLAG_CONSOLE_FLAG_PROMOTIONS=1 env var.
4. After staging soak ≥ 24h: apply migration 0006 on prod console DB.
5. Deploy console to prod with FLAG_CONSOLE_FLAG_PROMOTIONS=1 on prod.
Rollback: Drop the table (down migration). In-flight promotions are lost. The actual flag values in console_feature_flags are unaffected — the promotion queue is metadata, not authority. Rollback is safe.
| Phase | What lands | Gate |
|---|---|---|
| Dark | Migration 0006 applied; promotion endpoints exist but not reachable from UI | Zero-downtime additive migration |
| Staging preview | FLAG_CONSOLE_FLAG_PROMOTIONS=1 on staging; /flags/promotions page live; mark + approve flow testable |
Manual smoke test by operator on staging |
| Beta | Staging soak ≥ 24h; audit log reviewed; Slack notification confirmed; no expiry regressions | Staging sign-off |
| GA | FLAG_CONSOLE_FLAG_PROMOTIONS=1 on prod; expiry job confirmed running; ops runbook updated |
Prod sign-off |
The feature is gated by FLAG_CONSOLE_FLAG_PROMOTIONS env var (Infisical-sourced), not a self-referential DB flag, to avoid a bootstrapping dependency on the very system being introduced.
PII collected: None. marked_by and approved_by reference admins.id (UUID, not PII directly). Names are resolved at read time from admins table. rejection_reason is free text — operators should not enter PII here; the UI should include a note.
Retention: console_flag_promotions rows are operational records. Promoted/rejected/expired rows are kept for 90 days then eligible for archival (not deletion — the audit log entries reference them). The console_audit_log rows for promotion events follow the 2-year audit retention policy.
DSR erasure: marked_by uses ON DELETE RESTRICT — deletion of an admin with an active (pending/approved) promotion is blocked. Operator must resolve pending promotions before the account can be erased. approved_by uses ON DELETE SET NULL — terminal promotions are preserved; the approver FK is nulled if the admin is erased. Both behaviors are consistent with the erasure policy in rbac-design.md §10.
Audit trail: Five distinct audit actions cover every state transition. Audit rows are immutable; no delete path exists. The audit_correlation_id FK on console_flag_promotions links the row to its first audit entry for efficient forensics.
No credential storage: Promotion records store flag keys (strings) and values (0/1 integers). This system must not be extended to store anything else. The rejection_reason field is free text; the service layer must enforce a max-length (e.g. 500 chars) and no HTML/code injection.
Confirmation phrase gate for high-risk flags: risk: high flags require {"confirmation_phrase": "promote <flag_key> to prod"} in the POST body. The phrase is validated with a constant-time string comparison to avoid timing oracle. Mismatched phrase returns 422 without indicating which part was wrong.
Slack notification posture: The Slack DM to D0AJ7K184TV is outbound only. The bot token lives in Infisical. If the token is compromised, an attacker can post to the DM channel — they cannot read console data. The token is rotatable without redeploy (Infisical env var).
Kill-switch: FLAG_CONSOLE_FLAG_PROMOTIONS=0 disables all promotion endpoints. The console_flag_promotions table is unaffected (data preserved). Any approved-but-not-yet-promoted promotions resume when the flag is re-enabled.
Breach posture: console_flag_promotions contains flag keys, boolean values, and admin UUIDs. Exfiltration reveals which features were promoted and by whom — operational sensitivity, not user PII. Existing breach-notification automation applies; no new notification surface is created.
Secrets location: FLAG_CONSOLE_FLAG_PROMOTIONS (rollout gate), SLACK_BOT_TOKEN (DM notify) live in Infisical. Neither appears in code. Both are rotatable without redeploy.
These require a decision from Kristerpher before sub-cards can be claimed for GA:
Default soak period. The design uses 24h as default. Is this correct, or should the default be longer (e.g. 48h) for an initial conservative posture? Per-flag override exists regardless, but the fallback default matters for flags that have no soak_period_hours: set in YAML.
Auto-promote in v2: opt-in per-flag or opt-out? The risk: low classification is scaffolded for v2 auto-promote. When v2 lands, should auto-promote default to OFF (operator must explicitly set auto_promote: true in YAML for each flag) or default to ON for all risk: low flags (operator must set auto_promote: false to suppress)? The design doc defaults to opt-in-per-flag (OFF unless explicitly enabled), but confirmation is needed before v2 sub-cards are written.
High-risk flag classification: manual in YAML or inferred? Current design: risk: is set manually in feature_flags.yaml by the operator. Alternatively, risk could be inferred from the flag key (e.g., any flag containing billing, vault, or deploy in its name is automatically high). Manual is explicit but requires discipline. Inference is automatic but may mislabel. Which posture should v1 implement?
Expired/rejected promotion history: visible forever or auto-archived? The design retains terminal promotion records for 90 days. Should expired and rejected promotions be visible on /flags/promotions by default (under a collapsed section), or should they be hidden and only accessible via the audit log page? Default UX treatment affects the promotions page design.
Mark-from-prod context. Current design: marking active-for-prod requires session.selected_env == staging. This forces the operator to be in staging context when they mark. Is this the right friction, or should marking be possible from any context (the staging value is read from DB regardless of session env)? Forcing the staging context means the operator has to be "in staging" to initiate a promotion — a reasonable gate but adds a step.