Raxx · internal docs

internal · gated ↑ index

Console Feature Flag Management — Design

Status: Draft Owner: software-architect Created: 2026-04-28 UTC Parent epic: #146 (raxx-console) Related ADRs: 0026, 0027 Related docs: console-env-switcher.md, rbac-design.md, console.md


1. Context

Feature flags in the console are currently read exclusively from Heroku config vars (os.environ.get("FLAG_CONSOLE_X", "false")). Every flip requires a Heroku CLI config:set command followed by a dyno restart — a tool most operators don't have, and an operation that bypasses audit, env-scoping, and RBAC entirely.

The directive is: flags must be flippable from within the console UI. Heroku CLI should never be the only flip mechanism.

Current state: - backend_v2/api/feature_flags.yaml — canonical flag declarations (14 flags across epics) - console/app/blueprints/*.py and console/app/__init__.py — ~12 distinct os.environ.get("FLAG_*", "false") call sites - console/app/middleware/env_guard.py, cloudflare_origin_guard.py — 2 additional middleware-level flag reads

The new system must: 1. Keep Heroku env vars as a valid bootstrap mechanism (existing configs keep working) 2. Add a DB-backed override layer that wins over the env var when present 3. Expose a UI page for toggling flags with full audit trail 4. Scope flag values per environment (prod / staging can diverge) 5. Gate mutations behind RBAC


2. Invariants

All platform invariants apply. Feature-flag-specific:

  1. No stored credentials. Flag values are booleans (true/false). This system must never be extended to store secrets, tokens, or any replayable value.
  2. Every flip writes an audit row. Action console.flag.flip, payload {flag, env, from, to, operator_id}. Immutable. No deletes from console_audit_log.
  3. Fail-closed on missing row. If a flag has no DB row, the system falls back to the YAML default, not to true. Unknown flags are off.
  4. Env-scoped by default. A flag value for prod is independent from its value for staging. Operators cannot flip both envs in a single action.
  5. Heroku env var is bootstrap, not authority. An env var sets the initial value if no DB row exists for that flag+env pair. Once a DB row exists, the env var is ignored for that flag+env. This is a one-way migration: the DB row wins.
  6. Audit trail for every state change. This is a money-adjacent control surface (flags gate live trading paths, RBAC paths, and billing). Every mutation is audited.
  7. Paper-first gating is a hard invariant. The paper-first gate (live-trading code paths require paper-profitable-for-N-cycles gate or explicit override) is not a feature flag and must not be managed through this system. If a product card attempts this, stop and escalate.

3. Data Model

3.1 console_feature_flags table

CREATE TABLE console_feature_flags (
    id          TEXT        PRIMARY KEY DEFAULT (lower(hex(randomblob(16)))),
    flag_key    TEXT        NOT NULL,
    env         TEXT        NOT NULL CHECK (env IN ('prod', 'staging')),
    value       INTEGER     NOT NULL DEFAULT 0 CHECK (value IN (0, 1)),
    description TEXT        NULLABLE,
    updated_at  DATETIME    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_by  TEXT        NULLABLE REFERENCES admins(id) ON DELETE SET NULL,
    UNIQUE (flag_key, env)
);

CREATE INDEX idx_console_feature_flags_key_env ON console_feature_flags (flag_key, env);

Notes: - value is INTEGER 0/1 rather than BOOLEAN TEXT for SQLite portability and unambiguous comparison. - (flag_key, env) unique constraint makes upsert semantics clean (INSERT OR REPLACE). - updated_by is nullable to accommodate automated seed writes (no operator session) and bootstrap inserts. - No created_at — the audit log is the authoritative history. updated_at is a convenience for the UI ("last changed").

3.2 Resolution order (per flag, per env)

1. console_feature_flags row for (flag_key, env)  — wins if present
2. Heroku env var  FLAG_<UPPER_FLAG_KEY>=1|true     — bootstrap fallback
3. feature_flags.yaml default value               — final fallback

If no DB row exists and no env var is set, the YAML default applies. If the YAML has no entry for the key, the flag is false.

On first flip via UI: a DB row is written for that (flag_key, env). The env var continues to exist on Heroku but is now dormant for that pair. Operators should be informed (ops runbook) to remove the Heroku config var to avoid confusion, but its presence is harmless.

3.3 Schema migration

Migration file: console/db/migrations/0005_feature_flags.sql

-- up
CREATE TABLE console_feature_flags (
    id          TEXT     PRIMARY KEY DEFAULT (lower(hex(randomblob(16)))),
    flag_key    TEXT     NOT NULL,
    env         TEXT     NOT NULL CHECK (env IN ('prod', 'staging')),
    value       INTEGER  NOT NULL DEFAULT 0 CHECK (value IN (0, 1)),
    description TEXT,
    updated_at  DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_by  TEXT     REFERENCES admins(id) ON DELETE SET NULL,
    UNIQUE (flag_key, env)
);

CREATE INDEX idx_cff_key_env ON console_feature_flags (flag_key, env);

-- down
DROP INDEX IF EXISTS idx_cff_key_env;
DROP TABLE IF EXISTS console_feature_flags;

Zero-downtime: additive table creation. Existing behavior is unchanged until the new read path is activated by the new flags.py service module.


4. APIs / Contracts

4.1 flags.is_on(flag_key: str, env: str | None = None) -> bool

The drop-in replacement for os.environ.get("FLAG_X", "false") in ("true", "1", "yes").

# Before (current pattern in every blueprint):
val = os.environ.get("FLAG_CONSOLE_BILLING", "false").lower()
return val in ("true", "1", "yes")

# After (drop-in replacement):
from app.services.flags import flags
return flags.is_on("console_billing")

Key behaviors: - env defaults to g.selected_env (the session's active env) if called during a request context, and to prod outside a request context. - Cache: per-process LRU with 30-second TTL. The cache is keyed on (flag_key, env). A flip via the UI triggers an explicit cache invalidation for that pair. This means a flip takes effect within 30 seconds on all dynos — not instant, but safe and predictable. - Not per-request cached. Per-process cache with TTL is the right level: hitting DB on every flag check in a hot path (e.g., middleware that runs on every request) would be expensive. Per-request caching has the same hit-rate but adds complexity without benefit. A 30s TTL means near-realtime for operators flipping flags. - Falls back gracefully: if DB is unavailable, falls back to env var, then YAML. Never raises.

4.2 flags.get_all(env: str) -> list[FlagStatus]

Returns all known flags (from YAML) merged with any DB overrides for the given env. Used by the UI page.

@dataclass
class FlagStatus:
    key: str                  # e.g. "console_billing"
    description: str          # from YAML
    value: bool               # resolved value (DB override or YAML default)
    source: Literal["db", "env", "yaml"]  # which layer won
    updated_at: datetime | None
    updated_by_name: str | None

4.3 POST /console/flags/<flag_key>/flip

Flip a flag for the current session's active env.

Auth: @require_role("superadmin") minimum. @require_role("ops") if flag is tagged risk: low in YAML (see §4.5). Body: {"value": true | false, "env": "prod" | "staging"}env must match session.selected_env; mismatch returns 409. Response: 204 No Content. Side effects: - Upserts console_feature_flags row. - Writes audit_log row: action="console.flag.flip", target_type="feature_flag", target_id=flag_key, payload={"flag": flag_key, "env": env, "from": old_value, "to": new_value}. - Invalidates in-process cache entry for (flag_key, env).

TOTP elevation: not required by default. For flags tagged risk: high in YAML, the route additionally requires @require_totp_elevation. This is a design placeholder — the initial sub-cards treat all flags as requiring superadmin only; the risk tag is groundwork for future graduation.

4.4 GET /console/flags

The flag management UI page.

Auth: @require_role("ops") — read is unrestricted for operators; write (flip) is gated separately per flag. Response: HTML page (Jinja2 template). Renders the flag table for the current session.selected_env.

4.5 YAML schema extension

The existing feature_flags.yaml is extended with per-flag metadata that the flag service reads:

flags:
  console_billing:
    default: false
    description: "Gates billing-specific permission checks and ops cap enforcement"
    risk: low           # low | medium | high — governs RBAC gate (see §4.3)
    env_override: true  # whether env var override is allowed (default: true)

Flags that currently have no metadata (broker_fidelity: false) are treated as risk: medium, env_override: true, description: "" by default.


5. State Machines + Sequences

5.1 Read path (hot path — every flag check)

sequenceDiagram
    participant View as Blueprint/Middleware
    participant FlagSvc as flags.is_on()
    participant Cache as Process LRU Cache (30s TTL)
    participant DB as console_feature_flags
    participant Env as os.environ
    participant YAML as feature_flags.yaml

    View->>FlagSvc: flags.is_on("console_billing", env="prod")
    FlagSvc->>Cache: get("console_billing", "prod")
    alt Cache hit
        Cache-->>FlagSvc: bool value
        FlagSvc-->>View: bool
    else Cache miss
        FlagSvc->>DB: SELECT value FROM console_feature_flags WHERE flag_key=? AND env=?
        alt DB row exists
            DB-->>FlagSvc: value (0 or 1)
            FlagSvc->>Cache: set("console_billing", "prod", bool(value))
            FlagSvc-->>View: bool
        else No DB row
            FlagSvc->>Env: os.environ.get("FLAG_CONSOLE_BILLING", None)
            alt Env var set
                Env-->>FlagSvc: "1" / "true"
                FlagSvc->>Cache: set("console_billing", "prod", True)
                FlagSvc-->>View: True
            else No env var
                FlagSvc->>YAML: flags["console_billing"]["default"]
                YAML-->>FlagSvc: false
                FlagSvc->>Cache: set("console_billing", "prod", False)
                FlagSvc-->>View: False
            end
        end
    end

5.2 Flip path (operator action via UI)

sequenceDiagram
    participant Admin as Operator (console UI)
    participant Page as /console/flags (HTMX toggle)
    participant Handler as POST /console/flags/<key>/flip
    participant RBAC as @require_role
    participant DB as console_feature_flags
    participant Cache as Process LRU Cache
    participant Audit as audit_log

    Admin->>Page: Toggle switch for "console_billing" (prod)
    Page->>Handler: POST /console/flags/console_billing/flip {value: true, env: "prod"}
    Handler->>RBAC: Check role (superadmin or ops + low-risk)
    alt Role denied
        RBAC-->>Admin: 403
    else Role ok
        Handler->>Handler: Validate env == session.selected_env
        alt Env mismatch
            Handler-->>Admin: 409 {error: "env_switched_mid_flow"}
        else Env ok
            Handler->>DB: SELECT old value for (console_billing, prod)
            DB-->>Handler: old_value = false
            Handler->>DB: INSERT OR REPLACE console_feature_flags (flag_key, env, value, updated_by, updated_at)
            Handler->>Cache: invalidate("console_billing", "prod")
            Handler->>Audit: write_audit("console.flag.flip", payload={flag, env, from: false, to: true})
            Handler-->>Admin: 204 No Content
            Page->>Page: HTMX swaps row in table, shows "On" + updated_by + timestamp
        end
    end

5.3 Bootstrap (first deploy after migration)

sequenceDiagram
    participant Deploy as Heroku deploy
    participant App as console app startup
    participant DB as console_feature_flags
    participant Env as Heroku config vars

    Deploy->>App: Dyno starts
    App->>DB: Confirm migration 0005 applied
    Note over App,DB: No rows in console_feature_flags yet
    App->>App: flags.is_on("console_billing") called by first request
    App->>DB: SELECT — empty
    App->>Env: FLAG_CONSOLE_BILLING=1 (existing Heroku config)
    App-->>App: returns True (env var wins)
    Note over App: DB row written on first FLIP via UI, not on first READ
    Note over App: Heroku env var remains the bootstrap until operator flips in UI

6. Migrations

Migration 0005 (additive)

File: console/db/migrations/0005_feature_flags.sql and 0005_feature_flags_down.sql

The up migration creates the console_feature_flags table (schema in §3.3). The down migration drops it. No data migration required: the resolution chain falls back to env vars and YAML if the table is empty or missing.

Rollback path: Drop the table (down migration). Existing os.environ.get call sites reactivated (they are still present as fallback in flags.is_on()). Zero-risk rollback.

Deployment order: 1. Apply migration 0005 on the console DB. 2. Deploy new console dyno (includes the flags.py service module and updated blueprint call sites). 3. Heroku env vars continue to work as before — no behavior change until first UI flip. 4. Operator uses UI to flip flags, which writes DB rows and makes Heroku env vars dormant for those pairs. 5. Ops runbook instructs removal of now-dormant Heroku config vars after all flags have been migrated (non-urgent).


7. Rollout Plan

Phase What lands Gate
Dark Migration 0005 applied; flags.py service module exists but flags.is_on() is not yet called from blueprints Additive only; no behavior change
Flag-on-staging All blueprint call sites switched to flags.is_on(); /console/flags page accessible; staging flag values can be flipped via UI Feature flag for this feature itself: FLAG_CONSOLE_FLAG_MGMT=1 on staging only
Beta Prod flag values can be flipped via UI; audit log entries verified; Heroku env vars confirmed dormant for all flipped flags Staging soak >= 24h; audit log review
GA FLAG_CONSOLE_FLAG_MGMT Heroku config var removed; ops runbook updated; Heroku FLAG_CONSOLE_* config vars marked for removal Prod sign-off

Note: the feature flag management feature itself is gated by an env var (FLAG_CONSOLE_FLAG_MGMT) during rollout. This is intentionally not a self-referential DB flag — bootstrapping a flag system via itself creates a circular dependency on first deploy.


8. Security Considerations

PII collected: None. updated_by stores admin UUIDs, not PII directly. Admin names are resolved at read time from the admins table for display purposes only.

Retention: console_feature_flags rows are operational state, not audit records. They persist as long as the flag exists. Deleted flags (removed from YAML) can be purged by a cleanup job with no compliance implications. The audit_log rows for console.flag.flip follow the 2-year audit retention policy.

DSR erasure: updated_by references admins.id. If an admin is erased under a DSR, ON DELETE SET NULL removes the FK reference; the flag row is preserved (operational state), the audit log row is preserved (audit log is never erased — the admin_id field is set null per existing erasure policy in rbac-design.md §10).

Audit trail: every flip writes an immutable audit row. Action console.flag.flip, payload includes {flag, env, from, to}. The from value is read atomically within the same DB write path. Old value is never elided — the audit row captures the transition.

No credential storage: flag values are booleans. This system never stores secrets, tokens, or any replayable value. Invariant enforced at the type level (INTEGER 0/1).

Kill-switch for live execution paths: flags that gate live trading paths (e.g., broker_fidelity, ai_proposer) are managed through this system. If a live-path flag needs emergency disabling, the operator can flip it in the console UI in under 30 seconds (cache TTL). For sub-30s kill requirements, the Heroku config var (direct dyno restart) remains a valid emergency path — the ops runbook documents both.

RBAC gate: mutations require superadmin by default, ops only for explicitly risk: low flags. Read access (viewing the flag page and current values) requires ops or higher. readonly and support roles cannot see the flag management page.

Breach notification: no new PII surface. If console_feature_flags is exfiltrated, it reveals which features are on or off per env — operational sensitivity, not user data. Existing breach notification automation is unaffected.

Secrets location: FLAG_CONSOLE_FLAG_MGMT (the rollout gate) lives in Heroku config vars / Infisical, not in code. Rotatable without redeploy (Heroku config change restarts the dyno).


9. Open Questions

These require a decision before sub-cards can be claimed for GA:

  1. TOTP elevation for high-risk flag flips. The design proposes that flags tagged risk: high in YAML require @require_totp_elevation in addition to superadmin. No flags are currently tagged risk: high. Should the risk taxonomy be scaffolded now (for future flags that gate live-trading paths) or deferred until a specific high-risk flag needs it? If deferred, the YAML schema extension in §4.5 still ships as groundwork but the TOTP branch in the flip handler is not wired.

  2. Flag visibility to non-superadmin roles. Currently proposed: ops can view the flag page (read-only), support and readonly cannot. Is this correct? An argument for restricting the page entirely to superadmin (no read for ops) is that flag state could reveal unreleased features. Counter-argument: ops need to know which features are on to diagnose issues.

  3. Env var removal timeline. Once the DB row wins for a flag+env pair, the corresponding Heroku config var is dormant. The ops runbook will document removal. Is there a hard deadline for removing the Heroku vars (e.g., "within 30 days of GA"), or is it best-effort cleanup?

  4. Self-service flag for this feature. The rollout gate is FLAG_CONSOLE_FLAG_MGMT as an env var (not a DB flag) to avoid circular bootstrapping. After GA, this env var can be removed. Confirm: no flags added to feature_flags.yaml to gate this feature itself post-GA?

  5. Staging-only flags. Some flags (console_env_switcher_banner, console_env_gate) are intended to be on-staging-only during soaking. The UI should make it easy to see "this flag differs between envs." The proposed FlagStatus.source field surfacing this in the table — is the visual treatment (a "differs" badge) sufficient, or should there be a hard workflow block preventing prod from being turned on without staging having soaked for N hours?