Status: Draft Owner: software-architect Created: 2026-04-28 UTC Parent epic: #146 (raxx-console) Related ADRs: 0026, 0027 Related docs: console-env-switcher.md, rbac-design.md, console.md
Feature flags in the console are currently read exclusively from Heroku config vars (os.environ.get("FLAG_CONSOLE_X", "false")). Every flip requires a Heroku CLI config:set command followed by a dyno restart — a tool most operators don't have, and an operation that bypasses audit, env-scoping, and RBAC entirely.
The directive is: flags must be flippable from within the console UI. Heroku CLI should never be the only flip mechanism.
Current state:
- backend_v2/api/feature_flags.yaml — canonical flag declarations (14 flags across epics)
- console/app/blueprints/*.py and console/app/__init__.py — ~12 distinct os.environ.get("FLAG_*", "false") call sites
- console/app/middleware/env_guard.py, cloudflare_origin_guard.py — 2 additional middleware-level flag reads
The new system must: 1. Keep Heroku env vars as a valid bootstrap mechanism (existing configs keep working) 2. Add a DB-backed override layer that wins over the env var when present 3. Expose a UI page for toggling flags with full audit trail 4. Scope flag values per environment (prod / staging can diverge) 5. Gate mutations behind RBAC
All platform invariants apply. Feature-flag-specific:
true/false). This system must never be extended to store secrets, tokens, or any replayable value.console.flag.flip, payload {flag, env, from, to, operator_id}. Immutable. No deletes from console_audit_log.true. Unknown flags are off.prod is independent from its value for staging. Operators cannot flip both envs in a single action.console_feature_flags tableCREATE TABLE console_feature_flags (
id TEXT PRIMARY KEY DEFAULT (lower(hex(randomblob(16)))),
flag_key TEXT NOT NULL,
env TEXT NOT NULL CHECK (env IN ('prod', 'staging')),
value INTEGER NOT NULL DEFAULT 0 CHECK (value IN (0, 1)),
description TEXT NULLABLE,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_by TEXT NULLABLE REFERENCES admins(id) ON DELETE SET NULL,
UNIQUE (flag_key, env)
);
CREATE INDEX idx_console_feature_flags_key_env ON console_feature_flags (flag_key, env);
Notes:
- value is INTEGER 0/1 rather than BOOLEAN TEXT for SQLite portability and unambiguous comparison.
- (flag_key, env) unique constraint makes upsert semantics clean (INSERT OR REPLACE).
- updated_by is nullable to accommodate automated seed writes (no operator session) and bootstrap inserts.
- No created_at — the audit log is the authoritative history. updated_at is a convenience for the UI ("last changed").
1. console_feature_flags row for (flag_key, env) — wins if present
2. Heroku env var FLAG_<UPPER_FLAG_KEY>=1|true — bootstrap fallback
3. feature_flags.yaml default value — final fallback
If no DB row exists and no env var is set, the YAML default applies. If the YAML has no entry for the key, the flag is false.
On first flip via UI: a DB row is written for that (flag_key, env). The env var continues to exist on Heroku but is now dormant for that pair. Operators should be informed (ops runbook) to remove the Heroku config var to avoid confusion, but its presence is harmless.
Migration file: console/db/migrations/0005_feature_flags.sql
-- up
CREATE TABLE console_feature_flags (
id TEXT PRIMARY KEY DEFAULT (lower(hex(randomblob(16)))),
flag_key TEXT NOT NULL,
env TEXT NOT NULL CHECK (env IN ('prod', 'staging')),
value INTEGER NOT NULL DEFAULT 0 CHECK (value IN (0, 1)),
description TEXT,
updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_by TEXT REFERENCES admins(id) ON DELETE SET NULL,
UNIQUE (flag_key, env)
);
CREATE INDEX idx_cff_key_env ON console_feature_flags (flag_key, env);
-- down
DROP INDEX IF EXISTS idx_cff_key_env;
DROP TABLE IF EXISTS console_feature_flags;
Zero-downtime: additive table creation. Existing behavior is unchanged until the new read path is activated by the new flags.py service module.
flags.is_on(flag_key: str, env: str | None = None) -> boolThe drop-in replacement for os.environ.get("FLAG_X", "false") in ("true", "1", "yes").
# Before (current pattern in every blueprint):
val = os.environ.get("FLAG_CONSOLE_BILLING", "false").lower()
return val in ("true", "1", "yes")
# After (drop-in replacement):
from app.services.flags import flags
return flags.is_on("console_billing")
Key behaviors:
- env defaults to g.selected_env (the session's active env) if called during a request context, and to prod outside a request context.
- Cache: per-process LRU with 30-second TTL. The cache is keyed on (flag_key, env). A flip via the UI triggers an explicit cache invalidation for that pair. This means a flip takes effect within 30 seconds on all dynos — not instant, but safe and predictable.
- Not per-request cached. Per-process cache with TTL is the right level: hitting DB on every flag check in a hot path (e.g., middleware that runs on every request) would be expensive. Per-request caching has the same hit-rate but adds complexity without benefit. A 30s TTL means near-realtime for operators flipping flags.
- Falls back gracefully: if DB is unavailable, falls back to env var, then YAML. Never raises.
flags.get_all(env: str) -> list[FlagStatus]Returns all known flags (from YAML) merged with any DB overrides for the given env. Used by the UI page.
@dataclass
class FlagStatus:
key: str # e.g. "console_billing"
description: str # from YAML
value: bool # resolved value (DB override or YAML default)
source: Literal["db", "env", "yaml"] # which layer won
updated_at: datetime | None
updated_by_name: str | None
POST /console/flags/<flag_key>/flipFlip a flag for the current session's active env.
Auth: @require_role("superadmin") minimum. @require_role("ops") if flag is tagged risk: low in YAML (see §4.5).
Body: {"value": true | false, "env": "prod" | "staging"} — env must match session.selected_env; mismatch returns 409.
Response: 204 No Content.
Side effects:
- Upserts console_feature_flags row.
- Writes audit_log row: action="console.flag.flip", target_type="feature_flag", target_id=flag_key, payload={"flag": flag_key, "env": env, "from": old_value, "to": new_value}.
- Invalidates in-process cache entry for (flag_key, env).
TOTP elevation: not required by default. For flags tagged risk: high in YAML, the route additionally requires @require_totp_elevation. This is a design placeholder — the initial sub-cards treat all flags as requiring superadmin only; the risk tag is groundwork for future graduation.
GET /console/flagsThe flag management UI page.
Auth: @require_role("ops") — read is unrestricted for operators; write (flip) is gated separately per flag.
Response: HTML page (Jinja2 template). Renders the flag table for the current session.selected_env.
The existing feature_flags.yaml is extended with per-flag metadata that the flag service reads:
flags:
console_billing:
default: false
description: "Gates billing-specific permission checks and ops cap enforcement"
risk: low # low | medium | high — governs RBAC gate (see §4.3)
env_override: true # whether env var override is allowed (default: true)
Flags that currently have no metadata (broker_fidelity: false) are treated as risk: medium, env_override: true, description: "" by default.
sequenceDiagram
participant View as Blueprint/Middleware
participant FlagSvc as flags.is_on()
participant Cache as Process LRU Cache (30s TTL)
participant DB as console_feature_flags
participant Env as os.environ
participant YAML as feature_flags.yaml
View->>FlagSvc: flags.is_on("console_billing", env="prod")
FlagSvc->>Cache: get("console_billing", "prod")
alt Cache hit
Cache-->>FlagSvc: bool value
FlagSvc-->>View: bool
else Cache miss
FlagSvc->>DB: SELECT value FROM console_feature_flags WHERE flag_key=? AND env=?
alt DB row exists
DB-->>FlagSvc: value (0 or 1)
FlagSvc->>Cache: set("console_billing", "prod", bool(value))
FlagSvc-->>View: bool
else No DB row
FlagSvc->>Env: os.environ.get("FLAG_CONSOLE_BILLING", None)
alt Env var set
Env-->>FlagSvc: "1" / "true"
FlagSvc->>Cache: set("console_billing", "prod", True)
FlagSvc-->>View: True
else No env var
FlagSvc->>YAML: flags["console_billing"]["default"]
YAML-->>FlagSvc: false
FlagSvc->>Cache: set("console_billing", "prod", False)
FlagSvc-->>View: False
end
end
end
sequenceDiagram
participant Admin as Operator (console UI)
participant Page as /console/flags (HTMX toggle)
participant Handler as POST /console/flags/<key>/flip
participant RBAC as @require_role
participant DB as console_feature_flags
participant Cache as Process LRU Cache
participant Audit as audit_log
Admin->>Page: Toggle switch for "console_billing" (prod)
Page->>Handler: POST /console/flags/console_billing/flip {value: true, env: "prod"}
Handler->>RBAC: Check role (superadmin or ops + low-risk)
alt Role denied
RBAC-->>Admin: 403
else Role ok
Handler->>Handler: Validate env == session.selected_env
alt Env mismatch
Handler-->>Admin: 409 {error: "env_switched_mid_flow"}
else Env ok
Handler->>DB: SELECT old value for (console_billing, prod)
DB-->>Handler: old_value = false
Handler->>DB: INSERT OR REPLACE console_feature_flags (flag_key, env, value, updated_by, updated_at)
Handler->>Cache: invalidate("console_billing", "prod")
Handler->>Audit: write_audit("console.flag.flip", payload={flag, env, from: false, to: true})
Handler-->>Admin: 204 No Content
Page->>Page: HTMX swaps row in table, shows "On" + updated_by + timestamp
end
end
sequenceDiagram
participant Deploy as Heroku deploy
participant App as console app startup
participant DB as console_feature_flags
participant Env as Heroku config vars
Deploy->>App: Dyno starts
App->>DB: Confirm migration 0005 applied
Note over App,DB: No rows in console_feature_flags yet
App->>App: flags.is_on("console_billing") called by first request
App->>DB: SELECT — empty
App->>Env: FLAG_CONSOLE_BILLING=1 (existing Heroku config)
App-->>App: returns True (env var wins)
Note over App: DB row written on first FLIP via UI, not on first READ
Note over App: Heroku env var remains the bootstrap until operator flips in UI
File: console/db/migrations/0005_feature_flags.sql and 0005_feature_flags_down.sql
The up migration creates the console_feature_flags table (schema in §3.3). The down migration drops it. No data migration required: the resolution chain falls back to env vars and YAML if the table is empty or missing.
Rollback path: Drop the table (down migration). Existing os.environ.get call sites reactivated (they are still present as fallback in flags.is_on()). Zero-risk rollback.
Deployment order:
1. Apply migration 0005 on the console DB.
2. Deploy new console dyno (includes the flags.py service module and updated blueprint call sites).
3. Heroku env vars continue to work as before — no behavior change until first UI flip.
4. Operator uses UI to flip flags, which writes DB rows and makes Heroku env vars dormant for those pairs.
5. Ops runbook instructs removal of now-dormant Heroku config vars after all flags have been migrated (non-urgent).
| Phase | What lands | Gate |
|---|---|---|
| Dark | Migration 0005 applied; flags.py service module exists but flags.is_on() is not yet called from blueprints |
Additive only; no behavior change |
| Flag-on-staging | All blueprint call sites switched to flags.is_on(); /console/flags page accessible; staging flag values can be flipped via UI |
Feature flag for this feature itself: FLAG_CONSOLE_FLAG_MGMT=1 on staging only |
| Beta | Prod flag values can be flipped via UI; audit log entries verified; Heroku env vars confirmed dormant for all flipped flags | Staging soak >= 24h; audit log review |
| GA | FLAG_CONSOLE_FLAG_MGMT Heroku config var removed; ops runbook updated; Heroku FLAG_CONSOLE_* config vars marked for removal |
Prod sign-off |
Note: the feature flag management feature itself is gated by an env var (FLAG_CONSOLE_FLAG_MGMT) during rollout. This is intentionally not a self-referential DB flag — bootstrapping a flag system via itself creates a circular dependency on first deploy.
PII collected: None. updated_by stores admin UUIDs, not PII directly. Admin names are resolved at read time from the admins table for display purposes only.
Retention: console_feature_flags rows are operational state, not audit records. They persist as long as the flag exists. Deleted flags (removed from YAML) can be purged by a cleanup job with no compliance implications. The audit_log rows for console.flag.flip follow the 2-year audit retention policy.
DSR erasure: updated_by references admins.id. If an admin is erased under a DSR, ON DELETE SET NULL removes the FK reference; the flag row is preserved (operational state), the audit log row is preserved (audit log is never erased — the admin_id field is set null per existing erasure policy in rbac-design.md §10).
Audit trail: every flip writes an immutable audit row. Action console.flag.flip, payload includes {flag, env, from, to}. The from value is read atomically within the same DB write path. Old value is never elided — the audit row captures the transition.
No credential storage: flag values are booleans. This system never stores secrets, tokens, or any replayable value. Invariant enforced at the type level (INTEGER 0/1).
Kill-switch for live execution paths: flags that gate live trading paths (e.g., broker_fidelity, ai_proposer) are managed through this system. If a live-path flag needs emergency disabling, the operator can flip it in the console UI in under 30 seconds (cache TTL). For sub-30s kill requirements, the Heroku config var (direct dyno restart) remains a valid emergency path — the ops runbook documents both.
RBAC gate: mutations require superadmin by default, ops only for explicitly risk: low flags. Read access (viewing the flag page and current values) requires ops or higher. readonly and support roles cannot see the flag management page.
Breach notification: no new PII surface. If console_feature_flags is exfiltrated, it reveals which features are on or off per env — operational sensitivity, not user data. Existing breach notification automation is unaffected.
Secrets location: FLAG_CONSOLE_FLAG_MGMT (the rollout gate) lives in Heroku config vars / Infisical, not in code. Rotatable without redeploy (Heroku config change restarts the dyno).
These require a decision before sub-cards can be claimed for GA:
TOTP elevation for high-risk flag flips. The design proposes that flags tagged risk: high in YAML require @require_totp_elevation in addition to superadmin. No flags are currently tagged risk: high. Should the risk taxonomy be scaffolded now (for future flags that gate live-trading paths) or deferred until a specific high-risk flag needs it? If deferred, the YAML schema extension in §4.5 still ships as groundwork but the TOTP branch in the flip handler is not wired.
Flag visibility to non-superadmin roles. Currently proposed: ops can view the flag page (read-only), support and readonly cannot. Is this correct? An argument for restricting the page entirely to superadmin (no read for ops) is that flag state could reveal unreleased features. Counter-argument: ops need to know which features are on to diagnose issues.
Env var removal timeline. Once the DB row wins for a flag+env pair, the corresponding Heroku config var is dormant. The ops runbook will document removal. Is there a hard deadline for removing the Heroku vars (e.g., "within 30 days of GA"), or is it best-effort cleanup?
Self-service flag for this feature. The rollout gate is FLAG_CONSOLE_FLAG_MGMT as an env var (not a DB flag) to avoid circular bootstrapping. After GA, this env var can be removed. Confirm: no flags added to feature_flags.yaml to gate this feature itself post-GA?
Staging-only flags. Some flags (console_env_switcher_banner, console_env_gate) are intended to be on-staging-only during soaking. The UI should make it easy to see "this flag differs between envs." The proposed FlagStatus.source field surfacing this in the table — is the visual treatment (a "differs" badge) sufficient, or should there be a hard workflow block preventing prod from being turned on without staging having soaked for N hours?