Design: Flag Operator UX Hardening + Pony-Style Internal Docs
Status: Proposed Date: 2026-05-17 UTC Owner: software-architect ADR: 0098 Related docs: - console-feature-flags.md - console-flag-promotion-flow.md - flag-reconciler-bidirectional-sync-2026-05-13.md - docs-site-scaffold-2026-05-15.md
1. Context
The console at /console/flags exposes 158 flags across 9 surfaces. An operator
standing in front of the prod toggle asks five questions before flipping: What does
this flag actually do? What else depends on it? Does it require a dyno restart? How
do I confirm it worked? Where's the rollback path?
None of those answers are accessible in one click today. The description field is
a single sentence. references chips (PR #2230) are rendered but not semantically
typed. There is no "does flipping this need a restart?" signal, no smoke-test
affordance, and no per-flag knowledge page at a stable URL.
This design covers five phases:
| Phase | Topic | Size |
|---|---|---|
| 1 | YAML schema additions (runtime_behavior, docs_path, smoke, dependencies, lint) |
S |
| 2 | Console flag-row + promotion-row enrichment (badges, deep-links, pre-flip checklist) | M |
| 3 | Pony-style docs generator + per-flag pages at internal-docs.raxx.app | M |
| 4 | Verify smoke affordance (POST /console/flags/<key>/verify) |
S |
| 5 | Pre-flip checklist enrichment on promotion (Sentry error signal, modal gate) | S |
2. Invariants
All platform invariants apply. The following are specifically load-bearing here:
- No stored credentials.
smokeblock HTTP probes must never carry anAuthorizationheader or any credential. Probes hit public or CF-Access-gated endpoints only. The executor never reads vault secrets at probe time. - Audit trail. Every
POST /console/flags/<key>/verifycall appends anconsole.flag.verifyrow toconsole_audit_logwith{flag_key, operator_id, overall, duration_ms}. Probe details are included; no credential material is ever logged. - Fail-closed. If the
smokeblock is absent from YAML, the Verify button is hidden entirely (not disabled, not grayed out). Per [feedback_hide_dont_gray_unavailable_features]. - Paper-first gating is not a feature flag. If any smoke probe or dependency declaration attempts to reference or gate the paper-first check, stop and escalate. The flag UX system must never provide a path around that invariant.
- Smoke probes are read-only.
kind: httpprobes are GET-only by default; POST probes require explicitmethod: POSTand must not carry a body that triggers a state change. The executor rejects probes that would mutate live-trading state (deny-list of paths:/api/orders/*,/api/positions/*,/api/trading/*). - GDPR. The smoke verify audit row is subject to the standard 90-day
operational log retention. It contains
operator_id(no PII beyond that); operator accounts are deleted on DSR and the FK isSET NULL. - Secrets are never in YAML. The
smokeblock is checked in to source control. Probes may reference env-var names but must not embed actual values. YAML lint enforces no fields matching*key*,*secret*,*token*,*password*in thesmokeblock.
3. Data Model
3.1 YAML schema additions (Phase 1)
Three optional fields added to each entry in feature_flags.yaml. All are
backwards-compatible: existing flags without them continue to parse and run.
# runtime_behavior: how quickly the flag takes effect after flip
# Values: live | restart-required | cold-start-only
# Default: live (assumed if omitted)
runtime_behavior: restart-required
# docs_path: relative URL on internal-docs.raxx.app for this flag's knowledge page
# If omitted, console deep-link falls back to /flags/<flag_key>
docs_path: /flags/console_status_config
# dependencies: other flag keys that must be ON before this flag is useful
# Used by Phase 2 pre-flip checklist. Empty list = no dependencies.
dependencies:
- rbac_v2
- break_glass_session
# smoke: list of probes run by POST /console/flags/<key>/verify
# Supported kinds: http, ping, sql (sql only on console DB, not raptor)
smoke:
- kind: http
method: GET
path: /api/dashboard/status_grid
assert_status: 200
assert_body_contains: "public_note_form"
- kind: ping
url: https://api.raxx.app/health
- kind: sql
query: "SELECT COUNT(*) FROM console_feature_flags WHERE flag_key = 'console_status_config' AND value = 1"
assert_rows_gte: 1
3.2 Audit log extension
No new table. One new action value added to console_audit_log:
action = 'console.flag.verify'
payload = {
flag_key: str,
overall: 'pass' | 'fail' | 'partial',
probe_count: int,
pass_count: int,
duration_ms: int
}
Probe-level detail is stored in a details JSON column that already exists on
console_audit_log (added in migration 0005).
3.3 Docs generator output layout
docs-internal/
flags/
_index.md # auto-generated sidebar index
<flag_key>.md # one file per flag
search-index.json # Lunr.js pre-built index
sidebar.json # surface → area → flag hierarchy for left rail
4. APIs / Contracts
4.1 Verify endpoint (Phase 4)
POST /console/flags/<flag_key>/verify
Authorization: CF Access + console session
Rate limit: 6 calls / minute / operator (per-flag bucket)
Response 200:
{
"flag_key": "console_status_config",
"results": [
{
"kind": "http",
"label": "GET /api/dashboard/status_grid",
"status": "pass", // pass | fail | error
"http_status": 200,
"duration_ms": 143,
"detail": "body contains 'public_note_form'"
}
],
"overall": "pass", // pass | fail | partial
"verified_at_utc": "2026-05-17T14:22:01Z",
"audit_id": "abc123"
}
Response 404: flag key not found in YAML
Response 422: smoke block not defined (should not reach UI — button is hidden)
Response 429: rate limit exceeded
4.2 Console flag row enrichment (Phase 2)
Existing flag list API (GET /console/api/flags) extended to include:
{
"key": "console_status_config",
"runtime_behavior": "restart-required",
"docs_url": "https://internal-docs.raxx.app/flags/console_status_config",
"has_smoke": true,
"dependencies": ["rbac_v2"],
"dependency_states": {
"rbac_v2": { "prod": true, "staging": true }
},
"references": [
{ "kind": "pr", "number": 2230, "url": "https://github.com/..." },
{ "kind": "issue", "number": 1692, "url": "..." }
]
}
5. State Machines / Sequences
5.1 Pre-flip checklist evaluation (Phase 2 + 5)
stateDiagram-v2
[*] --> Evaluating
Evaluating --> AllGreen : deps ON + smoke defined + docs exist
Evaluating --> YellowWarning : smoke missing OR docs stub OR Sentry errors(24h)
Evaluating --> RedBlocked : required dep OFF
AllGreen --> DeployButton : operator clicks Deploy
YellowWarning --> ConfirmModal : operator clicks Deploy
RedBlocked --> LockedDeploy : Deploy button blocked (not hidden)
ConfirmModal --> DeployButton : operator confirms
DeployButton --> [*]
Note: RedBlocked does NOT hide the Deploy button — it locks it with an
explanatory tooltip naming the unsatisfied dependency. This is different from the
Verify button (which is hidden when no smoke block exists); here the operator must
know why they're blocked, not just that a button is absent.
5.2 Smoke verify sequence (Phase 4)
sequenceDiagram
participant Op as Operator
participant Console as Console UI
participant API as Console API
participant Target as Probe Target
Op->>Console: clicks Verify on flag row
Console->>API: POST /console/flags/<key>/verify
API->>API: load smoke block from YAML
loop each probe
API->>Target: execute probe (http/ping/sql)
Target-->>API: response
API->>API: evaluate assert_* conditions
end
API->>API: write console_audit_log (action=console.flag.verify)
API-->>Console: {results[], overall, verified_at_utc, audit_id}
Console->>Op: inline result chips (pass/fail per probe + overall)
5.3 Docs generator pipeline (Phase 3)
sequenceDiagram
participant CI as GitHub Actions (push to main)
participant Gen as scripts/build-flag-docs.py
participant YAML as feature_flags.yaml
participant DB as Console audit_log (read replica)
participant CFP as CF Pages (internal-docs.raxx.app)
CI->>Gen: invoke generator
Gen->>YAML: parse all flags
Gen->>DB: fetch flag flip history per flag_key
Gen->>Gen: render one .md per flag (Jinja2 template)
Gen->>Gen: build Lunr.js index JSON
Gen->>Gen: emit sidebar.json
Gen-->>CFP: deploy via CF Pages API (wrangler)
CFP-->>CI: deploy URL confirmation
6. Migrations
Phase 1 introduces no schema change. feature_flags.yaml is additive-only;
new optional fields parse as None / absent in all existing readers.
Phase 4 adds one action string variant to console_audit_log. The action
column is TEXT; no migration needed. The details JSON column already exists
(migration 0005).
Phase 2 extends the flags list API response; backwards-compatible (consumers that don't read new fields are unaffected).
Phase 3 adds docs-internal/ directory output to the repo (gitignored in
production build; CF Pages deploy is the artifact). No DB migration.
Rollback per phase: - Phase 1: revert YAML fields; existing flags unaffected; lint CI check is the only breaking change (revert the lint job config). - Phase 2: revert console API + UI changes; YAML schema can stay. - Phase 3: CF Pages rollback to prior deploy; YAML schema unaffected. - Phase 4: revert verify endpoint + UI button; no table to drop (audit rows remain, harmlessly). - Phase 5: revert Sentry integration + modal gate; Phase 2 checklist reverts to non-Sentry signals.
7. Rollout Plan
Phase 1 (foundation) → dark deploy (no UI change; YAML + lint only)
Phase 2 (console enrichment) → flag-gated via FLAG_FLAG_DOCS_ENRICHMENT
Phase 3 (docs generator) → CI-only at first; deploy to internal-docs.raxx.app/flags/
Phase 4 (verify affordance) → flag-gated via FLAG_FLAG_VERIFY; shown only when
smoke block exists on a given flag
Phase 5 (checklist + Sentry) → flag-gated via FLAG_FLAG_PREFLIGHT_V2
All three feature flags follow standard B1 promotion requirement (migration row
in console_flag_promotions required). Phases are sequential (each builds on
prior); Phase 1 must land before any other phase is claimed.
8. Security Considerations
Smoke probe executor (Phase 4)
- Probes run server-side inside the console dyno. They hit internal endpoints or CF-Access-gated URLs using the console's own service token (not the operator's session credentials). The operator never delegates their credentials to the executor.
- Path deny-list enforced in the executor before dispatch: any probe targeting
/api/orders/*,/api/positions/*,/api/trading/*, or any path resolving outside*.raxx.appis rejected with a 422 response and a lint error in YAML validation. - Rate limit (6/min/operator) prevents probes from becoming a DoS vector against downstream services.
- Audit row written before probe results are returned. If the dyno crashes during
probe execution, the audit row records
overall: error.
YAML lint gate (Phase 1)
- CI rejects PRs that add a new flag without at least one
referencesentry. This enforces lineage: every flag in production can be traced to an issue or PR. - Lint also rejects
smokeblocks containing fields matching/(key|secret|token|password|credential)/ias field names or values that look like secrets (entropy check).
Internal docs site (Phase 3)
internal-docs.raxx.appis CF Access gated to operator (email policy). No customer-facing content lands here.- Audit log history rendered in docs pages is redacted:
operator_idis shown asoperator-<last4>only. Full actor detail is available in the console audit log view, not the public-facing docs page. - Docs generator reads the console audit DB via a read-only connection string
(
INTERNAL_DOCS_DB_RO_URL); it cannot write to any table.
Sentry integration (Phase 5)
- Sentry API key is stored in Infisical under
/console/sentry-api-token; pulled at dyno start, not embedded in YAML or code. - The pre-flip checklist reads Sentry error counts for the flag's tagged code paths over the last 24 hours. It is advisory only: a non-zero count turns the checklist item yellow and triggers the confirmation modal; it does not block the flip.
9. Open Questions
These require operator decision before the corresponding phase can be claimed:
-
Phase 3 — Docs DB access from CI: The docs generator needs to pull flag flip history from the console audit DB to render "Rollout history" sections. Options: (a) read-only Postgres connection from CI (requires SSM param for RO creds), (b) export a static snapshot JSON at deploy time via a console API endpoint, (c) omit history from the generator (populated later via live API call in the browser). Option (b) is recommended — zero new DB credentials in CI. Operator decision needed.
-
Phase 5 — Sentry per-flag tagging: The Sentry signal only works if Raptor and Antlers emit
flag_keyas a Sentry tag on errors. This requires a tagging convention to be established and backfilled. How many flags are currently tagged? Is a backfill sprint in scope before launch, or does Phase 5 ship with a "no Sentry data available" fallback? Operator decision needed. -
Phase 3 —
sqlprobe kind scope: The design allowskind: sqlprobes against the console DB only. Should raptor DB queries be in scope (read-only, via a separate RO connection string)? Broader scope = more useful probes; narrower scope = simpler security boundary. Operator decision needed.