Design: Flag Operator UX Hardening + Pony-Style Internal Docs

Status: Proposed Date: 2026-05-17 UTC Owner: software-architect ADR: 0098 Related docs: - console-feature-flags.md - console-flag-promotion-flow.md - flag-reconciler-bidirectional-sync-2026-05-13.md - docs-site-scaffold-2026-05-15.md

1. Context

The console at /console/flags exposes 158 flags across 9 surfaces. An operator standing in front of the prod toggle asks five questions before flipping: What does this flag actually do? What else depends on it? Does it require a dyno restart? How do I confirm it worked? Where's the rollback path?

None of those answers are accessible in one click today. The description field is a single sentence. references chips (PR #2230) are rendered but not semantically typed. There is no "does flipping this need a restart?" signal, no smoke-test affordance, and no per-flag knowledge page at a stable URL.

This design covers five phases:

Phase	Topic	Size
1	YAML schema additions (`runtime_behavior`, `docs_path`, `smoke`, `dependencies`, lint)	S
2	Console flag-row + promotion-row enrichment (badges, deep-links, pre-flip checklist)	M
3	Pony-style docs generator + per-flag pages at internal-docs.raxx.app	M
4	Verify smoke affordance (`POST /console/flags/<key>/verify`)	S
5	Pre-flip checklist enrichment on promotion (Sentry error signal, modal gate)	S

2. Invariants

All platform invariants apply. The following are specifically load-bearing here:

No stored credentials. smoke block HTTP probes must never carry an Authorization header or any credential. Probes hit public or CF-Access-gated endpoints only. The executor never reads vault secrets at probe time.
Audit trail. Every POST /console/flags/<key>/verify call appends an console.flag.verify row to console_audit_log with {flag_key, operator_id, overall, duration_ms}. Probe details are included; no credential material is ever logged.
Fail-closed. If the smoke block is absent from YAML, the Verify button is hidden entirely (not disabled, not grayed out). Per [feedback_hide_dont_gray_unavailable_features].
Paper-first gating is not a feature flag. If any smoke probe or dependency declaration attempts to reference or gate the paper-first check, stop and escalate. The flag UX system must never provide a path around that invariant.
Smoke probes are read-only. kind: http probes are GET-only by default; POST probes require explicit method: POST and must not carry a body that triggers a state change. The executor rejects probes that would mutate live-trading state (deny-list of paths: /api/orders/*, /api/positions/*, /api/trading/*).
GDPR. The smoke verify audit row is subject to the standard 90-day operational log retention. It contains operator_id (no PII beyond that); operator accounts are deleted on DSR and the FK is SET NULL.
Secrets are never in YAML. The smoke block is checked in to source control. Probes may reference env-var names but must not embed actual values. YAML lint enforces no fields matching *key*, *secret*, *token*, *password* in the smoke block.

3. Data Model

3.1 YAML schema additions (Phase 1)

Three optional fields added to each entry in feature_flags.yaml. All are backwards-compatible: existing flags without them continue to parse and run.

# runtime_behavior: how quickly the flag takes effect after flip
# Values: live | restart-required | cold-start-only
# Default: live (assumed if omitted)
runtime_behavior: restart-required

# docs_path: relative URL on internal-docs.raxx.app for this flag's knowledge page
# If omitted, console deep-link falls back to /flags/<flag_key>
docs_path: /flags/console_status_config

# dependencies: other flag keys that must be ON before this flag is useful
# Used by Phase 2 pre-flip checklist. Empty list = no dependencies.
dependencies:
  - rbac_v2
  - break_glass_session

# smoke: list of probes run by POST /console/flags/<key>/verify
# Supported kinds: http, ping, sql (sql only on console DB, not raptor)
smoke:
  - kind: http
    method: GET
    path: /api/dashboard/status_grid
    assert_status: 200
    assert_body_contains: "public_note_form"
  - kind: ping
    url: https://api.raxx.app/health
  - kind: sql
    query: "SELECT COUNT(*) FROM console_feature_flags WHERE flag_key = 'console_status_config' AND value = 1"
    assert_rows_gte: 1

3.2 Audit log extension

No new table. One new action value added to console_audit_log:

action = 'console.flag.verify'
payload = {
  flag_key: str,
  overall: 'pass' | 'fail' | 'partial',
  probe_count: int,
  pass_count: int,
  duration_ms: int
}

Probe-level detail is stored in a details JSON column that already exists on console_audit_log (added in migration 0005).

3.3 Docs generator output layout

docs-internal/
  flags/
    _index.md              # auto-generated sidebar index
    <flag_key>.md          # one file per flag
  search-index.json        # Lunr.js pre-built index
  sidebar.json             # surface → area → flag hierarchy for left rail

4. APIs / Contracts

4.1 Verify endpoint (Phase 4)

POST /console/flags/<flag_key>/verify
Authorization: CF Access + console session
Rate limit: 6 calls / minute / operator (per-flag bucket)

Response 200:
{
  "flag_key": "console_status_config",
  "results": [
    {
      "kind": "http",
      "label": "GET /api/dashboard/status_grid",
      "status": "pass",          // pass | fail | error
      "http_status": 200,
      "duration_ms": 143,
      "detail": "body contains 'public_note_form'"
    }
  ],
  "overall": "pass",             // pass | fail | partial
  "verified_at_utc": "2026-05-17T14:22:01Z",
  "audit_id": "abc123"
}

Response 404: flag key not found in YAML
Response 422: smoke block not defined (should not reach UI — button is hidden)
Response 429: rate limit exceeded

4.2 Console flag row enrichment (Phase 2)

Existing flag list API (GET /console/api/flags) extended to include:

{
  "key": "console_status_config",
  "runtime_behavior": "restart-required",
  "docs_url": "https://internal-docs.raxx.app/flags/console_status_config",
  "has_smoke": true,
  "dependencies": ["rbac_v2"],
  "dependency_states": {
    "rbac_v2": { "prod": true, "staging": true }
  },
  "references": [
    { "kind": "pr", "number": 2230, "url": "https://github.com/..." },
    { "kind": "issue", "number": 1692, "url": "..." }
  ]
}

5. State Machines / Sequences

5.1 Pre-flip checklist evaluation (Phase 2 + 5)

stateDiagram-v2
    [*] --> Evaluating
    Evaluating --> AllGreen : deps ON + smoke defined + docs exist
    Evaluating --> YellowWarning : smoke missing OR docs stub OR Sentry errors(24h)
    Evaluating --> RedBlocked : required dep OFF
    AllGreen --> DeployButton : operator clicks Deploy
    YellowWarning --> ConfirmModal : operator clicks Deploy
    RedBlocked --> LockedDeploy : Deploy button blocked (not hidden)
    ConfirmModal --> DeployButton : operator confirms
    DeployButton --> [*]

Note: RedBlocked does NOT hide the Deploy button — it locks it with an explanatory tooltip naming the unsatisfied dependency. This is different from the Verify button (which is hidden when no smoke block exists); here the operator must know why they're blocked, not just that a button is absent.

5.2 Smoke verify sequence (Phase 4)

sequenceDiagram
    participant Op as Operator
    participant Console as Console UI
    participant API as Console API
    participant Target as Probe Target

    Op->>Console: clicks Verify on flag row
    Console->>API: POST /console/flags/<key>/verify
    API->>API: load smoke block from YAML
    loop each probe
        API->>Target: execute probe (http/ping/sql)
        Target-->>API: response
        API->>API: evaluate assert_* conditions
    end
    API->>API: write console_audit_log (action=console.flag.verify)
    API-->>Console: {results[], overall, verified_at_utc, audit_id}
    Console->>Op: inline result chips (pass/fail per probe + overall)

5.3 Docs generator pipeline (Phase 3)

sequenceDiagram
    participant CI as GitHub Actions (push to main)
    participant Gen as scripts/build-flag-docs.py
    participant YAML as feature_flags.yaml
    participant DB as Console audit_log (read replica)
    participant CFP as CF Pages (internal-docs.raxx.app)

    CI->>Gen: invoke generator
    Gen->>YAML: parse all flags
    Gen->>DB: fetch flag flip history per flag_key
    Gen->>Gen: render one .md per flag (Jinja2 template)
    Gen->>Gen: build Lunr.js index JSON
    Gen->>Gen: emit sidebar.json
    Gen-->>CFP: deploy via CF Pages API (wrangler)
    CFP-->>CI: deploy URL confirmation

6. Migrations

Phase 1 introduces no schema change. feature_flags.yaml is additive-only; new optional fields parse as None / absent in all existing readers.

Phase 4 adds one action string variant to console_audit_log. The action column is TEXT; no migration needed. The details JSON column already exists (migration 0005).

Phase 2 extends the flags list API response; backwards-compatible (consumers that don't read new fields are unaffected).

Phase 3 adds docs-internal/ directory output to the repo (gitignored in production build; CF Pages deploy is the artifact). No DB migration.

Rollback per phase: - Phase 1: revert YAML fields; existing flags unaffected; lint CI check is the only breaking change (revert the lint job config). - Phase 2: revert console API + UI changes; YAML schema can stay. - Phase 3: CF Pages rollback to prior deploy; YAML schema unaffected. - Phase 4: revert verify endpoint + UI button; no table to drop (audit rows remain, harmlessly). - Phase 5: revert Sentry integration + modal gate; Phase 2 checklist reverts to non-Sentry signals.

7. Rollout Plan

Phase 1 (foundation) → dark deploy (no UI change; YAML + lint only)
Phase 2 (console enrichment) → flag-gated via FLAG_FLAG_DOCS_ENRICHMENT
Phase 3 (docs generator) → CI-only at first; deploy to internal-docs.raxx.app/flags/
Phase 4 (verify affordance) → flag-gated via FLAG_FLAG_VERIFY; shown only when
                               smoke block exists on a given flag
Phase 5 (checklist + Sentry) → flag-gated via FLAG_FLAG_PREFLIGHT_V2

All three feature flags follow standard B1 promotion requirement (migration row in console_flag_promotions required). Phases are sequential (each builds on prior); Phase 1 must land before any other phase is claimed.

8. Security Considerations

Smoke probe executor (Phase 4)

Probes run server-side inside the console dyno. They hit internal endpoints or CF-Access-gated URLs using the console's own service token (not the operator's session credentials). The operator never delegates their credentials to the executor.
Path deny-list enforced in the executor before dispatch: any probe targeting /api/orders/*, /api/positions/*, /api/trading/*, or any path resolving outside *.raxx.app is rejected with a 422 response and a lint error in YAML validation.
Rate limit (6/min/operator) prevents probes from becoming a DoS vector against downstream services.
Audit row written before probe results are returned. If the dyno crashes during probe execution, the audit row records overall: error.

YAML lint gate (Phase 1)

CI rejects PRs that add a new flag without at least one references entry. This enforces lineage: every flag in production can be traced to an issue or PR.
Lint also rejects smoke blocks containing fields matching /(key|secret|token|password|credential)/i as field names or values that look like secrets (entropy check).

Internal docs site (Phase 3)

internal-docs.raxx.app is CF Access gated to operator (email policy). No customer-facing content lands here.
Audit log history rendered in docs pages is redacted: operator_id is shown as operator-<last4> only. Full actor detail is available in the console audit log view, not the public-facing docs page.
Docs generator reads the console audit DB via a read-only connection string (INTERNAL_DOCS_DB_RO_URL); it cannot write to any table.

Sentry integration (Phase 5)

Sentry API key is stored in Infisical under /console/sentry-api-token; pulled at dyno start, not embedded in YAML or code.
The pre-flip checklist reads Sentry error counts for the flag's tagged code paths over the last 24 hours. It is advisory only: a non-zero count turns the checklist item yellow and triggers the confirmation modal; it does not block the flip.

9. Open Questions

These require operator decision before the corresponding phase can be claimed:

Phase 3 — Docs DB access from CI: The docs generator needs to pull flag flip history from the console audit DB to render "Rollout history" sections. Options: (a) read-only Postgres connection from CI (requires SSM param for RO creds), (b) export a static snapshot JSON at deploy time via a console API endpoint, (c) omit history from the generator (populated later via live API call in the browser). Option (b) is recommended — zero new DB credentials in CI. Operator decision needed.
Phase 5 — Sentry per-flag tagging: The Sentry signal only works if Raptor and Antlers emit flag_key as a Sentry tag on errors. This requires a tagging convention to be established and backfilled. How many flags are currently tagged? Is a backfill sprint in scope before launch, or does Phase 5 ship with a "no Sentry data available" fallback? Operator decision needed.
Phase 3 — sql probe kind scope: The design allows kind: sql probes against the console DB only. Should raptor DB queries be in scope (read-only, via a separate RO connection string)? Broader scope = more useful probes; narrower scope = simpler security boundary. Operator decision needed.