Raxx · internal docs

internal · gated

Design: Flag Operator UX Hardening + Pony-Style Internal Docs

Status: Proposed Date: 2026-05-17 UTC Owner: software-architect ADR: 0098 Related docs: - console-feature-flags.md - console-flag-promotion-flow.md - flag-reconciler-bidirectional-sync-2026-05-13.md - docs-site-scaffold-2026-05-15.md


1. Context

The console at /console/flags exposes 158 flags across 9 surfaces. An operator standing in front of the prod toggle asks five questions before flipping: What does this flag actually do? What else depends on it? Does it require a dyno restart? How do I confirm it worked? Where's the rollback path?

None of those answers are accessible in one click today. The description field is a single sentence. references chips (PR #2230) are rendered but not semantically typed. There is no "does flipping this need a restart?" signal, no smoke-test affordance, and no per-flag knowledge page at a stable URL.

This design covers five phases:

Phase Topic Size
1 YAML schema additions (runtime_behavior, docs_path, smoke, dependencies, lint) S
2 Console flag-row + promotion-row enrichment (badges, deep-links, pre-flip checklist) M
3 Pony-style docs generator + per-flag pages at internal-docs.raxx.app M
4 Verify smoke affordance (POST /console/flags/<key>/verify) S
5 Pre-flip checklist enrichment on promotion (Sentry error signal, modal gate) S

2. Invariants

All platform invariants apply. The following are specifically load-bearing here:

  1. No stored credentials. smoke block HTTP probes must never carry an Authorization header or any credential. Probes hit public or CF-Access-gated endpoints only. The executor never reads vault secrets at probe time.
  2. Audit trail. Every POST /console/flags/<key>/verify call appends an console.flag.verify row to console_audit_log with {flag_key, operator_id, overall, duration_ms}. Probe details are included; no credential material is ever logged.
  3. Fail-closed. If the smoke block is absent from YAML, the Verify button is hidden entirely (not disabled, not grayed out). Per [feedback_hide_dont_gray_unavailable_features].
  4. Paper-first gating is not a feature flag. If any smoke probe or dependency declaration attempts to reference or gate the paper-first check, stop and escalate. The flag UX system must never provide a path around that invariant.
  5. Smoke probes are read-only. kind: http probes are GET-only by default; POST probes require explicit method: POST and must not carry a body that triggers a state change. The executor rejects probes that would mutate live-trading state (deny-list of paths: /api/orders/*, /api/positions/*, /api/trading/*).
  6. GDPR. The smoke verify audit row is subject to the standard 90-day operational log retention. It contains operator_id (no PII beyond that); operator accounts are deleted on DSR and the FK is SET NULL.
  7. Secrets are never in YAML. The smoke block is checked in to source control. Probes may reference env-var names but must not embed actual values. YAML lint enforces no fields matching *key*, *secret*, *token*, *password* in the smoke block.

3. Data Model

3.1 YAML schema additions (Phase 1)

Three optional fields added to each entry in feature_flags.yaml. All are backwards-compatible: existing flags without them continue to parse and run.

# runtime_behavior: how quickly the flag takes effect after flip
# Values: live | restart-required | cold-start-only
# Default: live (assumed if omitted)
runtime_behavior: restart-required

# docs_path: relative URL on internal-docs.raxx.app for this flag's knowledge page
# If omitted, console deep-link falls back to /flags/<flag_key>
docs_path: /flags/console_status_config

# dependencies: other flag keys that must be ON before this flag is useful
# Used by Phase 2 pre-flip checklist. Empty list = no dependencies.
dependencies:
  - rbac_v2
  - break_glass_session

# smoke: list of probes run by POST /console/flags/<key>/verify
# Supported kinds: http, ping, sql (sql only on console DB, not raptor)
smoke:
  - kind: http
    method: GET
    path: /api/dashboard/status_grid
    assert_status: 200
    assert_body_contains: "public_note_form"
  - kind: ping
    url: https://api.raxx.app/health
  - kind: sql
    query: "SELECT COUNT(*) FROM console_feature_flags WHERE flag_key = 'console_status_config' AND value = 1"
    assert_rows_gte: 1

3.2 Audit log extension

No new table. One new action value added to console_audit_log:

action = 'console.flag.verify'
payload = {
  flag_key: str,
  overall: 'pass' | 'fail' | 'partial',
  probe_count: int,
  pass_count: int,
  duration_ms: int
}

Probe-level detail is stored in a details JSON column that already exists on console_audit_log (added in migration 0005).

3.3 Docs generator output layout

docs-internal/
  flags/
    _index.md              # auto-generated sidebar index
    <flag_key>.md          # one file per flag
  search-index.json        # Lunr.js pre-built index
  sidebar.json             # surface → area → flag hierarchy for left rail

4. APIs / Contracts

4.1 Verify endpoint (Phase 4)

POST /console/flags/<flag_key>/verify
Authorization: CF Access + console session
Rate limit: 6 calls / minute / operator (per-flag bucket)

Response 200:
{
  "flag_key": "console_status_config",
  "results": [
    {
      "kind": "http",
      "label": "GET /api/dashboard/status_grid",
      "status": "pass",          // pass | fail | error
      "http_status": 200,
      "duration_ms": 143,
      "detail": "body contains 'public_note_form'"
    }
  ],
  "overall": "pass",             // pass | fail | partial
  "verified_at_utc": "2026-05-17T14:22:01Z",
  "audit_id": "abc123"
}

Response 404: flag key not found in YAML
Response 422: smoke block not defined (should not reach UI — button is hidden)
Response 429: rate limit exceeded

4.2 Console flag row enrichment (Phase 2)

Existing flag list API (GET /console/api/flags) extended to include:

{
  "key": "console_status_config",
  "runtime_behavior": "restart-required",
  "docs_url": "https://internal-docs.raxx.app/flags/console_status_config",
  "has_smoke": true,
  "dependencies": ["rbac_v2"],
  "dependency_states": {
    "rbac_v2": { "prod": true, "staging": true }
  },
  "references": [
    { "kind": "pr", "number": 2230, "url": "https://github.com/..." },
    { "kind": "issue", "number": 1692, "url": "..." }
  ]
}

5. State Machines / Sequences

5.1 Pre-flip checklist evaluation (Phase 2 + 5)

stateDiagram-v2
    [*] --> Evaluating
    Evaluating --> AllGreen : deps ON + smoke defined + docs exist
    Evaluating --> YellowWarning : smoke missing OR docs stub OR Sentry errors(24h)
    Evaluating --> RedBlocked : required dep OFF
    AllGreen --> DeployButton : operator clicks Deploy
    YellowWarning --> ConfirmModal : operator clicks Deploy
    RedBlocked --> LockedDeploy : Deploy button blocked (not hidden)
    ConfirmModal --> DeployButton : operator confirms
    DeployButton --> [*]

Note: RedBlocked does NOT hide the Deploy button — it locks it with an explanatory tooltip naming the unsatisfied dependency. This is different from the Verify button (which is hidden when no smoke block exists); here the operator must know why they're blocked, not just that a button is absent.

5.2 Smoke verify sequence (Phase 4)

sequenceDiagram
    participant Op as Operator
    participant Console as Console UI
    participant API as Console API
    participant Target as Probe Target

    Op->>Console: clicks Verify on flag row
    Console->>API: POST /console/flags/<key>/verify
    API->>API: load smoke block from YAML
    loop each probe
        API->>Target: execute probe (http/ping/sql)
        Target-->>API: response
        API->>API: evaluate assert_* conditions
    end
    API->>API: write console_audit_log (action=console.flag.verify)
    API-->>Console: {results[], overall, verified_at_utc, audit_id}
    Console->>Op: inline result chips (pass/fail per probe + overall)

5.3 Docs generator pipeline (Phase 3)

sequenceDiagram
    participant CI as GitHub Actions (push to main)
    participant Gen as scripts/build-flag-docs.py
    participant YAML as feature_flags.yaml
    participant DB as Console audit_log (read replica)
    participant CFP as CF Pages (internal-docs.raxx.app)

    CI->>Gen: invoke generator
    Gen->>YAML: parse all flags
    Gen->>DB: fetch flag flip history per flag_key
    Gen->>Gen: render one .md per flag (Jinja2 template)
    Gen->>Gen: build Lunr.js index JSON
    Gen->>Gen: emit sidebar.json
    Gen-->>CFP: deploy via CF Pages API (wrangler)
    CFP-->>CI: deploy URL confirmation

6. Migrations

Phase 1 introduces no schema change. feature_flags.yaml is additive-only; new optional fields parse as None / absent in all existing readers.

Phase 4 adds one action string variant to console_audit_log. The action column is TEXT; no migration needed. The details JSON column already exists (migration 0005).

Phase 2 extends the flags list API response; backwards-compatible (consumers that don't read new fields are unaffected).

Phase 3 adds docs-internal/ directory output to the repo (gitignored in production build; CF Pages deploy is the artifact). No DB migration.

Rollback per phase: - Phase 1: revert YAML fields; existing flags unaffected; lint CI check is the only breaking change (revert the lint job config). - Phase 2: revert console API + UI changes; YAML schema can stay. - Phase 3: CF Pages rollback to prior deploy; YAML schema unaffected. - Phase 4: revert verify endpoint + UI button; no table to drop (audit rows remain, harmlessly). - Phase 5: revert Sentry integration + modal gate; Phase 2 checklist reverts to non-Sentry signals.


7. Rollout Plan

Phase 1 (foundation) → dark deploy (no UI change; YAML + lint only)
Phase 2 (console enrichment) → flag-gated via FLAG_FLAG_DOCS_ENRICHMENT
Phase 3 (docs generator) → CI-only at first; deploy to internal-docs.raxx.app/flags/
Phase 4 (verify affordance) → flag-gated via FLAG_FLAG_VERIFY; shown only when
                               smoke block exists on a given flag
Phase 5 (checklist + Sentry) → flag-gated via FLAG_FLAG_PREFLIGHT_V2

All three feature flags follow standard B1 promotion requirement (migration row in console_flag_promotions required). Phases are sequential (each builds on prior); Phase 1 must land before any other phase is claimed.


8. Security Considerations

Smoke probe executor (Phase 4)

YAML lint gate (Phase 1)

Internal docs site (Phase 3)

Sentry integration (Phase 5)


9. Open Questions

These require operator decision before the corresponding phase can be claimed:

  1. Phase 3 — Docs DB access from CI: The docs generator needs to pull flag flip history from the console audit DB to render "Rollout history" sections. Options: (a) read-only Postgres connection from CI (requires SSM param for RO creds), (b) export a static snapshot JSON at deploy time via a console API endpoint, (c) omit history from the generator (populated later via live API call in the browser). Option (b) is recommended — zero new DB credentials in CI. Operator decision needed.

  2. Phase 5 — Sentry per-flag tagging: The Sentry signal only works if Raptor and Antlers emit flag_key as a Sentry tag on errors. This requires a tagging convention to be established and backfilled. How many flags are currently tagged? Is a backfill sprint in scope before launch, or does Phase 5 ship with a "no Sentry data available" fallback? Operator decision needed.

  3. Phase 3 — sql probe kind scope: The design allows kind: sql probes against the console DB only. Should raptor DB queries be in scope (read-only, via a separate RO connection string)? Broader scope = more useful probes; narrower scope = simpler security boundary. Operator decision needed.