Raxx · internal docs

internal · gated ↑ index

Console Self-Deploy Web Layer

Status: Proposed Owner: software-architect Date: 2026-05-03 UTC Refs: #988 (parent card), #985 (deploy-button pulse ring), #907 (Velvet v2 — rotation-side, out of scope here) Related ADRs: ADR-0044, ADR-0034, ADR-0029, ADR-0036 Related docs: console-driven-deploy-flow.md, console-env-switcher.md, prod-deploy-gating.md

1. Context

The console's deploy dispatch panel (PR #541, extended in #810, #921) can fire a GitHub Actions workflow_dispatch for any registered surface. The machinery works. The UX problem is what happens after the operator clicks "deploy console-prod to production" from within console-prod itself: the Heroku dyno restarts mid-session, the operator's HTTP connection terminates, and the browser tab receives a 502 or connection-reset. The operator is then bounced through CF Access re-auth and manually has to navigate back to find out whether the deploy succeeded.

This design adds the web layer that keeps the operator's browser tab engaged during their own console's restart without requiring re-authentication, re-navigation, or any operator-side action.

Topology clarification

ADR-0029 planned to retire raxx-console-staging after the env-switcher soaked for one week (~2026-05-07). With self-deploy in scope, that retirement is suspended. The two-Heroku-app split (raxx-console-prod + raxx-console-staging) is now a structural feature: console-staging is the "sister console" that keeps the operator informed during console-prod's restart window. ADR-0044 records this topology decision formally. A future single-console architecture is only viable if it self-restarts without operator disruption — which is exactly the problem being solved here. Retirement can be reconsidered after this design is implemented and soaked.

Relationship to #985 (deploy-button pulse ring)

Issue #985 needs a polling endpoint that feeds the pulse ring's animation state. That endpoint is GET /api/internal/deploys/<id> on whichever console is live — prod or the sister. This design defines a cross-env read endpoint that #985 must consume during the prod-restart window. Section 4.2 specifies that contract explicitly. Feature-developer implementing #985 should read §4.2 before starting.

Why not Velvet v2 (#907)

Velvet v2 handles credential rotation. Rotation patches Heroku config vars (no dyno restart). Code deploys trigger a dyno restart. This design is strictly about code deploys. The two systems share no code path and should not be entangled.

2. Invariants

All platform invariants apply. Self-deploy-specific:

No stored credentials. The cross-env service token that authenticates console-staging's read endpoint is injected at runtime from Infisical. It is not persisted in the DB, session, or application code.
No credential replay. The service token is read-only scoped. It cannot trigger deploys or access secrets. Its exposure enables status polling only.
Audit trail for every deploy state change. Already enforced by the existing console.deploy.callback and console.deploy.intent audit rows. This design adds a console.deploy.crossenv_read audit event whenever the sister endpoint is used (§4.2).
Session cookie must not be shared cross-environment. Flask session cookies issued by console-prod must not be accepted by console-staging, and vice versa. They use different SECRET_KEY values. Cross-env polling uses a service token, not the operator's session cookie.
CF Access cookies are bound to the CF identity session, not to a specific Heroku dyno. A post-restart cold reconnect to console-prod will pick up the operator's CF Access cookie without re-auth, provided the cookie has not expired (default 24h) and the CF Access session is still valid. This design requires no changes to CF Access session lifetime.
Paper-first gating invariant does not apply to console deploys. Console deploys are operator-only actions on the ops surface; the paper-first gate is for live-trading code paths.
Kill-switch preserved. DEPLOY_FREEZE_OVERRIDE=1 (ADR-0028) gates all console-triggered deploys. The freeze check fires before workflow_dispatch is called, unchanged from the existing deploy flow.

3. Option Decision: Hybrid of A + C (not pure A, not B)

Issue #988 presents three options. This design selects a hybrid: Option A's edge intercept with Option C's out-of-band notification as a baseline fallback. Option B (WebSocket) is rejected.

Full rationale is in ADR-0044. Summary:

Option A (CF Worker / edge intercept) gives the best UX — the operator's tab never goes blank and recovers automatically when the dyno warms. It requires a cross-env deploy-status read API and a CF Worker, both of which are well-scoped. This is the primary path.
Option B (WebSocket to sister console) requires client-side JavaScript to maintain a WebSocket connection across a potential 90-second disconnect, handle reconnection backoff, deduplicate messages, and seamlessly hand back to the restarted dyno. The state management complexity is high for a maintenance-window feature. It also requires the sister console to accept WebSocket connections from the operator's browser, which means the operator's CF Access cookie must work against the sister URL — that couples the two CF Access applications, which ADR-0043 explicitly separates. Rejected.
Option C (Slack/email notification) is always deployed as the baseline fallback. Even with Option A running, a slack_notify.py call fires at deploy-complete (success or failure). If the edge worker fails or the operator closed the tab, they still learn the outcome. This also satisfies the "if the sister is also down" worst case (§8.3).

Topology decision: keep both Heroku apps. See §3.3 for the single-console consideration.

3.1 The selected topology

┌─────────────────────────────┐     ┌─────────────────────────────┐
│   raxx-console-prod          │     │   raxx-console-staging       │
│   console.raxx.app           │     │   console-staging.raxx.app   │
│   (CF Access: prod policy)   │     │   (CF Access: staging policy)│
│                              │     │                              │
│  console_deploys table       │     │  console_deploys table       │
│  (prod Postgres)             │     │  (staging Postgres)          │
└─────────┬───────────────────┘     └──────────────┬──────────────┘
          │                                         │
          │ deploy_id written to prod DB             │
          │                                         │
          └─────────────────────────────────────────┘
                  cross-env read: staging polls prod
                  via CONSOLE_CROSS_ENV_READ_TOKEN

Each console has its own DB; the console_deploys row lives in the console that was deploying. When console-prod deploys, the row is in prod DB. Console-staging reads it via the cross-env endpoint.

3.2 Topology decision: two apps, not one

The env-switcher (#617 target) imagined one console deployment with a prod/staging routing dropdown. With self-deploy, a single deployment cannot reliably restart itself. The possible single-console approaches are:

Serverless / always-on platform (e.g., Fly.io with zero-downtime rolling deploys) — eliminates the restart problem at the infrastructure layer. This is a platform migration, not a feature card. Out of scope here.
Split dyno — one web dyno + one worker dyno; web restarts while worker keeps serving. Heroku eco/basic dynos don't support this split cleanly. Out of scope.
CF Worker as the permanent API gateway — all traffic goes through CF Worker, which can proxy to the live dyno and serve the "be right back" page when the dyno is down. This is essentially Option A generalized to a permanent architecture. Viable long-term but too large for this card.

Decision for this design: keep the two-app topology. The env-switcher's retirement of raxx-console-staging is deferred until after this design soaks and a future card evaluates the single-console options above.

3.3 Single-console revisit gate

When a future card evaluates single-console, this design doc is the input. The cross-env read API defined in §4.2 is the interface that would move into a CF Worker permanent gateway. No rework required on the deploy-status DB schema.

4. APIs / Contracts

4.1 Existing read endpoint (unchanged)

GET /api/internal/deploys/<id> on the console that owns the deploy row.

Auth: session cookie; any authenticated role. ETag/If-None-Match supported.

Response shape (already defined in console-driven-deploy-flow.md §4.2):

{
  "id":                 "<uuid>",
  "surface_id":         "console-prod",
  "target_env":         "production",
  "target_ref":         "main",
  "requested_by":       "operator@raxx.app",
  "requested_at_utc":   "2026-05-03T14:00:00Z",
  "status":             "deploying",
  "github_run_id":      "12345678",
  "github_run_url":     "https://github.com/raxx-app/TradeMasterAPI/actions/runs/12345678",
  "last_status_at_utc": "2026-05-03T14:02:10Z",
  "log_tail":           "...",
  "failure_reason":     null
}

Terminal statuses: succeeded, failed, timed_out.

4.2 Cross-env read endpoint (new)

This is the endpoint that #985 (pulse ring) and the CF Worker intercept both consume during the prod-restart window.

GET /api/internal/deploys/<id>/xenv

Auth: Authorization: Bearer <CONSOLE_CROSS_ENV_READ_TOKEN> (service token, not operator session). No session cookie accepted on this route.

Scope: read-only. Returns the same JSON shape as §4.1. Does not write, does not trigger, does not expose log_tail in full (truncated to last 1 KB to reduce payload size for edge polling).

Rate limit: 30 req/min per token. Enforced by existing rate_limiter middleware with a separate bucket key xenv:<token_hash>.

Feature flag: FLAG_CONSOLE_DEPLOY_XENV_READ (default off). When off: 501.

New audit event: console.deploy.crossenv_read written on each request:

{
  "deploy_id": "<id>",
  "requested_from_env": "staging",
  "token_id": "<first 8 chars of token hash>",
  "timestamp": "..."
}

The requested_from_env field is set from the X-Raxx-Requester-Env request header (value: staging or prod). The CF Worker sets this header; if absent, defaults to unknown.

Response shape (same as §4.1 minus full log_tail):

{
  "id":                 "<uuid>",
  "surface_id":         "console-prod",
  "status":             "deploying",
  "target_env":         "production",
  "target_ref":         "main",
  "requested_by":       "operator@raxx.app",
  "requested_at_utc":   "2026-05-03T14:00:00Z",
  "last_status_at_utc": "2026-05-03T14:02:10Z",
  "log_tail":           "...(last 1 KB)...",
  "github_run_url":     "https://github.com/...",
  "failure_reason":     null
}

Where this endpoint lives: on the console that owns the deploy row — whichever console is not currently restarting. When console-prod deploys console-prod, the row lives in the prod DB. The CF Worker routes to this endpoint on console-staging (which is not restarting) via the CONSOLE_STAGING_BASE_URL config var.

Token provisioning: CONSOLE_CROSS_ENV_READ_TOKEN is a randomly generated 32-byte token stored in: - Infisical at /Console/prod/CONSOLE_CROSS_ENV_READ_TOKEN (read by console-prod at boot) - Infisical at /Console/staging/CONSOLE_CROSS_ENV_READ_TOKEN (same value; read by console-staging) - The CF Worker secret CONSOLE_CROSS_ENV_READ_TOKEN (injected via wrangler secret put)

All three must hold the same token value. Rotation procedure: generate a new token, update all three locations simultaneously, then redeploy the CF Worker (wrangler deploy). The console apps pick up the new token on next config reload without a dyno restart (Heroku config var change restarts dyno, which is acceptable since we're not deploying the console itself during rotation).

4.3 CF Worker intercept specification

The CF Worker lives at the console.raxx.app origin. It intercepts requests based on deploy state. Logic:

request arrives at console.raxx.app
  ├── check: is there an active deploy for surface_id=console-prod?
  │     (reads CONSOLE_DEPLOY_STATE KV store — see §4.3.1)
  │     ├── no active deploy → pass through to raxx-console-prod dyno (normal)
  │     └── active deploy found (status in [dispatched, building, deploying])
  │           ├── request is GET /api/internal/deploys/<id>
  │           │     → proxy to console-staging /api/internal/deploys/<id>/xenv
  │           │     → return response to operator's tab (transparent proxy)
  │           ├── request is POST or mutating method
  │           │     → return 503 with JSON { "error": "deploy_in_progress",
  │           │                              "deploy_id": "<id>",
  │           │                              "status_url": "/api/internal/deploys/<id>" }
  │           └── all other GET requests
  │                 → return "be right back" HTML page (served from Worker inline)
  │                   the page polls GET /api/internal/deploys/<id>
  │                   (which the Worker transparently proxies to staging)
  │                   and auto-redirects to original URL when status=succeeded

4.3.1 CONSOLE_DEPLOY_STATE KV store

Cloudflare Workers KV. Key schema:

key:   "active_deploy:console-prod"
value: { "deploy_id": "<uuid>", "status": "deploying", "since_utc": "..." }
TTL:   600 seconds (10 min hard cap — auto-expires if the Worker fails to clean up)

Write path: The console-prod Flask app writes to KV at two points: 1. When a deploy transitions to dispatched status: PUT active_deploy:console-prod with the deploy record. 2. When a deploy reaches a terminal status (succeeded, failed, timed_out): DELETE active_deploy:console-prod.

The KV write is best-effort (non-blocking, fire-and-forget in a background thread). A failed KV write does not abort the deploy. The 10-minute TTL is the safety net: if the app crashes before writing the DELETE, the Worker stops intercepting after 10 minutes.

Read path: The CF Worker reads KV on every request to console.raxx.app. KV reads are low-latency (~1ms) and free within CF's KV quota. This is not a DB query; no blocking.

KV namespace: RAXX_CONSOLE_DEPLOY_STATE (configured in wrangler.toml).

4.3.2 "Be right back" page

A minimal HTML page served inline by the CF Worker. It does not require any external assets (no CDN calls during a deploy window). It: - Displays the surface name, deploy started-at, current status (if available from the xenv poll). - Polls GET /api/internal/deploys/<id> every 3 seconds (the Worker proxies this to staging). - Animates a status indicator using a CSS animation (no JS library dependency). - Auto-redirects to the operator's original requested URL when status === "succeeded". - On status === "failed" or status === "timed_out": shows failure reason, links to GitHub run URL, and shows "Refresh to retry" button. Does not auto-redirect. - On no response from xenv (sister also unreachable): shows "Check Slack for deploy status" with a link to the Raxx workspace.

The page is stored as an inline HTML string in the CF Worker source file (no external file dependency). It is under 10 KB.

The operator's session cookie (issued by console-prod's Flask app) is a HttpOnly; Secure; SameSite=Strict cookie bound to console.raxx.app. After the dyno restarts: - The new dyno picks up the same SECRET_KEY (Infisical-sourced at boot). Flask can validate the cookie against the same key. - The session store (DB-backed, not in-process) is intact (same Heroku Postgres). The session row survives the restart. - Result: the operator's session is valid the moment the new dyno warms. No re-auth.

CF Access cookies are bound to the CF identity session, not the Heroku process. They survive dyno restarts entirely.

No changes needed to Flask's SECRET_KEY handling or the session DB schema.

The "be right back" page is served from console.raxx.app (prod domain) by the CF Worker. It does not make requests to console-staging.raxx.app. All cross-env polling goes through the CF Worker, which proxies to staging's /xenv endpoint using the service token. The operator's browser never sees console-staging.raxx.app. Session cookie isolation is preserved.

5. State Machine for the Deploy-Window UX

stateDiagram-v2
    [*] --> Normal : operator browsing console-prod

    Normal --> DeployDispatched : operator clicks deploy + TOTP confirm
    note right of DeployDispatched : KV write: active_deploy = {id, status=dispatched}

    DeployDispatched --> DynoRestarting : GitHub Actions triggers Heroku push
    note right of DynoRestarting : CF Worker begins intercepting console.raxx.app requests\nOperator's tab receives "be right back" page

    DynoRestarting --> Polling : Worker proxies GET /deploys/<id> to console-staging xenv
    Polling --> Polling : status in [building, deploying]\n(poll every 3s)

    Polling --> DeploySucceeded : status = succeeded
    DeploySucceeded --> Redirected : Worker removes KV entry\nPage auto-redirects to original URL\nOperator's session cookie works on new dyno

    Polling --> DeployFailed : status = failed or timed_out
    DeployFailed --> ManualRecovery : Page shows failure + GitHub run link\nSlack notification sent

    DynoRestarting --> SisterUnavailable : xenv returns error or timeout (> 5 retries)
    SisterUnavailable --> SlackOnly : Page shows "Check Slack" message\nOut-of-band C notification sent

sequenceDiagram
    participant Op as Operator Browser
    participant CFW as CF Worker (console.raxx.app)
    participant Prod as raxx-console-prod dyno
    participant KV as CF KV (RAXX_CONSOLE_DEPLOY_STATE)
    participant Staging as raxx-console-staging (/xenv)
    participant GHA as GitHub Actions

    Op->>Prod: POST /api/internal/deploys {surface_id: "console-prod"}
    Prod->>GHA: workflow_dispatch
    Prod->>KV: PUT active_deploy:console-prod {id, status=dispatched}
    Prod-->>Op: 202 {id, status: "dispatched", status_url: "/api/internal/deploys/<id>"}

    Note over Prod,GHA: GHA runs deploy-console.yml → heroku push → dyno restart

    Op->>CFW: GET /dashboard (or any page)
    CFW->>KV: GET active_deploy:console-prod → found
    CFW-->>Op: "Be right back" HTML page (served from Worker inline)

    loop Poll every 3s while status not terminal
        Op->>CFW: GET /api/internal/deploys/<id>
        CFW->>Staging: GET /api/internal/deploys/<id>/xenv\n  Authorization: Bearer <CROSS_ENV_TOKEN>\n  X-Raxx-Requester-Env: worker
        Staging-->>CFW: 200 {status: "deploying", ...}
        CFW-->>Op: 200 (transparent proxy)
    end

    Note over GHA,Prod: new dyno boots, health check passes, GHA sends callback

    GHA->>Prod: POST /api/internal/deploys/<id>/status {status: "succeeded"}
    Prod->>KV: DELETE active_deploy:console-prod

    Op->>CFW: GET /api/internal/deploys/<id> (next poll)
    CFW->>KV: GET active_deploy:console-prod → not found (or status=succeeded)
    CFW->>Prod: GET /api/internal/deploys/<id> (pass through, new dyno is up)
    Prod-->>CFW: 200 {status: "succeeded"}
    CFW-->>Op: 200 {status: "succeeded"}
    Note over Op: Page auto-redirects to original URL\nSession cookie works on new dyno — no re-auth

6. Data Model Changes

6.1 No schema changes to `console_deploys`

The existing table (defined in console-driven-deploy-flow.md §3.1) requires no new columns. The status field already carries dispatched | building | deploying | succeeded | failed | timed_out. The CF Worker reads status from the xenv endpoint; the KV store is the edge-layer signal, not the DB.

6.2 New: KV store schema (not DB)

CF Workers KV is the only new storage component. Schema documented in §4.3.1. It is not a database; it is a session-like cache with a hard TTL.

6.3 New audit event in `audit_log`

console.deploy.crossenv_read (defined in §4.2). No schema change needed; existing audit_log table accommodates new event types via the action text field.

6.4 New: `console.deploy.kv_write` audit event

Emitted when the Flask app writes to or deletes from the KV store. Payload:

{
  "deploy_id": "<id>",
  "kv_operation": "put" | "delete",
  "surface_id": "console-prod",
  "timestamp": "..."
}

This ensures any discrepancy between DB state and KV state is auditable.

7. Migrations

7.1 DB migrations: none required

The console_deploys schema is unchanged. No ALTER TABLE.

7.2 KV namespace provisioning

A one-time wrangler kv:namespace create RAXX_CONSOLE_DEPLOY_STATE command. This must run before the CF Worker is deployed. The namespace ID is then referenced in wrangler.toml (committed to repo; the ID is not a secret).

Rollback: delete the KV namespace. No data is lost (it is ephemeral deploy-window state only).

7.3 CF Worker deployment

The Worker is a new resource. Deployment via wrangler deploy in CI. No change to existing Heroku app deploy workflows.

7.4 Secret provisioning sequence

Order matters: 1. Generate CONSOLE_CROSS_ENV_READ_TOKEN (32 bytes, URL-safe base64). 2. Write to Infisical /Console/prod/CONSOLE_CROSS_ENV_READ_TOKEN. 3. Write to Infisical /Console/staging/CONSOLE_CROSS_ENV_READ_TOKEN. 4. Set Worker secret: wrangler secret put CONSOLE_CROSS_ENV_READ_TOKEN. 5. Deploy console-prod (picks up new token from Infisical at boot). 6. Deploy console-staging (picks up same token from Infisical at boot). 7. Deploy CF Worker.

Rollback: change the token in all three locations simultaneously. The new token takes effect on next dyno boot; Worker picks it up immediately after wrangler secret put + wrangler deploy.

8. Rollout Plan

Phase	Gate	What ships
Dark	None	KV namespace provisioned; no Worker deployed; no xenv endpoint
Alpha	`FLAG_CONSOLE_DEPLOY_XENV_READ=true` on staging console	xenv endpoint live on console-staging; CF Worker deployed with intercept disabled (KV always empty)
Beta	xenv endpoint live on prod console; KV writes enabled on prod	CF Worker intercepts when KV has an active deploy; operator must test by deploying console-staging from prod console
GA	Full intercept active on prod; Slack notification baseline confirmed; ux-polisher reviewed "be right back" page	Deploy console-prod from prod console; operator observes no session disruption

Feature flags: - FLAG_CONSOLE_DEPLOY_XENV_READ — enables the /xenv endpoint on each console app (default: off). - FLAG_CONSOLE_DEPLOY_KV_WRITE — enables the KV write/delete in the Flask app on status transitions (default: off). Must be on before CF Worker intercept is meaningful.

9. Security Considerations

PII collected: The xenv endpoint returns requested_by (operator email). This is already in the existing GET /api/internal/deploys/<id> response, subject to the same 2-year retention and DSR tombstone policy. No additional PII surface.

Retention: console.deploy.crossenv_read audit events follow the 2-year audit retention policy. KV entries have a 10-minute TTL (ephemeral; not PII).

DSR: No new PII path. Existing requested_by DSR erasure path is sufficient.

Audit trail: every xenv read is audited (§4.2). Every KV write/delete is audited (§6.4). A complete trace of the deploy-window cross-env activity is in audit_log.

Credential storage: - CONSOLE_CROSS_ENV_READ_TOKEN — Infisical at /Console/prod/ and /Console/staging/. CF Worker secret. Never in code, never in DB. - CONSOLE_STAGING_BASE_URL — Heroku config var on console-prod. Not a secret; the URL is public. Value: https://console-staging.raxx.app. - CLOUDFLARE_KV_NAMESPACE_ID — non-secret; committed to wrangler.toml.

No stored credentials: The service token is read from env at request time. It cannot trigger deploys. Its compromise allows an attacker to read deploy status (deploy IDs, surface names, log tails — operational data). It does not expose user PII beyond operator email. Rotate by regenerating and updating all three locations.

Session cookie isolation: confirmed in §4.3.4. The operator's browser never makes cross-domain requests. All cross-env communication is server-to-server (CF Worker → console-staging xenv).

CF Access session during deploy window: CF Access cookies are bound to the CF identity provider session, not to the Heroku dyno. A 90-second restart window is well within the 24-hour CF Access cookie lifetime. No changes needed.

Edge of CF Access re-auth: if the deploy takes longer than 24 hours (pathological case — all other failure modes trigger first), the CF Access cookie expires. The recovery path is normal re-auth via passkey; no data is lost (session is in the DB, which survived the restart).

Breach: a breach of the xenv endpoint token exposes deploy history (status, log tails, operator emails). It does not expose credentials, trading data, or customer PII. Breach notification: platform GDPR breach path, notify within 72 hours.

Kill-switch: set FLAG_CONSOLE_DEPLOY_XENV_READ=false on the console to disable cross-env reads. Set FLAG_CONSOLE_DEPLOY_KV_WRITE=false to stop writing to KV (Worker falls back to pass-through). Both flags are Heroku config vars, effective on next request without redeploy.

Secrets rotatable without redeploy: CONSOLE_CROSS_ENV_READ_TOKEN rotation requires a CF Worker redeploy (wrangler deploy) after wrangler secret put. Both operations take under 30 seconds and do not touch the Heroku dynos.

10. Failure Modes

10.1 Deploy hangs > 5 minutes

The existing reconciler (console-driven-deploy-flow.md §8.1) marks rows as timed_out after 30 minutes with no callback. For UX purposes, the CF Worker "be right back" page displays a warning banner after 5 minutes with status still in building or deploying:

"This deploy is taking longer than expected. Check Slack for status updates or view the run on GitHub."

The Slack notification baseline (Option C) fires at deploy-complete in all cases. If the deploy hangs without ever reaching a terminal state, the existing reconciler promotes it to timed_out and the xenv endpoint returns status: "timed_out", which triggers the failure UX on the "be right back" page.

> 5 min UX: warning banner + GitHub run link (if available) + Slack link. Terminal failure UX: red banner + failure reason + GitHub run link + "Refresh to retry" button. No auto-redirect.

10.2 Sister console (raxx-console-staging) is also down

This is the "both broken" worst case. The CF Worker's xenv proxy receives an error (connection refused, 5xx, timeout).

Behavior: after 3 consecutive failed xenv poll attempts, the "be right back" page transitions to:

"The status endpoint is temporarily unavailable. Your deploy is still in progress. Check Slack (#ops-deploys) for status updates."

A direct link to the Slack channel is shown. The Slack notification (Option C baseline) fires independently via the GitHub Actions workflow (not through the console app), so it is not affected by the console being down.

Recovery: when the sister comes back up, the next xenv poll will succeed and the page resumes normal status display.

10.3 CF Access session expires mid-deploy

Default CF Access cookie lifetime is 24 hours. A console deploy takes 2–5 minutes. This case is effectively impossible under normal operations.

If it somehow occurs (e.g., a stale token from before the deploy started): - The operator's GET request to console.raxx.app is redirected to CF Access re-auth. - CF Access re-auth requires a passkey assertion. - After re-auth, CF redirects back to the original URL. If the deploy completed, the original URL works. If still deploying, the CF Worker intercepts and serves the "be right back" page. - No data loss. The deploy itself is unaffected (it runs in GitHub Actions, independent of the operator's session).

10.4 CF Worker fails (edge process crash)

CF Workers are highly available (global edge, no single point of failure). A Worker crash falls back to CF's default behavior, which is to proxy to the origin (console-prod dyno). If the dyno is still restarting, the request gets a 502. This is the existing pre-design behavior — no worse than today.

Recovery: CF Workers auto-restart. The window of degraded UX is expected to be under 30 seconds.

10.5 KV write fails (Flask app cannot reach CF KV)

The KV write is fire-and-forget in a background thread. A failed KV write is logged at WARNING level and audited as console.deploy.kv_write with error: true. The deploy proceeds normally. The CF Worker has no KV entry for the active deploy, so it passes all requests through to the origin — the operator gets the normal deploy-interruption experience (502 during restart). This is the pre-design baseline, not a regression.

10.6 Concurrent deploys for the same surface

The existing rate-limit check (5 non-terminal deploys per surface per hour) and idempotency key prevent concurrent dispatches for the same surface. The KV store uses active_deploy:console-prod as a single-value key. If two deploys somehow ran concurrently (break-glass path bypassing rate limits), the last write wins in KV. The "be right back" page would poll the most recent deploy's status. This is acceptable; concurrent deploys to the same surface are already blocked under normal conditions.

11. Open Questions

These block downstream sub-card claims until resolved:

OQ-1 (blocks sub-card S5 — CF Worker): What Cloudflare account and zone does wrangler.toml target? The existing CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN (per reference_cloudflare_tokens.md) does not have Worker or KV write scope. A new token scope is needed. Who provisions the Wrangler API token and at what Infisical path?

OQ-2 (blocks sub-card S2 — xenv endpoint): The xenv endpoint is on whichever console is not restarting. When console-prod deploys console-staging, which console holds the deploy row? (It is console-prod, since the POST was made to console-prod.) Confirm: the console_deploys row always lives in the console that received the POST /api/internal/deploys request — not the console being deployed. This is the expected behavior given the existing service design, but it must be confirmed before implementing the routing logic.

OQ-3 (blocks sub-card S5 — CF Worker routing): CONSOLE_STAGING_BASE_URL must be a CF Worker secret or environment variable, not a KV entry. Confirm the staging URL: is it https://console-staging.raxx.app or does it use a different domain pattern? (The project_environments_mental_model.md memory suggests staging console exists but the URL isn't confirmed in the docs.)

OQ-4 (not blocking, but shapes S4 — Slack notification): The existing slack_notify.py:notify_rotation_success sends to a hardcoded channel. What Slack channel should deploy completion/failure messages go to? Proposed: #ops-deploys. Does this channel exist?

OQ-5 (not blocking, shapes S3 — KV writes): Should FLAG_CONSOLE_DEPLOY_KV_WRITE be a single flag covering both prod and staging consoles, or split into FLAG_CONSOLE_DEPLOY_KV_WRITE_PROD and FLAG_CONSOLE_DEPLOY_KV_WRITE_STAGING? Staging console probably should not write to the same KV namespace as prod (it would incorrectly intercept traffic at console.raxx.app). Proposal: use a single CONSOLE_DEPLOY_KV_SURFACE_ID_FILTER env var; only write to KV when the deploying surface matches the filter value. For console-prod: console-prod. For console-staging: leave unset (no KV writes for staging self-deploys).

12. Sub-Card Breakdown

See §13 for the filed GitHub issue numbers once they are created.

ID	Title	Size	Depends on	AC
S1	Add `FLAG_CONSOLE_DEPLOY_XENV_READ` + xenv blueprint route	S	None	`GET /api/internal/deploys/<id>/xenv` returns deploy row with 1 KB log_tail; service token auth; audit event written; feature flag gates to 501 when off
S2	xenv service token provisioning + secret distribution	S	OQ-2 resolved	`CONSOLE_CROSS_ENV_READ_TOKEN` in Infisical at both `/Console/prod/` and `/Console/staging/`; both consoles boot successfully; token never logged
S3	KV write/delete on deploy status transitions	M	S1	Flask app writes to `RAXX_CONSOLE_DEPLOY_STATE` KV on `dispatched` status and deletes on terminal status; fire-and-forget with audit log; `FLAG_CONSOLE_DEPLOY_KV_WRITE` flag; 10-min TTL on PUT
S4	Slack notification baseline (Option C)	S	None	`slack_notify.py` emits on deploy `succeeded`, `failed`, `timed_out` transitions; message includes surface, status, GitHub run URL; fires via existing callback path
S5	CF Worker: KV read + intercept + "be right back" page	L	S1, S2, S3, OQ-1, OQ-3	Worker deployed to `console.raxx.app`; GET status requests proxied to xenv during active deploy; all other requests during deploy window return inline "be right back" page; page auto-redirects on success; Wrangler CI job added
S6	"Be right back" page UX polish	S	S5	Page reviewed by `ux-polisher`; 5-min warning banner; failure UX; sister-down fallback message; accessible markup
S7	Integration test: full deploy-window round-trip	M	S3, S5	Staging smoke test that mocks a console-prod deploy; verifies KV write, xenv poll, KV delete, redirect; does not require a real Heroku restart

Size legend: S = 0.5–1 day, M = 1–2 days, L = 2–4 days (feature-developer estimate).

Claim order: S1 and S4 are independent and can be claimed simultaneously. S2 can proceed in parallel with S1 once OQ-2 is resolved. S3 depends on S1 (feature flag pattern established). S5 depends on all three plus OQ-1 and OQ-3 — it is the longest card and should not be claimed until OQs are resolved. S6 and S7 follow S5.

Auto-generated from docs/ in raxx-app/TradeMasterAPI. Gated behind Cloudflare Access. Re-deployed on every push to main.