Raxx · internal docs

internal · gated ↑ index

ADR-0044 — Console Self-Deploy Web Layer: Option Selection + Topology

Status: Accepted Date: 2026-05-03 UTC Deciders: software-architect (design), Kristerpher (topology constraint) Scope: Approach selection (A vs B vs C vs hybrid) for keeping the operator's browser session alive during a console self-deploy, plus the two-app vs single-app topology decision. Design doc: docs/architecture/console-self-deploy-web-layer.md Refs: Issue #988 (parent card), #985 (deploy-button pulse ring), ADR-0029 (console staging retirement — suspended by this ADR) Does NOT supersede: ADR-0034 (deploy flow), ADR-0036 (async dispatch). Extends them. Suspends: ADR-0029 retirement timeline pending soak of this design.

Context

Issue #988 describes three candidate approaches for keeping an operator's browser tab connected during a console self-deploy:

Option A: CF Pages / CF Worker edge intercept. During a pending deploy, the edge serves a "be right back" page that polls the deploy status endpoint on the sister console.
Option B: In-page WebSocket fallback to the sister console. The browser opens a WS connection to the sister at deploy-trigger time; the page freezes but keeps the operator informed.
Option C: Out-of-band Slack/email notification. The operator's tab is allowed to die; they learn the result via Slack DM.

A fourth implicit option was also considered: a hybrid of A + C.

The design also must resolve the topology question: should raxx-console-staging be retired as ADR-0029 planned, or kept as the "sister console" that enables Option A?

Decisions

D1: Select Hybrid A + C

The primary continuity mechanism is a CF Worker edge intercept (Option A). Option C (Slack notification) is deployed unconditionally as a fallback baseline. Neither Option B nor a pure Option C is selected.

D2: Retain both Heroku apps; suspend ADR-0029 retirement timeline

The raxx-console-staging Heroku app is retained. ADR-0029's planned retirement (~2026-05-07) is suspended. The two-app topology is a structural prerequisite for Option A: the sister console must exist for the CF Worker to proxy status reads during a prod-console restart. ADR-0029's retirement can be revisited after this design soaks and a future card evaluates zero-downtime deployment alternatives (rolling deploys on an always-on platform, permanent CF Worker gateway, etc.).

D3: Cross-env polling uses a scoped service token, not session cookies

The CF Worker proxies deploy status reads to the sister console via a dedicated CONSOLE_CROSS_ENV_READ_TOKEN. The operator's session cookie is never forwarded cross-domain. Session isolation between the two console apps is preserved.

D4: KV is the edge-layer signal; DB is the source of truth

The CF Worker reads a Cloudflare Workers KV entry (active_deploy:console-prod) to decide whether to intercept. The Flask app writes to KV as a fire-and-forget side effect of deploy status transitions. The DB (console_deploys table) remains the authoritative deploy record. KV is an ephemeral cache with a 10-minute hard TTL.

D5: No shared SECRET_KEY or shared session backend between consoles

Each console app has its own SECRET_KEY and its own session DB. Cross-env communication is strictly server-to-server via the service token. This decision follows from D3 and the existing auth posture (ADR-0043: per-surface CF Access policies, no auto-provisioning).

Consequences

Positive

Zero operator action during the deploy window. The CF Worker serves the "be right back" page before the dyno goes down, polls status transparently, and redirects automatically when the deploy completes. The operator's session cookie and CF Access cookie survive intact.
No session cookie sharing. The operator's browser stays on console.raxx.app throughout. No cross-domain cookie issues.
Option C as baseline means the operator is notified even if the CF Worker or sister console is unavailable. Failure modes degrade gracefully to the pre-design state (502 + Slack notification), never to silent failure.
KV TTL is a safety net. If the Flask app crashes before writing the KV DELETE, the Worker automatically stops intercepting after 10 minutes. No manual cleanup.
ADR-0029 is explicitly suspended, not silently contradicted. Future architects can see the retirement was deferred and why.

Negative / Risks

New infrastructure component (CF Worker + KV namespace). Adds operational surface: wrangler.toml, CI deploy job, KV namespace, Worker secret. Feature-developer implementing S5 must be comfortable with Cloudflare Workers tooling.
Three-location secret distribution. CONSOLE_CROSS_ENV_READ_TOKEN must be set in Infisical (prod path), Infisical (staging path), and as a CF Worker secret. Rotation requires all three updates simultaneously. Runbook must document this.
KV write is fire-and-forget. A failed KV write degrades silently to the pre-design UX (operator gets a 502). This is acceptable for a maintenance-window feature but must be monitored. The audit log entry (§6.4 of design doc) provides observability.
raxx-console-staging costs money. Retaining the staging Heroku app instead of retiring it as ADR-0029 planned adds ongoing dyno cost. At current scale this is negligible (~$7/month eco dyno). Revisit if cost becomes relevant.

Neutral

The "be right back" HTML page is stored inline in the CF Worker. It has no external asset dependencies. The UX is intentionally minimal; ux-polisher reviews it in sub-card S6 but it is not a customer-facing surface.
The console_deploys DB schema is unchanged.

Why Option B was rejected

Option B (WebSocket to sister console) requires:

The operator's browser to open a WebSocket to console-staging.raxx.app directly. This means the operator's CF Access cookie must be accepted by the staging CF Access application. CF Access policies are per-application and per-domain. Connecting the two CF Access apps so the operator's prod cookie works on the staging domain couples the two authorization domains — a security regression against ADR-0043's per-surface policy model.
Client-side JavaScript to maintain the WS connection across a 90-second disconnect, implement reconnection backoff, deduplicate messages, and orchestrate the hand-back to the restarted prod dyno. This is 200–400 lines of stateful JS for a maintenance-window edge case. Option A achieves the same UX goal with the logic server-side in the CF Worker.
The sister console to expose a WebSocket upgrade endpoint with session-compatible auth. Session cookies don't cross domains (SameSite=Strict). The only viable auth for the WS endpoint is the service token — the same token used by Option A. So Option B's auth model reduces to the same service token as Option A, but with more client-side complexity.

Verdict: Option B gives the same UX as Option A with higher complexity and a CF Access coupling risk. Rejected.

Why pure Option C was rejected as the primary path

Option C (out-of-band Slack notification) delivers the deploy result to the operator but breaks the continuity promise: "operator stays engaged without re-login or context loss." After a pure Option C deploy, the operator must re-auth, navigate back to the deploy panel, and manually correlate the Slack notification with the console state. This is exactly the "missed-idealism state" described in #988.

Option C is retained as a baseline fallback because it is already partially built (slack_notify.py) and provides a safety net when the CF Worker or sister console is unavailable.

Alternatives Considered

Alternative: Permanent CF Worker gateway (long-term single-console path)

Instead of using the CF Worker only during deploy windows, run all console.raxx.app traffic through the Worker permanently, proxying to the active dyno. During deploys, the Worker switches to the sister or serves the "be right back" page. This eliminates the two-app topology permanently.

Not selected for this card because it is a platform migration (changes the normal-path architecture for all console requests, requires Worker to handle auth forwarding, introduces CF Worker as a required path for all console traffic). It is the right long-term direction but is out of scope for this design. The KV store and xenv endpoint designed here are compatible with this future architecture; no rework needed.

Alternative: Fly.io / always-on rolling deploy

Migrate the console to a platform that supports zero-downtime rolling deploys (Fly.io, Render, Railway). The dyno-restart problem disappears at the infrastructure layer.

Not selected because it is a platform migration. The scope of migrating the console from Heroku to another platform is larger than the self-deploy UX problem it solves. Filed as a "for consideration" comment on #988.

Alternative: Shared session backend (Redis) between consoles

Make both consoles share a Redis session store so the operator's session cookie is valid on both domains. The operator's browser could then make direct requests to the sister console during the deploy window.

Not selected because it collapses session isolation between the two console apps. A compromise of the staging session store would expose prod session tokens. Per ADR-0043, per-surface auth isolation is a design invariant.

References

docs/architecture/console-self-deploy-web-layer.md — full design doc
ADR-0034 — deploy flow (no changes)
ADR-0036 — async dispatch (no changes)
ADR-0029 — staging retirement (suspended by D2)
ADR-0043 — per-surface auth isolation (informs D3 and B rejection)
Issue #988 — problem statement and three candidate approaches
Issue #985 — deploy-button pulse ring (consumes xenv endpoint from §4.2 of design doc)

Auto-generated from docs/ in raxx-app/TradeMasterAPI. Gated behind Cloudflare Access. Re-deployed on every push to main.