Raxx · internal docs

internal · gated ↑ index

ADR-0044 — Console Self-Deploy Web Layer: Option Selection + Topology

Status: Accepted Date: 2026-05-03 UTC Deciders: software-architect (design), Kristerpher (topology constraint) Scope: Approach selection (A vs B vs C vs hybrid) for keeping the operator's browser session alive during a console self-deploy, plus the two-app vs single-app topology decision. Design doc: docs/architecture/console-self-deploy-web-layer.md Refs: Issue #988 (parent card), #985 (deploy-button pulse ring), ADR-0029 (console staging retirement — suspended by this ADR) Does NOT supersede: ADR-0034 (deploy flow), ADR-0036 (async dispatch). Extends them. Suspends: ADR-0029 retirement timeline pending soak of this design.


Context

Issue #988 describes three candidate approaches for keeping an operator's browser tab connected during a console self-deploy:

A fourth implicit option was also considered: a hybrid of A + C.

The design also must resolve the topology question: should raxx-console-staging be retired as ADR-0029 planned, or kept as the "sister console" that enables Option A?


Decisions

D1: Select Hybrid A + C

The primary continuity mechanism is a CF Worker edge intercept (Option A). Option C (Slack notification) is deployed unconditionally as a fallback baseline. Neither Option B nor a pure Option C is selected.

D2: Retain both Heroku apps; suspend ADR-0029 retirement timeline

The raxx-console-staging Heroku app is retained. ADR-0029's planned retirement (~2026-05-07) is suspended. The two-app topology is a structural prerequisite for Option A: the sister console must exist for the CF Worker to proxy status reads during a prod-console restart. ADR-0029's retirement can be revisited after this design soaks and a future card evaluates zero-downtime deployment alternatives (rolling deploys on an always-on platform, permanent CF Worker gateway, etc.).

D3: Cross-env polling uses a scoped service token, not session cookies

The CF Worker proxies deploy status reads to the sister console via a dedicated CONSOLE_CROSS_ENV_READ_TOKEN. The operator's session cookie is never forwarded cross-domain. Session isolation between the two console apps is preserved.

D4: KV is the edge-layer signal; DB is the source of truth

The CF Worker reads a Cloudflare Workers KV entry (active_deploy:console-prod) to decide whether to intercept. The Flask app writes to KV as a fire-and-forget side effect of deploy status transitions. The DB (console_deploys table) remains the authoritative deploy record. KV is an ephemeral cache with a 10-minute hard TTL.

D5: No shared SECRET_KEY or shared session backend between consoles

Each console app has its own SECRET_KEY and its own session DB. Cross-env communication is strictly server-to-server via the service token. This decision follows from D3 and the existing auth posture (ADR-0043: per-surface CF Access policies, no auto-provisioning).


Consequences

Positive

Negative / Risks

Neutral


Why Option B was rejected

Option B (WebSocket to sister console) requires:

  1. The operator's browser to open a WebSocket to console-staging.raxx.app directly. This means the operator's CF Access cookie must be accepted by the staging CF Access application. CF Access policies are per-application and per-domain. Connecting the two CF Access apps so the operator's prod cookie works on the staging domain couples the two authorization domains — a security regression against ADR-0043's per-surface policy model.

  2. Client-side JavaScript to maintain the WS connection across a 90-second disconnect, implement reconnection backoff, deduplicate messages, and orchestrate the hand-back to the restarted prod dyno. This is 200–400 lines of stateful JS for a maintenance-window edge case. Option A achieves the same UX goal with the logic server-side in the CF Worker.

  3. The sister console to expose a WebSocket upgrade endpoint with session-compatible auth. Session cookies don't cross domains (SameSite=Strict). The only viable auth for the WS endpoint is the service token — the same token used by Option A. So Option B's auth model reduces to the same service token as Option A, but with more client-side complexity.

Verdict: Option B gives the same UX as Option A with higher complexity and a CF Access coupling risk. Rejected.

Why pure Option C was rejected as the primary path

Option C (out-of-band Slack notification) delivers the deploy result to the operator but breaks the continuity promise: "operator stays engaged without re-login or context loss." After a pure Option C deploy, the operator must re-auth, navigate back to the deploy panel, and manually correlate the Slack notification with the console state. This is exactly the "missed-idealism state" described in #988.

Option C is retained as a baseline fallback because it is already partially built (slack_notify.py) and provides a safety net when the CF Worker or sister console is unavailable.


Alternatives Considered

Alternative: Permanent CF Worker gateway (long-term single-console path)

Instead of using the CF Worker only during deploy windows, run all console.raxx.app traffic through the Worker permanently, proxying to the active dyno. During deploys, the Worker switches to the sister or serves the "be right back" page. This eliminates the two-app topology permanently.

Not selected for this card because it is a platform migration (changes the normal-path architecture for all console requests, requires Worker to handle auth forwarding, introduces CF Worker as a required path for all console traffic). It is the right long-term direction but is out of scope for this design. The KV store and xenv endpoint designed here are compatible with this future architecture; no rework needed.

Alternative: Fly.io / always-on rolling deploy

Migrate the console to a platform that supports zero-downtime rolling deploys (Fly.io, Render, Railway). The dyno-restart problem disappears at the infrastructure layer.

Not selected because it is a platform migration. The scope of migrating the console from Heroku to another platform is larger than the self-deploy UX problem it solves. Filed as a "for consideration" comment on #988.

Alternative: Shared session backend (Redis) between consoles

Make both consoles share a Redis session store so the operator's session cookie is valid on both domains. The operator's browser could then make direct requests to the sister console during the deploy window.

Not selected because it collapses session isolation between the two console apps. A compromise of the staging session store would expose prod session tokens. Per ADR-0043, per-surface auth isolation is a design invariant.


References