Status: Accepted
Date: 2026-05-03 UTC
Deciders: software-architect (design), Kristerpher (topology constraint)
Scope: Approach selection (A vs B vs C vs hybrid) for keeping the operator's browser session alive during a console self-deploy, plus the two-app vs single-app topology decision.
Design doc: docs/architecture/console-self-deploy-web-layer.md
Refs: Issue #988 (parent card), #985 (deploy-button pulse ring), ADR-0029 (console staging retirement — suspended by this ADR)
Does NOT supersede: ADR-0034 (deploy flow), ADR-0036 (async dispatch). Extends them.
Suspends: ADR-0029 retirement timeline pending soak of this design.
Issue #988 describes three candidate approaches for keeping an operator's browser tab connected during a console self-deploy:
A fourth implicit option was also considered: a hybrid of A + C.
The design also must resolve the topology question: should raxx-console-staging be retired as ADR-0029 planned, or kept as the "sister console" that enables Option A?
The primary continuity mechanism is a CF Worker edge intercept (Option A). Option C (Slack notification) is deployed unconditionally as a fallback baseline. Neither Option B nor a pure Option C is selected.
The raxx-console-staging Heroku app is retained. ADR-0029's planned retirement (~2026-05-07) is suspended. The two-app topology is a structural prerequisite for Option A: the sister console must exist for the CF Worker to proxy status reads during a prod-console restart. ADR-0029's retirement can be revisited after this design soaks and a future card evaluates zero-downtime deployment alternatives (rolling deploys on an always-on platform, permanent CF Worker gateway, etc.).
The CF Worker proxies deploy status reads to the sister console via a dedicated CONSOLE_CROSS_ENV_READ_TOKEN. The operator's session cookie is never forwarded cross-domain. Session isolation between the two console apps is preserved.
The CF Worker reads a Cloudflare Workers KV entry (active_deploy:console-prod) to decide whether to intercept. The Flask app writes to KV as a fire-and-forget side effect of deploy status transitions. The DB (console_deploys table) remains the authoritative deploy record. KV is an ephemeral cache with a 10-minute hard TTL.
Each console app has its own SECRET_KEY and its own session DB. Cross-env communication is strictly server-to-server via the service token. This decision follows from D3 and the existing auth posture (ADR-0043: per-surface CF Access policies, no auto-provisioning).
console.raxx.app throughout. No cross-domain cookie issues.wrangler.toml, CI deploy job, KV namespace, Worker secret. Feature-developer implementing S5 must be comfortable with Cloudflare Workers tooling.CONSOLE_CROSS_ENV_READ_TOKEN must be set in Infisical (prod path), Infisical (staging path), and as a CF Worker secret. Rotation requires all three updates simultaneously. Runbook must document this.raxx-console-staging costs money. Retaining the staging Heroku app instead of retiring it as ADR-0029 planned adds ongoing dyno cost. At current scale this is negligible (~$7/month eco dyno). Revisit if cost becomes relevant.ux-polisher reviews it in sub-card S6 but it is not a customer-facing surface.console_deploys DB schema is unchanged.Option B (WebSocket to sister console) requires:
The operator's browser to open a WebSocket to console-staging.raxx.app directly. This means the operator's CF Access cookie must be accepted by the staging CF Access application. CF Access policies are per-application and per-domain. Connecting the two CF Access apps so the operator's prod cookie works on the staging domain couples the two authorization domains — a security regression against ADR-0043's per-surface policy model.
Client-side JavaScript to maintain the WS connection across a 90-second disconnect, implement reconnection backoff, deduplicate messages, and orchestrate the hand-back to the restarted prod dyno. This is 200–400 lines of stateful JS for a maintenance-window edge case. Option A achieves the same UX goal with the logic server-side in the CF Worker.
The sister console to expose a WebSocket upgrade endpoint with session-compatible auth. Session cookies don't cross domains (SameSite=Strict). The only viable auth for the WS endpoint is the service token — the same token used by Option A. So Option B's auth model reduces to the same service token as Option A, but with more client-side complexity.
Verdict: Option B gives the same UX as Option A with higher complexity and a CF Access coupling risk. Rejected.
Option C (out-of-band Slack notification) delivers the deploy result to the operator but breaks the continuity promise: "operator stays engaged without re-login or context loss." After a pure Option C deploy, the operator must re-auth, navigate back to the deploy panel, and manually correlate the Slack notification with the console state. This is exactly the "missed-idealism state" described in #988.
Option C is retained as a baseline fallback because it is already partially built (slack_notify.py) and provides a safety net when the CF Worker or sister console is unavailable.
Instead of using the CF Worker only during deploy windows, run all console.raxx.app traffic through the Worker permanently, proxying to the active dyno. During deploys, the Worker switches to the sister or serves the "be right back" page. This eliminates the two-app topology permanently.
Not selected for this card because it is a platform migration (changes the normal-path architecture for all console requests, requires Worker to handle auth forwarding, introduces CF Worker as a required path for all console traffic). It is the right long-term direction but is out of scope for this design. The KV store and xenv endpoint designed here are compatible with this future architecture; no rework needed.
Migrate the console to a platform that supports zero-downtime rolling deploys (Fly.io, Render, Railway). The dyno-restart problem disappears at the infrastructure layer.
Not selected because it is a platform migration. The scope of migrating the console from Heroku to another platform is larger than the self-deploy UX problem it solves. Filed as a "for consideration" comment on #988.
Make both consoles share a Redis session store so the operator's session cookie is valid on both domains. The operator's browser could then make direct requests to the sister console during the deploy window.
Not selected because it collapses session isolation between the two console apps. A compromise of the staging session store would expose prod session tokens. Per ADR-0043, per-surface auth isolation is a design invariant.