Status: Proposed Owner: software-architect Date: 2026-04-30 UTC Related ADRs: ADR-0034, ADR-0028, ADR-0030 Refs: Kristerpher directive 2026-04-30 UTC, PR #675 (deploy-console.yml), PR #677 (scripts/ops/deploy-console.sh)
The console (console.raxx.app) already shows a health status grid for every
Raxx surface via SURFACE_REGISTRY_BY_ENV in
console/app/services/status_poller.py. Each tile shows operational state but
has no deploy controls. To deploy a surface today, an operator must open a
terminal and run gh workflow run deploy-heroku.yml ... manually.
This design adds a Deploy button to each tile, a type-to-confirm modal (per
ADR-0028), live deploy status feedback within the console, and a structured
console_deploys table as the canonical deploy intent record. GitHub workflows
post lifecycle callbacks to the console via a new webhook endpoint.
The existing workflows (deploy-console.yml, deploy-heroku.yml,
deploy-customer-docs.yml, deploy-status-page.yml) and break-glass scripts
(scripts/ops/deploy-console.sh) are not replaced — they are the execution
engines. The console becomes the preferred initiation and observability layer
on top of those existing engines.
All platform invariants apply. Deploy-flow-specific constraints:
No stored credentials. The GitHub dispatch token and HMAC secret live in vault (Infisical) and are injected at runtime. Neither appears in code or schema.
ADR-0028 friction preserved. The Deploy button does not skip the
type-to-confirm gate. The modal requires the operator to type the exact phrase
before POST /api/internal/deploys fires.
Audit trail. Every deploy intent writes an audit_log row at creation
(console.deploy.intent) and every callback writes another
(console.deploy.callback). No deploy event is unlogged.
Break-glass paths unchanged. Manual gh workflow run and
scripts/ops/deploy-console.sh continue to function. Workflows gracefully
skip the callback step when console_deploy_id is absent.
Secrets rotatable without redeploy. DEPLOY_CALLBACK_HMAC_SECRET and
GITHUB_DISPATCH_TOKEN are read from environment variables at request time,
not baked into the app binary.
Kill-switch is unchanged. Heroku releases:rollback remains the fastest
rollback path and requires no CI or console UI action.
console_deploys tableCREATE TABLE console_deploys (
id TEXT PRIMARY KEY, -- UUID
surface_id TEXT NOT NULL, -- FK to SURFACE_REGISTRY id
target_env TEXT NOT NULL, -- 'staging' | 'production'
target_ref TEXT NOT NULL, -- git SHA or branch name
requested_by TEXT NOT NULL, -- operator email (RAXX_OPERATOR_EMAIL)
requested_at_utc DATETIME NOT NULL,
idempotency_key TEXT NOT NULL UNIQUE, -- UUID; prevents double-click
status TEXT NOT NULL DEFAULT 'requested',
-- requested | dispatched | building
-- | deploying | succeeded
-- | failed | timed_out
github_run_id TEXT, -- populated after dispatch
github_run_url TEXT, -- derived from run_id
last_status_at_utc DATETIME,
streaming_log TEXT DEFAULT '', -- appended; capped at 500 KB
failure_reason TEXT, -- populated on failed | timed_out
audit_correlation_id TEXT, -- FK to audit_log.id (intent row)
workflow_name TEXT NOT NULL -- e.g. 'deploy-console.yml'
);
CREATE INDEX ix_console_deploys_surface_status
ON console_deploys (surface_id, status, requested_at_utc DESC);
CREATE INDEX ix_console_deploys_idempotency
ON console_deploys (idempotency_key);
CREATE INDEX ix_console_deploys_github_run_id
ON console_deploys (github_run_id)
WHERE github_run_id IS NOT NULL;
streaming_log is append-only from the server side. Feature-developer must
enforce the 500 KB cap on each append (truncate oldest bytes if needed, not
newest — preserve the tail).
Stored as a Python constant (not in the DB) in the deploy service module:
SURFACE_WORKFLOW_MAP = {
"api-prod": ("deploy-heroku.yml", "production"),
"api-staging": ("deploy-heroku.yml", "staging"),
"console-prod": ("deploy-console.yml", "production"),
"console-staging": ("deploy-console.yml", "staging"),
"getraxx": ("deploy-customer-docs.yml", "production"),
"raxx-mockups": ("deploy-customer-docs.yml", "production"),
}
Surfaces not in this map (e.g., vault, raxx-app-previews) do not show a
Deploy button.
POST /api/internal/deploysAuth: session cookie; ops or superadmin role required.
Request body:
{
"surface_id": "console-prod",
"target_ref": "main",
"idempotency_key": "550e8400-e29b-41d4-a716-446655440000"
}
target_env is derived from SURFACE_WORKFLOW_MAP — the caller does not set it
directly. target_ref defaults to "main" if omitted.
Behavior:
1. Validate surface_id exists in SURFACE_WORKFLOW_MAP. 401 if role check
fails, 422 if surface not deployable.
2. Check idempotency: if a row exists with idempotency_key and status is not
failed/timed_out, return 200 with the existing row's id and
status_url. Prevents double-click.
3. Rate-limit check: if 5 or more non-terminal deploys for this surface_id
exist in the last 60 minutes, return 429.
4. Insert console_deploys row with status: requested.
5. Write audit_log row (console.deploy.intent); store id in
audit_correlation_id.
6. Call GitHub workflow_dispatch API:
POST https://api.github.com/repos/{owner}/{repo}/actions/workflows/{workflow}/dispatches
Authorization: Bearer $GITHUB_DISPATCH_TOKEN
{
"ref": "<target_ref>",
"inputs": {
"environment": "<target_env>",
"console_deploy_id": "<deploy_row_id>"
}
}
7. On 204 from GitHub: update row to status: dispatched. Poll
GET /repos/{owner}/{repo}/actions/runs?event=workflow_dispatch for up to 30
seconds (3 polls × 10s) to capture the run_id; store it if found.
8. Return 201:
json
{
"id": "<deploy_row_id>",
"status": "dispatched",
"status_url": "/api/internal/deploys/<id>"
}
On GitHub API error: update row to status: failed, failure_reason:
"github_dispatch_failed: <http_status>". Return 502.
GET /api/internal/deploys/<id>Auth: session cookie; any authenticated role.
Response:
{
"id": "<uuid>",
"surface_id": "console-prod",
"target_env": "production",
"target_ref": "main",
"status": "building",
"github_run_url": "https://github.com/…/actions/runs/12345",
"last_status_at_utc": "2026-04-30T14:35:10Z",
"log_tail": "…last 50 lines of streaming_log…",
"failure_reason": null
}
log_tail is the last 4 KB of streaming_log — the endpoint does not return
the full log to avoid oversized responses.
POST /api/internal/deploys/<id>/statusAuth: HMAC signature on request body. Header:
X-Raxx-Deploy-Signature: sha256=<hex_digest>.
Verification:
expected = HMAC-SHA256(DEPLOY_CALLBACK_HMAC_SECRET, raw_request_body)
compare(expected, header_value, constant_time=True)
Return 401 if verification fails; log security event to audit_log.
Request body:
{
"status": "building",
"log_line": "Step: Install Heroku CLI — started",
"failure_reason": null
}
status must be one of: building, deploying, succeeded, failed.
Behavior:
1. Look up deploy row by id. 404 if not found.
2. Validate status transition is forward-only (no backwards transitions from
terminal states).
3. Append log_line (with UTC timestamp prefix) to streaming_log. Enforce
500 KB cap.
4. Update status, last_status_at_utc.
5. Write audit_log row (console.deploy.callback).
6. Return 204.
requested → dispatched → building → deploying → succeeded
↓ ↓
failed failed
dispatched → timed_out (reconciler promotes after 30 min with no callback)
Terminal states: succeeded, failed, timed_out.
stateDiagram-v2
[*] --> requested : POST /api/internal/deploys (intent created)
requested --> dispatched : GitHub workflow_dispatch 204
requested --> failed : GitHub API error
dispatched --> building : callback status=building
dispatched --> timed_out : reconciler: no callback in 30 min
dispatched --> failed : callback status=failed
building --> deploying : callback status=deploying
building --> failed : callback status=failed
deploying --> succeeded : callback status=succeeded
deploying --> failed : callback status=failed
succeeded --> [*]
failed --> [*]
timed_out --> [*]
workflow_dispatch inputEach of the four deploy workflows gains one new optional input:
console_deploy_id:
description: "console_deploys row ID (empty when triggered manually)"
required: false
default: ""
This input is passed through the entire workflow via env or outputs so the
notify action can reference it.
.github/actions/notify-deploy-status composite actionInputs:
- console_deploy_id — the deploy row UUID (may be empty)
- status — building | deploying | succeeded | failed
- log_line — one-line message to append
- failure_reason — only set when status: failed
- console_url — base URL of the console (default: https://console.raxx.app)
- hmac_secret — the DEPLOY_CALLBACK_HMAC_SECRET value
Behavior:
- name: Notify deploy status
if: inputs.console_deploy_id != ''
run: |
BODY=$(printf '{"status":"%s","log_line":"%s","failure_reason":%s}' \
"$STATUS" "$LOG_LINE" "${FAILURE_REASON:-null}")
SIG=$(printf '%s' "$BODY" | openssl dgst -sha256 -hmac "$HMAC_SECRET" -hex | awk '{print $2}')
curl -sS -X POST \
-H "Content-Type: application/json" \
-H "X-Raxx-Deploy-Signature: sha256=$SIG" \
--max-time 10 \
"$CONSOLE_URL/api/internal/deploys/$CONSOLE_DEPLOY_ID/status" \
-d "$BODY" || true
The || true ensures a callback failure does not abort the workflow. The deploy
is the primary job; the callback is observability.
Each deploy workflow inserts notify steps at these points:
| Point | Status | Log line |
|---|---|---|
| Deploy job start | building |
"Deploy job started for <app> (<env>)" |
| After Heroku push succeeds | deploying |
"Code pushed to Heroku. Awaiting dyno restart." |
| Health check passes | succeeded |
"Health check passed. <app_url>/health → 200" |
| Health check fails | failed |
"Health check failed after 5 retries." |
| Any job failure (catch) | failed |
"Workflow step failed: <step_name>" |
This section is spec for ux-designer; implementation is for feature-developer.
Each site tile in the status grid (dashboard/_status_grid.html) gains a Deploy
button. The button:
- Is only rendered when the operator session has ops or superadmin role.
- Is only rendered when the tile's surface_id exists in SURFACE_WORKFLOW_MAP.
- Shows a lock icon and tooltip "Deploy frozen" when the deploy freeze flag is
active (deploy_freeze.is_frozen()).
Clicking Deploy opens a modal with:
1. Surface name + target environment (derived from SURFACE_WORKFLOW_MAP).
2. Red/purple env banner consistent with ADR-0024 env-switcher styling.
3. Text input: "Type deploy <surface> to <env> to confirm" where the phrase
is generated from the surface/env. Examples:
- Production: deploy console-prod to production
- Staging: deploy api-staging to staging
4. Target ref field (default: main; editable by operator).
5. Confirm button (disabled until phrase matches exactly).
6. On confirm: POST to /api/internal/deploys with an idempotency key generated
client-side (UUID). The modal switches to the live-status view on 201 response.
The phrase must be generated dynamically from surface_id and target_env —
not hardcoded — so adding a new surface does not require a modal code change.
After dispatch, the modal body becomes:
- Status badge (matches deploy row status, styled per ADR-0030 state vocabulary).
- GitHub run link (when github_run_url is populated).
- Log tail (last ~30 lines, monospace, auto-scrolled to bottom).
- Polling: GET /api/internal/deploys/<id> every 2 seconds until terminal state.
- On succeeded: green banner, "Close" button.
- On failed/timed_out: red banner, failure reason if present, "Close" and
"View run on GitHub" buttons.
| Failure | Detection | Response |
|---|---|---|
| GitHub API 5xx on dispatch | Immediate | Row set to failed; UI shows error; operator uses break-glass |
| Workflow exits before any callback (GH outage, runner failure) | Reconciler: polls /runs/<id> every 60s |
Promotes to succeeded/failed based on GH conclusion; if run not found after 30 min, promotes to timed_out |
| Callback HMAC verification fails | Immediate (401) | Security event logged to audit_log (action: console.deploy.callback.auth_fail); row unchanged |
| Callback for unknown deploy ID | Immediate (404) | Logged; no action |
| Duplicate dispatch (double-click) | idempotency_key UNIQUE constraint |
200 returns existing row; no second dispatch |
| Rate limit exceeded (5/hour/surface) | Count check on insert | 429 returned; no row created |
Deploy row stuck in dispatched > 30 min |
Reconciler | Promoted to timed_out; failure_reason: "reconciler: no callback received in 30 min" |
streaming_log exceeds 500 KB |
Per-append cap | Oldest bytes truncated (head trimmed, tail preserved) |
A background thread in the console process (alongside the existing status poller thread):
console_deploys for rows in dispatched/building/
deploying states with last_status_at_utc older than 5 minutes.github_run_id, call
GET https://api.github.com/repos/{owner}/{repo}/actions/runs/{run_id} and
map conclusion → deploy status:success → succeededfailure / cancelled / timed_out → failednull (still running) → no changegithub_run_id (dispatch succeeded but run ID not yet
captured) that are older than 30 minutes: promote to timed_out.console.deploy.reconciler.console/migrations/versions/<timestamp>_add_console_deploys.py
Creates console_deploys table and three indexes. No existing tables modified.
Safe to run on an active console deployment; table is new, no locking concern.
Drop console_deploys and indexes. No foreign keys from other tables point at
console_deploys (the audit_correlation_id column is a soft reference by
value, not a DB-enforced FK, for the same reason audit_log avoids FKs on
its referencing columns — audit entries outlive their subjects).
The migration must land before any of the API endpoints are deployed. The sub-card order enforces this: DB migration is sub-card #1.
| Phase | Gate | What ships |
|---|---|---|
| Dark — migration only | Migration runs clean on staging | console_deploys table exists; no UI yet |
| Alpha — API + action | Sub-cards #1–5 merged | Endpoints live behind feature flag FLAG_CONSOLE_DEPLOY_UI=false; action available for manual testing |
| Beta — workflow integration | Sub-cards #6–7 merged | Notify steps active in all 4 workflows; reconciler running |
| GA — UI | Sub-cards #8–11 merged; ux-designer mockup reviewed | Deploy button visible to ops+ role; flag removed |
Feature flag: FLAG_CONSOLE_DEPLOY_UI (env var, default false). When false:
Deploy button is not rendered; API endpoints return 501. This allows the
back-end to be deployed independently of the UI.
PII: requested_by stores the operator email address. Follows existing
audit_log PII posture: 2-year retention, DSR erasure replaces email with
[redacted] tombstone.
Retention: console_deploys rows retained indefinitely for operational
history (no user PII beyond operator email). Purge policy: rows older than 2
years may be archived. Feature-developer should add a created_at index.
DSR: If an operator account is deleted, requested_by is tombstoned to
[redacted:<admin_id_hash>] by the existing account erasure path.
audit_correlation_id rows in audit_log follow the same redaction path.
Credential storage:
- GITHUB_DISPATCH_TOKEN — env var, sourced from Infisical at boot. Never
logged. Rotatable without redeploy.
- DEPLOY_CALLBACK_HMAC_SECRET — env var on console (Infisical) and GitHub
Environment secret. Must match. Rotating requires updating both locations.
Runbook sub-card covers rotation procedure.
Callback security: HMAC-SHA256 with constant_time_compare. The secret
is shared between the console and the four deploy workflows. If compromised,
rotate in Infisical + all four GitHub Environment secrets simultaneously. A
compromised secret allows a third party to inject fake status updates into
deploy rows — it does not allow triggering deploys or accessing secrets.
Rate limiting: 5 deploys per surface per hour prevents runaway dispatch loops. Limit is enforced server-side; no client-side trust.
GitHub dispatch token scope: GITHUB_DISPATCH_TOKEN needs only the
actions:write scope on this repository. When GitHub Apps (#335, #336) land,
this becomes a fine-grained app installation token scoped to a single
repository and a single permission. Until then, the PAT scope must be
documented and audited.
Breach: A breach of console_deploys exposes deploy history (operator
email, target SHAs, log output). Log output may contain workflow step names and
error messages but must not contain secrets (enforce via || true log
suppression in the notify action — do not echo secrets into log_line). Breach
notification follows the platform GDPR breach path; notify Kristerpher within
72 hours of discovery.
Kill-switch: DEPLOY_FREEZE_OVERRIDE=1 (existing mechanism from ADR-0028)
gates all console-triggered deploys. The freeze check runs before
workflow_dispatch is called.
These block sub-card claims until resolved:
OQ-1 (blocks #8, ux-designer card #11): What is the exact confirmation
phrase pattern for each surface? The spec proposes
deploy <surface_id> to <target_env>. Confirm this is the right shape, or
specify alternatives (e.g., deploy api to production).
OQ-2 (blocks #6): The four workflows in scope are deploy-heroku.yml,
deploy-console.yml, deploy-customer-docs.yml, deploy-status-page.yml.
deploy-status-worker.yml is out of scope for v1 — confirm this is correct
or add it to scope.
OQ-3 (blocks #9): DEPLOY_CALLBACK_HMAC_SECRET must be provisioned in
Infisical before the console can verify callbacks. Who provisions it and at what
path? Proposed: /Console/prod/DEPLOY_CALLBACK_HMAC_SECRET and
/Console/staging/DEPLOY_CALLBACK_HMAC_SECRET. Confirm path convention with
vault operator.
OQ-4 (blocks #2): GITHUB_DISPATCH_TOKEN scope and owner. The token must
have actions:write on raxx-app/TradeMasterAPI. Is the existing agent PAT
adequate, or does a dedicated service account token need to be provisioned?
OQ-5 (blocks #7): The reconciler polls api.github.com every 60 seconds
for active dispatches. At current deploy cadence this is minimal. If more than
~20 deploys are active simultaneously, the reconciler should batch calls or use
a GitHub App token with higher rate limits. Accept this as a v1 limitation?