Raxx · internal docs

internal · gated ↑ index

Console-Driven Deploy Flow

Status: Proposed Owner: software-architect Date: 2026-04-30 UTC Related ADRs: ADR-0034, ADR-0028, ADR-0030 Refs: Kristerpher directive 2026-04-30 UTC, PR #675 (deploy-console.yml), PR #677 (scripts/ops/deploy-console.sh)

1. Context

The console (console.raxx.app) already shows a health status grid for every Raxx surface via SURFACE_REGISTRY_BY_ENV in console/app/services/status_poller.py. Each tile shows operational state but has no deploy controls. To deploy a surface today, an operator must open a terminal and run gh workflow run deploy-heroku.yml ... manually.

This design adds a Deploy button to each tile, a type-to-confirm modal (per ADR-0028), live deploy status feedback within the console, and a structured console_deploys table as the canonical deploy intent record. GitHub workflows post lifecycle callbacks to the console via a new webhook endpoint.

The existing workflows (deploy-console.yml, deploy-heroku.yml, deploy-customer-docs.yml, deploy-status-page.yml) and break-glass scripts (scripts/ops/deploy-console.sh) are not replaced — they are the execution engines. The console becomes the preferred initiation and observability layer on top of those existing engines.

2. Invariants

All platform invariants apply. Deploy-flow-specific constraints:

No stored credentials. The GitHub dispatch token and HMAC secret live in vault (Infisical) and are injected at runtime. Neither appears in code or schema.
ADR-0028 friction preserved. The Deploy button does not skip the type-to-confirm gate. The modal requires the operator to type the exact phrase before POST /api/internal/deploys fires.
Audit trail. Every deploy intent writes an audit_log row at creation (console.deploy.intent) and every callback writes another (console.deploy.callback). No deploy event is unlogged.
Break-glass paths unchanged. Manual gh workflow run and scripts/ops/deploy-console.sh continue to function. Workflows gracefully skip the callback step when console_deploy_id is absent.
Secrets rotatable without redeploy. DEPLOY_CALLBACK_HMAC_SECRET and GITHUB_DISPATCH_TOKEN are read from environment variables at request time, not baked into the app binary.
Kill-switch is unchanged. Heroku releases:rollback remains the fastest rollback path and requires no CI or console UI action.

3. Data Model

3.1 `console_deploys` table

CREATE TABLE console_deploys (
    id                    TEXT PRIMARY KEY,          -- UUID
    surface_id            TEXT NOT NULL,             -- FK to SURFACE_REGISTRY id
    target_env            TEXT NOT NULL,             -- 'staging' | 'production'
    target_ref            TEXT NOT NULL,             -- git SHA or branch name
    requested_by          TEXT NOT NULL,             -- operator email (RAXX_OPERATOR_EMAIL)
    requested_at_utc      DATETIME NOT NULL,
    idempotency_key       TEXT NOT NULL UNIQUE,      -- UUID; prevents double-click
    status                TEXT NOT NULL DEFAULT 'requested',
                                                     -- requested | dispatched | building
                                                     -- | deploying | succeeded
                                                     -- | failed | timed_out
    github_run_id         TEXT,                      -- populated after dispatch
    github_run_url        TEXT,                      -- derived from run_id
    last_status_at_utc    DATETIME,
    streaming_log         TEXT DEFAULT '',           -- appended; capped at 500 KB
    failure_reason        TEXT,                      -- populated on failed | timed_out
    audit_correlation_id  TEXT,                      -- FK to audit_log.id (intent row)
    workflow_name         TEXT NOT NULL              -- e.g. 'deploy-console.yml'
);

CREATE INDEX ix_console_deploys_surface_status
    ON console_deploys (surface_id, status, requested_at_utc DESC);

CREATE INDEX ix_console_deploys_idempotency
    ON console_deploys (idempotency_key);

CREATE INDEX ix_console_deploys_github_run_id
    ON console_deploys (github_run_id)
    WHERE github_run_id IS NOT NULL;

streaming_log is append-only from the server side. Feature-developer must enforce the 500 KB cap on each append (truncate oldest bytes if needed, not newest — preserve the tail).

3.2 Surface-to-workflow mapping

Stored as a Python constant (not in the DB) in the deploy service module:

SURFACE_WORKFLOW_MAP = {
    "api-prod":          ("deploy-heroku.yml",         "production"),
    "api-staging":       ("deploy-heroku.yml",         "staging"),
    "console-prod":      ("deploy-console.yml",        "production"),
    "console-staging":   ("deploy-console.yml",        "staging"),
    "getraxx":           ("deploy-customer-docs.yml",  "production"),
    "raxx-mockups":      ("deploy-customer-docs.yml",  "production"),
}

Surfaces not in this map (e.g., vault, raxx-app-previews) do not show a Deploy button.

4. APIs / Contracts

4.1 `POST /api/internal/deploys`

Auth: session cookie; ops or superadmin role required.

Request body:

{
  "surface_id":      "console-prod",
  "target_ref":      "main",
  "idempotency_key": "550e8400-e29b-41d4-a716-446655440000"
}

target_env is derived from SURFACE_WORKFLOW_MAP — the caller does not set it directly. target_ref defaults to "main" if omitted.

Behavior: 1. Validate surface_id exists in SURFACE_WORKFLOW_MAP. 401 if role check fails, 422 if surface not deployable. 2. Check idempotency: if a row exists with idempotency_key and status is not failed/timed_out, return 200 with the existing row's id and status_url. Prevents double-click. 3. Rate-limit check: if 5 or more non-terminal deploys for this surface_id exist in the last 60 minutes, return 429. 4. Insert console_deploys row with status: requested. 5. Write audit_log row (console.deploy.intent); store id in audit_correlation_id. 6. Call GitHub workflow_dispatch API: POST https://api.github.com/repos/{owner}/{repo}/actions/workflows/{workflow}/dispatches Authorization: Bearer $GITHUB_DISPATCH_TOKEN { "ref": "<target_ref>", "inputs": { "environment": "<target_env>", "console_deploy_id": "<deploy_row_id>" } } 7. On 204 from GitHub: update row to status: dispatched. Poll GET /repos/{owner}/{repo}/actions/runs?event=workflow_dispatch for up to 30 seconds (3 polls × 10s) to capture the run_id; store it if found. 8. Return 201: json { "id": "<deploy_row_id>", "status": "dispatched", "status_url": "/api/internal/deploys/<id>" }

On GitHub API error: update row to status: failed, failure_reason: "github_dispatch_failed: <http_status>". Return 502.

4.2 `GET /api/internal/deploys/<id>`

Auth: session cookie; any authenticated role.

Response:

{
  "id":                 "<uuid>",
  "surface_id":         "console-prod",
  "target_env":         "production",
  "target_ref":         "main",
  "status":             "building",
  "github_run_url":     "https://github.com/…/actions/runs/12345",
  "last_status_at_utc": "2026-04-30T14:35:10Z",
  "log_tail":           "…last 50 lines of streaming_log…",
  "failure_reason":     null
}

log_tail is the last 4 KB of streaming_log — the endpoint does not return the full log to avoid oversized responses.

4.3 `POST /api/internal/deploys/<id>/status`

Auth: HMAC signature on request body. Header: X-Raxx-Deploy-Signature: sha256=<hex_digest>.

Verification:

expected = HMAC-SHA256(DEPLOY_CALLBACK_HMAC_SECRET, raw_request_body)
compare(expected, header_value, constant_time=True)

Return 401 if verification fails; log security event to audit_log.

Request body:

{
  "status":         "building",
  "log_line":       "Step: Install Heroku CLI — started",
  "failure_reason": null
}

status must be one of: building, deploying, succeeded, failed.

Behavior: 1. Look up deploy row by id. 404 if not found. 2. Validate status transition is forward-only (no backwards transitions from terminal states). 3. Append log_line (with UTC timestamp prefix) to streaming_log. Enforce 500 KB cap. 4. Update status, last_status_at_utc. 5. Write audit_log row (console.deploy.callback). 6. Return 204.

5. State Machine

requested → dispatched → building → deploying → succeeded
                            ↓           ↓
                          failed      failed
dispatched → timed_out  (reconciler promotes after 30 min with no callback)

Terminal states: succeeded, failed, timed_out.

stateDiagram-v2
    [*] --> requested : POST /api/internal/deploys (intent created)
    requested --> dispatched : GitHub workflow_dispatch 204
    requested --> failed : GitHub API error
    dispatched --> building : callback status=building
    dispatched --> timed_out : reconciler: no callback in 30 min
    dispatched --> failed : callback status=failed
    building --> deploying : callback status=deploying
    building --> failed : callback status=failed
    deploying --> succeeded : callback status=succeeded
    deploying --> failed : callback status=failed
    succeeded --> [*]
    failed --> [*]
    timed_out --> [*]

6. GitHub Workflow Integration

6.1 New `workflow_dispatch` input

Each of the four deploy workflows gains one new optional input:

console_deploy_id:
  description: "console_deploys row ID (empty when triggered manually)"
  required: false
  default: ""

This input is passed through the entire workflow via env or outputs so the notify action can reference it.

6.2 `.github/actions/notify-deploy-status` composite action

Inputs: - console_deploy_id — the deploy row UUID (may be empty) - status — building | deploying | succeeded | failed - log_line — one-line message to append - failure_reason — only set when status: failed - console_url — base URL of the console (default: https://console.raxx.app) - hmac_secret — the DEPLOY_CALLBACK_HMAC_SECRET value

Behavior:

- name: Notify deploy status
  if: inputs.console_deploy_id != ''
  run: |
    BODY=$(printf '{"status":"%s","log_line":"%s","failure_reason":%s}' \
      "$STATUS" "$LOG_LINE" "${FAILURE_REASON:-null}")
    SIG=$(printf '%s' "$BODY" | openssl dgst -sha256 -hmac "$HMAC_SECRET" -hex | awk '{print $2}')
    curl -sS -X POST \
      -H "Content-Type: application/json" \
      -H "X-Raxx-Deploy-Signature: sha256=$SIG" \
      --max-time 10 \
      "$CONSOLE_URL/api/internal/deploys/$CONSOLE_DEPLOY_ID/status" \
      -d "$BODY" || true

The || true ensures a callback failure does not abort the workflow. The deploy is the primary job; the callback is observability.

6.3 Lifecycle call points per workflow

Each deploy workflow inserts notify steps at these points:

Point	Status	Log line
Deploy job start	`building`	`"Deploy job started for <app> (<env>)"`
After Heroku push succeeds	`deploying`	`"Code pushed to Heroku. Awaiting dyno restart."`
Health check passes	`succeeded`	`"Health check passed. <app_url>/health → 200"`
Health check fails	`failed`	`"Health check failed after 5 retries."`
Any job failure (catch)	`failed`	`"Workflow step failed: <step_name>"`

7. Console UI Design

This section is spec for ux-designer; implementation is for feature-developer.

7.1 Deploy button placement

Each site tile in the status grid (dashboard/_status_grid.html) gains a Deploy button. The button: - Is only rendered when the operator session has ops or superadmin role. - Is only rendered when the tile's surface_id exists in SURFACE_WORKFLOW_MAP. - Shows a lock icon and tooltip "Deploy frozen" when the deploy freeze flag is active (deploy_freeze.is_frozen()).

Clicking Deploy opens a modal with: 1. Surface name + target environment (derived from SURFACE_WORKFLOW_MAP). 2. Red/purple env banner consistent with ADR-0024 env-switcher styling. 3. Text input: "Type deploy <surface> to <env> to confirm" where the phrase is generated from the surface/env. Examples: - Production: deploy console-prod to production - Staging: deploy api-staging to staging 4. Target ref field (default: main; editable by operator). 5. Confirm button (disabled until phrase matches exactly). 6. On confirm: POST to /api/internal/deploys with an idempotency key generated client-side (UUID). The modal switches to the live-status view on 201 response.

The phrase must be generated dynamically from surface_id and target_env — not hardcoded — so adding a new surface does not require a modal code change.

After dispatch, the modal body becomes: - Status badge (matches deploy row status, styled per ADR-0030 state vocabulary). - GitHub run link (when github_run_url is populated). - Log tail (last ~30 lines, monospace, auto-scrolled to bottom). - Polling: GET /api/internal/deploys/<id> every 2 seconds until terminal state. - On succeeded: green banner, "Close" button. - On failed/timed_out: red banner, failure reason if present, "Close" and "View run on GitHub" buttons.

8. Failure Modes and Reconciliation

Failure	Detection	Response
GitHub API 5xx on dispatch	Immediate	Row set to `failed`; UI shows error; operator uses break-glass
Workflow exits before any callback (GH outage, runner failure)	Reconciler: polls `/runs/<id>` every 60s	Promotes to `succeeded`/`failed` based on GH conclusion; if run not found after 30 min, promotes to `timed_out`
Callback HMAC verification fails	Immediate (401)	Security event logged to `audit_log` (action: `console.deploy.callback.auth_fail`); row unchanged
Callback for unknown deploy ID	Immediate (404)	Logged; no action
Duplicate dispatch (double-click)	`idempotency_key` UNIQUE constraint	200 returns existing row; no second dispatch
Rate limit exceeded (5/hour/surface)	Count check on insert	429 returned; no row created
Deploy row stuck in `dispatched` > 30 min	Reconciler	Promoted to `timed_out`; `failure_reason: "reconciler: no callback received in 30 min"`
`streaming_log` exceeds 500 KB	Per-append cap	Oldest bytes truncated (head trimmed, tail preserved)

8.1 Reconciler design

A background thread in the console process (alongside the existing status poller thread):

On startup: scan console_deploys for rows in dispatched/building/ deploying states with last_status_at_utc older than 5 minutes.
Every 60 seconds: for each active dispatch row with a github_run_id, call GET https://api.github.com/repos/{owner}/{repo}/actions/runs/{run_id} and map conclusion → deploy status:
success → succeeded
failure / cancelled / timed_out → failed
null (still running) → no change
For rows without github_run_id (dispatch succeeded but run ID not yet captured) that are older than 30 minutes: promote to timed_out.
The reconciler uses a separate audit entry per promotion: console.deploy.reconciler.

9. Migrations

9.1 Forward migration

console/migrations/versions/<timestamp>_add_console_deploys.py

Creates console_deploys table and three indexes. No existing tables modified. Safe to run on an active console deployment; table is new, no locking concern.

9.2 Rollback

Drop console_deploys and indexes. No foreign keys from other tables point at console_deploys (the audit_correlation_id column is a soft reference by value, not a DB-enforced FK, for the same reason audit_log avoids FKs on its referencing columns — audit entries outlive their subjects).

9.3 Migration sequencing

The migration must land before any of the API endpoints are deployed. The sub-card order enforces this: DB migration is sub-card #1.

10. Rollout Plan

Phase	Gate	What ships
Dark — migration only	Migration runs clean on staging	`console_deploys` table exists; no UI yet
Alpha — API + action	Sub-cards #1–5 merged	Endpoints live behind feature flag `FLAG_CONSOLE_DEPLOY_UI=false`; action available for manual testing
Beta — workflow integration	Sub-cards #6–7 merged	Notify steps active in all 4 workflows; reconciler running
GA — UI	Sub-cards #8–11 merged; ux-designer mockup reviewed	Deploy button visible to ops+ role; flag removed

Feature flag: FLAG_CONSOLE_DEPLOY_UI (env var, default false). When false: Deploy button is not rendered; API endpoints return 501. This allows the back-end to be deployed independently of the UI.

11. Security Considerations

PII: requested_by stores the operator email address. Follows existing audit_log PII posture: 2-year retention, DSR erasure replaces email with [redacted] tombstone.

Retention: console_deploys rows retained indefinitely for operational history (no user PII beyond operator email). Purge policy: rows older than 2 years may be archived. Feature-developer should add a created_at index.

DSR: If an operator account is deleted, requested_by is tombstoned to [redacted:<admin_id_hash>] by the existing account erasure path. audit_correlation_id rows in audit_log follow the same redaction path.

Credential storage: - GITHUB_DISPATCH_TOKEN — env var, sourced from Infisical at boot. Never logged. Rotatable without redeploy. - DEPLOY_CALLBACK_HMAC_SECRET — env var on console (Infisical) and GitHub Environment secret. Must match. Rotating requires updating both locations. Runbook sub-card covers rotation procedure.

Callback security: HMAC-SHA256 with constant_time_compare. The secret is shared between the console and the four deploy workflows. If compromised, rotate in Infisical + all four GitHub Environment secrets simultaneously. A compromised secret allows a third party to inject fake status updates into deploy rows — it does not allow triggering deploys or accessing secrets.

Rate limiting: 5 deploys per surface per hour prevents runaway dispatch loops. Limit is enforced server-side; no client-side trust.

GitHub dispatch token scope: GITHUB_DISPATCH_TOKEN needs only the actions:write scope on this repository. When GitHub Apps (#335, #336) land, this becomes a fine-grained app installation token scoped to a single repository and a single permission. Until then, the PAT scope must be documented and audited.

Breach: A breach of console_deploys exposes deploy history (operator email, target SHAs, log output). Log output may contain workflow step names and error messages but must not contain secrets (enforce via || true log suppression in the notify action — do not echo secrets into log_line). Breach notification follows the platform GDPR breach path; notify Kristerpher within 72 hours of discovery.

Kill-switch: DEPLOY_FREEZE_OVERRIDE=1 (existing mechanism from ADR-0028) gates all console-triggered deploys. The freeze check runs before workflow_dispatch is called.

12. Open Questions

These block sub-card claims until resolved:

OQ-1 (blocks #8, ux-designer card #11): What is the exact confirmation phrase pattern for each surface? The spec proposes deploy <surface_id> to <target_env>. Confirm this is the right shape, or specify alternatives (e.g., deploy api to production).

OQ-2 (blocks #6): The four workflows in scope are deploy-heroku.yml, deploy-console.yml, deploy-customer-docs.yml, deploy-status-page.yml. deploy-status-worker.yml is out of scope for v1 — confirm this is correct or add it to scope.

OQ-3 (blocks #9): DEPLOY_CALLBACK_HMAC_SECRET must be provisioned in Infisical before the console can verify callbacks. Who provisions it and at what path? Proposed: /Console/prod/DEPLOY_CALLBACK_HMAC_SECRET and /Console/staging/DEPLOY_CALLBACK_HMAC_SECRET. Confirm path convention with vault operator.

OQ-4 (blocks #2): GITHUB_DISPATCH_TOKEN scope and owner. The token must have actions:write on raxx-app/TradeMasterAPI. Is the existing agent PAT adequate, or does a dedicated service account token need to be provisioned?

OQ-5 (blocks #7): The reconciler polls api.github.com every 60 seconds for active dispatches. At current deploy cadence this is minimal. If more than ~20 deploys are active simultaneously, the reconciler should batch calls or use a GitHub App token with higher rate limits. Accept this as a v1 limitation?

Auto-generated from docs/ in raxx-app/TradeMasterAPI. Gated behind Cloudflare Access. Re-deployed on every push to main.