Raxx · internal docs

internal · gated ↑ index

Console-Driven Deploy Flow

Status: Proposed Owner: software-architect Date: 2026-04-30 UTC Related ADRs: ADR-0034, ADR-0028, ADR-0030 Refs: Kristerpher directive 2026-04-30 UTC, PR #675 (deploy-console.yml), PR #677 (scripts/ops/deploy-console.sh)


1. Context

The console (console.raxx.app) already shows a health status grid for every Raxx surface via SURFACE_REGISTRY_BY_ENV in console/app/services/status_poller.py. Each tile shows operational state but has no deploy controls. To deploy a surface today, an operator must open a terminal and run gh workflow run deploy-heroku.yml ... manually.

This design adds a Deploy button to each tile, a type-to-confirm modal (per ADR-0028), live deploy status feedback within the console, and a structured console_deploys table as the canonical deploy intent record. GitHub workflows post lifecycle callbacks to the console via a new webhook endpoint.

The existing workflows (deploy-console.yml, deploy-heroku.yml, deploy-customer-docs.yml, deploy-status-page.yml) and break-glass scripts (scripts/ops/deploy-console.sh) are not replaced — they are the execution engines. The console becomes the preferred initiation and observability layer on top of those existing engines.


2. Invariants

All platform invariants apply. Deploy-flow-specific constraints:

  1. No stored credentials. The GitHub dispatch token and HMAC secret live in vault (Infisical) and are injected at runtime. Neither appears in code or schema.

  2. ADR-0028 friction preserved. The Deploy button does not skip the type-to-confirm gate. The modal requires the operator to type the exact phrase before POST /api/internal/deploys fires.

  3. Audit trail. Every deploy intent writes an audit_log row at creation (console.deploy.intent) and every callback writes another (console.deploy.callback). No deploy event is unlogged.

  4. Break-glass paths unchanged. Manual gh workflow run and scripts/ops/deploy-console.sh continue to function. Workflows gracefully skip the callback step when console_deploy_id is absent.

  5. Secrets rotatable without redeploy. DEPLOY_CALLBACK_HMAC_SECRET and GITHUB_DISPATCH_TOKEN are read from environment variables at request time, not baked into the app binary.

  6. Kill-switch is unchanged. Heroku releases:rollback remains the fastest rollback path and requires no CI or console UI action.


3. Data Model

3.1 console_deploys table

CREATE TABLE console_deploys (
    id                    TEXT PRIMARY KEY,          -- UUID
    surface_id            TEXT NOT NULL,             -- FK to SURFACE_REGISTRY id
    target_env            TEXT NOT NULL,             -- 'staging' | 'production'
    target_ref            TEXT NOT NULL,             -- git SHA or branch name
    requested_by          TEXT NOT NULL,             -- operator email (RAXX_OPERATOR_EMAIL)
    requested_at_utc      DATETIME NOT NULL,
    idempotency_key       TEXT NOT NULL UNIQUE,      -- UUID; prevents double-click
    status                TEXT NOT NULL DEFAULT 'requested',
                                                     -- requested | dispatched | building
                                                     -- | deploying | succeeded
                                                     -- | failed | timed_out
    github_run_id         TEXT,                      -- populated after dispatch
    github_run_url        TEXT,                      -- derived from run_id
    last_status_at_utc    DATETIME,
    streaming_log         TEXT DEFAULT '',           -- appended; capped at 500 KB
    failure_reason        TEXT,                      -- populated on failed | timed_out
    audit_correlation_id  TEXT,                      -- FK to audit_log.id (intent row)
    workflow_name         TEXT NOT NULL              -- e.g. 'deploy-console.yml'
);

CREATE INDEX ix_console_deploys_surface_status
    ON console_deploys (surface_id, status, requested_at_utc DESC);

CREATE INDEX ix_console_deploys_idempotency
    ON console_deploys (idempotency_key);

CREATE INDEX ix_console_deploys_github_run_id
    ON console_deploys (github_run_id)
    WHERE github_run_id IS NOT NULL;

streaming_log is append-only from the server side. Feature-developer must enforce the 500 KB cap on each append (truncate oldest bytes if needed, not newest — preserve the tail).

3.2 Surface-to-workflow mapping

Stored as a Python constant (not in the DB) in the deploy service module:

SURFACE_WORKFLOW_MAP = {
    "api-prod":          ("deploy-heroku.yml",         "production"),
    "api-staging":       ("deploy-heroku.yml",         "staging"),
    "console-prod":      ("deploy-console.yml",        "production"),
    "console-staging":   ("deploy-console.yml",        "staging"),
    "getraxx":           ("deploy-customer-docs.yml",  "production"),
    "raxx-mockups":      ("deploy-customer-docs.yml",  "production"),
}

Surfaces not in this map (e.g., vault, raxx-app-previews) do not show a Deploy button.


4. APIs / Contracts

4.1 POST /api/internal/deploys

Auth: session cookie; ops or superadmin role required.

Request body:

{
  "surface_id":      "console-prod",
  "target_ref":      "main",
  "idempotency_key": "550e8400-e29b-41d4-a716-446655440000"
}

target_env is derived from SURFACE_WORKFLOW_MAP — the caller does not set it directly. target_ref defaults to "main" if omitted.

Behavior: 1. Validate surface_id exists in SURFACE_WORKFLOW_MAP. 401 if role check fails, 422 if surface not deployable. 2. Check idempotency: if a row exists with idempotency_key and status is not failed/timed_out, return 200 with the existing row's id and status_url. Prevents double-click. 3. Rate-limit check: if 5 or more non-terminal deploys for this surface_id exist in the last 60 minutes, return 429. 4. Insert console_deploys row with status: requested. 5. Write audit_log row (console.deploy.intent); store id in audit_correlation_id. 6. Call GitHub workflow_dispatch API: POST https://api.github.com/repos/{owner}/{repo}/actions/workflows/{workflow}/dispatches Authorization: Bearer $GITHUB_DISPATCH_TOKEN { "ref": "<target_ref>", "inputs": { "environment": "<target_env>", "console_deploy_id": "<deploy_row_id>" } } 7. On 204 from GitHub: update row to status: dispatched. Poll GET /repos/{owner}/{repo}/actions/runs?event=workflow_dispatch for up to 30 seconds (3 polls × 10s) to capture the run_id; store it if found. 8. Return 201: json { "id": "<deploy_row_id>", "status": "dispatched", "status_url": "/api/internal/deploys/<id>" }

On GitHub API error: update row to status: failed, failure_reason: "github_dispatch_failed: <http_status>". Return 502.

4.2 GET /api/internal/deploys/<id>

Auth: session cookie; any authenticated role.

Response:

{
  "id":                 "<uuid>",
  "surface_id":         "console-prod",
  "target_env":         "production",
  "target_ref":         "main",
  "status":             "building",
  "github_run_url":     "https://github.com/…/actions/runs/12345",
  "last_status_at_utc": "2026-04-30T14:35:10Z",
  "log_tail":           "…last 50 lines of streaming_log…",
  "failure_reason":     null
}

log_tail is the last 4 KB of streaming_log — the endpoint does not return the full log to avoid oversized responses.

4.3 POST /api/internal/deploys/<id>/status

Auth: HMAC signature on request body. Header: X-Raxx-Deploy-Signature: sha256=<hex_digest>.

Verification:

expected = HMAC-SHA256(DEPLOY_CALLBACK_HMAC_SECRET, raw_request_body)
compare(expected, header_value, constant_time=True)

Return 401 if verification fails; log security event to audit_log.

Request body:

{
  "status":         "building",
  "log_line":       "Step: Install Heroku CLI — started",
  "failure_reason": null
}

status must be one of: building, deploying, succeeded, failed.

Behavior: 1. Look up deploy row by id. 404 if not found. 2. Validate status transition is forward-only (no backwards transitions from terminal states). 3. Append log_line (with UTC timestamp prefix) to streaming_log. Enforce 500 KB cap. 4. Update status, last_status_at_utc. 5. Write audit_log row (console.deploy.callback). 6. Return 204.


5. State Machine

requested → dispatched → building → deploying → succeeded
                            ↓           ↓
                          failed      failed
dispatched → timed_out  (reconciler promotes after 30 min with no callback)

Terminal states: succeeded, failed, timed_out.

stateDiagram-v2
    [*] --> requested : POST /api/internal/deploys (intent created)
    requested --> dispatched : GitHub workflow_dispatch 204
    requested --> failed : GitHub API error
    dispatched --> building : callback status=building
    dispatched --> timed_out : reconciler: no callback in 30 min
    dispatched --> failed : callback status=failed
    building --> deploying : callback status=deploying
    building --> failed : callback status=failed
    deploying --> succeeded : callback status=succeeded
    deploying --> failed : callback status=failed
    succeeded --> [*]
    failed --> [*]
    timed_out --> [*]

6. GitHub Workflow Integration

6.1 New workflow_dispatch input

Each of the four deploy workflows gains one new optional input:

console_deploy_id:
  description: "console_deploys row ID (empty when triggered manually)"
  required: false
  default: ""

This input is passed through the entire workflow via env or outputs so the notify action can reference it.

6.2 .github/actions/notify-deploy-status composite action

Inputs: - console_deploy_id — the deploy row UUID (may be empty) - statusbuilding | deploying | succeeded | failed - log_line — one-line message to append - failure_reason — only set when status: failed - console_url — base URL of the console (default: https://console.raxx.app) - hmac_secret — the DEPLOY_CALLBACK_HMAC_SECRET value

Behavior:

- name: Notify deploy status
  if: inputs.console_deploy_id != ''
  run: |
    BODY=$(printf '{"status":"%s","log_line":"%s","failure_reason":%s}' \
      "$STATUS" "$LOG_LINE" "${FAILURE_REASON:-null}")
    SIG=$(printf '%s' "$BODY" | openssl dgst -sha256 -hmac "$HMAC_SECRET" -hex | awk '{print $2}')
    curl -sS -X POST \
      -H "Content-Type: application/json" \
      -H "X-Raxx-Deploy-Signature: sha256=$SIG" \
      --max-time 10 \
      "$CONSOLE_URL/api/internal/deploys/$CONSOLE_DEPLOY_ID/status" \
      -d "$BODY" || true

The || true ensures a callback failure does not abort the workflow. The deploy is the primary job; the callback is observability.

6.3 Lifecycle call points per workflow

Each deploy workflow inserts notify steps at these points:

Point Status Log line
Deploy job start building "Deploy job started for <app> (<env>)"
After Heroku push succeeds deploying "Code pushed to Heroku. Awaiting dyno restart."
Health check passes succeeded "Health check passed. <app_url>/health → 200"
Health check fails failed "Health check failed after 5 retries."
Any job failure (catch) failed "Workflow step failed: <step_name>"

7. Console UI Design

This section is spec for ux-designer; implementation is for feature-developer.

7.1 Deploy button placement

Each site tile in the status grid (dashboard/_status_grid.html) gains a Deploy button. The button: - Is only rendered when the operator session has ops or superadmin role. - Is only rendered when the tile's surface_id exists in SURFACE_WORKFLOW_MAP. - Shows a lock icon and tooltip "Deploy frozen" when the deploy freeze flag is active (deploy_freeze.is_frozen()).

7.2 Confirmation modal

Clicking Deploy opens a modal with: 1. Surface name + target environment (derived from SURFACE_WORKFLOW_MAP). 2. Red/purple env banner consistent with ADR-0024 env-switcher styling. 3. Text input: "Type deploy <surface> to <env> to confirm" where the phrase is generated from the surface/env. Examples: - Production: deploy console-prod to production - Staging: deploy api-staging to staging 4. Target ref field (default: main; editable by operator). 5. Confirm button (disabled until phrase matches exactly). 6. On confirm: POST to /api/internal/deploys with an idempotency key generated client-side (UUID). The modal switches to the live-status view on 201 response.

The phrase must be generated dynamically from surface_id and target_env — not hardcoded — so adding a new surface does not require a modal code change.

7.3 Live-status view (inside the modal)

After dispatch, the modal body becomes: - Status badge (matches deploy row status, styled per ADR-0030 state vocabulary). - GitHub run link (when github_run_url is populated). - Log tail (last ~30 lines, monospace, auto-scrolled to bottom). - Polling: GET /api/internal/deploys/<id> every 2 seconds until terminal state. - On succeeded: green banner, "Close" button. - On failed/timed_out: red banner, failure reason if present, "Close" and "View run on GitHub" buttons.


8. Failure Modes and Reconciliation

Failure Detection Response
GitHub API 5xx on dispatch Immediate Row set to failed; UI shows error; operator uses break-glass
Workflow exits before any callback (GH outage, runner failure) Reconciler: polls /runs/<id> every 60s Promotes to succeeded/failed based on GH conclusion; if run not found after 30 min, promotes to timed_out
Callback HMAC verification fails Immediate (401) Security event logged to audit_log (action: console.deploy.callback.auth_fail); row unchanged
Callback for unknown deploy ID Immediate (404) Logged; no action
Duplicate dispatch (double-click) idempotency_key UNIQUE constraint 200 returns existing row; no second dispatch
Rate limit exceeded (5/hour/surface) Count check on insert 429 returned; no row created
Deploy row stuck in dispatched > 30 min Reconciler Promoted to timed_out; failure_reason: "reconciler: no callback received in 30 min"
streaming_log exceeds 500 KB Per-append cap Oldest bytes truncated (head trimmed, tail preserved)

8.1 Reconciler design

A background thread in the console process (alongside the existing status poller thread):


9. Migrations

9.1 Forward migration

console/migrations/versions/<timestamp>_add_console_deploys.py

Creates console_deploys table and three indexes. No existing tables modified. Safe to run on an active console deployment; table is new, no locking concern.

9.2 Rollback

Drop console_deploys and indexes. No foreign keys from other tables point at console_deploys (the audit_correlation_id column is a soft reference by value, not a DB-enforced FK, for the same reason audit_log avoids FKs on its referencing columns — audit entries outlive their subjects).

9.3 Migration sequencing

The migration must land before any of the API endpoints are deployed. The sub-card order enforces this: DB migration is sub-card #1.


10. Rollout Plan

Phase Gate What ships
Dark — migration only Migration runs clean on staging console_deploys table exists; no UI yet
Alpha — API + action Sub-cards #1–5 merged Endpoints live behind feature flag FLAG_CONSOLE_DEPLOY_UI=false; action available for manual testing
Beta — workflow integration Sub-cards #6–7 merged Notify steps active in all 4 workflows; reconciler running
GA — UI Sub-cards #8–11 merged; ux-designer mockup reviewed Deploy button visible to ops+ role; flag removed

Feature flag: FLAG_CONSOLE_DEPLOY_UI (env var, default false). When false: Deploy button is not rendered; API endpoints return 501. This allows the back-end to be deployed independently of the UI.


11. Security Considerations

PII: requested_by stores the operator email address. Follows existing audit_log PII posture: 2-year retention, DSR erasure replaces email with [redacted] tombstone.

Retention: console_deploys rows retained indefinitely for operational history (no user PII beyond operator email). Purge policy: rows older than 2 years may be archived. Feature-developer should add a created_at index.

DSR: If an operator account is deleted, requested_by is tombstoned to [redacted:<admin_id_hash>] by the existing account erasure path. audit_correlation_id rows in audit_log follow the same redaction path.

Credential storage: - GITHUB_DISPATCH_TOKEN — env var, sourced from Infisical at boot. Never logged. Rotatable without redeploy. - DEPLOY_CALLBACK_HMAC_SECRET — env var on console (Infisical) and GitHub Environment secret. Must match. Rotating requires updating both locations. Runbook sub-card covers rotation procedure.

Callback security: HMAC-SHA256 with constant_time_compare. The secret is shared between the console and the four deploy workflows. If compromised, rotate in Infisical + all four GitHub Environment secrets simultaneously. A compromised secret allows a third party to inject fake status updates into deploy rows — it does not allow triggering deploys or accessing secrets.

Rate limiting: 5 deploys per surface per hour prevents runaway dispatch loops. Limit is enforced server-side; no client-side trust.

GitHub dispatch token scope: GITHUB_DISPATCH_TOKEN needs only the actions:write scope on this repository. When GitHub Apps (#335, #336) land, this becomes a fine-grained app installation token scoped to a single repository and a single permission. Until then, the PAT scope must be documented and audited.

Breach: A breach of console_deploys exposes deploy history (operator email, target SHAs, log output). Log output may contain workflow step names and error messages but must not contain secrets (enforce via || true log suppression in the notify action — do not echo secrets into log_line). Breach notification follows the platform GDPR breach path; notify Kristerpher within 72 hours of discovery.

Kill-switch: DEPLOY_FREEZE_OVERRIDE=1 (existing mechanism from ADR-0028) gates all console-triggered deploys. The freeze check runs before workflow_dispatch is called.


12. Open Questions

These block sub-card claims until resolved:

OQ-1 (blocks #8, ux-designer card #11): What is the exact confirmation phrase pattern for each surface? The spec proposes deploy <surface_id> to <target_env>. Confirm this is the right shape, or specify alternatives (e.g., deploy api to production).

OQ-2 (blocks #6): The four workflows in scope are deploy-heroku.yml, deploy-console.yml, deploy-customer-docs.yml, deploy-status-page.yml. deploy-status-worker.yml is out of scope for v1 — confirm this is correct or add it to scope.

OQ-3 (blocks #9): DEPLOY_CALLBACK_HMAC_SECRET must be provisioned in Infisical before the console can verify callbacks. Who provisions it and at what path? Proposed: /Console/prod/DEPLOY_CALLBACK_HMAC_SECRET and /Console/staging/DEPLOY_CALLBACK_HMAC_SECRET. Confirm path convention with vault operator.

OQ-4 (blocks #2): GITHUB_DISPATCH_TOKEN scope and owner. The token must have actions:write on raxx-app/TradeMasterAPI. Is the existing agent PAT adequate, or does a dedicated service account token need to be provisioned?

OQ-5 (blocks #7): The reconciler polls api.github.com every 60 seconds for active dispatches. At current deploy cadence this is minimal. If more than ~20 deploys are active simultaneously, the reconciler should batch calls or use a GitHub App token with higher rate limits. Accept this as a v1 limitation?