Raxx · internal docs

internal · gated ↑ index

Prod-Deploy Gating — Universal Pattern + Console First Implementation

Status: Proposed Owner: software-architect Date: 2026-04-29 UTC Refs: #81 (SDLC epic), Kristerpher directive 2026-04-29 Related ADRs: 0028, 0029 Related docs: branch-promotion-strategy.md, console-env-switcher.md


1. Context

Today, raxx-console-prod is deployed via a manual git subtree split and push to Heroku. No confirmation, no diff preview, no audit entry, no post-deploy smoke check. v30 shipped this way on 2026-04-29; the operator had no gate between "git push" and "console is updated in production."

The other prod surfaces have partial gating:

Surface Heroku app Current prod gate
Raptor (backend) raxx-api-prod deploy-heroku.yml workflow_dispatch with environment: production input. Has gate.
Antlers (frontend) CF Pages raxx.app Tag-gated via deploy-antlers.yml (extracted from deleted deploy.yml — #698 / PR #837); production environment gate (ADR-0020). Has gate.
Console raxx-console-prod Pure manual subtree push. No gate.
Docs/mockups CF Pages Auto-deploy on path filter; low blast radius. Acceptable for static content.
Lightsail (tickets/vault) N/A (Terraform) Manual terraform apply with snapshot-before-apply convention. Adequate.

raxx-console-staging is being retired approximately 2026-05-07, post env-switcher soak (see console-env-switcher.md §8 and ops-runbook). The console is becoming a single prod deployment with an in-app env switcher. This makes the absence of a console deploy gate more acute: there is no second deployment to catch mistakes.

Objective: Define a universal gating pattern applicable to every prod surface, then implement it for the console as the first concrete application.


2. Invariants

All platform invariants apply. Deploy-gating-specific:

  1. No stored credentials. Deploy secrets (Heroku API key, CLOUDFLARE_API_TOKEN, CLOUDFLARE_EDIT_DNS) live in GitHub Environments secrets or Infisical. They must never appear in workflow YAML files or shell scripts that ship in the repo.

  2. Audit trail for every prod deploy. Every prod deploy must record: actor identity, source SHA, target SHA, target app name, timestamp (UTC), and outcome. This is a platform invariant per the north-star.

  3. Explicit confirmation before prod. No prod deploy may fire as a side effect of any automated push to main. The gate must require a deliberate operator action. Accidental production deploys must be structurally impossible, not just unlikely.

  4. Break-glass preserved. The manual dispatch path (workflow_dispatch) must remain available as an emergency bypass. It is the hotfix and kill-switch path. The gate fires on the break-glass path too, but approval is fast when the operator is already present.

  5. Kill-switch must survive the gate. A Heroku rollback (heroku releases:rollback) or CF Pages "previous deployment" promotion must be achievable in under two minutes without re-running CI. The gate is on the forward path; rollback is unrestricted.

  6. Secrets are rotatable without redeploy. Any secret that the deploy gate reads (e.g. CONSOLE_DEPLOY_CONFIRMATION_PHRASE if used) must be a Heroku config var, not a hardcoded value.


3. Universal Gating Principles

These six principles apply to every prod surface. Each surface implements them differently; the section below maps each principle to each surface.

P1 — Confirmation gate

A human must take an unambiguous action that could not occur by accident. For CI-driven deploys, this is the GitHub Environment required-reviewer approval. For local-script deploys, this is a typed confirmation phrase. The phrase must not be a generic "yes" — it must identify the target (deploy console to prod).

P2 — Diff preview

Before the confirmation gate fires, the operator must see what will actually ship: the from-SHA, the to-SHA, and an abbreviated log of commits in between. This prevents "I thought I was deploying X but actually Y was ahead of me on main" incidents.

P3 — Audit trail

Every prod deploy writes a structured audit record: actor, from-SHA, to-SHA, app name, timestamp (UTC), and outcome (success/failure). For CI-driven deploys, the GitHub Actions run log and environment log are the canonical audit record. For local-script deploys, the record is written to console_audit_log (action: console.deploy.prod).

P4 — Safety net (rollback path)

The rollback path must be documented and fast: - Heroku: heroku releases:rollback --app <app-name>. No redeploy, no rebuild. - CF Pages: dashboard "Rollback deployment" or wrangler pages deployment create --env production <previous-build-dir>. - Terraform (Lightsail): terraform apply from the previous state snapshot.

The safety net does not require re-running CI. Speed matters during an incident.

P5 — Post-deploy smoke check

After every prod deploy, an automated health check runs against the live endpoint. Failure triggers an alert and, for CI-driven deploys, marks the workflow run failed (creating a clear paper trail). The health check must not require operator action to initiate; it runs automatically as part of the deploy workflow.

P6 — Environment visibility

The operator must see, unambiguously, which environment they are targeting at every step: - CLI scripts: colored banner + prominent "TARGET: PRODUCTION" header before the confirmation prompt. - CI workflows: environment name in job title, PR comment, and Slack notification. - Console UI: the red/purple banner from the env-switcher design (see console-env-switcher.md).


4. Per-Surface Pattern

4.1 Heroku surfaces (console + API + any future Heroku apps)

Normal release path (CI-driven):

push to console/** path → deploy-console.yml fires
  → smoke gate
  → resolve target: staging (auto) or production (manual dispatch)
  → if production: GitHub Environment approval gate (required reviewer) pauses here
  → deploy to Heroku
  → post-deploy /health smoke check (5 retries, 10s intervals)
  → audit entry written
  → PR/commit comment + Slack notification

Break-glass path (local script):

scripts/deploy-console.sh
  → shows colored TARGET: PRODUCTION banner
  → fetches and displays diff (from-SHA → to-SHA, commit log)
  → prompts: "Type 'deploy console to prod' to continue:"
  → validates phrase exactly
  → calls heroku git:remote + git push heroku main
  → polls /health (5 retries)
  → writes audit entry to console_audit_log via Raptor API or direct DB insert
  → prints rollback hint: "heroku releases:rollback --app raxx-console-prod"

The break-glass script exists so that a deploy can proceed if GitHub Actions is unavailable or if a security incident requires bypassing CI entirely. It is not the normal path; the CI workflow is.

Rollback:

heroku releases:rollback --app raxx-console-prod

No redeploy, no CI. Operator can do this in under 60 seconds. The previous release is the safety net.

4.2 Cloudflare Pages (docs, mockups, marketing)

Path-filtered auto-deploy on push to main is acceptable for static content. Blast radius is low — a bad docs deploy does not affect user data or trading. CF Pages retains every deployment and supports one-click rollback via the dashboard.

Additional gate for prod custom domains: any PR that touches docs/ root-level files or console/static/ should receive an automated PR comment (via pr-preview workflow or a new check) that confirms: "Will deploy to docs.raxx.app on merge." This is a visibility check, not a hard gate.

Rollback: CF Pages dashboard "Rollback deployment" or via Wrangler CLI.

4.3 Lightsail / Terraform (tickets, vault)

The mandatory pre-apply snapshot + terraform plan review is the existing gate. This pattern is adequate. The architect formalizes it:

No CI automation needed for Terraform at current scale. Runbook-driven is appropriate.


5. State Machine — Console Prod Deploy (CI path)

stateDiagram-v2
    [*] --> Triggered: push to console/** path\nor workflow_dispatch
    Triggered --> SmokeGate: CI job starts
    SmokeGate --> Blocked: smoke fails
    Blocked --> [*]: deploy cancelled, PR comment
    SmokeGate --> Resolved: smoke passes
    Resolved --> ApprovalPending: environment = production
    Resolved --> Deploying: environment = staging (auto)
    ApprovalPending --> Deploying: reviewer approves
    ApprovalPending --> Rejected: reviewer rejects / timeout
    Rejected --> [*]: deploy cancelled, Slack DM
    Deploying --> HealthCheck: Heroku deploy succeeds
    Deploying --> DeployFailed: Heroku deploy fails
    DeployFailed --> [*]: audit entry (failure), Slack DM
    HealthCheck --> AuditWritten: /health 200
    HealthCheck --> RollbackAlert: /health fails after 5 retries
    RollbackAlert --> [*]: audit entry (health failure), Slack DM, rollback hint
    AuditWritten --> [*]: PR comment, Slack DM, success

6. Sequence — Console Prod Deploy (CI path)

sequenceDiagram
    participant Dev as Operator / CI
    participant GH as GitHub Actions
    participant Gate as GH Environment gate
    participant Heroku as raxx-console-prod
    participant Health as /health endpoint
    participant Audit as console_audit_log
    participant Slack as Slack (D0AJ7K184TV)

    Dev->>GH: workflow_dispatch {environment: production, ref: main}
    GH->>GH: smoke gate passes
    GH->>GH: resolve: target=production, app=raxx-console-prod
    GH->>Gate: pause — waiting for approval
    Gate-->>Dev: email + Slack: "Approval required for console → production"
    Dev->>Gate: click Approve (GH Actions UI)
    Gate->>GH: approved, proceed
    GH->>Heroku: git push heroku main (via akhileshns/heroku-deploy)
    Heroku-->>GH: deploy complete
    GH->>Health: GET /health (5 retries, 10s intervals)
    Health-->>GH: 200 OK
    GH->>Audit: write {action: console.deploy.prod, from_sha, to_sha, actor, ts}
    GH->>Slack: "console deployed to prod — SHA → SHA"
    GH-->>Dev: PR/commit comment with deploy summary

7. Sequence — Break-Glass Local Deploy

sequenceDiagram
    participant Op as Operator (local)
    participant Script as scripts/deploy-console.sh
    participant Heroku as raxx-console-prod
    participant Health as /health endpoint
    participant Audit as console_audit_log

    Op->>Script: ./scripts/deploy-console.sh
    Script->>Script: fetch remote HEAD (heroku-console-prod)
    Script->>Script: compute diff: local HEAD → remote HEAD
    Script-->>Op: display "TARGET: PRODUCTION (raxx-console-prod)"
    Script-->>Op: display commit log (from_sha → to_sha)
    Script-->>Op: prompt: "Type 'deploy console to prod' to continue:"
    Op->>Script: types confirmation phrase
    Script->>Script: validate phrase exactly
    Script->>Heroku: git push heroku-console-prod main
    Heroku-->>Script: deploy complete
    Script->>Health: curl /health (5 retries)
    Health-->>Script: 200 OK
    Script->>Audit: write audit entry (actor=$USER, from_sha, to_sha, ts)
    Script-->>Op: "Deploy complete. Rollback: heroku releases:rollback --app raxx-console-prod"

8. Data Model — Audit Log Entry

No new schema required. The existing audit_log table in the console database accepts a free-form payload_redacted JSON column. The deploy action adds:

{
  "action": "console.deploy.prod",
  "actor": "kristerpher@raxx.app",
  "from_sha": "abc1234",
  "to_sha": "def5678",
  "target_app": "raxx-console-prod",
  "trigger": "ci" | "local-script",
  "outcome": "success" | "health_check_failed" | "deploy_failed",
  "ts_utc": "2026-04-29T14:33:00Z"
}

actor for CI-driven deploys is the GitHub Actions bot identity (the approved reviewer's GitHub handle, extracted from the environment approval event). For local-script deploys, it is $GIT_AUTHOR_EMAIL or $USER as a fallback.


9. Rollout Plan

Phase What Gate
Sub 1 deploy-console.yml lands; CI path gates console staging/prod PR merged, smoke passes
Sub 2 scripts/deploy-console.sh lands; break-glass path available Sub 1 merged
Sub 3 Audit log integration; deploy entries appear in console audit trail Sub 2 merged
Sub 4 Post-deploy smoke check wired into both CI and local script Sub 1 + Sub 2 merged
Sub 5 ENV banner on CLI prompt (Sub 2 refinement) Sub 2 merged
Sub 6 raxx-console-staging retirement env-switcher soak complete (≥ 2026-05-07); subs 1–5 stable

The local script (Sub 2) is the break-glass path for the period between "Sub 1 not yet landed" and "Sub 1 merged." Both paths are additive; neither blocks the other.


10. Security Considerations

PII: actor field in audit log contains the operator's email address. This is intentional — the audit trail must be attributable. Follows existing audit log PII posture (2-year retention, DSR erasure via the operator account erasure path).

Retention: audit entries follow the 2-year platform audit retention policy.

DSR: deploy audit entries are tied to operator admin accounts, not user accounts. If an operator's account is erased, their deploy entries are redacted (actor field nulled, email replaced with a tombstone token) per the existing audit erasure path.

Credential storage: secrets (HEROKU_API_KEY, HEROKU_EMAIL) live in GitHub Environments. The local script reads them from environment variables (HEROKU_API_KEY, set via Infisical dotenv or shell export). Neither the workflow YAML nor the script sources or stores credentials. The confirmation phrase is not a secret.

Audit coverage: every prod deploy — CI or local script — writes to console_audit_log. There is no unaudited prod deploy path after Sub 3 lands.

Breach: a breach of the deploy audit log reveals which SHAs were deployed when and by whom. No secrets in the payload. No trading data. Operator notification via existing breach-notification path.

Kill-switch: Heroku releases:rollback is the kill-switch for a bad deploy. No CI required. Operator can execute in under 60 seconds. The local script prints the rollback command after every successful deploy to ensure it is always one copy-paste away.

Secrets rotation: HEROKU_API_KEY and HEROKU_EMAIL are rotatable in GitHub Environments without redeploying. The local script re-reads them from the environment on every run.


11. Open Questions

These require a decision before the labeled sub-cards can be claimed for GA. Sub-cards are filed but not yet claimable until questions are resolved.

  1. Audit write path for local script. The local script needs to write to console_audit_log. Two options: (a) POST to a Raptor API endpoint (/api/internal/audit) authenticated via a service account token, or (b) direct Postgres insert via a psql one-liner using DATABASE_URL from the environment. Option (a) is cleaner but requires a new Raptor endpoint. Option (b) is simpler but requires the operator to have DATABASE_URL available locally. Decision needed before Sub 3 is claimed.

  2. Approval notification channel. GitHub sends an email when the environment gate pauses. Is a Slack DM to D0AJ7K184TV also required for the console deploy gate, or is email sufficient? (The API deploy gate currently sends email only.)

  3. Smoke check endpoint for console. The existing /health endpoint on raxx-console-prod — does it exist? If not, Sub 4 depends on Sub 1's deploy-console.yml and the existence of a /health route in the console app. Feature-developer must confirm before Sub 4 is claimed.

  4. Local script actor identity. For the break-glass script audit entry, $USER on macOS is the macOS login name, not an email. Should the script require RAXX_OPERATOR_EMAIL to be set (and fail if unset), or accept $USER as a fallback with a warning?