Status: Proposed Owner: software-architect Date: 2026-04-29 UTC Refs: #81 (SDLC epic), Kristerpher directive 2026-04-29 Related ADRs: 0028, 0029 Related docs: branch-promotion-strategy.md, console-env-switcher.md
Today, raxx-console-prod is deployed via a manual git subtree split and push to Heroku. No confirmation, no diff preview, no audit entry, no post-deploy smoke check. v30 shipped this way on 2026-04-29; the operator had no gate between "git push" and "console is updated in production."
The other prod surfaces have partial gating:
| Surface | Heroku app | Current prod gate |
|---|---|---|
| Raptor (backend) | raxx-api-prod |
deploy-heroku.yml workflow_dispatch with environment: production input. Has gate. |
| Antlers (frontend) | CF Pages raxx.app |
Tag-gated via deploy-antlers.yml (extracted from deleted deploy.yml — #698 / PR #837); production environment gate (ADR-0020). Has gate. |
| Console | raxx-console-prod |
Pure manual subtree push. No gate. |
| Docs/mockups | CF Pages | Auto-deploy on path filter; low blast radius. Acceptable for static content. |
| Lightsail (tickets/vault) | N/A (Terraform) | Manual terraform apply with snapshot-before-apply convention. Adequate. |
raxx-console-staging is being retired approximately 2026-05-07, post env-switcher soak (see console-env-switcher.md §8 and ops-runbook). The console is becoming a single prod deployment with an in-app env switcher. This makes the absence of a console deploy gate more acute: there is no second deployment to catch mistakes.
Objective: Define a universal gating pattern applicable to every prod surface, then implement it for the console as the first concrete application.
All platform invariants apply. Deploy-gating-specific:
No stored credentials. Deploy secrets (Heroku API key, CLOUDFLARE_API_TOKEN, CLOUDFLARE_EDIT_DNS) live in GitHub Environments secrets or Infisical. They must never appear in workflow YAML files or shell scripts that ship in the repo.
Audit trail for every prod deploy. Every prod deploy must record: actor identity, source SHA, target SHA, target app name, timestamp (UTC), and outcome. This is a platform invariant per the north-star.
Explicit confirmation before prod. No prod deploy may fire as a side effect of any automated push to main. The gate must require a deliberate operator action. Accidental production deploys must be structurally impossible, not just unlikely.
Break-glass preserved. The manual dispatch path (workflow_dispatch) must remain available as an emergency bypass. It is the hotfix and kill-switch path. The gate fires on the break-glass path too, but approval is fast when the operator is already present.
Kill-switch must survive the gate. A Heroku rollback (heroku releases:rollback) or CF Pages "previous deployment" promotion must be achievable in under two minutes without re-running CI. The gate is on the forward path; rollback is unrestricted.
Secrets are rotatable without redeploy. Any secret that the deploy gate reads (e.g. CONSOLE_DEPLOY_CONFIRMATION_PHRASE if used) must be a Heroku config var, not a hardcoded value.
These six principles apply to every prod surface. Each surface implements them differently; the section below maps each principle to each surface.
A human must take an unambiguous action that could not occur by accident. For CI-driven deploys, this is the GitHub Environment required-reviewer approval. For local-script deploys, this is a typed confirmation phrase. The phrase must not be a generic "yes" — it must identify the target (deploy console to prod).
Before the confirmation gate fires, the operator must see what will actually ship: the from-SHA, the to-SHA, and an abbreviated log of commits in between. This prevents "I thought I was deploying X but actually Y was ahead of me on main" incidents.
Every prod deploy writes a structured audit record: actor, from-SHA, to-SHA, app name, timestamp (UTC), and outcome (success/failure). For CI-driven deploys, the GitHub Actions run log and environment log are the canonical audit record. For local-script deploys, the record is written to console_audit_log (action: console.deploy.prod).
The rollback path must be documented and fast:
- Heroku: heroku releases:rollback --app <app-name>. No redeploy, no rebuild.
- CF Pages: dashboard "Rollback deployment" or wrangler pages deployment create --env production <previous-build-dir>.
- Terraform (Lightsail): terraform apply from the previous state snapshot.
The safety net does not require re-running CI. Speed matters during an incident.
After every prod deploy, an automated health check runs against the live endpoint. Failure triggers an alert and, for CI-driven deploys, marks the workflow run failed (creating a clear paper trail). The health check must not require operator action to initiate; it runs automatically as part of the deploy workflow.
The operator must see, unambiguously, which environment they are targeting at every step: - CLI scripts: colored banner + prominent "TARGET: PRODUCTION" header before the confirmation prompt. - CI workflows: environment name in job title, PR comment, and Slack notification. - Console UI: the red/purple banner from the env-switcher design (see console-env-switcher.md).
Normal release path (CI-driven):
push to console/** path → deploy-console.yml fires
→ smoke gate
→ resolve target: staging (auto) or production (manual dispatch)
→ if production: GitHub Environment approval gate (required reviewer) pauses here
→ deploy to Heroku
→ post-deploy /health smoke check (5 retries, 10s intervals)
→ audit entry written
→ PR/commit comment + Slack notification
Break-glass path (local script):
scripts/deploy-console.sh
→ shows colored TARGET: PRODUCTION banner
→ fetches and displays diff (from-SHA → to-SHA, commit log)
→ prompts: "Type 'deploy console to prod' to continue:"
→ validates phrase exactly
→ calls heroku git:remote + git push heroku main
→ polls /health (5 retries)
→ writes audit entry to console_audit_log via Raptor API or direct DB insert
→ prints rollback hint: "heroku releases:rollback --app raxx-console-prod"
The break-glass script exists so that a deploy can proceed if GitHub Actions is unavailable or if a security incident requires bypassing CI entirely. It is not the normal path; the CI workflow is.
Rollback:
heroku releases:rollback --app raxx-console-prod
No redeploy, no CI. Operator can do this in under 60 seconds. The previous release is the safety net.
Path-filtered auto-deploy on push to main is acceptable for static content. Blast radius is low — a bad docs deploy does not affect user data or trading. CF Pages retains every deployment and supports one-click rollback via the dashboard.
Additional gate for prod custom domains: any PR that touches docs/ root-level files or console/static/ should receive an automated PR comment (via pr-preview workflow or a new check) that confirms: "Will deploy to docs.raxx.app on merge." This is a visibility check, not a hard gate.
Rollback: CF Pages dashboard "Rollback deployment" or via Wrangler CLI.
The mandatory pre-apply snapshot + terraform plan review is the existing gate. This pattern is adequate. The architect formalizes it:
terraform apply: take a Lightsail instance snapshot. The snapshot name must include the UTC timestamp and the operator's identifier.terraform plan output must be reviewed and confirmed before apply.No CI automation needed for Terraform at current scale. Runbook-driven is appropriate.
stateDiagram-v2
[*] --> Triggered: push to console/** path\nor workflow_dispatch
Triggered --> SmokeGate: CI job starts
SmokeGate --> Blocked: smoke fails
Blocked --> [*]: deploy cancelled, PR comment
SmokeGate --> Resolved: smoke passes
Resolved --> ApprovalPending: environment = production
Resolved --> Deploying: environment = staging (auto)
ApprovalPending --> Deploying: reviewer approves
ApprovalPending --> Rejected: reviewer rejects / timeout
Rejected --> [*]: deploy cancelled, Slack DM
Deploying --> HealthCheck: Heroku deploy succeeds
Deploying --> DeployFailed: Heroku deploy fails
DeployFailed --> [*]: audit entry (failure), Slack DM
HealthCheck --> AuditWritten: /health 200
HealthCheck --> RollbackAlert: /health fails after 5 retries
RollbackAlert --> [*]: audit entry (health failure), Slack DM, rollback hint
AuditWritten --> [*]: PR comment, Slack DM, success
sequenceDiagram
participant Dev as Operator / CI
participant GH as GitHub Actions
participant Gate as GH Environment gate
participant Heroku as raxx-console-prod
participant Health as /health endpoint
participant Audit as console_audit_log
participant Slack as Slack (D0AJ7K184TV)
Dev->>GH: workflow_dispatch {environment: production, ref: main}
GH->>GH: smoke gate passes
GH->>GH: resolve: target=production, app=raxx-console-prod
GH->>Gate: pause — waiting for approval
Gate-->>Dev: email + Slack: "Approval required for console → production"
Dev->>Gate: click Approve (GH Actions UI)
Gate->>GH: approved, proceed
GH->>Heroku: git push heroku main (via akhileshns/heroku-deploy)
Heroku-->>GH: deploy complete
GH->>Health: GET /health (5 retries, 10s intervals)
Health-->>GH: 200 OK
GH->>Audit: write {action: console.deploy.prod, from_sha, to_sha, actor, ts}
GH->>Slack: "console deployed to prod — SHA → SHA"
GH-->>Dev: PR/commit comment with deploy summary
sequenceDiagram
participant Op as Operator (local)
participant Script as scripts/deploy-console.sh
participant Heroku as raxx-console-prod
participant Health as /health endpoint
participant Audit as console_audit_log
Op->>Script: ./scripts/deploy-console.sh
Script->>Script: fetch remote HEAD (heroku-console-prod)
Script->>Script: compute diff: local HEAD → remote HEAD
Script-->>Op: display "TARGET: PRODUCTION (raxx-console-prod)"
Script-->>Op: display commit log (from_sha → to_sha)
Script-->>Op: prompt: "Type 'deploy console to prod' to continue:"
Op->>Script: types confirmation phrase
Script->>Script: validate phrase exactly
Script->>Heroku: git push heroku-console-prod main
Heroku-->>Script: deploy complete
Script->>Health: curl /health (5 retries)
Health-->>Script: 200 OK
Script->>Audit: write audit entry (actor=$USER, from_sha, to_sha, ts)
Script-->>Op: "Deploy complete. Rollback: heroku releases:rollback --app raxx-console-prod"
No new schema required. The existing audit_log table in the console database accepts a free-form payload_redacted JSON column. The deploy action adds:
{
"action": "console.deploy.prod",
"actor": "kristerpher@raxx.app",
"from_sha": "abc1234",
"to_sha": "def5678",
"target_app": "raxx-console-prod",
"trigger": "ci" | "local-script",
"outcome": "success" | "health_check_failed" | "deploy_failed",
"ts_utc": "2026-04-29T14:33:00Z"
}
actor for CI-driven deploys is the GitHub Actions bot identity (the approved reviewer's GitHub handle, extracted from the environment approval event). For local-script deploys, it is $GIT_AUTHOR_EMAIL or $USER as a fallback.
| Phase | What | Gate |
|---|---|---|
| Sub 1 | deploy-console.yml lands; CI path gates console staging/prod |
PR merged, smoke passes |
| Sub 2 | scripts/deploy-console.sh lands; break-glass path available |
Sub 1 merged |
| Sub 3 | Audit log integration; deploy entries appear in console audit trail | Sub 2 merged |
| Sub 4 | Post-deploy smoke check wired into both CI and local script | Sub 1 + Sub 2 merged |
| Sub 5 | ENV banner on CLI prompt (Sub 2 refinement) | Sub 2 merged |
| Sub 6 | raxx-console-staging retirement |
env-switcher soak complete (≥ 2026-05-07); subs 1–5 stable |
The local script (Sub 2) is the break-glass path for the period between "Sub 1 not yet landed" and "Sub 1 merged." Both paths are additive; neither blocks the other.
PII: actor field in audit log contains the operator's email address. This is intentional — the audit trail must be attributable. Follows existing audit log PII posture (2-year retention, DSR erasure via the operator account erasure path).
Retention: audit entries follow the 2-year platform audit retention policy.
DSR: deploy audit entries are tied to operator admin accounts, not user accounts. If an operator's account is erased, their deploy entries are redacted (actor field nulled, email replaced with a tombstone token) per the existing audit erasure path.
Credential storage: secrets (HEROKU_API_KEY, HEROKU_EMAIL) live in GitHub Environments. The local script reads them from environment variables (HEROKU_API_KEY, set via Infisical dotenv or shell export). Neither the workflow YAML nor the script sources or stores credentials. The confirmation phrase is not a secret.
Audit coverage: every prod deploy — CI or local script — writes to console_audit_log. There is no unaudited prod deploy path after Sub 3 lands.
Breach: a breach of the deploy audit log reveals which SHAs were deployed when and by whom. No secrets in the payload. No trading data. Operator notification via existing breach-notification path.
Kill-switch: Heroku releases:rollback is the kill-switch for a bad deploy. No CI required. Operator can execute in under 60 seconds. The local script prints the rollback command after every successful deploy to ensure it is always one copy-paste away.
Secrets rotation: HEROKU_API_KEY and HEROKU_EMAIL are rotatable in GitHub Environments without redeploying. The local script re-reads them from the environment on every run.
These require a decision before the labeled sub-cards can be claimed for GA. Sub-cards are filed but not yet claimable until questions are resolved.
Audit write path for local script. The local script needs to write to console_audit_log. Two options: (a) POST to a Raptor API endpoint (/api/internal/audit) authenticated via a service account token, or (b) direct Postgres insert via a psql one-liner using DATABASE_URL from the environment. Option (a) is cleaner but requires a new Raptor endpoint. Option (b) is simpler but requires the operator to have DATABASE_URL available locally. Decision needed before Sub 3 is claimed.
Approval notification channel. GitHub sends an email when the environment gate pauses. Is a Slack DM to D0AJ7K184TV also required for the console deploy gate, or is email sufficient? (The API deploy gate currently sends email only.)
Smoke check endpoint for console. The existing /health endpoint on raxx-console-prod — does it exist? If not, Sub 4 depends on Sub 1's deploy-console.yml and the existence of a /health route in the console app. Feature-developer must confirm before Sub 4 is claimed.
Local script actor identity. For the break-glass script audit entry, $USER on macOS is the macOS login name, not an email. Should the script require RAXX_OPERATOR_EMAIL to be set (and fail if unset), or accept $USER as a fallback with a warning?