Prod-Deploy Gating — Universal Pattern + Console First Implementation
Status: Proposed Owner: software-architect Date: 2026-04-29 UTC Refs: #81 (SDLC epic), Kristerpher directive 2026-04-29 Related ADRs: 0028, 0029 Related docs: branch-promotion-strategy.md, console-env-switcher.md
1. Context
Today, raxx-console-prod is deployed via a manual git subtree split and push to Heroku. No confirmation, no diff preview, no audit entry, no post-deploy smoke check. v30 shipped this way on 2026-04-29; the operator had no gate between "git push" and "console is updated in production."
The other prod surfaces have partial gating:
| Surface | Heroku app | Current prod gate |
|---|---|---|
| Raptor (backend) | raxx-api-prod |
deploy-heroku.yml workflow_dispatch with environment: production input. Has gate. |
| Antlers (frontend) | CF Pages raxx.app |
Tag-gated via deploy-antlers.yml (extracted from deleted deploy.yml — #698 / PR #837); production environment gate (ADR-0020). Has gate. |
| Console | raxx-console-prod |
Pure manual subtree push. No gate. |
| Docs/mockups | CF Pages | Auto-deploy on path filter; low blast radius. Acceptable for static content. |
| Lightsail (tickets/vault) | N/A (Terraform) | Manual terraform apply with snapshot-before-apply convention. Adequate. |
raxx-console-staging is being retired approximately 2026-05-07, post env-switcher soak (see console-env-switcher.md §8 and ops-runbook). The console is becoming a single prod deployment with an in-app env switcher. This makes the absence of a console deploy gate more acute: there is no second deployment to catch mistakes.
Objective: Define a universal gating pattern applicable to every prod surface, then implement it for the console as the first concrete application.
2. Invariants
All platform invariants apply. Deploy-gating-specific:
-
No stored credentials. Deploy secrets (Heroku API key, CLOUDFLARE_API_TOKEN, CLOUDFLARE_EDIT_DNS) live in GitHub Environments secrets or Infisical. They must never appear in workflow YAML files or shell scripts that ship in the repo.
-
Audit trail for every prod deploy. Every prod deploy must record: actor identity, source SHA, target SHA, target app name, timestamp (UTC), and outcome. This is a platform invariant per the north-star.
-
Explicit confirmation before prod. No prod deploy may fire as a side effect of any automated push to main. The gate must require a deliberate operator action. Accidental production deploys must be structurally impossible, not just unlikely.
-
Break-glass preserved. The manual dispatch path (
workflow_dispatch) must remain available as an emergency bypass. It is the hotfix and kill-switch path. The gate fires on the break-glass path too, but approval is fast when the operator is already present. -
Kill-switch must survive the gate. A Heroku rollback (
heroku releases:rollback) or CF Pages "previous deployment" promotion must be achievable in under two minutes without re-running CI. The gate is on the forward path; rollback is unrestricted. -
Secrets are rotatable without redeploy. Any secret that the deploy gate reads (e.g.
CONSOLE_DEPLOY_CONFIRMATION_PHRASEif used) must be a Heroku config var, not a hardcoded value.
3. Universal Gating Principles
These six principles apply to every prod surface. Each surface implements them differently; the section below maps each principle to each surface.
P1 — Confirmation gate
A human must take an unambiguous action that could not occur by accident. For CI-driven deploys, this is the GitHub Environment required-reviewer approval. For local-script deploys, this is a typed confirmation phrase. The phrase must not be a generic "yes" — it must identify the target (deploy console to prod).
P2 — Diff preview
Before the confirmation gate fires, the operator must see what will actually ship: the from-SHA, the to-SHA, and an abbreviated log of commits in between. This prevents "I thought I was deploying X but actually Y was ahead of me on main" incidents.
P3 — Audit trail
Every prod deploy writes a structured audit record: actor, from-SHA, to-SHA, app name, timestamp (UTC), and outcome (success/failure). For CI-driven deploys, the GitHub Actions run log and environment log are the canonical audit record. For local-script deploys, the record is written to console_audit_log (action: console.deploy.prod).
P4 — Safety net (rollback path)
The rollback path must be documented and fast:
- Heroku: heroku releases:rollback --app <app-name>. No redeploy, no rebuild.
- CF Pages: dashboard "Rollback deployment" or wrangler pages deployment create --env production <previous-build-dir>.
- Terraform (Lightsail): terraform apply from the previous state snapshot.
The safety net does not require re-running CI. Speed matters during an incident.
P5 — Post-deploy smoke check
After every prod deploy, an automated health check runs against the live endpoint. Failure triggers an alert and, for CI-driven deploys, marks the workflow run failed (creating a clear paper trail). The health check must not require operator action to initiate; it runs automatically as part of the deploy workflow.
P6 — Environment visibility
The operator must see, unambiguously, which environment they are targeting at every step: - CLI scripts: colored banner + prominent "TARGET: PRODUCTION" header before the confirmation prompt. - CI workflows: environment name in job title, PR comment, and Slack notification. - Console UI: the red/purple banner from the env-switcher design (see console-env-switcher.md).
4. Per-Surface Pattern
4.1 Heroku surfaces (console + API + any future Heroku apps)
Normal release path (CI-driven):
push to console/** path → deploy-console.yml fires
→ smoke gate
→ resolve target: staging (auto) or production (manual dispatch)
→ if production: GitHub Environment approval gate (required reviewer) pauses here
→ deploy to Heroku
→ post-deploy /health smoke check (5 retries, 10s intervals)
→ audit entry written
→ PR/commit comment + Slack notification
Break-glass path (local script):
scripts/deploy-console.sh
→ shows colored TARGET: PRODUCTION banner
→ fetches and displays diff (from-SHA → to-SHA, commit log)
→ prompts: "Type 'deploy console to prod' to continue:"
→ validates phrase exactly
→ calls heroku git:remote + git push heroku main
→ polls /health (5 retries)
→ writes audit entry to console_audit_log via Raptor API or direct DB insert
→ prints rollback hint: "heroku releases:rollback --app raxx-console-prod"
The break-glass script exists so that a deploy can proceed if GitHub Actions is unavailable or if a security incident requires bypassing CI entirely. It is not the normal path; the CI workflow is.
Rollback:
heroku releases:rollback --app raxx-console-prod
No redeploy, no CI. Operator can do this in under 60 seconds. The previous release is the safety net.
4.2 Cloudflare Pages (docs, mockups, marketing)
Path-filtered auto-deploy on push to main is acceptable for static content. Blast radius is low — a bad docs deploy does not affect user data or trading. CF Pages retains every deployment and supports one-click rollback via the dashboard.
Additional gate for prod custom domains: any PR that touches docs/ root-level files or console/static/ should receive an automated PR comment (via pr-preview workflow or a new check) that confirms: "Will deploy to docs.raxx.app on merge." This is a visibility check, not a hard gate.
Rollback: CF Pages dashboard "Rollback deployment" or via Wrangler CLI.
4.3 Lightsail / Terraform (tickets, vault)
The mandatory pre-apply snapshot + terraform plan review is the existing gate. This pattern is adequate. The architect formalizes it:
- Before
terraform apply: take a Lightsail instance snapshot. The snapshot name must include the UTC timestamp and the operator's identifier. terraform planoutput must be reviewed and confirmed beforeapply.- Post-apply: verify the service responds (manual health check or scripted ping).
- Audit: the Terraform state file change and the snapshot together constitute the audit trail.
No CI automation needed for Terraform at current scale. Runbook-driven is appropriate.
5. State Machine — Console Prod Deploy (CI path)
stateDiagram-v2
[*] --> Triggered: push to console/** path\nor workflow_dispatch
Triggered --> SmokeGate: CI job starts
SmokeGate --> Blocked: smoke fails
Blocked --> [*]: deploy cancelled, PR comment
SmokeGate --> Resolved: smoke passes
Resolved --> ApprovalPending: environment = production
Resolved --> Deploying: environment = staging (auto)
ApprovalPending --> Deploying: reviewer approves
ApprovalPending --> Rejected: reviewer rejects / timeout
Rejected --> [*]: deploy cancelled, Slack DM
Deploying --> HealthCheck: Heroku deploy succeeds
Deploying --> DeployFailed: Heroku deploy fails
DeployFailed --> [*]: audit entry (failure), Slack DM
HealthCheck --> AuditWritten: /health 200
HealthCheck --> RollbackAlert: /health fails after 5 retries
RollbackAlert --> [*]: audit entry (health failure), Slack DM, rollback hint
AuditWritten --> [*]: PR comment, Slack DM, success
6. Sequence — Console Prod Deploy (CI path)
sequenceDiagram
participant Dev as Operator / CI
participant GH as GitHub Actions
participant Gate as GH Environment gate
participant Heroku as raxx-console-prod
participant Health as /health endpoint
participant Audit as console_audit_log
participant Slack as Slack (D0AJ7K184TV)
Dev->>GH: workflow_dispatch {environment: production, ref: main}
GH->>GH: smoke gate passes
GH->>GH: resolve: target=production, app=raxx-console-prod
GH->>Gate: pause — waiting for approval
Gate-->>Dev: email + Slack: "Approval required for console → production"
Dev->>Gate: click Approve (GH Actions UI)
Gate->>GH: approved, proceed
GH->>Heroku: git push heroku main (via akhileshns/heroku-deploy)
Heroku-->>GH: deploy complete
GH->>Health: GET /health (5 retries, 10s intervals)
Health-->>GH: 200 OK
GH->>Audit: write {action: console.deploy.prod, from_sha, to_sha, actor, ts}
GH->>Slack: "console deployed to prod — SHA → SHA"
GH-->>Dev: PR/commit comment with deploy summary
7. Sequence — Break-Glass Local Deploy
sequenceDiagram
participant Op as Operator (local)
participant Script as scripts/deploy-console.sh
participant Heroku as raxx-console-prod
participant Health as /health endpoint
participant Audit as console_audit_log
Op->>Script: ./scripts/deploy-console.sh
Script->>Script: fetch remote HEAD (heroku-console-prod)
Script->>Script: compute diff: local HEAD → remote HEAD
Script-->>Op: display "TARGET: PRODUCTION (raxx-console-prod)"
Script-->>Op: display commit log (from_sha → to_sha)
Script-->>Op: prompt: "Type 'deploy console to prod' to continue:"
Op->>Script: types confirmation phrase
Script->>Script: validate phrase exactly
Script->>Heroku: git push heroku-console-prod main
Heroku-->>Script: deploy complete
Script->>Health: curl /health (5 retries)
Health-->>Script: 200 OK
Script->>Audit: write audit entry (actor=$USER, from_sha, to_sha, ts)
Script-->>Op: "Deploy complete. Rollback: heroku releases:rollback --app raxx-console-prod"
8. Data Model — Audit Log Entry
No new schema required. The existing audit_log table in the console database accepts a free-form payload_redacted JSON column. The deploy action adds:
{
"action": "console.deploy.prod",
"actor": "kristerpher@raxx.app",
"from_sha": "abc1234",
"to_sha": "def5678",
"target_app": "raxx-console-prod",
"trigger": "ci" | "local-script",
"outcome": "success" | "health_check_failed" | "deploy_failed",
"ts_utc": "2026-04-29T14:33:00Z"
}
actor for CI-driven deploys is the GitHub Actions bot identity (the approved reviewer's GitHub handle, extracted from the environment approval event). For local-script deploys, it is $GIT_AUTHOR_EMAIL or $USER as a fallback.
9. Rollout Plan
| Phase | What | Gate |
|---|---|---|
| Sub 1 | deploy-console.yml lands; CI path gates console staging/prod |
PR merged, smoke passes |
| Sub 2 | scripts/deploy-console.sh lands; break-glass path available |
Sub 1 merged |
| Sub 3 | Audit log integration; deploy entries appear in console audit trail | Sub 2 merged |
| Sub 4 | Post-deploy smoke check wired into both CI and local script | Sub 1 + Sub 2 merged |
| Sub 5 | ENV banner on CLI prompt (Sub 2 refinement) | Sub 2 merged |
| Sub 6 | raxx-console-staging retirement |
env-switcher soak complete (≥ 2026-05-07); subs 1–5 stable |
The local script (Sub 2) is the break-glass path for the period between "Sub 1 not yet landed" and "Sub 1 merged." Both paths are additive; neither blocks the other.
10. Security Considerations
PII: actor field in audit log contains the operator's email address. This is intentional — the audit trail must be attributable. Follows existing audit log PII posture (2-year retention, DSR erasure via the operator account erasure path).
Retention: audit entries follow the 2-year platform audit retention policy.
DSR: deploy audit entries are tied to operator admin accounts, not user accounts. If an operator's account is erased, their deploy entries are redacted (actor field nulled, email replaced with a tombstone token) per the existing audit erasure path.
Credential storage: secrets (HEROKU_API_KEY, HEROKU_EMAIL) live in GitHub Environments. The local script reads them from environment variables (HEROKU_API_KEY, set via Infisical dotenv or shell export). Neither the workflow YAML nor the script sources or stores credentials. The confirmation phrase is not a secret.
Audit coverage: every prod deploy — CI or local script — writes to console_audit_log. There is no unaudited prod deploy path after Sub 3 lands.
Breach: a breach of the deploy audit log reveals which SHAs were deployed when and by whom. No secrets in the payload. No trading data. Operator notification via existing breach-notification path.
Kill-switch: Heroku releases:rollback is the kill-switch for a bad deploy. No CI required. Operator can execute in under 60 seconds. The local script prints the rollback command after every successful deploy to ensure it is always one copy-paste away.
Secrets rotation: HEROKU_API_KEY and HEROKU_EMAIL are rotatable in GitHub Environments without redeploying. The local script re-reads them from the environment on every run.
11. Open Questions
These require a decision before the labeled sub-cards can be claimed for GA. Sub-cards are filed but not yet claimable until questions are resolved.
-
Audit write path for local script. The local script needs to write to
console_audit_log. Two options: (a) POST to a Raptor API endpoint (/api/internal/audit) authenticated via a service account token, or (b) direct Postgres insert via a psql one-liner usingDATABASE_URLfrom the environment. Option (a) is cleaner but requires a new Raptor endpoint. Option (b) is simpler but requires the operator to haveDATABASE_URLavailable locally. Decision needed before Sub 3 is claimed. -
Approval notification channel. GitHub sends an email when the environment gate pauses. Is a Slack DM to
D0AJ7K184TValso required for the console deploy gate, or is email sufficient? (The API deploy gate currently sends email only.) -
Smoke check endpoint for console. The existing
/healthendpoint onraxx-console-prod— does it exist? If not, Sub 4 depends on Sub 1'sdeploy-console.ymland the existence of a/healthroute in the console app. Feature-developer must confirm before Sub 4 is claimed. -
Local script actor identity. For the break-glass script audit entry,
$USERon macOS is the macOS login name, not an email. Should the script requireRAXX_OPERATOR_EMAILto be set (and fail if unset), or accept$USERas a fallback with a warning?