ADR 0053 — New-surface deploy workflow template structure

Status: Accepted Date: 2026-05-06 UTC Refs: #649, ADR-0052, docs/architecture/new-surface-convention.md

Context

Existing deploy workflows (deploy-console.yml, deploy-customer-docs.yml, deploy-internal-docs.yml, deploy-status-page.yml) converged organically on a shared structure: freeze-check → build → deploy → health-check → audit emit → notify. But each workflow re-implements that structure from scratch. When a step was improved in one workflow (e.g., the Slack DM on health-check failure landing in deploy-console.yml) it was not propagated to others.

We need to codify the required job sequence so new surfaces get it right the first time and so deviations are visible rather than silent.

Decision

Every deploy workflow for a new surface must contain these steps, in this order, with the documented behavior:

freeze-check — .github/actions/check-deploy-freeze. Runs before any build or deploy step. On Tier B prod-only surfaces, skipped for staging.
Notify building — .github/actions/notify-deploy-status with status: building. Fires at start of build-and-deploy job.
Load vault secrets — .github/actions/load-vault-secrets. No secrets from environment variables or hardcoded values.
Build — surface-specific. Fails fast; failing build blocks deploy.
Infrastructure bootstrap — idempotent steps for CF Pages project creation, DNS CNAME, CF Access (if internal). Safe to re-run.
Notify deploying — .github/actions/notify-deploy-status with status: deploying.
Deploy — wrangler (Tier A/A+) or heroku git push (Tier B).
Health check — 5 retries × 10 s. GET <health-url>/health → 200. Stale data annotations do not fail the gate (pattern from deploy-console.yml).
Emit audit event — success — continue-on-error: true. POST to console internal audit endpoint.
Emit audit event — failure — if: failure(), continue-on-error: true.
Slack DM on prod health-check failure — if: failure(), prod-only, continue-on-error: true. Channel D0AJ7K184TV.
Notify succeeded / failed — .github/actions/notify-deploy-status.

The scaffold script generates this structure from a template. Deviations from the order require a comment explaining the reason.

Consequences

New surfaces start with the full audit/notify/freeze pattern rather than discovering it incrementally.
The console_deploy_id input (for callback tracking) is included in every workflow even if the surface is not yet wired to the console deploy UI — it is a no-op when empty, and its presence means wiring is a one-line change.
The audit emit step uses continue-on-error: true so a console outage during a deploy does not block the deploy itself.
Existing workflows are not required to be retrofitted; this ADR governs new surfaces.

Alternatives Considered

Reusable workflow (workflow_call): A single reusable workflow would enforce the structure mechanically. Rejected because the build step is too surface-specific to parameterise cleanly today. A future ADR could revisit this if the number of surfaces grows to 10+.

Composite action for audit emit: The audit emit block (30 lines of Python) is duplicated across workflows. A composite action would DRY it. Deferred to a separate card — the Python inline block is readable and the duplication is contained to one block per workflow.

Skip freeze-check for Tier A surfaces: CF Pages deploys are lower risk than Heroku deploys, so the freeze check adds latency without proportional value. Rejected: the freeze gate is an ops-wide invariant, not risk-weighted. A frozen deploy window means all surfaces are frozen.