ADR-0095: Deploy Modal Phase Progression — Option A (Fine-Grained Workflow Callbacks)
Date: 2026-05-16 UTC Status: Accepted Issue: #2232 (closes #1948) Design doc: docs/architecture/deploy-modal-phase-progression-2026-05-16.md
Context
The deploy modal's four-step stepper (Smoke gate / Freeze check / Deploy /
Health check) cannot advance beyond step 3 (Deploy) during a live deploy.
deploy-console.yml emits only three status callbacks: building, deploying,
succeeded/failed. The smoke and freeze-check jobs run before the deploy
job and are not wired to notify-deploy-status at all.
Three options were presented by sre-agent:
Option A — Add fine-grained callbacks to all four layers (workflow, service model, DB migration, modal JS).
Option B — Infer phase client-side from elapsed time / log_tail keyword matching.
Option C — Simplify the stepper to match actual event granularity (3 states).
Decision
Option A is adopted.
Reasoning
Option B fabricates state. The elapsed-time estimate would advance the stepper
regardless of whether the workflow is actually past the smoke gate. A slow smoke
run or an unexpected stall would produce a false phase display — precisely the
misleading behaviour the operator reported in #1948. The project invariant
(feedback_deterministic_execution_ai_augments.md) prohibits the system from
inferring state it does not actually have: authoritative events, not estimates.
Option C removes real gates from the operator's view. The smoke gate and freeze-check gate are safety controls. Hiding them from the stepper reduces visibility into which gate a deploy is waiting on. The #1948 user story explicitly requires each phase to produce a visible state transition; Option C satisfies neither the story nor the acceptance criteria.
Option A requires changes across four layers but each layer is bounded and
independently testable. The callback mechanism is already generic (apply_callback
handles any valid status). The transition graph extension is additive. The workflow
changes are two steps per job. The modal JS routing is an extension of the existing
switch/map structure. The total scope is appropriate for a size:m multi-PR effort.
Consequences
Positive: - Stepper accurately reflects real workflow state at each phase. - Audit log gains four new granular transition events per deploy. - No fabricated state; operator can distinguish smoke-gate delay from deploy delay.
Negative:
- Four layers must ship in coordinated PRs (or one large PR).
- New status values require a DB migration (additive; low risk).
- New callback steps in smoke and freeze-check jobs must use continue-on-error:
true to tolerate 422 rejections during a code rollback window.
Neutral:
- Existing building / deploying / succeeded / failed behaviour is unchanged.
- Non-console-triggered deploys (push-to-main, break-glass) are unaffected — they
supply no console_deploy_id and the action no-ops silently.
Alternatives Considered
| Option | Rejected reason |
|---|---|
| B — Client-side inference | Fabricates phase state; violates deterministic-execution invariant |
| C — Simplify stepper | Removes real gate visibility; does not satisfy #1948 AC |