Deploy Modal Phase Progression
Issue: #2232 (closes #1948) Date: 2026-05-16 UTC ADR: docs/architecture/adr/0095-deploy-modal-phase-option-a.md
1. Context
The deploy modal has four labeled steps in its in-flight stepper: Smoke gate /
Freeze check / Deploy / Health check. The modal cannot advance through them
during a live deploy because deploy-console.yml emits only three status
callbacks — building, deploying, succeeded/failed — and neither the
smoke job nor the freeze-check job has any callback wired into
notify-deploy-status at all.
The result: the operator cannot distinguish a stalled deploy from one that is still in the smoke gate. The four-step label row is static; only steps 3 (Deploy) and 4 (Health check) can ever light up from real events.
SRE root-cause diagnosis (issue #2232) identified three options: add fine-grained workflow callbacks across four layers (Option A), infer phase client-side from elapsed time (Option B), or simplify the stepper to match actual event granularity (Option C).
2. Invariants
- No credential exposure. The
notify-deploy-statusaction carries an HMAC secret (DEPLOY_CALLBACK_HMAC_SECRET). New invocations must pass the same secret through the existing action input; no new secret surfaces. - Audit trail. Every deploy status transition already writes a
console.deploy.callbackaudit row viaapply_callback. New status values (smoke_gate_started,smoke_gate_passed,freeze_check_started,freeze_check_passed) must flow through the sameapply_callbackpath to land in the audit log. - Deterministic execution. The deploy path is rule-based. The stepper must
reflect authoritative events emitted by the workflow, not estimated timing.
Option B (inference) violates this principle — it would display state the
system does not actually know. The operator's preference, documented in project
memory (
feedback_deterministic_execution_ai_augments.md), is that AI augments understanding but never fabricates state. The same principle applies here: the UI must not fabricate phase transitions. - Forward-only transitions. The existing
_VALID_TRANSITIONSgraph indeploys.pyenforces forward-only status changes. New statuses must slot into the graph at the correct positions and must never allow backward movement. - No stored credentials. Not applicable to this feature.
- Paper-first gating. Not applicable to this feature.
3. Decision: Option A — Fine-Grained Workflow Callbacks
Option B is rejected. Client-side elapsed-time inference would display phase transitions that are not grounded in authoritative system state. A smoke gate that takes longer than expected would show a false "freeze check" phase. A stall would silently advance the stepper. This is precisely the kind of fabricated state the deterministic-execution invariant prohibits.
Option C is rejected. Simplifying the stepper to three states discards visibility that already exists in the workflow's job structure. The smoke and freeze-check jobs are real, sequenced, meaningful gates. The operator story (#1948) specifically requires each phase to produce a visible state transition. Removing steps satisfies neither the user story nor the acceptance criteria.
Option A is chosen. The four steps are real jobs in the workflow. Wiring callbacks into them produces accurate, authoritative events. The scope is bounded: four new status values, one new DB migration (additive column change), four new workflow steps, and one modal JS routing update. Each layer is isolated enough to test independently.
4. Data Model
4.1 New DeployStatus values
Add to DeployStatus enum in console/app/models/console_deploy.py:
SMOKE_GATE_STARTED = "smoke_gate_started"
SMOKE_GATE_PASSED = "smoke_gate_passed"
FREEZE_CHECK_STARTED = "freeze_check_started"
FREEZE_CHECK_PASSED = "freeze_check_passed"
4.2 Updated transition graph
requested
└─> dispatched
└─> smoke_gate_started
└─> smoke_gate_passed
└─> freeze_check_started (only if prod; otherwise → building)
└─> freeze_check_passed
└─> building
└─> deploying
└─> succeeded | failed | timed_out
└─> failed (smoke gate failed — terminal)
└─> freeze_check_started (staging skips smoke on direct dispatch; see §4.3)
└─> building (existing fast path; backward-compat for non-console triggers)
└─> failed
└─> timed_out
Backward-compatibility rule: The existing dispatched → building edge is
preserved. Non-console-triggered deploys (push-to-main, break-glass gh workflow
run) do not supply console_deploy_id and thus send no callbacks at all.
The reconciler, seeder, and version manager already handle rows that stay at
dispatched with no intermediate callbacks. No change is needed for that path.
4.3 Freeze-check conditionality
freeze-check runs only for production workflow_dispatch. On a staging deploy
the workflow skips it (existing if: condition). The transition graph must allow
smoke_gate_passed → building for staging. Feature-developer must handle this in
_VALID_TRANSITIONS:
smoke_gate_passed → { freeze_check_started, building, failed }
4.4 _CALLBACK_VALID_STATUSES expansion
Add the four new values to the _CALLBACK_VALID_STATUSES frozenset in deploys.py.
5. API / Contracts
5.1 Callback action inputs (no change to action interface)
notify-deploy-status already accepts status as a free-form string. No action
interface change is needed — the new status strings are valid inputs as-is. The
validation gate moves to _CALLBACK_VALID_STATUSES in the service layer.
5.2 GET /api/internal/deploys/ poll response
The poll response already returns status. No new fields are required. The modal
JS only reads the status field to advance the stepper, so the existing contract
is sufficient.
5.3 Audit contract
Each new status callback emits a console.deploy.callback audit row with
action = "console.deploy.callback" and status = <new_value>. No audit schema
changes are needed — the payload is free-form JSON.
6. Sequence Diagram
sequenceDiagram
participant WF as GitHub Actions (deploy-console.yml)
participant CB as Console /api/internal/deploys/<id>/callback
participant DB as console_deploys (Postgres)
participant Modal as Browser modal (SSE/poll)
Note over WF: smoke job starts
WF->>CB: POST status=smoke_gate_started
CB->>DB: UPDATE status, append log_line, emit audit
Modal->>CB: GET poll → smoke_gate_started → stepper row 1 active
Note over WF: smoke job succeeds
WF->>CB: POST status=smoke_gate_passed
CB->>DB: UPDATE status, emit audit
Modal->>CB: GET poll → smoke_gate_passed → stepper row 1 done
Note over WF: freeze-check job starts (prod only)
WF->>CB: POST status=freeze_check_started
CB->>DB: UPDATE status, emit audit
Modal->>CB: GET poll → freeze_check_started → stepper row 2 active
Note over WF: freeze-check job passes (prod only)
WF->>CB: POST status=freeze_check_passed
CB->>DB: UPDATE status, emit audit
Modal->>CB: GET poll → freeze_check_passed → stepper row 2 done
Note over WF: deploy job starts (Heroku push)
WF->>CB: POST status=building
CB->>DB: UPDATE status (existing behaviour)
Modal->>CB: GET poll → building → stepper row 3 active
Note over WF: Heroku push done, health check pending
WF->>CB: POST status=deploying
CB->>DB: UPDATE status (existing behaviour)
Modal->>CB: GET poll → deploying → stepper row 3 done, row 4 active
Note over WF: health check passes
WF->>CB: POST status=succeeded
CB->>DB: UPDATE status, notify Slack
Modal->>CB: GET poll → succeeded → stepper row 4 done, done state
7. File Changes by Layer
Layer 1: Workflow (deploy-console.yml)
File: .github/workflows/deploy-console.yml
In the smoke job, add two steps:
- name: Notify — smoke gate started
uses: ./.github/actions/notify-deploy-status
with:
console_deploy_id: ${{ inputs.console_deploy_id }}
status: smoke_gate_started
log_line: "Smoke gate started."
console_url: ${{ vars.CONSOLE_INTERNAL_URL || '...' }}
hmac_secret: ${{ secrets.DEPLOY_CALLBACK_HMAC_SECRET }}
# ... existing smoke steps ...
- name: Notify — smoke gate passed
if: success()
uses: ./.github/actions/notify-deploy-status
with:
console_deploy_id: ${{ inputs.console_deploy_id }}
status: smoke_gate_passed
log_line: "Smoke gate passed."
console_url: ...
hmac_secret: ${{ secrets.DEPLOY_CALLBACK_HMAC_SECRET }}
In the freeze-check job, similarly bracket the composite action call with
freeze_check_started (before) and freeze_check_passed (after, if: success()).
Constraint: The smoke and freeze-check jobs only fire when
inputs.console_deploy_id is non-empty. The notify-deploy-status action already
no-ops silently when console_deploy_id is empty (non-console triggers). No guard
needed beyond what already exists.
Layer 2: Service model (deploys.py + console_deploy.py)
File: console/app/models/console_deploy.py
- Extend
DeployStatusenum with four new values (§4.1).
File: console/app/services/deploys.py
- Add four new values to
_CALLBACK_VALID_STATUSES. - Update
_VALID_TRANSITIONSwith the new edges (§4.2). - No changes to
apply_callbacklogic — it is generic and handles any valid status.
Layer 3: DB migration
File: new migration in console/migrations/versions/
The new status values are strings stored in the existing status VARCHAR column.
No column addition is needed. However, the migration should add the new values to
any CHECK constraint on the status column if one exists.
Inspect the existing migration chain for a CHECK constraint on
console_deploys.status. If present, add an ALTER TABLE to expand it. If absent
(VARCHAR with no CHECK), no migration DDL is needed — only a marker migration
documenting the new valid values.
Rollback plan: If the new status values cause a production incident, the
rollback is:
1. heroku releases:rollback for the console app (reverts Python code).
2. The DB migration is additive (no column drop); rows with new status values
that arrived before rollback will have unrecognised status strings in the
old code. The old code's _CALLBACK_VALID_STATUSES rejects unknown callbacks,
so the deploy row stays at the last known good status. No data loss.
3. The workflow callbacks for the new steps will 422 against the old service code
(unknown status) — continue-on-error: true must be set on new callback steps
to prevent that 422 from failing the overall deploy workflow.
Layer 4: Modal JS (_deploy_modal.html)
File: console/app/templates/dashboard/_deploy_modal.html
Update the STEPS array and the poll response handler to route new status values
to stepper positions:
var STEPS = [
{ key: 'smoke_gate', label: 'Smoke gate' },
{ key: 'freeze_check', label: 'Freeze check' },
{ key: 'deploy', label: 'Deploy' },
{ key: 'health_check', label: 'Health check' },
];
// Status → stepper mapping (in the poll handler switch/if chain):
// 'smoke_gate_started' → step 0 active
// 'smoke_gate_passed' → step 0 done
// 'freeze_check_started' → step 1 active
// 'freeze_check_passed' → step 1 done
// 'building' → step 2 active (existing)
// 'deploying' → step 2 done, step 3 active (existing)
// 'succeeded' → step 3 done → done state (existing)
// 'failed' → failure state (existing)
The FAILURE_STAGE_MAP must also be updated to map smoke_gate_failed and
freeze_check_failed to the correct stepper failure positions. These keys already
exist in the map (see current template lines 2885–2919) — confirm they are correct
for the expanded STEPS array.
8. Migrations
Single additive migration:
- Inspect
console_deploys.statuscolumn for aCHECKconstraint. - If CHECK exists:
ALTER TABLE console_deploys DROP CONSTRAINT <name>; ALTER TABLE console_deploys ADD CONSTRAINT console_deploys_status_check CHECK (status IN (...existing..., 'smoke_gate_started', 'smoke_gate_passed', 'freeze_check_started', 'freeze_check_passed')); - If no CHECK: migration is a no-op DDL but documents the new values.
Rollback: The new enum values are additive. Removing them requires a follow-on
migration that adds the old CHECK constraint back (if it existed) and bulk-updates
any rows with new status values to failed. The workflow rollback (step 3 in §7,
Layer 3) prevents new rows from entering the new status values after a code rollback.
9. Rollout Plan
| Phase | Gate | Description |
|---|---|---|
| Dark | None | Migration + model + service changes deploy; new callback statuses are accepted but workflow does not emit them yet |
| Flag | FLAG_CONSOLE_DEPLOY_PHASE_CALLBACKS |
Workflow steps guarded by flag check; modal JS routing always-on (backward-compat with old statuses) |
| Beta | Staging first | Enable flag on staging; verify stepper advances through all four positions on a staging deploy |
| GA | Prod | Enable flag on prod after one successful staging observation |
The flag must be added to feature_flags.yaml with a console_flag_promotions
migration in the same PR (per feedback_new_flag_needs_b1_migration_same_pr.md).
10. Test Plan
- Unit (service layer): Add test cases in
test_deploy_observability_636.pyor a newtest_deploy_phase_callbacks_2232.pyfor each new_VALID_TRANSITIONSedge. Verifyapply_callbackaccepts new statuses and rejects backward transitions. - Unit (modal JS): Extend
test_promotion_deploy_spinner_1519.pyor add a new test that mounts the modal JS and simulates GET poll responses with each new status value, asserting the correct step index becomes active. - Integration (workflow): Manual staging deploy with
console_deploy_idset; confirm console_deploys row transitions through all four new statuses in order. - Regression: Existing tests for
building,deploying,succeeded,failedmust continue to pass unmodified. - Rollback smoke: Deploy old service code against a DB with new-status rows; confirm no panic, no data loss, existing terminal-state display intact.
11. Security Considerations
- No new secrets introduced. The
DEPLOY_CALLBACK_HMAC_SECRETalready covers all callback invocations including the new steps. - New callback steps in
smokeandfreeze-checkjobs must setcontinue-on-error: trueso a 422 rejection (e.g. during a code rollback window) does not abort a successful deploy. - The
console_deploy_idinput is validated as a UUID by the existing callback route before any DB access. No injection surface is added. - Audit trail: all new status transitions land in
console.deploy.callbackaudit rows via the existingapply_callbackpath. No audit gap is introduced.
12. Open Questions
None blocking sub-card dispatch. The flag name FLAG_CONSOLE_DEPLOY_PHASE_CALLBACKS
is a suggestion; feature-developer may use a shorter name if it fits the existing
flag naming convention better.