Raxx · internal docs

internal · gated

Deploy Modal Phase Progression

Issue: #2232 (closes #1948) Date: 2026-05-16 UTC ADR: docs/architecture/adr/0095-deploy-modal-phase-option-a.md


1. Context

The deploy modal has four labeled steps in its in-flight stepper: Smoke gate / Freeze check / Deploy / Health check. The modal cannot advance through them during a live deploy because deploy-console.yml emits only three status callbacks — building, deploying, succeeded/failed — and neither the smoke job nor the freeze-check job has any callback wired into notify-deploy-status at all.

The result: the operator cannot distinguish a stalled deploy from one that is still in the smoke gate. The four-step label row is static; only steps 3 (Deploy) and 4 (Health check) can ever light up from real events.

SRE root-cause diagnosis (issue #2232) identified three options: add fine-grained workflow callbacks across four layers (Option A), infer phase client-side from elapsed time (Option B), or simplify the stepper to match actual event granularity (Option C).


2. Invariants


3. Decision: Option A — Fine-Grained Workflow Callbacks

Option B is rejected. Client-side elapsed-time inference would display phase transitions that are not grounded in authoritative system state. A smoke gate that takes longer than expected would show a false "freeze check" phase. A stall would silently advance the stepper. This is precisely the kind of fabricated state the deterministic-execution invariant prohibits.

Option C is rejected. Simplifying the stepper to three states discards visibility that already exists in the workflow's job structure. The smoke and freeze-check jobs are real, sequenced, meaningful gates. The operator story (#1948) specifically requires each phase to produce a visible state transition. Removing steps satisfies neither the user story nor the acceptance criteria.

Option A is chosen. The four steps are real jobs in the workflow. Wiring callbacks into them produces accurate, authoritative events. The scope is bounded: four new status values, one new DB migration (additive column change), four new workflow steps, and one modal JS routing update. Each layer is isolated enough to test independently.


4. Data Model

4.1 New DeployStatus values

Add to DeployStatus enum in console/app/models/console_deploy.py:

SMOKE_GATE_STARTED   = "smoke_gate_started"
SMOKE_GATE_PASSED    = "smoke_gate_passed"
FREEZE_CHECK_STARTED = "freeze_check_started"
FREEZE_CHECK_PASSED  = "freeze_check_passed"

4.2 Updated transition graph

requested
  └─> dispatched
        └─> smoke_gate_started
              └─> smoke_gate_passed
                    └─> freeze_check_started  (only if prod; otherwise → building)
                          └─> freeze_check_passed
                                └─> building
                                      └─> deploying
                                            └─> succeeded | failed | timed_out
              └─> failed          (smoke gate failed — terminal)
        └─> freeze_check_started  (staging skips smoke on direct dispatch; see §4.3)
        └─> building               (existing fast path; backward-compat for non-console triggers)
        └─> failed
        └─> timed_out

Backward-compatibility rule: The existing dispatched → building edge is preserved. Non-console-triggered deploys (push-to-main, break-glass gh workflow run) do not supply console_deploy_id and thus send no callbacks at all. The reconciler, seeder, and version manager already handle rows that stay at dispatched with no intermediate callbacks. No change is needed for that path.

4.3 Freeze-check conditionality

freeze-check runs only for production workflow_dispatch. On a staging deploy the workflow skips it (existing if: condition). The transition graph must allow smoke_gate_passed → building for staging. Feature-developer must handle this in _VALID_TRANSITIONS:

smoke_gate_passed → { freeze_check_started, building, failed }

4.4 _CALLBACK_VALID_STATUSES expansion

Add the four new values to the _CALLBACK_VALID_STATUSES frozenset in deploys.py.


5. API / Contracts

5.1 Callback action inputs (no change to action interface)

notify-deploy-status already accepts status as a free-form string. No action interface change is needed — the new status strings are valid inputs as-is. The validation gate moves to _CALLBACK_VALID_STATUSES in the service layer.

5.2 GET /api/internal/deploys/ poll response

The poll response already returns status. No new fields are required. The modal JS only reads the status field to advance the stepper, so the existing contract is sufficient.

5.3 Audit contract

Each new status callback emits a console.deploy.callback audit row with action = "console.deploy.callback" and status = <new_value>. No audit schema changes are needed — the payload is free-form JSON.


6. Sequence Diagram

sequenceDiagram
    participant WF as GitHub Actions (deploy-console.yml)
    participant CB as Console /api/internal/deploys/<id>/callback
    participant DB as console_deploys (Postgres)
    participant Modal as Browser modal (SSE/poll)

    Note over WF: smoke job starts
    WF->>CB: POST status=smoke_gate_started
    CB->>DB: UPDATE status, append log_line, emit audit
    Modal->>CB: GET poll → smoke_gate_started → stepper row 1 active

    Note over WF: smoke job succeeds
    WF->>CB: POST status=smoke_gate_passed
    CB->>DB: UPDATE status, emit audit
    Modal->>CB: GET poll → smoke_gate_passed → stepper row 1 done

    Note over WF: freeze-check job starts (prod only)
    WF->>CB: POST status=freeze_check_started
    CB->>DB: UPDATE status, emit audit
    Modal->>CB: GET poll → freeze_check_started → stepper row 2 active

    Note over WF: freeze-check job passes (prod only)
    WF->>CB: POST status=freeze_check_passed
    CB->>DB: UPDATE status, emit audit
    Modal->>CB: GET poll → freeze_check_passed → stepper row 2 done

    Note over WF: deploy job starts (Heroku push)
    WF->>CB: POST status=building
    CB->>DB: UPDATE status (existing behaviour)
    Modal->>CB: GET poll → building → stepper row 3 active

    Note over WF: Heroku push done, health check pending
    WF->>CB: POST status=deploying
    CB->>DB: UPDATE status (existing behaviour)
    Modal->>CB: GET poll → deploying → stepper row 3 done, row 4 active

    Note over WF: health check passes
    WF->>CB: POST status=succeeded
    CB->>DB: UPDATE status, notify Slack
    Modal->>CB: GET poll → succeeded → stepper row 4 done, done state

7. File Changes by Layer

Layer 1: Workflow (deploy-console.yml)

File: .github/workflows/deploy-console.yml

In the smoke job, add two steps:

- name: Notify — smoke gate started
  uses: ./.github/actions/notify-deploy-status
  with:
    console_deploy_id: ${{ inputs.console_deploy_id }}
    status: smoke_gate_started
    log_line: "Smoke gate started."
    console_url: ${{ vars.CONSOLE_INTERNAL_URL || '...' }}
    hmac_secret: ${{ secrets.DEPLOY_CALLBACK_HMAC_SECRET }}

# ... existing smoke steps ...

- name: Notify — smoke gate passed
  if: success()
  uses: ./.github/actions/notify-deploy-status
  with:
    console_deploy_id: ${{ inputs.console_deploy_id }}
    status: smoke_gate_passed
    log_line: "Smoke gate passed."
    console_url: ...
    hmac_secret: ${{ secrets.DEPLOY_CALLBACK_HMAC_SECRET }}

In the freeze-check job, similarly bracket the composite action call with freeze_check_started (before) and freeze_check_passed (after, if: success()).

Constraint: The smoke and freeze-check jobs only fire when inputs.console_deploy_id is non-empty. The notify-deploy-status action already no-ops silently when console_deploy_id is empty (non-console triggers). No guard needed beyond what already exists.

Layer 2: Service model (deploys.py + console_deploy.py)

File: console/app/models/console_deploy.py

File: console/app/services/deploys.py

Layer 3: DB migration

File: new migration in console/migrations/versions/

The new status values are strings stored in the existing status VARCHAR column. No column addition is needed. However, the migration should add the new values to any CHECK constraint on the status column if one exists.

Inspect the existing migration chain for a CHECK constraint on console_deploys.status. If present, add an ALTER TABLE to expand it. If absent (VARCHAR with no CHECK), no migration DDL is needed — only a marker migration documenting the new valid values.

Rollback plan: If the new status values cause a production incident, the rollback is: 1. heroku releases:rollback for the console app (reverts Python code). 2. The DB migration is additive (no column drop); rows with new status values that arrived before rollback will have unrecognised status strings in the old code. The old code's _CALLBACK_VALID_STATUSES rejects unknown callbacks, so the deploy row stays at the last known good status. No data loss. 3. The workflow callbacks for the new steps will 422 against the old service code (unknown status) — continue-on-error: true must be set on new callback steps to prevent that 422 from failing the overall deploy workflow.

Layer 4: Modal JS (_deploy_modal.html)

File: console/app/templates/dashboard/_deploy_modal.html

Update the STEPS array and the poll response handler to route new status values to stepper positions:

var STEPS = [
  { key: 'smoke_gate',    label: 'Smoke gate' },
  { key: 'freeze_check',  label: 'Freeze check' },
  { key: 'deploy',        label: 'Deploy' },
  { key: 'health_check',  label: 'Health check' },
];

// Status → stepper mapping (in the poll handler switch/if chain):
// 'smoke_gate_started'   → step 0 active
// 'smoke_gate_passed'    → step 0 done
// 'freeze_check_started' → step 1 active
// 'freeze_check_passed'  → step 1 done
// 'building'             → step 2 active   (existing)
// 'deploying'            → step 2 done, step 3 active  (existing)
// 'succeeded'            → step 3 done → done state    (existing)
// 'failed'               → failure state               (existing)

The FAILURE_STAGE_MAP must also be updated to map smoke_gate_failed and freeze_check_failed to the correct stepper failure positions. These keys already exist in the map (see current template lines 2885–2919) — confirm they are correct for the expanded STEPS array.


8. Migrations

Single additive migration:

  1. Inspect console_deploys.status column for a CHECK constraint.
  2. If CHECK exists: ALTER TABLE console_deploys DROP CONSTRAINT <name>; ALTER TABLE console_deploys ADD CONSTRAINT console_deploys_status_check CHECK (status IN (...existing..., 'smoke_gate_started', 'smoke_gate_passed', 'freeze_check_started', 'freeze_check_passed'));
  3. If no CHECK: migration is a no-op DDL but documents the new values.

Rollback: The new enum values are additive. Removing them requires a follow-on migration that adds the old CHECK constraint back (if it existed) and bulk-updates any rows with new status values to failed. The workflow rollback (step 3 in §7, Layer 3) prevents new rows from entering the new status values after a code rollback.


9. Rollout Plan

Phase Gate Description
Dark None Migration + model + service changes deploy; new callback statuses are accepted but workflow does not emit them yet
Flag FLAG_CONSOLE_DEPLOY_PHASE_CALLBACKS Workflow steps guarded by flag check; modal JS routing always-on (backward-compat with old statuses)
Beta Staging first Enable flag on staging; verify stepper advances through all four positions on a staging deploy
GA Prod Enable flag on prod after one successful staging observation

The flag must be added to feature_flags.yaml with a console_flag_promotions migration in the same PR (per feedback_new_flag_needs_b1_migration_same_pr.md).


10. Test Plan


11. Security Considerations


12. Open Questions

None blocking sub-card dispatch. The flag name FLAG_CONSOLE_DEPLOY_PHASE_CALLBACKS is a suggestion; feature-developer may use a shorter name if it fits the existing flag naming convention better.