Raxx · internal docs

internal · gated ↑ index

Console Deploy — Async Dispatch (H12 Fix)

Status: Accepted Owner: software-architect Date: 2026-05-02 UTC Related ADRs: ADR-0036, ADR-0034, ADR-0028 Refs: #898 (bug), #739 (reconciler), PR #867 (reconciler impl), PR #899 (modal error-render fix already merged)

1. Context

POST /api/internal/deploys synchronously polls GitHub Actions for a run_id for up to 30 s after firing workflow_dispatch. The poll budget is:

_RUN_ID_POLL_ATTEMPTS = 3
_RUN_ID_POLL_INTERVAL_S = 10
_RUN_ID_POLL_TIMEOUT_S = 30

This budget sits exactly at Heroku's H12 request-timeout ceiling. When GitHub takes longer than ~25 s to surface the new run in /actions/runs, Heroku kills the request with HTTP 503. The row is left at status=dispatched, github_run_id=NULL and the modal showed garbled error text even though the workflow fired and ran to success (reproduction 2026-05-02 07:11 UTC, run 25246579234).

The reconciler (PR #867) is a 60-second background thread that re-attempts run_id look-up for orphaned rows. It is the correct long-tail safety net, but it cannot prevent the H12 from corrupting the UX for the operator watching the modal.

2. Invariants

All platform invariants apply. Deploy-flow-specific constraints carry forward unchanged from console-driven-deploy-flow.md:

No stored credentials.
ADR-0028 friction preserved (type-to-confirm gate, TOTP for prod).
Audit trail on every state change.
Break-glass paths (gh workflow run, scripts/ops/deploy-console.sh) unchanged.
Secrets rotatable without redeploy.
Kill-switch: heroku releases:rollback unchanged.

New invariant for this design:

POST /api/internal/deploys must return a response before Heroku's H12 ceiling. Budget is: validate + DB write + GH workflow_dispatch POST + commit. No synchronous GH polling within the dispatch request.

3. Chosen Fix Direction

Option (a) — validated and adopted.

Return 202 immediately after workflow_dispatch returns 204. Run-id resolution moves to the GET /api/internal/deploys/<id> polling handler (lazy, one attempt per poll). The 60-second reconciler remains the safety net for the case where the modal is closed before a poll resolves the run_id.

Why not option (b) — background thread per dispatch?

A per-dispatch threadpool worker running inside the Heroku dyno has no persistence guarantee: dyno restarts or scaling events during the 10–30 s window orphan the in-flight look-up with no recovery. The reconciler already fills this niche reliably. Adding a second mechanism creates two competing writers to the same row with no locking.

Why not option (c) — workflow_dispatch echo callback?

The GH workflow could call back with its own run_id as a first step. This would be reliable, but it requires modifying every workflow YAML file and adds a callback round-trip before the row can show a run link. The lazy-GET approach achieves the same outcome using the infrastructure already in place (the modal polls every 2 s).

4. Data Model

No schema changes. The existing console_deploys columns are sufficient:

column	relevant behaviour
`status`	stays `dispatched` until first HMAC callback or lazy run_id success
`github_run_id`	NULL on 202 response; populated lazily by GET handler or reconciler
`github_run_url`	same lifecycle as `github_run_id`
`requested_at_utc`	used as the lower bound for lazy GH look-up window

5. API Contract Changes

POST /api/internal/deploys

Before: Synchronous; returned 201 with github_run_id populated (when luck held) or H12 503 (when it didn't).

After: - Returns 202 Accepted (not 201) when dispatch succeeds. - Body is identical in shape: {id, status, status_url, github_run_id, github_run_url}. - github_run_id and github_run_url will be null in the 202 response. Clients must not assume these are populated on 202. - status is dispatched. - The _poll_and_capture_run_id call is removed from create_deploy.

The change from 201 to 202 is intentional: 202 signals "accepted for processing; outcome not yet known" per RFC 7231. The modal JS already branches on result.status === 201 || result.status === 200; this branch must be updated to also accept 202.

Error cases unchanged: 422 validation, 429 rate-limit, 502 GH dispatch failure (GH returned non-204), 503 token missing.

GET /api/internal/deploys/`<id>` — lazy run_id resolution

When the following conditions are ALL true, the GET handler performs one (and only one) GH API look-up before returning:

deploy.github_run_id IS NULL
AND deploy.status = 'dispatched'
AND deploy.requested_at_utc >= now() - 90s

The 90-second window is generous enough to cover typical GH API lag (measured median ~5–15 s post-dispatch; p99 ~50 s). Beyond 90 s the reconciler owns recovery.

The look-up calls:

GET /repos/{repo}/actions/workflows/{workflow_name}/runs
    ?event=workflow_dispatch
    &created=>{requested_at_utc ISO}
    &per_page=5

Filtering by workflow_name (not just event=workflow_dispatch) reduces false-positive risk on busy repos. If a match is found, the row is updated in-place (run_id + run_url, commit). If not found, the row is returned as-is with github_run_id=null — the modal shows "Dispatched — waiting for run..." and polls again in 2 s.

Important: the GH look-up has a hard 8 s timeout. If the GH API is slow or unavailable, the GET returns the current row without modification. No errors bubble to the client; the row just stays unresolved until the reconciler cycle.

One attempt per poll. The GET handler does not loop. Each 2-second poll from the modal is exactly one GH attempt (when conditions are met). This bounds the GH API call rate to at most 1 req / 2 s during the 90-second window = ~45 GH requests per dispatch in the worst case.

POST /api/internal/deploys/`<id>`/status (HMAC callback)

No contract change. The callback can arrive before the GET handler has resolved the run_id. In this scenario the callback payload includes github_run_id in the existing run_id field (already part of the callback contract per PR #867). The callback handler should write github_run_id + github_run_url from the payload when the row has github_run_id IS NULL. See section 6 for the state machine implications.

6. State Machine

requested
    |
    | (GH POST returns 204)
    v
dispatched [run_id=NULL]
    |
    | -- lazy GET poll resolves run_id ---> dispatched [run_id=SET]
    |                                            |
    | -- HMAC callback "building" arrives        |
    |    (back-fills run_id if still NULL) ------+
    v
building
    |
    v
deploying
    |
    +---> succeeded
    +---> failed
    +---> timed_out  (reconciler after 30 min, or callback)

Key rule: dispatched is the only state where run_id may legitimately be NULL. All other non-terminal states should have run_id set (either by lazy GET, by callback, or by reconciler).

HMAC callback arrives before run_id is known

This is an expected race: the GH "Notify — building" step fires ~30 s after dispatch; the first GET poll fires ~2 s after dispatch. If the callback wins the race, the callback handler must:

Accept the callback (do not reject — it is HMAC-verified and the transition dispatched -> building is valid).
Extract github_run_id from the callback payload's run_id field (if present) and write it to the row at the same time as the status update.
Write github_run_url from the callback payload's run_url field (if present).

This avoids a subsequent GET-handler write racing against an already-resolved row. The callback is the authoritative source of truth for run_id; the lazy-GET is opportunistic and must check github_run_id IS NULL before writing.

Idempotency under Retry (Idempotency-Key)

The existing idempotency path handles this correctly once 202 is the success code:

Original request fires workflow_dispatch, commits row at status=dispatched, returns 202 (even if H12 killed the response in flight — the Heroku H12 kills the response, not the Flask handler; the row is already committed).
User sees "Request timed out" error (or Safari SyntaxError on H12 HTML response) and clicks Retry with the same Idempotency-Key.
_check_idempotency finds the non-terminal dispatched row and raises DeployIdempotencyHit.
Blueprint returns the existing row as a 200 response.
Modal receives 200, reads id, starts polling GET — and the deploy that was already running resolves normally.

Critical edge case: if the row was never committed because Heroku killed the Gunicorn worker process before db.session.commit() (not just the HTTP response), the retry sees no idempotency hit and fires a second workflow_dispatch. This is the same race that existed before this design. Mitigation: the commit happens immediately after the GH 204 response, before any polling loop. The H12 30 s ceiling is far past that point. The risk window is narrow (the ~200 ms between GH returning 204 and the DB commit completing). Acceptable; it is an improvement over the current 30 s window.

7. Sequence Diagram

sequenceDiagram
    participant U  as Operator (modal)
    participant C  as Console (Flask)
    participant DB as Postgres
    participant GH as GitHub Actions API
    participant W  as GH Workflow

    U  ->> C : POST /api/internal/deploys {surface_id, target_ref, totp_code}
    C  ->> C : validate + idempotency + rate-limit
    C  ->> DB: INSERT console_deploys (status=requested)
    C  ->> DB: UPDATE status=dispatched (audit: console.deploy.intent)
    C  ->> GH: POST /actions/workflows/{file}/dispatches  [timeout=15s]
    GH -->> C : 204 No Content
    C  ->> DB: commit status=dispatched, run_id=NULL
    C  -->> U : 202 {id, status="dispatched", github_run_id=null}

    note over U: transitions to State 3 (in-flight)
    note over U: polls every 2s

    loop every 2s while status=dispatched AND run_id IS NULL AND age < 90s
        U  ->> C : GET /api/internal/deploys/{id}
        C  ->> GH: GET /actions/workflows/{file}/runs?event=workflow_dispatch&created>=T  [timeout=8s]
        GH -->> C : 200 {workflow_runs: [...]}
        C  ->> C : match on inputs.console_deploy_id
        alt run found
            C  ->> DB: UPDATE run_id, run_url
            C  -->> U : 200 {status="dispatched", github_run_id="25246579234", ...}
        else not found yet
            C  -->> U : 200 {status="dispatched", github_run_id=null}
        end
    end

    note over W: ~30s after dispatch
    W  ->> C : POST /api/internal/deploys/{id}/status {status="building", run_id=...}
    C  ->> C : HMAC verify + transition check
    C  ->> DB: UPDATE status=building, run_id (if still NULL), audit
    C  -->> W : 204

    note over U: next poll sees status=building
    U  ->> C : GET /api/internal/deploys/{id}
    C  -->> U : 200 {status="building", github_run_id="25246579234"}

    W  ->> C : POST status=deploying
    W  ->> C : POST status=succeeded
    C  ->> DB: UPDATE status=succeeded, audit
    U  ->> C : GET /api/internal/deploys/{id}
    C  -->> U : 200 {status="succeeded"}
    note over U: transitions to State 4 (done)

sequenceDiagram
    participant R  as Reconciler (background)
    participant DB as Postgres
    participant GH as GitHub Actions API

    note over DB: row stuck: status=dispatched, run_id=NULL, age>15s

    loop every 60s
        R  ->> DB: SELECT rows WHERE run_id IS NULL AND status IN (requested,dispatched) AND age BETWEEN 15s AND 30min
        R  ->> GH: GET /actions/runs?event=workflow_dispatch&per_page=20
        GH -->> R : 200 {workflow_runs}
        R  ->> R : match on inputs.console_deploy_id
        alt match found
            R  ->> DB: UPDATE run_id, run_url, audit: console.deploy.run_id_recovered
        else age > 30min
            R  ->> DB: UPDATE status=timed_out, audit: console.deploy.timeout
        end
    end

8. Reconciler Scope Expansion

The current reconciler (PR #867) matches orphan rows by inputs.console_deploy_id. This is the primary match path and remains unchanged.

A secondary match path by (surface_id, workflow_filename, requested_at_utc ± 30s, target_env) should be added for the scenario where:

console_deploy_id was injected as a workflow input but the GH run record does not yet expose inputs in the list API (GH returns inputs: null for very recent runs in some API versions), OR
the run was triggered manually (break-glass) without a console_deploy_id input and the operator has a matching orphan row that should be linked.

The secondary match must only fire when the primary match returns nothing. It must not over-match: a row is only linked via secondary match when exactly one GH run exists within requested_at_utc ± 30s for (workflow_file, target_env). If zero or two+ runs exist in that window, skip (ambiguous).

This expansion is scoped to sub-card #5 below.

9. Flag and Rollout

Feature flag: FLAG_CONSOLE_DEPLOY_ASYNC (new, default off).

When off: behavior is unchanged from current (synchronous poll in POST, 201 response). When on: POST returns 202, _poll_and_capture_run_id is skipped, GET handler performs lazy look-up.

The flag is independent of FLAG_CONSOLE_DEPLOY_UI, which gates the entire deploy feature. The async flag is internal — it is not surfaced in the console UI flag promotion panel.

Rollout plan:

Stage	Description
dark	Flag off on prod. Sub-cards 1–3 shipped. Behavior identical to today (minus the H12 for the ~25% of dispatches that were getting hit).
flag on staging	`FLAG_CONSOLE_DEPLOY_ASYNC=true` on staging console. Manual smoke: dispatch staging surface, confirm modal resolves run_id within 3 polls, confirm 202 response.
flag on prod	Enable on prod. Monitor Heroku H12 logs for absence of `/api/internal/deploys` entries.
cleanup	Sub-card 5 (reconciler secondary match) + remove flag constant.

10. Migrations

None. No schema changes. The new FLAG_CONSOLE_DEPLOY_ASYNC env var is an operational toggle, not a DB column.

If a future operator wants to back-fill github_run_id on historical dispatched rows, the reconciler's _find_run_for_deploy static method can be run in a one-shot script. This is out of scope for the initial rollout.

11. Security Considerations

No credential storage change. The same GITHUB_API_DISPATCH_TOKEN used for dispatch is used for the lazy GET look-up. It is read at request time from env. The 202 path adds one more GET call per poll-cycle that uses this token; total rate-limit exposure is bounded (one call per GET request while status=dispatched, bounded to 90 s window).
HMAC callback back-fill. The callback handler accepting a run_id from the callback payload introduces no new trust assumption — the callback is already HMAC-verified with DEPLOY_CALLBACK_HMAC_SECRET. The GH workflow is the only sender that knows the secret.
Idempotency-Key replay. The retry-on-timeout path produces at most one extra workflow_dispatch in the narrow (< 200 ms) window between GH 204 and DB commit. No additional mitigation required beyond rate-limit (5/hour).
Audit trail unchanged. console.deploy.intent fires on row insert. console.deploy.callback fires on each status callback. The new lazy run_id write in the GET handler does not fire an audit row (it is observability data, not a state transition). The reconciler audit (console.deploy.run_id_recovered) continues unchanged.
PII: No new PII collected. requested_by (email) is already in the row. Retention and DSR erasure paths unchanged from ADR-0034.
Breach notification: No change. No new personal data introduced.
Kill-switch: FLAG_CONSOLE_DEPLOY_ASYNC=false reverts to synchronous behavior. FLAG_CONSOLE_DEPLOY_UI=false disables the entire deploy feature.

12. Open Questions

None blocking sub-card #1. The design is complete.

Previously open:

OQ-1: Should 202 or 201 be the success status code? Decision: 202, per RFC 7231 semantics (accepted but not yet complete). Modal JS must be updated.
OQ-2: Should lazy run_id resolution live in GET handler or a background worker? Decision: GET handler. Rationale in section 3.
OQ-3: What is the correct timeout for the lazy GH look-up? Decision: 8 s hard timeout on the GH request; stop attempting lazy look-up after 90 s from requested_at_utc.
OQ-4: Callback before run_id — accept or reject? Decision: accept and back-fill. See section 6.
OQ-5: Reconciler secondary match — in scope for this change? Decision: yes, sub-card #5, not blocking initial fix.

Auto-generated from docs/ in raxx-app/TradeMasterAPI. Gated behind Cloudflare Access. Re-deployed on every push to main.