Console Deploy — Async Dispatch (H12 Fix)
Status: Accepted Owner: software-architect Date: 2026-05-02 UTC Related ADRs: [ADR-0036](https://internal-docs.raxx.app/architecture/adr/0036-deploy-async-run-id-resolution.html), [ADR-0034](https://internal-docs.raxx.app/architecture/adr/0034-console-driven-deploy-flow.html), [ADR-0028](https://internal-docs.raxx.app/architecture/adr/0028-prod-deploy-intentional-friction.html) Refs: #898 (bug), #739 (reconciler), PR #867 (reconciler impl), PR #899 (modal error-render fix already merged)
1. Context
POST /api/internal/deploys synchronously polls GitHub Actions for a run_id
for up to 30 s after firing workflow_dispatch. The poll budget is:
_RUN_ID_POLL_ATTEMPTS = 3
_RUN_ID_POLL_INTERVAL_S = 10
_RUN_ID_POLL_TIMEOUT_S = 30
This budget sits exactly at Heroku's H12 request-timeout ceiling. When GitHub
takes longer than ~25 s to surface the new run in /actions/runs, Heroku kills
the request with HTTP 503. The row is left at status=dispatched,
github_run_id=NULL and the modal showed garbled error text even though the
workflow fired and ran to success (reproduction 2026-05-02 07:11 UTC, run
25246579234).
The reconciler (PR #867) is a 60-second background thread that re-attempts run_id look-up for orphaned rows. It is the correct long-tail safety net, but it cannot prevent the H12 from corrupting the UX for the operator watching the modal.
2. Invariants
All platform invariants apply. Deploy-flow-specific constraints carry forward unchanged from console-driven-deploy-flow.md:
- No stored credentials.
- ADR-0028 friction preserved (type-to-confirm gate, TOTP for prod).
- Audit trail on every state change.
- Break-glass paths (
gh workflow run,scripts/ops/deploy-console.sh) unchanged. - Secrets rotatable without redeploy.
- Kill-switch:
heroku releases:rollbackunchanged.
New invariant for this design:
POST /api/internal/deploysmust return a response before Heroku's H12 ceiling. Budget is: validate + DB write + GH workflow_dispatch POST + commit. No synchronous GH polling within the dispatch request.
3. Chosen Fix Direction
Option (a) — validated and adopted.
Return 202 immediately after workflow_dispatch returns 204. Run-id
resolution moves to the GET /api/internal/deploys/<id> polling handler
(lazy, one attempt per poll). The 60-second reconciler remains the safety net
for the case where the modal is closed before a poll resolves the run_id.
Why not option (b) — background thread per dispatch?
A per-dispatch threadpool worker running inside the Heroku dyno has no persistence guarantee: dyno restarts or scaling events during the 10–30 s window orphan the in-flight look-up with no recovery. The reconciler already fills this niche reliably. Adding a second mechanism creates two competing writers to the same row with no locking.
Why not option (c) — workflow_dispatch echo callback?
The GH workflow could call back with its own run_id as a first step. This
would be reliable, but it requires modifying every workflow YAML file and
adds a callback round-trip before the row can show a run link. The lazy-GET
approach achieves the same outcome using the infrastructure already in place
(the modal polls every 2 s).
4. Data Model
No schema changes. The existing console_deploys columns are sufficient:
| column | relevant behaviour |
|---|---|
status |
stays dispatched until first HMAC callback or lazy run_id success |
github_run_id |
NULL on 202 response; populated lazily by GET handler or reconciler |
github_run_url |
same lifecycle as github_run_id |
requested_at_utc |
used as the lower bound for lazy GH look-up window |
5. API Contract Changes
POST /api/internal/deploys
Before: Synchronous; returned 201 with github_run_id populated (when
luck held) or H12 503 (when it didn't).
After:
- Returns 202 Accepted (not 201) when dispatch succeeds.
- Body is identical in shape: {id, status, status_url, github_run_id, github_run_url}.
- github_run_id and github_run_url will be null in the 202 response.
Clients must not assume these are populated on 202.
- status is dispatched.
- The _poll_and_capture_run_id call is removed from create_deploy.
The change from 201 to 202 is intentional: 202 signals "accepted for
processing; outcome not yet known" per RFC 7231. The modal JS already
branches on result.status === 201 || result.status === 200; this branch
must be updated to also accept 202.
Error cases unchanged: 422 validation, 429 rate-limit, 502 GH dispatch failure (GH returned non-204), 503 token missing.
GET /api/internal/deploys/<id> — lazy run_id resolution
When the following conditions are ALL true, the GET handler performs one (and only one) GH API look-up before returning:
deploy.github_run_id IS NULL
AND deploy.status = 'dispatched'
AND deploy.requested_at_utc >= now() - 90s
The 90-second window is generous enough to cover typical GH API lag (measured median ~5–15 s post-dispatch; p99 ~50 s). Beyond 90 s the reconciler owns recovery.
The look-up calls:
GET /repos/{repo}/actions/workflows/{workflow_name}/runs
?event=workflow_dispatch
&created=>{requested_at_utc ISO}
&per_page=5
Filtering by workflow_name (not just event=workflow_dispatch) reduces
false-positive risk on busy repos. If a match is found, the row is updated
in-place (run_id + run_url, commit). If not found, the row is returned
as-is with github_run_id=null — the modal shows "Dispatched — waiting for
run..." and polls again in 2 s.
Important: the GH look-up has a hard 8 s timeout. If the GH API is slow or unavailable, the GET returns the current row without modification. No errors bubble to the client; the row just stays unresolved until the reconciler cycle.
One attempt per poll. The GET handler does not loop. Each 2-second poll from the modal is exactly one GH attempt (when conditions are met). This bounds the GH API call rate to at most 1 req / 2 s during the 90-second window = ~45 GH requests per dispatch in the worst case.
POST /api/internal/deploys/<id>/status (HMAC callback)
No contract change. The callback can arrive before the GET handler has
resolved the run_id. In this scenario the callback payload includes
github_run_id in the existing run_id field (already part of the
callback contract per PR #867). The callback handler should write
github_run_id + github_run_url from the payload when the row has
github_run_id IS NULL. See section 6 for the state machine implications.
6. State Machine
requested
|
| (GH POST returns 204)
v
dispatched [run_id=NULL]
|
| -- lazy GET poll resolves run_id ---> dispatched [run_id=SET]
| |
| -- HMAC callback "building" arrives |
| (back-fills run_id if still NULL) ------+
v
building
|
v
deploying
|
+---> succeeded
+---> failed
+---> timed_out (reconciler after 30 min, or callback)
Key rule: dispatched is the only state where run_id may legitimately be
NULL. All other non-terminal states should have run_id set (either by
lazy GET, by callback, or by reconciler).
HMAC callback arrives before run_id is known
This is an expected race: the GH "Notify — building" step fires ~30 s after dispatch; the first GET poll fires ~2 s after dispatch. If the callback wins the race, the callback handler must:
- Accept the callback (do not reject — it is HMAC-verified and the
transition
dispatched -> buildingis valid). - Extract
github_run_idfrom the callback payload'srun_idfield (if present) and write it to the row at the same time as the status update. - Write
github_run_urlfrom the callback payload'srun_urlfield (if present).
This avoids a subsequent GET-handler write racing against an already-resolved
row. The callback is the authoritative source of truth for run_id; the
lazy-GET is opportunistic and must check github_run_id IS NULL before
writing.
Idempotency under Retry (Idempotency-Key)
The existing idempotency path handles this correctly once 202 is the success code:
- Original request fires
workflow_dispatch, commits row atstatus=dispatched, returns 202 (even if H12 killed the response in flight — the Heroku H12 kills the response, not the Flask handler; the row is already committed). - User sees "Request timed out" error (or Safari SyntaxError on H12 HTML
response) and clicks Retry with the same
Idempotency-Key. _check_idempotencyfinds the non-terminaldispatchedrow and raisesDeployIdempotencyHit.- Blueprint returns the existing row as a 200 response.
- Modal receives 200, reads
id, starts polling GET — and the deploy that was already running resolves normally.
Critical edge case: if the row was never committed because Heroku killed
the Gunicorn worker process before db.session.commit() (not just the
HTTP response), the retry sees no idempotency hit and fires a second
workflow_dispatch. This is the same race that existed before this design.
Mitigation: the commit happens immediately after the GH 204 response, before
any polling loop. The H12 30 s ceiling is far past that point. The risk
window is narrow (the ~200 ms between GH returning 204 and the DB commit
completing). Acceptable; it is an improvement over the current 30 s window.
7. Sequence Diagram
sequenceDiagram
participant U as Operator (modal)
participant C as Console (Flask)
participant DB as Postgres
participant GH as GitHub Actions API
participant W as GH Workflow
U ->> C : POST /api/internal/deploys {surface_id, target_ref, totp_code}
C ->> C : validate + idempotency + rate-limit
C ->> DB: INSERT console_deploys (status=requested)
C ->> DB: UPDATE status=dispatched (audit: console.deploy.intent)
C ->> GH: POST /actions/workflows/{file}/dispatches [timeout=15s]
GH -->> C : 204 No Content
C ->> DB: commit status=dispatched, run_id=NULL
C -->> U : 202 {id, status="dispatched", github_run_id=null}
note over U: transitions to State 3 (in-flight)
note over U: polls every 2s
loop every 2s while status=dispatched AND run_id IS NULL AND age < 90s
U ->> C : GET /api/internal/deploys/{id}
C ->> GH: GET /actions/workflows/{file}/runs?event=workflow_dispatch&created>=T [timeout=8s]
GH -->> C : 200 {workflow_runs: [...]}
C ->> C : match on inputs.console_deploy_id
alt run found
C ->> DB: UPDATE run_id, run_url
C -->> U : 200 {status="dispatched", github_run_id="25246579234", ...}
else not found yet
C -->> U : 200 {status="dispatched", github_run_id=null}
end
end
note over W: ~30s after dispatch
W ->> C : POST /api/internal/deploys/{id}/status {status="building", run_id=...}
C ->> C : HMAC verify + transition check
C ->> DB: UPDATE status=building, run_id (if still NULL), audit
C -->> W : 204
note over U: next poll sees status=building
U ->> C : GET /api/internal/deploys/{id}
C -->> U : 200 {status="building", github_run_id="25246579234"}
W ->> C : POST status=deploying
W ->> C : POST status=succeeded
C ->> DB: UPDATE status=succeeded, audit
U ->> C : GET /api/internal/deploys/{id}
C -->> U : 200 {status="succeeded"}
note over U: transitions to State 4 (done)
Reconciler path (modal closed before run_id resolved)
sequenceDiagram
participant R as Reconciler (background)
participant DB as Postgres
participant GH as GitHub Actions API
note over DB: row stuck: status=dispatched, run_id=NULL, age>15s
loop every 60s
R ->> DB: SELECT rows WHERE run_id IS NULL AND status IN (requested,dispatched) AND age BETWEEN 15s AND 30min
R ->> GH: GET /actions/runs?event=workflow_dispatch&per_page=20
GH -->> R : 200 {workflow_runs}
R ->> R : match on inputs.console_deploy_id
alt match found
R ->> DB: UPDATE run_id, run_url, audit: console.deploy.run_id_recovered
else age > 30min
R ->> DB: UPDATE status=timed_out, audit: console.deploy.timeout
end
end
8. Reconciler Scope Expansion
The current reconciler (PR #867) matches orphan rows by inputs.console_deploy_id.
This is the primary match path and remains unchanged.
A secondary match path by (surface_id, workflow_filename, requested_at_utc ± 30s,
target_env) should be added for the scenario where:
console_deploy_idwas injected as a workflow input but the GH run record does not yet exposeinputsin the list API (GH returnsinputs: nullfor very recent runs in some API versions), OR- the run was triggered manually (break-glass) without a
console_deploy_idinput and the operator has a matching orphan row that should be linked.
The secondary match must only fire when the primary match returns nothing. It
must not over-match: a row is only linked via secondary match when exactly one
GH run exists within requested_at_utc ± 30s for (workflow_file, target_env).
If zero or two+ runs exist in that window, skip (ambiguous).
This expansion is scoped to sub-card #5 below.
9. Flag and Rollout
Feature flag: FLAG_CONSOLE_DEPLOY_ASYNC (new, default off).
When off: behavior is unchanged from current (synchronous poll in POST, 201
response). When on: POST returns 202, _poll_and_capture_run_id is skipped,
GET handler performs lazy look-up.
The flag is independent of FLAG_CONSOLE_DEPLOY_UI, which gates the entire
deploy feature. The async flag is internal — it is not surfaced in the console
UI flag promotion panel.
Rollout plan:
| Stage | Description |
|---|---|
| dark | Flag off on prod. Sub-cards 1–3 shipped. Behavior identical to today (minus the H12 for the ~25% of dispatches that were getting hit). |
| flag on staging | FLAG_CONSOLE_DEPLOY_ASYNC=true on staging console. Manual smoke: dispatch staging surface, confirm modal resolves run_id within 3 polls, confirm 202 response. |
| flag on prod | Enable on prod. Monitor Heroku H12 logs for absence of /api/internal/deploys entries. |
| cleanup | Sub-card 5 (reconciler secondary match) + remove flag constant. |
10. Migrations
None. No schema changes. The new FLAG_CONSOLE_DEPLOY_ASYNC env var is an
operational toggle, not a DB column.
If a future operator wants to back-fill github_run_id on historical
dispatched rows, the reconciler's _find_run_for_deploy static method can
be run in a one-shot script. This is out of scope for the initial rollout.
11. Security Considerations
- No credential storage change. The same
GITHUB_API_DISPATCH_TOKENused for dispatch is used for the lazy GET look-up. It is read at request time from env. The 202 path adds one more GET call per poll-cycle that uses this token; total rate-limit exposure is bounded (one call per GET request whilestatus=dispatched, bounded to 90 s window). - HMAC callback back-fill. The callback handler accepting a
run_idfrom the callback payload introduces no new trust assumption — the callback is already HMAC-verified withDEPLOY_CALLBACK_HMAC_SECRET. The GH workflow is the only sender that knows the secret. - Idempotency-Key replay. The retry-on-timeout path produces at most one
extra
workflow_dispatchin the narrow (< 200 ms) window between GH 204 and DB commit. No additional mitigation required beyond rate-limit (5/hour). - Audit trail unchanged.
console.deploy.intentfires on row insert.console.deploy.callbackfires on each status callback. The new lazy run_id write in the GET handler does not fire an audit row (it is observability data, not a state transition). The reconciler audit (console.deploy.run_id_recovered) continues unchanged. - PII: No new PII collected.
requested_by(email) is already in the row. Retention and DSR erasure paths unchanged from ADR-0034. - Breach notification: No change. No new personal data introduced.
- Kill-switch:
FLAG_CONSOLE_DEPLOY_ASYNC=falsereverts to synchronous behavior.FLAG_CONSOLE_DEPLOY_UI=falsedisables the entire deploy feature.
12. Open Questions
None blocking sub-card #1. The design is complete.
Previously open:
- OQ-1: Should 202 or 201 be the success status code? Decision: 202, per RFC 7231 semantics (accepted but not yet complete). Modal JS must be updated.
- OQ-2: Should lazy run_id resolution live in GET handler or a background worker? Decision: GET handler. Rationale in section 3.
- OQ-3: What is the correct timeout for the lazy GH look-up? Decision: 8 s
hard timeout on the GH request; stop attempting lazy look-up after 90 s from
requested_at_utc. - OQ-4: Callback before run_id — accept or reject? Decision: accept and back-fill. See section 6.
- OQ-5: Reconciler secondary match — in scope for this change? Decision: yes, sub-card #5, not blocking initial fix.