Status: Accepted Owner: software-architect Date: 2026-05-02 UTC Related ADRs: ADR-0036, ADR-0034, ADR-0028 Refs: #898 (bug), #739 (reconciler), PR #867 (reconciler impl), PR #899 (modal error-render fix already merged)
POST /api/internal/deploys synchronously polls GitHub Actions for a run_id
for up to 30 s after firing workflow_dispatch. The poll budget is:
_RUN_ID_POLL_ATTEMPTS = 3
_RUN_ID_POLL_INTERVAL_S = 10
_RUN_ID_POLL_TIMEOUT_S = 30
This budget sits exactly at Heroku's H12 request-timeout ceiling. When GitHub
takes longer than ~25 s to surface the new run in /actions/runs, Heroku kills
the request with HTTP 503. The row is left at status=dispatched,
github_run_id=NULL and the modal showed garbled error text even though the
workflow fired and ran to success (reproduction 2026-05-02 07:11 UTC, run
25246579234).
The reconciler (PR #867) is a 60-second background thread that re-attempts run_id look-up for orphaned rows. It is the correct long-tail safety net, but it cannot prevent the H12 from corrupting the UX for the operator watching the modal.
All platform invariants apply. Deploy-flow-specific constraints carry forward unchanged from console-driven-deploy-flow.md:
gh workflow run, scripts/ops/deploy-console.sh)
unchanged.heroku releases:rollback unchanged.New invariant for this design:
POST /api/internal/deploys must return a response before Heroku's H12
ceiling. Budget is: validate + DB write + GH workflow_dispatch POST +
commit. No synchronous GH polling within the dispatch request.Option (a) — validated and adopted.
Return 202 immediately after workflow_dispatch returns 204. Run-id
resolution moves to the GET /api/internal/deploys/<id> polling handler
(lazy, one attempt per poll). The 60-second reconciler remains the safety net
for the case where the modal is closed before a poll resolves the run_id.
A per-dispatch threadpool worker running inside the Heroku dyno has no persistence guarantee: dyno restarts or scaling events during the 10–30 s window orphan the in-flight look-up with no recovery. The reconciler already fills this niche reliably. Adding a second mechanism creates two competing writers to the same row with no locking.
The GH workflow could call back with its own run_id as a first step. This
would be reliable, but it requires modifying every workflow YAML file and
adds a callback round-trip before the row can show a run link. The lazy-GET
approach achieves the same outcome using the infrastructure already in place
(the modal polls every 2 s).
No schema changes. The existing console_deploys columns are sufficient:
| column | relevant behaviour |
|---|---|
status |
stays dispatched until first HMAC callback or lazy run_id success |
github_run_id |
NULL on 202 response; populated lazily by GET handler or reconciler |
github_run_url |
same lifecycle as github_run_id |
requested_at_utc |
used as the lower bound for lazy GH look-up window |
Before: Synchronous; returned 201 with github_run_id populated (when
luck held) or H12 503 (when it didn't).
After:
- Returns 202 Accepted (not 201) when dispatch succeeds.
- Body is identical in shape: {id, status, status_url, github_run_id, github_run_url}.
- github_run_id and github_run_url will be null in the 202 response.
Clients must not assume these are populated on 202.
- status is dispatched.
- The _poll_and_capture_run_id call is removed from create_deploy.
The change from 201 to 202 is intentional: 202 signals "accepted for
processing; outcome not yet known" per RFC 7231. The modal JS already
branches on result.status === 201 || result.status === 200; this branch
must be updated to also accept 202.
Error cases unchanged: 422 validation, 429 rate-limit, 502 GH dispatch failure (GH returned non-204), 503 token missing.
<id> — lazy run_id resolutionWhen the following conditions are ALL true, the GET handler performs one (and only one) GH API look-up before returning:
deploy.github_run_id IS NULL
AND deploy.status = 'dispatched'
AND deploy.requested_at_utc >= now() - 90s
The 90-second window is generous enough to cover typical GH API lag (measured median ~5–15 s post-dispatch; p99 ~50 s). Beyond 90 s the reconciler owns recovery.
The look-up calls:
GET /repos/{repo}/actions/workflows/{workflow_name}/runs
?event=workflow_dispatch
&created=>{requested_at_utc ISO}
&per_page=5
Filtering by workflow_name (not just event=workflow_dispatch) reduces
false-positive risk on busy repos. If a match is found, the row is updated
in-place (run_id + run_url, commit). If not found, the row is returned
as-is with github_run_id=null — the modal shows "Dispatched — waiting for
run..." and polls again in 2 s.
Important: the GH look-up has a hard 8 s timeout. If the GH API is slow or unavailable, the GET returns the current row without modification. No errors bubble to the client; the row just stays unresolved until the reconciler cycle.
One attempt per poll. The GET handler does not loop. Each 2-second poll from the modal is exactly one GH attempt (when conditions are met). This bounds the GH API call rate to at most 1 req / 2 s during the 90-second window = ~45 GH requests per dispatch in the worst case.
<id>/status (HMAC callback)No contract change. The callback can arrive before the GET handler has
resolved the run_id. In this scenario the callback payload includes
github_run_id in the existing run_id field (already part of the
callback contract per PR #867). The callback handler should write
github_run_id + github_run_url from the payload when the row has
github_run_id IS NULL. See section 6 for the state machine implications.
requested
|
| (GH POST returns 204)
v
dispatched [run_id=NULL]
|
| -- lazy GET poll resolves run_id ---> dispatched [run_id=SET]
| |
| -- HMAC callback "building" arrives |
| (back-fills run_id if still NULL) ------+
v
building
|
v
deploying
|
+---> succeeded
+---> failed
+---> timed_out (reconciler after 30 min, or callback)
Key rule: dispatched is the only state where run_id may legitimately be
NULL. All other non-terminal states should have run_id set (either by
lazy GET, by callback, or by reconciler).
This is an expected race: the GH "Notify — building" step fires ~30 s after dispatch; the first GET poll fires ~2 s after dispatch. If the callback wins the race, the callback handler must:
dispatched -> building is valid).github_run_id from the callback payload's run_id field (if
present) and write it to the row at the same time as the status update.github_run_url from the callback payload's run_url field
(if present).This avoids a subsequent GET-handler write racing against an already-resolved
row. The callback is the authoritative source of truth for run_id; the
lazy-GET is opportunistic and must check github_run_id IS NULL before
writing.
The existing idempotency path handles this correctly once 202 is the success code:
workflow_dispatch, commits row at
status=dispatched, returns 202 (even if H12 killed the response in
flight — the Heroku H12 kills the response, not the Flask handler; the
row is already committed).Idempotency-Key._check_idempotency finds the non-terminal dispatched row and raises
DeployIdempotencyHit.id, starts polling GET — and the deploy that
was already running resolves normally.Critical edge case: if the row was never committed because Heroku killed
the Gunicorn worker process before db.session.commit() (not just the
HTTP response), the retry sees no idempotency hit and fires a second
workflow_dispatch. This is the same race that existed before this design.
Mitigation: the commit happens immediately after the GH 204 response, before
any polling loop. The H12 30 s ceiling is far past that point. The risk
window is narrow (the ~200 ms between GH returning 204 and the DB commit
completing). Acceptable; it is an improvement over the current 30 s window.
sequenceDiagram
participant U as Operator (modal)
participant C as Console (Flask)
participant DB as Postgres
participant GH as GitHub Actions API
participant W as GH Workflow
U ->> C : POST /api/internal/deploys {surface_id, target_ref, totp_code}
C ->> C : validate + idempotency + rate-limit
C ->> DB: INSERT console_deploys (status=requested)
C ->> DB: UPDATE status=dispatched (audit: console.deploy.intent)
C ->> GH: POST /actions/workflows/{file}/dispatches [timeout=15s]
GH -->> C : 204 No Content
C ->> DB: commit status=dispatched, run_id=NULL
C -->> U : 202 {id, status="dispatched", github_run_id=null}
note over U: transitions to State 3 (in-flight)
note over U: polls every 2s
loop every 2s while status=dispatched AND run_id IS NULL AND age < 90s
U ->> C : GET /api/internal/deploys/{id}
C ->> GH: GET /actions/workflows/{file}/runs?event=workflow_dispatch&created>=T [timeout=8s]
GH -->> C : 200 {workflow_runs: [...]}
C ->> C : match on inputs.console_deploy_id
alt run found
C ->> DB: UPDATE run_id, run_url
C -->> U : 200 {status="dispatched", github_run_id="25246579234", ...}
else not found yet
C -->> U : 200 {status="dispatched", github_run_id=null}
end
end
note over W: ~30s after dispatch
W ->> C : POST /api/internal/deploys/{id}/status {status="building", run_id=...}
C ->> C : HMAC verify + transition check
C ->> DB: UPDATE status=building, run_id (if still NULL), audit
C -->> W : 204
note over U: next poll sees status=building
U ->> C : GET /api/internal/deploys/{id}
C -->> U : 200 {status="building", github_run_id="25246579234"}
W ->> C : POST status=deploying
W ->> C : POST status=succeeded
C ->> DB: UPDATE status=succeeded, audit
U ->> C : GET /api/internal/deploys/{id}
C -->> U : 200 {status="succeeded"}
note over U: transitions to State 4 (done)
sequenceDiagram
participant R as Reconciler (background)
participant DB as Postgres
participant GH as GitHub Actions API
note over DB: row stuck: status=dispatched, run_id=NULL, age>15s
loop every 60s
R ->> DB: SELECT rows WHERE run_id IS NULL AND status IN (requested,dispatched) AND age BETWEEN 15s AND 30min
R ->> GH: GET /actions/runs?event=workflow_dispatch&per_page=20
GH -->> R : 200 {workflow_runs}
R ->> R : match on inputs.console_deploy_id
alt match found
R ->> DB: UPDATE run_id, run_url, audit: console.deploy.run_id_recovered
else age > 30min
R ->> DB: UPDATE status=timed_out, audit: console.deploy.timeout
end
end
The current reconciler (PR #867) matches orphan rows by inputs.console_deploy_id.
This is the primary match path and remains unchanged.
A secondary match path by (surface_id, workflow_filename, requested_at_utc ± 30s,
target_env) should be added for the scenario where:
console_deploy_id was injected as a workflow input but the GH run record
does not yet expose inputs in the list API (GH returns inputs: null for
very recent runs in some API versions), ORconsole_deploy_id
input and the operator has a matching orphan row that should be linked.The secondary match must only fire when the primary match returns nothing. It
must not over-match: a row is only linked via secondary match when exactly one
GH run exists within requested_at_utc ± 30s for (workflow_file, target_env).
If zero or two+ runs exist in that window, skip (ambiguous).
This expansion is scoped to sub-card #5 below.
Feature flag: FLAG_CONSOLE_DEPLOY_ASYNC (new, default off).
When off: behavior is unchanged from current (synchronous poll in POST, 201
response). When on: POST returns 202, _poll_and_capture_run_id is skipped,
GET handler performs lazy look-up.
The flag is independent of FLAG_CONSOLE_DEPLOY_UI, which gates the entire
deploy feature. The async flag is internal — it is not surfaced in the console
UI flag promotion panel.
Rollout plan:
| Stage | Description |
|---|---|
| dark | Flag off on prod. Sub-cards 1–3 shipped. Behavior identical to today (minus the H12 for the ~25% of dispatches that were getting hit). |
| flag on staging | FLAG_CONSOLE_DEPLOY_ASYNC=true on staging console. Manual smoke: dispatch staging surface, confirm modal resolves run_id within 3 polls, confirm 202 response. |
| flag on prod | Enable on prod. Monitor Heroku H12 logs for absence of /api/internal/deploys entries. |
| cleanup | Sub-card 5 (reconciler secondary match) + remove flag constant. |
None. No schema changes. The new FLAG_CONSOLE_DEPLOY_ASYNC env var is an
operational toggle, not a DB column.
If a future operator wants to back-fill github_run_id on historical
dispatched rows, the reconciler's _find_run_for_deploy static method can
be run in a one-shot script. This is out of scope for the initial rollout.
GITHUB_API_DISPATCH_TOKEN used
for dispatch is used for the lazy GET look-up. It is read at request time
from env. The 202 path adds one more GET call per poll-cycle that uses this
token; total rate-limit exposure is bounded (one call per GET request while
status=dispatched, bounded to 90 s window).run_id from
the callback payload introduces no new trust assumption — the callback is
already HMAC-verified with DEPLOY_CALLBACK_HMAC_SECRET. The GH workflow
is the only sender that knows the secret.workflow_dispatch in the narrow (< 200 ms) window between GH 204
and DB commit. No additional mitigation required beyond rate-limit (5/hour).console.deploy.intent fires on row insert.
console.deploy.callback fires on each status callback. The new lazy
run_id write in the GET handler does not fire an audit row (it is
observability data, not a state transition). The reconciler audit
(console.deploy.run_id_recovered) continues unchanged.requested_by (email) is already in the row.
Retention and DSR erasure paths unchanged from ADR-0034.FLAG_CONSOLE_DEPLOY_ASYNC=false reverts to synchronous
behavior. FLAG_CONSOLE_DEPLOY_UI=false disables the entire deploy feature.None blocking sub-card #1. The design is complete.
Previously open:
requested_at_utc.