Status: Accepted Date: 2026-05-02 UTC Deciders: Kristerpher (product), software-architect Refs: #898, docs/architecture/console-deploy-async-dispatch.md
POST /api/internal/deploys has a synchronous GH API polling loop
(_poll_and_capture_run_id, up to 30 s) inside the request lifecycle.
This loop sits exactly at Heroku's H12 30-second request-timeout ceiling.
On 2026-05-02 it caused a real incident: H12 503, modal showed garbled
error text, operator confused about whether the deploy had fired (it had).
We need to decouple run_id discovery from the dispatch request without introducing new infrastructure dependencies.
Three options were evaluated:
Option 1: lazy resolution in GET /api/internal/deploys/<id> poll handler.
POST /api/internal/deploys removes the _poll_and_capture_run_id call and
returns 202 immediately after committing status=dispatched, run_id=NULL.
GET /api/internal/deploys/<id> performs one GH API look-up per call when:
- github_run_id IS NULL
- status = 'dispatched'
- requested_at_utc >= now() - 90s
Look-up timeout: 8 s. Look-up uses GET /actions/workflows/{file}/runs?event=workflow_dispatch&created>={T}.
The 60-second reconciler (PR #867) remains the safety net for orphaned rows when the modal is closed before a poll resolves the run_id.
A new FLAG_CONSOLE_DEPLOY_ASYNC env var (default off) gates the behavior.
Positive:
- POST /api/internal/deploys p99 drops from ~30 s to ~3–5 s (validate +
DB write + GH POST). H12 incidents eliminated under normal GH latency.
- Modal operator experience: run_id appears within 2–4 polls (4–8 s) in the
common case. Indistinguishable from current UX when GH is fast.
- No new infrastructure. The GET endpoint already exists and is called every
2 s by the modal.
- Retry-on-timeout via Idempotency-Key already works correctly: the
idempotency check finds the committed dispatched row and returns it.
Negative / trade-offs:
- GET handler now makes a conditional outbound GH API call. Adds ~8 s of
potential latency to GET responses in the 0–90 s window when GH is slow.
Mitigated: hard 8 s timeout; call is skipped entirely once run_id is set.
- The 202 response code change requires a one-line update in the modal JS
(status === 201 || status === 200 → add 202).
- One additional test surface: the lazy look-up branch in get_deploy.
A threading.Thread started per dispatch to poll GH and write run_id.
Rejected because: - Heroku dyno restarts and rolling deploys orphan in-flight threads silently. - Two concurrent writers (thread + reconciler) for the same row with no row-level locking. SQLite WAL is not designed for concurrent async writers. - Adds complexity without reliability benefit — the reconciler already handles this niche, and the GET handler approach is simpler and stateless.
The GH workflow's first step would POST its own run_id to the callback
endpoint before doing any real work.
Rejected because:
- Requires modifying every workflow YAML (deploy-heroku.yml,
deploy-console.yml, deploy-customer-docs.yml).
- Adds a mandatory network round-trip into every workflow before build starts.
- The callback endpoint already handles run_id back-fill as part of the
"building" callback (section 6 of the design doc) — no workflow change needed.
- Break-glass manual gh workflow run would not supply the run_id, requiring
special-casing.
The modal's "Dispatched — waiting for run..." state shown during the 0–90 s
window when run_id is still null is a new UX state. The existing step
tracker already handles status=dispatched gracefully (no step is marked
active). No new UI components are required; the status text updates
naturally via the existing poll loop.