Raxx · internal docs

internal · gated ↑ index

ADR-0036: Async Run-ID Resolution for Console Deploy Dispatch

Status: Accepted Date: 2026-05-02 UTC Deciders: Kristerpher (product), software-architect Refs: #898, docs/architecture/console-deploy-async-dispatch.md

Context

POST /api/internal/deploys has a synchronous GH API polling loop (_poll_and_capture_run_id, up to 30 s) inside the request lifecycle. This loop sits exactly at Heroku's H12 30-second request-timeout ceiling. On 2026-05-02 it caused a real incident: H12 503, modal showed garbled error text, operator confused about whether the deploy had fired (it had).

We need to decouple run_id discovery from the dispatch request without introducing new infrastructure dependencies.

Three options were evaluated:

Return 202 immediately; lazy resolution in GET poll handler.
Return 202 immediately; background threadpool worker resolves run_id.
Return 202 immediately; workflow's first step POSTs back run_id via callback.

Decision

Option 1: lazy resolution in GET /api/internal/deploys/<id> poll handler.

POST /api/internal/deploys removes the _poll_and_capture_run_id call and returns 202 immediately after committing status=dispatched, run_id=NULL.

GET /api/internal/deploys/<id> performs one GH API look-up per call when: - github_run_id IS NULL - status = 'dispatched' - requested_at_utc >= now() - 90s

Look-up timeout: 8 s. Look-up uses GET /actions/workflows/{file}/runs?event=workflow_dispatch&created>={T}.

The 60-second reconciler (PR #867) remains the safety net for orphaned rows when the modal is closed before a poll resolves the run_id.

A new FLAG_CONSOLE_DEPLOY_ASYNC env var (default off) gates the behavior.

Consequences

Positive: - POST /api/internal/deploys p99 drops from ~30 s to ~3–5 s (validate + DB write + GH POST). H12 incidents eliminated under normal GH latency. - Modal operator experience: run_id appears within 2–4 polls (4–8 s) in the common case. Indistinguishable from current UX when GH is fast. - No new infrastructure. The GET endpoint already exists and is called every 2 s by the modal. - Retry-on-timeout via Idempotency-Key already works correctly: the idempotency check finds the committed dispatched row and returns it.

Negative / trade-offs: - GET handler now makes a conditional outbound GH API call. Adds ~8 s of potential latency to GET responses in the 0–90 s window when GH is slow. Mitigated: hard 8 s timeout; call is skipped entirely once run_id is set. - The 202 response code change requires a one-line update in the modal JS (status === 201 || status === 200 → add 202). - One additional test surface: the lazy look-up branch in get_deploy.

Alternatives Considered

Option 2: background threadpool worker

A threading.Thread started per dispatch to poll GH and write run_id.

Rejected because: - Heroku dyno restarts and rolling deploys orphan in-flight threads silently. - Two concurrent writers (thread + reconciler) for the same row with no row-level locking. SQLite WAL is not designed for concurrent async writers. - Adds complexity without reliability benefit — the reconciler already handles this niche, and the GET handler approach is simpler and stateless.

Option 3: workflow callback back-fills run_id

The GH workflow's first step would POST its own run_id to the callback endpoint before doing any real work.

Rejected because: - Requires modifying every workflow YAML (deploy-heroku.yml, deploy-console.yml, deploy-customer-docs.yml). - Adds a mandatory network round-trip into every workflow before build starts. - The callback endpoint already handles run_id back-fill as part of the "building" callback (section 6 of the design doc) — no workflow change needed. - Break-glass manual gh workflow run would not supply the run_id, requiring special-casing.

Notes

The modal's "Dispatched — waiting for run..." state shown during the 0–90 s window when run_id is still null is a new UX state. The existing step tracker already handles status=dispatched gracefully (no step is marked active). No new UI components are required; the status text updates naturally via the existing poll loop.

Auto-generated from docs/ in raxx-app/TradeMasterAPI. Gated behind Cloudflare Access. Re-deployed on every push to main.