Owner: ops / Kristerpher
Last updated: 2026-04-30 UTC
ADR: docs/architecture/adr/0033-self-hosted-ci-runners.md
Triggering incident: run 25189616454, Deploy customer docs, wrangler transient failure 2026-05-01 UTC
Current posture: GitHub-hosted ubuntu-latest runners + retry hardening on all wrangler/external-API deploy steps.
Escalation path: migrate to Ubicloud managed runners when monthly CI burn exceeds $20 incremental for two consecutive months (see Phase 2 below).
A single-attempt npx wrangler pages deploy or external curl call with no retry can fail on a transient network event between GitHub's runner pool and the CF API edge. Adding a retry loop absorbs these without requiring manual re-runs.
Apply the retry pattern to every step that calls:
| Workflow | Step(s) to wrap |
|---|---|
deploy-customer-docs.yml |
Ensure CF Pages project (raxx-docs), Deploy dist-customer-docs/ to Cloudflare Pages |
deploy-internal-docs.yml |
wrangler project create + wrangler deploy |
deploy-status-page.yml |
wrangler deploy |
deploy-console.yml |
wrangler deploy |
deploy-status-worker.yml |
wrangler deploy |
Do NOT wrap:
- actions/checkout — idempotent and retried internally by the action
- Post-deploy health checks in deploy-heroku.yml — already implement their own retry loop
- git push to Heroku — git push is idempotent; wrap if persistent flake surfaces
attempt=0
max_attempts=3
delay=15
until <command>; do
attempt=$((attempt + 1))
if [ "$attempt" -ge "$max_attempts" ]; then
echo "::error::<step name> failed after $max_attempts attempts"
exit 1
fi
echo "<step name> attempt $attempt failed — retrying in ${delay}s ..."
sleep "$delay"
delay=$((delay * 2))
done
Backoff schedule: 15s, 30s, 60s. Max wall-clock delay before hard failure: ~105s per step. Acceptable for deploy workflows.
docs/customer/ to trigger deploy-customer-docs.yml.exit 1 before the wrangler call and verify the loop fires. Revert before merging.Migrate when both of the following are true:
Check the GH Actions usage dashboard at: Settings > Billing > Actions.
raxx-app/TradeMasterAPIubicloud-standard-2 for 2 vCPU / 8 GB)Migrate in this order to minimize blast radius:
ci.yml — highest minute consumer; safe to migrate first (no deploy side effects)ci-pr.yml — PR-gated; failures are visible before mergedeploy-heroku.yml — migrate after 2+ successful ci.yml runs on Ubicloudsecurity-zap.yml — migrate after deploy workflows stable# Before
runs-on: ubuntu-latest
# After (Ubicloud)
runs-on: ubicloud-standard-2
For jobs that must stay on GH-hosted (e.g., if a third-party action requires GitHub's environment), keep ubuntu-latest and document the exception inline.
After updating each workflow file:
1. Trigger a manual run (workflow_dispatch or a test PR).
2. Confirm: vault secret loading works (Infisical outbound), artifact uploads succeed, CF Pages deploy reaches CF API.
3. Wait 2 full business days before migrating the next workflow.
# Revert the runs-on label changes in the affected workflow files
git revert <migration-commit-sha>
git push origin main
No data migration required. All runner state is ephemeral. Rollback takes effect on the next workflow run.
GH-hosted runner outages are rare and typically short (<30 min). Action:
1. Check https://www.githubstatus.com — if "Actions" is degraded, wait.
2. Re-run failed jobs after status clears.
3. For urgent deploys during a GH outage: deploy manually via heroku git:push and wrangler pages deploy from a local terminal with vault secrets loaded.
If a wrangler deploy fails all 3 retry attempts:
1. Check https://www.cloudflarestatus.com — if CF API is degraded, wait and re-run the workflow.
2. If CF API is healthy, check that CF_PAGES_DEPLOY_TOKEN in Infisical is valid and not expired: infisical secrets get CF_PAGES_DEPLOY_TOKEN --env=prod.
3. If the token is valid, check wrangler version (npx wrangler --version) — pin to a specific version if a bad release is suspected.
If a Ubicloud runner shows as offline in the GitHub repo > Settings > Actions > Runners:
1. Log in to cloud.ubicloud.com and check the runner pool status.
2. If the runner pool is degraded, temporarily change runs-on back to ubuntu-latest for the affected workflow(s).
3. File a support ticket with Ubicloud referencing the runner pool name.
Bootstrap secrets (INFISICAL_CLIENT_ID, INFISICAL_CLIENT_SECRET) must be rotated without redeployment:
1. Generate new Infisical machine identity credentials in the Infisical dashboard.
2. Update the GH Actions repository secrets at Settings > Secrets > Actions.
3. Trigger a test workflow run to confirm vault access still works.
4. Revoke the old credentials in Infisical.
| Scenario | Runner | Min/mo | Monthly cost |
|---|---|---|---|
| Current (~3 merges/day) | GH-hosted | ~5,500 | ~$20 ($4 Pro + ~$16 overage) |
| 5x growth (~15 merges/day) | GH-hosted | ~27,500 | ~$200 |
| 5x growth | Ubicloud | ~27,500 | ~$116 |
| 5x growth | Lightsail self-hosted | ~27,500 | ~$10-20 (VM only) |
At 5x growth, Ubicloud saves ~$84/mo vs GH-hosted. Lightsail self-hosted saves more but adds ongoing maintenance. The breakeven between Ubicloud and Lightsail self-hosted (accounting for ~1 hour/month operator time at $150/hr opportunity cost) is: Ubicloud wins until volume justifies dedicated ops time.
docs/architecture/adr/0033-self-hosted-ci-runners.md.github/workflows/deploy-*.yml