Runbook — CI runner posture
Owner: ops / Kristerpher
Last updated: 2026-04-30 UTC
ADR: docs/architecture/adr/0033-self-hosted-ci-runners.md
Triggering incident: run 25189616454, Deploy customer docs, wrangler transient failure 2026-05-01 UTC
Summary
Current posture: GitHub-hosted ubuntu-latest runners + retry hardening on all wrangler/external-API deploy steps.
Escalation path: migrate to Ubicloud managed runners when monthly CI burn exceeds $20 incremental for two consecutive months (see Phase 2 below).
Phase 1 — Retry hardening on GH-hosted runners
What this fixes
A single-attempt npx wrangler pages deploy or external curl call with no retry can fail on a transient network event between GitHub's runner pool and the CF API edge. Adding a retry loop absorbs these without requiring manual re-runs.
Affected workflows
Apply the retry pattern to every step that calls:
| Workflow | Step(s) to wrap |
|---|---|
deploy-customer-docs.yml |
Ensure CF Pages project (raxx-docs), Deploy dist-customer-docs/ to Cloudflare Pages |
deploy-internal-docs.yml |
wrangler project create + wrangler deploy |
deploy-status-page.yml |
wrangler deploy |
deploy-console.yml |
wrangler deploy |
deploy-status-worker.yml |
wrangler deploy |
Do NOT wrap:
- actions/checkout — idempotent and retried internally by the action
- Post-deploy health checks in deploy-heroku.yml — already implement their own retry loop
- git push to Heroku — git push is idempotent; wrap if persistent flake surfaces
Retry shell pattern
attempt=0
max_attempts=3
delay=15
until <command>; do
attempt=$((attempt + 1))
if [ "$attempt" -ge "$max_attempts" ]; then
echo "::error::<step name> failed after $max_attempts attempts"
exit 1
fi
echo "<step name> attempt $attempt failed — retrying in ${delay}s ..."
sleep "$delay"
delay=$((delay * 2))
done
Backoff schedule: 15s, 30s, 60s. Max wall-clock delay before hard failure: ~105s per step. Acceptable for deploy workflows.
Testing after patching
- Open a test PR that modifies a file in
docs/customer/to triggerdeploy-customer-docs.yml. - Confirm the workflow passes without manual intervention.
- To test the retry path locally: temporarily inject
exit 1before the wrangler call and verify the loop fires. Revert before merging.
Phase 2 — Ubicloud migration (trigger-gated)
Trigger condition
Migrate when both of the following are true:
- Monthly CI minutes exceed 5,000 for two consecutive months, OR
- Monthly GitHub Actions invoice exceeds $20 incremental (above the free tier) for two consecutive months
Check the GH Actions usage dashboard at: Settings > Billing > Actions.
Pre-migration checklist
- [ ] Create account at https://cloud.ubicloud.com
- [ ] Install the Ubicloud GitHub App on
raxx-app/TradeMasterAPI - [ ] Confirm runner pool label (currently
ubicloud-standard-2for 2 vCPU / 8 GB) - [ ] Note the runner OS image version (must be Ubuntu 22.04-compatible for our workflow assumptions)
Workflow migration order
Migrate in this order to minimize blast radius:
ci.yml— highest minute consumer; safe to migrate first (no deploy side effects)ci-pr.yml— PR-gated; failures are visible before mergedeploy-heroku.yml— migrate after 2+ successful ci.yml runs on Ubicloudsecurity-zap.yml— migrate after deploy workflows stable- All CF Pages deploy workflows — migrate last; these touch CF API and benefit least from runner change
Label change
# Before
runs-on: ubuntu-latest
# After (Ubicloud)
runs-on: ubicloud-standard-2
For jobs that must stay on GH-hosted (e.g., if a third-party action requires GitHub's environment), keep ubuntu-latest and document the exception inline.
Validation gate
After updating each workflow file:
1. Trigger a manual run (workflow_dispatch or a test PR).
2. Confirm: vault secret loading works (Infisical outbound), artifact uploads succeed, CF Pages deploy reaches CF API.
3. Wait 2 full business days before migrating the next workflow.
Rollback
# Revert the runs-on label changes in the affected workflow files
git revert <migration-commit-sha>
git push origin main
No data migration required. All runner state is ephemeral. Rollback takes effect on the next workflow run.
Cost tracking after migration
- Ubicloud bills per minute on their platform. Monitor at cloud.ubicloud.com billing dashboard.
- GH Actions minutes drop to ~0 for migrated workflows (self-hosted runners don't consume GH minutes on private repos).
- Expected monthly cost at 5,000 min/mo: ~$21 vs ~$16 incremental on GH-hosted. Delta: ~$5/mo for improved vendor diversity. Acceptable.
Failure modes and recovery
Runner pool unavailable (GH-hosted)
GH-hosted runner outages are rare and typically short (<30 min). Action:
1. Check https://www.githubstatus.com — if "Actions" is degraded, wait.
2. Re-run failed jobs after status clears.
3. For urgent deploys during a GH outage: deploy manually via heroku git:push and wrangler pages deploy from a local terminal with vault secrets loaded.
Wrangler transient failure (post-retry-hardening)
If a wrangler deploy fails all 3 retry attempts:
1. Check https://www.cloudflarestatus.com — if CF API is degraded, wait and re-run the workflow.
2. If CF API is healthy, check that CF_PAGES_DEPLOY_TOKEN in Infisical is valid and not expired: infisical secrets get CF_PAGES_DEPLOY_TOKEN --env=prod.
3. If the token is valid, check wrangler version (npx wrangler --version) — pin to a specific version if a bad release is suspected.
Ubicloud runner offline (Phase 2 only)
If a Ubicloud runner shows as offline in the GitHub repo > Settings > Actions > Runners:
1. Log in to cloud.ubicloud.com and check the runner pool status.
2. If the runner pool is degraded, temporarily change runs-on back to ubuntu-latest for the affected workflow(s).
3. File a support ticket with Ubicloud referencing the runner pool name.
GH Actions secret rotation
Bootstrap secrets (INFISICAL_CLIENT_ID, INFISICAL_CLIENT_SECRET) must be rotated without redeployment:
1. Generate new Infisical machine identity credentials in the Infisical dashboard.
2. Update the GH Actions repository secrets at Settings > Secrets > Actions.
3. Trigger a test workflow run to confirm vault access still works.
4. Revoke the old credentials in Infisical.
Cost projection
| Scenario | Runner | Min/mo | Monthly cost |
|---|---|---|---|
| Current (~3 merges/day) | GH-hosted | ~5,500 | ~$20 ($4 Pro + ~$16 overage) |
| 5x growth (~15 merges/day) | GH-hosted | ~27,500 | ~$200 |
| 5x growth | Ubicloud | ~27,500 | ~$116 |
| 5x growth | Lightsail self-hosted | ~27,500 | ~$10-20 (VM only) |
At 5x growth, Ubicloud saves ~$84/mo vs GH-hosted. Lightsail self-hosted saves more but adds ongoing maintenance. The breakeven between Ubicloud and Lightsail self-hosted (accounting for ~1 hour/month operator time at $150/hr opportunity cost) is: Ubicloud wins until volume justifies dedicated ops time.
Reference
- ADR 0033:
docs/architecture/adr/0033-self-hosted-ci-runners.md - Triggering incident: GH Actions run 25189616454
- Workflow files:
.github/workflows/deploy-*.yml - CF API status: https://www.cloudflarestatus.com
- GH Actions status: https://www.githubstatus.com
- Ubicloud docs: https://www.ubicloud.com/docs/github-actions-integration