Runbook — CI runner posture

Owner: ops / Kristerpher Last updated: 2026-04-30 UTC ADR: docs/architecture/adr/0033-self-hosted-ci-runners.md Triggering incident: run 25189616454, Deploy customer docs, wrangler transient failure 2026-05-01 UTC

Summary

Current posture: GitHub-hosted ubuntu-latest runners + retry hardening on all wrangler/external-API deploy steps.

Escalation path: migrate to Ubicloud managed runners when monthly CI burn exceeds $20 incremental for two consecutive months (see Phase 2 below).

Phase 1 — Retry hardening on GH-hosted runners

What this fixes

A single-attempt npx wrangler pages deploy or external curl call with no retry can fail on a transient network event between GitHub's runner pool and the CF API edge. Adding a retry loop absorbs these without requiring manual re-runs.

Affected workflows

Apply the retry pattern to every step that calls:

Workflow	Step(s) to wrap
`deploy-customer-docs.yml`	`Ensure CF Pages project (raxx-docs)`, `Deploy dist-customer-docs/ to Cloudflare Pages`
`deploy-internal-docs.yml`	wrangler project create + wrangler deploy
`deploy-status-page.yml`	wrangler deploy
`deploy-console.yml`	wrangler deploy
`deploy-status-worker.yml`	wrangler deploy

Do NOT wrap: - actions/checkout — idempotent and retried internally by the action - Post-deploy health checks in deploy-heroku.yml — already implement their own retry loop - git push to Heroku — git push is idempotent; wrap if persistent flake surfaces

Retry shell pattern

attempt=0
max_attempts=3
delay=15
until <command>; do
  attempt=$((attempt + 1))
  if [ "$attempt" -ge "$max_attempts" ]; then
    echo "::error::<step name> failed after $max_attempts attempts"
    exit 1
  fi
  echo "<step name> attempt $attempt failed — retrying in ${delay}s ..."
  sleep "$delay"
  delay=$((delay * 2))
done

Backoff schedule: 15s, 30s, 60s. Max wall-clock delay before hard failure: ~105s per step. Acceptable for deploy workflows.

Testing after patching

Open a test PR that modifies a file in docs/customer/ to trigger deploy-customer-docs.yml.
Confirm the workflow passes without manual intervention.
To test the retry path locally: temporarily inject exit 1 before the wrangler call and verify the loop fires. Revert before merging.

Phase 2 — Ubicloud migration (trigger-gated)

Trigger condition

Migrate when both of the following are true:

Monthly CI minutes exceed 5,000 for two consecutive months, OR
Monthly GitHub Actions invoice exceeds $20 incremental (above the free tier) for two consecutive months

Check the GH Actions usage dashboard at: Settings > Billing > Actions.

Pre-migration checklist

[ ] Create account at https://cloud.ubicloud.com
[ ] Install the Ubicloud GitHub App on raxx-app/TradeMasterAPI
[ ] Confirm runner pool label (currently ubicloud-standard-2 for 2 vCPU / 8 GB)
[ ] Note the runner OS image version (must be Ubuntu 22.04-compatible for our workflow assumptions)

Workflow migration order

Migrate in this order to minimize blast radius:

ci.yml — highest minute consumer; safe to migrate first (no deploy side effects)
ci-pr.yml — PR-gated; failures are visible before merge
deploy-heroku.yml — migrate after 2+ successful ci.yml runs on Ubicloud
security-zap.yml — migrate after deploy workflows stable
All CF Pages deploy workflows — migrate last; these touch CF API and benefit least from runner change

Label change

# Before
runs-on: ubuntu-latest

# After (Ubicloud)
runs-on: ubicloud-standard-2

For jobs that must stay on GH-hosted (e.g., if a third-party action requires GitHub's environment), keep ubuntu-latest and document the exception inline.

Validation gate

After updating each workflow file: 1. Trigger a manual run (workflow_dispatch or a test PR). 2. Confirm: vault secret loading works (Infisical outbound), artifact uploads succeed, CF Pages deploy reaches CF API. 3. Wait 2 full business days before migrating the next workflow.

Rollback

# Revert the runs-on label changes in the affected workflow files
git revert <migration-commit-sha>
git push origin main

No data migration required. All runner state is ephemeral. Rollback takes effect on the next workflow run.

Cost tracking after migration

Ubicloud bills per minute on their platform. Monitor at cloud.ubicloud.com billing dashboard.
GH Actions minutes drop to ~0 for migrated workflows (self-hosted runners don't consume GH minutes on private repos).
Expected monthly cost at 5,000 min/mo: ~$21 vs ~$16 incremental on GH-hosted. Delta: ~$5/mo for improved vendor diversity. Acceptable.

Failure modes and recovery

Runner pool unavailable (GH-hosted)

GH-hosted runner outages are rare and typically short (<30 min). Action: 1. Check https://www.githubstatus.com — if "Actions" is degraded, wait. 2. Re-run failed jobs after status clears. 3. For urgent deploys during a GH outage: deploy manually via heroku git:push and wrangler pages deploy from a local terminal with vault secrets loaded.

Wrangler transient failure (post-retry-hardening)

If a wrangler deploy fails all 3 retry attempts: 1. Check https://www.cloudflarestatus.com — if CF API is degraded, wait and re-run the workflow. 2. If CF API is healthy, check that CF_PAGES_DEPLOY_TOKEN in Infisical is valid and not expired: infisical secrets get CF_PAGES_DEPLOY_TOKEN --env=prod. 3. If the token is valid, check wrangler version (npx wrangler --version) — pin to a specific version if a bad release is suspected.

Ubicloud runner offline (Phase 2 only)

If a Ubicloud runner shows as offline in the GitHub repo > Settings > Actions > Runners: 1. Log in to cloud.ubicloud.com and check the runner pool status. 2. If the runner pool is degraded, temporarily change runs-on back to ubuntu-latest for the affected workflow(s). 3. File a support ticket with Ubicloud referencing the runner pool name.

GH Actions secret rotation

Bootstrap secrets (INFISICAL_CLIENT_ID, INFISICAL_CLIENT_SECRET) must be rotated without redeployment: 1. Generate new Infisical machine identity credentials in the Infisical dashboard. 2. Update the GH Actions repository secrets at Settings > Secrets > Actions. 3. Trigger a test workflow run to confirm vault access still works. 4. Revoke the old credentials in Infisical.

Cost projection

Scenario	Runner	Min/mo	Monthly cost
Current (~3 merges/day)	GH-hosted	~5,500	~$20 ($4 Pro + ~$16 overage)
5x growth (~15 merges/day)	GH-hosted	~27,500	~$200
5x growth	Ubicloud	~27,500	~$116
5x growth	Lightsail self-hosted	~27,500	~$10-20 (VM only)

At 5x growth, Ubicloud saves ~$84/mo vs GH-hosted. Lightsail self-hosted saves more but adds ongoing maintenance. The breakeven between Ubicloud and Lightsail self-hosted (accounting for ~1 hour/month operator time at $150/hr opportunity cost) is: Ubicloud wins until volume justifies dedicated ops time.

Reference

ADR 0033: docs/architecture/adr/0033-self-hosted-ci-runners.md
Triggering incident: GH Actions run 25189616454
Workflow files: .github/workflows/deploy-*.yml
CF API status: https://www.cloudflarestatus.com
GH Actions status: https://www.githubstatus.com
Ubicloud docs: https://www.ubicloud.com/docs/github-actions-integration