Raxx · internal docs

internal · gated ↑ index

Runbook — CI runner posture

Owner: ops / Kristerpher Last updated: 2026-04-30 UTC ADR: docs/architecture/adr/0033-self-hosted-ci-runners.md Triggering incident: run 25189616454, Deploy customer docs, wrangler transient failure 2026-05-01 UTC


Summary

Current posture: GitHub-hosted ubuntu-latest runners + retry hardening on all wrangler/external-API deploy steps.

Escalation path: migrate to Ubicloud managed runners when monthly CI burn exceeds $20 incremental for two consecutive months (see Phase 2 below).


Phase 1 — Retry hardening on GH-hosted runners

What this fixes

A single-attempt npx wrangler pages deploy or external curl call with no retry can fail on a transient network event between GitHub's runner pool and the CF API edge. Adding a retry loop absorbs these without requiring manual re-runs.

Affected workflows

Apply the retry pattern to every step that calls:

Workflow Step(s) to wrap
deploy-customer-docs.yml Ensure CF Pages project (raxx-docs), Deploy dist-customer-docs/ to Cloudflare Pages
deploy-internal-docs.yml wrangler project create + wrangler deploy
deploy-status-page.yml wrangler deploy
deploy-console.yml wrangler deploy
deploy-status-worker.yml wrangler deploy

Do NOT wrap: - actions/checkout — idempotent and retried internally by the action - Post-deploy health checks in deploy-heroku.yml — already implement their own retry loop - git push to Heroku — git push is idempotent; wrap if persistent flake surfaces

Retry shell pattern

attempt=0
max_attempts=3
delay=15
until <command>; do
  attempt=$((attempt + 1))
  if [ "$attempt" -ge "$max_attempts" ]; then
    echo "::error::<step name> failed after $max_attempts attempts"
    exit 1
  fi
  echo "<step name> attempt $attempt failed — retrying in ${delay}s ..."
  sleep "$delay"
  delay=$((delay * 2))
done

Backoff schedule: 15s, 30s, 60s. Max wall-clock delay before hard failure: ~105s per step. Acceptable for deploy workflows.

Testing after patching

  1. Open a test PR that modifies a file in docs/customer/ to trigger deploy-customer-docs.yml.
  2. Confirm the workflow passes without manual intervention.
  3. To test the retry path locally: temporarily inject exit 1 before the wrangler call and verify the loop fires. Revert before merging.

Phase 2 — Ubicloud migration (trigger-gated)

Trigger condition

Migrate when both of the following are true:

Check the GH Actions usage dashboard at: Settings > Billing > Actions.

Pre-migration checklist

Workflow migration order

Migrate in this order to minimize blast radius:

  1. ci.yml — highest minute consumer; safe to migrate first (no deploy side effects)
  2. ci-pr.yml — PR-gated; failures are visible before merge
  3. deploy-heroku.yml — migrate after 2+ successful ci.yml runs on Ubicloud
  4. security-zap.yml — migrate after deploy workflows stable
  5. All CF Pages deploy workflows — migrate last; these touch CF API and benefit least from runner change

Label change

# Before
runs-on: ubuntu-latest

# After (Ubicloud)
runs-on: ubicloud-standard-2

For jobs that must stay on GH-hosted (e.g., if a third-party action requires GitHub's environment), keep ubuntu-latest and document the exception inline.

Validation gate

After updating each workflow file: 1. Trigger a manual run (workflow_dispatch or a test PR). 2. Confirm: vault secret loading works (Infisical outbound), artifact uploads succeed, CF Pages deploy reaches CF API. 3. Wait 2 full business days before migrating the next workflow.

Rollback

# Revert the runs-on label changes in the affected workflow files
git revert <migration-commit-sha>
git push origin main

No data migration required. All runner state is ephemeral. Rollback takes effect on the next workflow run.

Cost tracking after migration


Failure modes and recovery

Runner pool unavailable (GH-hosted)

GH-hosted runner outages are rare and typically short (<30 min). Action: 1. Check https://www.githubstatus.com — if "Actions" is degraded, wait. 2. Re-run failed jobs after status clears. 3. For urgent deploys during a GH outage: deploy manually via heroku git:push and wrangler pages deploy from a local terminal with vault secrets loaded.

Wrangler transient failure (post-retry-hardening)

If a wrangler deploy fails all 3 retry attempts: 1. Check https://www.cloudflarestatus.com — if CF API is degraded, wait and re-run the workflow. 2. If CF API is healthy, check that CF_PAGES_DEPLOY_TOKEN in Infisical is valid and not expired: infisical secrets get CF_PAGES_DEPLOY_TOKEN --env=prod. 3. If the token is valid, check wrangler version (npx wrangler --version) — pin to a specific version if a bad release is suspected.

Ubicloud runner offline (Phase 2 only)

If a Ubicloud runner shows as offline in the GitHub repo > Settings > Actions > Runners: 1. Log in to cloud.ubicloud.com and check the runner pool status. 2. If the runner pool is degraded, temporarily change runs-on back to ubuntu-latest for the affected workflow(s). 3. File a support ticket with Ubicloud referencing the runner pool name.

GH Actions secret rotation

Bootstrap secrets (INFISICAL_CLIENT_ID, INFISICAL_CLIENT_SECRET) must be rotated without redeployment: 1. Generate new Infisical machine identity credentials in the Infisical dashboard. 2. Update the GH Actions repository secrets at Settings > Secrets > Actions. 3. Trigger a test workflow run to confirm vault access still works. 4. Revoke the old credentials in Infisical.


Cost projection

Scenario Runner Min/mo Monthly cost
Current (~3 merges/day) GH-hosted ~5,500 ~$20 ($4 Pro + ~$16 overage)
5x growth (~15 merges/day) GH-hosted ~27,500 ~$200
5x growth Ubicloud ~27,500 ~$116
5x growth Lightsail self-hosted ~27,500 ~$10-20 (VM only)

At 5x growth, Ubicloud saves ~$84/mo vs GH-hosted. Lightsail self-hosted saves more but adds ongoing maintenance. The breakeven between Ubicloud and Lightsail self-hosted (accounting for ~1 hour/month operator time at $150/hr opportunity cost) is: Ubicloud wins until volume justifies dedicated ops time.


Reference