Raxx · internal docs

internal · gated ↑ index

ADR 0033 — CI runner posture: transient-failure retry + Ubicloud migration trigger

Status: Accepted Date: 2026-04-30 UTC Deciders: Kristerpher (product owner) Refs: triggering incident run 25189616454 (Deploy customer docs wrangler failure); project memory project_ci_billing.md; issue #340 (GH Actions billing posture)


Context

On 2026-05-01 UTC a transient network failure between a GitHub-hosted runner and the Cloudflare API killed run 25189616454 of the Deploy customer docs workflow at the npx wrangler pages deploy step. The job had no retry logic; one bad TCP handshake between GitHub's EC2 runner and CF's API edge produced a hard failure that required a manual re-run.

This incident surfaces two distinct problems:

  1. Flake problem: a single-attempt wrangler/curl call with no retry can fail on transient network noise that is unrelated to code correctness.
  2. Cost / speed problem: GH Actions minutes are accumulating across smoke gate (10-15 min), ZAP scan (5-10 min), Heroku deploys, multiple CF Pages deploys, secret scans, and nightly security runs. At current trajectory (~thousands of min/month, growing), the $4/mo Pro cap will be exhausted and per-minute billing kicks in at $0.008/min for ubuntu-latest.

These are separable problems. Conflating them risks over-engineering a fix for the immediate flake.

Current workflow inventory

Workflow Est. run time Trigger
ci.yml (backend + frontend tests, SAST, secret scan) 10-15 min PR + push to main
ci-pr.yml 5-8 min PR
deploy-heroku.yml (smoke gate + deploy) 12-20 min push main + manual
deploy-customer-docs.yml 3-5 min push main (path-filtered)
deploy-internal-docs.yml 2-4 min push main (path-filtered)
deploy-status-page.yml 2-4 min push main
deploy-console.yml 3-5 min push main
pr-preview.yml 4-7 min PR
security-zap.yml 5-10 min PR + schedule
nightly-security-scan.yml 5-10 min schedule
release.yml 2-5 min tag

Rough ceiling: ~60-80 min/push-to-main if all workflows trigger together. At 3-4 merges/day = ~180-320 min/day = ~5,500-10,000 min/month. GitHub Pro includes 3,000 min/month free; everything above that is $0.008/min.

Infra context

All deploy targets are external SaaS/PaaS. Runners need outbound HTTPS to api.heroku.com, api.cloudflare.com, and infisical endpoints. They do not need VPC/VPN access to any private subnet.


Options evaluated

Option 1 — Stay on GH-hosted + retry on transient steps

Description: Add continue-on-error + shell-level retry loops (or uses: nick-fields/retry@v3) to specific wrangler and curl steps. Accept GH-hosted as the runner fleet.

Cost: $0 incremental for the fix itself. Minutes cost stays as-is (excess billed at $0.008/min once Pro cap exceeded).

Setup time: 1-2 hours (patch existing workflows).

Maintenance: Near-zero. No new infra. Stays entirely within GitHub's security boundary.

Security posture: GH-managed, ephemeral VMs. Best-in-class for isolation. No surface area added.

Network flake mitigation: Partial. Retries absorb single-event transients; sustained CF API degradation would still fail after N attempts.

Limitation: Doesn't address the cost trajectory. Doesn't make jobs faster.


Option 2 — GH Larger Runners (4-vCPU, 16-vCPU)

Description: Switch selected jobs to GitHub's 4-vCPU (ubuntu-latest-4-cores) or larger runners.

Cost: $0.008/min for 4-vCPU = $0.016/min for 8-vCPU. At 5,000 min/mo on 4-vCPU: ~$40/mo. Doesn't reduce minutes; increases cost for faster wall-clock time.

Network flake mitigation: None. Same GitHub-hosted network.

Verdict: Not appropriate at current scale. Speeds up compute-heavy jobs (builds, tests) but solves neither the flake nor the cost problem.


Option 3 — Ubicloud managed runners

Description: Drop-in GH Actions runner pool hosted by Ubicloud. No VMs to manage. Change runs-on: ubuntu-latest to runs-on: ubicloud-standard-2 (or equivalent). Ubicloud claims ~30% cheaper than GH-hosted at list price (~$0.0042/min vs $0.008/min for ubuntu-latest).

Cost: At 5,000 min/mo: ~$21/mo. At 10,000 min/mo: ~$42/mo. Compare GH-hosted at 5,000 min/mo: ~$16 incremental above 3,000-min free tier = ~$16/mo; at 10,000 min/mo: ~$56/mo.

Setup time: ~1-2 hours (Ubicloud account, GitHub App install, update runs-on labels).

Maintenance: Ubicloud manages the VMs. Operator action limited to account + GitHub App config.

Security posture: Ubicloud ephemeral VMs; similar isolation model to GH-hosted but Ubicloud is a smaller vendor with a shorter security track record. Acceptable for this threat model (no PII flows through CI, secrets come from GH Actions secrets/vault at runtime).

Network: Ubicloud runs in EU and US regions. Network path to CF API and Heroku is comparable to GH-hosted; no specific advantage on CF API latency. Does not fix the root cause of transient wrangler failures.

Verdict: Correct optimization once burn consistently exceeds the free tier. Not urgent at current scale.


Option 4 — Self-hosted on AWS Lightsail

Description: Add a $5-$20/mo Lightsail instance alongside the FreeScout instance, install the GitHub Actions runner agent, configure as a persistent self-hosted runner.

Cost: $5/mo (512 MB, 1 vCPU) to $20/mo (2 GB, 2 vCPU). No per-minute CI billing once on self-hosted (GH charges $0 for self-hosted runners on public repos; private repos consume no paid minutes).

Setup time: ~1 day (provision instance, install runner agent, configure systemd service, wire to repo, update workflows).

Maintenance: Operator-managed. OS patches, runner agent updates, disk monitoring, SSH key rotation all land on Kristerpher. Ongoing overhead estimated at 1-2 hours/month.

Security posture: Persistent runner = biggest attack surface of all options. A compromised repo secret or malicious PR could run code on the persistent instance. Mitigations required: restrict to non-fork PRs, run runner as a low-privilege user, no secrets stored on disk, job isolation via Docker (adds complexity). For a single-founder private repo with no external contributors, fork-PR risk is negligible today but the habit is bad.

Network: Lightsail US-East is geographically close to CF's US PoPs. Marginally more stable than a random GH runner assignment, but not meaningfully different in practice.

Verdict: Viable option for mature single-product setups with high CI volume. Premature here; maintenance burden is not worth it until monthly CI cost exceeds $30-40/mo with persistent issues.


Option 5 — Actuated.dev (Firecracker microVMs)

Description: Managed ephemeral microVM runners. Each job boots a fresh Firecracker VM. Flat-fee pricing: $59-79/mo for unlimited builds (as of 2026 pricing).

Cost: $59-79/mo regardless of volume. Breakeven vs GH-hosted is ~7,375-9,875 min/mo (at $0.008/min). Current estimated burn is below that ceiling.

Setup time: ~2-4 hours (account, GitHub App, server requirement — Actuated requires you to provide at least one arm64 or x86_64 server for them to manage; currently requires operator-owned or rented bare metal/VPS).

Security posture: Best-in-class. Each job gets a fresh Firecracker microVM with no leftover state. Better isolation than GH-hosted for persistent-runner threat model.

Maintenance: Actuated manages the microVM hypervisor; operator provides the host server. This introduces an operator-managed server requirement which conflicts with the goal of minimal maintenance overhead.

Verdict: Compelling security story, but requires operator-managed host hardware (even if it's a rented VPS). Flat fee only makes economic sense above ~8,000 min/mo. Not the right fit for current scale.


Option 6 — Kubernetes ARC (Actions Runner Controller)

Excluded per design brief. Overkill for current scale.


Option 7 — Hybrid: GH-hosted by default + self-hosted for heavy jobs

Description: Label specific workflow jobs (smoke gate, ZAP scan) to route to a self-hosted runner while keeping lightweight jobs on GH-hosted.

Verdict: Inherits all the maintenance overhead of Option 4 plus workflow complexity. Only makes sense if certain jobs have strong network co-location requirements. Not the case here.


Decision

Adopt a two-phase approach:

Phase 1 (immediate) — Retry hardening on GH-hosted runners.

Add retry logic to all wrangler and outbound API steps across every deploy workflow. This directly addresses the triggering incident at zero infrastructure cost. Use shell-level retry loops (idiomatic, no third-party action dependency) with exponential backoff and a 3-attempt ceiling.

This is the right first move because: - The triggering failure was a single-attempt transient call, not a sustained CF API outage. - Retry logic is valuable regardless of which runner fleet we use. - It closes the immediate reliability gap in under 2 hours of work.

Phase 2 (at >5,000 min/mo sustained for 2+ consecutive months) — Migrate to Ubicloud.

When monthly CI burn consistently exceeds the 3,000-min free tier by a material margin (indicative threshold: $20+/mo incremental), evaluate Ubicloud as a drop-in replacement. At that point the economics justify the 1-2 hour migration.

Ubicloud is preferred over Actuated.dev because it requires no operator-managed host and over self-hosted Lightsail because it requires no persistent runner maintenance. It is preferred over GH Larger Runners because the per-minute rate is lower.

This is a 2-way-door decision. The retry hardening (Phase 1) is additive and does not couple us to any runner provider. The Ubicloud migration (Phase 2) is straightforward to reverse — changing runs-on labels back to ubuntu-latest takes minutes.


Consequences

Positive: - Transient wrangler/CF API failures are absorbed by retry logic; manual re-runs become rare. - No new infra to maintain for Phase 1. - Clear threshold ($20/mo incremental or >5,000 min/mo for 2 months) gives an objective trigger for Phase 2 rather than a vague "when it gets expensive."

Negative: - Phase 1 retry logic increases maximum possible job runtime by (retries × step time). For a 30-second wrangler step with 3 attempts: worst case +60s per job. Acceptable. - Phase 1 does not reduce per-minute cost. If burn is already above the free tier, the retry logic slightly increases it on failure paths. - Phase 2 (Ubicloud) introduces a dependency on a smaller vendor. Vendor risk mitigated by the fact that migration back to GH-hosted is a label change.


Alternatives considered (summary)

Option Monthly cost (5K min) Maint overhead Fixes flake Fixes cost
1. GH-hosted + retry ~$16 incremental Minimal Yes (retries) No
2. GH Larger Runners ~$40 None No Increases
3. Ubicloud ~$21 Minimal Partial Yes
4. Lightsail self-hosted ~$10 + ~1h/mo ops Medium Marginal Yes
5. Actuated.dev $59-79 flat Medium (host req) Yes Only at >8K min
Phase 1+2 (this ADR) ~$16 now; ~$21 at scale Minimal Yes At threshold

Retry pattern (normative)

The following shell-level retry loop is the standard pattern to apply to wrangler deploy steps and any other single-shot outbound API calls in deploy workflows:

- name: Deploy to CF Pages (with retry)
  env:
    CLOUDFLARE_API_TOKEN: ${{ env.CLOUDFLARE_API_TOKEN }}
    CLOUDFLARE_ACCOUNT_ID: ${{ env.CLOUDFLARE_ACCOUNT_ID }}
  run: |
    set -euo pipefail
    attempt=0
    max_attempts=3
    delay=15
    until npx wrangler@latest pages deploy dist-customer-docs \
        --project-name=raxx-docs \
        --branch=main \
        --commit-dirty=true \
        --commit-hash="${{ github.sha }}" \
        --commit-message="customer docs (${{ github.sha }})"; do
      attempt=$((attempt + 1))
      if [ "$attempt" -ge "$max_attempts" ]; then
        echo "::error::wrangler deploy failed after $max_attempts attempts"
        exit 1
      fi
      echo "wrangler deploy attempt $attempt failed — retrying in ${delay}s ..."
      sleep "$delay"
      delay=$((delay * 2))
    done

Apply this pattern to: - npx wrangler pages deploy steps in all CF Pages deploy workflows - npx wrangler pages project create steps - Any curl call that posts to an external API as a deploy gate (not internal health checks — those already have their own retry loops in deploy-heroku.yml)


Ubicloud migration trigger + checklist (Phase 2)

When the billing trigger fires, the migration checklist is:

  1. Create Ubicloud account at cloud.ubicloud.com; connect GitHub App to raxx-app/TradeMasterAPI.
  2. Verify Ubicloud runner pool label (e.g., ubicloud-standard-2).
  3. Update runs-on in ci.yml, ci-pr.yml, deploy-heroku.yml, and security-zap.yml (the high-minute jobs). Leave lightweight workflows on ubuntu-latest as fallback.
  4. Run one full CI cycle on a test PR. Confirm artifact uploads, vault access, and CF Pages deploys succeed.
  5. If stable for 5 business days, update remaining workflows.
  6. Document the migration in this ADR's changelog.

Rollback: revert the runs-on label changes. No data migration required.


Open questions

None currently blocking Phase 1. Phase 2 trigger is objective (billing threshold). No decisions deferred that would block sub-card implementation.