Status: Accepted
Date: 2026-04-30 UTC
Deciders: Kristerpher (product owner)
Refs: triggering incident run 25189616454 (Deploy customer docs wrangler failure); project memory project_ci_billing.md; issue #340 (GH Actions billing posture)
On 2026-05-01 UTC a transient network failure between a GitHub-hosted runner and the Cloudflare API killed run 25189616454 of the Deploy customer docs workflow at the npx wrangler pages deploy step. The job had no retry logic; one bad TCP handshake between GitHub's EC2 runner and CF's API edge produced a hard failure that required a manual re-run.
This incident surfaces two distinct problems:
ubuntu-latest.These are separable problems. Conflating them risks over-engineering a fix for the immediate flake.
| Workflow | Est. run time | Trigger |
|---|---|---|
ci.yml (backend + frontend tests, SAST, secret scan) |
10-15 min | PR + push to main |
ci-pr.yml |
5-8 min | PR |
deploy-heroku.yml (smoke gate + deploy) |
12-20 min | push main + manual |
deploy-customer-docs.yml |
3-5 min | push main (path-filtered) |
deploy-internal-docs.yml |
2-4 min | push main (path-filtered) |
deploy-status-page.yml |
2-4 min | push main |
deploy-console.yml |
3-5 min | push main |
pr-preview.yml |
4-7 min | PR |
security-zap.yml |
5-10 min | PR + schedule |
nightly-security-scan.yml |
5-10 min | schedule |
release.yml |
2-5 min | tag |
Rough ceiling: ~60-80 min/push-to-main if all workflows trigger together. At 3-4 merges/day = ~180-320 min/day = ~5,500-10,000 min/month. GitHub Pro includes 3,000 min/month free; everything above that is $0.008/min.
All deploy targets are external SaaS/PaaS. Runners need outbound HTTPS to api.heroku.com, api.cloudflare.com, and infisical endpoints. They do not need VPC/VPN access to any private subnet.
Description: Add continue-on-error + shell-level retry loops (or uses: nick-fields/retry@v3) to specific wrangler and curl steps. Accept GH-hosted as the runner fleet.
Cost: $0 incremental for the fix itself. Minutes cost stays as-is (excess billed at $0.008/min once Pro cap exceeded).
Setup time: 1-2 hours (patch existing workflows).
Maintenance: Near-zero. No new infra. Stays entirely within GitHub's security boundary.
Security posture: GH-managed, ephemeral VMs. Best-in-class for isolation. No surface area added.
Network flake mitigation: Partial. Retries absorb single-event transients; sustained CF API degradation would still fail after N attempts.
Limitation: Doesn't address the cost trajectory. Doesn't make jobs faster.
Description: Switch selected jobs to GitHub's 4-vCPU (ubuntu-latest-4-cores) or larger runners.
Cost: $0.008/min for 4-vCPU = $0.016/min for 8-vCPU. At 5,000 min/mo on 4-vCPU: ~$40/mo. Doesn't reduce minutes; increases cost for faster wall-clock time.
Network flake mitigation: None. Same GitHub-hosted network.
Verdict: Not appropriate at current scale. Speeds up compute-heavy jobs (builds, tests) but solves neither the flake nor the cost problem.
Description: Drop-in GH Actions runner pool hosted by Ubicloud. No VMs to manage. Change runs-on: ubuntu-latest to runs-on: ubicloud-standard-2 (or equivalent). Ubicloud claims ~30% cheaper than GH-hosted at list price (~$0.0042/min vs $0.008/min for ubuntu-latest).
Cost: At 5,000 min/mo: ~$21/mo. At 10,000 min/mo: ~$42/mo. Compare GH-hosted at 5,000 min/mo: ~$16 incremental above 3,000-min free tier = ~$16/mo; at 10,000 min/mo: ~$56/mo.
Setup time: ~1-2 hours (Ubicloud account, GitHub App install, update runs-on labels).
Maintenance: Ubicloud manages the VMs. Operator action limited to account + GitHub App config.
Security posture: Ubicloud ephemeral VMs; similar isolation model to GH-hosted but Ubicloud is a smaller vendor with a shorter security track record. Acceptable for this threat model (no PII flows through CI, secrets come from GH Actions secrets/vault at runtime).
Network: Ubicloud runs in EU and US regions. Network path to CF API and Heroku is comparable to GH-hosted; no specific advantage on CF API latency. Does not fix the root cause of transient wrangler failures.
Verdict: Correct optimization once burn consistently exceeds the free tier. Not urgent at current scale.
Description: Add a $5-$20/mo Lightsail instance alongside the FreeScout instance, install the GitHub Actions runner agent, configure as a persistent self-hosted runner.
Cost: $5/mo (512 MB, 1 vCPU) to $20/mo (2 GB, 2 vCPU). No per-minute CI billing once on self-hosted (GH charges $0 for self-hosted runners on public repos; private repos consume no paid minutes).
Setup time: ~1 day (provision instance, install runner agent, configure systemd service, wire to repo, update workflows).
Maintenance: Operator-managed. OS patches, runner agent updates, disk monitoring, SSH key rotation all land on Kristerpher. Ongoing overhead estimated at 1-2 hours/month.
Security posture: Persistent runner = biggest attack surface of all options. A compromised repo secret or malicious PR could run code on the persistent instance. Mitigations required: restrict to non-fork PRs, run runner as a low-privilege user, no secrets stored on disk, job isolation via Docker (adds complexity). For a single-founder private repo with no external contributors, fork-PR risk is negligible today but the habit is bad.
Network: Lightsail US-East is geographically close to CF's US PoPs. Marginally more stable than a random GH runner assignment, but not meaningfully different in practice.
Verdict: Viable option for mature single-product setups with high CI volume. Premature here; maintenance burden is not worth it until monthly CI cost exceeds $30-40/mo with persistent issues.
Description: Managed ephemeral microVM runners. Each job boots a fresh Firecracker VM. Flat-fee pricing: $59-79/mo for unlimited builds (as of 2026 pricing).
Cost: $59-79/mo regardless of volume. Breakeven vs GH-hosted is ~7,375-9,875 min/mo (at $0.008/min). Current estimated burn is below that ceiling.
Setup time: ~2-4 hours (account, GitHub App, server requirement — Actuated requires you to provide at least one arm64 or x86_64 server for them to manage; currently requires operator-owned or rented bare metal/VPS).
Security posture: Best-in-class. Each job gets a fresh Firecracker microVM with no leftover state. Better isolation than GH-hosted for persistent-runner threat model.
Maintenance: Actuated manages the microVM hypervisor; operator provides the host server. This introduces an operator-managed server requirement which conflicts with the goal of minimal maintenance overhead.
Verdict: Compelling security story, but requires operator-managed host hardware (even if it's a rented VPS). Flat fee only makes economic sense above ~8,000 min/mo. Not the right fit for current scale.
Excluded per design brief. Overkill for current scale.
Description: Label specific workflow jobs (smoke gate, ZAP scan) to route to a self-hosted runner while keeping lightweight jobs on GH-hosted.
Verdict: Inherits all the maintenance overhead of Option 4 plus workflow complexity. Only makes sense if certain jobs have strong network co-location requirements. Not the case here.
Adopt a two-phase approach:
Phase 1 (immediate) — Retry hardening on GH-hosted runners.
Add retry logic to all wrangler and outbound API steps across every deploy workflow. This directly addresses the triggering incident at zero infrastructure cost. Use shell-level retry loops (idiomatic, no third-party action dependency) with exponential backoff and a 3-attempt ceiling.
This is the right first move because: - The triggering failure was a single-attempt transient call, not a sustained CF API outage. - Retry logic is valuable regardless of which runner fleet we use. - It closes the immediate reliability gap in under 2 hours of work.
Phase 2 (at >5,000 min/mo sustained for 2+ consecutive months) — Migrate to Ubicloud.
When monthly CI burn consistently exceeds the 3,000-min free tier by a material margin (indicative threshold: $20+/mo incremental), evaluate Ubicloud as a drop-in replacement. At that point the economics justify the 1-2 hour migration.
Ubicloud is preferred over Actuated.dev because it requires no operator-managed host and over self-hosted Lightsail because it requires no persistent runner maintenance. It is preferred over GH Larger Runners because the per-minute rate is lower.
This is a 2-way-door decision. The retry hardening (Phase 1) is additive and does not couple us to any runner provider. The Ubicloud migration (Phase 2) is straightforward to reverse — changing runs-on labels back to ubuntu-latest takes minutes.
Positive: - Transient wrangler/CF API failures are absorbed by retry logic; manual re-runs become rare. - No new infra to maintain for Phase 1. - Clear threshold ($20/mo incremental or >5,000 min/mo for 2 months) gives an objective trigger for Phase 2 rather than a vague "when it gets expensive."
Negative: - Phase 1 retry logic increases maximum possible job runtime by (retries × step time). For a 30-second wrangler step with 3 attempts: worst case +60s per job. Acceptable. - Phase 1 does not reduce per-minute cost. If burn is already above the free tier, the retry logic slightly increases it on failure paths. - Phase 2 (Ubicloud) introduces a dependency on a smaller vendor. Vendor risk mitigated by the fact that migration back to GH-hosted is a label change.
| Option | Monthly cost (5K min) | Maint overhead | Fixes flake | Fixes cost |
|---|---|---|---|---|
| 1. GH-hosted + retry | ~$16 incremental | Minimal | Yes (retries) | No |
| 2. GH Larger Runners | ~$40 | None | No | Increases |
| 3. Ubicloud | ~$21 | Minimal | Partial | Yes |
| 4. Lightsail self-hosted | ~$10 + ~1h/mo ops | Medium | Marginal | Yes |
| 5. Actuated.dev | $59-79 flat | Medium (host req) | Yes | Only at >8K min |
| Phase 1+2 (this ADR) | ~$16 now; ~$21 at scale | Minimal | Yes | At threshold |
The following shell-level retry loop is the standard pattern to apply to wrangler deploy steps and any other single-shot outbound API calls in deploy workflows:
- name: Deploy to CF Pages (with retry)
env:
CLOUDFLARE_API_TOKEN: ${{ env.CLOUDFLARE_API_TOKEN }}
CLOUDFLARE_ACCOUNT_ID: ${{ env.CLOUDFLARE_ACCOUNT_ID }}
run: |
set -euo pipefail
attempt=0
max_attempts=3
delay=15
until npx wrangler@latest pages deploy dist-customer-docs \
--project-name=raxx-docs \
--branch=main \
--commit-dirty=true \
--commit-hash="${{ github.sha }}" \
--commit-message="customer docs (${{ github.sha }})"; do
attempt=$((attempt + 1))
if [ "$attempt" -ge "$max_attempts" ]; then
echo "::error::wrangler deploy failed after $max_attempts attempts"
exit 1
fi
echo "wrangler deploy attempt $attempt failed — retrying in ${delay}s ..."
sleep "$delay"
delay=$((delay * 2))
done
Apply this pattern to:
- npx wrangler pages deploy steps in all CF Pages deploy workflows
- npx wrangler pages project create steps
- Any curl call that posts to an external API as a deploy gate (not internal health checks — those already have their own retry loops in deploy-heroku.yml)
When the billing trigger fires, the migration checklist is:
raxx-app/TradeMasterAPI.ubicloud-standard-2).runs-on in ci.yml, ci-pr.yml, deploy-heroku.yml, and security-zap.yml (the high-minute jobs). Leave lightweight workflows on ubuntu-latest as fallback.Rollback: revert the runs-on label changes. No data migration required.
None currently blocking Phase 1. Phase 2 trigger is objective (billing threshold). No decisions deferred that would block sub-card implementation.