Raxx · internal docs

internal · gated

Repo Split Strategy + Buildkite Migration Plan

Date: 2026-05-13 UTC Author: architect-agent Status: Proposal — operator decision required before any action Related: PR #1978 (CI migration candidate filtering), PR #1979 (BLR financial/licensing research) Not a commitment. Operator + BLR + architect-review decides after reading this document.


1. Executive Summary

The monorepo raxx-app/TradeMasterAPI currently holds seven distinct deployable services (Raptor, Antlers, Console, Velvet, Queue, two static sites), nine Terraform roots, and 34 GitHub Actions workflows in a single flat tree. The operator has identified this as a development-speed drag and is evaluating a split. This document evaluates three repo strategies — stay monorepo, true polyrepo, or git submodules — and recommends a targeted polyrepo split along service deployment boundaries, deferred partially to post-launch. Separately, the operator has minted a Buildkite API key (stored at /MooseQuest/buildkite/ in Infisical) and confirmed post-launch cutover intent; a phased Buildkite migration plan (Phase 0-3) is documented here. The two plans interact: repo split should happen before or concurrent with CI migration, not after, to avoid migrating pipelines you will later need to re-home to new repos. The single question the operator must answer before sub-cards can be claimed is below in Section 6.


2. Plan A — Repo Strategy

2.1 Current state inventory

TradeMasterAPI/
  backend_v2/              Raptor — Flask API (Python)            Heroku: raxx-api-{prod,staging}
  console/                 Console — operator console (Python)    Heroku: raxx-console-{prod,staging}
  velvet/                  Velvet — token rotation (Python)       Heroku: raxx-velvet-{prod,staging}
  queue/                   Queue — identity/RBAC/customers (C++)  Heroku: (queue deploy workflow)
  frontend/trademaster_ui/ Antlers — SPA (React/CRA)             CF Pages: raxx-app
  frontend/status-page/    Status page (static HTML)             CF Pages: status.raxx.app
  frontend/getraxx-landing/ getraxx.com marketing site           CF Pages: getraxx.com
  mockups-site/            Product mockups                        CF Pages: mockups.raxx.app
  terraform/               9 infra roots                          n/a
  docs/                    Architecture, ops, legal, etc.         n/a
  scripts/                 CI helpers, vault helpers              n/a
  .github/workflows/       34 workflows                           n/a
  .claude/agents/          Agent personas + project config        n/a
  queue/                   Queue service (C++ CMake)              n/a

Cross-service coupling today:

Coupling point Services involved Notes
Console calls GitHub workflow_dispatch Console → GHA Would need to call Buildkite API or per-repo GHA after a split
Console GET /api/internal/deploys polls GitHub run status Console → GHA Same coupling — needs updating per CI migration
Velvet client library consumed by Raptor, Console, Queue Velvet → all three velvet/client/ directory imported by sibling services
HMAC deploy callback endpoint Console ← GHA → Heroku Webhook receiver is Console-side; emitter is GHA workflow
shared scripts (scripts/ci/, scripts/agents/) all workflows 34 workflows import these via path
CLAUDE.md / .claude/agents/ Agent fleet Agent personas shared across all services
Terraform TF state cross-references terraform roots Some modules reference sibling roots

2.2 Option matrix

Option 1 — Stay monorepo

Description: No structural change. Address the pain points inside the single repo (better CODEOWNERS partitioning, per-path CI filtering, sparse checkouts).

Honest assessment of benefits: - Atomic cross-service commits remain trivially easy (single PR for a Velvet client API change + Raptor consumer update). - Single CI cache layer; all 34 workflows share actions/cache entries. - One CODEOWNERS file; one branch protection ruleset; one GitHub App token. - Agent fleet never resolves inter-repo permission boundaries. - The shared velvet/client/ library stays on a single import path with no package versioning overhead.

Developer ergonomics: Good for cross-service work. Poor for service-scoped work because git clone pulls 7 services + terraform + docs. IDE symbol resolution traverses the entire tree regardless of which service you are editing.

CI complexity: 34 workflows can be path-gated; most are already. Main cost is that every workflow merge lands in one git history.

Cross-service refactor cost: Lowest of all options (single PR, single test run).

Onboarding: Simplest for a solo operator + agent fleet. One URL, one clone, one token. Every agent already knows the layout.

Migration cost: Zero.

Reversibility: Trivial — nothing to undo.

Verdict on the operator's stated pain: The monorepo is not the cause of development speed drag. The drag is: (a) the broken ci.yml YAML parse error burning 35 allocations per day since 2026-05-06, (b) two GHA workflows sharing the name "CI" making branch protection semantics unreliable, and (c) the absence of per-service test isolation boundaries. These are all fixable inside the monorepo without a structural change.

Score: Ergonomics 3/5 | CI complexity 2/5 | Cross-service refactor 5/5 | Onboarding 5/5 | Agent tooling 5/5 | Migration cost 5/5 | Reversibility 5/5


Option 2 — True polyrepo

Description: Each service becomes an independent GitHub repository under the raxx-app org. The monorepo is archived or repurposed as a meta-repo for docs only.

Proposed repo layout:

Repo Contents Deploy target
raxx-app/raptor backend_v2/ Heroku raxx-api-*
raxx-app/antlers frontend/trademaster_ui/ CF Pages raxx-app
raxx-app/console console/ Heroku raxx-console-*
raxx-app/velvet velvet/ Heroku raxx-velvet-*
raxx-app/queue queue/ Heroku (queue deploys)
raxx-app/getraxx frontend/getraxx-landing/ CF Pages getraxx.com
raxx-app/status-page frontend/status-page/, frontend/status-worker/ CF Pages status.raxx.app
raxx-app/mockups mockups-site/ CF Pages mockups.raxx.app
raxx-app/infra terraform/, scripts/ n/a (operator-run)
raxx-app/docs docs/, .claude/agents/, CLAUDE.md n/a

Developer ergonomics: Excellent for single-service work. Poor for cross-service work — a Velvet client API change now requires coordinating two PRs in two repos with ordering constraints (velvet PR lands first, then raptor/console PRs reference the new version). For a solo operator with an AI agent fleet, cross-repo coordination is a meaningful friction multiplier.

CI complexity (with Buildkite as target): Each repo gets its own pipeline. Independent branch protection. No path-gating needed. But every agent dispatch now requires knowing which repo to target.

Cross-service refactor cost: High. Velvet client versioning alone requires either (a) a package registry (PyPI private or GitHub Packages) with SemVer pins, or (b) git submodule references to velvet inside raptor/console/queue. Neither is free.

Onboarding: Harder. A new engineer (or agent) must understand the repo map before writing a single line.

Migration cost: High. 34 workflows split across 9 repos; branch protection rules re-configured per-repo; GitHub secrets duplicated (or migrated to an org-level store); CODEOWNERS re-written. Minimum 2-3 weeks of mechanical migration work.

Reversibility: Hard. Once history is split across repos, recombining requires git subtree merge with significant manual conflict resolution. The 2026-04-25 incident (#330) showed how destructive an accidental base-branch error can be; a polyrepo split multiplies the surface area for this class of error.

Agent tooling: The current agent fleet (software-architect, feature-developer, etc.) operates with a single worktree isolation model tied to one repo. A polyrepo requires either (a) per-repo worktrees across all agents, or (b) the operator dispatching agents per-repo explicitly. This is a real friction increase for the current agent-driven development model.

Score: Ergonomics 4/5 (single service) / 2/5 (cross-service) | CI complexity 3/5 | Cross-service refactor 2/5 | Onboarding 2/5 | Agent tooling 2/5 | Migration cost 1/5 | Reversibility 1/5


Option 3 — Git submodules / git subtree

Description: A top-level "super-repo" references each service as a git submodule. Each submodule has its own repo and CI. The super-repo pins the submodule refs.

Developer ergonomics: Worst of all options for this team. Git submodules are infamously confusing even for experienced engineers. The "submodule pinning ceremony" (updating the parent commit every time a child changes) adds manual steps that agents do not handle well. Git subtree is safer to use but has its own rebasing complexity.

CI complexity: Very high. Parent CI must understand submodule state. PRs in a submodule do not automatically trigger parent CI. State coherence across the super-repo and all submodule repos requires additional tooling.

Cross-service refactor cost: Same as polyrepo — worse ergonomically because of submodule ceremony.

Onboarding: Hardest. Submodules are a known developer experience trap.

Migration cost: High. Comparable to polyrepo plus submodule ceremony overhead.

Reversibility: Hard.

Score: Ergonomics 1/5 | CI complexity 1/5 | Cross-service refactor 2/5 | Onboarding 1/5 | Agent tooling 1/5 | Migration cost 2/5 | Reversibility 2/5


2.3 Recommendation: Targeted partial polyrepo (Option 2, scoped)

Do not do a full polyrepo split. Do not do submodules.

The recommendation is a targeted partial split of services that have zero shared code with siblings and are already independently deployable. The services with meaningful internal coupling (Raptor + Velvet + Console + Queue, which share the Velvet client and a common scripts/agents/ layer) stay in the monorepo until the Velvet client is properly packaged.

Split now (pre-launch safe, low risk):

Repo Contents Rationale
raxx-app/getraxx frontend/getraxx-landing/ Zero cross-service coupling. Purely static HTML/JS. Has its own deploy workflow (deploy-getraxx.yml). Splitting removes marketing site churn from service-CI noise.
raxx-app/mockups mockups-site/ Zero cross-service coupling. Static. Has its own deploy workflow (deploy-mockups.yml). Safe to move at any time.
raxx-app/status-page frontend/status-page/, frontend/status-worker/, frontend/artifacts/ Zero coupling to Raptor/Antlers. Independent deploy. Separating it means status-page deploys cannot accidentally break Antlers CI.

Split post-launch (after v1 stabilizes, in tandem with CI migration):

Repo Contents Blocker before split
raxx-app/infra terraform/ Terraform workflows (terraform-validate.yml, per ADR-0082 forthcoming per-root workflows) need to move. Do this when Buildkite pipelines are established per Plan B.
raxx-app/antlers frontend/trademaster_ui/ Antlers is standalone today (no velvet/client import). Blocked only by migrating deploy-antlers.yml + pr-preview.yml to the new repo. Do this in CI migration Phase 2.
raxx-app/docs docs/, .claude/agents/ Move after agent persona update. Agents need to be retargeted to the new docs repo for architecture/ADR commits. Lowest urgency — docs do not have a deploy pipeline.

Stay in monorepo indefinitely (until post-launch re-evaluation):

Why this is better than a full split:


2.4 Pre-launch starter steps (safe now)

These actions have zero risk to running services and can be done before 2026-05-23:

  1. Create raxx-app/getraxx repo. Empty, with README and placeholder CI workflow. No code moves yet — just establishes the destination. Copy org-level secrets as needed.
  2. Create raxx-app/mockups repo. Same as above.
  3. Create raxx-app/status-page repo. Same as above.
  4. Do NOT move code yet. The git filter-repo run that extracts history is a post-launch operation. The empty repos are just scaffolding.
  5. Update frontend/getraxx-landing/README.md with a note: "This directory is slated to move to raxx-app/getraxx post-launch. Do not add cross-service imports."

Why defer the actual git filter-repo run to post-launch: A history extraction on a live repo with active CI is a coordination hazard. During the pre-launch sprint, any PR that touches these directories would need to be coordinated with the extraction. Better to freeze the boundary with a README note and do the clean cut after launch.


2.5 Migration sequencing (post-launch)

Week 1 post-launch (2026-05-24+)
  git filter-repo extract getraxx-landing/ → raxx-app/getraxx
  git filter-repo extract mockups-site/    → raxx-app/mockups
  git filter-repo extract frontend/status-page/ + status-worker/ → raxx-app/status-page
  Update deploy-getraxx.yml, deploy-mockups.yml, deploy-status-page.yml to point
    at new repo; remove old paths from monorepo CI path filters.
  Archive monorepo copies (move to _archived/ prefix, do not delete).

Week 3-6 post-launch (concurrent with Buildkite Phase 2)
  Package velvet/client/ as an installable Python package (velvet-client on GitHub Packages).
  Pin version in raptor/console/queue requirements.
  Then: extract frontend/trademaster_ui/ → raxx-app/antlers (with its own CI pipeline).
  Then: extract terraform/ → raxx-app/infra (after Buildkite pipelines for terraform are live).

Post-launch re-evaluation (month 2-3)
  Assess whether Raptor + Console + Velvet + Queue warrant further splits.
  Recommended: keep the four service directories in one "core" monorepo
  (rename to raxx-app/core or raxx-app/platform) rather than splitting into four repos.
  The internal coupling does not dissolve just because the directory structure changes.

Rollback: git filter-repo extractions do not delete the source directory from the monorepo. Rollback = stop using the new repo, restore CI path filters to the original directory. No data loss in any scenario.


3. Plan B — Buildkite Migration

3.1 Vault inventory: /MooseQuest/buildkite/

The Infisical CLI requires a project token for interactive secret reads; the operator has the key at vault.raxx.app. From the operator's 2026-05-13 UTC briefing, the vault path /MooseQuest/buildkite/ is confirmed to contain at minimum:

What this tells us: A Buildkite organization has already been provisioned (you cannot generate an AGENT_TOKEN without an org). Phase 0 plumbing is further along than a blank-slate start.

What is not yet confirmed in vault: Whether a pipeline has been created, whether the agent has ever been run, and whether the Buildkite GitHub App is installed on raxx-app/TradeMasterAPI.


3.2 Design question answers

Where does the Buildkite agent live?

Self-hosted on a dedicated small VPS (recommended: Hetzner CX22, ~$5/mo, 2 vCPU / 4 GB RAM / 40 GB SSD). Rationale: Buildkite's free tier (1 user, 1 agent) requires operator-owned compute. A Hetzner VPS at $5/mo is the lowest-cost option that gives a persistent agent (not an on-demand EC2 instance that cold-starts per build). The C++ CMake Queue build is CPU-intensive and benefits from a warm runner that caches its build/ directory between runs. Heroku one-off dynos are unsuitable — they cannot persist a Buildkite agent daemon. A local laptop is unsuitable — the agent must be available for scheduled cron pipelines at any time.

At post-launch scale, if build queue wait time becomes an issue, add a second Buildkite agent on the same VPS or scale to a CX32 ($10/mo). The free tier permits multiple concurrent agents from a single registered org.

How does GitHub trigger Buildkite?

Buildkite's official GitHub App (available at https://github.com/apps/buildkite). This is the correct integration path — it posts commit statuses back to GitHub PRs, creates GitHub Deployments for deploy pipelines, and supports the branch-based trigger model. Do not use webhooks or polling. The GitHub App is first-class and supported.

The GitHub App must be installed on each GitHub repo that needs Buildkite CI. For the partial polyrepo model (Plan A), that means: - raxx-app/TradeMasterAPI (core services, during Phase 1-2) - raxx-app/getraxx, raxx-app/mockups, raxx-app/status-page (when those repos exist in Phase 2)

How do Buildkite secrets bind to Infisical?

A Buildkite agent hook, not Buildkite's native secret store. The recommended pattern:

  1. Install the infisical CLI on the agent VPS at provisioning time.
  2. Write a Buildkite environment hook at /etc/buildkite-agent/hooks/environment that calls infisical export --path /MooseQuest/<service> --format dotenv and sources the output into the build environment.
  3. The Infisical machine identity token for the VPS is stored in /etc/buildkite-agent/infisical-token with mode 0600, owned by the buildkite-agent user.
  4. AWS SSM secrets (workload secrets per memory feedback_aws_workloads_use_ssm_not_vault) are fetched via aws ssm get-parameter in the same hook, using an IAM instance profile attached to the VPS (if EC2) or an IAM user key stored in Infisical (if Hetzner VPS).

This keeps secrets out of Buildkite's SaaS layer entirely. Pipeline YAML never contains secret values. The approach is compatible with the "no stored credentials" invariant because the Infisical token is rotatable independently of any pipeline code.

How does the Console Ops Dispatch panel talk to Buildkite?

The console/app/services/ops_dispatch.py service currently calls GitHub's POST /repos/{owner}/{repo}/actions/workflows/{workflow_id}/dispatches REST endpoint. At migration time, this is replaced with a call to Buildkite's POST https://api.buildkite.com/v2/organizations/{org}/pipelines/{pipeline_slug}/builds endpoint, authenticated via BUILDKITE_API_TOKEN (already in vault).

The OpsDispatchLog model and the existing WORKFLOW_REGISTRY pattern are retained — only the HTTP call destination changes. The dispatch_factory() function in ops_dispatch.py is extended with a Buildkite dispatcher implementation that the feature-developer swaps in via feature flag (FLAG_CONSOLE_OPS_DISPATCH_BUILDKITE). GHA dispatch path stays live as fallback until the flag is promoted.

How does the deploy modal read state from Buildkite?

Two-way integration:

  1. Outbound (Console → Buildkite): POST .../builds returns the Buildkite build number and URL immediately. Console stores this as github_run_id equivalent (rename field to ci_run_id / ci_run_url in a migration to avoid the GitHub- specific naming). UI polls GET /api/internal/deploys/<id> which proxies GET https://api.buildkite.com/v2/.../builds/<number> for live status.

  2. Inbound (Buildkite → Console): Configure a Buildkite webhook (notification service in the pipeline settings) to POST to POST https://console.raxx.app/api/internal/deploys/<id>/status, using the existing DEPLOY_CALLBACK_HMAC_SECRET for webhook signature verification. The existing HMAC callback receiver in console/app/blueprints/deploys.py handles this with minor payload adaptation (Buildkite webhook shape differs from the GHA shape; a thin adapter translates field names).

How does Branch Protection work when checks come from Buildkite?

Buildkite's GitHub App posts status checks as buildkite/<pipeline-slug> context names on each commit. GitHub branch protection rules are updated to require these contexts instead of GHA workflow names. Migration procedure per workflow:

  1. Run Buildkite pipeline in parallel with GHA for 2 weeks.
  2. Confirm Buildkite consistently posts the expected status.
  3. Update branch protection to add buildkite/<slug> as required check.
  4. Remove GHA workflow name from required checks.
  5. After 1 additional week, disable the GHA workflow.

This is a per-pipeline migration that avoids any window where branch protection is unenforced.


3.3 Phased migration plan

Phase 0 — Pre-launch scaffolding (now through 2026-05-23 UTC)

Goal: Verify the plumbing works before launch. No prod traffic is involved. No GHA workflows are disabled.

Deliverables:

  1. Vault inventory confirmation. Operator reads /MooseQuest/buildkite/ in Infisical and confirms BUILDKITE_AGENT_TOKEN and BUILDKITE_API_TOKEN are present and scoped to the right Buildkite org.
  2. Agent VPS provisioned. A Hetzner CX22 (or equivalent low-cost VPS) with: - buildkite-agent installed and registered to the org using BUILDKITE_AGENT_TOKEN. - infisical CLI installed. - /etc/buildkite-agent/hooks/environment hook written. - Agent confirmed connected in the Buildkite dashboard.
  3. GitHub App installed. Buildkite GitHub App installed on raxx-app/TradeMasterAPI (and any pre-split repos created per Plan A Phase 0).
  4. One trivial pipeline created. A "hello world" pipeline in Buildkite that: - Is triggered manually from the Buildkite UI. - Runs echo "agent connected; infisical version: $(infisical --version)" on the agent VPS. - Posts a green status check on a test commit in a scratch branch. - This pipeline is NOT in the branch protection required checks.
  5. Vault bridge script tested. The Infisical environment hook fetches at least one non-sensitive secret (e.g., INFISICAL_PROJECT_ID) and makes it visible in the build log (redacted confirmation, not the value itself).

Success criteria: - Agent shows connected in Buildkite dashboard. - Hello-world pipeline runs to completion on the agent VPS. - Buildkite posts a green status to a scratch commit on GitHub. - Infisical hook executes without error in the build environment.

Rollback: Delete the VPS. Remove the GitHub App from the repo. No GHA changes were made; nothing reverts.


Phase 1 — Parallel run (post-launch, week 1-2 after 2026-05-23)

Goal: Mirror 2 low-risk pipelines from GHA to Buildkite. Both run simultaneously. No branch protection changes yet.

Candidate pipelines (lowest risk first):

  1. ci-digest-cron.yml → Buildkite scheduled pipeline. This workflow sends a daily Slack digest. A parallel Buildkite version can be disabled if it double-sends; the risk is noise, not data loss.
  2. nightly-security-scan.yml → Buildkite scheduled pipeline. Security scan results are observability-only; a duplicate run wastes a few minutes but does not affect prod.

Deliverables:

  1. .buildkite/pipelines/ci-digest-cron.yml and .buildkite/pipelines/nightly-security-scan.yml pipeline YAML files written.
  2. Buildkite pipelines configured for the same schedule (07:00 UTC daily for digest; 08:07 UTC daily for security scan).
  3. Both GHA and Buildkite versions run in parallel for 5+ business days.
  4. A comparison table of outputs: did Buildkite's digest fire? Did the security scan produce equivalent findings? Were there any secret-injection failures?

Success criteria: - Both parallel runs complete successfully for 5 consecutive days without operator intervention. - No duplicate Slack messages (Buildkite version is configured to post to a staging Slack channel during the parallel phase, not the ops channel). - Secret injection (Infisical hook) succeeds in every run.

Rollback: Delete Buildkite pipeline definitions. GHA versions never stopped running; nothing to restore.


Phase 2 — Per-service migration (post-launch, week 2-6)

Goal: Migrate CI pipelines service by service. Order: lowest-risk first.

Migration order and rationale:

1. Static sites (getraxx, mockups, status-page)
   Rationale: if these repos have been split per Plan A, they get fresh Buildkite
   pipelines from day one. No GHA-to-BK migration needed — the new repos never had
   GHA. If not yet split, migrate the GHA workflows for these paths first (lowest
   blast radius).

2. docs / internal tooling (daily-card-groomer, flag-drift-check, freescout-backup,
   billing-collector-cron, drift-orchestrator-cron, terraform-validate)
   Rationale: failures here are operational annoyances, not customer-facing incidents.

3. Console CI (ci-console.yml, deploy-console.yml, review-app-console.yml)
   Rationale: Console is operator-internal. A broken CI pass on Console does not
   affect Raptor or Antlers availability.

4. Velvet CI (deploy-velvet.yml)
   Rationale: Velvet is called by other services but has no customer-facing endpoint.
   Broken Velvet deploy CI does not take down Raptor.

5. Raptor CI (ci.yml, deploy-heroku.yml, synthetic-gate.yml, waf-synthetic-probe.yml)
   Rationale: Core API. Migrate last among backend services. Parallel-run for 2 full
   weeks before disabling GHA.

6. Antlers CI (pr-preview.yml, deploy-antlers.yml)
   Rationale: Customer-facing SPA. Parallel-run for 2 full weeks. PR preview deploy
   integration with GitHub requires the Buildkite GitHub App deployment feature.

7. Queue CI (deploy-queue.yml)
   Rationale: Queue owns identity/RBAC/customer data. Highest risk. Migrate last.
   Extra parallel-run period: 4 weeks before disabling GHA.

For each pipeline migration: 1. Write .buildkite/pipelines/<name>.yml. 2. Run parallel with GHA for the prescribed period. 3. Update branch protection to add buildkite/<slug> as required check. 4. Remove GHA workflow name from required checks. 5. Disable GHA workflow file (rename to _disabled-<name>.yml; do not delete — keep for rollback reference for 90 days).

Key coupling to update during Phase 2:

Success criteria per pipeline: - Buildkite pipeline produces equivalent output to GHA pipeline for 10 consecutive runs. - No operator-intervention failures in parallel-run period. - Secret injection succeeds (confirmed via build log, not value echo). - Branch protection enforced on main via Buildkite check (not GHA) without any window of gap.

Rollback per pipeline: Re-enable the _disabled-<name>.yml GHA workflow. Restore branch protection required check to GHA workflow name. Buildkite pipeline can remain configured for future retry.


Phase 3 — Decommission (post-launch, week 6+)

Goal: Remove dead GHA workflows, update all documentation and runbooks.

Deliverables: 1. Delete all _disabled-*.yml files from .github/workflows/ (after 90-day hold). 2. Update docs/architecture/ci-notification-posture.md to reference Buildkite schedules instead of GHA cron triggers. 3. Update docs/ops/runbooks/ for any runbook that references GHA workflow names. 4. Update CLAUDE.md and agent personas to reference Buildkite pipeline slugs for any agent that dispatches CI. 5. Remove GITHUB_TOKEN / GH_TOKEN references in console/app/services/ops_dispatch.py once the GHA dispatcher path is confirmed dead. 6. Final audit: run grep -r "workflow_dispatch" . on the repo and confirm zero hits in application code (only in historical git log).

Success criteria: - Zero active GHA workflow files in .github/workflows/ (only _disabled- archived files pending final delete, or entirely removed). - All 34 equivalent Buildkite pipelines running cleanly for 4+ weeks. - Console Ops Dispatch surface successfully dispatches all registered actions via Buildkite API. - Branch protection on main enforced entirely by Buildkite checks.

Rollback: Full rollback at Phase 3 is high-cost (GHA workflows were deleted). This is why the 90-day hold on _disabled- files exists — they can be re-enabled quickly. The dispatch_factory() GHA path in ops_dispatch.py should not be deleted until Phase 3 is confirmed stable for 4+ weeks.


4. How Plans A and B Interact

Recommended sequencing:

Pre-launch (now to 2026-05-23)
  Plan A: Create empty repos for getraxx, mockups, status-page. README notes only.
  Plan B: Phase 0 — Provision agent VPS, install Buildkite agent, run hello-world
           pipeline, confirm vault bridge works.

Post-launch week 1-2 (2026-05-24 to 2026-06-06)
  Plan A: Execute git filter-repo extractions for getraxx, mockups, status-page.
           These new repos get fresh Buildkite pipelines from day one (no GHA to migrate).
  Plan B: Phase 1 — Parallel run of ci-digest-cron + nightly-security-scan.

Post-launch week 3-6 (2026-06-07 to 2026-07-04)
  Plan A: Package velvet/client as installable Python package.
           Extract frontend/trademaster_ui/ → raxx-app/antlers.
           Extract terraform/ → raxx-app/infra.
  Plan B: Phase 2 — Per-service pipeline migration (static → docs → console → velvet
           → raptor → antlers → queue).

Post-launch month 2-3 (2026-07+)
  Plan A: Evaluate Raptor + Console + Velvet + Queue split (likely stay in core mono).
  Plan B: Phase 3 — Decommission old GHA workflows.

The critical dependency: Do not migrate the Antlers CI pipeline to Buildkite (Plan B Phase 2, step 6) before the Antlers repo exists (Plan A post-launch week 3). If Antlers CI is migrated while still in the monorepo, the pipeline migration must be re-done when the repo moves. Do Plan A extraction first, then Plan B pipeline migration.

The safe order for everything else: Plan B Phase 0 is independent of Plan A. It can and should start immediately. The Buildkite agent on the VPS does not care whether the repo has been split or not — it checks out from wherever the Buildkite GitHub App points.


5. Risks and Dependencies

Risk Likelihood Impact Mitigation
git filter-repo rewrites commit SHAs for extracted files Certain (it always does) Med Document before extraction; close all open PRs touching those directories first; update any external references to old SHAs.
Agent VPS becomes a single point of failure for all CI Med High Add a second agent on the same VPS after Phase 0 validation. Buildkite free tier permits multiple agents. For Queue CI (highest value), consider two agents.
Velvet client packaging breaks existing import paths in Raptor/Console/Queue Low-Med High Pin the new package version in requirements.txt before deleting the old velvet/client/ import path. Run full integration test suite before removing the old path.
Buildkite SaaS controller outage blocks all CI Low High Keep GHA workflows alive in _disabled- state for 90 days post-migration. Restore takes under 5 minutes per workflow.
Console Ops Dispatch Buildkite adapter introduces a regression Med Med Gate behind FLAG_CONSOLE_OPS_DISPATCH_BUILDKITE; keep GHA dispatcher as live fallback. Flag off by default until fully validated.
Infisical environment hook fails silently, exposing missing secrets at build time Low High Add a pre-command hook that asserts all expected env vars are non-empty and exits 1 if any are missing. Fail fast, loudly.
Hetzner VPS buildkite-agent process exits on reboot Low Med Install buildkite-agent as a systemd service with Restart=always. Test reboot in Phase 0.

6. Open Questions for Operator

The one question that blocks sub-cards:

Q1 (blocks all Plan B sub-cards): Have you logged into the Buildkite dashboard and confirmed the organization name and the org slug? The BUILDKITE_AGENT_TOKEN and BUILDKITE_API_TOKEN in vault are only useful if we know which org they belong to. Pipeline YAML, API calls, and the GitHub App installation all require the org slug. Please confirm the Buildkite org slug (e.g., raxx) so sub-cards can reference it.

Lower-priority questions (do not block Phase 0):


7. Out of Scope (not a decision in this doc)