Repo Split Strategy + Buildkite Migration Plan
Date: 2026-05-13 UTC Author: architect-agent Status: Proposal — operator decision required before any action Related: PR #1978 (CI migration candidate filtering), PR #1979 (BLR financial/licensing research) Not a commitment. Operator + BLR + architect-review decides after reading this document.
1. Executive Summary
The monorepo raxx-app/TradeMasterAPI currently holds seven distinct deployable
services (Raptor, Antlers, Console, Velvet, Queue, two static sites), nine Terraform
roots, and 34 GitHub Actions workflows in a single flat tree. The operator has
identified this as a development-speed drag and is evaluating a split. This document
evaluates three repo strategies — stay monorepo, true polyrepo, or git submodules —
and recommends a targeted polyrepo split along service deployment boundaries, deferred
partially to post-launch. Separately, the operator has minted a Buildkite API key
(stored at /MooseQuest/buildkite/ in Infisical) and confirmed post-launch cutover
intent; a phased Buildkite migration plan (Phase 0-3) is documented here. The two
plans interact: repo split should happen before or concurrent with CI migration, not
after, to avoid migrating pipelines you will later need to re-home to new repos. The
single question the operator must answer before sub-cards can be claimed is below in
Section 6.
2. Plan A — Repo Strategy
2.1 Current state inventory
TradeMasterAPI/
backend_v2/ Raptor — Flask API (Python) Heroku: raxx-api-{prod,staging}
console/ Console — operator console (Python) Heroku: raxx-console-{prod,staging}
velvet/ Velvet — token rotation (Python) Heroku: raxx-velvet-{prod,staging}
queue/ Queue — identity/RBAC/customers (C++) Heroku: (queue deploy workflow)
frontend/trademaster_ui/ Antlers — SPA (React/CRA) CF Pages: raxx-app
frontend/status-page/ Status page (static HTML) CF Pages: status.raxx.app
frontend/getraxx-landing/ getraxx.com marketing site CF Pages: getraxx.com
mockups-site/ Product mockups CF Pages: mockups.raxx.app
terraform/ 9 infra roots n/a
docs/ Architecture, ops, legal, etc. n/a
scripts/ CI helpers, vault helpers n/a
.github/workflows/ 34 workflows n/a
.claude/agents/ Agent personas + project config n/a
queue/ Queue service (C++ CMake) n/a
Cross-service coupling today:
| Coupling point | Services involved | Notes |
|---|---|---|
Console calls GitHub workflow_dispatch |
Console → GHA | Would need to call Buildkite API or per-repo GHA after a split |
Console GET /api/internal/deploys polls GitHub run status |
Console → GHA | Same coupling — needs updating per CI migration |
| Velvet client library consumed by Raptor, Console, Queue | Velvet → all three | velvet/client/ directory imported by sibling services |
| HMAC deploy callback endpoint | Console ← GHA → Heroku | Webhook receiver is Console-side; emitter is GHA workflow |
shared scripts (scripts/ci/, scripts/agents/) |
all workflows | 34 workflows import these via path |
CLAUDE.md / .claude/agents/ |
Agent fleet | Agent personas shared across all services |
| Terraform TF state cross-references | terraform roots | Some modules reference sibling roots |
2.2 Option matrix
Option 1 — Stay monorepo
Description: No structural change. Address the pain points inside the single repo
(better CODEOWNERS partitioning, per-path CI filtering, sparse checkouts).
Honest assessment of benefits:
- Atomic cross-service commits remain trivially easy (single PR for a Velvet client
API change + Raptor consumer update).
- Single CI cache layer; all 34 workflows share actions/cache entries.
- One CODEOWNERS file; one branch protection ruleset; one GitHub App token.
- Agent fleet never resolves inter-repo permission boundaries.
- The shared velvet/client/ library stays on a single import path with no package
versioning overhead.
Developer ergonomics: Good for cross-service work. Poor for service-scoped
work because git clone pulls 7 services + terraform + docs. IDE symbol resolution
traverses the entire tree regardless of which service you are editing.
CI complexity: 34 workflows can be path-gated; most are already. Main cost is that every workflow merge lands in one git history.
Cross-service refactor cost: Lowest of all options (single PR, single test run).
Onboarding: Simplest for a solo operator + agent fleet. One URL, one clone, one token. Every agent already knows the layout.
Migration cost: Zero.
Reversibility: Trivial — nothing to undo.
Verdict on the operator's stated pain: The monorepo is not the cause of development
speed drag. The drag is: (a) the broken ci.yml YAML parse error burning 35 allocations
per day since 2026-05-06, (b) two GHA workflows sharing the name "CI" making branch
protection semantics unreliable, and (c) the absence of per-service test isolation
boundaries. These are all fixable inside the monorepo without a structural change.
Score: Ergonomics 3/5 | CI complexity 2/5 | Cross-service refactor 5/5 | Onboarding 5/5 | Agent tooling 5/5 | Migration cost 5/5 | Reversibility 5/5
Option 2 — True polyrepo
Description: Each service becomes an independent GitHub repository under the
raxx-app org. The monorepo is archived or repurposed as a meta-repo for docs only.
Proposed repo layout:
| Repo | Contents | Deploy target |
|---|---|---|
raxx-app/raptor |
backend_v2/ |
Heroku raxx-api-* |
raxx-app/antlers |
frontend/trademaster_ui/ |
CF Pages raxx-app |
raxx-app/console |
console/ |
Heroku raxx-console-* |
raxx-app/velvet |
velvet/ |
Heroku raxx-velvet-* |
raxx-app/queue |
queue/ |
Heroku (queue deploys) |
raxx-app/getraxx |
frontend/getraxx-landing/ |
CF Pages getraxx.com |
raxx-app/status-page |
frontend/status-page/, frontend/status-worker/ |
CF Pages status.raxx.app |
raxx-app/mockups |
mockups-site/ |
CF Pages mockups.raxx.app |
raxx-app/infra |
terraform/, scripts/ |
n/a (operator-run) |
raxx-app/docs |
docs/, .claude/agents/, CLAUDE.md |
n/a |
Developer ergonomics: Excellent for single-service work. Poor for cross-service work — a Velvet client API change now requires coordinating two PRs in two repos with ordering constraints (velvet PR lands first, then raptor/console PRs reference the new version). For a solo operator with an AI agent fleet, cross-repo coordination is a meaningful friction multiplier.
CI complexity (with Buildkite as target): Each repo gets its own pipeline. Independent branch protection. No path-gating needed. But every agent dispatch now requires knowing which repo to target.
Cross-service refactor cost: High. Velvet client versioning alone requires either (a) a package registry (PyPI private or GitHub Packages) with SemVer pins, or (b) git submodule references to velvet inside raptor/console/queue. Neither is free.
Onboarding: Harder. A new engineer (or agent) must understand the repo map before writing a single line.
Migration cost: High. 34 workflows split across 9 repos; branch protection rules re-configured per-repo; GitHub secrets duplicated (or migrated to an org-level store); CODEOWNERS re-written. Minimum 2-3 weeks of mechanical migration work.
Reversibility: Hard. Once history is split across repos, recombining requires
git subtree merge with significant manual conflict resolution. The 2026-04-25 incident
(#330) showed how destructive an accidental base-branch error can be; a polyrepo split
multiplies the surface area for this class of error.
Agent tooling: The current agent fleet (software-architect, feature-developer, etc.) operates with a single worktree isolation model tied to one repo. A polyrepo requires either (a) per-repo worktrees across all agents, or (b) the operator dispatching agents per-repo explicitly. This is a real friction increase for the current agent-driven development model.
Score: Ergonomics 4/5 (single service) / 2/5 (cross-service) | CI complexity 3/5 | Cross-service refactor 2/5 | Onboarding 2/5 | Agent tooling 2/5 | Migration cost 1/5 | Reversibility 1/5
Option 3 — Git submodules / git subtree
Description: A top-level "super-repo" references each service as a git submodule. Each submodule has its own repo and CI. The super-repo pins the submodule refs.
Developer ergonomics: Worst of all options for this team. Git submodules are infamously confusing even for experienced engineers. The "submodule pinning ceremony" (updating the parent commit every time a child changes) adds manual steps that agents do not handle well. Git subtree is safer to use but has its own rebasing complexity.
CI complexity: Very high. Parent CI must understand submodule state. PRs in a submodule do not automatically trigger parent CI. State coherence across the super-repo and all submodule repos requires additional tooling.
Cross-service refactor cost: Same as polyrepo — worse ergonomically because of submodule ceremony.
Onboarding: Hardest. Submodules are a known developer experience trap.
Migration cost: High. Comparable to polyrepo plus submodule ceremony overhead.
Reversibility: Hard.
Score: Ergonomics 1/5 | CI complexity 1/5 | Cross-service refactor 2/5 | Onboarding 1/5 | Agent tooling 1/5 | Migration cost 2/5 | Reversibility 2/5
2.3 Recommendation: Targeted partial polyrepo (Option 2, scoped)
Do not do a full polyrepo split. Do not do submodules.
The recommendation is a targeted partial split of services that have zero shared
code with siblings and are already independently deployable. The services with meaningful
internal coupling (Raptor + Velvet + Console + Queue, which share the Velvet client and
a common scripts/agents/ layer) stay in the monorepo until the Velvet client is
properly packaged.
Split now (pre-launch safe, low risk):
| Repo | Contents | Rationale |
|---|---|---|
raxx-app/getraxx |
frontend/getraxx-landing/ |
Zero cross-service coupling. Purely static HTML/JS. Has its own deploy workflow (deploy-getraxx.yml). Splitting removes marketing site churn from service-CI noise. |
raxx-app/mockups |
mockups-site/ |
Zero cross-service coupling. Static. Has its own deploy workflow (deploy-mockups.yml). Safe to move at any time. |
raxx-app/status-page |
frontend/status-page/, frontend/status-worker/, frontend/artifacts/ |
Zero coupling to Raptor/Antlers. Independent deploy. Separating it means status-page deploys cannot accidentally break Antlers CI. |
Split post-launch (after v1 stabilizes, in tandem with CI migration):
| Repo | Contents | Blocker before split |
|---|---|---|
raxx-app/infra |
terraform/ |
Terraform workflows (terraform-validate.yml, per ADR-0082 forthcoming per-root workflows) need to move. Do this when Buildkite pipelines are established per Plan B. |
raxx-app/antlers |
frontend/trademaster_ui/ |
Antlers is standalone today (no velvet/client import). Blocked only by migrating deploy-antlers.yml + pr-preview.yml to the new repo. Do this in CI migration Phase 2. |
raxx-app/docs |
docs/, .claude/agents/ |
Move after agent persona update. Agents need to be retargeted to the new docs repo for architecture/ADR commits. Lowest urgency — docs do not have a deploy pipeline. |
Stay in monorepo indefinitely (until post-launch re-evaluation):
backend_v2/(Raptor),console/,velvet/,queue/— these four share thevelvet/client/import path and thescripts/agents/layer. Splitting them requires packagingvelvet/clientas a proper Python package with a registry. That is a post-v1 task.
Why this is better than a full split:
- Removes the marketing/static-site noise from service CI immediately (the most tangible ergonomics win).
- Does not force Velvet client versioning before launch.
- The agent fleet's worktree isolation model remains valid for the core services.
- The infrastructure repo split (post-launch) lands naturally alongside the CI migration — Buildkite per-repo pipelines are the right time to restructure infra CI.
2.4 Pre-launch starter steps (safe now)
These actions have zero risk to running services and can be done before 2026-05-23:
- Create
raxx-app/getraxxrepo. Empty, with README and placeholder CI workflow. No code moves yet — just establishes the destination. Copy org-level secrets as needed. - Create
raxx-app/mockupsrepo. Same as above. - Create
raxx-app/status-pagerepo. Same as above. - Do NOT move code yet. The
git filter-reporun that extracts history is a post-launch operation. The empty repos are just scaffolding. - Update
frontend/getraxx-landing/README.mdwith a note: "This directory is slated to move toraxx-app/getraxxpost-launch. Do not add cross-service imports."
Why defer the actual git filter-repo run to post-launch: A history extraction on
a live repo with active CI is a coordination hazard. During the pre-launch sprint, any
PR that touches these directories would need to be coordinated with the extraction.
Better to freeze the boundary with a README note and do the clean cut after launch.
2.5 Migration sequencing (post-launch)
Week 1 post-launch (2026-05-24+)
git filter-repo extract getraxx-landing/ → raxx-app/getraxx
git filter-repo extract mockups-site/ → raxx-app/mockups
git filter-repo extract frontend/status-page/ + status-worker/ → raxx-app/status-page
Update deploy-getraxx.yml, deploy-mockups.yml, deploy-status-page.yml to point
at new repo; remove old paths from monorepo CI path filters.
Archive monorepo copies (move to _archived/ prefix, do not delete).
Week 3-6 post-launch (concurrent with Buildkite Phase 2)
Package velvet/client/ as an installable Python package (velvet-client on GitHub Packages).
Pin version in raptor/console/queue requirements.
Then: extract frontend/trademaster_ui/ → raxx-app/antlers (with its own CI pipeline).
Then: extract terraform/ → raxx-app/infra (after Buildkite pipelines for terraform are live).
Post-launch re-evaluation (month 2-3)
Assess whether Raptor + Console + Velvet + Queue warrant further splits.
Recommended: keep the four service directories in one "core" monorepo
(rename to raxx-app/core or raxx-app/platform) rather than splitting into four repos.
The internal coupling does not dissolve just because the directory structure changes.
Rollback: git filter-repo extractions do not delete the source directory from the
monorepo. Rollback = stop using the new repo, restore CI path filters to the original
directory. No data loss in any scenario.
3. Plan B — Buildkite Migration
3.1 Vault inventory: /MooseQuest/buildkite/
The Infisical CLI requires a project token for interactive secret reads; the operator
has the key at vault.raxx.app. From the operator's 2026-05-13 UTC briefing, the vault
path /MooseQuest/buildkite/ is confirmed to contain at minimum:
BUILDKITE_AGENT_TOKEN— the agent registration token for a provisioned Buildkite organization. This is the token a self-hosted agent binary presents to the Buildkite SaaS controller at startup. It is per-cluster/per-queue.BUILDKITE_API_TOKEN— a REST API token for the Buildkite organization. Used to trigger builds, read pipeline status, and manage pipelines programmatically. This is the token the Console Ops Dispatch surface would use when calling the Buildkite API instead of GitHub'sworkflow_dispatch.
What this tells us: A Buildkite organization has already been provisioned (you cannot generate an AGENT_TOKEN without an org). Phase 0 plumbing is further along than a blank-slate start.
What is not yet confirmed in vault: Whether a pipeline has been created, whether
the agent has ever been run, and whether the Buildkite GitHub App is installed on
raxx-app/TradeMasterAPI.
3.2 Design question answers
Where does the Buildkite agent live?
Self-hosted on a dedicated small VPS (recommended: Hetzner CX22, ~$5/mo, 2 vCPU /
4 GB RAM / 40 GB SSD). Rationale: Buildkite's free tier (1 user, 1 agent) requires
operator-owned compute. A Hetzner VPS at $5/mo is the lowest-cost option that gives a
persistent agent (not an on-demand EC2 instance that cold-starts per build). The C++
CMake Queue build is CPU-intensive and benefits from a warm runner that caches its
build/ directory between runs. Heroku one-off dynos are unsuitable — they cannot
persist a Buildkite agent daemon. A local laptop is unsuitable — the agent must be
available for scheduled cron pipelines at any time.
At post-launch scale, if build queue wait time becomes an issue, add a second Buildkite agent on the same VPS or scale to a CX32 ($10/mo). The free tier permits multiple concurrent agents from a single registered org.
How does GitHub trigger Buildkite?
Buildkite's official GitHub App (available at https://github.com/apps/buildkite).
This is the correct integration path — it posts commit statuses back to GitHub PRs,
creates GitHub Deployments for deploy pipelines, and supports the branch-based trigger
model. Do not use webhooks or polling. The GitHub App is first-class and supported.
The GitHub App must be installed on each GitHub repo that needs Buildkite CI. For the
partial polyrepo model (Plan A), that means:
- raxx-app/TradeMasterAPI (core services, during Phase 1-2)
- raxx-app/getraxx, raxx-app/mockups, raxx-app/status-page (when those repos exist
in Phase 2)
How do Buildkite secrets bind to Infisical?
A Buildkite agent hook, not Buildkite's native secret store. The recommended pattern:
- Install the
infisicalCLI on the agent VPS at provisioning time. - Write a Buildkite
environmenthook at/etc/buildkite-agent/hooks/environmentthat callsinfisical export --path /MooseQuest/<service> --format dotenvand sources the output into the build environment. - The Infisical machine identity token for the VPS is stored in
/etc/buildkite-agent/infisical-tokenwith mode0600, owned by thebuildkite-agentuser. - AWS SSM secrets (workload secrets per memory
feedback_aws_workloads_use_ssm_not_vault) are fetched viaaws ssm get-parameterin the same hook, using an IAM instance profile attached to the VPS (if EC2) or an IAM user key stored in Infisical (if Hetzner VPS).
This keeps secrets out of Buildkite's SaaS layer entirely. Pipeline YAML never contains secret values. The approach is compatible with the "no stored credentials" invariant because the Infisical token is rotatable independently of any pipeline code.
How does the Console Ops Dispatch panel talk to Buildkite?
The console/app/services/ops_dispatch.py service currently calls GitHub's
POST /repos/{owner}/{repo}/actions/workflows/{workflow_id}/dispatches REST endpoint.
At migration time, this is replaced with a call to Buildkite's
POST https://api.buildkite.com/v2/organizations/{org}/pipelines/{pipeline_slug}/builds
endpoint, authenticated via BUILDKITE_API_TOKEN (already in vault).
The OpsDispatchLog model and the existing WORKFLOW_REGISTRY pattern are retained —
only the HTTP call destination changes. The dispatch_factory() function in
ops_dispatch.py is extended with a Buildkite dispatcher implementation that the
feature-developer swaps in via feature flag (FLAG_CONSOLE_OPS_DISPATCH_BUILDKITE).
GHA dispatch path stays live as fallback until the flag is promoted.
How does the deploy modal read state from Buildkite?
Two-way integration:
-
Outbound (Console → Buildkite):
POST .../buildsreturns the Buildkite build number and URL immediately. Console stores this asgithub_run_idequivalent (rename field toci_run_id/ci_run_urlin a migration to avoid the GitHub- specific naming). UI pollsGET /api/internal/deploys/<id>which proxiesGET https://api.buildkite.com/v2/.../builds/<number>for live status. -
Inbound (Buildkite → Console): Configure a Buildkite webhook (notification service in the pipeline settings) to POST to
POST https://console.raxx.app/api/internal/deploys/<id>/status, using the existingDEPLOY_CALLBACK_HMAC_SECRETfor webhook signature verification. The existing HMAC callback receiver inconsole/app/blueprints/deploys.pyhandles this with minor payload adaptation (Buildkite webhook shape differs from the GHA shape; a thin adapter translates field names).
How does Branch Protection work when checks come from Buildkite?
Buildkite's GitHub App posts status checks as buildkite/<pipeline-slug> context
names on each commit. GitHub branch protection rules are updated to require these
contexts instead of GHA workflow names. Migration procedure per workflow:
- Run Buildkite pipeline in parallel with GHA for 2 weeks.
- Confirm Buildkite consistently posts the expected status.
- Update branch protection to add
buildkite/<slug>as required check. - Remove GHA workflow name from required checks.
- After 1 additional week, disable the GHA workflow.
This is a per-pipeline migration that avoids any window where branch protection is unenforced.
3.3 Phased migration plan
Phase 0 — Pre-launch scaffolding (now through 2026-05-23 UTC)
Goal: Verify the plumbing works before launch. No prod traffic is involved. No GHA workflows are disabled.
Deliverables:
- Vault inventory confirmation. Operator reads
/MooseQuest/buildkite/in Infisical and confirmsBUILDKITE_AGENT_TOKENandBUILDKITE_API_TOKENare present and scoped to the right Buildkite org. - Agent VPS provisioned. A Hetzner CX22 (or equivalent low-cost VPS) with:
-
buildkite-agentinstalled and registered to the org usingBUILDKITE_AGENT_TOKEN. -infisicalCLI installed. -/etc/buildkite-agent/hooks/environmenthook written. - Agent confirmedconnectedin the Buildkite dashboard. - GitHub App installed. Buildkite GitHub App installed on
raxx-app/TradeMasterAPI(and any pre-split repos created per Plan A Phase 0). - One trivial pipeline created. A "hello world" pipeline in Buildkite that:
- Is triggered manually from the Buildkite UI.
- Runs
echo "agent connected; infisical version: $(infisical --version)"on the agent VPS. - Posts a green status check on a test commit in a scratch branch. - This pipeline is NOT in the branch protection required checks. - Vault bridge script tested. The Infisical
environmenthook fetches at least one non-sensitive secret (e.g.,INFISICAL_PROJECT_ID) and makes it visible in the build log (redacted confirmation, not the value itself).
Success criteria:
- Agent shows connected in Buildkite dashboard.
- Hello-world pipeline runs to completion on the agent VPS.
- Buildkite posts a green status to a scratch commit on GitHub.
- Infisical hook executes without error in the build environment.
Rollback: Delete the VPS. Remove the GitHub App from the repo. No GHA changes were made; nothing reverts.
Phase 1 — Parallel run (post-launch, week 1-2 after 2026-05-23)
Goal: Mirror 2 low-risk pipelines from GHA to Buildkite. Both run simultaneously. No branch protection changes yet.
Candidate pipelines (lowest risk first):
ci-digest-cron.yml→ Buildkite scheduled pipeline. This workflow sends a daily Slack digest. A parallel Buildkite version can be disabled if it double-sends; the risk is noise, not data loss.nightly-security-scan.yml→ Buildkite scheduled pipeline. Security scan results are observability-only; a duplicate run wastes a few minutes but does not affect prod.
Deliverables:
.buildkite/pipelines/ci-digest-cron.ymland.buildkite/pipelines/nightly-security-scan.ymlpipeline YAML files written.- Buildkite pipelines configured for the same schedule (07:00 UTC daily for digest; 08:07 UTC daily for security scan).
- Both GHA and Buildkite versions run in parallel for 5+ business days.
- A comparison table of outputs: did Buildkite's digest fire? Did the security scan produce equivalent findings? Were there any secret-injection failures?
Success criteria: - Both parallel runs complete successfully for 5 consecutive days without operator intervention. - No duplicate Slack messages (Buildkite version is configured to post to a staging Slack channel during the parallel phase, not the ops channel). - Secret injection (Infisical hook) succeeds in every run.
Rollback: Delete Buildkite pipeline definitions. GHA versions never stopped running; nothing to restore.
Phase 2 — Per-service migration (post-launch, week 2-6)
Goal: Migrate CI pipelines service by service. Order: lowest-risk first.
Migration order and rationale:
1. Static sites (getraxx, mockups, status-page)
Rationale: if these repos have been split per Plan A, they get fresh Buildkite
pipelines from day one. No GHA-to-BK migration needed — the new repos never had
GHA. If not yet split, migrate the GHA workflows for these paths first (lowest
blast radius).
2. docs / internal tooling (daily-card-groomer, flag-drift-check, freescout-backup,
billing-collector-cron, drift-orchestrator-cron, terraform-validate)
Rationale: failures here are operational annoyances, not customer-facing incidents.
3. Console CI (ci-console.yml, deploy-console.yml, review-app-console.yml)
Rationale: Console is operator-internal. A broken CI pass on Console does not
affect Raptor or Antlers availability.
4. Velvet CI (deploy-velvet.yml)
Rationale: Velvet is called by other services but has no customer-facing endpoint.
Broken Velvet deploy CI does not take down Raptor.
5. Raptor CI (ci.yml, deploy-heroku.yml, synthetic-gate.yml, waf-synthetic-probe.yml)
Rationale: Core API. Migrate last among backend services. Parallel-run for 2 full
weeks before disabling GHA.
6. Antlers CI (pr-preview.yml, deploy-antlers.yml)
Rationale: Customer-facing SPA. Parallel-run for 2 full weeks. PR preview deploy
integration with GitHub requires the Buildkite GitHub App deployment feature.
7. Queue CI (deploy-queue.yml)
Rationale: Queue owns identity/RBAC/customer data. Highest risk. Migrate last.
Extra parallel-run period: 4 weeks before disabling GHA.
For each pipeline migration:
1. Write .buildkite/pipelines/<name>.yml.
2. Run parallel with GHA for the prescribed period.
3. Update branch protection to add buildkite/<slug> as required check.
4. Remove GHA workflow name from required checks.
5. Disable GHA workflow file (rename to _disabled-<name>.yml; do not delete — keep
for rollback reference for 90 days).
Key coupling to update during Phase 2:
console/app/services/ops_dispatch.py: extenddispatch_factory()with a Buildkite dispatcher. Gate behindFLAG_CONSOLE_OPS_DISPATCH_BUILDKITE.console/app/blueprints/deploys.py: add a Buildkite webhook adapter for the/api/internal/deploys/<id>/statuscallback. Gate behindFLAG_CONSOLE_DEPLOY_BUILDKITE_CALLBACK.- Branch protection rules: update via
gh apifor each pipeline as it completes parallel-run validation.
Success criteria per pipeline: - Buildkite pipeline produces equivalent output to GHA pipeline for 10 consecutive runs. - No operator-intervention failures in parallel-run period. - Secret injection succeeds (confirmed via build log, not value echo). - Branch protection enforced on main via Buildkite check (not GHA) without any window of gap.
Rollback per pipeline: Re-enable the _disabled-<name>.yml GHA workflow. Restore
branch protection required check to GHA workflow name. Buildkite pipeline can remain
configured for future retry.
Phase 3 — Decommission (post-launch, week 6+)
Goal: Remove dead GHA workflows, update all documentation and runbooks.
Deliverables:
1. Delete all _disabled-*.yml files from .github/workflows/ (after 90-day hold).
2. Update docs/architecture/ci-notification-posture.md to reference Buildkite
schedules instead of GHA cron triggers.
3. Update docs/ops/runbooks/ for any runbook that references GHA workflow names.
4. Update CLAUDE.md and agent personas to reference Buildkite pipeline slugs for
any agent that dispatches CI.
5. Remove GITHUB_TOKEN / GH_TOKEN references in console/app/services/ops_dispatch.py
once the GHA dispatcher path is confirmed dead.
6. Final audit: run grep -r "workflow_dispatch" . on the repo and confirm zero hits
in application code (only in historical git log).
Success criteria:
- Zero active GHA workflow files in .github/workflows/ (only _disabled- archived
files pending final delete, or entirely removed).
- All 34 equivalent Buildkite pipelines running cleanly for 4+ weeks.
- Console Ops Dispatch surface successfully dispatches all registered actions via
Buildkite API.
- Branch protection on main enforced entirely by Buildkite checks.
Rollback: Full rollback at Phase 3 is high-cost (GHA workflows were deleted).
This is why the 90-day hold on _disabled- files exists — they can be re-enabled
quickly. The dispatch_factory() GHA path in ops_dispatch.py should not be deleted
until Phase 3 is confirmed stable for 4+ weeks.
4. How Plans A and B Interact
Recommended sequencing:
Pre-launch (now to 2026-05-23)
Plan A: Create empty repos for getraxx, mockups, status-page. README notes only.
Plan B: Phase 0 — Provision agent VPS, install Buildkite agent, run hello-world
pipeline, confirm vault bridge works.
Post-launch week 1-2 (2026-05-24 to 2026-06-06)
Plan A: Execute git filter-repo extractions for getraxx, mockups, status-page.
These new repos get fresh Buildkite pipelines from day one (no GHA to migrate).
Plan B: Phase 1 — Parallel run of ci-digest-cron + nightly-security-scan.
Post-launch week 3-6 (2026-06-07 to 2026-07-04)
Plan A: Package velvet/client as installable Python package.
Extract frontend/trademaster_ui/ → raxx-app/antlers.
Extract terraform/ → raxx-app/infra.
Plan B: Phase 2 — Per-service pipeline migration (static → docs → console → velvet
→ raptor → antlers → queue).
Post-launch month 2-3 (2026-07+)
Plan A: Evaluate Raptor + Console + Velvet + Queue split (likely stay in core mono).
Plan B: Phase 3 — Decommission old GHA workflows.
The critical dependency: Do not migrate the Antlers CI pipeline to Buildkite (Plan B Phase 2, step 6) before the Antlers repo exists (Plan A post-launch week 3). If Antlers CI is migrated while still in the monorepo, the pipeline migration must be re-done when the repo moves. Do Plan A extraction first, then Plan B pipeline migration.
The safe order for everything else: Plan B Phase 0 is independent of Plan A. It can and should start immediately. The Buildkite agent on the VPS does not care whether the repo has been split or not — it checks out from wherever the Buildkite GitHub App points.
5. Risks and Dependencies
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
git filter-repo rewrites commit SHAs for extracted files |
Certain (it always does) | Med | Document before extraction; close all open PRs touching those directories first; update any external references to old SHAs. |
| Agent VPS becomes a single point of failure for all CI | Med | High | Add a second agent on the same VPS after Phase 0 validation. Buildkite free tier permits multiple agents. For Queue CI (highest value), consider two agents. |
| Velvet client packaging breaks existing import paths in Raptor/Console/Queue | Low-Med | High | Pin the new package version in requirements.txt before deleting the old velvet/client/ import path. Run full integration test suite before removing the old path. |
| Buildkite SaaS controller outage blocks all CI | Low | High | Keep GHA workflows alive in _disabled- state for 90 days post-migration. Restore takes under 5 minutes per workflow. |
| Console Ops Dispatch Buildkite adapter introduces a regression | Med | Med | Gate behind FLAG_CONSOLE_OPS_DISPATCH_BUILDKITE; keep GHA dispatcher as live fallback. Flag off by default until fully validated. |
Infisical environment hook fails silently, exposing missing secrets at build time |
Low | High | Add a pre-command hook that asserts all expected env vars are non-empty and exits 1 if any are missing. Fail fast, loudly. |
Hetzner VPS buildkite-agent process exits on reboot |
Low | Med | Install buildkite-agent as a systemd service with Restart=always. Test reboot in Phase 0. |
6. Open Questions for Operator
The one question that blocks sub-cards:
Q1 (blocks all Plan B sub-cards): Have you logged into the Buildkite dashboard and confirmed the organization name and the org slug? The
BUILDKITE_AGENT_TOKENandBUILDKITE_API_TOKENin vault are only useful if we know which org they belong to. Pipeline YAML, API calls, and the GitHub App installation all require the org slug. Please confirm the Buildkite org slug (e.g.,raxx) so sub-cards can reference it.
Lower-priority questions (do not block Phase 0):
-
Q2 (Plan A, getraxx/mockups/status-page split timing): Do you want the three empty repos created this week as pre-launch scaffolding, or do you prefer to defer all repo work until after 2026-05-23? The risk of creating empty repos now is essentially zero, but it is an operator call.
-
Q3 (Plan A, Velvet client packaging): Packaging
velvet/client/as a proper Python package (GitHub Packages or PyPI private) requires naming and versioning decisions. The nameraxx-velvet-clientorvelvet-clientboth work; the version can start at0.1.0. Is there a preference, or should feature-developer choose? -
Q4 (Plan B, compute): The recommendation is a Hetzner CX22 VPS for the Buildkite agent. Do you have an existing Hetzner account, or would you prefer a different low-cost VPS provider (DigitalOcean, Linode/Akamai, AWS Lightsail)? The agent setup is identical across all — this is only a billing account question.
7. Out of Scope (not a decision in this doc)
- Buildkite vs Ubicloud final choice — BLR PR #1979 established Buildkite as the operator's selected candidate post the financial/licensing review. This doc treats that as a settled decision and plans the Buildkite path. If the operator reverses to Ubicloud, Plan B is irrelevant (Ubicloud requires zero pipeline migration; see PR #1978 Section 7).
- Full polyrepo split of Raptor/Console/Velvet/Queue — this document explicitly recommends against it pre-v1 and defers the evaluation to post-launch month 2-3.
- Migrating secrets from Infisical to Buildkite's native secret store — not recommended. Infisical remains the source of truth; the Buildkite agent hooks are a read path only.