Raxx · internal docs

internal · gated ↑ index

ADR 0050 — Trunk-based SDLC affirmed; Gitflow rejected; hardening plan for drift and revert friction

Status: Accepted
Date: 2026-05-06 UTC
Refs: ADR-0020, ADR-0028, ADR-0035, docs/architecture/branch-promotion-strategy.md, operator complaint 2026-05-06

TL;DR

Trunk-based stays. Gitflow is rejected — long-lived branches worsen the drift problem rather than solving it, and they multiply merge cost on a solo-operator + parallel-agent fleet. The observed friction (favicon revert, card-details-popup missing from prod, staging/prod flag lag) traces to three specific gaps: no required-reviewer gate on the production Environment (toggle available since #1202 shipped the runbook), no post-merge production-state checklist to catch static asset and flag divergence before the operator notices it live, and no dashboard surface showing which flags are staging-on / prod-off. These gaps are closed by the hardening plan in this ADR — not by switching branching models.

1. Context

Raxx is operated by a solo founder (Kristerpher) with a fleet of parallel agent-produced PRs merging to a single main branch. As of 2026-05-06, the operator reports:

Occasional reverts of apparently-merged features (e.g. favicon suspected lost in a rebase).
A card-details popup verified on staging that has not reached production, stalled for several days.
Flag state lagging between staging and prod with no at-a-glance view of the pending delta.
A production deploy path (deploy-console.yml workflow_dispatch) that recently returned a 502, leaving prod state ambiguous.

The question raised: would Gitflow reduce this friction?

Raxx-specific context that matters for this decision

One trunk, one release stream. There is no parallel major-version maintenance. There is no release-team gating process. There is no long-lived develop branch accumulating features before a batch release.
Agent PRs branch from main and merge to main. This is enforced policy (feedback_pr_base_main.md). It is the correct policy. Any branching model that requires agents to target a different base creates coordination complexity and violates the established incident-driven rule (incidents #330 and #457).
Parallel agent work is the norm, not the exception. Multiple feature branches are in-flight simultaneously, each owned by a separate agent session. The concurrency-cancel CI pattern is already correct (feedback_pr_cancelled_checks_are_duplicates.md).
Staging and prod are structurally separate Heroku apps (not deploy slots of a single app). The promotion path is: merge to main → staging auto-deploys → operator approves prod gate. This is the correct model per ADR-0020 and ADR-0028.

2. Invariants

These apply to this ADR and any sub-cards it produces:

Audit trail for every prod state change. Any hardening step that touches prod must leave a record: who triggered, what SHA, at what timestamp.
No new long-lived branches. Branching from main for short-lived feature work is correct; persistent environment branches are not.
Agent PR branches must base from origin/main. The base_branch_lint CI job enforces this. Nothing in this ADR relaxes it.
Flag flips are a separate action from code deploys. ADR-0035 defines the promotion queue. The hardening plan below does not collapse them.
No prod auto-deploy. Every prod state change (code or flag) requires a deliberate operator action. This preserves ADR-0028 intentional friction.

3. Decision

Retain trunk-based development. Reject Gitflow.

Harden the trunk model with four targeted fixes that close the diagnosed root causes.

4. Gitflow rejection

Decision matrix

Gitflow benefit	Real cost in Raxx's situation	Failure mode this creates
`develop` branch isolates in-progress work from `main`	All agent PRs already branch from `main` and are individually gated by CI. There is no un-gated code accumulating on `main`.	Agents targeting `develop` instead of `main` violates `feedback_pr_base_main.md` and incidents #330/#457. PRs would need re-targeting for every session.
Long-lived `release/*` branches allow last-minute stabilization	Raxx has one release stream and no release-team. Release-please already owns the tag. A `release/*` branch gives the operator a second branch to watch with no additional safety signal.	Merge conflicts between `develop` → `release/` and `release/` → `main` compound the drift problem; two extra merge surfaces for the same code.
`hotfix/*` branches isolate prod fixes from in-progress work	The hotfix path under trunk-based is: merge to `main`, approve gate immediately. This takes minutes. Gitflow hotfix branches require cherry-picks from `main` → `develop` and from `main` → the hotfix branch, both of which can conflict.	Cherry-pick conflicts delay hotfixes. On a solo+agent setup, the agent that produced the hotfix is likely not the agent that opened the backport PR. Cross-agent cherry-pick coordination is a new failure surface.
Explicit integration gate via `develop` → `release` PR	CI already gates every PR to `main`. Staging auto-deploys on merge and provides an integration environment. Adding a `develop` → `release` PR gate would be a third merge event for code already tested twice.	Increases merge count per feature by 3x with no new information (the CI gate and staging soak are already happening).
Visual separation of what's "ready" vs "in progress"	GitHub PR labels (`status: staging-verified`, `status: ready-for-prod`) and the flag promotion queue (ADR-0035) provide this without branch topology changes.	Branch proliferation that agents do not clean up becomes stale branch noise in the remote. Each stale branch is a merge conflict surface for the next agent session.

Summary: Gitflow's benefits exist for multi-team, multi-release-stream organizations where main would otherwise be unusable as a stable base. None of those conditions hold for Raxx. Every Gitflow benefit Raxx would get is already provided by the existing CI gate + staging soak + ADR-0028 friction model. The costs are all real and specific to the solo+agent setup.

5. Root cause diagnosis

The reported friction does not trace to the branching model. It traces to three distinct gaps in the existing trunk model:

Gap 1 — Required-reviewer gate on `production` Environment is not yet toggled

ADR-0020 selected the tag + environment approval gate model. #1202 shipped the runbook documenting exactly how to enable it. The toggle has not been set. Without the gate, staging-verified code can reach prod without a deliberate operator checkpoint between "staging looks good" and "prod updated." The favicon revert is consistent with a staging deploy overwriting a prod state that was never explicitly confirmed by the operator.

Root cause: configuration gap, not a branching-model gap.

Gap 2 — No post-merge production-state checklist

After a PR merges and staging deploys, there is no structured step that asks: "did this change affect static assets? does prod still have the right asset hashes? are any flags that staging-on but prod-off blocking this feature?" The card-details popup stall is consistent with a feature that passed staging CI but depends on a flag that was never promoted to prod (ADR-0035 describes exactly this failure mode).

Root cause: process gap, not a branching-model gap.

Gap 3 — No at-a-glance flag-drift surface

The operator cannot see, at a glance, how many and which flags are staging-on / prod-off. The promotion queue exists in the database (ADR-0035) but has no dashboard widget. Without this, the operator only learns of flag drift when a feature is visibly missing from prod.

Root cause: observability gap, not a branching-model gap.

Gap 4 — Agent rebase hygiene (contributing factor)

When multiple agents open PRs in parallel against main, a slow-moving PR's branch falls behind. If the agent that opened it does not rebase before the PR is merged, the merge commit may silently revert changes made by a concurrent PR that already landed. The favicon incident is consistent with this pattern.

Root cause: agent workflow discipline gap, not a branching-model gap. Gitflow does not fix this — it moves the conflict surface from main to develop, where it is equally possible and harder to observe.

6. Drift — measurable definition

Drift is defined as the aggregate gap between main's current state and production's observed state across three dimensions:

Dimension	Metric	Alert threshold
Code drift	Hours since last prod deploy vs. timestamp of last merge to `main`	> 48 hours without a prod deploy following any merge triggers a console warning
Flag drift	Count of flags where `staging_enabled = true` AND `prod_enabled = false`	> 0 flags pending promotion for > 24 hours triggers a console badge
Static asset drift	Count of files under `dist/` or CF Pages asset manifest where prod content-hash != main's last-built hash	> 0 divergent asset hashes after a prod deploy is a deploy verification failure

These three metrics are observable without Gitflow. A Gitflow develop branch adds a fourth metric (hours since develop was merged to release) without reducing the other three.

7. Hardening plan

H1 — Enable required-reviewer gate on `production` Environment

The runbook from #1202 documents the exact GitHub UI path. This is a settings toggle, zero YAML changes. Effect: every prod deploy (code or flag) pauses for a deliberate Kristerpher approval before proceeding.

Owner: operator (Kristerpher, settings toggle only). No sub-card needed — this is a one-minute action against the runbook in #1202.

H2 — Post-merge production-state checklist runbook

A runbook (not automated enforcement — a checklist the operator runs before approving the prod gate) that covers:

Static asset diff. Compare the CF Pages asset manifest for the staging alias vs. the production alias. Any file where the content-hash differs is a candidate for silent revert. Flag these before approving.
Flag-promotion-pending list. Open /flags/promotions in the console (or query console_flag_promotions WHERE state = 'pending'). Any feature that is staging-enabled but not prod-enabled must be a deliberate choice, not an oversight.
Env-var seed check. For any PR that adds a new env var, confirm the var is seeded in both the Heroku prod config and SSM (per feedback_aws_workloads_use_ssm_not_vault.md) before approving the deploy.
Deploy-status verify. After the prod deploy completes, confirm the Heroku release is green (heroku releases --app raxx-console-prod | head -3) and the smoke suite passes.

This runbook lives at docs/runbooks/post-merge-prod-checklist.md and is referenced from the prod deploy approval notification.

Sub-card needed: file a card for feature-developer to write the runbook and wire a link to it into the prod approval notification step.

A console dashboard widget (read-only, no action in v1) that shows:

Count of flags in staging_enabled = true AND prod_enabled = false
List view: flag name, staging-enabled-since timestamp, soak elapsed, soak required
Badge on the console nav: orange dot when count > 0

This gives the operator an at-a-glance view of drift without navigating to /flags/promotions. It surfaces the "card-details popup never reached prod" class of failure before the operator notices it in the live product.

Sub-card needed: file a card for feature-developer to add the dashboard widget.

H4 — Agent rebase hygiene enforcement

Two changes:

Stale-branch guard in CI. A workflow step that fails the PR if the feature branch is more than N commits behind main at the time the PR is opened for review. Threshold: 10 commits or 48 hours — whichever is larger. This forces the agent (or operator) to rebase before the PR can merge.
Agent session convention. Each agent that opens a PR should rebase to origin/main immediately before pushing the final commit. This is already implicit in feedback_commit_agent_docs_immediately.md (commit before any rebase storm) but needs to be a named step in the agent workflow conventions.

Sub-card needed: file a card for feature-developer to add the stale-branch CI guard workflow step.

8. Rollout plan

Phase	What	Gate
Immediate	Kristerpher enables required-reviewer gate on `production` Environment per #1202 runbook	Operator action only — no PR needed
Sprint 1	H2: post-merge prod checklist runbook + link in approval notification	Sub-card #N1
Sprint 1	H3: flag-promotion-pending dashboard widget	Sub-card #N2
Sprint 1	H4: stale-branch CI guard	Sub-card #N3
After Sprint 1	Measure: code drift metric, flag drift count, asset hash divergence	Console dashboard or Slack report

9. Security considerations

The required-reviewer gate (H1) records the approver's GitHub identity and timestamp. This satisfies the audit trail invariant for prod state changes.
The post-merge checklist (H2) adds no new credential surface. It reads from existing Heroku CLI output and console UI — both already require authenticated sessions.
The dashboard widget (H3) exposes flag names and soak timestamps. Flag names are internal configuration identifiers, not PII. No retention or GDPR implications.
The stale-branch guard (H4) is a CI check only. It does not access secrets or user data.
None of these hardening steps store credentials, collect PII, or touch the order execution path.

10. Open questions

None block implementation. One for Kristerpher's awareness:

Stale-branch threshold. The 10-commit / 48-hour threshold in H4 is a starting point. If the agent fleet is producing PRs faster than expected, 10 commits may be too tight and cause excessive forced-rebases. Adjust after Sprint 1 observability.

11. Alternatives considered

Gitflow

Fully analyzed in §4. Rejected. Long-lived branches worsen the drift problem, multiply merge cost, and add cross-agent coordination complexity without providing any safety property that trunk-based + the H1–H4 hardening plan does not already provide.

GitLab Flow with `production` branch

A production long-lived branch (GitLab Flow variant) was evaluated in docs/architecture/branch-promotion-strategy.md as Option A and rejected there. The reasoning holds here: it adds a branch-protection ceremony and a second merge event per release with no additional safety signal beyond the approval gate (H1) already provides. Specifically: agent PRs targeting main would need to be manually re-promoted to production — that is the exact coordination step that is missing today, and adding a branch makes it harder, not easier, to see what is pending.

Automated soak timer (branch-promotion-strategy Option C)

Evaluated and rejected in ADR-0020. Still rejected here. Automated prod deploys without a human checkpoint are not appropriate for the current pre-launch posture. The operator's approval is the signal; a timer is not a substitute.

Action items (sub-cards to file)

These are scoped for feature-developer. Do not claim until card-groomer has processed them.

Card	Title	Depends on
#N1	Write post-merge production-state checklist runbook + wire link into prod approval notification	#1202 (runbook already exists for deploy gate)
#N2	Console dashboard: flag-promotion-pending widget (count badge + list view)	ADR-0035, `console_flag_promotions` table (#552)
#N3	CI: stale-branch guard — fail PR if branch is > N commits or 48h behind `main`	None

Operator action (not a card): enable the required-reviewer gate on the production GitHub Environment per the runbook in #1202. This takes under two minutes and unblocks H1 immediately.

Auto-generated from docs/ in raxx-app/TradeMasterAPI. Gated behind Cloudflare Access. Re-deployed on every push to main.