ADR 0122 — Trunk-based SDLC affirmed; Gitflow rejected; hardening plan for drift and revert friction
Status: Accepted
Date: 2026-05-06 UTC
Refs: ADR-0020, ADR-0028, ADR-0035, docs/architecture/branch-promotion-strategy.md, operator complaint 2026-05-06
TL;DR
Trunk-based stays. Gitflow is rejected — long-lived branches worsen the drift problem rather than solving it, and they multiply merge cost on a solo-operator + parallel-agent fleet. The observed friction (favicon revert, card-details-popup missing from prod, staging/prod flag lag) traces to three specific gaps: no required-reviewer gate on the production Environment (toggle available since #1202 shipped the runbook), no post-merge production-state checklist to catch static asset and flag divergence before the operator notices it live, and no dashboard surface showing which flags are staging-on / prod-off. These gaps are closed by the hardening plan in this ADR — not by switching branching models.
1. Context
Raxx is operated by a solo founder (Kristerpher) with a fleet of parallel agent-produced PRs merging to a single main branch. As of 2026-05-06, the operator reports:
- Occasional reverts of apparently-merged features (e.g. favicon suspected lost in a rebase).
- A card-details popup verified on staging that has not reached production, stalled for several days.
- Flag state lagging between staging and prod with no at-a-glance view of the pending delta.
- A production deploy path (
deploy-console.ymlworkflow_dispatch) that recently returned a 502, leaving prod state ambiguous.
The question raised: would Gitflow reduce this friction?
Raxx-specific context that matters for this decision
- One trunk, one release stream. There is no parallel major-version maintenance. There is no release-team gating process. There is no long-lived
developbranch accumulating features before a batch release. - Agent PRs branch from
mainand merge tomain. This is enforced policy (feedback_pr_base_main.md). It is the correct policy. Any branching model that requires agents to target a different base creates coordination complexity and violates the established incident-driven rule (incidents #330 and #457). - Parallel agent work is the norm, not the exception. Multiple feature branches are in-flight simultaneously, each owned by a separate agent session. The concurrency-cancel CI pattern is already correct (
feedback_pr_cancelled_checks_are_duplicates.md). - Staging and prod are structurally separate Heroku apps (not deploy slots of a single app). The promotion path is: merge to
main→ staging auto-deploys → operator approves prod gate. This is the correct model per ADR-0020 and ADR-0028.
2. Invariants
These apply to this ADR and any sub-cards it produces:
- Audit trail for every prod state change. Any hardening step that touches prod must leave a record: who triggered, what SHA, at what timestamp.
- No new long-lived branches. Branching from
mainfor short-lived feature work is correct; persistent environment branches are not. - Agent PR branches must base from
origin/main. Thebase_branch_lintCI job enforces this. Nothing in this ADR relaxes it. - Flag flips are a separate action from code deploys. ADR-0035 defines the promotion queue. The hardening plan below does not collapse them.
- No prod auto-deploy. Every prod state change (code or flag) requires a deliberate operator action. This preserves ADR-0028 intentional friction.
3. Decision
Retain trunk-based development. Reject Gitflow.
Harden the trunk model with four targeted fixes that close the diagnosed root causes.
4. Gitflow rejection
Decision matrix
| Gitflow benefit | Real cost in Raxx's situation | Failure mode this creates |
|---|---|---|
develop branch isolates in-progress work from main |
All agent PRs already branch from main and are individually gated by CI. There is no un-gated code accumulating on main. |
Agents targeting develop instead of main violates feedback_pr_base_main.md and incidents #330/#457. PRs would need re-targeting for every session. |
Long-lived release/* branches allow last-minute stabilization |
Raxx has one release stream and no release-team. Release-please already owns the tag. A release/* branch gives the operator a second branch to watch with no additional safety signal. |
Merge conflicts between develop → release/* and release/* → main compound the drift problem; two extra merge surfaces for the same code. |
hotfix/* branches isolate prod fixes from in-progress work |
The hotfix path under trunk-based is: merge to main, approve gate immediately. This takes minutes. Gitflow hotfix branches require cherry-picks from main → develop and from main → the hotfix branch, both of which can conflict. |
Cherry-pick conflicts delay hotfixes. On a solo+agent setup, the agent that produced the hotfix is likely not the agent that opened the backport PR. Cross-agent cherry-pick coordination is a new failure surface. |
Explicit integration gate via develop → release PR |
CI already gates every PR to main. Staging auto-deploys on merge and provides an integration environment. Adding a develop → release PR gate would be a third merge event for code already tested twice. |
Increases merge count per feature by 3x with no new information (the CI gate and staging soak are already happening). |
| Visual separation of what's "ready" vs "in progress" | GitHub PR labels (status: staging-verified, status: ready-for-prod) and the flag promotion queue (ADR-0035) provide this without branch topology changes. |
Branch proliferation that agents do not clean up becomes stale branch noise in the remote. Each stale branch is a merge conflict surface for the next agent session. |
Summary: Gitflow's benefits exist for multi-team, multi-release-stream organizations where main would otherwise be unusable as a stable base. None of those conditions hold for Raxx. Every Gitflow benefit Raxx would get is already provided by the existing CI gate + staging soak + ADR-0028 friction model. The costs are all real and specific to the solo+agent setup.
5. Root cause diagnosis
The reported friction does not trace to the branching model. It traces to three distinct gaps in the existing trunk model:
Gap 1 — Required-reviewer gate on production Environment is not yet toggled
ADR-0020 selected the tag + environment approval gate model. #1202 shipped the runbook documenting exactly how to enable it. The toggle has not been set. Without the gate, staging-verified code can reach prod without a deliberate operator checkpoint between "staging looks good" and "prod updated." The favicon revert is consistent with a staging deploy overwriting a prod state that was never explicitly confirmed by the operator.
Root cause: configuration gap, not a branching-model gap.
Gap 2 — No post-merge production-state checklist
After a PR merges and staging deploys, there is no structured step that asks: "did this change affect static assets? does prod still have the right asset hashes? are any flags that staging-on but prod-off blocking this feature?" The card-details popup stall is consistent with a feature that passed staging CI but depends on a flag that was never promoted to prod (ADR-0035 describes exactly this failure mode).
Root cause: process gap, not a branching-model gap.
Gap 3 — No at-a-glance flag-drift surface
The operator cannot see, at a glance, how many and which flags are staging-on / prod-off. The promotion queue exists in the database (ADR-0035) but has no dashboard widget. Without this, the operator only learns of flag drift when a feature is visibly missing from prod.
Root cause: observability gap, not a branching-model gap.
Gap 4 — Agent rebase hygiene (contributing factor)
When multiple agents open PRs in parallel against main, a slow-moving PR's branch falls behind. If the agent that opened it does not rebase before the PR is merged, the merge commit may silently revert changes made by a concurrent PR that already landed. The favicon incident is consistent with this pattern.
Root cause: agent workflow discipline gap, not a branching-model gap. Gitflow does not fix this — it moves the conflict surface from main to develop, where it is equally possible and harder to observe.
6. Drift — measurable definition
Drift is defined as the aggregate gap between main's current state and production's observed state across three dimensions:
| Dimension | Metric | Alert threshold |
|---|---|---|
| Code drift | Hours since last prod deploy vs. timestamp of last merge to main |
> 48 hours without a prod deploy following any merge triggers a console warning |
| Flag drift | Count of flags where staging_enabled = true AND prod_enabled = false |
> 0 flags pending promotion for > 24 hours triggers a console badge |
| Static asset drift | Count of files under dist/ or CF Pages asset manifest where prod content-hash != main's last-built hash |
> 0 divergent asset hashes after a prod deploy is a deploy verification failure |
These three metrics are observable without Gitflow. A Gitflow develop branch adds a fourth metric (hours since develop was merged to release) without reducing the other three.
7. Hardening plan
H1 — Enable required-reviewer gate on production Environment
The runbook from #1202 documents the exact GitHub UI path. This is a settings toggle, zero YAML changes. Effect: every prod deploy (code or flag) pauses for a deliberate Kristerpher approval before proceeding.
Owner: operator (Kristerpher, settings toggle only). No sub-card needed — this is a one-minute action against the runbook in #1202.
H2 — Post-merge production-state checklist runbook
A runbook (not automated enforcement — a checklist the operator runs before approving the prod gate) that covers:
- Static asset diff. Compare the CF Pages asset manifest for the staging alias vs. the production alias. Any file where the content-hash differs is a candidate for silent revert. Flag these before approving.
- Flag-promotion-pending list. Open
/flags/promotionsin the console (or queryconsole_flag_promotions WHERE state = 'pending'). Any feature that is staging-enabled but not prod-enabled must be a deliberate choice, not an oversight. - Env-var seed check. For any PR that adds a new env var, confirm the var is seeded in both the Heroku prod config and SSM (per
feedback_aws_workloads_use_ssm_not_vault.md) before approving the deploy. - Deploy-status verify. After the prod deploy completes, confirm the Heroku release is green (
heroku releases --app raxx-console-prod | head -3) and the smoke suite passes.
This runbook lives at docs/runbooks/post-merge-prod-checklist.md and is referenced from the prod deploy approval notification.
Sub-card needed: file a card for feature-developer to write the runbook and wire a link to it into the prod approval notification step.
H3 — Flag-promotion-pending dashboard widget
A console dashboard widget (read-only, no action in v1) that shows:
- Count of flags in
staging_enabled = true AND prod_enabled = false - List view: flag name, staging-enabled-since timestamp, soak elapsed, soak required
- Badge on the console nav: orange dot when count > 0
This gives the operator an at-a-glance view of drift without navigating to /flags/promotions. It surfaces the "card-details popup never reached prod" class of failure before the operator notices it in the live product.
Sub-card needed: file a card for feature-developer to add the dashboard widget.
H4 — Agent rebase hygiene enforcement
Two changes:
- Stale-branch guard in CI. A workflow step that fails the PR if the feature branch is more than N commits behind
mainat the time the PR is opened for review. Threshold: 10 commits or 48 hours — whichever is larger. This forces the agent (or operator) to rebase before the PR can merge. - Agent session convention. Each agent that opens a PR should rebase to
origin/mainimmediately before pushing the final commit. This is already implicit infeedback_commit_agent_docs_immediately.md(commit before any rebase storm) but needs to be a named step in the agent workflow conventions.
Sub-card needed: file a card for feature-developer to add the stale-branch CI guard workflow step.
8. Rollout plan
| Phase | What | Gate |
|---|---|---|
| Immediate | Kristerpher enables required-reviewer gate on production Environment per #1202 runbook |
Operator action only — no PR needed |
| Sprint 1 | H2: post-merge prod checklist runbook + link in approval notification | Sub-card #N1 |
| Sprint 1 | H3: flag-promotion-pending dashboard widget | Sub-card #N2 |
| Sprint 1 | H4: stale-branch CI guard | Sub-card #N3 |
| After Sprint 1 | Measure: code drift metric, flag drift count, asset hash divergence | Console dashboard or Slack report |
9. Security considerations
- The required-reviewer gate (H1) records the approver's GitHub identity and timestamp. This satisfies the audit trail invariant for prod state changes.
- The post-merge checklist (H2) adds no new credential surface. It reads from existing Heroku CLI output and console UI — both already require authenticated sessions.
- The dashboard widget (H3) exposes flag names and soak timestamps. Flag names are internal configuration identifiers, not PII. No retention or GDPR implications.
- The stale-branch guard (H4) is a CI check only. It does not access secrets or user data.
- None of these hardening steps store credentials, collect PII, or touch the order execution path.
10. Open questions
None block implementation. One for Kristerpher's awareness:
- Stale-branch threshold. The 10-commit / 48-hour threshold in H4 is a starting point. If the agent fleet is producing PRs faster than expected, 10 commits may be too tight and cause excessive forced-rebases. Adjust after Sprint 1 observability.
11. Alternatives considered
Gitflow
Fully analyzed in §4. Rejected. Long-lived branches worsen the drift problem, multiply merge cost, and add cross-agent coordination complexity without providing any safety property that trunk-based + the H1–H4 hardening plan does not already provide.
GitLab Flow with production branch
A production long-lived branch (GitLab Flow variant) was evaluated in docs/architecture/branch-promotion-strategy.md as Option A and rejected there. The reasoning holds here: it adds a branch-protection ceremony and a second merge event per release with no additional safety signal beyond the approval gate (H1) already provides. Specifically: agent PRs targeting main would need to be manually re-promoted to production — that is the exact coordination step that is missing today, and adding a branch makes it harder, not easier, to see what is pending.
Automated soak timer (branch-promotion-strategy Option C)
Evaluated and rejected in ADR-0020. Still rejected here. Automated prod deploys without a human checkpoint are not appropriate for the current pre-launch posture. The operator's approval is the signal; a timer is not a substitute.
Action items (sub-cards to file)
These are scoped for feature-developer. Do not claim until card-groomer has processed them.
| Card | Title | Depends on |
|---|---|---|
| #N1 | Write post-merge production-state checklist runbook + wire link into prod approval notification | #1202 (runbook already exists for deploy gate) |
| #N2 | Console dashboard: flag-promotion-pending widget (count badge + list view) | ADR-0035, console_flag_promotions table (#552) |
| #N3 | CI: stale-branch guard — fail PR if branch is > N commits or 48h behind main |
None |
Operator action (not a card): enable the required-reviewer gate on the production GitHub Environment per the runbook in #1202. This takes under two minutes and unblocks H1 immediately.