Status: Accepted Date: 2026-05-05 UTC Refs: ADR-0050 (trunk SDLC), ADR-0035 (flag promotion queue), #757 (billing destinations drift), #947/#1224 (Velvet Heroku + GH Actions adapters), #552 (flag promotion DB schema), #1217 (reactive post-merge checklist — this ADR augments it)
Drift is the operator's number-one recurring pain. ADR-0050 affirmed trunk-based development and diagnosed three gap types (asset hash divergence, flag promotion lag, secret staleness); it closed them with a post-merge checklist (H2) and two dashboard features. This ADR sits one layer upstream: it replaces the checklist as the primary control with three structural, automated prevention layers (CI asset-checksum guard, flag-promotion enforcement gate, Vault-to-destination auto-distribution) and adds a daily reconciler as a backstop that catches anything the preventive layers miss. The checklist from #1217 is retained as a rare-case safety net — not as the first line of defense.
Concretely: a PR that silently reverts a favicon should fail CI before merge. A direct heroku config:set FLAG_* on a prod app without a console promotion row should be blocked or flagged. A rotated Postmark key should fan out to Heroku and GH Actions automatically — no operator heroku config:set required. A daily job should surface any mismatch that slipped through. Those four outcomes constitute the decision.
| Incident | Date | What drifted | Why it slipped through |
|---|---|---|---|
| Favicon revert | ~2026-05 | Static asset hash on prod reverted to a prior version | Agent rebase silently overwrote a merged commit; no CI check compared prod asset state |
| Card-detail popup missing from prod | 2026-05-06 | FLAG_CONSOLE_INVESTIGATE_FROM_STATUS on staging, never on prod |
Manual heroku config:set bypassed the promotion queue; no guard existed |
| Heroku config-var stale (PAT) | 2026-05 | Vault rotated, Heroku stale | Velvet distribution loop not yet wired to all rotatable secrets |
| Billing destinations | #757 | Multiple destinations out of sync | No automated sync; operator manually maintained per-destination config |
ADR-0050 retained trunk-based development, rejected Gitflow, and prescribed four hardening steps (H1–H4): enable the production environment required-reviewer gate, write a post-merge checklist, add a flag-drift dashboard widget, and add a stale-branch CI guard. Sub-cards #N1/#N2/#N3 (from ADR-0050) cover those.
ADR-0050 also explicitly defined drift across three dimensions:
mainstaging_enabled = true AND prod_enabled = falseThis ADR targets all three with preventive controls, then adds a fourth drift dimension: secret drift (vault value != destination value).
Structural controls prevent drift at the point of introduction — a PR that would cause drift fails CI, a direct flag flip that bypasses the queue triggers an alert, a rotated secret distributes automatically. The checklist remains for the residual cases these controls cannot reach.
These apply to every sub-card this ADR produces:
Adopt three preventive layers (a), (b), (c) before falling back to the reactive checklist. Add a daily reconciler as the universal backstop.
What it prevents: Silent static asset regressions (favicon revert class).
Mechanism: Every PR that touches console/app/static/, frontend/trademaster_ui/public/, or frontend/status-page/public/ must regenerate a committed asset manifest (docs/asset-manifest.json or per-surface equivalent). The manifest records the SHA-256 of each tracked asset at the time of the PR's last commit. CI compares the manifest diff to the set of changed files: if a static asset file changed but the manifest did not update to reflect the new hash, CI fails with a descriptive message. If an asset was reverted to a prior hash (agent rebase scenario), the manifest diff shows the revert explicitly — the PR description is forced to acknowledge it.
What this does NOT do: It does not compare the manifest against live prod state at PR time (network call in CI is a flaky dependency). The reconciler (§6) handles the prod-vs-manifest comparison.
Blast radius if it fails: A false-negative (CI passes but asset is silently reverted) would look identical to pre-control behavior — drift is caught by the reconciler. A false-positive (manifest mismatch for a legitimate change) requires the agent or operator to update the manifest; the error message must be prescriptive about how.
Implementation cost: Low. A single CI step (scripts/ci/check-asset-manifest.sh) that walks the target directories, computes SHA-256, and diffs against the committed manifest. The manifest-update script runs as a pre-commit hook and as a PR step.
What it prevents: Direct heroku config:set FLAG_* on prod apps that bypasses the promotion queue (card-detail popup class).
Mechanism — two sub-controls, either sufficient:
B1 — CI guard: Any PR that touches feature_flags.yaml (the declared flag registry) must include either (a) a corresponding console_flag_promotions row migration, or (b) an explicit NO_PROD_FLAG annotation in the PR body. CI reads the diff, identifies flag additions or changes, and fails if neither condition is met. This ensures that newly declared flags are wired into the promotion queue before they ship.
B2 — Heroku pre-config-set hook equivalent: Heroku does not support pre-config-set hooks natively. The equivalent is a CI job (flag-drift-check.yml) that runs on a schedule (every 4 hours) and compares the FLAG_* keys currently set on the prod Heroku app against the console_flag_promotions table. Any FLAG_* key on prod that has no promotion record is flagged as an unauthorized direct-set and surfaces as a console alert. This does not block the direct-set after the fact but makes it immediately visible.
B1 prevents the forward path (new flags slipping through without a promotion row). B2 detects unauthorized direct-sets and surfaces them within 4 hours.
Blast radius if it fails: B1 false-positive: CI rejects a PR that legitimately sets a flag without a promotion row (e.g., a CI-only flag). The NO_PROD_FLAG annotation escape hatch handles this. B1 false-negative: a PR that adds a flag but the CI diff parser misses it — caught by B2 at next scheduled run. B2 failure: the scheduled job errors. Alert goes to ops@ per standard job-failure routing; reconciler (§6) also covers this dimension.
Implementation cost: Medium. B1 requires a YAML-aware diff parser in CI. B2 requires a job that calls the Heroku API and queries the promotions table — both are already accessible to the console's service account.
What it prevents: Stale secrets on Heroku / GH Actions after a vault rotation (PAT staleness class, billing destinations class #757).
Mechanism: Velvet's service-bus subscription model (#1224, ADR-0037) already distributes to Heroku config-vars and GH Actions secrets for the subset of secrets enrolled in Velvet. This layer expands the enrollment to cover all rotatable secrets:
When vault emits a secret.rotated event for an enrolled secret, Velvet fans out to all registered destinations for that secret. No manual heroku config:set required. The existing Heroku adapter (#1224) and GH Actions adapter (#1224) handle delivery; this layer is an enrollment expansion, not a new adapter.
Blast radius if it fails: A fan-out failure for one destination leaves that destination stale. Velvet already emits a distribution.failed event on adapter error. The reconciler (§6) will catch the mismatch within 24 hours. The Velvet retry policy (ADR-0038) handles transient failures.
Implementation cost: Low-to-medium. For each new secret type, the Velvet consumer manifest gains one entry. The secrets themselves remain in Vault; only metadata (which destinations, which rotation schedule) is added to the manifest.
What it prevents: Accumulated drift that all three preventive layers miss — including drift that predates the controls, drift from manual interventions, and drift from partial failures in A/B/C.
Mechanism: A daily job (cron, 06:00 UTC) that runs three diff checks and surfaces mismatches as console alerts:
| Check | Source of truth | Destination | Alert trigger |
|---|---|---|---|
| Secret sync | Vault current value metadata (hash + rotation timestamp) | Heroku config-var metadata (last-set timestamp), GH Actions secret metadata | Mismatch in rotation timestamps or hash |
| Flag sync | feature_flags.yaml declared flags + console_flag_promotions table |
Heroku FLAG_* config-vars on each app |
Any FLAG_* on prod not in promotions table, or any promoted flag not reflected on Heroku |
| Asset sync | docs/asset-manifest.json at main HEAD |
CF Pages asset hashes on prod surface | Any file where prod hash != manifest hash |
Alerts surface in:
1. Console /status dashboard (new "Drift" section, count badge)
2. ops@ email if any check yields > 0 mismatches (daily digest, not per-mismatch spam)
The reconciler does not auto-remediate. It reports. Remediation is a deliberate operator action (Velvet re-push, flag promotion, asset redeploy).
Blast radius if it fails: The reconciler's own failure is a monitoring gap, not a drift event. Standard job-failure alert to ops@ applies. The preventive layers (A/B/C) remain active regardless of reconciler state.
Implementation cost: Medium. Three distinct API calls (Vault, Heroku, CF Pages). Result aggregation and console alert surface. The console already has a /status page with alert infrastructure.
| Control | Class of drift prevented | Blast radius (failure mode) | Implementation cost |
|---|---|---|---|
| A: CI asset-checksum guard | Static asset regression (rebase overwrite) | False-negative: caught by reconciler. False-positive: operator updates manifest. | Low |
| B1: CI flag-promotion guard | New flag skipping promotion queue | False-negative: caught by B2 + reconciler. False-positive: NO_PROD_FLAG annotation. |
Medium |
| B2: Flag-drift scheduled check | Direct FLAG_* set bypassing queue |
Delayed detection (4h window). False-positive: none (purely additive). | Medium |
| C: Velvet enrollment expansion | Stale secrets post-rotation | Fan-out failure: caught by reconciler within 24h. Velvet retry handles transient. | Low-to-medium |
| D: Daily reconciler | All drift types, including pre-control and manual drift | Reconciler failure: monitoring gap. Ops@ alert on job failure. | Medium |
The post-merge checklist (#1217) remains the procedure when: - The operator suspects drift not yet surfaced by the reconciler. - A drift alert from the reconciler needs operator-directed resolution. - A control (A/B/C/D) itself has failed and the operator is in recovery mode.
It is not the procedure the operator runs after every merge. That distinction is the core shift this ADR makes.
The existing #1217 runbook path. Rejected as the primary control because it scales linearly with PR throughput and generates no signal when skipped. Retained as safety net.
Relies on the operator to remember every category of drift after every merge. Fragile by construction — the incidents in §1 demonstrate the failure mode. Rejected.
A single job that checks everything at merge time and blocks the PR if any drift is detected. Rejected because: - It requires prod-state reads in CI (network dependency, latency, flakiness). - It conflates different drift types with different remediation paths. - A single validator failure blocks all PRs regardless of which check failed.
Layered controls fail independently and have independent remediation paths.
Heroku pipeline promotions would solve the flag-lag problem at the slug level (promote the entire slug from staging to prod). Rejected because: - Raxx's staging and prod are structurally separate apps, not slots in a pipeline. - Flag state and code state would be coupled — a code deploy to prod would also flip all flags, removing the deliberate flag-promotion queue (ADR-0035). - Does not address secret drift or asset drift.
No schema changes in this ADR. The reconciler (Layer D) reads existing tables (console_flag_promotions, Velvet secret_registrations). If those tables do not yet have the columns the reconciler needs (e.g., last_hash, last_rotation_timestamp), the sub-cards for N6 and N7 must include the necessary additive migrations and their rollbacks.
Rollback for each layer:
- A: Delete or gate the CI step behind a SKIP_ASSET_MANIFEST_CHECK env var. Does not affect application state.
- B1: Remove the CI step. Flag manifests remain; no drift introduced.
- B2: Disable the scheduled workflow. No application state affected.
- C: Remove the Velvet manifest entries for the new secrets. Secrets remain in Vault; distribution stops.
- D: Disable the cron job. No drift is introduced; visibility is lost.
| Phase | What | Gate |
|---|---|---|
| Sprint 1 | N4: CI asset-checksum guard | Sub-card, mergeable independently |
| Sprint 1 | N5: Heroku flag-drift check (B2) + CI guard (B1) | Sub-card, depends on console_flag_promotions table (#552) |
| Sprint 1 | N6: Velvet enrollment expansion (Postmark, CF, Slack, FreeScout) | Sub-card, depends on #1224 adapters already merged |
| Sprint 2 | N7: Status reconciler cron job + console drift widget | Sub-card, depends on N4/N5/N6 for full coverage |
| After Sprint 2 | Validate: zero drift alerts for 5 business days | Operator sign-off |
| GA | Retire "run this checklist after every merge" language from #1217; update it as emergency-only procedure | PR against #1217 runbook |
feature_flags.yaml and the PR diff — no secrets involved.None block sub-card implementation. One for operator awareness:
Scoped for feature-developer. Do not claim until card-groomer has processed them.
| Card | Title | Layer | Depends on | Rough sizing |
|---|---|---|---|---|
| N4 | CI: static asset checksum guard — manifest generation + PR diff check | A | None | S (1–2 days) |
| N5 | Flag-promotion enforcement: CI guard (B1) + scheduled flag-drift check (B2) | B | #552 (console_flag_promotions table), #1224 (Heroku API access pattern) |
M (2–3 days) |
| N6 | Velvet: expand subscription enrollment to remaining rotatable secrets (Postmark, CF, Slack, FreeScout) | C | #1224 (Heroku + GH Actions adapters merged) | S-M (1–2 days per secret type) |
| N7 | Status reconciler cron job: vault/Heroku/asset drift diffs + console drift widget + ops@ digest | D | N4, N5, N6 for full coverage; can ship partial | L (3–5 days) |