Raxx · internal docs

internal · gated ↑ index

ADR 0051 — Drift prevention: layered structural controls

Status: Accepted Date: 2026-05-05 UTC Refs: ADR-0050 (trunk SDLC), ADR-0035 (flag promotion queue), #757 (billing destinations drift), #947/#1224 (Velvet Heroku + GH Actions adapters), #552 (flag promotion DB schema), #1217 (reactive post-merge checklist — this ADR augments it)


TL;DR

Drift is the operator's number-one recurring pain. ADR-0050 affirmed trunk-based development and diagnosed three gap types (asset hash divergence, flag promotion lag, secret staleness); it closed them with a post-merge checklist (H2) and two dashboard features. This ADR sits one layer upstream: it replaces the checklist as the primary control with three structural, automated prevention layers (CI asset-checksum guard, flag-promotion enforcement gate, Vault-to-destination auto-distribution) and adds a daily reconciler as a backstop that catches anything the preventive layers miss. The checklist from #1217 is retained as a rare-case safety net — not as the first line of defense.

Concretely: a PR that silently reverts a favicon should fail CI before merge. A direct heroku config:set FLAG_* on a prod app without a console promotion row should be blocked or flagged. A rotated Postmark key should fan out to Heroku and GH Actions automatically — no operator heroku config:set required. A daily job should surface any mismatch that slipped through. Those four outcomes constitute the decision.


1. Context

Drift incidents that drove this ADR

Incident Date What drifted Why it slipped through
Favicon revert ~2026-05 Static asset hash on prod reverted to a prior version Agent rebase silently overwrote a merged commit; no CI check compared prod asset state
Card-detail popup missing from prod 2026-05-06 FLAG_CONSOLE_INVESTIGATE_FROM_STATUS on staging, never on prod Manual heroku config:set bypassed the promotion queue; no guard existed
Heroku config-var stale (PAT) 2026-05 Vault rotated, Heroku stale Velvet distribution loop not yet wired to all rotatable secrets
Billing destinations #757 Multiple destinations out of sync No automated sync; operator manually maintained per-destination config

What ADR-0050 already decided

ADR-0050 retained trunk-based development, rejected Gitflow, and prescribed four hardening steps (H1–H4): enable the production environment required-reviewer gate, write a post-merge checklist, add a flag-drift dashboard widget, and add a stale-branch CI guard. Sub-cards #N1/#N2/#N3 (from ADR-0050) cover those.

ADR-0050 also explicitly defined drift across three dimensions:

This ADR targets all three with preventive controls, then adds a fourth drift dimension: secret drift (vault value != destination value).

Why a checklist is not enough

1217 (the post-merge runbook card) is correct as a safety net. It is not correct as the primary control. A checklist:

Structural controls prevent drift at the point of introduction — a PR that would cause drift fails CI, a direct flag flip that bypasses the queue triggers an alert, a rotated secret distributes automatically. The checklist remains for the residual cases these controls cannot reach.


2. Invariants

These apply to every sub-card this ADR produces:


3. Decision

Adopt three preventive layers (a), (b), (c) before falling back to the reactive checklist. Add a daily reconciler as the universal backstop.


4. Preventive layers

Layer A — CI asset-checksum guard

What it prevents: Silent static asset regressions (favicon revert class).

Mechanism: Every PR that touches console/app/static/, frontend/trademaster_ui/public/, or frontend/status-page/public/ must regenerate a committed asset manifest (docs/asset-manifest.json or per-surface equivalent). The manifest records the SHA-256 of each tracked asset at the time of the PR's last commit. CI compares the manifest diff to the set of changed files: if a static asset file changed but the manifest did not update to reflect the new hash, CI fails with a descriptive message. If an asset was reverted to a prior hash (agent rebase scenario), the manifest diff shows the revert explicitly — the PR description is forced to acknowledge it.

What this does NOT do: It does not compare the manifest against live prod state at PR time (network call in CI is a flaky dependency). The reconciler (§6) handles the prod-vs-manifest comparison.

Blast radius if it fails: A false-negative (CI passes but asset is silently reverted) would look identical to pre-control behavior — drift is caught by the reconciler. A false-positive (manifest mismatch for a legitimate change) requires the agent or operator to update the manifest; the error message must be prescriptive about how.

Implementation cost: Low. A single CI step (scripts/ci/check-asset-manifest.sh) that walks the target directories, computes SHA-256, and diffs against the committed manifest. The manifest-update script runs as a pre-commit hook and as a PR step.


Layer B — Flag-promotion enforcement gate

What it prevents: Direct heroku config:set FLAG_* on prod apps that bypasses the promotion queue (card-detail popup class).

Mechanism — two sub-controls, either sufficient:

B1 — CI guard: Any PR that touches feature_flags.yaml (the declared flag registry) must include either (a) a corresponding console_flag_promotions row migration, or (b) an explicit NO_PROD_FLAG annotation in the PR body. CI reads the diff, identifies flag additions or changes, and fails if neither condition is met. This ensures that newly declared flags are wired into the promotion queue before they ship.

B2 — Heroku pre-config-set hook equivalent: Heroku does not support pre-config-set hooks natively. The equivalent is a CI job (flag-drift-check.yml) that runs on a schedule (every 4 hours) and compares the FLAG_* keys currently set on the prod Heroku app against the console_flag_promotions table. Any FLAG_* key on prod that has no promotion record is flagged as an unauthorized direct-set and surfaces as a console alert. This does not block the direct-set after the fact but makes it immediately visible.

B1 prevents the forward path (new flags slipping through without a promotion row). B2 detects unauthorized direct-sets and surfaces them within 4 hours.

Blast radius if it fails: B1 false-positive: CI rejects a PR that legitimately sets a flag without a promotion row (e.g., a CI-only flag). The NO_PROD_FLAG annotation escape hatch handles this. B1 false-negative: a PR that adds a flag but the CI diff parser misses it — caught by B2 at next scheduled run. B2 failure: the scheduled job errors. Alert goes to ops@ per standard job-failure routing; reconciler (§6) also covers this dimension.

Implementation cost: Medium. B1 requires a YAML-aware diff parser in CI. B2 requires a job that calls the Heroku API and queries the promotions table — both are already accessible to the console's service account.


Layer C — Vault-to-destination auto-distribution (Velvet expansion)

What it prevents: Stale secrets on Heroku / GH Actions after a vault rotation (PAT staleness class, billing destinations class #757).

Mechanism: Velvet's service-bus subscription model (#1224, ADR-0037) already distributes to Heroku config-vars and GH Actions secrets for the subset of secrets enrolled in Velvet. This layer expands the enrollment to cover all rotatable secrets:

When vault emits a secret.rotated event for an enrolled secret, Velvet fans out to all registered destinations for that secret. No manual heroku config:set required. The existing Heroku adapter (#1224) and GH Actions adapter (#1224) handle delivery; this layer is an enrollment expansion, not a new adapter.

Blast radius if it fails: A fan-out failure for one destination leaves that destination stale. Velvet already emits a distribution.failed event on adapter error. The reconciler (§6) will catch the mismatch within 24 hours. The Velvet retry policy (ADR-0038) handles transient failures.

Implementation cost: Low-to-medium. For each new secret type, the Velvet consumer manifest gains one entry. The secrets themselves remain in Vault; only metadata (which destinations, which rotation schedule) is added to the manifest.


5. Defense in depth — reconciler as backstop

Layer D — Status reconciler (daily cron)

What it prevents: Accumulated drift that all three preventive layers miss — including drift that predates the controls, drift from manual interventions, and drift from partial failures in A/B/C.

Mechanism: A daily job (cron, 06:00 UTC) that runs three diff checks and surfaces mismatches as console alerts:

Check Source of truth Destination Alert trigger
Secret sync Vault current value metadata (hash + rotation timestamp) Heroku config-var metadata (last-set timestamp), GH Actions secret metadata Mismatch in rotation timestamps or hash
Flag sync feature_flags.yaml declared flags + console_flag_promotions table Heroku FLAG_* config-vars on each app Any FLAG_* on prod not in promotions table, or any promoted flag not reflected on Heroku
Asset sync docs/asset-manifest.json at main HEAD CF Pages asset hashes on prod surface Any file where prod hash != manifest hash

Alerts surface in: 1. Console /status dashboard (new "Drift" section, count badge) 2. ops@ email if any check yields > 0 mismatches (daily digest, not per-mismatch spam)

The reconciler does not auto-remediate. It reports. Remediation is a deliberate operator action (Velvet re-push, flag promotion, asset redeploy).

Blast radius if it fails: The reconciler's own failure is a monitoring gap, not a drift event. Standard job-failure alert to ops@ applies. The preventive layers (A/B/C) remain active regardless of reconciler state.

Implementation cost: Medium. Three distinct API calls (Vault, Heroku, CF Pages). Result aggregation and console alert surface. The console already has a /status page with alert infrastructure.


6. Decision matrix

Control Class of drift prevented Blast radius (failure mode) Implementation cost
A: CI asset-checksum guard Static asset regression (rebase overwrite) False-negative: caught by reconciler. False-positive: operator updates manifest. Low
B1: CI flag-promotion guard New flag skipping promotion queue False-negative: caught by B2 + reconciler. False-positive: NO_PROD_FLAG annotation. Medium
B2: Flag-drift scheduled check Direct FLAG_* set bypassing queue Delayed detection (4h window). False-positive: none (purely additive). Medium
C: Velvet enrollment expansion Stale secrets post-rotation Fan-out failure: caught by reconciler within 24h. Velvet retry handles transient. Low-to-medium
D: Daily reconciler All drift types, including pre-control and manual drift Reconciler failure: monitoring gap. Ops@ alert on job failure. Medium

Runbook from #1217 — role after these controls

The post-merge checklist (#1217) remains the procedure when: - The operator suspects drift not yet surfaced by the reconciler. - A drift alert from the reconciler needs operator-directed resolution. - A control (A/B/C/D) itself has failed and the operator is in recovery mode.

It is not the procedure the operator runs after every merge. That distinction is the core shift this ADR makes.


7. Rejected alternatives

Pure checklist (reactive toil)

The existing #1217 runbook path. Rejected as the primary control because it scales linearly with PR throughput and generates no signal when skipped. Retained as safety net.

Pure trust (operator vigilance)

Relies on the operator to remember every category of drift after every merge. Fragile by construction — the incidents in §1 demonstrate the failure mode. Rejected.

One monolithic drift validator

A single job that checks everything at merge time and blocks the PR if any drift is detected. Rejected because: - It requires prod-state reads in CI (network dependency, latency, flakiness). - It conflates different drift types with different remediation paths. - A single validator failure blocks all PRs regardless of which check failed.

Layered controls fail independently and have independent remediation paths.

Heroku pipeline promotion (Heroku-native)

Heroku pipeline promotions would solve the flag-lag problem at the slug level (promote the entire slug from staging to prod). Rejected because: - Raxx's staging and prod are structurally separate apps, not slots in a pipeline. - Flag state and code state would be coupled — a code deploy to prod would also flip all flags, removing the deliberate flag-promotion queue (ADR-0035). - Does not address secret drift or asset drift.


8. Migration path

No schema changes in this ADR. The reconciler (Layer D) reads existing tables (console_flag_promotions, Velvet secret_registrations). If those tables do not yet have the columns the reconciler needs (e.g., last_hash, last_rotation_timestamp), the sub-cards for N6 and N7 must include the necessary additive migrations and their rollbacks.

Rollback for each layer: - A: Delete or gate the CI step behind a SKIP_ASSET_MANIFEST_CHECK env var. Does not affect application state. - B1: Remove the CI step. Flag manifests remain; no drift introduced. - B2: Disable the scheduled workflow. No application state affected. - C: Remove the Velvet manifest entries for the new secrets. Secrets remain in Vault; distribution stops. - D: Disable the cron job. No drift is introduced; visibility is lost.


9. Rollout plan

Phase What Gate
Sprint 1 N4: CI asset-checksum guard Sub-card, mergeable independently
Sprint 1 N5: Heroku flag-drift check (B2) + CI guard (B1) Sub-card, depends on console_flag_promotions table (#552)
Sprint 1 N6: Velvet enrollment expansion (Postmark, CF, Slack, FreeScout) Sub-card, depends on #1224 adapters already merged
Sprint 2 N7: Status reconciler cron job + console drift widget Sub-card, depends on N4/N5/N6 for full coverage
After Sprint 2 Validate: zero drift alerts for 5 business days Operator sign-off
GA Retire "run this checklist after every merge" language from #1217; update it as emergency-only procedure PR against #1217 runbook

10. Security considerations

GDPR checklist


11. Open questions

None block sub-card implementation. One for operator awareness:

  1. Reconciler alert threshold — daily digest vs. immediate. The current design sends a daily ops@ email digest when any check yields mismatches. If a critical secret is stale, 24-hour delay before alert is too slow. Consider: secret-drift alerts are immediate (Velvet already has this signal); flag and asset drift stay in the daily digest. Feature-developer should confirm the alert routing in N7 implementation.

12. Action items (sub-cards to file)

Scoped for feature-developer. Do not claim until card-groomer has processed them.

Card Title Layer Depends on Rough sizing
N4 CI: static asset checksum guard — manifest generation + PR diff check A None S (1–2 days)
N5 Flag-promotion enforcement: CI guard (B1) + scheduled flag-drift check (B2) B #552 (console_flag_promotions table), #1224 (Heroku API access pattern) M (2–3 days)
N6 Velvet: expand subscription enrollment to remaining rotatable secrets (Postmark, CF, Slack, FreeScout) C #1224 (Heroku + GH Actions adapters merged) S-M (1–2 days per secret type)
N7 Status reconciler cron job: vault/Heroku/asset drift diffs + console drift widget + ops@ digest D N4, N5, N6 for full coverage; can ship partial L (3–5 days)