ADR 0051 — Drift prevention: layered structural controls

Status: Accepted Date: 2026-05-05 UTC Refs: ADR-0050 (trunk SDLC), ADR-0035 (flag promotion queue), #757 (billing destinations drift), #947/#1224 (Velvet Heroku + GH Actions adapters), #552 (flag promotion DB schema), #1217 (reactive post-merge checklist — this ADR augments it)

TL;DR

Drift is the operator's number-one recurring pain. ADR-0050 affirmed trunk-based development and diagnosed three gap types (asset hash divergence, flag promotion lag, secret staleness); it closed them with a post-merge checklist (H2) and two dashboard features. This ADR sits one layer upstream: it replaces the checklist as the primary control with three structural, automated prevention layers (CI asset-checksum guard, flag-promotion enforcement gate, Vault-to-destination auto-distribution) and adds a daily reconciler as a backstop that catches anything the preventive layers miss. The checklist from #1217 is retained as a rare-case safety net — not as the first line of defense.

Concretely: a PR that silently reverts a favicon should fail CI before merge. A direct heroku config:set FLAG_* on a prod app without a console promotion row should be blocked or flagged. A rotated Postmark key should fan out to Heroku and GH Actions automatically — no operator heroku config:set required. A daily job should surface any mismatch that slipped through. Those four outcomes constitute the decision.

1. Context

Drift incidents that drove this ADR

Incident	Date	What drifted	Why it slipped through
Favicon revert	~2026-05	Static asset hash on prod reverted to a prior version	Agent rebase silently overwrote a merged commit; no CI check compared prod asset state
Card-detail popup missing from prod	2026-05-06	`FLAG_CONSOLE_INVESTIGATE_FROM_STATUS` on staging, never on prod	Manual `heroku config:set` bypassed the promotion queue; no guard existed
Heroku config-var stale (PAT)	2026-05	Vault rotated, Heroku stale	Velvet distribution loop not yet wired to all rotatable secrets
Billing destinations	#757	Multiple destinations out of sync	No automated sync; operator manually maintained per-destination config

What ADR-0050 already decided

ADR-0050 retained trunk-based development, rejected Gitflow, and prescribed four hardening steps (H1–H4): enable the production environment required-reviewer gate, write a post-merge checklist, add a flag-drift dashboard widget, and add a stale-branch CI guard. Sub-cards #N1/#N2/#N3 (from ADR-0050) cover those.

ADR-0050 also explicitly defined drift across three dimensions:

Code drift: hours since last prod deploy vs. last merge to main
Flag drift: flags where staging_enabled = true AND prod_enabled = false
Static asset drift: prod content-hash != main's last-built hash

This ADR targets all three with preventive controls, then adds a fourth drift dimension: secret drift (vault value != destination value).

Why a checklist is not enough

1217 (the post-merge runbook card) is correct as a safety net. It is not correct as the primary control. A checklist:

Requires the operator to remember to run it after every merge.
Produces no signal when skipped.
Cannot catch drift that accumulates between runs.
Generates toil that compounds as PR throughput grows.

Structural controls prevent drift at the point of introduction — a PR that would cause drift fails CI, a direct flag flip that bypasses the queue triggers an alert, a rotated secret distributes automatically. The checklist remains for the residual cases these controls cannot reach.

2. Invariants

These apply to every sub-card this ADR produces:

Audit trail for every prod state change. Any control that touches prod config (flag, secret, asset) must produce a structured log entry: actor, timestamp, resource, old value hash, new value hash.
No stored credentials. The reconciler reads secret hashes or metadata — never raw secret values. Drift detection compares hashes or last-rotation timestamps, not plaintext.
Paper-first gating is unaffected. The live-execution kill-switch is a separate code path and is not in scope for any of these controls.
No prod auto-deploy. The preventive layers block bad state from entering. They do not autonomously push code to prod. Every prod state change remains a deliberate operator action.
Secrets in env/secret stores, never in files. The reconciler reads from Vault and Heroku API. It does not write secret values to any file or log.
GDPR by default. Reconciler logs are operational metadata (timestamps, hash comparisons, resource names) — no PII. Retain for 90 days; subject to the standard DPA-ready log format.

3. Decision

Adopt three preventive layers (a), (b), (c) before falling back to the reactive checklist. Add a daily reconciler as the universal backstop.

4. Preventive layers

Layer A — CI asset-checksum guard

What it prevents: Silent static asset regressions (favicon revert class).

Mechanism: Every PR that touches console/app/static/, frontend/trademaster_ui/public/, or frontend/status-page/public/ must regenerate a committed asset manifest (docs/asset-manifest.json or per-surface equivalent). The manifest records the SHA-256 of each tracked asset at the time of the PR's last commit. CI compares the manifest diff to the set of changed files: if a static asset file changed but the manifest did not update to reflect the new hash, CI fails with a descriptive message. If an asset was reverted to a prior hash (agent rebase scenario), the manifest diff shows the revert explicitly — the PR description is forced to acknowledge it.

What this does NOT do: It does not compare the manifest against live prod state at PR time (network call in CI is a flaky dependency). The reconciler (§6) handles the prod-vs-manifest comparison.

Blast radius if it fails: A false-negative (CI passes but asset is silently reverted) would look identical to pre-control behavior — drift is caught by the reconciler. A false-positive (manifest mismatch for a legitimate change) requires the agent or operator to update the manifest; the error message must be prescriptive about how.

Implementation cost: Low. A single CI step (scripts/ci/check-asset-manifest.sh) that walks the target directories, computes SHA-256, and diffs against the committed manifest. The manifest-update script runs as a pre-commit hook and as a PR step.

Layer B — Flag-promotion enforcement gate

What it prevents: Direct heroku config:set FLAG_* on prod apps that bypasses the promotion queue (card-detail popup class).

Mechanism — two sub-controls, either sufficient:

B1 — CI guard: Any PR that touches feature_flags.yaml (the declared flag registry) must include either (a) a corresponding console_flag_promotions row migration, or (b) an explicit NO_PROD_FLAG annotation in the PR body. CI reads the diff, identifies flag additions or changes, and fails if neither condition is met. This ensures that newly declared flags are wired into the promotion queue before they ship.

B2 — Heroku pre-config-set hook equivalent: Heroku does not support pre-config-set hooks natively. The equivalent is a CI job (flag-drift-check.yml) that runs on a schedule (every 4 hours) and compares the FLAG_* keys currently set on the prod Heroku app against the console_flag_promotions table. Any FLAG_* key on prod that has no promotion record is flagged as an unauthorized direct-set and surfaces as a console alert. This does not block the direct-set after the fact but makes it immediately visible.

B1 prevents the forward path (new flags slipping through without a promotion row). B2 detects unauthorized direct-sets and surfaces them within 4 hours.

Blast radius if it fails: B1 false-positive: CI rejects a PR that legitimately sets a flag without a promotion row (e.g., a CI-only flag). The NO_PROD_FLAG annotation escape hatch handles this. B1 false-negative: a PR that adds a flag but the CI diff parser misses it — caught by B2 at next scheduled run. B2 failure: the scheduled job errors. Alert goes to ops@ per standard job-failure routing; reconciler (§6) also covers this dimension.

Implementation cost: Medium. B1 requires a YAML-aware diff parser in CI. B2 requires a job that calls the Heroku API and queries the promotions table — both are already accessible to the console's service account.

Layer C — Vault-to-destination auto-distribution (Velvet expansion)

What it prevents: Stale secrets on Heroku / GH Actions after a vault rotation (PAT staleness class, billing destinations class #757).

Mechanism: Velvet's service-bus subscription model (#1224, ADR-0037) already distributes to Heroku config-vars and GH Actions secrets for the subset of secrets enrolled in Velvet. This layer expands the enrollment to cover all rotatable secrets:

Postmark API key
Cloudflare API tokens
Slack bot token
FreeScout API key
Any future vendor API token added to the manifest

When vault emits a secret.rotated event for an enrolled secret, Velvet fans out to all registered destinations for that secret. No manual heroku config:set required. The existing Heroku adapter (#1224) and GH Actions adapter (#1224) handle delivery; this layer is an enrollment expansion, not a new adapter.

Blast radius if it fails: A fan-out failure for one destination leaves that destination stale. Velvet already emits a distribution.failed event on adapter error. The reconciler (§6) will catch the mismatch within 24 hours. The Velvet retry policy (ADR-0038) handles transient failures.

Implementation cost: Low-to-medium. For each new secret type, the Velvet consumer manifest gains one entry. The secrets themselves remain in Vault; only metadata (which destinations, which rotation schedule) is added to the manifest.

5. Defense in depth — reconciler as backstop

Layer D — Status reconciler (daily cron)

What it prevents: Accumulated drift that all three preventive layers miss — including drift that predates the controls, drift from manual interventions, and drift from partial failures in A/B/C.

Mechanism: A daily job (cron, 06:00 UTC) that runs three diff checks and surfaces mismatches as console alerts:

Check	Source of truth	Destination	Alert trigger
Secret sync	Vault current value metadata (hash + rotation timestamp)	Heroku config-var metadata (last-set timestamp), GH Actions secret metadata	Mismatch in rotation timestamps or hash
Flag sync	`feature_flags.yaml` declared flags + `console_flag_promotions` table	Heroku `FLAG_*` config-vars on each app	Any `FLAG_*` on prod not in promotions table, or any promoted flag not reflected on Heroku
Asset sync	`docs/asset-manifest.json` at `main` HEAD	CF Pages asset hashes on prod surface	Any file where prod hash != manifest hash

Alerts surface in: 1. Console /status dashboard (new "Drift" section, count badge) 2. ops@ email if any check yields > 0 mismatches (daily digest, not per-mismatch spam)

The reconciler does not auto-remediate. It reports. Remediation is a deliberate operator action (Velvet re-push, flag promotion, asset redeploy).

Blast radius if it fails: The reconciler's own failure is a monitoring gap, not a drift event. Standard job-failure alert to ops@ applies. The preventive layers (A/B/C) remain active regardless of reconciler state.

Implementation cost: Medium. Three distinct API calls (Vault, Heroku, CF Pages). Result aggregation and console alert surface. The console already has a /status page with alert infrastructure.

6. Decision matrix

Control	Class of drift prevented	Blast radius (failure mode)	Implementation cost
A: CI asset-checksum guard	Static asset regression (rebase overwrite)	False-negative: caught by reconciler. False-positive: operator updates manifest.	Low
B1: CI flag-promotion guard	New flag skipping promotion queue	False-negative: caught by B2 + reconciler. False-positive: `NO_PROD_FLAG` annotation.	Medium
B2: Flag-drift scheduled check	Direct `FLAG_*` set bypassing queue	Delayed detection (4h window). False-positive: none (purely additive).	Medium
C: Velvet enrollment expansion	Stale secrets post-rotation	Fan-out failure: caught by reconciler within 24h. Velvet retry handles transient.	Low-to-medium
D: Daily reconciler	All drift types, including pre-control and manual drift	Reconciler failure: monitoring gap. Ops@ alert on job failure.	Medium

Runbook from #1217 — role after these controls

The post-merge checklist (#1217) remains the procedure when: - The operator suspects drift not yet surfaced by the reconciler. - A drift alert from the reconciler needs operator-directed resolution. - A control (A/B/C/D) itself has failed and the operator is in recovery mode.

It is not the procedure the operator runs after every merge. That distinction is the core shift this ADR makes.

7. Rejected alternatives

Pure checklist (reactive toil)

The existing #1217 runbook path. Rejected as the primary control because it scales linearly with PR throughput and generates no signal when skipped. Retained as safety net.

Pure trust (operator vigilance)

Relies on the operator to remember every category of drift after every merge. Fragile by construction — the incidents in §1 demonstrate the failure mode. Rejected.

One monolithic drift validator

A single job that checks everything at merge time and blocks the PR if any drift is detected. Rejected because: - It requires prod-state reads in CI (network dependency, latency, flakiness). - It conflates different drift types with different remediation paths. - A single validator failure blocks all PRs regardless of which check failed.

Layered controls fail independently and have independent remediation paths.

Heroku pipeline promotion (Heroku-native)

Heroku pipeline promotions would solve the flag-lag problem at the slug level (promote the entire slug from staging to prod). Rejected because: - Raxx's staging and prod are structurally separate apps, not slots in a pipeline. - Flag state and code state would be coupled — a code deploy to prod would also flip all flags, removing the deliberate flag-promotion queue (ADR-0035). - Does not address secret drift or asset drift.

8. Migration path

No schema changes in this ADR. The reconciler (Layer D) reads existing tables (console_flag_promotions, Velvet secret_registrations). If those tables do not yet have the columns the reconciler needs (e.g., last_hash, last_rotation_timestamp), the sub-cards for N6 and N7 must include the necessary additive migrations and their rollbacks.

Rollback for each layer: - A: Delete or gate the CI step behind a SKIP_ASSET_MANIFEST_CHECK env var. Does not affect application state. - B1: Remove the CI step. Flag manifests remain; no drift introduced. - B2: Disable the scheduled workflow. No application state affected. - C: Remove the Velvet manifest entries for the new secrets. Secrets remain in Vault; distribution stops. - D: Disable the cron job. No drift is introduced; visibility is lost.

9. Rollout plan

Phase	What	Gate
Sprint 1	N4: CI asset-checksum guard	Sub-card, mergeable independently
Sprint 1	N5: Heroku flag-drift check (B2) + CI guard (B1)	Sub-card, depends on `console_flag_promotions` table (#552)
Sprint 1	N6: Velvet enrollment expansion (Postmark, CF, Slack, FreeScout)	Sub-card, depends on #1224 adapters already merged
Sprint 2	N7: Status reconciler cron job + console drift widget	Sub-card, depends on N4/N5/N6 for full coverage
After Sprint 2	Validate: zero drift alerts for 5 business days	Operator sign-off
GA	Retire "run this checklist after every merge" language from #1217; update it as emergency-only procedure	PR against #1217 runbook

10. Security considerations

The reconciler reads secret metadata (hashes, rotation timestamps) — never raw secret values. The Vault API supports metadata-only reads; the reconciler must use that endpoint, not the value endpoint.
Reconciler credentials (Vault token, Heroku API key, CF API token) are rotatable secrets enrolled in Velvet. They must not be hardcoded; they live in the reconciler's own SSM path.
Audit log entries from the reconciler record: job run timestamp, check name, resource identifier, mismatch type, operator-visible summary. No secret values in logs. Retain 90 days.
The CI flag guard (B1) reads feature_flags.yaml and the PR diff — no secrets involved.
The asset manifest (A) contains file paths and SHA-256 hashes of static assets — no PII, no secrets.
Breach scenario: if the reconciler's own Vault token is compromised, an attacker gains metadata visibility into secret rotation schedules. Mitigations: reconciler token is scoped read-only to metadata endpoints; token rotates weekly via Velvet.

PII collected: none. Reconciler operates on system metadata.
Retention: operational logs 90 days. Asset manifests are code artifacts, not personal data.
DSR applicability: none.
Credential replay risk: none (reconciler never reads or logs raw secret values).
Breach notification: reconciler token compromise → Velvet auto-rotates within detection window. ops@ notified.
Kill-switch: each layer (A/B/C/D) can be independently disabled via env var or workflow toggle without a redeploy of application code.

11. Open questions

None block sub-card implementation. One for operator awareness:

Reconciler alert threshold — daily digest vs. immediate. The current design sends a daily ops@ email digest when any check yields mismatches. If a critical secret is stale, 24-hour delay before alert is too slow. Consider: secret-drift alerts are immediate (Velvet already has this signal); flag and asset drift stay in the daily digest. Feature-developer should confirm the alert routing in N7 implementation.

12. Action items (sub-cards to file)

Scoped for feature-developer. Do not claim until card-groomer has processed them.

Card	Title	Layer	Depends on	Rough sizing
N4	CI: static asset checksum guard — manifest generation + PR diff check	A	None	S (1–2 days)
N5	Flag-promotion enforcement: CI guard (B1) + scheduled flag-drift check (B2)	B	#552 (`console_flag_promotions` table), #1224 (Heroku API access pattern)	M (2–3 days)
N6	Velvet: expand subscription enrollment to remaining rotatable secrets (Postmark, CF, Slack, FreeScout)	C	#1224 (Heroku + GH Actions adapters merged)	S-M (1–2 days per secret type)
N7	Status reconciler cron job: vault/Heroku/asset drift diffs + console drift widget + ops@ digest	D	N4, N5, N6 for full coverage; can ship partial	L (3–5 days)