ADR 0051 — Drift prevention: layered structural controls
Status: Accepted Date: 2026-05-05 UTC Refs: ADR-0050 (trunk SDLC), ADR-0035 (flag promotion queue), #757 (billing destinations drift), #947/#1224 (Velvet Heroku + GH Actions adapters), #552 (flag promotion DB schema), #1217 (reactive post-merge checklist — this ADR augments it)
TL;DR
Drift is the operator's number-one recurring pain. ADR-0050 affirmed trunk-based development and diagnosed three gap types (asset hash divergence, flag promotion lag, secret staleness); it closed them with a post-merge checklist (H2) and two dashboard features. This ADR sits one layer upstream: it replaces the checklist as the primary control with three structural, automated prevention layers (CI asset-checksum guard, flag-promotion enforcement gate, Vault-to-destination auto-distribution) and adds a daily reconciler as a backstop that catches anything the preventive layers miss. The checklist from #1217 is retained as a rare-case safety net — not as the first line of defense.
Concretely: a PR that silently reverts a favicon should fail CI before merge. A direct heroku config:set FLAG_* on a prod app without a console promotion row should be blocked or flagged. A rotated Postmark key should fan out to Heroku and GH Actions automatically — no operator heroku config:set required. A daily job should surface any mismatch that slipped through. Those four outcomes constitute the decision.
1. Context
Drift incidents that drove this ADR
| Incident | Date | What drifted | Why it slipped through |
|---|---|---|---|
| Favicon revert | ~2026-05 | Static asset hash on prod reverted to a prior version | Agent rebase silently overwrote a merged commit; no CI check compared prod asset state |
| Card-detail popup missing from prod | 2026-05-06 | FLAG_CONSOLE_INVESTIGATE_FROM_STATUS on staging, never on prod |
Manual heroku config:set bypassed the promotion queue; no guard existed |
| Heroku config-var stale (PAT) | 2026-05 | Vault rotated, Heroku stale | Velvet distribution loop not yet wired to all rotatable secrets |
| Billing destinations | #757 | Multiple destinations out of sync | No automated sync; operator manually maintained per-destination config |
What ADR-0050 already decided
ADR-0050 retained trunk-based development, rejected Gitflow, and prescribed four hardening steps (H1–H4): enable the production environment required-reviewer gate, write a post-merge checklist, add a flag-drift dashboard widget, and add a stale-branch CI guard. Sub-cards #N1/#N2/#N3 (from ADR-0050) cover those.
ADR-0050 also explicitly defined drift across three dimensions:
- Code drift: hours since last prod deploy vs. last merge to
main - Flag drift: flags where
staging_enabled = true AND prod_enabled = false - Static asset drift: prod content-hash != main's last-built hash
This ADR targets all three with preventive controls, then adds a fourth drift dimension: secret drift (vault value != destination value).
Why a checklist is not enough
1217 (the post-merge runbook card) is correct as a safety net. It is not correct as the primary control. A checklist:
- Requires the operator to remember to run it after every merge.
- Produces no signal when skipped.
- Cannot catch drift that accumulates between runs.
- Generates toil that compounds as PR throughput grows.
Structural controls prevent drift at the point of introduction — a PR that would cause drift fails CI, a direct flag flip that bypasses the queue triggers an alert, a rotated secret distributes automatically. The checklist remains for the residual cases these controls cannot reach.
2. Invariants
These apply to every sub-card this ADR produces:
- Audit trail for every prod state change. Any control that touches prod config (flag, secret, asset) must produce a structured log entry: actor, timestamp, resource, old value hash, new value hash.
- No stored credentials. The reconciler reads secret hashes or metadata — never raw secret values. Drift detection compares hashes or last-rotation timestamps, not plaintext.
- Paper-first gating is unaffected. The live-execution kill-switch is a separate code path and is not in scope for any of these controls.
- No prod auto-deploy. The preventive layers block bad state from entering. They do not autonomously push code to prod. Every prod state change remains a deliberate operator action.
- Secrets in env/secret stores, never in files. The reconciler reads from Vault and Heroku API. It does not write secret values to any file or log.
- GDPR by default. Reconciler logs are operational metadata (timestamps, hash comparisons, resource names) — no PII. Retain for 90 days; subject to the standard DPA-ready log format.
3. Decision
Adopt three preventive layers (a), (b), (c) before falling back to the reactive checklist. Add a daily reconciler as the universal backstop.
4. Preventive layers
Layer A — CI asset-checksum guard
What it prevents: Silent static asset regressions (favicon revert class).
Mechanism: Every PR that touches console/app/static/, frontend/trademaster_ui/public/, or frontend/status-page/public/ must regenerate a committed asset manifest (docs/asset-manifest.json or per-surface equivalent). The manifest records the SHA-256 of each tracked asset at the time of the PR's last commit. CI compares the manifest diff to the set of changed files: if a static asset file changed but the manifest did not update to reflect the new hash, CI fails with a descriptive message. If an asset was reverted to a prior hash (agent rebase scenario), the manifest diff shows the revert explicitly — the PR description is forced to acknowledge it.
What this does NOT do: It does not compare the manifest against live prod state at PR time (network call in CI is a flaky dependency). The reconciler (§6) handles the prod-vs-manifest comparison.
Blast radius if it fails: A false-negative (CI passes but asset is silently reverted) would look identical to pre-control behavior — drift is caught by the reconciler. A false-positive (manifest mismatch for a legitimate change) requires the agent or operator to update the manifest; the error message must be prescriptive about how.
Implementation cost: Low. A single CI step (scripts/ci/check-asset-manifest.sh) that walks the target directories, computes SHA-256, and diffs against the committed manifest. The manifest-update script runs as a pre-commit hook and as a PR step.
Layer B — Flag-promotion enforcement gate
What it prevents: Direct heroku config:set FLAG_* on prod apps that bypasses the promotion queue (card-detail popup class).
Mechanism — two sub-controls, either sufficient:
B1 — CI guard: Any PR that touches feature_flags.yaml (the declared flag registry) must include either (a) a corresponding console_flag_promotions row migration, or (b) an explicit NO_PROD_FLAG annotation in the PR body. CI reads the diff, identifies flag additions or changes, and fails if neither condition is met. This ensures that newly declared flags are wired into the promotion queue before they ship.
B2 — Heroku pre-config-set hook equivalent: Heroku does not support pre-config-set hooks natively. The equivalent is a CI job (flag-drift-check.yml) that runs on a schedule (every 4 hours) and compares the FLAG_* keys currently set on the prod Heroku app against the console_flag_promotions table. Any FLAG_* key on prod that has no promotion record is flagged as an unauthorized direct-set and surfaces as a console alert. This does not block the direct-set after the fact but makes it immediately visible.
B1 prevents the forward path (new flags slipping through without a promotion row). B2 detects unauthorized direct-sets and surfaces them within 4 hours.
Blast radius if it fails: B1 false-positive: CI rejects a PR that legitimately sets a flag without a promotion row (e.g., a CI-only flag). The NO_PROD_FLAG annotation escape hatch handles this. B1 false-negative: a PR that adds a flag but the CI diff parser misses it — caught by B2 at next scheduled run. B2 failure: the scheduled job errors. Alert goes to ops@ per standard job-failure routing; reconciler (§6) also covers this dimension.
Implementation cost: Medium. B1 requires a YAML-aware diff parser in CI. B2 requires a job that calls the Heroku API and queries the promotions table — both are already accessible to the console's service account.
Layer C — Vault-to-destination auto-distribution (Velvet expansion)
What it prevents: Stale secrets on Heroku / GH Actions after a vault rotation (PAT staleness class, billing destinations class #757).
Mechanism: Velvet's service-bus subscription model (#1224, ADR-0037) already distributes to Heroku config-vars and GH Actions secrets for the subset of secrets enrolled in Velvet. This layer expands the enrollment to cover all rotatable secrets:
- Postmark API key
- Cloudflare API tokens
- Slack bot token
- FreeScout API key
- Any future vendor API token added to the manifest
When vault emits a secret.rotated event for an enrolled secret, Velvet fans out to all registered destinations for that secret. No manual heroku config:set required. The existing Heroku adapter (#1224) and GH Actions adapter (#1224) handle delivery; this layer is an enrollment expansion, not a new adapter.
Blast radius if it fails: A fan-out failure for one destination leaves that destination stale. Velvet already emits a distribution.failed event on adapter error. The reconciler (§6) will catch the mismatch within 24 hours. The Velvet retry policy (ADR-0038) handles transient failures.
Implementation cost: Low-to-medium. For each new secret type, the Velvet consumer manifest gains one entry. The secrets themselves remain in Vault; only metadata (which destinations, which rotation schedule) is added to the manifest.
5. Defense in depth — reconciler as backstop
Layer D — Status reconciler (daily cron)
What it prevents: Accumulated drift that all three preventive layers miss — including drift that predates the controls, drift from manual interventions, and drift from partial failures in A/B/C.
Mechanism: A daily job (cron, 06:00 UTC) that runs three diff checks and surfaces mismatches as console alerts:
| Check | Source of truth | Destination | Alert trigger |
|---|---|---|---|
| Secret sync | Vault current value metadata (hash + rotation timestamp) | Heroku config-var metadata (last-set timestamp), GH Actions secret metadata | Mismatch in rotation timestamps or hash |
| Flag sync | feature_flags.yaml declared flags + console_flag_promotions table |
Heroku FLAG_* config-vars on each app |
Any FLAG_* on prod not in promotions table, or any promoted flag not reflected on Heroku |
| Asset sync | docs/asset-manifest.json at main HEAD |
CF Pages asset hashes on prod surface | Any file where prod hash != manifest hash |
Alerts surface in:
1. Console /status dashboard (new "Drift" section, count badge)
2. ops@ email if any check yields > 0 mismatches (daily digest, not per-mismatch spam)
The reconciler does not auto-remediate. It reports. Remediation is a deliberate operator action (Velvet re-push, flag promotion, asset redeploy).
Blast radius if it fails: The reconciler's own failure is a monitoring gap, not a drift event. Standard job-failure alert to ops@ applies. The preventive layers (A/B/C) remain active regardless of reconciler state.
Implementation cost: Medium. Three distinct API calls (Vault, Heroku, CF Pages). Result aggregation and console alert surface. The console already has a /status page with alert infrastructure.
6. Decision matrix
| Control | Class of drift prevented | Blast radius (failure mode) | Implementation cost |
|---|---|---|---|
| A: CI asset-checksum guard | Static asset regression (rebase overwrite) | False-negative: caught by reconciler. False-positive: operator updates manifest. | Low |
| B1: CI flag-promotion guard | New flag skipping promotion queue | False-negative: caught by B2 + reconciler. False-positive: NO_PROD_FLAG annotation. |
Medium |
| B2: Flag-drift scheduled check | Direct FLAG_* set bypassing queue |
Delayed detection (4h window). False-positive: none (purely additive). | Medium |
| C: Velvet enrollment expansion | Stale secrets post-rotation | Fan-out failure: caught by reconciler within 24h. Velvet retry handles transient. | Low-to-medium |
| D: Daily reconciler | All drift types, including pre-control and manual drift | Reconciler failure: monitoring gap. Ops@ alert on job failure. | Medium |
Runbook from #1217 — role after these controls
The post-merge checklist (#1217) remains the procedure when: - The operator suspects drift not yet surfaced by the reconciler. - A drift alert from the reconciler needs operator-directed resolution. - A control (A/B/C/D) itself has failed and the operator is in recovery mode.
It is not the procedure the operator runs after every merge. That distinction is the core shift this ADR makes.
7. Rejected alternatives
Pure checklist (reactive toil)
The existing #1217 runbook path. Rejected as the primary control because it scales linearly with PR throughput and generates no signal when skipped. Retained as safety net.
Pure trust (operator vigilance)
Relies on the operator to remember every category of drift after every merge. Fragile by construction — the incidents in §1 demonstrate the failure mode. Rejected.
One monolithic drift validator
A single job that checks everything at merge time and blocks the PR if any drift is detected. Rejected because: - It requires prod-state reads in CI (network dependency, latency, flakiness). - It conflates different drift types with different remediation paths. - A single validator failure blocks all PRs regardless of which check failed.
Layered controls fail independently and have independent remediation paths.
Heroku pipeline promotion (Heroku-native)
Heroku pipeline promotions would solve the flag-lag problem at the slug level (promote the entire slug from staging to prod). Rejected because: - Raxx's staging and prod are structurally separate apps, not slots in a pipeline. - Flag state and code state would be coupled — a code deploy to prod would also flip all flags, removing the deliberate flag-promotion queue (ADR-0035). - Does not address secret drift or asset drift.
8. Migration path
No schema changes in this ADR. The reconciler (Layer D) reads existing tables (console_flag_promotions, Velvet secret_registrations). If those tables do not yet have the columns the reconciler needs (e.g., last_hash, last_rotation_timestamp), the sub-cards for N6 and N7 must include the necessary additive migrations and their rollbacks.
Rollback for each layer:
- A: Delete or gate the CI step behind a SKIP_ASSET_MANIFEST_CHECK env var. Does not affect application state.
- B1: Remove the CI step. Flag manifests remain; no drift introduced.
- B2: Disable the scheduled workflow. No application state affected.
- C: Remove the Velvet manifest entries for the new secrets. Secrets remain in Vault; distribution stops.
- D: Disable the cron job. No drift is introduced; visibility is lost.
9. Rollout plan
| Phase | What | Gate |
|---|---|---|
| Sprint 1 | N4: CI asset-checksum guard | Sub-card, mergeable independently |
| Sprint 1 | N5: Heroku flag-drift check (B2) + CI guard (B1) | Sub-card, depends on console_flag_promotions table (#552) |
| Sprint 1 | N6: Velvet enrollment expansion (Postmark, CF, Slack, FreeScout) | Sub-card, depends on #1224 adapters already merged |
| Sprint 2 | N7: Status reconciler cron job + console drift widget | Sub-card, depends on N4/N5/N6 for full coverage |
| After Sprint 2 | Validate: zero drift alerts for 5 business days | Operator sign-off |
| GA | Retire "run this checklist after every merge" language from #1217; update it as emergency-only procedure | PR against #1217 runbook |
10. Security considerations
- The reconciler reads secret metadata (hashes, rotation timestamps) — never raw secret values. The Vault API supports metadata-only reads; the reconciler must use that endpoint, not the value endpoint.
- Reconciler credentials (Vault token, Heroku API key, CF API token) are rotatable secrets enrolled in Velvet. They must not be hardcoded; they live in the reconciler's own SSM path.
- Audit log entries from the reconciler record: job run timestamp, check name, resource identifier, mismatch type, operator-visible summary. No secret values in logs. Retain 90 days.
- The CI flag guard (B1) reads
feature_flags.yamland the PR diff — no secrets involved. - The asset manifest (A) contains file paths and SHA-256 hashes of static assets — no PII, no secrets.
- Breach scenario: if the reconciler's own Vault token is compromised, an attacker gains metadata visibility into secret rotation schedules. Mitigations: reconciler token is scoped read-only to metadata endpoints; token rotates weekly via Velvet.
GDPR checklist
- PII collected: none. Reconciler operates on system metadata.
- Retention: operational logs 90 days. Asset manifests are code artifacts, not personal data.
- DSR applicability: none.
- Credential replay risk: none (reconciler never reads or logs raw secret values).
- Breach notification: reconciler token compromise → Velvet auto-rotates within detection window. ops@ notified.
- Kill-switch: each layer (A/B/C/D) can be independently disabled via env var or workflow toggle without a redeploy of application code.
11. Open questions
None block sub-card implementation. One for operator awareness:
- Reconciler alert threshold — daily digest vs. immediate. The current design sends a daily ops@ email digest when any check yields mismatches. If a critical secret is stale, 24-hour delay before alert is too slow. Consider: secret-drift alerts are immediate (Velvet already has this signal); flag and asset drift stay in the daily digest. Feature-developer should confirm the alert routing in N7 implementation.
12. Action items (sub-cards to file)
Scoped for feature-developer. Do not claim until card-groomer has processed them.
| Card | Title | Layer | Depends on | Rough sizing |
|---|---|---|---|---|
| N4 | CI: static asset checksum guard — manifest generation + PR diff check | A | None | S (1–2 days) |
| N5 | Flag-promotion enforcement: CI guard (B1) + scheduled flag-drift check (B2) | B | #552 (console_flag_promotions table), #1224 (Heroku API access pattern) |
M (2–3 days) |
| N6 | Velvet: expand subscription enrollment to remaining rotatable secrets (Postmark, CF, Slack, FreeScout) | C | #1224 (Heroku + GH Actions adapters merged) | S-M (1–2 days per secret type) |
| N7 | Status reconciler cron job: vault/Heroku/asset drift diffs + console drift widget + ops@ digest | D | N4, N5, N6 for full coverage; can ship partial | L (3–5 days) |