ADR-0085 — Flag Reconciler: Bidirectional Sync with Drift-as-Kill-Switch

Status: Accepted
Date: 2026-05-13 UTC
Deciders: Kristerpher (operator)
Relates to: console-flag-promotion-flow.md, ADR-0026 (flag persistence), ADR-0027 (flag env scoping), ADR-0035 (flag promotion staging-to-prod)

Context

Heroku FLAG_* environment variables have been the canonical source of truth for feature flags for approximately 9 months (feedback_bootstrap_via_heroku.md). The console_flag_promotions table records operator intent but has no mechanism to confirm whether that intent was successfully applied to Heroku, nor to detect when Heroku was changed directly via CLI after a promotion.

The existing flag_drift.py reconciler detects mismatches and writes to drift_run_results, but it is read-only and advisory only. There is no enforcement mechanism that prevents an operator from acting on stale DB state. The /console/flags UI can display incorrect "current state" if Heroku was changed since the last reconciler run.

Three concrete problems: 1. ~50 flags are live on Heroku prod with no corresponding DB rows — the promotion queue does not reflect reality. 2. A CLI-applied flag change silently diverges from the DB with no gate stopping subsequent UI actions. 3. The console UI has no authoritative way to know whether a "promote" action will succeed (the DB might already agree with Heroku but the row is still marked synced=false from the backfill gap).

Decision

Implement bidirectional sync between console_flag_promotions and Heroku FLAG_* config vars with the following design invariants:

5-minute reconciler reads all 4 Heroku apps and diffs against DB rows. Writes synced, last_heroku_value, and drift_reason columns on every run.
Drift is a kill-switch. When synced=false, the promote and rollback API endpoints return 409. This is enforced at the service layer, not just the UI.
Operator resolution required. The reconciler never auto-resolves drift. The operator must explicitly invoke mark-synced with a declared winner.
Both sides accept writes. The new POST /flags/<key>/mark-synced endpoint can push DB value to Heroku (winner=db) or adopt Heroku value into DB (winner=heroku). The new POST /flags/<key>/promote endpoint enables UI-triggered Heroku writes in Phase 3.
Backfill as a one-shot script. The ~50 untracked flags are backfilled via a manual idempotent script with a --dry-run flag, not an automatic Alembic data migration, to avoid blocking the migration chain on a Heroku API call.

Consequences

Positive: - The promotion queue becomes authoritative and reflects what is actually running on Heroku. - Operators have a clear, audited path to resolve drift without resorting to CLI commands. - The drift kill-switch prevents a class of bugs where UI actions are taken on stale DB state. - Mark-synced audit trail gives forensics when drift is detected.

Negative / trade-offs: - The kill-switch adds friction: if the reconciler incorrectly marks a row drifted (e.g., Heroku API returns stale data), the operator must resolve it manually. The 5-minute window and the misfire_grace_time mitigate transient flaps, but do not eliminate them. - The backfill must be run manually in each environment. Until it runs, all existing rows have synced=false (the migration 0055 default), which means ALL existing flags would trip the kill-switch immediately after migration. The deployment order must be: (1) migration 0055, (2) backfill script, (3) enable kill-switch enforcement. The kill-switch is NOT enforced until Phase 2 of the rollout (post-backfill). - Introducing Heroku API writes from the console service increases the blast radius of a compromised console. Mitigated by TOTP elevation requirement and per-app token scoping.

Alternatives Considered

Option A: Read-only reconciler with advisory warnings only (current state)

The existing flag_drift.py already does this. Rejected because it has not improved drift over 9 months of operation. Advisory warnings without enforcement are ignored or acted on too slowly.

Option B: Heroku as the single authoritative source; DB is a cache only

Invert the current model: the console reads from Heroku directly, DB rows are a projection. Rejected because: (1) it makes audit history dependent on Heroku API availability; (2) the promotion queue workflow (soak periods, approval gates) requires DB state; (3) Heroku API latency would make every flag page load dependent on an external call.

Option C: Auto-reconcile — reconciler silently pushes DB value to Heroku on mismatch

Reconciler detects drift and immediately calls Heroku PATCH with the DB value, no operator intervention needed. Rejected because: (1) it assumes DB is always correct, which is false during the backfill gap; (2) it can silently undo an operator's intentional CLI change; (3) it violates the principle that every Heroku write is attributed to an actor. Auto-reconcile may be reconsidered for Phase 4 (post-v1) once the DB is authoritative and the backfill is complete.

Option D: Separate "sync service" microservice

Extract the reconciler and Heroku write logic to a standalone service (like Velvet). Rejected for v1 because it adds operational overhead before the existing console is even fully trusted. The reconciler is scoped to console internals and belongs in the console service for now.

Security Notes

No FLAG_* values are written to logs. Token hash (SHA-256[:12]) only in log output.
Mark-synced winner=db path requires TOTP elevation (same bar as promote/reject).
Heroku API tokens live in Infisical. Rotatable without redeploy.
The kill-switch gate is enforced in PromotionService (service layer), not just templates, preventing bypass via direct API calls.