Flag Reconciler — Bidirectional Sync (Drift-as-Kill-Switch)
Status: Proposed
Owner: software-architect
Date: 2026-05-13 UTC
Related docs:
- console-flag-promotion-flow.md — prior staging-to-prod promotion design
- [ADR-0085](https://internal-docs.raxx.app/architecture/adr/0085-flag-reconciler-bidirectional-sync.html) — decision record for this design
1. Goal
Establish a bidirectional, audited sync between console_flag_promotions (the DB source of intent) and Heroku FLAG_* environment variables (the runtime source of truth) across all four Raxx apps (raxx-console-staging, raxx-console-prod, raxx-api-staging, raxx-api-prod). A 5-minute reconciler job compares both sides and marks rows synced=true or synced=false. When drift is detected, the console UI disables promote/rollback actions for the affected flag and surfaces a banner with a side-by-side diff. The operator resolves drift by choosing which side wins ("DB is correct" or "Heroku is correct") via a "Mark as synced" action. Only after drift is resolved does the normal promotion workflow resume. Drift is therefore a kill-switch on forward progress — not a warning.
2. Current State and Problem
Heroku has been the canonical source of truth for feature flags for approximately 9 months. Flag values are set via heroku config:set directly (per feedback_bootstrap_via_heroku.md). The console_flag_promotions table exists and records intent (pending, staged, deploying, deployed, failed) but the synced column does not exist. The /console/flags page is gated behind FLAG_CONSOLE_FLAG_PROMOTIONS which is currently OFF on both staging and prod (per project_console_flag_toggle.md), meaning all flag flips still go via CLI.
This creates three types of drift today:
- Orphan-on-Heroku: A
FLAG_*var exists on Heroku with no corresponding DB promotion row. This is the dominant case (~50+ flags set in prod; ~7 with DB rows). The flag_drift reconciler (console/app/services/drift_reconcilers/flag_drift.py) already detects this asorphan_on_herokuand writes todrift_run_results, but takes no action beyond alerting. - Missing-from-Heroku: A promoted DB row exists but Heroku never got the value. The reconciler detects this as
missing_from_herokuat severitycritical. - Value-mismatch: DB row says
true; Heroku saysfalse(or vice versa). Reconciler detects asvalue_mismatch.
The existing reconciler is read-only. This design makes it write-capable (mark synced) and wires it into a UI-level kill-switch.
3. Invariants
All platform invariants apply. Reconciler-specific:
- No stored credentials. The Heroku API token lives in the env/secret store. The reconciler reads it at call time. It must not be written to any DB column, log line, or response body. Log only a SHA-256[:12] hash for token health checks.
- Audit trail for every state change. Every write to
console_flag_promotions(promote, rollback, mark-synced) emits a row toconsole_audit_log. The reconciler's automated updates (settingsynced, updatinglast_heroku_value) also emit audit rows. - Drift is a kill-switch, not a warning. When
synced=false, the promote and rollback actions for that flag are disabled at the API layer (not just the UI layer). A request to promote a drifted row returns409 {"error": "flag_drifted"}. - Mark-synced requires operator intent. Automated resolution is not permitted. The operator must explicitly choose which side wins. Auto-reconcile (reconciler silently pushing DB value to Heroku) is out of scope for v1.
- Paper-first gating is not a feature flag. The reconciler must never manage the paper-first gate. Any
FLAG_*var that corresponds to a paper-first path must be treated as untouchable and logged with apaper_first_protectedguard. - GDPR baseline. This system stores no PII.
last_heroku_valuestores a boolean as text.drift_reasonstores an enum-like string. No user data.
4. Data Model
4.1 New columns on console_flag_promotions
The following columns are added to the existing table. All are nullable to allow zero-downtime additive migration.
| Column | Type | Default | Description |
|---|---|---|---|
synced |
BOOLEAN | false |
Whether DB value and Heroku value agree for this flag x target_app pair. Indexed. |
last_heroku_value |
TEXT | NULL | Last boolean value the reconciler observed on Heroku. Stored as "true" or "false" string to distinguish "present but false" from "not set". NULL means never checked. |
last_synced_at |
TIMESTAMPTZ | NULL | When synced was last set to true. |
drift_detected_at |
TIMESTAMPTZ | NULL | When synced was last set to false. |
drift_reason |
TEXT | NULL | Enum-like: heroku_unset, heroku_value_mismatch, db_unset_heroku_set, untracked_heroku_flag. |
sync_source |
TEXT | NULL | Enum-like: console_ui_promote, console_ui_rollback, heroku_cli_backfill, mark_synced_db_wins, mark_synced_heroku_wins. Updated on every write. |
4.2 Index additions
CREATE INDEX idx_cfp_synced ON console_flag_promotions (synced)
WHERE synced = false;
A partial index on the drifted rows enables the drift-banner query to be fast even as the table grows.
4.3 Unchanged columns
The existing state machine columns (state, created_at, created_by, target_app, new_value, soak_seconds, etc.) are unchanged. The synced column is orthogonal to state — a row can be in state deployed and synced=false if Heroku was subsequently changed via CLI.
5. APIs and Contracts
5.1 Existing endpoint changes
POST /console/flags/promotions/<id>/promote — existing route.
New pre-condition: if the console_flag_promotions row has synced=false, return 409 {"error": "flag_drifted", "drift_reason": "<reason>", "last_heroku_value": "<val>"} before any other processing. This is enforced in PromotionService.promote_async(), not just the blueprint.
POST /console/flags/promotions/<id>/reject — existing route.
Same pre-condition: synced=false returns 409 {"error": "flag_drifted"} for rollback operations as well.
5.2 New endpoints
POST /console/flags/<flag_key>/promote
Shortcut that creates a pending promotion row and immediately calls Heroku API (for flags that need a direct UI-driven flip without the soak queue). Writes DB row with sync_source='console_ui_promote', calls Heroku PATCH /apps/{app}/config-vars, then sets synced=true and last_synced_at=now on success or synced=false + drift_reason='heroku_unset' on Heroku failure.
- Auth:
@require_role("superadmin")+@require_totp_elevation - Body:
{"value": true|false, "target_app": "raxx-console-prod"} - Response 202:
{"state": "deploying", "status_url": "..."}(follows existing async pattern) - Response 409:
{"error": "flag_drifted"}if row is already drifted
POST /console/flags/<flag_key>/mark-synced
Operator resolves a drift conflict by declaring which side wins.
- Auth:
@require_role("superadmin")+@require_totp_elevation - Body:
{"winner": "db" | "heroku", "totp_code": "..."} winner="db": Setssync_source='mark_synced_db_wins'. Schedules an immediate Heroku PATCH to push the DB value. On success:synced=true,last_synced_at=now.winner="heroku": Reads currentlast_heroku_value, updates the DB row'snew_valueto match, setssynced=true,sync_source='mark_synced_heroku_wins',last_synced_at=now. No Heroku write needed.- Response 200:
{"synced": true, "resolved_value": true|false, "winner": "db"|"heroku"} - Response 409: if Heroku PATCH fails during
winner="db"path, returns{"error": "heroku_push_failed"}and leavessynced=false. - Emits audit row:
action="console.flag.mark_synced", payload{flag_key, winner, resolved_value, previous_db_value, previous_heroku_value}.
GET /console/flags/drift
Returns all currently-drifted rows for the UI drift banner.
- Auth:
@require_role("ops", "superadmin") - Response 200:
{"drifted": [{"flag_key": "...", "target_app": "...", "db_value": true|false, "heroku_value": "true"|"false"|null, "drift_reason": "...", "drift_detected_at": "...UTC..."}]} - Returns
[]when no drift exists (banner hides itself).
6. Architecture
+---------------------------+ POST /flags/<key>/mark-synced
| Console UI |<---------------------------------+
| /console/flags | |
| - Drift banner |--GET /console/flags/drift------->|
| - Per-row drift badge | |
| - Disabled promote btn | |
+---------------------------+ |
| |
| POST /flags/promotions/<id>/promote |
v |
+---------------------------+ +---------------------+ |
| Console API | | Heroku Platform | |
| blueprints/flags.py |--------->| API | |
| services/promotions.py | PATCH | PATCH config-vars | |
| |<---------| GET config-vars | |
+---------------------------+ +---------------------+ |
| ^ |
| DB write | |
v | |
+---------------------------+ reads | |
| console DB |<-------------------+ |
| console_flag_promotions | (reconciler reads |
| + synced columns | Heroku + DB, diffs, |
+---------------------------+ writes synced col) |
^ |
| |
+---------------------------+ |
| Reconciler Job |----------------------------------+
| services/flag_reconciler| mark-synced API when
| .py | winner resolution needed
| (every 5 min via APSched)|
+---------------------------+
|
v
+---------------------------+
| console_audit_log |
| (every state change) |
+---------------------------+
The reconciler is a separate scheduler job, not a request-scoped service. It runs independently of HTTP traffic.
7. State Machine
7.1 Row states (flag_key x target_app pair)
| State name | Meaning |
|---|---|
pending |
Promotion queued; soak not elapsed. synced is irrelevant until deployed. |
promoted |
DB says ON; reconciler has confirmed Heroku matches. synced=true. |
rolled_back |
DB says OFF; reconciler has confirmed Heroku matches. synced=true. |
drifted |
synced=false. DB and Heroku disagree. Promote/rollback buttons disabled. |
Note: drifted is not a separate state column value — it is a derived view based on synced=false. The underlying state column retains its current values (deployed, pending, etc.). The UI derives "drifted" from synced=false.
7.2 Transition table
| From | Event | To | synced |
sync_source |
|---|---|---|---|---|
| any | Reconciler: DB and Heroku agree | same state | true |
unchanged |
| any | Reconciler: DB and Heroku disagree | same state | false |
unchanged |
drifted (synced=false) |
mark-synced, winner=db, Heroku PATCH succeeds | same state | true |
mark_synced_db_wins |
drifted (synced=false) |
mark-synced, winner=db, Heroku PATCH fails | same state | false |
unchanged |
drifted (synced=false) |
mark-synced, winner=heroku | same state | true |
mark_synced_heroku_wins |
synced (synced=true) |
promote via UI (new endpoint) | deployed | true |
console_ui_promote |
synced (synced=true) |
heroku CLI changes FLAG_* directly | same state | false |
unchanged (reconciler sets) |
| backfill migration | initial scan | varies | true (if match) |
heroku_cli_backfill |
7.3 State machine diagram
stateDiagram-v2
[*] --> pending : mark-promote
pending --> deployed : promote (soak elapsed)
pending --> rejected : operator rejects
pending --> expired : 7-day TTL
deployed --> drifted : reconciler detects mismatch
drifted --> deployed : mark-synced resolves conflict
deployed --> [*] : terminal
note right of drifted
promote/rollback buttons
DISABLED while in this state.
Operator must resolve via
mark-synced before proceeding.
end note
8. Reconciler Service Shape
File: console/app/services/flag_reconciler.py
This service is a standalone module called by the APScheduler job registered in the console app factory. It does not import from blueprint handlers.
# Pseudo-Python — shapes the public interface; implementation is feature-developer's
from dataclasses import dataclass
from typing import Optional
APPS = [
"raxx-console-staging",
"raxx-console-prod",
"raxx-api-staging",
"raxx-api-prod",
]
@dataclass
class ReconcileResult:
app_name: str
synced_count: int # rows marked synced=true this run
drifted_count: int # rows marked synced=false this run
skipped_count: int # rows skipped (e.g. paper-first protected)
error: Optional[str] # non-None if Heroku fetch failed for this app
def reconcile_all_apps() -> list[ReconcileResult]:
"""
Called every 5 minutes by APScheduler.
For each app in APPS:
1. Fetch FLAG_* config vars via Heroku Platform API
GET /apps/{app}/config-vars
Auth: HEROKU_API_TOKEN (or per-app token from env, see promotions.py)
2. Read all console_flag_promotions rows for this target_app
that are in state IN ('deployed', 'pending', 'staged')
3. Diff: for each DB row, check if Heroku value matches new_value
4. Write synced=true/false + last_heroku_value + timestamps
5. Write audit row for each change to synced
Never raises — errors are captured in ReconcileResult.error.
Returns list of ReconcileResult, one per app.
"""
...
def _fetch_heroku_config_vars(app_name: str) -> dict[str, str]:
"""
Calls GET /apps/{app_name}/config-vars.
Returns raw dict of all env vars.
Raises HerokuFetchError on 4xx/5xx or network timeout.
Token sourced from env: HEROKU_API_TOKEN or
per-app override (HEROKU_API_TOKEN_CONSOLE_PROD etc.).
15-second timeout. No retries in the reconciler path
(job will retry in 5 minutes automatically).
"""
...
def _extract_flag_vars(raw_config: dict) -> dict[str, bool]:
"""
Filter raw config to FLAG_* keys.
Returns {flag_key_lowercase: bool} e.g. {"console_flag_promotions": True}.
Key is lower(key.removeprefix("FLAG_")).
Value is case-insensitive "true" -> True, anything else -> False.
"""
...
def _diff_and_update(
db_rows: list, # ConsoleFlagPromotion rows for one target_app
heroku_vars: dict, # from _extract_flag_vars
target_app: str,
dry_run: bool = False,
) -> tuple[int, int]:
"""
Compare each DB row against heroku_vars.
Write synced, last_heroku_value, drift_reason, timestamps.
Emit audit rows for each change.
Returns (synced_count, drifted_count).
"""
...
Scheduler registration (in console/app/__init__.py or equivalent scheduler setup):
scheduler.add_job(
func=reconcile_all_apps,
trigger="interval",
minutes=5,
id="flag_reconciler",
replace_existing=True,
misfire_grace_time=60, # tolerate up to 60s delay before treating as missed
)
Heroku API dependency: The reconciler uses the same GET /apps/{app}/config-vars endpoint already used by flag_drift.py. It can share the fetch_heroku_flag_vars helper from drift_reconcilers/flag_drift.py (or extract it to a shared heroku_client.py).
Token: HEROKU_API_TOKEN (generic) or per-app tokens HEROKU_API_TOKEN_CONSOLE_PROD, HEROKU_API_TOKEN_API_PROD etc. already in Infisical. Rotatable without redeploy.
9. Migrations
9.1 Migration 0055 — add sync columns (additive)
File: console/migrations/versions/0055_cfp_sync_columns.py
-- up
ALTER TABLE console_flag_promotions
ADD COLUMN synced BOOLEAN NOT NULL DEFAULT FALSE;
ALTER TABLE console_flag_promotions
ADD COLUMN last_heroku_value TEXT;
ALTER TABLE console_flag_promotions
ADD COLUMN last_synced_at TIMESTAMPTZ;
ALTER TABLE console_flag_promotions
ADD COLUMN drift_detected_at TIMESTAMPTZ;
ALTER TABLE console_flag_promotions
ADD COLUMN drift_reason TEXT;
ALTER TABLE console_flag_promotions
ADD COLUMN sync_source TEXT;
CREATE INDEX idx_cfp_synced_false
ON console_flag_promotions (synced)
WHERE synced = false;
-- down
DROP INDEX IF EXISTS idx_cfp_synced_false;
-- SQLite: no DROP COLUMN; Postgres supports it
ALTER TABLE console_flag_promotions DROP COLUMN IF EXISTS sync_source;
ALTER TABLE console_flag_promotions DROP COLUMN IF EXISTS drift_reason;
ALTER TABLE console_flag_promotions DROP COLUMN IF EXISTS drift_detected_at;
ALTER TABLE console_flag_promotions DROP COLUMN IF EXISTS last_synced_at;
ALTER TABLE console_flag_promotions DROP COLUMN IF EXISTS last_heroku_value;
ALTER TABLE console_flag_promotions DROP COLUMN IF EXISTS synced;
Depends on: migration 0054 applied. Zero-downtime: additive columns with nullable defaults. Existing code that does not read these columns is unaffected.
SQLite note: SQLite supports ADD COLUMN but not DROP COLUMN before version 3.35.0. The down migration for SQLite environments will use batch_alter_table (Alembic) to recreate the table without the columns, consistent with migration 0054 pattern.
9.2 Migration 0056 — backfill (data migration)
File: console/migrations/versions/0056_cfp_heroku_backfill.py
This migration does NOT run inline. It is implemented as a one-shot script invoked by a sub-card (see Implementation Cards). The migration file defines the schema structure for the backfill operation and a heroku_cli_backfill marker.
The backfill script:
1. Reads all FLAG_* vars from each of the 4 Heroku apps via API.
2. For each FLAG_* var with no existing console_flag_promotions row for that (flag_key, target_app) pair, inserts a new row with:
- state='deployed' (it is live on Heroku)
- new_value = current Heroku boolean
- synced=true
- sync_source='heroku_cli_backfill'
- created_by='system_backfill'
- last_heroku_value = string representation
- last_synced_at = now (UTC)
3. For existing rows where synced is false (all rows post-0055 migration), reconciles against current Heroku state.
The backfill is idempotent. Running it twice produces the same result.
10. Rollout Plan
| Phase | What ships | Gate |
|---|---|---|
| Phase 0 | Migration 0055 (sync columns, additive). Reconciler job added but writes only to drift_run_results (existing table) — does NOT yet write synced column. Reconciler runs in read-only mode. |
Zero-downtime migration deploy. |
| Phase 1 | Reconciler writes synced, last_heroku_value, timestamps to console_flag_promotions. Backfill script run against staging. GET /console/flags/drift endpoint live. |
Staging operator review of drift banner. |
| Phase 2 | POST /console/flags/<key>/mark-synced endpoint + UI modal with side-by-side diff. Drift kill-switch enforcement (synced=false → 409 on promote). |
Staging soak; smoke test mark-synced both winner paths. |
| Phase 3 | POST /console/flags/<key>/promote (UI → Heroku write) for direct promotes without CLI. Backfill run against prod. |
Post-soak prod deploy. FLAG_CONSOLE_FLAG_PROMOTIONS=1 on prod gates access. |
Phase 0 and Phase 1 are safe to ship before 2026-05-23 launch — they add columns and read-only behavior. Phase 2 (kill-switch enforcement) should also land pre-launch if the team has bandwidth; it closes a gap where the UI could act on stale state. Phase 3 (UI-triggered Heroku writes) is lower urgency pre-launch and can be deferred post-launch if needed.
11. Failure Modes
| Scenario | Behavior | Recovery |
|---|---|---|
| Heroku API down during reconciler run | ReconcileResult.error set; synced values not updated; previous state preserved. Slack alert fired if error persists across 2+ consecutive runs. |
Automatic recovery on next 5-min cycle. |
| Heroku API down during mark-synced (winner=db) | 409 {"error": "heroku_push_failed"} returned; synced remains false; operator informed to retry. |
Operator retries when Heroku recovers. |
| DB write succeeds, Heroku PATCH fails during promote | Promotion row in state=deploying; synced=false; drift_reason='heroku_unset'. |
Existing force-complete / force-fail escape hatch (#1639). Or operator runs mark-synced(winner=heroku) to acknowledge the failure and correct the DB. |
| Two operators promote the same flag concurrently | Existing state=deploying lock in PromotionService.promote_async() returns 409 for the second request. synced is only updated after deploy completes. |
No new risk surface. |
| Reconciler runs between DB write and Heroku apply (window ~2s) | Row is in state=deploying; reconciler skips rows in deploying state (they are in-flight). |
Skip condition: WHERE state NOT IN ('deploying') in reconciler query. |
| Missed reconciler cycle (dyno restart) | APScheduler misfire_grace_time=60 handles brief gaps. After restart, reconciler runs immediately and catches up. |
No manual intervention needed. |
| Untracked Heroku FLAG_* var (no DB row) | drift_reason='untracked_heroku_flag' written to a synthetic row. Operator sees it in drift banner. |
Backfill script creates the row. Or operator runs mark-synced(winner=heroku) to adopt it. |
12. Security Considerations
- PII collected: None.
last_heroku_valueis a boolean string.drift_reasonis an enum.sync_sourceis an enum. No user identifiers beyond admin UUIDs (existingcreated_by). - Retention: New columns on
console_flag_promotionsfollow the same 90-day operational retention as other columns.console_audit_logentries for mark-synced follow the 2-year audit retention. - DSR erasure: No new PII columns. Existing
created_by/approved_byerasure behavior is unchanged. - Credential posture:
HEROKU_API_TOKEN(and per-app variants) live in Infisical. The reconciler reads them from env at call time. Never logged (hash only). Rotatable without redeploy. - Audit trail: Every state change to
syncedcolumn (reconciler write or mark-synced) emits aconsole_audit_logrow withaction="console.flag.sync_updated"oraction="console.flag.mark_synced". Thesync_sourcecolumn on the promotion row gives a denormalized trace without needing to query the audit log for the common case. - Kill-switch enforcement at service layer: The
synced=falsegate is enforced inPromotionService, not just the template. A direct API call to the promote endpoint will also hit the gate. This prevents bypass via curl/scripts. - Mark-synced requires TOTP elevation: Because mark-synced can push a DB value to Heroku (modifying the runtime), it requires the same
@require_totp_elevationas the existing promote/reject endpoints. - No auto-resolution: The reconciler never silently resolves drift. It only reads and marks. Resolution always requires explicit operator action.
- Breach posture: The new columns contain flag names and boolean values. No user PII. Exfiltration of
console_flag_promotionsreveals feature flag state and operator activity — operational sensitivity, not GDPR-relevant. Existing breach-notification automation applies.
13. Open Questions
These require a decision from Kristerpher before sub-cards in Phase 2 and Phase 3 can be claimed:
-
Reconciler token scope. The existing
flag_drift.pyreconciler uses a singleHEROKU_API_TOKENor falls back to per-app tokens. The new reconciler will also need write access for thePOST /apps/{app}/config-varscall (mark-synced winner=db path). Should the write token be the same as the read token, or should a separate write-scoped Heroku token be provisioned for the reconciler? Separate tokens provide blast-radius isolation but require vault setup. -
Backfill timing. The backfill will create ~50+ new
console_flag_promotionsrows for prod (one per liveFLAG_*var with no existing DB row). Should this run as a one-shot admin script, or as an Alembic data migration that runs automatically on deploy? The risk of automatic migration is that it hits the Heroku API during deploy, which could fail and block the migration chain. -
Drift banner placement pre-
FLAG_CONSOLE_FLAG_PROMOTIONS-on. The drift banner (Phase 1/2) requiresFLAG_CONSOLE_FLAG_PROMOTIONS=1to access/console/flags. Should the drift endpoint (GET /console/flags/drift) be accessible without the main promotion flag, so a drift banner can appear on the existing dashboard before the full promotion UI ships? Or is it acceptable to gate all of this behind the single flag? -
Grace period before kill-switch fires. The moment a CLI operator sets a
FLAG_*var directly on Heroku, the next 5-minute reconciler cycle will mark the row drifted and disable the UI promote button. Is 5 minutes the right grace period, or should the kill-switch only fire after N consecutive drifted cycles (e.g., 15 minutes = 3 cycles) to avoid transient failures from Heroku API flakiness? -
Untracked flags: backfill-only or persistent drift. Flags set directly on Heroku with no DB row are currently
orphan_on_herokuwarnings. After this design ships, should they create syntheticconsole_flag_promotionsrows automatically on first reconciler detection (enabling the drift banner to surface them), or should they only be onboarded via the backfill script? Automatic creation is more visible but adds complexity to the reconciler.
14. Implementation Cards
PM can lift these cards directly. Each card is sized for a single PR by feature-developer. Cards are ordered by dependency chain.
Card 1 — Schema migration: add sync columns to console_flag_promotions
Title: feat(console): migration 0055 — add sync columns to console_flag_promotions
Body:
Add six new columns to console_flag_promotions that support the bidirectional sync design. See design doc: docs/architecture/flag-reconciler-bidirectional-sync-2026-05-13.md section 4 and section 9.1.
Columns: synced (BOOLEAN, default false), last_heroku_value (TEXT nullable), last_synced_at (TIMESTAMPTZ nullable), drift_detected_at (TIMESTAMPTZ nullable), drift_reason (TEXT nullable), sync_source (TEXT nullable).
Partial index: idx_cfp_synced_false on (synced) WHERE synced = false.
Use batch_alter_table pattern (consistent with migration 0054) for dialect-neutral SQLite/Postgres compatibility.
Acceptance criteria:
- [ ] Migration 0055 applies cleanly on top of 0054 with no data loss.
- [ ] Down migration removes columns and index without error.
- [ ] ConsoleFlagPromotion model in app/models/flag_promotion.py updated to include new fields.
- [ ] Existing tests pass.
- [ ] New test: test_0055_migration.py verifies columns exist post-up and do not exist post-down.
Size: S
Risk: Low — additive columns only.
Dependencies: None (depends on 0054 already applied, which it is).
Card 2 — Reconciler service + scheduler registration
Title: feat(console): flag_reconciler.py — 5-minute bidirectional sync job
Body:
Implement console/app/services/flag_reconciler.py as specified in design doc section 8. The reconciler:
1. Iterates all 4 Heroku apps.
2. Fetches FLAG_* config vars via GET /apps/{app}/config-vars.
3. Reads console_flag_promotions rows for that target_app where state NOT IN ('deploying', 'rejected', 'expired', 'failed').
4. Diffs each row's new_value against the Heroku boolean.
5. Writes synced, last_heroku_value, drift_reason, and timestamp columns.
6. Emits console_audit_log rows with action="console.flag.sync_updated" for each change to synced.
7. Never raises — errors captured per-app in ReconcileResult.
Register in APScheduler: interval=5m, misfire_grace_time=60s, id="flag_reconciler". Skip rows in state='deploying' to avoid race condition during in-flight promotes.
Reuse _fetch_heroku_flag_vars / HerokuConfigFetchError from drift_reconcilers/flag_drift.py or extract to a shared heroku_client.py.
Acceptance criteria:
- [ ] reconcile_all_apps() returns one ReconcileResult per app.
- [ ] synced=true set when DB new_value matches Heroku value.
- [ ] synced=false, drift_reason, drift_detected_at set when mismatch.
- [ ] state='deploying' rows are skipped.
- [ ] Heroku API failure captured in ReconcileResult.error; no exception propagated.
- [ ] Audit row emitted for each synced column change.
- [ ] APScheduler job registered in app factory; visible in scheduler job list.
- [ ] Unit tests with mocked Heroku API covering: clean run, mismatch, orphan-on-Heroku, Heroku failure.
- [ ] No token values in any log output (hash only).
Size: M
Risk: Medium — introduces write behavior; must not corrupt existing promotion state.
Dependencies: Card 1.
Card 3 — Backfill data migration script
Title: feat(console): one-shot backfill — seed console_flag_promotions rows for existing Heroku FLAG_* vars
Body:
Implement scripts/flag_reconciler_backfill.py (not an Alembic migration; invoked manually by operator). See design doc section 9.2.
Script logic:
1. For each of the 4 Heroku apps, fetch all FLAG_* config vars.
2. For each var with no matching (flag_key, target_app) row in console_flag_promotions, insert a new row: state='deployed', new_value=<heroku_boolean>, synced=true, sync_source='heroku_cli_backfill', created_by='system_backfill', last_heroku_value=<str>, last_synced_at=<utcnow>.
3. For existing rows in console_flag_promotions for this (flag_key, target_app) where synced=false (NULL or explicit false), diff against Heroku and update.
4. Print a summary: N rows inserted, M rows updated, K errors.
5. Idempotent: running twice produces the same result.
Include a --dry-run flag that prints the diff without writing.
Acceptance criteria:
- [ ] --dry-run prints all planned inserts/updates without writing.
- [ ] Script is idempotent (run twice = same result).
- [ ] All 4 apps processed.
- [ ] Rows inserted have sync_source='heroku_cli_backfill'.
- [ ] Script exits non-zero on Heroku API failure after logging the error.
- [ ] Unit test with mocked Heroku API verifying idempotency and insert/update logic.
- [ ] Runbook entry in docs/runbooks/ describing when and how to invoke.
Size: S
Risk: Low — script is separate from main app; uses --dry-run for validation.
Dependencies: Card 1.
Card 4 — Mark-synced API endpoints
Title: feat(console): POST /flags/<key>/mark-synced + GET /flags/drift endpoints
Body:
Add two new endpoints to console/app/blueprints/flags.py. See design doc section 5.2.
GET /console/flags/drift: Returns all rows where synced=false. Auth: @require_role("ops", "superadmin"). Response is a JSON list of drifted rows (flag_key, target_app, db_value, heroku_value, drift_reason, drift_detected_at).
POST /console/flags/<flag_key>/mark-synced: Resolves a drift conflict. Body: {"winner": "db"|"heroku", "totp_code": "..."}. Auth: @require_role("superadmin") + @require_totp_elevation. Winner=db: calls Heroku PATCH then sets synced=true. Winner=heroku: updates DB new_value, sets synced=true. Both paths emit audit row action="console.flag.mark_synced". Returns 409 if Heroku PATCH fails (winner=db path).
Also add the drift kill-switch to PromotionService.promote_async() and PromotionService.reject(): return 409 {"error": "flag_drifted"} if the target row has synced=false at the start of the operation.
Acceptance criteria:
- [ ] GET /flags/drift returns empty list when no drift.
- [ ] GET /flags/drift returns correct rows when drift exists.
- [ ] POST /mark-synced winner=heroku updates DB value and sets synced=true.
- [ ] POST /mark-synced winner=db calls Heroku PATCH and sets synced=true on success.
- [ ] POST /mark-synced winner=db returns 409 heroku_push_failed on Heroku failure.
- [ ] POST /mark-synced emits audit row for both winner paths.
- [ ] Promote endpoint returns 409 flag_drifted when synced=false.
- [ ] Reject endpoint returns 409 flag_drifted when synced=false.
- [ ] TOTP elevation enforced on mark-synced.
- [ ] Unit tests for each path; mock Heroku API.
Size: M
Risk: Medium — introduces drift kill-switch enforcement; must not break existing promote flow for non-drifted rows.
Dependencies: Cards 1 and 2.
Card 5 — Drift banner UI
Title: feat(console): /console/flags drift banner and per-row drift badge
Body:
Update /console/flags page and /console/flags/promotions page. See design doc section on UI changes.
On both pages: add a banner above the table that queries GET /console/flags/drift. If response is non-empty, show a yellow/amber banner: "N flag(s) are drifted — DB and Heroku disagree. Promote and rollback actions are disabled for affected flags until drift is resolved." Each drifted flag in the banner links to its row.
Per-row in the flags table: add a drift badge (e.g., "DRIFTED") in the status column when the row has synced=false. The badge should show drift_reason on hover (tooltip).
Promote/rollback buttons for drifted rows: render as disabled (greyed out, no click handler). Show tooltip: "Flag is drifted — resolve drift first."
Banner auto-hides via HTMX polling (reuse existing 5-second poll pattern from promotions page). If GET /flags/drift returns [], banner removes itself.
Acceptance criteria:
- [ ] Banner appears when at least one drifted row exists.
- [ ] Banner disappears when no drifted rows remain.
- [ ] Per-row drift badge visible for drifted flags.
- [ ] Promote/rollback buttons disabled on drifted rows (not just hidden).
- [ ] Banner links to correct rows.
- [ ] HTMX poll refreshes banner without full page reload.
- [ ] Works with existing FLAG_CONSOLE_FLAG_PROMOTIONS gate (404 when off).
- [ ] Snapshot test or Playwright test confirms banner renders and hides.
Size: S
Risk: Low — UI-only; no new write paths.
Dependencies: Card 4.
Card 6 — "Mark as synced" modal with side-by-side diff
Title: feat(console): mark-as-synced modal — side-by-side diff + winner selection
Body:
Add a modal to the /console/flags page triggered by a "Resolve drift" button on drifted rows. See design doc section on UI changes.
Modal content:
- Header: "Resolve drift for <flag_key>"
- Two-column diff table:
- Left column: "DB value" — shows new_value from the promotion row (true/false)
- Right column: "Heroku value" — shows last_heroku_value from the row
- drift_reason displayed as a human-readable explanation (e.g., "Heroku value was changed directly via CLI after the last promotion")
- drift_detected_at timestamp
- Two action buttons: "DB value is correct (push to Heroku)" and "Heroku value is correct (update DB)"
- TOTP field (required before either action fires)
- Confirmation: high-risk flags (per YAML risk: high) require typing the flag key before confirming
On confirmation: calls POST /console/flags/<flag_key>/mark-synced with winner: "db"|"heroku". On success: modal closes, row refreshes via HTMX to show synced=true state, promote/rollback buttons re-enable. On 409 error: modal shows error message inline.
Acceptance criteria: - [ ] Modal opens on "Resolve drift" button click. - [ ] Side-by-side diff shows correct DB and Heroku values. - [ ] DB wins path: calls mark-synced, shows success, re-enables promote button. - [ ] Heroku wins path: calls mark-synced, shows updated DB value, re-enables promote button. - [ ] TOTP required before action fires. - [ ] High-risk flag confirmation phrase required. - [ ] 409 error displayed inline in modal (not full page redirect). - [ ] Modal is accessible (keyboard navigable, ARIA labels).
Size: M
Risk: Low-Medium — UI complexity; depends on Card 4 for API.
Dependencies: Card 4.
Card 7 — Promote-via-UI calls Heroku API directly
Title: feat(console): POST /flags/<key>/promote — UI-triggered Heroku config-var write
Body:
Add the new POST /console/flags/<flag_key>/promote endpoint (distinct from the existing POST /flags/promotions/<id>/promote). See design doc section 5.2.
This endpoint is for direct UI-driven flag flips that bypass the soak queue. It:
1. Validates the flag is not drifted (synced=true or no existing row).
2. Creates a console_flag_promotions row with state='deploying', sync_source='console_ui_promote'.
3. Calls PATCH /apps/{target_app}/config-vars on Heroku (async pattern, same as existing promote flow).
4. On success: sets state='deployed', synced=true, last_synced_at=now.
5. On failure: sets state='failed', synced=false, drift_reason='heroku_unset'.
6. Emits audit row.
Return 202 with status_url for frontend polling (same pattern as existing promote_async).
This is the "Phase 3" endpoint that finally enables operators to flip flags from the UI without touching the Heroku CLI.
Acceptance criteria:
- [ ] Returns 202 with status_url on valid request.
- [ ] Returns 409 flag_drifted if existing row is drifted.
- [ ] On Heroku success: state='deployed', synced=true.
- [ ] On Heroku failure: state='failed', synced=false, drift_reason='heroku_unset'.
- [ ] Audit row emitted.
- [ ] TOTP elevation required.
- [ ] End-to-end test with mocked Heroku API.
- [ ] FLAG_CONSOLE_FLAG_PROMOTIONS gate enforced.
Size: M
Risk: Medium-High — this is the first endpoint that writes to Heroku from the UI. Requires thorough testing. Must land after Cards 1-6 are stable on staging.
Dependencies: Cards 1, 2, 4.
Card 8 — Audit log integration for all new actions
Title: feat(console): audit trail — emit trace_events for sync_updated and mark_synced
Body:
Ensure all new state changes emit to both console_audit_log and trace_events (dual-write). See design doc section 12 (security considerations).
New audit actions to register:
- console.flag.sync_updated — emitted by reconciler each time synced changes (true→false or false→true). Payload: {flag_key, target_app, previous_synced, new_synced, drift_reason, last_heroku_value}.
- console.flag.mark_synced — emitted by mark-synced endpoint. Payload: {flag_key, target_app, winner, resolved_value, previous_db_value, previous_heroku_value, admin_id}.
Verify that existing audit actions from the promote flow (console.flag.promoted, console.flag.mark_promote) already dual-write to trace_events. If not, add dual-write in this card.
Verify retention policy: console_audit_log rows for these actions must be covered by the existing 2-year retention job.
Acceptance criteria:
- [ ] console.flag.sync_updated written on every reconciler synced column change.
- [ ] console.flag.mark_synced written on every mark-synced endpoint call.
- [ ] Both actions dual-write to trace_events table.
- [ ] Payload fields match specification above.
- [ ] admin_id is 'system_reconciler' for automated reconciler writes.
- [ ] Retention policy confirmed to cover new action types.
- [ ] Unit test: reconciler run triggers correct audit row.
- [ ] Unit test: mark-synced triggers correct audit row.
Size: S
Risk: Low — audit additions only; does not change existing behavior.
Dependencies: Cards 2 and 4.