Raxx · internal docs

internal · gated

Flag Reconciler — Bidirectional Sync (Drift-as-Kill-Switch)

Status: Proposed
Owner: software-architect
Date: 2026-05-13 UTC
Related docs:
- console-flag-promotion-flow.md — prior staging-to-prod promotion design
- [ADR-0085](https://internal-docs.raxx.app/architecture/adr/0085-flag-reconciler-bidirectional-sync.html) — decision record for this design


1. Goal

Establish a bidirectional, audited sync between console_flag_promotions (the DB source of intent) and Heroku FLAG_* environment variables (the runtime source of truth) across all four Raxx apps (raxx-console-staging, raxx-console-prod, raxx-api-staging, raxx-api-prod). A 5-minute reconciler job compares both sides and marks rows synced=true or synced=false. When drift is detected, the console UI disables promote/rollback actions for the affected flag and surfaces a banner with a side-by-side diff. The operator resolves drift by choosing which side wins ("DB is correct" or "Heroku is correct") via a "Mark as synced" action. Only after drift is resolved does the normal promotion workflow resume. Drift is therefore a kill-switch on forward progress — not a warning.


2. Current State and Problem

Heroku has been the canonical source of truth for feature flags for approximately 9 months. Flag values are set via heroku config:set directly (per feedback_bootstrap_via_heroku.md). The console_flag_promotions table exists and records intent (pending, staged, deploying, deployed, failed) but the synced column does not exist. The /console/flags page is gated behind FLAG_CONSOLE_FLAG_PROMOTIONS which is currently OFF on both staging and prod (per project_console_flag_toggle.md), meaning all flag flips still go via CLI.

This creates three types of drift today:

  1. Orphan-on-Heroku: A FLAG_* var exists on Heroku with no corresponding DB promotion row. This is the dominant case (~50+ flags set in prod; ~7 with DB rows). The flag_drift reconciler (console/app/services/drift_reconcilers/flag_drift.py) already detects this as orphan_on_heroku and writes to drift_run_results, but takes no action beyond alerting.
  2. Missing-from-Heroku: A promoted DB row exists but Heroku never got the value. The reconciler detects this as missing_from_heroku at severity critical.
  3. Value-mismatch: DB row says true; Heroku says false (or vice versa). Reconciler detects as value_mismatch.

The existing reconciler is read-only. This design makes it write-capable (mark synced) and wires it into a UI-level kill-switch.


3. Invariants

All platform invariants apply. Reconciler-specific:

  1. No stored credentials. The Heroku API token lives in the env/secret store. The reconciler reads it at call time. It must not be written to any DB column, log line, or response body. Log only a SHA-256[:12] hash for token health checks.
  2. Audit trail for every state change. Every write to console_flag_promotions (promote, rollback, mark-synced) emits a row to console_audit_log. The reconciler's automated updates (setting synced, updating last_heroku_value) also emit audit rows.
  3. Drift is a kill-switch, not a warning. When synced=false, the promote and rollback actions for that flag are disabled at the API layer (not just the UI layer). A request to promote a drifted row returns 409 {"error": "flag_drifted"}.
  4. Mark-synced requires operator intent. Automated resolution is not permitted. The operator must explicitly choose which side wins. Auto-reconcile (reconciler silently pushing DB value to Heroku) is out of scope for v1.
  5. Paper-first gating is not a feature flag. The reconciler must never manage the paper-first gate. Any FLAG_* var that corresponds to a paper-first path must be treated as untouchable and logged with a paper_first_protected guard.
  6. GDPR baseline. This system stores no PII. last_heroku_value stores a boolean as text. drift_reason stores an enum-like string. No user data.

4. Data Model

4.1 New columns on console_flag_promotions

The following columns are added to the existing table. All are nullable to allow zero-downtime additive migration.

Column Type Default Description
synced BOOLEAN false Whether DB value and Heroku value agree for this flag x target_app pair. Indexed.
last_heroku_value TEXT NULL Last boolean value the reconciler observed on Heroku. Stored as "true" or "false" string to distinguish "present but false" from "not set". NULL means never checked.
last_synced_at TIMESTAMPTZ NULL When synced was last set to true.
drift_detected_at TIMESTAMPTZ NULL When synced was last set to false.
drift_reason TEXT NULL Enum-like: heroku_unset, heroku_value_mismatch, db_unset_heroku_set, untracked_heroku_flag.
sync_source TEXT NULL Enum-like: console_ui_promote, console_ui_rollback, heroku_cli_backfill, mark_synced_db_wins, mark_synced_heroku_wins. Updated on every write.

4.2 Index additions

CREATE INDEX idx_cfp_synced ON console_flag_promotions (synced)
    WHERE synced = false;

A partial index on the drifted rows enables the drift-banner query to be fast even as the table grows.

4.3 Unchanged columns

The existing state machine columns (state, created_at, created_by, target_app, new_value, soak_seconds, etc.) are unchanged. The synced column is orthogonal to state — a row can be in state deployed and synced=false if Heroku was subsequently changed via CLI.


5. APIs and Contracts

5.1 Existing endpoint changes

POST /console/flags/promotions/<id>/promote — existing route.
New pre-condition: if the console_flag_promotions row has synced=false, return 409 {"error": "flag_drifted", "drift_reason": "<reason>", "last_heroku_value": "<val>"} before any other processing. This is enforced in PromotionService.promote_async(), not just the blueprint.

POST /console/flags/promotions/<id>/reject — existing route.
Same pre-condition: synced=false returns 409 {"error": "flag_drifted"} for rollback operations as well.

5.2 New endpoints

POST /console/flags/<flag_key>/promote
Shortcut that creates a pending promotion row and immediately calls Heroku API (for flags that need a direct UI-driven flip without the soak queue). Writes DB row with sync_source='console_ui_promote', calls Heroku PATCH /apps/{app}/config-vars, then sets synced=true and last_synced_at=now on success or synced=false + drift_reason='heroku_unset' on Heroku failure.

POST /console/flags/<flag_key>/mark-synced
Operator resolves a drift conflict by declaring which side wins.

GET /console/flags/drift
Returns all currently-drifted rows for the UI drift banner.


6. Architecture

+---------------------------+     POST /flags/<key>/mark-synced
|   Console UI              |<---------------------------------+
|   /console/flags          |                                  |
|   - Drift banner          |--GET /console/flags/drift------->|
|   - Per-row drift badge   |                                  |
|   - Disabled promote btn  |                                  |
+---------------------------+                                  |
            |                                                  |
            | POST /flags/promotions/<id>/promote              |
            v                                                  |
+---------------------------+          +---------------------+ |
|   Console API             |          |   Heroku Platform   | |
|   blueprints/flags.py     |--------->|   API               | |
|   services/promotions.py  |  PATCH   |   PATCH config-vars | |
|                           |<---------|   GET  config-vars  | |
+---------------------------+          +---------------------+ |
            |                                    ^             |
            | DB write                           |             |
            v                                    |             |
+---------------------------+     reads          |             |
|   console DB              |<-------------------+             |
|   console_flag_promotions |   (reconciler reads             |
|   + synced columns        |    Heroku + DB, diffs,          |
+---------------------------+    writes synced col)            |
            ^                                                  |
            |                                                  |
+---------------------------+                                  |
|   Reconciler Job          |----------------------------------+
|   services/flag_reconciler|  mark-synced API when
|   .py                     |  winner resolution needed
|   (every 5 min via APSched)|
+---------------------------+
            |
            v
+---------------------------+
|   console_audit_log       |
|   (every state change)    |
+---------------------------+

The reconciler is a separate scheduler job, not a request-scoped service. It runs independently of HTTP traffic.


7. State Machine

7.1 Row states (flag_key x target_app pair)

State name Meaning
pending Promotion queued; soak not elapsed. synced is irrelevant until deployed.
promoted DB says ON; reconciler has confirmed Heroku matches. synced=true.
rolled_back DB says OFF; reconciler has confirmed Heroku matches. synced=true.
drifted synced=false. DB and Heroku disagree. Promote/rollback buttons disabled.

Note: drifted is not a separate state column value — it is a derived view based on synced=false. The underlying state column retains its current values (deployed, pending, etc.). The UI derives "drifted" from synced=false.

7.2 Transition table

From Event To synced sync_source
any Reconciler: DB and Heroku agree same state true unchanged
any Reconciler: DB and Heroku disagree same state false unchanged
drifted (synced=false) mark-synced, winner=db, Heroku PATCH succeeds same state true mark_synced_db_wins
drifted (synced=false) mark-synced, winner=db, Heroku PATCH fails same state false unchanged
drifted (synced=false) mark-synced, winner=heroku same state true mark_synced_heroku_wins
synced (synced=true) promote via UI (new endpoint) deployed true console_ui_promote
synced (synced=true) heroku CLI changes FLAG_* directly same state false unchanged (reconciler sets)
backfill migration initial scan varies true (if match) heroku_cli_backfill

7.3 State machine diagram

stateDiagram-v2
    [*] --> pending : mark-promote
    pending --> deployed : promote (soak elapsed)
    pending --> rejected : operator rejects
    pending --> expired : 7-day TTL

    deployed --> drifted : reconciler detects mismatch
    drifted --> deployed : mark-synced resolves conflict

    deployed --> [*] : terminal

    note right of drifted
        promote/rollback buttons
        DISABLED while in this state.
        Operator must resolve via
        mark-synced before proceeding.
    end note

8. Reconciler Service Shape

File: console/app/services/flag_reconciler.py

This service is a standalone module called by the APScheduler job registered in the console app factory. It does not import from blueprint handlers.

# Pseudo-Python — shapes the public interface; implementation is feature-developer's

from dataclasses import dataclass
from typing import Optional

APPS = [
    "raxx-console-staging",
    "raxx-console-prod",
    "raxx-api-staging",
    "raxx-api-prod",
]

@dataclass
class ReconcileResult:
    app_name: str
    synced_count: int        # rows marked synced=true this run
    drifted_count: int       # rows marked synced=false this run
    skipped_count: int       # rows skipped (e.g. paper-first protected)
    error: Optional[str]     # non-None if Heroku fetch failed for this app


def reconcile_all_apps() -> list[ReconcileResult]:
    """
    Called every 5 minutes by APScheduler.
    For each app in APPS:
      1. Fetch FLAG_* config vars via Heroku Platform API
         GET /apps/{app}/config-vars
         Auth: HEROKU_API_TOKEN (or per-app token from env, see promotions.py)
      2. Read all console_flag_promotions rows for this target_app
         that are in state IN ('deployed', 'pending', 'staged')
      3. Diff: for each DB row, check if Heroku value matches new_value
      4. Write synced=true/false + last_heroku_value + timestamps
      5. Write audit row for each change to synced
    Never raises — errors are captured in ReconcileResult.error.
    Returns list of ReconcileResult, one per app.
    """
    ...


def _fetch_heroku_config_vars(app_name: str) -> dict[str, str]:
    """
    Calls GET /apps/{app_name}/config-vars.
    Returns raw dict of all env vars.
    Raises HerokuFetchError on 4xx/5xx or network timeout.
    Token sourced from env: HEROKU_API_TOKEN or
    per-app override (HEROKU_API_TOKEN_CONSOLE_PROD etc.).
    15-second timeout. No retries in the reconciler path
    (job will retry in 5 minutes automatically).
    """
    ...


def _extract_flag_vars(raw_config: dict) -> dict[str, bool]:
    """
    Filter raw config to FLAG_* keys.
    Returns {flag_key_lowercase: bool} e.g. {"console_flag_promotions": True}.
    Key is lower(key.removeprefix("FLAG_")).
    Value is case-insensitive "true" -> True, anything else -> False.
    """
    ...


def _diff_and_update(
    db_rows: list,          # ConsoleFlagPromotion rows for one target_app
    heroku_vars: dict,      # from _extract_flag_vars
    target_app: str,
    dry_run: bool = False,
) -> tuple[int, int]:
    """
    Compare each DB row against heroku_vars.
    Write synced, last_heroku_value, drift_reason, timestamps.
    Emit audit rows for each change.
    Returns (synced_count, drifted_count).
    """
    ...

Scheduler registration (in console/app/__init__.py or equivalent scheduler setup):

scheduler.add_job(
    func=reconcile_all_apps,
    trigger="interval",
    minutes=5,
    id="flag_reconciler",
    replace_existing=True,
    misfire_grace_time=60,   # tolerate up to 60s delay before treating as missed
)

Heroku API dependency: The reconciler uses the same GET /apps/{app}/config-vars endpoint already used by flag_drift.py. It can share the fetch_heroku_flag_vars helper from drift_reconcilers/flag_drift.py (or extract it to a shared heroku_client.py).

Token: HEROKU_API_TOKEN (generic) or per-app tokens HEROKU_API_TOKEN_CONSOLE_PROD, HEROKU_API_TOKEN_API_PROD etc. already in Infisical. Rotatable without redeploy.


9. Migrations

9.1 Migration 0055 — add sync columns (additive)

File: console/migrations/versions/0055_cfp_sync_columns.py

-- up
ALTER TABLE console_flag_promotions
  ADD COLUMN synced           BOOLEAN NOT NULL DEFAULT FALSE;
ALTER TABLE console_flag_promotions
  ADD COLUMN last_heroku_value TEXT;
ALTER TABLE console_flag_promotions
  ADD COLUMN last_synced_at   TIMESTAMPTZ;
ALTER TABLE console_flag_promotions
  ADD COLUMN drift_detected_at TIMESTAMPTZ;
ALTER TABLE console_flag_promotions
  ADD COLUMN drift_reason     TEXT;
ALTER TABLE console_flag_promotions
  ADD COLUMN sync_source      TEXT;

CREATE INDEX idx_cfp_synced_false
  ON console_flag_promotions (synced)
  WHERE synced = false;

-- down
DROP INDEX IF EXISTS idx_cfp_synced_false;
-- SQLite: no DROP COLUMN; Postgres supports it
ALTER TABLE console_flag_promotions DROP COLUMN IF EXISTS sync_source;
ALTER TABLE console_flag_promotions DROP COLUMN IF EXISTS drift_reason;
ALTER TABLE console_flag_promotions DROP COLUMN IF EXISTS drift_detected_at;
ALTER TABLE console_flag_promotions DROP COLUMN IF EXISTS last_synced_at;
ALTER TABLE console_flag_promotions DROP COLUMN IF EXISTS last_heroku_value;
ALTER TABLE console_flag_promotions DROP COLUMN IF EXISTS synced;

Depends on: migration 0054 applied. Zero-downtime: additive columns with nullable defaults. Existing code that does not read these columns is unaffected.

SQLite note: SQLite supports ADD COLUMN but not DROP COLUMN before version 3.35.0. The down migration for SQLite environments will use batch_alter_table (Alembic) to recreate the table without the columns, consistent with migration 0054 pattern.

9.2 Migration 0056 — backfill (data migration)

File: console/migrations/versions/0056_cfp_heroku_backfill.py

This migration does NOT run inline. It is implemented as a one-shot script invoked by a sub-card (see Implementation Cards). The migration file defines the schema structure for the backfill operation and a heroku_cli_backfill marker.

The backfill script: 1. Reads all FLAG_* vars from each of the 4 Heroku apps via API. 2. For each FLAG_* var with no existing console_flag_promotions row for that (flag_key, target_app) pair, inserts a new row with: - state='deployed' (it is live on Heroku) - new_value = current Heroku boolean - synced=true - sync_source='heroku_cli_backfill' - created_by='system_backfill' - last_heroku_value = string representation - last_synced_at = now (UTC) 3. For existing rows where synced is false (all rows post-0055 migration), reconciles against current Heroku state.

The backfill is idempotent. Running it twice produces the same result.


10. Rollout Plan

Phase What ships Gate
Phase 0 Migration 0055 (sync columns, additive). Reconciler job added but writes only to drift_run_results (existing table) — does NOT yet write synced column. Reconciler runs in read-only mode. Zero-downtime migration deploy.
Phase 1 Reconciler writes synced, last_heroku_value, timestamps to console_flag_promotions. Backfill script run against staging. GET /console/flags/drift endpoint live. Staging operator review of drift banner.
Phase 2 POST /console/flags/<key>/mark-synced endpoint + UI modal with side-by-side diff. Drift kill-switch enforcement (synced=false409 on promote). Staging soak; smoke test mark-synced both winner paths.
Phase 3 POST /console/flags/<key>/promote (UI → Heroku write) for direct promotes without CLI. Backfill run against prod. Post-soak prod deploy. FLAG_CONSOLE_FLAG_PROMOTIONS=1 on prod gates access.

Phase 0 and Phase 1 are safe to ship before 2026-05-23 launch — they add columns and read-only behavior. Phase 2 (kill-switch enforcement) should also land pre-launch if the team has bandwidth; it closes a gap where the UI could act on stale state. Phase 3 (UI-triggered Heroku writes) is lower urgency pre-launch and can be deferred post-launch if needed.


11. Failure Modes

Scenario Behavior Recovery
Heroku API down during reconciler run ReconcileResult.error set; synced values not updated; previous state preserved. Slack alert fired if error persists across 2+ consecutive runs. Automatic recovery on next 5-min cycle.
Heroku API down during mark-synced (winner=db) 409 {"error": "heroku_push_failed"} returned; synced remains false; operator informed to retry. Operator retries when Heroku recovers.
DB write succeeds, Heroku PATCH fails during promote Promotion row in state=deploying; synced=false; drift_reason='heroku_unset'. Existing force-complete / force-fail escape hatch (#1639). Or operator runs mark-synced(winner=heroku) to acknowledge the failure and correct the DB.
Two operators promote the same flag concurrently Existing state=deploying lock in PromotionService.promote_async() returns 409 for the second request. synced is only updated after deploy completes. No new risk surface.
Reconciler runs between DB write and Heroku apply (window ~2s) Row is in state=deploying; reconciler skips rows in deploying state (they are in-flight). Skip condition: WHERE state NOT IN ('deploying') in reconciler query.
Missed reconciler cycle (dyno restart) APScheduler misfire_grace_time=60 handles brief gaps. After restart, reconciler runs immediately and catches up. No manual intervention needed.
Untracked Heroku FLAG_* var (no DB row) drift_reason='untracked_heroku_flag' written to a synthetic row. Operator sees it in drift banner. Backfill script creates the row. Or operator runs mark-synced(winner=heroku) to adopt it.

12. Security Considerations


13. Open Questions

These require a decision from Kristerpher before sub-cards in Phase 2 and Phase 3 can be claimed:

  1. Reconciler token scope. The existing flag_drift.py reconciler uses a single HEROKU_API_TOKEN or falls back to per-app tokens. The new reconciler will also need write access for the POST /apps/{app}/config-vars call (mark-synced winner=db path). Should the write token be the same as the read token, or should a separate write-scoped Heroku token be provisioned for the reconciler? Separate tokens provide blast-radius isolation but require vault setup.

  2. Backfill timing. The backfill will create ~50+ new console_flag_promotions rows for prod (one per live FLAG_* var with no existing DB row). Should this run as a one-shot admin script, or as an Alembic data migration that runs automatically on deploy? The risk of automatic migration is that it hits the Heroku API during deploy, which could fail and block the migration chain.

  3. Drift banner placement pre-FLAG_CONSOLE_FLAG_PROMOTIONS-on. The drift banner (Phase 1/2) requires FLAG_CONSOLE_FLAG_PROMOTIONS=1 to access /console/flags. Should the drift endpoint (GET /console/flags/drift) be accessible without the main promotion flag, so a drift banner can appear on the existing dashboard before the full promotion UI ships? Or is it acceptable to gate all of this behind the single flag?

  4. Grace period before kill-switch fires. The moment a CLI operator sets a FLAG_* var directly on Heroku, the next 5-minute reconciler cycle will mark the row drifted and disable the UI promote button. Is 5 minutes the right grace period, or should the kill-switch only fire after N consecutive drifted cycles (e.g., 15 minutes = 3 cycles) to avoid transient failures from Heroku API flakiness?

  5. Untracked flags: backfill-only or persistent drift. Flags set directly on Heroku with no DB row are currently orphan_on_heroku warnings. After this design ships, should they create synthetic console_flag_promotions rows automatically on first reconciler detection (enabling the drift banner to surface them), or should they only be onboarded via the backfill script? Automatic creation is more visible but adds complexity to the reconciler.


14. Implementation Cards

PM can lift these cards directly. Each card is sized for a single PR by feature-developer. Cards are ordered by dependency chain.


Card 1 — Schema migration: add sync columns to console_flag_promotions

Title: feat(console): migration 0055 — add sync columns to console_flag_promotions

Body: Add six new columns to console_flag_promotions that support the bidirectional sync design. See design doc: docs/architecture/flag-reconciler-bidirectional-sync-2026-05-13.md section 4 and section 9.1.

Columns: synced (BOOLEAN, default false), last_heroku_value (TEXT nullable), last_synced_at (TIMESTAMPTZ nullable), drift_detected_at (TIMESTAMPTZ nullable), drift_reason (TEXT nullable), sync_source (TEXT nullable).

Partial index: idx_cfp_synced_false on (synced) WHERE synced = false.

Use batch_alter_table pattern (consistent with migration 0054) for dialect-neutral SQLite/Postgres compatibility.

Acceptance criteria: - [ ] Migration 0055 applies cleanly on top of 0054 with no data loss. - [ ] Down migration removes columns and index without error. - [ ] ConsoleFlagPromotion model in app/models/flag_promotion.py updated to include new fields. - [ ] Existing tests pass. - [ ] New test: test_0055_migration.py verifies columns exist post-up and do not exist post-down.

Size: S
Risk: Low — additive columns only.
Dependencies: None (depends on 0054 already applied, which it is).


Card 2 — Reconciler service + scheduler registration

Title: feat(console): flag_reconciler.py — 5-minute bidirectional sync job

Body: Implement console/app/services/flag_reconciler.py as specified in design doc section 8. The reconciler: 1. Iterates all 4 Heroku apps. 2. Fetches FLAG_* config vars via GET /apps/{app}/config-vars. 3. Reads console_flag_promotions rows for that target_app where state NOT IN ('deploying', 'rejected', 'expired', 'failed'). 4. Diffs each row's new_value against the Heroku boolean. 5. Writes synced, last_heroku_value, drift_reason, and timestamp columns. 6. Emits console_audit_log rows with action="console.flag.sync_updated" for each change to synced. 7. Never raises — errors captured per-app in ReconcileResult.

Register in APScheduler: interval=5m, misfire_grace_time=60s, id="flag_reconciler". Skip rows in state='deploying' to avoid race condition during in-flight promotes.

Reuse _fetch_heroku_flag_vars / HerokuConfigFetchError from drift_reconcilers/flag_drift.py or extract to a shared heroku_client.py.

Acceptance criteria: - [ ] reconcile_all_apps() returns one ReconcileResult per app. - [ ] synced=true set when DB new_value matches Heroku value. - [ ] synced=false, drift_reason, drift_detected_at set when mismatch. - [ ] state='deploying' rows are skipped. - [ ] Heroku API failure captured in ReconcileResult.error; no exception propagated. - [ ] Audit row emitted for each synced column change. - [ ] APScheduler job registered in app factory; visible in scheduler job list. - [ ] Unit tests with mocked Heroku API covering: clean run, mismatch, orphan-on-Heroku, Heroku failure. - [ ] No token values in any log output (hash only).

Size: M
Risk: Medium — introduces write behavior; must not corrupt existing promotion state.
Dependencies: Card 1.


Card 3 — Backfill data migration script

Title: feat(console): one-shot backfill — seed console_flag_promotions rows for existing Heroku FLAG_* vars

Body: Implement scripts/flag_reconciler_backfill.py (not an Alembic migration; invoked manually by operator). See design doc section 9.2.

Script logic: 1. For each of the 4 Heroku apps, fetch all FLAG_* config vars. 2. For each var with no matching (flag_key, target_app) row in console_flag_promotions, insert a new row: state='deployed', new_value=<heroku_boolean>, synced=true, sync_source='heroku_cli_backfill', created_by='system_backfill', last_heroku_value=<str>, last_synced_at=<utcnow>. 3. For existing rows in console_flag_promotions for this (flag_key, target_app) where synced=false (NULL or explicit false), diff against Heroku and update. 4. Print a summary: N rows inserted, M rows updated, K errors. 5. Idempotent: running twice produces the same result.

Include a --dry-run flag that prints the diff without writing.

Acceptance criteria: - [ ] --dry-run prints all planned inserts/updates without writing. - [ ] Script is idempotent (run twice = same result). - [ ] All 4 apps processed. - [ ] Rows inserted have sync_source='heroku_cli_backfill'. - [ ] Script exits non-zero on Heroku API failure after logging the error. - [ ] Unit test with mocked Heroku API verifying idempotency and insert/update logic. - [ ] Runbook entry in docs/runbooks/ describing when and how to invoke.

Size: S
Risk: Low — script is separate from main app; uses --dry-run for validation.
Dependencies: Card 1.


Card 4 — Mark-synced API endpoints

Title: feat(console): POST /flags/<key>/mark-synced + GET /flags/drift endpoints

Body: Add two new endpoints to console/app/blueprints/flags.py. See design doc section 5.2.

GET /console/flags/drift: Returns all rows where synced=false. Auth: @require_role("ops", "superadmin"). Response is a JSON list of drifted rows (flag_key, target_app, db_value, heroku_value, drift_reason, drift_detected_at).

POST /console/flags/<flag_key>/mark-synced: Resolves a drift conflict. Body: {"winner": "db"|"heroku", "totp_code": "..."}. Auth: @require_role("superadmin") + @require_totp_elevation. Winner=db: calls Heroku PATCH then sets synced=true. Winner=heroku: updates DB new_value, sets synced=true. Both paths emit audit row action="console.flag.mark_synced". Returns 409 if Heroku PATCH fails (winner=db path).

Also add the drift kill-switch to PromotionService.promote_async() and PromotionService.reject(): return 409 {"error": "flag_drifted"} if the target row has synced=false at the start of the operation.

Acceptance criteria: - [ ] GET /flags/drift returns empty list when no drift. - [ ] GET /flags/drift returns correct rows when drift exists. - [ ] POST /mark-synced winner=heroku updates DB value and sets synced=true. - [ ] POST /mark-synced winner=db calls Heroku PATCH and sets synced=true on success. - [ ] POST /mark-synced winner=db returns 409 heroku_push_failed on Heroku failure. - [ ] POST /mark-synced emits audit row for both winner paths. - [ ] Promote endpoint returns 409 flag_drifted when synced=false. - [ ] Reject endpoint returns 409 flag_drifted when synced=false. - [ ] TOTP elevation enforced on mark-synced. - [ ] Unit tests for each path; mock Heroku API.

Size: M
Risk: Medium — introduces drift kill-switch enforcement; must not break existing promote flow for non-drifted rows.
Dependencies: Cards 1 and 2.


Card 5 — Drift banner UI

Title: feat(console): /console/flags drift banner and per-row drift badge

Body: Update /console/flags page and /console/flags/promotions page. See design doc section on UI changes.

On both pages: add a banner above the table that queries GET /console/flags/drift. If response is non-empty, show a yellow/amber banner: "N flag(s) are drifted — DB and Heroku disagree. Promote and rollback actions are disabled for affected flags until drift is resolved." Each drifted flag in the banner links to its row.

Per-row in the flags table: add a drift badge (e.g., "DRIFTED") in the status column when the row has synced=false. The badge should show drift_reason on hover (tooltip).

Promote/rollback buttons for drifted rows: render as disabled (greyed out, no click handler). Show tooltip: "Flag is drifted — resolve drift first."

Banner auto-hides via HTMX polling (reuse existing 5-second poll pattern from promotions page). If GET /flags/drift returns [], banner removes itself.

Acceptance criteria: - [ ] Banner appears when at least one drifted row exists. - [ ] Banner disappears when no drifted rows remain. - [ ] Per-row drift badge visible for drifted flags. - [ ] Promote/rollback buttons disabled on drifted rows (not just hidden). - [ ] Banner links to correct rows. - [ ] HTMX poll refreshes banner without full page reload. - [ ] Works with existing FLAG_CONSOLE_FLAG_PROMOTIONS gate (404 when off). - [ ] Snapshot test or Playwright test confirms banner renders and hides.

Size: S
Risk: Low — UI-only; no new write paths.
Dependencies: Card 4.


Card 6 — "Mark as synced" modal with side-by-side diff

Title: feat(console): mark-as-synced modal — side-by-side diff + winner selection

Body: Add a modal to the /console/flags page triggered by a "Resolve drift" button on drifted rows. See design doc section on UI changes.

Modal content: - Header: "Resolve drift for <flag_key>" - Two-column diff table: - Left column: "DB value" — shows new_value from the promotion row (true/false) - Right column: "Heroku value" — shows last_heroku_value from the row - drift_reason displayed as a human-readable explanation (e.g., "Heroku value was changed directly via CLI after the last promotion") - drift_detected_at timestamp - Two action buttons: "DB value is correct (push to Heroku)" and "Heroku value is correct (update DB)" - TOTP field (required before either action fires) - Confirmation: high-risk flags (per YAML risk: high) require typing the flag key before confirming

On confirmation: calls POST /console/flags/<flag_key>/mark-synced with winner: "db"|"heroku". On success: modal closes, row refreshes via HTMX to show synced=true state, promote/rollback buttons re-enable. On 409 error: modal shows error message inline.

Acceptance criteria: - [ ] Modal opens on "Resolve drift" button click. - [ ] Side-by-side diff shows correct DB and Heroku values. - [ ] DB wins path: calls mark-synced, shows success, re-enables promote button. - [ ] Heroku wins path: calls mark-synced, shows updated DB value, re-enables promote button. - [ ] TOTP required before action fires. - [ ] High-risk flag confirmation phrase required. - [ ] 409 error displayed inline in modal (not full page redirect). - [ ] Modal is accessible (keyboard navigable, ARIA labels).

Size: M
Risk: Low-Medium — UI complexity; depends on Card 4 for API.
Dependencies: Card 4.


Card 7 — Promote-via-UI calls Heroku API directly

Title: feat(console): POST /flags/<key>/promote — UI-triggered Heroku config-var write

Body: Add the new POST /console/flags/<flag_key>/promote endpoint (distinct from the existing POST /flags/promotions/<id>/promote). See design doc section 5.2.

This endpoint is for direct UI-driven flag flips that bypass the soak queue. It: 1. Validates the flag is not drifted (synced=true or no existing row). 2. Creates a console_flag_promotions row with state='deploying', sync_source='console_ui_promote'. 3. Calls PATCH /apps/{target_app}/config-vars on Heroku (async pattern, same as existing promote flow). 4. On success: sets state='deployed', synced=true, last_synced_at=now. 5. On failure: sets state='failed', synced=false, drift_reason='heroku_unset'. 6. Emits audit row.

Return 202 with status_url for frontend polling (same pattern as existing promote_async).

This is the "Phase 3" endpoint that finally enables operators to flip flags from the UI without touching the Heroku CLI.

Acceptance criteria: - [ ] Returns 202 with status_url on valid request. - [ ] Returns 409 flag_drifted if existing row is drifted. - [ ] On Heroku success: state='deployed', synced=true. - [ ] On Heroku failure: state='failed', synced=false, drift_reason='heroku_unset'. - [ ] Audit row emitted. - [ ] TOTP elevation required. - [ ] End-to-end test with mocked Heroku API. - [ ] FLAG_CONSOLE_FLAG_PROMOTIONS gate enforced.

Size: M
Risk: Medium-High — this is the first endpoint that writes to Heroku from the UI. Requires thorough testing. Must land after Cards 1-6 are stable on staging.
Dependencies: Cards 1, 2, 4.


Card 8 — Audit log integration for all new actions

Title: feat(console): audit trail — emit trace_events for sync_updated and mark_synced

Body: Ensure all new state changes emit to both console_audit_log and trace_events (dual-write). See design doc section 12 (security considerations).

New audit actions to register: - console.flag.sync_updated — emitted by reconciler each time synced changes (true→false or false→true). Payload: {flag_key, target_app, previous_synced, new_synced, drift_reason, last_heroku_value}. - console.flag.mark_synced — emitted by mark-synced endpoint. Payload: {flag_key, target_app, winner, resolved_value, previous_db_value, previous_heroku_value, admin_id}.

Verify that existing audit actions from the promote flow (console.flag.promoted, console.flag.mark_promote) already dual-write to trace_events. If not, add dual-write in this card.

Verify retention policy: console_audit_log rows for these actions must be covered by the existing 2-year retention job.

Acceptance criteria: - [ ] console.flag.sync_updated written on every reconciler synced column change. - [ ] console.flag.mark_synced written on every mark-synced endpoint call. - [ ] Both actions dual-write to trace_events table. - [ ] Payload fields match specification above. - [ ] admin_id is 'system_reconciler' for automated reconciler writes. - [ ] Retention policy confirmed to cover new action types. - [ ] Unit test: reconciler run triggers correct audit row. - [ ] Unit test: mark-synced triggers correct audit row.

Size: S
Risk: Low — audit additions only; does not change existing behavior.
Dependencies: Cards 2 and 4.