Migration gate runbook

System: Console Alembic migration gate + Raptor Alembic migration gate (CI job migration-gate in .github/workflows/ci-pr.yml) Owner: sre-agent / operator Last incident: 2026-05-11 (gate broken from day one — missing kris@ seed row, missing raptor_app role; fixed in PR #1654 and PR #1656) Last reviewed: 2026-05-31

What the migration gate does

The migration-gate CI job (ci-pr.yml, added in PR #1623, closes issue #1552) runs automatically on every PR that touches:

console/migrations/** (Console Alembic tree)
backend_v2/alembic/versions/** (Raptor Alembic tree)

It spins up an ephemeral postgres:15 service container and runs a full round-trip on both migration trees:

upgrade head → assert single head → downgrade -1 → upgrade head

Each tree gets its own ephemeral database (console_ci, raptor_ci) so there is no schema bleed.

What it catches before merge:

Failure class	First observed	Example PR
Multiple heads (duplicate `down_revision`)	2026-05-10	#1548, #1550
Postgres-incompatible expression (STABLE vs IMMUTABLE)	2026-05-10	#1551
Migration that errors on `upgrade` or `downgrade`	2026-05-11	#1654, #1656
Missing seed data required by a migration	2026-05-11	#1654
Missing Postgres role required by a migration GRANT	2026-05-11	#1656

The gate runs against an ephemeral database with no vault credentials. This bypasses the RDS IAM restriction that prevents CREATE ROLE WITH PASSWORD on Heroku Standard-0 (feedback: heroku_pg_rds_password_gotcha.md).

Seed requirements

The ephemeral CI database starts empty. Two categories of "environment prerequisites" that exist in prod but not in a fresh Postgres must be supplied by the gate itself before running alembic upgrade head.

1. Console: `kris@moosequest.net` admin row (required by migration 0016)

Migration: console/migrations/versions/0016_raxx_platform_owners_group.py

Why it is required: Migration 0016 seeds the raxx-platform-owners RBAC group and enrolls the operator (kris@moosequest.net) as its first member. Step 3 of upgrade() executes SELECT id FROM admins WHERE email = :email. If the row is absent it raises:

RuntimeError: 0016 migration: operator 'kris@moosequest.net' not found in the admins table.

This is intentional on prod — the operator account must exist before the platform-owners group is seeded. In CI there is no prior data, so the gate supplies a stub row.

How the gate supplies it (two-phase upgrade, PR #1654):

Console — alembic upgrade to 0015 (pre-seed) — brings the schema up to the migration that creates the admins table.
Console — seed kris@ admin row (required by 0016) — inserts the stub row:

INSERT INTO admins (id, email, display_name, created_at, updated_at, disabled_at)
VALUES (
  '00000000-0000-0000-0000-000000000001',
  'kris@moosequest.net',
  'CI Seed — do not use in prod',
  '2026-01-01T00:00:00',
  '2026-01-01T00:00:00',
  NULL
)
ON CONFLICT (email) DO NOTHING;

Console — alembic upgrade head — completes from 0016 onward.

Why ON CONFLICT DO NOTHING: The round-trip downgrade -1 / upgrade head must not fail on the second upgrade head. Migration 0016's downgrade() only removes rbac_user_groups membership — the admins row is untouched — so the seed is still present for the second upgrade.

Columns seeded: id, email, display_name, created_at, updated_at, disabled_at. The admins table has no password columns (schema from migration 0001). The stub UUID and timestamps are CI-only and are never used for authentication.

2. Raptor: `raptor_app` Postgres role (required by migration 0002+)

Migration: backend_v2/alembic/versions/0002_paper_orders.py (and all subsequent Raptor migrations that add tables)

Why it is required: Migration 0002 executes:

GRANT SELECT, INSERT, UPDATE, DELETE ON paper_orders TO raptor_app;
GRANT USAGE ON SEQUENCE paper_orders_id_seq TO raptor_app;

On prod, the raptor_app role is provisioned during the Postgres Standard-0 setup (RM-9), outside of Alembic. The CI ephemeral database has no such provisioning. Without the role, upgrade head fails with:

psycopg2.errors.UndefinedObject: role "raptor_app" does not exist

How the gate supplies it (PR #1656):

DO $$ BEGIN
  IF NOT EXISTS (SELECT FROM pg_roles WHERE rolname = 'raptor_app') THEN
    CREATE ROLE raptor_app NOLOGIN;
  END IF;
END $$;

This runs before alembic upgrade head on the Raptor tree. The role has no login privilege and no initial permissions — the GRANT statements inside each migration attach the needed permissions.

Rule for future data-dependent migrations

Any migration that queries an existing table for lookup data OR executes a GRANT to a named role is a seed-dependent migration. The gate must be updated in the same PR as the migration. The pattern:

Add a new CI step that runs alembic upgrade <N-1> (up to the migration before the one with the dependency).
Insert or create the required data/role idempotently (ON CONFLICT DO NOTHING / IF NOT EXISTS).
Continue with alembic upgrade head.

Every seed INSERT must use ON CONFLICT DO NOTHING. Every role creation must use the IF NOT EXISTS guard (or the DO $$ BEGIN IF NOT EXISTS ... END $$ block for Postgres < 9.6 syntax, though CI uses Postgres 15). This ensures the gate remains idempotent across re-runs and round-trips.

How to tell it is broken

migration-gate job in GitHub Actions shows FAILED for a PR touching console/migrations/** or backend_v2/alembic/versions/**.
The failing step name is one of:
Console — alembic upgrade to 0015 (pre-seed) — pre-seed step failed
Console — seed kris@ admin row (required by 0016) — psql seed failed
Console — alembic upgrade head — a migration errored
Console — assert single head — multiple heads detected
Console — downgrade -1 (reversibility check) — downgrade errored
Raptor — create raptor_app role (required by 0002+) — role creation failed
Raptor — alembic upgrade head — a Raptor migration errored
Raptor — assert single head — multiple Raptor heads detected

How to diagnose (in order)

Open the failing CI run. Navigate to the migration-gate job and expand the failing step. The Alembic stack trace or psql error is the primary signal.
Classify the error.

Error message	Failure class
`RuntimeError: ... not found in the admins table`	Missing seed row (new migration requires a row that the gate does not supply yet)
`psycopg2.errors.UndefinedObject: role "..." does not exist`	Missing Postgres role (new migration GRANTs to a role the gate does not create yet)
`DETAIL: Multiple heads are present`	Duplicate `down_revision` — two migrations claim the same parent
`psycopg2.errors.DuplicateTable: relation '...' already exists`	Two migrations both call `op.create_table` for the same table — see Failure mode F
`ERROR: cannot use a non-immutable function in index expression`	Postgres-incompatible expression in `CREATE INDEX ... WHERE`
`alembic.util.exc.CommandError: Can't locate revision ...`	Broken `down_revision` pointer — a migration references a revision ID that does not exist
Any other `psycopg2` error	Migration-specific Postgres incompatibility

For "Multiple heads" errors: Run locally to identify the conflicting revisions:

```bash # Console cd console && alembic -c migrations/alembic.ini heads

# Raptor cd backend_v2 && alembic heads ```

Each output line marked (head) is a branch tip. Find the two migrations that share the same down_revision value.

For seed row / role errors: Confirm the CI workflow has the corresponding seed step before alembic upgrade head. If it is missing, the fix is a new seed step in ci-pr.yml (see "Recovery steps" below).

Known failure modes

Failure mode A: Missing admin seed row

Symptom: Console — alembic upgrade head fails with:

RuntimeError: 0016 migration: operator 'kris@moosequest.net' not found in the admins table.

Cause: A new migration was added that runs before or alongside 0016 and depends on the same seed, OR the seed step was removed or reordered in a ci-pr.yml change.

Fix:

Verify the seed step is present and runs after alembic upgrade 0015 but before alembic upgrade head. If the seed step exists but the error still fires, the cause is likely a new migration that executes before 0016 and operates on data that requires a different seed row — diagnose by reading the new migration's upgrade() body.

If a completely new migration introduces its own seed requirement, add the corresponding seed step following the pattern established by PR #1654.

Verification: Re-run the migration-gate job. Console — seed kris@ admin row (required by 0016) must exit 0, followed by Console — alembic upgrade head exiting 0.

Failure mode B: Missing Postgres role

Symptom: Raptor — alembic upgrade head fails with:

psycopg2.errors.UndefinedObject: role "raptor_app" does not exist
LINE 1: GRANT SELECT, INSERT, UPDATE, DELETE ON <table> TO raptor_app

Cause: A new Raptor migration GRANTs to raptor_app (or another named role) that the gate's role-creation step does not cover. Most commonly this happens if a new Raptor migration GRANTs to a role other than raptor_app.

Fix:

Add a role-creation step to the Raptor section of the migration-gate job in ci-pr.yml, modeled on the existing raptor_app step:

DO $$ BEGIN
  IF NOT EXISTS (SELECT FROM pg_roles WHERE rolname = '<new_role_name>') THEN
    CREATE ROLE <new_role_name> NOLOGIN;
  END IF;
END $$;

Place this step before Raptor — alembic upgrade head.

Verification: Re-run the migration-gate job. The new role-creation step must exit 0, and Raptor — alembic upgrade head must exit 0 after it.

Failure mode C: Duplicate revision (multiple heads)

Symptom: Console — assert single head or Raptor — assert single head fails with output like:

Console head count: 2
ERROR: Console alembic has 2 heads — expected exactly 1.

Or alembic upgrade head itself may error:

DETAIL: Multiple heads are present; please specify the target revision.

Cause: Two migrations were merged into main that both set down_revision to the same parent revision. This is the duplicate-revision collision pattern. Historical examples:

PR #1548, #1550: 0022, 0024 both set down_revision = "0021" / "0022" respectively — caused by concurrent feature branches merging without rebasing.
PR #1778, #1781 (0038 chain): 0037_promote_trace_mode_transitions.py and 0038_promote_demo_paper_fill.py both claimed revision "0038". Fixed by renumbering:
PR #1781 (2225e2f2): renamed 0037_promote_trace_mode_transitions.py → 0038_promote_trace_mode_transitions.py, updated revision = "0038".
PR #1778 (fd175f8e): renamed 0038_promote_demo_paper_fill.py → 0040_promote_demo_paper_fill.py, updated revision = "0040" and down_revision = "0039", producing the correct linear chain: ...0037 → 0038 (trace_mode) → 0039 (auth_backup_codes) → 0040 (demo_paper_fill).

Fix:

Identify the two colliding migrations. Choose the one that was merged later (higher PR number) and renumber it to the next available revision ID:

Rename the file: XXXX_old_name.py → YYYY_old_name.py where YYYY is the next unused number.
Update revision = "YYYY" inside the file.
Update down_revision = "ZZZZ" to point to the correct predecessor.
Verify the chain is linear: alembic -c console/migrations/alembic.ini heads should return exactly one (head).

The fix can be committed directly to a PR branch. The migration-gate job on that PR will validate the chain.

Verification: Console — assert single head or Raptor — assert single head must report head count: 1.

Failure mode D: Postgres-incompatible index predicate

Symptom: alembic upgrade head fails with:

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.InvalidColumnDefinition)
ERROR:  cannot use a non-immutable function in index expression

Cause: A migration creates a partial index with a WHERE clause that uses a non-IMMUTABLE function (e.g., CURRENT_TIMESTAMP, now()). Postgres requires index predicates to be IMMUTABLE. This was the failure mode in PR #1551.

Fix: Change the index predicate to use only IMMUTABLE expressions (literal values, IMMUTABLE-marked functions). If a time-bounded partial index is genuinely needed, use a fixed literal timestamp or restructure the index.

Verification: alembic upgrade head must exit 0. Confirm the index exists with the correct predicate:

SELECT indexname, indexdef FROM pg_indexes WHERE tablename = '<table>';

Failure mode F: DuplicateTable — two migrations both create the same table

Symptom: Raptor — alembic upgrade head fails with:

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.DuplicateTable)
ERROR:  relation '<table_name>' already exists

Cause: Two migrations in the linear chain both call op.create_table('<table_name>', ...). Because both run in sequence on a fresh Postgres database (as in CI), the second create fails. This typically happens when a later migration is written with the incorrect assumption that the earlier migration only exists in a different schema chain (e.g., a SQLite-only legacy chain) and does not apply to the Postgres Alembic chain.

First observed: paper_orders duplicate between 0002_paper_orders.py and 0022_mbt_paper_tables.py (fixed in PR that resolves the Settings 6-tab rebuild gate failure, 2026-05-31).

Fix:

The later migration must explicitly drop the earlier table before re-creating it. Determine which schema is authoritative post-migration:

If the later migration replaces the earlier table with a different schema (different columns/purpose), add op.drop_index(...) + op.drop_table(...) calls at the top of the later migration's upgrade(), before the op.create_table(...) call. Add the inverse (recreate the earlier schema) to the later migration's downgrade() so the chain remains consistent if the operator rolls back.
If the later migration is redundant (exactly the same schema), delete the duplicate op.create_table block and replace it with a comment explaining the earlier migration already created the table.

Pattern for case 1 (schema replacement):

def upgrade() -> None:
    # Drop earlier table first — <earlier_migration> created this with a
    # different schema. This migration replaces it.
    op.drop_index("<earlier_index>", table_name="<table_name>")
    op.drop_table("<table_name>")
    # ... now create the new version
    op.create_table("<table_name>", ...)

def downgrade() -> None:
    # Drop new version
    op.drop_table("<table_name>")
    # Restore earlier schema so the chain is consistent
    op.create_table("<table_name>", ...)  # mirror of earlier migration's schema

Verification: alembic upgrade head must exit 0 on a fresh Postgres. Confirm the final table schema matches the later migration's intent:

SELECT column_name, data_type FROM information_schema.columns
WHERE table_name = '<table_name>' ORDER BY ordinal_position;

Failure mode E: Broken `down_revision` pointer

Symptom: alembic upgrade head fails with:

alembic.util.exc.CommandError: Can't locate revision identified by '<hash>'

Cause: A migration's down_revision references a revision ID that does not exist in the chain. This happens when a migration file is deleted, renamed without updating the successor's down_revision, or when a PR adds a migration that references a revision that has not merged yet.

Fix: Update the down_revision in the orphaned migration to point to the correct predecessor revision ID. Run alembic history to see the full chain and identify the gap.

Verification: alembic history --verbose must show a contiguous chain with no gaps.

First failure / next step decision tree

migration-gate fails
│
├── Step: "Console — alembic upgrade to 0015 (pre-seed)"
│   └── Cause: schema error before 0015 — look at migrations 0001–0015
│       → Feature-dev owner of the failing migration must fix it.
│
├── Step: "Console — seed kris@ admin row (required by 0016)"
│   └── Cause: psql syntax error or admins table missing
│       → Check that 0015 actually created the admins table (gate step ordering)
│       → If admins table is missing: upgrade-to-0015 step may have silently exited 0 on an error
│       → Escalate to operator if the admins schema has been modified in a recent migration
│
├── Step: "Console — alembic upgrade head"
│   ├── RuntimeError: operator not found in admins table
│   │   → A new migration requires a seed row that the gate does not supply.
│   │   → SRE adds seed step to ci-pr.yml. Must ship in the same PR as the migration.
│   ├── psycopg2.errors.UndefinedObject (role)
│   │   → A new migration GRANTs to a role the gate does not create.
│   │   → SRE adds role-creation step to ci-pr.yml. Must ship in the same PR as the migration.
│   ├── Other psycopg2 error
│   │   → Migration-specific Postgres incompatibility — feature-dev fixes the migration.
│   └── alembic.util.exc.CommandError: Can't locate revision
│       → Broken down_revision pointer — feature-dev fixes the chain.
│
├── Step: "Console — assert single head" / "Raptor — assert single head"
│   └── head count > 1
│       → Duplicate revision collision. Identify and renumber the later migration.
│       → Feature-dev (whichever PR introduced the collision) fixes it.
│       → SRE verifies the fix resolves the gate before approving.
│
├── Step: "Console — downgrade -1" / "Raptor — downgrade -1"
│   └── Downgrade errored
│       → The migration's downgrade() function has a bug.
│       → Feature-dev fixes the downgrade() path. All migrations must be reversible.
│
├── Step: "Raptor — create raptor_app role (required by 0002+)"
│   └── DO block errored — likely psql connectivity issue or syntax
│       → Check the service container is healthy (rare; usually a transient CI issue)
│       → Re-run the job once. If it fails a second time, file a type:reliability issue.
│
└── Step: "Raptor — alembic upgrade head"
    ├── psycopg2.errors.UndefinedObject (role)
    │   → New migration GRANTs to a role the gate does not create → SRE adds role step
    ├── psycopg2.errors.DuplicateTable
    │   → Two migrations both create the same table (see Failure mode F)
    │   → grep for op.create_table('<table>') across all versions files
    │   → Later migration must drop the earlier table before re-creating (or skip if redundant)
    └── Other error
        → Feature-dev fixes the migration.

Escalation

When SRE handles it directly

Missing seed row: SRE adds the seed step to ci-pr.yml in the same PR as the migration.
Missing role: SRE adds the role-creation step to ci-pr.yml in the same PR as the migration.
Duplicate revision (multiple heads): SRE coordinates with feature-dev to renumber; SRE validates the fix.
Transient CI infrastructure failure (Postgres service container fails to start): re-run the job; file a reliability issue if it recurs.

When to escalate to feature-dev

The migration's upgrade() or downgrade() body contains a bug (Postgres-incompatible expression, wrong SQL, broken FK ordering).
A broken down_revision pointer — requires the migration author to confirm the intended predecessor.
A new seed requirement that was not communicated before merge.

When to escalate to operator

The gate itself is being bypassed or disabled in ci-pr.yml without a documented reason.
A migration failure in prod (not CI) — the prod migration was never gated, implying the migration touches schema that the gate does not cover.
Any migration that touches payment, audit, or RBAC schema and fails — treat as SEV-2 until confirmed non-data-loss.
The gate is failing consistently across all PRs despite no migration changes (CI infrastructure issue).

Contact: operator (kris@moosequest.net) via Slack #ops or direct message (D0AJ7K184TV).

Reference	Description
PR #1623	Gate introduced (`ci-pr.yml`, `migration-gate` job)
PR #1654	`kris@` admin row seed step added (two-phase Console upgrade)
PR #1656	`raptor_app` role creation step added
PR #1781	Emergency recovery — `0037_promote_trace_mode_transitions` renamed to `0038` (duplicate 0037 chain repair)
PR #1778	`0038_promote_demo_paper_fill` renamed to `0040` (duplicate 0038 collision resolved)
Issue #1655	Tracking issue — documents the seed-step pattern
Issue #1552	Original gate request
Issue #1548	Duplicate `down_revision` hotfix #1
Issue #1550	Duplicate `down_revision` hotfix #2
Issue #1551	Postgres-incompatible index predicate hotfix
`feedback_heroku_pg_rds_password_gotcha.md`	Why the gate uses ephemeral Postgres instead of the RDS-backed Heroku Postgres
`feedback_gh_actions_transitive_skip.md`	Why every job downstream of `detect-changes` needs an explicit `if:` clause

Migration gate runbook

What the migration gate does

Seed requirements

1. Console: kris@moosequest.net admin row (required by migration 0016)

2. Raptor: raptor_app Postgres role (required by migration 0002+)

Rule for future data-dependent migrations

How to tell it is broken

How to diagnose (in order)

Known failure modes

Failure mode A: Missing admin seed row

Failure mode B: Missing Postgres role

Failure mode C: Duplicate revision (multiple heads)

Failure mode D: Postgres-incompatible index predicate

Failure mode F: DuplicateTable — two migrations both create the same table

Failure mode E: Broken down_revision pointer

First failure / next step decision tree

Escalation

When SRE handles it directly

When to escalate to feature-dev

When to escalate to operator

Related references

1. Console: `kris@moosequest.net` admin row (required by migration 0016)

2. Raptor: `raptor_app` Postgres role (required by migration 0002+)

Failure mode E: Broken `down_revision` pointer