Migration gate runbook
System: Console Alembic migration gate + Raptor Alembic migration gate (CI job migration-gate in .github/workflows/ci-pr.yml)
Owner: sre-agent / operator
Last incident: 2026-05-11 (gate broken from day one — missing kris@ seed row, missing raptor_app role; fixed in PR #1654 and PR #1656)
Last reviewed: 2026-05-31
What the migration gate does
The migration-gate CI job (ci-pr.yml, added in PR #1623, closes issue #1552) runs automatically on every PR that touches:
console/migrations/**(Console Alembic tree)backend_v2/alembic/versions/**(Raptor Alembic tree)
It spins up an ephemeral postgres:15 service container and runs a full round-trip on both migration trees:
upgrade head → assert single head → downgrade -1 → upgrade head
Each tree gets its own ephemeral database (console_ci, raptor_ci) so there is no schema bleed.
What it catches before merge:
| Failure class | First observed | Example PR |
|---|---|---|
Multiple heads (duplicate down_revision) |
2026-05-10 | #1548, #1550 |
| Postgres-incompatible expression (STABLE vs IMMUTABLE) | 2026-05-10 | #1551 |
Migration that errors on upgrade or downgrade |
2026-05-11 | #1654, #1656 |
| Missing seed data required by a migration | 2026-05-11 | #1654 |
| Missing Postgres role required by a migration GRANT | 2026-05-11 | #1656 |
The gate runs against an ephemeral database with no vault credentials. This bypasses the RDS IAM restriction that prevents CREATE ROLE WITH PASSWORD on Heroku Standard-0 (feedback: heroku_pg_rds_password_gotcha.md).
Seed requirements
The ephemeral CI database starts empty. Two categories of "environment prerequisites" that exist in prod but not in a fresh Postgres must be supplied by the gate itself before running alembic upgrade head.
1. Console: kris@moosequest.net admin row (required by migration 0016)
Migration: console/migrations/versions/0016_raxx_platform_owners_group.py
Why it is required: Migration 0016 seeds the raxx-platform-owners RBAC group and enrolls the operator (kris@moosequest.net) as its first member. Step 3 of upgrade() executes SELECT id FROM admins WHERE email = :email. If the row is absent it raises:
RuntimeError: 0016 migration: operator 'kris@moosequest.net' not found in the admins table.
This is intentional on prod — the operator account must exist before the platform-owners group is seeded. In CI there is no prior data, so the gate supplies a stub row.
How the gate supplies it (two-phase upgrade, PR #1654):
Console — alembic upgrade to 0015 (pre-seed)— brings the schema up to the migration that creates theadminstable.Console — seed kris@ admin row (required by 0016)— inserts the stub row:
INSERT INTO admins (id, email, display_name, created_at, updated_at, disabled_at)
VALUES (
'00000000-0000-0000-0000-000000000001',
'kris@moosequest.net',
'CI Seed — do not use in prod',
'2026-01-01T00:00:00',
'2026-01-01T00:00:00',
NULL
)
ON CONFLICT (email) DO NOTHING;
Console — alembic upgrade head— completes from 0016 onward.
Why ON CONFLICT DO NOTHING: The round-trip downgrade -1 / upgrade head must not fail on the second upgrade head. Migration 0016's downgrade() only removes rbac_user_groups membership — the admins row is untouched — so the seed is still present for the second upgrade.
Columns seeded: id, email, display_name, created_at, updated_at, disabled_at. The admins table has no password columns (schema from migration 0001). The stub UUID and timestamps are CI-only and are never used for authentication.
2. Raptor: raptor_app Postgres role (required by migration 0002+)
Migration: backend_v2/alembic/versions/0002_paper_orders.py (and all subsequent Raptor migrations that add tables)
Why it is required: Migration 0002 executes:
GRANT SELECT, INSERT, UPDATE, DELETE ON paper_orders TO raptor_app;
GRANT USAGE ON SEQUENCE paper_orders_id_seq TO raptor_app;
On prod, the raptor_app role is provisioned during the Postgres Standard-0 setup (RM-9), outside of Alembic. The CI ephemeral database has no such provisioning. Without the role, upgrade head fails with:
psycopg2.errors.UndefinedObject: role "raptor_app" does not exist
How the gate supplies it (PR #1656):
DO $$ BEGIN
IF NOT EXISTS (SELECT FROM pg_roles WHERE rolname = 'raptor_app') THEN
CREATE ROLE raptor_app NOLOGIN;
END IF;
END $$;
This runs before alembic upgrade head on the Raptor tree. The role has no login privilege and no initial permissions — the GRANT statements inside each migration attach the needed permissions.
Rule for future data-dependent migrations
Any migration that queries an existing table for lookup data OR executes a GRANT to a named role is a seed-dependent migration. The gate must be updated in the same PR as the migration. The pattern:
- Add a new CI step that runs
alembic upgrade <N-1>(up to the migration before the one with the dependency). - Insert or create the required data/role idempotently (
ON CONFLICT DO NOTHING/IF NOT EXISTS). - Continue with
alembic upgrade head.
Every seed INSERT must use ON CONFLICT DO NOTHING. Every role creation must use the IF NOT EXISTS guard (or the DO $$ BEGIN IF NOT EXISTS ... END $$ block for Postgres < 9.6 syntax, though CI uses Postgres 15). This ensures the gate remains idempotent across re-runs and round-trips.
How to tell it is broken
migration-gatejob in GitHub Actions showsFAILEDfor a PR touchingconsole/migrations/**orbackend_v2/alembic/versions/**.- The failing step name is one of:
Console — alembic upgrade to 0015 (pre-seed)— pre-seed step failedConsole — seed kris@ admin row (required by 0016)— psql seed failedConsole — alembic upgrade head— a migration erroredConsole — assert single head— multiple heads detectedConsole — downgrade -1 (reversibility check)— downgrade erroredRaptor — create raptor_app role (required by 0002+)— role creation failedRaptor — alembic upgrade head— a Raptor migration erroredRaptor — assert single head— multiple Raptor heads detected
How to diagnose (in order)
-
Open the failing CI run. Navigate to the
migration-gatejob and expand the failing step. The Alembic stack trace or psql error is the primary signal. -
Classify the error.
| Error message | Failure class |
|---|---|
RuntimeError: ... not found in the admins table |
Missing seed row (new migration requires a row that the gate does not supply yet) |
psycopg2.errors.UndefinedObject: role "..." does not exist |
Missing Postgres role (new migration GRANTs to a role the gate does not create yet) |
DETAIL: Multiple heads are present |
Duplicate down_revision — two migrations claim the same parent |
psycopg2.errors.DuplicateTable: relation '...' already exists |
Two migrations both call op.create_table for the same table — see Failure mode F |
ERROR: cannot use a non-immutable function in index expression |
Postgres-incompatible expression in CREATE INDEX ... WHERE |
alembic.util.exc.CommandError: Can't locate revision ... |
Broken down_revision pointer — a migration references a revision ID that does not exist |
Any other psycopg2 error |
Migration-specific Postgres incompatibility |
- For "Multiple heads" errors: Run locally to identify the conflicting revisions:
```bash # Console cd console && alembic -c migrations/alembic.ini heads
# Raptor cd backend_v2 && alembic heads ```
Each output line marked (head) is a branch tip. Find the two migrations that share the same down_revision value.
- For seed row / role errors: Confirm the CI workflow has the corresponding seed step before
alembic upgrade head. If it is missing, the fix is a new seed step inci-pr.yml(see "Recovery steps" below).
Known failure modes
Failure mode A: Missing admin seed row
Symptom: Console — alembic upgrade head fails with:
RuntimeError: 0016 migration: operator 'kris@moosequest.net' not found in the admins table.
Cause: A new migration was added that runs before or alongside 0016 and depends on the same seed, OR the seed step was removed or reordered in a ci-pr.yml change.
Fix:
Verify the seed step is present and runs after alembic upgrade 0015 but before alembic upgrade head. If the seed step exists but the error still fires, the cause is likely a new migration that executes before 0016 and operates on data that requires a different seed row — diagnose by reading the new migration's upgrade() body.
If a completely new migration introduces its own seed requirement, add the corresponding seed step following the pattern established by PR #1654.
Verification: Re-run the migration-gate job. Console — seed kris@ admin row (required by 0016) must exit 0, followed by Console — alembic upgrade head exiting 0.
Failure mode B: Missing Postgres role
Symptom: Raptor — alembic upgrade head fails with:
psycopg2.errors.UndefinedObject: role "raptor_app" does not exist
LINE 1: GRANT SELECT, INSERT, UPDATE, DELETE ON <table> TO raptor_app
Cause: A new Raptor migration GRANTs to raptor_app (or another named role) that the gate's role-creation step does not cover. Most commonly this happens if a new Raptor migration GRANTs to a role other than raptor_app.
Fix:
Add a role-creation step to the Raptor section of the migration-gate job in ci-pr.yml, modeled on the existing raptor_app step:
DO $$ BEGIN
IF NOT EXISTS (SELECT FROM pg_roles WHERE rolname = '<new_role_name>') THEN
CREATE ROLE <new_role_name> NOLOGIN;
END IF;
END $$;
Place this step before Raptor — alembic upgrade head.
Verification: Re-run the migration-gate job. The new role-creation step must exit 0, and Raptor — alembic upgrade head must exit 0 after it.
Failure mode C: Duplicate revision (multiple heads)
Symptom: Console — assert single head or Raptor — assert single head fails with output like:
Console head count: 2
ERROR: Console alembic has 2 heads — expected exactly 1.
Or alembic upgrade head itself may error:
DETAIL: Multiple heads are present; please specify the target revision.
Cause: Two migrations were merged into main that both set down_revision to the same parent revision. This is the duplicate-revision collision pattern. Historical examples:
- PR #1548, #1550:
0022,0024both setdown_revision = "0021"/"0022"respectively — caused by concurrent feature branches merging without rebasing. - PR #1778, #1781 (0038 chain):
0037_promote_trace_mode_transitions.pyand0038_promote_demo_paper_fill.pyboth claimed revision"0038". Fixed by renumbering: - PR #1781 (
2225e2f2): renamed0037_promote_trace_mode_transitions.py→0038_promote_trace_mode_transitions.py, updatedrevision = "0038". - PR #1778 (
fd175f8e): renamed0038_promote_demo_paper_fill.py→0040_promote_demo_paper_fill.py, updatedrevision = "0040"anddown_revision = "0039", producing the correct linear chain:...0037 → 0038 (trace_mode) → 0039 (auth_backup_codes) → 0040 (demo_paper_fill).
Fix:
Identify the two colliding migrations. Choose the one that was merged later (higher PR number) and renumber it to the next available revision ID:
- Rename the file:
XXXX_old_name.py→YYYY_old_name.pywhereYYYYis the next unused number. - Update
revision = "YYYY"inside the file. - Update
down_revision = "ZZZZ"to point to the correct predecessor. - Verify the chain is linear:
alembic -c console/migrations/alembic.ini headsshould return exactly one(head).
The fix can be committed directly to a PR branch. The migration-gate job on that PR will validate the chain.
Verification: Console — assert single head or Raptor — assert single head must report head count: 1.
Failure mode D: Postgres-incompatible index predicate
Symptom: alembic upgrade head fails with:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.InvalidColumnDefinition)
ERROR: cannot use a non-immutable function in index expression
Cause: A migration creates a partial index with a WHERE clause that uses a non-IMMUTABLE function (e.g., CURRENT_TIMESTAMP, now()). Postgres requires index predicates to be IMMUTABLE. This was the failure mode in PR #1551.
Fix: Change the index predicate to use only IMMUTABLE expressions (literal values, IMMUTABLE-marked functions). If a time-bounded partial index is genuinely needed, use a fixed literal timestamp or restructure the index.
Verification: alembic upgrade head must exit 0. Confirm the index exists with the correct predicate:
SELECT indexname, indexdef FROM pg_indexes WHERE tablename = '<table>';
Failure mode F: DuplicateTable — two migrations both create the same table
Symptom: Raptor — alembic upgrade head fails with:
sqlalchemy.exc.ProgrammingError: (psycopg2.errors.DuplicateTable)
ERROR: relation '<table_name>' already exists
Cause: Two migrations in the linear chain both call op.create_table('<table_name>', ...). Because both run in sequence on a fresh Postgres database (as in CI), the second create fails. This typically happens when a later migration is written with the incorrect assumption that the earlier migration only exists in a different schema chain (e.g., a SQLite-only legacy chain) and does not apply to the Postgres Alembic chain.
First observed: paper_orders duplicate between 0002_paper_orders.py and 0022_mbt_paper_tables.py (fixed in PR that resolves the Settings 6-tab rebuild gate failure, 2026-05-31).
Fix:
The later migration must explicitly drop the earlier table before re-creating it. Determine which schema is authoritative post-migration:
-
If the later migration replaces the earlier table with a different schema (different columns/purpose), add
op.drop_index(...)+op.drop_table(...)calls at the top of the later migration'supgrade(), before theop.create_table(...)call. Add the inverse (recreate the earlier schema) to the later migration'sdowngrade()so the chain remains consistent if the operator rolls back. -
If the later migration is redundant (exactly the same schema), delete the duplicate
op.create_tableblock and replace it with a comment explaining the earlier migration already created the table.
Pattern for case 1 (schema replacement):
def upgrade() -> None:
# Drop earlier table first — <earlier_migration> created this with a
# different schema. This migration replaces it.
op.drop_index("<earlier_index>", table_name="<table_name>")
op.drop_table("<table_name>")
# ... now create the new version
op.create_table("<table_name>", ...)
def downgrade() -> None:
# Drop new version
op.drop_table("<table_name>")
# Restore earlier schema so the chain is consistent
op.create_table("<table_name>", ...) # mirror of earlier migration's schema
Verification: alembic upgrade head must exit 0 on a fresh Postgres. Confirm the final table schema matches the later migration's intent:
SELECT column_name, data_type FROM information_schema.columns
WHERE table_name = '<table_name>' ORDER BY ordinal_position;
Failure mode E: Broken down_revision pointer
Symptom: alembic upgrade head fails with:
alembic.util.exc.CommandError: Can't locate revision identified by '<hash>'
Cause: A migration's down_revision references a revision ID that does not exist in the chain. This happens when a migration file is deleted, renamed without updating the successor's down_revision, or when a PR adds a migration that references a revision that has not merged yet.
Fix: Update the down_revision in the orphaned migration to point to the correct predecessor revision ID. Run alembic history to see the full chain and identify the gap.
Verification: alembic history --verbose must show a contiguous chain with no gaps.
First failure / next step decision tree
migration-gate fails
│
├── Step: "Console — alembic upgrade to 0015 (pre-seed)"
│ └── Cause: schema error before 0015 — look at migrations 0001–0015
│ → Feature-dev owner of the failing migration must fix it.
│
├── Step: "Console — seed kris@ admin row (required by 0016)"
│ └── Cause: psql syntax error or admins table missing
│ → Check that 0015 actually created the admins table (gate step ordering)
│ → If admins table is missing: upgrade-to-0015 step may have silently exited 0 on an error
│ → Escalate to operator if the admins schema has been modified in a recent migration
│
├── Step: "Console — alembic upgrade head"
│ ├── RuntimeError: operator not found in admins table
│ │ → A new migration requires a seed row that the gate does not supply.
│ │ → SRE adds seed step to ci-pr.yml. Must ship in the same PR as the migration.
│ ├── psycopg2.errors.UndefinedObject (role)
│ │ → A new migration GRANTs to a role the gate does not create.
│ │ → SRE adds role-creation step to ci-pr.yml. Must ship in the same PR as the migration.
│ ├── Other psycopg2 error
│ │ → Migration-specific Postgres incompatibility — feature-dev fixes the migration.
│ └── alembic.util.exc.CommandError: Can't locate revision
│ → Broken down_revision pointer — feature-dev fixes the chain.
│
├── Step: "Console — assert single head" / "Raptor — assert single head"
│ └── head count > 1
│ → Duplicate revision collision. Identify and renumber the later migration.
│ → Feature-dev (whichever PR introduced the collision) fixes it.
│ → SRE verifies the fix resolves the gate before approving.
│
├── Step: "Console — downgrade -1" / "Raptor — downgrade -1"
│ └── Downgrade errored
│ → The migration's downgrade() function has a bug.
│ → Feature-dev fixes the downgrade() path. All migrations must be reversible.
│
├── Step: "Raptor — create raptor_app role (required by 0002+)"
│ └── DO block errored — likely psql connectivity issue or syntax
│ → Check the service container is healthy (rare; usually a transient CI issue)
│ → Re-run the job once. If it fails a second time, file a type:reliability issue.
│
└── Step: "Raptor — alembic upgrade head"
├── psycopg2.errors.UndefinedObject (role)
│ → New migration GRANTs to a role the gate does not create → SRE adds role step
├── psycopg2.errors.DuplicateTable
│ → Two migrations both create the same table (see Failure mode F)
│ → grep for op.create_table('<table>') across all versions files
│ → Later migration must drop the earlier table before re-creating (or skip if redundant)
└── Other error
→ Feature-dev fixes the migration.
Escalation
When SRE handles it directly
- Missing seed row: SRE adds the seed step to
ci-pr.ymlin the same PR as the migration. - Missing role: SRE adds the role-creation step to
ci-pr.ymlin the same PR as the migration. - Duplicate revision (multiple heads): SRE coordinates with feature-dev to renumber; SRE validates the fix.
- Transient CI infrastructure failure (Postgres service container fails to start): re-run the job; file a reliability issue if it recurs.
When to escalate to feature-dev
- The migration's
upgrade()ordowngrade()body contains a bug (Postgres-incompatible expression, wrong SQL, broken FK ordering). - A broken
down_revisionpointer — requires the migration author to confirm the intended predecessor. - A new seed requirement that was not communicated before merge.
When to escalate to operator
- The gate itself is being bypassed or disabled in
ci-pr.ymlwithout a documented reason. - A migration failure in prod (not CI) — the prod migration was never gated, implying the migration touches schema that the gate does not cover.
- Any migration that touches payment, audit, or RBAC schema and fails — treat as SEV-2 until confirmed non-data-loss.
- The gate is failing consistently across all PRs despite no migration changes (CI infrastructure issue).
Contact: operator (kris@moosequest.net) via Slack #ops or direct message (D0AJ7K184TV).
Related references
| Reference | Description |
|---|---|
| PR #1623 | Gate introduced (ci-pr.yml, migration-gate job) |
| PR #1654 | kris@ admin row seed step added (two-phase Console upgrade) |
| PR #1656 | raptor_app role creation step added |
| PR #1781 | Emergency recovery — 0037_promote_trace_mode_transitions renamed to 0038 (duplicate 0037 chain repair) |
| PR #1778 | 0038_promote_demo_paper_fill renamed to 0040 (duplicate 0038 collision resolved) |
| Issue #1655 | Tracking issue — documents the seed-step pattern |
| Issue #1552 | Original gate request |
| Issue #1548 | Duplicate down_revision hotfix #1 |
| Issue #1550 | Duplicate down_revision hotfix #2 |
| Issue #1551 | Postgres-incompatible index predicate hotfix |
feedback_heroku_pg_rds_password_gotcha.md |
Why the gate uses ephemeral Postgres instead of the RDS-backed Heroku Postgres |
feedback_gh_actions_transitive_skip.md |
Why every job downstream of detect-changes needs an explicit if: clause |