Raxx · internal docs

internal · gated ↑ index

Migration gate runbook

System: Console Alembic migration gate + Raptor Alembic migration gate (CI job migration-gate in .github/workflows/ci-pr.yml) Owner: sre-agent / operator Last incident: 2026-05-11 (gate broken from day one — missing kris@ seed row, missing raptor_app role; fixed in PR #1654 and PR #1656) Last reviewed: 2026-05-12


What the migration gate does

The migration-gate CI job (ci-pr.yml, added in PR #1623, closes issue #1552) runs automatically on every PR that touches:

It spins up an ephemeral postgres:15 service container and runs a full round-trip on both migration trees:

upgrade head → assert single head → downgrade -1 → upgrade head

Each tree gets its own ephemeral database (console_ci, raptor_ci) so there is no schema bleed.

What it catches before merge:

Failure class First observed Example PR
Multiple heads (duplicate down_revision) 2026-05-10 #1548, #1550
Postgres-incompatible expression (STABLE vs IMMUTABLE) 2026-05-10 #1551
Migration that errors on upgrade or downgrade 2026-05-11 #1654, #1656
Missing seed data required by a migration 2026-05-11 #1654
Missing Postgres role required by a migration GRANT 2026-05-11 #1656

The gate runs against an ephemeral database with no vault credentials. This bypasses the RDS IAM restriction that prevents CREATE ROLE WITH PASSWORD on Heroku Standard-0 (feedback: heroku_pg_rds_password_gotcha.md).


Seed requirements

The ephemeral CI database starts empty. Two categories of "environment prerequisites" that exist in prod but not in a fresh Postgres must be supplied by the gate itself before running alembic upgrade head.

1. Console: kris@moosequest.net admin row (required by migration 0016)

Migration: console/migrations/versions/0016_raxx_platform_owners_group.py

Why it is required: Migration 0016 seeds the raxx-platform-owners RBAC group and enrolls the operator (kris@moosequest.net) as its first member. Step 3 of upgrade() executes SELECT id FROM admins WHERE email = :email. If the row is absent it raises:

RuntimeError: 0016 migration: operator 'kris@moosequest.net' not found in the admins table.

This is intentional on prod — the operator account must exist before the platform-owners group is seeded. In CI there is no prior data, so the gate supplies a stub row.

How the gate supplies it (two-phase upgrade, PR #1654):

  1. Console — alembic upgrade to 0015 (pre-seed) — brings the schema up to the migration that creates the admins table.
  2. Console — seed kris@ admin row (required by 0016) — inserts the stub row:
INSERT INTO admins (id, email, display_name, created_at, updated_at, disabled_at)
VALUES (
  '00000000-0000-0000-0000-000000000001',
  'kris@moosequest.net',
  'CI Seed — do not use in prod',
  '2026-01-01T00:00:00',
  '2026-01-01T00:00:00',
  NULL
)
ON CONFLICT (email) DO NOTHING;
  1. Console — alembic upgrade head — completes from 0016 onward.

Why ON CONFLICT DO NOTHING: The round-trip downgrade -1 / upgrade head must not fail on the second upgrade head. Migration 0016's downgrade() only removes rbac_user_groups membership — the admins row is untouched — so the seed is still present for the second upgrade.

Columns seeded: id, email, display_name, created_at, updated_at, disabled_at. The admins table has no password columns (schema from migration 0001). The stub UUID and timestamps are CI-only and are never used for authentication.

2. Raptor: raptor_app Postgres role (required by migration 0002+)

Migration: backend_v2/alembic/versions/0002_paper_orders.py (and all subsequent Raptor migrations that add tables)

Why it is required: Migration 0002 executes:

GRANT SELECT, INSERT, UPDATE, DELETE ON paper_orders TO raptor_app;
GRANT USAGE ON SEQUENCE paper_orders_id_seq TO raptor_app;

On prod, the raptor_app role is provisioned during the Postgres Standard-0 setup (RM-9), outside of Alembic. The CI ephemeral database has no such provisioning. Without the role, upgrade head fails with:

psycopg2.errors.UndefinedObject: role "raptor_app" does not exist

How the gate supplies it (PR #1656):

DO $$ BEGIN
  IF NOT EXISTS (SELECT FROM pg_roles WHERE rolname = 'raptor_app') THEN
    CREATE ROLE raptor_app NOLOGIN;
  END IF;
END $$;

This runs before alembic upgrade head on the Raptor tree. The role has no login privilege and no initial permissions — the GRANT statements inside each migration attach the needed permissions.


Rule for future data-dependent migrations

Any migration that queries an existing table for lookup data OR executes a GRANT to a named role is a seed-dependent migration. The gate must be updated in the same PR as the migration. The pattern:

  1. Add a new CI step that runs alembic upgrade <N-1> (up to the migration before the one with the dependency).
  2. Insert or create the required data/role idempotently (ON CONFLICT DO NOTHING / IF NOT EXISTS).
  3. Continue with alembic upgrade head.

Every seed INSERT must use ON CONFLICT DO NOTHING. Every role creation must use the IF NOT EXISTS guard (or the DO $$ BEGIN IF NOT EXISTS ... END $$ block for Postgres < 9.6 syntax, though CI uses Postgres 15). This ensures the gate remains idempotent across re-runs and round-trips.


How to tell it is broken


How to diagnose (in order)

  1. Open the failing CI run. Navigate to the migration-gate job and expand the failing step. The Alembic stack trace or psql error is the primary signal.

  2. Classify the error.

Error message Failure class
RuntimeError: ... not found in the admins table Missing seed row (new migration requires a row that the gate does not supply yet)
psycopg2.errors.UndefinedObject: role "..." does not exist Missing Postgres role (new migration GRANTs to a role the gate does not create yet)
DETAIL: Multiple heads are present Duplicate down_revision — two migrations claim the same parent
ERROR: cannot use a non-immutable function in index expression Postgres-incompatible expression in CREATE INDEX ... WHERE
alembic.util.exc.CommandError: Can't locate revision ... Broken down_revision pointer — a migration references a revision ID that does not exist
Any other psycopg2 error Migration-specific Postgres incompatibility
  1. For "Multiple heads" errors: Run locally to identify the conflicting revisions:

```bash # Console cd console && alembic -c migrations/alembic.ini heads

# Raptor cd backend_v2 && alembic heads ```

Each output line marked (head) is a branch tip. Find the two migrations that share the same down_revision value.

  1. For seed row / role errors: Confirm the CI workflow has the corresponding seed step before alembic upgrade head. If it is missing, the fix is a new seed step in ci-pr.yml (see "Recovery steps" below).

Known failure modes

Failure mode A: Missing admin seed row

Symptom: Console — alembic upgrade head fails with:

RuntimeError: 0016 migration: operator 'kris@moosequest.net' not found in the admins table.

Cause: A new migration was added that runs before or alongside 0016 and depends on the same seed, OR the seed step was removed or reordered in a ci-pr.yml change.

Fix:

Verify the seed step is present and runs after alembic upgrade 0015 but before alembic upgrade head. If the seed step exists but the error still fires, the cause is likely a new migration that executes before 0016 and operates on data that requires a different seed row — diagnose by reading the new migration's upgrade() body.

If a completely new migration introduces its own seed requirement, add the corresponding seed step following the pattern established by PR #1654.

Verification: Re-run the migration-gate job. Console — seed kris@ admin row (required by 0016) must exit 0, followed by Console — alembic upgrade head exiting 0.


Failure mode B: Missing Postgres role

Symptom: Raptor — alembic upgrade head fails with:

psycopg2.errors.UndefinedObject: role "raptor_app" does not exist
LINE 1: GRANT SELECT, INSERT, UPDATE, DELETE ON <table> TO raptor_app

Cause: A new Raptor migration GRANTs to raptor_app (or another named role) that the gate's role-creation step does not cover. Most commonly this happens if a new Raptor migration GRANTs to a role other than raptor_app.

Fix:

Add a role-creation step to the Raptor section of the migration-gate job in ci-pr.yml, modeled on the existing raptor_app step:

DO $$ BEGIN
  IF NOT EXISTS (SELECT FROM pg_roles WHERE rolname = '<new_role_name>') THEN
    CREATE ROLE <new_role_name> NOLOGIN;
  END IF;
END $$;

Place this step before Raptor — alembic upgrade head.

Verification: Re-run the migration-gate job. The new role-creation step must exit 0, and Raptor — alembic upgrade head must exit 0 after it.


Failure mode C: Duplicate revision (multiple heads)

Symptom: Console — assert single head or Raptor — assert single head fails with output like:

Console head count: 2
ERROR: Console alembic has 2 heads — expected exactly 1.

Or alembic upgrade head itself may error:

DETAIL: Multiple heads are present; please specify the target revision.

Cause: Two migrations were merged into main that both set down_revision to the same parent revision. This is the duplicate-revision collision pattern. Historical examples:

Fix:

Identify the two colliding migrations. Choose the one that was merged later (higher PR number) and renumber it to the next available revision ID:

  1. Rename the file: XXXX_old_name.pyYYYY_old_name.py where YYYY is the next unused number.
  2. Update revision = "YYYY" inside the file.
  3. Update down_revision = "ZZZZ" to point to the correct predecessor.
  4. Verify the chain is linear: alembic -c console/migrations/alembic.ini heads should return exactly one (head).

The fix can be committed directly to a PR branch. The migration-gate job on that PR will validate the chain.

Verification: Console — assert single head or Raptor — assert single head must report head count: 1.


Failure mode D: Postgres-incompatible index predicate

Symptom: alembic upgrade head fails with:

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.InvalidColumnDefinition)
ERROR:  cannot use a non-immutable function in index expression

Cause: A migration creates a partial index with a WHERE clause that uses a non-IMMUTABLE function (e.g., CURRENT_TIMESTAMP, now()). Postgres requires index predicates to be IMMUTABLE. This was the failure mode in PR #1551.

Fix: Change the index predicate to use only IMMUTABLE expressions (literal values, IMMUTABLE-marked functions). If a time-bounded partial index is genuinely needed, use a fixed literal timestamp or restructure the index.

Verification: alembic upgrade head must exit 0. Confirm the index exists with the correct predicate:

SELECT indexname, indexdef FROM pg_indexes WHERE tablename = '<table>';

Failure mode E: Broken down_revision pointer

Symptom: alembic upgrade head fails with:

alembic.util.exc.CommandError: Can't locate revision identified by '<hash>'

Cause: A migration's down_revision references a revision ID that does not exist in the chain. This happens when a migration file is deleted, renamed without updating the successor's down_revision, or when a PR adds a migration that references a revision that has not merged yet.

Fix: Update the down_revision in the orphaned migration to point to the correct predecessor revision ID. Run alembic history to see the full chain and identify the gap.

Verification: alembic history --verbose must show a contiguous chain with no gaps.


First failure / next step decision tree

migration-gate fails
│
├── Step: "Console — alembic upgrade to 0015 (pre-seed)"
│   └── Cause: schema error before 0015 — look at migrations 0001–0015
│       → Feature-dev owner of the failing migration must fix it.
│
├── Step: "Console — seed kris@ admin row (required by 0016)"
│   └── Cause: psql syntax error or admins table missing
│       → Check that 0015 actually created the admins table (gate step ordering)
│       → If admins table is missing: upgrade-to-0015 step may have silently exited 0 on an error
│       → Escalate to operator if the admins schema has been modified in a recent migration
│
├── Step: "Console — alembic upgrade head"
│   ├── RuntimeError: operator not found in admins table
│   │   → A new migration requires a seed row that the gate does not supply.
│   │   → SRE adds seed step to ci-pr.yml. Must ship in the same PR as the migration.
│   ├── psycopg2.errors.UndefinedObject (role)
│   │   → A new migration GRANTs to a role the gate does not create.
│   │   → SRE adds role-creation step to ci-pr.yml. Must ship in the same PR as the migration.
│   ├── Other psycopg2 error
│   │   → Migration-specific Postgres incompatibility — feature-dev fixes the migration.
│   └── alembic.util.exc.CommandError: Can't locate revision
│       → Broken down_revision pointer — feature-dev fixes the chain.
│
├── Step: "Console — assert single head" / "Raptor — assert single head"
│   └── head count > 1
│       → Duplicate revision collision. Identify and renumber the later migration.
│       → Feature-dev (whichever PR introduced the collision) fixes it.
│       → SRE verifies the fix resolves the gate before approving.
│
├── Step: "Console — downgrade -1" / "Raptor — downgrade -1"
│   └── Downgrade errored
│       → The migration's downgrade() function has a bug.
│       → Feature-dev fixes the downgrade() path. All migrations must be reversible.
│
├── Step: "Raptor — create raptor_app role (required by 0002+)"
│   └── DO block errored — likely psql connectivity issue or syntax
│       → Check the service container is healthy (rare; usually a transient CI issue)
│       → Re-run the job once. If it fails a second time, file a type:reliability issue.
│
└── Step: "Raptor — alembic upgrade head"
    ├── psycopg2.errors.UndefinedObject (role)
    │   → New migration GRANTs to a role the gate does not create → SRE adds role step
    └── Other error
        → Feature-dev fixes the migration.

Escalation

When SRE handles it directly

When to escalate to feature-dev

When to escalate to operator

Contact: operator (kris@moosequest.net) via Slack #ops or direct message (D0AJ7K184TV).


Reference Description
PR #1623 Gate introduced (ci-pr.yml, migration-gate job)
PR #1654 kris@ admin row seed step added (two-phase Console upgrade)
PR #1656 raptor_app role creation step added
PR #1781 Emergency recovery — 0037_promote_trace_mode_transitions renamed to 0038 (duplicate 0037 chain repair)
PR #1778 0038_promote_demo_paper_fill renamed to 0040 (duplicate 0038 collision resolved)
Issue #1655 Tracking issue — documents the seed-step pattern
Issue #1552 Original gate request
Issue #1548 Duplicate down_revision hotfix #1
Issue #1550 Duplicate down_revision hotfix #2
Issue #1551 Postgres-incompatible index predicate hotfix
feedback_heroku_pg_rds_password_gotcha.md Why the gate uses ephemeral Postgres instead of the RDS-backed Heroku Postgres
feedback_gh_actions_transitive_skip.md Why every job downstream of detect-changes needs an explicit if: clause