RCA — Postgres owner privilege voids audit-chain append-only invariant (#1455)
Incident ID: 2026-05-15-audit-role-split-1455 Date: 2026-05-15 Severity: SEV-2 Duration: Detected 2026-05-15 UTC; staging remediation applied same day; prod blocked pending Essential-0 → Standard-0 upgrade (ongoing) Blast radius: Pre-launch internal only — no customers. Audit-chain tamper-evidence guarantee was paper-only: runtime Postgres connection held owner privileges, making every REVOKE in the SC-A2 schema a no-op. Author: sre-agent
Summary
Raptor's DATABASE_URL Postgres credential is the database owner. Even though migration 016 defined REVOKE-based append-only constraints on customer_audit_events and all *_history shadow tables, those REVOKEs were meaningless while the application runtime connected as the owner. Any bug or compromised process could DELETE or UPDATE audit rows directly. Operator tagged the issue pre-launch-blocker on 2026-05-15 UTC with T-8 days to v1 launch. Staging was fully remediated (full privilege matrix applied, verified, flag enabled, application healthy). Prod is blocked because the prod Heroku Postgres addon is on Essential-0, which does not support pg:credentials:create. Prod requires an operator-initiated addon upgrade to Standard-0 before role separation can be completed there.
Timeline (all times UTC)
- 2026-05-10 — SC-A1 (#1481) filed;
raptor_approle created on staging Standard-0 viaheroku pg:credentials:create; migration 015 written with operator runbook - 2026-05-10 — SC-A2 (#1482) filed;
customer_audit_events+ 6 shadow tables created; REVOKE-based append-only documented in migration 016 operator runbook; Postgres DDL not yet applied - 2026-05-12 — RM-11 (#1569) filed; SC-A1 reactivation tracked (GRANT/REVOKE + flag flip)
- 2026-05-14 — Raptor Postgres migration Phase 1–3 sub-cards all closed (RM-1 through RM-10)
- 2026-05-15 06:00 — Operator tags #1455
pre-launch-blocker; sre-agent engaged - 2026-05-15 17:30 — Diagnostics: staging has
raptor_approle +FLAG_RAPTOR_APP_ROLE_SEPARATION=1; onlypaper_ordershad grants (all other tables missing) - 2026-05-15 17:40 — Migration 031 written (
031_audit_role_split.sql) - 2026-05-15 17:43 — Block 2 (REVOKE reset) applied to staging Postgres
- 2026-05-15 17:44 — Block 3 (non-audit full DML) applied to staging
- 2026-05-15 17:45 — Block 4 (audit INSERT+SELECT; explicit REVOKE UPDATE/DELETE) applied
- 2026-05-15 17:46 — Block 5 (sequence GRANT) applied
- 2026-05-15 17:47 — Privilege matrix verified:
customer_audit_eventsDELETE=f, INSERT=t; all audit/history tables UPDATE=f, DELETE=f;alembic_versionall=f; application tables all=t - 2026-05-15 17:48 — Staging health check returning 200; application logs clean
- 2026-05-15 17:49 — Prod diagnostic:
raptor_approle absent; prod addon = Essential-0;pg:credentials:createreturns "You can't create a custom credential on Essential-tier databases" - 2026-05-15 17:50 — Prod escalation issue filed; runbook updated
Impact
- Users affected: 0 (pre-launch)
- User-visible symptoms: none
- Data integrity: at-risk (audit rows could have been modified by owner process — no evidence of tampering; no customers yet)
- Revenue / billing: ok
What went well
- The SC-A1/SC-A2 migrations had the GRANT/REVOKE logic documented inline in SQL comments, making the remediation unambiguous
- The
api/db.pyresolver +FLAG_RAPTOR_APP_ROLE_SEPARATIONflag allowed zero-downtime activation on staging after grants were applied - The
raptor-postgres-roles.mdrunbook already documented Failure mode B (Essential-tier block), preventing wasted time onCREATE ROLEapproaches
What didn't go well
- RM-10 (#1568) was closed as a docs card but the prod Essential-0 → Standard-0 upgrade and role creation were not completed; the issue closure created a false impression that prod was cutover-ready
- Migration 016 operator runbook was not applied (the GRANT/REVOKE steps were in SQL comments but not tracked as a separate operational action item with a due date)
- Staging had
raptor_approle and the flag enabled but onlypaper_ordershad grants — the system allowed Raptor to boot and pass health checks without detecting the missing grants - No CI check verified that raptor_app had the expected grant matrix on staging (the lint check
scripts/ci/lint_migration_grants.shverifies new migrations include grants, but does not verify live DB state)
Root cause analysis
- Contributing factor 1: RM-10 closed as docs-only — The RM-10 issue was scoped as "write + execute prod cutover SOP." The doc was written, but execution (addon upgrade, role creation, alembic migration) requires operator action. The system allowed the issue to be closed without tracking the execution step separately, creating a gap between "SOP exists" and "SOP was run."
- Contributing factor 2: Migration 016 Postgres DDL not tracked as an action item — The GRANT/REVOKE steps for the audit tables (SC-A2) lived in SQL comments in migration 016 but had no corresponding GitHub action item with an owner and due date. The system allowed Postgres-specific DDL to be deferred indefinitely without a blocking ticket.
- Contributing factor 3: No live grant verification in CI or monitoring — The CI lint checks verify that new migration files include the correct grant patterns, but no job verifies that the live Postgres instance has the expected privilege matrix. A
raptor_appwith onlypaper_ordersgrants and the flag enabled is operationally unsafe but passed all health checks. - Contributing factor 4: Prod Postgres tier not upgraded before RM-10 closed — The migration plan documented that prod Essential-0 must be upgraded to Standard-0 as step 3 of RM-10. The upgrade requires operator action (cost decision) and has a Heroku maintenance window. The system allowed RM-10 to be closed without confirming the upgrade was complete.
Detection
- What alerted us: operator-tagged
pre-launch-blockeron #1455 - Time between cause and detection: ~5 days (migration 016 applied 2026-05-10; flagged 2026-05-15)
- How to detect faster: add a Heroku Scheduler or CI job that runs the
has_table_privilegequery against staging and prod Postgres, alerting on any audit table whereraptor_apphas DELETE or UPDATE
Resolution
- What was changed: migration
031_audit_role_split.sqlwritten and all 4 grant blocks applied to staging Postgres; privilege matrix verified viahas_table_privilege; runbook updated - Validation: full privilege matrix query confirms DELETE=f, UPDATE=f on all audit/history tables for
raptor_app; staging health check returning 200; Heroku logs clean - Prod: blocked on Essential-0 → Standard-0 upgrade (operator action required)
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Upgrade raxx-api-prod Postgres from Essential-0 to Standard-0 | operator | 2026-05-19 UTC | #1455 (this incident escalation) |
| 2 | Create raptor_app credential on prod via heroku pg:credentials:create; apply migration 031 grant blocks; set RAPTOR_APP_DATABASE_URL; enable FLAG_RAPTOR_APP_ROLE_SEPARATION=1 on prod |
sre-agent (after op. completes item 1) | 2026-05-19 UTC | filed as part of prod cutover |
| 3 | Add CI job: verify live raptor_app privilege matrix on staging Postgres (DELETE=f on audit tables) |
sre-agent | 2026-05-21 UTC | new type:reliability issue to file |
| 4 | Reopen or un-close RM-10 (#1568) execution tracking — split "SOP written" from "SOP executed" in any future ops card | operator / PM | immediate | #1568 |
| 5 | Create audit_archiver and raptor_audit_compliance no-login roles — requires Heroku support ticket (RDS owner cannot CREATEROLE) |
operator | post-v1 | new issue |
References
- Migration:
backend_v2/db/migrations/031_audit_role_split.sql - Runbook:
docs/ops/runbooks/raptor-postgres-roles.md - Related issues: #1455 (this incident), #1481 (SC-A1 raptor_app base), #1482 (SC-A2 schema), #1568 (RM-10 prod cutover), #1569 (RM-11 reactivation)
- Feedback:
feedback_heroku_pg_rds_password_gotcha.md