RCA — Postgres owner privilege voids audit-chain append-only invariant (#1455)

Incident ID: 2026-05-15-audit-role-split-1455 Date: 2026-05-15 Severity: SEV-2 Duration: Detected 2026-05-15 UTC; staging remediation applied same day; prod blocked pending Essential-0 → Standard-0 upgrade (ongoing) Blast radius: Pre-launch internal only — no customers. Audit-chain tamper-evidence guarantee was paper-only: runtime Postgres connection held owner privileges, making every REVOKE in the SC-A2 schema a no-op. Author: sre-agent

Summary

Raptor's DATABASE_URL Postgres credential is the database owner. Even though migration 016 defined REVOKE-based append-only constraints on customer_audit_events and all *_history shadow tables, those REVOKEs were meaningless while the application runtime connected as the owner. Any bug or compromised process could DELETE or UPDATE audit rows directly. Operator tagged the issue pre-launch-blocker on 2026-05-15 UTC with T-8 days to v1 launch. Staging was fully remediated (full privilege matrix applied, verified, flag enabled, application healthy). Prod is blocked because the prod Heroku Postgres addon is on Essential-0, which does not support pg:credentials:create. Prod requires an operator-initiated addon upgrade to Standard-0 before role separation can be completed there.

Timeline (all times UTC)

2026-05-10 — SC-A1 (#1481) filed; raptor_app role created on staging Standard-0 via heroku pg:credentials:create; migration 015 written with operator runbook
2026-05-10 — SC-A2 (#1482) filed; customer_audit_events + 6 shadow tables created; REVOKE-based append-only documented in migration 016 operator runbook; Postgres DDL not yet applied
2026-05-12 — RM-11 (#1569) filed; SC-A1 reactivation tracked (GRANT/REVOKE + flag flip)
2026-05-14 — Raptor Postgres migration Phase 1–3 sub-cards all closed (RM-1 through RM-10)
2026-05-15 06:00 — Operator tags #1455 pre-launch-blocker; sre-agent engaged
2026-05-15 17:30 — Diagnostics: staging has raptor_app role + FLAG_RAPTOR_APP_ROLE_SEPARATION=1; only paper_orders had grants (all other tables missing)
2026-05-15 17:40 — Migration 031 written (031_audit_role_split.sql)
2026-05-15 17:43 — Block 2 (REVOKE reset) applied to staging Postgres
2026-05-15 17:44 — Block 3 (non-audit full DML) applied to staging
2026-05-15 17:45 — Block 4 (audit INSERT+SELECT; explicit REVOKE UPDATE/DELETE) applied
2026-05-15 17:46 — Block 5 (sequence GRANT) applied
2026-05-15 17:47 — Privilege matrix verified: customer_audit_events DELETE=f, INSERT=t; all audit/history tables UPDATE=f, DELETE=f; alembic_version all=f; application tables all=t
2026-05-15 17:48 — Staging health check returning 200; application logs clean
2026-05-15 17:49 — Prod diagnostic: raptor_app role absent; prod addon = Essential-0; pg:credentials:create returns "You can't create a custom credential on Essential-tier databases"
2026-05-15 17:50 — Prod escalation issue filed; runbook updated

Impact

Users affected: 0 (pre-launch)
User-visible symptoms: none
Data integrity: at-risk (audit rows could have been modified by owner process — no evidence of tampering; no customers yet)
Revenue / billing: ok

What went well

The SC-A1/SC-A2 migrations had the GRANT/REVOKE logic documented inline in SQL comments, making the remediation unambiguous
The api/db.py resolver + FLAG_RAPTOR_APP_ROLE_SEPARATION flag allowed zero-downtime activation on staging after grants were applied
The raptor-postgres-roles.md runbook already documented Failure mode B (Essential-tier block), preventing wasted time on CREATE ROLE approaches

What didn't go well

RM-10 (#1568) was closed as a docs card but the prod Essential-0 → Standard-0 upgrade and role creation were not completed; the issue closure created a false impression that prod was cutover-ready
Migration 016 operator runbook was not applied (the GRANT/REVOKE steps were in SQL comments but not tracked as a separate operational action item with a due date)
Staging had raptor_app role and the flag enabled but only paper_orders had grants — the system allowed Raptor to boot and pass health checks without detecting the missing grants
No CI check verified that raptor_app had the expected grant matrix on staging (the lint check scripts/ci/lint_migration_grants.sh verifies new migrations include grants, but does not verify live DB state)

Root cause analysis

Contributing factor 1: RM-10 closed as docs-only — The RM-10 issue was scoped as "write + execute prod cutover SOP." The doc was written, but execution (addon upgrade, role creation, alembic migration) requires operator action. The system allowed the issue to be closed without tracking the execution step separately, creating a gap between "SOP exists" and "SOP was run."
Contributing factor 2: Migration 016 Postgres DDL not tracked as an action item — The GRANT/REVOKE steps for the audit tables (SC-A2) lived in SQL comments in migration 016 but had no corresponding GitHub action item with an owner and due date. The system allowed Postgres-specific DDL to be deferred indefinitely without a blocking ticket.
Contributing factor 3: No live grant verification in CI or monitoring — The CI lint checks verify that new migration files include the correct grant patterns, but no job verifies that the live Postgres instance has the expected privilege matrix. A raptor_app with only paper_orders grants and the flag enabled is operationally unsafe but passed all health checks.
Contributing factor 4: Prod Postgres tier not upgraded before RM-10 closed — The migration plan documented that prod Essential-0 must be upgraded to Standard-0 as step 3 of RM-10. The upgrade requires operator action (cost decision) and has a Heroku maintenance window. The system allowed RM-10 to be closed without confirming the upgrade was complete.

Detection

What alerted us: operator-tagged pre-launch-blocker on #1455
Time between cause and detection: ~5 days (migration 016 applied 2026-05-10; flagged 2026-05-15)
How to detect faster: add a Heroku Scheduler or CI job that runs the has_table_privilege query against staging and prod Postgres, alerting on any audit table where raptor_app has DELETE or UPDATE

Resolution

What was changed: migration 031_audit_role_split.sql written and all 4 grant blocks applied to staging Postgres; privilege matrix verified via has_table_privilege; runbook updated
Validation: full privilege matrix query confirms DELETE=f, UPDATE=f on all audit/history tables for raptor_app; staging health check returning 200; Heroku logs clean
Prod: blocked on Essential-0 → Standard-0 upgrade (operator action required)

Action items

#	Action	Owner	Due	Issue
1	Upgrade raxx-api-prod Postgres from Essential-0 to Standard-0	operator	2026-05-19 UTC	#1455 (this incident escalation)
2	Create `raptor_app` credential on prod via `heroku pg:credentials:create`; apply migration 031 grant blocks; set `RAPTOR_APP_DATABASE_URL`; enable `FLAG_RAPTOR_APP_ROLE_SEPARATION=1` on prod	sre-agent (after op. completes item 1)	2026-05-19 UTC	filed as part of prod cutover
3	Add CI job: verify live `raptor_app` privilege matrix on staging Postgres (DELETE=f on audit tables)	sre-agent	2026-05-21 UTC	new `type:reliability` issue to file
4	Reopen or un-close RM-10 (#1568) execution tracking — split "SOP written" from "SOP executed" in any future ops card	operator / PM	immediate	#1568
5	Create `audit_archiver` and `raptor_audit_compliance` no-login roles — requires Heroku support ticket (RDS owner cannot CREATEROLE)	operator	post-v1	new issue

References

Migration: backend_v2/db/migrations/031_audit_role_split.sql
Runbook: docs/ops/runbooks/raptor-postgres-roles.md
Related issues: #1455 (this incident), #1481 (SC-A1 raptor_app base), #1482 (SC-A2 schema), #1568 (RM-10 prod cutover), #1569 (RM-11 reactivation)
Feedback: feedback_heroku_pg_rds_password_gotcha.md