Raxx · internal docs

internal · gated

Unified Customer Audit — Migration Plan v2

Status: Design v2 — pending sub-card filing
Owner: software-architect
Date: 2026-05-09 UTC
Supersedes: v1 (PR #1451)
Refs: design.md, api-contract.md, ADR-0058, ADR-0059, ADR-0063


Overview

Five existing audit streams are converged into customer_audit_events over five phases (v1 had four; v2 adds Phase 0 for role separation). Historical rows from all five legacy streams are back-filled into the unified table during Phase 2 (option D — full migration before drop). No legacy table is dropped until the unified table contains the complete historical record.


Phase 0 — Role Separation (New — BLOCKING prerequisite)

What: Create raptor_app Postgres role. Switch Raptor's application connection from DATABASE_URL (owner credential) to a restricted raptor_app connection string. Without this, every REVOKE in Phases 1+ is a paper control. See #1455.

Dev-days estimate: 1 dev-day

Dependent cards: none (first in chain)

Actions: 1. Run as Heroku Postgres owner: CREATE ROLE raptor_app WITH LOGIN PASSWORD '<generated>';. 2. Grant DML on existing application tables: GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO raptor_app. Grant USAGE on sequences. 3. Store raptor_app connection string in Infisical at /raxx/prod/raptor/RAPTOR_APP_DB_URL (not in Heroku env). 4. Update Raptor app.py to read RAPTOR_APP_DB_URL from Infisical at startup (fallback to DATABASE_URL for backward compatibility during deploy window, then remove fallback post-verification). 5. Verify Raptor runs correctly with the restricted role on staging.

Rollback: revert app.py to DATABASE_URL. No data impact.

Feature flag: none.

B1 note: this is a plumbing change, not a feature flag. No console_flag_promotions row required.


Phase 1 — Schema Creation (Dark)

What: Alembic migration creates customer_audit_events, all *_history shadow tables, triggers, roles, RLS policies, and pg_audit extension. No writes. No reads. Zero behavioral change.

Dev-days estimate: 3 dev-days (Phase 0 adds role-separation complexity to the migration)

Dependent cards: Phase 0 complete

Alembic migration file: migrations/versions/xxxx_unified_audit_v2_schema.py

Migration actions: 1. Create customer_audit_events with all columns, constraints, indexes per design.md §3.1. Include ticket_state_at_read column and cae_ticket_state index. 2. Create freescout_ticket_cache table. 3. Create users_history, subscriptions_history, positions_history, trades_history, sessions_history, passkeys_history per design.md §3.2. 4. Create trigger functions and attach to primary tables. 5. Create audit_archiver role (no login). 6. Create raptor_audit_compliance role (no login, SELECT only on audit tables). 7. Execute role grants: INSERT/SELECT to raptor_app; SELECT/DELETE to audit_archiver; SELECT to raptor_audit_compliance. Explicitly REVOKE INSERT on customer_audit_events from raptor_migrations. 8. Enable Row Level Security: ALTER TABLE customer_audit_events ENABLE ROW LEVEL SECURITY; FORCE ROW LEVEL SECURITY;. Create RLS policies per design.md §6. 9. Install pg_audit extension: CREATE EXTENSION IF NOT EXISTS pgaudit;. Configure to log all DML on customer_audit_events and *_history tables. Route pg_audit output to external log sink (Heroku log drain → Papertrail/Logtail, write-only). 10. Create audit_archival_runs, audit_integrity_log, dsr_log, freescout_ticket_cache tables. 11. UUID v4 validation: no code change needed at the schema layer; UUID v4 is enforced in the writer service.

CI requirements: - Migration passes on clean staging DB. - Idempotency check (re-run produces no error). - Lint: migration must not contain GRANT UPDATE ON customer_audit_events or GRANT INSERT ON customer_audit_events TO raptor_migrations. - Smoke test: table exists, all columns present, all indexes present, all triggers present, RLS enabled, raptor_app cannot DELETE from customer_audit_events. - pg_audit smoke test: insert a row as raptor_app, verify the event appears in the external log sink.

Rollback:

DROP EXTENSION IF EXISTS pgaudit;
DROP TRIGGER trg_trades_history ON trades;
-- (repeat for all shadow tables)
DROP TABLE IF EXISTS trades_history, positions_history, sessions_history,
  subscriptions_history, users_history, passkeys_history CASCADE;
DROP TABLE IF EXISTS customer_audit_events, freescout_ticket_cache,
  audit_archival_runs, audit_integrity_log, dsr_log CASCADE;
DROP ROLE IF EXISTS audit_archiver, raptor_audit_compliance;

Feature flag: none.


Phase 2 — Dual-Write + Legacy Back-Fill

What: Every existing writer also writes a unified row. Historical rows from all 5 legacy streams are back-filled into customer_audit_events. Legacy tables remain source-of-truth for reads.

Dev-days estimate: 8 dev-days (5 dual-write adapters + back-fill job + reconciler)

Feature flag: FLAG_UNIFIED_AUDIT_DUAL_WRITE (add console_flag_promotions row per B1 enforcement)

Dependent cards: Phase 1 complete

Dual-write adapters

Surface Current mechanism Adapter change
Raptor blueprints (Dim 1) Direct DB + write_audit Call POST /api/customer-audit/event after primary audit write
Console write_audit / write_audit_strict (Dim 3) Direct DB insert to Console audit_log Call POST /api/customer-audit/event for customer-related rows
Reasonator (Dim 2) Unbuilt Implement writer against unified endpoint from the start
Velvet auth/audit.py (Dim 2) Python logger / stdout Call POST /api/customer-audit/event for events touching a customer_id
Per-blueprint logs Python logger.* No change (not structured audit events)

FreeScout webhook registration: register POST /api/internal/freescout-webhook in FreeScout admin UI. Test: change a ticket status; verify freescout_ticket_cache updates within 5 seconds.

Back-fill strategy (legacy → unified, option D):

All 5 legacy audit streams are back-filled into customer_audit_events before the legacy tables can be dropped. Back-fill runs as a one-shot job (jobs/audit_backfill.py) invoked manually by the operator during Phase 2.

Dimension mapping per legacy source:

Legacy source Legacy table / mechanism Target dimension Action namespace Notes
Raptor audit_log backend_v2 DB audit_log table customer_self (if actor = customer) / operator_interaction (if actor = operator) Inferred from action column; map via translation table in jobs/audit_backfill.py Must map ticket_id from legacy context JSONB if present
Console audit_log Console DB audit_log table operator_interaction Map write_audit action names to new namespace (e.g., console.customer.audit_viewedcustomer.data.read.in_ticket) ticket_state_at_read defaults to 'none' for historical rows (ticket state unknowable post-hoc); no notifications fired for back-filled rows
Reasonator Unbuilt pre-dual-write system_automated N/A — no historical rows to back-fill
Velvet velvet.audit stdout logs (log drain archive) system_automated system.velvet.rotation.* — mapped from log message patterns Velvet rotation history has forensic value; parse from log archive. Requires log drain export.
Per-blueprint logs Python logger.* Not back-filled Not structured audit events; retained as operational logs only

Back-fill job design:

# jobs/audit_backfill.py — run once per legacy source, idempotent
# For each legacy row:
#   1. Check if a matching row already exists in customer_audit_events
#      (match on: customer_id + action_namespace + at_utc within 1s window)
#   2. If not: INSERT with schema_version=1 (back-filled), event_hash=KMS.HMAC(...)
#   3. If yes: skip (idempotent)
# For Velvet log archive:
#   1. Parse log drain S3 export for velvet.audit entries
#   2. Map each to customer_id, action, timestamp
#   3. Same INSERT/skip pattern
# No notifications fired for back-filled rows (ticket_state_at_read = 'back_filled')
# Back-fill manifest written to audit_archival_runs

Back-fill completeness verification: after back-fill completes, the reconciler compares row counts between legacy tables and customer_audit_events per customer_id per day. Any gap > 2% fires a Sentry alert.

Reconciler job: jobs/audit_reconciler.py runs nightly. Samples 100 customers/day, compares event counts and action types in 24h windows between legacy and unified tables. Mismatch > 2% alerts ops@raxx.app.

CI requirements: - Integration test: submit a trade; verify row appears in both legacy audit_log and customer_audit_events. - Integration test: console session revoke; verify Dim-3 row in customer_audit_events. - Integration test: FreeScout webhook status change; verify freescout_ticket_cache updated.

Rollback: FLAG_UNIFIED_AUDIT_DUAL_WRITE=0. Back-filled rows are not deleted — they remain in customer_audit_events for future re-enablement.


Phase 3 — Reader Cutover

What: Customer and operator audit viewers read from customer_audit_events. Old table reads disabled via flag flip.

Dev-days estimate: 5 dev-days

Feature flag: FLAG_UNIFIED_AUDIT_READ (add console_flag_promotions row)

Dependent cards: Phase 2 complete + all 7 prerequisites simultaneously green:

Prerequisite Tracking Status
Phase 0 role separation #1455
Postgres RLS #1457
RBAC V2 live in production docs/architecture/rbac-v2/
HMAC chain (#1454) #1454
UUID v4 for replay_uuid design.md §3.1
Mandatory date-range guard api-contract.md §3
Per-customer rate limit api-contract.md §2

Gate: Kristerpher explicit workflow_dispatch approval. Reconciler mismatch rate < 1% for 7 consecutive days. All prerequisites green.

Reader changes:

  1. Customer self-service audit (GET /api/customer-audit/<customer_id>): new endpoint. Replaces ad-hoc Raptor audit_log query used by customer audit viewer from #1050. Filtered to dimension IN ('customer_self', 'system_automated') for customer sessions.
  2. Console customer audit viewer (/console/customers/<user_id>/audit): update CustomerAuditService to call Raptor unified reader endpoint.
  3. Console global audit viewer (/console/audit): continues to read from Console audit_log — not migrated (operator-on-operator events are not in scope for customer_audit_events).

Rollback: FLAG_UNIFIED_AUDIT_READ=0. Readers revert to legacy tables. No data impact.


Phase 4 — Legacy Table Drop

What: Once the unified table contains the complete historical record (back-fill verified) and Phase 3 has soaked 90 days without integrity failures, drop the legacy audit tables.

Dev-days estimate: 1 dev-day

Gate: Phase 3 complete + 90-day soak + reconciler mismatch < 0.5% sustained + Kristerpher explicit approval.

Disposition (option D — full migration before drop): 1. Confirm back-fill is complete and verified (row count parity per day per customer). 2. Export legacy tables to S3 at s3://raxx-audit-archive/legacy/<table>/<year>/<month>/ as additional backup. 3. Write manifest row to audit_archival_runs. 4. Operator confirms S3 archive is queryable (one test query per table). 5. Alembic migration: drop legacy audit_log tables from Raptor DB and Console DB. 6. Velvet auth/audit.py stdout logging retained (operational log; not a DB table).

Rollback: once dropped, restoration requires S3 archive restore (~30 min manual). Operator must sign off explicitly before Phase 4 runs. Irreversible in practice.

Feature flag: none (one-shot).


Trigger-to-Async Refactor Path

When p99 INSERT latency on trades or positions tables exceeds 50ms sustained for 7 consecutive days, execute this refactor (estimated 1–2 dev-days):

  1. Replace shadow table triggers with pg_notify calls: trigger emits NOTIFY audit_shadow_channel, '<json_payload>' instead of direct INSERT.
  2. Background worker (jobs/shadow_writer.py) subscribes to audit_shadow_channel via LISTEN. Processes notifications and INSERTs into *_history tables asynchronously.
  3. Lag monitoring: shadow writer queue depth measured; alert if depth > 1,000 (1 min of burst writes at 10K customers).
  4. During the cutover: brief dual-write window (both trigger + async worker) to ensure no rows are dropped.

Downside: history may be stale by up to ~1s during peak write bursts. Acceptable for forensic use; document in runbooks that "shadow tables have eventual-consistency lag under high write load."


Phase Summary

Phase What Dev-days Flag Rollback Blocker
0 — Role separation raptor_app role; restricted connection 1 None Revert app.py None
1 — Schema Tables, triggers, RLS, pg_audit, roles 3 None Drop tables Phase 0
2 — Dual-write + back-fill All writers + historical back-fill 8 FLAG_UNIFIED_AUDIT_DUAL_WRITE Flag off Phase 1
3 — Reader cutover Readers switch to unified table 5 FLAG_UNIFIED_AUDIT_READ Flag off Phase 2 + 7 prerequisites + <1% mismatch 7d
4 — Legacy drop S3 backup + drop legacy tables 1 None (one-shot) Restore from S3 Phase 3 + 90d soak + approval
Total 18 dev-days

Sub-Cards to File (v2 Slate — 19 sub-cards)

Each sub-card is one PR for one developer. Ordered by dependency. Do not file until operator issues the "file-em-all" command after reviewing v2.

# Title Size Phase Depends on
SC-A1 feat(raptor): Phase 0 — create raptor_app role + switch application DB connection to restricted credential S Phase 0
SC-A2 feat(raptor): Phase 1 — alembic migration — customer_audit_events + shadow tables + triggers + RLS + pg_audit + roles L Phase 1 SC-A1
SC-A3 feat(raptor): POST /api/customer-audit/event — internal writer + HMAC-KMS + deny-list + action allowlist + per-customer rate limit M Phase 2 SC-A2
SC-A4 feat(raptor): FreeScout webhook — POST /api/internal/freescout-webhook + freescout_ticket_cache upsert S Phase 2 SC-A2
SC-A5 feat(raptor): dual-write adapter — Raptor blueprints → customer_audit_events (dim 1 + dim 2) + ticket-state-aware notification trigger M Phase 2 SC-A3, SC-A4
SC-A6 feat(console): dual-write adapter — Console write_audit → customer_audit_events (dim 3) + ticket-state lookup S Phase 2 SC-A3, SC-A4
SC-A7 feat(velvet): dual-write adapter — velvet auth/audit.py → customer_audit_events (dim 2, rotation events) S Phase 2 SC-A3
SC-A8 feat(reasonator): audit writer — Reasonator score events → customer_audit_events (dim 2) S Phase 2 SC-A3
SC-A9 feat(raptor): audit back-fill job — historical rows from 5 legacy streams into customer_audit_events (option D) M Phase 2 SC-A5, SC-A6, SC-A7
SC-A10 feat(raptor): audit reconciler job — nightly mismatch check + Sentry alert on >2% mismatch S Phase 2 SC-A5
SC-A11 feat(raptor): HMAC chain integrity check job — nightly 24h window + monthly full-chain verification from genesis M Phase 2 (before GA) SC-A3
SC-A12 feat(raptor): ticket-state-aware notification service — Path A (welcoming, 5 min SLA) + Path B (security incident) via Postmark M Phase 2 SC-A4, SC-A5
SC-A13 feat(infra): audit archiver job — S3 Glacier export of legacy + unified rows per retention policy + manifest + optional Object Lock M Phase 3 prep SC-A9
SC-A14 feat(infra): pg_audit extension + external log sink — DML audit at Postgres level, survives application compromise S Phase 1 SC-A2
SC-A15 docs(ops): departing employee deprovisioning runbook — Heroku Postgres rotation + audit_archiver SSM credential, 4-hour window S Phase 2 SC-A2
SC-A16 docs(ops): SOC-2 quarterly attestation runbook — docs/ops/runbooks/audit-soc2-attestation.md S Phase 3 SC-A11, SC-A13
SC-A17 feat(ci): action-namespace allowlist CI gate — fail pipeline on unregistered audit action namespaces in application code S Phase 2 SC-A3
SC-A18 feat(rbac-v2): raptor-audit-compliance role — read-only all dimensions, no customer notification, SOC-2 auditor path M Phase 3 RBAC V2 design
SC-A19 feat(raptor): GET /api/customer-audit/<customer_id> + GET /api/customer-audit/<customer_id>/by-replay/<replay_uuid> — reader endpoints + RLS + RBAC + date-range guard L Phase 3 SC-A5, RBAC V2 (SC-A18), all 7 prerequisites

SC-A19 note: this replaces v1's SC-A8 + SC-A9. Split into separate stories if RBAC V2 delays introduce scheduling pressure; the by-replay endpoint can ship independently.

SC-A18 note: depends on RBAC V2. If the parallel RBAC V2 architect PR lands before this sub-card is claimed, update this body to reference the concrete RBAC V2 role endpoints and data model. If this sub-card is claimed before RBAC V2 lands, the implementer stubs the role registration and revisits when RBAC V2 merges.

Total sub-cards: 19 (+6 from v1's 13)

Total dev-days: 18 (migration phases) + ~8 (SOC-2 + ops runbooks + CI gate + pg_audit + compliance role) = ~26 dev-days. Several Phase 2 sub-cards (SC-A5 through SC-A9) can run in parallel after SC-A3 + SC-A4 land.

v1 → v2 delta: - SC-A1 (new): Phase 0 role separation - SC-A3 expanded: adds HMAC-KMS, deny-list, allowlist, per-customer rate limit - SC-A4 (new): FreeScout webhook + ticket cache - SC-A9 (new): back-fill job (option D) - SC-A12 (new): ticket-state-aware notification service (replaces single nightly email) - SC-A14 (new): pg_audit + external sink - SC-A15 (new): deprovisioning runbook - SC-A16 (new): SOC-2 attestation runbook - SC-A17 (new): action-namespace CI gate - SC-A18 (new): compliance role for SOC-2 auditors - SC-A19: merged reader endpoints (was SC-A8 + SC-A9 in v1); gated on all 7 prerequisites


Existing Card Dependencies

Cards that will write to or read from the unified table after Phase 2 lands:

All new surfaces implement against the writer endpoint (SC-A3) once it ships, not against legacy tables.