Unified Customer Audit — Migration Plan v2
Status: Design v2 — pending sub-card filing
Owner: software-architect
Date: 2026-05-09 UTC
Supersedes: v1 (PR #1451)
Refs: design.md, api-contract.md, ADR-0058, ADR-0059, ADR-0063
Overview
Five existing audit streams are converged into customer_audit_events over five phases (v1 had four; v2 adds Phase 0 for role separation). Historical rows from all five legacy streams are back-filled into the unified table during Phase 2 (option D — full migration before drop). No legacy table is dropped until the unified table contains the complete historical record.
Phase 0 — Role Separation (New — BLOCKING prerequisite)
What: Create raptor_app Postgres role. Switch Raptor's application connection from DATABASE_URL (owner credential) to a restricted raptor_app connection string. Without this, every REVOKE in Phases 1+ is a paper control. See #1455.
Dev-days estimate: 1 dev-day
Dependent cards: none (first in chain)
Actions:
1. Run as Heroku Postgres owner: CREATE ROLE raptor_app WITH LOGIN PASSWORD '<generated>';.
2. Grant DML on existing application tables: GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO raptor_app. Grant USAGE on sequences.
3. Store raptor_app connection string in Infisical at /raxx/prod/raptor/RAPTOR_APP_DB_URL (not in Heroku env).
4. Update Raptor app.py to read RAPTOR_APP_DB_URL from Infisical at startup (fallback to DATABASE_URL for backward compatibility during deploy window, then remove fallback post-verification).
5. Verify Raptor runs correctly with the restricted role on staging.
Rollback: revert app.py to DATABASE_URL. No data impact.
Feature flag: none.
B1 note: this is a plumbing change, not a feature flag. No console_flag_promotions row required.
Phase 1 — Schema Creation (Dark)
What: Alembic migration creates customer_audit_events, all *_history shadow tables, triggers, roles, RLS policies, and pg_audit extension. No writes. No reads. Zero behavioral change.
Dev-days estimate: 3 dev-days (Phase 0 adds role-separation complexity to the migration)
Dependent cards: Phase 0 complete
Alembic migration file: migrations/versions/xxxx_unified_audit_v2_schema.py
Migration actions:
1. Create customer_audit_events with all columns, constraints, indexes per design.md §3.1. Include ticket_state_at_read column and cae_ticket_state index.
2. Create freescout_ticket_cache table.
3. Create users_history, subscriptions_history, positions_history, trades_history, sessions_history, passkeys_history per design.md §3.2.
4. Create trigger functions and attach to primary tables.
5. Create audit_archiver role (no login).
6. Create raptor_audit_compliance role (no login, SELECT only on audit tables).
7. Execute role grants: INSERT/SELECT to raptor_app; SELECT/DELETE to audit_archiver; SELECT to raptor_audit_compliance. Explicitly REVOKE INSERT on customer_audit_events from raptor_migrations.
8. Enable Row Level Security: ALTER TABLE customer_audit_events ENABLE ROW LEVEL SECURITY; FORCE ROW LEVEL SECURITY;. Create RLS policies per design.md §6.
9. Install pg_audit extension: CREATE EXTENSION IF NOT EXISTS pgaudit;. Configure to log all DML on customer_audit_events and *_history tables. Route pg_audit output to external log sink (Heroku log drain → Papertrail/Logtail, write-only).
10. Create audit_archival_runs, audit_integrity_log, dsr_log, freescout_ticket_cache tables.
11. UUID v4 validation: no code change needed at the schema layer; UUID v4 is enforced in the writer service.
CI requirements:
- Migration passes on clean staging DB.
- Idempotency check (re-run produces no error).
- Lint: migration must not contain GRANT UPDATE ON customer_audit_events or GRANT INSERT ON customer_audit_events TO raptor_migrations.
- Smoke test: table exists, all columns present, all indexes present, all triggers present, RLS enabled, raptor_app cannot DELETE from customer_audit_events.
- pg_audit smoke test: insert a row as raptor_app, verify the event appears in the external log sink.
Rollback:
DROP EXTENSION IF EXISTS pgaudit;
DROP TRIGGER trg_trades_history ON trades;
-- (repeat for all shadow tables)
DROP TABLE IF EXISTS trades_history, positions_history, sessions_history,
subscriptions_history, users_history, passkeys_history CASCADE;
DROP TABLE IF EXISTS customer_audit_events, freescout_ticket_cache,
audit_archival_runs, audit_integrity_log, dsr_log CASCADE;
DROP ROLE IF EXISTS audit_archiver, raptor_audit_compliance;
Feature flag: none.
Phase 2 — Dual-Write + Legacy Back-Fill
What: Every existing writer also writes a unified row. Historical rows from all 5 legacy streams are back-filled into customer_audit_events. Legacy tables remain source-of-truth for reads.
Dev-days estimate: 8 dev-days (5 dual-write adapters + back-fill job + reconciler)
Feature flag: FLAG_UNIFIED_AUDIT_DUAL_WRITE (add console_flag_promotions row per B1 enforcement)
Dependent cards: Phase 1 complete
Dual-write adapters
| Surface | Current mechanism | Adapter change |
|---|---|---|
| Raptor blueprints (Dim 1) | Direct DB + write_audit |
Call POST /api/customer-audit/event after primary audit write |
Console write_audit / write_audit_strict (Dim 3) |
Direct DB insert to Console audit_log |
Call POST /api/customer-audit/event for customer-related rows |
| Reasonator (Dim 2) | Unbuilt | Implement writer against unified endpoint from the start |
Velvet auth/audit.py (Dim 2) |
Python logger / stdout | Call POST /api/customer-audit/event for events touching a customer_id |
| Per-blueprint logs | Python logger.* |
No change (not structured audit events) |
FreeScout webhook registration: register POST /api/internal/freescout-webhook in FreeScout admin UI. Test: change a ticket status; verify freescout_ticket_cache updates within 5 seconds.
Back-fill strategy (legacy → unified, option D):
All 5 legacy audit streams are back-filled into customer_audit_events before the legacy tables can be dropped. Back-fill runs as a one-shot job (jobs/audit_backfill.py) invoked manually by the operator during Phase 2.
Dimension mapping per legacy source:
| Legacy source | Legacy table / mechanism | Target dimension |
Action namespace | Notes |
|---|---|---|---|---|
Raptor audit_log |
backend_v2 DB audit_log table |
customer_self (if actor = customer) / operator_interaction (if actor = operator) |
Inferred from action column; map via translation table in jobs/audit_backfill.py |
Must map ticket_id from legacy context JSONB if present |
Console audit_log |
Console DB audit_log table |
operator_interaction |
Map write_audit action names to new namespace (e.g., console.customer.audit_viewed → customer.data.read.in_ticket) |
ticket_state_at_read defaults to 'none' for historical rows (ticket state unknowable post-hoc); no notifications fired for back-filled rows |
| Reasonator | Unbuilt pre-dual-write | system_automated |
N/A — no historical rows to back-fill | |
Velvet velvet.audit |
stdout logs (log drain archive) | system_automated |
system.velvet.rotation.* — mapped from log message patterns |
Velvet rotation history has forensic value; parse from log archive. Requires log drain export. |
| Per-blueprint logs | Python logger.* |
Not back-filled | Not structured audit events; retained as operational logs only |
Back-fill job design:
# jobs/audit_backfill.py — run once per legacy source, idempotent
# For each legacy row:
# 1. Check if a matching row already exists in customer_audit_events
# (match on: customer_id + action_namespace + at_utc within 1s window)
# 2. If not: INSERT with schema_version=1 (back-filled), event_hash=KMS.HMAC(...)
# 3. If yes: skip (idempotent)
# For Velvet log archive:
# 1. Parse log drain S3 export for velvet.audit entries
# 2. Map each to customer_id, action, timestamp
# 3. Same INSERT/skip pattern
# No notifications fired for back-filled rows (ticket_state_at_read = 'back_filled')
# Back-fill manifest written to audit_archival_runs
Back-fill completeness verification: after back-fill completes, the reconciler compares row counts between legacy tables and customer_audit_events per customer_id per day. Any gap > 2% fires a Sentry alert.
Reconciler job: jobs/audit_reconciler.py runs nightly. Samples 100 customers/day, compares event counts and action types in 24h windows between legacy and unified tables. Mismatch > 2% alerts ops@raxx.app.
CI requirements:
- Integration test: submit a trade; verify row appears in both legacy audit_log and customer_audit_events.
- Integration test: console session revoke; verify Dim-3 row in customer_audit_events.
- Integration test: FreeScout webhook status change; verify freescout_ticket_cache updated.
Rollback: FLAG_UNIFIED_AUDIT_DUAL_WRITE=0. Back-filled rows are not deleted — they remain in customer_audit_events for future re-enablement.
Phase 3 — Reader Cutover
What: Customer and operator audit viewers read from customer_audit_events. Old table reads disabled via flag flip.
Dev-days estimate: 5 dev-days
Feature flag: FLAG_UNIFIED_AUDIT_READ (add console_flag_promotions row)
Dependent cards: Phase 2 complete + all 7 prerequisites simultaneously green:
| Prerequisite | Tracking | Status |
|---|---|---|
| Phase 0 role separation | #1455 | — |
| Postgres RLS | #1457 | — |
| RBAC V2 live in production | docs/architecture/rbac-v2/ |
— |
| HMAC chain (#1454) | #1454 | — |
UUID v4 for replay_uuid |
design.md §3.1 | — |
| Mandatory date-range guard | api-contract.md §3 | — |
| Per-customer rate limit | api-contract.md §2 | — |
Gate: Kristerpher explicit workflow_dispatch approval. Reconciler mismatch rate < 1% for 7 consecutive days. All prerequisites green.
Reader changes:
- Customer self-service audit (
GET /api/customer-audit/<customer_id>): new endpoint. Replaces ad-hoc Raptoraudit_logquery used by customer audit viewer from #1050. Filtered todimension IN ('customer_self', 'system_automated')for customer sessions. - Console customer audit viewer (
/console/customers/<user_id>/audit): updateCustomerAuditServiceto call Raptor unified reader endpoint. - Console global audit viewer (
/console/audit): continues to read from Consoleaudit_log— not migrated (operator-on-operator events are not in scope forcustomer_audit_events).
Rollback: FLAG_UNIFIED_AUDIT_READ=0. Readers revert to legacy tables. No data impact.
Phase 4 — Legacy Table Drop
What: Once the unified table contains the complete historical record (back-fill verified) and Phase 3 has soaked 90 days without integrity failures, drop the legacy audit tables.
Dev-days estimate: 1 dev-day
Gate: Phase 3 complete + 90-day soak + reconciler mismatch < 0.5% sustained + Kristerpher explicit approval.
Disposition (option D — full migration before drop):
1. Confirm back-fill is complete and verified (row count parity per day per customer).
2. Export legacy tables to S3 at s3://raxx-audit-archive/legacy/<table>/<year>/<month>/ as additional backup.
3. Write manifest row to audit_archival_runs.
4. Operator confirms S3 archive is queryable (one test query per table).
5. Alembic migration: drop legacy audit_log tables from Raptor DB and Console DB.
6. Velvet auth/audit.py stdout logging retained (operational log; not a DB table).
Rollback: once dropped, restoration requires S3 archive restore (~30 min manual). Operator must sign off explicitly before Phase 4 runs. Irreversible in practice.
Feature flag: none (one-shot).
Trigger-to-Async Refactor Path
When p99 INSERT latency on trades or positions tables exceeds 50ms sustained for 7 consecutive days, execute this refactor (estimated 1–2 dev-days):
- Replace shadow table triggers with
pg_notifycalls: trigger emitsNOTIFY audit_shadow_channel, '<json_payload>'instead of direct INSERT. - Background worker (
jobs/shadow_writer.py) subscribes toaudit_shadow_channelviaLISTEN. Processes notifications and INSERTs into*_historytables asynchronously. - Lag monitoring: shadow writer queue depth measured; alert if depth > 1,000 (1 min of burst writes at 10K customers).
- During the cutover: brief dual-write window (both trigger + async worker) to ensure no rows are dropped.
Downside: history may be stale by up to ~1s during peak write bursts. Acceptable for forensic use; document in runbooks that "shadow tables have eventual-consistency lag under high write load."
Phase Summary
| Phase | What | Dev-days | Flag | Rollback | Blocker |
|---|---|---|---|---|---|
| 0 — Role separation | raptor_app role; restricted connection |
1 | None | Revert app.py |
None |
| 1 — Schema | Tables, triggers, RLS, pg_audit, roles | 3 | None | Drop tables | Phase 0 |
| 2 — Dual-write + back-fill | All writers + historical back-fill | 8 | FLAG_UNIFIED_AUDIT_DUAL_WRITE |
Flag off | Phase 1 |
| 3 — Reader cutover | Readers switch to unified table | 5 | FLAG_UNIFIED_AUDIT_READ |
Flag off | Phase 2 + 7 prerequisites + <1% mismatch 7d |
| 4 — Legacy drop | S3 backup + drop legacy tables | 1 | None (one-shot) | Restore from S3 | Phase 3 + 90d soak + approval |
| Total | 18 dev-days |
Sub-Cards to File (v2 Slate — 19 sub-cards)
Each sub-card is one PR for one developer. Ordered by dependency. Do not file until operator issues the "file-em-all" command after reviewing v2.
| # | Title | Size | Phase | Depends on |
|---|---|---|---|---|
| SC-A1 | feat(raptor): Phase 0 — create raptor_app role + switch application DB connection to restricted credential |
S | Phase 0 | — |
| SC-A2 | feat(raptor): Phase 1 — alembic migration — customer_audit_events + shadow tables + triggers + RLS + pg_audit + roles |
L | Phase 1 | SC-A1 |
| SC-A3 | feat(raptor): POST /api/customer-audit/event — internal writer + HMAC-KMS + deny-list + action allowlist + per-customer rate limit |
M | Phase 2 | SC-A2 |
| SC-A4 | feat(raptor): FreeScout webhook — POST /api/internal/freescout-webhook + freescout_ticket_cache upsert |
S | Phase 2 | SC-A2 |
| SC-A5 | feat(raptor): dual-write adapter — Raptor blueprints → customer_audit_events (dim 1 + dim 2) + ticket-state-aware notification trigger |
M | Phase 2 | SC-A3, SC-A4 |
| SC-A6 | feat(console): dual-write adapter — Console write_audit → customer_audit_events (dim 3) + ticket-state lookup |
S | Phase 2 | SC-A3, SC-A4 |
| SC-A7 | feat(velvet): dual-write adapter — velvet auth/audit.py → customer_audit_events (dim 2, rotation events) |
S | Phase 2 | SC-A3 |
| SC-A8 | feat(reasonator): audit writer — Reasonator score events → customer_audit_events (dim 2) |
S | Phase 2 | SC-A3 |
| SC-A9 | feat(raptor): audit back-fill job — historical rows from 5 legacy streams into customer_audit_events (option D) |
M | Phase 2 | SC-A5, SC-A6, SC-A7 |
| SC-A10 | feat(raptor): audit reconciler job — nightly mismatch check + Sentry alert on >2% mismatch |
S | Phase 2 | SC-A5 |
| SC-A11 | feat(raptor): HMAC chain integrity check job — nightly 24h window + monthly full-chain verification from genesis |
M | Phase 2 (before GA) | SC-A3 |
| SC-A12 | feat(raptor): ticket-state-aware notification service — Path A (welcoming, 5 min SLA) + Path B (security incident) via Postmark |
M | Phase 2 | SC-A4, SC-A5 |
| SC-A13 | feat(infra): audit archiver job — S3 Glacier export of legacy + unified rows per retention policy + manifest + optional Object Lock |
M | Phase 3 prep | SC-A9 |
| SC-A14 | feat(infra): pg_audit extension + external log sink — DML audit at Postgres level, survives application compromise |
S | Phase 1 | SC-A2 |
| SC-A15 | docs(ops): departing employee deprovisioning runbook — Heroku Postgres rotation + audit_archiver SSM credential, 4-hour window |
S | Phase 2 | SC-A2 |
| SC-A16 | docs(ops): SOC-2 quarterly attestation runbook — docs/ops/runbooks/audit-soc2-attestation.md |
S | Phase 3 | SC-A11, SC-A13 |
| SC-A17 | feat(ci): action-namespace allowlist CI gate — fail pipeline on unregistered audit action namespaces in application code |
S | Phase 2 | SC-A3 |
| SC-A18 | feat(rbac-v2): raptor-audit-compliance role — read-only all dimensions, no customer notification, SOC-2 auditor path |
M | Phase 3 | RBAC V2 design |
| SC-A19 | feat(raptor): GET /api/customer-audit/<customer_id> + GET /api/customer-audit/<customer_id>/by-replay/<replay_uuid> — reader endpoints + RLS + RBAC + date-range guard |
L | Phase 3 | SC-A5, RBAC V2 (SC-A18), all 7 prerequisites |
SC-A19 note: this replaces v1's SC-A8 + SC-A9. Split into separate stories if RBAC V2 delays introduce scheduling pressure; the by-replay endpoint can ship independently.
SC-A18 note: depends on RBAC V2. If the parallel RBAC V2 architect PR lands before this sub-card is claimed, update this body to reference the concrete RBAC V2 role endpoints and data model. If this sub-card is claimed before RBAC V2 lands, the implementer stubs the role registration and revisits when RBAC V2 merges.
Total sub-cards: 19 (+6 from v1's 13)
Total dev-days: 18 (migration phases) + ~8 (SOC-2 + ops runbooks + CI gate + pg_audit + compliance role) = ~26 dev-days. Several Phase 2 sub-cards (SC-A5 through SC-A9) can run in parallel after SC-A3 + SC-A4 land.
v1 → v2 delta: - SC-A1 (new): Phase 0 role separation - SC-A3 expanded: adds HMAC-KMS, deny-list, allowlist, per-customer rate limit - SC-A4 (new): FreeScout webhook + ticket cache - SC-A9 (new): back-fill job (option D) - SC-A12 (new): ticket-state-aware notification service (replaces single nightly email) - SC-A14 (new): pg_audit + external sink - SC-A15 (new): deprovisioning runbook - SC-A16 (new): SOC-2 attestation runbook - SC-A17 (new): action-namespace CI gate - SC-A18 (new): compliance role for SOC-2 auditors - SC-A19: merged reader endpoints (was SC-A8 + SC-A9 in v1); gated on all 7 prerequisites
Existing Card Dependencies
Cards that will write to or read from the unified table after Phase 2 lands:
-
111 (passkey management) — passkey events become
action=passkey.add/passkey.removedim-1 events -
112 (email verification) — email verification becomes
action=email.verifydim-1 events - Reasonator — score writes become dim-2 events (
action=system.sentiment_score.updated) - Iron-condor — IC strategy events become dim-1 + dim-2 events
- Track B — broker connection events become dim-1 events
All new surfaces implement against the writer endpoint (SC-A3) once it ships, not against legacy tables.