DET-OPS-002 — Postgres p99 latency drift

Rule ID: DET-OPS-002 Title: Postgres p99 query latency exceeding 2× the 7-day baseline per statement Category: ops Last validated: 2026-06-04 (initial catalog, dormant) State: dormant — requires pg_stat_statements extension verified active on raxx-api-prod Postgres

Telemetry source

Postgres view pg_stat_statements (extension; available on Heroku Standard-0+, not yet verified active per campaign prerequisites §P5).
Query: SELECT queryid, query, calls, mean_exec_time, stddev_exec_time, (mean_exec_time + 2*stddev_exec_time) AS p95_estimate FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 100;
Sampling cadence: every 5 minutes via a scheduled job (does not exist yet; see cost_dyno_hour_spike prerequisite §P6 for the shared scraper).

Statistical method + baseline window

Method: K-S distribution shift test on the per-statement mean_exec_time distribution, today vs. 7-day baseline. Backup: hard ratio threshold (today's p99 / baseline's p99 > 2.0).
Baseline window: rolling 7 days. Re-baseline weekly to avoid baseline poisoning by a slow query that has been "the new normal" for a few days.
Fire condition: K-S p-value < 0.01 for a per-statement distribution shift AND today's p99 > 2× baseline p99 AND statement is called >= 100 times in the comparison window.

Threshold + expected FP rate

Threshold: as above; the 100-call floor avoids FP on rarely-executed admin queries.
Expected FP rate: ~2 per week. Cold cache after deploy or after pg_stat_statements_reset() will fire one false positive per occurrence. Tag those as confirmed-cold-cache in _log/.

Alert route

MEDIUM (single statement, 2× threshold, sustained 30+ min): ops@ daily digest.
HIGH (multiple statements simultaneously degraded, or p99 > 5× baseline): #raxx-ops-alert-sev2-5 / #raxx-ops-alert-sev2.

Escalation owner

sre-agent primary — n+1 from recent feature, missing index, table bloat, vacuum lag.
data-scientist if a statement is degrading because the underlying data distribution shifted (e.g., a new feature flag changed the working set).

Test fixture / synthetic positive

See _fixtures/postgres_p99_drift_positive.json for a synthetic pg_stat_statements snapshot where the SELECT FROM strategies WHERE user_id = ? statement shows mean 184ms today vs. baseline μ=42ms.

What to do when this fires

Identify the statement(s). Recent deploy that changed the query? New feature flag that altered the working set? Missing index on a new column?
Run EXPLAIN ANALYZE against the live statement to confirm the plan changed.
If a plan flip explains the degradation: dispatch sre-agent for the appropriate remedy (index, query rewrite, table tune).
Re-baseline pg_stat_statements after remediation: SELECT pg_stat_statements_reset(); — and mark the next baseline window's first day as re-baselined in _log/.

What NOT to do

Do not run pg_stat_statements_reset() to silence the rule. That clears the baseline reference and masks the degradation.
Do not exclude high-degradation queries because they're "expected to be slow" — the rule's job is to flag change, not absolute slowness.
Do not extend baseline beyond 7 days; longer baselines hide degradations as "normal."