DET-OPS-002 — Postgres p99 latency drift
Rule ID: DET-OPS-002
Title: Postgres p99 query latency exceeding 2× the 7-day baseline per statement
Category: ops
Last validated: 2026-06-04 (initial catalog, dormant)
State: dormant — requires pg_stat_statements extension verified active on raxx-api-prod Postgres
Telemetry source
- Postgres view
pg_stat_statements(extension; available on Heroku Standard-0+, not yet verified active per campaign prerequisites §P5). - Query:
SELECT queryid, query, calls, mean_exec_time, stddev_exec_time, (mean_exec_time + 2*stddev_exec_time) AS p95_estimate FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 100; - Sampling cadence: every 5 minutes via a scheduled job (does not exist yet; see
cost_dyno_hour_spikeprerequisite §P6 for the shared scraper).
Statistical method + baseline window
- Method: K-S distribution shift test on the per-statement
mean_exec_timedistribution, today vs. 7-day baseline. Backup: hard ratio threshold (today's p99 / baseline's p99 > 2.0). - Baseline window: rolling 7 days. Re-baseline weekly to avoid baseline poisoning by a slow query that has been "the new normal" for a few days.
- Fire condition: K-S p-value < 0.01 for a per-statement distribution shift AND today's p99 > 2× baseline p99 AND statement is called >= 100 times in the comparison window.
Threshold + expected FP rate
- Threshold: as above; the 100-call floor avoids FP on rarely-executed admin queries.
- Expected FP rate: ~2 per week. Cold cache after deploy or after
pg_stat_statements_reset()will fire one false positive per occurrence. Tag those asconfirmed-cold-cachein_log/.
Alert route
- MEDIUM (single statement, 2× threshold, sustained 30+ min): ops@ daily digest.
- HIGH (multiple statements simultaneously degraded, or p99 > 5× baseline):
#raxx-ops-alert-sev2-5/#raxx-ops-alert-sev2.
Escalation owner
- sre-agent primary — n+1 from recent feature, missing index, table bloat, vacuum lag.
- data-scientist if a statement is degrading because the underlying data distribution shifted (e.g., a new feature flag changed the working set).
Test fixture / synthetic positive
See _fixtures/postgres_p99_drift_positive.json for a synthetic pg_stat_statements snapshot where the SELECT FROM strategies WHERE user_id = ? statement shows mean 184ms today vs. baseline μ=42ms.
What to do when this fires
- Identify the statement(s). Recent deploy that changed the query? New feature flag that altered the working set? Missing index on a new column?
- Run
EXPLAIN ANALYZEagainst the live statement to confirm the plan changed. - If a plan flip explains the degradation: dispatch sre-agent for the appropriate remedy (index, query rewrite, table tune).
- Re-baseline
pg_stat_statementsafter remediation:SELECT pg_stat_statements_reset();— and mark the next baseline window's first day asre-baselinedin_log/.
What NOT to do
- Do not run
pg_stat_statements_reset()to silence the rule. That clears the baseline reference and masks the degradation. - Do not exclude high-degradation queries because they're "expected to be slow" — the rule's job is to flag change, not absolute slowness.
- Do not extend baseline beyond 7 days; longer baselines hide degradations as "normal."