Detection catalog — initial campaign

Campaign date: 2026-06-04 UTC Author: detection-engineer (first invocation) Branch: detections/initial-catalog-2026-06-04 Status: docs-only PR; no code changes

This is the first-invocation campaign per the agent charter at .claude/agents/detection-engineer.md. It establishes the initial behavioral-detection catalog for Raxx pre-launch. Subsequent campaign docs re-baseline, retire dormant rules, and add coverage as telemetry sources come online.

North-star posture (recap, not redesign)

Detections target the seams — auth (passkey + session), data (audit log + backtest), cost (Heroku spend), ops (dyno + Postgres), webhook (Stripe + Postmark post-launch).
Statistical methods first (Benford, K-S, Poisson tails, cardinality drift); hard thresholds only when sample size forbids statistics.
Pre-launch traffic is sparse. Every rule in this catalog declares whether it is "live now," "dormant — activate post-first-real-traffic," or "dormant — activate when X telemetry source is wired." Operator can defer the dormant ones safely until launch.
Alert routing respects feedback_pre_launch_digest_notifications. CRITICAL fires into existing #raxx-ops-alert-sev* channels (per project_oncall_severity_routing). HIGH/MEDIUM/LOW batch into the ops@ daily digest. No new per-event surfaces are introduced.

Rules built this campaign (10)

#	Rule ID	Category	Severity ceiling	State at ship	Reason this was prioritized
1	`auth_passkey_enumeration`	auth	HIGH	dormant — needs Heroku→Papertrail drain	Passkey-begin is the first probe in a registration farm; cardinality of distinct emails per IP is the canonical signature. Pre-launch traffic is low → easy to baseline once a single signup farm hits.
2	`auth_session_creation_velocity`	auth	HIGH	dormant — needs session-create event log	Session creation per IP per minute is the post-passkey-success velocity check. Catches credential-stuffing or stolen-cookie replay before audit-log forensics can.
3	`auth_rbac_denied_burst`	auth	MEDIUM	live (RBAC denial events log today)	Operator-side defense: catches an operator account whose privileges were silently rotated narrower (mis-flag) or a compromised operator session probing for elevation.
4	`signup_waitlist_velocity_per_origin`	signup	HIGH	live (waitlist_signups table exists; FLAG_WAITLIST_DATASTORE-gated)	getraxx waitlist is the first inbound public surface. Per-Origin velocity catches a single referrer pumping the list.
5	`signup_email_pattern_anomaly`	signup	MEDIUM	live	N waitlist signups in 5 min sharing a domain pattern = synthesized-list seeding. Cheapest possible bot defense; sample size small but adequate for the pattern.
6	`data_audit_log_gap_window`	data	CRITICAL	dormant — needs audit-write cadence baseline (post-first-customer)	A gap > 5 min in `customer_audit_events` during business hours indicates either an outage of the audit writer OR an attacker pausing the log. Both warrant immediate page.
7	`data_audit_log_hash_chain_break`	data	CRITICAL	dormant — needs KMS HMAC hash chain wired (gated on `project_kms_audit_chain_approved`)	The hash chain is the tamper-evidence primitive. Any break is by definition adversarial; CRITICAL on first observation is justified because the hash chain has zero legitimate-break path.
8	`ops_sentry_error_rate_spike`	ops	HIGH	dormant — needs `sentry_backend` flag ON in prod (currently OFF)	Sentry events-per-minute by route is the cheapest deviation signal once Sentry is on; 3σ from a 7-day route-specific baseline.
9	`ops_postgres_p99_drift`	ops	MEDIUM	dormant — needs Heroku `pg_stat_statements` extension verified + scraper wired	Postgres p99 drift catches a noisy-neighbor query or an n+1 from a recently-shipped feature; saves a SEV-2 by surfacing degradation pre-customer-impact.
10	`cost_dyno_hour_spike`	cost	MEDIUM	dormant — needs Heroku Platform API metrics scraper	Dyno-hour rate above the 99.9th percentile of the prior 7d catches runaway autoscaling, infinite-loop dyno burn, or a forgotten worker. Pre-launch dollar exposure is small but the signal still applies.

Why ten not twelve: The eleventh and twelfth slots (Benford on backtest returns; Postmark hard-bounce rate) are catalogued under "deferred" below because (a) Benford needs > 100 backtest samples and pre-launch we don't have them, and (b) Postmark customer-mail volume is currently zero. Both ship in the next campaign once telemetry is producing.

Rules NOT built this campaign (deferred — 7)

Rule (proposed name)	Why deferred	When to revisit
`data_backtest_benford`	Benford χ² needs 100+ first-digit samples; pre-launch backtest volume is < 10/day from operator dogfooding alone. False-positive rate undefined until real users run backtests.	First campaign refresh after launch, when 7-day rolling backtest count > 200.
`webhook_stripe_signature_fail`	Stripe webhook surface tracked at #3206 and is not wired yet. No `/webhooks/stripe` endpoint exists in `backend_v2/api/routes/`. Detection has nothing to subscribe to.	When #3206 ships — flip this rule live the same campaign refresh.
`webhook_postmark_bounce_spike`	Postmark is out of sandbox (`project_postmark_approved`) but production customer-mail volume is zero. Hard-bounce rate is undefined on a denominator of zero. Internal addresses (ops@/billing@/no-reply@) were on the suppression list 2026-05-09; cleanup outstanding.	First campaign refresh after launch, once cumulative outbound > 500/day.
`ops_dyno_restart_cluster`	Heroku platform metrics scraper does not exist. Manual `heroku ps:scale` events from operator-driven deploys swamp any natural restart signal pre-launch.	Same campaign refresh as `cost_dyno_hour_spike` — both need the Heroku Platform API scraper.
`cost_heroku_addon_drift`	Operator is the sole party making config-var changes; a "new addon without operator change record" check needs an authoritative change-record source. `console_audit_events` does not yet cover Heroku config mutations.	When console UI gains Heroku config-var write-through (#3208 stub), the audit-event becomes the change-record source.
`signup_geographic_burst`	Geo IP enrichment is not yet wired into the request pipeline. CF country-code header is available but `waitlist_signups` does not persist it. Without geo on the row, the detection has no grouping key.	When `cf-ipcountry` is persisted on `waitlist_signups` (small migration; will likely come with EU geoblock telemetry per `project_eu_geoblock_decision`).
`auth_login_failure_burst`	The passkey login flow does not have a meaningful "failure burst" — WebAuthn assertion failures are quiet 400s without account-state side-effects. Credential-stuffing attacks against passkey-only auth look like cardinality of distinct emails, not failure-bursts of a single email. Subsumed by `auth_passkey_enumeration`.	Permanently subsumed unless we add fallback password auth (not on roadmap).

Operator-action prerequisites (gate the dormant rules)

These prerequisites block the listed rules from producing real signal. Each is an operator-action card the operator may dispatch when ready. The detection-engineer does NOT file these as GH issues (per feedback_fix_dont_file); listing here so the operator can choose.

P1 — biggest unlock (gates 4 rules)

Heroku Logplex → log-aggregation drain undefined. Today, app logs are accessible only via heroku logs --tail; there is no archival drain into Papertrail or S3. Without a drain, none of the rolling-window detections (auth_passkey_enumeration, auth_session_creation_velocity, ops_sentry_error_rate_spike partial, ops_dyno_restart_cluster) can query historical telemetry. The heroku_log_drain blueprint exists in console (console/app/blueprints/heroku_log_drain.py) but is a receiver; the upstream drain has to be configured Heroku-side. Sentry partially compensates for the Sentry-error-rate rule (Sentry retains events even without a separate Heroku drain).

P2 — Sentry production rollout

sentry_backend flag is OFF in prod. Per project_apm_vendor_sentry, Sentry is the locked APM and sentry_backend flag exists with full init code at backend_v2/api/observability/sentry_init.py. Flipping the flag on (with SENTRY_DSN_BACKEND set per app) is bootstrap operator-action per feedback_bootstrap_via_heroku. Once on, ops_sentry_error_rate_spike can populate baselines. Also wire SENTRY_TRACES_SAMPLE_RATE to a non-zero value (currently defaults to 0.1) for any trace-based detection in future campaigns.

P3 — KMS HMAC hash chain ship

data_audit_log_hash_chain_break requires the KMS HMAC chain to be live. Per project_kms_audit_chain_approved, operator approved the ~$2/mo KMS spend; SC-A11 + SC-A14 are gated on deploy. Until those land, the audit-log table has rows but no tamper-evidence chain to check. The rule ships dormant.

P4 — audit-write cadence baseline

data_audit_log_gap_window needs a baseline of normal audit-write cadence per hour. Pre-customer, the only writers are operator-action sessions (sparse, bursty). Baselining the rule requires 7+ days of real-customer audit-write traffic. Until then, the rule's threshold is set to "any gap > 30 min during 12:00–22:00 UTC weekdays" as a conservative pre-launch placeholder. Re-baseline at first campaign refresh post-launch.

P5 — Postgres `pg_stat_statements` verification

ops_postgres_p99_drift needs the pg_stat_statements extension confirmed enabled on Heroku Postgres. Heroku Standard-0 (current Raptor + Console pg tier) does support the extension; not yet verified active. Operator can verify with heroku pg:psql -a raxx-api-prod -c "SHOW shared_preload_libraries;" and heroku pg:psql -a raxx-api-prod -c "SELECT count(*) FROM pg_extension WHERE extname='pg_stat_statements';".

P6 — Heroku Platform API scraper

cost_dyno_hour_spike and the deferred ops_dyno_restart_cluster need a scheduled scraper that polls GET /apps/{app}/dynos + GET /apps/{app}/formation from the Heroku Platform API. Not wired today; no script in scripts/ops/ or console/app/services/ queries those endpoints. Belongs to sre-agent to build; detection-engineer consumes the resulting telemetry table.

P7 — Stripe webhook ship

webhook_stripe_signature_fail (deferred this campaign) is gated on #3206. No /webhooks/stripe route exists in backend_v2/api/routes/. When the webhook ships, the rule activates the same campaign refresh.

P8 — Postmark customer-mail volume

webhook_postmark_bounce_spike is gated on real customer-mail volume > 500/day. Activates the first campaign refresh after that threshold sustains for 7 days.

Alert routing posture (confirmed per memory hygiene)

Severity	Channel	Posture	Source memory
CRITICAL	`#raxx-ops-alert-sev1` (Slack `C0B423M38H4`)	Per-event, push override	`project_oncall_severity_routing`
HIGH (ET hours)	`#raxx-ops-alert-sev2-5` (Slack `C0B4611UC2V`)	Per-event, no push override	`project_oncall_severity_routing`
HIGH (off-hours)	`#raxx-ops-alert-sev2` (Slack `C0B445RU95Y`)	Per-event, push	`project_oncall_severity_routing`
MEDIUM	ops@ daily digest	Digest only	`feedback_pre_launch_digest_notifications`
LOW	`docs/detections/_log/` markdown	Silent log	charter §Alert routing

Note: pre-launch, even HIGH findings default to digest unless they describe an active-attack signature with sample size > 1. Per-event HIGH is reserved for: hash-chain break, audit-log gap, passkey-enumeration confirmed-burst. Documented per-rule in each rule file.

Roadmap — next campaign refresh

Trigger: earlier of (a) launch + 14 days of real customer traffic, or (b) operator-driven request, or (c) one of the deferred-rule prerequisites lands.

Refresh agenda:

Activate the 6 dormant rules whose prereqs are met.
Build data_backtest_benford once backtest sample size > 200 over 7 days.
Build webhook_stripe_signature_fail once #3206 ships.
Re-baseline auth_passkey_enumeration from the post-launch traffic distribution; tighten threshold based on observed FP rate.
Self-tune review: per charter §Workflow, any rule that fired ≥5 times in 7 days with no operator action gets severity downgraded.
Retire any rule that has been dormant for 60+ days without its prereq advancing — re-evaluate whether the rule is still worth building.
Add new categories if the threat model shifts: mobile/ (when iOS app is live), agent/ (when other agents start writing to repo at runtime), vault/ (Velvet rotation anomalies once Velvet v2 stabilizes).

Anti-pattern guard: the second campaign should NOT add more rules merely to grow the catalog. Per charter, sliver-cutting detections produces alert fatigue worse than under-coverage. Target ceiling: ~15 active rules. Anything beyond that needs explicit operator approval.

Files in this PR

docs/detections/_campaign/2026-06-04-initial-catalog.md (this doc)
10 rule files under docs/detections/<category>/<rule_name>.md
6 synthetic-positive fixtures under docs/detections/<category>/_fixtures/

No code changes. No new GH issues. No new feature flags. No migrations.

Cross-references

Charter: .claude/agents/detection-engineer.md
Memories consulted: project_apm_vendor_sentry, project_postmark_approved, project_oncall_severity_routing, project_cf_gate_and_attorney_hold_until_raxx_app_perfect, feedback_nightly_scan_dark_is_high, feedback_pre_launch_digest_notifications, feedback_no_progress_recap_framing, feedback_pr_base_main, feedback_no_speculative_markdown, feedback_fix_dont_file, project_kms_audit_chain_approved, feedback_bootstrap_via_heroku
Sentry init: backend_v2/api/observability/sentry_init.py
Audit writer: backend_v2/api/services/customer_audit_writer_service.py, console/app/services/customer_audit.py
Waitlist route: backend_v2/api/routes/waitlist_signups.py
Heroku log drain receiver: console/app/blueprints/heroku_log_drain.py
WAF events flag: backend_v2/api/feature_flags.yaml (flag waf_events)
Latest scan triage: docs/security/remediation/2026-06-04-nightly-scans-triage.md