Detection catalog — initial campaign
Campaign date: 2026-06-04 UTC
Author: detection-engineer (first invocation)
Branch: detections/initial-catalog-2026-06-04
Status: docs-only PR; no code changes
This is the first-invocation campaign per the agent charter at .claude/agents/detection-engineer.md. It establishes the initial behavioral-detection catalog for Raxx pre-launch. Subsequent campaign docs re-baseline, retire dormant rules, and add coverage as telemetry sources come online.
North-star posture (recap, not redesign)
- Detections target the seams — auth (passkey + session), data (audit log + backtest), cost (Heroku spend), ops (dyno + Postgres), webhook (Stripe + Postmark post-launch).
- Statistical methods first (Benford, K-S, Poisson tails, cardinality drift); hard thresholds only when sample size forbids statistics.
- Pre-launch traffic is sparse. Every rule in this catalog declares whether it is "live now," "dormant — activate post-first-real-traffic," or "dormant — activate when X telemetry source is wired." Operator can defer the dormant ones safely until launch.
- Alert routing respects
feedback_pre_launch_digest_notifications. CRITICAL fires into existing#raxx-ops-alert-sev*channels (perproject_oncall_severity_routing). HIGH/MEDIUM/LOW batch into the ops@ daily digest. No new per-event surfaces are introduced.
Rules built this campaign (10)
| # | Rule ID | Category | Severity ceiling | State at ship | Reason this was prioritized |
|---|---|---|---|---|---|
| 1 | auth_passkey_enumeration |
auth | HIGH | dormant — needs Heroku→Papertrail drain | Passkey-begin is the first probe in a registration farm; cardinality of distinct emails per IP is the canonical signature. Pre-launch traffic is low → easy to baseline once a single signup farm hits. |
| 2 | auth_session_creation_velocity |
auth | HIGH | dormant — needs session-create event log | Session creation per IP per minute is the post-passkey-success velocity check. Catches credential-stuffing or stolen-cookie replay before audit-log forensics can. |
| 3 | auth_rbac_denied_burst |
auth | MEDIUM | live (RBAC denial events log today) | Operator-side defense: catches an operator account whose privileges were silently rotated narrower (mis-flag) or a compromised operator session probing for elevation. |
| 4 | signup_waitlist_velocity_per_origin |
signup | HIGH | live (waitlist_signups table exists; FLAG_WAITLIST_DATASTORE-gated) | getraxx waitlist is the first inbound public surface. Per-Origin velocity catches a single referrer pumping the list. |
| 5 | signup_email_pattern_anomaly |
signup | MEDIUM | live | N waitlist signups in 5 min sharing a domain pattern = synthesized-list seeding. Cheapest possible bot defense; sample size small but adequate for the pattern. |
| 6 | data_audit_log_gap_window |
data | CRITICAL | dormant — needs audit-write cadence baseline (post-first-customer) | A gap > 5 min in customer_audit_events during business hours indicates either an outage of the audit writer OR an attacker pausing the log. Both warrant immediate page. |
| 7 | data_audit_log_hash_chain_break |
data | CRITICAL | dormant — needs KMS HMAC hash chain wired (gated on project_kms_audit_chain_approved) |
The hash chain is the tamper-evidence primitive. Any break is by definition adversarial; CRITICAL on first observation is justified because the hash chain has zero legitimate-break path. |
| 8 | ops_sentry_error_rate_spike |
ops | HIGH | dormant — needs sentry_backend flag ON in prod (currently OFF) |
Sentry events-per-minute by route is the cheapest deviation signal once Sentry is on; 3σ from a 7-day route-specific baseline. |
| 9 | ops_postgres_p99_drift |
ops | MEDIUM | dormant — needs Heroku pg_stat_statements extension verified + scraper wired |
Postgres p99 drift catches a noisy-neighbor query or an n+1 from a recently-shipped feature; saves a SEV-2 by surfacing degradation pre-customer-impact. |
| 10 | cost_dyno_hour_spike |
cost | MEDIUM | dormant — needs Heroku Platform API metrics scraper | Dyno-hour rate above the 99.9th percentile of the prior 7d catches runaway autoscaling, infinite-loop dyno burn, or a forgotten worker. Pre-launch dollar exposure is small but the signal still applies. |
Why ten not twelve: The eleventh and twelfth slots (Benford on backtest returns; Postmark hard-bounce rate) are catalogued under "deferred" below because (a) Benford needs > 100 backtest samples and pre-launch we don't have them, and (b) Postmark customer-mail volume is currently zero. Both ship in the next campaign once telemetry is producing.
Rules NOT built this campaign (deferred — 7)
| Rule (proposed name) | Why deferred | When to revisit |
|---|---|---|
data_backtest_benford |
Benford χ² needs 100+ first-digit samples; pre-launch backtest volume is < 10/day from operator dogfooding alone. False-positive rate undefined until real users run backtests. | First campaign refresh after launch, when 7-day rolling backtest count > 200. |
webhook_stripe_signature_fail |
Stripe webhook surface tracked at #3206 and is not wired yet. No /webhooks/stripe endpoint exists in backend_v2/api/routes/. Detection has nothing to subscribe to. |
When #3206 ships — flip this rule live the same campaign refresh. |
webhook_postmark_bounce_spike |
Postmark is out of sandbox (project_postmark_approved) but production customer-mail volume is zero. Hard-bounce rate is undefined on a denominator of zero. Internal addresses (ops@/billing@/no-reply@) were on the suppression list 2026-05-09; cleanup outstanding. |
First campaign refresh after launch, once cumulative outbound > 500/day. |
ops_dyno_restart_cluster |
Heroku platform metrics scraper does not exist. Manual heroku ps:scale events from operator-driven deploys swamp any natural restart signal pre-launch. |
Same campaign refresh as cost_dyno_hour_spike — both need the Heroku Platform API scraper. |
cost_heroku_addon_drift |
Operator is the sole party making config-var changes; a "new addon without operator change record" check needs an authoritative change-record source. console_audit_events does not yet cover Heroku config mutations. |
When console UI gains Heroku config-var write-through (#3208 stub), the audit-event becomes the change-record source. |
signup_geographic_burst |
Geo IP enrichment is not yet wired into the request pipeline. CF country-code header is available but waitlist_signups does not persist it. Without geo on the row, the detection has no grouping key. |
When cf-ipcountry is persisted on waitlist_signups (small migration; will likely come with EU geoblock telemetry per project_eu_geoblock_decision). |
auth_login_failure_burst |
The passkey login flow does not have a meaningful "failure burst" — WebAuthn assertion failures are quiet 400s without account-state side-effects. Credential-stuffing attacks against passkey-only auth look like cardinality of distinct emails, not failure-bursts of a single email. Subsumed by auth_passkey_enumeration. |
Permanently subsumed unless we add fallback password auth (not on roadmap). |
Operator-action prerequisites (gate the dormant rules)
These prerequisites block the listed rules from producing real signal. Each is an operator-action card the operator may dispatch when ready. The detection-engineer does NOT file these as GH issues (per feedback_fix_dont_file); listing here so the operator can choose.
P1 — biggest unlock (gates 4 rules)
Heroku Logplex → log-aggregation drain undefined. Today, app logs are accessible only via heroku logs --tail; there is no archival drain into Papertrail or S3. Without a drain, none of the rolling-window detections (auth_passkey_enumeration, auth_session_creation_velocity, ops_sentry_error_rate_spike partial, ops_dyno_restart_cluster) can query historical telemetry. The heroku_log_drain blueprint exists in console (console/app/blueprints/heroku_log_drain.py) but is a receiver; the upstream drain has to be configured Heroku-side. Sentry partially compensates for the Sentry-error-rate rule (Sentry retains events even without a separate Heroku drain).
P2 — Sentry production rollout
sentry_backend flag is OFF in prod. Per project_apm_vendor_sentry, Sentry is the locked APM and sentry_backend flag exists with full init code at backend_v2/api/observability/sentry_init.py. Flipping the flag on (with SENTRY_DSN_BACKEND set per app) is bootstrap operator-action per feedback_bootstrap_via_heroku. Once on, ops_sentry_error_rate_spike can populate baselines. Also wire SENTRY_TRACES_SAMPLE_RATE to a non-zero value (currently defaults to 0.1) for any trace-based detection in future campaigns.
P3 — KMS HMAC hash chain ship
data_audit_log_hash_chain_break requires the KMS HMAC chain to be live. Per project_kms_audit_chain_approved, operator approved the ~$2/mo KMS spend; SC-A11 + SC-A14 are gated on deploy. Until those land, the audit-log table has rows but no tamper-evidence chain to check. The rule ships dormant.
P4 — audit-write cadence baseline
data_audit_log_gap_window needs a baseline of normal audit-write cadence per hour. Pre-customer, the only writers are operator-action sessions (sparse, bursty). Baselining the rule requires 7+ days of real-customer audit-write traffic. Until then, the rule's threshold is set to "any gap > 30 min during 12:00–22:00 UTC weekdays" as a conservative pre-launch placeholder. Re-baseline at first campaign refresh post-launch.
P5 — Postgres pg_stat_statements verification
ops_postgres_p99_drift needs the pg_stat_statements extension confirmed enabled on Heroku Postgres. Heroku Standard-0 (current Raptor + Console pg tier) does support the extension; not yet verified active. Operator can verify with heroku pg:psql -a raxx-api-prod -c "SHOW shared_preload_libraries;" and heroku pg:psql -a raxx-api-prod -c "SELECT count(*) FROM pg_extension WHERE extname='pg_stat_statements';".
P6 — Heroku Platform API scraper
cost_dyno_hour_spike and the deferred ops_dyno_restart_cluster need a scheduled scraper that polls GET /apps/{app}/dynos + GET /apps/{app}/formation from the Heroku Platform API. Not wired today; no script in scripts/ops/ or console/app/services/ queries those endpoints. Belongs to sre-agent to build; detection-engineer consumes the resulting telemetry table.
P7 — Stripe webhook ship
webhook_stripe_signature_fail (deferred this campaign) is gated on #3206. No /webhooks/stripe route exists in backend_v2/api/routes/. When the webhook ships, the rule activates the same campaign refresh.
P8 — Postmark customer-mail volume
webhook_postmark_bounce_spike is gated on real customer-mail volume > 500/day. Activates the first campaign refresh after that threshold sustains for 7 days.
Alert routing posture (confirmed per memory hygiene)
| Severity | Channel | Posture | Source memory |
|---|---|---|---|
| CRITICAL | #raxx-ops-alert-sev1 (Slack C0B423M38H4) |
Per-event, push override | project_oncall_severity_routing |
| HIGH (ET hours) | #raxx-ops-alert-sev2-5 (Slack C0B4611UC2V) |
Per-event, no push override | project_oncall_severity_routing |
| HIGH (off-hours) | #raxx-ops-alert-sev2 (Slack C0B445RU95Y) |
Per-event, push | project_oncall_severity_routing |
| MEDIUM | ops@ daily digest | Digest only | feedback_pre_launch_digest_notifications |
| LOW | docs/detections/_log/ markdown |
Silent log | charter §Alert routing |
Note: pre-launch, even HIGH findings default to digest unless they describe an active-attack signature with sample size > 1. Per-event HIGH is reserved for: hash-chain break, audit-log gap, passkey-enumeration confirmed-burst. Documented per-rule in each rule file.
Roadmap — next campaign refresh
Trigger: earlier of (a) launch + 14 days of real customer traffic, or (b) operator-driven request, or (c) one of the deferred-rule prerequisites lands.
Refresh agenda:
- Activate the 6 dormant rules whose prereqs are met.
- Build
data_backtest_benfordonce backtest sample size > 200 over 7 days. - Build
webhook_stripe_signature_failonce #3206 ships. - Re-baseline
auth_passkey_enumerationfrom the post-launch traffic distribution; tighten threshold based on observed FP rate. - Self-tune review: per charter §Workflow, any rule that fired ≥5 times in 7 days with no operator action gets severity downgraded.
- Retire any rule that has been dormant for 60+ days without its prereq advancing — re-evaluate whether the rule is still worth building.
- Add new categories if the threat model shifts:
mobile/(when iOS app is live),agent/(when other agents start writing to repo at runtime),vault/(Velvet rotation anomalies once Velvet v2 stabilizes).
Anti-pattern guard: the second campaign should NOT add more rules merely to grow the catalog. Per charter, sliver-cutting detections produces alert fatigue worse than under-coverage. Target ceiling: ~15 active rules. Anything beyond that needs explicit operator approval.
Files in this PR
docs/detections/_campaign/2026-06-04-initial-catalog.md(this doc)- 10 rule files under
docs/detections/<category>/<rule_name>.md - 6 synthetic-positive fixtures under
docs/detections/<category>/_fixtures/
No code changes. No new GH issues. No new feature flags. No migrations.
Cross-references
- Charter:
.claude/agents/detection-engineer.md - Memories consulted:
project_apm_vendor_sentry,project_postmark_approved,project_oncall_severity_routing,project_cf_gate_and_attorney_hold_until_raxx_app_perfect,feedback_nightly_scan_dark_is_high,feedback_pre_launch_digest_notifications,feedback_no_progress_recap_framing,feedback_pr_base_main,feedback_no_speculative_markdown,feedback_fix_dont_file,project_kms_audit_chain_approved,feedback_bootstrap_via_heroku - Sentry init:
backend_v2/api/observability/sentry_init.py - Audit writer:
backend_v2/api/services/customer_audit_writer_service.py,console/app/services/customer_audit.py - Waitlist route:
backend_v2/api/routes/waitlist_signups.py - Heroku log drain receiver:
console/app/blueprints/heroku_log_drain.py - WAF events flag:
backend_v2/api/feature_flags.yaml(flagwaf_events) - Latest scan triage:
docs/security/remediation/2026-06-04-nightly-scans-triage.md