Raxx · internal docs

internal · gated

Detection catalog — initial campaign

Campaign date: 2026-06-04 UTC Author: detection-engineer (first invocation) Branch: detections/initial-catalog-2026-06-04 Status: docs-only PR; no code changes

This is the first-invocation campaign per the agent charter at .claude/agents/detection-engineer.md. It establishes the initial behavioral-detection catalog for Raxx pre-launch. Subsequent campaign docs re-baseline, retire dormant rules, and add coverage as telemetry sources come online.


North-star posture (recap, not redesign)

  1. Detections target the seams — auth (passkey + session), data (audit log + backtest), cost (Heroku spend), ops (dyno + Postgres), webhook (Stripe + Postmark post-launch).
  2. Statistical methods first (Benford, K-S, Poisson tails, cardinality drift); hard thresholds only when sample size forbids statistics.
  3. Pre-launch traffic is sparse. Every rule in this catalog declares whether it is "live now," "dormant — activate post-first-real-traffic," or "dormant — activate when X telemetry source is wired." Operator can defer the dormant ones safely until launch.
  4. Alert routing respects feedback_pre_launch_digest_notifications. CRITICAL fires into existing #raxx-ops-alert-sev* channels (per project_oncall_severity_routing). HIGH/MEDIUM/LOW batch into the ops@ daily digest. No new per-event surfaces are introduced.

Rules built this campaign (10)

# Rule ID Category Severity ceiling State at ship Reason this was prioritized
1 auth_passkey_enumeration auth HIGH dormant — needs Heroku→Papertrail drain Passkey-begin is the first probe in a registration farm; cardinality of distinct emails per IP is the canonical signature. Pre-launch traffic is low → easy to baseline once a single signup farm hits.
2 auth_session_creation_velocity auth HIGH dormant — needs session-create event log Session creation per IP per minute is the post-passkey-success velocity check. Catches credential-stuffing or stolen-cookie replay before audit-log forensics can.
3 auth_rbac_denied_burst auth MEDIUM live (RBAC denial events log today) Operator-side defense: catches an operator account whose privileges were silently rotated narrower (mis-flag) or a compromised operator session probing for elevation.
4 signup_waitlist_velocity_per_origin signup HIGH live (waitlist_signups table exists; FLAG_WAITLIST_DATASTORE-gated) getraxx waitlist is the first inbound public surface. Per-Origin velocity catches a single referrer pumping the list.
5 signup_email_pattern_anomaly signup MEDIUM live N waitlist signups in 5 min sharing a domain pattern = synthesized-list seeding. Cheapest possible bot defense; sample size small but adequate for the pattern.
6 data_audit_log_gap_window data CRITICAL dormant — needs audit-write cadence baseline (post-first-customer) A gap > 5 min in customer_audit_events during business hours indicates either an outage of the audit writer OR an attacker pausing the log. Both warrant immediate page.
7 data_audit_log_hash_chain_break data CRITICAL dormant — needs KMS HMAC hash chain wired (gated on project_kms_audit_chain_approved) The hash chain is the tamper-evidence primitive. Any break is by definition adversarial; CRITICAL on first observation is justified because the hash chain has zero legitimate-break path.
8 ops_sentry_error_rate_spike ops HIGH dormant — needs sentry_backend flag ON in prod (currently OFF) Sentry events-per-minute by route is the cheapest deviation signal once Sentry is on; 3σ from a 7-day route-specific baseline.
9 ops_postgres_p99_drift ops MEDIUM dormant — needs Heroku pg_stat_statements extension verified + scraper wired Postgres p99 drift catches a noisy-neighbor query or an n+1 from a recently-shipped feature; saves a SEV-2 by surfacing degradation pre-customer-impact.
10 cost_dyno_hour_spike cost MEDIUM dormant — needs Heroku Platform API metrics scraper Dyno-hour rate above the 99.9th percentile of the prior 7d catches runaway autoscaling, infinite-loop dyno burn, or a forgotten worker. Pre-launch dollar exposure is small but the signal still applies.

Why ten not twelve: The eleventh and twelfth slots (Benford on backtest returns; Postmark hard-bounce rate) are catalogued under "deferred" below because (a) Benford needs > 100 backtest samples and pre-launch we don't have them, and (b) Postmark customer-mail volume is currently zero. Both ship in the next campaign once telemetry is producing.


Rules NOT built this campaign (deferred — 7)

Rule (proposed name) Why deferred When to revisit
data_backtest_benford Benford χ² needs 100+ first-digit samples; pre-launch backtest volume is < 10/day from operator dogfooding alone. False-positive rate undefined until real users run backtests. First campaign refresh after launch, when 7-day rolling backtest count > 200.
webhook_stripe_signature_fail Stripe webhook surface tracked at #3206 and is not wired yet. No /webhooks/stripe endpoint exists in backend_v2/api/routes/. Detection has nothing to subscribe to. When #3206 ships — flip this rule live the same campaign refresh.
webhook_postmark_bounce_spike Postmark is out of sandbox (project_postmark_approved) but production customer-mail volume is zero. Hard-bounce rate is undefined on a denominator of zero. Internal addresses (ops@/billing@/no-reply@) were on the suppression list 2026-05-09; cleanup outstanding. First campaign refresh after launch, once cumulative outbound > 500/day.
ops_dyno_restart_cluster Heroku platform metrics scraper does not exist. Manual heroku ps:scale events from operator-driven deploys swamp any natural restart signal pre-launch. Same campaign refresh as cost_dyno_hour_spike — both need the Heroku Platform API scraper.
cost_heroku_addon_drift Operator is the sole party making config-var changes; a "new addon without operator change record" check needs an authoritative change-record source. console_audit_events does not yet cover Heroku config mutations. When console UI gains Heroku config-var write-through (#3208 stub), the audit-event becomes the change-record source.
signup_geographic_burst Geo IP enrichment is not yet wired into the request pipeline. CF country-code header is available but waitlist_signups does not persist it. Without geo on the row, the detection has no grouping key. When cf-ipcountry is persisted on waitlist_signups (small migration; will likely come with EU geoblock telemetry per project_eu_geoblock_decision).
auth_login_failure_burst The passkey login flow does not have a meaningful "failure burst" — WebAuthn assertion failures are quiet 400s without account-state side-effects. Credential-stuffing attacks against passkey-only auth look like cardinality of distinct emails, not failure-bursts of a single email. Subsumed by auth_passkey_enumeration. Permanently subsumed unless we add fallback password auth (not on roadmap).

Operator-action prerequisites (gate the dormant rules)

These prerequisites block the listed rules from producing real signal. Each is an operator-action card the operator may dispatch when ready. The detection-engineer does NOT file these as GH issues (per feedback_fix_dont_file); listing here so the operator can choose.

P1 — biggest unlock (gates 4 rules)

Heroku Logplex → log-aggregation drain undefined. Today, app logs are accessible only via heroku logs --tail; there is no archival drain into Papertrail or S3. Without a drain, none of the rolling-window detections (auth_passkey_enumeration, auth_session_creation_velocity, ops_sentry_error_rate_spike partial, ops_dyno_restart_cluster) can query historical telemetry. The heroku_log_drain blueprint exists in console (console/app/blueprints/heroku_log_drain.py) but is a receiver; the upstream drain has to be configured Heroku-side. Sentry partially compensates for the Sentry-error-rate rule (Sentry retains events even without a separate Heroku drain).

P2 — Sentry production rollout

sentry_backend flag is OFF in prod. Per project_apm_vendor_sentry, Sentry is the locked APM and sentry_backend flag exists with full init code at backend_v2/api/observability/sentry_init.py. Flipping the flag on (with SENTRY_DSN_BACKEND set per app) is bootstrap operator-action per feedback_bootstrap_via_heroku. Once on, ops_sentry_error_rate_spike can populate baselines. Also wire SENTRY_TRACES_SAMPLE_RATE to a non-zero value (currently defaults to 0.1) for any trace-based detection in future campaigns.

P3 — KMS HMAC hash chain ship

data_audit_log_hash_chain_break requires the KMS HMAC chain to be live. Per project_kms_audit_chain_approved, operator approved the ~$2/mo KMS spend; SC-A11 + SC-A14 are gated on deploy. Until those land, the audit-log table has rows but no tamper-evidence chain to check. The rule ships dormant.

P4 — audit-write cadence baseline

data_audit_log_gap_window needs a baseline of normal audit-write cadence per hour. Pre-customer, the only writers are operator-action sessions (sparse, bursty). Baselining the rule requires 7+ days of real-customer audit-write traffic. Until then, the rule's threshold is set to "any gap > 30 min during 12:00–22:00 UTC weekdays" as a conservative pre-launch placeholder. Re-baseline at first campaign refresh post-launch.

P5 — Postgres pg_stat_statements verification

ops_postgres_p99_drift needs the pg_stat_statements extension confirmed enabled on Heroku Postgres. Heroku Standard-0 (current Raptor + Console pg tier) does support the extension; not yet verified active. Operator can verify with heroku pg:psql -a raxx-api-prod -c "SHOW shared_preload_libraries;" and heroku pg:psql -a raxx-api-prod -c "SELECT count(*) FROM pg_extension WHERE extname='pg_stat_statements';".

P6 — Heroku Platform API scraper

cost_dyno_hour_spike and the deferred ops_dyno_restart_cluster need a scheduled scraper that polls GET /apps/{app}/dynos + GET /apps/{app}/formation from the Heroku Platform API. Not wired today; no script in scripts/ops/ or console/app/services/ queries those endpoints. Belongs to sre-agent to build; detection-engineer consumes the resulting telemetry table.

P7 — Stripe webhook ship

webhook_stripe_signature_fail (deferred this campaign) is gated on #3206. No /webhooks/stripe route exists in backend_v2/api/routes/. When the webhook ships, the rule activates the same campaign refresh.

P8 — Postmark customer-mail volume

webhook_postmark_bounce_spike is gated on real customer-mail volume > 500/day. Activates the first campaign refresh after that threshold sustains for 7 days.


Alert routing posture (confirmed per memory hygiene)

Severity Channel Posture Source memory
CRITICAL #raxx-ops-alert-sev1 (Slack C0B423M38H4) Per-event, push override project_oncall_severity_routing
HIGH (ET hours) #raxx-ops-alert-sev2-5 (Slack C0B4611UC2V) Per-event, no push override project_oncall_severity_routing
HIGH (off-hours) #raxx-ops-alert-sev2 (Slack C0B445RU95Y) Per-event, push project_oncall_severity_routing
MEDIUM ops@ daily digest Digest only feedback_pre_launch_digest_notifications
LOW docs/detections/_log/ markdown Silent log charter §Alert routing

Note: pre-launch, even HIGH findings default to digest unless they describe an active-attack signature with sample size > 1. Per-event HIGH is reserved for: hash-chain break, audit-log gap, passkey-enumeration confirmed-burst. Documented per-rule in each rule file.


Roadmap — next campaign refresh

Trigger: earlier of (a) launch + 14 days of real customer traffic, or (b) operator-driven request, or (c) one of the deferred-rule prerequisites lands.

Refresh agenda:

  1. Activate the 6 dormant rules whose prereqs are met.
  2. Build data_backtest_benford once backtest sample size > 200 over 7 days.
  3. Build webhook_stripe_signature_fail once #3206 ships.
  4. Re-baseline auth_passkey_enumeration from the post-launch traffic distribution; tighten threshold based on observed FP rate.
  5. Self-tune review: per charter §Workflow, any rule that fired ≥5 times in 7 days with no operator action gets severity downgraded.
  6. Retire any rule that has been dormant for 60+ days without its prereq advancing — re-evaluate whether the rule is still worth building.
  7. Add new categories if the threat model shifts: mobile/ (when iOS app is live), agent/ (when other agents start writing to repo at runtime), vault/ (Velvet rotation anomalies once Velvet v2 stabilizes).

Anti-pattern guard: the second campaign should NOT add more rules merely to grow the catalog. Per charter, sliver-cutting detections produces alert fatigue worse than under-coverage. Target ceiling: ~15 active rules. Anything beyond that needs explicit operator approval.


Files in this PR

No code changes. No new GH issues. No new feature flags. No migrations.

Cross-references