Raxx · internal docs

internal · gated ↑ index

WAF runbook

System: Cloudflare WAF — raxx.app zone (raxx.app, api.raxx.app, console.raxx.app, vault.raxx.app, tickets.raxx.app) + getraxx.com zone (getraxx.com, www.getraxx.com) Owner: operator Last incident: n/a (initial setup — SC-WAF-01 #1737) Last reviewed: 2026-05-17

How to tell it's broken

How to diagnose (in order)

  1. Check CF WAF Events dashboard — raxx.app or getraxx.com zone → Security → WAF. Filter by last 30 min. Expected: zero blocking actions in Phase 1 (log-only).
  2. Check Logpush S3 bucket (once SC-WAF-00 is complete) — FirewallMatchesActions field. A block action in Phase 1 indicates a rule error.
  3. Correlate FirewallMatchesRuleIDs with the ruleset IDs from terraform output. Identify which ruleset (managed vs custom vs rate limit) fired.
  4. For Postmark webhook failures: GET /zones/{zone_id}/rulesets/{custom_waf_ruleset_id} — verify the Postmark IP ranges in rule Priority 2 match the current Postmark IP list.
  5. For service token block: GET /zones/{zone_id}/rulesets/{custom_waf_ruleset_id} — verify Priority 1 skip rule expression is (len(http.request.headers["cf-access-client-id"]) gt 0) and is enabled.
  6. Check Terraform state drift: cd terraform/waf && terraform plan. Any non-zero diff against a known-good apply indicates dashboard drift (Failure Mode F11).

Token setup

This stack requires a CF API token with Zone:WAF:Edit + Zone:Logs:Edit on both zones.

Verify your token has the correct scopes before applying:

curl -s -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
  https://api.cloudflare.com/client/v4/user/tokens/verify | python3 -m json.tool

The CLOUDFLARE_RAXX_AUTOMATION_API_TOKEN documented in terraform/README.md was confirmed to NOT have WAF:Edit scope as of 2026-04-30 (see cloudflare-rate-limiting.md). If that token has not been updated, mint a new WAF-scoped token:

  1. CF dashboard → My Profile → API Tokens → Create Token
  2. Permissions: Zone > WAF > Edit, Zone > Logs > Edit
  3. Zone resources: Include > Specific zone > raxx.app AND getraxx.com (both)
  4. Store in Infisical: POST /api/v3/secrets/raw/CF_WAF_EDIT at path /MooseQuest/cloudflare/
  5. Export at apply time: export CLOUDFLARE_API_TOKEN=$(infisical secrets get CF_WAF_EDIT --path /MooseQuest/cloudflare/ --plain)

Known failure modes

Failure mode A: False positive — legitimate customer blocked (F1)

Symptom: Customer reports 403 on a valid request. WAF Events log shows an OWASP or CF Managed rule firing on a legitimate path. Cause: OWASP CRS triggering on valid JSON body or API field names containing SQL/XSS patterns. Most common on api.raxx.app with complex order payloads. Fix:

# Identify the rule ID from WAF Events or Logpush
# Edit terraform/waf/terraform.tfvars: set owasp_action = "log" to revert to observation
# Or apply a per-rule override in terraform/modules/cf-waf/main.tf overrides block
cd terraform/waf
export CLOUDFLARE_API_TOKEN=$(infisical secrets get CF_WAF_EDIT --path /MooseQuest/cloudflare/ --plain)
terraform plan -out=tfplan
terraform apply tfplan

Verification: Customer can complete the previously blocked action. WAF Events shows "log" not "block" for the rule. Phase impact: Rolling back to log is always safe. Docs: waf-strategy.md §8 Phase 1, Failure Mode F1.

Failure mode B: Postmark webhook blocked (F5)

Symptom: Postmark webhook delivery failures. FreeScout inbound email stops. Logpush shows block on /api/webhooks/postmark. Cause: Postmark rotated their delivery IP ranges without notice. Fix:

# Get current Postmark IP ranges from:
# https://postmarkapp.com/support/article/800-ips-for-rate-limiting-or-firewall-rules
# Update terraform/waf/main.tf postmark_ip_ranges in both module calls
# Then:
cd terraform/waf
terraform plan -out=tfplan
terraform apply tfplan

Verification: curl -X POST https://api.raxx.app/api/webhooks/postmark from a Postmark IP returns 200 (not 403). Logpush shows no block on this path.

Failure mode C: CF Access service token challenged or blocked (F6)

Symptom: Velvet, CI, or Console machine calls to Queue or Raptor returning 403 or CAPTCHA challenge. Cause: New service token not matching the skip rule, or BFM skip rule accidentally disabled. Fix:

# Verify the skip rule in CF dashboard:
# raxx.app zone → Security → WAF → Custom rules → "Priority 1 — skip BFM..."
# Confirm: expression = (len(http.request.headers["cf-access-client-id"]) gt 0)
# Confirm: Action = Skip, Status = Enabled

# If the rule is present but not working, verify the service token is sending
# the CF-Access-Client-Id header. Trace with:
curl -v -H "CF-Access-Client-Id: <token-id>" -H "CF-Access-Client-Secret: <token-secret>" \
  https://api.raxx.app/health

Verification: Machine caller returns expected response (not 403/challenge). Logpush shows no block on affected path.

Failure mode D: Rate limit too tight — Stripe/payment webhook backlog (F4)

Symptom: Stripe webhook delivery failures. Payment processing lag. Rate limit action fires on /api/v1/billing/webhook. Cause: Rate limit threshold on global or order path too tight during a Stripe event replay burst. Fix:

# Immediately revert rate_limit_action to "log" (observation mode):
# In terraform.tfvars: rate_limit_action = "log"
cd terraform/waf
terraform plan -out=tfplan
terraform apply tfplan

Verification: Stripe webhook delivery resumes. Check Stripe dashboard for webhook retry status.

Failure mode E: Terraform state drift (F11)

Symptom: terraform plan shows diff for a resource that was not intentionally changed. Indicates a direct CF dashboard edit (not via Terraform). Fix:

cd terraform/waf
# Review the diff carefully. If the dashboard state is correct:
# Import the changed resource into TF state and update main.tf to match.
# If TF state is correct:
terraform apply -target=<resource_address>

Prevention: All WAF changes must go through Terraform. No direct CF dashboard edits after first apply (ADR-0077 D2, ADR-0051).

Phase advancement

Phase transitions require explicit operator sign-off. Do not advance phases autonomously.

Phase tfvars change Gate criteria
Phase 1 → Phase 2 managed_ruleset_action = "managed_challenge", rate_limit_action = "managed_challenge" 7-day log soak; false-positive rate <1%
Phase 2 → Phase 3 managed_ruleset_action = "block", rate_limit_action = "block" 72h; zero legitimate flows challenged
Phase 4 → Phase 5 n/a (flag flip — FLAG_ENFORCE_CF_ORIGIN) 7-day Phase 4 soak; SC-WAF-07 (#1741)

Always run terraform plan and review before terraform apply on any phase change.

Emergency stop (kill-switch)

Fastest rollback: set all actions to log/simulate and apply. ~30s CF propagation.

cd terraform/waf
# Edit terraform.tfvars:
#   managed_ruleset_action = "log"
#   owasp_action           = "log"
#   auth_challenge_action  = "log"
#   rate_limit_action      = "simulate"
export CLOUDFLARE_API_TOKEN=$(infisical secrets get CF_WAF_EDIT --path /MooseQuest/cloudflare/ --plain)
terraform plan -out=tfplan
terraform apply tfplan

Full removal (removes all WAF rulesets and rate limits; CF Access unaffected):

cd terraform/waf
terraform destroy

Note: terraform destroy removes WAF only. It does not touch terraform/cf-access/ (separate state file).

How to run this stack

cd terraform/waf

# 1. Set the CF WAF-scoped API token
export CLOUDFLARE_API_TOKEN=$(infisical secrets get CF_WAF_EDIT \
  --path /MooseQuest/cloudflare/ --plain)

# 2. Set zone IDs (populate terraform.tfvars REPLACE_WITH_* placeholders)
#    raxx.app zone: f12dbb5cac57d5591a5058874498a6d1 (from cloudflare-rate-limiting.md)
#    getraxx.com zone: retrieve from CF dashboard

# 3. If SC-WAF-00 is complete, set Logpush vars from SSM:
#    export TF_VAR_logpush_destination_conf=$(aws ssm get-parameter \
#      --name /raxx/waf/logpush_destination_conf --with-decryption \
#      --query Parameter.Value --output text)
#    export TF_VAR_logpush_ownership_challenge=$(aws ssm get-parameter \
#      --name /raxx/waf/logpush_ownership_challenge --with-decryption \
#      --query Parameter.Value --output text)

# 4. Init + plan + apply
terraform init
terraform plan -out=tfplan
# Review: all changes must be additive; no modifications to cf-access/ resources
terraform apply tfplan

# 5. Verify
terraform output
# Check CF dashboard: Security → WAF → Custom rules
# Expected: all rules show mode "log"; no blocking actions

Logpush setup dependency (SC-WAF-00)

The logpush_destination_conf and logpush_ownership_challenge variables are empty by default. The Logpush job is not created until SC-WAF-00 (#1736) completes.

SC-WAF-00 operator actions: 1. Create S3 bucket for WAF logs (raxx-waf-logs-prod recommended). 2. Create IAM user with s3:PutObject on that bucket only. 3. Run Cloudflare ownership challenge for the destination. 4. Store destination conf and challenge token in SSM: - /raxx/waf/logpush_destination_conf - /raxx/waf/logpush_ownership_challenge 5. Re-apply this stack with the SSM values injected via TF_VAR_*.

Cross-stack ruleset migration (operator state operations — post #2328)

Code migration status: COMPLETE as of 2026-05-17 (Issue #2183, PR pending).

The freescout_lambda_skip rule has been moved out of terraform/cf-access/freescout_service_token.tf and into terraform/modules/cf-waf/main.tf as a Priority 0 dynamic rule inside cloudflare_ruleset.custom_waf. The cloudflare_ruleset.freescout_lambda_skip resource declaration has been removed from terraform/cf-access.

What remains: two state operations. These require live CF credentials (#2328 token refresh) and cannot be executed until tokens are valid. Until then, terraform plan on terraform/cf-access will show the ruleset as a planned destroy — do not run terraform apply on cf-access until Step 2 below is complete.

State migration — run once after #2328 token refresh:

Step 1 — import the live zone-default ruleset into terraform/waf state:

cd terraform/waf
export CLOUDFLARE_API_TOKEN=$(infisical secrets get CF_ACCESS_MGMT \
  --path /MooseQuest/cloudflare/ --plain)
export TF_VAR_raxx_app_zone_id="f12dbb5cac57d5591a5058874498a6d1"
export TF_VAR_getraxx_zone_id=$(infisical secrets get CF_ZONE_ID_GETRAXX \
  --path /MooseQuest/cloudflare/ --plain)
terraform init
terraform import module.waf_raxx_app.cloudflare_ruleset.custom_waf \
  zones/f12dbb5cac57d5591a5058874498a6d1/17dc768ccadf4d02ae279e133b7b5bfd

Step 2 — remove from terraform/cf-access state (does NOT destroy the CF resource):

cd terraform/cf-access
export CLOUDFLARE_API_TOKEN=$(infisical secrets get CF_ACCESS_MGMT \
  --path /MooseQuest/cloudflare/ --plain)
terraform init
terraform state rm cloudflare_ruleset.freescout_lambda_skip

Step 3 — plan both stacks; both must show zero diff on the ruleset:

cd terraform/waf    && terraform plan
cd terraform/cf-access && terraform plan

Expected terraform/waf plan: new resources for managed WAF, rate limits, zone settings, logpush — zero destroys. The custom_waf ruleset shows as an in-place update (Priority 0 rule added, all other rules preserved).

Expected terraform/cf-access plan: zero changes (ruleset resource gone from both config and state).

Step 4 — apply terraform/waf only:

cd terraform/waf
terraform apply tfplan

Step 5 — verify in CF dashboard:

Tracking: Required before #1735 can be closed.

Cross-references

Escalation

Wake the operator when: - A WAF rule is blocking customers in Phase 1 (log mode should never block) - CF WAF Events shows a block action that cannot be explained by the ruleset - terraform destroy is being considered (impacts all WAF protection for both zones) - Any incident that involves the WAF affecting payment or order submission flows