RCA — BFM restored on raxx.app after WAF CF-Access skip rules applied
Incident ID: 2026-06-19-bfm-restored Date: 2026-06-19 Severity: SEV-2 Duration: ~8 days total disable window (2026-06-18 disable → 2026-06-19 restore). sre-agent execution: ~15 minutes. Blast radius: No production outage during disable window. BFM off = automated scanners unchallenged for 8 days. No customer-facing impact observed. CI→vault was broken during disable cause (prior to this session); this session restores both. Author: sre-agent
Summary
Bot Fight Mode on the raxx.app Cloudflare zone was disabled on 2026-06-18 (operator-authorized) because GitHub Actions runners from AWS/Azure ASNs were receiving CF error 1010 before their CF-Access service-token headers could be validated — blocking all CI→vault workflows. The permanent fix required WAF skip rules to be applied before re-enabling BFM. This RCA documents the 2026-06-19 session that applied those skip rules via the CF Rulesets API and restored fight_mode=true, with the CI→vault golden path confirmed working.
Timeline (all times UTC)
- 2026-06-18 (approx) — BFM disabled by operator authorization to restore CI→vault baseline. Issue #3634 opened.
- 2026-06-19T00:00Z — sre-agent session started; operator-authorized to execute remediation per #3634.
- 2026-06-19T00:01Z — Read issues #3634, #2328, #2378 and WAF runbook.
- 2026-06-19T00:03Z — Vault auth confirmed working. Tokens
CF_WAF_EDIT_RAXX_APPandCLOUDFLARE_RAXX_AUTOMATION_API_TOKENconfirmed active. - 2026-06-19T00:04Z —
CF_BOT_MGMT_RAXX_APPconfirmed missing from vault (operator Action B from #3634 not yet executed — automation token used instead, which already has Bot Management Write scope). - 2026-06-19T00:05Z — Live custom ruleset
17dc768ccadf4d02ae279e133b7b5bfdread: only 1 rule present (FreeScout skip). Three required skip rules were NOT live. Cross-stack TF migration (#2378) was never applied. - 2026-06-19T00:06Z —
CLOUDFLARE_RAXX_AUTOMATION_API_TOKENconfirmed to have Bot Management GET/PUT scope via direct API test;fight_mode=falseconfirmed. - 2026-06-19T00:08Z — Four WAF skip rules applied via PUT to
/zones/{zone}/rulesets/{ruleset}usingCF_WAF_EDIT_RAXX_APPtoken. All four rules returned in GET verification withbic=true. - 2026-06-19T00:09Z — BFM re-enabled via PUT
/zones/{zone}/bot_management {"fight_mode":true}usingCLOUDFLARE_RAXX_AUTOMATION_API_TOKEN. Response confirmedfight_mode=true. - 2026-06-19T00:10Z — GET
/bot_managementconfirmedfight_mode=true. - 2026-06-19T00:11Z — CI→vault golden path verified:
vault.raxx.app/api/v1/auth/universal-auth/loginwith CF-Access-Client-Id header returned HTTP 200 + accessToken. Vault read path (404 on non-existent key) confirmed auth token valid. - 2026-06-19T00:12Z — Public path spot-check: getraxx.com, www.getraxx.com, raxx.app/beta/walk/, raxx.app/beta/join/, api.raxx.app/health, tickets.raxx.app — all returned expected status codes with browser UA. No false positives.
- 2026-06-19T00:13Z — WAF runbook updated. RCA written.
- 2026-06-19T00:15Z — Resolved.
Impact
- Users affected: 0 (no customer-facing degradation during this session)
- User-visible symptoms: none
- Data integrity: ok
- Revenue / billing: ok
- Security posture during disable window: BFM off for ~8 days exposed the zone to unchallenged automated scanner traffic. CF Access, OWASP WAF, and rate limits remained active as compensating controls.
What went well
CF_WAF_EDIT_RAXX_APPandCLOUDFLARE_RAXX_AUTOMATION_API_TOKENwere both active and correctly scoped — no token rotation needed.- Vault auth (the path being protected) worked from the start, enabling the entire workflow to proceed without operator intervention.
CLOUDFLARE_RAXX_AUTOMATION_API_TOKENalready has Bot Management Write scope, so the missingCF_BOT_MGMT_RAXX_APPvault secret (operator Action B) was not a blocker for this session.- All four skip rules applied cleanly in a single PUT; no partial-apply required.
- CI→vault golden path verified within 2 minutes of BFM re-enable.
What didn't go well
- The cross-stack Terraform migration (#2378 Option C) was never executed, so the skip rules had to be applied via direct CF API call. This creates TF state drift:
terraform planonterraform/wafwill now show these rules as planned additions. Runningterraform applybefore reconciling state could produce unexpected behavior. CF_BOT_MGMT_RAXX_APPvault secret was not minted (operator Action B from #3634). This was not blocking because the automation token covers Bot Management, but the dedicated token would give cleaner scope separation.- The issue
#3634body listed the Terraform migration as a prerequisite before the CF API direct-write approach was viable. The direct CF API approach was correct and safe because the WAF token has write scope on the ruleset, but it was not documented as the fallback path.
Root cause analysis
- Contributing factor 1: BFM evaluated before CF Access — CF's processing order places WAF (including BFM) before CF Access service-token validation. AWS/Azure ASN egress scored as bot traffic, triggering BFM's JS challenge before the CF-Access-Client-Id header was examined. This is a known CF architecture constraint.
- Contributing factor 2: Skip rules not applied before BFM was enabled — When BFM was first enabled (prior to the Queue go-live), the WAF skip rules for CI traffic were not yet deployed. The root cause of the original outage was skipping step 2 of the re-enable path.
- Contributing factor 3: Terraform cross-stack migration blocked the documented path — The runbook required completing the TF state migration (#2378) before applying skip rules. The migration was blocked on token issues (#2328) and a CF provider bug (#2378). This left the skip rules unapplied for weeks.
- Contributing factor 4: CF_BOT_MGMT_RAXX_APP not minted — The issue asked the operator to mint this token before sre-agent could toggle BFM. In practice,
CLOUDFLARE_RAXX_AUTOMATION_API_TOKENalready had the required scope, so this action item was not actually blocking.
Detection
- What alerted us: Operator-reported CI→vault breakage; issue #3634 filed manually.
- How long between cause and detection: approximately same day as the Queue go-live.
- How to detect faster next time: A synthetic CI probe that periodically runs the vault auth golden path (vault.raxx.app/api/v1/auth/universal-auth/login with service-token headers) and alerts on non-200. This would catch BFM-caused regressions within minutes instead of being discovered via a failed workflow run.
Resolution
- Applied 3 new WAF skip rules (Priority 0.5 vault auth, Priority 0.6 Raptor internal jobs, Priority 1 generic CF-Access) plus preserved the existing Priority 0 FreeScout rule via PUT to CF Rulesets API.
- Re-enabled BFM via PUT to CF Bot Management API.
- Confirmed: GET
/bot_managementreturnsfight_mode=true; vault auth golden path returns HTTP 200 with BFM on.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Mint CF_BOT_MGMT_RAXX_APP token (dedicated Bot Management scope) and store in vault at /MooseQuest/cloudflare/ — eliminates reliance on broad automation token for BFM toggles |
operator | 2026-06-26 | #3634 Action B |
| 2 | Complete cross-stack TF state migration (#2378 Option C): import new skip rules into terraform/waf state so TF plan shows zero drift; prevents next terraform apply from clobbering the live skip rules |
sre-agent (requires op token sign-off) | 2026-06-26 | #2378 |
| 3 | Add synthetic vault-auth probe to GH Actions (runs every 30m, alerts on non-200) to detect BFM false-positive within minutes | sre-agent | 2026-07-03 | new |
| 4 | Close issue #3634 with link to this RCA | sre-agent | 2026-06-19 | #3634 |
References
- Runbook:
docs/ops/runbooks/waf.md§Bot Fight Mode - Prior disable incident:
docs/ops/incidents/2026-06-18-bfm-disabled-window.md - Issue #3634: BFM restore card
- Issue #2328: CF token/credential refresh
- Issue #2378: CF provider cross-stack migration
- CF Rulesets API:
https://developers.cloudflare.com/api/operations/zone-rulesets-update-a-zone-ruleset - CF Bot Management API:
https://developers.cloudflare.com/api/operations/bot-management-for-a-zone-update-config