RCA — Synthetic gate 403 on authenticated routes due to raw-Heroku routing
Incident ID: 2026-05-26-synthetic-gate-routed-via-cf Date: 2026-05-26 Severity: SEV-3 Duration: ~6 days (detection 2026-05-20 → remediation PR 2026-05-26) Blast radius: Internal CI only — no user-visible impact. 42 auto-filed GH issues on #2630 + #2631 over 6 days. Author: sre-agent
Summary
The synthetic-gate workflow targeted the raw Heroku origin URL (raxx-api-staging-1a19fb3873b9.herokuapp.com) rather than the Cloudflare-proxied API hostname (api-staging.raxx.app). Raptor's cloudflare_origin_guard middleware requires the CF-Connecting-IP header, which Cloudflare injects only when proxying — direct-to-Heroku requests never carry it. The /health endpoint is allowlisted by the guard and passed, masking the issue; historical_data and backtest returned 403 on every probe run. Over 6 days, 42 GitHub issues were auto-filed against #2630 and #2631, generating noise with zero signal value.
Timeline (all times UTC)
- 2026-05-20 ~13:00 — First scheduled synthetic-gate run after
FLAG_ENFORCE_CF_ORIGINactivated on staging.historical_data+backtestbegin returning 403. GH issue auto-filed for each check name (deduplicated per check, so #2630 + #2631 are the same two issues throughout). - 2026-05-20–2026-05-25 — Gate runs daily at 13:00. Each run:
/health+trading_runtimepass,historical_data+backtestfail 403. Auto-file-issue step reopens / comments on #2630 + #2631 each day (42 comment events total). - 2026-05-26 ~00:00 — Root cause confirmed via curl: raw Heroku URL returns 403 on authenticated routes;
api-staging.raxx.app(CF-proxied) returns correct responses whenCF-Access-Client-Id+CF-Access-Client-Secretheaders are present. - 2026-05-26 — sre-agent investigation + remediation PR opened. Files changed:
terraform/cf-access/synthetic_gate_service_token.tf(new),terraform/cf-access/outputs.tf(new outputs),.github/workflows/synthetic-gate.yml(CF URL + secrets),scripts/ci/run_synthetic_remote.py(CF Access headers + UA),docs/ops/runbooks/synthetic-check-diagnosis.md(updated), this RCA. - Post-merge + operator Terraform apply + GH Actions secret wiring — gate runs against CF URL, all 4 checks expected to pass.
Impact
- Users affected: none (CI-internal only)
- User-visible symptoms: none
- Data integrity: ok
- Revenue / billing: ok
- CI noise: 42 auto-filed GitHub issue comments on #2630 + #2631 over 6 days — no actionable signal.
What went well
- The
FLAG_SYNTHETIC_CI_GATEdefault of0(warn, not block) meant the 403 failures did not block any deploys. The gate degraded gracefully. - The issue deduplication in
synthetic_file_issue.pycontained the noise to two open issues (#2630, #2631) rather than 42 separate issues. - The
/healthallowlist incloudflare_origin_guardcorrectly kept the liveness check passing — no false dyno-crash alerts were triggered.
What didn't go well
- The synthetic-gate workflow was written to target the raw Heroku URL from day one. The
cloudflare_origin_guardmiddleware was added later (staged rollout) without a corresponding update to the probe target. There was no test or check that caught the mismatch. - The probe's
trading_runtimecheck also hits/health(not an authenticated route), so it passed — two of four checks passing gave a false impression of partial health. The actually-broken checks (historical_data,backtest) are the most meaningful ones. - The
feedback_python_urllib_needs_ua_headermemory item was not applied when the probe script was first written. The defaultPython-urllib/3.xUA trips CF error 1010. This was a latent bug that would have surfaced as soon as the probe hit the CF-proxied URL. - No CF Access application or service token existed for
api-staging.raxx.app/api.raxx.app. The entire "machine identity through CF Access" path was undocumented for this surface.
Root cause analysis
-
Contributing factor 1: probe targeted raw Heroku URL —
DEFAULT_STAGING_URLwas hardcoded toraxx-api-staging-1a19fb3873b9.herokuapp.com. Direct-to-Heroku requests bypass Cloudflare entirely;CF-Connecting-IPis never injected. Thecloudflare_origin_guardmiddleware (enabled viaFLAG_ENFORCE_CF_ORIGIN) rejects all non-allowlisted paths when this header is absent. -
Contributing factor 2: no CF Access application existed for the API surface — The
terraform/cf-accessstack managed CF Access applications forconsole.raxx.app,vault.raxx.app,tickets.raxx.app, andmoosequest.cloudflareaccess.com, but not forapi.raxx.apporapi-staging.raxx.app. No documented path existed for machine identities to authenticate through CF Access to the API. -
Contributing factor 3: /health allowlist masked the failure — Both
healthandtrading_runtimechecks probe/health, which is exempt from the origin guard. With 2 of 4 checks passing, the failure pattern looked like partial degradation rather than a systematic routing error. -
Contributing factor 4: default User-Agent would have tripped CF error 1010 — Python's default
Python-urllib/3.xUA triggers Cloudflare bot detection (error 1010) even with valid CF Access headers. This was a latent bug in the probe script that would have caused a secondary failure immediately upon routing to the CF URL.
Detection
- What alerted us: GitHub issues #2630 and #2631 auto-filed by
synthetic_file_issue.py, confirmed by operator curl test on 2026-05-26. - Time between cause and detection: ~6 days (detection was triggered by the noise volume, not an active alert).
- How to detect faster next time: A SEV-3 trigger on "same GH issue commented on 3+ consecutive days by synthetic gate" would surface stale-failure loops within 72 hours. Also: the "CF Access mode: OFF" warning added to the probe in this PR will appear in the Actions step summary, giving an explicit signal.
Resolution
-
What was changed: 1.
terraform/cf-access/synthetic_gate_service_token.tf— new Terraform file creating CF Access applications forapi-staging.raxx.appandapi.raxx.app, araxx-synthetic-gateservice token, andnon_identitypolicies on both applications. 2.terraform/cf-access/outputs.tf— added outputs for the synthetic-gate service token client ID, client secret (sensitive), expiry, and application IDs. 3..github/workflows/synthetic-gate.yml— changedDEFAULT_STAGING_URLtohttps://api-staging.raxx.app; addedDEFAULT_PROD_URLandFALLBACK_STAGING_URL; addedCF_ACCESS_CLIENT_ID_SYNTHETIC_GATE+CF_ACCESS_CLIENT_SECRET_SYNTHETIC_GATEsecrets inworkflow_call; added URL-resolution logic to use CF URL when secrets present, fall back to raw Heroku when absent; injectedCF_ACCESS_CLIENT_ID+CF_ACCESS_CLIENT_SECRETenv vars into the probe step; addedenvchoice input forworkflow_dispatch(staging/prod). 4.scripts/ci/run_synthetic_remote.py— addedCF_ACCESS_CLIENT_ID+CF_ACCESS_CLIENT_SECRETenv var reading;_build_headers()helper that always setsUser-Agent: RaxxSyntheticGate/1.0and conditionally injects CF Access headers; applied to both_getand_post; logs "CF Access mode: ON/OFF" at startup. 5.docs/ops/runbooks/synthetic-check-diagnosis.md— updated all curl commands to use CF-proxied URL + CF Access headers; added failure mode 6 (HTTP 403 on authenticated routes); updated "How to run checks manually" section. -
Operator action required before gate is fully fixed: 1. Run
terraform applyinterraform/cf-access/— creates the CF Access apps, service token, and non_identity policies. 2. Capture token:terraform output -raw synthetic_gate_service_token_client_id+terraform output -raw synthetic_gate_service_token_client_secret. 3. Create Infisical folder/raxx/synthetic-gate/(POST/api/v1/folders) then writeCF_ACCESS_CLIENT_IDandCF_ACCESS_CLIENT_SECRETsecrets. 4. Set GH Actions secrets:gh secret set CF_ACCESS_CLIENT_ID_SYNTHETIC_GATE+gh secret set CF_ACCESS_CLIENT_SECRET_SYNTHETIC_GATE. 5. Runworkflow_dispatchon theSynthetic Gateworkflow, confirm all 4 checks PASS on staging, then confirm prod. 6. After 2 consecutive green scheduled runs, post resolution comments on #2630 + #2631 quoting the run URLs, then close both withreason: completed. -
Validation: All 4 checks (health, historical_data, backtest, trading_runtime) PASS on both staging and prod. The Actions step summary shows
CF Access mode: ONand targethttps://api-staging.raxx.app/https://api.raxx.app.
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Apply terraform/cf-access/ to create CF Access apps + service token |
operator | 2026-05-27 | #2630 |
| 2 | Write service-token credentials to Infisical /raxx/synthetic-gate/ |
operator | 2026-05-27 | #2630 |
| 3 | Set GH Actions secrets CF_ACCESS_CLIENT_ID_SYNTHETIC_GATE + CF_ACCESS_CLIENT_SECRET_SYNTHETIC_GATE |
operator | 2026-05-27 | #2630 |
| 4 | Run workflow_dispatch, confirm all 4 checks PASS on staging + prod | operator | 2026-05-27 | #2630 |
| 5 | After 2 consecutive green scheduled runs, close #2630 + #2631 with resolution comments | sre-agent | 2026-05-29 | #2630 #2631 |
| 6 | Add "CF Access mode" check to the pre-deploy gate checklist (detect bootstrap-window regression) | sre-agent | 2026-06-02 | new |
| 7 | File calendar reminder 60 days before synthetic_gate_service_token_expires_at for token rotation |
operator | after Terraform apply | new |
References
- Runbook:
docs/ops/runbooks/synthetic-check-diagnosis.md - Terraform:
terraform/cf-access/synthetic_gate_service_token.tf - Workflow:
.github/workflows/synthetic-gate.yml - Probe script:
scripts/ci/run_synthetic_remote.py - Middleware:
backend_v2/api/middleware/cloudflare_origin_guard.py - Memory:
feedback_cf_access_service_token_needs_non_identity.md - Memory:
feedback_python_urllib_needs_ua_header.md - Memory:
feedback_cf_access_does_not_bypass_bot_fight_mode.md - Memory:
feedback_vault_folder_must_exist.md - Related issues: #2630, #2631