Synthetic Check Diagnosis Runbook

System: synthetic health gates — Raptor API (staging + prod cron) + WAF surface probes Owner: operator / EyeTok on-call Last reviewed: 2026-05-28 UTC Last incident: 2026-05-28 (#3027 #3028 — see failure mode 6 below, fixed in #3031) Source code: scripts/ci/run_synthetic_remote.py, scripts/waf/probe.py Workflows: .github/workflows/synthetic-gate.yml, .github/workflows/waf-synthetic-probe.yml Refs: #1108 (this runbook), #323 (observability parent epic), #1739 (WAF probes SC-WAF-06), #2630, #2631, #3027, #3028, #3031

Alert anatomy

Synthetic Gate (Raptor checks)

A failing run posts to the #raxx-ops-alert-sev2 Slack channel with this shape:

:x: Synthetic gate failure
Failed checks: <check_name>[, <check_name>]
Target: https://api-staging.raxx.app
Run: https://github.com/raxx-app/TradeMasterAPI/actions/runs/<run_id>

Note: the target URL changed from the raw Heroku hostname to the CF-proxied hostname on 2026-05-26 (#2630 #2631). If you see raxx-api-staging-1a19fb3873b9.herokuapp.com in the target, the CF service-token secrets have not been wired yet (bootstrap window — see below).

A GitHub issue is also auto-filed with title [synthetic] failing check: <check_name> (deduplicated — one open issue per check name). Source: .github/scripts/synthetic_file_issue.py.

WAF Synthetic Probes

A failing run posts a Slack DM to the operator (channel D0AJ7K184TV) with this shape:

:x: WAF synthetic probe failure (SC-WAF-06)
Failed probes: <probe_name>[, ...]
(:warning: WAF would-block (false-positive risk): <probe_name>[, ...])
Run: https://github.com/raxx-app/TradeMasterAPI/actions/runs/<run_id>

Severity routing (EyeTok on-call)

Alert	Normal hours (08:00–18:30 UTC)	Off-hours
Any Raptor check failing	SEV2 → `#raxx-ops-alert-sev2`	SEV2 → `#raxx-ops-alert-sev2` (page if prod)
WAF probe HTTP 403 (true block in Phase 2+)	SEV1 → `#raxx-ops-alert-sev1`	SEV1 → `#raxx-ops-alert-sev1`
WAF probe content-mismatch or latency	SEV2.5 = SEV2 during business hours	SEV3 → agent autonomous
WAF would-block (Phase 1 log-only)	SEV3 → agent autonomous	SEV3 → agent autonomous

SEV2.5 = business-hours SEV2. Off-hours: monitor, do not page.

Check 1 — `health` (GET /health)

What it means: The Raptor API process is up and returning a healthy status object. This is the baseline liveness check. Failure here means nothing else will work.

Alert name: [synthetic] failing check: health

Severity: SEV2 (staging); SEV1 if prod is also confirmed down.

First check:

curl -s https://api-staging.raxx.app/health \
  -H "CF-Access-Client-Id: $CF_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_CLIENT_SECRET" \
  -H "User-Agent: RaxxSyntheticGate/1.0" \
  | python3 -m json.tool

Expected: {"status": "ok", ...} with HTTP 200.

If you don't have the CF Access credentials, check Infisical at path /raxx/synthetic-gate/ for CF_ACCESS_CLIENT_ID and CF_ACCESS_CLIENT_SECRET. Alternatively, read the raw Heroku URL (only /health will pass):

curl -s https://raxx-api-staging-1a19fb3873b9.herokuapp.com/health | python3 -m json.tool

Diagnosis tree:

HTTP connection refused / timeout (no response) - Heroku dyno is crashed or scaled to 0. - Check: heroku ps --app raxx-api-staging - If dynos are DOWN: heroku restart --app raxx-api-staging - If dynos are UP but refusing connections: heroku logs --tail --app raxx-api-staging — look for Python tracebacks at startup. - Common cause: a bad migration or import error on boot. Rollback: heroku releases --app raxx-api-staging then heroku rollback v<N> --app raxx-api-staging.
HTTP 429 Too Many Requests - Heroku API rate limit hit (9,000/hr cap; raised 2026-05-14). Likely transient. - Wait 60 seconds and re-run the check manually. - If sustained: check whether a CI batch or deploy loop is hammering the app. Kill the loop. - See: docs/ops/runbooks/heroku.md §Rate limits.
HTTP 503 Service Unavailable - Heroku platform issue or H13 (connection closed without response). - Check Heroku status: https://status.heroku.com - Check app logs: heroku logs --tail --app raxx-api-staging - If H13/H14 in logs: dyno memory pressure. Check heroku metrics --app raxx-api-staging.
HTTP 200 but status != "ok" - App is up but a dependency is unhealthy (database, cache, etc.). - Check the full JSON body for a components or details field describing what is unhealthy. - If database: see Check 2 (historical_data) for database diagnosis.
HTTP 302 redirect to CF Access login - The probe is hitting a CF Access gate without a valid service token, OR the service token policy is decision=allow instead of decision=non_identity. - Check GH Actions secrets: CF_ACCESS_CLIENT_ID_SYNTHETIC_GATE and CF_ACCESS_CLIENT_SECRET_SYNTHETIC_GATE must be set. - Check Terraform: terraform/cf-access/synthetic_gate_service_token.tf — the policy decision must be non_identity. An allow decision requires IdP identity that service tokens don't carry, producing a 302 even with valid credentials. - See: docs/ops/runbooks/cf-access-service-token-provisioning.md
HTTP 403 on historical_data or backtest (but /health passes) - The probe is hitting the raw Heroku URL (raxx-api-staging-1a19fb3873b9.herokuapp.com) rather than the CF-proxied URL (api-staging.raxx.app). Raptor's cloudflare_origin_guard middleware blocks requests without CF-Connecting-IP (injected by Cloudflare when proxying). /health is allowlisted, so it passes — masking the issue. Error body: {'error': 'direct_origin_blocked'}. - Root cause A (bootstrap window): CF_ACCESS_CLIENT_ID_SYNTHETIC_GATE secret absent → synthetic-gate.yml falls back to raw Heroku URL. - Root cause B (caller override): The workflow that calls synthetic-gate.yml (e.g. deploy-heroku.yml) passes an explicit base_url pointing to the raw Heroku URL. When base_url is non-empty, the CF-proxied URL resolution inside synthetic-gate.yml is skipped entirely. This was the cause of #3027 + #3028 (2026-05-28). Fixed in #3031 by removing the base_url override from deploy-heroku.yml and forwarding the two CF Access secrets. - Diagnosis: Check the workflow run inputs. If base_url: https://raxx-api-staging-1a19fb3873b9.herokuapp.com appears in the job's "Inputs" section, root cause B is active. - Fix A: Complete Terraform apply (terraform/cf-access/synthetic_gate_service_token.tf) and set GH Actions secrets. See: docs/ops/incidents/2026-05-26-synthetic-gate-routed-via-cf.md. - Fix B: Ensure the calling workflow does NOT pass base_url and DOES forward CF_ACCESS_CLIENT_ID_SYNTHETIC_GATE + CF_ACCESS_CLIENT_SECRET_SYNTHETIC_GATE in its secrets: block. - Verify the workflow step summary shows CF Access mode: ON (not CF Access mode: OFF).

Common false positives: - Heroku 429 during a deploy burst (rate limit transient; clears within 60s). - Cold boot latency on a freshly-scaled dyno causing a 503 on the first probe hit before health check passes — the runner retries 3× with exponential backoff (5s, 10s, 20s) before marking fail.

Resolution path: docs/ops/runbooks/heroku.md

Escalation: Page the operator if the staging health endpoint is still 503 after 10 minutes. Escalate to SEV1 if prod /health is also failing.

Check 2 — `historical_data` (GET /api/historical-data/data)

What it means: The historical data API can fetch and return OHLCV data from the configured data source (Alpha Vantage or mock fallback). Failure here means the data ingestion pipeline is broken or the external data provider is unavailable.

Alert name: [synthetic] failing check: historical_data

Severity: SEV2.

First check:

curl -s "https://api-staging.raxx.app/api/historical-data/data?symbol=AAPL&source=alpha_vantage&timeframe=daily&start_date=2026-01-01&end_date=2026-01-10&adjusted=true" \
  -H "CF-Access-Client-Id: $CF_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_CLIENT_SECRET" \
  -H "User-Agent: RaxxSyntheticGate/1.0" \
  | python3 -m json.tool

Expected: HTTP 200 with {"data": [...]} where data is a non-empty list.

Diagnosis tree:

HTTP 200 but data is an empty list - Alpha Vantage returned no data for the probe date range (possible: market holiday, symbol issue, rate-limit from AV). - Check: heroku logs --tail --app raxx-api-staging for Alpha Vantage API errors. - Alpha Vantage rate limit (free tier: 5/min, 500/day) is a common cause. - Resolution: If AV is rate-limited, the mock fallback should activate. Check whether USE_MOCK_DATA is configured. - Check: heroku config --app raxx-api-staging | grep -i mock
HTTP 500 Internal Server Error - Raptor exception in the historical data route. Check logs for Python traceback. - heroku logs --tail --app raxx-api-staging | grep -A5 "historical" - Common cause: SQLite lock contention (WAL mode is enabled; concurrent writes from a deploy may block reads briefly). - Also check: is the instance/ directory accessible? On Heroku the ephemeral filesystem resets on dyno restart.
HTTP 400 Bad Request - The probe parameters are malformed or the route validation changed. - Check the route schema in backend_v2/api/routes/historical_data.py against the probe params (symbol=AAPL, source=alpha_vantage, timeframe=daily, dates 2026-01-01 to 2026-01-10). - If a schema change was deployed without updating the probe, update scripts/ci/run_synthetic_remote.py.
HTTP 401 / 403 - Auth middleware unexpectedly applied to this public endpoint. - Check route decorators in backend_v2/api/routes/historical_data.py.
Alpha Vantage outage (HTTP 200 with AV error JSON) - AV API may return 200 with {"Note": "Thank you for using Alpha Vantage..."} (rate limit) or {"Error Message": "..."}. - Check AV status: the response body from the log will contain the AV error message. - Resolution: Confirm mock fallback is activating. If AV is down for >1h, the probe will fail every run — this is a false positive from an external dependency outage, not a Raptor failure.

Common false positives: - Alpha Vantage free-tier rate limit exhausted by CI (500 calls/day shared across all workflows). Clears after midnight UTC. - AV API 503 during market-open burst — usually self-resolves within minutes.

Resolution path: docs/ops/runbooks/heroku.md + Alpha Vantage API status page.

Escalation: Page the operator if data is empty for 3+ consecutive runs AND mock fallback is also not activating (indicates Raptor startup or filesystem issue).

Check 3 — `backtest` (POST /api/backtest/run)

What it means: The backtesting engine can accept a run request and begin processing. Accepts HTTP 200 (synchronous result) or 202 (async queued). Failure here means the strategy engine is unavailable or the request shape is rejected.

Alert name: [synthetic] failing check: backtest

Severity: SEV2.

First check:

curl -s -X POST https://api-staging.raxx.app/api/backtest/run \
  -H "Content-Type: application/json" \
  -H "CF-Access-Client-Id: $CF_CLIENT_ID" \
  -H "CF-Access-Client-Secret: $CF_CLIENT_SECRET" \
  -H "User-Agent: RaxxSyntheticGate/1.0" \
  -d '{"symbol":"AAPL","start_date":"2026-01-01","end_date":"2026-01-10","strategy":"sma_crossover","initial_capital":10000}' \
  | python3 -m json.tool

Expected: HTTP 200 or 202 with a result or job ID body.

Diagnosis tree:

HTTP 400 Bad Request - The probe payload does not match the current route validation schema. - Check backend_v2/api/routes/backtest.py for required fields. - Common cause: a schema change deployed without updating the probe. Update scripts/ci/run_synthetic_remote.py probe payload if necessary. - Check for MarketDataValidationError in logs.
HTTP 500 Internal Server Error - Backtest engine threw an unhandled exception. - heroku logs --tail --app raxx-api-staging | grep -A10 "backtest" - Check for: missing historical data (the backtest engine may call the same Alpha Vantage path as Check 2 — if AV is rate-limited, this will also fail). - Check for: arithmetic errors in the strategy (NaN/Inf in returns).
HTTP 503 or timeout (>30s) - Dyno memory pressure or CPU spike from a large backtest running concurrently. - Check: heroku metrics --app raxx-api-staging for memory and response time spikes. - The probe uses sma_crossover over 10 days — this should complete in <5s. If it times out, another goroutine or request is blocking the dyno.
HTTP 404 - Route not registered. Check that the backtest blueprint is still registered in backend_v2/app.py.

Common false positives: - Alpha Vantage rate limit (same upstream as Check 2) — backtest requires data fetch; if AV is unavailable and mock fallback does not cover the strategy, the backtest will 500. - Ephemeral SQLite state reset on dyno restart — if the backtest writes intermediate state to SQLite, a fresh dyno won't have it.

Resolution path: docs/ops/runbooks/heroku.md

Escalation: Page the operator if HTTP 500 persists after restart and logs show no dependency issue (pure engine error).

Check 4 — `trading_runtime` (GET /api/system/health)

What it means: The trading runtime system health endpoint is reachable and returns HTTP 200. This is the deeper system health ping (vs. the bare /health liveness check) and confirms the runtime subsystem is operational.

Alert name: [synthetic] failing check: trading_runtime

Severity: SEV2 (staging); SEV1 if confirmed on prod and trading is affected.

First check:

curl -s https://raxx-api-staging-1a19fb3873b9.herokuapp.com/api/system/health | python3 -m json.tool

Expected: HTTP 200. Body may contain {"status": "ok", ...} or similar.

Diagnosis tree:

HTTP 200 but body indicates degraded state - Check the status field and any components breakdown in the body. - Degraded components: trading runtime thread may have exited or be unhealthy. - Check: heroku logs --tail --app raxx-api-staging | grep -i "trading_runtime\|runtime"
HTTP 500 - System health route itself is throwing. Check logs for traceback in the system blueprint. - heroku logs --tail --app raxx-api-staging | grep -B2 -A10 "system/health"
HTTP 503 or connection refused - Same as Check 1 — dyno is down. Diagnose via heroku ps.
HTTP 200 but payload_status is None (body is not JSON) - The endpoint returned a non-JSON body (possibly an HTML error page from a proxy). - Check whether CF Access or a WAF rule is intercepting this path and returning an HTML 200. - Verify the request reaches Raptor: check Heroku logs for the request hit.
Check 1 passes but Check 4 fails - The app process is up (/health ok) but the trading runtime thread or subsystem is in a failed state. - Check: is there a separate thread/process for trading runtime? Look at backend_v2/api/services/trading_runtime.py for thread lifecycle. - Resolution: heroku restart --app raxx-api-staging to force thread reinitialisation.

Common false positives: - Race condition on fresh dyno boot: /health passes before the trading runtime thread fully initialises, causing /api/system/health to return a degraded status for 5–15 seconds. The probe retries 3× — if all three hit this window, it may false-positive. Set --timeout 30 (current default) and the window should be sufficient.

Resolution path: docs/ops/runbooks/heroku.md

Escalation: Page the operator if trading runtime is degraded for >5 minutes and the trading_runtime service shows errors in logs. This indicates a potential issue with live/paper trading execution paths.

WAF Probes — surface group failures

The WAF synthetic probe workflow (waf-synthetic-probe.yml) exercises four surfaces: getraxx, raxx-app, support, docs. Each surface has 2–3 individual flows.

Severity: SEV3 for WAF would-block in Phase 1 (log mode). SEV1 for a true HTTP 403 block in Phase 2+.

Common probe failure patterns

Pattern W1 — HTTP 403 on any surface (WAF block)

What it means in Phase 1 (log mode): The WAF logged a would-block action. The request got through (200) but would be blocked in Phase 2. Treat as SEV3 — agent autonomous investigation.

What it means in Phase 2+ (challenge/block mode): The WAF is blocking the synthetic probe. Treat as SEV1 — customers may also be blocked.

Diagnosis:

Check CF WAF Events dashboard: raxx.app zone → Security → WAF → filter by last 30 min.
Identify the rule ID that fired: look for RaxxProbe/1.0 UA in event log.
The probe carries X-Raxx-Probe: sc-waf-06-synthetic — confirm the WAF skip rule for this header is enabled. - CF dashboard → raxx.app zone → Security → WAF → Custom rules → look for a skip rule keyed on X-Raxx-Probe.
If the skip rule is missing or disabled: re-apply terraform/waf — the skip rule is managed there.
If the skip rule is present and enabled but the probe is still blocked: the WAF rule is firing before the skip evaluation. Check rule priority ordering.

Resolution path: docs/ops/runbooks/waf.md §Failure mode C, §Phase advancement.

Pattern W2 — Content mismatch (HTTP 200 but expected strings absent)

What it means: The surface returned HTTP 200 but the response body did not contain any of the expected marker strings. This can indicate a broken deploy, CDN caching an error page, or a full-page redirect replacing content.

Diagnosis by surface:

getraxx-landing: Expected strings: raxx, trading. If missing: check CF Pages deploy status for getraxx.com. Run: curl -v https://getraxx.com/ | grep -i "raxx\|trading".
getraxx-faq / getraxx-pricing: These may 404 pre-launch — HTTP 200 with content-mismatch means a page shell was served but expected content is absent. Check the CF Pages build output.
raxx-app-signin-page: Expected: sign, email, password. If React SPA shell loads but without hydration (no JS), content-mismatch fires. Check Antlers build.
raxx-app-dashboard-stub: Expected: raxx, html. Very broad — almost any HTML page will match. If this fails, the surface is likely returning a non-HTML error.
raxx-app-api-health: Expected: ok, status, healthy. Check GET https://raxx.app/health manually.
support-root: Expected: support, help, ticket. Check FreeScout is running: curl -s https://support.raxx.app/ | head -20. If FreeScout is down, see docs/ops/runbooks/freescout.md.
docs-root / docs-getting-started: Expected: docs, documentation, guide. Check CF Pages deploy for docs.raxx.app.

Pattern W3 — Connection refused / timeout (no response)

What it means: The surface is completely unreachable. DNS issue, Heroku dyno down, or CF Pages build not deployed.

Diagnosis: 1. curl -v https://<surface-url>/ — check if DNS resolves and TCP connects. 2. If DNS fails: check CF DNS for the zone. docs/ops/runbooks/cloudflare-tokens.md. 3. If TCP connects but hangs: check CF Pages or Heroku dyno status. 4. If Heroku: heroku ps --app <app-name> for the relevant app.

Pattern W4 — WAF false-positive log event (probe UA in CF block log)

What it means: CF Firewall Events shows a block action (in log mode) triggered by the probe's RaxxProbe/1.0 User-Agent. This is a WAF rule incorrectly matching legitimate traffic patterns.

Diagnosis: 1. Note the rule ID from waf_rule_id in the probe output JSON. 2. Cross-reference with terraform output from terraform/waf to identify the ruleset. 3. If it is an OWASP/managed rule: add a per-rule exception or lower the anomaly score threshold. 4. If it is a custom rule: review the rule expression in terraform/modules/cf-waf/main.tf.

Resolution path: docs/ops/runbooks/waf.md §Failure mode A (false positive).

How to silence a flapping check

Use silence sparingly. A flapping check indicates a real instability even if the root cause is external.

Synthetic Gate (Raptor checks):

The gate respects FLAG_SYNTHETIC_CI_GATE. When the flag is 0 (default), failures post Slack + file GitHub issues but do not block CI. When the flag is 1, failures block the deploy pipeline.

To temporarily stop the gate from blocking:

heroku config:set FLAG_SYNTHETIC_CI_GATE=0 --app raxx-api-staging >/dev/null 2>&1

To re-enable:

heroku config:set FLAG_SYNTHETIC_CI_GATE=1 --app raxx-api-staging >/dev/null 2>&1

WAF Probes:

Disable a specific surface via workflow_dispatch with the surfaces input limited to surfaces that are healthy. Example: run only getraxx and raxx-app, skipping support during a FreeScout outage.

To skip the scheduled run entirely for one day: cancel the scheduled run from the GitHub Actions UI. This does not affect the next scheduled trigger.

How to run checks manually

Synthetic Gate (all four Raptor checks — CF-proxied, preferred):

export CF_ACCESS_CLIENT_ID=<from Infisical /raxx/synthetic-gate/ CF_ACCESS_CLIENT_ID>
export CF_ACCESS_CLIENT_SECRET=<from Infisical /raxx/synthetic-gate/ CF_ACCESS_CLIENT_SECRET>
python scripts/ci/run_synthetic_remote.py \
  --base-url https://api-staging.raxx.app \
  --timeout 30 \
  --output /tmp/synthetic-remote.json
cat /tmp/synthetic-remote.json | python3 -m json.tool

Synthetic Gate (raw Heroku URL — only /health passes, for dyno liveness only):

python scripts/ci/run_synthetic_remote.py \
  --base-url https://raxx-api-staging-1a19fb3873b9.herokuapp.com \
  --timeout 30 \
  --output /tmp/synthetic-remote.json
cat /tmp/synthetic-remote.json | python3 -m json.tool

WAF Probes (all surfaces):

python3 scripts/waf/probe.py --timeout 15

WAF Probes (single surface):

python3 scripts/waf/probe.py --surfaces getraxx --timeout 15

Manual workflow dispatch:

Synthetic Gate: GitHub Actions → Synthetic Gate → Run workflow → optionally override base URL.
WAF Probes: GitHub Actions → WAF Synthetic Probes → Run workflow → optionally specify surfaces.

Runbook maintenance

When a new synthetic check is added to scripts/ci/run_synthetic_remote.py or scripts/waf/probe.py, add a corresponding section to this runbook before merging. The check name in the section header must match exactly the name field in CheckResult or ProbeResult so alert-to-runbook matching works.

Cross-references

Heroku operations: docs/ops/runbooks/heroku.md
WAF detailed runbook: docs/ops/runbooks/waf.md
CF Access tokens: docs/ops/runbooks/cf-access-service-token-provisioning.md
FreeScout: docs/ops/runbooks/freescout.md
CI notification posture: docs/architecture/ci-notification-posture.md
Parent epic: #323 (observability + CI health gates)
WAF probes issue: #1739 (SC-WAF-06)
Synthetic gate flag: FLAG_SYNTHETIC_CI_GATE (see backend_v2/api/feature_flags.yaml)

Synthetic Check Diagnosis Runbook

Alert anatomy

Synthetic Gate (Raptor checks)

WAF Synthetic Probes

Severity routing (EyeTok on-call)

Check 1 — health (GET /health)

Check 2 — historical_data (GET /api/historical-data/data)

Check 3 — backtest (POST /api/backtest/run)

Check 4 — trading_runtime (GET /api/system/health)

WAF Probes — surface group failures

Common probe failure patterns

Pattern W1 — HTTP 403 on any surface (WAF block)

Pattern W2 — Content mismatch (HTTP 200 but expected strings absent)

Pattern W3 — Connection refused / timeout (no response)

Pattern W4 — WAF false-positive log event (probe UA in CF block log)

How to silence a flapping check

How to run checks manually

Runbook maintenance

Cross-references

Check 1 — `health` (GET /health)

Check 2 — `historical_data` (GET /api/historical-data/data)

Check 3 — `backtest` (POST /api/backtest/run)

Check 4 — `trading_runtime` (GET /api/system/health)