Synthetic Check Diagnosis Runbook
System: synthetic health gates — Raptor API (staging + prod cron) + WAF surface probes
Owner: operator / EyeTok on-call
Last reviewed: 2026-05-17 UTC
Source code: scripts/ci/run_synthetic_remote.py, scripts/waf/probe.py
Workflows: .github/workflows/synthetic-gate.yml, .github/workflows/waf-synthetic-probe.yml
Refs: #1108 (this runbook), #323 (observability parent epic), #1739 (WAF probes SC-WAF-06)
Alert anatomy
Synthetic Gate (Raptor checks)
A failing run posts to the #raxx-ops-alert-sev2 Slack channel with this shape:
:x: Synthetic gate failure
Failed checks: <check_name>[, <check_name>]
Target: https://raxx-api-staging-1a19fb3873b9.herokuapp.com
Run: https://github.com/raxx-app/TradeMasterAPI/actions/runs/<run_id>
A GitHub issue is also auto-filed with title [synthetic] failing check: <check_name> (deduplicated — one open issue per check name). Source: .github/scripts/synthetic_file_issue.py.
WAF Synthetic Probes
A failing run posts a Slack DM to the operator (channel D0AJ7K184TV) with this shape:
:x: WAF synthetic probe failure (SC-WAF-06)
Failed probes: <probe_name>[, ...]
(:warning: WAF would-block (false-positive risk): <probe_name>[, ...])
Run: https://github.com/raxx-app/TradeMasterAPI/actions/runs/<run_id>
Severity routing (EyeTok on-call)
| Alert | Normal hours (08:00–18:30 UTC) | Off-hours |
|---|---|---|
| Any Raptor check failing | SEV2 → #raxx-ops-alert-sev2 |
SEV2 → #raxx-ops-alert-sev2 (page if prod) |
| WAF probe HTTP 403 (true block in Phase 2+) | SEV1 → #raxx-ops-alert-sev1 |
SEV1 → #raxx-ops-alert-sev1 |
| WAF probe content-mismatch or latency | SEV2.5 = SEV2 during business hours | SEV3 → agent autonomous |
| WAF would-block (Phase 1 log-only) | SEV3 → agent autonomous | SEV3 → agent autonomous |
SEV2.5 = business-hours SEV2. Off-hours: monitor, do not page.
Check 1 — health (GET /health)
What it means: The Raptor API process is up and returning a healthy status object. This is the baseline liveness check. Failure here means nothing else will work.
Alert name: [synthetic] failing check: health
Severity: SEV2 (staging); SEV1 if prod is also confirmed down.
First check:
curl -s https://raxx-api-staging-1a19fb3873b9.herokuapp.com/health | python3 -m json.tool
Expected: {"status": "ok", ...} with HTTP 200.
Diagnosis tree:
-
HTTP connection refused / timeout (no response) - Heroku dyno is crashed or scaled to 0. - Check:
heroku ps --app raxx-api-staging- If dynos are DOWN:heroku restart --app raxx-api-staging- If dynos are UP but refusing connections:heroku logs --tail --app raxx-api-staging— look for Python tracebacks at startup. - Common cause: a bad migration or import error on boot. Rollback:heroku releases --app raxx-api-stagingthenheroku rollback v<N> --app raxx-api-staging. -
HTTP 429 Too Many Requests - Heroku API rate limit hit (9,000/hr cap; raised 2026-05-14). Likely transient. - Wait 60 seconds and re-run the check manually. - If sustained: check whether a CI batch or deploy loop is hammering the app. Kill the loop. - See:
docs/ops/runbooks/heroku.md§Rate limits. -
HTTP 503 Service Unavailable - Heroku platform issue or H13 (connection closed without response). - Check Heroku status:
https://status.heroku.com- Check app logs:heroku logs --tail --app raxx-api-staging- If H13/H14 in logs: dyno memory pressure. Checkheroku metrics --app raxx-api-staging. -
HTTP 200 but
status!="ok"- App is up but a dependency is unhealthy (database, cache, etc.). - Check the full JSON body for acomponentsordetailsfield describing what is unhealthy. - If database: see Check 2 (historical_data) for database diagnosis. -
HTTP 302 redirect to CF Access login - The probe is hitting a CF Access gate without a service token. - The synthetic checks do not send CF-Access credentials by default — confirm the target URL is the Heroku origin (not behind CF Access). - If running against the public
api.raxx.appURL: ensure the probe carriesCF-Access-Client-Id+CF-Access-Client-Secretheaders. See:docs/ops/runbooks/cf-access-service-token-provisioning.md.
Common false positives: - Heroku 429 during a deploy burst (rate limit transient; clears within 60s). - Cold boot latency on a freshly-scaled dyno causing a 503 on the first probe hit before health check passes — the runner retries 3× with exponential backoff (5s, 10s, 20s) before marking fail.
Resolution path: docs/ops/runbooks/heroku.md
Escalation: Page the operator if the staging health endpoint is still 503 after 10 minutes. Escalate to SEV1 if prod /health is also failing.
Check 2 — historical_data (GET /api/historical-data/data)
What it means: The historical data API can fetch and return OHLCV data from the configured data source (Alpha Vantage or mock fallback). Failure here means the data ingestion pipeline is broken or the external data provider is unavailable.
Alert name: [synthetic] failing check: historical_data
Severity: SEV2.
First check:
curl -s "https://raxx-api-staging-1a19fb3873b9.herokuapp.com/api/historical-data/data?symbol=AAPL&source=alpha_vantage&timeframe=daily&start_date=2026-01-01&end_date=2026-01-10&adjusted=true" | python3 -m json.tool
Expected: HTTP 200 with {"data": [...]} where data is a non-empty list.
Diagnosis tree:
-
HTTP 200 but
datais an empty list - Alpha Vantage returned no data for the probe date range (possible: market holiday, symbol issue, rate-limit from AV). - Check:heroku logs --tail --app raxx-api-stagingfor Alpha Vantage API errors. - Alpha Vantage rate limit (free tier: 5/min, 500/day) is a common cause. - Resolution: If AV is rate-limited, the mock fallback should activate. Check whetherUSE_MOCK_DATAis configured. - Check:heroku config --app raxx-api-staging | grep -i mock -
HTTP 500 Internal Server Error - Raptor exception in the historical data route. Check logs for Python traceback. -
heroku logs --tail --app raxx-api-staging | grep -A5 "historical"- Common cause: SQLite lock contention (WAL mode is enabled; concurrent writes from a deploy may block reads briefly). - Also check: is theinstance/directory accessible? On Heroku the ephemeral filesystem resets on dyno restart. -
HTTP 400 Bad Request - The probe parameters are malformed or the route validation changed. - Check the route schema in
backend_v2/api/routes/historical_data.pyagainst the probe params (symbol=AAPL,source=alpha_vantage,timeframe=daily, dates 2026-01-01 to 2026-01-10). - If a schema change was deployed without updating the probe, updatescripts/ci/run_synthetic_remote.py. -
HTTP 401 / 403 - Auth middleware unexpectedly applied to this public endpoint. - Check route decorators in
backend_v2/api/routes/historical_data.py. -
Alpha Vantage outage (HTTP 200 with AV error JSON) - AV API may return 200 with
{"Note": "Thank you for using Alpha Vantage..."}(rate limit) or{"Error Message": "..."}. - Check AV status: the response body from the log will contain the AV error message. - Resolution: Confirm mock fallback is activating. If AV is down for >1h, the probe will fail every run — this is a false positive from an external dependency outage, not a Raptor failure.
Common false positives: - Alpha Vantage free-tier rate limit exhausted by CI (500 calls/day shared across all workflows). Clears after midnight UTC. - AV API 503 during market-open burst — usually self-resolves within minutes.
Resolution path: docs/ops/runbooks/heroku.md + Alpha Vantage API status page.
Escalation: Page the operator if data is empty for 3+ consecutive runs AND mock fallback is also not activating (indicates Raptor startup or filesystem issue).
Check 3 — backtest (POST /api/backtest/run)
What it means: The backtesting engine can accept a run request and begin processing. Accepts HTTP 200 (synchronous result) or 202 (async queued). Failure here means the strategy engine is unavailable or the request shape is rejected.
Alert name: [synthetic] failing check: backtest
Severity: SEV2.
First check:
curl -s -X POST https://raxx-api-staging-1a19fb3873b9.herokuapp.com/api/backtest/run \
-H "Content-Type: application/json" \
-d '{"symbol":"AAPL","start_date":"2026-01-01","end_date":"2026-01-10","strategy":"sma_crossover","initial_capital":10000}' \
| python3 -m json.tool
Expected: HTTP 200 or 202 with a result or job ID body.
Diagnosis tree:
-
HTTP 400 Bad Request - The probe payload does not match the current route validation schema. - Check
backend_v2/api/routes/backtest.pyfor required fields. - Common cause: a schema change deployed without updating the probe. Updatescripts/ci/run_synthetic_remote.pyprobe payload if necessary. - Check forMarketDataValidationErrorin logs. -
HTTP 500 Internal Server Error - Backtest engine threw an unhandled exception. -
heroku logs --tail --app raxx-api-staging | grep -A10 "backtest"- Check for: missing historical data (the backtest engine may call the same Alpha Vantage path as Check 2 — if AV is rate-limited, this will also fail). - Check for: arithmetic errors in the strategy (NaN/Inf in returns). -
HTTP 503 or timeout (>30s) - Dyno memory pressure or CPU spike from a large backtest running concurrently. - Check:
heroku metrics --app raxx-api-stagingfor memory and response time spikes. - The probe usessma_crossoverover 10 days — this should complete in <5s. If it times out, another goroutine or request is blocking the dyno. -
HTTP 404 - Route not registered. Check that the backtest blueprint is still registered in
backend_v2/app.py.
Common false positives: - Alpha Vantage rate limit (same upstream as Check 2) — backtest requires data fetch; if AV is unavailable and mock fallback does not cover the strategy, the backtest will 500. - Ephemeral SQLite state reset on dyno restart — if the backtest writes intermediate state to SQLite, a fresh dyno won't have it.
Resolution path: docs/ops/runbooks/heroku.md
Escalation: Page the operator if HTTP 500 persists after restart and logs show no dependency issue (pure engine error).
Check 4 — trading_runtime (GET /api/system/health)
What it means: The trading runtime system health endpoint is reachable and returns HTTP 200. This is the deeper system health ping (vs. the bare /health liveness check) and confirms the runtime subsystem is operational.
Alert name: [synthetic] failing check: trading_runtime
Severity: SEV2 (staging); SEV1 if confirmed on prod and trading is affected.
First check:
curl -s https://raxx-api-staging-1a19fb3873b9.herokuapp.com/api/system/health | python3 -m json.tool
Expected: HTTP 200. Body may contain {"status": "ok", ...} or similar.
Diagnosis tree:
-
HTTP 200 but body indicates degraded state - Check the
statusfield and anycomponentsbreakdown in the body. - Degraded components: trading runtime thread may have exited or be unhealthy. - Check:heroku logs --tail --app raxx-api-staging | grep -i "trading_runtime\|runtime" -
HTTP 500 - System health route itself is throwing. Check logs for traceback in the
systemblueprint. -heroku logs --tail --app raxx-api-staging | grep -B2 -A10 "system/health" -
HTTP 503 or connection refused - Same as Check 1 — dyno is down. Diagnose via
heroku ps. -
HTTP 200 but
payload_statusisNone(body is not JSON) - The endpoint returned a non-JSON body (possibly an HTML error page from a proxy). - Check whether CF Access or a WAF rule is intercepting this path and returning an HTML 200. - Verify the request reaches Raptor: check Heroku logs for the request hit. -
Check 1 passes but Check 4 fails - The app process is up (
/healthok) but the trading runtime thread or subsystem is in a failed state. - Check: is there a separate thread/process for trading runtime? Look atbackend_v2/api/services/trading_runtime.pyfor thread lifecycle. - Resolution:heroku restart --app raxx-api-stagingto force thread reinitialisation.
Common false positives:
- Race condition on fresh dyno boot: /health passes before the trading runtime thread fully initialises, causing /api/system/health to return a degraded status for 5–15 seconds. The probe retries 3× — if all three hit this window, it may false-positive. Set --timeout 30 (current default) and the window should be sufficient.
Resolution path: docs/ops/runbooks/heroku.md
Escalation: Page the operator if trading runtime is degraded for >5 minutes and the trading_runtime service shows errors in logs. This indicates a potential issue with live/paper trading execution paths.
WAF Probes — surface group failures
The WAF synthetic probe workflow (waf-synthetic-probe.yml) exercises four surfaces:
getraxx, raxx-app, support, docs. Each surface has 2–3 individual flows.
Severity: SEV3 for WAF would-block in Phase 1 (log mode). SEV1 for a true HTTP 403 block in Phase 2+.
Common probe failure patterns
Pattern W1 — HTTP 403 on any surface (WAF block)
What it means in Phase 1 (log mode): The WAF logged a would-block action. The request got through (200) but would be blocked in Phase 2. Treat as SEV3 — agent autonomous investigation.
What it means in Phase 2+ (challenge/block mode): The WAF is blocking the synthetic probe. Treat as SEV1 — customers may also be blocked.
Diagnosis:
- Check CF WAF Events dashboard:
raxx.appzone → Security → WAF → filter by last 30 min. - Identify the rule ID that fired: look for
RaxxProbe/1.0UA in event log. - The probe carries
X-Raxx-Probe: sc-waf-06-synthetic— confirm the WAF skip rule for this header is enabled. - CF dashboard → raxx.app zone → Security → WAF → Custom rules → look for a skip rule keyed onX-Raxx-Probe. - If the skip rule is missing or disabled: re-apply
terraform/waf— the skip rule is managed there. - If the skip rule is present and enabled but the probe is still blocked: the WAF rule is firing before the skip evaluation. Check rule priority ordering.
Resolution path: docs/ops/runbooks/waf.md §Failure mode C, §Phase advancement.
Pattern W2 — Content mismatch (HTTP 200 but expected strings absent)
What it means: The surface returned HTTP 200 but the response body did not contain any of the expected marker strings. This can indicate a broken deploy, CDN caching an error page, or a full-page redirect replacing content.
Diagnosis by surface:
- getraxx-landing: Expected strings:
raxx,trading. If missing: check CF Pages deploy status forgetraxx.com. Run:curl -v https://getraxx.com/ | grep -i "raxx\|trading". - getraxx-faq / getraxx-pricing: These may 404 pre-launch — HTTP 200 with content-mismatch means a page shell was served but expected content is absent. Check the CF Pages build output.
- raxx-app-signin-page: Expected:
sign,email,password. If React SPA shell loads but without hydration (no JS), content-mismatch fires. Check Antlers build. - raxx-app-dashboard-stub: Expected:
raxx,html. Very broad — almost any HTML page will match. If this fails, the surface is likely returning a non-HTML error. - raxx-app-api-health: Expected:
ok,status,healthy. CheckGET https://raxx.app/healthmanually. - support-root: Expected:
support,help,ticket. Check FreeScout is running:curl -s https://support.raxx.app/ | head -20. If FreeScout is down, seedocs/ops/runbooks/freescout.md. - docs-root / docs-getting-started: Expected:
docs,documentation,guide. Check CF Pages deploy fordocs.raxx.app.
Pattern W3 — Connection refused / timeout (no response)
What it means: The surface is completely unreachable. DNS issue, Heroku dyno down, or CF Pages build not deployed.
Diagnosis:
1. curl -v https://<surface-url>/ — check if DNS resolves and TCP connects.
2. If DNS fails: check CF DNS for the zone. docs/ops/runbooks/cloudflare-tokens.md.
3. If TCP connects but hangs: check CF Pages or Heroku dyno status.
4. If Heroku: heroku ps --app <app-name> for the relevant app.
Pattern W4 — WAF false-positive log event (probe UA in CF block log)
What it means: CF Firewall Events shows a block action (in log mode) triggered by the probe's RaxxProbe/1.0 User-Agent. This is a WAF rule incorrectly matching legitimate traffic patterns.
Diagnosis:
1. Note the rule ID from waf_rule_id in the probe output JSON.
2. Cross-reference with terraform output from terraform/waf to identify the ruleset.
3. If it is an OWASP/managed rule: add a per-rule exception or lower the anomaly score threshold.
4. If it is a custom rule: review the rule expression in terraform/modules/cf-waf/main.tf.
Resolution path: docs/ops/runbooks/waf.md §Failure mode A (false positive).
How to silence a flapping check
Use silence sparingly. A flapping check indicates a real instability even if the root cause is external.
Synthetic Gate (Raptor checks):
The gate respects FLAG_SYNTHETIC_CI_GATE. When the flag is 0 (default), failures post Slack + file GitHub issues but do not block CI. When the flag is 1, failures block the deploy pipeline.
To temporarily stop the gate from blocking:
heroku config:set FLAG_SYNTHETIC_CI_GATE=0 --app raxx-api-staging >/dev/null 2>&1
To re-enable:
heroku config:set FLAG_SYNTHETIC_CI_GATE=1 --app raxx-api-staging >/dev/null 2>&1
WAF Probes:
Disable a specific surface via workflow_dispatch with the surfaces input limited to surfaces that are healthy. Example: run only getraxx and raxx-app, skipping support during a FreeScout outage.
To skip the scheduled run entirely for one day: cancel the scheduled run from the GitHub Actions UI. This does not affect the next scheduled trigger.
How to run checks manually
Synthetic Gate (all four Raptor checks):
python scripts/ci/run_synthetic_remote.py \
--base-url https://raxx-api-staging-1a19fb3873b9.herokuapp.com \
--timeout 30 \
--output /tmp/synthetic-remote.json
cat /tmp/synthetic-remote.json | python3 -m json.tool
WAF Probes (all surfaces):
python3 scripts/waf/probe.py --timeout 15
WAF Probes (single surface):
python3 scripts/waf/probe.py --surfaces getraxx --timeout 15
Manual workflow dispatch:
- Synthetic Gate: GitHub Actions → Synthetic Gate → Run workflow → optionally override base URL.
- WAF Probes: GitHub Actions → WAF Synthetic Probes → Run workflow → optionally specify surfaces.
Runbook maintenance
When a new synthetic check is added to scripts/ci/run_synthetic_remote.py or scripts/waf/probe.py, add a corresponding section to this runbook before merging. The check name in the section header must match exactly the name field in CheckResult or ProbeResult so alert-to-runbook matching works.
Cross-references
- Heroku operations:
docs/ops/runbooks/heroku.md - WAF detailed runbook:
docs/ops/runbooks/waf.md - CF Access tokens:
docs/ops/runbooks/cf-access-service-token-provisioning.md - FreeScout:
docs/ops/runbooks/freescout.md - CI notification posture:
docs/architecture/ci-notification-posture.md - Parent epic: #323 (observability + CI health gates)
- WAF probes issue: #1739 (SC-WAF-06)
- Synthetic gate flag:
FLAG_SYNTHETIC_CI_GATE(seebackend_v2/api/feature_flags.yaml)