Raxx · internal docs

internal · gated

Synthetic probes runbook

System: scheduled GH Actions synthetic probes — production API surfaces Owner: sre-agent / operator Last incident: 2026-06-05 (see RCA docs/ops/incidents/2026-06-05-session-auth-waitlist-postmark-exemptions.md) Last reviewed: 2026-06-17


Overview

Synthetic probes are scheduled GH Actions workflows that exercise production API endpoints every N minutes and alert on unexpected responses. They are the primary mechanism for detecting auth-regression and route-availability failures before a customer report.

All probes:

Distinction from the Synthetic Gate (synthetic-gate.yml): the Synthetic Gate runs against staging and exercises authenticated API paths (historical data, backtest, trading runtime). These probes run against production and exercise public/unauthenticated paths. Both complement each other; neither replaces the other.

Runbook for the Synthetic Gate: docs/ops/runbooks/synthetic-check-diagnosis.md


Probes

waitlist-signup

Workflow: .github/workflows/synth-probe-waitlist.yml Schedule: every 15 minutes, around the clock Issue: #3309

What it probes

POST https://api.raxx.app/api/waitlist/signup with payload {"email": "synth-probe+<run_id>@raxx.invalid"}.

The .invalid TLD is globally non-routable per RFC 2606 §2. Postmark suppresses delivery to .invalid addresses. No real email is ever sent.

What success looks like

The probe asserts the HTTP response code is one of:

Code Meaning
202 Accepted — signup row written; confirmation email queued (Postmark will suppress the .invalid address)
400 Bad request — schema validation rejected the payload (acceptable; route is alive and parsing)
409 Conflict — duplicate email (acceptable; route is alive; a prior run's row already exists)

Any other code is a failure.

What failure looks like

Code What it almost certainly means First action
401 Session-auth middleware regression — /api/waitlist/signup not in _EXEMPT_PREFIXES See §Failure mode 1 below
403 Cloudflare WAF or CF Access blocking the probe See §Failure mode 2 below
404 Route unregistered — blueprint not wired in backend_v2/app.py See §Failure mode 3 below
500 Unhandled exception in the waitlist route See §Failure mode 4 below
503 Heroku dyno down or H13 See §Failure mode 5 below
timeout Heroku dyno boot latency or network issue See §Failure mode 6 below

Failure mode 1 — HTTP 401 (auth regression)

Symptom: probe response is {"error":"Authentication required","reason":"missing"} with HTTP 401.

Cause: FLAG_SESSION_AUTH_MIDDLEWARE is on and /api/waitlist/signup is not in session_auth.py's _EXEMPT_PREFIXES list. This is the exact failure that caused the 8-day silent outage from 2026-05-28 to 2026-06-05 (RCA: docs/ops/incidents/2026-06-05-session-auth-waitlist-postmark-exemptions.md).

Diagnosis:

curl -s -X POST https://api.raxx.app/api/waitlist/signup \
  -H "Content-Type: application/json" \
  -H "User-Agent: Raxx-Synth-Probe/1.0" \
  -d '{"email":"manual-diag@raxx.invalid"}'

If the response is {"error":"Authentication required",...}, confirm the route is missing from the exemption list:

heroku run --app raxx-api-prod -- grep -n "waitlist" /app/backend_v2/api/middleware/session_auth.py

Fix: add /api/waitlist/ to _EXEMPT_PREFIXES in backend_v2/api/middleware/session_auth.py, deploy, and re-run the probe manually via workflow_dispatch to confirm.

Verification: curl as above returns HTTP 202 or 409.


Failure mode 2 — HTTP 403 (WAF or CF Access blocking)

Symptom: probe returns HTTP 403.

Cause options:

Diagnosis:

  1. Check CF Firewall Events for api.raxx.app, last 15 minutes.
  2. Filter by User-Agent Raxx-Synth-Probe/1.0.
  3. Note the rule ID that fired.
  4. If WAF custom rule: check terraform/waf for any rule matching on UA patterns.
  5. If CF Access: confirm /api/waitlist is not covered by any CF Access policy.

Resolution path: docs/ops/runbooks/waf.md


Failure mode 3 — HTTP 404 (route missing)

Symptom: probe returns HTTP 404 with {"error":"not_found"} or similar.

Cause: the waitlist_signups blueprint was not registered in backend_v2/app.py, or a recent deploy removed it.

Diagnosis:

heroku run --app raxx-api-prod -- python -c "
from backend_v2.app import create_app
app = create_app()
print([r.rule for r in app.url_map.iter_rules() if 'waitlist' in r.rule])
"

Expected: ['/api/waitlist/signup'] (or similar).

If empty: the blueprint registration was dropped. File a P0 regression against the deploy that removed it.


Failure mode 4 — HTTP 500 (unhandled exception)

Symptom: probe returns HTTP 500.

Diagnosis:

heroku logs --tail --app raxx-api-prod | grep -A10 "waitlist"

Look for a Python traceback in the waitlist_signups route. Common causes:

Resolution path: docs/ops/runbooks/heroku.md + docs/ops/runbooks/postmark.md


Failure mode 5 — HTTP 503 (dyno down)

Symptom: probe returns HTTP 503 or Heroku H13/H14 error.

This is a general dyno health issue, not specific to the waitlist route.

Diagnosis:

heroku ps --app raxx-api-prod
heroku logs --tail --app raxx-api-prod | grep -E "H[0-9]+"

Resolution path: docs/ops/runbooks/heroku.md §H-errors


Failure mode 6 — Timeout (no response in 10s)

Symptom: probe logs timeout (>10s) with status code 0.

Cause options:

Diagnosis:

  1. Re-run manually via workflow_dispatch to confirm it is not a transient flap.
  2. If timeout persists: curl -v --max-time 10 https://api.raxx.app/health from a separate network to rule out CF rate limiting.
  3. Check Heroku metrics for response time spikes.

Common false positive: a single timeout on one 15-minute window is usually a transient network hiccup. Two consecutive timeouts warrant investigation.


How to run manually

Via workflow_dispatch in the GH Actions UI:

  1. Go to Actions → "Synth probe — waitlist signup" → Run workflow.
  2. Optionally set target_url to override the API base (e.g. for staging: https://api-staging.raxx.app).
  3. Set dry_run=true to suppress the Slack alert during testing.

Via curl from a terminal:

curl -s -X POST https://api.raxx.app/api/waitlist/signup \
  -H "Content-Type: application/json" \
  -H "User-Agent: Raxx-Synth-Probe/1.0" \
  -d '{"email":"manual-diag@raxx.invalid"}' \
  -w "\nHTTP %{http_code}\n"

How to silence a flapping probe

The probe has no feature-flag gate (it is always active when the workflow is enabled). To temporarily suppress alerts without disabling the probe:

  1. Run with dry_run=true via workflow_dispatch (Slack suppressed, probe still runs).
  2. To disable the scheduled run entirely: cancel future scheduled runs from the GH Actions UI. Document the reason in the incident log before doing this.

Do NOT disable the workflow without documenting the reason. A silent probe is the same operational gap as having no probe.


Adding a new probe

When adding a probe for a new route:

  1. Create .github/workflows/synth-probe-<name>.yml following the synth-probe-waitlist.yml pattern.
  2. Add a §<name> section to this runbook before merging the PR.
  3. Classify acceptable vs. failure status codes explicitly in the code comment.
  4. Link the workflow in the relevant RCA's action-items table.

Cross-references