Synthetic probes runbook
System: scheduled GH Actions synthetic probes — production API surfaces
Owner: sre-agent / operator
Last incident: 2026-06-05 (see RCA docs/ops/incidents/2026-06-05-session-auth-waitlist-postmark-exemptions.md)
Last reviewed: 2026-06-17
Overview
Synthetic probes are scheduled GH Actions workflows that exercise production API endpoints every N minutes and alert on unexpected responses. They are the primary mechanism for detecting auth-regression and route-availability failures before a customer report.
All probes:
- Use
User-Agent: Raxx-Synth-Probe/1.0(required — default Python-urllib UA triggers CF error 1010 on CF-gated origins; seememory/feedback_python_urllib_needs_ua_header.md) - Run on a schedule; also dispatchable manually via
workflow_dispatch - Fire a SEV2 Slack alert to
#raxx-ops-alert-sev2on failure - Write a step summary to the GH Actions UI per run
- On failure, the workflow exits non-zero so the badge reflects the failure
Distinction from the Synthetic Gate (synthetic-gate.yml): the Synthetic
Gate runs against staging and exercises authenticated API paths (historical data,
backtest, trading runtime). These probes run against production and exercise
public/unauthenticated paths. Both complement each other; neither replaces the
other.
Runbook for the Synthetic Gate: docs/ops/runbooks/synthetic-check-diagnosis.md
Probes
waitlist-signup
Workflow: .github/workflows/synth-probe-waitlist.yml
Schedule: every 15 minutes, around the clock
Issue: #3309
What it probes
POST https://api.raxx.app/api/waitlist/signup with payload
{"email": "synth-probe+<run_id>@raxx.invalid"}.
The .invalid TLD is globally non-routable per RFC 2606 §2. Postmark
suppresses delivery to .invalid addresses. No real email is ever sent.
What success looks like
The probe asserts the HTTP response code is one of:
| Code | Meaning |
|---|---|
202 |
Accepted — signup row written; confirmation email queued (Postmark will suppress the .invalid address) |
400 |
Bad request — schema validation rejected the payload (acceptable; route is alive and parsing) |
409 |
Conflict — duplicate email (acceptable; route is alive; a prior run's row already exists) |
Any other code is a failure.
What failure looks like
| Code | What it almost certainly means | First action |
|---|---|---|
401 |
Session-auth middleware regression — /api/waitlist/signup not in _EXEMPT_PREFIXES |
See §Failure mode 1 below |
403 |
Cloudflare WAF or CF Access blocking the probe | See §Failure mode 2 below |
404 |
Route unregistered — blueprint not wired in backend_v2/app.py |
See §Failure mode 3 below |
500 |
Unhandled exception in the waitlist route | See §Failure mode 4 below |
503 |
Heroku dyno down or H13 | See §Failure mode 5 below |
| timeout | Heroku dyno boot latency or network issue | See §Failure mode 6 below |
Failure mode 1 — HTTP 401 (auth regression)
Symptom: probe response is {"error":"Authentication required","reason":"missing"} with HTTP 401.
Cause: FLAG_SESSION_AUTH_MIDDLEWARE is on and /api/waitlist/signup is not
in session_auth.py's _EXEMPT_PREFIXES list. This is the exact failure that
caused the 8-day silent outage from 2026-05-28 to 2026-06-05 (RCA:
docs/ops/incidents/2026-06-05-session-auth-waitlist-postmark-exemptions.md).
Diagnosis:
curl -s -X POST https://api.raxx.app/api/waitlist/signup \
-H "Content-Type: application/json" \
-H "User-Agent: Raxx-Synth-Probe/1.0" \
-d '{"email":"manual-diag@raxx.invalid"}'
If the response is {"error":"Authentication required",...}, confirm the route is
missing from the exemption list:
heroku run --app raxx-api-prod -- grep -n "waitlist" /app/backend_v2/api/middleware/session_auth.py
Fix: add /api/waitlist/ to _EXEMPT_PREFIXES in
backend_v2/api/middleware/session_auth.py, deploy, and re-run the probe manually
via workflow_dispatch to confirm.
Verification: curl as above returns HTTP 202 or 409.
Failure mode 2 — HTTP 403 (WAF or CF Access blocking)
Symptom: probe returns HTTP 403.
Cause options:
- Cloudflare WAF custom rule is blocking
Raxx-Synth-Probe/1.0User-Agent. - CF Access policy has been applied to
api.raxx.app/api/waitlist/signup(should not happen; this is a public route).
Diagnosis:
- Check CF Firewall Events for
api.raxx.app, last 15 minutes. - Filter by User-Agent
Raxx-Synth-Probe/1.0. - Note the rule ID that fired.
- If WAF custom rule: check
terraform/waffor any rule matching on UA patterns. - If CF Access: confirm
/api/waitlistis not covered by any CF Access policy.
Resolution path: docs/ops/runbooks/waf.md
Failure mode 3 — HTTP 404 (route missing)
Symptom: probe returns HTTP 404 with {"error":"not_found"} or similar.
Cause: the waitlist_signups blueprint was not registered in
backend_v2/app.py, or a recent deploy removed it.
Diagnosis:
heroku run --app raxx-api-prod -- python -c "
from backend_v2.app import create_app
app = create_app()
print([r.rule for r in app.url_map.iter_rules() if 'waitlist' in r.rule])
"
Expected: ['/api/waitlist/signup'] (or similar).
If empty: the blueprint registration was dropped. File a P0 regression against the deploy that removed it.
Failure mode 4 — HTTP 500 (unhandled exception)
Symptom: probe returns HTTP 500.
Diagnosis:
heroku logs --tail --app raxx-api-prod | grep -A10 "waitlist"
Look for a Python traceback in the waitlist_signups route. Common causes:
- Database write failed (Postgres connection issue, migration not applied).
- Postmark client initialization error (missing
POSTMARK_SERVER_TOKEN).
Resolution path: docs/ops/runbooks/heroku.md + docs/ops/runbooks/postmark.md
Failure mode 5 — HTTP 503 (dyno down)
Symptom: probe returns HTTP 503 or Heroku H13/H14 error.
This is a general dyno health issue, not specific to the waitlist route.
Diagnosis:
heroku ps --app raxx-api-prod
heroku logs --tail --app raxx-api-prod | grep -E "H[0-9]+"
Resolution path: docs/ops/runbooks/heroku.md §H-errors
Failure mode 6 — Timeout (no response in 10s)
Symptom: probe logs timeout (>10s) with status code 0.
Cause options:
- Heroku dyno cold boot (unlikely at 15-minute probe frequency — dynos stay warm).
- Network path issue between GitHub Actions runner and CF/Heroku.
- CF rate limiting the runner IP.
Diagnosis:
- Re-run manually via
workflow_dispatchto confirm it is not a transient flap. - If timeout persists:
curl -v --max-time 10 https://api.raxx.app/healthfrom a separate network to rule out CF rate limiting. - Check Heroku metrics for response time spikes.
Common false positive: a single timeout on one 15-minute window is usually a transient network hiccup. Two consecutive timeouts warrant investigation.
How to run manually
Via workflow_dispatch in the GH Actions UI:
- Go to Actions → "Synth probe — waitlist signup" → Run workflow.
- Optionally set
target_urlto override the API base (e.g. for staging:https://api-staging.raxx.app). - Set
dry_run=trueto suppress the Slack alert during testing.
Via curl from a terminal:
curl -s -X POST https://api.raxx.app/api/waitlist/signup \
-H "Content-Type: application/json" \
-H "User-Agent: Raxx-Synth-Probe/1.0" \
-d '{"email":"manual-diag@raxx.invalid"}' \
-w "\nHTTP %{http_code}\n"
How to silence a flapping probe
The probe has no feature-flag gate (it is always active when the workflow is enabled). To temporarily suppress alerts without disabling the probe:
- Run with
dry_run=trueviaworkflow_dispatch(Slack suppressed, probe still runs). - To disable the scheduled run entirely: cancel future scheduled runs from the GH Actions UI. Document the reason in the incident log before doing this.
Do NOT disable the workflow without documenting the reason. A silent probe is the same operational gap as having no probe.
Adding a new probe
When adding a probe for a new route:
- Create
.github/workflows/synth-probe-<name>.ymlfollowing thesynth-probe-waitlist.ymlpattern. - Add a
§<name>section to this runbook before merging the PR. - Classify acceptable vs. failure status codes explicitly in the code comment.
- Link the workflow in the relevant RCA's action-items table.
Cross-references
- Synthetic Gate (staging, authenticated paths):
docs/ops/runbooks/synthetic-check-diagnosis.md - WAF probes (surface reachability):
docs/ops/runbooks/waf.md - Heroku operations:
docs/ops/runbooks/heroku.md - Postmark:
docs/ops/runbooks/postmark.md - CF Access tokens:
docs/ops/runbooks/cf-access-service-token-provisioning.md - Waitlist incident RCA:
docs/ops/incidents/2026-06-05-session-auth-waitlist-postmark-exemptions.md - Issue: #3309