Raxx · internal docs

internal · gated

Queue service runbook

System: raxx-queue-prod / raxx-queue-staging (Heroku container-stack) Owner: sre-agent Last incident: 2026-06-17 (queue-ci-cpp-build-failures) Last reviewed: 2026-06-16

Overview

Queue is the billing identity service (codename: Queue). It is a C++ service built with the Drogon framework and deployed as a container to Heroku. It owns the billing_customer, billing_subscription, and billing_invoice tables in its own Postgres database, and exposes a read/write API consumed by Console and Raptor over internal bearer-token auth.

Key URLs: - Prod: https://queue.raxx.app (CF-proxied) / https://raxx-queue-prod-327fa047b4b6.herokuapp.com (direct) - Staging: https://raxx-queue-staging-403c1aa5941f.herokuapp.com (direct only; no CF proxy)

Relevant env vars on raxx-queue-prod: - FLAG_QUEUE_BILLING — enables billing handlers (flip to true during go-live) - FLAG_ENFORCE_QUEUE_CF_ORIGIN — enforces CF-Connecting-IP guard (flip to true after CF proxy confirmed) - STRIPE_WEBHOOK_SECRETwhsec_* from Stripe webhook registration (must not be stale) - DATABASE_URL — Heroku-managed Postgres connection string

How to tell it's broken

How to diagnose (in order)

  1. Check dyno count: heroku ps --app raxx-queue-prod — expected: web.1: up
  2. Check health: curl -s https://raxx-queue-prod-327fa047b4b6.herokuapp.com/health | python3 -m json.tool
  3. Check recent logs: heroku logs --app raxx-queue-prod --tail --num 100
  4. Check recent releases: heroku releases --app raxx-queue-prod --num 10
  5. Check CI: gh run list --repo raxx-app/TradeMasterAPI --workflow=deploy-queue.yml --limit=5
  6. Check env flags: heroku config --app raxx-queue-prod | grep FLAG

Known failure modes

Failure mode A: Zero dynos — container never deployed

Symptom: heroku ps --app raxx-queue-prod shows no web dyno. Health check returns connection refused.

Cause: Either the container deploy pipeline never ran (CI failed upstream), or the container push succeeded but Heroku did not scale up.

Fix:

# Trigger prod deploy via CI (requires CI to be green first):
gh workflow run deploy-queue.yml --repo raxx-app/TradeMasterAPI \
  --ref main --field target=prod --field confirm=deploy-prod-now

# OR: if CI is already green and the scale-up was missed:
heroku ps:scale web=1 --app raxx-queue-prod

Verification: heroku ps --app raxx-queue-prodweb.1: up. curl -s https://raxx-queue-prod-327fa047b4b6.herokuapp.com/health | python3 -m json.tool{"service":"raxx-queue","status":"ok","timestamp":"...","version":"0.1.0"}.


Failure mode B: Container crash on boot

Symptom: heroku ps shows the dyno cycling or crashing immediately. Heroku logs show crash within 30s of start.

Cause options: 1. Missing env var (e.g. DATABASE_URL not set, or STRIPE_WEBHOOK_SECRET missing) 2. Postgres connection refused at boot (Heroku Postgres not yet provisioned or credentials rotated) 3. Container image build was corrupted (rare)

Fix:

# Check logs first:
heroku logs --app raxx-queue-prod --num 50

# Verify required env vars are set (non-empty):
heroku config --app raxx-queue-prod | grep -E "DATABASE_URL|STRIPE_WEBHOOK_SECRET|QUEUE_SERVICE_TOKEN"

# If DATABASE_URL is missing: provision Postgres
heroku addons:create heroku-postgresql:standard-0 --app raxx-queue-prod

# If STRIPE_WEBHOOK_SECRET is stale placeholder:
# Operator action required — register webhook in Stripe Dashboard first.
# Endpoint: https://queue.raxx.app/api/v1/billing/webhook
# Then: heroku config:set STRIPE_WEBHOOK_SECRET=whsec_<real_value> --app raxx-queue-prod >/dev/null 2>&1

Verification: heroku ps --app raxx-queue-prod → dyno stable for >60s.


Failure mode C: CI build failure — C++ compilation error

Symptom: deploy-queue.yml "Build + test (Debug, ASan + UBSan)" step fails with exit code 1. gh run view <id> --log-failed shows a C++ compile error.

Cause: Code or test file uses a Drogon API method that does not exist in the pinned version (1.9.13), or a CMake configuration error.

Known Drogon 1.9.13 API gotchas: - getContentTypeString() does NOT exist. Use contentTypeString() (no "get" prefix) or getContentType() (returns ContentType enum). - setContentTypeCode(CT_APPLICATION_JSON) stores content type as an internal enum, NOT in the HTTP headers map. getHeader("Content-Type") will return "" in unit tests. Use getContentType() == drogon::CT_APPLICATION_JSON to assert the content type in tests. - find_package(Drogon) and find_package(libpqxx) in subdirectory CMakeLists.txt files MUST be guarded with if(NOT TARGET Drogon::Drogon) / if(NOT TARGET libpqxx::pqxx) to prevent duplicate imported target errors when the root CMakeLists.txt has already called them.

Fix: Identify the failing file from the log, apply the correct API. Push fix to main; CI will requeue automatically.

Verification: Next push-triggered CI run on deploy-queue.yml passes all 263 tests.


Failure mode D: CI failure — vcpkg binary cache miss (full rebuild)

Symptom: Build takes >45 minutes instead of ~22 minutes. Usually succeeds but is slow.

Cause: The vcpkg binary cache key changed (a dependency version bumped, or the cache was evicted). The next run does a full source build of all 15+ dependencies.

Fix: No action needed — the cache will repopulate on the next successful run. If this recurs frequently, check whether vcpkg.json is being modified without updating the cache key in deploy-queue.yml.

See also: docs/incidents/2026-05-27-queue-deploy-vcpkg-shallow-clone.md — separate issue where a --depth 1 clone of vcpkg caused a baseline SHA lookup failure.


Failure mode E: Billing webhook silently ignoring events

Symptom: Stripe sends events; Queue logs show 400 "invalid signature" or events are not appearing in billing_* tables.

Cause options: 1. STRIPE_WEBHOOK_SECRET is stale (a placeholder whsec_3cH... rather than the live whsec_* from Stripe) 2. Webhook endpoint is not registered in Stripe, or is registered at the wrong URL 3. FLAG_QUEUE_BILLING=false — billing handlers are disabled

Fix:

# Check flag:
heroku config --app raxx-queue-prod | grep FLAG_QUEUE_BILLING

# Enable billing:
heroku config:set FLAG_QUEUE_BILLING=true --app raxx-queue-prod >/dev/null 2>&1

# Verify webhook secret is real (not placeholder):
# Read from vault: GET /api/v1/secrets?workspaceId=29b77751-f761-4afa-b3fa-2c842988f95c&environment=prod&secretPath=/Raxx/Queue&secretName=STRIPE_WEBHOOK_SECRET
# If placeholder: operator must register webhook in Stripe Dashboard and update the secret.

Verification: Send a Stripe test event via stripe trigger customer.created and confirm it appears in Queue logs and in the billing_customer table.


Failure mode F: CF origin guard rejecting all requests

Symptom: https://queue.raxx.app/health returns 403; direct Heroku URL https://raxx-queue-prod-327fa047b4b6.herokuapp.com/health returns 200. Logs show CF-Connecting-IP header missing; rejecting.

Cause: FLAG_ENFORCE_QUEUE_CF_ORIGIN=true but CF proxying is not active (DNS not proxied, or CF Access misconfigured).

Fix:

# Verify CF DNS record is proxied:
# CF Dashboard → raxx.app zone → DNS → queue.raxx.app → Proxy status must be "Proxied" (orange cloud)

# If proxying confirmed but still failing, temporarily disable the guard:
heroku config:set FLAG_ENFORCE_QUEUE_CF_ORIGIN=false --app raxx-queue-prod >/dev/null 2>&1

Verification: curl -s https://queue.raxx.app/health returns 200.


Failure mode G: Deploy summary step failing with HttpError: Resource not accessible by integration

Symptom: Deploy summary job fails at "Post comment on PR / commit" with HttpError: Resource not accessible by integration. This is a known non-blocking error in the deploy workflow.

Cause: The GITHUB_TOKEN used in the Post comment step does not have PR comment permissions when triggered by a push event on main (not a PR). The step attempts to post a comment on the commit, which requires different permission scopes than a PR comment.

Impact: None. The deploy itself completes normally. The Slack DM step is skipped but no service impact occurs.

Fix: continue-on-error: true added to the "Post comment on PR / commit" step in deploy-queue.yml (ops/queue-post-go-live-fixes, merged 2026-06-16). The step still runs and logs the error, but no longer fails the job. If this recurs after the fix, check the actions/github-script@v7 step in the notify job — the fix should have eliminated the phantom failure entirely.


Go-live checklist

Steps to complete a full Queue production go-live (reference: operator-authorized sequence):

  1. queue.raxx.app DNS CNAME attached to shielded-gazelle-a5mdky8xkvx4568m8vqzve6o.herokudns.com (Heroku ACM target) with CF proxy ON — confirmed 2026-06-17
  2. QUEUE_INTERNAL_BASE_URL=https://queue.raxx.app set on raxx-api-prod and raxx-console-prod — confirmed 2026-06-17
  3. CI green (deploy-queue.yml passes all 263 tests)
  4. Dispatch prod deploy: gh workflow run deploy-queue.yml --ref main --field target=prod --field confirm=deploy-prod-now
  5. Verify https://raxx-queue-prod-327fa047b4b6.herokuapp.com/health returns {"service":"raxx-queue","status":"ok",...}
  6. Register Stripe webhook at https://queue.raxx.app/api/v1/billing/webhookOPERATOR ACTION (requires Stripe Dashboard 2FA)
  7. Set STRIPE_WEBHOOK_SECRET=<real whsec_> on raxx-queue-prod — after step 6
  8. Flip FLAG_QUEUE_BILLING=true on raxx-queue-prod
  9. Flip FLAG_ENFORCE_QUEUE_CF_ORIGIN=true on raxx-queue-prod (after CF proxy confirmed working)
  10. Verify end-to-end: Console billing dashboard loads customer list; /api/v1/billing/customers returns 200

Pending operator actions: - Stripe webhook registration (requires Stripe Dashboard + webhook_endpoints:write scope on rk_live) - CF WAF token rotation (CF_WAF_EDIT_RAXX_APP returns HTTP 401; rotate via CLOUDFLARE_ACCESS_MGMT_TOKEN) - CF SSL/TLS upgrade to Full (strict) (currently Full; requires Zone:Settings:Edit permission) - QUEUE_TO_RAPTOR_INTERNAL_TOKEN not yet provisioned in vault


Emergency stop

# Scale down immediately (stops serving traffic):
heroku ps:scale web=0 --app raxx-queue-prod

# Disable billing flag (billing handlers stop but service stays up):
heroku config:set FLAG_QUEUE_BILLING=false --app raxx-queue-prod >/dev/null 2>&1

Escalation

Wake the operator when: - Stripe webhook events are failing AND STRIPE_WEBHOOK_SECRET needs rotation (Stripe Dashboard access required) - DATABASE_URL credentials need rotation (Heroku RDS — use heroku pg:credentials:create, not CREATE ROLE WITH PASSWORD) - CF WAF token rotation is needed (CF_WAF_EDIT_RAXX_APP expired/revoked) - Postgres data loss or corruption is suspected