Queue service runbook
System: raxx-queue-prod / raxx-queue-staging (Heroku container-stack) Owner: sre-agent Last incident: 2026-06-17 (queue-ci-cpp-build-failures) Last reviewed: 2026-06-16
Overview
Queue is the billing identity service (codename: Queue). It is a C++ service
built with the Drogon framework and deployed as a container to Heroku. It owns
the billing_customer, billing_subscription, and billing_invoice tables in
its own Postgres database, and exposes a read/write API consumed by Console and
Raptor over internal bearer-token auth.
Key URLs:
- Prod: https://queue.raxx.app (CF-proxied) / https://raxx-queue-prod-327fa047b4b6.herokuapp.com (direct)
- Staging: https://raxx-queue-staging-403c1aa5941f.herokuapp.com (direct only; no CF proxy)
Relevant env vars on raxx-queue-prod:
- FLAG_QUEUE_BILLING — enables billing handlers (flip to true during go-live)
- FLAG_ENFORCE_QUEUE_CF_ORIGIN — enforces CF-Connecting-IP guard (flip to true after CF proxy confirmed)
- STRIPE_WEBHOOK_SECRET — whsec_* from Stripe webhook registration (must not be stale)
- DATABASE_URL — Heroku-managed Postgres connection string
How to tell it's broken
https://queue.raxx.app/healthreturns non-200, timeout, or invalid JSONheroku ps --app raxx-queue-prodshows 0 dynos (no container deployed)- Heroku logs show container crash on startup
- CI run for
deploy-queue.ymlis in a persistent failure streak (checkdeploy-queue-failure-monitor.yml) - Console billing dashboard shows 503 on all customer/subscription/invoice endpoints
How to diagnose (in order)
- Check dyno count:
heroku ps --app raxx-queue-prod— expected:web.1: up - Check health:
curl -s https://raxx-queue-prod-327fa047b4b6.herokuapp.com/health | python3 -m json.tool - Check recent logs:
heroku logs --app raxx-queue-prod --tail --num 100 - Check recent releases:
heroku releases --app raxx-queue-prod --num 10 - Check CI:
gh run list --repo raxx-app/TradeMasterAPI --workflow=deploy-queue.yml --limit=5 - Check env flags:
heroku config --app raxx-queue-prod | grep FLAG
Known failure modes
Failure mode A: Zero dynos — container never deployed
Symptom: heroku ps --app raxx-queue-prod shows no web dyno. Health check returns connection refused.
Cause: Either the container deploy pipeline never ran (CI failed upstream), or the container push succeeded but Heroku did not scale up.
Fix:
# Trigger prod deploy via CI (requires CI to be green first):
gh workflow run deploy-queue.yml --repo raxx-app/TradeMasterAPI \
--ref main --field target=prod --field confirm=deploy-prod-now
# OR: if CI is already green and the scale-up was missed:
heroku ps:scale web=1 --app raxx-queue-prod
Verification: heroku ps --app raxx-queue-prod → web.1: up. curl -s https://raxx-queue-prod-327fa047b4b6.herokuapp.com/health | python3 -m json.tool → {"service":"raxx-queue","status":"ok","timestamp":"...","version":"0.1.0"}.
Failure mode B: Container crash on boot
Symptom: heroku ps shows the dyno cycling or crashing immediately. Heroku logs show crash within 30s of start.
Cause options:
1. Missing env var (e.g. DATABASE_URL not set, or STRIPE_WEBHOOK_SECRET missing)
2. Postgres connection refused at boot (Heroku Postgres not yet provisioned or credentials rotated)
3. Container image build was corrupted (rare)
Fix:
# Check logs first:
heroku logs --app raxx-queue-prod --num 50
# Verify required env vars are set (non-empty):
heroku config --app raxx-queue-prod | grep -E "DATABASE_URL|STRIPE_WEBHOOK_SECRET|QUEUE_SERVICE_TOKEN"
# If DATABASE_URL is missing: provision Postgres
heroku addons:create heroku-postgresql:standard-0 --app raxx-queue-prod
# If STRIPE_WEBHOOK_SECRET is stale placeholder:
# Operator action required — register webhook in Stripe Dashboard first.
# Endpoint: https://queue.raxx.app/api/v1/billing/webhook
# Then: heroku config:set STRIPE_WEBHOOK_SECRET=whsec_<real_value> --app raxx-queue-prod >/dev/null 2>&1
Verification: heroku ps --app raxx-queue-prod → dyno stable for >60s.
Failure mode C: CI build failure — C++ compilation error
Symptom: deploy-queue.yml "Build + test (Debug, ASan + UBSan)" step fails with exit code 1. gh run view <id> --log-failed shows a C++ compile error.
Cause: Code or test file uses a Drogon API method that does not exist in the pinned version (1.9.13), or a CMake configuration error.
Known Drogon 1.9.13 API gotchas:
- getContentTypeString() does NOT exist. Use contentTypeString() (no "get" prefix) or getContentType() (returns ContentType enum).
- setContentTypeCode(CT_APPLICATION_JSON) stores content type as an internal enum, NOT in the HTTP headers map. getHeader("Content-Type") will return "" in unit tests. Use getContentType() == drogon::CT_APPLICATION_JSON to assert the content type in tests.
- find_package(Drogon) and find_package(libpqxx) in subdirectory CMakeLists.txt files MUST be guarded with if(NOT TARGET Drogon::Drogon) / if(NOT TARGET libpqxx::pqxx) to prevent duplicate imported target errors when the root CMakeLists.txt has already called them.
Fix: Identify the failing file from the log, apply the correct API. Push fix to main; CI will requeue automatically.
Verification: Next push-triggered CI run on deploy-queue.yml passes all 263 tests.
Failure mode D: CI failure — vcpkg binary cache miss (full rebuild)
Symptom: Build takes >45 minutes instead of ~22 minutes. Usually succeeds but is slow.
Cause: The vcpkg binary cache key changed (a dependency version bumped, or the cache was evicted). The next run does a full source build of all 15+ dependencies.
Fix: No action needed — the cache will repopulate on the next successful run. If this recurs frequently, check whether vcpkg.json is being modified without updating the cache key in deploy-queue.yml.
See also: docs/incidents/2026-05-27-queue-deploy-vcpkg-shallow-clone.md — separate issue where a --depth 1 clone of vcpkg caused a baseline SHA lookup failure.
Failure mode E: Billing webhook silently ignoring events
Symptom: Stripe sends events; Queue logs show 400 "invalid signature" or events are not appearing in billing_* tables.
Cause options:
1. STRIPE_WEBHOOK_SECRET is stale (a placeholder whsec_3cH... rather than the live whsec_* from Stripe)
2. Webhook endpoint is not registered in Stripe, or is registered at the wrong URL
3. FLAG_QUEUE_BILLING=false — billing handlers are disabled
Fix:
# Check flag:
heroku config --app raxx-queue-prod | grep FLAG_QUEUE_BILLING
# Enable billing:
heroku config:set FLAG_QUEUE_BILLING=true --app raxx-queue-prod >/dev/null 2>&1
# Verify webhook secret is real (not placeholder):
# Read from vault: GET /api/v1/secrets?workspaceId=29b77751-f761-4afa-b3fa-2c842988f95c&environment=prod&secretPath=/Raxx/Queue&secretName=STRIPE_WEBHOOK_SECRET
# If placeholder: operator must register webhook in Stripe Dashboard and update the secret.
Verification: Send a Stripe test event via stripe trigger customer.created and confirm it appears in Queue logs and in the billing_customer table.
Failure mode F: CF origin guard rejecting all requests
Symptom: https://queue.raxx.app/health returns 403; direct Heroku URL https://raxx-queue-prod-327fa047b4b6.herokuapp.com/health returns 200. Logs show CF-Connecting-IP header missing; rejecting.
Cause: FLAG_ENFORCE_QUEUE_CF_ORIGIN=true but CF proxying is not active (DNS not proxied, or CF Access misconfigured).
Fix:
# Verify CF DNS record is proxied:
# CF Dashboard → raxx.app zone → DNS → queue.raxx.app → Proxy status must be "Proxied" (orange cloud)
# If proxying confirmed but still failing, temporarily disable the guard:
heroku config:set FLAG_ENFORCE_QUEUE_CF_ORIGIN=false --app raxx-queue-prod >/dev/null 2>&1
Verification: curl -s https://queue.raxx.app/health returns 200.
Failure mode G: Deploy summary step failing with HttpError: Resource not accessible by integration
Symptom: Deploy summary job fails at "Post comment on PR / commit" with HttpError: Resource not accessible by integration. This is a known non-blocking error in the deploy workflow.
Cause: The GITHUB_TOKEN used in the Post comment step does not have PR comment permissions when triggered by a push event on main (not a PR). The step attempts to post a comment on the commit, which requires different permission scopes than a PR comment.
Impact: None. The deploy itself completes normally. The Slack DM step is skipped but no service impact occurs.
Fix: continue-on-error: true added to the "Post comment on PR / commit" step in deploy-queue.yml (ops/queue-post-go-live-fixes, merged 2026-06-16). The step still runs and logs the error, but no longer fails the job. If this recurs after the fix, check the actions/github-script@v7 step in the notify job — the fix should have eliminated the phantom failure entirely.
Go-live checklist
Steps to complete a full Queue production go-live (reference: operator-authorized sequence):
queue.raxx.appDNS CNAME attached toshielded-gazelle-a5mdky8xkvx4568m8vqzve6o.herokudns.com(Heroku ACM target) with CF proxy ON — confirmed2026-06-17QUEUE_INTERNAL_BASE_URL=https://queue.raxx.appset onraxx-api-prodandraxx-console-prod— confirmed2026-06-17- CI green (
deploy-queue.ymlpasses all 263 tests) - Dispatch prod deploy:
gh workflow run deploy-queue.yml --ref main --field target=prod --field confirm=deploy-prod-now - Verify
https://raxx-queue-prod-327fa047b4b6.herokuapp.com/healthreturns{"service":"raxx-queue","status":"ok",...} - Register Stripe webhook at
https://queue.raxx.app/api/v1/billing/webhook— OPERATOR ACTION (requires Stripe Dashboard 2FA) - Set
STRIPE_WEBHOOK_SECRET=<real whsec_>onraxx-queue-prod— after step 6 - Flip
FLAG_QUEUE_BILLING=trueonraxx-queue-prod - Flip
FLAG_ENFORCE_QUEUE_CF_ORIGIN=trueonraxx-queue-prod(after CF proxy confirmed working) - Verify end-to-end: Console billing dashboard loads customer list;
/api/v1/billing/customersreturns 200
Pending operator actions:
- Stripe webhook registration (requires Stripe Dashboard + webhook_endpoints:write scope on rk_live)
- CF WAF token rotation (CF_WAF_EDIT_RAXX_APP returns HTTP 401; rotate via CLOUDFLARE_ACCESS_MGMT_TOKEN)
- CF SSL/TLS upgrade to Full (strict) (currently Full; requires Zone:Settings:Edit permission)
- QUEUE_TO_RAPTOR_INTERNAL_TOKEN not yet provisioned in vault
Emergency stop
# Scale down immediately (stops serving traffic):
heroku ps:scale web=0 --app raxx-queue-prod
# Disable billing flag (billing handlers stop but service stays up):
heroku config:set FLAG_QUEUE_BILLING=false --app raxx-queue-prod >/dev/null 2>&1
Escalation
Wake the operator when:
- Stripe webhook events are failing AND STRIPE_WEBHOOK_SECRET needs rotation (Stripe Dashboard access required)
- DATABASE_URL credentials need rotation (Heroku RDS — use heroku pg:credentials:create, not CREATE ROLE WITH PASSWORD)
- CF WAF token rotation is needed (CF_WAF_EDIT_RAXX_APP expired/revoked)
- Postgres data loss or corruption is suspected