Raxx · internal docs

internal · gated ↑ index

Heroku runbook

System: raxx-api-staging + raxx-api-prod (Raptor backend on Heroku) Owner: operator / sre-agent Last incident: 2026-04-30 (see docs/incidents/2026-04-30-raxx-api-prod-down.md) Last reviewed: 2026-05-01

Workflow architecture (deploy source of truth)

As of 2026-05-01 (PR #837, closes issues #834 + #698), the canonical deploy workflows are:

Workflow File Trigger Purpose
Deploy to Heroku .github/workflows/deploy-heroku.yml push to main (staging) / workflow_dispatch (staging or prod) Single source of truth for all Raptor backend Heroku deploys
Deploy Antlers .github/workflows/deploy-antlers.yml push to main or v* tag touching frontend/trademaster_ui/** Antlers (frontend) → Cloudflare Pages
PR Preview .github/workflows/pr-preview.yml pull_request open/sync/reopen PR preview deploys for Antlers + mockups (not production)

.github/workflows/deploy.yml was deleted in PR #834. It was the redundant workflow: - Its Heroku jobs duplicated deploy-heroku.yml and used the broken akhileshns/heroku-deploy action with a smoke step that hit /api/system/status (403'd by FLAG_ENFORCE_CF_ORIGIN). - Its Antlers job was extracted to deploy-antlers.yml.

Bottom line: If a Heroku deploy is failing, look at deploy-heroku.yml. There is no other backend deploy workflow.

How to tell it's broken

App reference

App URL Stack
raxx-api-staging https://raxx-api-staging-1a19fb3873b9.herokuapp.com heroku-24
raxx-api-prod https://raxx-api-prod-a60a19e5efbf.herokuapp.com heroku-24

How to diagnose (in order)

  1. Check dyno formation: heroku ps --app raxx-api-prod — expected: web.1: up (HH:MM:SS)
  2. Check release history: heroku releases --app raxx-api-prod — look for Deploy ... entries. If all entries say Set ... config vars with no Deploy entry, no code has ever been deployed.
  3. Check slug presence via Heroku API: HEROKU_API_KEY=$(heroku auth:token) curl -s -H "Authorization: Bearer $HEROKU_API_KEY" \ -H "Accept: application/vnd.heroku+json; version=3" \ "https://api.heroku.com/apps/raxx-api-prod/releases" \ | python3 -c "import json,sys; r=json.load(sys.stdin); print('Latest slug:', r[-1]['slug'])" Expected: {'id': '<uuid>', 'name': 'slug-name'}. If null, no code has been deployed.
  4. Check CI workflow runs: gh run list --workflow=deploy-heroku.yml --limit 10 --json event,conclusion,displayTitle | python3 -m json.tool - All "event": "push" → only staging has been deployed; production is untouched - Any "event": "workflow_dispatch" with "conclusion": "success" → production deploy succeeded at that time
  5. Check GitHub Environments: gh api repos/raxx-app/TradeMasterAPI/environments — confirm production environment exists with protection rules
  6. Check Heroku logs: heroku logs --tail --app raxx-api-prod — look for crash/OOM/boot errors

Known failure modes

Failure mode A: No slug deployed — app created but never had code pushed

Symptom: heroku ps shows no dynos; all releases in heroku releases are config-var-only; health endpoint returns 502; slug_size: null in Platform API.

Cause: The production deploy path requires a manual workflow_dispatch on deploy-heroku.yml. If this has never been triggered, no slug exists. (The old deploy.yml prod path required a v* git tag — that workflow was deleted in #834.)

Fix (requires operator):

Step 1 — Ensure production GitHub Environment exists: - URL: https://github.com/raxx-app/TradeMasterAPI/settings/environments - If missing: create environment named production, add required reviewer (Kristerpher), restrict to main branch

Step 2 — Trigger first production deploy: - URL: https://github.com/raxx-app/TradeMasterAPI/actions/workflows/deploy-heroku.yml - Click "Run workflow" > set environment=production, ref=main > Run - Approve the deployment at the required-reviewer gate - Monitor until health check passes

Verification: heroku ps --app raxx-api-prod shows web.1: up; curl returns HTTP 200 from /api/system/status


Failure mode B: Dyno crash loop (R10/H10/H12 errors)

Symptom: heroku ps shows dynos crashing; heroku logs --tail shows R10 (boot timeout), H10 (app crashed), or H12 (request timeout).

Cause: Application boot failure (bad config var, missing dependency, OOM) or a request handler that times out.

Diagnosis:

heroku logs --tail --num 200 --app raxx-api-prod

Look for lines with at=error code=R10 (boot timeout) or Error R14 (memory exceeded).

Fix: Depends on error code: - R10 (boot timeout >60s): Check gunicorn startup; increase timeout if legitimately slow boot: heroku config:set GUNICORN_TIMEOUT=120 --app raxx-api-prod and redeploy - R14 (memory exceeded): Scale up dyno type or profile the memory leak - H10 (crash): Check startup traceback in logs; most common cause is missing env var or import error

Verification: heroku ps --app raxx-api-prod shows web.1: up for >5 min with no restarts


Failure mode C: CI deploy fails — heroku create auth error (historical)

Status: No longer possible — deploy.yml was deleted in PR #834. deploy-heroku.yml uses the heroku CLI + scoped credential.helper pattern (spike #825) which does not call heroku create.

Historical symptom (for reference): deploy.yml failed at "Deploy to Heroku (staging)" with heroku create <appname>... TypeError: process.stdin.setRawMode is not a function. This was caused by akhileshns/heroku-deploy lacking dontautocreate: true.

If you see a similar error in a different workflow: Ensure any akhileshns/heroku-deploy usage has dontautocreate: true. But the canonical deploy-heroku.yml does not use that action.


Failure mode D: Deploy freeze blocks production deploy

Symptom: deploy-heroku.yml workflow runs but freeze-check job fails; run is blocked.

Cause: Global deploy freeze is active (set via console.raxx.app or DEPLOY_FREEZE_OVERRIDE=1 secret).

Diagnosis: Check https://console.raxx.app deploy freeze toggle; check repo secret DEPLOY_FREEZE_OVERRIDE.

Fix: - If freeze is intentional: wait for operator to lift it via console - If freeze is stuck (console unreachable): set DEPLOY_FREEZE_OVERRIDE=0 repo secret to lift, OR leave at 1 to enforce freeze even when console is down (see docs/ops/runbooks/deploy-freeze.md)

Verification: Freeze check job passes; deploy proceeds.


Failure mode E: production GitHub Environment missing — no required-reviewer gate

Symptom: workflow_dispatch on deploy-heroku.yml with environment=production completes without an approval prompt; OR gh api repos/raxx-app/TradeMasterAPI/environments returns only staging.

Cause: The production GitHub Environment was never created. The deploy job references environment: production in the workflow YAML, but GitHub does not enforce protection rules for environments that don't exist as repo environment objects.

Fix (operator): 1. Go to https://github.com/raxx-app/TradeMasterAPI/settings/environments/new 2. Name: production (must match exactly — case-sensitive) 3. Add required reviewer: Kristerpher 4. Restrict deployment branches to main only 5. Save

Verification: gh api repos/raxx-app/TradeMasterAPI/environments returns an environment named production with at least one protection rule of type required_reviewers.


Emergency stop

To take raxx-api-prod offline cleanly (0 dynos, keeps the app and data intact):

heroku ps:scale web=0 --app raxx-api-prod

To confirm it's offline:

heroku ps --app raxx-api-prod
# should show: No dynos on raxx-api-prod

To restore (after a clean slug exists):

heroku ps:scale web=1 --app raxx-api-prod

Note: scaling to 0 is an emergency stop only. If no slug has been deployed, the app is already effectively at 0 dynos and there is nothing to stop.

Escalation

Wake the operator when: - A production deploy run fails after the required-reviewer gate has been passed (indicates a deploy-time failure, not a gate failure) - heroku logs shows Error R14 (memory) or repeated Error R10 (boot timeout) and the standard fixes do not resolve within 30 min - A Heroku incident is active (check https://status.heroku.com) that affects the us region - The Heroku app is unreachable AND Heroku status is green (possible data corruption or misconfiguration requiring dashboard access)

Who: Kristerpher (ops@raxx.app / Slack #MooseQuest workspace)