Heroku runbook
System: raxx-api-staging + raxx-api-prod (Raptor backend on Heroku); raxx-console-prod (Console app, flag promotion machinery)
Owner: operator / sre-agent
Last incident: 2026-05-11 (see docs/ops/incidents/2026-05-11-stuck-flag-promotion-console-heroku-log-drain-alerting.md)
Last reviewed: 2026-05-14
Rollback runbook: docs/ops/runbooks/rollback.md — step-by-step release rollback for all Heroku apps, comms templates, and drill record. If a production release must be reverted, start there.
Workflow architecture (deploy source of truth)
As of 2026-05-01 (PR #837, closes issues #834 + #698), the canonical deploy workflows are:
| Workflow | File | Trigger | Purpose |
|---|---|---|---|
| Deploy to Heroku | .github/workflows/deploy-heroku.yml |
push to main (staging) / workflow_dispatch (staging or prod) |
Single source of truth for all Raptor backend Heroku deploys |
| Deploy Antlers | .github/workflows/deploy-antlers.yml |
push to main or v* tag touching frontend/trademaster_ui/** |
Antlers (frontend) → Cloudflare Pages |
| PR Preview | .github/workflows/pr-preview.yml |
pull_request open/sync/reopen | PR preview deploys for Antlers + mockups (not production) |
.github/workflows/deploy.yml was deleted in PR #834. It was the redundant workflow:
- Its Heroku jobs duplicated deploy-heroku.yml and used the broken akhileshns/heroku-deploy action with a smoke step that hit /api/system/status (403'd by FLAG_ENFORCE_CF_ORIGIN).
- Its Antlers job was extracted to deploy-antlers.yml.
Bottom line: If a Heroku deploy is failing, look at deploy-heroku.yml. There is no other backend deploy workflow.
How to tell it's broken
- Symptom 1:
curl -sI https://raxx-api-prod-a60a19e5efbf.herokuapp.com/api/system/statusreturns non-200 (502, 503, connection refused) - Symptom 2:
heroku ps --app raxx-api-prodreturns "No dynos on raxx-api-prod" - Symptom 3: CI
deploy-heroku.ymlrun shows "Deploy (production)" job was never triggered (only push-triggered staging runs exist) - Symptom 4: Heroku Platform API
GET /apps/raxx-api-prod/releasesshows"slug": nullon all releases
App reference
| App | URL | Stack |
|---|---|---|
| raxx-api-staging | https://raxx-api-staging-1a19fb3873b9.herokuapp.com |
heroku-24 |
| raxx-api-prod | https://raxx-api-prod-a60a19e5efbf.herokuapp.com |
heroku-24 |
How to diagnose (in order)
- Check dyno formation:
heroku ps --app raxx-api-prod— expected:web.1: up (HH:MM:SS) - Check release history:
heroku releases --app raxx-api-prod— look forDeploy ...entries. If all entries saySet ... config varswith no Deploy entry, no code has ever been deployed. - Check slug presence via Heroku API:
HEROKU_API_KEY=$(heroku auth:token) curl -s -H "Authorization: Bearer $HEROKU_API_KEY" \ -H "Accept: application/vnd.heroku+json; version=3" \ "https://api.heroku.com/apps/raxx-api-prod/releases" \ | python3 -c "import json,sys; r=json.load(sys.stdin); print('Latest slug:', r[-1]['slug'])"Expected:{'id': '<uuid>', 'name': 'slug-name'}. Ifnull, no code has been deployed. - Check CI workflow runs:
gh run list --workflow=deploy-heroku.yml --limit 10 --json event,conclusion,displayTitle | python3 -m json.tool- All"event": "push"→ only staging has been deployed; production is untouched - Any"event": "workflow_dispatch"with"conclusion": "success"→ production deploy succeeded at that time - Check GitHub Environments:
gh api repos/raxx-app/TradeMasterAPI/environments— confirmproductionenvironment exists with protection rules - Check Heroku logs:
heroku logs --tail --app raxx-api-prod— look for crash/OOM/boot errors
Known failure modes
Failure mode A: No slug deployed — app created but never had code pushed
Symptom: heroku ps shows no dynos; all releases in heroku releases are config-var-only; health endpoint returns 502; slug_size: null in Platform API.
Cause: The production deploy path requires a manual workflow_dispatch on deploy-heroku.yml. If this has never been triggered, no slug exists. (The old deploy.yml prod path required a v* git tag — that workflow was deleted in #834.)
Fix (requires operator):
Step 1 — Ensure production GitHub Environment exists:
- URL: https://github.com/raxx-app/TradeMasterAPI/settings/environments
- If missing: create environment named production, add required reviewer (Kristerpher), restrict to main branch
Step 2 — Trigger first production deploy:
- URL: https://github.com/raxx-app/TradeMasterAPI/actions/workflows/deploy-heroku.yml
- Click "Run workflow" > set environment=production, ref=main > Run
- Approve the deployment at the required-reviewer gate
- Monitor until health check passes
Verification: heroku ps --app raxx-api-prod shows web.1: up; curl returns HTTP 200 from /api/system/status
Failure mode B: Dyno crash loop (R10/H10/H12 errors)
Symptom: heroku ps shows dynos crashing; heroku logs --tail shows R10 (boot timeout), H10 (app crashed), or H12 (request timeout).
Cause: Application boot failure (bad config var, missing dependency, OOM) or a request handler that times out.
Diagnosis:
heroku logs --tail --num 200 --app raxx-api-prod
Look for lines with at=error code=R10 (boot timeout) or Error R14 (memory exceeded).
Fix: Depends on error code:
- R10 (boot timeout >60s): Check gunicorn startup; increase timeout if legitimately slow boot: heroku config:set GUNICORN_TIMEOUT=120 --app raxx-api-prod and redeploy
- R14 (memory exceeded): Scale up dyno type or profile the memory leak
- H10 (crash): Check startup traceback in logs; most common cause is missing env var or import error
Verification: heroku ps --app raxx-api-prod shows web.1: up for >5 min with no restarts
Failure mode C: CI deploy fails — heroku create auth error (historical)
Status: No longer possible — deploy.yml was deleted in PR #834. deploy-heroku.yml uses the heroku CLI + scoped credential.helper pattern (spike #825) which does not call heroku create.
Historical symptom (for reference): deploy.yml failed at "Deploy to Heroku (staging)" with heroku create <appname>... TypeError: process.stdin.setRawMode is not a function. This was caused by akhileshns/heroku-deploy lacking dontautocreate: true.
If you see a similar error in a different workflow: Ensure any akhileshns/heroku-deploy usage has dontautocreate: true. But the canonical deploy-heroku.yml does not use that action.
Failure mode D: Deploy freeze blocks production deploy
Symptom: deploy-heroku.yml workflow runs but freeze-check job fails; run is blocked.
Cause: Global deploy freeze is active (set via console.raxx.app or DEPLOY_FREEZE_OVERRIDE=1 secret).
Diagnosis: Check https://console.raxx.app deploy freeze toggle; check repo secret DEPLOY_FREEZE_OVERRIDE.
Fix:
- If freeze is intentional: wait for operator to lift it via console
- If freeze is stuck (console unreachable): set DEPLOY_FREEZE_OVERRIDE=0 repo secret to lift, OR leave at 1 to enforce freeze even when console is down (see docs/ops/runbooks/deploy-freeze.md)
Verification: Freeze check job passes; deploy proceeds.
Failure mode E: production GitHub Environment missing — no required-reviewer gate
Symptom: workflow_dispatch on deploy-heroku.yml with environment=production completes without an approval prompt; OR gh api repos/raxx-app/TradeMasterAPI/environments returns only staging.
Cause: The production GitHub Environment was never created. The deploy job references environment: production in the workflow YAML, but GitHub does not enforce protection rules for environments that don't exist as repo environment objects.
Fix (operator):
1. Go to https://github.com/raxx-app/TradeMasterAPI/settings/environments/new
2. Name: production (must match exactly — case-sensitive)
3. Add required reviewer: Kristerpher
4. Restrict deployment branches to main only
5. Save
Verification: gh api repos/raxx-app/TradeMasterAPI/environments returns an environment named production with at least one protection rule of type required_reviewers.
Failure mode F: Flag promotion stuck in deploying state (raxx-console-prod)
Last occurrence: 2026-05-11 (RCA: docs/ops/incidents/2026-05-11-stuck-flag-promotion-console-heroku-log-drain-alerting.md)
Symptom: /console/flags/promotions shows a promotion with state "Deploying..." and a disabled button. The soak remaining field reads "deploying..." rather than a countdown. The promotion has been in this state for more than 30 minutes.
Cause: promotions.py uses a daemon thread (promote_async) to run the Heroku config-var PATCH + release poll. When the PATCH succeeds, Heroku triggers an immediate dyno restart. Daemon threads are killed on process exit. The thread has ~10–30 seconds to finish db.session.commit() on the state=deployed transition before the restart kills it. If the thread loses that race, the flag is live in Heroku but the DB row stays in state=deploying with no watchdog to resolve it. See issue #1639 for the fix.
Diagnosis:
Step 1 — Confirm the flag value in Heroku:
heroku config:get FLAG_<FLAG_KEY_UPPER> -a raxx-console-prod
Step 2 — Check the DB row state and reason column:
heroku run -a raxx-console-prod -- python -c "
import os, sys; os.chdir('/app'); sys.path.insert(0, '/app')
from app import create_app; app = create_app()
with app.app_context():
from app.models.flag_promotion import ConsoleFlagPromotion
row = ConsoleFlagPromotion.query.filter_by(flag_key='<flag_key>').first()
print(f'state={row.state!r} deployed_at={row.deployed_at} reason={row.reason!r}')
"
Interpretation:
- state=deploying, deployed_at=None, reason does NOT start with FAILED:, AND flag IS set correctly in Heroku → daemon thread killed before commit; roll forward is safe.
- state=deploying, deployed_at=None, reason starts with FAILED: → Heroku write failed; inspect the error; decide based on actual Heroku value.
- state=deploying, flag is NOT set in Heroku → the PATCH never succeeded; do NOT mark deployed; investigate HEROKU_API_TOKEN_CONSOLE_PROD.
Fix (roll-forward — most common case):
Only proceed if Step 1 confirms the flag is set to the expected value.
heroku run -a raxx-console-prod -- python -c "
import os, sys, json, uuid; os.chdir('/app'); sys.path.insert(0, '/app')
from app import create_app
from datetime import datetime, timezone
app = create_app()
with app.app_context():
from app.models.db import db
from app.models.flag_promotion import ConsoleFlagPromotion, PromotionState
from app.models.audit import AuditLog
row = ConsoleFlagPromotion.query.filter_by(flag_key='<flag_key>', target_app='raxx-console-prod').first()
print(f'BEFORE: state={row.state!r} deployed_at={row.deployed_at}')
assert row.state == 'deploying', f'expected deploying, got {row.state!r}'
now = datetime.now(timezone.utc)
row.state = PromotionState.DEPLOYED.value
row.deployed_at = now
db.session.commit()
print(f'AFTER: state={row.state!r} deployed_at={row.deployed_at}')
audit_row = AuditLog(
id=str(uuid.uuid4()), admin_id=None,
action='console.flag.promotion.ops_complete',
target_type='feature_flag', target_id=row.flag_key,
payload_redacted=json.dumps({'promotion_id': row.id, 'before_state': 'deploying',
'after_state': 'deployed', 'actor': 'sre-agent',
'reason': 'manual ops remediation — daemon thread killed by dyno restart before commit'}),
request_id=None, ip_address='sre-agent', user_agent='sre-agent/ops-remediation',
created_at=now,
)
db.session.add(audit_row); db.session.commit()
print(f'audit row written: {audit_row.id}')
"
Verification: Re-run Step 2 — confirm state=deployed and deployed_at is set. Reload /console/flags/promotions — promotion should appear in the terminal (completed) section.
Do NOT use this fix if:
- The Heroku flag value does not match new_value in the promotion row
- The reason column starts with FAILED: — investigate that error first
Failure mode G: WebAuthn registration challenge miss — verify-with-token returns 400 in 1ms
Last occurrence: 2026-05-25 (RCA: docs/ops/incidents/2026-05-25-signup-challenge-store-miss.md)
Symptom: POST /api/auth/register/verify-with-token returns 400 "Passkey verification failed" in 1ms, repeatedly, even after fresh token mints and single-worker deploys. No webauthn.registration_verify_failed WARNING log emitted (the failure is before webauthn.verify_registration_response is called). The challenge IS stored by begin-with-token (which returns 200) but cannot be found by verify-with-token.
Cause: The _challenge_store dict in webauthn_service.py is a module-level Python object. With gunicorn --preload, the module is imported in the master process before fork. Each worker receives a copy-on-write clone of the module at fork time. The first write to the dict in any process creates a private copy invisible to other processes. If the master process services the begin request (storing the challenge) while the worker services verify (looking up the challenge), the lookup fails.
Diagnosis:
-
Check Heroku logs for the challenge store messages (post-fix, with Redis):
heroku logs -a raxx-api-prod --tail | grep "webauthn.challenge_"Expected on begin:webauthn.challenge_stored store=redis user_id=N challenge_hash=...Expected on verify:webauthn.challenge_popped store=redis user_id=N challenge_hash=...If you seestore=memoryon prod, Redis init failed — checkREDIS_URL. -
Confirm Redis is reachable:
heroku run -a raxx-api-prod -- python3 -c "import os,redis; r=redis.from_url(os.environ['REDIS_URL'],ssl_cert_reqs=None,decode_responses=True); print(r.ping())"Expected:True -
Confirm
REDIS_URLis set:heroku config:get REDIS_URL -a raxx-api-prod
Fix: The fix is already in webauthn_service.py (PR #2728, deployed in v47+). If this failure recurs, Redis is likely down or REDIS_URL is unset. Remediate the Redis issue; do NOT revert to the in-process dict.
Verification: After mint + retry, Heroku logs show webauthn.challenge_popped store=redis and verify-with-token returns 200.
IMPORTANT — do NOT ship begin-with-token without verify-with-token in the same PR. If the begin endpoint ships first (without verify), the frontend falls back to the old /register/verify endpoint on first attempt, consuming the challenge. Every subsequent attempt with a new token fails because the user_id exists but the challenge from the second begin is not consumed. The fix is a single atomic PR containing both endpoints.
Emergency stop
To take raxx-api-prod offline cleanly (0 dynos, keeps the app and data intact):
heroku ps:scale web=0 --app raxx-api-prod
To confirm it's offline:
heroku ps --app raxx-api-prod
# should show: No dynos on raxx-api-prod
To restore (after a clean slug exists):
heroku ps:scale web=1 --app raxx-api-prod
Note: scaling to 0 is an emergency stop only. If no slug has been deployed, the app is already effectively at 0 dynos and there is nothing to stop.
Escalation
Wake the operator when:
- A production deploy run fails after the required-reviewer gate has been passed (indicates a deploy-time failure, not a gate failure)
- heroku logs shows Error R14 (memory) or repeated Error R10 (boot timeout) and the standard fixes do not resolve within 30 min
- A Heroku incident is active (check https://status.heroku.com) that affects the us region
- The Heroku app is unreachable AND Heroku status is green (possible data corruption or misconfiguration requiring dashboard access)
Who: Kristerpher (ops@raxx.app / Slack #MooseQuest workspace)