Raxx · internal docs

internal · gated

Heroku runbook

System: raxx-api-staging + raxx-api-prod (Raptor backend on Heroku); raxx-console-prod (Console app, flag promotion machinery) Owner: operator / sre-agent Last incident: 2026-05-11 (see docs/ops/incidents/2026-05-11-stuck-flag-promotion-console-heroku-log-drain-alerting.md) Last reviewed: 2026-05-14 Rollback runbook: docs/ops/runbooks/rollback.md — step-by-step release rollback for all Heroku apps, comms templates, and drill record. If a production release must be reverted, start there.

Workflow architecture (deploy source of truth)

As of 2026-05-01 (PR #837, closes issues #834 + #698), the canonical deploy workflows are:

Workflow File Trigger Purpose
Deploy to Heroku .github/workflows/deploy-heroku.yml push to main (staging) / workflow_dispatch (staging or prod) Single source of truth for all Raptor backend Heroku deploys
Deploy Antlers .github/workflows/deploy-antlers.yml push to main or v* tag touching frontend/trademaster_ui/** Antlers (frontend) → Cloudflare Pages
PR Preview .github/workflows/pr-preview.yml pull_request open/sync/reopen PR preview deploys for Antlers + mockups (not production)

.github/workflows/deploy.yml was deleted in PR #834. It was the redundant workflow: - Its Heroku jobs duplicated deploy-heroku.yml and used the broken akhileshns/heroku-deploy action with a smoke step that hit /api/system/status (403'd by FLAG_ENFORCE_CF_ORIGIN). - Its Antlers job was extracted to deploy-antlers.yml.

Bottom line: If a Heroku deploy is failing, look at deploy-heroku.yml. There is no other backend deploy workflow.

How to tell it's broken

App reference

App URL Stack
raxx-api-staging https://raxx-api-staging-1a19fb3873b9.herokuapp.com heroku-24
raxx-api-prod https://raxx-api-prod-a60a19e5efbf.herokuapp.com heroku-24

How to diagnose (in order)

  1. Check dyno formation: heroku ps --app raxx-api-prod — expected: web.1: up (HH:MM:SS)
  2. Check release history: heroku releases --app raxx-api-prod — look for Deploy ... entries. If all entries say Set ... config vars with no Deploy entry, no code has ever been deployed.
  3. Check slug presence via Heroku API: HEROKU_API_KEY=$(heroku auth:token) curl -s -H "Authorization: Bearer $HEROKU_API_KEY" \ -H "Accept: application/vnd.heroku+json; version=3" \ "https://api.heroku.com/apps/raxx-api-prod/releases" \ | python3 -c "import json,sys; r=json.load(sys.stdin); print('Latest slug:', r[-1]['slug'])" Expected: {'id': '<uuid>', 'name': 'slug-name'}. If null, no code has been deployed.
  4. Check CI workflow runs: gh run list --workflow=deploy-heroku.yml --limit 10 --json event,conclusion,displayTitle | python3 -m json.tool - All "event": "push" → only staging has been deployed; production is untouched - Any "event": "workflow_dispatch" with "conclusion": "success" → production deploy succeeded at that time
  5. Check GitHub Environments: gh api repos/raxx-app/TradeMasterAPI/environments — confirm production environment exists with protection rules
  6. Check Heroku logs: heroku logs --tail --app raxx-api-prod — look for crash/OOM/boot errors

Known failure modes

Failure mode A: No slug deployed — app created but never had code pushed

Symptom: heroku ps shows no dynos; all releases in heroku releases are config-var-only; health endpoint returns 502; slug_size: null in Platform API.

Cause: The production deploy path requires a manual workflow_dispatch on deploy-heroku.yml. If this has never been triggered, no slug exists. (The old deploy.yml prod path required a v* git tag — that workflow was deleted in #834.)

Fix (requires operator):

Step 1 — Ensure production GitHub Environment exists: - URL: https://github.com/raxx-app/TradeMasterAPI/settings/environments - If missing: create environment named production, add required reviewer (Kristerpher), restrict to main branch

Step 2 — Trigger first production deploy: - URL: https://github.com/raxx-app/TradeMasterAPI/actions/workflows/deploy-heroku.yml - Click "Run workflow" > set environment=production, ref=main > Run - Approve the deployment at the required-reviewer gate - Monitor until health check passes

Verification: heroku ps --app raxx-api-prod shows web.1: up; curl returns HTTP 200 from /api/system/status


Failure mode B: Dyno crash loop (R10/H10/H12 errors)

Symptom: heroku ps shows dynos crashing; heroku logs --tail shows R10 (boot timeout), H10 (app crashed), or H12 (request timeout).

Cause: Application boot failure (bad config var, missing dependency, OOM) or a request handler that times out.

Diagnosis:

heroku logs --tail --num 200 --app raxx-api-prod

Look for lines with at=error code=R10 (boot timeout) or Error R14 (memory exceeded).

Fix: Depends on error code: - R10 (boot timeout >60s): Check gunicorn startup; increase timeout if legitimately slow boot: heroku config:set GUNICORN_TIMEOUT=120 --app raxx-api-prod and redeploy - R14 (memory exceeded): Scale up dyno type or profile the memory leak - H10 (crash): Check startup traceback in logs; most common cause is missing env var or import error

Verification: heroku ps --app raxx-api-prod shows web.1: up for >5 min with no restarts


Failure mode C: CI deploy fails — heroku create auth error (historical)

Status: No longer possible — deploy.yml was deleted in PR #834. deploy-heroku.yml uses the heroku CLI + scoped credential.helper pattern (spike #825) which does not call heroku create.

Historical symptom (for reference): deploy.yml failed at "Deploy to Heroku (staging)" with heroku create <appname>... TypeError: process.stdin.setRawMode is not a function. This was caused by akhileshns/heroku-deploy lacking dontautocreate: true.

If you see a similar error in a different workflow: Ensure any akhileshns/heroku-deploy usage has dontautocreate: true. But the canonical deploy-heroku.yml does not use that action.


Failure mode D: Deploy freeze blocks production deploy

Symptom: deploy-heroku.yml workflow runs but freeze-check job fails; run is blocked.

Cause: Global deploy freeze is active (set via console.raxx.app or DEPLOY_FREEZE_OVERRIDE=1 secret).

Diagnosis: Check https://console.raxx.app deploy freeze toggle; check repo secret DEPLOY_FREEZE_OVERRIDE.

Fix: - If freeze is intentional: wait for operator to lift it via console - If freeze is stuck (console unreachable): set DEPLOY_FREEZE_OVERRIDE=0 repo secret to lift, OR leave at 1 to enforce freeze even when console is down (see docs/ops/runbooks/deploy-freeze.md)

Verification: Freeze check job passes; deploy proceeds.


Failure mode E: production GitHub Environment missing — no required-reviewer gate

Symptom: workflow_dispatch on deploy-heroku.yml with environment=production completes without an approval prompt; OR gh api repos/raxx-app/TradeMasterAPI/environments returns only staging.

Cause: The production GitHub Environment was never created. The deploy job references environment: production in the workflow YAML, but GitHub does not enforce protection rules for environments that don't exist as repo environment objects.

Fix (operator): 1. Go to https://github.com/raxx-app/TradeMasterAPI/settings/environments/new 2. Name: production (must match exactly — case-sensitive) 3. Add required reviewer: Kristerpher 4. Restrict deployment branches to main only 5. Save

Verification: gh api repos/raxx-app/TradeMasterAPI/environments returns an environment named production with at least one protection rule of type required_reviewers.


Failure mode F: Flag promotion stuck in deploying state (raxx-console-prod)

Last occurrence: 2026-05-11 (RCA: docs/ops/incidents/2026-05-11-stuck-flag-promotion-console-heroku-log-drain-alerting.md)

Symptom: /console/flags/promotions shows a promotion with state "Deploying..." and a disabled button. The soak remaining field reads "deploying..." rather than a countdown. The promotion has been in this state for more than 30 minutes.

Cause: promotions.py uses a daemon thread (promote_async) to run the Heroku config-var PATCH + release poll. When the PATCH succeeds, Heroku triggers an immediate dyno restart. Daemon threads are killed on process exit. The thread has ~10–30 seconds to finish db.session.commit() on the state=deployed transition before the restart kills it. If the thread loses that race, the flag is live in Heroku but the DB row stays in state=deploying with no watchdog to resolve it. See issue #1639 for the fix.

Diagnosis:

Step 1 — Confirm the flag value in Heroku:

heroku config:get FLAG_<FLAG_KEY_UPPER> -a raxx-console-prod

Step 2 — Check the DB row state and reason column:

heroku run -a raxx-console-prod -- python -c "
import os, sys; os.chdir('/app'); sys.path.insert(0, '/app')
from app import create_app; app = create_app()
with app.app_context():
    from app.models.flag_promotion import ConsoleFlagPromotion
    row = ConsoleFlagPromotion.query.filter_by(flag_key='<flag_key>').first()
    print(f'state={row.state!r} deployed_at={row.deployed_at} reason={row.reason!r}')
"

Interpretation: - state=deploying, deployed_at=None, reason does NOT start with FAILED:, AND flag IS set correctly in Heroku → daemon thread killed before commit; roll forward is safe. - state=deploying, deployed_at=None, reason starts with FAILED: → Heroku write failed; inspect the error; decide based on actual Heroku value. - state=deploying, flag is NOT set in Heroku → the PATCH never succeeded; do NOT mark deployed; investigate HEROKU_API_TOKEN_CONSOLE_PROD.

Fix (roll-forward — most common case):

Only proceed if Step 1 confirms the flag is set to the expected value.

heroku run -a raxx-console-prod -- python -c "
import os, sys, json, uuid; os.chdir('/app'); sys.path.insert(0, '/app')
from app import create_app
from datetime import datetime, timezone
app = create_app()
with app.app_context():
    from app.models.db import db
    from app.models.flag_promotion import ConsoleFlagPromotion, PromotionState
    from app.models.audit import AuditLog
    row = ConsoleFlagPromotion.query.filter_by(flag_key='<flag_key>', target_app='raxx-console-prod').first()
    print(f'BEFORE: state={row.state!r} deployed_at={row.deployed_at}')
    assert row.state == 'deploying', f'expected deploying, got {row.state!r}'
    now = datetime.now(timezone.utc)
    row.state = PromotionState.DEPLOYED.value
    row.deployed_at = now
    db.session.commit()
    print(f'AFTER: state={row.state!r} deployed_at={row.deployed_at}')
    audit_row = AuditLog(
        id=str(uuid.uuid4()), admin_id=None,
        action='console.flag.promotion.ops_complete',
        target_type='feature_flag', target_id=row.flag_key,
        payload_redacted=json.dumps({'promotion_id': row.id, 'before_state': 'deploying',
            'after_state': 'deployed', 'actor': 'sre-agent',
            'reason': 'manual ops remediation — daemon thread killed by dyno restart before commit'}),
        request_id=None, ip_address='sre-agent', user_agent='sre-agent/ops-remediation',
        created_at=now,
    )
    db.session.add(audit_row); db.session.commit()
    print(f'audit row written: {audit_row.id}')
"

Verification: Re-run Step 2 — confirm state=deployed and deployed_at is set. Reload /console/flags/promotions — promotion should appear in the terminal (completed) section.

Do NOT use this fix if: - The Heroku flag value does not match new_value in the promotion row - The reason column starts with FAILED: — investigate that error first

Failure mode G: WebAuthn registration challenge miss — verify-with-token returns 400 in 1ms

Last occurrence: 2026-05-25 (RCA: docs/ops/incidents/2026-05-25-signup-challenge-store-miss.md)

Symptom: POST /api/auth/register/verify-with-token returns 400 "Passkey verification failed" in 1ms, repeatedly, even after fresh token mints and single-worker deploys. No webauthn.registration_verify_failed WARNING log emitted (the failure is before webauthn.verify_registration_response is called). The challenge IS stored by begin-with-token (which returns 200) but cannot be found by verify-with-token.

Cause: The _challenge_store dict in webauthn_service.py is a module-level Python object. With gunicorn --preload, the module is imported in the master process before fork. Each worker receives a copy-on-write clone of the module at fork time. The first write to the dict in any process creates a private copy invisible to other processes. If the master process services the begin request (storing the challenge) while the worker services verify (looking up the challenge), the lookup fails.

Diagnosis:

  1. Check Heroku logs for the challenge store messages (post-fix, with Redis): heroku logs -a raxx-api-prod --tail | grep "webauthn.challenge_" Expected on begin: webauthn.challenge_stored store=redis user_id=N challenge_hash=... Expected on verify: webauthn.challenge_popped store=redis user_id=N challenge_hash=... If you see store=memory on prod, Redis init failed — check REDIS_URL.

  2. Confirm Redis is reachable: heroku run -a raxx-api-prod -- python3 -c "import os,redis; r=redis.from_url(os.environ['REDIS_URL'],ssl_cert_reqs=None,decode_responses=True); print(r.ping())" Expected: True

  3. Confirm REDIS_URL is set: heroku config:get REDIS_URL -a raxx-api-prod

Fix: The fix is already in webauthn_service.py (PR #2728, deployed in v47+). If this failure recurs, Redis is likely down or REDIS_URL is unset. Remediate the Redis issue; do NOT revert to the in-process dict.

Verification: After mint + retry, Heroku logs show webauthn.challenge_popped store=redis and verify-with-token returns 200.

IMPORTANT — do NOT ship begin-with-token without verify-with-token in the same PR. If the begin endpoint ships first (without verify), the frontend falls back to the old /register/verify endpoint on first attempt, consuming the challenge. Every subsequent attempt with a new token fails because the user_id exists but the challenge from the second begin is not consumed. The fix is a single atomic PR containing both endpoints.


Emergency stop

To take raxx-api-prod offline cleanly (0 dynos, keeps the app and data intact):

heroku ps:scale web=0 --app raxx-api-prod

To confirm it's offline:

heroku ps --app raxx-api-prod
# should show: No dynos on raxx-api-prod

To restore (after a clean slug exists):

heroku ps:scale web=1 --app raxx-api-prod

Note: scaling to 0 is an emergency stop only. If no slug has been deployed, the app is already effectively at 0 dynos and there is nothing to stop.

Escalation

Wake the operator when: - A production deploy run fails after the required-reviewer gate has been passed (indicates a deploy-time failure, not a gate failure) - heroku logs shows Error R14 (memory) or repeated Error R10 (boot timeout) and the standard fixes do not resolve within 30 min - A Heroku incident is active (check https://status.heroku.com) that affects the us region - The Heroku app is unreachable AND Heroku status is green (possible data corruption or misconfiguration requiring dashboard access)

Who: Kristerpher (ops@raxx.app / Slack #MooseQuest workspace)