Incident ID: 2026-04-30-raxx-api-prod-down
Date: 2026-04-30
Severity: SEV-1
Duration: 8 days latent (app created 2026-04-22, first detected 2026-04-30); active investigation 2026-04-30 ~16:30–17:30 UTC
Blast radius: raxx-api-prod (Raptor production backend) serving HTTP 502 on all requests. Pre-launch, so 0 external users affected. Internal systems that depend on the production API URL (console.raxx.app prod surface, Sentry DSN, status worker) could not reach the health endpoint.
Author: sre-agent
raxx-api-prod was provisioned on 2026-04-22 with config vars but no code was ever deployed to it. All 7 Heroku releases were config-var-only (slug null on every release). Two CI/CD workflows exist that are supposed to cover the production deploy path, but both have structural gaps that prevented any production deploy from occurring in the 8 days since the app was created. The deploy.yml staging path has been failing on every push to main due to a missing dontautocreate: true flag; its production path has never been triggered because no v* tag has been pushed. The deploy-heroku.yml production path requires a workflow_dispatch that has never been triggered, and the required production GitHub Environment does not exist. Remediation requires: (1) operator creates the production GitHub Environment, (2) operator triggers workflow_dispatch on deploy-heroku.yml with environment=production, and (3) a PR to fix the missing dontautocreate: true in deploy.yml.
raxx-api-prod Heroku app created (v1)workflow_dispatch run ever executed, production GH Environment missing, deploy.yml broken on staging for entire periodhttps://raxx-api-prod-a60a19e5efbf.herokuapp.com/health returns HTTP 502 on all requestsdeploy-heroku.yml workflow is correctly structured for prod — it has the right app name, freeze-check gate, smoke gate, and dontautocreate: true. Once the production GH Environment is created and the dispatch is triggered, the first deploy should succeed without code changes.raxx-api-prod — the app is ready to receive a slug.Procfile is valid and correctly points to backend_v2/app.py via gunicorn.deploy-heroku.yml have been succeeding reliably, validating the deploy mechanism.deploy.yml and deploy-heroku.yml) with different trigger models, different gating approaches, and different levels of correctness. This created confusion about which workflow was the authoritative prod deploy path.deploy.yml has been failing its staging deploy-staging job on every single push to main since before the app existed. These failures were not escalated or acted on — they became background noise.production GitHub Environment was never created. The deploy-heroku.yml deploy job references environment: production but the environment itself does not exist in the repo, so the required-reviewer protection (per ADR-0020) was never enforced.Contributing factor 1: No workflow_dispatch ever triggered for production — The deploy-heroku.yml workflow correctly gates production behind workflow_dispatch. However, no one ever triggered it manually. The system allowed production to remain un-deployed indefinitely because the push-to-main automation only targets staging, and the staging deploys appeared successful (masking the fact that production was never seeded).
Contributing factor 2: production GitHub Environment not provisioned — The deploy job in deploy-heroku.yml uses environment: production. GitHub Environments gate approvals; without the environment existing, the environment protection rules (required reviewers, wait timers) that ADR-0020 mandates for prod are not enforced. The deploy job would still run without the environment record, but the required-reviewer gate would be absent — meaning prod could be deployed without approval if someone triggered the dispatch.
Contributing factor 3: deploy.yml missing dontautocreate: true on staging deploy — The deploy.yml workflow uses akhileshns/heroku-deploy@v3.13.15 without dontautocreate: true. When raxx-api-staging already exists, the action's default behavior is to call heroku create <appname>, which in a non-interactive CI environment triggers the Heroku CLI browser-login prompt, crashes with TypeError: process.stdin.setRawMode is not a function, and fails the job. This has caused 100% failure rate on the deploy-staging job in deploy.yml since the workflow was created. Because deploy-heroku.yml was independently succeeding on staging, this persistent failure went unnoticed.
Contributing factor 4: No health monitor on production app slug presence — There is no alerting that would fire when raxx-api-prod has 0 dynos or slug: null. The incident was detected incidentally by a console triage agent, not by a proactive monitor.
/apps/raxx-api-prod/releases — if latest release has slug: null, fire a SEV-2 alert. Add to nightly SRE sweep.Not yet complete — pending operator actions below.
The code and configuration to deploy are ready. The blockers are two GitHub administrative actions that require operator (Kristerpher) execution:
production GitHub EnvironmentIn the GitHub repo settings under Environments, create a new environment named exactly production. Configure:
- Required reviewers: Kristerpher (minimum 1)
- Wait timer: 0 min (can increase after first successful deploy)
- Deployment branches: main only (restrict to protected branch)
URL: https://github.com/MooseQuest/TradeMasterAPI/settings/environments/new
deploy-heroku.ymlAfter the production environment exists:
1. Go to: https://github.com/MooseQuest/TradeMasterAPI/actions/workflows/deploy-heroku.yml
2. Click "Run workflow"
3. Set environment = production
4. Set ref = main
5. Click "Run workflow"
6. Approve the deployment when the required-reviewer gate appears
7. Monitor the run — the smoke gate will run first, then the deploy job, then the health check
8. Verify: curl -s https://raxx-api-prod-a60a19e5efbf.herokuapp.com/health returns HTTP 200
deploy.yml missing dontautocreatePR filed (see action items) to add dontautocreate: true to the deploy-staging and deploy-prod jobs in deploy.yml. This is a low-risk one-line addition that matches the pattern already in deploy-heroku.yml.
heroku ps --app raxx-api-prod shows web.1: upheroku releases --app raxx-api-prod shows a Deploy ... entry (not config-var-only)curl -s https://raxx-api-prod-a60a19e5efbf.herokuapp.com/api/system/status returns HTTP 200slug_size is non-null| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Create production GitHub Environment with required-reviewer protection |
Kristerpher | 2026-04-30 | #690 |
| 2 | Trigger workflow_dispatch on deploy-heroku.yml with environment=production, ref=main |
Kristerpher | 2026-04-30 | #690 |
| 3 | Fix missing dontautocreate: true in deploy.yml staging + prod jobs |
sre-agent (PR) | 2026-05-01 | #691 |
| 4 | Add nightly monitor: Heroku API check that raxx-api-prod latest release has non-null slug |
sre-agent | 2026-05-07 | #692 |
| 5 | Add Heroku dyno-count monitor: alert SEV-2 if raxx-api-prod has 0 running dynos for >5 min |
sre-agent | 2026-05-07 | #692 |
| 6 | Deprecate or align deploy.yml vs deploy-heroku.yml — two overlapping workflows for the same target is a maintenance hazard |
operator | 2026-05-07 | #693 |
docs/ops/runbooks/heroku.md (created this incident)GET /apps/raxx-api-prod — slug_size: null, repo_size: nulldeploy.yml — Deploy to Heroku staging job, error: heroku create raxx-api-staging: TypeError: process.stdin.setRawModedeploy-heroku.yml — all push-to-main runs to staging pass