RCA — raxx-api-prod: 0 dynos, no slug, never deployed
Incident ID: 2026-04-30-raxx-api-prod-down
Date: 2026-04-30
Severity: SEV-1
Duration: 8 days latent (app created 2026-04-22, first detected 2026-04-30); active investigation 2026-04-30 ~16:30–17:30 UTC
Blast radius: raxx-api-prod (Raptor production backend) serving HTTP 502 on all requests. Pre-launch, so 0 external users affected. Internal systems that depend on the production API URL (console.raxx.app prod surface, Sentry DSN, status worker) could not reach the health endpoint.
Author: sre-agent
Summary
raxx-api-prod was provisioned on 2026-04-22 with config vars but no code was ever deployed to it. All 7 Heroku releases were config-var-only (slug null on every release). Two CI/CD workflows exist that are supposed to cover the production deploy path, but both have structural gaps that prevented any production deploy from occurring in the 8 days since the app was created. The deploy.yml staging path has been failing on every push to main due to a missing dontautocreate: true flag; its production path has never been triggered because no v* tag has been pushed. The deploy-heroku.yml production path requires a workflow_dispatch that has never been triggered, and the required production GitHub Environment does not exist. Remediation requires: (1) operator creates the production GitHub Environment, (2) operator triggers workflow_dispatch on deploy-heroku.yml with environment=production, and (3) a PR to fix the missing dontautocreate: true in deploy.yml.
Timeline (all times UTC)
- 2026-04-22 06:53 —
raxx-api-prodHeroku app created (v1) - 2026-04-22 06:53 — Logplex enabled (v2)
- 2026-04-22 06:55 — PostgreSQL addon provisioned, DATABASE_URL set (v3)
- 2026-04-22 06:58 — FLASK_ENV, FLASK_DEBUG, FRONTEND_ORIGIN config vars set (v4)
- 2026-04-22 07:27 — SENTRY_DSN_BACKEND set (v5)
- 2026-04-30 02:21 — STATUS_INTERNAL_WRITE_TOKEN set (v6)
- 2026-04-30 02:21 — STATUS_WORKER_URL set (v7)
- 2026-04-30 ~09:00 — Prior SRE agent triage of console.raxx.app detects degraded prod API surface; incident opened as #690
- 2026-04-30 ~16:30 — This investigation started; current state confirmed via Heroku Platform API + heroku CLI
- 2026-04-30 ~16:45 — Root cause identified: no
workflow_dispatchrun ever executed,productionGH Environment missing,deploy.ymlbroken on staging for entire period - 2026-04-30 ~17:00 — RCA drafted; remediation steps documented; sub-issues filed
- Unresolved — pending operator action to trigger first prod deploy
Impact
- Users affected: 0 (Raxx is pre-launch; no external customers)
- User-visible symptoms:
https://raxx-api-prod-a60a19e5efbf.herokuapp.com/healthreturns HTTP 502 on all requests - Data integrity: ok (PostgreSQL addon present, no data written without a running API)
- Revenue / billing: ok (pre-launch)
What went well
- The
deploy-heroku.ymlworkflow is correctly structured for prod — it has the right app name, freeze-check gate, smoke gate, anddontautocreate: true. Once theproductionGH Environment is created and the dispatch is triggered, the first deploy should succeed without code changes. - Config vars are all present on
raxx-api-prod— the app is ready to receive a slug. - The
Procfileis valid and correctly points tobackend_v2/app.pyvia gunicorn. - Staging deploys via
deploy-heroku.ymlhave been succeeding reliably, validating the deploy mechanism.
What didn't go well
- The Heroku app was created and config vars were set across 8 days with no system check that a slug had ever been deployed. The system allowed an app to exist in a "configured but not deployed" state indefinitely with no alert.
- Two workflows exist for overlapping purposes (
deploy.ymlanddeploy-heroku.yml) with different trigger models, different gating approaches, and different levels of correctness. This created confusion about which workflow was the authoritative prod deploy path. deploy.ymlhas been failing its stagingdeploy-stagingjob on every single push to main since before the app existed. These failures were not escalated or acted on — they became background noise.- The
productionGitHub Environment was never created. Thedeploy-heroku.ymldeployjob referencesenvironment: productionbut the environment itself does not exist in the repo, so the required-reviewer protection (per ADR-0020) was never enforced. - No runbook existed for the Heroku deploy path prior to this incident.
Root cause analysis
-
Contributing factor 1: No
workflow_dispatchever triggered for production — Thedeploy-heroku.ymlworkflow correctly gates production behindworkflow_dispatch. However, no one ever triggered it manually. The system allowed production to remain un-deployed indefinitely because the push-to-main automation only targets staging, and the staging deploys appeared successful (masking the fact that production was never seeded). -
Contributing factor 2:
productionGitHub Environment not provisioned — Thedeployjob indeploy-heroku.ymlusesenvironment: production. GitHub Environments gate approvals; without the environment existing, the environment protection rules (required reviewers, wait timers) that ADR-0020 mandates for prod are not enforced. The deploy job would still run without the environment record, but the required-reviewer gate would be absent — meaning prod could be deployed without approval if someone triggered the dispatch. -
Contributing factor 3:
deploy.ymlmissingdontautocreate: trueon staging deploy — Thedeploy.ymlworkflow usesakhileshns/heroku-deploy@v3.13.15withoutdontautocreate: true. Whenraxx-api-stagingalready exists, the action's default behavior is to callheroku create <appname>, which in a non-interactive CI environment triggers the Heroku CLI browser-login prompt, crashes withTypeError: process.stdin.setRawMode is not a function, and fails the job. This has caused 100% failure rate on thedeploy-stagingjob indeploy.ymlsince the workflow was created. Becausedeploy-heroku.ymlwas independently succeeding on staging, this persistent failure went unnoticed. -
Contributing factor 4: No health monitor on production app slug presence — There is no alerting that would fire when
raxx-api-prodhas 0 dynos orslug: null. The incident was detected incidentally by a console triage agent, not by a proactive monitor.
Detection
- What alerted us: Console.raxx.app degraded surface triage by prior SRE agent (2026-04-30 ~09:00 UTC)
- How long between cause and detection: 8 days (app created 2026-04-22, detected 2026-04-30)
- How to detect faster next time: Heroku Platform API check on
/apps/raxx-api-prod/releases— if latest release hasslug: null, fire a SEV-2 alert. Add to nightly SRE sweep.
Resolution
Not yet complete — pending operator actions below.
The code and configuration to deploy are ready. The blockers are two GitHub administrative actions that require operator (Kristerpher) execution:
Operator action 1 — Create production GitHub Environment
In the GitHub repo settings under Environments, create a new environment named exactly production. Configure:
- Required reviewers: Kristerpher (minimum 1)
- Wait timer: 0 min (can increase after first successful deploy)
- Deployment branches: main only (restrict to protected branch)
URL: https://github.com/MooseQuest/TradeMasterAPI/settings/environments/new
Operator action 2 — Trigger first production deploy via deploy-heroku.yml
After the production environment exists:
1. Go to: https://github.com/MooseQuest/TradeMasterAPI/actions/workflows/deploy-heroku.yml
2. Click "Run workflow"
3. Set environment = production
4. Set ref = main
5. Click "Run workflow"
6. Approve the deployment when the required-reviewer gate appears
7. Monitor the run — the smoke gate will run first, then the deploy job, then the health check
8. Verify: curl -s https://raxx-api-prod-a60a19e5efbf.herokuapp.com/health returns HTTP 200
Code fix — deploy.yml missing dontautocreate
PR filed (see action items) to add dontautocreate: true to the deploy-staging and deploy-prod jobs in deploy.yml. This is a low-risk one-line addition that matches the pattern already in deploy-heroku.yml.
Validation
heroku ps --app raxx-api-prodshowsweb.1: upheroku releases --app raxx-api-prodshows aDeploy ...entry (not config-var-only)curl -s https://raxx-api-prod-a60a19e5efbf.herokuapp.com/api/system/statusreturns HTTP 200- Heroku Platform API:
slug_sizeis non-null
Action items
| # | Action | Owner | Due | Issue |
|---|---|---|---|---|
| 1 | Create production GitHub Environment with required-reviewer protection |
Kristerpher | 2026-04-30 | #690 |
| 2 | Trigger workflow_dispatch on deploy-heroku.yml with environment=production, ref=main |
Kristerpher | 2026-04-30 | #690 |
| 3 | Fix missing dontautocreate: true in deploy.yml staging + prod jobs |
sre-agent (PR) | 2026-05-01 | #691 |
| 4 | Add nightly monitor: Heroku API check that raxx-api-prod latest release has non-null slug |
sre-agent | 2026-05-07 | #692 |
| 5 | Add Heroku dyno-count monitor: alert SEV-2 if raxx-api-prod has 0 running dynos for >5 min |
sre-agent | 2026-05-07 | #692 |
| 6 | Deprecate or align deploy.yml vs deploy-heroku.yml — two overlapping workflows for the same target is a maintenance hazard |
operator | 2026-05-07 | #693 |
References
- Runbook:
docs/ops/runbooks/heroku.md(created this incident) - Tracking issue: https://github.com/MooseQuest/TradeMasterAPI/issues/690
- Heroku Platform API response:
GET /apps/raxx-api-prod—slug_size: null, repo_size: null - Failing workflow:
deploy.yml—Deploy to Heroku stagingjob, error:heroku create raxx-api-staging: TypeError: process.stdin.setRawMode - Successful staging workflow:
deploy-heroku.yml— all push-to-main runs to staging pass