RCA — raxx-api-prod: 0 dynos, no slug, never deployed

Incident ID: 2026-04-30-raxx-api-prod-down Date: 2026-04-30 Severity: SEV-1 Duration: 8 days latent (app created 2026-04-22, first detected 2026-04-30); active investigation 2026-04-30 ~16:30–17:30 UTC Blast radius: raxx-api-prod (Raptor production backend) serving HTTP 502 on all requests. Pre-launch, so 0 external users affected. Internal systems that depend on the production API URL (console.raxx.app prod surface, Sentry DSN, status worker) could not reach the health endpoint. Author: sre-agent

Summary

raxx-api-prod was provisioned on 2026-04-22 with config vars but no code was ever deployed to it. All 7 Heroku releases were config-var-only (slug null on every release). Two CI/CD workflows exist that are supposed to cover the production deploy path, but both have structural gaps that prevented any production deploy from occurring in the 8 days since the app was created. The deploy.yml staging path has been failing on every push to main due to a missing dontautocreate: true flag; its production path has never been triggered because no v* tag has been pushed. The deploy-heroku.yml production path requires a workflow_dispatch that has never been triggered, and the required production GitHub Environment does not exist. Remediation requires: (1) operator creates the production GitHub Environment, (2) operator triggers workflow_dispatch on deploy-heroku.yml with environment=production, and (3) a PR to fix the missing dontautocreate: true in deploy.yml.

Timeline (all times UTC)

2026-04-22 06:53 — raxx-api-prod Heroku app created (v1)
2026-04-22 06:53 — Logplex enabled (v2)
2026-04-22 06:55 — PostgreSQL addon provisioned, DATABASE_URL set (v3)
2026-04-22 06:58 — FLASK_ENV, FLASK_DEBUG, FRONTEND_ORIGIN config vars set (v4)
2026-04-22 07:27 — SENTRY_DSN_BACKEND set (v5)
2026-04-30 02:21 — STATUS_INTERNAL_WRITE_TOKEN set (v6)
2026-04-30 02:21 — STATUS_WORKER_URL set (v7)
2026-04-30 ~09:00 — Prior SRE agent triage of console.raxx.app detects degraded prod API surface; incident opened as #690
2026-04-30 ~16:30 — This investigation started; current state confirmed via Heroku Platform API + heroku CLI
2026-04-30 ~16:45 — Root cause identified: no workflow_dispatch run ever executed, production GH Environment missing, deploy.yml broken on staging for entire period
2026-04-30 ~17:00 — RCA drafted; remediation steps documented; sub-issues filed
Unresolved — pending operator action to trigger first prod deploy

Impact

Users affected: 0 (Raxx is pre-launch; no external customers)
User-visible symptoms: https://raxx-api-prod-a60a19e5efbf.herokuapp.com/health returns HTTP 502 on all requests
Data integrity: ok (PostgreSQL addon present, no data written without a running API)
Revenue / billing: ok (pre-launch)

What went well

The deploy-heroku.yml workflow is correctly structured for prod — it has the right app name, freeze-check gate, smoke gate, and dontautocreate: true. Once the production GH Environment is created and the dispatch is triggered, the first deploy should succeed without code changes.
Config vars are all present on raxx-api-prod — the app is ready to receive a slug.
The Procfile is valid and correctly points to backend_v2/app.py via gunicorn.
Staging deploys via deploy-heroku.yml have been succeeding reliably, validating the deploy mechanism.

What didn't go well

The Heroku app was created and config vars were set across 8 days with no system check that a slug had ever been deployed. The system allowed an app to exist in a "configured but not deployed" state indefinitely with no alert.
Two workflows exist for overlapping purposes (deploy.yml and deploy-heroku.yml) with different trigger models, different gating approaches, and different levels of correctness. This created confusion about which workflow was the authoritative prod deploy path.
deploy.yml has been failing its staging deploy-staging job on every single push to main since before the app existed. These failures were not escalated or acted on — they became background noise.
The production GitHub Environment was never created. The deploy-heroku.yml deploy job references environment: production but the environment itself does not exist in the repo, so the required-reviewer protection (per ADR-0020) was never enforced.
No runbook existed for the Heroku deploy path prior to this incident.

Root cause analysis

Contributing factor 1: No workflow_dispatch ever triggered for production — The deploy-heroku.yml workflow correctly gates production behind workflow_dispatch. However, no one ever triggered it manually. The system allowed production to remain un-deployed indefinitely because the push-to-main automation only targets staging, and the staging deploys appeared successful (masking the fact that production was never seeded).
Contributing factor 2: production GitHub Environment not provisioned — The deploy job in deploy-heroku.yml uses environment: production. GitHub Environments gate approvals; without the environment existing, the environment protection rules (required reviewers, wait timers) that ADR-0020 mandates for prod are not enforced. The deploy job would still run without the environment record, but the required-reviewer gate would be absent — meaning prod could be deployed without approval if someone triggered the dispatch.
Contributing factor 3: deploy.yml missing dontautocreate: true on staging deploy — The deploy.yml workflow uses akhileshns/heroku-deploy@v3.13.15 without dontautocreate: true. When raxx-api-staging already exists, the action's default behavior is to call heroku create <appname>, which in a non-interactive CI environment triggers the Heroku CLI browser-login prompt, crashes with TypeError: process.stdin.setRawMode is not a function, and fails the job. This has caused 100% failure rate on the deploy-staging job in deploy.yml since the workflow was created. Because deploy-heroku.yml was independently succeeding on staging, this persistent failure went unnoticed.
Contributing factor 4: No health monitor on production app slug presence — There is no alerting that would fire when raxx-api-prod has 0 dynos or slug: null. The incident was detected incidentally by a console triage agent, not by a proactive monitor.

Detection

What alerted us: Console.raxx.app degraded surface triage by prior SRE agent (2026-04-30 ~09:00 UTC)
How long between cause and detection: 8 days (app created 2026-04-22, detected 2026-04-30)
How to detect faster next time: Heroku Platform API check on /apps/raxx-api-prod/releases — if latest release has slug: null, fire a SEV-2 alert. Add to nightly SRE sweep.

Resolution

Not yet complete — pending operator actions below.

The code and configuration to deploy are ready. The blockers are two GitHub administrative actions that require operator (Kristerpher) execution:

Operator action 1 — Create `production` GitHub Environment

In the GitHub repo settings under Environments, create a new environment named exactly production. Configure: - Required reviewers: Kristerpher (minimum 1) - Wait timer: 0 min (can increase after first successful deploy) - Deployment branches: main only (restrict to protected branch)

URL: https://github.com/MooseQuest/TradeMasterAPI/settings/environments/new

Operator action 2 — Trigger first production deploy via `deploy-heroku.yml`

After the production environment exists: 1. Go to: https://github.com/MooseQuest/TradeMasterAPI/actions/workflows/deploy-heroku.yml 2. Click "Run workflow" 3. Set environment = production 4. Set ref = main 5. Click "Run workflow" 6. Approve the deployment when the required-reviewer gate appears 7. Monitor the run — the smoke gate will run first, then the deploy job, then the health check 8. Verify: curl -s https://raxx-api-prod-a60a19e5efbf.herokuapp.com/health returns HTTP 200

Code fix — `deploy.yml` missing `dontautocreate`

PR filed (see action items) to add dontautocreate: true to the deploy-staging and deploy-prod jobs in deploy.yml. This is a low-risk one-line addition that matches the pattern already in deploy-heroku.yml.

Validation

heroku ps --app raxx-api-prod shows web.1: up
heroku releases --app raxx-api-prod shows a Deploy ... entry (not config-var-only)
curl -s https://raxx-api-prod-a60a19e5efbf.herokuapp.com/api/system/status returns HTTP 200
Heroku Platform API: slug_size is non-null

Action items

#	Action	Owner	Due	Issue
1	Create `production` GitHub Environment with required-reviewer protection	Kristerpher	2026-04-30	#690
2	Trigger `workflow_dispatch` on `deploy-heroku.yml` with `environment=production, ref=main`	Kristerpher	2026-04-30	#690
3	Fix missing `dontautocreate: true` in `deploy.yml` staging + prod jobs	sre-agent (PR)	2026-05-01	#691
4	Add nightly monitor: Heroku API check that `raxx-api-prod` latest release has non-null slug	sre-agent	2026-05-07	#692
5	Add Heroku dyno-count monitor: alert SEV-2 if `raxx-api-prod` has 0 running dynos for >5 min	sre-agent	2026-05-07	#692
6	Deprecate or align `deploy.yml` vs `deploy-heroku.yml` — two overlapping workflows for the same target is a maintenance hazard	operator	2026-05-07	#693

References

Runbook: docs/ops/runbooks/heroku.md (created this incident)
Tracking issue: https://github.com/MooseQuest/TradeMasterAPI/issues/690
Heroku Platform API response: GET /apps/raxx-api-prod — slug_size: null, repo_size: null
Failing workflow: deploy.yml — Deploy to Heroku staging job, error: heroku create raxx-api-staging: TypeError: process.stdin.setRawMode
Successful staging workflow: deploy-heroku.yml — all push-to-main runs to staging pass