Incident: Review-app teardown failures — Heroku 429 rate limiting
Date: 2026-05-14 (UTC) Severity: SEV-3 (self-healing; no customer impact; billing risk only) Status: Resolved — fix shipped in PR #2050 (merged 2026-05-14T06:23:31Z) Owner: raxx-dev-bot / operator
Summary
During a high-PR-close window, 14 raxx-console-pr-* Heroku review apps failed to
tear down. The root cause was a missing retry around heroku apps:destroy: the CLI
returns exit code 1 on HTTP 429 (Heroku API rate limit), and the teardown job used
set -euo pipefail with no retry, so every 429 aborted the job immediately. Each
failed teardown left a review app running and added to rate-limit pressure, creating
a positive-feedback loop.
No customer data was affected. Review apps are isolated (stub vault creds, dedicated Essential-0 Postgres, no prod access). The only consequence was unnecessary Heroku billing for the orphaned apps until they were manually destroyed.
Timeline (all times UTC)
| Time | Event |
|---|---|
| 2026-05-13 ~21:00 | High-volume PR merge window — ~8 PRs closed within 15 minutes |
| 2026-05-13 ~21:05 | First teardown job fails with heroku apps:destroy exit 1 |
| 2026-05-13 ~21:20 | All teardown jobs for that window have failed; 14 orphaned apps |
| 2026-05-14 ~00:30 | Heroku API rate-limit cap hit; support ticket filed (see ops/heroku-rate-limit-support-ticket-2026-05-14.md) |
| 2026-05-14 ~02:00 | Root cause identified: no retry on 429 in teardown job |
| 2026-05-14 ~05:00 | Fix PR #2050 opened: add 4-attempt/30s retry around apps:info + apps:destroy |
| 2026-05-14 ~06:23 | PR #2050 merged |
| 2026-05-14 ~07:00 | Orphaned apps manually destroyed via heroku apps:destroy |
| 2026-05-14 ~12:00 | Heroku Support confirmed rate-limit cap raised to 9,000 req/hr |
Root cause
Primary: No retry on 429 in teardown job
review-app-console.yml teardown step:
- name: Delete review app (if it exists)
run: |
set -euo pipefail
heroku apps:destroy --app "$APP" --confirm "$APP"
heroku apps:destroy returns exit 1 on HTTP 429. set -euo pipefail terminates the
job immediately. No retry. App stays live.
The deploy job had already added retry logic around heroku apps:create (3 attempts,
30s back-off) after issue #1899, but that pattern was never applied to the teardown path.
Secondary: Rate-limit pressure compounds failures
Each failed teardown job retries the workflow from scratch (GH Actions re-queues the
closed event handler), making additional API calls, which further exhausts the
rate-limit budget and causes more failures.
Fix
PR #2050 (.github/workflows/review-app-console.yml):
- Wrapped
heroku apps:infoin a 4-attempt / 30s back-off loop before checking whether the app exists. - Wrapped
heroku apps:destroyin a 4-attempt / 30s back-off loop. - Added
if: always()to the sticky-comment step so teardown status is surfaced on the PR even when destroy fails.
This mirrors the existing retry pattern in the deploy job's apps:create path.
Action items
| # | Action | Status |
|---|---|---|
| 1 | Add retry to teardown apps:info |
Done — PR #2050 |
| 2 | Add retry to teardown apps:destroy |
Done — PR #2050 |
| 3 | Add if: always() to teardown sticky comment |
Done — PR #2050 |
| 4 | File Heroku support ticket for rate-limit cap increase | Done — 2026-05-14 |
| 5 | Audit other workflows for raxx-api-pr-* / raxx-app-pr-* review apps with same gap |
Done — PR #2530 (issue #2053) |
Scope confirmed (action item #5)
Audit per issue #2053 confirmed:
deploy-heroku.yml— deploys toraxx-api-stagingandraxx-api-prodonly. No per-PR apps. Fires onpushto main orworkflow_dispatch.deploy-antlers.yml— deploys to Cloudflare Pages. No Heroku per-PR apps.pr-preview.yml— Cloudflare Pages previews for Antlers and Mockups. No Heroku.- No
review-app-api.ymlorreview-app-antlers.ymlexists. - No
raxx-api-pr-*orraxx-app-pr-*Heroku apps have ever existed.
Only Console uses Heroku review apps. The 429 teardown vulnerability was Console-only and is fully remediated by PR #2050.
References
- Fix PR: https://github.com/raxx-app/TradeMasterAPI/pull/2050
- Audit issue: https://github.com/raxx-app/TradeMasterAPI/issues/2053
- Heroku support ticket:
docs/ops/heroku-rate-limit-support-ticket-2026-05-14.md - Console review apps runbook:
docs/ops/runbooks/console-review-apps.md - Workflow:
.github/workflows/review-app-console.yml