Raxx · internal docs

internal · gated

Incident: Review-app teardown failures — Heroku 429 rate limiting

Date: 2026-05-14 (UTC) Severity: SEV-3 (self-healing; no customer impact; billing risk only) Status: Resolved — fix shipped in PR #2050 (merged 2026-05-14T06:23:31Z) Owner: raxx-dev-bot / operator


Summary

During a high-PR-close window, 14 raxx-console-pr-* Heroku review apps failed to tear down. The root cause was a missing retry around heroku apps:destroy: the CLI returns exit code 1 on HTTP 429 (Heroku API rate limit), and the teardown job used set -euo pipefail with no retry, so every 429 aborted the job immediately. Each failed teardown left a review app running and added to rate-limit pressure, creating a positive-feedback loop.

No customer data was affected. Review apps are isolated (stub vault creds, dedicated Essential-0 Postgres, no prod access). The only consequence was unnecessary Heroku billing for the orphaned apps until they were manually destroyed.


Timeline (all times UTC)

Time Event
2026-05-13 ~21:00 High-volume PR merge window — ~8 PRs closed within 15 minutes
2026-05-13 ~21:05 First teardown job fails with heroku apps:destroy exit 1
2026-05-13 ~21:20 All teardown jobs for that window have failed; 14 orphaned apps
2026-05-14 ~00:30 Heroku API rate-limit cap hit; support ticket filed (see ops/heroku-rate-limit-support-ticket-2026-05-14.md)
2026-05-14 ~02:00 Root cause identified: no retry on 429 in teardown job
2026-05-14 ~05:00 Fix PR #2050 opened: add 4-attempt/30s retry around apps:info + apps:destroy
2026-05-14 ~06:23 PR #2050 merged
2026-05-14 ~07:00 Orphaned apps manually destroyed via heroku apps:destroy
2026-05-14 ~12:00 Heroku Support confirmed rate-limit cap raised to 9,000 req/hr

Root cause

Primary: No retry on 429 in teardown job

review-app-console.yml teardown step:

- name: Delete review app (if it exists)
  run: |
    set -euo pipefail
    heroku apps:destroy --app "$APP" --confirm "$APP"

heroku apps:destroy returns exit 1 on HTTP 429. set -euo pipefail terminates the job immediately. No retry. App stays live.

The deploy job had already added retry logic around heroku apps:create (3 attempts, 30s back-off) after issue #1899, but that pattern was never applied to the teardown path.

Secondary: Rate-limit pressure compounds failures

Each failed teardown job retries the workflow from scratch (GH Actions re-queues the closed event handler), making additional API calls, which further exhausts the rate-limit budget and causes more failures.


Fix

PR #2050 (.github/workflows/review-app-console.yml):

  1. Wrapped heroku apps:info in a 4-attempt / 30s back-off loop before checking whether the app exists.
  2. Wrapped heroku apps:destroy in a 4-attempt / 30s back-off loop.
  3. Added if: always() to the sticky-comment step so teardown status is surfaced on the PR even when destroy fails.

This mirrors the existing retry pattern in the deploy job's apps:create path.


Action items

# Action Status
1 Add retry to teardown apps:info Done — PR #2050
2 Add retry to teardown apps:destroy Done — PR #2050
3 Add if: always() to teardown sticky comment Done — PR #2050
4 File Heroku support ticket for rate-limit cap increase Done — 2026-05-14
5 Audit other workflows for raxx-api-pr-* / raxx-app-pr-* review apps with same gap Done — PR #2530 (issue #2053)

Scope confirmed (action item #5)

Audit per issue #2053 confirmed:

Only Console uses Heroku review apps. The 429 teardown vulnerability was Console-only and is fully remediated by PR #2050.


References