Production rollback runbook
System: raxx-api-prod · raxx-api-staging · raxx-console-prod · raxx-console-staging · raxx-velvet-
Owner: operator
Last incident: n/a (initial authoring — #98)
Last reviewed: 2026-05-14 UTC
Related:* docs/ops/runbooks/heroku.md · docs/ops/runbooks/deploy-freeze.md · docs/ops/runbooks/migration-gate.md
When to use this runbook
Use this runbook when a production release must be reverted immediately. Roll back when:
- 5xx error rate climbs above baseline for >2 minutes after a deploy.
- Sentry CRIT or HIGH alert fires and the event fingerprint points to code shipped in the current release.
- A live demo or critical customer flow is broken and the breakage is traced to the release, not a dependency outage.
- The deploy passed CI but a smoke check on the live surface returns an unexpected response.
- A config-var-only release (e.g., a secret rotation) produces unexpected behavior and the previous value is known-good.
Roll forward instead of rolling back when:
- The defect is a one-line fix and a new deploy can reach prod in under 10 minutes — prefer the forward patch.
- The current release added a DB migration. Rolling the slug back will leave the schema in the migrated state. Evaluate impact before rolling back; a forward patch that accommodates both schema states is safer. See DB migration caveat.
- The incident is caused by an upstream dependency (Heroku platform, a third-party API) — rollback won't help; escalate.
Expected wall-clock from decision to verified recovery: under 5 minutes.
Pre-rollback checks
-
Confirm the release is the cause. Check Sentry or
heroku logs --tail -a <app>for a traceback that references code or config from the current release, not a dependency. -
Check dyno health before rolling back:
bash
heroku ps -a raxx-api-prod
If dynos are in a crash loop, rollback is likely correct. If dynos are up but serving errors, confirm the request path before acting.
- Identify the known-good release:
bash
heroku releases -a raxx-api-prod
Look for the last Deploy <sha> entry before the bad one. Config-var-only releases (e.g., Set STRIPE_API_KEY config vars) do not change the slug; rolling back past them reverts the config vars too — note this before proceeding.
-
Check for a DB migration in the bad release. If
heroku releases -a <app>shows aDeployentry and that deploy included a migration, see DB migration caveat before executing rollback. -
Verify you are targeting the right app. The app names follow
raxx-<service>-<env>:
| App | URL |
|---|---|
raxx-api-prod |
https://raxx-api-prod-a60a19e5efbf.herokuapp.com |
raxx-api-staging |
https://raxx-api-staging-1a19fb3873b9.herokuapp.com |
raxx-console-prod |
https://console.raxx.app |
raxx-console-staging |
https://console-staging.raxx.app |
Rollback procedure — Heroku release rollback
This is the canonical path for slug-based (git-push) apps: raxx-api-* and raxx-console-*.
Step 1 — Identify the target release
heroku releases -a raxx-api-prod
Note the version number of the known-good release (e.g., v84). The current broken release is the current version (e.g., v85).
Step 2 — Execute rollback
heroku rollback v84 -a raxx-api-prod
Heroku creates a new release (e.g., v86: Rollback to v84) and immediately routes traffic to that slug. The command completes in seconds; dyno restart takes 10–30 seconds.
Step 3 — Verify recovery
# Confirm the new release appears at the top of the release list
heroku releases -a raxx-api-prod --num 3
# Expected: top row reads "Rollback to v84"
# Confirm dynos are up
heroku ps -a raxx-api-prod
# Expected: web.1: up
# Smoke-check the health endpoint
# Note: direct Heroku URLs return 403 when FLAG_ENFORCE_CF_ORIGIN is on.
# Use the CF-fronted URL instead:
curl -sf -o /dev/null -w "%{http_code}" https://api.raxx.app/api/system/status
# Expected: 200
For console:
heroku releases -a raxx-console-prod --num 3
heroku ps -a raxx-console-prod
curl -sf -o /dev/null -w "%{http_code}" https://console.raxx.app/health
Rollback procedure — tagged-image redeploy (container apps)
raxx-velvet-* and any future service deployed via heroku container: use the container stack, not the git slug stack. The heroku rollback command still works for these apps (it flips the release pointer), but if the prior release's image has been garbage-collected or you need to re-pin to a specific image tag, use this path.
Step 1 — Identify the known-good image
heroku releases -a raxx-velvet-prod --num 10
Find the last Deploy entry with a known-good commit SHA. Cross-reference against the GitHub Container Registry (GHCR) or your CI artifact log to find the corresponding image tag (e.g., sha256:<digest> or a semantic tag like main-<sha>).
Step 2 — Pull and re-release the prior image
# Pull the known-good image to your local Docker daemon
docker pull ghcr.io/raxx-app/trademasterapi/velvet:<prior-tag>
# Re-tag as latest for the push
docker tag ghcr.io/raxx-app/trademasterapi/velvet:<prior-tag> \
registry.heroku.com/raxx-velvet-prod/web
# Push to Heroku registry
docker push registry.heroku.com/raxx-velvet-prod/web
# Release the image
heroku container:release web -a raxx-velvet-prod
Step 3 — Verify
heroku releases -a raxx-velvet-prod --num 3
heroku ps -a raxx-velvet-prod
Note: If heroku rollback v<N> succeeds for a container app (the image is still available in the Heroku slug cache), prefer that path — it is faster and does not require local Docker access.
DB migration caveat
Forward-only migrations make rollback partial. If the bad release ran a migration that added a column, table, or index:
- Rolling the slug back will restore old application code but the schema remains migrated.
- Old code may crash on columns it does not expect, or silently ignore new columns it never writes.
- Assessment before rolling back: check whether the old slug is compatible with the new schema. If yes, proceed. If no, a forward patch is safer.
Migration reviews must reject DROP COLUMN, DROP TABLE, and destructive ALTER statements on the rollback path — these are non-reversible and break rollback entirely. See docs/ops/runbooks/migration-gate.md for the gate checklist.
For v1.0, DB migrations are forward-only by policy. If a migration must be reversed, file it as a separate forward migration (re-add the removed column as nullable, etc.) rather than attempting a true rollback.
Comms template
User-facing incident note (brief, plain language)
We are investigating an issue affecting [surface, e.g., the trading platform]. Our team is on it and we will post an update within 15 minutes. No account data has been affected.
We have rolled back to the previous release. The platform is recovering. We will confirm full recovery shortly.
The platform has recovered. Thank you for your patience. We are conducting a post-incident review.
Internal Slack (operator DM — D0AJ7K184TV)
Incident open: [app] [brief symptom]. Investigating. Started: [HH:MM UTC]
Rolling back [app] from v[N] to v[N-1]. Initiated at [HH:MM UTC].
Rollback confirmed: v[N+1] (Rollback to v[N-1]) is live, dynos up, smoke check passing. [HH:MM UTC]. Wall-clock: [X] min.
Incident closed. Post-incident review: [link or TBD].
Post to the daily digest (not a separate per-event ping) for pre-launch incidents unless the incident runs into the next day or affects a live customer flow.
Drill record
| Date (UTC) | App | Operator | Bad release | Rolled to | Wall-clock (min) | Outcome |
|---|---|---|---|---|---|---|
| 2026-05-14 21:24:25 UTC | raxx-api-staging | raxx-dev-bot (agent, #98) | v431 | v430 | <1 | Success — v432 appeared, dyno up, rolled forward to v433 |
Add a row each time this runbook is executed in staging or production.