Post-merge production-state checklist — runbook
System: release → main production boundary (ADR-0115)
Owner: operator / sre-agent
Last reviewed: 2026-06-30
Related ADR: docs/architecture/adr/0115-develop-release-main-branching-model.md
Wire-in target: docs/ops/runbooks/prod-deploy-approval.md (approval notification from #1202)
Closes: #1217
Purpose
This checklist runs after a release → main merge triggers the production deploy.
It answers: "The deploy workflow succeeded — is production actually healthy and in
the expected state?"
A passing deploy workflow is necessary but not sufficient. This runbook verifies the post-merge state categories that the deploy pipeline does not check automatically:
- App boot health
- Static asset / favicon revert state
- Flag-promotion-pending drift
- Env-var seed completeness
- Database migration applied
- Smoke-test pass against prod
- Observability / Sentry baseline
- Rollback pointer recorded
Work through the checklist in order. All steps must be green before the production deploy is considered closed.
Scope
Applies to the release → main production boundary only. Does not apply to:
develop → release(staging boundary) — seedocs/ops/runbooks/gatekeeper-develop-to-release.md- Hotfix merges that target
maindirectly — use the emergency hotfix path in ADR-0115 §Emergency hotfix path; run this checklist immediately after the hotfix deploy and the subsequent cherry-pick back todevelop.
Pre-requisites
Confirm all of the following before starting:
- [ ] The
release → mainPR was approved and merged by the operator. - [ ]
deploy-heroku.ymlproduction job completed with conclusionsuccess:gh run list --workflow=deploy-heroku.yml --branch=main --limit=3 \ --json conclusion,headSha,createdAt \ --jq '.[] | select(.conclusion != null) | {conclusion, headSha, createdAt}' - [ ] The GitHub Environment approval log records the approval:
https://github.com/raxx-app/TradeMasterAPI/deployments/activity_log?environments_filter=production - [ ] CF Access tokens for console access are valid. Stale tokens silently block
any checklist step that calls the console API (see
session-bootstrap.mdand #680). Validate early:bash curl -s -o /dev/null -w "%{http_code}" \ -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \ -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \ -H "User-Agent: raxx-sre/1.0" \ https://console.raxx.app/healthExpected:200. If401or403, refresh tokens from vault (docs/ops/runbooks/vault-access.md) before continuing.
Step 1 — App boot health
Confirm the production API booted cleanly and is serving requests.
# Health endpoint — must return 200 OK
curl -s -o /dev/null -w "%{http_code}" \
https://raxx-api-prod-a60a19e5efbf.herokuapp.com/health
Expected: 200. Any other status is a SEV-1; consult docs/ops/runbooks/raptor.md.
Check for crash loops (recent prod logs):
heroku logs --app raxx-api-prod --num=50 2>&1
Signs of trouble: repeated Starting process with command, Error R10, Error H10,
or Stopping process with SIGTERM within seconds of boot.
Confirm the deployed version matches the merged SHA:
heroku releases --app raxx-api-prod --num=3
The most recent release should correspond to the release → main merge commit. The
release number vNNN from this output is needed for Step 8 (rollback pointer).
Expected state: 200 health, no crash loop, release number matches. If a crash loop
is detected, execute the rollback in Step 8 before proceeding with any other step.
Step 2 — Static asset / favicon revert check
Temporary favicon or static asset overrides (demo-mode favicons, placeholder images) must be reverted before the release is declared complete.
# Confirm canonical favicon is being served — not a placeholder or demo icon
curl -s -I https://getraxx.com/favicon.ico \
| grep -i "content-type\|content-length\|etag"
curl -s -I https://app.raxx.app/favicon.ico \
| grep -i "content-type\|content-length\|etag"
Manual browser check:
- Open https://getraxx.com and https://app.raxx.app.
- Confirm the tab icon is the Raxx brand icon, not a browser default, placeholder,
or demo variant.
- Confirm the page title reflects the production brand.
If this release touched console/app/static/**, confirm the asset manifest was
regenerated (ADR-0051 Layer A guard):
git show origin/main:docs/asset-manifest.json | python3 -m json.tool | head -20
The manifest timestamp and content hashes must match the files in the release. A
manifest mismatch with no accompanying static-file change is acceptable; a static-file
change with a stale manifest is not — file a type:reliability ticket.
Expected: no placeholder or override assets. A favicon revert oversight is not a rollback trigger in isolation unless it represents a brand-safety issue; file a ticket and note it in the deploy log.
Step 3 — Flag-promotion-pending review
Confirm that every feature flag enabled on staging and disabled on production is either intentionally still in staging-only state, or promoted to production in this release.
# Produce a diff of FLAG_ vars between staging and prod
heroku config --app raxx-api-staging 2>&1 | grep FLAG_ | sort > /tmp/flags-staging.txt
heroku config --app raxx-api-prod 2>&1 | grep FLAG_ | sort > /tmp/flags-prod.txt
diff /tmp/flags-staging.txt /tmp/flags-prod.txt
Lines beginning with < appear only on staging (prod-off drift). For each:
- Confirm the drift is intentional and document it in the deploy log.
- If the flag was supposed to be promoted in this release, promote it now:
bash heroku config:set FLAG_EXAMPLE=1 --app raxx-api-prod >/dev/null 2>&1 # Verify presence: heroku config --app raxx-api-prod 2>&1 | grep FLAG_EXAMPLE
Note: FLAG_CONSOLE_FLAG_PROMOTIONS is currently off on prod. Once enabled, the
/console/flags page surfaces this drift visually. Until then, use the
heroku config diff above.
Expected: every drift entry is either documented as intentional or promoted. No unreviewed staging-on / prod-off drift at close.
Step 4 — Env-var seed check
Every config var required by the merged PR must be present on raxx-api-prod
(and raxx-console-prod if the console was part of the release).
Identify new vars by reviewing the PR diff and body, then verify each:
# Check presence — grep only, never print the value
heroku config --app raxx-api-prod 2>&1 | grep VAR_NAME
If a required var is absent:
# Set it — silence stdout to avoid echoing the secret value in terminal history
heroku config:set VAR_NAME=value --app raxx-api-prod >/dev/null 2>&1
# Confirm presence:
heroku config --app raxx-api-prod 2>&1 | grep VAR_NAME
Always suppress stdout from heroku config:set with >/dev/null 2>&1. A bare
heroku config:set echoes all config vars to the terminal; this has leaked secrets
before (see feedback_heroku_config_set_echoes_secrets.md).
Consult docs/ops/required-config-vars.yaml for the authoritative list of
required vars by service. If a new var introduced in this release is not yet in
that file, add it in a follow-up PR to develop.
Expected: all required vars present. A missing var that affects a live user-visible workflow is SEV-2.
Step 5 — Database migration check
Confirm any Alembic migrations included in this release were applied to the production database.
# Current migration revision on prod — must end with (head)
heroku run "cd backend_v2 && alembic current" --app raxx-api-prod 2>&1
Cross-reference against the head in the merged code:
ls -1 backend_v2/alembic/versions/ | sort | tail -5
If the revision is behind (head):
heroku run "cd backend_v2 && alembic upgrade head" --app raxx-api-prod 2>&1
# Re-verify:
heroku run "cd backend_v2 && alembic current" --app raxx-api-prod 2>&1
For migrations marked -- POSTGRES-ONLY (PL/pgSQL blocks): these are safe to run
on prod Postgres and were already validated in staging. If a migration fails on prod
but passed on staging, treat as SEV-1 and escalate immediately before any further
upgrade attempts.
See docs/ops/runbooks/migration-gate.md for known failure modes and rollback steps.
Expected: alembic current shows (head). Any pending migration on a deployed
release is SEV-2.
Step 6 — Smoke-test gate against production
Run the combined smoke suite against the production endpoint to confirm the golden path is working end-to-end, not just that the process is alive.
TARGET_ENV=production bash scripts/ci/run_smoke.sh
If TARGET_ENV override is not yet supported, run the read-only backend smoke:
python3 scripts/ci/smoke_backend.py \
--base-url https://raxx-api-prod-a60a19e5efbf.herokuapp.com
Expected: all checks pass. A failing smoke against production after a successful deploy is SEV-1 or SEV-2 depending on blast radius. Do not close the production deploy until smoke passes or a rollback is in progress.
Step 7 — Observability / Sentry check
Confirm no new error class spiked in the 15 minutes following the deploy.
In Sentry (projects: raxx-backend, raxx-frontend):
- Filter: last 30 minutes.
- Sort: by count (highest first).
- Look for any issue with a first-seen timestamp matching the deploy window.
- If a new issue class appears with count > 10 in the first 15 minutes post-deploy: treat as SEV-2, execute Step 8 rollback, then investigate.
Sentry issues UI:
https://sentry.io/organizations/moosequest/issues/?project=<id>&query=is:unresolved&sort=date
Check Heroku router error rates:
heroku logs --app raxx-api-prod --num=200 2>&1 \
| grep -E " H[0-9]+| R[0-9]+" | head -20
Any H12 (request timeout) or H10 (app crashed) entries post-deploy require immediate triage. H12 bursts often indicate a migration that took locks or a cold-start memory pressure issue.
Expected: no new error class spike, no H10/H12 burst. A single isolated H10 immediately at boot (before the first successful health check) is acceptable; a recurring H10 is not.
Step 8 — Rollback pointer (record before closing)
Record the rollback target before declaring the deploy complete. Do this while context is fresh.
# Identify the release immediately before this deploy
heroku releases --app raxx-api-prod --num=5
Find the release that was live BEFORE the current deploy. Record:
- Previous Heroku release number:
vNNN(one row above the current release) - Previous git SHA: visible in
heroku releasesor in therelease → mainPR as the parent of the merge commit - Rollback command:
bash heroku rollback vNNN --app raxx-api-prod
For Antlers Next (Cloudflare Pages), record the previous CF Pages deployment ID:
gh api \
"repos/raxx-app/TradeMasterAPI/deployments?environment=production&per_page=5" \
--jq '.[] | {id, sha: .sha, created_at}'
If rollback is triggered during this checklist (crash loop in Step 1 or Sentry spike in Step 7):
heroku rollback vNNN --app raxx-api-prod
# Re-run Step 1 to confirm the prior release is healthy
Full rollback procedure: docs/ops/runbooks/rollback.md.
Approval-notification wire-in
The GitHub Environment approval banner pauses the production deploy job for
operator review (see docs/ops/runbooks/prod-deploy-approval.md). Include
the following in the review comment before clicking "Approve and deploy":
Post-merge checklist: docs/ops/runbooks/post-merge-prod-state-checklist.md
Run all 8 steps after deploy completes before closing.
This surfaces the checklist at the moment of approval, not after the fact — the critical delivery requirement from issue #1217.
To embed the checklist link permanently in the notify-deploy-status action
that deploy-heroku.yml uses (Phase 2 of #1217), add the following to the
deploy-complete notification step body:
Post-merge checklist: https://github.com/raxx-app/TradeMasterAPI/blob/main/docs/ops/runbooks/post-merge-prod-state-checklist.md
File a type:reliability ticket against develop for that automation step if
it has not yet shipped.
Emergency stop
If any checklist step reveals a production-breaking condition:
- Execute the rollback immediately (Step 8 rollback command).
- Freeze all further production deploys:
bash heroku config:set DEPLOY_FREEZE=true --app raxx-console-prod >/dev/null 2>&1 - File a SEV-1 incident under
docs/incidents/YYYY-MM-DD-<slug>.md. - Notify the operator.
Full freeze / unfreeze procedure: docs/ops/runbooks/deploy-freeze.md.
Escalation
Escalate to operator when:
- Smoke gate fails and rollback does not restore a green smoke result.
- A migration fails on
alembic upgrade headafter succeeding on staging. - A required env var value is missing from vault and the correct value is unknown.
- A Sentry spike is novel (first-ever error class, unknown root cause).
- A rollback itself fails (releases are locked or the target release is unavailable).
- Cost implications of a forward-fix exceed $50/mo or $500/year.
References
- ADR:
docs/architecture/adr/0115-develop-release-main-branching-model.md— branching model, release→main boundary, emergency hotfix path - Gatekeeper (develop→release):
docs/ops/runbooks/gatekeeper-develop-to-release.md— the promotion step that precedes this checklist - Prod deploy approval:
docs/ops/runbooks/prod-deploy-approval.md— GitHub Environment reviewer gate (wire-in target from #1202) - Rollback:
docs/ops/runbooks/rollback.md - Migration gate:
docs/ops/runbooks/migration-gate.md - Deploy freeze:
docs/ops/runbooks/deploy-freeze.md - Required config vars:
docs/ops/required-config-vars.yaml - CF Access token staleness:
docs/ops/runbooks/session-bootstrap.md(see also #680) - Issue:
https://github.com/raxx-app/TradeMasterAPI/issues/1217 - Parent epic:
https://github.com/raxx-app/TradeMasterAPI/issues/1208