Post-merge production-state checklist — runbook

System: release → main production boundary (ADR-0115) Owner: operator / sre-agent Last reviewed: 2026-06-30 Related ADR: docs/architecture/adr/0115-develop-release-main-branching-model.md Wire-in target: docs/ops/runbooks/prod-deploy-approval.md (approval notification from #1202) Closes: #1217

Purpose

This checklist runs after a release → main merge triggers the production deploy. It answers: "The deploy workflow succeeded — is production actually healthy and in the expected state?"

A passing deploy workflow is necessary but not sufficient. This runbook verifies the post-merge state categories that the deploy pipeline does not check automatically:

App boot health
Static asset / favicon revert state
Flag-promotion-pending drift
Env-var seed completeness
Database migration applied
Smoke-test pass against prod
Observability / Sentry baseline
Rollback pointer recorded

Work through the checklist in order. All steps must be green before the production deploy is considered closed.

Scope

Applies to the release → main production boundary only. Does not apply to:

develop → release (staging boundary) — see docs/ops/runbooks/gatekeeper-develop-to-release.md
Hotfix merges that target main directly — use the emergency hotfix path in ADR-0115 §Emergency hotfix path; run this checklist immediately after the hotfix deploy and the subsequent cherry-pick back to develop.

Pre-requisites

Confirm all of the following before starting:

[ ] The release → main PR was approved and merged by the operator.
[ ] deploy-heroku.yml production job completed with conclusion success: gh run list --workflow=deploy-heroku.yml --branch=main --limit=3 \ --json conclusion,headSha,createdAt \ --jq '.[] | select(.conclusion != null) | {conclusion, headSha, createdAt}'
[ ] The GitHub Environment approval log records the approval: https://github.com/raxx-app/TradeMasterAPI/deployments/activity_log?environments_filter=production
[ ] CF Access tokens for console access are valid. Stale tokens silently block any checklist step that calls the console API (see session-bootstrap.md and #680). Validate early: bash curl -s -o /dev/null -w "%{http_code}" \ -H "CF-Access-Client-Id: $CF_ACCESS_CLIENT_ID" \ -H "CF-Access-Client-Secret: $CF_ACCESS_CLIENT_SECRET" \ -H "User-Agent: raxx-sre/1.0" \ https://console.raxx.app/health Expected: 200. If 401 or 403, refresh tokens from vault (docs/ops/runbooks/vault-access.md) before continuing.

Step 1 — App boot health

Confirm the production API booted cleanly and is serving requests.

# Health endpoint — must return 200 OK
curl -s -o /dev/null -w "%{http_code}" \
  https://raxx-api-prod-a60a19e5efbf.herokuapp.com/health

Expected: 200. Any other status is a SEV-1; consult docs/ops/runbooks/raptor.md.

Check for crash loops (recent prod logs):

heroku logs --app raxx-api-prod --num=50 2>&1

Signs of trouble: repeated Starting process with command, Error R10, Error H10, or Stopping process with SIGTERM within seconds of boot.

Confirm the deployed version matches the merged SHA:

heroku releases --app raxx-api-prod --num=3

The most recent release should correspond to the release → main merge commit. The release number vNNN from this output is needed for Step 8 (rollback pointer).

Expected state: 200 health, no crash loop, release number matches. If a crash loop is detected, execute the rollback in Step 8 before proceeding with any other step.

Step 2 — Static asset / favicon revert check

Temporary favicon or static asset overrides (demo-mode favicons, placeholder images) must be reverted before the release is declared complete.

# Confirm canonical favicon is being served — not a placeholder or demo icon
curl -s -I https://getraxx.com/favicon.ico \
  | grep -i "content-type\|content-length\|etag"
curl -s -I https://app.raxx.app/favicon.ico \
  | grep -i "content-type\|content-length\|etag"

Manual browser check: - Open https://getraxx.com and https://app.raxx.app. - Confirm the tab icon is the Raxx brand icon, not a browser default, placeholder, or demo variant. - Confirm the page title reflects the production brand.

If this release touched console/app/static/**, confirm the asset manifest was regenerated (ADR-0051 Layer A guard):

git show origin/main:docs/asset-manifest.json | python3 -m json.tool | head -20

The manifest timestamp and content hashes must match the files in the release. A manifest mismatch with no accompanying static-file change is acceptable; a static-file change with a stale manifest is not — file a type:reliability ticket.

Expected: no placeholder or override assets. A favicon revert oversight is not a rollback trigger in isolation unless it represents a brand-safety issue; file a ticket and note it in the deploy log.

Step 3 — Flag-promotion-pending review

Confirm that every feature flag enabled on staging and disabled on production is either intentionally still in staging-only state, or promoted to production in this release.

# Produce a diff of FLAG_ vars between staging and prod
heroku config --app raxx-api-staging 2>&1 | grep FLAG_ | sort > /tmp/flags-staging.txt
heroku config --app raxx-api-prod    2>&1 | grep FLAG_ | sort > /tmp/flags-prod.txt
diff /tmp/flags-staging.txt /tmp/flags-prod.txt

Lines beginning with < appear only on staging (prod-off drift). For each:

Confirm the drift is intentional and document it in the deploy log.
If the flag was supposed to be promoted in this release, promote it now: bash heroku config:set FLAG_EXAMPLE=1 --app raxx-api-prod >/dev/null 2>&1 # Verify presence: heroku config --app raxx-api-prod 2>&1 | grep FLAG_EXAMPLE

Note: FLAG_CONSOLE_FLAG_PROMOTIONS is currently off on prod. Once enabled, the /console/flags page surfaces this drift visually. Until then, use the heroku config diff above.

Expected: every drift entry is either documented as intentional or promoted. No unreviewed staging-on / prod-off drift at close.

Step 4 — Env-var seed check

Every config var required by the merged PR must be present on raxx-api-prod (and raxx-console-prod if the console was part of the release).

Identify new vars by reviewing the PR diff and body, then verify each:

# Check presence — grep only, never print the value
heroku config --app raxx-api-prod 2>&1 | grep VAR_NAME

If a required var is absent:

# Set it — silence stdout to avoid echoing the secret value in terminal history
heroku config:set VAR_NAME=value --app raxx-api-prod >/dev/null 2>&1
# Confirm presence:
heroku config --app raxx-api-prod 2>&1 | grep VAR_NAME

Always suppress stdout from heroku config:set with >/dev/null 2>&1. A bare heroku config:set echoes all config vars to the terminal; this has leaked secrets before (see feedback_heroku_config_set_echoes_secrets.md).

Consult docs/ops/required-config-vars.yaml for the authoritative list of required vars by service. If a new var introduced in this release is not yet in that file, add it in a follow-up PR to develop.

Expected: all required vars present. A missing var that affects a live user-visible workflow is SEV-2.

Step 5 — Database migration check

Confirm any Alembic migrations included in this release were applied to the production database.

# Current migration revision on prod — must end with (head)
heroku run "cd backend_v2 && alembic current" --app raxx-api-prod 2>&1

Cross-reference against the head in the merged code:

ls -1 backend_v2/alembic/versions/ | sort | tail -5

If the revision is behind (head):

heroku run "cd backend_v2 && alembic upgrade head" --app raxx-api-prod 2>&1
# Re-verify:
heroku run "cd backend_v2 && alembic current" --app raxx-api-prod 2>&1

For migrations marked -- POSTGRES-ONLY (PL/pgSQL blocks): these are safe to run on prod Postgres and were already validated in staging. If a migration fails on prod but passed on staging, treat as SEV-1 and escalate immediately before any further upgrade attempts.

See docs/ops/runbooks/migration-gate.md for known failure modes and rollback steps.

Expected: alembic current shows (head). Any pending migration on a deployed release is SEV-2.

Step 6 — Smoke-test gate against production

Run the combined smoke suite against the production endpoint to confirm the golden path is working end-to-end, not just that the process is alive.

TARGET_ENV=production bash scripts/ci/run_smoke.sh

If TARGET_ENV override is not yet supported, run the read-only backend smoke:

python3 scripts/ci/smoke_backend.py \
  --base-url https://raxx-api-prod-a60a19e5efbf.herokuapp.com

Expected: all checks pass. A failing smoke against production after a successful deploy is SEV-1 or SEV-2 depending on blast radius. Do not close the production deploy until smoke passes or a rollback is in progress.

Step 7 — Observability / Sentry check

Confirm no new error class spiked in the 15 minutes following the deploy.

In Sentry (projects: raxx-backend, raxx-frontend):

Filter: last 30 minutes.
Sort: by count (highest first).
Look for any issue with a first-seen timestamp matching the deploy window.
If a new issue class appears with count > 10 in the first 15 minutes post-deploy: treat as SEV-2, execute Step 8 rollback, then investigate.

Sentry issues UI:

https://sentry.io/organizations/moosequest/issues/?project=<id>&query=is:unresolved&sort=date

Check Heroku router error rates:

heroku logs --app raxx-api-prod --num=200 2>&1 \
  | grep -E " H[0-9]+| R[0-9]+" | head -20

Any H12 (request timeout) or H10 (app crashed) entries post-deploy require immediate triage. H12 bursts often indicate a migration that took locks or a cold-start memory pressure issue.

Expected: no new error class spike, no H10/H12 burst. A single isolated H10 immediately at boot (before the first successful health check) is acceptable; a recurring H10 is not.

Step 8 — Rollback pointer (record before closing)

Record the rollback target before declaring the deploy complete. Do this while context is fresh.

# Identify the release immediately before this deploy
heroku releases --app raxx-api-prod --num=5

Find the release that was live BEFORE the current deploy. Record:

Previous Heroku release number: vNNN (one row above the current release)
Previous git SHA: visible in heroku releases or in the release → main PR as the parent of the merge commit
Rollback command: bash heroku rollback vNNN --app raxx-api-prod

For Antlers Next (Cloudflare Pages), record the previous CF Pages deployment ID:

gh api \
  "repos/raxx-app/TradeMasterAPI/deployments?environment=production&per_page=5" \
  --jq '.[] | {id, sha: .sha, created_at}'

If rollback is triggered during this checklist (crash loop in Step 1 or Sentry spike in Step 7):

heroku rollback vNNN --app raxx-api-prod
# Re-run Step 1 to confirm the prior release is healthy

Full rollback procedure: docs/ops/runbooks/rollback.md.

Approval-notification wire-in

The GitHub Environment approval banner pauses the production deploy job for operator review (see docs/ops/runbooks/prod-deploy-approval.md). Include the following in the review comment before clicking "Approve and deploy":

Post-merge checklist: docs/ops/runbooks/post-merge-prod-state-checklist.md
Run all 8 steps after deploy completes before closing.

This surfaces the checklist at the moment of approval, not after the fact — the critical delivery requirement from issue #1217.

To embed the checklist link permanently in the notify-deploy-status action that deploy-heroku.yml uses (Phase 2 of #1217), add the following to the deploy-complete notification step body:

Post-merge checklist: https://github.com/raxx-app/TradeMasterAPI/blob/main/docs/ops/runbooks/post-merge-prod-state-checklist.md

File a type:reliability ticket against develop for that automation step if it has not yet shipped.

Emergency stop

If any checklist step reveals a production-breaking condition:

Execute the rollback immediately (Step 8 rollback command).
Freeze all further production deploys: bash heroku config:set DEPLOY_FREEZE=true --app raxx-console-prod >/dev/null 2>&1
File a SEV-1 incident under docs/incidents/YYYY-MM-DD-<slug>.md.
Notify the operator.

Full freeze / unfreeze procedure: docs/ops/runbooks/deploy-freeze.md.

Escalation

Escalate to operator when:

Smoke gate fails and rollback does not restore a green smoke result.
A migration fails on alembic upgrade head after succeeding on staging.
A required env var value is missing from vault and the correct value is unknown.
A Sentry spike is novel (first-ever error class, unknown root cause).
A rollback itself fails (releases are locked or the target release is unavailable).
Cost implications of a forward-fix exceed $50/mo or $500/year.

References

ADR: docs/architecture/adr/0115-develop-release-main-branching-model.md — branching model, release→main boundary, emergency hotfix path
Gatekeeper (develop→release): docs/ops/runbooks/gatekeeper-develop-to-release.md — the promotion step that precedes this checklist
Prod deploy approval: docs/ops/runbooks/prod-deploy-approval.md — GitHub Environment reviewer gate (wire-in target from #1202)
Rollback: docs/ops/runbooks/rollback.md
Migration gate: docs/ops/runbooks/migration-gate.md
Deploy freeze: docs/ops/runbooks/deploy-freeze.md
Required config vars: docs/ops/required-config-vars.yaml
CF Access token staleness: docs/ops/runbooks/session-bootstrap.md (see also #680)
Issue: https://github.com/raxx-app/TradeMasterAPI/issues/1217
Parent epic: https://github.com/raxx-app/TradeMasterAPI/issues/1208