Raxx · internal docs

internal · gated

SOP — Heroku Rack Apps Bootstrap (Eco Tier, Pre-Launch)

Owner: Operator (Kristerpher) + agent Last updated: 2026-05-31 UTC Refs: #1401 (Reasonator service scaffold), #1398 (Pro/Pro+ sentiment surface), docs/architecture/reasonator/design.md, docs/architecture/reasonator/adr/0054-rack-deployment-target.md


Terminology note

The service is documented internally as Reasonator (per project_codenames.md and docs/architecture/reasonator/design.md). The Heroku app name retains the original codename rack (raxx-rack-staging, raxx-rack-prod) per the operator decision 2026-05-31 — the existing parent card (#1401) and the ADR-0054 deployment target both reference rack-raxx / rack-raxx-staging (the proposed names at scaffold time), but the operator's app-naming convention per project_heroku_app_names.md is raxx-<surface>-<env>. This runbook uses raxx-rack-staging and raxx-rack-prod accordingly. The internal feature flag is FLAG_REASONATOR regardless of the Heroku-side naming.


When to run this

You need this runbook the first time the Reasonator service is deployed to Heroku — once for staging, once for prod. Both apps are created at Eco tier ($5/mo each) per the operator decision 2026-05-31 UTC and stay there until the first paid customer of the Reasonator-backed surface signs up; after that they upgrade to Standard-1X (or Standard-2X per ADR-0054's original spec — see "Upgrade trigger" below).

This runbook covers:


Pre-conditions


Step 1 — Create the apps

heroku apps:create raxx-rack-staging --region us --team mooseQuest >/dev/null 2>&1
heroku apps:create raxx-rack-prod    --region us --team mooseQuest >/dev/null 2>&1

If --team mooseQuest errors with "team not found," check the exact team slug with heroku teams first. The default region is us; do NOT use eu — the Reasonator scoring is for US-market news, latency to the US is what matters, and there's no GDPR data-residency obligation per the EU geo-block decision (project_eu_geoblock_decision.md).

Both heroku apps:create calls go to stdout-silenced because the create command echoes the app's HTTPS URL + git remote, both of which are also accessible via heroku apps:info afterward and don't need to be in shell history.


Step 2 — Buildpack + runtime

Reasonator is Python 3.11 per docs/architecture/reasonator/design.md §packaging and the runtime.txt checked into rack/. Heroku detects this from the Procfile and runtime.txt automatically, but explicitly set the buildpack to avoid surprises:

heroku buildpacks:set heroku/python --app raxx-rack-staging >/dev/null 2>&1
heroku buildpacks:set heroku/python --app raxx-rack-prod    >/dev/null 2>&1

The transformers + torch (CPU build) wheel install is the load-bearing one; verify requirements.txt pins torch==2.x.x+cpu and --extra-index-url https://download.pytorch.org/whl/cpu is set, otherwise Heroku pulls the GPU build and the slug exceeds the 500 MB limit.


Step 3 — Dyno tier + add-ons (Eco)

Per the operator decision 2026-05-31 UTC: stay at Eco until the first paid customer of the Reasonator-backed surface signs up. The Eco tier sleeps after 30 minutes of inactivity; the keep-alive cron from #1401 (*/10 * * * * ping on /v1/health) is what keeps the dyno warm during operator-testing.

# Eco dyno scale (one web dyno per app)
heroku ps:scale web=1 --app raxx-rack-staging >/dev/null 2>&1
heroku ps:scale web=1 --app raxx-rack-prod    >/dev/null 2>&1

# Eco dyno type
heroku ps:type eco --app raxx-rack-staging >/dev/null 2>&1
heroku ps:type eco --app raxx-rack-prod    >/dev/null 2>&1

Add-ons at Eco / Mini tier:

# Postgres Mini ($5/mo each) — Reasonator is stateless per design.md §state,
# but Mini Postgres is required if cards under #1401 add a job-queue table
# (e.g., batch_score_jobs). Provision at bootstrap to avoid a downtime swap later.
heroku addons:create heroku-postgresql:mini --app raxx-rack-staging >/dev/null 2>&1
heroku addons:create heroku-postgresql:mini --app raxx-rack-prod    >/dev/null 2>&1

# Redis Mini — provision ONLY if the implementation needs it for the
# batch queue. Reasonator's design.md §queue uses an in-memory queue at v1
# (acceptable per I-3 graceful-degradation). Hold Redis provision until
# the first card actually needs a persistent queue.
# heroku addons:create heroku-redis:mini --app raxx-rack-staging >/dev/null 2>&1
# heroku addons:create heroku-redis:mini --app raxx-rack-prod    >/dev/null 2>&1

Heroku Postgres Mini gotcha — pg:credentials:create, not CREATE ROLE WITH PASSWORD. Per feedback_heroku_pg_rds_password_gotcha.md, RDS-backed Heroku Postgres reserves password operations to rds_password members and the DATABASE_URL owner is NOT in that group. Any role-creation script must use heroku pg:credentials:create --name <name> --app raxx-rack-prod, not direct SQL.

Cost envelope (Eco):

Line item Monthly cost
Heroku Eco dyno × 2 (staging + prod) $10
Heroku Postgres Mini × 2 $10
Heroku Redis Mini × 2 (if needed) $6
Combined pre-launch (no Redis) ~$20/mo
Combined pre-launch (with Redis) ~$26/mo

This aligns with the operator's stated $25-30/mo envelope.


Step 4 — Vault bootstrap

Create the /rack/ folder first (per feedback_vault_folder_must_exist.md). Use environment-scoped writes — Infisical separates staging and prod:

curl -X POST "https://app.infisical.com/api/v1/folders" \
  -H "Authorization: Bearer $INFISICAL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"workspaceId":"<wsid>","environment":"prod","name":"rack","path":"/"}' \
  >/dev/null 2>&1

curl -X POST "https://app.infisical.com/api/v1/folders" \
  -H "Authorization: Bearer $INFISICAL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"workspaceId":"<wsid>","environment":"staging","name":"rack","path":"/"}' \
  >/dev/null 2>&1

Then write the secrets per #1401:

Secret name Path Value
RACK_SERVICE_TOKEN /rack/ (env: prod) 32-byte random base64url; mint with openssl rand -base64 32 \| tr -d '=' \| tr '/+' '_-'
RACK_SERVICE_TOKEN /rack/ (env: staging) Separate value from prod; never reuse
SENTRY_DSN /rack/ (env: prod) From Sentry project Reasonator → Settings → Client Keys
SENTRY_DSN /rack/ (env: staging) Separate DSN per env; do NOT cross-mix
FINBERT_MODEL_SHA /rack/ (env: prod) TODO — per #1401 OQ-2, awaiting operator confirmation. Use unknown placeholder until resolved

Never paste any value into a chat / PR / commit. Per feedback_no_inline_secrets_in_repo.md.


Step 5 — Heroku config

The Reasonator app reads everything from vault at boot via the agent's CF-Access-headered REST calls (per feedback_secrets_in_vault_sop.md). Only three Heroku config keys are needed: the Infisical bootstrap credentials, the env label, and the feature flag.

heroku config:set \
  INFISICAL_PROJECT_ID=<from-vault> \
  INFISICAL_CLIENT_ID=<from-vault> \
  INFISICAL_CLIENT_SECRET=<from-vault> \
  CF_ACCESS_CLIENT_ID=<from-vault> \
  CF_ACCESS_CLIENT_SECRET=<from-vault> \
  ENV=production \
  FLAG_REASONATOR=0 \
  --app raxx-rack-prod >/dev/null 2>&1

heroku config:set \
  INFISICAL_PROJECT_ID=<from-vault-staging> \
  INFISICAL_CLIENT_ID=<from-vault-staging> \
  INFISICAL_CLIENT_SECRET=<from-vault-staging> \
  CF_ACCESS_CLIENT_ID=<from-vault-staging> \
  CF_ACCESS_CLIENT_SECRET=<from-vault-staging> \
  ENV=staging \
  FLAG_REASONATOR=0 \
  --app raxx-rack-staging >/dev/null 2>&1

FLAG_REASONATOR=0 means the customer-facing surfaces gated by Reasonator (Pro retrospective panel, Pro+ real-time chip — #1398) stay off until the operator flips it. Per feedback_new_flag_needs_b1_migration_same_pr.md, the PR that lands this flag must also include the console_flag_promotions migration; this runbook only documents the Heroku-side value.

Every heroku config:set is silenced per feedback_heroku_config_set_echoes_secrets.md — even bootstrap IDs that aren't secret-grade are cleaner kept out of shell history.

To verify (length-check, never value-dump):

heroku config:get FLAG_REASONATOR --app raxx-rack-prod
# Expected: 0

heroku config:get RACK_SERVICE_TOKEN --app raxx-rack-prod | wc -c
# Expected: 44 (32 bytes base64 = 44 chars + newline; subtract 1 to confirm)

Step 6 — Domain attach (Heroku + Cloudflare)

Attach the custom domain to each app, then update Cloudflare DNS to point at the Heroku DNS target.

heroku domains:add rack-staging.raxx.app --app raxx-rack-staging
heroku domains:add rack.raxx.app --app raxx-rack-prod

Capture the DNS target each command emits (looks like xxxxxx.herokudns.com). NOTE: these commands DO need to emit their output — the DNS target is needed for Cloudflare. They are not silenced.

Cloudflare DNS update (separate operator step — UI navigation or wrangler / cf-api):

  1. Cloudflare Dashboard → DNS → Records for raxx.app
  2. Add CNAME rack<prod-heroku-dns-target>.herokudns.com, proxied (orange cloud)
  3. Add CNAME rack-staging<staging-heroku-dns-target>.herokudns.com, proxied (orange cloud)
  4. Confirm SSL on both records (Cloudflare Universal SSL covers this by default)
  5. If Cloudflare Access is enabled on *.raxx.app (per feedback_cf_access_does_not_bypass_bot_fight_mode.md), pair the Access policy with the WAF skip rule keyed on CF-Access-Client-Id for the Raptor → Reasonator call path

Wait for DNS propagation (1-5 minutes for orange-cloud records). Verify:

curl -I https://rack.raxx.app/v1/health
# Expected: HTTP/2 503 with {"status":"warming_up","model_loaded":false} body
# (FinBERT is ~400 MB; first boot takes ~30s; 503 is correct until model load completes)

Step 7 — Deploy workflow + keep-alive

Per #1401, two GitHub Actions workflows ship in the same scaffold PR:

  1. .github/workflows/deploy-rack.yml — Heroku deploy following ADR-0053
  2. .github/workflows/rack-keep-alive.yml*/10 * * * * cron pinging /v1/health on staging and prod

The deploy workflow needs the Heroku API key in the repo secrets:

The keep-alive workflow is critical at Eco tier — without it, the dyno sleeps after 30 minutes and the first request after sleep takes ~30s for FinBERT to reload. With the 10-minute cron, the dyno stays warm during US-market hours.

Pre-launch digest framing per feedback_pre_launch_digest_notifications.md: keep-alive failures + deploy failures route to the daily digest, not individual Slack pings. Only health-endpoint 5xx persisting >3 successive cron windows (30 minutes) escalates to per-event alerting.


Step 8 — Initial smoke

After Step 6 propagates:

# Health endpoint - should return 503 warming_up immediately,
# then 200 ok within ~60s after the model loads
curl -sf https://rack-staging.raxx.app/v1/health
sleep 60
curl -sf https://rack-staging.raxx.app/v1/health | jq .
# Expected: {"status":"ok","model_loaded":true,"model_sha":"...","queue_depth":{...},"uptime_seconds":N,"version":"1.0.0"}

# Same for prod
curl -sf https://rack.raxx.app/v1/health
sleep 60
curl -sf https://rack.raxx.app/v1/health | jq .

If the prod app's health endpoint returns 200 with model_loaded: true, Step 8 is complete and the bootstrap is done.

If health endpoint returns persistent 503 after >2 minutes, check:

  1. Slug size: heroku slugs --app raxx-rack-prod. If >500 MB, the GPU torch wheel slipped in — revert requirements.txt to pin the CPU build.
  2. Boot log: heroku logs --tail --app raxx-rack-prod | grep -E "(error|Traceback|finbert)". The most-common boot failure is the FinBERT model fetch from HuggingFace timing out — increase WEB_CONCURRENCY=1 and the gunicorn --timeout 180 if seen.
  3. Vault reachability: heroku run --app raxx-rack-prod python -c "from rack.config import vault_health; print(vault_health())". If this fails, the CF Access headers are stale per project_session_env_staleness.md — re-rotate.

Upgrade trigger (Eco → Standard-1X)

When to upgrade:

Upgrade command (no downtime, dyno restarts in place):

heroku ps:type standard-1x --app raxx-rack-prod >/dev/null 2>&1

ADR-0054 originally specced Standard-2X. Re-evaluate the 1X vs 2X choice based on actual queue depth and p99 latency in the first 30 days of paid traffic — the design's "min 1 dyno always running" requirement is satisfied at 1X, the 2X step is a memory-headroom margin that may or may not be needed.

When Standard-1X is in place, also upgrade Postgres from Mini ($5) to Basic ($9) if the row count on the audit / score-event tables exceeds 10K (Mini's row limit).

Per project_oncall_severity_routing.md, the upgrade is a SEV3 routine ops change — agent-autonomous; no PagerDuty needed.


Rollback / teardown (rare)

If the operator decides to retire Reasonator pre-launch (e.g., Alpaca display-rights review per docs/blr/2026-05-31-alpaca-display-rights-memo.md returns "no display permitted" and the surface is dropped):

heroku config:set FLAG_REASONATOR=0 --app raxx-rack-prod >/dev/null 2>&1
heroku ps:scale web=0 --app raxx-rack-prod >/dev/null 2>&1
# Do NOT destroy the app — keeping it at scale=0 retains the config + add-ons
# for forensic and audit-log purposes. Destroy only after a 90-day window
# with operator written approval.

Same pattern for staging if needed.


Common pitfalls

  1. Eco dyno sleeps without keep-alive. The keep-alive cron is the load-bearing piece; without it operator-testing latency is awful. Verify the cron is scheduled before Step 8.

  2. Slug size blows past 500 MB. PyTorch GPU build is the usual culprit. Pin CPU build with --extra-index-url https://download.pytorch.org/whl/cpu in requirements.txt.

  3. FLAG_REASONATOR=0 does not gate Reasonator's own deployment. The flag gates the customer-facing sentiment surfaces in Antlers / Raptor (#1398). The Reasonator service itself runs regardless; flipping the flag to 0 does not stop the dyno, only the consumer.

  4. Heroku Postgres Mini row limit (10K) is small. Once the audit table exceeds 10K rows, writes start failing silently for INSERTs. Monitor via heroku pg:info --app raxx-rack-prod. Plan the upgrade to Basic before this hits in practice.

  5. CF DNS update is a separate operator step. Step 6 has both Heroku-side domains:add and Cloudflare-side DNS. The Heroku side is in this runbook; the Cloudflare side is in the operator's manual flow. Do not assume Cloudflare auto-updates from heroku domains:add.

  6. raxx-rack-* vs rack-raxx-* naming inconsistency. Per project_heroku_app_names.md the operator's convention is raxx-<surface>-<env>. #1401 was written before this convention was locked and uses rack-raxx-*. This runbook uses the convention-correct names. If the ops-deploy workflow in .github/workflows/deploy-rack.yml still uses the older names, update it in the same PR that lands the bootstrap.

  7. Per feedback_pre_launch_digest_notifications.md, Reasonator health-check failures route to the daily digest pre-launch, not per-event Slack pings. Verify the alert routing before Step 8 to avoid noise.


Refs